Monitoring and Prediction of Thermal Emergencies in High Performance Computing Systems

Seyedkazemi Ardebili, Mohsen (2022) Monitoring and Prediction of Thermal Emergencies in High Performance Computing Systems, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Ingegneria elettronica, telecomunicazioni e tecnologie dell'informazione, 34 Ciclo. DOI 10.48676/unibo/amsdottorato/9891.
Documenti full-text disponibili:
[img] Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato.
Download (12MB)

Abstract

Modern scientific discoveries are driven by an unsatisfiable demand for computational resources. High-Performance Computing (HPC) systems are an aggregation of computing power to deliver considerably higher performance than one typical desktop computer can provide, to solve large problems in science, engineering, or business. An HPC room in the datacenter is a complex controlled environment that hosts thousands of computing nodes that consume electrical power in the range of megawatts, which gets completely transformed into heat. Although a datacenter contains sophisticated cooling systems, our studies indicate quantitative evidence of thermal bottlenecks in real-life production workload, showing the presence of significant spatial and temporal thermal and power heterogeneity. Therefore minor thermal issues/anomalies can potentially start a chain of events that leads to an unbalance between the amount of heat generated by the computing nodes and the heat removed by the cooling system originating thermal hazards. Although thermal anomalies are rare events, anomaly detection/prediction in time is vital to avoid IT and facility equipment damage and outage of the datacenter, with severe societal and business losses. For this reason, automated approaches to detect thermal anomalies in datacenters have considerable potential. This thesis analyzed and characterized the power and thermal characteristics of a Tier0 datacenter (CINECA) during production and under abnormal thermal conditions. Then, a Deep Learning (DL)-powered thermal hazard prediction framework is proposed. The proposed models are validated against real thermal hazard events reported for the studied HPC cluster while in production. This thesis is the first empirical study of thermal anomaly detection and prediction techniques of a real large-scale HPC system to the best of my knowledge. For this thesis, I used a large-scale dataset, monitoring data of tens of thousands of sensors for around 24 months with a data collection rate of around 20 seconds.

Abstract
Tipologia del documento
Tesi di dottorato
Autore
Seyedkazemi Ardebili, Mohsen
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
34
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Datacenter, High Performance Computing Systems, HPC, Anomaly Detection, Anomaly Prediction, Thermal Hazard, Deep Learning, TCN, LSTM
URN:NBN
DOI
10.48676/unibo/amsdottorato/9891
Data di discussione
12 Luglio 2022
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza la tesi

^