Machine learning methodologies for supporting HPC systems operations

Molan, Martin (2025) Machine learning methodologies for supporting HPC systems operations, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Data science and computation, 36 Ciclo. DOI 10.48676/unibo/amsdottorato/11873.
Documenti full-text disponibili:
[thumbnail of PhD_Thesis_Submitted.pdf] Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato.
Download (14MB)

Abstract

The growing size and complexity of modern high-performance computing systems demand advanced data collection, monitoring, and machine learning methodologies for effective management and operations, collectively referred to in the literature as operational data analytics (ODA). The thesis introduces a comprehensive ODA framework addressing key challenges: open-ended data exploration, unsupervised anomaly detection, and long-term anomaly prediction. The first part of the comprehensive ODA framework is the methodology used to perform open-ended data exploration and analysis, called the DEM (data exploration model). DEM forms the foundation of the ODA framework, requiring no structured or labeled data, making it ideal as the first machine-learning model for HPC systems. It provides operational insights, helping administrators and stakeholders identify metrics for further analysis with specialized machine-learning models. The second component of the ODA framework is RUAD (Recurrent Unsupervised Anomaly Detection), a novel model that addresses the limitations of current state-of-the-art anomaly detection methods. Unlike traditional approaches that require labeled data or exhibit poor performance in unsupervised settings, RUAD outperformed all previous state-of-the-art semi-supervised and unsupervised techniques. RUAD achieves an AUC of 0.763 for semi-supervised and 0.767 for unsupervised training, surpassing the state-of-the-art method (AUC 0.747 semi-supervised, 0.734 unsupervised). It also significantly outperforms clustering-based unsupervised anomaly detection (AUC 0.548). The third component of the ODA framework, GRAAFE GRaph anomaly anticipation framework) extends anomaly detection to anomaly prediction using graph neural networks (GNNs). The physical layout of compute nodes in a compute room is modeled as a graph, with nodes as vertices and edges representing the physical distances between them. By leveraging spatial information ignored by per-node models, GRAAFE's GNN surpasses state-of-the-art anomaly prediction methods, achieving AUCs ranging from 0.91 to 0.78, compared to 0.64 to 0.5 from existing approaches. GRAAFE also pioneers long-term (over eight hours ahead) node failure predictions for high-performance computing systems.

Abstract
Tipologia del documento
Tesi di dottorato
Autore
Molan, Martin
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
36
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Machine Learning, Operational Data Analytics (ODA), High-Performance Computing Systems (HPC), Anomaly Detection, Anomaly Prediction, Data Exploration, Unsupervised Learning, Predictive Models, Graph Methodologies, Graph Neural Networks, Self-Supervised Learning
DOI
10.48676/unibo/amsdottorato/11873
Data di discussione
26 Marzo 2025
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza la tesi

^