Machine learning methodologies for supporting HPC systems operations

Molan, Martin (2025) Machine learning methodologies for supporting HPC systems operations, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Data science and computation, 36 Ciclo. DOI 10.48676/unibo/amsdottorato/11873.

Salva citazione

Citato da

Documenti full-text disponibili:

Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato.
Download (14MB)

Abstract

The growing size and complexity of modern high-performance computing systems demand advanced data collection, monitoring, and machine learning methodologies for effective management and operations, collectively referred to in the literature as operational data analytics (ODA). The thesis introduces a comprehensive ODA framework addressing key challenges: open-ended data exploration, unsupervised anomaly detection, and long-term anomaly prediction. The first part of the comprehensive ODA framework is the methodology used to perform open-ended data exploration and analysis, called the DEM (data exploration model). DEM forms the foundation of the ODA framework, requiring no structured or labeled data, making it ideal as the first machine-learning model for HPC systems. It provides operational insights, helping administrators and stakeholders identify metrics for further analysis with specialized machine-learning models. The second component of the ODA framework is RUAD (Recurrent Unsupervised Anomaly Detection), a novel model that addresses the limitations of current state-of-the-art anomaly detection methods. Unlike traditional approaches that require labeled data or exhibit poor performance in unsupervised settings, RUAD outperformed all previous state-of-the-art semi-supervised and unsupervised techniques. RUAD achieves an AUC of 0.763 for semi-supervised and 0.767 for unsupervised training, surpassing the state-of-the-art method (AUC 0.747 semi-supervised, 0.734 unsupervised). It also significantly outperforms clustering-based unsupervised anomaly detection (AUC 0.548). The third component of the ODA framework, GRAAFE GRaph anomaly anticipation framework) extends anomaly detection to anomaly prediction using graph neural networks (GNNs). The physical layout of compute nodes in a compute room is modeled as a graph, with nodes as vertices and edges representing the physical distances between them. By leveraging spatial information ignored by per-node models, GRAAFE's GNN surpasses state-of-the-art anomaly prediction methods, achieving AUCs ranging from 0.91 to 0.78, compared to 0.64 to 0.5 from existing approaches. GRAAFE also pioneers long-term (over eight hours ahead) node failure predictions for high-performance computing systems.

Abstract

Tipologia del documento

Tesi di dottorato

Autore

Molan, Martin

Supervisore

Bartolini, Andrea

Co-supervisore

Benini, luca

Dottorato di ricerca

Data science and computation

Ciclo

Coordinatore

Bonacorsi, Daniele

Settore disciplinare

Area 09 - Ingegneria industriale e dell'informazione > ING-INF/05 Sistemi di elaborazione delle informazioni

Settore concorsuale

Area 09 - Ingegneria industriale e dell'informazione > 09/H - Ingegneria informatica > 09/H1 Sistemi di elaborazione delle informazioni

Parole chiave

Machine Learning, Operational Data Analytics (ODA), High-Performance Computing Systems (HPC), Anomaly Detection, Anomaly Prediction, Data Exploration, Unsupervised Learning, Predictive Models, Graph Methodologies, Graph Neural Networks, Self-Supervised Learning

DOI

10.48676/unibo/amsdottorato/11873

Data di discussione

26 Marzo 2025

URI

https://amsdottorato.unibo.it/id/eprint/11873

Altri metadati

Statistica sui download

Vedi altre statistiche

Gestione del documento:

Strumenti di navigazione

Collezioni AlmaDL

Machine learning methodologies for supporting HPC systems operations

Abstract

Altri metadati

Statistica sui download