Machine learning for software engineering

Balla, Stefano (2025) Machine learning for software engineering, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Computer science and engineering, 37 Ciclo.
Documenti full-text disponibili:
[thumbnail of Balla_Stefano_thesis.pdf] Documento PDF (English) - Accesso riservato fino a 1 Gennaio 2026 - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Creative Commons: Attribuzione 4.0 (CC BY 4.0) .
Download (1MB) | Contatta l'autore

Abstract

The explosive growth of open-source repositories creates opportunities and challenges for Machine Learning for Software Engineering (ML4SE). Current methods struggle with: (i) code frequently reformatted or minified, obscuring stylistic signals; (ii) the lack of standardised benchmarks for repository recommendation; and (iii) the need to scale to billions of files in archives such as Software Heritage. Objectives. This thesis aims to (1) develop an authorship-attribution technique resilient to common code transformations, (2) conduct the first large-scale systematic mapping of repository-recommendation research, and (3) design a multi-label classifier that operates at archive scale. Methods. A language-agnostic stylometric representation based on Concrete Syntax Tree (CST) path-contexts is introduced, contrasted with Abstract Syntax Trees (ASTs). A systematic mapping study screens over 1 700 papers and distils 43 primary studies, revealing gaps in benchmark standardisation and scalability. To address these, DRAGON is proposed, a sentence-pair BERT model with focal loss and adaptive thresholding, trained on 825 k repositories and 239 GitRanking topics. Results. On untransformed code, CST-based stylometry lifts top-1 author-recognition accuracy from 51 % to 68 %, a 17 % absolute gain over AST baselines. After formatting or minification, recognition falls for both, yet CST still leads, showing the limited privacy such transformations afford. DRAGON raises F1@5 by 11 % over prior work and maintains this performance even when 34 % of projects lack a README. All datasets, model checkpoints, and evaluation scripts are released under permissive licences. Contributions. (i) A transformation-resilient stylometry pipeline; (ii) the largest systematic map of repository-recommendation research; (iii) the first repository classifier evaluated on Software-Heritage-scale data; and (iv) practical guidelines for representation, scale-aware engineering, and responsible deployment. Impact. The findings enable accurate topic tagging, strengthen forensic analysis, and guide ML4SE systems that keep pace with open-source ecosystem growth.

Abstract
Tipologia del documento
Tesi di dottorato
Autore
Balla, Stefano
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
37
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Machine Learning, Artificial Intelligence, Software Engineering, Software Heritage, authorship attribution, topic tagging, repository classification, software archives, privacy
Data di discussione
23 Ottobre 2025
URI

Altri metadati

Gestione del documento: Visualizza la tesi

^