Machine learning for software engineering

Balla, Stefano (2025) Machine learning for software engineering, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Computer science and engineering, 37 Ciclo. DOI 10.48676/unibo/amsdottorato/12493.

Salva citazione

Citato da

Documenti full-text disponibili:

Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Creative Commons: Attribuzione 4.0 (CC BY 4.0) .
Download (1MB)

Abstract

The explosive growth of open-source repositories creates opportunities and challenges for Machine Learning for Software Engineering (ML4SE). Current methods struggle with: (i) code frequently reformatted or minified, obscuring stylistic signals; (ii) the lack of standardised benchmarks for repository recommendation; and (iii) the need to scale to billions of files in archives such as Software Heritage. Objectives. This thesis aims to (1) develop an authorship-attribution technique resilient to common code transformations, (2) conduct the first large-scale systematic mapping of repository-recommendation research, and (3) design a multi-label classifier that operates at archive scale. Methods. A language-agnostic stylometric representation based on Concrete Syntax Tree (CST) path-contexts is introduced, contrasted with Abstract Syntax Trees (ASTs). A systematic mapping study screens over 1 700 papers and distils 43 primary studies, revealing gaps in benchmark standardisation and scalability. To address these, DRAGON is proposed, a sentence-pair BERT model with focal loss and adaptive thresholding, trained on 825 k repositories and 239 GitRanking topics. Results. On untransformed code, CST-based stylometry lifts top-1 author-recognition accuracy from 51 % to 68 %, a 17 % absolute gain over AST baselines. After formatting or minification, recognition falls for both, yet CST still leads, showing the limited privacy such transformations afford. DRAGON raises F1@5 by 11 % over prior work and maintains this performance even when 34 % of projects lack a README. All datasets, model checkpoints, and evaluation scripts are released under permissive licences. Contributions. (i) A transformation-resilient stylometry pipeline; (ii) the largest systematic map of repository-recommendation research; (iii) the first repository classifier evaluated on Software-Heritage-scale data; and (iv) practical guidelines for representation, scale-aware engineering, and responsible deployment. Impact. The findings enable accurate topic tagging, strengthen forensic analysis, and guide ML4SE systems that keep pace with open-source ecosystem growth.

Abstract

Tipologia del documento

Tesi di dottorato

Autore

Balla, Stefano

Supervisore

Gabbrielli, Maurizio

Co-supervisore

Zacchiroli, Stefano

Dottorato di ricerca

Computer science and engineering

Ciclo

Coordinatore

Bartolini, Ilaria

Settore disciplinare

Area 01 - Scienze matematiche e informatiche > INF/01 Informatica

Settore concorsuale

Area 01 - Scienze matematiche e informatiche > 01/B - Informatica > 01/B1 Informatica

Parole chiave

Machine Learning, Artificial Intelligence, Software Engineering, Software Heritage, authorship attribution, topic tagging, repository classification, software archives, privacy

DOI

10.48676/unibo/amsdottorato/12493

Data di discussione

23 Ottobre 2025

URI

https://amsdottorato.unibo.it/id/eprint/12493