Protein functional annotation with embeddings and computational approach

Vazzana, Gabriele (2026) Protein functional annotation with embeddings and computational approach, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Archeologia, 38 Ciclo. DOI 10.48676/unibo/amsdottorato/12612.
Documenti full-text disponibili:
[thumbnail of vazzana_gabriele_tesi.pdf] Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato.
Download (24MB)

Abstract

The emergence of Next Generation Sequencing technologies has created an urgent need for reliable and scalable methods of protein functional annotation. Moreover, recent advancements in machine and deep learning, including protein language models (PLMs), offer unprecedented opportunities to develop methods able to generalize information collected in databases for the annotation of uncharacterized proteins. My PhD research focused on leveraging deep learning to enhance annotation strategies, with a focus on the Glutathione S-transferase (GST) superfamily, a multifunctional and hard-to-annotate enzyme group. First, I tested the potential of protein sequence encodings based on PLMs (embeddings) in the process of functional annotation. Using an alignment algorithm designed for embeddings, I demonstrated that they capture structural information and enable accurate classification of GSTs. Furthermore, I characterised new multifunctional traits of GSTs, providing computational evidence that canonical GSTs can bind RNA, as suggested by recent large-scale studies. By applying deep learning methods and molecular docking validation, I showed that GST–RNA interactions are theoretically possible, and proposed that this interaction occurs at the glutathione binding site. As deep learning procedures drive modern protein structure modeling, even in low-homology scenarios, their comparison is an active research field. I analyzed protein structure models from the Alpha&ESMhFold database, which collects AlphaFold2 and ESMFold predicted models of human proteins. By mapping Pfam domains, I found that functionally relevant regions are consistently well-predicted by both methods, even when the global structures diverge. During my period abroad in Barcelona, I analyzed evolutionary information captured by embeddings and found that, given a dataset of remote homologs, the embedding vectors representing residues aligned in a multiple structural alignment cluster together. Overall, these studies show that embeddings and structural predictions can enhance the annotation of challenging protein families, reveal novel functional roles, and facilitate the integration of large-scale data into annotation pipelines.

Abstract
Tipologia del documento
Tesi di dottorato
Autore
Vazzana, Gabriele
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
38
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
Protein Functional Annotation, Machine Learning, Protein Language Models, Glutathione S-transferases, Embedding-based Alignment, RNA-binding proteins, Multifunctional proteins, Molecular Docking, Pfam domains, Enzyme Active site, AlphaFold2, ESMFold
DOI
10.48676/unibo/amsdottorato/12612
Data di discussione
18 Marzo 2026
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza la tesi

^