Protein functional annotation with embeddings and computational approach

Vazzana, Gabriele (2026) Protein functional annotation with embeddings and computational approach, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Archeologia, 38 Ciclo. DOI 10.48676/unibo/amsdottorato/12612.

Salva citazione

Citato da

Documenti full-text disponibili:

[thumbnail of vazzana_gabriele_tesi.pdf]

Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato.
Download (24MB)

Abstract

The emergence of Next Generation Sequencing technologies has created an urgent need for reliable and scalable methods of protein functional annotation. Moreover, recent advancements in machine and deep learning, including protein language models (PLMs), offer unprecedented opportunities to develop methods able to generalize information collected in databases for the annotation of uncharacterized proteins. My PhD research focused on leveraging deep learning to enhance annotation strategies, with a focus on the Glutathione S-transferase (GST) superfamily, a multifunctional and hard-to-annotate enzyme group. First, I tested the potential of protein sequence encodings based on PLMs (embeddings) in the process of functional annotation. Using an alignment algorithm designed for embeddings, I demonstrated that they capture structural information and enable accurate classification of GSTs. Furthermore, I characterised new multifunctional traits of GSTs, providing computational evidence that canonical GSTs can bind RNA, as suggested by recent large-scale studies. By applying deep learning methods and molecular docking validation, I showed that GST–RNA interactions are theoretically possible, and proposed that this interaction occurs at the glutathione binding site. As deep learning procedures drive modern protein structure modeling, even in low-homology scenarios, their comparison is an active research field. I analyzed protein structure models from the Alpha&ESMhFold database, which collects AlphaFold2 and ESMFold predicted models of human proteins. By mapping Pfam domains, I found that functionally relevant regions are consistently well-predicted by both methods, even when the global structures diverge. During my period abroad in Barcelona, I analyzed evolutionary information captured by embeddings and found that, given a dataset of remote homologs, the embedding vectors representing residues aligned in a multiple structural alignment cluster together. Overall, these studies show that embeddings and structural predictions can enhance the annotation of challenging protein families, reveal novel functional roles, and facilitate the integration of large-scale data into annotation pipelines.

Abstract

Tipologia del documento

Tesi di dottorato

Autore

Vazzana, Gabriele

Supervisore

Martelli, Pier Luigi

Co-supervisore

Casadio, Rita ; Savojardo, Castrense

Dottorato di ricerca

Archeologia

Ciclo

Coordinatore

Bolognesi, Maria Laura

Settore disciplinare

Area 05 - Scienze biologiche > BIO/10 Biochimica

Settore concorsuale

Area 05 - Scienze biologiche > 05/E - Biochimica e biologia molecolare sperimentali e cliniche > 05/E1 Biochimica generale e biochimica clinica

Parole chiave

Protein Functional Annotation, Machine Learning, Protein Language Models, Glutathione S-transferases, Embedding-based Alignment, RNA-binding proteins, Multifunctional proteins, Molecular Docking, Pfam domains, Enzyme Active site, AlphaFold2, ESMFold

DOI

10.48676/unibo/amsdottorato/12612

Data di discussione

18 Marzo 2026

URI

https://amsdottorato.unibo.it/id/eprint/12612

Altri metadati

Statistica sui download

Vedi altre statistiche

Gestione del documento:

Strumenti di navigazione

Collezioni AlmaDL

Protein functional annotation with embeddings and computational approach

Abstract

Altri metadati

Statistica sui download