Deep learning and embeddings for problems of computational biology

Manfredi, Matteo (2023) Deep learning and embeddings for problems of computational biology, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Scienze biotecnologiche, biocomputazionali, farmaceutiche e farmacologiche, 35 Ciclo. DOI 10.48676/unibo/amsdottorato/10884.
Documenti full-text disponibili:
[img] Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Creative Commons Attribution Non-commercial No Derivatives 4.0 (CC BY-NC-ND 4.0) .
Download (5MB)


The development of Next Generation Sequencing promotes Biology in the Big Data era. The ever-increasing gap between proteins with known sequences and those with a complete functional annotation requires computational methods for automatic structure and functional annotation. My research has been focusing on proteins and led so far to the development of three novel tools, DeepREx, E-SNPs&GO and ISPRED-SEQ, based on Machine and Deep Learning approaches. DeepREx computes the solvent exposure of residues in a protein chain. This problem is relevant for the definition of structural constraints regarding the possible folding of the protein. DeepREx exploits Long Short-Term Memory layers to capture residue-level interactions between positions distant in the sequence, achieving state-of-the-art performances. With DeepRex, I conducted a large-scale analysis investigating the relationship between solvent exposure of a residue and its probability to be pathogenic upon mutation. E-SNPs&GO predicts the pathogenicity of a Single Residue Variation. Variations occurring on a protein sequence can have different effects, possibly leading to the onset of diseases. E-SNPs&GO exploits protein embeddings generated by two novel Protein Language Models (PLMs), as well as a new way of representing functional information coming from the Gene Ontology. The method achieves state-of-the-art performances and is extremely time-efficient when compared to traditional approaches. ISPRED-SEQ predicts the presence of Protein-Protein Interaction sites in a protein sequence. Knowing how a protein interacts with other molecules is crucial for accurate functional characterization. ISPRED-SEQ exploits a convolutional layer to parse local context after embedding the protein sequence with two novel PLMs, greatly surpassing the current state-of-the-art. All methods are published in international journals and are available as user-friendly web servers. They have been developed keeping in mind standard guidelines for FAIRness (FAIR: Findable, Accessible, Interoperable, Reusable) and are integrated into the public collection of tools provided by ELIXIR, the European infrastructure for Bioinformatics.

Tipologia del documento
Tesi di dottorato
Manfredi, Matteo
Dottorato di ricerca
Settore disciplinare
Settore concorsuale
Parole chiave
Deep Learning, Machine Learning, Protein Embeddings, Protein Language Models, Accessible Surface Area, Single Residue Variations, Pathogenicity, Protein-Protein Interaction
Data di discussione
23 Giugno 2023

Altri metadati

Statistica sui download

Gestione del documento: Visualizza la tesi