Deep learning and embeddings for problems of computational biology

Manfredi, Matteo (2023) Deep learning and embeddings for problems of computational biology, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Scienze biotecnologiche, biocomputazionali, farmaceutiche e farmacologiche, 35 Ciclo. DOI 10.48676/unibo/amsdottorato/10884.

Salva citazione

Citato da

Documenti full-text disponibili:

[thumbnail of manfredi_matteo_final_thesis.pdf]

Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Disponibile con Licenza: Creative Commons: Attribuzione - Non Commerciale - Non Opere Derivate 4.0 (CC BY-NC-ND 4.0) .
Download (5MB)

Abstract

The development of Next Generation Sequencing promotes Biology in the Big Data era. The ever-increasing gap between proteins with known sequences and those with a complete functional annotation requires computational methods for automatic structure and functional annotation. My research has been focusing on proteins and led so far to the development of three novel tools, DeepREx, E-SNPs&GO and ISPRED-SEQ, based on Machine and Deep Learning approaches. DeepREx computes the solvent exposure of residues in a protein chain. This problem is relevant for the definition of structural constraints regarding the possible folding of the protein. DeepREx exploits Long Short-Term Memory layers to capture residue-level interactions between positions distant in the sequence, achieving state-of-the-art performances. With DeepRex, I conducted a large-scale analysis investigating the relationship between solvent exposure of a residue and its probability to be pathogenic upon mutation. E-SNPs&GO predicts the pathogenicity of a Single Residue Variation. Variations occurring on a protein sequence can have different effects, possibly leading to the onset of diseases. E-SNPs&GO exploits protein embeddings generated by two novel Protein Language Models (PLMs), as well as a new way of representing functional information coming from the Gene Ontology. The method achieves state-of-the-art performances and is extremely time-efficient when compared to traditional approaches. ISPRED-SEQ predicts the presence of Protein-Protein Interaction sites in a protein sequence. Knowing how a protein interacts with other molecules is crucial for accurate functional characterization. ISPRED-SEQ exploits a convolutional layer to parse local context after embedding the protein sequence with two novel PLMs, greatly surpassing the current state-of-the-art. All methods are published in international journals and are available as user-friendly web servers. They have been developed keeping in mind standard guidelines for FAIRness (FAIR: Findable, Accessible, Interoperable, Reusable) and are integrated into the public collection of tools provided by ELIXIR, the European infrastructure for Bioinformatics.

Abstract

Tipologia del documento

Tesi di dottorato

Autore

Manfredi, Matteo

Supervisore

Martelli, Pier Luigi

Dottorato di ricerca

Scienze biotecnologiche, biocomputazionali, farmaceutiche e farmacologiche

Ciclo

Coordinatore

Bolognesi, Maria Laura

Settore disciplinare

Area 05 - Scienze biologiche > BIO/10 Biochimica

Settore concorsuale

Area 05 - Scienze biologiche > 05/E - Biochimica e biologia molecolare sperimentali e cliniche > 05/E1 Biochimica generale e biochimica clinica

Parole chiave

Deep Learning, Machine Learning, Protein Embeddings, Protein Language Models, Accessible Surface Area, Single Residue Variations, Pathogenicity, Protein-Protein Interaction

URN:NBN

urn:nbn:it:unibo-29341

DOI

10.48676/unibo/amsdottorato/10884

Data di discussione

23 Giugno 2023

URI

https://amsdottorato.unibo.it/id/eprint/10884