A clustering method for robust and reliable large scale functional and structural protein sequence annotation

Piovesan, Damiano (2013) A clustering method for robust and reliable large scale functional and structural protein sequence annotation, [Dissertation thesis], Alma Mater Studiorum Università di Bologna. Dottorato di ricerca in Biotecnologie, farmacologia e tossicologia: progetto n. 1 "Biotecnologie cellulari e molecolari", 25 Ciclo. DOI 10.6092/unibo/amsdottorato/5627.
Documenti full-text disponibili:
Documento PDF (English) - Richiede un lettore di PDF come Xpdf o Adobe Acrobat Reader
Download (9MB) | Anteprima


Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accurate methods are those based on the transfer-by-homology approach and the most incisive contribution is given by cross-genome comparisons. In the present thesis it is described a non-hierarchical sequence clustering method for protein automatic large-scale annotation, called “The Bologna Annotation Resource Plus” (BAR+). The method is based on an all-against-all alignment of more than 13 millions protein sequences characterized by a very stringent metric. BAR+ can safely transfer functional features (Gene Ontology and Pfam terms) inside clusters by means of a statistical validation, even in the case of multi-domain proteins. Within BAR+ clusters it is also possible to transfer the three dimensional structure (when a template is available). This is possible by the way of cluster-specific HMM profiles that can be used to calculate reliable template-to-target alignments even in the case of distantly related proteins (sequence identity < 30%). Other BAR+ based applications have been developed during my doctorate including the prediction of Magnesium binding sites in human proteins, the ABC transporters superfamily classification and the functional prediction (GO terms) of the CAFA targets. Remarkably, in the CAFA assessment, BAR+ placed among the ten most accurate methods. At present, as a web server for the functional and structural protein sequence annotation, BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0.

Tipologia del documento
Tesi di dottorato
Piovesan, Damiano
Dottorato di ricerca
Scuola di dottorato
Scienze biologiche, biomediche e biotecnologiche
Settore disciplinare
Settore concorsuale
Parole chiave
Bioinformatics, protein annotation, function prediction, sequence clustering, Hidden Markov Model, cross-genome comparisons, protein families
Data di discussione
18 Aprile 2013

Altri metadati

Statistica sui download

Gestione del documento: Visualizza la tesi