Pierleoni, Andrea
(2008)
Design and implementation of bioinformatics tools for large scale genome annotation, [Dissertation thesis], Alma Mater Studiorum Università di Bologna.
Dottorato di ricerca in
Biotecnologie cellulari e molecolari, 20 Ciclo.
Documenti full-text disponibili:
Abstract
The continuous increase of genome sequencing projects produced a huge amount of data in the
last 10 years: currently more than 600 prokaryotic and 80 eukaryotic genomes are fully sequenced
and publically available. However the sole sequencing process of a genome is able to determine just
raw nucleotide sequences. This is only the first step of the genome annotation process that will deal
with the issue of assigning biological information to each sequence. The annotation process is done at
each different level of the biological information processing mechanism, from DNA to protein, and
cannot be accomplished only by in vitro analysis procedures resulting extremely expensive and time
consuming when applied at a this large scale level. Thus, in silico methods need to be used to
accomplish the task.
The aim of this work was the implementation of predictive computational methods to allow a
fast, reliable, and automated annotation of genomes and proteins starting from aminoacidic sequences.
The first part of the work was focused on the implementation of a new machine learning based
method for the prediction of the subcellular localization of soluble eukaryotic proteins. The method is
called BaCelLo, and was developed in 2006. The main peculiarity of the method is to be independent
from biases present in the training dataset, which causes the over‐prediction of the most represented
examples in all the other available predictors developed so far. This important result was achieved by
a modification, made by myself, to the standard Support Vector Machine (SVM) algorithm with the
creation of the so called Balanced SVM. BaCelLo is able to predict the most important subcellular
localizations in eukaryotic cells and three, kingdom‐specific, predictors were implemented. In two
extensive comparisons, carried out in 2006 and 2008, BaCelLo reported to outperform all the
currently available state‐of‐the‐art methods for this prediction task.
BaCelLo was subsequently used to completely annotate 5 eukaryotic genomes, by integrating it
in a pipeline of predictors developed at the Bologna Biocomputing group by Dr. Pier Luigi Martelli and
Dr. Piero Fariselli. An online database, called eSLDB, was developed by integrating, for each
aminoacidic sequence extracted from the genome, the predicted subcellular localization merged with
experimental and similarity‐based annotations.
In the second part of the work a new, machine learning based, method was implemented for the
prediction of GPI‐anchored proteins. Basically the method is able to efficiently predict from the raw
aminoacidic sequence both the presence of the GPI‐anchor (by means of an SVM), and the position in
the sequence of the post‐translational modification event, the so called ω‐site (by means of an Hidden
Markov Model (HMM)). The method is called GPIPE and reported to greatly enhance the prediction
performances of GPI‐anchored proteins over all the previously developed methods. GPIPE was able to
predict up to 88% of the experimentally annotated GPI‐anchored proteins by maintaining a rate of
false positive prediction as low as 0.1%.
GPIPE was used to completely annotate 81 eukaryotic genomes, and more than 15000 putative
GPI‐anchored proteins were predicted, 561 of which are found in H. sapiens. In average 1% of a
proteome is predicted as GPI‐anchored. A statistical analysis was performed onto the composition of
the regions surrounding the ω‐site that allowed the definition of specific aminoacidic abundances in
the different considered regions. Furthermore the hypothesis that compositional biases are present
among the four major eukaryotic kingdoms, proposed in literature, was tested and rejected.
All the developed predictors and databases are freely available at:
BaCelLo http://gpcr.biocomp.unibo.it/bacello
eSLDB http://gpcr.biocomp.unibo.it/esldb
GPIPE http://gpcr.biocomp.unibo.it/gpipe
Abstract
The continuous increase of genome sequencing projects produced a huge amount of data in the
last 10 years: currently more than 600 prokaryotic and 80 eukaryotic genomes are fully sequenced
and publically available. However the sole sequencing process of a genome is able to determine just
raw nucleotide sequences. This is only the first step of the genome annotation process that will deal
with the issue of assigning biological information to each sequence. The annotation process is done at
each different level of the biological information processing mechanism, from DNA to protein, and
cannot be accomplished only by in vitro analysis procedures resulting extremely expensive and time
consuming when applied at a this large scale level. Thus, in silico methods need to be used to
accomplish the task.
The aim of this work was the implementation of predictive computational methods to allow a
fast, reliable, and automated annotation of genomes and proteins starting from aminoacidic sequences.
The first part of the work was focused on the implementation of a new machine learning based
method for the prediction of the subcellular localization of soluble eukaryotic proteins. The method is
called BaCelLo, and was developed in 2006. The main peculiarity of the method is to be independent
from biases present in the training dataset, which causes the over‐prediction of the most represented
examples in all the other available predictors developed so far. This important result was achieved by
a modification, made by myself, to the standard Support Vector Machine (SVM) algorithm with the
creation of the so called Balanced SVM. BaCelLo is able to predict the most important subcellular
localizations in eukaryotic cells and three, kingdom‐specific, predictors were implemented. In two
extensive comparisons, carried out in 2006 and 2008, BaCelLo reported to outperform all the
currently available state‐of‐the‐art methods for this prediction task.
BaCelLo was subsequently used to completely annotate 5 eukaryotic genomes, by integrating it
in a pipeline of predictors developed at the Bologna Biocomputing group by Dr. Pier Luigi Martelli and
Dr. Piero Fariselli. An online database, called eSLDB, was developed by integrating, for each
aminoacidic sequence extracted from the genome, the predicted subcellular localization merged with
experimental and similarity‐based annotations.
In the second part of the work a new, machine learning based, method was implemented for the
prediction of GPI‐anchored proteins. Basically the method is able to efficiently predict from the raw
aminoacidic sequence both the presence of the GPI‐anchor (by means of an SVM), and the position in
the sequence of the post‐translational modification event, the so called ω‐site (by means of an Hidden
Markov Model (HMM)). The method is called GPIPE and reported to greatly enhance the prediction
performances of GPI‐anchored proteins over all the previously developed methods. GPIPE was able to
predict up to 88% of the experimentally annotated GPI‐anchored proteins by maintaining a rate of
false positive prediction as low as 0.1%.
GPIPE was used to completely annotate 81 eukaryotic genomes, and more than 15000 putative
GPI‐anchored proteins were predicted, 561 of which are found in H. sapiens. In average 1% of a
proteome is predicted as GPI‐anchored. A statistical analysis was performed onto the composition of
the regions surrounding the ω‐site that allowed the definition of specific aminoacidic abundances in
the different considered regions. Furthermore the hypothesis that compositional biases are present
among the four major eukaryotic kingdoms, proposed in literature, was tested and rejected.
All the developed predictors and databases are freely available at:
BaCelLo http://gpcr.biocomp.unibo.it/bacello
eSLDB http://gpcr.biocomp.unibo.it/esldb
GPIPE http://gpcr.biocomp.unibo.it/gpipe
Tipologia del documento
Tesi di dottorato
Autore
Pierleoni, Andrea
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
20
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
subcellular localization gpi-anchor svm hmm
URN:NBN
Data di discussione
6 Giugno 2008
URI
Altri metadati
Tipologia del documento
Tesi di dottorato
Autore
Pierleoni, Andrea
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
20
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
subcellular localization gpi-anchor svm hmm
URN:NBN
Data di discussione
6 Giugno 2008
URI
Gestione del documento: