Montanucci, Ludovica
(2008)
Computational methods for genome screening, [Dissertation thesis], Alma Mater Studiorum Università di Bologna.
Dottorato di ricerca in
Fisica, 20 Ciclo.
Documenti full-text disponibili:
Abstract
Motivation An actual issue of great interest, both under a theoretical and an
applicative perspective, is the analysis of biological sequences for disclosing the information
that they encode. The development of new technologies for genome sequencing
in the last years, opened new fundamental problems since huge amounts of biological
data still deserve an interpretation. Indeed, the sequencing is only the first step
of the genome annotation process that consists in the assignment of biological information
to each sequence. Hence given the large amount of available data, in silico
methods became useful and necessary in order to extract relevant information from
sequences. The availability of data from Genome Projects gave rise to new strategies
for tackling the basic problems of computational biology such as the determination of
the tridimensional structures of proteins, their biological function and their reciprocal
interactions.
Results The aim of this work has been the implementation of predictive methods
that allow the extraction of information on the properties of genomes and proteins
starting from the nucleotide and aminoacidic sequences, by taking advantage of the
information provided by the comparison of the genome sequences from different species.
In the first part of the work a comprehensive large scale genome comparison of 599
organisms is described. 2,6 million of sequences coming from 551 prokaryotic and 48
eukaryotic genomes were aligned and clustered on the basis of their sequence identity.
This procedure led to the identification of classes of proteins that are peculiar to the
different groups of organisms. Moreover the adopted similarity threshold produced
clusters that are homogeneous on the structural point of view and that can be used
for structural annotation of uncharacterized sequences.
The second part of the work focuses on the characterization of thermostable proteins
and on the development of tools able to predict the thermostability of a protein
starting from its sequence. By means of Principal Component Analysis the codon
composition of a non redundant database comprising 116 prokaryotic genomes has
been analyzed and it has been showed that a cross genomic approach can allow the
extraction of common determinants of thermostability at the genome level, leading
to an overall accuracy in discriminating thermophilic coding sequences equal to 95%.
This result outperform those obtained in previous studies. Moreover, we investigated
the effect of multiple mutations on protein thermostability. This issue is of great importance
in the field of protein engineering, since thermostable proteins are generally
more suitable than their mesostable counterparts in technological applications. A Support
Vector Machine based method has been trained to predict if a set of mutations
can enhance the thermostability of a given protein sequence. The developed predictor
achieves 88% accuracy.
Abstract
Motivation An actual issue of great interest, both under a theoretical and an
applicative perspective, is the analysis of biological sequences for disclosing the information
that they encode. The development of new technologies for genome sequencing
in the last years, opened new fundamental problems since huge amounts of biological
data still deserve an interpretation. Indeed, the sequencing is only the first step
of the genome annotation process that consists in the assignment of biological information
to each sequence. Hence given the large amount of available data, in silico
methods became useful and necessary in order to extract relevant information from
sequences. The availability of data from Genome Projects gave rise to new strategies
for tackling the basic problems of computational biology such as the determination of
the tridimensional structures of proteins, their biological function and their reciprocal
interactions.
Results The aim of this work has been the implementation of predictive methods
that allow the extraction of information on the properties of genomes and proteins
starting from the nucleotide and aminoacidic sequences, by taking advantage of the
information provided by the comparison of the genome sequences from different species.
In the first part of the work a comprehensive large scale genome comparison of 599
organisms is described. 2,6 million of sequences coming from 551 prokaryotic and 48
eukaryotic genomes were aligned and clustered on the basis of their sequence identity.
This procedure led to the identification of classes of proteins that are peculiar to the
different groups of organisms. Moreover the adopted similarity threshold produced
clusters that are homogeneous on the structural point of view and that can be used
for structural annotation of uncharacterized sequences.
The second part of the work focuses on the characterization of thermostable proteins
and on the development of tools able to predict the thermostability of a protein
starting from its sequence. By means of Principal Component Analysis the codon
composition of a non redundant database comprising 116 prokaryotic genomes has
been analyzed and it has been showed that a cross genomic approach can allow the
extraction of common determinants of thermostability at the genome level, leading
to an overall accuracy in discriminating thermophilic coding sequences equal to 95%.
This result outperform those obtained in previous studies. Moreover, we investigated
the effect of multiple mutations on protein thermostability. This issue is of great importance
in the field of protein engineering, since thermostable proteins are generally
more suitable than their mesostable counterparts in technological applications. A Support
Vector Machine based method has been trained to predict if a set of mutations
can enhance the thermostability of a given protein sequence. The developed predictor
achieves 88% accuracy.
Tipologia del documento
Tesi di dottorato
Autore
Montanucci, Ludovica
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
20
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
genome comparison alignments thermostability support vector machines principal component analysis
URN:NBN
Data di discussione
12 Giugno 2008
URI
Altri metadati
Tipologia del documento
Tesi di dottorato
Autore
Montanucci, Ludovica
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
20
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
genome comparison alignments thermostability support vector machines principal component analysis
URN:NBN
Data di discussione
12 Giugno 2008
URI
Gestione del documento: