Trombetti, Gabriele Antonio
(2008)
Enabling computationally intensive bioinformatics applications on the Grid platform, [Dissertation thesis], Alma Mater Studiorum Università di Bologna.
Dottorato di ricerca in
Ingegneria elettronica, informatica e delle telecomunicazioni, 20 Ciclo. DOI 10.6092/unibo/amsdottorato/922.
Documenti full-text disponibili:
Abstract
Bioinformatics is a recent and emerging discipline which aims at studying
biological problems through computational approaches. Most branches of
bioinformatics such as Genomics, Proteomics and Molecular Dynamics are
particularly computationally intensive, requiring huge amount of
computational resources for running algorithms of everincreasing
complexity over data of everincreasing
size.
In the search for computational power, the EGEE Grid platform, world's
largest community of interconnected clusters load balanced as a whole,
seems particularly promising and is considered the new hope for satisfying
the everincreasing
computational requirements of bioinformatics, as well as
physics and other computational sciences.
The EGEE platform, however, is rather new and not yet free of problems. In
addition, specific requirements of bioinformatics need to be addressed in
order to use this new platform effectively for bioinformatics tasks.
In my three years' Ph.D. work I addressed numerous aspects of this Grid
platform, with particular attention to those needed by the bioinformatics
domain.
I hence created three major frameworks, Vnas, GridDBManager and
SETest, plus an additional smaller standalone solution, to enhance the
support for bioinformatics applications in the Grid environment and to
reduce the effort needed to create new applications, additionally addressing
numerous existing Grid issues and performing a series of optimizations.
The Vnas framework is an advanced system for the submission and
monitoring of Grid jobs that provides an abstraction with reliability over the
Grid platform. In addition, Vnas greatly simplifies the development of new
Grid applications by providing a callback system to simplify the creation of
arbitrarily complex multistage
computational pipelines and provides an
abstracted virtual sandbox which bypasses Grid limitations. Vnas also
reduces the usage of Grid bandwidth and storage resources by
transparently detecting equality of virtual sandbox files based on content,
across different submissions, even when performed by different users.
BGBlast, evolution of the earlier project GridBlast, now provides a Grid
Database Manager (GridDBManager) component for managing and
automatically updating biological flatfile
databases in the Grid environment.
GridDBManager sports very novel features such as an adaptive replication
algorithm that constantly optimizes the number of replicas of the managed
databases in the Grid environment, balancing between response times
(performances) and storage costs according to a programmed cost formula.
GridDBManager also provides a very optimized automated management for
older versions of the databases based on reverse delta files, which reduces
the storage costs required to keep such older versions available in the Grid
environment by two orders of magnitude.
The SETest framework provides a way to the user to test and
regressiontest
Python applications completely scattered with side effects
(this is a common case with Grid computational pipelines), which could not
easily be tested using the more standard methods of unit testing or test
cases. The technique is based on a new concept of datasets containing
invocations and results of filtered calls. The framework hence significantly
accelerates the development of new applications and computational
pipelines for the Grid environment, and the efforts required for maintenance.
An analysis of the impact of these solutions will be provided in this thesis.
This Ph.D. work originated various publications in journals and conference
proceedings as reported in the Appendix. Also, I orally presented my work
at numerous international conferences related to Grid and bioinformatics.
Abstract
Bioinformatics is a recent and emerging discipline which aims at studying
biological problems through computational approaches. Most branches of
bioinformatics such as Genomics, Proteomics and Molecular Dynamics are
particularly computationally intensive, requiring huge amount of
computational resources for running algorithms of everincreasing
complexity over data of everincreasing
size.
In the search for computational power, the EGEE Grid platform, world's
largest community of interconnected clusters load balanced as a whole,
seems particularly promising and is considered the new hope for satisfying
the everincreasing
computational requirements of bioinformatics, as well as
physics and other computational sciences.
The EGEE platform, however, is rather new and not yet free of problems. In
addition, specific requirements of bioinformatics need to be addressed in
order to use this new platform effectively for bioinformatics tasks.
In my three years' Ph.D. work I addressed numerous aspects of this Grid
platform, with particular attention to those needed by the bioinformatics
domain.
I hence created three major frameworks, Vnas, GridDBManager and
SETest, plus an additional smaller standalone solution, to enhance the
support for bioinformatics applications in the Grid environment and to
reduce the effort needed to create new applications, additionally addressing
numerous existing Grid issues and performing a series of optimizations.
The Vnas framework is an advanced system for the submission and
monitoring of Grid jobs that provides an abstraction with reliability over the
Grid platform. In addition, Vnas greatly simplifies the development of new
Grid applications by providing a callback system to simplify the creation of
arbitrarily complex multistage
computational pipelines and provides an
abstracted virtual sandbox which bypasses Grid limitations. Vnas also
reduces the usage of Grid bandwidth and storage resources by
transparently detecting equality of virtual sandbox files based on content,
across different submissions, even when performed by different users.
BGBlast, evolution of the earlier project GridBlast, now provides a Grid
Database Manager (GridDBManager) component for managing and
automatically updating biological flatfile
databases in the Grid environment.
GridDBManager sports very novel features such as an adaptive replication
algorithm that constantly optimizes the number of replicas of the managed
databases in the Grid environment, balancing between response times
(performances) and storage costs according to a programmed cost formula.
GridDBManager also provides a very optimized automated management for
older versions of the databases based on reverse delta files, which reduces
the storage costs required to keep such older versions available in the Grid
environment by two orders of magnitude.
The SETest framework provides a way to the user to test and
regressiontest
Python applications completely scattered with side effects
(this is a common case with Grid computational pipelines), which could not
easily be tested using the more standard methods of unit testing or test
cases. The technique is based on a new concept of datasets containing
invocations and results of filtered calls. The framework hence significantly
accelerates the development of new applications and computational
pipelines for the Grid environment, and the efforts required for maintenance.
An analysis of the impact of these solutions will be provided in this thesis.
This Ph.D. work originated various publications in journals and conference
proceedings as reported in the Appendix. Also, I orally presented my work
at numerous international conferences related to Grid and bioinformatics.
Tipologia del documento
Tesi di dottorato
Autore
Trombetti, Gabriele Antonio
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
20
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
grid bioinformatics adaptive replication optimization regression testing
URN:NBN
DOI
10.6092/unibo/amsdottorato/922
Data di discussione
7 Aprile 2008
URI
Altri metadati
Tipologia del documento
Tesi di dottorato
Autore
Trombetti, Gabriele Antonio
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
20
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
grid bioinformatics adaptive replication optimization regression testing
URN:NBN
DOI
10.6092/unibo/amsdottorato/922
Data di discussione
7 Aprile 2008
URI
Statistica sui download
Gestione del documento: