Marconi, Daniela
(2008)
New approaches to open problems in gene expression microarray data, [Dissertation thesis], Alma Mater Studiorum Università di Bologna.
Dottorato di ricerca in
Fisica, 20 Ciclo. DOI 10.6092/unibo/amsdottorato/842.
Documenti full-text disponibili:
Abstract
In the past decade, the advent of efficient genome sequencing tools and high-throughput
experimental biotechnology has lead to enormous progress in the life science. Among
the most important innovations is the microarray tecnology. It allows to quantify the
expression for thousands of genes simultaneously by measurin the hybridization from a
tissue of interest to probes on a small glass or plastic slide. The characteristics of these
data include a fair amount of random noise, a predictor dimension in the thousand, and
a sample noise in the dozens.
One of the most exciting areas to which microarray technology has been applied is
the challenge of deciphering complex disease such as cancer. In these studies, samples are
taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to
reach a standard organization (through the effort of preposed International project like
Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble
in a clinician's question that do not have a compelling statistical method that could
permit to answer it.The contribution of this dissertation in deciphering disease regards
the development of new approaches aiming at handle open problems posed by clinicians
in handle specific experimental designs.
In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the
production of the array, to the quality controls ending with preprocessing steps that will
be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical
review of standard analysis methods are provided stressing most of problems that
In Chapter 3 is introduced a method to adress the issue of unbalanced design of
miacroarray experiments. In microarray experiments, experimental design is a crucial
starting-point for obtaining reasonable results. In a two-class problem, an equal or
similar number of samples it should be collected between the two classes. However in
some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose
to address this issue by applying a modified version of SAM [2]. MultiSAM consists in
a reiterated application of a SAM analysis, comparing the less populated class (LPC)
with 1,000 random samplings of the same size from the more populated class (MPC) A
list of the differentially expressed genes is generated for each SAM application. After
1,000 reiterations, each single probe given a "score"
ranging from 0 to 1,000 based on its
recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM
was compared to the performance of SAM and LIMMA [3] over two simulated data
sets via beta and exponential distribution. The results of all three algorithms over low-
noise data sets seems acceptable However, on a real unbalanced two-channel data set
reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds
23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical
clustering. We also report extra-assay validation in terms of differentially expressed
genes Although standard algorithms perform well over low-noise simulated data sets,
multi-SAM seems to be the only one able to reveal subtle differences in gene expression
profiles on real unbalanced data.
In Chapter 4 a method to adress similarities evaluation in a three-class prblem by
means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in
a prognostic and diagnostic clinical framework, not only differences could have a crucial
role. In some cases similarities can give useful and, sometimes even more, important
information. The goal, given three classes, could be to establish, with a certain level
of confidence, if the third one is similar to the first or the second one. In this work
we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the
limitation of standard supervised classification. In fact, RVM offers many advantages
compared, for example, with his well-known precursor (Support Vector Machine - SVM
[3]). Among these advantages, the estimate of posterior probability of class membership
represents a key feature to address the similarity issue. This is a highly important, but
often overlooked, option of any practical pattern recognition system. We focused on
Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of
grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate
G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for
samples of G2 to be member of class G1 or class G3. The analysis showed that breast
cancer samples of grade II have a molecular profile more similar to breast cancer samples
of grade I. Looking at the literature this result have been guessed, but no measure of
significance was gived before.
Abstract
In the past decade, the advent of efficient genome sequencing tools and high-throughput
experimental biotechnology has lead to enormous progress in the life science. Among
the most important innovations is the microarray tecnology. It allows to quantify the
expression for thousands of genes simultaneously by measurin the hybridization from a
tissue of interest to probes on a small glass or plastic slide. The characteristics of these
data include a fair amount of random noise, a predictor dimension in the thousand, and
a sample noise in the dozens.
One of the most exciting areas to which microarray technology has been applied is
the challenge of deciphering complex disease such as cancer. In these studies, samples are
taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to
reach a standard organization (through the effort of preposed International project like
Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble
in a clinician's question that do not have a compelling statistical method that could
permit to answer it.The contribution of this dissertation in deciphering disease regards
the development of new approaches aiming at handle open problems posed by clinicians
in handle specific experimental designs.
In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the
production of the array, to the quality controls ending with preprocessing steps that will
be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical
review of standard analysis methods are provided stressing most of problems that
In Chapter 3 is introduced a method to adress the issue of unbalanced design of
miacroarray experiments. In microarray experiments, experimental design is a crucial
starting-point for obtaining reasonable results. In a two-class problem, an equal or
similar number of samples it should be collected between the two classes. However in
some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose
to address this issue by applying a modified version of SAM [2]. MultiSAM consists in
a reiterated application of a SAM analysis, comparing the less populated class (LPC)
with 1,000 random samplings of the same size from the more populated class (MPC) A
list of the differentially expressed genes is generated for each SAM application. After
1,000 reiterations, each single probe given a "score"
ranging from 0 to 1,000 based on its
recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM
was compared to the performance of SAM and LIMMA [3] over two simulated data
sets via beta and exponential distribution. The results of all three algorithms over low-
noise data sets seems acceptable However, on a real unbalanced two-channel data set
reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds
23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical
clustering. We also report extra-assay validation in terms of differentially expressed
genes Although standard algorithms perform well over low-noise simulated data sets,
multi-SAM seems to be the only one able to reveal subtle differences in gene expression
profiles on real unbalanced data.
In Chapter 4 a method to adress similarities evaluation in a three-class prblem by
means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in
a prognostic and diagnostic clinical framework, not only differences could have a crucial
role. In some cases similarities can give useful and, sometimes even more, important
information. The goal, given three classes, could be to establish, with a certain level
of confidence, if the third one is similar to the first or the second one. In this work
we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the
limitation of standard supervised classification. In fact, RVM offers many advantages
compared, for example, with his well-known precursor (Support Vector Machine - SVM
[3]). Among these advantages, the estimate of posterior probability of class membership
represents a key feature to address the similarity issue. This is a highly important, but
often overlooked, option of any practical pattern recognition system. We focused on
Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of
grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate
G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for
samples of G2 to be member of class G1 or class G3. The analysis showed that breast
cancer samples of grade II have a molecular profile more similar to breast cancer samples
of grade I. Looking at the literature this result have been guessed, but no measure of
significance was gived before.
Tipologia del documento
Tesi di dottorato
Autore
Marconi, Daniela
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
20
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
gene expression relevance vector machine
URN:NBN
DOI
10.6092/unibo/amsdottorato/842
Data di discussione
12 Giugno 2008
URI
Altri metadati
Tipologia del documento
Tesi di dottorato
Autore
Marconi, Daniela
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
20
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
gene expression relevance vector machine
URN:NBN
DOI
10.6092/unibo/amsdottorato/842
Data di discussione
12 Giugno 2008
URI
Statistica sui download
Gestione del documento: