Di Lascio, Francesca Marta Lilja
(2008)
Analyzing the dependence structure of microarray data: a copula–based approach, [Dissertation thesis], Alma Mater Studiorum Università di Bologna.
Dottorato di ricerca in
Metodologia statistica per la ricerca scientifica, 20 Ciclo. DOI 10.6092/unibo/amsdottorato/670.
Documenti full-text disponibili:
Abstract
The main aim of this Ph.D. dissertation is the study of clustering dependent data by means of copula
functions with particular emphasis on microarray data. Copula functions are a popular multivariate modeling
tool in each field where the multivariate dependence is of great interest and their use in clustering has not
been still investigated.
The first part of this work contains the review of the literature of clustering methods, copula functions
and microarray experiments. The attention focuses on the K–means (Hartigan, 1975; Hartigan and Wong,
1979), the hierarchical (Everitt, 1974) and the model–based (Fraley and Raftery, 1998, 1999, 2000, 2007)
clustering techniques because their performance is compared. Then, the probabilistic interpretation of the
Sklar’s theorem (Sklar’s, 1959), the estimation methods for copulas like the Inference for Margins (Joe and
Xu, 1996) and the Archimedean and Elliptical copula families are presented. In the end, applications of
clustering methods and copulas to the genetic and microarray experiments are highlighted.
The second part contains the original contribution proposed. A simulation study is performed in order to
evaluate the performance of the K–means and the hierarchical bottom–up clustering methods in identifying
clusters according to the dependence structure of the data generating process. Different simulations are
performed by varying different conditions (e.g., the kind of margins (distinct, overlapping and nested) and
the value of the dependence parameter ) and the results are evaluated by means of different measures of
performance.
In light of the simulation results and of the limits of the two investigated clustering methods, a new
clustering algorithm based on copula functions (‘CoClust’ in brief) is proposed. The basic idea, the iterative
procedure of the CoClust and the description of the written R functions with their output are given. The
CoClust algorithm is tested on simulated data (by varying the number of clusters, the copula models, the
dependence parameter value and the degree of overlap of margins) and is compared with the performance
of model–based clustering by using different measures of performance, like the percentage of well–identified
number of clusters and the not rejection percentage of H0 on .
It is shown that the CoClust algorithm allows to overcome all observed limits of the other investigated
clustering techniques and is able to identify clusters according to the dependence structure of the data
independently of the degree of overlap of margins and the strength of the dependence. The CoClust uses
a criterion based on the maximized log–likelihood function of the copula and can virtually account for
any possible dependence relationship between observations. Many peculiar characteristics are shown for the
CoClust, e.g. its capability of identifying the true number of clusters and the fact that it does not require a
starting classification.
Finally, the CoClust algorithm is applied to the real microarray data of Hedenfalk et al. (2001) both to
the gene expressions observed in three different cancer samples and to the columns (tumor samples) of the
whole data matrix.
Abstract
The main aim of this Ph.D. dissertation is the study of clustering dependent data by means of copula
functions with particular emphasis on microarray data. Copula functions are a popular multivariate modeling
tool in each field where the multivariate dependence is of great interest and their use in clustering has not
been still investigated.
The first part of this work contains the review of the literature of clustering methods, copula functions
and microarray experiments. The attention focuses on the K–means (Hartigan, 1975; Hartigan and Wong,
1979), the hierarchical (Everitt, 1974) and the model–based (Fraley and Raftery, 1998, 1999, 2000, 2007)
clustering techniques because their performance is compared. Then, the probabilistic interpretation of the
Sklar’s theorem (Sklar’s, 1959), the estimation methods for copulas like the Inference for Margins (Joe and
Xu, 1996) and the Archimedean and Elliptical copula families are presented. In the end, applications of
clustering methods and copulas to the genetic and microarray experiments are highlighted.
The second part contains the original contribution proposed. A simulation study is performed in order to
evaluate the performance of the K–means and the hierarchical bottom–up clustering methods in identifying
clusters according to the dependence structure of the data generating process. Different simulations are
performed by varying different conditions (e.g., the kind of margins (distinct, overlapping and nested) and
the value of the dependence parameter ) and the results are evaluated by means of different measures of
performance.
In light of the simulation results and of the limits of the two investigated clustering methods, a new
clustering algorithm based on copula functions (‘CoClust’ in brief) is proposed. The basic idea, the iterative
procedure of the CoClust and the description of the written R functions with their output are given. The
CoClust algorithm is tested on simulated data (by varying the number of clusters, the copula models, the
dependence parameter value and the degree of overlap of margins) and is compared with the performance
of model–based clustering by using different measures of performance, like the percentage of well–identified
number of clusters and the not rejection percentage of H0 on .
It is shown that the CoClust algorithm allows to overcome all observed limits of the other investigated
clustering techniques and is able to identify clusters according to the dependence structure of the data
independently of the degree of overlap of margins and the strength of the dependence. The CoClust uses
a criterion based on the maximized log–likelihood function of the copula and can virtually account for
any possible dependence relationship between observations. Many peculiar characteristics are shown for the
CoClust, e.g. its capability of identifying the true number of clusters and the fact that it does not require a
starting classification.
Finally, the CoClust algorithm is applied to the real microarray data of Hedenfalk et al. (2001) both to
the gene expressions observed in three different cancer samples and to the columns (tumor samples) of the
whole data matrix.
Tipologia del documento
Tesi di dottorato
Autore
Di Lascio, Francesca Marta Lilja
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
20
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
copula functions clustering methods ifm estimation method coclust algorithm microarray data
URN:NBN
DOI
10.6092/unibo/amsdottorato/670
Data di discussione
2 Aprile 2008
URI
Altri metadati
Tipologia del documento
Tesi di dottorato
Autore
Di Lascio, Francesca Marta Lilja
Supervisore
Co-supervisore
Dottorato di ricerca
Ciclo
20
Coordinatore
Settore disciplinare
Settore concorsuale
Parole chiave
copula functions clustering methods ifm estimation method coclust algorithm microarray data
URN:NBN
DOI
10.6092/unibo/amsdottorato/670
Data di discussione
2 Aprile 2008
URI
Statistica sui download
Gestione del documento: