Ventrucci, Massimo
(2009)
Multiple testing in spatial epidemiology: a Bayesian approach, [Dissertation thesis], Alma Mater Studiorum Università di Bologna.
Dottorato di ricerca in
Metodologia statistica per la ricerca scientifica, 21 Ciclo. DOI 10.6092/unibo/amsdottorato/1564.
Documenti full-text disponibili:
Abstract
In this work we aim to propose a new approach for preliminary epidemiological
studies on Standardized Mortality Ratios (SMR) collected in many spatial
regions. A preliminary study on SMRs aims to formulate hypotheses to be investigated
via individual epidemiological studies that avoid bias carried on by
aggregated analyses. Starting from collecting disease counts and calculating expected
disease counts by means of reference population disease rates, in each
area an SMR is derived as the MLE under the Poisson assumption on each observation.
Such estimators have high standard errors in small areas, i.e. where
the expected count is low either because of the low population underlying the
area or the rarity of the disease under study. Disease mapping models and other
techniques for screening disease rates among the map aiming to detect anomalies
and possible high-risk areas have been proposed in literature according to
the classic and the Bayesian paradigm. Our proposal is approaching this issue
by a decision-oriented method, which focus on multiple testing control, without
however leaving the preliminary study perspective that an analysis on SMR
indicators is asked to. We implement the control of the FDR, a quantity largely
used to address multiple comparisons problems in the eld of microarray data
analysis but which is not usually employed in disease mapping. Controlling
the FDR means providing an estimate of the FDR for a set of rejected null
hypotheses.
The small areas issue arises diculties in applying traditional methods
for FDR estimation, that are usually based only on the p-values knowledge
(Benjamini and Hochberg, 1995; Storey, 2003). Tests evaluated by a traditional
p-value provide weak power in small areas, where the expected number of disease
cases is small. Moreover tests cannot be assumed as independent when spatial
correlation between SMRs is expected, neither they are identical distributed
when population underlying the map is heterogeneous.
The Bayesian paradigm oers a way to overcome the inappropriateness of
p-values based methods. Another peculiarity of the present work is to propose
a hierarchical full Bayesian model for FDR estimation in testing many null
hypothesis of absence of risk.We will use concepts of Bayesian models for disease
mapping, referring in particular to the Besag York and Mollié model (1991) often
used in practice for its exible prior assumption on the risks distribution across
regions. The borrowing of strength between prior and likelihood typical of a
hierarchical Bayesian model takes the advantage of evaluating a singular test
(i.e. a test in a singular area) by means of all observations in the map under
study, rather than just by means of the singular observation. This allows to
improve the power test in small areas and addressing more appropriately the
spatial correlation issue that suggests that relative risks are closer in spatially
contiguous regions.
The proposed model aims to estimate the FDR by means of the MCMC
estimated posterior probabilities b i's of the null hypothesis (absence of risk) for
each area. An estimate of the expected FDR conditional on data (\FDR) can
be calculated in any set of b i's relative to areas declared at high-risk (where thenull hypothesis is rejected) by averaging the b i's themselves. The\FDR can be
used to provide an easy decision rule for selecting high-risk areas, i.e. selecting
as many as possible areas such that the\FDR is non-lower than a prexed
value; we call them\FDR based decision (or selection) rules. The sensitivity
and specicity of such rule depend on the accuracy of the FDR estimate, the
over-estimation of FDR causing a loss of power and the under-estimation of
FDR producing a loss of specicity. Moreover, our model has the interesting
feature of still being able to provide an estimate of relative risk values as in the
Besag York and Mollié model (1991).
A simulation study to evaluate the model performance in FDR estimation
accuracy, sensitivity and specificity of the decision rule, and goodness of
estimation of relative risks, was set up. We chose a real map from which we
generated several spatial scenarios whose counts of disease vary according to
the spatial correlation degree, the size areas, the number of areas where the
null hypothesis is true and the risk level in the latter areas. In summarizing
simulation results we will always consider the FDR estimation in sets
constituted by all b i's selected lower than a threshold t. We will show graphs of
the\FDR and the true FDR (known by simulation) plotted against a threshold
t to assess the FDR estimation. Varying the threshold we can learn which FDR
values can be accurately estimated by the practitioner willing to apply the model
(by the closeness between\FDR and true FDR). By plotting the calculated
sensitivity and specicity (both known by simulation) vs the\FDR we can
check the sensitivity and specicity of the corresponding\FDR based decision
rules. For investigating the over-smoothing level of relative risk estimates we will
compare box-plots of such estimates in high-risk areas (known by simulation),
obtained by both our model and the classic Besag York Mollié model. All the
summary tools are worked out for all simulated scenarios (in total 54 scenarios).
Results show that FDR is well estimated (in the worst case we get an overestimation,
hence a conservative FDR control) in small areas, low risk levels and
spatially correlated risks scenarios, that are our primary aims. In such scenarios
we have good estimates of the FDR for all values less or equal than 0.10. The
sensitivity of\FDR based decision rules is generally low but specicity is high.
In such scenario the use of\FDR = 0:05 or\FDR = 0:10 based selection rule can
be suggested. In cases where the number of true alternative hypotheses (number
of true high-risk areas) is small, also FDR = 0:15 values are well estimated, and
\FDR = 0:15 based decision rules gains power maintaining an high specicity.
On the other hand, in non-small areas and non-small risk level scenarios the
FDR is under-estimated unless for very small values of it (much lower than
0.05); this resulting in a loss of specicity of a\FDR = 0:05 based decision rule.
In such scenario\FDR = 0:05 or, even worse,\FDR = 0:1 based decision rules
cannot be suggested because the true FDR is actually much higher. As regards
the relative risk estimation, our model achieves almost the same results of the
classic Besag York Molliè model. For this reason, our model is interesting for
its ability to perform both the estimation of relative risk values and the FDR
control, except for non-small areas and large risk level scenarios. A case of study
is nally presented to show how the method can be used in epidemiology.
Abstract
In this work we aim to propose a new approach for preliminary epidemiological
studies on Standardized Mortality Ratios (SMR) collected in many spatial
regions. A preliminary study on SMRs aims to formulate hypotheses to be investigated
via individual epidemiological studies that avoid bias carried on by
aggregated analyses. Starting from collecting disease counts and calculating expected
disease counts by means of reference population disease rates, in each
area an SMR is derived as the MLE under the Poisson assumption on each observation.
Such estimators have high standard errors in small areas, i.e. where
the expected count is low either because of the low population underlying the
area or the rarity of the disease under study. Disease mapping models and other
techniques for screening disease rates among the map aiming to detect anomalies
and possible high-risk areas have been proposed in literature according to
the classic and the Bayesian paradigm. Our proposal is approaching this issue
by a decision-oriented method, which focus on multiple testing control, without
however leaving the preliminary study perspective that an analysis on SMR
indicators is asked to. We implement the control of the FDR, a quantity largely
used to address multiple comparisons problems in the eld of microarray data
analysis but which is not usually employed in disease mapping. Controlling
the FDR means providing an estimate of the FDR for a set of rejected null
hypotheses.
The small areas issue arises diculties in applying traditional methods
for FDR estimation, that are usually based only on the p-values knowledge
(Benjamini and Hochberg, 1995; Storey, 2003). Tests evaluated by a traditional
p-value provide weak power in small areas, where the expected number of disease
cases is small. Moreover tests cannot be assumed as independent when spatial
correlation between SMRs is expected, neither they are identical distributed
when population underlying the map is heterogeneous.
The Bayesian paradigm oers a way to overcome the inappropriateness of
p-values based methods. Another peculiarity of the present work is to propose
a hierarchical full Bayesian model for FDR estimation in testing many null
hypothesis of absence of risk.We will use concepts of Bayesian models for disease
mapping, referring in particular to the Besag York and Mollié model (1991) often
used in practice for its exible prior assumption on the risks distribution across
regions. The borrowing of strength between prior and likelihood typical of a
hierarchical Bayesian model takes the advantage of evaluating a singular test
(i.e. a test in a singular area) by means of all observations in the map under
study, rather than just by means of the singular observation. This allows to
improve the power test in small areas and addressing more appropriately the
spatial correlation issue that suggests that relative risks are closer in spatially
contiguous regions.
The proposed model aims to estimate the FDR by means of the MCMC
estimated posterior probabilities b i's of the null hypothesis (absence of risk) for
each area. An estimate of the expected FDR conditional on data (\FDR) can
be calculated in any set of b i's relative to areas declared at high-risk (where thenull hypothesis is rejected) by averaging the b i's themselves. The\FDR can be
used to provide an easy decision rule for selecting high-risk areas, i.e. selecting
as many as possible areas such that the\FDR is non-lower than a prexed
value; we call them\FDR based decision (or selection) rules. The sensitivity
and specicity of such rule depend on the accuracy of the FDR estimate, the
over-estimation of FDR causing a loss of power and the under-estimation of
FDR producing a loss of specicity. Moreover, our model has the interesting
feature of still being able to provide an estimate of relative risk values as in the
Besag York and Mollié model (1991).
A simulation study to evaluate the model performance in FDR estimation
accuracy, sensitivity and specificity of the decision rule, and goodness of
estimation of relative risks, was set up. We chose a real map from which we
generated several spatial scenarios whose counts of disease vary according to
the spatial correlation degree, the size areas, the number of areas where the
null hypothesis is true and the risk level in the latter areas. In summarizing
simulation results we will always consider the FDR estimation in sets
constituted by all b i's selected lower than a threshold t. We will show graphs of
the\FDR and the true FDR (known by simulation) plotted against a threshold
t to assess the FDR estimation. Varying the threshold we can learn which FDR
values can be accurately estimated by the practitioner willing to apply the model
(by the closeness between\FDR and true FDR). By plotting the calculated
sensitivity and specicity (both known by simulation) vs the\FDR we can
check the sensitivity and specicity of the corresponding\FDR based decision
rules. For investigating the over-smoothing level of relative risk estimates we will
compare box-plots of such estimates in high-risk areas (known by simulation),
obtained by both our model and the classic Besag York Mollié model. All the
summary tools are worked out for all simulated scenarios (in total 54 scenarios).
Results show that FDR is well estimated (in the worst case we get an overestimation,
hence a conservative FDR control) in small areas, low risk levels and
spatially correlated risks scenarios, that are our primary aims. In such scenarios
we have good estimates of the FDR for all values less or equal than 0.10. The
sensitivity of\FDR based decision rules is generally low but specicity is high.
In such scenario the use of\FDR = 0:05 or\FDR = 0:10 based selection rule can
be suggested. In cases where the number of true alternative hypotheses (number
of true high-risk areas) is small, also FDR = 0:15 values are well estimated, and
\FDR = 0:15 based decision rules gains power maintaining an high specicity.
On the other hand, in non-small areas and non-small risk level scenarios the
FDR is under-estimated unless for very small values of it (much lower than
0.05); this resulting in a loss of specicity of a\FDR = 0:05 based decision rule.
In such scenario\FDR = 0:05 or, even worse,\FDR = 0:1 based decision rules
cannot be suggested because the true FDR is actually much higher. As regards
the relative risk estimation, our model achieves almost the same results of the
classic Besag York Molliè model. For this reason, our model is interesting for
its ability to perform both the estimation of relative risk values and the FDR
control, except for non-small areas and large risk level scenarios. A case of study
is nally presented to show how the method can be used in epidemiology.
Tipologia del documento
Tesi di dottorato
Autore
Ventrucci, Massimo
Supervisore
Co-supervisore
Dottorato di ricerca
Scuola di dottorato
Scienze economiche e statistiche
Ciclo
21
Coordinatore
Settore disciplinare
Settore concorsuale
URN:NBN
DOI
10.6092/unibo/amsdottorato/1564
Data di discussione
19 Marzo 2009
URI
Altri metadati
Tipologia del documento
Tesi di dottorato
Autore
Ventrucci, Massimo
Supervisore
Co-supervisore
Dottorato di ricerca
Scuola di dottorato
Scienze economiche e statistiche
Ciclo
21
Coordinatore
Settore disciplinare
Settore concorsuale
URN:NBN
DOI
10.6092/unibo/amsdottorato/1564
Data di discussione
19 Marzo 2009
URI
Statistica sui download
Gestione del documento: