
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
1 International Agency for Research on Cancer, Lyon, France; 2 Department of Epidemiology, University of California at Los Angeles, Los Angeles, California; Institutes of 3 Occupational Health and 4 Hygiene, University of Brescia, Brescia, Italy; 5 German Cancer Research Center, Heidelberg, Germany; and 6 Department of Epidemiology and Biostatistics, University of California at San Francisco, San Francisco, California
Requests for reprints: Rayjean J. Hung, Internal Agency for Research on Cancer Epidemiology, 150 cours Albert-Thomas, 69008 Lyon, France. Fax: 33-4-72738320. E-mail: hung{at}iarc.fr
| Abstract |
|---|
|
|
|---|
Key Words: hierarchical model genetic polymorphism gene-environment interaction
| Background |
|---|
|
|
|---|
The common conventional analytic approaches currently in practice, such as (1) treating all exposures independently (i.e., estimating the parameters of interest one at a time), (2) including all exposures in a single model, and (3) using algebraic selection procedures, such as stepwise regression, each have limitations (2). The first approach ignores the biological relevance among the exposures. The second approach often results in over-parameterized models, and the third approach selects the parameters arbitrarily based on "statistical significance" and can result in biased point and variance estimates (3). Furthermore, none of these approaches properly address the issues of multiple comparisons and potential false-positive associations. In contrast, hierarchical modeling may provide an avenue for addressing these issues in genetic association studies. In particular, such studies generally collect much information on interrelated genetic variables, which can be incorporated into a hierarchical model.
A false-positive association is suspected to occur when the point estimate is far away from other estimates, and unstable (i.e., with a large SE; ref. 4). This problem can be reduced by "correcting" the overestimation of the observed variance with the goal of estimating the "true" variance. The true variance (VT) can be approximated by subtracting the mean of the variance of the individual estimates (VM) from the observed sample variance of the log relative risk estimate (VO; ref. 4). Because the variance should be nonnegative, the estimate of VT is the maximum of VO VM and zero.
![]() | (A) |
There are three assumptions when using this adjustment. First, there is no systematic bias in the conventional estimates that threatens the validity of the "shrinkage to the mean" (i.e., if the systematic bias is present, the estimates would be shrunk toward the "wrong mean" after adjustment). Second, the true values and random errors in each ensemble are both roughly normally distributed. And third, the true values of the relative risk within each ensemble are exchangeable. To satisfy the last assumption, the parameters must be grouped in a manner that minimizes the ESE (4). For example, consider a study of smoking, genetic polymorphisms, and lung cancer. If smoking (a strong risk factor) is included in one ensemble with multiple low penetrance genes, large but unstable estimates for a few genetic polymorphisms (which are more likely to be produced by sampling error) would not be pulled down, as would happen without smoking in the same ensemble.
The adjustment by shrinkage estimation can be undertaken with hierarchical modeling. This approach unifies the seemly separate concepts of frequentist and Bayesian analysis, and represents a valid alternative to a conventional analysis (5). It adds in two or more levels in the model to specify the relations among study variables. General properties of hierarchical modeling have been reviewed elsewhere (6, 7).
While numerous authors have advocated the advantages of using hierarchical modeling, they remain little used in the health sciences (7-10). Few other authors used hierarchical modeling to analyze gene-gene or gene-environment interactions (11, 12). Here, we further demonstrate how hierarchical modeling can be applied to genetic epidemiologic studies, including several low penetrance genes and addressing gene-environment interactions. The present study is different from the previous studies in that we employed a data set with multiple markers in different genes, and also provided a possible construction of the prior multivariate second-stage model for such a data set. To evaluate the performance of our model, we contrasted the results based on hierarchical modeling to those from conventional analyses. We used the data from a case-control study of bladder cancer, which is representative of current genetic epidemiologic studies, containing a modest sample size and multiple markers.
| Materials and Methods |
|---|
|
|
|---|
Hierarchical Modeling Overview
Hierarchical modeling extends and complements conventional analyses. In a two-stage model, one can use a conventional logistic regression as the first stage:
![]() | (B) |
is the vector of covariate coefficients.
Hierarchical modeling adds a second stage to the conventional model to try and improve the estimation of the ßj. For example, one could use the following linear model
![]() | (C) |
![]() | (D) |
is a column vector of p prior coefficients. Z is the n-by-p matrix of second-stage covariates, with rows zj. The Z matrix is a key component of the hierarchical approach: it is defined by the investigators to reflect the similarities between the first-stage factors. For example, in a case-control study of multiple dietary exposures and breast cancer risk (8), the Z matrix was based on the nutritional contents of the food items. In an analysis of genetic factors, it was based on physical distance between polymorphisms (16). In a study of gene-environment interactions in polyp formation, the Z matrix incorporated the conversion rates for heterocyclic amines (12).
In Eq. C,
is the column vector of linear effects of the second-stage covariates (Z) on ßj, and
j are the residual effects of item j (e.g., specific genetic factors). Residual effects can arise from interactions among second-stage covariates, or from the covariates not included in the Z matrix, and are assumed to have mean of zero and variance of
2. The
can either be estimated from the data [empirical-Bayes (EB) approach] using a variance model based on the concept outlined above (Eq. A; refs. 2, 4, 17), or pre-specified by the investigators using background information [semi-Bayes (SB) approach] (7, 17).
2 reflects the range of the potential residual effect that remain after accounting for all first- and second-level covariates, and can be viewed as an approximation of VT.
Substituting Eq. C into B gives a mixed model Eq. (E), which contains fixed coefficients (
and
) and random coefficients
.
![]() | (E) |
Application of Hierarchical Modeling to the Case-Control Study
For comparison's sake, the data were initially analyzed with the conventional approaches using the following two logistic regression models: treating all the genetic factors independently and constructing one-at-a-time models; and including all genetic factors in one model.
Our first hierarchical model was constructed under the assumption that all genes are exchangeable, with the EB approach (allowing
2 to be estimated from the data). Because the assumption of full exchangeability is unlikely to reflect the underlying biological relations among genes, we further estimated the main effects of genetic polymorphisms by using a Z matrix to specify the similarity between genetic factors in terms of their function.
Construction of Z Matrix
The genetic factors were broadly classified into four main pathways related to bladder carcinogenesis, based on the biological functions of the genes: (1) reduced carcinogen detoxification; (2) enhanced procarcinogen activation; (3) increased oxidative stress; and (4) decreased DNA repair capacity. The risk genotype of each gene is designated based on the functional significance of the polymorphism. For each pathway, a score was assigned to each genetic polymorphism according to biological function of the gene and the functional significance of its polymorphism. A score of 0 was assigned to the genetic polymorphism if the gene is not involved in such pathway (the coefficient of the factor was expected to be zero). For example, MPO was assigned a score of 0 in the pathway of reduced detoxification and DNA repair, because it is not expected to have roles in such pathways; whereas a score of 1 was assigned if the gene is involved in such pathway, and the functional significances of the risk genotypes correspond to the pathway that contribute to the risk of cancer (the factor was expected to have a positive coefficient). Take XRCC1, for example, it has an important role in repair of single strand break. 399Gln allele was associated with higher levels of aflatoxin B1-DNA adduct and higher bleomycin sensitivity; therefore, the designated risk genotype of XRCC1 is Arg/Gln, Gln/Gln, and a score of 1 was assigned in the pathway of reduced DNA repair (18, 19).
The genes assigned to the same pathway were assumed to be exchangeable, that is, for their effects to arise from a common distribution. The four pathways were not mutually exclusive. For example, MPO activates procarcinogens in tobacco smoke, such as benzo[a]pyrene, through the release of reactive oxygen species (21-23). 463A allele is associated with reduced mRNA expression and its transcription activity is about 25 times lower than G allele in vitro (24). A score of 1 was assigned to MPO G/G, G/A in the pathway of enhanced activation and increased oxidative stress due to MPO's role in both pathways. Another example, NQO1 Pro187Ser variant genotypes (Pro/Ser, Ser/Ser) was given a score of 1 in both pathways of reduced carcinogen detoxification and enhanced procarcinogen activation, because NQO1 converts the highly genotoxic benzoquinine to the less toxic hydroxy metabolites (24), whereas it is also involved in bioactivation of some procarcinogens, such as 4-nitroquinoline-1-oxide (25).
The basic Z matrix is reported in Table 1. In summary, the biological implication of this Z matrix is that GSTs, NAT2, SULT1A1, and NQO1 play roles in carcinogen detoxification, and their designated risk genotypes reduce the detoxification; NAT1, SULT1A1, CYP1B1, MPO, and NQO1 are involved in procarcinogen activation, and their designated risk genotypes enhance the activation; CYP1B1, MPO, COMT, MnSOD, and NQO1 modulate the oxidative stress, and the risk genotypes reduce the oxidative stress; XRCC1, XRCC3, and XPD repair DNA via different mechanisms, and their risk genotypes decreased the DNA repair capacity.
|
In Z-matrix variant 2, scores were allowed to range from 1 to 1. Genetic polymorphisms were assigned a score of 1 when hypothesizing that the prior risk genotypes have opposite effect in the given pathway (e.g., increased detoxification, instead of reduced detoxification). In other words, the factor was expected to have a negative coefficient. For example, NAT1 have been shown to be involved in both the activation and detoxification of aromatic amines. In general, NAT1 has a major role in the O-acetylation of N-hydroxy aromatic amines, and lead to the activation of aromatic amines (32). NAT1 *10 and *11 are associated with faster enzyme activity and were, therefore, assigned a score of 1 in the pathway of enhanced activation. On the other hand, because NAT1 is also involved in detoxification process, it is possible that a higher enzyme activity conferred by *10 and *11 might be associated with an increased detoxification via N-acetylation, although the affinity for which is less. We, therefore, assigned a score of 1 in the pathway of reduced detoxification to indicate the opposite effect via same pathway.
We assumed that the basic Z matrix was the most comprehensive prior for our purposes, and the analyses of gene-environment interaction were based on the basic Z matrix. The details of Z matrix applied to gene-environment interaction is described in the following section.
We used an EB approach, which estimated
2 from the data. When this approach estimated
2 as being less than or equal to zero, we used a SB approach setting
2 equal to 0.05, a value similar to that estimated for main effects from the data in the EB approach. This assumes that the true odds ratio (OR) of each parameter would fall within a 2.4-fold range, that is,
. To see the how sensitive the results were to
2, we increased the pre-specified value to 0.35, which assumes with 95% probability that the true ORs fall within a 10-fold range [i.e., (ln(10)/3.92)2
0.35].
Assessment of Gene-Environment Interactions
Three environmental agents were included in the analysis: tobacco smoking and occupational exposure to polycyclic aromatic hydrocarbons and to aromatic amines. These exposures are known to cause of bladder cancer (32). Two approaches were used for assessing the gene-environment interactions. The first approach added product terms to the first-stage model (see Eq. F).
![]() | (F) |
Attempting to include one environmental factor, with all the genetic factors and their product terms in a single model (a total of 29 parameters), resulted in the over-parameterization of the model, which could not be fitted. Therefore, we grouped genetic polymorphisms based on the biological functions of the genes (four groups: GSTs, oxidative stress, DNA repair, and others). The basic Z matrix was then divided into four corresponding groups. One column was added to the Z matrix to indicate the environmental exposure. All gene-environment product terms were assigned a score of 1 in the relevant genetic pathways and environmental exposure. The Z matrices used in gene-environment interaction analyses are available from the authors on request.
The second approach used three indicators for subjects exposed to high-risk genotype only, environmental factors only, and to both factors (see Eq. G below). Because there were three indicators for each pair of gene-environment interactions, analyzing the interaction by the group also resulted in an over-parameterized model. Therefore, gene-environment interactions were analyzed one pair at the time while using this approach. The Z matrix applied in this approach was also simplified to a three-by-two matrix with two columns indicating genetic and environmental factors. Consistent with the notation in Eq. G: D1 was assigned a score of 1 in genetic column, D2 was assigned a score of 1 in environmental column, and D3 was assigned a score of 1 in both columns.
![]() | (G) |
Consistent with the analyses for the genetic main effects, in gene-environment interaction analyses, we first used an EB approach to estimate
2 from the data. When this approach estimated
2 as being less than or equal to zero, we set
2 equal to 0.05, a value similar to that estimated from the data in the EB approach. To see how sensitive the results were to
2, we increased the pre-specified value to 0.35. All statistical analyses were conducted with SAS IML code in conjunction with GLIMMIX macro. An example program is available at URL http://darwin.cwru.edu/~witte/hm.html (17).
| Results |
|---|
|
|
|---|
|
2 was pre-specified, the larger the pre-specified value, the lesser the "shrinkage" (results based on hierarchical models 5 and 6 in Table 2, and hierarchical models 1 and 2 in Tables 3 and 4).
|
|
Table 4 shows the noteworthy results of gene-environment interaction based on the indicator-term approach. Similar to the results in Table 3, the hierarchical estimates were more stable, with the amount of shrinkage depending on the value of
2. For example, when looking at the interaction between the GSTT1 null genotype and exposure to aromatic amines, the conventional analysis resulted in an unstable OR that was likely overestimated and suggested the presence of interaction. The hierarchical modeling, on the other hand, gave more stable estimates and suggested no interaction.
| Discussion |
|---|
|
|
|---|
Potential false-positive results are characterized by increased imprecision in effect estimation, which results in less weight in the hierarchical models, and are, therefore, more liable to the "shrinkage" effect of hierarchical modeling than true positive results (4). Although hierarchical modeling may also increase the potential for false-negative findings, it should be noted that even initial positive findings that are later replicated in subsequent studies are also usually overestimated (34). Thus, the shrinkage effect typically seen in hierarchical modeling is also relevant for true positive effects.
Other authors had presented the use of hierarchical models for looking at gene-gene or gene-environment interactions. Aragaki et al. (12) studied the interaction between diet and a single candidate gene NAT2, using a different data structure than the one presented here. We had similar data as in De Roos et al. (11), though their second-stage model assumed that all effects (i.e., main genetic, environmental, and when modeled, their interactions) should be shrunk toward a single prior mean. However, this may not hold in practice. Therefore, in the present report, we extended the previous work by using a second-stage (Z) design matrix that reflects the similarity of different genetic factors in a more detailed manner. Furthermore, in our interaction analysis, we provided results from both indicator-term and product-term approaches. In the product-term approach, we went beyond looking at one interaction at a time. In addition, we undertook a sensitivity analysis of the design of the Z matrix and the pre-specified
2.
There are a number of assumptions underlying the use of hierarchical models; when these are violated, the resulting estimates can be worse than obtained with a conventional analysis (35). While hierarchical modeling attempts to improve estimation accuracy, it may sometimes do so at the expense of increasing the bias (6). Recall that estimation accuracy reflects both validity (systematic error) and precision (random error). In the presence of systematic bias, and when the mean of the estimates is not reasonably close to the true mean, the hierarchical estimates would be pulled to the "wrong" mean. Furthermore, in the case when the second-stage model does not provide a reasonable presentation of the biological phenomenon, the estimates from the hierarchical model would also be biased. For example, if one assumes the effect of GSTM1 null (risk allele) and MPO A/A genotype (protective allele) come from the same mean, the estimate of GSTM1 null may be biased downward and the estimate of MPO A/A genotype may be biased upward (5). However, under a reasonably specified prior, the hierarchical models are expected to provide more precise estimates of risks.
Slight alterations of the Z matrix did not have a major influence on the risk estimates. This suggests that the model is relatively robust, instead of implying that one particular Z matrix is "correct." One should aim to construct a Z matrix that reasonably reflects the underlying biological phenomena; however, it would not be feasible to estimate how well the Z matrix reflects "biological truth" by a statistical method, including sensitivity analyses.
Note that we used values of 0 or 1 in our second-stage design (Z) matrix. While such a crude design matrix has been shown to improve over conventional estimates (9), it is unlikely to fully capture the true differences between genetic factors. This is more likely to be on a continuous scale, reflecting aspects such as enzyme kinetics, the rate or the level of the adduct formation, etc. A study of NAT2 genotype-specific dietary effects on adenomatous polyps provided a fine example, in which Z matrix is the calculated conversion rates for heterocyclic amines, derived from the estimated concentrations of heterocyclic amines in the selected foods, and Michaelis-Menten constant estimates for NATs genotypes (12). Our knowledge on the effects of the products of the genes included in our study is far from complete. As this information becomes available, one can further refine the Z matrix. In the context of genetic polymorphisms, the main obstacle for researchers to apply hierarchical modeling may be the construction of prior (Z matrix). Our results suggest that the model is relatively robust, and that a Z matrix which reasonably represents the biological mechanisms can be helpful in improving the risk estimates.
One can estimate
2 from the data (EB approach) when the information of the residual effects is unknown, because when using a SB approach,
2 is pre-specified and hence potentially subject to criticism. However, previous work has shown that SB generally often outperforms EB, which may estimate
2 as equaling zero (2). Regardless, one should consider varying
2 to see how sensitive results are to this value, as in our example. Simulation studies have shown that the sensitivity of SB estimates to the pre-specified
2 increases with increasing study size (2).
We used two different approaches to investigate gene-environment interactions (i.e., product terms versus indicator variables). When considering more than one pair of interaction, the product-term approach requires fewer terms in one's model, and so allowed us to consider more markers simultaneously than the indicator variable approach. In particular, when looking at the interaction analysis based on the product-term approach, the number of models to test, including genetic main effects and gene-environment interactions was reduced from 56 (considering polymorphisms of 14 genes and 3 environmental exposures) to 13 [1 for genetic main effect, and 12 (4 x 3) models for gene-environment interactions]. However, the indicator-term approach may more directly exhibit the joint effect of the genetic and environmental factors.
In summary, association studies are commonly using high-throughput genotyping techniques, which generate observations on several hundreds or thousands of genetic markers in large populations. One must be very cautious about multiple comparison and potential false-positive associations when analyzing and interpreting such data. Hierarchical modeling can help address these issues, provide more plausible and stable estimates, and, therefore, represents an alternative to conventional methods for genetic association studies.
| Acknowledgments |
|---|
| Footnotes |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Note: R. Hung worked on this study under the tenure of a Special Training Award from the International Agency for Research on Cancer.
Received 11/ 4/03; revised 2/ 3/04; accepted 2/ 9/04.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. Y. Park and T. Hastie Penalized logistic regression for detecting gene interactions Biostat., January 1, 2008; 9(1): 30 - 50. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Hung, M. Baragatti, D. Thomas, J. McKay, N. Szeszenia-Dabrowska, D. Zaridze, J. Lissowska, P. Rudnai, E. Fabianova, D. Mates, et al. Inherited Predisposition of Lung Cancer: A Hierarchical Modeling Approach to DNA Repair and Cell Cycle Control Pathways Cancer Epidemiol. Biomarkers Prev., December 1, 2007; 16(12): 2736 - 2744. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. M. Ulrich, H. F. Nijhout, and M. C. Reed Mathematical modeling: epidemiology meets systems biology. Cancer Epidemiol. Biomarkers Prev., May 1, 2006; 15(5): 827 - 829. [Full Text] [PDF] |
||||
![]() |
D. C. Thomas Are We Ready for Genome-wide Association Studies? Cancer Epidemiol. Biomarkers Prev., April 1, 2006; 15(4): 595 - 598. [Full Text] [PDF] |
||||
![]() |
D. C. Thomas The Need for a Systematic Approach to Complex Pathways in Molecular Epidemiology Cancer Epidemiol. Biomarkers Prev., March 1, 2005; 14(3): 557 - 559. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Cell Growth & Differentiation |