Abstract
Confounding by ethnicity (i.e. population stratification) can result in bias and incorrect inferences in genotypedisease association studies, but the effect of population stratification in genegene or geneenvironment interaction studies has not been addressed. We used logistic regression models to fit multiplicative interactions between two dichotomous variables that represented genetic and/or environmental factors for a binary disease outcome in a hypothetical cohort of multiple ethnicities. Biases in main effects and interactions due to population stratification were evaluated by comparing regression coefficients in misspecified models that ignored ethnicities with their counterparts in models that accounted for ethnicities. We showed that biases in main effects and interactions were constrained by the differences in disease risks across the ethnicities. Therefore, large biases due to population stratification are not possible when baseline disease risk differences among ethnicities are small or moderate. Numerical examples of biases in genotypegenotype and/or genotypeenvironment interactions suggested that biases due to population stratification for main effects were generally small but could become large for studies of interactions, particularly when strong linkage disequilibrium between genes or large correlations between genetic and environmental factors existed. However, when linkage disequilibrium among genes or correlations among genes and environments were small, biases to main effects or interaction odds ratios were small to nonexistent. (Cancer Epidemiol Biomarkers Prev 2006;15(1):124–32)
 Population Stratification
 Interaction
 Bias
 Association Studies
Introduction
When the risk of disease varies between ethnic groups and the statistical distribution of an exposure variable (whether genetic or environmental) also varies between these groups, the association between the exposure and disease may be confounded by the ethnic background of the studied groups. This form of confounding by ethnicity is called population stratification in epidemiologic studies of genetic risk factors (1).
The effect of population stratification in epidemiologic studies of disease association with a single candidate gene has been extensively studied (29). Although various approaches using unlinked markers have been proposed to deal with the effects of population stratification on association tests of candidate genes (1015), epidemiologic studies are increasingly concerned with interactions between genetic and/or environmental factors. None of the prior literature explicitly addresses the pattern or extent of bias due to population stratification in association studies involving interactions between genes and environmental factors.
To evaluate bias due to population stratification, we employed two dichotomous variables to represent the genetic and/or environmental factors and used logistic regression models to fit multiplicative interactions between the two variables for a binary disease outcome in a hypothetical cohort of multiple ethnicities. We derived algebraic solutions for asymptotic biases in the maximum likelihood estimates of the interaction variables under corresponding models that ignored ethnicities and identified conditions for the biases to reach their maximum or minimum expected values. We also provided numerical examples of biases due to population stratification under a wide range of conditions that may be observed in epidemiologic studies.
Materials and Methods
Data Structure and Model Specification
Assume that the interactions between two categorical variables V_{1} and V_{2} are studied for a binary disease outcome Y during a certain time in a closed cohort of subjects. Either V_{1} or V_{2} or both can be genetic or environmental variables (i.e., either genegene or geneenvironment interactions can be modeled using the following approach). V_{1} = 0 denotes the reference category of V_{1}, and V_{1} = 1 denotes the comparison category. V_{2} is coded in the same fashion as V_{1}. Y = 1 indicates the presence of disease, and Y = 0 indicates the absence of disease. Assume the cohort is comprised of k ethnicities represented by indicator variables E_{1}, E_{2},…, E_{k}, so that a subject in the ith ethnicity will be coded by E_{i} = 1 and E_{j} = 0, for i, j (i ≠ j) = 1, 2,…, k. Assume associations between disease and V_{1} and V_{2} are modeled with logistic regression aswhere π is the conditional probability of the disease (Y = 1) given V_{1}, V_{2}, and ethnicity, and binomial random error is assumed. Without loss of generality, let α_{1} specifies the log odds of disease (i.e., logit function of the baseline disease risk) in the lowestrisk ethnicity E_{1}, and 0 < α_{2} < … < α_{k}, where α_{2} specifies the log odds ratio (OR) of disease risks comparing ethnicity E_{2} versus E_{1}. Similarly, α_{k} specifies the log OR of disease risks comparing ethnicity E_{k} versus E_{1}; β_{1} and β_{2} specify the main effects associated with the comparison of V_{1} and V_{2} relative to their reference categories, respectively; and β_{3} specifies the multiplicative interaction between V_{1} and V_{2}.
Population stratification may be present when the joint distributions of V_{1} and V_{2} are different across the ethnicities. Biases due to population stratification can be evaluated by omitting all ethnicity indicator variables from model Eq. A to fit the misspecified model
We define the asymptotic biases due to population stratification to be β_{1}^{*} − β_{1}, β_{2}^{*} − β_{2} for the main effects of V_{1} and V_{2}, respectively, and define β_{3}^{*} − β_{3} as the asymptotic bias for the interaction effects between V_{1} and V_{2}. Here, we do not deal with issues of variance or precision of estimates, assuming model coefficient estimates obtained from sufficiently large samples satisfy E(β̂_{i}^{*}) = β_{i}^{*}, where i = 1, 2, or 3.
Two Ethnicities
We obtained numerical estimates of large sample biases due to population stratification by fitting logistic regression models to simulated data generated under a wide range of conditions using Stata 7.0 (College Station, TX). Baseline disease risk in the lowrisk ethnicity was specified to be 1%, whereas baseline risk in the highrisk ethnicity specified to be 2% or 10%. To specifically assess genotypegenotype interactions, subsequent text assumes V_{1} = G_{1} and V_{2} = G_{2} to denote the study of interactions involving two candidate genes G_{1} and G_{2}. For simplicity, we assumed that candidate gene G_{i} (i = 1, 2) had two alleles A_{i} and a_{i} with frequency p_{i} and 1 − p_{i}, respectively. Genotype a_{i}a_{i} was coded as the reference category (i.e., G_{i} = 0), whereas genotypes A_{i}A_{i} and A_{i}a_{i} were coded as the comparison (i.e., “atrisk” genotype) category (i.e., G_{i} = 1). The atrisk genotype frequencies of G_{i} were specified in the range of 5% to 95%, assuming singlelocus HardyWeinberg equilibrium, corresponding to p_{i} ranging from 3% to 78%.
In some cases, G_{1} and G_{2} may not be independently distributed due to linkage disequilibrium between the two loci, denoted here as D. Because D between G_{1} and G_{2} is constrained by D_{max} and D_{min}, where D_{max} = min[p_{1} × (1 − p_{2}),(1 − p_{1}) × p_{2}] and D_{min} = max[−p_{1} × p_{2},−(1 − p_{1}) × (1 − p_{2})] (16), we specified the degree of linkage disequilibrium between the two genes by D = D′ × D_{max} or D′ × D_{min}, where D′ took on the values of 0 (no linkage disequilibrium), 0.2 (small linkage disequilibrium), 0.5 (moderate linkage disequilibrium), and 0.9 (strong linkage disequilibrium). Main effects and interactions of genes were specified by assigning β_{j} = 0 (no effect, OR = 1), β_{j} = ±0.4 (small effect, OR = 1.49 or 0.67), β_{j} = ±0.8 (moderate effect, OR = 2.23 or 0.45), or β_{j} = ±1.6 (large effect, OR = 4.95 or 0.20). Using the expected frequency of each genotypedisease category under the specified conditions, a hypothetical cohort of 100,000 observations was generated. Disease status for each observation was determined by comparing its disease risk with the standard uniform random variable. 5000 casecontrol samples were randomly drawn from the hypothetical cohort. Each sample consisted of 95% of diseased individuals as cases and an equal number of nondiseased individuals as controls. Both correctly specified models and their corresponding misspecified models were fitted to each sample to obtain point estimates of all relevant regression coefficients. The corresponding point estimates from the 5,000 samples were averaged to obtain large sample point estimates for biases due to population stratification on both main effects and interactions. Results are presented in Figs. 1, 2, and 3.
More than Two Ethnicities
For more than two ethnicities (k > 2), we assumed baseline risks of k ethnicities were specified to fall within certain ranges. Specifically, we assumed that baseline disease risk of the ith ethnicity π_{i} ∼ Uniform [0.01, 0.02] or π_{i} ∼ Uniform [0.01, 0.1] for i = 1,…, k. Similarly, we assumed that genotype frequencies were uniformly distributed and were consistent with singlelocus HardyWeinberg Equilibrium. We considered the atrisk genotype frequency within ranges of 5% to 10%, 5% to 20%,…, up to 5% to 95%. Linkage disequilibrium ranged from 90% of minimum possible value (i.e., D_{min}) to 90% of maximum possible value (i.e., D_{max}). Under these assumptions, we generated 5,000 sets of variables for a hypothetical cohort. Each set of disease risks and genotype frequencies of the k ethnicities was randomly assigned from the distributions specified above, assuming k = 2, 5, or 10, respectively, and β_{1} = β_{2} = β_{3} = 0 (i.e., OR = 1) or β_{1} = β_{2} = β_{3} = 0.693 (i.e., OR = 2). Next, casecontrol samples were randomly drawn from the hypothetical cohort under each set of variables to obtain bias estimates as described in the previous paragraph. The range and average of the 5,000 sets of bias estimates were presented in Fig. 4.
Results
Determination of Maximal Possible Bias
We have determined the maximal biases that could result from population stratification (see Appendix 1 for derivations). In the simplest case when the cohort is comprised of two ethnicities (i.e., k = 2: one “higher risk” ethnicity and one “lower risk” ethnicity), the maximal bias that can be attained under conditions of population stratification is determined by the difference in log odds of disease risks in the two populations (i.e., the log of OR of disease risks in the two ethnicities) being compared. Specifically, the asymptotic bias to the main effect estimates for V_{1} or V_{2} are bounded by −α_{2} and α_{2}, where α_{2} is the difference in log odds of the baseline risk of disease in the higher risk population compared with the lower risk population (see Eqs. A and B and Appendix 1). This maximum bias is reached only under extreme conditions, such as when the joint occurrence of V_{1} = 1 and V_{2} = 0 is never observed in the lowrisk ethnicity, and the joint occurrence of V_{1} = 0 and V_{2} = 0 is never observed in the highrisk ethnicity.
Asymptotic biases to estimates of interactions between V_{1} and V_{2} are bounded by −2α_{2} and 2α_{2}. That is, the maximal bias due to population stratification that can be attained for an interaction between two factors is bounded by twice the log OR of the disease risks in the two populations being compared. However, these maximal biases can be reached only when all of the following four conditions hold: (a) the joint occurrence of V_{1} = 1 and V_{2} = 0 is never observed in the highrisk ethnicity; (b) the joint occurrence of V_{1} = 0 and V_{2} = 1 is never observed in the highrisk ethnicity; (c) the joint occurrence of V_{1} = 1 and V_{2} = 1 is never observed in the lowrisk ethnicity; (d) the joint occurrence of V_{1} = 0 and V_{2} = 0 is never observed in the lowrisk ethnicity. Similarly, the maximal bias in the other direction is −2α_{k} only when the reverse of the four conditions described above hold (see Appendix 1).
Similarly, when k > 2 ethnicities in the cohort, the most extreme biases that could result from population stratification are ±α_{k} and ±2α_{k} to main effects and interaction estimates, respectively, where α_{k} represents the maximum of log OR among baseline risks of the k ethnicities (see Eqs. A and B and Appendix 1). To our knowledge, this is the first demonstration of the bounds on the magnitude of biases to interaction estimates for genotypegenotype or genotypeenvironment interaction studies. However, the boundary conditions that result in the theoretical extremes are unlikely to represent the conditions observed in most studies. Therefore, we undertook numerical evaluations next to consider situations that are more likely to be encountered in actual studies.
Two Ethnicities
We have summarized results of large sample biases to main effects and genotypegenotype or genotypeenvironment interactions using six sets of conditions that may be encountered in association studies (Table 1). Note that bias can only arise when differences in baseline disease risk among ethnicities exist (1), which is a necessary condition for confounding to occur (17). Condition a represented the situation where no population stratification existed for G_{1} or G_{2}, when G_{1} and G_{2} had the same marginal distributions, and linkage disequilibrium between G_{1} and G_{2} was the same in each ethnicity (i.e., joint distributions of G_{1} and G_{2} were the same across the ethnicities). Under these conditions, ignoring ethnicity did not result in large sample bias due to population stratification in any of these estimates when β_{1} = β_{2} = β_{3} = 0. When β_{1} = β_{2} = 0 but β_{3} ≠ 0, ignoring ethnicity did not result in large sample bias to β_{1} or β_{2} but resulted in negligible biases toward the null hypothesis to the interaction term β_{3}. When β_{1} ≠ 0 or β_{2} ≠ 0, a slight bias towards the null hypothesis was observed, whereas biases to β_{3} were no longer always towards the null. Instead, the bias to β_{3} depended on both the main and interaction effects between the two genes. In all cases, the magnitude of these large sample biases was negligible and reflected nonlinearity of logistic regression model (1820) rather than biases due to population stratification.
Condition b (Fig. 1) represented the situation when the marginal genotype distributions of both genes were same across both ethnicities, such that the frequency of the atrisk genotype of G_{1} in the highrisk ethnicity was equal to the frequency of the atrisk genotype of G_{1} in the lowrisk ethnicity, and the frequency of the atrisk genotype of G_{2} in the highrisk ethnicity was equal to the frequency of the atrisk genotype of G_{2} in the lowrisk ethnicity. However, condition b also specified that the joint genotype distributions of G_{1} and G_{2} were different due to unequal linkage disequilibrium between the two genes across ethnicities. In these circumstances, ignoring ethnicity could result in large sample biases in both main effects and interaction estimates. Linkage disequilibrium between G_{1} and G_{2} was D′ × D_{max} in the highrisk ethnicity and D′ × D_{min} in the lowrisk ethnicity (Fig. 1). Thus, atrisk genotypes of both genes appeared together more frequently in the highrisk ethnicity than in the lowrisk ethnicity. For simplicity, it was assumed that the frequency of the atrisk genotypes of G_{1} and G_{2} were equal in both ethnicities and that β_{1} = β_{2} = β_{3} = 0. When ethnicity was ignored, large sample biases to interaction and maineffect estimates depended on baseline disease risks, the proportions of the two ethnicities in the cohort, and the frequencies of the atrisk genotypes (Fig. 1). The magnitude of bias increased with greater differences in baseline disease risks and/or linkage disequilibrium between the two genes. Figure 1 suggests that there was no simple pattern of bias corresponding to the variables considered here. However, large biases in interaction estimates tended to occur when large biases in main effects also occurred.
Under condition c, the marginal genotype distribution of G_{1} was constant across ethnicities, and G_{1} and G_{2} were in linkage equilibrium in both ethnicities (D = 0). When β_{1} = 0 and β_{3} = 0, there were no large sample biases to their estimates. As shown in Fig. 2A and B for “No Linkage Disequilibrium in Either Ethnicity,” biases in G_{2} main effect estimates followed the same patterns as biases to a single candidate gene under conditions of population stratification (1, 6). Bias was positive if atrisk genotype frequency of G_{2} was greater in highrisk ethnicity (Fig. 2A). Bias was negative if atrisk genotype frequency of G_{2} was greater in lowrisk ethnicity (Fig. 2B). Condition d differed from condition c only in that linkage disequilibrium was specified to be different across ethnicities. Under condition d, biases to main effects and interactions depended on the ethnicityspecific genotype frequencies of both genes and baseline disease risks as well as the main effects of the genes and their interactions. For example, we considered the situation when baseline disease risks by ethnicity were 10% versus 1%, main effects and interaction were 0, linkage disequilibrium was D′ × D_{max} in the highrisk ethnicity and D′ × D_{min} in the lowrisk ethnicity. Figure 2A presents the results where the frequency of the atrisk genotype of G_{2} varied within the highrisk ethnicity, whereas Fig. 2B presented the results where the frequency of the atrisk genotype of G_{2} varied within the lowrisk ethnicity. As shown in these figures for D′ = 0.2 and D′ = 0.5, large sample biases occurred to G_{1} main effects, G_{2} main effects, and interaction effects. These biases became more pronounced with increasing degrees of linkage disequilibrium. However, the biases did not follow a simple monotonic pattern with changing genotype frequencies.
Conditions e and f (Fig. 3) occurred when the marginal genotype distributions of both genes differed across ethnicities. If G_{1} and G_{2} were in linkage equilibrium (condition e), the patterns for the direction and magnitude of biases to main effects were similar to those reported for condition c, but biases to interaction effect estimates did not follow simple patterns corresponding to the marginal genotype frequencies of either gene. If G_{1} and G_{2} were in linkage disequilibrium (condition f), biases to main effects no longer followed simple patterns. For example, in Fig. 3A, baseline disease risks by ethnicity were 10% versus 1%, and both main effects and interactions were assumed to be 0. Bias depended on atrisk genotype frequencies of both genes as well as on the degree of linkage disequilibrium between the two genes. Large biases were observed even when genotype frequencies were not very different across ethnicities. Again, no simple relationship of bias was observed with respect to genotype frequency. Therefore, biases due to population stratification can be large in relatively unpredictable ways when marginal genotype frequencies of both genes differ by ethnicity and the two genes are in linkage disequilibrium.
More than Two Ethnicities
When we expanded our analyses to consider cohorts consisting of k = 5 or 10 ethnicities, we observed that (on average) large sample biases to main and interaction effects were either nonexistent or negligible (Fig. 4AD), even if the conditions for population stratification (Table 1) were met. This result follows because we assumed that baseline disease risks and the joint genotype distributions of both genes were uncorrelated. Biases to both main effects and interaction were greatest when k = 2 (i.e., when there were only two ethnicities in the cohort) but decreased with increasing number of component ethnicities. Biases to both main effects and interaction were smaller when baseline disease risks of the k ethnicities were all within the range of 1% to 2% and larger when baseline risks of the k ethnicities were all within the range of 1% to 10%. Biases to main effects tended to increase as the differences in genotype frequencies across ethnicities increased. For example, in G_{1} main effects and G_{2} main effects in Fig. 4AD, the range of biases for main effects increased as the range of atrisk genotypes of both genes across the k ethnicities increased from 5% to 10% up to 5% to 95%. Under the latter more extreme conditions, biases to main effects approached their theoretical bounds.
For example, based on our algebraic derivation of bounds to the biases of the estimates, when baseline risks of k ethnicities were within 1% to 2% (Fig. 4A and B), biases to main effects approached their bounds of −0.7 and 0.7 (on the natural log scale). Although interaction estimates were bounded by −1.4 and 1.4, the actual biases observed were far from reaching these bounds. On the other hand, biases to interaction did not show monotonic relationships corresponding to increasing or decreasing ranges of marginal genotype frequency ranges for either gene. Even when the range of marginal genotype frequencies of both genes was small across ethnicities, biases to interaction estimates could still be very large. The patterns were similar when baseline risks of k ethnicities were within 1% to 10% (Fig. 4C and D), where biases to main effects were bounded by −2.4 to 2.4 and biases to interaction by −4.8 to 4.8. Similar patterns were observed when OR = 1 (Fig. 4A and C) and when OR = 2 (Fig. 4B and D) for both main effects and interactions.
Discussion
We have evaluated the potential for bias due to population stratification in studies of genotypegenotype or genotypeenvironment interactions by deriving bounds for the maximal biases that may be conferred by population stratification. We also provided numerical examples for the potential biases in main effects and interactions under a variety of conditions commonly met in association studies. Bias can occur in genegene interaction studies if linkage disequilibrium differs across ethnicities and the sample consists of the mixed ethnicities, even when the conditions of population stratification for either gene alone are absent. When the two genes were in linkage equilibrium, both main effects and interactions were unbiased if marginal distributions of genes did not vary by ethnicity. When the two genes were in linkage disequilibrium, biases could be substantial for all effect estimates. Genes may seem to be in linkage equilibrium in the entire cohort, whereas within each ethnicity, they are not. Thus, when multiple genes are jointly studied across ethnicities, pooling ethnicities may induce biases that depend on the joint distributions or correlations between the genes across ethnicities, as well as their marginal distributions.
These arguments can be extended to studies of genotypeenvironment interactions, where correlation between the gene and environment factors exists. However, it is more likely that differences across ethnicities in the joint distributions of genetic and environmental factors result from different marginal distributions rather than from correlations between the genes and environments of interest. Although severe biases were only observed under extreme stratification conditions (e.g., large differences in linkage disequilibrium, in the marginal frequencies of two genes, and in disease risks across ethnicities), population stratification may result in larger biases in genotypegenotype interaction studies than in studies involving only one gene. Our analytic derivation showed that in most settings of interaction where the joint distributions of two factors differ across the levels of a third factor, and where disease risk also varies across those levels, the omission of that third factor as a covariate will produce biases in the estimates of interaction estimates that can be 2fold as large as biases to the estimates of main effects. To our knowledge, this is the first time such quantitative evaluation has been addressed. We anticipate that our findings of population stratification effect on interaction studies will be useful for studies of both genegene interactions and geneenvironment interactions (21, 22) in different populations.
Additional studies are required to address other aspects of bias due to population stratification in genotypegenotype interaction studies. For example, we have not considered the effect of deviations from HardyWeinberg equilibrium. Such deviations could also confer biases to interaction terms involving two or more genes. Nonetheless, the magnitude of biases would be bounded as shown in Appendix 1. Similarly, we have focused on large sample biases to point estimates from logistic regression in genotypegenotype interaction studies. Another future challenge is to assess the effect of population stratification on variance estimation and hypothesis testing. In the case of studies involving single candidate genes, Heiman et al. (8) addressed both issues by evaluating falsepositive rates and comparing with confounding risk ratios due to population stratification. However, these evaluations have not been undertaken for genotypegenotype interactions. Marchini et al. (22) considered power issues due to population stratification, which have not been considered here. Finally, we have not evaluated additional situations of potential interest, such as the case of a main effect of a gene in one population but not in another.
When genotype distributions are the same across ethnicity or independent of ethnicity, ignoring ethnicity may still result in attenuation of the OR due to the nonlinearity of the logit link function in logistic regression (1820). In the context of a single gene having same distributions in two or more ethnicities, bias is absent if the gene has no effect even when ethnicities are ignored. Otherwise, attenuation of the estimate towards the null will increase with the magnitude of the gene's effect. The magnitude of the biases would be negligible unless disease risks were also extremely different across ethnicities. We found that when interactions between two genes are considered, the same rules applied to biases in the main effects of the two genes. However, bias may occur for interaction estimates even absent interaction (i.e., β_{3} = 0). In addition, the direction of the bias is not always predictable when interaction is present (i.e., β_{3} ≠ 0). Nonetheless, the magnitude of these biases to main effects or interactions was generally negligible, unless the β_{1}, β_{2}, and β_{3} were all large (>1.6), and the relative disease risk between the two ethnicities was >10fold.
We have evaluated biases under relatively extreme conditions of population stratification with respect to disease risk and allele frequencies. The ranges of variables employed here were similar or more extreme than those in other studies (10, 23). In real situations, biases on genotypegenotype interactions should be smaller when ethnicity strata are more numerous, and the range of disease risk is narrower than considered here. Furthermore, our results are consistent with those of Wacholder et al. (1) and Wang et al. (6) as to the smaller potential for bias in the presence of larger numbers of ethnicities. Similar arguments hold for geneenvironment interaction studies. For studies of genegene interactions, linkage disequilibrium patterns differ by ethnicity (13). Therefore, studies of genotypegenotype interactions should specifically consider potential linkage disequilibrium, baseline disease risks, and genotype frequency differences by ethnicity. This issue takes on special significance in light of suggestions that populationspecific linkage disequilibrium might contribute to nonreplication of association study results, including studies of genotypegenotype interactions (24).
The data presented here support the hypothesis that bias due to population stratification can occur in association studies involving genotypegenotype interactions, particularly if the two genes are in strong linkage disequilibrium. However, our results show that the magnitude of potential bias is constrained by the differences in disease risk among populations. Thus, when these disease risk differences among populations are small, population stratification cannot lead to large biases. Furthermore, our empirical results show that population stratification causes relatively small biases even under extreme conditions and is unlikely to cause large biases to estimates of main effects and interactions under usual study conditions, particularly when the correlation (i.e., linkage disequilibrium) among the interacting factors is small. Therefore, if population stratification is not a major concern, studies of interaction involving unlinked genes (e.g., genes in common metabolic pathways that are located on different chromosomes) might be appropriate for casecontrol association studies, whereas haplotypebased approaches might be more appropriate for genes in linkage disequilibrium.
Appendix 1: Algebraic Analyses of Asymptotic Bounds on Biases Due to Population Stratification
A. Algebraic Analyses of Biases when k = 2 Ethnicities. The loglikelihood for a cohort sample of N independent observations can be written aswhere i = 1, 2, …, N. When there are two ethnicities E_{1} and E_{2} in the cohort, they can be referred to as the highrisk ethnicity and lowrisk ethnicity, respectively. Assume associations between disease and V_{1} and V_{2} are modeled with logistic regression as
In the presence of population stratification, we assumed the following misspecified model was fitted:
Let f_{V1V2E2} represent the expected fraction of joint occurrence of V_{1} and V_{2} in ethnicity E_{1} (E_{2} = 0) or E_{2} (E_{2} = 1), where the subscripts V_{1}, V_{2}, and E_{2} each take values 0 or 1. For example, f_{000} represents the fraction of observations having joint reference categories of V_{1} and V_{2} in the lowrisk ethnicity E_{1} (i.e., V_{1} = 0, V_{2} = 0, E_{2} = 0); likewise, f_{001} represents the fraction of observations having the joint reference categories of V_{1} and V_{2} in the highrisk ethnicity E_{2} (i.e., V_{1} = 0, V_{2} = 0, E_{2} = 1). Correspondingly, the expected values of D are π_{000} = P(α_{1}) and π_{001} = P(α_{1} + α_{2}) by Eq. A, where P(x) is the logistic function defined by P(x) = exp(x) / [1+ exp(x)]. Let ε and 1 − ε represent the proportions of the highrisk and lowrisk ethnicity in the cohort, respectively, then the expected fraction of observations having the joint reference categories in the entire cohort is f_{00·} = (1 − ε) × f_{000} + ε × f_{001}, where the “·” subscript indicates that observations are pooled over the associated index ethnicity in this case. Let π̂_{00·} be the estimated expected value of D for these observations under the misspecified model (Eq. B), then π̂_{00·} = P(α̂^{*}), etc. Then the expected values of the maximum likelihood estimates of the variables in Eq. B are found by
Because (1 − ε) × (f_{000}/f_{00·}) + ε × f_{001}/f_{00·} = 1 and P(x) is monotonic function, α_{1} ≤ α^{*} ≤ α_{1} + α_{2}. α^{*} = α_{1} only when f_{001} = 0 (i.e., when no observations in highrisk ethnicity fall into the joint reference categories of V_{1} and V_{2}); and α^{*} = α_{1} + α_{2} only when f_{000} = 0 (i.e., when no observations in lowrisk ethnicity fall into the joint reference categories of V_{1} and V_{2}).
Similarly, (1 − ε) × (f_{100}/f_{10·}) + ε × f_{101}/f_{10·} = 1; therefore, α_{1} + β_{1} ≤ α^{*} + β_{1}^{*} ≤ α_{1} + β_{1} + α_{2}, because α_{1} ≤ α^{*} ≤ α_{1} + α_{2} , then β_{1} − α_{2} ≤ β_{1}^{*} ≤ β_{1} + α_{2}. Therefore, the biases to main effect estimates of V_{1} are bounded by −α_{2} ≤ β_{1}^{*} − β_{1} ≤ α_{2}. β_{1}^{*} − β_{1} = α_{2} only when f_{100} = 0 and f_{001} = 0 [i.e., when no observations in the lowrisk ethnicity fall into the comparison category of V_{1} (V_{1} = 1) and reference category of V_{2} (V_{2} = 0)]. In addition, no observations in the highrisk ethnicity fall into reference categories of V_{1} and V_{2}; β_{1}^{*} − β_{1} = −α_{2} only when f_{101} = 0 and f_{000} = 0 (i.e., when no observations in the highrisk ethnicity fall into the comparison category of V_{1} and reference category of V_{2}). In addition, no observations in the lowrisk ethnicity fall into joint reference categories of V_{1} and V_{2}.
In the same fashion, the biases to main effect estimates of V_{2} are bounded by −α_{2} ≤ β_{2}^{*} − β_{2} ≤ + α_{2}. β_{2}^{*} − β_{2} = α_{2} only when f_{010} = 0 and f_{001} = 0; β_{2}^{*} − β_{2} = −α_{2} only when f_{011} = 0 and f_{000} = 0.
Next, using above derivations, β_{3} − 2α_{2} ≤ β_{3}^{*} ≤ β_{3} + 2α_{2}; thus, the biases to estimates of interaction between V_{1} and V_{2} are bounded by −2α_{2} ≤ β_{3}^{*} − β_{3} ≤ + 2α_{2}. However, only when f_{100} = f_{010} = f_{001} = f_{111} = 0 will β_{3}^{*} − β_{3} = −2α_{2}; that is, all observations in the lowrisk ethnicity fall into either both reference categories or both comparison categories of V_{1} and V_{2}, and no observations in the highrisk ethnicity fall into either both reference categories or both comparison categories of V_{1} and V_{2}; only when f_{110} = f_{000} = f_{011} = f_{101} = 0 will β_{3}^{*} − β_{3} = 2α_{2}.
The maximum bias is reached when the joint occurrence of V_{1} = 1 and V_{2} = 0 is never observed in the lowrisk ethnicity, and the joint occurrence of V_{1} = 0 and V_{2} = 0 is never observed in the highrisk ethnicity. Similarly, the lower bound on the bias is reached only when the joint occurrence of V_{1} = 1 and V_{2} = 0 is never observed in the highrisk ethnicity, and the joint occurrence of V_{1} = 0 and V_{2} = 0 never occurs in the lowrisk ethnicity.
The maximal biases for interactions can be reached only when all of the following four conditions hold: (a) the joint occurrence of V_{1} = 1 and V_{2} = 0 is never observed in the highrisk ethnicity; (b) the joint occurrence of V_{1} = 0 and V_{2} = 1 is never observed in the highrisk ethnicity; (c) the joint occurrence of V_{1} = 1 and V_{2} = 1 is never observed in the lowrisk ethnicity; (d) the joint occurrence of V_{1} = 0 and V_{2} = 0 is never observed in the lowrisk ethnicity. Similarly, the maximal bias in the other direction is −2α_{k} only when all of the following four conditions hold: (a) the joint occurrence of V_{1} = 1 and V_{2} = 0 is never observed in the lowrisk ethnicity; (b) the joint occurrence of V_{1} = 0 and V_{2} = 1 is never observed in the lowrisk ethnicity; (c) the joint occurrence of V_{1} = 1 and V_{2} = 1 is never observed in the highrisk ethnicity; (d) the joint occurrence of V_{1} = 0 and V_{2} = 0 is never observed in the highrisk ethnicity.
B. Algebraic Analyses of Biases when k > 2 Ethnicities. Assume k ethnicities E_{1}, E_{2},…, E_{k} comprise a cohort, with expected fractions ε_{1}, ε_{2},…,ε_{k}, respectively, where . Assume underlying disease associations can be fit with logistic regression written as logit (π) = α_{1} + β_{1} × V_{1} + β_{2} × V_{2} + β_{3} × V_{1} × V_{2} + α_{2} × E_{2} + α_{3} × E_{3} + … + α_{k} × E_{k} (A). Maximum likelihood estimates for misspecified model that ignored all k ethnicities E_{1}, E_{2},…,E_{k} are
As an example for the notation, here, the expected fraction of observations in the joint reference category in the entire cohort is f_{00·} = ε_{1} × f_{000} + ε_{2} × f_{001} + … + ε_{k} × f_{00k}, where the “·” subscript indicates that observations are pooled over the associated index ethnicity in this case.
Without loss of generality, assume baseline risks π_{001} < π_{002} < … < π_{00k}, so that 0 < α_{2} < α_{3} … < α_{k}. Then biases on intercept term satisfy α_{1} ≤ α^{*} ≤ α_{1} + α_{k}, α^{*} = α_{1} only when f_{002} = f_{003} = … = f_{00k} = 0 (i.e., ethnicities 2 to k do not have joint reference category of V_{1} = 0 and V_{2} = 0). In addition, α^{*} = α_{1} + α_{k} only when f_{001} = f_{002} = … = f_{00(k−1)} = 0 [i.e., ethnicities 1 to (k − 1) do not have joint reference category of V_{1} = 0 and V_{2} = 0]. Biases on V_{1} effect estimates satisfy β_{1} − α_{k} ≤ β_{1}^{*} ≤ β_{1} + α_{k}, β_{1}^{*} = β_{1} + α_{k} only when f_{002} = f_{003} = … = f_{00k} = 0 and f_{101} = f_{102} = … = f_{10(k−1)} = 0 [i.e., ethnicities 2 to k do not have joint categories of V_{1} = 0 and V_{2} = 0, and ethnicities 1 to (k − 1) do not have joint categories of V_{1} = 1 and V_{2} = 0]. In addition, β_{1}^{*} = β_{1} − α_{k} only when f_{001} = f_{002} = … = f_{00(k−1)} = 0 and f_{102} = f_{103} = … = f_{10k} = 0 [i.e., ethnicities 2 to k do not have joint categories of V_{1} = 1 and V_{2} = 0, and ethnicities 1 to (k − 1) do not have joint categories of V_{1} = 0 and V_{2} = 0].
There will be no bias on V_{1} effect estimates (i.e., β_{1}^{*} = β_{1}) if f_{001} = f_{002} = … = f_{00(k−1)} = 0 and f_{101} = f_{102} = … = f_{10(k−1)} = 0, or if f_{002} = f_{003} = … = f_{00k} = 0 and f_{001} = f_{002} = … = f_{00(k−1)} = 0. Similarly, β_{2} − α_{k} ≤ β_{2}^{*} ≤ β_{2} + α_{k}, β_{2}^{*} = β_{2} + α_{k} only when f_{002} = f_{003} = … = f_{00k} = 0 and f_{011} = f_{012} = … = f_{01(k−1)} = 0; and β_{2}^{*} = β_{2} − α_{k} only when f_{001} = f_{002} = … = f_{00(k−1)} = 0 and f_{012} = f_{013} = … = f_{01k} = 0. In addition, β_{3} − 2α_{k} ≤ β_{3}^{*} ≤ β_{3} + 2α_{k} and β_{3}^{*} = β_{3} + 2α_{k} only when f_{00i} = f_{01i} = f_{10i} = f_{11i} for i = 2,…,(k − 1).
Acknowledgments
We thank Warren Ewens, Peter Kanetsky, Caryn Lerman, Richard Spielman, and Thomas Ten Have for their helpful comments during the development and execution of this research.
Footnotes

Grant support: USPHS grants R21ES11658 and P50CA105641 and R01CA85074 (T.R. Rebbeck).

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
 Accepted November 9, 2005.
 Received April 28, 2005.
 Revision received October 18, 2005.