Abstract
Casecontrol studies of genetic factors are prone to a special form of confounding called population stratification, whenever the existence of one or more subpopulations may lead to a false association, be it positive or negative. We quantify both the bias (in terms of confounding risk ratio) and the probability of false association (type I error) in the most unfavorable situation in which only one highrisk subpopulation is hidden within the studied population, considering different scenarios of population structuring and varying sample sizes. In accord with previous work, we find that the bias is likely to be small in most cases. In addition, we show that the same applies to the associated type I error whenever the subpopulation is small in proportion. For instance, when the hidden subpopulation makes up 5% of the entire population, with an allelic frequency of 0.25 (versus 0.10) and a disease rate that is double, then the estimated bias is 1.07 and the type I error associated with a sample of 500 cases and 500 controls is 8% (instead of 5%). We also show that the type I error is substantially greater for a rare allele (frequency of 0.1) than for a common allele (frequency of 0.5) and analyze the pattern of increase of vulnerability to stratification bias with sample size. Based on our findings, we may therefore conclude that with moderate sample sizes the type I error associated with population stratification remains very limited in most realistic scenarios.
Introduction
Epidemiologic studies of genetic factors and disease are sensitive to population stratification into racial or ethnic groups, as spurious associations may arise whenever one or more subpopulations carries both a different prevalence of an allele and a different risk for the disease under study. Concerns related to stratification biases have raised doubts about the credibility of reported findings and led some authors to advise routine use of related controls in casecontrol studies of genetic factors (1, 2). Yet, the potential impact of this type of bias on the outcome of association studies has not been thoroughly investigated and a “systematic program of research to understand the magnitude of the problem in general” has been called for (3). A recent study has provided evidence that, assuming the epidemiologic principles of study design, conduct, and analysis are rigorously applied, the stratification bias is unlikely to be substantial in casecontrol studies of cancer among nonHispanic Caucasians of European origin in the United States, and called for further work on other populations (4, 5). Those conclusions have been extended to African Americans and Whites in the United States (6).
The rationale in both studies was focused on the bias in estimation, as measured by the confounding risk ratio (CRR), which is the ratio of the crude and ethnic groupadjusted relative risks for the effect of genotype on disease. The CRR reflects the magnitude of the bias attached to stratification within the wide variety of ranges of genotype frequencies and rates of diseases, and the abovementioned studies have shown that the bias from population stratification is likely to be small except under extreme conditions. Yet, given that critics, who have argued that stratification undermines the credibility of epidemiologic studies of genetic factors, are especially concerned with spurious associations, a more meaningful measure of the impact of the bias would be the probability of falsepositive findings (i.e., type I error). For a given magnitude of the bias, the type I error depends partly on the prevalence of the genotype of interest in the subpopulations and the size of the sample, and the question is: Even if there is little bias, can the probability of falsepositive findings be appreciable?
In this article, we investigate the probability of artificial association between a genetic factor and a disease in relation to the characteristics of the stratification ongoing in a large population. Based on an additive model of inheritance for the disease, estimates of the CRR and the associated type I error are provided for the most unfavorable situation in which only one highrisk subpopulation is hidden within the studied population, considering different scenarios in terms of population variables (frequency of genotype of interest, disease rate ratio, and relative size of hidden subpopulation) and varying sample sizes. On the basis of our findings, we discuss the likelihood of type I error in current epidemiologic studies of genetic factors.
Materials and Methods
Let us consider a casecontrol study with N_{a} cases and N_{u} controls sampled from a population that is divided into two subpopulations, Pop_{1} and Pop_{2}. Let f be the proportion of Pop_{1}, and let us assume that Pop_{1} is the subpopulation having the highest risk of disease. Let us consider a biallelic gene and let p_{1} and p_{2} be the respective frequencies of the variant allele in the two subpopulations.
We want to derive the expected genotype distribution for the marker gene in cases and controls, under the null hypothesis that this marker has no influence on the risk of disease. Provided the disease is rare enough, we may assume that the genotype distribution in controls is roughly the same as that found in the population as a whole. We can write:
Let K be the ratio of the disease rate in Pop_{1} to the disease rate in Pop_{2}. The expected genotype distribution in cases may be written as:
To test the effect of the genotype on disease risk, we used a logistic regression analysis in which the genotype was incorporated as a quantitative variable coded 0 (aa), 1 (Aa), or 2 (AA) (i.e., assuming an additive model of inheritance). Under the hypothesis of no relation between genotype and disease risk, the CRR reduces to the crude relative risk for the effect of genotype on disease. The type I error of the likelihood ratio test of association between the marker and the disease was evaluated by using a noncentral χ^{2} distribution with 1 df and a noncentrality variable λ (7) depending on the different variable values:
This procedure provides an asymptotic approximation of the type I error that was found to be very accurate by simulation (data not shown).
Different scenarios were considered: disease rate ratio of 2 as opposed to 10; uncommon marker allele (frequency in larger subpopulation = 0.1) as opposed to common marker allele (frequency = 0.5); and proportion of the hidden subpopulation of 1%, 5%, 10%, or 50%. For each scenario, we estimated both the CRR and the type I error. To compute type I errors, a sample size of 500 cases and 500 controls was considered but the impact of the sample size was also studied by varying it from 100 to 1,000.
All computations were done using the Stata software (version 7).
Results
Based on a disease rate ratio of 2 and a sample of 500 cases and 500 controls, we have investigated the influence on the CRR and the type I error of (a) the allele frequency difference between the two populations, (b) the proportion of the higherrisk subpopulation, and (c) the allele frequency in the lowerrisk subpopulation (rare as opposed to common; Fig. 1).
When the study allele is less frequent in the subpopulation with the highest prevalence of the disease (called the “higherrisk” subpopulation) than in the other one (lefthand side of the graphs), then the CRR takes values lower than 1, as if the genotype of interest had a protective role with respect to the disease (falsenegative association). Conversely, when the allele is more frequent in the higherrisk subpopulation than in the other one, then the CRR is higher than 1, potentially reflecting a falsepositive association. The pattern of variation of the CRR within each of the four curves indicates that (a) the further the allele frequency in the higherrisk subpopulation moves away from that in the other one, the greater the departure of the CRR value from 1 (i.e., the greater the bias), and (b) the greater the proportion of the higherrisk subpopulation, the greater the bias. Therefore, the maximum bias is obtained here with a 50% subpopulation, and in that case, for rare alleles, the CRR goes up to 1.4 when the allele frequency difference is of 0.9 (0.1 versus 1) and, for common alleles, it varies from 0.7 to 1.4 when the allele frequency difference shifts from −0.5 to 0.5. Let us consider the more realistic scenarios in which the subpopulation makes up only a small part of the study population (1%, 5%, or 10%): focusing on the bias and type I error estimates corresponding to an allelic frequency difference of ≤0.2 (see Table 1 for detailed figures), we find that in that case the bias is contained within much more reasonable limits (i.e., at most 1.15 for rare alleles and 1.07 for common alleles).
Looking at the relation between the CRR and the type I error, we find, as expected, that the larger the CRR value, the greater the type I error, and for the extreme scenarios providing the maximum CRR values, the type I error is 100% (i.e., certainty to falsely conclude that there is an association between the study allele and the disease). Again, in the more realistic scenarios defined above, the estimated type I errors do remain within an acceptable range for samples of 500 cases and 500 controls (Table 1): for instance, for rare alleles, we get in the extreme a type I error of 19% (with a CRR of 1.15), whenever the allele frequency difference is 0.2, and under the hypothesis of a disease rate ratio as high as 2. For common alleles, that ceiling type I error is 11% (CRR of 1.07).
The same patterns of variation are found when the disease rate in the higherrisk subpopulation is 10fold that in the other one (data not shown), but in that case both the magnitude of the bias and the type I error are considerably expanded. Under the most extreme scenario (50% subpopulation and maximal allele frequency difference), the bias is 4 for common alleles and 3.3 for rare ones, whereas the type I error with a sample of 500 cases and 500 controls is 100%. In terms of type I error, the advantage of common alleles over rare ones is much less evident in that case, except for the smallest subpopulations (f = 1%). For f = 5% or 10%, false associations are almost certain to be found for rare alleles, starting from an allele frequency difference of 0.2.
The influence of sample size on type I error was looked at based on the scenario with a disease rate ratio of 2 and an allele frequency of 0.3 in the higherrisk subpopulation as opposed to 0.1 in the other one, corresponding to a CRR of 1.02, 1.09, 1.15, or 1.20 for values of f of 1%, 5%, 10%, or 50%, respectively (Fig. 2). Clearly, when the hidden higherrisk subpopulation is a very small portion of the larger one (1%), then the probability of false association remains close to the expected 5% for sample sizes ranging up to 1,000 cases and 1,000 controls. When the proportion of the higherrisk subpopulation is larger, the type I error increases with sample size, and the greater that proportion, the faster the increase. For a subpopulation amounting to 10% of the total, the type I error reaches 10% for samples of size of 200, whereas it takes a sample of 550 to reach that level when the proportion is 5%.
Discussion
Understanding the magnitude of the bias and the role of the factors underlying population stratification is helpful for properly interpreting the findings of casecontrol studies of genetic factors based on a thorough knowledge of the characteristics of the underlying population structure. Recently, Wacholder et al. (4) have quantified the bias in estimation attached to population stratification based on the CRR. Given that the main concern in epidemiologic studies of genetic factors is the reliability of positive findings, we have extended the perspective by examining simultaneously the CRR and the probability of falsepositive findings (type I error) attached to different population structure scenarios, knowing that, for a given amount of bias, the type I error depends on the frequency of the atrisk genotype and on the sample size.
Clearly, the 50% subpopulation case is the one that provides the maximum bias and type I error, but whether this type of population structuring is likely to be found is open to question. Furthermore, large allelic frequency differences between subpopulations are likely to occur mainly between distinguishable ethnic groups and therefore could be handled easily by matching, adjustment, or other standard methods. More subtle differences in admixed populations may be more tricky, as they are impossible to deal with analytically. Based on the International Project on Genetic Susceptibility to Environmental Carcinogens database, Garte et al. (8, 9) have found that there were major and significant differences in the frequency of the more commonly studied metabolic genes among Caucasians, Asians, Africans, and African Americans and some, but much less, within Caucasian populations from different countries. One example of the latter is the shift of the frequency of GSTT1*0 from ∼0.13 in northern Europeans to ∼0.17 in Caucasians from the rest of Europe (i.e., a 0.04 difference between subpopulations from different origins within Europe). With a disease prevalence in the higherrisk subpopulation as high as double that in the other subpopulation (rate ratio = 2), we show that, for the 1%, 5%, and 10% scenarios in which the allele frequencies differences do not exceed 0.2, both bias and type I errors are contained within reasonable limits.
Our findings are therefore in accord with those of Ardlie et al. (10), who suggest that “carefully matched, moderatesized casecontrol samples in cosmopolitan U.S. and European populations are unlikely to contain levels of structure that would result in significantly inflated numbers of falsepositive associations.” In addition to that, we present original findings that clarify the separate impact of the different components of population structure and sample size: (a) the larger the difference between the study population and the hidden subpopulation in terms of either allele frequencies or disease prevalence, the larger the type I error; (b) the larger the relative size of the hidden subpopulation within the study population, the larger the type I error; and (c) the type I error is larger for uncommon alleles than for common ones: for instance, a relatively common allele like GSTM1*0 (frequency around 50%) would be much less exposed to stratification bias than a rare one like CYP1A1*2A (frequency around 56%).
We also illustrate the pattern of variation of type I error according to sample size, for different subpopulation scenarios, and underline the greater vulnerability of large study samples to stratification bias, in agreement with the findings of Pritchard and Donnelly (11), who argue that the extent of bias caused by population structure increases as the sample gets larger due to greater power to detect both real associations and spurious ones. Yet, even with large samples, it has to be pinpointed that the type I error only reaches appreciable levels under quite unrealistic population scenarios, given that, as discussed above, the variability to be considered in terms of disease rates and allele frequencies is the variability within ethnic groups, after accounting for known risk factors. Furthermore, we have voluntarily limited our study to a scenario comprising two subpopulations, which is the situation the most exposed to stratification bias. Wacholder et al. (4) have shown that ethnic diversity reduces the bias from population stratification: the potential for bias is greatest with two or three ethnicities and tends to decline as the number goes up, with single biases canceling each other out.
In conclusion, this study throws light on the patterns of variation of type I error in relation to the different components of population structuring and shows that both bias and type I error resulting from population stratification are likely to be limited in methodologically sound casecontrol studies of moderate size, except in quite unrealistic scenarios. Whenever a statistical association is likely to result from stratification, then other approaches should be used to confirm the association.
Acknowledgments
We thank Dr. Catherine Bonaïti for very helpful comments on a first draft of the article.
Footnotes

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
 Accepted May 6, 2004.
 Received December 29, 2003.
 Revision received April 30, 2004.