
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Short Communication |
1 Institut National d'Etudes Démographiques, Paris, France and 2 Institut National de la Santé et de la Recherche Médicale U535, Villejuif, France
Requests for reprints: Myriam Khlat, Institut National d'Etudes Démographiques, 133 Boulevard Davout, 75980 Paris Cedex 20, France. Phone: 33-1-56-06-21-49; Fax: 33-1-56-06-21-94. Email: Khlat{at}ined.fr
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
The rationale in both studies was focused on the bias in estimation, as measured by the confounding risk ratio (CRR), which is the ratio of the crude and ethnic group-adjusted relative risks for the effect of genotype on disease. The CRR reflects the magnitude of the bias attached to stratification within the wide variety of ranges of genotype frequencies and rates of diseases, and the above-mentioned studies have shown that the bias from population stratification is likely to be small except under extreme conditions. Yet, given that critics, who have argued that stratification undermines the credibility of epidemiologic studies of genetic factors, are especially concerned with spurious associations, a more meaningful measure of the impact of the bias would be the probability of false-positive findings (i.e., type I error). For a given magnitude of the bias, the type I error depends partly on the prevalence of the genotype of interest in the subpopulations and the size of the sample, and the question is: Even if there is little bias, can the probability of false-positive findings be appreciable?
In this article, we investigate the probability of artificial association between a genetic factor and a disease in relation to the characteristics of the stratification ongoing in a large population. Based on an additive model of inheritance for the disease, estimates of the CRR and the associated type I error are provided for the most unfavorable situation in which only one high-risk subpopulation is hidden within the studied population, considering different scenarios in terms of population variables (frequency of genotype of interest, disease rate ratio, and relative size of hidden subpopulation) and varying sample sizes. On the basis of our findings, we discuss the likelihood of type I error in current epidemiologic studies of genetic factors.
| Materials and Methods |
|---|
|
|
|---|
We want to derive the expected genotype distribution for the marker gene in cases and controls, under the null hypothesis that this marker has no influence on the risk of disease. Provided the disease is rare enough, we may assume that the genotype distribution in controls is roughly the same as that found in the population as a whole. We can write:
![]() | (1) |
![]() |
![]() |
Let K be the ratio of the disease rate in Pop1 to the disease rate in Pop2. The expected genotype distribution in cases may be written as:
![]() | (2) |
![]() |
![]() |
![]() |
To test the effect of the genotype on disease risk, we used a logistic regression analysis in which the genotype was incorporated as a quantitative variable coded 0 (aa), 1 (Aa), or 2 (AA) (i.e., assuming an additive model of inheritance). Under the hypothesis of no relation between genotype and disease risk, the CRR reduces to the crude relative risk for the effect of genotype on disease. The type I error of the likelihood ratio test of association between the marker and the disease was evaluated by using a noncentral
2 distribution with 1 df and a noncentrality variable
(7) depending on the different variable values:
![]() | (3) |
This procedure provides an asymptotic approximation of the type I error that was found to be very accurate by simulation (data not shown).
Different scenarios were considered: disease rate ratio of 2 as opposed to 10; uncommon marker allele (frequency in larger subpopulation = 0.1) as opposed to common marker allele (frequency = 0.5); and proportion of the hidden subpopulation of 1%, 5%, 10%, or 50%. For each scenario, we estimated both the CRR and the type I error. To compute type I errors, a sample size of 500 cases and 500 controls was considered but the impact of the sample size was also studied by varying it from 100 to 1,000.
All computations were done using the Stata software (version 7).
| Results |
|---|
|
|
|---|
|
0.2 (see Table 1 for detailed figures), we find that in that case the bias is contained within much more reasonable limits (i.e., at most 1.15 for rare alleles and 1.07 for common alleles).
|
The same patterns of variation are found when the disease rate in the higher-risk subpopulation is 10-fold that in the other one (data not shown), but in that case both the magnitude of the bias and the type I error are considerably expanded. Under the most extreme scenario (50% subpopulation and maximal allele frequency difference), the bias is 4 for common alleles and 3.3 for rare ones, whereas the type I error with a sample of 500 cases and 500 controls is 100%. In terms of type I error, the advantage of common alleles over rare ones is much less evident in that case, except for the smallest subpopulations (f = 1%). For f = 5% or 10%, false associations are almost certain to be found for rare alleles, starting from an allele frequency difference of 0.2.
The influence of sample size on type I error was looked at based on the scenario with a disease rate ratio of 2 and an allele frequency of 0.3 in the higher-risk subpopulation as opposed to 0.1 in the other one, corresponding to a CRR of 1.02, 1.09, 1.15, or 1.20 for values of f of 1%, 5%, 10%, or 50%, respectively (Fig. 2). Clearly, when the hidden higher-risk subpopulation is a very small portion of the larger one (1%), then the probability of false association remains close to the expected 5% for sample sizes ranging up to 1,000 cases and 1,000 controls. When the proportion of the higher-risk subpopulation is larger, the type I error increases with sample size, and the greater that proportion, the faster the increase. For a subpopulation amounting to 10% of the total, the type I error reaches 10% for samples of size of 200, whereas it takes a sample of 550 to reach that level when the proportion is 5%.
|
| Discussion |
|---|
|
|
|---|
Clearly, the 50% subpopulation case is the one that provides the maximum bias and type I error, but whether this type of population structuring is likely to be found is open to question. Furthermore, large allelic frequency differences between subpopulations are likely to occur mainly between distinguishable ethnic groups and therefore could be handled easily by matching, adjustment, or other standard methods. More subtle differences in admixed populations may be more tricky, as they are impossible to deal with analytically. Based on the International Project on Genetic Susceptibility to Environmental Carcinogens database, Garte et al. (8, 9) have found that there were major and significant differences in the frequency of the more commonly studied metabolic genes among Caucasians, Asians, Africans, and African Americans and some, but much less, within Caucasian populations from different countries. One example of the latter is the shift of the frequency of GSTT1*0 from
0.13 in northern Europeans to
0.17 in Caucasians from the rest of Europe (i.e., a 0.04 difference between subpopulations from different origins within Europe). With a disease prevalence in the higher-risk subpopulation as high as double that in the other subpopulation (rate ratio = 2), we show that, for the 1%, 5%, and 10% scenarios in which the allele frequencies differences do not exceed 0.2, both bias and type I errors are contained within reasonable limits.
Our findings are therefore in accord with those of Ardlie et al. (10), who suggest that "carefully matched, moderate-sized case-control samples in cosmopolitan U.S. and European populations are unlikely to contain levels of structure that would result in significantly inflated numbers of false-positive associations." In addition to that, we present original findings that clarify the separate impact of the different components of population structure and sample size: (a) the larger the difference between the study population and the hidden subpopulation in terms of either allele frequencies or disease prevalence, the larger the type I error; (b) the larger the relative size of the hidden subpopulation within the study population, the larger the type I error; and (c) the type I error is larger for uncommon alleles than for common ones: for instance, a relatively common allele like GSTM1*0 (frequency around 50%) would be much less exposed to stratification bias than a rare one like CYP1A1*2A (frequency around 5-6%).
We also illustrate the pattern of variation of type I error according to sample size, for different subpopulation scenarios, and underline the greater vulnerability of large study samples to stratification bias, in agreement with the findings of Pritchard and Donnelly (11), who argue that the extent of bias caused by population structure increases as the sample gets larger due to greater power to detect both real associations and spurious ones. Yet, even with large samples, it has to be pinpointed that the type I error only reaches appreciable levels under quite unrealistic population scenarios, given that, as discussed above, the variability to be considered in terms of disease rates and allele frequencies is the variability within ethnic groups, after accounting for known risk factors. Furthermore, we have voluntarily limited our study to a scenario comprising two subpopulations, which is the situation the most exposed to stratification bias. Wacholder et al. (4) have shown that ethnic diversity reduces the bias from population stratification: the potential for bias is greatest with two or three ethnicities and tends to decline as the number goes up, with single biases canceling each other out.
In conclusion, this study throws light on the patterns of variation of type I error in relation to the different components of population structuring and shows that both bias and type I error resulting from population stratification are likely to be limited in methodologically sound case-control studies of moderate size, except in quite unrealistic scenarios. Whenever a statistical association is likely to result from stratification, then other approaches should be used to confirm the association.
| Acknowledgments |
|---|
| Footnotes |
|---|
Received 12/29/03; revised 4/30/04; accepted 5/ 6/04.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
N. A. Rosenberg and M. Nordborg A General Population-Genetic Model for the Production by Population Structure of Spurious Genotype-Phenotype Associations in Discrete, Admixed or Spatially Distributed Populations Genetics, July 1, 2006; 173(3): 1665 - 1678. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. R. Rebbeck, C. H. Halbert, and P. Sankar Genetics, Epidemiology, and Cancer Disparities: Is it Black and White? J. Clin. Oncol., May 10, 2006; 24(14): 2164 - 2169. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Wang, R. Localio, and T. R. Rebbeck Evaluating Bias due to Population Stratification in Epidemiologic Studies of Gene-Gene or Gene-Environment Interactions Cancer Epidemiol. Biomarkers Prev., January 1, 2006; 15(1): 124 - 132. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. R. Rebbeck and P. Sankar Ethnicity, Ancestry, and Race in Molecular Epidemiologic Research Cancer Epidemiol. Biomarkers Prev., November 1, 2005; 14(11): 2467 - 2471. [Full Text] [PDF] |
||||
![]() |
R. L. Riha, P. Brander, M. Vennelle, N. McArdle, S. M. Kerr, N. H. Anderson, and N. J. Douglas Tumour necrosis factor-{alpha} (-308) gene polymorphism in obstructive sleep apnoea-hypopnoea syndrome Eur. Respir. J., October 1, 2005; 26(4): 673 - 678. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Barnholtz-Sloan, R. Chakraborty, T. A. Sellers, and A. G. Schwartz Examining Population Stratification via Individual Ancestry Estimates versus Self-Reported Race Cancer Epidemiol. Biomarkers Prev., June 1, 2005; 14(6): 1545 - 1551. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. A. Heiman, P. Gorroochurn, S. E. Hodge, D. A. Greenberg, M. Khlat, and M.-H. Cazes Robustness of Case-Control Studies to Population Stratification Cancer Epidemiol. Biomarkers Prev., June 1, 2005; 14(6): 1579 - 1582. [Full Text] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |