
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
1 Cancer Prevention and Control Program, H. Lee Moffitt Cancer Center and Research Institute; 2 Department of Interdisciplinary Oncology, University of South Florida College of Medicine, Tampa, Florida; 3 Center for Genome Research, Department of Environmental Health, University of Cincinnati, Cincinnati, Ohio; and 4 Population Studies and Prevention Program, Karmanos Cancer Institute and Department of Internal Medicine, Wayne State University School of Medicine, Detroit, Michigan
Requests for reprints: Jill Barnholtz-Sloan, Cancer Prevention and Control Program, H. Lee Moffitt Cancer Center and Research Institute, 12902 Magnolia Drive, Tampa, FL 33612. Phone: 813-745-6531; Fax: 813-632-1334. E-mail: barnhojs{at}moffitt.usf.edu
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
Studies of disease risk associated with candidate susceptibility genes are potentially vulnerable to bias due to population stratification. In order for important bias from population stratification to exist, the following must be true: (a) the frequency of the genotype of interest varies substantially by race and ancestry, (b) the disease rate varies substantially by race and ancestry, and (c) the disease rates and genotype frequencies vary together which occurs when the genotype is related to a true risk factor or is the true risk factor with a high attributable risk (5). Methods have therefore been developed to assess population stratification in case-control studies (6-12). Some of these methods involve using DNA markers to estimate genetic ancestry at the individual level, thereby allowing studies of its association with various complex traits (9, 13-16). Ancestry-informative markers generally show large allele frequency differences between ancestral populations (7).
Estimating individual ancestry requires genotyping an additional set of DNA markers for each individual in the study population. Whether this is necessary, or whether doing stratified analyses by self-reported race is adequate to control for population stratification, is unresolved (5). For example, it has been found that commonly used racial labels were insufficient and inaccurate representations of ancestral clusters and that genotype profiles, defined by the distribution of variants in drug-metabolizing genes (CYP1A2, CYP2C19, CYP2D6, NAT2, GSTM1, and DIA4), differed significantly among ancestral clusters (17). Individuals of mixed race are becoming increasingly more common in the United States (18). Most study subjects simply cannot report what percent of their genome originated from Europe, West Africa, or Native American ancestry via questionnaire. Populations in the United States are generally formed from more recent admixture, which causes interindividual differences in genetic ancestry to become more pronounced (19, 20). The West African contribution to ancestry for African Americans in the United States is on average 80%, but it can range from 20% to 100% and
30% of United States Caucasian, non-Hispanics have >90% European ancestry (7, 21). African Americans also have significant admixture from European and Native American ancestral populations, whereas United States Caucasian, non-Hispanics have significant admixture from West African and Native American populations (22).
To assess the utility of using individual genetic ancestry estimates to better understand population stratification in a standard epidemiologic case-control study, we genotyped early-onset lung cancer cases (i.e., diagnosed before age 50) and population-based controls for a panel of ancestry informative markers. We then estimated individual ancestry from these markers using two different methods and used these estimates to assess population stratification within this case-control sample. We used the glutathione S-transferase µ (GSTM1) locus, a candidate gene for lung cancer risk (23-27), as an example of how using individual ancestry estimates versus self-reported race can affect estimates of disease risk associated with genotype in groups of individuals.
| Materials and Methods |
|---|
|
|
|---|
Genotyping
Each individual was genotyped for the lung cancer candidate gene, GSTM1 (null or present; refs. 28, 29). In addition, all individuals were genotyped for the U.S. Federal Bureau of Investigation CODIS Core short tandem repeat (STR) set of 13 loci for analysis of individual ancestry (30). A list of these 13 loci with the chromosomal location and the number of alleles are shown in Table 1. The 13 CODIS loci were tested for Hardy-Weinberg equilibrium and linkage disequilibrium and were found to not violate Hardy-Weinberg equilibrium within loci or show linkage disequilibrium between loci (data not shown; P = 0.08-0.75 for tests of Hardy-Weinberg equilibrium and linkage disequilibrium). The average of German and Polish parental frequencies were used to represent European (31, 32) and the average of Rwandan and Nigerian parental frequencies to represent West African (32, 33), for the maximum likelihood estimations (MLE). Detroit, MI was originally settled by the Polish and Germans, with African ancestral populations settling in over time (34), making these parental populations appropriate for this study population for estimation of individual ancestry.
|
Considering a population that was formed by admixture between two genetically distinct ancestral populations (this can be easily extended to any number of populations), the frequency of the kth allele at the gth locus in the admixed population, A, is
![]() | (A) |
coefficients (i.e., allele frequencies differences between parental populations) are defined as
g1k = pg1k pg2k. The constraint
ensures that the outcome of analysis is unaffected by the way parental populations are numbered, or which population is subtracted from the others. The log-likelihood function for an individual is,
![]() | (B) |
Maximum likelihood estimates for the ancestral contributions were obtained from the log-likelihood function by setting the partial derivatives, with respect to mj,
![]() | (C) |
1, using the Newton-Rhapson method (36). The MLE of
2 equals 1
1 (37). For the second method, individual ancestry for two "clusters" (i.e., ancestral European and West African populations), using each individual's CODIS Core STR loci genotypes was calculated. The STRUCTURE method assigns each individual to clusters by calculating a posterior probability that an individual belongs to a cluster, given the observed marker genotypes (i.e., the CODIS STR genotypes). The number of clusters can either be inferred by the program or can be given as an initial variable. In this case, we set the number of clusters to two to compare with the MLE estimates. In the presence of admixture and hence correlated allele frequencies, the STRUCTURE method also estimates the proportion of an individual's genome that derives from each of the two cluster subpopulations.
Statistical Analysis
Composite delta (
c) was calculated for each of the 13 CODIS loci for the European and West African ancestral combination. Composite
was calculated as half the sum across all loci pairs of the allele frequencies in two different populations when there are multiple alleles at a locus. Spearman correlation coefficients were calculated for MLE individual ancestry compared with STRUCTURE individual ancestry. Only European ancestry estimates were used for further analyses because the West African estimates were equal to one minus the European estimates. Median European MLE and STRUCTURE ancestry were compared within self-reported racial group by case-control status using a t test; the frequency of the GSTM1 null genotype was also compared within self-reported racial group by case-control status using a
2 test. Histograms of individual European ancestry for both MLE and STRUCTURE estimates by self-reported race were generated. To assess differences in ancestry between cases and controls related to the GSTM1 null risk genotype, histograms were generated to compare the frequency of the risk genotype by case-control status within European MLE or STRUCTURE ancestral group, stratified by self-reported race. Unconditional logistic regression models were used to estimate odds ratios and 95% confidence intervals to measure the association between early-onset lung cancer and the GSTM1 null genotype. Potential confounders, including gender, age at diagnosis for cases or age at interview for controls (continuous), family history of lung cancer, and pack-years of smoking (continuous) were included in all models. To test the effects of self-reported race and individual ancestry on genetic risk, models were additionally adjusted for self-reported race or individual European MLE or STRUCTURE ancestry and were compared with the general model using the likelihood ratio test. Additionally, models were compared using the Akaike Information Criterion that adjusts the 2 log-likelihood for the model by twice the number of estimated variables in the model (38). All statistical analyses were done using SAS version 9.1 (39).
| Results |
|---|
|
|
|---|
c values for the majority of the CODIS loci were >0.2 making them appropriate for ancestry estimation analysis (40, 41). Because the ancestral allele frequencies were used in the MLE, it was clear which of the two MLE estimates correlated with which ancestral group; however, this correlation was less clear with the STRUCTURE results. Spearman correlation coefficients showed that the European individual MLE ancestry estimates were highly positively correlated (+0.80) with the cluster 2 estimates from STRUCTURE, whereas the West African individual MLE ancestry estimates were highly positively correlated (+0.80) with the cluster 1 estimates from STRUCTURE. Therefore, we denoted the cluster 2 STRUCTURE estimates as European and the cluster 1 STRUCTURE estimates as West African, to compare with the MLE results in subsequent analyses. There were no significant differences in median individual European ancestry estimates within self-reported racial group by case-control group or GSTM1 null genotype frequency (Table 2). However, the distribution of European ancestry values was significantly different by self-reported race, whether using MLE or STRUCTURE estimates (Fig. 1A and B). The GSTM1 null genotype frequency in Caucasian, non-Hispanic controls was 48.3%, whereas in African American controls it was 28.8% (Table 2).
|
|
|
|
|
| Discussion |
|---|
|
|
|---|
Genetic ancestry estimation is not commonly used in studies of complex diseases, because of the difficulty and expense of genotyping additional markers. However, many studies use race as an eligibility criterion. Genetic ancestry proportions seem to not only vary between groups of individuals who would self-identify to the same racial group but also among individuals within a group (22). Assortative mating, patterns of linkage and linkage disequilibrium among loci, and random genetic drift can all contribute to variability in ancestry among individuals (43). Allele frequencies have been shown to vary substantially across populations that have mixed ancestry from different continents (44) and within the same continent (45). Even if common variants are shared among racial groups, the frequencies can often differ substantially (46). Although the error caused by population stratification seems stronger for rare variants compared with common ones, the greatest bias and type I error caused by population stratification is when there is a hidden subpopulation of at least 50% in the study population of interest and the allele frequencies differ by at least 25% (47). Therefore, it has been recommended that information on ethnic origin be collected in the greatest detail possible (48).
Few studies have analyzed empirical (i.e., nonsimulated data) data to compare self-reported race with individual ancestry estimates to assess population substructure. A study by Wacholder et al. (5) examined NAT2 and incidence of bladder and breast cancers in relation to ancestry. They concluded that genetic markers of ancestry were unlikely to create a better proxy than self-reported race for cultural practices that could strongly affect cancer risk. In simulation studies, it has been shown that errors that occur from using self-reported race instead of genetic ancestry would be more problematic in large studies searching for susceptibility loci with small effects (19), as the adverse effects of this stratification seem to increase with increasing sample size (49, 50). Although not a case-control design, a recent study by Wilson et al., observed that frequency of risk genotypes in six drug metabolizing genes, including GSTM1, varied by ancestral group and that self-reported race was an insufficient and inaccurate representation of these ancestral clusters (17). The conclusions from the study by Wilson et al. and from our study suggest that using individual ancestry information could enhance the validity of epidemiologic studies and improve precision of estimated effects.
The present study suffers from limitations that are common to all studies involving ancestry estimation. The precise estimation of the ancestral proportions is highly dependent on four factors: (a) choice of parental populations, (b) choice of markers for ancestry estimation, (c) precise estimation of the parental allele frequencies, and (d) choice of method for ancestry estimation. Studies show that human populations worldwide can be subdivided based on parental/ancestral population combinations from five continents: Africa, Europe and the Middle East, Asia, Pacific Islands, and America (Native American; ref. 51). Groups of Caucasian and African American individuals in the United States today, like those used in this study, have been shown to have a combination of parental/ancestral genes from West Africa and Europe (7, 16, 21). We chose to estimate these two ancestral proportions for each individual. However, to choose proper exact parental populations, with available allele frequencies for the ancestry markers, a well-described history of the immigration and migration of the study population is needed and is not always readily available. The settlement history of Detroit is well described and is available through the Center for Michigan Studies (34). Detroit was first settled by the Polish followed by the Germans and still has a large population of individuals who identify themselves as having Polish or German ancestry (34, 52). It is estimated that 20% to 30% of African Americans in the United States originated from Nigeria and it is believed to be the most homogeneous group in Western Africa (53), making it a rational choice for the estimation of African ancestry. Rwandans were also used because this country is in the most populated area in sub-Saharan Africa having the same linguistic affiliation as many other groups in Africa, indicating recent mixture of this group with other groups in Africa (33).
The choice of markers for ancestry estimation depends on the marker's informativeness for ancestry, which has generally been thought to depend solely on allele frequency differences between parental/ancestral populations or
(7, 40, 41, 54). However, recent studies show that informativeness for ancestry can also depend on other population genetic events (55), such as which population is the contributor of genes and which population is the acceptor of genes (56). Because there currently is no "standard" set of markers available for ancestry estimation, we chose to use the Federal Bureau of Investigation CODIS Core set of 13 STR loci. This set of markers is readily available in an easy to use, reasonably priced laboratory kit. These markers show considerable allele frequency variation among racial and ancestral groups from around the world (57-59). In particular, for the two ancestral groups used in this analysis, the majority of the 13 markers had composite
values of
0.2. In addition, the CODIS loci were unlinked to each other (59) and were unlinked to the candidate gene of interest in this study, GSTM1, which is a key assumption for ancestry analysis (8, 10).
If the size of the parental population is small, then the precision of the estimation of the allele frequencies is poor (22). Estimation of the allele frequencies from the parental populations used in this study, however, was based on large parental sample sizes.
Controversy about the best method to use to estimate individual ancestry still exists. Therefore, individual ancestry was estimated using both MLE and STRUCTURE to compare the ability of each estimation technique to assess population structure. From the MLE estimates, it was clear which estimates were the European and West African, because ancestral allele frequencies were specified to perform the MLE calculations. Although the STRUCTURE estimates eventually showed similar population structure results compared with MLE estimates, interpretation of the individual cluster proportion estimates in terms of which ancestral population each cluster was describing was difficult, because prior ancestral allele frequency information is not used in the estimation algorithm. However, the STRUCTURE estimates did give a better fit to the data for the modeling of genetic risk compared with the MLE estimates, possibly because it did not rely on prior allele frequency information for the estimation of ancestry.
In summary, this study is one of the first to evaluate the association of individual genetic ancestry with self-reported race in a case-control study of cancer. We found that individual European ancestry did not completely correlate with self-reported race and that there was significant overlap by individual ancestry between the Caucasian, non-Hispanics and African Americans. We also observed that the frequency of a risk genotype, GSTM1 null, varied substantially within self-reported racial group by individual ancestry and case-control status, thereby affecting models of disease risk. We conclude that individual ancestry may confound associations between disease status and a candidate gene risk genotype and could have a direct effect on accuracy of risk estimation for early-onset cancer in this Detroit population.
| Acknowledgments |
|---|
| Footnotes |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Received 11/11/04; revised 2/24/05; accepted 3/28/05.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S. F. A. Grant and H. Hakonarson Microarray Technology and Applications in the Arena of Genome-Wide Association Clin. Chem., July 1, 2008; 54(7): 1116 - 1124. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Yaeger, A. Avila-Bront, K. Abdul, P. C. Nolan, V. R. Grann, M. G. Birchette, S. Choudhry, E. G. Burchard, K. B. Beckman, P. Gorroochurn, et al. Comparing Genetic Ancestry and Self-Described Race in African Americans Born in the United States and in Africa Cancer Epidemiol. Biomarkers Prev., June 1, 2008; 17(6): 1329 - 1338. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Carlsten, G. S. Sagoo, A. J. Frodsham, W. Burke, and J. P. T. Higgins Glutathione S-Transferase M1 (GSTM1) Polymorphisms and Lung Cancer: A Literature-based Systematic HuGE Review and Meta-Analysis Am. J. Epidemiol., April 1, 2008; 167(7): 759 - 774. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Barnholtz-Sloan, B. McEvoy, M. D. Shriver, and T. R. Rebbeck Ancestry Estimation and Correction for Population Stratification in Molecular Epidemiologic Association Studies Cancer Epidemiol. Biomarkers Prev., March 1, 2008; 17(3): 471 - 477. [Full Text] [PDF] |
||||
![]() |
R. Moreno Lima, B. S. de Abreu, P. Gentil, T. C. de Lima Lins, D. Grattapaglia, R. W. Pereira, and R. J. de Oliveira Lack of Association Between Vitamin D Receptor Genotypes and Haplotypes With Fat-Free Mass in Postmenopausal Brazilian Women J. Gerontol. A Biol. Sci. Med. Sci., September 1, 2007; 62(9): 966 - 972. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. I. Brandt-Rauf, V. H. Raveis, N. F. Drummond, J. A. Conte, and S. M. Rothman Ashkenazi Jews and Breast Cancer: The Consequences of Linking Ethnic Identity to Genetic Disease Am J Public Health, November 1, 2006; 96(11): 1979 - 1988. [Abstract] [Full Text] [PDF] |
||||
![]() |
J O'Loughlin, E Dugas, K Maximova, and N Kishchuk Reporting of ethnicity in research on chronic disease: update Postgrad. Med. J., November 1, 2006; 82(973): 737 - 742. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Ziv, E. M. John, S. Choudhry, J. Kho, W. Lorizio, E. J. Perez-Stable, and E. G. Burchard Genetic Ancestry and Risk Factors for Breast Cancer among Latinas in the San Francisco Bay Area. Cancer Epidemiol. Biomarkers Prev., October 1, 2006; 15(10): 1878 - 1885. [Abstract] [Full Text] [PDF] |
||||
![]() |
A E Hegab, T Sakamoto, K Sekizawa, I Hall, and J Blakey Assessing the validity of genetic association studies Thorax, October 1, 2005; 60(10): 882 - 883. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |