Abstract
To examine the joint effect of multiple loci on disease risk, many casecontrol association studies used “genedose analyses.” However, some researchers defined highrisk genotypes (or alleles) as those that have higher genotypic (allelic) frequencies in the case group compared with the control group in the study. This will lead to the total number of the “highrisk” genotypes (alleles) tending to be higher for the cases than for the controls as well, even if none of the studied loci were related to the disease. MonteCarlo simulations done in this study showed that such a “datadredging” genedose analysis could produce grossly biased results. A permutation correction method was proposed which could correct the biases very effectively. (Cancer Epidemiol Biomarkers Prev 2005;14(12):3004–6)
 epidemiologic methods
 genetic epidemiology
 doseresponse relation
Introduction
To examine the joint effect of multiple loci on disease risk, many casecontrol association studies used “genedose analyses” (we found 40 articles with genedose analysis from January 1998 to May 2005 in six esteemed cancer journals: Cancer Epidemiology Biomarkers and Prevention, Cancer Research, Carcinogenesis, Clinical Cancer Research, International Journal of Cancer, and the Journal of the National Cancer Institute). In these studies, the researchers first defined the highrisk genotype (or allele) at each locus. They then calculated the total number of the highrisk genotypes (alleles) a subject has for the cases and for the controls in their study. They categorized the total number of the highrisk genotypes (alleles) into a number of categories, say, “0 to 2”, “3 to 4”, and “5+”, and they calculated the odds ratios (OR) and the associated confidence intervals (CI) for these categories. Finally, they did a trend test to see if the OR increases as the total number of the highrisk genotypes (alleles) increases. “Doseresponse relationship” is one of the causal criteria put forward by Hill (1). A genedose effect is corroborative of a causeandeffect relation between the genes and the disease.
However, some researchers defined the highrisk genotypes (alleles) as those that have higher genotypic (allelic) frequencies in the case group than in the control group in the study, or equivalently, as those with OR > 1 in the study [among the aforementioned 40 articles, researchers from 10 of these articles defined their highrisk genotypes (alleles) in this way; see Table 1 for a summary]. They then did the usual genedose analysis as if those defined are the bona fide highrisk genotypes (alleles)—the “datadredging” genedose analysis. The definition guarantees the frequency of the “highrisk” genotypes (alleles) being higher in the case group than in the control group at each locus. Intuitively, this will lead to the total number of the highrisk genotypes (alleles) tending to be higher for the cases than for the controls as well, even if, in fact, none of the studied loci were related to the disease. Consequently, an excess of falsepositive genedose effects may result.
Methods and Results
A simulation study is presented in Table 2. The loci are assumed to be unlinked or in linkage equilibrium with one another. We see that although none of the simulated loci is related to the disease, the OR increases as the total number of the highrisk genotypes increases, displaying a false sense of genedose effects. The coverage probabilities of the CIs are too low and the type I error rates of the trend test are inflated (α = 0.05). As the number of the studied loci increases, the problem becomes more acute; with 10 loci, the OR can be as high as ∼2.00, the coverage probability, as low as ∼0.38, and the type I error rate, as high as ∼0.64 (the datadredging articles in Table 1 with ≥5 loci and borderline level of significance are thus particularly vulnerable). The overestimation of the OR is less severe when sample sizes reached 1,000 as compared with studies with 500 samples. However, the undercoverage of the CIs and the inflation of the type I error rates are irrespective of sample size. Similar biases can be found when the studied loci are in linkage disequilibrium (less severe) and when the analysis is based on alleles (more severe; results not shown).
To correct the biases, we let each and every study subject retain his/her genetic data. Whereas we randomly permute the disease status (cases or controls) of all the study subjects (keeping the total number of cases and the total number of controls constant). A new round of datadredging genedose analyses is done for this permuted data—to determine a new set of the “highrisk” genotypes (alleles), and then to calculate a new set of the ORs and a new χ^{2} statistic of the trend test. The procedure is to be repeated 1,000 times (from our limited simulation experiences, this number of permutations is sufficient).
To get a permutationcorrected OR for a particular category of the total number of highrisk genotypes (alleles), we divide the OR of that category in the original data by the geometric mean of the ORs of the same category in the 1,000 permuted data (the original log OR minus the arithmetic mean of the permutation log ORs). To get a permutationcorrected upper (lower) limit of the 95% CI for a particular category, we divide the OR of that category in the original data by the 2.5th (97.5) percentile of the ORs of the same category in the 1,000 permuted data (the 95% CI constructed in this way will cover 1.00, if the original OR does not differ significantly from 1.00, using permutation tests at α = 0.05; it will not cover 1.00, if the OR differs significantly from 1.00). To get a permutationcorrected P value for the trend test, we calculate the proportion in the 1,000 permuted data that have the χ^{2} statistic of the trend test larger than the same statistic in the original data (alternatively, we declare the trend test to be significant at α = 0.05, if the original χ^{2} statistic is larger than the 95th percentile of the permutation χ^{2} statistics).
With regard to the data in Table 2, it can now be seen in Table 3 that the permutationcorrected ORs are very close to their true values of 1.00, the permutationcorrected CIs have coverage probabilities of ∼0.95, and the permutationcorrected trend tests have type I error rates of ∼0.05. The permutation correction achieves similarly satisfactory results when the studied loci are in linkage disequilibrium and when the analysis is based on alleles (data not shown).
Discussion
In a study in which we have a priori knowledge about the highrisk genotypes (alleles) for some of the loci but not for the rest, we may resort to performing a genedose analysis with a mixture of the “predetermined” highrisks and the “datadredged” highrisks. The correction method can deal with this; by copying the predetermining and the datadredging (of the original data) to each and every round of the permutations. The method also works when other covariates beyond the studied loci (i.e., sex, age, smoking, etc.) need to be considered; by performing a logistic regression to derive the “covariatesadjusted” ORs (2) in each and every round of the permutations. Finally, the method can also be used for a matched casecontrol study; the ORs are to be calculated using the matcheddata methods (2) and the permutations (of the disease status) are to be restricted to subjects in the same matching set.
In conclusion, the datadredging genedose analysis as commonly employed in the literature will produce grossly biased results. The biases, however, could very effectively be corrected by the permutation correction method as described in this report.
Footnotes

Grant support: Supported in part by the National Science Council, Republic of China.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
 Accepted October 10, 2005.
 Received August 8, 2005.
 Revision received September 29, 2005.