
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda Maryland 20892
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
The increase in sample size for fixed misclassification probabilities is dependent on the prevalence of the environmental and genetic factors and on the type and magnitude of the interaction being evaluated (4 , 10) . Because studies to detect interactions typically require large sample sizes (12, 13, 14, 15, 16) , further increases in sample size due to exposure misclassification could compromise the feasibility of the study (10) . The evaluation of the effects of misclassification at the study design phase allows investigators an opportunity to consider alternative measures of exposure with different levels of accuracy and to identify situations where high-quality exposure assessment is crucial. The objective of this paper is to describe a relatively simple approach to quantify the impact of misclassification on bias in the estimation of interaction effects and on the required sample sizes. In the next sections, we describe and illustrate the approach with examples.
| Materials and Methods |
|---|
|
|
|---|
|
, is defined as the ratio of the joint odds ratio and the product of the odds ratios for each factor at the reference level of the other factor, namely,
= OR11/(OR10 x OR01). In the absence of a multiplicative interaction,
= 1.0 and OR11= OR10 x OR01.
The additive interaction parameter,
, is defined as the ratio of the joint excess risk (OR11 - 1) and the sum of the excess risks for each factor at the reference level of the other factor, namely,
= (OR11 - 1)/(OR10 - 1) + (OR01 - 1). Other definitions for additive interaction parameters are possible (17)
but will not be discussed in this paper. In the absence of an additive interaction,
= 1.0 and (OR11 - 1) = (OR10 - 1) + (OR01 - 1). It should be noted that
is undefined when both OR10 and OR01 are 1.0, and that whereas
takes values from 0 to +
,
takes values from -
to +
.
Misclassification of a dichotomous exposure is defined by the misclassification probabilities sensitivity (se) and specificity (sp; Ref. 18 ). Sensitivity is the probability that a truly exposed subject is classified as exposed, and specificity is the probability that a truly unexposed subject is classified as unexposed. Nondifferential misclassification occurs when the misclassification probabilities are independent of the disease status, whereas differential misclassification occurs when the misclassification probabilities are dependent on the disease status. In the examples presented in this paper, we assume nondifferential misclassification; however, the approach described in this section can be used for both nondifferential and differential misclassification. Because nearly all instruments in epidemiology have some degree of error, sensitivity and specificity can also be defined for two instruments with different degrees of accuracy rather than for an error-free and an error-prone instrument. We will refer to the more accurate instrument as "gold standard" and the less accurate instrument as "error prone".
| Sample Size Calculations. |
|---|
|
|
|---|
To calculate the sample size required to detect a multiplicative interaction of magnitude
or an additive interaction of magnitude
, values for OR10 and OR01 need to be specified. These parameters are often difficult to specify, and the marginal odds ratios for the environmental factor (ORE) and genetic factor (ORG), i.e., the odds ratios for each factor when the other factor is ignored, are often better known. The relationship between OR10, OR01, and
(or
) and the marginal odds ratios (ORE and ORG) is given in "Appendix 1" . Sample size calculations in two of the examples of multiplicative interactions presented in this paper are based on estimates of marginal effects from previous studies and
, rather than on OR10, OR01, and
.
| An Approach to Assess the Impact of Misclassification on Bias and Sample Size. |
|---|
|
|
|---|
(a) Specify values for P(E = 1), P(G = 1), OR10, OR01, and OR11 in the absence of misclassification or when using a "gold standard" instrument.
(b) Calculate the required sample size for a given power to detect the interaction effect
or
.
(c) Calculate P(E* = 1), P(G* = 1), OR*10, OR*01, and OR*11 for values of sensitivity and specificity of the environmental and genetic factors as indicated in "Appendix 2" , where "*" denotes the observed parameters in the presence of misclassification or when using an "error prone" instrument.
(d) Calculate the sample size using the observed parameters in the presence of misclassification.
It should be noted that this methodology is not applicable to ordered categorical or continuous exposure variables because without restrictions on the disease rate and on the form of the odds ratio function, the shape of the relationship with disease will generally be distorted by the measurement error (20) .
| Effect of Misclassification on Bias and Sample Size. |
|---|
|
|
|---|
1 (i.e., the classification instrument is better than random) (11)
. Under these circumstances, the sample size required to reject the null hypothesis of no multiplicative interaction with a given statistical power will be increased.
The direction of the bias to the additive interaction parameter in the presence of misclassification is more difficult to predict because we do not have a general rule as in the case of multiplicative interactions (11)
. Using the method described in the previous section, we explored empirically the direction of the bias to the additive interaction parameter,
, under a range of parameter values, assuming the same conditions indicated above for a multiplicative interaction. We found that under these conditions, nondifferential misclassification of the genetic or environmental factor generally tends to bias the additive interaction parameter toward the null value. However, we did find several examples where the additive interaction is biased away from the null in the presence of nondifferential misclassification in the environmental or genetic factor assessment. Although most of these scenarios were extreme situations, we found examples that can be encountered in practice. These examples followed a pattern where a protective factor measured with reduced specificity interacts with a risk factor of disease. We illustrate this situation in an example presented in Table 5
. However, the approach described in this section can be used to assess the direction of the bias in each particular situation.
|
| Results |
|---|
|
|
|---|
= 2.0). This example corresponds to a pattern of interaction where both the genetic and the environmental factors increase the risk of disease by themselves, and the joint effect is different from the effect of each factor acting alone [pattern 4 as described by Khoury et al. (21)
and model E as described by Ottman et al. (22)
]. We chose an example of this pattern because we believe that it is reasonable in the context of complex multifactorial diseases like cancer, where environmental and genetic factors are likely to influence the risk of cancer through multiple pathways.
Table 2
illustrates the impact of reducing sensitivity of the environmental factor assessment from 1.0 to 0.80, both in the absence and presence of reduced sensitivity in the assessment of the genetic factor (from 1.0 to 0.95). Although measures of genetic markers are generally considered less prone to error than measures of environmental exposures, some degree of error may be present due to technical errors in determining the genotype or due to failure to analyze or identify relevant alleles (8
, 10) . In Table 2
, the prevalence for both factors is 0.5, and the specificity for the assessment of both the genetic and environmental factors is 1.0.
|
In Fig. 1
, we explore in more detail the effects of misclassification on sample size. The solid lines in Fig. 1
represent the sample size required to detect the specified 2-fold interaction in the absence of misclassification as a function of the true prevalence of the environmental factor for 0.5 (Panels 13) and 0.1 (Panels 46) prevalence of the genetic factor. The dashed lines in Fig. 1
illustrate the impact of misclassification of the environmental factor on sample size for selected values of sensitivity and specificity of exposure assessment.
|
Fig. 1
, Panels 46 shows similar patterns as Panels 13 for a genetic factor with 0.1 prevalence. For environmental factors with 0.5 true prevalence, reducing the environmental factor sensitivity from 1.0 to 0.8 and 0.6 (while holding specificity to 1.0) will increase the sample size from 1200 to 2700 (2.3-fold) and to 5390 (4.5-fold) cases, respectively. It should be noted that although the baseline sample size in the absence of misclassification is increased, the percent increase in sample size is very similar as it is in Panels 13. The reason is that in Fig. 1
, we assumed that the genetic factor is perfectly measured and independent from the environmental factor. Therefore, the impact of misclassification on the environmental factor does not depend on the prevalence of the genetic factor.
Example of a 2-fold Additive Gene-Environment Interaction.
Generally, when both the genetic and the environmental factors increase the risk of disease by themselves and in combination, as in the previous example, nondifferential misclassification tends to bias the additive interaction effect toward the null value. However, the direction of the bias to the additive interaction parameter due to nondifferential misclassification cannot be easily predicted. In this section, we provide an example of an additive gene-environment interaction where nondifferential misclassification of the environmental factor biases the additive interaction parameter away from the null value, even though the factors are binary and independent, and the misclassification probabilities for the environmental factor are independent of the genetic factor. In this example, the prevalence for the environmental factor is 0.3 and for the genetic factor is 0.5; the odds ratio for the effect of the environmental factor alone (OR10) is 0.5 and for the genetic factor alone (OR01) is 2.0; and the joint odds ratio for both factors (OR11) is 2.0. These values represent a 2-fold additive interaction (
= 2.0).
Table 3
illustrates the impact of reducing the specificity of the environmental factor assessment from 1.0 to 0.8, both in the absence and presence of reduced sensitivity in the genetic factor assessment. The sensitivity for the environmental factor and the specificity for the genetic factor are both 1.0. In the absence of misclassification, the required sample size for 80% power is 2930 cases and 2930 controls. Reducing exposure specificity from 1.0 to 0.80 results in bias of the additive interaction parameter away from the null from 2.0 to 2.88 while increasing the required sample size to 3486 cases (1.19-fold increase). Paradoxically, when the sensitivity of the genetic factor is 0.95 rather than 1.0, the same amount of environmental factor error results in a smaller bias to the additive interaction parameter (from 2.0 to 2.46) while further increasing the required sample size to 3993 cases. Thus, reduced specificity in measuring a protective environmental factor can bias the additive interaction parameter away from the null value while increasing the sample size to reject the null hypothesis of no additive interaction.
|
Table 4
shows the minimum number of women needed to detect a 2-fold interaction (
= 2.0) between obesity, defined as BMI
30 kg/m2, and the COMT LL genotype using two alternative methods to estimate a womens BMI: with actual measurements of weight and height and self-reported weight and height. Assuming a prevalence of obesity of 0.15, a marginal odds ratio for obesity of 1.5, a prevalence of COMT LL genotype of 0.25, and a marginal odds ratio of 2.0 for COMT LL genotype (25)
, one would need to study 1016 cases and 1016 controls to detect a 2-fold interaction (
= 2.0). These marginal odds ratios and interaction parameter imply: OR10 = 1.10, OR01 =1.72, and OR11= 3.78 (calculated as indicated in "Appendix 1" ).
|
, will be 1.83 rather than 2.0, and the required sample size to detect the interaction will be increased from 1016 cases to 1548 cases and an equal number of controls. Although obtaining actual measurements of weight and height may increase the total cost of data collection, the savings from enrolling, collecting biological samples, and determining the genotype in 532 fewer cases and 532 fewer controls may off-set the increased cost of data collection. Moreover, using actual measurements of weight will provide unbiased estimates for the "true" interaction parameter and the obesity and COMT LL odds ratios.
Benzo(a)pyrene, GSTM1 Genotype, and Lung Cancer Risk Among Nonsmokers.
Occupational exposure to benzo(a)pyrene has been associated with about a 2-fold increase in lung cancer risk among nonsmokers (29)
. Detoxification of benzo(a)pyrene by conjugation to glutathione is catalyzed by the GSTM1 enzyme (glutathione S-transferase M1). A homozygous deletion of the GSTM1 gene is responsible for a lack of enzyme activity and has been associated with about a 1.5-fold increase in lung cancer risk (30)
. Thus, subjects exposed to benzo(a)pyrene who have the homozygous deletion in the GSTM1 gene could be at a particularly high risk of lung cancer. Dewar et al. (31)
have estimated a sensitivity of 0.6 and a specificity of 0.99 for the classification of exposure to benzo(a)pyrene based on a job-exposure matrix applied to job titles from a personal interview, as compared to exposure based on a more complex procedure involving the evaluation of a detailed job history by a trained team of chemists and industrial hygienists. Based on these estimates, a study to detect a 2-fold interaction (
= 2.00, OR10 = 1.03, OR01 = 1.20, OR11 = 2.49) between the GSTM1 null genotype and exposure to benzo(a)pyrene assessed by the evaluation of a detailed job history would need to include about 672 cases and 672 controls (Table 5)
. In contrast, using a job-exposure matrix to estimate benzo(a)pyrene exposure biases the interaction parameter to 1.76, and the required sample size is more than twice the previous estimate (1413 cases and 1413 controls).
| Discussion |
|---|
|
|
|---|
Both differential and nondifferential misclassification of the environmental factor biases a multiplicative interaction effect toward the null value provided that the environmental and genetic factors are binary and independent, misclassification is independent of the genetic factor, and the sum of sensitivity and specificity is
1 (i.e., the classification instrument is better than random; Ref. 11
). However, bias to the additive interaction parameter cannot be easily predicted, even under this set of conditions. In fact, we provide an example of an additive interaction between a genetic susceptibility factor and a protective environmental factor, where reduced specificity in the assessment of the environmental factor results in an overestimation of the additive interaction parameter and an increase in sample size. Although this and all other examples in this paper assume nondifferential misclassification with respect to the disease status, our procedure can also be used for differential misclassification.
The observations in this paper point out the trade-off between using more accurate and usually more expensive measures of exposure assessment in a smaller number of subjects or using less-accurate but usually cheaper measures in a larger number of subjects. When making these choices, it should be borne in mind that increasing sample size increases the study power to detect the attenuated interaction; however, the interaction effect is still biased. In this case, adjustments based on estimates of sensitivity and specificity are required to obtain an unbiased estimate of the true interaction effect. It should be noted that if the conditions used in our paper are not satisfied (i.e., binary genetic and environmental factors, independent of each other in the population and independence of misclassification probabilities for both factors), there may be unpredictable effects of misclassification on the direction of the bias to the multiplicative interaction (11) . Moreover, as indicated above, the direction of the bias to the additive interaction cannot be generally predicted, even under the conditions used in our paper.
In conclusion, efforts to improve the accuracy of exposure assessment for both the environmental and genetic factors can greatly reduce sample size requirements to study interactions and are critical for accurate assessment of gene-environment interactions in case-control studies. Our examples also illustrate the importance of routine assessment of accuracy in genotype assays through quality control procedures because of the large impact of small degrees of error.
|
| Appendix 1 |
|---|
|
|
|---|
or
), and prevalence of the environmental factor and genetic factors (P(E = 1) and P(G = 1)), the environmental and genetic marginal effects (ORE and ORG), can be calculated by using in the following set of equations.
For multiplicative interactions:
![]() |
![]() |
![]() |
![]() |
For given values of the environmental and genetic marginal effects (ORE and ORG), interaction effect (
or
), and prevalence of the environmental factor and genetic factors (P(E = 1) and P(G = 1)), the effects of the environmental and genetic factors alone (OR10 and OR01) can be calculated by solving for OR10 and OR01 in the above set of equations.
All calculations in this Appendix can be performed easily using a spreadsheet (EXPECT) that can be obtained by e-mail from connorj{at}mail.nih.gov
| Appendix 2 |
|---|
|
|
|---|
Below are the formulae used in calculations performed by EXPECT.
Given P(G = 1), P(E = 1), OR10, OR01, and OR11, the expected cell counts in Table 2
are:
a1 =
x P(E = 1) x P(G = 1) x OR11
b1 =
x (1 - P(E = 1)) x P(G = 1) x OR01
c1 = P(E = 1) x P(G = 1)
d1 = (1 - P(E = 1)) x P(G = 1)
a0 =
x P(E = 1) x (1 - P(G = 1)) x OR10
b0 =
x (1 - P(E = 1)) x (1 - P(G = 1))
c0 = P(E = 1) x (1 - P(G = 1))
d0 = (1 - P(E = 1)) x (1 - P(G = 1))
![]() |
Let se0E sp0E and se1E sp1E be the sensitivity and specificity of the environmental factor among controls and cases, respectively, and se0G sp0G and se0G sp0G be the sensitivity and specificity of the genetic factor among controls and cases, respectively. The expected cell counts among the controls in the presence of misclassification (denoted by an asterisk *) are calculated as:
![]() |
The expected cell counts among cases are:
![]() |
The observed parameters in the presence of misclassification, P(G* = 1), P(E*= 1), OR*10, OR*01, and OR*11, are then calculated from the expected cell counts. Note that for nondifferential misclassification of the environmental and genetic factors se0E = se1E, sp0E = sp1E, and se0G = se1G, sp0G = sp1G, respectively.
| Footnotes |
|---|
1 To whom requests for reprints should be addressed, at National Cancer Institute, DCEG, EPS, Room 7076, 6120 Executive Boulevard, Bethesda, MD. ![]()
2 The abbreviations used are: COMT, catechol-O-metyltransferase; BMI, body mass index. ![]()
Received 2/11/99; revised 9/ 1/99; accepted 9/ 9/99.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M P M Boks, M Schipper, C D Schubart, I E Sommer, R S Kahn, and R A Ophoff Investigating gene environment interaction in complex diseases: increasing power by selective sampling for environmental exposure Int. J. Epidemiol., December 1, 2007; 36(6): 1363 - 1369. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Liang, A. Trentham-Dietz, L. Titus-Ernstoff, P. A. Newcomb, R. A. Welch, A. A. Hutchinson, J. M. Hampton, C. B. Sutcliffe, J. L. Haines, and K. M. Egan Whole-Genome Amplification of Oral Rinse Self-Collected DNA in a Population-Based Case-Control Study of Breast Cancer Cancer Epidemiol. Biomarkers Prev., August 1, 2007; 16(8): 1610 - 1614. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. P. A. Ioannidis, T. A. Trikalinos, and M. J. Khoury Implications of Small Effect Sizes of Individual Genetic Variants on the Design and Interpretation of Genetic Association Studies of Complex Diseases Am. J. Epidemiol., October 1, 2006; 164(7): 609 - 614. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Hill, E. Gilbert, G. M. Dores, M. Gospodarowicz, F. E. van Leeuwen, E. Holowaty, B. Glimelius, M. Andersson, T. Wiklund, C. F. Lynch, et al. Breast cancer risk following radiotherapy for Hodgkin lymphoma: modification by other risk factors Blood, November 15, 2005; 106(10): 3358 - 3365. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Le Marchand The Predominance of the Environment over Genes in Cancer Causation: Implications for Genetic Epidemiology Cancer Epidemiol. Biomarkers Prev., May 1, 2005; 14(5): 1037 - 1039. [Full Text] [PDF] |
||||
![]() |
R. J. Hung, P. Brennan, F. Canzian, N. Szeszenia-Dabrowska, D. Zaridze, J. Lissowska, P. Rudnai, E. Fabianova, D. Mates, L. Foretova, et al. Large-Scale Investigation of Base Excision Repair Genetic Polymorphisms and Lung Cancer Risk in a Multicenter Study J Natl Cancer Inst, April 20, 2005; 97(8): 567 - 576. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. C. Deitz, N. Rothman, T. R. Rebbeck, R. B. Hayes, W.-H. Chow, W. Zheng, D. W. Hein, and M. Garcia-Closas Impact of Misclassification in Genotype-Exposure Interaction Studies: Example of N-Acetyltransferase 2 (NAT2), Smoking, and Bladder Cancer Cancer Epidemiol. Biomarkers Prev., September 1, 2004; 13(9): 1543 - 1546. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-L. Ponsonby, T. Dwyer, L. Trevillian, A. Kemp, J. Cochrane, D. Couper, and A. Carmichael The Bedding Environment, Sleep Position, and Frequent Wheeze in Childhood Pediatrics, May 1, 2004; 113(5): 1216 - 1222. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Little, L. Sharp, S. Duthie, and S. Narayanan Colon Cancer and Genetic Variation in Folate Metabolism: The Clinical Bottom Line J. Nutr., November 1, 2003; 133(11): 3758S - 3766. [Abstract] [Full Text] [PDF] |
||||
![]() |
H.-S. Jeon, K. M. Kim, S. H. Park, S. Y. Lee, J. E. Choi, G. Y. Lee, S. Kam, R. W. Park, I.-S. Kim, C. H. Kim, et al. Relationship between XPG codon 1104 polymorphism and risk of primary lung cancer Carcinogenesis, October 1, 2003; 24(10): 1677 - 1681. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Rundle and S. Schwartz Issues in the Epidemiological Analysis and Interpretation of Intermediate Biomarkers Cancer Epidemiol. Biomarkers Prev., June 1, 2003; 12(6): 491 - 496. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Wong, N. Day, J. Luan, K. Chan, and N. Wareham The detection of gene-environment interaction for continuous traits: should we deal with measurement error by bigger studies or better measurement? Int. J. Epidemiol., February 1, 2003; 32(1): 51 - 57. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. E. Caporaso Why Have We Failed to Find the Low Penetrance Genetic Constituents of Common Cancers? Cancer Epidemiol. Biomarkers Prev., December 1, 2002; 11(12): 1544 - 1549. [Full Text] [PDF] |
||||
![]() |
J. Y. Park, S. H. Park, J. E. Choi, S. Y. Lee, H.-S. Jeon, S. I. Cha, C. H. Kim, J.-H. Park, S. Kam, R. W. Park, et al. Polymorphisms of the DNA Repair Gene Xeroderma Pigmentosum Group A and Risk of Primary Lung Cancer Cancer Epidemiol. Biomarkers Prev., October 1, 2002; 11(10): 993 - 997. [Abstract] [Full Text] [PDF] |
||||
![]() |