Abstract
The association between the genotypic frequencies of the cytochrome P450 1A1 polymorphism and the risk of childhood leukemia is explored with the data from a matched casecontrol study. The data are displayed in a 3 × 3 casecontrol array, and the discordant pair counts are assessed for quasiindependence, homogeneity, and symmetry. This statistical approach is contrasted to the more typical analysis of matched data based on a conditional logistic model and estimated odds ratios. The statistical analysis of 175 matched pairs (part of a large study of potential environmental/genetic influences on the risk of childhood leukemia) showed no evidence of an association between cytochrome P450 1A1 genotype frequencies and casecontrol status.
Introduction
The possible association between cytochrome P450 1A1 (CYP1A1) genetic polymorphism and disease has been explored by several investigators. Some examples are the investigations of Ladonna et al. (1) on Spanish toxic oil syndrome, Ishibe et al. (2) on breast cancer, and Kim et al. (3) on cervical cancer and several reports on the association with childhood leukemia (e.g., refs. 4, 5). These analyses, as well as most analyses of matched casecontrol genetic data, are summarized in terms of ratios of discordant pairs of observations (odds ratios) estimated from conditional logistic regression models. The following statistical approach to the analysis of the CYP1A1/leukemia matched data is based on a 3 × 3 casecontrol array that allows an assessment of three fundamental statistical issues (i.e., independence, homogeneity, and symmetry of the discordant pairs). In addition, this approach is contrasted to the more typical use of odds ratios estimated from a conditional logistic model.
Data
To describe the CYP1A1 polymorphism and the risk of acute lymphoblastic leukemia, data collected as part of the Northern California Childhood Leukemia Study are used. These matched casecontrol observations were abstracted from a large number of genetic/environmental variables that potentially influence the risk of childhood leukemia. The cases are children ages 0 to 14 years old with newly diagnosed leukemia (1995 to 1999) obtained from major hospitals in the San Francisco Bay Area. Comparison with California State Cancer Registry data shows that >90% of the eligible children were ascertained. The control children were randomly selected from birth certificate records and matched to cases with respect to sex, age, race, and county of birth. A more extensive description of this farranging study is found elsewhere (6). The CYP1A1/leukemia data consisting of 175 matched pairs of acute lymphoblastic leukemia cases and their controls (117 concordant and 58 discordant pairs) are given in Table 1.
QuasiIndependence
Genotype data classified into a square array are quasiindependent when the categorical variables (row and columns) are independent with respect to only the discordant pairs. No restrictions are placed on the concordant pairs (the main diagonal of the casecontrol array). In symbols, the six expected cell counts of the discordant pairs (denoted F_{ij}) are . The values p_{i} represents the case genotypic frequencies, and q_{j} represents the control genotypic frequencies. The quantity N represents the “total number of pairs” that would have occurred if the genotypic frequencies were independent and the data were randomly sampled. Specifically, the value N = n / ∑p_{i}q_{j} for i ≠ j = 1, 2, and 3 where n represents the total observed number of discordant pairs.
The maximum likelihood estimates of the p_{i} and q_{j} frequencies are found by iterative techniques (7) or an application of a specialized loglinear model (8). Both procedures are designed to estimate the genotypic frequencies excluding the concordant pairs from consideration (truncated). For the Northern California Childhood Leukemia Study data (Table 1), these estimated genotypic frequencies are
A natural estimate of the expected number of casecontrol discordant pairs based on the quasiindependence model is F̂_{ij} = N̂p̂_{i}q̂_{j} where N̂ = n / ∑p̂_{i}q̂_{j} for i ≠ j = 1, 2, and 3. These expected counts estimated from the data in Table 1 are displayed in Table 2.
The Pearson χ^{2} goodnessoffit test statistic (1 df) summarizing the deviations of the observed values from the expected values (Table 1 versus Table 2) is X_{Q}^{2} = 0.712 (P = 0.399). The df values are the number of observations (offdiagonal cell frequencies = 6) minus the number of independent parameters necessary to estimate the expected values or to specify the appropriate loglinear model. In the case of quasiindependence, five independent estimated parameters establish the expected values in Table 2 making the df equal to 1 (6 − 5 = 1).
In the context of the analysis of matched casecontrol data, another form of this same χ^{2} test is called a “test for the consistency of the odds ratios” (9). The noninformative concordant pairs are excluded because the increased correlation within pairs due to the matching process tends to increase the number of concordant pairs relative to the number predicted by a model postulating independence.
Marginal Homogeneity
Quasiindependence does not imply marginal homogeneity (identical case and control genotypic frequency distributions). The issue of marginal homogeneity in general is the subject of several statistical articles (e.g., refs. 1012). The expected row/column totals, under the conjecture of marginal homogeneity, are estimated by where n_{i.} = ∑_{j}f_{ij}, n._{j} = ∑_{i}f_{ij} and f_{ij} represents the number of pairs in the (i,j)^{th} cell.For the CYP1A1/leukemia data, the estimated homogeneous marginal totals are
Symmetry
When the casecontrol discordant pairs are independent and the marginal frequencies are homogeneous, the expected counts of casecontrol pairs create a symmetrical array; that is, when casecontrol status is unrelated to the genotypic frequencies, the counts within the three kinds of discordant pairs (f_{ij} versus f_{ji}) differ by chance alone. Under these conditions (independence + homogeneity = symmetry), the maximum likelihood estimates of the genotypic frequencies (denoted P̂_{i}) are
where f_{{ij}} = f_{ij} + f_{ji}. Because the number of independent parameters equals the number of independent observations (2), the maximum likelihood estimates p̂_{i} can be derived by equating expected and observed frequencies (13). The specific estimated genotype frequencies from the acute lymphoblastic leukemia data are
These estimated genotypic frequencies produce an estimate of the expected counts of matched casecontrol discordant pairs (denoted F_{ij}′) where
The estimated “sample size” is N̂′ = n / ∑P̂_{i}P̂_{j} for i ≠ j = 1, 2, and 3. The Northern California Childhood Leukemia Study matched pairs data (Table 1) produce the estimates displayed in Table 3. These expected discordant pairs are quasiindependent, and the marginal frequencies are homogeneous.
The Pearson goodnessoffit test statistic (Table 1 versus Table 3) is X_{S}^{2} = 1.184 (P = 0.757) and has an approximate χ^{2} distribution with 3 df when the observed counts randomly differ from the expected counts. The comparison of these observed and expected numbers of discordant pairs is identical to the sum of three McNemarlike test statistic (14). In symbols,
The χ^{2} test for symmetry requires three independent estimated parameters giving a test statistic with 3 df (6 − 3 = 3). In addition, this test statistic partitions into three independent components each with 1 df.
In addition, the three estimated genotypic frequencies P̂_{i} lead to a compact form of a χ^{2} test statistic to evaluate marginal homogeneity in a 3 × 3 casecontrol array. The test statistic is
The test statistic X_{H}^{2} summarizes deviations from marginal homogeneity and has an approximate χ^{2} distribution with 2 df when the expected marginal frequencies are homogeneous. The χ^{2} expression for testing homogeneity requires four independent estimated parameters yielding 2 df (6 − 4 = 2). For the Northern California Childhood Leukemia Study data, the χ^{2} value is X_{H}^{2} = 0.479 (P = 0.787).
Conditional Logistic Model
The additive conditional logistic model applied to genotype frequency data collected in a matched design yields estimates of the logarithms of the three ratios of discordant pairs (denoted b_{i}). When the genotypic frequencies are the same for both cases and controls, the model estimated ratio within all three kinds of discordant pairs is 1 (b_{1} = b_{2} = b_{3} = 0). Furthermore, the additive model requires that b_{1} + b_{3} = b_{2}. For the CYP1A1/leukemia data, these estimated logratios are
b̂_{1} = −0.170 for AA/AG discordant pairs,
b̂_{2} = 0.110 for AA/GG discordant pairs, and
b̂_{3} = 0.280 for AG/GG discordant pairs.
The corresponding estimated odds ratios (e^{b̂1}) are 0.843, 1.116, and 1.323, respectively.
The estimated logratios are directly related to the expected cell counts generated by the quasiindependence model. In symbols, the estimates from the quasiindependence model give
In other words, both models generate identical expected counts contained in a 3 × 3 casecontrol array (Table 2).
From another prospective, both models necessarily conform to the relationship that or
The requirement that D = 1 is essentially the definition of quasiindependence or model additivity (no interaction) within a 3 × 3 array.
The distribution of the estimated value has estimated variance given by These two estimates provide a computationally easy and additional assessment of the correspondence between the observed and the expected counts generated by the quasiindependence or the additive conditional logistic models.
The estimate of log(D̂) and its estimated variance from the CYP1A1/leukemia data are log(D̂) = −1.264 and v̂ = 2.332 yielding the test statistic The value X_{Q}′^{2} has an approximate χ^{2} distribution with 1 df when the data (Table 1) randomly differ from the expected counts generated by the quasiindependence or the conditional logistic models (Table 2). In general, this χ^{2} statistic (X_{Q}′^{2}) and the χ^{2} statistic (X_{Q}^{2}) will be similar, particularly for casecontrol arrays with many discordant pairs.
It should also be noted that the likelihood ratio test based on the additive conditional logistic model addresses only the hypothesis that b_{1} = b_{2} = b_{3} = 0 or the ratios of the corresponding discordant pairs are 1.0. The score likelihood ratio test statistic is identical to the previously described test for marginal homogeneity χ_{H}^{2}.
Discussion
A symmetrical casecontrol array of matched pairs data indicates that no association likely exists between casecontrol status and genotypic frequencies. When statistical evidence emerges of nonsymmetry, two issues become important (i.e., independence and marginal homogeneity). That is, two reasons exist for a significant lack of symmetry: the failure of the genotypic frequencies to be independent (failure of the additive conditional logistic model to reflect the data) and the failure of the matched data to have the same case and control genotypic frequencies or both. These two sources of deviation for a symmetrical array are easily identified and indicate different dimensions of the association between genotypic frequencies and disease risk.
In general, the likelihood ratio statistic estimated from an additive conditional logistic model addresses only the issue of marginal homogeneity of the casecontrol array because an additive model explicitly requires the discordant matched pairs to be independent. That is, substantial differences in the ratios of discordant pairs can exist in a 3 × 3 casecontrol array with perfectly homogeneous marginal frequencies X_{H}^{2} = 0 when the genotypic frequencies that determine the numbers of discordant pairs are not independent. Without an assessment of the quasiindependence model X_{Q}^{2} or equivalently the consistency of the odds ratios (b_{1} + b_{3} = b_{2}), inferences from matched casecontrol data are potentially biased and even potentially misleading. As with statistical models in general, goodnessoffit is a critical issue.
The interpretation of quasiindependence of two variables is not different in principle from the interpretation in most contingency tables. Quasiindependence becomes an issue when specific cell frequencies are truncated from consideration. In a matched pairs design, the frequencies on the diagonal cells of the casecontrol array (the concordant pairs) are not included in the analysis. Nevertheless, two variables are not independent (or not quasiindependent) when the occurrence of one changes the probability of the occurrence of the other. For example, casecontrol status and genotypic frequencies are not quasiindependent when P(AA  AG is a control) is not equal to P(AA case). A phenotype frequency among the matched pair cases will not be quasiindependent when, for example, cases and controls have differing racial compositions and the phenotypic frequencies under investigation differ among races. In fact, nonindependence potentially arise whenever the controls fail to be a random sample of the population from which the cases were selected.
The χ^{2} test of symmetry is a simultaneous evaluation of both independence and homogeneity. Applied to the CYP1A1/leukemia data, this test produces no evidence of an association between genotypic frequencies and casecontrol status (X_{S}^{2} = 1.184 with P = 0.757). It then becomes a foregone conclusion that χ^{2} tests of quasiindependence and marginal homogeneity (X_{Q}^{2} = 0.712 and X_{H}^{2} = 0.479) consists of two nonsignificant pieces.
Footnotes

Grant support: U.S. Environmental Health Sciences research grants R01 ES09137 and PS42 ES04705.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
 Accepted February 9, 2004.
 Received September 24, 2003.
 Revision received December 22, 2003.