Abstract
Genomewide association scans for disease susceptibility genes of complex diseases require genotyping on a massive scale. A DNA pooling strategy for familybased association studies is described, which is robust to population stratification biases and to errors in pooling. It can achieve a statistical efficiency of 0.95 with ∼1 of 8 or fewer genotyping efforts, and an efficiency of 0.90 with ∼1 of 16 or fewer efforts compared with individual genotyping. The pooling method described in this article provides a tradeoff between genotyping efforts and subject recruitment efforts.
 DNA pooling
 family study
 genetic association study
Introduction
A genomewide association study to search for the disease susceptibility gene(s) of a complex disease requires genotyping on a massive scale, in terms either of the number of markers that have to be scanned or of the number of individuals that have to be recruited (13). A practical way to reduce genotyping efforts is to carry out analyses on pools made up of DNA from many individuals rather than on individual samples (49).
Previous studies on DNA pooling considered the simplest casecontrol design where the cases and the unrelated controls were sampled from a homogeneous population in HardyWeinberg equilibrium (69). However, familybased association designs have become popular in the recent decade in part because they are robust to population stratification biases (1015). The designs recruit relatives of the cases as the control subjects. One can distinguish between two lines of relatives (15): line I, the parents or, if the parents are missing, the unaffected siblings; and line II, the spouse or, if the spouse is missing, the offspring. For earlyonset diseases, the sample of line I relatives should be simple to collect, whereas for lateonset diseases, one could easily turn to line II relatives instead. Furthermore, the information from both lines of relatives (if available) can be integrated to improve the study power (15). (Note that the relatives here include those related to the affected cases genetically and those only in law. However, we still refer to the design as a “family”based study, because these subjects are coming from the same family literally).
Risch and Teng (4) have constructed statistical tests for caseparents and casesiblings DNA pooling studies. However, the tests are valid only if the study population is in HardyWeinberg equilibrium. This defeats one purpose of a familybased study, which is meant to be robust to population structure. Furthermore, they assumed the allele frequency measurement for a DNA pool to be 100% accurate, which is not possible with current technologies. In this article, a robust and efficient DNA pooling strategy for familybased association study is described.
DNA Pooling Strategy
Assume that there is only one affected case in a (nuclear) family. Let two numbers (x and y) distinguish the various family configurations. The x is the number of line I relatives (parents or unaffected siblings) an affected case has (parents are treated as if x = ∞). The y is the number of line II relatives (spouse or offspring) an affected case has (spouse is treated as if y = ∞). Assume that we form a total of J (j = 1, …, J) “pooling sets” in a study. Each pooling set contains a certain number of families exclusively of the same family configuration. Let (x_{j}, y_{j}) represent the family configuration of the jth set, and n_{j}, the number of families in the jth set. (The problem of how many families should be put into a pooling set will be discussed later.)
At the jth pooling set, the affected cases are pooled into a single “case pool”. The allele frequency of the pool is measured using quantitative PCR. The result is denoted as C_{j}. Note that here and hereinafter, we do not attempt to correct for unequal allelic amplification often associated with a quantitative PCR (6, 7). If parents are available (x_{j} = ∞), the fathers in this jth set are pooled into a single “father pool”, and the mothers, a single “mother pool”. The measured allele frequency of the father(mother) pool is denoted as F_{j}(M_{j}). Then we calculate for all the families in the jth pooling set, D_{j}^{I} = total allele counts for the transmitted genotypes (the affected cases) − total allele counts for the nontransmitted genotypes = 2n_{j}C_{j} − 2n_{j}(F_{j} + M_{j} − C_{j}) = 4n_{j}[C_{j} − (F_{j} + M_{j}) / 2]. If parents are not available but unaffected siblings are (0 < x_{j} < ∞), we form a total of x_{j} “sibling pools” (k = 1, …, x_{j}), with the kth pool containing each and every one of the kth eldest unaffected siblings of the families in the jth set. The measured allele frequency for the kth sibling pool in the jth set is denoted as B_{jk}. And we calculate for all the families in the jth pooling set, D_{j}^{I} = total allele counts for the transmitted genotypes (the affected cases) − total allele counts for the “imputed nontransmitted” (see ref. 15)
Next, let us turn to the line II relatives. If spouses are available (y_{j} = ∞), the spouses are pooled into a single “spouse pool”. The measured allele frequency of the spouse pool is denoted as S_{j}. We calculate for all the families in the jth pooling set, D_{j}^{II} = total allele counts for the affected cases − total allele counts for the spouses = 2n_{j}(C_{j} − S_{j}). If spouses are not available but offspring are (0 < y_{j} < ∞), we form a total of y_{j} “offspring pools” (l = 1, …, y_{j}), with the lth pool containing each and every one of the lth eldest offspring of the families in the jth set. The measured allele frequency for the lth offspring pool in the jth set is denoted as O_{jl}. And we calculate for all the families in the jth pooling set, D_{j}^{II} = total allele counts for the affected cases − total allele counts for the “imputed spouses” (see ref. 13, 15)
Expected Total Allele Differences
If the study population is a homogeneous population or is a stratified population but mating is restrictive to subjects in the same stratum (13), the probability distribution of the allele frequency for the case pool under the null will be exactly the same as those for the father pool, the mother pool, the spouse pool, and each and every one of the sibling and the offspring pools in the same pooling set. Therefore, the expected values of D_{j}^{I} and D_{j}^{II} are both zero under the null hypothesis of no genetic association, no matter how “biased” a quantitative PCR can be (a complex error structure that is asymmetric for the alleles in a pool, or an allelic amplification that is a nonlinear function of allele frequency, etc.). By contrast, a “greedy” strategy that puts all line I relatives and line II relatives into a “grand control” pool does not have this property. A grand control pool so formed will contain more subjects than the corresponding case pool. Consequently, even under the null hypothesis, the distributions of the allele frequencies will not be the same for the case pool and the grand control pool in a pooling set, due to the law of large numbers.
To better illustrate the point, Table 1 presents the probability distributions for the various DNA pools in a hypothetical pooling set, which contains two cases and their parents recruited from a HardyWeinberg population with allele frequency of 0.4. The quantitative PCR was assumed to have a very complex error structure (column 2 of Table 1) in that it will amplify an allele if that allele wins a majority in a pool. Moreover, the amplification is not a linear function of the allele frequency and is not symmetric for the two alleles. It can be seen that the expected allele frequencies measured by this grossly biased PCR are exactly the same (0.3885) for the case pool, the father pool, and the mother pool (although the values are not equal to the true value of 0.4). By contrast, a grand control pool formed by putting the father and the mother pools together has a different expected value (0.3587) by the same PCR.
Disequilibrium Test for Pooled Data
Because E(D_{j}^{I}) = E(D_{j}^{II}) = 0, we have E(w_{j}^{I}D_{j}^{I} + w_{j}^{II}D_{j}^{II}) = 0 for arbitrary coefficients w_{j}^{I} and w_{j}^{II} under the null (condition 1). Furthermore, because subjects from the same family will and only will appear in one pooling set and the measurements of the various DNA pools are done independently, the w_{j}^{I}D_{j}^{I} + w_{j}^{II}D_{j}^{II} for j = 1, …, J are independent to one another (condition 2). A disequilibrium test for pooled data is constructed belowFor large J (e.g., J > 30), Z^{2} is distributed as a one degreeoffreedom χ^{2} distribution under the null, for whatever values chosen for the w_{j}^{I} and w_{j}^{II} (as long as they do not both assume the value of zero). Note that the conditions 1 and 2 stated above are sufficient to ensure Z^{2} to be a valid test for genetic association. It does not matter whether or not the PCR is unbiased or has a simple error structure (e.g., can be described using a single parameter) and whether or not the population under study is in HardyWeinberg equilibrium.
In this article, the same coefficients previously proposed for individually genotyped data (15) are used for pooled data, i.e.,andThese coefficients are optimal under the null hypothesis and in a HardyWeinberg population. Note that the sum of the two coefficients,is in general not equal to one. Rather, it reflects the statistical efficiency for family configuration (x_{j}, y_{j}), relative to caseparents data (∞, 0).
Statistical Efficiency
Assume that the study population is a randommating population in HardyWeinberg equilibrium with allele frequency of P. Lee (15) showed that the variance of w_{j}^{I}D_{j}^{I} + w_{j}^{II}D_{j}^{II} under the null hypothesis isif the allele counts are based on individual genotyping (assuming no genotyping error). With pooled genotyping, the variance is composed of two terms (i.e., V_{j}^{pooled} = V_{j}^{individual} + V_{j}^{quantitative PCR}). To calculate V_{j}^{quantitative PCR}, we assume that the measurement error of the quantitative PCR in estimating the allele frequency of a DNA pool is a constant σ, irrespective of the actual allele frequency. As an example, in a pooling set with affected cases together with their fathers, their mothers, and their three offspring (∞, 3), we haveAnd the variance associated with the quantitative PCR is
For a general pooling set with family configuration (x_{j}, y_{j}), it is easy to show thatwhere the I_{(statement)} is an indicator function (value = 1, if the statement is true; value = 0, otherwise). The statistical efficiency of the pooled study relative to individual genotyping can thus be approximated by . To achieve a statistical efficiency of c (0 < c < 1), the number of families in a pooling set can be kept below a certain limit, depending on the family configuration of the set, i.e.,
Note that the above calculations were carried out under the null hypothesis, with a constant σ, and in a homogeneous population in HardyWeinberg equilibrium. A more relevant comparison of the methods should be made with respect to the power under a specific alternative hypothesis, for a more complex but realistic error structure, and in a structured population. Yet, the simple and elegant formulas presented above provide a concrete guideline for forming DNA pools. Adhering to the guideline will optimize the study efficiency at least nearly, for a gene with a weak effect, using a PCR with error nearly constant, and in a population deviating from HardyWeinberg equilibrium not too much.
Assuming σ = 0.01 (corresponding to the method of mass spectrometry; ref. 6), we found that the pooling strategy can achieve a statistical efficiency of 0.95 with ∼1 of 8 or fewer genotyping efforts (Table 2), and a statistical efficiency of 0.90 with ∼1 of 16 or fewer efforts (Table 3), compared with individual genotyping. The reduction in genotyping efforts is greater when the frequency of the minor allele is higher, and smaller when it is lower. In a HardyWeinberg population, a casecontrol study with equal number of cases and controls is equivalent to a casespouse study in terms of statistical efficiency. Tables 2 and 3 (right upper corners) indicate that a pooled casecontrol study (and a pooled casespouse study) has the greatest reduction in genotyping efforts (a statistical efficiency of 0.95 with ∼1 of 20 or fewer genotyping efforts and a statistical efficiency of 0.90 with ∼1 of 50 or fewer efforts). However, it should be pointed out again that a simple casecontrol study, individual genotyping and pooled genotyping alike, offers no protection for population stratification biases.
Discussion
For most complex diseases such as noninsulindependent diabetes, cardiovascular diseases, Alzheimer's disease, and many forms of cancers, the incidence rates are very low in the population, so that in the great majority of families, there is at most one affected case. In real practice, one does occasionally encounter families with multiple affected siblings, or with affected subjects across multiple generations, etc. Such complex family configurations cannot be described using the (x, y) system in this article and it is difficult to design a pooling strategy for them. However, because such families represent a very small portion of the data, subjects in these families can be individually typed without too much inconvenience.
Previously, DNA pooling is usually embedded in a twostage procedure (5, 8, 9). The firststage is a DNA pooling study, which scans across the genome for markers that have different frequencies in the pools of cases and controls. The markers identified by the firststage study are then followed up by a secondstage individualtyping study. Without the follow up, DNA pooling alone will produce an excess of falsepositive results, because (i) correction of the measurement errors of DNA pools and the unequal allelic amplification (6, 7) might not be sufficiently accurate and (ii) estimation of the allele frequencies in DNA pools is valid only if HardyWeinberg equilibrium holds. The method in this article will maintain the nominal α level exactly even for an errorprone PCR and in a stratified population. This is a property not shared by previous pooling methods (49). One may instead argue that some inflation of the type I error rate in the firststage pooling study is tolerable, or even welcome, because this would identify more markers for the secondstage individual genotyping and hence decrease the overall type II error rate. This is a misconception however. In fact, one can raise the nominal α level of the present method to a level comparable with the inflated α level of the previous pooling methods and achieve the same or even lower level of the overall type II error rate.
As shown in this article, the proposed DNA pooling method can achieve a very high statistical efficiency relative to individual typing. This means that if one can recruit just a few more families for study (e.g., 510% more), he/she can rightfully resort to DNA pooling and do away with the costly and laborious individual followups entirely. This singlestage pooling study needs 1 of 8 to 1 of 16 or fewer genotyping efforts compared with an individualtyping study. It should be noted that, by trading more subjects for fewer genotypings, neither the type I nor the type II error rate has to be compromised using the DNA pooling strategy in this article. With the rapid progress in biotechnology, the pooled genotyping is expected in the future to become even more accurate and thus have even higher statistical efficiency. This means that, compared with those described in Tables 2 and 3, more families can be put together and fewer genotyping efforts will be required. On the other hand, it may be that today's costly and laborious individual genotyping will become an easy matter in the future, leaving no cause for a DNA pooling study. Before that day really comes, the pooling strategy presented in this article provides a tradeoff between genotyping efforts and subject recruitment efforts.
Footnotes

Grant support: National Science Council, Republic of China.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
 Accepted November 23, 2004.
 Received July 7, 2004.
 Revision received November 11, 2004.