## Abstract

**Background:** There is uncertainty about the benefits of using genome-wide sequencing to implement personalized preventive strategies at the population level, with some projections suggesting little benefit. We used data for all currently known breast cancer susceptibility variants to assess the benefits and harms of targeting preventive efforts to a population subgroup at highest genomic risk of breast cancer.

**Methods:** We used the allele frequencies and effect sizes of 86 known breast cancer variants to estimate the population distribution of breast cancer risks and evaluate the strategy of targeting preventive efforts to those at highest risk. We compared the efficacy of this strategy with that of a “best-case” strategy based on a risk distribution estimated from breast cancer concordance in monozygous twins, and with strategies based on previously estimated risk distributions.

**Results:** Targeting those in the top 25% of the risk distribution would include approximately half of all future breast cancer cases, compared with 70% captured by the best-case strategy and 35% based on previously known variants. In addition, current evidence suggests that reducing exposure to modifiable nongenetic risk factors will have greatest benefit for those at highest genetic risk.

**Conclusions:** These estimates suggest that personalized breast cancer preventive strategies based on genome sequencing will bring greater gains in disease prevention than previously projected. Moreover, these gains will increase with increased understanding of the genetic etiology of breast cancer.

**Impact:** These results support the feasibility of using genome-wide sequencing to target the women who would benefit from mammography screening. *Cancer Epidemiol Biomarkers Prev; 23(11); 2322–7. ©2014 AACR*.

This article is featured in Highlights of This Issue, p. 2199

## Introduction

Describing an article in *Science Translational Medicine*, entitled “The predictive capacity of personal genome sequencing,” by Roberts and colleagues (1), an editor wrote:

Imagine that everyone at birth could have their whole genome sequenced at negligible cost. Surely, this must be a worthwhile endeavor, given the list of luminaries who have already had this sequencing completed. But how well will such tests perform?

The authors address this question by using the incidence data of twins for a range of diseases to estimate the percentage of the population and the percentage of diseased cases whose genomes would classify them at high risk, and the risk among those whose genomes put them at low risk relative to the population. They conclude that the genomes of most people would classify them at low risk for most of the diseases, and that the predictive value of such a negative test result would generally be small, because the disease risks among those who test negative would be similar to those in the general population. These conclusions have been criticized as deriving from assumptions that involve no information about the genetic factors relevant to the diseases studied (2). Here, we compare the breast cancer findings of Roberts and colleagues (1) with those obtained using information on the currently known breast cancer susceptibility loci.

To be specific, consider the strategy of sequencing the genomes of all young women, and using their genotypes at breast cancer loci to construct for each a genomic risk score, which specifies the rank of her inherited breast cancer risk relative to that of others in the population. These ranks can then be used to classify women into high- and low-risk categories, with high-risk women targeted for more intensive screening and preventive efforts. At present we do not fully know a person's breast cancer risk score, but genome sequencing coupled with current knowledge would allow us to assign her a partial score by combining her genotypes at all known breast cancer loci with the effect sizes of the risk alleles at these loci. Here, we shall evaluate and compare the efficacy of this strategy with that of (i) the best-case classification theoretically possible if we knew the full risk scores and (ii) the estimates of Roberts and colleagues (1).

## Materials and Methods

### Population studied

As the frequencies and effect sizes of breast cancer susceptibility loci may differ by race/ethnicity, we restrict analyses to women of European ancestry. This study did not require approval from an ethical review board, as it involved only the use of published summary data.

### Statistical analysis

We modeled the lifetime probability of developing breast cancer (i.e., the absolute risk) for an individual with genomic risk score *s* as , where *c* is a positive constant. This model specifies a monotonically increasing relationship between risk *R* and risk score *s*. Therefore, the population percentile of the risk of an individual equals that of her risk score, and the efficacy of percentile-based stratification depends on the variances of the risk score distributions in the population and among future breast cancer cases. For the theoretically derived distribution of fully known risk scores, these variances can be estimated using the arguments of Pharoah and colleagues (3, 4) and Begg and Pike (5), as described in the Supplementary Materials and Methods.

To estimate the variance of the partially known risk scores, we modeled them as linear combinations of genotypes at a set of uncorrelated breast cancer loci from the literature, with coefficients given by their estimated effect sizes. Specifically, we listed all loci with validated breast cancer association in women of European ancestry, and then selected a subset of 86 uncorrelated loci. We chose 93 breast cancer susceptibility loci by reviewing the literature for established, replicated associations (6–15). We then selected a subset of 86 uncorrelated loci by computing all 4,278 pairwise correlation coefficients using the SNP Annotation and Proxy Search tool (<http://www.broadinstitute.org/mpg/snap/ldsearch.php>) for the CEU population from the 1000 Genomes Project. We found seven pairs of loci with squared correlation coefficient exceeding 0.25 (16), and for each of these we selected the locus with the largest value of *βp*(1 − *p*), where *β* is the log relative risk associated with the variant and *p* is its allele frequency. The seven SNPs that we excluded were rs10069690, rs3215401, rs2943559, rs10759243, rs11199914, rs494406, and rs75915166.

The remaining 86 loci consist of (i) genes containing rare variants of high and moderate penetrance and (ii) SNPs identified in genome-wide association studies of breast cancer. We used the cumulative allele frequency and took the relative risk estimates for rare variants in breast cancer susceptibility genes to be the midpoints of the ranges spanned by the published studies. We also used the averages of the risk allele frequencies and relative risk (OR) estimates for SNPs that were associated with breast cancer in multiple genome-wide association studies.

Table 1 shows the 86 loci, and the frequencies and effect sizes of their risk alleles (6–15). We modeled the combined effects of these 86 loci by assuming that they act multiplicatively on a woman's cumulative hazard for breast cancer. As shown in the Supplementary Materials and Methods, this implies that her partially known risk score has the additive form *s* = *β _{1}g_{1}* +

^{…}+

*β*, where

_{86}g_{86}*g*= 0, 1, 2 denotes her count of risk alleles at locus

_{k}*k*, and

*β*denotes the effect size of the risk allele at locus

_{k}*k*as obtained from the literature,

*k*= 1, …, 86. Determining the variance of the resulting partial scores is infeasible, as it would require summing over all 3

^{86}= 10

^{41}possible multilocus genotypes. However, it can be approximated by random genotype sampling as described in the Supplementary Materials and Methods.

### Performance of risk score–based classification

We estimated how well we can identify future breast cancer cases by classifying women into high-risk (targeted) and low-risk (untargeted) subgroups based on the percentiles of their fully and partially known risk scores (see Supplementary Materials and Methods for details). Specifically, we estimated the sensitivity (Sn), specificity (Sp), positive predictive value (PPV), negative predictive value (NPV), and risk in untargeted women relative to that of the population.

## Results

We estimated the population variance of the risk scores based on the 86 currently known breast cancer susceptibility variants to be 0.35. This variance, while lower than the estimate of 1.44 for the variance of the fully known risk scores determined using the arguments of Pharoah (3, 4), is nevertheless considerably higher than the value 0.07 obtained for the risk scores based on the seven loci known in 2008 (4).

Figure 1 shows the percentage of breast cancer cases included among women having the highest 100(1 − α)th percent of risk scores, for 0 < α < 1. The curves correspond to the best-case classification with risk score variance equal to 1.44 (solid curve), the currently feasible classification based on partially known risk scores, with variance equal to 0.35 (dashed curve), and the classification based on the seven loci known in 2008 (4) with variance equal to 0.07 (dotted curve). As the efficacy of risk stratification for prevention depends on the population variance of the risk scores, these results indicate that (i) current genetic knowledge far exceeds that in 2008 (4); and (ii) despite these gains, considerably better stratification should still be possible in the future, as we better understand the etiology of this disease.

Table 2 shows additional measures of discrimination obtained by classifying women as high risk or low risk on the basis of their risk scores, where the high-risk group is defined as those whose scores exceed the 100(1 − α)th percentile of the centered Gaussian risk score distribution. How do these results compare with the breast cancer predictions of Roberts and colleagues (1)? The latter were based on a high-risk group defined as those whose risks exceed the 90th to 95th percentile of the population distribution. The authors estimated that this classification would target between 10% and 35% of all future breast cancer cases. In contrast, Table 2 shows that the percentage of cases targeted would be approximately 47% using the best-case classification and 32% using the currently feasible classification. In addition, the authors estimated that the ratio of risk among women classified at low risk relative to the population would be as high as 0.72 to 0.90, indicating poor specificity. Yet, Table 2 suggests that this relative risk is lower: 0.59 with the best-case classification and 0.75 with the currently feasible classification. Thus, the present estimates provide more optimistic projections than those obtained using the theoretical model of Roberts and colleagues (1).

## Discussion

We have estimated the variance of breast cancer risk scores among women of European ancestry by using a multiplicative model for the joint effects of currently known breast cancer loci. We found that the distribution of these partially known risk scores has variance 0.35, which is similar in magnitude to the estimate of 0.28 obtained independently by Pashayan and colleagues (17). We have compared the performance of targeted preventive measures based on the currently feasible partially known risk scores with those obtained using the theoretical distribution of fully known risk scores derived by Pharoah (3, 4). The results suggest that the predictive power of genome sequencing to determine breast cancer risk is considerably greater than that described by Roberts and colleagues (1), and the estimates contradict the authors' statement that “…our conclusions […] represent an absolute upper bound that cannot be improved by improvements in technology or genetic knowledge.”

To achieve the optimal predictive power represented by the “best-case” classification, we will need to identify the combined effects of all causal alleles for breast cancer. Moreover, better understanding of gene–environment interactions could further improve predictive power. For example, the performance measures described here underestimate the potential value of genomic-based risk classification. This is because a child's lifetime breast cancer risk is determined not only by her genome, but also by her future levels of nongenetic (lifestyle, environmental, and epigenetic) risk factors. Epidemiologic data support a multiplicative model for the joint effects of genetic and nongenetic factors on breast cancer risk (18). Under this model, a modifiable nongenetic factor associated with an overall 50% increase in risk would add considerably more to the absolute risk of a female whose lifetime genetic risk is 36% (increasing it to 54%) than to that of a female whose lifetime genomic risk is 4% (increasing it only to 6%). Thus, high-risk women have considerably more to gain by appropriate choices of lifestyle factors than do low-risk women.

In conclusion, the data-based estimates presented here suggest that personalized breast cancer preventive strategies informed by genome sequencing may bring greater gains in cost-efficient disease prevention than previously projected. Moreover, these gains will increase as we gain increased understanding of the etiology of breast cancer.

## Disclosure of Potential Conflicts of Interest

No potential conflicts of interest were disclosed.

## Authors' Contributions

**Conception and design:** W. Sieh, A.S. Whittemore

**Development of methodology:** W. Sieh, J.H. Rothstein, A.S. Whittemore

**Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.):** V. McGuire

**Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis):** W. Sieh, J.H. Rothstein, A.S. Whittemore

**Writing, review, and/or revision of the manuscript:** W. Sieh, J.H. Rothstein, V. McGuire, A.S. Whittemore

**Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases):** V. McGuire

**Study supervision:** W. Sieh, A.S. Whittemore

## Grant Support

This research is supported by grants K07CA143047 (to W. Sieh) and R01CA094069 (to A.S. Whittemore and J.H. Rothstein) from the U.S. National Cancer Institute.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked *advertisement* in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

- Received May 22, 2014.
- Revision received August 1, 2014.
- Accepted August 2, 2014.

- ©2014 American Association for Cancer Research.