## Abstract

**Background:** Biomarker discovery research has yielded few biomarkers that validate for clinical use. A contributing factor may be poor study designs.

**Methods:** The goal in discovery research is to identify a subset of potentially useful markers from a large set of candidates assayed on case and control samples. We recommend the PRoBE design for selecting samples. We propose sample size calculations that require specifying: (i) a definition for biomarker performance; (ii) the proportion of useful markers the study should identify (Discovery Power); and (iii) the tolerable number of useless markers amongst those identified (False Leads Expected, FLE).

**Results:** We apply the methodology to a study of 9,000 candidate biomarkers for risk of colon cancer recurrence where a useful biomarker has positive predictive value ≥ 30%. We find that 40 patients with recurrence and 160 without recurrence suffice to filter out 98% of useless markers (2% FLE) while identifying 95% of useful biomarkers (95% Discovery Power). Alternative methods for sample size calculation required more assumptions.

**Conclusions:** Biomarker discovery research should utilize quality biospecimen repositories and include sample sizes that enable markers meeting prespecified performance characteristics for well-defined clinical applications to be identified.

**Impact:** The scientific rigor of discovery research should be improved. *Cancer Epidemiol Biomarkers Prev; 24(6); 944–50. ©2015 AACR*.

## Introduction

The search for biomarkers for early detection, diagnosis, or prognosis is a major avenue of cancer research. Molecular technologies continue to evolve and collaborative frameworks that foster the research have been established. However, to date progress has been disappointing at least when measured by numbers of biomarker discoveries that have reached clinical application. A contributing factor is undoubtedly the poor quality of study designs used for biomarker discovery. With the goal of improving the quality of discovery research, this paper sets forth guidelines for selection of samples including sources and numbers of samples.

Discovery research studies typically investigate many candidate biomarkers. Samples from cases (with disease or having poor outcome) and from controls (without disease or having good outcome) are assayed for the candidate biomarkers and a statistical measure of association between each biomarker and case–control status is used to rank the biomarkers. The objective of a discovery study is to filter the many candidate markers to arrive at a short list for validation. In a validation study, one or a handful of markers are evaluated for performance in a specific clinical application.

It is widely accepted that a high degree of rigor is needed in the conduct of validation studies (1). Discovery research on the other hand tends to be less rigorous. In particular, conveniently available biologic samples are often used (2). But convenience samples are often inherently biased. For example, blood samples from breast cancer cases at surgery may be compared with blood samples drawn at mammography from healthy women controls. Measurements that are effected by any factor that is different between these groups of women will show up as biomarkers. One such factor is breast cancer but these groups also differ on other factors including, for example, stress levels and medications. Biomarkers of stress, biomarkers of medication use, and biomarkers of any other factor that differs between the two groups can be mistaken for biomarkers of cancer. Another common problem in discovery research is use of a small number of samples to address the statistically very challenging problem of finding a few good markers in a haystack of many candidates.

Why is it that so little attention has been devoted to the design of biomarker discovery studies? One reason may be that discovery studies are by definition exploratory in nature and traditionally scientists, and statisticians in particular, have not demanded rigor in exploratory research. Another reason may be that the principles of population science are not well entrenched in the culture of basic science. Finally, good quality biospecimen repositories have generally not been made available for discovery research. Their use has instead been prioritized for validation studies.

In this paper, we first draw attention to existing guidelines for choosing biologic samples in biomarker research (3). The guidelines are now applied routinely in validation research but they are equally relevant to discovery research and they should be routinely applied there too in order to elevate the quality and fruitfulness of discovery research. Second, we offer an approach to sample size calculation for discovery studies that is technically very similar to traditional sample size calculations for validation studies but that addresses the unique objective of discovery studies, namely to filter through many candidate biomarkers. We contrast our approach with other approaches that are both more complex and less robust (4–8). Finally, we emphasize some of the most important points already made by others (1, 3, 5, 9–11) about tying the design of the discovery study to the clinical application of interest.

## Materials and Methods

### Choosing specimens

It is now well recognized that the selection of case and control specimens for a biomarker validation study should follow the PRoBE design principles (ref. 3; Fig. 1). The PRoBE principles should also be used for discovery studies. The key components of PRoBE are: (i) that a prospective cohort is identified that pertains to the clinical setting envisioned for applying the biomarkers; biologic samples are drawn and stored prospectively using standard operating procedures that ensure no knowledge about case–control status; (ii) that cases and controls are selected randomly from all the cases and controls in the cohort among those meeting eligibility criteria tied to the potential clinical application; (iii) that the biomarker(s) are measured in a fashion that ensures blinding to case–control status; and (iv) that the performance of the markers is evaluated using a measure of performance that pertains to the clinical setting. The PRoBE guidelines are an application of standard principles of good population science to the special case of biomarker research (1). Although the original PRoBE paper emphasized the need for PRoBE in validation studies, it also strongly recommended that PRoBE principles be applied in discovery research because bias is an equally important issue in biomarker discovery science. In Supplementary Materials and Methods and associated Supplementary Figs. S1, S2, S3, and S4, we provide examples that demonstrate how deviations from the PRoBE study design can result in false leads and/or missed biomarkers in discovery research.

### Sample sizes

Sample sizes should be large enough to make reliable recommendations about which markers are identified as promising for further development. Because recommendations are largely based on observed performance, choosing a pertinent measure of biomarker performance is a crucial preliminary step. This important point has been raised previously (4, 5, 12) but needs to be better appreciated by biomarker discovery scientists. We write the pertinent biomarker performance measure generically as M. For example, in the context of ovarian cancer screening, M might represent the proportion of cases detected (M = TPR = sensitivity) when, for a continuous biomarker, the positivity threshold is chosen for very high specificity, that is, setting the proportion of controls testing positive (FPR = 1-specificity) very small. For a binary marker, M might represent the TPR if the FPR is low enough but equal to 0 if the FPR is too high. In the context of identifying colon cancer patients for more or less intense treatment, M might represent the risk of recurrence for biomarker positive patients (M = PPV = positive predictive value).

In practice the *t* statistic, the area under the ROC curve (AUC) or the OR, and their *P* values are commonly used to rank and filter biomarkers (8, 13). Unfortunately, these measures do not quantify biomarker performance in a clinically relevant way. For example, the *t* statistic is based on the difference in mean biomarker values between cases and controls but this does not quantify the usefulness of the biomarker for clinical application. Moreover filtering in markers with the largest mean difference may not identify the markers with the best clinical utility. Instead one must choose an M that is relevant to the clinical application.

The goal of the study is to filter in potentially “useful” markers and to filter out “useless” markers. A key step in designing the biomarker discovery study is to define the meanings of “useful” and “useless” in the intended clinical application. A useful biomarker has a high level of performance, high enough that investigators require the study to have a good chance of discovering it (Discovery Power). We write the level of performance corresponding to a “useful” marker as M_{1}. Some examples might be: TPR = 35% when FPR is set at 1% in discovery of ovarian cancer screening biomarkers (i.e., M_{1} = TPR = 35%); or PPV = 30% in discovery of biomarkers for predicting recurrence in stage I colon cancer patients (i.e., M_{1} = PPV = 30%). When investigators wish to identify biomarkers that may be useful members of a biomarker panel, the target performance level M_{1} for individual markers could be set at a value less than that endowing clinical usefulness. Deciding on the performance level M_{1} is often a difficult but insightful exercise.

“Useless markers” are those that have no association with case–control status. We write the corresponding level of performance as M_{0}. Useless markers in the context of the ovarian cancer screening example have M_{0} = TPR = 1% since TPR = FPR when markers have no association with case–control status. In the colon cancer example, M_{0} = PPV = 10% because the overall recurrence rate for stage I colon cancer is 10% and for “useless” markers, the recurrence rate in biomarker positive patients is the same as the overall rate.

When study data are analyzed, markers are identified as promising if their performance passes some criterion, that is, they “filter in.” The filter-in criterion is a key element of the study design. We consider criteria that are marker specific, using an estimate of M and its sampling variability to filter the corresponding marker. Some examples are:

The

*P*value associated with testing H_{0}: M = M_{0}, is sufficiently small;The estimated value of M is sufficiently large;

The lower limit of the 95% confidence interval for M exceeds a critical value.

Identifying statistically significant markers (*P* value < 0.05) as promising is a special case of criterion (1). However, by definition, this criterion filters in 5% of useless markers. That may be too many. For example, with 9,000 candidate biomarkers approximately 450 = 0.05 × 9000 useless markers would be identified as promising. A smaller *P* value criterion may be preferred in practice to ensure that a smaller number of useless markers filter in.

Sample sizes should be chosen so that a high proportion of good markers filter in. The expected proportion is called “*Discovery Power*” in Fig. 1. The sample size should also guarantee that the number of useless markers expected to filter in (FLE in Fig. 1) is adequately small. Suppose, for example, that we can tolerate 180 useless markers filtering in, that is, 180 FLE, if we are guaranteed that almost all good markers are also filtered in, that is, high *Discovery Power*. If there are 9,000 candidate markers and we assume the vast majority are useless, we need to ensure that the proportion of useless markers filtering in is approximately 180/9,000 = 0.02, written *FLE %* = 2%. For a given sample size and filter-in criterion, one can calculate the *Discovery Power* and the *FLE %*. In practice after using formulas for initial sample size calculations, the sample size and the stringency of the filter-in criterion can be varied in simulation studies until acceptable values for *Discovery Power* and *FLE %* are arrived at (details in Supplementary Materials and Methods SB.2). A worked example with technical details is provided in Supplementary Materials and Methods SB.3.

It is important to note that we make no assumptions about the composition of the candidate marker set. We do not assume particular proportions of useless, intermediate, and useful markers in the set and we do not assume that the markers are uncorrelated. A technical argument is presented in Supplementary Materials and Methods SB.1.

### Alternative approaches

Popular alternative criteria for filtering markers are based on the false discovery rate (FDR; refs. 4, 7, 8) or the k-best markers (5, 6) where k is some acceptable number based perhaps on resources available for developing clinical assays. An illustration of different filtering criteria is provided in Supplementary Table S1. Our approach to sample size calculations does not apply directly when these alternative filtering criteria will be applied. We use marker-specific criteria that depend only on the observed performance of that individual marker, whereas selection of a marker according to these alternative criteria depends in part on how other markers perform in comparison with that marker. Therefore, for FDR and best-k criteria, the mix of biomarkers studied determines the chance that an individual marker will filter in. Calculations of *Discovery Power* and *FLE %* can be undertaken using numerical calculations or simulation studies but the calculations must make assumptions about the strengths of all markers in the candidate set and about their correlations. This is a tall order. Our approach in contrast makes no assumptions about the mix of markers studied and requires only that definitions for useful and useless markers are made.

## Results

### Application to the Colocare study

Stage I colon cancer patients have a recurrence rate of 10% and these patients generally do not undergo chemotherapy. However, if a subset of patients at elevated risk for recurrence could be identified, they might benefit from chemotherapy and the overall recurrence rate could be reduced further. How high should the recurrence risk be in order to warrant chemotherapy? We are not aware of formal cost/benefit analysis but since stage 3 patients have a recurrence rate of 30% and they are routinely offered chemotherapy, we take 30% as the threshold for defining high risk. We propose a study to find biomarkers that will identify a subgroup of stage I colon cancer patients in which the recurrence rate is ≥30%.

The discovery platform we will use enables the simultaneous assessment of 9,000 unique candidate biomarkers using a small volume of plasma. We propose a two-stage study where the initial discovery will assess these 9,000 biomarkers in a set of cases and controls. Up to 300 promising candidates that are identified will move forward to an independent verification study that includes a different set of cases and controls. We now describe the key elements of the design of the initial biomarker discovery study (Fig. 2).

The Colocare cohort is composed of patients diagnosed with stage I-III colon cancer at several high volume medical centers in the United States and Germany. The study involves comprehensive collection of specimens and data including blood specimens at baseline, 6, and 12 months after diagnosis and follow-up for recurrence up to 5 years post-surgery. Here, we focus on patients with stage I disease and their blood specimens obtained at diagnosis to estimate risk of recurrence.

The top panel of Fig. 2 summarizes the PRoBE compliant nature of samples that will be used for biomarker discovery. The cohort fits the intended clinical application exactly. Most recurrences occur within 2 years of diagnosis. Therefore, cases are defined as those patients that recur by 2 years postdiagnosis. Controls are those alive and colon cancer free 2 years postdiagnosis.

The only factors likely to be predictive of recurrence are age and treatment center. Treatment centers may have undocumented differences in specimen collection and processing. Therefore, cases and controls will be selected randomly but the selection mechanism will match 4 controls to each case on age and treatment center to avoid possible confounding by these factors.

The final issue in item (i) of Fig. 2 concerns ensuring that specimens from cases and controls are handled in exactly the same way. Because case–control status will not be known at the time of specimen collection, collection and storage procedures will be the same for cases and controls. Samples will be blinded for the purposes of measuring the biomarkers by using labels that do not indicate case–control status. The order in which samples are assayed will be randomized.

The next set of considerations concerns the measure of performance that will be used to quantify biomarker performance. This is relatively straightforward given the earlier discussion about the intended use of the biomarker. The recurrence rate for biomarker positive patients (PPV) is the key measure of biomarker performance, that is, M = PPV. Controls that are biomarker positive receive unnecessary toxic treatment so the proportion of controls that are biomarker positive (FPR) should be kept low. We target 10% for the biomarker positivity rate and therefore we set the biomarker positivity threshold accordingly at the 90th percentile of biomarker values among controls.

A “useful” single marker has a positive predictive value of 30% or more (M_{1} = 30%) in the intended clinical population (alternative weaker criteria might be considered if one sought markers for inclusion in a panel). A “useless” biomarker has no association with disease so its positive predictive value is 10% (M_{0} = 10%).

The next item is how the positive predictive value will be estimated in the case–control study. We use the following formula (14) that incorporates the population recurrence rate for stage I patients, ρ = 10%, the desired biomarker positivity rate in controls, f = 10%, and the TPR corresponding to FPR = *f* which is denoted by ROC(*f*):

The ROC curve is estimated empirically. Note that given values for f and ρ, the null hypothesis H_{0}: PPV = M_{0} can now be restated in terms of a null hypothesis about ROC(f), specifically H_{0}: ROC(0.1) = 0.1. We calculate *P* values by comparing (logit(ROC(0.1))−logit(0.1))/se(logit(ROC(0.1))) with the standard normal distribution.

Next, consider the criterion used to filter in biomarkers. The investigators want to allow no more than five useless biomarkers to filter through both stages of our discovery study. Assuming that the vast majority of the 9,000 markers are not associated with recurrence, we can accomplish this by filtering out 98% of useless markers in each stage because by doing so approximately 0.02 × 0.02 × 9,000 ≈ 4 useless markers filter through both stages. In the initial discovery study that is the focus of this paper, we therefore proposed using a *P* value <0.02 criterion to filter biomarkers. By definition, this criterion should allow only 2% of useless markers to filter in (*FLE %* = 2%), that is, approximately 180 markers. The investigators also require that the two-stage discovery study filters in 90% of useful markers. This can be accomplished by filtering in 95% of useful markers in each stage as 0.90 ≈ 0.95 × 0.95. That is, we require *Discovery Power* = 95% in the initial discovery stage.

Given the target values for *Discovery Power* and *FLE %* and definitions for useful and useless markers, we now set about sample size calculations. Technical details of the calculations are provided in Supplementary Materials and Methods. Table 1 shows the chance that a useful marker (M_{1} = 30%) will pass the “*P* value < 0.02” criterion for various choices of sample size. The *Discovery Power* varies from 0.80 with *n* = 20 cases to >0.99 with *n* = 60 cases.

In theory, exactly 2% of useless markers should filter in with the *P* value < 0.02 criterion. However, using simulations we found that the actual *FLE %* varies from 2.6% to 4.2% (Table 1). The problem here is that small *P* values are notoriously inaccurate especially with nonparametric statistics unless sample sizes are extremely large. We tried a *P* value criterion of 0.01 to see whether a more stringent criterion would result in an FLE % closer to the target 2% value. Table 1 shows that the corresponding FLE % values were all very close to 2%. The *Discovery Power* corresponding to the *P* value < 0.01 criterion is therefore the relevant power column in Table 1. We see that with 40 cases and 160 controls and a *P* value criterion of 0.01, 95% of useful markers are filtered in along with 2.3% of useless markers. A reasonable sample size for this study therefore is 40 cases and 160 controls. Because only 10% of stage I colon cancer patients suffer recurrence, it will be necessary to recruit 400 stage I colon cancer patients in the Colocare cohort for the initial discovery study. In summary, this number should successfully identify 95% of useful markers while also allowing about 0.021 × 9000 = 189 useless markers to be considered for further evaluation.

Additional power calculations were done to investigate the *Discovery Power* for identifying markers in the candidate set with performances that are neither useless nor useful (Table 2). Using the proposed sample size of 40 cases and 160 controls, we see that 45% of intermediate markers whose PPVs are 20% are expected to filter in while 77% of markers whose PPVs are 25% are expected to filter in. Markers with performances that exceed the “useful” criterion are almost certain to filter through the initial discovery study.

Table 2 also displays *Discovery Power* for various case–control ratios. Our study cohort includes a very large number of controls. However, we see that there is little to be gained by using more than four controls per case because *Discovery Power* for useful markers increases only by 1% with five controls per case and by 2% with six controls per case. Our choice of four controls per case seems sufficient.

### Power using alternative filtering criteria

Using the sample size of 40 cases and 160 controls, we calculated the *FLE %* and *Discovery Power* obtained with the *P* value < 0.02 criterion and contrasted them with those obtained with an alternative filtering criterion, namely FDR < 60% (Table 3). All scenarios included only 900 markers for computational feasibility and useful markers were defined as having PPV = 25% so that power values were sufficiently less than 100% to make sensible comparisons between methods. The most interesting aspect of the results concerns how *Discovery Power* and *FLE %* vary across scenarios. For the *P* value criterion, they do not vary with the numbers of useful and intermediate markers studied. On the other hand, with the FDR criterion, *Discovery Power* and *FLE %* vary, from 77% and 3.7% when there are 10 useful and 890 useless markers, to 91% and 11.6% when there are 10 useful, 100 intermediate, and 790 useless markers studied. In other words, we see that using the FDR filtering criterion, the *Discovery Power* and *FLE %* depend on the mix of markers in the study whereas using the *P* value criterion, the *Discovery Power* and *FLE %* are robust to the numbers and strengths of the candidate markers.

## Discussion

Figure 1 presents the key components of design for a typical biomarker discovery study. Regarding selection of biospecimens, considerations are the same as those for designing a validation study: the intended clinical application must drive the design and adherence to the PRoBE design principles is recommended to eliminate the inherent biases that are prevalent in most discovery studies undertaken to date.

Deviations from the PRoBE design are sometimes unavoidable. For example, for early detection of ovarian cancer, the ideal samples would be from healthy women, some of whom subsequently are diagnosed with cancer. But such preclinical samples are extremely scarce. There are few large cohorts of healthy women with stored specimens who are followed for outcomes. Samples from a cohort of women undergoing diagnostic laparoscopy may be tolerable in these circumstances at least if we believe that biomarkers elevated in preclinical disease should also be elevated in clinical disease. However, the *Discovery Power* should be set very high to ensure all biomarkers are detected because the preclinical markers of interest may not be the strongest amongst the full set of markers elevated in clinical disease (10).

Sample size calculations presented here for discovery research are technically very similar to those for validation research albeit that the objectives are very different. In validation research, traditional notions of *statistical power* and α-*level* are relevant in making conclusions about the performance of a single marker (14). The analogous entities in discovery research are *Discovery Power* and *FLE %*. Importantly *Discovery Power* and *FLE %* pertain to proportions of candidate markers identified as promising, whereas *statistical power and α-level* pertain to definitive conclusions about a single marker in validation research, so the notions are quite different. Their unique objectives imply that the values chosen for *Discovery Power* and *FLE %* are likely to be quite different from the traditional choices of 80% for *statistical power* and 5% for α-*level*. We chose 95% for *Discovery Power* and 2% for *FLE %* in our application.

Technology used in discovery research, such as array technology, may differ from that for validation, such as ELISA. One implication is that a marker with performance level M_{1} when measured with validation technology may not have this level of performance when evaluated with discovery technology. Adjustments to sample size calculations can be undertaken if the reduced level of performance due to technical variation is known. See ref. (5) for a detailed examination of this issue.

Deciding which set of promising markers should go forward for subsequent study involves many considerations including: biologic rationale; availability of assays; estimated magnitude of biomarker performance; evidence of redundancy or complementarity among markers; and the estimated FDR of the panel of markers going forward. The initial evaluation of markers often centers on their *P* values and we believe that it is prudent and easy to design a discovery study so that most useful markers will be identified with a *P* value threshold. Nevertheless, one should acknowledge that additional statistical and nonstatistical criteria will be examined for those markers that pass the *P* value filter. For example, if the FDR is very high (i.e., there appears to be few non-null markers), one might decide to do another discovery study using alternative technology. Alternatively, if many markers pass the *P* value criterion and the FDR appears acceptable, one might elect to further develop only the top markers with a view to returning to the remaining candidates if the top ones do not pan out. If FDR will be a key driver in selecting markers that go forward, one should perform additional *Discovery Power* calculations for the FDR-based criterion. We have shown here that those calculations require strong assumptions about anticipated numbers, strengths, and correlations of all the candidate markers. Our marker-specific *P* value–based calculations of *Discovery Power* do not depend on those assumptions and are far easier to implement.

A single biomarker may not exist with sufficient performance for clinical application. Our approach allows for selecting a panel of markers where each member meets a performance criterion that may be less than the clinical goal. A more ambitious task is to derive combinations within the discovery study using regression or classification techniques. The statistical challenge of such a task is, however, enormous, a fact that is widely underappreciated (12). Although discovery power calculations can be performed for an analysis and selection plan that incorporates combining markers in some predefined fashion, like FDR-based discovery power calculations, they require specifying the anticipated numbers, strengths, and correlations of the candidate markers. See Dobbin and Simon (15) for an insightful paper on this topic.

Discovery research might be more fruitful with use of specimen sets that adhere to PRoBE design principles and that are large enough to reliably identify useful biomarkers. One step toward this goal would be to make existing well-designed biospecimen repositories available for discovery research. Unfortunately, current practice is to reserve their use for validation studies and to use conveniently available non-PRoBE sets for discovery research. One could argue that this is an inefficient strategy because use of convenience samples fosters discovery of false leads that go to validation thereby wasting precious resources. At the same time, small convenience sets miss potentially useful markers (11, 16). An alternative strategy would use well-designed PRoBE sets for discovery research and convenience sets, if necessary, for verification purposes before embarking on a fully fledged validation phase of research. Such a new strategy may lead to more efficient and successful biomarker discoveries.

## Disclosure of Potential Conflicts of Interest

No potential conflicts of interest were disclosed.

## Authors' Contributions

**Conception and design:** M.S. Pepe, C.I. Li, Z. Feng

**Development of methodology:** M.S. Pepe, C.I. Li, Z. Feng

**Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.):** M.S. Pepe, C.I. Li

**Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis):** M.S. Pepe, C.I. Li, Z. Feng

**Writing, review, and/or revision of the manuscript:** M.S. Pepe, C.I. Li, Z. Feng

**Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases):** M.S. Pepe

**Study supervision:** M.S. Pepe

## Grant Support

This work was supported by grants from the National Cancer Institute at the NIH (R01 GM054438 to M.S. Pepe; U01 CA152637 to C.I. Li; U24 CA086368 to M.S. Pepe and Z. Feng).

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked *advertisement* in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

## Acknowledgments

The authors thank the reviewers for thoughtful suggestions.

## Footnotes

**Note:**Supplementary data for this article are available at Cancer Epidemiology, Biomarkers & Prevention Online (http://cebp.aacrjournals.org/).

- Received October 31, 2014.
- Revision received March 17, 2015.
- Accepted March 17, 2015.

- ©2015 American Association for Cancer Research.