| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
1 Fred Hutchinson Cancer Research Center, Seattle, Washington; 2 Feinberg School of Medicine, Northwestern University, Chicago, Illinois; 3 Department of Urology, University of Washington, Seattle, Washington; and 4 Harvard School of Public Health, Boston, Massachusetts
Requests for reprints: Ruth Etzioni, Program in Biostatistics, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, M2-B230, Seattle, WA 98109-1024. Phone: 206-667-6561; Fax: 206-667-7004. E-mail: retzioni{at}fhcrc.org.
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
70% to 80% among men within 4 years prior to clinical diagnosis of prostate cancer, and the overall specificity is close to 90% (1). However, false-positive tests are not uncommon, particularly among older men and those with benign prostate conditions. This phenomenon argues for a more stringent, or specific, test. At the same time, several studies have established the presence of prostate cancer in some men with PSA levels below 4.0 ng/mL (2), suggesting a need for a more sensitive test (3). Recent attempts to improve the performance of PSA have focused on the different molecular forms of PSA in serum: total PSA (TPSA), free PSA (not complexed to serum proteins), and complexed PSA (CPSA). Because the ratio of free PSA to TPSA (RPSA) tends to decline in men with prostate cancer, combination tests have generally used a threshold for RPSA within an interval of moderately elevated TPSA values, termed the reflex range. Early studies focused on a reflex range for TPSA of 4 to 10 ng/mL, with the goal of reducing false-positive rates (4, 5). Subsequent studies suggested that RPSA might be useful when TPSA is even lower than 4.0 ng/mL (2). A recent report by Gann et al. (6) observed that use of RPSA within a TPSA reflex range of 3 to 10 ng/mL could actually improve both specificity and sensitivity simultaneously relative to the conventional test. Uncertainty about the optimal reflex range has been compounded by recent results suggesting that CPSA may be preferable to both TPSA and RPSA (7). However, results concerning the utility of CPSA are also not consistent across studies (8). Current PSA guidelines do not, to our knowledge, provide any direction as to how free PSA and CPSA should be used in the early detection of prostate cancer.
In this article, we undertake a systematic analysis of the diagnostic performance of different strategies based on TPSA, CPSA, and tests combining free PSA with TPSA in banked plasma samples from the Physicians' Health Study (1, 6). This nested case-control study represents one of the earliest and most extensive sources of information on serum PSA levels prior to diagnosis of prostate cancer.
Our analysis differs from prior studies in that we do not begin by selecting a specific reference range for TPSA or a threshold for RPSA within this range. Rather, our goals are to (a) systematically evaluate a wide range of clinically interpretable tests combining TPSA and RPSA and (b) determine whether this class of tests provides significant improvements in diagnostic performance relative to TPSA-based tests.
The ability to optimally combine information on multiple markers is important because single markers typically lack the sensitivity and specificity to be useful for mass screening. With genomic and proteomic studies yielding many novel markers for cancer detection (9), a statistically coherent framework will be needed to identify useful combination tests and to evaluate whether these tests provide statistically and clinically significant improvements over existing tests. The methods presented herein represent a broadly applicable framework that addresses this need.
| Materials and Methods |
|---|
|
|
|---|
Statistical Analysis
Overview. Our analytic approach consists of two key components: (a) identification of potentially useful TPSA/RPSA combination tests and (b) statistical comparison of these tests with tests based on TPSA and CPSA. The second component is a comparison of receiver operating characteristic (ROC) curves (10) for the different types of tests. Statistical methods that extend this technique to tests using RPSA within intervals of TPSA have only recently been developed (11). In addition to estimating ROC curves for the three different types of tests considered (TPSA, CPSA, and TPSA/RPSA combination), we also evaluate the impact of time prior to diagnosis and subject age on the relative performance of the tests. This allows us to address, for example, whether tests that include information on percent free PSA can identify prostate cancer cases earlier than those based on TPSA.
Definition of Combination Tests. In combining information on TPSA and RPSA, we consider the set of and-or combinations of tests in each marker. We refer to tests of this type as logic rules. Logic rules are of particular interest because of their flexibility and clinical interpretability. The test which uses RPSA within a specified TPSA reflex range, is an example of a logic rule; however, the set of logic rules is far more general. In practice, we define a set of possible cutoffs for TPSA and RPSA and consider the collection of logic rules based on these cutoffs. For TPSA, we use cutoffs {1, 1.25, 1.5, 1.75, ..., 9.75, 10} where all measurements are in nanogram per milliliter. For RPSA, we define cutoffs of {0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4}. Thus, the set of combination rules that we consider consists of all and-or combinations of threshold conditions in TPSA and RPSA using these cutoffs.
ROC Analysis. To construct the logic rule ROC curve, we use the methods described by Etzioni et al. (11). With a single marker, each point on the ROC curve represents the true-positive rate for a specific marker threshold versus the false-positive rate for that threshold. With multiple markers, every point on the ROC curve corresponds to a different rule (11). Each rule on the curve maximizes the true-positive rate given the corresponding false-positive rate; this collection of rules is "optimal" in the sense that for any rule that is not on the curve, there exists a rule on the curve that has higher true-positive and lower false-positive rates (12). The rules are selected by a classification algorithm called logic regression (13).
To obtain an assessment of comparative predictive performance that is not overly optimistic, the logic rules are identified using a training data set, consisting of a randomly selected two-thirds of the original sample. We use a test data set, consisting of the remaining one-third, to evaluate the corresponding true-positive and false-positive rates and construct the ROC curve. We also construct ROC curves for CPSA and TPSA on the test data. For robustness, we implement analyses for 35 different runs, each corresponding to a different test-train split. In general, we present results from all the runs; where necessary (e.g., in plots of the ROC curves), we present results for the run in which the area under the logic rule ROC curve (AUC) is the median over all the runs.
Estimating the AUC. To test whether apparent differences in the ROC curves are statistically significant, we compare the AUCs. The AUC is a general measure of diagnostic performance, interpretable as an average true-positive rate over the full range of false-positive rates (10). For multiple markers, the AUC is interpretable as the probability that any pair of case-control observations will be correctly classified by at least one rule on the curve.
Because the AUC is interpretable as a probability, logistic regression may be used to determine whether it differs according to test type (TPSA, CPSA, logic rule; refs. 11, 14). For example, to compare TPSA with CPSA, an indicator of test type (1 for TPSA, 0 for CPSA) would be entered as an independent variable in the appropriate logistic regression model (11, 14). SEs for the regression coefficients can be determined by bootstrapping (14). To evaluate whether independent variables such as age and time from test to diagnosis affect the relative diagnostic performance of the different tests, we include interactions between these factors and indicators of test type in the regression models.
In practice, we conduct three separate logistic regression analyses, the first comparing TPSA with CPSA, the second comparing TPSA with the TPSA/RPSA combination, and the third comparing CPSA with the combination rule. Each test-train split of the data yields a different set of results for each analysis. We summarize results by reporting mean coefficient estimates as well as the number of times for which coefficients of interest are statistically significant. A result that is consistently significant across runs indicates a robust association of the corresponding covariate with the AUC. All statistical significance tests are conducted at the (two-sided) 0.05 level.
| Results |
|---|
|
|
|---|
|
|
Figure 2A plots the TPSA, CPSA, and logic rule ROC curves for the test data from a representative run, namely, one in which the AUC was closest to "average" across the 35 runs. We considered the following thresholds when plotting the ROC curves for the TPSA-based and CPSA-based rules: {0.25, 0.5, 0.75, 1, 1.25, 1.5, ..., 10}. Because high specificity is important in cancer screening studies, Fig. 2B also shows the ROC curves restricted to false-positive rates below 20%. Table 2 lists the logic rules on which Fig. 2B is based.
|
|
The AUCs for TPSA across the 35 runs ranged from 0.70 to 0.78, with a mean of 0.74. The average AUC for the logic rule and CPSA was 0.76. In the logistic regressions, interaction terms were rarely statistically significant, with the exception of the TPSA-CPSA comparison where the test type: age interaction term was significant in 14 runs, suggesting that any improvements in diagnostic performance associated with CPSA might be restricted to older men. Results are presented in Table 3, which indicates similar diagnostic performance (as measured by the AUC) for the three types of tests. For example, in the comparison of TPSA with the logic rule, the coefficient for test type was statistically significant in only 3 of 35 runs; similarly, for CPSA and the logic rule, the indicator of test type was statistically significant in only 10 runs. The coefficient estimates for the CPSA-TPSA comparison suggest a slight degradation in diagnostic performance associated with the use of CPSA in younger men and a corresponding improvement in older men.
|
| Discussion |
|---|
|
|
|---|
Although the different rules showed similar overall diagnostic performance, the ROC curves indicated that specific combination tests could provide improvements over the standard TPSA-based test. Across test-train splits, we consistently identified logic combination tests with lower false-positive and higher true-positive rates than TPSA > 4.0 ng/mL. Given the wide prevalence of PSA testing in the population, use of these tests could translate into a practically important reduction in unnecessary biopsies without sacrificing cancers detected (6).
All of the TPSA/RPSA combinations with higher sensitivity and specificity than TPSA > 4.0 ng/mL extended the TPSA reflex range to below 4.0 ng/mL. Combination tests that improved specificity with only small losses in cancers detected also were of this form. This is consistent with several prior studies of TPSA and RPSA (6, 17, 18) as well as studies that have identified disease cases with TPSA levels below 4.0 ng/mL (2, 3, 17). Of note, these combination tests all used RPSA at low TPSA levels, indicating that simply lowering the threshold for TPSA, as has been recently suggested (3), may not be an optimal approach. If detection of cases with low PSA levels is important, but limiting false-positive tests is a priority, then our results suggest that a lowering of the TPSA threshold should also be accompanied by a threshold criterion on RPSA (or some other discriminating marker); otherwise, false-positive rates could become prohibitively high.
We found that lowering the TPSA threshold to 2.5 ng/mL, as has been suggested (3), led to an average false-positive rate of 18.9% and a corresponding true-positive rate of 50.5%. Assuming that the prevalence of latent, biopsy-detectable prostate cancer is 25%, this translates into 2.13 biopsies per cancer detected. In contrast, the logic rules that lowered the TPSA threshold but used RPSA in this range had false-positive rates of 6.91% and true-positive rates of 36.06% on average, which translates into 1.57 biopsies per cancer detecteda 26% reduction. Of note, the standard TPSA > 4.0 ng/mL rule led in our data set to average false-positive and false-negative rates of 10% and 36%, respectively, which translates into 1.83 biopsies per cancer detected. Thus, the logic rules that used RPSA within a lower TPSA reflex range reduced false-positive rates by 30% on average and could result in practice in a 37% reduction in the number of biopsies per cancer detected.
A key advantage of the Physicians' Health Study data set is that the majority of prostate cancers are clinically significant in the sense that they were at some point diagnosed prior to the PSA era, within the lifetime of the patient. The design of the present study (nested, case-control) contrasts with that of prospective screening studies (e.g., ref. 3), in which prostate cancer cases consist of men with a positive PSA and biopsy-detectable disease. The differences between the case populations in case-control and prospective screening studies lead to different definitions of sensitivity in the two types of studies, which may account for differences between study results. For example, the sensitivity of the test TPSA > 4.0 ng/mL among participants in the Physicians' Health Study within 4 years prior to diagnosis was estimated by Gann et al. (1) to be 73%; however, Punglia et al. (3) estimated sensitivity among prospectively screened cases to be only 19% for men younger than 60 and 35% for men over 60. In that our estimates of sensitivity pertain to cases whose disease will become apparent during their lifetimes (non-overdiagnosed cases), these estimates may be more relevant for clinical practice.
In this article, we have focused on diagnostic properties of PSA-based tests and not on the value of PSA testing in terms of its benefitsor costs. Given that some of the controversy about PSA testing centers on morbidity of false-positive tests, improving false-positive rates is clearly worthwhilealthough there may be some cost implications associated with the additional tests. However, the value of improving true-positive rates is not clear, particularly in light of concerns about overdiagnosis associated with PSA screening. Although we identify tests that seem to provide modest improvements in sensitivity, our results pertain only to non-overdiagnosed cases. It is not clear whether these tests will increase the likelihood of overdiagnosis in a prospective screening setting, nor whether any such increases will be outweighed by the survival benefits that may accrue as a result of improved sensitivity.
To summarize, our findings indicate that discrimination between asymptomatic prostate cancer cases and controls may be enhanced by the use of information on the different molecular forms of PSA. The specific combination rules that outperform the standard TPSA-based rule in terms of both sensitivity and specificity all lower the reflex range for TPSA but use a threshold criterion for RPSA within this range. Our approach illustrates how use of multiple markers can be guided by systematic consideration of a wide range of combination tests coupled with a coherent statistical framework for evaluating and comparing diagnostic performance.
| Footnotes |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Received 1/22/04; revised 4/ 1/04; accepted 5/ 3/04.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
H. Janes and M. S. Pepe Adjusting for covariate effects on classification accuracy using the covariate-adjusted receiver operating characteristic curve Biometrika, June 1, 2009; 96(2): 371 - 382. [Abstract] [PDF] |
||||
![]() |
A. R. Bhavsar, L. R. Grillone, T. R. McNamara, J. A. Gow, A. M. Hochberg, R. K. Pearson, and for the Vitrase for Vitreous Hemorrhage Study Grou Predicting Response of Vitreous Hemorrhage after Intravitreous Injection of Highly Purified Ovine Hyaluronidase (Vitrase) in Patients with Diabetes Invest. Ophthalmol. Vis. Sci., October 1, 2008; 49(10): 4219 - 4225. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Saraiya, B. J. Kottiri, S. Leadbetter, D. Blackman, T. Thompson, M. T. McKenna, and F. L. Stallings Total and Percent Free Prostate-Specific Antigen Levels among U.S. Men, 2001-2002 Cancer Epidemiol. Biomarkers Prev., September 1, 2005; 14(9): 2178 - 2182. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |