| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
1 Fred Hutchinson Cancer Research Center; 2 Department of Pathology, University of Washington, Seattle, Washington; and 3 Division of Biostatistics, Vanderbilt-Ingram Cancer Center, Nashville, Tennessee
Requests for reprints: Ruth Etzioni, Translational and Outcomes Research, Fred Hutchinson Cancer Research Center, Mailstop M2-B230, 1100 Fairview Avenue North, P.O. Box 19024, Seattle, WA 98109. Phone: 206-667-6561; Fax: 206-667-7264. E-mail: retzioni{at}fhcrc.org
| Abstract |
|---|
|
|
|---|
Methods: Compositional data analysis is a technique used to analyze vectors of unit-sum proportions, such as those obtained from soil sample studies or species abundance surveys. We summarized specimen staining patterns by the proportion of cells staining at mild, moderate, and intense levels and used compositional data analysis to summarize and compare the resulting staining profiles.
Results: In a study of Syndecan-1 staining patterns among 44 localized prostate cancer cases with Gleason score 7 disease, compositional data analysis did not detect a statistically significant difference between the staining patterns in recurrent (n = 22) versus nonrecurrent (n = 22) patients. Results indicated only modest increases in the proportion of cells staining at a moderate intensity in the recurrent group. In contrast, an analysis that compared quantitative scores across groups indicated a (borderline) significant increase in staining in the recurrent group (P = 0.05, t test).
Conclusions: Compositional data analysis offers a novel analytic approach for immunohistochemical studies, providing greater insight into differences in staining patterns between groups, but possibly lower statistical power than existing, score-based methods. When appropriate, we recommend conducting a compositional data analysis in addition to a standard score-based analysis.
| Background |
|---|
|
|
|---|
Immunohistochemical studies have become an integral part of the process of biomarker development, where they are used to determine whether overexpression or underexpression of potential markers predicts disease status. For example, several novel proteins have been shown to be markedly overexpressed in prostate cancer tissue, including
-methylacyl-CoA Racemase (1), EZH2 (2), and pim-1 kinase and hepsin (3). cDNA microarray studies typically yield many candidate genes that seem to differentiate between disease groups at the RNA level. However, only a small portion of these findings will ultimately be confirmed by immunohistochemistry at the protein level.
Immunohistochemical studies of prostate cancer reveal tremendous within-specimen heterogeneity.Not only are normal cells typically present in cancer specimens, but staining among tumor cells can be highly variable. Figure 1 shows a section of formalin-fixed, paraffin-embedded prostate tissue stained with the syndecan-1 antibody (4). Both normal and cancer cells are visible; moreover, some areas of the tumor show mild to moderate staining, whereas others show very intense staining. Because of variability, like that illustrated in Fig. 1, specimens are often not coded as positive or negative for a particular antibody; rather, a summary score is calculated (1, 4). First, the percent of cells staining at a fixed number of intensity levels (e.g. none, faint, moderate, strong) is determined. We refer to this vector of percents as the staining profile associated with a specific specimen. Then the score is given as the sum over intensity levels of the percent staining multiplied by an ordinal value corresponding to the intensity level (0 = no staining, 1 = mild staining, etc.). With four intensity levels, the resulting score ranges from 0 (no staining in the entire specimen) to 300 (intense staining uniformly within the specimen). Variants of this system have been proposed; for example, instead of the percent staining at each intensity level, an integer may be recorded with values (0, 1, 2, ...) indicating whether the percent staining at a given level falls within a specified range (0-20%, 20-40%, 40-60%, ..., ref. 5). Qualitative approaches, which classify specimens as positive/negative or weak-strong, have also been used (2, 3).
|
|
Since a tissue specimen that has been stained with an antibody represents a mixture of cells with different staining intensities, compositional data methods are applicable. We show how the compositions represented by staining profiles may be graphically representedand statistically comparedacross groups sustaining different outcomes of interest. We use these methods to analyze data from an immunohistochemical study of a potential biomarker for prostate cancer recurrence, and compare results to those obtained with the score-based approach.
| Materials and Methods |
|---|
|
|
|---|
For three-part compositions, the basic graphical tool for representing data is the ternary diagram (Fig. 2A). A ternary diagram consists of an equilateral triangle, with unit altitude and vertices corresponding to the components of the composition (e.g., 0, 1, and 2 for a three-level staining profile reflecting none-mild, moderate, and intense staining). A staining profile (a, b, and c) is represented in the figure by a point P such that the perpendicular distances from P to the edges of the triangle opposite vertices 0, 1, and 2 equal a, b, and c, respectively. Thus, staining profiles that are mostly null (no stain) would be represented by points close to vertex 0; similarly, profiles that are balanced between moderate and intense staining would be on the perpendicular line from vertex 0 to the edge connecting vertices 1 and 2.
|
To compare staining profiles across two samples, we need to determine whether the perturbation that captures any observed between-sample differences is statistically significantly different from the identity value (1/3, 1/3, 1/3). To estimate the best-fitting perturbation, we fit a regression model to a transformation of the observed staining profiles. The transformation is called the additive log-ratio (alr) transformation, and it converts any k-part composition from a vector of observations lying between 0 and 1 to a (k 1) vector of observations lying between
and +
. The regression model is then applied to the transformed data. Appendix 1 details the alr transformation and its inverse (the ialr transformation), and describes the regression model. In the absence of other covariates that may affect staining patterns, the regression model has only one independent variable namely, an indicator of group membership. Applying the inverse of the alr transformation to the estimated coefficient of this independent variable yields the perturbation that best captures between-sample differences.
Utilization of the alr transformation requires that all elements of the observed compositions be greater than zero. In practice this is not always the case; some specimens may have no cells staining at a given intensity level. When some of the compositions have zero elements, Aitchison (6) recommends adding a small positive constant to the zero-valued cells, but we have found that adding a small amount of normally distributed random noise produces transformed data that are more likely to satisfy the assumptions required for the regression analysis (specifically, multivariate normality of the alr-transformed observations). We detail how this is done in practice when we present our analysis of the Syndecan-1 data.
The key outputs of the regression model are (a) an estimate of the median composition among observations in the baseline group, and (b) an estimate of the perturbation that best describes the differential between the medians of the baseline group and the comparison group (6, 7). As we describe in Appendix 1, methods developed by Billheimer et al. (7) can be used to construct a 95% interval around this estimated differential. The 95% interval is represented as a region in the ternary diagram; if it excludes the identity point (1/3, 1/3, 1/3) we conclude that the data indicate that there is a statistically significant difference between the two groups.
To fit the regression model and construct the 95% interval, we use WinBUGS (the WinBUGS web site can be accessed at http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml). WinBUGS is publicly available; code to fit the regression model and construct the 95% interval is available from the authors. One of the advantages of using this framework for modeling is that it allows us to extend the regression model to the setting where multiple cores are taken from the same patient, as might, for example, be the case in a paired study (both cancerous and healthy samples analyzed within each participant). Appendix 1 describes how this may be done using WinBUGS.
Syndecan-1 and Prostate Cancer Recurrence
We previously described (4) the results of a study to investigate the association between prostate cancer progression and expression of Syndecan-1, a cell surface proteoglycan implicated in sequestering growth factors and enhancing their activity (8). One section of prostate cancer tissue from each of 76 patients diagnosed with clinically localized disease was analyzed for Syndecan-1. All tumors were Gleason score 6 or 7 and were treated with radical prostatectomy. All subjects were followed for at least 5 years. The study end point (time of recurrence) was defined as two consecutive rising serum prostate-specific antigens greater than 0.2 ng/mL. In the group with Gleason score 7 disease, 22 experienced prostate-specific antigen recurrence within 5 years, and 22 survived, recurrence free, for at least 5 years.
We analyzed the immunohistochemical data for Gleason score 7 patients using both the summary score approach and compositional data analysis. Our goal was to determine whether staining profiles differed for patients who did and did not recur within 5 years. Three categories of staining intensity (1 = no/faint staining; 2 = moderate staining; 3 = strong staining) were defined. To analyze the resulting (univariate) scores, we used a t test.
For the compositional data analysis, we first added a small amount of random noise to the cells within each staining profile that had zero values, and then renormalized the resulting compositions to sum to 1. The random noise was generated as the absolute value of a Normal(0,
2) random variable. We considered several values for
(i.e., 0.01, 0.015, 0.02, and 0.025). The smallest value of
for which the resulting transformed observations satisfied the regression assumptions was 0.02, and we used this value in the analysis. Our method for dealing with zero components is similar to that of Aitchison (6), reflecting the notion that zero components represent observations that fall below some minimal detectable limit. However, rather than replacing zeros by a small, fixed number, we used a random value which we found more likely to satisfy the assumptions required for the regression analysis. We assessed sensitivity to the specific random values generated by re-running the analysis several times, with different random imputations each time, and found results across the runs to be similar.
A Simulation Study
As we have already noted, in practice it is possible that a perturbation may be very different from the identity, but with a relatively small effect on the score. Such perturbations include shifts that reflect movement out of moderate staining to both more intense and less intense staining, and vice versa. These differences may arise in practice, for example, when comparing relatively homogeneous tissues in which the majority of cells stain moderately, with tissues that consist of a mixture of disease phenotypes with a variety of staining profiles including none or mild staining and some intense staining. Because the score is linear, simultaneous increases (or decreases) in intense and mild staining tend to cancel each other out, so these perturbations are generally not reflected in the score. As an example, Fig. 2A shows data reflecting two groups of 25 compositions each, generated from two independent distributions. The median of the first distribution is (0.35, 0.50, 0.15), and that of the second distribution (0.55, 0.10, 0.35). The resulting perturbation is (0.38, 0.05, 0.57), which reflects a lesser amount of moderate staining in the second group, with a corresponding increase in the proportion of cells staining mildly and intensely. This perturbation is quite different from (1/3, 1/3, 1/3). We conducted 100 simulations from the model generating the data in Fig. 2A, and compared the number of runs for which the compositional approach detected a difference between the two groups with the number of runs for which the standard score-based approach (t test on the scores within each group) detected a difference at a significance level of 0.05. We also conducted a second set of simulations, using a model reflected by the data in Fig. 2B. As in Fig. 2A, the first group has median (0.35, 0.50, 0.15), but the median of the second group is (0.05, 0.65, 0.30), reflecting a greater frequency of cells staining at both the moderate and intense levels.
| Results |
|---|
|
|
|---|
= 0.05 level. This example confirms that certain differences between groups, such as the one represented in Fig. 2A, will tend to be missed by the score-based approach but identified by the more general compositional analysis approach. In contrast, the difference represented in Fig. 2B was detected equally well by the compositional and score-based methods, with compositional data analysis identifying a difference in 95 of 100 simulations, and a score-based t test identifying a difference in 96 of 100 simulations, at the
= 0.05 level.
Analysis of Syndecan-1 Expression and Prostate Cancer Recurrence
Figure 3A shows the staining profiles for Gleason score 7 patients by recurrence status. Figure 3B shows a box plot for the scores within each group. A t test applied to the scores yielded a P value of 0.05 (recurrent versus non-recurrent cases).
|
|
| Discussion |
|---|
|
|
|---|
Compositional data methods are designed to be applied in settings where there is a considerable amount of heterogeneity in the intensity of staining within a single specimen. In the absence of heterogeneity, the staining profiles will contain many zero components, and the assumptions of multivariate normality in the regression analysis may not be satisfied; in such settings, qualitative approaches may be preferable.
Our analyses consider the staining profile as a dependent variable and the disease status indicator as an independent variable. However, the goal of many analyses is to predict outcome as the dependent variable, often on the basis of more than one marker. In this setting, our approach would be most useful in the exploratory phase of the analysis, to identify those markers that might be useful for predictive purposes, and to determine how staining profile differences should be quantified in the final predictor panel.
The greater flexibility provided by compositional methods may in some cases confer a cost (i.e., loss of statistical power). In the analysis of Syndecan-1 staining, we found that the compositional analysis did not identify a significant difference between the recurrent and non-recurrent groups, whereas a difference was just apparent using the more standard score-based analysis (t test on the scores within each group; P = 0.05). Given the limitations of the different approaches, it may be a useful practical strategy to implement both score-based and compositional analyses on any particular data set. At the very least, the compositional data approach will be a useful complement to the score analysis, allowing the investigator to confirm any results obtained and providing greater detail regarding any staining profile differences that may exist.
In conclusion, compositional data analysis provides a novel, flexible, and intuitive analytic approach that accounts for heterogeneity in specimen staining profiles. We anticipate that it will prove useful in many studies that aim to differentiate between patient groups on the basis of biomarker staining patterns.
| Appendix 1 A statistical model for compositional data analysis |
|---|
|
|
|---|
Given an observation in the simplex, X = (X1, X2, ..., XK), and a perturbation,
= (
1,
2, ...,
K), the perturbed composition is given by (Xj
j / D),j = 1, ..., K, where D =
. The null perturbation is given by (1, 1, ..., 1) / K.
To estimate the best-fitting perturbation, we first transform the compositions from the simplex to
K 1, using the additive log-ratio (alr) transformation, which we denote by
. Let Yi =
(Xi); Yi = (Yi1, Y12 ..., Yi K1), where Yij = log(Xij / XiK). We model Yi as multivariate normal i.e., Yi
N(µi,
).
To model the dependence of the staining profile on disease status, where observations are available for both normal and diseased specimens, a linear model is assumed for the means µij of the Yij. Thus, we have µij = ß0j + ß1jZi, where Zi is an indicator of disease status for observation i. The ß's may be estimated using any package for multivariate linear regression; we have used WinBUGS, which, together with software developed by Billheimer et al. (7), conveniently allows for the computation of a 95% region around the estimated perturbation corresponding to ß1. This perturbation, which we denote
, is given by the inverse of the alr (ialr) transformation applied to ß1 i.e.
for j = 1, ..., K 1, and
.
The WinBUGS framework requires specification of prior distributions on the vectors ß0, ß1, and
. In our data analysis, we use a fairly noninformative bivariate normal prior for the ß's (mean 0, variance 1, and covariance 0.5) and a noninformative inverse Wishart prior for (
-1(
WR,4)), where R is the prior variance-covariance matrix specified for the ß's. An advantage of the WinBUGS framework is that it allows us to extend the model to accommodate settings in which multiple observations have been taken from the same patient. This may be particularly important in the case of prostate cancer because of within-tumor heterogeneity. To accommodate multiple observations from patient i, we can specify µij = ßoij + ß1jZi,ij incorporating a random effect in the model so that each individual has his own median staining profile. The effect of the disease status indicator is then measured against this individual-specific baseline. To implement this extension in Win BUGS requires specification of a hyper prior on ß0ij, such as ß
N (ß0j,
2). Then, priors for ß0j and ß1j can be specified described above.
| Footnotes |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Received 8/ 4/04; revised 2/ 9/05; accepted 3/ 2/05.
| References |
|---|
|
|
|---|
-Methylacyl-CoA racemase: a new molecular marker for prostate cancer. Cancer Res 2004;13:22206.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Cell Growth & Differentiation |