Abstract
Background: Immunohistochemical studies use antibodies to stain tissues with the goal of quantifying protein expression. However, protein expression is often heterogeneous resulting in variable degrees and patterns of staining. This problem is particularly acute in prostate cancer, where tumors are infiltrative and heterogeneous in nature. In this article, we introduce analytic approaches that explicitly consider both the frequency and intensity of tissue staining.
Methods: Compositional data analysis is a technique used to analyze vectors of unitsum proportions, such as those obtained from soil sample studies or species abundance surveys. We summarized specimen staining patterns by the proportion of cells staining at mild, moderate, and intense levels and used compositional data analysis to summarize and compare the resulting staining profiles.
Results: In a study of Syndecan1 staining patterns among 44 localized prostate cancer cases with Gleason score 7 disease, compositional data analysis did not detect a statistically significant difference between the staining patterns in recurrent (n = 22) versus nonrecurrent (n = 22) patients. Results indicated only modest increases in the proportion of cells staining at a moderate intensity in the recurrent group. In contrast, an analysis that compared quantitative scores across groups indicated a (borderline) significant increase in staining in the recurrent group (P = 0.05, t test).
Conclusions: Compositional data analysis offers a novel analytic approach for immunohistochemical studies, providing greater insight into differences in staining patterns between groups, but possibly lower statistical power than existing, scorebased methods. When appropriate, we recommend conducting a compositional data analysis in addition to a standard scorebased analysis.
 Immunodiagnosis
 Tumor markers and detection of metastasis
 New algorithms
 Genitourinary cancers: prostate
 Molecular markers of metastasis and progression
Background
Immunohistochemical studies are designed to detect the expression of proteins at the cellular level. In such studies, an antibody is applied to a tissue specimen, and antibody binding is detected using a chromogenic substrate that results in a color change where the protein is localized. The specimen is then examined under a microscope and the extent and intensity of color staining is assessed by an observer.
Immunohistochemical studies have become an integral part of the process of biomarker development, where they are used to determine whether overexpression or underexpression of potential markers predicts disease status. For example, several novel proteins have been shown to be markedly overexpressed in prostate cancer tissue, including αmethylacylCoA Racemase (1), EZH2 (2), and pim1 kinase and hepsin (3). cDNA microarray studies typically yield many candidate genes that seem to differentiate between disease groups at the RNA level. However, only a small portion of these findings will ultimately be confirmed by immunohistochemistry at the protein level.
Immunohistochemical studies of prostate cancer reveal tremendous withinspecimen heterogeneity.Not only are normal cells typically present in cancer specimens, but staining among tumor cells can be highly variable. Figure 1 shows a section of formalinfixed, paraffinembedded prostate tissue stained with the syndecan1 antibody (4). Both normal and cancer cells are visible; moreover, some areas of the tumor show mild to moderate staining, whereas others show very intense staining. Because of variability, like that illustrated in Fig. 1, specimens are often not coded as positive or negative for a particular antibody; rather, a summary score is calculated (1, 4). First, the percent of cells staining at a fixed number of intensity levels (e.g. none, faint, moderate, strong) is determined. We refer to this vector of percents as the staining profile associated with a specific specimen. Then the score is given as the sum over intensity levels of the percent staining multiplied by an ordinal value corresponding to the intensity level (0 = no staining, 1 = mild staining, etc.). With four intensity levels, the resulting score ranges from 0 (no staining in the entire specimen) to 300 (intense staining uniformly within the specimen). Variants of this system have been proposed; for example, instead of the percent staining at each intensity level, an integer may be recorded with values (0, 1, 2, …) indicating whether the percent staining at a given level falls within a specified range (020%, 2040%, 4060%, …, ref. 5). Qualitative approaches, which classify specimens as positive/negative or weakstrong, have also been used (2, 3).
The transformation from staining profile to score is many to one. Consider, for example, the two specimen profiles in Table 1. The first profile shows increased mild to moderate staining relative to the second profile, which has more intense staining. However, the two sample specimens in Table 1 yield the same score (80), because the increased frequency of cells staining moderately in the first sample compensates for the small absolute increase in intense expression in the second sample. Therefore, if differences such as those represented in Table 1 are of interest, a scorebased analysis may not be adequate.
This article presents an approach for directly analyzing data on the staining profiles recorded in immunohistochemical studies. Our goal is to provide an analytic method that takes advantage of the richness of the information in the individual profiles. We use a class of methods collectively referred to as compositional data analysis. A composition consists of a collection of probabilities which sum to 1, and which generally represent proportions of components of a mixture. Compositional data analyses are common in soil science where interest may focus on the various components (sand, silt, and clay) that make up a sample (6, 7). Compositional data methods have also been applied in biodiversity studies where the relative frequencies of different species within a specific geographic area are of interest.
Since a tissue specimen that has been stained with an antibody represents a mixture of cells with different staining intensities, compositional data methods are applicable. We show how the compositions represented by staining profiles may be graphically represented—and statistically compared—across groups sustaining different outcomes of interest. We use these methods to analyze data from an immunohistochemical study of a potential biomarker for prostate cancer recurrence, and compare results to those obtained with the scorebased approach.
Materials and Methods
Compositional Data Analysis: Basic Concepts
A composition is a set of proportions, representing components of a mixture. The elements of a composition all lie between 0 and 1 and their sum is equal to 1. In immunohistochemical studies, we can think of the data from each specimen as a composition, where the different components are the discrete staining intensity levels and the data for a given specimen consist of the proportion of cells staining at each intensity level. The goal is to compare the compositions for one group of tissues (e.g., normal tissues) with the compositions for a second group (tumor tissues). In the rest of this article, we will focus on threepart compositions because we have found them to be adequate for the immunohistochemical applications that we have considered, and because they have a convenient graphical representation.
For threepart compositions, the basic graphical tool for representing data is the ternary diagram (Fig. 2A). A ternary diagram consists of an equilateral triangle, with unit altitude and vertices corresponding to the components of the composition (e.g., 0, 1, and 2 for a threelevel staining profile reflecting nonemild, moderate, and intense staining). A staining profile (a, b, and c) is represented in the figure by a point P such that the perpendicular distances from P to the edges of the triangle opposite vertices 0, 1, and 2 equal a, b, and c, respectively. Thus, staining profiles that are mostly null (no stain) would be represented by points close to vertex 0; similarly, profiles that are balanced between moderate and intense staining would be on the perpendicular line from vertex 0 to the edge connecting vertices 1 and 2.
Figure 2A shows a ternary diagram for a set of threelevel staining profiles from a simulated data set consisting of 25 observations in each of two groups. In the first group, specimens tend to show a mixture of light/no staining and moderate staining as evidenced by their location along the edge of the simplex joining vertices 0 and 1. In the second group, there is less moderate staining, with a corresponding increase in light/no staining together with some intense staining. The points in the second group are said to be perturbed relative to the points in the first group. A perturbation is a composition that shifts points within the ternary diagram. The perturbation operation may be thought of as the addition operation for compositions, but the identity element (analogous to zero) is the value (1/3, 1/3, 1/3). The perturbation accounting for the difference between the medians of the two samples in Fig. 2A is (0.38, 0.05, 0.57), reflecting a shift out of level 1 (moderate) to both levels 0 (none) and 2 (intense). This type of shift is particularly difficult to detect using scorebased methods, because although it reflects a real biological difference, it does not lead to a marked change in the score.
To compare staining profiles across two samples, we need to determine whether the perturbation that captures any observed betweensample differences is statistically significantly different from the identity value (1/3, 1/3, 1/3). To estimate the bestfitting perturbation, we fit a regression model to a transformation of the observed staining profiles. The transformation is called the additive logratio (alr) transformation, and it converts any kpart composition from a vector of observations lying between 0 and 1 to a (k − 1) vector of observations lying between −∞ and +∞. The regression model is then applied to the transformed data. Appendix 1 details the alr transformation and its inverse (the ialr transformation), and describes the regression model. In the absence of other covariates that may affect staining patterns, the regression model has only one independent variable namely, an indicator of group membership. Applying the inverse of the alr transformation to the estimated coefficient of this independent variable yields the perturbation that best captures betweensample differences.
Utilization of the alr transformation requires that all elements of the observed compositions be greater than zero. In practice this is not always the case; some specimens may have no cells staining at a given intensity level. When some of the compositions have zero elements, Aitchison (6) recommends adding a small positive constant to the zerovalued cells, but we have found that adding a small amount of normally distributed random noise produces transformed data that are more likely to satisfy the assumptions required for the regression analysis (specifically, multivariate normality of the alrtransformed observations). We detail how this is done in practice when we present our analysis of the Syndecan1 data.
The key outputs of the regression model are (a) an estimate of the median composition among observations in the baseline group, and (b) an estimate of the perturbation that best describes the differential between the medians of the baseline group and the comparison group (6, 7). As we describe in Appendix 1, methods developed by Billheimer et al. (7) can be used to construct a 95% interval around this estimated differential. The 95% interval is represented as a region in the ternary diagram; if it excludes the identity point (1/3, 1/3, 1/3) we conclude that the data indicate that there is a statistically significant difference between the two groups.
To fit the regression model and construct the 95% interval, we use WinBUGS (the WinBUGS web site can be accessed at http://www.mrcbsu.cam.ac.uk/bugs/welcome.shtml). WinBUGS is publicly available; code to fit the regression model and construct the 95% interval is available from the authors. One of the advantages of using this framework for modeling is that it allows us to extend the regression model to the setting where multiple cores are taken from the same patient, as might, for example, be the case in a paired study (both cancerous and healthy samples analyzed within each participant). Appendix 1 describes how this may be done using WinBUGS.
Syndecan1 and Prostate Cancer Recurrence
We previously described (4) the results of a study to investigate the association between prostate cancer progression and expression of Syndecan1, a cell surface proteoglycan implicated in sequestering growth factors and enhancing their activity (8). One section of prostate cancer tissue from each of 76 patients diagnosed with clinically localized disease was analyzed for Syndecan1. All tumors were Gleason score 6 or 7 and were treated with radical prostatectomy. All subjects were followed for at least 5 years. The study end point (time of recurrence) was defined as two consecutive rising serum prostatespecific antigens greater than 0.2 ng/mL. In the group with Gleason score 7 disease, 22 experienced prostatespecific antigen recurrence within 5 years, and 22 survived, recurrence free, for at least 5 years.
We analyzed the immunohistochemical data for Gleason score 7 patients using both the summary score approach and compositional data analysis. Our goal was to determine whether staining profiles differed for patients who did and did not recur within 5 years. Three categories of staining intensity (1 = no/faint staining; 2 = moderate staining; 3 = strong staining) were defined. To analyze the resulting (univariate) scores, we used a t test.
For the compositional data analysis, we first added a small amount of random noise to the cells within each staining profile that had zero values, and then renormalized the resulting compositions to sum to 1. The random noise was generated as the absolute value of a Normal(0, σ^{2}) random variable. We considered several values for σ (i.e., 0.01, 0.015, 0.02, and 0.025). The smallest value of σ for which the resulting transformed observations satisfied the regression assumptions was 0.02, and we used this value in the analysis. Our method for dealing with zero components is similar to that of Aitchison (6), reflecting the notion that zero components represent observations that fall below some minimal detectable limit. However, rather than replacing zeros by a small, fixed number, we used a random value which we found more likely to satisfy the assumptions required for the regression analysis. We assessed sensitivity to the specific random values generated by rerunning the analysis several times, with different random imputations each time, and found results across the runs to be similar.
A Simulation Study
As we have already noted, in practice it is possible that a perturbation may be very different from the identity, but with a relatively small effect on the score. Such perturbations include shifts that reflect movement out of moderate staining to both more intense and less intense staining, and vice versa. These differences may arise in practice, for example, when comparing relatively homogeneous tissues in which the majority of cells stain moderately, with tissues that consist of a mixture of disease phenotypes with a variety of staining profiles including none or mild staining and some intense staining. Because the score is linear, simultaneous increases (or decreases) in intense and mild staining tend to cancel each other out, so these perturbations are generally not reflected in the score. As an example, Fig. 2A shows data reflecting two groups of 25 compositions each, generated from two independent distributions. The median of the first distribution is (0.35, 0.50, 0.15), and that of the second distribution (0.55, 0.10, 0.35). The resulting perturbation is (0.38, 0.05, 0.57), which reflects a lesser amount of moderate staining in the second group, with a corresponding increase in the proportion of cells staining mildly and intensely. This perturbation is quite different from (1/3, 1/3, 1/3). We conducted 100 simulations from the model generating the data in Fig. 2A, and compared the number of runs for which the compositional approach detected a difference between the two groups with the number of runs for which the standard scorebased approach (t test on the scores within each group) detected a difference at a significance level of 0.05. We also conducted a second set of simulations, using a model reflected by the data in Fig. 2B. As in Fig. 2A, the first group has median (0.35, 0.50, 0.15), but the median of the second group is (0.05, 0.65, 0.30), reflecting a greater frequency of cells staining at both the moderate and intense levels.
Results
Simulation Results
Over 100 simulations from the model that generated the data in Fig. 2A, compositional data analysis identified a significant difference between groups 84% of the time, whereas a t test comparing the scores within each group identified a difference only 9% of the time at the α = 0.05 level. This example confirms that certain differences between groups, such as the one represented in Fig. 2A, will tend to be missed by the scorebased approach but identified by the more general compositional analysis approach. In contrast, the difference represented in Fig. 2B was detected equally well by the compositional and scorebased methods, with compositional data analysis identifying a difference in 95 of 100 simulations, and a scorebased t test identifying a difference in 96 of 100 simulations, at the α = 0.05 level.
Analysis of Syndecan1 Expression and Prostate Cancer Recurrence
Figure 3A shows the staining profiles for Gleason score 7 patients by recurrence status. Figure 3B shows a box plot for the scores within each group. A t test applied to the scores yielded a P value of 0.05 (recurrent versus nonrecurrent cases).
Figure 4A shows the 95% region for the estimate of the perturbation that best describes the difference between the median compositions within each group. This region includes the identity point, suggesting lack of evidence of a significant difference between the staining profiles in the two groups. The bestfitting perturbation is (0.28, 0.41, 0.31), reflecting a modest increase in the frequency of moderate staining, but little change in the frequency of intense staining among recurrent relative to nonrecurrent cancer cases. For a run of 10,000 iterations, the run time was less than 30 seconds to fit this model.
Figure 4B shows the estimated median staining profiles within each group (recurrent versus nonrecurrent). The point for the nonrecurrent group shows that there is very little staining in general in this group. The figure also provides a quantitative representation of the information lost when conducting a scorebased analysis, assuming scores of 1, 2, and 3 corresponding to the three levels of staining intensity represented in the figure (4). These numerical values for each staining level imply a score range with a lower limit of 100 and an upper limit of 300. The parallel lines on the figure represent points with equal score differences relative to the median staining profile for the nonrecurrent group. Thus, for example, the points on the topmost line, which includes the observed median profile for the recurrent group, all have a score difference of 13 when compared with the nonrecurrent group. Similarly, the points on the lowest line all have a difference of 100 relative to the nonrecurrent group. However, the staining profiles represented by different points on the same line often represent quite different phenomena. For example, a score difference of 75 could imply that samples from recurrent patients tend to show mostly moderate staining, or a balance between negative and intense staining and very little moderate staining. As shown by our first set of simulations, information to distinguish between these possibilities would be provided by a compositional analysis.
Discussion
This study presents a novel application of compositional data methods to the field of immunohistochemistry. Until now, pathologists analyzing immunohistochemical data have had access to a limited set of analytic approaches. Compositional data methods extend the available analytic options, providing the ability to detect a larger range of betweenspecimen differences and greater insight into how staining profiles change with disease status. The ability to detect small differences between tissues as represented for example in Table 1 is particularly important, because biological behavior of a tumor may reside in a small clone of cells, which may ultimately be those that metastasize or lead to an unfavorable outcome.
Compositional data methods are designed to be applied in settings where there is a considerable amount of heterogeneity in the intensity of staining within a single specimen. In the absence of heterogeneity, the staining profiles will contain many zero components, and the assumptions of multivariate normality in the regression analysis may not be satisfied; in such settings, qualitative approaches may be preferable.
Our analyses consider the staining profile as a dependent variable and the disease status indicator as an independent variable. However, the goal of many analyses is to predict outcome as the dependent variable, often on the basis of more than one marker. In this setting, our approach would be most useful in the exploratory phase of the analysis, to identify those markers that might be useful for predictive purposes, and to determine how staining profile differences should be quantified in the final predictor panel.
The greater flexibility provided by compositional methods may in some cases confer a cost (i.e., loss of statistical power). In the analysis of Syndecan1 staining, we found that the compositional analysis did not identify a significant difference between the recurrent and nonrecurrent groups, whereas a difference was just apparent using the more standard scorebased analysis (t test on the scores within each group; P = 0.05). Given the limitations of the different approaches, it may be a useful practical strategy to implement both scorebased and compositional analyses on any particular data set. At the very least, the compositional data approach will be a useful complement to the score analysis, allowing the investigator to confirm any results obtained and providing greater detail regarding any staining profile differences that may exist.
In conclusion, compositional data analysis provides a novel, flexible, and intuitive analytic approach that accounts for heterogeneity in specimen staining profiles. We anticipate that it will prove useful in many studies that aim to differentiate between patient groups on the basis of biomarker staining patterns.
Appendix 1 A statistical model for compositional data analysis
In comparing staining profiles across samples, the goal is to determine whether the perturbation that captures any observed betweensample differences is statistically significantly different from the null value.
Given an observation in the simplex, X = (X_{1}, X_{2}, …, X_{K}), and a perturbation, α = (α_{1}, α_{2}, …, α_{K}), the perturbed composition is given by (X_{j}α_{j} / D),j = 1, …, K, where D = . The null perturbation is given by (1, 1, …, 1) / K.
To estimate the bestfitting perturbation, we first transform the compositions from the simplex to ℜ^{K − 1}, using the additive logratio (alr) transformation, which we denote by ϕ. Let Y_{i} = ϕ(X_{i}); Y_{i} = (Y_{i1}, Y_{12} …, Y_{i K−1}), where Y_{ij} = log(X_{ij} / XiK). We model Y_{i} as multivariate normal i.e., Y_{i} ≈ N(μ_{i}, ∑).
To model the dependence of the staining profile on disease status, where observations are available for both normal and diseased specimens, a linear model is assumed for the means μ_{ij} of the Y_{ij}. Thus, we have μ_{ij} = β_{0j} + β_{1j}Z_{i}, where Z_{i} is an indicator of disease status for observation i. The β's may be estimated using any package for multivariate linear regression; we have used WinBUGS, which, together with software developed by Billheimer et al. (7), conveniently allows for the computation of a 95% region around the estimated perturbation corresponding to β_{1}. This perturbation, which we denote α, is given by the inverse of the alr (ialr) transformation applied to β_{1} i.e. for j = 1, …, K − 1, and .
The WinBUGS framework requires specification of prior distributions on the vectors β_{0}, β_{1}, and σ. In our data analysis, we use a fairly noninformative bivariate normal prior for the β's (mean 0, variance 1, and covariance 0.5) and a noninformative inverse Wishart prior for (∑_{1}(≈WR,4)), where R is the prior variancecovariance matrix specified for the β's. An advantage of the WinBUGS framework is that it allows us to extend the model to accommodate settings in which multiple observations have been taken from the same patient. This may be particularly important in the case of prostate cancer because of withintumor heterogeneity. To accommodate multiple observations from patient i, we can specify μ_{ij} = β_{oij} + β_{1j}Z_{i},_{ij} incorporating a random effect in the model so that each individual has his own median staining profile. The effect of the disease status indicator is then measured against this individualspecific baseline. To implement this extension in Win BUGS requires specification of a hyper prior on β_{0ij}, such as β ≈ N (β_{0j}, σ^{2}). Then, priors for β_{0j} and β_{1j} can be specified described above.
Footnotes

Grant support: Grant P50 CA97186.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
 Accepted March 2, 2005.
 Received August 4, 2004.
 Revision received February 9, 2005.