
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
1 Cancer Prevention Laboratory, Colorado State University, Fort Collins, Colorado and 2 Departments of Biometrics and Occupational Medicine, National Jewish Medical and Research Center, Denver, Colorado
Requests for reprints: Pamela Wolfe, Cancer Prevention Laboratory, Colorado State University, 111 Shepardson Building, 1173 Campus Delivery, Fort Collins, CO 80523. Phone: 303-730-3236; Fax: 970-491-3542. E-mail: wolfe{at}earthlink.net
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
While adaptation of quantitative morphometric analysis as a clinical tool has been discussed in the literature for over 30 years, the usefulness of this approach has not received widespread acceptance. One reason for this is that there has been limited evidence that this approach can provide stable, specific, and selective algorithms for use in diagnosis, prognosis, or risk assessment. This is reflected in the literature by less than optimal classification rates and the lack of consistency among reports in the morphometric variables selected for use in diagnostic, prognostic, or disease risk algorithms (10, 11, 13, 14). The situation has been further complicated by the emergence in the last few years of programs that compute an increasing number of morphometric parameters derived from the same basic set of image data, which raises the question of whether the additional parameters are providing new information or simply additional measurement variables. If it is the latter, data reduction should be among the steps taken prior to modeling.
Various statistical methods have been applied to the analysis of morphometric data, but our search did not reveal any work in comparative discrimination methods. The number of morphometric variables is not so substantial as to discourage attempts to evaluate the data one parameter at a time and not yet enough to suggest that the application of artificial intelligence, or machine learning techniques, would be the method of choice. The data are not multinormal, but the performance of classic statistical tests, such as Student's t, is optimal when the assumption of normality holds. The power of nonparametric methods depends on just how far from normal the distribution strays; nonparametric multivariate methods require of the analyst some statistical savvy and patience with trial and error if they are to be applied appropriately. Artificial intelligence algorithms tend to work best with high-dimensional data sets, and neural networks, in particular, operate in a black box, producing an answer but failing to show their work. In an effort to evaluate the relative merits of a variety of statistical methods using an experimental approach that was simple and designed to limit individual variation among cells being evaluated, this study was performed using cells grown in culture, the biological characteristics of which relative to malignancy were known.
R.A. Fisher's seminal article in the Annals of Eugenics (1936) described a linear discriminant function for the taxonomy of three Iris sp. based on four parameters: sepal length and width, petal length and width. Although the method he described was general and did not assume normality, it has since been shown to perform optimally when that assumption holds. Comparison of linear discriminant analysis (LDA) with logistic regression (LR) has shown that when the data are not multivariate normal, LR is a more reliable classifier, and more so as the distributions diverge from normal. Some examples are Efron's (15) theoretical treatment, an application to more than two groups by Bull and Donner (16), and a simulation study that compared the performance of multiple group linear discriminant function, rank-based linear discriminant function, kernel density (KD) function, and multiple logistic function by Barón (17). The advent of high-throughput devices that is revolutionizing biomedical research is also challenging statisticians; an empirical comparison of the performance of classic discriminant analysis with several computer-intensive, nonparametric techniques applied to genomic data was published in the Journal of the American Statistical Society in 2002 (18). A theoretical treatment of discrimination and classification is given by Hand (19); computer-intensive machine learning techniques are described by Hastie et al. (20).
This article is organized into a Materials and Methods section, which describes the lung cell culture and image acquisition; a description of the Statistical Methods we applied to the image data; the Results of the classification analysis; and a Discussion of the results with an emphasis on the strengths of those we deem most appropriate for classification based on morphometric data.
| Materials and Methods |
|---|
|
|
|---|
|
Image Acquisition and Analysis
The nuclear morphometry of the cells was evaluated by computer-assisted image analysis using a CAS-200 image analysis system (Becton-Dickinson, San Jose, CA) equipped with CellSheet version 2.0 software (Bacus Laboratories, Lombard, IL). For analysis, 100 diploid cells from each cell line were selected; no S-phase, tetraploid, or aneuploid cells were evaluated. In addition, irregularly shaped diploid nuclei were excluded from analysis. Cells were selected for acquisition via digital filter. Cell images were screened by a technician following acquisition for quality control purposes. Overlapping or oddly shaped nuclei were deleted prior to nuclear morphometric analysis. Parameters derived from the size, shape, and absorbance of each nucleus were measured. In addition to these 12 measurements, 6 general texture and 21 Markovian texture features of the nucleus were evaluated (Table 2). Markovian texture features characterize variations between adjacent pixels in an image based on mathematical algorithms using a gray-level transition probability matrix. Markovian texture data provide a quantitative assessment of fine changes in chromatin structure classically used by the pathologist for grading and staging of lesions. However, the use of computer-assisted image analysis removes the subjectivity and greatly improves the sensitivity to detect subtle early changes in chromatin structure that are associated with cancer. All 39 features are listed in Table 2 and are described in detail at http://cpl.colostate.edu/morph. Differences measured by morphometric analysis are generally subtle and are frequently not detected on visual inspection of stained cells. Images of the cells measured in this study and of the cells that were typically misclassified statistically are provided at http://cpl.colostate.edu/morph.
|
| Statistical Methods |
|---|
|
|
|---|
i) /
l, where 1 designates the reference group, the nontumorigenic BEAS-2B cell line. This customized z-score measures the direction and distance from the mean of the reference group in SD units. The mean z-scores for the reference group are all 0 by definition.
Classification
The following notation and terms are used throughout the discussion of classification methods. Let X be a P-dimensional measurement space containing all possible ordered vectors of morphometry features; for our data, the first coordinate, x1, is nuclear area; the second, x2, is PgDNA; ...; the last, xp, is triangular symmetry (refer to Table 2). A classifier d(x) is a function defined on X that will, for J cell types, partition the measurement space into J pairwise disjoint subsets A1, A2, ..., AJ; d(x) assigns each vector x = (x1, x2, ..., xp) from X to exactly one subset, Aj. For every case with a measurement vector x in Aj, the class assignment is j. Classifiers are constructed from a learning sample consisting of N vectors, where xn = (xn1, xn2, ..., xnp) is drawn from X with known class membership, yn. Define a set of N cases with an indicator, y, for class membership to be a learning set L = {(x1, y1), (x2, y2), ..., (xN, yN)}. The error rate computed by counting the misclassified elements in L for a classifier developed on L is the resubstitution rate and is overly optimistic. The validity, or performance, of a classifier is estimated from a test set, T = (x1, x2, ..., xM), where ym is known but used only after the classifier has been applied to compute the misclassification rate. In our data set, N = 200, P = 39, and J = 2; Y is coded 1 for the cancer cell line, 0 otherwise.
LDA, quadratic discriminant analysis, and LR belong to a class of methods that estimate a discriminant function, d(x), and classify an observation x based either on threshold values of d(x) or on the minimum over j of dj(x), which is the distance between x and the center of group j.
Suppose the class probability densities are multivariate normal, that is, x|yj
Np(µj,
j) and
j is the proportion of observations in class j. Then, a discriminant function that achieves maximum separation of means is given by dj(x) = (1/2)logë
jë (1/2)(x µj)T
j1(x µj) + log
j and its application is called quadratic discriminant analysis. The second term, (x µj)T
j1(x µj), is the Mahalanobis distance from x to the center of the class. If
j =
for all j, the discriminant function simplifies to dj(x) = xT
1µj (1/2)µjT
j1µj + log
j, which is linear in x and is called LDA. The population parameters are rarely known and sample estimates for µ and
are substituted into the equations in the development and application of dj(x). Although LDA and quadratic discriminant analysis perform optimally with multinormal random variables, in practice, the normality assumption is often ignored. For a detailed discussion of discriminant analysis under the assumption of normality, see Morrison (23).
LR (24) models the posterior probability that a case is in one of two or more classes conditional on x. Let
(x) = E(Y|x), the expected value of Y conditional on x. The LR model is specified:
(x) = exp(ß0 + ß1x1 + ... + ßpxp) / [1 + exp(ß0 + ß1x1 + ... + ßpxp] which is nonlinear in x.
The logit transformation g(x) = ln[
(x) / (1
(x))] = ß0 + ß1x1 + ... + ßpxp gives a function that is linear in x and ranges from
to +
. Estimates for
(x) can now be seen as an example of d(x), described above, and when J = 2 classes, we assign x to class 2 if ð(x) > c and class 1 if
(x)
c. A common choice for c is 0.5, but this may not maximize sensitivity and specificity. While this model is still linear in x, the distributional assumption is on Y (binomial) rather than x, as it was in LDA and quadratic discriminant analysis; the model is easily extended to multinomial outcomes.
Binary tree structured classifiers begin by splitting the sample space into two partitions; each partition, or node, is split into two more partitions, and so on until further splitting cannot improve the performance of the classifier. The construction requires three elements: a rule for determining the splits, a rule for stopping splitting, and a rule for class assignment to terminal nodes or leaves. The criterion for splitting is a reduction in impurity, a function of the class representation at a node, which is at its maximum value when all classes are equally represented at a node and at its minimum when a node contains only one class. The size of the tree can be limited in two ways, stopping rules and pruning. Stopping rules are based on a combination of impurity and the number of cases at a node. Pruning can be based on a cost complexity algorithm or the misclassification rate. Class assignment is determined by plurality at a node. The method we used was Classification and Regression Trees (CART), described by Breiman et al. (25) and implemented in the tree() function in S-Plus.
K nearest neighbor (KNN) clustering is a memory-based classifier and imposes no distributional assumptions on the data. The method requires only a metric ||x|| on X and the specification of K, the number of neighbors that contribute to the decision on class assignment of x. A range of values for K is tested by cross-validation. Using the K that minimizes the misclassification rate, the procedure is to select the KNN of xi based on the metric and classify xi by majority vote. The resulting partition of X is not linear. The method is sensitive to the choice of metric and provides no information about the structure of the data. We selected the Mahalanobis distance as the metric, after some trial and error, and set K = 5. KNN is implemented in SAS as an option in PROC DISCRIM and in the R function knn().
The KD method, like KNN, generates a nonlinear partition of the data, is nonparametric, and provides little information about the structure of the data. The method incorporates a specific weighting function, Kj(ìxì), the kernel, and a fixed radius, r, to estimate the probability that xi belongs to group j. The posterior probability that xi belongs to j is given by Bayes' rule: p(j|x) =
jfj(x) / f(x), where the probability density estimate fj(x) = nj1
yKj(x y), summing over every observation y in group j, and f(x) is the unconditional density,
u
ufu(x), summing over all groups. xi is assigned to the group where p(j|x) is greatest. The uniform kernel method is easily visualized: p(j|x) is a function of the number of observations, y, from group j falling in a p-dimensional ellipsoid of radius r centered on xi. This is different from KNN only in that the radius is fixed. Other KD functions weight the distance between xi and each y in j. Optimal classification for the lung cell culture data was given by r = 2 and the normal kernel: Kj(||x||) = [1 / c(j)]exp(0.5||x||T
j1||x|| / r2), where c(j) = (2
)p / 2rpë
j1ë1/2.
Other Methods
Newer methods for exploring higher-dimensional data, such as neural networks (14, 26, 27) and the learning vector quantization neural network (13), have been applied to morphometry data. We tried several specifications of a neural network and a support vector machine on this data set; the results (available from http://cpl.colostate.edu/morph) were poor. For a comprehensive discussion of statistical learning, see Hastie et al. (20).
Software
We used SYSTAT version 10 for graphics; SAS version 8.2, S-Plus 4.0, R,3 and iMiner 1.01 were used for all other analyses. In most cases, default parameters were used: stepwise LR in SAS adds or removes variables at the 0.05 significance level and stepwise discriminant analysis adds or removes variables when their partial correlation coefficient is 0.01. The CART procedure in S-Plus uses deviance (stop splitting when impurity is <1%) and the number of observations at a node to stop splitting. We reduced the minimum number of cells per leaf for a split to 2 (the default is 10) and specified the minimum deviance (the specific measure of impurity) to be 0.001. We did a search from 1 to 10 for the number of neighbors that minimized errors for the KNN application; misclassification within the learning set was minimized with sets of five neighbors; the default metric is the Mahalanobis distance computed using the full covariance matrix. We also tested the Euclidean distance and the Mahalanobis distance computed using only the diagonal of the covariance matrix in an effort to reduce the misclassification rate. Mahalanobis distance using the full covariance matrix was the best metric. KD also required an iterative approach to the best specification.
Criteria for Comparison of Statistical Methods
The purpose of classification is prediction and understanding; while these goals are not mutually exclusive, not every method reveals the structure of the data. A method that elucidates structure or patterns in a data set, other things equal, would be preferred over one that does not. One straightforward criterion for selecting one method over another is prediction accuracy or the proportion of cases that are misclassified. Because misclassification rates do not indicate the direction of the errors (the false-positive and false-negative rates), sensitivity and specificity are also informative; in a particular application, it may be more important to minimize error in one direction than the other.
The validity of a classifier should be assessed by computing the misclassification rate on an independent sample of data. For large data sets, this is accomplished by holding out half or a third of the data to be used as a validation set, T. For smaller data sets such as ours, where a more parsimonious use of data is required, v-fold cross-validation works well (25). Briefly, the cases in L are divided randomly into V subsets of equal size designated L1, L2, ..., LV. The classifier is constructed on L Lv and tested on Lv for all v. The resulting misclassification rates are averaged over V, and the final classifier is constructed on L. Cross-validation is parsimonious with data; each case is used to construct the classifier and each case is used once in a test sample. For analyses in SAS and R, the data were partitioned into 10 sets of 20 observations, balanced between cell lines. The same partition was used for each classification method, and for stepwise methods, variable selection was done on each partition to avoid the bias that would be introduced by using any information from the test set, Lv.
Misclassification has two components: sensitivity and specificity. In the context of our data, sensitivity is the ability of a classifier to correctly identify cells in the cancer cell line and specificity is the ability of a classifier to correctly identify cells in the normal cell line. These were computed along with the misclassification rate in the cross-validation process and are reported in Table 3. Although there is no cost associated with false positives in our experiment, in many settings, this would be an additional consideration.
|
| Results |
|---|
|
|
|---|
|
|
The z-scores are consistently greater in absolute value for the parameters in the categories of size, shape, and general texture, all of which are more familiar to the histologist than the Markovian texture parameters. It is not surprising that the Markovian texture parameters, which are mathematical constructions, appear to carry somewhat less information. However, the feature selected first in the stepwise procedures tends to belong to the Markovian texture subset (coefficient of variation). Good separation and a high z-score are not synonymous. Figure 2 illustrates this: for sum absorbance (sum optical density), z = 2.93; for the coefficient of variation, z = 1.87; however, the sum absorbance values for the cancer cells lie entirely within the distribution of values for the normal cells.
|
|
Parameter selection in both LDA and LR was dependent on the data and the choice of classifier; different subsets of parameters were selected for each subset of data, L Lv, in the cross-validation models. Only PgDNA appeared in all 10 partitions for discriminant analysis, for example, and in 8 of 10 models for LR. This is due to the strong correlations among the parameters. Backward selection, the preferred method because variables behave differently in concert than singly, does not work well with these data, again due to the strong correlations. In the cross-validation macro for LR for example, 10 to 20 variables were selected depending on the partition, and the misclassification rate was high due to overfitting. Table 3 shows the set of parameters for each method based on stepwise selection on the full 200 observations.
CART, technically known as binary recursive partitioning, is widely used in the behavioral sciences and has recently been applied to the high-dimensional data of proteomic assays (30). The misclassification was the same as that for LDA after adjusting the default parameters for stopping. We present the graphic results for the process with a tree grown on L and pruned based on the minimum error criterion (see Fig. 4). The importance of each element of x is evident in the vertical distance between nodes. Interactions among parameters can be found by observing which parameters split the data along the left and right branches of the tree. For example, PgDNA and peak are predictors only when product moment is >194, while difference variance is a predictor only when product moment is <194.
|
Because the area under the ROC curve was similar for LR, LDA, and KD, we examined the ROC data to find a cut point that would optimize the misclassification rate for KD. Setting the cut point at 0.42 gives 87% specificity and 91% sensitivity and reduced the misclassification rate by 2%. The ROC curves for these three methods, which had the greatest area under the ROC curve, are shown in Fig. 5.
|
| Discussion |
|---|
|
|
|---|
Standardization
The benefit of standardizing data prior to analysis is less obvious in our example than it is in theory. The transformation of centering alone improves the numerical properties of the data, and further division by the SD of the referent group, the calculation of the z-score, provides an intuitively appealing measure in SD units. The improvement in numerical stability achieved by this process would be more apparent in a larger data set. Some applications, KNN, for example, require standardization of the inputs to avoid the problem of dissimilarities of scale. The drawback to the procedure we used, standardizing with respect to a particular reference group, is that while most statistical software has an option to produce z-scores, they will be column specific. That is, for each element xi, the transformation is done with respect to the mean and SD for the parameter over all classes, so that some programming is required to accomplish the transformation where the mean and SD are restricted to a particular subset of the data. The performance of the classifiers was not affected by the choice of transformation in our data set.
Multiple Comparisons
Typically, analyses in the morphometry literature ignore the problem of multiple endpoints as well as the distribution of the morphometric measures. Adjustment for multiple comparisons in a nonparametric procedure leads to more conservative conclusions than those we would have drawn by examining the P values on a list of 39 t tests. The step-down bootstrap is a better choice than the traditional Bonferroni adjustment on theoretical grounds; the advantages increase dramatically as the number of tests increases, divergence from the normal distribution increases, and in the presence of correlation among the test statistics. The feature-by-feature analysis presented in Table 4 is particularly useful in situations where it is not known which parameters might be important (32) and should be considered with appropriate adjustment for multiple comparisons (column 6).
Methods of variable selection that use information on the known classes perform multiple comparisons iteratively; the resulting models, if not adjusted for shrinkage, are characterized by P values that are too small and by biased parameter estimates. The stepwise selection we used for LR and LDA would lead to highly suspect parameters if the objective were model building and parameter estimation rather than predicting class membership.
Data Reduction
The morphometry data are inherently redundant and multivariate in nature. A single summary measure, data reduction techniques, or computer methods that select only the relevant information hold an obvious appeal. Bacus et al. (10) presented a summary measure, histologic grade. Its construction was the result of an iterative, labor-intensive process in which more than 100 possible combinations of features were examined. Among the characteristics of the resulting measure were "a high histologic grade at the end of the neoplastic process, ... a monotonically increasing histologic grade from 0 to the endpoint over the period of preinvasive neoplastic growth, [and] a generally Gaussian distribution of the histologic grade for the distribution of samples taken from ... normal tissues." They were combining data reduction and variable selection. The same authors proposed a similar but more objective summary measure in 1999 (22): they used LDA to select the important features from a collection of features standardized against the corresponding features from normal tissue and combined them into a weighed mean, where the weights are the canonical coefficients estimated for the discriminant function. The resulting NG inherits its predictive strength from the underlying linear discriminant function, however, and the probabilities for class membership estimated from LR can be used in the same way with more reliable results.
Data reduction, as opposed to variable selection, does not use information on the outcome or the known classes. The best approach would be to eliminate unimportant variable from the analysis based on prior knowledge or other criteria not requiring reference to the known classes. This approach has limited usefulness for the morphometry data because many of the features have no apparent clinical interpretation. Statistical approaches, such as variable clustering and nonlinear generalizations of principal components, have not been applied to these data to our knowledge.
Classification
LR has not been widely applied to morphometric data. A few exceptions are Kavantzas et al. (33), Veltri et al. (6), and Acker et al. (34). LR performs the same task as LDA, but the discrimination is conditional on the observed values of the random vectors in the training set. Efron (15) shows that while LR is less efficient (estimates will have greater variance) than LDA when the data are multivariate normal, it is robust to departures from normality. When the assumption of multivariate normal is violated, parameter estimates in the normal discriminant procedure are biased away from zero when the true parameter is not zero, which means that the tendency will be to overestimate the importance of the underlying variable. The estimated probabilities for membership in the event category, defined in our model to be the cancer cell line, can be interpreted in the same way we interpret the NG. Figure 3b shows probability density plots of the estimated probability of class membership from the regression of cell type on valley, coefficient of variation, diagonal moment, and sum absorbance. The misclassification rate with the cut point set at 0.5 argues in favor of the use of LR over the other methods we tested. The departures from multinormal distribution in the morphometric data are nontrivial, which suggests that the information about the importance of each parameter used in the classifier will be more reliably assessed by LR in cases where information about the contributing parameters is of interest. A limitation of LR is the requirement that any interactions to be tested must be specified by the analyst. Another is that the predictors should be linear in the logit, which can often be accomplished with an appropriate transformation.
An alternative measure of classification accuracy, the area under the ROC curve, groups LDA, LR, and KD more closely than the misclassification rate. The differences are very small, and given the SD of each, it is expected that the order would shift with different data sets. Optimizing the misclassification rate by adjusting the cut point produced a lower misclassification rate for each method, with the greatest reductions for LDA and KD, giving LDA the lowest misclassification rate (9%). Clearly, in settings where there is particular interest in sensitivity or specificity, adjusting the cut point based on exploration of the ROC data will be useful.
We found only one example of CART models applied to morphometric data (35). The misclassification rate for the S-Plus version of CART was not markedly different from LR and the same as LDA in SAS. CART has two distinct advantages over LR: the classifier is not assumed to be linear in x, and it is not necessary for the analyst to specify which interactions are to be tested, assuming the model is not simply additive. In a data set with 39 predictors, there are 741 possible two-way interactions and more than 10,000 three-way interactions. CART analysis can be used to complement LR; Nelson et al. (36) argue for the use of the two methods in concert: CART for elucidating the important subgroups and LR for estimating odd ratios in the subgroups. We examined the performance of LR with the variables and implied interactions from the CART model shown in Fig. 4; the interaction terms were not significant. As in LR, the classification variable can be continuous, categorical, or binary in CART; there are no restrictions on or assumptions about the distribution of the variables to be used in the classification process, nor is there any difficulty in handling complex, high-dimensional data sets containing mixed data types. In our data, the misclassification rate for new observations was slightly higher with CART than with LR, but studies in other fields have shown the performance of CART models to improve with larger training sets and more complex structure, especially if bagging or boosting can be used to improve their accuracy (18).
Limitations
Complete separation at the cell level is not possible in these data irrespective of the classification method used. A review of published articles in the morphometric literature reveals similar findings for in vivo data, across diverse model systems, and for different sampling procedures. See, for example, Markopoulos et al. (13), Pantazopoulos et al. (14), Boone et al. (9, 38), Bacus (37), and Poulin et al. (39). These results are central to the problem of determining the applications in which morphometry data are likely to be useful. Consider the difference between the problem of determining whether two populations are distributed differently and the problem of assigning to one population a single observation drawn at random from a mixture of the two populations based on a single characteristic. For the sake of argument, suppose the characteristic for the two populations is normally distributed and differs only in its mean (i.e., the variance is the same in both populations). If the distributions overlap, as they do in Figs. 2 and 3, the assignment of a new case selected at random will have a probability for misclassification that is a function of the amount of overlap; without information on other characteristics of the population, 100% correct classification is not possible. Conversely, if we want to estimate a confidence interval around either mean, the width of that interval can be made as narrow as we please by increasing the number of observations drawn from the mixture of the two populations. If the mean of a parameter (or combination of parameters) tends to shift in the presence of, say, a chemopreventive agent administered to a subject over time, then the hypothesis that a competing agent has the same effect can be accepted (or rejected) at any size test and power of the test that the investigator is comfortable with, simply by selecting an adequate number of subjects at the beginning of the experiment. The only way to improve the misclassification rate for new observations, however, is to identify another clinical characteristic that has discriminatory power. An example of this is given in Veltri et al. (40), where biochemical recurrence of prostate cancer survivors was equally well predicted by quantitative NG or the Gleason score. When quantitative NG and the Gleason score were combined, sensitivity, specificity, and accuracy increased.
The stepwise methods we used select four or five parameters, depending on where the threshold for entry into the model is set. A different subset of features emerged for each of the 10 partitions used for cross-validation, illustrating the point that variable selection is made arbitrary by collinearity (41). Models for classification in a variety of tissue or cancer types are characterized by even more striking differences and may have only one or two features in common (10). Given the redundancy in the data, it is not surprising but does imply that, for each new application, an extensive feature-finding trial should be performed, analogous to dose finding in a new pharmaceutical agent.
Our data set was carefully selected to maximize the signal. With less careful cell selection, noise levels will be higher and may introduce bias. How the various classifiers perform under varying conditions has been much studied, and if one message is clear, it is that no one method is universally better than the others. The results of several methods we tested have been reported for nuclear morphometry data on a variety of tissue types, and many studies report misclassification rates in the 10% to 15% range, not markedly higher than the rates we achieved in an ideal setting with optimal thresholds. The differences between classification of cells and whole tissue samples may be small. Poulin et al. (39), for example, found little difference in classification accuracy between single cell analysis and histometric analysis.
Conclusions
The value of the classifiers that produce an estimated probability for group membership, which can be used to establish a threshold for group assignment, is apparent in their potential for interpretation as surrogate endpoint biomarkers or risk scores. The examples in Fig. 3 establish the extremes for disease versus no disease for a particular function. The function, or algorithm, could be applied to similarly acquired populations of cells from subjects at various levels of known risk or disease progression. We expect that the number of cells with characteristics similar to the most extreme disease class will increase with risk or disease progression, causing the function to shift toward the disease distribution. This has been done with NG (22) but not, to our knowledge, with the LR classifier, which has a lower misclassification rate at the default threshold. Predictions from a good CART model or KD could also be used in this way. Further research is needed to extend the methods comparison with in vivo data, larger data sets, and other cancers; the clustering of methods based on the area under the ROC curve strongly suggests the need for exploration with each new tissue or model. Interindividual variability, which was not a factor in this experiment, is an important component of in vivo data. For many applications, the prevalence of a condition in the population is relevant, and appropriate prior probabilities can be assigned based on expected event rates. In our data, adjusting the prior probability for the cancer cell line to 0.05 in LDA increased the misclassification rate to 16.5%. The development of risk indices and their statistical properties is a necessary step toward incorporating morphometric features into chemoprevention and risk reduction trials.
| Acknowledgments |
|---|
| Footnotes |
|---|
3 R is an open source software environment for statistical computing and graphics, similar to the S system, which was developed at Bell Laboratories by J. Chambers et al. A variety of sophisticated analysis packages can be downloaded from http://www.r-project.org. Analysis packages for S-Plus are available at http://lib.stat.cmu.edu. ![]()
Received 10/ 6/03; revised 12/23/03; accepted 2/12/04.
| References |
|---|
|
|
|---|
-difluoromethylornithine. Cytometry (Commun Clin Cytom) 1999;38:214-23.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||