Abstract
The treatment effect of a colorectal polyp prevention trial is often evaluated from the colorectal adenoma recurrence status at the end of the trial. However, early colonoscopy from some participants complicates estimation of the final study end recurrence rate. The early colonoscopy could be informative of status of recurrence and induce informative differential followup into the data. In this article, we use midpoint imputation to handle intervalcensored observations. We then apply a weighted KaplanMeier method to the imputed data to adjust for potential informative differential followup, while estimating the recurrence rate at the end of the trial. In addition, we modify the weighted KaplanMeier method to handle a situation with multiple prognostic covariates by deriving a risk score of recurrence from a working logistic regression model and then use the risk score to define risk groups to perform weighted KaplanMeier estimation. We argue that midpoint imputation will produce an unbiased estimate of recurrence rate at the end of the trial under an assumption that censoring only depends on the status of early colonoscopy. The method described here is illustrated with an example from a colon polyp prevention study. (Cancer Epidemiol Biomarkers Prev 2009;18(3):712–7)
 current status data
 midpoint imputation
 weighted KaplanMeier estimator
Introduction
Colorectal cancer is one of the most common malignancies in the United States, with a projected 148,610 new cases and 68,840 deaths from this disease in 2006 (1). It is believed that most colorectal carcinomas arise from adenomas (2). Hence, most of colorectal cancer prevention trials use colorectal adenomas to study preventive agents. For a typical colorectal polyp prevention trial, individuals who had undergone removal of a colorectal adenoma within 6 months before study are randomly assigned to the treatment group or the matched placebo group and treated for 3 or more years. The treatment effect is then evaluated on occurrence of newly discovered adenomas by performing colonoscopy at followup to remove all new colorectal polyps and then check if any of the polyps is adenomatous. The followup colonoscopy is scheduled to be done once at the end of the trial (e.g., at 3 years after start of the intervention) to evaluate the recurrence status. The actual event time for each participant is then only known as occurring either before three years or not. A reasonable statistical method to analyze this type of data will be logistic regression (Logit), which simply analyzes the binary outcomes, i.e., recurrence status at the end of the trial.
Because of family history of colorectal cancer and other health conditions, some participants could have their only followup colonoscopy before 3 years (the scheduled time) or even have more than one followup colonoscopy. The followup colonoscopy done before 3 years is considered as “early colonoscopy.” As a result, the participants with early colonoscopy could have differential followup lengths in contrast to those participants who adhered strictly to the schedule of followup colonoscopy. Due to differential followup, the recurrent adenoma data can be considered as current status data (“case 1” intervalcensored data), in which each participant was observed only once and the recurrence time was known only to be either less than or greater than the observation time. In addition to intervalcensored observations, there could be right censored observations as well if a participant did not have any recurrent adenomas at all followup colonoscopies. Logistic regression could produce a biased estimate of the recurrence rate at the end of the study when there are right censored observations existing before the end of the study (3).
A potential estimator is the nonparametric maximum likelihood estimator (NPMLE; refs. 4, 5), which is often used to analyze intervalcensored data. The NPMLE requires a computationally intensive algorithm to obtain measures of uncertainty. Our previous work modified Logit using a weight function, a function of followup length, to propose a less computationally intensive approach to account for censoring due to differential followup, while still estimating the recurrence rate at the end of the trial (3). Both the weighted Logit and the NPMLE approach assume a noninformative differential followup, i.e., the reasons a participant had early colonoscopy are not associated with risk of recurrence. However, often the reasons (e.g., family history of colorectal cancer and previous polyp history) are associated with the risk of recurrence and, therefore, induce informative differential followup. When the early colonoscopy is informative of risk of recurrence, a method, e.g., the weighted Logit approach, which simply treats informative differential followup as noninformative, could produce a biased estimate of the recurrence rate at the end of the trial.
In this article, we assume that every participant with early colonoscopy had the same distribution of followup colonoscopy and propose to use midpoint imputation to simplify intervalcensored data to rightcensored data. We then combine this with the weighted KaplanMeier (WKM) estimator (6, 7), which incorporates prognostic factors into survival analysis, to adjust for informative differential followup, to estimate the recurrence rate at the end of the trial. The midpoint imputation approach is attractive because it does not require a distribution for the imputation. Such a distribution would be hard to estimate in a typical colorectal polyp prevention trial because often over 50% of the participants had their only followup colonoscopy done at the end of trial and those participants provided very little information with regard to their actual time of recurrence. Although it has been shown that midpoint imputation could produce biased survival estimates (810), we will show that it will not produce a biased estimate when the primary interest is in estimating the recurrence rate at the end of the trial. In addition, we will also modify the WKM method, which is limited to a situation with a single categorical prognostic factor, to handle a situation with multiple prognostic factors. We propose to modify the WKM method by deriving a risk score of recurrence from a working Logit model with all prognostic factors as the covariates and then categorize the risk score to perform the WKM method. The risk score summarizes the complex structure of prognostic factors into a scalar. In this article, we will focus on estimating the recurrence rate at the end of the trial and are interested in comparing the performance of the WKM method derived from the midpoint imputed data with NPMLE under a situation of informative early colonoscopy. In addition, logistic regression, which is often used for this type of study, and weighted logistic regression (WLogit) will be also explored.
This article is organized as follows: In the methods section, we show the theoretical properties of midpoint imputation in estimating the recurrence rate at the end of the trial, review and describe the WKM estimator derived from midpoint imputed data for intervalcensored data, and modify the WKM estimator to handle a situation with multiple prognostic covariates. In the application section, we apply the WKM method to a data set from a ursodeoxycholic acid colorectal polyp prevention (UDCA) study. In the discussion section, we discuss the performance and limitations of the proposed WKM estimator for estimating the recurrence rate at the end of a colon polyp prevention trial.
Materials and Methods
Notation
Let X denote the time to first occurrence of a polyp after randomization, T_{k} denote the kth followup colonoscopy time, where k = 1,…,K, τ (e.g., 3 y) denote maximum followup time, M denote an early colonoscopy indicator variable showing whether the first followup colonoscopy was done before the end of the trial (i.e., I(T_{1}<τ)), and Z denote a baseline covariate. We only know X decreases in some interval (L, R), where L<X<R. Right censoring is equivalent to R = ∞. Let (L_{i}, R_{i}) denote the observable random interval, (l_{i}, r_{i}) denote the observed time interval, and Δ_{i} = I(R_{i}< ∞) denote the recurrence indicator for each subject. Suppose there are n participants in a study. The observed data are thus O = (l_{1}, r_{1}, δ_{1}, m_{1}, z_{1}),…,(l_{n}, r_{n}, δ_{n}, m_{n}, z_{n}). We assume that these n subjects come from a random sample and are independent. Each participant could have only one followup colonoscopy (i.e., K = 1) or more than one followup colonoscopy (i.e., K>1). For a participant with only one followup colonoscopy, the recurrence time is either rightcensored at L = T_{1} or intervalcensored into an interval (0, R = T_{1}), where T_{1} = τ if M = 0 and T_{1}<τ if M = 1. Of the participants with multiple colonoscopy, some could have their final followup colonoscopy at the end of the trial (i.e., T_{K} = τ). Therefore, the recurrence time is either right censored at L = T_{K} (last followup colonoscopy time) or interval censored into an interval (L, R), where L<X<R = T_{K} ≤ τ.
For participant i with an intervalcensored recurrence time, i.e., (l_{i}, r_{i}), where r_{i} ≤ τ, midpoint imputation is used to impute time to recurrence by (l_{i}+r_{i})/2, the midpoint of (l_{i}, r_{i}). For participant j with a rightcensored recurrence time, i.e., r_{j} = 8, time to recurrence is treated as right censored at l_{j}, where l_{j}≤τ. Let X^{*} denote the observed time to recurrence derived from midpoint imputation. That is, X^{*} = X if δ = 0 and X^{*} = (l+r)/2 if δ = 1. The KM and WKM estimates can then be derived from the imputed data set.
Properties of Midpoint Imputation
The survival rate at the end of the trial based on the imputed data set can be written as
Under a condition that some of the participants with multiple colonoscopy could have their final followup colonoscopy at the end of the trial (i.e., P(T_{K} = τ M = 1, K>1)>0), the third equality holds because midpoint imputation only affects S(x), where x<τ. Therefore, under an assumption that differential censoring only depends on the status of early colonoscopy, i.e., M, the imputed data X^{*} can be used to give an unbiased estimate of the recurrence rate at the end of the trial. This provides a theoretical foundation for using midpoint imputed data to replace the intervalcensored data and can be generalized to handle a situation that censoring depends on more than one covariate, when the main interest is in estimating the recurrence rate at the end of the trial.
WKM Estimator
For illustration, we assume Z is a categorical covariate and takes on values 1,…,K. The survival function derived from the imputed data can be written aswhere 𝛉_{k} is the probability a subject has covariate value k, (k = 1,…,K) and S^{*}_{k}(t) is the probability of survival conditional on having covariate value k. Based on the above expression, the WKM estimator (6, 7) is defined as , where is the KM estimator among those with covariate value k, n_{k} is the number of subjects with covariate value k, and . The recurrence rate at time t is then equal to 1WKM(t). The associated variance is equal to the sum of the weighted averages of withinvariation and betweenvariation (see below)where λ(.) and H(.) are hazard and cumulative hazard functions, respectively. The first term of the variance can be easily estimated by calculating the weighted average of the variances derived from the Greenwood's formula for those K groups and the second term can be estimated by plugging in estimates of each component (7).
In a situation with early colonoscopy, regarding M as the only covariate, the WKM estimator (denoted as WKM^{C}) at the end of the trial can be expressed aswhere and Ŝ are the KM estimators for the participants with early colonoscopy and without early colonoscopy, respectively, n_{e} and n_{ne} are the number of the participants with early colonoscopy and without early colonoscopy, respectively, and n = n_{e} + n_{ne}. Because all of the participants without early colonoscopy had their only followup colonoscopy at the end of the trial, Ŝ reduces to 1p̂_{ne}, where p̂_{ne} is the sample proportion of recurrence among those participants. All of the participants with early colonoscopy had their first followup colonoscopy before τ. If the largest observed time among those participants was censored, we propose to complete the tail of Ŝ_{e}^{*} by an exponential curve to estimate S_{e}^{*} (11).
WKM with Multiple Covariates
In addition to early colonoscopy, often there is more than one covariate associated with the risk of recurrence in a colorectal polyp prevention trial. They could be either categorical or continuous. Let Z = (z_{1},…,z_{p}) denote the p covariates associated with risk of recurrence. The WKM method, because it requires a single categorical covariate, cannot directly incorporate those p covariates into estimation. To use the information from the covariates to improve the marginal survival estimate, Hsu et al. (12) considered a situation of possibly multiple timeindependent or timedependent continuous covariates and proposed deriving risk scores. These risk scores summarize the associations between the covariates and the failure and censored times, from two working proportional hazards models, one for the failure time and one for the censoring time. By incorporating predictive covariates into survival analysis, one can both increase efficiency and reduce bias due to dependent censoring of the estimate of the marginal survival distribution.
In this article, we adapt and modify the ideas in Hsu et al. (12) to incorporate multiple covariates into the WKM method. We propose to fit a working Logit model for recurrence of the form, logit[Pr(Δ = 1) Z] = Z*β, to reduce the covariates to a risk score, which provides an indicator of an individual's risk of recurrence. We propose to fit one working model for the participants with early colonoscopy and one for the participants without early colonoscopy because we believe they might have a different association between the covariates and risk of recurrence. The risk scores are then defined as RŜ_{e} = Zβ̂_{e} for the participants with early colonoscopy and RŜ_{ne} = Zβ̂_{ne} for the participants without early colonoscopy, where β̂_{e} and β̂_{ne} denote the estimates of the regression coefficients for the Logit models. The risk scores will be continuous and can be categorized into groups based on dichotomization or quartiles. The WKM estimator can then be easily derived based on the categorized groups of the risk scores for both participants with or without early colonoscopy (denoted as WKM_{e} and WKM_{ne}, respectively). The WKM method using both early colonoscopy and prognostic covariates of recurrence is denoted as . Note if there is only one prognostic covariate, then there is no need to fit the working Logit model.
Both WKM^{C} and WKM^{R+C} treat the status of early colonoscopy, i.e., M, as a baseline covariate rather than a post hoc variable and might underestimate the variabilities of estimators of recurrence rate. In addition, midpoint imputation simplifies the complexity of the intervalcensored data and Greenwood's variance formula is approximate and known to underestimate the variance, especially with heavy censoring and in the right tail of the survival distribution. Hence, we propose to use the bootstrap technique to estimate the SE s of the estimators for WKM^{C} and WKM^{R+C}.
Application to UDCA Data
In 1996, the Arizona Cancer Center initiated a multicenter trial to determine whether UDCA can prevent the recurrence of colorectal adenomas (13). A total of 1,285 subjects with identified colorectal adenomas at the qualifying examination were recruited and randomly assigned to one of the two treatment groups, placebo and UDCA (810 mg/kg/d). Of 1,285 subjects, a total of 1,192 subjects underwent at least one followup colonoscopy and were thus considered for the end point analysis, 579 in the placebo group and 613 in the UDCA group. For each of the 1,192 subjects, his/her recurrent status was measured, as well as the baseline covariates, such as age (mean, 66.2; SD, 8.5), gender (67.4% male), body mass index (mean, 27.4; SD, 4.6), family history of colorectal cancer (27.4% with family history of colorectal cancer), and previous polyp history (before the qualifying examination; 47.3% with previous polyp history). According to the baseline covariates, on average, the UDCA participants were slightly overweight and had a higher risk of recurrence compared with the general population.
Initially, the followup colonoscopy was planned to be done only once, no earlier than 6 mo before the 3y anniversary date after randomization (i.e., 30 mo). However, some participants went through their followup colonoscopy before the planned time. The number of participants who had early colonoscopy are 233 (40.2%) in the placebo group and 260 (42.4%) in the UDCA group. Some of those participants had multiple followup colonoscopies. Table 1 displays the frequency and recurrence results of followup colonoscopy for the participants with early colonoscopy and indicates that the participants with multiple followup colonoscopy tended to have their first followup colonoscopy done earlier compared with those who had only one followup colonoscopy. Of the participants with multiple followup colonoscopy (n = 327), 297 (90.8%) had at least 1 followup colonoscopy at least 30 mo after randomization and 138 (42.2%) had at least one recurrent adenoma at their first followup colonoscopy. Based on Table 1, a participant could have recurrent adenomas at the first colonoscopy and no recurrent adenomas at the second colonoscopy. This is because at each colonoscopy the participant's colorectal polyps were removed and tested to see if any of them is adenomatous. Instead of fixing the end of the trial exactly at 3 y, for each participant, the actual time of the colonoscopy was used to define the interval of time to first recurrence. The midpoint imputation method was then conducted on the intervalcensored observations.
Table 2 explores the covariates associated with having early colonoscopy and risk of recurrence. According to the table, early colonoscopy is highly associated with risk of recurrence and marginally associated with gender (male). The participants with early colonoscopy had a significantly higher recurrence rate (49.3%) compared with the participants without any early colonoscopy (37.5%) with an odds ratio of 1.621 and a 95% confidence interval (CI) of 1.283 to 2.048. This indicates informative early colonoscopy for the UDCA study. Age, body mass index, gender, early colonoscopy, and previous polyp history are significantly associated with risk of recurrence. In this article, we calculate the WKM^{C} estimator using early colonoscopy as the only covariate and the WKM^{R+C} estimator using both early colonoscopy and a prognostic categorical covariate derived from a linear combination (risk scores) of the covariates (age, gender, body mass index, and previous polyp history), which are associated with risk of recurrence. The risk scores of recurrence are obtained by fitting Logit models for the participants with and without early colonoscopy separately. Each risk score is then dichotomized into two groups (low versus high) to calculate the WKM estimate. We repeated the analyses using four groups instead of two and it gave similar results.
In this article, we are interested in estimating the recurrence rate of adenomas at three years for both the placebo and UDCA groups based on the UDCA study protocol. A sample proportion of recurrence (Logit), WLogit (with an exponential weight function truncated at 3 y; ref. 3), NPMLE, WKM^{C}, and WKM^{R+C} methods are calculated from the data. The results are provided in Table 3 . Logit produces a lower recurrence rate compared with the NPMLE and WKM methods for both placebo and UDCA groups. This supports our previous findings (3). WLogit produces a higher recurrence rate compared with the other methods (Logit, NPMLE, and WKM) for both placebo and UDCA groups. The WKM^{C} method, which incorporates early colonoscopy directly into the analysis to adjust for informative differential followup, produces a slightly higher recurrence rate for the placebo group and a similar recurrence rate for the UDCA group compared with the NPMLE method. This results in a lower odds ratio 0.747 with a 95% CI of 0.737 to 0.992. The WKM^{R+C} method, which incorporates both early colonoscopy and prognostic covariates into analysis, produces a much higher recurrence rate for the placebo group and a similar recurrence rate for the UDCA groups compared with the NPMLE method. As a result, the WKM^{R+C} method yields the lowest odds ratio 0.725 with a 95% CI of 0.541 to 0.981 compared with the other methods. The 95% CI for both WKM^{C} and WKM^{R+C} does not cover one and indicates that UDCA is associated with a lower risk of recurrence in contrast to the results from the NPMLE, Logit, and WLogit methods. In addition, we observe a counterintuitive phenomenon that the WKM^{R+C} method has a slightly higher estimate of SE compared with the WKM^{C} method. This could be due to one (previous polyp history) of the covariates used in the WKM^{R+C} method having missing observations and, as a result, a smaller data set (only the nonmissing data) used for the WKM^{R+C} method compared with the WKM^{C} method or because the WKM^{R+C} method gives a more accurate estimate of SE compared with the WKM^{C} method. In summary, the WKM method could provide an adjustment for informative differential followup due to early colonoscopy. We also perform simulations to investigate the properties of the WKM methods. The simulation results will be published as supplementary material, including Supplementary Tables S1 and S2.
Discussion
The research in this article uses midpoint imputation to handle intervalcensored observations and then combines with a WKM approach to adjust for informative early colonoscopy through the use of the status of early colonoscopy and gains efficiency by incorporating prognostic covariates of recurrence, when estimating the recurrence rate at the end of the trial for a colorectal polyp prevention trial. This approach can handle a situation with multiple prognostic covariates by deriving a risk score from a Logit model. Although the idea of this approach might seem simple, the results based on simulations (not provided here) do show that the WKM approach can provide a reasonable recurrence rate estimate under an informative differential followup and can gain efficiency when prognostic covariates exist. In contrast, the conventional statistical methods such as Logit, which simply ignores differential followup, and the WLogit and NPMLE methods, which depend on the assumption of noninformative differential followup, could produce biased estimates. Hence, the method that does not incorporate informative differential followup into estimation of the recurrence rate could produce misleading conclusions as indicated in the data analysis section.
In this article, we treat the early colonoscopy status as known at baseline and use it directly to define risk groups to perform WKM estimator to adjust for informative differential followup while estimating the recurrence rate. This might seem to be unrealistic and underestimate variability of the WKM estimator. However, based on the guidelines for screening colorectal cancer (14), the chance of a participant having early colonoscopy during follow up was very likely to be decided by his/her family history of colorectal cancer, previous polyp history, and baseline polyp characteristics (e.g., size ≥1 cm), which were known at baseline. Hence, treating the early colonoscopy status as known at baseline might not be unrealistic. In addition, the risk score approach in this article can be generalized to handle a situation that the early colonoscopy status and time to the first followup colonoscopy are treated as end points and known to be associated with some baseline covariates (e.g., family history, previous polyp history, and baseline polyp characteristics) where a working proportional hazards model can be fitted to time to the first followup colonoscopy data to derive a risk score to summarize the association between time to the first followup colonoscopy and the covariates. This risk score and the risk score derived from the working Logit model for recurrence can then be jointly used to define the risk groups to perform the WKM estimator, instead of using the status of early colonoscopy directly to define the risk groups. Although the approach is motivated by the data from colorectal polyp prevention trials, the WKM approach can also be generalized to handle the data from other types of clinical studies where each participant is only scheduled to be followed for a fixed standard time period instead of regular monitoring throughout the study and the main interest is in estimating the event rate at the end of the study.
Simply using midpoint imputation to handle intervalcensored observations highly depends on the lengths of intervals and might produce biased survival estimates and misleading results, especially at early time points. However, in this article, we focus on estimating the recurrence rate at the end of the trial and have shown that midpoint imputation will not produce a biased estimate of recurrence rate at the end of the trial under an assumption that censoring only depends on the status of early colonoscopy.
The recurrence rate at the end of the study corresponds to the tail of the survival curve. It is wellrecognized that the survival estimate in the tail is often unstable. Hence, one might feel that survival analysis techniques do not seem good choices for estimating the recurrence rate at the end of the study under this setting. However, in a colorectal polyp prevention trial, often over 50% of participants had their only followup colonoscopy at the end of the trial. Those participants were either interval censored or right censored at the end of the study. For those participants, midpoint imputation is less likely to contribute additional variation to the estimate of recurrence rate at the end of the study and the information they provide toward estimation of recurrence rate at the end of the study simply reduces to a binary outcome and their followup lengths provide little information with regard to the actual recurrence time. We suspect this might stabilize the tail problem.
In a situation with multiple covariates, we fit a working Logit model for recurrence to reduce multiple covariates into a scalar. The model is only used as a convenience in calculating the risk scores to create a categorical variable, which is predictive of risk of recurrence, to implement the WKM method. More sophisticated and computationally intensive approaches for fitting the working model could be used, such as a proportional hazard model for intervalcensored data, but we suspect that would not lead to a significant reduction in the bias, which is the major concern under a situation with informative differential followup for the WKM method. In addition, parametric assumptions connected with the statistical model are only used to define the risk scores. As a result, the reliance on the Logit model is weaker for the WKM approach. However, the performance of the WKM method using predictive covariates of recurrence to improve efficiency in estimation of the recurrence rate in an informative differential followup situation will depend on the strength of the association between these covariates and recurrence.
The research in this article assumes that every participant with early colonoscopy had the same distribution of followup colonoscopy. However, the distribution of followup colonoscopy might depend on a participant's health condition or family history of colorectal cancer. In addition, the research mainly focuses on estimating the recurrence rate at the end of the study. Often one of the main interests is in testing the prevention effect. Future research can focus on developing approaches that can handle complex distributional assumptions for the followup colonoscopy and perform two sample tests, as well.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Acknowledgments
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Footnotes

Grant support: NIH (CA23074; CA41108) and American Cancer Society (IRG7400128).

Note: Supplementary data for this article are available at Cancer Epidemiology Biomarkers and Prevention Online (http://cebp.aacrjournals.org/).

This original work has not been presented or submitted elsewhere.
 Accepted December 22, 2008.
 Received September 16, 2008.
 Revision received December 8, 2008.