Abstract
Many approaches have been taken to adjust for smoking in modeling cancer risk. In casecontrol studies, these metrics are often used arbitrarily rather than being based on the properties of the metric in the context of the study. Depending on the underlying study design, hypotheses, and base population, different metrics may be deemed most appropriate. We present our approach to evaluating different smoking metrics. We examine the properties of a new metric, “logcigyears”, that we initially derived from using a biological model of DNA adduct formation. We compare this metric to three other smoking metrics, namely packyears, squareroot packyears, and a model in which smoking duration and intensity are separate variables. Our comparisons use generalized additive models and logistic regression to examine the relationship between the logit probability of cancer and each of the metrics, adjusting for other covariates. All models were fit using data from a lung cancer study of 1,275 cases and 1,269 controls that has focused on genesmoking relationships. There was a very significant, linear relationship between logcigyears and the logit probability of lung cancer in this sample, without any need to adjust for smoking status. These properties together were not shared by the other metrics. In this sample, logcigyears captured more information about smoking that is important in lung cancer risk than the other metrics. In conclusion, we provide a general framework for evaluating different smoking metrics in studies where smoking is a critical variable.
 lung cancer
 smoking
 dose metric
 biological model
Introduction
The nature of the relationship between smoking and lung cancer as estimated from a statistical model depends in large part on how “smoking” itself is coded. Coding methods include using indicator variables to differentiate between current, former, and never smokers, using packyear categories, or using a continuous variable such as packyears itself or its constituent factors.
Models that use categories assume that, conditional on other model covariates, the risk of lung cancer within a category is constant. Categorizing a continuous variable does not make full use of all the available data (1) and the choice of cutpoints between categories may influence the estimated smokinglung cancer risk relationship (2). Furthermore, if the underlying variable used to define the categories is measured with error, then the categorization may create nondifferential measurement error because a value close to the cutpoint is more likely to be misclassified than a value in the midrange of the category (3).
Continuous smoking metrics used in the literature include packyears, the squareroot of packyears (47), or including smoking duration and intensity as separate variables (810). Previous reports have found nonlinear relationships between packyears and lung cancer (11, 12). In our own lung cancer casecontrol sample, an approximately linear relationship between squareroot packyears and lung cancer risk was found, but indicator variables to distinguish between current, former, and never smokers were necessary for improved model performance (4). The multistage model of carcinogenesis (1315) has motivated several authors to separate smoking duration and intensity in modeling lung cancer risk. However, when never smokers are included in the model, relative risks associated with duration for a fixed intensity and vice versa are difficult to interpret (16), because duration and intensity are always zero for never smokers. Other continuous variables that may be important include age of smoking initiation and years since smoking cessation; however, these variables, together with smoking duration, are highly collinear with age, another variable commonly included in cancer risk models.
In addition to these issues, some studies appropriately limit their populations to current smokers (8), ever smokers (17), or use separate models for current and former smokers (10). Such analyses require defining cutpoints in smoking duration, timing, and/or intensity to define these samples. Not only do the choice of cutpoints determine which subjects are excluded, but studies differ in their choice of cutpoints, which can ultimately affect results (12).
In many circumstances, different smoking metrics may provide reasonably similar results such that the choice of metric is not critical. However, when smoking becomes integral to the study hypothesis, as is the case with genesmoking analyses (47, 10, 1820), it may be important to compare how different metrics perform within the study population. The primary aim of this article is to provide a general approach for evaluating the performance of different metrics through the use of a concrete example of how this approach can be applied in a specific study. Our comparison includes a new metric that we call “logcigyears”, which we define to be log(cigarettes smoked per day + 1) × years of smoking. We compare the performance of four different metrics using data from a large lung cancer casecontrol study, and also explore how the performance of these metrics compare to results from Doll and Peto's model of smoking and cancer (14). Throughout this article, we use “log” to mean the natural logarithm. We define cigpday as cigarettes smoked per day, logcigp as log(cigpday + 1), cigtime as years of smoking, yrsquit as years since smoking cessation, and agestart as age of smoking initiation.
Materials and Methods
Study Population
The data used in this analysis was derived from a casecontrol study approved by the Human Subjects Committees of Harvard School of Public Health and Massachusetts General Hospital. Details of the study design have been described previously (11). Briefly, the sample consists of histologically confirmed, newly diagnosed lung cancer patients presenting at Massachusetts General Hospital between December 1992 and September 2000. Controls were friends or non–bloodrelated family members of the cases, and were not specifically matched to cases. When the above potential controls were not available, controls were recruited from friends or family members of non–lung cancer patients.
As we have previously in this article, we included data from all Caucasians with complete data on age, gender, smoking status, cigpday and cigtime (for ever smokers), and yrsquit (for former smokers).
Motivation of the LogcigYears Metric
The logcigyears metric was derived in part from a model relating smoking to DNA adducts. The formation of DNA adducts from polycyclic aromatic hydrocarbons (such as benzo[a]pyrene) in tobacco smoke is widely believed to be on the causal pathway from smoking to lung cancer (2124). Given certain assumptions, it follows from the solution to a set of differential equations relating adducts to smoking, that the logarithm of the number of DNA adducts can be modeled as an additive function of the logarithm of smoking intensity. The logarithmic transformation of adduct numbers, whereas not universal, is fairly standard both in models relating smoking and adducts (25, 26) and in models relating adducts to lung cancer risk (27). Because adduct formation is believed to be on the causal pathway to lung cancer, one could model the probability of cancer initiation as a function of the number of adducts, on some scale. If the logarithmic transformation of smoking intensity is useful for a model of the logarithm of DNA adducts, and if the cumulative log(adduct) burden is directly related to cancer risk, this suggests that cumulative log(smoking intensity) may be a useful smoking metric. Packyears, which is cumulative smoking intensity on the untransformed scale, is widely used but does not necessarily represent the best way to combine smoking intensity and duration into a single cumulative metric. The logcigyears metric is one alternative to packyears, and is also a cumulative smoking metric.
Like other simple smoking metrics, the logcigyears metric does not take into account all of the many steps that occur between cancer initiation and tumor detection. These steps may depend on factors such as the age at which an individual started smoking and the age at which the individual stopped smoking (if ever). In this article, we do not attempt to model the process of carcinogenesis or to better understand the true complexity of how smoking leads to cancer development. For this, we refer the interested reader to articles on the multistage model of carcinogenesis (1315) and related articles (2832). Instead, our goal in this report is simply to compare the performance of the logcigyear metric with more standard smoking metrics.
Statistical Analyses
We examined the relationship between the logit probability of cancer and each of four continuous smoking metrics separately, using first generalized additive models (GAM; ref. 33), and then logistic regression. The smoking metrics we considered were packyears, squareroot packyears, logcigyears, and the “twometrics” model in which smoking duration and intensity were separate metrics in the same model. In the twometrics model, we used cigtime as the duration variable, and logcigp as the intensity variable. This transformation of smoking intensity was chosen in part because of the nonlinearity between the logit probability of lung cancer and the untransformed smoking intensity observed here (data not shown) and in a previous article using data from this study (10). This nonlinearity has also been noted by Rachet et al. (9), who used GAM to develop models relating smoking to lung cancer risk in a casecontrol study that used duration of smoking and smoking intensity as separate variables.
GAM is a powerful statistical tool that extends the generalized linear models framework to allow the shape of the relationship between the outcome and each continuous variable to be an arbitrary smooth function with the shape determined by the data. GAM was used to examine the nature of the relationship between cancer risk and each smoking metric separately, in a model that adjusted for age, years since quitting smoking (defined here and in other reports to be zero for never smokers; refs. 47), smoking status (as two indicator variables to distinguish between never, former, and current smokers), and gender. Each continuous variable was allowed to have a possibly nonlinear effect on cancer risk. Specifically, the GAM models we fit to ever and never smokers together are of the formwhere “smoking metric” is one of the four smoking metrics mentioned above, and former, current, and gender are indicator variables for former smokers, current smokers, and female, respectively. The notation “s(.)” indicates a smooth term that we fit using a smoothing spline with 4 df. In the twometrics model, s(smoking metric) was replaced by s(cigtime) + s(logcigp).
We also used GAM to examine similar adjusted relationships among smokers only. In the smokersonly model, we can potentially adjust for age of smoking initiation, a variable that is meaningless for never smokers. However, due to the collinearity between this variable, years of smoking, age, and years since smoking cessation, it is not possible to adjust for all these variables in the twometrics model. Instead, in all smokersonly models, we categorized age of smoking initiation and included an indicator variable for whether or not the smoker started smoking prior to age 18. The value of 18 was chosen to represent the approximate age at which lung development is nearing completion. In the smokersonly models, we did not include the current smoking indicator because the former smoker indicator was sufficient to distinguish between current and former smokers.
All GAM models were fit using the SPlus software (34, 35). In addition to examining the GAM plots, we tested for nonlinearity between the outcome and each continuous variable using the approximate χ^{2} test for the nonlinear contribution of the nonparametric terms (36), supplied by SPlus.
Any smoking metric that did not have a significant departure from a linear relationship with the logit probability of cancer in the adjusted model was then considered further in logistic regression models, also fit in SPlus. Any covariate other than the smoking metric that had a nonlinear relationship with cancer risk was transformed such that the relationship using the transformed variable was approximately linear. The transformed covariate was then used in the logistic regression models. Two logistic regression models were fit using these smoking metrics. In the first logistic regression model (the “full model”), in addition to adjusting for the covariates as described above, we also included an interaction term between smoking status and the smoking metric, to allow the slope relating the smoking metric and cancer risk to differ for current versus former smokers. For the twometrics model, this meant that we included a pair of interaction terms, one for smoking intensity and one for duration. The second logistic regression model (the “all covariates” model) included all covariates described, but did not include the interaction term(s). The necessity of considering these interactions is motivated by our earlier work (4, 6).
Results
Baseline Characteristics
After excluding nonCaucasians and individuals missing key model covariates, the resulting sample contained 2,544 observations: 1,275 lung cancer cases and 1,269 controls. Among the cases, there were 85 never smokers, 675 former smokers, and 515 current smokers, whereas among the controls there were 445 never smokers, 578 former smokers, and 246 current smokers. Never smokers were defined to have smoked fewer than 100 cigarettes in their lifetime, and former smokers were defined to have quit smoking one or more years ago. The 1,190 eversmoking cases tended to be heavier smokers than the 824 eversmoking controls, with mean (SD) packyears of 59.8 (36.8) and 31.8 (27.2), respectively.
Results from Assessing Linearity Between the Smoking Metrics and Risk, Using GAM
In our sample, the adjusted relationship between packyears and the logit probability of cancer was significantly nonlinear (P < 0.001 for the nonlinear contribution) both in a model fit using all individuals (i.e., both ever and never smokers; Fig. 1), and in a model fit using only smokers. This indicates that in our sample, packyears is not appropriate to use as a continuous variable in logistic regression models. In separate models, squareroot packyears and logcigyears were linearly related to the logit probability of cancer, after adjusting for other model covariates (P > 0.10 for the nonlinear contribution), when all individuals were included (Fig. 1), and when only ever smokers were included. The corresponding plots for the smokersonly models were very similar (data not shown).
In the twometrics model, the adjusted relationships between the logit probability of cancer and both logcigp and cigtime in a model fit using all individuals were approximately linear (see bottom plots in Fig. 1). In the model fit using smokers only, there was weak evidence of nonlinearity between the logit probability of cancer and logcigp (P ≈ 0.07 for the contribution of the nonlinear terms).
In all models just described, the adjusted relationship between the logit probability of cancer and years since quitting smoking was approximately linear. However, in our sample the adjusted relationship between the logit probability of cancer and age was significantly nonlinear in all models (P < 0.001). The corresponding GAM plots indicated that the relationship with age was approximately linear up to about age 70, and approximately linear thereafter, but with a change in slope at about age 70 (see Fig. 2). This observed age effect is partly due to the difference in age distribution among cases and controls in this sample.
Results from Modeling Smoking and Lung Cancer Risk, Using Logistic Regression
For the logistic regression models, we focus on three metrics that are linearly related to the logit probability of cancer: squareroot packyears, logcigyears, and the twometrics model. In the models using squareroot packyears and logcigyears, these smoking metrics were very significant predictors of cancer risk (P < 0.001). In the twometrics model, logcigp was a very significant predictor (P < 0.001), but cigtime was a significant predictor only in the model using all individuals (P < 0.01).
Due to the nonlinearity associated with the age effect, in all logistic regression models, we adjusted for age using a piecewise linear model, in which we allowed one slope for age <70, and a different slope for age >70, with the constraint that the slopes join at age 70. In all cases, the slopes before and after age 70 were significantly different from each other (P < 0.001). Gender was not significant in any of the models. We started by considering the “full model”, which includes the interaction between smoking status and the smoking metric, for each of the three remaining smoking metrics. In models using all individuals and in the smokers only models, the interactions between smoking status and the smoking metric were significantly different from zero for the squareroot packyears models, and for the twometrics models (in which the interactions were only significant for logcigp but not for cigtime), but not for the logcigyears models.
Next we considered the “all covariates” models that did not include the interactions mentioned above, but adjusted for all remaining covariates. In models fit using all individuals, and ever smokers only, yrsquit was a significant predictor (P < 0.01) in models with squareroot packyears and in the twometrics model, but was of borderline significance or not significant in the models with logcigyears (P ≈ 0.06 for all individuals, P ≈ 0.64 for smokers only). In the model fit using all individuals, the smoking status indicator variables were significant predictors in the model using squareroot packyears and the twometrics model, but not in the model using logcigyears. For models fit using smokers only, smoking status was not significant for any of the three metrics. The indicator variable for starting smoking before age 18, only included in smokeronly models, was significantly different from zero only in the model using logcigyears as the smoking metric. The logistic regression models are summarized in Table 1.
The last two columns of Table 1 give the residual deviances of the models—both for an unadjusted model including only the smoking metric (two variables for the twometrics model), and for the adjusted model which also adjusts for the covariates in the “all covariates” model. In the unadjusted models, the residual deviances for the logcigyears models were substantially smaller than for all other comparable models, except for the smokersonly models in which the residual deviance using logcigyears was approximately the same as for the twometrics model. The residual deviance can be thought of as a measure of discrepancy of a generalized linear model (37) such as logistic regression, analogous to the sum of squared residuals in a normal linear regression. This suggests that as a single metric, logcigyears explains more of the variability in lung cancer than the other metrics (except possibly the twometrics model for smokersonly). However, when adjusted for the other model covariates, the residual deviances for the logcigyears models were somewhat larger than for the corresponding models using the other metrics. This suggests that in models using all the covariates considered here, models other than the logcigyears model explained more of the variability in lung cancer. When the model requires smoking status indicator variables, the smaller deviance comes with a price of abrupt changes in estimated cancer risk upon changes in smoking status.
Sensitivity of the LogcigYears Metric
We explored the sensitivity of the logcigyears metric to the scale on which smoking intensity is measured. Specifically, we considered generalized metrics of the form log(α cigpday + 1) × cigtime, for a range of values of α. We found that the residual deviance of the unadjusted model is smallest for metrics based on values of α between 0.5 and 1.5, but the residual deviance for the adjusted model is smallest for metrics based on values of α < 1, suggesting that a metric based on α between 0.5 and 1 may be somewhat better than the logcigyears metric which uses α = 1. In the adjusted smokersonly model using α = 0.5, smoking status and yrsquit remained statistically insignificant, whereas in the model based on all individuals using α = 0.5, yrsquit and the current smoking indicator both became borderline significant.
We also investigated the sensitivity of the logcigyear metric to adding the constant of one to cigpday before taking the logarithm. For all individuals and separately for smokers only, we fit three additional logistic regression models (and three analogous GAM models) in which logcigyears was replaced with log(cigpday + k) × cigtime, for k = 2, 3, and 4 in turn. For smokersonly, we also fit a fourth model in which logcigyears was replaced with log(cigpday)× cigtime. Each model adjusted for the same covariates as the logcigyears model. In all cases, the GAM plot was visually indistinguishable from the GAM plot using logcigyears, neither smoking status nor yrsquit were statistically significant, and the coefficient for the alternative metric continued to be ∼0.02.
Addressing Possible Confounding by Age
In our sample, the median case age was almost 7 years larger than the median control age. Thus, the observed age effect in this study, as in any casecontrol study which is not perfectly agematched, reflects a combination of the direct age effect and the difference in age distribution between cases and controls.
In order to remove the possible confounding with age, we fit the logcigyears model to current, former and never smokers together, separately by four age strata. Following the example of Flanders et al. (8), we fit separate GAM and logistic regression models within age deciles of 40 to 49, 50 to 59, 60 to 69, and 70 to 79. Each model included covariates of logcigyears, current and former smoking indicator variables, yrsquit, age, and gender. The reason for including age was to allow for a possible age effect within age decile. Age was only significant in the 70 to 79 year group. Logcigyears was statistically significant in all four age strata models (P < 0.005), and the coefficient for logcigyears ranged from 0.014 to 0.023 within age strata. This coefficient was smallest (0.0140.015) for the 40 to 49 and 60 to 69 age groups, and largest (0.0220.023) for the 50 to 59 and 70 to 70 age groups. Our results imply reasonable robustness of our metric in different age group strata.
Addressing Possible Confounding by Age of Smoking Initiation
Among ever smokers, we also explored models in which age of smoking initiation was included as a continuous variable (results not shown). In the twometrics model, this meant that we were not able to adjust for age, and in this model, larger values of age of smoking initiation and larger values of years since quitting smoking were both associated with increased cancer risk. Under the multistage model of carcinogenesis, the effect of a carcinogen will depend on age of smoking initiation, time since initial exposure, or both, depending on the stage(s) in which the carcinogen has an effect (38). The results just described are consistent with cigarette smoke carcinogens acting on both early and late stage transitions (38), as other studies have suggested. However, the implication that years since smoking cessation is positively related to lung cancer risk is neither biologically reasonable nor consistent with other studies. In this data, age of smoking initiation ranged from 6 to 61 years, with 78 smokers starting at age 30 or greater, including 8 who started smoking after age 50. In the twometrics model, age of smoking initiation as a continuous variable, years of smoking, and years since quitting smoking together comprise the overall age effect, possibly explaining the apparent positive association between cancer risk and years since quitting smoking in this model.
Our decision to dichotomize age of smoking initiation allows us to also adjust for age in models using each smoking metric. It has been suggested that the lung is most sensitive to the effects of smoking during lung development (26, 39). Dichotomizing age of smoking initiation at age 18 is meant to capture whether smoking started before or after lung development was essentially complete. However, this dichotomization does not capture smoking initiation effects which may be important at a later age, such as cancer promotion in intermediatestage cancer cells. Individuals who started smoking earlier were on average heavier smokers who smoked longer than those who started smoking later. There was no evidence of an interaction between this indicator variable and logcigyears.
Addressing the Definition of Years Since Quitting Smoking for Never Smokers
We defined yrsquit to be zero for never smokers, yet it could be argued that yrsquit, like agestart, is not meaningful for never smokers. For smokers, the variable age is the sum of agestart, cigtime, and yrsquit. For never smokers, this suggests defining yrsquit to be zero and agestart to be age. In a model that includes never smokers and adjusts for yrsquit (defined to be zero for never smokers), whether or not never smokers are influential in determining the coefficient for yrsquit can be visually assessed by examining the GAM plot for yrsquit. In all models discussed here, the adjusted relationship between yrsquit and the logit probability of cancer for never smokers was consistent with the relationship for ever smokers.
Exploring SmokingLung Cancer Risk Implications of Each Metric
Here we compare what each smoking metric implies about lung cancer risk predictions over a range of different values of smoking intensity and duration. For packyears, the increase in predicted cancer risk is the same for a doubling in smoking intensity (for fixed duration) as it is for a doubling in number of years smoked (for fixed intensity). The same is true for squareroot packyears. In contrast, for logcigyears, the predicted increase in cancer risk for a doubling of smoking duration (for fixed intensity) is much larger than it is for a doubling in smoking intensity (for a fixed duration).
In Fig. 3, we give contours of these three smoking metrics, as well as a twodimensional smoothed estimate of cancer risk as a function of smoking intensity and smoking duration estimated from the lung cancer data from the modelIn the three contour plots, for fixed values of other model covariates, the estimated cancer risk is constant along any given contour of the smoking metric (shown as curves in the plot), and cancer risk is estimated to increase when moving from one contour to another with a larger value of the smoking metric. The contour plot for the twometrics model depends on the coefficients for cigtime and logcigp. For the values given in Table 1, the contour plot for the twometrics model (data not shown) is similar to that for the logcigyears model. The twodimensional smoothed estimate from the data using all individuals (Fig. 3, bottom right) and using smokers only (very similar to Fig. 3, bottom right; data not shown) differ from the three contour plots most dramatically where smoking intensity is zero, and to a lesser extent where years of smoking is zero. Any twodimensional estimate is less accurate at the plot edges where extrapolation is needed, than in the center of the plot. The twodimensional smooth fit from the data does suggest that cancer risk increases more rapidly with increasing years of smoking (for fixed intensity) than it does with increasing intensity (for fixed duration), consistent with the logcigyears and twometrics models, but not with the packyears or squareroot packyears models.
In the twometrics model, adjusting for duration and intensity separately assumes that the effect of smoking intensity (on the logarithmic scale) and smoking duration are additive, an assumption which is not made for the other smoking metrics considered here. Under the twometrics model, a specific increase in years of smoking is predicted to increase cancer risk by the same amount for light smokers as for heavy smokers, whereas under the logcigyears model, the predicted increase is greater for heavier smokers. A similar conclusion can be reached about differences in estimated cancer risks for a specific increase in smoking intensity for a fixed smoking duration.
In Fig. 4, we show the estimated lung cancer relative risk on the logarithmic scale, for ever smokers relative to never smokers using the logcigyears model, including only the significant or borderline significant covariates (logcigyears, age as a piecewise linear term, and years since quitting smoking). The estimated log relative risk was 0.019 × logcigyears − 0.009 × yrsquit. Because smoking status was not needed in this model, the estimated relative risk does not change abruptly when smoking cessation occurs. This feature is not shared by any of the other smoking metrics when fit to data using all individuals.
Assessing Our Sample Using the Doll and Peto Equation
We now explore differences in our choice of metrics with the gold standard one of Doll and Peto (14) in their landmark study. In a cohort study, Doll and Peto found that among male never smokers and current smokers aged 40 to 79 who started smoking between age 16 and 25 and who smoked 40 or fewer cigarettes per day, the annual lung cancer incidence was proportional to (cigpday + 6)^{2} × (age − 22.5)^{4.5}, where age − 22.5 was used as a proxy for smoking duration (cigtime). We tried to fit an analogous model in our data by using the log odds ratio to approximate the log incidence rate ratio among the n = 177 male never smokers and n = 137 male current smokers in our sample which met Doll and Peto's criteria. We assumed a baseline risk of age^{4.5} for never smokers. Because we are modeling the log odds ratio rather than incidence itself, and furthermore we are using casecontrol data rather than cohort data, our model results are not strictly comparable to Doll and Peto's results. One consequence of using the logarithmic scale is that we must add a constant (we chose to add one) to cigtime so as to not exclude never smokers when taking the logarithm. In a model for incidence this is not necessary. Among this subsample, we examined the relationship between the logit probability of cancer and log(cigpday + 6), log(cigtime + 1) and log(age), using both GAM and logistic regression. Our results would be consistent with Doll and Peto's model if the coefficients in the logistic regression model were 2, 4.5, and −4.5, respectively.
Adding one to cigtime before taking the logarithm resulted in a very bimodal distribution for this variable. Among the n = 137 current male smokers meeting Doll and Peto's criteria, the smallest cigtime was 9, so the variable log(cigtime + 1) had n = 177 values of 0, and n = 137 values between 2.30 and 4.11. The GAM plot indicated that the logit probability of cancer was strongly and positively related to log(cigtime + 1) among current smokers, but that the adjusted relationship for never smokers did not fit this pattern. As a result, the inclusion of never smokers caused the overall relationship to be extremely nonlinear. In the adjusted logistic regression model, the coefficient (SE) for log(cigpday + 6) was 2.70 (0.70), consistent with the Doll and Peto model. The coefficient (SE) for log(age) was 1.35 (0.85), whereas the coefficient of −0.35 for log(cigtime + 1) is not meaningful due to the nonlinearity noted above. The age distributions among cases and controls in our overall casecontrol study and in the subset used for this analysis is not necessarily representative of the corresponding age distributions in the population. The age effect seen here partly reflects this difference, which could explain why our estimated age effect differs from Doll and Peto's estimate. Other reasons why our results did not more closely match Doll and Peto's could include our assumption of the baseline risk among never smokers, and the fact that cigarette smoke exposure characteristics have changed over the past four decades, which may also affect the smoking metric. In fact, Flanders et al. (8) did a more recent cohort analysis, and also found major differences with Doll and Peto. In this data subset, logcigyears continued to be linearly related to the logit probability cancer with a regression coefficient (SE) of 0.02(0.002).
The Doll and Peto sample did not include former smokers in their base population, and this subset comprised over half of the ever smokers in our sample. Differences in the epidemiology of lung cancer in former smokers and current smokers (for example, proportion of adenocarcinomas, peripheral versus central lung cancers, etc.) suggest possible differences in lung carcinogenesis, and this too may affect the smoking metric. Thus, choosing an appropriate metric may be affected by differences in study design and population.
Discussion
We compared the performance of several metrics in a large casecontrol study to illustrate how we evaluate smoking metric(s) for use in our genesmoking models. Three of the metrics have been used in published studies (packyears, squareroot packyears, and the twometrics model), whereas the fourth, logcigyears, has not been considered previously. In our sample, we showed that the contribution of packyears to the logit probability of lung cancer was highly nonlinear. The remaining three metrics passed this first hurdle and were approximately linearly related to the logit probability of lung cancer. Both the squareroot packyears model and the twometrics model require inclusion of smoking status as a covariate, especially in models that include never smokers, implying that risk estimates may change drastically upon smoking initiation and smoking cessation. The model using logcigyears did not have this drawback because smoking status (and its interaction with logcigyears) were not significant predictors of cancer risk, after adjusting for other model covariates. Models that include smoking status may be sensitive to the cutpoints used to differentiate between never, former, and current smokers. Such a sensitivity has been noted by Leffondre et al. (12) for the estimated hazard ratio for lung cancer in a Cox model. We note that although the logcigyears metric did well in our data, whether or not it performs well in other data sets would need to be determined on the basis of its performance relative to other metrics in those data sets.
Our general approach to evaluating continuous variable smoking metrics can be summed up as follows: (a) evaluate each metric for linearity with disease outcome using the appropriate link function (e.g., the logit probability of cancer, for logistic regression), in one's study population; (b) evaluate the effect on risk estimates by inclusion of other potentially clinically important variables along with your metric(s) of interest. Examples include smoking status (never, current, or former smokers), year since quitting smoking, age of smoking initiation, and/or age; (c) compare the implications of the different smoking metrics for lung cancer risk predictions; and (d) explore possible reasons why the metric that performs best in your study population may be different from other metrics chosen in other studies or for other hypotheses. The best performing continuous smoking metrics seem to have the following three properties: (a) a linear relationship with disease risk using the appropriate link function because this is a model assumption; (b) the ability to include or exclude never smokers from the model without substantial changes in choice of model covariates or estimated disease risk in smokers; and (c) an insensitivity of disease risk estimates to changes in smoking status for fixed values of other model covariates. Models that include smoking status imply a jump in estimated risk at the age of smoking initiation and/or smoking cessation, an assumption that is appropriate for certain types of analyses, but not for others, and one that is somewhat implausible from the biological perspective.
A limitation of this study concerns the derivation of the logcigyears metric, which was based on several simplifying assumptions. Our derivation only considered polycyclic aromatic hydrocarbon formation from smoking, but other substances such as welldone red meat are also sources of polycyclic aromatic hydrocarbons. We did not account for other possible sources of polycyclic aromatic hydrocarbons in this article [but see Cortessis and Thomas (40) who model smoking and welldone red meat consumption jointly]. Although we have stated various limitations of the logcigyears metric, it should be noted that all metrics suffer from an inability to explain or account for many biological premises associated with tobacco carcinogenesis. Although initially motivated by a DNA adducts model, our metric was chosen mainly because it did better than other metrics in this sample data set. It should be understood that in other contexts, other metrics, including those not mentioned in this article, might be most appropriate for analysis. In all circumstances, the derived metric should have at least face validity.
In summary, we recommend that a process such as we outlined here be followed before assuming that a particular smoking metric suitably adjusts for or evaluates smoking in a statistical model. Different studies may use different metrics because the base population and study designs may differ between studies. We do not recommend that this comprehensive approach be used for all studies that incorporate smoking variables, but that the process be adapted to evaluate smoking metrics in studies where smoking is an integral part of the biology of the disease or the study hypothesis.
Acknowledgments
We acknowledge very helpful discussions with Edwin van Wijngaarden, Wei Zhou, and George Thurston, and suggestions from two anonymous reviewers. We are especially indebted to the senior editor, Duncan Thomas, for raising several important issues and for his concrete suggestions, all of which improved the manuscript substantially.
Footnotes

Grant support: Grant number K22 ES11027 National Institutes of Environmental Health Sciences, NIH. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the National Institutes of Environmental Health Sciences or NIH. Additional support was provided by NIH grants CA092824, CA74386, CA90578, Doris Duke Charitable Foundation.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

Note: This manuscript was presented in part at the 2004 AACR Annual Meeting in Orlando, Florida. S. Thurston, G. Liu, D.P. Miller, D.C. Christiani. “Modeling cancer risk in casecontrol studies using a new dose metric of smoking based on a DNAadduct model of carcinogenesis.”
 Accepted June 13, 2005.
 Received May 31, 2004.
 Revision received May 19, 2005.