
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
1 International Agency for Research on Cancer, Lyon, France; 2 Department of Community and Family Medicine, 3 Norris Cotton Cancer Center, and 4 Department of Genetics, Computational Genetics Laboratory, Dartmouth Medical School, Lebanon, New Hampshire; 5 Cancer Epidemiology Studies, Department of Epidemiology and Biostatistics, University of California, San Francisco, California; 6 Departments of Pediatrics, Microbiology and Immunology, Epidemiology and Population Health, and Obstetrics, Gynecology and Women's Health, Albert Einstein College of Medicine, Bronx, New York; and 7 Departments of Community Health and Pathology and Laboratory Medicine, Center for Environmental Health and Technology, Brown University, Providence, Rhode Island
Requests for reprints: Eric J. Duell, International Agency for Research Cancer, 150 Cours Albert Thomas, 69008 Lyon, France. Phone: 33-472-738670; Fax: 33-472-738320. E-mail: duelle{at}iarc.fr
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
Multifactor dimensionality reduction was developed as a nonparametric method. In case-control studies of complex disease, it does not require specification of a genetic model to detect gene-gene interactions without main gene effects (1). Multifactor dimensionality reduction software has existed since 2001 and has evolved over multiple versions. The core algorithm used to collapse high-dimension data and cross-validation consistency has remained unchanged, whereas user interface and the addition of graphical features characterize recent improvements. Newer versions of multifactor dimensionality reduction have incorporated data filtering methods such as Tuned ReliefF to assist in the analysis of interactions in genome-wide association studies (see Web site: www.epistasis.org; ref. 7).
Focused interaction testing framework software was developed to identify markers for gene-gene interaction and uses a parametric search algorithm in a pooled case-control group that reduces the number of tests done. The screening algorithm uses a goodness-of-fit
2 statistic that compares observed to expected genotype frequencies in the pooled cases and controls (assuming no marginal or main gene effects). The main interaction testing framework then uses a likelihood ratio test to simultaneously test for higher-order multilocus effects while adjusting the threshold for significance by controlling false discovery rates. Thus, multifactor dimensionality reduction and focused interaction testing framework use fundamentally different strategies to detect interactions.
Pancreatic cancer is the fourth leading cause of cancer-related death in men and women in the United States (8). With the exception of cigarette smoking, few environmental risk factors for pancreatic cancer are known, and most cases (>90%) do not aggregate in families (9). Thus, it is reasonable to suggest that most pancreatic cancers are the result of complex interactions and cross-talk between alleles or gene products and environmental exposures such as tobacco smoke.
The main objectives of this article are to identify pathway-based higher-order interactions (gene-gene and gene-smoking) important in pancreatic cancer etiology and to compare the results of multifactor dimensionality reduction and focused interaction testing framework software packages with traditional multivariable logistic regression methods using genotype data from participants in a population-based case-control study of pancreatic adenocarcinoma. From 7 biological pathways, 26 polymorphisms in 20 genes were evaluated in this study (Table 1 ). The following pathways and genes were included in the analyses: base excision repair (APE1, OGG1, XRCC1), nucleotide excision repair (XPA, XPC, XPD, ERCC1), double-strand break repair (XRCC3), carcinogen metabolism and oxidant stress (GSTM1, GSTT1, GSTP1, UGT1A7, SOD2), hormone metabolism (CYP1A1, CYP1B1, CCK), inflammation (TNF-A, RANTES, CCR5), and extracellular matrix (MMP3).
|
| Materials and Methods |
|---|
|
|
|---|
Control participants were identified within the six San Francisco Bay Area counties using random-digit dial and were frequency matched to cases in an approximately 3:1 ratio by sex and 5-year age group. Eligibility criteria were identical for case and control participants, except for pancreatic cancer status. Control recruitment for those older than 65 years was supplemented by random sampling of the Health Care Finance Administration (now the Centers for Medicare and Medicaid Services) lists for the 6 Bay Area counties. A total of 1,701 eligible control participants completed the study interview for a 67% response rate (10-12).
The analyses for this study are based on 308 cases and 964 controls who gave blood as part of the laboratory portion of the study. Detailed methods on case and control selection and the laboratory portion of the study have been published (12-14, 16). Eligible participants were those who had no portacath (a medical device surgically inserted under the skin that typically is used to deliver chemotherapy for cancer patients or for patients requiring long-term parenteral nutrition) in place and had no history of bleeding disorders. Blood was not requested from the out-of-area cases. It was not obtained from the remainder of the case participants for the following reasons: patient was too ill, had died, or refused; the blood draw was unsuccessful or insufficient; or the study had ended. An analysis comparing participants who provided blood with those who did not provide blood has been previously described (14). Among cases, age, sex, race, education, smoking status (never, former, current), and pack-years of smoking were not different between those who did and did not provide a blood specimen (all P values were greater than 0.05). Blood was not requested from out-of-area controls. It was not obtained from the remainder of the control participants for the following reasons: participant refused, had died, was lost to contact, or was too ill; the blood draw was unsuccessful or insufficient; or the study had ended. Among controls, age, education, and pack-years of smoking were independent of venipuncture (all P values were greater than 0.05), whereas those who provided blood were more likely to be white, men, and ever smokers. Overall, case or control status was not related to providing blood (P = 0.60). The study interviewers obtained separate written informed consent from all participants before interview and venipuncture. Study methods and protocols were approved by the University of California Committee on Human Research.
Exposure and demographic information were obtained from participants during in-person interviews conducted by trained interviewers using structured questionnaires. No proxy interviews were conducted. Self-reported race was broadly defined as white or Caucasian, black or African American, or Asian. Of the participants, 5 cases and 15 controls did not fall into any of these 3 categories and were classified as "other race" for these analyses. Participants were defined as never smokers if they never had smoked more than 100 cigarettes in their lifetime and had not smoked cigars or pipes at least once per month for 6 months or more. Because there was a substantial number of participants who had never smoked and who reported a history of passive smoke exposure at home as an adult (women: 32 cases, 95 controls; men: 5 cases, 21 controls), these individuals were removed from the reference group of never smokers. In analyses of smoking status, passive smokers were combined with former active smokers and pipe or cigar smokers to form three groups (never, former or passive, and current). Smoking intensity (pack-years) was defined as the number of packs of cigarettes smoked per day multiplied by the number of years smoked. For gene-smoking interaction analyses using multifactor dimensionality reduction and focused interaction testing framework, pack-years were categorized to form three groups as follows: (a) never active or passive smokers; (b) former smokers, passive smokers, pipe or cigar smokers, or less than 41 pack-years; and (c) 41 or more pack-years. Age and sex were used to determine sampling probabilities and were therefore included in multivariable logistic models.
Genotyping
All genotyping was done on germ-line DNA (
50 ng) extracted from peripheral blood lymphocytes using the QIAmp DNA Blood Mini kit (Qiagen Inc.) according to the instruction of the manufacturer. PCR-RFLP analysis was used to genotype CYP1A1 m1 (T
C, nucleotide 6235 in 3' flanking region), m2 (A
G, nucleotide 4889), and m4 (C
A) alleles. Genotypes for GSTM1-null (homozygous gene deletion), GSTT1-null, XPC-PAT+ [intron 9 poly(AT)], and CCR5-
32 (32-bp deletion) were determined using PCR amplification and visualization on agarose gels. Detailed methods and results from our earlier analyses of polymorphisms in CYP1A1, GSTM1, and GSTT1 in this subset of the San Francisco Bay Area pancreatic cancer study have been published (13). CCR5-
32 was genotyped according to a gel-based PCR method and primers published by Martinson et al. (21). XPC-PAT+ was genotyped using primers and an optimized protocol from Khan et al. (22). XRCC1.194, XPA, ERCC1, XPD, and SOD2 variants were genotyped using validated Taqman assays (Applied Biosystems).
The CYP1B1 and UGT1A7 variants were detected via allele-specific oligonucleotide hybridization using similar methods to those developed to distinguish different human papillomavirus DNA genotypes (23). The CYP1B1 (Val432Leu) and CYP1B1 (Asn453Ser) variants were examined by amplifying genomic DNA using the primers listed in the footnotes to Table 1 (24). The biotinylated probe sequences used for the CYP1B1 hybridizations are listed in footnotes to Table 1. The UGT1A7.208 variant was examined by amplifying genomic DNA using primers listed in the footnotes to Table 1. The biotinylated probe sequences used for the UGT1A7*208 hybridizations also are listed in Table 1 footnotes. All PCRs were done in a final volume of 50 µL consisting of DNA (20 ng), deoxynucleotide triphosphate (0.2 mmol/L each; Invitrogen), MgCl2 (2.5 mmol/L), 1 µmol/L each primer, AmplitaqGold polymerase (1.25 U; Perkin-Elmer), and 1x reaction buffer. Amplification was done with an initial denaturation at 94°C for 10 minutes, followed by 35 cycles of amplification at 94°C for 30 seconds, 55°C for 1 minute and 72°C for 1.5 minutes, and a final extension at 72°C for 5 minutes using a GeneAmp 9700 thermal cycler (Perkin-Elmer). The PCR products for each of the variants were denatured and blotted in duplicate onto Biodyne B membrane filters (pore size, 0.45 µm; Pall Biodyne) using a Robbins Hydra 96. The filters were treated with 3% hydrogen peroxide solution (Sigma Chemicals) at room temperature for 15 minutes then washed at 65°C for 30 minutes {wash solution consisted of 0.1x saline–sodium phosphate–EDTA [180 mmol/L NaCl, 10 mmol/L NaH2PO4, and 1 mmol/L EDTA (pH 7.4)] and 0.5% sodium dodecyl sulfate}. Afterward, the treatment the filters were hybridized with the biotin-labeled probes overnight (hybridization temperatures for the CYP1B1 and UGT1A7.208 variants were 57°C and 44°C, respectively), followed by two 1-hour washes at the hybridization temperature. Enhanced chemiluminescence reagent (Amersham) was used to detect hybridization, followed by exposure to autoradiography. All of the results were interpreted by two experienced investigators, and discrepancies were resolved by consensus.
APE1, OGG1, XRCC1.399, GSTP1, CCK, TNF-
, RANTES, XRCC3, and MMP3 (Table 1) were genotyped using the Masscode assay (BioServe Inc.; ref. 25). Methods and results for XRCC1.399, TNF-
, and RANTES have been published (14, 16). For participants in whom the mass spectrometry method failed to yield a conclusive genotype for TNF-
and RANTES, missing data were completed using PCR-RFLP assays according to Wilson et al. (26) and Hajeer et al. (27). A random sample of the data (3%) for TNF-
-308 and RANTES-403 were repeated using Masscode and PCR-RFLP and were found to agree for both genotyping methods. DNA samples that yielded "no calls" after three genotyping attempts were reported as missing.
Statistical Methods
Only markers with less than 7% missing data in cases or controls were included in these analyses. For multifactor dimensionality reduction and focused interaction testing framework analyses, genetic markers with missing genotype values were imputed to the most common genotype for that marker. Tests for Hardy-Weinberg equilibrium among all or white or Caucasian control participants were conducted by comparing observed with expected genotype frequencies using a
2 test with 1 degree of freedom. Expected genotype frequencies were estimated from allele frequencies. Multifactor dimensionality reduction and focused interaction testing framework analyses were run with and without variables for smoking status or pack-years.
Multifactor Dimensionality Reduction Analysis
Multifactor dimensionality reduction version 1.0.Orc1 was used, and best models were reported for interactions with up to three factors until the total cross-validation consistency was five or more. Multifactor dimensionality reduction uses cross-validation by dividing the data into a training set (e.g., 9/10 of the data) and a testing set (e.g., the remaining 1/10 of the data) to derive estimates of cross-validation consistency and testing accuracy. Multifactor dimensionality reduction models were considered statistically significant if the testing accuracy was greater than the cutoff based on a 1,000-fold permutation test. For permutation testing, the data were randomized 1,000 times by case or control status consistent with the null hypothesis of no association (testing accuracy, 0.5). The multifactor dimensionality reduction model-fitting procedure was run for each randomized data set to determine expected values for testing accuracy. Testing accuracies greater than the expected values based on the permutated data sets were considered statistically significant at the 0.05 level. This approach also accounts for multiple hypothesis testing. High- or low-risk summary graphics provided by multifactor dimensionality reduction were used to visualize potentially interacting genotypes and to build combined variables for testing in logistic regression models. Interaction dendrograms provided by multifactor dimensionality reduction were examined to assist in the visualization and interpretation of potential interactions (6). Connected red or orange lines indicate genetic markers that may interact synergistically, whereas blue or green lines indicate genetic markers that are redundant or do not interact. Shorter lines or leaves indicate stronger synergistic or redundant relations between variables.
Focused Interaction Testing Framework Analysis
Focused interaction testing framework software was downloaded in July 2006 using the Web site http://hydra.usc.edu/fitf (5). Data for interactions with up to three factors were evaluated. The overall
level of 0.05 was partitioned as follows: 0.01 in the first stage and 0.02 in the second and third stages. The
2 subset (chi-square subset) statistical cutoff values were set to three and six for these analyses. All statistically significant interaction models were reported based on the false discovery rate P value. The model with the lowest false discovery rate P value was reported if no models were statistically significant.
Unconditional Logistic Regression
Unconditional multiple logistic regression with PROC LOGISTIC in SAS (version 9.1; SAS Institute) was used to compute covariate-adjusted odds ratios and 95% confidence intervals (95% CI) for genetic factors and pancreatic cancer risk. Interactions between variables were assessed in logistic regression models by forming a new variable with a common reference group (e.g., XRCC3.241 TT+TC+ never smoker) from two or three individual variables. For gene-gene interaction variables in logistic models, either previous information on variant function (if known) or interaction graph output from multifactor dimensionality reduction software was used to form the new combined variable (high versus low risk).
| Results |
|---|
|
|
|---|
In two-loci models among all participants, multifactor dimensionality reduction favored a possible interaction between OGG1.326*XPC.PAT and focused interaction testing framework-identified XPD.312*XRCC3.241, although neither of these was considered statistically significant (Table 2 ). Focused interaction testing framework also identified a possible interaction between XPD.312 and XPA among whites or Caucasians that was not identified by multifactor dimensionality reduction. Thus, potential two-loci interactions OGG1.326*XPC.PAT and XPD.312*XPA were tested in subsequent logistic regression models.
|
In general, similar results were obtained from models that evaluated tobacco smoking using smoking status (never, former, current) or smoking pack-years. In these models, multifactor dimensionality reduction and focused interaction testing framework identified smoking as the most important single risk factor for pancreatic cancer (Table 3 ). When smoking was included in multifactor dimensionality reduction and focused interaction testing framework models, multifactor dimensionality reduction tended to include smoking in all the best two- and three-factor combinations, whereas none of the best two- and three-factor combinations identified by focused interaction testing framework included smoking (Table 3). Multifactor dimensionality reduction results suggested that OGG1 and XPD and smoking interact (consistent among all participants and among whites or Caucasians), whereas focused interaction testing framework results suggested that OGG1 and XPD interact (among all participants, with or without smoking).
|
|
|
2 of 4.2 (P = 0.04). Among all participants, the likelihood ratio test
2 value testing for interaction between XRCC3.241 and smoking was 0.87 (P = 0.35). None of the likelihood ratio test tests for XPD.312*XPA and OGG1.326*XPC.PAT were statistically significant (all P > 0.05). We ran the final confirmatory logistic regression model in whites or Caucasians separately for men and for women (data not shown). In general, the direction and magnitude of odds ratios were similar between men and women. None of the likelihood ratio test tests for XRCC3.241*smoking by sex (men or women) or by race or ethnicity (in whites or Caucasians or in all participants) were statistically significant (all P > 0.2). In logistic models, we found some evidence for an interaction between smoking and XPD.751 in men and not in women (data not shown). The magnitude of the combined odds ratios (for XPD.751 and smoking) was not as strong as those observed for XRCC3.241*smoking (data not shown).
|
A graphical depiction of the combined effect of XRCC3.241 and smoking as high- and low-risk groups and statistical interactions determined by multifactor dimensionality reduction are shown (Fig. 1C). As portrayed in the dendrogram (Fig. 2 ), OGG1 (OGG1.326) and XPC9 (XPC.PAT) may interact (connected red lines), whereas XPD.312 and XPD.751 are redundant or do not interact (blue lines).
|
| Discussion |
|---|
|
|
|---|
For three-factor combinations, there was some evidence for an interaction between XPD*OGG1*smoking, but estimates based on logistic regression models lacked precision. In a recent analysis of a hospital-based case-control study from the MD Anderson Cancer Center, Jiao et al. (29) reported an interaction between XPD codon 312 variants and smoking in relation to risk of pancreatic cancer. Their analysis did not evaluate polymorphisms in OGG1. Overall, our results support the hypothesis that some common genetic variants in base excision repair, nucleotide excision repair, and double-strand break repair pathways define subgroups at higher risk for smoking-associated pancreatic cancer.
Although multifactor dimensionality reduction and focused interaction testing framework identified smoking as the best single-factor predictor of pancreatic cancer in our study, unlike multifactor dimensionality reduction, focused interaction testing framework did not identify any interactions involving smoking. Multifactor dimensionality reduction and focused interaction testing framework rarely agreed on the interaction factors and may reflect the different methods each program uses to identify interactions. Focused interaction testing framework screens markers based on a
2 test that compares observed with expected frequencies in a pooled group of cases and controls, the same as assuming no marginal or main gene effects. In comparison, multifactor dimensionality reduction is less constrained than focused interaction testing framework and does not estimate model parameters or interaction terms and instead does cross-validation as part of the algorithm. With multifactor dimensionality reduction, any combination of genotypes (or environmental exposures) between two or more factors that results in an excess of cases compared with controls will be considered a potential interaction. Because the biology of many genes and of interactions between genes or gene products and environmental exposures are often unknown, the more "agnostic approach" of multifactor dimensionality reduction may be more appropriate for data mining. In contrast, focused interaction testing framework may have more power to detect interactions because of the prescreening procedure implemented in pooled cases and controls.
Because of the low magnitude of combined odds ratios from logistic models based on multifactor dimensionality reduction categories of high risk versus low risk, it is unclear if these categories represent risks due to true interactions (departures from additive or multiplicative effects of individual factors) or if they represent increased risks from multiple alleles or factors from distinct or overlapping pathways. There is considerable redundancy in DNA repair pathways. Because of this, the population-level effects of weakly or moderately interacting alleles and gene products are likely to be difficult to detect and interpret. Two markers (GSTP.105 and CYP1A1.m2) were not included in the analyses because they were not in Hardy-Weinberg equilibrium in Caucasian controls. Although the precise reasons for not satisfying Hardy-Weinberg equilibrium in our study are not known, they could include genotyping error (however, we repeated genotyping on 5% of the samples and found no differences), recent mutations that have not yet reached equilibrium in our population, or an artifact of admixture due to subgroups that differ in allele frequency. It is important to note that our study had limited power to assess interactions; thus, our results require confirmation in larger epidemiologic studies and in studies from different populations.
Multifactor dimensionality reduction and focused interaction testing framework are tools that can provide analysts with guidelines for the evaluation of interactions, but neither method can compensate for a lack of precision in the data. Although traditional logistic regression is inadequate to analyze the large amount of data that are produced by genome-wide methods, it is useful for covariate adjustment and to describe relative risks for disease in association with various combinations of genetic and environmental factors. No single method is likely to identify all of the potential interacting alleles or genotypes in a data set. Based on our experience, it seems appropriate to recommend that researchers and analysts use more than one approach to screen for potential gene combinations or interactions. Our approach was to use multifactor dimensionality reduction and focused interaction testing framework as tools to identify potential interacting candidate alleles or genotypes for more efficient testing using traditional logistic regression techniques. Our result showing an increased association for pancreatic cancer with a "checkerboard-like" combination of XPC.PAT and OGG1.326 genotypes (Fig. 1A) may not have been observed using more traditional logistic regression methods such as interaction terms and -2 log–like likelihood ratio tests. It remains to be seen whether this pattern (XPC.PAT*OGG1.326) is observed in other study populations. It is becoming apparent that some interactions and allelic effects may be context dependent (31). Mutual collaborations and exchange of ideas among epidemiologists, computational biologists, mouse geneticists, and other basic scientists are necessary to further our understanding of these potentially important processes in complex human diseases.
| Disclosure of Potential Conflicts of Interest |
|---|
|
|
|---|
| Acknowledgments |
|---|
| Footnotes |
|---|
Received 11/13/07; revised 2/20/08; accepted 3/21/08.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
C.-H. CHANG, C.-L. CHANG, C.-W. TSAI, H.-C. WU, C.-F. CHIU, R.-F. WANG, C.-S. LIU, C.-C. LIN, and D.-T. BAU Significant Association of an XRCC4 Single Nucleotide Polymorphism with Bladder Cancer Susceptibility in Taiwan Anticancer Res, May 1, 2009; 29(5): 1777 - 1782. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. R. McWilliams, W. R. Bamlet, M. de Andrade, D. N. Rider, J. M. Cunningham, and G. M. Petersen Nucleotide Excision Repair Pathway Polymorphisms and Pancreatic Cancer Risk: Evidence for role of MMS19L Cancer Epidemiol. Biomarkers Prev., April 1, 2009; 18(4): 1295 - 1302. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. R. Pico, I. V. Smirnov, J. S. Chang, R.-F. Yeh, J. L. Wiemels, J. K. Wiencke, T. Tihan, B. R. Conklin, and M. Wrensch SNPLogic: an interactive single nucleotide polymorphism selection, annotation, and prioritization system Nucleic Acids Res., January 1, 2009; 37(suppl_1): D803 - D809. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |