
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
1 Department of Epidemiology and Biostatistics; 2 Department of Neurological Surgery, Division of Neuroepidemiology; 3 Gladstone Institute of Cardiovascular Disease; and 4 Department of Pathology, University of California, San Francisco, San Francisco, California
Requests for reprints: Jeffrey S. Chang, Department of Epidemiology and Biostatistics, University of California, San Francisco, 44 Page Street, Suite 503, San Francisco, CA 94143-1215. Phone: 510-642-6299; Fax: 510-643-1735. E-mail: jeffrey.chang{at}ucsf.edu
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
Random forests (RF) is a tree-based classification method developed by Breiman (1, 2). It has several key features ideal for analyzing multiple SNPs: (a) the ability to analyze a data set that has a high ratio of number of predictor variables to observations; (b) the ability to detect a SNP that has a weak main effect but has significant interaction with other SNPs (3); (c) the importance measure from RF, which gives a natural ranking of SNPs; (d) having no requirement to specify the mode of inheritance (dominant, codominant, or recessive); and (e) the effect of a risk allele not canceling the effect of a protective allele in a RF analysis, as the RF algorithm does not assume any directionality for the risk association for each allele. Diaz-Uriarte et al. (4) showed that classification error rates of RF are equivalent to those produced by other classification methods including support vector machines and K-nearest neighbor. In addition, Diaz-Uriarte developed a RF-based gene selection procedure that matches or outperforms alternative methods by selecting fewer genes with equal or lower prediction errors (4). A pathway approach analysis using RF has been applied successfully to gene expression data by Pang et al. (5). The current analysis applies a similar strategy to analyze associations between SNPs and glioblastoma status in a pilot case-control study to evaluate the feasibility of using RF as an analytic tool to study SNP-disease association.
| Materials and Methods |
|---|
|
|
|---|
This study was approved by the University of California, San Francisco Committee on Human Research. All participants signed detailed consent forms in accordance with the Helsinki Doctrine.
Genotyping
Genotyping was done using the commercially available ParAllele (now part of Affymetrix) SNP panel, which contains 10,177 nonsynonymous coding SNPs; this represented all such known SNPs spanning the genome that could be accurately genotyped with the ParAllele genotyping method at the time of the study (8). A list of the 10,177 SNPs on the assay panel is provided in Supplementary Table S1.
Statistical and Bioinformatic Methods
Choosing SNPs within Pathways. Because this analysis is a feasibility study for using RF to analyze SNPs, we only selected a few pathways that have some biological plausibility for being involved in gliomagenesis. The 10 pathways chosen included several commonly involved in cancers [phase I and II carcinogen metabolism (9, 10), DNA repair (11), cell cycle (12), apoptosis (13), cell adhesion (14, 15), and calcium signaling (16)], two pathways that are dysregulated in brain tumors [mitogen-activated protein kinase signaling (17) and WNT signaling (17)], and two immune function pathways potentially involved in gliomagenesis [transforming growth factor-β (18) and interleukin-6 (19)]. We then selected genes and SNPs from the panel belonging to those pathways using the review article by Wood et al. (11) for the DNA repair pathway, and pathway websites including KEGG (20), BioCarta (21), and GenMAPP (22) for the other pathways. We also used an interactive web resource, called SNPLogic,5 which we developed to help identify and categorize genes and SNPs potentially related to glioma (23). SNPLogic provides an integrated view across multiple pathway resources together with a variety of SNP annotations, haplotype information, and functional predictions.
Identification of Important Pathways. Within each pathway, RF analyses were done with SNPs using randomForest R package version 4.5-18 (by Liaw and Wiener), available through the Comprehensive R Archive Network (CRAN) website.6 Of the 227 SNPs examined in this analysis (33 SNPs belong to more than one pathway), 149 SNPs had some missing values. Among these 149 SNPs, 134 (90%) had <5% with missing values. Missing values of SNPs were imputed based on the proximity measure (2, 24) and this procedure was carried out using the rfImpute function in randomForest R package.
RF is a tree-based classification algorithm similar to Classification and Regression Tree (CART; refs. 1, 2, 24). In contrast to CART, which builds only one classification tree, RF builds a collection of trees to produce a more stable prediction error (1, 2, 24); 20,000 trees were built for each pathway in the current analysis. RF builds each individual tree by taking a bootstrap sample (sampling with replacement) of the original data, and on average, about one third of the original data are not sampled [out of bag (OOB)]. Those sampled are used as the training set to grow the trees, and the OOB data are used as the test set. At each node of the tree, a random sample of m of the total M variables is chosen and the best split is found among the m variables. The default value for m in the randomForest R package is the square root of
. In the current analysis, we tested a range of m from half of the square root of
to two times the square root of
and used the m that gave the lowest prediction error. Each tree in the RF analysis is grown fully without pruning. Each classification tree of the forest gets one vote for each OOB observation, and for each observation the class (case versus control status in our study) with the most votes is the RF prediction. The OOB error rate is the percentage of time the RF prediction is incorrect. In addition to OOB error rate, RF also produces importance scores that can be used to rank variables. The importance scores are determined by permuting the values of each predictor variable in the testing set; the more important a variable is, the larger the increase in the OOB error rate will be due to the permutation.
For each pathway with a prediction error <50%, 100 data sets were generated, which randomly shuffled the relationship between the case-control status and the SNPs. RF analysis was done with each of the 100 data sets to generate a null distribution of the prediction error; the P value was then determined by the percentage of prediction errors equal to or lower than the prediction error of the original data set.
Identification of Important SNPs. For the pathways with a prediction error rate better than chance (prediction error <50% and P
0.10), further analyses were done using varSelRF R package (4) to reduce the number of SNPs down to the "best" set. Because this is a pilot study with a small sample size mainly used to evaluate the feasibility of using RF to analyze SNPs by pathways, it was felt that P
0.10 was a reasonable cutoff for further investigating a pathway. The best set of SNPs was determined using an iterative process of fitting RF and dropping the lowest ranked SNP. The smallest set of SNPs with the lowest OOB error rate was considered the best set. Although OOB error rate was used to select the best set of SNPs, it cannot be used as the prediction error rate when the iterative fitting of RF is done because this process leads to overfitting, causing downward bias of the prediction error rate (4). The prediction error rate was estimated by the .632+ bootstrap method using 200 bootstrap samples to produce an unbiased estimate of the prediction error (4, 25). In addition, the stability of the selected SNPs was measured by the frequency of their inclusion in the best set of SNPs by each of the 200 bootstrap samples (4). We also conducted logistic regression of case-control status with SNPs using a log-additive model, adjusting for age and gender, to determine P values and compared ranking of SNPs by logistic regression and RF. SNPs ranked higher by RF than logistic regression may suggest interaction. Such interaction between SNPs was subsequently tested by logistic regression. The P value for interaction was derived by log-likelihood ratio test comparing the full model with interaction terms with the submodel without the interaction terms.
| Results |
|---|
|
|
|---|
Among the 10 pathways examined, four pathways (DNA repair, mitogen-activated protein kinase signaling, calcium signaling, and transforming growth factor-β pathways) had a prediction error <50% in classifying case-control status (Table 1 ); however, only the prediction error of the DNA repair pathway had a P value of <0.10 (P = 0.09).
|
|
|
| Discussion |
|---|
|
|
|---|
Rs1047840 is a nonsynonymous SNP located in the coding region of EXO1, a double-stranded DNA exonuclease (28). The polymorphism results in a dramatic amino acid change from a negatively charged glutamate to a positively charged lysine residue. This change could potentially have an effect on internal structure or a protein-protein binding interface. Rs12450550 is a nonsynonymous SNP located in the coding region of EME1 within the crossover junction endonuclease domain. EME1 interacts with MUS81 to form DNA structure–specific endonuclease implicated in DNA repair (29). The resulting amino acid change is, on its own, relatively conservative from an aliphatic isoleucine to a polar threonine; however, this SNP disrupts a potential binding site for the transcription factor MYB according to Delta-MATCH (dif-z score = 0.2; dif-z score predicts the effect on transcription factor site binding due to allele substitution), a computer program that predicts the importance of SNPs in the transcription factor binding sites (30). It is also in linkage disequilibrium with rs3744526 (pairwise Tagger with r2 > 0.8), which enhances a potential binding site for the MSX1 transcription factor (dif-z score = –0.3148; ref. 30).
An attractive feature of the RF method is its ability to account for interaction between SNPs. Lunetta et al. (3) showed that RF has a greater power in detecting important SNPs, when there are SNP-SNP interactions, compared with Fisher's exact test. Among the SNPs we analyzed for the DNA repair pathway (Table 2), both rs799917 and rs1799966 of BRCA1 did not have a strong main effect but received a high ranking (second and fourth among all 57 DNA repair SNPs, respectively) by the RF. The tests for interaction between rs799917 and rs1047840 and between rs1799966 and rs1047840 by unconditional logistic regression were statistically significant. Rs799917 is a nonsynonymous SNP located in the coding region of BRCA1, another critical DNA repair gene. The polymorphism leads to an amino acid change from proline to leucine at position 871 in the BRCA1 protein. This is a nonconservative change as proline conveys unique structural properties to the polypeptide. Furthermore, this polymorphism lies in the middle of a strongly conserved region of the gene as measured by phastCons analysis of >28 species (31). Rs799917 is also in linkage disequilibrium with rs1799966. Notably, rs1799966 is a nonsynonymous SNP in the coding region of the COOH-terminal domain of BRCA1, referred to as BRCT. A recent study has shown that the BRCT homologue in yeast, Brc1, mediates suppression of the Smc6-74 allele in concert with Exo1 and Eme1 (27). This suppression is essential for homologous recombination in processes such as repairing double-stranded breaks in DNA. The interaction between rs1799966 and rs12450550 of EME1 was suggestive, but not statistically significant, in the current analysis.
In the current analysis, rs9352 of CHAF1A was repeatedly included (78% of the 200 bootstrap samples) in the best set of SNPs. A recent study showed that another SNP (rs243356) of CHAF1A was associated with glioma risk (32). This suggests that CHAF1A may contribute to gliomagenesis and warrants further investigation.
In this study, genes were grouped by pathways for analysis. Analyzing SNP data at a pathway level may have several advantages over analysis of single SNPs or multiple SNPs within a single gene. Because the SNPs were grouped by pathways, the number of comparisons was greatly reduced (227 SNPs versus 10 pathways in the current analysis), decreasing the probability of false positives. In addition, grouping the SNPs by biological pathways allows for a biologically meaningful interpretation of the results. Finally, it is often difficult to replicate the findings of individual SNPs or haplotypes due to different linkage disequilibrium patterns or different allele frequencies (important when the SNPs being studied are not causal) among different study populations. Thus, it may be more feasible and meaningful to perform replication at the level of biological pathways, although there have been too few studies using this type of pathway analysis to show this.
One must be aware of the several limitations associated with this study: (a) Because this was a pilot study, the genotyped SNPs were from a commercially available SNP chip that was not specifically designed for detecting important SNPs associated with glioma. Furthermore, the limited inclusion of genes on the panel precluded us from analyzing some important pathways associated with glioma (e.g., pathways of allergic disorders such as IL-4 and IL-13). In addition, SNPs on the assay panel only included nonsynonymous SNPs, and thus the coverage for variation in each gene was far from complete. Furthermore, even the number of nonsynonymous SNPs in the assay was lower than those on more recent panels. The results of our analysis do depend on the completeness of the genes and SNPs included in the pathway because the function of a gene may depend on its interaction with other genes in the same pathway; therefore, the null findings with some of the pathways examined by this study do not preclude their importance in gliomagenesis. (b) The small sample size may have limited the statistical power. (c) We did not adjust for genetic ancestry to account for the potential population stratification, although the inclusion of only the Caucasian subjects makes the effect of population stratification less likely (33).
Despite less than complete inclusion of genes and SNPs relevant to glioma and a small sample size for this pilot study, RF analysis was able to identify a potentially important biological pathway that distinguished glioblastoma cases and controls better than chance. By incorporating information on biological pathways and using statistical methods that can account for interaction between genes or SNPs, the power for detecting gene-disease association can be increased.
| Disclosure of Potential Conflicts of Interest |
|---|
|
|
|---|
| Acknowledgments |
|---|
| Footnotes |
|---|
Note: Supplementary data for this article are available at Cancer Epidemiology, Biomarkers & Prevention Online (http://cebp.aacrjournals.org/).
Received 11/26/07; revised 3/ 4/08; accepted 3/20/08.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
R. McKean-Cowdin, J. Barnholtz-Sloan, P. D. Inskip, A. M. Ruder, M. Butler, P. Rajaraman, P. Razavi, J. Patoka, J. K. Wiencke, M. L. Bondy, et al. Associations between Polymorphisms in DNA Repair Genes and Glioblastoma Cancer Epidemiol. Biomarkers Prev., April 1, 2009; 18(4): 1118 - 1126. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Liu, M. E. Scheurer, R. El-Zein, Y. Cao, K.-A. Do, M. Gilbert, K. D. Aldape, Q. Wei, C. Etzel, and M. L. Bondy Association and Interactions between DNA Repair Gene Polymorphisms and Adult Glioma Cancer Epidemiol. Biomarkers Prev., January 1, 2009; 18(1): 204 - 214. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. R. Pico, I. V. Smirnov, J. S. Chang, R.-F. Yeh, J. L. Wiemels, J. K. Wiencke, T. Tihan, B. R. Conklin, and M. Wrensch SNPLogic: an interactive single nucleotide polymorphism selection, annotation, and prioritization system Nucleic Acids Res., January 1, 2009; 37(suppl_1): D803 - D809. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |