
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
1 Fred A. Litwin Centre for Cancer Genetics, Samuel Lunenfeld Research Institute, 2 Department of Pathology and Laboratory Medicine, Mount Sinai Hospital, Ontario, Canada and 3 Department of Laboratory Medicine and Pathobiology, University of Toronto, Ontario, Canada
Requests for reprints: Hilmi Ozcelik, Mount Sinai Hospital Samuel Lunenfeld Research Institute, 600 University Avenue Room 992A, Toronto, ON M5G 1X5, Canada. Phone: (416) 586-4996; Fax: (416) 586-8869. E-mail: ozcelik{at}mshri.on.ca
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
To date, much success has been obtained in the identification of high-penetrant cancer predisposition genes using linkage analysis. However, the challenge that has remained is to identify those alleles conferring low to moderate cancer risk. It is hypothesized that genetic variation contributes to the susceptibility for complex traits such as cancer (46). Molecular epidemiological and genetic approaches use single nucleotide polymorphisms (SNPs) in the human genome to study disease susceptibility. Because genome-wide scans are still challenging, often candidate gene/pathway approach may prove more efficient. Due to presence of enormous number of SNPs, systematic prioritization on the basis of biological function and relevance to cancer will accelerate the identification of such susceptibility alleles (4).
The most common form of genetic variation in the human genome is the SNPs (58). SNPs are relatively stably inherited genomic variations with an estimated density of 1 in 1000 bp. SNPs are usually bi-allelic, their occurrence rates vary across the genomic regions, and their allelic frequencies may differ among ethnic groups. A fraction of SNPs alter the encoded amino acid sequence (non-synonymous SNPs; nsSNPs), and have the potential to affect the structure, function, and interactions of proteins. Thus, nsSNPs are excellent candidates for candidate-gene association studies (7). However, not all nsSNPs are anticipated to have functional consequences; it is essential to develop strategies to select the variations that may alter and disrupt the proper functions of the proteins. Studying the functional consequences of genetic variants has been challenging due to the enormous number of variants present in the genome. Although there is an increasing effort for establishing in vivo functional strategies for studying the effects of variants, it is still far from being available for a large number of variants of interest. Recently, several approaches have been developed and used to study the nature of the genetic variants (915). Among these, computational tools provide an efficient and high-throughput source for in vivo functional analyses and/or population studies. SIFT (Sorting Intolerant From Tolerant) (10, 11) is a powerful tool that predicts the functional importance of an amino acid based on the alignment of highly similar proteins (either orthologous or paralogous or both) with the protein of interest. The predictions rely on whether or not an amino acid is conserved (or substituted by only a similar amino acid) in the protein family, which can suggest its importance for the function/structure of the protein.
Here, using the public SNP databases, we have identified a wide range of DNA repair nsSNPs, and we have carried out a computational study to characterize the evolutionary importance of these DNA repair nsSNPs. This study has the potential to provide a pool of functional SNPs, which may play important roles in the predisposition to cancer as well as other DNA repair-associated genetic diseases.
| Methods |
|---|
|
|
|---|
Mutation Data Set
Mutations with known functional consequences were retrieved from the SWISS-PROT database11 (21) using the key words "Human AND mutations AND functional" on September 2002. Following a manual inspection, a total of 231 mutations in 55 human genes constituted the final list. The mutations in this list were characterized by complete/partial loss of activity, gain of function, affecting protein-macromolecule interactions, interfering with cellular localization of the mutant protein, altering the protein stability, altering a protein-critical site, or interfering with the protein dimerization, as indicated in the feature table of each SWISS-PROT entry.
Evolutionary Conservation Analysis
Protein conservation analysis was performed using the SIFT12 software developed by Ng and Henikoff (10, 11). The SIFT algorithm predicts whether an amino acid substitution may have an impact on protein function by aligning similar proteins, and calculating a score which is used to determine the evolutionary conservation status of the amino acid of interest. It evaluates the identity (such as if only a single amino acid is observed in all proteins aligned at that position, then the alteration of it is predicted to affect the protein) and two physicochemical characteristics of amino acids, hydrophobicity and polarity (if the substituted amino acid differs in these characteristics from the wild-type amino acid and this kind of a substitution is not observed in the other proteins in the alignment at that position, it is predicted to affect the protein as well). These predictions are based on the assumption that amino acids conserved within the protein families are important for the function of the proteins. Whenever the frequency information was available, this conservation analysis was performed for the common allele. As we thought they would not be reliable, in this analysis, we did not consider the SIFT predictions based on less than six proteins in the alignments. We used the default median sequence conservation in the range of 3.0. In no cases the median sequence conservation score was found
2.25. However, there were many amino acid substitutions where the score was calculated as >3.25. Such scores indicate that the substitution at that position might not have had time to evolve yet, and consequently, the prediction may be misleading (11). Thus, we designated the predictions with a median sequence conservation score of >3.25 as either possibly affecting or possibly tolerated. This evaluation is different from that of Ng and Henikoff (11), where such predictions were not accepted at all.
Statistical Analysis
The statistical analyses were done using a
2 test (22). We applied the Yates correction for approximation of 2 x 2 tables. The test was conducted at the
= 0.05 level of significance. This test was applied to examine possible significant differences of the evolutionary conservation status of the amino acids altered in mutation and DNA repair nsSNP data sets, and between the rare and common DNA repair nsSNPs.
| Results |
|---|
|
|
|---|
In this study, we have used a modified interpretation of the SIFT algorithm results to define the nature of the variations (see "Methods"). To determine the sensitivity of the modified SIFT interpretation, we have used a panel of 231 missense mutations supported with functional evidence (see "Methods"; Table 1). Except one mutation, the number of proteins in all the alignments was at least six or higher (n = 230). Mutations in this group were predicted as either damaging (57.39%) or possibly damaging (19.13%), whereas 17.83% and 5.65% of the mutations were predicted either tolerated or possibly tolerated, respectively. Thus, the sensitivity of the modified SIFT predictions (damaging together with possibly damaging) reported in this study was 76.52%.
|
Frequency information of 102 (68.0%) of 150 nsSNPs13 was available either in the SNP databases or in Mohrenweiser et al. (20) (herein called validated/proven SNPs). For 68 of the validated nsSNPs, there were reliable SIFT predictions (Table 1). We classified the nsSNPs as rare or common if the frequencies of the minor allele fell between
5% and >5% ranges, respectively. There were a total of seven nsSNPs (6.86%) that were reported in independent submissions as both common and rare according to our classification. In the remaining cases, we have categorized 76.47% (78 of 102) and 16.66% (17 of 102) as rarely and commonly occurring nsSNPs, respectively. The comparisons of evolutionary conservation status of the amino acids and the frequency ranges of the nsSNPs substituting these amino acids are depicted in Table 2. In addition, there were >20 nsSNPs in our set with minor allele frequencies
10% at least in one submission (see http://www.ozceliklab.com/savas2004a/).
|
|
| Discussion |
|---|
|
|
|---|
Among 1000 entries in the SNP databases, we have extracted a total of 150 nsSNPs resulting in an amino acid substitution from 51.1% (45 of 88) of the DNA repair genes analyzed in this study. The number of SNPs in these genes is likely to improve as more SNPs are discovered, and the SNP databases continue to be updated. Several factors may lead to underestimation of the number of SNPs in genes of interest. For example, a considerable number of SNPs in these databases is not validated to distinguish them from sequencing errors, and thus these nsSNPs represent "suspected" or "non-proven" SNPs. In terms of suspected SNPs, which are described based on the DNA/RNA sequence alignments, there may be a bias toward the genetic variations through the 3' end of the transcripts as well as for abundant transcripts, common variations, and variations in less complex regions of the genome (2426). Therefore, sequencing of the entire coding region of the genes of interest in significant number of DNA samples may reveal additional SNPs in the genes. Sequencing might especially help to demonstrate whether these genes found to have no nsSNPs during this study are really devoid of nsSNPs or not. This information could be useful for assessing conservation status of the genes, or the different mutation/recombination rates at genomic regions containing the genes of interest (7, 8, 26).
Protein conservation analyses based on the alignment of similar proteins (either among species or within species) can reveal those amino acids that are important for the function and probably for the structure of the protein families. Although such analyses would not indicate newly evolved critical amino acids with a particular function, or amino acid which are under positive selection under today's conditions, it may still be critical in assigning evolutionary conserved residues along the proteins. SIFT (10, 11) is an automated tool that calculates the conservation scores of each amino acid residue along the given protein sequences. Originally, the prediction sensitivity of SIFT for damaging amino acid substitutions was found to be 69% (10). Our SIFT predictions reported in this paper differ in some aspects from what Ng and Henikoff (11) did. First, in this study, we have modified the SIFT predictions by only considering predictions that are based on at least six protein sequences in the alignment at the amino acid position of interest. Second, whenever the median sequence conservation was >3.25, Ng and Henikoff (11) did not accept any predictions (a median sequence conservation score >3.25 indicates that the proteins in the alignment did not diverge yet, and thus the predictions would not be reliable as much as the predictions obtained from alignment of the diverged proteins where conserved residues are more easily identified) (11). However, considering the fact that 19.03% of the mutations were also found with median sequence conservation scores of >3.25 (Table 1), we preferred to include such predictions in our results, only stating that they were either "possibly tolerated" or "possibly damaging."
The sensitivity of the modified SIFT prediction system was tested on a mutation set with experimentally determined functional consequences (see "Methods"). According to our results, it can be concluded that approximately 57.39% of the mutations occurred at amino acids that are conserved within the protein family in our set (median sequence conservation score 2.753.25). On the other hand, 19.03% of the mutations occurred either at regions of proteins that are highly conserved, or in the proteins for which homologous proteins from only close species were available (median sequence conservation score >3.25). Further analyses may be performed to investigate the latter possibility. The mutations that were not detected by SIFT as damaging could be those that occurred at query specific functional residues or are the variations in linkage disequilibrium with yet unidentified causative mutations (10). As far as DNA repair genes are concerned, over one third of the nsSNPs turned out to be likely to have functional consequences (i.e., found damaging and possibly damaging). Eleven DNA repair nsSNPs were found damaging, suggesting that they are excellent candidates for disease-predisposition studies. Another 28 nsSNPs were predicted as possibly damaging. We suggest that along with the damaging SNPs, these possibly damaging nsSNPs may also be good candidates for functional and association studies.
We were not able to make predictions for 44 DNA repair nsSNPs, due to the lack of sufficient sequence information available from homologous proteins (<6 proteins in the alignment at the position of the nsSNPs). As these analyses are based on the availability of the similar proteins in the public databases, we believe that as the number of curated proteins increases in protein databases, the predictions will become possible for these nsSNPs, and the reliability of the predictions for other nsSNPs will also improve.
Classification of the proven (validated) nsSNPs based on allele frequencies showed that only 16.2% of the nsSNPs was presented in the population(s) with an allele frequency of >5%, suggesting that most of the nsSNPs presented here are actually rare nsSNPs. These nsSNPs may be rare because they are either under negative selection, or newly evolved and thus not fixed in the population yet. None of the common nsSNPs investigated in this study were found to be truly damaging, whereas three of them were predicted to be possibly damaging (Table 3). We were unable to find any published reports regarding the analysis of the IGHMBP2-T671A variant, which was found to be possibly damaging in this study. IGHMBP2 (immunoglobulin µ binding protein 2) protein is presumably involved in a variety of cellular functions such as immunoglobulin-class switching, pre-mRNA processing, and transcription, and mutations in this protein have been shown to result in a neurodegenerative disease (27). On the other hand, the XRCC1-R399Q and XRCC3-T241M variants were intensively studied in the context of cancer association. XRCC1-R399Q SNP was shown to be associated with altered breast (28, 29) and lung (30) cancer risk. XRCC3-M241T has also been shown to confer increased risk to breast cancer14 (31), bladder cancer (32), and melanoma (33). Both of the XRCC1-R399 and XRCC3-T241 residues were conserved in mammalian orthologues, suggesting that they may be important for the function of these proteins.15 There were two nsSNPs (ERCC4-P379S and XRCC1-R280H) for which the minor allele frequencies were reported as both lower and higher than 5% cutoff. The ERCC4-P379S variation was reported in SNP databases as well as in the literature (23) as both rare and common in different sample sets. Our SIFT analysis showed that ERCC4-P379 was damaging. To our knowledge, the association of this SNP with cancer risk has not been studied yet. On the other hand, XRCC1-R280H nsSNP was predicted possibly damaging by SIFT analysis and was already found to be associated with nasopharyngeal carcinoma (34), prostate cancer (35), and lung cancer (36), and was found to have a role in mutagen sensitivity (37). There were 4 and 11 nsSNPs which were both rare and either damaging or possibly damaging, respectively (Table 3). Literature search for these nsSNPs did not reveal any association of them with cancer risk. To sum up, our strategy has the ability to select the potentially disease-related SNPs, and we propose that the other nsSNPs found as evolutionary conserved during this study are good candidates for further cancer-association studies.
Mutations that reduce the fitness of the individuals will be subject to purifying selection that eventually eliminate the mutations from the gene pool of a population, and thus never reach high frequencies (38), unless they confer a selective advantage because of a disease resistance in carriers of such mutations (39). Therefore, we analyzed the common and rare DNA nsSNPs for their conservation status. As a result, we could not detect any statistically significant difference (P < 0.0001, Table 3). Thus, it is tempting to speculate that some deleterious nsSNPs with moderate-high frequencies do not reduce the fitness of the individuals. In this context, the nature of such proteins with deleterious variations can be explained by either (a) the protein's function can be compensated by other proteins, (b) the protein's function is required only under certain environmental exposures/conditions, or (c) the protein is a rapidly evolving one, thus accumulating more mutations without affecting the fitness of the individual. Alternatively, these new substitutions may be either neutral or even positively selected. Analysis of a much larger data set will be helpful to fully characterize frequency-conservation status relation of genetic variations.
Genetic variation has been suggested to alter disease-susceptibility risk. SNPs being the most common variation in the human genome have been extensively studied in the context of disease predisposition. SNPs that alter important molecular features such as the expression, function, structure, stability, and interaction of candidate proteins are excellent candidates to study a possible association/direct involvement of a SNP and a phenotypic expression. However, both the presence of an enormous number of SNPs and the search for biologically relevant SNPs in candidate gene approaches require the application of reliable and logical selection systems. Here we presented results obtained using a highly stringent SNP mining strategy and a modified version of the previously developed SIFT tool to select DNA repair nsSNPs that are conserved within the protein family. Our results suggest that more than one third of the nsSNPs in the DNA repair genes are likely to have functional consequences. These nsSNPs are excellent candidates for cancer association as well as for experimental functional studies. In addition, these genetic variations are likely to be critical in studies aiming to elucidate the disparity in cancer-treatment responses among patients as well as to improve the effectiveness of the cancer treatments (40).
| Acknowledgments |
|---|
| Footnotes |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
2 Internet address: http://lpgws.nci.nih.gov/html-cgap/cgl/DNA_damage.html). ![]()
3 Internet address: http://www.ncbi.nlm.nih.gov/SNP/. ![]()
4 Internet address: http://hgvbase.cgb.ki.se/. ![]()
5 Internet address: http://lpgws.nci.nih.gov/. ![]()
6 Internet address: http://snp500cancer.nci.nih.gov/home.cfm. ![]()
7 Internet address: http://www.genome.utah.edu/genesnps/. ![]()
8 M. Edmenson, K. Buetow. The BLAST against gene transcripts tool (unpublished). Internet address: http://lpgws.nci.nih.gov:80/perl/blast2. ![]()
9 Internet address: http://www.ncbi.nlm.nih.gov/BLAST/. ![]()
10 Internet address: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene. ![]()
11 Internet address: http://us.expasy.org/sprot/. ![]()
12 Internet address: http://blocks.fhcrc.org/sift/SIFT_seq_submit2.html. ![]()
13 A few number of nsSNPs were screened in population(s) but could not be detected: we still report them as there was a chance that these nsSNPs could not be validated because they may represent either ethnic group specific or rare nsSNPs. ![]()
14 J. C. Figueiredo, J. A. Knight, L. Briollais, I. L. Andrulis, H. Ozcelik. Polymorphisms XRCC1-R399Q and XRCC3-T241M and the risk of breast cancer at the Ontario site of the breast cancer family registry, in press. ![]()
15 J. C. Figueiredo, N. Diaz-Granados, J. A. Knight, S. Savas, L. Briollais, H. Ozcelik. XRCC1-R399Q and XRCC3-T241M: a systematic review of biological importance and role in cancer, in preparation. ![]()
Received 9/24/03; revised 12/ 4/03; accepted 12/24/03.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
G. S. Sellick, R. Wade, S. Richards, D. G. Oscier, D. Catovsky, and R. S. Houlston Scan of 977 nonsynonymous SNPs in CLL4 trial patients for the identification of genetic variants influencing prognosis Blood, February 1, 2008; 111(3): 1625 - 1633. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Johnson, O. Fletcher, C. Palles, M. Rudd, E. Webb, G. Sellick, I. dos Santos Silva, V. McCormack, L. Gibson, A. Fraser, et al. Counting potentially functional variants in BRCA1, BRCA2 and ATM predicts breast cancer susceptibility Hum. Mol. Genet., May 1, 2007; 16(9): 1051 - 1057. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Savas, I. W. Taylor, J. L. Wrana, and H. Ozcelik Functional nonsynonymous single nucleotide polymorphisms from the TGF-{beta} protein interaction network Physiol Genomics, April 24, 2007; 29(2): 109 - 117. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Ruzzo, F. Graziano, F. Loupakis, E. Rulli, E. Canestrari, D. Santini, V. Catalano, R. Ficarelli, P. Maltese, R. Bisonni, et al. Pharmacogenetic Profiling in Patients With Advanced Colorectal Cancer Treated With First-Line FOLFOX-4 Chemotherapy J. Clin. Oncol., April 1, 2007; 25(10): 1247 - 1254. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Vodicka, R. Stetina, V. Polakova, E. Tulupova, A. Naccarati, L. Vodickova, R. Kumar, M. Hanova, B. Pardini, J. Slyskova, et al. Association of DNA repair polymorphisms with DNA repair functional outcomes in healthy human subjects Carcinogenesis, March 1, 2007; 28(3): 657 - 664. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. G. Jegga, S. Gowrisankar, J. Chen, and B. J. Aronow PolyDoms: a whole genome database for the identification of non-synonymous coding SNPs with the potential to impact disease Nucleic Acids Res., January 12, 2007; 35(suppl_1): D700 - D706. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. L. Webb, M. F. Rudd, G. S. Sellick, R. El Galta, L. Bethke, W. Wood, O. Fletcher, S. Penegar, L. Withey, M. Qureshi, et al. Search for low penetrance alleles for colorectal cancer through a scan of 1467 non-synonymous SNPs in 2575 cases and 2707 controls with validation by kin-cohort analysis of 14 704 first-degree relatives Hum. Mol. Genet., November 1, 2006; 15(21): 3263 - 3271. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Bhatti, D. M. Church, J. L. Rutter, J. P. Struewing, and A. J. Sigurdson Candidate Single Nucleotide Polymorphism Selection using Publicly Available Tools: A Guide for Epidemiologists Am. J. Epidemiol., October 15, 2006; 164(8): 794 - 804. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. K. Thirumaran, J. L. Bermejo, P. Rudnai, E. Gurzau, K. Koppova, W. Goessler, M. Vahter, G. S. Leonardi, F. Clemens, T. Fletcher, et al. Single nucleotide polymorphisms in DNA repair genes and basal cell carcinoma of skin Carcinogenesis, August 1, 2006; 27(8): 1676 - 1681. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. F. Rudd, G. S. Sellick, E. L. Webb, D. Catovsky, and R. S. Houlston Variants in the ATM-BRCA2-CHEK2 axis predispose to chronic lymphocytic leukemia Blood, July 15, 2006; 108(2): 638 - 644. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. E. Mechanic, R. C. Millikan, J. Player, A. R. de Cotret, S. Winkel, K. Worley, K. Heard, K. Heard, C.-K. Tse, and T. Keku Polymorphisms in nucleotide excision repair genes, smoking and breast cancer in African Americans and whites: a population-based case-control study Carcinogenesis, July 1, 2006; 27(7): 1377 - 1385. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. F. Rudd, E. L. Webb, A. Matakidou, G. S. Sellick, R. D. Williams, H. Bridle, T. Eisen, R. S. Houlston, and the GELCAPS Consortium Variants in the GH-IGF axis confer susceptibilityto lung cancer. Genome Res., June 1, 2006; 16(6): 693 - 701. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Ruzzo, F. Graziano, K. Kawakami, G. Watanabe, D. Santini, V. Catalano, R. Bisonni, E. Canestrari, R. Ficarelli, E. T. Menichetti, et al. Pharmacogenetic Profiling and Clinical Outcome of Patients With Advanced Gastric Cancer Treated With Palliative Chemotherapy J. Clin. Oncol., April 20, 2006; 24(12): 1883 - 1891. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. D. Terry and M. Goodman Is the Association between Cigarette Smoking and Breast Cancer Modified by Genotype? A Review of Epidemiologic Studies and Meta-analysis. Cancer Epidemiol. Biomarkers Prev., April 1, 2006; 15(4): 602 - 611. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. C. Millikan, A. Hummer, C. Begg, J. Player, A. R. de Cotret, S. Winkel, H. Mohrenweiser, N. Thomas, B. Armstrong, A. Kricker, et al. Polymorphisms in nucleotide excision repair genes and risk of multiple primary melanoma: the Genes Environment and Melanoma Study Carcinogenesis, March 1, 2006; 27(3): 610 - 618. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. F. Rudd, R. D. Williams, E. L. Webb, S. Schmidt, G. S. Sellick, and R. S. Houlston The Predicted Impact of Coding Single Nucleotide Polymorphisms Database Cancer Epidemiol. Biomarkers Prev., November 1, 2005; 14(11): 2598 - 2604. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. C. Millikan, J. S. Player, A. R. deCotret, C.-K. Tse, and T. Keku Polymorphisms in DNA Repair Genes, Medical Exposure to Ionizing Radiation, and Breast Cancer Risk Cancer Epidemiol. Biomarkers Prev., October 1, 2005; 14(10): 2326 - 2334. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. B. Begg Reflections on Publication Criteria for Genetic Association Studies Cancer Epidemiol. Biomarkers Prev., June 1, 2005; 14(6): 1364 - 1365. [Full Text] [PDF] |
||||
![]() |
M. M. Johnson, J. Houck, and C. Chen Screening for Deleterious Nonsynonymous Single-Nucleotide Polymorphisms in Genes Involved in Steroid Hormone Metabolism and Response Cancer Epidemiol. Biomarkers Prev., May 1, 2005; 14(5): 1326 - 1329. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. C. Thomas The Need for a Systematic Approach to Complex Pathways in Molecular Epidemiology Cancer Epidemiol. Biomarkers Prev., March 1, 2005; 14(3): 557 - 559. [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||