
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Short Communication |
1 Program in Epidemiology, Fred Hutchinson Cancer Research Center and 2 Department of Epidemiology, University of Washington, Seattle, Washington
Requests for reprints: Chu Chen, Program in Epidemiology, Fred Hutchinson Cancer Research Center, P.O. Box 19024, Mailstop M5-C800, 1100 Fairview Avenue North, Seattle, WA 98109-1024. Phone: 206-667-6644; Fax: 206-667-2537. E-mail: cchen{at}fhcrc.org
| Abstract |
|---|
|
|
|---|
| Introduction |
|---|
|
|
|---|
SIFT uses sequence homology among related genes and domains across species to predict the impact of all 20 possible amino acids at a given position, allowing users to determine which nsSNPs would be of most interest to study by sorting variants by this prediction score (1, 2, 4-9). The SIFT algorithm has been shown to predict a phenotype for a nsSNP more accurately than previously used substitution scoring matrices, such as BLOSUM62, as these matrices do not incorporate information specific to the protein of interest (4, 10). Another advantage to using SIFT is the potential to analyze a larger number of nsSNPs than methods that are dependent on the availability of protein structure alone (4, 11).
The PolyPhen algorithm, like SIFT, also takes an evolutionary approach in distinguishing deleterious nsSNPs from functionally neutral ones. PolyPhen differs from SIFT in that it predicts how damaging a particular variant may be by using a set of empirical rules based on sequence, phylogenetic, and structural information characterizing a particular variant. In addition to using sequence alignments, PolyPhen utilizes protein structure databases, such as PDB (Protein Data Bank) or PQS (Protein Quarternary Structure), DSSP (Dictionary of Secondary Structure in Proteins), and three-dimensional structure databases to determine if a variant may have an effect on the protein's secondary structure, interchain contacts, functional sites, and binding sites (3, 7-9).
These two algorithms have great potential in their use to screen for potentially damaging nsSNPs in genes that may be associated with disease risk, such as hormone-related cancers like breast, prostate, and endometrial cancer. Interindividual variation in the biosynthesis and metabolism of sex hormones has been shown in population studies and may be related to risk of these types of cancers. For instance, Dunning et al. (12) found that circulating levels of several steroid hormones, such as estrogen, androgen, and their precursors, are directly related to the risk of breast cancer. There is evidence that variation in circulating levels of steroid hormones may be associated with polymorphisms of genes involved in the steroid hormone metabolism and response pathways (13-15). In this study, we screened 137 nsSNPs identified in genes in these pathways, using the SIFT and PolyPhen algorithms, to predict which of these nsSNPs would most likely have a damaging effect on the function of the encoded proteins.
| Materials and Methods |
|---|
|
|
|---|
Obtaining nsSNP Tolerance Scores with SIFT and PolyPhen
For each gene of interest, we analyzed the protein sequence identified from dbSNP using the SIFT-version 2 database. In general, for the peptide sequence, SIFT performs multiple alignments of a number of sequences until a median conservation for the peptide sequence is reached at the default of 3.0 and predicts whether substitution with any of the other amino acids is tolerated or deleterious for every position in the submitted sequence. The SIFT prediction is given as a tolerance index (TI) score ranging from 0.0 to 1.0, which is the normalized probability that the amino acid change is tolerated. A nsSNP with a TI score of
0.05 is considered to be deleterious. The confidence measure of this result is given as the median sequence conservation score (MSCS) for the given codon. Deleterious predictions with a MSCS
3.25 were made with low confidence because the sequences used to determine the score were not diverse enough.
We adjusted the PolyPhen algorithm settings to include all sequences homologous to the peptide sequence of interest for the calculation of the difference in structural parameters instead of using the single most homologous sequence as a default. All other settings of PolyPhen were set as the default. Predictions of how a particular nsSNP may affect protein structure by PolyPhen are assigned as "probably damaging," a score made with high confidence that the nsSNP should affect protein structure and/or function; "possibly damaging," where it may affect protein function and/or structure; and "benign," as most likely having no phenotypic effect (3). It arrives at these predictions by determining the position-specific independent count difference of the two allelic variants in the polymorphic position and also measures the degree of damaging effect a variant may have on structural parameters of a protein. The position-specific independent count is a logarithmic ratio of the likelihood of a given amino acid occurring at a particular position to the likelihood of the same amino acid occurring at any position in the sequence, also known as background frequency. Thus, PolyPhen, as mentioned before, uses a set of empirical rules based on sequence, phylogenetic, and structural information to determine if protein structure may be compromised.
| Results |
|---|
|
|
|---|
0.05 are considered to be deleterious (TI
0.05) and SIFT scores with low confidence (MSCS
3.25) are in parentheses. Of the 137 nsSNPs screened with SIFT and PolyPhen, the 111 nsSNPs that were scored with confidence by SIFT (MSCS < 3.25) were used in determining concordance between SIFT and PolyPhen.
|
0.05 with confidence; PolyPhen: probably damaging/possibly damaging) and 50 (45%) were predicted to be tolerated variants (TI > 0.05 with confidence; benign). The remaining 30 nsSNPs (27%) of these variants had conflicting results between the SIFT and PolyPhen algorithms.
|
| Discussion |
|---|
|
|
|---|
0.05; probably damaging/possibly damaging). It would be interesting to study these particular variants in an epidemiologic study to assess possible associations with the risk of hormone-related cancers. Of the 111 nsSNPs scored with confidence by SIFT, 30 nsSNPs (27%) were found to have conflicting results. A possible explanation as to why SIFT and PolyPhen may differ in their prediction scores could be that PolyPhen, in addition to using sequence information, incorporates structural information of the protein (obtained from PQS and DSSP).
Several studies have used SIFT and/or PolyPhen to assess function of nsSNPs in candidate genes for association studies, but none have reported such assessment of the steroid hormone metabolism pathway (2, 9). Based on an evaluation of published association of nsSNPs and cancer risk, Zhu et al. (1) concluded that the risk of a number of cancers were significantly inversely correlated with SIFT predictions. Their results also suggest that variants that occur in conserved sequences are more likely to be associated with cancer susceptibility. Two studies thus far have combined both the SIFT and PolyPhen algorithms to screen for deleterious nsSNPs. In one paper, Xi et al. (7) examined the DNA repair pathways and found that the predictions made by the two programs are highly correlated, with
62% concordance in the predictions of deleterious or potentially damaging nsSNPs. We observed a similar concordance of 73% between SIFT and PolyPhen predictions, but in genes involved in steroid hormone metabolism and response. In a second, more recent, study, Livingston et al. (8) used SIFT and PolyPhen to identify 57 potentially deleterious nsSNPs involved in DNA repair, cell cycle regulation, apoptosis, and drug metabolism.
Whereas useful, SIFT and PolyPhen are nonetheless somewhat limited. This is because both require available sequence data. However, as more whole genome sequences become publicly available, prediction of functionality of more nsSNPs will become possible and the results more reliable (5, 9). When there are too few sequences available on a particular gene for comparison, for example, the PGR gene, or when the sequences seem to be too homologous to one another, a nonfunctional variant may be predicted to be "intolerant" (6, 9). When the MSCS is >3.25, Savas et al. (2) interpreted the SIFT predictions as "possibly affecting" or "possibly tolerated." However, like Ng and Henikoff (4, 5), we chose to disregard nsSNPs that were scored with low confidence by SIFT until more sequence data become publicly available to allow more reliable predictions to be made. Another limitation of these algorithms is that the impacts of a combination of variants are not assessed and the dependence of functional impact of a variant on genotype of other genes or on exposure risk is not addressed (9). One final restriction of SIFT and PolyPhen is that the algorithms are unable to predict the impact of SNPs that occur outside of the coding region, such as promoter and enhancer regions, and splice sites that may affect protein levels or protein function.
To those conducting large-scale population-based epidemiologic studies, the idea of prioritizing nsSNPs in the investigation of association of SNPs with disease risk is of great interest. The use of SIFT and PolyPhen to select potentially intolerant nsSNPs for epidemiology studies can be an efficient way to explore the role of genetic variation in disease risk and to contain cost. Furthermore, predicted impact of these nsSNPs can be tested with the use of animal models and/or cell line systems to determine if functionality of the protein has indeed been altered.
| Footnotes |
|---|
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Received 11/ 8/04; revised 12/28/04; accepted 1/ 6/05.
| References |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S. Savas, I. W. Taylor, J. L. Wrana, and H. Ozcelik Functional nonsynonymous single nucleotide polymorphisms from the TGF-{beta} protein interaction network Physiol Genomics, April 24, 2007; 29(2): 109 - 117. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. H. Olson, E. V. Bandera, and I. Orlow Variants in Estrogen Biosynthesis Genes, Sex Steroid Hormone Levels, and Endometrial Cancer: A HuGE Review Am. J. Epidemiol., February 1, 2007; 165(3): 235 - 245. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. G. Jegga, S. Gowrisankar, J. Chen, and B. J. Aronow PolyDoms: a whole genome database for the identification of non-synonymous coding SNPs with the potential to impact disease Nucleic Acids Res., January 12, 2007; 35(suppl_1): D700 - D706. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Damaraju, D. Murray, J. Dufour, D. Carandang, S. Myrehaug, G. Fallone, C. Field, R. Greiner, J. Hanson, C. E. Cass, et al. Association of DNA Repair and Steroid Metabolism Gene Polymorphisms with Clinical Late Toxicity in Patients Treated with Conformal Radiotherapy for Prostate Cancer Clin. Cancer Res., April 15, 2006; 12(8): 2545 - 2554. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Cancer Research | Clinical Cancer Research |
| Cancer Epidemiology Biomarkers & Prevention | Molecular Cancer Therapeutics |
| Molecular Cancer Research | Cancer Prevention Research |
| Cancer Prevention Journals Portal | Cancer Reviews Online |
| Annual Meeting Education Book | Meeting Abstracts Online |