Analysis of Coding Region SNPs and Its Propensity to Cause Disease

Kulkarni, Vinayak Vaman

Analysis of Coding Region SNPs and Its Propensity to Cause Disease

Files

kulkarnivinayak.pdf (659.44 KB)

Date

2007-12-17

Authors

Kulkarni, Vinayak Vaman

Abstract

Single Nucleotide Polymorphisms or (SNPs) are the most abundant form of variation present in the human genome. These variations in individuals are considered to be the cause of diseases, difference in response to treatment, susceptibility to diseases or may have no impact. Association studies aim at correlating an observed disease or a phenotype with these sequence variations. However very few of these SNPs are actually characterized according to the disease or phenotype they are implicated in. Currently, it is not possible to test and validate each and every SNP in the coding region of the human genome. Hence, the real challenge in association studies lies in carefully selecting reliable marker alleles which are most likely responsible for the observed phenotype or disease. This thesis addresses this problem by providing for each and every nucleotide in the human genome with a probabilistic value of it being involved in a disease or an important phenotype. Our hypothesis hinges on the fact that evolutionary conserved nucleotides are most important for gene function and hence would cause a disease if altered than non conserved nucleotides. By calculating the conservation of each base in all human Refseq exons and correlating the results with all SNPs in the Human Gene Mutation Database, a database of known disease causing SNPs and Database of Single Nucleotide Polymorphisms, we have exhaustively confirmed that the most conserved bases are indeed most sensitive to variation. Other factors known to be responsible for causing disease like alleles were also investigated. All the factors that were found to be responsible for disease alleles were chosen for the design of a classifier, which subsequently assigned a disease probability score to each coding base, based on these factors. This probability score represented the potential sensitivity to variation of each base. This will aid researchers rank SNPs and select candidate SNPs from a cohort for SNP-disease association studies. Identification of SNPs with disease-like signatures in SNP databases could provide researchers and clinicians with valuable information to aid them in the design and interpretation of epidemiological and genetic studies especially for those databases devoid of such annotation.