Selecting tagging SNPs

Polymorphisms are variations in human DNA sequences among two individuals. A significant number of these variations consist of only a single nucleotide; and hence are called single nucleotide polymorphisms (pronounced as snips). It is conjectured that human DNA has more than ten million SNPs but only a few million have been discovered so far. These variations are important because although they do not cause diseases, but they do determine the susceptibility of an individual to them. Strong correlations have been observed among certain genetic variations and diseases like heart disease, diabetes, and different types of cancers.

Recent developments have significantly reduced the cost of assaying SNPs. However their sheer number and strong correlation among neighboring SNPs begs the idea of compression. If we can somehow select a much smaller representative set of SNPs, they can then be used to reconstruct the complete original set. This smaller set of SNPs is known as tagging SNPs (or tSNPs) in literature. We use de-randomized counterparts of recent linear algebraic algorithms to select tSNPs which best capture the SNP variance.


A. Javed, P. Drineas, M.W. Mahoney, and P. Paschou, Efficient genomewide selection of PCA-correlated tSNPs for genotype imputation, Annals of Human Genetics

A. Javed and P. Paschou, Extracting tagging SNPs from Genome-wide Datasets, Data Mining for Biomedical Informatics, workshop held in conjunction with 7th SIAM Conference on Data Mining, April 2007

P. Paschou, M.W. Mahoney, A. Javed, J.R. Kidd, A.J. Pakstis, S. Gu, K.K. Kidd, and P. Drineas, Intra- and inter-population genotype reconstruction from tagging SNPs, Genome Research, January 2007. [pubmed, dataset and code]