Recent developments have significantly reduced the cost of assaying SNPs. However their sheer number and strong correlation among neighboring SNPs begs the idea of compression. If we can somehow select a much smaller representative set of SNPs, they can then be used to reconstruct the complete original set. This smaller set of SNPs is known as tagging SNPs (or tSNPs) in literature. We use de-randomized counterparts of recent linear algebraic algorithms to select tSNPs which best capture the SNP variance.
A. Javed, P. Drineas, M.W. Mahoney, and P. Paschou, Efficient genomewide selection
of PCA-correlated tSNPs for genotype imputation, Annals of Human Genetics
A. Javed and P. Paschou, Extracting tagging SNPs from Genome-wide Datasets, Data Mining for Biomedical Informatics, workshop held in conjunction with 7th SIAM Conference on Data Mining, April 2007
P. Paschou, M.W. Mahoney, A. Javed, J.R. Kidd, A.J. Pakstis, S. Gu, K.K. Kidd, and P. Drineas, Intra- and inter-population genotype reconstruction from tagging SNPs, Genome Research, January 2007. [pubmed, dataset and code]