The optimal method to be used for tSNP selection, the applicability of a reference LD map to unassayed populations, and the scalability of these methods to genome-wide analysis, all remain subjects of debate. We propose novel, scalable matrix algorithms that address these issues, and we evaluate them on genotypic data from four genomic regions (248 SNPs typed for approximately 2000 individuals from 38 worldwide populations). We also evaluate these algorithms on a second dataset consisting of genotypes available from the HapMap database (1336 SNPs for four populations) over the same genomic regions. Furthermore, we test these methods in the setting of a real association study, using a publicly available family dataset. The algorithms we employ for tSNP selection and unassayed SNP reconstruction do not require haplotype inference, and they are, in principle, scalable even to genome-wide analysis. Moreover, they are greedy variants of recently-developed matrix algorithms with provable performance guarantees. Using a small set of carefully selected tSNPs we achieve very good reconstruction accuracy of 'untyped' genotypes for most of the populations studied. Additionally, we demonstrate in a quantitative manner that the chosen tSNPs exhibit substantial transferability, both within and across different geographic regions. Finally, we show that reconstruction can be applied to retrieve significant SNP associations with disease, with important genotyping savings.
The datasets and Matlab code used for this project are available