Pearson Correlation Coefficient

Pearson's correlation coefficient (r2) is widely used in genetics. It lies at the heart of Carlson et. al. and De Bakker et. al.'s tagging SNP selection methods. It is also frequently used in disease association studies to reveal SNPs in high linkage with various diseases. r2 can easily be computed from haplotype data. However most currently available genetic datasets contain genotype data; which makes computing r2 more expensive.

All currently available software utilize expectation maximization to compute r2. This is computationally expensive and hence does not scale to chromosome wide data. Most publicly available implementations go around this problem by restricting the pair wise computations to sliding windows of neighboring SNPs only. For one of our studies we needed to look at long range linkage. We developed a numeric method to compute r2.

The implementation with detailed explanation can be downloaded from this link. Please send me an email if you need the C code or for any suggestions to add another input format.

The plot to the right was computed for 541 individuals. Since we are computing r2 for all pairs of SNPs, the curve is quadratic. But our implementation takes very little time for each pair. In this plot, it took roughly 50-100 μsec for each pair on a commodity desktop.