MATLAB Code

A Matlab library, Genetics implements the functions being used in the driver files (provided with datasets). Genetics uses some of the mathematical tools developed in MCMatrix. Both these libraries need to be unzipped and their respective paths added to MATLAB for the driver files to work. Kindly contact Asif Javed if any problems are encountered executing the code. Please note that the functions implemented might be updated as more efficient techniques are developed. Suggestions on improvement or general comments regarding the code are much appreciated.

The high level functions are explained below. Once the libraries have been added to MATLAB path, help on these functions can also be accessed using

help <function_name>

linear_structure
This function takes as input a target percentage between 0 and 1 and returns statistics for the number of eigenSNPs and actual SNPs that were necessary to recover dataset with at most (1-target)*100 percent erroneous entries.

intra_population
This function takes as input a target percentage between 0 and 1 which denotes the percentage of people to be considered training data; the rest will comprise test data. For each population, the function splits the data randomly in test and training sets, and attempt to guess the test data using only the training data and the CUR algorithm. Statistics are reported in a return variable and stored in as a .mat file. For each population, multiple splits of the data are evaluated for the reconstruction accuracy of multiples of SNP_interval (an input parameter) SNPs.

inter_population
This function takes as input a percentage between 0 and 1 that determines how many actual SNPs to choose from source population (via the coverage that they provide) and estimates the prediction error after assaying the selected SNPs in the other population. The function returns a populations-by-populations matrix whose (i,j)-th entry stores various statistics regarding the error when the i-th population is used to predict the j-th population. It also returns number of SNPs retained for each population while predicting every other population.

Some of the functions used in the above are explained here.