Datasets

Datasets from Yale and Hapmap were used in this project. We restricted our study to four genomic regions, namely 17q25, HOXB, PAH, and SORC3. The .mat files were created using MATLAB 7.0 (which may not be backward compatible). Please contact Asif Javed in case of a problem, or if the original data files are needed.

Yale Dataset

The Yale Dataset consists of 1962 individuals from 38 diverse populations from across the globe. The SNPs are available in the form of a .mat file which can be load-ed in MATLAB generating

The data format can be best explained with the following illustration. The first population constitutes of begs(1) to begs(2)-1 rows of F. For the ease of computation each dataset is bundled with a driver file generating linear structure, and inter and intra population results (for explanation see here). The dataset can be downloaded using this link. The original files from which this dataset was extracted are available in ALFRED.

Hapmap Dataset

The same genomic regions were extracted from the International Hapmap Project data. SNPs having the same allele pair for each individual in the dataset, are easy to predict and provide an unfair advantage to any prediction scheme. For the sake of fair comparison such SNPs were removed. An additional variable to_keep was kept to provide backward mapping to the original Hapmap data files. The data is stored in the above mentioned format with this additional variable, and can be downloaded using this link.