The data for Xing et al's paper "Toward a more uniform sampling of human genetic diversity: a survey of worldwide populations by high-density genotyping" is available online.
This dataset consists of 850 individuals, but 259 of them overlap with the HapMap. Another 15 samples had to be removed because they were too similar to others. I also removed Native American samples. This leaves us with 529 samples.
Ethnic group | Count |
---|---|
Slovenian | 25 |
Punjabi Arain | 25 |
N. European | 25 |
Nepalese | 25 |
Kyrgyzstani | 25 |
Iban | 25 |
Buryat | 25 |
Bambaran | 25 |
Andhra Pradesh Brahmin | 25 |
Kurd | 24 |
Dogon | 24 |
Irula | 23 |
Thai | 22 |
Pygmy | 22 |
Urkarah | 18 |
Tamil Nadu Brahmin | 14 |
Hema | 14 |
Tongan | 13 |
Tamil Nadu Dalit | 13 |
Samoan | 13 |
!Kung | 13 |
Japanese | 13 |
Andhra Pradesh Mala | 11 |
Pedi | 10 |
Andhra Pradesh Madiga | 10 |
Alur | 10 |
Nguni | 9 |
Sotho/Tswana | 8 |
Vietnamese | 7 |
Stalskoe | 5 |
Chinese | 5 |
Khmer Cambodian | 3 |
This dataset is valuable because it contains several South Asian, Central Asian, Southeast Asian and Caucasian groups. However, it does not have a good SNP overlap with 23andme and the other datasets. It has only about 29,000 SNPs in common with 23andme v2 data. Combining HapMap, HGDP, SGVP, Behar et al and Xing et al with 23andme data leaves us with 25,000 SNPs. Due to that, I'll be using Xing et al data for only a few analyses.
any chance you can post the PLINK files of the public data sets you've assembled for your analysis?
I won't provide the plink format data files since I am not sure if I have the right to distribute these datasets, but I can do the next best thing which is to provide the scripts I am using to convert the downloaded data to plink format.
thanks. Yes please upload it somewhere.
regards