To my standard reference 3 (list of populations), I added the Yunusbayev et al Caucasus samples which include the following:
- 20 abhkasians
- 16 armenians
- 19 balkars
- 13 bulgarians
- 20 chechens
- 14 kumyks
- 6 kurds
- 15 mordovians
- 16 nogais
- 15 north-ossetians
- 15 tajiks
- 15 turkmens
- 20 ukranians
These 204 samples increased the total to 4,090.
Then I applied a stricter IBD relationship cutoff than I have before. Previously my focus was on removing relatives, but now I wanted to remove samples that seemed highly inbred or belonged to highly bottle-necked small groups so they would not create their own clusters in Admixture. This process removed the following 164 samples:
- maasai 30
- papuan 15
- karitiana 12
- pima 12
- onge 8
- surui 7
- luhya 6
- melanesian 6
- colombian 5
- hadza 5
- koryaks 5
- sandawe 5
- san 4
- turkmens 4
- african-americans 3
- east-greenlanders 3
- great-andamanese 3
- nganassans 3
- chenchu 2
- evenkis 2
- han-chinese-south 2
- maya 2
- mbutipygmy 2
- mexicans 2
- utahn-whites 2
- aus 1
- bantukenya 1
- british 1
- chinese-americans 1
- gujaratis-b 1
- iranians 1
- naxi 1
- north-kannadi 1
- samaritians 1
- she 1
- tuvinians 1
- yemenese 1
- yoruba 1
- yukaghirs 1
Finally, I added the 165 founders from the Harappa Project participants (up to HRP0180).
The crossvalidation error for the admixture results with K (number of ancestral components) from 2 to 20 is plotted here.
The lowest crossvalidation errors are for K=17 and K=12.
The admixture results are in a spreadsheet.
In addition to K=17 and K=12, take a look at the results for K=15.
PS. I should point out that the names for the ancestral components are just useful mnemonics based on the current distribution of that component. Also, a component with the same name at one value of K is different from a similarly named component at another K.
Where are the punjabis represented in the reference list? I see kashmiri pandits, sindhis, Iranians and pathans. I see singapore Indians, what about Indians in general?
The only autosomal DNA study wherein they've used a Punjabi reference sample is in Toward a more uniform sampling of human genetic diversity: a survey of worldwide populations by high-density genotyping - Jinchuan Xing et al. They used 25 Punjabi Arains from Pakistan in the study. Zack doesn't use the Xing et al populations in most of his admixture runs because they have a poor and limited SNP overlap with 23andMe data if I remember correctly.
Zack said;
"Finally, I added the 165 founders from the Harappa Project participants (up to HRP0180)."
In light of this, will you be posting individual admixture proportions at all these K's sometime? While I'm assuming this run is only experimental, that'd still be cool.
Yes, I'll post the participant results in a few days.
Nicely Zack done thanks! Also what interpretation does one give to the "South European" component after K = 11. It kind of looks like South West Asian and European mix.
It's the Sardinian-centered component other genome bloggers have also found. While Sardinians are a fairly isolated population, this component seems to be present in a fairly wide range especially near the Mediterranean.