Since we have established that none of the Harappa participants so far have African admixture except for HRP0001 (me) and HRP0027 (Caribbean Indian) and African populations are the most diverse, it's best to remove the African populations from our Reference I dataset and do some analysis using the Eurasian subset.
One option is to exclude the 517 samples of sub-Saharan African populations in our dataset:
- Bantu Keyna: 11
- Bantu South Africa: 8
- Ethiopian Jews: 12
- Ethiopians: 19
- Kenyan Luhya: 101
- Maasai: 135
- Mandenka: 22
- African Americans: 48
- Yoruba: 161
However, in addition to the above, I decided to remove anyone from the reference I dataset who had more than x% African ancestry (sum of East African, East African Bantu and West African) at K=12 admixture run. I created two Eurasian datasets: Eurasian90 and Eurasian95.
Eurasian90 excludes all samples with more than 10% African admixture. That completely removes the following populations in addition to the above:
- Egyptians: 12
- Moroccans: 10
- Mozabite: 29
Also, some samples from the following populations were removed for Eurasian90:
- Balochi: 3/24
- Bedouin: 19/46
- Brahui: 2/25
- Iranians: 3/19
- Jordanians: 6/20
- Lebanese: 2/7
- Makrani: 3/25
- Palestinian: 10/46
- Saudis: 2/20
- Sindhi: 2/24
- Syrians: 2/16
- Yemense: 7/8
That's a total of 629 samples in Reference I dataset that had at least 10% African admixture. Thus Eurasian90 has 2,025 samples. The complete list is here.
The other dataset, Eurasian95 excludes everyone with more than 5% African admixture. Thus in addition to the samples listed above, it excludes the following:
- Balochi: 1
- Bedouin: 19
- Brahui: 1
- Druze: 1
- Iranians: 1
- Jordanians: 14 (completely removed)
- Makrani: 8
- Morocco Jews: 2
- Palestinian: 36 (completely removed)
- Saudis: 16
- Sindhi: 2
- Syrians: 7
- Yemenese: 1 (completely removed)
- Yemen Jews: 15 (completely removed)
Eurasian95 is thus left with 1,901 whose breakdown is listed here.
I'll be experimenting with both Eurasian90 and Eurasian95.
Do you think it might be worth removing the more isolated populations? I realize we should be careful about throwing out data, but there seems to be a tendency for these populations to define their own cluster and possibly result in misleading percentages for the other individuals. The problem might solve itself at higher Ks, as with the Kalash component, but I wonder if the results will look more interesting without populations that are over 95% from a single component, like the Kalash and Papuans (and admittedly also the Japanese and Yoruba).
I think removing outlier populations is a good idea. However, I want to do it in a systemic way, understanding what happens.
Because Admixture takes so long to run, this means experimentation like this is slow.
Also, the Japanese are a different case than Kalash and Papuan. I don't expect the Japanese to have their own isolated ancestral component. But for our South Asian focused admixture analysis, it'll better to have the Northeast Asian component maximized among the Han. There is a way to do that.
removing outliers is good. i started doing it in my own private runs, but one thing to note is that often a new outlier pops out from the remaining populations.... 🙂 i got rid of kalash, and the lahu then emerged as their own cluster....
Hmmm .... the Lahu? Is that because removing an outlier has the effect of basically "zooming-in" on the remaining populations, which makes other groups look like outliers? This feels odd also because one would think South Asians are more Kalash-like than Lahu-like, i.e. the Lahu should have been more outlying than the Kalash to start with.
I don't think it has to do with the similarity of the Lahu or Kalash to South Asians. If we ran ADMIXTURE with my sibling and me, we'd probably form a component even though we're both well within the South Asian mainstream.
Is that because removing an outlier has the effect of basically “zooming-in†on the remaining populations, which makes other groups look like outliers? This feels odd also because one would think South Asians are more Kalash-like than Lahu-like, i.e. the Lahu should have been more outlying than the Kalash to start with.
i think the lahu are the most "outlier" population after you remove the kalash or mozabites. if i removed the lahu, probably something similar would happen.