Category Archives: Admixture - Page 16

Reference I Admixture Analysis K=10-12

A week later, some more Admixture analysis of Reference I dataset.

As usual, the results are available in a spreadsheet, which is also listed on my sidebar.

Let's start with K=10.

Admixture: Reference I populations K=10

C1 South Asian C2 Kalash
C3 Southwest Asian C4 Southeast Asian
C5 European C6 Papuan
C7 Northeast Asian C8 Siberian
C9 West African C10 East African

The addition here is basically of the Siberian component which is highest among the Yakut.

Fst divergences between estimated populations for K=10:

C1 C2 C3 C4 C5 C6 C7 C8 C9
C2 0.057
C3 0.064 0.073
C4 0.089 0.127 0.136
C5 0.063 0.061 0.038 0.131
C6 0.167 0.209 0.215 0.202 0.210
C7 0.080 0.120 0.129 0.032 0.123 0.190
C8 0.085 0.117 0.127 0.059 0.118 0.203 0.039
C9 0.152 0.174 0.161 0.201 0.171 0.266 0.195 0.199
C10 0.115 0.133 0.117 0.166 0.128 0.233 0.160 0.163 0.036

Now for K=11,

Admixture: Reference I populations K=11

C1 South Asian C2 Kalash
C3 Southwest Asian C4 Southeast Asian
C5 European C6 Papuan
C7 Siberian C8 Northeast Asian
C9 East African Bantus C10 West African
C11 East African

C8 at K=11 is now modal among the Han instead of the Japanese. This affected the Southeast Asian C4 component which is now more of a real Southeast Asian one.

The new ancestral component C9 is among the Bantus of eastern and southern Africa. It is highest among the Luhya and Bantus of Kenya.

Fst divergences between estimated populations for K=11:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
C2 0.055
C3 0.062 0.072
C4 0.081 0.120 0.128
C5 0.063 0.063 0.038 0.124
C6 0.169 0.211 0.215 0.195 0.213
C7 0.089 0.128 0.135 0.057 0.130 0.203
C8 0.083 0.122 0.131 0.031 0.127 0.194 0.039
C9 0.143 0.165 0.150 0.185 0.162 0.259 0.195 0.189
C10 0.152 0.174 0.160 0.194 0.172 0.268 0.203 0.198 0.014
C11 0.104 0.122 0.101 0.149 0.115 0.226 0.158 0.152 0.037 0.043

At K=12,

Admixture: Reference I populations K=12

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Papuan C8 Northeast Asian
C9 Siberian C10 East African Bantus
C11 West African C12 East African

The Kalash component has split, with an assist from Southwest Asian, into a pure Kalash component (C3) and a Balochistan/Caucasus (C2) which is highest in Southwestern Pakistan (Brahui, Makrani, Balochi) at 60-57% followed by Georgians, Lezgin, Adeygei, Azerbaijan Jews and Iranian Jews (56-50%).

The Southwest Asian component (C5) is now more of a Southwest Asian and North/Northwest African component. The West Asian element in it has been reduced.

The Northeast Asian component (C8) is now again centered on Japan. I have a solution for this movement which I'll apply in my next round of analysis.

Fst divergences between estimated populations for K=12:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11
C2 0.057
C3 0.066 0.060
C4 0.089 0.124 0.136
C5 0.075 0.057 0.087 0.142
C6 0.066 0.040 0.073 0.130 0.048
C7 0.167 0.205 0.219 0.202 0.220 0.210
C8 0.080 0.117 0.128 0.032 0.134 0.122 0.190
C9 0.085 0.114 0.126 0.059 0.133 0.117 0.203 0.039
C10 0.145 0.154 0.176 0.192 0.154 0.162 0.258 0.187 0.190
C11 0.154 0.163 0.186 0.201 0.164 0.172 0.266 0.195 0.199 0.014
C12 0.107 0.109 0.135 0.157 0.105 0.116 0.225 0.151 0.154 0.035 0.041

Higher K value admixture analysis will continue.

Chinese Samples

Mithra asked:

Almost all the Chinese are now around 50% SE Asian, didn’t see this before is it right.

So I decided to look at the Chinese samples in Reference I dataset.

I ran Admixture on the whole Reference I dataset for K=10 ancestral populations. The green component is what I call Southeast Asian, blue is Northeast Asian (highest among the Japanese) and violet is Siberian (highest among the Yakut).

Here is the plot for the 106 HapMap Chinese samples from Denver (label: us chinese):

HapMap US Chinese

For the 137 HapMap samples from Beijing, China (label: han chinese):

HapMap Han Chinese

For the 34 HGDP Han samples (label: han):

HGDP Han

For the 10 HGDP Han samples from North China (label: han-nchina):

HGDP Han North China

As you can see, the "Southeast Asian" component goes down from the top group to the bottom one, which is as expected.

I wasn't satisfied with these results, so I decided to run Admixture on the East Asian samples in Reference I separately.

East Asian Admixture K=3

At K=3, the results are about the same as at K=10 for the whole reference I population. The Han all have a significant amount of blue component which is highest among the Southeast Asians.

East Asian Admixture K=4

At K=4, we get a Chinese ("East Asian") component. So we have Japanese, Chinese, Yakut and Southeast Asian components. This is what most of you were probably expecting.

Why did the Japanese become the modal population for the Northeast Asian component? I ran a PCA on the East Asian data to see how the different populations looked on a PCA plot. Remember that eigenvector 1 explains 1.49 times the variance of eigenvector 2 and 1.9 times the variance of eigenvector 3. Thus, eigenvector 2 explains 1.28 times the variation explained by eigenvector 3.

East Asian PCA eig1 vs eig2


East Asian PCA eig1 vs eig3


East Asian PCA eig2 vs eig3

As you can see, the Yakut are the far away, but the Japanese are also fairly well-separated from the Chinese populations.

If I didn't have the 141 Japanese samples in my reference dataset, the Northeast Asian component would be centered on the Han most likely, which is the case for Dodecad.

I think this shows that it is not correct to think of the ancestral components inferred from admixture as some pure ancestral population.

Admixture K=4,7,9, HRP0021 to HRP0030

Here's the spreadsheet with their admixture results. And you can check their ethnic backgrounds.

You might also want to refer to the reference dataset I admixture analyses for K=2-5 and K=6-9.

I did not run admixture for all values of K this time. So let's start with K=4. For quick reference,

C1 South Asian
C2 European
C3 East Asian
C4 African

Batch 3 Admixture K=4

Now, for K=7, the ancestral components are:

C1 South Asian
C2 European
C3 Southeast Asian
C4 Southwest Asian
C5 Papuan
C6 Northeast Asian
C7 African

Batch 3 Admixture K=7

And finally, here's K=9.

C1 South Asian
C2 Kalash
C3 Southwest Asian
C4 Southeast Asian
C5 European
C6 Papuan
C7 Northeast Asian
C8 West African
C9 East African

Batch 3 Admixture K=9

Reference II Admixture Analysis K=6-9

Continuing with admixture analysis of Reference II dataset, here's the spreadsheet.

Other than the differences with Reference I analysis, do take a look at the additional ethnic groups included in this dataset, especially the 8 South Asian groups: Tamil Nadu Dalit, Irula, Andhra Pradesh Madiga, Andhra Pradesh Mala, Tamil Nadu Brahmin, Andhra Pradesh Brahmin, Punjabi Arain, Nepali.

Let's start with K=6.

Reference II Admixture K=6

Note the difference between Tamil Nadu Dalits and Brahmins. The Dalits lack the European ancestral component of the Brahmins.

For K=7, the East Asian component splits into Northeast Asian and Southeast Asian.

Reference II Admixture K=7

Punjabi Arain are about the same as Sindhis (excluding the those with some African ancestry) in terms of their ancestral components.

Comparing the Andhra Brahmins to the Mala and Madiga, we see the same pattern as in Tamil Nadu: Brahmins have more European and Southwest/West Asian while Mala and Madiga have more Southeast Asian and South Asian.

At K=8, the African component splits into West African and East African.

Reference II Admixture K=8

The Nepalese samples are interesting. They have about 49% South Asian, 19% Northeast Asian, 16% European and 10% Southeast Asian. So they look like a mix of South Asian and East Asian.

Similar to the previous post, here's a comparison of K=8 admixture analysis between Reference I and Reference II datasets.

Here's the average absolute difference between the two datasets for each ancestral component:

Ancestral Component Mean(Abs(Ref1-Ref2))
South Asian (C1) 2.17%
Southwest Asian (C2) 1.32%
European (C3) 1.70%
Southeast Asian (C4) 2.16%
Papuan (C5) 0.33%
Northeast Asian (C6) 1.93%
West African (C7) 0.27%
East African (C8) 0.48%

The larger differences are for Balochi, Cambodian, Dai, Han, Kalash, Lahu, Miao, Naxi, She, Singapore Chinese, Tu, Tujia, US Chinese, and Yi, Thus, it's mostly East Asian groups.

For K=9, we see some divergence between the ancestral components inferred from Reference II as compared to Reference I. Instead of the Kalash component in Reference I analysis, we get the Polynesian component here. This is likely due to the inclusion of Tongan and Samoan samples.

Reference II Admixture K=9

Here's a summary of the ancestral components inferred from Reference II dataset:

K=2 K=3 K=4 K=5 K=6 K=7 K=8 K=9
Eurasian European S Asian S Asian S Asian S Asian S Asian S Asian
African E Asian European European European European SW Asian European
African E Asian E Asian E Asian SE Asian European SW Asian
African SW Asian SW Asian SW Asian SE Asian SE Asian
African Papuan Papuan Papuan Papuan
African NE Asian NE Asian NE Asian
African W African Polynesian
E African W African
E African

I might do some admixture runs for Reference II with Harappa participants later.

Reference II Admixture Analysis K=2-5

Our Reference II Dataset has 3,161 samples with 544 South Asians belonging to 24 ethnic groups. Unfortunately, we can do our admixture analysis on about 23,000 SNPs.

The ancestral population averages for each ethnic group from the admixture analysis can be seen in this spreadsheet. I have also calculated the standard deviation of the ancestral components for the samples in each ethnic group.

Here are the results for K=2.

Reference II Admixture K=2

For K=3, we get the ancestral populations: European, E Asian, African.

Reference II Admixture K=3

For K=4, the ancestral populations are South Asian, European, East Asian and African.

Reference II Admixture K=4

Let's compare the results of K=4 admixture analysis of Reference I and Reference II datasets.

While there is some difference in the average percentages of ancestral components computed with the two reference datasets, most of the differences are 1% or less. The mean absolute difference for the four components is as follows:

Ancestral Component Mean(Abs(Ref1-Ref2))
South Asian (C1) 0.92%
European (C2) 0.58%
East Asian (C3) 0.52%
African (C4) 0.32%

I have highlighted the larger differences which affect: Balochi, Kalash, Malayan, Melanesian, Papuan, and Samaritians. Even then the largest change is about 5%.

Let's also look at the Fst divergences. Here's for Reference I admixture results:

C1 C2 C3
C2 0.071
C3 0.083 0.109
C4 0.152 0.152 0.184

And for Reference II:

C1 C2 C3
C2 0.074
C3 0.086 0.118
C4 0.156 0.159 0.194

The Fst numbers for Reference II are somewhat higher.

Considering that Reference II has only one-eighth of the SNPs of Reference I, the results are fairly good.

Here's K=5 admixture analysis for Reference II:

Reference II Admixture K=5

Higher K values to follow.

Admixture K=4,7,9, HRP0011 to HRP0020

We'll go to higher values of K (number of ancestral populations) for batch 1 later, but let's not keep the other batches waiting.

Here's the spreadsheet with their admixture results. And you can check their ethnic backgrounds.

You might also want to refer to the reference dataset I admixture analyses for K=2-5 and K=6-9.

I did not run admixture for all values of K this time. So let's start with K=4. For quick reference,

C1 South Asian
C2 European
C3 East Asian
C4 African

Batch 2 Admixture K=4

Now, for K=7, the ancestral components are:

C1 South Asian
C2 European
C3 Southeast Asian
C4 Southwest Asian
C5 Papuan
C6 Northeast Asian
C7 African

Batch 2 Admixture K=7

And finally, here's K=9.

C1 South Asian
C2 Kalash
C3 Southwest Asian
C4 Southeast Asian
C5 European
C6 Papuan
C7 Northeast Asian
C8 West African
C9 East African

Batch 2 Admixture K=9

What do you guys think?

Higher values of K will be coming when admixture is done taking it sweet time to run. But more analysis and results are coming fast and furious now.

Reference Admixture Analysis K=6-9

Continuing the admixture analysis on my reference dataset I, let's look at K=6 ancestral components.

As before, all the results are listed in a spreadsheet.

For K=6, we get the following plot:

Admixture: Reference populations K=6

C1 (red) is the South Asian ancestral component. However, the Austronesian (Papuan/Melanesian) component has now separated from it as C5 (blue). You can see small proportions of the Papuan component among South Indian and Southeast Asian (Malay and Cambodian) populations.

C3 (green) is exactly the same as C3 in K=5 run and represents East Asian populations. C6 (magenta) is exactly the same component as C5 in the K=5 run and represents African ancestry. C4 (cyan) is the same as C4 in the K=5 run and represents Southwest/West Asia.

C2 (yellow) is the European component (maximum among North Europeans) almost the same as C2 in K=5 analysis. The major difference is that C2 (in K=6) is reduced among South Asians as compared to K=5. This is due to the South Asian component being higher for them.

Fst divergences between estimated populations for K=6:

C1 C2 C3 C4 C5
C2 0.053
C3 0.084 0.114
C4 0.068 0.052 0.130
C5 0.178 0.205 0.184 0.218
C6 0.148 0.165 0.186 0.157 0.260

When we increase the ancestral components to K=7,

Admixture: Reference populations K=7

The South Asian component (C1/red) is the same. Note that there is a significant drop from about 51% to 29% from Makranis to Iranians (ignore the Paniya as there are only 4 samples with one being very different). Looking at the 19 individual Iranian samples from our reference dataset, their South Asian ancestral component values vary from 17% to 33%.

The Southwest/West Asian component (C2/yellow) is now higher among West Asians and lower among East Africans compared to K=6 run. C3/green is the European component which now almost disappears from the Southwest Asian populations.

The East Asian component (C4/bluish green) is the same as before as is the Papuan C5/light blue component.

The African ancestry breaks into West African (C6/blue) and East African (C7/magenta).

Note that the split here is different from the batch 1 run where the East Asian split into two for K=7 and the African split happened at K=8, the opposite of what happened here.

Fst divergences between estimated populations for K=7:

C1 C2 C3 C4 C5 C6
C2 0.058
C3 0.052 0.034
C4 0.082 0.122 0.113
C5 0.176 0.210 0.204 0.184
C6 0.152 0.159 0.167 0.190 0.264
C7 0.112 0.113 0.122 0.153 0.229 0.037

At K=8, the East Asian components forks into two: A Southeast Asian (C4/bright green) one that is highest among the Dai, Malay, Cambodians and Lahu; and a Northeast Asian one (C6/blue) that is maximum among the Yakut, Oroqen, Japanese, Hezhen and Daur.

Admixture: Reference populations K=8

Among most South Asian groups in our reference dataset, the Southeast Asian component is much more common than the Northeast Asian one.

Fst divergences between estimated populations for K=8:

C1 C2 C3 C4 C5 C6 C7
C2 0.058
C3 0.052 0.034
C4 0.096 0.133 0.125
C5 0.177 0.211 0.205 0.201
C6 0.093 0.131 0.122 0.046 0.195
C7 0.152 0.159 0.167 0.200 0.266 0.201
C8 0.113 0.113 0.113 0.163 0.231 0.163 0.037

Here's the plot for K=9 ancestral components:

Reference Populations Admixture K=9

The new component here is the Kalash component which is at 94% among the Kalash but is in the 30-40% range for Caucasian and Pakistani populations. It is also present among West Asians, Europeans and Central Asians to a small degree.

Looking at the Kalash samples, they seem fairly uniform and mostly with only a little of the other ancestral components except for one sample which has 70% Kalash component.

Kalash Admixture K=9

Fst divergences between estimated populations for K=9:

C1 C2 C3 C4 C5 C6 C7 C8
C2 0.056
C3 0.064 0.072
C4 0.088 0.126 0.136
C5 0.064 0.061 0.039 0.131
C6 0.167 0.208 0.214 0.202 0.211
C7 0.084 0.124 0.133 0.045 0.127 0.195
C8 0.152 0.173 0.161 0.201 0.172 0.266 0.200
C9 0.115 0.133 0.117 0.165 0.129 0.233 0.164 0.036

From the Fst values, we can see that the Kalash component is closest to the South Asian component and then to the European component.

To summarize, here are the ancestral components inferred for different values of K.

K=2 K=3 K=4 K=5 K=6 K=7 K=8 K=9
Eurasian European S Asian S Asian S Asian S Asian S Asian S Asian
African E Asian European European European SW Asian SW Asian Kalash
African E Asian E Asian E Asian European European SW Asian
African SW Asian SW Asian E Asian SE Asian SE Asian
African Papuan Papuan Papuan European
African W African NE Asian Papuan
E African W African NE Asian
E African W African
E African

Note that for a specific value of K, they are listed in approximately decreasing average percentage among the South Asian samples in our reference dataset I.

Admixture K=6-9, HRP0001 to HRP0010

Let's continue our admixture analysis of the first 10 Harappa Project participants.

Here are their ethnic backgrounds and their admixture analysis results.

You might want to refer to the admixture analysis of the reference dataset.

Let's look at K=6 ancestral components. As seen in the reference admixture results, we got a Papuan ancestral component (C5/blue).

Batch 1 Admixture K=6

You can see the increase in C1/red South Asian component in all the participants. The Papuan component (C5/blue) is present is all except our Assyrian sample. It is lower among the Punjabis though.

The East Asian (C3/green) is about the same as in K=5 analysis. C6/magenta, the African component, is only present in HRP0001 (me) at the same proportion as K=5. The Southwest/West Asian component (C4/cyan) is the same as C4 in K=5 with no changes.

The European component (C2/yellow) reduced in magnitude among the South Asian participants by about 14-19%. My guess about that is that the South Asian component became more "pure" for K=6 due to the separate Papuan component which was merged in the South Asian one in K=5. So it better represents the South Asians now compared to K=5, thus reducing the European proportion.

Batch 1 Admixture K=7

For K=7, C1 is South Asian, C2 European, C4 Southwest/West Asian, C5 Papuan and C7 African. These are all same as before.

The East Asian component has split into two: C3 Southeast Asian and C6 Northeast Asian. For this batch of Harappa participants, most of their East Asian ancestry falls into the Southeast Asian component.

Batch 1 Admixture K=8

For K=8, C1 is South Asian, C4 is Southeast Asian, C5 is Papuan, C6 is Northeast Asian and these have stayed about the same.

C2 (Southwest/West Asian) component has increased for most Harappa members, especially for HRP0010 (Assyrian Iranian). This change in West Asian component is balanced a bit by a decrease in C3 (European) component but the main reason for the West Asian change is that East African component has split from the Southwest/West Asian and the African components.

The African component has split into C7 West African and C8 East African. As usual, HRP0001 (me) is the only one with any West or East African component, though I have more of East African than West which makes sense due to my (part-)Egyptian ancestry.

Batch 1 Admixture K=9

For K=9, C1 is the South Asian component and it decreased in all project members except for South Indians and Bengalis. It even decreased in the Bihari sample (HRP0003) and almost disappeared from the Assyrian Iranian one (HRP0010).

The reason is the appearance of what I am calling the Kalash ancestral component (C2). This component is at 94% among the Kalash reference populaton, followed by 41% among Lezgin (a Caucasian group). It is also high among the Pakistani reference populations and other Caucasian populations. Among our first batch of Harappa participants, this Kalash component is high (27-31%) among the Punjabis and Assyrian Iranian.

C3 is the Southwest/West Asian component which hasn't changed a lot among the project members. The Southeast Asian component (C4) has decreased, as has C5 (European).

The Papuan component (C6) has remained small.

C7 (Northeast Asian), C8 (West African), and C9 (East African) have stayed the same.

I am running admixture for even higher values of K, but it takes a long time. While those are running, I am going to go ahead and start the 2nd batch (HRP0011 to HRP0020). For those, I am not going to run all K values. Instead I'll do only a few. If you have any suggestions on which specific K values I should focus on for the latter batches, please let me know.

PS. I have added the names of components to the spreadsheet for ease of use, but these should be thought of as useful mnemonics rather than these components representing some "pure" ancient population. Also remember that the South Asian (or other) component from one K value to the next might not be the same.

HapMap Gujaratis

Razib is wondering what's going on with the HapMap Houston Gujaratis.

As you can see, the Chinese simply do not vary much, and are a tight cluster. But, there is a somewhat equivalent Gujarati cluster too! The HapMap sample was collected from Gujaratis in Houston. To me, it looks like that Houston population can be divided into two groups: one of the tight cluster, and the rest of the population, which is all over the place. [...] What’s more interesting is to try and understanding what’s going on with Houston Gujaratis. Anyone in the audience know?

And his 3-dimensional PCA plot: (Those on the right are Gujaratis)
PCA Plot of Gujaratis and Chinese

So I thought I would share the admixture results for the Gujaratis for K=8. Here's the spreadsheet of the admixture proportions for Gujaratis. And here is the plot:

Gujaratis Admixture K=8

The ancestral components and their statistics are as follows:

Population Range Mean Median
C1 South Asian 64-89% 81.9% 85.8%
C2 West Asian 0-13% 2.3% 1.6%
C3 European 2-22% 7.6% 5.0%
C4 Southeast Asian 0-9% 4.9% 5.0%
C5 Austronesian 1-6% 2.8% 2.9%
C6 Northeast Asian 0-3% 0.4% 0.0%
C7 West African 0-1% 0.0% 0.0%
C8 East African 0-0% 0.0% 0.0%

It looks like a majority of the Gujarati samples have mostly South Asian ancestral component with small amounts of West Asian, European and Southeast Asian, but some Gujarati samples have much larger West Asian and/or European ancestral components.

Changes due to San/Pygmy Removal

As mentioned earlier, I removed San and Pygmy groups from my reference datasets.

For the admixture runs on Reference Dataset I, the only major changes are for K=2 ancestral components where most European, Middle Eastern and South/Central Asian groups increase their African component. The changes for K=3,4,5 were minor as shown by these statistics:

K Median Abs Maximum Abs
3 0.01% 0.22%
4 0.02% 0.26%
5 0.02% 0.71%

I have updated the spreadsheet and the plots in the original post.

Looking at the changes in the admixture results I already posted for Harappa Project participants HRP0001 to HRP0010, there is major change for K=2. The African compoent (C1/red) increased by a lot among all project participants. This seems to be due to the African component best representing West Africans now instead of Pygmies as it did before.

For K=3,4,5, the changes are very minor. Let's look at the absolute value of the changes in the percentages of ancestral components for the ten project participants.

K Median Abs Maximum Abs
3 0.05% 0.19%
4 0.05% 0.22%
5 0.09% 0.60%

I have updated the spreadsheets and the charts in the original post.