Tag Archives: harappa - Page 9

Admixture K=12, HRP0011-HRP0020

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results and this batch's results at lower K.

Batch 2 Admixture K=12

PS. This was run using Admixture version 1.04.

Admixture K=10-12, HRP0001 to HRP0010

Let's continue our admixture analysis of the first batch of Harappa participants.

Here are their ethnic backgrounds and their admixture analysis results.

You might want to refer to the admixture analysis of the reference dataset.

At K=10,

Batch 1 Admixture K=10

C1 South Asian C2 Kalash
C3 Southwest Asian C4 Southeast Asian
C5 European C6 Papuan
C7 Northeast Asian C8 Siberian
C9 West African C10 East African

At K=11,

Batch 1 Admixture K=11

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Papuan C8 Northeast Asian
C9 Siberian C10 West African
C11 East African

Note the C2 component, it sounds a bit like ANI (Ancestral North Indian) of Reich et al, though hold off on your conclusions and your excitement for now.

Also, note that this split is different from the results of Reference I K=11 admixture run where the East African split happened. However, at K=12 we get similar components.

At K=12,

Batch 1 Admixture K=12

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Papuan C8 Northeast Asian
C9 Siberian C10 East African Bantus
C11 West African C12 East African

I am going to explore even higher values of K since the crossvalidation errors are still decreasing.

Dodecad vs Harappa

We know that some participants in Harappa Ancestry Project had also submitted their data to Dodecad Project. And they were curious how the different ancestry components here lined up with the Dodecad ones.

So I decided to compare the two. I took the ancestral component percentages for the reference populations from
Dodecad population spreadsheet K=10 and Harappa Reference I spreadsheet K=9.

I selected the 36 populations that are present in both. While some of these are still not comparable because of which samples out of these populations were selected to be included in the reference datasets for Dodecad and Harappa, we are using mean values, so barring any big outliers we can compare them.

I decided to find a solution to linear equations of the form:

C1 = a11*D1 + a12*D2 + a13*D3 + a14*D4 + a15*D5 + a16*D6 + a17*D7 + a18*D8 + a19*D9 + a1A*D10
C2 = a21*D1 + a22*D2 + a23*D3 + a24*D4 + a25*D5 + a26*D6 + a27*D7 + a28*D8 + a29*D9 + a2A*D10
C3 = a31*D1 + a32*D2 + a33*D3 + a34*D4 + a35*D5 + a36*D6 + a37*D7 + a38*D8 + a39*D9 + a3A*D10
C4 = a41*D1 + a42*D2 + a43*D3 + a44*D4 + a45*D5 + a46*D6 + a47*D7 + a48*D8 + a49*D9 + a4A*D10
C5 = a51*D1 + a52*D2 + a53*D3 + a54*D4 + a55*D5 + a56*D6 + a57*D7 + a58*D8 + a59*D9 + a5A*D10
C6 = a61*D1 + a62*D2 + a63*D3 + a64*D4 + a65*D5 + a66*D6 + a67*D7 + a68*D8 + a69*D9 + a6A*D10
C7 = a71*D1 + a72*D2 + a73*D3 + a74*D4 + a75*D5 + a76*D6 + a77*D7 + a78*D8 + a79*D9 + a7A*D10
C8 = a81*D1 + a82*D2 + a83*D3 + a84*D4 + a85*D5 + a86*D6 + a87*D7 + a88*D8 + a89*D9 + a8A*D10
C9 = a91*D1 + a92*D2 + a93*D3 + a94*D4 + a95*D5 + a96*D6 + a97*D7 + a98*D8 + a99*D9 + a9A*D10

For each of the 36 populations, we'll have these 9 equations where C1 through C9 are the ancestral component percentages of that population in Harappa Project and D1 through D10 are the ancestral percentages in Dodecad Project.

The unknowns are the coefficients "a". They are 90 unknowns. Since we have 36 populations, the number of equations is 36*9=324. Therefore, this is an overdetermined system of linear equations and we can find a least squares solution to it.

Here is the solution:

D1 W Asian D2 NW African D3 S Euro D4 NE Asian D5 SW Asian D6 E Asian D7 N Euro D8 W African D9 E African D10 S Asian
C1 S Asian 0 0 0 0 0 0 0 0 0 0.92
C2 Kalash 0.54 0 -0.05 0.12 0.07 0 0.2 0 0 0.1
C3 SW Asian 0.46 0.56 0.44 0 0.9 0 -0.09 0 0.09 -0.07
C4 SE Asian 0 0 0 0 0 0.6 0 0 0 0
C5 Euro 0 0.19 0.6 0.05 -0.05 0 0.88 0 0 0
C6 Papuan 0 0 0 0 0 0 0 0 0 0
C7 NE Asian 0 0 0 0.85 0 0.4 0 0 0 0
C8 W African 0 0.12 0 0 0 0 0 1 0 0
C9 E African 0 0.12 0 0 0.05 0 0 0 0.89 0

Don't take the exact values to heart but this shows the general relationship between the Dodecad and Harappa (K=9) ancestral components.

The South Asian components are about the same in both projects.

The Kalash component is a mix but is primarily Dodecad West Asian.

The Harappa Southwest Asian has contributions from Northwest African, West Asian and South European in addition to the Dodecad West Asian component.

The Southeast Asian component corresponds partially to the Dodecad East Asian component.

The Harappa European component is more Dodecad North European than South European.

If enough Harappa-Dodecad participants are willing to let me know their IDs for both projects, I can do a similar analysis using individual data.

Admixture K=4,7,9, HRP0021 to HRP0030

Here's the spreadsheet with their admixture results. And you can check their ethnic backgrounds.

You might also want to refer to the reference dataset I admixture analyses for K=2-5 and K=6-9.

I did not run admixture for all values of K this time. So let's start with K=4. For quick reference,

C1 South Asian
C2 European
C3 East Asian
C4 African

Batch 3 Admixture K=4

Now, for K=7, the ancestral components are:

C1 South Asian
C2 European
C3 Southeast Asian
C4 Southwest Asian
C5 Papuan
C6 Northeast Asian
C7 African

Batch 3 Admixture K=7

And finally, here's K=9.

C1 South Asian
C2 Kalash
C3 Southwest Asian
C4 Southeast Asian
C5 European
C6 Papuan
C7 Northeast Asian
C8 West African
C9 East African

Batch 3 Admixture K=9

South Asian PCA

I used Eigensoft to create a PCA plot of the South Asians in our Reference I dataset (a total of 398 samples) along with the first batch of South Asian Harappa Project participants (HRP0001 to HRP0009).

The PCA software removed 2 Makranis, 1 Sindhi, 1 Balochi and 1 Brahui as outliers, thus leaving us with 402 samples to perform a PCA on.

Here are the plots for the first four eigenvectors. Click to see bigger images.

South Asian PCA eig1 vs eig2

South Asian PCA eig1 vs eig3

South Asian PCA eig2 vs eig3

South Asian PCA eig1 vs eig4

South Asian PCA eig2 vs eig4

South Asian PCA eig3 vs eig4

If you have seen the South Asian plot at 23andme, the first plot here isn't very different except that it seems rotated.

UPDATE: Eigenvectors 1 through 4 explain 1.12%, 0.77%, 0.71% and 0.44% of the total variance.

Admixture K=4,7,9, HRP0011 to HRP0020

We'll go to higher values of K (number of ancestral populations) for batch 1 later, but let's not keep the other batches waiting.

Here's the spreadsheet with their admixture results. And you can check their ethnic backgrounds.

You might also want to refer to the reference dataset I admixture analyses for K=2-5 and K=6-9.

I did not run admixture for all values of K this time. So let's start with K=4. For quick reference,

C1 South Asian
C2 European
C3 East Asian
C4 African

Batch 2 Admixture K=4

Now, for K=7, the ancestral components are:

C1 South Asian
C2 European
C3 Southeast Asian
C4 Southwest Asian
C5 Papuan
C6 Northeast Asian
C7 African

Batch 2 Admixture K=7

And finally, here's K=9.

C1 South Asian
C2 Kalash
C3 Southwest Asian
C4 Southeast Asian
C5 European
C6 Papuan
C7 Northeast Asian
C8 West African
C9 East African

Batch 2 Admixture K=9

What do you guys think?

Higher values of K will be coming when admixture is done taking it sweet time to run. But more analysis and results are coming fast and furious now.

Admixture K=6-9, HRP0001 to HRP0010

Let's continue our admixture analysis of the first 10 Harappa Project participants.

Here are their ethnic backgrounds and their admixture analysis results.

You might want to refer to the admixture analysis of the reference dataset.

Let's look at K=6 ancestral components. As seen in the reference admixture results, we got a Papuan ancestral component (C5/blue).

Batch 1 Admixture K=6

You can see the increase in C1/red South Asian component in all the participants. The Papuan component (C5/blue) is present is all except our Assyrian sample. It is lower among the Punjabis though.

The East Asian (C3/green) is about the same as in K=5 analysis. C6/magenta, the African component, is only present in HRP0001 (me) at the same proportion as K=5. The Southwest/West Asian component (C4/cyan) is the same as C4 in K=5 with no changes.

The European component (C2/yellow) reduced in magnitude among the South Asian participants by about 14-19%. My guess about that is that the South Asian component became more "pure" for K=6 due to the separate Papuan component which was merged in the South Asian one in K=5. So it better represents the South Asians now compared to K=5, thus reducing the European proportion.

Batch 1 Admixture K=7

For K=7, C1 is South Asian, C2 European, C4 Southwest/West Asian, C5 Papuan and C7 African. These are all same as before.

The East Asian component has split into two: C3 Southeast Asian and C6 Northeast Asian. For this batch of Harappa participants, most of their East Asian ancestry falls into the Southeast Asian component.

Batch 1 Admixture K=8

For K=8, C1 is South Asian, C4 is Southeast Asian, C5 is Papuan, C6 is Northeast Asian and these have stayed about the same.

C2 (Southwest/West Asian) component has increased for most Harappa members, especially for HRP0010 (Assyrian Iranian). This change in West Asian component is balanced a bit by a decrease in C3 (European) component but the main reason for the West Asian change is that East African component has split from the Southwest/West Asian and the African components.

The African component has split into C7 West African and C8 East African. As usual, HRP0001 (me) is the only one with any West or East African component, though I have more of East African than West which makes sense due to my (part-)Egyptian ancestry.

Batch 1 Admixture K=9

For K=9, C1 is the South Asian component and it decreased in all project members except for South Indians and Bengalis. It even decreased in the Bihari sample (HRP0003) and almost disappeared from the Assyrian Iranian one (HRP0010).

The reason is the appearance of what I am calling the Kalash ancestral component (C2). This component is at 94% among the Kalash reference populaton, followed by 41% among Lezgin (a Caucasian group). It is also high among the Pakistani reference populations and other Caucasian populations. Among our first batch of Harappa participants, this Kalash component is high (27-31%) among the Punjabis and Assyrian Iranian.

C3 is the Southwest/West Asian component which hasn't changed a lot among the project members. The Southeast Asian component (C4) has decreased, as has C5 (European).

The Papuan component (C6) has remained small.

C7 (Northeast Asian), C8 (West African), and C9 (East African) have stayed the same.

I am running admixture for even higher values of K, but it takes a long time. While those are running, I am going to go ahead and start the 2nd batch (HRP0011 to HRP0020). For those, I am not going to run all K values. Instead I'll do only a few. If you have any suggestions on which specific K values I should focus on for the latter batches, please let me know.

PS. I have added the names of components to the spreadsheet for ease of use, but these should be thought of as useful mnemonics rather than these components representing some "pure" ancient population. Also remember that the South Asian (or other) component from one K value to the next might not be the same.

Changes due to San/Pygmy Removal

As mentioned earlier, I removed San and Pygmy groups from my reference datasets.

For the admixture runs on Reference Dataset I, the only major changes are for K=2 ancestral components where most European, Middle Eastern and South/Central Asian groups increase their African component. The changes for K=3,4,5 were minor as shown by these statistics:

K Median Abs Maximum Abs
3 0.01% 0.22%
4 0.02% 0.26%
5 0.02% 0.71%

I have updated the spreadsheet and the plots in the original post.

Looking at the changes in the admixture results I already posted for Harappa Project participants HRP0001 to HRP0010, there is major change for K=2. The African compoent (C1/red) increased by a lot among all project participants. This seems to be due to the African component best representing West Africans now instead of Pygmies as it did before.

For K=3,4,5, the changes are very minor. Let's look at the absolute value of the changes in the percentages of ancestral components for the ten project participants.

K Median Abs Maximum Abs
3 0.05% 0.19%
4 0.05% 0.22%
5 0.09% 0.60%

I have updated the spreadsheets and the charts in the original post.

Admixture K=2-5, HRP0001 to HRP0010

Finally, it's time to analyze the genomes of project participants. Admixture analysis is going to be done in batches of ten so that the ancestral components are stable from one run to another.

My choice of calling them "ancestral components" is deliberate. Please do not think of them as pure ancestral populations.

First, the ethnic background of the participants in this batch. I'll give the ethnicity only if I have explicit permission from the participant to make such information public. By default, I assume it to be private. Here's the summary:

Ethnicity Count
Punjab 5
Bengal 1
Bihar 1
Tamil 1
Andhra Pradesh 1
Iran 1

Since this is the first batch, I am running admixture for all values of K to get a better handle on how things shake out. With later batches, I will run only a few specific values of K since admixture takes a long time to run.

The ancestral component percentages for project participants can be found in this spreadsheet.

It might be good to refer to the admixture runs for the reference (spreadsheet) to get a better idea of what the different ancestral components represent.

Let's start with K=2 ancestral components.

Batch 1 Admixture K=2

Cyan/African (C2) component varies from 29-51% among participants which is about what you would expect from the results for South Asian reference populations.

With K=3 where the ancestral components roughly represent European (C1/red), East Asian (C2/green) and African (C3/blue), we see the following:

Batch 1 Admixture K=3

I am HRP0001 and my number for K=3 are 77% European, 18% Asian and 5% African. This contrasts with my 23andme ancestry painting of 91.22% European, 8.69% Asian and 0.09% African. However, HRP0002 has closer numbers:

HRP0002 European Asian African
HAP 55% 43% 1%
23andme 57% 43% 0%

We (HAP) are using a much more diverse reference population while 23andme ancestry painting is based on the basic three populations of HapMap. Also, since I am a quarter Egyptian, the likelihood of some African ancestry is high in my case.

Note that the Asian (C2) percentages vary from 18% to 44% for the South Asians in this batch, but it's low (18-22%) in Punjabis and higher in southern and eastern South Asians. It's almost negligible in our Iranian Assyrian sample.

With K=4, we finally get our South Asian ancestral component (C1/red).

Batch 1 Admixture K=4

I (HRP0001) am the only one with any noticeable African component (C4/violet) while HRP0002 has some East Asian ancestry (C3/cyan). The two South Indians have lower European component (C2/green) along with HRP0002 who is from East Bengal.

Finally, let's take a look at K=5 ancestral components.

Batch 1 Admixture K=5

The South Asian (C1/red), East Asian (C3/green) and African (C5/magenta) components are about the same as in K=4. The new component here is C4/blue, which is the Southwest/West Asian component. This is basically a split from the K=4 European (C2/yellow) component. Our Assyrian sample has the highest Southwest/West Asian component while I also have it higher than the South Asians due to my quarter Egyptian ancestry.

Let's continue higher values of K next time.