Category Archives: Admixture - Page 15

Fst for Reference I Admixture K=12

I had posted the Fst divergences between the estimated ancestral populations for the admixture analysis on Reference I dataset. But a picture is worth a thousand words and this dendrogram (using complete linkage) shows the Fst numbers fairly clearly.

Remember this is not a phylogeny.

Admixture K=9, HRP0001 to HRP0040

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

In case you guys are wondering, the new thing here are the results fro HRP0031 to HRP0040.

PS. This was run using Admixture version 1.04.

Reference I Dendrogram

Handschar created a dendrogram using a hierarchical classifier based on K=12 admixture results and wondered:

When I run a classification based on simple euclidean distances (not a phylogeny), the Armenians and Turks, as they were, prior to the removal of the four North European admixed Behar samples in David's runs, cluster together. The North European component, in Dodecad Armenians, is practically nonexistent. I am not sure how the Harappa project "European" component translates to Dodecad components. If the admixed Armenians are included, it is possible their inclusion is impacting the Armenian population component percentages. Then again, even if included, perhaps your runs are picking up on something not previously detected. The Armenians, in previous classification runs, ordinarily matched one or more of the Caucasian Jewish groups.

While looking into his question, I figured that I would create some dendrograms too. The ones here are based on the K=12 admixture results of Reference I dataset (spreadsheet). Also, I am using the pairwise Euclidean distance of the Admixture results between population groups to do a complete linkage hierarchical classification. So these dendrograms show which groups are closest in terms of their admixture percentages and do not show shared ancestry. In other words, it is not a phylogeny or a family tree.

First, I used the mean admixture percentages for each group, as given in the spreadsheet.

Reference 1 Mean Admixture Complete Linkage Dendrogram

There are a number of outliers in the dataset. For example, some Arabs and Sindhis with African admixture, some Armenians with a lot more European component than the rest, etc. Therefore, I thought a better approach would be to do the same classification using the median admixture percentages for each population group.

Reference 1 Median Admixture Complete Linkage Dendrogram

Using the median sample from each population, handschar was correct that the Armenians match the Caucasian Jewish groups.

UPDATE: Here's another dendrogram in which I take the mean of the ancestral components for each population after removing outliers.

Reference 1 Mean (No Outliers) Admixture Complete Linkage Dendrogram

Again, don't take these dendrograms to heart. All they show is the distance between the admixture results of different populations.

Admixture K=4, HRP0001-HRP0040

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

In case you guys are wondering, the new thing here are the results fro HRP0031 to HRP0040.

PS. This was run using Admixture version 1.04.

Improved Admixture Bar Charts

I have improved the Admixture bar charts further. As per your demands, ethnicity information is now available in a table right below the bar plot, in the same order as the bar plot IDs.

Also, you can click on any of the legend color rectangles on the right to sort the bar chart and the table by that ancestral component. Similarly, click on the header row of the table to sort by a column.

I might make some minor tweaks to this one.

Admixture K=12, HRP0021-HRP0030

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results and this batch's results at lower K.

Batch 3 Admixture K=12

If you guys can confirm that the interactive bar chart is working well for you, then this is the last static bar plot.

PS. This was run using Admixture version 1.04.

Admixture K=12, HRP0011-HRP0020

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results and this batch's results at lower K.

Batch 2 Admixture K=12

PS. This was run using Admixture version 1.04.

Admixture Upgrade

I noticed a few days ago that Admixture had an update available:

1.1 (2/8/2011): Parallel processing, supervised analysis. Minor speedups and cleanups.

There were two important new features in version 1.1 that I started salivating over. One was parallel processing so I could utilize all the cores of my machine and thus run Admixture faster. The other was more important though I have yet to experiment with it. It's the ability to assign some ancestral components to specific samples, i.e. assign some individuals in the data specific 100% ancestry as a starting assumption and calculate admixture from that.

Of course, these two features made me forget the cardinal rule: Never upgrade in the middle of an analysis. But I did upgrade and things have changed subtly, making some comparisons between admixture v1.04 and v1.1 difficult.

For example, previously (admixture v1.04), at K=12, admixture was giving me the ancestral components: South Asian, Balochistan/Caucasus, Kalash, Southeast Asian, Southwest Asian, European, Papuan, Northeast Asian, Siberian, East African Bantus, West African, and East African.

With Admixture v1.1, I am getting the ancestral components: South Asian, Balochistan/Caucasus, Kalash, Southeast Asian, European, Mediterranean (maximum among Mozabite and Sardinians), Papuan, Northeast Asian, Southwest Asian, Siberian, West African, and East African.

So now I am running Admixture with different random seeds and trying to compare the old version results vs the new. Of course since we are talking K=12, just one admixture run takes a whole day.

Anyway, while that's going on, I have more things in process which can go forward, like reporting the results of Batch 4. And working on the Eurasian dataset.

Admixture K=10-12, HRP0001 to HRP0010

Let's continue our admixture analysis of the first batch of Harappa participants.

Here are their ethnic backgrounds and their admixture analysis results.

You might want to refer to the admixture analysis of the reference dataset.

At K=10,

Batch 1 Admixture K=10

C1 South Asian C2 Kalash
C3 Southwest Asian C4 Southeast Asian
C5 European C6 Papuan
C7 Northeast Asian C8 Siberian
C9 West African C10 East African

At K=11,

Batch 1 Admixture K=11

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Papuan C8 Northeast Asian
C9 Siberian C10 West African
C11 East African

Note the C2 component, it sounds a bit like ANI (Ancestral North Indian) of Reich et al, though hold off on your conclusions and your excitement for now.

Also, note that this split is different from the results of Reference I K=11 admixture run where the East African split happened. However, at K=12 we get similar components.

At K=12,

Batch 1 Admixture K=12

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Papuan C8 Northeast Asian
C9 Siberian C10 East African Bantus
C11 West African C12 East African

I am going to explore even higher values of K since the crossvalidation errors are still decreasing.

Dodecad vs Harappa

We know that some participants in Harappa Ancestry Project had also submitted their data to Dodecad Project. And they were curious how the different ancestry components here lined up with the Dodecad ones.

So I decided to compare the two. I took the ancestral component percentages for the reference populations from
Dodecad population spreadsheet K=10 and Harappa Reference I spreadsheet K=9.

I selected the 36 populations that are present in both. While some of these are still not comparable because of which samples out of these populations were selected to be included in the reference datasets for Dodecad and Harappa, we are using mean values, so barring any big outliers we can compare them.

I decided to find a solution to linear equations of the form:

C1 = a11*D1 + a12*D2 + a13*D3 + a14*D4 + a15*D5 + a16*D6 + a17*D7 + a18*D8 + a19*D9 + a1A*D10
C2 = a21*D1 + a22*D2 + a23*D3 + a24*D4 + a25*D5 + a26*D6 + a27*D7 + a28*D8 + a29*D9 + a2A*D10
C3 = a31*D1 + a32*D2 + a33*D3 + a34*D4 + a35*D5 + a36*D6 + a37*D7 + a38*D8 + a39*D9 + a3A*D10
C4 = a41*D1 + a42*D2 + a43*D3 + a44*D4 + a45*D5 + a46*D6 + a47*D7 + a48*D8 + a49*D9 + a4A*D10
C5 = a51*D1 + a52*D2 + a53*D3 + a54*D4 + a55*D5 + a56*D6 + a57*D7 + a58*D8 + a59*D9 + a5A*D10
C6 = a61*D1 + a62*D2 + a63*D3 + a64*D4 + a65*D5 + a66*D6 + a67*D7 + a68*D8 + a69*D9 + a6A*D10
C7 = a71*D1 + a72*D2 + a73*D3 + a74*D4 + a75*D5 + a76*D6 + a77*D7 + a78*D8 + a79*D9 + a7A*D10
C8 = a81*D1 + a82*D2 + a83*D3 + a84*D4 + a85*D5 + a86*D6 + a87*D7 + a88*D8 + a89*D9 + a8A*D10
C9 = a91*D1 + a92*D2 + a93*D3 + a94*D4 + a95*D5 + a96*D6 + a97*D7 + a98*D8 + a99*D9 + a9A*D10

For each of the 36 populations, we'll have these 9 equations where C1 through C9 are the ancestral component percentages of that population in Harappa Project and D1 through D10 are the ancestral percentages in Dodecad Project.

The unknowns are the coefficients "a". They are 90 unknowns. Since we have 36 populations, the number of equations is 36*9=324. Therefore, this is an overdetermined system of linear equations and we can find a least squares solution to it.

Here is the solution:

D1 W Asian D2 NW African D3 S Euro D4 NE Asian D5 SW Asian D6 E Asian D7 N Euro D8 W African D9 E African D10 S Asian
C1 S Asian 0 0 0 0 0 0 0 0 0 0.92
C2 Kalash 0.54 0 -0.05 0.12 0.07 0 0.2 0 0 0.1
C3 SW Asian 0.46 0.56 0.44 0 0.9 0 -0.09 0 0.09 -0.07
C4 SE Asian 0 0 0 0 0 0.6 0 0 0 0
C5 Euro 0 0.19 0.6 0.05 -0.05 0 0.88 0 0 0
C6 Papuan 0 0 0 0 0 0 0 0 0 0
C7 NE Asian 0 0 0 0.85 0 0.4 0 0 0 0
C8 W African 0 0.12 0 0 0 0 0 1 0 0
C9 E African 0 0.12 0 0 0.05 0 0 0 0.89 0

Don't take the exact values to heart but this shows the general relationship between the Dodecad and Harappa (K=9) ancestral components.

The South Asian components are about the same in both projects.

The Kalash component is a mix but is primarily Dodecad West Asian.

The Harappa Southwest Asian has contributions from Northwest African, West Asian and South European in addition to the Dodecad West Asian component.

The Southeast Asian component corresponds partially to the Dodecad East Asian component.

The Harappa European component is more Dodecad North European than South European.

If enough Harappa-Dodecad participants are willing to let me know their IDs for both projects, I can do a similar analysis using individual data.