Author Archives: Zack - Page 25

Harappa and Reference I Dendrograms

Looking at the Harappa dendrogram and the dendrogram for reference I, I thought I would combine them to see where our project participants fit.

Then I got more curious. I wanted to see a similarity tree of all the samples in reference I (2,654) plus the 40 Harappa participants I have processed till now. That came out to be such a huge tree it was impossible to save it in a way to be legible. Finally I compromised by selecting only the South Asian samples from the Reference I dataset and putting them together with the Harappa data. Unfortunately, that doesn't give the Iranian and European-admixed participants any information. I'll have to analyze those separately.

Anyway, here's the South Asian Admixture Dendrogram in PDF format. That means you can search for "HRP" to find all the project members, which is why I like PDF in this case better than an image.

Note that Singapore Indians are such a good stand-in for South Indians.

Harappa Admixture Dendrogram

Using the ancestral component percentages from the Admixture run at K=12 for Harappa Project participants, we can calculate the pairwise Euclidean distance between them. These distances can be used to create complete linkage (i.e. furthest neighbor) hierarchical clustering, which you see below.

Note that this is not a phylogeny. It just visualizes the closeness of your admixture results to others.

Thus in terms of admixture results, the Punjabis mostly cluster together along with the Rajasthani (HRP0033), except for my family (HRP0001 and HRP0035) who cluster (not so closely) with the Sindhi-Balochi guy (HRP0039) likely due to the Southwest Asian and African components.

Interestingly, the Bihari Brahmin (HRP0003) is very different from the Bihari Kayastha participant (HRP0032). The Caribbean Indian samples (HRP0027 & HRP0028) cluster with the Bihari Kayastha, so we can't really say for sure where from India their ancestors originated from.

The South Indian Brahmin samples seem to vary consistently from the non-Brahmin ones.

The Iranians cluster closely except for the Khorasanian HRP0034 and Assyrian HRP0010. The Assyrian Iranian sample is actually closer to the Iraqi/Egyptian Jewish sample (HRP0037) than to other Iranians.

The participants with recent European admixture cluster very loosely with each other. Other techniques will need to be used to pinpoint their specific South Asian origins.

If we make a cut at about 0.3 on this tree, we get 3 South Asian clusters:

  • the Northwest of South Asia
  • South Indian Brahmins, Bihari Brahmin, UP Brahmin
  • South Indian non-Brahmin, Bihari non-Brahmin, Bengalis, Caribbean Indians

I wish I had a thousand South Asian samples to play with. I wonder how this dendrogram would look in that case.

Admixture K=12, HRP0001 to HRP0040

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

In case you guys are wondering, the new thing here are the results fro HRP0031 to HRP0040.

If you can't see the interactive charts above, Javascript might be disabled on your browser. Here's a static image for HRP0031 to HRP0040 admixture run.

PS. This was run using Admixture version 1.04.

Fst for Reference I Admixture K=12

I had posted the Fst divergences between the estimated ancestral populations for the admixture analysis on Reference I dataset. But a picture is worth a thousand words and this dendrogram (using complete linkage) shows the Fst numbers fairly clearly.

Remember this is not a phylogeny.

Admixture K=9, HRP0001 to HRP0040

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

In case you guys are wondering, the new thing here are the results fro HRP0031 to HRP0040.

PS. This was run using Admixture version 1.04.

Reference I Dendrogram

Handschar created a dendrogram using a hierarchical classifier based on K=12 admixture results and wondered:

When I run a classification based on simple euclidean distances (not a phylogeny), the Armenians and Turks, as they were, prior to the removal of the four North European admixed Behar samples in David's runs, cluster together. The North European component, in Dodecad Armenians, is practically nonexistent. I am not sure how the Harappa project "European" component translates to Dodecad components. If the admixed Armenians are included, it is possible their inclusion is impacting the Armenian population component percentages. Then again, even if included, perhaps your runs are picking up on something not previously detected. The Armenians, in previous classification runs, ordinarily matched one or more of the Caucasian Jewish groups.

While looking into his question, I figured that I would create some dendrograms too. The ones here are based on the K=12 admixture results of Reference I dataset (spreadsheet). Also, I am using the pairwise Euclidean distance of the Admixture results between population groups to do a complete linkage hierarchical classification. So these dendrograms show which groups are closest in terms of their admixture percentages and do not show shared ancestry. In other words, it is not a phylogeny or a family tree.

First, I used the mean admixture percentages for each group, as given in the spreadsheet.

Reference 1 Mean Admixture Complete Linkage Dendrogram

There are a number of outliers in the dataset. For example, some Arabs and Sindhis with African admixture, some Armenians with a lot more European component than the rest, etc. Therefore, I thought a better approach would be to do the same classification using the median admixture percentages for each population group.

Reference 1 Median Admixture Complete Linkage Dendrogram

Using the median sample from each population, handschar was correct that the Armenians match the Caucasian Jewish groups.

UPDATE: Here's another dendrogram in which I take the mean of the ancestral components for each population after removing outliers.

Reference 1 Mean (No Outliers) Admixture Complete Linkage Dendrogram

Again, don't take these dendrograms to heart. All they show is the distance between the admixture results of different populations.

Admixture K=4, HRP0001-HRP0040

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

In case you guys are wondering, the new thing here are the results fro HRP0031 to HRP0040.

PS. This was run using Admixture version 1.04.

Improved Admixture Bar Charts

I have improved the Admixture bar charts further. As per your demands, ethnicity information is now available in a table right below the bar plot, in the same order as the bar plot IDs.

Also, you can click on any of the legend color rectangles on the right to sort the bar chart and the table by that ancestral component. Similarly, click on the header row of the table to sort by a column.

I might make some minor tweaks to this one.

Admixture K=12, HRP0021-HRP0030

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results and this batch's results at lower K.

Batch 3 Admixture K=12

If you guys can confirm that the interactive bar chart is working well for you, then this is the last static bar plot.

PS. This was run using Admixture version 1.04.

Google Charts

Here's a chart using Google Visualization API.

In case you are wondering, the individuals are ordered by the sum of their South Asian, Pakistan/Caucasian and Kalash component percentages.

If it works well for everyone, using Internet Explorer, Firefox, Chrome, or Safari on Windows, Linux, iOS or Mac OS, then I'll start using these interactive bar charts instead of the ones I have been creating in R. These just use the data from the spreadsheet directly.

I am also looking into interactive scatter plots for the PCA plots, but I am not sure if it will handle a lot of data points without running your computer into the ground.

The Google Visualization API also has a geographical map feature using flash. There is also a static map chart which I am looking into.