South Asian Map

Simranjit has another map:http://dekor-okno.ru

I am working on improving the interpolation algorithm to take into account barriers such as oceans and even terrain features like mountain ranges. However, this process takes a long time.

Anyways in the meantime , this is one i think the participants would be interested in. It has several things in it, an isopleth layer for C1 - South Asian (12 gradations for more impact). It also have the other Components (C1, C2, C3, C4, C5, C6, C8) represented in the form of pie charts. Base map is a topographic one this time.

Ref1 South Asians + Harappa PCA Clusters

Using the PCA results of the South Asians in Reference I as well as Harappa participants, I ran a couple of clustering algorithms.

First, I scaled the principal components by the respective eigenvalues.

Using Euclidean distance for hierarchical clustering with complete linkage, here's the dendrogram for the Harappa Project participants.

You can compare this to the Admixture-based dendrogram:

The most obvious thing is that I (HRP0001) am an outlier by far.

We inferred three major clusters with the admixture results. Those are intact, though changed a little.

I also ran MClust on the PCA data. The optimum number of clusters was 14. The resulting cluster assignments can be seen in a spreadsheet.

For the Harappa Project participants, the numbers give the probability of assignment to a cluster. For example, for HRP0009 there is a 72% of belonging to cluster 4. For the reference populations, the numbers give the expected number of samples assigned to a cluster.

Harappa Participant Admixture Maps

Following maps of the reference populations, Simranjit has gone ahead and included the Harappa Project participants in these maps as well.

Here's what he said:

I'm now incorporating project participants into the maps. I had to drop admixed individuals, however, and I made some choices, dropped the Bihari Kayastha and the Tamil Nadu non-Brahmin for now, as they differ a fair bit. Take note that as we don't have reference sample for some countries so this sometimes can cause the interpolation to be off (e.g. lack of Central Asian republics other than Uzbekistan is skewing Central Asia to be more South Asian than it really is).

These maps are based on K=12 admixture run.

The gradation is from Dark green (low) to Dark red (high) for most of them.

Basically the percentages for each Component are divided into 32 equal intervals, to create the contour effect. Take note that it represents relative values not absolute.

C1 South Asian component:

C2 Balochistan/Caucasus component:

C6 European component:

Distance Measures

Referring to the dendrogram computed from the admixture results of Harappa Project participants, Thorfinn asked a long time ago:

Interesting that South Indian/Cow Belt Brahmins cluster together; while Punjabi Brahmins are closer to Punjabis.

I can understand the first clustering, assuming that Southern Brahmin communities are a spinoff of northern communities and have maintained relative genetic isolation; and the source Northern Brahmin population differed in original origin from other Cow Belt populations.

But how do both Brahmin communities differ equally from Punjabi/Rajasthani Brahmins; and why is that community closer to other Punjabi populations?

In terms of admixture results, that is correct in the case of the project participants. Why this is the case, I have no idea.

However, there is an issue here that we have to consider and nsriram commented about it:

The euclidean distance doesn’t seem to be the appropriate metric to capture the pairwise similarities. Once you make a commitment to the distance measure then the side effects carry-over into the tree construction.

What is a good distance measure to compute the similarity or dissimilarity of the admixture results of two people? Is the Euclidean distance a good one in this case? It certainly is the most common and the easiest to use I guess. So we usually default to it.

However, if we look at the Fst divergences of the ancestral components, we see that the different components are more or less different from each other. So a 5% difference in C1 might not be the same as a 5% difference in C10.

A solution might be to use a weighted distance, but how to weight it? The Fst numbers give pairwise distances for the different ancestral populations. If you are focused on a specific population (e.g., South Asians), we could try weighting by the Fst values between that component and the others. But I am not sure if that's a good solution either.

In the end, a Euclidean distance measure gives us a rough idea of the differences between admixture results, but it should not be used to explain minor differences or to consider phylogenies.

Reich et al and Pan-Asian Datasets

I got access to the Reich et al (Nature 2009) dataset used in their paper "Reconstructing Indian population history".

It has the following populations:

Aonaga Aus Bhil
Chenchu Great_Andamanese Hallaki
Kamsali Kashmiri_Pandit Kharia
Kurumba Lodi Madiga
Mala Meghawal Naidu
Nysha Onge Sahariya
Santhal Satnami Siddi
Somali Srivastava Tharu
Vaish Velama Vysya

There are 141 individuals with 587,753 SNPs in their dataset which conveniently is in PED format.

Also, Blaise pointed me to the Pan-Asian SNP data used in the Dec 2009 Science paper "Mapping Human Genetic Diversity in Asia".

It includes the following 71 populations:

Maya Auca Quechua Karitiana Pima
Ami Atayal Melanesians Zhuang Han_Cantonese
Hmong Jiamao Jinuo Han_Shanghai Uyghur
Wa Alorese Dayak Javanese Batak_Karo
Lamaholot Lembata Malay Mentawai Manggarai
Kambera Sunda Batak_Toba Toraja Andhra_Pradesh
Karnataka Bengali-Assamese Rajasthan Uttaranchal Uttar Pradesh
Haryana Spiti Bhili Marathi Japanese
Ryukyuan Korean Bidayuh Jehai Kelantan
Kensiu Temuan Ayta Agta Ati
Iraya Minanubu Mamanwa Filipino Singapore_Chinese
Singapore_Indian Singapore_Malay Hmong (Miao) Karen Lawa
Mlabri Mon Paluang Plang Tai_Khuen
Tai_Lue H'tin Tai_Yuan Tai_Yong Yao
Hakka Minnan

It has 1,719 individuals with 54,794 SNPs. I wish it had more SNPs considering the wealth of populations.

Also, the Pan-Asian data is in the form of minor allele counts, so I need to convert that back to A/C/G/T. Since there are some HapMap populations included in the dataset, that shouldn't be too hard.

I am going to include both these datasets into my big reference set.

Harappa Participants on 3-D PCA

sv wanted to see where he was on the South Asian 3-D PCA plot, so I obliged.

It's a quick and dirty method, but you should see a dropdown select box under the 3-D plot. Just select one of the participant IDs from there and that person's dot on the 3-D plot should increase in size so that it's easier to spot.

Isopleths

Simranjit has done a great job of creating some maps showing the distribution of the various ancestral components at K=16. He has posted them on DNA Forums and sent them to me.

The gradation is from Dark green (low) to Dark red (high) for most of them.

Basically the percentages for each Component are divided into 32 equal intervals, to create the contour effect. Take note that it represents relative values not absolute.

Here is C1 South Asian:

C2 Balochistan/Caucasus:

C5 Southwest Asian:

C6 European:

C12 Siberian:

Great job, Simranjit!

Ref 1 South Asians + Harappa PCA

I ran PCA on the South Asian populations included in Reference I dataset (excluding Kalash and Hazara) as well as 38 South Asian participants of Harappa Project. I excluded Kalash and Hazara because they usually dominate a South Asian PCA plot being so distinct.

The reference populations included are: Balochi, Bnei Menashe Jews, Brahui, Burusho, Cochin Jews, Gujaratis (divided into two groups), Makrani, Malayan, North Kannadi, Paniya, Pathan, Sakilli, Sindhi, and Singapore Indians.

Here's the spreadsheet showing the eigenvalues and the first 15 principal components for each sample.

I computed the PCA using Eigensoft which removed 26 samples as outliers. The Tracy-Widom statistics show that about 30 eigenvectors are significant.

Here are the first 15 eigenvalues:

1 3.874124
2 1.819077
3 1.663232
4 1.335721
5 1.293500
6 1.242984
7 1.230921
8 1.225775
9 1.222177
10 1.214539
11 1.212808
12 1.204000
13 1.198930
14 1.195450
15 1.192848

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues.

Now here are plots of the first 14 eigenvectors. In this case, I have not stretched the principal components, so keep in mind that the first eigenvector explains 3.874124/1.819077=2.13 times variation compared to the 2nd eigenvector.

UPDATE: At the bottom of the 3-D plot, you can see a dropdown. Just select one of the project participants from there and that participant's dot in the plot with become bigger so they are easy to spot.

Balochistan/Caucasian

There has been some discussion in the comments about the C2 ancestral component at K=12 admixture runs which I called Pakistani/Caucasian.

First of all, we should remember that these "names" of ancestral populations are just rough mnemonics. They are chosen based on the frequencies of the component among modern reference samples. So the names have nothing at all to do with history.

In the case of Pakistani/Caucasian component, I wanted to emphasize the peaks of the component in Pakistan and the Caucasus. As commenters pointed out, the component is also quite high among the Iranians.

However, I have realized that this name, Pakistani/Caucasian, is a hindrance rather than a help for understanding the Admixture results. Also, this component is lower among the Pathan, Sindhis, and Punjabis than it is for Iranians etc. Therefore, the Pakistani part of the name is a bit of a misnomer, considering that the Pakistani populations it is high among comprise only about 5% of the country's population.

On the other hand, I do not like the name "Iranian" for this component. While it was suggested based on the geographical Iranian plateau which extends from the Caucasus to Balochistan, it still is confusing and it doesn't emphasize the peak areas.

Thus, I have renamed "Pakistani/Caucasian" as "Balochistan/Caucasus". I didn't use the shorter Baloch as this component is equally high among the Baloch, Brahui and Makrani, all populations living in the province of Balochistan.

Reference I Admixture Analysis K=16

Continuing with Reference I admixture analysis, here is the results spreadsheet.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

If you can't see the interactive chart above, here's a static image.

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Melanesian C8 Naxi/Yi
C9 Japanese C10 Papuan
C11 She C12 Siberian
C13 Eastern Bantu C14 Northwest African
C15 West African C16 East African

Things are breaking down now, with the East Asian components breaking up. The usefulness of higher K's is doubtful. I am going to run K=17 on this dataset and then focus on more filtered data.

Fst divergences between estimated populations for K=16:

Here are the Fst numbers:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12
C2 0.053
C3 0.064 0.060
C4 0.076 0.112 0.123
C5 0.073 0.056 0.085 0.130
C6 0.064 0.040 0.073 0.118 0.048
C7 0.164 0.200 0.215 0.165 0.217 0.206
C8 0.087 0.122 0.133 0.045 0.140 0.127 0.181
C9 0.081 0.117 0.128 0.036 0.135 0.122 0.172 0.021
C10 0.184 0.222 0.237 0.200 0.238 0.227 0.145 0.215 0.207
C11 0.083 0.119 0.130 0.023 0.137 0.125 0.171 0.025 0.017 0.209
C12 0.086 0.114 0.127 0.063 0.133 0.118 0.189 0.048 0.041 0.221 0.048
C13 0.145 0.153 0.177 0.181 0.156 0.162 0.257 0.192 0.186 0.275 0.188 0.191
C14 0.079 0.063 0.096 0.127 0.052 0.056 0.211 0.138 0.132 0.232 0.134 0.132
C15 0.153 0.162 0.186 0.189 0.166 0.172 0.265 0.201 0.195 0.283 0.197 0.200
C16 0.106 0.108 0.135 0.145 0.106 0.116 0.223 0.156 0.150 0.241 0.152 0.154
C13 C14 C15
C14 0.116
C15 0.013 0.122
C16 0.034 0.079 0.041

PS. This was run using Admixture version 1.04 so I can make an apples-to-apples comparison with the previous runs.