Monthly Archives: March 2011 - Page 3

Admixture K=4, HRP0051-HRP0060

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

The interesting samples here are the Gujaratis and the Kurd.

If you can't see the interactive bar chart above, here's a static image.

PS. This was run using Admixture version 1.04.

Austroasiatic Dataset

Razib pointed out the paper "Population Genetic Structure in Indian Austroasiatic speakers: The Role of Landscape Barriers and Sex-specific Admixture" by Gyaneshwer Chaubey, Mait Metspalu, Ying Choi, Reedik Mägi, Irene Gallego Romero, Pedro Soares, Mannis van Oven, Doron M. Behar, Siiri Rootsi, Georgi Hudjashov, Chandana Basu Mallick, Monika Karmin, Mari Nelis, Jüri Parik, Alla Goverdhana Reddy, Ene Metspalu, George van Driem, Yali Xue, Chris Tyler-Smith, Kumarasamy Thangaraj, Lalji Singh, Maido Remm, Martin B. Richards, Marta Mirazon Lahr, Manfred Kayser, Richard Villems and Toomas Kivisild to me 36 hours ago. And I have their dataset now.

I have been told that the data will hopefully be in the NCBI GEO database soon.

There are a total of 41 samples with 527,319 SNPs in the data. There are Bonda, Savara, Juang and Gadaba from Orissa; Santhal and Asur from Jharkand; Kharia from Chattishgarh; Ho from Bihar; Khasi and Garo from Meghalaya; and some (15) Burmese.

PS. I have created a separate page for references where I link to the papers which led to the datasets I am using.

South Asian Map

Simranjit has another map:http://dekor-okno.ru

I am working on improving the interpolation algorithm to take into account barriers such as oceans and even terrain features like mountain ranges. However, this process takes a long time.

Anyways in the meantime , this is one i think the participants would be interested in. It has several things in it, an isopleth layer for C1 - South Asian (12 gradations for more impact). It also have the other Components (C1, C2, C3, C4, C5, C6, C8) represented in the form of pie charts. Base map is a topographic one this time.

Ref1 South Asians + Harappa PCA Clusters

Using the PCA results of the South Asians in Reference I as well as Harappa participants, I ran a couple of clustering algorithms.

First, I scaled the principal components by the respective eigenvalues.

Using Euclidean distance for hierarchical clustering with complete linkage, here's the dendrogram for the Harappa Project participants.

You can compare this to the Admixture-based dendrogram:

The most obvious thing is that I (HRP0001) am an outlier by far.

We inferred three major clusters with the admixture results. Those are intact, though changed a little.

I also ran MClust on the PCA data. The optimum number of clusters was 14. The resulting cluster assignments can be seen in a spreadsheet.

For the Harappa Project participants, the numbers give the probability of assignment to a cluster. For example, for HRP0009 there is a 72% of belonging to cluster 4. For the reference populations, the numbers give the expected number of samples assigned to a cluster.

Harappa Participant Admixture Maps

Following maps of the reference populations, Simranjit has gone ahead and included the Harappa Project participants in these maps as well.

Here's what he said:

I'm now incorporating project participants into the maps. I had to drop admixed individuals, however, and I made some choices, dropped the Bihari Kayastha and the Tamil Nadu non-Brahmin for now, as they differ a fair bit. Take note that as we don't have reference sample for some countries so this sometimes can cause the interpolation to be off (e.g. lack of Central Asian republics other than Uzbekistan is skewing Central Asia to be more South Asian than it really is).

These maps are based on K=12 admixture run.

The gradation is from Dark green (low) to Dark red (high) for most of them.

Basically the percentages for each Component are divided into 32 equal intervals, to create the contour effect. Take note that it represents relative values not absolute.

C1 South Asian component:

C2 Balochistan/Caucasus component:

C6 European component:

Distance Measures

Referring to the dendrogram computed from the admixture results of Harappa Project participants, Thorfinn asked a long time ago:

Interesting that South Indian/Cow Belt Brahmins cluster together; while Punjabi Brahmins are closer to Punjabis.

I can understand the first clustering, assuming that Southern Brahmin communities are a spinoff of northern communities and have maintained relative genetic isolation; and the source Northern Brahmin population differed in original origin from other Cow Belt populations.

But how do both Brahmin communities differ equally from Punjabi/Rajasthani Brahmins; and why is that community closer to other Punjabi populations?

In terms of admixture results, that is correct in the case of the project participants. Why this is the case, I have no idea.

However, there is an issue here that we have to consider and nsriram commented about it:

The euclidean distance doesn’t seem to be the appropriate metric to capture the pairwise similarities. Once you make a commitment to the distance measure then the side effects carry-over into the tree construction.

What is a good distance measure to compute the similarity or dissimilarity of the admixture results of two people? Is the Euclidean distance a good one in this case? It certainly is the most common and the easiest to use I guess. So we usually default to it.

However, if we look at the Fst divergences of the ancestral components, we see that the different components are more or less different from each other. So a 5% difference in C1 might not be the same as a 5% difference in C10.

A solution might be to use a weighted distance, but how to weight it? The Fst numbers give pairwise distances for the different ancestral populations. If you are focused on a specific population (e.g., South Asians), we could try weighting by the Fst values between that component and the others. But I am not sure if that's a good solution either.

In the end, a Euclidean distance measure gives us a rough idea of the differences between admixture results, but it should not be used to explain minor differences or to consider phylogenies.

Reich et al and Pan-Asian Datasets

I got access to the Reich et al (Nature 2009) dataset used in their paper "Reconstructing Indian population history".

It has the following populations:

Aonaga Aus Bhil
Chenchu Great_Andamanese Hallaki
Kamsali Kashmiri_Pandit Kharia
Kurumba Lodi Madiga
Mala Meghawal Naidu
Nysha Onge Sahariya
Santhal Satnami Siddi
Somali Srivastava Tharu
Vaish Velama Vysya

There are 141 individuals with 587,753 SNPs in their dataset which conveniently is in PED format.

Also, Blaise pointed me to the Pan-Asian SNP data used in the Dec 2009 Science paper "Mapping Human Genetic Diversity in Asia".

It includes the following 71 populations:

Maya Auca Quechua Karitiana Pima
Ami Atayal Melanesians Zhuang Han_Cantonese
Hmong Jiamao Jinuo Han_Shanghai Uyghur
Wa Alorese Dayak Javanese Batak_Karo
Lamaholot Lembata Malay Mentawai Manggarai
Kambera Sunda Batak_Toba Toraja Andhra_Pradesh
Karnataka Bengali-Assamese Rajasthan Uttaranchal Uttar Pradesh
Haryana Spiti Bhili Marathi Japanese
Ryukyuan Korean Bidayuh Jehai Kelantan
Kensiu Temuan Ayta Agta Ati
Iraya Minanubu Mamanwa Filipino Singapore_Chinese
Singapore_Indian Singapore_Malay Hmong (Miao) Karen Lawa
Mlabri Mon Paluang Plang Tai_Khuen
Tai_Lue H'tin Tai_Yuan Tai_Yong Yao
Hakka Minnan

It has 1,719 individuals with 54,794 SNPs. I wish it had more SNPs considering the wealth of populations.

Also, the Pan-Asian data is in the form of minor allele counts, so I need to convert that back to A/C/G/T. Since there are some HapMap populations included in the dataset, that shouldn't be too hard.

I am going to include both these datasets into my big reference set.

Harappa Participants on 3-D PCA

sv wanted to see where he was on the South Asian 3-D PCA plot, so I obliged.

It's a quick and dirty method, but you should see a dropdown select box under the 3-D plot. Just select one of the participant IDs from there and that person's dot on the 3-D plot should increase in size so that it's easier to spot.

Isopleths

Simranjit has done a great job of creating some maps showing the distribution of the various ancestral components at K=16. He has posted them on DNA Forums and sent them to me.

The gradation is from Dark green (low) to Dark red (high) for most of them.

Basically the percentages for each Component are divided into 32 equal intervals, to create the contour effect. Take note that it represents relative values not absolute.

Here is C1 South Asian:

C2 Balochistan/Caucasus:

C5 Southwest Asian:

C6 European:

C12 Siberian:

Great job, Simranjit!

Ref 1 South Asians + Harappa PCA

I ran PCA on the South Asian populations included in Reference I dataset (excluding Kalash and Hazara) as well as 38 South Asian participants of Harappa Project. I excluded Kalash and Hazara because they usually dominate a South Asian PCA plot being so distinct.

The reference populations included are: Balochi, Bnei Menashe Jews, Brahui, Burusho, Cochin Jews, Gujaratis (divided into two groups), Makrani, Malayan, North Kannadi, Paniya, Pathan, Sakilli, Sindhi, and Singapore Indians.

Here's the spreadsheet showing the eigenvalues and the first 15 principal components for each sample.

I computed the PCA using Eigensoft which removed 26 samples as outliers. The Tracy-Widom statistics show that about 30 eigenvectors are significant.

Here are the first 15 eigenvalues:

1 3.874124
2 1.819077
3 1.663232
4 1.335721
5 1.293500
6 1.242984
7 1.230921
8 1.225775
9 1.222177
10 1.214539
11 1.212808
12 1.204000
13 1.198930
14 1.195450
15 1.192848

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues.

Now here are plots of the first 14 eigenvectors. In this case, I have not stretched the principal components, so keep in mind that the first eigenvector explains 3.874124/1.819077=2.13 times variation compared to the 2nd eigenvector.

UPDATE: At the bottom of the 3-D plot, you can see a dropdown. Just select one of the project participants from there and that participant's dot in the plot with become bigger so they are easy to spot.