Ref 2 South Asians + Harappa PCA

I ran PCA on the South Asian populations included in Reference II dataset as well as 38 South Asian participants of Harappa Project. This is sort of a complementary analysis to the Ref1 South Asian one, as this one includes Kalash, Hazara and the additional South Asian groups in Xing et al.

The reference populations included are: Andhra Brahmin, Andhra Madiga, Andhra Mala, Balochi, Bnei Menashe Jews, Brahui, Burusho, Cochin Jews, Gujaratis, Gujaratis-B, Hazara, Irula, Kalash, Makrani, Malayan, Nepalese, North Kannadi, Paniya, Pathan, Punjabi Arain, Sakilli, Sindhi, Singapore Indians, Tamil Nadu Brahmin, and Tamil Nadu Dalit.

Here's the spreadsheet showing the eigenvalues and the first 15 principal components for each sample.

I computed the PCA using Eigensoft which removed 13 samples as outliers. The Tracy-Widom statistics show that about 25 eigenvectors are significant.

Here are the first 15 eigenvalues:

1 6.374483
2 3.650626
3 3.270121
4 2.999767
5 1.937818
6 1.713315
7 1.538295
8 1.503051
9 1.458331
10 1.448079
11 1.433288
12 1.414678
13 1.408943
14 1.390791
15 1.38101

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues. Also, you can highlight the individual project participants in the plot by using the dropdown list below the plot.

Now here are plots of the first 14 eigenvectors. In this case, I have not stretched the principal components, so keep in mind that the first eigenvector explains 1.75 times variation compared to the 2nd eigenvector.

Two Steps Forward, Two Steps Back

I got my daughter a netbook, so now my computer is doing Harappa Project work 24x7.

Also, Simranjit was nice enough to offer me the use of a server. For privacy reasons, I am not going to upload any of the participants' data there but it is much faster than my machine and hence very useful for running Admixture on the reference data (especially with crossvalidation).

As for steps back, I downloaded the current 1000genomes data (1,212 samples, 2.4 million SNPs). It's in vcf format. Using vcftools to convert it to ped format will take about 3 weeks. Yes you heard that right. BTW, the good stuff from a South Asian point of view will come later this year with a 100 Assamese Ahom, 100 Kayadtha from Calcutta, 100 Reddys from Hyderabad, 100 Maratha from Bombay and 100 Lahori Punjabis.

Also, I spent most of Sunday evening and night in the ER and got a diagnosis of ureterolithiasis for my efforts. All I can say is: Three cheers for Percocet!!

UPDATE: Dienekes was kind enough to send me his conversion code which looking at the source code should run really fast.

I am still astonished at why the vcftools conversion code is so slow. May be I should look at their source code.

Admixture K=12, HRP0061-HRP0070

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

If you can't see the interactive bar chart above, here's a static image.

I dare you to generalize!

PS. This was run using Admixture version 1.04.

Admixture K=4, HRP0061-HRP0070

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

The interesting samples here are the Gujarati and the Punjabi. HRP0064 is very different from the other Punjabis so far.

If you can't see the interactive bar chart above, here's a static image.

PS. This was run using Admixture version 1.04.

End of March Update

I have a total of 67 participants in the project right now who have sent me their raw data. This is not counting those who have relatives participating and thus have to be filtered out for most analysis other than individual admixture percentages etc where I divide participants into small groups.http://polvam.ru

The following groups are represented:

I need to post analyses of Tamils, Bengalis and Punjabis soon.

Reference II PCA

I ran PCA on the Reference II dataset which includes 3.161 samples from various populations but with only 23,000 SNPs in common.

Here are the top ten eigenvalues:

  • 219.225396
  • 146.835968
  • 20.719760
  • 9.721733
  • 7.552482
  • 6.216977
  • 3.991663
  • 3.484690
  • 3.106919
  • 2.805874

While the first two eigenvalues are much bigger than the rest, the first explains 7.12% of the variation and the second 4.77%, the Tracy-Widom stats show that about 54 eigenvectors are significant.

Here are the plots for the first 10 principal components. Remember that the 1st eigenvector is 1.5 times the 2nd.

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues.

I also ran MClust on the PCA data and got 17 clusters. The results are in a spreadsheet. I am sure with more principal components than the 10 I used, I would be able to deduce finer population structure.

Do take a look at the clusters assigned to the South Asian populations from Xing et al.

Admixture K=17 maps: Mediterranean and Southwest Asian

From the Reference I K=17 Admixture results, Simranjit has created more isopleth maps.

Mediterranean component:

Southwest Asian component:

My Genetic Journey II

While my computer's busy running K=12 admixture on batch 7, K=17 admixture on batch 1, some MClust experiments and converting 1000genomes data from vcf to ped and I am reeling from the pollen count (3,939 yesterday), here are some links to my personal genetics blogging.

For the record, my daughter complains about all the "Trantor windows" open on the computer all the time. She calls the terminal windows "Trantor" because of the shell prompt. My desktop is named Trantor. Now who can guess what my laptop, my other desktop and my wireless network are named?

Reference I PCA

I ran PCA on the Reference I dataset which includes 2,654 samples from various populations.

Here are the top ten eigenvalues:

  • 178.727040
  • 118.884690
  • 15.014072
  • 9.346602
  • 5.983225
  • 5.140090
  • 3.322723
  • 2.739313
  • 2.559640
  • 2.475389

While the first two eigenvalues are much bigger than the rest, the first explains 6.82% of the variation and the second 4.54%, the Tracy-Widom stats show that about 70-something eeigenvectors are significant.

Here are the plots for the first 10 principal components. Remember that the 1st eigenvector is 1.5 times the 2nd.

Here is a 3-D PCA plot (hat tip: Doug McDonald) showing the first three eigenvectors. The plot is rotating about the 1st eigenvector which is vertical. Also, I have stretched the principal components based on the corresponding eigenvalues.

I also ran MClust on the PCA data and got 16 clusters. The results are in a spreadsheet. I am sure with more principal components than the 10 I used, I would be able to deduce finer population structure.

Note that African Americans cluster with East Africans in CL1. That's because African Americans have some European ancestry (20% on average) and that pulls them away from West Africans and towards Europeans. East Africans also lie in that direction, so they cluster together in a PCA. However, that doesn't mean that African Americans have East African ancestry. If you look at the Admixture results for African Americans, you see that their East African ancestry is negligible.

Iranians

Since we have 7 Iranians in the project, it's time to look at them as a group. We also have 19 Iranians from the Behar et al dataset.

Let's look at their admixture results at K=12.

The big difference between Harappa Project Iranians and Behar et al Iranians is African admixture. Only one Harappa Iranian (HRP0046) has 1% African admixture while three Behar Iranians have more than 10%.

Let's do hierarchical clustering with complete linkage using the Euclidean distance between admixture components. First a caveat or two. This is not a phylogeny. Also, the Euclidean distance measure is not a good one for measuring differences in admixture but I am not sure what would be better.

HRP0010 who is an Assyrian actually clusters better with Caucasian, Iranian and Iraqi Jews than with Iranians.

I'll run an MDS or PCA of the whole region from Punjab/Kashmir to the Levant and Caucasus soon which should be more interesting for clustering.

UPDATE: Since Palisto wondered, I checked and found out that he, an Iraqi Kurd, is very like the Iranians in his admixture result. So I have included him (HRP0059).