Reference I: Eurasian Subsets

Since we have established that none of the Harappa participants so far have African admixture except for HRP0001 (me) and HRP0027 (Caribbean Indian) and African populations are the most diverse, it's best to remove the African populations from our Reference I dataset and do some analysis using the Eurasian subset.

One option is to exclude the 517 samples of sub-Saharan African populations in our dataset:

  • Bantu Keyna: 11
  • Bantu South Africa: 8
  • Ethiopian Jews: 12
  • Ethiopians: 19
  • Kenyan Luhya: 101
  • Maasai: 135
  • Mandenka: 22
  • African Americans: 48
  • Yoruba: 161

However, in addition to the above, I decided to remove anyone from the reference I dataset who had more than x% African ancestry (sum of East African, East African Bantu and West African) at K=12 admixture run. I created two Eurasian datasets: Eurasian90 and Eurasian95.

Eurasian90 excludes all samples with more than 10% African admixture. That completely removes the following populations in addition to the above:

  • Egyptians: 12
  • Moroccans: 10
  • Mozabite: 29

Also, some samples from the following populations were removed for Eurasian90:

  • Balochi: 3/24
  • Bedouin: 19/46
  • Brahui: 2/25
  • Iranians: 3/19
  • Jordanians: 6/20
  • Lebanese: 2/7
  • Makrani: 3/25
  • Palestinian: 10/46
  • Saudis: 2/20
  • Sindhi: 2/24
  • Syrians: 2/16
  • Yemense: 7/8

That's a total of 629 samples in Reference I dataset that had at least 10% African admixture. Thus Eurasian90 has 2,025 samples. The complete list is here.

The other dataset, Eurasian95 excludes everyone with more than 5% African admixture. Thus in addition to the samples listed above, it excludes the following:

  • Balochi: 1
  • Bedouin: 19
  • Brahui: 1
  • Druze: 1
  • Iranians: 1
  • Jordanians: 14 (completely removed)
  • Makrani: 8
  • Morocco Jews: 2
  • Palestinian: 36 (completely removed)
  • Saudis: 16
  • Sindhi: 2
  • Syrians: 7
  • Yemenese: 1 (completely removed)
  • Yemen Jews: 15 (completely removed)

Eurasian95 is thus left with 1,901 whose breakdown is listed here.

I'll be experimenting with both Eurasian90 and Eurasian95.

Project Update

I have a total of 42 participants in the project right now who have sent me their raw data. This is not counting two people who have relatives participating and thus have to be filtered out for most analysis other than individual admixture percentages etc where I divide participants into small groups.

The following groups are represented:

  • Punjab: 7
  • Iran: 6
  • Tamil: 5
  • Andhra Pradesh: 2
  • Bengal: 2
  • Bihar: 2
  • Karnataka: 2
  • Caribbean Indian: 2
  • Kashmir: 2
  • Anglo-Indian: 1
  • Roma: 1
  • Goa: 1
  • Uttar Pradesh: 1
  • Sri Lankan: 1
  • Rajasthan: 1
  • Kerala: 1
  • Baloch: 1
  • Unknown: 1

The unknown is Manu Sporny who has put his genetic data in the public domain and I have drafted him into our project.

In addition, out of curiosity, I have accepted data from the following:

  • Iraqi Arab: 2
  • Egyptian/Iraqi Jew: 1

I know a bunch of you have done a lot to make this project known and gotten people to submit their data. But we really do need more participants of every ethnicity and geographic region in and around South Asia. So keep on!

I am working on K=12 admixture runs for the batches we have already done. In addition, the reference I dataset will be used for even higher values of K admixture components to see where the limit is.

Also, I am looking into doing chromosome by chromosome admixture (and other analysis). I have done some experimental runs and once I have pored over that data, I'll have something to report.

As we have seen, even with the removal of the San and Pygmy, the Africans take up 3 ancestral components and most South Asians (excepting me of course) do not have any African admixture. So I am working on a reference dataset without any Africans. I have my own take on how to do that which I'll share in the next few days.

In short, my home computer is running admixture, plink, eigensoft, etc. 24x7.

Admixture K=10-12, HRP0001 to HRP0010

Let's continue our admixture analysis of the first batch of Harappa participants.

Here are their ethnic backgrounds and their admixture analysis results.

You might want to refer to the admixture analysis of the reference dataset.

At K=10,

Batch 1 Admixture K=10

C1 South Asian C2 Kalash
C3 Southwest Asian C4 Southeast Asian
C5 European C6 Papuan
C7 Northeast Asian C8 Siberian
C9 West African C10 East African

At K=11,

Batch 1 Admixture K=11

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Papuan C8 Northeast Asian
C9 Siberian C10 West African
C11 East African

Note the C2 component, it sounds a bit like ANI (Ancestral North Indian) of Reich et al, though hold off on your conclusions and your excitement for now.

Also, note that this split is different from the results of Reference I K=11 admixture run where the East African split happened. However, at K=12 we get similar components.

At K=12,

Batch 1 Admixture K=12

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Papuan C8 Northeast Asian
C9 Siberian C10 East African Bantus
C11 West African C12 East African

I am going to explore even higher values of K since the crossvalidation errors are still decreasing.

Dodecad vs Harappa

We know that some participants in Harappa Ancestry Project had also submitted their data to Dodecad Project. And they were curious how the different ancestry components here lined up with the Dodecad ones.

So I decided to compare the two. I took the ancestral component percentages for the reference populations from
Dodecad population spreadsheet K=10 and Harappa Reference I spreadsheet K=9.

I selected the 36 populations that are present in both. While some of these are still not comparable because of which samples out of these populations were selected to be included in the reference datasets for Dodecad and Harappa, we are using mean values, so barring any big outliers we can compare them.

I decided to find a solution to linear equations of the form:

C1 = a11*D1 + a12*D2 + a13*D3 + a14*D4 + a15*D5 + a16*D6 + a17*D7 + a18*D8 + a19*D9 + a1A*D10
C2 = a21*D1 + a22*D2 + a23*D3 + a24*D4 + a25*D5 + a26*D6 + a27*D7 + a28*D8 + a29*D9 + a2A*D10
C3 = a31*D1 + a32*D2 + a33*D3 + a34*D4 + a35*D5 + a36*D6 + a37*D7 + a38*D8 + a39*D9 + a3A*D10
C4 = a41*D1 + a42*D2 + a43*D3 + a44*D4 + a45*D5 + a46*D6 + a47*D7 + a48*D8 + a49*D9 + a4A*D10
C5 = a51*D1 + a52*D2 + a53*D3 + a54*D4 + a55*D5 + a56*D6 + a57*D7 + a58*D8 + a59*D9 + a5A*D10
C6 = a61*D1 + a62*D2 + a63*D3 + a64*D4 + a65*D5 + a66*D6 + a67*D7 + a68*D8 + a69*D9 + a6A*D10
C7 = a71*D1 + a72*D2 + a73*D3 + a74*D4 + a75*D5 + a76*D6 + a77*D7 + a78*D8 + a79*D9 + a7A*D10
C8 = a81*D1 + a82*D2 + a83*D3 + a84*D4 + a85*D5 + a86*D6 + a87*D7 + a88*D8 + a89*D9 + a8A*D10
C9 = a91*D1 + a92*D2 + a93*D3 + a94*D4 + a95*D5 + a96*D6 + a97*D7 + a98*D8 + a99*D9 + a9A*D10

For each of the 36 populations, we'll have these 9 equations where C1 through C9 are the ancestral component percentages of that population in Harappa Project and D1 through D10 are the ancestral percentages in Dodecad Project.

The unknowns are the coefficients "a". They are 90 unknowns. Since we have 36 populations, the number of equations is 36*9=324. Therefore, this is an overdetermined system of linear equations and we can find a least squares solution to it.

Here is the solution:

D1 W Asian D2 NW African D3 S Euro D4 NE Asian D5 SW Asian D6 E Asian D7 N Euro D8 W African D9 E African D10 S Asian
C1 S Asian 0 0 0 0 0 0 0 0 0 0.92
C2 Kalash 0.54 0 -0.05 0.12 0.07 0 0.2 0 0 0.1
C3 SW Asian 0.46 0.56 0.44 0 0.9 0 -0.09 0 0.09 -0.07
C4 SE Asian 0 0 0 0 0 0.6 0 0 0 0
C5 Euro 0 0.19 0.6 0.05 -0.05 0 0.88 0 0 0
C6 Papuan 0 0 0 0 0 0 0 0 0 0
C7 NE Asian 0 0 0 0.85 0 0.4 0 0 0 0
C8 W African 0 0.12 0 0 0 0 0 1 0 0
C9 E African 0 0.12 0 0 0.05 0 0 0 0.89 0

Don't take the exact values to heart but this shows the general relationship between the Dodecad and Harappa (K=9) ancestral components.

The South Asian components are about the same in both projects.

The Kalash component is a mix but is primarily Dodecad West Asian.

The Harappa Southwest Asian has contributions from Northwest African, West Asian and South European in addition to the Dodecad West Asian component.

The Southeast Asian component corresponds partially to the Dodecad East Asian component.

The Harappa European component is more Dodecad North European than South European.

If enough Harappa-Dodecad participants are willing to let me know their IDs for both projects, I can do a similar analysis using individual data.

Reference I Admixture Analysis K=10-12

A week later, some more Admixture analysis of Reference I dataset.

As usual, the results are available in a spreadsheet, which is also listed on my sidebar.

Let's start with K=10.

Admixture: Reference I populations K=10

C1 South Asian C2 Kalash
C3 Southwest Asian C4 Southeast Asian
C5 European C6 Papuan
C7 Northeast Asian C8 Siberian
C9 West African C10 East African

The addition here is basically of the Siberian component which is highest among the Yakut.

Fst divergences between estimated populations for K=10:

C1 C2 C3 C4 C5 C6 C7 C8 C9
C2 0.057
C3 0.064 0.073
C4 0.089 0.127 0.136
C5 0.063 0.061 0.038 0.131
C6 0.167 0.209 0.215 0.202 0.210
C7 0.080 0.120 0.129 0.032 0.123 0.190
C8 0.085 0.117 0.127 0.059 0.118 0.203 0.039
C9 0.152 0.174 0.161 0.201 0.171 0.266 0.195 0.199
C10 0.115 0.133 0.117 0.166 0.128 0.233 0.160 0.163 0.036

Now for K=11,

Admixture: Reference I populations K=11

C1 South Asian C2 Kalash
C3 Southwest Asian C4 Southeast Asian
C5 European C6 Papuan
C7 Siberian C8 Northeast Asian
C9 East African Bantus C10 West African
C11 East African

C8 at K=11 is now modal among the Han instead of the Japanese. This affected the Southeast Asian C4 component which is now more of a real Southeast Asian one.

The new ancestral component C9 is among the Bantus of eastern and southern Africa. It is highest among the Luhya and Bantus of Kenya.

Fst divergences between estimated populations for K=11:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
C2 0.055
C3 0.062 0.072
C4 0.081 0.120 0.128
C5 0.063 0.063 0.038 0.124
C6 0.169 0.211 0.215 0.195 0.213
C7 0.089 0.128 0.135 0.057 0.130 0.203
C8 0.083 0.122 0.131 0.031 0.127 0.194 0.039
C9 0.143 0.165 0.150 0.185 0.162 0.259 0.195 0.189
C10 0.152 0.174 0.160 0.194 0.172 0.268 0.203 0.198 0.014
C11 0.104 0.122 0.101 0.149 0.115 0.226 0.158 0.152 0.037 0.043

At K=12,

Admixture: Reference I populations K=12

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Papuan C8 Northeast Asian
C9 Siberian C10 East African Bantus
C11 West African C12 East African

The Kalash component has split, with an assist from Southwest Asian, into a pure Kalash component (C3) and a Balochistan/Caucasus (C2) which is highest in Southwestern Pakistan (Brahui, Makrani, Balochi) at 60-57% followed by Georgians, Lezgin, Adeygei, Azerbaijan Jews and Iranian Jews (56-50%).

The Southwest Asian component (C5) is now more of a Southwest Asian and North/Northwest African component. The West Asian element in it has been reduced.

The Northeast Asian component (C8) is now again centered on Japan. I have a solution for this movement which I'll apply in my next round of analysis.

Fst divergences between estimated populations for K=12:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11
C2 0.057
C3 0.066 0.060
C4 0.089 0.124 0.136
C5 0.075 0.057 0.087 0.142
C6 0.066 0.040 0.073 0.130 0.048
C7 0.167 0.205 0.219 0.202 0.220 0.210
C8 0.080 0.117 0.128 0.032 0.134 0.122 0.190
C9 0.085 0.114 0.126 0.059 0.133 0.117 0.203 0.039
C10 0.145 0.154 0.176 0.192 0.154 0.162 0.258 0.187 0.190
C11 0.154 0.163 0.186 0.201 0.164 0.172 0.266 0.195 0.199 0.014
C12 0.107 0.109 0.135 0.157 0.105 0.116 0.225 0.151 0.154 0.035 0.041

Higher K value admixture analysis will continue.

Chinese Samples

Mithra asked:

Almost all the Chinese are now around 50% SE Asian, didn’t see this before is it right.

So I decided to look at the Chinese samples in Reference I dataset.

I ran Admixture on the whole Reference I dataset for K=10 ancestral populations. The green component is what I call Southeast Asian, blue is Northeast Asian (highest among the Japanese) and violet is Siberian (highest among the Yakut).

Here is the plot for the 106 HapMap Chinese samples from Denver (label: us chinese):

HapMap US Chinese

For the 137 HapMap samples from Beijing, China (label: han chinese):

HapMap Han Chinese

For the 34 HGDP Han samples (label: han):

HGDP Han

For the 10 HGDP Han samples from North China (label: han-nchina):

HGDP Han North China

As you can see, the "Southeast Asian" component goes down from the top group to the bottom one, which is as expected.

I wasn't satisfied with these results, so I decided to run Admixture on the East Asian samples in Reference I separately.

East Asian Admixture K=3

At K=3, the results are about the same as at K=10 for the whole reference I population. The Han all have a significant amount of blue component which is highest among the Southeast Asians.

East Asian Admixture K=4

At K=4, we get a Chinese ("East Asian") component. So we have Japanese, Chinese, Yakut and Southeast Asian components. This is what most of you were probably expecting.

Why did the Japanese become the modal population for the Northeast Asian component? I ran a PCA on the East Asian data to see how the different populations looked on a PCA plot. Remember that eigenvector 1 explains 1.49 times the variance of eigenvector 2 and 1.9 times the variance of eigenvector 3. Thus, eigenvector 2 explains 1.28 times the variation explained by eigenvector 3.

East Asian PCA eig1 vs eig2


East Asian PCA eig1 vs eig3


East Asian PCA eig2 vs eig3

As you can see, the Yakut are the far away, but the Japanese are also fairly well-separated from the Chinese populations.

If I didn't have the 141 Japanese samples in my reference dataset, the Northeast Asian component would be centered on the Han most likely, which is the case for Dodecad.

I think this shows that it is not correct to think of the ancestral components inferred from admixture as some pure ancestral population.

Admixture K=4,7,9, HRP0021 to HRP0030

Here's the spreadsheet with their admixture results. And you can check their ethnic backgrounds.

You might also want to refer to the reference dataset I admixture analyses for K=2-5 and K=6-9.

I did not run admixture for all values of K this time. So let's start with K=4. For quick reference,

C1 South Asian
C2 European
C3 East Asian
C4 African

Batch 3 Admixture K=4

Now, for K=7, the ancestral components are:

C1 South Asian
C2 European
C3 Southeast Asian
C4 Southwest Asian
C5 Papuan
C6 Northeast Asian
C7 African

Batch 3 Admixture K=7

And finally, here's K=9.

C1 South Asian
C2 Kalash
C3 Southwest Asian
C4 Southeast Asian
C5 European
C6 Papuan
C7 Northeast Asian
C8 West African
C9 East African

Batch 3 Admixture K=9

My Genetic Journey

It all started DNA Day 2010 when Razib tweeted about a $99 sale for the DNA test at 23andme. I ordered one immediately. Over the next few months, a lot of my free time was spent poring over and analyzing my genomic results.

While the health and physical traits information was interesting, I found the ancestry information that can be deduced from your genome to be fascinating. That might be because I was working on collecting together and digitizing our family tree at the time.

So to beat Razib's record of of writing about his personal genome, I have started blogging about mine:

There's much more to come, including: What's wrong with my chromosome 9 and who did I get it from?; my results from Doug McDonald, Dodecad and Eurogenes; why do I have low similarity scores with everyone?; where exactly was my great-grandmother from?; and more.

PS. Since it's Valentine's Day, I should probably mention that my top match among the people (excluding my sibling of course) I am sharing genomes with on 23andme is my dear wife, Amber.

South Asian PCA

I used Eigensoft to create a PCA plot of the South Asians in our Reference I dataset (a total of 398 samples) along with the first batch of South Asian Harappa Project participants (HRP0001 to HRP0009).

The PCA software removed 2 Makranis, 1 Sindhi, 1 Balochi and 1 Brahui as outliers, thus leaving us with 402 samples to perform a PCA on.

Here are the plots for the first four eigenvectors. Click to see bigger images.

South Asian PCA eig1 vs eig2

South Asian PCA eig1 vs eig3

South Asian PCA eig2 vs eig3

South Asian PCA eig1 vs eig4

South Asian PCA eig2 vs eig4

South Asian PCA eig3 vs eig4

If you have seen the South Asian plot at 23andme, the first plot here isn't very different except that it seems rotated.

UPDATE: Eigenvectors 1 through 4 explain 1.12%, 0.77%, 0.71% and 0.44% of the total variance.

Reference II Admixture Analysis K=6-9

Continuing with admixture analysis of Reference II dataset, here's the spreadsheet.

Other than the differences with Reference I analysis, do take a look at the additional ethnic groups included in this dataset, especially the 8 South Asian groups: Tamil Nadu Dalit, Irula, Andhra Pradesh Madiga, Andhra Pradesh Mala, Tamil Nadu Brahmin, Andhra Pradesh Brahmin, Punjabi Arain, Nepali.

Let's start with K=6.

Reference II Admixture K=6

Note the difference between Tamil Nadu Dalits and Brahmins. The Dalits lack the European ancestral component of the Brahmins.

For K=7, the East Asian component splits into Northeast Asian and Southeast Asian.

Reference II Admixture K=7

Punjabi Arain are about the same as Sindhis (excluding the those with some African ancestry) in terms of their ancestral components.

Comparing the Andhra Brahmins to the Mala and Madiga, we see the same pattern as in Tamil Nadu: Brahmins have more European and Southwest/West Asian while Mala and Madiga have more Southeast Asian and South Asian.

At K=8, the African component splits into West African and East African.

Reference II Admixture K=8

The Nepalese samples are interesting. They have about 49% South Asian, 19% Northeast Asian, 16% European and 10% Southeast Asian. So they look like a mix of South Asian and East Asian.

Similar to the previous post, here's a comparison of K=8 admixture analysis between Reference I and Reference II datasets.

Here's the average absolute difference between the two datasets for each ancestral component:

Ancestral Component Mean(Abs(Ref1-Ref2))
South Asian (C1) 2.17%
Southwest Asian (C2) 1.32%
European (C3) 1.70%
Southeast Asian (C4) 2.16%
Papuan (C5) 0.33%
Northeast Asian (C6) 1.93%
West African (C7) 0.27%
East African (C8) 0.48%

The larger differences are for Balochi, Cambodian, Dai, Han, Kalash, Lahu, Miao, Naxi, She, Singapore Chinese, Tu, Tujia, US Chinese, and Yi, Thus, it's mostly East Asian groups.

For K=9, we see some divergence between the ancestral components inferred from Reference II as compared to Reference I. Instead of the Kalash component in Reference I analysis, we get the Polynesian component here. This is likely due to the inclusion of Tongan and Samoan samples.

Reference II Admixture K=9

Here's a summary of the ancestral components inferred from Reference II dataset:

K=2 K=3 K=4 K=5 K=6 K=7 K=8 K=9
Eurasian European S Asian S Asian S Asian S Asian S Asian S Asian
African E Asian European European European European SW Asian European
African E Asian E Asian E Asian SE Asian European SW Asian
African SW Asian SW Asian SW Asian SE Asian SE Asian
African Papuan Papuan Papuan Papuan
African NE Asian NE Asian NE Asian
African W African Polynesian
E African W African
E African

I might do some admixture runs for Reference II with Harappa participants later.