Author Archives: Zack - Page 26

Admixture K=12, HRP0011-HRP0020

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results and this batch's results at lower K.

Batch 2 Admixture K=12

PS. This was run using Admixture version 1.04.

Admixture Upgrade

I noticed a few days ago that Admixture had an update available:

1.1 (2/8/2011): Parallel processing, supervised analysis. Minor speedups and cleanups.

There were two important new features in version 1.1 that I started salivating over. One was parallel processing so I could utilize all the cores of my machine and thus run Admixture faster. The other was more important though I have yet to experiment with it. It's the ability to assign some ancestral components to specific samples, i.e. assign some individuals in the data specific 100% ancestry as a starting assumption and calculate admixture from that.

Of course, these two features made me forget the cardinal rule: Never upgrade in the middle of an analysis. But I did upgrade and things have changed subtly, making some comparisons between admixture v1.04 and v1.1 difficult.

For example, previously (admixture v1.04), at K=12, admixture was giving me the ancestral components: South Asian, Balochistan/Caucasus, Kalash, Southeast Asian, Southwest Asian, European, Papuan, Northeast Asian, Siberian, East African Bantus, West African, and East African.

With Admixture v1.1, I am getting the ancestral components: South Asian, Balochistan/Caucasus, Kalash, Southeast Asian, European, Mediterranean (maximum among Mozabite and Sardinians), Papuan, Northeast Asian, Southwest Asian, Siberian, West African, and East African.

So now I am running Admixture with different random seeds and trying to compare the old version results vs the new. Of course since we are talking K=12, just one admixture run takes a whole day.

Anyway, while that's going on, I have more things in process which can go forward, like reporting the results of Batch 4. And working on the Eurasian dataset.

Reference I: Eurasian Subsets

Since we have established that none of the Harappa participants so far have African admixture except for HRP0001 (me) and HRP0027 (Caribbean Indian) and African populations are the most diverse, it's best to remove the African populations from our Reference I dataset and do some analysis using the Eurasian subset.

One option is to exclude the 517 samples of sub-Saharan African populations in our dataset:

  • Bantu Keyna: 11
  • Bantu South Africa: 8
  • Ethiopian Jews: 12
  • Ethiopians: 19
  • Kenyan Luhya: 101
  • Maasai: 135
  • Mandenka: 22
  • African Americans: 48
  • Yoruba: 161

However, in addition to the above, I decided to remove anyone from the reference I dataset who had more than x% African ancestry (sum of East African, East African Bantu and West African) at K=12 admixture run. I created two Eurasian datasets: Eurasian90 and Eurasian95.

Eurasian90 excludes all samples with more than 10% African admixture. That completely removes the following populations in addition to the above:

  • Egyptians: 12
  • Moroccans: 10
  • Mozabite: 29

Also, some samples from the following populations were removed for Eurasian90:

  • Balochi: 3/24
  • Bedouin: 19/46
  • Brahui: 2/25
  • Iranians: 3/19
  • Jordanians: 6/20
  • Lebanese: 2/7
  • Makrani: 3/25
  • Palestinian: 10/46
  • Saudis: 2/20
  • Sindhi: 2/24
  • Syrians: 2/16
  • Yemense: 7/8

That's a total of 629 samples in Reference I dataset that had at least 10% African admixture. Thus Eurasian90 has 2,025 samples. The complete list is here.

The other dataset, Eurasian95 excludes everyone with more than 5% African admixture. Thus in addition to the samples listed above, it excludes the following:

  • Balochi: 1
  • Bedouin: 19
  • Brahui: 1
  • Druze: 1
  • Iranians: 1
  • Jordanians: 14 (completely removed)
  • Makrani: 8
  • Morocco Jews: 2
  • Palestinian: 36 (completely removed)
  • Saudis: 16
  • Sindhi: 2
  • Syrians: 7
  • Yemenese: 1 (completely removed)
  • Yemen Jews: 15 (completely removed)

Eurasian95 is thus left with 1,901 whose breakdown is listed here.

I'll be experimenting with both Eurasian90 and Eurasian95.

Project Update

I have a total of 42 participants in the project right now who have sent me their raw data. This is not counting two people who have relatives participating and thus have to be filtered out for most analysis other than individual admixture percentages etc where I divide participants into small groups.

The following groups are represented:

  • Punjab: 7
  • Iran: 6
  • Tamil: 5
  • Andhra Pradesh: 2
  • Bengal: 2
  • Bihar: 2
  • Karnataka: 2
  • Caribbean Indian: 2
  • Kashmir: 2
  • Anglo-Indian: 1
  • Roma: 1
  • Goa: 1
  • Uttar Pradesh: 1
  • Sri Lankan: 1
  • Rajasthan: 1
  • Kerala: 1
  • Baloch: 1
  • Unknown: 1

The unknown is Manu Sporny who has put his genetic data in the public domain and I have drafted him into our project.

In addition, out of curiosity, I have accepted data from the following:

  • Iraqi Arab: 2
  • Egyptian/Iraqi Jew: 1

I know a bunch of you have done a lot to make this project known and gotten people to submit their data. But we really do need more participants of every ethnicity and geographic region in and around South Asia. So keep on!

I am working on K=12 admixture runs for the batches we have already done. In addition, the reference I dataset will be used for even higher values of K admixture components to see where the limit is.

Also, I am looking into doing chromosome by chromosome admixture (and other analysis). I have done some experimental runs and once I have pored over that data, I'll have something to report.

As we have seen, even with the removal of the San and Pygmy, the Africans take up 3 ancestral components and most South Asians (excepting me of course) do not have any African admixture. So I am working on a reference dataset without any Africans. I have my own take on how to do that which I'll share in the next few days.

In short, my home computer is running admixture, plink, eigensoft, etc. 24x7.

Admixture K=10-12, HRP0001 to HRP0010

Let's continue our admixture analysis of the first batch of Harappa participants.

Here are their ethnic backgrounds and their admixture analysis results.

You might want to refer to the admixture analysis of the reference dataset.

At K=10,

Batch 1 Admixture K=10

C1 South Asian C2 Kalash
C3 Southwest Asian C4 Southeast Asian
C5 European C6 Papuan
C7 Northeast Asian C8 Siberian
C9 West African C10 East African

At K=11,

Batch 1 Admixture K=11

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Papuan C8 Northeast Asian
C9 Siberian C10 West African
C11 East African

Note the C2 component, it sounds a bit like ANI (Ancestral North Indian) of Reich et al, though hold off on your conclusions and your excitement for now.

Also, note that this split is different from the results of Reference I K=11 admixture run where the East African split happened. However, at K=12 we get similar components.

At K=12,

Batch 1 Admixture K=12

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Papuan C8 Northeast Asian
C9 Siberian C10 East African Bantus
C11 West African C12 East African

I am going to explore even higher values of K since the crossvalidation errors are still decreasing.

Dodecad vs Harappa

We know that some participants in Harappa Ancestry Project had also submitted their data to Dodecad Project. And they were curious how the different ancestry components here lined up with the Dodecad ones.

So I decided to compare the two. I took the ancestral component percentages for the reference populations from
Dodecad population spreadsheet K=10 and Harappa Reference I spreadsheet K=9.

I selected the 36 populations that are present in both. While some of these are still not comparable because of which samples out of these populations were selected to be included in the reference datasets for Dodecad and Harappa, we are using mean values, so barring any big outliers we can compare them.

I decided to find a solution to linear equations of the form:

C1 = a11*D1 + a12*D2 + a13*D3 + a14*D4 + a15*D5 + a16*D6 + a17*D7 + a18*D8 + a19*D9 + a1A*D10
C2 = a21*D1 + a22*D2 + a23*D3 + a24*D4 + a25*D5 + a26*D6 + a27*D7 + a28*D8 + a29*D9 + a2A*D10
C3 = a31*D1 + a32*D2 + a33*D3 + a34*D4 + a35*D5 + a36*D6 + a37*D7 + a38*D8 + a39*D9 + a3A*D10
C4 = a41*D1 + a42*D2 + a43*D3 + a44*D4 + a45*D5 + a46*D6 + a47*D7 + a48*D8 + a49*D9 + a4A*D10
C5 = a51*D1 + a52*D2 + a53*D3 + a54*D4 + a55*D5 + a56*D6 + a57*D7 + a58*D8 + a59*D9 + a5A*D10
C6 = a61*D1 + a62*D2 + a63*D3 + a64*D4 + a65*D5 + a66*D6 + a67*D7 + a68*D8 + a69*D9 + a6A*D10
C7 = a71*D1 + a72*D2 + a73*D3 + a74*D4 + a75*D5 + a76*D6 + a77*D7 + a78*D8 + a79*D9 + a7A*D10
C8 = a81*D1 + a82*D2 + a83*D3 + a84*D4 + a85*D5 + a86*D6 + a87*D7 + a88*D8 + a89*D9 + a8A*D10
C9 = a91*D1 + a92*D2 + a93*D3 + a94*D4 + a95*D5 + a96*D6 + a97*D7 + a98*D8 + a99*D9 + a9A*D10

For each of the 36 populations, we'll have these 9 equations where C1 through C9 are the ancestral component percentages of that population in Harappa Project and D1 through D10 are the ancestral percentages in Dodecad Project.

The unknowns are the coefficients "a". They are 90 unknowns. Since we have 36 populations, the number of equations is 36*9=324. Therefore, this is an overdetermined system of linear equations and we can find a least squares solution to it.

Here is the solution:

D1 W Asian D2 NW African D3 S Euro D4 NE Asian D5 SW Asian D6 E Asian D7 N Euro D8 W African D9 E African D10 S Asian
C1 S Asian 0 0 0 0 0 0 0 0 0 0.92
C2 Kalash 0.54 0 -0.05 0.12 0.07 0 0.2 0 0 0.1
C3 SW Asian 0.46 0.56 0.44 0 0.9 0 -0.09 0 0.09 -0.07
C4 SE Asian 0 0 0 0 0 0.6 0 0 0 0
C5 Euro 0 0.19 0.6 0.05 -0.05 0 0.88 0 0 0
C6 Papuan 0 0 0 0 0 0 0 0 0 0
C7 NE Asian 0 0 0 0.85 0 0.4 0 0 0 0
C8 W African 0 0.12 0 0 0 0 0 1 0 0
C9 E African 0 0.12 0 0 0.05 0 0 0 0.89 0

Don't take the exact values to heart but this shows the general relationship between the Dodecad and Harappa (K=9) ancestral components.

The South Asian components are about the same in both projects.

The Kalash component is a mix but is primarily Dodecad West Asian.

The Harappa Southwest Asian has contributions from Northwest African, West Asian and South European in addition to the Dodecad West Asian component.

The Southeast Asian component corresponds partially to the Dodecad East Asian component.

The Harappa European component is more Dodecad North European than South European.

If enough Harappa-Dodecad participants are willing to let me know their IDs for both projects, I can do a similar analysis using individual data.

Reference I Admixture Analysis K=10-12

A week later, some more Admixture analysis of Reference I dataset.

As usual, the results are available in a spreadsheet, which is also listed on my sidebar.

Let's start with K=10.

Admixture: Reference I populations K=10

C1 South Asian C2 Kalash
C3 Southwest Asian C4 Southeast Asian
C5 European C6 Papuan
C7 Northeast Asian C8 Siberian
C9 West African C10 East African

The addition here is basically of the Siberian component which is highest among the Yakut.

Fst divergences between estimated populations for K=10:

C1 C2 C3 C4 C5 C6 C7 C8 C9
C2 0.057
C3 0.064 0.073
C4 0.089 0.127 0.136
C5 0.063 0.061 0.038 0.131
C6 0.167 0.209 0.215 0.202 0.210
C7 0.080 0.120 0.129 0.032 0.123 0.190
C8 0.085 0.117 0.127 0.059 0.118 0.203 0.039
C9 0.152 0.174 0.161 0.201 0.171 0.266 0.195 0.199
C10 0.115 0.133 0.117 0.166 0.128 0.233 0.160 0.163 0.036

Now for K=11,

Admixture: Reference I populations K=11

C1 South Asian C2 Kalash
C3 Southwest Asian C4 Southeast Asian
C5 European C6 Papuan
C7 Siberian C8 Northeast Asian
C9 East African Bantus C10 West African
C11 East African

C8 at K=11 is now modal among the Han instead of the Japanese. This affected the Southeast Asian C4 component which is now more of a real Southeast Asian one.

The new ancestral component C9 is among the Bantus of eastern and southern Africa. It is highest among the Luhya and Bantus of Kenya.

Fst divergences between estimated populations for K=11:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
C2 0.055
C3 0.062 0.072
C4 0.081 0.120 0.128
C5 0.063 0.063 0.038 0.124
C6 0.169 0.211 0.215 0.195 0.213
C7 0.089 0.128 0.135 0.057 0.130 0.203
C8 0.083 0.122 0.131 0.031 0.127 0.194 0.039
C9 0.143 0.165 0.150 0.185 0.162 0.259 0.195 0.189
C10 0.152 0.174 0.160 0.194 0.172 0.268 0.203 0.198 0.014
C11 0.104 0.122 0.101 0.149 0.115 0.226 0.158 0.152 0.037 0.043

At K=12,

Admixture: Reference I populations K=12

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Papuan C8 Northeast Asian
C9 Siberian C10 East African Bantus
C11 West African C12 East African

The Kalash component has split, with an assist from Southwest Asian, into a pure Kalash component (C3) and a Balochistan/Caucasus (C2) which is highest in Southwestern Pakistan (Brahui, Makrani, Balochi) at 60-57% followed by Georgians, Lezgin, Adeygei, Azerbaijan Jews and Iranian Jews (56-50%).

The Southwest Asian component (C5) is now more of a Southwest Asian and North/Northwest African component. The West Asian element in it has been reduced.

The Northeast Asian component (C8) is now again centered on Japan. I have a solution for this movement which I'll apply in my next round of analysis.

Fst divergences between estimated populations for K=12:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11
C2 0.057
C3 0.066 0.060
C4 0.089 0.124 0.136
C5 0.075 0.057 0.087 0.142
C6 0.066 0.040 0.073 0.130 0.048
C7 0.167 0.205 0.219 0.202 0.220 0.210
C8 0.080 0.117 0.128 0.032 0.134 0.122 0.190
C9 0.085 0.114 0.126 0.059 0.133 0.117 0.203 0.039
C10 0.145 0.154 0.176 0.192 0.154 0.162 0.258 0.187 0.190
C11 0.154 0.163 0.186 0.201 0.164 0.172 0.266 0.195 0.199 0.014
C12 0.107 0.109 0.135 0.157 0.105 0.116 0.225 0.151 0.154 0.035 0.041

Higher K value admixture analysis will continue.

Chinese Samples

Mithra asked:

Almost all the Chinese are now around 50% SE Asian, didn’t see this before is it right.

So I decided to look at the Chinese samples in Reference I dataset.

I ran Admixture on the whole Reference I dataset for K=10 ancestral populations. The green component is what I call Southeast Asian, blue is Northeast Asian (highest among the Japanese) and violet is Siberian (highest among the Yakut).

Here is the plot for the 106 HapMap Chinese samples from Denver (label: us chinese):

HapMap US Chinese

For the 137 HapMap samples from Beijing, China (label: han chinese):

HapMap Han Chinese

For the 34 HGDP Han samples (label: han):

HGDP Han

For the 10 HGDP Han samples from North China (label: han-nchina):

HGDP Han North China

As you can see, the "Southeast Asian" component goes down from the top group to the bottom one, which is as expected.

I wasn't satisfied with these results, so I decided to run Admixture on the East Asian samples in Reference I separately.

East Asian Admixture K=3

At K=3, the results are about the same as at K=10 for the whole reference I population. The Han all have a significant amount of blue component which is highest among the Southeast Asians.

East Asian Admixture K=4

At K=4, we get a Chinese ("East Asian") component. So we have Japanese, Chinese, Yakut and Southeast Asian components. This is what most of you were probably expecting.

Why did the Japanese become the modal population for the Northeast Asian component? I ran a PCA on the East Asian data to see how the different populations looked on a PCA plot. Remember that eigenvector 1 explains 1.49 times the variance of eigenvector 2 and 1.9 times the variance of eigenvector 3. Thus, eigenvector 2 explains 1.28 times the variation explained by eigenvector 3.

East Asian PCA eig1 vs eig2


East Asian PCA eig1 vs eig3


East Asian PCA eig2 vs eig3

As you can see, the Yakut are the far away, but the Japanese are also fairly well-separated from the Chinese populations.

If I didn't have the 141 Japanese samples in my reference dataset, the Northeast Asian component would be centered on the Han most likely, which is the case for Dodecad.

I think this shows that it is not correct to think of the ancestral components inferred from admixture as some pure ancestral population.

Admixture K=4,7,9, HRP0021 to HRP0030

Here's the spreadsheet with their admixture results. And you can check their ethnic backgrounds.

You might also want to refer to the reference dataset I admixture analyses for K=2-5 and K=6-9.

I did not run admixture for all values of K this time. So let's start with K=4. For quick reference,

C1 South Asian
C2 European
C3 East Asian
C4 African

Batch 3 Admixture K=4

Now, for K=7, the ancestral components are:

C1 South Asian
C2 European
C3 Southeast Asian
C4 Southwest Asian
C5 Papuan
C6 Northeast Asian
C7 African

Batch 3 Admixture K=7

And finally, here's K=9.

C1 South Asian
C2 Kalash
C3 Southwest Asian
C4 Southeast Asian
C5 European
C6 Papuan
C7 Northeast Asian
C8 West African
C9 East African

Batch 3 Admixture K=9

My Genetic Journey

It all started DNA Day 2010 when Razib tweeted about a $99 sale for the DNA test at 23andme. I ordered one immediately. Over the next few months, a lot of my free time was spent poring over and analyzing my genomic results.

While the health and physical traits information was interesting, I found the ancestry information that can be deduced from your genome to be fascinating. That might be because I was working on collecting together and digitizing our family tree at the time.

So to beat Razib's record of of writing about his personal genome, I have started blogging about mine:

There's much more to come, including: What's wrong with my chromosome 9 and who did I get it from?; my results from Doug McDonald, Dodecad and Eurogenes; why do I have low similarity scores with everyone?; where exactly was my great-grandmother from?; and more.

PS. Since it's Valentine's Day, I should probably mention that my top match among the people (excluding my sibling of course) I am sharing genomes with on 23andme is my dear wife, Amber.