Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results and this batch's results at lower K.
PS. This was run using Admixture version 1.04.
Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results and this batch's results at lower K.
PS. This was run using Admixture version 1.04.
I noticed a few days ago that Admixture had an update available:
1.1 (2/8/2011): Parallel processing, supervised analysis. Minor speedups and cleanups.
There were two important new features in version 1.1 that I started salivating over. One was parallel processing so I could utilize all the cores of my machine and thus run Admixture faster. The other was more important though I have yet to experiment with it. It's the ability to assign some ancestral components to specific samples, i.e. assign some individuals in the data specific 100% ancestry as a starting assumption and calculate admixture from that.
Of course, these two features made me forget the cardinal rule: Never upgrade in the middle of an analysis. But I did upgrade and things have changed subtly, making some comparisons between admixture v1.04 and v1.1 difficult.
For example, previously (admixture v1.04), at K=12, admixture was giving me the ancestral components: South Asian, Balochistan/Caucasus, Kalash, Southeast Asian, Southwest Asian, European, Papuan, Northeast Asian, Siberian, East African Bantus, West African, and East African.
With Admixture v1.1, I am getting the ancestral components: South Asian, Balochistan/Caucasus, Kalash, Southeast Asian, European, Mediterranean (maximum among Mozabite and Sardinians), Papuan, Northeast Asian, Southwest Asian, Siberian, West African, and East African.
So now I am running Admixture with different random seeds and trying to compare the old version results vs the new. Of course since we are talking K=12, just one admixture run takes a whole day.
Anyway, while that's going on, I have more things in process which can go forward, like reporting the results of Batch 4. And working on the Eurasian dataset.
Since we have established that none of the Harappa participants so far have African admixture except for HRP0001 (me) and HRP0027 (Caribbean Indian) and African populations are the most diverse, it's best to remove the African populations from our Reference I dataset and do some analysis using the Eurasian subset.
One option is to exclude the 517 samples of sub-Saharan African populations in our dataset:
However, in addition to the above, I decided to remove anyone from the reference I dataset who had more than x% African ancestry (sum of East African, East African Bantu and West African) at K=12 admixture run. I created two Eurasian datasets: Eurasian90 and Eurasian95.
Eurasian90 excludes all samples with more than 10% African admixture. That completely removes the following populations in addition to the above:
Also, some samples from the following populations were removed for Eurasian90:
That's a total of 629 samples in Reference I dataset that had at least 10% African admixture. Thus Eurasian90 has 2,025 samples. The complete list is here.
The other dataset, Eurasian95 excludes everyone with more than 5% African admixture. Thus in addition to the samples listed above, it excludes the following:
Eurasian95 is thus left with 1,901 whose breakdown is listed here.
I'll be experimenting with both Eurasian90 and Eurasian95.
I have a total of 42 participants in the project right now who have sent me their raw data. This is not counting two people who have relatives participating and thus have to be filtered out for most analysis other than individual admixture percentages etc where I divide participants into small groups.
The following groups are represented:
The unknown is Manu Sporny who has put his genetic data in the public domain and I have drafted him into our project.
In addition, out of curiosity, I have accepted data from the following:
I know a bunch of you have done a lot to make this project known and gotten people to submit their data. But we really do need more participants of every ethnicity and geographic region in and around South Asia. So keep on!
I am working on K=12 admixture runs for the batches we have already done. In addition, the reference I dataset will be used for even higher values of K admixture components to see where the limit is.
Also, I am looking into doing chromosome by chromosome admixture (and other analysis). I have done some experimental runs and once I have pored over that data, I'll have something to report.
As we have seen, even with the removal of the San and Pygmy, the Africans take up 3 ancestral components and most South Asians (excepting me of course) do not have any African admixture. So I am working on a reference dataset without any Africans. I have my own take on how to do that which I'll share in the next few days.
In short, my home computer is running admixture, plink, eigensoft, etc. 24x7.
Let's continue our admixture analysis of the first batch of Harappa participants.
Here are their ethnic backgrounds and their admixture analysis results.
You might want to refer to the admixture analysis of the reference dataset.
At K=10,
C1 | South Asian | C2 | Kalash |
---|---|---|---|
C3 | Southwest Asian | C4 | Southeast Asian |
C5 | European | C6 | Papuan |
C7 | Northeast Asian | C8 | Siberian |
C9 | West African | C10 | East African |
At K=11,
C1 | South Asian | C2 | Balochistan/Caucasus |
---|---|---|---|
C3 | Kalash | C4 | Southeast Asian |
C5 | Southwest Asian | C6 | European |
C7 | Papuan | C8 | Northeast Asian |
C9 | Siberian | C10 | West African |
C11 | East African |
Note the C2 component, it sounds a bit like ANI (Ancestral North Indian) of Reich et al, though hold off on your conclusions and your excitement for now.
Also, note that this split is different from the results of Reference I K=11 admixture run where the East African split happened. However, at K=12 we get similar components.
At K=12,
C1 | South Asian | C2 | Balochistan/Caucasus |
---|---|---|---|
C3 | Kalash | C4 | Southeast Asian |
C5 | Southwest Asian | C6 | European |
C7 | Papuan | C8 | Northeast Asian |
C9 | Siberian | C10 | East African Bantus |
C11 | West African | C12 | East African |
I am going to explore even higher values of K since the crossvalidation errors are still decreasing.
We know that some participants in Harappa Ancestry Project had also submitted their data to Dodecad Project. And they were curious how the different ancestry components here lined up with the Dodecad ones.
So I decided to compare the two. I took the ancestral component percentages for the reference populations from
Dodecad population spreadsheet K=10 and Harappa Reference I spreadsheet K=9.
I selected the 36 populations that are present in both. While some of these are still not comparable because of which samples out of these populations were selected to be included in the reference datasets for Dodecad and Harappa, we are using mean values, so barring any big outliers we can compare them.
I decided to find a solution to linear equations of the form:
C1 = a11*D1 + a12*D2 + a13*D3 + a14*D4 + a15*D5 + a16*D6 + a17*D7 + a18*D8 + a19*D9 + a1A*D10
C2 = a21*D1 + a22*D2 + a23*D3 + a24*D4 + a25*D5 + a26*D6 + a27*D7 + a28*D8 + a29*D9 + a2A*D10
C3 = a31*D1 + a32*D2 + a33*D3 + a34*D4 + a35*D5 + a36*D6 + a37*D7 + a38*D8 + a39*D9 + a3A*D10
C4 = a41*D1 + a42*D2 + a43*D3 + a44*D4 + a45*D5 + a46*D6 + a47*D7 + a48*D8 + a49*D9 + a4A*D10
C5 = a51*D1 + a52*D2 + a53*D3 + a54*D4 + a55*D5 + a56*D6 + a57*D7 + a58*D8 + a59*D9 + a5A*D10
C6 = a61*D1 + a62*D2 + a63*D3 + a64*D4 + a65*D5 + a66*D6 + a67*D7 + a68*D8 + a69*D9 + a6A*D10
C7 = a71*D1 + a72*D2 + a73*D3 + a74*D4 + a75*D5 + a76*D6 + a77*D7 + a78*D8 + a79*D9 + a7A*D10
C8 = a81*D1 + a82*D2 + a83*D3 + a84*D4 + a85*D5 + a86*D6 + a87*D7 + a88*D8 + a89*D9 + a8A*D10
C9 = a91*D1 + a92*D2 + a93*D3 + a94*D4 + a95*D5 + a96*D6 + a97*D7 + a98*D8 + a99*D9 + a9A*D10
For each of the 36 populations, we'll have these 9 equations where C1 through C9 are the ancestral component percentages of that population in Harappa Project and D1 through D10 are the ancestral percentages in Dodecad Project.
The unknowns are the coefficients "a". They are 90 unknowns. Since we have 36 populations, the number of equations is 36*9=324. Therefore, this is an overdetermined system of linear equations and we can find a least squares solution to it.
Here is the solution:
D1 W Asian | D2 NW African | D3 S Euro | D4 NE Asian | D5 SW Asian | D6 E Asian | D7 N Euro | D8 W African | D9 E African | D10 S Asian | |
---|---|---|---|---|---|---|---|---|---|---|
C1 S Asian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.92 |
C2 Kalash | 0.54 | 0 | -0.05 | 0.12 | 0.07 | 0 | 0.2 | 0 | 0 | 0.1 |
C3 SW Asian | 0.46 | 0.56 | 0.44 | 0 | 0.9 | 0 | -0.09 | 0 | 0.09 | -0.07 |
C4 SE Asian | 0 | 0 | 0 | 0 | 0 | 0.6 | 0 | 0 | 0 | 0 |
C5 Euro | 0 | 0.19 | 0.6 | 0.05 | -0.05 | 0 | 0.88 | 0 | 0 | 0 |
C6 Papuan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
C7 NE Asian | 0 | 0 | 0 | 0.85 | 0 | 0.4 | 0 | 0 | 0 | 0 |
C8 W African | 0 | 0.12 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
C9 E African | 0 | 0.12 | 0 | 0 | 0.05 | 0 | 0 | 0 | 0.89 | 0 |
Don't take the exact values to heart but this shows the general relationship between the Dodecad and Harappa (K=9) ancestral components.
The South Asian components are about the same in both projects.
The Kalash component is a mix but is primarily Dodecad West Asian.
The Harappa Southwest Asian has contributions from Northwest African, West Asian and South European in addition to the Dodecad West Asian component.
The Southeast Asian component corresponds partially to the Dodecad East Asian component.
The Harappa European component is more Dodecad North European than South European.
If enough Harappa-Dodecad participants are willing to let me know their IDs for both projects, I can do a similar analysis using individual data.
A week later, some more Admixture analysis of Reference I dataset.
As usual, the results are available in a spreadsheet, which is also listed on my sidebar.
Let's start with K=10.
C1 | South Asian | C2 | Kalash |
---|---|---|---|
C3 | Southwest Asian | C4 | Southeast Asian |
C5 | European | C6 | Papuan |
C7 | Northeast Asian | C8 | Siberian |
C9 | West African | C10 | East African |
The addition here is basically of the Siberian component which is highest among the Yakut.
Fst divergences between estimated populations for K=10:
C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | |
---|---|---|---|---|---|---|---|---|---|
C2 | 0.057 | ||||||||
C3 | 0.064 | 0.073 | |||||||
C4 | 0.089 | 0.127 | 0.136 | ||||||
C5 | 0.063 | 0.061 | 0.038 | 0.131 | |||||
C6 | 0.167 | 0.209 | 0.215 | 0.202 | 0.210 | ||||
C7 | 0.080 | 0.120 | 0.129 | 0.032 | 0.123 | 0.190 | |||
C8 | 0.085 | 0.117 | 0.127 | 0.059 | 0.118 | 0.203 | 0.039 | ||
C9 | 0.152 | 0.174 | 0.161 | 0.201 | 0.171 | 0.266 | 0.195 | 0.199 | |
C10 | 0.115 | 0.133 | 0.117 | 0.166 | 0.128 | 0.233 | 0.160 | 0.163 | 0.036 |
Now for K=11,
C1 | South Asian | C2 | Kalash |
---|---|---|---|
C3 | Southwest Asian | C4 | Southeast Asian |
C5 | European | C6 | Papuan |
C7 | Siberian | C8 | Northeast Asian |
C9 | East African Bantus | C10 | West African |
C11 | East African |
C8 at K=11 is now modal among the Han instead of the Japanese. This affected the Southeast Asian C4 component which is now more of a real Southeast Asian one.
The new ancestral component C9 is among the Bantus of eastern and southern Africa. It is highest among the Luhya and Bantus of Kenya.
Fst divergences between estimated populations for K=11:
C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | |
---|---|---|---|---|---|---|---|---|---|---|
C2 | 0.055 | |||||||||
C3 | 0.062 | 0.072 | ||||||||
C4 | 0.081 | 0.120 | 0.128 | |||||||
C5 | 0.063 | 0.063 | 0.038 | 0.124 | ||||||
C6 | 0.169 | 0.211 | 0.215 | 0.195 | 0.213 | |||||
C7 | 0.089 | 0.128 | 0.135 | 0.057 | 0.130 | 0.203 | ||||
C8 | 0.083 | 0.122 | 0.131 | 0.031 | 0.127 | 0.194 | 0.039 | |||
C9 | 0.143 | 0.165 | 0.150 | 0.185 | 0.162 | 0.259 | 0.195 | 0.189 | ||
C10 | 0.152 | 0.174 | 0.160 | 0.194 | 0.172 | 0.268 | 0.203 | 0.198 | 0.014 | |
C11 | 0.104 | 0.122 | 0.101 | 0.149 | 0.115 | 0.226 | 0.158 | 0.152 | 0.037 | 0.043 |
At K=12,
C1 | South Asian | C2 | Balochistan/Caucasus |
---|---|---|---|
C3 | Kalash | C4 | Southeast Asian |
C5 | Southwest Asian | C6 | European |
C7 | Papuan | C8 | Northeast Asian |
C9 | Siberian | C10 | East African Bantus |
C11 | West African | C12 | East African |
The Kalash component has split, with an assist from Southwest Asian, into a pure Kalash component (C3) and a Balochistan/Caucasus (C2) which is highest in Southwestern Pakistan (Brahui, Makrani, Balochi) at 60-57% followed by Georgians, Lezgin, Adeygei, Azerbaijan Jews and Iranian Jews (56-50%).
The Southwest Asian component (C5) is now more of a Southwest Asian and North/Northwest African component. The West Asian element in it has been reduced.
The Northeast Asian component (C8) is now again centered on Japan. I have a solution for this movement which I'll apply in my next round of analysis.
Fst divergences between estimated populations for K=12:
C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | |
---|---|---|---|---|---|---|---|---|---|---|---|
C2 | 0.057 | ||||||||||
C3 | 0.066 | 0.060 | |||||||||
C4 | 0.089 | 0.124 | 0.136 | ||||||||
C5 | 0.075 | 0.057 | 0.087 | 0.142 | |||||||
C6 | 0.066 | 0.040 | 0.073 | 0.130 | 0.048 | ||||||
C7 | 0.167 | 0.205 | 0.219 | 0.202 | 0.220 | 0.210 | |||||
C8 | 0.080 | 0.117 | 0.128 | 0.032 | 0.134 | 0.122 | 0.190 | ||||
C9 | 0.085 | 0.114 | 0.126 | 0.059 | 0.133 | 0.117 | 0.203 | 0.039 | |||
C10 | 0.145 | 0.154 | 0.176 | 0.192 | 0.154 | 0.162 | 0.258 | 0.187 | 0.190 | ||
C11 | 0.154 | 0.163 | 0.186 | 0.201 | 0.164 | 0.172 | 0.266 | 0.195 | 0.199 | 0.014 | |
C12 | 0.107 | 0.109 | 0.135 | 0.157 | 0.105 | 0.116 | 0.225 | 0.151 | 0.154 | 0.035 | 0.041 |
Higher K value admixture analysis will continue.
Mithra asked:
Almost all the Chinese are now around 50% SE Asian, didn’t see this before is it right.
So I decided to look at the Chinese samples in Reference I dataset.
I ran Admixture on the whole Reference I dataset for K=10 ancestral populations. The green component is what I call Southeast Asian, blue is Northeast Asian (highest among the Japanese) and violet is Siberian (highest among the Yakut).
Here is the plot for the 106 HapMap Chinese samples from Denver (label: us chinese):
For the 137 HapMap samples from Beijing, China (label: han chinese):
For the 34 HGDP Han samples (label: han):
For the 10 HGDP Han samples from North China (label: han-nchina):
As you can see, the "Southeast Asian" component goes down from the top group to the bottom one, which is as expected.
I wasn't satisfied with these results, so I decided to run Admixture on the East Asian samples in Reference I separately.
At K=3, the results are about the same as at K=10 for the whole reference I population. The Han all have a significant amount of blue component which is highest among the Southeast Asians.
At K=4, we get a Chinese ("East Asian") component. So we have Japanese, Chinese, Yakut and Southeast Asian components. This is what most of you were probably expecting.
Why did the Japanese become the modal population for the Northeast Asian component? I ran a PCA on the East Asian data to see how the different populations looked on a PCA plot. Remember that eigenvector 1 explains 1.49 times the variance of eigenvector 2 and 1.9 times the variance of eigenvector 3. Thus, eigenvector 2 explains 1.28 times the variation explained by eigenvector 3.
As you can see, the Yakut are the far away, but the Japanese are also fairly well-separated from the Chinese populations.
If I didn't have the 141 Japanese samples in my reference dataset, the Northeast Asian component would be centered on the Han most likely, which is the case for Dodecad.
I think this shows that it is not correct to think of the ancestral components inferred from admixture as some pure ancestral population.
Here's the spreadsheet with their admixture results. And you can check their ethnic backgrounds.
You might also want to refer to the reference dataset I admixture analyses for K=2-5 and K=6-9.
I did not run admixture for all values of K this time. So let's start with K=4. For quick reference,
C1 | South Asian |
C2 | European |
C3 | East Asian |
C4 | African |
Now, for K=7, the ancestral components are:
C1 | South Asian |
C2 | European |
C3 | Southeast Asian |
C4 | Southwest Asian |
C5 | Papuan |
C6 | Northeast Asian |
C7 | African |
And finally, here's K=9.
C1 | South Asian |
C2 | Kalash |
C3 | Southwest Asian |
C4 | Southeast Asian |
C5 | European |
C6 | Papuan |
C7 | Northeast Asian |
C8 | West African |
C9 | East African |
It all started DNA Day 2010 when Razib tweeted about a $99 sale for the DNA test at 23andme. I ordered one immediately. Over the next few months, a lot of my free time was spent poring over and analyzing my genomic results.
While the health and physical traits information was interesting, I found the ancestry information that can be deduced from your genome to be fascinating. That might be because I was working on collecting together and digitizing our family tree at the time.
So to beat Razib's record of of writing about his personal genome, I have started blogging about mine:
There's much more to come, including: What's wrong with my chromosome 9 and who did I get it from?; my results from Doug McDonald, Dodecad and Eurogenes; why do I have low similarity scores with everyone?; where exactly was my great-grandmother from?; and more.
PS. Since it's Valentine's Day, I should probably mention that my top match among the people (excluding my sibling of course) I am sharing genomes with on 23andme is my dear wife, Amber.
Recent Comments