Monthly Archives: April 2011

Harappa Reference 2 IBS Concordance

Vasishta asked:

would it be possible to repeat the same exercise with the Reference II populations? These results seem to be far more plausible for every participant as compared to the previous ones.

Since it took only a few minutes, I calculated the scores as detailed in a previous post from the IBS measures between Harappa participants (1-80 only) and Reference 2.

The spreadsheet is here.

Harappa Ref3 Admixture Dendrograms

Now that we have the admixture results for project participants using Reference 3, let's take a look at a tree based on Euclidean distance of the admixture proportions for each participant.

Compare it to the earlier one with reference 1 admixture results.

And here is a dendrogram combining the average reference population results with the Harappa participants.

Harappa (1-90) K=11 Admixture Ref3

Here's my first admixture run using Reference 3 for Harappa participants. Since K=11 was the run with the Onge-ASI connection, I ran admixture at K=11 with all the 90 Harappa participants.

You can see the participant results in a spreadsheet as well as their ethnic breakdowns and the reference population results.

Here's our bar chart and table. Remember you can click on the legend or the table headers to sort.

Using the comparison between the Onge component here and Reich et al's Ancestral South Indian one, I get the following linear regression.

The correlation is 0.9949 which is probably as high as it can get. So let's calculate the ASI percentage for all the Harappa participants.

Note that I didn't calculate the ASI percentage for those who had a really low Onge component since the linear regression above would not be valid outside the range we have in our original data.

You can see the percentages in a spreadsheet too.

Let's compare with the Dodecad ANI-ASI results. I have 22.5% ASI here while it was 20.6% in the Dodecad analysis. Overall, it seems like my technique results in about 2% more ASI than Dodecad's, with a few exceptions: Like Razib who jumps from 34.3% to 43.3% (averaging his parents who are very close).

Harappa Reference Population Similarity

I was not satisfied with the median IBS with reference populations method for checking how similar you are to different populations. So I took inspiration from Dienekes' population concordance ratio to compute another measure.

Let's say we have a Harappa participant h and we want to compare h to a reference population A. We can then divide our reference dataset into the in-group A and the out-group A' (which consists of everyone not in A). Now for every individual a belonging to group A and every individual a' belonging to group A', we can compare the IBS similarities and score them as:

The condition in this equation is true when Harappa participant h is more similar to individual a in population A than he is to individual a' who's not in population A and h and a are closer to each other than a is to a'.

We can then sum up these values over the whole set of populations A and A' and divide by the number of pairs .

This score tells us how similar h is to population A compared to all the reference samples not in population A and varies from 0 (most disimilar) to 1 (most similar).

Let's see how the Harappa participants HRP0001 to HRP0089 score with the different reference 3 populations.

Go to the spreadsheet and click on your Harappa ID to sort the populations by your similarity score with them (click two times if you want to sort in decreasing order which I like better).

The first sheet Sheet1 has all the populations. In the Filtered 1 sheet, I removed 13 African populations that had really low similarity scores with all participants and recomputed the scores.

In Filtered 2, I further removed 9 populations (East Africa, America, Oceania) with low scores for everyone.

In Filtered 3, another 40 populations with low scores with at least 88 (out of 89) Harappa participants were removed. The reason I removed populations and recomputed is that this made the out-group not as different from the in-group as it was before. So we can check if this algorithm can provide us with some meaningful difference in scores with close populations.

In Filtered 4, another 25 populations were removed making it more South Asian centered.

Finally, I used the 68 unmixed South Asian Harappa participants and did a South Asian specific run (though I cheated a bit and kept myself HRP0001 and my sister HRP0035 in). The most interesting thing here is the really high score the Patel Gujaratis get with the Gujarati-A reference population.

Reference 3 K=11 Admixture Dendrogram

Laredo asked:

Is it possible for you to create an unrooted similarity tree of all the populations in your “Reference 3″ dataset?

So here's a dendrogram of the average K=11 admixture results for the reference 3 populations.

Harappa Median IBS with Reference 3

You guys didn't like it the last time I did this and you are not going to like this either, but while I am thinking of solutions for posting closest individual IBS neighbors, here's another go at which reference populations have the best median IBS matches with you.

I used Reference 3 with about 100,000 SNPs for this IBS run.

Go to the spreadsheet and click on your ID in the column headers to sort by your similarity to the different reference populations.

UPDATE: I have added a transpose spreadsheet too, on Onur's request, so that you can sort which Harappa participant has higher or lower scores with a specific reference population.

Admixture K=12, HRP0081-HRP0090

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

If you can't see the interactive bar chart above, here's a static image.

The two new Assyrians (HRP0081 & HRP0082) are pretty similar to the earlier Assyrian participant HRP0010.

HRP0087 is an interesting case with ancestry from France, Martinique, Madagascar and India. I can't be certain but the ratio of South Asian to Balochistan/Caucasus components seems to point in the direction of northern Indian ancestry. I definitely need to do a supervised admixture run for the mixed participants.

HRP0089 is Kazakh and has one-third Siberian component. That's higher than Uygurs (21%) and Uzbeks (23%) in my reference set. HRP0089 also has little bit more European component than the average Uygur or Uzbek in my reference.

PS. This was run using Admixture version 1.04.

Admixture K=4, HRP0081-HRP0090

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

It would be interesting to see how the Kazakh and the mixed French/Madagascar/Martinique/Indian participants get on K=12.

If you can't see the interactive bar chart above, here's a static image.

PS. This was run using Admixture version 1.04 and using reference I. Probably the last batch for both.

PPS. For some reason, my efforts to reduce the font in the table are unsuccessful. Since we are close to 100 participants now, I need to find a better way for you guys to visualize these results. May be a slice at one time.

Harappa Nearest IBS Presentation

Since Dodecad posted nearest IBS (identity by state) neighbors, I have had requests to do the same for Harappa participants.

I have the data ready but I am not sure how to present it. I don't want to post an R object since I suspect most of you don't have it installed.

The idea is to give you a list of your closest IBS neighbors as well as your match percentage with them. How would you present that that for 90 people who might match any of several hundred (thousand?) reference samples too? Give me some ideas.

Reference 3 Admixture Data

Onur asked:

BTW, Zack, are you planning to publish (in this blog or in a medium like Rapidshare) the ADMIXTURE results of the reference populations on an individual by individual basis like Dienekes?

So, I have uploaded a zip file which contains admixture results from K=2 to K=17 for all individual samples in the Reference 3 dataset. Do note that K=14 had the lowest crossvalidation error and I actually prefer even lower values of K.

I have plotted the population averages on the blog but this contains the individual level data. There are two files for each value of K. One is ref3.K.Q which has the admixture proportions for each individual. The other is ref3.K.F which has the allele frequencies for the inferred ancestral components. I haven't been able to look at the allele frequency files at all, so if you find anything interesting there, do let me know.

There is also a info file (

ref3_info.csv

) in the archive which has the information about the samples in the same order as their results are listed in the admixture output.