Category Archives: Admixture - Page 12

Admixture: Choice of K

Admixture lets you choose the number of ancestral populations, K. This number is really important and in a lot of cases we do not know how many ancestral populations our samples have descended from. In the Admixture manual, we are advised:

Use ADMIXTURE's cross-validation procedure. A good value of K will exhibit a low cross-validation error compared to other K values. Cross-validation is enabled by simply adding the --cv flag to the ADMIXTURE command line. In this default setting, the cross-validation procedure will do 10 repetitions, each time holding out 10% of the genotypes at
random.

I like this idea compared to using the BIC (Bayes Information Criterion) but I am plotting all the different variables for various K below.

For our Reference I dataset which is what I have used for most of the analysis so far, here is the spreadsheet for Log Likelihood, BIC, AIC and CV (cross-validation error). The plots follow.

Using the cross-validation error, the optimum value of K so far is 17 which is the largest I have run so far. It now takes days to run admixture (with cross-validation). Cross-validation almost doubles the time required to run.

For Reference II, here are the spreadsheet and graphs.



The cross-validation error is lowest at K=16 which is the highest I have run. So it is likely to decrease further for higher K.

Ref1 South Asian + Harappa Admixture

Since I was working on this dataset consisting of South Asians (minus Kalash and Hazara) from Reference I and some Harappa participants, I thought I would run Admixture on it.

The optimum value for the number of ancestral populations K is 3 in this case. Roughly the three ancestral components correspond to South India, Balochistan and Gujarat.

The spreadsheet showing the admixture results is here. The first sheet shows the individual results for reference samples as well as project participants.

The 2nd sheet shows the average (and standard deviation) for the reference populations.

The 3rd sheet shows the average and standard deviation for each cluster computed by MClust. I included only the samples which had at least 90% probability of belonging to a cluster.

Note how clusters CL8, CL9 and CL13 have a lot more variation than the others. Of course, I am in CL9 along with some fairly eclectic samples.

Reference I Admixture Analysis K=17

Continuing with Reference I admixture analysis, here is the results spreadsheet.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

If you can't see the interactive chart above, here's a static image.

C1 South Asian C2 Balochistan/Caucasus
C3 Gujarati C4 Kalash
C5 Southeast Asian C6 European
C7 Mediterranean C8 Japanese
C9 Southwest Asian C10 Melanesian
C11 Siberian C12 Papuan
C13 Chinese C14 Eastern Bantu
C15 Northwest African C16 West African
C17 East African

The new ancestral component is the tightly clustered Gujarati. This consists of almost two-thirds of the Gujaratis sampled by HapMap in Houston, TX. So my question is does anyone have any idea which Gujarati communities are the biggest in Houston? I know that Patel is a very common name, probably the most common South Asian last name in the US. Most Patels I know have been from Gujarat. Are Patels a tightly knit community who are endogamous but likely don't marry close cousins? Are there different Patel subcommunities?

Fst divergences between estimated populations for K=17:

Here are the Fst numbers:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12
C2 0.072
C3 0.032 0.044
C4 0.076 0.061 0.062
C5 0.085 0.120 0.085 0.129
C6 0.076 0.045 0.059 0.072 0.123
C7 0.085 0.062 0.073 0.088 0.138 0.050
C8 0.084 0.119 0.084 0.128 0.035 0.122 0.138
C9 0.091 0.059 0.076 0.095 0.139 0.062 0.058 0.139
C10 0.168 0.203 0.168 0.215 0.171 0.206 0.220 0.172 0.221
C11 0.090 0.116 0.088 0.127 0.064 0.117 0.135 0.039 0.138 0.188
C12 0.188 0.225 0.189 0.237 0.209 0.228 0.242 0.207 0.243 0.145 0.220
C13 0.086 0.122 0.087 0.130 0.030 0.125 0.140 0.014 0.142 0.173 0.044 0.210
C14 0.151 0.155 0.146 0.177 0.186 0.163 0.164 0.186 0.152 0.257 0.190 0.275
C15 0.089 0.066 0.076 0.096 0.133 0.060 0.054 0.132 0.063 0.211 0.131 0.232
C16 0.160 0.164 0.155 0.186 0.194 0.173 0.173 0.195 0.162 0.265 0.199 0.283
C17 0.114 0.111 0.107 0.136 0.150 0.119 0.114 0.151 0.106 0.223 0.154 0.242
C13 C14 C15 C16
C14 0.188
C15 0.135 0.115
C16 0.197 0.013 0.122
C17 0.153 0.034 0.079 0.041

PS. This was run using Admixture version 1.04 so I can make an apples-to-apples comparison with the previous runs.

More Admixture Maps

Simranjit has sent more maps incorporating the latest admixture results.

C1 South Asian:
http://visualcage.ru

C2 Balochistan/Caucasus:

C6 European:

Admixture K=12, HRP0051-HRP0060

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

If you can't see the interactive bar chart above, here's a static image.

Look at how different the two Gujaratis are. Also, the Iraqi Kurd is more like our Iranian participants than the two Iraqi Arab participants.

PS. This was run using Admixture version 1.04.

Admixture K=4, HRP0051-HRP0060

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

The interesting samples here are the Gujaratis and the Kurd.

If you can't see the interactive bar chart above, here's a static image.

PS. This was run using Admixture version 1.04.

South Asian Map

Simranjit has another map:http://dekor-okno.ru

I am working on improving the interpolation algorithm to take into account barriers such as oceans and even terrain features like mountain ranges. However, this process takes a long time.

Anyways in the meantime , this is one i think the participants would be interested in. It has several things in it, an isopleth layer for C1 - South Asian (12 gradations for more impact). It also have the other Components (C1, C2, C3, C4, C5, C6, C8) represented in the form of pie charts. Base map is a topographic one this time.

Harappa Participant Admixture Maps

Following maps of the reference populations, Simranjit has gone ahead and included the Harappa Project participants in these maps as well.

Here's what he said:

I'm now incorporating project participants into the maps. I had to drop admixed individuals, however, and I made some choices, dropped the Bihari Kayastha and the Tamil Nadu non-Brahmin for now, as they differ a fair bit. Take note that as we don't have reference sample for some countries so this sometimes can cause the interpolation to be off (e.g. lack of Central Asian republics other than Uzbekistan is skewing Central Asia to be more South Asian than it really is).

These maps are based on K=12 admixture run.

The gradation is from Dark green (low) to Dark red (high) for most of them.

Basically the percentages for each Component are divided into 32 equal intervals, to create the contour effect. Take note that it represents relative values not absolute.

C1 South Asian component:

C2 Balochistan/Caucasus component:

C6 European component:

Distance Measures

Referring to the dendrogram computed from the admixture results of Harappa Project participants, Thorfinn asked a long time ago:

Interesting that South Indian/Cow Belt Brahmins cluster together; while Punjabi Brahmins are closer to Punjabis.

I can understand the first clustering, assuming that Southern Brahmin communities are a spinoff of northern communities and have maintained relative genetic isolation; and the source Northern Brahmin population differed in original origin from other Cow Belt populations.

But how do both Brahmin communities differ equally from Punjabi/Rajasthani Brahmins; and why is that community closer to other Punjabi populations?

In terms of admixture results, that is correct in the case of the project participants. Why this is the case, I have no idea.

However, there is an issue here that we have to consider and nsriram commented about it:

The euclidean distance doesn’t seem to be the appropriate metric to capture the pairwise similarities. Once you make a commitment to the distance measure then the side effects carry-over into the tree construction.

What is a good distance measure to compute the similarity or dissimilarity of the admixture results of two people? Is the Euclidean distance a good one in this case? It certainly is the most common and the easiest to use I guess. So we usually default to it.

However, if we look at the Fst divergences of the ancestral components, we see that the different components are more or less different from each other. So a 5% difference in C1 might not be the same as a 5% difference in C10.

A solution might be to use a weighted distance, but how to weight it? The Fst numbers give pairwise distances for the different ancestral populations. If you are focused on a specific population (e.g., South Asians), we could try weighting by the Fst values between that component and the others. But I am not sure if that's a good solution either.

In the end, a Euclidean distance measure gives us a rough idea of the differences between admixture results, but it should not be used to explain minor differences or to consider phylogenies.

Isopleths

Simranjit has done a great job of creating some maps showing the distribution of the various ancestral components at K=16. He has posted them on DNA Forums and sent them to me.

The gradation is from Dark green (low) to Dark red (high) for most of them.

Basically the percentages for each Component are divided into 32 equal intervals, to create the contour effect. Take note that it represents relative values not absolute.

Here is C1 South Asian:

C2 Balochistan/Caucasus:

C5 Southwest Asian:

C6 European:

C12 Siberian:

Great job, Simranjit!