Reference I Admixture Errors

I am have thinking about error estimation for Admixture results for some time since I have heard a lot of arguments about how even 0.1% result is significant. I was skeptical of that and have rounded off my admixture run results to the nearest percent.

There was a memory leak issue in the bootstrapping code for admixture which crashed it every time I tried running it. I emailed David Alexander and he fixed it in version 1.12.

So I ran the default 200 bootstrap replicates to measure standard error in our old Reference I K=12 admixture. Spreadsheet with population level results is here and participant results are here.

Here are some statistics for the standard error estimates:

Min. 1st Qu. Median Mean 3rd Qu. Max.
C1 S Asian 0.00% 0.02% 0.33% 0.52% 0.96% 1.93%
C2 Blch/Cauc 0.00% 0.00% 1.02% 0.79% 1.45% 2.63%
C3 Kalash 0.00% 0.01% 0.40% 0.50% 0.99% 3.76%
C4 SE Asian 0.00% 0.09% 0.37% 0.60% 1.27% 1.92%
C5 SW Asian 0.00% 0.00% 0.60% 0.66% 1.28% 2.90%
C6 Euro 0.00% 0.00% 0.35% 0.56% 1.12% 1.82%
C7 Papuan 0.00% 0.07% 0.22% 0.23% 0.36% 1.08%
C8 NE Asian 0.00% 0.07% 0.36% 0.67% 1.36% 2.45%
C9 Siberian 0.00% 0.08% 0.37% 0.51% 0.82% 2.29%
C10 E Bantu 0.00% 0.00% 0.00% 0.35% 0.72% 1.93%
C11W Afr 0.00% 0.00% 0.00% 0.28% 0.50% 1.51%
C12 E Afr 0.00% 0.00% 0.05% 0.31% 0.60% 1.79%

You can see the mean value of the standard errors per population and realize how many are over 1% (marked in red).

And statistics for bias estimates:

Min. 1st Qu. Median Mean 3rd Qu. Max.
C1 S Asian -1.104% -0.031% 0.000% -0.024% 0.075% 1.026%
C2 Blch/Cauc -0.835% -0.280% -0.009% -0.133% 0.000% 1.049%
C3 Kalash -1.575% 0.000% 0.020% 0.076% 0.147% 0.615%
C4 SE Asian -0.629% -0.021% 0.011% 0.018% 0.087% 0.478%
C5 SW Asian -0.691% -0.094% 0.000% -0.020% 0.035% 0.613%
C6 Euro -0.572% -0.086% 0.000% -0.039% 0.004% 0.468%
C7 Papuan -0.171% 0.008% 0.059% 0.070% 0.120% 0.312%
C8 NE Asian -0.739% 0.000% 0.016% 0.034% 0.107% 0.679%
C9 Siberian -1.044% 0.000% 0.015% 0.035% 0.103% 0.692%
C10 E Bantu -0.412% 0.000% 0.000% -0.007% 0.001% 0.370%
C11 W Afr -0.261% 0.000% 0.000% 0.009% 0.005% 0.304%
C12 E Afr -0.635% 0.000% 0.000% -0.017% 0.010% 0.405%

You can also see the average value of the bias in each ancestral component for each population.

Since the bias is lower than the standard error and distributed around zero, if a large number of samples of a population group have some small percentage of an ancestral component, the likelihood of that not being noise is higher.

Reference 3F(iltered) Admixture

I removed all American populations and San and Pygmy (i.e., South and Central African) from Reference 3 for a better focus on our target populations.

Here are the admixture results. You can choose the number of ancestral components, K, from the dropdown below.

K=13, 14, 15 (in that order) have the lowest cross-validation error.

There's a bunch of interesting results in there. For example, the split into northern and southern European, and the split of Siberian into Siberian and Russian Far East (or Bering Strait). However, the Onge component as a proxy of the ASI does not appear. Also, we don't get much breakdown of the South Asian populations as we would like.

Harappa Nearest IBS Neighbors

After a long tease, here is the spreadsheet containing the top 500 nearest neighbors (using IBS similarity percentages) for the Harappa participants from HRP0001 to HRP0089.

I am also providing an R data object with the same data (except it contains all the 3,975 individual from reference 3 and Harappa). To use this data,

  1. Download R
  2. Install R on your computer
  3. When you start R, type
    load('harappa_ibs.RData')

    to load the data

  4. Type
    closest("HRP0001")

    to find the 20 closest IBS neighbors of HRP0001. You can use any of the Harappa IDs here.

  5. You can set the number of IBS neighbors (50, for example) to show using
    closest("HRP0010",50)

Enjoy!

100!

Yesterday, we got to 100 participants in the Harappa Ancestry Project.

I made the project public on January 17, 2011. So, 100 submissions in 106 days. That's pretty good.http://ceoec.ru/

I am surprised at the speed and quantity of submissions. I probably have the largest dataset of South Asians right now.

Keep spreading the word and encouraging everyone to participate.

Accepting FTDNA Family Finder

In addition to 23andme data, I am now accepting the autosomal data from FTDNA Family Finder too.

This is due to the recent switch to Illumina Omni chip by FamilyTreeDNA which has a lot more markers in common with the 23andme data.

Since FTDNA is retesting all its current customers on the new chip, even if you tested with them earlier, you should have autosomal data from the new chip which you can download and email to me at harappa@zackvision.com.

I am basically looking for participants who have at least some ancestry from the following countries/regions:

  • Afghanistan
  • Bangladesh
  • Bhutan
  • Burma
  • India
  • Iran
  • Maldives
  • Nepal
  • Pakistan
  • Sri Lanka
  • Tibet

But if you have ancestry from West or Central Asia or Caucasus, I am likely to accept your data too.

Details of participation are here.

April Update

I have a total of 97 participants in the project right now who have sent me their raw data. Six of those have relatives participating and thus have to be filtered out for most analysis other than individual admixture percentages etc where I divide participants into small groups.http://mountainsphoto.ru

The following groups are represented:

Let's try to get to hundred soon.

And yes, I am accepting FTDNA Family Finder (new Illumina chip) now.

Ref3 + Harappa Maps

More maps from The Jatt Gene using the Reference 3 and Harappa participants K=11 admixture results.

C1 South Asian Isopleth

C2 Onge Isopleth

C1 South Asian Chloropleth at state/province level

C2 Onge Chloropleth

As usual, Simranjit has more maps on his blog.

Harappa Reference 2 IBS Concordance

Vasishta asked:

would it be possible to repeat the same exercise with the Reference II populations? These results seem to be far more plausible for every participant as compared to the previous ones.

Since it took only a few minutes, I calculated the scores as detailed in a previous post from the IBS measures between Harappa participants (1-80 only) and Reference 2.

The spreadsheet is here.

Harappa Ref3 Admixture Dendrograms

Now that we have the admixture results for project participants using Reference 3, let's take a look at a tree based on Euclidean distance of the admixture proportions for each participant.

Compare it to the earlier one with reference 1 admixture results.

And here is a dendrogram combining the average reference population results with the Harappa participants.

Harappa (1-90) K=11 Admixture Ref3

Here's my first admixture run using Reference 3 for Harappa participants. Since K=11 was the run with the Onge-ASI connection, I ran admixture at K=11 with all the 90 Harappa participants.

You can see the participant results in a spreadsheet as well as their ethnic breakdowns and the reference population results.

Here's our bar chart and table. Remember you can click on the legend or the table headers to sort.

Using the comparison between the Onge component here and Reich et al's Ancestral South Indian one, I get the following linear regression.

The correlation is 0.9949 which is probably as high as it can get. So let's calculate the ASI percentage for all the Harappa participants.

Note that I didn't calculate the ASI percentage for those who had a really low Onge component since the linear regression above would not be valid outside the range we have in our original data.

You can see the percentages in a spreadsheet too.

Let's compare with the Dodecad ANI-ASI results. I have 22.5% ASI here while it was 20.6% in the Dodecad analysis. Overall, it seems like my technique results in about 2% more ASI than Dodecad's, with a few exceptions: Like Razib who jumps from 34.3% to 43.3% (averaging his parents who are very close).