Monthly Archives: April 2011 - Page 5

Harappa Admixture Dendrogram 1-80

It's time to update the Harappa admixture dendrogram since the last time I created one we had only 40 participants.

Note that this is not a phylogeny. It just visualizes the closeness of your admixture results to others.

Also note that I am using Euclidean distance between admixture proportions which has problems and using complete linkage hierarchical clustering.

Henn Duplicates

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Henn et al dataset, which you can download from their website.

There are 107 samples common from the HapMap (IDs start with NA) and 131 from HGDP (IDs start with HGDP).

Henn et al has two PED files. One for the Khoisan data and one for all Africa 55k SNP set. Unfortunately they have 31 San duplicated in both these PED files with same individual IDs but different family IDs (SAN and SAN_SA). So they do not get automatically merged per Plink procedures. Just remove all the ones with SAN_SA FID since they have fewer SNPs. All the IBD info etc is in this spreadsheet.

Harappa Maps

Simranjit has generated new isopleth maps from the latest K=12 admixture run.

C1 South Asian:

C2 Balochistan/Caucasus:

C5 Southwest Asian:

C6 European:

Simranjit's also creating isoclusters now which classify different points/regions into clusters based on the admixture results Simran is using from here. Here's an isocluster map with 15 clusters inferred from the K=12 admixture results of reference populations and Harappa participants.

You can see the dendrogram showing the distance between the clusters on his blog.

Admixture K=12, HRP0071-HRP0080

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

If you can't see the interactive bar chart above, here's a static image.

Since I don't have any Native American samples in my reference populations, the Brazilian participant (HRP0074) shows up as having Northeast Asian and Siberian.

PS. This was run using Admixture version 1.04.

Admixture K=4, HRP0071-HRP0080

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

The interesting samples here are Gujarati (HRP0071), Bengali Brahmin (HRP0077) and Brazilian (HRP0074).

If you can't see the interactive bar chart above, here's a static image.

PS. This was run using Admixture version 1.04.

Ref2 South Asian + Harappa Admixture

Using the reference II dataset of 548 South Asians and 38 Harappa Project South Asians that I have been working on, I ran Admixture.

The optimum number of ancestral components was 5-6. So I used K=6. The components are highest among the following groups:

C1 Brahui, Makrani, Balochi C2 TN Dalit, North Kannadi
C3 Irula C4 Gujaratis
C5 Hazara C6 Kalash

I consider the Irulas, a Scheduled tribe from Tamil Nadu, to be problematic in a similar way to the Kalash except that the Irulas are well-scattered in their own space in the PCA plot.

Also, note that all the European, West Asian, etc is being represented by C1. Similarly, all the East Asian ancestry is being collected in C5.

The spreadsheet showing the admixture results is here. The first sheet shows the individual results for the project participants.

The 2nd sheet shows the average (and standard deviation) for the reference populations.

The 3rd sheet shows the average and standard deviation for each cluster computed by MClust from PCA.

The 4th sheet shows the average and standard deviation for each cluster computed by MClust from MDS.

Also, take a look at the admixture percentage standard deviations. You'll notice that those are generally lower for the clusters compared to the population groups.

Xing Redo

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Xing et al dataset, which you can download from their website.

I found no duplicates within the Xing et al data but there are 259 samples common from the HapMap. Since they are not assigned any family IDs they will pass through the ped files without being merged into HapMap samples. So you need to remove any samples with IDs starting with "NA".

Xing et al also contains 6 duplicates from HGDP with completely different IDs and two Xing samples look to be related to HGDP samples.

There are also three pairs with very high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 85% similar.

The samples I have removed as a result of this (other than HapMap) are listed in this spreadsheet.

Supervised Continental Admixture

Since the version 1.1 of Admixture with supervised option came almost two months ago, I have been salivating over it.

My original use case for it is not possible (for now). I wanted to be able to assign a few of the K ancestral components to specific reference populations and let the other ancestral components fall where they may. But we can do supervised admixture only by assigning all K ancestral components.

So I decided to test this supervised option by mimicking the three continental percentages 23andme assigns you on their ancestry painting page. Mine are:

Europe 91.22%
Asia 8.69%
Africa 0.09%

You can get the extra precision (and false sense of accuracy) here.

Regarding the reference populations used for ancestry painting, 23andme says:

23andMe takes advantage of publicly available data for four populations studied extensively via the International HapMap project (hapmap.org). That project obtained the genotypes for 60 individuals of western European descent from Utah, 60 western African individuals from Nigeria, and 90 eastern Asian individuals, 45 from each of Japan and China. Because the two eastern Asian populations are geographically near one another and relatively similar at the genetic level, 23andMe combines these to form a single eastern Asian reference population.

So I dug up my reference admixture run at K=3 and found the same number of samples of these HapMap populations by looking for those samples which had the highest percentage in the respective component.

Then I combined these 210 samples from the HapMap with 74 Harappa Project participants (HRP0001 to HRP0079, excluding 5 who are related to others).

The results of the supervised admixture run are in a spreadsheet and also shown in a bar chart below.

Since I did run an unsupervised K=3 admixture analysis of the first Harappa batch with the whole reference I populations, you can compare these results to those.

Behar Redo

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Behar et al dataset, which you can download from the GEO Accession website.

I found three set of duplicates and two pairs with very high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 83% similar.

The five samples I have removed as a result of this are listed in this spreadsheet.

HapMap Redo

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to HapMap, which you can download from their website. I am using HapMap 3 public release #3 from May 28, 2010.

I found one set of duplicates, NA21344 is identical to NA21737. And a whole bunch of pairs with high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 85% similar in fact.

All the 41 samples I have removed as a result of this are listed in this spreadsheet.