Author Archives: Zack - Page 19

Henn Duplicates

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Henn et al dataset, which you can download from their website.

There are 107 samples common from the HapMap (IDs start with NA) and 131 from HGDP (IDs start with HGDP).

Henn et al has two PED files. One for the Khoisan data and one for all Africa 55k SNP set. Unfortunately they have 31 San duplicated in both these PED files with same individual IDs but different family IDs (SAN and SAN_SA). So they do not get automatically merged per Plink procedures. Just remove all the ones with SAN_SA FID since they have fewer SNPs. All the IBD info etc is in this spreadsheet.

Harappa Maps

Simranjit has generated new isopleth maps from the latest K=12 admixture run.

C1 South Asian:

C2 Balochistan/Caucasus:

C5 Southwest Asian:

C6 European:

Simranjit's also creating isoclusters now which classify different points/regions into clusters based on the admixture results Simran is using from here. Here's an isocluster map with 15 clusters inferred from the K=12 admixture results of reference populations and Harappa participants.

You can see the dendrogram showing the distance between the clusters on his blog.

Admixture K=12, HRP0071-HRP0080

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

If you can't see the interactive bar chart above, here's a static image.

Since I don't have any Native American samples in my reference populations, the Brazilian participant (HRP0074) shows up as having Northeast Asian and Siberian.

PS. This was run using Admixture version 1.04.

Admixture K=4, HRP0071-HRP0080

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

The interesting samples here are Gujarati (HRP0071), Bengali Brahmin (HRP0077) and Brazilian (HRP0074).

If you can't see the interactive bar chart above, here's a static image.

PS. This was run using Admixture version 1.04.

Ref2 South Asian + Harappa Admixture

Using the reference II dataset of 548 South Asians and 38 Harappa Project South Asians that I have been working on, I ran Admixture.

The optimum number of ancestral components was 5-6. So I used K=6. The components are highest among the following groups:

C1 Brahui, Makrani, Balochi C2 TN Dalit, North Kannadi
C3 Irula C4 Gujaratis
C5 Hazara C6 Kalash

I consider the Irulas, a Scheduled tribe from Tamil Nadu, to be problematic in a similar way to the Kalash except that the Irulas are well-scattered in their own space in the PCA plot.

Also, note that all the European, West Asian, etc is being represented by C1. Similarly, all the East Asian ancestry is being collected in C5.

The spreadsheet showing the admixture results is here. The first sheet shows the individual results for the project participants.

The 2nd sheet shows the average (and standard deviation) for the reference populations.

The 3rd sheet shows the average and standard deviation for each cluster computed by MClust from PCA.

The 4th sheet shows the average and standard deviation for each cluster computed by MClust from MDS.

Also, take a look at the admixture percentage standard deviations. You'll notice that those are generally lower for the clusters compared to the population groups.

Xing Redo

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Xing et al dataset, which you can download from their website.

I found no duplicates within the Xing et al data but there are 259 samples common from the HapMap. Since they are not assigned any family IDs they will pass through the ped files without being merged into HapMap samples. So you need to remove any samples with IDs starting with "NA".

Xing et al also contains 6 duplicates from HGDP with completely different IDs and two Xing samples look to be related to HGDP samples.

There are also three pairs with very high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 85% similar.

The samples I have removed as a result of this (other than HapMap) are listed in this spreadsheet.

Supervised Continental Admixture

Since the version 1.1 of Admixture with supervised option came almost two months ago, I have been salivating over it.

My original use case for it is not possible (for now). I wanted to be able to assign a few of the K ancestral components to specific reference populations and let the other ancestral components fall where they may. But we can do supervised admixture only by assigning all K ancestral components.

So I decided to test this supervised option by mimicking the three continental percentages 23andme assigns you on their ancestry painting page. Mine are:

Europe 91.22%
Asia 8.69%
Africa 0.09%

You can get the extra precision (and false sense of accuracy) here.

Regarding the reference populations used for ancestry painting, 23andme says:

23andMe takes advantage of publicly available data for four populations studied extensively via the International HapMap project (hapmap.org). That project obtained the genotypes for 60 individuals of western European descent from Utah, 60 western African individuals from Nigeria, and 90 eastern Asian individuals, 45 from each of Japan and China. Because the two eastern Asian populations are geographically near one another and relatively similar at the genetic level, 23andMe combines these to form a single eastern Asian reference population.

So I dug up my reference admixture run at K=3 and found the same number of samples of these HapMap populations by looking for those samples which had the highest percentage in the respective component.

Then I combined these 210 samples from the HapMap with 74 Harappa Project participants (HRP0001 to HRP0079, excluding 5 who are related to others).

The results of the supervised admixture run are in a spreadsheet and also shown in a bar chart below.

Since I did run an unsupervised K=3 admixture analysis of the first Harappa batch with the whole reference I populations, you can compare these results to those.

Behar Redo

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Behar et al dataset, which you can download from the GEO Accession website.

I found three set of duplicates and two pairs with very high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 83% similar.

The five samples I have removed as a result of this are listed in this spreadsheet.

HapMap Redo

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to HapMap, which you can download from their website. I am using HapMap 3 public release #3 from May 28, 2010.

I found one set of duplicates, NA21344 is identical to NA21737. And a whole bunch of pairs with high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 85% similar in fact.

All the 41 samples I have removed as a result of this are listed in this spreadsheet.

Ref 2 South Asians + Harappa MDS Clusters

Why do MDS clusters when we already did PCA-based clustering for this data?

You guys probably know about Dienekes' Clusters Galore approach. The way it works is that varying the number of MDS dimensions used you compute the number of clusters inferred (done using Mclust) and use the number of MDS dimensions which give you the maximum number of clusters.

This sounded a little unsatisfactory for me. So I ran an experiment. I computed 100 MDS dimensions for the samples in this dataset which includes South Asians from Reference II as well as 38 Harappa participants. Then I kept 2,3,4,...,100 dimensions and ran NNClean (to get initial noise/outlier estimate) and Mclust on them.

This first graph shows the number of outliers NNclean computed from 586 samples.

Things go crazy with NNclean when 64 or more MDS dimensions are retained since it considers most of the samples to be noise then.

Now let's look at the number of outliers identified after Mclust's clustering procedure.

This shows us that probably somewhere between 8 and 65 MDS dimensions might be useful to keep.

Finally, a plot of the number of clusters inferred by Mclust versus the number of MDS dimensions used.

There are two big jumps here to consider. One is around 12 MDS dimensions and the other after 52. So we are looking at an optimum number of MDS dimensions between 12 and 52. However, in that range, the number of clusters computed is fairly noisy between 18 and 26. The only pattern I can discern with some smoothed fitting is that we should likely be looking at somewhere between 20 and 30 MDS dimensions.

But why choose the maximum number of clusters (26 clusters when 24 MDS dimensions are kept)? That could be the result of noise too.

Is there some other way to figure out what are the significant number of MDS dimensions to keep for population structure? It turns out there is. Patterson, Price and Reich proposed Tracy-Widom statistics for Principal Component Analysis in their paper "Population Structure and Eigenanalysis". We also know that the MDS analysis we are performing is the classical metric MDS which is in some ways equivalent to a PCA. Looking at the Tracy Widom stats then, we see that about 25 eigenvalues are significant. Thus, keeping 24 MDS dimensions to maximum the number of clusters seems defensible.

Finally, here are the clustering results.