Monthly Archives: April 2011 - Page 4

Introducing Reference 3

Having collected 12 datasets, I have gone through them and finally selected the samples and SNPs I want to include in my new dataset, which I'll call Reference 3.

It has 3,889 individuals and 217,957 SNPs. Since this is a South Asia focused blog, there are a total of 558 South Asians in this reference set (compared to 398 in my Reference I).

You can see the number of SNPs of various datasets which are common to 23andme version 2, 23andme version 3 and FTDNA Family Finder (Illumina chip).

The following datasets had more than 280,000 SNPs common with all three platforms and hence were included in Reference 3:

  1. HapMap
  2. HGDP
  3. SGVP
  4. Behar
  5. Henn (Khoisan data)
  6. Rasmussen
  7. Austroasiatic
  8. Latino
  9. 1000genomes

Reich et al had about 100,000 SNPs in common with 23andme (v2 & v3 intersection) and 137,000 with FTDNA, but there was not a great overlap. Only 59,000 Reich et al SNPs were present in all three platforms. Since I really wanted Reich et al data in Reference 3, I included it but the SNPs used for FTDNA comparisons won't be the same as for the 23andme comparisons.

Of the datasets I could not include, I am most disappointed about the Pan-Asian dataset since it has a good coverage of South and Southeast Asia. Unfortunately, it has only 19,000 SNPs in common with 23andme v2 and 23,000 with 23andme v3. I am going to have to do some analyses with the Pan-Asian data but it just can't be included in my Reference 3.

I am also interested in doing some analysis with the Henn et al African data with about 52,000 SNPs for personal reasons.

Xing et al has about 71,000 SNPs in common with 23andme v3, so some good work could be done with that, though I'll have to use only 23andme version 3 participants.

The information about the populations included in Reference 3 is in a spreadsheet as usual.

Reich et al Duplicates

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Reich et al Indian dataset.

The dataset doesn't have any duplicate or likely relative samples itself. However, there are two Kharia samples that are the same as the Austroasiatic dataset. Since Austroasiatic dataset has more SNPs in common with 23andme, I removed these two samples from Reich et al.

The IBS/IBS analysis and the sample IDs are in a spreadsheet as usual.

Harappa Genome Similarity MDS/Dendrogram

I computed the IBS similarity matrix for the Harappa participants HRP0001 to HRP0080 over 500,000 SNPs. This is exactly the same thing as the genome-wide gene comparison at 23andme.

Then, I converted the similarity matrix to a dissimilarity/distance matrix with the standard formula:

dij = sqrt(2 - 2 * sij)

where sij is the similarity between individuals i and j and dij is the distance/dissimilarity between the two.

Using the dissimilarity matrix, I classified all the participants (excluding close relatives) using hierarchical clustering with complete linkage. You can see the dendrogram below.

Then I used the same dissimilarity matrix to calculate 6-dimensional MDS. You can see the MDS plots below. The numbers on the plots are your Harappa IDs.

MDS Dimensions 1 & 2:

MDS Dimensions 3 & 4:

MDS Dimensions 5 & 6:

As you can see I (HRP0001) and my sister (HRP0035) are far away in the first four dimensions.

I'll let you guys speculate on what each dimension represents.

Now why create an MDS this way instead of directly using Plink's MDS functionality? Well, I needed to check if I could do it using only the similarity matrix because that would be really useful for something else. Tune in on my other blog for more later this week.

23andme Sale

23andme is having the DNA Day sale early.

Monday April 11, 2011, they are selling the kits for FREE with a $9/month 1 year commitment. So basically a total of $108. This is compared to $199 + $9/month (=$307) regular price. It's even less than the Christmas sale (assuming you cancel subscription after a year).почему одинаковые на внешний вид дома имеют разную цену?

The sale is on midnight Pacific time tonight (3am Eastern time or 7am GMT) and will end April 11 11:59pm Pacific time (2:59am Eastern time or 6:59am GMT April 12).

Spread the word and get people to participate in our Harappa Project too.

1000genomes

I got the 1000genomes data a couple of weeks ago. Trying to convert it from VCF to PED format using vcftools was a complete disaster. Then Dienekes sent me a conversion script which was more than a hundred times faster.

1000genomes will have 100 Assamese Ahom, 100 Kayadtha from Calcutta, 100 Reddys from Hyderabad, 100 Maratha from Bombay and 100 Lahori Punjabis later this year. Right now, the new populations (other than HapMap) are British, Finns, Han Chinese South, Puerto Ricans, Colombians, and Spaniards.

I removed all the 660 samples which were common with the HapMap data. Also, there were 31 pairs with high IBD values. The list of IBD/IBS values and the samples I removed can be seen in the spreadsheet.

Latino Dataset

Razib mentioned a Latino/Hispanic dataset to me a few days ago.

The relevant paper is "Genome-wide patterns of population structure and admixture among Hispanic/Latino populations" by Katarzyna Bryca, Christopher Velezb, Tatiana Karafetc, Andres Moreno-Estradaa, Andy Reynoldsa, Adam Autona, Michael Hammerc, Carlos D. Bustamantea, and Harry Ostrer. And the data is available on the GEO Accession viewer.

The dataset has 100 samples from Colombia, Dominican Republic, Ecuador, and Puerto Rico.

It's in the same format and uses the same chip as Behar et al and Rasmussen et al. So it was really easy to download and convert it to Plink PED format.

Now what does a Hispanic dataset got to do with a South Asian genetics project? Nothing, for now. But I am collecting all genotyping data. And also I am hoping that we get more participants of South Asian origin from the Caribbean and other countries of the region where there has been a longer presence of South Asians. In that case, it would be interesting to compare them against other populations of the Americas.

In keeping with my effort to clean the data of any relatives, here are the IBD/IBS analysis results. The 2nd sheet shows the two samples I removed.

Pan-Asian Dataset Duplicates and Relatives

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

Looking at the Pan-Asian dataset, I found 3 pairs of duplicate samples and 82 pairs that could be closely related. I have removed 64 samples from the dataset.

You can see the IBD results from plink as well as the list of sample IDs I removed in a spreadsheet.

UPDATE: I found 4 Melanesians in the Pan-Asian dataset who were the same as those in HGDP. So I have removed those as well and added them in the list in the spreadsheet.

Austroasiatic Dataset Duplicates

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Chaubey et al Austroasiatic Indians dataset.

The dataset doesn't have any duplicate or likely relative samples itself. Of course, I had to remove the 632 HGDP samples it had, but that's easy to do since they have the same IDs (starting with HGDP).

As their paper mentions, the dataset also has 19 Dravidian speaking Indian samples from Behar et al. Since I got Behar et al data from the GEO site, I had different IDs for them than what they use in this dataset. So I had to figure out which samples were the same in both. The IBS/IBD results of duplicates as well as the list of sample IDs I removed is given in a spreadsheet.

Checking this out resolved an issue I had with Behar et al. Behar et al has 4 Paniya samples from South India. One of those four has admixture proportions similar to Indians but three seem very East Asian. I had always suspected that those three samples were mislabeled. Now the Austroasiatic dataset also has those four Paniya samples. However, only one of them is identical to the Behar et al one. The other three are different. I haven't checked yet which one of the Behar samples matches Austroasiatic, but my guess is that it is the more Indian admixture one. So I am keeping the other three Paniya samples from the Austroasiatic dataset and hoping that they are the correct ones.

Harappa Gene Similarity

I was looking at Simranjit's DNA Tribes results and I thought I could provide you guys a list of how similar (part of) your genome is to different reference populations in a somewhat similar way to DNA Tribes results.

Basically, I computed an IBS (identity by state) matrix for all Harappa participants from HRP0001 to HRP0080 and my Reference II samples (info). These are the same as the Genome-wide comparison feature at 23andme.

Then I took the median similarity percentage between you and a reference population group. I found that median worked better here than mean as the mean was affected a lot by some outlier samples in the reference data.

Of course, since I am giving you a big discount compared to DNA Tribes, I am not doing a nice individual report. Instead, all you get is one spreadsheet including everyone. Click on your ID in the column headers to sort by your similarity to the different reference populations.

We see four outliers among the project participants who don't match any reference populations very well. One is HRP0074, a Brazilian, which is expected since I don't have any Native American populations. Then there are me (HRP0001) and my sister (HRP0035) which was well-known already. Finally, HRP0044, a Kashmiri.

Do note that this analysis was done using about 20,000 SNPs.

Rasmussen Likely Relatives

As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.

So I went back to the Rasmussen et al dataset, which you can download from here.

While there are no duplicates, 9 pairs of samples have high IBS values (85% similar or more) and seem to be related (Plink PI_HAT > 0.5). You can see the IBD results in a spreadsheet, along with the 8 samples I removed.