As per AV's comment, here are the individual results for Xing et al South Asians.
Tag Archives: xing
Xing Ref3 K=11 Admixture
Xing et al dataset is interesting because it has a number of South Asian populations:
- 25 Andhra Pradesh Brahmin
- 10 Andhra Pradesh Madiga
- 11 Andhra Pradesh Mala
- 22 Irula
- 25 Nepalese
- 25 Punjabi Arain
- 14 Tamil Nadu Brahmin
- 12 Tamil Nadu Dalit
Unfortunately, the dataset does not have a lot of common SNPs with 23andme, FTDNA and the other data I am using.
However, I did run a reference 3 admixture on Xing data using about 30,000 SNPs. Since this is a lot less than the usual 118,000 SNPs, the noise levels are much larger.
Here is the spreadsheet with the Xing group averages for reference 3 admixture at K=11 ancestral components.
Xing to PED Conversion
Following mallu's request, here is the code I used to convert Xing et al data to Plink's PED format.
#!/bin/bash dos2unix *.csv head --lines=1 JHS_Genotype.csv > header.txt awk -F, '{for (i=2;i<=NF;i++) print "0",$i,"0","0","0", "0"}' header.txt > xing.tfam sed '1d' JHS_Genotype.csv > genotype.csv sort -t',' -k 1b,1 genotype.csv > genotype_sorted.csv sort -t',' -k 1b,1 JHS_SNP.csv > snp_sorted.csv join -t',' -j 1 snp_sorted.csv genotype_sorted.csv > xing_compound.csv awk -F, '{printf("%s %s 0 %s ",substr($2,4),$1,$3); for (i=6;i<=NF;i++) printf("%s %s ",substr($i,1,1),substr($i,2,1)); printf("\n")}' xing_compound.csv > xing.tped plink --tfile xing --out xing --make-bed --missing-genotype N --output-missing-genotype 0 |
I make no guarantees that it will work for you. I used it on my Ubuntu box, but I am sure it'll have trouble on Mac OS.
Xing Redo
As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.
So I went back to the Xing et al dataset, which you can download from their website.
I found no duplicates within the Xing et al data but there are 259 samples common from the HapMap. Since they are not assigned any family IDs they will pass through the ped files without being merged into HapMap samples. So you need to remove any samples with IDs starting with "NA".
Xing et al also contains 6 duplicates from HGDP with completely different IDs and two Xing samples look to be related to HGDP samples.
There are also three pairs with very high identity-by-descent values, which I calculated using Plink. You can see the samples with PI_HAT greater than 0.5 in this spreadsheet. PI_HAT is the proportion IBD estimated by plink. Notice also that all these pairs also have high IBS similarity (the DSC column), more than 85% similar.
The samples I have removed as a result of this (other than HapMap) are listed in this spreadsheet.
One PED File to Rule Them All
I am interested in North African populations due to my own heritage, so when Razib alerted me that Henn et al had a paper out about South African origins of humans and their African dataset was publicly available and included populations from all over Africa, I immediately downloaded it.
I have also been considering looking into the East Asian admixture in South Asians and Iranians in some detail to see where it originates from: Southeast Asia, Chinese/Japanese/Koreans, or the Turkic/Mongolian/Siberian populations of interior northeastern Asia. At a quick glance, Razib is correct:
The eastern Asian components are enriched among Bengalis, as you’d expect, but they’re found in different proportions among many individuals who hail from the northern fringe of South Asia more generally. It seems clear that the further west you go, the more likely the “eastern†element is going to be Turk, while the further east (and to some extent south) the more likely it is to be more southernly in provenance.
To do a better job though, it would be better to have more than the Yakut as an examplar of the Siberian component as I have done till now. Therefore, I downloaded the arctic populations dataset from Rasmussen et al.
Combining Henn et al and Rasmussen et al with my previous datasets (HapMap, HGDP, SGVP, Behar et al and Xing et al), I got 3,970 samples with a total of 1,716,031 SNPs represented, though at 99% genotyping rate it gets reduced to about 27,000 SNPs.
I did not remove any populations or individuals except for any duplicates and non-founders.
Here's the information on the populations represented in this dataset.
Now I am on the lookout for more datasets that are public, have enough SNPs in common with this set and can easily be converted into the Plink PED format. So if you know of any, let me know. May be I will have the biggest and most diverse dataset with your help.
San and Pygmy
I have removed San and Pygmy groups from my reference datasets. That meant removing 39 samples from Reference Data I and 61 samples from Reference Data II.
The presence of those groups was creating some weird effects in admixture runs at K=8,9. Basically, the ancestral components for Africans I was getting were not stable. Instead they were varying with/without different Harappa participant batches. Also, at K=10,11, there were too many Africa-only ancestral components, forcing me to run even higher values of K.
Since we are not really interested in African diversity in this project and any African admixture among South Asians is most likely to be East, West or North African instead of Pygmy or San, the removal of these groups should not have any implications for the Harappa Ancestry Project.
To make sure that the above assertion is true, I'll re-run admixture analysis for K=2-5 and update later with the results.
Reference Dataset II
Combining my reference population with Xing et al data gets me 3,222 3,161 samples but with only about 23,000 SNPs after LD-pruning.
The good thing is that this dataset has 544 South Asian samples from 24 ethnic groups. So it'll be useful for some analyses despite the low number of SNPs. I'll try to run parallel analyses on my reference population and this dataset so we can compare the pros and cons of both.
UPDATE: I removed 61 pygmy and San samples.
Xing et al Data
The data for Xing et al's paper "Toward a more uniform sampling of human genetic diversity: a survey of worldwide populations by high-density genotyping" is available online.
This dataset consists of 850 individuals, but 259 of them overlap with the HapMap. Another 15 samples had to be removed because they were too similar to others. I also removed Native American samples. This leaves us with 529 samples.
Ethnic group | Count |
---|---|
Slovenian | 25 |
Punjabi Arain | 25 |
N. European | 25 |
Nepalese | 25 |
Kyrgyzstani | 25 |
Iban | 25 |
Buryat | 25 |
Bambaran | 25 |
Andhra Pradesh Brahmin | 25 |
Kurd | 24 |
Dogon | 24 |
Irula | 23 |
Thai | 22 |
Pygmy | 22 |
Urkarah | 18 |
Tamil Nadu Brahmin | 14 |
Hema | 14 |
Tongan | 13 |
Tamil Nadu Dalit | 13 |
Samoan | 13 |
!Kung | 13 |
Japanese | 13 |
Andhra Pradesh Mala | 11 |
Pedi | 10 |
Andhra Pradesh Madiga | 10 |
Alur | 10 |
Nguni | 9 |
Sotho/Tswana | 8 |
Vietnamese | 7 |
Stalskoe | 5 |
Chinese | 5 |
Khmer Cambodian | 3 |
This dataset is valuable because it contains several South Asian, Central Asian, Southeast Asian and Caucasian groups. However, it does not have a good SNP overlap with 23andme and the other datasets. It has only about 29,000 SNPs in common with 23andme v2 data. Combining HapMap, HGDP, SGVP, Behar et al and Xing et al with 23andme data leaves us with 25,000 SNPs. Due to that, I'll be using Xing et al data for only a few analyses.
Recent Comments