For regular admixture analysis, I am using HapMap, HGDP, SGVP and Behar datasets with some samples removed as I wrote earlier.
For each of these datasets,
- I first filtered to keep only the list of SNPs present in 23andme v2 chip.
plink --bfile data --extract 23andmev2.snplist
- I also filtered for founders:
plink --bfile data --filter-founders
- And excluded SNPs with missing rates greater than 1%:
plink --bfile data --geno 0.01
Then, I merged the datasets one by one. The reason for doing it one by one was that there were conflicts of strand orientation (forward or reverse) between the different datasets. If the merge operation gave an error, I had to flip those strands in one dataset and try the merge again.
plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed plink --bfile data2 --flip plink.missnp --make-bed --out data2flip plink --bfile data1 --bmerge data2flip.bed data2flip.bim data2flip.fam --make-bed |
Once all the four datasets were merged, I processed the combined data file:
- Removed SNPs with a missing rate of more than 1% in the combined dataset
plink --bfile data --geno 0.01
- Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:
plink --bfile data --indep-pairwise 50 5 0.3 plink --bfile data --extract plink.prune.in --make-bed
This gave me a reference population of 2,693 2,654 individuals with each sample having about 186,000 SNPs. Out of these 2,693 2,654 individuals, we have a total of 398 South Asians belonging to 16 ethnic groups.
Finally, it's time to start having some fun!
UPDATE: I removed 39 Pygmy and San samples because they were causing some trouble with African ancestral components. Since we are not interested in detailed African ancestry and African admixture among South Asians is not likely to be pygmy or San, I decided it would be best to remove them.
12 Comments.