For converting the HGDP data (from Stanford University) to Plink PED format, I used the following code.
#!/bin/bash dos2unix HGDP_FinalReport_Forward.txt dos2unix HGDP_Map.txt dos2unix SampleInformation.txt head --lines=1 HGDP_FinalReport_Forward.txt > header.txt awk '{for (i=1;i<=NF;i++) print "0",$i,"0","0"}' header.txt > hgdp_nosex.tfam sed '1d' HGDP_FinalReport_Forward.txt > HGDP_Data_NoHeader.txt sort -k 1b,1 HGDP_Data_NoHeader.txt > HGDP_Data_Sorted.txt sort -k 1b,1 HGDP_Map.txt > HGDP_Map_Sorted.txt join -j 1 HGDP_Map_Sorted.txt HGDP_Data_Sorted.txt > HGDP_compound.txt awk '{if ($2=="M") $2="MT";printf("%s %s 0 %s ",$2,$1,$3); for (i=4;i<=NF;i++) printf("%s %s ",substr($i,1,1),substr($i,2,1)); printf("\n")}' HGDP_compound.txt > hgdp.tped # Add sex info sed '1d' SampleInformation.txt > temp.txt sed '$d' temp.txt > SampleInfo_noheader.txt awk '{printf("HGDP%05d ",$1); if ($6=="m") print "1"; else if ($6=="f") print "2"; else print "0";}' SampleInfo_noheader.txt > Sample_sex.txt awk 'BEGIN { while ((getline < "Sample_sex.txt") > 0) f2array[$1] = $2} {if (f2array[$2]) print $1, $2, $3, $4, f2array[$2], "0" else print $2 "not listed in file2" > "unmatched" }' hgdp_nosex.tfam > hgdp.tfam # convert to ped plink --tfile hgdp --out hgdp --make-bed --missing-genotype - --output-missing-genotype 0 # Filter to 952 (or 940) people using the SampleInformation.txt file awk '{if ($16=="1") printf("0 HGDP%05d\n",$1);}' SampleInfo_noheader.txt > Sample_keep.txt plink --bfile hgdp --keep Sample_keep.txt --make-bed --out hgdp940 |
This (and more) could easily be done in Perl. You can look at the SPSmart code for some help along these lines.
Thank you for sharing this, Zack 🙂 .
If you find the time, could you also do one for the Behar et al. dataset? Since that one is tad more difficult to convert.
I will, though I had some hints earlier.
It worked for me. Thanks, no need for it anymore.
Please also help to convert PAn Asian data in to plink ped format.
thanks
I have converted it. But it has only 18,000 SNPs in common with 23andme. So I won't use it for Admixture. Let's see if I can get some value out of it for PCA.
Can you please upload the script for that somewhere?
Thanks a lot
Pardon this ignorant question, but what language is this? I tried to run this shell script in terminal on my mac and it didn't recognize the first 4 or so commands... Also, where did you get a SampleInformation.txt file? The only comparable file I saw when I downloaded the Stanford HGDP data was SampleList.txt...but that's just a list of IDs...
If you could reply I would really appreciate it!
I can't guarantee it will work on a Mac. The dos2unix utility (which you are likely missing) just changes line endings from DOS to Unix format so later parsing doesn't run into any problems. You can easily change the files to Mac endings if needed.
The sample list file can be downloaded here.
Thanks for the quick reply! I ran your program from the UNIX cluster at my school rather than my desktop and it worked just fine. Also that file was key. MANY thanks! =)
Please tell me how to download hgdp imputed snps data.
I know this post is from last year but if I can speak for those noobs trying to catch up, MANY THANKS ZACK!!!
Thank you for sharing your script Zack. I tried it but it blocks at the conversion to bed (i changed: SampleInformation.txt --> HGDP_SampleList.txt). Thanks
everything works