HGDP to PED Conversion

Posted by Zack on February 5, 2011

For converting the HGDP data (from Stanford University) to Plink PED format, I used the following code.

#!/bin/bash
dos2unix HGDP_FinalReport_Forward.txt
dos2unix HGDP_Map.txt
dos2unix SampleInformation.txt
head --lines=1 HGDP_FinalReport_Forward.txt > header.txt
awk '{for (i=1;i<=NF;i++) print "0",$i,"0","0"}' header.txt > hgdp_nosex.tfam
sed '1d' HGDP_FinalReport_Forward.txt > HGDP_Data_NoHeader.txt
sort -k 1b,1 HGDP_Data_NoHeader.txt > HGDP_Data_Sorted.txt
sort -k 1b,1 HGDP_Map.txt > HGDP_Map_Sorted.txt
join -j 1 HGDP_Map_Sorted.txt HGDP_Data_Sorted.txt > HGDP_compound.txt
 
awk '{if ($2=="M") $2="MT";printf("%s %s 0 %s ",$2,$1,$3);
    for (i=4;i<=NF;i++)
        printf("%s %s ",substr($i,1,1),substr($i,2,1));
    printf("\n")}' HGDP_compound.txt > hgdp.tped
 
# Add sex info
sed '1d' SampleInformation.txt > temp.txt
sed '$d' temp.txt > SampleInfo_noheader.txt
awk '{printf("HGDP%05d ",$1);
    if ($6=="m") print "1";
    else if ($6=="f") print "2";
    else print "0";}' SampleInfo_noheader.txt > Sample_sex.txt
awk 'BEGIN {
	while ((getline < "Sample_sex.txt") > 0)
		f2array[$1] = $2}
	{if (f2array[$2])
		print $1, $2, $3, $4, f2array[$2], "0"
	else
		print $2 "not listed in file2" > "unmatched"
	}' hgdp_nosex.tfam > hgdp.tfam
 
# convert to ped
plink --tfile hgdp --out hgdp --make-bed --missing-genotype - --output-missing-genotype 0
 
# Filter to 952 (or 940) people using the SampleInformation.txt file
awk '{if ($16=="1") printf("0 HGDP%05d\n",$1);}' SampleInfo_noheader.txt > Sample_keep.txt
plink --bfile hgdp --keep Sample_keep.txt --make-bed --out hgdp940

This (and more) could easily be done in Perl. You can look at the SPSmart code for some help along these lines.

Codeconversion, hgdp, plink

← Changes due to San/Pygmy Removal

Latest on Participants →

13 Comments.

Ez February 6, 2011 at 12:09 pm

Thank you for sharing this, Zack 🙂 .

If you find the time, could you also do one for the Behar et al. dataset? Since that one is tad more difficult to convert.
- Zack February 6, 2011 at 1:34 pm
  
  I will, though I had some hints earlier.
  - Ez April 14, 2011 at 12:56 pm
    
    It worked for me. Thanks, no need for it anymore.
mallu April 8, 2011 at 8:47 am

Please also help to convert PAn Asian data in to plink ped format.

thanks
- Zack April 8, 2011 at 9:11 am
  
  I have converted it. But it has only 18,000 SNPs in common with 23andme. So I won't use it for Admixture. Let's see if I can get some value out of it for PCA.
  - mallu April 14, 2011 at 11:54 am
    
    Can you please upload the script for that somewhere?
    
    Thanks a lot
Eden April 30, 2012 at 4:14 pm

Pardon this ignorant question, but what language is this? I tried to run this shell script in terminal on my mac and it didn't recognize the first 4 or so commands... Also, where did you get a SampleInformation.txt file? The only comparable file I saw when I downloaded the Stanford HGDP data was SampleList.txt...but that's just a list of IDs...

If you could reply I would really appreciate it!
- Zack April 30, 2012 at 4:58 pm
  
  I can't guarantee it will work on a Mac. The dos2unix utility (which you are likely missing) just changes line endings from DOS to Unix format so later parsing doesn't run into any problems. You can easily change the files to Mac endings if needed.
  
  The sample list file can be downloaded here.
Eden April 30, 2012 at 5:09 pm

Thanks for the quick reply! I ran your program from the UNIX cluster at my school rather than my desktop and it worked just fine. Also that file was key. MANY thanks! =)
harpreet kaur June 18, 2013 at 4:22 am

Please tell me how to download hgdp imputed snps data.
manglalayag August 9, 2013 at 6:14 pm

I know this post is from last year but if I can speak for those noobs trying to catch up, MANY THANKS ZACK!!!
KhaledLas November 19, 2013 at 8:30 am

Thank you for sharing your script Zack. I tried it but it blocks at the conversion to bed (i changed: SampleInformation.txt --> HGDP_SampleList.txt). Thanks
- KhaledLas November 19, 2013 at 9:36 am
  
  everything works

Harappa Ancestry Project

Genetics and South Asia

HGDP to PED Conversion

Related

13 Comments.

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Harappa Ancestry Project

Genetics and South Asia

HGDP to PED Conversion

Share this:

Related

13 Comments.

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll