Someone asked about how to convert a 23andme raw data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:
#!/bin/bash if test -z "$1" then echo "23andme raw data filename not supplied as argument." exit 0 fi echo "Family ID: " read fid echo "Individual ID: " read id echo "Paternal ID: " read pid echo "Maternal ID: " read mid echo "Sex (m/f/u): " read sexchr if [[ $sexchr == m* ]] then sex=1 elif [[ $sexchr == f* ]] then sex=2 else sex=0 fi pheno=0 echo "$fid $id $pid $mid $sex $pheno" > $id.tfam dos2unix $1 sed '/^\#/d' $1 > $id.nocomment awk '{ if (length($4)==1) print $2,$1,"0",$3,substr($4,1,1),substr($4,1,1); else print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1) }' $id.nocomment > $id.tped plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0 |
Well that's it! You can easily create a Perl script to do the same but this was faster for me.
This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.
UPDATE: A Perl script to do the same:
#!/usr/bin/perl -w $numArgs = $#ARGV + 1; if ($numArgs < 1) { print "23andme raw data filename not provided.\n"; exit 0; } $file = $ARGV[0]; print "Family ID: "; $fid = <STDIN>; chomp $fid; print "Individual ID: "; $id = <STDIN>; chomp $id; print "Paternal ID: "; $pid = <STDIN>; chomp $pid; print "Maternal ID: "; $mid = <STDIN>; chomp $mid; print "Sex (m/f/u): "; $sexchr = <STDIN>; if (lc(substr($sexchr,0,1)) eq "m") { $sex = 1; } elsif (lc(substr($sexchr,0,1)) eq "f") { $sex = 2; } else { $sex = 0; } $pheno = 0; $tfamname = ">" . $id . ".tfam"; open(TFAM, $tfamname); print TFAM "$fid $id $pid $mid $sex $pheno"; close TFAM; open(INFILE,$file); open(TPED,">" . $id . ".tped"); while (<INFILE>) { next if /#.*/; chomp; my($rsid,$chr,$pos,$geno) = split(/\s/); if (length($geno)==1) { $geno1 = $geno; $geno2 = $geno; } else { $geno1 = substr($geno,0,1); $geno2 = substr($geno,1,1); } print TPED "$chr $rsid 0 $pos $geno1 $geno2\n"; } close TPED; close INFILE; |
That should work on Linux, Microsoft Windows and Mac OS X.
Recent Comments