Someone asked about how to convert a 23andme raw data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:
#!/bin/bash
if test -z "$1"
then
echo "23andme raw data filename not supplied as argument."
exit 0
fi
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
then
sex=1
elif [[ $sexchr == f* ]]
then
sex=2
else
sex=0
fi
pheno=0
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
dos2unix $1
sed '/^\#/d' $1 > $id.nocomment
awk '{ if (length($4)==1) print $2,$1,"0",$3,substr($4,1,1),substr($4,1,1); else
print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1) }' $id.nocomment > $id.tped
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0 |
#!/bin/bash
if test -z "$1"
then
echo "23andme raw data filename not supplied as argument."
exit 0
fi
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
then
sex=1
elif [[ $sexchr == f* ]]
then
sex=2
else
sex=0
fi
pheno=0
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
dos2unix $1
sed '/^\#/d' $1 > $id.nocomment
awk '{ if (length($4)==1) print $2,$1,"0",$3,substr($4,1,1),substr($4,1,1); else
print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1) }' $id.nocomment > $id.tped
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0
Well that's it! You can easily create a Perl script to do the same but this was faster for me.
This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.
UPDATE: A Perl script to do the same:
#!/usr/bin/perl -w
$numArgs = $#ARGV + 1;
if ($numArgs < 1) {
print "23andme raw data filename not provided.\n";
exit 0;
}
$file = $ARGV[0];
print "Family ID: ";
$fid = <STDIN>;
chomp $fid;
print "Individual ID: ";
$id = <STDIN>;
chomp $id;
print "Paternal ID: ";
$pid = <STDIN>;
chomp $pid;
print "Maternal ID: ";
$mid = <STDIN>;
chomp $mid;
print "Sex (m/f/u): ";
$sexchr = <STDIN>;
if (lc(substr($sexchr,0,1)) eq "m") {
$sex = 1; }
elsif (lc(substr($sexchr,0,1)) eq "f") {
$sex = 2; }
else {
$sex = 0; }
$pheno = 0;
$tfamname = ">" . $id . ".tfam";
open(TFAM, $tfamname);
print TFAM "$fid $id $pid $mid $sex $pheno";
close TFAM;
open(INFILE,$file);
open(TPED,">" . $id . ".tped");
while (<INFILE>) {
next if /#.*/;
chomp;
my($rsid,$chr,$pos,$geno) = split(/\s/);
if (length($geno)==1) {
$geno1 = $geno;
$geno2 = $geno;
}
else {
$geno1 = substr($geno,0,1);
$geno2 = substr($geno,1,1);
}
print TPED "$chr $rsid 0 $pos $geno1 $geno2\n";
}
close TPED;
close INFILE; |
#!/usr/bin/perl -w
$numArgs = $#ARGV + 1;
if ($numArgs < 1) {
print "23andme raw data filename not provided.\n";
exit 0;
}
$file = $ARGV[0];
print "Family ID: ";
$fid = <STDIN>;
chomp $fid;
print "Individual ID: ";
$id = <STDIN>;
chomp $id;
print "Paternal ID: ";
$pid = <STDIN>;
chomp $pid;
print "Maternal ID: ";
$mid = <STDIN>;
chomp $mid;
print "Sex (m/f/u): ";
$sexchr = <STDIN>;
if (lc(substr($sexchr,0,1)) eq "m") {
$sex = 1; }
elsif (lc(substr($sexchr,0,1)) eq "f") {
$sex = 2; }
else {
$sex = 0; }
$pheno = 0;
$tfamname = ">" . $id . ".tfam";
open(TFAM, $tfamname);
print TFAM "$fid $id $pid $mid $sex $pheno";
close TFAM;
open(INFILE,$file);
open(TPED,">" . $id . ".tped");
while (<INFILE>) {
next if /#.*/;
chomp;
my($rsid,$chr,$pos,$geno) = split(/\s/);
if (length($geno)==1) {
$geno1 = $geno;
$geno2 = $geno;
}
else {
$geno1 = substr($geno,0,1);
$geno2 = substr($geno,1,1);
}
print TPED "$chr $rsid 0 $pos $geno1 $geno2\n";
}
close TPED;
close INFILE;
That should work on Linux, Microsoft Windows and Mac OS X.
Recent Comments