Someone asked about how to convert a 23andme raw data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:
#!/bin/bash if test -z "$1" then echo "23andme raw data filename not supplied as argument." exit 0 fi echo "Family ID: " read fid echo "Individual ID: " read id echo "Paternal ID: " read pid echo "Maternal ID: " read mid echo "Sex (m/f/u): " read sexchr if [[ $sexchr == m* ]] then sex=1 elif [[ $sexchr == f* ]] then sex=2 else sex=0 fi pheno=0 echo "$fid $id $pid $mid $sex $pheno" > $id.tfam dos2unix $1 sed '/^\#/d' $1 > $id.nocomment awk '{ if (length($4)==1) print $2,$1,"0",$3,substr($4,1,1),substr($4,1,1); else print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1) }' $id.nocomment > $id.tped plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0 |
Well that's it! You can easily create a Perl script to do the same but this was faster for me.
This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.
UPDATE: A Perl script to do the same:
#!/usr/bin/perl -w $numArgs = $#ARGV + 1; if ($numArgs < 1) { print "23andme raw data filename not provided.\n"; exit 0; } $file = $ARGV[0]; print "Family ID: "; $fid = <STDIN>; chomp $fid; print "Individual ID: "; $id = <STDIN>; chomp $id; print "Paternal ID: "; $pid = <STDIN>; chomp $pid; print "Maternal ID: "; $mid = <STDIN>; chomp $mid; print "Sex (m/f/u): "; $sexchr = <STDIN>; if (lc(substr($sexchr,0,1)) eq "m") { $sex = 1; } elsif (lc(substr($sexchr,0,1)) eq "f") { $sex = 2; } else { $sex = 0; } $pheno = 0; $tfamname = ">" . $id . ".tfam"; open(TFAM, $tfamname); print TFAM "$fid $id $pid $mid $sex $pheno"; close TFAM; open(INFILE,$file); open(TPED,">" . $id . ".tped"); while (<INFILE>) { next if /#.*/; chomp; my($rsid,$chr,$pos,$geno) = split(/\s/); if (length($geno)==1) { $geno1 = $geno; $geno2 = $geno; } else { $geno1 = substr($geno,0,1); $geno2 = substr($geno,1,1); } print TPED "$chr $rsid 0 $pos $geno1 $geno2\n"; } close TPED; close INFILE; |
That should work on Linux, Microsoft Windows and Mac OS X.
Thanks for posting this! However, when I use my 23andMe file plink complains about my X chromosome which has only 5 columns. It looks like:
X rs5939319 0 2710157 A
Any ideas on how to fix this?
Which OS are you running it on?
OS X 10.6
Aaron: The different line ending characters of Unix, Mac OS and DOS/Windows trip up length($4) in awk.
Once my kid's asleep, I can quickly throw together a Perl script which shouldn't have this problem.
Thanks!
i have a perl script which filters X, Y & mtDNA. why u want dat anyway?
Hi,
I'm getting the following error when I run admixture on my raw data pre-processed with your format-conversion code and plink:
'Error: detected that all genotypes are missing for a SNP locus.
Please apply quality-control filters to remove such loci.'
I would like to try filtering my data following the instructions you've given in your post on Jan 29, but could you tell me where I can get the SNP list file for 23andMe?
Thanks!
PS: I'm not a geneticist, so I have no clue what most of the technical terms mean!
Which of the two scripts are you using, the first or the Perl one?
Is that an error from Admixture or plink? Can you show me plink's output?
Hi,
I used the Perl script for this run. The given error is from admixture.
This is how I tried to do the run:
1. Ran the Perl script to create file 'b.tped'.
2. Ran plink to create 'b.bed':
plink --tfile b --out b --make-bed --missing-genotype - --output-missing-genotype 0
Output:
...
Web-based version check ( --noweb to skip )
Recent cached web-check found...Problem connecting to web
Writing this text to log file [ b.log ]
Analysis started: Sun Feb 27 19:06:16 2011
Options in effect:
--tfile b
--out b
--make-bed
--missing-genotype -
--output-missing-genotype 0
Reading pedigree information from [ b.tfam ]
1 individuals read from [ b.tfam ]
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0 controls and 1 missing
1 males, 0 females, and 0 of unspecified sex
966977 (of 966977) markers to be included from [ b.tped ]
Before frequency and genotyping pruning, there are 966977 SNPs
0 founders and 1 non-founders found
168 heterozygous haploid genotypes; set to missing
Writing list of heterozygous haploid genotypes to [ b.hh ]
966977 SNPs with no founder genotypes observed
Warning, MAF set to 0 for these SNPs (see --nonfounders)
Writing list of these SNPs to [ b.nof ]
Total genotyping rate in remaining individuals is 0.994775
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 966977 SNPs
After filtering, 0 cases, 0 controls and 1 missing
After filtering, 1 males, 0 females, and 0 of unspecified sex
Writing pedigree information to [ b.fam ]
Writing map (extended format) information to [ b.bim ]
Writing genotype bitfile to [ b.bed ]
Using (default) SNP-major mode
Analysis finished: Sun Feb 27 19:06:28 2011
3. Ran admixture:
admixture b.bed 1
...
Random seed: 43
Point estimation method: Block relaxation algorithm
Convergence acceleration algorithm: QuasiNewton, 3 secant conditions
Point estimation will terminate when objective function delta < 0.0001
Estimation of standard errors disabled; will compute point estimates only.
Error: detected that all genotypes are missing for a SNP locus.
Please apply quality-control filters to remove such loci.
Thanks!
Admixture needs a file with multiple individuals, not a single individual.
Also K is the number of ancestral populations Admixture should find, so setting it to 1 is not useful at all.
If you want an an admixture analysis, have you sent me your data?
Okay, I'll try out your suggestions. Thanks!
I haven't sent you my data yet - I was thinking of playing around with admixture, figure out how it works, etc. But I might, in the near future. 🙂
Hi Zack, I know this is an old post but I've ran out of luck and need help. I'm trying to merge my SNPs to the Pan-Asian data. Basically:
1. Converted 23nMe to bed, bim, bam
2. Converted Pan-Asian to bed, bim, bam
3. -extracted the Pan-Asian SNP
4. Filtered my 23nMe using Pan-Asian.snplist
5. This is where the problem begins. I tried to merged 23nMe to Pan-Asian.
==> plink --bfile PASNP --bmerge 23nMeFil.bed 23nMeFil.bim 23nMeFil.fam --make-bed --out PAnMe
Error comes outlike this:
*********************
Found 479 SNPs that do not match in terms of allele codes
Might include strand flips, although flipped A/T and C/G SNPs will be undetected)
Writing problem SNPs to [ PAnGlenn.missnp ]
479 markers with 3+ alleles present.
*********************
6. I tried to --flip option
==> plink --bfile 23nMe --flip PAnMe-merge.missnp --make-bed --out 23nMeFlip
Tried --bmerge again and the same exact error.
Tried --exclude option and the same exact error.
Any help would be greatly appreciated. Also, thank you very much for all your blogs. It's a gold mine!
It's working now. I tried to filter the PASNP data before combining both and it worked.