Author Archives: Zack - Page 28

23andme Conversion to PED

Someone asked about how to convert a 23andme raw data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:

#!/bin/bash
if test -z "$1"
then
	echo "23andme raw data filename not supplied as argument."
	exit 0
fi
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
then
	sex=1
elif [[ $sexchr == f* ]]
then
	sex=2
else
	sex=0
fi
pheno=0
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
dos2unix $1
sed '/^\#/d' $1 > $id.nocomment
awk '{ if (length($4)==1) print $2,$1,"0",$3,substr($4,1,1),substr($4,1,1); else
    print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1) }' $id.nocomment > $id.tped
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0

Well that's it! You can easily create a Perl script to do the same but this was faster for me.

This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.

UPDATE: A Perl script to do the same:

#!/usr/bin/perl -w
 
$numArgs = $#ARGV + 1;
if ($numArgs < 1) {
        print "23andme raw data filename not provided.\n";
        exit 0;
}
 
$file = $ARGV[0];
 
print "Family ID: ";
$fid = <STDIN>;
chomp $fid;
print "Individual ID: ";
$id = <STDIN>;
chomp $id;
print "Paternal ID: ";
$pid = <STDIN>;
chomp $pid;
print "Maternal ID: ";
$mid = <STDIN>;
chomp $mid;
print "Sex (m/f/u): ";
$sexchr = <STDIN>;
if (lc(substr($sexchr,0,1)) eq "m") {
        $sex = 1; }
elsif (lc(substr($sexchr,0,1)) eq "f") {
        $sex = 2; }
else {
        $sex = 0; }
$pheno = 0;
$tfamname = ">" . $id . ".tfam";
open(TFAM, $tfamname);
print TFAM "$fid $id $pid $mid $sex $pheno";
close TFAM;
 
open(INFILE,$file);
open(TPED,">" . $id . ".tped");
while (<INFILE>) {
        next if /#.*/;
        chomp;
        my($rsid,$chr,$pos,$geno) = split(/\s/);
        if (length($geno)==1) {
                $geno1 = $geno;
                $geno2 = $geno;
        }
        else {
                $geno1 = substr($geno,0,1);
                $geno2 = substr($geno,1,1);
        }
        print TPED "$chr $rsid 0 $pos $geno1 $geno2\n";
}
close TPED;
close INFILE;

That should work on Linux, Microsoft Windows and Mac OS X.

San and Pygmy

I have removed San and Pygmy groups from my reference datasets. That meant removing 39 samples from Reference Data I and 61 samples from Reference Data II.

The presence of those groups was creating some weird effects in admixture runs at K=8,9. Basically, the ancestral components for Africans I was getting were not stable. Instead they were varying with/without different Harappa participant batches. Also, at K=10,11, there were too many Africa-only ancestral components, forcing me to run even higher values of K.

Since we are not really interested in African diversity in this project and any African admixture among South Asians is most likely to be East, West or North African instead of Pygmy or San, the removal of these groups should not have any implications for the Harappa Ancestry Project.

To make sure that the above assertion is true, I'll re-run admixture analysis for K=2-5 and update later with the results.

Admixture: Note on Precision

As you might have seen in the spreadsheets for the reference and for participants, I am rounding off the percentages to the nearest integer.

There is a reason for that. For one thing, there are lots of factors that can influence these results. If I choose somewhat different reference samples, the ancestral components as well as their proportions in different individuals would vary from the current case. This is especially true for minor ancestral components.

I am running admixture for project participants in batches of 10 along with all of my reference dataset. Thus I am be sure that the ancestral components inferred stay the same from one batch to the next.

While the percentages do not vary much for the reference samples from one admixture run to another (with different project participant samples), they do change a little. And I have seen a few changes by as much as 1-2%.

Therefore, these ancestry percentages are at most accurate up to the nearest whole number. There is absolutely no difference between 11.7% and 12.4% for example in my opinion.

Admixture K=2-5, HRP0001 to HRP0010

Finally, it's time to analyze the genomes of project participants. Admixture analysis is going to be done in batches of ten so that the ancestral components are stable from one run to another.

My choice of calling them "ancestral components" is deliberate. Please do not think of them as pure ancestral populations.

First, the ethnic background of the participants in this batch. I'll give the ethnicity only if I have explicit permission from the participant to make such information public. By default, I assume it to be private. Here's the summary:

Ethnicity Count
Punjab 5
Bengal 1
Bihar 1
Tamil 1
Andhra Pradesh 1
Iran 1

Since this is the first batch, I am running admixture for all values of K to get a better handle on how things shake out. With later batches, I will run only a few specific values of K since admixture takes a long time to run.

The ancestral component percentages for project participants can be found in this spreadsheet.

It might be good to refer to the admixture runs for the reference (spreadsheet) to get a better idea of what the different ancestral components represent.

Let's start with K=2 ancestral components.

Batch 1 Admixture K=2

Cyan/African (C2) component varies from 29-51% among participants which is about what you would expect from the results for South Asian reference populations.

With K=3 where the ancestral components roughly represent European (C1/red), East Asian (C2/green) and African (C3/blue), we see the following:

Batch 1 Admixture K=3

I am HRP0001 and my number for K=3 are 77% European, 18% Asian and 5% African. This contrasts with my 23andme ancestry painting of 91.22% European, 8.69% Asian and 0.09% African. However, HRP0002 has closer numbers:

HRP0002 European Asian African
HAP 55% 43% 1%
23andme 57% 43% 0%

We (HAP) are using a much more diverse reference population while 23andme ancestry painting is based on the basic three populations of HapMap. Also, since I am a quarter Egyptian, the likelihood of some African ancestry is high in my case.

Note that the Asian (C2) percentages vary from 18% to 44% for the South Asians in this batch, but it's low (18-22%) in Punjabis and higher in southern and eastern South Asians. It's almost negligible in our Iranian Assyrian sample.

With K=4, we finally get our South Asian ancestral component (C1/red).

Batch 1 Admixture K=4

I (HRP0001) am the only one with any noticeable African component (C4/violet) while HRP0002 has some East Asian ancestry (C3/cyan). The two South Indians have lower European component (C2/green) along with HRP0002 who is from East Bengal.

Finally, let's take a look at K=5 ancestral components.

Batch 1 Admixture K=5

The South Asian (C1/red), East Asian (C3/green) and African (C5/magenta) components are about the same as in K=4. The new component here is C4/blue, which is the Southwest/West Asian component. This is basically a split from the K=4 European (C2/yellow) component. Our Assyrian sample has the highest Southwest/West Asian component while I also have it higher than the South Asians due to my quarter Egyptian ancestry.

Let's continue higher values of K next time.

Reference Admixture Analysis K=2-5

Let's do admixture analysis on my reference population.

Since I wasn't sure what value of K would be appropriate, I ran admixture with different values of K, which defines the number of ancestral populations.

The proportion of ancestral populations for each ethnic group is given in this spreadsheet. These are the mean values for that group, calculated by averaging the ancestral proportion across all the samples belonging to that group. I have also calculated the standard deviation across each ethnic group and that's included in the spreadsheet. The higher values of standard deviation are highlighted in blue (>1%) and red (>5%). Those population groups have samples that have somewhat different ancestries.

Let's start with two ancestral populations, i.e. K = 2.

Admixture: Reference populations K=2

The second ancestral component C2 (cyan) seems to be African and the 1st one C1 (red) is maximum among East Asians. Since all populations are constrained to be made of these two ancestral components, Europeans, Middle Easterners and South Asians all have about half African ancestral component (C2) and the rest East Asian (C1). This is as I expected with the classification of humanity into African and non-African.

The Fst divergences between estimated ancestral populations are as follows:

C1
C2 0.157

The K=3 analysis ancestral components can be roughly said to be European, East Asian and African.

Admixture: Reference populations K=3

The component C1 (red) is maximum among Europeans and is the major ancestry component for Middle Easterners, Central Asians and South Asians. Ancestral component C2 (green) is East Asian. South Asians also have a significant fraction of C2. African populations are represented by C3 (blue). Yemenese, Mozabits and Ethiopian Jews also have appreciable proportions of this African ancestral component.

Looking at the standard deviations of ancestral components for our sample groups, we see that while the Bedouin, Jordanians, Makrani, Moroccons, Mozabite, Saudis and Yemenese are mostly West Eurasian, their proportion of African ancestry vary quite a bit. The large standard deviation in Paniya is due to one sample (C1=55%, C2=42%, C3=3%) being very different (i.e. much more West Eurasian) from the other three (C1=11%, C2=85%, C3=4%).

There are also a couple of Sindhis with some African admixture. These are possibly partly or wholly Siddi.

HGDP Sindhi Samples Admixture K=3

Fst divergences between estimated populations for K=3:

C1 C2
C2 0.102
C3 0.144 0.182

With four ancestral components (K=4), component C1 (red) is a South Asian ancestral component. It is maximum among central and south Indians as well as among Papuans and Melanesians. It could thus possibly related to the ASI (Ancestral South Indian) component. C4 (violet) is the African component. C3 (cyan) is the East Asian component and C2 (green) is the European component.

Admixture: Reference populations K=4

Fst divergences between estimated populations for K=4:

C1 C2 C3
C2 0.071
C3 0.083 0.109
C4 0.152 0.152 0.184

When we increase K to 5, we get the following graph:

Admixture: Reference populations K=5

Ancestral component C1 (red) is Austronesian/South Asian. It is maximum among the Papuans at 75% and is higher among South Indians as compared to Pakistanis. It is about the same component as C1 in K=4.

C4 (blue) is Southwest Asian/West Asian. It peaks in Yemeni Jews at 66% and is high among Saudis, Bedouin, Samaritans, Egyptians, and Palestinians. It's 32% among Turks, so the Southwest Asian part is dominating the West Asian in this component. Notice how Ethiopians and Ethiopian jews have about half of their ancestry from this component.

C3 (green) is the East Asian component and is the same as C3 in the K=4 analysis.

C5 (magenta) is the African ancestry component and is about the same as C4 in the K=4 analysis.

C2 (yellow) is the European component. In K=4, the European component was high among both southern and northern Europeans. Now in K=5, we have the C4 (Southwest/West Asian) component among southern Europeans, so this European component has taken on more of a north European outlook.

Fst divergences between estimated populations for K=5:

C1 C2 C3 C4
C2 0.081
C3 0.084 0.114
C4 0.085 0.054 0.129
C5 0.154 0.165 0.186 0.155

Let's continue this admixture analysis for higher values of K.

Participation Update

I have a total of 23 participants in the project right now who have sent me their raw data. The following groups are represented:

  • Punjab: 7
  • Tamil: 4
  • Iran: 3
  • Bengal: 2
  • Andhra Pradesh: 2
  • Bihar: 1
  • Anglo-Indian: 1
  • Roma: 1
  • Karnataka: 1
  • Kashmir: 1

There is still a lot of ethnicities and regions missing. Uttar Pradesh comes to mind as the biggest one.

Reference Dataset II

Combining my reference population with Xing et al data gets me 3,222 3,161 samples but with only about 23,000 SNPs after LD-pruning.

The good thing is that this dataset has 544 South Asian samples from 24 ethnic groups. So it'll be useful for some analyses despite the low number of SNPs. I'll try to run parallel analyses on my reference population and this dataset so we can compare the pros and cons of both.

UPDATE: I removed 61 pygmy and San samples.

Admixture: Reference Population

For regular admixture analysis, I am using HapMap, HGDP, SGVP and Behar datasets with some samples removed as I wrote earlier.

For each of these datasets,

  1. I first filtered to keep only the list of SNPs present in 23andme v2 chip.
    plink --bfile data --extract 23andmev2.snplist
  2. I also filtered for founders:
    plink --bfile data --filter-founders
  3. And excluded SNPs with missing rates greater than 1%:
    plink --bfile data --geno 0.01

Then, I merged the datasets one by one. The reason for doing it one by one was that there were conflicts of strand orientation (forward or reverse) between the different datasets. If the merge operation gave an error, I had to flip those strands in one dataset and try the merge again.

plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed
plink --bfile data2 --flip plink.missnp --make-bed --out data2flip
plink --bfile data1 --bmerge data2flip.bed data2flip.bim data2flip.fam --make-bed

Once all the four datasets were merged, I processed the combined data file:

  1. Removed SNPs with a missing rate of more than 1% in the combined dataset
    plink --bfile data --geno 0.01
  2. Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:
    plink --bfile data --indep-pairwise 50 5 0.3
    plink --bfile data --extract plink.prune.in --make-bed

This gave me a reference population of 2,693 2,654 individuals with each sample having about 186,000 SNPs. Out of these 2,693 2,654 individuals, we have a total of 398 South Asians belonging to 16 ethnic groups.

Finally, it's time to start having some fun!

UPDATE: I removed 39 Pygmy and San samples because they were causing some trouble with African ancestral components. Since we are not interested in detailed African ancestry and African admixture among South Asians is not likely to be pygmy or San, I decided it would be best to remove them.

Xing et al Data

The data for Xing et al's paper "Toward a more uniform sampling of human genetic diversity: a survey of worldwide populations by high-density genotyping" is available online.

This dataset consists of 850 individuals, but 259 of them overlap with the HapMap. Another 15 samples had to be removed because they were too similar to others. I also removed Native American samples. This leaves us with 529 samples.

Ethnic group Count
Slovenian 25
Punjabi Arain 25
N. European 25
Nepalese 25
Kyrgyzstani 25
Iban 25
Buryat 25
Bambaran 25
Andhra Pradesh Brahmin 25
Kurd 24
Dogon 24
Irula 23
Thai 22
Pygmy 22
Urkarah 18
Tamil Nadu Brahmin 14
Hema 14
Tongan 13
Tamil Nadu Dalit 13
Samoan 13
!Kung 13
Japanese 13
Andhra Pradesh Mala 11
Pedi 10
Andhra Pradesh Madiga 10
Alur 10
Nguni 9
Sotho/Tswana 8
Vietnamese 7
Stalskoe 5
Chinese 5
Khmer Cambodian 3

This dataset is valuable because it contains several South Asian, Central Asian, Southeast Asian and Caucasian groups. However, it does not have a good SNP overlap with 23andme and the other datasets. It has only about 29,000 SNPs in common with 23andme v2 data. Combining HapMap, HGDP, SGVP, Behar et al and Xing et al with 23andme data leaves us with 25,000 SNPs. Due to that, I'll be using Xing et al data for only a few analyses.

Behar et al Data

In their paper "The genome-wide structure of the Jewish people", Behar et al analyzed the genomes of some Jewish groups. More important than the Jewish samples (which include two South Asian Jewish groups) for us are the different South Asian, Middle Eastern, and European groups they sampled:

Ethnic group Count
Saudis 20
Jordanians 20
Georgians 20
Turks 19
Iranians 19
Hungarians 19
Ethiopians 19
Armenians 19
Lezgins 18
Chuvashs 17
Syrians 16
Romanians 16
Uzbeks 15
Spaniards 12
Egyptians 12
Cypriots 12
Moroccans 10
Lithuanians 10
North Kannadi 9
Belorussian 9
Yemenese 8
Lebanese 7
Sakilli 4
Paniya 4
Cochin Jews 4
Bene Israel 4
Samaritians 2
Russian 2
Malayan 2

Of the 466 samples, I excluded 8 because they were either duplicates or too similar in their genomes to others.

The series matrix files that I downloaded were in a somewhat different format. To convert them to Plink format, I had to look up the platform file for the Illumina genotyping BeadChip they used. Also, Illumina used an A/B alleles and Top/Bot strands system instead of the regular ACGT alleles and forward/reverse strands. This Illumina Technote explained it and I found a Perl script to convert between the two.