Author Archives: Zack - Page 5

FTDNA FF to PED Conversion

Posted by Zack on June 23, 2012 Comments Off

Someone asked about how to convert a FTDNA Family Finder csv data file to the Plink format. I threw together a very simple Unix script to do that and I am sharing it here:

#!/bin/bash
if test -z "$1"
then
        echo "FTDNA raw data filename not supplied as argument."
        exit 0
fi
echo "Family ID: "
read fid
echo "Individual ID: "
read id
echo "Paternal ID: "
read pid
echo "Maternal ID: "
read mid
echo "Sex (m/f/u): "
read sexchr
if [[ $sexchr == m* ]]
then
        sex=1
elif [[ $sexchr == f* ]]
then
        sex=2
else
        sex=0
fi
pheno=0
 
echo "$fid $id $pid $mid $sex $pheno" > $id.tfam
 
dos2unix $1
sed '1d' $1 > $id.nocomment
awk -F, '{gsub(/"/,""); print $2,$1,"0",$3,substr($4,1,1),substr($4,2,1)}' $id.nocomment > $id.tped
rm $id.nocomment
 
plink --tfile $id --out $id --make-bed --missing-genotype - --output-missing-genotype 0

This script creates three files: *.bed, *.bim and *.fam, which are the binary format files for Plink. You can then use Plink to merge multiple files, filter SNPs or individuals and do other processing.

HarappaWorld Ancestral South Indian

Posted by Zack on June 19, 2012 12 comments

Using the same method as I used for reference 3 admixture, I decided to guesstimate the Ancestral South Indian proportions, as given by Reich et al, for my HarappaWorld admixture run.

Basically, I used the 92 (out of the 96 samples Reich et al used) to find population averages for the South Indian component. Then, I used linear regression between the South Indian component average and Reich et al's estimate of Ancestral South Indian (ASI) ancestry. Since Reich et al actually list Ancestral North Indian percentages in their paper but their model is a two-ancestry ANI+ASI one, I simply calculated the ASI percentages as 100% minus ANI.

The correlation between Reich et al ASI and my HarappaWorld South Indian component for the relevant populations turns out to be 0.99277086.

And the linear regression fit for the data is:

ASI = 2.5218942 + 0.8104836 * S_INDIAN

where both ASI (Reich et al) and S_INDIAN (HarappaWorld) are given in percentages.

Of the individuals in HarappaWorld, I kept only those who had a South Indian component of at least 20% for computing the ASI proportions.

The resulting ASI percentages can be seen in a spreadsheet.

Please note that in the Group sheet, the averages are based on the samples which met the 20% South Indian component threshold. Thus, the 20% ASI in the Romanians is the average of the two Romanians who met the threshold out of a total of 16 Romanian samples.

The individual results are available in the Individual sheet. These results are a little different from the estimates using reference 3. Thus, I would point out that these should be taken only as a rough estimate.

HarappaOracle Limitations

Posted by Zack on June 12, 2012 10 comments

While HarappaOracle is a great tool, it has its limitations.

First of all, do not think of the mixed mode results as showing which populations you are descended from. Use HarappaOracle to get an idea of which populations are similar to you in their admixture results. This function is especially important since admixture results should be understood in relative terms, as I have been stressing.

Sometimes, for mixed-race people, the Oracle might sometimes provide a correct result like it does for me. For others, the known ancestral mix might not show up.

There is also the fact that the Oracle calculator is sensitive to your admixture percentages and sometimes small changes can change the Oracle mixed mode results radically.

Let's look at three siblings as an example. (My thanks to them for letting me use their results for this post.) Here are their admixture results:

	Sibling 1	Sibling 2	Sibling 3
NE Euro	43.8%	43.1%	43.9%
Mediterranean	27.6%	26.8%	27.0%
Baloch	11.2%	11.1%	12.2%
Caucasian	9.4%	10.8%	8.6%
S Indian	5.3%	6.5%	7.0%
SW Asian	1.5%	1.0%	0.5%
American	0.7%	0.3%	0.7%
NE Asian	0.4%	0.1%	0.1%
Beringian	0.0%	0.3%	0.0%
San	0.0%	0.1%	0.0%

Their admixture results are broadly similar, as expected. Some of you might think that 1% less or difference is very significant, but do consider what we know of DNA inheritance and the error margins in ADMIXTURE.

Now let's see their HarappaWorld Oracle results.

Sibling 1		Sibling 2		Sibling 3
romany	3.16	romany	2.85	romany	4.45
hungarian	9.47	hungarian	9.65	utahn-white	10.74
utahn-white	9.88	french	11.05	n-european	10.79
n-european	9.9	slovenian	11.09	hungarian	10.97
french	9.95	n-european	11.38	utahn-white	11.35
utahn-white	10.62	utahn-white	11.54	french	11.54
slovenian	10.95	utahn-white	12.21	british	11.94
british	11.29	british	12.99	slovenian	12.21
orcadian	13.17	orcadian	14.86	orcadian	13.51
ukranian	17.73	romanian	17.14	ukranian	18.24

Again, not unexpected. The top 10 population matches are not too different for the siblings. There are some differences, but nothing extraordinary.

Finally let's look at mixed mode Oracle, where we try to find the 10 closest matches (based on admixture results) assuming that these individuals are mixed from two populations.

Sibling 1		Sibling 2		Sibling 3
91.3% romany + 8.7% lithuanian	1.58	93.3% romany + 6.7% lithuanian	1.93	82.8% utahn-white + 17.2% bene-israel	1.99
78.4% romany + 21.6% n-european	1.67	95.4% romany + 4.6% finnish	1.99	83.7% n-european + 16.3% bene-israel	2.37
79.7% romany + 20.3% utahn-white	1.69	91.9% romany + 8.1% belorussian	2.01	83.8% utahn-white + 16.2% bene-israel	2.47
83.3% romany + 16.7% orcadian	1.76	92.4% romany + 7.6% russian	2.02	84.8% n-european + 15.2% cochin-jew	2.52
94.2% romany + 5.8% finnish	1.88	92.0% romany + 8.0% mordovian	2.06	86.0% n-european + 14.0% kerala-christian	2.93
79.8% romany + 20.2% utahn-white	1.99	90.4% romany + 9.6% ukranian	2.14	85.0% utahn-white + 15.0% cochin-jew	2.98
90.2% romany + 9.8% belorussian	2.00	88.7% romany + 11.3% slovenian	2.5	85.9% n-european + 14.1% ap-hyderabad	2.99
82.1% romany + 17.9% british	2.03	89.2% romany + 10.8% n-european	2.52	85.1% n-european + 14.9% up	3.02
91.4% romany + 8.6% russian	2.06	95.7% romany + 4.3% chuvash	2.58	85.5% n-european + 14.5% tn-brahmin	3.13
85.0% n-european + 15.0% bene-israel	2.16	90.7% romany + 9.3% utahn-white	2.58	85.5% n-european + 14.5% brahmin-tamil-nadu	3.15

Sibling 1 and Sibling 2 are again not too different from each other: Mostly Romany with some European. However, Sibling 3 is getting vastly different results. Why? No, Sibling 3 wasn't adopted! The reason is simple. Sibling 3 has more South Indian component than the average Romany in our dataset. This means that (s)he cannot be represented as a mix of Romany and a European ethnicity without a large error. Instead mostly northwest European and a little bit of Indian, especially Indian Jewish, seem to be closest to her results. However, this does not make her Jewish or Indian Jewish (who are quite mixed with the local Indian populations).

HarappaWorld HRP0240-HRP0244

Posted by Zack on June 5, 2012 5 comments

From now on, instead of waiting till I have a batch of 10 new participants to compute their Admixture results, I'll run admixture at the start of the month for those who submitted their data during the previous month.

So I have added the HarappaWorld Admixture results for HRP0240-HRP0244 to the individual spreadsheet.

I have also recomputed the weighted averages for Bengalis (from 3 to 5 now), Kerala Muslims (from 1 to 2), and Georgians (from 3 to 4) while adding a new one for our first North Ossetian participant.

Do note that the admixture components do not necessarily represent real ancestral populations. Also, the names I have chosen for the components should be thought of as mnemonics to ease discussion. I chose them based on which populations in my data these components peaked in. They do not tell anything directly about ancestral populations. The best way to look at these admixture results is by comparing individuals and populations. Finally, the standard error estimates on these results can be about 1%. Therefore, it is entirely possible that your 1% exotic admixture result is just noise.

HarappaWorld Tweaks

Posted by Zack on May 29, 2012 18 comments

First of all, I wanted to draw your attention to the fact that I am using weighted means for population averages for HarappaWorld instead of just averaging all samples' results. The weighting gives less importance to outliers. I find this to be a better solution than a simple average or median. A median removes all outliers but it also rejects a lot of information.

An example of the weighted mean effect can be seen in the Behar et al Armenian samples. Four of the samples have higher NE European percentages than the rest. As you can see in the table below, the weighting makes their impact on the population results low.

	Mean		Weighted Mean
Ethnicity	armenian	armenian	armenian	armenian
Dataset	behar	yunusbayev	behar	yunusbayev
N	19	16	19	16
S Indian	0.37%	0.52%	0.41%	0.52%
Baloch	16.57%	17.73%	17.07%	17.65%
Caucasian	54.35%	56.43%	57.29%	56.61%
NE Euro	8.96%	2.98%	5.35%	2.95%
SE Asian	0.10%	0.12%	0.10%	0.13%
Siberian	0.49%	0.09%	0.29%	0.09%
NE Asian	0.14%	0.08%	0.16%	0.09%
Papuan	0.28%	0.27%	0.26%	0.27%
American	0.19%	0.18%	0.22%	0.18%
Beringian	0.26%	0.19%	0.23%	0.20%
Mediterranean	8.46%	8.37%	8.21%	8.40%
SW Asian	9.81%	13.03%	10.40%	12.91%
San	0.00%	0.00%	0.00%	0.00%
E African	0.02%	0.00%	0.01%	0.00%
Pygmy	0.00%	0.00%	0.00%	0.00%
W African	0.00%	0.00%	0.00%	0.00%

Another example is the Somali samples in Reich et al data. There is one sample (out of 6) who seems to be eastern Bantu. Let's compare the unweighted mean and weighted mean for Somalis in Reich et al and Harappa participants.

	Mean		Weighted Mean
Ethnicity	somali	somali	somali	somali
Dataset	harappa	reich	harappa	reich
N	2	6	2	6
S Indian	0.00%	1.62%	0.00%	1.49%
Baloch	0.00%	0.00%	0.00%	0.00%
Caucasian	2.76%	0.00%	2.76%	0.00%
NE Euro	0.00%	0.11%	0.00%	0.04%
SE Asian	0.27%	0.05%	0.27%	0.06%
Siberian	0.00%	0.04%	0.00%	0.05%
NE Asian	0.00%	0.41%	0.00%	0.46%
Papuan	0.26%	0.10%	0.26%	0.11%
American	0.14%	0.17%	0.14%	0.19%
Beringian	0.23%	0.33%	0.23%	0.38%
Mediterranean	2.12%	3.25%	2.12%	3.65%
SW Asian	31.73%	24.48%	31.73%	27.33%
San	1.96%	1.48%	1.96%	1.37%
E African	60.37%	56.75%	60.37%	60.13%
Pygmy	0.15%	1.78%	0.15%	1.23%
W African	0.00%	9.43%	0.00%	3.51%

Also, I have divided Singapore Indians into 4 groups (actually 3 groups and 1 outlier) since they are so heterogeneous. Here are the weighted mean admixture proportions for all Singapore Indians and the four subgroups.

Ethnicity	singapore-indian	singapore-indian-1	singapore-indian-2	singapore-indian-3	singapore-indian-4
Dataset	sgvp	sgvp	sgvp	sgvp	sgvp
N	83	31	41	10	1
S Indian	53.57%	61.95%	50.39%	33.68%	27.81%
Baloch	33.97%	30.24%	36.00%	40.72%	14.27%
Caucasian	3.55%	1.92%	4.03%	9.32%	4.53%
NE Euro	2.93%	0.08%	3.89%	9.84%	35.38%
SE Asian	1.31%	1.30%	1.23%	0.63%	1.20%
Siberian	0.45%	0.47%	0.44%	0.43%	1.19%
NE Asian	0.92%	0.91%	0.80%	1.19%	3.26%
Papuan	0.72%	1.09%	0.50%	0.35%	0.62%
American	0.42%	0.35%	0.44%	0.69%	1.29%
Beringian	0.56%	0.38%	0.65%	0.76%	0.00%
Mediterranean	0.67%	0.40%	0.72%	1.33%	10.38%
SW Asian	0.90%	0.86%	0.87%	1.05%	0.06%
San	0.01%	0.00%	0.01%	0.00%	0.00%
E African	0.03%	0.02%	0.04%	0.00%	0.00%
Pygmy	0.00%	0.00%	0.00%	0.00%	0.00%
W African	0.01%	0.01%	0.00%	0.00%	0.00%

I have updated the spreadsheet as well as HarappaWorld Oracle.

HarappaWorld on GEDmatch

Posted by Zack on May 21, 2012 2 comments

The HarappaWorld Admixture calculator is now available on GEDmatch.

You can compute:

Admixture Proportions
Admixture Proportions by Chromosome
Chromosome Painting
Paint differences between 2 kits, 1 chromosome
Paint differences between 2 kits, 22 chromosomes, reduced size

You do have to upload your genetic data to GEDmatch to use it.

If you are a Harappa participant and try GEDmatch too, please let me know if there's any difference between your admixture results.

UPDATE: Now you can even get your HarappaWorld Oracle results after getting the admixture results, thanks to John.

Participation Changes

Posted by Zack on May 16, 2012 5 comments

Now that I have DIY HarappaWorld out, I am changing the participation requirements a little bit with somewhat different requirements for South Asians compared to other regions.

If you have any real ancestry from a South Asian origin, you are eligible to participate. Partial South Asian ancestry is okay. The list of countries of origin I count as South Asian are as follows:

Afghanistan
Bangladesh
Bhutan
India
Maldives
Nepal
Pakistan
Sri Lanka

Note that 2-3% South Asian from Dr. McDonald's BGA or Dodecad Project does not count as South Asian ancestry.

If you have all four of your grandparents from one of the following countries or regions, you can also send me your data.

Burma
Tibet
Uyghur from Xinjiang, China
Tajikistan
Kyrgyzstan
Kazakhstan
Uzbekistan
Turkmenistan
Iran
Turkey
Azerbaijan
Armenia
Georgia
North Caucasian Federal District, Russia
Iraq
Syria
Lebanon
Jordan

Relatives will only be accepted when they are a better replacement for current participants. For example, replacing a participant by his/her parents or his maternal uncle and paternal aunt gets us two unrelated participants (assuming, of course, that the two sides of the family are not related by blood). Another example could be if a participant is of partial South Asian ancestry and they get replaced by a relative who has more South Asian ancestry.

Everyone else can use DIY HarappaWorld. It's fairly easy to use on both Windows and Linux. The only hard part right now is that you have to install R to standardize your genome file. I might look into creating an executable for that to make it easier.

Finally, please be honest.

HarappaWorld Oracle

Posted by Zack on May 11, 2012 17 comments

Here's the HarappaWorld Oracle to go with the HarappaWorld admixture results and DIYHarappaWorld.

It works similar to the old Ref3 Harappa Oracle, with a couple of differences. One, there is no panasian switch since the Pan-Asian dataset is not included in this calculator.

I have added an optional mincount argument. It picks only those groups where the number of individuals is equal to or more than mincount for the Oracle calculation. By default mincount is 2, so only those groups which have 2 or more samples are used to compute your Oracle results.

Let's look at my top 20 Oracle results in mixed mode excluding population groups with less than 4 individuals.

HarappaOracle(c(26.46,36.82,14.22,4.78,0.00,1.32,0.86,0.04,0.19,0.06,3.63,8.07,0.00,2.44,0.43,0.67),k=20,mincount=4,mixedmode=T)

[,1] [,2]
[1,] "18.1% egyptian_behar_12 + 81.9% punjabi-arain_xing_25" "2.3361"
[2,] "18.1% egypt_henn2012_19 + 81.9% punjabi-arain_xing_25" "2.5615"
[3,] "80.7% punjabi-arain_xing_25 + 19.3% yemenese_behar_8" "2.8388"
[4,] "18.4% palestinian_hgdp_46 + 81.6% punjabi-arain_xing_25" "2.9944"
[5,] "84.7% punjabi-arain_xing_25 + 15.3% yemen-jew_behar_15" "3.0923"
[6,] "19.1% jordanian_behar_20 + 80.9% punjabi-arain_xing_25" "3.1877"
[7,] "18% egypt_henn2012_19 + 82% sindhi_hgdp_24" "3.4814"
[8,] "17.9% egyptian_behar_12 + 82.1% sindhi_hgdp_24" "3.5554"
[9,] "20.3% jordanian_behar_20 + 79.7% punjabi_harappa_7" "3.6161"
[10,] "18.9% egyptian_behar_12 + 81.1% punjabi_harappa_7" "3.6587"
[11,] "19.5% palestinian_hgdp_46 + 80.5% punjabi_harappa_7" "3.7079"
[12,] "19% egypt_henn2012_19 + 81% punjabi_harappa_7" "3.8303"
[13,] "18.3% palestinian_hgdp_46 + 81.7% sindhi_hgdp_24" "3.8762"
[14,] "80.4% punjabi-arain_xing_25 + 19.6% syrian_behar_16" "3.8908"
[15,] "19% lebanese_behar_7 + 81% punjabi-arain_xing_25" "4.0494"
[16,] "18.9% jordanian_behar_20 + 81.1% sindhi_hgdp_24" "4.078"
[17,] "79.9% punjabi_harappa_7 + 20.1% yemenese_behar_8" "4.1222"
[18,] "15.1% bedouin_hgdp_46 + 84.9% punjabi-arain_xing_25" "4.1522"
[19,] "85.3% punjabi-arain_xing_25 + 14.7% saudi_behar_20" "4.2014"
[20,] "79.1% punjabi_harappa_7 + 20.9% syrian_behar_16" "4.2191"

These results are closer to my actual reported ancestry than the ones from reference 3 oracle.

DIY HarappaWorld

Posted by Zack on May 7, 2012 154 comments

Based on Dienekes' instructions, I have created DIYHarappaWorld for anyone to compute their admixture results for my HarappaWorld calculator.

Here's what you need to do:

Download DIYHarappaWorld files and unzip them.
Download DIYDodecad v2.1 (File->Download).
Unpack DIYDodecad2.1.rar by using 7-zip, WinRAR, or Linux rar/unrar command.
Start R and change the working directory to where you have the DIY files.
Enter the following command in R:
```
source('standardize.r')
```
If you have your 23andme raw data, run the following command in R:
```
standardize('genome_john_doe.txt', company='23andMe')
```
where genome_john_doe.txt is the filename for your raw data file.
If you have your FTDNA Family Finder data in a file named johndoe.csv, run the following in R:
```
standardize('johndoe.csv', company='ftdna')
```

From your operating system command prompt, run the appropriate command:

DIYDodecadWin harappaworld.par
./DIYDodecadLinux32 harappaworld.par
./DIYDodecadLinux64 harappaworld.par

The program will start computing the admixture percentages. It took about 5-10 minutes on my computer.
The best way to understand your results is to compare them with other populations and individuals. Do not take the component names seriously. They do not represent true ancestral populations.

You can also edit the harappaworld.par file's last line to one of genomewide/bychr/byseg/target to calculate the admixture percentages for the whole genome, by chromosome, by segment or target region respectively. Do note that the last three will have larger noise.

UPDATE: I should also point out that this DIY calculator will work better for those individuals whose genetic variation was included in computing the admixture model. Those belonging to a group not included at all in the set of samples I used might get somewhat odd results.

HarappaWorld Admixture

Posted by Zack on May 4, 2012 18 comments

Here is a new admixture calculator. This uses populations all over the world and I got the best results (i.e., lowest crossvalidation error) at K=16.

You can see the admixture results for different ethnic groups as well as results for individual (founder-only) project participants.

UPDATE: The population results have been calculated using weighted means.

The group results are also shown in the usual interactive bar chart below. You can click on the component labels to sort by that ancestral component.

I used about 188,173 SNPs for this run. The results for Henn2011 (181,223 SNPs for Hadza, Sandawe and San, 26,494 SNPs for other groups), Henn2012 (26,494 SNPs), Reich (48,967 SNPs) and Xing (18,986 SNPs) datasets reported above were however calculated using lower number of common SNPs. Hence caution should be exercised in interpreting those results.

You can also see the Fst distances between the ancestral components.

I should have HarappaWorldOracle and DIYHarappaWorld calculators out in the next few days.

Also, I am working on another calculator which will focus more closely on South Asia.

« Previous page | Next page »

Harappa Ancestry Project

Genetics and South Asia

Author Archives: Zack - Page 5

FTDNA FF to PED Conversion

HarappaWorld Ancestral South Indian

HarappaOracle Limitations

HarappaWorld HRP0240-HRP0244

HarappaWorld Tweaks

HarappaWorld on GEDmatch

Participation Changes

HarappaWorld Oracle

DIY HarappaWorld

HarappaWorld Admixture

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Genetics and South Asia

Author Archives: Zack - Page 5

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll