Monthly Archives: May 2012

HarappaWorld Tweaks

First of all, I wanted to draw your attention to the fact that I am using weighted means for population averages for HarappaWorld instead of just averaging all samples' results. The weighting gives less importance to outliers. I find this to be a better solution than a simple average or median. A median removes all outliers but it also rejects a lot of information.

An example of the weighted mean effect can be seen in the Behar et al Armenian samples. Four of the samples have higher NE European percentages than the rest. As you can see in the table below, the weighting makes their impact on the population results low.

Mean Weighted Mean
Ethnicity armenian armenian armenian armenian
Dataset behar yunusbayev behar yunusbayev
N 19 16 19 16
S Indian 0.37% 0.52% 0.41% 0.52%
Baloch 16.57% 17.73% 17.07% 17.65%
Caucasian 54.35% 56.43% 57.29% 56.61%
NE Euro 8.96% 2.98% 5.35% 2.95%
SE Asian 0.10% 0.12% 0.10% 0.13%
Siberian 0.49% 0.09% 0.29% 0.09%
NE Asian 0.14% 0.08% 0.16% 0.09%
Papuan 0.28% 0.27% 0.26% 0.27%
American 0.19% 0.18% 0.22% 0.18%
Beringian 0.26% 0.19% 0.23% 0.20%
Mediterranean 8.46% 8.37% 8.21% 8.40%
SW Asian 9.81% 13.03% 10.40% 12.91%
San 0.00% 0.00% 0.00% 0.00%
E African 0.02% 0.00% 0.01% 0.00%
Pygmy 0.00% 0.00% 0.00% 0.00%
W African 0.00% 0.00% 0.00% 0.00%

Another example is the Somali samples in Reich et al data. There is one sample (out of 6) who seems to be eastern Bantu. Let's compare the unweighted mean and weighted mean for Somalis in Reich et al and Harappa participants.

Mean Weighted Mean
Ethnicity somali somali somali somali
Dataset harappa reich harappa reich
N 2 6 2 6
S Indian 0.00% 1.62% 0.00% 1.49%
Baloch 0.00% 0.00% 0.00% 0.00%
Caucasian 2.76% 0.00% 2.76% 0.00%
NE Euro 0.00% 0.11% 0.00% 0.04%
SE Asian 0.27% 0.05% 0.27% 0.06%
Siberian 0.00% 0.04% 0.00% 0.05%
NE Asian 0.00% 0.41% 0.00% 0.46%
Papuan 0.26% 0.10% 0.26% 0.11%
American 0.14% 0.17% 0.14% 0.19%
Beringian 0.23% 0.33% 0.23% 0.38%
Mediterranean 2.12% 3.25% 2.12% 3.65%
SW Asian 31.73% 24.48% 31.73% 27.33%
San 1.96% 1.48% 1.96% 1.37%
E African 60.37% 56.75% 60.37% 60.13%
Pygmy 0.15% 1.78% 0.15% 1.23%
W African 0.00% 9.43% 0.00% 3.51%

Also, I have divided Singapore Indians into 4 groups (actually 3 groups and 1 outlier) since they are so heterogeneous. Here are the weighted mean admixture proportions for all Singapore Indians and the four subgroups.

Ethnicity singapore-indian singapore-indian-1 singapore-indian-2 singapore-indian-3 singapore-indian-4
Dataset sgvp sgvp sgvp sgvp sgvp
N 83 31 41 10 1
S Indian 53.57% 61.95% 50.39% 33.68% 27.81%
Baloch 33.97% 30.24% 36.00% 40.72% 14.27%
Caucasian 3.55% 1.92% 4.03% 9.32% 4.53%
NE Euro 2.93% 0.08% 3.89% 9.84% 35.38%
SE Asian 1.31% 1.30% 1.23% 0.63% 1.20%
Siberian 0.45% 0.47% 0.44% 0.43% 1.19%
NE Asian 0.92% 0.91% 0.80% 1.19% 3.26%
Papuan 0.72% 1.09% 0.50% 0.35% 0.62%
American 0.42% 0.35% 0.44% 0.69% 1.29%
Beringian 0.56% 0.38% 0.65% 0.76% 0.00%
Mediterranean 0.67% 0.40% 0.72% 1.33% 10.38%
SW Asian 0.90% 0.86% 0.87% 1.05% 0.06%
San 0.01% 0.00% 0.01% 0.00% 0.00%
E African 0.03% 0.02% 0.04% 0.00% 0.00%
Pygmy 0.00% 0.00% 0.00% 0.00% 0.00%
W African 0.01% 0.01% 0.00% 0.00% 0.00%

I have updated the spreadsheet as well as HarappaWorld Oracle.

HarappaWorld on GEDmatch

The HarappaWorld Admixture calculator is now available on GEDmatch.

You can compute:

  • Admixture Proportions
  • Admixture Proportions by Chromosome
  • Chromosome Painting
  • Paint differences between 2 kits, 1 chromosome
  • Paint differences between 2 kits, 22 chromosomes, reduced size

You do have to upload your genetic data to GEDmatch to use it.

If you are a Harappa participant and try GEDmatch too, please let me know if there's any difference between your admixture results.

UPDATE: Now you can even get your HarappaWorld Oracle results after getting the admixture results, thanks to John.

Participation Changes

Now that I have DIY HarappaWorld out, I am changing the participation requirements a little bit with somewhat different requirements for South Asians compared to other regions.

If you have any real ancestry from a South Asian origin, you are eligible to participate. Partial South Asian ancestry is okay. The list of countries of origin I count as South Asian are as follows:

  • Afghanistan
  • Bangladesh
  • Bhutan
  • India
  • Maldives
  • Nepal
  • Pakistan
  • Sri Lanka

Note that 2-3% South Asian from Dr. McDonald's BGA or Dodecad Project does not count as South Asian ancestry.

If you have all four of your grandparents from one of the following countries or regions, you can also send me your data.

  • Burma
  • Tibet
  • Uyghur from Xinjiang, China
  • Tajikistan
  • Kyrgyzstan
  • Kazakhstan
  • Uzbekistan
  • Turkmenistan
  • Iran
  • Turkey
  • Azerbaijan
  • Armenia
  • Georgia
  • North Caucasian Federal District, Russia
  • Iraq
  • Syria
  • Lebanon
  • Jordan

Relatives will only be accepted when they are a better replacement for current participants. For example, replacing a participant by his/her parents or his maternal uncle and paternal aunt gets us two unrelated participants (assuming, of course, that the two sides of the family are not related by blood). Another example could be if a participant is of partial South Asian ancestry and they get replaced by a relative who has more South Asian ancestry.

Everyone else can use DIY HarappaWorld. It's fairly easy to use on both Windows and Linux. The only hard part right now is that you have to install R to standardize your genome file. I might look into creating an executable for that to make it easier.

Finally, please be honest.

HarappaWorld Oracle

Here's the HarappaWorld Oracle to go with the HarappaWorld admixture results and DIYHarappaWorld.

It works similar to the old Ref3 Harappa Oracle, with a couple of differences. One, there is no panasian switch since the Pan-Asian dataset is not included in this calculator.

I have added an optional mincount argument. It picks only those groups where the number of individuals is equal to or more than mincount for the Oracle calculation. By default mincount is 2, so only those groups which have 2 or more samples are used to compute your Oracle results.

Let's look at my top 20 Oracle results in mixed mode excluding population groups with less than 4 individuals.

HarappaOracle(c(26.46,36.82,14.22,4.78,0.00,1.32,0.86,0.04,0.19,0.06,3.63,8.07,0.00,2.44,0.43,0.67),k=20,mincount=4,mixedmode=T)

[,1] [,2]
[1,] "18.1% egyptian_behar_12 + 81.9% punjabi-arain_xing_25" "2.3361"
[2,] "18.1% egypt_henn2012_19 + 81.9% punjabi-arain_xing_25" "2.5615"
[3,] "80.7% punjabi-arain_xing_25 + 19.3% yemenese_behar_8" "2.8388"
[4,] "18.4% palestinian_hgdp_46 + 81.6% punjabi-arain_xing_25" "2.9944"
[5,] "84.7% punjabi-arain_xing_25 + 15.3% yemen-jew_behar_15" "3.0923"
[6,] "19.1% jordanian_behar_20 + 80.9% punjabi-arain_xing_25" "3.1877"
[7,] "18% egypt_henn2012_19 + 82% sindhi_hgdp_24" "3.4814"
[8,] "17.9% egyptian_behar_12 + 82.1% sindhi_hgdp_24" "3.5554"
[9,] "20.3% jordanian_behar_20 + 79.7% punjabi_harappa_7" "3.6161"
[10,] "18.9% egyptian_behar_12 + 81.1% punjabi_harappa_7" "3.6587"
[11,] "19.5% palestinian_hgdp_46 + 80.5% punjabi_harappa_7" "3.7079"
[12,] "19% egypt_henn2012_19 + 81% punjabi_harappa_7" "3.8303"
[13,] "18.3% palestinian_hgdp_46 + 81.7% sindhi_hgdp_24" "3.8762"
[14,] "80.4% punjabi-arain_xing_25 + 19.6% syrian_behar_16" "3.8908"
[15,] "19% lebanese_behar_7 + 81% punjabi-arain_xing_25" "4.0494"
[16,] "18.9% jordanian_behar_20 + 81.1% sindhi_hgdp_24" "4.078"
[17,] "79.9% punjabi_harappa_7 + 20.1% yemenese_behar_8" "4.1222"
[18,] "15.1% bedouin_hgdp_46 + 84.9% punjabi-arain_xing_25" "4.1522"
[19,] "85.3% punjabi-arain_xing_25 + 14.7% saudi_behar_20" "4.2014"
[20,] "79.1% punjabi_harappa_7 + 20.9% syrian_behar_16" "4.2191"

These results are closer to my actual reported ancestry than the ones from reference 3 oracle.

DIY HarappaWorld

Based on Dienekes' instructions, I have created DIYHarappaWorld for anyone to compute their admixture results for my HarappaWorld calculator.

Here's what you need to do:

  1. Download DIYHarappaWorld files and unzip them.
  2. Download DIYDodecad v2.1 (File->Download).
  3. Unpack DIYDodecad2.1.rar by using 7-zip, WinRAR, or Linux rar/unrar command.
  4. Start R and change the working directory to where you have the DIY files.
  5. Enter the following command in R:
    source('standardize.r')
  6. If you have your 23andme raw data, run the following command in R:
    standardize('genome_john_doe.txt', company='23andMe')

    where genome_john_doe.txt is the filename for your raw data file.

  7. If you have your FTDNA Family Finder data in a file named johndoe.csv, run the following in R:
    standardize('johndoe.csv', company='ftdna')
  8. From your operating system command prompt, run the appropriate command:
    DIYDodecadWin harappaworld.par
    ./DIYDodecadLinux32 harappaworld.par
    ./DIYDodecadLinux64 harappaworld.par
  9. The program will start computing the admixture percentages. It took about 5-10 minutes on my computer.
  10. The best way to understand your results is to compare them with other populations and individuals. Do not take the component names seriously. They do not represent true ancestral populations.

You can also edit the harappaworld.par file's last line to one of genomewide/bychr/byseg/target to calculate the admixture percentages for the whole genome, by chromosome, by segment or target region respectively. Do note that the last three will have larger noise.

UPDATE: I should also point out that this DIY calculator will work better for those individuals whose genetic variation was included in computing the admixture model. Those belonging to a group not included at all in the set of samples I used might get somewhat odd results.

HarappaWorld Admixture

Here is a new admixture calculator. This uses populations all over the world and I got the best results (i.e., lowest crossvalidation error) at K=16.

You can see the admixture results for different ethnic groups as well as results for individual (founder-only) project participants.

UPDATE: The population results have been calculated using weighted means.

The group results are also shown in the usual interactive bar chart below. You can click on the component labels to sort by that ancestral component.

Do note that the admixture components do not necessarily represent real ancestral populations. Also, the names I have chosen for the components should be thought of as mnemonics to ease discussion. I chose them based on which populations in my data these components peaked in. They do not tell anything directly about ancestral populations. The best way to look at these admixture results is by comparing individuals and populations.

I used about 188,173 SNPs for this run. The results for Henn2011 (181,223 SNPs for Hadza, Sandawe and San, 26,494 SNPs for other groups), Henn2012 (26,494 SNPs), Reich (48,967 SNPs) and Xing (18,986 SNPs) datasets reported above were however calculated using lower number of common SNPs. Hence caution should be exercised in interpreting those results.

You can also see the Fst distances between the ancestral components.

I should have HarappaWorldOracle and DIYHarappaWorld calculators out in the next few days.

Also, I am working on another calculator which will focus more closely on South Asia.