Tag Archives: india

Participation Changes

Now that I have DIY HarappaWorld out, I am changing the participation requirements a little bit with somewhat different requirements for South Asians compared to other regions.

If you have any real ancestry from a South Asian origin, you are eligible to participate. Partial South Asian ancestry is okay. The list of countries of origin I count as South Asian are as follows:

  • Afghanistan
  • Bangladesh
  • Bhutan
  • India
  • Maldives
  • Nepal
  • Pakistan
  • Sri Lanka

Note that 2-3% South Asian from Dr. McDonald's BGA or Dodecad Project does not count as South Asian ancestry.

If you have all four of your grandparents from one of the following countries or regions, you can also send me your data.

  • Burma
  • Tibet
  • Uyghur from Xinjiang, China
  • Tajikistan
  • Kyrgyzstan
  • Kazakhstan
  • Uzbekistan
  • Turkmenistan
  • Iran
  • Turkey
  • Azerbaijan
  • Armenia
  • Georgia
  • North Caucasian Federal District, Russia
  • Iraq
  • Syria
  • Lebanon
  • Jordan

Relatives will only be accepted when they are a better replacement for current participants. For example, replacing a participant by his/her parents or his maternal uncle and paternal aunt gets us two unrelated participants (assuming, of course, that the two sides of the family are not related by blood). Another example could be if a participant is of partial South Asian ancestry and they get replaced by a relative who has more South Asian ancestry.

Everyone else can use DIY HarappaWorld. It's fairly easy to use on both Windows and Linux. The only hard part right now is that you have to install R to standardize your genome file. I might look into creating an executable for that to make it easier.

Finally, please be honest.

Indian Cline

I had used linear regression to estimate Ancestral South Indian (ASI) component from Reference 3 K=11 admixture run. Now here are a couple more exercises along the same lines but much simpler.

Just using the 96 Indian cline samples from Reich et al to compute PCA or admixture doesn't work as the Chenchu separate out in both analyses from the rest. So I added the Utahn White (CEU) samples from HapMap and the Onge from Reich et al.

First, I ran supervised admixture with two ancestral components, Utahn Whites and Onge. Here's the Onge component plotted against Reich et al's ASI estimate along with a linear regression estimate. The correlation between the two is 0.9908.

Second, I ran Principal Component Analysis (PCA) on the Indian cline samples plus Utahn Whites and Onge. Here are the first two PCA dimensions plotted. The first eigenvector explains 4.04% of the total variation and the 2nd explains 1.94%.

The first principal component is mostly along the Indian cline while the second one basically separates the Onge from everyone else.

Using the 1st principal component to estimate ASI, here's the plot with Reich et al's ASI estimate along with a regression line. The correlation between pc1 and ASI is 0.9929.

Note that both these methods work only if the samples are on the Indian cline, i.e., they don't have any other admixture.

And now for comparison, here's the linear regression for the Reference 3 K=11 admixture Onge component and ASI. The correlation here is 0.9949. Note that this is a little different than my previous analysis since I calculated the population averages using only the 96 samples recommended by Reich et al.

Here's a spreadsheet containing the data for these three runs.

There are a couple more tricks I have to figure out some things regarding Ancestral South Indian admixture. Let's hope they provide us some insight.

Participants' Help Needed

A journalist from Times of India has contacted me to talk about the Harappa Ancestry Project. She is interested in talking to some Indian participants about it.http://focuz.ru

If any of you are interested, let me know in comments or by email and I'll forward your contact information to the journalist.

UPDATE: I have sent the contact info of the six people who volunteered.

Reich et al and Pan-Asian Datasets

I got access to the Reich et al (Nature 2009) dataset used in their paper "Reconstructing Indian population history".

It has the following populations:

Aonaga Aus Bhil
Chenchu Great_Andamanese Hallaki
Kamsali Kashmiri_Pandit Kharia
Kurumba Lodi Madiga
Mala Meghawal Naidu
Nysha Onge Sahariya
Santhal Satnami Siddi
Somali Srivastava Tharu
Vaish Velama Vysya

There are 141 individuals with 587,753 SNPs in their dataset which conveniently is in PED format.

Also, Blaise pointed me to the Pan-Asian SNP data used in the Dec 2009 Science paper "Mapping Human Genetic Diversity in Asia".

It includes the following 71 populations:

Maya Auca Quechua Karitiana Pima
Ami Atayal Melanesians Zhuang Han_Cantonese
Hmong Jiamao Jinuo Han_Shanghai Uyghur
Wa Alorese Dayak Javanese Batak_Karo
Lamaholot Lembata Malay Mentawai Manggarai
Kambera Sunda Batak_Toba Toraja Andhra_Pradesh
Karnataka Bengali-Assamese Rajasthan Uttaranchal Uttar Pradesh
Haryana Spiti Bhili Marathi Japanese
Ryukyuan Korean Bidayuh Jehai Kelantan
Kensiu Temuan Ayta Agta Ati
Iraya Minanubu Mamanwa Filipino Singapore_Chinese
Singapore_Indian Singapore_Malay Hmong (Miao) Karen Lawa
Mlabri Mon Paluang Plang Tai_Khuen
Tai_Lue H'tin Tai_Yuan Tai_Yong Yao
Hakka Minnan

It has 1,719 individuals with 54,794 SNPs. I wish it had more SNPs considering the wealth of populations.

Also, the Pan-Asian data is in the form of minor allele counts, so I need to convert that back to A/C/G/T. Since there are some HapMap populations included in the dataset, that shouldn't be too hard.

I am going to include both these datasets into my big reference set.

Singapore Indians

In the South Asian PCA plot, we saw that Singapore Indian samples from the SGVP dataset had a lot of diversity. Let's zoom into that plot so it's not dominated by the distinctiveness of the Kalash.

Eigenvector 1 explains 1.45 times the variation compared to eigenvector 2.

We see that Singapore Indians are spread in the whole region from Sindhis to North Kanaddi.

Now let's look at the individual admixture results (at K=12 ancestral populations) for the Singapore Indians. I have added some South Asian reference population averages so you can place them in context.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

From these results, a majority of the Singapore Indian samples look South Indian but there are definitely a few from the northwest of the subcontinent (Punjabis or Sindhis?) There are also a few who could be from the Hindi belt.

There are 2-3 samples who have a significant amount of Southeast Asian. Could they be originally from Bengal? Or could they have partial Singapore Malay ancestry?

SGVP

SGVP is the Singapore Genome Variation Project. It sampled the following groups:

Ethnicity Sample Count SNP Count
Singapore Chinese 96 1,405,417
Singapore Malay 89 1,402,256
Singapore Indian 83 1,404,699

Singapore Indians are generally likely to be South Indians, especially Tamils.

These 268 samples were easy to convert to Plink format

Introduction

I have become interested (some would say obsessed) with genetics recently. I wrote about getting my DNA test done and there's a lot more about my own results that I plan to bore you with.

One fun application of genetic testing is inferring ancestry: Which ancestral group are you descended from? Can we estimate the admixture of the different population groups you are descended from?

Most DNA testing companies provide information about ancestry and genetic genealogy has taken off. With several genome databases (HapMap, HGDP, etc) and software (like plink, admixture, Structure) publicly available, the days of the genome bloggers are here. And I am trying to be the latest one.

In starting this project, I have been inspired by the Dodecad Ancestry Project by Dienekes Pontikos and Eurogenes Ancestry Project by David Wesolowski. The catalyst for this project was my friend Razib who I bug whenever I need to talk genetics.

What is Harappa Ancestry Project?
It is a project to analyze (autosomal) genetic data of participants of South Asian origin for the purpose of providing detailed ancestry information. So the focus of the project is on South Asians: Indians, Pakistanis, Bangladeshis and Sri Lankans.

The project will collect 23andme raw genetic data from participants to better understand the ancestry relationships of different South Asian ethnicities.

I have named it after Harappa, an archaeological site of the Indus Valley Civilization in Punjab, Pakistan.

Participation
People of South Asian origin, or from neighboring countries, are eligible to participate. The list of countries of origin I am accepting are as follows:

  • Afghanistan
  • Bangladesh
  • Bhutan
  • Burma
  • India
  • Iran
  • Maldives
  • Nepal
  • Pakistan
  • Sri Lanka
  • Tibet

Right now, I am only accepting raw data samples from people who have tested with 23andme.

Please do not send samples from close relatives. I define close relatives as 2nd cousins or closer. If you have data from yourself and your parents, it might be better to send the samples from your parents (assuming they are not related to each other) and not send your own sample.

If you are unsure if you are eligible to participate, please send me an email (harappa@zackvision.com) to inquire about it before sending off your raw data.

What to send?
Please send your All DNA raw data text file (zipped is better) downloaded from 23andme to harappa@zackvision.com along with ancestral background information about you and all four of your grandparents. Background information would include where they were born, mother tongue, caste/community to which they belonged, etc. Please provide as much ancestry information as possible and try to be specific. Do especially include information about any ancestry from outside South Asia.

Data Privacy
The raw genetic data and ancestry information that you send me will not be shared with anyone.

Your data will be used only for ancestry analysis. No analysis of physical or health/medical traits will be performed.

The individual ancestry analysis published on this blog will be done using an ID of the form HRPnnnn known to only you and me.

What do you get?
All results of ancestry analysis (individual and group) will be posted on this blog under the Harappa Ancestry Project category. This will include admixture analysis as well as clustering into population groups etc.

I suggest you read about Dienekes' analysis on South Asians for an idea about what to expect.

You can access all blog posts related to this project from the Harappa Ancestry Project link on the navigation menu on every page of my website. You can also subscribe to the project feed.