Author Archives: Zack - Page 29

SGVP

SGVP is the Singapore Genome Variation Project. It sampled the following groups:

Ethnicity	Sample Count	SNP Count
Singapore Chinese	96	1,405,417
Singapore Malay	89	1,402,256
Singapore Indian	83	1,404,699

Singapore Indians are generally likely to be South Indians, especially Tamils.

These 268 samples were easy to convert to Plink format

HGDP

Posted by Zack on January 25, 2011 3 comments

Human Genome Diversity Project (HGDP) is the best resource for a diverse set of genomic data. It has 1050 individuals from 52 different populations.

I got the Stanford University data which has data for 660,918 SNPs from 1,043 samples. It is claimed that the forward strand is given but that turned out not to be true and I had to flip strands and make sure I didn't include any ambiguous A/T or C/G strands in my dataset.

I followed the recommendations of Rosenberg (spreadsheet) in excluding some atypical samples and relatives, leaving me with 940 samples.

I also excluded the Native American samples because we are not interested in them and they are very closely related either due to recent endogamy or ancient bottlenecks. (yeah I had the nerve to write that.)

Of the total of 876 samples, here are the numbers for our populations of interest:

Total South Asians	190
Balochi	24
Brahui	25
Burusho	25
Hazara	22
Kalash	23
Makrani	25
Pathan	22
Sindhi	24

These samples have about 541,560 SNPs in common with 23andme v2.

23andme v3 Data

Posted by Zack on January 25, 2011 1 comment

The results from 23andme's new version 3 chip started coming in yesterday and I have already got three samples of the new chip.

I counted 966,977 SNPs on the new chip. It seems to have about 547,000 SNPs in common with version 2 (which had about 578,000). Also, the version 3 data has about 230,000 SNPs in common with my reference dataset (out of a total of 241,000). Which is a long way of saying that the v3 data is very usable for my project.

Therefore, if you are from South Asia or neighboring countries and got your spanking new results, please participate and send your data over.

HapMap

Posted by Zack on January 24, 2011 7 comments

I am using several datasets in the public domain for my reference population samples. HapMap is one of those datasets.

According to its website,

The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation. The HapMap is expected to be a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors. The information produced by the Project will be made freely available.

In the first phase, it genotyped

30 Yoruba adult-and-both-parents trios from Ibadan, Nigeria, 30 trios of U.S. (Utah) residents of northern and western European ancestry, 44 unrelated individuals from Tokyo, Japan and 45 unrelated Han Chinese individuals from Beijing, China.

In their HapMap phase 3 release #3 (NCBI build 36, dbSNP b126), there are 1,397 samples with about 1,457,897 SNPs each.

I removed related individuals as well as individuals whose genomes were too similar. This left me with a total of 1,149 samples with about 474,606 SNPs in common with 23andme's version 2 data.

Since we are not interested in Native American ancestry, I also removed 58 Mexican samples, thus leaving me with 1,091 samples.

Here are the samples I am using from the HapMap data:

Ethnicity	Region	Count
African Americans	Africa	48
European Americans (Utahns)	Europe	111
Han Chinese	East Asia	137
US Chinese	East Asia	106
Gujaratis	South Asia	98
Japanese	East Asia	113
Kenyan Luhya	East Africa	101
Maasai	East Africa	135
Tuscans	Europe	102
Yoruba	West Africa	140

The region assignments are mine to aid me in the analysis, by including/excluding samples by region or by aggregating results by region to find patterns etc.

It was easiest to use the HapMap data since it's available for download in Plink format.

Participants So Far

Posted by Zack on January 24, 2011 11 comments

While I am analyzing the data, checking for errors and making sure the results I am getting are valid, here is some information about participants till now.

So far I have got 11 participants send me their raw data. Of these eleven, ten have some South Asian ancestry.

The regions/ethnicities they cover are:

Punjab
Bengal
Bihar
Tamil Nadu
Telegu
Anglo-Indian

Of these, Punjabis are the only ones I have multiple samples of. So I definitely need more samples of the other ethnicities. And there are lots of ethnicities/regions I haven't gotten any participants in.

It would be great for this project if we got a few participants from each state/province of India and Pakistan. So if you know someone who is from our target regions and has tested with 23andme, please spread the word.

If you tested with 23andme during their Christmas sale, I am hearing that results are going to start coming in starting today.

Welcome!

Posted by Zack on January 20, 2011 1 comment

As several people had asked, I have set up a separate website for the Harappa Ancestry Project here at http://www.harappadna.org/.

I might also cross-post some items from the project on my regular blog.

I have also set up a Facebook page for the Harappa Ancestry Project. Please like it on Facebook so I can get a nice short name for the Facebook page URL.

I have received several samples and will be reporting some analysis results soon. However, I do need lots of participants, so please spread the word.

Cross-posted at Procrastination.

Introduction

Posted by Zack on January 19, 2011 5 comments

I have become interested (some would say obsessed) with genetics recently. I wrote about getting my DNA test done and there's a lot more about my own results that I plan to bore you with.

One fun application of genetic testing is inferring ancestry: Which ancestral group are you descended from? Can we estimate the admixture of the different population groups you are descended from?

Most DNA testing companies provide information about ancestry and genetic genealogy has taken off. With several genome databases (HapMap, HGDP, etc) and software (like plink, admixture, Structure) publicly available, the days of the genome bloggers are here. And I am trying to be the latest one.

In starting this project, I have been inspired by the Dodecad Ancestry Project by Dienekes Pontikos and Eurogenes Ancestry Project by David Wesolowski. The catalyst for this project was my friend Razib who I bug whenever I need to talk genetics.

What is Harappa Ancestry Project?
It is a project to analyze (autosomal) genetic data of participants of South Asian origin for the purpose of providing detailed ancestry information. So the focus of the project is on South Asians: Indians, Pakistanis, Bangladeshis and Sri Lankans.

The project will collect 23andme raw genetic data from participants to better understand the ancestry relationships of different South Asian ethnicities.

I have named it after Harappa, an archaeological site of the Indus Valley Civilization in Punjab, Pakistan.

Participation
People of South Asian origin, or from neighboring countries, are eligible to participate. The list of countries of origin I am accepting are as follows:

Afghanistan
Bangladesh
Bhutan
Burma
India
Iran
Maldives
Nepal
Pakistan
Sri Lanka
Tibet

Right now, I am only accepting raw data samples from people who have tested with 23andme.

Please do not send samples from close relatives. I define close relatives as 2nd cousins or closer. If you have data from yourself and your parents, it might be better to send the samples from your parents (assuming they are not related to each other) and not send your own sample.

If you are unsure if you are eligible to participate, please send me an email (harappa@zackvision.com) to inquire about it before sending off your raw data.

What to send?
Please send your All DNA raw data text file (zipped is better) downloaded from 23andme to harappa@zackvision.com along with ancestral background information about you and all four of your grandparents. Background information would include where they were born, mother tongue, caste/community to which they belonged, etc. Please provide as much ancestry information as possible and try to be specific. Do especially include information about any ancestry from outside South Asia.

Data Privacy
The raw genetic data and ancestry information that you send me will not be shared with anyone.

Your data will be used only for ancestry analysis. No analysis of physical or health/medical traits will be performed.

The individual ancestry analysis published on this blog will be done using an ID of the form HRPnnnn known to only you and me.

What do you get?
All results of ancestry analysis (individual and group) will be posted on this blog under the Harappa Ancestry Project category. This will include admixture analysis as well as clustering into population groups etc.

I suggest you read about Dienekes' analysis on South Asians for an idea about what to expect.

You can access all blog posts related to this project from the Harappa Ancestry Project link on the navigation menu on every page of my website. You can also subscribe to the project feed.

« Previous page

Harappa Ancestry Project

Genetics and South Asia

Author Archives: Zack - Page 29

SGVP

HGDP

23andme v3 Data

HapMap

Participants So Far

Welcome!

Introduction

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Genetics and South Asia

Author Archives: Zack - Page 29

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll