Category Archives: Dataset - Page 4

Reference Dataset II

Posted by Zack on January 30, 2011 9 comments

Combining my reference population with Xing et al data gets me ~~3,222~~ 3,161 samples but with only about 23,000 SNPs after LD-pruning.

The good thing is that this dataset has 544 South Asian samples from 24 ethnic groups. So it'll be useful for some analyses despite the low number of SNPs. I'll try to run parallel analyses on my reference population and this dataset so we can compare the pros and cons of both.

UPDATE: I removed 61 pygmy and San samples.

Admixture: Reference Population

Posted by Zack on January 29, 2011 12 comments

For regular admixture analysis, I am using HapMap, HGDP, SGVP and Behar datasets with some samples removed as I wrote earlier.

For each of these datasets,

I first filtered to keep only the list of SNPs present in 23andme v2 chip.
plink --bfile data --extract 23andmev2.snplist
plink --bfile data --extract 23andmev2.snplist
I also filtered for founders:
plink --bfile data --filter-founders
plink --bfile data --filter-founders
And excluded SNPs with missing rates greater than 1%:
plink --bfile data --geno 0.01
plink --bfile data --geno 0.01

Then, I merged the datasets one by one. The reason for doing it one by one was that there were conflicts of strand orientation (forward or reverse) between the different datasets. If the merge operation gave an error, I had to flip those strands in one dataset and try the merge again.

plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed
plink --bfile data2 --flip plink.missnp --make-bed --out data2flip
plink --bfile data1 --bmerge data2flip.bed data2flip.bim data2flip.fam --make-bed

Once all the four datasets were merged, I processed the combined data file:

Removed SNPs with a missing rate of more than 1% in the combined dataset
plink --bfile data --geno 0.01
plink --bfile data --geno 0.01
Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:
plink --bfile data --indep-pairwise 50 5 0.3 plink --bfile data --extract plink.prune.in --make-bed
plink --bfile data --indep-pairwise 50 5 0.3 plink --bfile data --extract plink.prune.in --make-bed

This gave me a reference population of ~~2,693~~ 2,654 individuals with each sample having about 186,000 SNPs. Out of these ~~2,693~~ 2,654 individuals, we have a total of 398 South Asians belonging to 16 ethnic groups.

Finally, it's time to start having some fun!

UPDATE: I removed 39 Pygmy and San samples because they were causing some trouble with African ancestral components. Since we are not interested in detailed African ancestry and African admixture among South Asians is not likely to be pygmy or San, I decided it would be best to remove them.

Xing et al Data

Posted by Zack on January 28, 2011 6 comments

The data for Xing et al's paper "Toward a more uniform sampling of human genetic diversity: a survey of worldwide populations by high-density genotyping" is available online.

This dataset consists of 850 individuals, but 259 of them overlap with the HapMap. Another 15 samples had to be removed because they were too similar to others. I also removed Native American samples. This leaves us with 529 samples.

Ethnic group	Count
Slovenian	25
Punjabi Arain	25
N. European	25
Nepalese	25
Kyrgyzstani	25
Iban	25
Buryat	25
Bambaran	25
Andhra Pradesh Brahmin	25
Kurd	24
Dogon	24
Irula	23
Thai	22
Pygmy	22
Urkarah	18
Tamil Nadu Brahmin	14
Hema	14
Tongan	13
Tamil Nadu Dalit	13
Samoan	13
!Kung	13
Japanese	13
Andhra Pradesh Mala	11
Pedi	10
Andhra Pradesh Madiga	10
Alur	10
Nguni	9
Sotho/Tswana	8
Vietnamese	7
Stalskoe	5
Chinese	5
Khmer Cambodian	3

This dataset is valuable because it contains several South Asian, Central Asian, Southeast Asian and Caucasian groups. However, it does not have a good SNP overlap with 23andme and the other datasets. It has only about 29,000 SNPs in common with 23andme v2 data. Combining HapMap, HGDP, SGVP, Behar et al and Xing et al with 23andme data leaves us with 25,000 SNPs. Due to that, I'll be using Xing et al data for only a few analyses.

Behar et al Data

Posted by Zack on January 28, 2011 16 comments

In their paper "The genome-wide structure of the Jewish people", Behar et al analyzed the genomes of some Jewish groups. More important than the Jewish samples (which include two South Asian Jewish groups) for us are the different South Asian, Middle Eastern, and European groups they sampled:

Ethnic group	Count
Saudis	20
Jordanians	20
Georgians	20
Turks	19
Iranians	19
Hungarians	19
Ethiopians	19
Armenians	19
Lezgins	18
Chuvashs	17
Syrians	16
Romanians	16
Uzbeks	15
Spaniards	12
Egyptians	12
Cypriots	12
Moroccans	10
Lithuanians	10
North Kannadi	9
Belorussian	9
Yemenese	8
Lebanese	7
Sakilli	4
Paniya	4
Cochin Jews	4
Bene Israel	4
Samaritians	2
Russian	2
Malayan	2

Of the 466 samples, I excluded 8 because they were either duplicates or too similar in their genomes to others.

The series matrix files that I downloaded were in a somewhat different format. To convert them to Plink format, I had to look up the platform file for the Illumina genotyping BeadChip they used. Also, Illumina used an A/B alleles and Top/Bot strands system instead of the regular ACGT alleles and forward/reverse strands. This Illumina Technote explained it and I found a Perl script to convert between the two.

SGVP

Posted by Zack on January 26, 2011 4 comments

SGVP is the Singapore Genome Variation Project. It sampled the following groups:

Ethnicity	Sample Count	SNP Count
Singapore Chinese	96	1,405,417
Singapore Malay	89	1,402,256
Singapore Indian	83	1,404,699

Singapore Indians are generally likely to be South Indians, especially Tamils.

These 268 samples were easy to convert to Plink format

HGDP

Posted by Zack on January 25, 2011 3 comments

Human Genome Diversity Project (HGDP) is the best resource for a diverse set of genomic data. It has 1050 individuals from 52 different populations.

I got the Stanford University data which has data for 660,918 SNPs from 1,043 samples. It is claimed that the forward strand is given but that turned out not to be true and I had to flip strands and make sure I didn't include any ambiguous A/T or C/G strands in my dataset.

I followed the recommendations of Rosenberg (spreadsheet) in excluding some atypical samples and relatives, leaving me with 940 samples.

I also excluded the Native American samples because we are not interested in them and they are very closely related either due to recent endogamy or ancient bottlenecks. (yeah I had the nerve to write that.)

Of the total of 876 samples, here are the numbers for our populations of interest:

Total South Asians	190
Balochi	24
Brahui	25
Burusho	25
Hazara	22
Kalash	23
Makrani	25
Pathan	22
Sindhi	24

These samples have about 541,560 SNPs in common with 23andme v2.

23andme v3 Data

Posted by Zack on January 25, 2011 1 comment

The results from 23andme's new version 3 chip started coming in yesterday and I have already got three samples of the new chip.

I counted 966,977 SNPs on the new chip. It seems to have about 547,000 SNPs in common with version 2 (which had about 578,000). Also, the version 3 data has about 230,000 SNPs in common with my reference dataset (out of a total of 241,000). Which is a long way of saying that the v3 data is very usable for my project.

Therefore, if you are from South Asia or neighboring countries and got your spanking new results, please participate and send your data over.

HapMap

Posted by Zack on January 24, 2011 7 comments

I am using several datasets in the public domain for my reference population samples. HapMap is one of those datasets.

According to its website,

The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation. The HapMap is expected to be a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors. The information produced by the Project will be made freely available.

In the first phase, it genotyped

30 Yoruba adult-and-both-parents trios from Ibadan, Nigeria, 30 trios of U.S. (Utah) residents of northern and western European ancestry, 44 unrelated individuals from Tokyo, Japan and 45 unrelated Han Chinese individuals from Beijing, China.

In their HapMap phase 3 release #3 (NCBI build 36, dbSNP b126), there are 1,397 samples with about 1,457,897 SNPs each.

I removed related individuals as well as individuals whose genomes were too similar. This left me with a total of 1,149 samples with about 474,606 SNPs in common with 23andme's version 2 data.

Since we are not interested in Native American ancestry, I also removed 58 Mexican samples, thus leaving me with 1,091 samples.

Here are the samples I am using from the HapMap data:

Ethnicity	Region	Count
African Americans	Africa	48
European Americans (Utahns)	Europe	111
Han Chinese	East Asia	137
US Chinese	East Asia	106
Gujaratis	South Asia	98
Japanese	East Asia	113
Kenyan Luhya	East Africa	101
Maasai	East Africa	135
Tuscans	Europe	102
Yoruba	West Africa	140

The region assignments are mine to aid me in the analysis, by including/excluding samples by region or by aggregating results by region to find patterns etc.

It was easiest to use the HapMap data since it's available for download in Plink format.

« Previous page

Harappa Ancestry Project

Genetics and South Asia

Category Archives: Dataset - Page 4

Reference Dataset II

Admixture: Reference Population

Xing et al Data

Behar et al Data

SGVP

HGDP

23andme v3 Data

HapMap

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Genetics and South Asia

Category Archives: Dataset - Page 4

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll