Category Archives: Dataset - Page 4

Reference Dataset II

Combining my reference population with Xing et al data gets me 3,222 3,161 samples but with only about 23,000 SNPs after LD-pruning.

The good thing is that this dataset has 544 South Asian samples from 24 ethnic groups. So it'll be useful for some analyses despite the low number of SNPs. I'll try to run parallel analyses on my reference population and this dataset so we can compare the pros and cons of both.

UPDATE: I removed 61 pygmy and San samples.

Admixture: Reference Population

For regular admixture analysis, I am using HapMap, HGDP, SGVP and Behar datasets with some samples removed as I wrote earlier.

For each of these datasets,

  1. I first filtered to keep only the list of SNPs present in 23andme v2 chip.
    plink --bfile data --extract 23andmev2.snplist
  2. I also filtered for founders:
    plink --bfile data --filter-founders
  3. And excluded SNPs with missing rates greater than 1%:
    plink --bfile data --geno 0.01

Then, I merged the datasets one by one. The reason for doing it one by one was that there were conflicts of strand orientation (forward or reverse) between the different datasets. If the merge operation gave an error, I had to flip those strands in one dataset and try the merge again.

plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed
plink --bfile data2 --flip plink.missnp --make-bed --out data2flip
plink --bfile data1 --bmerge data2flip.bed data2flip.bim data2flip.fam --make-bed

Once all the four datasets were merged, I processed the combined data file:

  1. Removed SNPs with a missing rate of more than 1% in the combined dataset
    plink --bfile data --geno 0.01
  2. Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:
    plink --bfile data --indep-pairwise 50 5 0.3
    plink --bfile data --extract plink.prune.in --make-bed

This gave me a reference population of 2,693 2,654 individuals with each sample having about 186,000 SNPs. Out of these 2,693 2,654 individuals, we have a total of 398 South Asians belonging to 16 ethnic groups.

Finally, it's time to start having some fun!

UPDATE: I removed 39 Pygmy and San samples because they were causing some trouble with African ancestral components. Since we are not interested in detailed African ancestry and African admixture among South Asians is not likely to be pygmy or San, I decided it would be best to remove them.

Xing et al Data

The data for Xing et al's paper "Toward a more uniform sampling of human genetic diversity: a survey of worldwide populations by high-density genotyping" is available online.

This dataset consists of 850 individuals, but 259 of them overlap with the HapMap. Another 15 samples had to be removed because they were too similar to others. I also removed Native American samples. This leaves us with 529 samples.

Ethnic group Count
Slovenian 25
Punjabi Arain 25
N. European 25
Nepalese 25
Kyrgyzstani 25
Iban 25
Buryat 25
Bambaran 25
Andhra Pradesh Brahmin 25
Kurd 24
Dogon 24
Irula 23
Thai 22
Pygmy 22
Urkarah 18
Tamil Nadu Brahmin 14
Hema 14
Tongan 13
Tamil Nadu Dalit 13
Samoan 13
!Kung 13
Japanese 13
Andhra Pradesh Mala 11
Pedi 10
Andhra Pradesh Madiga 10
Alur 10
Nguni 9
Sotho/Tswana 8
Vietnamese 7
Stalskoe 5
Chinese 5
Khmer Cambodian 3

This dataset is valuable because it contains several South Asian, Central Asian, Southeast Asian and Caucasian groups. However, it does not have a good SNP overlap with 23andme and the other datasets. It has only about 29,000 SNPs in common with 23andme v2 data. Combining HapMap, HGDP, SGVP, Behar et al and Xing et al with 23andme data leaves us with 25,000 SNPs. Due to that, I'll be using Xing et al data for only a few analyses.

Behar et al Data

In their paper "The genome-wide structure of the Jewish people", Behar et al analyzed the genomes of some Jewish groups. More important than the Jewish samples (which include two South Asian Jewish groups) for us are the different South Asian, Middle Eastern, and European groups they sampled:

Ethnic group Count
Saudis 20
Jordanians 20
Georgians 20
Turks 19
Iranians 19
Hungarians 19
Ethiopians 19
Armenians 19
Lezgins 18
Chuvashs 17
Syrians 16
Romanians 16
Uzbeks 15
Spaniards 12
Egyptians 12
Cypriots 12
Moroccans 10
Lithuanians 10
North Kannadi 9
Belorussian 9
Yemenese 8
Lebanese 7
Sakilli 4
Paniya 4
Cochin Jews 4
Bene Israel 4
Samaritians 2
Russian 2
Malayan 2

Of the 466 samples, I excluded 8 because they were either duplicates or too similar in their genomes to others.

The series matrix files that I downloaded were in a somewhat different format. To convert them to Plink format, I had to look up the platform file for the Illumina genotyping BeadChip they used. Also, Illumina used an A/B alleles and Top/Bot strands system instead of the regular ACGT alleles and forward/reverse strands. This Illumina Technote explained it and I found a Perl script to convert between the two.

SGVP

SGVP is the Singapore Genome Variation Project. It sampled the following groups:

Ethnicity Sample Count SNP Count
Singapore Chinese 96 1,405,417
Singapore Malay 89 1,402,256
Singapore Indian 83 1,404,699

Singapore Indians are generally likely to be South Indians, especially Tamils.

These 268 samples were easy to convert to Plink format

HGDP

Human Genome Diversity Project (HGDP) is the best resource for a diverse set of genomic data. It has 1050 individuals from 52 different populations.

I got the Stanford University data which has data for 660,918 SNPs from 1,043 samples. It is claimed that the forward strand is given but that turned out not to be true and I had to flip strands and make sure I didn't include any ambiguous A/T or C/G strands in my dataset.

I followed the recommendations of Rosenberg (spreadsheet) in excluding some atypical samples and relatives, leaving me with 940 samples.

I also excluded the Native American samples because we are not interested in them and they are very closely related either due to recent endogamy or ancient bottlenecks. (yeah I had the nerve to write that.)

Of the total of 876 samples, here are the numbers for our populations of interest:

Balochi 24
Brahui 25
Burusho 25
Hazara 22
Kalash 23
Makrani 25
Pathan 22
Sindhi 24
Total South Asians 190

These samples have about 541,560 SNPs in common with 23andme v2.

23andme v3 Data

The results from 23andme's new version 3 chip started coming in yesterday and I have already got three samples of the new chip.

I counted 966,977 SNPs on the new chip. It seems to have about 547,000 SNPs in common with version 2 (which had about 578,000). Also, the version 3 data has about 230,000 SNPs in common with my reference dataset (out of a total of 241,000). Which is a long way of saying that the v3 data is very usable for my project.

Therefore, if you are from South Asia or neighboring countries and got your spanking new results, please participate and send your data over.

HapMap

I am using several datasets in the public domain for my reference population samples. HapMap is one of those datasets.

According to its website,

The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation. The HapMap is expected to be a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors. The information produced by the Project will be made freely available.

In the first phase, it genotyped

30 Yoruba adult-and-both-parents trios from Ibadan, Nigeria, 30 trios of U.S. (Utah) residents of northern and western European ancestry, 44 unrelated individuals from Tokyo, Japan and 45 unrelated Han Chinese individuals from Beijing, China.

In their HapMap phase 3 release #3 (NCBI build 36, dbSNP b126), there are 1,397 samples with about 1,457,897 SNPs each.

I removed related individuals as well as individuals whose genomes were too similar. This left me with a total of 1,149 samples with about 474,606 SNPs in common with 23andme's version 2 data.

Since we are not interested in Native American ancestry, I also removed 58 Mexican samples, thus leaving me with 1,091 samples.

Here are the samples I am using from the HapMap data:

Ethnicity Region Count
African Americans Africa 48
European Americans (Utahns) Europe 111
Han Chinese East Asia 137
US Chinese East Asia 106
Gujaratis South Asia 98
Japanese East Asia 113
Kenyan Luhya East Africa 101
Maasai East Africa 135
Tuscans Europe 102
Yoruba West Africa 140

The region assignments are mine to aid me in the analysis, by including/excluding samples by region or by aggregating results by region to find patterns etc.

It was easiest to use the HapMap data since it's available for download in Plink format.