Tag Archives: reference - Page 8

Austroasiatic Dataset

Razib pointed out the paper "Population Genetic Structure in Indian Austroasiatic speakers: The Role of Landscape Barriers and Sex-specific Admixture" by Gyaneshwer Chaubey, Mait Metspalu, Ying Choi, Reedik Mägi, Irene Gallego Romero, Pedro Soares, Mannis van Oven, Doron M. Behar, Siiri Rootsi, Georgi Hudjashov, Chandana Basu Mallick, Monika Karmin, Mari Nelis, Jüri Parik, Alla Goverdhana Reddy, Ene Metspalu, George van Driem, Yali Xue, Chris Tyler-Smith, Kumarasamy Thangaraj, Lalji Singh, Maido Remm, Martin B. Richards, Marta Mirazon Lahr, Manfred Kayser, Richard Villems and Toomas Kivisild to me 36 hours ago. And I have their dataset now.

I have been told that the data will hopefully be in the NCBI GEO database soon.

There are a total of 41 samples with 527,319 SNPs in the data. There are Bonda, Savara, Juang and Gadaba from Orissa; Santhal and Asur from Jharkand; Kharia from Chattishgarh; Ho from Bihar; Khasi and Garo from Meghalaya; and some (15) Burmese.

PS. I have created a separate page for references where I link to the papers which led to the datasets I am using.

Reich et al and Pan-Asian Datasets

I got access to the Reich et al (Nature 2009) dataset used in their paper "Reconstructing Indian population history".

It has the following populations:

Aonaga Aus Bhil
Chenchu Great_Andamanese Hallaki
Kamsali Kashmiri_Pandit Kharia
Kurumba Lodi Madiga
Mala Meghawal Naidu
Nysha Onge Sahariya
Santhal Satnami Siddi
Somali Srivastava Tharu
Vaish Velama Vysya

There are 141 individuals with 587,753 SNPs in their dataset which conveniently is in PED format.

Also, Blaise pointed me to the Pan-Asian SNP data used in the Dec 2009 Science paper "Mapping Human Genetic Diversity in Asia".

It includes the following 71 populations:

Maya Auca Quechua Karitiana Pima
Ami Atayal Melanesians Zhuang Han_Cantonese
Hmong Jiamao Jinuo Han_Shanghai Uyghur
Wa Alorese Dayak Javanese Batak_Karo
Lamaholot Lembata Malay Mentawai Manggarai
Kambera Sunda Batak_Toba Toraja Andhra_Pradesh
Karnataka Bengali-Assamese Rajasthan Uttaranchal Uttar Pradesh
Haryana Spiti Bhili Marathi Japanese
Ryukyuan Korean Bidayuh Jehai Kelantan
Kensiu Temuan Ayta Agta Ati
Iraya Minanubu Mamanwa Filipino Singapore_Chinese
Singapore_Indian Singapore_Malay Hmong (Miao) Karen Lawa
Mlabri Mon Paluang Plang Tai_Khuen
Tai_Lue H'tin Tai_Yuan Tai_Yong Yao
Hakka Minnan

It has 1,719 individuals with 54,794 SNPs. I wish it had more SNPs considering the wealth of populations.

Also, the Pan-Asian data is in the form of minor allele counts, so I need to convert that back to A/C/G/T. Since there are some HapMap populations included in the dataset, that shouldn't be too hard.

I am going to include both these datasets into my big reference set.

Isopleths

Simranjit has done a great job of creating some maps showing the distribution of the various ancestral components at K=16. He has posted them on DNA Forums and sent them to me.

The gradation is from Dark green (low) to Dark red (high) for most of them.

Basically the percentages for each Component are divided into 32 equal intervals, to create the contour effect. Take note that it represents relative values not absolute.

Here is C1 South Asian:

C2 Balochistan/Caucasus:

C5 Southwest Asian:

C6 European:

C12 Siberian:

Great job, Simranjit!

Reference I Admixture Analysis K=16

Continuing with Reference I admixture analysis, here is the results spreadsheet.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

If you can't see the interactive chart above, here's a static image.

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Melanesian C8 Naxi/Yi
C9 Japanese C10 Papuan
C11 She C12 Siberian
C13 Eastern Bantu C14 Northwest African
C15 West African C16 East African

Things are breaking down now, with the East Asian components breaking up. The usefulness of higher K's is doubtful. I am going to run K=17 on this dataset and then focus on more filtered data.

Fst divergences between estimated populations for K=16:

Here are the Fst numbers:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12
C2 0.053
C3 0.064 0.060
C4 0.076 0.112 0.123
C5 0.073 0.056 0.085 0.130
C6 0.064 0.040 0.073 0.118 0.048
C7 0.164 0.200 0.215 0.165 0.217 0.206
C8 0.087 0.122 0.133 0.045 0.140 0.127 0.181
C9 0.081 0.117 0.128 0.036 0.135 0.122 0.172 0.021
C10 0.184 0.222 0.237 0.200 0.238 0.227 0.145 0.215 0.207
C11 0.083 0.119 0.130 0.023 0.137 0.125 0.171 0.025 0.017 0.209
C12 0.086 0.114 0.127 0.063 0.133 0.118 0.189 0.048 0.041 0.221 0.048
C13 0.145 0.153 0.177 0.181 0.156 0.162 0.257 0.192 0.186 0.275 0.188 0.191
C14 0.079 0.063 0.096 0.127 0.052 0.056 0.211 0.138 0.132 0.232 0.134 0.132
C15 0.153 0.162 0.186 0.189 0.166 0.172 0.265 0.201 0.195 0.283 0.197 0.200
C16 0.106 0.108 0.135 0.145 0.106 0.116 0.223 0.156 0.150 0.241 0.152 0.154
C13 C14 C15
C14 0.116
C15 0.013 0.122
C16 0.034 0.079 0.041

PS. This was run using Admixture version 1.04 so I can make an apples-to-apples comparison with the previous runs.

One PED File to Rule Them All

I am interested in North African populations due to my own heritage, so when Razib alerted me that Henn et al had a paper out about South African origins of humans and their African dataset was publicly available and included populations from all over Africa, I immediately downloaded it.

I have also been considering looking into the East Asian admixture in South Asians and Iranians in some detail to see where it originates from: Southeast Asia, Chinese/Japanese/Koreans, or the Turkic/Mongolian/Siberian populations of interior northeastern Asia. At a quick glance, Razib is correct:

The eastern Asian components are enriched among Bengalis, as you’d expect, but they’re found in different proportions among many individuals who hail from the northern fringe of South Asia more generally. It seems clear that the further west you go, the more likely the “eastern” element is going to be Turk, while the further east (and to some extent south) the more likely it is to be more southernly in provenance.

To do a better job though, it would be better to have more than the Yakut as an examplar of the Siberian component as I have done till now. Therefore, I downloaded the arctic populations dataset from Rasmussen et al.

Combining Henn et al and Rasmussen et al with my previous datasets (HapMap, HGDP, SGVP, Behar et al and Xing et al), I got 3,970 samples with a total of 1,716,031 SNPs represented, though at 99% genotyping rate it gets reduced to about 27,000 SNPs.

I did not remove any populations or individuals except for any duplicates and non-founders.

Here's the information on the populations represented in this dataset.

Now I am on the lookout for more datasets that are public, have enough SNPs in common with this set and can easily be converted into the Plink PED format. So if you know of any, let me know. May be I will have the biggest and most diverse dataset with your help.

Singapore Indians

In the South Asian PCA plot, we saw that Singapore Indian samples from the SGVP dataset had a lot of diversity. Let's zoom into that plot so it's not dominated by the distinctiveness of the Kalash.

Eigenvector 1 explains 1.45 times the variation compared to eigenvector 2.

We see that Singapore Indians are spread in the whole region from Sindhis to North Kanaddi.

Now let's look at the individual admixture results (at K=12 ancestral populations) for the Singapore Indians. I have added some South Asian reference population averages so you can place them in context.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

From these results, a majority of the Singapore Indian samples look South Indian but there are definitely a few from the northwest of the subcontinent (Punjabis or Sindhis?) There are also a few who could be from the Hindi belt.

There are 2-3 samples who have a significant amount of Southeast Asian. Could they be originally from Bengal? Or could they have partial Singapore Malay ancestry?

Reference I Admixture Analysis K=15

Continuing with Reference I admixture analysis, here is the results spreadsheet.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

If you can't see the interactive chart above, here's a static image.

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Melanesian C8 Japanese
C9 Siberian C10 Papuan
C11 Chinese C12 Eastern Bantu
C13 Northwest African C14 West African
C15 East African

The new Northwest African component is mostly Mozabite, though it is present among Moroccans too.

Fst divergences between estimated populations for K=15:

PS. This was run using Admixture version 1.04 so I can make an apples-to-apples comparison with the previous runs.

UPDATE: Here are the Fst numbers:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14
C2 0.053
C3 0.064 0.060
C4 0.081 0.116 0.129
C5 0.073 0.056 0.085 0.135
C6 0.065 0.040 0.073 0.123 0.048
C7 0.164 0.200 0.215 0.171 0.217 0.205
C8 0.080 0.116 0.128 0.035 0.135 0.122 0.172
C9 0.084 0.113 0.126 0.064 0.133 0.117 0.188 0.040
C10 0.184 0.222 0.237 0.208 0.238 0.227 0.145 0.207 0.219
C11 0.083 0.119 0.130 0.030 0.137 0.125 0.173 0.014 0.044 0.209
C12 0.145 0.153 0.177 0.185 0.156 0.162 0.257 0.186 0.190 0.275 0.188
C13 0.079 0.063 0.096 0.132 0.052 0.056 0.210 0.132 0.131 0.232 0.135 0.116
C14 0.153 0.162 0.186 0.194 0.166 0.172 0.265 0.195 0.199 0.283 0.197 0.013 0.122
C15 0.106 0.108 0.135 0.149 0.106 0.116 0.223 0.150 0.153 0.241 0.152 0.034 0.079 0.041

Reference II Admixture Analysis K=16

Continuing with Reference II admixture analysis, here is the results spreadsheet.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

If you can't see the interactive chart above, here's a static image.

C1 South Asian C2 Balochistan/Caucasus
C3 Irula C4 Kalash
C5 European C6 Southwest Asian
C7 Southeast Asian C8 Chinese
C9 Polynesian C10 Siberian
C11 Papuan C12 Japanese
C13 Eastern Bantu C14 Bushman
C15 East African C16 West African

Fst divergences dendrogram between estimated ancestral populations for K=16:

PS. This was run using Admixture version 1.04 so I can make an apples-to-apples comparison with the previous runs.

Reference II Admixture Analysis K=15

Continuing with Reference II admixture analysis, here is the results spreadsheet.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

If you can't see the interactive chart above, here's a static image.

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southwest Asian
C5 European C6 Southeast Asian
C7 Chinese C8 Polynesian
C9 Siberian C10 Papuan
C11 Japanese C12 Eastern Bantu
C13 Bushman C14 East African
C15 West African

Fst divergences dendrogram between estimated ancestral populations for K=15:

PS. This was run using Admixture version 1.04 so I can make an apples-to-apples comparison with the previous runs.

Reference II Admixture Analysis K=14

Continuing with Reference II admixture analysis, here is the results spreadsheet.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

If you can't see the interactive chart above, here's a static image.

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southwest Asian
C5 European C6 Southeast Asian
C7 Chinese C8 Polynesian
C9 Siberian C10 Papuan
C11 Japanese C12 West African
C13 East African C14 Bushman

Fst divergences dendrogram between estimated ancestral populations for K=14:

PS. This was run using Admixture version 1.04 so I can make an apples-to-apples comparison with the previous runs.