Author Archives: Zack - Page 23

Balochistan/Caucasian

There has been some discussion in the comments about the C2 ancestral component at K=12 admixture runs which I called Pakistani/Caucasian.

First of all, we should remember that these "names" of ancestral populations are just rough mnemonics. They are chosen based on the frequencies of the component among modern reference samples. So the names have nothing at all to do with history.

In the case of Pakistani/Caucasian component, I wanted to emphasize the peaks of the component in Pakistan and the Caucasus. As commenters pointed out, the component is also quite high among the Iranians.

However, I have realized that this name, Pakistani/Caucasian, is a hindrance rather than a help for understanding the Admixture results. Also, this component is lower among the Pathan, Sindhis, and Punjabis than it is for Iranians etc. Therefore, the Pakistani part of the name is a bit of a misnomer, considering that the Pakistani populations it is high among comprise only about 5% of the country's population.

On the other hand, I do not like the name "Iranian" for this component. While it was suggested based on the geographical Iranian plateau which extends from the Caucasus to Balochistan, it still is confusing and it doesn't emphasize the peak areas.

Thus, I have renamed "Pakistani/Caucasian" as "Balochistan/Caucasus". I didn't use the shorter Baloch as this component is equally high among the Baloch, Brahui and Makrani, all populations living in the province of Balochistan.

Reference I Admixture Analysis K=16

Continuing with Reference I admixture analysis, here is the results spreadsheet.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

If you can't see the interactive chart above, here's a static image.

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Melanesian C8 Naxi/Yi
C9 Japanese C10 Papuan
C11 She C12 Siberian
C13 Eastern Bantu C14 Northwest African
C15 West African C16 East African

Things are breaking down now, with the East Asian components breaking up. The usefulness of higher K's is doubtful. I am going to run K=17 on this dataset and then focus on more filtered data.

Fst divergences between estimated populations for K=16:

Here are the Fst numbers:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12
C2 0.053
C3 0.064 0.060
C4 0.076 0.112 0.123
C5 0.073 0.056 0.085 0.130
C6 0.064 0.040 0.073 0.118 0.048
C7 0.164 0.200 0.215 0.165 0.217 0.206
C8 0.087 0.122 0.133 0.045 0.140 0.127 0.181
C9 0.081 0.117 0.128 0.036 0.135 0.122 0.172 0.021
C10 0.184 0.222 0.237 0.200 0.238 0.227 0.145 0.215 0.207
C11 0.083 0.119 0.130 0.023 0.137 0.125 0.171 0.025 0.017 0.209
C12 0.086 0.114 0.127 0.063 0.133 0.118 0.189 0.048 0.041 0.221 0.048
C13 0.145 0.153 0.177 0.181 0.156 0.162 0.257 0.192 0.186 0.275 0.188 0.191
C14 0.079 0.063 0.096 0.127 0.052 0.056 0.211 0.138 0.132 0.232 0.134 0.132
C15 0.153 0.162 0.186 0.189 0.166 0.172 0.265 0.201 0.195 0.283 0.197 0.200
C16 0.106 0.108 0.135 0.145 0.106 0.116 0.223 0.156 0.150 0.241 0.152 0.154
C13 C14 C15
C14 0.116
C15 0.013 0.122
C16 0.034 0.079 0.041

PS. This was run using Admixture version 1.04 so I can make an apples-to-apples comparison with the previous runs.

One PED File to Rule Them All

I am interested in North African populations due to my own heritage, so when Razib alerted me that Henn et al had a paper out about South African origins of humans and their African dataset was publicly available and included populations from all over Africa, I immediately downloaded it.

I have also been considering looking into the East Asian admixture in South Asians and Iranians in some detail to see where it originates from: Southeast Asia, Chinese/Japanese/Koreans, or the Turkic/Mongolian/Siberian populations of interior northeastern Asia. At a quick glance, Razib is correct:

The eastern Asian components are enriched among Bengalis, as you’d expect, but they’re found in different proportions among many individuals who hail from the northern fringe of South Asia more generally. It seems clear that the further west you go, the more likely the “eastern” element is going to be Turk, while the further east (and to some extent south) the more likely it is to be more southernly in provenance.

To do a better job though, it would be better to have more than the Yakut as an examplar of the Siberian component as I have done till now. Therefore, I downloaded the arctic populations dataset from Rasmussen et al.

Combining Henn et al and Rasmussen et al with my previous datasets (HapMap, HGDP, SGVP, Behar et al and Xing et al), I got 3,970 samples with a total of 1,716,031 SNPs represented, though at 99% genotyping rate it gets reduced to about 27,000 SNPs.

I did not remove any populations or individuals except for any duplicates and non-founders.

Here's the information on the populations represented in this dataset.

Now I am on the lookout for more datasets that are public, have enough SNPs in common with this set and can easily be converted into the Plink PED format. So if you know of any, let me know. May be I will have the biggest and most diverse dataset with your help.

Another Update

I have a total of 51 participants in the project right now who have sent me their raw data. This is not counting three people who have relatives participating and thus have to be filtered out for most analysis other than individual admixture percentages etc where I divide participants into small groups.

The following groups are represented:

  • Punjab: 7
  • Iran: 7
  • Tamil: 6
  • Bengal: 5
  • Andhra Pradesh: 2
  • Bihar: 2
  • Karnataka: 2
  • Caribbean Indian: 2
  • Kashmir: 2
  • Uttar Pradesh: 2
  • Sri Lankan: 2
  • Kerala: 2
  • Iraqi Arab: 2
  • Anglo-Indian: 1
  • Roma: 1
  • Goa: 1
  • Rajasthan: 1
  • Baloch: 1
  • Unknown: 1
  • Egyptian/Iraqi Jew: 1
  • Maharashtra: 1

I haven't received data from any new participants for more than a week which is the longest lull since I started Harappa Ancestry Project. So go out there and get people to send me their 23andme raw data.

Also, does anyone know if there are a significant number of South Asians who have done FamilyTreeDNA's Family Finder test? Is there a good overlap of SNPs between their test and 23andme's?

We have enough Punjabis, Iranians, Tamil and Bengalis that they deserve separate analysis posts.

Singapore Indians

In the South Asian PCA plot, we saw that Singapore Indian samples from the SGVP dataset had a lot of diversity. Let's zoom into that plot so it's not dominated by the distinctiveness of the Kalash.

Eigenvector 1 explains 1.45 times the variation compared to eigenvector 2.

We see that Singapore Indians are spread in the whole region from Sindhis to North Kanaddi.

Now let's look at the individual admixture results (at K=12 ancestral populations) for the Singapore Indians. I have added some South Asian reference population averages so you can place them in context.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

From these results, a majority of the Singapore Indian samples look South Indian but there are definitely a few from the northwest of the subcontinent (Punjabis or Sindhis?) There are also a few who could be from the Hindi belt.

There are 2-3 samples who have a significant amount of Southeast Asian. Could they be originally from Bengal? Or could they have partial Singapore Malay ancestry?

Your Genes, Regulated?

The FDA had a meeting the last two days:

FDA is convening this two-day meeting to seek the Panel’s expert opinion and input on scientific issues concerning Direct to Consumer (DTC) genetic tests that make medical claims.

This meeting is focused specifically on issues regarding clinical genetic tests that are marketed directly to consumers (DTC clinical genetic tests), where a consumer can order tests and receive test results without the involvement of a clinician.

The American Medical Association of course wants to limit genetic testing so that you would need a doctor to supervise everything.

We urge the Panel to offer clear findings and recommendations that genetic testing, except under the most limited circumstances, should be carried out under the personal supervision of a qualified health care professional, and provide individuals interested in obtaining genetic testing access to qualified health care professionals for further information.

23andme had two presentations at the meeting which they have posted on their blog.

In our presentations, we take the position that all genetic testing services, whether ordered by a physician or offered through direct access, should adhere to the same standards. We simultaneously request that the FDA consider redefining and establishing regulatory standards, including some fundamental definitions, to accommodate large-scale genetic testing and support innovation of its technologies and applications. We also request that regulation be based upon evidence and not fear of potential harm to individuals which, to date, has not been demonstrated. In fact, growing numbers of participating individuals and independent studies focused on this issue provide preliminary evidence that the vast majority of people understand the information presented and experience no significant negative effects.

Genomics Law Report had an overview of the issues beforehand as well as a Twitter roundup of the meeting. Here are his thoughts after the first day:

First and foremost, I fully expect the MCGP (Molecular and Clinical Genetics Panel) to note, likely more than once, that given the complexity of the questions put to it by the FDA it should be afforded far more time to deliberate and research prior to making any recommendations.

If taking time out for further debate isn’t an option, what is the MCGP likely to recommend? Based on today’s deliberations, I think it’s a safe bet that the MCGP will advise the FDA to (1) demand clear proof of analytical and clinical validity for all genetic tests and (2) require that most, or perhaps even all, genetic tests with demonstrated or potential clinical significance be (to use the FDA’s terminology) “routed through a clinician.”

In other words, I think the odds strongly favor an MCGP recommendation to the FDA that clinical (as defined by the FDA, which is itself a separate issue) direct-to-consumer genetic testing, when offered without a requirement that a clinician participate in the ordering, receipt and interpretation of the test, be removed from the marketplace. At least for the time being.

If you read my blog, you probably know my politics. I do however think that any regulations have to be shown to have actual tangible benefit and prevention of harm. Simple misinterpretation of genetic results by a regular joe causing hypothetical harm is not enough justification.

So what can you do? Razib Khan is already on the task.

1) I am going to release my own 23andMe sequence into the public domain soon. I encourage everyone to download it. I would rather have someone off the street know my own genetic information than be made invisible by the government. That is my right. For now that right is not barred by law. I will exercise it.

2) Spread word of this video via social networking websites and twitter. The media needs to get the word out, but they only will if they know you care. Do you care? I hope you do. This is a power grab, this is not about safety or ethics. If it was, I assume that the “interpretative services” would be provided for free. I doubt they will be.

3) Contact your local representative in congress. I’ve never done this myself, but am going to draft a quick note. They need to be aware that people care, that this isn’t just a minor regulatory issue.

4) The online community needs to get organized. We’re not as powerful as a million doctors and a Leviathan government, but we have right on our side. They’re trying to take from us what is ours.

5) Plan B’s. We need to prepare for the worst. Which nations have the least onerous regulatory regimes? Is genomic tourism going to be necessary? How about DIYgenomics? The cost of the technology to genotype and sequence is going to crash. I know that the Los Angeles DIYbio group has a cheap cast-off sequencer. For those who can’t afford to go abroad soon we’ll be able to get access to our information in our homes. Let’s prepare for that day.

Here are the links to contact your House Representative and your Senators.

Reference I Admixture Analysis K=15

Continuing with Reference I admixture analysis, here is the results spreadsheet.

You can click on the legend to the right of the bar chart to sort by different ancestral components.

If you can't see the interactive chart above, here's a static image.

C1 South Asian C2 Balochistan/Caucasus
C3 Kalash C4 Southeast Asian
C5 Southwest Asian C6 European
C7 Melanesian C8 Japanese
C9 Siberian C10 Papuan
C11 Chinese C12 Eastern Bantu
C13 Northwest African C14 West African
C15 East African

The new Northwest African component is mostly Mozabite, though it is present among Moroccans too.

Fst divergences between estimated populations for K=15:

PS. This was run using Admixture version 1.04 so I can make an apples-to-apples comparison with the previous runs.

UPDATE: Here are the Fst numbers:

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14
C2 0.053
C3 0.064 0.060
C4 0.081 0.116 0.129
C5 0.073 0.056 0.085 0.135
C6 0.065 0.040 0.073 0.123 0.048
C7 0.164 0.200 0.215 0.171 0.217 0.205
C8 0.080 0.116 0.128 0.035 0.135 0.122 0.172
C9 0.084 0.113 0.126 0.064 0.133 0.117 0.188 0.040
C10 0.184 0.222 0.237 0.208 0.238 0.227 0.145 0.207 0.219
C11 0.083 0.119 0.130 0.030 0.137 0.125 0.173 0.014 0.044 0.209
C12 0.145 0.153 0.177 0.185 0.156 0.162 0.257 0.186 0.190 0.275 0.188
C13 0.079 0.063 0.096 0.132 0.052 0.056 0.210 0.132 0.131 0.232 0.135 0.116
C14 0.153 0.162 0.186 0.194 0.166 0.172 0.265 0.195 0.199 0.283 0.197 0.013 0.122
C15 0.106 0.108 0.135 0.149 0.106 0.116 0.223 0.150 0.153 0.241 0.152 0.034 0.079 0.041

Admixture K=12, HRP0041-HRP0050

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

If you can't see the interactive bar chart above, here's a static image.

PS. This was run using Admixture version 1.04.

Admixture K=9, HRP0041-HRP0050

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

The interesting samples here are the two Iraqi Arabs (HRP0042 & HRP0043) who have some African admixture.

If you can't see the interactive bar chart above, here's a static image.

PS. This was run using Admixture version 1.04.

Admixture K=4, HRP0041-HRP0050

Here are their ethnic backgrounds and the results spreadsheet. Also relevant are the reference I admixture results.

The interesting samples here are the two Iraqi Arabs (HRP0042 & HRP0043) who have some African admixture.

Also, we finally have a couple of Bengalis (HRP0049 & HRP0050) who have 13% East Asian for this run which is less than Razib's (HRP0002) and his parents' 19-20% but still higher than others.

If you can't see the interactive bar chart above, here's a static image.

PS. This was run using Admixture version 1.04.