Tag Archives: reference - Page 10

Dodecad vs Harappa

Posted by Zack on February 18, 2011 9 comments

We know that some participants in Harappa Ancestry Project had also submitted their data to Dodecad Project. And they were curious how the different ancestry components here lined up with the Dodecad ones.

So I decided to compare the two. I took the ancestral component percentages for the reference populations from
Dodecad population spreadsheet K=10 and Harappa Reference I spreadsheet K=9.

I selected the 36 populations that are present in both. While some of these are still not comparable because of which samples out of these populations were selected to be included in the reference datasets for Dodecad and Harappa, we are using mean values, so barring any big outliers we can compare them.

I decided to find a solution to linear equations of the form:

C1 = a₁₁*D1 + a₁₂*D2 + a₁₃*D3 + a₁₄*D4 + a₁₅*D5 + a₁₆*D6 + a₁₇*D7 + a₁₈*D8 + a₁₉*D9 + a_1A*D10
C2 = a₂₁*D1 + a₂₂*D2 + a₂₃*D3 + a₂₄*D4 + a₂₅*D5 + a₂₆*D6 + a₂₇*D7 + a₂₈*D8 + a₂₉*D9 + a_2A*D10
C3 = a₃₁*D1 + a₃₂*D2 + a₃₃*D3 + a₃₄*D4 + a₃₅*D5 + a₃₆*D6 + a₃₇*D7 + a₃₈*D8 + a₃₉*D9 + a_3A*D10
C4 = a₄₁*D1 + a₄₂*D2 + a₄₃*D3 + a₄₄*D4 + a₄₅*D5 + a₄₆*D6 + a₄₇*D7 + a₄₈*D8 + a₄₉*D9 + a_4A*D10
C5 = a₅₁*D1 + a₅₂*D2 + a₅₃*D3 + a₅₄*D4 + a₅₅*D5 + a₅₆*D6 + a₅₇*D7 + a₅₈*D8 + a₅₉*D9 + a_5A*D10
C6 = a₆₁*D1 + a₆₂*D2 + a₆₃*D3 + a₆₄*D4 + a₆₅*D5 + a₆₆*D6 + a₆₇*D7 + a₆₈*D8 + a₆₉*D9 + a_6A*D10
C7 = a₇₁*D1 + a₇₂*D2 + a₇₃*D3 + a₇₄*D4 + a₇₅*D5 + a₇₆*D6 + a₇₇*D7 + a₇₈*D8 + a₇₉*D9 + a_7A*D10
C8 = a₈₁*D1 + a₈₂*D2 + a₈₃*D3 + a₈₄*D4 + a₈₅*D5 + a₈₆*D6 + a₈₇*D7 + a₈₈*D8 + a₈₉*D9 + a_8A*D10
C9 = a₉₁*D1 + a₉₂*D2 + a₉₃*D3 + a₉₄*D4 + a₉₅*D5 + a₉₆*D6 + a₉₇*D7 + a₉₈*D8 + a₉₉*D9 + a_9A*D10

For each of the 36 populations, we'll have these 9 equations where C1 through C9 are the ancestral component percentages of that population in Harappa Project and D1 through D10 are the ancestral percentages in Dodecad Project.

The unknowns are the coefficients "a". They are 90 unknowns. Since we have 36 populations, the number of equations is 36*9=324. Therefore, this is an overdetermined system of linear equations and we can find a least squares solution to it.

Here is the solution:

	D1 W Asian	D2 NW African	D3 S Euro	D4 NE Asian	D5 SW Asian	D6 E Asian	D7 N Euro	D8 W African	D9 E African	D10 S Asian
C1 S Asian	0	0	0	0	0	0	0	0	0	0.92
C2 Kalash	0.54	0	-0.05	0.12	0.07	0	0.2	0	0	0.1
C3 SW Asian	0.46	0.56	0.44	0	0.9	0	-0.09	0	0.09	-0.07
C4 SE Asian	0	0	0	0	0	0.6	0	0	0	0
C5 Euro	0	0.19	0.6	0.05	-0.05	0	0.88	0	0	0
C6 Papuan	0	0	0	0	0	0	0	0	0	0
C7 NE Asian	0	0	0	0.85	0	0.4	0	0	0	0
C8 W African	0	0.12	0	0	0	0	0	1	0	0
C9 E African	0	0.12	0	0	0.05	0	0	0	0.89	0

Don't take the exact values to heart but this shows the general relationship between the Dodecad and Harappa (K=9) ancestral components.

The South Asian components are about the same in both projects.

The Kalash component is a mix but is primarily Dodecad West Asian.

The Harappa Southwest Asian has contributions from Northwest African, West Asian and South European in addition to the Dodecad West Asian component.

The Southeast Asian component corresponds partially to the Dodecad East Asian component.

The Harappa European component is more Dodecad North European than South European.

If enough Harappa-Dodecad participants are willing to let me know their IDs for both projects, I can do a similar analysis using individual data.

Reference I Admixture Analysis K=10-12

Posted by Zack on February 17, 2011 26 comments

A week later, some more Admixture analysis of Reference I dataset.

As usual, the results are available in a spreadsheet, which is also listed on my sidebar.

Let's start with K=10.

Admixture: Reference I populations K=10

C1	South Asian	C2	Kalash
C3	Southwest Asian	C4	Southeast Asian
C5	European	C6	Papuan
C7	Northeast Asian	C8	Siberian
C9	West African	C10	East African

The addition here is basically of the Siberian component which is highest among the Yakut.

Fst divergences between estimated populations for K=10:

	C1	C2	C3	C4	C5	C6	C7	C8	C9
C2	0.057
C3	0.064	0.073
C4	0.089	0.127	0.136
C5	0.063	0.061	0.038	0.131
C6	0.167	0.209	0.215	0.202	0.210
C7	0.080	0.120	0.129	0.032	0.123	0.190
C8	0.085	0.117	0.127	0.059	0.118	0.203	0.039
C9	0.152	0.174	0.161	0.201	0.171	0.266	0.195	0.199
C10	0.115	0.133	0.117	0.166	0.128	0.233	0.160	0.163	0.036

Now for K=11,

Admixture: Reference I populations K=11

C1	South Asian	C2	Kalash
C3	Southwest Asian	C4	Southeast Asian
C5	European	C6	Papuan
C7	Siberian	C8	Northeast Asian
C9	East African Bantus	C10	West African
C11	East African

C8 at K=11 is now modal among the Han instead of the Japanese. This affected the Southeast Asian C4 component which is now more of a real Southeast Asian one.

The new ancestral component C9 is among the Bantus of eastern and southern Africa. It is highest among the Luhya and Bantus of Kenya.

Fst divergences between estimated populations for K=11:

	C1	C2	C3	C4	C5	C6	C7	C8	C9	C10
C2	0.055
C3	0.062	0.072
C4	0.081	0.120	0.128
C5	0.063	0.063	0.038	0.124
C6	0.169	0.211	0.215	0.195	0.213
C7	0.089	0.128	0.135	0.057	0.130	0.203
C8	0.083	0.122	0.131	0.031	0.127	0.194	0.039
C9	0.143	0.165	0.150	0.185	0.162	0.259	0.195	0.189
C10	0.152	0.174	0.160	0.194	0.172	0.268	0.203	0.198	0.014
C11	0.104	0.122	0.101	0.149	0.115	0.226	0.158	0.152	0.037	0.043

At K=12,

Admixture: Reference I populations K=12

C1	South Asian	C2	Balochistan/Caucasus
C3	Kalash	C4	Southeast Asian
C5	Southwest Asian	C6	European
C7	Papuan	C8	Northeast Asian
C9	Siberian	C10	East African Bantus
C11	West African	C12	East African

The Kalash component has split, with an assist from Southwest Asian, into a pure Kalash component (C3) and a Balochistan/Caucasus (C2) which is highest in Southwestern Pakistan (Brahui, Makrani, Balochi) at 60-57% followed by Georgians, Lezgin, Adeygei, Azerbaijan Jews and Iranian Jews (56-50%).

The Southwest Asian component (C5) is now more of a Southwest Asian and North/Northwest African component. The West Asian element in it has been reduced.

The Northeast Asian component (C8) is now again centered on Japan. I have a solution for this movement which I'll apply in my next round of analysis.

Fst divergences between estimated populations for K=12:

	C1	C2	C3	C4	C5	C6	C7	C8	C9	C10	C11
C2	0.057
C3	0.066	0.060
C4	0.089	0.124	0.136
C5	0.075	0.057	0.087	0.142
C6	0.066	0.040	0.073	0.130	0.048
C7	0.167	0.205	0.219	0.202	0.220	0.210
C8	0.080	0.117	0.128	0.032	0.134	0.122	0.190
C9	0.085	0.114	0.126	0.059	0.133	0.117	0.203	0.039
C10	0.145	0.154	0.176	0.192	0.154	0.162	0.258	0.187	0.190
C11	0.154	0.163	0.186	0.201	0.164	0.172	0.266	0.195	0.199	0.014
C12	0.107	0.109	0.135	0.157	0.105	0.116	0.225	0.151	0.154	0.035	0.041

Higher K value admixture analysis will continue.

Reference II Admixture Analysis K=6-9

Posted by Zack on February 12, 2011 11 comments

Continuing with admixture analysis of Reference II dataset, here's the spreadsheet.

Other than the differences with Reference I analysis, do take a look at the additional ethnic groups included in this dataset, especially the 8 South Asian groups: Tamil Nadu Dalit, Irula, Andhra Pradesh Madiga, Andhra Pradesh Mala, Tamil Nadu Brahmin, Andhra Pradesh Brahmin, Punjabi Arain, Nepali.

Let's start with K=6.

Reference II Admixture K=6

Note the difference between Tamil Nadu Dalits and Brahmins. The Dalits lack the European ancestral component of the Brahmins.

For K=7, the East Asian component splits into Northeast Asian and Southeast Asian.

Reference II Admixture K=7

Punjabi Arain are about the same as Sindhis (excluding the those with some African ancestry) in terms of their ancestral components.

Comparing the Andhra Brahmins to the Mala and Madiga, we see the same pattern as in Tamil Nadu: Brahmins have more European and Southwest/West Asian while Mala and Madiga have more Southeast Asian and South Asian.

At K=8, the African component splits into West African and East African.

Reference II Admixture K=8

The Nepalese samples are interesting. They have about 49% South Asian, 19% Northeast Asian, 16% European and 10% Southeast Asian. So they look like a mix of South Asian and East Asian.

Here's the average absolute difference between the two datasets for each ancestral component:

Ancestral Component	Mean(Abs(Ref1-Ref2))
South Asian (C1)	2.17%
Southwest Asian (C2)	1.32%
European (C3)	1.70%
Southeast Asian (C4)	2.16%
Papuan (C5)	0.33%
Northeast Asian (C6)	1.93%
West African (C7)	0.27%
East African (C8)	0.48%

The larger differences are for Balochi, Cambodian, Dai, Han, Kalash, Lahu, Miao, Naxi, She, Singapore Chinese, Tu, Tujia, US Chinese, and Yi, Thus, it's mostly East Asian groups.

For K=9, we see some divergence between the ancestral components inferred from Reference II as compared to Reference I. Instead of the Kalash component in Reference I analysis, we get the Polynesian component here. This is likely due to the inclusion of Tongan and Samoan samples.

Reference II Admixture K=9

Here's a summary of the ancestral components inferred from Reference II dataset:

K=2	K=3	K=4	K=5	K=6	K=7	K=8	K=9
Eurasian	European	S Asian	S Asian	S Asian	S Asian	S Asian	S Asian
African	E Asian	European	European	European	European	SW Asian	European
	African	E Asian	E Asian	E Asian	SE Asian	European	SW Asian
		African	SW Asian	SW Asian	SW Asian	SE Asian	SE Asian
			African	Papuan	Papuan	Papuan	Papuan
				African	NE Asian	NE Asian	NE Asian
					African	W African	Polynesian
						E African	W African
							E African

I might do some admixture runs for Reference II with Harappa participants later.

Reference II Admixture Analysis K=2-5

Posted by Zack on February 11, 2011 1 comment

Our Reference II Dataset has 3,161 samples with 544 South Asians belonging to 24 ethnic groups. Unfortunately, we can do our admixture analysis on about 23,000 SNPs.

The ancestral population averages for each ethnic group from the admixture analysis can be seen in this spreadsheet. I have also calculated the standard deviation of the ancestral components for the samples in each ethnic group.

Here are the results for K=2.

Reference II Admixture K=2

For K=3, we get the ancestral populations: European, E Asian, African.

Reference II Admixture K=3

For K=4, the ancestral populations are South Asian, European, East Asian and African.

Reference II Admixture K=4

Let's compare the results of K=4 admixture analysis of Reference I and Reference II datasets.

While there is some difference in the average percentages of ancestral components computed with the two reference datasets, most of the differences are 1% or less. The mean absolute difference for the four components is as follows:

Ancestral Component	Mean(Abs(Ref1-Ref2))
South Asian (C1)	0.92%
European (C2)	0.58%
East Asian (C3)	0.52%
African (C4)	0.32%

I have highlighted the larger differences which affect: Balochi, Kalash, Malayan, Melanesian, Papuan, and Samaritians. Even then the largest change is about 5%.

Let's also look at the Fst divergences. Here's for Reference I admixture results:

	C1	C2	C3
C2	0.071
C3	0.083	0.109
C4	0.152	0.152	0.184

And for Reference II:

	C1	C2	C3
C2	0.074
C3	0.086	0.118
C4	0.156	0.159	0.194

The Fst numbers for Reference II are somewhat higher.

Considering that Reference II has only one-eighth of the SNPs of Reference I, the results are fairly good.

Here's K=5 admixture analysis for Reference II:

Reference II Admixture K=5

Higher K values to follow.

Reference Admixture Analysis K=6-9

Posted by Zack on February 9, 2011 16 comments

Continuing the admixture analysis on my reference dataset I, let's look at K=6 ancestral components.

As before, all the results are listed in a spreadsheet.

For K=6, we get the following plot:

Admixture: Reference populations K=6

C1 (red) is the South Asian ancestral component. However, the Austronesian (Papuan/Melanesian) component has now separated from it as C5 (blue). You can see small proportions of the Papuan component among South Indian and Southeast Asian (Malay and Cambodian) populations.

C3 (green) is exactly the same as C3 in K=5 run and represents East Asian populations. C6 (magenta) is exactly the same component as C5 in the K=5 run and represents African ancestry. C4 (cyan) is the same as C4 in the K=5 run and represents Southwest/West Asia.

C2 (yellow) is the European component (maximum among North Europeans) almost the same as C2 in K=5 analysis. The major difference is that C2 (in K=6) is reduced among South Asians as compared to K=5. This is due to the South Asian component being higher for them.

Fst divergences between estimated populations for K=6:

	C1	C2	C3	C4	C5
C2	0.053
C3	0.084	0.114
C4	0.068	0.052	0.130
C5	0.178	0.205	0.184	0.218
C6	0.148	0.165	0.186	0.157	0.260

When we increase the ancestral components to K=7,

Admixture: Reference populations K=7

The South Asian component (C1/red) is the same. Note that there is a significant drop from about 51% to 29% from Makranis to Iranians (ignore the Paniya as there are only 4 samples with one being very different). Looking at the 19 individual Iranian samples from our reference dataset, their South Asian ancestral component values vary from 17% to 33%.

The Southwest/West Asian component (C2/yellow) is now higher among West Asians and lower among East Africans compared to K=6 run. C3/green is the European component which now almost disappears from the Southwest Asian populations.

The East Asian component (C4/bluish green) is the same as before as is the Papuan C5/light blue component.

The African ancestry breaks into West African (C6/blue) and East African (C7/magenta).

Note that the split here is different from the batch 1 run where the East Asian split into two for K=7 and the African split happened at K=8, the opposite of what happened here.

Fst divergences between estimated populations for K=7:

	C1	C2	C3	C4	C5	C6
C2	0.058
C3	0.052	0.034
C4	0.082	0.122	0.113
C5	0.176	0.210	0.204	0.184
C6	0.152	0.159	0.167	0.190	0.264
C7	0.112	0.113	0.122	0.153	0.229	0.037

At K=8, the East Asian components forks into two: A Southeast Asian (C4/bright green) one that is highest among the Dai, Malay, Cambodians and Lahu; and a Northeast Asian one (C6/blue) that is maximum among the Yakut, Oroqen, Japanese, Hezhen and Daur.

Admixture: Reference populations K=8

Among most South Asian groups in our reference dataset, the Southeast Asian component is much more common than the Northeast Asian one.

Fst divergences between estimated populations for K=8:

	C1	C2	C3	C4	C5	C6	C7
C2	0.058
C3	0.052	0.034
C4	0.096	0.133	0.125
C5	0.177	0.211	0.205	0.201
C6	0.093	0.131	0.122	0.046	0.195
C7	0.152	0.159	0.167	0.200	0.266	0.201
C8	0.113	0.113	0.113	0.163	0.231	0.163	0.037

Here's the plot for K=9 ancestral components:

Reference Populations Admixture K=9

The new component here is the Kalash component which is at 94% among the Kalash but is in the 30-40% range for Caucasian and Pakistani populations. It is also present among West Asians, Europeans and Central Asians to a small degree.

Looking at the Kalash samples, they seem fairly uniform and mostly with only a little of the other ancestral components except for one sample which has 70% Kalash component.

Kalash Admixture K=9

Fst divergences between estimated populations for K=9:

	C1	C2	C3	C4	C5	C6	C7	C8
C2	0.056
C3	0.064	0.072
C4	0.088	0.126	0.136
C5	0.064	0.061	0.039	0.131
C6	0.167	0.208	0.214	0.202	0.211
C7	0.084	0.124	0.133	0.045	0.127	0.195
C8	0.152	0.173	0.161	0.201	0.172	0.266	0.200
C9	0.115	0.133	0.117	0.165	0.129	0.233	0.164	0.036

From the Fst values, we can see that the Kalash component is closest to the South Asian component and then to the European component.

To summarize, here are the ancestral components inferred for different values of K.

K=2	K=3	K=4	K=5	K=6	K=7	K=8	K=9
Eurasian	European	S Asian	S Asian	S Asian	S Asian	S Asian	S Asian
African	E Asian	European	European	European	SW Asian	SW Asian	Kalash
	African	E Asian	E Asian	E Asian	European	European	SW Asian
		African	SW Asian	SW Asian	E Asian	SE Asian	SE Asian
			African	Papuan	Papuan	Papuan	European
				African	W African	NE Asian	Papuan
					E African	W African	NE Asian
						E African	W African
							E African

Note that for a specific value of K, they are listed in approximately decreasing average percentage among the South Asian samples in our reference dataset I.

Changes due to San/Pygmy Removal

Posted by Zack on February 4, 2011 Comments Off

As mentioned earlier, I removed San and Pygmy groups from my reference datasets.

For the admixture runs on Reference Dataset I, the only major changes are for K=2 ancestral components where most European, Middle Eastern and South/Central Asian groups increase their African component. The changes for K=3,4,5 were minor as shown by these statistics:

K	Median Abs	Maximum Abs
3	0.01%	0.22%
4	0.02%	0.26%
5	0.02%	0.71%

I have updated the spreadsheet and the plots in the original post.

Looking at the changes in the admixture results I already posted for Harappa Project participants HRP0001 to HRP0010, there is major change for K=2. The African compoent (C1/red) increased by a lot among all project participants. This seems to be due to the African component best representing West Africans now instead of Pygmies as it did before.

For K=3,4,5, the changes are very minor. Let's look at the absolute value of the changes in the percentages of ancestral components for the ten project participants.

K	Median Abs	Maximum Abs
3	0.05%	0.19%
4	0.05%	0.22%
5	0.09%	0.60%

I have updated the spreadsheets and the charts in the original post.

San and Pygmy

Posted by Zack on February 3, 2011 11 comments

I have removed San and Pygmy groups from my reference datasets. That meant removing 39 samples from Reference Data I and 61 samples from Reference Data II.

The presence of those groups was creating some weird effects in admixture runs at K=8,9. Basically, the ancestral components for Africans I was getting were not stable. Instead they were varying with/without different Harappa participant batches. Also, at K=10,11, there were too many Africa-only ancestral components, forcing me to run even higher values of K.

Since we are not really interested in African diversity in this project and any African admixture among South Asians is most likely to be East, West or North African instead of Pygmy or San, the removal of these groups should not have any implications for the Harappa Ancestry Project.

To make sure that the above assertion is true, I'll re-run admixture analysis for K=2-5 and update later with the results.

Reference Admixture Analysis K=2-5

Posted by Zack on February 1, 2011 37 comments

Let's do admixture analysis on my reference population.

Since I wasn't sure what value of K would be appropriate, I ran admixture with different values of K, which defines the number of ancestral populations.

The proportion of ancestral populations for each ethnic group is given in this spreadsheet. These are the mean values for that group, calculated by averaging the ancestral proportion across all the samples belonging to that group. I have also calculated the standard deviation across each ethnic group and that's included in the spreadsheet. The higher values of standard deviation are highlighted in blue (>1%) and red (>5%). Those population groups have samples that have somewhat different ancestries.

Let's start with two ancestral populations, i.e. K = 2.

Admixture: Reference populations K=2

The second ancestral component C2 (cyan) seems to be African and the 1st one C1 (red) is maximum among East Asians. Since all populations are constrained to be made of these two ancestral components, Europeans, Middle Easterners and South Asians all have about half African ancestral component (C2) and the rest East Asian (C1). This is as I expected with the classification of humanity into African and non-African.

The Fst divergences between estimated ancestral populations are as follows:

	C1
C2	0.157

The K=3 analysis ancestral components can be roughly said to be European, East Asian and African.

Admixture: Reference populations K=3

The component C1 (red) is maximum among Europeans and is the major ancestry component for Middle Easterners, Central Asians and South Asians. Ancestral component C2 (green) is East Asian. South Asians also have a significant fraction of C2. African populations are represented by C3 (blue). Yemenese, Mozabits and Ethiopian Jews also have appreciable proportions of this African ancestral component.

Looking at the standard deviations of ancestral components for our sample groups, we see that while the Bedouin, Jordanians, Makrani, Moroccons, Mozabite, Saudis and Yemenese are mostly West Eurasian, their proportion of African ancestry vary quite a bit. The large standard deviation in Paniya is due to one sample (C1=55%, C2=42%, C3=3%) being very different (i.e. much more West Eurasian) from the other three (C1=11%, C2=85%, C3=4%).

There are also a couple of Sindhis with some African admixture. These are possibly partly or wholly Siddi.

HGDP Sindhi Samples Admixture K=3

Fst divergences between estimated populations for K=3:

	C1	C2
C2	0.102
C3	0.144	0.182

With four ancestral components (K=4), component C1 (red) is a South Asian ancestral component. It is maximum among central and south Indians as well as among Papuans and Melanesians. It could thus possibly related to the ASI (Ancestral South Indian) component. C4 (violet) is the African component. C3 (cyan) is the East Asian component and C2 (green) is the European component.

Admixture: Reference populations K=4

Fst divergences between estimated populations for K=4:

	C1	C2	C3
C2	0.071
C3	0.083	0.109
C4	0.152	0.152	0.184

When we increase K to 5, we get the following graph:

Admixture: Reference populations K=5

Ancestral component C1 (red) is Austronesian/South Asian. It is maximum among the Papuans at 75% and is higher among South Indians as compared to Pakistanis. It is about the same component as C1 in K=4.

C4 (blue) is Southwest Asian/West Asian. It peaks in Yemeni Jews at 66% and is high among Saudis, Bedouin, Samaritans, Egyptians, and Palestinians. It's 32% among Turks, so the Southwest Asian part is dominating the West Asian in this component. Notice how Ethiopians and Ethiopian jews have about half of their ancestry from this component.

C3 (green) is the East Asian component and is the same as C3 in the K=4 analysis.

C5 (magenta) is the African ancestry component and is about the same as C4 in the K=4 analysis.

C2 (yellow) is the European component. In K=4, the European component was high among both southern and northern Europeans. Now in K=5, we have the C4 (Southwest/West Asian) component among southern Europeans, so this European component has taken on more of a north European outlook.

Fst divergences between estimated populations for K=5:

	C1	C2	C3	C4
C2	0.081
C3	0.084	0.114
C4	0.085	0.054	0.129
C5	0.154	0.165	0.186	0.155

Let's continue this admixture analysis for higher values of K.

Reference Dataset II

Posted by Zack on January 30, 2011 9 comments

Combining my reference population with Xing et al data gets me ~~3,222~~ 3,161 samples but with only about 23,000 SNPs after LD-pruning.

The good thing is that this dataset has 544 South Asian samples from 24 ethnic groups. So it'll be useful for some analyses despite the low number of SNPs. I'll try to run parallel analyses on my reference population and this dataset so we can compare the pros and cons of both.

UPDATE: I removed 61 pygmy and San samples.

Admixture: Reference Population

Posted by Zack on January 29, 2011 12 comments

For regular admixture analysis, I am using HapMap, HGDP, SGVP and Behar datasets with some samples removed as I wrote earlier.

For each of these datasets,

I first filtered to keep only the list of SNPs present in 23andme v2 chip.
plink --bfile data --extract 23andmev2.snplist
plink --bfile data --extract 23andmev2.snplist
I also filtered for founders:
plink --bfile data --filter-founders
plink --bfile data --filter-founders
And excluded SNPs with missing rates greater than 1%:
plink --bfile data --geno 0.01
plink --bfile data --geno 0.01

Then, I merged the datasets one by one. The reason for doing it one by one was that there were conflicts of strand orientation (forward or reverse) between the different datasets. If the merge operation gave an error, I had to flip those strands in one dataset and try the merge again.

plink --bfile data1 --bmerge data2.bed data2.bim data2.fam --make-bed
plink --bfile data2 --flip plink.missnp --make-bed --out data2flip
plink --bfile data1 --bmerge data2flip.bed data2flip.bim data2flip.fam --make-bed

Once all the four datasets were merged, I processed the combined data file:

Removed SNPs with a missing rate of more than 1% in the combined dataset
plink --bfile data --geno 0.01
plink --bfile data --geno 0.01
Then i performed linkage disequilibrium based pruning using a window size of 50, a step of 5 and r^2 threshold of 0.3:
plink --bfile data --indep-pairwise 50 5 0.3 plink --bfile data --extract plink.prune.in --make-bed
plink --bfile data --indep-pairwise 50 5 0.3 plink --bfile data --extract plink.prune.in --make-bed

This gave me a reference population of ~~2,693~~ 2,654 individuals with each sample having about 186,000 SNPs. Out of these ~~2,693~~ 2,654 individuals, we have a total of 398 South Asians belonging to 16 ethnic groups.

Finally, it's time to start having some fun!

UPDATE: I removed 39 Pygmy and San samples because they were causing some trouble with African ancestral components. Since we are not interested in detailed African ancestry and African admixture among South Asians is not likely to be pygmy or San, I decided it would be best to remove them.

« Previous page

Harappa Ancestry Project

Genetics and South Asia

Tag Archives: reference - Page 10

Dodecad vs Harappa

Reference I Admixture Analysis K=10-12

Reference II Admixture Analysis K=6-9

Reference II Admixture Analysis K=2-5

Reference Admixture Analysis K=6-9

Changes due to San/Pygmy Removal

San and Pygmy

Reference Admixture Analysis K=2-5

Reference Dataset II

Admixture: Reference Population

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Archives

Recent Comments

Blogroll

Genetics and South Asia

Tag Archives: reference - Page 10

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Contact

My Sites

Data

Affiliate DNA Tests

Categories

Tags

Archives

Recent Comments

Blogroll