You have probably heard of ChromoPainter/fineSTRUCTURE by now (Eurogenes, Dienekes, MDLP and Razib).
So I decided to run the South Asian samples data which I had earlier done PCA/MClust on through ChromoPainter and fineSTRUCTURE.
Here is the coancestry matrix among the 715 participants visualized as a heat map.
UPDATE: Here's a huge image showing the same.
fineSTRUCTURE can use this coancestry matrix to classify individuals into clusters, 52 in this case (compared to 38 using PCA and MClust). You can check the cluster assignments in a spreadsheet.
Note that I have named the clusters. That's just a shorthand so we don't have to refer to them by cluster number. Instead I used the population with the largest number of individuals in a cluster to label that cluster.
Here's the cluster-level coancestry heat map.
And the pairwise coincidence:
And finally PCA plots for the first 10 dimensions from fineSTRUCTURE.
UPDATE (Feb 9, 2012): New PCA plots with better markers for the clusters.
What was the overall running time and how is it broken down into components (phasing/chromopainter/finestructure)?
6 hours for phasing.
22 hrs for ChromoPainter.
48-72 hrs for fineSTRUCTURE since that can't be parallelized.
That doesn't seem too bad. How many SNPs/threads and which software did you use for phasing? I have phased a dataset of similar size with 3 threads using shapeIT and that took days, so I am wondering whether I should try something else.
90,000 SNPs and 8 threads using BEAGLE.
I am curious what sort of processing "oomph" you have for compute-intensive stuff like this. Performance is always relative to such things like number of processors, processor speed, cache, memory, etc., no.
I recently upgraded my computer.
Great! Thanks Zack!
Can you put up a better resolved figure 1? or provide a link to download the figure? the axes are unreadable on the png format.
Done.
Zach,
Can I ask you for more specifics on your hardware, is it for instance:
1. Quad-processor or above?
2. How much RAM?
3. What OS, Windows 7 or some variant of Linux, like Ubuntu or something?
4. How is storage organized, RAID array, SATA or whatever?
Any other optimization tweaks?
Thanks in advance
He is running a Core i7 quadcore with 8GB RAM, and Ubuntu/Win XP
http://www.zackvision.com/weblog/2011/11/computer-upgrade/
I have finally ditched XP for Windows 7. But all the Harappa Project work is done in Ubuntu.
Zach,
BTW, Doug McDonald finds that I have 3.1% Pathan or 3% Sindi ancestry, while Dienekes has found that my Father and Mother have 10.3 and 9.3 of his Gedrosia component.
Am I eligible to join the Harappan Project as a result??
Well eligibility is in the eye of the beholder. 🙂
I don't usually refuse potential participants, but at the same time a number of my analyses, other than basic admixture, do not include many non-South-Asians.
So you have to figure out if you will get anything useful from submitting to the Harappa Ancestry Project.
Well 23andMe tells me that 2 of my Relatives are Indian - both 5th cousins, 4-10 range - one called Bennett, one called Thakrar - who now live in NZ and Kenya respectively.
So it may be that I actually have some recent South Asian ancestry?! When I search for my name Conroy in an online database of British Army stationed in India, I see that there were 38 enlisted men and 2 officers called Conroy in Bengal alone. Of course Bengal is not Pakistan, but the Connaught Rangers - an all Irish battalion of the British Army - battled the Pathans and others in the region. I'm wondering if one of them took a "War Bride" back to Ireland??
http://en.wikipedia.org/wiki/Connaught_Rangers
Pcontroy, if you're referring to Dr. McDonald's analysis as far as your South Asian admixture is concerned, don't take it too seriously. McDonald uses what may be deemed "mixed" samples. For instance, you scored around 3.1% and 3% with the Pakistani Sindhi and Pakistani Pashtun respectively. It is very likely that these small percentages are popping up due to shared ancient ancestry with the said groups; as opposed to having any real South-Asian admixture. The Pathan and the Sindhi both have appreciable levels of North(-east) European admixture. It seems unlikely to me that you'd have any real, non-trivial and recent South-Asian ancestry. We could say the same for Gedrosia - it seems to be found in non-trace levels in most West-Eurasian populations and is probably simply a signature of generic West-Eurasian ancestry as opposed to anything real.
Thanks!
I guess the coancestry plot doesn't say much. The finestructure PCA plots are easier to read. It looks like the plots are symmetric with respect to the transpose. Is there a way to figure out which populations are donors, and which are the acceptors? for example, from the PCA plot, the Vysya group(on vertical axis on the left) has a blue line corresponding to kanjar, singapore 3 and dharkar, while this is transposed also(if you look at Vysya on the horizontal axis on top). So does this mean that the genes flowed both ways?
ChromoPainter can be run two ways. One is to define specific populations as donors and compute the results for everyone based on those donors.
The other is an all-against-all mode. Here you assume that for an individual all other samples are donors. This is what I did in this analysis. So you cannot find out direction of gene flow but you can make inferences about clustering and haplotype similarity etc.
Thanks Zack.