As part of my effort to create one big reference dataset for my use, I have been going over all the datasets I have and make sure there's no duplicates or relatives or any other strange things that could cause issues with my analysis.
So I went back to the Chaubey et al Austroasiatic Indians dataset.
The dataset doesn't have any duplicate or likely relative samples itself. Of course, I had to remove the 632 HGDP samples it had, but that's easy to do since they have the same IDs (starting with HGDP).
As their paper mentions, the dataset also has 19 Dravidian speaking Indian samples from Behar et al. Since I got Behar et al data from the GEO site, I had different IDs for them than what they use in this dataset. So I had to figure out which samples were the same in both. The IBS/IBD results of duplicates as well as the list of sample IDs I removed is given in a spreadsheet.
Checking this out resolved an issue I had with Behar et al. Behar et al has 4 Paniya samples from South India. One of those four has admixture proportions similar to Indians but three seem very East Asian. I had always suspected that those three samples were mislabeled. Now the Austroasiatic dataset also has those four Paniya samples. However, only one of them is identical to the Behar et al one. The other three are different. I haven't checked yet which one of the Behar samples matches Austroasiatic, but my guess is that it is the more Indian admixture one. So I am keeping the other three Paniya samples from the Austroasiatic dataset and hoping that they are the correct ones.
Comments are closed.