I performed IBD analysis on the Metspalu dataset using plink and found the relatedness of the following samples to be too high.
ID1 | Source1 | Population1 | ID2 | Source2 | Population2 | IBD Estimate |
---|---|---|---|---|---|---|
Mawasi1 | Metspalu | Mawasi | Mawasi1 | Chaubey | Mawasi | 100% |
VELZ260 | Metspalu | Velama | Velama_184_R2 | Reich | Velama | 99% |
VELZ260 | Metspalu | Velama | VELZ265 | Metspalu | Velama | 19% |
VELZ265 | Metspalu | Velama | Velama_184_R2 | Reich | Velama | 19% |
D254 | Metspalu | Tharu | Tharu_107_R1 | Reich | Tharu | 99% |
D260 | Metspalu | Tharu | Tharu_108_R1 | Reich | Tharu | 98% |
evo_32 | Metspalu | Kanjar | 321e | Metspalu | Kol | 53% |
HA030 | Metspalu | Dharkar | HA039 | Metspalu | Dharkar | 52% |
A387 | Metspalu | Dusadh | A388 | Metspalu | Dusadh | 52% |
A394 | Metspalu | Dusadh | A395 | Metspalu | Dusadh | 52% |
A395 | Metspalu | Dusadh | A393 | Metspalu | Dusadh | 46% |
A394 | Metspalu | Dusadh | A393 | Metspalu | Dusadh | 45% |
A392 | Metspalu | Dusadh | A393 | Metspalu | Dusadh | 32% |
A392 | Metspalu | Dusadh | A395 | Metspalu | Dusadh | 31% |
A392 | Metspalu | Dusadh | A394 | Metspalu | Dusadh | 28% |
evo_37 | Metspalu | Kanjar | HA023 | Metspalu | Dharkar | 27% |
HA039 | Metspalu | Dharkar | HA041 | Metspalu | Dharkar | 24% |
HLKP245 | Metspalu | Hakkipikki | Hallaki_137_R2 | Reich | Hallaki | 22% |
PULD160 | Metspalu | Pulliyar | PULD162 | Metspalu | Pulliyar | 20% |
As you can see, three samples from Reich et al seem to be the same as Metspalu et al. In addition, two Reich samples seem to be related to Metspalu samples.
There are some Metspalu samples who are likely related to one another. A 50% indicates likely a parent-child or sibling-sibling relationship. A 45-46% relatedness is most likely siblings in my opinion. An 18-19% percentage could be a 1st cousin relationship in an endogamous community. it could also just be the background relatedness in a small, bottlenecked and endogamous community.
It looks like about half of the Dusadh in the Metspalu dataset are related.
I am surprised at the close relationship of a Kanjar and a Kol in the dataset, though both are from Uttar Pradesh.
Is it so hard to do sampling without incorporating relatives? Most ethnic or regional groups have millions of members.
With endogamy the relatedness within groups is very high. In our relatively insular caste (total population ~7 million) in spite of gotra and clan exogamy almost everyone is related within a few degrees to another.
I think it is hard.
Why? You have the option of collecting all samples of an ethnic group from different locales of a country. This way all samples of the ethnic group will likely be non-relatives.
That requires extra resources. Both in terms of money and knowledge.
I think universities and research institutes have enough resources, especially in wealthy countries, to afford such sampling tasks.
While I can see the difficulty in collecting unrelated samples what I can't understand is the lack first-level pruning/cleaning-up of data just as you have done above. A careful reading of the Material and Methods doesn't show any specific issue here. It's not rocket-science and there is no reason to doubt your numbers above. (Perhaps the data you were given was the
unpruned set? The dataset is passwd-protected at NCBi/GEO.)
Assuming your days for the next week had 48h each (:)) it shouldn't be too
hard to compute if this data indeed did skew relevant haplotype frequecies
and/or Fst's, right? (These two strike me as the most probable - perhaps I
am mistaken in this.)
The blood/saliva samples are collected by one (or more groups), the genotyping is done by another lab and the analysis (that brought us the paper) is done by others still. Usually it's at the last step that you find out about samples being related. It's not possible to go back into the field then.
My guess is the results in the paper are not impacted much by those relatives. The Reich et al data was used in only a couple of analyses. The only population whose results are likely to be affected is Dusadh.
I am going to compute Fst to compare with the Metspalu et al paper.
Yes, given the size of the related set above to the set used in the analyses, I expect it to have minimal impact on the results if any. I was a bit surprised there was no mention of this related set in the paper. Agree with your first paragraph.