Metspalu et al Data Relatedness

Posted by Zack on December 11, 2011

I performed IBD analysis on the Metspalu dataset using plink and found the relatedness of the following samples to be too high.

ID1	Source1	Population1	ID2	Source2	Population2	IBD Estimate
Mawasi1	Metspalu	Mawasi	Mawasi1	Chaubey	Mawasi	100%
VELZ260	Metspalu	Velama	Velama_184_R2	Reich	Velama	99%
VELZ260	Metspalu	Velama	VELZ265	Metspalu	Velama	19%
VELZ265	Metspalu	Velama	Velama_184_R2	Reich	Velama	19%
D254	Metspalu	Tharu	Tharu_107_R1	Reich	Tharu	99%
D260	Metspalu	Tharu	Tharu_108_R1	Reich	Tharu	98%
evo_32	Metspalu	Kanjar	321e	Metspalu	Kol	53%
HA030	Metspalu	Dharkar	HA039	Metspalu	Dharkar	52%
A387	Metspalu	Dusadh	A388	Metspalu	Dusadh	52%
A394	Metspalu	Dusadh	A395	Metspalu	Dusadh	52%
A395	Metspalu	Dusadh	A393	Metspalu	Dusadh	46%
A394	Metspalu	Dusadh	A393	Metspalu	Dusadh	45%
A392	Metspalu	Dusadh	A393	Metspalu	Dusadh	32%
A392	Metspalu	Dusadh	A395	Metspalu	Dusadh	31%
A392	Metspalu	Dusadh	A394	Metspalu	Dusadh	28%
evo_37	Metspalu	Kanjar	HA023	Metspalu	Dharkar	27%
HA039	Metspalu	Dharkar	HA041	Metspalu	Dharkar	24%
HLKP245	Metspalu	Hakkipikki	Hallaki_137_R2	Reich	Hallaki	22%
PULD160	Metspalu	Pulliyar	PULD162	Metspalu	Pulliyar	20%

As you can see, three samples from Reich et al seem to be the same as Metspalu et al. In addition, two Reich samples seem to be related to Metspalu samples.

There are some Metspalu samples who are likely related to one another. A 50% indicates likely a parent-child or sibling-sibling relationship. A 45-46% relatedness is most likely siblings in my opinion. An 18-19% percentage could be a 1st cousin relationship in an endogamous community. it could also just be the background relatedness in a small, bottlenecked and endogamous community.

It looks like about half of the Dusadh in the Metspalu dataset are related.

I am surprised at the close relationship of a Kanjar and a Kol in the dataset, though both are from Uttar Pradesh.

Datasetibd, metspalu, reference

← Shared and Unique Components of Human Population Structure and Genome-Wide Signals of Positive Selection in South Asia

Metspalu Ref3 Admixture Results →

9 Comments.

Onur December 11, 2011 at 5:11 pm

Is it so hard to do sampling without incorporating relatives? Most ethnic or regional groups have millions of members.
- Parasar December 11, 2011 at 5:28 pm
  
  With endogamy the relatedness within groups is very high. In our relatively insular caste (total population ~7 million) in spite of gotra and clan exogamy almost everyone is related within a few degrees to another.
- Zack December 11, 2011 at 7:06 pm
  
  I think it is hard.
  - Onur December 11, 2011 at 7:13 pm
    
    Why? You have the option of collecting all samples of an ethnic group from different locales of a country. This way all samples of the ethnic group will likely be non-relatives.
    - Zack December 11, 2011 at 7:29 pm
      
      That requires extra resources. Both in terms of money and knowledge.
      - Onur December 11, 2011 at 7:36 pm
        
        I think universities and research institutes have enough resources, especially in wealthy countries, to afford such sampling tasks.
Vitasta December 11, 2011 at 10:47 pm

While I can see the difficulty in collecting unrelated samples what I can't understand is the lack first-level pruning/cleaning-up of data just as you have done above. A careful reading of the Material and Methods doesn't show any specific issue here. It's not rocket-science and there is no reason to doubt your numbers above. (Perhaps the data you were given was the
unpruned set? The dataset is passwd-protected at NCBi/GEO.)

Assuming your days for the next week had 48h each (:)) it shouldn't be too
hard to compute if this data indeed did skew relevant haplotype frequecies
and/or Fst's, right? (These two strike me as the most probable - perhaps I
am mistaken in this.)
- Zack December 13, 2011 at 9:19 am
  
  The blood/saliva samples are collected by one (or more groups), the genotyping is done by another lab and the analysis (that brought us the paper) is done by others still. Usually it's at the last step that you find out about samples being related. It's not possible to go back into the field then.
  
  My guess is the results in the paper are not impacted much by those relatives. The Reich et al data was used in only a couple of analyses. The only population whose results are likely to be affected is Dusadh.
  
  I am going to compute Fst to compare with the Metspalu et al paper.
  - Vitasta December 13, 2011 at 1:06 pm
    
    Yes, given the size of the related set above to the set used in the analyses, I expect it to have minimal impact on the results if any. I was a bit surprised there was no mention of this related set in the paper. Agree with your first paragraph.

Harappa Ancestry Project

Genetics and South Asia