ADMIXTURE Seed and Cross-Validation

I have been running some ADMIXTURE experiments recently. This is with my world dataset and about 180,000 SNPs.

I ran ADMIXTURE at K=15 ancestral components with different random seeds. Let's take a look at the final log likelihood and cross-validation errors I got for 11 runs.

As you can see, as the log likelihood increases, the cross-validation error decreases, though there is a fair bit of variation there.

For different runs, I got fairly different ancestral components. Remember that the only difference between the different runs was the random seed used to initialize the algorithm. Some ancestral components stayed very similar across the runs but others appeared and disappeared or switched subtly between different populations in a broad region.

The cross-validation (CV) error is important in my opinion since it gives you an idea of which run has results that generalize better. Basically, it is calculated by removing a portion of the individuals in the dataset.

At K=15, the minimum CV error I got was 0.52200 and the median was 0.52206. The maximum CV error was 0.52241, which is pretty large for this data. Let's superimpose this maximum CV value on a graph showing how CV error varies for different values of K (number of ancestral components).

The set of runs in this graph (other than the red line for the maximum CV error at K=15) used the default random seed for ADMIXTURE.

What this shows is that running ADMIXTURE only once using the default random seed (or any other seed) is fraught with problems. A better approach is to run it multiple times with different seeds so you can be sure that you have arrived at a computationally optimum solution.

9 Comments.

  1. Dude its cool but try to estimate the age of the components which i think is the real deal now.

  2. One baby, alone on a PCA island | Omsai Estates - pingback on April 25, 2012 at 12:48 am
  3. One baby, alone on a PCA island | Biology News by Biologged - pingback on April 25, 2012 at 1:29 am
  4. The point here is two-fold.

    One is that we shouldn't think of ADMIXTURE results as giving us real ancestral populations. Instead it's more useful to use he results relative to our individuals and populations. This is now fairly well-known, though sometimes we do tend to ignore it.

    The other point is that running ADMIXTURE once not only does not give you the optimal historical result but it doesn't even provide you with the optimal mathematical result. There is a large variation between runs using different random seeds. In papers that have used ADMIXTURE, there are some where they did run ADMIXTURE with different seeds and used the best or most common results. However, I don't think this issue is well-known outside academia.

  5. Very important observation! I see people going around claiming ancient ancestry just because they have x% Caucasian, y% European ancestry etc. While there may be similarity with those other populations in certain regions of the chromosome, the source could be completely different. To me, admixture results can only be used to bin people into a set of different populations, based on similarity. I am skeptical of interpreting ancestral origins from autosomal results. y-DNA and mtDNA are much more reliable in that sense.

  6. Informative post, I have experimented with CV errors before but not with random seeds, will do in future runs.

  7. Lately, I have been really interested in the Uighur people and their diverse genetic makeup. Considering that a large percentages of their mtDNA is Western Eurasian.

    Could the same scenario have played out among the Pashtuns, Punjabi and other Northern Indian peoples?

    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2790568/?tool=pmcentrez

  8. Yes for relation or better say genetic similarities among groups Components are cool but incase of judging historical and pre-historic genetic impact they are not direct/conclusive by any means specially in the case of unseen/unrecorded eras.

Trackbacks and Pingbacks: