Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores

Lundbeck Foundation logo
Bjarni J Vilhjalmsson, postdoc at Bioinformatics Research Centre, Aarhus University,

About the study

The ability to predict disease risk and traits, e.g. side effects from drugs, from individual genotypes is an important goal in precision medicine. Indeed, there are already a large number of genetic tests readily available in clinical settings that test for rare traits using a handful of carefully chosen genetic markers.  However, most common diseases, e.g. coronary artery disease, type-2 diabetes, schizophrenia, have been shown to have a highly polygenic genetic architecture.  For such diseases, tests based on individual genetic variants  will have little predictive value for most individuals. Polygenic risk scores (PRS) are a promising alternative for polygenic diseases, as they simply use a genome-wide set of variants to construct a risk score.  PRS typically only require summary level information from genome-wide association studies (GWAS) as training data and a set of genotypes from which linkage disequilibrium patterns can be estimated.  This enables PRS to train on larger sample sizes than otherwise possible, because GWAS summary statistics are typically made publicly available even though individual genotypes are not.  Besides predicting risk, PRS can also be used to study the genetic architecture of polygenic diseases and traits, as well as estimate heritability, polygenicity, and the genetic correlations between diseases and traits. 

The standard PRS currently used, is calculated on a set of markers that have been pruned with respect to linkage disequilibrium (LD) and association p-value.  The p-value threshold is typically optimized for prediction accuracy in a validation sample.  However, these two pruning steps (LD-pruning and p-value thresholding) limit the predictive accuracy of PRS, especially when the training sample sizes are very large.  To address this problem we proposed a novel Bayesian PRS, LDpred.  The LDpred PRS has the desired property that it converges to the optimal prediction accuracy (as determined by the trait or disease heritability) as sample sizes increase.  We applied LDpred to both simulated datasets and real datasets, demonstrating both improvements in prediction accuracy and bias.  E.g., when applied to 5 disease and height, for which we had large training samples, we observed an improvement in prediction accuracy of about 10-30%.  For schizophrenia, the Nagelkerke R2 improved from 20.1% to 25.3% and for multiple sclerosis it improved from 9.8% to 12.0% compared to standard.  For height, Pearson’s R2 improved from 6.3% to 8.5%, an improvement of 30% in prediction accuracy. 

As training sample sizes increase PRS will become more accurate and more valuable, both in clinical settings and for understanding genetics. Although LDpred offers significant improvements over existing approaches, PRS can still be improved further.  One future direction would be to use different priors for different genomic regions and another would be to account for GWAS summary statistics from correlated traits.

The article Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores was published in The American Journal of Human Genetics, 2015;Oct 1; 97(4):576-92.

Facts about the study

  • PRS are useful in precision medicine as they may enable us to improve current genetic risk predictions.
  • PRS are also useful for understanding the genetic architecture of diseases and traits, as well as identifying shared genetic components.
  • LDpred is a novel Bayesian PRS that adjusts the effect of variants for LD, instead of pruning variants.  This results in more accurate PRS.
  • LDpred PRS was found to provide more accurate predictions than standard PRS in both simulations and when applied to real datasets.
  • When using large GWAS summary statistics as training data we observed an improvement in Nagelkerke R2 from 20.1% to 25.3% for schizophrenia and from 9.8% to 12.0% in a large multiple sclerosis dataset. 
  • The advantage of LDpred over existing methods will grow as sample sizes increase.
  • The study was carried out in collaboration with researchers at Aarhus University, Harvard University, the Broad Institute, the Schizophrenia Working Group of the Psychiatric Genomics Consortium, and others. 

Further information:

Bjarni J Vilhjalmsson, PhD, postdoc, Bioinformatics Research Centre, Aarhus University. Mobile: +45 2420 1279, Email: