What is a Polygenic Risk Score?

    For decades, most genetic research focused on single, common mutations significantly associated with certain diseases like breast cancer or Huntington's disease.

    The majority of complex traits, however, are determined by multiple variants with smaller effect sizes.

    A polygenic risk score calculates an individual's overall genetic risk for a condition and is the summation of the variants across their genome, weighted by the effect sizes on the disease or trait of interest.


Read our paper

Calculating a Polygenic Risk Score

A polygenic risk score is derived from the data collected ingenome-wide association studies. The odds ratios or beta values from the studies in the PRSKB database are used as weights for each variant in order to compute a trait and study specific cumulative risk score for each sample.

The equation used by the PRSKB for calculating polygenic risk scores is the same one used by PLINK and is comparable to the 'average' score option used by PRSice-2.

Best Practices

Although polygenic risk scores have become increasingly prevalent in genetic research, historically, only minimal guidelines have existed for performing polygenic risk score analyses. In our efforts to overcome the variability in current polygenic risk score research, we adhere to the following standard protocols set forth by Choi, et al. (2020):

Quality Control

Quality Control Performed by the PRSKB

  • Any allele that has been reported on the reverse strand is automatically detected and flipped to the forward strand.
  • We ensure that summary data and query data are from the same reference genome.

Quality Control Recommendations for the User

  • Ensure that GWA summary data and query samples are from the same population.
  • Avoid overlap or highly related samples between summary data and query samples.
  • Aim for a target query sample size of at least 100.
  • Seek to use GWA study data with a SNP heritability > 0.05.

Linkage Disequilibrium Adjustment

Linkage disequilibrium (LD) clumping reduces the inflation of polygenic risk scores by ensuring that no more than one genetic variant from each LD region is included in the risk score calculations. We calculated LD clumps for each population in the 1000 Genomes dataset using an r-squared value of 0.25 and a distance threshold of 500 kb. The clumping analysis results were used to assign each 1000 Genomes variant to an LD clump ID for each population. The clump IDs facilitate the dynamic retrieval of LD clumps from the PRSKB database so that only the genetic variant with the most significant p-value in the GWA study of interest is used in the polygenic risk score calculation.

P-value Thresholding

We allow the user to determine the p-value threshold, which dictates which GWA study variants will be included in the PRS calculations. Additionally, we recommened that users who utilize the PRSKB to run bulk PRS analyses for post-hoc hypothesizing account for multiple testing when determining a significance threshold for those analyses.


In addition to the above best practices, we also impute SNPs not found in the user's files. Instead of imputing the reference allele for the population we impute the alleles as unknown. We believe this to be a more accurate approach as the reference allele is not always the most common allele in the population. By imputing the alleles as unknown, we are able to use the allele frequency of the risk allele in the user's selected cohort to estimate the contribution of the SNP to overall risk.

Using a Polygenic Risk Score

  • Disease Prediction
  • Polygenic risk scores can be used to assess the risk of expressing a trait. An individual with a risk score that is higher than the population average is more likely to express that trait. Predicting late onset diseases with this method is particularly useful because preventative treatment can be applied sooner.

  • Clinical Trial Screening
  • Clinical trials, such as drug testing, rely on removing confounding factors. Selecting individuals with similar risk scores for many traits eliminates potential confounding factors between these traits.

  • Additional Filtering in Genome-wide Association Studies
  • Polygenic risk scores can act as an additional filter for novel loci associated with a trait. If an indiviudal has a trait, but has a low risk score, or does not have a trait, but has a high risk score, the individual likely has novel genetic polymorphisms that need futher study.

  • Mendelian Randomization Studies
  • Mendelian randomization studies use polygenic risk scores to predict the effect a particular treatment or exposure will have on an indiviual.

  • Infer Disease Status in Cohorts Lacking Phenotypic Data
  • An individual's phenotype can be infered by comparing their risk scores to other individuals in their population or ethnicity.

Benefits of the PRS Knowledge Base

Additional Comments

The power of a polygenic risk score is contingent on the power and scope of the corresponding genome-wide association study data. One significant problem is the lack of diversity in the current genetic studies. As of April 2022, 79% of all genome-wide association study data participants are of European descent. Scores for an individual are most accurate when computed from data of the same ethnicity. As such, the PRS Knowledge Base is currently more robust for studying subjects of European ancestry. As future studies begin incorporating greater diversity, the holistic accuracy of our Knowledge Base will similarly increase.

Questions or Comments?

Email us at or