Polygenic Risk Scores

Calculating a Polygenic Risk Score

A polygenic risk score is derived from the data collected ingenome-wide association studies. The odds ratios or beta values from the studies in the PRSKB database are used as weights for each variant in order to compute a trait and study specific cumulative risk score for each sample.

The equation used by the PRSKB for calculating polygenic risk scores is the same one used by PLINK and is comparable to the 'average' score option used by PRSice-2.

Best Practices

Although polygenic risk scores have become increasingly prevalent in genetic research, historically, only minimal guidelines have existed for performing polygenic risk score analyses. In our efforts to overcome the variability in current polygenic risk score research, we adhere to the following standard protocols set forth by Choi, et al. (2020):

Quality Control

Quality Control Performed by the PRSKB

Any allele that has been reported on the reverse strand is automatically detected and flipped to the forward strand.
We ensure that summary data and query data are from the same reference genome.

Quality Control Recommendations for the User

Ensure that GWA summary data and query samples are from the same population.
Avoid overlap or highly related samples between summary data and query samples.
Aim for a target query sample size of at least 100.
Seek to use GWA study data with a SNP heritability > 0.05.

Linkage Disequilibrium Adjustment

Linkage disequilibrium (LD) clumping reduces the inflation of polygenic risk scores by ensuring that no more than one genetic variant from each LD region is included in the risk score calculations. We calculated LD clumps for each population in the 1000 Genomes dataset using an r-squared value of 0.25 and a distance threshold of 500 kb. The clumping analysis results were used to assign each 1000 Genomes variant to an LD clump ID for each population. The clump IDs facilitate the dynamic retrieval of LD clumps from the PRSKB database so that only the genetic variant with the most significant p-value in the GWA study of interest is used in the polygenic risk score calculation.

P-value Thresholding

We allow the user to determine the p-value threshold, which dictates which GWA study variants will be included in the PRS calculations. Additionally, we recommened that users who utilize the PRSKB to run bulk PRS analyses for post-hoc hypothesizing account for multiple testing when determining a significance threshold for those analyses.

Imputation

In addition to the above best practices, we also impute SNPs not found in the user's files. Instead of imputing the reference allele for the population we impute the alleles as unknown. We believe this to be a more accurate approach as the reference allele is not always the most common allele in the population. By imputing the alleles as unknown, we are able to use the allele frequency of the risk allele in the user's selected cohort to estimate the contribution of the SNP to overall risk.

Using a Polygenic Risk Score

Disease Prediction

Polygenic risk scores can be used to assess the risk of expressing a trait. An individual with a risk score that is higher than the population average is more likely to express that trait. Predicting late onset diseases with this method is particularly useful because preventative treatment can be applied sooner.

Clinical Trial Screening

Clinical trials, such as drug testing, rely on removing confounding factors. Selecting individuals with similar risk scores for many traits eliminates potential confounding factors between these traits.

Additional Filtering in Genome-wide Association Studies

Mendelian Randomization Studies

Mendelian randomization studies use polygenic risk scores to predict the effect a particular treatment or exposure will have on an indiviual.

Infer Disease Status in Cohorts Lacking Phenotypic Data

An individual's phenotype can be infered by comparing their risk scores to other individuals in their population or ethnicity.

Benefits of the PRS Knowledge Base

User‑Friendly

Until now, the only tools available for calculating polygenic risk scores were on the command-line. Our Knowledge Base offers a more user-friendly medium for computing and outputting these scores.

Centralized Database

The PRSKB has a centralized database for all of the genome-wide association studies used to compute the scores. As new studies are performed and added to the NHGRI-EBI GWAS Catalog, they are added to the PRSKB database.

Contextualization

We present polygenic risk score distributions and summary statistics for each of the studies in the PRSKB database, generated from individual genetic data in the 1000 Genomes, UK Biobank, and ADNI datasets. Users can utilize this data as an approximate contextualization for their own reported risk scores.

Additional Comments

The power of a polygenic risk score is contingent on the power and scope of the corresponding genome-wide association study data. One significant problem is the lack of diversity in the current genetic studies. As of April 2022, 79% of all genome-wide association study data participants are of European descent. Scores for an individual are most accurate when computed from data of the same ethnicity. As such, the PRS Knowledge Base is currently more robust for studying subjects of European ancestry. As future studies begin incorporating greater diversity, the holistic accuracy of our Knowledge Base will similarly increase.

What is a Polygenic Risk Score?