For decades, most genetic research focused on single, common mutations significantly associated with certain diseases like breast cancer or Huntington's disease.
The majority of complex traits, however, are determined by multiple variants with smaller effect sizes.
A polygenic risk score calculates an individual's overall genetic risk for a condition and is the summation of the variants across their genome, weighted by the effect sizes on the disease or trait of interest.
A polygenic risk score is derived from the data collected ingenome-wide
association studies. The odds ratios or beta values from the studies in the PRSKB database are used as weights for each
variant in order to compute a trait and study specific cumulative risk score for each sample.
The equation used by the PRSKB for calculating polygenic risk scores is the same one used by PLINK and is comparable to the 'average' score option used by PRSice-2.
Although polygenic risk scores have become increasingly prevalent in genetic research, historically, only minimal guidelines have existed for performing polygenic risk score analyses. In our efforts to overcome the variability in current polygenic risk score research, we adhere to the following standard protocols set forth by Choi, et al. (2020):
Linkage disequilibrium (LD) clumping reduces the inflation of polygenic risk scores by ensuring that no more
than one genetic variant from each LD region is included in the risk score calculations. We calculated LD clumps
for each population in the 1000 Genomes dataset using an r-squared value of 0.25 and a distance threshold of 500 kb.
The clumping analysis results were used to assign each 1000 Genomes variant to an LD clump ID for each population.
The clump IDs facilitate the dynamic retrieval of LD clumps from the PRSKB database so that only the genetic variant
with the most significant p-value in the GWA study of interest is used in the polygenic risk score calculation.
We allow the user to determine the p-value threshold, which dictates which GWA study variants will be included in
the PRS calculations. Additionally, we recommened that users who utilize the PRSKB to run bulk PRS analyses for
post-hoc hypothesizing account for multiple testing when determining a significance threshold for those analyses.
In addition to the above best practices, we also impute SNPs not found in the user's files. Instead of imputing the reference allele for the population we impute the alleles as unknown. We believe this to be a more accurate approach as the reference allele is not always the most common allele in the population. By imputing the alleles as unknown, we are able to use the allele frequency of the risk allele in the user's selected cohort to estimate the contribution of the SNP to overall risk.
Polygenic risk scores can be used to assess the risk of expressing a trait. An individual with a risk score that is higher than the population average is more likely to express that trait. Predicting late onset diseases with this method is particularly useful because preventative treatment can be applied sooner.
Clinical trials, such as drug testing, rely on removing confounding factors. Selecting individuals
with similar risk scores for many traits eliminates potential confounding factors between these
traits.
Polygenic risk scores can act as an additional filter for novel loci associated with a trait. If an
indiviudal has a trait, but has a low risk score, or does not have a trait, but has a high risk score,
the individual likely has novel genetic polymorphisms that need futher study.
Mendelian randomization studies use polygenic risk scores to predict the effect a particular
treatment or exposure will have on an indiviual.
An individual's phenotype can be infered by comparing their risk scores to other individuals in their
population or ethnicity.
Until now, the only tools available for calculating polygenic risk scores were on the command-line. Our Knowledge Base offers a more user-friendly medium for computing and outputting these scores.
The PRSKB has a centralized database for all of the genome-wide association studies used to compute the scores. As new studies are performed and added to the NHGRI-EBI GWAS Catalog, they are added to the PRSKB database.
We present polygenic risk score distributions and summary statistics for each of the studies in the PRSKB database, generated from individual genetic data in the 1000 Genomes, UK Biobank, and ADNI datasets. Users can utilize this data as an approximate contextualization for their own reported risk scores.
The power of a polygenic risk score is contingent on the power and scope of the corresponding genome-wide association study data. One significant problem is the lack of diversity in the current genetic studies. As of April 2022, 79% of all genome-wide association study data participants are of European descent. Scores for an individual are most accurate when computed from data of the same ethnicity. As such, the PRS Knowledge Base is currently more robust for studying subjects of European ancestry. As future studies begin incorporating greater diversity, the holistic accuracy of our Knowledge Base will similarly increase.