3 Summary Statistics

To create the PGSs we need summary statistics, or results from Genome-Wide Association Studies (GWASs). One overall requirement for the summary statistics that we use is that they cannot have utilized the UK Biobank in their production. If they did we would be overfitting the problem, which is not good. Each summary statistic must also have the following features (or columns):

  1. Chromosome
  2. Position
  3. Variant ID (the rsID)
  4. Effect Allele
  5. Alternate Allele
  6. Standard Error of the Effect
  7. Effect (Beta, Odds Ratio)
  8. P-Value

The reasons we need these specific columns is as follows. Chromosome and position give the position of the variant on the genome, which is important in determining variant proximity for LD-aware methods. The variant allele is important for basic recognition purposes of which variants make it into the final score. The effect and alternative allele are nescessary just based on the polygenic risk score definition. While the alternative allele is not strictly nescessary for scoring, it is needed to determine if the variant is ambigous. The standard error and p-value are used by many methods for thresholding purposes. Lastly the effect is the other basic component needed in the polygenic risk score equation.