5 Adjusting Summary Statistics

The main thrust of this project, and a key step in any polygenic risk scoring process, is adjusting summary statistics. In short, adjusting summary statistics are nescessary because as they stand linkage disequilibrium and a possible high prevelance of false positives will severly limit the predictive ability contained within the underlying GWAS. There are many ways to actually carry out this adjustment. In fact, there are quite a few people who have spent considerable time working on methods to adjust summary statistics with the highest possible accuracy. The problem is that these people do not always leave the greatest doccumentation or instructions. In the following text I will attempt to translate the doccumentation, while pulling from the original publication and possible other sources, to produce a clear and intended scripts for summary statistic adjustment.

5.0.1 Computational Framework

A key consideration in setting up a summary statistic adjustment workflow is time. Many of the adjustment scripts are slow. Therefore, parallelization is a near nescessity to get things done. While I am sure there are fancy systems or clever linux commands, I went with a low-tech bash script. The key idea of the bash script is that it continously runs, beginning a summary statistic adjustment on one chromosome with one method when there is space available. The space determination is made by constantly checking a poss_dirs file. If there is a directory number within a script is begun in the background for adjustment. On its completion it will return the name of its directory to the poss_dirs file. In the mean time a while loop with a wait function keeps everything active until this time comes. The other important consideration is independence. At the top of this script directories that contain intermediate files are emptied, leaving room for a clean starting place.


maxDirs=8


    rm temp_files/*
    rm logs/*
    rm done_check/*
    rm -r comp_zone/*

    for (( i=1; i<=$maxDirs; i++ )); do
        mkdir comp_zone/dir$i
        echo $i >> temp_files/poss_dirs
    done



    cat all_specs/chr_specs | while read chr; do

        cat all_specs/author_specs | while read author; do      

            ./setup_data.sh $author $chr

            cat all_specs/method_specs | while read method; do


                echo about to hermes
                ./hermes.sh $chr $method $author &> logs/${author}.${chr}.${method}.log &
                sleep $(( ( RANDOM % 30 )  + 1 ))

                goOn=False
                while [ $goOn == "False" ]; do
                    openSlots=`cat temp_files/poss_dirs | wc -l`
                    if [ $openSlots -gt 0 ]; then
                        echo NOW WE CAN GO
                        goOn=True
                    else
                        echo MUST WAIT FOR ROOM TO GO
                        sleep $(( ( RANDOM % 30 )  + 1 ))
                        openSlots=`cat temp_files/poss_dirs | wc -l`
                    fi
                done

            done        

            ./check_to_delete.sh

        done
    done

Note that if instead of having a multi-score laptop or other large computer you have a cluster submission system you can likely change this script slightly so that hermes.sh is not submitted to the background but instead is just submitted to your cluster management system. Perhaps I will write this option in soon.

#Should have some sort of graphic indicating what is going on

As can be seen, the actual adjustment is launched through the script hermes.sh (an ode to swift completion). At the top of hermes.sh some quick file re-writing claims the directory, and therefore slot to run an adjustment. Then the method specific script is called. An important consideration is the taskset command, which will limit the descrending computation to a select number of clusters. Finally, once the method is complete the directory name is returned to poss_dirs and the directory that contained all of the computation is emptied. Just as before everything was a wrapper for hermes.sh, here everything is a wrapper for the details within the method specific script.

5.1 Simple

The simple methods are not really methods at all, but rather the most primitive subsetting that has been considered in the past and therefore for the sake of completeness will be considered here as well. Specifically, these methods simply threshold the summary statistics based on a p-value cut off. The first uses the genome-wide signficance threshold of 5e-8, the second uses the sometimes cited genome-wide interesting or suggested level of 1e-6, the third keeps every SNP possible, and the fourth keeps SNPs with p-values above 0.5. The idea is that the final method can act as a false negative, i.e. if it returns any predictive value we know something is likely going wrong.

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}



if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.simple.1.ss ]; then
  cat temp_files/ss.${low_author}.${chr} | awk '$8 < 5e-8 {print $0}' > ~/athena/doc_score/mod_sets/${author}/${low_author}.${chr}.simple.1.ss
fi

if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.simple.2.ss ]; then
  cat temp_files/ss.${low_author}.${chr} | awk '$8 < 1e-6 {print $0}' > ~/athena/doc_score/mod_sets/${author}/${low_author}.${chr}.simple.2.ss
fi

if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.simple.3.ss ]; then
  cat temp_files/ss.${low_author}.${chr} > ~/athena/doc_score/mod_sets/${author}/${low_author}.${chr}.simple.3.ss
fi

if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.simple.4.ss ]; then
  cat temp_files/ss.${low_author}.${chr} | awk '$8 > 0.5 {print $0}' > ~/athena/doc_score/mod_sets/${author}/${low_author}.${chr}.simple.4.ss
fi

Note there is not any single publication or doccumentation linked to these “methods”, they are just a natural starting place for making scores.

5.2 Clumping

The next, most common method involves clumping, which is an intelligent method to threshold SNPs based on their p-value while limiting false positives caused by linkage disequilibrium. Specifically, this is done by limiting a significant locus to maintaining only one SNP in the adjusted summary statistics. The size of this locus is determined by a r-squared parameter. While it is likely possible to write your own clumping algorithm, I and I assume most other people use PLINK, a tool with amazing doccumentation. The two obvious clumping parameters are the p-value and r2. An array of 3 values for each are chosen, leading to 9 total clumped summary statistics. The final parameter is the genotype file that generated the LD information. Many methods that I will be using require LD information. Frustratingly, the exact choice of LD information often varies based on the originating publication. As there is actually not any publication to specifically go along with clump the choice here is not well-informed. I have assumed using the UKBB is a good choice, and because it is computationally easy to do so I will employ the full possible sample size.

A final consideration that is not always apparent is that sometimes the clumping algorithm leaves zero SNPs. In this case we need an if statement to reveal that nothing should be done. If there are SNPs remaining then we simply subset the input to these SNPs and pass them on.

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}


i=1
cat all_specs/clump_param_specs | tail -n +2 | while read spec;do
  plim=`echo $spec | cut -f1 -d' '`
  r2lim=`echo $spec | cut -f2 -d' '`

  if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.clump.${i}.ss ]; then

    plink --memory 4000 --threads 1 --bfile geno_files/${low_author}.${chr} --clump temp_files/ss.${low_author}.${chr} --clump-snp-field RSID --clump-p1 $plim --clump-r2 $r2lim --out ${d}/out

    if [ -f ${d}/out.clumped ]; then
      sed -e 's/ [ ]*/\t/g' ${d}/out.clumped | sed '/^\s*$/d' | cut -f4 | tail -n +2 > ${d}/done_rsids
      fgrep -w -f ${d}/done_rsids temp_files/ss.${low_author}.${chr} > ~/athena/doc_score/mod_sets/${author}/${low_author}.${chr}.clump.${i}.ss
    fi
  fi

  let i=i+1
done

More specifics on the doccumentation are described: https://www.cog-genomics.org/plink/1.9/postproc#clump

5.3 Double Weight

Increasing from complexity is double weight scores. The motivation is that whenever a hard threshold is chosen the SNPs that are included within the score are hit by a winner’s curse. The corresponding publication explains “by selecting only SNPs with estimated p-values below a certain threshold, one systematically selects SNPs with effect overestimated by chance. Thus, for a large proportion of selected SNPs, betas will be a biased estimate for the true weight”. Mathematically we can think about this comparing the current polygenic risk score format:

\[PGS=\sum_{i=1}^{k}I(p_i<p^*)\hat{\beta}X_i\]

to a winner’s curse aware format:

\[PGS=\sum_{i=1}^{k}\hat{\pi}(X_i)\hat{\beta}X_i\]

Here we replace an indicator function that is either 0 or 1 with a probability of inclusion within a set of top Z number SNPs, where Z can be any positive integer less than your total number of available SNPs. The next decision is determining how to calculate the \(\hat{\pi}\) term. The originating paper, within the supplementary states we should generate a sample of values for each SNP from a normal distribution specificed by \(N(\hat{\beta}, \hat{SE}^2)\) (or at least that’s how I interpreted the variable definitions). We then estimate how many times our sample for the SNP is within the top Z number of SNPs. The paper goes onto to mention a wald type statistic, but I could not fully determine how one would best calculate this statistic. The approach I took is likely best described within the following R script.

args <- commandArgs(trailingOnly=TRUE)

d <- args[1]
top_choice <- args[2]

#must make sure the number of requested top ranks SNPs is less than the total number of available SNPs
ss <- read.table(paste0(d, "/specific_ss"), stringsAsFactors=F)

if(top_choice < nrow(ss)){

  #construct a matrix with ncols = number of SNPs, nrows = 100 (sample size for each SNP)
  norm_samples <- apply(ss, 1, function(x) rnorm(100, mean = as.numeric(x[7]), sd = as.numeric(x[6])))

  #for each snp generate a rank of the SNP, note that the dimensions are now swapped from above
  rank_samples <- apply(norm_samples, 1, function(x) rank(x * -1))

  #calculate the number of time for each SNP the rank is less than the desired top number of SNPs
  prob_ranks <- apply(rank_samples, 1, function(x) sum(x < top_choice)/nrow(rank_samples))

  #multiple the normal beta values by the winner's curse probability to get an updated beta value
  ss[,7] <- ss[,7] * prob_ranks

  write.table(ss, paste0(d, "/adjust_ss"), col.names = F, row.names = F, sep = '\t', quote = F)
}

Note that this R script is just launched by a short shell script. While short one key component is the PLINK line that subsets the summary statistics to a short subset whose SNPs are ideally not impacted by linkage disequillibrium. The exact parameters utilized are 0.5 for the p-value and 0.05 for R2. In both of the publications I could find that have implemeted this algorithm these are the paramters that were utilized. I therefore did not find the need to change them, especially since their values are not as intrinsic to the idea of double weighting as the parameter of Z important SNPs.

There may be a variation that more fully agrees with the intended algorithm, but I believe that this is a faithful enough reproduction to accurately evaluate the method’s performance.

For more information on the algorithm please see: https://pubmed.ncbi.nlm.nih.gov/27513194/, specifically the supplemental for implementation details.

5.4 LDpred

One of the premier polygenic risk score adjustment algorithms. Rather than using a heuristic, LDpred attempts to provide a rigerous attempt to recreate effect estimates that would be produced if the full genetic information was used in the original estimation process. Further considerations are made to determine the proportion of SNPs that are true causal, and how linkage disequilibrium would change the estimates. The mathematical details are intense, with MCMC algorithms involved, but for the point of application we actually only need to know instillation, application and the troubleshooting processes required (this will be a theme throughout).

LDpred runs in two steps, coordination and the MCMC algorithm. A key consideration in the initial coordination step is choosing a genotypic file to be the LD reference. The GitHub page reports that a file with over 1,000 unrelated individuals should be used. Within the publication they utilize their validation data set, which varies from about 100,000 to over a million SNPs. For a few of my summary statistics this will present a possible problem, but I think it is fair to test the limits of their algorithm in this way. To speed execution I limited my validation data set to 5,000 individuals for the LD reference. This 5,000 value is five times more than the recommended value so I hope it is a fair step (I am also encouraged to think so due to similar sample sizes used for the reference within the original publication, in fact they regularly had data sets with less than 5,000 individuals). Within another well-cited publication (“Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations”) the validation data set was not used, and instead the European individuals of the 1000 Genomes Group were. While this group falls below the 1,000 person recommendation, their performance and the accoldates the paper motivates this reference group as a strong alternative. The few remaining considerations concerning the coordination are the sample size given, which was the effective sample size recommended by PLINK, and the type of effect size specified, which is LOGOR, or beta values that were standardized in the previous chapter.

The next step is entitled gibbs, which is the subtype of the MCMC algorithm. The first two arguments of this step are the outputs of the previous, easy enough. The next is the estimated heritability, which was taken from the consensus value determined in the previous chapter. Next is the f values, or the proportion of SNPs that are believed to be causal. As it is rather difficult to know this a priori, a range of parameters are attempted. The last value is the ld-radius. The value recommended to use is the total number of SNPs (across all chromosomes) divided by 3000. This value, and this value alone was utilized.

Over the f value and reference genotype options a series of LDpred estimates were generated. The adjusted effect sizes were then subtituted into the original summary statistics to complete the process. Also an important note is that the software does not overwrite the coordination file by default, so if you are trying multiple coordinate set-ups you must manually delete it upon changing. The full code for the process is:

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}

prev_reftype="none"

i=1
cat all_specs/ldpred_param_specs | tail -n +2 | while read spec;do
  fval=`echo $spec | cut -f1 -d' '`
  reftype=`echo $spec | cut -f2 -d' '`

  echo fval $fval
  echo reftype $reftype

  num_snps=`cat ../raw_ss/meta_stats | fgrep $author | cut -f7 -d','`
  h2=`cat ../raw_ss/meta_stats | fgrep $author | cut -f12 -d','`
  rad=`echo $num_snps/3000 | bc`

  if [ $reftype != $prev_reftype ];then
    echo removeing ldradius and coordoutput
    rm ${d}/*ldradius*
    rm ${d}/coord_output
  fi

  if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.ldpred.${i}.ss ]; then
    echo no ldpred score already exists
    present_cord=`ls ${d}/*ldradius* | wc -l`

    if [ $present_cord -eq 0 ];then
        echo there is not already a ldradius file present

        if [ $reftype == "TGP" ];then
        #1000 genomes
        echo went with TGP
            ldpred coord --gf ~/athena/refs/1000genomes/eur.${chr} --ssf temp_files/ss.${low_author}.${chr} --ssf-format CUSTOM --rs RSID --A1 A1 --A2 A2 --eff BETA --pos BP --chr CHR --pval P --se SE --ncol ESS --eff_type LOGOR --out ${d}/coord_output

        else
        #UKBB
        echo went with UKB
            head -5000 geno_files/${low_author}.${chr}.fam | cut -f1 > ${d}/subset_inds
            plink --bfile geno_files/${low_author}.${chr} --keep-fam ${d}/subset_inds --make-bed --out ${d}/for_ldpred
            ldpred coord --gf ${d}/for_ldpred --ssf temp_files/ss.${low_author}.${chr} --ssf-format CUSTOM --rs RSID --A1 A1 --A2 A2 --eff BETA --pos BP --chr CHR --pval P --se SE --ncol ESS --eff_type LOGOR --out ${d}/coord_output
        fi
    fi

    ldpred gibbs --cf ${d}/coord_output --ldf ${d}/ld_file --h2 $h2 --ldr $rad --f $fval --out ${d}/ldpred.${i}.res
        new_file=`ls  ${d}/ldpred.${i}.res_LDpred_p* | wc -l`
        if [ $new_file -gt 0 ];then
      echo found new file
          new_file_name=`ls  ${d}/ldpred.${i}.res_LDpred_p*`
      echo it is $new_file_name
          Rscript helper_scripts/ldpred_beta_switch.R dir$dir $new_file_name $author $chr $i
        fi
    prev_reftpye=$reftype

  fi

  let i=i+1
done

For more information I would (and have) read the originial publication, https://www.cell.com/ajhg/fulltext/S0002-9297(15)00365-1, follow the GitHub instructions, https://github.com/bvilhjal/ldpred, or use the bitbucket instructions, https://bitbucket.org/bjarni_vilhjalmsson/ldpred/src/master/. I think the best instructions are from the direct doccumentation accessed with -h.

5.5 LDpred2

While LDpred is a premier method, there are few additional options and computational limitations that the authors believed should be resolved in an entirely new method implementation. Specifically, LDpred2 is implemented within the bigsnpr package, and therefore required all files to be read into a R environment (compared to previously where things were managed as calls to a python script). I will briefly describe some of the important components of my implementation of the online vignette, which I used to base my own implementation.

The first step is loading in the the genotype file that will be used as the linkage disequillibrium. While there was some difficulty in picking a single reference in LDpred, in this implementation the associated publication clearly states that 10,000 UK Biobank individuals were utilized. Specifically, those of white, British ancestry and with imputed SNPs. However, I should note that on initial applications I recieved errors on the genotypic correlation matrix exceeding the maximum vector size. To fix this problem I followed the publication and restricted the possible SNPs to those from HapMap (specifically within a file of the vignette). As I easily have all of those pieces I can easily and quickly load them in.

The next big step is calculating the genotypic correlation matrix. This is by far the longest step in the process. The size of the correlation was chosen as 3cM as recommended in the vignette. Following I read in the consensus heritability, then set up a grid of paramters to analyze upon. The vignette contained over 100 sets of parameters, to prevent overfitting on my larger validation set I did not want to create this number of scores for each method, therefore I reduced the paramters investigated. With the parameters, genotypic correlation and read-in summary statistics the snp_ldpred2_grid command can actually be called to adjust the effect sizes. The output from this command can then be substituted into the original summary statistics and then written to produce the final adjusted summary statistics.

As in the double weight set up, this method is carried with an initial shell script that prepares the genotypic files.

chr=$1
author=$2
dir=$3
d=comp_zone/dir${dir}
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`

head -10000 geno_files/${low_author}.${chr}.fam | cut -f1 > ${d}/subset_inds
plink --bfile geno_files/${low_author}.${chr} --keep-fam ${d}/subset_inds --extract ~/athena/refs/hapmap_from_ldpred2 --make-bed --out ${d}/for_ldpred2

Rscript helper_scripts/ldpred2.R $chr $author $d

The main point of this shell script is calling a R script, which looks like:

library(bigsnpr)
#source("helper_scripts/snp_as_genetic.R")

args <- commandArgs(trailingOnly=TRUE)

chr <- args[1]
author <- args[2]
d <- args[3]

lowauthor <- tolower(author)

#read in the genotypic data
if(!file.exists(paste0(d, "/for_ldpred2.rds"))){
  snp_readBed(paste0(d, "/for_ldpred2.bed"))
}
obj.bigSNP <- snp_attach(paste0(d, "/for_ldpred2.rds"))

#split up genotypic data as described
G   <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
y   <- obj.bigSNP$fam$affection - 1
NCORES <- nb_cores()

#read in the summary stats
sumstats <- bigreadr::fread2(paste0("temp_files/ss.", lowauthor, ".", chr))
colnames(sumstats) <- c("chr", "pos", "rsid", "a0", "a1", "beta_se", "beta", "p", "n_eff")
sumatsts <- sumstats[sumstats$beta_se > 0,]

#run snp_match (likely can skip since I do this earlier)
map <- obj.bigSNP$map[-(2:3)]
names(map) <- c("chr", "pos", "a0", "a1")
info_snp <- snp_match(sumstats, map, join_by_pos = T)

#Calculate the LD correlation matrix
print("calc the corr matrix")
POS2 <- snp_asGeneticPos(CHR, POS, dir = "temp_files", ncores = 3)
df_beta <- info_snp[, c("beta", "beta_se", "n_eff")]
corr0 <- snp_cor(G, ncores = NCORES, infos.pos = POS2, size = 3 / 1000)
corr <- bigsparser::as_SFBM(as(corr0, "dgCMatrix"))

#Get the heritability
print("getting h2")
h2 <- read.table("../raw_ss/meta_stats", stringsAsFactors=F, sep = ",", header=T)
h2 <- h2$h2[h2$author == author]

#Actually run ldpred2
print("run ldpred2 inf")
beta_inf <- snp_ldpred2_inf(corr, df_beta, h2 = h2)

#set up parameters for the non-inf version
h2_seq <- round(h2 * c(0.7, 1, 1.4), 3)
p_seq <- signif(seq_log(1e-4, 1, length.out = 4), 2)
params <- expand.grid(p = p_seq, h2 = h2_seq, sparse = c(FALSE, TRUE))

#run ldpred2 over all the parameters
print("run ldpred2 over grid")
beta_grid <- snp_ldpred2_grid(corr, df_beta, params, ncores = NCORES)

#write the results to mod_sets
print("writing the results")
ss <- info_snp[,1:9]
ss$beta <- beta_inf
write.table(ss, paste0("../mod_sets/", author, "/", lowauthor, ".", chr,  ".ldpred2.1.ss"), row.names = F, col.names = T, quote = F, sep = '\t')

for(i in 1:ncol(beta_grid)){
  ss$beta <- beta_grid[,i]
  write.table(ss, paste0("../mod_sets/", author, "/", lowauthor, ".", chr,  ".ldpred2.", i+1, ".ss"), row.names = F, col.names = T, quote = F, sep = '\t')
}

The original vignette is https://privefl.github.io/bigsnpr/articles/LDpred2.html, and the original publication is https://www.biorxiv.org/content/10.1101/2020.04.28.066720v1.

5.6 SBLUP

SBLUP, which stands for summary best linear unbiased predictions, attempts to use summary statistics alone to generate BLUP estimates of SNP effect sizes (a similar goal of LDpred). The application of SBLUP is easily completed within the genome-wide complex trait analysis application, a piece of software that is only rivaled by PLINK for its doccumentation and dependability. The first argument of sblup is a genotypic file used for determining a LD correlation matrix. On initial applications I attempeted to use all people within my training set (80,000). While the application did sometimes complete, it was very memory intensive, and slow. Investigating the originating publication I found that in the supplamentary (specifically figure 2) 10,000 individuals were used. In the main text over 7,000 couples were mentioned within an analysis. While more people could have been used in SBLUP, I figured an analysis of 20,000 individuals could be a safe margin on the couples analysis and will give a good picture on whether it performs significantly better than 10,000.

The second argument of SBLUP is termed cojo-sblup, which is specified to be \(m(1/h^2 - 1)\). This term seem rather important, and turns out to be one of the few parameters worth tuning, therefore I parametrized over three variations in which the heritability term is multipled by 0.7, 1, and 1.3. This way if the heritability was slightly off and the cojo-sblup term is very sensitive I still have the change of obtaining a good set of adjusted summary statistics. The next argument is the cojo-wind term, or the LD radius term. The website recommends 1 Mb even though the default value is 10 Mb. For ease of computation I went with 1 Mb. The final term is the summary statistics, which have to be modified to fit a “ma” format. A key addition in this format is the allele frequency, which was empirically calculated with the UK Biobank full training sample.

As before, all of the steps were controlled in the following bash script:

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}

num_snps=`cat ../raw_ss/meta_stats | fgrep $author | cut -f7 -d','`
h2=`cat ../raw_ss/meta_stats | fgrep $author | cut -f12 -d','`

i=1
cat all_specs/sblup_param_specs | while read spec;do
  coef=`echo $spec | cut -f1 -d' '`
  fam_num=`echo $spec | cut -f2 -d' '`
  use_h2=`echo "$h2 * $coef" | bc -l`
  lambda=`echo "$num_snps * (1/$use_h2 - 1)" | bc`

  if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.sblup.${i}.ss ]; then

    plink --bfile geno_files/${low_author}.${chr} --freq --out temp_files/${low_author}.${chr}
    Rscript helper_scripts/add_ma.R $author $chr

    head -$fam_num geno_files/${low_author}.${chr}.fam | cut -f1 > ${d}/subset_inds
    plink --bfile geno_files/${low_author}.${chr} --keep-fam ${d}/subset_inds --make-bed --out ${d}/for_sblup

    gcta64 --bfile ${d}/for_sblup --cojo-file temp_files/ss.${low_author}.${chr}.ma --cojo-sblup $lambda --cojo-wind 100 --thread-num 20 --out ${d}/out
    Rscript helper_scripts/reformat_sblup.R $chr $author $d $i
    rm ${d}/for_sblup*

  fi
  let i=i+1
done

The first needed R script for calculating the “ma” summary statistics is:

args <- commandArgs(trailingOnly=TRUE)
lowauthor <- tolower(args[1])

bim <- read.table(paste0("geno_files/", lowauthor, ".", args[2], ".bim"), stringsAsFactors=F)
ss <- read.table(paste0("temp_files/ss.", lowauthor, ".", args[2]), stringsAsFactors=F, header=T)
ss <- ss[ss$RSID %in% bim[,2],]
frq <- read.table(paste0("temp_files/", lowauthor, ".", args[2], ".frq"), stringsAsFactors=F, header=T)

ss$frq <- rep(0, nrow(frq))
ss$frq[ss$A1 == frq$A1] <- frq$MAF[ss$A1 == frq$A1]
ss$frq[ss$A2 == frq$A1] <- frq$MAF[ss$A2 == frq$A1]

ss <- ss[,c(3,4,5,10,7,6,8,9)]
colnames(ss) <- c("SNP", "A1", "A2", "freq", "b", "se", "p", "N")
write.table(ss, paste0("temp_files/ss.", lowauthor, ".", args[2],".ma"), row.names = F, col.names = T, sep = '\t', quote = F)

The output file looks a little funky, so a final R script substitutes the adjusted effect into the original summary statistics.

args <- commandArgs(trailingOnly=TRUE)

chr <- args[1]
author <- args[2]
d <- args[3]
i <- args[4]
lowauthor <- tolower(author)

sblup <- read.table(paste0(d,"/out.sblup.cojo"), stringsAsFactors=F)
ss <- read.table(paste0("temp_files/ss.", lowauthor, ".", chr), stringsAsFactors=F, header=T)

ss <- ss[ss$RSID %in% sblup[,1],]
sblup <- sblup[order(sblup[,1])[rank(ss$RSID)],]
ss$BETA <- sblup[,4]

write.table(ss, paste0("../mod_sets/", author, "/", lowauthor, ".", chr, ".sblup.", i, ".ss"), row.names = F, col.names = T, sep = '\t', quote = F)

The original publication is https://www.nature.com/articles/s41562-016-0016#MOESM11, and the doccumentation comes from https://cnsgenomics.com/software/gcta/#SBLUP.

5.7 SMTpred

SMTPred, which I believe stands for summary multi-trait prediction, is a very interesting method that leverages correlated traits to effectively increase the sample size of the original effect estimates. While the weighting system is actually relatively simple, I will leave that information to the interested reader and instead focus on the actual implementation. The SMTpred python script requires quite a few types of inputs, including heritabilities, genetic correlations, sample sizes and summary statistics. Luckily for us most of these inputs are really easily accesible fixed information. The two items that we can change are the summary statistics and the genetic correlations. Specifically, the SMTpred algorithm accepts either normal least-squares effect estimates and SBLUP outputs. As we are already producing SBLUP adjusted summary statistics we might as well try both. The genetic correlations is a little trickier. Many genetic correlation estimates have very high standard errors, and in my opinion this means they should not be used. Similarly, I can include, in theory 24 genetic correlations in an implementation, however this seems problematic as comparisons across applications likely become more difficult and some of the weaker correlations may be more likely to harm my final output than help. Therefore, I will limit the number of correlations to the top 3, 5 and 10 to see if a more conservative or liberal approach is best.

With all of the theory of implementation I may now actually write the code, which is really just a bit of data-pulling and re-arranging. The overall shell script is:

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}


#NOTE!!!!!!!!!!!! Need to implement sblup option

i=1
cat all_specs/smtpred_param_specs | tail -n +2 | while read spec;do
  max_corr=`echo $spec | cut -f1 -d' '`
  est_type=`echo $spec | cut -f2 -d' '`

  if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.smtpred.${i}.ss ]; then

      Rscript helper_scripts/set_up_smtpred.R $low_author $d $max_corr
   
      cat ${d}/ss_files | cut -f1 -d'.' | while read new_author;do
        if [ $new_author == "imsgc" ];then 
          upper_author=`echo $new_author | tr [a-z] [A-Z]`
        else
          upper_author=${new_author^}
        fi
        speclow_author=`echo $new_author | tr '[:upper:]' '[:lower:]'`
        if [ $est_type == "normal" ];then
          zcat ~/athena/doc_score/raw_ss/${upper_author}/chr_ss/${new_author}_${chr}.ss.gz > ${d}/${new_author}.ss
        else 
          zcat ~/athena/doc_score/mod_sets/${upper_author}/${speclow_author}.${chr}.sblup.5.ss > ${d}/${new_author}.ss
        fi
      done

      ss_line=`cat ${d}/ss_files | while read line;do echo ${d}/${line}; done | tr '\n' ' '`

      python ~/Programs/smtpred/smtpred.py --betafiles $ss_line --nfile ${d}/samp_size --h2file ${d}/herit --rgfile ${d}/rg

      Rscript helper_scripts/finish_smtpred.R $d $i $low_author $chr

    fi

  let i=i+1
done

The first R script within this shell script does much of the heavy lifting, reading all the genetic correlations and additional meta-stats, ultimately writing all of the stats files that are reuired in the actual python call of smtpred.

args <- commandArgs(trailingOnly=TRUE)

author <- args[1]
d <- args[2]
toprank <- as.numeric(args[3])

gencorr <- read.table("~/athena/doc_score/raw_ss/genetic_correlations", stringsAsFactors=F)
metastats <- read.table("~/athena/doc_score/raw_ss/meta_stats", stringsAsFactors=F, sep = ",", header = T)

gencorr <- gencorr[gencorr[,1] == author,]
gencorr <- gencorr[abs(gencorr[,8]) < gencorr[,7] & !is.na(gencorr[,7]) & gencorr[,2] != author,]
gencorr <- gencorr[order(gencorr[,7], decreasing=T),]

if(nrow(gencorr) > toprank){
  gencorr <- gencorr[1:toprank,]
}

metastats <- metastats[tolower(metastats[,1]) %in% c(author, gencorr[,2]),]
sampsize <- cbind(tolower(metastats$author), 4/(1/metastats$cases + 1/metastats$controls))
herit <- cbind(tolower(metastats$author), metastats$h2)
ss_names <- paste0(tolower(metastats$author), ".ss")

write.table(sampsize, paste0(d, "/samp_size"), row.names = F, col.names = F, quote = F, sep = '\t')
write.table(herit, paste0(d, "/herit"), row.names = F, col.names = F, quote = F, sep = '\t')
write.table(gencorr[,c(1,2,7)], paste0(d, "/rg"), row.names = F, col.names = F, quote = F, sep = '\t')
write.table(ss_names, paste0(d, "/ss_files"), row.names = F, col.names = F, quote = F, sep = '\t')

The final R script completes a role similar to many other methods, combining the effects adjusted by the method to the original summary statistics.

#arguments are: d variable, i number, low author, chr
args <- commandArgs(trailingOnly=TRUE)
low_author <- tolower(args[3])

smtpred_ss <- read.table(paste0(args[1], "/multi_trait.beta"), stringsAsFactors=F, header = T)
raw_ss <- read.table(paste0(args[1], "/", args[3], ".ss"), stringsAsFactors=F, header=T)

raw_ss <- raw_ss[raw_ss$RSID %in% smtpred_ss$snpid,]
smtpred_ss <- smtpred_ss[order(smtpred_ss$snpid)[rank(raw_ss$RSID)],]

raw_ss$BETA <- smtpred_ss[,3]
write.table(raw_ss, paste0("~/athena/doc_score/mod_sets/", tools::toTitleCase(args[3]), "/", args[3], ".", args[4],  ".smtpred.", args[2] ,".ss"), row.names = F, col.names = T, sep = '\t', quote = F)

The original publication is https://www.nature.com/articles/s41467-017-02769-6, and the doccumentation comes from https://github.com/uqrmaie1/smtpred.

5.8 prsCS

the prsCS algorithm is actually very similar to LDpred, in that it aims to shrink some of the least squares effect estimates, thereby generating adjusted estimates that are better poised for polygenic risk scoring. The main difference (I believe) is that LDpred uses a point-block prior on its probability a SNP being causal whereas prsCS goes with a continous prior (naturally).

Luckily for us many of the parameters that have confused us in past methods are easily and simply provided to us within this method. Specifically, I am referring to genotypic files for a linkage disequilibrium reference, they are provided on the GitHub page based on the ancestry of the desired polygenic risk score. In fact, the only parameter that we are specifying is phi, which I believe is related to the propotion of underlying casual SNPs for the trait, also know as the polygenicity. While the algorithm states that it can learn this parameter itself, there are some caveats that lead me to believe it would be best to just try a scale of phi values. There are other parameters relating to the MCMC algorithm, but I will not touch those.

The actual mechanics of the algorithm include changing the formatting of the summary statistics, running a single python script, and finally substituting the output adjusted effect sizes for the original effect sizes. This process was controlled by the shell script:

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}

ess=`cat temp_files/ss.${low_author}.${chr} | head -2 | tail -n +2 | cut -f9`

cp helper_scripts/prscs_header ${d}/ss
cat temp_files/ss.${low_author}.${chr} | tail -n+2 | cut -f3,4,5,7,8 >> ${d}/ss

i=1
cat all_specs/prscs_param_specs | while read phi;do

  if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.prs.${i}.ss ]; then

    python ~/Programs/PRScs/PRScs.py --ref_dir=/home/kulmsc/athena/refs/ldblk_1kg_eur --bim_prefix=geno_files/${low_author}.${chr} --sst_file=${d}/ss --phi=${phi} --n_gwas=${ess} --chrom=$chr --out_dir=${d}/output.$i

    Rscript helper_scripts/prscs_beta_switch.R $d $i $author $chr

  fi
  let i=i+1

done

The R script needed for effect substitution is:

#arguments are: d, i, author, chr 
args <- commandArgs(trailingOnly=TRUE)
d <- args[1]
i <- args[2]
author <- args[3]
chr <- args[4]
low_author <- tolower(author)


file_name <- list.files(d, paste0("output.", i, "_pst"))
beta <- read.table(paste0(d, "/", file_name), stringsAsFactors = F)
ss <- read.table(paste0("temp_files/ss.", low_author, ".", chr), stringsAsFactors=F, header=T)

ss <- ss[ss$RSID %in% beta[,2],]
beta <- beta[order(beta[,2])[rank(ss$RSID)],]
ss$BETA <- beta[,6]

write.table(ss, paste0("~/athena/doc_score/mod_sets/", author, "/", low_author, ".", chr,  ".prscs.", i ,".ss"), row.names = F, col.names = T, sep = '\t', quote = F)

The original publication can be found at https://www.nature.com/articles/s41467-019-09718-5, and the doccumentation comes from https://github.com/getian107/PRScs.

5.9 lassosum

While lassosum may appear to be a classic method like LDpred, SBLUP, or prsCS, its underlying methedology is actuall far different. Rather than following a theoretical basic that recreates effect estimates that would be produced with full genetic information lassosum tries to recreate the process of lasso regression, a far more heuristic or goal-oriented process. The process of actually implementing lassosum is relatively easy as everything can be carried out within a R script, leaving the controlling shell script to do very little.

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}




Rscript helper_scripts/lassosum.R $d $chr $low_author

The R script it launches does three sets of things. First, it sets up a few important objects: reading in the summary statistics and sets the name of the reference genotypes. Second, it converts the summary statistic P-values into a correlation type statistic then runs the lassosum pipeline. A few of the important parameters that are set in the pipeline include: the name of the LDblock, which is already mostly specified by the software and just needs an ancestry matching name; the sample number or number of people to keep in the LD reference, I chose 5000 as its value exceeds that of the sample size used within the originating publication (from my reading either than 1000 genomes or cases from the WTCCC were used, both of which I think are close to 5000, or at least not an order of magnitude higher); the s parameter (which I am unclear what it represents) is given the values randing from 0.1 to 1 in the paper, and with better performance at lower values I will test 0.1, 0.4 and 0.8; finally the lambda value, which I believe represents the strength of the penalty in the lasso is given the default values ranging from 0.001 to 0.1, therefore I test a similar range but with fewever numbers in my own sample vector. The third thing the script does is subsitute the results into the original summary statistics and writes the results. Altogether it looks like:

library(lassosum)

#args are: d, chr, low_author
args <- commandArgs(trailingOnly=TRUE)

ss <- read.table(paste0("temp_files/ss.", args[3], ".", args[2]), stringsAsFactors=F, header=T)
ref.bfile <- paste0("geno_files/", args[3], ".", args[2])

cor <- p2cor(p = ss$P, n = ss$ESS[1], sign=ss$BETA)

out <- lassosum.pipeline(cor=cor, chr=ss$CHR, pos=ss$BP, 
                         A1=ss$A1, A2=ss$A2, # A2 is not required but advised
                         ref.bfile=ref.bfile, LDblocks = "EUR.hg19", sample = 5000,
                         s = c(0.1, 0.4, 0.8), lambda = c(0.002, 0.004, 0.007, 0.1))

new_ss <- ss[ss$BP %in% out$sumstats[,2],]

k <- 1
for(i in 1:3){
  for(j in 1:4){
    new_ss$BETA <- out$beta[[i]][,j]
    write.table(new_ss, paste0("~/athena/doc_score/mod_sets/", tools::toTitleCase(args[3]), "/", args[3], ".", args[2],  ".lassosum.", k ,".ss"), row.names = F, col.names = T, sep = '\t', quote = F)
    k <- k + 1
  }
}

The original(accessible) publication can be found at https://www.biorxiv.org/content/10.1101/058214v2.full.pdf, and the doccumentation comes from https://github.com/tshmak/lassosum.

5.10 Tweedie

The Tweedie Method is fundamentally concerned with the Winner’s Curse idea that has been similarly in several other methods. Tweedie specifically handles the curse by thinking of a true distribution of Z-statistics, and then attempting to create an approximation for this distribution by inserting a kernel over the estimated statistics. However, before we can begin this Winners Curse correction process the corresponding publication recommends we try to limit the amount of linkage disequilbrium within the underlying summary statistics. They have a consistent R2 cut-off of 0.25 and leave the P-value parameter take a range of values. The limited summary statistics are then converted into Z-statistics, incorporated with empirically estimated allele frequencies, and sent into a supplied R script. The output adjusted effect values can the be subset back into the original summary statistics and saved. The controlling script for this process looks like:

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}


i=1
cat all_specs/tweedie_param_specs | tail -n +2 | while read spec;do
  plim=`echo $spec | cut -f1 -d' '`

  if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.tweedie.${i}.ss ]; then

    plink --memory 4000 --threads 1 --bfile geno_files/${low_author}.${chr} --clump temp_files/ss.${low_author}.${chr} --clump-snp-field RSID --clump-p1 $plim --clump-r2 0.25 --out ${d}/out

    if [ -f ${d}/out.clumped ]; then

      plink --bfile geno_files/${low_author}.${chr} --freq --out ${d}/freq

      Rscript helper_scripts/add_tweedie.R $author $chr $d

      Rscript helper_scripts/tweedie.R $author $chr $d $i

    fi

  fi

  let i=i+3

done

I won’t show the R script because it is rather long and almost completely what was originally downloaded. The only changes are to read in the summary statistics at the top and convert P-values to Z-statistics, then seperate out the nescessary objects.

The original publication comes from https://www.nature.com/articles/srep41262.pdf, and the code comes from https://sites.google.com/site/honcheongso/software/empirical-bayes-risk-prediction.

5.11 SBayesR

The SBayesR method is somewhat similar to the SBLUP method in that it uses a LD reference matrix to attempt to recover effect estimates that would be produced from a full genotype aware regression. Any more description would get pretty complicated, so I will leave it to the interested reader. The largest in terms of computation and memory component of a SBayesR run is the construction of the LD reference matrix. There are solid descriptions within the tutorial, the ultimate message is that a sparse and shrunk matrix is best. Although what’s even better is downloading these matricies directly without computation. You can find the links to these matricies on the FAQ page. There is a 1 million HapMap version and a 2.8 million common SNPs version. The corresponding publication saw slightly better results with 2.8 million SNPs, however this larger version looks like it is well over 100 GB compared to the ~50GB of the HapMap. While losing some accuracy, in the effort to make this an attainable guide for all I went with the HapMap version. So please know if you are looking for super accuracy, having a lot of storage and want to use SBayesR then go for the 2.8 million. The remaining parameters can likely be left at their defaults, as that seems like all was done in the publication. However, an important step is to set a P-value and rsq cut-off as recommended in their FAQ. Finally, I ran two models, one for the R model and another for C.

The controlling script for this entire process is actually quite simple and looks like:

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}




plink --bfile geno_files/${low_author}.${chr} --freq --out temp_files/${low_author}.${chr}
Rscript helper_scripts/add_ma.R $author $chr

gctb2 --sbayes R --ldm  ~/athena/refs/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_chr${chr}_v3_50k.ldm.sparse --gwas-summary temp_files/ss.${low_author}.${chr}.ma --p-value 0.05 --rsq 0.99 --out-freq 100 --out ${d}/sbay_out

Rscript helper_scripts/reformat_sbayesr.R $chr $author $d 1
rm ${d}/sbay_out*


gctb2 --sbayes C --ldm  ~/athena/refs/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_chr${chr}_v3_50k.ldm.sparse --gwas-summary temp_files/ss.${low_author}.${chr}.ma --p-value 0.05 --rsq 0.99 --out-freq 100 --out ${d}/sbay_out

Rscript helper_scripts/reformat_sbayesr.R $chr $author $d 2
rm ${d}/sbay_out*

At the top of the script is the execution of a R script, which is the exact same as in the SBLUP description. And at the bottom, as I often have to do, I exchange the adjusted effects for those in the original summary statistics. This script looks like:

args <- commandArgs(trailingOnly=TRUE)

chr <- args[1]
author <- args[2]
d <- args[3]
i <- args[4]
lowauthor <- tolower(author)

sbay <- read.table(paste0(d,"/sbay_out.snpRes"), stringsAsFactors=F, header = T)
ss <- read.table(paste0("temp_files/ss.", lowauthor, ".", chr), stringsAsFactors=F, header=T)

ss <- ss[ss$RSID %in% sbay[,2],]
sbay <- sbay[order(sbay[,2])[rank(ss$RSID)],]
ss$BETA <- sbay$A1Effect

write.table(ss, paste0("../mod_sets/", author, "/", lowauthor, ".", chr, ".sbayesr.", i, ".ss"), row.names = F, col.names = T, sep = '\t', quote = F)

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}

#spack load /g7nwnaf
#spack load gcc@8.2.0

num_snps=`cat ../raw_ss/meta_stats | fgrep $author | cut -f7 -d','`
h2=`cat ../raw_ss/meta_stats | fgrep $author | cut -f12 -d','`
samp_size=`cat temp_files/ss.${low_author}.${chr} | head -2 | tail -n +2 | cut -f9`

i=1
cat all_specs/dbslmm_param_specs | tail -n+2 | while read spec;do

  pval=`echo $spec | cut -f1 -d' '`
  r2=`echo $spec | cut -f1 -d' '`

  if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.dbslmm.${i}.ss ]; then

    plink --bfile geno_files/${low_author}.${chr} --freq --out temp_files/${low_author}.${chr}
    Rscript helper_scripts/add_gemma.R $author $chr

    head -5000 geno_files/${low_author}.${chr}.fam > ${d}/keep_people
    plink --bfile geno_files/${low_author}.${chr} --keep-fam ${d}/keep_people --make-bed --out ${d}/short

    dbslmm_bash=~/Programs/DBSLMM/dbslmm
    dbslmm_r=~/Programs/DBSLMM/software/DBSLMM.R
    ss=temp_files/ss.${low_author}.${chr}.gemma
    ref=${d}/short
    blockf=~/athena/refs/ldsplits/chr${chr}.bed 
    
    Rscript ${dbslmm_r} --summary $ss --outPath ./${d}/ --plink ~/bin/plink --dbslmm ${dbslmm_bash} --ref ${ref} --n ${samp_size} --nsnp ${num_snps} --block ${blockf} --h2 ${h2} --pv=${pval} --r2=${r2}

    if [ -e ${d}/ss.${low_author}.dbslmm.txt ];then
      Rscript helper_scripts/reformat_dbslmm.R $chr $author $d $i
      rm ${d}/ss.${low_author}.dbslmm.txt
    fi

  fi

  let i=i+1
done

The main steps involve converting my summary statistics format into the desired GEMMA format, then extracting a large amount of meta-statistics (including heritability, sample size, and number of SNPs), then actually launching the main R script. I have had some troubles getting this program to work, as some specific GCC compilers and additional libraries are needed. The final step in this process substitutes the adjusted beta values into the original summary statistics. The only parameters needed in this process are p-value and R2 values, which are used in a preliminary clumping process in the DBSLMM algorithm. The only other, unclear but common parameter is the number of individuals to include in the LD reference panel. While the original paper only specifies 500 individuals in their reference panel, I thought that value was very low and increasing it would not decrease accuracy. Therefore I have 5,000 individuals in the reference panel, bringing the same size closer in line to other methods.

The original publication comes from https://www.sciencedirect.com/science/article/pii/S0002929720301099, and great doccumentation comes from https://biostat0903.github.io/DBSLMM/index.html.

5.12 JAMPred

JAMPred, which stands for Joint Analysis of Marginal Summary Statistics Prediction, is a polygenic risk score generating model that is based upon the JAM algorithm. The original algorithm was made to smartly perform meta-analyses for the purpose of fine-mapping. The prediction aspect has been added on, adjusting for both normal and uniquely long-range linkage disequilibrium through the induction of sparsity in the effects. A block approach allows for the long-range correction while also allowing for easy parallelization. Further details on the algorithm are intersting, but we only need to implement the algorithm.

chr=$1
author=$2
dir=$3
d=comp_zone/dir${dir}
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`

#can limit to 5,000 samples and R2 < 0.95

head -5000 geno_files/${low_author}.${chr}.fam | cut -f1 > ${d}/subset_inds
plink --bfile geno_files/${low_author}.${chr} --keep-fam ${d}/subset_inds --indep-pairwise 500 100 0.95 --out ${d}/snps
plink --bfile geno_files/${low_author}.${chr} --keep-fam ${d}/subset_inds --extract ${d}/snps.prune.in --make-bed --out ${d}/for_jampred

Rscript helper_scripts/jampred.R $chr $author $d

The primary implementation of JAMPred takes place within an R script. The only prior steps needed are setting up the LD reference data. Following the original publication, the sample size in LD reference is set to 5,000 individuals, slightly more than what is used in the primary paper. An interesting step specific to this method is that the markers in the LD reference were specified to have R2 less than 95%, meaning that I had to randomly sample down my markers using PLINK.

.libPaths( c( "/home/kulmsc/R/x86_64-pc-linux-gnu-library/3.6",  .libPaths() ) )
library(bigsnpr)
library(R2BGLiMS)

args <- commandArgs(trailingOnly=TRUE)

chr <- args[1]
author <- args[2]
d <- args[3]

if(!file.exists(paste0(d, "/for_jampred.rds"))){
  snp_readBed(paste0(d, "/for_jampred.bed"))
}
obj.bigSNP <- snp_attach(paste0(d, "/for_jampred.rds"))
ped <- big_copy(snp_fastImputeSimple(obj.bigSNP$genotypes, "mean0"), type = "integer")
ped <- ped$bm()
options(bigmemory.allow.dimnames=TRUE)
colnames(ped) <- obj.bigSNP$map$marker.ID
rm(obj.bigSNP)

ss <- read.table(paste0("temp_files/ss.", tolower(author), ".", chr), stringsAsFactors=F, header=T)
ss <- ss[ss$RSID %in% colnames(ped),]
marg_beta <- ss$BETA
names(marg_beta) <- ss$RSID
marg_se <- ss$SE
names(marg_se) <- ss$RSID

meta_stats <- read.table("~/athena/doc_score/raw_ss/meta_stats", sep = ",", stringsAsFactors=F, header = T)
meta_line <- meta_stats[meta_stats[,1] == tools::toTitleCase(author),]

lambdas <- read.table("all_specs/jampred_param_specs", header = T)
try_ind <- unname(which(marg_beta != 0))

for(i in 1:nrow(lambdas)){
  jampred.res.bin <- JAMPred(
   marginal.betas = marg_beta[try_ind],
   n.training = as.numeric(meta_line$sampe_size),
   marginal.logor.ses = marg_se[try_ind], # Only necessary for a binary trait
   p.cases.training = meta_line$cases/meta_line$sampe_size, # Only necessary for a binary trait
   ref.geno = ped[1:nrow(ped),try_ind],
   total.snps.genome.wide = meta_line$snps, # Total SNPs across all chromosomes
   n.mil = 0.2,
   n.cores = 1,
   beta.binom.b.lambda = lambdas[i,1]
   debug = TRUE,
   save.path = paste0("/home/kulmsc/athena/doc_score/adjust_ss/", d),
   seed = 1 # For re-producibility. If not set a random seed is used
  )

  newss <- ss[ss$RSID %in% jampred.res.bin$snps,]
  newss$BETA <- jampred.res.bin$step2.posterior.mean.snp.weights

  write.table(newss, paste0("../mod_sets/", author, "/", lowauthor, ".", chr,  ".jampred.", i, ".ss"), row.names = F, col.names = T, quote = F, sep = '\t')
}

After the genotypic data is prepared, the actual R script can be carried out. While the demands of JAMPred may seem simple, it requires that the genotypic data in 0, 1, and 2s be read into R. This can be a very large file, easily crashing the RAM of even a nice cluster. Therefore I took the approach of using the bigsnpr package to read the data in a memory-smart way. Some funky code then is used to change the column names to the rsids in the summary statistics, and impute the missing data. The memory-smart genotypic file can then be directly inserted into the main JAMPred function call, along with other meta statistics and the summary statistics. The only parameter needing adjusting is lambda. The four lambda values chosen in this trial are the four used within the example on the associated GitHub page. The output of this function are adjusted betas, which can then be substiuted into the main summary statistics.

The original publication comes from https://onlinelibrary.wiley.com/doi/full/10.1002/gepi.22245, and the doccumentation is at https://github.com/pjnewcombe/R2BGLiMS.

5.13 Winner’s Curse Lasso

The next series of methods come from the same publication that attempts to improve GWAS summary statistics for scoring by correcting the winner’s curse. As a short review, winner’s curse is inflicted upon the SNPs that barely surpass any hard (often p-value) threshold that is made for inclusion in scoring. The economic definition, which is a little easier to follow, is described as the curse of the person who places the winning bid at an auction. While the person wins the item they are often irrational and the bid does not reflect the true price or average bid. To fix this curse the general idea is to make a smooth threshold for SNP inclusion.

The first way the corresponding publication does this is with the LASSO algorithm. As we do not have the raw genotypic data we obviously cannot do a true LASSO. However, we can treat the summary statistic effect size as the quantity that becomes penalized we can construct an ad-hoc LASSO approach that limits SNPs to those that are the most powerful. To implement the associated equation the publication first requires limiting SNPs so they are not within LD. They use clumping with R2 less than 0.1, so we do the same in all possible LASSO executions (we also set the p-value limit to 0.5 to remove the SNPs that clearly will not make any following cut-off). While there is no need to match any LD reference panel, because it does not exist, we do need to specify lambda values. While I cannot find any mention of values in the publication I have found that 0.001, 0.01, and 0.1 give a good array of adjusted effect sizes. With all of this in mind we can now launch the actual scoring script.

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}


plink --memory 4000 --threads 1 --bfile geno_files/${low_author}.${chr} --clump temp_files/ss.${low_author}.${chr} --clump-snp-field RSID --clump-p1 0.5 --clump-r2 0.05 --out ${d}/out
sed -e 's/ [ ]*/\t/g' ${d}/out.clumped | sed '/^\s*$/d' | cut -f4 | tail -n +2 > ${d}/done_rsids
cat temp_files/ss.${low_author}.${chr} | fgrep -w -f ${d}/done_rsids > ${d}/specific_ss


i=1
cat all_specs/wc_lasso_param_specs | while read lambda;do

  if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.wc_lasso.${i}.ss ]; then

    python helper_scripts/winners_curse.py lasso $lambda $d $i

    mv ${d}/summStat ~/athena/doc_score/mod_sets/${author}/${low_author}.${chr}.wc_lasso.${i}.ss

  fi

  let i=i+1

done

The key line within this script is launching the python winners_curse python script. While not being the shortest script, it is actually quite simple. The summary statistics are read in, each SNP is then iterated through with the nescessary values being sent to the lasso function, and the output adjusted effect size is then substituted into the new summary statistics. I should note that this script was not provided by the publication but written by myself, leaving plenty of room for error. After the script runs I simply save the adjusted summary statistics to finish the process.

import numpy as np
import gzip
import os
import sys
import pdb
from scipy.stats import norm
from scipy.optimize import minimize

method=sys.argv[1]
inputLambda=float(sys.argv[2])
dd=sys.argv[3]

ssName=dd+"/specific_ss"
pLim=0.05

def normRead(fileName,delimiter):
    totalData=[]
    with open(fileName,"r") as f:
        for line in f.read().splitlines():
            totalData.append(line.split())
    totalData=np.array(totalData)
    return(totalData)


def justRead(fileName,delimiter):
    rList=[]
    with open(fileName,"r") as f:
        for line in f.read().splitlines():
            rList.append(line.split(delimiter))
    return(np.array(rList))


def likelihood(beta,betaHat,se,alpha):
    density=norm.pdf((betaHat-beta)/se)
    lambdaVal=norm.ppf(1-alpha/2)*se
    cumu1=norm.cdf(beta/se - lambdaVal/se)
    cumu2=norm.cdf(-beta/se - lambdaVal/se)
    if abs(beta)>lambdaVal:
        indi=1
    else:
        indi=0
    like=((density/se)/(cumu1+cumu2))*indi
    return(like)


def lasso(beta,lambdaVal):
    newBeta=np.sign(beta)*abs(abs(beta)-lambdaVal)*(1 if abs(beta)>lambdaVal else 0)
    return(newBeta)


ss=normRead(ssName,'\t')
ss=ss[ss[:,5] != "Inf", :]
ss=ss[ss[:,5] != "0", :]
#pdb.set_trace()


ssText=ss[:,2:5]
ssNums=ss[:,(0,1,5,6,7,8)]
ssNums=ssNums.astype(float)
goodBetaHolder=np.zeros(ss.shape[0])


for i in range(ss.shape[0]):
    ssTextRow=ssText[i,:]
    ssNumsRow=ssNums[i,:]
    zVal=abs(norm.ppf(ssNumsRow[4]))
    betaSE=ssNumsRow[3]/zVal

    if method=="like":
        y=minimize(likelihood,1,args=(ssNumsRow[3],betaSE,pLim),method='nelder-mead')
        betaAns=y.x
    elif method=="lasso":
        betaAns=lasso(ssNumsRow[3],inputLambda)
    goodBetaHolder[i]=betaAns


goodBetaHolder.astype("str")
ss[:,6]=goodBetaHolder

np.savetxt(dd+'/summStat', ss, fmt='%s', delimiter='\t')

The original publication comes from https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006493, there is not any supporting doccumentation.

5.14 Winner’s Curse Likelihood

Following directly from the previous method, the corresponding publication came up with a second way to correct for winner’s curse, specifically through maximum likelihood. Prior theorizing in determining true effect sizes in regression have created an equation that gives a point estimate for the likelihood of an effect size. This new likelihood can be thresholded the same way the original P-value was. Once again the publication does not give any recommended values for setting a new series of thresholds, however I have found that the values of 0.05, 0.01, and 0.001 work relatively well. The script that controls the process looks very similar to the previous lasso method.

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}


plink --memory 4000 --threads 1 --bfile geno_files/${low_author}.${chr} --clump temp_files/ss.${low_author}.${chr} --clump-snp-field RSID --clump-p1 0.5 --clump-r2 0.05 --out ${d}/out
sed -e 's/ [ ]*/\t/g' ${d}/out.clumped | sed '/^\s*$/d' | cut -f4 | tail -n +2 > ${d}/done_rsids
cat temp_files/ss.${low_author}.${chr} | fgrep -w -f ${d}/done_rsids > ${d}/specific_ss


i=1
if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.wc_like.${i}.ss ]; then

  python helper_scripts/winners_curse.py like 0 $d

  mv ${d}/summStat ~/athena/doc_score/mod_sets/${author}/${low_author}.${chr}.wc_like.${i}.ss

fi

let i=i+1

Again, the key step is the python script, which is exactly the same as the one above. For simplicity I will not print it again.

5.15 Winner’s Curse 2D

Following mostly in a direct line from the previous publication, the 2D method attempts to remove the winner’s curse from summary statistics. But instead of creating an equation that makes a new thresholding statistic the 2D method carries out clumping just as we did in the original clumping method, except with the additional subsetting of which SNPs are clumped by which cut-off. Specifically, SNPs that are within special regions that have been deemed to be more important will get a more lenient p-value cut-off. The original publication attempts multiple regions, I will use those that seemed to have work best: conserved SNPs, blood eQTL SNPs and pleiotropic SNPs. The other important specification are the respective p-value cut-offs. Lucily for us the original publication does give some advice to this question by providing a nice moutain plot. Three values around the peak for both the important and unimportant SNPs regions were taken from this plot.

To carry out the 2D approach we iterate through each of the different p-value and region parameters. Upon each set of parameters we split the summary statistics into the important and unimportant set, clump each set based on their respective p-value, subset the respective summary statistics, then join those two sets back together for the final adjusted summary statistics. The script to do this looks like:

chr=$1
author=$2
dir=$3
low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
d=comp_zone/dir${dir}




i=1
cat all_specs/wc_2d_param_specs | tail -n +2 | while read spec;do
  anno_type=`echo $spec | cut -f1 -d' '`
  pelse=`echo $spec | cut -f2 -d' '`
  panno=`echo $spec | cut -f3 -d' '`

  if [ ! -e ~/athena/doc_score/mod_sets/${author}/${low_author}.wc_2d.${i}.ss ]; then

    head -1 temp_files/ss.${low_author}.${chr} > ${d}/anno_ss
    head -1 temp_files/ss.${low_author}.${chr} > ${d}/else_ss

    cat temp_files/ss.${low_author}.${chr} | fgrep -w -f ~/athena/refs/genome_annotations/${anno_type}_snps >> ${d}/anno_ss
    cat temp_files/ss.${low_author}.${chr} | fgrep -v -w -f ~/athena/refs/genome_annotations/${anno_type}_snps >> ${d}/else_ss

    plink --memory 4000 --threads 1 --bfile geno_files/${low_author}.${chr} --clump ${d}/anno_ss --clump-snp-field RSID --clump-p1 $panno --clump-r2 0.1 --out ${d}/out.anno
    plink --memory 4000 --threads 1 --bfile geno_files/${low_author}.${chr} --clump ${d}/else_ss --clump-snp-field RSID --clump-p1 $pelse --clump-r2 0.1 --out ${d}/out.else

    if [ -e ${d}/out.anno.clumped ];then
      sed -e 's/ [ ]*/\t/g' ${d}/out.anno.clumped | sed '/^\s*$/d' | cut -f4 | tail -n +2 > ${d}/done_rsids
      fgrep -w -f ${d}/done_rsids temp_files/ss.${low_author}.${chr} > ${d}/ss.out
    fi
    if [ -e ${d}/out.else.clumped ];then
      sed -e 's/ [ ]*/\t/g' ${d}/out.else.clumped | sed '/^\s*$/d' | cut -f4 | tail -n +2 > ${d}/done_rsids
      fgrep -w -f ${d}/done_rsids temp_files/ss.${low_author}.${chr} >> ${d}/ss.out
    fi

    if [ -e ${d}/ss.out ];then
      mv ${d}/ss.out ~/athena/doc_score/mod_sets/${author}/${low_author}.${chr}.wc_2d.${i}.ss
    fi

    rm ${d}/done_rsids
    rm ${d}/ss.out
    rm ${d}/out.anno*
    rm ${d}/out.else*
    rm ${d}/anno_ss
    rm ${d}/else_ss

  fi

  let i=i+1
done

Once again this is the same publication as the likelihood and lasso methods.

5.15.1 Additional Methods

Thses are all of the methods that I could find that adjust summary statistics for the purpose of producing a polygenic risk score. I am aware that there are a few other methods, including GraBLD, LDPred-herit, stackCT and others that look similar to what have been presented. However, to the best of my knowledge these methods require phenotypic information. I have decided that if I allow the inclusion of phenotypic information into score generation than I should also test regression methods such as BLUP or alphabet bayesian techniques. Since those are fundamentally not what I am trying to do I am excluding any method that requires phenotypic information.

If you know another method that I should evaluate please let me know at sdk2004@med.cornell.edu.