6 Compiling the Scores

Converting the adjusted summary statistics to actual polygenic risk scores should be a trivial task. However, there are a few important computational aspects of this step which can either significantly slow down or altogether ruin the scoring process. Specific to the UK Biobank, the key time and therefore computational problem is converting the given bgenix files into something faller that can be easily read into PLINK. Doing this for all several million SNPs and ~500,000 individuals at once will almost certainly wreck your RAM and is slow as it is not parallelizable. Therefore, I have split up scoring over each of the chromosomes (as it is already natural), and overal each adjusted summary statistic file (as it is simplest).

6.1 Simple Score Approach

As already alluded to, the approach I took is simple in nature. The code that processes the scoring of each adjusted summary statistic set is:


ss_name=$1
author=$2
method=$3
chr=$4
i=$5

echo $ss_name $author $method $chr $i

#Just logistics around controlling parallelism
go_num=`head -1 temp_files/poss_go`
grep -v -w $go_num temp_files/poss_go > temp_files/temp_poss
mv temp_files/temp_poss temp_files/poss_go

low_author=`echo "$author" | tr '[:upper:]' '[:lower:]'`
ver=`echo $ss_name | cut -f4 -d'.'`


if [ ! -e small_score_files/score.${low_author}.${chr}.${ver}.${method}.profile.zst ];then

  cat ../mod_sets/${author}/${ss_name} | cut -f3 > temp_files/rsids.${i}

  bgenix -g ~/athena/ukbiobank/imputed/ukbb.${chr}.bgen -incl-rsids temp_files/rsids.${i} > temp_files/temp.${i}.bgen

  plink2_new --memory 12000 --threads 12 --bgen temp_files/temp.${i}.bgen ref-first --sample ~/athena/ukbiobank/imputed/ukbb.${chr}.sample --keep-fam temp_files/brit_eid --make-bed --out temp_files/geno.${i}

  plink --memory 12000 --threads 12 --bfile temp_files/geno.${i} --keep-allele-order --score ../mod_sets/${author}/${ss_name} 3 4 7 sum --out small_score_files/score.${low_author}.${chr}.${ver}.${method}

  zstd --rm small_score_files/score.${low_author}.${chr}.${ver}.${method}.profile

  rm temp_files/rsids.${i}
  rm temp_files/temp.${i}.bgen
  rm temp_files/geno.${i}.*

else

  echo already exists

fi

#Just logistics around controlling parallelism
echo $go_num >> temp_files/poss_go

The key and simple steps are: one, we get the rsids that are included in the adjusted summary statistic set; two, we use bgenix to subset the original UKBB imputation file to a (hopefully) much smaller subset; three, we produce a bed/bim/fam fileset through plink2; four, we produce the actual scores with PLINK. There are a few other important statements, namely to check to see if the score exists (either in the small_score_files or final_scores directory) before trying to make it anew.

6.1.1 What to Worry About

There are a few things that I did not think of originally that should be considered in order to make accurate and even remotely realistic scores.

1 - Matching Alleles - As we already carefully went over when creating clean summary statistics from those orginally downloaded, the alleles in the summary statistics must match the genotype file you are working with. When dealing with the original bgen files this is true. But, when we subset (to say only British individuals) the allele may be flipped if the major allele flips (important, this is true for PLINK but not for PLINK2). To prevent this behavoir in PLINK we use the –keep-allele-order flag.

2 - Sum - The default PLINK behavoir is to average the score in its output. Therefore, the ultimate score is based on the number of SNPs. This becomes a problem if we are creating a larger score by running several PLINK routines on each chromosome and adding up the results. Specifically, because the score will be weighted based on the number of SNPs on each chromosome, not the beta values. To fix this we use the “sum” option, which no longer does any of this averaging.

3 - Allele Frequency - The default behavoir of the PLINK scoring routine is when an allele is unknown an amount proportional to the allele frequency is added. This can be pretty confusing so please check out https://zzz.bwh.harvard.edu/plink/profile.shtml, which has a simple example. The problem here is the exact allele frequency being used. If we used –keep-fam and –score in the same plink call, then the allele frequency is derived from the non-family subset. So instead we must subset into the bed/bim/fam files, and then in a seperate PLINK call we do the scoring.

4 - Round-Off Error - This is a much smaller issue, but in nearly all computational tasks with thousands of multiplications between small values, there is bound to be round off error. While I admit I have not done anything to correct for it, it could be a good idea in the future to multiply the beta values by a fixed value to bring everything futher awayfrom small values. Although, I should clarify, while scores change I have not noticed any changes in the order of individuals.

6.1.2 Controlling and Assembling

All of this scoring, for each chromosome, author, etc. are controlled by the following script. This is a very similar process to the one that controls the adjusting summary statistic process, so I won’t go into too much detail explaining.


maxGo=18
rm temp_files/poss_go
cat ../qc/cv_files/train_eid.0.2.txt | cut -f1 > temp_files/brit_eid
cat ../qc/cv_files/test_eid.0.8.txt | cut -f1 >> temp_files/brit_eid

for (( i=1; i<=$maxGo; i++ )); do
  echo $i >> temp_files/poss_go
done

echo 1 > temp_files/counter

cat all_specs/author_specs | while read cauthor;do
  cat all_specs/method_specs | while read cmethod;do
    for cchr in {1..22};do

      ls ../mod_sets/${cauthor}/ | fgrep -w ${cmethod} | awk -v var="$cchr" -F. '$2 == var {print $0}' | while read cname;do

        counter_var=`cat temp_files/counter`
        echo $counter_var
        ./simple_score.sh $cname $cauthor $cmethod $cchr $counter_var &> logs/log.${counter_var}.log &
        echo $counter_var + 1 | bc > temp_files/counter

        sleep $(( ( RANDOM % 10 )  + 1 ))

        goOn=False
        while [ $goOn == "False" ]; do
          openSlots=`cat temp_files/poss_go | wc -l`
          sizedir=`du temp_files/ | cut -f1`
          if [ $openSlots -gt 0 ]; then
            if [ $sizedir -lt 50164096 ];then
              echo NOW WE CAN GO
              goOn=True
            fi
          else
            echo MUST WAIT FOR ROOM TO GO
            sleep $(( ( RANDOM % 5 )  + 1 ))
          fi
        done

      done
    done
  done
done

The final script needed in this process is assembling. The process begins by getting the names of all the small score files, then the method, version, and author are extracted. Then we simply line up the files that have the same qualities except for the chromosomes, read them in then add them together. Note that the system() command must be used to zstd decompress and then remove the uncompressed file. There may be a faster way to do this directly in R in the future. There very well may be a better way to do this since this is alot of IO work. but it gets the job done. Although again I want to note that PLINK rounds off all of the scores in the chromosomes leading to possibly problematic round-off error.

library(data.table)

all_files <- list.files("small_score_files/", pattern = "profile")
split_files <- strsplit(all_files, ".", fixed = T)
all_author <- unlist(lapply(split_files, function(x) x[2]))
all_chr <- unlist(lapply(split_files, function(x) x[3]))
all_ind <- unlist(lapply(split_files, function(x) x[4]))
all_method <- unlist(lapply(split_files, function(x) x[5]))

brit_eid <- read.table("temp_files/brit_eid", stringsAsFactors=F)
new_name <- unique(paste0(all_author, ".", all_ind, ".", all_method))
all_score <- matrix(0, nrow = nrow(brit_eid), ncol = length(new_name))
almost_new_name <- paste0(all_author, ".", all_ind, ".", all_method)
colnames(all_score) <- new_name

firstup <- function(x) {
  substr(x, 1, 1) <- toupper(substr(x, 1, 1))
  return(x)
}

all_upauthor <- firstup(all_author)
all_upauthor[all_upauthor == "Imsgc"] <- "IMSGC"

i <- 1
j <- 1
for(i in 1:length(all_files)){

      real_mod_set <- c(grep(paste0(tolower(all_author[i]), ".[[:digit:]][[:digit:]].", all_method[i], ".", all_ind[i], ".ss"), list.files(paste0("../mod_sets/", all_upauthor[i],"/")), value = T),
                         grep(paste0(tolower(all_author[i]), ".[[:digit:]].", all_method[i], ".", all_ind[i], ".ss"), list.files(paste0("../mod_sets/", all_upauthor[i], "/")), value = T))

      real_scored <- c(grep(paste0("score.", tolower(all_author[i]), ".[[:digit:]].", all_ind[i], ".", all_method[i], ".profile.zst"), all_files, value = T),
                        grep(paste0("score.", tolower(all_author[i]), ".[[:digit:]][[:digit:]].", all_ind[i], ".", all_method[i], ".profile.zst"), all_files, value = T))

      if(length(real_mod_set) == length(real_scored)){

          system(paste0("zstd -d small_score_files/", all_files[i]))
          sub_score <- as.data.frame(fread(paste0("small_score_files/", substr(all_files[i], 1, nchar(all_files[i])-4))))
          system(paste0("rm small_score_files/", substr(all_files[i], 1, nchar(all_files[i])-4)))
 
          all_score[,new_name == almost_new_name[i]] <- all_score[,new_name == almost_new_name[i]] + sub_score$SCORESUM 
    
      }
}

all_score <- all_score[,colSums(all_score) != 0]

next_file <- length(grep("RDS", list.files("final_scores", "all_score"))) + 1

write.table(all_score, paste0("final_scores/all_score.", next_file), row.names = F, col.names = T, quote = F, sep = ' ')
saveRDS(all_score, paste0("final_scores/all_score.", next_file, ".RDS"))
write.table(colnames(all_score), paste0("final_scores/done_names.", next_file), row.names = F, col.names = F, quote = F)
system(paste0("gzip final_scores/all_score.", next_file))

6.1.3 Verifying Everything Worked

As perhaps I have alluded to, scoring is actually not as straightforward as it seems. By taking steps that make it computationally easier there may be inaccuracies that infiltrate your final scores. Therefore, one should check whether the scores made with whatever nice and fancy approach are the same as scores made with the simplest, most fool-proof approach. Specifically, I am thinking about creating one large PLINK file with all of the SNPs needed for the desired score creation, then in one PLINK command creating the score. This process will require iterating through all of the 22 chromosomes and creating a small PLINK file, then in the end merging them all together, and finally creating the actual score desired. The code looks like:

rm try_single_score/ss
rm try_single_score/rsids
rm try_single_score/geno_list

Author=IMSGC
author=imsgc

for chr in {1..22};do
  cat ../mod_sets/${Author}/${author}.${chr}.clump.1.ss >> try_single_score/ss
  cat ../mod_sets/${Author}/${author}.${chr}.clump.1.ss | cut -f3 > try_single_score/rsids
  bgenix -g ~/athena/ukbiobank/imputed/ukbb.${chr}.bgen -incl-rsids try_single_score/rsids > try_single_score/temp.bgen
  plink2_new --memory 12000 --threads 12 --bgen try_single_score/temp.bgen ref-first --sample ~/athena/ukbiobank/imputed/ukbb.1.sample --keep-fam temp_files/brit_eid --make-bed --out try_single_score/geno.${chr}
  echo try_single_score/geno.${chr} >> try_single_score/geno_list
done

plink --merge-list try_single_score/geno_list --make-bed --out try_single_score/all
plink --bfile try_single_score/all --keep-allele-order --score try_single_score/ss 3 4 7 no-sum --out try_single_score/score.${author}.clump.1

After the score is created we can compare it to what we have already done. This simply requires reading both scores in and taking the correlation between the two. However, the correlation is not the only thing that matters. Many analyses rely on ranks, therefore I also checked to see if the ranks of both scores are the same. Alas they are not exactly the same, but they are very similar. Seeing as this difference is possibly (likely?) due to round-off error, which I cannot eliminate, and the difference is very likely negligible, I will call my scoring system a success. See the code below for the actual score checking.

author <- "imsgc"
Author <- "IMSGC"

all_score <- readRDS("final_scores/all_score.1.RDS")
template_score <- read.table("template.profile", stringsAsFactors=F, header=T)

fileind <- which(colnames(all_score) == paste0(author, ".1.clump"))
new_score <- read.table(paste0("try_single_score/score.", author, ".clump.1.profile"), stringsAsFactors=F, header=T)
new_score <- new_score[order(new_score[,1])[rank(template_score[,1])],]
cor(all_score[,fileind], new_score$SCORE)
cor(rank(new_score$SCORE, ties.method="average"), rank(all_score[,fileind], ties.method="average"))
mean_dif <- mean(abs(rank(new_score$SCORE, ties.method="average") - rank(all_score[,fileind], ties.method="average")))
max_dif <- max(abs(rank(new_score$SCORE, ties.method="average") - rank(all_score[,fileind], ties.method="average")))

6.1.4 Other Options

I also want to quickly note that this scoring implementation was not my first, or even third try to score the files. The first attempt eventually used PLINK2, which allows you to score multiple columns of beta values simultaneously. While faster, I noticed there to be some scoring errors because of the many zero values that would line up between the scoring files. Perhaps I was reading it wrong, but I assume it is not worth risking bad scores. Quickly, a second attempt used the bigsnpr package, which turned out to be too slow within R.