7 Score Analysis Set-Up

Now that we have the scores we are nearly ready for analysis. However, there are actually several more things that need to be done, namely setting up all of the phenotypes and covariates. Similar to creating the scores, this is an often underrated process that needs to be done carefully or everything else fails.

7.1 Defining the Phenotypes

By phenotype I mean status of whether an individual has the disease corresponding to the polygenic risk score. This is a tricky problem, since first of all each underlying GWAS of the polygenic risk scores may construct phenotypes differently. Adjusting our phenotypes to each GWAS does not seem easy however, or even possible given the limited information within the UK Biobank. However the UK Biobank does have a great deal of information. The sources that we will use to construct phenotypes are as follows:

Self-Reported - Each individual was interviewed by a trained nurse and asked about any illnesses they may have. The nurse then placed any description by the individual into a pre-specified tree of diseases. This data can be suspect as the individual may report self-diagnoses not confirmed by a doctor and not really existing, but overall they are likely to line up with actual conditions for most of the individuals. Further description on the self-reported data-type can be found here, http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=20002, and all possible self-reported codings can be found here, http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=3, and here http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=6 (for cancer and non-cancer).
ICD - The UK Biobank has pulled in the electronic health record of every individual. Specifically, when a doctor makes a diagnosis they will record the ICD code in the computer. These should be highly accurate, however there is always the change of a typo or a doctor recording a similar diagnosis that does not quite match what we want, making ICD codes rather fickle. These records have been parsed into hesin files (hospital episode statistics). In short there this one file, hesin_diag, with ICD codes on every line, and another file, hesin, with dates of hospital episodes. One can look up the dates for a given ICD code by using the EID and instance index. We will go over this more later. More information on the hesin data can be found here, https://biobank.ndph.ox.ac.uk/showcase/showcase/docs/HospitalEpisodeStatistics.pdf. The coding for ICD9 codes can be found here, http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=87. The coding for ICD10 codes can be found here, http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=19. Note that ICD9 codes are older than ICD10.
OPCS - Similar to ICD, OPCS codes are generally part of the electronic health record and are accessed from the hesin set of files. However, instead of recording diagnoses they record operations. While generally not helpful for our purposes, some publications have used specific OPCS codes to assume that an underlying phenotype was the cause. Following precendent I will do the same. More information on OPCS coding can be found here, http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=240. There are also OPCS3 codes, but they are very, very sparse.
Medications - This is the least orthodox way of determining phenotype. But I figure if someone is taking a medication that is only used for one illness, then there is a very good chance the person has that illness. Unfortunately the UK Biobank does not have to the day records of medications, but only inventories when individuals come in for assessments. By looking over CDC, respective professional society, and Mayo Clinic websites I determine the associated medications and convert them into the UK Biobank codes. More information on Medication coding can be found here, http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=4. More information on how medication data was recorded can be found here, http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=20003.

The table that actually holds the conversion from the many possible codes of each category into a 1/0 phenotype is as follows.

  pheno_defs <- read.table("../analyze_score/descript_defs/author_defs", stringsAsFactors=F, header=T)
  print(pheno_defs)

##            author                             name sex cancer noncancer
## 1         Bentham                              sle   A     NA      1381
## 2  Christophersen               atrial_fibrilation   A     NA      1471
## 3        Demenais                           asthma   A     NA      1111
## 4        Demontis                             adhd   A     NA      <NA>
## 5          Dubois                   celiac_disease   A     NA      1456
## 6         Gormley                         migraine   A     NA      1265
## 7           IMSGC                               ms   A     NA      1261
## 8             Jin                         vitiligo   A     NA      1661
## 9         Kottgen                             gout   A     NA      1466
## 10          Liu-1               ulcerative_colitis   A     NA      1463
## 11          Liu-2                   crohns_disease   A     NA      1462
## 12        Mahajan                  type_2_diabetes   A     NA      1223
## 13          Malik                           stroke   A     NA      1081
## 14    Michailidou                    breast_cancer   F   1002      <NA>
## 15         Namjou nonalcoholic_fatty_liver_disease   A     NA      <NA>
## 16         Nikpay          coronary_artery_disease   A     NA 1075|1076
## 17          Okada             rheumatoid_arthritis   A     NA      1464
## 18        Onengut                  type_1_diabetes   A     NA      1222
## 19         Phelan                   ovarian_cancer   F   1039      <NA>
## 20        Rheenen                              als   A     NA      <NA>
## 21     Schumacher                  prostate_cancer   M   1044      <NA>
## 22           Shah                    heart_failure   A     NA      1076
## 23          Sklar                  bipolar_disease   A     NA      1291
## 24           Tsoi                        psoriaris   A     NA      1453
## 25           Wray        major_depressive_disorder   A     NA      1286
## 26            Xie    membranous_glomerulonephritis   A     NA      <NA>
##                               icd9                                    icd10
## 1                              710                           M321|M328|M329
## 2                             4273                                      I48
## 3                              493                                  J45|J46
## 4                              314                                      F90
## 5                              579                                     <NA>
## 6                              346                                      G43
## 7                              340                                      G35
## 8                             7091                                      L80
## 9                              274                                      M10
## 10                             556                                      K51
## 11                             555                                      K50
## 12                            <NA>                                      E11
## 13 431|432|433|434|435|436|437|438 I60|I61|I62|I633|I64|I65|I66|I67|I68|I69
## 14                             174                                      C50
## 15                            5718                                     K760
## 16                     410|411|412                    I21|I22|I23|I241|I252
## 17                             714                                  M05|M06
## 18                            <NA>                                      E10
## 19                             183                                      C56
## 20                            3352                                     G122
## 21                             185                                      C61
## 22                             428                                      I50
## 23                             296                                      F31
## 24                       6960|6961                                      L40
## 25                            <NA>                                      F33
## 26                            5831                                     N002
##                       opcs
## 1                     <NA>
## 2                K622|K623
## 3                     <NA>
## 4                     <NA>
## 5                     <NA>
## 6                     <NA>
## 7                     <NA>
## 8                     <NA>
## 9                     <NA>
## 10                    <NA>
## 11                    <NA>
## 12                    <NA>
## 13                U543|Z35
## 14             B27|B28|B29
## 15                    <NA>
## 16 K40|K41|K45|K49|K50|K75
## 17                    U504
## 18                    <NA>
## 19                    <NA>
## 20                    X852
## 21                    <NA>
## 22                    <NA>
## 23                    <NA>
## 24                    <NA>
## 25                    <NA>
## 26                    <NA>
##                                                                                       meds
## 1                                                                                     <NA>
## 2                                                                               1140888482
## 3                                                                               1141168340
## 4                                                                                     <NA>
## 5                                                                                     <NA>
## 6             1141163670|1141167932|1141172728|1141185436|1141192666|1141150620|1141151284
## 7  1141188640|1141188642|1141150596|1141172620|1141165546|1141167618|1141189254|1141189256
## 8                                                                                     <NA>
## 9                                                                               1140875408
## 10                                                                              1141153242
## 11                                                                                    <NA>
## 12                                                                                    <NA>
## 13                                                                                    <NA>
## 14                                                                   1140923018|1141190734
## 15                                                                                    <NA>
## 16                                                                                    <NA>
## 17                                  1141145896|1140909702|1141188588|1141180070|1140871188
## 18                                                                                    <NA>
## 19                                                                                    <NA>
## 20                                                                              1141195974
## 21                                                        1141150594|1140870274|1140921100
## 22                                                                                    <NA>
## 23                                                                                    <NA>
## 24                                                                                    <NA>
## 25                                                                                    <NA>
## 26                                                                                    <NA>

The pipe symbol is used to indicate that if any of the codes can be found in the respective data type the phenotype will be called positive.

7.2 Making the Phenotype

Now that we know what codes go to which phenotype, we simply need to go through the respective data files and check off when we find one for each EID. Easier said than done, so I’ve broken down things into parts. The first part is preparing the data, specifically breaking down the large phenotype file downloaded from the UK Biobank into parts that are easy to inspect and read in. Along with these files, the hesin files are also nescessary, although they do not need any breaking down. I would display these files but I do not think I am allowed to. Anyway, the breaking down scripts looks like the following (note that if following along your indices are likely different).


#self diag from 472 to 573
#date attend assessment center from 41 to 43
#medication from 574 to 717

zcat ~/athena/ukbiobank/phenotypes/ukb26867.csv.gz | cut -d',' -f1 > eid.csv
zcat ~/athena/ukbiobank/phenotypes/ukb26867.csv.gz | cut -d',' -f472-573 > self_report_diag.csv
zcat ~/athena/ukbiobank/phenotypes/ukb26867.csv.gz | cut -d',' -f454-471 > self_report_cancer.csv
zcat ~/athena/ukbiobank/phenotypes/ukb26867.csv.gz | cut -d',' -f41-43 > date_assessed.csv
zcat ~/athena/ukbiobank/phenotypes/ukb26867.csv.gz | cut -d',' -f574-717 > medications.csv

Now that everything is prepared data-wise we can get on with inspecting and calling phenotypes. I have tried this in many different ways, and the fastest option seems to be breaking down the total 500,000 people into groups and calling phenotypes for each group in parallel. This parallel component is controlled by the following script, which is straightforward.

cat parallel_splits | while read line;do
  star=`echo $line | cut -d' ' -f1`
  end=`echo $line | cut -d' ' -f2`
  echo $star
  python better_pheno.py $star $end &
  sleep 80
done

As you may be able to see the key script here is called better_pheno.py. This script has a few different parts. First it reads in all of the prepared data. Next it subsets down to the group of individuals specified by the control_make_pheno, and sorts each file so the EIDs line up. This sorting is important as it allows for quick indexing of EIDs (just iterate to the next unique EID rather than searching for the index). Lastly I check each type of phenotype (self-report, ICD, etc.) for the coding that was specified. For the ICD and OPCS this process required me to check not just the exact code, but also a more general code due to the hierarchical natur of their coding. For example if I am looking for G35 and an individual has G351 then I will still count that as a match. Further details on how this script works can be found in the well recorded comments of the script itself, which starts below.

import numpy as np
import datetime
import time
import pickle
import pdb
import gzip
import os
import sys
import sys
import re

#Declare the author that corresponds to the phenotype you want
#Also the indices of the group we want phenotypes for
author = "Christophersen"
start_ind = int(sys.argv[1])
end_ind = int(sys.argv[2])

#Function to read in the data
def normRead(fileName, withHeader = True, delim = '\t', removeQuote = False):
    with open(fileName,"r",encoding = "latin-1") as f:
        totalData=[]
        for line in f.read().splitlines():
            if removeQuote:
                totalData.append(line.replace('"', '').strip().split(delim))
            else:
                totalData.append(line.split(delim))
    if withHeader:
        header=totalData[0]
        del totalData[0]
    else:
        header = None
    totalData=np.array(totalData)
    return(totalData,header)

#Read in the files that were split from the main UKBB phenotype file
date_ass,ass_header = normRead("date_assessed.csv", True, ",", True)
meds, meds_header = normRead("medications.csv", True, ",", True)
self_rep, self_rep_header = normRead("self_report_diag.csv", True, ",", True)
cancer_rep, cancer_rep_header = normRead("self_report_cancer.csv", True, ",", True)
big_eid, eid_head = normRead("eid.csv", True, ",", True)

#sort the files just read in so they are in EID order
date_ass = date_ass[big_eid[:,0].argsort(),:]
meds = meds[big_eid[:,0].argsort(),:]
self_rep = self_rep[big_eid[:,0].argsort(),:]
cancer_rep = cancer_rep[big_eid[:,0].argsort(),:]
big_eid = big_eid[big_eid[:,0].argsort(),:]

#Get the phase, or the assessment type for cancer, self-report and medication report
#Phase meaning what index, or what time the person came in to be assessed (first, second or third assessment)
cancer_phase = [x.split("-")[1].split(".")[0] for x in cancer_rep_header]
self_phase = [x.split("-")[1].split(".")[0] for x in self_rep_header]
meds_phase = [x.split("-")[1].split(".")[0] for x in meds_header]

#Read in the hesin type of data and again sort by EID
diag, head_diag = normRead("/home/kulmsc/athena/ukbiobank/hesin/hesin_diag.txt")
oper, head_oper = normRead("/home/kulmsc/athena/ukbiobank/hesin/hesin_oper.txt")
hesin, head_hesin = normRead("/home/kulmsc/athena/ukbiobank/hesin/hesin.txt")
diag = diag[diag[:,0].argsort(),:]
oper = oper[oper[:,0].argsort(),:]
hesin = hesin[hesin[:,0].argsort(),:]
diag_eid = np.unique(diag[:,0])
oper_eid = np.unique(diag[:,0])
diag_max = diag.shape[0] - 1
oper_max = oper.shape[0] - 1
diag_eid_list = diag[:,0].tolist()
oper_eid_list = oper[:,0].tolist()


#Read in the disease defintion and split the types with multiple condictions, then make into a dict
defs, head_defs = normRead("../descript_defs/author_defs")
def_ind = defs[:,0].tolist().index(author)
split_defs = list()
for type_def in defs[def_ind, 3:9]:
    split_defs.append(type_def.split("|"))

#Produce a dictionary so we can easily pull out the codes for each data type
use_defs = dict(zip(head_defs[3:9], split_defs))

#This if for timing, which I do not do anymore
#cancer_sr_time = 0
#noncancer_sr_time = 0
#hesin_setup_time = 0
#icd9_time = 0
#icd10_time = 0
#hesin_oper_time = 0
#med_time = 0

#Find the splits of the input data, so RAM will be less
#Simply find the eid corresponding to the start and stop index
#Continuing to skip them if they do not appear in the given data type
def subset_big_data(eid_list, big_data):

    for ind in range(start_ind, end_ind):
        if big_eid[ind][0] in eid_list:
            start_eid = eid_list.index(big_eid[ind][0])
            break

    for ind in range(end_ind, start_ind,-1):
        if big_eid[ind][0] in eid_list:
            end_eid = np.where(big_data[:,0] == big_eid[ind][0])[0][-1]
            break

    return(big_data[start_eid:end_eid,:])
    

#I'm still bad at python indexing so I have to include this line to fix a single case example
if end_ind > big_eid.shape[0]:
    end_ind = big_eid.shape[0] - 1

#Apply the subset_big_data, and convert some things to lists for easier indexing
diag = subset_big_data(diag_eid_list, diag)
oper = subset_big_data(oper_eid_list, oper)
hesin = subset_big_data(hesin[:,0].tolist(), hesin)
diag_eid_list = diag[:,0].tolist()
oper_eid_list = oper[:,0].tolist()

#Establish indicdes of diag and oper for easy indexing (do not need to look and make index every time
#only need to add the extent of the current eid to this index)
start_eid_diag = 0
start_eid_oper = 0

#Subset the things that do not require the subset_big_data function
meds = meds[start_ind:end_ind,:]
self_rep = self_rep[start_ind:end_ind,:]
date_ass = date_ass[start_ind:end_ind,:]
cancer_rep = cancer_rep[start_ind:end_ind,:]
big_eid = big_eid[start_ind:end_ind,:]


#Prepare the data objects that will hold diagnosis results
df_occur = np.zeros((big_eid.shape[0], 6))
df_date = np.tile("__________", (big_eid.shape[0], 6))

#Iterate through the indices of each unique EID
#Note for lessons learned in speed-up:
# - do not subset objects (I think copies are made in RAM which really slow things down)
# - always try to just iterate to the next index rather than searching for some keyword
for ind in range(0, (end_ind - start_ind)):
    if ind % 10000 == 0:
        print(ind) #So I know how fast things are running

    #Set the value of the actual eid
    curr_eid = big_eid[ind][0]

    #Now we go through and fill in each column of df_occur and df_date, one data type at a time

    #CANCER SELF REPORT (this is the name of the column in df_occur and df_date I am filling in)
    if use_defs["cancer"][0] != "NA": #check to see if there is a noncancer definition
        #t1 = time.time()
        if any(np.isin(use_defs["cancer"], cancer_rep[ind,:])): #if the cancer definition is in the cancer data
            locs = np.where(np.isin(cancer_rep[ind,:], use_defs["cancer"]))[0] #get assessment index of the cancer occurence
            try_date = date_ass[ind, int(cancer_phase[locs[0]])] #record the date based on index of ind. assessment
            if try_date == "": #somtimes the data is empty so we go back to the last date we know
                try_date = date_ass[ind,0]
            df_date[ind,0] = try_date
            df_occur[ind,0] = 1

        #t2 = time.time()
        #cancer_sr_time += t2-t1

    #NON-CANCER SELF REPORT #this is the same process as with the cancer data, but now with non-cancer data
    if use_defs["noncancer"][0] != "NA":
        #t22 = time.time()
        if any(np.isin(use_defs["noncancer"], self_rep[ind,:])):
            locs = np.where(np.isin(self_rep[ind,:], use_defs["noncancer"]))[0]
            try_date = date_ass[ind, int(self_phase[locs[0]])]
            if try_date == "":
                try_date = date_ass[ind, 0]
            df_date[ind,1] = try_date
            df_occur[ind,1] = 1

        #t3 = time.time()
        #noncancer_sr_time += t3-t22


    #HESIN DIAG
    if curr_eid in diag_eid_list: #check if the eid has ever had any ICD codes
        #t4 = time.time()
        #set up the hesin indexing, since multiple rows can have the same eid we use this
        #use this technique to get the index of the next unique EID
        if curr_eid != diag_eid_list[-1]:
            next_eid = np.unique(diag[start_eid_diag:(start_eid_diag+7000),0])[1]
            ext_eid = diag_eid_list.index(next_eid)
        else:
            ext_eid = diag_max

        #t5 = time.time()
        #hesin_setup_time += t5-t4

        #ICD - 9
        #as was explained either the full or slightly shorter ICD code may match what we want, so we create both
        use_short_icd9_diag = [x[0:3] for x in diag[start_eid_diag:ext_eid,4]]
        #Then we check to see if either the short short or long ICD codes (from the hesin) match any of the definitions
        if any(np.isin(use_defs["icd9"], use_short_icd9_diag)):
            #If there is a match we get all of the locations in the hesin of the match
            short_icd9_locs = np.where(np.isin(use_short_icd9_diag, use_defs["icd9"]))[0][0]
        else:
            short_icd9_locs = 10000
        if any(np.isin(use_defs["icd9"], diag[start_eid_diag:ext_eid,4])):
                        long_icd9_locs = np.where(np.isin(diag[start_eid_diag:ext_eid,4], use_defs["icd9"]))[0][0]
        else:
            long_icd9_locs = 10000
        #There cannot possibly be a match greater than 10000, so I set it to the impossible value
        if long_icd9_locs != 10000 or short_icd9_locs != 10000:
            #Find the minimum matching location as it is the earliest time
            icd9_locs = min((long_icd9_locs, short_icd9_locs))
            #Then we get the instance or the numbered time the EID went to the hospital
            icd9_ins_index = diag[start_eid_diag+icd9_locs,1]

            #We carry the ins_index and EID to the larger hesin array to get the date
            #Somtimes the most accurate form of the date is empty so we go to the next best
            raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd9_ins_index),5][0]
            if raw_date == "":
                raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd9_ins_index),21][0]
                if raw_date == "":
                    raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd9_ins_index),4][0]

            df_occur[ind,2] = 1
            df_date[ind,2] = raw_date
        
        #t6 = time.time()
        #icd9_time += t6-t5

        #The process for ICD10 is nearly exactly the same as for ICD9
        #ICD - 10
        use_short_icd10_diag = [x[0:3] for x in diag[start_eid_diag:ext_eid,6]]
        if any(np.isin(use_defs["icd10"], use_short_icd10_diag)):
            short_icd10_locs = np.where(np.isin(use_short_icd10_diag, use_defs["icd10"]))[0][0]
        else:
            short_icd10_locs = 10000
        if any(np.isin(use_defs["icd10"], diag[start_eid_diag:ext_eid,6])):
            long_icd10_locs = np.where(np.isin(diag[start_eid_diag:ext_eid,6], use_defs["icd10"]))[0][0]
        else:
            long_icd10_locs = 10000
        if long_icd10_locs != 10000 or short_icd10_locs != 10000:
            icd10_locs = min((long_icd10_locs, short_icd10_locs))
            icd10_ins_index = diag[start_eid_diag+icd10_locs,1]

            raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd10_ins_index),5][0]
            if raw_date == "":
                raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd10_ins_index),21][0]
                if raw_date == "":
                    raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd10_ins_index),4][0]

            df_occur[ind,3] = 1
            df_date[ind,3] = raw_date
        start_eid_diag = ext_eid

        #t7 = time.time()
        #icd10_time += t7-t6


    #The process for HESIN OPER is nearly exactly the same as for HESIN DIAG (which includes ICD10 and ICD))
    #HESIN OPER
    if curr_eid in oper_eid_list and use_defs["opcs"][0] != "NA":
        #t8 = time.time()
        if curr_eid != oper_eid_list[-1]: #if it is not the last in the list
            next_eid = np.unique(oper[start_eid_oper:(start_eid_oper+7000),0])[1]
            ext_eid = oper_eid_list.index(next_eid)
        else:
            ext_eid = oper_max

        use_short_oper = [x[0:3] for x in oper[start_eid_oper:ext_eid,7]]
        if any(np.isin(use_defs["opcs"], use_short_oper)):
            short_oper_locs = np.where(np.isin(use_short_oper, use_defs["opcs"]))[0][0]
        else:
            short_oper_locs = 10000
        if any(np.isin(use_defs["opcs"], oper[start_eid_oper:ext_eid,7])):
            long_oper_locs = np.where(np.isin(oper[start_eid_oper:ext_eid,7], use_defs["opcs"]))[0][0]
        else:
            long_oper_locs = 10000
        if long_oper_locs != 10000 or short_oper_locs != 10000:
            oper_locs = min((long_oper_locs, short_oper_locs))
            oper_ins_index = oper[start_eid_oper+oper_locs,1]

            raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == oper_ins_index),5][0]
            if raw_date == "":
                raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == oper_ins_index),21][0]
                if raw_date == "":
                    raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == oper_ins_index),4][0]

            df_occur[ind,4] = 1
            df_date[ind,4] = raw_date
        start_eid_oper = ext_eid

        #t9 = time.time()
        #hesin_oper_time += t9-t8

    #The medication process is nearly the same as for cancer and non-cancer
    #MEDICATION
    if use_defs["meds"][0] != "NA":
        #t10 = time.time()
        if any(np.isin(use_defs["meds"], meds[ind,:])):
            locs = np.where(np.isin(meds[ind,:], use_defs["meds"]))[0]
            df_date[ind,5] = date_ass[ind, int(meds_phase[locs[0]])]
            df_occur[ind,5] = 1
        #t11 = time.time()
        #med_time += t11 - t10



#save the output phenos
np.savetxt("raw_output/diag.coding." + author.lower() + "." + str(start_ind) + ".txt.gz", df_occur, fmt='%i')
np.savetxt("raw_output/diag.time." + author.lower() + "." + str(start_ind) + ".txt.gz", df_date, fmt='%s')



print("end")

The final step in the process is combining all of the small subset phenotype file into one large phenotype file. The mechanics of this script are very simple, as all that needs to be done is iterating over the subsets and concatenating them together.

ls raw_output/ | cut -f3 -d'.' | sort | uniq | while read author;do
  rm pheno_defs/diag.${author}.txt.gz
  rm pheno_defs/time.${author}.txt.gz
  ls raw_output/diag.coding.${author}* | cut -f2 -d'/' | cut -f4 -d'.' | sort -n | while read num;do
    zcat raw_output/diag.coding.${author}.${num}.txt.gz >> pheno_defs/diag.${author}.txt
    zcat raw_output/diag.time.${author}.${num}.txt.gz >> pheno_defs/time.${author}.txt
  done
  gzip pheno_defs/diag.${author}.txt
  gzip pheno_defs/time.${author}.txt
done

The final output are two files for each phenotype. Both files contain six columns for each different defintion type, and 502,535 rows for each individual. One file will either contain 0’s and 1’s for no phenotype present or yes phenotype present. The other file will have a series of dahes where there is a 0 in the corresponding file, and the date the phenotype started where there are 1’s in the corresponding file.

7.3 Other Scores

To evaluate the prior methods against more curated scores I will downloaded scoring summary statistics and creating scores here. The problem, with this aim is that scoring summary statistics are often hard to find and come in various formats that are difficult to standardize. For example scoring summary statistics are often not located within figures, may not contain rsIDs or genome reference, may not label the type of effect or be saved in a difficult format. Luckily a great resource has come into existence recently called PGS Catalog, https://www.pgscatalog.org/browse/all/. My idea to ensure that the other scores I create are all reasonable is that I will only use scores within PGS Catalog. While there are other scores that I probably could access, I think this is a good compromise to keep things accurate while still carrying out the mission of checking on other scores (and just maybe this will motivate people to upload their scoring summary statistics to PGS catalog with all of the important information nescessary to recreate).

To get the scoring summary statistics I simply searched for all of the phenotypes that I am researching within the PGS Catalog. I checked all available scores to ensure they were not derived from the UK Biobank. I then saved the name of the all the scores that met both of these requirements. Then iterating through the list each score was downloaded with a call to wget and the address http://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/{SCORING_NAME}/ScoringFiles/{SCORING_NAME}.txt.gz. Each of the summary statistics then had to be checked just as the summary statistics from GWAS Catalog were. The process is rather similar to before, making sure that the rsIDs in the score exist within the UK Biobank, removing ambigous SNPs and flipping SNPs where nescessary. The full script looks like:

library(vroom)
library(stringr)
library(bigsnpr)

options(warn=2)

#Read in the UKBB summary statistics
impute <- read.table("../../raw_ss/common_files/impute_rsids", header = F, stringsAsFactors = F)
colnames(impute) <- c("LOC", "SNP", "POS", "A1", "A2", "MAF", "AX", "INFO")

#Make sure the UKBB summary statistics have alleles that match ACGT and are not ambigous
impute <- impute[nchar(impute$A1) == 1 & nchar(impute$A2) == 1,]
impute <- impute[impute$A1 %in% c("A", "C", "G", "T") & impute$A2 %in% c("A", "C", "G", "T"),]
impute <- impute[!((impute$A1 == "A" & impute$A2 == "T") |
             (impute$A1 == "T" & impute$A2 == "A") |
             (impute$A1 == "G" & impute$A2 == "C") |
             (impute$A1 == "C" & impute$A2 == "G")),]

#Read in other UKBB data that comes with chromosomes so I can give a chromosome to each SNP
impute$CHR <- "X"
for(i in 1:22){
  temp <- readRDS(paste0("~/athena/ukbiobank/qc/imputed/chr", i, ".RDS"))
  impute$CHR[impute$SNP %in% temp[,2]] <- i
}
impute <- impute[impute$CHR %in% 1:22,]
impute$SUPERPOS <- paste0(impute$CHR, "_", impute$POS)

#Now iterating through each of the score summary statistics
###
for(filename in list.files("raw_other_ss/")){
  #read the summary statistic in
  ss <- read.table(paste0("raw_other_ss/", filename), stringsAsFactors=F, header=T, sep='\t')

  #require both a column for major and minor allele, if only one exists refining is not possible
  if("effect_allele" %in% colnames(ss) & "reference_allele" %in% colnames(ss)){

    #limit SNPs to those with ACGT
    ss$effect_allele <- toupper(ss$effect_allele)
    ss$reference_allele <- toupper(ss$reference_allele)
    ss <- ss[nchar(ss$effect_allele) == 1 & nchar(ss$reference_allele) == 1,]
    ss <- ss[ss$effect_allele %in% c("A", "C", "G", "T") & ss$reference_allele %in% c("A", "C", "G", "T"),]

    #remove ambigous SNPs
    ss <- ss[!((ss$effect_allele == "A" & ss$reference_allele == "T") |
               (ss$effect_allele == "T" & ss$reference_allele == "A") |
               (ss$effect_allele == "G" & ss$reference_allele == "C") |
               (ss$effect_allele == "C" & ss$reference_allele == "G")),]

    #if rsID in the ss use it to match and sort to the UKBB
    #if not then use the chromosome and position (I already checked that all scores' ref genomes match UKBB)
    if("rsID" %in% colnames(ss)){
      ss <- ss[!is.na(ss$rsID),]
      ss$rsID <- tolower(ss$rsID)
      ss <- ss[ss$rsID %in% impute$SNP,]
      sub_impute <- impute[impute$SNP %in% ss$rsID,]
      ss <- ss[order(ss$rsID)[rank(sub_impute$SNP)],]
    } else {
      ss$SUPERPOS <- paste0(ss$chr_name, "_", ss$chr_position)
      ss <- ss[ss$SUPERPOS %in% impute$SUPERPOS,]
      sub_impute <- impute[impute$SUPERPOS %in% ss$SUPERPOS,]
      ss <- ss[order(ss$SUPERPOS)[rank(sub_impute$SUPERPOS)],]
      ss$rsID <- sub_impute$SNP    
    }

    #Change the column names and then use the snp_match function from bigsnpr to flip alleles
    colnames(sub_impute) <- c("loc", "rsid", "pos", "a0", "a1", "maf", "ax", "info", "chr", "superpos")
    colnames(ss)[colnames(ss) == "effect_allele"] <- "a0"
    colnames(ss)[colnames(ss) == "reference_allele"] <- "a1"
    colnames(ss)[colnames(ss) == "rsID"] <- "rsid"
    colnames(ss)[colnames(ss) == "effect_weight"] <- "beta"
    ss$pos <- sub_impute$pos
    ss$chr <- sub_impute$chr
    ss <- snp_match(ss, sub_impute)

    #write out the refined summary statistic
    out_ss <- data.frame(chr = ss$chr, rsid = ss$rsid.ss, A1 = ss$a0, beta = ss$beta)
    new_name <- strsplit(filename, ".", fixed = T)[[1]][1]
    write.table(out_ss, paste0("refine_other_ss/", new_name, ".new.ss"), col.names = T, row.names = F, sep = '\t', quote=F)
  }
}

Within this process a key limitation was that both major and minor allele are present. Approximitley 30% of scores did not have both and therefore could not be used in scoring. While I considered laxing this requirement, making sure the alleles are flipped correctly (which requires both alleles) seemed too important of a step to just throw out for nearly any reason. In the end 24 summary statistics were refined. For the record their names are:

score_names <- read.table("../analyze_score/other_scores/score_names")
print(score_names)

##                 V1               V2               V3               V4
## 1 PGS000007.new.ss PGS000014.new.ss PGS000020.new.ss PGS000039.new.ss
## 2 PGS000011.new.ss PGS000015.new.ss PGS000036.new.ss PGS000044.new.ss
## 3 PGS000013.new.ss PGS000016.new.ss PGS000037.new.ss PGS000045.new.ss
##                 V5               V6               V7               V8
## 1 PGS000046.new.ss PGS000058.new.ss PGS000082.new.ss PGS000195.new.ss
## 2 PGS000047.new.ss PGS000067.new.ss PGS000084.new.ss PGS000196.new.ss
## 3 PGS000052.new.ss PGS000072.new.ss PGS000194.new.ss PGS000199.new.ss

With the summary statistics made I continued on with scoring following nearly the same process as was carried out within the previous scoring section. In short I iterated through each score and chromosome, generating a score. At the end of this process I summed together all of the chromosomes to generate one score. To run in parallel a control script is nescessary, which looks very similar to the adjusting summary statistic control script.


maxGo=3
rm temp_files/poss_go
cat ~/athena/doc_score/qc/cv_files/train_eid.0.2.txt | cut -f1 > temp_files/brit_eid
cat ~/athena/doc_score/qc/cv_files/test_eid.0.8.txt | cut -f1 >> temp_files/brit_eid

for (( i=1; i<=$maxGo; i++ )); do
  echo $i >> temp_files/poss_go
done

echo 1 > temp_files/counter

ls refine_other_ss/ | while read scorename;do
    for cchr in {1..22};do

        counter_var=`cat temp_files/counter`
        echo $counter_var
        ./simple_score.sh $scorename $cchr $counter_var &> logs/log.${counter_var}.log &
        echo $counter_var + 1 | bc > temp_files/counter

        sleep $(( ( RANDOM % 10 )  + 1 ))

        goOn=False
        while [ $goOn == "False" ]; do
          openSlots=`cat temp_files/poss_go | wc -l`
          sizedir=`du temp_files/ | cut -f1`
          if [ $openSlots -gt 0 ]; then
            if [ $sizedir -lt 20164096 ];then
              echo NOW WE CAN GO
              goOn=True
            fi
          else
            echo MUST WAIT FOR ROOM TO GO
            sleep $(( ( RANDOM % 5 )  + 1 ))
          fi
        done

  done
done

The key line within this control script is the line that launches simple_score.sh. This sub-script creates the nescessary genotypic information from the larger UKBB imputed file, and then uses PLINK to create the score.


ss_name=$1
chr=$2
i=$3

echo $ss_name $author $method $chr $i

#Just logistics around controlling parallelism
go_num=`head -1 temp_files/poss_go`
grep -v -w $go_num temp_files/poss_go > temp_files/temp_poss
mv temp_files/temp_poss temp_files/poss_go

if [ ! -e small_score_files/score.${ss_name}.${chr}.profile.zst ];then

  len=`cat refine_other_ss/${ss_name} | awk -v var="$chr" '$1 == var {print $0}' | wc -l`
  if [ $len != 0 ];then
    cat refine_other_ss/${ss_name} | awk -v var="$chr" '$1 == var {print $0}' > temp_files/ss.${i}

    cat temp_files/ss.${i} | cut -f2 > temp_files/rsids.${i}

    bgenix -g ~/athena/ukbiobank/imputed/ukbb.${chr}.bgen -incl-rsids temp_files/rsids.${i} > temp_files/temp.${i}.bgen

    plink2_new --memory 12000 --threads 12 --bgen temp_files/temp.${i}.bgen ref-first --sample ~/athena/ukbiobank/imputed/ukbb.${chr}.sample --keep-fam temp_files/brit_eid --make-bed --out temp_files/geno.${i}

    plink --memory 12000 --threads 12 --bfile temp_files/geno.${i} --keep-allele-order --score temp_files/ss.${i} 2 3 4 sum --out small_score_files/score.${ss_name}.${chr}

    zstd --rm small_score_files/score.${ss_name}.${chr}.profile
  fi
fi

  rm temp_files/ss.${i}
  rm temp_files/rsids.${i}
  rm temp_files/temp.${i}.bgen
  rm temp_files/geno.${i}.*


#Just logistics around controlling parallelism
echo $go_num >> temp_files/poss_go

Finally, all of the scores are added together.

library(data.table)

all_files <- list.files("small_score_files/", pattern = "profile")
split_files <- strsplit(all_files, ".", fixed = T)
all_author <- unlist(lapply(split_files, function(x) x[2]))
all_chr <- unlist(lapply(split_files, function(x) x[5]))

brit_eid <- read.table("temp_files/brit_eid", stringsAsFactors=F)
new_name <- unique(all_author)
all_score <- matrix(0, nrow = nrow(brit_eid), ncol = length(new_name))
colnames(all_score) <- new_name


i <- 1
j <- 1
for(i in 1:length(all_files)){

          system(paste0("zstd -d small_score_files/", all_files[i]))
          sub_score <- as.data.frame(fread(paste0("small_score_files/", substr(all_files[i], 1, nchar(all_files[i])-4))))
          system(paste0("rm small_score_files/", substr(all_files[i], 1, nchar(all_files[i])-4)))
          all_score[,new_name == all_author[i]] <- all_score[,new_name == all_author[i]] + sub_score$SCORESUM 
    
}

all_score <- all_score[,colSums(all_score) != 0]

next_file <- length(grep("RDS", list.files("final_scores", "all_score"))) + 1

write.table(all_score, paste0("final_scores/all_score.", next_file), row.names = F, col.names = T, quote = F, sep = ' ')
saveRDS(all_score, paste0("final_scores/all_score.", next_file, ".RDS"))
write.table(colnames(all_score), paste0("final_scores/done_names.", next_file), row.names = F, col.names = F, quote = F)
system(paste0("gzip final_scores/all_score.", next_file))

If this process seemed a little too fast I encourage you to look at the 5th chapter that goes over many scoring processes and caveats. The end product are 24 polygenic risk scores that may be used to supplement future analyses of the phenotypes already discussed.

7.4 Other Covariates

In order to get a better picture of how well the PGS is working, we collect other covariates that may be included in a linear prediction model. While this is a good idea, complications quickly emerge in determining which covariates should be included. One way forward is simple including a wide variety of covariates and then paring those away based on feature selection. While this is a fine idea, and is a good idea that I may follow in another project, it makes overfitting quite possible (which is not good) and is alot of work (which is not good). Therefore I will only include covariates which have been declared to be significant. The sources I am looking at for this significance is the Mayo Clinic, NIH and various respective professional organizations. The list of covariates for each disease under analysis includes:

covar_names <- read.table("../analyze_score/descript_defs/author_covar", sep = "\t")
print(covar_names)

##                V1
## 1         Bentham
## 2  Christophersen
## 3        Demenais
## 4        Demontis
## 5          Dubois
## 6         Gormley
## 7           IMSGC
## 8             Jin
## 9         Kottgen
## 10           Liu1
## 11           Liu2
## 12        Mahajan
## 13          Malik
## 14    Michailidou
## 15         Namjou
## 16         Nikpay
## 17          Okada
## 18        Onengut
## 19         Phelan
## 20        Rheenen
## 21     Schumacher
## 22           Shah
## 23          Sklar
## 24           Tsoi
## 25           Wray
## 26            Xie
##                                                                                                      V2
## 1                                                                                                   sex
## 2  age,hypertension,congenital heart disease,cardiac arrest,coronary artery disease,alcohol,sleep apnea
## 3                                                                         allergic rhinitis,smoking,bmi
## 4                                                                                                  none
## 5                                                                                       type 1 diabetes
## 6                                                     age,sex,age menopause,hormone replacement therapy
## 7                                                                            age,sex,epstein barr virus
## 8                                                  use of sun protection,melanona,non-hodgkins lymphoma
## 9                                                                 age,sex,obesity,hypertension,diabetes
## 10                                                                                              smoking
## 11                                                                                              smoking
## 12                                                 age,sex,bmi,exercise,hypertension,hypocholesteroemia
## 13  age,sex,bmi,age started oral contraceptive,hypertension,hypocholesteroemia,smoking,alcohol,diabetes
## 14                                              age,sex,bmi,alcohol,age menarche,age menopause,pregnant
## 15                                                 age,obesity,hypertension,hypocholesteroemia,diabetes
## 16                                         age,sex,bmi,hypocholesteroemia,hypertension,diabetes,smoking
## 17                                                                     age,sex,smoking,obesity,pregnant
## 18                                                                                                 none
## 19                                           age,bmi,hormone replacement therapy,pregnant,breast cancer
## 20                                                                                 age,diabetes,obesity
## 21                                                                                          age,obesity
## 22                      cardiac arrest,hypertension,congenital heart defects,obesity,diabetes,arrythmia
## 23                                                                                                 none
## 24                                                                                              smoking
## 25                                                    alcohol,hormone replacement therapy,age menopause
## 26                                                                                          age,sex,sle

These other covariates are obtained in two very different ways as there are two very different types of covariates, those from the ICD records and those that are not. We will start off with the non-ICD covariates, as it is simpler. We start off by reading in all of the UK Biobank phenotypes, subsetting to make sure the EIDs are held within all of the files, then sort so all of the EIDs are in the same order. We then extract the columns that are within the author covariates shown above. The second part of the non-ICD covariate extraction process is some manual data cleaning. Some of the covariates have NA entries, which we either fill in with the mean value of the remaining non-NA entries, or we select the mode or other obvious answer that assumes the person did not do something. For example we take the mean for BMI and assume horome replacement therapy was not taken. A final data cleaning step sets NA for male individuals to female-specific features (such as age of menopause). We lastly write out the covariates into a text file. The exact process is shown above.

library(vroom)

defs <- read.table("../descript_defs/covar_defs_singlecol", stringsAsFactors=F, header = T)
preg_codes <- strsplit(defs[which(defs[,1] == "pregnant"),2], ",")[[1]]
use_codes <- c(defs$single_col[-which(defs[,1] == "pregnant")], preg_codes)

phen <- as.data.frame(vroom("~/athena/ukbiobank/phenotypes/ukb26867.csv.gz", delim = ","))
sub_phen_1 <- phen[,colnames(phen) %in% c(use_codes, "eid")]

phen <- as.data.frame(vroom("~/athena/ukbiobank/phenotypes/ukb33822.csv.gz", delim = ","))
sub_phen_2 <- phen[,colnames(phen) %in% c(use_codes, "eid")]

phen <- as.data.frame(vroom("~/athena/ukbiobank/phenotypes/ukb42385.csv.gz", delim = ","))
sub_phen_3 <- phen[,colnames(phen) %in% c(use_codes, "eid")]

sub_phen_1 <- sub_phen_1[sub_phen_1$eid %in% sub_phen_2$eid & sub_phen_1$eid %in% sub_phen_3$eid,]
sub_phen_2 <- sub_phen_2[sub_phen_2$eid %in% sub_phen_1$eid & sub_phen_2$eid %in% sub_phen_3$eid,]
sub_phen_3 <- sub_phen_3[sub_phen_3$eid %in% sub_phen_2$eid & sub_phen_3$eid %in% sub_phen_1$eid,]

sub_phen_2 <- sub_phen_2[order(sub_phen_2$eid)[rank(sub_phen_1$eid)],]
sub_phen_3 <- sub_phen_3[order(sub_phen_3$eid)[rank(sub_phen_1$eid)],]

new_phen <- cbind(sub_phen_1, sub_phen_2[,-1, drop = F], sub_phen_3[,-1, drop = F])
preg_ans <- (new_phen[["3140-0.0"]] == 1 & !is.na(new_phen[["3140-0.0"]])) | (!is.na(new_phen[["2754-0.0"]]))
preg_ans <- preg_ans*1

inds <- 1:nrow(defs)[-10]
for(i in inds){
  colnames(new_phen)[colnames(new_phen) == defs[i,2]] <- defs[i,1]
}
new_phen <- new_phen[,!grepl("0.0", colnames(new_phen))]
new_phen$pregnant <- preg_ans


##################### CLEAN BY HAND ##############################

new_phen$alcohol[is.na(new_phen$alcohol) | new_phen$alcohol == -3] <- 4
new_phen$smoking[is.na(new_phen$smoking) | new_phen$smoking == -3] <- 0
new_phen <- new_phen[,-which(colnames(new_phen) == "alcohol.1")]
new_phen <- new_phen[,-which(colnames(new_phen) == "smoking.1")]

new_phen$bmi[is.na(new_phen$bmi)] <- mean(new_phen$bmi, na.rm = T)

new_phen$exercise[is.na(new_phen$exercise) | new_phen$exercise == -3 | new_phen$exercise == -1] <- 2

new_phen$use_of_sun_protection[is.na(new_phen$use_of_sun_protection) | new_phen$use_of_sun_protection == -3 | new_phen$use_of_sun_protection == -1] <- 2

new_phen$age_menarche[is.na(new_phen$age_menarche) | new_phen$age_menarche == -1 | new_phen$age_menarche == -3] <- mean(new_phen$age_menarche[!(is.na(new_phen$age_menarche) | new_phen$age_menarche == -1 | new_phen$age_menarche == -3 | new_phen$sex == 1)])
new_phen$age_menarche[new_phen$sex == 1] <- NA

new_phen$hormone_replacement_therapy[is.na(new_phen$hormone_replacement_therapy) | new_phen$hormone_replacement_therapy == -1 | new_phen$hormone_replacement_therapy == -3] <- 0
new_phen$hormone_replacement_therapy[new_phen$sex == 1] <- NA

new_phen$had_menopause[is.na(new_phen$had_menopause) | new_phen$had_menopause == 2 | new_phen$had_menopause == 3] <- 0
new_phen$had_menopause[new_phen$sex == 1] <- NA
new_phen$age_menopause[new_phen$had_menopause == 0] <- NA
new_phen$age_menopause[new_phen$sex == 0 & is.na(new_phen$age_menopause)] <- mean(new_phen$age_menopause, na.rm = T)
new_phen$age_menopause[new_phen$sex == 1] <- NA

new_phen$ever_used_oral_contraceptive[is.na(new_phen$ever_used_oral_contraceptive) | new_phen$ever_used_oral_contraceptive == -3 | new_phen$ever_used_oral_contraceptive == -1] <- 0
new_phen$ever_used_oral_contraceptive[new_phen$sex == 1] <- NA
new_phen$age_started_oral_contraceptive[new_phen$ever_used_oral_contraceptive == 0 | new_phen$age_started_oral_contraceptive < 10] <- NA
new_phen$age_started_oral_contraceptive[new_phen$sex == 0 & is.na(new_phen$age_started_oral_contraceptive)] <- mean(new_phen$age_started_oral_contraceptive, na.rm = T)
new_phen$age_started_oral_contraceptive[new_phen$sex == 1] <- NA

new_phen$epstein_barr_virus[is.na(new_phen$epstein_barr_virus)] <- 0.5

new_phen$obesity <- 0
new_phen$obesity[new_phen$bmi > 30] <- 1

##########################################################################

write.table(new_phen, "covar_data/single_col_covars", row.names = F, col.names = T, quote = F, sep = "\t")

The second group of the other covariates come from the ICD (and self-reported disease) records. For each of the disease traits that are considered covariates for any of the diseaes under the main analysis we need to get an ICD definition. This process was simply carried out through table and internet look-ups. The defintions used are as follows:

covar_names <- read.table("../analyze_score/descript_defs/covar_defs", sep = "\t", header = T)
print(covar_names)

##                              name        single_col   ICD10     ICD9
## 1                             age            34-0.0    <NA>     <NA>
## 2                             sex            31-0.0    <NA>     <NA>
## 3                    hypertension              <NA>     I10      401
## 4        congenital_heart_disease              <NA>     Q24      746
## 5                  cardiac_arrest              <NA> I21,I46 410,4275
## 6         coronary_artery_disease              <NA>     I25      414
## 7                         alcohol          1558-0.0    <NA>     <NA>
## 8                     sleep_apnea              <NA>    G473     <NA>
## 9               allergic_rhinitis              <NA>     J30      477
## 10                        smoking         20116-0.0    <NA>     <NA>
## 11                            bmi         21001-0.0    <NA>     <NA>
## 12                type_1_diabetes              <NA>     E10      250
## 13                       exercise           884-0.0    <NA>     <NA>
## 14             hypocholesteroemia              <NA>    E780     <NA>
## 15 age_started_oral_contraceptive          2784-0.0    <NA>     <NA>
## 16                   age_menarche          2714-0.0    <NA>     <NA>
## 17                  age_menopause          3581-0.0    <NA>     <NA>
## 18                       pregnant 2754-0.0,3140-0.0    <NA>     <NA>
## 19    hormone_replacement_therapy          2814-0.0    <NA>     <NA>
## 20                  breast_cancer              <NA>     C50     2330
## 21             epstein_barr_virus         23053-0.0    <NA>     <NA>
## 22          use_of_sun_protection          2267-0.0    <NA>     <NA>
## 23                       diabetes              <NA> E10,E11      250
## 24                     arrhythmia              <NA> I47,I48      427
## 25                            sle              <NA>     M32     7100
## 26                       melanoma              <NA>     C43      172
## 27          non-hodgkins_lymphoma              <NA> C82,C83     <NA>
##         self_diag self_diag_cancer
## 1            <NA>               NA
## 2            <NA>               NA
## 3       1065,1072               NA
## 4            <NA>               NA
## 5            <NA>               NA
## 6            <NA>               NA
## 7            <NA>               NA
## 8            1123               NA
## 9            1387               NA
## 10           <NA>               NA
## 11           <NA>               NA
## 12           1222               NA
## 13           <NA>               NA
## 14           1473               NA
## 15           <NA>               NA
## 16           <NA>               NA
## 17           <NA>               NA
## 18           <NA>               NA
## 19           <NA>               NA
## 20           <NA>             1002
## 21           <NA>               NA
## 22           <NA>               NA
## 23 1220,1222,1223               NA
## 24           1077               NA
## 25           1381               NA
## 26           <NA>             1059
## 27           <NA>             1053

While it could be possible to get the ICD covariates for all individuals for all diseases, this process is actually quite time intensive and therefore I have decided to only get the relavent disease covariates for each main disease under analysis. The specific covariates matched to each disease under analyis is:

covar_names <- read.table("../analyze_score/descript_defs/author_to_covar_hesin", sep = "\t", header = F)
print(covar_names)

##                V1
## 1         Bentham
## 2  Christophersen
## 3        Demenais
## 4        Demontis
## 5          Dubois
## 6         Gormley
## 7           IMSGC
## 8             Jin
## 9         Kottgen
## 10           Liu1
## 11           Liu2
## 12        Mahajan
## 13          Malik
## 14    Michailidou
## 15         Namjou
## 16         Nikpay
## 17          Okada
## 18        Onengut
## 19         Phelan
## 20        Rheenen
## 21     Schumacher
## 22           Shah
## 23          Sklar
## 24           Tsoi
## 25           Wray
## 26            Xie
##                                                                                          V2
## 1                                                                                      <NA>
## 2  hypertension,congenital_heart_disease,cardiac_arrest,coronary_artery_disease,sleep_apnea
## 3                                                                         allergic_rhinitis
## 4                                                                                      <NA>
## 5                                                                           type_1_diabetes
## 6                                                                                      <NA>
## 7                                                                                      <NA>
## 8                                                            melanona,non-hodgkins_lymphoma
## 9                                                                     hypertension,diabetes
## 10                                                                                     <NA>
## 11                                                                                     <NA>
## 12                                                          hypertension,hypocholesteroemia
## 13                                                 hypertension,hypocholesteroemia,diabetes
## 14                                                                                     <NA>
## 15                                                 hypertension,hypocholesteroemia,diabetes
## 16                                                 hypocholesteroemia,hypertension,diabetes
## 17                                                                                     <NA>
## 18                                                                                     <NA>
## 19                                                                            breast_cancer
## 20                                                                                 diabetes
## 21                                                                                     <NA>
## 22                   cardiac_arrest,hypertension,congenital_heart_disese,diabetes,arrythmia
## 23                                                                                     <NA>
## 24                                                                                     <NA>
## 25                                                                                     <NA>
## 26                                                                                     <NA>

Now that we know exactly what we are getting, let’s actually get the values. As I am lopping though alot of data I have opted to use a Python script rather than R. The script itself is highly similar to the phenotype extraction shown in the previous section, so I won’t go in too much detail on how it operates. I will instead just highlight a few things. First, I want to make sure that all of the disease covariates take place before the possible date of the primary disease. Therefore I read in the dates of the primary disease and use it to limit the number of disease diagnoses that I can look at for the covariates. Secondly, in this instance I am possible looking at more than one disease defintion (whereas before I only needed to know one defition). I then create a for loop within the loop for each indvidial to iterate over all posible covariates definitions. Third, wheras in making the phenotype defintion I kept 6 columns, one for each possible diagnosis source, in this instance I am assuming that if the individual has any positive covariate within either the ICD or self-reported sources I will assume that they do have the disease. Other than these changes everything is the same, as I attempt to intelligently iterate through all individuals, pulling out the meaningful ICD codes and subsetting by the date. The exact code is shown below:

import numpy as np
import datetime
import time
import pickle
import pdb
import gzip
import os
import sys
import re
import gzip

author = "Bentham"
phen_method = "all"          
#Still need to sort everything so that the extEid process works

def normRead(fileName, withHeader = True, delim = '\t', removeQuote = False):
    totalData = []
    if fileName[-2:] == "gz":
        with gzip.open(fileName,"r") as f:
            for line in f:
                totalData.append(line.decode().strip().split(delim))

    else:
        with open(fileName,"r",encoding = "latin-1") as f:
            for line in f.read().splitlines():
                #for line in f:
                if removeQuote:
                    line = line.replace('"', '').strip()
                totalData.append(line.split(delim))
    if withHeader:
        header=totalData[0]
        del totalData[0]
    else:
        header = None
    totalData=np.array(totalData)
    return(totalData,header)



covar_defs, covar_header = normRead("../descript_defs/covar_defs_hesin", False)
use_defs = []
for i in range(covar_defs.shape[0]):
    split_defs = [j.split(",") for j in covar_defs[i,2:]]
    use_defs.append(dict(zip(["ICD10", "ICD9", "selfrep", "cancer"], split_defs)))


date_ass,ass_header = normRead("../construct_defs/date_assessed.csv", True, ",", True)
self_rep, self_rep_header = normRead("../construct_defs/self_report_diag.csv", True, ",", True)
cancer_rep, cancer_rep_header = normRead("../construct_defs/self_report_cancer.csv", True, ",", True)
big_eid, eid_head = normRead("../construct_defs/eid.csv", True, ",", True)

date_ass = date_ass[big_eid[:,0].argsort(),:]
self_rep = self_rep[big_eid[:,0].argsort(),:]
cancer_rep = cancer_rep[big_eid[:,0].argsort(),:]
big_eid = big_eid[big_eid[:,0].argsort(),:]

cancer_phase = [x.split("-")[1].split(".")[0] for x in cancer_rep_header]
self_phase = [x.split("-")[1].split(".")[0] for x in self_rep_header]

diag, head_diag = normRead("/home/kulmsc/athena/ukbiobank/hesin/hesin_diag.txt")
hesin, head_hesin = normRead("/home/kulmsc/athena/ukbiobank/hesin/hesin.txt")
diag = diag[diag[:,0].argsort(),:]
hesin = hesin[hesin[:,0].argsort(),:]
diag_eid = np.unique(diag[:,0])
diag_max = diag.shape[0] - 1

diag_eid_list = diag[:,0].tolist()


print("reading")
known_diag, known_head = normRead("../construct_defs/pheno_defs/diag." + author.lower() + ".txt.gz", False, " ")
known_time, known_head = normRead("../construct_defs/pheno_defs/time." + author.lower() + ".txt.gz", False, " ")
new_time = np.tile(datetime.datetime.strptime("31/12/2020", '%d/%m/%Y'), (known_time.shape[0], known_time.shape[1]))
print("read in")
for i in range(known_time.shape[0]):
    for j in range(known_time.shape[1]):
        if known_time[i,j] != '__________':
            if j == 0 or j == 1 or j == 5:
                new_time[i,j] = datetime.datetime.strptime(known_time[i,j], "%Y-%m-%d")
            else:
                new_time[i,j] = datetime.datetime.strptime(known_time[i,j], "%d/%m/%Y")
known_time = new_time
    
if phen_method == "icd":
    known_diag = know_diag[:,2:4]
    known_time = known_time[:,2:4]
elif phen_method == "selfrep":
    known_diag = known_diag[:,0:2]
    known_time = known_time[:,0:2]
elif phen_method == "icd_selfrep":
    known_diag = known_diag[:,0:4]
    known_time = known_time[:,0:4]
elif phen_method == "double":
    known_diag = known_diag.astype("int")
    for_double = np.sum(known_diag, axis = 1)

final_diag = []
final_time = []
for i in range(known_diag.shape[0]):
    if phen_method == "double":
        if for_double[i] > 1:
            final_diag.append(1)
            final_time.append(min(known_time[i,:]))
        else:
            final_diag.append(0)
            final_time.append("NA")
    else:
        if "1" in known_diag[i,:]:
            final_diag.append(1)
            final_time.append(min(known_time[i,:]))
        else:
            final_diag.append(0)
            final_time.append("NA")


#STILL NEED TO FIGURE OUT HOW I WANT TO SPLIT THIS - right now assume I dont

start_eid_diag = 0
df_occur = np.zeros((big_eid.shape[0], covar_defs.shape[0]))
df_date = np.tile("_________", (big_eid.shape[0], covar_defs.shape[0]))

#pdb.set_trace()

for ind in range(big_eid.shape[0]):
    print(ind)
    print(big_eid[ind])
    #pdb.set_trace()
    curr_eid = big_eid[ind][0]
    if ind % 10000 == 0:
        print(ind)

    for trait_ind in range(covar_defs.shape[0]):

        #CANCER SELF REPORT
        if covar_defs[trait_ind, 5] != "NA":
            if any(np.isin(use_defs[trait_ind]["cancer"], cancer_rep[ind,:])):
                locs = np.where(np.isin(cancer_rep[ind,:], covar_defs[trait_ind, 5]))[0]
                poss_date = date_ass[ind, int(cancer_phase[locs[0]])]
                poss_date = datetime.datetime.strptime(poss_date, "%Y-%m-%d")
                if poss_date < final_time[i]:
                    df_occur[ind, trait_ind] = 1
                    df_date[ind, trait_ind] = poss_date


        #CANCER SELF REPORT
        if covar_defs[trait_ind, 4] != "NA":
            if any(np.isin(use_defs[trait_ind]["selfrep"], self_rep[ind,:])):
                locs = np.where(np.isin(self_rep[ind,:], covar_defs[trait_ind, 4]))[0]
                poss_date = date_ass[ind, int(self_phase[locs[0]])]
                poss_date = datetime.datetime.strptime(poss_date, "%Y-%m-%d")
                if poss_date < final_time[i]:
                    df_occur[ind, trait_ind] = 1
                    df_date[ind, trait_ind] = poss_date




        #HESIN DIAG
        if curr_eid in diag_eid_list:
            if curr_eid != diag_eid_list[-1]:
                next_eid = np.unique(diag[start_eid_diag:(start_eid_diag+7000),0])[1]
                ext_eid = diag_eid_list.index(next_eid)
            else:
                ext_eid = diag_max

            #ICD -9
            use_short_icd9_diag = [x[0:3] for x in diag[start_eid_diag:ext_eid,4]]
            if any(np.isin(use_defs[trait_ind]["ICD9"], use_short_icd9_diag)):
                short_icd9_locs = np.where(np.isin(use_short_icd9_diag, use_defs[trait_ind]["ICD9"]))[0][0]
            else:
                short_icd9_locs = 10000
            if any(np.isin(use_defs[trait_ind]["ICD9"], diag[start_eid_diag:ext_eid,4])):
                long_icd9_locs = np.where(np.isin(diag[start_eid_diag:ext_eid,4], use_defs[trait_ind]["ICD9"]))[0][0]
            else:
                long_icd9_locs = 10000
            if long_icd9_locs != 10000 or short_icd9_locs != 10000:
                icd9_locs = min((long_icd9_locs, short_icd9_locs))
                icd9_ins_index = diag[start_eid_diag+icd9_locs,1]

                raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd9_ins_index),5][0]
                if raw_date == "":
                    raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd9_ins_index),21][0]
                    if raw_date == "":
                        raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd9_ins_index),4][0]

                pdb.set_trace()
                df_occur[ind,2] = 1
                df_date[ind,2] = raw_date



            #ICD - 10
            use_short_icd10_diag = [x[0:3] for x in diag[start_eid_diag:ext_eid,6]]
            if any(np.isin(use_defs[trait_ind]["ICD10"], use_short_icd10_diag)):
                short_icd10_locs = np.where(np.isin(use_short_icd10_diag, use_defs[trait_ind]["ICD10"]))[0][0]
            else:
                short_icd10_locs = 10000
            if any(np.isin(use_defs[trait_ind]["ICD10"], diag[start_eid_diag:ext_eid,6])):
                long_icd10_locs = np.where(np.isin(diag[start_eid_diag:ext_eid,6], use_defs[trait_ind]["ICD10"]))[0][0]
            else:
                long_icd10_locs = 10000
            if long_icd10_locs != 10000 or short_icd10_locs != 10000:
                icd10_locs = min((long_icd10_locs, short_icd10_locs))
                icd10_ins_index = diag[start_eid_diag+icd10_locs,1]

                raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd10_ins_index),5][0]
                if raw_date == "":
                    raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd10_ins_index),21][0]
                    if raw_date == "":
                        raw_date = hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd10_ins_index),4][0]

                pdb.set_trace()
                df_occur[ind,3] = 1
                df_date[ind,3] = raw_date


        start_eid_diag = ext_eid








print("end")