7 Score Analysis Set-Up
Now that we have the scores we are nearly ready for analysis. However, there are actually several more things that need to be done, namely setting up all of the phenotypes and covariates. Similar to creating the scores, this is an often underrated process that needs to be done carefully or everything else fails.
7.1 Defining the Phenotypes
By phenotype I mean status of whether an individual has the disease corresponding to the polygenic risk score. This is a tricky problem, since first of all each underlying GWAS of the polygenic risk scores may construct phenotypes differently. Adjusting our phenotypes to each GWAS does not seem easy however, or even possible given the limited information within the UK Biobank. However the UK Biobank does have a great deal of information. The sources that we will use to construct phenotypes are as follows:
Self-Reported - Each individual was interviewed by a trained nurse and asked about any illnesses they may have. The nurse then placed any description by the individual into a pre-specified tree of diseases. This data can be suspect as the individual may report self-diagnoses not confirmed by a doctor and not really existing, but overall they are likely to line up with actual conditions for most of the individuals. Further description on the self-reported data-type can be found here, http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=20002, and all possible self-reported codings can be found here, http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=3, and here http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=6 (for cancer and non-cancer).
ICD - The UK Biobank has pulled in the electronic health record of every individual. Specifically, when a doctor makes a diagnosis they will record the ICD code in the computer. These should be highly accurate, however there is always the change of a typo or a doctor recording a similar diagnosis that does not quite match what we want, making ICD codes rather fickle. These records have been parsed into hesin files (hospital episode statistics). In short there this one file, hesin_diag, with ICD codes on every line, and another file, hesin, with dates of hospital episodes. One can look up the dates for a given ICD code by using the EID and instance index. We will go over this more later. More information on the hesin data can be found here, https://biobank.ndph.ox.ac.uk/showcase/showcase/docs/HospitalEpisodeStatistics.pdf. The coding for ICD9 codes can be found here, http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=87. The coding for ICD10 codes can be found here, http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=19. Note that ICD9 codes are older than ICD10.
OPCS - Similar to ICD, OPCS codes are generally part of the electronic health record and are accessed from the hesin set of files. However, instead of recording diagnoses they record operations. While generally not helpful for our purposes, some publications have used specific OPCS codes to assume that an underlying phenotype was the cause. Following precendent I will do the same. More information on OPCS coding can be found here, http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=240. There are also OPCS3 codes, but they are very, very sparse.
Medications - This is the least orthodox way of determining phenotype. But I figure if someone is taking a medication that is only used for one illness, then there is a very good chance the person has that illness. Unfortunately the UK Biobank does not have to the day records of medications, but only inventories when individuals come in for assessments. By looking over CDC, respective professional society, and Mayo Clinic websites I determine the associated medications and convert them into the UK Biobank codes. More information on Medication coding can be found here, http://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=4. More information on how medication data was recorded can be found here, http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=20003.
The table that actually holds the conversion from the many possible codes of each category into a 1/0 phenotype is as follows.
<- read.table("../analyze_score/descript_defs/author_defs", stringsAsFactors=F, header=T)
pheno_defs print(pheno_defs)
## author name sex cancer noncancer
## 1 Bentham sle A NA 1381
## 2 Christophersen atrial_fibrilation A NA 1471
## 3 Demenais asthma A NA 1111
## 4 Demontis adhd A NA <NA>
## 5 Dubois celiac_disease A NA 1456
## 6 Gormley migraine A NA 1265
## 7 IMSGC ms A NA 1261
## 8 Jin vitiligo A NA 1661
## 9 Kottgen gout A NA 1466
## 10 Liu-1 ulcerative_colitis A NA 1463
## 11 Liu-2 crohns_disease A NA 1462
## 12 Mahajan type_2_diabetes A NA 1223
## 13 Malik stroke A NA 1081
## 14 Michailidou breast_cancer F 1002 <NA>
## 15 Namjou nonalcoholic_fatty_liver_disease A NA <NA>
## 16 Nikpay coronary_artery_disease A NA 1075|1076
## 17 Okada rheumatoid_arthritis A NA 1464
## 18 Onengut type_1_diabetes A NA 1222
## 19 Phelan ovarian_cancer F 1039 <NA>
## 20 Rheenen als A NA <NA>
## 21 Schumacher prostate_cancer M 1044 <NA>
## 22 Shah heart_failure A NA 1076
## 23 Sklar bipolar_disease A NA 1291
## 24 Tsoi psoriaris A NA 1453
## 25 Wray major_depressive_disorder A NA 1286
## 26 Xie membranous_glomerulonephritis A NA <NA>
## icd9 icd10
## 1 710 M321|M328|M329
## 2 4273 I48
## 3 493 J45|J46
## 4 314 F90
## 5 579 <NA>
## 6 346 G43
## 7 340 G35
## 8 7091 L80
## 9 274 M10
## 10 556 K51
## 11 555 K50
## 12 <NA> E11
## 13 431|432|433|434|435|436|437|438 I60|I61|I62|I633|I64|I65|I66|I67|I68|I69
## 14 174 C50
## 15 5718 K760
## 16 410|411|412 I21|I22|I23|I241|I252
## 17 714 M05|M06
## 18 <NA> E10
## 19 183 C56
## 20 3352 G122
## 21 185 C61
## 22 428 I50
## 23 296 F31
## 24 6960|6961 L40
## 25 <NA> F33
## 26 5831 N002
## opcs
## 1 <NA>
## 2 K622|K623
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## 7 <NA>
## 8 <NA>
## 9 <NA>
## 10 <NA>
## 11 <NA>
## 12 <NA>
## 13 U543|Z35
## 14 B27|B28|B29
## 15 <NA>
## 16 K40|K41|K45|K49|K50|K75
## 17 U504
## 18 <NA>
## 19 <NA>
## 20 X852
## 21 <NA>
## 22 <NA>
## 23 <NA>
## 24 <NA>
## 25 <NA>
## 26 <NA>
## meds
## 1 <NA>
## 2 1140888482
## 3 1141168340
## 4 <NA>
## 5 <NA>
## 6 1141163670|1141167932|1141172728|1141185436|1141192666|1141150620|1141151284
## 7 1141188640|1141188642|1141150596|1141172620|1141165546|1141167618|1141189254|1141189256
## 8 <NA>
## 9 1140875408
## 10 1141153242
## 11 <NA>
## 12 <NA>
## 13 <NA>
## 14 1140923018|1141190734
## 15 <NA>
## 16 <NA>
## 17 1141145896|1140909702|1141188588|1141180070|1140871188
## 18 <NA>
## 19 <NA>
## 20 1141195974
## 21 1141150594|1140870274|1140921100
## 22 <NA>
## 23 <NA>
## 24 <NA>
## 25 <NA>
## 26 <NA>
The pipe symbol is used to indicate that if any of the codes can be found in the respective data type the phenotype will be called positive.
7.2 Making the Phenotype
Now that we know what codes go to which phenotype, we simply need to go through the respective data files and check off when we find one for each EID. Easier said than done, so I’ve broken down things into parts. The first part is preparing the data, specifically breaking down the large phenotype file downloaded from the UK Biobank into parts that are easy to inspect and read in. Along with these files, the hesin files are also nescessary, although they do not need any breaking down. I would display these files but I do not think I am allowed to. Anyway, the breaking down scripts looks like the following (note that if following along your indices are likely different).
#self diag from 472 to 573
#date attend assessment center from 41 to 43
#medication from 574 to 717
zcat ~/athena/ukbiobank/phenotypes/ukb26867.csv.gz | cut -d',' -f1 > eid.csv
zcat ~/athena/ukbiobank/phenotypes/ukb26867.csv.gz | cut -d',' -f472-573 > self_report_diag.csv
zcat ~/athena/ukbiobank/phenotypes/ukb26867.csv.gz | cut -d',' -f454-471 > self_report_cancer.csv
zcat ~/athena/ukbiobank/phenotypes/ukb26867.csv.gz | cut -d',' -f41-43 > date_assessed.csv
zcat ~/athena/ukbiobank/phenotypes/ukb26867.csv.gz | cut -d',' -f574-717 > medications.csv
Now that everything is prepared data-wise we can get on with inspecting and calling phenotypes. I have tried this in many different ways, and the fastest option seems to be breaking down the total 500,000 people into groups and calling phenotypes for each group in parallel. This parallel component is controlled by the following script, which is straightforward.
cat parallel_splits | while read line;do
star=`echo $line | cut -d' ' -f1`
end=`echo $line | cut -d' ' -f2`
echo $star
python better_pheno.py $star $end &
sleep 80
done
As you may be able to see the key script here is called better_pheno.py. This script has a few different parts. First it reads in all of the prepared data. Next it subsets down to the group of individuals specified by the control_make_pheno, and sorts each file so the EIDs line up. This sorting is important as it allows for quick indexing of EIDs (just iterate to the next unique EID rather than searching for the index). Lastly I check each type of phenotype (self-report, ICD, etc.) for the coding that was specified. For the ICD and OPCS this process required me to check not just the exact code, but also a more general code due to the hierarchical natur of their coding. For example if I am looking for G35 and an individual has G351 then I will still count that as a match. Further details on how this script works can be found in the well recorded comments of the script itself, which starts below.
import numpy as np
import datetime
import time
import pickle
import pdb
import gzip
import os
import sys
import sys
import re
#Declare the author that corresponds to the phenotype you want
#Also the indices of the group we want phenotypes for
= "Christophersen"
author = int(sys.argv[1])
start_ind = int(sys.argv[2])
end_ind
#Function to read in the data
def normRead(fileName, withHeader = True, delim = '\t', removeQuote = False):
with open(fileName,"r",encoding = "latin-1") as f:
=[]
totalDatafor line in f.read().splitlines():
if removeQuote:
'"', '').strip().split(delim))
totalData.append(line.replace(else:
totalData.append(line.split(delim))if withHeader:
=totalData[0]
headerdel totalData[0]
else:
= None
header =np.array(totalData)
totalDatareturn(totalData,header)
#Read in the files that were split from the main UKBB phenotype file
= normRead("date_assessed.csv", True, ",", True)
date_ass,ass_header = normRead("medications.csv", True, ",", True)
meds, meds_header = normRead("self_report_diag.csv", True, ",", True)
self_rep, self_rep_header = normRead("self_report_cancer.csv", True, ",", True)
cancer_rep, cancer_rep_header = normRead("eid.csv", True, ",", True)
big_eid, eid_head
#sort the files just read in so they are in EID order
= date_ass[big_eid[:,0].argsort(),:]
date_ass = meds[big_eid[:,0].argsort(),:]
meds = self_rep[big_eid[:,0].argsort(),:]
self_rep = cancer_rep[big_eid[:,0].argsort(),:]
cancer_rep = big_eid[big_eid[:,0].argsort(),:]
big_eid
#Get the phase, or the assessment type for cancer, self-report and medication report
#Phase meaning what index, or what time the person came in to be assessed (first, second or third assessment)
= [x.split("-")[1].split(".")[0] for x in cancer_rep_header]
cancer_phase = [x.split("-")[1].split(".")[0] for x in self_rep_header]
self_phase = [x.split("-")[1].split(".")[0] for x in meds_header]
meds_phase
#Read in the hesin type of data and again sort by EID
= normRead("/home/kulmsc/athena/ukbiobank/hesin/hesin_diag.txt")
diag, head_diag = normRead("/home/kulmsc/athena/ukbiobank/hesin/hesin_oper.txt")
oper, head_oper = normRead("/home/kulmsc/athena/ukbiobank/hesin/hesin.txt")
hesin, head_hesin = diag[diag[:,0].argsort(),:]
diag = oper[oper[:,0].argsort(),:]
oper = hesin[hesin[:,0].argsort(),:]
hesin = np.unique(diag[:,0])
diag_eid = np.unique(diag[:,0])
oper_eid = diag.shape[0] - 1
diag_max = oper.shape[0] - 1
oper_max = diag[:,0].tolist()
diag_eid_list = oper[:,0].tolist()
oper_eid_list
#Read in the disease defintion and split the types with multiple condictions, then make into a dict
= normRead("../descript_defs/author_defs")
defs, head_defs = defs[:,0].tolist().index(author)
def_ind = list()
split_defs for type_def in defs[def_ind, 3:9]:
"|"))
split_defs.append(type_def.split(
#Produce a dictionary so we can easily pull out the codes for each data type
= dict(zip(head_defs[3:9], split_defs))
use_defs
#This if for timing, which I do not do anymore
#cancer_sr_time = 0
#noncancer_sr_time = 0
#hesin_setup_time = 0
#icd9_time = 0
#icd10_time = 0
#hesin_oper_time = 0
#med_time = 0
#Find the splits of the input data, so RAM will be less
#Simply find the eid corresponding to the start and stop index
#Continuing to skip them if they do not appear in the given data type
def subset_big_data(eid_list, big_data):
for ind in range(start_ind, end_ind):
if big_eid[ind][0] in eid_list:
= eid_list.index(big_eid[ind][0])
start_eid break
for ind in range(end_ind, start_ind,-1):
if big_eid[ind][0] in eid_list:
= np.where(big_data[:,0] == big_eid[ind][0])[0][-1]
end_eid break
return(big_data[start_eid:end_eid,:])
#I'm still bad at python indexing so I have to include this line to fix a single case example
if end_ind > big_eid.shape[0]:
= big_eid.shape[0] - 1
end_ind
#Apply the subset_big_data, and convert some things to lists for easier indexing
= subset_big_data(diag_eid_list, diag)
diag = subset_big_data(oper_eid_list, oper)
oper = subset_big_data(hesin[:,0].tolist(), hesin)
hesin = diag[:,0].tolist()
diag_eid_list = oper[:,0].tolist()
oper_eid_list
#Establish indicdes of diag and oper for easy indexing (do not need to look and make index every time
#only need to add the extent of the current eid to this index)
= 0
start_eid_diag = 0
start_eid_oper
#Subset the things that do not require the subset_big_data function
= meds[start_ind:end_ind,:]
meds = self_rep[start_ind:end_ind,:]
self_rep = date_ass[start_ind:end_ind,:]
date_ass = cancer_rep[start_ind:end_ind,:]
cancer_rep = big_eid[start_ind:end_ind,:]
big_eid
#Prepare the data objects that will hold diagnosis results
= np.zeros((big_eid.shape[0], 6))
df_occur = np.tile("__________", (big_eid.shape[0], 6))
df_date
#Iterate through the indices of each unique EID
#Note for lessons learned in speed-up:
# - do not subset objects (I think copies are made in RAM which really slow things down)
# - always try to just iterate to the next index rather than searching for some keyword
for ind in range(0, (end_ind - start_ind)):
if ind % 10000 == 0:
print(ind) #So I know how fast things are running
#Set the value of the actual eid
= big_eid[ind][0]
curr_eid
#Now we go through and fill in each column of df_occur and df_date, one data type at a time
#CANCER SELF REPORT (this is the name of the column in df_occur and df_date I am filling in)
if use_defs["cancer"][0] != "NA": #check to see if there is a noncancer definition
#t1 = time.time()
if any(np.isin(use_defs["cancer"], cancer_rep[ind,:])): #if the cancer definition is in the cancer data
= np.where(np.isin(cancer_rep[ind,:], use_defs["cancer"]))[0] #get assessment index of the cancer occurence
locs = date_ass[ind, int(cancer_phase[locs[0]])] #record the date based on index of ind. assessment
try_date if try_date == "": #somtimes the data is empty so we go back to the last date we know
= date_ass[ind,0]
try_date 0] = try_date
df_date[ind,0] = 1
df_occur[ind,
#t2 = time.time()
#cancer_sr_time += t2-t1
#NON-CANCER SELF REPORT #this is the same process as with the cancer data, but now with non-cancer data
if use_defs["noncancer"][0] != "NA":
#t22 = time.time()
if any(np.isin(use_defs["noncancer"], self_rep[ind,:])):
= np.where(np.isin(self_rep[ind,:], use_defs["noncancer"]))[0]
locs = date_ass[ind, int(self_phase[locs[0]])]
try_date if try_date == "":
= date_ass[ind, 0]
try_date 1] = try_date
df_date[ind,1] = 1
df_occur[ind,
#t3 = time.time()
#noncancer_sr_time += t3-t22
#HESIN DIAG
if curr_eid in diag_eid_list: #check if the eid has ever had any ICD codes
#t4 = time.time()
#set up the hesin indexing, since multiple rows can have the same eid we use this
#use this technique to get the index of the next unique EID
if curr_eid != diag_eid_list[-1]:
= np.unique(diag[start_eid_diag:(start_eid_diag+7000),0])[1]
next_eid = diag_eid_list.index(next_eid)
ext_eid else:
= diag_max
ext_eid
#t5 = time.time()
#hesin_setup_time += t5-t4
#ICD - 9
#as was explained either the full or slightly shorter ICD code may match what we want, so we create both
= [x[0:3] for x in diag[start_eid_diag:ext_eid,4]]
use_short_icd9_diag #Then we check to see if either the short short or long ICD codes (from the hesin) match any of the definitions
if any(np.isin(use_defs["icd9"], use_short_icd9_diag)):
#If there is a match we get all of the locations in the hesin of the match
= np.where(np.isin(use_short_icd9_diag, use_defs["icd9"]))[0][0]
short_icd9_locs else:
= 10000
short_icd9_locs if any(np.isin(use_defs["icd9"], diag[start_eid_diag:ext_eid,4])):
= np.where(np.isin(diag[start_eid_diag:ext_eid,4], use_defs["icd9"]))[0][0]
long_icd9_locs else:
= 10000
long_icd9_locs #There cannot possibly be a match greater than 10000, so I set it to the impossible value
if long_icd9_locs != 10000 or short_icd9_locs != 10000:
#Find the minimum matching location as it is the earliest time
= min((long_icd9_locs, short_icd9_locs))
icd9_locs #Then we get the instance or the numbered time the EID went to the hospital
= diag[start_eid_diag+icd9_locs,1]
icd9_ins_index
#We carry the ins_index and EID to the larger hesin array to get the date
#Somtimes the most accurate form of the date is empty so we go to the next best
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd9_ins_index),5][0]
raw_date if raw_date == "":
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd9_ins_index),21][0]
raw_date if raw_date == "":
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd9_ins_index),4][0]
raw_date
2] = 1
df_occur[ind,2] = raw_date
df_date[ind,
#t6 = time.time()
#icd9_time += t6-t5
#The process for ICD10 is nearly exactly the same as for ICD9
#ICD - 10
= [x[0:3] for x in diag[start_eid_diag:ext_eid,6]]
use_short_icd10_diag if any(np.isin(use_defs["icd10"], use_short_icd10_diag)):
= np.where(np.isin(use_short_icd10_diag, use_defs["icd10"]))[0][0]
short_icd10_locs else:
= 10000
short_icd10_locs if any(np.isin(use_defs["icd10"], diag[start_eid_diag:ext_eid,6])):
= np.where(np.isin(diag[start_eid_diag:ext_eid,6], use_defs["icd10"]))[0][0]
long_icd10_locs else:
= 10000
long_icd10_locs if long_icd10_locs != 10000 or short_icd10_locs != 10000:
= min((long_icd10_locs, short_icd10_locs))
icd10_locs = diag[start_eid_diag+icd10_locs,1]
icd10_ins_index
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd10_ins_index),5][0]
raw_date if raw_date == "":
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd10_ins_index),21][0]
raw_date if raw_date == "":
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd10_ins_index),4][0]
raw_date
3] = 1
df_occur[ind,3] = raw_date
df_date[ind,= ext_eid
start_eid_diag
#t7 = time.time()
#icd10_time += t7-t6
#The process for HESIN OPER is nearly exactly the same as for HESIN DIAG (which includes ICD10 and ICD))
#HESIN OPER
if curr_eid in oper_eid_list and use_defs["opcs"][0] != "NA":
#t8 = time.time()
if curr_eid != oper_eid_list[-1]: #if it is not the last in the list
= np.unique(oper[start_eid_oper:(start_eid_oper+7000),0])[1]
next_eid = oper_eid_list.index(next_eid)
ext_eid else:
= oper_max
ext_eid
= [x[0:3] for x in oper[start_eid_oper:ext_eid,7]]
use_short_oper if any(np.isin(use_defs["opcs"], use_short_oper)):
= np.where(np.isin(use_short_oper, use_defs["opcs"]))[0][0]
short_oper_locs else:
= 10000
short_oper_locs if any(np.isin(use_defs["opcs"], oper[start_eid_oper:ext_eid,7])):
= np.where(np.isin(oper[start_eid_oper:ext_eid,7], use_defs["opcs"]))[0][0]
long_oper_locs else:
= 10000
long_oper_locs if long_oper_locs != 10000 or short_oper_locs != 10000:
= min((long_oper_locs, short_oper_locs))
oper_locs = oper[start_eid_oper+oper_locs,1]
oper_ins_index
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == oper_ins_index),5][0]
raw_date if raw_date == "":
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == oper_ins_index),21][0]
raw_date if raw_date == "":
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == oper_ins_index),4][0]
raw_date
4] = 1
df_occur[ind,4] = raw_date
df_date[ind,= ext_eid
start_eid_oper
#t9 = time.time()
#hesin_oper_time += t9-t8
#The medication process is nearly the same as for cancer and non-cancer
#MEDICATION
if use_defs["meds"][0] != "NA":
#t10 = time.time()
if any(np.isin(use_defs["meds"], meds[ind,:])):
= np.where(np.isin(meds[ind,:], use_defs["meds"]))[0]
locs 5] = date_ass[ind, int(meds_phase[locs[0]])]
df_date[ind,5] = 1
df_occur[ind,#t11 = time.time()
#med_time += t11 - t10
#save the output phenos
"raw_output/diag.coding." + author.lower() + "." + str(start_ind) + ".txt.gz", df_occur, fmt='%i')
np.savetxt("raw_output/diag.time." + author.lower() + "." + str(start_ind) + ".txt.gz", df_date, fmt='%s')
np.savetxt(
print("end")
The final step in the process is combining all of the small subset phenotype file into one large phenotype file. The mechanics of this script are very simple, as all that needs to be done is iterating over the subsets and concatenating them together.
ls raw_output/ | cut -f3 -d'.' | sort | uniq | while read author;do
rm pheno_defs/diag.${author}.txt.gz
rm pheno_defs/time.${author}.txt.gz
ls raw_output/diag.coding.${author}* | cut -f2 -d'/' | cut -f4 -d'.' | sort -n | while read num;do
zcat raw_output/diag.coding.${author}.${num}.txt.gz >> pheno_defs/diag.${author}.txt
zcat raw_output/diag.time.${author}.${num}.txt.gz >> pheno_defs/time.${author}.txt
done
gzip pheno_defs/diag.${author}.txt
gzip pheno_defs/time.${author}.txt
done
The final output are two files for each phenotype. Both files contain six columns for each different defintion type, and 502,535 rows for each individual. One file will either contain 0’s and 1’s for no phenotype present or yes phenotype present. The other file will have a series of dahes where there is a 0 in the corresponding file, and the date the phenotype started where there are 1’s in the corresponding file.
7.3 Other Scores
To evaluate the prior methods against more curated scores I will downloaded scoring summary statistics and creating scores here. The problem, with this aim is that scoring summary statistics are often hard to find and come in various formats that are difficult to standardize. For example scoring summary statistics are often not located within figures, may not contain rsIDs or genome reference, may not label the type of effect or be saved in a difficult format. Luckily a great resource has come into existence recently called PGS Catalog, https://www.pgscatalog.org/browse/all/. My idea to ensure that the other scores I create are all reasonable is that I will only use scores within PGS Catalog. While there are other scores that I probably could access, I think this is a good compromise to keep things accurate while still carrying out the mission of checking on other scores (and just maybe this will motivate people to upload their scoring summary statistics to PGS catalog with all of the important information nescessary to recreate).
To get the scoring summary statistics I simply searched for all of the phenotypes that I am researching within the PGS Catalog. I checked all available scores to ensure they were not derived from the UK Biobank. I then saved the name of the all the scores that met both of these requirements. Then iterating through the list each score was downloaded with a call to wget and the address http://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/{SCORING_NAME}/ScoringFiles/{SCORING_NAME}.txt.gz. Each of the summary statistics then had to be checked just as the summary statistics from GWAS Catalog were. The process is rather similar to before, making sure that the rsIDs in the score exist within the UK Biobank, removing ambigous SNPs and flipping SNPs where nescessary. The full script looks like:
library(vroom)
library(stringr)
library(bigsnpr)
options(warn=2)
#Read in the UKBB summary statistics
<- read.table("../../raw_ss/common_files/impute_rsids", header = F, stringsAsFactors = F)
impute colnames(impute) <- c("LOC", "SNP", "POS", "A1", "A2", "MAF", "AX", "INFO")
#Make sure the UKBB summary statistics have alleles that match ACGT and are not ambigous
<- impute[nchar(impute$A1) == 1 & nchar(impute$A2) == 1,]
impute <- impute[impute$A1 %in% c("A", "C", "G", "T") & impute$A2 %in% c("A", "C", "G", "T"),]
impute <- impute[!((impute$A1 == "A" & impute$A2 == "T") |
impute $A1 == "T" & impute$A2 == "A") |
(impute$A1 == "G" & impute$A2 == "C") |
(impute$A1 == "C" & impute$A2 == "G")),]
(impute
#Read in other UKBB data that comes with chromosomes so I can give a chromosome to each SNP
$CHR <- "X"
imputefor(i in 1:22){
<- readRDS(paste0("~/athena/ukbiobank/qc/imputed/chr", i, ".RDS"))
temp $CHR[impute$SNP %in% temp[,2]] <- i
impute
}<- impute[impute$CHR %in% 1:22,]
impute $SUPERPOS <- paste0(impute$CHR, "_", impute$POS)
impute
#Now iterating through each of the score summary statistics
###
for(filename in list.files("raw_other_ss/")){
#read the summary statistic in
<- read.table(paste0("raw_other_ss/", filename), stringsAsFactors=F, header=T, sep='\t')
ss
#require both a column for major and minor allele, if only one exists refining is not possible
if("effect_allele" %in% colnames(ss) & "reference_allele" %in% colnames(ss)){
#limit SNPs to those with ACGT
$effect_allele <- toupper(ss$effect_allele)
ss$reference_allele <- toupper(ss$reference_allele)
ss<- ss[nchar(ss$effect_allele) == 1 & nchar(ss$reference_allele) == 1,]
ss <- ss[ss$effect_allele %in% c("A", "C", "G", "T") & ss$reference_allele %in% c("A", "C", "G", "T"),]
ss
#remove ambigous SNPs
<- ss[!((ss$effect_allele == "A" & ss$reference_allele == "T") |
ss $effect_allele == "T" & ss$reference_allele == "A") |
(ss$effect_allele == "G" & ss$reference_allele == "C") |
(ss$effect_allele == "C" & ss$reference_allele == "G")),]
(ss
#if rsID in the ss use it to match and sort to the UKBB
#if not then use the chromosome and position (I already checked that all scores' ref genomes match UKBB)
if("rsID" %in% colnames(ss)){
<- ss[!is.na(ss$rsID),]
ss $rsID <- tolower(ss$rsID)
ss<- ss[ss$rsID %in% impute$SNP,]
ss <- impute[impute$SNP %in% ss$rsID,]
sub_impute <- ss[order(ss$rsID)[rank(sub_impute$SNP)],]
ss else {
} $SUPERPOS <- paste0(ss$chr_name, "_", ss$chr_position)
ss<- ss[ss$SUPERPOS %in% impute$SUPERPOS,]
ss <- impute[impute$SUPERPOS %in% ss$SUPERPOS,]
sub_impute <- ss[order(ss$SUPERPOS)[rank(sub_impute$SUPERPOS)],]
ss $rsID <- sub_impute$SNP
ss
}
#Change the column names and then use the snp_match function from bigsnpr to flip alleles
colnames(sub_impute) <- c("loc", "rsid", "pos", "a0", "a1", "maf", "ax", "info", "chr", "superpos")
colnames(ss)[colnames(ss) == "effect_allele"] <- "a0"
colnames(ss)[colnames(ss) == "reference_allele"] <- "a1"
colnames(ss)[colnames(ss) == "rsID"] <- "rsid"
colnames(ss)[colnames(ss) == "effect_weight"] <- "beta"
$pos <- sub_impute$pos
ss$chr <- sub_impute$chr
ss<- snp_match(ss, sub_impute)
ss
#write out the refined summary statistic
<- data.frame(chr = ss$chr, rsid = ss$rsid.ss, A1 = ss$a0, beta = ss$beta)
out_ss <- strsplit(filename, ".", fixed = T)[[1]][1]
new_name write.table(out_ss, paste0("refine_other_ss/", new_name, ".new.ss"), col.names = T, row.names = F, sep = '\t', quote=F)
} }
Within this process a key limitation was that both major and minor allele are present. Approximitley 30% of scores did not have both and therefore could not be used in scoring. While I considered laxing this requirement, making sure the alleles are flipped correctly (which requires both alleles) seemed too important of a step to just throw out for nearly any reason. In the end 24 summary statistics were refined. For the record their names are:
<- read.table("../analyze_score/other_scores/score_names")
score_names print(score_names)
## V1 V2 V3 V4
## 1 PGS000007.new.ss PGS000014.new.ss PGS000020.new.ss PGS000039.new.ss
## 2 PGS000011.new.ss PGS000015.new.ss PGS000036.new.ss PGS000044.new.ss
## 3 PGS000013.new.ss PGS000016.new.ss PGS000037.new.ss PGS000045.new.ss
## V5 V6 V7 V8
## 1 PGS000046.new.ss PGS000058.new.ss PGS000082.new.ss PGS000195.new.ss
## 2 PGS000047.new.ss PGS000067.new.ss PGS000084.new.ss PGS000196.new.ss
## 3 PGS000052.new.ss PGS000072.new.ss PGS000194.new.ss PGS000199.new.ss
With the summary statistics made I continued on with scoring following nearly the same process as was carried out within the previous scoring section. In short I iterated through each score and chromosome, generating a score. At the end of this process I summed together all of the chromosomes to generate one score. To run in parallel a control script is nescessary, which looks very similar to the adjusting summary statistic control script.
maxGo=3
rm temp_files/poss_go
cat ~/athena/doc_score/qc/cv_files/train_eid.0.2.txt | cut -f1 > temp_files/brit_eid
cat ~/athena/doc_score/qc/cv_files/test_eid.0.8.txt | cut -f1 >> temp_files/brit_eid
for (( i=1; i<=$maxGo; i++ )); do
echo $i >> temp_files/poss_go
done
echo 1 > temp_files/counter
ls refine_other_ss/ | while read scorename;do
for cchr in {1..22};do
counter_var=`cat temp_files/counter`
echo $counter_var
./simple_score.sh $scorename $cchr $counter_var &> logs/log.${counter_var}.log &
echo $counter_var + 1 | bc > temp_files/counter
sleep $(( ( RANDOM % 10 ) + 1 ))
goOn=False
while [ $goOn == "False" ]; do
openSlots=`cat temp_files/poss_go | wc -l`
sizedir=`du temp_files/ | cut -f1`
if [ $openSlots -gt 0 ]; then
if [ $sizedir -lt 20164096 ];then
echo NOW WE CAN GO
goOn=True
fi
else
echo MUST WAIT FOR ROOM TO GO
sleep $(( ( RANDOM % 5 ) + 1 ))
fi
done
done
done
The key line within this control script is the line that launches simple_score.sh. This sub-script creates the nescessary genotypic information from the larger UKBB imputed file, and then uses PLINK to create the score.
ss_name=$1
chr=$2
i=$3
echo $ss_name $author $method $chr $i
#Just logistics around controlling parallelism
go_num=`head -1 temp_files/poss_go`
grep -v -w $go_num temp_files/poss_go > temp_files/temp_poss
mv temp_files/temp_poss temp_files/poss_go
if [ ! -e small_score_files/score.${ss_name}.${chr}.profile.zst ];then
len=`cat refine_other_ss/${ss_name} | awk -v var="$chr" '$1 == var {print $0}' | wc -l`
if [ $len != 0 ];then
cat refine_other_ss/${ss_name} | awk -v var="$chr" '$1 == var {print $0}' > temp_files/ss.${i}
cat temp_files/ss.${i} | cut -f2 > temp_files/rsids.${i}
bgenix -g ~/athena/ukbiobank/imputed/ukbb.${chr}.bgen -incl-rsids temp_files/rsids.${i} > temp_files/temp.${i}.bgen
plink2_new --memory 12000 --threads 12 --bgen temp_files/temp.${i}.bgen ref-first --sample ~/athena/ukbiobank/imputed/ukbb.${chr}.sample --keep-fam temp_files/brit_eid --make-bed --out temp_files/geno.${i}
plink --memory 12000 --threads 12 --bfile temp_files/geno.${i} --keep-allele-order --score temp_files/ss.${i} 2 3 4 sum --out small_score_files/score.${ss_name}.${chr}
zstd --rm small_score_files/score.${ss_name}.${chr}.profile
fi
fi
rm temp_files/ss.${i}
rm temp_files/rsids.${i}
rm temp_files/temp.${i}.bgen
rm temp_files/geno.${i}.*
#Just logistics around controlling parallelism
echo $go_num >> temp_files/poss_go
Finally, all of the scores are added together.
library(data.table)
<- list.files("small_score_files/", pattern = "profile")
all_files <- strsplit(all_files, ".", fixed = T)
split_files <- unlist(lapply(split_files, function(x) x[2]))
all_author <- unlist(lapply(split_files, function(x) x[5]))
all_chr
<- read.table("temp_files/brit_eid", stringsAsFactors=F)
brit_eid <- unique(all_author)
new_name <- matrix(0, nrow = nrow(brit_eid), ncol = length(new_name))
all_score colnames(all_score) <- new_name
<- 1
i <- 1
j for(i in 1:length(all_files)){
system(paste0("zstd -d small_score_files/", all_files[i]))
<- as.data.frame(fread(paste0("small_score_files/", substr(all_files[i], 1, nchar(all_files[i])-4))))
sub_score system(paste0("rm small_score_files/", substr(all_files[i], 1, nchar(all_files[i])-4)))
== all_author[i]] <- all_score[,new_name == all_author[i]] + sub_score$SCORESUM
all_score[,new_name
}
<- all_score[,colSums(all_score) != 0]
all_score
<- length(grep("RDS", list.files("final_scores", "all_score"))) + 1
next_file
write.table(all_score, paste0("final_scores/all_score.", next_file), row.names = F, col.names = T, quote = F, sep = ' ')
saveRDS(all_score, paste0("final_scores/all_score.", next_file, ".RDS"))
write.table(colnames(all_score), paste0("final_scores/done_names.", next_file), row.names = F, col.names = F, quote = F)
system(paste0("gzip final_scores/all_score.", next_file))
If this process seemed a little too fast I encourage you to look at the 5th chapter that goes over many scoring processes and caveats. The end product are 24 polygenic risk scores that may be used to supplement future analyses of the phenotypes already discussed.
7.4 Other Covariates
In order to get a better picture of how well the PGS is working, we collect other covariates that may be included in a linear prediction model. While this is a good idea, complications quickly emerge in determining which covariates should be included. One way forward is simple including a wide variety of covariates and then paring those away based on feature selection. While this is a fine idea, and is a good idea that I may follow in another project, it makes overfitting quite possible (which is not good) and is alot of work (which is not good). Therefore I will only include covariates which have been declared to be significant. The sources I am looking at for this significance is the Mayo Clinic, NIH and various respective professional organizations. The list of covariates for each disease under analysis includes:
<- read.table("../analyze_score/descript_defs/author_covar", sep = "\t")
covar_names print(covar_names)
## V1
## 1 Bentham
## 2 Christophersen
## 3 Demenais
## 4 Demontis
## 5 Dubois
## 6 Gormley
## 7 IMSGC
## 8 Jin
## 9 Kottgen
## 10 Liu1
## 11 Liu2
## 12 Mahajan
## 13 Malik
## 14 Michailidou
## 15 Namjou
## 16 Nikpay
## 17 Okada
## 18 Onengut
## 19 Phelan
## 20 Rheenen
## 21 Schumacher
## 22 Shah
## 23 Sklar
## 24 Tsoi
## 25 Wray
## 26 Xie
## V2
## 1 sex
## 2 age,hypertension,congenital heart disease,cardiac arrest,coronary artery disease,alcohol,sleep apnea
## 3 allergic rhinitis,smoking,bmi
## 4 none
## 5 type 1 diabetes
## 6 age,sex,age menopause,hormone replacement therapy
## 7 age,sex,epstein barr virus
## 8 use of sun protection,melanona,non-hodgkins lymphoma
## 9 age,sex,obesity,hypertension,diabetes
## 10 smoking
## 11 smoking
## 12 age,sex,bmi,exercise,hypertension,hypocholesteroemia
## 13 age,sex,bmi,age started oral contraceptive,hypertension,hypocholesteroemia,smoking,alcohol,diabetes
## 14 age,sex,bmi,alcohol,age menarche,age menopause,pregnant
## 15 age,obesity,hypertension,hypocholesteroemia,diabetes
## 16 age,sex,bmi,hypocholesteroemia,hypertension,diabetes,smoking
## 17 age,sex,smoking,obesity,pregnant
## 18 none
## 19 age,bmi,hormone replacement therapy,pregnant,breast cancer
## 20 age,diabetes,obesity
## 21 age,obesity
## 22 cardiac arrest,hypertension,congenital heart defects,obesity,diabetes,arrythmia
## 23 none
## 24 smoking
## 25 alcohol,hormone replacement therapy,age menopause
## 26 age,sex,sle
These other covariates are obtained in two very different ways as there are two very different types of covariates, those from the ICD records and those that are not. We will start off with the non-ICD covariates, as it is simpler. We start off by reading in all of the UK Biobank phenotypes, subsetting to make sure the EIDs are held within all of the files, then sort so all of the EIDs are in the same order. We then extract the columns that are within the author covariates shown above. The second part of the non-ICD covariate extraction process is some manual data cleaning. Some of the covariates have NA entries, which we either fill in with the mean value of the remaining non-NA entries, or we select the mode or other obvious answer that assumes the person did not do something. For example we take the mean for BMI and assume horome replacement therapy was not taken. A final data cleaning step sets NA for male individuals to female-specific features (such as age of menopause). We lastly write out the covariates into a text file. The exact process is shown above.
library(vroom)
<- read.table("../descript_defs/covar_defs_singlecol", stringsAsFactors=F, header = T)
defs <- strsplit(defs[which(defs[,1] == "pregnant"),2], ",")[[1]]
preg_codes <- c(defs$single_col[-which(defs[,1] == "pregnant")], preg_codes)
use_codes
<- as.data.frame(vroom("~/athena/ukbiobank/phenotypes/ukb26867.csv.gz", delim = ","))
phen <- phen[,colnames(phen) %in% c(use_codes, "eid")]
sub_phen_1
<- as.data.frame(vroom("~/athena/ukbiobank/phenotypes/ukb33822.csv.gz", delim = ","))
phen <- phen[,colnames(phen) %in% c(use_codes, "eid")]
sub_phen_2
<- as.data.frame(vroom("~/athena/ukbiobank/phenotypes/ukb42385.csv.gz", delim = ","))
phen <- phen[,colnames(phen) %in% c(use_codes, "eid")]
sub_phen_3
<- sub_phen_1[sub_phen_1$eid %in% sub_phen_2$eid & sub_phen_1$eid %in% sub_phen_3$eid,]
sub_phen_1 <- sub_phen_2[sub_phen_2$eid %in% sub_phen_1$eid & sub_phen_2$eid %in% sub_phen_3$eid,]
sub_phen_2 <- sub_phen_3[sub_phen_3$eid %in% sub_phen_2$eid & sub_phen_3$eid %in% sub_phen_1$eid,]
sub_phen_3
<- sub_phen_2[order(sub_phen_2$eid)[rank(sub_phen_1$eid)],]
sub_phen_2 <- sub_phen_3[order(sub_phen_3$eid)[rank(sub_phen_1$eid)],]
sub_phen_3
<- cbind(sub_phen_1, sub_phen_2[,-1, drop = F], sub_phen_3[,-1, drop = F])
new_phen <- (new_phen[["3140-0.0"]] == 1 & !is.na(new_phen[["3140-0.0"]])) | (!is.na(new_phen[["2754-0.0"]]))
preg_ans <- preg_ans*1
preg_ans
<- 1:nrow(defs)[-10]
inds for(i in inds){
colnames(new_phen)[colnames(new_phen) == defs[i,2]] <- defs[i,1]
}<- new_phen[,!grepl("0.0", colnames(new_phen))]
new_phen $pregnant <- preg_ans
new_phen
##################### CLEAN BY HAND ##############################
$alcohol[is.na(new_phen$alcohol) | new_phen$alcohol == -3] <- 4
new_phen$smoking[is.na(new_phen$smoking) | new_phen$smoking == -3] <- 0
new_phen<- new_phen[,-which(colnames(new_phen) == "alcohol.1")]
new_phen <- new_phen[,-which(colnames(new_phen) == "smoking.1")]
new_phen
$bmi[is.na(new_phen$bmi)] <- mean(new_phen$bmi, na.rm = T)
new_phen
$exercise[is.na(new_phen$exercise) | new_phen$exercise == -3 | new_phen$exercise == -1] <- 2
new_phen
$use_of_sun_protection[is.na(new_phen$use_of_sun_protection) | new_phen$use_of_sun_protection == -3 | new_phen$use_of_sun_protection == -1] <- 2
new_phen
$age_menarche[is.na(new_phen$age_menarche) | new_phen$age_menarche == -1 | new_phen$age_menarche == -3] <- mean(new_phen$age_menarche[!(is.na(new_phen$age_menarche) | new_phen$age_menarche == -1 | new_phen$age_menarche == -3 | new_phen$sex == 1)])
new_phen$age_menarche[new_phen$sex == 1] <- NA
new_phen
$hormone_replacement_therapy[is.na(new_phen$hormone_replacement_therapy) | new_phen$hormone_replacement_therapy == -1 | new_phen$hormone_replacement_therapy == -3] <- 0
new_phen$hormone_replacement_therapy[new_phen$sex == 1] <- NA
new_phen
$had_menopause[is.na(new_phen$had_menopause) | new_phen$had_menopause == 2 | new_phen$had_menopause == 3] <- 0
new_phen$had_menopause[new_phen$sex == 1] <- NA
new_phen$age_menopause[new_phen$had_menopause == 0] <- NA
new_phen$age_menopause[new_phen$sex == 0 & is.na(new_phen$age_menopause)] <- mean(new_phen$age_menopause, na.rm = T)
new_phen$age_menopause[new_phen$sex == 1] <- NA
new_phen
$ever_used_oral_contraceptive[is.na(new_phen$ever_used_oral_contraceptive) | new_phen$ever_used_oral_contraceptive == -3 | new_phen$ever_used_oral_contraceptive == -1] <- 0
new_phen$ever_used_oral_contraceptive[new_phen$sex == 1] <- NA
new_phen$age_started_oral_contraceptive[new_phen$ever_used_oral_contraceptive == 0 | new_phen$age_started_oral_contraceptive < 10] <- NA
new_phen$age_started_oral_contraceptive[new_phen$sex == 0 & is.na(new_phen$age_started_oral_contraceptive)] <- mean(new_phen$age_started_oral_contraceptive, na.rm = T)
new_phen$age_started_oral_contraceptive[new_phen$sex == 1] <- NA
new_phen
$epstein_barr_virus[is.na(new_phen$epstein_barr_virus)] <- 0.5
new_phen
$obesity <- 0
new_phen$obesity[new_phen$bmi > 30] <- 1
new_phen
##########################################################################
write.table(new_phen, "covar_data/single_col_covars", row.names = F, col.names = T, quote = F, sep = "\t")
The second group of the other covariates come from the ICD (and self-reported disease) records. For each of the disease traits that are considered covariates for any of the diseaes under the main analysis we need to get an ICD definition. This process was simply carried out through table and internet look-ups. The defintions used are as follows:
<- read.table("../analyze_score/descript_defs/covar_defs", sep = "\t", header = T)
covar_names print(covar_names)
## name single_col ICD10 ICD9
## 1 age 34-0.0 <NA> <NA>
## 2 sex 31-0.0 <NA> <NA>
## 3 hypertension <NA> I10 401
## 4 congenital_heart_disease <NA> Q24 746
## 5 cardiac_arrest <NA> I21,I46 410,4275
## 6 coronary_artery_disease <NA> I25 414
## 7 alcohol 1558-0.0 <NA> <NA>
## 8 sleep_apnea <NA> G473 <NA>
## 9 allergic_rhinitis <NA> J30 477
## 10 smoking 20116-0.0 <NA> <NA>
## 11 bmi 21001-0.0 <NA> <NA>
## 12 type_1_diabetes <NA> E10 250
## 13 exercise 884-0.0 <NA> <NA>
## 14 hypocholesteroemia <NA> E780 <NA>
## 15 age_started_oral_contraceptive 2784-0.0 <NA> <NA>
## 16 age_menarche 2714-0.0 <NA> <NA>
## 17 age_menopause 3581-0.0 <NA> <NA>
## 18 pregnant 2754-0.0,3140-0.0 <NA> <NA>
## 19 hormone_replacement_therapy 2814-0.0 <NA> <NA>
## 20 breast_cancer <NA> C50 2330
## 21 epstein_barr_virus 23053-0.0 <NA> <NA>
## 22 use_of_sun_protection 2267-0.0 <NA> <NA>
## 23 diabetes <NA> E10,E11 250
## 24 arrhythmia <NA> I47,I48 427
## 25 sle <NA> M32 7100
## 26 melanoma <NA> C43 172
## 27 non-hodgkins_lymphoma <NA> C82,C83 <NA>
## self_diag self_diag_cancer
## 1 <NA> NA
## 2 <NA> NA
## 3 1065,1072 NA
## 4 <NA> NA
## 5 <NA> NA
## 6 <NA> NA
## 7 <NA> NA
## 8 1123 NA
## 9 1387 NA
## 10 <NA> NA
## 11 <NA> NA
## 12 1222 NA
## 13 <NA> NA
## 14 1473 NA
## 15 <NA> NA
## 16 <NA> NA
## 17 <NA> NA
## 18 <NA> NA
## 19 <NA> NA
## 20 <NA> 1002
## 21 <NA> NA
## 22 <NA> NA
## 23 1220,1222,1223 NA
## 24 1077 NA
## 25 1381 NA
## 26 <NA> 1059
## 27 <NA> 1053
While it could be possible to get the ICD covariates for all individuals for all diseases, this process is actually quite time intensive and therefore I have decided to only get the relavent disease covariates for each main disease under analysis. The specific covariates matched to each disease under analyis is:
<- read.table("../analyze_score/descript_defs/author_to_covar_hesin", sep = "\t", header = F)
covar_names print(covar_names)
## V1
## 1 Bentham
## 2 Christophersen
## 3 Demenais
## 4 Demontis
## 5 Dubois
## 6 Gormley
## 7 IMSGC
## 8 Jin
## 9 Kottgen
## 10 Liu1
## 11 Liu2
## 12 Mahajan
## 13 Malik
## 14 Michailidou
## 15 Namjou
## 16 Nikpay
## 17 Okada
## 18 Onengut
## 19 Phelan
## 20 Rheenen
## 21 Schumacher
## 22 Shah
## 23 Sklar
## 24 Tsoi
## 25 Wray
## 26 Xie
## V2
## 1 <NA>
## 2 hypertension,congenital_heart_disease,cardiac_arrest,coronary_artery_disease,sleep_apnea
## 3 allergic_rhinitis
## 4 <NA>
## 5 type_1_diabetes
## 6 <NA>
## 7 <NA>
## 8 melanona,non-hodgkins_lymphoma
## 9 hypertension,diabetes
## 10 <NA>
## 11 <NA>
## 12 hypertension,hypocholesteroemia
## 13 hypertension,hypocholesteroemia,diabetes
## 14 <NA>
## 15 hypertension,hypocholesteroemia,diabetes
## 16 hypocholesteroemia,hypertension,diabetes
## 17 <NA>
## 18 <NA>
## 19 breast_cancer
## 20 diabetes
## 21 <NA>
## 22 cardiac_arrest,hypertension,congenital_heart_disese,diabetes,arrythmia
## 23 <NA>
## 24 <NA>
## 25 <NA>
## 26 <NA>
Now that we know exactly what we are getting, let’s actually get the values. As I am lopping though alot of data I have opted to use a Python script rather than R. The script itself is highly similar to the phenotype extraction shown in the previous section, so I won’t go in too much detail on how it operates. I will instead just highlight a few things. First, I want to make sure that all of the disease covariates take place before the possible date of the primary disease. Therefore I read in the dates of the primary disease and use it to limit the number of disease diagnoses that I can look at for the covariates. Secondly, in this instance I am possible looking at more than one disease defintion (whereas before I only needed to know one defition). I then create a for loop within the loop for each indvidial to iterate over all posible covariates definitions. Third, wheras in making the phenotype defintion I kept 6 columns, one for each possible diagnosis source, in this instance I am assuming that if the individual has any positive covariate within either the ICD or self-reported sources I will assume that they do have the disease. Other than these changes everything is the same, as I attempt to intelligently iterate through all individuals, pulling out the meaningful ICD codes and subsetting by the date. The exact code is shown below:
import numpy as np
import datetime
import time
import pickle
import pdb
import gzip
import os
import sys
import re
import gzip
= "Bentham"
author = "all"
phen_method #Still need to sort everything so that the extEid process works
def normRead(fileName, withHeader = True, delim = '\t', removeQuote = False):
= []
totalData if fileName[-2:] == "gz":
with gzip.open(fileName,"r") as f:
for line in f:
totalData.append(line.decode().strip().split(delim))
else:
with open(fileName,"r",encoding = "latin-1") as f:
for line in f.read().splitlines():
#for line in f:
if removeQuote:
= line.replace('"', '').strip()
line
totalData.append(line.split(delim))if withHeader:
=totalData[0]
headerdel totalData[0]
else:
= None
header =np.array(totalData)
totalDatareturn(totalData,header)
= normRead("../descript_defs/covar_defs_hesin", False)
covar_defs, covar_header = []
use_defs for i in range(covar_defs.shape[0]):
= [j.split(",") for j in covar_defs[i,2:]]
split_defs dict(zip(["ICD10", "ICD9", "selfrep", "cancer"], split_defs)))
use_defs.append(
= normRead("../construct_defs/date_assessed.csv", True, ",", True)
date_ass,ass_header = normRead("../construct_defs/self_report_diag.csv", True, ",", True)
self_rep, self_rep_header = normRead("../construct_defs/self_report_cancer.csv", True, ",", True)
cancer_rep, cancer_rep_header = normRead("../construct_defs/eid.csv", True, ",", True)
big_eid, eid_head
= date_ass[big_eid[:,0].argsort(),:]
date_ass = self_rep[big_eid[:,0].argsort(),:]
self_rep = cancer_rep[big_eid[:,0].argsort(),:]
cancer_rep = big_eid[big_eid[:,0].argsort(),:]
big_eid
= [x.split("-")[1].split(".")[0] for x in cancer_rep_header]
cancer_phase = [x.split("-")[1].split(".")[0] for x in self_rep_header]
self_phase
= normRead("/home/kulmsc/athena/ukbiobank/hesin/hesin_diag.txt")
diag, head_diag = normRead("/home/kulmsc/athena/ukbiobank/hesin/hesin.txt")
hesin, head_hesin = diag[diag[:,0].argsort(),:]
diag = hesin[hesin[:,0].argsort(),:]
hesin = np.unique(diag[:,0])
diag_eid = diag.shape[0] - 1
diag_max
= diag[:,0].tolist()
diag_eid_list
print("reading")
= normRead("../construct_defs/pheno_defs/diag." + author.lower() + ".txt.gz", False, " ")
known_diag, known_head = normRead("../construct_defs/pheno_defs/time." + author.lower() + ".txt.gz", False, " ")
known_time, known_head = np.tile(datetime.datetime.strptime("31/12/2020", '%d/%m/%Y'), (known_time.shape[0], known_time.shape[1]))
new_time print("read in")
for i in range(known_time.shape[0]):
for j in range(known_time.shape[1]):
if known_time[i,j] != '__________':
if j == 0 or j == 1 or j == 5:
= datetime.datetime.strptime(known_time[i,j], "%Y-%m-%d")
new_time[i,j] else:
= datetime.datetime.strptime(known_time[i,j], "%d/%m/%Y")
new_time[i,j] = new_time
known_time
if phen_method == "icd":
= know_diag[:,2:4]
known_diag = known_time[:,2:4]
known_time elif phen_method == "selfrep":
= known_diag[:,0:2]
known_diag = known_time[:,0:2]
known_time elif phen_method == "icd_selfrep":
= known_diag[:,0:4]
known_diag = known_time[:,0:4]
known_time elif phen_method == "double":
= known_diag.astype("int")
known_diag = np.sum(known_diag, axis = 1)
for_double
= []
final_diag = []
final_time for i in range(known_diag.shape[0]):
if phen_method == "double":
if for_double[i] > 1:
1)
final_diag.append(min(known_time[i,:]))
final_time.append(else:
0)
final_diag.append("NA")
final_time.append(else:
if "1" in known_diag[i,:]:
1)
final_diag.append(min(known_time[i,:]))
final_time.append(else:
0)
final_diag.append("NA")
final_time.append(
#STILL NEED TO FIGURE OUT HOW I WANT TO SPLIT THIS - right now assume I dont
= 0
start_eid_diag = np.zeros((big_eid.shape[0], covar_defs.shape[0]))
df_occur = np.tile("_________", (big_eid.shape[0], covar_defs.shape[0]))
df_date
#pdb.set_trace()
for ind in range(big_eid.shape[0]):
print(ind)
print(big_eid[ind])
#pdb.set_trace()
= big_eid[ind][0]
curr_eid if ind % 10000 == 0:
print(ind)
for trait_ind in range(covar_defs.shape[0]):
#CANCER SELF REPORT
if covar_defs[trait_ind, 5] != "NA":
if any(np.isin(use_defs[trait_ind]["cancer"], cancer_rep[ind,:])):
= np.where(np.isin(cancer_rep[ind,:], covar_defs[trait_ind, 5]))[0]
locs = date_ass[ind, int(cancer_phase[locs[0]])]
poss_date = datetime.datetime.strptime(poss_date, "%Y-%m-%d")
poss_date if poss_date < final_time[i]:
= 1
df_occur[ind, trait_ind] = poss_date
df_date[ind, trait_ind]
#CANCER SELF REPORT
if covar_defs[trait_ind, 4] != "NA":
if any(np.isin(use_defs[trait_ind]["selfrep"], self_rep[ind,:])):
= np.where(np.isin(self_rep[ind,:], covar_defs[trait_ind, 4]))[0]
locs = date_ass[ind, int(self_phase[locs[0]])]
poss_date = datetime.datetime.strptime(poss_date, "%Y-%m-%d")
poss_date if poss_date < final_time[i]:
= 1
df_occur[ind, trait_ind] = poss_date
df_date[ind, trait_ind]
#HESIN DIAG
if curr_eid in diag_eid_list:
if curr_eid != diag_eid_list[-1]:
= np.unique(diag[start_eid_diag:(start_eid_diag+7000),0])[1]
next_eid = diag_eid_list.index(next_eid)
ext_eid else:
= diag_max
ext_eid
#ICD -9
= [x[0:3] for x in diag[start_eid_diag:ext_eid,4]]
use_short_icd9_diag if any(np.isin(use_defs[trait_ind]["ICD9"], use_short_icd9_diag)):
= np.where(np.isin(use_short_icd9_diag, use_defs[trait_ind]["ICD9"]))[0][0]
short_icd9_locs else:
= 10000
short_icd9_locs if any(np.isin(use_defs[trait_ind]["ICD9"], diag[start_eid_diag:ext_eid,4])):
= np.where(np.isin(diag[start_eid_diag:ext_eid,4], use_defs[trait_ind]["ICD9"]))[0][0]
long_icd9_locs else:
= 10000
long_icd9_locs if long_icd9_locs != 10000 or short_icd9_locs != 10000:
= min((long_icd9_locs, short_icd9_locs))
icd9_locs = diag[start_eid_diag+icd9_locs,1]
icd9_ins_index
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd9_ins_index),5][0]
raw_date if raw_date == "":
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd9_ins_index),21][0]
raw_date if raw_date == "":
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd9_ins_index),4][0]
raw_date
pdb.set_trace()2] = 1
df_occur[ind,2] = raw_date
df_date[ind,
#ICD - 10
= [x[0:3] for x in diag[start_eid_diag:ext_eid,6]]
use_short_icd10_diag if any(np.isin(use_defs[trait_ind]["ICD10"], use_short_icd10_diag)):
= np.where(np.isin(use_short_icd10_diag, use_defs[trait_ind]["ICD10"]))[0][0]
short_icd10_locs else:
= 10000
short_icd10_locs if any(np.isin(use_defs[trait_ind]["ICD10"], diag[start_eid_diag:ext_eid,6])):
= np.where(np.isin(diag[start_eid_diag:ext_eid,6], use_defs[trait_ind]["ICD10"]))[0][0]
long_icd10_locs else:
= 10000
long_icd10_locs if long_icd10_locs != 10000 or short_icd10_locs != 10000:
= min((long_icd10_locs, short_icd10_locs))
icd10_locs = diag[start_eid_diag+icd10_locs,1]
icd10_ins_index
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd10_ins_index),5][0]
raw_date if raw_date == "":
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd10_ins_index),21][0]
raw_date if raw_date == "":
= hesin[np.logical_and(hesin[:,0] == curr_eid, hesin[:,1] == icd10_ins_index),4][0]
raw_date
pdb.set_trace()3] = 1
df_occur[ind,3] = raw_date
df_date[ind,
= ext_eid
start_eid_diag
print("end")