title | output | ||||||
---|---|---|---|---|---|---|---|
Coxph - A guide on how to conduct Cox Survival analysis in R |
|
In this example we will analyse the survival profile of breast cancer patients from the TCGA (BRCA TCGA). Specifically, we will look at the impact of age @ diagnosis and cancer stage (I, II, III, IV or Terminal) on survival.
When analysis survival differentials in a data set, a vast range of variables will be important in creating sound and significant models. Largely, these fall in 3 categories:
- Covariates: any exogenous factor that enrich the analysis. Ex: smoking status, sex, blood biomarker levels, disease status, etc...
- Events: the outcome of interest, has to be binary (0=outcome not happened, 1=outcome happened) or at the very least it has to be ordinal (Ex: outcome is disease severity: 1,2 .., n for some very small n). In the example of this document, the outcome of interest is death by breast cancer. Any patient that dies of breast cancer related complications within the specified observation time is marked with outcome=1, otherwise they are marked with outcome=0.
- If a patient dies of anything other than breast cancer (ex: CVD, respiratory disease, accident,...), they have outcome=0
- If a patient dies of breast cancer after the end of the observation period, they also have outcome=0
- If a patient is lost in follow up (ex: they disappeared, changed hospitals and records got lost,...) before the end of the observation period, no matter their fate, they are marked with outcome=0 - we refer to that as censoring (more on that below)
- Time line & Cencoring: the variables that delimitate the start and end of the observation period:
- The start is set to when patients/subjects enter the study. In the context of a RCT or experiment, that's when the experiment start, and that time is identical for everyone (T_start = 0). In the context of a longitudinal study, with data coming from various sources like TCGA, this is set as the time patients are put under observations, i.e. when they are diagnosed and begin their treament journey.
- The end is set to be the end of the observation period. For a RCT/experiment, this is set by the scientists who made the experimental design (T_end =
$\gamma$ for everyone). For longitudinal analyses like ours, we have to look at when most subjects are lost to follow-up. That won't capture absolutely everyone, but that's what we want by design - by doing that we implicitly exclude "outliers".- Note: we can (should) also do that with the start of the observation period when doing left-censoring.
- The time unit of the study has to be well specified and uniform across the data set of interest. Depending on the experimental design, it can be hours, days, months or even years. Making it too coarse will render unprecise results, while making it too thin will make the model over-complicated. For a cancer survival analysis, months or years are appropriate time scales.
Example: censoring from a Truck durability analysis. Here, the outcome of interest is 1st ever service; The observation period is set between June 2010 and June 2013. T_start = purchase time, note that these can also fall outside the observation period - we then "left-censor" them. T_end = time of first ever service, or end of observation period, whichever comes first - that's "right-censoring"
Any survival analysis for a factor on an outcome needs to be adjusted for confounding variables. In the context of death, **Age and gender always need to be included **. The only exceptions are if:
- everyone in the study cohort is the same age,
- the outcome of interest is sex-specific. This is the case for us here, as we are looking at women with breast cancer
It is essential to start with an overview of the study cohort with a summary table with a couple of basic statistics
library(tableone)
library(kableExtra)
#
df<-df[!is.na(df$days_to_birth),] # exclude people with no Age
df$Age.yr<-(-1)*df$days_to_birth/365 # Convert days to birth into age in yrs
#
# Define censoring time based on survival outcome - in days
df$end<-ifelse(df$vital_status=="Alive",df$days_to_last_follow_up,df$days_to_death)
# exclude people with no censoring time + those who died in negative time
df<-df[(!is.na(df$end))&(df$end>0),]
# censoring time - in years of life
df$end.yr<-df$end/365+df$Age.yr
#
#####
#
Df.table<-df[,colnames(df) != "submitter_id"]
Df.table<-Df.table %>%
rename( Censoring_Time=end, Censoring_age=end.yr)
varstofact<-c("vital_status","TumourStage")
Df.table[varstofact] <- lapply(Df.table[varstofact], factor)
Vars<-c(varstofact,"Age.yr","Censoring_Time","Censoring_age")
###
tableOne <- CreateTableOne(vars = Vars,data=Df.table)
T1<-kableone(tableOne)
T1 %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Overall | |
---|---|
n | 1050 |
vital_status = Dead (%) | 145 (13.8) |
TumourStage (%) | |
I | 182 (17.3) |
II | 596 (56.8) |
III | 240 (22.9) |
IV | 20 ( 1.9) |
T | 12 ( 1.1) |
Age.yr (mean (SD)) | 58.80 (12.93) |
Censoring_Time (mean (SD)) | 1256.86 (1179.64) |
Censoring_age (mean (SD)) | 62.25 (12.78) |
With the package tableone, you can also compute summary statistics across specific groups of the cohort, complete with group means comparison p-values for each variable.
Vars<-c("TumourStage","Age.yr","Censoring_Time","Censoring_age")
tableTwo<-CreateTableOne(vars = Vars,strata="vital_status",data=Df.table)
T2<-kableone(tableTwo)
T2 %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Alive | Dead | p | test | |
---|---|---|---|---|
n | 905 | 145 | ||
TumourStage (%) | <0.001 | |||
I | 166 (18.3) | 16 (11.0) | ||
II | 532 (58.8) | 64 (44.1) | ||
III | 196 (21.7) | 44 (30.3) | ||
IV | 5 ( 0.6) | 15 (10.3) | ||
T | 6 ( 0.7) | 6 ( 4.1) | ||
Age.yr (mean (SD)) | 58.36 (12.58) | 61.56 (14.72) | 0.006 | |
Censoring_Time (mean (SD)) | 1212.41 (1169.06) | 1534.27 (1211.30) | 0.002 | |
Censoring_age (mean (SD)) | 61.68 (12.42) | 65.76 (14.40) | <0.001 |
Here we will only look at refining the end of the observation period.
To evaluate an optimal obsertaion period for our cohort, we need to look at the distribution of right-censoring times across the cohort
cf<-round(quantile(df$end,0.99)) #1% top percentile
ggplot(df, aes(x=end)) +
geom_histogram() +
geom_vline(aes(xintercept=cf), color="blue", linetype="dashed", size=1) +
xlab("Time in days")
99% of our cohort die / stop treatment 5463 days from their initial diagnosis. We will set this to be the end of our observation period. Thus, we have to update the definition of the event for those who fall outside of that
df[df$end>cf,"vital_status"]<-"Alive"
table(df$vital_status) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Var1 | Freq |
---|---|
Alive | 907 |
Dead | 143 |
- Tidy up the data frame and re-level the outcome variable. Coxph in R only accepts binary variables coded as
${0,1}$ in numeric format. 0= no outcome, 1=outcome.
df.model<-df[,c('TumourStage','Age.yr','end','end.yr','vital_status')]
levels(df.model$vital_status)<-c(0,1)
df.model$vital_status<-as.numeric(df.model$vital_status)-1 # outcome converted from factor to numeric
# Age variable
df.model$over65<-ifelse(df.model$Age.yr>=65,1,0)
- Check for data missingness. Make sure you have no /as few Na's as possible
kable(summary(df.model)) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
TumourStage | Age.yr | end | end.yr | vital_status | over65 | |
---|---|---|---|---|---|---|
I :182 | Min. :26.59 | Min. : 1.0 | Min. :29.64 | Min. :0.0000 | Min. :0.0000 | |
II :596 | 1st Qu.:49.33 | 1st Qu.: 463.8 | 1st Qu.:52.84 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | |
III:240 | Median :58.86 | Median : 871.0 | Median :62.50 | Median :0.0000 | Median :0.0000 | |
IV : 20 | Mean :58.80 | Mean :1256.9 | Mean :62.25 | Mean :0.1362 | Mean :0.3057 | |
T : 12 | 3rd Qu.:67.86 | 3rd Qu.:1691.0 | 3rd Qu.:70.67 | 3rd Qu.:0.0000 | 3rd Qu.:1.0000 | |
NA | Max. :90.06 | Max. :8605.0 | Max. :90.06 | Max. :1.0000 | Max. :1.0000 |
- Define Surv item. It needs to have at least:
- event
- rigth censoring
This is where you can also decide whether to include left-cencoring or not - this is optional however, the model can run without it.
event="vital_status"
### if right censoring only
#t="end,"
### if left + right censoring
t="Age.yr,time2=end.yr"
#
surv.gen<-paste("Surv(time=",t,",event=",event,")~",sep='')
Here we choose to do left+right censoring, where left censoring = age @ diagnosis, and right censoring = age @ censoring (death or end of follow-up, whichever comes first)
- To adjust for the effect of Age - a very strong confounder on death. This could either be included as a linear covariate in the model. But we can do better than that;
- To model the impact of Age in survival profiles as a non-linear, non-parametric effect
library(survival)
cox.0<-coxph(as.formula(paste(surv.gen,'TumourStage',sep='')), data = df.model)
summary(cox.0)
## Call:
## coxph(formula = as.formula(paste(surv.gen, "TumourStage", sep = "")),
## data = df.model)
##
## n= 1050, number of events= 143
##
## coef exp(coef) se(coef) z Pr(>|z|)
## TumourStageII 0.5112 1.6672 0.2833 1.805 0.071144 .
## TumourStageIII 1.0993 3.0022 0.2937 3.743 0.000182 ***
## TumourStageIV 2.5305 12.5596 0.3688 6.861 6.83e-12 ***
## TumourStageT 1.6335 5.1218 0.4859 3.362 0.000774 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## TumourStageII 1.667 0.59979 0.9569 2.905
## TumourStageIII 3.002 0.33309 1.6882 5.339
## TumourStageIV 12.560 0.07962 6.0960 25.877
## TumourStageT 5.122 0.19524 1.9762 13.274
##
## Concordance= 0.666 (se = 0.025 )
## Likelihood ratio test= 51.25 on 4 df, p=2e-10
## Wald test = 63.87 on 4 df, p=4e-13
## Score (logrank) test = 83.32 on 4 df, p=<2e-16
Cancer stage at diagnosis is very significant in determining survival from breast cancer, even after adjusting for age.
pretty_pres_coxph<-function(x){
x <- summary(x)
p.value<-signif(x$coefficients[,5], digits=2)
wald.test<-signif(x$coefficients[,4], digits=2)
beta<-signif(x$coefficients[,1], digits=3);#coeficient beta
HR <-signif(x$coefficients[,2], digits=3);#exp(beta)
HR.confint.lower <- signif(x$conf.int[,"lower .95"],3) #signif( ,2)
HR.confint.upper <- signif(x$conf.int[,"upper .95"],3)
HR <- paste0(HR, " (",
HR.confint.lower, "-", HR.confint.upper, ")")
res<-as.data.frame(cbind(HR, p.value))
colnames(res)<-c("HR (95% CI for HR)",
"p.value")
return(res)
}
pretty_pres_coxph(cox.0) %>%
kable() %>%
kable_styling()
HR (95% CI for HR) | p.value | |
---|---|---|
TumourStageII | 1.67 (0.957-2.9) | 0.071 |
TumourStageIII | 3 (1.69-5.34) | 0.00018 |
TumourStageIV | 12.6 (6.1-25.9) | 6.8e-12 |
TumourStageT | 5.12 (1.98-13.3) | 0.00077 |
Forest plots
library(survminer)
ggforest(cox.0,data=df.model)
KM curves can be very useful for viewing survival profiles in a graphical manner, for computing numbers at risk by group X time, and for evaluating median survival (50% survival threshold) by group.
Unfortunately KM curves also have quite a few drawbacks that limit their usefulness:
- They do not accomodate for left-censoring - at least in R
- They do not cater for continuous covariates - only categorical covariates can be fed into KM models
- They dont allow for confounder adjustments, unlike cox models
t='end'
my.formula<-as.formula("Surv(time=end,event=vital_status)~TumourStage") #
sd <- survdiff(my.formula, data = df.model)
p.val.KM<-1 - pchisq(sd$chisq, length(sd$n) - 1) # exact log-rank test p-value
p.val.KM<-formatC(p.val.KM, format = "e", digits = 2)
surv_item= surv_fit(my.formula,data=df.model)
ggsurv<-ggsurvplot(surv_item, data=df.model,
#fun = "cumhaz", # for cumul. hazard
pval = TRUE, # p-value of log-rank test of diff between strata
# conf.int = TRUE, not great if you have multiple strata
surv.plot.height = 0.7,
risk.table = TRUE, # Add risk table
risk.table.height = 0.35, # Useful when you have multiple groups
xlab = "Time in days", # customize X axis label.
surv.median.line = "hv", # add the median survival pointer.
# ncensor.plot = TRUE, # plot the number of censored subjects at time t
# ncensor.plot.height = 0.6,
ggtheme = theme_bw(),xlim=c(0,round(max(df.model[,t],na.rm = TRUE))),
legend.labs = c("I", "II", "III", "IV", "T"))
ggsurv