In this document, we show how to use the \(\texttt{SurvGPR}\) R package. If you encounter any errors or strange behavior, please report the issue at https://github.com/ajmolstad/SurvGPR. First, we download the package from GitHub.

library(devtools)
devtools::install_github("ajmolstad/SurvGPR")

Then, we create survival data from the Gaussian process regression model.

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.3.6
##  [5] yaml_2.2.0      Rcpp_1.0.2      stringi_1.4.3   rmarkdown_1.14 
##  [9] knitr_1.24      stringr_1.4.0   xfun_0.8        digest_0.6.20  
## [13] evaluate_0.14
library(SurvGPR)
library(MASS)
# set dimesions
set.seed(20)
n = 500
p = 1000
q = 2
train.inds <- sample(1:n, 300)
test.inds <- c(1:n)[-train.inds]

# generate gene expression kernel
SigmaX <- matrix(.5, nrow=p, ncol=p)
diag(SigmaX) <- 1
X <- mvrnorm(n = n, mu = rep(0, p), Sigma = SigmaX, tol = 1e-6)
Kout <- as.matrix(dist(X, method="euclidean"))
K1 <- exp(-Kout/max(Kout[train.inds,train.inds]))
K1.full <- exp(-Kout/max(Kout))

Z <- cbind(rep(1, n), matrix(rnorm(n*q), nrow=n))
beta <- c(6.2, -0.5, -1.2)
Sigma <- 2*K1.full
G <- mvrnorm(n = 1, mu = rep(0, n), Sigma = Sigma, tol = 1e-6)
    
# generate failure times 
log_time <- Z%*%beta + c(G) + rnorm(n, sd=sqrt(.5))     
C <- log(rexp(n=n, rate=1/quantile(exp(log_time), .8)))
y <- pmin(log_time, C)
time <- exp(y)
status <- 1*(y != C)

Note that the \(M\) candidate kernels should be organized into an array of dimension \(n_{\rm train} \times n_{\rm train} \times M\). Note that even if \(M = 1\), this must still be entered as an array. In addition, each of the candidate the kernels should be positive definite. In the code above, the candidate kernels are stored as \(\texttt{K1}\).

Next, we fit the \(\texttt{SurvGPR}\) model using the following. Please note that this may take a few minutes to run.

time.train <- time[train.inds]
status.train <- status[train.inds]
Z.train <- Z[train.inds,]
K.train <- array(K1[train.inds,train.inds], dim=c(length(train.inds), length(train.inds), 1))

# ------------------------
# display inputs 
# ------------------------
time.train[1:10]
##  [1] 446.29118  11.19027 370.64129 210.12920  78.34528 206.31260 277.91410
##  [8] 451.16314 531.66789  28.28487
status.train[1:10]
##  [1] 1 0 1 0 0 1 1 0 1 0
Z.train[1:10, ]
##       [,1]       [,2]        [,3]
##  [1,]    1  1.6047070  0.16164251
##  [2,]    1 -1.0999800  0.42764294
##  [3,]    1 -0.2965829  0.62351308
##  [4,]    1  0.4775841 -0.44029827
##  [5,]    1  0.9999098 -0.75503343
##  [6,]    1  1.8769275  0.59915343
##  [7,]    1  2.0396633  0.02066611
##  [8,]    1 -0.3490389 -0.63380672
##  [9,]    1  0.2104500 -0.23658373
## [10,]    1  0.5167755  0.37743189
str(K.train)
##  num [1:300, 1:300, 1] 1 0.639 0.746 0.748 0.679 ...

Finally, we fit the model using the \(\texttt{SurvGPR}\) function.

set.seed(21)
results <- SurvGPR(time = time.train, status = status.train, Z = Z.train, K = K.train, 
                   tol = 1e-7, max.iter.MM = 100, max.iter = 100, quiet=FALSE, 
                   max.samples = 1e5, initializer = 0)
## 1 : ASE = 0.107 ; sigma2 = 1.457 ; resid = 8.39069 ; sk = 500 
## 2 : ASE = 0.034 ; sigma2 = 1.786 ; resid = 0.85956 ; sk = 500 
## 3 : ASE = 0.012 ; sigma2 = 2.071 ; resid = 0.13419 ; sk = 500 
## 4 : ASE = 0.005 ; sigma2 = 2.232 ; resid = 0.02412 ; sk = 500 
## 5 : ASE = 0.002 ; sigma2 = 2.349 ; resid = 0.00749 ; sk = 500 
## 6 : ASE = 0.002 ; sigma2 = 2.441 ; resid = 0.00208 ; sk = 500 
## 7 : ASE = 0.001 ; sigma2 = 2.464 ; resid = -0.00028 ; sk = 500 
## Adding samples for ascent: 1000  samples 
## 7 : ASE = 0.001 ; sigma2 = 2.47 ; resid = 3e-04 ; sk = 1000 
## 8 : ASE = 0 ; sigma2 = 2.469 ; resid = -0.00062 ; sk = 1000 
## Adding samples for ascent: 2000  samples 
## 8 : ASE = 0 ; sigma2 = 2.47 ; resid = -0.00038 ; sk = 2000 
## Adding samples for ascent: 4000  samples 
## 8 : ASE = 0 ; sigma2 = 2.477 ; resid = 0 ; sk = 4000 
## Adding samples for ascent: 8000  samples 
## 8 : ASE = 0 ; sigma2 = 2.5 ; resid = 8e-04 ; sk = 8000 
## 9 : ASE = 0 ; sigma2 = 2.523 ; resid = 0.00051 ; sk = 8000 
## 10 : ASE = 0 ; sigma2 = 2.53 ; resid = 4e-05 ; sk = 8000 
## Adding samples for ascent: 16000  samples 
## 10 : ASE = 0 ; sigma2 = 2.531 ; resid = 1e-04 ; sk = 16000 
## Adding samples for ascent: 32000  samples 
## 10 : ASE = 0 ; sigma2 = 2.531 ; resid = 0.00014 ; sk = 32000 
## 11 : ASE = 0 ; sigma2 = 2.531 ; resid = 0 ; sk = 32000 
## Adding samples for ascent: 64000  samples 
## 11 : ASE = 0 ; sigma2 = 2.532 ; resid = 3e-05 ; sk = 64000 
## Adding samples for ascent: 128000  samples

Checking the output, we have a list of parameter estimates and other information needed for subsequent prediction tasks.

str(results)
## List of 4
##  $ beta   : num [1:3, 1] 6.937 -0.599 -1.142
##  $ sigma2 : num [1:2] 2.532 0.274
##  $ Tout   : num [1:300] 6.1 6.75 5.92 6.58 6.53 ...
##  $ Yimpute: num [1:300] 446 1199 371 1612 1335 ...

For example, we can look at the estimated variance components, which are fairly accurate in this particular dataset:

results$sigma2
## [1] 2.5324571 0.2738176

With the fitted model, we can then predict for new subjects. Note that we can also get predicted survival probabilities at certain times (on the original scale, not log-scale) by specifying times in the \(\texttt{times}\) argument. The first two arguments should be input as shown below: \(\texttt{Z} \in \mathbb{R}^{n \times p+1}\), \(\texttt{K.testt} \in \mathbb{R}^{n \times n \times M}\) where \(\texttt{train.inds}\) and \(\texttt{test.inds}\) are a partition of \(\left\{1, \dots, n\right\}\) where \(\texttt{train.inds}\) denote the rows of \(\texttt{Z}\) and row/columns of \(\texttt{K}\) corresponding to the training subjects; and \(\texttt{test.inds}\) corresponds to the testing subjects.

K.test <- array(K1.full, dim = c(n, n, 1))
pred_results <- SurvGPR_Predict(results = results, barT = results$Tout, Z.full = Z, 
    K.full = K.test, train.inds = train.inds, test.inds = test.inds, times = seq(0, 
        max(time)/5, length = 1000))
str(pred_results)
## List of 3
##  $ test.inds: int [1:200] 1 4 7 11 12 15 19 22 24 25 ...
##  $ log.pred : num [1:200, 1] 4.17 4.75 3.46 7.26 8.78 ...
##  $ survFunc : num [1:200, 1:1000] 1 1 1 1 1 1 1 1 1 1 ...

The output contains the predicted survival times on the log-scale, the testing indices, and the survival probabilities for all test set patients at the values given for . We plot the true survival curve (dashed) and the estimated survival curve (solid) for two subjects below (differentiated by colors).

dim(pred_results$survFunc)
## [1]  200 1000
plot(x = seq(0, max(time)/5, length = 1000), y = pred_results$survFunc[4, ], 
    ylab = "Survival probability", xlab = "Survival time", type = "l", ylim = c(0, 
        1))
lines(x = seq(0, max(time)/5, length = 1000), y = 1 - pnorm(log(seq(1, max(time)/5, 
    length = 1000)), mean = c(Z[test.inds[4], ] %*% beta), sd = sqrt(3.5)), 
    lty = 2)
lines(x = seq(0, max(time)/5, length = 1000), y = 1 - pnorm(log(seq(1, max(time)/5, 
    length = 1000)), mean = c(Z[test.inds[1], ] %*% beta), sd = sqrt(3.5)), 
    lty = 2, col = "red")
lines(x = seq(0, max(time)/5, length = 1000), y = pred_results$survFunc[1, ], 
    lty = 1, col = "red")

We also plot our predictions versus the true log-survival times.

plot(x = pred_results$log.pred, y = log_time[pred_results$test.inds], xlab = "Predicted log-survival times", 
    ylab = "True log-survival times", pch = 20)
abline(0, 1, lty = 2)