1. Introduction

In this document, we will detail how to use the penAFT R package with a number short simulated examples. This document is meant to be a user-guide accompanying the documentation in the penAFT R package. The main purpose of the package is to fit, tune, and apply the penalized Gehan estimator in high-dimensional settings.

For \(i = 1, \dots, n\), let \(y_i = \min(t_i, c_i)\) where \(t_i\) is the time-to-event and \(c_i\) is the censoring time for the \(i\)th subject; let \(x_i \in \mathbb{R}^p\) be the measured predictors for the \(i\)th subject; and let \(\delta_i = \mathbf{1}(y_i = t_i)\) be the indicator of censoring. The penalized Gehan estimator is thus defined as \[\begin{equation} \label{eq:penalizedGehan} \arg\min_{{\beta} \in \mathbb{R}^p} \left\{ \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \delta_i \{\log y_i - \log y_j - \beta'(x_i - x_j) \}^{-} + \lambda g({\beta})\right\} \end{equation}\] where \(\{a\}^{-} = \max(-a, 0)\), \(\lambda > 0\) is a tuning parameter, and \(g\) is a convex penalty function. In the penAFT package, one can fit this model and perform cross-validation for two classes of penalties: The weighted elastic net penalty is defined as \[\texttt{(penalty = "EN")}~~~~~~ g(\beta) = \alpha \| w \circ \beta\|_1 + \frac{(1-\alpha)}{2}\|\beta\|_2^2\] where \(w \in \mathbb{R}^p\) is a set of non-negative weights (which can be specified in the \(\texttt{weight.set}\) argument). The weighted sparse group-lasso penalty, which one may also use, is defined as \[\texttt{(penalty = "SG")}~~~~~~ g(\beta)= \alpha \| w \circ \beta\|_1 + (1-\alpha)\sum_{l=1}^G v_l\|\beta_{\mathcal{G}_l}\|_2\] where again, \(w \in \mathbb{R}^p\) is a set of non-negative weights and \(v_l \in \mathbb{R}^G\) are non-negative weights applied to each of the \(G\) (user-specified) groups.

It is important to note that \(\alpha = 1\) and \(\texttt{penalty = "EN"}\) are the defaults. For the sparse group lasso penalty \(\texttt{penalty = "SG"}\), \(\alpha = 1\) is identical to \(\texttt{penalty = "EN"}\). Thus, one must use some value other than \(\alpha = 1\) to use \(\texttt{(penalty = "SG")}\).

First, load the penAFT package from GitHub. This will require the devtools package in R.

install.packages("devtools")
library(devtools)
devtools::install_github("ajmolstad/penAFT", force=TRUE)

To begin, we generate data from a log-logistic accelerated failure time model.

library(penAFT)
set.seed(1)
n <- 200
p <- 250
SigmaX <- matrix(0, nrow=p, ncol=p)
for (j in 1:p) {
  for (k in 1:p) {
    SigmaX[j,k] <- 0.5^abs(j-k)
  }
}
eo <- eigen(SigmaX)
X <- matrix(rnorm(n*p), nrow=n)%*%eo$vec%*%diag(eo$val^.5)%*%t(eo$vec)
beta <- rep(0, p)
beta[c(1,20,200)] <- 2
logT <- X%*%beta + rlogis(n, loc=0, scale=2)
C <- rexp(n, rate=1/quantile(exp(logT), 0.65))
logY <- pmin(logT, log(C))
delta <- 1*(logY == logT)

2. Elastic net penalized Gehan estimator

2.1 Basics

Let us begin by fitting the model with the elastic net penalty.

fit.en <- penAFT(X, logY, delta, alpha = 0.5, nlambda = 100)
str(fit.en)
## List of 6
##  $ beta       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. ..@ i       : int [1:2269] 19 19 19 19 19 199 19 199 19 199 ...
##   .. ..@ p       : int [1:101] 0 1 2 3 4 6 8 10 12 14 ...
##   .. ..@ Dim     : int [1:2] 250 100
##   .. ..@ Dimnames:List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : NULL
##   .. ..@ x       : num [1:2269] 0.00124 0.02285 0.03876 0.05623 0.07831 ...
##   .. ..@ factors : list()
##  $ lambda     : num [1:100] 0.51 0.499 0.487 0.476 0.465 ...
##  $ standardize: logi TRUE
##  $ X.mean     : num [1:250] 0.0361 0.0266 -0.0517 -0.1021 -0.0357 ...
##  $ X.sd       : num [1:250] 0.94 1.01 1.06 1.1 1.07 ...
##  $ alpha      : num 0.5
##  - attr(*, "class")= chr "penAFT"

The output of \(\texttt{fit.en}\) is a list containing

  • \(\texttt{beta}\): A sparse matrix containing the regression coefficients. To extract coefficients for a particular tuning parameter, one must use the \(\texttt{penAFT.coef}\) function.
  • \(\texttt{lambda}\): The set of candidate tuning parameter values.
  • \(\texttt{standarize}\): An indicator of whether the predictors were standardized for model fitting.
  • \(\texttt{X.mean}\), \(\texttt{X.sd}\): The mean and standard deviations of the predictors. These are used to return coefficients and linear predictors on the appropriate scale in the \(\texttt{penAFT.coef}\) and \(\texttt{penAFT.predict}\) functions.

We can look at a trace plot for the fitted model.

penAFT.trace(fit.en)

Using the functions \(\texttt{penAFT.predict}\) and \(\texttt{penAFT.coef}\), we can also obtain the linear predictor for new subjects and extract coefficients at a particular \(\lambda\)

preds <- penAFT.predict(fit.en, Xnew = rnorm(p), lambda = fit.en$lambda[20])
str(preds)
##  num [1, 1] 0.667
coef.est20 <- penAFT.coef(fit.en, lambda = fit.en$lambda[20])
coef.est20$beta[1:20]
##  [1] 0.1449862 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [8] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [15] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.4225436

If we wanted to select tuning parameters, we can run cross-validation using \(\texttt{penAFT.cv}\).

fit.en.cv <- penAFT.cv(X, logY, delta, alpha = 0.5, nlambda = 100, nfolds = 5)
## CV through:  ###                 20 % 
## CV through:  ### ###             40 % 
## CV through:  ### ### ###         60 % 
## CV through:  ### ### ### ###     80 % 
## CV through:  ### ### ### ### ###  100 %
str(fit.en.cv)
## List of 5
##  $ full.fit      :List of 6
##   ..$ beta       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. ..@ i       : int [1:2269] 19 19 19 19 19 199 19 199 19 199 ...
##   .. .. ..@ p       : int [1:101] 0 1 2 3 4 6 8 10 12 14 ...
##   .. .. ..@ Dim     : int [1:2] 250 100
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ x       : num [1:2269] 0.00124 0.02285 0.03876 0.05623 0.07831 ...
##   .. .. ..@ factors : list()
##   ..$ lambda     : num [1:100] 0.51 0.499 0.487 0.476 0.465 ...
##   ..$ standardize: logi TRUE
##   ..$ X.mean     : num [1:250] 0.0361 0.0266 -0.0517 -0.1021 -0.0357 ...
##   ..$ X.sd       : num [1:250] 0.94 1.01 1.06 1.1 1.07 ...
##   ..$ alpha      : num 0.5
##  $ cv.err.linPred: num [1:100] 1.87 1.86 1.86 1.86 1.85 ...
##  $ cv.err.obj    : num [1:5, 1:100] 1.59 2.02 1.66 1.94 1.97 ...
##  $ cv.index      :List of 5
##   ..$ : int [1:41] 6 18 19 34 36 49 67 70 73 74 ...
##   ..$ : int [1:41] 1 3 14 21 24 29 32 58 59 69 ...
##   ..$ : int [1:40] 11 16 28 30 31 35 37 47 54 63 ...
##   ..$ : int [1:39] 10 13 44 45 46 56 57 62 82 91 ...
##   ..$ : int [1:39] 9 50 52 53 65 75 94 95 97 98 ...
##  $ lambda.min    : num 0.51
##  - attr(*, "class")= chr "penAFT.cv"

We see that the object \(\texttt{fit.en.cv}\) contains exactly the \(\texttt{fit.en}\) object in the sublist \(\texttt{full.fit}\). The other output includes

  • \(\texttt{cv.linPred}\) the cross-validated linear predicto scores (the default)
  • \(\texttt{cv.err.obj}\) the cross-validation error Gehan loss evaluated on each fold
  • \(\texttt{lambda.min}\) the tuning parameter which minimized the cross-validated linear predictor scores
  • \(\texttt{cv.index}\) which gives the indices of subjects belonging to each fold.

We can plot the cross-validation error curves using the \(\texttt{penAFT.plot}\) function.

penAFT.plot(fit.en.cv)