BvCategorical Example

In this document, we provide a short tutorial on how to use the \(\texttt{BvCategorical}\) R package. If you encounter any errors or strange behavior, please report the issue at https://github.com/ajmolstad/BvCategorical.

First, we install the \(\texttt{BvCategorical}\) R package from GitHub.

install.packages("devtools")
library(devtools)
devtools::install_github("ajmolstad/BvCategorical")
library(BvCategorical)

Data generating model

Next, we generate data from the bivariate categorical response regression model. Here, the responses are \(J\)-category and \(K\)-category multinomial random variables with a single trial per subject. Generalizing to multiple trials per subject is trivial: users need only append additional rows to \(X\) and \(Y\) so that the data are formatted as if there were only a single trial per subject. To keep computing times short, we use \(J=3\) and \(K=2\) in the following example.

Recall, under the bivariate multinomial logistic regression model, \(\boldsymbol{\beta} \in \mathbb{R}^{p + 1 \times J \times K}\) where \[ P(Y_{1i} = j, Y_{2i} = k \mid x_i) = \frac{\exp\left(x_i'\boldsymbol{\beta}_{\cdot, j, k} \right)}{\sum_{s,t} \exp\left(x_i'\boldsymbol{\beta}_{\cdot, s, t} \right)}, \quad\quad(i,j,k) \in \left\{1, \dots, n\right\} \times \left\{1, \dots, J\right\} \times \left\{1,\dots, K\right\}.\] where \(x_i = (1, x_{1i}, \dots, x_{pi})\) for \(i=1, \dots, n\).

In the following, we generate \(\boldsymbol{\beta}\) so that six variables modify the joint probability mass function, i.e., they uniquely modify \(P(Y_{1i} = j, Y_{2i} = k \mid x_i)\); four modify only the marginal probabilities \(P(Y_{1i} = j\mid x_i)\) and \(P(Y_{2i} = k \mid x_i)\), and the rest do not affect the response.

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MASS_7.3-51.5 Matrix_1.2-18
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4      lattice_0.20-40 digest_0.6.25   grid_3.6.1     
##  [5] magrittr_1.5    evaluate_0.14   rlang_0.4.5     stringi_1.4.6  
##  [9] rmarkdown_2.1   tools_3.6.1     stringr_1.4.0   xfun_0.12      
## [13] yaml_2.2.1      compiler_3.6.1  htmltools_0.4.0 knitr_1.28

# --------------------------------
# Preliminaries
# --------------------------------
set.seed(10)
p <- 200
n <- 500
J <- 3 
K <- 2
ntest <- 1e4
Results <- NULL

SigmaX <- matrix(0, nrow=p, ncol=p)
for(jj in 1:p){
  for(kk in 1:p){
    SigmaX[jj,kk] <- .5^abs(jj-kk)
  }
}
eo <- eigen(SigmaX) 
SigmaXsqrt <- eo$vec%*%diag(eo$val^.5)%*%t(eo$vec)

X <- matrix(rnorm(n*p), nrow=n)%*%SigmaXsqrt
Xtest <- matrix(rnorm(ntest*p), nrow=ntest)%*%SigmaXsqrt

# generated data -------------
X <- matrix(rnorm(n*p), nrow=n)%*%SigmaXsqrt
Xtest <- matrix(rnorm(ntest*p), nrow=ntest)%*%SigmaXsqrt

beta <- matrix(0, nrow=p, ncol=J*K)
temp <- sample(1:p, 10)

# joint probabilities -------------
for(j in 1:6){
  beta[temp[j],] <- runif(6, -3, 3)
}
  #  marginal probabilities only ---
for(j in 7:10){
  temp1 <- runif(4, -3, 3)
  beta[temp[j],] <- c(-temp1[4] + temp1[3] + temp1[1], temp1[1], temp1[2], temp1[3], temp1[4], -temp1[1]+ temp1[4] + temp1[2])
}
  
lleval <- exp(crossprod(t(X), beta))
Pmat <- lleval/(rowSums(lleval)%*%t(rep(1, J*K)))
Ymat <- matrix(0, nrow=n, ncol=J*K)
Y <- list(NA)
Y[[1]] <- rep(0, n)
Y[[2]] <- rep(0, n)
refClasses <- rbind(c(1:J, 1:J), c(rep(1, J), rep(2, J)))
for(k in 1:n){
  Ymat[k,] <- rmultinom(1, 1, Pmat[k,])
  Y[[1]][k] <- refClasses[1,which(Ymat[k,]==1)]
  Y[[2]][k] <- refClasses[2,which(Ymat[k,]==1)]
}

lleval <- exp(crossprod(t(Xtest), beta))
PmatTest <- lleval/(rowSums(lleval)%*%t(rep(1, J*K)))
YmatTest <- matrix(0, nrow=ntest, ncol=J*K)
Ytest <- list(NA)
Ytest[[1]] <- rep(0, ntest)
Ytest[[2]] <- rep(0, ntest)
refClasses <- rbind(c(1:J, 1:J), c(rep(1, J), rep(2, J)))
for(k in 1:ntest){
  YmatTest[k,] <- rmultinom(1, 1, PmatTest[k,])
  Ytest[[1]][k] <- refClasses[1,which(YmatTest[k,]==1)]
  Ytest[[2]][k] <- refClasses[2,which(YmatTest[k,]==1)]
}

Model fitting

To fit the model, use the function \(\texttt{BvCat.cv}\).The arguments are as follows:

\(\texttt{X}\): an \(n \times p\) matrix of predictors, unstandardized. Should not include an intercept – this is done internally;
\(\texttt{Y}\): a list with two components, each corresponding to one of the two response variables. Elements of each component should be integers from \(1\) to \(K\) where \(K\) is the number of response categories for that response;
\(\texttt{ngamma}\): the number of candidate tuning parameters \(\gamma\) to consider in cross-validation/model fitting;
\(\texttt{lambda.vec}\): a vector of candidate \(\lambda\) values to consider in cross-validation/model fitting;
\(\texttt{nfolds}\): the number of folds to use for cross-validation. If one would rather not perform cross-validation, setting \(\texttt{nfolds = NULL}\) fits the model for each of the pairs \((\gamma, \lambda)\), but does not perform cross-validation;
\(\texttt{delta}\): a parameter between 0 and 1 indicating the ratio from largest to smallest candidate \(\gamma\) value, i.e., \(\min{\gamma} = \delta \max{\gamma}\);
\(\texttt{standardize}\): \(\texttt{TRUE/FALSE}\) indicating whether predictors should be standardized for model fitting;
\(\texttt{tol}\): Convergence tolerance for proximal gradient descent algorithm (change in objective function value across previous three iterations);
\(\texttt{quiet}\): \(\texttt{TRUE/FALSE}\) indicating whether progress should be printed after each model fit for all pairs \((\gamma, \lambda)\);
\(\texttt{inner.quiet}\): \(\texttt{TRUE/FALSE}\) indicating whether progress should be printed after each iterations of the proximal gradient descent algorithm – note that this argument should only be \(\texttt{TRUE}\) if users are diagnosing covergence tolerance as output can be quite verbose.

For more on formatting, we suggest following the example data generating procedure above.

modfit <- BvCat.cv(X, Y, ngamma = 10, 
                  lambda.vec = 10^seq(-1, -4, length=5), 
                  nfolds = 5, delta = .05, standardize = TRUE, 
                  tol = 1e-10,  quiet = FALSE, inner.quiet = TRUE)

## lambda =  0.1 ; gamma =  0.2558115 ; nonzero =  6 
## lambda =  0.1 ; gamma =  0.1833839 ; nonzero =  12 
## lambda =  0.1 ; gamma =  0.1314626 ; nonzero =  30 
## lambda =  0.1 ; gamma =  0.09424177 ; nonzero =  48 
## lambda =  0.1 ; gamma =  0.0675592 ; nonzero =  60 
## lambda =  0.1 ; gamma =  0.04843125 ; nonzero =  66 
## lambda =  0.1 ; gamma =  0.03471896 ; nonzero =  144 
## lambda =  0.1 ; gamma =  0.02488902 ; nonzero =  312 
## lambda =  0.1 ; gamma =  0.01784222 ; nonzero =  504 
## lambda =  0.1 ; gamma =  0.01279058 ; nonzero =  714 
## lambda =  0.01778279 ; gamma =  0.2558115 ; nonzero =  6 
## lambda =  0.01778279 ; gamma =  0.1833839 ; nonzero =  18 
## lambda =  0.01778279 ; gamma =  0.1314626 ; nonzero =  42 
## lambda =  0.01778279 ; gamma =  0.09424177 ; nonzero =  48 
## lambda =  0.01778279 ; gamma =  0.0675592 ; nonzero =  60 
## lambda =  0.01778279 ; gamma =  0.04843125 ; nonzero =  66 
## lambda =  0.01778279 ; gamma =  0.03471896 ; nonzero =  120 
## lambda =  0.01778279 ; gamma =  0.02488902 ; nonzero =  240 
## lambda =  0.01778279 ; gamma =  0.01784222 ; nonzero =  426 
## lambda =  0.01778279 ; gamma =  0.01279058 ; nonzero =  588 
## lambda =  0.003162278 ; gamma =  0.2558115 ; nonzero =  6 
## lambda =  0.003162278 ; gamma =  0.1833839 ; nonzero =  24 
## lambda =  0.003162278 ; gamma =  0.1314626 ; nonzero =  42 
## lambda =  0.003162278 ; gamma =  0.09424177 ; nonzero =  54 
## lambda =  0.003162278 ; gamma =  0.0675592 ; nonzero =  60 
## lambda =  0.003162278 ; gamma =  0.04843125 ; nonzero =  66 
## lambda =  0.003162278 ; gamma =  0.03471896 ; nonzero =  144 
## lambda =  0.003162278 ; gamma =  0.02488902 ; nonzero =  294 
## lambda =  0.003162278 ; gamma =  0.01784222 ; nonzero =  450 
## lambda =  0.003162278 ; gamma =  0.01279058 ; nonzero =  618 
## lambda =  0.0005623413 ; gamma =  0.2558115 ; nonzero =  6 
## lambda =  0.0005623413 ; gamma =  0.1833839 ; nonzero =  24 
## lambda =  0.0005623413 ; gamma =  0.1314626 ; nonzero =  42 
## lambda =  0.0005623413 ; gamma =  0.09424177 ; nonzero =  54 
## lambda =  0.0005623413 ; gamma =  0.0675592 ; nonzero =  60 
## lambda =  0.0005623413 ; gamma =  0.04843125 ; nonzero =  66 
## lambda =  0.0005623413 ; gamma =  0.03471896 ; nonzero =  204 
## lambda =  0.0005623413 ; gamma =  0.02488902 ; nonzero =  366 
## lambda =  0.0005623413 ; gamma =  0.01784222 ; nonzero =  534 
## lambda =  0.0005623413 ; gamma =  0.01279058 ; nonzero =  708 
## lambda =  1e-04 ; gamma =  0.2558115 ; nonzero =  6 
## lambda =  1e-04 ; gamma =  0.1833839 ; nonzero =  24 
## lambda =  1e-04 ; gamma =  0.1314626 ; nonzero =  42 
## lambda =  1e-04 ; gamma =  0.09424177 ; nonzero =  54 
## lambda =  1e-04 ; gamma =  0.0675592 ; nonzero =  60 
## lambda =  1e-04 ; gamma =  0.04843125 ; nonzero =  72 
## lambda =  1e-04 ; gamma =  0.03471896 ; nonzero =  210 
## lambda =  1e-04 ; gamma =  0.02488902 ; nonzero =  378 
## lambda =  1e-04 ; gamma =  0.01784222 ; nonzero =  552 
## lambda =  1e-04 ; gamma =  0.01279058 ; nonzero =  726 
## Through CV fold 1 
## Through CV fold 2 
## Through CV fold 3 
## Through CV fold 4 
## Through CV fold 5

Once we have fit the model, we can examine the output.

str(modfit)

## List of 15
##  $ beta            :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. ..@ i       : int [1:9456] 0 201 402 603 804 1005 0 159 201 360 ...
##   .. ..@ p       : int [1:51] 0 6 18 48 96 156 222 366 678 1182 ...
##   .. ..@ Dim     : int [1:2] 1206 50
##   .. ..@ Dimnames:List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : NULL
##   .. ..@ x       : num [1:9456] 0.0579 0.1679 -0.106 0.3572 -0.106 ...
##   .. ..@ factors : list()
##  $ J               : int 3
##  $ K               : int 2
##  $ p               : int 200
##  $ D               : num [1:6, 1:3] 1 -1 0 -1 1 0 1 0 -1 -1 ...
##  $ standardize     : logi TRUE
##  $ errs.joint      : num [1:5, 1:10, 1:5] 0.769 0.769 0.769 0.769 0.769 ...
##  $ deviance        : num [1:5, 1:10, 1:5] 368 368 368 368 368 ...
##  $ lambda.min.joint: num 0.00316
##  $ gamma.min.joint : num 0.0249
##  $ X.mean          : num [1:200] -0.0315 -0.0502 -0.0639 0.0422 0.0317 ...
##  $ X.sd            : num [1:200] 1.02 1.008 0.967 0.971 1.039 ...
##  $ cv.index        :List of 5
##   ..$ : int [1:104] 2 28 52 83 174 180 182 190 219 231 ...
##   ..$ : int [1:100] 15 26 38 77 186 195 205 247 263 340 ...
##   ..$ : int [1:100] 5 13 31 120 153 203 235 242 257 283 ...
##   ..$ : int [1:98] 17 41 82 162 210 218 249 258 290 311 ...
##   ..$ : int [1:98] 111 116 123 132 185 194 196 225 270 330 ...
##  $ lambda.vec      : num [1:5] 0.1 0.017783 0.003162 0.000562 0.0001
##  $ gamma.vec       : num [1:10] 0.2558 0.1834 0.1315 0.0942 0.0676 ...
##  - attr(*, "class")= chr "BvCat"

The output contains the following (omitting those with obvious meaning):

\(\texttt{beta}:\) a sparse matrix containing the estimated regression coefficients for all pairs of tuning parameters. \(\textbf{Do not}\) use these directly as they are not properly rescaled if \(\texttt{standardized = TRUE}\): rather, extract coefficients using \(\texttt{BvCat.coef}\) below;
\(D \in \mathbb{R}^{JK \times \binom{J}{2}\binom{K}{2}}\) is the matrix used to impose the log-odds penalty;
\(\texttt{errs.joint}\): the joint misclassification rate for the \(\texttt{nfolds}\) for each candidate pair of tuning parameters;
\(\texttt{deviance}:\) the deviance for each of the \(\texttt{nfolds}\);
\(\texttt{lambda.min.joint}\) the \(\lambda\) value leading to smallest average misclassification rate over the \(\texttt{nfolds}\);
\(\texttt{gamma.min.joint}\) the \(\gamma\) value leading to smallest average misclassification rate over the \(\texttt{nfolds}\);
\(\texttt{X.mean}\) and \(\texttt{X.sd}\): the mean and standard deviation of the \(p\) predictor variables (used for standardization/prediction in other functions);
\(\texttt{lambda.vec}\) and \(\texttt{gamma.vec}\): the candidate tuning parameters \(\lambda\) and \(\gamma\) used for model fitting.

Prediction and coefficient extraction

With the fitted model, an object of type \(\texttt{BvCat}\), we can first examine the misclassification errors averaged over the folds:

library(lattice)
levelplot(t(apply(modfit$errs.joint, c(1,2), mean)), xlab=expression(paste(gamma)), ylab=expression(paste(lambda)), main="Cross-validated misclassification rate", col.regions=grey((100:0)/100))

Then, we can predict the classes for a new set of predictors, \(\texttt{Xtest}\), using the \(\texttt{BvCat.predict}\) function. By default, \(\texttt{BvCat.predict}\) uses the tuning parameter pair which had the minimum average misclassification rate in cross-validation.

outPred <- BvCat.predict(Xtest = Xtest, modfit, type="class") # prediction using tuning parameters which minimized classification error

We can also perform prediction using the fitted model for other tuning parameter pairs:

devInds <- which(apply(modfit$deviance, c(1,2), mean) == min(apply(modfit$deviance, c(1,2), mean)), arr.ind=TRUE)
outPred.dev <- BvCat.predict(Xtest = Xtest, modfit, lambda = modfit$lambda.vec[devInds[1,1]], gamma = modfit$gamma.vec[devInds[1,2]]) # prediction using tuning parameters which minimized classification error

preds.mat <- apply(outPred$preds, 1, function(x){(x[1]+(J*(x[2]-1)))})
misclass.joint <- sum(apply(YmatTest, 1, which.max) != preds.mat)/length(Ytest[[1]])
cat(misclass.joint, "\n")

## 0.2487

preds.dev.mat <- apply(outPred.dev$preds, 1, function(x){(x[1]+(J*(x[2]-1)))})
misclass.joint.dev <- sum(apply(YmatTest, 1, which.max) != preds.dev.mat)/length(Ytest[[1]])
cat(misclass.joint.dev, "\n")

## 0.2558

Finally, we can extract the regression coefficients using \(\texttt{BvCat.coef}\). When we use the argument \(\texttt{type="matrix"}\), we get the matricized version of \(\hat{\boldsymbol{\beta}}\), \(\hat\beta\), which is given by \[ \hat\beta_{m,f(j,k)} = \hat{\boldsymbol{\beta}}_{m,j,k}, \quad\quad (m,j,k) \in \left\{1, \dots, p+1\right\} \times \left\{1, \dots, J \right\} \times \left\{1, \dots, K\right\}.\] where \(f(j,k) = (k-1)J + j\) for \((j,k) \in \left\{1, \dots, J\right\} \times \left\{1, \dots, K\right\}.\) Note that \(\boldsymbol{\beta}_0\) denotes the intercept term.

outBeta <- BvCat.coef(modfit, type="matrix")
str(outBeta)

## List of 3
##  $ b0      : num [1:6] 0.0862 0.1196 -0.1713 0.187 0.0102 ...
##  $ beta    : num [1:200, 1:6] 0.19 0 0 0 -1.11 ...
##  $ classref: num [1:2, 1:6] 1 1 2 1 3 1 1 2 2 2 ...

Using the argument \(\texttt{type="array"}\) instead returns \(\hat{\boldsymbol{\beta}}\), the arra-valued estimate.

outBetaArray <- BvCat.coef(modfit, type="array")
str(outBetaArray)

## List of 3
##  $ b0      : num [1, 1:3, 1:2] 0.0862 0.1196 -0.1713 0.187 0.0102 ...
##  $ beta    : num [1:200, 1:3, 1:2] 0.19 0 0 0 -1.11 ...
##  $ classref: num [1:2, 1:6] 1 1 2 1 3 1 1 2 2 2 ...

BvCategorical Example

Aaron J. Molstad (amolstad@ufl.edu)

7/5/2020

Data generating model

Model fitting

Prediction and coefficient extraction