In this document, we provide a short tutorial on how to use the \(\texttt{HierMultinom}\) software in R. If you encounter any errors or strange behavior, please report the issue at https://github.com/ajmolstad/HierMultinom.

First, you need to set the path to where the R scripts and RDS files will be. Call this path \(\texttt{dir}\). For example,

dir <- "/Users/amolstad/scRNAseq/Example/"

To begin, download the dataset \(\texttt{pbmc3k_final.RDS}\) from this Dropbox link. This is a PBMC single-cell RNA-seq dataset consisting of 2700 cells with labels. These data were already processed following the \(\texttt{Seurat}\) tutorial.

library(Seurat)
pbmc <- readRDS(paste(dir, "pbmc3k_final.RDS", sep=""))
# update to newer Seurat format
pbmc <- UpdateSeuratObject(pbmc)

In this particular dataset, there are eight cell types.

table(pbmc@meta.data$ClusterNames_0.6)
## 
##           B cells   CD14+ Monocytes       CD4 T cells       CD8 T cells 
##               342               479              1151               308 
##   Dendritic cells FCGR3A+ Monocytes    Megakaryocytes          NK cells 
##                32               157                14               155

We will create the expression matrix needed for our software. In the first step, we select the top five hundred most variable genes using the \(\texttt{FindVariableFeatures}\) function from \(\texttt{Seurat}\).

topGenes <- sort(FindVariableFeatures(pbmc[["RNA"]]@data)$vst.variance.standardized, decreasing=TRUE, index=TRUE)$ix[1:500]
Xfull <- as.matrix(pbmc[["RNA"]]@data[topGenes,])

Then, we will divide the data into training, testing, and a validation set. Note that our function takes the design matrix \(X\) as an \(n \times p\) matrix where \(n\) is the number of samples (here, cells) and \(p\) is the number of predictors (here, genes). Similarly, \(Y\) needs to be a \(n \times K\) matrix where \(K\) is the number of response categories (here. cell type labels). Each row of \(Y\) must be all zeros except one–in the column corresponding to that particular sample’s label.

set.seed(1)
train.inds <- sample(1:dim(Xfull)[2], 1750)
val.inds <- sample(c(1:dim(Xfull)[2])[-train.inds], 750)
test.inds <- c(1:dim(Xfull)[2])[-c(train.inds, val.inds)]

Yfull <- pbmc@meta.data$ClusterNames_0.6
categories = sort(unique(Yfull))
Yfull = t(sapply(Yfull, function(y) categories == y) * 1)
colnames(Yfull) = categories

Ytrain <- Yfull[train.inds,]
Yval <- Yfull[val.inds,]
Ytest <- Yfull[test.inds,]

Xtrain <- t(Xfull[,train.inds])
# remove any genes which do not vary
rm.genes <- which(apply(Xtrain, 2, sd)==0)
if(length(rm.genes) > 0){
  Xtrain <- Xtrain[,-rm.genes]
  Xval <- t(Xfull[-rm.genes,val.inds])
  Xtest <- t(Xfull[-rm.genes,test.inds])
} else {
  Xval <- t(Xfull[,val.inds])
  Xtest <- t(Xfull[,test.inds])
}

To keep matters simple, we will use only two coarse categories: we will define T cells = \(\{\)CD4 T cells, CD8 T cells\(\}\) and will define Monocytes = \(\{\)CD14+ Monocytes, FCGR3A+ Monocytes\(\}\).

Nonoverlapping coarse categories

To define coarse categories with our method is straightforward. We need only construct a list named \(\texttt{groups}\) where each element of this list contains the set of categories comprising the coarse category. We must then convert this to numeric values to match the columns of \(Y\).

groups <- list(
  c("CD4 T cells", "CD8 T cells"), 
  c("FCGR3A+ Monocytes", "CD14+ Monocytes"))
groups <- lapply(groups, function(group) sapply(group, function(x) which(categories == x)))

Now we are ready to fit the model using the function \(\texttt{HeirMult.Path}\). Note that this function only works when no coarse categories overlap. By definition, an overlap occurs when a fine category is contained in two or more coarse categories. This function fits the model for a grid of candidate tuning parameters. Note that this function will take a few minutes to run (~3 minutes).

# install.packages("devtools")
# library(devtools)
# devtools::install_github("ajmolstad/HierMultinom")
library(HierMultinom)
mod.fit <- HierMultinom.path(X = Xtrain, Y = Ytrain, groups = groups, 
    ngamma = 20, delta = 1e-4, tol = 1e-7,
    lambda.vec = 10^seq(-5,-1, length=5), 
    max.iter = 1e4,
    Xval = Xval, Yval = Yval, quiet=FALSE)
## 2455.665 
## 1818.435 
## 1202.063 
## 763.4395 
## 483.9877 
## 339.27 
## 256.537 
## 200.5227 
## 165.2464 
## 150.1199 
## 141.3543 
## 140.6914 
## 147.7326 
## 160.2219 
## 176.0888 
## 193.2139 
## 2455.665 
## 1818.792 
## 1202.508 
## 763.7526 
## 484.2905 
## 339.5822 
## 256.8916 
## 200.9783 
## 165.6944 
## 150.5556 
## 141.7418 
## 140.0782 
## 146.1794 
## 156.9805 
## 170.879 
## 185.6726 
## 2455.665 
## 1822.366 
## 1206.966 
## 766.868 
## 487.2689 
## 342.5318 
## 260.3365 
## 205.1919 
## 170.3281 
## 154.3191 
## 145.2043 
## 140.1832 
## 138.4966 
## 141.165 
## 145.8809 
## 152.5679 
## 158.3652 
## 2455.665 
## 1857.833 
## 1247.387 
## 796.976 
## 515.2625 
## 369.8371 
## 287.6836 
## 236.7299 
## 201.7794 
## 181.3614 
## 169.6495 
## 162.2054 
## 157.559 
## 155.075 
## 153.9056 
## 153.6908 
## 154.1005 
## 154.7801 
## 156.1484 
## 158.5355 
## 2455.665 
## 2018.84 
## 1425.126 
## 1054.062 
## 785.8806 
## 610.1676 
## 510.5935 
## 449.782 
## 409.661 
## 384.8718 
## 372.3854 
## 365.0436 
## 360.932 
## 358.6736 
## 357.8131 
## 357.8482 
## 358.824 
## 360.3664 
## 362.6779

The arguments are

With our fitted model object, we can then look at the validation errors, extract coefficients, and predict. Validation errors are saved as \(\texttt{val.errs}\).

str(mod.fit)
## List of 5
##  $ beta.est    : num [1:499, 1:8] -2.2545 0 0.3849 -0.0945 0.092 ...
##  $ beta.array  : num [1:499, 1:8, 1:5, 1:20] -2.02 0 0 0 0 ...
##  $ val.errs    : num [1:5, 1:20] 2456 2456 2456 2456 2456 ...
##  $ X.train.mean: Named num [1:498] 1.83 1.76 1.17 2.56 1.25 ...
##   ..- attr(*, "names")= chr [1:498] "LYZ" "HLA-DRA" "S100A9" "CD74" ...
##  $ X.train.sd  : Named num [1:498] 1.9 1.81 1.82 1.54 1.71 ...
##   ..- attr(*, "names")= chr [1:498] "LYZ" "HLA-DRA" "S100A9" "CD74" ...

For prediction, one can use \(\texttt{HierMult.predict}\). Here, let us look at the confusion matrix.

out.preds <- HierMultinom.predict(mod.fit, Xtest)
str(out.preds)
## List of 2
##  $ probs: num [1:138, 1:8] 9.99e-01 8.31e-04 3.36e-04 1.64e-05 3.18e-04 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:138] "AAACATTGAGCTAC" "AAATCAACCAGGAG" "AAATGTTGAACGAA" "AACCGATGTTCTAC" ...
##   .. ..$ : NULL
##  $ preds: Named int [1:138] 1 3 2 2 3 3 1 2 2 2 ...
##   ..- attr(*, "names")= chr [1:138] "AAACATTGAGCTAC" "AAATCAACCAGGAG" "AAATGTTGAACGAA" "AACCGATGTTCTAC" ...
sum(out.preds$preds == apply(Ytest, 1, which.max))/dim(Ytest)[1]
## [1] 0.9637681
tmp <- data.frame("preds" = categories[out.preds$preds],  "truth" = categories[apply(Ytest, 1, which.max)])
tab <- table(tmp$preds, tmp$truth)
colnames(tab) <- 1:8
tab
##                    
##                      1  2  3  4  5  6  7  8
##   B cells           25  0  0  0  0  0  0  0
##   CD14+ Monocytes    0 25  0  0  0  1  0  0
##   CD4 T cells        0  0 47  0  0  0  0  0
##   CD8 T cells        0  0  0 11  0  0  0  2
##   Dendritic cells    0  1  0  0  1  0  0  0
##   FCGR3A+ Monocytes  0  1  0  0  0 15  0  0
##   Megakaryocytes     0  0  0  0  0  0  1  0
##   NK cells           0  0  0  0  0  0  0  8

Classification accuracy on the testing set is quite high here. Now, let us extract the estimated regression coefficients on the original \(X\) scale. This can be done with \(\texttt{HierMult.coef}\)

out.coefs <- HierMultinom.coef(mod.fit)
str(out.coefs)
## List of 2
##  $ beta.hat     : num [1:498, 1:8] 0 0.2124 -0.0518 0.0596 0 ...
##  $ intercept.hat: num [1, 1:8] -2.01 -1.94 5.09 -5.27 -5.2 ...

Overlapping coarse categories

A different function must be used when the coarse categories overlap. Consider the case that we wanted another category: lymphocytes, which consists of the T cells and the B cells.

groups <- list(
  c("CD4 T cells", "CD8 T cells"),  # T cells
  c("FCGR3A+ Monocytes", "CD14+ Monocytes"), # monocytes
   c("CD4 T cells", "CD8 T cells", "B cells")) # lymphocytes
groups <- lapply(groups, function(group) sapply(group, function(x) which(categories == x)))

Using the same inputs, we now need only use the function \(\texttt{HeirMultinomOverlap.Path}\). This function is somewhat slower than that for nonoverlapping coarse categories, so again, please be patient when using this software.

mod.fit.overlap <- HierMultinomOverlap.path(X = Xtrain, Y = Ytrain, groups = groups,
    ngamma = 20, delta = 1e-4, tol = 1e-7,
    lambda.vec = 10^seq(-5,-1, length=5), 
    max.iter = 1e4,
    Xval = Xval, Yval = Yval, quiet=FALSE)
## 2455.665 
## 2257.553 
## 1514.245 
## 969.5351 
## 636.0262 
## 408.3419 
## 297.6927 
## 229.8533 
## 182.2917 
## 155.6731 
## 145.1965 
## 139.3667 
## 143.0494 
## 152.4337 
## 166.5858 
## 2455.665 
## 2257.894 
## 1514.8 
## 970.1885 
## 636.6623 
## 408.8972 
## 298.2727 
## 230.5392 
## 183.0235 
## 156.3001 
## 145.9213 
## 139.8431 
## 141.4585 
## 148.3939 
## 160.0375 
## 2455.665 
## 2261.304 
## 1520.323 
## 976.7176 
## 642.9966 
## 414.2898 
## 303.9176 
## 237.1542 
## 189.3713 
## 163.8599 
## 153.3328 
## 145.3864 
## 141.166 
## 139.6076 
## 139.0147 
## 141.0248 
## 144.3568 
## 147.3747 
## 2455.665 
## 2294.774 
## 1576.511 
## 1040.812 
## 700.5577 
## 464.9641 
## 352.0848 
## 285.0404 
## 242.3561 
## 213.601 
## 198.6639 
## 190.666 
## 186.0027 
## 183.2864 
## 181.9982 
## 181.9788 
## 183.115 
## 185.0644 
## 187.4934 
## 2455.665 
## 2443.818 
## 1998.826 
## 1512.807 
## 1252.034 
## 998.2198 
## 856.4364 
## 768.004 
## 710.3814 
## 674.0325 
## 654.943 
## 643.622 
## 637.1307 
## 633.9736 
## 632.6853 
## 632.8625 
## 634.1071 
## 636.1547

We can use the fitted model object in the exact same way before. Let us quickly check the prediction under the new grouping structure.

out.preds.overlap <- HierMultinom.predict(mod.fit.overlap, Xtest)
sum(out.preds.overlap$preds == apply(Ytest, 1, which.max))/dim(Ytest)[1]
## [1] 0.9710145
tmp <- data.frame("preds" = categories[out.preds.overlap$preds],  "truth" = categories[apply(Ytest, 1, which.max)])
tab <- table(tmp$preds, tmp$truth)
colnames(tab) <- 1:8
tab
##                    
##                      1  2  3  4  5  6  7  8
##   B cells           25  0  0  0  0  0  0  0
##   CD14+ Monocytes    0 25  0  0  0  1  0  0
##   CD4 T cells        0  0 47  0  0  0  0  0
##   CD8 T cells        0  0  0 11  0  0  0  1
##   Dendritic cells    0  1  0  0  1  0  0  0
##   FCGR3A+ Monocytes  0  1  0  0  0 15  0  0
##   Megakaryocytes     0  0  0  0  0  0  1  0
##   NK cells           0  0  0  0  0  0  0  9

Here, we see the classification accuracy is even higher, with one NK that was misclassified before correctly classified now.