In this document, we provide a short tutorial on how to use the \(\texttt{SpPDCC}\) (sparse and positive definite compositional data covariance estimation) software in R. If you encounter any errors or strange behavior, please report it as an issue at https://github.com/ajmolstad/SpPDCC.

First, you need to download the R package from the GitHub repository. This can be done using \(\texttt{devtools}\).

# install.packages("devtools")
# library(devtools)
# devtools::install_github("ajmolstad/SpPDCC")
library(SpPDCC)

Main functions

There are four main functions in the R package. They are

Estimating a basis covariance matrix for a single population

First, we will focus on the case where we have a single population, i.e., \(H=1\). We set the basis covariance to be moderately sparse with \(p = 20\) and \(n = 100\).

# ------------------------------   
# Basis covariance matrix  
# ----------------------------- 
p <- 20
Omega <- matrix(0, nrow = p, ncol = p)
for (jj in 1:p) {
  for (kk in 1:p) {
    Omega[jj,kk] <- 0.5^abs(jj-kk) * (abs(jj - kk) < 4)
  }
}
diag(Omega) <- 1
eo <- eigen(Omega)
OmegaSqrt <- eo$vec%*%diag(eo$val^0.5)%*%t(eo$vec)

Let us then create synthetic compositional data such that the log-abundances have covariance \(\texttt{Omega}\).

# -----------------------------  
# Generate log-abundances  
# -----------------------------
set.seed(1)
n <- 100
# create training set
logW <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt
W <- exp(logW)
dat <- W/rowSums(W)
dat[1:5, 1:5]
##            [,1]       [,2]        [,3]       [,4]       [,5]
## [1,] 0.01287899 0.01563206 0.042359236 0.07830466 0.08638637
## [2,] 0.02506078 0.03115090 0.083556701 0.01753852 0.12151947
## [3,] 0.01633654 0.02167786 0.164953919 0.23030754 0.02863038
## [4,] 0.12491072 0.04071250 0.020884353 0.01773854 0.01717146
## [5,] 0.03889782 0.01459166 0.006095849 0.09859948 0.03639834
# create validation set
logWval <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt
Wval <- exp(logWval)
datval <- Wval/rowSums(Wval)

Now, we will show how to estimate a single basis covariance using the function. This function takes two sets of data as inputs: a training set , and an optional validation set . If the validation set is included, the output will also include the validation errors. It computes the solution path for \[\operatorname*{arg \ min}_{\Omega = \Omega^\top} \left( \| \widehat{\Theta} - \omega \mathbf{1}_p^\top - \mathbf{1}_p\omega^\top + 2 \Omega \|_F^2 + \lambda_1 \|\Omega^{-} \|_1 \right) ~~~\text{subject to }~\Omega \succcurlyeq \tau I_p,\] where \(\omega = {\rm diag}(\Omega)\).

The arguments for this function are as follows:

Computing the solution path

Let us now compute the solution path and plot the validation errors as a function of the tuning parameter value. We leave , which means our function will print the validation errors as they are computed.

t0 <- SCCpath(dat = dat, datval = datval, 
      nlambda1 = 25, deltalambda1 = 0.01, 
      alpha0 = 10,   tau = 1e-4,
      max.iter = 5e3,
      tol.obj = 1e-6,
      tol.diff = 1e-6,
      silent = FALSE,
      quiet = TRUE,
      path.requires = NULL)
## 80.78847 
## 79.71636 
## 76.92104 
## 71.98883 
## 67.06506 
## 62.75614 
## 58.7971 
## 55.59699 
## 53.59258 
## 52.53062 
## 52.16434 
## 52.22194 
## 52.51772 
## 52.94404 
## 53.52762 
## 54.24214 
## 54.97954 
## 55.64945 
## 56.2625 
## 56.80949 
## 57.30656 
## 57.73241 
## 58.10649 
## 58.42519 
## 58.69122
str(t0)
## List of 3
##  $ Omega      : num [1, 1:20, 1:20, 1:25] 0.821 0 0 0 0 ...
##  $ lambda1.vec: num [1:25] 5.77 4.76 3.93 3.24 2.68 ...
##  $ valErrs    : num [1:25] 80.8 79.7 76.9 72 67.1 ...
plot(log10(t0$lambda1.vec), t0$valErrs, pch=20, xlab=expression(log[10](lambda[1])), 
     ylab="Validation error")
abline(v = log10(t0$lambda1.vec[which.min(t0$valErrs)]))

The output includes

Finally, we will extract our covariance matrix estimate and compare it to the truth using leveplots.

best.ind <- which.min(t0$valErrs)
library(lattice)
levelplot(t0$Omega[1,,,best.ind], col.regions = gray(100:0/100))

levelplot(Omega, col.regions = gray(100:0/100))

Cross-validation for tuning parameter selection

Since one will often not have a validation set on hand, it may be preferable to select tuning parameters using cross validation. For this, we can use the \(\texttt{SCCcv}\) function. The arguments are nearly identical, except now we need to specify \(\texttt{nfolds}\), the number of folds to be used for cross-validation. Also, no validation set is needed here.

Below, we perform cross-validation and plot the average cross-validation errors.

set.seed(1)
t0.cv <- SCCcv(dat, 
      nlambda1 = 25,
      deltalambda1 = 0.01, 
      alpha0 = 10,
      nfolds = 5,   
      tau = 1e-4,
      max.iter = 5e3,
      tol.obj = 1e-6,
      tol.diff = 1e-6,
      tol.obj.con = 1e-6,
      tol.diff.con = 1e-6,
      silent = FALSE,
      quiet = TRUE)
## Beginning cross-validation; printing left-out fold errors.... 
## 231.636 
## 230.8632 
## 227.377 
## 221.0684 
## 215.5186 
## 211.1204 
## 208.1691 
## 206.0648 
## 204.665 
## 204.4741 
## 205.0803 
## 206.2522 
## 207.5486 
## 209.0241 
## 210.5688 
## 212.0686 
## 213.4028 
## 214.6677 
## 215.813 
## 216.7836 
## 217.6258 
## 218.3326 
## 218.9265 
## 219.4145 
## 219.8246 
## # ---------------------------------- 
## Through fold  1  of  5 
## # ---------------------------------- 
## 215.0265 
## 214.8831 
## 210.1442 
## 202.1143 
## 194.118 
## 187.6732 
## 181.8499 
## 176.6455 
## 172.3088 
## 169.0315 
## 166.8674 
## 165.2738 
## 164.206 
## 163.4954 
## 163.1601 
## 162.9455 
## 162.9004 
## 162.9641 
## 163.0605 
## 163.1577 
## 163.2801 
## 163.4173 
## 163.5426 
## 163.6571 
## 163.7714 
## # ---------------------------------- 
## Through fold  2  of  5 
## # ---------------------------------- 
## 192.6032 
## 192.241 
## 189.3281 
## 184.7279 
## 180.0912 
## 176.1877 
## 172.3503 
## 169.057 
## 166.4314 
## 164.8126 
## 164.2978 
## 164.2123 
## 164.5575 
## 165.1308 
## 165.7126 
## 166.3253 
## 166.844 
## 167.3108 
## 167.6712 
## 168.0018 
## 168.3087 
## 168.5725 
## 168.7987 
## 168.9902 
## 169.1551 
## # ---------------------------------- 
## Through fold  3  of  5 
## # ---------------------------------- 
## 188.0394 
## 185.8124 
## 183.1556 
## 178.8035 
## 174.3133 
## 169.1547 
## 163.9407 
## 158.8996 
## 154.9253 
## 152.374 
## 150.6616 
## 149.6926 
## 149.112 
## 148.9113 
## 149.0141 
## 149.274 
## 149.6183 
## 149.9913 
## 150.3287 
## 150.606 
## 150.8732 
## 151.1085 
## 151.3104 
## 151.4866 
## 151.6381 
## # ---------------------------------- 
## Through fold  4  of  5 
## # ---------------------------------- 
## 186.7963 
## 185.6682 
## 184.4874 
## 181.4998 
## 177.5056 
## 174.7906 
## 173.0222 
## 172 
## 171.613 
## 172.1221 
## 173.4472 
## 174.6643 
## 175.9057 
## 177.0752 
## 178.2779 
## 179.3991 
## 180.4935 
## 181.4877 
## 182.4023 
## 183.1904 
## 183.8622 
## 184.447 
## 184.9386 
## 185.3497 
## 185.7035 
## # ---------------------------------- 
## Through fold  5  of  5 
## # ----------------------------------
str(t0.cv)
## List of 6
##  $ Omega      : num [1, 1:20, 1:20, 1:25] 0.821 0 0 0 0 ...
##  $ valErrs    : num [1:5, 1:25] 232 215 193 188 187 ...
##  $ avg.valErrs: num [1:25] 203 202 199 194 188 ...
##  $ Omega.min  : num [1:20, 1:20] 0.83108 0.41062 0.21663 0 0.00418 ...
##  $ fold.id    : int [1:100] 3 4 1 4 2 3 4 2 4 1 ...
##  $ lambda1.vec: num [1:25] 5.77 4.76 3.93 3.24 2.68 ...
plot(log10(t0.cv$lambda1.vec), t0.cv$avg.valErrs, pch=20, xlab=expression(log[10](lambda[1])), ylab="Average CV error")

The output closely matches that from \(\texttt{SCCpath}\), except we include fold IDs, the validation errors, the averaged validation errors, and the estimate of \(\Omega^*\) which uses the tuning parameter minimizing the average validation error (\(\texttt{Omega.min}\)).

Estimating \(H \geq 2\) basis covariance matrices jointly

Next, we will focus on the case where we have \(H=2\) basis covariance matrices to estimate. Generalization to \(H > 2\) is straightforward. We will use the same structure as before, except the correlations will differ slightly across the two populations.

# ------------------------------   
# Basis covariance matrix  
# ----------------------------- 
p <- 20
Omega <- array(0, dim=c(2, p, p))
for (jj in 1:p) {
  for (kk in 1:p) {
    Omega[1,jj,kk] <- 0.5^abs(jj-kk) * (abs(jj - kk) < 4)
    Omega[2,jj,kk] <- 0.7^abs(jj-kk) * (abs(jj - kk) < 5)
  }
}

diag(Omega[1,,]) <- 1; diag(Omega[2,,]) <- 1
eo <- eigen(Omega[1,,])
OmegaSqrt1 <- eo$vec%*%diag(eo$val^0.5)%*%t(eo$vec)
eo <- eigen(Omega[2,,])
OmegaSqrt2 <- eo$vec%*%diag(eo$val^0.5)%*%t(eo$vec)

# -----------------------------  
# Generate log-abundances  
# -----------------------------
set.seed(1)
n <- 100
dat <- list(); datval <- list()
# create training set
logW <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt1
W <- exp(logW)
dat[[1]] <- W/rowSums(W)
logW <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt2
W <- exp(logW)
dat[[2]] <- W/rowSums(W)

# create validation set
logWval <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt1
Wval <- exp(logWval)
datval[[1]] <- Wval/rowSums(Wval)
logWval <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt2
Wval <- exp(logWval)
datval[[2]] <- Wval/rowSums(Wval)

Using \(\texttt{gSCCpath}\), we fit the entire solution path. This takes essentially identical arguments as \(\texttt{SCCpath}\), except we also have \(\texttt{nlambda2}\), \(\texttt{deltalambda2}\), and \(\texttt{weighted}\). These new arguments are

Note that here, \(\lambda_1\) and \(\lambda_2\) refer to the tuning parameters from this version of our optimization problem: \[\operatorname*{arg \min}_{\boldsymbol{\Omega} \in \mathbb{R}^{H \times p \times p}} \left\{ \sum_{h=1}^H \left(\| \widehat{\Theta}_{(h)} - \omega_{(h)} \mathbf{1}_p^\top - \mathbf{1}_p\omega_{(h)}^\top + 2 \Omega_{(h)} \|_F^2 + \lambda_1 \|\Omega_{(h)}^{-}\|_1 \right) + \lambda_2 \sum_{j\neq k} \|\boldsymbol{\Omega}_{\cdot jk}\|_2 \right\}\] \[\text{subject to } ~\Omega_{(h)} = \Omega_{(h)}^\top,~ \Omega_{(h)} \succcurlyeq \tau I_p, ~~ \text{ for all } ~h \in [H], \]

Again, we can evaluate the validation error on the validation set.

t1 <- gSCCpath(dat = dat, 
      datval = datval, 
      weighted = FALSE,
      nlambda1 = 25, nlambda2 = 15, 
      deltalambda1 = 0.001, 
      deltalambda2 = 0.01, 
      alpha0 = 1,  
      tau = 1e-4,
      max.iter = 5e3,
      tol.obj = 1e-6,
      tol.diff = 1e-6,
      tol.obj.con = 1e-6,
      tol.diff.con = 1e-6,
      silent = TRUE,
      quiet = TRUE)
str(t1)
## List of 4
##  $ Omega      : num [1:2, 1:20, 1:20, 1:25, 1:15] 0.821 0.995 0 0 0 ...
##  $ valErrs    : num [1:25, 1:15] 223 221 216 211 205 ...
##  $ lambda1.mat: num [1:15, 1:25] 2.38 3.15 3.79 4.16 4.57 ...
##  $ lambda2.vec: num [1:15] 4.24 3.18 2.38 1.79 1.34 ...
levelplot(t1$valErrs, col.regions = gray(100:0/100))

inds <- which(t1$valErrs == min(t1$valErrs), arr.ind=TRUE)
Omega.est <- t1$Omega[,,,inds[1,1], inds[1,2]]
sum((Omega.est[1,,] - Omega[1,,])^2); sum((Omega.est[2,,] - Omega[2,,])^2)
## [1] 2.982092
## [1] 4.345359

Looking at the levelplot, we see that \(\lambda_1\) small appears to be best. This agrees with the intuition, since \(\Omega_{(1)}^*\) and \(\Omega_{(2)}^*\) have reasonably similar sparsity patterns.

t1.cv <- gSCCcv(dat = dat, 
      nfolds = 5, 
      nlambda1 = 20, nlambda2 = 10, 
      deltalambda1 = 0.001, 
      deltalambda2 = 0.01, 
      alpha0 = 1,  
      tau = 1e-4,
      max.iter = 5e3,
      tol.obj = 1e-6,
      tol.diff = 1e-6,
      tol.obj.con = 1e-6,
      tol.diff.con = 1e-6,
      silent = TRUE,
      quiet = TRUE)
str(t1.cv)
## List of 7
##  $ Omega      : num [1:2, 1:20, 1:20, 1:20, 1:10] 0.821 0.995 0 0 0 ...
##  $ valErrs    : num [1:5, 1:20, 1:10] 444 488 543 409 449 ...
##  $ avg.valErrs: num [1:20, 1:10] 467 463 453 441 432 ...
##  $ fold.ids   :List of 2
##   ..$ : int [1:100] 2 4 5 1 2 2 2 2 5 5 ...
##   ..$ : int [1:100] 5 1 4 3 3 3 3 5 2 4 ...
##  $ lambda1.mat: num [1:10, 1:20] 3.15 3.98 4.57 5.02 5.25 ...
##  $ lambda2.vec: num [1:10] 3.263 2.147 1.413 0.929 0.611 ...
##  $ Omega.min  : num [1:2, 1:20, 1:20] 0.839 1.026 0.456 0.629 0.273 ...
sum((t1.cv$Omega.min[1,,] - Omega[1,,])^2); sum((t1.cv$Omega.min[2,,] - Omega[2,,])^2)
## [1] 2.950094
## [1] 4.360383

We can compare our estimate of \(\Omega_{(1)}^*\) to the truth

library(lattice)
levelplot(t1.cv$Omega.min[1,,], col.regions = gray(100:0/100))

levelplot(Omega[1,,], col.regions = gray(100:0/100))

Finally, we can instead use the weighted version and see how our estimate differs. Note that this estimator solves \[\operatorname*{arg \min}_{\boldsymbol{\Omega} \in \mathbb{R}^{H \times p \times p}} \left\{ \sum_{h=1}^H \left( \frac{n_{(h)}}{N}\| \widehat{\Theta}_{(h)} - \omega_{(h)} \mathbf{1}_p^\top - \mathbf{1}_p\omega_{(h)}^\top + 2 \Omega_{(h)} \|_F^2 + \lambda_1 \|\Omega_{(h)}^{-}\|_1 \right) + \lambda_2 \sum_{j\neq k} \|\boldsymbol{\Omega}_{\cdot jk}\|_2 \right\}\] \[\text{subject to } ~\Omega_{(h)} = \Omega_{(h)}^\top,~ \Omega_{(h)} \succcurlyeq \tau I_p, ~~ \text{ for all } ~h \in [H], \] and similarly, uses a weighted version of our cross-validation criterion.

t1w.path <- gSCCpath(dat = dat, datval = datval,
      nlambda1 = 20, nlambda2 = 10, 
      deltalambda1 = 0.001, 
      deltalambda2 = 0.01, 
      weighted = TRUE,
      alpha0 = 1,  
      tau = 1e-4,
      max.iter = 5e3,
      tol.obj = 1e-6,
      tol.diff = 1e-6,
      tol.obj.con = 1e-6,
      tol.diff.con = 1e-6,
      silent = TRUE,
      quiet = TRUE)
str(t1w.path)
## List of 4
##  $ Omega      : num [1:2, 1:20, 1:20, 1:20, 1:10] 0.821 0.995 0 0 0 ...
##  $ valErrs    : num [1:20, 1:10] 111.3 109.1 103.2 96.8 91.6 ...
##  $ lambda1.mat: num [1:10, 1:20] 1.58 1.99 2.29 2.51 2.63 ...
##  $ lambda2.vec: num [1:10] 1.632 1.074 0.706 0.465 0.306 ...
sum((t1.cv$Omega.min[1,,] - Omega[1,,])^2); sum((t1.cv$Omega.min[2,,] - Omega[2,,])^2)
## [1] 2.950094
## [1] 4.360383