SpPDCC Example

In this document, we provide a short tutorial on how to use the \(\texttt{SpPDCC}\) (sparse and positive definite compositional data covariance estimation) software in R. If you encounter any errors or strange behavior, please report it as an issue at https://github.com/ajmolstad/SpPDCC.

First, you need to download the R package from the GitHub repository. This can be done using \(\texttt{devtools}\).

# install.packages("devtools")
# library(devtools)
# devtools::install_github("ajmolstad/SpPDCC")
library(SpPDCC)

Main functions

There are four main functions in the R package. They are

\(\texttt{SCCpath}\): Fits the solution path for estimating the basis covariance in \(H = 1\) population.
\(\texttt{SCCcv}\): Performs \(\texttt{nfolds}\) cross-validation to select tuning parameters in order to estimate the basis covariance in \(H = 1\) population. Note that this function also returns the solution path on the complete training data.
\(\texttt{gSCCpath}\): Fits the solution path for estimating basis covariance matrices jointly in \(H \geq 2\) populations.
\(\texttt{gSCCcv}\): Performs \(\texttt{nfolds}\) cross-validation to select tuning parameters in order to estimate basis covariance matrices in \(H \geq 2\) population. Note that this function also returns the solution path on the complete training data.

Estimating a basis covariance matrix for a single population

First, we will focus on the case where we have a single population, i.e., \(H=1\). We set the basis covariance to be moderately sparse with \(p = 20\) and \(n = 100\).

# ------------------------------   
# Basis covariance matrix  
# ----------------------------- 
p <- 20
Omega <- matrix(0, nrow = p, ncol = p)
for (jj in 1:p) {
  for (kk in 1:p) {
    Omega[jj,kk] <- 0.5^abs(jj-kk) * (abs(jj - kk) < 4)
  }
}
diag(Omega) <- 1
eo <- eigen(Omega)
OmegaSqrt <- eo$vec%*%diag(eo$val^0.5)%*%t(eo$vec)

Let us then create synthetic compositional data such that the log-abundances have covariance \(\texttt{Omega}\).

# -----------------------------  
# Generate log-abundances  
# -----------------------------
set.seed(1)
n <- 100
# create training set
logW <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt
W <- exp(logW)
dat <- W/rowSums(W)
dat[1:5, 1:5]

##            [,1]       [,2]        [,3]       [,4]       [,5]
## [1,] 0.01287899 0.01563206 0.042359236 0.07830466 0.08638637
## [2,] 0.02506078 0.03115090 0.083556701 0.01753852 0.12151947
## [3,] 0.01633654 0.02167786 0.164953919 0.23030754 0.02863038
## [4,] 0.12491072 0.04071250 0.020884353 0.01773854 0.01717146
## [5,] 0.03889782 0.01459166 0.006095849 0.09859948 0.03639834

# create validation set
logWval <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt
Wval <- exp(logWval)
datval <- Wval/rowSums(Wval)

Now, we will show how to estimate a single basis covariance using the function. This function takes two sets of data as inputs: a training set , and an optional validation set . If the validation set is included, the output will also include the validation errors. It computes the solution path for \[\operatorname*{arg \ min}_{\Omega = \Omega^\top} \left( \| \widehat{\Theta} - \omega \mathbf{1}_p^\top - \mathbf{1}_p\omega^\top + 2 \Omega \|_F^2 + \lambda_1 \|\Omega^{-} \|_1 \right) ~~~\text{subject to }~\Omega \succcurlyeq \tau I_p,\] where \(\omega = {\rm diag}(\Omega)\).

The arguments for this function are as follows:

\(\texttt{dat}\): an \(n \times p\) matrix of compositional data with each \(p\)-dimensional compositional observation organized by row
\(\texttt{datval}\): an \(n_{\rm val} \times p\) matrix of compositional data with each \(p\)-dimensional compositional observation organized by row
\(\texttt{nlambda1}\): number of candidate \(\lambda_1\) values
\(\texttt{deltalambda1}\): ratio of minimum to maximum candidate values of \(\lambda_1\)
\(\texttt{alpha0}\): starting step size (default 10)
\(\texttt{tau}\): lower bound on eigenvalues of solution
\(\texttt{max.iter}\): maximum AccPGD or PPGD iterations
\(\texttt{tol.obj}\): % change in objective function for convergence for AccPGD
\(\texttt{tol.diff}\): max-abs change in iterates for convergence for AccPGD
\(\texttt{tol.obj.con}\): % change in objective function for convergence for PPGD (recommended to be lower than \(\texttt{tol.obj}\))
\(\texttt{tol.diff.con}\): max-abs change in iterates for convergence for PPGD
\(\texttt{silent}\): TRUE/FALSE suppress all progress messages?
\(\texttt{quiet}\): TRUE/FALSE suppress objective function value printing after each iteration (to be used for diagnostics only)
\(\texttt{path.requires}\): how many tuning parameters must be tried we terminate path when validation loss increases for five successive candidate tuning parameters

Computing the solution path

Let us now compute the solution path and plot the validation errors as a function of the tuning parameter value. We leave , which means our function will print the validation errors as they are computed.

t0 <- SCCpath(dat = dat, datval = datval, 
      nlambda1 = 25, deltalambda1 = 0.01, 
      alpha0 = 10,   tau = 1e-4,
      max.iter = 5e3,
      tol.obj = 1e-6,
      tol.diff = 1e-6,
      silent = FALSE,
      quiet = TRUE,
      path.requires = NULL)

## 80.78847 
## 79.71636 
## 76.92104 
## 71.98883 
## 67.06506 
## 62.75614 
## 58.7971 
## 55.59699 
## 53.59258 
## 52.53062 
## 52.16434 
## 52.22194 
## 52.51772 
## 52.94404 
## 53.52762 
## 54.24214 
## 54.97954 
## 55.64945 
## 56.2625 
## 56.80949 
## 57.30656 
## 57.73241 
## 58.10649 
## 58.42519 
## 58.69122

str(t0)

## List of 3
##  $ Omega      : num [1, 1:20, 1:20, 1:25] 0.821 0 0 0 0 ...
##  $ lambda1.vec: num [1:25] 5.77 4.76 3.93 3.24 2.68 ...
##  $ valErrs    : num [1:25] 80.8 79.7 76.9 72 67.1 ...

plot(log10(t0$lambda1.vec), t0$valErrs, pch=20, xlab=expression(log[10](lambda[1])), 
     ylab="Validation error")
abline(v = log10(t0$lambda1.vec[which.min(t0$valErrs)]))

The output includes

\(\texttt{Omega}\): A \(1 \times p \times p \times \texttt{nlambda1}\) array, including basis covariance matrix estimates along the solution path
\(\texttt{lambda1.vec}\): The candidate tuning parameters. The \(k\)th basis covariance estimate \(\texttt{Omega[1,,,k]}\) is that fit with \(k\)th element of \(\texttt{lambda1.vec}\).
\(\texttt{valErrs}\): If a validation set was included, returns the validation error for each element of \(\texttt{lambda1.vec}\).

Finally, we will extract our covariance matrix estimate and compare it to the truth using leveplots.

best.ind <- which.min(t0$valErrs)
library(lattice)
levelplot(t0$Omega[1,,,best.ind], col.regions = gray(100:0/100))

levelplot(Omega, col.regions = gray(100:0/100))

Cross-validation for tuning parameter selection

Since one will often not have a validation set on hand, it may be preferable to select tuning parameters using cross validation. For this, we can use the \(\texttt{SCCcv}\) function. The arguments are nearly identical, except now we need to specify \(\texttt{nfolds}\), the number of folds to be used for cross-validation. Also, no validation set is needed here.

Below, we perform cross-validation and plot the average cross-validation errors.

set.seed(1)
t0.cv <- SCCcv(dat, 
      nlambda1 = 25,
      deltalambda1 = 0.01, 
      alpha0 = 10,
      nfolds = 5,   
      tau = 1e-4,
      max.iter = 5e3,
      tol.obj = 1e-6,
      tol.diff = 1e-6,
      tol.obj.con = 1e-6,
      tol.diff.con = 1e-6,
      silent = FALSE,
      quiet = TRUE)

## Beginning cross-validation; printing left-out fold errors.... 
## 231.636 
## 230.8632 
## 227.377 
## 221.0684 
## 215.5186 
## 211.1204 
## 208.1691 
## 206.0648 
## 204.665 
## 204.4741 
## 205.0803 
## 206.2522 
## 207.5486 
## 209.0241 
## 210.5688 
## 212.0686 
## 213.4028 
## 214.6677 
## 215.813 
## 216.7836 
## 217.6258 
## 218.3326 
## 218.9265 
## 219.4145 
## 219.8246 
## # ---------------------------------- 
## Through fold  1  of  5 
## # ---------------------------------- 
## 215.0265 
## 214.8831 
## 210.1442 
## 202.1143 
## 194.118 
## 187.6732 
## 181.8499 
## 176.6455 
## 172.3088 
## 169.0315 
## 166.8674 
## 165.2738 
## 164.206 
## 163.4954 
## 163.1601 
## 162.9455 
## 162.9004 
## 162.9641 
## 163.0605 
## 163.1577 
## 163.2801 
## 163.4173 
## 163.5426 
## 163.6571 
## 163.7714 
## # ---------------------------------- 
## Through fold  2  of  5 
## # ---------------------------------- 
## 192.6032 
## 192.241 
## 189.3281 
## 184.7279 
## 180.0912 
## 176.1877 
## 172.3503 
## 169.057 
## 166.4314 
## 164.8126 
## 164.2978 
## 164.2123 
## 164.5575 
## 165.1308 
## 165.7126 
## 166.3253 
## 166.844 
## 167.3108 
## 167.6712 
## 168.0018 
## 168.3087 
## 168.5725 
## 168.7987 
## 168.9902 
## 169.1551 
## # ---------------------------------- 
## Through fold  3  of  5 
## # ---------------------------------- 
## 188.0394 
## 185.8124 
## 183.1556 
## 178.8035 
## 174.3133 
## 169.1547 
## 163.9407 
## 158.8996 
## 154.9253 
## 152.374 
## 150.6616 
## 149.6926 
## 149.112 
## 148.9113 
## 149.0141 
## 149.274 
## 149.6183 
## 149.9913 
## 150.3287 
## 150.606 
## 150.8732 
## 151.1085 
## 151.3104 
## 151.4866 
## 151.6381 
## # ---------------------------------- 
## Through fold  4  of  5 
## # ---------------------------------- 
## 186.7963 
## 185.6682 
## 184.4874 
## 181.4998 
## 177.5056 
## 174.7906 
## 173.0222 
## 172 
## 171.613 
## 172.1221 
## 173.4472 
## 174.6643 
## 175.9057 
## 177.0752 
## 178.2779 
## 179.3991 
## 180.4935 
## 181.4877 
## 182.4023 
## 183.1904 
## 183.8622 
## 184.447 
## 184.9386 
## 185.3497 
## 185.7035 
## # ---------------------------------- 
## Through fold  5  of  5 
## # ----------------------------------

str(t0.cv)

## List of 6
##  $ Omega      : num [1, 1:20, 1:20, 1:25] 0.821 0 0 0 0 ...
##  $ valErrs    : num [1:5, 1:25] 232 215 193 188 187 ...
##  $ avg.valErrs: num [1:25] 203 202 199 194 188 ...
##  $ Omega.min  : num [1:20, 1:20] 0.83108 0.41062 0.21663 0 0.00418 ...
##  $ fold.id    : int [1:100] 3 4 1 4 2 3 4 2 4 1 ...
##  $ lambda1.vec: num [1:25] 5.77 4.76 3.93 3.24 2.68 ...

plot(log10(t0.cv$lambda1.vec), t0.cv$avg.valErrs, pch=20, xlab=expression(log[10](lambda[1])), ylab="Average CV error")

The output closely matches that from \(\texttt{SCCpath}\), except we include fold IDs, the validation errors, the averaged validation errors, and the estimate of \(\Omega^*\) which uses the tuning parameter minimizing the average validation error (\(\texttt{Omega.min}\)).

Estimating \(H \geq 2\) basis covariance matrices jointly

Next, we will focus on the case where we have \(H=2\) basis covariance matrices to estimate. Generalization to \(H > 2\) is straightforward. We will use the same structure as before, except the correlations will differ slightly across the two populations.

# ------------------------------   
# Basis covariance matrix  
# ----------------------------- 
p <- 20
Omega <- array(0, dim=c(2, p, p))
for (jj in 1:p) {
  for (kk in 1:p) {
    Omega[1,jj,kk] <- 0.5^abs(jj-kk) * (abs(jj - kk) < 4)
    Omega[2,jj,kk] <- 0.7^abs(jj-kk) * (abs(jj - kk) < 5)
  }
}

diag(Omega[1,,]) <- 1; diag(Omega[2,,]) <- 1
eo <- eigen(Omega[1,,])
OmegaSqrt1 <- eo$vec%*%diag(eo$val^0.5)%*%t(eo$vec)
eo <- eigen(Omega[2,,])
OmegaSqrt2 <- eo$vec%*%diag(eo$val^0.5)%*%t(eo$vec)

# -----------------------------  
# Generate log-abundances  
# -----------------------------
set.seed(1)
n <- 100
dat <- list(); datval <- list()
# create training set
logW <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt1
W <- exp(logW)
dat[[1]] <- W/rowSums(W)
logW <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt2
W <- exp(logW)
dat[[2]] <- W/rowSums(W)

# create validation set
logWval <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt1
Wval <- exp(logWval)
datval[[1]] <- Wval/rowSums(Wval)
logWval <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt2
Wval <- exp(logWval)
datval[[2]] <- Wval/rowSums(Wval)

Using \(\texttt{gSCCpath}\), we fit the entire solution path. This takes essentially identical arguments as \(\texttt{SCCpath}\), except we also have \(\texttt{nlambda2}\), \(\texttt{deltalambda2}\), and \(\texttt{weighted}\). These new arguments are

\(\texttt{nlambda2}\). Number of candidate \(\lambda_2\)
\(\texttt{deltalambda2}\). Ratio of the smallest to largest \(\lambda_2\).
\(\texttt{weighted}\). Whether to use the version of our estimator where the loss function is weighted by each populations’ sample size. Default is FALSE.

Note that here, \(\lambda_1\) and \(\lambda_2\) refer to the tuning parameters from this version of our optimization problem: \[\operatorname*{arg \min}_{\boldsymbol{\Omega} \in \mathbb{R}^{H \times p \times p}} \left\{ \sum_{h=1}^H \left(\| \widehat{\Theta}_{(h)} - \omega_{(h)} \mathbf{1}_p^\top - \mathbf{1}_p\omega_{(h)}^\top + 2 \Omega_{(h)} \|_F^2 + \lambda_1 \|\Omega_{(h)}^{-}\|_1 \right) + \lambda_2 \sum_{j\neq k} \|\boldsymbol{\Omega}_{\cdot jk}\|_2 \right\}\] \[\text{subject to } ~\Omega_{(h)} = \Omega_{(h)}^\top,~ \Omega_{(h)} \succcurlyeq \tau I_p, ~~ \text{ for all } ~h \in [H], \]

Again, we can evaluate the validation error on the validation set.

t1 <- gSCCpath(dat = dat, 
      datval = datval, 
      weighted = FALSE,
      nlambda1 = 25, nlambda2 = 15, 
      deltalambda1 = 0.001, 
      deltalambda2 = 0.01, 
      alpha0 = 1,  
      tau = 1e-4,
      max.iter = 5e3,
      tol.obj = 1e-6,
      tol.diff = 1e-6,
      tol.obj.con = 1e-6,
      tol.diff.con = 1e-6,
      silent = TRUE,
      quiet = TRUE)
str(t1)

## List of 4
##  $ Omega      : num [1:2, 1:20, 1:20, 1:25, 1:15] 0.821 0.995 0 0 0 ...
##  $ valErrs    : num [1:25, 1:15] 223 221 216 211 205 ...
##  $ lambda1.mat: num [1:15, 1:25] 2.38 3.15 3.79 4.16 4.57 ...
##  $ lambda2.vec: num [1:15] 4.24 3.18 2.38 1.79 1.34 ...

levelplot(t1$valErrs, col.regions = gray(100:0/100))

inds <- which(t1$valErrs == min(t1$valErrs), arr.ind=TRUE)
Omega.est <- t1$Omega[,,,inds[1,1], inds[1,2]]
sum((Omega.est[1,,] - Omega[1,,])^2); sum((Omega.est[2,,] - Omega[2,,])^2)

## [1] 2.982092

## [1] 4.345359

Looking at the levelplot, we see that \(\lambda_1\) small appears to be best. This agrees with the intuition, since \(\Omega_{(1)}^*\) and \(\Omega_{(2)}^*\) have reasonably similar sparsity patterns.

t1.cv <- gSCCcv(dat = dat, 
      nfolds = 5, 
      nlambda1 = 20, nlambda2 = 10, 
      deltalambda1 = 0.001, 
      deltalambda2 = 0.01, 
      alpha0 = 1,  
      tau = 1e-4,
      max.iter = 5e3,
      tol.obj = 1e-6,
      tol.diff = 1e-6,
      tol.obj.con = 1e-6,
      tol.diff.con = 1e-6,
      silent = TRUE,
      quiet = TRUE)
str(t1.cv)

## List of 7
##  $ Omega      : num [1:2, 1:20, 1:20, 1:20, 1:10] 0.821 0.995 0 0 0 ...
##  $ valErrs    : num [1:5, 1:20, 1:10] 444 488 543 409 449 ...
##  $ avg.valErrs: num [1:20, 1:10] 467 463 453 441 432 ...
##  $ fold.ids   :List of 2
##   ..$ : int [1:100] 2 4 5 1 2 2 2 2 5 5 ...
##   ..$ : int [1:100] 5 1 4 3 3 3 3 5 2 4 ...
##  $ lambda1.mat: num [1:10, 1:20] 3.15 3.98 4.57 5.02 5.25 ...
##  $ lambda2.vec: num [1:10] 3.263 2.147 1.413 0.929 0.611 ...
##  $ Omega.min  : num [1:2, 1:20, 1:20] 0.839 1.026 0.456 0.629 0.273 ...

sum((t1.cv$Omega.min[1,,] - Omega[1,,])^2); sum((t1.cv$Omega.min[2,,] - Omega[2,,])^2)

## [1] 2.950094

## [1] 4.360383

We can compare our estimate of \(\Omega_{(1)}^*\) to the truth

library(lattice)
levelplot(t1.cv$Omega.min[1,,], col.regions = gray(100:0/100))

levelplot(Omega[1,,], col.regions = gray(100:0/100))

Finally, we can instead use the weighted version and see how our estimate differs. Note that this estimator solves \[\operatorname*{arg \min}_{\boldsymbol{\Omega} \in \mathbb{R}^{H \times p \times p}} \left\{ \sum_{h=1}^H \left( \frac{n_{(h)}}{N}\| \widehat{\Theta}_{(h)} - \omega_{(h)} \mathbf{1}_p^\top - \mathbf{1}_p\omega_{(h)}^\top + 2 \Omega_{(h)} \|_F^2 + \lambda_1 \|\Omega_{(h)}^{-}\|_1 \right) + \lambda_2 \sum_{j\neq k} \|\boldsymbol{\Omega}_{\cdot jk}\|_2 \right\}\] \[\text{subject to } ~\Omega_{(h)} = \Omega_{(h)}^\top,~ \Omega_{(h)} \succcurlyeq \tau I_p, ~~ \text{ for all } ~h \in [H], \] and similarly, uses a weighted version of our cross-validation criterion.

t1w.path <- gSCCpath(dat = dat, datval = datval,
      nlambda1 = 20, nlambda2 = 10, 
      deltalambda1 = 0.001, 
      deltalambda2 = 0.01, 
      weighted = TRUE,
      alpha0 = 1,  
      tau = 1e-4,
      max.iter = 5e3,
      tol.obj = 1e-6,
      tol.diff = 1e-6,
      tol.obj.con = 1e-6,
      tol.diff.con = 1e-6,
      silent = TRUE,
      quiet = TRUE)
str(t1w.path)

## List of 4
##  $ Omega      : num [1:2, 1:20, 1:20, 1:20, 1:10] 0.821 0.995 0 0 0 ...
##  $ valErrs    : num [1:20, 1:10] 111.3 109.1 103.2 96.8 91.6 ...
##  $ lambda1.mat: num [1:10, 1:20] 1.58 1.99 2.29 2.51 2.63 ...
##  $ lambda2.vec: num [1:10] 1.632 1.074 0.706 0.465 0.306 ...

sum((t1.cv$Omega.min[1,,] - Omega[1,,])^2); sum((t1.cv$Omega.min[2,,] - Omega[2,,])^2)

## [1] 2.950094

## [1] 4.360383