In this document, we provide a short tutorial on how to use the \(\texttt{SpPDCC}\) (sparse and positive
definite compositional data covariance estimation) software in R. If you
encounter any errors or strange behavior, please report it as an issue
at https://github.com/ajmolstad/SpPDCC.
First, you need to download the R package from the GitHub repository.
This can be done using \(\texttt{devtools}\).
# install.packages("devtools")
# library(devtools)
# devtools::install_github("ajmolstad/SpPDCC")
library(SpPDCC)
Main functions
There are four main functions in the R package. They are
- \(\texttt{SCCpath}\): Fits the
solution path for estimating the basis covariance in \(H = 1\) population.
- \(\texttt{SCCcv}\): Performs \(\texttt{nfolds}\) cross-validation to
select tuning parameters in order to estimate the basis covariance in
\(H = 1\) population. Note that this
function also returns the solution path on the complete training
data.
- \(\texttt{gSCCpath}\): Fits the
solution path for estimating basis covariance matrices jointly in \(H \geq 2\) populations.
- \(\texttt{gSCCcv}\): Performs \(\texttt{nfolds}\) cross-validation to
select tuning parameters in order to estimate basis covariance matrices
in \(H \geq 2\) population. Note that
this function also returns the solution path on the complete training
data.
Estimating a basis covariance matrix for a single population
First, we will focus on the case where we have a single population,
i.e., \(H=1\). We set the basis
covariance to be moderately sparse with \(p =
20\) and \(n = 100\).
# ------------------------------
# Basis covariance matrix
# -----------------------------
p <- 20
Omega <- matrix(0, nrow = p, ncol = p)
for (jj in 1:p) {
for (kk in 1:p) {
Omega[jj,kk] <- 0.5^abs(jj-kk) * (abs(jj - kk) < 4)
}
}
diag(Omega) <- 1
eo <- eigen(Omega)
OmegaSqrt <- eo$vec%*%diag(eo$val^0.5)%*%t(eo$vec)
Let us then create synthetic compositional data such that the
log-abundances have covariance \(\texttt{Omega}\).
# -----------------------------
# Generate log-abundances
# -----------------------------
set.seed(1)
n <- 100
# create training set
logW <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt
W <- exp(logW)
dat <- W/rowSums(W)
dat[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.01287899 0.01563206 0.042359236 0.07830466 0.08638637
## [2,] 0.02506078 0.03115090 0.083556701 0.01753852 0.12151947
## [3,] 0.01633654 0.02167786 0.164953919 0.23030754 0.02863038
## [4,] 0.12491072 0.04071250 0.020884353 0.01773854 0.01717146
## [5,] 0.03889782 0.01459166 0.006095849 0.09859948 0.03639834
# create validation set
logWval <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt
Wval <- exp(logWval)
datval <- Wval/rowSums(Wval)
Now, we will show how to estimate a single basis covariance using the
function. This function takes two sets of data as inputs: a training set
, and an optional validation set . If the validation set is included,
the output will also include the validation errors. It computes the
solution path for \[\operatorname*{arg \
min}_{\Omega = \Omega^\top} \left( \| \widehat{\Theta} - \omega
\mathbf{1}_p^\top - \mathbf{1}_p\omega^\top + 2 \Omega \|_F^2 +
\lambda_1 \|\Omega^{-} \|_1 \right) ~~~\text{subject to }~\Omega
\succcurlyeq \tau I_p,\] where \(\omega
= {\rm diag}(\Omega)\).
The arguments for this function are as follows:
- \(\texttt{dat}\): an \(n \times p\) matrix of compositional data
with each \(p\)-dimensional
compositional observation organized by row
- \(\texttt{datval}\): an \(n_{\rm val} \times p\) matrix of
compositional data with each \(p\)-dimensional compositional observation
organized by row
- \(\texttt{nlambda1}\): number of
candidate \(\lambda_1\) values
- \(\texttt{deltalambda1}\): ratio of
minimum to maximum candidate values of \(\lambda_1\)
- \(\texttt{alpha0}\): starting step
size (default 10)
- \(\texttt{tau}\): lower bound on
eigenvalues of solution
- \(\texttt{max.iter}\): maximum
AccPGD or PPGD iterations
- \(\texttt{tol.obj}\): % change in
objective function for convergence for AccPGD
- \(\texttt{tol.diff}\): max-abs
change in iterates for convergence for AccPGD
- \(\texttt{tol.obj.con}\): % change
in objective function for convergence for PPGD (recommended to be lower
than \(\texttt{tol.obj}\))
- \(\texttt{tol.diff.con}\): max-abs
change in iterates for convergence for PPGD
- \(\texttt{silent}\): TRUE/FALSE
suppress all progress messages?
- \(\texttt{quiet}\): TRUE/FALSE
suppress objective function value printing after each iteration (to be
used for diagnostics only)
- \(\texttt{path.requires}\): how
many tuning parameters must be tried we terminate path when validation
loss increases for five successive candidate tuning parameters
Computing the solution path
Let us now compute the solution path and plot the validation errors
as a function of the tuning parameter value. We leave , which means our
function will print the validation errors as they are computed.
t0 <- SCCpath(dat = dat, datval = datval,
nlambda1 = 25, deltalambda1 = 0.01,
alpha0 = 10, tau = 1e-4,
max.iter = 5e3,
tol.obj = 1e-6,
tol.diff = 1e-6,
silent = FALSE,
quiet = TRUE,
path.requires = NULL)
## 80.78847
## 79.71636
## 76.92104
## 71.98883
## 67.06506
## 62.75614
## 58.7971
## 55.59699
## 53.59258
## 52.53062
## 52.16434
## 52.22194
## 52.51772
## 52.94404
## 53.52762
## 54.24214
## 54.97954
## 55.64945
## 56.2625
## 56.80949
## 57.30656
## 57.73241
## 58.10649
## 58.42519
## 58.69122
str(t0)
## List of 3
## $ Omega : num [1, 1:20, 1:20, 1:25] 0.821 0 0 0 0 ...
## $ lambda1.vec: num [1:25] 5.77 4.76 3.93 3.24 2.68 ...
## $ valErrs : num [1:25] 80.8 79.7 76.9 72 67.1 ...
plot(log10(t0$lambda1.vec), t0$valErrs, pch=20, xlab=expression(log[10](lambda[1])),
ylab="Validation error")
abline(v = log10(t0$lambda1.vec[which.min(t0$valErrs)]))
The output includes
- \(\texttt{Omega}\): A \(1 \times p \times p \times
\texttt{nlambda1}\) array, including basis covariance matrix
estimates along the solution path
- \(\texttt{lambda1.vec}\): The
candidate tuning parameters. The \(k\)th basis covariance estimate \(\texttt{Omega[1,,,k]}\) is that fit with
\(k\)th element of \(\texttt{lambda1.vec}\).
- \(\texttt{valErrs}\): If a
validation set was included, returns the validation error for each
element of \(\texttt{lambda1.vec}\).
Finally, we will extract our covariance matrix estimate and compare
it to the truth using leveplots.
best.ind <- which.min(t0$valErrs)
library(lattice)
levelplot(t0$Omega[1,,,best.ind], col.regions = gray(100:0/100))
levelplot(Omega, col.regions = gray(100:0/100))
Cross-validation for tuning parameter selection
Since one will often not have a validation set on hand, it may be
preferable to select tuning parameters using cross validation. For this,
we can use the \(\texttt{SCCcv}\)
function. The arguments are nearly identical, except now we need to
specify \(\texttt{nfolds}\), the number
of folds to be used for cross-validation. Also, no validation set is
needed here.
Below, we perform cross-validation and plot the average
cross-validation errors.
set.seed(1)
t0.cv <- SCCcv(dat,
nlambda1 = 25,
deltalambda1 = 0.01,
alpha0 = 10,
nfolds = 5,
tau = 1e-4,
max.iter = 5e3,
tol.obj = 1e-6,
tol.diff = 1e-6,
tol.obj.con = 1e-6,
tol.diff.con = 1e-6,
silent = FALSE,
quiet = TRUE)
## Beginning cross-validation; printing left-out fold errors....
## 231.636
## 230.8632
## 227.377
## 221.0684
## 215.5186
## 211.1204
## 208.1691
## 206.0648
## 204.665
## 204.4741
## 205.0803
## 206.2522
## 207.5486
## 209.0241
## 210.5688
## 212.0686
## 213.4028
## 214.6677
## 215.813
## 216.7836
## 217.6258
## 218.3326
## 218.9265
## 219.4145
## 219.8246
## # ----------------------------------
## Through fold 1 of 5
## # ----------------------------------
## 215.0265
## 214.8831
## 210.1442
## 202.1143
## 194.118
## 187.6732
## 181.8499
## 176.6455
## 172.3088
## 169.0315
## 166.8674
## 165.2738
## 164.206
## 163.4954
## 163.1601
## 162.9455
## 162.9004
## 162.9641
## 163.0605
## 163.1577
## 163.2801
## 163.4173
## 163.5426
## 163.6571
## 163.7714
## # ----------------------------------
## Through fold 2 of 5
## # ----------------------------------
## 192.6032
## 192.241
## 189.3281
## 184.7279
## 180.0912
## 176.1877
## 172.3503
## 169.057
## 166.4314
## 164.8126
## 164.2978
## 164.2123
## 164.5575
## 165.1308
## 165.7126
## 166.3253
## 166.844
## 167.3108
## 167.6712
## 168.0018
## 168.3087
## 168.5725
## 168.7987
## 168.9902
## 169.1551
## # ----------------------------------
## Through fold 3 of 5
## # ----------------------------------
## 188.0394
## 185.8124
## 183.1556
## 178.8035
## 174.3133
## 169.1547
## 163.9407
## 158.8996
## 154.9253
## 152.374
## 150.6616
## 149.6926
## 149.112
## 148.9113
## 149.0141
## 149.274
## 149.6183
## 149.9913
## 150.3287
## 150.606
## 150.8732
## 151.1085
## 151.3104
## 151.4866
## 151.6381
## # ----------------------------------
## Through fold 4 of 5
## # ----------------------------------
## 186.7963
## 185.6682
## 184.4874
## 181.4998
## 177.5056
## 174.7906
## 173.0222
## 172
## 171.613
## 172.1221
## 173.4472
## 174.6643
## 175.9057
## 177.0752
## 178.2779
## 179.3991
## 180.4935
## 181.4877
## 182.4023
## 183.1904
## 183.8622
## 184.447
## 184.9386
## 185.3497
## 185.7035
## # ----------------------------------
## Through fold 5 of 5
## # ----------------------------------
str(t0.cv)
## List of 6
## $ Omega : num [1, 1:20, 1:20, 1:25] 0.821 0 0 0 0 ...
## $ valErrs : num [1:5, 1:25] 232 215 193 188 187 ...
## $ avg.valErrs: num [1:25] 203 202 199 194 188 ...
## $ Omega.min : num [1:20, 1:20] 0.83108 0.41062 0.21663 0 0.00418 ...
## $ fold.id : int [1:100] 3 4 1 4 2 3 4 2 4 1 ...
## $ lambda1.vec: num [1:25] 5.77 4.76 3.93 3.24 2.68 ...
plot(log10(t0.cv$lambda1.vec), t0.cv$avg.valErrs, pch=20, xlab=expression(log[10](lambda[1])), ylab="Average CV error")
The output closely matches that from \(\texttt{SCCpath}\), except we include fold
IDs, the validation errors, the averaged validation errors, and the
estimate of \(\Omega^*\) which uses the
tuning parameter minimizing the average validation error (\(\texttt{Omega.min}\)).
Estimating \(H \geq 2\) basis
covariance matrices jointly
Next, we will focus on the case where we have \(H=2\) basis covariance matrices to
estimate. Generalization to \(H >
2\) is straightforward. We will use the same structure as before,
except the correlations will differ slightly across the two
populations.
# ------------------------------
# Basis covariance matrix
# -----------------------------
p <- 20
Omega <- array(0, dim=c(2, p, p))
for (jj in 1:p) {
for (kk in 1:p) {
Omega[1,jj,kk] <- 0.5^abs(jj-kk) * (abs(jj - kk) < 4)
Omega[2,jj,kk] <- 0.7^abs(jj-kk) * (abs(jj - kk) < 5)
}
}
diag(Omega[1,,]) <- 1; diag(Omega[2,,]) <- 1
eo <- eigen(Omega[1,,])
OmegaSqrt1 <- eo$vec%*%diag(eo$val^0.5)%*%t(eo$vec)
eo <- eigen(Omega[2,,])
OmegaSqrt2 <- eo$vec%*%diag(eo$val^0.5)%*%t(eo$vec)
# -----------------------------
# Generate log-abundances
# -----------------------------
set.seed(1)
n <- 100
dat <- list(); datval <- list()
# create training set
logW <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt1
W <- exp(logW)
dat[[1]] <- W/rowSums(W)
logW <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt2
W <- exp(logW)
dat[[2]] <- W/rowSums(W)
# create validation set
logWval <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt1
Wval <- exp(logWval)
datval[[1]] <- Wval/rowSums(Wval)
logWval <- matrix(rnorm(n*p), nrow = n, ncol = p)%*%OmegaSqrt2
Wval <- exp(logWval)
datval[[2]] <- Wval/rowSums(Wval)
Using \(\texttt{gSCCpath}\), we fit
the entire solution path. This takes essentially identical arguments as
\(\texttt{SCCpath}\), except we also
have \(\texttt{nlambda2}\), \(\texttt{deltalambda2}\), and \(\texttt{weighted}\). These new arguments
are
- \(\texttt{nlambda2}\). Number of
candidate \(\lambda_2\)
- \(\texttt{deltalambda2}\). Ratio of
the smallest to largest \(\lambda_2\).
- \(\texttt{weighted}\). Whether to
use the version of our estimator where the loss function is weighted by
each populations’ sample size. Default is FALSE.
Note that here, \(\lambda_1\) and
\(\lambda_2\) refer to the tuning
parameters from this version of our optimization problem: \[\operatorname*{arg \min}_{\boldsymbol{\Omega} \in
\mathbb{R}^{H \times p \times p}} \left\{ \sum_{h=1}^H \left(\|
\widehat{\Theta}_{(h)} - \omega_{(h)} \mathbf{1}_p^\top -
\mathbf{1}_p\omega_{(h)}^\top + 2 \Omega_{(h)} \|_F^2 + \lambda_1
\|\Omega_{(h)}^{-}\|_1 \right) + \lambda_2 \sum_{j\neq k}
\|\boldsymbol{\Omega}_{\cdot jk}\|_2 \right\}\] \[\text{subject to } ~\Omega_{(h)} =
\Omega_{(h)}^\top,~ \Omega_{(h)} \succcurlyeq \tau I_p, ~~ \text{ for
all } ~h \in [H], \]
Again, we can evaluate the validation error on the validation
set.
t1 <- gSCCpath(dat = dat,
datval = datval,
weighted = FALSE,
nlambda1 = 25, nlambda2 = 15,
deltalambda1 = 0.001,
deltalambda2 = 0.01,
alpha0 = 1,
tau = 1e-4,
max.iter = 5e3,
tol.obj = 1e-6,
tol.diff = 1e-6,
tol.obj.con = 1e-6,
tol.diff.con = 1e-6,
silent = TRUE,
quiet = TRUE)
str(t1)
## List of 4
## $ Omega : num [1:2, 1:20, 1:20, 1:25, 1:15] 0.821 0.995 0 0 0 ...
## $ valErrs : num [1:25, 1:15] 223 221 216 211 205 ...
## $ lambda1.mat: num [1:15, 1:25] 2.38 3.15 3.79 4.16 4.57 ...
## $ lambda2.vec: num [1:15] 4.24 3.18 2.38 1.79 1.34 ...
levelplot(t1$valErrs, col.regions = gray(100:0/100))
inds <- which(t1$valErrs == min(t1$valErrs), arr.ind=TRUE)
Omega.est <- t1$Omega[,,,inds[1,1], inds[1,2]]
sum((Omega.est[1,,] - Omega[1,,])^2); sum((Omega.est[2,,] - Omega[2,,])^2)
## [1] 2.982092
## [1] 4.345359
Looking at the levelplot, we see that \(\lambda_1\) small appears to be best. This
agrees with the intuition, since \(\Omega_{(1)}^*\) and \(\Omega_{(2)}^*\) have reasonably similar
sparsity patterns.
t1.cv <- gSCCcv(dat = dat,
nfolds = 5,
nlambda1 = 20, nlambda2 = 10,
deltalambda1 = 0.001,
deltalambda2 = 0.01,
alpha0 = 1,
tau = 1e-4,
max.iter = 5e3,
tol.obj = 1e-6,
tol.diff = 1e-6,
tol.obj.con = 1e-6,
tol.diff.con = 1e-6,
silent = TRUE,
quiet = TRUE)
str(t1.cv)
## List of 7
## $ Omega : num [1:2, 1:20, 1:20, 1:20, 1:10] 0.821 0.995 0 0 0 ...
## $ valErrs : num [1:5, 1:20, 1:10] 444 488 543 409 449 ...
## $ avg.valErrs: num [1:20, 1:10] 467 463 453 441 432 ...
## $ fold.ids :List of 2
## ..$ : int [1:100] 2 4 5 1 2 2 2 2 5 5 ...
## ..$ : int [1:100] 5 1 4 3 3 3 3 5 2 4 ...
## $ lambda1.mat: num [1:10, 1:20] 3.15 3.98 4.57 5.02 5.25 ...
## $ lambda2.vec: num [1:10] 3.263 2.147 1.413 0.929 0.611 ...
## $ Omega.min : num [1:2, 1:20, 1:20] 0.839 1.026 0.456 0.629 0.273 ...
sum((t1.cv$Omega.min[1,,] - Omega[1,,])^2); sum((t1.cv$Omega.min[2,,] - Omega[2,,])^2)
## [1] 2.950094
## [1] 4.360383
We can compare our estimate of \(\Omega_{(1)}^*\) to the truth
library(lattice)
levelplot(t1.cv$Omega.min[1,,], col.regions = gray(100:0/100))
levelplot(Omega[1,,], col.regions = gray(100:0/100))
Finally, we can instead use the weighted version and see how our
estimate differs. Note that this estimator solves \[\operatorname*{arg \min}_{\boldsymbol{\Omega} \in
\mathbb{R}^{H \times p \times p}} \left\{ \sum_{h=1}^H \left(
\frac{n_{(h)}}{N}\| \widehat{\Theta}_{(h)} - \omega_{(h)}
\mathbf{1}_p^\top - \mathbf{1}_p\omega_{(h)}^\top + 2 \Omega_{(h)}
\|_F^2 + \lambda_1 \|\Omega_{(h)}^{-}\|_1 \right) + \lambda_2
\sum_{j\neq k} \|\boldsymbol{\Omega}_{\cdot jk}\|_2 \right\}\]
\[\text{subject to } ~\Omega_{(h)} =
\Omega_{(h)}^\top,~ \Omega_{(h)} \succcurlyeq \tau I_p, ~~ \text{ for
all } ~h \in [H], \] and similarly, uses a weighted version of
our cross-validation criterion.
t1w.path <- gSCCpath(dat = dat, datval = datval,
nlambda1 = 20, nlambda2 = 10,
deltalambda1 = 0.001,
deltalambda2 = 0.01,
weighted = TRUE,
alpha0 = 1,
tau = 1e-4,
max.iter = 5e3,
tol.obj = 1e-6,
tol.diff = 1e-6,
tol.obj.con = 1e-6,
tol.diff.con = 1e-6,
silent = TRUE,
quiet = TRUE)
str(t1w.path)
## List of 4
## $ Omega : num [1:2, 1:20, 1:20, 1:20, 1:10] 0.821 0.995 0 0 0 ...
## $ valErrs : num [1:20, 1:10] 111.3 109.1 103.2 96.8 91.6 ...
## $ lambda1.mat: num [1:10, 1:20] 1.58 1.99 2.29 2.51 2.63 ...
## $ lambda2.vec: num [1:10] 1.632 1.074 0.706 0.465 0.306 ...
sum((t1.cv$Omega.min[1,,] - Omega[1,,])^2); sum((t1.cv$Omega.min[2,,] - Omega[2,,])^2)
## [1] 2.950094
## [1] 4.360383