This vignette provides an introduction to the R package
SC.MEB
, where the function SC.MEB
implements
the model SC-MEB
, spatial clustering with hidden Markov
random field using empirical Bayes. The package can be installed with
the command:
library(devtools)
install_github("Shufeyangyi2015310117/SC.MEB")
The package can be loaded with the command:
We first set the basic parameter:
library(mvtnorm)
library(GiRaF)
library(SingleCellExperiment)
set.seed(100)
G <- 4
Bet <- 1
KK <- 5
p <- 15
mu <- matrix(c( c(-6, rep(-1.5, 14)),
rep(0, 15),
c(6, rep(1.5, 14)),
c(rep(-1.5, 7), rep(1.5, 7), 6),
c(rep(1.5, 7), rep(-1.5, 7), -6)), ncol = KK)
height <- 70
width <- 70
n <- height * width # # of cell in each indviduals
Then, we generate the true clustering label, 15-dimensional PCA and position of each spot.
X <- sampler.mrf(iter = n, sampler = "Gibbs", h = height, w = width, ncolors = KK, nei = G, param = Bet,initialise = FALSE, view = TRUE)
x <- c(X) + 1
y <- matrix(0, nrow = n, ncol = p)
for(i in 1:n) { # cell
mu_i <- mu[, x[i]]
Sigma_i <- ((x[i]==1)*2 + (x[i]==2)*2.5 + (x[i]==3)*3 +
(x[i]==4)*3.5 + (x[i]==5)*4)*diag(1, p)*0.3
y[i, ] <- rmvnorm(1, mu_i, Sigma_i)
}
pos <- cbind(rep(1:height, width), rep(1:height, each=width))
Subsequently, we construct the SingleCellExperiment object based on the above PCA and position.
# -------------------------------------------------
# make SC-MEB metadata used in SC-MEB
counts <- t(y)
rownames(counts) <- paste0("gene_", seq_len(p))
colnames(counts) <- paste0("spot_", seq_len(n))
## Make array coordinates - filled rectangle
cdata <- list()
nrow <- height; ncol <- width
cdata$row <- rep(seq_len(nrow), each=ncol)
cdata$col <- rep(seq_len(ncol), nrow)
cdata <- as.data.frame(do.call(cbind, cdata))
## Scale and jitter image coordinates
#scale.factor <- rnorm(1, 8); n_spots <- n
#cdata$imagerow <- scale.factor * cdata$row + rnorm(n_spots)
#cdata$imagecol <- scale.factor * cdata$col + rnorm(n_spots)
cdata$imagerow <- cdata$row
cdata$imagecol <- cdata$col
## Make SCE
## note: scater::runPCA throws warning on our small sim data, so use prcomp
sce <- SingleCellExperiment(assays=list(counts=counts), colData=cdata)
reducedDim(sce, "PCA") <- y
# sce$spatial.cluster <- floor(runif(ncol(sce), 1, 3))
metadata(sce)$SCMEB.data <- list()
metadata(sce)$SCMEB.data$platform <- "ST"
metadata(sce)$SCMEB.data$is.enhanced <- FALSE
Here, we set the basic paramters for our function
SC.MEB
platform = "ST"
beta_grid = seq(0,4,0.2)
K_set= 2:10
parallel=TRUE
num_core = 3
PX = TRUE
maxIter_ICM = 10
maxIter = 50
Here, we briefly explain these parameters.
‘parallel’ is a logical value specifing the run the model in parallel. The default is TRUE.
‘num_core’ is a interger value for the number of cores used for parallel. The default is 5.
‘PX’ is a logical value for paramter expansion in EM algorithm. The default is True.
‘K_set’ is an integer vector specifying the numbers of mixture components (clusters) for which the BIC is to be calculated. The default is K = 2:10.
‘platform’ is the name of spatial transcriptomic platform. Specify ‘Visium’ for hex lattice geometry or ‘ST’ for square lattice geometry. Specifying this parameter is optional as this information is included in their metadata.
‘beta_grid’ is a numeric vector specifying the smoothness of Random Markov Field. The default is seq(0,5,0.2).
‘maxIter_ICM’ is the maximum iteration of ICM algorithm. The default is 10.
‘maxIter’ is the maximum iteration of EM algorithm. The default is 50.
First, we require to find the neighborhood
or
The output ‘Adj_sp’ is a sparse matrix to describe the neighorhood. There two functions are both used to calculate the nrighborhood. The first function ‘getneighobrhood_fast’ is a general function used for general data. However, the second function ‘find_neighbors2’ is a specific function for ST and Visium data.
Finally, we run our model SC-MEB
by the function
SC.MEB
.
fit = SC.MEB(y, Adj_sp, beta_grid = beta_grid, K_set= K_set, parallel=parallel, num_core = num_core, PX = PX, maxIter_ICM=maxIter_ICM, maxIter=maxIter)
str(fit[,1])
Here, We briefly explain the output of the SC.MEB
.
The item ‘x’ is clustering label.
The item ‘ell’ is the opposite log-likelihood for each beta and K.
The item ‘mu’ is the mean of each component.
The item ‘sigma’ is the variance of each component.
The item ‘gam’ is the posterior probability.
The item ‘beta’ is the estimated smoothing parameter.
Here we briefly explain how to choose the parameter c in the modified BIC. In general, For the ST or Visium dataset, it often ranges from 0.4 to 1 while for the MERFISH dataset with large number of cells, it often becomes larger, for example 10,20. Most importantly, SC-MEB is fast, scaling well in terms of sample size, which allow the user to tune the c based on their prior knowledge about the tissues or cells.
We can plot the cluster assignments over the spatial locations of the spots with ClusterPlot().
As ClusterPlot() returns a ggplot object, it can be customized by composing with familiar ggplot2 functions.