Title:  Spatial Clustering with Hidden Markov Random Field using Empirical Bayes 

Description:  Spatial clustering with hidden markov random field fitted via EM algorithm, details of which can be found in Yi Yang (2021) <doi:10.1101/2021.06.05.447181>. It is not only computationally efficient and scalable to the sample size increment, but also is capable of choosing the smoothness parameter and the number of clusters as well. 
Authors:  Yi Yang [aut, cre], Xingjie Shi [aut], Jin Liu [aut] 
Maintainer:  Yi Yang <[email protected]> 
License:  GPL3 
Version:  1.1 
Built:  20240921 05:10:18 UTC 
Source:  https://github.com/cran/SC.MEB 
The function ClusterPlot is used to Visualize spatial clusters.
ClusterPlot(out, pos, size = 5, shape = 15)
ClusterPlot(out, pos, size = 5, shape = 15)
out 
is the output of function selectK. 
pos 
is a nby2 matrix of position. 
size 
is a positive value for characterizing the size of point in the plot, which is the same as size in ggplot2. 
shape 
is a positive value for characterizing the shape of point in the plot, which is the same as shape in ggplot2. 
The function ClusterPlot is used to Visualize spatial clusters.
a ggplot2 object.
pos = cbind(rep(1:5, each=5), rep(1:5, 5)) out = list() out[[1]] = "" out[[2]] = rep(1:5, each = 5) ClusterPlot(out, pos)
pos = cbind(rep(1:5, each=5), rep(1:5, 5)) out = list() out[[1]] = "" out[[2]] = rep(1:5, each = 5) ClusterPlot(out, pos)
find_neighbors2 was used to find the neighborhood of spot.
find_neighbors2(sce, platform)
find_neighbors2(sce, platform)
sce 
is a SingleCellExperiment object containing PCA and position informatin. 
platform 
is the name of spatial transcriptomic platform. Specify 'Visium' for hex lattice geometry or 'ST' for square lattice geometry. Specifying this parameter is optional as this information is included in their metadata. 
find_neighbors2 was used to find the neighborhood of spot.
a sparse matrix recording the information of neighborhood.
data(sce) platform = "ST" Adj < find_neighbors2(sce, platform)
data(sce) platform = "ST" Adj < find_neighbors2(sce, platform)
an efficient function to find the neighborhood based on the matrix of position and a predefined cutoff
getneighborhood_fast(x, cutoff)
getneighborhood_fast(x, cutoff)
x 
is a nby2 matrix of position. 
cutoff 
is a threashold of Euclidean distance to decide whether a spot is an neighborhood of another spot. For example, if the Euclidean distance between spot A and B is less than cutoff, then A is taken as the neighbourhood of B. 
A sparse matrix containing the neighbourhood
pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj = getneighborhood_fast(pos, 2)
pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj = getneighborhood_fast(pos, 2)
The function ICMEM was used to conduct spatial clustering with hidden Markov random field for a sequence of beta and fixed number of clusters
ICMEM( y, x_int, Adj, mu_int, sigma_int, alpha, beta_grid, PX, maxIter_ICM, maxIter )
ICMEM( y, x_int, Adj, mu_int, sigma_int, alpha, beta_grid, PX, maxIter_ICM, maxIter )
y 
is a matrix of PCs containing gene expression. 
x_int 
is a vector of initial cluster label. 
Adj 
is a matrix containing neighborhood information generated by find_neighbors2. 
mu_int 
is a initial mean vector. we often generated it by Gaussian mixture model. 
sigma_int 
is a initial covariance matrix. we often generated it by Gaussian mixture model. 
alpha 
is a intercept. 
beta_grid 
is a sequence of smoothing parameter that can be specified by user. 
PX 
is a logical value specifying the parameter expansion in EM algorithm. 
maxIter_ICM 
is the maximum iteration of ICM algorithm. 
maxIter 
is the maximum iteration of EM algorithm. 
The function ICMEM was used to conduct spatial clustering with hidden Markov random field for fixed beta and fixed number of clusters
a list.
The item 'x' is the clustering result.
The item 'gam' is the posterior probability matrix.
The item 'ell' is the opposite loglikelihood.
The item 'mu' is the mean of each component.
The item 'sigma' is the variance of each component.
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) G = 2 fit_int = Mclust(y, G = G) x_gmm < fit_int$classification mu_int < unname(fit_int$parameter$mean) sigma_int < unname(fit_int$parameter$variance$sigma) alpha < log(fit_int$parameter$pro)*0 reslist < ICMEM(y = y, x_int = x_gmm, Adj = Adj, mu_int = mu_int, sigma_int = sigma_int, alpha = alpha, beta_grid = beta_grid, PX = TRUE, maxIter_ICM = 10, maxIter = 50)
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) G = 2 fit_int = Mclust(y, G = G) x_gmm < fit_int$classification mu_int < unname(fit_int$parameter$mean) sigma_int < unname(fit_int$parameter$variance$sigma) alpha < log(fit_int$parameter$pro)*0 reslist < ICMEM(y = y, x_int = x_gmm, Adj = Adj, mu_int = mu_int, sigma_int = sigma_int, alpha = alpha, beta_grid = beta_grid, PX = TRUE, maxIter_ICM = 10, maxIter = 50)
The function parafun implements the model SCMEB for fixed number of clusters and a sequence of beta with initial value from Gaussian mixture model
parafun( y, Adj, G, beta_grid = seq(0, 4, 0.2), PX = TRUE, maxIter_ICM = 10, maxIter = 50 )
parafun( y, Adj, G, beta_grid = seq(0, 4, 0.2), PX = TRUE, maxIter_ICM = 10, maxIter = 50 )
y 
is nbyd PCs. 
Adj 
is a sparse matrix of neighborhood. 
G 
is an integer specifying the numbers of clusters. 
beta_grid 
is a numeric vector specifying the smoothness parameter of Random Markov Field. The default is seq(0,4,0.2). 
PX 
is a logical value specifying the parameter expansion in EM algorithm. 
maxIter_ICM 
is the maximum iteration of ICM algorithm. The default is 10. 
maxIter 
is the maximum iteration of EM algorithm. The default is 50. 
The function parafun implements the model SCMEB for fixed number of clusters and a sequence of beta with initial value from Gaussian mixture model
a list, We briefly explain the output of the SC.MEB.
The item 'x' storing clustering results.
The item 'gam' is the posterior probability matrix.
The item 'ell' is the opposite loglikelihood.
The item 'mu' is the mean of each component.
The item 'sigma' is the variance of each component.
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) G = 2 out = parafun(y, Adj_sp, G, beta_grid)
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) G = 2 out = parafun(y, Adj_sp, G, beta_grid)
A dataset containing PCs
data(PC)
data(PC)
It is a matrix containing 5 PCs
the variables are listed as following
The 1th PC
The 2th PC
...
The 5th PC
## run the PC with the Gaussian mixture model data(PC) out1 = mclust::Mclust(PC,G = 2)
## run the PC with the Gaussian mixture model data(PC) out1 = mclust::Mclust(PC,G = 2)
SC.MEB implements the model SCMEB, spatial clustering with hidden Markov random field using empirical Bayes.
SC.MEB( y, Adj_sp, beta_grid = seq(0, 4, 0.2), K_set = 2:10, parallel = TRUE, num_core = 5, PX = TRUE, maxIter_ICM = 10, maxIter = 50 )
SC.MEB( y, Adj_sp, beta_grid = seq(0, 4, 0.2), K_set = 2:10, parallel = TRUE, num_core = 5, PX = TRUE, maxIter_ICM = 10, maxIter = 50 )
y 
is nbyd PCs. 
Adj_sp 
is a sparse matrix of neighborhood. It is often generated from function find_neighbors2 or getneighborhood_fast. 
beta_grid 
is a numeric vector specifying the smoothness parameter of Random Markov Field. The default is seq(0,4,0.2). 
K_set 
is an integer vector specifying the numbers of mixture components (clusters) for which the BIC is to be calculated. The default is K = 2:10. 
parallel 
is a logical value to decide whether the function SC.MEB run in parallel. The default is TRUE. 
num_core 
is an integer value to decide how many cores are used to run SC.MEB in parallel. 
PX 
is a logical value to decide whether to use parameter expansion in EM algorithm 
maxIter_ICM 
is the maximum iteration of ICM algorithm. The default is 10. 
maxIter 
is the maximum iteration of EM algorithm. The default is 50. 
SC.MEB can implements the model SCMEB in parallel which can improve the speed of the computation.
a list, We briefly explain the output of the SC.MEB.
The item 'x' contains clustering results.
The item 'gam' is the posterior probability matrix.
The item 'ell' is the opposite loglikelihood.
The item 'mu' is the mean of each component.
The item 'sigma' is the variance of each component.
Yang Y, Shi X, Zhou Q, et al. SCMEB: spatial clustering with hidden Markov random field using empirical Bayes[J]. bioRxiv, 2021.
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) K_set = 2:3 out = SC.MEB(y, Adj_sp, beta_grid, K_set, TRUE, 2)
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) K_set = 2:3 out = SC.MEB(y, Adj_sp, beta_grid, K_set, TRUE, 2)
A dataset of SingleCellExperiment
data(sce)
data(sce)
It is a SingleCellExperiment object with gene expression and meta information
Amezquita R A, Lun A T L, Becht E, et al. Orchestrating singlecell analysis with Bioconductor[J]. Nature methods, 2020, 17(2): 137145.
## find the neighborhood of spots in SingleCellExperiment data(sce) out = find_neighbors2(sce, "ST")
## find the neighborhood of spots in SingleCellExperiment data(sce) out = find_neighbors2(sce, "ST")
The function selectK is used to select the best K according to BIC or Modified BIC criterion.
selectK(SCobject, K_set = 2:10, criterion = "BIC", c = 1)
selectK(SCobject, K_set = 2:10, criterion = "BIC", c = 1)
SCobject 
is an object generated from SC.MEB function. 
K_set 
is a integer vector used in SC.MEB. The default is 2:10 
criterion 
is a character specifying the criterion for selecting K. The default value is BIC. The alternative value MBIC can also be used. 
c 
is a positive value in the modified BIC. The default is 1. Here we briefly explain how to choose the parameter c in the modified BIC. In general, For the ST or Visium dataset, it often ranges from 0.4 to 1 while for the MERFISH dataset with large number of cells, it often becomes larger, for example 10,20. Most importantly, SCMEB is fast, scaling well in terms of sample size, which allow the user to tune the c based on their prior knowledge about the tissues or cells. 
The function selectK is used to select the best K according to BIC or Modified BIC criterion.
a list contains two items. one is for the best K and the other is the clustering labels of n spots.
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) K_set = 2:3 out = SC.MEB(y, Adj_sp, beta_grid, K_set, TRUE, 2) selectK(out, K_set)
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) K_set = 2:3 out = SC.MEB(y, Adj_sp, beta_grid, K_set, TRUE, 2) selectK(out, K_set)
The function selectKPlot is used to demonstrate the scatter plot of BIC or Modified BIC vs K for selecting the best K.
selectKPlot(SCobject, K_set = 2:10, criterion = "BIC", c = 1)
selectKPlot(SCobject, K_set = 2:10, criterion = "BIC", c = 1)
SCobject 
is a object generated from SC.MEB function. 
K_set 
is the corresponding K_set used in your previous function SC.MEB. 
criterion 
is a character specifying the criterion for selecting K. The default is BIC, the alternative criterion MBIC can also be used. 
c 
is a positive value in modified BIC. The default is 1. Here we briefly explain how to choose the parameter c in the modified BIC. In general, For the ST or Visium dataset, it often ranges from 0.4 to 1 while for the MERFISH dataset with large number of cells, it often becomes larger, for example 10,20. Most importantly, SCMEB is fast, scaling well in terms of sample size, which allow the user to tune the c based on their prior knowledge about the tissues or cells. 
The function selectKPlot is used to demonstrate the scatter plot of BIC or Modified BIC vs K for selecting the best K.
a ggplot2 object.
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) K_set = 2:3 out = SC.MEB(y, Adj_sp, beta_grid, K_set, TRUE, 2) selectKPlot(out, K_set)
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) K_set = 2:3 out = SC.MEB(y, Adj_sp, beta_grid, K_set, TRUE, 2) selectKPlot(out, K_set)
Adds metadata required for downstream analyses, and (optionally) performs PCA on lognormalized expression of top HVGs.
spatialPreprocess( sce, platform = c("Visium", "ST"), n.PCs = 15, n.HVGs = 2000, skip.PCA = FALSE, log.normalize = TRUE, assay.type = "logcounts", BSPARAM = BiocSingular::ExactParam() )
spatialPreprocess( sce, platform = c("Visium", "ST"), n.PCs = 15, n.HVGs = 2000, skip.PCA = FALSE, log.normalize = TRUE, assay.type = "logcounts", BSPARAM = BiocSingular::ExactParam() )
sce 
SingleCellExperiment to preprocess 
platform 
Spatial sequencing platform. Used to determine spot layout and neighborhood structure (Visium = hex, ST = square). 
n.PCs 
Number of principal components to compute. We suggest using the top 15 PCs in most cases. 
n.HVGs 
Number of highly variable genes to run PCA upon. 
skip.PCA 
Skip PCA (if dimensionality reduction was previously computed.) 
log.normalize 
Whether to lognormalize the input data with scater. May be omitted if lognormalization previously computed. 
assay.type 
Name of assay in 
BSPARAM 
A BiocSingularParam object specifying which
algorithm should be used to perform the PCA. By default, an exact PCA is
performed, as current spatial datasets are generally small (<10,000 spots).
To perform a faster approximate PCA, please specify

SingleCellExperiment with PCA and SC.MEB metadata
## read the simulated data data(sce) platform = "ST" out = find_neighbors2(sce, platform)
## read the simulated data data(sce) platform = "ST" out = find_neighbors2(sce, platform)