Title: | Spatial Clustering with Hidden Markov Random Field using Empirical Bayes |
---|---|
Description: | Spatial clustering with hidden markov random field fitted via EM algorithm, details of which can be found in Yi Yang (2021) <doi:10.1101/2021.06.05.447181>. It is not only computationally efficient and scalable to the sample size increment, but also is capable of choosing the smoothness parameter and the number of clusters as well. |
Authors: | Yi Yang [aut, cre], Xingjie Shi [aut], Jin Liu [aut] |
Maintainer: | Yi Yang <[email protected]> |
License: | GPL-3 |
Version: | 1.1 |
Built: | 2024-11-20 05:22:30 UTC |
Source: | https://github.com/cran/SC.MEB |
The function ClusterPlot is used to Visualize spatial clusters.
ClusterPlot(out, pos, size = 5, shape = 15)
ClusterPlot(out, pos, size = 5, shape = 15)
out |
is the output of function selectK. |
pos |
is a n-by-2 matrix of position. |
size |
is a positive value for characterizing the size of point in the plot, which is the same as size in ggplot2. |
shape |
is a positive value for characterizing the shape of point in the plot, which is the same as shape in ggplot2. |
The function ClusterPlot is used to Visualize spatial clusters.
a ggplot2 object.
pos = cbind(rep(1:5, each=5), rep(1:5, 5)) out = list() out[[1]] = "" out[[2]] = rep(1:5, each = 5) ClusterPlot(out, pos)
pos = cbind(rep(1:5, each=5), rep(1:5, 5)) out = list() out[[1]] = "" out[[2]] = rep(1:5, each = 5) ClusterPlot(out, pos)
find_neighbors2 was used to find the neighborhood of spot.
find_neighbors2(sce, platform)
find_neighbors2(sce, platform)
sce |
is a SingleCellExperiment object containing PCA and position informatin. |
platform |
is the name of spatial transcriptomic platform. Specify 'Visium' for hex lattice geometry or 'ST' for square lattice geometry. Specifying this parameter is optional as this information is included in their metadata. |
find_neighbors2 was used to find the neighborhood of spot.
a sparse matrix recording the information of neighborhood.
data(sce) platform = "ST" Adj <- find_neighbors2(sce, platform)
data(sce) platform = "ST" Adj <- find_neighbors2(sce, platform)
an efficient function to find the neighborhood based on the matrix of position and a pre-defined cutoff
getneighborhood_fast(x, cutoff)
getneighborhood_fast(x, cutoff)
x |
is a n-by-2 matrix of position. |
cutoff |
is a threashold of Euclidean distance to decide whether a spot is an neighborhood of another spot. For example, if the Euclidean distance between spot A and B is less than cutoff, then A is taken as the neighbourhood of B. |
A sparse matrix containing the neighbourhood
pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj = getneighborhood_fast(pos, 2)
pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj = getneighborhood_fast(pos, 2)
The function ICMEM was used to conduct spatial clustering with hidden Markov random field for a sequence of beta and fixed number of clusters
ICMEM( y, x_int, Adj, mu_int, sigma_int, alpha, beta_grid, PX, maxIter_ICM, maxIter )
ICMEM( y, x_int, Adj, mu_int, sigma_int, alpha, beta_grid, PX, maxIter_ICM, maxIter )
y |
is a matrix of PCs containing gene expression. |
x_int |
is a vector of initial cluster label. |
Adj |
is a matrix containing neighborhood information generated by find_neighbors2. |
mu_int |
is a initial mean vector. we often generated it by Gaussian mixture model. |
sigma_int |
is a initial co-variance matrix. we often generated it by Gaussian mixture model. |
alpha |
is a intercept. |
beta_grid |
is a sequence of smoothing parameter that can be specified by user. |
PX |
is a logical value specifying the parameter expansion in EM algorithm. |
maxIter_ICM |
is the maximum iteration of ICM algorithm. |
maxIter |
is the maximum iteration of EM algorithm. |
The function ICMEM was used to conduct spatial clustering with hidden Markov random field for fixed beta and fixed number of clusters
a list.
The item 'x' is the clustering result.
The item 'gam' is the posterior probability matrix.
The item 'ell' is the opposite log-likelihood.
The item 'mu' is the mean of each component.
The item 'sigma' is the variance of each component.
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) G = 2 fit_int = Mclust(y, G = G) x_gmm <- fit_int$classification mu_int <- unname(fit_int$parameter$mean) sigma_int <- unname(fit_int$parameter$variance$sigma) alpha <- -log(fit_int$parameter$pro)*0 reslist <- ICMEM(y = y, x_int = x_gmm, Adj = Adj, mu_int = mu_int, sigma_int = sigma_int, alpha = alpha, beta_grid = beta_grid, PX = TRUE, maxIter_ICM = 10, maxIter = 50)
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) G = 2 fit_int = Mclust(y, G = G) x_gmm <- fit_int$classification mu_int <- unname(fit_int$parameter$mean) sigma_int <- unname(fit_int$parameter$variance$sigma) alpha <- -log(fit_int$parameter$pro)*0 reslist <- ICMEM(y = y, x_int = x_gmm, Adj = Adj, mu_int = mu_int, sigma_int = sigma_int, alpha = alpha, beta_grid = beta_grid, PX = TRUE, maxIter_ICM = 10, maxIter = 50)
The function parafun implements the model SC-MEB for fixed number of clusters and a sequence of beta with initial value from Gaussian mixture model
parafun( y, Adj, G, beta_grid = seq(0, 4, 0.2), PX = TRUE, maxIter_ICM = 10, maxIter = 50 )
parafun( y, Adj, G, beta_grid = seq(0, 4, 0.2), PX = TRUE, maxIter_ICM = 10, maxIter = 50 )
y |
is n-by-d PCs. |
Adj |
is a sparse matrix of neighborhood. |
G |
is an integer specifying the numbers of clusters. |
beta_grid |
is a numeric vector specifying the smoothness parameter of Random Markov Field. The default is seq(0,4,0.2). |
PX |
is a logical value specifying the parameter expansion in EM algorithm. |
maxIter_ICM |
is the maximum iteration of ICM algorithm. The default is 10. |
maxIter |
is the maximum iteration of EM algorithm. The default is 50. |
The function parafun implements the model SC-MEB for fixed number of clusters and a sequence of beta with initial value from Gaussian mixture model
a list, We briefly explain the output of the SC.MEB.
The item 'x' storing clustering results.
The item 'gam' is the posterior probability matrix.
The item 'ell' is the opposite log-likelihood.
The item 'mu' is the mean of each component.
The item 'sigma' is the variance of each component.
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) G = 2 out = parafun(y, Adj_sp, G, beta_grid)
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) G = 2 out = parafun(y, Adj_sp, G, beta_grid)
A dataset containing PCs
data(PC)
data(PC)
It is a matrix containing 5 PCs
the variables are listed as following
The 1th PC
The 2th PC
...
The 5th PC
## run the PC with the Gaussian mixture model data(PC) out1 = mclust::Mclust(PC,G = 2)
## run the PC with the Gaussian mixture model data(PC) out1 = mclust::Mclust(PC,G = 2)
SC.MEB implements the model SC-MEB, spatial clustering with hidden Markov random field using empirical Bayes.
SC.MEB( y, Adj_sp, beta_grid = seq(0, 4, 0.2), K_set = 2:10, parallel = TRUE, num_core = 5, PX = TRUE, maxIter_ICM = 10, maxIter = 50 )
SC.MEB( y, Adj_sp, beta_grid = seq(0, 4, 0.2), K_set = 2:10, parallel = TRUE, num_core = 5, PX = TRUE, maxIter_ICM = 10, maxIter = 50 )
y |
is n-by-d PCs. |
Adj_sp |
is a sparse matrix of neighborhood. It is often generated from function find_neighbors2 or getneighborhood_fast. |
beta_grid |
is a numeric vector specifying the smoothness parameter of Random Markov Field. The default is seq(0,4,0.2). |
K_set |
is an integer vector specifying the numbers of mixture components (clusters) for which the BIC is to be calculated. The default is K = 2:10. |
parallel |
is a logical value to decide whether the function SC.MEB run in parallel. The default is TRUE. |
num_core |
is an integer value to decide how many cores are used to run SC.MEB in parallel. |
PX |
is a logical value to decide whether to use parameter expansion in EM algorithm |
maxIter_ICM |
is the maximum iteration of ICM algorithm. The default is 10. |
maxIter |
is the maximum iteration of EM algorithm. The default is 50. |
SC.MEB can implements the model SC-MEB in parallel which can improve the speed of the computation.
a list, We briefly explain the output of the SC.MEB.
The item 'x' contains clustering results.
The item 'gam' is the posterior probability matrix.
The item 'ell' is the opposite log-likelihood.
The item 'mu' is the mean of each component.
The item 'sigma' is the variance of each component.
Yang Y, Shi X, Zhou Q, et al. SC-MEB: spatial clustering with hidden Markov random field using empirical Bayes[J]. bioRxiv, 2021.
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) K_set = 2:3 out = SC.MEB(y, Adj_sp, beta_grid, K_set, TRUE, 2)
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) K_set = 2:3 out = SC.MEB(y, Adj_sp, beta_grid, K_set, TRUE, 2)
A dataset of SingleCellExperiment
data(sce)
data(sce)
It is a SingleCellExperiment object with gene expression and meta information
Amezquita R A, Lun A T L, Becht E, et al. Orchestrating single-cell analysis with Bioconductor[J]. Nature methods, 2020, 17(2): 137-145.
## find the neighborhood of spots in SingleCellExperiment data(sce) out = find_neighbors2(sce, "ST")
## find the neighborhood of spots in SingleCellExperiment data(sce) out = find_neighbors2(sce, "ST")
The function selectK is used to select the best K according to BIC or Modified BIC criterion.
selectK(SCobject, K_set = 2:10, criterion = "BIC", c = 1)
selectK(SCobject, K_set = 2:10, criterion = "BIC", c = 1)
SCobject |
is an object generated from SC.MEB function. |
K_set |
is a integer vector used in SC.MEB. The default is 2:10 |
criterion |
is a character specifying the criterion for selecting K. The default value is BIC. The alternative value MBIC can also be used. |
c |
is a positive value in the modified BIC. The default is 1. Here we briefly explain how to choose the parameter c in the modified BIC. In general, For the ST or Visium dataset, it often ranges from 0.4 to 1 while for the MERFISH dataset with large number of cells, it often becomes larger, for example 10,20. Most importantly, SC-MEB is fast, scaling well in terms of sample size, which allow the user to tune the c based on their prior knowledge about the tissues or cells. |
The function selectK is used to select the best K according to BIC or Modified BIC criterion.
a list contains two items. one is for the best K and the other is the clustering labels of n spots.
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) K_set = 2:3 out = SC.MEB(y, Adj_sp, beta_grid, K_set, TRUE, 2) selectK(out, K_set)
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) K_set = 2:3 out = SC.MEB(y, Adj_sp, beta_grid, K_set, TRUE, 2) selectK(out, K_set)
The function selectKPlot is used to demonstrate the scatter plot of BIC or Modified BIC vs K for selecting the best K.
selectKPlot(SCobject, K_set = 2:10, criterion = "BIC", c = 1)
selectKPlot(SCobject, K_set = 2:10, criterion = "BIC", c = 1)
SCobject |
is a object generated from SC.MEB function. |
K_set |
is the corresponding K_set used in your previous function SC.MEB. |
criterion |
is a character specifying the criterion for selecting K. The default is BIC, the alternative criterion MBIC can also be used. |
c |
is a positive value in modified BIC. The default is 1. Here we briefly explain how to choose the parameter c in the modified BIC. In general, For the ST or Visium dataset, it often ranges from 0.4 to 1 while for the MERFISH dataset with large number of cells, it often becomes larger, for example 10,20. Most importantly, SC-MEB is fast, scaling well in terms of sample size, which allow the user to tune the c based on their prior knowledge about the tissues or cells. |
The function selectKPlot is used to demonstrate the scatter plot of BIC or Modified BIC vs K for selecting the best K.
a ggplot2 object.
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) K_set = 2:3 out = SC.MEB(y, Adj_sp, beta_grid, K_set, TRUE, 2) selectKPlot(out, K_set)
y = matrix(rnorm(50, 0, 1), 25,2) pos = cbind(rep(1:5, each=5), rep(1:5, 5)) Adj_sp = getneighborhood_fast(pos, 1.2) beta_grid = c(0.5,1) K_set = 2:3 out = SC.MEB(y, Adj_sp, beta_grid, K_set, TRUE, 2) selectKPlot(out, K_set)
Adds metadata required for downstream analyses, and (optionally) performs PCA on log-normalized expression of top HVGs.
spatialPreprocess( sce, platform = c("Visium", "ST"), n.PCs = 15, n.HVGs = 2000, skip.PCA = FALSE, log.normalize = TRUE, assay.type = "logcounts", BSPARAM = BiocSingular::ExactParam() )
spatialPreprocess( sce, platform = c("Visium", "ST"), n.PCs = 15, n.HVGs = 2000, skip.PCA = FALSE, log.normalize = TRUE, assay.type = "logcounts", BSPARAM = BiocSingular::ExactParam() )
sce |
SingleCellExperiment to preprocess |
platform |
Spatial sequencing platform. Used to determine spot layout and neighborhood structure (Visium = hex, ST = square). |
n.PCs |
Number of principal components to compute. We suggest using the top 15 PCs in most cases. |
n.HVGs |
Number of highly variable genes to run PCA upon. |
skip.PCA |
Skip PCA (if dimensionality reduction was previously computed.) |
log.normalize |
Whether to log-normalize the input data with scater. May be omitted if log-normalization previously computed. |
assay.type |
Name of assay in |
BSPARAM |
A BiocSingularParam object specifying which
algorithm should be used to perform the PCA. By default, an exact PCA is
performed, as current spatial datasets are generally small (<10,000 spots).
To perform a faster approximate PCA, please specify
|
SingleCellExperiment with PCA and SC.MEB metadata
## read the simulated data data(sce) platform = "ST" out = find_neighbors2(sce, platform)
## read the simulated data data(sce) platform = "ST" out = find_neighbors2(sce, platform)