SIMVI¶
SimVI Model¶
- class simvi.model.simvi.SimVIModel(adata, n_batch=0, n_hidden=128, n_intrinsic=20, n_spatial=20, n_layers=1, dropout_rate=0, use_observed_lib_size=True, lam_mi=1000, reg_to_use='mmd', dis_to_use='zinb', noising_mode='default', permutation_rate=0.25, var_eps=0.0001, kl_weight=1, kl_gatweight=0.01, attention_heads=1)[source]¶
Bases:
BaseModelClassCalculate the intrinsic variation and spatial variation of single-cell expression data via variation inference.
- Parameters:
adata (AnnData) – AnnData object that has been registered via SimVIModel.setup_anndata.
n_batch (int, default: 0) – Number of batches.
n_hidden (int, default: 128) – Number of nodes per hidden layer.
n_intrinsic (int, default: 20) – Dimensionality of the intrinsic variation.
n_spatial (int, default: 20) – Dimensionality of the spatial variation.
n_layers (int, default: 1) – Number of decoder layers. Note that in our implementation, encoder is fixed to have two layers.
dropout_rate (float, default: 0) – Dropout rate for neural networks.
use_observed_lib_size (bool, default: True) – Use observed library size for RNA as scaling factor in mean of conditional distribution.
lam_mi (float, default: 1000) – Coefficient of the independence regularization term. When using the mmd option, a coefficient of 1000 is recommended. When using the mi option, the value of 5 is recommended.
reg_to_use (str, default: 'mmd') – The regularization method to use. Either ‘mmd’ (Maximal Mean Discrepancy) or ‘mi’ (Closed-form mutual information).
dis_to_use (str, default: 'zinb') – The distribution to use for the generative model.
noising_mode (str, default: 'default') – The noising scheme in the denoising autoencoding phase. ‘default’, original permutation procedure used in SIMVI manuscript. ‘zero’, zero masking. ‘sampling’, a faster version of the permutation by allowing replacement.
permutation_rate (float, default: 0.25) – The rate of permutation to use in the training. The permutation step itself is optional.
var_eps (float, default: 1e-4) – Minimal variance for the variational posteriors.
kl_weight (float, default: 1) – The KL divergence coefficient for intrinsic variation.
kl_gatweight (float, default: 0.01) – The KL divergence coefficient for spatial variation.
attention_heads (int, default: 1) – The number of attention heads.
Methods Table¶
Set up AnnData instance for SIMVI model. |
|
Define edge_index for SIMVI model training. |
|
Train the SIMVI model. |
|
Return the latent representation for each cell. |
|
Return the attention matrix for graph attention network. |
|
Calculate archetypal analysis for the input latent representation. |
|
Return the spatial effect for each cell in spatial omics data. |
Preprocessing¶
- classmethod SimVIModel.setup_anndata(adata, layer=None, batch_key=None, labels_key=None, size_factor_key=None, categorical_covariate_keys=None, continuous_covariate_keys=None, **kwargs)[source]¶
Set up AnnData instance for SIMVI model. A standard function to call in scvi-tools pipeline.
- Parameters:
adata (AnnData) – AnnData object containing raw counts. Rows represent cells, columns represent features.
layer (str, optional) – If not None, uses this as the key in adata.layers for raw count data.
batch_key (str, optional) – Key in adata.obs for batch information. Categories will automatically be converted into integer categories and saved to adata.obs[“_scvi_batch”]. If None, assign the same batch to all the data.
labels_key (str, optional) – Key in adata.obs for label information. Categories will automatically be converted into integer categories and saved to adata.obs[“_scvi_labels”]. If None, assign the same label to all the data.
size_factor_key (str, optional) – Key in adata.obs for size factor information.
categorical_covariate_keys (List[str], optional) – Keys in adata.obs corresponding to categorical data. Not used in SIMVI.
continuous_covariate_keys (List[str], optional) – Keys in adata.obs corresponding to continuous data. Not used in SIMVI.
- simvi.model.simvi.SimVIModel.extract_edge_index(adata, batch_key=None, spatial_key='spatial', method='knn', n_neighbors=10)¶
Define edge_index for SIMVI model training.
- Parameters:
adata (AnnData) – AnnData object.
batch_key (str, optional) – Key in adata.obs for batch information. If batch_key is none, assume the adata is from the same batch. Otherwise, we create edge_index based on each batch and concatenate them.
spatial_key (str, optional, default: 'spatial') – Key in adata.obsm for spatial location.
method (str, default: 'knn') – Method for establishing the graph proximity relationship between cells. Two available methods are: knn and Delouney. Knn is used as default due to its flexible neighbor number selection.
n_neighbors (int, default: 10) – The number of n_neighbors of knn graph. Not used if the graph is based on Delouney triangularization.
- Returns:
edge_index – The edge index tensor containing the graph structure.
- Return type:
torch.Tensor
Training¶
- SimVIModel.train(edge_index, max_epochs=None, use_gpu=None, train_size=0.9, batch_size=None, anneal_epochs=50, mae_epochs=80, validation_size=None, lr=0.001, weight_decay=0.0001, device=None)[source]¶
Train the SIMVI model. We adopt the “semi-supervised” framework for model training.
- Parameters:
edge_index (torch.Tensor) – Tensor returned by model.extract_edge_index.
max_epochs (int, optional) – Number of passes through the dataset. If None, default to np.min([round((20000 / n_cells) * 400), 400]).
use_gpu (Union[str, int, bool], optional) – Use default GPU if available (if None or True), or index of GPU to use (if int), or name of GPU (if str, e.g., “cuda:0”), or use CPU (if False).
train_size (float, default: 0.9) – Size of training set in the range [0.0, 1.0].
batch_size (int, optional) – Mini-batch size to use during training.
anneal_epochs (int, default: 50) – The number of epochs that use KL annealing.
mae_epochs (int, default: 80) – The number of epochs that corrupts input data.
validation_size (float, optional) – Size of the validation set. If None, default to 1 - train_size. If train_size + validation_size < 1, the remaining cells belong to the test set.
lr (float, default: 1e-3) – Learning rate.
weight_decay (float, default: 1e-4) – Weight decay (serves as L2 regularization).
device (str, optional) – The GPU to train the model on. If none, use torch.device(“cuda”) or cpu.
- Returns:
The model is trained.
- Return type:
None
Post-Training / Inference¶
- SimVIModel.get_latent_representation(edge_index, adata=None, indices=None, give_mean=True, batch_size=None, representation_kind='all')[source]¶
Return the latent representation for each cell.
- Parameters:
adata (AnnData, optional) – AnnData object with equivalent structure to initial AnnData. If None, defaults to the AnnData object used to initialize the model.
indices (Sequence[int], optional) – Indices of cells in adata to use. If None, all cells are used.
give_mean (bool, default: True) – Give mean of distribution or sample from it.
batch_size (int, optional) – Mini-batch size for data loading into model. Defaults to full batch training.
representation_kind (str, default: "all") – “intrinsic”, “interaction” or “all” for the corresponding representation kind.
- Returns:
A numpy array with shape (n_cells, n_latent).
- Return type:
np.ndarray
- SimVIModel.get_attention(edge_index)[source]¶
Return the attention matrix for graph attention network.
- Parameters:
edge_index (torch.Tensor) – Edge index tensor containing the graph structure, created by extract_edge_index function.
- Returns:
A sparse matrix containing the attention weights between cells, with shape (n_cells, n_cells).
- Return type:
scipy.sparse.csr_matrix
- SimVIModel.get_archetypes(embedding, noc=5, delta=0.1, conv_crit=1e-05, maxiter=200, verbose=False)[source]¶
Calculate archetypal analysis for the input latent representation. A preliminary step for get_se function.
- Parameters:
embedding (np.ndarray) – Input data matrix where each row represents a cell.
noc (int, default: 5) – Number of archetypes to extract.
delta (float, default: 0.1) – The relaxation parameter in PCHA algorithm.
conv_crit (float, default: 0.00001) – Convergence criterion. Algorithm stops when the relative change in objective function is less than this.
maxiter (int, default: 200) – Maximum number of iterations.
verbose (bool, default: False) – Whether to print progress during optimization.
- Returns:
Returns a tuple containing: - Feature loading matrix (shape: (n_archetypes, latent_dim)) - Archetypal representation matrix (shape: (n_cells, n_archetypes)) - Explained variance ratio
- Return type:
Tuple
- SimVIModel.get_se(edge_index=None, adata=None, z_label='simvi_z', s_label='simvi_s', transformation='log1p', batch_label=None, num_arch=5, delta=0.1, maxiter=200, Kfold=5, eps=0, thres=0.95, positivity_filter=False, cell_type_label=None, obsm_label=None, mode='individual')[source]¶
Return the spatial effect for each cell in spatial omics data. Requires training the SIMVI model in priori.
- Parameters:
edge_index (torch.Tensor) – The object created by function “extract_edge_index”.
adata (AnnData, optional) – AnnData object with equivalent structure to initial AnnData. If None, defaults to the AnnData object used to initialize the model.
z_label (str, optional) – The name of the intrinsic variation in adata.obsm. If adata is None, then it is calculated in this function.
s_label (str, optional) – The name of the spatial variation in adata.obsm. If adata is None, then it is calculated in this function.
transformation (str, default: 'log1p') – If log1p, perform log1p on a copy of the data. Else, operate on the given adata.X.
batch_label (str, optional) – If given, then add it as a covariate in the double machine learning model.
num_arch (int, default: 5) – Number of archetypes in archetypal transformation.
delta (float, default: 0.1) – Delta parameter in archetypal transformation.
maxiter (int, default: 200) – Maximum iterations in archetypal transformation.
Kfold (int, default: 5) – Number of folds in cross validation.
eps (float, default: 0) – Epsilon parameter in archetypal transformation.
thres (float, default: 0.95) – Thres2 in positivity index calculation.
positivity_filter (bool, default: False) – If True, only return the spatial effect of cells satisfying positivity condition, and return the indices of these cells.
cell_type_label (str, optional) – If given, then add it as a covariate in the double machine learning model.
obsm_label (str, optional) – If given, then add it as a covariate in the double machine learning model.
- Returns:
If positivity is False, return (spatial_effect, R2s, p_values, archetypes). If positivity is True, return (positive_indices, spatial_effect, R2s, p_values, archetypes).
- Return type:
Union[tuple, tuple]