SIMVI

SimVI Model

class simvi.model.simvi.SimVIModel(adata, n_batch=0, n_hidden=128, n_intrinsic=20, n_spatial=20, n_layers=1, dropout_rate=0, use_observed_lib_size=True, lam_mi=1000, reg_to_use='mmd', dis_to_use='zinb', noising_mode='default', permutation_rate=0.25, var_eps=0.0001, kl_weight=1, kl_gatweight=0.01, attention_heads=1)[source]

Bases: BaseModelClass

Calculate the intrinsic variation and spatial variation of single-cell expression data via variation inference.

Parameters:
  • adata (AnnData) – AnnData object that has been registered via SimVIModel.setup_anndata.

  • n_batch (int, default: 0) – Number of batches.

  • n_hidden (int, default: 128) – Number of nodes per hidden layer.

  • n_intrinsic (int, default: 20) – Dimensionality of the intrinsic variation.

  • n_spatial (int, default: 20) – Dimensionality of the spatial variation.

  • n_layers (int, default: 1) – Number of decoder layers. Note that in our implementation, encoder is fixed to have two layers.

  • dropout_rate (float, default: 0) – Dropout rate for neural networks.

  • use_observed_lib_size (bool, default: True) – Use observed library size for RNA as scaling factor in mean of conditional distribution.

  • lam_mi (float, default: 1000) – Coefficient of the independence regularization term. When using the mmd option, a coefficient of 1000 is recommended. When using the mi option, the value of 5 is recommended.

  • reg_to_use (str, default: 'mmd') – The regularization method to use. Either ‘mmd’ (Maximal Mean Discrepancy) or ‘mi’ (Closed-form mutual information).

  • dis_to_use (str, default: 'zinb') – The distribution to use for the generative model.

  • noising_mode (str, default: 'default') – The noising scheme in the denoising autoencoding phase. ‘default’, original permutation procedure used in SIMVI manuscript. ‘zero’, zero masking. ‘sampling’, a faster version of the permutation by allowing replacement.

  • permutation_rate (float, default: 0.25) – The rate of permutation to use in the training. The permutation step itself is optional.

  • var_eps (float, default: 1e-4) – Minimal variance for the variational posteriors.

  • kl_weight (float, default: 1) – The KL divergence coefficient for intrinsic variation.

  • kl_gatweight (float, default: 0.01) – The KL divergence coefficient for spatial variation.

  • attention_heads (int, default: 1) – The number of attention heads.

Methods Table

simvi.model.simvi.SimVIModel.setup_anndata

Set up AnnData instance for SIMVI model.

simvi.model.simvi.SimVIModel.extract_edge_index

Define edge_index for SIMVI model training.

simvi.model.simvi.SimVIModel.train

Train the SIMVI model.

simvi.model.simvi.SimVIModel.get_latent_representation

Return the latent representation for each cell.

simvi.model.simvi.SimVIModel.get_attention

Return the attention matrix for graph attention network.

simvi.model.simvi.SimVIModel.get_archetypes

Calculate archetypal analysis for the input latent representation.

simvi.model.simvi.SimVIModel.get_se

Return the spatial effect for each cell in spatial omics data.

Preprocessing

classmethod SimVIModel.setup_anndata(adata, layer=None, batch_key=None, labels_key=None, size_factor_key=None, categorical_covariate_keys=None, continuous_covariate_keys=None, **kwargs)[source]

Set up AnnData instance for SIMVI model. A standard function to call in scvi-tools pipeline.

Parameters:
  • adata (AnnData) – AnnData object containing raw counts. Rows represent cells, columns represent features.

  • layer (str, optional) – If not None, uses this as the key in adata.layers for raw count data.

  • batch_key (str, optional) – Key in adata.obs for batch information. Categories will automatically be converted into integer categories and saved to adata.obs[“_scvi_batch”]. If None, assign the same batch to all the data.

  • labels_key (str, optional) – Key in adata.obs for label information. Categories will automatically be converted into integer categories and saved to adata.obs[“_scvi_labels”]. If None, assign the same label to all the data.

  • size_factor_key (str, optional) – Key in adata.obs for size factor information.

  • categorical_covariate_keys (List[str], optional) – Keys in adata.obs corresponding to categorical data. Not used in SIMVI.

  • continuous_covariate_keys (List[str], optional) – Keys in adata.obs corresponding to continuous data. Not used in SIMVI.

simvi.model.simvi.SimVIModel.extract_edge_index(adata, batch_key=None, spatial_key='spatial', method='knn', n_neighbors=10)

Define edge_index for SIMVI model training.

Parameters:
  • adata (AnnData) – AnnData object.

  • batch_key (str, optional) – Key in adata.obs for batch information. If batch_key is none, assume the adata is from the same batch. Otherwise, we create edge_index based on each batch and concatenate them.

  • spatial_key (str, optional, default: 'spatial') – Key in adata.obsm for spatial location.

  • method (str, default: 'knn') – Method for establishing the graph proximity relationship between cells. Two available methods are: knn and Delouney. Knn is used as default due to its flexible neighbor number selection.

  • n_neighbors (int, default: 10) – The number of n_neighbors of knn graph. Not used if the graph is based on Delouney triangularization.

Returns:

edge_index – The edge index tensor containing the graph structure.

Return type:

torch.Tensor

Training

SimVIModel.train(edge_index, max_epochs=None, use_gpu=None, train_size=0.9, batch_size=None, anneal_epochs=50, mae_epochs=80, validation_size=None, lr=0.001, weight_decay=0.0001, device=None)[source]

Train the SIMVI model. We adopt the “semi-supervised” framework for model training.

Parameters:
  • edge_index (torch.Tensor) – Tensor returned by model.extract_edge_index.

  • max_epochs (int, optional) – Number of passes through the dataset. If None, default to np.min([round((20000 / n_cells) * 400), 400]).

  • use_gpu (Union[str, int, bool], optional) – Use default GPU if available (if None or True), or index of GPU to use (if int), or name of GPU (if str, e.g., “cuda:0”), or use CPU (if False).

  • train_size (float, default: 0.9) – Size of training set in the range [0.0, 1.0].

  • batch_size (int, optional) – Mini-batch size to use during training.

  • anneal_epochs (int, default: 50) – The number of epochs that use KL annealing.

  • mae_epochs (int, default: 80) – The number of epochs that corrupts input data.

  • validation_size (float, optional) – Size of the validation set. If None, default to 1 - train_size. If train_size + validation_size < 1, the remaining cells belong to the test set.

  • lr (float, default: 1e-3) – Learning rate.

  • weight_decay (float, default: 1e-4) – Weight decay (serves as L2 regularization).

  • device (str, optional) – The GPU to train the model on. If none, use torch.device(“cuda”) or cpu.

Returns:

The model is trained.

Return type:

None

Post-Training / Inference

SimVIModel.get_latent_representation(edge_index, adata=None, indices=None, give_mean=True, batch_size=None, representation_kind='all')[source]

Return the latent representation for each cell.

Parameters:
  • adata (AnnData, optional) – AnnData object with equivalent structure to initial AnnData. If None, defaults to the AnnData object used to initialize the model.

  • indices (Sequence[int], optional) – Indices of cells in adata to use. If None, all cells are used.

  • give_mean (bool, default: True) – Give mean of distribution or sample from it.

  • batch_size (int, optional) – Mini-batch size for data loading into model. Defaults to full batch training.

  • representation_kind (str, default: "all") – “intrinsic”, “interaction” or “all” for the corresponding representation kind.

Returns:

A numpy array with shape (n_cells, n_latent).

Return type:

np.ndarray

SimVIModel.get_attention(edge_index)[source]

Return the attention matrix for graph attention network.

Parameters:

edge_index (torch.Tensor) – Edge index tensor containing the graph structure, created by extract_edge_index function.

Returns:

A sparse matrix containing the attention weights between cells, with shape (n_cells, n_cells).

Return type:

scipy.sparse.csr_matrix

SimVIModel.get_archetypes(embedding, noc=5, delta=0.1, conv_crit=1e-05, maxiter=200, verbose=False)[source]

Calculate archetypal analysis for the input latent representation. A preliminary step for get_se function.

Parameters:
  • embedding (np.ndarray) – Input data matrix where each row represents a cell.

  • noc (int, default: 5) – Number of archetypes to extract.

  • delta (float, default: 0.1) – The relaxation parameter in PCHA algorithm.

  • conv_crit (float, default: 0.00001) – Convergence criterion. Algorithm stops when the relative change in objective function is less than this.

  • maxiter (int, default: 200) – Maximum number of iterations.

  • verbose (bool, default: False) – Whether to print progress during optimization.

Returns:

Returns a tuple containing: - Feature loading matrix (shape: (n_archetypes, latent_dim)) - Archetypal representation matrix (shape: (n_cells, n_archetypes)) - Explained variance ratio

Return type:

Tuple

SimVIModel.get_se(edge_index=None, adata=None, z_label='simvi_z', s_label='simvi_s', transformation='log1p', batch_label=None, num_arch=5, delta=0.1, maxiter=200, Kfold=5, eps=0, thres=0.95, positivity_filter=False, cell_type_label=None, obsm_label=None, mode='individual')[source]

Return the spatial effect for each cell in spatial omics data. Requires training the SIMVI model in priori.

Parameters:
  • edge_index (torch.Tensor) – The object created by function “extract_edge_index”.

  • adata (AnnData, optional) – AnnData object with equivalent structure to initial AnnData. If None, defaults to the AnnData object used to initialize the model.

  • z_label (str, optional) – The name of the intrinsic variation in adata.obsm. If adata is None, then it is calculated in this function.

  • s_label (str, optional) – The name of the spatial variation in adata.obsm. If adata is None, then it is calculated in this function.

  • transformation (str, default: 'log1p') – If log1p, perform log1p on a copy of the data. Else, operate on the given adata.X.

  • batch_label (str, optional) – If given, then add it as a covariate in the double machine learning model.

  • num_arch (int, default: 5) – Number of archetypes in archetypal transformation.

  • delta (float, default: 0.1) – Delta parameter in archetypal transformation.

  • maxiter (int, default: 200) – Maximum iterations in archetypal transformation.

  • Kfold (int, default: 5) – Number of folds in cross validation.

  • eps (float, default: 0) – Epsilon parameter in archetypal transformation.

  • thres (float, default: 0.95) – Thres2 in positivity index calculation.

  • positivity_filter (bool, default: False) – If True, only return the spatial effect of cells satisfying positivity condition, and return the indices of these cells.

  • cell_type_label (str, optional) – If given, then add it as a covariate in the double machine learning model.

  • obsm_label (str, optional) – If given, then add it as a covariate in the double machine learning model.

Returns:

If positivity is False, return (spatial_effect, R2s, p_values, archetypes). If positivity is True, return (positive_indices, spatial_effect, R2s, p_values, archetypes).

Return type:

Union[tuple, tuple]

Helper Functions

simvi.model.simvi._train(model, data, edge_index, mask, train_loader, optimizer, batch_size, weight, eval_mode)[source]

Helper function for training. Has full-batch and mini-batch modes.

simvi.model.simvi._eval(model, data, edge_index, mask, val_loader, batch_size, weight)[source]
simvi.model.simvi.return_f_pv(X, Rsq)[source]

Helper function for calculating p values of spatial effects.