teichlab / celltypist Goto Github PK

A tool for semi-automatic cell type classification

License: MIT License

Python 99.43% Dockerfile 0.57%

python single-cell machine-learning scrna-seq cell-type-classification label-transfer

celltypist's Introduction

CellTypist is an automated cell type annotation tool for scRNA-seq datasets on the basis of logistic regression classifiers optimised by the stochastic gradient descent algorithm. CellTypist allows for cell prediction using either built-in (with a current focus on immune sub-populations) or custom models, in order to assist in the accurate classification of different cell types and subtypes.

CellTypist website

Information of CellTypist can be also found in our CellTypist portal.

Interactive tutorials

Using CellTypist for cell type classification Open In Colab
Using CellTypist for multi-label classification
Best practice in large-scale cross-dataset label transfer using CellTypist

Install CellTypist

Using pip

pip install celltypist

Using conda

conda install -c bioconda -c conda-forge celltypist

Usage (classification)

1. Use in the Python environment

1.1. Import the module

import celltypist
from celltypist import models

1.2. Download available models

The models serve as the basis for cell type predictions. Information of available models can be also found here.

#Show all available models that can be downloaded and used.
models.models_description()
#Download a specific model, for example, `Immune_All_Low.pkl`.
models.download_models(model = 'Immune_All_Low.pkl')
#Download a list of models, for example, `Immune_All_Low.pkl` and `Immune_All_High.pkl`.
models.download_models(model = ['Immune_All_Low.pkl', 'Immune_All_High.pkl'])
#Update the models by re-downloading the latest versions if you think they may be outdated.
models.download_models(model = ['Immune_All_Low.pkl', 'Immune_All_High.pkl'], force_update = True)
#Show the local directory storing these models.
models.models_path

A simple way is to download all available models. Since each model is on average 1 megabyte (MB), we encourage the users to download all of them.

#Download all the available models.
models.download_models()
#Update all models by re-downloading the latest versions if you think they may be outdated.
models.download_models(force_update = True)

By default, a folder .celltypist/ will be created in the user's home directory to store model files. A different path/folder can be specified by exporting the environment variable CELLTYPIST_FOLDER in your configuration file (e.g. in ~/.bash_profile).

#In the shell configuration file.
export CELLTYPIST_FOLDER='/path/to/model/folder/'

1.3. Overview of the models

All models are serialised in a binary format by pickle.

#Get an overview of the models that are downloaded in `1.2.`.
#By default (`on_the_fly = False`), all possible models (even those that are not downloaded) are shown.
models.models_description(on_the_fly = True)

1.4. Inspect the model of interest

To take a look at a given model, load the model as an instance of the Model class as defined in CellTypist.

#Select the model from the above list. If the `model` argument is not provided, will default to `Immune_All_Low.pkl`.
model = models.Model.load(model = 'Immune_All_Low.pkl')
#The model summary information.
model
#Examine cell types contained in the model.
model.cell_types
#Examine genes/features contained in the model.
model.features

1.5. Celltyping based on the input of count table

CellTypist accepts the input data as a count table (cell-by-gene or gene-by-cell) in the format of .txt, .csv, .tsv, .tab, .mtx or .mtx.gz. A raw count matrix (reads or UMIs) is required. Non-expressed genes (if you are sure of their expression absence in your data) are suggested to be included in the input table as well, as they point to the negative transcriptomic signatures when compared with the model used.
```
#Get a demo test data. This is a UMI count csv file with cells as rows and gene symbols as columns.
input_file = celltypist.samples.get_sample_csv()
```
Assign the cell type labels from the model to the input test cells using the celltypist.annotate function.
```
#Predict the identity of each input cell.
predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl')
#Alternatively, the model argument can be a previously loaded `Model` as in 1.4.
predictions = celltypist.annotate(input_file, model = model)
```
If your input file is in a gene-by-cell format (genes as rows and cells as columns), pass in the transpose_input = True argument. In addition, if the input is provided in the .mtx format, you will also need to specify the gene_file and cell_file arguments as the files containing names of genes and cells, respectively.
```
#In case your input file is a gene-by-cell table.
predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', transpose_input = True)
#In case your input file is a gene-by-cell mtx file.
predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', transpose_input = True, gene_file = '/path/to/gene/file.txt', cell_file = '/path/to/cell/file.txt')
```
Again, if the model argument is not specified, CellTypist will by default use the Immune_All_Low.pkl model.

The annotate function will return an instance of the AnnotationResult class as defined in CellTypist.
```
#Summary information for the prediction result.
predictions
#Examine the predicted cell type labels.
predictions.predicted_labels
#Examine the matrix representing the decision score of each cell belonging to a given cell type.
predictions.decision_matrix
#Examine the matrix representing the probability each cell belongs to a given cell type (transformed from decision matrix by the sigmoid function).
predictions.probability_matrix
```
By default, with the annotate function, each query cell is predicted into the cell type with the largest score/probability among all possible cell types (mode = 'best match'). This mode is straightforward and can be used to differentiate between highly homogeneous cell types.

However, in some scenarios where a query cell cannot be assigned to any cell type in the reference model (i.e., a novel cell type) or can be assigned to multiple cell types (i.e., multi-label classification), a mode of probability match can be turned on (mode = 'prob match') with a probability cutoff (default to 0.5, p_thres = 0.5) to decide the cell types (none, 1, or multiple) assigned for a given cell.
```
#Query cell will get the label of 'Unassigned' if it fails to pass the probability cutoff in each cell type.
#Query cell will get multiple label outputs (concatenated by '|') if more than one cell type passes the probability cutoff.
predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', mode = 'prob match', p_thres = 0.5)
```
The three tables in the AnnotationResult (.predicted_labels, .decision_matrix and .probability_matrix) can be written out to local files (tables) by the function to_table, specifying the target folder for storage and the prefix common to each table.
```
#Export the three results to csv tables.
predictions.to_table(folder = '/path/to/a/folder', prefix = '')
#Alternatively, export the three results to a single Excel table (.xlsx).
predictions.to_table(folder = '/path/to/a/folder', prefix = '', xlsx = True)
```
The resulting AnnotationResult can be also transformed to an AnnData which stores the expression matrix in the log1p normalised format (to 10,000 counts per cell) by the function to_adata. The predicted cell type labels can be inserted to this AnnData as well by specifying insert_labels = True (which is the default behavior of to_adata).

Confidence scores of query cells can be inserted by specifying insert_conf = True (which is also the default behavior of to_adata). The scores correspond to the probabilities of cell predictions based on either predictions.predicted_labels.predicted_labels or predictions.predicted_labels.majority_voting (see 1.7.), which can be specified by insert_conf_by (default to the former, predicted_labels).
```
#Get an `AnnData` with predicted labels and confidence scores embedded into the observation metadata columns.
adata = predictions.to_adata(insert_labels = True, insert_conf = True)
#Inspect these columns (`predicted_labels` and `conf_score`).
adata.obs
```
In addition, you can insert the decision matrix into the AnnData by passing in insert_decision = True, which represents the decision scores of each cell type distributed across the input cells. Alternatively, setting insert_prob = True will insert the probability matrix into the AnnData. The latter is the recommended way as probabilities are more interpretable (though sometimes not all query datasets converge to a meaningful range of probability values).

After the insertion, multiple columns will show up in the cell metadata of AnnData, with each column's name as a cell type name. Of note, all these columns (including the predicted_labels and conf_score) can be prefixed with a specific string by setting prefix in to_adata.
```
#Get an `AnnData` with predicted labels, confidence scores, and decision matrix.
adata = predictions.to_adata(insert_labels = True, insert_conf = True, insert_decision = True)
#Get an `AnnData` with predicted labels, confidence scores, and probability matrix (recommended).
adata = predictions.to_adata(insert_labels = True, insert_conf = True, insert_prob = True)
```
You can now manipulate this object with any functions or modules applicable to AnnData. Actually, CellTypist provides a quick function to_plots to visualise your AnnotationResult and store the figures without the need of explicitly transforming it into an AnnData.
```
#Visualise the predicted cell types overlaid onto the UMAP.
predictions.to_plots(folder = '/path/to/a/folder', prefix = '')
```
A different prefix for the output figures can be specified with the prefix tag, and UMAP coordinates will be generated for the input dataset using a canonical Scanpy pipeline. The labels in the figure may be crowded if too many cell types are predicted (can be alleviated by a majority voting process, see 1.7.).

If you also would like to inspect the decision score and probability distributions for each cell type involved in the model, pass in the plot_probability = True argument. This may take a bit longer time as one figure will be generated for each of the cell types from the model.
```
#Visualise the decision scores and probabilities of each cell type overlaid onto the UMAP as well.
predictions.to_plots(folder = '/path/to/a/folder', prefix = '', plot_probability = True)
```
Multiple figures will be generated, including the predicted cell type labels overlaid onto the UMAP space, plus the decision score and probability distributions of each cell type on the UMAP.
1.6. Celltyping based on AnnData

CellTypist also accepts the input data as an AnnData generated from for example Scanpy.

Since the expression of each gene will be centred and scaled by matching with the mean and standard deviation of that gene in the provided model, CellTypist requires a logarithmised and normalised expression matrix stored in the AnnData (log1p normalised expression to 10,000 counts per cell). CellTypist will try the .X attribute first, and if it does not suffice, try the .raw.X attribute. If none of them fit into the desired data type or the expression matrix is not properly normalised, an error will be raised.

Within the AnnData, please provide all genes to ensure maximal overlap with genes in the model. If you normalise and logarithmise the gene expression matrix using all genes while later only keep a subset of genes in the AnnData, the prediction result may not be optimal.
```
#Provide the input as an `AnnData`.
predictions = celltypist.annotate('/path/to/input.h5ad', model = 'Immune_All_Low.pkl')
#Alternatively, the input can be specified as an `AnnData` already loaded in memory.
predictions = celltypist.annotate(a_loaded_adata, model = 'Immune_All_Low.pkl')
```
All the parameters and downstream operations are the same as in 1.5., except that 1) the transformed AnnData from to_adata stores all the expression matrix and other information as is in the original object. 2) when generating the visualisation figures, existing UMAP coordinates will be used. If no UMAP coordinates are found, CellTypist will fall back on the neighborhood graph to yield new 2D UMAP projections. If none is available, a canonical Scanpy pipeline will be performed to generate the UMAP coordinates as in 1.5..

Of note, when the input is an AnnData, compared to the visualisations in 1.5., a more useful way for visualising the prediction result is to use the function celltypist.dotplot, which quantitatively compares the CellTypist prediction result with the cell types (or clusters) pre-defined and stashed in the AnnData by the user. Specifically, a dot plot will be generated, demonstrating the match between CellTypist predictions and manual annotations (or clusters). For each cell type or cluster (each column within the dot plot), this plot shows how it can be 'decomposed' into different cell types predicted by CellTypist.
```
#Examine the correspondence between CellTypist predictions (`use_as_prediction`) and manual annotations (`use_as_reference`).
#Here, `predicted_labels` from `predictions.predicted_labels` is used as the prediction result from CellTypist.
#`use_as_prediction` can be also set as `majority_voting` (see `1.7.`).
celltypist.dotplot(predictions, use_as_reference = 'column_key_of_manual_annotation', use_as_prediction = 'predicted_labels')
```
Check celltypist.dotplot for other parameters controlling visualisation details of this plot.
1.7. Use a majority voting classifier combined with celltyping

By default, CellTypist will only do the prediction jobs to infer the identities of input cells, which renders the prediction of each cell independent. To combine the cell type predictions with the cell-cell transcriptomic relationships, CellTypist offers a majority voting approach based on the idea that similar cell subtypes are more likely to form a (sub)cluster regardless of their individual prediction outcomes. To turn on the majority voting classifier in addition to the CellTypist predictions, pass in majority_voting = True to the annotate function.
```
#Turn on the majority voting classifier as well.
predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', majority_voting = True)
```
During the majority voting, to define cell-cell relations, CellTypist will use a heuristic over-clustering approach according to the size of the input data with the aid of a Leiden clustering pipeline. Users can also provide their own over-clustering result to the over_clustering argument. This argument can be specified in several ways:
1. an input plain file with the over-clustering result of one cell per line.
2. a string key specifying an existing cell metadata column in the AnnData (pre-created by the user).
3. a list-like object (such as a numpy 1D array) indicating the over-clustering result of all cells.
4. if none of the above is provided, will use a heuristic over-clustering approach, noted above.
```
#Add your own over-clustering result.
predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', majority_voting = True, over_clustering = '/path/to/over_clustering/file')
```
There is also a min_prop parameter (defaults to 0) which controls the minimum proportion of cells from the dominant cell type required to name a given subcluster by this cell type. Subcluster that fails to pass this proportion threshold will be assigned Heterogeneous.

Similarly, an instance of the AnnotationResult class will be returned.
```
#Examine the predicted cell type labels.
predictions.predicted_labels
#Examine specifically the majority-voting results.
predictions.predicted_labels.majority_voting
#Examine the matrix representing the decision score of each cell belonging to a given cell type.
predictions.decision_matrix
#Examine the matrix representing the probability each cell belongs to a given cell type (transformed from decision matrix by the sigmoid function).
predictions.probability_matrix
```
Compared to the results without majority-voting functionality as in 1.5. and 1.6., the .predicted_labels attribute now has two extra columns (over_clustering and majority_voting) in addition to the column predicted_labels.

Other parameters and downstream operations are the same as in 1.5. and 1.6.. Note that due to the majority-voting results added, the exported tables (by to_table), the transformed AnnData (by to_adata), and the visualisation figures (by to_plots) will all have additional outputs or information indicating the majority-voting outcomes. For example, when using the function celltypist.dotplot, you can set use_as_prediction = 'majority_voting' to visualise the match between majority-voting results with manual annotations. The other example is that when using to_adata, you can specify insert_conf_by = 'majority_voting' to have the confidence scores corresponding to the majority-voting result instead of raw predictions (insert_conf_by = 'predicted_labels' which is the default).
```
#Examine the correspondence between CellTypist predictions (`use_as_prediction`) and manual annotations (`use_as_reference`).
celltypist.dotplot(predictions, use_as_reference = 'column_key_of_manual_annotation', use_as_prediction = 'majority_voting')
```

2. Use as the command line

2.1. Check the command line options
```
celltypist --help
```
2.2. Download all available models
```
celltypist --update-models
```
This will download the latest models from the remote server.
2.3. Overview of the models
```
celltypist --show-models
```
2.4. Celltyping based on the input of count table

See 1.5. for the format of the desired count matrix.
```
celltypist --indata /path/to/input/file --model Immune_All_Low.pkl --outdir /path/to/outdir
```
You can add a different model to be used in the --model option. If the --model is not provided, CellTypist will by default use the Immune_All_Low.pkl model. The output directory will be set to the current working directory if --outdir is not specified.

If your input file is in a gene-by-cell format (genes as rows and cells as columns), add the --transpose-input option.
```
celltypist --indata /path/to/input/file --model Immune_All_Low.pkl --outdir /path/to/outdir --transpose-input
```
If the input is provided in the .mtx format, you will also need to specify the --gene-file and --cell-file options as the files containing names of genes and cells, respectively.

The default mode (--mode best_match) for prediction is to choose the cell type with the largest score/probability as the final prediction; setting --mode prob_match combined with a probability threshold (default to 0.5, --p-thres 0.5) will enable a multi-label classification, which assigns 0 (i.e., unassigned), 1, or >=2 cell type labels to each query cell.

Other options that control the output files of CellTypist include --prefix which adds a custom prefix and --xlsx which merges the output files into one xlsx table. Check celltypist --help for more details.
2.5. Celltyping based on AnnData

See 1.6. for the requirement of the expression matrix in the AnnData object (.h5ad).
```
celltypist --indata /path/to/input/adata --model Immune_All_Low.pkl --outdir /path/to/outdir
```
Other command line options are the same as in 2.4..
2.6. Use a majority voting classifier combined with celltyping

See 1.7. for how the majority voting classifier works.
```
celltypist --indata /path/to/input/file --model Immune_All_Low.pkl --outdir /path/to/outdir --majority-voting
```
During the majority voting, to define cell-cell relations, CellTypist will use a heuristic over-clustering approach according to the size of the input data with the aid of a Leiden clustering pipeline. Users can also provide their own over-clustering result to the --over-clustering option. This option can be specified in several ways:
1. an input plain file with the over-clustering result of one cell per line.
2. a string key specifying an existing cell metadata column in the AnnData (pre-created by the user).
3. if none of the above is provided, will use a heuristic over-clustering approach, noted above.
```
celltypist --indata /path/to/input/file --model Immune_All_Low.pkl --outdir /path/to/outdir --majority-voting --over-clustering /path/to/over_clustering/file
```
There is also a --min-prop option (defaults to 0) which controls the minimum proportion of cells from the dominant cell type required to name a given subcluster by this cell type. Subcluster that fails to pass this proportion threshold will be assigned Heterogeneous.

Other command line options are the same as in 2.4..

2.7. Generate visualisation figures for the results

In addition to the tables output by CellTypist, you have the option to generate multiple figures to get an overview of your prediction results. See 1.5., 1.6. and 1.7. for what these figures represent.

#Plot the results after the celltyping process.
celltypist --indata /path/to/input/file --model Immune_All_Low.pkl --outdir /path/to/outdir --plot-results
#Plot the results after the celltyping and majority-voting processes.
celltypist --indata /path/to/input/file --model Immune_All_Low.pkl --outdir /path/to/outdir --majority-voting --plot-results

3. Use in the R environment

Currently, there is no plan for R compatibility. Try to convert R objects into AnnData for use in CellTypist.

4. Use as Docker/Singularity container

Docker

A docker image is available from the Quay.io Container Registry as quay.io/teichlab/celltypist:latest.

Simple usage:

docker run --rm -it \
  -v /path/to/data:/data \
  quay.io/teichlab/celltypist:latest \
  celltypist --indata /data/file --model Immune_All_Low.pkl --outdir /data/output

Usage with custom models:

docker run --rm -it \
  -v /path/to/data:/data \
  -v /path/to/models:/opt/celltypist/data/models \
  quay.io/teichlab/celltypist:latest \
  celltypist --indata /data/file --model My_Custom_Model.pkl --outdir /data/output

Singularity

Use the singularity pull command to download the container from the given container registry:

singularity pull celltypist-latest.sif docker://quay.io/teichlab/celltypist:latest

Then run the downloaded image as a container.

Simple usage:

singularity run \
  -B /path/to/data:/data \
  celltypist-latest.sif \
  celltypist --indata /data/file --model Immune_All_Low.pkl --outdir /data/output

Usage with custom models:

singularity run \
  -B /path/to/data:/data \
  -B /path/to/models:/opt/celltypist/data/models \
  celltypist-latest.sif \
  celltypist --indata /data/file --model My_Custom_Model.pkl --outdir /data/output

Supplemental guidance

Generate a custom model

As well as the models provided by CellTypist (see 1.2.), you can generate your own model from which the cell type labels can be transferred to another scRNA-seq dataset. This will be most useful when a large and comprehensive reference atlas is trained for future use, or when the similarity between two scRNA-seq datasets is under examination.

Inputs for data training

The inputs for CellTypist training comprise the gene expression data, the cell annotation details (i.e., cell type labels), and in some scenarios the genes used. To facilitate the training process, the train function (see below) has been designed to accommodate different kinds of input formats:
1. The gene expression data can be provided as a path to the expression table (such as .csv and .mtx), or a path to the AnnData (.h5ad), with the former containing raw counts (in order to reduce the file size) while the latter containing log1p normalised expression (to 10,000 counts per cell) stored in .X or .raw.X. In addition to specifying the paths, you can provide any array-like objects (e.g., csr_matrix) or AnnData which are already loaded in memory (both should be in the log1p format). A cell-by-gene format (cells as rows and genes as columns) is required.
2. The cell type labels can be supplied as a path to the file containing cell type label per line corresponding to the cells in gene expression data. Any list-like objects (such as a tuple or series) are also acceptable. If the gene expression data is input as an AnnData, you can also provide a column name from its cell metadata (.obs) which represents information of cell type labels.
3. The genes will be automatically extracted if the gene expression data is provided as a table file, an AnnData or a DataFrame. Otherwise, you need to specify a path to the file containing one gene per line corresponding to the genes in the gene expression data. Any list-like objects (such as a tuple or series) are also acceptable.
One-pass data training

Derive a new model by training the data using the celltypist.train function:
```
#Training a CellTypist model.
new_model = celltypist.train(expression_input, labels = label_input, genes = gene_input)
```
If the input is a table file, an AnnData or a DataFrame, genes will be automatically extracted and the genes tag can thus be omitted from the above code. If your input is in a gene-by-cell format (genes as rows and cells as columns), remember to pass in the transpose_input = True argument.

Before the training is conducted, the gene expression format will be checked to make sure the input data is supplied as required. For example, the expression matrix should be in log1p normalised expression (to 10,000 counts per cell) if the input is an AnnData. This means when you subset the input with given genes (e.g., by highly variable genes), an error may be raised as CellTypist cannot judge the input as properly normalised with only a subset of genes. In such a case, pass in check_expression = False to skip the expression format check.
```
#Training a CellTypist model with only subset of genes (e.g., highly variable genes).
#Restricting the input to a subset of genes can accelerate the training process.
#Use `AnnData` here as an example.
new_model = celltypist.train(some_adata[:, some_adata.var.highly_variable], labels = label_input, check_expression = False)
```
By default, data is trained using a traditional logistic regression classifier. This classifier is well suited to datasets of small or intermediate sizes (as an empirical estimate, <= 100k cells), and usually leads to an unbiased probability range with less parameter tuning. Among the training parameters, three important ones are solver which (if not specified by the user) is selected based on the size of the input data by CellTypist, C which sets the inverse of L2 regularisation strength, and max_iter which controls the maximum number of iterations before reaching the minimum of the cost function. Other (hyper)parameters from LogisticRegression are also applicable in the train function.

When the dimensions of the input data are large, training may take longer time even with CPU parallelisation (achieved by the n_jobs argument). To reduce the training time as well as to add some randomness to the classifier's solution, a stochastic gradient descent (SGD) logistic regression classifier can be enabled by use_SGD = True.
```
#Training a CellTypist model with SGD learning.
new_model = celltypist.train(expression_input, labels = label_input, genes = gene_input, use_SGD = True)
```
A logistic regression classifier with SGD learning reduces the training burden dramatically and has a comparable performance versus a traditional logistic regression classifier. A minor caveat is that more careful model parameter tuning may be needed if you want to utilise the probability values from the model for scoring cell types in the prediction step (the selection of the most likely cell type for each query cell is not influenced however). Among the training parameters, two important ones are alpha which sets the L2 regularisation strength and max_iter which controls the maximum number of iterations. Other (hyper)parameters from SGDClassifier are also applicable in the train function.

When the training data contains a huge number of cells (for example >500k cells) or more randomness in selecting cells for training is needed, you may consider using the mini-batch version of the SGD logistic regression classifier by specifying use_SGD = True and mini_batch = True. As a result, in each epoch (default to 10 epochs, epochs = 10), cells are binned into equal-sized (the size is default to 1000, batch_size = 1000) random batches, and are trained in a batch-by-batch manner (default to 100 batches, batch_number = 100).
```
#Get a CellTypist model with SGD mini-batch training.
new_model = celltypist.train(expression_input, labels = label_input, genes = gene_input, use_SGD = True, mini_batch = True)
```
By selecting part of cells for training (default to 1,000,000 cells with possible duplications, epochs x batch_size x batch_number), training time can be again reduced and the performance of the derived model is shown to persist as compared to the above two methods. Since some rare cell types may be undersampled during this procedure, you can pass in the balance_cell_type = True argument to sample rare cell types with a higher probability, ensuring close-to-even cell type distributions in mini-batches (subject to the maximum number of cells that can be provided by a given cell type).

There are also some free texts that can be inserted (e.g., date) to describe the model. Check out the celltypist.train for more information.

The resulting model is an instance of the Model class as in 1.4., and can be manipulated as with other CellTypist models.

Save this model locally:
```
#Write out the model.
new_model.write('/path/to/local/folder/some_model_name.pkl')
```
A suggested location for stashing the model is the models.models_path (see 1.2.). Through this, all models (including the models provided by CellTypist) will be in the same folder, and can be accessed in the same manner as in 1.4..
```
#Write out the model in the `models.models_path` folder.
new_model.write(f'{models.models_path}/some_model_name.pkl')
```
To leverage this model, first load it by models.Model.load.
```
new_model = models.Model.load('/path/to/local/folder/some_model_name.pkl')
```
This model can be used as with the built-in CellTypist models, for example, it can be specified as the model argument in annotate.
```
#Predict the identity of each input cell with the new model.
predictions = celltypist.annotate(input_file, model = new_model)
#Alternatively, just specify the model path (recommended as this ensures the model is intact every time it is loaded).
predictions = celltypist.annotate(input_file, model = '/path/to/local/folder/some_model_name.pkl')
#If the model is stored in `models.models_path`, only the model name is needed.
predictions = celltypist.annotate(input_file, model = 'some_model_name.pkl')
```
Downstream operations are the same as in 1.4., 1.5., 1.6., and 1.7..

Two-pass data training incorporating feature selection

Some scRNA-seq datasets may involve the noise mostly from genes not helpful or even detrimental to the characterisation of cell types. To mitigate this, celltypist.train has the option (feature_selection = True) to do a fast feature selection based on the feature importance (here, the absolute regression coefficients) using SGD learning. In short, top important genes (default: top_genes = 300) are selected from each cell type, and are further combined across cell types as the final feature set. The classifier is then re-run using the corresponding subset of the input data.
```
#Two-pass data training with traditional logistic regression after SGD-based feature selection.
new_model = celltypist.train(expression_input, labels = label_input, genes = gene_input, feature_selection = True)
#Two-pass data training with SGD learning after feature selection.
new_model = celltypist.train(expression_input, labels = label_input, genes = gene_input, use_SGD = True, feature_selection = True)
#Two-pass data training with SGD mini-batch training after feature selection.
new_model = celltypist.train(expression_input, labels = label_input, genes = gene_input, use_SGD = True, mini_batch = True, feature_selection = True)
```
If you prefer other feature selection approaches and obtain a set of genes which are designated as important features, you can subset your input data and train the CellTypist model accordingly. As noted in the previous section, remember to pass in the check_expression = False argument.
```
new_model = celltypist.train(expression_input_subset, labels = label_input, genes = gene_input, check_expression = False)
```
The downstream workflow is the same as that from one-pass data training.

General parameters relating to runtime and RAM usage

max_iter: when celltypist.train does not converge for a long time, setting max_iter to a lower number can reduce runtime at a possible cost of a suboptimal model.

with_mean: when the training data is a sparse matrix, setting with_mean = False will preserve sparsity by skipping the step of subtraction by the mean during scaling, and thus lower the RAM usage at the cost of a suboptimal model.

n_jobs: Number of CPUs used. This argument is not applicable to mini-batch training.

use_GPU: GPU acceleration by using logistic regression from cuml. You need to install RAPIDS and cuml first. This argument is ignored if SGD learning is enabled.
Cross-species model conversion

It is always recommended to predict a query dataset using the reference model from the same species. In cases where a cross-species label projection is needed, you can convert the model of interest to its "orthologous" form of another species. This is achieved by aligning orthologous genes between species.

Load a human immune model.
```
model = models.Model.load('Immune_All_Low.pkl')
```
This model can be converted to a mouse equivalent through the convert method. By default, a human-mouse conversion (or the opposite) will be conducted by automatically detecting the species of the model (e.g., human) and transforming it to the other species (e.g., mouse).
```
#Note `model` is modified in-place.
model.convert()
```
By default (unique_only = True), only 1:1 orthologs between the two species are kept and all other genes are discarded in the model. You can also keep those genes (including both 1:N and N:1 orthologs) by specifying unique_only = False. By doing so, you need to specify how these 1:N orthologs will be handled: for each gene, averaging the classifier weights (collapse = 'average', which is the default when unique_only = False) or randomly choosing one gene's weight as the representative (collapse = 'random') from all its orthologs.
```
#For illustration purpose. Convert the model by utilising 1:N orthologs and their average weights.
#model.convert(unique_only = False, collapse = 'average')
```
As mentioned above, the default mode is a human-to-mouse (or mouse-to-human) conversion using the built-in gene mapping file (Ensembl105 version). For conversion to other species, you can provide a different file (map_file), with one column being the species of the model and the other column being the species you want to convert to. Check out models.Model.convert for more information.

Lastly, write out the converted model locally.
```
model.write('/path/to/local/folder/some_model_name.pkl')
```
This model can be used as with other CellTypist models.
Model conversion from gene symbols to Ensembl IDs

CellTypist models are usually trained based on gene symbols. When genes of a query dataset are formatted as Ensembl IDs, you can convert gene symbols in the model to Ensembl ID for matching the query dataset. The convert method will be utilised as in the above section.

Specifically, you need to provide a gene-symbol-to-Ensembl-ID file, such that gene symbols in the model will be converted to IDs (or vice versa). A built-in file is provided in CellTypist (GENCODE v44). Parameters and details during model conversion can be found in the previous section Cross-species model conversion.

Load a human immune model.
```
model = models.Model.load('Immune_All_Low.pkl')
```
Convert gene symbols to Ensembl IDs using the built-in file. You can also provide a path to your own ID mapping file.
```
#Note `model` is modified in-place.
model.convert('GENCODEv44_Gene_id2name.csv')
```
Lastly, write out the converted model locally.
```
model.write('/path/to/local/folder/some_model_name.pkl')
```
This model can be used as with other CellTypist models.

Citation

Dominguez Conde et al., Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376, eabl5197 (2022). Link

celltypist's People

Contributors

Stargazers

Watchers

celltypist's Issues

T compartment without exhausted T cells?

Congratulations on this outstanding job! There is a little doubt in my mind that in your dataset (https://www.science.org/doi/10.1126/science.abl5197) your team detected 101 immune populations without any exhausted T cells. I wonder whether it is needed?

Error : sparse matrix length is ambiguous; use getnnz() or shape[0]

The anndata is GSE158055. I passed an anndata into the celltypist and got the above error. What is this error and how do I solve it ?

Guidance for using Nanostring CosMx RNA input

Hello, I appreciate your work in making CellTypist available. I have been able to use the python API to assign predicted_labels and majority_voting types to our data but am getting the warning message below while running celltypist.annotate:

⚠️ Warning: the input file seems not a raw count matrix. The prediction result may not be accurate

The CosMx data contains counts for 960 genes and 20 negative probes. It is sparse data with an average of 250 unique genes per cell. I have processed these by normalizing each cell to have a target of 10,000 counts then computing their log1p values. The input file is attached. I have also tried using an annData object as input but this throws an error instead of just a warning.

Can you comment on whether this data is a good match for CellTypist and if I have it in the best or correct format?

-Mark Dane
CT_sample_file_values.csv

Plot a celltypist.dotplot to visualise celltypist's classification using a probability threshold

Hi all,
I am trying to visualise the results of the classification using a probability threshold and majority of voting on a cell typist.dotplot. I get an error:

Traceback (most recent call last):
  File "celltypist-scRNA-test.py", line 66, in <module>
    celltypist.dotplot(predictions, use_as_reference = 'predicted.celltype.l2', use_as_prediction = 'majority_voting', save ='scRNA-test-celltypist-probabilistic-majority_voting.png')
  File "/Users/ysanchez/opt/anaconda3/envs/transcriptomicsconda/lib/python3.8/site-packages/celltypist/plot.py", line 140, in dotplot
    dot_size_df, dot_color_df = _get_fraction_prob_df(predictions, use_as_reference, use_as_prediction, None, None)
  File "/Users/ysanchez/opt/anaconda3/envs/transcriptomicsconda/lib/python3.8/site-packages/celltypist/plot.py", line 33, in _get_fraction_prob_df
    score = [row[pred[index]] for index, row in predictions.probability_matrix.iterrows()]
  File "/Users/ysanchez/opt/anaconda3/envs/transcriptomicsconda/lib/python3.8/site-packages/celltypist/plot.py", line 33, in <listcomp>
    score = [row[pred[index]] for index, row in predictions.probability_matrix.iterrows()]
  File "/Users/ysanchez/opt/anaconda3/envs/transcriptomicsconda/lib/python3.8/site-packages/pandas/core/series.py", line 851, in __getitem__
    return self._get_value(key)
  File "/Users/ysanchez/opt/anaconda3/envs/transcriptomicsconda/lib/python3.8/site-packages/pandas/core/series.py", line 959, in _get_value
    loc = self.index.get_loc(label)
  File "/Users/ysanchez/opt/anaconda3/envs/transcriptomicsconda/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'Unassigned'

Could you please let me know if there is a way to get around this?

Many thanks for your help!

Celltypist showing invalid expression matrix

I tried running celltypist on the anndata object and it wqs giving me this error. The data is from a research paper : https://github.com/scCOVID-19/COVIDPBMC/

Error:

Exception: Invalid expression matrix, expect log1p normalized expression to 10000 counts per cell

Can you please help me out with this?

Failed to install with conda

I've tried installing the package with conda on both Ubuntu and macOS, but it fails.

What I've done:

conda create -n celltypist python=3.8
conda install -c bioconda celltypist

Ubuntu

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: \ 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                                                                                      

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versionsThe following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.31=0
  - feature:|@/linux-64::__glibc==2.31=0

Your installed version is: 2.31

macOS

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: | 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                                                                                      

UnsatisfiableError: The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

I've tried with Python 3.9 and 3.7 as well with the same outcome.

Would appreciate any hints!

Best

conda version is out of date.

Celltypist in conda is out of date (2023-2-9), methods like extract_top_markers is not in the conda version. Thanks for the nice work.

CellTypist with batch correction

Hi together,
great tool you have created with CellTypist!
Since I work with mixed scRNA-Seq data from multiple patients, I need to do a batch effects correction before the cell type annotation.
Unfortunately, I don't know of any method that corrects batch effects while leaving the data in a count matrix.
There will always be negative data that does not correspond to a count matrix anymore.
Is it still possible to apply CellTypist after a batch effects correction, although the data is not in 1e4 and log1p format?
Any other ideas to find a workaround for that problem?

Thanks a lot!

0 features used for prediction

We are running celltypist and ran into a fringe case. I think the error is somewhere upstream in our data preparation, but running celltypist gives this:

⏳ Loading data
🔬 Input data has 2700 cells and 13714 genes
🔗 Matching reference genes in the model
🧬 0 features used for prediction
⚖️ Scaling input data

and then it dies with the stacktrace below. Obviously if no features match, celltypist cant run. I'm reporting this since it might be nice if the celltypist code that finds matching features would die more immediately and in a more informative way if zero features match (or perhaps fewer than some configurable threshold).

I'm confused about this particular case since the failure is from our automated tests, and the input data is the pbmc3k dataset from Seurat, downloaded using the SeuratData R package. This test runs fine on R/release, but not R/develop. In any event, i dont think the failure itself is celltypist's issue.

Traceback (most recent call last):
  File "/home/runner/.local/bin/celltypist", line 8, in <module>
    sys.exit(main())
  File "/home/runner/.local/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/runner/.local/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/runner/.local/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/runner/.local/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/runner/.local/lib/python3.8/site-packages/celltypist/command_line.py", line 109, in main
    result = annotate(
  File "/home/runner/.local/lib/python3.8/site-packages/celltypist/annotate.py", line 81, in annotate
    predictions = clf.celltype(mode = mode, p_thres = p_thres)
  File "/home/runner/.local/lib/python3.8/site-packages/celltypist/classifier.py", line 351, in celltype
    decision_mat, prob_mat, lab = self.model.predict_labels_and_prob(self.indata, mode = mode, p_thres = p_thres)
  File "/home/runner/.local/lib/python3.8/site-packages/celltypist/models.py", line 120, in predict_labels_and_prob
    scores = self.classifier.decision_function(indata)
  File "/home/runner/.local/lib/python3.8/site-packages/sklearn/linear_model/_base.py", line 407, in decision_function
    X = self._validate_data(X, accept_sparse="csr", reset=False)
  File "/home/runner/.local/lib/python3.8/site-packages/sklearn/base.py", line 566, in _validate_data
    X = check_array(X, **check_params)
  File "/home/runner/.local/lib/python3.8/site-packages/sklearn/utils/validation.py", line 814, in check_array
    raise ValueError(
ValueError: Found array with 0 feature(s) (shape=(2700, 0)) while a minimum of 1 is required.

`celltypist.annotate` to specify the required scale for count normalization (currently 10'000)

It appears that counts needs to be normalized with a scale of 10'000 when calling celltypist.annotate. This is not clear from the documentation of that function. However one can figure it out by trial and error or by code inspection (for instance from the following code in classifier.py:
if np.abs(np.expm1(self.indata[0]).sum()-10000) > 1: raise ValueError("🛑 Invalid expression matrix, expect log1p normalized expression to 10000 counts per cell")).

This issue is a suggestion to explicitly call that out in the documentation of the annotate method.

Thanks!

TypeError: np.matrix is not supported. Please convert to a numpy array with np.asarray?

Hello,

We've been running cell typist pretty regularly without issues, but recently saw this. I dont know yet whether this is a quirk in the input data or not, but I thought I'd report. We are running a basic celltypist command with a built-in model. The input is an AnnData file created by writing an R SeuratObject to disk.

This stack makes me wonder if some dependency updated and changed validation, like sklearn, but I havent debugged it yet.

Have you seen anything like this before? Thanks in advance.

09 Dec 2022 08:15:11,830 DEBUG: 	Traceback (most recent call last):
09 Dec 2022 08:15:11,834 DEBUG: 	  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
09 Dec 2022 08:15:11,838 DEBUG: 	    return _run_code(code, main_globals, None,
09 Dec 2022 08:15:11,843 DEBUG: 	  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
09 Dec 2022 08:15:11,848 DEBUG: 	    exec(code, run_globals)
09 Dec 2022 08:15:11,852 DEBUG: 	  File "/usr/local/lib/python3.8/dist-packages/celltypist/command_line.py", line 129, in <module>
09 Dec 2022 08:15:11,857 DEBUG: 	    main()
09 Dec 2022 08:15:11,862 DEBUG: 	  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
09 Dec 2022 08:15:11,866 DEBUG: 	    return self.main(*args, **kwargs)
09 Dec 2022 08:15:11,872 DEBUG: 	  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
09 Dec 2022 08:15:11,877 DEBUG: 	    rv = self.invoke(ctx)
09 Dec 2022 08:15:11,881 DEBUG: 	  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
09 Dec 2022 08:15:11,885 DEBUG: 	    return ctx.invoke(self.callback, **ctx.params)
09 Dec 2022 08:15:11,889 DEBUG: 	  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
09 Dec 2022 08:15:11,895 DEBUG: 	    return __callback(*args, **kwargs)
09 Dec 2022 08:15:11,900 DEBUG: 	  File "/usr/local/lib/python3.8/dist-packages/celltypist/command_line.py", line 109, in main
09 Dec 2022 08:15:11,904 DEBUG: 	    result = annotate(
09 Dec 2022 08:15:11,910 DEBUG: 	  File "/usr/local/lib/python3.8/dist-packages/celltypist/annotate.py", line 81, in annotate
09 Dec 2022 08:15:11,915 DEBUG: 	    predictions = clf.celltype(mode = mode, p_thres = p_thres)
09 Dec 2022 08:15:11,919 DEBUG: 	  File "/usr/local/lib/python3.8/dist-packages/celltypist/classifier.py", line 376, in celltype
09 Dec 2022 08:15:11,924 DEBUG: 	    decision_mat, prob_mat, lab = self.model.predict_labels_and_prob(self.indata, mode = mode, p_thres = p_thres)
09 Dec 2022 08:15:11,937 DEBUG: 	  File "/usr/local/lib/python3.8/dist-packages/celltypist/models.py", line 145, in predict_labels_and_prob
09 Dec 2022 08:15:11,945 DEBUG: 	    scores = self.classifier.decision_function(indata)
09 Dec 2022 08:15:11,951 DEBUG: 	  File "/usr/local/lib/python3.8/dist-packages/sklearn/linear_model/_base.py", line 401, in decision_function
09 Dec 2022 08:15:11,957 DEBUG: 	    X = self._validate_data(X, accept_sparse="csr", reset=False)
09 Dec 2022 08:15:11,965 DEBUG: 	  File "/usr/local/lib/python3.8/dist-packages/sklearn/base.py", line 535, in _validate_data
09 Dec 2022 08:15:11,971 DEBUG: 	    X = check_array(X, input_name="X", **check_params)
09 Dec 2022 08:15:11,977 DEBUG: 	  File "/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py", line 737, in check_array
09 Dec 2022 08:15:11,983 DEBUG: 	    raise TypeError(
09 Dec 2022 08:15:11,993 DEBUG: 	TypeError: np.matrix is not supported. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
09 Dec 2022 08:15:12,094 DEBUG: 	Quitting from lines 182-193 (16-2-GEX.df.appendHashing.frc.cite.norm.pca.dr.RunCelltypist.rmd)

Bioconda recipe

Would it be possible to add celltypst to bioconda? I'm happy to help create a recipe.

Using integrated data

Hi,
Thank you for providing the tool!
I integrate the data with harmony which 'corrects' the PCA embeddings as a means of batch correction.
This corrected PCA is then used for downstream clustering.
If I were to use an integrated batch corrected data, would you suggest to save 'harmony-corrected pca' in the 'X_pca' slot of the adata object, since I see that in some modes the classifier uses PCA embeddings?

Kindly advise,
Thanks and Kind regards,

Details to be added about the available model

Dear,

In the description of the model it is not clear whether the model involves cells from both healthy and CD or just healthy? I am especially referring to 'Cells_Intestinal_Tract ' from intestinal cells from fetal, pediatric and adult human gut (134 cell types). The original study associated also has CD data. Hence the confusion.

Kindly clarify.

sequencing protocol requirements?

This is more of a question than an issue, but do the trained models only work on samples sequenced on a 10x platform, or can we expect it to also work well for smart-seq2?

If we are not using 10x data, can we simply build a new model, or are there some assumptions in the model that would lead to performance differences?

predictions.to_adata() error

Dear developer,

This is really fantastic package for cell annotation, well, when I run the predictions.to_adata(), it returns ValueError: cannot reindex on an axis with duplicate labels

Can you help me with it?

Thanks a lot!!!

The tutorial notebook didn't quite work ModuleNotFoundError: No module named 'celltypist'


---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_7855/813241117.py in <module>
----> 1 import celltypist
      2 from celltypist import models


ModuleNotFoundError: No module named 'celltypist'

Training new model gives ValueError

Hi,
I'm trying to train my own model and keep getting this error:

🍳 Preparing data before training
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [37], line 1
----> 1 coarse_model = celltypist.train(GAnn, labels = 'annot_coars', n_jobs = 10, feature_selection = True)

File ~\anaconda3\envs\GDT2\lib\site-packages\celltypist\train.py:293, in train(X, labels, genes, transpose_input, with_mean, check_expression, C, solver, max_iter, n_jobs, use_SGD, alpha, mini_batch, batch_number, batch_size, epochs, balance_cell_type, feature_selection, top_genes, date, details, url, source, version, **kwargs)
    291 #check
    292 if check_expression and (np.abs(np.expm1(indata[0]).sum()-10000) > 1):
--> 293     raise ValueError(
    294             "🛑 Invalid expression matrix, expect log1p normalized expression to 10000 counts per cell")
    295 if len(labels) != indata.shape[0]:
    296     raise ValueError(
    297             f"🛑 Length of training labels ({len(labels)}) does not match the number of input cells ({indata.shape[0]})")

ValueError: 🛑 Invalid expression matrix, expect log1p normalized expression to 10000 counts per cell

my anndata is simple raw counts and obs metadata:

AnnData object with n_obs × n_vars = 185894 × 31053
    obs: 'annot_coars', 'annot_fine'

I tried also running sc.pp.log1p() prior to the train function (though that should be done under the hood, no?) but nothing changes.

training on the demo adata_2000 works just fine.

Thanks!

error while plotting predictions: cannot find keys

Hi,

I am trying to plot the predictions as a dotplot. It works for one dataset but does not seem to work for another. I have checked to make sure the respective columns exist in predictions.predicted_labels.

celltypist.dotplot(predictions, use_as_reference='seurat_clusters', use_as_prediction='majority_voting')

Error:

KeyError                                  Traceback (most recent call last)
----> 7 celltypist.dotplot(predictions, use_as_reference='seurat_clusters', use_as_prediction="majority_voting")
.../scanpy_py37/lib/python3.7/site-packages/celltypist/plot.py in dotplot(predictions, use_as_reference, use_as_prediction, prediction_order, reference_order, filter_prediction, cmap, vmin, vmax, colorbar_title, dot_min, dot_max, smallest_dot, size_title, swap_axes, title, figsize, show, save, ax, return_fig, **kwds)
    156     _adata.obs['_pred'] = dot_size_df.index
    157     #DotPlot
--> 158     dp = sc.pl.DotPlot(_adata, dot_size_df.columns, '_pred', title = title, figsize = figsize, dot_color_df = dot_color_df, dot_size_df = dot_size_df, ax = ax, vmin = vmin, vmax = vmax, **kwds)
    159     if swap_axes:
    160         dp.swap_axes()

.../scanpy_py37/lib/python3.7/site-packages/scanpy/plotting/_dotplot.py in __init__(self, adata, var_names, groupby, use_raw, log, num_categories, categories_order, title, figsize, gene_symbols, var_group_positions, var_group_labels, var_group_rotation, layer, expression_cutoff, mean_only_expressed, standard_scale, dot_color_df, dot_size_df, ax, vmin, vmax, vcenter, norm, **kwds)
    151             vcenter=vcenter,
    152             norm=norm,
--> 153             **kwds,
    154         )
    155 

.../scanpy_py37/lib/python3.7/site-packages/scanpy/plotting/_baseplot_class.py in __init__(self, adata, var_names, groupby, use_raw, log, num_categories, categories_order, title, figsize, gene_symbols, var_group_positions, var_group_labels, var_group_rotation, layer, ax, vmin, vmax, vcenter, norm, **kwds)
    117             num_categories,
    118             layer=layer,
--> 119             gene_symbols=gene_symbols,
    120         )
    121         if len(self.categories) > self.MAX_NUM_CATEGORIES:
.../scanpy_py37/lib/python3.7/site-packages/scanpy/plotting/_anndata.py in _prepare_dataframe(adata, var_names, groupby, use_raw, log, num_categories, layer, gene_symbols)
   1918     keys = list(groupby) + list(np.unique(var_names))
   1919     obs_tidy = get.obs_df(
-> 1920         adata, keys=keys, layer=layer, use_raw=use_raw, gene_symbols=gene_symbols
   1921     )
   1922     assert np.all(np.array(keys) == np.array(obs_tidy.columns))

.../scanpy_py37/lib/python3.7/site-packages/scanpy/get/get.py in obs_df(adata, keys, obsm_keys, layer, gene_symbols, use_raw)
    276         keys,
    277         alias_index=alias_index,
--> 278         use_raw=use_raw,
    279     )
    280 

.../scanpy_py37/lib/python3.7/site-packages/scanpy/get/get.py in _check_indices(dim_df, alt_index, dim, keys, alias_index, use_raw)
    166     if len(not_found) > 0:
    167         raise KeyError(
--> 168             f"Could not find keys '{not_found}' in columns of `adata.{dim}` or in"
    169             f" {alt_repr}.{alt_search_repr}."
    170         )

KeyError: "Could not find keys '['0', '1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '4', '40', '41', '42', '43', '44', '45', '46', '47', '48', '5', '6', '7', '8', '9']' in columns of `adata.obs` or in adata.var_names."

Any help would be great.

Thanks
Saurabh

Wording on message

This is a minor point, but the current code logs this:

logger.info("👀 Can not detect a neighborhood graph, construct one before the over-clustering")

when the input data lacks a neighborhood graph. It seems like celltypist automatically creates one in this situation. I would suggest re-phrasing that message more like:

logger.info("👀 Can not detect a neighborhood graph, will construct one before the over-clustering")

to make it clear celltypist is fixing this. the phrasing "construct one before the over-clustering" sounds more like the tool is asking the user to construct one before running celltypist.

Can we extract the performance metrics from a customized model?

Thanks for this awesome tool! We are using it with our own data to create a customized model. My question is how we can obtain the model performance info from the customized model? Is there any utility function in celltypist we can use or we have to write our own code to evaluate how well our model performs? Thanks!

raw training dataset

is it possible to obtain the raw training dataset?

Support custom .celltypist dir?

Hello,

By default, celltypist makes a folder in the user's home dir to cache data. It would be convenient if celltypist supported either an environment variable or argument to directly specify this path. Our scenario is that we're running in docker on a cluster as a non-root user and the default home is non-writable. I'm fixing this, but having flexibility over the save location would still be a nice feature.

Thanks,
Ben

Filtering labels in celltypist.dotplot

Hello CellTypist team,
It would be great to have an option to specify a subset of labels to plot in celltypist.dotplot. E.g. when predicting from a model with > 100 cell types and using majority_voting=False, often certain labels are assigned to just a few cells, so I get a huge dotplot, but I am really interested in just a few of those labels.

In scanpy.pl.dotplot I would do this by simply filtering the anndata object, but I don't seem to be able to manipulate predictions.adata in this case.

predictions = celltypist.annotate "ValueError:"

Hi Teichlab,

This is a really excellent tool and I love to use it. I am able to manage to run the tutorial but when I replace it with my own dataset I am strat to get errors .

adata_2000.X.expm1().sum(axis = 1)

matrix([[1.39864217e+141],
        [4.99632738e+074],
        [1.12685234e+037],
        ...,
        [4.65627696e+256],
        [1.14145687e+070],
        [3.34092341e+191]])



adata_2000_raw = adata_2000.copy()
sc.pp.normalize_total(adata_2000_raw)
sc.pp.log1p(adata_2000_raw)
adata_2000.raw = adata_2000_raw



# Not run; predict cell identities using this loaded model.
predictions = celltypist.annotate(adata_2000, model = model, majority_voting = True)
# Alternatively, just specify the model name (recommended as this ensures the model is intact every time it is loaded).
#predictions = celltypist.annotate(adata_2000, model = 'Immune_All_High.pkl', majority_voting = True)




👀 Invalid expression matrix in `.X`, expect log1p normalized expression to 10000 counts per cell; will try the `.raw` attribute
⚠️ Warning: invalid expression matrix, expect all genes and log1p normalized expression to 10000 counts per cell. The prediction result may not be accurate
🔬 Input data has 122530 cells and 24910 genes
🔗 Matching reference genes in the model
🧬 5900 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, will construct one before the over-clustering
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?f160eb2e-abec-4641-aa29-501be9d31a2d)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [35], line 2
      1 # Not run; predict cell identities using this loaded model.
----> 2 predictions = celltypist.annotate(adata_2000, model = model, majority_voting = True)

File ~/opt/miniconda3/envs/SCVI/lib/python3.8/site-packages/celltypist/annotate.py:89, in annotate(filename, model, transpose_input, gene_file, cell_file, mode, p_thres, majority_voting, over_clustering, min_prop)
     87 #over clustering
     88 if over_clustering is None:
---> 89     over_clustering = clf.over_cluster()
     90     predictions.adata = clf.adata
     91 elif isinstance(over_clustering, str):

File ~/opt/miniconda3/envs/SCVI/lib/python3.8/site-packages/celltypist/classifier.py:418, in Classifier.over_cluster(self, resolution)
    416     logger.info("👀 Can not detect a neighborhood graph, will construct one before the over-clustering")
    417     adata = self.adata.copy()
--> 418     self.adata.obsm['X_pca'], self.adata.obsp['connectivities'], self.adata.obsp['distances'], self.adata.uns['neighbors'] = Classifier._construct_neighbor_graph(adata)
    419 else:
    420     logger.info("👀 Detected a neighborhood graph in the input object, will run over-clustering on the basis of it")

File ~/opt/miniconda3/envs/SCVI/lib/python3.8/site-packages/celltypist/classifier.py:393, in Classifier._construct_neighbor_graph(adata)
    391 if 'highly_variable' not in adata.var:
    392     sc.pp.filter_genes(adata, min_cells=5)
--> 393     sc.pp.highly_variable_genes(adata, n_top_genes = min([2500, adata.n_vars]))
    394 adata = adata[:, adata.var.highly_variable]
...
    265     )
    266 elif mn == mx:  # adjust end points before binning
    267     mn -= 0.001 * abs(mn) if mn != 0 else 0.001

ValueError: cannot specify integer `bins` when input data contains infinity

Thank you

Genes driving the prediction

Hello, as usual thank you so much for all your help with my questions. I am wondering if there is a way to know which genes are driving a specific similarity with a model? Can I get that information from the results?
Many thanks,
Carmen

Can't run cell typist with a my own model

Hello : )

Im trying to run cell typist with my own model and in the step "Training data using SGD logistic regression" I get the following error:

OMP: Error #13: Assertion failure at kmp_runtime.cpp(3689).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
/Users/amartinezl/opt/anaconda3/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
zsh: abort      python 2.5_RunCellTypist_OwnModel.py

Anyone has faced this error before?

Thanks!

Error with model download

📜 Retrieving model list from server https://celltypist.cog.sanger.ac.uk/models/models.json

timeout Traceback (most recent call last)
~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
381 try:
--> 382 self._validate_conn(conn)
383 except (SocketTimeout, BaseSSLError) as e:

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/urllib3/connectionpool.py in _validate_conn(self, conn)
1009 if not getattr(conn, "sock", None): # AppEngine might not have .sock
-> 1010 conn.connect()
1011

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/urllib3/connection.py in connect(self)
420 ssl_context=context,
--> 421 tls_in_tls=tls_in_tls,
422 )

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/urllib3/util/ssl_.py in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data, tls_in_tls)
449 ssl_sock = _ssl_wrap_socket_impl(
--> 450 sock, context, tls_in_tls, server_hostname=server_hostname
451 )

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/urllib3/util/ssl_.py in _ssl_wrap_socket_impl(sock, ssl_context, tls_in_tls, server_hostname)
492 if server_hostname:
--> 493 return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
494 else:

~/.conda/envs/single_cell_v0.1/lib/python3.7/ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
422 context=self,
--> 423 session=session
424 )

~/.conda/envs/single_cell_v0.1/lib/python3.7/ssl.py in _create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session)
869 raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
--> 870 self.do_handshake()
871 except (OSError, ValueError):

~/.conda/envs/single_cell_v0.1/lib/python3.7/ssl.py in do_handshake(self, block)
1138 self.settimeout(None)
-> 1139 self._sslobj.do_handshake()
1140 finally:

timeout: _ssl.c:1074: The handshake operation timed out

During handling of the above exception, another exception occurred:

ReadTimeoutError Traceback (most recent call last)
~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
448 retries=self.max_retries,
--> 449 timeout=timeout
450 )

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
755 retries = retries.increment(
--> 756 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
757 )

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
531 if read is False or not self._is_method_retryable(method):
--> 532 raise six.reraise(type(error), error, _stacktrace)
533 elif read is not None:

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/urllib3/packages/six.py in reraise(tp, value, tb)
769 raise value.with_traceback(tb)
--> 770 raise value
771 finally:

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
384 # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
--> 385 self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
386 raise

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/urllib3/connectionpool.py in _raise_timeout(self, err, url, timeout_value)
336 raise ReadTimeoutError(
--> 337 self, url, "Read timed out. (read timeout=%s)" % timeout_value
338 )

ReadTimeoutError: HTTPSConnectionPool(host='celltypist.cog.sanger.ac.uk', port=443): Read timed out. (read timeout=30)

During handling of the above exception, another exception occurred:

ReadTimeout Traceback (most recent call last)
~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/celltypist/models.py in _requests_get(url, timeout)
36 try:
---> 37 r = requests.get(url, timeout = timeout)
38 r.raise_for_status()

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/requests/api.py in get(url, params, **kwargs)
74
---> 75 return request('get', url, params=params, **kwargs)
76

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs)
60 with sessions.Session() as session:
---> 61 return session.request(method=method, url=url, **kwargs)
62

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
541 send_kwargs.update(settings)
--> 542 resp = self.send(prep, **send_kwargs)
543

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/requests/sessions.py in send(self, request, **kwargs)
654 # Send the request
--> 655 r = adapter.send(request, **kwargs)
656

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
528 elif isinstance(e, ReadTimeoutError):
--> 529 raise ReadTimeout(e, request=request)
530 else:

ReadTimeout: HTTPSConnectionPool(host='celltypist.cog.sanger.ac.uk', port=443): Read timed out. (read timeout=30)

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last)
in
1 # Enabling force_update = True will overwrite existing (old) models.
----> 2 models.download_models(force_update = True)

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/celltypist/models.py in download_models(force_update, model)
424 To check all available models, use :func:~celltypist.models.models_description.
425 """
--> 426 models_json = get_models_index(force_update)
427 logger.info(f"📂 Storing models in {models_path}")
428 if model is not None:

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/celltypist/models.py in get_models_index(force_update)
384 models_json_path = get_model_path("models.json")
385 if not os.path.exists(models_json_path) or force_update:
--> 386 download_model_index()
387 with open(models_json_path) as f:
388 return json.load(f)

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/celltypist/models.py in download_model_index(only_model)
402 logger.info(f"📜 Retrieving model list from server {url}")
403 with open(get_model_path("models.json"), "wb") as f:
--> 404 f.write(_requests_get(url).content)
405 model_count = len(_requests_get(url).json()["models"])
406 logger.info(f"📚 Total models in list: {model_count}")

~/.conda/envs/single_cell_v0.1/lib/python3.7/site-packages/celltypist/models.py in _requests_get(url, timeout)
39 except requests.exceptions.RequestException as e:
40 raise Exception(
---> 41 f"🛑 Cannot fetch '{url}', the error is: {e}")
42 return r
43

Exception: 🛑 Cannot fetch 'https://celltypist.cog.sanger.ac.uk/models/models.json', the error is: HTTPSConnectionPool(host='celltypist.cog.sanger.ac.uk', port=443): Read timed out. (read timeout=30)

Cell Encyclopedia Link Broken

I was looking for the encyclopedia table and the link (https://github.com/Teichlab/celltypist_wiki/tree/main/atlases/Immune/v2/encyclopedia) from the CellTypist website was broken. Is that no longer posted?

Thank you!

Gene names VS gene IDs in precomputed models

Hello celltypers,

While using a trained celltypist model on my data, I got this error. It took me a little while to realise it was coming from having mismatched feature names: my adata.var_names are EnsemblIDs while the model uses gene names.

predictions = celltypist.annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = True)

🔬 Input data has 634000 cells and 5000 genes
🔗 Matching reference genes in the model
🧬 0 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-248-0b4cb11719f9> in <module>
----> 1 predictions = celltypist.annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = True)

~/my-conda-envs/emma_env/lib/python3.7/site-packages/celltypist/annotate.py in annotate(filename, model, transpose_input, gene_file, cell_file, mode, p_thres, majority_voting, over_clustering, min_prop)
     79     clf = classifier.Classifier(filename = filename, model = lr_classifier, transpose = transpose_input, gene_file = gene_file, cell_file = cell_file)
     80     #predict
---> 81     predictions = clf.celltype(mode = mode, p_thres = p_thres)
     82     if not majority_voting:
     83         return predictions

~/my-conda-envs/emma_env/lib/python3.7/site-packages/celltypist/classifier.py in celltype(self, mode, p_thres)
    349 
    350         logger.info("🖋️ Predicting labels")
--> 351         decision_mat, prob_mat, lab = self.model.predict_labels_and_prob(self.indata, mode = mode, p_thres = p_thres)
    352         logger.info("✅ Prediction done!")
    353 

~/my-conda-envs/emma_env/lib/python3.7/site-packages/celltypist/models.py in predict_labels_and_prob(self, indata, mode, p_thres)
    118             A tuple of decision score matrix, raw probability matrix, and predicted cell type labels.
    119         """
--> 120         scores = self.classifier.decision_function(indata)
    121         probs = expit(scores)
    122         if mode == 'best match':

~/my-conda-envs/emma_env/lib/python3.7/site-packages/sklearn/linear_model/_base.py in decision_function(self, X)
    280         check_is_fitted(self)
    281 
--> 282         X = check_array(X, accept_sparse='csr')
    283 
    284         n_features = self.coef_.shape[1]

~/my-conda-envs/emma_env/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/my-conda-envs/emma_env/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    659                              " a minimum of %d is required%s."
    660                              % (n_features, array.shape, ensure_min_features,
--> 661                                 context))
    662 
    663     if copy and np.may_share_memory(array, array_orig):

ValueError: Found array with 0 feature(s) (shape=(634000, 0)) while a minimum of 1 is required.

This made me think of two suggestions:

Could the error message for this case become a bit more informative? If the feature overlap is 0, then print a message saying "Are you using gene names is adata.var_names?" With the current message I first thought it was triggered by having all zeros in some row or column
Using gene names while matching info between datasets can be problematic, because of name duplication or mismatches in different gene annotation databases. Would it be possible to also store unique geneIDs in the model objects (e.g. ensembl IDs) and give an option to select the type of feature names to use in celltypist.annotate?

model download - requests not working

Dear Celltypist Team,

great job. One small problem I encountered when using Celltypist in a Jupyter Notebook on our compute cluster (CentOS Linux Linux 7 (Core)) is that the model download is not working. I have tested this on google colab and it works perfectly fine. Specifically I am referring to this function:

models.download_models()

I noticed that this function as well as the download_model_index function it calls use requests. But this gets stalled but doesnt give an error/timeout when using it on our compute cluster. I find the same problem when trying to download the model manually through curl using os.system in the Jupyter notebook. I think the issue is that the worker node of the compute cluster doesnt run on the home directory where celltypist is stored.

def download_model_index(only_model: bool = True) -> None:
    """
    Download the `models.json` file from the remote server.
    Parameters
    ----------
    only_model
        If set to `False`, will also download the models in addition to the json file.
        (Default: `True`)
    """
    url = 'https://celltypist.cog.sanger.ac.uk/models/models.json'
    logger.info(f"📜 Retrieving model list from server {url}")
    with open(get_model_path("models.json"), "wb") as f:
        f.write(requests.get(url).content)
    model_count = len(requests.get(url).json()["models"])
    logger.info(f"📚 Total models in list: {model_count}")
    if not only_model:
        download_models()

A workaround is getting the Celltypist directory with:

models.models_path

then downloading the model files manually using the urls listed here https://celltypist.cog.sanger.ac.uk/models/models.json .

Perhaps you have an idea how to solve this more elegantly/ you include an error message/timeout in the?:

f.write(requests.get(url).content)](url)

Maybe defining the model directory manually in models.download_models() would be an option?

This is my environment:


# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
_r-mutex                  1.0.1               anacondar_1    conda-forge
alabaster                 0.7.12                     py_0    conda-forge
anndata                   0.8.0            py37h89c1867_0    conda-forge
anyio                     3.5.0            py37h89c1867_0    conda-forge
argon2-cffi               21.3.0             pyhd8ed1ab_0    conda-forge
argon2-cffi-bindings      21.2.0           py37h540881e_2    conda-forge
arpack                    3.7.0                hdefa2d7_2    conda-forge
astor                     0.8.1              pyh9f0ad1d_0    conda-forge
attrs                     21.4.0             pyhd8ed1ab_0    conda-forge
autograd                  1.3                        py_0    conda-forge
autograd-gamma            0.5.0              pyh9f0ad1d_0    conda-forge
babel                     2.9.1              pyh44b312d_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
backports.zoneinfo        0.2.1            py37h5e8e339_4    conda-forge
beautifulsoup4            4.10.0                   pypi_0    pypi
binutils_impl_linux-64    2.36.1               h193b22a_2    conda-forge
binutils_linux-64         2.36                hf3e587d_10    conda-forge
biothings_client          0.2.6              pyh5e36f6f_0    bioconda
blas                      1.1                    openblas    conda-forge
bleach                    4.1.0                    pypi_0    pypi
blosc                     1.21.0               h9c3ff4c_0    conda-forge
brewer2mpl                1.4.1                    pypi_0    pypi
brotli                    1.0.9                h7f98852_6    conda-forge
brotli-bin                1.0.9                h7f98852_6    conda-forge
brotlipy                  0.7.0           py37h540881e_1004    conda-forge
bwidget                   1.9.14               ha770c72_1    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2022.5.18.1          ha878542_0    conda-forge
cached-property           1.5.2                hd8ed1ab_1    conda-forge
cached_property           1.5.2              pyha770c72_1    conda-forge
cairo                     1.16.0            ha12eb4b_1010    conda-forge
cellrank                  1.0.0rc4                   py_0    bioconda
celltypist                0.1.9              pyhdfd78af_0    bioconda
certifi                   2021.10.8                pypi_0    pypi
cffi                      1.15.0           py37h036bc23_0    conda-forge
charset-normalizer        2.0.12             pyhd8ed1ab_0    conda-forge
click                     8.1.3            py37h89c1867_0    conda-forge
cloudpickle               2.1.0              pyhd8ed1ab_0    conda-forge
cmake                     3.22.4                   pypi_0    pypi
colorama                  0.4.4              pyh9f0ad1d_0    conda-forge
cryptography              36.0.0           py37h9ce1e76_0  
cudatoolkit               10.1.243            h036e899_10    conda-forge
curl                      7.82.0               h2283fc2_0    conda-forge
cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
cython                    0.29.28                  pypi_0    pypi
cytoolz                   0.11.2           py37h540881e_2    conda-forge
dask-core                 1.1.4                    py37_1  
debugpy                   1.5.1                    pypi_0    pypi
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
docrep                    0.3.2              pyh44b312d_0    conda-forge
docutils                  0.17.1           py37h89c1867_2    conda-forge
entrypoints               0.4                pyhd8ed1ab_0    conda-forge
enum34                    1.1.10           py37hc8dfbb8_2    conda-forge
et_xmlfile                1.0.1                   py_1001    conda-forge
expat                     2.4.8                h27087fc_0    conda-forge
fa2                       0.3.5                    pypi_0    pypi
fcsparser                 0.2.4                    pypi_0    pypi
flit-core                 3.7.1              pyhd8ed1ab_0    conda-forge
font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
font-ttf-inconsolata      3.000                h77eed37_0    conda-forge
font-ttf-source-code-pro  2.038                h77eed37_0    conda-forge
font-ttf-ubuntu           0.83                 hab24e00_0    conda-forge
fontconfig                2.14.0               h8e229c2_0    conda-forge
fonts-conda-ecosystem     1                             0    conda-forge
fonts-conda-forge         1                             0    conda-forge
fonttools                 4.30.0           py37h540881e_0    conda-forge
formulaic                 0.3.4              pyhd8ed1ab_0    conda-forge
freetype                  2.10.4               h0708190_1    conda-forge
fribidi                   1.0.10               h36c2ea0_0    conda-forge
future                    0.18.2           py37h89c1867_5    conda-forge
future_fstrings           1.2.0            py37h89c1867_3    conda-forge
gcc_impl_linux-64         9.4.0               h03d3576_16    conda-forge
gcc_linux-64              9.4.0               h391b98a_10    conda-forge
gettext                   0.19.8.1          h73d1719_1008    conda-forge
gfortran_impl_linux-64    9.4.0               h0003116_16    conda-forge
gfortran_linux-64         9.4.0               hf0ab688_10    conda-forge
giflib                    5.2.1                h36c2ea0_2    conda-forge
glpk                      4.65              h9202a9a_1004    conda-forge
gmp                       6.2.1                h58526e2_0    conda-forge
graphite2                 1.3.13            h58526e2_1001    conda-forge
gsl                       2.7                  he838d99_0    conda-forge
gxx_impl_linux-64         9.4.0               h03d3576_16    conda-forge
gxx_linux-64              9.4.0               h0316aca_10    conda-forge
h5py                      3.6.0           nompi_py37hd308b1e_100    conda-forge
harfbuzz                  3.4.0                hb4a5f5f_0    conda-forge
harmonyts                 0.1.4                    pypi_0    pypi
hdf5                      1.12.1          nompi_h4df4325_104    conda-forge
icu                       69.1                 h9c3ff4c_0    conda-forge
idna                      3.3                pyhd8ed1ab_0    conda-forge
igraph                    0.9.7                hf5496dd_0    conda-forge
imagecodecs-lite          2019.12.3        py37hda87dfa_5    conda-forge
imageio                   2.19.2             pyhcf75d05_0    conda-forge
imagesize                 1.3.0              pyhd8ed1ab_0    conda-forge
importlib-metadata        4.11.3           py37h89c1867_0    conda-forge
importlib-resources       5.4.0                    pypi_0    pypi
importlib_metadata        4.11.3               hd8ed1ab_0    conda-forge
importlib_resources       5.7.1              pyhd8ed1ab_0    conda-forge
iniconfig                 1.1.1              pyh9f0ad1d_0    conda-forge
intel-openmp              2022.0.1          h06a4308_3633  
interface_meta            1.3.0              pyhd8ed1ab_0    conda-forge
ipykernel                 6.9.2                    pypi_0    pypi
ipython                   7.32.0           py37h89c1867_0    conda-forge
ipython-genutils          0.2.0                    pypi_0    pypi
ipython_genutils          0.2.0                      py_1    conda-forge
jbig                      2.1               h7f98852_2003    conda-forge
jedi                      0.18.1           py37h89c1867_1    conda-forge
jinja2                    3.0.3                    pypi_0    pypi
joblib                    1.1.0              pyhd8ed1ab_0    conda-forge
joypy                     0.2.6                    pypi_0    pypi
jpeg                      9e                   h7f98852_0    conda-forge
json5                     0.9.6                    pypi_0    pypi
jsonpickle                2.1.0                    pypi_0    pypi
jsonschema                4.4.0              pyhd8ed1ab_0    conda-forge
jupyter-client            7.1.2                    pypi_0    pypi
jupyter-server            1.15.6                   pypi_0    pypi
jupyter_client            7.3.0              pyhd8ed1ab_0    conda-forge
jupyter_core              4.9.2            py37h89c1867_0    conda-forge
jupyter_server            1.16.0             pyhd8ed1ab_1    conda-forge
jupyterlab                3.3.2                    pypi_0    pypi
jupyterlab-pygments       0.1.2                    pypi_0    pypi
jupyterlab-server         2.11.2                   pypi_0    pypi
jupyterlab_pygments       0.2.2              pyhd8ed1ab_0    conda-forge
jupyterlab_server         2.13.0             pyhd8ed1ab_1    conda-forge
kaleido                   0.2.1                    pypi_0    pypi
kernel-headers_linux-64   2.6.32              he073ed8_15    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
kiwisolver                1.4.0            py37h7cecad7_0    conda-forge
kneed                     0.7.0                    pypi_0    pypi
krb5                      1.19.3               h08a2579_0    conda-forge
lcms2                     2.12                 hddcbb42_0    conda-forge
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
leidenalg                 0.8.10                   pypi_0    pypi
lerc                      3.0                  h9c3ff4c_0    conda-forge
libblas                   3.9.0           13_linux64_openblas    conda-forge
libbrotlicommon           1.0.9                h7f98852_6    conda-forge
libbrotlidec              1.0.9                h7f98852_6    conda-forge
libbrotlienc              1.0.9                h7f98852_6    conda-forge
libcblas                  3.9.0           13_linux64_openblas    conda-forge
libcurl                   7.82.0               h2283fc2_0    conda-forge
libdeflate                1.10                 h7f98852_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc                    7.2.0                h69d50b8_2    conda-forge
libgcc-devel_linux-64     9.4.0               hd854feb_16    conda-forge
libgcc-ng                 11.2.0              h1d223b6_14    conda-forge
libgfortran-ng            11.2.0              h69a702a_14    conda-forge
libgfortran5              11.2.0              h5c6108e_14    conda-forge
libglib                   2.70.2               h174f98d_4    conda-forge
libgomp                   11.2.0              h1d223b6_14    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
liblapack                 3.9.0           13_linux64_openblas    conda-forge
libllvm9                  9.0.1           default_hc23dcda_7    conda-forge
libnghttp2                1.47.0               he49606f_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.18          pthreads_h8fe5266_0    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libsanitizer              9.4.0               h79bfe98_16    conda-forge
libsodium                 1.0.18               h36c2ea0_1    conda-forge
libssh2                   1.10.0               ha35d2d1_2    conda-forge
libstdcxx-devel_linux-64  9.4.0               hd854feb_16    conda-forge
libstdcxx-ng              11.2.0              he4da1e4_14    conda-forge
libtiff                   4.3.0                h542a066_3    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libwebp                   1.2.2                h3452ae3_0    conda-forge
libwebp-base              1.2.2                h7f98852_1    conda-forge
libxcb                    1.13              h7f98852_1004    conda-forge
libxml2                   2.9.12               h885dcf4_1    conda-forge
libzlib                   1.2.11            h36c2ea0_1013    conda-forge
lifelines                 0.27.1             pyhd8ed1ab_0    conda-forge
llvmlite                  0.33.0           py37h5202443_1    conda-forge
loompy                    3.0.6                      py_0    conda-forge
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
lzo                       2.10              h516909a_1000    conda-forge
make                      4.3                  hd18ef5c_1    conda-forge
markupsafe                2.1.1            py37h540881e_1    conda-forge
matplotlib-base           3.5.1            py37h1058ff1_0    conda-forge
matplotlib-inline         0.1.3              pyhd8ed1ab_0    conda-forge
matplotlib-venn           0.11.6                   pypi_0    pypi
metis                     5.1.0             h58526e2_1006    conda-forge
mistune                   0.8.4           py37h5e8e339_1005    conda-forge
mkl                       2022.0.1           h06a4308_117  
mpfr                      4.1.0                h9202a9a_1    conda-forge
multicoretsne             0.1                      pypi_0    pypi
munkres                   1.1.4              pyh9f0ad1d_0    conda-forge
mygene                    3.2.2              pyh5e36f6f_0    bioconda
natsort                   8.1.0              pyhd8ed1ab_0    conda-forge
nbclassic                 0.3.7              pyhd8ed1ab_0    conda-forge
nbclient                  0.5.13                   pypi_0    pypi
nbconvert                 6.4.4                    pypi_0    pypi
nbconvert-core            6.5.0              pyhd8ed1ab_0    conda-forge
nbconvert-pandoc          6.5.0              pyhd8ed1ab_0    conda-forge
nbformat                  5.2.0                    pypi_0    pypi
ncurses                   6.3                  h9c3ff4c_0    conda-forge
nest-asyncio              1.5.4                    pypi_0    pypi
networkx                  2.6.3                    pypi_0    pypi
ninja                     1.10.2               h4bd325d_1    conda-forge
notebook                  6.4.10                   pypi_0    pypi
notebook-shim             0.1.0              pyhd8ed1ab_0    conda-forge
numba                     0.51.0rc1       np1.11py3.7h04863e7_g833c5907c_0    numba
numexpr                   2.8.1            py37hecfb737_0  
numpy                     1.21.5                   pypi_0    pypi
numpy_groupies            0.9.16             pyhd8ed1ab_0    conda-forge
openblas                  0.3.18          pthreads_h4748800_0    conda-forge
openjpeg                  2.4.0                hb52868f_1    conda-forge
openpyxl                  3.0.9              pyhd8ed1ab_0    conda-forge
openssl                   3.0.3                h166bdaf_0    conda-forge
opt-einsum                3.3.0                    pypi_0    pypi
oslom-runner              1.5                      pypi_0    pypi
packaging                 21.3               pyhd8ed1ab_0    conda-forge
palantir                  1.0.0                    pypi_0    pypi
pandas                    1.3.5            py37he8f5f7f_0    conda-forge
pandoc                    2.18                 ha770c72_0    conda-forge
pandocfilters             1.5.0              pyhd8ed1ab_0    conda-forge
pango                     1.48.10              h4dcc4a0_3    conda-forge
parso                     0.8.3              pyhd8ed1ab_0    conda-forge
pathlib                   1.0.1            py37h89c1867_6    conda-forge
patsy                     0.5.2              pyhd8ed1ab_0    conda-forge
pcre                      8.45                 h9c3ff4c_0    conda-forge
pcre2                     10.37                h032f7d1_0    conda-forge
pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
phenograph                1.5.7                    pypi_0    pypi
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    9.0.1            py37h44f0d7a_2    conda-forge
pip                       22.0.4             pyhd8ed1ab_0    conda-forge
pixman                    0.40.0               h36c2ea0_0    conda-forge
plotly                    5.6.0                      py_0    plotly
pluggy                    1.0.0            py37h89c1867_2    conda-forge
progressbar2              4.0.0              pyhd8ed1ab_0    conda-forge
prometheus-client         0.13.1                   pypi_0    pypi
prometheus_client         0.14.1             pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.28                   pypi_0    pypi
psutil                    5.9.0            py37h540881e_1    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
py                        1.11.0             pyh6c4a22f_0    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pydiffmap                 0.2.0.1                  pypi_0    pypi
pygam                     0.8.0                      py_0    conda-forge
pygments                  2.11.2                   pypi_0    pypi
pyopenssl                 22.0.0             pyhd8ed1ab_0    conda-forge
pyparsing                 3.0.7              pyhd8ed1ab_0    conda-forge
pyrsistent                0.18.1           py37h540881e_1    conda-forge
pysocks                   1.7.1            py37h89c1867_5    conda-forge
pytables                  3.7.0            py37h5dea08b_0    conda-forge
pytest                    7.1.0            py37h89c1867_0    conda-forge
python                    3.7.12          hf930737_100_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-fastjsonschema     2.15.3             pyhd8ed1ab_0    conda-forge
python-igraph             0.9.9            py37h6c76e3a_0    conda-forge
python-tzdata             2022.1             pyhd8ed1ab_0    conda-forge
python-utils              3.2.2              pyhd8ed1ab_0    conda-forge
python_abi                3.7                     2_cp37m    conda-forge
pytorch                   1.4.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
pytz                      2021.3             pyhd8ed1ab_0    conda-forge
pytz-deprecation-shim     0.1.0.post0      py37h89c1867_1    conda-forge
pyvis                     0.1.9                    pypi_0    pypi
pywavelets                1.3.0            py37hda87dfa_1    conda-forge
pyzmq                     22.3.0           py37h0c0c2a8_2    conda-forge
r-base                    4.1.2                hde4fec0_0    conda-forge
readline                  8.1                  h46c0cb4_0    conda-forge
requests                  2.27.1             pyhd8ed1ab_0    conda-forge
rpy2                      3.5.1           py37r41hda87dfa_0    conda-forge
scanpy                    1.8.2                    pypi_0    pypi
schpf                     0.5.0                    pypi_0    pypi
scikit-image              0.19.2           py37he8f5f7f_0    conda-forge
scikit-learn              1.0.2            py37hf9e9bfc_0    conda-forge
scikit-misc               0.1.4                    pypi_0    pypi
scipy                     1.7.3            py37hf2a6cf1_0    conda-forge
scvelo                    0.2.4              pyhdfd78af_0    bioconda
seaborn                   0.11.2               hd8ed1ab_0    conda-forge
seaborn-base              0.11.2             pyhd8ed1ab_0    conda-forge
sed                       4.8                  he412f7d_0    conda-forge
send2trash                1.8.0              pyhd8ed1ab_0    conda-forge
session-info              1.0.0              pyhd8ed1ab_0    conda-forge
setuptools                59.8.0           py37h89c1867_0    conda-forge
simplegeneric             0.8.1                      py_1    conda-forge
simplejson                3.17.6                   pypi_0    pypi
sinfo                     0.3.4                    pypi_0    pypi
six                       1.16.0             pyh6c4a22f_0    conda-forge
sklearn                   0.0                      pypi_0    pypi
slalom                    1.0.0.dev11              pypi_0    pypi
sniffio                   1.2.0            py37h89c1867_3    conda-forge
snowballstemmer           2.2.0              pyhd8ed1ab_0    conda-forge
soupsieve                 2.3.1              pyhd8ed1ab_0    conda-forge
sphinx                    4.5.0              pyh6c4a22f_0    conda-forge
sphinxcontrib-applehelp   1.0.2                      py_0    conda-forge
sphinxcontrib-devhelp     1.0.2                      py_0    conda-forge
sphinxcontrib-htmlhelp    2.0.0              pyhd8ed1ab_0    conda-forge
sphinxcontrib-jsmath      1.0.1                      py_0    conda-forge
sphinxcontrib-qthelp      1.0.3                      py_0    conda-forge
sphinxcontrib-serializinghtml 1.1.5              pyhd8ed1ab_2    conda-forge
sqlite                    3.37.1               h4ff8645_0    conda-forge
statsmodels               0.13.2           py37hb1e94ed_0    conda-forge
stdlib-list               0.8.0                    pypi_0    pypi
suitesparse               5.10.1               h9e50725_1    conda-forge
sysroot_linux-64          2.12                he073ed8_15    conda-forge
tbb                       2021.5.0             h4bd325d_0    conda-forge
tenacity                  8.0.1              pyhd8ed1ab_0    conda-forge
terminado                 0.13.3           py37h89c1867_1    conda-forge
testpath                  0.6.0                    pypi_0    pypi
texttable                 1.6.4              pyhd8ed1ab_0    conda-forge
threadpoolctl             3.1.0              pyh8a188c0_0    conda-forge
tifffile                  2019.7.26.2              py37_0    conda-forge
tinycss2                  1.1.1              pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tktable                   2.10                 hb7b940f_3    conda-forge
tokenize-rt               4.2.1              pyhd8ed1ab_0    conda-forge
tomli                     2.0.1              pyhd8ed1ab_0    conda-forge
toolz                     0.11.2             pyhd8ed1ab_0    conda-forge
tornado                   6.1              py37h540881e_3    conda-forge
tqdm                      4.63.0                   pypi_0    pypi
traitlets                 5.1.1              pyhd8ed1ab_0    conda-forge
typing-extensions         4.1.1                hd8ed1ab_0    conda-forge
typing_extensions         4.1.1              pyha770c72_0    conda-forge
tzdata                    2022a                h191b570_0    conda-forge
tzlocal                   4.2              py37h89c1867_0    conda-forge
umap-learn                0.4.6                    pypi_0    pypi
unicodedata2              14.0.0           py37h5e8e339_0    conda-forge
urllib3                   1.26.9             pyhd8ed1ab_0    conda-forge
wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
webencodings              0.5.1                    pypi_0    pypi
websocket-client          1.3.1                    pypi_0    pypi
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
wrapt                     1.14.1           py37h540881e_0    conda-forge
xlrd                      1.2.0                    pypi_0    pypi
xorg-kbproto              1.0.7             h7f98852_1002    conda-forge
xorg-libice               1.0.10               h7f98852_0    conda-forge
xorg-libsm                1.2.3             hd9c2040_1000    conda-forge
xorg-libx11               1.7.2                h7f98852_0    conda-forge
xorg-libxau               1.0.9                h7f98852_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xorg-libxext              1.3.4                h7f98852_1    conda-forge
xorg-libxrender           0.9.10            h7f98852_1003    conda-forge
xorg-libxt                1.2.1                h7f98852_2    conda-forge
xorg-renderproto          0.11.1            h7f98852_1002    conda-forge
xorg-xextproto            7.3.0             h7f98852_1002    conda-forge
xorg-xproto               7.0.31            h7f98852_1007    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zeromq                    4.3.4                h9c3ff4c_1    conda-forge
zipp                      3.7.0              pyhd8ed1ab_1    conda-forge
zlib                      1.2.11            h36c2ea0_1013    conda-forge
zstd                      1.5.2                ha95c52a_0    conda-forge

'Naive' cell type in 'Adult_Mouse_Gut' model

What is the meaning of the 'Naive' cell type in 'Adult_Mouse_Gut' model? seems like 'Naive B-cells' ..

How to update the model

Dear Celltypist Team,

If I have a new annotated dataset, how can I train and update it into the existing model ？

Looking forward to your reply！

Train Custom Model on scATAC-seq data?

Hello,

Is it at all possible to train a custom celltypist model on scATAC-seq data and then use it for predicting cell types in scATAC-seq datasets? I've been trying accomplish this, but so far it seems incompatible, despite my efforts.

Predition error

When I ran the prediction porcedure below, there was an error message.

predictions = celltypist.annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = True)
ValueError: ðŸ›‘ Invalid expression matrix, expect log1p normalized expression to 10000 counts per cell

I then tried to normalized the adata, it showed:

adata.raw = sc.pp.log1p(adata, copy=True)
WARNING: adata.X seems to be already log-transformed.

Is there any way to fix it?

RAM usage by converting from sparse to dense

Is there a reason why the input data is converted to an np.array rather than accepting sparse matrices when running .train? Skimming the remainder of the code, I cannot seem to find anything that would not also work with sparse matrices. The reason I am asking is that this conversion to an array seems to be the reasons I find myself running out of RAM quite frequently when working with larger datasets.

Thanks

Deterministic results?

Hello,

This is admittedly a picky question. We're experimenting with running celltypist to score cells, from the command line. In our test data we have a rare category with ~4 cells. These are consistently scored as Tcm/Naive cytotoxic T cells in 'predicted labels'. However, the result of majority_voting is not deterministic. Some of the time these 4 cells lump into another category. This by itself it not a huge problem (i.e. in reality they are probably ambiguous cells and it's 4 total). My question is about the inconsistency run-to-run. The input we give celltypist does not have a neighborhood graph, etc., and celltypist creates it for us. Are there any instance where we can or should be setting a random seed or something like this? Thanks

Getting error when running predictions

predictions = celltypist.annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = True)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I check adata.X and I don't have NaN values, also I shorten the values. Not sure what else to try.

Getting error when running predictions follow tutorial

Thank you for providing such a useful tool, but when I run your the tutorial of Best practice in large-scale cross-dataset label transfer using CellTypist, why can't I read the locally saved model_from_Elmentaite_2021.pkl.

predictions = celltypist.annotate(adata_James, model = 'model_from_Elmentaite_2021.pkl' , majority_voting = True,mode = 'best match')
🔎 No available models. Downloading...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/jdfssz1/USER/hw/2.software/mn3/envs/celltypist/lib/python3.9/site-packages/celltypist/annotate.py", line 77, in annotate
    lr_classifier = model if isinstance(model, Model) else Model.load(model)
  File "/jdfssz1/USER/hw/2.software/mn3/envs/celltypist/lib/python3.9/site-packages/celltypist/models.py", line 90, in load
    if model in get_all_models():
  File "/jdfssz1/USER/hw/2.software/mn3/envs/celltypist/lib/python3.9/site-packages/celltypist/models.py", line 359, in get_all_models
    download_if_required()
  File "/jdfssz1/USER/hw/2.software/mn3/envs/celltypist/lib/python3.9/site-packages/celltypist/models.py", line 372, in download_if_required
    download_models()
  File "/jdfssz1/USER/hw/2.software/mn3/envs/celltypist/lib/python3.9/site-packages/celltypist/models.py", line 432, in download_models
    models_json = get_models_index(force_update)
  File "/jdfssz1/USER/hw/2.software/mn3/envs/celltypist/lib/python3.9/site-packages/celltypist/models.py", line 394, in get_models_index
    return json.load(f)
  File "/jdfssz1/USER/hw/2.software/mn3/envs/celltypist/lib/python3.9/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/jdfssz1/USER/hw/2.software/mn3/envs/celltypist/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/jdfssz1/USER/hw/2.software/mn3/envs/celltypist/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/jdfssz1/USER/hw/2.software/mn3/envs/celltypist/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

undetermined cells

Im trying to run celltypist on several different datasets (PBMCs, Spleenocytes etc) in almost all cases im getting more than 30% of cells being called as undetermined. I tried using some high quality public datasets as well but end up with same situation. Im using the low resolution Immune cell model. Can you help me understand what I can do to trouble shoot ?

Binary cell type decisions lead to np.AxisError in predict_labels_and_prob

When trying to make a binary cell type decision using CellTypist, I encountered an AxisError in line 131 of models.py, caused by scores.argmax(axis=1). When making binary decisions, sklearn's LogsticRegression.decision_function yields not an ndarray of shape (n_samples, n_classes) but of (n_samples,), making axis=1 and argmax in general problematic.

My suggested solution is first checking the number of dimensions in the output of decision_function and if it is equal to 1, as with binary decisions, emulating the output format expected by CellTypist, conserving the difference in confidence scores between the two available classes for every cell. This is implemented in #18.

Only 9 features were used for prediction

Hello. Thank you for developing this wonderful tool.

I am trying to apply CellTypist to my dataset composed of mouse heart cells.
Besides the immune cells, I have preliminarily annotated the endothelial cells and fibroblast cells in this dataset too.

I am not familiar with Scanpy so the AnnData file was converted from the Seurat Object following the instructions by SeuratDisk.
https://mojaveazure.github.io/seurat-disk/articles/convert-anndata.html

The raw count matrix was scaled to 10,000 and log normalized by the NormalizeData funtion using Seurat.
cre <- NormalizeData(cre, normalization.method = "LogNormalize", scale.factor = 10000)

However, when I tried predicting the annotation of this converted AnnData file, the result told me that almost all the predicted_labels are "Double-positive thymocytes" in my data, which was not possible at all.

Then I checked for the reason and I found that only 9 features were used for prediction.

Here is what CellTypist output.

# Predict the identity of each input cell.
cre_predictions = celltypist.annotate( cre, model = 'Immune_All_Low.pkl', majority_voting = True)

🔬 Input data has 9267 cells and 20011 genes
🔗 Matching reference genes in the model
🧬 9 features used for prediction
⚖️ Scaling input data
🖋️ Predicting labels
✅ Prediction done!
👀 Can not detect a neighborhood graph, will construct one before the over-clustering

No error was reported.

Was it caused by the conversion from Seurat to AnnData? Would it be better if I turned to use Scanpy from the beginning?

Can celltypist handle doublets and low quality cells?

Should cells with low counts/ high-Mitochondria and doublets be discarded prior to entering the data to cell typist? or can one remove them thereafter?

Thanks!

Add a ``celltypist.download_model(force_update=True)`` API

celltypist has an API download_models(force_update: bool=False) to download all models with the latest version. We don't want this for all models, only a handful. Could you provide an API for a single model? Something like celltypist.download_model(force_update=True) If there is already a way to do this with the current API, please let me know.
Thanks!

Subsetting celltypist model by cell type

Hi, thanks a lot for developing this tool!

May I ask if there's a way to subset the model by cell types? The scenario is the model contains some cell types that I know is absent in the tissue I'm working on, so I want to prevent it from assigning these cell types and focus only on a subset of cell types.

(I'm working on bone marrow samples, while the model like 'Immune_All_Low.pkl' is very close but also include some cell types which are usually not seen in bone marrow).

Would it be feasible and reasonable to do this? Thanks a lot for your help!

Best regards,
Marcus

Adding a version argument to download_models

It would be great to have a version argument to download_models. This will allow writing and sharing code for reproducible analysis. (Currently, a newer model version will likely change the results of my analysis. While force_update is sufficient locally, it is not sufficient if I wish to share my code).

Are the confidence scores of celltype predicted by different reference models comparable？

Dear Celltypist Team,
Are the confidence scores of celltype predicted by different reference models comparable? whether I could choose the most suitable model to annotate new datasets by comparing the average confidence scores of each cluster annotated by different models? in addition, If a cell is annotated as “unassigned”or “heterogeneous”, what does its confidence score mean？

Use "python -m" to run celltypist?

Hello,

We're interested in running celltypist from R/reticulate. While I understand that celltypist makes a standalone executable, can something like the following work on the command line? For various reasons this is preferred since code would not need ot know the location of the celltypist executable file itself:

python.exe -m celltypist.command_line --update_models

The above does not error, but it also doesnt seem to do anything either (no console output produced). Thanks.

Adult_Mouse_Gut cell types

Hello! Might the team have the publications where the Adult_Mouse_Gut model derived its cell types? It says 'TBD' in the celltypist site.

Can I use log2(TPM) normalized data as input instead log1p w/ 10K scaling factor?

Hi,

Thank you for developing and maintaining CellTypist.
It is a great tool and makes my life much easier.

I'm analyzing log2(TPM) normalized data from a publicly available data set and I was wondering if I could provide this to CellTypist or is there any assumption violated by this?

I know that the software requires log1p (w/ 10K scaling factor) normalized data as input, but for this particularly data set, I don't have access to the count data to normalize it myself and I still would like to run CellTypist.

I know that CellTypist gives an error and exits when such data is not provided, but if I comment that line of code, will the CellTypist assumptions still be valid?

Thanks in advance for any help or advice.
Best regards,
António

teichlab / celltypist Goto Github PK

celltypist's Introduction

CellTypist website

Interactive tutorials

Install CellTypist

Using pip

Using conda

Usage (classification)

Docker

Singularity

Inputs for data training

One-pass data training

Two-pass data training incorporating feature selection

General parameters relating to runtime and RAM usage

Citation

celltypist's People

Contributors

Stargazers

Watchers

Forkers

celltypist's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs