GithubHelp home page GithubHelp logo

bbknn's Introduction

Batch balanced KNN

BBKNN is a fast and intuitive batch effect removal tool that can be directly used in the scanpy workflow. It serves as an alternative to scanpy.pp.neighbors(), with both functions creating a neighbour graph for subsequent use in clustering, pseudotime and UMAP visualisation. The standard approach begins by identifying the k nearest neighbours for each individual cell across the entire data structure, with the candidates being subsequently transformed to exponentially related connectivities before serving as the basis for further analyses. If technical artifacts (be they because of differing data acquisition technologies, protocol alterations or even particularly severe operator effects) are present in the data, they will make it challenging to link corresponding cell types across different batches.

KNN

As such, BBKNN actively combats this effect by taking each cell and identifying a (smaller) k nearest neighbours in each batch separately, rather than the dataset as a whole. These nearest neighbours for each batch are then merged into a final neighbour list for the cell. This helps create connections between analogous cells in different batches without altering the counts or PCA space.

BBKNN

Citation

If you use BBKNN in your work, please cite the paper:

@article{polanski2019bbknn,
  title={BBKNN: Fast Batch Alignment of Single Cell Transcriptomes},
  author={Pola{\'n}ski, Krzysztof and Young, Matthew D and Miao, Zhichao and Meyer, Kerstin B and Teichmann, Sarah A and Park, Jong-Eun},
  doi={10.1093/bioinformatics/btz625},
  journal={Bioinformatics},
  year={2019}
}

Installation

BBKNN depends on Cython, numpy, scipy, annoy, pynndescent, umap-learn and scikit-learn. The package is available on pip and conda, and can be easily installed as follows:

pip3 install bbknn

or

conda install -c bioconda bbknn

BBKNN can also make use of faiss. Consult the official installation instructions, the easiest way to get it is via conda.

Usage and Documentation

BBKNN has the option to immediately slot into the spot occupied by scanpy.neighbors() in the Seurat-inspired scanpy workflow. It computes a batch aligned variant of the neighbourhood graph, with its uses within scanpy including clustering, diffusion map pseudotime inference and UMAP visualisation. The basic syntax to run BBKNN on scanpy's AnnData object (with PCA computed via scanpy.tl.pca()) is as follows:

import bbknn

bbknn.bbknn(adata)

You can provide which adata.obs column to use for batch discrimination via the batch_key parameter. This defaults to 'batch', which is created by scanpy when you merge multiple AnnData objects (e.g. if you were to import multiple samples separately and then concatenate them).

Integration can be improved by using ridge regression on both a technical effect and a biological grouping prior to BBKNN, following a workflow from Park et al., 2020. In the event of not having a biological grouping at hand, a coarse clustering obtained from a BBKNN-corrected graph can be used in its place. This creates the following basic workflow syntax:

import bbknn
import scanpy

bbknn.bbknn(adata)
scanpy.tl.leiden(adata)
bbknn.ridge_regression(adata, batch_key=['batch'], confounder_key=['leiden'])
scanpy.tl.pca(adata)
bbknn.bbknn(adata)

Alternately, you can just provide a PCA matrix with cells as rows and a matching vector of batch assignments for each of the cells and call BBKNN as follows (with connectivities being the primary graph output of interest):

import bbknn.matrix

distances, connectivities, parameters = bbknn.matrix.bbknn(pca_matrix, batch_list)

An HTML render of the BBKNN function docstring, detailing all the parameters, can be accessed at ReadTheDocs. BBKNN use, along with using ridge regression to improve the integration, is shown in a demonstration notebook.

BBKNN in R

At this point, there is no plan to create a BBKNN R package. However, it can be ran quite easily via reticulate. Using the base functions is the same as in python. If you're in possession of a PCA matrix and a batch assignment vector and want to get UMAP coordinates out of it, you can use the following code snippet to do so. The weird PCA computation part and replacing it with your original values is unfortunately necessary due to how AnnData innards operate from a reticulate level. Provide your python path in use_python()

library(reticulate)
use_python("/usr/bin/python3")

anndata = import("anndata",convert=FALSE)
bbknn = import("bbknn", convert=FALSE)
sc = import("scanpy",convert=FALSE)

adata = anndata$AnnData(X=pca, obs=batch)
sc$tl$pca(adata)
adata$obsm$X_pca = pca
bbknn$bbknn(adata,batch_key=0)
sc$tl$umap(adata)
umap = py_to_r(adata$obsm[["X_umap"]])

If you wish to change any integer arguments (such as neighbors_within_batch), you'll have to as.integer() the value so python understands it as an integer.

When testing locally, faiss refused to work when BBKNN was reticulated. As such, provide use_faiss=FALSE to the BBKNN call if you run into this problem.

Example Notebooks

demo.ipynb is the main demonstration, applying BBKNN to some pancreas data with a batch effect. The notebook also uses ridge regression to improve the integration.

The BBKNN paper makes use of the following analyses:

  • simulation.ipynb applies BBKNN to simulated data with a known ground truth, and demonstrates the utility of graph trimming by introducing an unrelated cell population. This simulated data is then used to benchmark BBKNN against mnnCorrect, CCA, Scanorama and Harmony in benchmark.ipynb, and then finish off with a benchmarking of a BBKNN variant reluctant to work within R/reticulate and visualise the findings in benchmark2.ipynb. benchmark3-new-R-methods.ipynb adds some newer R approaches to the benchmark.
  • mouse.ipynb runs a collection of murine atlases through BBKNN. mouse-harmony.ipynb applies Harmony to the same data.

The BBKNN preprint performed some additional analyses that got left out of the final manuscript. Archival notebooks are stored in a separate repository.

bbknn's People

Contributors

canergen avatar chuanxu1 avatar fbnrst avatar iandriver avatar ivirshup avatar jenzopr avatar ktpolanski avatar ryan-williams avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bbknn's Issues

`ImportError` for `scikit_learn` >= 1.0.0

We are currently trying to import DistanceMetrics from sklearn.neighbors in bbknn/matrix.py (link to code) but this is only possible for scikit_learn<1.0.0. DistanceMetrics is moved to sklearn.metric in the later versions of sklearn.

I will raise a PR for fixing this via handling the ImportError -- but would like to hear opinions from the maintainers for sure!

Is it possible to identify marker genes?

Hello:

Suppose we ran a clustering method on bbknn output and identified a few clusters that hopefully represent distinct cell types. Do I get it right that it's impossible to identify marker genes for those clusters? BBKNN doesn't alter the original data or PCs obtained from the original data, so we never obtain the gene expression adjusted for batch effect. If I am right, is there a method to adjust the original data for batch effect using bbknn output?

Thanks in advance,
Nik

Please also require packaging in setup.py

Hi Krzysztof,

the bbknn bioconda build fails for current version 1.3.10 and 1.3.11 because the packaging package you introduced in d6c60f5 cannot be found. I added it to the recipe dependencies, but I recommend adding it to install_requires as well.

Thanks,
Jens

"sklearn" as a dependency

Just a heads up - you are using "sklearn" as a dependency but this isn't the correct package on PyPi. It should be "scikit-learn".

Something I noticed because I made the same mistake when writing a package a few years ago!

scanorama bbknn

Dear,
Scanorama handles the mutual nearest neighbors-based matching, batch correction, and panorama assembly. I have not find assembly function in pancreas-4-Scanorama.ipynb. what's the corresponding function of scanorama's assembly function in bbknn (or scanpy)?

TypeError: info() got an unexpected keyword argument 'r

Good afternoon,

I'm running on the pbmc notebook with python 3.7.7, Scanpy 1.5.1, bbknn 1.3.3. After running the bbknn, the following error message appeared. I tested the "sc.external.pp.bbknn," and the same message was shown. Updating the version to python 3.8.0, scanpy 1.6.0 did not work either.

TypeError Traceback (most recent call last)
in
----> 1 bdata = bbknn.bbknn(adata,batch_key='Sample',copy=True)

~\Anaconda3\lib\site-packages\bbknn_init_.py in bbknn(adata, batch_key, approx, metric, copy, **kwargs)
259 If True, return a copy instead of writing to the supplied adata.
260 '''
--> 261 logg.info('computing batch balanced neighbors', r=True)
262 adata = adata.copy() if copy else adata
263 #basic sanity checks to begin

TypeError: info() got an unexpected keyword argument 'r'

Do you have any idea what causes this error? Thank you.

BBKNN for ATAC data

Hi! Thank you for the awesome package :)
I was wondering if it is possible to use other dimensionality reduction methods than PCA for bbknn.bbknn, such as LSI which is commonly used for ATAC data? use_rep argument would suggest it is possible, but I wanted to check what your thoughts were before running it!

Collaboration

Hi, Authors of BBKNN,

My name is Feng Zhang. I recently build a pipeline (BEER, published in Cell Discovery, https://github.com/jumphone/BEER) to remove batch-effect related PCA subspaces.

And I find that it works well when combing it with BBKNN (especially when integrating scRNA-seq and scATAC-seq data).

Maybe we can collaborate with each other to further improve the performance.

Best,
Feng Zhang

kernals died when using bbknn

I'm trying to do bbknn with my own single-cell data and demo data, but kernals died every time when I doing whatever my own data or demo data.
Is this because of my device? or?

sc.settings.verbosity = 3sc.logging.print_header()
sc.settings.set_figure_params(dpi=80)
scanpy==1.10.1 anndata==0.10.5.post1 umap==0.5.5 numpy==1.26.4 scipy==1.11.1 pandas==2.2.2 scikit-learn==1.3.0 statsmodels==0.14.0 igraph==0.11.5 louvain==0.8.2 pynndescent==0.5.11

adata = sc.read('pancreas.h5ad', backup_url='https://www.dropbox.com/s/qj1jlm9w10wmt0u/pancreas.h5ad?dl=1')
bbknn.bbknn(adata, batch_key='batch')

umap error after package updates

I updated scanpy (1.5.1), umap-learn (0.4.6) and BBKNN. But I found the following error when running the umap function:

Error in py_call_impl(callable, dots$args, dots$keywords) : 
ValueError: Unknown metric angular. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra',
 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto',
 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski', 'nan_euclidean', 'haversine'], or
 'precomputed', or a callable

Here is my codes:

pca <- sce@[email protected]
anndata = import("anndata", convert=FALSE)
sc = import("scanpy",convert=FALSE)
np = import("numpy",convert=FALSE)
bbknn = import("bbknn", convert=FALSE)
adata = anndata$AnnData(X=pca, obs=sce$patient)
sc$tl$pca(adata)

adata$obsm$X_pca = pca
bbknn$bbknn(adata, batch_key=0)
sc$tl$umap(adata)

I could run it with no problem before I update these packages. Could you help me figure this out?

`n_trees` is now `annoy_n_trees`

Thanks for the great method!

The change of n_trees to annoy_n_trees seems to have broken compatibility with scanpy's bbknn module (sc.pp.external.bbknn). Are there any plans to make changes to that module as well?

Details for pbmc dataset used #30

[Reopening Issue] Thanks for your quick reply! I am having trouble finding the 5' dataset on the 10X Genomics website. Is it no longer available? Can you share the link?

"The input data was downloaded from the 10X Genomics website. The exact 5′dataset was ‘PBMCs of a healthy donor 5′gene expression’, under Cell Ranger 2.1.0, under V(D)J + 5′Gene Expression. The exact 3′dataset was ‘8k PBMCs from a Healthy Donor’, under Cell Ranger 2.1.0, under Chromium Demonstration (v2 Chemistry)."

Bandwidth parameter no longer supported by scanpy or UMAP

This was originally reported by a scanpy user here: scverse/scanpy#632.

Scanpy has just removed a frozen version of the umap library we'd been using (PR: scverse/scanpy#576). The current version of umap doesn't support a bandwidth parameter, so now compute_connectivities_umap doesn't either. It looks like this is causing an issue with these lines, where bandwidth is explicitly passed:

bbknn/bbknn/__init__.py

Lines 272 to 274 in 93f25dc

dist, cnts = compute_connectivities_umap(knn_indices, knn_distances, knn_indices.shape[0],
knn_indices.shape[1], bandwidth=bandwidth,
local_connectivity=local_connectivity)

Sorry about the break with so little notice!

Standalone function

Really exciting method! I don't usually usually use scanpy for my pipelines. Do you have a BBKNN function that works without it? Maybe something that takes in only PCs and batch labels?

neighbors_within_batch argument usage in R?

Hello! I'm trying to recapitulate some results (using bbknn in R) from a paper that uses bbknn for scRNAseq batch correction where they say they set neighbors_within_batch to 10 but I'm running into an issue. I'm able to run the code fine in R without setting a neighbors_within_batch argument (see a possible edit below to the end py_to_r code). When I set the neighbors_within_batch bbknn$bbknn(adata, batch_key=0, neighbors_within_batch=10) I get an error with this traceback:

Error in py_call_impl(callable, dots$args, dots$keywords) :
TypeError: 'float' object cannot be interpreted as an integer
5.
stop(structure(list(message = "TypeError: 'float' object cannot be interpreted as an integer",
call = py_call_impl(callable, dots$args, dots$keywords),
cppstack = structure(list(file = "", line = -1L, stack = c("1 reticulate.so 0x0000000114a023ed _ZN4Rcpp9exceptionC2EPKcb + 221",
"2 reticulate.so 0x0000000114a0a485 _ZN4Rcpp4stopERKNSt3__112basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEE + 53", ...
4.
get_graph at init.py#148
3.
bbknn_pca_matrix at init.py#355
2.
bbknn at init.py#294
1.
bbknn$bbknn(adata, batch_key = 0, neighbors_within_batch = 10)

My full code is as follows (pca matrix and batch assignment vector not shown; I don't think either of these is causing the error since I can run this code minus the neighbors_within_batch argument but happy to post how I generated them/what they contain if useful):

adata = anndata$AnnData(X=pca, obs=batch)
sc$tl$pca(adata)
adata$obsm$X_pca = pca
bbknn$bbknn(adata, batch_key=0, neighbors_within_batch=10)
sc$tl$umap(adata)
umap = py_to_r(adata$obsm[["X_umap"]])

I'm at a loss for what's causing this error... do you have any idea what I'm doing wrong? I'm assuming I can use the neighbors_within_batch parameter in R?

Also, I think umap = py_to_r(adata$obsm$X_umap) should be umap = py_to_r(adata$obsm[["X_umap"]])? I was only able to get the latter to work...

Thanks,
Rachel

P.S. I'm sorry if I've missed including anything or if this looks funky when it gets posted; this is the first time I've asked about an issue on github. Happy to provide more details if needed!

save_knn

Hi @ktpolanski , I find your save_knn function extremely useful (in fact, it was the primary reason I was using bbknn in the first place) ! I am currently reverting to the 1.3.0 version of bbknn so I can use save_knn , but it would be nice to have the save_knn option in future versions of bbknn, if possible.

KeyError: 'connectivities' while running running UMAP function

Hello,

I'm following the tutorial which integrates with Scanpy. I was able to successfully run this line of code:

sc.external.pp.bbknn(adata, batch_key='sample_id',metric='euclidean')

In the next step I get this error:

sc.tl.umap(adata)
computing UMAP

KeyError Traceback (most recent call last)
in
----> 1 sc.tl.umap(adata)

/miniconda3/envs/scanpy/lib/python3.7/site-packages/scanpy/tools/_umap.py in umap(adata, min_dist, spread, n_components, maxiter, alpha, gamma, negative_sample_rate, init_pos, random_state, a, b, copy, method)
142 X_umap = simplicial_set_embedding(
143 X,
--> 144 adata.uns['neighbors']['connectivities'].tocoo(),
145 n_components,
146 alpha,

KeyError: 'connectivities'

Am I doing something incorrectly ? How can I fix this problem ? Thanks a lot for your help.

np.matrix and ridge_regression

Hello,
I'm trying to use bbknn.ridge_regression but get the following output when I run
bbknn.ridge_regression(adata, batch_key=['batch'], confounder_key=['cell_type'])
Is this an issue with compatibility with current numpy?
Many thanks


TypeError Traceback (most recent call last)
Cell In[19], line 9
7 import bbknn
8 # bbknn.bbknn(adata_v3)
----> 9 bbknn.ridge_regression(adata_v3, batch_key=['batch'], confounder_key=['cell_type'])
10 # scanpy.tl.pca(adata_v3)
11 # bbknn.bbknn(adata_v3)

File ~/miniforge3/envs/mypython3/lib/python3.9/site-packages/bbknn/init.py:196, in ridge_regression(adata, batch_key, confounder_key, chunksize, copy, **kwargs)
193 X_exp = X_exp.todense()
194 #fit the ridge regression model, compute the expression explained by the technical
195 #effect, and the remaining residual
--> 196 LR.fit(dummy,X_exp)
197 X_explained.append(dm.dot(LR.coef_[:,batch_index].T))
198 X_remain.append(X_exp - X_explained[-1])

File ~/miniforge3/envs/mypython3/lib/python3.9/site-packages/sklearn/base.py:1151, in _fit_context..decorator..wrapper(estimator, *args, **kwargs)
1144 estimator._validate_params()
1146 with config_context(
1147 skip_parameter_validation=(
1148 prefer_skip_nested_validation or global_skip_validation
1149 )
1150 ):
-> 1151 return fit_method(estimator, *args, **kwargs)

File ~/miniforge3/envs/mypython3/lib/python3.9/site-packages/sklearn/linear_model/_ridge.py:1134, in Ridge.fit(self, X, y, sample_weight)
1114 """Fit Ridge regression model.
1115
1116 Parameters
(...)
1131 Fitted estimator.
1132 """
1133 _accept_sparse = _get_valid_accept_sparse(sparse.issparse(X), self.solver)
-> 1134 X, y = self._validate_data(
1135 X,
1136 y,
1137 accept_sparse=_accept_sparse,
1138 dtype=[np.float64, np.float32],
1139 multi_output=True,
1140 y_numeric=True,
1141 )
1142 return super().fit(X, y, sample_weight=sample_weight)

File ~/miniforge3/envs/mypython3/lib/python3.9/site-packages/sklearn/base.py:621, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
619 y = check_array(y, input_name="y", **check_y_params)
620 else:
--> 621 X, y = check_X_y(X, y, **check_params)
622 out = X, y
624 if not no_val_X and check_params.get("ensure_2d", True):

File ~/miniforge3/envs/mypython3/lib/python3.9/site-packages/sklearn/utils/validation.py:1163, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1143 raise ValueError(
1144 f"{estimator_name} requires y to be passed, but the target y is None"
1145 )
1147 X = check_array(
1148 X,
1149 accept_sparse=accept_sparse,
(...)
1160 input_name="X",
1161 )
-> 1163 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
1165 check_consistent_length(X, y)
1167 return X, y

File ~/miniforge3/envs/mypython3/lib/python3.9/site-packages/sklearn/utils/validation.py:1173, in _check_y(y, multi_output, y_numeric, estimator)
1171 """Isolated part of check_X_y dedicated to y validation"""
1172 if multi_output:
-> 1173 y = check_array(
1174 y,
1175 accept_sparse="csr",
1176 force_all_finite=True,
1177 ensure_2d=False,
1178 dtype=None,
1179 input_name="y",
1180 estimator=estimator,
1181 )
1182 else:
1183 estimator_name = _check_estimator_name(estimator)

File ~/miniforge3/envs/mypython3/lib/python3.9/site-packages/sklearn/utils/validation.py:753, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
662 """Input validation on an array, list, sparse matrix or similar.
663
664 By default, the input is checked to be a non-empty 2D array containing
(...)
750 The converted and validated array.
751 """
752 if isinstance(array, np.matrix):
--> 753 raise TypeError(
754 "np.matrix is not supported. Please convert to a numpy array with "
755 "np.asarray. For more information see: "
756 "https://numpy.org/doc/stable/reference/generated/numpy.matrix.html"
757 )
759 xp, is_array_api_compliant = get_namespace(array)
761 # store reference to original array to check if copy is needed when
762 # function returns

TypeError: np.matrix is not supported. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html

Details for pbmc dataset used

Hi,

I am using the dataset mentioned in the notebook linked to this rep.
The link to download the data is ftp://ngs.sanger.ac.uk/production/teichmann/BBKNN/PBMC.merged.h5ad

Can you provide details about where the dataset is obtained (sequencing technologies and such)? Is there a publication from your group which explains this dataset?

Logging error

It looks like logging has been broken by an update to scanpy.

It's pretty straight forward to fix the timing error, now logg.info returns a date time, which you pass to the next logg.info you want to have the elapsed time.

I'm not sure how to replace the end argument. @flying-sheep, any suggestion?

edge weights

Dear,
I am unfamiliar with graph theory. Why do you convet the neighbour distance collections to exponentially related connectivities ? How to assign weights to the edges ? Does BBKNN construct the connectivity graph with Jaccard index (which is used in Seuart and Scanpy for louvain clustering)?

X_umap_3D

Great package!

How did you go about initially computing the X_umap_3D coordinates without overwriting X_umap?

error during ridge_regression

Hi there,

First of all, thank you for this great package! It has been very helpful for my graduation project so far.
I am currently following the bbknn tutorial notebook, but I get an error at the ridge_regression part of the tutorial.
"AttributeError: module 'numpy' has no attribute 'int'."

"np.int was a deprecated alias for the builtin int. To avoid this error in existing code, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20;""

I get this error when following your tutorial notebook with my own dataset. I am quite new to scRNAseq analysis and I am simply wondering if it's my package management that's wrong or that it might be something else?

Anyway, grateful in advance for your reply!

Greetings,

Julia

IndexError: index 2 is out of bounds for axis 0 with size 2

I have a dataset that includes one batch containing only 2 cells.
The command was sc.external.pp.bbknn(bdata,batch_key = "batch",approx=False)
It seems that bbknn doesn't like that small batch and gave me this error:

computing batch balanced neighbors
WARNING: unrecognised metric for type of neighbor calculation, switching to euclidean

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-274-086c83c814fc> in <module>
----> 1 sc.external.pp.bbknn(bdata,batch_key = "batch",approx=False)

/usr/local/lib/python3.6/dist-packages/scanpy/external/pp/_bbknn.py in bbknn(adata, batch_key, approx, metric, copy, n_pcs, trim, n_trees, use_faiss, set_op_mix_ratio, local_connectivity, **kwargs)
    118         set_op_mix_ratio=set_op_mix_ratio,
    119         local_connectivity=local_connectivity,
--> 120         **kwargs,
    121     )

/usr/local/lib/python3.6/dist-packages/bbknn/__init__.py in bbknn(adata, batch_key, use_rep, approx, metric, copy, **kwargs)
    289         #call BBKNN proper
    290 	bbknn_out = bbknn_pca_matrix(pca=pca, batch_list=batch_list,
--> 291 								 approx=approx, metric=metric, **kwargs)
    292         #store the parameters in .uns['neighbors']['params'], add use_rep and batch_key
    293         adata.uns['neighbors'] = {}

/usr/local/lib/python3.6/dist-packages/bbknn/__init__.py in bbknn_pca_matrix(pca, batch_list, neighbors_within_batch, n_pcs, trim, approx, n_trees, use_faiss, metric, set_op_mix_ratio, local_connectivity)
    346 	knn_distances, knn_indices = get_graph(pca=pca,batch_list=batch_list,n_pcs=n_pcs,n_trees=n_trees,
    347                                                                                    approx=approx,metric=metric,use_faiss=use_faiss,
--> 348 										   neighbors_within_batch=neighbors_within_batch)
    349         #sort the neighbours so that they're actually in order from closest to furthest
    350         newidx = np.argsort(knn_distances,axis=1)

/usr/local/lib/python3.6/dist-packages/bbknn/__init__.py in get_graph(pca, batch_list, neighbors_within_batch, n_pcs, approx, metric, use_faiss, n_trees)
    171                         for i in range(ckdout[1].shape[0]):
    172                                 for j in range(ckdout[1].shape[1]):
--> 173                                         ckdout[1][i,j] = ind_to[ckdout[1][i,j]]
    174                         #save the results within the appropriate rows and columns of the structures
    175                         col_range = np.arange(to_ind*neighbors_within_batch, (to_ind+1)*neighbors_within_batch)

IndexError: index 2 is out of bounds for axis 0 with size 2

Is that really due to the fact that one batch contains only two cells?

Performace varied on different operating systems

Hello:
When I installed bbknn 1.4.0 on ubuntu 16 and ubuntu 18, I got different performances on these two different operating systems. Concretely, the cell similarity scores varied between batches when running bbknn on ubuntu 16 and 18. Can you help me solve it?

Thank you very much!

bbknn spark error when runing with data integrated from scRNA data

Hi @Teichlab,
The weird condition occurred from data integrated through scanpy using bbknn.
The samples can be integrated by umap plot, but the tsne plotting showing the data that can not be integrated.

image
UMAP showing integrated

image
TSNE showing no integrated

  The different showing below between umap and tsne.
   Based on this, how can i get same integrated effect on tsne when using bbknn to integrated. if it should not been done,
   whether had another way to replace this selection?

Best
hanhuihong

scanpy update incompatibility

calling bbknn either from scanpy.external (or bbknn direct) yields:

Error in py_call_impl(callable, dots$args, dots$keywords) : AttributeError: 'tuple' object has no attribute 'tocsr'

this is reported on scanpy scverse/scanpy#1249 where the solution maybe is downgrading umap, but that's not ideal...

umap-learn 0.4.3 bbknn 1.3.4 scanpy 1.5.1 anndata 0.7.3

Any ideas?

ModuleNotFoundError

A ModuleNotFoundError raised for me when I was trying bbknn.bbknn(adata) function, at ./bbknn/__init__.py line 262, where you wrote

start = logg.info(...)

I know that this logg came from line 10

from scanpy import logging as logg

Here you are using a "try/except" syntax, such that you are allowing the nonexistence of scanpy. But in the core function (line 262) it seems still mandatory.

Simply installing scanpy has already fixed it, I am just writing this down in case anyone else new to all these pipelines encounters a similar issue.

bbknn publication

Dear,

Do you plan to publish BBKNN on a high impact factor journal later? Some people argue that bioAxiv is not a serious journal and we are a little worried. After all, it will take us much time to follow BBKNN.

ValueError: No hyperplanes of adequate size were found! When not using annoy

Hi there,
Having an issue when I try to run BBKNN without annoy. Had this error, then freshly installed everything in a new conda environment, I'm still getting the error passing from pynndescent when I run the code:
bbknn.bbknn(adata,batch_key='batch_name',use_annoy=False,metric='manhattan',neighbors_within_batch=3)

Thanks so much! This package works amazingly for correcting batch-driven compositional problems!!

Full error message below:

    122         batch_list = adata.obs[batch_key].values
    123         #call BBKNN proper
--> 124 	bbknn_out = bbknn_matrix(pca=pca, batch_list=batch_list, approx=approx,
    125 							 use_annoy=use_annoy, metric=params['metric'], **kwargs)
    126         #store the parameters in .uns['neighbors']['params'], add use_rep and batch_key

~/utils/miniconda3/envs/scanpy/lib/python3.9/site-packages/bbknn/matrix.py in bbknn(pca, batch_list, neighbors_within_batch, n_pcs, trim, approx, annoy_n_trees, pynndescent_n_neighbors, pynndescent_random_state, use_annoy, use_faiss, metric, set_op_mix_ratio, local_connectivity)
    312         params = check_knn_metric(params, counts)
    313         #obtain the batch balanced KNN graph
--> 314         knn_distances, knn_indices = get_graph(pca=pca,batch_list=batch_list,params=params)
    315         #sort the neighbours so that they're actually in order from closest to furthest
    316         newidx = np.argsort(knn_distances,axis=1)

~/utils/miniconda3/envs/scanpy/lib/python3.9/site-packages/bbknn/matrix.py in get_graph(pca, batch_list, params)
    173                 ind_to = np.arange(len(batch_list))[mask_to]
    174                 #create the faiss/cKDTree/KDTree/annoy, depending on approx/metric
--> 175                 ckd = create_tree(data=pca[mask_to,:params['n_pcs']], params=params)
    176                 for from_ind in range(len(batches)):
    177                         #this is the batch that will have its neighbours identified

~/utils/miniconda3/envs/scanpy/lib/python3.9/site-packages/bbknn/matrix.py in create_tree(data, params)
     95                                                                         n_neighbors=params['pynndescent_n_neighbors'],
     96 									random_state=params['pynndescent_random_state'])
---> 97                 ckd.prepare()
     98         elif params['computation'] == 'faiss':
     99                 ckd = faiss.IndexFlatL2(data.shape[1])

~/utils/miniconda3/envs/scanpy/lib/python3.9/site-packages/pynndescent/pynndescent_.py in prepare(self)
   1524     def prepare(self):
   1525         if not hasattr(self, "_search_graph"):
-> 1526             self._init_search_graph()
   1527         if not hasattr(self, "_search_function"):
   1528             if self._is_sparse:

~/utils/miniconda3/envs/scanpy/lib/python3.9/site-packages/pynndescent/pynndescent_.py in _init_search_graph(self)
    962                 best_trees = [self._rp_forest[idx] for idx in best_tree_indices]
    963                 del self._rp_forest
--> 964                 self._search_forest = [
    965                     convert_tree_format(tree, self._raw_data.shape[0])
    966                     for tree in best_trees

~/utils/miniconda3/envs/scanpy/lib/python3.9/site-packages/pynndescent/pynndescent_.py in <listcomp>(.0)
    963                 del self._rp_forest
    964                 self._search_forest = [
--> 965                     convert_tree_format(tree, self._raw_data.shape[0])
    966                     for tree in best_trees
    967                 ]

~/utils/miniconda3/envs/scanpy/lib/python3.9/site-packages/pynndescent/rp_trees.py in convert_tree_format(tree, data_size)
   1161     if tree.hyperplanes[0].ndim == 1:
   1162         # dense hyperplanes
-> 1163         hyperplane_dim = dense_hyperplane_dim(tree.hyperplanes)
   1164         hyperplanes = np.zeros((n_nodes, hyperplane_dim), dtype=np.float32)
   1165     else:

~/utils/miniconda3/envs/scanpy/lib/python3.9/site-packages/pynndescent/rp_trees.py in dense_hyperplane_dim()
   1143             return hyperplanes[i].shape[0]
   1144 
-> 1145     raise ValueError("No hyperplanes of adequate size were found!")
   1146 
   1147 

ValueError: No hyperplanes of adequate size were found!```

Update on bioconda

Hi,

Can you update the recipe on bioconda? The bioconda version is still at 1.5.1

Thanks

Export BBKNN results to R software

Dear BBKNN team,

I am using BBKNN in R as indicated in this github page and I am wondering how I could, for instance, export BBKNN results to perform UMAP/clustering/trajectory analysis with some customized scripts in R.

This is not an issue with the software at all, but I could not find the "batch-corrected data" to export from the anndata object (I am aware that the algorithm does not change the data matrix), but then which data I could use as input, for instance, to run umap with the umap R package?

Thank you in advance!

problem : name 'logg' is not defined

I tried to redo this example to using bbknn demo (https://nbviewer.jupyter.org/github/Teichlab/bbknn/blob/master/examples/demo.ipynb)
but I faced two problem
1- NameError: name 'logg' is not defined
2- AttributeError: module 'bbknn' has no attribute 'ridge_regression'


NameError Traceback (most recent call last)
in
3 except ImportError:
4 pass
----> 5 bbknn.bbknn(adata)
6 sc.tl.umap(adata)
7 sc.pl.umap(adata, color=['batch','celltype'])

~\miniconda3\envs\sc_trial\lib\site-packages\bbknn_init_.py in bbknn(adata, batch_key, approx, metric, copy, **kwargs)
259 '''
260 start = logg.info('computing batch balanced neighbors')
--> 261 adata = adata.copy() if copy else adata
262 #basic sanity checks to begin
263 #is our batch key actually present in the object?

NameError: name 'logg' is not defined


2- regression :
AttributeError: module 'bbknn' has no attribute 'ridge_regression'

I am working on windows and I changed bbknn between 1.3.6 and 1.4 and 1.5 and same problem in all
how can I solve these problems?

Does bbknn really work?

I have tried bbknn based on the scanpy tutorial, but the running time is very quick, and the result is very similar to the original data(in fact, same)
I just use:
bbknn.bbknn(adata,batch_key='batch')

Anything wrong here? Thanks.

incompatible with annoy==1.17.0 ?

I was trying to run bbknn (and also scrublet) and both independently kept crashing my sessions.

I think it's got to do with annoy==1.17.0 and simply downgrading to annoy==1.16.3 made everything run again.

Understanding `neighbors_within_batch` parameter?

Thanks for the nice tool! I'm trying to conceptually understand the neighbors_within_batch parameter. I read the docstring, but I'm still not clear exactly what this means? Is it 'k' when approx=True? Setting this value higher leads to a more spread out UMAP (i.e. less correction), which may be preferable for some datasets? Is there a reason for the default value of 3?

bbknn/bbknn/__init__.py

Lines 216 to 218 in 7e736d4

neighbors_within_batch : ``int``, optional (default: 3)
How many top neighbours to report for each batch; total number of neighbours
will be this number times the number of batches.

Any plan on making R package for bbknn?

I really like your robustness of integration especially the fast performance.
I kind of prefer using R solely due to their intuitiveness.
So are there any plan on making R package for bbknn? It would be really wonderful

An error when running bbknn$bbknn(adata,batch_key=0)

Hi @Teichlab,

I want to use bbknn in R. But an error come out as follow:
`

bbknn = import(module = 'bbknn')
sc = import("scanpy",convert=FALSE)
np = import("numpy")
scipy = import("scipy")
b <- brca[[1]]
pca.input <- b@reductions$[email protected]
batches <- [email protected]$sample
adata <- anndata$AnnData(X = pca.input, obs = batches)
sc$tl$pca(adata)
None
adata$obsm$X_pca <- r_to_py(pca.input)
bbknn$bbknn(adata, batch_key = 0)

*** caught illegal operation ***
address 0x2b6db2dc781d, cause 'illegal operand'

Traceback:
1: py_call_impl(callable, dots$args, dots$keywords)
2: bbknn$bbknn(adata, batch_key = 0)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

`
Could you please help me figure out this problem?

Thanks a lot

Best

Zhaohui Ruan

Issues importing bbknn after successful install

Hello,

I was able to successfully install bbknn; however, when I went to run it in a notebook, I got the following error:

AssertionError: Failed in nopython mode pipeline (step: native lowering)
Storing i64 to ptr of i32 ('dim'). FE type int32

I realize that this error is related to numba (currently running 0.54.1) . I tried downgrading it to 0.52.0; however, I got a few compatibility issues with bbknn. Do you have any suggestions?

Thank you!

error reproducing notebook

Hi there,

I tried to reproduce the pancreas Jupyter notebook (planning to eventually add some more data to it). However, after loading the 4 datasets in the holder, when I run adata = holder[0].concatenate(holder[1:], join='outer') I get the following error:

<ipython-input-27-8bddebd9bb8e> in <module>
----> 1 adata = holder[0].concatenate(holder[1:], join='outer')
      2 #adata.X = adata.X.tocsr()
      3 #adata = adata[:,['ERCC' not in item.upper() for item in adata.var_names]]
      4 #adata.raw = sc.pp.log1p(adata, copy=True)
      5 #sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)

/anaconda3/envs/leiden/lib/python3.6/site-packages/anndata/base.py in concatenate(self, join, batch_key, batch_categories, index_unique, *adatas)
   1807                 # constructed like that
   1808                 X[obs_i:obs_i+ad.n_obs,
-> 1809                   var_names.isin(vars_intersect)] = ad[:, vars_intersect].X
   1810             else:
   1811                 Xs.append(ad[:, vars_intersect].X)

/anaconda3/envs/leiden/lib/python3.6/site-packages/anndata/base.py in __getitem__(self, index)
   1299     def __getitem__(self, index):
   1300         """Returns a sliced view of the object."""
-> 1301         return self._getitem_view(index)
   1302 
   1303     def _getitem_view(self, index):

/anaconda3/envs/leiden/lib/python3.6/site-packages/anndata/base.py in _getitem_view(self, index)
   1302 
   1303     def _getitem_view(self, index):
-> 1304         oidx, vidx = self._normalize_indices(index)
   1305         return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
   1306 

/anaconda3/envs/leiden/lib/python3.6/site-packages/anndata/base.py in _normalize_indices(self, index)
   1278                 return index
   1279         obs, var = super(AnnData, self)._unpack_index(index)
-> 1280         obs = _normalize_index(obs, self.obs_names)
   1281         var = _normalize_index(var, self.var_names)
   1282         return obs, var

/anaconda3/envs/leiden/lib/python3.6/site-packages/anndata/base.py in _normalize_index(index, names)
    238     if not isinstance(names, RangeIndex):
    239         assert names.dtype != float and names.dtype != int, \
--> 240             'Don’t call _normalize_index with non-categorical/string names'
    241 
    242     # the following is insanely slow for sequences, we replaced it using pandas below

AssertionError: Don’t call _normalize_index with non-categorical/string names

These are the current versions I'm using:
scanpy==1.3.6 anndata==0.6.13 numpy==1.15.3 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.20.0 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1.

Do you know what is causing this?

Thank you!

should change location of 'distances' and 'connectivities' for new versions of anndata

/usr/local/lib/python3.6/dist-packages/bbknn/__init__.py:294: FutureWarning: This location for 'distances' is deprecated. It has been moved to .obsp[distances], and will not be accesible here in a future version of anndata.
  adata.uns['neighbors']['distances'] = bbknn_out[0]
/usr/local/lib/python3.6/dist-packages/bbknn/__init__.py:295: FutureWarning: This location for 'connectivities' is deprecated. It has been moved to .obsp[connectivities], and will not be accesible here in a future version of anndata.
  adata.uns['neighbors']['connectivities'] = bbknn_out[1]

New versions of anndata/scanpy look for the distances and connectivities to be in adata.obsp instead of uns['neighbors']. Probably should update the outputs of bbknn to the adata.obsp as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.