shobistassen / parc Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 11.0 11.83 MB

License: MIT License

Python 4.08% Jupyter Notebook 95.92%

parc's People

Contributors

Stargazers

Watchers

Forkers

zunderlab pythseq kevintsia tommycelsius arkal vinay-swamy starxian laplacekorea burtonrj dhtc ahill187

parc's Issues

Bugs in Example Usage 3

There are a few bugs and challenges when working with Example Usage 3 on the GitHub home page

Finding the data

For starters, it's hard to find the appropriate raw data. The example currently has the link below:
raw datafile
...but the relevant data is very hard to find from that portal.

I believe this is a direct link to the dataset used in the example:
direct link
source
...note that after decompressing the .tar.gz archive, the folder needs to be renamed from filtered_matrices_mex to zheng17_filtered_matrices_mex and moved to a subfolder called "data" to match the code in Example Usage 3.

Finding the annotations

Also, the annotations file needs to be downloaded from the link in Example Usage 2 and renamed to match the code in Example Usage 3:
annotations_zhang.txt --> data/zheng17_annotations.txt

Fixing some typos

The example code mentions "adata2" but should be "adata":

# BAD CODE:
# pre-process as per Zheng et al., and take first 50 PCs for analysis
sc.pp.recipe_zheng17(adata)
sc.tl.pca(adata, n_comps=50)
# setting small_pop to 50 cleans up some of the smaller clusters, but can also be left at the default 10
parc1 = parc.PARC(adata2.obsm['X_pca'], true_label = annotations, jac_std_global=0.15, random_seed =1, small_pop = 50)  
parc1.run_PARC() # run the clustering
parc_labels = parc1.labels
adata2.obs["PARC"] = pd.Categorical(parc_labels)

should be:

# GOOD CODE:
# pre-process as per Zheng et al., and take first 50 PCs for analysis
sc.pp.recipe_zheng17(adata)
sc.tl.pca(adata, n_comps=50)
# setting small_pop to 50 cleans up some of the smaller clusters, but can also be left at the default 10
parc1 = parc.PARC(adata.obsm['X_pca'], true_label = annotations, jac_std_global=0.15, random_seed =1, small_pop = 50)  
parc1.run_PARC() # run the clustering
parc_labels = parc1.labels
adata.obs["PARC"] = pd.Categorical(parc_labels)

Adding some missing steps for scanpy UMAP

# OLD CODE:
//visualize
sc.pl.umap(adata, color='annotations')
sc.pl.umap(adata, color='PARC')

# NEW CODE (includes some missing steps to allow scanpy to calculate a UMAP embedding)
# visualize
sc.settings.n_jobs=4
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.pl.umap(adata, color='annotations')
sc.pl.umap(adata, color='PARC')

# Ignore the "no transformation for parallel execution was possible" warnings

This code should produce the following output:

Embedding a total of 2 separate connected components using meta-embedding (experimental)
  n_components
# and two pretty plots

My final script is attached:
PARCdemo3.txt

pip install issue: Mac OS X and AWS Linux

Hey Shobi!

I get the following when trying to pip install parc in a fresh Anaconda environment with Python 3.7 on both Mac OS X 10.14.6 (Mojave) and on AWS' flavor of Linux:

Collecting parc
  Downloading https://files.pythonhosted.org/packages/2d/82/7dcddc30ba3cd6cec30a1f939d40bb21882bef791d87a1ddab174e825b5e/parc-0.18-py3-none-any.whl
Collecting python-igraph (from parc)
  Using cached https://files.pythonhosted.org/packages/0f/a0/4e7134f803737aa6eebb4e5250565ace0e2599659e22be7f7eba520ff017/python-igraph-0.7.1.post6.tar.gz
Collecting leidenalg (from parc)
Collecting pandas (from parc)
  Using cached https://files.pythonhosted.org/packages/39/73/99aa822ee88cef5829607217c11bf24ecc1171ae5d49d5f780085f5da518/pandas-0.25.1-cp37-cp37m-macosx_10_9_x86_64.macosx_10_10_x86_64.whl
Collecting hnswlib (from parc)
  Using cached https://files.pythonhosted.org/packages/51/ee/850ac2cdc9483a5a26fd4173be486f48db0bdb9e2b200dfc3149a572a907/hnswlib-0.3.2.0.tar.gz
Collecting pybind11 (from parc)
  Downloading https://files.pythonhosted.org/packages/4b/4d/ae1c4d8e8b139afa9682054dd42df3b0e3b5c1731287933021b9fd7e9cc4/pybind11-2.4.3-py2.py3-none-any.whl (150kB)
     |████████████████████████████████| 153kB 3.6MB/s 
Collecting scipy (from parc)
  Using cached https://files.pythonhosted.org/packages/d5/06/1a696649f4b2e706c509cb9333fdc6331fbe71251cede945f9e1fa13ea34/scipy-1.3.1-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Collecting numpy (from parc)
  Using cached https://files.pythonhosted.org/packages/b4/e8/5ececadd9cc220bb783b4ce6ffaa9266925d37ed41237bc23bc530ab4f3d/numpy-1.17.2-cp37-cp37m-macosx_10_6_intel.whl
Collecting pytz>=2017.2 (from pandas->parc)
  Downloading https://files.pythonhosted.org/packages/e7/f9/f0b53f88060247251bf481fa6ea62cd0d25bf1b11a87888e53ce5b7c8ad2/pytz-2019.3-py2.py3-none-any.whl (509kB)
     |████████████████████████████████| 512kB 16.1MB/s 
Collecting python-dateutil>=2.6.1 (from pandas->parc)
  Using cached https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl
Collecting six>=1.5 (from python-dateutil>=2.6.1->pandas->parc)
  Using cached https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Building wheels for collected packages: python-igraph, hnswlib
  Building wheel for python-igraph (setup.py) ... done
  Created wheel for python-igraph: filename=python_igraph-0.7.1.post6-cp37-cp37m-macosx_10_9_x86_64.whl size=1903873 sha256=0c6639b7e49b66a679add893935b4cdabec863b711406adf644e1a59b6726a56
  Stored in directory: /Users/wyatt.mcdonnell/Library/Caches/pip/wheels/41/d6/02/34eebae97e25f5b87d60f4c0687e00523e3f244fa41bc3f4a7
  Building wheel for hnswlib (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: //anaconda3/envs/parc/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/bq/7jsphcrn5jj6rt7jrll1x5y80000gp/T/pip-install-yfhtmyc0/hnswlib/setup.py'"'"'; __file__='"'"'/private/var/folders/bq/7jsphcrn5jj6rt7jrll1x5y80000gp/T/pip-install-yfhtmyc0/hnswlib/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/bq/7jsphcrn5jj6rt7jrll1x5y80000gp/T/pip-wheel-wlmtzn3j --python-tag cp37
       cwd: /private/var/folders/bq/7jsphcrn5jj6rt7jrll1x5y80000gp/T/pip-install-yfhtmyc0/hnswlib/
  Complete output (41 lines):
  running bdist_wheel
  running build
  running build_ext
  creating var
  creating var/folders
  creating var/folders/bq
  creating var/folders/bq/7jsphcrn5jj6rt7jrll1x5y80000gp
  creating var/folders/bq/7jsphcrn5jj6rt7jrll1x5y80000gp/T
  gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I//anaconda3/envs/parc/include -arch x86_64 -I//anaconda3/envs/parc/include -arch x86_64 -I//anaconda3/envs/parc/include/python3.7m -c /var/folders/bq/7jsphcrn5jj6rt7jrll1x5y80000gp/T/tmpq3tagf4m.cpp -o var/folders/bq/7jsphcrn5jj6rt7jrll1x5y80000gp/T/tmpq3tagf4m.o -std=c++14
  gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I//anaconda3/envs/parc/include -arch x86_64 -I//anaconda3/envs/parc/include -arch x86_64 -I//anaconda3/envs/parc/include/python3.7m -c /var/folders/bq/7jsphcrn5jj6rt7jrll1x5y80000gp/T/tmpw16kygtu.cpp -o var/folders/bq/7jsphcrn5jj6rt7jrll1x5y80000gp/T/tmpw16kygtu.o -fvisibility=hidden
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/private/var/folders/bq/7jsphcrn5jj6rt7jrll1x5y80000gp/T/pip-install-yfhtmyc0/hnswlib/setup.py", line 116, in <module>
      zip_safe=False,
    File "//anaconda3/envs/parc/lib/python3.7/site-packages/setuptools/__init__.py", line 145, in setup
      return distutils.core.setup(**attrs)
    File "//anaconda3/envs/parc/lib/python3.7/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "//anaconda3/envs/parc/lib/python3.7/distutils/dist.py", line 966, in run_commands
      self.run_command(cmd)
    File "//anaconda3/envs/parc/lib/python3.7/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "//anaconda3/envs/parc/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 192, in run
      self.run_command('build')
    File "//anaconda3/envs/parc/lib/python3.7/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "//anaconda3/envs/parc/lib/python3.7/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "//anaconda3/envs/parc/lib/python3.7/distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "//anaconda3/envs/parc/lib/python3.7/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "//anaconda3/envs/parc/lib/python3.7/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "//anaconda3/envs/parc/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 84, in run
      _build_ext.run(self)
    File "//anaconda3/envs/parc/lib/python3.7/distutils/command/build_ext.py", line 340, in run
      self.build_extensions()
    File "/private/var/folders/bq/7jsphcrn5jj6rt7jrll1x5y80000gp/T/pip-install-yfhtmyc0/hnswlib/setup.py", line 88, in build_extensions
      import pybind11
  ModuleNotFoundError: No module named 'pybind11'
  ----------------------------------------
  ERROR: Failed building wheel for hnswlib
  Running setup.py clean for hnswlib
Successfully built python-igraph
Failed to build hnswlib
Installing collected packages: python-igraph, leidenalg, numpy, pytz, six, python-dateutil, pandas, pybind11, hnswlib, scipy, parc
  Running setup.py install for hnswlib ... done
Successfully installed hnswlib-0.3.2.0 leidenalg-0.7.0 numpy-1.17.2 pandas-0.25.1 parc-0.18 pybind11-2.4.3 python-dateutil-2.8.0 python-igraph-0.7.1.post6 pytz-2019.3 scipy-1.3.1 six-1.12.0

For what it's worth, if I manually install everything first with the following bash script and then clone PARC.git and python3 setup.py install everything works!

!/usr/bin/env bash
conda create --name parc_20191014 python=3 --yes |& tee -a parc.log
conda activate parc_20191014 |& tee -a parc.log
conda install -c anaconda numpy pandas scipy setuptools scikit-learn seaborn --yes |& tee -a parc.log
conda install -c conda-forge pybind11 python-igraph umap-learn --yes |& tee -a parc.log
pip install leidenalg |& tee -a parc.log
pip install ipykernel |& tee -a parc.log
pip3 install jupyter |& tee -a parc.log
python -m ipykernel install --user --name parc --display-name "Python (parc)" |& tee -a parc.log
git clone https://github.com/nmslib/hnswlib |& tee -a parc.log
cd hnswlib |& tee -a parc.log
cd python_bindings |& tee -a parc.log
python3 setup.py install |& tee -a parc.log
cd ../../ |& tee -a parc.log
git clone https://github.com/ShobiStassen/PARC.git |& tee -a parc.log
cd PARC |& tee -a parc.log
# 2019-10-14, most recently tested with: git checkout 46a36996539bf57bb0d544898d900a63fdaa5b90 |& tee -a parc.log
python3 setup.py install

Cheers,
Wyatt

Hello! I am getting AttributeError, when trying to import parc

AttributeError Traceback (most recent call last)
/tmp/ipykernel_2109246/3403208955.py in
----> 1 import parc

~/user_venv/lib/python3.7/site-packages/parc/init.py in
----> 1 from ._parc import PARC

~/user_venv/lib/python3.7/site-packages/parc/_parc.py in
6 from scipy.sparse import csr_matrix
7 import igraph as ig
----> 8 import leidenalg
9 import time
10 import umap

~/user_venv/lib/python3.7/site-packages/leidenalg/init.py in
33 not immediately available in :func:leidenalg.find_partition.
34 """
---> 35 from .functions import ALL_COMMS
36 from .functions import ALL_NEIGH_COMMS
37 from .functions import RAND_COMM

~/user_venv/lib/python3.7/site-packages/leidenalg/functions.py in
21 return graph.__graph_as_cobject()
22
---> 23 from .VertexPartition import *
24 from .Optimiser import *
25

~/user_venv/lib/python3.7/site-packages/leidenalg/VertexPartition.py in
6 PY3 = (sys.version > '3')
7
----> 8 class MutableVertexPartition(_ig.VertexClustering):
9 """ Contains a partition of graph, derives from :class:ig.VertexClustering.
10

AttributeError: module 'igraph' has no attribute 'VertexClustering'

Minor bug in run_toobig_subPARC

The comparison of cluster size on lines 232 and 256 in _parc.py, I think, should be to self.small_pop instead of the hard-coded value of 10

Error when running UMAP

Hi, Great package! I am trying to run PARC's UMAP implementation but I receive the following error:

It appears that this issue has popped up here: scverse/scanpy#1579

I tried stepping back to umap-learn==0.4.5 but I encounter an error when importing PARC.

Any help would be greatly appreciated.

My venv:

_libgcc_mutex 0.1 main
backcall 0.2.0
blas 1.0 mkl
ca-certificates 2021.5.25 h06a4308_1
certifi 2021.5.30 py37h06a4308_0
cycler 0.10.0 py37_0
dbus 1.13.18 hb2f20db_0
decorator 5.0.9
expat 2.4.1 h2531618_2
fontconfig 2.13.1 h6c09931_0
freetype 2.10.4 h5ab3b9f_0
glib 2.68.2 h36276a3_0
gst-plugins-base 1.14.0 h8213a91_2
gstreamer 1.14.0 h28cd5cc_2
hnswlib 0.5.1
icu 58.2 he6710b0_3
intel-openmp 2021.2.0 h06a4308_610
ipykernel 5.5.5
ipython 7.24.1
ipython-genutils 0.2.0
jedi 0.18.0
joblib 1.0.1 pyhd3eb1b0_0
jpeg 9b h024ee3a_2
jupyter-client 6.1.12
jupyter-core 4.7.1
kiwisolver 1.3.1 py37h2531618_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.33.1 h53a641e_7
leidenalg 0.8.4
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.2.0 h85742a9_0
libuuid 1.0.3 h1bed415_2
libwebp-base 1.2.0 h27cfd23_0
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 hb55368b_3
llvmlite 0.36.0
lz4-c 1.9.3 h2531618_0
matplotlib 3.3.4 py37h06a4308_0
matplotlib-base 3.3.4 py37h62a2d02_0
matplotlib-inline 0.1.2
mkl 2021.2.0 h06a4308_296
mkl-service 2.3.0 py37h27cfd23_1
mkl_fft 1.3.0 py37h42c9631_2
mkl_random 1.2.1 py37ha9443f7_2
ncurses 6.2 he6710b0_1
numba 0.53.1
numpy 1.20.2 py37h2d18471_0
numpy-base 1.20.2 py37hfae3a4d_0
libstdcxx-ng 9.1.0 hdf63c60_0
olefile 0.46 py_0
openssl 1.1.1k h27cfd23_0
pandas 1.2.4 py37h2531618_0
parc 0.31
parso 0.8.2
pcre 8.44 he6710b0_0
pexpect 4.8.0
pickleshare 0.7.5
pillow 8.2.0 py37he98fc37_0
pip 21.1.1 py37h06a4308_0
prompt-toolkit 3.0.18
ptyprocess 0.7.0
pybind11 2.6.2 py37hff7bd54_1
Pygments 2.9.0
pynndescent 0.5.2
pyparsing 2.4.7 pyhd3eb1b0_0
pyqt 5.9.2 py37h05f1152_2
python 3.7.10 hdb3f193_0
python-dateutil 2.8.1 pyhd3eb1b0_0
python-igraph 0.9.4
pytz 2021.1 pyhd3eb1b0_0
pyzmq 22.1.0
qt 5.9.7 h5867ecd_1
readline 8.1 h27cfd23_0
scikit-learn 0.24.2 py37ha9443f7_0
scipy 1.6.2 py37had2a1c9_1
setuptools 52.0.0 py37h06a4308_0
sip 4.19.8 py37hf484d3e_0
six 1.15.0 pyhd3eb1b0_0
sqlite 3.35.4 hdfb4753_0
texttable 1.6.3
threadpoolctl 2.1.0 pyh5ca1d4c_0
tk 8.6.10 hbc83047_0
tornado 6.1 py37h27cfd23_0
traitlets 5.0.5
umap-learn 0.5.1
wcwidth 0.2.5
wheel 0.36.2 pyhd3eb1b0_0
xz 5.2.5 h7b6447c_0
zlib 1.2.11 h7b6447c_3
zstd 1.4.9 haebb681_0

Questions on using PARC for identifying plasma cells

Hi, I am trying to use PARC for Flow Cytometry and I want to identify Plasma cells which is usually done by manually gating on CD38 and CD138. The gate identifies the high CD38 and CD138 population (upper right corner on the 2-d CD38 vs Cd138 graph). To replicate the same process in PARC is the following the proper method:

X contains CD38, CD138 and 12 other antigen columns

Parc1 = parc.PARC(X, jac_std_global=0.15, random_seed = 2, small_pop = 20)
Parc1.run_PARC()
parc_labels = Parc1.labels

#Attach a column called "Cluster" to identify which cell belongs to which cluster
df['cluster'] = pd.Series(Parc1.labels, index=df.index)

#Find which cluster has the max value for the sum
sum_df = df.groupby(['cluster'])[["CD38", "CD138"]].sum().sum(axis=1)
sum_df.sort_values(ascending=False)

Example Usage 2 has some code typos

The examples on the GitHub home page are very useful, but I hit a few bugs in Example 2. Specifically, (1) an errant quotation mark, (2) a missing dot, and (3) an undefined alias for the numpy package. The code below works:

import parc
import csv
import numpy as np
import pandas as pd

## load data (50 PCs of filtered gene matrix pre-processed as per Zheng et al. 2017)

X = csv.reader(open("./pca50_pbmc68k.txt", 'rt'),delimiter = ",")
X = np.array(list(X)) # (n_obs x k_dim, 68579 x 50)
X = X.astype("float")
# OR with pandas as: X = pd.read_csv("'./pca50_pbmc68k.txt", header=None).values.astype("float")

y = [] # annotations
with open('./annotations_zhang.txt', 'rt') as f: 
    for line in f: y.append(line.strip().replace('\"', ''))
# OR with pandas as: y = list(pd.read_csv('./annotations_zhang.txt', header=None)[0])   

# setting small_pop to 50 cleans up some of the smaller clusters, but can also be left at the default 10
parc1 = parc.PARC(X,true_label=y,jac_std_global=0.15, random_seed =1, small_pop = 50) # instantiate PARC
parc1.run_PARC() # run the clustering
parc_labels = parc1.labels

Import of umap increases PARC import time

Hey,

I was wondering if the import of umap in line 8 of _parc.py is necessary, cause it slows down the import of PARC and the only function using umap is importing it again.

Best wishes Max

any plan to release an R version of PARC?

Very interesting work, I was thinking to include this method in my cytometry pipeline (it will be released soon). Are you planning a release fo R?

Relaxation of leidenalg==0.7.0 requirement?

Hi @ShobiStassen

I see that you have leidenalg==0.7.0 as a requirement. Is there any particular reason why it needs to be that exact version? For people like me who use version managers, that specific requirement makes it difficult to use parc. Is it possible to change it to something like leidenalg>=0.7.0 if the version is not critical? Thank you.

install_requires=['pybind11','numpy','scipy','pandas','hnswlib','python-igraph','leidenalg==0.7.0','umap-learn']

Best,
Ajit.

Python igraph

Hi,

in the installation tutorial I try to execute this command:

pip install python-igraph, leidenalg==0.7.0, hnswlib, umap-learn

but i noted that python-igraph now is "pip install igraph"

Could you specify the python version to use for correct installation of PARC ?

Best regard

np.reshape error if too many edges are pruned

Fantastic clustering tool! I'm very happy to be able to use this - amazing how fast and well it works. Some of my datasets produce the following error though:

`p = parc.PARC(data)

p.run_PARC()
input data has shape 600 (samples) x 12 (features)
commencing local pruning based on minowski metric at 2 s.dev above mean
commencing global pruning
commencing community detection
0.01430511474609375
Traceback (most recent call last):

File "", line 1, in
p.run_PARC()

File "C:\Users\ezund\Anaconda3\lib\site-packages\parc_parc.py", line 431, in run_PARC
self.run_subPARC()

File "C:\Users\ezund\Anaconda3\lib\site-packages\parc_parc.py", line 241, in run_subPARC
PARC_labels_leiden = np.reshape(PARC_labels_leiden, (n_elements, 1))

File "C:\Users\ezund\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 292, in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)

File "C:\Users\ezund\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py", line 56, in _wrapfunc
return getattr(obj, method)(*args, **kwds)

ValueError: cannot reshape array of size 599 into shape (600,1)`

This was pretty easy to track down through the error messages. The problem happens because so many edges are pruned that some vertices lose all their edges, and then they're not added to the G_sim graph because it's constructed from the edgelist. I've fixed this problem by explicitly adding all the vertices during graph construction in the _parc.py file. I don't think this will cause any problems elsewhere in the code. Perhaps I should adjust parameters to decrease the level of pruning(?) but I think changing the code to make this failsafe is still a good idea. I've made a fork and pull request for this small fix. I'm happy to provide an example dataset that reproduces this error with np.reshape if that would be useful.

Eli

Question on pruning big cluster

Hi @ShobiStassen ,

Thanks for making available your program!

I am a bit confused regarding the following, and I would like to ask for your clarifications.

When a crs_graph is built with option for local pruning, you use this command: weight_list.append(1/ (dist + 0.1))

I see that for two close neighbors, their distance is small and their weight is big. On the other hand, for two far neighbors, their distance is high and their weight is small.

Then, in the function to analyze a big cluster, you use this pruning:
mask |= (csr_array.data > (
np.mean(csr_array.data) + np.std(csr_array.data) * 5)) # smaller distance means stronger edge
csr_arrray[mask]= 0
csr_array.eliminate_zeros()

It seems to me that the above pruning will cause relative big weights to be remove. Thus, very close neighbors are eliminated. Is this the intended goal?

Kind regards,

Ivan

Tips for replicating the performance on the lung cell clustering

Hi,

I'm working to replicate the performance results of the paper with the 1.1 million cells from the lung data, but getting very poor performance in the call to leidenalg. I stuck a few timing statements into the code to find the bottleneck and am seeing runtimes well into the two-hour range for leidenalg with this data set.

I am testing with the snippet from the main page:

X = pd.read_csv("'./LungData.txt", header=None).values.astype("float") 
y = list(pd.read_csv('./LungData_annotations.txt', header=None)[0]) // list of cell-type annotations

// run PARC on 1.1M and 70K cells
parc1 = parc.PARC(X, true_label=y)
parc1.run_PARC() // run the clustering
parc_labels = parc1.labels

I also tried with a simple normalization of the data with:

normalized_X = preprocessing.normalize(X)

and see the same runtimes.

I am using GCP instances - c2-standard-16 (16 vCPUs, 64 GB memory) with what reports as 3.2GHz cpus on centos - but the almost order-of-magnitude difference makes me thing I have missed some configuration or parameter tuning.

My results show:
input data has shape 1113369 (samples) x 26 (features)
time elapsed make_knn_struct 28.5 seconds
commencing global pruning
commencing community detection
partition type MVP
time elapsed leiden 6713.5 seconds
list of cluster labels and populations
time elapsed run_subPARC 6834.2 seconds

Based on the partition type, this seems to point at:

partition = leidenalg.find_partition(G_sim, leidenalg.ModularityVertexPartition, weights='weight', n_iterations=self.n_iter_leiden, seed=self.random_seed)

as unexpectedly long-running.

I've also noticed that the call to leidenalg seems to result in a single CPU being pegged at 100% for the duration of that time - I'm a little unsure if leidenalg should (like HNSW) be more multiple-cpu friendly?

Any suggestions?
Thanks.

Controlling the number of clusters

In Leiden clustering, we could pass a resolution parameter value to control the coarseness of the clustering. Is it possible to expose that parameter?

Hyperclustering with PARC

Thank you for this excellent tool!

We are attempting to generate many more clusters than achieved with default parameters. The application is AML-MRD detection by flow cytometry. So far very promising!

What would be your recommended first approach to generate more clusters? 1) increase resolution parameter, 2) decrease jac_std_global, 3) decrease dist_std_local.

Thanks!

RStudio crashing when clustering a large dataset

Hi,

Thank you so much for the excellent package! I have managed to get your code working in RStudio (using reticulate) on my new macbook Pro M1 Max 64 GB ram laptop. But when i am clustering my mass cytometry dataset of 15 million cells, RStudio crashes after some time with a fatal error and no further info. When i look at the cores during the clustering, all cores are working.

I was hoping you might have some suggestions on how to cluster this dataset, and other larger ones that i have (30 million cells)? i would like to use all cells. Or is it just my mac that cannot handle so big datasets?

I use CATALYST to read data and then i export single cell data into a Large Matrix (661330692 elements, 5.3 GB). this is then passed to PARC. I am clustering on 20 parameters/columns. I have changed the small_pop to make the clustering easier to handle. Is there any other parameters that i could change?

scdf <- data.frame(t(assay(sce, "exprs")[c(type_markers(sce),state_markers(sce)), ]),
sample_id = sce$sample_id,
check.names = FALSE)
scdf <- as.matrix(scdf[,1:(ncol(scdf)-1)])

tic()
Sys.setenv("RETICULATE_PYTHON" = "path")
library(reticulate)
reticulate::py_config()
markers <- c(1:length(ToUse))
parc <- import("parc")
parc1 <- parc$PARC(
data = scdf[,markers],
num_threads=20L,
resolution_parameter = 1, # defaults to 1. expose this parameter in leidenalg
small_pop=2000L,# smallest cluster population to be considered a community
knn=30L,
hnsw_param_ef_construction = 150L)
parc1$run_PARC()

clusters <- unlist(parc1$labels)
levels(clusters) <- as.character(as.numeric(levels(clusters)) + 1)

sce@colData@listData[["cluster_id"]] <- as.factor(clusters)
sce@metadata[["cluster_codes"]] <- data.frame("PARC" = as.factor(levels(factor(clusters))))

rm(scdf)
toc()

About the true labels for calculating the F1 scores

Hi,
I am interested in testing the PARC in more datasets. I am wondering what is the source of the true labels? do you use the labels provided by the original papers? For example, the labels in the ''clusters.csv" from the clustering analysis of the PBMC dataset provided by 10xGenomics: pbmc_68k?

Thank you!
Yijia

`run_umap_hnsw` doesn't work

Version Information

Python 3.10
MacOSx

Error

When calling run_umap_hnsw in the example in the README, I get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
     16 graph = Parc1.knngraph_full()
---> 17 X_umap = Parc1.run_umap_hnsw(X, graph)
     18 plt.scatter(X_umap[:, 0], X_umap[:, 1], c=Parc1.labels)
     19 plt.show()

File python3.10/site-packages/parc/_parc.py:595, in PARC.run_umap_hnsw(self, X_input, graph, n_components, alpha, negative_sample_rate, gamma, spread, min_dist, init_pos, random_state)
    593 print('a,b, spread, dist', a, b, spread, min_dist)
    594 t0 = time.time()
--> 595 X_umap = simplicial_set_embedding(data=X_input, graph=graph, n_components=n_components, initial_alpha=alpha,
    596                                   a=a, b=b, n_epochs=0, metric_kwds={}, gamma=gamma,
    597                                   negative_sample_rate=negative_sample_rate, init=init_pos,
    598                                   random_state=np.random.RandomState(random_state), metric='euclidean',
    599                                   verbose=1)
    600 return X_umap

TypeError: simplicial_set_embedding() missing 3 required positional arguments: 'densmap', 'densmap_kwds', and 'output_dens'

Amazing work!

Hi,

This is not an issue, I just wanted to give you my thanks. This work is awesome!