aertslab / arboreto Goto Github PK

A scalable python-based framework for gene regulatory network inference using tree-based ensemble regressors.

License: BSD 3-Clause "New" or "Revised" License

Python 0.75% Jupyter Notebook 99.22% Shell 0.03%

python dask scalable gene-regulation machine-learning ensemble-learning random-forest gradient-boosting inference network

arboreto's Introduction

The most satisfactory definition of man from the scientific point of view is probably Man the Tool-maker.

Inferring a gene regulatory network (GRN) from gene expression data is a computationally expensive task, exacerbated by increasing data sizes due to advances in high-throughput gene profiling technology.

The arboreto software library addresses this issue by providing a computational strategy that allows executing the class of GRN inference algorithms exemplified by GENIE3 [1] on hardware ranging from a single computer to a multi-node compute cluster. This class of GRN inference algorithms is defined by a series of steps, one for each target gene in the dataset, where the most important candidates from a set of regulators are determined from a regression model to predict a target gene's expression profile.

Members of the above class of GRN inference algorithms are attractive from a computational point of view because they are parallelizable by nature. In arboreto, we specify the parallelizable computation as a dask graph [2], a data structure that represents the task schedule of a computation. A dask scheduler assigns the tasks in a dask graph to the available computational resources. Arboreto uses the dask distributed scheduler to spread out the computational tasks over multiple processes running on one or multiple machines.

Arboreto currently supports 2 GRN inference algorithms:

GRNBoost2: a novel and fast GRN inference algorithm using Stochastic Gradient Boosting Machine (SGBM) [3] regression with early-stopping regularization.
GENIE3: the classic GRN inference algorithm using Random Forest (RF) or ExtraTrees (ET) regression.

Get Started

Arboreto was conceived with the working bioinformatician or data scientist in mind. We provide extensive documentation and examples to help you get up to speed with the library.

Read the arboreto documentation.
Browse example notebooks.
Report an issue.

License

BSD 3-Clause License

pySCENIC

Arboreto is a component in pySCENIC: a lightning-fast python implementation of the SCENIC pipeline [5] (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.

References

Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P (2010) Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLoS ONE
Rocklin, M. (2015). Dask: parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th Python in Science Conference (pp. 130-136).
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367-378.
Marbach, D., Costello, J. C., Kuffner, R., Vega, N. M., Prill, R. J., Camacho, D. M., ... & Dream5 Consortium. (2012). Wisdom of crowds for robust gene network inference. Nature methods, 9(8), 796-804.
Aibar S, Bravo Gonzalez-Blas C, Moerman T, Wouters J, Huynh-Thu VA, Imrichova H, Kalender Atak Z, Hulselmans G, Dewaele M, Rambow F, Geurts P, Aerts J, Marine C, van den Oord J, Aerts S. SCENIC: Single-cell regulatory network inference and clustering. Nature Methods 14, 1083–1086 (2017). doi: 10.1038/nmeth.4463

arboreto's People

Contributors

Stargazers

Watchers

arboreto's Issues

Is arboreto applicable to bulk RNA-Seq

I understand the arboreto is designed for single-cell RNA-Seq, but how about the bulk RNAS-Seq data? If I apply arboreto to bulk RNA-Seq data set, is the result credible? Also, I am not familiar with this algorithm, what can I tell from the arboreto output, what is the meaning of importance between TF and target, and what threshold can be regarded as a valid regulation of TF against the target gene? Thanks in advance!

bug in retry mechanism

The retry fallback is None, and should be an empty DataFrame.

error of 'distributed' when running GRNboost on server without internet connection

Hi arboreto author,
i'm trying to run GRNboost on supercomputer server,which cannot connect internet. my code:

import pandas as pd
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
if name == 'main':
in_file = '1.1_exprMatrix_filtered_t.txt'
tf_file = '1.2_inputTFs.txt'
out_file = 'net1_grn_output.tsv'
ex_matrix = pd.read_csv(in_file, sep='\t')
tf_names = load_tf_names(tf_file)
network = grnboost2(expression_data=ex_matrix,
tf_names=tf_names)
network.to_csv(out_file, sep='\t', index=False, header=False)

pandas and arboreto were installed successfully before i upload this task. I got following error message:

/lustre/home/acct-bmelgn/bmelgn-3/.conda/envs/mypython3/lib/python3.7/site-packages/distributed/utils.py:134: RuntimeWarning: Couldn't detect a suitable IP address for reaching '8.8.8.8', defaulting to '127.0.0.1': [Errno 101] Network is unreachable

I followed the example in https://arboreto.readthedocs.io/en/latest/ ,which does not import 'distributed'. But the error message seemed to tell me that 'distrbuted' is trying to connect internet. I wonder whether 'distributed' can be avoided when i run GRNboost. Is there any suggestion for running arboreto on server? Thanks for your help.

ps: at first, i followed example in https://arboreto.readthedocs.io/en/latest/examples.html,which indeed import 'distributed'. But now i use the code listed above(which also come from your example),which seems to have nothing to do with 'distributed'.

Algorithm complexity of GRNBoost2

Hi thomas,
What is the algorithm complexity of GRNBoost2 ?

network = grnboost2(.. takes too long. Something wrong?

Hi!

I'm trying to run the GRNBoost2 - GRN algorithm on a matrix of shape (11744, 9031) in my cluster of 104GB RAM / Intel(R) Xeon(R) CPU @ 2.30GHz / 16 CPU (s). So far it's running for +20h.

I have a couple of questions:

Should I use the raw expression matrix? Or should I use the log-transformed/normalized expression matrix?
I don't have the list of TF? Can I put all the genes present in the dataset? Or leave it blank?

Thanks in advance for any help!

Best,

Francisco Grisanti

AttributeError

Hi,
I tried the Example 01 - GRNBoost2 local, https://nbviewer.jupyter.org/github/tmoerman/arboreto/blob/master/notebooks/examples/ex_01_grnboost2_local.ipynb, and I got the following error:
AttributeError Traceback (most recent call last)
in ()
1 import os
----> 2 from arboreto.algo import grnboost2, genie3

D:\Anaconda\lib\site-packages\arboreto\algo.py in ()
4
5 import pandas as pd
----> 6 from distributed import Client, LocalCluster
7 from arboreto.core import create_graph, SGBM_KWARGS, RF_KWARGS, EARLY_STOP_WINDOW_LENGTH
8

D:\Anaconda\lib\site-packages\distributed_init_.py in ()
1 from future import print_function, division, absolute_import
2
----> 3 from . import config
4 from dask.config import config
5 from .actor import Actor, ActorFuture

D:\Anaconda\lib\site-packages\distributed\config.py in ()
11 from .compatibility import logging_names
12
---> 13 config = dask.config.config
14
15

AttributeError: module 'dask' has no attribute 'config'

arboreto package available in bioconda

hi just a FYI - in the framework of setting the scene for more reproducible computational biology workflows, the arboreto package is part of a pilot and has been added to bioconda.
furthermore, we have targeted multiprocessing_on_dill at conda-forge to finally provide pySCENIC in bioconda. cheers

Dask error in GRNBoost2

Hello,
I incorporated pySCENIC into my workflow a while ago, and its been working for me pretty well for a bit. I last used it a month or two ago, and I went to run my code again, but this time I started getting a bunch of dask errors out of the blue.

` if name == 'main': # required to run outside jupyter notebook

import os  # used to interface with the operating system on a basic level
import glob  # finds pathnames for UNIX
import pickle # allows serialization of objects
import pandas as pd  # required for data array manipulation

from dask.diagnostics import ProgressBar  # creates progress bar to check completion
from distributed import Client, LocalCluster

from arboreto.utils import load_tf_names  
from arboreto.algo import grnboost2  

from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase # imports cisTarget database metadata
from pyscenic.utils import modules_from_adjacencies, load_motifs # creates modules from GENIE3 adjacencies
from pyscenic.prune import prune2df, df2regulons
from pyscenic.aucell import aucell

# load paths for repeatedly invoked files
motifs_filename = os.path.join("motifs.csv")
regulons_filename = os.path.join("regulons.p")
TF_list_filename = os.path.join("mm_tfs.txt")

# this cell creates the TF list if not made yet
tf_raw = pd.read_csv("TF_import.txt", delimiter = "\t",     
                     error_bad_lines = False, encoding = "ISO-8859-1")  
tfs = tf_raw[["Gene ID", "Evidence Strength"]].drop_duplicates().dropna()    annotations
tfs["ID"] = list(map(int, tfs["Gene ID"]))      
conv_tfs = pd.read_csv("TF_conversion.txt", delimiter = "\t")   
def extract_symbol(name):      
    s_idx = name.rfind('(')
    e_idx = name.rfind(')')
    return name[s_idx+1:e_idx]
conv_tfs["Gene Name"].apply(extract_symbol).to_csv(TF_list_filename, index = False)     
tf_names = load_tf_names(TF_list_filename)     
ex_matrix = pd.read_csv("GENIE3_import.csv", header = 0, index_col = 0).T    
databases_glob = os.path.join("mm10__*.feather") 
db_fnames = glob.glob(databases_glob)
def name(fname):
    return os.path.basename(fname).split(".")[0]
dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]

client = Client(LocalCluster())

adjacencies = grnboost2(ex_matrix, tf_names = tf_names, verbose = True, client_or_address=client)     
modules = list(modules_from_adjacencies(adjacencies, ex_matrix))  `

And then I get this error upon running it:

adjacencies = grnboost2(ex_matrix, tf_names = tf_names, verbose = True, client_or_address=client) preparing dask client parsing input creating dask graph ~/site-packages/arboreto/algo.py:214: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. expression_matrix = expression_data.as_matrix() 12 partitions computing dask graph distributed.protocol.core - CRITICAL - Failed to deserialize

I don't really know much about Dask, do you have an idea what could be causing these errors?

Dask error: Too many processors

Hi Thomas,

I have been trying out the pySCENIC tutorial script with my own test data, but have consistently found the same types of Dask-related errors when running arboretum.grnboost2 on two different linux environments ( SLURM-managed single node and a big multi-core linux box). Your colleague @bramvds, wrote the tutorial, but it fails at the arboretum function call. Following the tutorial as it was written exactly, which sets up a default LocalCluster internal to the function call, produced the same errors as explicitly controlling the number of workers as shown below:

local_cluster = LocalCluster(n_workers=31, 
                             threads_per_worker=1)

custom_client = Client(local_cluster)
print(custom_client)

adjacencies = grnboost2(ex_matrix, 
                        tf_names=tf_names, 
                        verbose=True,
                        client_or_address=custom_client)

I attached the Error Logs in hopes you could diagnose what is going on? I think the diagnostic message is pretty clear, for someone familiar with the Dask framework, but I am not.

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

            if __name__ == '__main__':
                freeze_support()

The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable.

The same error shows up for this other python project on GitHub here.

I should mention that I installed pySCENIC with pip as part of a recent miniconda3 python 3.6.5 environment.

Thanks in advance,
Chris Conley

RuntimeError

Hi!

I created a python virtual environment using conda, and installed arboreto using the conda package manager.
I downloaded example files (from here https://github.com/tmoerman/arboreto/tree/master/resources/dream5/net1) and was trying to run run_grnboost2.py. I got the following error:

Exception ignored in: <generator object add_client at 0x1c1ca87830>
RuntimeError: generator ignored GeneratorExit
tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
  File "/projects/djin_prj/Installs/conda/arboreto/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
    future.result()
concurrent.futures._base.CancelledError

Despite showing an error, it created an output file with 318832 lines.

I also created a python virtual environment using virtualevn, and installed arboreto using pip.
I ran the same run_grnboost2.py and it didn't report any error. It also create an output file with 318827 lines (fewer lines compared to the provious one).

Given the second output file containing fewer lines of result, I wasn't sure if the second method worked successful, or it also stopped in the middle but somehow didn't show any errrors on screen...

Could you please help me? Thanks!

UPDATE:
I ran run_grnboost2.py in the virtual environment created by virtualevn (the second method). Again, it didn't show any error, and created an output file. But this time the output file contained 318779 lines. I don't think arboreto finished successfully...

load_tf_names command not working

Hello,

I am trying to utilize Arboreto, but am having difficulty. I am running the following code in Jupiter Notebook:

import os
import pandas as pd

from arboreto.algo import grnboost2, genie3
from arboreto.utils import load_tf_names

if name == 'main':
ex_matrix = pd.read_csv('myDir/1.1_exprMatrix_filtered_t.txt', sep='\t')

tf_names = load_tf_names('myDir//1.1_inputTFs.txt')

network = grnboost2(expression_data=ex_matrix,
                tf_names=tf_names)

network.to_csv('myDir/sc_01_network.tsv', sep='\t', header=False, index=False)

but get an error:

----> 7 tf_names=tf_names)
...
AttributeError: 'DataFrame' object has no attribute 'as_matrix'

Is this an incompatibility with the new pandas version? I am using the following:

Pandas version: 1.0.0
arboreto version: 0.1.5

Thank you!

TypeError from dask_expr/io/_delayed.py when running `grnboost2`

When running the Example 01, I got the following error message at the cell [9] where grnboost2 is executed.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File <timed exec>:1

File [~/miniforge3/envs/arboreto-env/lib/python3.10/site-packages/arboreto/algo.py:39](http://localhost:8889/~/miniforge3/envs/arboreto-env/lib/python3.10/site-packages/arboreto/algo.py#line=38), in grnboost2(expression_data, gene_names, tf_names, client_or_address, early_stop_window_length, limit, seed, verbose)
     10 def grnboost2(expression_data,
     11               gene_names=None,
     12               tf_names='all',
   (...)
     16               seed=None,
     17               verbose=False):
     18     """
     19     Launch arboreto with [GRNBoost2] profile.
     20 
   (...)
     36     :return: a pandas DataFrame['TF', 'target', 'importance'] representing the inferred gene regulatory links.
     37     """
---> 39     return diy(expression_data=expression_data, regressor_type='GBM', regressor_kwargs=SGBM_KWARGS,
     40                gene_names=gene_names, tf_names=tf_names, client_or_address=client_or_address,
     41                early_stop_window_length=early_stop_window_length, limit=limit, seed=seed, verbose=verbose)

File [~/miniforge3/envs/arboreto-env/lib/python3.10/site-packages/arboreto/algo.py:120](http://localhost:8889/~/miniforge3/envs/arboreto-env/lib/python3.10/site-packages/arboreto/algo.py#line=119), in diy(expression_data, regressor_type, regressor_kwargs, gene_names, tf_names, client_or_address, early_stop_window_length, limit, seed, verbose)
    117 if verbose:
    118     print('creating dask graph')
--> 120 graph = create_graph(expression_matrix,
    121                      gene_names,
    122                      tf_names,
    123                      client=client,
    124                      regressor_type=regressor_type,
    125                      regressor_kwargs=regressor_kwargs,
    126                      early_stop_window_length=early_stop_window_length,
    127                      limit=limit,
    128                      seed=seed)
    130 if verbose:
    131     print('{} partitions'.format(graph.npartitions))

File [~/miniforge3/envs/arboreto-env/lib/python3.10/site-packages/arboreto/core.py:450](http://localhost:8889/~/miniforge3/envs/arboreto-env/lib/python3.10/site-packages/arboreto/core.py#line=449), in create_graph(expression_matrix, gene_names, tf_names, regressor_type, regressor_kwargs, client, target_genes, limit, include_meta, early_stop_window_length, repartition_multiplier, seed)
    448 # gather the DataFrames into one distributed DataFrame
    449 all_links_df = from_delayed(delayed_link_dfs, meta=_GRN_SCHEMA)
--> 450 all_meta_df = from_delayed(delayed_meta_dfs, meta=_META_SCHEMA)
    452 # optionally limit the number of resulting regulatory links, descending by top importance
    453 if limit:

File [~/miniforge3/envs/arboreto-env/lib/python3.10/site-packages/dask_expr/io/_delayed.py:93](http://localhost:8889/~/miniforge3/envs/arboreto-env/lib/python3.10/site-packages/dask_expr/io/_delayed.py#line=92), in from_delayed(dfs, meta, divisions, verify_meta)
     90     dfs = [dfs]
     92 if len(dfs) == 0:
---> 93     raise TypeError("Must supply at least one delayed object")
     95 if meta is None:
     96     meta = delayed(make_meta)(dfs[0]).compute()

TypeError: Must supply at least one delayed object

I would be grateful if you could tell me how to resolve this.

from_delayed error when running grnboost2

Hi,

I installed arboreto through pip and I can't successfully run the following code using the example data provided in this repository :

import pandas as pd
from distributed import Client, LocalCluster
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2

in_file  = 'net1_expression_data.tsv'
tf_file  = 'net1_transcription_factors.tsv'
out_file = 'net1_grn_output.tsv'

# ex_matrix is a DataFrame with gene names as column names
ex_matrix = pd.read_csv(in_file, sep='\t')

# tf_names is read using a utility function included in Arboreto
tf_names = load_tf_names(tf_file)

# compute the GRN
network = grnboost2(expression_data=ex_matrix, tf_names=tf_names)

When I run the code I get the following error, which prevents the program from running:

Exception: ValueError("Metadata mismatch found in `from_delayed`.\n\nThe columns in the computed data do not match the columns in the provided metadata.\n Index([u'TF', u'importance', u'target'], dtype='object')\n  :Index([u'TF', u'target', u'importance'], dtype='object')",)

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/compgen/users/scastillo/.local/lib/python2.7/site-packages/arboreto/algo.py", line 41, in grnboost2
    early_stop_window_length=early_stop_window_length, limit=limit, seed=seed, verbose=verbose)
  File "/home/compgen/users/scastillo/.local/lib/python2.7/site-packages/arboreto/algo.py", line 135, in diy
    .compute(graph, sync=True) \
  File "/home/compgen/users/scastillo/.local/lib/python2.7/site-packages/distributed/client.py", line 2758, in compute
    result = self.gather(futures)
  File "/home/compgen/users/scastillo/.local/lib/python2.7/site-packages/distributed/client.py", line 1822, in gather
    asynchronous=asynchronous,
  File "/home/compgen/users/scastillo/.local/lib/python2.7/site-packages/distributed/client.py", line 753, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/home/compgen/users/scastillo/.local/lib/python2.7/site-packages/distributed/utils.py", line 331, in sync
    six.reraise(*error[0])
  File "/home/compgen/users/scastillo/.local/lib/python2.7/site-packages/distributed/utils.py", line 316, in f
    result[0] = yield future
  File "/home/compgen/users/scastillo/.local/lib/python2.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/home/compgen/users/scastillo/.local/lib/python2.7/site-packages/tornado/concurrent.py", line 261, in result
    raise_exc_info(self._exc_info)
  File "/home/compgen/users/scastillo/.local/lib/python2.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/compgen/users/scastillo/.local/lib/python2.7/site-packages/distributed/client.py", line 1653, in _gather
    six.reraise(type(exception), exception, traceback)
  File "/home/compgen/users/scastillo/.local/lib/python2.7/site-packages/dask/dataframe/utils.py", line 598, in check_meta
    errmsg))
ValueError: Metadata mismatch found in `from_delayed`.

The columns in the computed data do not match the columns in the provided metadata.
 Index([u'TF', u'importance', u'target'], dtype='object')
  :Index([u'TF', u'target', u'importance'], dtype='object')

Is there something I am doing wrong? I have had the same error in two different machines with my own data, which made me look into using the provided examples.

Thank you.

GRNBoost2 not available in R 4.0?

Also posted here - aertslab/GRNBoost#7, not sure which repository is more appropriate.

Anyway, trying to install (through Biocmanager) on Rstudio gives this error

Warning message:
package ‘GRNBoost2’ is not available (for R version 4.0.2)

is this something users can circumvent (without reverting to R 3 and causing other incompatibility issues)?

grnboost error

Hello,
When I run "network = grnboost2(expression_data=ex_matrix,tf_names=tf_names)" according to the tutorial, the following error occured:
here is a snippet of the errors:

distributed.core - ERROR - Timed out trying to connect to 'tcp://127.0.0.1:33261' after 10 s: connect() didn't finish in time
tornado.util.TimeoutError: Timeout
OSError: Timed out trying to connect to 'tcp://127.0.0.1:33261' after 10 s: connect() didn't finish in time
tornado.application - ERROR - Exception in Future after timeout
distributed.comm.tcp - WARNING - Closing dangling stream in
distributed.comm.tcp - WARNING - Closing dangling stream in
distributed.comm.tcp - WARNING - Closing dangling stream in

Pandas version of as_matrix is deprecated

I was trying to use pandas version 1.0.0 with grnboost2. However, I ran into a problem where the as_matrix call part of grnboost2 does not work because it is deprecated (I think people use .values instead). Do you know a way to use the older version of pandas so I do not run into this problem?

Thanks, I am somewhat new to programming.

grnboost2: No result, without error message

Hi,
I have successfully installed and run arboreto on a laptop PC. I tried to run an example based on the ex_01_grnboost2_local.ipynb notebook that you kindly provide, with a dataset very similar in size to the one you are using.
The box...
%%time
network = grnboost2(expression_data=ex_matrix,
tf_names=tf_names)
...appeared to run successfully, took about 3 minutes, printed the time report, and did not produce any error message whatsoever.
However, the next command
network.head()
...lead to the message that "network" does not exist (so it was not generated).
Any idea what could be the problem, please?
Thanks, best, M.

Error when running GRNBoost2

I am running GRNBoost2 as part of pySCENIC and encountered an error that I have not seen posts about previously. Here is what the console returns when I run it:

preparing dask client

parsing input

creating dask graph

6 partitions

computing dask graph

distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%) (happens a few more dozen times)

distributed.nanny - WARNING - Worker process 28580 exited with status 1

shutting down client and local cluster

distributed.nanny - WARNING - Worker process 28581 exited with status 1

distributed.nanny - WARNING - Worker process 28585 exited with status 1

distributed.nanny - WARNING - Worker process 28582 exited with status 1

distributed.nanny - WARNING - Worker process 28584 exited with status 1

distributed.nanny - WARNING - Worker process 28583 exited with status 1

finished

tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 8 memory: 11737 MB fds: 98>>

Traceback (most recent call last):

File "/Users/chip18/anaconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run return self.callback()

File "/Users/chip18/anaconda3/lib/python3.7/site-packages/distributed/system_monitor.py", line 72, in update

`read_bytes = (ioc.bytes_recv - last.bytes_recv) / (duration or 0.5)`

AttributeError: 'NoneType' object has no attribute 'bytes_recv'

Any help would be appreciated.

Different results with same random seed

I am using arboreto.core to run grnboost2. In different runs, using the same seed, I am obtaining different results. What could be the problem?

remove dependency on distributed

See the TPOT project for pointers on how to only have dask as a dependency and have dask implicitly determine whether a distributed scheduler should be used or not.

Observation:

the life cycle of the distributed client is completely removed from the TPOT code. This is also desirable for Arboreto.

"No dispatch for <class 'dict'>" error

Tried this on python 3.7.10 and 3.9.4. Any thoughts on how to fix this? Thanks!

>>> from arboreto.core import *
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/smarkson/fiddling/arboreto_fiddling/arboreto/arboreto/core.py", line 371, in <module>
    _GRN_SCHEMA = make_meta({'TF': str, 'target': str, 'importance': float})
  File "/home/smarkson/miniconda3/envs/arboreto-env/lib/python3.9/site-packages/dask/utils.py", line 511, in __call__
    meth = self.dispatch(type(arg))
  File "/home/smarkson/miniconda3/envs/arboreto-env/lib/python3.9/site-packages/dask/utils.py", line 505, in dispatch
    raise TypeError("No dispatch for {0}".format(cls))
TypeError: No dispatch for <class 'dict'>

Environment details:

#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
arboreto                  0.1.6                     dev_0    <develop>
bokeh                     2.3.2            py39hf3d152e_0    conda-forge
ca-certificates           2021.5.30            ha878542_0    conda-forge
certifi                   2021.5.30        py39hf3d152e_0    conda-forge
click                     8.0.1            py39hf3d152e_0    conda-forge
cloudpickle               1.6.0                      py_0    conda-forge
cytoolz                   0.11.0           py39h3811e60_3    conda-forge
dask                      2021.5.1           pyhd8ed1ab_0    conda-forge
dask-core                 2021.5.1           pyhd8ed1ab_0    conda-forge
distributed               2021.5.1         py39hf3d152e_0    conda-forge
freetype                  2.10.4               h0708190_1    conda-forge
fsspec                    2021.5.0           pyhd8ed1ab_0    conda-forge
heapdict                  1.0.1                      py_0    conda-forge
jbig                      2.1               h7f98852_2003    conda-forge
jinja2                    3.0.1              pyhd8ed1ab_0    conda-forge
joblib                    1.0.1              pyhd8ed1ab_0    conda-forge
jpeg                      9d                   h36c2ea0_0    conda-forge
lcms2                     2.12                 hddcbb42_0    conda-forge
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
lerc                      2.2.1                h9c3ff4c_0    conda-forge
libblas                   3.9.0                9_openblas    conda-forge
libcblas                  3.9.0                9_openblas    conda-forge
libdeflate                1.7                  h7f98852_5    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_19    conda-forge
libgfortran-ng            9.3.0               hff62375_19    conda-forge
libgfortran5              9.3.0               hff62375_19    conda-forge
libgomp                   9.3.0               h2828fa1_19    conda-forge
liblapack                 3.9.0                9_openblas    conda-forge
libopenblas               0.3.15          pthreads_h8fe5266_1    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libstdcxx-ng              9.3.0               h6de172a_19    conda-forge
libtiff                   4.3.0                hf544144_1    conda-forge
libwebp-base              1.2.0                h7f98852_2    conda-forge
locket                    0.2.0                      py_2    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
markupsafe                2.0.1            py39h3811e60_0    conda-forge
msgpack-python            1.0.2            py39h1a9c180_1    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
numpy                     1.20.3           py39hdbf815f_1    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openjpeg                  2.4.0                hb52868f_1    conda-forge
openssl                   1.1.1k               h7f98852_0    conda-forge
packaging                 20.9               pyh44b312d_0    conda-forge
pandas                    1.2.4            py39hde0f152_0    conda-forge
partd                     1.2.0              pyhd8ed1ab_0    conda-forge
pillow                    8.2.0            py39hf95b381_1    conda-forge
pip                       21.1.2             pyhd8ed1ab_0    conda-forge
psutil                    5.8.0            py39h3811e60_1    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
python                    3.9.4           hffdb5ce_0_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.9                      1_cp39    conda-forge
pytz                      2021.1             pyhd8ed1ab_0    conda-forge
pyyaml                    5.4.1            py39h3811e60_0    conda-forge
readline                  8.1                  h46c0cb4_0    conda-forge
scikit-learn              0.24.2           py39h4dfa638_0    conda-forge
scipy                     1.6.3            py39hee8e79c_0    conda-forge
setuptools                49.6.0           py39hf3d152e_3    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
sqlite                    3.35.5               h74cdb3f_0    conda-forge
tblib                     1.7.0              pyhd8ed1ab_0    conda-forge
threadpoolctl             2.1.0              pyh5ca1d4c_0    conda-forge
tk                        8.6.10               h21135ba_1    conda-forge
toolz                     0.11.1                     py_0    conda-forge
tornado                   6.1              py39h3811e60_1    conda-forge
typing_extensions         3.7.4.3                    py_0    conda-forge
tzdata                    2021a                he74cb21_0    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yaml                      0.2.5                h516909a_0    conda-forge
zict                      2.0.0                      py_0    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.5.0                ha95c52a_0    conda-forge

error when running grnboost

Hello,

I am implementing pySCENIC program and ran into a problem with grnboost package. I followed the instructions and wrote my code similar to this:
//
import pandas as pd
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
if name == 'main':
# load the data
ex_matrix = pd.read_csv(<ex_path>, sep='\t')
tf_names = load_tf_names(<tf_path>)
network = grnboost2(expression_data=ex_matrix, tf_names=tf_names)
//
pySCENIC works fine with small data set of 250 genes; however, for bigger data set that I am testing out (~2000 genes or more), this is the error that I got:

UserWarning: Large object of size 1.17 MB detected in task graph:
(["('from-delayed-7f2fea60c7dfbbfb0ec7f83dc75b83af ... af', 19972)"],)
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers

future = client.submit(func, big_data)    # bad

big_future = client.scatter(big_data)     # good
future = client.submit(func, big_future)  # good

% (format_bytes(len(b)), s))

The program stuck at this point and never finished when I ran it on Macbook Pro (2.6Hz i7). I also tried the command-line version as pyscenic grnboost -o OUTPUT @grn_args.txt in which grn_args.txt contains names of expression matrix and known TF file; expression matrix input have cell IDs as rows and genes as columns.
What would you think is the issue here?

Thank you,
Diep

GRNboost worker memory usage

Sorry for posting in pySCENIC, was expecting to find the arboreto repo in the Aerts lab Git. Here is the problem:

When running

from arboretum import algo
import pandas as pd

geneData = pd.read_csv("my-count-data.csv",index_col=[0],header=0)
network = algo.grnboost2(expression_data=geneData.T)```

among multiple warnings I get the following message:

Worker is at 89% memory usage. Pausing worker. Process memory: 5.04 GB -- Worker memory limit: 5.62 GB


as far as I understand it, this message comes from dask and can be alleviated by changing dask limit settings. But I am not sure how to do that... Shall I import dask prior to GRNboost and change the settings first? Are there any hidden options how to access dask options via GRNboost itself?
Thanks in advance!

P.S. I am running python 3.7, arboretum 0.1.3 on Ubuntu 16.04.

GRNboost2 importance vs GENIE3 weights

Hi, I would like to ask what is the difference between:
i) importance calculated from GRNboost2 and
ii) weights calculated from GENIE3

Are these values comparable in anyway?
Or can I scale the numbers so that they are comparable?

msgpack.exceptions.ExtraData: unpack(b) received extra data.

I am trying to run I am trying to run GRNboost from arboreto to infer co-expression modules on a juypternotebook. I have a apple mM@ macbook pro.

Specifically, I am trying to follow along the steps show in pySCENIC - Full pipeline.ipynb

I was given this "warning error" (the execution is still running). But I am struggling to understand what is causing the issue and why it is happening.

from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
...
adjancencies = grnboost2(expression_data=expression_matrix_df, tf_names=tf_names, verbose=True)
display(adjancencies.head())

This is the output (the execution is taking some time)

preparing dask client

Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.

parsing input
creating dask graph
4 partitions
computing dask graph

/Users/zach/anaconda3/lib/python3.11/site-packages/distributed/client.py:3125: UserWarning: Sending large graph of size 466.82 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
  warnings.warn(
2023-07-18 19:25:05,533 - distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/Users/zach/anaconda3/lib/python3.11/site-packages/distributed/protocol/core.py", line 158, in loads
    return msgpack.loads(
           ^^^^^^^^^^^^^^
  File "/Users/zach/anaconda3/lib/python3.11/site-packages/msgpack/fallback.py", line 136, in unpackb
    raise ExtraData(ret, unpacker._get_extradata())
msgpack.exceptions.ExtraData: unpack(b) received extra data.
2023-07-18 19:25:05,536 - distributed.core - ERROR - Exception while handling op register-client
Traceback (most recent call last):
  File "/Users/zach/anaconda3/lib/python3.11/site-packages/distributed/core.py", line 924, in _handle_comm
    result = await result
             ^^^^^^^^^^^^
  File "/Users/zach/anaconda3/lib/python3.11/site-packages/distributed/scheduler.py", line 5449, in add_client
    await self.handle_stream(comm=comm, extra={"client": client})
  File "/Users/zach/anaconda3/lib/python3.11/site-packages/distributed/core.py", line 977, in handle_stream
    msgs = await comm.read()
           ^^^^^^^^^^^^^^^^^
  File "/Users/zach/anaconda3/lib/python3.11/site-packages/distributed/comm/tcp.py", line 254, in read
    msg = await from_frames(
          ^^^^^^^^^^^^^^^^^^
  File "/Users/zach/anaconda3/lib/python3.11/site-packages/distributed/comm/utils.py", line 100, in from_frames
    res = _from_frames()
          ^^^^^^^^^^^^^^
  File "/Users/zach/anaconda3/lib/python3.11/site-packages/distributed/comm/utils.py", line 83, in _from_frames
    return protocol.loads(
           ^^^^^^^^^^^^^^^
  File "/Users/zach/anaconda3/lib/python3.11/site-packages/distributed/protocol/core.py", line 158, in loads
    return msgpack.loads(
           ^^^^^^^^^^^^^^
  File "/Users/zach/anaconda3/lib/python3.11/site-packages/msgpack/fallback.py", line 136, in unpackb
    raise ExtraData(ret, unpacker._get_extradata())
msgpack.exceptions.ExtraData: unpack(b) received extra data.

replace deprecated df.as_matrix() with df.values

not a problem yet
to include in next release

Why GENIE3/A outperforms GENIE3?

In the paper, GENIE3/A was briefly mentioned to be a re-implementation of GENIE3, presumably for parallelizing (Dask).

However, in the comparison diagram, GENIE3/A consistently outperformed GENIE3 slightly.

Can the authors please clarify what else changed to cause this improvement in AUROC/AUPRC performance?

requirements.txt is missing from PyPI source tarball

requirements.txt is required by setup.py, but it's missing from the PyPI source tarball, which leads to the following when trying to install arboreto 0.1.6 from source:

    Running command python setup.py egg_info
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/eb-hq4sf4y2/pip-req-build-q_upe322/setup.py", line 17, in <module>
        install_requires=read_requirements('requirements.txt'),
      File "/tmp/eb-hq4sf4y2/pip-req-build-q_upe322/setup.py", line 4, in read_requirements
        with open(fname, 'r', encoding='utf-8') as file:
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'

Garbage collection issue with GRNBoost2

I'm running Arboreto's implementation of GRNBoost2 via the pySCENIC command line, but I figured this issue probably belongs here. I get the following warning, which repeats a number of times through the run. It seems like most of the time GRNBoost2 does complete successfully, but it would be nice to avoid the performance hit which this warning seems to imply. Any ideas on how to solve this? I've notice that it may occur more often with larger expression matrices (10,000 cells, 20,000 genes). I'm using dask v1.0.0 if that helps.

distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 12% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 12% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 12% CPU time recently (threshold: 10%)

Thanks for any help you can provide.
Chris

Exception: too many values to unpack (expected 1)

Hi,

I am running GRNBoost2 as part of pySCENIC and encountered an error that I have not seen posts about previously. Here is what the console returns when I run it:

Traceback (most recent call last):
File "", line 1, in
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/arboreto/algo.py", line 41, in grnboost2
early_stop_window_length=early_stop_window_length, limit=limit, seed=seed, verbose=verbose)
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/arboreto/algo.py", line 128, in diy
seed=seed)
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/arboreto/core.py", line 419, in create_graph
future_tf_matrix = client.scatter(tf_matrix, broadcast=True)
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 2186, in scatter
hash=hash,
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 845, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 353, in sync
raise exc.with_traceback(tb)
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 336, in f
result[0] = yield future
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 2073, in _scatter
timeout=timeout,
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 862, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 661, in send_recv
raise exc.with_traceback(tb)
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 501, in handle_comm
result = await result
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 5038, in scatter
nthreads, data, rpc=self.rpc, report=False
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/utils_comm.py", line 149, in scatter_to_workers
for address, v in d.items()
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 237, in All
result = await tasks.next()
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 862, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/home/ranxiaojuan/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 663, in send_recv
raise Exception(response["text"])
Exception: too many values to unpack (expected 1)

And I have more than 50000 cells with 10000 genes, could it may be too large of the datasets?

How to set the number of cores used by GRNboost2

I cannot find any parameter which limits the number of cores used in GRNboost2. I am using a shared server, and my code is blocking all the resource available. Wha tcould be way the limit it to only use a predefined number of nodes, similar to '--num-worker' argument in the command line grnboost call. ?

crashed

Hey -
Not sure if this is an issue - but after using grnboost for 16 hours on 200 cpu's i got this error message:

distributed.utils - ERROR - ('infer_data-c4dbee66e029628605e1f69c6886d80d', 'tcp://10.132.0.10:42063')     
Traceback (most recent call last):                     
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/utils.py", line 238, in f                      
    result[0] = yield make_coro()                      
  File "/opt/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run                         
    value = future.result()                            
  File "/opt/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run                         
    yielded = self.gen.throw(*exc_info)                
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/client.py", line 1364, in _gather              
    traceback)                                         
  File "/opt/miniconda3/lib/python3.6/site-packages/six.py", line 693, in reraise                              
    raise value                                        
distributed.scheduler.KilledWorker: ('infer_data-c4dbee66e029628605e1f69c6886d80d', 'tcp://10.132.0.10:42063') 
Traceback (most recent call last):                     
  File "./run_grnboost.py", line 60, in <module>       
    client_or_address=cluster_client)                  
  File "/opt/miniconda3/lib/python3.6/site-packages/arboretum/algo.py", line 71, in genie3                     
    limit=limit, seed=seed, verbose=verbose)           
  File "/opt/miniconda3/lib/python3.6/site-packages/arboretum/algo.py", line 129, in diy                       
    .compute(graph, sync=True) \                       
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/client.py", line 2232, in compute              
    result = self.gather(futures)                      
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/client.py", line 1486, in gather               
    asynchronous=asynchronous)                         
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/client.py", line 608, in sync                  
    return sync(self.loop, func, *args, **kwargs)      
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/utils.py", line 254, in sync                   
    six.reraise(*error[0])                             
  File "/opt/miniconda3/lib/python3.6/site-packages/six.py", line 693, in reraise                              
    raise value                                        
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/utils.py", line 238, in f                      
    result[0] = yield make_coro()                      
  File "/opt/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run                         
    value = future.result()                            
  File "/opt/miniconda3/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run                         
    yielded = self.gen.throw(*exc_info)                
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/client.py", line 1364, in _gather              
    traceback)                                         
  File "/opt/miniconda3/lib/python3.6/site-packages/six.py", line 693, in reraise                              
    raise value                                        
distributed.scheduler.KilledWorker: ('infer_data-c4dbee66e029628605e1f69c6886d80d', 'tcp://10.132.0.10:42063')

Inference without TF

Thanks for the great work!

Can I infer a network from the GRNBoost2 or GENIE3 from the expression_matrix but without tf_names ?

aertslab / arboreto Goto Github PK

arboreto's Introduction

Get Started

License

pySCENIC

References

arboreto's People

Contributors

Stargazers

Watchers

Forkers

arboreto's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs