jakobrunge / tigramite Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 270.0 96.85 MB

Tigramite is a python package for causal inference with a focus on time series data. The Tigramite documentation is at

Home Page: https://jakobrunge.github.io/tigramite/

License: GNU General Public License v3.0

Python 13.72% Jupyter Notebook 86.28%

tigramite's People

Contributors

Stargazers

Watchers

Forkers

jayden11 peishijiang anyarum tartaruszen ewengillies xiaoganghe cvanelteren xtibau vishalbelsare wangrui6 sahanduiuc kuchaale sjstreicher jiacheng-yao jinyun1tang crlsmcl wgmueller1 a-ritwik pankajcivil surfcao saadmahboob asanchezlorente mengyuan404 brainy749 nmahmud satpreetsingh giegloop swapank mbencherif jfloresc mattiabattaglia opnumten mikepsinn quickpanda yanlirock yassir-alharbi esaggioro matthieubulte fpli-mbr deltac13 dingchenwei liyubov amurrayw razib764 datascienceandml weilin2018 airysen shuentang hotmanc nicolasfauchereau jbdatascience scoji r4idhub sbenthall parnumeric xinzhou-1 yogytes fipeop sdljan ahoyosid black-swan-icl gt-master kevinpiger yubaocheng258 tanxuezhi cjen07 mbruhns opaquezxd akumenyi cperbost lillywu winggy genana koutrgor tsaoyu nsood-ai do-minik decastro-alex jcadecarli wjj5881005 rishirelan dcc-lab tianxiongxie skye777 mjdhasan ankitsatpute chen0031 rickguanyu c-h-simpson keshava myrthemh surajitdb hanqingxu93 edsonfreirefs yx577 stietsche msinghraniyal caomaowsh huning2009 passion4energy

tigramite's Issues

AttributeError: 'range' object has no attribute 'size'

Hey again! :)

I'm getting an AttributeError: 'range' object has no attribute 'size' error thrown when running run_pcmci. The error is thrown from one of the functions in independence_tests.py when it calls Scipy's curve_fit function.

I'm attaching the full log below, which includes the parameters with which the PCMCI object was initialized, and the ones with which run_pcmci was called. I'll add a data example soon.

Thank you again,
Shay

##
## Running Tigramite PC algorithm
##

Parameters:
selected_variables = [5, 6, 7, 8, 9, 10, 11, 12]
independence test = cmi_knn
tau_min = 1
tau_max = 24
pc_alpha = 0.1
max_conds_dim = None
max_combinations = 1


Warning: Link specified in selected links that is outside the scope of the selected variables

## Variable i0

Iterating through pc_alpha = [0.1]:

# pc_alpha = 0.1 (1/1):

Testing condition sets of dimension 0:

    Link (c0 -1) --> i0 (1/312):
Traceback (most recent call last):
  File "/usr/local/bin/poseidon", line 11, in <module>
    load_entry_point('poseidon', 'console_scripts', 'poseidon')()
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/poseidon-app/poseidon_package/scripts/poseidon_cli.py", line 51, in infer_causality
    timestep_len_in_seconds=tsteplen,
  File "/poseidon-app/poseidon_package/poseidon/core.py", line 124, in run_poseidon
    timestep_in_sec=timestep_len_in_seconds,
  File "/poseidon-app/poseidon_package/poseidon/infer_causality.py", line 261, in infer_causality
    results = pcmci.run_pcmci(tau_max=max_lag, pc_alpha=alpha)
  File "/poseidon-app/tigramite/tigramite/pcmci.py", line 1567, in run_pcmci
    max_combinations=max_combinations)
  File "/poseidon-app/tigramite/tigramite/pcmci.py", line 896, in run_pc_stable
    max_combinations=max_combinations)
  File "/poseidon-app/tigramite/tigramite/pcmci.py", line 602, in _run_pc_stable_single
    tau_max=tau_max)
  File "/poseidon-app/tigramite/tigramite/independence_tests.py", line 335, in run_test
    pval = self.get_significance(val, array, xyz, T, dim)
  File "/poseidon-app/tigramite/tigramite/independence_tests.py", line 453, in get_significance
    value=val)
  File "/poseidon-app/tigramite/tigramite/independence_tests.py", line 2175, in get_shuffle_significance
    verbosity=self.verbosity)
  File "/poseidon-app/tigramite/tigramite/independence_tests.py", line 813, in _get_shuffle_dist
    mode='significance')
  File "/poseidon-app/tigramite/tigramite/independence_tests.py", line 754, in _get_block_length
    popt, _ = optimize.curve_fit(func, range(0, max_lag+1), hilbert)
  File "/usr/local/lib/python3.7/site-packages/scipy/optimize/minpack.py", line 718, in curve_fit
    if xdata.size == 0:
AttributeError: 'range' object has no attribute 'size'

Complex multivariate time series

Thanks for providing such a useful time series analysis tool!
Does this toolbox support complex number case or it only support real number? Will the performance degrade if I divide a complex multivariate time series into two real parts and analysis seperately?

Pip repo contains an old version of the library

Please upload the latest version of the library to pip. Currently it has reached version 4.1 while on Pip repo I could find 3.0b. This leads to errors while running the examples in your docs such as:

TypeError: init() got an unexpected keyword argument 'var_names'

med = LinearMediation(dataframe=dataframe)

Getting tigramite on PyPI and anaconda cloud

To ensure tigramite can be used by many, we need to ensure it is distributed on conventional channels, like the Python Package Index. Here is my breakdown of this goal:

Ensure installation is working.
Figure out which dependencies are needed.
Test the current software.
Tag and version the software.
Release version on PyPI and anaconda cloud

I found there was ever a tigramite with gui but I can't find it

I found there was ever a tigramite with gui but I can't find it, can you share with me. I want to try the version of GUI.

Questions about the usage of CMIknn test

Hey there, Jakob and tigramite contributors!

I'm planning to use CMIknn for testing conditional independence. But I'm not sure how to specify the parameters "xyz" and "value" in CIMknn.get_shuffle_significance(self, array, xyz, value, return_null_dist). Is it possible for you to kindly provide an example on how to do CMIknn test, plz?

Parameter xyz is the XYZ identifier array of shape (dim,), so I consider this is the vecter specifying dimensions of X, Y and Z. But I have no clue of how to determine "value" – Value of test statistic for unshuffled estimate.

The following is my code:

cmi_knn = CMIknn(significance='shuffle_test', knn=0.1, shuffle_neighbors=5)
links_coeffs = {0: [((0, -1), 0.8)],
                    1: [((1, -1), 0.8), ((0, -1), 0.5)],
                    2: [((2, -1), 0.8), ((1, -2), -0.6)]}
data, _ = pp.var_process(links_coeffs, T=1000)
cmi_knn.get_shuffle_significance(array=data, xyz=[1,1,1], value=1)

and I got a ZeroDivisionError:

 ---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-16-a64bde0562f9> in <module>()
----> 1 cmi_knn.get_shuffle_significance(array=data, xyz=[1,1,1], value=1)

~\Anaconda3\lib\site-packages\tigramite-4.0.0b0-py3.6-win-amd64.egg\tigramite\independence_tests.py in get_shuffle_significance(self, array, xyz, value, return_null_dist)
   2173                                            sig_samples=self.sig_samples,
   2174                                            sig_blocklength=self.sig_blocklength,
-> 2175                                            verbosity=self.verbosity)
   2176 
   2177         # Sort

~\Anaconda3\lib\site-packages\tigramite-4.0.0b0-py3.6-win-amd64.egg\tigramite\independence_tests.py in _get_shuffle_dist(self, array, xyz, dependence_measure, sig_samples, sig_blocklength, verbosity)
    813                                                      mode='significance')
    814 
--> 815         n_blks = int(math.floor(float(T)/sig_blocklength))
    816         # print 'n_blks ', n_blks
    817         if verbosity > 2:

ZeroDivisionError: float division by zero

Would you please kindly help me on this?

Many thanks.

How to make PCMCI picture?

I want to use my own data to make PCMCI picture like that, I wonder what should I do? Sorry I can't understand the instruction well. Could you make a explanation? Thank you very much.

Error: 'tigramite_cython_code' is not defined

Hi,

After installing the required packages (including cython), when I run the setpu.py script, I see:
skipping 'tigramite/tigramite_cython_code.c' Cython extension (up-to-date)
in the output. After that, when I try to run the notebook, I have the following error at the execution of the cell 17 (GPDC) and the cell 22 (CMIknn):

name 'tigramite_cython_code' is not defined

Did you already experience this problem ?

Weird results from running tigramite on our data

Hi,
we work with Shay Palachy whom you corresponded with in a few other issues. We using tigramite on sparse time series. After another run we investigated results of pairs of time series with high score and found some strange behavior. Two time series have no common (or close) activity surprisingly got a high causality score. We were wondering what you think could explain it, whether it's a bug or another explanation. Attaching a figure of the two time series - the difference between the last non-zero value of the bottom time series and the first non-zero value of the top one is over a week and we defined the max lag as 4 hours.
Any insight you might have would be very appreciated.

Thanks in advance,
Roman and Michal

Model selection not implemented for cmi_knn

Dear Runge and contributors:

I am trying to use CMIknn as the cond_ind_test in the function of PCMCI, but getting the error of "Model selection not implemented for cmi_knn". What does it mean, and could you please help me?

PS: just want to change the ParCorr to CMIknn or CMIsymb. Thanks a lot.

cond_ind_test = CMIknn(mask_type=None,
              significance='shuffle_test',
              fixed_thres=None,
              sig_samples=len,  # data length
              sig_blocklength=3,
              knn=10,
              confidence='bootstrap',
              conf_lev=0.9,
              conf_samples=len, # data length
              conf_blocklength=1,
              verbosity=0)
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=cond_ind_test)
results = pcmci.run_pcmci(tau_max=2, pc_alpha=None)

I am looking forward to hearing from you.

plot_graph: control space between nodes and edges

When I plot a graph with custom coordinates with the node_pos argument, I obtain an important buffer between the nodes and the edges (see figure below). So far, I haven't been able to control this parameter. Is there any workaround?

Notes

My coordinates are

ycoord= np.array([25, 25, -20, -20, -20, 10, 13,  8, 14, 0, 9])
xcoord= np.array([16, 32,  24,   8,  40, 22, 32, 39, 17, 5, 7])

Package versions are:
tigramite version: 4.1
matplotlib version: 3.2.1
networkx version: 2.4

Remarks
I just spot a typo in the documentation for the 3 graphs (dag, time-series and mediation) :

curved_radius, float, optional (default: 0.2)
        Curvature of links. Passed on to FancyArrowPatch object.

change into

curved_radius :  float, optional (default: 0.2)
        Curvature of links. Passed on to FancyArrowPatch object.

Documentation on parameters of CMIsymb is lacking

Hey there, Jakob and tigramite contributors! :)

I'm planning to use PCMCI for causal inference on discrete data. Since I don't want to assume any model for causality flow, I chose the non-parametric conditional mutual information test. Thus, the discrete implementation for it, the CMIsymb class, seems like the way to go.

However, CMIsymb has several parameters, the effect of which on the functionality of the test is not clear from the class documentation. This is true even after reading the paper on PCMCI, and specifically sections S2.3 and S2.4, dealing with the CMI test and its discrete variate, correspondingly. Extending the documentation can help users make better use and to decide intelligently on the right parameters for their use.

Personally, I would also love to hear some more about these just for my own use. Also, if you'll give me something to start with I can read up on it and will happily open a PR with the extended documentation.

n_symbs - I get the basic idea of assuming symbolic input and why the default is the maximum value seen in the data (plus 1), but what are the effects of giving a smaller or larger number? Is there any reason to do so? If so, is there a way to estimate that number, in those cases?
significance - Not sure what fixed_thres does.
sig_blocklength and conf_blocklength - I can see that the documentation of the abstract base class, CondIndTest refers to the paper, and that section S3.3 does go into details about this, but I think some more detail in the documentation itself can help.

I promise to read up on the sections I mentioned, and if I get my answers there to try and extend the documentation myself, but I would also love to get your help with this.

Different linestyle for undirected links

It may be useful to add an option modifying the linestyle attribute for undirected (lag = 0) links, e.g.:
tp.plot_graph( val_matrix=results['val_matrix'], link_matrix=link_matrix, var_names=var_names, link_colorbar_label='cross-MCI', node_colorbar_label='auto-MCI', show_colorbar = False, undirected_linestyle='dotted' )
would result in:

'tigramite_cython_code' is not defined

Thanks for the great package. Running in Colab works great, but trying to run GPDC locally, in a fresh environment with cython 0.29 installed, I get:

_, val, _, _ = tigramite_cython_code.dcov_all(x_vals, y_vals)
NameError: name 'tigramite_cython_code' is not defined

Any ideas to deal with this? I see this was an issue for some other people, but apparently they solved it using Colab.

FutureWarning: rcond parameter will change to the default of machine precision times max(M, N) where M and N are the input matrix dimensions. To use the future default and silence this warning we advise to pass rcond=None, to keep using the old, explicitly pass rcond=-1.

Hi,
Thank you for TIGRAMITE!
I have this error when run:
pcmci.run_pcmci(tau_min =1, tau_max=30, pc_alpha=0.1)

independence_tests.py:1144: FutureWarning: rcond parameter will change to the default of machine precision times max(M, N) where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass rcond=None, to keep using the old, explicitly pass rcond=-1.
beta_hat = numpy.linalg.lstsq(z, y)[0]
Traceback (most recent call last):
File "tigramite_era_geo_corpar.py", line 53, in
results_pcmi = pcmci.run_pcmci(tau_min =1, tau_max=3, pc_alpha=0.1)
File "/home/yuditsabet/.local/lib/python3.7/site-packages/tigramite/pcmci.py", line 1385, in run_pcmci
max_combinations=max_combinations,
File "/home/yuditsabet/.local/lib/python3.7/site-packages/tigramite/pcmci.py", line 718, in run_pc_stable
print("\n## Variable %s" % self.var_names[j])
KeyError: 132

Besides, when run:
pcmci.run_gpdc(tau_min =1, tau_max=30, pc_alpha=0.1) not end
how long does it take if the process is not parallelized??

OUT_OF_MEMORY (exit code 0): run_pcmi_paralell.py

Hello Jacob,
I am trying to run the parallelization script for a 10x1296 array and no matter how much memory I allocate, the process kills me for 35 min due to memory problems. I'm allocating up to 50GB of memory and it seems to demand a lot more. When you ran the script how many nodes did you need, how many tasks per node, and how much memory?
Thank you

Job ID: 18178
Cluster: xxx
User/Group: yyy/users
State: OUT_OF_MEMORY (exit code 0)
Cores: 1
CPU Utilized: 00:46:57
CPU Efficiency: 99.16% of 00:47:21 core-walltime
Job Wall-clock time: 00:47:21
Memory Utilized: 54.79 GB
Memory Efficiency: 99.63% of 54.99 GB

Save results of run_pcmci in instance attribute

I noticed that after running the run_pcmci method, the results are not stored in the class instance, hence once the method is run, if you don't store the values in a separate variable they are lost.

I think it would be useful to store the results inside the instance with something like self.results = self.run_mci()

thank you

Could not import packages for CMIknn and GPDC estimation

Hello!Now I have a question need your help: I have already installed tigramite-master(new) and all packages it needs. But when I used Geany to run the example in documents, it still display an error "Could not import packages for CMIknn and GPDC estimation", I wonder whether I have loss some packages I don't know. Please tell me the truth. Thank you very much

Do I need to manually select links for run_pcmci? Warning: Link specified in selected links that is outside the scope of the selected variables

Hey there,

So somewhat related to issue #35 , I'm guessing, but I think warranting it's own issue (because my intuition regarding the right answer is the opposite of the case there):

I'm initializing a PCMCI object with selected_variables set to some subset of the variables found in the used dataframe (say we have variables 0 through 6, and I set selected_variables=[0, 1, 2]).

When calling run_pcmci(tau_max=8, pc_alpha=0.1) (just random numbers here), I get:

Warning: Link specified in selected links that is outside the scope of the selected variables

Now, indeed the documentation for selected_links states:
selected_links (dict or None) – Dictionary of form {0:all_parents (3, -2), …], 1:[], …} specifying whether only selected links should be tested. If None is passed, all links are tested.

But on the other hand, the documentation for selected_variables says:
selected_variables (list of integers, optional (default: range(N))) – Specify to estimate parents only for selected variables. If None is passed, parents are estimated for all variables.

...which implies links like 1 -> 4 (meaning, links corresponding to causal parents of unselected variables, like variable 4) will not be checked.

So which is right? The documentation or the warning? The documentation suggests that setting selected_variables is enough, and that non-relevant links will be skipped, but the warning suggests that I need to manually match the scope of selected links to that of selected variables.

I would greatly appreciate your help (and I wish to thank you again for all your help so far),
Shay

Negative CMI in blue ?

Hi,

In the notebook you provide with tigramite, I see that the MCI can be negative (presented in blue). However, in the equation 4 and 5 of your paper, it shows that the MCI is positive. How is it possible to plot negative MCI values ?

Thanks in advance for your help!

Could not import r-package RCIT

Hellow! I met a error "Could not import r-package RCIT". Do I need to download the rcit package in R? I didn't find this package in 3.6.3 or 4.0.0 version.

Improve Tutorials and Documentation

While the current tutorial and documentation are comprehensive, we should ensure that new users have all they need to get started. Here is a list of possible improvements:

Factor current tutorial into smaller tutorials
Factor existing documentation page into smaller pages
Write a quick-start guide
Extend tutorials to also cover the definition/effects of the various parameters

Could not import packages for CMIknn and GPDC estimation

I am truly sorry that I am asking dummy questions. But I met a problem importing package,

Could not import packages for CMIknn and GPDC estimation
Could not import r-package RCIT

ImportError Traceback (most recent call last)
in
----> 1 from tigramite import tigramite_cython_code

ImportError: cannot import name 'tigramite_cython_code' from 'tigramite' (C:\Users\ranbix\Documents\summer_19\summer_19\tigramite-master\tigramite-master\tigramite_init_.py)

Could you please give me any help? Thanks!

what's the difference between this model and dynamic graph model?

I dont know whats the diference，Can you explain briefly?

Issue with "Symbolic time series"

Hi Jakob
Your work on tigramite and PCMCI is impressive! Great Work!
I have one question regarding Symbolic time series.
I am trying your jupiter notebook and get the following error:
"\lib\site-packages\tigramite-3.0b0-py2.7-win-amd64.egg\tigramite\independence_tests.pyc in _bincount_hist(self, symb_array, weights)
2494 # Needed because numpy.bincount cannot process longs
2495 if type(self.n_symbs ** dim) != int:
-> 2496 raise ValueError("Too many n_symbs and/or dimensions, "
2497 "numpy.bincount cannot process longs")
2498 if self.n_symbs ** dim * 16. / 8. / 1024. ** 3 > 3.:

ValueError: Too many n_symbs and/or dimensions, numpy.bincount cannot process longs"
Is it something you are aware of? Do you have any simple workaround?
I will try to investigate that.
Thanks for your help!

Is it possible to use CMIknn as TDMI ?

Hi Jakob, beautiful work you got.

I cant state cleary from the paper, since the definitions of CMI and MI are different, if its possible to use your CMIknn algorithm as a TimeDelayMI estimator. There is Matlab implementations for this, but i need python. Some guys relay on histogram but i found binning noisy.

Iam aiming to use the measure as a evidence of low dimensionality, so it's MI and not CMI i need. Further development of my work in time series may use the CMIknn too.

plus question: this package is on pip/conda ?

Thank you in advance.

Complex number multivariate time

Matplotlib issue: ValueError: list.remove(x): x not in list

I have just installed the new TIGRAMITE v4 in an anaconda environment running with python 3.7 and matplotlib 3.1.0. While running the tutorials in the Jupyter Notebook environment with the option %matplotlib inline. I cannot show the figures obtained for instance with tigramite.plotting.plot_timeseries or tigramite.plotting.plot_graph. The functions work well and do return the fig and axes but the fig instance triggers the following error as the jupyter cell runs it.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\envs\tigramite_v4\lib\site-packages\IPython\core\formatters.py in __call__(self, obj)
    339                 pass
    340             else:
--> 341                 return printer(obj)
    342             # Finally look for special method names
    343             method = get_real_method(obj, self.print_method)

~\AppData\Local\Continuum\anaconda3\envs\tigramite_v4\lib\site-packages\IPython\core\pylabtools.py in <lambda>(fig)
    242 
    243     if 'png' in formats:
--> 244         png_formatter.for_type(Figure, lambda fig: print_figure(fig, 'png', **kwargs))
    245     if 'retina' in formats or 'png2x' in formats:
    246         png_formatter.for_type(Figure, lambda fig: retina_figure(fig, **kwargs))

~\AppData\Local\Continuum\anaconda3\envs\tigramite_v4\lib\site-packages\IPython\core\pylabtools.py in print_figure(fig, fmt, bbox_inches, **kwargs)
    126 
    127     bytes_io = BytesIO()
--> 128     fig.canvas.print_figure(bytes_io, **kw)
    129     data = bytes_io.getvalue()
    130     if fmt == 'svg':

~\AppData\Local\Continuum\anaconda3\envs\tigramite_v4\lib\site-packages\matplotlib\backend_bases.py in print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, **kwargs)
   2058                     bbox_artists = kwargs.pop("bbox_extra_artists", None)
   2059                     bbox_inches = self.figure.get_tightbbox(renderer,
-> 2060                             bbox_extra_artists=bbox_artists)
   2061                     pad = kwargs.pop("pad_inches", None)
   2062                     if pad is None:

~\AppData\Local\Continuum\anaconda3\envs\tigramite_v4\lib\site-packages\matplotlib\figure.py in get_tightbbox(self, renderer, bbox_extra_artists)
   2359         bb = []
   2360         if bbox_extra_artists is None:
-> 2361             artists = self.get_default_bbox_extra_artists()
   2362         else:
   2363             artists = bbox_extra_artists

~\AppData\Local\Continuum\anaconda3\envs\tigramite_v4\lib\site-packages\matplotlib\figure.py in get_default_bbox_extra_artists(self)
   2330                 bbox_artists.extend(ax.get_default_bbox_extra_artists())
   2331         # we don't want the figure's patch to influence the bbox calculation
-> 2332         bbox_artists.remove(self.patch)
   2333         return bbox_artists
   2334 

ValueError: list.remove(x): x not in list

So far, I can avoid the problem by setting the display mode to %matplotlib notebook. I am not sure the problem is related to TIGRAMITE or to my anaconda env but I have not yet encountered this error before while plotting in jupyter notebooks.

Tigramite cython issues

When I try to install on my Windows 10 system (Pycharm virtual env) I get the following error messages:

Traceback (most recent call last):
File "D:\Abhinav_Sharma\Software\tigramite-master\setup.py", line 72, in
EXT_MODULES += define_extension("tigramite.tigramite_cython_code")
File "D:\Abhinav_Sharma\Software\tigramite-master\setup.py", line 43, in define_extension
return cythonize(Extension(extension_name, source_files))
File "C:\venv\lib\site-packages\Cython\Build\Dependencies.py", line 974, in cythonize
aliases=aliases)
File "C:\venv\lib\site-packages\Cython\Build\Dependencies.py", line 817, in create_extension_list
for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern):
File "C:\venv\lib\site-packages\Cython\Build\Dependencies.py", line 116, in nonempty
raise ValueError(error_msg)
ValueError: 'tigramite/tigramite_cython_code.pyx' doesn't match any files

ValueError: operands could not be broadcast together with shapes

Hello there. :)

Thanks again for this great package (and the research it is based on), and all the hard work.

I'm getting the following error (see stack trace below) thrown by the get_lagged_dependencies method of the PCMCI class. I'm calling it with tau_max=144. Also, the PCMCI object for which it is called was initialized with following arguments:

dataframe - A tigramite dataframe of shape (8640, 13); so 13 variables, each a time series of 8640 time steps. All values or non-negative integers.
cond_ind_test - An instance of the CMIsymb class (initialized with all None arguments).
selected_variables - The list [5, 6, ..., 12] (so it is not required to find causal parents for the first five variables).
verbosity - A value of 10.

As you can see below, the method runs through thousands of links before throwing this error:

        with conds_y = [ ]
        with conds_x = [ ]
        val = 0.006

        link (i4 -138) --> i1 (1443/1884):
        with conds_y = [ ]
        with conds_x = [ ]
        val = 0.005

        link (i4 -139) --> i1 (1444/1884):
        with conds_y = [ ]
        with conds_x = [ ]
Traceback (most recent call last):
  File "/usr/local/bin/poseidon", line 11, in <module>
    load_entry_point('poseidon', 'console_scripts', 'poseidon')()
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/poseidon-app/poseidon_package/scripts/poseidon_cli.py", line 51, in infer_causality
    timestep_len_in_seconds=tsteplen,
  File "/poseidon-app/poseidon_package/poseidon/core.py", line 124, in run_poseidon
    timestep_in_sec=timestep_len_in_seconds,
  File "/poseidon-app/poseidon_package/poseidon/infer_causality.py", line 220, in infer_causality
    epsilon=None,
  File "/poseidon-app/poseidon_package/poseidon/util.py", line 23, in find_max_lag
    correlations = pcmci.get_lagged_dependencies(tau_max=timesteps_in_a_day)
  File "/poseidon-app/tigramite/tigramite/pcmci.py", line 1171, in get_lagged_dependencies
    val = self.cond_ind_test.get_measure(X, Y, Z=Z, tau_max=tau_max)
  File "/poseidon-app/tigramite/tigramite/independence_tests.py", line 493, in get_measure
    return self._get_dependence_measure_recycle(X, Y, Z, xyz, array)
  File "/poseidon-app/tigramite/tigramite/independence_tests.py", line 373, in _get_dependence_measure_recycle
    return self.get_dependence_measure(array, xyz)
  File "/poseidon-app/tigramite/tigramite/independence_tests.py", line 2345, in get_dependence_measure
    hist = self._bincount_hist(array, weights=None)
  File "/poseidon-app/tigramite/tigramite/independence_tests.py", line 2318, in _bincount_hist
    flathist[:len(result)] += result
ValueError: operands could not be broadcast together with shapes (144,) (145,) (144,)

Any help at all with finding the source of this error will be greatly appreciated.

Is there an automatic way to find tau_max for run_pcmci?

Hey there,

This is not a problem with the package, rather a question regarding a possible improvement. As the title states, I was wondering whether there is an automatic way to find tau_max for run_pcmci?

In the tutorial, you plotted the lagged unconditional dependencies (the lagged correlations) and chose the lag after which dependencies decay:

Based on that, I thought a possible way to automate it is to find, for each such series of correlation vs lag, the lag for which the correlation is close enough to 0 (it is in [-Ɛ, Ɛ]), and take the max out of those (and so Ɛ is a parameter with which the user can control the level of decay required, but which can have a nice default like 0.1 or 0.05).

What do you think? If it's a silly idea, I'd love to know that as well, and also get help thinking of a correct method to do this. :)

Cheers,
Shay

Improve tests

As part of a developing package, we should have a rigorous set of unit tests and higher level validation.

Unit tests

We should test all the main modules. For the unit testing, the tests need to be fast and as modular as possible.

Validation

We should include some toy datasets to run on. These should be designed to test the functionality at a higher level / across modules.

KeyError prevents plotting TimeSeries Graph?

Dear Dr. Runge & Contributors,

I am trying to utilize the tigramite.plotting package to visualize the time-series graph (plot_timeseries_graph and plot_mediation_time_series_graph). Whether I am using my own code or copying your notebook tutorial, I consistently am stopped by this error:

tp.plot_mediation_time_series_graph(var_names=var_names,path_node_array=graph_data['path_node_array'],tsg_path_val_matrix=graph_data['tsg_path_val_matrix'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zskaufma/.conda/envs/env_tigramite/lib/python3.6/site-packages/tigramite/plotting.py", line 2023, in plot_mediation_time_series_graph
    network_lower_bound=network_lower_bound
  File "/home/zskaufma/.conda/envs/env_tigramite/lib/python3.6/site-packages/tigramite/plotting.py", line 1126, in _draw_network_with_curved_edges
    seen[(u, v)] = draw_edge(ax, u, v, d, seen, arrowstyle, directed=True)
  File "/home/zskaufma/.conda/envs/env_tigramite/lib/python3.6/site-packages/tigramite/plotting.py", line 827, in draw_edge
    if d['directed_attribute'] == 'spurious':
KeyError: 'directed_attribute

The meaning of this is not particularly clear to me, so any help would be much appreciated. Note that this problem is specific to the visualization commands referenced above; plot_timeseries and plot_graph work fine for me.

Best regards,
Zack K

Parallelization using Dask

Hi,

I have been using Tigramite over the last week and have been trying out different ways to parallelize it.

There are two levels of parallelization that where interesting for me:

Parallelizing at the node level, the way you are doing it in your MPI-based script
Parallelizing the knn-based independence test -- this is new.

An obvious 3rd level (not relevant for me) is to parallelize over the selection of the level of significance for the id-tests.

Since I (and probably other people as well) am not an MPI user, I worked a little bit on implementing my own parallelization using dask's distributed library (https://distributed.dask.org/en/latest/).

Some point about this:

I started with Python's multiprocessing library, but it doesn't allow to easily scale over one machine and it's a pain to manage multiple levels of parallelization (node > significance > id test), which is why I didn't end up using it.
Each parallelizable part of the code (pc_algo, id test) is given a cluster client in order to make it easy for the tigramite user to distribute the work on hiss own infrastructure (dask distributed provides many possible configurations)
Dask is really just used as a distributed work queue in which each job can create subjobs, but it also comes with many interesting features -- I would encourage you to check Dask out!
I didn't think yet about the best way to distribute the data, so there is still a lot of data being copied around, but this could be improved

Would you be interested in me preparing a PR for this? If not, I wouldn't spend time documenting and cleaning up the code, so let me know!

typo

print_significant_links and return_significant_parents: clarification and consistency between names

Hello and thank you for this amazing package.
I was testing it out this afternoon after reading the article and the supplementary materials and I would like to get a clarification (maybe my understanding is wrong):

after creating the PCMCI instance (and having called the run_pcmci) I get those two methods print_significant_links and return_significant_parents: the former refers to significant links and the latter refers to significant parents.

Is it true though that significant links and significant parents are the same thing (at least from my understanding of the theory)?

If that is true I would suggest to change the name to avoid confusion.

Many thanks

Question regarding tigramite_tutorial_missing_masking

Dear Jakob,
I have a question regarding the description of the Masking procedure in the tigramite_tutorial_missing_masking.
In the description of the tutorial, it is written that if we are interested in winter months only, then we should mark "all winter month data in ( 𝑋1,𝑋2,𝑋3 ) in mask with a 1 (or True). "

But in the python example of the same tutorial, I understood that in order to focus on winter half year, the masking is set for summer months as true values:
....
# Summer half year
data_mask[[t, t-1]] = True
....

This brings some confusion.

Plotting timeseries plot_timeseries with masking I would expect grey to stand for values that are not intended to be used in the analysis. But if I follow the description and mask winter months (set these months as True or 1), they appear as grey in the timeseries plot.

So my questions are:
(i) if I interested in winter month only, should I mask winter months? Or should I mask other seasons?
(ii) black lines in plot_timeseries stand for values that will be used by tigramite, grey - that will be 'ignored'?
Thanks in advance.
Best regards,
Evgenia

why I get a full connected graph?

Thanks for providing such a useful time series analysis tool which is quite interesting but I faced the following potential bottlenecks as having no in-depth knowledge:

I have a time series dataset including 8 variables (all data were z-score transferred). I followed the example of TIGRAMITE to constructed their causal relationships. Unfortunately, I got a full connected graph, and the mean false positive is more than 0.5.
I see that there is a parameter “links_coeffs” in the example. However, I don’t know any dependencies among variables. Without this parameter, can I get the true causality?

I think maybe my data is nonstationary, which leads to a full connected graph. Is there a method of stationarity test in order to know whether the data is suitable for using the PCMCI algorithm?
Attached is my data and code. Hope your little insight guidance will solve my BIG PROBLEM.
learn causality from time series data.zip

Data processing problem

Dear Jacob,
I was going trough your impressive work on tigramite, following your codes in notebook but I am getting errors: if I try this for example
np.random.seed(42) # Fix random seed
links_coeffs = {0: [((0, -1), 0.7), ((1, -1), -0.8)],
1: [((1, -1), 0.8), ((3, -1), 0.8)],
2: [((2, -1), 0.5), ((1, -2), 0.5), ((3, -3), 0.6)],
3: [((3, -1), 0.4)],
}
T = 1000 # time series length
data, true_parents_neighbors = pp.var_process(links_coeffs, T=T)
T, N = data.shape

Initialize dataframe object, specify time axis and variable names

var_names = [r'$X^0$', r'$X^1$', r'$X^2$', r'$X^3$']
dataframe = pp.DataFrame(data,
datatime = np.arange(len(data)),
var_names=var_names)

I get the following error:

TypeError Traceback (most recent call last)
in ()
13 dataframe = pp.DataFrame(data,
14 datatime = np.arange(len(data)),
---> 15 var_names=var_names)

TypeError: init() got an unexpected keyword argument 'datatime'

what is causing this?
Thank you for your help

Tigramite and Event Data

Hello.
This is a general question regarding the correct way to implement the algorithmic framework provided by the (great) package. I am trying to estimate causality between different time series in which all data are count (e.g., number of events i at time t, number of events j at time t, number of people affected k at time t). I was wondering whether this requires to implement the "Symbolic Time-Series" specification or is it just fine to use the "classic" setup.
By the way, I have tried both ways but when running
results = pcmci_cmi_symb.run_pcmci(tau_max=10, pc_alpha=0.2)
I get the following error:
ValueError: operands could not be broadcast together with shapes (169,) (173,) (169,)
My data are stored into a np array of dimension (1096,4) in int64 format. I was trying to understand what the issue was, but I am currently struggling to find a solution.

Thanks for your help.

Testing Independence of Stocks

Hi,

A critical assumption of correlation is that the variables are independent.

I am doing a correlation analysis of stock prices.

If we want to test the independence of stock prices (two or more stocks are independent of each other) for correlation analysis which of the following tests should be employed?

Thanks in advance.

var_names dictionary

Hi Jakob. Great job with tigramite! I was having an issue with my dataset that was wider than it was tall. I was getting a Key error when I tried to use _print_significant_links, and I saw that self.var_names in init is defined by using the length of the data. Should this be self.data.shape[1] since the rows are the time values, and the columns are the variables?

Python 3 compatibility

The package was first written for python 2.7. Some work has already been done to ensure it also works for Python 3. Once we complete #2, #3, #4, and #5, we can use the existing testing/development framework to ensure python 3 works. Here is the brief todo:

Finish implementing testing framework
Include and edits to ensure python 3 compatibility
Ensure tests pass for python 3
Deploy python 3 version on PyPI and Anaconda Cloud

Error: 'index 3 is out of bounds for axis 1 with size 3'

In the CMIknn section of the notebook, the cell 24 containing the following code:

link_matrix = pcmci._return_significant_parents(pq_matrix=results['p_matrix'],
                        val_matrix=results['val_matrix'], alpha_level=0.01)['link_matrix']
tp.plot_graph(
    val_matrix=results['val_matrix'],
    link_matrix=link_matrix,
    var_names=var_names,
    link_colorbar_label='cross-MCI',
    node_colorbar_label='auto-MCI',
    )

throws the following error:

Desktop/causal-graphs/tigramite/tigramite/pcmci.pyc in _return_significant_parents(self, pq_matrix, val_matrix, alpha_level)
   1236             links = dict([((p[0], -p[1] - 1), numpy.abs(val_matrix[p[0], 
   1237                             j, abs(p[1]) + 1]))
-> 1238                           for p in zip(*numpy.where(link_matrix[:, j, 1:]))])
   1239 
   1240             # Sort by value

IndexError: index 3 is out of bounds for axis 1 with size 3

More specifically, the error is coming from:

link_matrix = pcmci._return_significant_parents(pq_matrix=results['p_matrix'],
                        val_matrix=results['val_matrix'], alpha_level=0.01)['link_matrix']

Is that normal ? I am curious how you succeeded to run this notebook without this error.

Thanks!

Performance Improvements

Dear Tigramite Experts,

I am trying to figure out how to improve the running time of the CMIknn approach. I tried to cythonize modules that I thought could help, but no luck.
On my data, very 9 variables about 280 data points each . similar to the provided datasets synthetically generated, it takes about 20hrs to run using CMIknn and was wondering what I can try to improve.

I profiled (cProfile) and tried to cythonize but still takes tremendous amounts of time.

Attached is a cProfile snippet showing several iterations.
tigramite.performance.profile.log

QUESTIONS:

Can anyone comment on the issues found, they are not typical file, linenumber, function, but seem to be library functions ?
Can anyone comment on the thread locks, I am not using multi threading so suspect they are from libraries, I know there were issues prior to Python 3.4, but I am using Python 3.7, so wondering if the libraries may be using multi threads can be optimized further - not sure how other than refactoring the source ?
What would be the most impactful optimization in terms of reducing the run time of CMI knn ?
Can anyone comment if these bottlenecks can be improved by cython. I am not sure if these are numpy libraries that are already optimized or if there is any opportunities to further improving the run time

Summary of cProfile:

959190 2.239 0.000 163.029 0.000 <array_function internals>:2(amax)
479595 1.014 0.000 141.947 0.000 <array_function internals>:2(amin)
958958 66.351 0.000 121.180 0.000 _methods.py:167(_var)
958958 7.890 0.000 129.251 0.000 _methods.py:215(_std)
959190 3.327 0.000 158.242 0.000 fromnumeric.py:2504(amax)
479595 1.426 0.000 139.728 0.000 fromnumeric.py:2629(amin)
1919469 10.597 0.000 308.358 0.000 fromnumeric.py:73(_wrapreduction)

10911915 133.044 0.000 142.921 0.000 {built-in method numpy.array}
6624081/6135824 21.769 0.000 424.868 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
1089000 0.783 0.000 0.783 0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
5288396 366.798 0.000 366.798 0.000 {method 'reduce' of 'numpy.ufunc' objects}

I had originally made my code single threaded, and still had this thread lock issue, not sure where its coming from, but suspect from libraries ?

19183800 2400.066 0.000 2400.066 0.000 {method 'acquire' of '_thread.lock' objects}
3836760 23.542 0.000 2188.027 0.001 threading.py:264(wait)
3836760 5.878 0.000 25.006 0.000 threading.py:499(init)
11510280 3.144 0.000 3.144 0.000 threading.py:507(is_set)
3836760 21.753 0.000 61.568 0.000 threading.py:763(init)
3836760 18.968 0.000 2321.026 0.001 threading.py:834(start)
3836760 7.950 0.000 276.880 0.000 threading.py:1012(join)
3836760 6.278 0.000 264.014 0.000 threading.py:1050(_wait_for_tstate_lock)
github.tigramite.performance.profile (1).log

Any ideas welcomed.

Impliment Continuous Integration

Should be completed after #4. We should use some of the free CI software to set up build tests and run the test written in #4.

Maybe bug with return_significant_parents()

Tigramite 4:

I got an inconsistency between pcmci.print_significant_links() and pcmci.return_significant_parents()

output pcmci.print_significant_links()

Significant links at alpha = 0.05:

output pcmci.return_significant_parents()

return_significant_parents() seems to only give back link [19, '1_4_sst', 0] -1). According to the qvalues printed, all of them are lower then 0.05..

When qvalue is printed, it says : print(q_matrix[p[0], j, abs(p[1])])), p being a list of sorted_links:

when qvalue is 'returned', it says: np.argwhere(pq_matrix[:, j, 1:] <= alpha_level)

where is this '1:' based on in np.argwhere(pq_matrix[:, j, 1:] <= alpha_level)? Is it correct?

Thanks in advance!
Sem

Improve cross platform capabilities

Currently, tigramite has been installed and tested on Linux x86 systems. A medium term goal is to extend this first to macOS, and eventually to Windows. As a "catch-all" solution, we should also deploy a working version on Docker/Docker hub.

Mac

macOS should be able to run this natively:

Install on macOS
Ensure all tests pass
Write installation guide (if needed)
Release macOS build on Anaconda cloud

Windows

Windows may be more challenging:

Install on Windows
Ensure all tests pass
Write installation guide (if needed)
Release Windows build on Anaconda cloud

Docker / VM

To ensure anyone can run the code, we should build and deploy VM images and an image for Docker cloud:

Get VM image of version 3.0.0-beta on Linux OS
Upload Docker image to Docker cloud of version 3.0.0-beta

How to deal with 'almost' deterministic time-series?

Thanks for the great package Jakob! I am dealing with time-series (coming from a physical model) that have very small noise, and are closely correlated such as the following:

Original curves

First difference

Do you have any recommendations to deal with them? I am pretty sure that they don't satisfy PCMCI's assumptions. When I try to run run_pcmci, I get the following error:

/usr/local/lib/python3.6/dist-packages/tigramite/independence_tests.py in _get_single_residuals(self, array, target_var, standardize, return_means)
1064 array /= array.std(axis=1).reshape(dim, 1)
1065 if np.isnan(array).sum() != 0:
-> 1066 raise ValueError("nans after standardizing, "
1067 "possibly constant array!")

I would like to identify which one of the first five time series has the most impact on the 6th time series (power). I would appreciate any reference to some appropriate approach.

Thank you!

Which links to select for get_lagged_dependencies? (Warning: Link specified in selected links that is outside the scope of the selected variables)

Hey Jakob and other contributors! :)

So I'm using PCMCI to find causal relations among a set of 13 variables, indexed from 0 to 12. However, I'm only interested in the causal parents of variables 5 to 12, so I initialize PCMCI with selected_variables=[0,...,12].

However, when running the get_lagged_dependencies() method of the initialized PCMCI object, I only supply tau_max, leaving selected_links with its default None, so all links are tested (as per the documentation).

This is intentional, as I figured that even though I'm only interested in the parents of some of the variables, I should take into consideration all variable-pair interactions when choosing the max lag to use, as they might be crucial for the correct estimation of any causal link in the system.

However, the get_lagged_dependencies method throws the following warning (coming from the _set_sel_links() input validation method):

"Warning: Link specified in selected links that is outside the scope of the selected variables"

As a result, I wanted to ask if my logic is correct, and if no, how to construct a correct set of selected links. Is it simply the set of all links who has at least one member in selected_variables?

Also, is there a chance this is what causing the error I have when running this method (see issue #33?

Thank you again for all your help,
Shay