GithubHelp home page GithubHelp logo

py-why / causal-learn Goto Github PK

View Code? Open in Web Editor NEW
989.0 15.0 172.0 27.65 MB

Causal Discovery in Python. It also includes (conditional) independence tests and score functions.

Home Page: https://causal-learn.readthedocs.io/en/latest/

License: MIT License

Python 98.44% Jupyter Notebook 1.56%
causal causal-discovery causal-inference causality python graph structure tetrad time-series hidden-causal

causal-learn's Introduction

causal-learn: Causal Discovery in Python

Causal-learn (documentation, paper) is a python package for causal discovery that implements both classical and state-of-the-art causal discovery algorithms, which is a Python translation and extension of Tetrad.

The package is actively being developed. Feedbacks (issues, suggestions, etc.) are highly encouraged.

Package Overview

Our causal-learn implements methods for causal discovery:

  • Constraint-based causal discovery methods.
  • Score-based causal discovery methods.
  • Causal discovery methods based on constrained functional causal models.
  • Hidden causal representation learning.
  • Permutation-based causal discovery methods.
  • Granger causality.
  • Multiple utilities for building your own method, such as independence tests, score functions, graph operations, and evaluations.

Install

Causal-learn needs the following packages to be installed beforehand:

  • python 3 (>=3.7)
  • numpy
  • networkx
  • pandas
  • scipy
  • scikit-learn
  • statsmodels
  • pydot

(For visualization)

  • matplotlib
  • graphviz

To use causal-learn, we could install it using pip:

pip install causal-learn

Documentation

Please kindly refer to causal-learn Doc for detailed tutorials and usages.

Running examples

For search methods in causal discovery, there are various running examples in the ‘tests’ directory, such as TestPC.py and TestGES.py.

For the implemented modules, such as (conditional) independent test methods, we provide unit tests for the convenience of developing your own methods.

Benchmarks

For the convenience of our community, CMU-CLeaR group maintains a list of benchmark datasets including real-world scenarios and various learning tasks. Please refer to the following links:

Please feel free to let us know if you have any recommendation regarding causal datasets with high-quality. We are grateful for any effort that benefits the development of causality community.

Contribution

Please feel free to open an issue if you find anything unexpected. And please create pull requests, perhaps after passing unittests in 'tests/', if you would like to contribute to causal-learn. We are always targeting to make our community better!

Running Tetrad in Python

Although causal-learn provides python implementations for some causal discovery algorithms, there are currently a lot more in the classical Java-based Tetrad program. For users who would like to incorporate arbitrary Java code in Tetrad as part of a Python workflow, we strongly recommend considering py-tetrad. Here is a list of reusable examples of how to painlessly benefit from the most comprehensive Tetrad Java codebase.

Citation

Please cite as:

@article{zheng2024causal,
  title={Causal-learn: Causal discovery in python},
  author={Zheng, Yujia and Huang, Biwei and Chen, Wei and Ramsey, Joseph and Gong, Mingming and Cai, Ruichu and Shimizu, Shohei and Spirtes, Peter and Zhang, Kun},
  journal={Journal of Machine Learning Research},
  volume={25},
  number={60},
  pages={1--8},
  year={2024}
}

causal-learn's People

Contributors

aoqiz avatar biwei-huang avatar bja43 avatar boyle-coffee avatar chenweidelight avatar cogito233 avatar erdungao avatar fileds avatar ignavierng avatar janmarcoruizdevargas avatar jarodyv avatar jdramsey avatar jonathan-salisbury avatar kenneth-lee-ch avatar kunwuz avatar kunzhang16 avatar marcelrobeer avatar markdana avatar pckennethma avatar philippfaller avatar ruichucai avatar svenpieper avatar tiagodsilva avatar tofuwen avatar turuibo avatar wean2016 avatar x3zvawq avatar yarikoptic avatar yuanningd avatar zhi-yi-huang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

causal-learn's Issues

Please provide developing guidelines for outside contributors

When I tried to pass unittests in tests/, I got confused about how maintainers collect so many test cases.
Finally, I executed python -m unittest discover -p "Test*.py" under tests/, which I hope is the correct usage.

The tests first failed due to the lack of pygam (required by CAMUV) and torch (required by PNL), as I executed python setup.py install to install the requirements.
I thought that a simple patch in setup.py could solve this problem. But WAIT! What if the maintainers just want to make big packages like torch optional? It is so nice of you to think like that.
However, it would be better to document such packages somewhere, like in README.md or setup.py.

A general advice is to provide more information about how to participate in the development of this project in README.md. It would be better to provide executable commands in continuous integration configuration.

What's more, could the developing guidelines of this repository, if it already exists, change a little bit?
For example, print is widely used in this project. I find it annoying as a user like me cannot disable those messages. Could this project use the logging package with the DEBUG level instead?

Add cache for local score function for efficiency.

Local score is calculated repeatly during search.

Make the function def local_score_BDeu(Data: ndarray, i: int, PAi: List[int], parameters=None) -> float: cachable. Strongly recommend to use functools.lru_cache, that is a efficient python built-in decorator. But it requires all parameters are hashable, currently they are not, e.g. ndarray.

A commonly solution is to create a Score class to store data, and local score function only takes i, and parents tuple (now it is a list).

Usage of KCI test

I am interested in the KCI independence test function (causallearn.utils.cit.kci) and testing it now.

In the test, the input "data" is set with a simple numpy data table and column indices X Y are set as 1,2. However, after running the function, there is a TypeError ""str" object is not callable", as shown below. Does anyone know what is the cause of this problem , or show me with a correct example about this function's usage? Thank you.

WeChat Image_20220824122322

pip install?

Dumb question: Does pip install causal-learn grab the latest version from the repository, or do I need to grab the repository version and install that myself locally?

Thanks!

How to use BIC score

Hi, sorry to bother you@kunwuz ,

I'm having some problems using the bic score. It always get a big number like -6951.8887543026485
rewards = local_score_BIC(data,0,[0])

The data is created by myself. I don't know the eaxt meaning of i and Pai.

Import Error during using

I am a novice in this field, seek for help...

Operation:
pip install causal-learn(sucessfully)
touch a new file test.py(sucessfully)
copy testPC.py to test.py(sucessfully)
python3 test.py (encounter Error :

(main) ➜  Test git:(main) ✗ python3 testC.py 
Traceback (most recent call last):
  File "testC.py", line 8, in <module>
    from causallearn.utils.cit import chisq, fisherz, gsq, kci, mv_fisherz, d_separation
ImportError: cannot import name 'd_separation' from 'causallearn.utils.cit' (/Users/sheshuaijie/app/miniconda3/envs/main/lib/python3.8/site-packages/causallearn/utils/cit.py)

Any Suggestion will be greatly appreciate!!!!

PC learns different graphs dependent on the ordering

Hi,

Thank you for your great work on this package!
I am testing the behaviour of the PC algorithm on simple simulated data. I found that the number of directed edges detected differs based on the ordering of the variables given to the algorithm.

I am running the following code:

import numpy as np
import matplotlib.pyplot as plt
from causallearn.search.ConstraintBased.PC import pc

def simulate_data(n_obs):
    '''
    Simulate data from the following graph

        A       B
         \     /
          v   v
            C
          /   \
         v     v
        D       E
    '''
    A = np.random.normal(size = n_obs)
    B = np.random.normal(size = n_obs)
    C = A + B + np.random.normal(size = n_obs)*0.25
    D = C + np.random.normal(size = n_obs)*0.25
    E = C + np.random.normal(size = n_obs)*0.25
    return np.stack((A,B,C,D,E), axis =1)

# generate data
n = 10000
data = simulate_data(n)

# test different permutations
permutations = [[0,1,2,3,4], [0,1,3,2,4]]

for permutation in permutations:
    graph = pc(data[:,permutation], 0.05, 'fisherz', node_names = np.array(["A","B", "C", "D", "E"])[permutation],  verbose = False)
    graph.draw_pydot_graph(labels=np.array(["A","B", "C", "D", "E"])[permutation])
    plt.show()

When the ordering of variables C and D is permuted, the PC algorithm returns the Graph with an undirected edge from C to D. However, when the ordering is unpermuted, the PC algorithm correctly directs the edge from C to D. This happens, even though the p-values in the CI tests are unchanged for the two permutations.
Is this expected behaviour or can you help me fix this?

Generalized score with mixed data

hello,

i used generalized score to do GES on mixed data, which contains float and string types of data. But somehow it failed.
Is there any example for this situation?

Best Regards
Yuqiang

granger2d should work for more than 2d data

Problem: granger_test_2d will throw an error if applied to data with more than 2 variables (d > 2).

I think that probably without the check below, the granger_test_2d method in the Granger class will work over data with more than 2 variables.

https://github.com/cmu-phil/causal-learn/blob/6f297e743c462b4a2f8191250ea9ac03c288abf0/causallearn/search/Granger/Granger.py#L43

If this behavior is wanted maybe documentation should be changed, as doc is now it states that granger_test_2d will return matrices of pvalues and adjacency and it accept data (nxd) as input.

Passing required domain knowledge using add_required_by_node

Hello,
I'm new in causal discovery, and I'm not sure if it's an issue in the causal-learn package or if I'm using wrongly the add_required_by_node function.
I first built a graph based on data (without any knowledge), let's call it G1.
Then I wanted to add required edges based on domain expertise to create a new graphe G2.
The expected result : G2 will contain all directed edges defined by domain knowledge.
The observed result : G2 contains the directed edge (if the edge was already present in G1), otherwise the directed edge is not created.


Below the details :

  • data:

  • data_graph

  • First I create a graph without any knowledge (G1), using this code :

from causallearn.search.ConstraintBased.PC import pc
      data_np = data.to_numpy()
      cg = pc(data_np)

below the result :
G1

  • Second I wanted to add domain knowledge rule
    rules = [

      ("a", "b", "required"), #a->b
      ("c", "d", "required"), #c->d
      ("b", "c", "required"), #b->c
    

    ]
    I used add_required_by_node and add_forbidden_by_node methods to add either required or forbidden edges and to create the graph G2, using this code:

forbidden_edges = []
required_edges =  [(0, 1), (2, 3), (1, 2)] # {"a":0, "b":1, "c":2, "d":3}
data_np = data.to_numpy()
cg = pc(data_np)
nodes = cg.G.get_nodes()

bk = BackgroundKnowledge()
for (node_1_idx, node_2_idx) in forbidden_edges:
    bk.add_forbidden_by_node(nodes[node_1_idx], nodes[node_2_idx])

for (node_1_idx, node_2_idx) in required_edges:
    bk.add_required_by_node(nodes[node_1_idx], nodes[node_2_idx])

cg_with_knowledge = pc(data_np, background_knowledge=bk)

below the result :
G2

While I was expecting more something like this (results based on another package):
G3

Mixed data example

May I know if there are any examples of causal-learn usage on mixed data (categorical and continuous/ordinal)? This is supported in tetrad. Many thanks!

Does DAG2PAG supports selection variable?

Hi, I hope to use the dag2pag function in the causal-learn lab, i.e., causallearn.utils.DAG2PAG.

However, I notice in the documentation, this function only support add latent variables.

Is there a way to add selection variables (maybe a custom change to the function) ?

Many thanks!

score_g function

I am not sure if the function "score_g" in causallearn.utils.GESUtils is designed to calculate the score of a causal graph given a specific dataset. But when I used it, I found when parameter "PA" is not empty list, it comes to an error "IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices".
I think the reason of the problem is the “PA” given by "score_g" does not match the parameter (PAi) request of local_score_BIC.
Much appreciate for your response.
屏幕截图 2022-06-08 163326
屏幕截图 2022-06-08 163700

FAS does not use required edges anymore

Hi,

Since the new commit on Fas.py, the required edges are not used anymore to create the graph. May I ask why and what the benefit of this is? I made a program for which the required edges were essential, so I am very curious 😅

Kind regards,
Vera

QUESTION: To get DAG/Graph from FCM-based methods

With FCM-based methods, for instance DirectiINGAM, how do we get DAG so that we can compare and contrast the structure with other methods such as PC,FCI, GES etc? Are there any utilities?

from causallearn.search.FCMBased import lingam
model = lingam.DirectLiNGAM()
model.fit(data)

print(model.causal_order_)
print(model.adjacency_matrix_)

Thanks

ImportError: cannot import name 'GIN' from 'causallearn.search.FCMBased'

from causallearn.search.FCMBased.GIN.GIN import GIN
G, K = GIN(data)

ImportError Traceback (most recent call last)
/var/folders/8b/hhnbt0nd4zsg2qhxc28q23w80000gn/T/ipykernel_52081/3662974966.py in
----> 1 from causallearn.search.FCMBased import GIN
2 G, K = GIN(data)

ImportError: cannot import name 'GIN' from 'causallearn.search.FCMBased' (/Users/datalab/github/causal-learn/causallearn/search/FCMBased/init.py)

How to Visualize Casual Graph?

Awesome work!
I just try to use TestDirectLiNGAM to get the casual graph on public dataset. And the result is interesting! Thank you for your work~ I am interested in visualizing the casual graph I got ( TestDirectLiNGAM method produce a adjacency_matrix_). Is there any operation to get a png which visualize the graph?
Aftering reading docs, i notice there is some Graph operations but no visualize2Graph Operation :(

Can't add a new node to GeneralGraph after deleting edges

If I attempt to add a new node to a GeneralGraph using add_node after calling the method remove_nodes, I get the following error:

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 10 and the array at index 1 has size 5.

The more detailed error message:

Traceback (most recent call last):
  File <PATH_TO_SCRIPT>, line 20, in <module>
    graph.add_node(GraphNode("X11"))
  File "<PATH_TO_PROJECT>/ctf_venv/lib/python3.9/site-packages/causallearn/graph/GeneralGraph.py", line 205, in add_node
    dpath = np.vstack((self.dpath, row))
  File "<__array_function__ internals>", line 180, in vstack
  File "<PATH_TO_PROJECT>/ctf_venv/lib/python3.9/site-packages/numpy/core/shape_base.py", line 282, in vstack
    return _nx.concatenate(arrs, 0)
  File "<__array_function__ internals>", line 180, in concatenate

Minimal example to reproduce:

# Generate a graph with 10 edges
nodes = [GraphNode(f"X{i}") for i in range(1, 11)]
graph = GeneralGraph(nodes)
for i in range(9):
    graph.add_edge(Edge(nodes[i], nodes[i+1], Endpoint.TAIL, Endpoint.ARROW))
print(graph)

# Delete the first 5 nodes
nodes_to_remove = nodes[:5]
graph.remove_nodes(nodes_to_remove)

# Try to add a new node (X11)
graph.add_node(GraphNode("X11"))
print(graph)

It appears that self.dpath is not being updated correctly upon calling remove_nodes. If you compare its dimensions to self.graph you can see that it has not deleted the paths involving the removed nodes. For example, in the above example we removed 5 nodes but if we call print(graph.graph.shape, graph.dpath.shape), we get the following: (5, 5) (10, 10).

memory consumption

On my dataset (~3.6w, ~40 feats), I found it may use over 40GB memory, ( and I have not get it end successfully)
Maybe it needs to use some db to reduce max memory usage?

On the development of PNL

When will the implementation of the PNL method be completed? I saw that development has stopped since Feb 18, 2022. What are the remaining parts of the algorithm that needs to be implemented precisely? Maybe I can help

Question: Clarification on Edge properties from FCI algo

In the docs the edge properties are explained as:

"""
edges: list. Contains graph’s edges properties. If edge.properties have the Property ‘dd’, then there is no latent confounder. Otherwise, there might be latent confounders. If edge.properties have the Property ‘nl’, then it is definitely direct. Otherwise, it is possibly direct.
"""
The abbreviations seem odd to me. Unfortunately, I can't find the original paper on the internet free and I can't find another reference to these abbreviations. Consequently, I'm inquiring here for clarification.

Is it possible that dd and nl have been switched? I.e. dd means definitely direct and nl means no latent?

Additionally, there is no explanation fro pl. Does it mean possible latent confounders?

I'm guessing that pd means possibly direct.

Thanks for the clarification!

Orientation rules for FCI

Hello,

Are the edge orientation rules for the FCI implementation taken from Zhang et al (2008)? This paper introduces additional orientation rules and proves their completeness. I am wondering if the entire set of orientation rules required for completeness are implemented in this package.

  1. Zhang, J. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence 172, 1873–1896 (2008).

If not, from which citation(s) are the orientation rules taken?

Thanks!

Add Error Handling for cit.py line 165/166

Hi, I experienced a ValueError: math domain error on this line from log as the value of (1 + r) / (1 - r) was negative.
Please see the screenshot below and the value of r from the debugger.
I'm not sure why this might be happening, but it would be good to catch potential errors here :)

Screenshot 2023-04-03 at 16 57 04

Typo.

"Constrained-based" in the README should be "Constraint-based".

Facing problem in SHD estimation

Hi. I am getting the following KeyError whenever I try to estimate the SHD using the true graph and estimated graph.

Traceback (most recent call last): File "test_GES_metrics.py", line 25, in <module> shd = SHD(truth_cpdag, est) File "F:\REPOSITORIES\causal-learn\causallearn\graph\SHD.py", line 33, in __init__ if (not truth.get_edge(truth.get_node(nodes_name[i]), truth.get_node(nodes_name[j]))) and est.get_edge( File "F:\REPOSITORIES\causal-learn\causallearn\graph\GeneralGraph.py", line 530, in get_edge i = self.node_map[node1] KeyError: None

I am unable to figure out why this mapping error is happening and how to solve this.

Add required edges by background knowledge

Hi, Thanks for your great work.

I hope to add an edge that is not detected during the PC skeleton phase according to background knowledge.
I read the codes in causal-learn/causallearn/utils/PCUtils/BackgroundKnowledgeOrientUtils.py and it looks like only edges already detected in the skeleton phase can be defined as 'required'.

Is there any method to solve my problem?
Can I direct add the following lines in BackgroundKnowledgeOrientUtils.py?

def orient_by_background_knowledge(cg: CausalGraph, background_knowledge: BackgroundKnowledge):
    """
    orient the direction of edges using background background_knowledge after running skeleton_discovery in PC algorithm

    Parameters
    ----------
    cg: a CausalGraph object. Where cg.G.graph[j,i]=1 and cg.G.graph[i,j]=-1 indicates  i -> j ,
                    cg.G.graph[i,j] = cg.G.graph[j,i] = -1 indicates i -- j,
                    cg.G.graph[i,j] = cg.G.graph[j,i] = 1 indicates i <-> j.
    background_knowledge: artificial background background_knowledge

    Returns
    -------

    """
    if type(cg) != CausalGraph or (type(background_knowledge) != BackgroundKnowledge and type(background_knowledge) != CustomBackgroundKnowledge):
        raise TypeError(
            'cg must be type of CausalGraph and background_knowledge must be type of BackgroundKnowledge. cg = ' + str(
                type(cg)) + ' background_knowledge = ' + str(type(background_knowledge)))
    for edge in cg.G.get_graph_edges():
        if cg.G.is_undirected_from_to(edge.get_node1(), edge.get_node2()):
            if background_knowledge.is_forbidden(edge.get_node2(), edge.get_node1()):
                cg.G.remove_edge(edge)
                cg.G.add_directed_edge(edge.get_node1(), edge.get_node2())
            elif background_knowledge.is_forbidden(edge.get_node1(), edge.get_node2()):
                cg.G.remove_edge(edge)
                cg.G.add_directed_edge(edge.get_node2(), edge.get_node1())
            elif background_knowledge.is_required(edge.get_node2(), edge.get_node1()):
                cg.G.remove_edge(edge)
                cg.G.add_directed_edge(edge.get_node2(), edge.get_node1())
            elif background_knowledge.is_required(edge.get_node1(), edge.get_node2()):
                cg.G.remove_edge(edge)
                cg.G.add_directed_edge(edge.get_node1(), edge.get_node2())

    # custom change to add required edges
    for node1 in cg.G.get_nodes():
        for node2 in cg.G.get_nodes():
            if not cg.G.is_undirected_from_to(node1,node2):
                if background_knowledge.is_required(node1,node2):
                    cg.G.add_directed_edge(node1,node2)
                elif background_knowledge.is_required(node2,node1):
                    cg.G.add_directed_edge(node2,node1)

Add reference paper for implementation of granger lasso method

Hello everyone!

I want to use your library as part of an ongoing project, because I saw that you implemented a granger causality test for multi-dimensional time series called granger_lasso (docs).

I am curious based on which academic literature you implemented this method. There exist many papers who combine granger causality with some lasso regularization, so it is unclear for me why you implemented it that way.

It would be great if you can provide the reference paper here or, even better, update the documentation so other people can also benefit from this information.

Thank you very much in advanced.

fisherz test occasionally (but rarely) errors out

I am unsure why, but occasionally the fisherz test experiences the following error. I have not had this problem with the KCI test on the same data. I do not know if this is a bug or if it is a problem with the data. If a problem with the data, it would be more useful to be informed why.

File "/miniconda3/envs/causal/lib/python3.8/site-packages/causallearn/utils/cit.py", line 172, in fisherz
r = -inv[0, 1] / sqrt(inv[0, 0] * inv[1, 1])
ValueError: math domain error

Need reference for PCMAX (default setting for PC in CL apparently)

After some sleuthing in the Python code and some performance testing with the help of Pablo Puig and Bryan Andrews, I've come to the conclusion that the default setting for PC in CL is maxP. That is an algorithm of mine called PCMAX. The only reference for for this algorithm currently in the literature that I know of is an arXiv tech report I put out in 2016, this one:

Ramsey, J. (2016). Improving accuracy and scalability of the pc algorithm by maximizing p-value. arXiv preprint arXiv:1610.00378.

Anyway, the performance of PC in CL is the same as the performance of PCMAX in Tetrad (and different from the performance of PC in Tetrad.)

If this is going to be made the default setting for PC for CL, perhaps this tech report should be referenced for the PC algorithm in the documentation? Otherwise no one will know what the algroithm is. I'll see if I can't sent that to a conference somewhere to get it published. I didn't realize that was being made the default here.

Please clarify how mixed data are to be represented in causal-learn.

This is a request for information. For the mixed data project, could you clarify how mixed data are to be represented in causal-learn? That is, how is one to know, programmatically, which columns are for discrete data and which for continuous data? This cannot be gleaned from an np array itself, since binary data, for instance, can be treated as either continuous (with values 0 and 1) or discrete, and ordinal discrete data may often be treated as either continuous or discrete as well.

CD-NOD with independent change principle.

Hi, thanks for your great work.

I hope to use CD-NOD phase 3, i.e., identifying directions with independent change principle.

However, I notice it is still not implemented (a TODO comment).

When do you plan to release these codes?

Many thanks.

How to handle named variables

Hi,

I have been working with the causal-learn package for a couple of weeks now and I was wondering if there is a plan to add support for named variables?

As I understand it, at the moment data is generally assumed to take the form of a 2d np.array where each column contains a value for a particular variable X1, X2, ..., XN. I can quite easily take a csv file and convert it to this format but, in doing so, I lose the names of my columns. This relies on me having to keep track of which variable (X1, X2, ..., XN) corresponds to which named column of my csv data, which makes tasks such as visualising the output quite difficult.

I have been using the to_pyplot method to draw the discovered DAG and passing it the labels. However, this relies on the order being preserved (simply replacing X1 with the first label, X2 with the second, and so on). This has tripped me a couple of times when comparing two graphs produced from similar data where the variables do not align.

I am not sure what the solution is here, but it would be really useful if it were possible to attach actual variable names to the graph rather than doing this manually after computation. One possibility would be to support pandas dataframes for reading in data.

PC algorithm and Meek rules

I generate some sample data and use the PC algorithm to get an essential graph. but when I plot the graph. It looks like this.

I have several questions.

  1. Why there is no arrow from X8 -> X9 in this case, shouldn't Meek rules be able to orient X8-X9 as X8->X9?
  2. What is the difference between Meek.meek and Meek.definite_meek()?
  3. If I want to orient a DAG with Meek rules after I manually change some orientations of the graph. How should I go about it?

Screenshot 2023-01-31 at 2 51 16 AM

Thank you for having a look at this thread.

Is there an implementation of Degenerate Gaussian?

In the ReadMe.md, it says causal-learn is a Python translation and extension of Tetrad. So, is there an implementation of Degenerate Gaussian(DG) algorithm, which is published in a paper named "Learning High-dimensional Directed Acyclic Graphs with Mixed Data-types"? As in Tetrad program, it will always use CG or DG algorithm to manipulate mixed data first. Thank you for reply.

Using scoring function 'local_score_marginal_multi' on ges() function gives error

When I call the ges() function with the score function local_score_marginal_multi with no parameters, it gives the following error: File "lib\site-packages\causallearn\score\LocalScoreFunction.py", line 824, in local_score_marginal_multi X = Data[:, parameters["dlabel"][Xi]] KeyError: 0

It looks like the labels of the parameters are not correctly parsed. In the LocalScoreFunction.py Xi is an int, but the parameters["dlabel"] have that the keys are strings. However, fixing this problem locally gives other errors, so I am not sure what goes wrong further.

The same applies when using the function with the score functionlocal_score_CV_multi

Using background knowledge makes FCI algorithm slower

Hi!
I would like to use the FCI algorithm with background knowledge, but I have noticed that the computation speed with the background is much slower than the computation speed without passing the background knowledge to FCI (or PC). I work with about 300 variables and there are not many dependencies (about 1,2 or 3) between them. In my background knowledge there are a lot of forbidden edges and some required edges. As far as I understand the FAS function, the amount of forbidden edges will reduce the adjacency list by a lot and also the separating set will be smaller than the version without the background knowledge.

I cannot find a explanation for the increase of computation speed, what am I missing?

Thanks in advance!

FCI algorithm with KCI testing method

Hello,

I am testing the fci algrithm and its defaul independence testing method is 'fisherz'. I tried to use the 'kci' testing method. But it shows:

TypeError: '(slice(None, None, None), [0])' is an invalid key.

So, is there any example about the fci algorithm with kci method? Thank you.

CD-NOD outputs directed cycles.

Hi, Thanks for your great work.

When I am using CD-NOD in my private data, I realize the output PDAG contains a directed cycle.

I use KCI test, other settings (e.g. orientation rules) are set exactly as default values in the illustration example.

Any idea why this phonomene shows? How should I solve this bug?

The performance gap bewteen `KCI_UInd` and `KCI_CInd` under a similar setting

The issue is based on the code in Pull request #55

Here is just a weird problem with the performance gap between KCI_UInd and KCI_CInd. Intuitively, the test of $X\bot Y$ and $X\bot Y|Z=1$ (Z is a constant) should have a similar performance, or the latter test(use KCI_CInd) should have a worse performance due to it handling a more universal case. However, when I ran the code, the result is not as I excepted.

image

I test the code by a random collider dataset, which means $X\bot Z$, $X\equiv Y$; and I also visualize the test statistics, mean and var for convenient debugging. And the result shows a similar p-value of $X\bot Z$, $X\bot Y$ and a different p-value of $X\bot Z | 1$, $X\bot Y | 1$.

Following is my test code:

from icecream import ic
from causallearn.utils.cit import CIT
from tqdm import trange
import numpy as np


def generate_single_sample(type, dim):
    if (type == 'chain'):
        X = np.random.random(dim)
        Y = np.random.random(dim)+X
        Z = np.random.random(dim)+Y
        #X->Y->Z
    elif (type == 'collider'):
        # X->Y<-Z
        X = np.random.random(dim)
        Z = np.random.random(dim)
        Y = np.random.random(dim)+X+Z
    #Y = np.zeros(dim)+np.average(Y)
    return list(X)+list(Y)+list(Z)+[1]# 31 dim X:0..9; Y:10..19; Z:20..29; 1: 30

def generate_dataset(dim, size):
    dataset = []
    for i in range(size):
        datapoint = generate_single_sample('collider', dim)
        dataset.append(datapoint)
    dataset = np.array(dataset)
    return dataset


if __name__ == '__main__':
    dataset = generate_dataset(10, 1000)
    cit_tester = CIT(dataset, method = 'kci')
    #ic(cit_tester.kci(0, 20, []))
    # Origin version can not pass this due to the feature-30 have the similar value
    #ic(cit_tester.kci(0, 20, [30]))
    # The follow is from one of my recent requirements, which is using CIT to test high dim variables
    # Test high dim variables is not supported by current cit class, which is different from the documents,
    # so I also implement this function in the last commit.
    # An issue is related to the "CIT of test high dim variables" which I will put forward latter
    ic(cit_tester.kci(range(10), range(20,30), range(10,20)))
    ic(cit_tester.kci(range(10), range(20,30), []))
    ic(cit_tester.kci(range(10), range(10,20), []))
    ic(cit_tester.kci(range(10), range(20,30), [30]))
    ic(cit_tester.kci(range(10), range(10,20), [30]))

Inconsistent return types

One general problem we're having in comparing algorithms to one another is that algorithms that basically should return a GeneralGraph return it in different ways. So for instance, the way to get a graph from PC is different from the way one gets it from FCI or from GES. Is it possible to make this uniform with the same syntax?

Basically one should be able to write something like this (I think):

G = search_algorithm(data, ...)

uniformly, or perhaps

G = search_algorithm(data,...).G

maybe, but uniformly for all algorithms, it seems.

Passing domain knowledge

I am planning to get rid of Java dependencies in cause2e by replacing py-causal with causal-learn for the discovery step.

However, my applications require passing domain knowledge in the form of required or forbidden edges in the causal graph. Py-causal and Tetrad have a great interface for domain knowledge. Will this be included in causal-learn, too? In the docs, I have only found possibilities for LiNGAM-type models, but not for GES or PC.

Thanks for finally translating Tetrad to Python!

implementation of fGES (fast greedy equivalence search)

Does this repo have plans to implement the algorithm fGES[1]? fGES seems to work well for large scale problems. I wanna do some work on a large scale problem. If there is a related plan, it will help to use fGES more conveniently on the python platform, instead of calling Tetrad implemented in Java.

[1] Ramsey J, Glymour M, Sanchez-Romero R, et al. A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images[J]. International journal of data science and analytics, 2017, 3(2): 121-129.

Deleting edges takes a very long time, I came up with a solution.

I was running this program (newest version from a few days ago) on a dataset of roughly 6000 x 800 matrix and found in-between depth levels the program was taking a long time to get to the next depth level. I extrapolated and found this was going to take 331 days for just one depth level. I studied the code and eventually traced the bottleneck to GeneralGraph.remove_edge(edge1), which rebuilds the entire edge list in the final line self.reconstitute_dpath(self.get_graph_edges()). get_graph_edges is the problem as it traverses all of self.graph to build an edge list.

I fixed the problem by the following: create new functions remove_edge_only(self, edge: Edge) which is the same as remove_edge except it omits the last line self.reconstitute_dpath(self.get_graph_edges(). I also added a function
def clear_dpath(self):
self.dpath = np.zeros((self.num_vars, self.num_vars), np.dtype(int))

Then in SkeletonDiscovery.py:136, I call remove_edge_only instead of remove_edge. At the end of that loop, I call
cg.G.clear_dpath()
cg.G.reconstitute_dpath(cg.G.get_graph_edges())
so the edge list is rebuilt only once at the end of all edge deletions instead of after every individual edge deletion. This brought the runtime down to seconds and produced identical causal images on smaller datasets. Since this isn't my project, is there some kind of approval process I need to upload suggested changes like this to or is this something you all would want to implement yourselves?

[RFC, META-ISSUE] Complete continuous integration (CI) for unit-testing, documentation, and test coverage

Hi, this is a meta issue to track the items required to bring causal-learn into a more PRable repository:

This list is to track the high-level issues that need to be resolved in a series of PRs. Each may have some description as well associated regarding what needs to be done in detail, what the end result should look like and some motivation. My recommendation is using a combination of GH actions and circleCI as this has worked well in the past for me, but we can change based on discussion.

  • Implement easy-to-use build, test, docs, and formatting commands, such as https://github.com/py-why/dodiscover/blob/b0cb5c48317cf1c6859b3dc9646925fdac3ecf1f/pyproject.toml#L112-L122. These will be used to ease developer workflows and also make CIs easier to run.
  • Implement different requirement dependency groups e.g. for building, testing, doc-building and actually running of causal-learn
  • Enable CI that runs all unit-tests in each PR and commit to main. Should also upload test coverage via GH actions
  • Enable CI that builds (installs) causal-learn in each PR and commit to main. Testing installation for Windows, Mac OSX and Linux via GH actions.
  • Enable CI that builds the documentation in each PR and commit to main via circleCI
  • Add basic templates to GH issues and PRs: Can copy and modify files from https://github.com/py-why/dodiscover/tree/main/.github

Some possible issues that may need to be resolved along the way:

  • Ensure unit-tests run in a reasonable amount of time. I'm not sure how long the unit-tests takes to run currently, but if it requires some intensive computations, then it might be beneficial to refactor to make the unit-tests faster
  • Ensure docs are built in a reasonable amount of time. Examples should be self-contained and short to illustrate a point. If large dataset is needed, one can always trim the dataset and then add a note saying why we did it.

Some implementation details related to CI:

Still a WIP to get this list fully flushed out, but here's a first go at it.

GeneralGraph.subgraph bug

Hi,
thanks for the great work on the package.
I think I found a bug in GeneralGraph.subgraph() (causallearn.graph.GeneralGraph) when building on top of the method.
My code:

from causallearn.graph.GeneralGraph import GeneralGraph
import numpy as np
_ , relevant_nodes = cdag.get_parents_plus(cluster3) # A list of nodes (node objects)
#cdag.cg.G.subgraph(relevant_nodes)
subgraph = GeneralGraph(relevant_nodes)
graph = cdag.cg.G.graph # ndarray
for i in range(cdag.cg.G.num_vars):
    print(i)
    if (not cdag.cg.G.nodes[i] in relevant_nodes):
        print(cdag.cg.G.nodes[i].get_name())
        graph = np.delete(graph, i, axis = 0)

Throws error: index 8 is out of bound for axis 0 with size 8

My code is specific to my environment, but logically works the same as

import numpy as np
array = np.zeros((5,5))
for i in range(5):
    for j in range(5):
        array[i,j] = i+j
delete = [1,2,4]
for i in range(5):
    if i in delete:
        array = np.delete(array, i, axis=0)

In causallearn, the graph is a ndarray, and iteratively deletes rows/columns. This causes an index out of bounds error, as the array gets smaller and so an index later on in the loop can be out of bounds.

Interestingly, when i directly restrict from the node list of the graph, i don't get an error:

from causallearn.graph.GraphClass import CausalGraph
test = CausalGraph(no_of_var=5, node_names=['X1','X2','X3','X4','X5'])
node_list = test.G.get_nodes()
restricted_nodes = node_list[0:2] + node_list[3:5]
subgraph = test.G.subgraph(restricted_nodes)

Am i missing something or is this bugged?

A fix (which I submit as a pull request (#118) also) would be to change the code to:

def subgraph(self, nodes: List[Node]):
    subgraph = GeneralGraph(nodes)

    graph = self.graph

    nodes_to_delete = []

    for i in range(self.num_vars):
        if not (self.nodes[i] in nodes):
            nodes_to_delete .append(i)

    graph = np.delete(graph, nodes_to_delete, axis = 0)
    graph = np.delete(graph, nodes_to_delete, axis = 1)

    subgraph.graph = graph
    subgraph.reconstitute_dpath(subgraph.get_graph_edges())

    return subgraph

Let me know what you think.
Best,
Jan Marco

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.