py-why / dowhy Goto Github PK

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.

Home Page: https://www.pywhy.org/dowhy

License: MIT License

Python 100.00%

causal-inference machine-learning graphical-models bayesian-networks data-science python3 causality causal-models treatment-effects do-calculus

dowhy's People

Contributors

Stargazers

Watchers

Forkers

zimoqingfeng xiaoliang008 yanghaha11514 stevenlol falconzyx cclauss snowdj nagisayui bcajes kelvinfkr hgrif nicolasvo lidan456 hal2001 codeaudit domoritz ml-lab ai-libs-cdrone jdc08161063 bokunwang abi-aryan hulalazz rollend dataning ssameerr yue1harriet1 cuiruifei eisber vishalbelsare fernandoadauto dhumis1993 djinnome aaptel parthsharma2 julienc91 radhikapadia niceban shafiahmed mengyuan404 mindis yanxuhappygela tavpritesh jaykimbravekjh cfreksen fizzbuzzcoder andreachlebikova sonumehta hengdos qingtong08 evison vandith sparadiso nickmvincent mollenhauerm akelleh daehwanahn morristech jamco88 chubbymaggie hyzcn crlsmcl wangtian9 zhouyonglong jbdatascience wang-carpenter dinajankovic raulpl ahoyosid rahulchandrangb wangzt2012 dyllamt tapasag ghostintheshellarise cristianpachacama gavin-s-smith mrklees alexitosrv iabrady anhnguyendepocen dieudo livitki allorimd cwu2011 spencerai mehmet-sari mac-kim jygerardy xiaoniaoyouhuajiang rygbee rquintin wangrui24 tofewe ywang65 kristiankaufmann huckiyang caiwei-zhang momor666 erikamenezes be557308 ealdent

dowhy's Issues

Treatment and outcome must be single characters otherwise error is thrown

I assume this is not intended. If either is a string of more than one character then an error is thrown when creating the CausalModel object.

import random
import pandas as pd
import dowhy
from dowhy.do_why import CausalModel
z=[i for i in range(10)]
random.shuffle(z)
df = pd.DataFrame(data = {'zz': z, 'tt': range(0,10), 'bb': range(0,100,10)})
model=CausalModel(
        data = df,
        treatment='tt',
        outcome='bb',
        graph="digraph {zz -> tt;zz -> bb;tt -> bb;}"
        )

Error thrown: NetworkXError: The node t is not in the graph.

If instead you pass a list for treatment and outcome it works, but will then fail when estimating the effect. As a work around you can pass a list during construction then override the attributes after construction:

import pandas as pd
import dowhy
from dowhy.do_why import CausalModel
z=[i for i in range(10)]
random.shuffle(z)
df = pd.DataFrame(data = {'zz': z, 'tt': range(0,10), 'bb': range(0,100,10)})
model=CausalModel(
        data = df,
        treatment=['tt'],
        outcome=['bb'],
        graph="digraph {zz -> tt;zz -> bb;tt -> bb;}"
        )
# workaround
model._treatment = 'tt'
model._outcome = 'bb'

identified_estimand = model.identify_effect()
print(identified_estimand)


model._treatment_name=''
causal_estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.propensity_score_matching")
print(causal_estimate)
print("Causal Estimate is " + str(causal_estimate.value))

Run on a google colaboratory instance with dowhy installed via:
!pip install -q git+https://github.com/Microsoft/dowhy.git

networkx version: 2.2
python: 3.6.7

Question: How/where does this compare to EconML

I've been experimenting with DoWhy recently and really enjoy the structure. I noticed some other libraries out there, such as EconML (mentioned here) and Uber's CausalML. I'll focus on EconML for this discussion, particularly because I see some recent PRs that brought in EconML's CATE estimator.

I'd just like to confirm the differences and overlap between DoWhy and EconML. Please let me know if I understand this correctly.

My Attempt at Comparisons

Let's start with DoWhy's structure:

model (make assumptions),
identify (find what to estimate given the assumptions),
estimate
refute (sensitivity and robustness checks).

1. Model

DoWhy: Provides ability to explicitly define complex causal graphs. Or alternately, (though not preferred?), define common confounders to assess.
EconML: I didn't see a means to define the causal graph other than through variable definition Y, T, X, W, Z

2. Identify

DoWhy: Hunts down causal effects using graph analysis and do-calculus
EconML: Not sure I saw this explicitly in the library?

3. Estimate

DoWhy: Backdoor, instrumental variables, and most recently do-sampling (Which is aweeesoome!)
EconML: Seems to be where EconML currently shines. There's a whole slew of approaches, many implementing approaches very recent ML research papers

4. Refute

DoWhy: Heavy focus on model validation with several methods
EconML: I didn't notice anything explicit.

Where they overlap

It seems to me they overlap most heavily in the Estimation section. That's where I saw some references here on the roadmap in bringing in EconML calls. Is that correct?

Question on estimators and terminology

I see EconML has the following called out for estimators:

Potential Outcomes
Structural Equations
CATE

What is the difference between the estimators listed above and the ones built into DoWhy? I was just struggling to connect the dots there.

Thank you.

ImportError: weighting_sampler is not an existing do sampler.

When I repeat the example of dowhy documents (https://microsoft.github.io/dowhy/readme.html) about dowhy pandas api as:

the error raised:

python 3.5.7
macOS 10.14.6

ModuleNotFoundError: No module named 'dowhy.utils'

The Installation section in the README talks about installing dependencies but alludes to the actual installation of the package. After the dependencies were installed I proceeded to run

python setup.py install

It seemed to have installed with printing a lot of information. But on running the code from the Sample causal inference analysis in DoWhy section from the README, I get a ModuleNotFoundError

from dowhy.do_why import CausalModel

ModuleNotFoundError: No module named 'dowhy.utils'

Looks like the .egg file doesn't contain the utils folder.

Your first example throws an error

Hi,

your first example throws an error. The following line causes the problem:
# Refute the obtained estimate using multiple robustness checks. refute_results = model.refute_estimate(identified_estimand, estimate, method_name="random_common_cause")
This gives a ValueError:
ValueError: Cannot index with multidimensional key

I guess I will just have to change the method, but it would be smoother if you would provide a working example to start with. (On the other hand, this gets me thinking, so thank you!)

Question - Is this project active?

Hi @amit-sharma ,
This is a very good piece of work. I have simple question:

Is research and development of this still active? Last update to repo was 8 months ago.
Are we supposed to use it for business? Or this is just a research library? Should we go ahead, use this library and build some user friendly product out of it?

Adding to conda-forge?

Any chance your team will be producing a conda friendly version?

Linear regression is not reproducible

#!/usr/bin/env python3

-- coding: utf-8 --

"""
Created on Thu Dec 27 11:24:48 2018

@author: mgralle

Debugging script for the dowhy package, using the Lalonde data example.

Repetition of estimation using propensity score matching or weighting gives reproducible values, as expected. However, repetition of estimation using linear regression gives different values.
"""

#To simplify debugging, I obtained the Lalonde data as described on the DoWhy
#page and wrote it to a CSV file:

#from rpy2.robjects import r as R
#%load_ext rpy2.ipython
##%R install.packages("Matching")
#%R library(Matching)
#%R data(lalonde)
#%R -o lalonde
#lfile("lalonde.csv","w")
#lalonde.to_csv(lfile,index=False)
#lfile.close()

import pandas as pd
lalonde=pd.read_csv("lalonde.csv")

print("Lalonde data frame:")
print(lalonde.describe())

from dowhy.do_why import CausalModel

1. Propensity score weighting

model=CausalModel(
data = lalonde,
treatment='treat',
outcome='re78',
common_causes='nodegr+black+hisp+age+educ+married'.split('+'))
identified_estimand = model.identify_effect()

psw_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_weighting")
print("\n(1) Causal Estimate from PS weighting is " + str(psw_estimate.value))

psw_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_weighting")
print("\n(2) Causal Estimate from PS weighting is " + str(psw_estimate.value))

#2. Propensity score matching
psm_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_matching")
print("\n(1) Causal estimate from PS matching is " + str(psm_estimate.value))

psm_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_matching")
print("\n(2) Causal estimate from PS matching is " + str(psm_estimate.value))

#3. Linear regression
linear_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.linear_regression",
test_significance=True)
print("\n(1) Causal estimate from linear regression is " + str(linear_estimate.value))

linear_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.linear_regression",
test_significance=True)
print("\n(2) Causal estimate from linear regression is " + str(linear_estimate.value))

Recreate model from scratch for linear regression

model=CausalModel(
data = lalonde,
treatment='treat',
outcome='re78',
common_causes='nodegr+black+hisp+age+educ+married'.split('+'))

identified_estimand = model.identify_effect()

linear_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.linear_regression",
test_significance=True)
print("\n(3) Causal estimate from linear regression is " + str(linear_estimate.value))

print("\nLalonde Data frame hasn't changed:")
print(lalonde.describe())

TypeError: must be str not list for pandas api

Followed example with my own data. Not sure what the error is referring to.
Sample:

Item 	Date	        Customer Qty	OnAd	        Month	Week	DayOfWeek
38365	2019-03-06	571677	37.0	0.024896	3	10	2
38365	2019-03-29	1143314	56.0	0.120332	3	13	4
38365	2019-03-26	1140474	62.0	0.398340	3	13	1
38365	2019-03-08	747765	55.0	0.024896	3	10	4
38365	2019-04-23	1298936	0.0	0.975104	4	17	1

Then I run:

single_item \
.drop(['ItemNid','Date','CusNid'],axis=1) \
.causal.do(x='OnAd',
           outcome='Qty',
           common_causes=['Month','Week','DayOfWeek'],
           variable_types={
               'Month':'d',
               'Week':'d',
               'DayOfWeek':'d',
               'OnAd': 'c',
               'Qty': 'd'
           },
          proceed_when_unidentifiable=True)

to get this error:

WARNING:dowhy.do_why:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.do_why:Model to find the causal effect of treatment ['OnAd'] on outcome ['Qty']
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['U', 'Month', 'DayOfWeek', 'Week']
WARNING:dowhy.causal_identifier:There are unobserved common causes. Causal effect cannot be identified.
WARN: Do you want to continue by ignoring these unobserved confounders? [y/n]  y
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.do_sampler:Using WeightingSampler for do sampling.
INFO:dowhy.do_sampler:Caution: do samplers assume iid data.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-139-4fe7e28c8d45> in <module>()
      9                'Qty': 'd'
     10            },
---> 11           proceed_when_unidentifiable=True)

~/anaconda3/envs/main/lib/python3.6/site-packages/dowhy/api/causal_data_frame.py in do(self, x, method, num_cores, variable_types, outcome, params, dot_graph, common_causes, estimand_type, proceed_when_unidentifiable, stateful)
    102                                              causal_model=self._causal_model,
    103                                              keep_original_treatment=keep_original_treatment)
--> 104         result = self._sampler.do_sample(x)
    105         if not stateful:
    106             self.reset()

~/anaconda3/envs/main/lib/python3.6/site-packages/dowhy/do_sampler.py in do_sample(self, x)
    141     def do_sample(self, x):
    142         self.reset()
--> 143         self.disrupt_causes()
    144         self.make_treatment_effective(x)
    145         if self.point_sampler:

~/anaconda3/envs/main/lib/python3.6/site-packages/dowhy/do_samplers/weighting_sampler.py in disrupt_causes(self)
     34                                                               self._target_estimand.backdoor_variables,
     35                                                               self._treatment_names,
---> 36                                                               variable_types=self._variable_types)
     37         self._df['weight'] = self.compute_weights()
     38 

~/anaconda3/envs/main/lib/python3.6/site-packages/dowhy/utils/propensity_score.py in state_propensity_score(data, covariates, treatments, variable_types)
     38                                                                             covariates + treatments[i+1:],
     39                                                                             treatment,
---> 40                                                                             variable_types))
     41         else:
     42             raise Exception("Variable type {} for variable {} is not a recognized format type.".format(variable_types[treatment],

~/anaconda3/envs/main/lib/python3.6/site-packages/dowhy/utils/propensity_score.py in continuous_treatment_model(data, covariates, treatment, variable_types)
     69 def continuous_treatment_model(data, covariates, treatment, variable_types):
     70     data, covariates = binarize_discrete(data, covariates, variable_types)
---> 71     if len(data) > 300 or len(treatment + covariates) >= 3:
     72         defaults = EstimatorSettings(n_jobs=4, efficient=True)
     73     else:

TypeError: must be str, not list

The error pointing to line 71 implies it would need a list not a string, but I think it is confused over something else. Any help would be much appreciated. Thank you!

KeyError: 'label' error message could be more clear

When you miss declaring a node in your causal graph, it's going to throw a KeyError: 'label' error. It could be more explicit to make debugging easier. I think it would be nice to inform what is the node hough used in the graph.

EconmlCateEstimator: array reshape fails if there are categorical variables

Using EconmlCateEstimator fails on lines 59-64 of econml_cate_estimator.py when the dataset contains categorical variables. For example, the call

dml_estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.econml.dml.LinearDMLCateEstimator",
    method_params={
        "init_params":{},
        "fit_params":{}
    },
    target_units = 'ate',
    test_significance=False,
    confidence_intervals=True
)

produces the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-303-db1030f96573> in <module>
     15     target_units = 'ate',
     16     test_significance=False,
---> 17     confidence_intervals=True
     18 )

C:\ProgramData\Anaconda3\lib\site-packages\dowhy\causal_model.py in estimate_effect(self, identified_estimand, method_name, control_value, treatment_value, test_significance, evaluate_effect_strength, confidence_intervals, target_units, effect_modifiers, method_params)
    215                 params=method_params
    216             )
--> 217             estimate = causal_estimator.estimate_effect()
    218             # Store parameters inside estimate object for refutation methods
    219             estimate.add_params(

C:\ProgramData\Anaconda3\lib\site-packages\dowhy\causal_estimator.py in estimate_effect(self)
     88         """
     89 
---> 90         est = self._estimate_effect()
     91         self._estimate = est
     92 

C:\ProgramData\Anaconda3\lib\site-packages\dowhy\causal_estimators\econml_cate_estimator.py in _estimate_effect(self)
     60             X = np.reshape(np.array(self._effect_modifiers), (n_samples, len(self._effect_modifier_names)))
     61         if self._observed_common_causes_names:
---> 62             W = np.reshape(np.array(self._observed_common_causes), (n_samples,len(self._observed_common_causes_names)))
     63         if self._instrumental_variable_names:
     64             Z = np.reshape(np.array(self._instrumental_variables), (n_samples, len(self._instrumental_variable_names)))

<__array_function__ internals> in reshape(*args, **kwargs)

C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py in reshape(a, newshape, order)
    299            [5, 6]])
    300     """
--> 301     return _wrapfunc(a, 'reshape', newshape, order=order)
    302 
    303 

C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
     59 
     60     try:
---> 61         return bound(*args, **kwds)
     62     except TypeError:
     63         # A TypeError occurs if the object does have such a method in its

ValueError: cannot reshape array of size 5248700 into shape (3595,85)

self._observed_common_causes = pd.get_dummies(self._observed_common_causes, drop_first=True) (line 20 of econml_cate_estimator.py) one-hot encodes categorical variables, so the number of columns is no longer len(self._observed_common_causes_names).

Implementation of ID algorithm?

Hi there,

Back door and IV is nice, but for larger graphs, I am wondering if it would be worthwhile implementing Ilya Shpitsers ID algorithm, which is complete. Implementations of ID are currently available in R and in clojure.

By the way, if you haven't checked out Whittemore, it implements a causal programming paradigm similar in philosophy to doWhy

Citing this repo

Is there an official publication related to this repo that we can/should cite in research articles?
If not, what is your preferred way of citation?

IPW resulted in wrong value

I think calculation of IPW is wrong.
This is lalonde data example (from R)

import dowhy
from dowhy.do_why import CausalModel
from rpy2.robjects import r as R
%load_ext rpy2.ipython
#%R install.packages("Matching")
%R library(Matching)
%R data(lalonde)
%R -o lalonde

model=CausalModel(
        data = lalonde,
        treatment='treat',
        outcome='re78',
        common_causes='nodegr+black+hisp+age+educ+married'.split('+'))
identified_estimand = model.identify_effect()
estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.propensity_score_weighting")
#print(estimate)
print("Causal Estimate is " + str(estimate.value))

# wrong -> Causal Estimate is 720918.657957321

df = model._data
ps = df['ps']
y = df['re78']
z = df['treat']

ey1 = z*y/ps / sum(z/ps)
ey0 = (1-z)*y/(1-ps) / sum((1-z)/(1-ps))
ate = ey1.sum()-ey0.sum()
print("Causal Estimate is " + str(ate))

# correct -> Causal Estimate is 1634.9868359746906

so please check propensity_score_weighting_estimator.py

# wrong
self._data['ips_weight'] = (
    self._data[self._treatment_name] / self._data['ps'] +
    (1 - self._data[self._treatment_name]) / (1 - self._data['ps'])
)
# correct
self._data['ips_weight'] = (
    self._data[self._treatment_name] / self._data['ps'] / sum(self._data[self._treatment_name] / self._data['ps']) +
    (1 - self._data[self._treatment_name]) / (1 - self._data['ps']) / sum((1 - self._data[self._treatment_name]) / (1 - self._data['ps']))
)```

Need to reimplement do operation

Hi @amit-sharma

I was thinking more about our discussion around the do operation in the CausalModel, and now I'm convinced the do operation as I've implemented it is incorrect. The do operation says "mutilate the causal model by cutting all arrows into X, and then make an intervention which is effective for each unit".

Instead of a well-defined do() operation, we have a quantity, Q = E[Y|do(x)], being estimated in the new causal model that results from the do operation. Of course we don't have to explicitly cut arrows if we block backdoor paths, but it's actually the "make the intervention effective" piece that I think needs the better abstraction.

As I was figuring out the best way to implement the causal plot method, I'm was doing an aggregation before a plot, and then plotting the aggregate. That makes sense: we can't identify an outcome for a single unit -- only strata of units, and statistics of those outcomes (e.g. E[Y|do(x)] instead of y_i|do(x).

The right way to implement this on top of pandas' abstractions would be a causal version of groupby(x).mean().plot(), and the mean() method has no special significance -- it's just the particular quantity you'd like to estimate in a stratum in the mutilated causal model. It's easy to estimate from the Robins g-formula, but so is the second moment, etc etc. Really, we should support general aggregation functions for Q, like grouby(x).agg(Q).plot().

With our current abstractions, the procedure might go something like this: The groupby is done over units for whom the intervention has been made effective, so it's not a standard groupby -- it contains all units in the dataframe, but with X set to x. These are grouped at each level of X. Then, the mean is tricky -- you replace each y_i by E[Y|do(x)] in the appropriate x stratum, and average over the result to get the group means. That second average implements the averaging over P(z), so you get the g-formula result for a first moment. I was using our do() implementation for that y_i replacement.

Clearly we can get much more general than this. The causal groupby just says "make the intervention effective for each x", and the aggregation function says "and now compute the stratum-level statistics in the mutilated causal model". The problem with our current abstractions is that the quantity, Q, is coupled with the estimation procedure. There are good reasons for that -- mainly statistical efficiency. We can do better though, at least in some cases.

I could definitely see a generalized procedure for sampling from P(Y|do(x)), and then computing a user-defined statistic over those samples. That would make bootstrapping errors easy, as well. I think that's a much more general approach to the do operation. i.e. do(x) returns a same-length series for Y given do(x), so we can compute df.do(y=['y'], x=['x']).mean(), df.do(y=['y'], x=['x']).std(), etc. Then,, the groupby procedure could be a light layer on top of do(), like df.causal.groupby(x=['x'], y=['y']).mean() would sample the y variables after intervening on the set x (under each stratum), and then return the mean with the multi-index x. Then we have the right abstractions for pandas to operate with, and it's pretty intuitive too.

tl;dr I need to port over nonparametric conditional density estimation from causality,
https://github.com/akelleh/causality/blob/master/causality/estimation/nonparametric.py
but simplify it down to be a sampling procedure. That'll get rid of the integration over the Z cube, which should make it fast! Then we'll have a general nonparametric do, and can make proper aggregation functions.

The api as it exists on the PR here is probably good enough for the time being,
#34
if you want to merge that. It's going to take a while to add the sampling process. In fact, I think it's a new object type, since we could do it with kernel density estimation, MCMC sampling on a parametric model, stratified sampling with discrete X and Z, etc.

What do you think? Should I add an "InterventionalDistributionSampler" base class?

Best,
A

Econml Metalearners: is it possible to use metalearners in dowhy? I get an error.

Is it possible to use metalearners in dowhy?


ml_estimate = dowhy_model.estimate_effect(identified_estimand, 
                                     method_name="backdoor.econml.metalearners.TLearner",
                                     confidence_intervals=False,
                                     method_params={'init_params':{'models':GradientBoostingRegressor()},
                                               'fit_params':{}
                                              })

Gives the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-104-2fc9ee067db4> in <module>()
      7                                      confidence_intervals=False,
      8                                      method_params={"init_params":{'models':GradientBoostingRegressor()},
----> 9                                                "fit_params":{}
     10                                               })
     11 

3 frames
/usr/local/lib/python3.6/dist-packages/econml/cate_estimator.py in call(self, Y, T, inference, *args, **kwargs)
     87                 inference.prefit(self, Y, T, *args, **kwargs)
     88             # call the wrapped fit method
---> 89             m(self, Y, T, *args, **kwargs)
     90             if inference is not None:
     91                 # NOTE: we call inference fit *after* calling the main fit method

TypeError: fit() takes 4 positional arguments but 5 were given

Dot graph import problems

Hello,

I'm having troubles feeding a casual graph into my model.
I have a dataframe with columns X,Y,Z, and the dot file going as follows:
digraph {Z->X;Z->Y;X->Y;}
Importing the .dot file using pydot goes smoothly; however, when I try:

model=CausalModel(data=df,
                  outcome=df['Y'],
                  treatment=df['X'],graph="graph.dot")

I get the following error:

TypeError: sequence item 0: expected str instance, Series found

Googling this didn't really help.

Is there something wrong with my graph representation? I was able to run the linear model dataset example (with its own graph)...

mcmc_sampler not visible from do_samplers

I'm not extremely skilled at Python so this might be a dumb question. But I cloned the repository and ran setup.py and I can access the logic where I need it. But the mcmc_sampler somehow does not show up. Here is example code

import dowhy.do_samplers as do_samplers
do_samplers.get_class_object("weighting_sampler")

Out[14]: dowhy.do_samplers.weighting_sampler.WeightingSampler

do_samplers.get_class_object("mcmc_sampler")
Traceback (most recent call last):
File "", line 1, in
do_samplers.get_class_object("mcmc_sampler")
File "/home/johan/anaconda3/lib/python3.6/site-packages/dowhy-0.1.0-py3.6.egg/dowhy/do_samplers/init.py", line 19, in get_class_object
raise ImportError('{} is not an existing do sampler.'.format(method_name))
ImportError: mcmc_sampler is not an existing do sampler.

dir(do_samplers)
Out[16]:
['DoSampler',
'builtins',
'cached',
'doc',
'file',
'loader',
'name',
'package',
'path',
'spec',
'get_class_object',
'import_module',
'kernel_density_sampler',
'multivariate_weighting_sampler',
'string',
'weighting_sampler']

So it is missing indeed. Do you have a suggestion?

Synth-Validation for Estimator Selection

Hey there!

Have you considered / read the paper on Synth-Validation? I think including some similar method would benefit the DoWhy package as it helps to choose the right estimator given a specific dataset (especially, when no graph is given).
We are working on an R implementation of the paper, which could possibly be included via RPy if you consider that something worthwhile at some point.

Either way, thanks for the awesome project!

NetworkXError: cannot tokenize 'graph[directed 1node

After installed dowhy and its dependencies, I tested the simple example https://github.com/Microsoft/dowhy#sample-causal-inference-analysis-in-dowhy, and encountered the problem.
After running the code:
model=CausalModel(
data = df,
treatment=data["treatment_name"],
outcome=data["outcome_name"],
graph=data["gml_graph"]
)
the error come out:
NetworkXError: cannot tokenize 'graph[directed 1node[ id "v" label "v"]node[ id "y" label "y"]node[ id "Unobserved Confounders" label "Unobserved Confounders"]edge[source "v" target "y"]edge[source "Unobserved Confounders" target "v"]edge[source "Unobserved Confounders" target "y"]node[ id "X0" label "X0"] edge[ source "X0" target "v"] node[ id "X1" label "X1"] edge[ source "X1" target "v"] node[ id "X2" label "X2"] edge[ source "X2" target "v"] node[ id "X3" label "X3"] edge[ source "X3" target "v"] node[ id "X4" label "X4"] edge[ source "X4" target "v"]edge[ source "X0" target "y"] edge[ source "X1" target "y"] edge[ source "X2" target "y"] edge[ source "X3" target "y"] edge[ source "X4" target "y"]node[ id "Z0" label "Z0"] edge[ source "Z0" target "v"] node[ id "Z1" label "Z1"] edge[ source "Z1" target "v"]]' at (1, 1)

WARN: Do you want to continue by ignoring these unobserved confounders? [y/n]

i do not want WARN message, is it possible to run code without below warning and inputs:

WARN: Do you want to continue by ignoring these unobserved confounders? [y/n]

Unable to estimate causal effect with intermediary variable?

I am having some trouble understanding the errors.

Is it not supposed to be possible estimate the causal effect of a graph like this?

Where the treatment in 'error_code' and cause is 'days_on_grace'

Here is what i try to do:

M = pd.DataFrame(
    {"error_code": [601, 501, 500, 400, 100], 
     'grace_period_length': [2, 5, 1, 4, 20], 
     'days_on_grace': [1, 4, 0, 3, 19]})


import networkx as nx

G = nx.DiGraph()

for n in list(pd.DataFrame(M[['error_code', 'grace_period_length', 'days_on_grace']])):
    G.add_node(n)
    

# Now add 'causes'

G.add_edge('error_code', 'grace_period_length')
G.add_edge('grace_period_length', 'days_on_grace')

gml = list(nx.generate_gml(G))

import dowhy
from dowhy.do_why import CausalModel

# Use graph
treatment = ['error_code']
outcomes = ['days_on_grace']
model = CausalModel(pd.DataFrame(M[['grace_period_length', 'error_code', 'days_on_grace']]), 
                    treatment, 
                    outcomes, 
                    graph="".join(gml))

identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)

identify_effect seem to always throw an error if the treatment does not have a direct edge to the cause. Why is this?

Error

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/networkx/classes/digraph.py in remove_edge(self, u, v)
    732         try:
--> 733             del self._succ[u][v]
    734             del self._pred[v][u]

KeyError: 'days_on_grace'

During handling of the above exception, another exception occurred:

NetworkXError                             Traceback (most recent call last)
<ipython-input-98-5d361b5e14a2> in <module>
----> 1 identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
      2 
      3 print(identified_estimand)

/usr/lib/python3.6/dist-packages/dowhy/do_why.py in identify_effect(self, proceed_when_unidentifiable)
    120                                            self._estimand_type,
    121                                            proceed_when_unidentifiable=proceed_when_unidentifiable)
--> 122         identified_estimand = self.identifier.identify_effect()
    123 
    124         return identified_estimand

/usr/lib/python3.6/dist-packages/dowhy/causal_identifier.py in identify_effect(self)
     22         estimands_dict = {}
     23         causes_t = self._graph.get_causes(self.treatment_name)
---> 24         causes_y = self._graph.get_causes(self.outcome_name, remove_edges={'sources':self.treatment_name, 'targets':self.outcome_name})
     25         common_causes = list(causes_t.intersection(causes_y))
     26         self.logger.info("Common causes of treatment and outcome:" + str(common_causes))

/usr/lib/python3.6/dist-packages/dowhy/causal_graph.py in get_causes(self, nodes, remove_edges)
    164             for s in sources:
    165                 for t in targets:
--> 166                     new_graph.remove_edge(s, t)
    167         causes = set()
    168         for v in nodes:

/usr/local/lib/python3.6/dist-packages/networkx/classes/digraph.py in remove_edge(self, u, v)
    734             del self._pred[v][u]
    735         except KeyError:
--> 736             raise NetworkXError("The edge %s-%s not in graph." % (u, v))
    737 
    738     def remove_edges_from(self, ebunch):

NetworkXError: The edge error_code-days_on_grace not in graph.

I am sorry if this is the wrong forum to ask this question.

Does DoWhy have a function to validate a causal graph?

I am looking for a function that will take a causal graph (.gml) and dataframe, and will return a binary output whether the data confirms the supplied graph or not. For example if the graph says there's a confounder [A]<-[X]->[Y] but the data doesn't show a correlation between A and X, then the result will be false.

Is this supported by DoWhy at the moment? (couldn't find it in any of the docs), if not, any recommendations how to implement this in a methodic way? (want to scan a complete graph, not go manually over each relationship per, direct and in-direct and test it)

Adding common cause by default

Hi :)
I'm wondering why would 'dowhy' add common cause by default and there's no way to prevent it?

https://github.com/Microsoft/dowhy/blob/77aa1c941c6064cd1dcdffd2268add215dbda3f1/dowhy/causal_graph.py#L106

Thanks.

DoSampler: Fails when you Add a Graph to the Causal Model

Regards,

I noticed something that I do not understand at this stage:

The DoSampler works fine when you do not add a graph to your CausalModel object, but fails when you add a graph. To elaborate:

This model object, which I will name "a" and does not have a graph. It doesn't give you problems:

# Define causes, Outcomes and Common Causes
causes = ['v']
outcomes = ['y']
common_causes = ['x1',
                 'x4',
                 'x5',
                 'x6']

model = CausalModel(df, 
                    causes,
                    outcomes,
                    common_causes=common_causes)

This model object, which I will name "b", has a graph as an input. It gives you problems;

# Define causes, Outcomes and Common Causes
causes = ['v']
outcomes = ['y']
common_causes = ['x1',
                 'x4',
                 'x5',
                 'x6']
graph = "digraph {x1 -> y;v -> y;x6 -> y;x6 -> x5;x4 -> x5;}"

model = CausalModel(df, 
                    causes,
                    outcomes,
                    common_causes=common_causes,
                    graph=graph)

To be specific the model object formulation named "b" gives you problems when you try to create the interventional dataframe as follows:

interventional_df = sampler.do_sample(None)

It gives you the following "value error":

ValueError: Found array with 0 feature(s) (shape=(46, 0)) while a minimum of 1 is required.

Why is this so?

Best
-C.

Plotter: Invalid format string on Windows

When plotting with plot_treatment_outcome or plot_causal_effect on a Windows platform, the fig.savefig-call fails with the exception ValueError: Invalid format string, because the code uses lowercase-%s, which is not officially supported by Python (https://stackoverflow.com/a/11743262).

(On Linux, %s returns seconds since epoch)

test

dowhy_ihdp_data_example uses a local dataset

In the examples folder, the notebook dowhy_ihdp_data_example.ipynb loads data from a local path. Is it possible to add the official source where we can download it?

Thank you all for the effort put into this library.

Not compatible with python 3.7

Run the demo with Python 3.7 will raise a StopIteration exception, but works properly with Python 3.6.

In [1]: import dowhy
   ...: from dowhy.do_why import CausalModel
   ...: import dowhy.datasets
   ...:
   ...:

In [2]: data=dowhy.datasets.linear_dataset(
   ...:         beta=10,
   ...:         num_common_causes=5,
   ...:         num_instruments = 2,
   ...:         num_samples=10000,
   ...:         treatment_is_binary=True)
   ...:

In [3]: model=CausalModel(
   ...:         data = data["df"],
   ...:         treatment=data["treatment_name"],
   ...:         outcome=data["outcome_name"],
   ...:         graph=data["dot_graph"],
   ...:         )
   ...:
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
~/anaconda3/envs/dowhy/lib/python3.7/site-packages/pygraphviz/agraph.py in iteritems(self)
   1749             try:
-> 1750                 ah = gv.agnxtattr(self.handle, self.type, ah)
   1751                 yield (gv.agattrname(ah).decode(self.encoding),

StopIteration: agnxtattr

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-3-e382d2368efa> in <module>()
      3         treatment=data["treatment_name"],
      4         outcome=data["outcome_name"],
----> 5         graph=data["dot_graph"],
      6         )
      7

~/Desktop/dowhy-master/dowhy/do_why.py in __init__(self, data, treatment, outcome, graph, common_causes, instruments, estimand_type, **kwargs)
     96                 self._outcome,
     97                 graph,
---> 98                 observed_node_names=self._data.columns.tolist()
     99             )
    100             self._common_causes = self._graph.get_common_causes(self._treatment, self._outcome)

~/Desktop/dowhy-master/dowhy/causal_graph.py in __init__(self, treatment_name, outcome_name, graph, common_cause_names, instrument_names, observed_node_names)
     26             self._graph = pgv.AGraph(graph, strict=True, directed=True)
     27
---> 28         self._graph = nx.drawing.nx_agraph.from_agraph(self._graph)
     29         self._graph = self.add_node_attributes(observed_node_names)
     30         self._graph = self.add_unobserved_common_cause(observed_node_names)

~/anaconda3/envs/dowhy/lib/python3.7/site-packages/networkx/drawing/nx_agraph.py in from_agraph(A, create_using)
     85
     86     # add graph attributes
---> 87     N.graph.update(A.graph_attr)
     88
     89     # add nodes, attributes to N.node_attr

~/anaconda3/envs/dowhy/lib/python3.7/site-packages/pygraphviz/agraph.py in keys(self)
   1738
   1739     def keys(self):
-> 1740         return list(self.__iter__())
   1741
   1742     def __iter__(self):

~/anaconda3/envs/dowhy/lib/python3.7/site-packages/pygraphviz/agraph.py in __iter__(self)
   1741
   1742     def __iter__(self):
-> 1743         for (k, v) in self.iteritems():
   1744             yield k
   1745

Placeholder methods?

Base classes sometimes have placeholder methods that raise a NotImplementedError (for example, CausalEstimator._do), and sometimes they don't (for example, CausalEstimator seems to rely on a method _estimate_effect, but that doesn't have a placeholder method).

Do you generally want the placeholder methods? If so, I can work on finding them and filling them in.

Causal Estimate Value Interpretation

I have a question about the definition of the Causal Estimate from

estimate = model.estimate_effect(identified_estimand, method_name="backdoor.propensity_score_matching")

estimate.value

I have been doing experiments with it and I have seen values close to zero as well as a range of negative and positive values with different magnitude.
What is this value used for and how is it used to discover causality?

TypeError: 'method' object is not subscriptable

running the first example provided in github page:
Python 3.7.2 (default, Dec 29 2018, 00:00:04)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import dowhy.api
...: import dowhy.datasets
...:
...: data = dowhy.datasets.linear_dataset(beta=5,
...: num_common_causes=1,
...: num_instruments = 0,
...: num_samples=1000,
...: treatment_is_binary=True)
...:
...: # data['df'] is just a regular pandas.DataFrame
...: data['df'].causal.do(x='v',
...: variable_types={'v': 'b', 'y': 'c', 'X0': 'c'},
...: outcome='y',
...: common_causes=['X0']).groupby('v').mean().plot(y='y', kind='bar')
...:
...:
WARNING:dowhy.do_why:Causal Graph not provided. DoWhy will construct a graph based on data inputs.

TypeError Traceback (most recent call last)
in ()
12 variable_types={'v': 'b', 'y': 'c', 'X0': 'c'},
13 outcome='y',
---> 14 common_causes=['X0']).groupby('v').mean().plot(y='y', kind='bar')
15
16

~/Documents/projects/dowhy/dowhy/api/causal_data_frame.py in do(self, x, method, num_cores, variable_types, outcome, params, dot_graph, common_causes, estimand_type, proceed_when_unidentifiable, stateful)
88 instruments=None,
89 estimand_type=estimand_type,
---> 90 proceed_when_unidentifiable=proceed_when_unidentifiable)
91 #self._identified_estimand = self._causal_model.identify_effect()
92 if not self._sampler:

~/Documents/projects/dowhy/dowhy/do_why.py in init(self, data, treatment, outcome, graph, common_causes, instruments, estimand_type, proceed_when_unidentifiable, **kwargs)
79 self._outcome,
80 common_cause_names=self._common_causes,
---> 81 observed_node_names=self._data.columns.tolist()
82 )
83 elif instruments is not None:

~/Documents/projects/dowhy/dowhy/causal_graph.py in init(self, treatment_name, outcome_name, graph, common_cause_names, instrument_names, observed_node_names)
65 raise ValueError
66
---> 67 self._graph = self.add_node_attributes(observed_node_names)
68 self._graph = self.add_unobserved_common_cause(observed_node_names)
69

~/Documents/projects/dowhy/dowhy/causal_graph.py in add_node_attributes(self, observed_node_names)
116 for node_name in self._graph:
117 if node_name in observed_node_names:
--> 118 self._graph.nodes[node_name]["observed"] = "yes"
119 else:
120 self._graph.nodes[node_name]["observed"] = "no"

TypeError: 'method' object is not subscriptable

Should pydot be a requirement?

pydot is a fallback if Pygraphviz is not installed. However, pydot is not mentioned in requirements.txt

Running the demo code (saved in demo.py) with neither Pygraphviz nor pydot gives the following result:

$ python demo.py
Error: Pygraphviz cannot be loaded. No module named 'pygraphviz'
Trying pydot ...
Error: Pydot cannot be loaded. No module named 'pydot'
Traceback (most recent call last):
  File "/some/folder/dowhy/dowhy/causal_graph.py", line 41, in __init__
    import pygraphviz as pgv
ModuleNotFoundError: No module named 'pygraphviz'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "demo.py", line 18, in <module>
    graph=data["dot_graph"],
  File "/some/folder/dowhy/dowhy/do_why.py", line 98, in __init__
    observed_node_names=self._data.columns.tolist()
  File "/some/folder/dowhy/dowhy/causal_graph.py", line 52, in __init__
    raise e
  File "/some/folder/dowhy/dowhy/causal_graph.py", line 47, in __init__
    import pydot
ModuleNotFoundError: No module named 'pydot'

This is somewhat related to #5

No module named 'dowhy.do_why'

I'm following the installation instructions under Ubuntu:18.04 inside a Docker container and I keep on getting this error:

from dowhy.do_why import CausalModel
ModuleNotFoundError: No module named 'dowhy.do_why'

I tried installing it with

python3 dowhy/setup.py install

and

pip3 install -r requirements.txt

Am I doing anything wrong?
Many thanks

Significance test causing infinite loop in propensity_score_matching

It appears that when using method_name="backdoor.propensity_score_matching" with significance_test=True in the estimate_effect method that the significance test leads to an infinite loop.

estimate=model.estimate_effect(identified_estimand,method_name="backdoor.propensity_score_matching", test_significance=True)

Leads to:

WARNING:dowhy.do_why:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.do_why:Model to find the causal effect of treatment sex on outcome survived
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['embarked_Q', 'sibsp', 'U', 'pclass_3', 'pclass_1', 'age', 'parch', 'fare', 'embarked_C', 'embarked_S', 'pclass_2']
INFO:dowhy.causal_identifier:All common causes are observed. Causal effect can be identified.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.causal_estimator:INFO: Using Propensity Score Matching Estimator
INFO:dowhy.causal_estimator:b: survived~sex+embarked_Q+sibsp+pclass_3+pclass_1+age+parch+fare+embarked_C+embarked_S+pclass_2
NOTE: Entered **estimate_effect** of **CausalModel** in file **do_why.py**
NOTE: causal_estimator_class: <class 'dowhy.causal_estimators.propensity_score_matching_estimator.PropensityScoreMatchingEstimator'>.
NOTE: causal_estimator: <dowhy.causal_estimators.propensity_score_matching_estimator.PropensityScoreMatchingEstimator object at 0x00000226D69F2F60>.
NOTE: About to call **causal_estimator.estimate_effect()** in **CausalModel** in file **do_why.py**
NOTE: Entered **estimate_effect** of **CausalEstimator** in file **causal_estimator.py**
NOTE: About to call **_estimate_effect()** in **CausalEstimator** in file **causal_estimator.py**
NOTE: Entered **_estimate_effect** of **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: ATE: -0.5251299826689775
NOTE: About to call **CausalEstimate** in **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: Created **CausalEstimate** object in **_estimate_effect** in **PropensityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py** with value -0.5251299826689775.
NOTE: Received **estimate** object from **_estimate_effect** (in **propensity_score_matching_estimator**) in **CausalEstimator** in file **causal_estimator.py** with value -0.5251299826689775.
NOTE: Entered **_estimate_effect** of **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: ATE: -0.5251299826689775
NOTE: About to call **CausalEstimate** in **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: Created **CausalEstimate** object in **_estimate_effect** in **PropensityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py** with value -0.5251299826689775.
NOTE: Entered **_estimate_effect** of **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: ATE: -0.5251299826689775
NOTE: About to call **CausalEstimate** in **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: Created **CausalEstimate** object in **_estimate_effect** in **PropensityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py** with value -0.5251299826689775.
NOTE: Entered **_estimate_effect** of **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: ATE: -0.5251299826689775
NOTE: About to call **CausalEstimate** in **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: Created **CausalEstimate** object in **_estimate_effect** in **PropensityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py** with value -0.5251299826689775.
NOTE: Entered **_estimate_effect** of **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: ATE: -0.5251299826689775
NOTE: About to call **CausalEstimate** in **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: Created **CausalEstimate** object in **_estimate_effect** in **PropensityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py** with value -0.5251299826689775.
NOTE: Entered **_estimate_effect** of **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: ATE: -0.5251299826689775
NOTE: About to call **CausalEstimate** in **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: Created **CausalEstimate** object in **_estimate_effect** in **PropensityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py** with value -0.5251299826689775.
NOTE: Entered **_estimate_effect** of **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: ATE: -0.5251299826689775
NOTE: About to call **CausalEstimate** in **Propens

Expected outcome:

estimate=model.estimate_effect(identified_estimand, method_name="backdoor.propensity_score_matching", test_significance=False)

leads to:

WARNING:dowhy.do_why:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.do_why:Model to find the causal effect of treatment sex on outcome survived
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['pclass_1', 'embarked_S', 'embarked_C', 'age', 'embarked_Q', 'parch', 'pclass_2', 'U', 'sibsp', 'fare', 'pclass_3']
INFO:dowhy.causal_identifier:All common causes are observed. Causal effect can be identified.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.causal_estimator:INFO: Using Propensity Score Matching Estimator
INFO:dowhy.causal_estimator:b: survived~sex+pclass_1+embarked_S+embarked_C+age+embarked_Q+parch+pclass_2+sibsp+fare+pclass_3
NOTE: Entered **estimate_effect** of **CausalModel** in file **do_why.py**
NOTE: causal_estimator_class: <class 'dowhy.causal_estimators.propensity_score_matching_estimator.PropensityScoreMatchingEstimator'>.
NOTE: causal_estimator: <dowhy.causal_estimators.propensity_score_matching_estimator.PropensityScoreMatchingEstimator object at 0x000001A7DEF38EB8>.
NOTE: About to call **causal_estimator.estimate_effect()** in **CausalModel** in file **do_why.py**
NOTE: Entered **estimate_effect** of **CausalEstimator** in file **causal_estimator.py**
NOTE: About to call **_estimate_effect()** in **CausalEstimator** in file **causal_estimator.py**
NOTE: Entered **_estimate_effect** of **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: ATE: -0.5251299826689775
NOTE: About to call **CausalEstimate** in **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: Created **CausalEstimate** object in **_estimate_effect** in **PropensityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py** with value -0.5251299826689775.
NOTE: Received **estimate** object from **_estimate_effect** (in **propensity_score_matching_estimator**) in **CausalEstimator** in file **causal_estimator.py** with value -0.5251299826689775.
NOTE: About to return **estimate** object to **do_why.py** variable **estimate**.
NOTE: Received **estimate** object with value -0.5251299826689775 from **CausalEstimator** in **causal_estimator.py**
NOTE: FINAL STEP: About to pass the estimate object back to the assigned variable.

This does not affect method_name="backdoor.propensity_score_weighting":

WARNING:dowhy.do_why:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.do_why:Model to find the causal effect of treatment sex on outcome survived
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['pclass_3', 'pclass_1', 'embarked_C', 'sibsp', 'parch', 'fare', 'age', 'embarked_Q', 'U', 'pclass_2', 'embarked_S']
INFO:dowhy.causal_identifier:All common causes are observed. Causal effect can be identified.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.causal_estimator:INFO: Using Propensity Score Weighting Estimator
INFO:dowhy.causal_estimator:b: survived~sex+pclass_3+pclass_1+embarked_C+sibsp+parch+fare+age+embarked_Q+pclass_2+embarked_S
NOTE: Entered **estimate_effect** of **CausalModel** in file **do_why.py**
NOTE: causal_estimator_class: <class 'dowhy.causal_estimators.propensity_score_weighting_estimator.PropensityScoreWeightingEstimator'>.
NOTE: causal_estimator: <dowhy.causal_estimators.propensity_score_weighting_estimator.PropensityScoreWeightingEstimator object at 0x000001ED115E7B00>.
NOTE: About to call **causal_estimator.estimate_effect()** in **CausalModel** in file **do_why.py**
NOTE: Entered **estimate_effect** of **CausalEstimator** in file **causal_estimator.py**
NOTE: About to call **_estimate_effect()** in **CausalEstimator** in file **causal_estimator.py**
NOTE: Received **estimate** object from **_estimate_effect** (in **propensity_score_matching_estimator**) in **CausalEstimator** in file **causal_estimator.py** with value -0.46812461189960247.
NOTE: About to return **estimate** object to **do_why.py** variable **estimate**.
NOTE: Received **estimate** object with value -0.46812461189960247 from **CausalEstimator** in **causal_estimator.py**
NOTE: FINAL STEP: About to pass the estimate object back to the assigned variable.

Or method_name="backdoor.propensity_score_stratification", however there is a runtime effort and stratification returns estimate.value=nan. Have not started looking into this yet.

WARNING:dowhy.do_why:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.do_why:Model to find the causal effect of treatment sex on outcome survived
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['pclass_3', 'pclass_1', 'embarked_C', 'sibsp', 'parch', 'fare', 'age', 'embarked_Q', 'U', 'pclass_2', 'embarked_S']
INFO:dowhy.causal_identifier:All common causes are observed. Causal effect can be identified.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.causal_estimator:INFO: Using Propensity Score Stratification Estimator
INFO:dowhy.causal_estimator:b: survived~sex+pclass_3+pclass_1+embarked_C+sibsp+parch+fare+age+embarked_Q+pclass_2+embarked_S
NOTE: Entered **estimate_effect** of **CausalModel** in file **do_why.py**
NOTE: causal_estimator_class: <class 'dowhy.causal_estimators.propensity_score_stratification_estimator.PropensityScoreStratificationEstimator'>.
NOTE: causal_estimator: <dowhy.causal_estimators.propensity_score_stratification_estimator.PropensityScoreStratificationEstimator object at 0x000001ED115E7358>.
NOTE: About to call **causal_estimator.estimate_effect()** in **CausalModel** in file **do_why.py**
NOTE: Entered **estimate_effect** of **CausalEstimator** in file **causal_estimator.py**
NOTE: About to call **_estimate_effect()** in **CausalEstimator** in file **causal_estimator.py**
NOTE: Received **estimate** object from **_estimate_effect** (in **propensity_score_matching_estimator**) in **CausalEstimator** in file **causal_estimator.py** with value nan.
NOTE: About to return **estimate** object to **do_why.py** variable **estimate**.
NOTE: Received **estimate** object with value nan from **CausalEstimator** in **causal_estimator.py**
NOTE: FINAL STEP: About to pass the estimate object back to the assigned variable.
C:\Users\mbrown\AppData\Local\Continuum\anaconda3\lib\site-packages\dowhy-0.1.0-py3.6.egg\dowhy\causal_estimators\propensity_score_stratification_estimator.py:72: RuntimeWarning: invalid value encountered in double_scalars

**As an additional test, test_significance=True with the block commented out as so:

def estimate_effect(self):
        """TODO.

        More description.

        :param self: object instance of class Estimator
        :returns: point estimate of causal effect

        """
        print(f"NOTE: Entered **estimate_effect** of **CausalEstimator** in file **causal_estimator.py**")
        self._treatment = self._data[self._treatment_name]
        self._outcome = self._data[self._outcome_name]
        print(f"NOTE: About to call **_estimate_effect()** in **CausalEstimator** in file **causal_estimator.py**")
        est = self._estimate_effect()
        print(f"NOTE: Received **estimate** object from **_estimate_effect** (in **propensity_score_matching_estimator**) in **CausalEstimator** in file **causal_estimator.py** with value {est.value}.")
        # self._estimate = est

        #if self._significance_test:
        #    signif_dict = self.test_significance(est)
        #    est.add_significance_test_results(signif_dict)
        print(f"NOTE: About to return **estimate** object to **do_why.py** variable **estimate**.")
        return est

leads to:

WARNING:dowhy.do_why:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
INFO:dowhy.do_why:Model to find the causal effect of treatment sex on outcome survived
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['pclass_3', 'sibsp', 'pclass_1', 'embarked_C', 'pclass_2', 'embarked_Q', 'fare', 'U', 'age', 'embarked_S', 'parch']
INFO:dowhy.causal_identifier:All common causes are observed. Causal effect can be identified.
INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]
INFO:dowhy.causal_estimator:INFO: Using Propensity Score Matching Estimator
INFO:dowhy.causal_estimator:b: survived~sex+pclass_3+sibsp+pclass_1+embarked_C+pclass_2+embarked_Q+fare+age+embarked_S+parch
NOTE: Entered **estimate_effect** of **CausalModel** in file **do_why.py**
NOTE: causal_estimator_class: <class 'dowhy.causal_estimators.propensity_score_matching_estimator.PropensityScoreMatchingEstimator'>.
NOTE: causal_estimator: <dowhy.causal_estimators.propensity_score_matching_estimator.PropensityScoreMatchingEstimator object at 0x0000024DBEF5BCF8>.
NOTE: About to call **causal_estimator.estimate_effect()** in **CausalModel** in file **do_why.py**
NOTE: Entered **estimate_effect** of **CausalEstimator** in file **causal_estimator.py**
NOTE: About to call **_estimate_effect()** in **CausalEstimator** in file **causal_estimator.py**
NOTE: Entered **_estimate_effect** of **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: ATE: -0.5251299826689775
NOTE: About to call **CausalEstimate** in **PropenseityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py**
NOTE: Created **CausalEstimate** object in **_estimate_effect** in **PropensityScoreMatchingEstimator** in file **propensity_score_matching_estimator.py** with value -0.5251299826689775.
NOTE: Received **estimate** object from **_estimate_effect** (in **propensity_score_matching_estimator**) in **CausalEstimator** in file **causal_estimator.py** with value -0.5251299826689775.
NOTE: About to return **estimate** object to **do_why.py** variable **estimate**.
NOTE: Received **estimate** object with value -0.5251299826689775 from **CausalEstimator** in **causal_estimator.py**
NOTE: FINAL STEP: About to pass the estimate object back to the assigned variable.

I've started looking into the cause of this and will submit a pull request once it's corrected.

how does it compare to causalinference library

Hi, hank you for releasing this work and all the effort put in to it.
I was curious to understand what are the primary differences compared to this library? Are there any pros and cons?

Best,
k.

Instrumental variables estimator doesn't include intercept

The IV 2SLS estimates are incorrect.

Import packages

import numpy as np
import pandas as pd
import patsy as ps

from statsmodels.sandbox.regression.gmm import IV2SLS

Generate some data

n_points = 1000
education_abilty = 1
education_voucher = 0.5
income_abilty = 2
income_education = 4


# confounder
ability = np.random.normal(0, 3, size=n_points)

# instrument
voucher = np.random.normal(2, 1, size=n_points) 

# treatment
education = np.random.normal(5, 1, size=n_points) + education_abilty * ability +\
            education_voucher * voucher

# outcome
income = np.random.normal(10, 3, size=n_points) +\
         income_abilty * ability + income_education * education

# build dataset
data = np.stack([ability, education, income, voucher]).T
df = pd.DataFrame(data, columns = ['ability', 'education', 'income', 'voucher'])

Recover using statsmodels 2SLS

income_vec, endog = ps.dmatrices("income ~ education", data=df)
exog = ps.dmatrix("voucher", data=df)

m = IV2SLS(income_vec, endog, exog).fit()
m.summary()


                          IV2SLS Regression Results                           
==============================================================================
Dep. Variable:                 income   R-squared:                       0.853
Model:                         IV2SLS   Adj. R-squared:                  0.853
Method:                     Two Stage   F-statistic:                     59.77
                        Least Squares   Prob (F-statistic):           2.59e-14
Date:                Wed, 24 Jul 2019                                         
Time:                        22:42:29                                         
No. Observations:                1000                                         
Df Residuals:                     998                                         
Df Model:                           1                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     11.1327      3.029      3.676      0.000       5.190      17.076
education      3.8475      0.498      7.731      0.000       2.871       4.824
==============================================================================
Omnibus:                        5.158   Durbin-Watson:                   2.092
Prob(Omnibus):                  0.076   Jarque-Bera (JB):                5.027
Skew:                           0.166   Prob(JB):                       0.0810
Kurtosis:                       3.099   Cond. No.                         15.0
==============================================================================

Repeat using doWhy

model=CausalModel(
        data = df,
        treatment='education',
        outcome='income',
        instruments=['voucher']
        )

identified_estimand = model.identify_effect()

estimate = model.estimate_effect(identified_estimand,
        method_name="iv.instrumental_variable", test_significance=True
)
print(estimate)

*** Causal Estimate ***

## Target estimand
Estimand type: ate
### Estimand : 1
Estimand name: backdoor
Estimand expression:
    d                          
──────────(Expectation(income))
deducation                     
Estimand assumption 1, Unconfoundedness: If U→education and U→income then P(income|education,,U) = P(income|education,)
### Estimand : 2
Estimand name: iv
Estimand expression:
Expectation(Derivative(income, voucher)/Derivative(education, voucher))
Estimand assumption 1, As-if-random: If U→→income then ¬(U →→voucher)
Estimand assumption 2, Exclusion: If we remove {voucher}→education, then ¬(voucher→income)

## Realized estimand
Realized estimand: Wald Estimator
Realized estimand type: ate
Estimand expression:
                                                                              
Expectation(Derivative(income, voucher))⋅Expectation(Derivative(education, vou

      -1
cher))  
Estimand assumption 1, As-if-random: If U→→income then ¬(U →→voucher)
Estimand assumption 2, Exclusion: If we remove {voucher}→education, then ¬(voucher→income)
Estimand assumption 3, treatment_effect_homogeneity: Each unit's treatment education isaffected in the same way by common causes of education and income
Estimand assumption 4, outcome_effect_homogeneity: Each unit's outcome income isaffected in the same way by common causes of education and income

## Estimate
Value: 5.614298071071159

## Statistical Significance
p-value: 0.0020000000000000018

Conclusion

Note that the estimate is different from statsmodels 2SLS estimate and also incorrect. It should be around 4. Some experimentation revealed that we might be missing the intercept terms from the 1st stage and reduced form equations.

Causal effect estimation

in the example of the confounding variable (treatment,outcome,w0) given that the variables are continuous how could the program estimate the causal effect of the treatment variable ( I assume it's usually E( outcome=1|t=1)-E(outcome|t=0)) and the treatment in this case is continuous.

version issue from init.py

When i install from source in windows 7, i.e., python setup.py install, I could not properly "import dowhy" due to the version information loaded from \dowhy..\VERSION, which folder/file doesn't exist.

So I have to comment these two lines:
with open(path.join(here, path.pardir, 'VERSION')) as version_file:
version = version_file.read().strip()

mcmc_sampler bug

In the current MCMC Sampler, the init requires casual_model which is then passed to the parent class DoSampler. However, the mcmc_sampler also expects a casual_model in the kwargs (line 23).

If causal_model is specified and is not in the kwargs then we get the following error or if both are specified:
AttributeError: 'NoneType' object has no attribute '_graph' (From line 23 of mcmc_sampler).

If we don't pass causal_model but pass causal_model in the kwargs, we get the following error:
AttributeError: 'NoneType' object has no attribute 'identify_effect' (From line 55 of do_sampler)

I suspect that removing the kwargs part of line 23 in mcmc_sampler would fix the code, but I have not tried to do so.

TypeError when running Step 1 of do_why_confounder_example.ipynb

Hello folks,

I just finished reading The Book of Why and was delighted to discover your library.

I was going through the jupyter notebooks when I got to the Confounding example.

Instead of generating a causal graph based on data inputs, it threw an exception instead.

The reason why seems to be that networkx.DiGraph.nodes is a method, whereas networkx.DiGraph.node is a dictionary. Altering all calls from self._graph.nodes[...] to self._graph.node[...] in the file causal_graph.py seems to solve the problem.

I can send you folks a pull request, but I am not convinced that this is a robust solution, since according to the networkx 2.2 documentation, you should be able to treat networkx.DiGraph.nodes like a dictionary even though it is really a method.

In any event, below is the error message before I modified the code in the CausalGraph class:

WARNING:dowhy.do_why:Causal Graph not provided. DoWhy will construct a graph based on data inputs.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-b3b63f0ebad2> in <module>()
      4         outcome=data_dict["outcome_name"],
      5         common_causes=data_dict["common_causes_names"],
----> 6         instruments=data_dict["instrument_names"])                                                                                       
      7 model.view_model(layout="dot")

~/Projects/DL/dowhy/dowhy/do_why.py in __init__(self, data, treatment, outcome, graph, common_causes, instruments, estimand_type, **kwargs)
     76                     self._outcome,
     77                     common_cause_names=self._common_causes,
---> 78                     observed_node_names=self._data.columns.tolist()
     79                 )
     80             elif instruments is not None:

~/Projects/DL/dowhy/dowhy/causal_graph.py in __init__(self, treatment_name, outcome_name, graph, common_cause_names, instrument_names, observed_node_names)
     59             raise ValueError
     60 
---> 61         self._graph = self.add_node_attributes(observed_node_names)
     62         self._graph = self.add_unobserved_common_cause(observed_node_names)
     63         self.logger = logging.getLogger(__name__)

~/Projects/DL/dowhy/dowhy/causal_graph.py in add_node_attributes(self, observed_node_names)
    101         for node_name in self._graph:
    102             if node_name in observed_node_names:
--> 103                 self._graph.nodes[node_name]["observed"] = "yes"
    104             else:
    105                 self._graph.nodes[node_name]["observed"] = "no"

TypeError: 'method' object is not subscriptable

This problem occurs whether I am using networkx 2.1 or 2.2

Clarification on implementation of double ML

Hi, I have some questions regarding the project for implementing other causal estimation methods. https://github.com/microsoft/dowhy/projects/3#card-24479485

For double ML, did you have in mind a particular model like random forest for modeling the confounders or did you want to allow the user to choose models and hyperparameters? Is cross-validation for these models important?
Methods in econml seem to be aimed at estimating heterogeneous effects in the presence of high-dimensional complex confounders. Could you clarify what types of refutation/sensitivity methods you'd like to have for this particular case? I saw that there's a card about implementing methods for evaluating matching, but I'm not sure whether you had something else in mind for sensitivity analysis for methods like orthogonal random forest or double ML in the econml project.

Thanks!

dimensional issue in Y.ravel() in the estimate_effect function

Problem found in the sample notebook code " dowhy-conditional-treatment-effects.ipynb":

----------code----------------------
from sklearn.linear_model import LogisticRegressionCV
#todo needs binary y
drlearner_estimate = model_binary.estimate_effect(identified_estimand_binary,
method_name="backdoor.econml.drlearner.LinearDRLearner",
target_units = lambda df: df["X0"]>1,
confidence_intervals=False,
method_params={"init_params":{
'model_propensity': LogisticRegressionCV(cv=3, solver='lbfgs', multi_class='auto')
},
"fit_params":{}
})
print(drlearner_estimate)

------------issue-----------------
AssertionError: Can only accept single dimensional outcomes Y! Use Y.ravel().

Implementation of Causal Discovery?

I'm curious if anyone is interested in folding causal discovery algorithms into the dowhy package? I currently use the 'Causal Discovery Toolkit' (cdt) along with my own code for performing causal discovery. I think that for sufficiently complex problem domains, causal discovery is a necessary first half of causal analysis.

set up binder to run the examples

Hi, first time here and I love the idea of the library. I think that setting up binder to run the examples will help new users to play and try different things out without the need to install the dependencies locally.

If you think it is good idea I could start working on it.

Extracting Data from a DoWhy Plot into a Pandas Dataframe

Hi All,

How does one extract the data from a DoWhy plot?

The command for generating the plot is:

dowhy.plotter.plot_causal_effect(estimate, df["v"], df["y"])

It works fine. But, I want to extract the data that underpins the plot into a pandas dataframe?

Can anyone help?

CATE Notebook Error: econml.dml.DMLCateEstimator is not an existing causal estimator

I was referred to this Jupyter notebook in #88. When I attempt to execute that notebook, I get an error in the first cell that runs model.estimate_effect. It's unable to find DMLCateEstimator. If this is still experimental and in-progress, feel free to disregard and close. I would have PR'd a mod in, but I wasn't able to get it to run EconmlCateEstimator in my experiments.

Cell [7]

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
from sklearn.ensemble import GradientBoostingRegressor
econml_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.econml.dml.DMLCate",
                                 target_units = lambda df: df["X0"]>1, 
                                 confidence_intervals=False,
                                method_params={"init_params":{'model_y':GradientBoostingRegressor(),
                                                              'model_t': GradientBoostingRegressor(),
                                                              "model_final":LassoCV(), 
                                                              'featurizer':PolynomialFeatures(degree=1, include_bias=True)},
                                               "fit_params":{}})
print(econml_estimate)

Error

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
~/workspace/dowhy/dowhy/causal_estimators/econml_cate_estimator.py in _get_econml_class_object(self, module_method_name, *args, **kwargs)
     35             (module_name, _, class_name) = module_method_name.rpartition(".")
---> 36             estimator_module = import_module(module_name)
     37             estimator_class = getattr(estimator_module, class_name)

~/.pyenv/versions/3.6.7/lib/python3.6/importlib/__init__.py in import_module(name, package)
    125             level += 1
--> 126     return _bootstrap._gcd_import(name[level:], package, level)
    127 

~/.pyenv/versions/3.6.7/lib/python3.6/importlib/_bootstrap.py in _gcd_import(name, package, level)

~/.pyenv/versions/3.6.7/lib/python3.6/importlib/_bootstrap.py in _find_and_load(name, import_)

~/.pyenv/versions/3.6.7/lib/python3.6/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

~/.pyenv/versions/3.6.7/lib/python3.6/importlib/_bootstrap.py in _load_unlocked(spec)

~/.pyenv/versions/3.6.7/lib/python3.6/importlib/_bootstrap_external.py in exec_module(self, module)

~/.pyenv/versions/3.6.7/lib/python3.6/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

~/.pyenv/versions/3.6.7/envs/aaa3/lib/python3.6/site-packages/econml-0.5-py3.6.egg/econml/dml.py in <module>
     38 from warnings import warn
---> 39 from .utilities import (shape, reshape, ndim, hstack, cross_product, transpose, inverse_onehot,
     40                         broadcast_unit_treatments, reshape_treatmentwise_effects,

~/.pyenv/versions/3.6.7/envs/aaa3/lib/python3.6/site-packages/econml-0.5-py3.6.egg/econml/utilities.py in <module>
     34 from collections.abc import Iterable
---> 35 from sklearn.model_selection._split import _CVIterableWrapper, CV_WARNING
     36 from sklearn.utils.multiclass import type_of_target

ImportError: cannot import name 'CV_WARNING'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-7-ea0f5efeef4a> in <module>
      9                                                               "model_final":LassoCV(),
     10                                                               'featurizer':PolynomialFeatures(degree=1, include_bias=True)},
---> 11                                                "fit_params":{}})
     12 print(econml_estimate)

~/workspace/dowhy/dowhy/causal_model.py in estimate_effect(self, identified_estimand, method_name, test_significance, evaluate_effect_strength, confidence_intervals, target_units, effect_modifiers, method_params)
    186                 target_units = target_units,
    187                 effect_modifiers = effect_modifiers,
--> 188                 params=method_params
    189             )
    190             estimate = causal_estimator.estimate_effect()

~/workspace/dowhy/dowhy/causal_estimators/econml_cate_estimator.py in __init__(self, *args, **kwargs)
     24             raise Exception(error_msg)
     25 
---> 26         estimator_class = self._get_econml_class_object(self._econml_methodname+ "Estimator")
     27         self.estimator = estimator_class(**self.method_params["init_params"])
     28         self.logger.info("INFO: Using EconML Estimator")

~/workspace/dowhy/dowhy/causal_estimators/econml_cate_estimator.py in _get_econml_class_object(self, module_method_name, *args, **kwargs)
     38 
     39         except (AttributeError, AssertionError, ImportError):
---> 40             raise ImportError('{}.{} is not an existing causal estimator.'.format(module_name, class_name))
     41         return estimator_class
     42 

ImportError: econml.dml.DMLCateEstimator is not an existing causal estimator.

Individualized Treatment Effect Estimation (ITE)

Is it possible to obtain ITE estimates from dowhy? I seem to only be able to compute ATE
without actually tampering with the source code.

regression estimator problem

I have a problem with the regression estimator, returning error: UnboundLocalError: local variable 'estimate_index' referenced before assignment

I suspect the error may be due to the low score of my specific estimation, but it is generated by this section of the causal_estimator.py code:

# Doing a two-sided test
        if estimate.value > median_estimate:
            # Being conservative with the p-value reported
            estimate_index = np.searchsorted(sorted_null_estimates, estimate.value, side="left")
            p_value = 1 - (estimate_index / num_simulations)
        if estimate_index < num_simulations / 2:
            # Being conservative with the p-value reported
            estimate_index = np.searchsorted(sorted_null_estimates, estimate.value, side="right")
            p_value = (estimate_index / num_simulations)
        signif_dict = {
            'p_value': p_value,
            'sorted_null_estimates': sorted_null_estimates`

and It seems to me the chance to use estimate_index before defining it is not remote.