Hi @amit-sharma
I was thinking more about our discussion around the do operation in the CausalModel, and now I'm convinced the do operation as I've implemented it is incorrect. The do operation says "mutilate the causal model by cutting all arrows into X, and then make an intervention which is effective for each unit".
Instead of a well-defined do() operation, we have a quantity, Q = E[Y|do(x)], being estimated in the new causal model that results from the do operation. Of course we don't have to explicitly cut arrows if we block backdoor paths, but it's actually the "make the intervention effective" piece that I think needs the better abstraction.
As I was figuring out the best way to implement the causal plot method, I'm was doing an aggregation before a plot, and then plotting the aggregate. That makes sense: we can't identify an outcome for a single unit -- only strata of units, and statistics of those outcomes (e.g. E[Y|do(x)] instead of y_i|do(x).
The right way to implement this on top of pandas' abstractions would be a causal version of groupby(x).mean().plot(), and the mean() method has no special significance -- it's just the particular quantity you'd like to estimate in a stratum in the mutilated causal model. It's easy to estimate from the Robins g-formula, but so is the second moment, etc etc. Really, we should support general aggregation functions for Q, like grouby(x).agg(Q).plot().
With our current abstractions, the procedure might go something like this: The groupby is done over units for whom the intervention has been made effective, so it's not a standard groupby -- it contains all units in the dataframe, but with X set to x. These are grouped at each level of X. Then, the mean is tricky -- you replace each y_i by E[Y|do(x)] in the appropriate x stratum, and average over the result to get the group means. That second average implements the averaging over P(z), so you get the g-formula result for a first moment. I was using our do() implementation for that y_i replacement.
Clearly we can get much more general than this. The causal groupby just says "make the intervention effective for each x", and the aggregation function says "and now compute the stratum-level statistics in the mutilated causal model". The problem with our current abstractions is that the quantity, Q, is coupled with the estimation procedure. There are good reasons for that -- mainly statistical efficiency. We can do better though, at least in some cases.
I could definitely see a generalized procedure for sampling from P(Y|do(x)), and then computing a user-defined statistic over those samples. That would make bootstrapping errors easy, as well. I think that's a much more general approach to the do operation. i.e. do(x) returns a same-length series for Y given do(x), so we can compute df.do(y=['y'], x=['x']).mean(), df.do(y=['y'], x=['x']).std(), etc. Then,, the groupby procedure could be a light layer on top of do(), like df.causal.groupby(x=['x'], y=['y']).mean() would sample the y variables after intervening on the set x (under each stratum), and then return the mean with the multi-index x. Then we have the right abstractions for pandas to operate with, and it's pretty intuitive too.
tl;dr I need to port over nonparametric conditional density estimation from causality,
https://github.com/akelleh/causality/blob/master/causality/estimation/nonparametric.py
but simplify it down to be a sampling procedure. That'll get rid of the integration over the Z cube, which should make it fast! Then we'll have a general nonparametric do, and can make proper aggregation functions.
The api as it exists on the PR here is probably good enough for the time being,
#34
if you want to merge that. It's going to take a while to add the sampling process. In fact, I think it's a new object type, since we could do it with kernel density estimation, MCMC sampling on a parametric model, stratified sampling with discrete X and Z, etc.
What do you think? Should I add an "InterventionalDistributionSampler" base class?
Best,
A