GithubHelp home page GithubHelp logo

almost-matching-exactly / dame-flame-python-package Goto Github PK

View Code? Open in Web Editor NEW
55.0 7.0 14.0 5.02 MB

A Python Package providing two algorithms, DAME and FLAME, for fast and interpretable treatment-control matches of categorical data

Home Page: https://almost-matching-exactly.github.io/DAME-FLAME-Python-Package/

License: MIT License

Python 100.00%
matching causal-inference machine-learning econometrics python data-science

dame-flame-python-package's Introduction

Build Status Coverage Status

DAME-FLAME

A Python package for performing matching for observational causal inference on datasets containing discrete covariates

Documentation here

DAME-FLAME is a Python package for performing matching for observational causal inference on datasets containing discrete covariates. It implements the Dynamic Almost Matching Exactly (DAME) and Fast, Large-Scale Almost Matching Exactly (FLAME) algorithms, which match treatment and control units on subsets of the covariates. The resulting matched groups are interpretable, because the matches are made on covariates, and high-quality, because machine learning is used to determine which covariates are important to match on.

Installation

Dependencies

dame-flame requires Python version (>=3.6.5). Install from here if needed.

  • pandas>=0.11.0
  • numpy>= 1.16.5
  • scikit-learn>=0.23.2

If your python version does not have these packages, install from here.

To run the examples in the examples folder (these are not part of the package), Jupyter Notebooks or Jupyter Lab (available here) and Matplotlib (>=2.0.0) is also required.

User Installation

Download from PyPi via $ pip install dame-flame

A Tutorial to FLAME-database version

Make toy dataset

import pandas as pd
from dame_flame.flame_db.utils import *
from dame_flame.flame_db.FLAME_db_algorithm import *

train_df = pd.DataFrame([[0,1,1,1,0,5], [0,1,1,0,0,6], [1,0,1,1,1,7], [1,1,1,1,1,7]], 
                  columns=["x1", "x2", "x3", "x4", "treated", "outcome"])
test_df = pd.DataFrame([[0,1,1,1,0,5], [0,1,1,0,0,6], [1,0,1,1,1,7], [1,1,1,1,1,7]], 
                  columns=["x1", "x2", "x3", "x4", "treated", "outcome"])                 

Connect to the database

select_db = "postgreSQL"  # Select the database you are using: "MySQL", "postgreSQL","Microsoft SQL server"
database_name='tmp' # database name you use 
host = 'localhost' 
port = "5432"
user="postgres"
password= ""

conn = connect_db(database_name, user, password, host, port)

Insert the data to be matched into database

If you already have the dataset in the database, please ignore this step. Insert the test_df (data to be matched) into the database you are using.

from dame_flame.flame_db.gen_insert_data import *
insert_data_to_db("datasetToBeMatched", # The name of your table containing the dataset to be matched
                  test_df,
                  treatment_column_name= "treated",
                  outcome_column_name= 'outcome',conn = conn)

Run FLAME_db

res = FLAME_db(input_data = "datasetToBeMatched", # The name of your table containing the dataset to be matched
              holdout_data = train_df, # holdout set. We will use holdout set to train our model
              conn = conn # connector object that connects to your database. This is the output from function connect_db.
              )

Analysis results

res[0]:
    data frame of matched groups. Each row represent one matched groups.
    res[0]['avg_outcome_control']: 
        average of control units' outcomes in each matched group   
    res[0]['avg_outcome_treated']: 
        average of treated units' outcomes in each matched group   
    res[0]['num_control']:
        the number of control units in each matched group
    res[0]['num_treated']:
        the number of treated units in each matched group
    res[0]['is_matched']:
        the level each matched group belongs to
res[1]:
    a list of level numbers where we have matched groups
res[2]:
    a list of covariate names that we dropped

Postprocessing

ATE_db(res) # Get ATE for the whole dataset
ATT_db(res) # Get ATT for the whole dataset

dame-flame-python-package's People

Contributors

alexlanglang avatar dependabot[bot] avatar nehargupta avatar nickeubank avatar thowell332 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

dame-flame-python-package's Issues

ENH: Return full dataset for subsequent analysis

As per discussion with @nehargupta via email, it'd be helpful if dame returned an analysis dataset. results is a start, but it drops treatment assignment and outcome, and the insertion of "*" values means it can't actually be used for analysis.

Basically, most people I know who do matching in the social sciences expect the matching package to return a data set with all the matched pairs. In 1 to 1 matching, that data set is just a subset of the original data set for which matches were found. In many to one matching, observations that are matched more than once end up being repeated in the data set but with weights that reflect that fact. you can see this in the matchit docs here:
https://kosukeimai.github.io/MatchIt/articles/MatchIt.html#estimating-the-treatment-effect (you may have to scroll up a little for context).

Here's a super-crude implementation of what I'm looking for, where result_of_fit is the output of the match (the "result"), and model is my fitted model.

def get_dataframe(model, result_of_fit):

    # Get original data
    better = model.input_data.loc[result_of_fit.index]

   # Get match groups for clustering
   matched_data["match_group"] = np.nan
   for idx, group in enumerate(model.units_per_group):
       matched_data.loc[group, "match_group"] = idx

    # Weights
    better["weights"] = 1 / model.groups_per_unit

    if not model.repeats:
        assert (better["weights"] == 1).all()

    # Make sure right N!
    assert len(result_of_fit) == len(better)

    return better

EDIT: Oops. @nehargupta points out the result that the index in result_of_fit IS preserved, so edited with much much simpler solution.

handle factor inputs

explore creating some handling for factor inputs...so if a user enters a column called Race and enters 0,1,2 or black/white/asian this should be split into 3 variables, each with a boolean. because having them as ordered values doesn't make sense in the context of the fact that we would run linear regression on these to match them....etc...

sorting of columns if adaptive_weights=False

I'm pretty sure that line 293 in data_cleaning.py:

`if (adaptive_weights == False):

Ensure that the columns are sorted in order: binary, tertary, etc

`

isn't actually needed, so maybe this can be removed. Am I incorrect?

Testing: Ensure that DAME consistent with R-DAME

Working on this now, aim is to complete by end of this month, will definitely go out in the release coming out before July. It looks like there was a case in which this dame-flame Python package was ending one iteration too early, and that has been resolved now with #34. Doing more testing to ensure no other issues now.

confirm no floating point issues

look to see if the post_processing file for CATE and ATE and ATT is vulnerable to floating point rounding issues, and if true, how to resolve? Use a small 4 unit dataset or something to confirm.

Off by 1 (or two?) error in `model.pe_each_iter`?

Also running into some indexing confusion with model.pe_each_iter among students -- since iteration 1 has a pe error of 0 (all exact matches, right?) it doesn't get included in pe_each_iter, which means if you index into that, it's off by 1 (or two, given seems DAME-FLAME counting starts at 1, not 0). Probably need to adopt a consistent approach to these.

No releases since recent patches?

Hey DAME-FLAME team!

I'm using this in class here at Duke, and just realized that a patch I put in last year for a reporting error (#34 ), while merged, is not in the pypi version because looks like the package hasn't had any releases since 2020. Any chance you could push a new release? Very confused students... :(

Thanks!

Nick

Documentation updates needed

if (pe - prev_pe)/prev_pe >= (1 - early_stops.pe):

Looking here, it looks like "early_stop_pe_frac" should be a percent change variable from the previous predictive error, but in the documentation I didn't really describe it this way and instead described it as a raw value not a percent change: https://almost-matching-exactly.github.io/DAME-FLAME-Python-Package/api-documentation/DAME (search "early_stop_pe_frac").

This needs to be double checked (run the code on a sample dataset and confirm) and then the documentation needs to be updated if that's the case....

Misleading verbose output

For verbose = 3, the output starts with iteration 2 (skipping over the first iteration which checks for exact matches) which could be misleading for some users.

ENH: Move to 0-indexing?

Don't get me wrong, I'm a social scientist and prefer 1-indexing (e.g. love me some Julia!). But Python is a zero-indexed language, so maybe move everything to zero indexing to address #32 and #47 ? Note will require change in output message header, rollback of #33, and changing early_stop_iterations keyword.

Valid input checking for unique indexes

We don't check this right now, and it would cause problems if someone inputs a dataframe with nonunique indexes. Add something to data_cleaning.py that checks via:
df.index.is_unique

Integrate database version

Will make the algorithms scalable to way more covariates and units.

At the current state @ALEXLANGLANG has finished the code and it mostly needs to just be documented for users on the webpage, and @nehargupta (or any second person) needs to give it a quick glance. It looks well written/code commented and has been checked for correctness.

"No object to concatenate error"

When running with the adaptive_weights='decisionTreeCV' parameter, in late iterations, sometimes the error 'no object to concatenate' appears. It seems like users who see this error can avoid it by using the early_stop_iterations parameter, or by using another learning method than decision trees, and possibly by binarizing columns before running the algorithm as well

create additional stopping criteria

For flame, it's epsilon (float, default 0.25): Early stopping criteria, the acceptable percent change in PE before stopping

todo: calculate the baseline PE and use it to create a stopping based on
the epsilon criteria.

  baseline_pe = flame_dame_helpers.find_pe_for_covar_set(
           df_holdout_array, treatment_column_name, outcome_column_name, 
           all_covs, adaptive_weights, alpha)

todo: check for stopping criteria based on PE

        if (pe > (1 + epsilon) * baseline_pe):
            print("We stopped matching because predictive error would have "\
                  "risen ", 100 * epsilon, "% above the baseline.")
            break

Error in ATE estimation

Hi, I'm trying FLAME for the first time and encountered an error during post-processing of the ATE:

Code snippet:

model = dame_flame.matching.FLAME(
    repeats=True, 
    verbose=3, 
    adaptive_weights="decisiontree", 
    stop_unmatched_t=True, 
    early_stop_un_t_frac=0.005, 
    missing_holdout_replace=0, 
    want_pe=True,
    want_bf=True,
)
model.fit(holdout_data=df, treatment_column_name="treated", outcome_column_name="outcome")
result = model.predict(df)
dame_flame.utils.post_processing.ATE(model)

Error message:

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-10-b4ec8bd0f432> in <module>
----> 1 dame_flame.utils.post_processing.ATE(model)

~/Library/Caches/pypoetry/virtualenvs/pandata-ml-deal-optimisation--RFgPKiW-py3.7/lib/python3.7/site-packages/dame_flame/utils/post_processing.py in ATE(matching_object, mice_iter)
    161         treated = group_data.loc[group_data[matching_object.treatment_column_name] == 1]
    162         control = group_data.loc[group_data[matching_object.treatment_column_name] == 0]
--> 163         avg_treated = sum(treated[matching_object.outcome_column_name]) / len(treated.index)
    164         avg_control = sum(control[matching_object.outcome_column_name]) / len(control.index)
    165         cates[group_id] = avg_treated - avg_control

ZeroDivisionError: division by zero

Is it possible that some matched_groups do not contain treatment units?

Documentation updates

A number of small things:

  • Check the pe_frac things
  • Add paper citation, update names/contact email
  • Perhaps add the algorithm diagram from the paper somewhere too
  • (if it gets typed up) then an enhancement with a code-class-flow diagram

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.