GithubHelp home page GithubHelp logo

Comments (8)

pzivich avatar pzivich commented on July 23, 2024

I'll have to look into what's happening here. As a simple way to sidestep the issue, have you tried sci-kit learn's LinearRegression. That should be the same thing as GLMSL with the Gaussian family. If the rest of sklearn works fine, this should be the quickest fix until I figure out what's happening here

The original motivation for GLMSL was to get around the automatic penalization that sci-kit learn's LogisticRegression does (since I find tuning the C parameter to be inconsistent.

from zepid.

pzivich avatar pzivich commented on July 23, 2024

What the warning looks like is that a 'bad' value is being returned by GLMSL somehow, which is passing a np.nan to the targeting step. That might be some bad fitting on the GLM's part? I will see if I can replicate in the next week

from zepid.

miaow27 avatar miaow27 commented on July 23, 2024

Awesome! I dig further into different scenario when I only fit 1 estimator at a time for both exposure model and the outcome model using SingleCorssfit. It seems to be coming from the outcome model for non-penalized regression (Linear regression from SKlearn or sm.familiy.Gaussian) as well as Gradient Boost Machine.

My outcome is highly skewed to the right as below:
image

However, I have tried cap at 90% or bound to 0-1. it is still throwing the same error.

See detailed information below.

Detailed Estimator Defined

if outcome_binary:
        SL_emp = EmpiricalMeanSL()
        SL_glm = LogisticRegression(penalty = 'none', tol = 1e-3, solver = 'saga', max_iter = 1000)
        SL_lasso = LogisticRegression(penalty = 'l1', C = 1/lasso_alpha, tol = 1e-3, solver = 'saga', max_iter = 1000)
        SL_gmsl = GLMSL(family = sm.families.family.Binomial())
        SL_rf = RandomForestClassifier(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2) 
        SL_gbm = GradientBoostingClassifier(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2) 
    else:
        SL_emp = EmpiricalMeanSL()
        SL_glm = LinearRegression()
        SL_lasso = Lasso(alpha = 0.1)
        link_i = sm.genmod.families.links.identity()
        SL_gmsl = GLMSL(family = sm.families.family.Gaussian(link=link_i))
        SL_rf = RandomForestRegressor(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2) 
        SL_gbm = GradientBoostingRegressor(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2)

Result and Error

Result

It seems that Either Sklearn version of Regression of Statsmodel version of Regression failed. As well as GBM. However, Lasso is okay

ps_model: SL_emp, outcome_model: SL_emp, ATE: 299.7524716151026
ps_model: SL_emp, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_emp, outcome_model: SL_lasso, ATE: 299.1976197069547
ps_model: SL_emp, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_emp, outcome_model: SL_rf, ATE: 94.01894562905046
ps_model: SL_emp, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_glm, outcome_model: SL_emp, ATE: 147.88143010969998
ps_model: SL_glm, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_glm, outcome_model: SL_lasso, ATE: 147.93800377478118
ps_model: SL_glm, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_glm, outcome_model: SL_rf, ATE: 76.54026004386105
ps_model: SL_glm, outcome_model: SL_gbm, ATE: nan -> showing the above error

ps_model: SL_lasso, outcome_model: SL_emp, ATE: 148.62888266058008
ps_model: SL_lasso, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_lasso, outcome_model: SL_lasso, ATE: 148.65439623543404
ps_model: SL_lasso, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_lasso, outcome_model: SL_rf, ATE: 76.49970421812958
ps_model: SL_lasso, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_gmsl, outcome_model: SL_emp, ATE: 104.65699066369935
ps_model: SL_gmsl, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_gmsl, outcome_model: SL_lasso, ATE: 104.65699066369935
ps_model: SL_gmsl, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_gmsl, outcome_model: SL_rf, ATE: 80.77866133286848
ps_model: SL_gmsl, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_rf, outcome_model: SL_emp, ATE: 91.54591473217262
ps_model: SL_rf, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_rf, outcome_model: SL_lasso, ATE: 91.54591473217262
ps_model: SL_rf, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_rf, outcome_model: SL_rf, ATE: 76.42024750089442
ps_model: SL_rf, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_gbm, outcome_model: SL_emp, ATE: 50.585009347615674
ps_model: SL_gbm, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_gbm, outcome_model: SL_lasso, ATE: 50.585009347615674
ps_model: SL_gbm, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_gbm, outcome_model: SL_rf, ATE: 58.8199678774994
ps_model: SL_gbm, outcome_model: SL_gbm, ATE: nan -> showing the below error

Error Message

Each line represent a test case for getting ATE for continuous outcome (everything else is same).
When there is ATE:nan it comes with error message as below.

The reason it repeated 3 times is because I defined partition = 3.

I also notice the error occur at line 1663, 1669, 1671 and 1668 in crossfit.py within DoubleCrossfitTMLE class.
However, I only called SingleCrossfitTMLE.

/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
  offset=np.log(probability_to_odds(py_os))))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
  ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/a474369/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
  offset=np.log(probability_to_odds(py_os))))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
  ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
  ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
  offset=np.log(probability_to_odds(py_os))))

I hope these information can provide some insights about why Singlecrossfit failed!

from zepid.

miaow27 avatar miaow27 commented on July 23, 2024

I'll have to look into what's happening here. As a simple way to sidestep the issue, have you tried sci-kit learn's LinearRegression. That should be the same thing as GLMSL with the Gaussian family. If the rest of sklearn works fine, this should be the quickest fix until I figure out what's happening here

The original motivation for GLMSL was to get around the automatic penalization that sci-kit learn's LogisticRegression does (since I find tuning the C parameter to be inconsistent.

I think it tends to have this particular error for GLMSL or even LASSO (from SKlearn, when alpha is not large enough).

from zepid.

pzivich avatar pzivich commented on July 23, 2024

Yeah, it sounds like you are running into sparse data, so GLMSL returns nan because that's what statsmodels.GLM returns. Those nan get fed to the targeting model (another GLM, the one called out in the warning).

So, it sounds like a specific sparsity issue coming up with the data (not with something in GLMSL itself). Penalized model (like you did) is probably the way to fix it. Implicitly random forests do something like that, so that also explains why they work here.

from zepid.

miaow27 avatar miaow27 commented on July 23, 2024

This is very helpful insight! log-transform the outcome would be ideal, but i will not able to get ATE but ratio instead.
Random Forest with parameter that prevent overfitting or large penalization might be the way to go...

Thank you!

from zepid.

pzivich avatar pzivich commented on July 23, 2024

Personally, I would recommend using SuperLearer with a LASSO, Ridge, and Random Forest (and maybe some other algorithms that regularize). That should keep you on the ATE, and all those are regularized, so they should avoid this issue.

My recommendation with using LASSO + Ridge is based on some observations that random forests can be a little unpredictable when used alone in finite samples (also see "Demystifying Statistical Inference When Using Machine Learning in Causal Research"). Essentially you want to give prevent overfitting by random forest if you can

from zepid.

miaow27 avatar miaow27 commented on July 23, 2024

This is extremely helpful suggestion! Thank you so much!

from zepid.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.