<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

SingleCrossFit `invalid value encountered in log` about zepid HOT 8 CLOSED

miaow27 commented on July 23, 2024

SingleCrossFit `invalid value encountered in log`

from zepid.

Comments (8)

pzivich commented on July 23, 2024

I'll have to look into what's happening here. As a simple way to sidestep the issue, have you tried sci-kit learn's LinearRegression. That should be the same thing as GLMSL with the Gaussian family. If the rest of sklearn works fine, this should be the quickest fix until I figure out what's happening here

The original motivation for GLMSL was to get around the automatic penalization that sci-kit learn's LogisticRegression does (since I find tuning the C parameter to be inconsistent.

from zepid.

pzivich commented on July 23, 2024

What the warning looks like is that a 'bad' value is being returned by GLMSL somehow, which is passing a np.nan to the targeting step. That might be some bad fitting on the GLM's part? I will see if I can replicate in the next week

from zepid.

miaow27 commented on July 23, 2024

Awesome! I dig further into different scenario when I only fit 1 estimator at a time for both exposure model and the outcome model using SingleCorssfit. It seems to be coming from the outcome model for non-penalized regression (Linear regression from SKlearn or sm.familiy.Gaussian) as well as Gradient Boost Machine.

My outcome is highly skewed to the right as below:

However, I have tried cap at 90% or bound to 0-1. it is still throwing the same error.

See detailed information below.

Detailed Estimator Defined

if outcome_binary:
        SL_emp = EmpiricalMeanSL()
        SL_glm = LogisticRegression(penalty = 'none', tol = 1e-3, solver = 'saga', max_iter = 1000)
        SL_lasso = LogisticRegression(penalty = 'l1', C = 1/lasso_alpha, tol = 1e-3, solver = 'saga', max_iter = 1000)
        SL_gmsl = GLMSL(family = sm.families.family.Binomial())
        SL_rf = RandomForestClassifier(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2) 
        SL_gbm = GradientBoostingClassifier(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2) 
    else:
        SL_emp = EmpiricalMeanSL()
        SL_glm = LinearRegression()
        SL_lasso = Lasso(alpha = 0.1)
        link_i = sm.genmod.families.links.identity()
        SL_gmsl = GLMSL(family = sm.families.family.Gaussian(link=link_i))
        SL_rf = RandomForestRegressor(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2) 
        SL_gbm = GradientBoostingRegressor(n_estimators = 100, min_samples_split=5, min_samples_leaf = 0.2)

Result and Error

Result

It seems that Either Sklearn version of Regression of Statsmodel version of Regression failed. As well as GBM. However, Lasso is okay

ps_model: SL_emp, outcome_model: SL_emp, ATE: 299.7524716151026
ps_model: SL_emp, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_emp, outcome_model: SL_lasso, ATE: 299.1976197069547
ps_model: SL_emp, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_emp, outcome_model: SL_rf, ATE: 94.01894562905046
ps_model: SL_emp, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_glm, outcome_model: SL_emp, ATE: 147.88143010969998
ps_model: SL_glm, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_glm, outcome_model: SL_lasso, ATE: 147.93800377478118
ps_model: SL_glm, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_glm, outcome_model: SL_rf, ATE: 76.54026004386105
ps_model: SL_glm, outcome_model: SL_gbm, ATE: nan -> showing the above error

ps_model: SL_lasso, outcome_model: SL_emp, ATE: 148.62888266058008
ps_model: SL_lasso, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_lasso, outcome_model: SL_lasso, ATE: 148.65439623543404
ps_model: SL_lasso, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_lasso, outcome_model: SL_rf, ATE: 76.49970421812958
ps_model: SL_lasso, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_gmsl, outcome_model: SL_emp, ATE: 104.65699066369935
ps_model: SL_gmsl, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_gmsl, outcome_model: SL_lasso, ATE: 104.65699066369935
ps_model: SL_gmsl, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_gmsl, outcome_model: SL_rf, ATE: 80.77866133286848
ps_model: SL_gmsl, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_rf, outcome_model: SL_emp, ATE: 91.54591473217262
ps_model: SL_rf, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_rf, outcome_model: SL_lasso, ATE: 91.54591473217262
ps_model: SL_rf, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_rf, outcome_model: SL_rf, ATE: 76.42024750089442
ps_model: SL_rf, outcome_model: SL_gbm, ATE: nan -> showing the below error

ps_model: SL_gbm, outcome_model: SL_emp, ATE: 50.585009347615674
ps_model: SL_gbm, outcome_model: SL_glm, ATE: nan -> showing the below error
ps_model: SL_gbm, outcome_model: SL_lasso, ATE: 50.585009347615674
ps_model: SL_gbm, outcome_model: SL_gmsl, ATE: nan -> showing the below error
ps_model: SL_gbm, outcome_model: SL_rf, ATE: 58.8199678774994
ps_model: SL_gbm, outcome_model: SL_gbm, ATE: nan -> showing the below error

Error Message

Each line represent a test case for getting ATE for continuous outcome (everything else is same).
When there is ATE:nan it comes with error message as below.

The reason it repeated 3 times is because I defined partition = 3.

I also notice the error occur at line 1663, 1669, 1671 and 1668 in crossfit.py within DoubleCrossfitTMLE class.
However, I only called SingleCrossfitTMLE.

/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
  offset=np.log(probability_to_odds(py_os))))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
  ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/a474369/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
  offset=np.log(probability_to_odds(py_os))))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
  ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1663: RuntimeWarning: invalid value encountered in log
  log = sm.GLM(ys, np.column_stack((h1ws, h0ws)), offset=np.log(probability_to_odds(py_os)),
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1668: RuntimeWarning: invalid value encountered in log
  ystar1 = np.append(ystar1, logistic.cdf(np.log(probability_to_odds(py_as)) + epsilon[0] / pa1s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1669: RuntimeWarning: invalid value encountered in log
  ystar0 = np.append(ystar0, logistic.cdf(np.log(probability_to_odds(py_ns)) - epsilon[1] / pa0s))
/users/.conda/envs/cobra_dev/lib/python3.7/site-packages/zepid/causal/doublyrobust/crossfit.py:1671: RuntimeWarning: invalid value encountered in log
  offset=np.log(probability_to_odds(py_os))))

I hope these information can provide some insights about why Singlecrossfit failed!

from zepid.

miaow27 commented on July 23, 2024

I'll have to look into what's happening here. As a simple way to sidestep the issue, have you tried sci-kit learn's LinearRegression. That should be the same thing as GLMSL with the Gaussian family. If the rest of sklearn works fine, this should be the quickest fix until I figure out what's happening here

The original motivation for GLMSL was to get around the automatic penalization that sci-kit learn's LogisticRegression does (since I find tuning the C parameter to be inconsistent.

I think it tends to have this particular error for GLMSL or even LASSO (from SKlearn, when alpha is not large enough).

from zepid.

pzivich commented on July 23, 2024

Yeah, it sounds like you are running into sparse data, so GLMSL returns nan because that's what statsmodels.GLM returns. Those nan get fed to the targeting model (another GLM, the one called out in the warning).

So, it sounds like a specific sparsity issue coming up with the data (not with something in GLMSL itself). Penalized model (like you did) is probably the way to fix it. Implicitly random forests do something like that, so that also explains why they work here.

from zepid.

miaow27 commented on July 23, 2024

This is very helpful insight! log-transform the outcome would be ideal, but i will not able to get ATE but ratio instead.
Random Forest with parameter that prevent overfitting or large penalization might be the way to go...

Thank you!

from zepid.

pzivich commented on July 23, 2024

Personally, I would recommend using SuperLearer with a LASSO, Ridge, and Random Forest (and maybe some other algorithms that regularize). That should keep you on the ATE, and all those are regularized, so they should avoid this issue.

My recommendation with using LASSO + Ridge is based on some observations that random forests can be a little unpredictable when used alone in finite samples (also see "Demystifying Statistical Inference When Using Machine Learning in Causal Research"). Essentially you want to give prevent overfitting by random forest if you can

from zepid.

miaow27 commented on July 23, 2024

This is extremely helpful suggestion! Thank you so much!

from zepid.

SingleCrossFit `invalid value encountered in log` about zepid HOT 8 CLOSED

Comments (8)

Detailed Estimator Defined

Result and Error

Result

Error Message

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs