Original Launchpad bug 574004: <a href="https://bugs.launchpad.net/statsmodels/+bug/57

Closing this as a duplicate of <a class="issue-link js-issue-link" data-error-text="Fa

add hasconstant indicator for R-squared and df calculations about statsmodels HOT 4 CLOSED

statsmodels commented on May 22, 2024

add hasconstant indicator for R-squared and df calculations

from statsmodels.

Comments (4)

BenDundee commented on May 22, 2024

This one just bit me :)

For reference, please see my question and response in this thread: http://stats.stackexchange.com/questions/36064/calculating-r-squared-coefficient-of-determination-with-centered-vs-un-center

The common wisdom seems to be that the correct definition of R^2 is the one you mentioned, using the total sum of squares. Also, this is how R does it---this isn't a reason (in and of itself) to change things, but the community seems to have a certain expectation in this regard, at least.

I would say, this is an issue that deserves some bandwidth. Linear regression is a pretty basic thing for people to want to do with a statistical package...

from statsmodels.

jseabold commented on May 22, 2024

Thanks for the links. Will have a look and sort this out.

A quick comment. Linear regression is indeed a pretty basic thing. We provide an OLS class that is fully correct. As soon as you fit the model without an intercept, you're not doing OLS AFAIK. Without an intercept, you're fitting a regression through the origin model (which is a strong substantive claim). The reason I've kicked this down the road so far is that it's still not really clear to me what a "correct" R^2 is in this case. The existence of the R^2 measure depends on the model being fit with an intercept. This (and other) R_0^2 measure [1] may be (somewhat) analogous to R^2 in that they're forced to be in (0,1) like psedo-R^2 measure for non-linear models, but R_0^2 is not the same as R^2. Ie., you can't compare R^2 and R_0^2 measures. For the most part, I've seen it recommended not to rely on R^2 for the RTO model since it can be wildly overinflated (with the higher uncentered sum of squares), and it's almost never the case in the models I deal with that you want to force the predictor to be zero when the regressors are zero. I always assumed this is why it's almost never discussed in textbooks.

In any event, I'll sit down with this before 0.5 and see if I can sort out the theory and the implications for the other inferential statistics. I definitely do not want to silently use a different definition for the no-constant model like R. For the record, SAS includes a big warning that R^2 is redefined and also uses the uncentered TSS.

[1] http://stats.stackexchange.com/questions/26176/removal-of-statistically-significant-intercept-term-boosts-r2-in-linear-model#26205

See also #60.

from statsmodels.

BenDundee commented on May 22, 2024

I've been thinking about your comments a bit. I'd like to offer a bit of feedback, but I want to be clear that I'm sensitive to the fact that you (or yall) are the one(s) actually doing the work :)

In my mind, R^2 is a property of two data sets, not of the ordinary least squares algorithm for dealing with residuals. One must choose a model first, then use OLS (or GLS, or NNLS or...) to derive the regression constants. Whether one chooses to include an intercept in the model amounts to a modeling choice: either I believe the data is best represented by a model with an intercept, or I believe the data is best represented by a model without an intercept. The algorithm doesn't care whether you include a constant or not: it is perfectly happy to deal with the extra 1's.

The real reason for the change in definition (as I've come to understand it) is a consequence of a null hypothesis (our old friend). In either instance, the null hypothesis is "no relationship exists", which means "set the slope to 0". I hate to reference myself here, but I've seen no other concise explanation of the subject online (see link above). Couched in this way, it makes sense to me that there is a "right" definition of R^2 in the case without an intercept (however bad a statistic).

Given this, it's not clear (to me, at least) why silently doing the right thing is a bad idea. The definition of R^2 that is implemented in 0.4 is only correct for a model with an intercept, we both agree on that. The same definition of R^2 cannot be applied to a model without an intercept, we both agree on that. Given that you provide a function called "add_constant", I would (naively) expect the OLS class to figure out the relevant calculations, and return the correct version of R^2 when queried.

Anyway, I definitely want to say thanks for a great piece of software!

from statsmodels.

jseabold commented on May 22, 2024

Closing this as a duplicate of #423 so there's less to keep track of. Fixed the RTO linear case as per your suggestions in this branch

https://github.com/jseabold/statsmodels/tree/handle-constant

but I need to look at how this will affect the rest of the code base.

from statsmodels.

add hasconstant indicator for R-squared and df calculations about statsmodels HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs