lindeloev / tests-as-linear Goto Github PK

Common statistical tests are linear models (or: how to teach stats)

Home Page: https://lindeloev.github.io/tests-as-linear/

CSS 2.53% JavaScript 65.07% HTML 32.40%

tests-as-linear's Introduction

This is the source code underlying the post Common statistical tests are linear models (or: how to teach stats). I explained it in this Twitter thread.

This is the main content:

index.Rmd is the Rmarkdown code which generated the post.
linear_tests_cheat_sheet.odt is a Libreoffice Writer document with the cheat sheet.
simulations folder are the simulations of the accuracy of the parametric approximation to non-parametric tests.

Please contribute! For example:

Raise issues if you have ideas.
Submit pull requests if you want to participate improving it.
Star it if you just want to follow updates
Fork it to make your own spin!

tests-as-linear's People

Contributors

Stargazers

Watchers

Forkers

attrna perlatex temuulene bilaldendani stewartmacdonald fpsom thets dafitze codpoe scgeeker fyears richardyando rakshak-t nbatada wildoane cmschmtt emilyku dkahle bradduthie acgerstein tientong98 jomorlier anu-bioinfo hcientist memoresaycool wangbinzjcc ramasrikanth dtommandru arturosorio shayan-taheri brentocarrigan samspo ahy1221 nsteinau arvindvenkatadri 2series farhanreynaldo nikolaospapachristou borisevichdi doced anhnguyendepocen shichenxie jacodela dalcib dataeducation snowdj felixholub junhuili codetrainee astaples toddknutson fredymad mcdr65 tpfeeney conjugateprior amrofi jrcheeseman kvantas sclamons kchuangk iago-contributedforks khemlalnirmalkar rasanderson ibsneuro graemeleehickey xxmissingnoxx alex-gable animesh adityaron yiluheihei marcokuehne trevorlolsen eroesch dbrg77 adejumoridwan tiramisutes ggloor stevenworthington easypipi josherrickson nistara mattgalbraith hcp4715 yefeng0920 mcsnowface ianhussey roaldarbol junyahiraiwa coinmamo odomartrindade gtpedrosa

tests-as-linear's Issues

Add links to external material

Following the tweet, I have been made aware of many excellent ressources. This issue just serves to collect them before I add them somewhere.

https://www.middleprofessor.com/files/applied-biostatistics_bookdown/_book/ looks like a solid intro to linear modeling equivalent to the stats 101 models. Downsides: there is little visualization, and no mention of non-parametric (i think?), and a lot more sampling theory. Check if there are worked examples.

https://siminab.github.io/2018/01/10/everything-in-statistical-modeling-can-be-seen-as-a-regression/ contains the basics, but likely too superficial.

https://www.ncbi.nlm.nih.gov/pubmed/20063905 looks like an excellent academic discussion of rote learning vs. modeling.

aov is a wrapper for lm

This is a great cheat sheet and comparison of the methods that you've made. Thanks for taking the time to think about it and write it up!

One small comment....
I'm sure you're aware, but aov is just a wrapper for lm with some specific settings (e.g. Helmut contrasts) and print/summary methods to approximate a classical ANOVA table, so it's be difficult for the models to return something different... the way section 6.1.3 is written at the moment feels a bit like you're surprised that they yield the same thing.

Cheers!

rnorm_fixed can be simpler

As seen at the opening source code block in 2 Settings and toy data, rnorm_fixed is a function defined as
rnorm_fixed = function(N, mu = 0, sd = 1) scale(rnorm(N)) * sd + mu. Scaling something only to unscale it right after is confusing; rnorm(N, mean = mu, sd = sd) should do just fine.

Is there any python code add on ?

Thanks very much for the R code and explanation of the GLM !
I think pretty cool to let people understand all those statistics in GLM way
Though the R code is easy and clear , is it possible to add python code for reference ?

I list below as those I can find (mostly scipy, , but the syntax is NOT as beautiful as R ...

Y ~ continous x

[P] : Pearson correlation : scipy.stats.pearsonr(x, y)
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html
[N] Spearman correlation : scipy.stats.spearmanr
https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html

Y ~ descrete x

[P] Two-sample t test : scipy.stats.ttest_ind
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
[P] Welch's t-test
Need DIY: https://pythonfordatascience.org/welch-t-test-python-pandas/
[N] Mann-Whitney rank test : scipy.stats.mannwhitneyu
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html

Multiple regression : lm(y ~ 1 + x1 + x2 + ...)

[P] : One-way ANOVA : scipy.stats.f_oneway
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html
[N] : Kruskal-Walis : scipy.stats.kruskal
https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.kruskal.html
[P] One-way ANCOVA : smf.ols(formula='y ~ a + b + c' , data=df).fit()
[P] Two-way ANOVA : smf.ols(formula= 'y ~ C(a)*C(b)', df).fit()
example https://pythonfordatascience.org/anova-2-way-n-way/
[N] Chi-squared test : scipy.stats.chi2_contingency
https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chi2_contingency.html
example https://pythonfordatascience.org/chi-square-test-of-independence-python/
[N] Goodness to fit

correlation vs linear model

First, just want to say that i love this! So thanks for the work.

Maybe you can add a little note for the section on correlations:
There is a difference between corr(x,y) and lm(y ~ 1 + x). Correlation is commutative corr(x,y) = corr(y,x) but lm(y ~ 1 + x) ≠ lm(x ~ 1 + y). It's especially relevant when both x and y contain measurement error.

A good reference for this is here:
https://elifesciences.org/articles/00638
The tls package in R provides one option for computing this.

Cheat sheet: Kruskal-Wallis changes

Common name should be Kruskal-Wallis (sheet has an extra L)
Linear Model in Words should read "Same, but it predicts the rank of y" (currently reads "signed rank")

Figure: two-way ANOVA

Make a good figure and icon, and include it in the cheat sheet.

naming of "Exact?" column

I find the "Exact?" column of the "Common statistical tests are linear models" pdf to be somewhat misleading, since the "Exact?" column links to simulations that show correspondence for sufficiently large n. My concerns would be alleviated if the column were renamed "Correspondence" or "Equivalence".

I really appreciate this project: nice work!

Wrong values in the table of the section "6.1.3 R code: one-way ANOVA"

The second line of your table (i.e. lm) are with wrong values, but the results from R are right.
Congratulations for this great tutorial!

Incorrect results in Table 5.1.4 R code: independent t-test

This site is fantastic - please can I share a couple of minor errors:

In the table, the degrees of freedom for the t test are showing as 48, instead of 98.

The confidence intervals are also mismatching, but I can make them match exactly if the linear model CIs on beta_1 were used (if the directionality were reversed on either the t-test or the lm).

Thank you!

Figure: Chi-Square and goodness-of-fit

Make these figures, and include them in the cheat sheet.

Add explaination of rank() to text

Thanks for this great resource! In the section on Pearson Correlation, what is rank(x)? Did I miss it? If not, I suggest to elaborate on this in the text as I am probably not the only one with this question. Awesome work!

Output for Wilcoxon signed-rank test correct?

Hi, thanks for this great resource! I'm working through the book now.

Can I confirm that the p-values published in the table of section '4.1.3 R code: Wilcoxon signed-rank test' are correct? I get different p-values for both the Wilcoxon test and the linear model using signed ranks (0.2628 and 02650 respectively). I have been able to replicate all other tests in the book so far using the toy data set. Thanks.

Broken link in Section 7.

Hi!

In the first paragraph of Section 7, there is a statement:

See this nice introduction to Chi-Square tests as linear models.

for which the link is broken.

I have not been able to find the document elsewhere.

Thanks for this wonderful resource.

Usually, we use a symmetric matrix for the `Sigma` parameter of `MASS::mvrnorm` function.

Usually, we use a symmetric matrix for the Sigma parameter of MASS::mvrnorm function.

# Fixed correlation
D_correlation = data.frame(MASS::mvrnorm(30, mu = c(0.9, 0.9), 
Sigma = matrix(c(1, 0.8, 1, 0.8), ncol = 2), empirical = TRUE))  # Correlated data

I think it should be Sigma = matrix(c(1, 0.8, 0.8, 0.8), ncol = 2)

Formula theory for goodness-of-fit

Add this to the worked examples. Also, perhaps demonstrate/discuss how to model unequal expected frequencies cf https://twitter.com/matthewlewis896/status/1115545191300120576

Break more assumptions

The simulated data is currently balanced, normal, with approx. equal variances and no correlation. The results should generalize with deviations from these. If this can be implemented in a way that does not obfuscate the real message/argument, it would be an improvement.

Under section 5.1.4 in https://lindeloev.github.io/tests-as-linear/#51_independent_t-test_and_mann-whitney_u, it says to notice the identical t, df and estimates but the df are not identical. Is the t even in the tables? Is it the mean?
Or is that meant to refer to the results of the Mann Whitney U in section 5.1.5?
Or am I confused about the same thing as #17?

Fix formating of simulations

The HTML formatting seems to have disappeared for all but Kruskal-Wallis.

BAD:
https://lindeloev.github.io/tests-as-linear/simulate_spearman.html
https://lindeloev.github.io/tests-as-linear/simulate_mannwhitney.html
https://lindeloev.github.io/tests-as-linear/simulate_wilcoxon.html

GOOD:
https://lindeloev.github.io/tests-as-linear/simulate_kruskall.html

This link `https://www.uni-tuebingen.de/fileadmin/Uni_Tuebingen/SFB/SFB_833/A_Bereich/A1/Christoph_Scheepers_-_Statistikworkshop.pdf` doesn't work anymore

Hi
This link https://www.uni-tuebingen.de/fileadmin/Uni_Tuebingen/SFB/SFB_833/A_Bereich/A1/Christoph_Scheepers_-_Statistikworkshop.pdf doesn't work anymore. Thanks