statdivlab / rigr Goto Github PK

Regression, Inference, and General Data Analysis Tools for R

License: Other

R 99.29% TeX 0.71%

rigr's Introduction

`rigr`: Regression, Inference, and General Data Analysis Tools for R

Introduction

rigr is an R package to streamline data analysis in R. Learning both R and introductory statistics at the same time can be challenging, and so we created rigr to facilitate common data analysis tasks and enable learners to focus on statistical concepts.

rigr, formerly known as uwIntroStats, provides easy-to-use interfaces for descriptive statistics, one- and two-sample inference, and regression analyses. rigr output includes key information while omitting unnecessary details that can be confusing to beginners. Heteroskedasticity-robust (“sandwich”) standard errors are returned by default, and multiple partial F-tests and tests for contrasts are easy to specify. A single regression function (regress()) can fit both linear models, generalized linear models, and proportional hazards models, allowing students to more easily make connections between different classes of models.

Installation

You can install the stable release of rigr from CRAN as follows:

install.packages("rigr")

You can install the development version of rigr from GitHub using the code below. The installment is through the R package remotes.

# run the following code to make sure that the remote package it installed
if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}
# install the package
remotes::install_github("statdivlab/rigr")

rigr is maintained by the StatDivLab, but relies on community support to log issues and implement new features. Is there a method you would like to have implemented? Please submit a pull request or start a discussion!

Documentation

Examples of how to use the main functions in rigr are provided in three vignettes, available here. One details the regress function and its utilities, one details the descrip function for descriptive statistics, and the third details functions used for one- and two-sample inference, including ttest, wilcoxon, and proptest.

Humans

Maintainer: Amy Willis

Authors: Scott S Emerson, Brian D Williamson, Charles Wolock, Taylor Okonek, Yiqun T Chen, Jim Hughes, Amy Willis, Andrew J Spieker and Travis Y Hee Wai.

Issues

If you encounter any bugs, please file an issue. Better yet, submit a pull request!

Do you have a question? Please first check out the vignettes, then please post on the Discussions.

Code of Conduct

Please note that the rigr project is released with a [Contributor Code of Conduct](https://statdivlab.github.io/rigr/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.

rigr's People

Contributors

Stargazers

Watchers

Forkers

ameerd svteichman gthopkins

rigr's Issues

factors in bplot

bplot should be able to handle factors

bplot(fev,agecat,xlab="Age Category",ylab="FEV (liters)") # fails if agecat is a factor

add `by` argument to `wilcoxon`

coefficient names unclear for categorical variables in `regress`

Note that when the following code is run:

library(rigr); data(mri)
regress("mean", atrophy ~ sex, data=mri)

the coefficient names printed are "Intercept" and "sex", whereas from lm() we'd have "(Intercept)" and "sexMale", this telling us which value of the categorical variable the coefficient corresponds to. We should have something similar to this in our print method for regress. Maybe something like "sex: Male"?

Alternative hypothesis in ttest

The alternative hypothesis in test is specified with the argument "test.type". It would be better to be consistent with other R functions (particularly t.test) and use the argument "alternative". The attached version of ttest changes the
argument name everywhere.
my_ttest.txt

Does `ifelse1` need to exist?

I think this function has identical behavior and interface as ifelse - I'll look into it but it doesn't seem like we need this.

GEE for salary data in regress_intro vignette

In the regress_intro vignette, the line regress("mean", salary ~ female*year, id = id, data = salary) no longer works. Not sure why.

Make styling consistent

Currently, function names and objects follow inconsistent styling. Ideally, we would follow a style guide (e.g., Hadley Wickham's or Google's) so that it is easier to read and debug the code.

I propose following Hadley's for the most part, and using underscores (_) in both function and object names.

Also, ideally the main user-facing functions will have their own files (e.g., descrip.R) and internal helper functions will live in a separate file (e.g., descrip_utils.R or possibly a separate file for each internal function).

descrip no longer works is intended on `mri` data

A previous line in test_cases vignette was descrip(mri). This no longer works. Possible reasons include that mri is now a tibble rather than a dataframe, but perhaps also because it contains variables of different types (not only numeric).

Streamline `ttest.R` code

The code for this function is much more verbose than it needs to be, including many statements like if (condition){}; else{do stuff}. For ease of future maintenance this should be cleaned up.

Document helper functions

Many helper functions (i.e., non-exported functions) have no documentation. I think a good goal is to have all functions be documented in some way.

bug in wilcoxon

one-sided p-value in wilcoxon wrong . Fixed in the attached file. Search for "JPH" to find changes
my_wilcoxon.txt

Confirm `ttest` handles NA properly in all use cases

`print.uRegress` throws an error for models with categorical variables with >2 levels

Example:

library(rigr); data(mri); regress("mean", atrophy ~ race, data=mri)

Error message:

Error in dimnames(x$model) <- *vtmp* :
length of 'dimnames' [1] not equal to array extent
In addition: Warning message:
In cbind(nms, levels) :
number of rows of result is not a multiple of vector length (arg 2)

data argument in bplot doesn't work the way it should

When running:

mri <- read.table("http://www.emersonstatistics.com/datasets/mri.txt", header=TRUE);
bplot(data = mri, y=atrophy, x=male, xlab="Gender", ylab="Atrophy")

you get an error that "male" is not found because the dataset has not been attached. the data argument should work in a way that doesn't require users to attach the dataset to their environment beforehand

`descrip()` returns `firstEvent` column even for non time-to-event variables

The following code produces a value for firstEvent but has lastEvent as NA:

a <- rnorm(10,0,1) descrip(a)

I would imagine we'd want firstEvent to be NA as well in this case

Wilcoxon bug

Wilcoxon bug.docx

`ttest()` and `ttesti()` should return objects, in addition to printing output

I think that these functions should return an object (a list, say) with all relevant output, in addition to printing. This matches e.g. wilcoxon(), and I think is far more user-friendly than only printing.

update documentation for `regress` helper functions in regress_utils.R

at the very least, ensure that all parameters and return values are described well

why does `regress` no longer support proportional hazards regression, and what do we want moving forward?

The documentation for regress seems to suggest you can do proportional hazards regression using fnctl = "hazard", but the code will actually throw an error if you specify fnctl = "hazard", stating "proportional hazards regression no longer supported". That being said, the remainder of the code seems to still have if/else cases and other sections in case fnctl = "hazard", so is this error something that was thrown in recently? @bdwilliamson do you have any insight?

Another question is whether we want regress to be able to handle proportional hazards regression moving forward.

online documentation

It would be wonderful to have online documentation. I haven't looked into this for many years but I recall this being pretty easy with github.io and pkgdown.

Fix `ttesti` documentation

ttesti.R documentation needs some changes - a few of the parameter descriptions aren't working properly.

remove gee and hazard functionality from `regress`

Make TRUE and FALSE consistent

Currently, T and F are sometimes used as shorthand for TRUE and FALSE. While this is possible, it isn't good practice -- TRUE and FALSE are protected, while T and F are valid object names and thus can be overwritten by the user.

We should use the full words!

`proptest`

Implement proptest function as a branch from ttest. Add continuity correction, which is not currently implemented.

devtools::check() fails

As a result of some of the change is associated with modernizing the package, devtools::check() fails.

updated functions from scott emerson

In an earlier email from Scott Emerson to me: "I have some updated functions that have not yet been put in the package."

Could the summer TA team please that reach out to him about this and incorporate those edits?

remove `version` parameter from `descrip()`?

Currently if you specify version = TRUE in descrip(), the function will return "20160730" and nothing else. I'd imagine we either want to to return a date that's a bit closer to today, or just remove this parameter entirely. Personally, I'm in favor of removing it entirely. Any objections or other thoughts?

`regress` returns p-value based on t-statistic for logistic regression despite column name Pr(>|z|) in output

I believe this is just a problem with an if statement on line 887 of regress.R

bplot axis label issue

the followiong produces a plot with double x axis labels

bplot(fev,as.integer(agecat),xlab="Age Category",ylab="FEV (liters)")

CRAN

It would be great to be able to install this package from CRAN -- this would enormously help many of our students with using it for the first time. Target date: Sept 13.

interaction terms in regress

notice in the output of the following, the interaction term is not clearly labelled

also the overall F-statistic from regress does not agree with the F-statistic from lm

regress("mean",fev~ht*as.integer(sex))

`ttest` output lists one-sample and two-sample tests

The printed output for ttest reads "Two-sample t-test:" followed by "One-sample t-test:" whenever two samples are given. It should only read "Two-sample t-test:"

`equal` function could be cleaner

In termTraverse.R (soon to be in regress_utils.R), there is a function defined called equal, and the value returned calls the function itself (not sure if there are any implications to this?). The function seems to be doing the same thing as checking if length(unique(x)) == 1, other than that they would handle vectors that include NAs differently. This is mostly a reminder to myself to edit this helper function to be a bit cleaner in the future.

equal <- function(x){
if(length(x)==1){
return(TRUE)
} else {
if(x[1]!=x[2]){
return(FALSE)
} else {
return(equal(x[-1]))
}
}
}

lincom

It would be very useful if lincom could handle lm objects as well as uRegress objects

`ttest` does not throw error for matched test with vectors of unequal length

Waiting to fix this until unit tests are written

when replaceZeroes = TRUE in `regress`, it does not actually replace zeroes with one-half the lowest nonzero value

The function replaces missing values with TRUE (aka 1), at least for cases when fnctl = "geometric mean" is specified

Remove `version` parameter from functions, add version and/or package creation date when package is loaded

It will be far more maintainable to have the version listed in one spot (for example, when the package is loaded) as opposed to in individual functions. version parameter has already been removed from regress, descrip, and both of their utility functions, but the parameter likely still exists elsewhere.

At the suggestion of @bdwilliamson, we could consider is adding a print message when the package gets loaded that says something about the version and/or the package creation date.

This issue builds off of #33

`wilxcoxon()` returns incorrect `data.name`

This seems to me to be a scoping issue, since wilcoxon() calls wilcoxon.do(). I tried using sys.frame(1) in the substitute(), without success. Commented out failing unit tests for this - restore these when fixed.

Streamline `wilcoxon.R` code

wilcoxon.R has some redundancies, including calculating the p-value multiple times. The whole function should be simplified.

`ttest` returns NULL if two data vectors are provided with stratification using `by`

This should throw an error instead.

`wilcoxon()` provides two (different) p-values when `exact = TRUE`

As currently written, wilcoxon() is (as far as I can tell) largely a copy-pasted version of wilcox.test() from the stats package. That copied code provides a p-value (variable PVAL in the code), either using an exact test, or Normal approximation.

One of the additions in wilcoxon() is the inf return object, which is described as "a formatted table of inference values, for printing." This object provides the PVAL mentioned above, but also includes the Z-score and corresponding Normal approximation p-value. This p-value will correspond to PVAL when exact = FALSE. However, when exact = TRUE is used, PVAL is an exact p-value, but the Normal approximation p-value is still included in inf.

Long story short: when exact = TRUE, wilcoxon() provides both the exact and approximate p-value, which I think would be confusing to a user. To see this run:

set.seed(2)
y <- rnorm(100)
wilcoxon(y, exact = TRUE)

The inf return object has p-value 0.70012 (exact) and 0.69762 (Normal approximation).

My feeling is that we shouldn't report two different p-values. Thoughts?

regress doesn't accept factors

Other regression commands in R accept factor covariates but
regress("mean", fev ~ ht+sex,data=fevdata) # fails if sex is a factor

Incorrect label in ttest

I downloaded the rigr package on 7/23 and was using ttest() and got this result. Notice the incorrect label above the output that says "One-sample t-test"

with(hiv,ttest(cd4[art==0],cd4[art==1]))

Call:
ttest(var1 = cd4[art == 0], var2 = cd4[art == 1])

Two-sample t-test allowing for unequal variances :

One-sample t-test :

Summary:
Group Obs Missing Mean Std. Err. Std. Dev. 95% CI
cd4[art == 0] 78 14 388.5 32.5 260 [323.5, 454]
cd4[art == 1] 50 19 298.5 32.9 183 [231.3, 366]
Difference 128 33 90.1 46.3 [-2.02, 182]

Ho: difference in means = 0 ;
Ha: difference in means != 0
t = 1.946 , df = 80.7
Pr(|T| > t) = 0.0551195

force users to write out the full fnctl argument in `regress`

We no longer want to support shorted fnctl strings in regress

Discussed in #53

^{Originally posted by taylorokonek August 9, 2021}
Currently, users can enter a shortened character string for the fnctl argument in regress, and so long as it corresponds to a unique substring, the function will run. As an example, a user could simply set fnctl = "g", and the function would take this to mean fnctl = "geometric mean". Personally, I don't think there's a huge harm in forcing users to spell out the fnctl argument. Does anyone have any insight as to why this particular functionality was introduced (@bdwilliamson)?

bug in ttesti

one-sided p-value in ttesti wrong for Ha: <. Fixed in the attached file (to find changes, search for "JPH")
my_ttesti.txt

intercepts in linear models

If I recall correctly, you can use the argument intercept = FALSE in regress(). Is it possible to add the -1 functionality?

salary dataset as dataset

Currently both some vignettes and some examples run the following to get the salary data

salary <- read.table("http://www.emersonstatistics.com/datasets/salary.txt", header = TRUE, stringsAsFactors = FALSE)

This should instead be a documented dataset released with the package.

This impact the vignette regress_intro amongst others.

what is 'stratify' supposed to do in bplot?

I can't seem to get an example working (that doesn't throw an error) where the 'strata' argument in bplot is used. Is this argument supposed to do something analogous to facet_wrap in ggplot, or something else?

Write unit tests for `ttest` geometric mean

Reminder to myself to come back to this.

changed behavior for descrp

The newest version of descrip prints, by default, the columns Max Restriction, FirstEvent, LastEvent, isDate. Those seem confusing for the average user, especially if none of the measures is a time-to-event. Can we have descrip not print those columns by default?

Also, at the bottom of the descrip output I see
attr(,"class")
[1] "uDescriptives"
which will be confusing to the naive user.

Finally, all statistics are printed in scientific notation which is different from before and harder to read.

Having said all this, I just figured out that I think the issue is simply that the print.uDescriptives function has not been included in the rigr package?