dmsul / econtools Goto Github PK

View Code? Open in Web Editor NEW

108.0 12.0 24.0 3.08 MB

Econometrics and data manipulation functions.

Home Page: http://www.danielmsullivan.com/econtools

License: BSD 3-Clause "New" or "Revised" License

Python 97.01% Stata 2.99%

econometrics stata regression python statistics scipy

econtools's Introduction

econtools

econtools is a Python package of econometric functions and convenient shortcuts for data work with pandas and numpy. Full documentation here.

Installation

You can install directly from PYPI:

$ pip install econtools

Or you can clone from Github and install directly.

$ git clone http://github.com/dmsul/econtools
$ cd econtools
$ python setup.py install

Econometrics

OLS, 2SLS, LIML
Option to absorb any variable via within-transformation (a la areg in Stata)
Robust standard errors
- HAC (robust/hc1, hc2, hc3)
- Clustered standard errors
- Spatial HAC (SHAC, aka Conley standard errors) with uniform and triangle kernels
F-tests by variable name or R matrix.
Local linear regression.
WARNING [31 Oct 2019]: Predicted values (yhat and residuals) may not be as expected in transformed regressions (when using fixed effects or using weights). That is, the current behavior is different from Stata. I am looking into this and will post a either a fix or a justification of current behavior in the near future.

import econtools
import econtools.metrics as mt

# Read Stata DTA file
df = econtools.read('my_data.dta')

# Estimate OLS regression with fixed-effects and clustered s.e.'s
result = mt.reg(df,                     # DataFrame to use
                'y',                    # Outcome
                ['x1', 'x2'],           # Indep. Variables
                fe_name='person_id',    # Fixed-effects using variable 'person_id'
                cluster='state'         # Cluster by state
)

# Results
print(result.summary)                                # Print regression results
beta_x1 = result.beta['x1']                          # Get coefficient by variable name
r_squared = result.r2a                               # Get adjusted R-squared
joint_F = result.Ftest(['x1', 'x2'])                 # Test for joint significance
equality_F = result.Ftest(['x1', 'x2'], equal=True)  # Test for coeff. equality

Regression and Summary Stat Tables

outreg takes regression results and creates a LaTeX-formatted tabular fragment.
table_statrow can be used to add arbitrary statistics, notes, etc. to a table. Can also be used to create a table of summary statistics.
write_notes makes it easy to save table notes that depend on your data.

Misc. Data Manipulation Tools

stata_merge wraps pandas.merge and adds a lot of Stata's merge niceties like a '_m' flag for successfully merge observations.
group_id generates an ID based on the variables past (compare egen group).
Crosswalks of commonly used U.S. state labels.
- State abbreviation to state name (and reverse).
- State fips to state name (and reverse).

Data I/O

read and write: Use the passed file path's extension to determine which pandas I/O method to use. Useful for writing functions that programmatically read DataFrames from disk which are saved in different formats. See examples above and below.
load_or_build: A function decorator that caches datasets to disk. This function builds the requested dataset and saves it to disk if it doesn't already exist on disk. If the dataset is already saved, it simply loads it, saving computational time and allowing the use of a single function to both load and build data.
```
from econtools import load_or_build, read

@load_or_build('my_data_file.dta')
def build_my_data_file():
  """
  Cleans raw data from CSV format and saves as Stata DTA.
  """
  df = read('raw_data.csv')
  # Clean the DataFrame
  return df
```
File type is automatically detected from the passed filename. In this case, Stata DTA from my_data_file.dta.

save_cli: Simple wrapper for argparse that let's you use a --save flag on the command line. This lets you run a regression without over-writing the previous results and without modifying the code in any way (i.e., commenting out the "save" lines).

In your regression script:

from econtools import save_cli

def regression_table(save=False):
  """ Run a regression and save output if `save == True`.  """ 
  # Regression guts


if __name__ == '__main__':
    save = save_cli()
    regression_table(save=save)

In the command line/bash script:

python run_regression.py          # Runs regression without saving output
python run_regression.py --save   # Runs regression and saves output

Requirements

Python 3.6+
Pandas and its dependencies (Numpy, etc.)
Scipy and its dependencies
Pytables (optional, if you use HDF5 files)
PyTest (optional, if you want to run the tests)

econtools's People

Contributors

Stargazers

Watchers

econtools's Issues

get NaN value for standard error during calculating fixed effects

Hello,

Thanks for your econtools, it help me a lot in my work. Today I'm calculating the fixed effects by using OLS regression with robust errors. Yet I confront with warning

c:\program files\python37\lib\site-packages\econtools\metrics\core.py:203: RuntimeWarning: invalid value encountered in sqrt
  se = pd.Series(np.sqrt(np.diagonal(vce)), index=vce.columns)

In addition to the warning, all the se estimators of these fixed effects (dummy) variables become NaN.

                   coeff  se   t  p>t  CI_low  CI_high
dummy_0    96.0 NaN NaN  NaN     NaN      NaN
dummy_1   128.0 NaN NaN  NaN     NaN      NaN
dummy_2   144.0 NaN NaN  NaN     NaN      NaN
dummy_3   128.0 NaN NaN  NaN     NaN      NaN

I wonder whether this is due to the problem of colinearity. I appreciate it if you can help diagnose my problem. Thanks a lot.

Feature Request: Automatic Co-linear Variable Dropping?

Hi Dan,

Wanted to see if you have any feedback on this as a potential feature model functions. When including discrete variables as factors, both R and Stata have a nice way to handle it within the formula. For some discrete variable X, in R its factor(X) and in Stata its i.X. The nice thing about this is 1) it saves time, as when I use the current "reg" function in econtools I need to consciously leave one out if the model has a constant term and 2) for certain subsets of the data different variables will be collinear and R and Stata will automatically drop one; this has caused me some grief when trying to use the current function, as its then hard to identify which of the levels is causing the problem.

What do you think, is this worth the time to try and implement it? If so, what's in your opinion a good way of going about programming this safeguard in? I will start futzing around with my clone, but am interested where in the architecture you think it goes best.

Thanks for all you do!

Differences in F-scores with Stata when using clustering

It seems that the largest discrepancies between the Stata outputs and econtools are when the clustering option is used. On my dataset, I get perfect replicability of Stata for the command:

areg y X, absorb(alpha)

However, differences emerge in t and F values for the line

areg y X, absorb(alpha) cluster(alpha)

on the same dataset.

KeyError while using mt.reg and setting fe_name

Hi, I am facing an issue and cannot find a solution online.

result = mt.reg(mydata, # DataFrame to use 'car_-1_1', # Outcome fin_var + lang_cols, # Indep. Variables fe_name= 'permno', # Fixed-effects cluster= 'fyearqrt' # Cluster )

I confirm that 'permno' is in the columns, and I also try to set the value type to string.
But I still get the error:

`KeyError Traceback (most recent call last)
Cell In[17], line 1
----> 1 result = mt.reg(mydata, # DataFrame to use
2 'car_-1_1', # Outcome
3 fin_var + lang_cols, # Indep. Variables
4 fe_name= 'permno', # Fixed-effects
5 #cluster= 'fyearqrt' # Cluster
6 )

File c:\Users\KK\Anaconda3\envs\py38\lib\site-packages\econtools\metrics\api.py:82, in reg(df, y_name, x_name, fe_name, a_name, nosingles, vce_type, cluster, shac, addcons, nocons, awt_name, save_mem, check_colinear)
71 fe_name = _a_name_deprecation_handling(a_name, fe_name)
73 RegWorker = Regression(
74 df, y_name, x_name,
75 fe_name=fe_name, nosingles=nosingles, addcons=addcons, nocons=nocons,
(...)
79 check_colinear=check_colinear,
80 )
---> 82 results = RegWorker.main()
83 return results

File c:\Users\KK\Anaconda3\envs\py38\lib\site-packages\econtools\metrics\core.py:49, in RegBase.main(self)
48 def main(self):
---> 49 self.set_sample()
50 self.estimate()
51 self.get_vce()

File c:\Users\KK\Anaconda3\envs\py38\lib\site-packages\econtools\metrics\core.py:77, in RegBase.set_sample(self)
75 # Demean or add constant
76 if self.fe_name is not None:
---> 77 self._demean_sample()
78 elif self.addcons:
79 _cons = np.ones(self.y.shape[0])

File c:\Users\KK\Anaconda3\envs\py38\lib\site-packages\econtools\metrics\core.py:94, in RegBase._demean_sample(self)
92 self.y_raw = self.y.copy()
93 for var in self.vars_in_reg:
---> 94 self.dict[var] = _demean(self.A, self.dict[var])

File c:\Users\KK\Anaconda3\envs\py38\lib\site-packages\econtools\metrics\core.py:252, in _demean(A, df)
250 group_name = A.name
251 mean = df.groupby(A).mean()
--> 252 large_mean = force_df(A).join(mean, on=group_name).drop(group_name,
253 axis=1)
254 if df.ndim == 1:
255 large_mean = large_mean.squeeze()

File c:\Users\KK\Anaconda3\envs\py38\lib\site-packages\pandas\core\frame.py:9729, in DataFrame.join(self, other, on, how, lsuffix, rsuffix, sort, validate)
9566 def join(
9567 self,
9568 other: DataFrame | Series | Iterable[DataFrame | Series],
(...)
9574 validate: str | None = None,
9575 ) -> DataFrame:
9576 """
9577 Join columns of another DataFrame.
9578
(...)
9727 5 K1 A5 B1
9728 """
-> 9729 return self._join_compat(
9730 other,
9731 on=on,
9732 how=how,
9733 lsuffix=lsuffix,
9734 rsuffix=rsuffix,
9735 sort=sort,
9736 validate=validate,
9737 )

File c:\Users\KK\Anaconda3\envs\py38\lib\site-packages\pandas\core\frame.py:9768, in DataFrame._join_compat(self, other, on, how, lsuffix, rsuffix, sort, validate)
9758 if how == "cross":
9759 return merge(
9760 self,
9761 other,
(...)
9766 validate=validate,
9767 )
-> 9768 return merge(
9769 self,
9770 other,
9771 left_on=on,
9772 how=how,
9773 left_index=on is None,
9774 right_index=True,
9775 suffixes=(lsuffix, rsuffix),
9776 sort=sort,
9777 validate=validate,
9778 )
9779 else:
9780 if on is not None:

File c:\Users\KK\Anaconda3\envs\py38\lib\site-packages\pandas\core\reshape\merge.py:148, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
131 @substitution("\nleft : DataFrame or named Series")
132 @appender(_merge_doc, indents=0)
...
-> 1778 raise KeyError(key)
1780 # Check for duplicates
1781 if values.ndim > 1:

KeyError: 'permno'`

I also find another person having the same issue: https://stackoverflow.com/questions/76452835/keyerror-while-using-mt-reg-econtools

Suggestion: use patsy

It makes model specification nicer.

https://patsy.readthedocs.io/en/latest/quickstart.html