sergiocorreia / reghdfe Goto Github PK

View Code? Open in Web Editor NEW

209.0 24.0 56.0 13.38 MB

Linear, IV and GMM Regressions With Any Number of Fixed Effects

Home Page: http://scorreia.com/software/reghdfe/

License: MIT License

Stata 99.60% TeX 0.04% R 0.06% Python 0.30%

stata fixed-effects ols regression linear-models

reghdfe's Introduction

REGHDFE: Linear Regressions With Multiple Fixed Effects

Current SSC version: 6.12.3 20aug2023
Jump to: news citation install manual install description

Recent Updates

version 6.12.4 12sep2023:
- Fix ivreghdfe bug when clustering with string variables (#276)
version 6.12.3 20aug2023:
- Bugfix for parallel option (macOS)
- Fix typos in help file
version 6.12.0 26June2021:
- Add support for individual fixed effects, through new options: indiv() group() aggregation(). See Constantine and Correia (2021) as well as the help file.
- Add experimental support for parallelization via the parallel package
- Misc. code refactoring
- To use older versions of reghdfe, you can use version(3) and version(5). Those two are the latest versions before a major rewrite. This supercedes the old option.
version 5.7.3 13nov2019:
- Fix rare error with compact option (#194). Version also submitted to SSC.
version 5.7.0 20mar2019:
- Users no longer have to run reghdfe, compile after installing. If you are getting the error "class FixedEffects undefined", either upgrade to this version, or run reghdfe, compile
version 5.6.8 03mar2019:
- ppmlhdfe package released, for Poisson models with fixed effects. Use this if you are running regressions with log(y) on the left-hand-side.
- Stable version of reghdfe, also on SSC.
version 5.6.2 10feb2019:
- Minimum required version required to run Poisson pseudo-maximum likelihood estimation
version 5.6 26jan2019:
- Improved numerical accuracy. Previously, reghdfe standardized the data, partialled it out, unstandardized it, and solved the least squares problem. It now runs the solver on the standardized data, which preserves numerical accuracy on datasets with extreme combinations of values. Thanks to Zhaojun Huang for the bug report.
- Speed up calls to reghdfe. The first call to reghdfe after "clear all" should be around 2s faster, and each subsequent call around 0.1s faster.
- Running reghdfe with noabsorb option should now be considerably faster.
version 5.3 30nov2018:
- Fixed silent error with Stata 15 and version 5.2.x of reghdfe. Data was loading into Mata in the incorrect order if running regressions with many factor interactions. This resulted in a scrambling of the coefficients. Stata 15 users are strongly encouraged to upgrade. For more information see #150 by @simonheb
version 5.2 17jul2018:
- Added partial workaround for bug/quick when loading factor variables through st_data(). This does not affect Stata 15 users (see help fvtrack). (Note: this speed-up has been completely disabled as of 5.3.2)
- Misc. optimizations and refactoring.
- Improved support for ppmlhdfe package (which adds fixed effects to Poisson and other GLM models).
version 5.1 08jul2018:
- Added the compact and poolsize(#) options, to reduce memory usage. This can reduce reghdfe's memory usage by up to 5x-10x, at a slight speed cost.
- Automatically check that the installed version of ftools is not too old.
version 5.0 29jun2018:
- Added support for basevar. This is not very useful by itself but makes some postestimation packages (coefplot) easier to use
- Added support for margins postestimation command.
- Added _cons row to output table, so the intercept is reported (as in regress/xtreg/areg). The noconstant option disables this, but doing so might make the output of margin incorrect.
- predict, xb now includes the value of _cons, which before was included in predict, d.

version 4.4 11sep2017:
- Performance: speedup when using weights, reduced memory usage, improve convergence detection
- Bugfixes: summarize option was using full sample instead of regression sample, fixed a recent bug that failed to detect when FEs were nested within clusters
- Mata: refactor Mata internals and add their description to help reghdfe_mata; clean up warning messages
- Poisson/PPML HDFE: extend Mata internals so we can e.g. change weights without creating an entirely new object. This is mostly to speed up the ppmlhdfe package.
version 4.3 07jun2017: speed up fixed slopes (precompute inv(xx))
version 4.2 06apr2017: fix numerical accuracy issues (bugfixes)

version 4.1 28feb2017: entirely rewriten in Mata
- 3-10x faster thanks to ftools package (use it if you have large datasets!)
- Several minor bugs have been fixed, in particular some that did not allow complex factor variable expressions.
- reghdfe is now written entirely as a Mata object. For an example of how to use it to write other programs, see here
- Additional estimation options are now supported, including LSMR and pruning of degree-1 vertices.

Things to be aware of:

reghdfe depends on the ftools package (and boottest for Stata 12 and older)
IV/GMM is not done directly with reghdfe but through ivreg2. See this port, which adds an absorb() option to ivreg2. This is also useful for using more advanced standard error estimates, which ivreg2 supports.
If you use commands that depend on reghdfe (regife, poi2hdfe, ppml_panel_sg, etc.), check that they have been updated before using the new version of reghdfe.
Some options are not yet fully supported. They include cache and groupvar.
The previous stable release (3.2.9 21feb2016) can be accessed with the old option

Future/possible updates

Add back group3hdfe option

Citation

reghdfe implements the estimator described in Correia (2017). If you use it, please cite either the paper and/or the command's RePEc citation:

@TechReport {Correia2017:HDFE,
  Author = {Correia, Sergio},
  Title = {Linear Models with High-Dimensional Fixed Effects: An Efficient and Feasible Estimator},
  Note = {Working Paper},
  Year = {2017},
}

Correia, Sergio. 2017. "Linear Models with High-Dimensional Fixed Effects: An Efficient and Feasible Estimator" Working Paper. http://scorreia.com/research/hdfe.pdf

Noah Constantine, Sergio Correia, 2021. reghdfe: Stata module for linear and instrumental-variable/GMM regression absorbing multiple levels of fixed effects. https://ideas.repec.org/c/boc/bocode/s457874.html

Install:

To find out which version you have installed, type reghdfe, version.

reghdfe 6.x is not yet in SSC. To quickly install it and all its dependencies, copy/paste these lines and run them:

* Install ftools (remove program if it existed previously)
cap ado uninstall ftools
net install ftools, from("https://raw.githubusercontent.com/sergiocorreia/ftools/master/src/")

* Install reghdfe 6.x
cap ado uninstall reghdfe
net install reghdfe, from("https://raw.githubusercontent.com/sergiocorreia/reghdfe/master/src/")

To run IV/GMM regressions with ivreghdfe, also run these lines:

cap ado uninstall ivreg2hdfe
cap ado uninstall ivreghdfe
cap ssc install ivreg2 // Install ivreg2, the core package
net install ivreghdfe, from(https://raw.githubusercontent.com/sergiocorreia/ivreghdfe/master/src/)

Alternatively, you can install the stable/older version from SSC (5.x):

cap ado uninstall reghdfe
ssc install reghdfe

Manual Install:

To install reghdfe to a firewalled server, you need to download these zip files by hand and extract them:

ftools(https://codeload.github.com/sergiocorreia/ftools/zip/master)
reghdfe(https://codeload.github.com/sergiocorreia/reghdfe/zip/master)
ivreghdfe(https://codeload.github.com/sergiocorreia/ivreghdfe/zip/master)

Then, run the following, adjusting the folder names:

cap ado uninstall ftools
cap ado uninstall reghdfe
cap ado uninstall ivreghdfe
net install ftools, from(c:\git\ftools)
net install reghdfe, from(c:\git\reghdfe)
net install ivreghdfe, from(c:\git\ivreghdfe)

Note that you can now also use Github releases in order to install specific versions.

Description

reghdfe is a Stata package that estimates linear regressions with multiple levels of fixed effects. It works as a generalization of the built-in areg, xtreg,fe and xtivreg,fe regression commands. It's objectives are similar to the R package lfe by Simen Gaure and to the Julia package FixedEffectModels by Matthieu Gomez (beta). It's features include:

A novel and robust algorithm that efficiently absorbs multiple fixed effects. It improves on the work by Abowd et al, 2002, Guimaraes and Portugal, 2010 and Simen Gaure, 2013. This algorithm works particularly well on "hard cases" that converge very slowly (or fail to converge) with the existing algorithms.
Extremely fast compared to similar Stata programs.
- With one fixed effect and clustered-standard errors, it is 3-4 times faster than areg and xtreg,fe (see benchmarks). Note: speed improvements in Stata 14 have reduced this gap.
- With multiple fixed effects, it is at least an order of magnitude faster that the alternatives (reg2hdfe, a2reg, felsdvreg, res2fe, etc.). Note: a recent paper by Somaini and Wolak, 2015 reported that res2fe was faster than reghdfe on some scenarios (namely, with only two fixed effects, where the second fixed effect was low-dimensional). This is no longer correct for the current version of reghdfe, which outperforms res2fe even on the authors' benchmark (with a low-dimensional second fixed effect; see the benchmark results and the Stata code).
Allows two- and multi-way clustering of standard errors, as described in Cameron et al (2011)
Allows an extensive list of robust variance estimators (thanks to the avar package by Kit Baum and Mark Schaffer).
Works with instrumental-variable and GMM estimators (such as two-step-GMM, LIML, etc.) thanks to the ivreg2 routine by Baum, Schaffer and Stillman.
Allows multiple heterogeneous slopes (e.g. a separate slope coefficients for each individual).
Supports all standard Stata features:
- Frequency, probability, and analytic weights.
- Time-series and factor variables.
- Fixed effects and cluster variables can be expressed as factor interactions, for both convenience and speed (e.g. directly using state#year instead of previously using egen group to generate the state-year combination).
- Postestimation commands such as predict and test.
Allows precomputing results with the cache() option, so subsequent regressions are faster.
If requested, saves the point estimates of the fixed effects (caveat emptor: these fixed effects may not be consistent nor identifiable; see the Abowd paper for an introduction to the topic).
Calculates the degrees-of-freedom lost due to the fixed effects (beyond two levels of fixed effects this is still an open problem, but we provide a conservative upper bound).
Avoids common pitfalls, by excluding singleton groups (see notes), computing correct within- adjusted-R-squares (see initial discussion), etc.

Authors

Sergio Correia
Board of Governors of the Federal Reserve
Email: [email protected]

Noah Constantine
Board of Governors of the Federal Reserve
Email: [email protected]

Acknowledgments

This package wouldn't have existed without the invaluable feedback and contributions of Paulo Guimaraes, Amine Ouazad, Mark E. Schaffer, Kit Baum and Matthieu Gomez. Also invaluable are the great bug-spotting abilities of many users.

Contributing

Contributors and pull requests are more than welcome. There are a number of extension possibilities, such as estimating standard errors for the fixed effects using bootstrapping, exact computation of degrees-of-freedom for more than two HDFEs, and further improvements in the underlying algorithm.

Note that all the code is written in the current-code folder, which then gets compiled by build.py into the src folder (which combines multiple files in single .ado and .mata files, so they can be installed and copied faster.

reghdfe's People

Contributors

Stargazers

Watchers

reghdfe's Issues

Variable and factor colums not always align

est clear
sysuse bplong, clear

reghdfe bp sex##agegrp##when, a(patient)

reghdfe is slow with c. interactions

The demean_and_compute_invxx function, as its name states, inverts submatrices. We don't want that as rounding errors mean that convergence is slow. A better idea would be to still precompute the solver without actually inverting: www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/

Add wild clustered bootstrap

Tentative steps:

Estimate the full model, including the parameter of interest. Keep the t-statistic, using analytically clustered standard errors.
Re-estimate the model, imposing the null hypothesis of no effect. The simplest way to do this is to just re-estimate the model, but omit the parameter of interest. Collect the fitted values and residuals for each observation.
Construct a bootstrap replicate for each cluster. That is, with equal probability, for all observations within a cluster, multiply the residual by +1 or -1. Given the new bootstrap residual, add this to the fitted values to construct a bootstrap replicate value for the dependent variable.
Re-compute the full model, again using analytically clustered standard errors, and keep the resulting t-statistic.
Repeat steps 3 and 4 at least 500 times.
Use the bootstrap distribution of the t-statistic to test whether the original t-statistic is significant.

Anyway, it seemed to me that it would be much faster to re-compute the full model in step 4 if I could tell reghdfe to use the estimated parameters, and values of the fixed effects, from step 1 as seed values. Otherwise it starts from scratch, taking 11 iterations or whatever. I guess the fixed effects are not actually estimated, so I'm not sure how one would do this, but this would be the general idea.

Case where constant is not absorbed by the fixed effects.

When absorb only includes interactions (without fixed effect), reghdfe runs without constant.

reghdfe y x, a(i.id#c.year)

returns the same result than

reg y x i.id#c.year , nocons

while I was expecting

reg y x i.id#c.year

Add suboption so reghdfe does not fail when #Clusters < #Vars

(On behalf of Eduardo Montoya)

...the program should [not] break if a user estimates more parameters than the number of clusters. Of course it will not be possible to test for joint significance of all parameters, but it should be sufficient to simply omit the F test. areg, for example, does not break if this is done.

Example:

sysuse auto, clear
gen grp = mod(_n,4)
areg price mpg rep78 headroom trunk weight length turn, absorb(foreign) cluster(grp)
reghdfe price mpg rep78 headroom trunk weight length turn, absorb(foreign) cluster(grp)

-estfe- doesn't detect fixed effects of -xtreg, fe- and other xt-class commands

estfe doesn't detect fixed effects of xtreg, fe and other xt-class commands.

Might be outside its scope. Your call. 😃

add option to cache(save) so results can be used after restarts

Currently, due to a bug (?) in Stata, we can't save mata asarray, so we can't use the results of cache(save) across different stata sessions (i.e. saving and restarting) or across different computers.

Not sure about the best implementation though (maybe save the asarrays as dta chars)

Automatically run -xtset- with cache(use) if needed

Option cache(save) will clear xtset (and tsset).
Later calls to cache(use) will fail if they have tsvars
Solution: if I detect a tsvar and I'm in cache(use) (and only then), then run xtset

sysuse auto, clear
bys turn: gen t = _n
xtset turn t
reghdfe price L(0 3).(weight turn) , a(foreign) cache(save)

reghdfe price L(0 3).(weight turn) , a(foreign) cache(use) // this fails
xtset // solution
reghdfe price L(0 3).(weight turn) , a(foreign) cache(use) // this works

hdfe with weight returns an error

hdfe with weight returns an error, due to these lines

    if ("`weight'"!="") {
        local weightvar `exp'
        conf var `weightvar' // just allow simple weights
        local weighttype `weight'
        local weightequal =
    }

Unexpected wildcard behavior with cache(use)

After cache(save), wildcards used in regressions will also match the temporary variables, and then will fail because of that.

sysuse auto, clear
bys turn: gen t = _n
xtset turn t

gen sqrt_weight = weight^2
reghdfe price L.(*weight) , absorb(turn) cache(save)
reghdfe price L.(*weight) , absorb(turn) cache(use) // Will fail

Possible solution (but low priority): remove temporary variables from those matched in the syntax blocks of ParseIV.ado. Maybe by creating a char (var[temporary]==1) or by using the existing "name" char and asserting that it's equal to the varname.

allow pweight with predict

Allow strings in clusters and categorical absorb()

Assert_msg Error

reghdfe/package/reghdfe.ado

Line 2544 assert_msg ...

has the incorrect syntax and thus throws an error when using usecache. I would have tried to fix it, but am unsure which assert you want to use. For future reference, how are you choosing Assert or assert_msg?

Allow some factor variables as depvar

Factors in the LHS of a regression (reg i.turn price) are usually forbidden.

In line with possible changes in ivreg2, it may be worth it to allow some factors. EG:

reghdfe 43.turn price weight, a (trunk)

Instead of the equivalent

gen byte turn43 = (turn==43)
reghdfe turn43 price weight, a (trunk)

Add optional exact calculation for 3 HDFEs

Using group3hdfe.ado

Change DoF of model F-test?

Setting vce(robust) should be equivalent to vce(cluster _n) because the later means we are clustering for each observation. This is true for the betas, but the P-value of the model F-test changes due to the degrees of freedom.

Example with areg, to illustrate that it's more general than just reghdfe

cap cls
sysuse auto, clear
keep in 1/25
areg price weight, absorb(turn) vce(robust)
gen cl = _n
areg price weight, absorb(turn) vce(cluster cl)

Inferior parameterization of i.x##c.z

Let covariate x be binary.

Compare:

areg y i.x##c.z, a(fe)
reghdfe y i.x##c.z, a(fe)

areg parameterizes the covariates as (1) 1.x, (2) c.z, and (3) 1.x#c.z.

reghdfe parameterizes the covariates as (1) 1.x, (2) c.z, (3) 0.x#c.z, and (4) 1.x#c.z. Then (4) is omitted for collinearity.

I prefer areg here.

Might be related to bda7e03.

I'm using reghdfe 3.0.10 13may2015.

F statistics vary from run to run, whereas areg reports them as missing

clear
input x1    x2  x3  x4  x5  y   fe  clustervar
0   1   0   1   0   0   1   18
0   1   0   1   1   0   1   18
0   1   1   0   0   0   2   18
0   1   1   0   1   0   2   18
1   1   0   1   0   0   3   18
1   1   0   1   1   0   3   18
0   0   0   1   0   53.08017    4   18
0   0   0   1   1   18.87388    4   18
0   1   0   1   0   0   5   19
0   1   0   1   1   0   5   19
1   1   1   0   0   0   6   18
1   1   1   0   1   0   6   18
0   1   0   0   0   0   7   18
0   1   0   0   1   0   7   18
0   0   0   0   0   30.93297    8   18
0   0   0   0   1   .8479494    8   18
1   1   1   0   0   116.4302    9   15
1   1   1   0   1   0   9   15
0   0   0   1   0   0   10  15
0   0   0   1   1   0   10  15
0   1   0   1   0   0   11  15
0   1   0   1   1   0   11  15
0   0   1   0   0   140.2202    12  14
0   0   1   0   1   49.51491    12  14
1   1   1   0   0   0   13  15
1   1   1   0   1   0   13  15
0   1   0   0   0   0   14  10
0   1   0   0   1   0   14  10
0   1   0   1   0   0   15  11
0   1   0   1   1   0   15  11
0   0   0   0   0   0   16  15
0   0   0   0   1   0   16  15
1   1   0   1   0   33.94632    17  16
1   1   0   1   1   14.17506    17  16
0   1   0   0   0   0   18  15
0   1   0   0   1   0   18  15
1   1   0   0   0   144.8989    19  15
1   1   0   0   1   7.319   19  15
0   1   0   1   0   49.17793    20  15
0   1   0   1   1   2.579268    20  15
0   0   0   1   0   31.01913    21  15
0   0   0   1   1   8.316889    21  15
1   1   0   0   0   0   22  10
1   1   0   0   1   0   22  10
0   0   0   0   0   26.65633    23  16
0   0   0   0   1   0   23  16
0   0   0   0   0   70.59344    24  15
0   0   0   0   1   23.96148    24  15
0   1   0   1   0   0   25  18
0   1   0   1   1   0   25  18
1   1   0   1   0   0   26  20
1   1   0   1   1   0   26  20
1   1   0   0   0   0   27  17
1   1   0   0   1   0   27  17
0   1   1   0   0   25.07012    28  17
0   1   1   0   1   28.73061    28  17
1   1   0   0   0   0   29  11
1   1   0   0   1   0   29  11
0   0   0   0   0   0   30  10
0   0   0   0   1   0   30  10
1   1   0   0   0   0   31  13
1   1   0   0   1   0   31  13
0   1   0   0   0   0   32  14
0   1   0   0   1   0   32  14
1   1   0   0   0   0   33  12
1   1   0   0   1   0   33  12
0   1   0   0   0   0   34  17
0   1   0   0   1   0   34  17
0   0   0   1   0   23.95128    35  20
0   0   0   1   1   5.731863    35  20
0   1   0   1   0   0   36  20
0   1   0   1   1   0   36  20
1   1   0   1   0   32.5347 37  8
1   1   0   1   1   13.97691    37  8
1   1   0   0   0   45.62077    38  8
1   1   0   0   1   13.64072    38  8
0   1   0   0   0   0   39  7
0   1   0   0   1   0   39  7
0   1   1   0   0   0   40  7
0   1   1   0   1   0   40  7
0   0   1   0   0   0   41  8
0   0   1   0   1   0   41  8
0   0   0   0   0   0   42  8
0   0   0   0   1   0   42  8
0   1   0   0   0   0   43  9
0   1   0   0   1   0   43  9
0   0   0   1   0   0   44  9
0   0   0   1   1   0   44  9
1   1   0   0   0   57.53611    45  4
1   1   0   0   1   29.1646 45  4
0   1   0   1   0   0   46  6
0   1   0   1   1   0   46  6
0   1   0   1   0   0   47  4
0   1   0   1   1   0   47  4
1   1   0   1   0   0   48  4
1   1   0   1   1   0   48  4
1   1   0   1   0   23.30293    49  4
1   1   0   1   1   0   49  4
1   0   0   1   0   20.78044    50  5
1   0   0   1   1   0   50  5
1   1   0   0   0   .8128695    51  4
1   1   0   0   1   0   51  4
0   0   0   1   0   0   52  4
0   0   0   1   1   0   52  4
1   1   0   0   0   0   53  2
1   1   0   0   1   0   53  2
0   0   0   1   0   12.78493    54  3
0   0   0   1   1   16.55898    54  3
0   1   0   0   0   .5542843    55  1
0   1   0   0   1   .0111068    55  1
0   0   0   1   0   9.894016    56  1
0   0   0   1   1   10.96016    56  1
0   1   1   0   0   2.538583    57  1
0   1   1   0   1   6.827503    57  1
1   1   0   1   0   0   58  1
1   1   0   1   1   0   58  1
0   0   1   0   0   8.549977    59  1
0   0   1   0   1   2.152292    59  1
1   1   0   1   0   52.01904    60  1
1   1   0   1   1   8.515899    60  1
1   1   0   1   0   0   61  2
1   1   0   1   1   0   61  2
0   1   0   0   0   28.41826    62  1
0   1   0   0   1   5.491279    62  1
1   1   1   0   0   0   63  2
1   1   1   0   1   0   63  2
0   1   0   0   0   3.104406    64  1
0   1   0   0   1   0   64  1
1   1   0   0   0   0   65  1
1   1   0   0   1   0   65  1
end

forvalues i = 1/20 {
    qui reghdfe y x1##x2##(x3 x4)##x5, absorb(fe) vce(cl clustervar)
    di `e(F)'
}

forvalues i = 1/20 {
    qui areg y x1##x2##(x3 x4)##x5, absorb(fe) vce(cl clustervar)
    di `e(F)'
}

weights = zero

If the weight variable contains zeros:

sysuse nlsw88.dta
reghdfe wage hours [w=south], a(race)
weight south can only contain strictly positive reals, but 1304 zero values were found (w
> ill be dropped)
1304 contradictions in 2242 observations
assertion is false

Factor variables not being properly omitted

reghdfe 3.0.50 05jun2015

Setup:

est clear
sysuse bplong, clear

reghdfe bp sex##agegrp##when, a(patient)
esttab, noomit nobase nocons

areg bp sex##agegrp##when, a(patient)
esttab, noomit nobase nocons

`areg` behavior:

note: 1.sex omitted because of collinearity
note: 2.agegrp omitted because of collinearity
note: 3.agegrp omitted because of collinearity
note: 1.sex#2.agegrp omitted because of collinearity
note: 1.sex#3.agegrp omitted because of collinearity

Displayed table shows all are "(omitted)".

esttab, noomit drops all these variables.

`reghdfe` behavior:

note: __1__sex omitted because of collinearity
note: __2__agegrp omitted because of collinearity
note: __3__agegrp omitted because of collinearity
note: __1__sex_X_2__agegrp omitted because of collinearity
note: __1__sex_X_3__agegrp omitted because of collinearity

Displayed table shows uninteracted variables are "(empty)" and interacted variables are "(omitted)".

esttab, noomit drops only uninteracted variables (i.e. the ones that do not say "(omitted)"). 1.sex#2.agegroup should actually be 1o.sex#2o.agegroup.

dropping singletons with fweights and weightvar>1

I'm not sure if I should drop singletons with [fw=var] and var>1
The pweight and aweight are different, but for fweights I shouldn't consider those as singletons

Add noabsorb option

Would allow reghdfe to replace regress if we just want e.g mwc

See Julian Reif's email

Re-omit base values of factor variables

Commit bda7e03 (“bugfix with i.var#c.var: Now we don't exclude base variables with cont. interactions”) had unintended side effects. AFAICT, it no longer omits any base values.

sysuse nlsw88

Old behavior (bda7e03) (preferable IMHO):

. reghdfe wage i.collgrad, a(industry)

HDFE Linear regression                            Number of obs   =       2232
Absorbing 1 HDFE indicator                        F(   1,   2219) =     183.57
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.1355
                                                  Adj R-squared   =     0.1308
                                                  Within R-sq.    =     0.0764
                                                  Root MSE        =     5.3735

-------------------------------------------------------------------------------
         wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
     collgrad |
college grad  |   3.861116   .2849765    13.55   0.000     3.302268    4.419965
        _cons |   6.868351   .1322824    51.92   0.000     6.608941    7.127761
--------------+----------------------------------------------------------------
    Absorbed |       F(11, 2219) =     14.988   0.000             (Joint test)
------------------------------------------------------------------------------

------------------------------------------------------------------------------
 Absorbed FE |  Num. Coefs.  =   Categories  -   Redundant     |    Corr. w/xb
-------------+-------------------------------------------------+--------------
  i.industry |           11              12              1     |       -0.0706
------------------------------------------------------------------------------

New behavior (11e3ab0):

. reghdfe wage i.collgrad, a(industry)
note: __0b__collgrad1 omitted because of collinearity

HDFE Linear regression                            Number of obs   =       2232
Absorbing 1 HDFE indicator                        F(   1,   2219) =     183.57
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.1355
                                                  Adj R-squared   =     0.1308
                                                  Within R-sq.    =     0.0764
                                                  Root MSE        =     5.3735

-----------------------------------------------------------------------------------
             wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
         collgrad |
not college grad  |          0  (empty)
    college grad  |   3.861116   .2849765    13.55   0.000     3.302268    4.419965
                  |
            _cons |   6.868351   .1322824    51.92   0.000     6.608941    7.127761
------------------+----------------------------------------------------------------
     Absorbed |       F(11, 2219) =     14.988   0.000             (Joint test)
-------------------------------------------------------------------------------

-------------------------------------------------------------------------------
  Absorbed FE |  Num. Coefs.  =   Categories  -   Redundant     |    Corr. w/xb
--------------+-------------------------------------------------+--------------
   i.industry |           11              12              1     |       -0.0706
-------------------------------------------------------------------------------

absorb with equal signs

It would be nice to support spaces between equal signs in absorb (or to change the error message). For now the command

absorb(fe1 = state fe2 = year)

returns "variable fe1 not found"

Default weight option

reghdfe defaults to fweight, while reg/areg defaults to aweight. Unless there's a reason I'm missing, it'd be nice if they had the same defaults,
Out of curiosity, why does fweight tend to return a "numerical overflow" error (in areg or reghdfe)?

bug saving the fixed effects

Dear Sergio,

I have been using the two following syntaxes to estimate and save the FE coefficients

A) reghdfe r, a(i.id i.id#c.rmrf id#c.smb id#c.hml, savefe)
B) reghdfe r, a(i.id##c.(rmrf smb hml),savefe)

Where id is a categorical variable, and all the other are continuous variables.

If my understanding is correct, the two syntaxes should be equivalent. I.e the FE estimates should be the same. But I find different alphas. The slope estimates are the same ( _hdfe2_slop1 using syntax A = _hdfe1_slop1 using syntax B), but not the alpha estimates ( hdfe1 using syntax A <> hdfe1 using syntax B)

Note that the correct alpha seems to be obtained with syntax A only. When I do

reghdfe r if id==1, a(i.id i.id#c.rmrf id#c.smb id#c.hml, savefe)

and

reg r rmrf smb hml if id==1

I find that hdfe1 = _cons, which is what I expected. However, when I do

reghdfe r if id==1, a(i.id##c.(rmrf smb hml),savefe)

I have hdfe1 different than _cons

Is it a bug in the command or I am missing something?

Thank you so much for your help !

Best,

Reghdfe can't save stage(reduced ols first) regressions

Hi,

Thanks for making reghdfe! This command is amazing! I'm having trouble using reghdfe to output multiple forms of the regression. For example, when I run

reghdfe price (mpg = rep78), absorb(foreign) stages(first reduced ols)

I see all four regressions displayed. I would like to save all of these estimates, but I can't.

I am able to use

estimates replay reghdfe_first1

to show the first stage regression, but whenever I try to use eststo or estimates store to save it, it shows me the IV coefficients.

For example,

eststo reghdfe_first1
esttab reghdfe_first1

Shows me the IV output, not the first stage. Do you know how to save the output for all of the four regressions, particularly for use in estout/esttab?

-- predict, stdp --

Hello,

I'm glad that the predict command can be used after reghdfe (if the fixed effects are saved), but I don't think it's possible to get the standard error of the prediction (I get the error: "option stdp not allowed"). It would be nice if -- predict, stdp -- could be used to get the standard error of the X'B prediction (ignoring uncertainty in the estimated fixed effect), similar to how it works after areg (I believe).

Thanks!
-Mitch

(self note) For individual slopes, maybe report the average slope component below _cons

For consistency, as well as a way of testing whether adding individual slopes is significant, it would be interesting to add this.

Not 100% sure of what the test would be, or the best way to do this though..

Development branch throws new error: "e(df_r) doesn't match"

Load:

x	i	j	y
0	1	1	1
1	1	2	1
1	2	2	2
1	3	3	3
1	3	2	3

Run:

reghdfe y x, absorb(i j)

Stable branch result:

WARNING! Missing FStat
Note: equality df_m+1==rank failed (is there a collinear variable in the RHS?), running -test- to get correct values
[Table omitted]

Development branch result:

e(df_r) doesn't match: 3!=4

Add links to algorithms used

Small link to HDFE paper in method description
Link to identification warning pdf if saving the FEs
Add a NOWARNing option to silence all warnings

Parsing syntax for weights in hdfe

hdfe returns an error when using weights (both on SSC and in the master branch)

. sysuse nlsw88.dta, clear
. hdfe wage [w=tenure], a(fe) gen(new)
weights not allowed

The error is returned by the following call to ParseIV

 . ParseIV wage [fweighttenure], estimator() ivsuite()

reghdfe crashes after -fvset-

sysuse auto
fvset base 1 foreign
reghdfe price weight, a(foreign)

Problem lies when parsing the (ParseAbsvars.ado). Syntax doesn't expand to "i.foreign" but to "ib1.foreign".

A solution would be to replace -syntax- completely (including when parsing indep. vars) which would fix several related bugs but would involve a lot of work.

Improve summarize() option

Use fixed varnames (L.weight) instead of temp varnames (__L__weight)
Raise error if ... summarize usecache was not preceeded with a ... summarize savecache
With usecache, only report variables included in the regression

Support for postestimation -margins-

Currently margins "works" but gives wrong results. We can either set e(marginsok) to missing/empty, or add margins support. No clue how though.

clear all
cls
sysuse auto

reghdfe price weight i.foreign, a(turn)
margins foreign

areg price weight i.foreign, a(turn)
margins foreign

hdfe: add -replace- option

It would be equivalent to "hdfe x y, clear" but without dropping other variables or observations

Predict outside e(sample)

Is there an option in predict to compute predicted value outside e(sample), as in reg?

sysuse nlsw88, clear

* reg: predict generates predicted values for all observations
reg wage i.race if married == 0
predict temp1
assert temp1 <.

* reghdfe:  predict generates predicted values only for observations in e(sample)
reghdfe wage if married == 0, a(race, savefe)
predict temp2, xbd
assert temp2 <.

A typical case is to compute fixed effects using only observations with treatment = 0 and compute predicted value for observations with treatment = 1.

Bug?

Dear Sergio,

I have been using the two following syntaxes to estimate and save the FE coefficients

A) reghdfe r, a(i.id i.id#c.rmrf id#c.smb id#c.hml, savefe)
B) reghdfe r, a(i.id##c.(rmrf smb hml),savefe)

Where id is a categorical variable, and all the other are continuous variables.

Note that the correct alpha seems to be obtained with syntax A only. When I do

reghdfe r if id==1, a(i.id i.id#c.rmrf id#c.smb id#c.hml, savefe)

and

reg r rmrf smb hml if id==1

I find that hdfe1 = _cons, which is what I expected. However, when I do

reghdfe r if id==1, a(i.id##c.(rmrf smb hml),savefe)

I have hdfe1 different than _cons

Is it a bug in the command or I am missing something?

Thank you so much for your help !

Best,

The link to download the zip file from the website seems to be broken.

Slope-only models

I don't really get what reghdfe does in the case of only one interaction.

sysuse nlsw88.dta
reghdfe wage, a(fe = c.tenure#i.grade) keepsingletons

The resulting variable fe_Slope1 differs from

reg wage c.tenure#i.grade, nocons

Also, predict y returns a variable equal to zero

acceleration method

Currently, map_solve() is good enough with conjugate gradient and symmetric kaczmarz, so I don't have any dataset that requires more than a few iterations to converge.

Also, most synthetic benchmarks are "easy to solve" in terms of overall connectivity of the underlying graph.

Instead, I need a hard-to-solve dataset (basically, just the FEs) in order to tune-up the acceleration.

There are two potential improvements:

combine kaczmarz with cimmino (because one method seems better earlier on, and the other at the end)
Apply the GT preconditioner

Large Data Set - Issue?

Hi,

I am new to using reghdfe and have encountered what I believe to be an issue. I have a large dataset (40+gb) and am trying to run a regression with 2 fixed effects. The first time I ran the regression i received the following errors:

reghdfe lnkwh treatment_event*, absorb(date_time customer_id) vce(robust)
(dropped 182 singleton observations)

    map_projection():  3900  unable to allocate real <tmp>[195026781,10]

transform_sym_kaczmarz(): - function returned error
accelerate_cg(): - function returned error
map_solve(): - function returned error
: - function returned error
r(3900);

I then changed my code so that I had pool(3) since I had read that this used less memory. I am running Stata 12.0 on a server and have a maximum memory of 256gb available. When I tried re-running it, the regression then got a bit further but still isn't quite working. I'm not getting any output and there are no estimates to store. I know this code works when I run it on a much smaller sample (2gb of the data). For reference date_time will contain about 42,000 fixed effects and customer_id has 7,500 fixed effects.

reghdfe lnkwh treatment_event*, absorb(date_time customer_id) vce(cluster customer_id) cache(save) pool(3)
(dropped 182 singleton observations)
(converged in 7 iterations)

.
. estimates store Load_Shifting
last estimation results not found, nothing to store
r(301);

Any tips/ideas on what I might need to try differently?

Not enough accuracy with c.vars

With a cont. interaction in the absvars, we need more digits of accuracy because of numerical issues
EG: convergence at 1e-6 in FEs is not the same as convergence as 1e-6 when those FEs are interacted with the cont. variables.

Solution: Change the convergence criteria to rely on the multiplication

multiple ##c.var

sysuse nlsw88.dta
reghdfe wage i.industry##c.age i.union##c.age, a(race)
error: there are repeated variables: <age>
r(198);
areg wage i.industry##c.age i.union##c.age, a(race)

var1#var2 in the absorb

Consider following four examples:
sysuse auto,clear

a. reghdfe price turn,absorb(rep78 foreign rep78#foreign)
variable rep78 not found
r(111);

b. reghdfe price turn,absorb(rep78#foreign rep78 foreign)
runs without error

c. qui gen price1=price>6100
reghdfe price turn,absorb(price1 rep78#foreign)
runs without error

d. reghdfe price turn,absorb(price1 rep78 rep78#foreign)
variable rep78 not found
r(111);

I was wondering whether this means we should always include var1#var2 at the beginning when we are also including either var1 or var2 or both ? Is this the default behavior? I am using reghdfe 2.1.10 07apr2015.

`predict` fails after using pweights

MWE:

sysuse auto, clear

reg price mpg i.foreign [pw=turn]
predict double p1, xb

areg price mpg [pw=turn], absorb(foreign)
predict double p2, xbd

reghdfe price mpg [pw=turn], absorb(foreign, save)
predict double p3, xbd

Reason:

Behind the schemes reghdfe calls a summarize, and summarize fails with pweights.

Nested within clusters too aggressive on corner case

sysuse auto
gen id = _n 
reghdfe price weight, a(id) vce(cluster id)

Here it doesn't make sense to treat the FEs as "redundant"

vce(cluster _n) should be THE SAME as vce(robust)

It seems we can only assume as redundant coefs that are nested in clusters that span more than one observation

Subsequent coefs nested within a cluster need more observations

Checking for reduncancy between e.g. FE1 and FE2 is not affected even if FE1 is nested within a cluster

Twice Robust for clustering and wmatrix

I seem to be getting unclear errors when trying to use wmatrix(cluster var) for both gmm and 2sls IV. How does the option twicerobust work when you specify vce(cluster var)?

reghdfe Y X1 (X2 = Z), absorb(T1 T2) est(2sls) vce(cluster C) stages(first reduced) suboptions(wmatrix(cluster C)) ivsuite(ivregress)

>cannot specify wmatrix() with 2SLS estimator

reghdfe Y X1 (X2 = Z), absorb(T1 T2) est(gmm) vce(cluster C) stages(first reduced) suboptions(wmatrix(cluster C)) ivsuite(ivregress)

>option wmatrix() not allowed

I most likely am misunderstanding the documentation. Any help would be appreciated.

Thank you,

Ayal

hdfe with tempvars

I'm encountering issues when using hdfe inside a program

clear
set obs 3
gen a = 1
gen b = 1
gen x = 1
gen __000000 = 1
 hdfe x , a(a b) gen(new)
#__000000 already defined
 hdfe x , a(a b) gen(new)
# reghdfe error: data transformed with cache(save) requires cache(use)

Extend estat summarize

Allow [if] [in] or by .. : in estat summ