tlamadon / bipartitepandas Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 11.14 MB

Python tools for bipartite labor data

License: MIT License

Makefile 0.07% Python 99.93%

bipartitepandas's Introduction

Thibaut Lamadon Github profile

Please see my personal page at lamadon.com.

I will update this page soon!

bipartitepandas's People

Contributors

Stargazers

Watchers

Forkers

santiagohermo

bipartitepandas's Issues

Don't alter categorical variables

Remove code that replaces categorical variables with contiguous integers from 0 to n.

Rename firm class column and add default worker type column

Rename firm class column from g to k, and add default column l for worker types.

Stack overflow when trying to identify leave-one-out-connected set

Dear developers,

First of all, thank you for supplying the great BipartitePandas and PyTwoWay packages.

I have an issue identifying the leave-one-out-connected set using BipartitePandas.

Everything works well on small simulated and test data. However, if I apply the data cleaning to real admin data (>21 Mio. worker-year obs.), Python crashes. The error I get is ‘Windows fatal exception: stack overflow’. A window pops up saying “Python has stopped working” and the Python kernel in Spyder restarts.

The server I use runs on a Windows 64-bit OS. It has got a 2.1 GHz processor and 256GB RAM. Python uses only about 10GB RAM at max while my program is running. The Python version installed on this server is 3.8.8 and BipartitePandas is version 1.0.35.

I tried increasing both the stack size and recursion depth (see code in txt file). This did not resolve the problem though.

I would be grateful if you have any idea what may cause the stack overflow and how to fix it. Please find my code and associated output from the Spyder console below. (I redacted some paths from these files for security reasons.)

Best
Martin

22_8_22_bp_tw_code.txt
output_console.txt

AttributeError when cleaning simulated data

Hi,

I am just starting to use the package with simulated data using the example code, however I cannot get beyond cleaning - when I simulate data using the example and cleaning parameters, I get the following error:

>>>bdf=bdf.clean(clean_params)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how='returners')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/ssb/bruker/sml/.local/lib/python3.6/site-packages/bipartitepandas/bipartitelong.py", line 100, in clean
    frame = super().clean(params)
  File "/ssb/bruker/sml/.local/lib/python3.6/site-packages/bipartitepandas/bipartitelongbase.py", line 154, in clean
    frame = frame._drop_returns(how=drop_returns, is_sorted=True, reset_index=True, copy=False)
  File "/usr/local/lib64/python3.6/site-packages/pandas/core/generic.py", line 5179, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_drop_returns'
(4 lines skipped)

When I instead load fake dataset from leed2way, the error occurs one step earlier with the AttributeError: 'DataFrame' object has no attribute 'gen_m'.

Specify which columns drop for NaN

Add option during cleaning to specify which columns are checked when .dropna() is called.

Better error messages when constructing data sets

If one of the 4 required column is missing or has the wrong type, we should provide a message that says exactly what the issue is. Right now it just throws an error about wrong type if you define the wage as w instead of y.

I think we should also add a construction method that simply takes the 4 required columns as input argument, ie

Bipartite(i= I, j= J, t= T , y=Y )

where I, J, T and Y are simply numpy vectors.

`diagnostic()` should update class attributes

Add option to .diagnostic() so that it updates class attributes to the correct values while it runs.

4-period event study

Add class for 4-period event study.

Cluster on different columns

Add option to cluster on arbitrary columns (not just j).

Group observations based on windows of time

Add method to group observations based on windows of time.

This should work for long format, and maybe collapsed long format.

This could look like .windows(length, start=None, end=None).

This will group observations together over windows of period length.

If start and end are None, it will specify these windows based on when the worker worked.

If start and/or end are specified, drop observations that are from before start or after end. Then, share the windows between all workers. It will need to handle cases where workers don't have observations for some windows - either drop those workers, fill in their missing data, or leave their data as-is.

Clean data and prepare data should tell us what they are doing by default

These function take a long time, I think the default should provide feedback to the user.

Clean up quantiles code

I think this code can be vectorized (and it should just be cleaned up anyway).

Leave-one-out with categorical controls

Add method to compute the largest leave-one-out set, but accounting for categorical control variables (take a look at this paper).

Partialling out controls

Add option to partial out controls.

This is probably only feasible for long format data.

General idea: add method .partial(outcome, continuous_controls, categorical_controls, controls_to_partial). Then it runs the regression outcome = controls_not_to_partial @ beta_1 + controls_to_partial @ beta_2. Then update outcome to become outcome - controls_to_partial @ beta_2.

If the columns i and j are included, I think the easiest way to run the regression would be to use the class FEControlEstimator - but to make the method more flexible will require specifying the regression inside the method.

Specify periods to keep

Add an option to force all observations to be in a particular set of periods.

If a worker doesn't have observations for those periods, either fill them in or drop the worker.

tlamadon / bipartitepandas Goto Github PK

bipartitepandas's Introduction

Thibaut Lamadon Github profile

bipartitepandas's People

Contributors

Stargazers

Watchers

Forkers

bipartitepandas's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs