Please see my personal page at lamadon.com.
I will update this page soon!
Python tools for bipartite labor data
License: MIT License
Please see my personal page at lamadon.com.
I will update this page soon!
Remove code that replaces categorical variables with contiguous integers from 0 to n.
Rename firm class column from g
to k
, and add default column l
for worker types.
Dear developers,
First of all, thank you for supplying the great BipartitePandas and PyTwoWay packages.
I have an issue identifying the leave-one-out-connected set using BipartitePandas.
Everything works well on small simulated and test data. However, if I apply the data cleaning to real admin data (>21 Mio. worker-year obs.), Python crashes. The error I get is ‘Windows fatal exception: stack overflow’. A window pops up saying “Python has stopped working” and the Python kernel in Spyder restarts.
The server I use runs on a Windows 64-bit OS. It has got a 2.1 GHz processor and 256GB RAM. Python uses only about 10GB RAM at max while my program is running. The Python version installed on this server is 3.8.8 and BipartitePandas is version 1.0.35.
I tried increasing both the stack size and recursion depth (see code in txt file). This did not resolve the problem though.
I would be grateful if you have any idea what may cause the stack overflow and how to fix it. Please find my code and associated output from the Spyder console below. (I redacted some paths from these files for security reasons.)
Best
Martin
Hi,
I am just starting to use the package with simulated data using the example code, however I cannot get beyond cleaning - when I simulate data using the example and cleaning parameters, I get the following error:
>>>bdf=bdf.clean(clean_params)
checking required columns and datatypes
sorting rows
dropping NaN observations
generating 'm' column
keeping highest paying job for i-t (worker-year) duplicates (how='max')
dropping workers who leave a firm then return to it (how='returners')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/ssb/bruker/sml/.local/lib/python3.6/site-packages/bipartitepandas/bipartitelong.py", line 100, in clean
frame = super().clean(params)
File "/ssb/bruker/sml/.local/lib/python3.6/site-packages/bipartitepandas/bipartitelongbase.py", line 154, in clean
frame = frame._drop_returns(how=drop_returns, is_sorted=True, reset_index=True, copy=False)
File "/usr/local/lib64/python3.6/site-packages/pandas/core/generic.py", line 5179, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_drop_returns'
(4 lines skipped)
When I instead load fake dataset from leed2way, the error occurs one step earlier with the AttributeError: 'DataFrame' object has no attribute 'gen_m'.
Add option during cleaning to specify which columns are checked when .dropna()
is called.
If one of the 4 required column is missing or has the wrong type, we should provide a message that says exactly what the issue is. Right now it just throws an error about wrong type if you define the wage as w instead of y.
I think we should also add a construction method that simply takes the 4 required columns as input argument, ie
Bipartite(i= I, j= J, t= T , y=Y )
where I, J, T and Y are simply numpy vectors.
Add option to .diagnostic()
so that it updates class attributes to the correct values while it runs.
Add class for 4-period event study.
Add option to cluster on arbitrary columns (not just j
).
Add method to group observations based on windows of time.
This should work for long format, and maybe collapsed long format.
This could look like .windows(length, start=None, end=None)
.
This will group observations together over windows of period length
.
If start
and end
are None, it will specify these windows based on when the worker worked.
If start
and/or end
are specified, drop observations that are from before start
or after end
. Then, share the windows between all workers. It will need to handle cases where workers don't have observations for some windows - either drop those workers, fill in their missing data, or leave their data as-is.
These function take a long time, I think the default should provide feedback to the user.
I think this code can be vectorized (and it should just be cleaned up anyway).
Add method to compute the largest leave-one-out set, but accounting for categorical control variables (take a look at this paper).
Add option to partial out controls.
This is probably only feasible for long format data.
General idea: add method .partial(outcome, continuous_controls, categorical_controls, controls_to_partial)
. Then it runs the regression outcome = controls_not_to_partial @ beta_1 + controls_to_partial @ beta_2
. Then update outcome
to become outcome - controls_to_partial @ beta_2
.
If the columns i
and j
are included, I think the easiest way to run the regression would be to use the class FEControlEstimator
- but to make the method more flexible will require specifying the regression inside the method.
Add an option to force all observations to be in a particular set of periods.
If a worker doesn't have observations for those periods, either fill them in or drop the worker.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.