GithubHelp home page GithubHelp logo

puf-state-distribution's Introduction

PUF-State-Distribution

The goal of this project is to create a public-use file (PUF) with income-tax microdata that represents not just the U.S. as a whole, but each of the 50 states. The initial goal is to construct a file that is consistent with known or estimated values for each state, for many variables, for a recent historical year.

We anticipate starting with a national PUF for a recent year that has NO state codes on it, and using it as the basis for each state. (My/our understanding of the latest PUFs is that they do not have any state codes on them.)

Thus, if the national file has 150k records, then each state might, initially, have 150k records, but with different, adjusted, weights. Our goal is to develop a good and acceptable method for adjusting weights to hit the targets for a given state. If some of the weights end up to be zero for a given state, the records could be dropped.

See PSLmodels/taxdata#138 for an initial discussion.

We are open to exploring other approaches, too.

puf-state-distribution's People

Contributors

donboyd5 avatar matthjensen avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

fagan2888

puf-state-distribution's Issues

Fleshing out a maximum-entropy constrained NLP approach fully enough to allow implementation

Summary of previous relevant discussion at #2.

In #2 we discussed assigning PUF records to individual states, either based on probability that a record in the PUF actually is from a particular state or based on a measure of the record's (Euclidean) distance from summary characteristics of records from individual states, which is really a similar concept.

Probabilities might be estimated from other similar microdata that has state codes, using a multinomial logit approach. At the moment, we probably don't have data that are sufficient for this.

Distances might be estimated by comparison to summary state-level data such as those at https://www.irs.gov/statistics/soi-tax-stats-historic-table-2, which loses the richness of microdata but has the advantage of feasibility (we have the necessary data).

We discussed two further, related, issues:

  1. If we only assign each record to its highest-probability or closest-distance state, then each state will look like its average taxpayer and we will lose important variation that exists in the real world, because we do not include low-probability records. This is undesirable. We discussed two ways to avoid this:
    (a) Assign records to states randomly in a manner that makes it likely that records are assigned to high-probability states, but allows them to be assigned to low-probability states (this is the Stata code that Dan provided; the mechanism was assigning states to records rather than records to states, but it is the same thing). This assignment can be repeated multiple times. Or,
    (b) Distribute portions of records to states based upon probabilities (or distances), so that each record can be assigned to multiple states, with higher portions likely to go to the high probability (or low distance) states. This allows portions of low-probability records to be distributed to states, so that we get variation.

  2. Let's assume we have addressed the first point, either by multiple assignments of records to states, or by distribution of portions of records to states. We now have a file that in some sense is representative of the 50 states. It would have more than the initial number of, let's say, 150k records. If we used the assignment approach 10 times, it would have 1.5 million records. If we used the distribution approach and included all 50 states in each record's distribution, it would have 7.5 million records. The records generally would be consistent with characteristics of states, with variation. But there is no reason to believe this file would hit the targets we have for the 50 states, from the SOI summary data, although it should have moved in that direction.

For people who want a file that hits known/estimated totals (me), this is a problem to be solved. We talked about adjusting record weights from this point using a constrained NLP approach to ensure that targets are hit. Dan proposed a maximum-entropy objective function.

Is this an accurate summary? I'll propose some next steps but would love to see feedback first.

Assigning a state to a record, randomly, based on probability of the record being from particular states (versus distributing portions of records to states)

I am moving Dan's comments on this point to here. Here is his initial comment (PSLmodels/taxdata#138 (comment)).

It is possible to assign a single state to each record in an unbiased manner. The way I have done this is to calculate a probability of a record being in each of the 50 states, and assign it to one of those state in proportion to those probabilities. That is, if a record has high state income tax the procedure will show high probabilities for New York, California, etc and low probabilities (but not zero) for Florida and Texas. Then the computer will select New York or California with high probability and Florida or Texas with low probability. In expectation the resulting totals will be the same as the "long" format but with some unbiased error. I have done this and find that state level aggregates match nearly as well as summing over all possible states. If desired, one could take 2 draws, or any other number. It would not be necessary to multiply the workload by 51.

State-level Gini coefficients

As you go about this important work, keep in mind the kinds of requests we have received asking for Tax-Calculator to generate state-level information. This request was for the ability to generate Gini coefficients by state.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.