The puf-state-distribution from pslmodels

For completeness: A link to the initial discussion about creating a state-coded PUF.

See PSLmodels/taxdata#138 for the initial discussion.

Fleshing out a maximum-entropy constrained NLP approach fully enough to allow implementation

Summary of previous relevant discussion at #2.

In #2 we discussed assigning PUF records to individual states, either based on probability that a record in the PUF actually is from a particular state or based on a measure of the record's (Euclidean) distance from summary characteristics of records from individual states, which is really a similar concept.

Probabilities might be estimated from other similar microdata that has state codes, using a multinomial logit approach. At the moment, we probably don't have data that are sufficient for this.

Distances might be estimated by comparison to summary state-level data such as those at https://www.irs.gov/statistics/soi-tax-stats-historic-table-2, which loses the richness of microdata but has the advantage of feasibility (we have the necessary data).

We discussed two further, related, issues:

If we only assign each record to its highest-probability or closest-distance state, then each state will look like its average taxpayer and we will lose important variation that exists in the real world, because we do not include low-probability records. This is undesirable. We discussed two ways to avoid this:
(a) Assign records to states randomly in a manner that makes it likely that records are assigned to high-probability states, but allows them to be assigned to low-probability states (this is the Stata code that Dan provided; the mechanism was assigning states to records rather than records to states, but it is the same thing). This assignment can be repeated multiple times. Or,
(b) Distribute portions of records to states based upon probabilities (or distances), so that each record can be assigned to multiple states, with higher portions likely to go to the high probability (or low distance) states. This allows portions of low-probability records to be distributed to states, so that we get variation.
Let's assume we have addressed the first point, either by multiple assignments of records to states, or by distribution of portions of records to states. We now have a file that in some sense is representative of the 50 states. It would have more than the initial number of, let's say, 150k records. If we used the assignment approach 10 times, it would have 1.5 million records. If we used the distribution approach and included all 50 states in each record's distribution, it would have 7.5 million records. The records generally would be consistent with characteristics of states, with variation. But there is no reason to believe this file would hit the targets we have for the 50 states, from the SOI summary data, although it should have moved in that direction.

For people who want a file that hits known/estimated totals (me), this is a problem to be solved. We talked about adjusting record weights from this point using a constrained NLP approach to ensure that targets are hit. Dan proposed a maximum-entropy objective function.

Is this an accurate summary? I'll propose some next steps but would love to see feedback first.

Which NLP solver to use?

@donboyd5, @evtedeschi3 please feel free to rename as you see fit. You two are the admins.

Assigning a state to a record, randomly, based on probability of the record being from particular states (versus distributing portions of records to states)

I am moving Dan's comments on this point to here. Here is his initial comment (PSLmodels/taxdata#138 (comment)).

It is possible to assign a single state to each record in an unbiased manner. The way I have done this is to calculate a probability of a record being in each of the 50 states, and assign it to one of those state in proportion to those probabilities. That is, if a record has high state income tax the procedure will show high probabilities for New York, California, etc and low probabilities (but not zero) for Florida and Texas. Then the computer will select New York or California with high probability and Florida or Texas with low probability. In expectation the resulting totals will be the same as the "long" format but with some unbiased error. I have done this and find that state level aggregates match nearly as well as summing over all possible states. If desired, one could take 2 draws, or any other number. It would not be necessary to multiply the workload by 51.

State-level Gini coefficients

As you go about this important work, keep in mind the kinds of requests we have received asking for Tax-Calculator to generate state-level information. This request was for the ability to generate Gini coefficients by state.

pslmodels / puf-state-distribution Goto Github PK

puf-state-distribution's Introduction

PUF-State-Distribution

puf-state-distribution's People

Contributors

Stargazers

Watchers

Forkers

puf-state-distribution's Issues

For completeness: A link to the initial discussion about creating a state-coded PUF.

Fleshing out a maximum-entropy constrained NLP approach fully enough to allow implementation

Which NLP solver to use?

Assigning a state to a record, randomly, based on probability of the record being from particular states (versus distributing portions of records to states)

State-level Gini coefficients

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs