datadesk / census-data-aggregator Goto Github PK

View Code? Open in Web Editor NEW

42.0 4.0 9.0 264 KB

Combine U.S. census data responsibly

License: MIT License

Python 100.00%

python census statistics math news journalism data-journalism margin-of-error mapping-la-pipeline

census-data-aggregator's People

Contributors

Stargazers

Watchers

Forkers

nkrishnaswami sastoudt krishnan-viswanathan abekohen cephillips fagan2888 ws-pittman walrushat ar-puuk

census-data-aggregator's Issues

Correct handling of jam values in median approximation

Thanks to some clarification from our Census friends:

The jam value represents a result from a median calculation when the median can't actually be calculated because it lies in the lowest or highest bin. The jam value is not used in the median calculation itself as a lower or upper bound for the end bins.

This information doesn't impact the calculations of the examples we have now (we've treated the jam value as a bound), but we need to update the median function to handle the scenario where the lower and upper bins don't have concrete bounds (plus add examples of this scenario).

We may want to include an optional input jam_value to use in the case that the median occurs in the highest/lowest bin.

make compatible with census-data-downloader

Ensure that the output of census-data-downloader plays nice as an input to census-data-aggregator.

An "aggregation" tool

Accept a list of values and margins and, using the approximation methods in this library, returns the combined value with its estimated margin of error.

optional MOE input for approximate_median

We may want an optional moe input field for approximate_median to handle the case when the n values are estimates themselves (e.g. outputs of approximate_sum). The approximate_median would then need a simulation aspect to account for the n values' uncertainty.

readme: denominator on approx. proportion example typo

Testing the command with examples listed yields a different result. i'm guessing the denominator was supposed to be 630,498 per acs document linked?

negative values from numpy.random.normal

For smaller values or with large margins of error, the numpy.random.normal in approximate_mean may return a negative number which won't make sense in context. We should probably just use max(0, simulated_value) instead.

An "approximate_mean" method

guidance for what level of aggregated MOE is too large for estimate to be useful

From this paper (page 3):

"There is no CV level that is universally accepted as “too high,” but a comprehensive report on the ACS [5] describes a range of 0.10 to 0.12 as a “reasonable standard of precision for an estimate” (p. 64)"

CV = (MOE / 1.645) / estimate

Any interest with combining with areal aggregation

Some folks at @argo-marketplace were working on a fork of census_area to aggregate census data to arbitrary geographies: datamade/census_area#6

deal with annotations

If using the aggregator outside of the downloader, the aggregator needs to know what to do with annotated values.

Discrepancies in reference material for median MOE

For the calculation of SE(50 percent) there are different numerators used depending on the particular reference.

page 2 step A uses 99.

page 22 in Example 3 uses 95.

There are also different denominators used (B and 5B respectively).

We currently use 99 and B.

Source data for approximating median household income

Just to make sure I understand this correctly, to calculate median household income for an aggregate geography using the ACS, as shown in this example, would I use data from a table like ACS table B19001 to get the n (household counts), and min/max incomes for the ranges?

It looks like the wording of the top range of that table is "$200,000 or more". Should I just set an artificial upper bound for that? It looks like in the example and the linked PDF, they use $250,001.

Off the top of my head, this seems like it would be correct for many (most?) cases, but incorrect for very high income areas?

A pareto-based median aggregator

An "approximation" tool

To estimate the margin of error when summing values.

provide check that spatial aggregation doesn't induce spurious patterns

From this paper:

"one can induce geographic patterns in the aggregate data that do not
exist in the input data"

Create a diagnostic to check for this (equations 2 and 3 in paper):

"The statistic S_j measures whether the region-level estimates for a given variable are within the margins of error of their constituent tracts. If a region-level estimate is within the margin of error of all its constituent tracts, then there is no information lost through aggregation; information loss increases as the 90 percent confidence intervals of more and more tract-level estimates do not overlap with the region’s estimate."

disaggregation functions

Functions for breaking geographic units into different geographic units and recalculating quantities of interest [with and without margin of error].

sums
medians
means

datadesk / census-data-aggregator Goto Github PK

census-data-aggregator's People

Contributors

Stargazers

Watchers

Forkers

census-data-aggregator's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs