GithubHelp home page GithubHelp logo

anastasia / deduplicate-owners Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mit-spatial-action/who-owns-mass-processing

0.0 0.0 0.0 626 KB

This repository deduplicates property owners in Massachusetts using the MassGIS standardized assessors' parcel dataset and the Secretary of the Commonwealth's Corporate Database. The process extends that documented by Hangen and O'Brien (2022, in preprint).

License: MIT License

R 98.33% Cypher 1.67%

deduplicate-owners's Introduction

Deduplicate Owners

This repository deduplicates property owners in Massachusetts using the MassGIS standardized assessors' parcel dataset and the Secretary of the Commonwealth's Corporate Database. The process builds on Hangen and O'Brien's methods (2022, in preprint), which are themselves similar (though not identical) to methods used by Henry Gomory (2021) and the Anti-Eviction Mapping Project's Evictorbook (see e.g., McElroy and Amir-Ghassemi 2021). In outline...

  1. Prepare data using a large number of string-standardizing functions, some of which are place-based. (In other words, when adapting to non-Massachusetts locations, you'll want to consider how to adapt our codebase to your locale.)
  2. Perform naive deduplication on assessors' tables using concatenated name and address.
  3. Perform cosine-similarity-based deduplication on assessors' tables using concatenated name and address.
  4. Join parcels to companies using simple string matching. Note that here, when an owner fails to match within a cosine-similarity group that contains successful matches (see step 3), the owners that fail to match are assigned to the company id of one of the successful matches.
  5. Identify agents of companies that are companies themselves (distinguishing between law firms and other companies) and agents of companies that are individuals.
  6. Deduplicate individuals (including individual agents) associated with companies that match parcel owners using both naive and cosine similarity methods.
  7. Identify communities within corporate-individual networks. (This is done using the igraph implementation of the fast greedy modularity optimization algorithm.)

Getting Started

This library's dependencies are managed using renv. To install necessary dependencies, simply install renv and run renv::restore(). If you are using Windows, you'll probably have to install the Rtools bundle appropriate for your version of R.

Setting up .Renviron

Eviction filings are pulled down from a PostGIS database. As written, we expect PostgreSQL connection parameters to appear in an .Renviron file with the following environment variables defined:

DB_HOST="<host_location>"
DB_USER="<user_name>"
DB_PASS="<password>"
DB_PORT="<port>"
DB_NAME="<name_of_eviction_db>"

Running the Script

We provide an onmibus run() function in run.R. It takes two parameters:

  1. subset: If value is "test" (default), processes only Somerville. If value is "hns", processes only HNS municipalities. If value is "all", runs entire state. Otherwise, it stops and generates an error.
  2. return_results: If TRUE (default), return results in a named list. If FALSE, return nothing. In either case, results are output to delimited text and *.RData files.

In other words...

# Runs on Somerville.
run(subset = "test")
# Runs on Healthy Neighborhoods municipalities.
run(subset = "hns")
# Runs on entire state.
run(subset = "all")

If run.R is executed from a non-interactive environment (i.e., a terminal), it will run on the entire state. (In other words: don't do this unless you want to wait 8 hours for results.)

This function automatically saves its results to...

  • a simplified table of owners (by default, owners.csv, set using the OWNERS_OUT_NAME global variable at the top of run.R),
  • a table of matched companies (by default, corps.csv, set using the CORPS_OUT_NAME global variable at the top of run.R),
  • a table of individuals (by default, inds.csv, set using the INDS_OUT_NAME global variable at the top of run.R),
    • a table of assessors records, supplemened by owner-occupancy flag (by default, assess.csv, set using the ASSESS_OUT_NAME global variable at the top of run.R),
  • a simplified igraph community object (by default, community.csv, set using the COMMUNITY_OUT_NAME global variable at the top of run.R),

Data

The two databases necessary for this analysis are...

Acknowledgements

This work received grant support from the Conservation Law Foundation and was developed under the auspices of the Healthy Neighborhoods Study in the Department of Urban Studies and Planning at MIT.

References

deduplicate-owners's People

Contributors

anastasia avatar ericrobskyhuntley avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.