GithubHelp home page GithubHelp logo

isabella232 / address_normalizer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from openvenues/address_normalizer

0.0 0.0 0.0 2.74 MB

DEPRECATED - use libpostal/pypostal instead

License: MIT License

Makefile 0.01% Python 1.00% C 98.99%

address_normalizer's Introduction

address_normalizer

A fast, international postal address normalizer and deduper.

What it does

The use case for this system is a well-known one: given several real-world postal addresses entered by humans as natural language text, find (and destroy) all duplicates.

Like many problems in information extraction and NLP, this may sound trivial initially, but in fact can be quite complicated in real natural language texts.

As a motivating example, consider the following two equivalent ways to write a particular Manhattan street address with varying conventions and degrees of verbosity.

  • 30 W 26th St Fl #7
  • 30 West Twenty-sixth Street Floor Number 7

Obviously '30 W 26th St Fl #7 != '30 West Twenty-sixth Street Floor No. 7' in a string comparison sense, but a human can grok that these two addresses refer to the same physical location.

This library helps convert messy addresses that humans use into clean normalized forms suitable for machine comparison. It also includes a LevelDB/RocksDB-backed near duplicate store for checking new candidate addresses against an index of previously ingested addresses to see if it is a near duplicate of any of them while doing minimal comparisons (suitable for ingestion pipelines).

Usage

from address_normalizer import expand_street_address
addr1_expansions = expand_street_address('30 West Twenty-sixth Street Floor Number 7')
addr2_expansions = expand_street_address('30 W 26th St Fl #7')
# Share at least one expansion in common
addr1_expansions & addr2_expansions

Testing (Python)

python test.py

Non-goals

  • verifying that a location is a valid address

References

For further reading and some less intuitive examples of addresses, see "Falsehoods Programmers Believe About Addresses".

TODOS

  • sequence model to parse addresses into components like house number, street name, etc. (needs a small amount of training data)
  • sequence model for predicting which expansion is the correct one. "Dr" can mean either "Doctor" or "Drive" but for the purposes of deduping we just save both expansions. (needs some training data)
  • parse postal addresses from texts such as web documents

address_normalizer's People

Contributors

albarrentine avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.