GithubHelp home page GithubHelp logo

openvenues / libpostal Goto Github PK

View Code? Open in Web Editor NEW
4.0K 110.0 417.0 36.96 MB

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.

License: MIT License

Python 1.98% Shell 0.08% C 97.68% C++ 0.20% Makefile 0.02% M4 0.04%
address-parser machine-learning nlp address international c deduplication record-linkage deduping natural-language-processing

libpostal's Introduction

libpostal: international street address NLP

Build Status Build Status License OpenCollective Sponsors OpenCollective Backers

libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. The goal of this project is to understand location-based strings in every language, everywhere. For a more comprehensive overview of the research behind libpostal, be sure to check out the (lengthy) introductory blog posts:

๐Ÿ‡ง๐Ÿ‡ท ๐Ÿ‡ซ๐Ÿ‡ฎ ๐Ÿ‡ณ๐Ÿ‡ฌ ๐Ÿ‡ฏ๐Ÿ‡ต ๐Ÿ‡ฝ๐Ÿ‡ฐ ๐Ÿ‡ง๐Ÿ‡ฉ ๐Ÿ‡ต๐Ÿ‡ฑ ๐Ÿ‡ป๐Ÿ‡ณ ๐Ÿ‡ง๐Ÿ‡ช ๐Ÿ‡ฒ๐Ÿ‡ฆ ๐Ÿ‡บ๐Ÿ‡ฆ ๐Ÿ‡ฏ๐Ÿ‡ฒ ๐Ÿ‡ท๐Ÿ‡บ ๐Ÿ‡ฎ๐Ÿ‡ณ ๐Ÿ‡ฑ๐Ÿ‡ป ๐Ÿ‡ง๐Ÿ‡ด ๐Ÿ‡ฉ๐Ÿ‡ช ๐Ÿ‡ธ๐Ÿ‡ณ ๐Ÿ‡ฆ๐Ÿ‡ฒ ๐Ÿ‡ฐ๐Ÿ‡ท ๐Ÿ‡ณ๐Ÿ‡ด ๐Ÿ‡ฒ๐Ÿ‡ฝ ๐Ÿ‡จ๐Ÿ‡ฟ ๐Ÿ‡น๐Ÿ‡ท ๐Ÿ‡ช๐Ÿ‡ธ ๐Ÿ‡ธ๐Ÿ‡ธ ๐Ÿ‡ช๐Ÿ‡ช ๐Ÿ‡ง๐Ÿ‡ญ ๐Ÿ‡ณ๐Ÿ‡ฑ ๐Ÿ‡จ๐Ÿ‡ณ ๐Ÿ‡ต๐Ÿ‡น ๐Ÿ‡ต๐Ÿ‡ท ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ‡ต๐Ÿ‡ธ

Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally.

๐Ÿ‡ท๐Ÿ‡ด ๐Ÿ‡ฌ๐Ÿ‡ญ ๐Ÿ‡ฆ๐Ÿ‡บ ๐Ÿ‡ฒ๐Ÿ‡พ ๐Ÿ‡ญ๐Ÿ‡ท ๐Ÿ‡ญ๐Ÿ‡น ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ‡ฟ๐Ÿ‡ฆ ๐Ÿ‡ท๐Ÿ‡ธ ๐Ÿ‡จ๐Ÿ‡ฑ ๐Ÿ‡ฎ๐Ÿ‡น ๐Ÿ‡ฐ๐Ÿ‡ช ๐Ÿ‡จ๐Ÿ‡ญ ๐Ÿ‡จ๐Ÿ‡บ ๐Ÿ‡ธ๐Ÿ‡ฐ ๐Ÿ‡ฆ๐Ÿ‡ด ๐Ÿ‡ฉ๐Ÿ‡ฐ ๐Ÿ‡น๐Ÿ‡ฟ ๐Ÿ‡ฆ๐Ÿ‡ฑ ๐Ÿ‡จ๐Ÿ‡ด ๐Ÿ‡ฎ๐Ÿ‡ฑ ๐Ÿ‡ฌ๐Ÿ‡น ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ‡ต๐Ÿ‡ญ ๐Ÿ‡ฆ๐Ÿ‡น ๐Ÿ‡ฑ๐Ÿ‡จ ๐Ÿ‡ฎ๐Ÿ‡ธ ๐Ÿ‡ฎ๐Ÿ‡ฉ ๐Ÿ‡ฆ๐Ÿ‡ช ๐Ÿ‡ธ๐Ÿ‡ฐ ๐Ÿ‡น๐Ÿ‡ณ ๐Ÿ‡ฐ๐Ÿ‡ญ ๐Ÿ‡ฆ๐Ÿ‡ท ๐Ÿ‡ญ๐Ÿ‡ฐ

The core library is written in pure C. Language bindings for Python, Ruby, Go, Java, PHP, and NodeJS are officially supported and it's easy to write bindings in other languages.

Sponsors

If your company is using libpostal, consider asking your organization to sponsor the project. Interpreting what humans mean when they refer to locations is far from a solved problem, and sponsorships help us pursue new frontiers in geospatial NLP. As a sponsor, your company logo will appear prominently on the Github repo page along with a link to your site. Sponsorship info

Backers

Individual users can also help support open geo NLP research by making a monthly donation:

Installation (Mac/Linux)

Before you install, make sure you have the following prerequisites:

On Ubuntu/Debian

sudo apt-get install curl autoconf automake libtool pkg-config

On CentOS/RHEL

sudo yum install curl autoconf automake libtool pkgconfig

On Mac OSX

brew install curl autoconf automake libtool pkg-config

Then to install the C library:

If you're using an M1 Mac, add --disable-sse2 to the ./configure command. This will result in poorer performance but the build will succeed.

git clone https://github.com/openvenues/libpostal
cd libpostal
./bootstrap.sh
./configure --datadir=[...some dir with a few GB of space...]
make -j4
sudo make install

# On Linux it's probably a good idea to run
sudo ldconfig

libpostal has support for pkg-config, so you can use the pkg-config to print the flags needed to link your program against it:

pkg-config --cflags libpostal         # print compiler flags
pkg-config --libs libpostal           # print linker flags
pkg-config --cflags --libs libpostal  # print both

For example, if you write a program called app.c, you can compile it like this:

gcc app.c `pkg-config --cflags --libs libpostal`

Installation (Windows)

MSys2/MinGW

For Windows the build procedure currently requires MSys2 and MinGW. This can be downloaded from http://msys2.org. Please follow the instructions on the MSys2 website for installation.

Please ensure Msys2 is up-to-date by running:

pacman -Syu

Install the following prerequisites:

pacman -S autoconf automake curl git make libtool gcc mingw-w64-x86_64-gcc

Then to build the C library:

git clone https://github.com/openvenues/libpostal
cd libpostal
cp -rf windows/* ./
./bootstrap.sh
./configure --datadir=[...some dir with a few GB of space...]
make -j4
make install

Notes: When setting the datadir, the C: drive would be entered as /c. The libpostal build script automatically add libpostal on the end of the path, so '/c' would become C:\libpostal\ on Windows.

The compiled .dll will be in the src/.libs/ directory and should be called libpostal-1.dll.

If you require a .lib import library to link this to your application. You can generate one using the Visual Studio lib.exe tool and the libpostal.def definition file:

lib.exe /def:libpostal.def /out:libpostal.lib /machine:x64

Installation with an alternative data model

An alternative data model is available for libpostal. It is created by Senzing Inc. for improved parsing on US, UK and Singapore addresses and improved US rural route address handling. To enable this add MODEL=senzing to the conigure line during installation:

./configure --datadir=[...some dir with a few GB of space...] MODEL=senzing

The data for this model is gotten from OpenAddress, OpenStreetMap and data generated by Senzing based on customer feedback (a few hundred records), a total of about 1.2 billion records of data from over 230 countries, in 100+ languages. The data from OpenStreetMap and OpenAddress is good but not perfect so the data set was modified by filtering out badly formed addresses, correcting misclassified address tokens and removing tokens that didn't belong in the addresses, whenever these conditions were encountered.

Senzing created a data set of 12950 addresses from 89 countries that it uses to test and verify the quality of its models. The data set was generated using random addresses from OSM, minimally 50 per country. Hard-to-parse addresses were gotten from Senzing support team and customers and from the libpostal github page and added to this set. The Senzing model got 4.3% better parsing results than the default model, using this test set.

The size of this model is about 2.2GB compared to 1.8GB for the default model so keep that in mind if storages space is important.

Further information about this data model can be found at: https://github.com/Senzing/libpostal-data If you run into any issues with this model, whether they have to do with parses, installation or any other problems, then please report them at https://github.com/Senzing/libpostal-data

Examples of parsing

libpostal's international address parser uses machine learning (Conditional Random Fields) and is trained on over 1 billion addresses in every inhabited country on Earth. We use OpenStreetMap and OpenAddresses as sources of structured addresses, and the OpenCage address format templates at: https://github.com/OpenCageData/address-formatting to construct the training data, supplementing with containing polygons, and generating sub-building components like apartment/floor numbers and PO boxes. We also add abbreviations, drop out components at random, etc. to make the parser as robust as possible to messy real-world input.

These example parse results are taken from the interactive address_parser program that builds with libpostal when you run make. Note that the parser can handle commas vs. no commas as well as various casings and permutations of components (if the input is e.g. just city or just city/postcode).

parser

The parser achieves very high accuracy on held-out data, currently 99.45% correct full parses (meaning a 1 in the numerator for getting every token in the address correct).

Usage (parser)

Here's an example of the parser API using the Python bindings:

from postal.parser import parse_address
parse_address('The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom')

And an example with the C API:

#include <stdio.h>
#include <stdlib.h>
#include <libpostal/libpostal.h>

int main(int argc, char **argv) {
    // Setup (only called once at the beginning of your program)
    if (!libpostal_setup() || !libpostal_setup_parser()) {
        exit(EXIT_FAILURE);
    }

    libpostal_address_parser_options_t options = libpostal_get_address_parser_default_options();
    libpostal_address_parser_response_t *parsed = libpostal_parse_address("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA", options);

    for (size_t i = 0; i < parsed->num_components; i++) {
        printf("%s: %s\n", parsed->labels[i], parsed->components[i]);
    }

    // Free parse result
    libpostal_address_parser_response_destroy(parsed);

    // Teardown (only called once at the end of your program)
    libpostal_teardown();
    libpostal_teardown_parser();
}

Parser labels

The address parser can technically use any string labels that are defined in the training data, but these are the ones currently defined, based on the fields defined in OpenCage's address-formatting library, as well as a few added by libpostal to handle specific patterns:

  • house: venue name e.g. "Brooklyn Academy of Music", and building names e.g. "Empire State Building"
  • category: for category queries like "restaurants", etc.
  • near: phrases like "in", "near", etc. used after a category phrase to help with parsing queries like "restaurants in Brooklyn"
  • house_number: usually refers to the external (street-facing) building number. In some countries this may be a compount, hyphenated number which also includes an apartment number, or a block number (a la Japan), but libpostal will just call it the house_number for simplicity.
  • road: street name(s)
  • unit: an apartment, unit, office, lot, or other secondary unit designator
  • level: expressions indicating a floor number e.g. "3rd Floor", "Ground Floor", etc.
  • staircase: numbered/lettered staircase
  • entrance: numbered/lettered entrance
  • po_box: post office box: typically found in non-physical (mail-only) addresses
  • postcode: postal codes used for mail sorting
  • suburb: usually an unofficial neighborhood name like "Harlem", "South Bronx", or "Crown Heights"
  • city_district: these are usually boroughs or districts within a city that serve some official purpose e.g. "Brooklyn" or "Hackney" or "Bratislava IV"
  • city: any human settlement including cities, towns, villages, hamlets, localities, etc.
  • island: named islands e.g. "Maui"
  • state_district: usually a second-level administrative division or county.
  • state: a first-level administrative division. Scotland, Northern Ireland, Wales, and England in the UK are mapped to "state" as well (convention used in OSM, GeoPlanet, etc.)
  • country_region: informal subdivision of a country without any political status
  • country: sovereign nations and their dependent territories, anything with an ISO-3166 code.
  • world_region: currently only used for appending โ€œWest Indiesโ€ after the country name, a pattern frequently used in the English-speaking Caribbean e.g. โ€œJamaica, West Indiesโ€

Examples of normalization

The expand_address API converts messy real-world addresses into normalized equivalents suitable for search indexing, hashing, etc.

Here's an interactive example using the Python binding:

expand

libpostal contains an OSM-trained language classifier to detect which language(s) are used in a given address so it can apply the appropriate normalizations. The only input needed is the raw address string. Here's a short list of some less straightforward normalizations in various languages.

Input Output (may be multiple in libpostal)
One-hundred twenty E 96th St 120 east 96th street
C/ Ocho, P.I. 4 calle 8 polรญgono industrial 4
V XX Settembre, 20 via 20 settembre 20
Quatre vingt douze R. de l'ร‰glise 92 rue de l eglise
ัƒะป ะšะฐั€ะตั‚ะฝั‹ะน ะ ัะด, ะด 4, ัั‚ั€ะพะตะฝะธะต 7 ัƒะปะธั†ะฐ ะบะฐั€ะตั‚ะฝั‹ะธ ั€ัะด ะดะพะผ 4 ัั‚ั€ะพะตะฝะธะต 7
ัƒะป ะšะฐั€ะตั‚ะฝั‹ะน ะ ัะด, ะด 4, ัั‚ั€ะพะตะฝะธะต 7 ulitsa karetnyy ryad dom 4 stroyeniye 7
MarktstraรŸe 14 markt strasse 14

libpostal currently supports these types of normalizations in 60+ languages, and you can add more (without having to write any C).

For further reading and some bizarre address edge-cases, see: Falsehoods Programmers Believe About Addresses.

Usage (normalization)

Here's an example using the Python bindings for succinctness (most of the higher-level language bindings are similar):

from postal.expand import expand_address
expansions = expand_address('Quatre-vingt-douze Ave des Champs-ร‰lysรฉes')

assert '92 avenue des champs-elysees' in set(expansions)

The C API equivalent is a few more lines, but still fairly simple:

#include <stdio.h>
#include <stdlib.h>
#include <libpostal/libpostal.h>

int main(int argc, char **argv) {
    // Setup (only called once at the beginning of your program)
    if (!libpostal_setup() || !libpostal_setup_language_classifier()) {
        exit(EXIT_FAILURE);
    }

    size_t num_expansions;
    libpostal_normalize_options_t options = libpostal_get_default_options();
    char **expansions = libpostal_expand_address("Quatre-vingt-douze Ave des Champs-ร‰lysรฉes", options, &num_expansions);

    for (size_t i = 0; i < num_expansions; i++) {
        printf("%s\n", expansions[i]);
    }

    // Free expansions
    libpostal_expansion_array_destroy(expansions, num_expansions);

    // Teardown (only called once at the end of your program)
    libpostal_teardown();
    libpostal_teardown_language_classifier();
}

Command-line usage (expand)

After building libpostal:

cd src/

./libpostal "Quatre vingt douze Ave des Champs-ร‰lysรฉes"

If you have a text file or stream with one address per line, the command-line interface also accepts input from stdin:

cat some_file | ./libpostal --json

Command-line usage (parser)

After building libpostal:

cd src/

./address_parser

address_parser is an interactive shell. Just type addresses and libpostal will parse them and print the result.

Bindings

Libpostal is designed to be used by higher-level languages. If you don't see your language of choice, or if you're writing a language binding, please let us know!

Officially supported language bindings

Unofficial language bindings

Database extensions

Unofficial REST API

Libpostal REST Docker

Libpostal ZeroMQ Docker

Tests

libpostal uses greatest for automated testing. To run the tests, use:

make check

Adding test cases is easy, even if your C is rusty/non-existent, and we'd love contributions. We use mostly functional tests checking string input against string output.

libpostal also gets periodically battle-tested on millions of addresses from OSM (clean) as well as anonymized queries from a production geocoder (not so clean). During this process we use valgrind to check for memory leaks and other errors.

Data files

libpostal needs to download some data files from S3. The basic files are on-disk representations of the data structures necessary to perform expansion. For address parsing, since model training takes a few days, we publish the fully trained model to S3 and will update it automatically as new addresses get added to OSM, OpenAddresses, etc. Same goes for the language classifier model.

Data files are automatically downloaded when you run make. To check for and download any new data files, you can either run make, or run:

libpostal_data download all $YOUR_DATA_DIR/libpostal

And replace $YOUR_DATA_DIR with whatever you passed to configure during install.

Language dictionaries

libpostal contains a number of per-language dictionaries that influence expansion, the language classifier, and the parser. To explore the dictionaries or contribute abbreviations/phrases in your language, see resources/dictionaries.

Training data

In machine learning, large amounts of training data are often essential for getting good results. Many open-source machine learning projects either release only the model code (results reproducible if and only if you're Google), or a pre-baked model where the training conditions are unknown.

Libpostal is a bit different because it's trained on open data that's available to everyone, so we've released the entire training pipeline (the geodata package in this repo), as well as the resulting training data itself on the Internet Archive. It's over 100GB unzipped.

Training data are stored on archive.org by the date they were created. There's also a file stored in the main directory of this repo called current_parser_training_set which stores the date of the most recently created training set. To always point to the latest data, try something like: latest=$(cat current_parser_training_set) and use that variable in place of the date.

Parser training sets

All files can be found at https://archive.org/download/libpostal-parser-training-data-YYYYMMDD/$FILE as gzip'd tab-separated values (TSV) files formatted like:language\tcountry\taddress.

  • formatted_addresses_tagged.random.tsv.gz (ODBL): OSM addresses. Apartments, PO boxes, categories, etc. are added primarily to these examples
  • formatted_places_tagged.random.tsv.gz (ODBL): every toponym in OSM (even cities represented as points, etc.), reverse-geocoded to its parent admins, possibly including postal codes if they're listed on the point/polygon. Every place gets a base level of representation and places with higher populations get proportionally more.
  • formatted_ways_tagged.random.tsv.gz (ODBL): every street in OSM (ways with highway=*, with a few conditions), reverse-geocoded to its admins
  • geoplanet_formatted_addresses_tagged.random.tsv.gz (CC-BY): every postal code in Yahoo GeoPlanet (includes almost every postcode in the UK, Canada, etc.) and their parent admins. The GeoPlanet admins have been cleaned up and mapped to libpostal's tagset
  • openaddresses_formatted_addresses_tagged.random.tsv.gz (various licenses, mostly CC-BY): most of the address data sets from OpenAddresses, which in turn come directly from government sources
  • uk_openaddresses_formatted_addresses_tagged.random.tsv.gz (CC-BY): addresses from OpenAddresses UK

If the parser doesn't perform as well as you'd hoped on a particular type of address, the best recourse is to use grep/awk to look through the training data and try to determine if there's some pattern/style of address that's not being captured.

Features

  • Abbreviation expansion: e.g. expanding "rd" => "road" but for almost any language. libpostal supports > 50 languages and it's easy to add new languages or expand the current dictionaries. Ideographic languages (not separated by whitespace e.g. Chinese) are supported, as are Germanic languages where thoroughfare types are concatenated onto the end of the string, and may optionally be separated so RosenstraรŸe and Rosen StraรŸe are equivalent.

  • International address parsing: Conditional Random Field which parses "123 Main Street New York New York" into {"house_number": 123, "road": "Main Street", "city": "New York", "state": "New York"}. The parser works for a wide variety of countries and languages, not just US/English. The model is trained on over 1 billion addresses and address-like strings, using the templates in the OpenCage address formatting repo to construct formatted, tagged traning examples for every inhabited country in the world. Many types of normalizations are performed to make the training data resemble real messy geocoder input as closely as possible.

  • Language classification: multinomial logistic regression trained (using the FTRL-Proximal method to induce sparsity) on all of OpenStreetMap ways, addr:* tags, toponyms and formatted addresses. Labels are derived using point-in-polygon tests for both OSM countries and official/regional languages for countries and admin 1 boundaries respectively. So, for example, Spanish is the default language in Spain but in different regions e.g. Catalunya, Galicia, the Basque region, the respective regional languages are the default. Dictionary-based disambiguation is employed in cases where the regional language is non-default e.g. Welsh, Breton, Occitan. The dictionaries are also used to abbreviate canonical phrases like "Calle" => "C/" (performed on both the language classifier and the address parser training sets)

  • Numeric expression parsing ("twenty first" => 21st, "quatre-vingt-douze" => 92, again using data provided in CLDR), supports > 30 languages. Handles languages with concatenated expressions e.g. milleottocento => 1800. Optionally normalizes Roman numerals regardless of the language (IX => 9) which occur in the names of many monarchs, popes, etc.

  • Fast, accurate tokenization/lexing: clocked at > 1M tokens / sec, implements the TR-29 spec for UTF8 word segmentation, tokenizes East Asian languages chracter by character instead of on whitespace.

  • UTF8 normalization: optionally decompose UTF8 to NFD normalization form, strips accent marks e.g. ร  => a and/or applies Latin-ASCII transliteration.

  • Transliteration: e.g. ัƒะปะธั†ะฐ => ulica or ulitsa. Uses all CLDR transforms, the exact same source data as used by ICU, though libpostal doesn't require pulling in all of ICU (might conflict with your system's version). Note: some languages, particularly Hebrew, Arabic and Thai may not include vowels and thus will not often match a transliteration done by a human. It may be possible to implement statistical transliterators for some of these languages.

  • Script detection: Detects which script a given string uses (can be multiple e.g. a free-form Hong Kong or Macau address may use both Han and Latin scripts in the same address). In transliteration we can use all applicable transliterators for a given Unicode script (Greek can for instance be transliterated with Greek-Latin, Greek-Latin-BGN and Greek-Latin-UNGEGN).

Non-goals

  • Verifying that a location is a valid address
  • Actually geocoding addresses to a lat/lon (that requires a database/search index)

Raison d'รชtre

libpostal was originally created as part of the OpenVenues project to solve the problem of venue deduping. In OpenVenues, we have a data set of millions of places derived from terabytes of web pages from the Common Crawl. The Common Crawl is published monthly, and so even merging the results of two crawls produces significant duplicates.

Deduping is a relatively well-studied field, and for text documents like web pages, academic papers, etc. there exist pretty decent approximate similarity methods such as MinHash.

However, for physical addresses, the frequent use of conventional abbreviations such as Road == Rd, California == CA, or New York City == NYC complicates matters a bit. Even using a technique like MinHash, which is well suited for approximate matches and is equivalent to the Jaccard similarity of two sets, we have to work with very short texts and it's often the case that two equivalent addresses, one abbreviated and one fully specified, will not match very closely in terms of n-gram set overlap. In non-Latin scripts, say a Russian address and its transliterated equivalent, it's conceivable that two addresses referring to the same place may not match even a single character.

As a motivating example, consider the following two equivalent ways to write a particular Manhattan street address with varying conventions and degrees of verbosity:

  • 30 W 26th St Fl #7
  • 30 West Twenty-sixth Street Floor Number 7

Obviously '30 W 26th St Fl #7 != '30 West Twenty-sixth Street Floor Number 7' in a string comparison sense, but a human can grok that these two addresses refer to the same physical location.

libpostal aims to create normalized geographic strings, parsed into components, such that we can more effectively reason about how well two addresses actually match and make automated server-side decisions about dupes.

So it's not a geocoder?

If the above sounds a lot like geocoding, that's because it is in a way, only in the OpenVenues case, we have to geocode without a UI or a user to select the correct address in an autocomplete dropdown. Given a database of source addresses such as OpenAddresses or OpenStreetMap (or all of the above), libpostal can be used to implement things like address deduping and server-side batch geocoding in settings like MapReduce or stream processing.

Now, instead of trying to bake address-specific conventions into traditional document search engines like Elasticsearch using giant synonyms files, scripting, custom analyzers, tokenizers, and the like, geocoding can look like this:

  1. Run the addresses in your database through libpostal's expand_address
  2. Store the normalized string(s) in your favorite search engine, DB, hashtable, etc.
  3. Run your user queries or fresh imports through libpostal and search the existing database using those strings

In this way, libpostal can perform fuzzy address matching in constant time relative to the size of the data set.

Why C?

libpostal is written in C for three reasons (in order of importance):

  1. Portability/ubiquity: libpostal targets higher-level languages that people actually use day-to-day: Python, Go, Ruby, NodeJS, etc. The beauty of C is that just about any programming language can bind to it and C compilers are everywhere, so pick your favorite, write a binding, and you can use libpostal directly in your application without having to stand up a separate server. We support Mac/Linux (Windows is not a priority but happy to accept patches), have a standard autotools build and an endianness-agnostic file format for the data files. The Python bindings, are maintained as part of this repo since they're needed to construct the training data.

  2. Memory-efficiency: libpostal is designed to run in a MapReduce setting where we may be limited to < 1GB of RAM per process depending on the machine configuration. As much as possible libpostal uses contiguous arrays, tries (built on contiguous arrays), bloom filters and compressed sparse matrices to keep memory usage low. It's possible to use libpostal on a mobile device with models trained on a single country or a handful of countries.

  3. Performance: this is last on the list for a reason. Most of the optimizations in libpostal are for memory usage rather than performance. libpostal is quite fast given the amount of work it does. It can process 10-30k addresses / second in a single thread/process on the platforms we've tested (that means processing every address in OSM planet in a little over an hour). Check out the simple benchmark program to test on your environment and various types of input. In the MapReduce setting, per-core performance isn't as important because everything's being done in parallel, but there are some streaming ingestion applications at Mapzen where this needs to run in-process.

C conventions

libpostal is written in modern, legible, C99 and uses the following conventions:

  • Roughly object-oriented, as much as allowed by C
  • Almost no pointer-based data structures, arrays all the way down
  • Uses dynamic character arrays (inspired by sds) for safer string handling
  • Confines almost all mallocs to name_new and all frees to name_destroy
  • Efficient existing implementations for simple things like hashtables
  • Generic containers (via klib) whenever possible
  • Data structrues take advantage of sparsity as much as possible
  • Efficient double-array trie implementation for most string dictionaries
  • Cross-platform as much as possible, particularly for *nix

Preprocessing (Python)

The geodata Python package in the libpostal repo contains the pipeline for preprocessing the various geo data sets and building training data for the C models to use. This package shouldn't be needed for most users, but for those interested in generating new types of addresses or improving libpostal's training data, this is where to look.

Address parser accuracy

On held-out test data (meaning labeled parses that the model has not seen before), the address parser achieves 99.45% full parse accuracy.

For some tasks like named entity recognition it's preferable to use something like an F1 score or variants, mostly because there's a class bias problem (most words are non-entities, and a system that simply predicted non-entity for every token would actually do fairly well in terms of accuracy). That is not the case for address parsing. Every token has a label and there are millions of examples of each class in the training data, so accuracy is preferable as it's a clean, simple and intuitive measure of performance.

Here we use full parse accuracy, meaning we only give the parser one "point" in the numerator if it gets every single token in the address correct. That should be a better measure than simply looking at whether each token was correct.

Improving the address parser

Though the current parser works quite well for most standard addresses, there is still room for improvement, particularly in making sure the training data we use is as close as possible to addresses in the wild. There are two primary ways the address parser can be improved even further (in order of difficulty):

  1. Contribute addresses to OSM. Anything with an addr:housenumber tag will be incorporated automatically into the parser next time it's trained.
  2. If the address parser isn't working well for a particular country, language or style of address, chances are that some name variations or places being missed/mislabeled during training data creation. Sometimes the fix is to update the formats at: https://github.com/OpenCageData/address-formatting, and in many other cases there are relatively simple tweaks we can make when creating the training data that will ensure the model is trained to handle your use case without you having to do any manual data entry. If you see a pattern of obviously bad address parses, the best thing to do is post an issue to Github.

Contributing

Bug reports, issues and pull requests are welcome. Please read the contributing guide before submitting your issue, bug report, or pull request.

Submit issues at: https://github.com/openvenues/libpostal/issues.

Shoutouts

Special thanks to @BenK10 for the initial Windows build and @AeroXuk for integrating it seamlessly into the project and setting up an Appveyor build.

License

The software is available as open source under the terms of the MIT License.

libpostal's People

Contributors

aeroxuk avatar albarrentine avatar blackat87 avatar bradh avatar ddelange avatar dependabot[bot] avatar dmvianna avatar eefi avatar evrial avatar federicomenaquintero avatar iestynpryce avatar in-cloud-opensource avatar ironholds avatar jeffrey04 avatar johnlonganecker avatar komzpa avatar kyrill007 avatar madrisan avatar maurice-betzel avatar missinglink avatar nvkelso avatar nyalldawson avatar oschwald avatar oskar700 avatar petacat avatar reisub avatar rinigus avatar thatdatabaseguy avatar uberbaud avatar yanuarbb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libpostal's Issues

Probability of correct parse?

I was wondering if it might be possible to get some indicator that may help it telling if a parse went well or not. This would be very helpful for thresholding out examples that might be good to submit as training data.

Build fail

Attempting build on OS X 10.11.2 (15C50).
Followed all steps without warnings or errors, until make.
make failed with the following output.

$ make
/Applications/Xcode.app/Contents/Developer/usr/bin/make  all-recursive
Making all in src
Making all in sparkey
make[3]: Nothing to be done for `all'.
/bin/sh ../libtool  --tag=CC   --mode=link gcc -Wfloat-equal -Wpointer-arith -O3 -Wfloat-equal -Wpointer-arith  -L/usr/local/lib -o address_parser_train address_parser_train-address_parser_train.o address_parser_train-address_parser.o address_parser_train-address_parser_io.o address_parser_train-averaged_perceptron.o address_parser_train-sparse_matrix.o address_parser_train-averaged_perceptron_trainer.o address_parser_train-averaged_perceptron_tagger.o address_parser_train-address_dictionary.o address_parser_train-geodb.o address_parser_train-geo_disambiguation.o address_parser_train-graph.o address_parser_train-graph_builder.o address_parser_train-normalize.o address_parser_train-features.o address_parser_train-geonames.o geohash/address_parser_train-geohash.o address_parser_train-unicode_scripts.o address_parser_train-transliterate.o address_parser_train-trie.o address_parser_train-trie_search.o address_parser_train-string_utils.o address_parser_train-tokens.o address_parser_train-msgpack_utils.o address_parser_train-file_utils.o address_parser_train-shuffle.o utf8proc/address_parser_train-utf8proc.o cmp/address_parser_train-cmp.o sparkey/libsparkey.la libscanner.la -lsnappy
libtool: link: gcc -Wfloat-equal -Wpointer-arith -O3 -Wfloat-equal -Wpointer-arith -o address_parser_train address_parser_train-address_parser_train.o address_parser_train-address_parser.o address_parser_train-address_parser_io.o address_parser_train-averaged_perceptron.o address_parser_train-sparse_matrix.o address_parser_train-averaged_perceptron_trainer.o address_parser_train-averaged_perceptron_tagger.o address_parser_train-address_dictionary.o address_parser_train-geodb.o address_parser_train-geo_disambiguation.o address_parser_train-graph.o address_parser_train-graph_builder.o address_parser_train-normalize.o address_parser_train-features.o address_parser_train-geonames.o geohash/address_parser_train-geohash.o address_parser_train-unicode_scripts.o address_parser_train-transliterate.o address_parser_train-trie.o address_parser_train-trie_search.o address_parser_train-string_utils.o address_parser_train-tokens.o address_parser_train-msgpack_utils.o address_parser_train-file_utils.o address_parser_train-shuffle.o utf8proc/address_parser_train-utf8proc.o cmp/address_parser_train-cmp.o  -L/usr/local/lib sparkey/.libs/libsparkey.a ./.libs/libscanner.a -lsnappy
Undefined symbols for architecture x86_64:
  "_trie_new_from_hash", referenced from:
      _address_parser_init in address_parser_train-address_parser_train.o
      _averaged_perceptron_trainer_finalize in address_parser_train-averaged_perceptron_trainer.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[3]: *** [address_parser_train] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

Make libpostal thread safe

I'm following up on openvenues/pypostal#3
Making libpostal threadsafe will allow libraries for interpreted languages such as Python and Ruby to release the GIL in order to enable parallelism.
It will also allow the usage of libpostal in threaded C/C++ applications.

What isn't threadsafe in libpostal and what's required in order to ensure thread safety?

Could Not Find Snappy

When I run ./configure it fails with

configure: error: Could not find snappy

I have Snappy installed using homebrew. However, my homebrew is in a non-standard location (~/.homebrew). I believe this is causing the issue.

I'm not familiar enough with Autoconf to fix it myself. Is there a way to fix in the Makefile or will I have to change my environment?

problems detecting UK postcodes

Great library!

I found some problem parsing UK postcodes. I will post just a couple of example but I got more if necessary

{ input: '318 Upper Street, N1 2XQ London',
  expanded: 
   [ '318 upper street n1 2xq london',
     '318 upper street n 1 2xq london',
     '318 upper street north 1 2xq london' ],
  parsed: 
   [ { value: '318', component: 'house_number' },
     { value: 'upper street', component: 'road' },
     { value: 'n1', component: 'postcode' },
     { value: '2xq', component: 'suburb' },
     { value: 'london', component: 'city' } ] }

{ input: '21, Kingswood Road SW2 4JE, London',
  expanded: 
   [ '21 kingswood road sw2 4je london',
     '21 kingswood road sw 2 4je london',
     '21 kingswood road southwest 2 4je london' ],
  parsed: 
   [ { value: '21', component: 'house_number' },
     { value: 'kingswood road', component: 'road' },
     { value: 'sw2', component: 'postcode' },
     { value: '4je', component: 'suburb' },
     { value: 'london', component: 'city' } ] }

French fail?

Using the latest version of the models and the classic "Quatre-vignt-douze Ave des Champs-ร‰lysรฉes" it's not recognising the 'vignt'; I get "4-vignt-12 avenue des champs-elysees" as output.

Create official release to ease Mac OSX Brew install

Howdy,

This is an awesome project. I've gotten a proof of concept of a Mac OSX brew install config. To make the submission more likely to be accepted their automated checks its best if I can point the config at a release from the official repo instead of a fork (https://github.com/seekayel/libpostal/releases)

Would you be willing to create official releases on github? https://help.github.com/articles/creating-releases/

If not would you be be willing to post links to where they are hosted?

Thanks!

seg fault on latest build

I pulled the latest from master and, while it builds successfully, I get the following when trying to run the address parser:

admins-MacBook-Pro-4% ./src/address_parser
Loading models...
zsh: segmentation fault  ./src/address_parser

This is on OSX 10.11.3.

Some code is not 32-bit safe

I haven't checked everything, but for example:

static ordinal_indicator_t *ordinal_indicator_read(FILE *f) {
size_t key_len;
if (!file_read_uint64(f, (uint64_t *)&key_len)) {
return NULL;
}

char *key = malloc(key_len);
if (key == NULL) {
    return NULL;
}

This assumes that size_t is 64 bits. If the code is compiled on a 32-bit machine, the call to file_read_uint64() will smash the stack.

Windows Install

Before I try to reinvent the wheel. Anyone have a tips or procedure to install this on windows.
I would like to use it with windows 7 and python and possibly postgres.

Thanks,
Matt

Use " - " as asset

> LES HAUTS DE SAINT ANTOINE BAT F - BD DU COMMANDANT THOLLON, 13015 MARSEILLE
  "road": "les hauts de saint antoine bat",
  "house": "f",
  "road": "bd du commandant thollon",
> LES HAUTS DE SAINT ANTOINE BT F - BD DU COMMANDANT THOLLON, 13015 MARSEILLE
  "road": "les hauts de saint antoine bt f bd du commandant thollon",
> LE PETIT BOIS BT F - BD DU COMMANDANT THOLLON, 13015 MARSEILLE
  "house": "le petit bois bt f",
  "road": "bd du commandant thollon",

Only the last one is correct.

Simplify process for contributing languages/abbreviations

Adding new abbreviations to libpostal involves 4 steps:

  1. Edit a text file in dictionaries
  2. Run python scripts/geodata/address_expansions/address_dictionaries.py to generate the C data file address_expansion_data.c (new version should be checked in)
  3. After compiling libpostal with make, run ./src/build_address_dictionary to build the fast trie data structure used at run-time
  4. Run libpostal_data (e.g. libpostal_data upload base $YOUR_DATA_DIR/libpostal) to upload files to S3 (read access to the libpostal buckets is public, write access is not)

Ideally contributors should only have to think about step 1 and the others should be run automatically as part of the build assuming tests pass, etc.

.Net bindings

I would like to help in writing the .Net bindings for the library, but first I'll need some help to build the library on windows.

install fails

Al, first of all, thanks for the expanded README, very helpful!
I am following the installation sequence, the last step fails (Ubuntu):

/home/boshkins/github/libpostal> sudo make install
Making install in src
make[1]: Entering directory '/home/boshkins/github/libpostal/src'
Making install in sparkey
make[2]: Entering directory '/home/boshkins/github/libpostal/src/sparkey'
make[3]: Entering directory '/home/boshkins/github/libpostal/src/sparkey'
make[3]: Nothing to be done for 'install-exec-am'.
make[3]: Nothing to be done for 'install-data-am'.
make[3]: Leaving directory '/home/boshkins/github/libpostal/src/sparkey'
make[2]: Leaving directory '/home/boshkins/github/libpostal/src/sparkey'
make[2]: Entering directory '/home/boshkins/github/libpostal/src'
mkdir -p /home/boshkins/libpostal_data/libpostal
if [ ! -e ./libpostal_data_last_updated ]; then                                         \
        echo "Jan  1 00:00:00 1970" > ./libpostal_data_last_updated;                    \
fi;
if [ $(curl http://libpostal.s3.amazonaws.com/libpostal_data.tar.gz -z "$(cat ./libpostal_data_last_updated)" --silent --remote-time -o /home/boshkins/libpostal_data/libpostal/libpostal_data.tar.gz -w %{http_code}) = "200" ]; then            \
        if [ "x1" != "x" ]; then                                                                                                                 \
                echo $(date -d "$(date -d "@$(date -r /home/boshkins/libpostal_data/libpostal/libpostal_data.tar.gz +%s)") + 1 second") > ./libpostal_data_last_updated;                                                                                                                           \
        elif [ "x" != "x" ]; then                                                                                                                \
                echo $(date -r $(stat -f %m /home/boshkins/libpostal_data/libpostal/libpostal_data.tar.gz) -v+1S) > ./libpostal_data_last_updated\
        fi;                                                                                                                                      \
        tar -xvzf /home/boshkins/libpostal_data/libpostal/libpostal_data.tar.gz -C /home/boshkins/libpostal_data/libpostal;                      \
        rm /home/boshkins/libpostal_data/libpostal/libpostal_data.tar.gz;                                                                        \
fi;
make[3]: Entering directory '/home/boshkins/github/libpostal/src'
 /bin/mkdir -p '/usr/local/lib'
 /bin/bash ../libtool   --mode=install /usr/bin/install -c   libpostal.la '/usr/local/lib'
libtool: install: /usr/bin/install -c .libs/libpostal.so.0.0.0 /usr/local/lib/libpostal.so.0.0.0
libtool: install: (cd /usr/local/lib && { ln -s -f libpostal.so.0.0.0 libpostal.so.0 || { rm -f libpostal.so.0 && ln -s libpostal.so.0.0.0 libpostal.so.0; }; })
libtool: install: (cd /usr/local/lib && { ln -s -f libpostal.so.0.0.0 libpostal.so || { rm -f libpostal.so && ln -s libpostal.so.0.0.0 libpostal.so; }; })
libtool: install: /usr/bin/install -c .libs/libpostal.lai /usr/local/lib/libpostal.la
libtool: install: /usr/bin/install -c .libs/libpostal.a /usr/local/lib/libpostal.a
libtool: install: chmod 644 /usr/local/lib/libpostal.a
libtool: install: ranlib /usr/local/lib/libpostal.a
libtool: install: warning: remember to run `libtool --finish /home/boshkins/lib'
mkdir -p /home/boshkins/libpostal_data/libpostal
if [ ! -e ./libpostal_data_last_updated ]; then                                         \
        echo "Jan  1 00:00:00 1970" > ./libpostal_data_last_updated;                    \
fi;
if [ $(curl http://libpostal.s3.amazonaws.com/libpostal_data.tar.gz -z "$(cat ./libpostal_data_last_updated)" --silent --remote-time -o /home/boshkins/libpostal_data/libpostal/libpostal_data.tar.gz -w %{http_code}) = "200" ]; then            \
        if [ "x1" != "x" ]; then                                                                                                                 \
                echo $(date -d "$(date -d "@$(date -r /home/boshkins/libpostal_data/libpostal/libpostal_data.tar.gz +%s)") + 1 second") > ./libpostal_data_last_updated;                                                                                                                           \
        elif [ "x" != "x" ]; then                                                                                                                \
                echo $(date -r $(stat -f %m /home/boshkins/libpostal_data/libpostal/libpostal_data.tar.gz) -v+1S) > ./libpostal_data_last_updated\
        fi;                                                                                                                                      \
        tar -xvzf /home/boshkins/libpostal_data/libpostal/libpostal_data.tar.gz -C /home/boshkins/libpostal_data/libpostal;                      \
        rm /home/boshkins/libpostal_data/libpostal/libpostal_data.tar.gz;                                                                        \
fi;
 /bin/mkdir -p '/home/boshkins/libpostal_data/libpostal'
 /usr/bin/install -c -m 644 ./libpostal_data.tar.gz '/home/boshkins/libpostal_data/libpostal'
/usr/bin/install: cannot stat โ€˜./libpostal_data.tar.gzโ€™: No such file or directory
Makefile:840: recipe for target 'install-pkgdataDATA' failed
make[3]: *** [install-pkgdataDATA] Error 1
make[3]: Leaving directory '/home/boshkins/github/libpostal/src'
Makefile:1050: recipe for target 'install-am' failed
make[2]: *** [install-am] Error 2
make[2]: Leaving directory '/home/boshkins/github/libpostal/src'
Makefile:889: recipe for target 'install-recursive' failed
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory '/home/boshkins/github/libpostal/src'
Makefile:380: recipe for target 'install-recursive' failed
make: *** [install-recursive] Error 1

Why the lowercasing?

Is the lowercasing of input strings in the output needed for the processing, or more a convenience/normalisation thing? My question is because it makes it hard to substitute in replacements for parsed elements in the original address (unless there's a trick I'm missing).

Error while running `make` in ubuntu in virtualbox (vagrant)

Using commit b3c8a72ca691f6a46d6f1a9b02d63a2e4a5b81cd

I tried compiling on ubuntu 14.04 and 12.04 using vagrant boxes running in virtual box

I followed the instructions on the README.md

I get this error when running make

...
/bin/bash ../libtool  --tag=CC   --mode=compile gcc -std=gnu99 -DHAVE_CONFIG_H -I.. -I/usr/local/include    -Wfloat-equal -Wpointer-arith -DLIBPOSTAL_DATA_DIR='"/home/vagrant/data/libpostal"' -O2 -Wfloat-equal -Wpointer-arith -DLIBPOSTAL_DATA_DIR='"/home/vagrant/data/libpostal"' -c -o libpostal_la-float_utils.lo `test -f 'float_utils.c' || echo './'`float_utils.c
libtool: compile:  gcc -std=gnu99 -DHAVE_CONFIG_H -I.. -I/usr/local/include -Wfloat-equal -Wpointer-arith -DLIBPOSTAL_DATA_DIR=\"/home/vagrant/data/libpostal\" -O2 -Wfloat-equal -Wpointer-arith -DLIBPOSTAL_DATA_DIR=\"/home/vagrant/data/libpostal\" -c float_utils.c  -fPIC -DPIC -o .libs/libpostal_la-float_utils.o
libtool: compile:  gcc -std=gnu99 -DHAVE_CONFIG_H -I.. -I/usr/local/include -Wfloat-equal -Wpointer-arith -DLIBPOSTAL_DATA_DIR=\"/home/vagrant/data/libpostal\" -O2 -Wfloat-equal -Wpointer-arith -DLIBPOSTAL_DATA_DIR=\"/home/vagrant/data/libpostal\" -c float_utils.c -o libpostal_la-float_utils.o >/dev/null 2>&1
source='scanner.c' object='libscanner_la-scanner.lo' libtool=yes \
    DEPDIR=.deps depmode=none /bin/bash ../depcomp \
    /bin/bash ../libtool  --tag=CC   --mode=compile gcc -std=gnu99 -DHAVE_CONFIG_H -I.. -I/usr/local/include    -Wfloat-equal -Wpointer-arith -DLIBPOSTAL_DATA_DIR='"/home/vagrant/data/libpostal"' -O0 -Wfloat-equal -Wpointer-arith -DLIBPOSTAL_DATA_DIR='"/home/vagrant/data/libpostal"' -c -o libscanner_la-scanner.lo `test -f 'scanner.c' || echo './'`scanner.c
libtool: compile:  gcc -std=gnu99 -DHAVE_CONFIG_H -I.. -I/usr/local/include -Wfloat-equal -Wpointer-arith -DLIBPOSTAL_DATA_DIR=\"/home/vagrant/data/libpostal\" -O0 -Wfloat-equal -Wpointer-arith -DLIBPOSTAL_DATA_DIR=\"/home/vagrant/data/libpostal\" -c scanner.c  -fPIC -DPIC -o .libs/libscanner_la-scanner.o
gcc: internal compiler error: Killed (program cc1)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-4.6/README.Bugs> for instructions.
make[3]: *** [libscanner_la-scanner.lo] Error 1
make[3]: Leaving directory `/home/vagrant/libpostal/src'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/vagrant/libpostal/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/vagrant/libpostal'
make: *** [all] Error 2

Any idea on what is causing this problem?

Thanks!

Italian address with snc

For Italian address with the words 'SNC', the address expansion should return as the first element of the list the address with the normalization of SNC in 'senza numero civico' (no_number.txt) and not that for the type of company 'societa' in nome collettivo' (company_types.txt)

how it is now
VIA S. TOMMASI SNC [u'via santo tommasi societa in nome collettivo', u'via santo tommasi senza numero civico',...]

how it should be
VIA S. TOMMASI SNC [u'via santo tommasi senza numero civico', u'via santo tommasi societa in nome collettivo', ,...]

Thank you

Encoding issue

Replace input aรฉroport with ae ฬroport

> route de l'aรฉroport 64121 Artix

Result:

{
  "house_number": "64000",
  "road": "route de l'ae ฬroport",
  "postcode": "64121",
  "city": "artix"
}

Problem building with French locale

I had an error with make, something about a date in the wrong format, since my GNU/Linux system is in French (fr-CA). The date was output in French and couldn't be parsed.

Using the following I was able to build:

LC_ALL=C make

Build issues with Snappy on OSX 10.11

El Capitan seems to be triggering build issues involving Snappy.

Running brew install snappy works without a problem (and seems to pass its internal tests), but then when we run ./configure it will trigger:

checking for library containing snappy_compress... no
configure: error: Could not find snappy

Others on the team seem to have this same problem. Is there an easy way I can test if GCC can find the library?

Partially I'm flagging this here because I haven't seen any related open issues at homebrew.

In the interim, I threw together a quick vagrant image for working with Libpostal and all the bindings.

French for 20 seems incorrect

Address expansion should rewrite words to numbers. Testing with the command-line tool src/libpostal, it fails with this correct French address:

src/libpostal 'Quatre-vingt-douze Ave des Champs-ร‰lysรฉes'

Result:

12 avenue des champs-elysees
12 avenue des champs elysees
12 avenue des champselysees
42012 avenue des champs-elysees
42012 avenue des champs elysees
42012 avenue des champselysees

However, if you misspell "vingt" by swapping the 'g' and the 'n' it works as expected:

src/libpostal 'Quatre-vignt-douze Ave des Champs-ร‰lysรฉes'

Result:

92 avenue des champs-elysees
92 avenue des champs elysees
92 avenue des champselysees

I'm not a native French speaker so I verified the correct spelling of "vingt":

http://dictionary.reverso.net/french-english/vingt

If you enter vignt into the search box on the reverso.net site, it corrects it to vingt and displays the same URL as the one quoted above.

Note: this example came from the blog posting announcing libpostal. I shared the blog posting and one of my co-workers spotted the misspelling, so I don't get the credit for spotting this.

manhattan new york

"manhattan new york" is incorrectly parsed as city:new, state:york.

would it be safe to assume that the word 'new' is always a prefix? (ie. a modifier for the following tokens)

> manhattan new york

Result:

{
  "state_district": "manhattan",
  "city": "new",
  "state": "york"
}

> wellington new zealand

Result:

{
  "house": "wellington",
  "country": "new zealand"
}

Python bindings installation fails

I keep running into errors while attempting to install the Python bindings (Ubuntu 14.04 + OS X Snow Leopard).

pip installing (locally or from github) leads to:

    building 'postal.text._normalize' extension
    cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch x86_64 -arch i386 -pipe -I. -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/normalize.c -o build/temp.macosx-10.10-intel-2.7/src/normalize.o -std=c99 -DHAVE_CONFIG_H -Wno-unused-function
    In file included from src/normalize.c:1:
    In file included from src/normalize.h:36:
    In file included from src/transliterate.h:12:
    In file included from src/trie.h:32:
    In file included from src/file_utils.h:11:
    src/libpostal_config.h:7:14: fatal error: 'config.h' file not found
        #include <config.h>
                 ^
    1 error generated.
    error: command 'cc' failed with exit status 1

    ----------------------------------------
Command "/Users/driordan/Data/datapy/bin/python -c "import setuptools, tokenize;__file__='/var/folders/mx/2mwxpy5j4nvdfw53dy44xdk80000gn/T/pip-ZN5DJj-build/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /var/folders/mx/2mwxpy5j4nvdfw53dy44xdk80000gn/T/pip-CxQMdy-record/install-record.txt --single-version-externally-managed --compile --install-headers /Users/driordan/Data/datapy/include/site/python2.7/pypostal" failed with error code 1 in /var/folders/mx/2mwxpy5j4nvdfw53dy44xdk80000gn/T/pip-ZN5DJj-build

I'm foggy enough as is on C build processes, so debugging C meets bindings is hairy territory for me.

Additionally, readme.md needs to be updated so that pip install https is pip install git+https to facilitate installing from GitHub (was going to open a PR, but figured the other install issues took precedence). Figure that makes more sense once the general install issue's worked out.

libpostal_setup() only works once

Al,

libpostal_setup() only works once, all subsequent calls fail, regardless of calls to libpostal_teardown() in between, the error message is:

ERR   Error loading transliteration module
   at libpostal_setup (libpostal.c:742) errno: No such file or directory

I guess a given app is not likely to call setup/teardown more than once per run, so this is not a big deal. My context is a test suite that tries to separate test cases cleanly, thus multiple setup/teardown call pairs.

Thank you,
Anatoly

What are the possible labels?

As someone trying to fit the output into a tabular datastructure it'd be good to know what the range of labels for parsed addresses are (I've tried to dig through the code, but...there's kind of a lot of it!)

R language binding

Great library. I'd love to see a binding for #rstats ... R has nice database connectivity already, guess libpostal and R would work quite well together. There's a cpp and c interface already. I'd be willing to help out on the R part here (i.e. building an R package), still I need a little kick start on the C part. Also I am not sure about the licensing how you would package stuff. Technically c code can be part of an R package, but I don't know how much of the code here should be included in an R package. (R is GPL basically)

Apartment units

I was wondering if there's currently support for secondary addresses such as apartments, condos, and offices, or if that's outside the scope of libpostal. When I run an address like 1500 CHESTNUT ST APT B, PHILADELPHIA, PA through address_parser I get:

Result:

{
  "house_number": "1500",
  "road": "chestnut st apt b",
  "city": "philadelphia",
  "state": "pa"
}

For my particular purposes it would be great if there were distinct tags for the unit designator and unit num.

Thanks!

Doesn't appear to handle PO Box numbers.

postal.parser.parse_address('PO Box 1, Seattle, WA 98103');
[ { value: 'po', component: 'house_number' },
  { value: 'box', component: 'road' },
  { value: '1', component: 'house_number' },
  { value: 'seattle', component: 'city' },
  { value: 'wa', component: 'state' },
  { value: '98103', component: 'postcode' } ]

Parser failure on some 5 digit street addresses

There is an odd issue that addresses containing certain ranges of street numbers fail to parse properly. In my case for some (not all) addresses over 10000 seem to cause problems where the parser thinks the street number is the postal code. One of the real world addresses that is causing me problems is: 25050 ALESSANDRO BLVD, STE B, MORENO VALLEY, CA, 92553-4313 . It's a strip mall in CA, USA. Google Maps and USPS have no problem with it thought OpenStreeMap can't find it.

Here is a patch that introduces a test showing the problem.

diff --git a/test/test_parser.c b/test/test_parser.c
index db6a11c..9eea212 100644
--- a/test/test_parser.c
+++ b/test/test_parser.c
@@ -71,9 +71,28 @@ TEST test_us_parses(void) {
         (labeled_component_t){"postcode", "11216"},
         (labeled_component_t){"country", "usa"}
     ));
+
     PASS();
 }

+TEST test_us_commercial_parses(void) {
+    address_parser_options_t options = get_libpostal_address_parser_default_options();
+
+    CHECK_CALL(test_parse_result_equals(
+        "25050 ALESSANDRO BLVD, STE B, MORENO VALLEY, CA, 92553-4313",
+        options,
+        5,
+        (labeled_component_t){"house_number", "25050"},
+        (labeled_component_t){"road", "allessandro blvd ste b"},
+        (labeled_component_t){"city", "moreno valley"},
+        (labeled_component_t){"state", "ca"},
+        (labeled_component_t){"postcode", "92553-4313"}
+    )); 
+
+    PASS();
+}
+
+

 TEST test_uk_parses(void) {
     address_parser_options_t options = get_libpostal_address_parser_default_options();
@@ -234,6 +253,7 @@ SUITE(libpostal_parser_tests) {
     }

     RUN_TEST(test_us_parses);
+    RUN_TEST(test_us_commercial_parses);
     RUN_TEST(test_uk_parses);
     RUN_TEST(test_es_parses);
     RUN_TEST(test_za_parses);

Vector resizing operations don't signal out-of-memory conditions to caller

For example, within __VECTOR_BASE():

static inline void name##_push(name *array, type value) {                       \
    if (array->n == array->m) {                                                 \
        size_t size = array->m ? array->m << 1 : 2;                             \
        type *ptr = realloc(array->a, sizeof(type) * size);                     \
        if (ptr == NULL) return;                                                \
        array->a = ptr;                                                         \
        array->m = size;                                                        \
    }                                                                           \
    array->a[array->n++] = value;                                               \
}                                                                               \

The return value of realloc() is checked, but the caller has no idea if the operation succeeded. If someone does

size_t a_will_be_at_index = char_array->n;
char_array_push(myarray, 'a');

and later uses 'a_will_be_at_index', they could blow past the end of the array.

How to use normalize_options . address_components ?

Al,

In the normalize_options structure, I see a bit mask address_components, which is by default set to ADDRESS_HOUSE_NUMBER | ADDRESS_STREET | ADDRESS_UNIT. This looks like a way of controlling some aspects of address parsing and (hopefully) providing some feedback to the calling code as to which parts of the address have been identified. I tried populating the option with different values but saw no difference on the output. A clarification or maybe some examples would be very much appreciated.

Thank you,
Anatoly

Unable to locate package libsnappy-dev

May I know the repo for libsnappy-dev to pull the package from. (is it 32 bit lib)

root@550fc05b4ed1:/app/user# apt-get install libsnappy-dev autoconf automake libtool pkg-config
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package libsnappy-dev

root@550fc05b4ed1:/app/user# cat /etc/issue
Ubuntu 14.04.2 LTS \n \l

Data files is loaded twice

make downloads data files from S3 - ok, but when I type sudo make install - it downloads them again.

========================= Installation results ===========================
Making install in src
make[1]: Entering directory `/home/rykov/git/libpostal/src'
Making install in sparkey
make[2]: Entering directory `/home/rykov/git/libpostal/src/sparkey'
make[3]: Entering directory `/home/rykov/git/libpostal/src/sparkey'
make[3]: Nothing to be done for `install-exec-am'.
make[3]: Nothing to be done for `install-data-am'.
make[3]: Leaving directory `/home/rykov/git/libpostal/src/sparkey'
make[2]: Leaving directory `/home/rykov/git/libpostal/src/sparkey'
make[2]: Entering directory `/home/rykov/git/libpostal/src'
./libpostal_data download all /home/rykov/tmp/libpostal
Checking for new libpostal data file...
Warning: Illegal date format for -z, --timecond (and not a file name). 
Warning: Disabling time condition. See curl_getdate(3) for valid date syntax.

Ubuntu 14.04

Remove scripts/setup.py

Since the Python bindings now have their own repo, and since scripts/setup.py did not work when I tried it, I suggest removing scripts/setup.py from this repo.

Note: When I was trying to figure out how to install the Python bindings, Google found issue #10 and I tried the process described there. I should have just started with the README.md file! (I edited issue #10 to help out anyone who does the same thing I did.)

contributing address test data

I have about 100,000 US California commercial addresses as they were input by users and also a postal verified version of the same address. Is this useful to you folks and whats the best way to pass it along if it is (a big dump, excerpts, etc)? The data is public domain and unencumbered.

example code?

Hello,
Would you kindly provide some code examples?
Thank you!
Anatoly

Error loading geodb module

I just installed on a Debian system, following the step-by-step instructions in the readme. Then I tried running the parser from the command line, and got this error:

# ./address_parser
Loading models...
ERR   Error loading geodb module
   at libpostal_setup_parser (libpostal.c:1071) errno: None

I made sure the data directory is world-readable. What am I missing?

problem with Japanese addresses

Hi,

I am trying to parse addresses from PATSTAT (EPO Worldwide Patent Statistical Database). For most countries the parser works fine but I have problems with the Japanese ones.

E.g. postal.parser.parse_address("135, Higashifunahashi 2 Chome, Hirakata-shi Osaka-fu")

[(u'135', u'house_number'), (u'higashifunahashi 2 chome', u'road'), (u'hirakata-shi', u'house'), (u'osaka-fu', u'road')]

Is the input format wrong? I can provide more input strings if needed. I use the python wrapper.

Feature request: parse address should return guessed country code

This is a feature request. I request that the libpostal address parser should return a guessed country code, and possibly a confidence score for the guessed country code.

Consider this address:

332 Menzies Street, Victoria, BC V8V 2G9

Testing with src/address_parser shows that libpostal correctly identifies all the parts of this address. This strongly implies that libpostal knows that it is parsing a Canadian address (BC is an abbreviation for a Canadian province, and the postal code is in the Canadian format). It would be useful if the results included "cc_guess": 'CA' and possibly "confidence": 1.0 (on a scale from 0.0 to 1.0) for this address.

For a small address fragment, perhaps no guess is possible. Perhaps the best thing simply would be not to return any country code guess in this case.

Duplicate json fields in result

> 64000 Artix route de l'aรฉroport 64121

Result:

{
  "postcode": "64000",
  "house_number": "artix",
  "road": "route de l'ae ฬroport",
  "postcode": "64121"
}

US state abbreviations

Al,

Does the library recognize the US state abbreviations? Apparently not:

> ./libpostal "1 Main Street Reston VA 20333" en
1 main street reston vale 20333
> ./libpostal "1 Main Street Reston Virginia 20333" en
1 main street reston virginia 20333

However, I believe I have seen handling of state abbreviations mentioned in you commit comments. Could you clarify?

Thanks!
Anatoly

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.