pelias / openaddresses Goto Github PK

View Code? Open in Web Editor NEW

49.0 18.0 46.0 1.73 MB

Pelias import pipeline for OpenAddresses.

License: MIT License

JavaScript 98.68% Shell 0.77% Dockerfile 0.55%

openaddresses's Introduction

A modular, open-source search engine for our world.

Pelias is a geocoder powered completely by open data, available freely to everyone.

Local Installation · Cloud Webservice · Documentation · Community Chat

What is Pelias?

Pelias is a search engine for places worldwide, powered by open data. It turns addresses and place names into geographic coordinates, and turns geographic coordinates into places and addresses. With Pelias, you’re able to turn your users’ place searches into actionable geodata and transform your geodata into real places.

We think open data, open source, and open strategy win over proprietary solutions at any part of the stack and we want to ensure the services we offer are in line with that vision. We believe that an open geocoder improves over the long-term only if the community can incorporate truly representative local knowledge.

Pelias OpenAddresses importer

Overview

The OpenAddresses importer is used to process data from OpenAddresses for import into the Pelias geocoder.

Requirements

Node.js is required. See Pelias software requirements for supported versions.

Installation

For instructions on setting up Pelias as a whole, see our getting started guide. Further instructions here pertain to the OpenAddresses importer only

git clone https://github.com/pelias/openaddresses
cd openaddresses
npm install

Data Download

Use the imports.openaddresses.files configuration option to limit the download to just the OpenAddresses files of interest. Refer to the OpenAddresses data listing for file names.

npm run download

Usage

# show full command line options
node import.js --help

# run an import
npm start

Admin Lookup

OpenAddresses records do not contain information about which city, state (or other region like province), or country that they belong to. Pelias has the ability to compute these values from Who's on First data. For more info on how admin lookup works, see the documentation for pelias/wof-admin-lookup. By default, adminLookup is enabled. To disable, set imports.adminLookup.enabled to false in Pelias config.

Note: Admin lookup requires loading around 5GB of data into memory.

Configuration

This importer can be configured in pelias-config, in the imports.openaddresses hash. A sample configuration file might look like this:

{
  "esclient": {
    "hosts": [
      {
        "env": "development",
        "protocol": "http",
        "host": "localhost",
        "port": 9200
      }
    ]
  },
  "logger": {
    "level": "debug"
  },
  "imports": {
    "whosonfirst": {
      "datapath": "/mnt/data/whosonfirst/",
      "importPostalcodes": false,
      "importVenues": false
    },
    "openaddresses": {
      "datapath": "/mnt/data/openaddresses/",
      "files": [ "us/ny/city_of_new_york.csv" ]
    }
  }
}

The following configuration options are supported by this importer.

`imports.openaddresses.datapath`

Required: yes
Default: ``

The absolute path to a directory where OpenAddresses data is located. The download command will also automatically place downloaded files in this directory.

`imports.openaddresses.files`

Required: no
Default: []

An array of OpenAddresses files to be downloaded (full list can be found on the OpenAddresses results site). If no files are specified, the full planet data files (11GB+) will be downloaded.

`imports.openaddresses.missingFilesAreFatal`

Required: no
Default: false

If set to true, any missing files will immediately halt the importer with an error. Otherwise, the importer will continue processing with a warning. The data downloader will also continue if any download errors were encountered with this set to false.

`imports.openaddresses.dataHost`

Required: no
Default: https://data.openaddresses.io

The location from which to download OpenAddresses data from. By default, the primary OpenAddresses servers will be used. This can be overrriden to allow downloading customized data. Paths are supported (for example, https://yourhost.com/path/to/your/data), but must not end with a trailing slash.

S3 buckets are supported. Files will be downloaded using aws-cli.

For example: s3://data.openaddresses.io.

Note: When using s3, you might need authentcation (IAM instance role, env vars, etc.)

`imports.openaddresses.s3Options`

Required: no

If imports.openaddresses.dataHost is an s3 bucket, this will add options to the command. For example: --profile my-profile

This is useful, for example, when downloading from s3://data.openaddresses.io, as they require the requester to pay for data transfer. You can then use the following option: --request-payer

Parallel Importing

Because OpenAddresses consists of many small files, this importer can be configured to run several instances in parallel that coordinate to import all the data.

To use this functionality, replace calls to npm start with

npm run parallel 3 # replace 3 with your desired level of paralellism

Generally, a parallelism of 2 or 3 is suitable for most tasks.

openaddresses's People

Stargazers

Watchers

openaddresses's Issues

retrieve admin values through coarse reverse-geocoding

The importer currently retrieves admin values by scraping them from OpenAddresses CSV filenames and their correspondent config JSON files. Since the data present there is only meant to suggest the rough region that any set of addresses belongs so, it's occasionally inaccurate. We should migrate over to a model that finds admin values by perform coarse reverse-geocoding against some polygons dataset, like Quattroshapes.

ECONNREFUSED when attempting to import

I am attempting to import one CSV file of OpenAddresses data, and finding this error. I have previously created an index.

% node import.js

2016-05-05T03:42:56.151Z - info: [openaddresses] Importing 1 files.
2016-05-05T03:42:57.091Z - info: [openaddresses] Creating read stream for: /mnt/pelias/openaddresses/alameda.csv
Elasticsearch ERROR: 2016-05-05T03:42:59Z
  Error: Request error, retrying
  POST http://localhost:9200/_bulk => connect ECONNREFUSED 127.0.0.1:9200
      at Log.error (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/src/lib/log.js:225:56)
      at checkRespForFailure (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/src/lib/transport.js:240:18)
      at HttpConnector.<anonymous> (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/src/lib/connectors/http.js:162:7)
      at ClientRequest.wrapper (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/node_modules/lodash/index.js:3095:19)
      at emitOne (events.js:77:13)
      at ClientRequest.emit (events.js:169:7)
      at Socket.socketErrorListener (_http_client.js:258:9)
      at emitOne (events.js:77:13)
      at Socket.emit (events.js:169:7)
      at emitErrorNT (net.js:1256:8)

Elasticsearch ERROR: 2016-05-05T03:42:59Z
  Error: Request error, retrying
  POST http://localhost:9200/_bulk => connect ECONNREFUSED 127.0.0.1:9200
      at Log.error (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/src/lib/log.js:225:56)
      at checkRespForFailure (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/src/lib/transport.js:240:18)
      at HttpConnector.<anonymous> (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/src/lib/connectors/http.js:162:7)
      at ClientRequest.wrapper (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/node_modules/lodash/index.js:3095:19)
      at emitOne (events.js:77:13)
      at ClientRequest.emit (events.js:169:7)
      at Socket.socketErrorListener (_http_client.js:258:9)
      at emitOne (events.js:77:13)
      at Socket.emit (events.js:169:7)
      at emitErrorNT (net.js:1256:8)

(repeats many times)

The ES server is responding as I would expect:

% curl -i http://localhost:9200/

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 332

{
  "status" : 200,
  "name" : "Sack",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.7.5",
    "build_hash" : "00f95f4ffca6de89d68b7ccaf80d148f1f70e4d4",
    "build_timestamp" : "2016-02-02T09:55:30Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

I am using OpenJDK 8 on Ubuntu 16.04, with Node 4.2.6.

whitespace in street name

note: this may be fixed in dev, the data below was taken from the production server 18 Mar '16 and I couldn't confirm because the dev server was being rebuilt at the time.

"street": "South Albany          Avenue",

{
  "type": "Feature",
  "properties": {
    "id": "2d698c8c74484096a80758033b694967",
    "gid": "oa:address:2d698c8c74484096a80758033b694967",
    "layer": "address",
    "source": "oa",
    "name": "200 South Albany Avenue",
    "housenumber": "200",
    "street": "South Albany          Avenue",
    "country_a": "USA",
    "country": "United States",
    "region": "Florida",
    "region_a": "FL",
    "county": "Martin County",
    "locality": "Stuart",
    "neighbourhood": "Watermark Marina of Palm City",
    "confidence": 0.848,
    "label": "200 South Albany Avenue, Stuart, FL"
  },
  "geometry": {
    "type": "Point",
    "coordinates": [
      -80.256743,
      27.199085
    ]
  }
}

Elasticsearch errors when doing the import on a local machine

Hi, I'm attempting a worldwide import on a local machine, but have persistent Elasticsearch errors, with ES dropping connections resulting in batch_error and missing data. The typical example is as follows:

Elasticsearch WARNING: 2017-12-11T14:05:12Z
Unable to revive connection: http://localhost:9200/
Elasticsearch WARNING: 2017-12-11T14:05:12Z
No living connections
2017-12-11T14:05:12.121Z - error: [dbclient] esclient error Error: No Living connections
at sendReqWithConnection (/mnt/scratch/pelias/openstreetmap/node_modules/elasticsearch/src/lib/transport.js:225:15)
at next (/mnt/scratch/pelias/openstreetmap/node_modules/elasticsearch/src/lib/connection_pool.js:213:7)
at nextTickCallbackWith0Args (node.js:419:9)
at process._tickCallback (node.js:348:13)
2017-12-11T14:05:12.121Z - error: [dbclient] invalid resp from es bulk index operation
2017-12-11T14:05:12.121Z - info: [dbclient] retrying batch [500]
2017-12-11T14:05:13.643Z - info: [dbclient] paused=true, transient=15, current_length=0, indexed=3754500, batch_ok=7509, batch_retries=0, failed_records=0, venue=1578642, address=2175858, persec=0, batch_error=35

The errors happen reliably and within an hour when running multiple importers in parallel. Running sequentially, Geonames and WOF have imported without issues. OA and OSM start throwing errors after a few hours of importing (e.g. after about 600M OSM nodes in the above example, after getting through half of Brazil in OA).

The hardware is quite powerful (32 cores, 128 GB RAM, though not SSDs - HDD in RAID0), so I do not believe this is an issue. ES maxes out 2-3 cores during import, but then it suddenly requires more power and starts dropping connections.

Not an expert in ES (this is my first encounter), but maybe there is some reindexing going on, or maybe even heavy garbage collection. I give ES 30GB memory (via ES_JAVA_OPTS="-Xms30g -Xmx30g") and run it by launching a daemon. ES version is 2.4, the system is Ubuntu 16.04. I tried with both standard settings and with some suggestions found on the web. The ES documentation is missing and seems to be very different across versions with people suggesting undocumented parameters to tweak. The config that worked to import Geonames and WOF is threadpool.bulk.type: fixed threadpool.bulk.size: 25 threadpool.bulk.queue_size: 1000

I imagine that there could be some ES setting to either give it more resources, keep retrying for longer (or ideally just wait more between retries), or maybe changing batch sizes? Alternatively, is there a way to throttle the importers to a given maximum inserts per second?

housenumber with a '#' prefix

"housenumber": "#2708",

{
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.716908,
          45.513926
        ]
      },
      "properties": {
        "id": "ca/qc/montreal:330808",
        "gid": "openaddresses:address:ca/qc/montreal:330808",
        "layer": "address",
        "source": "openaddresses",
        "source_id": "ca/qc/montreal:330808",
        "name": "#2708 rue Équinoxes",
        "housenumber": "#2708",
        "street": "rue Équinoxes",
        "confidence": 0.74,
        "country": "Canada",
        "country_gid": "whosonfirst:country:85633041",
        "country_a": "CAN",
        "region": "Quebec",
        "region_gid": "whosonfirst:region:136251273",
        "locality": "Montréal",
        "locality_gid": "whosonfirst:locality:101736545",
        "neighbourhood": "Saint-Laurent",
        "neighbourhood_gid": "whosonfirst:neighbourhood:85895749",
        "label": "#2708 rue Équinoxes, Montréal, Quebec, Canada"
      }
    },

another one:

        "id": "ca/qc/montreal:331017",
        "gid": "openaddresses:address:ca/qc/montreal:331017",
        "housenumber": "#2855",

uppercase numerals

lowercase numberals such as 4TH Avenue -> 4th Avenue

Connected to pelias/pelias#170

Use `HASH` field from OA for id

OA recently started incorporating a hashcode of the field values to create a unique identifier, an example of which is:

-75.8897798,40.6627665,,QUAKER CITY RD,,ALBANY,,,,,40928065a7dc01e1

We can use this reliable persistent unique identifier to replace our id scheme.

An in-range update of joi is breaking the build 🚨

Version 13.4.0 of joi was just published.

Branch	Build failing 🚨
Dependency	`joi`
Current Version	13.3.0
Type	dependency

This version is covered by your current version range and after updating it in your project the build failed.

joi is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details

✅ ci/circleci Your tests passed on CircleCI! Details
❌ continuous-integration/travis-ci/push The Travis CI build could not complete due to an error Details

Commits

The new version differs by 13 commits.

f75f0d3 13.4.0
759c558 Cleanup for #1499.
d97ca0d Merge pull request #1499 from rgoble4/dynamic-keys
8a1eb96 Consider extended types parameters
eaefa17 review changes
3a84adc update docs
b747016 Allow pattern to support schema objects
944dbe9 Fix empty path reach. Fixes #1515.
1f39ed4 Update issue templates
ee15213 Merge pull request #1514 from radicand/fix/1513
f097e37 remove indirect require reference to index.js
e520c6d Merge pull request #1500 from logoran/feature-date-greater-less
97fb85c add date greater less rules

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

0 house numbers

I'm not sure if these are useful or simply a relic of the OA schema enforcing a housenumber:

http://pelias.compare.s3-website-eu-west-1.amazonaws.com/#/search%3Fsize=4&input=0%20west%204th%20and%20oak%20street

0 west 4th and oak street

1)  0 West 4th & Oak Street, Anniston, AL
2)  0 West 4th Avenue West, Escondido, CA
3)  0 West 4th Avenue, Holly Grove, AR
4)  0 West 4th Street, El Dorado, AR

@riordan thoughts?

set Document address* values

pelias-model Documents now have address objects, which can contain data like house number, house name, street name, etc; set them accordingly.

Document guid does not persist across different imports

Individual imports start off with a per-document guid of 0, meaning that, when importing multiple OpenAddresses files (which requires executing the import script once per file since it currently doesn't support file batches), each successive import overwrites the previous ones partly/entirely.

extract data paths from pelias-config

Use the pelias-config package to get paths to OpenAddresses data.

Allow street-less records in white-listed countries

Some countries allow addresses without streets. So far, it just appears to be the Czech Republic. Here are two examples:

č.p. 360, 79862 Rozstání
č.ev. 9, 79857 Rakůvka

These can be parsed as:

{
  housenumber: 'č.p. 360',
  postcode: '79862',
  city: 'Rozstání'
}

and:

{
  housenumber: 'č.ev. 9',
  postcode: '79857',
  city: 'Rakůvka'
}

reverse coding in South Bay always gives Palo Alto as city

Clicking in these four different South Bay cities returns the correct address number and street, but incorrectly identifies them all as in Palo Alto:

Negative house numbers in US

I ran across the behavior in Ohio where it's most egregious, but there are just shy of 35,000 addresses in that state with a negative house number. These should not be considered valid and filtered out during import.

strip quotation marks from fields

Quotation marks don't appear to get stripped, which results in records like the following getting indexed:

{
    "bbox": [
        -58.4255,
        -34.6249,
        -58.4255,
        -34.6249
    ],
    "date": 1423086910066,
    "features": [
        {
            "geometry": {
                "coordinates": [
                    -58.4255,
                    -34.6249
                ],
                "type": "Point"
            },
            "properties": {
                "admin0": "Argentina",
                "admin1": "Ciudad de Buenos Aires",
                "alpha3": "ARG",
                "id": "41578ad2c4144d9ca27144efef27d2ce",
                "layer": "openaddresses",
                "name": "4265 \"Calvo",
                "neighborhood": "Cafferata",
                "text": "4265 \"Calvo, Cafferata, Ciudad de Buenos Aires",
                "type": "openaddresses"
            },
            "type": "Feature"
        }
    ],
    "type": "FeatureCollection"
}

hook up into the Pelias elasticsearch pipeline

Since most of the data-processing components of this import pipeline are in place, hook it into the Pelias elasticsearch architecture (blocker: waiting for a green light on this pelias/addresses issue).

Limit needed in download_filtered.js

In order to get data for the entire US, I added all 1600 of the current US openaddress files to imports.openaddresses.files, and then did npm run download. But I didn't realize that this would start 1600 asynchronous curl jobs (which crashed the server). Is there a way to set a limit at, say, 10 or so asynchronous curl jobs at a time?

OpenAddress loader will stop processing if given the wrong .csv name in an array of .csv names in Pelias.json.

The OpenAddress loader will stop if given the wrong .csv name in an array of .csv names in pelias.json. I would rather be warned that one of the .csv files is an invalid file name, but have the loader continue to download and process the other .csv files in the list. I can see arguments for having the loader just die on such an error, but in the docker environment, all the other loaders (e.g., polylines and others that use OA data for interpolation) continue to process data … but it’s incomplete, since the OA loader didn’t download other valid .csv files.

NOTE: one other time, with a valid set of file names, I also saw the OA loader exist early due to some hiccup in downloading OA data (i.e., it was unclear if the .zip file didn't download correctly, a file in the .zip file was corrupt or maybe the .csv or .vrt meta data file was missing from the .zip -- error message was a bit cryptic) ... and just like above, it would have been better for me just to see a warning within the context of the docker-compose environment.

Diana's response on October 19th, 2017:

The OA importer bailing after an error, for our build purposes, we wanted to know as soon as possible that something was wrong but i can see how you’d rather just finish the files that can be finished and then worry about the missing or invalid ones separately. If you create an issue the team can discuss if that’s something we could change.

Refactor for cleanliness

There are several streams that include helper require's that I think were written before we could easily test streams. Now that we can easily test streams, these helpers are superfluous. Refactor them to be simpler.

import.js filepath is not taken.

When I run import.js in bash (exec folder is 'openaddresses-master', PELIAS_CONF set to my conf file)

The importer searches the file in the execution folder instead of getting the datapath path set in the config

my config:

"openaddresses": {
      "datapath": "\\\\prod\\..(morepath)...\\openaddresses-collected\\be\\wa\\",
      "files": ["brussels-fr.csv"]
    }

I guess smth wrong happens here as the "filePath" variable should be the absolute path, but it contains only the file name.

logger.info( 'Importing %s files.', files.length );
  files.forEach( function forEach( filePath ){
    recordStream.append( function ( next ){
      logger.info( 'Creating read stream for: ' + filePath );
      next( importPipelines.createRecordStream( filePath ) );
    });
  });

add usage documentation

Add documentation once all the moving parts of this pipeline are in place and mostly stable.

Enhance README.md
Add command-line usage message.

Configure admin lookup and deduplication via pelias.config

Unlike most of our other importers, this importer only allows configuration of admin lookup and deduplication via command line flags. This makes it difficult to point to a single place where configuration changes can be made, whether in our production/dev builds, in the vagrant image, or even on a local dev setup. It also makes our Chef configuration more complicated, as the chef recipes have to know about our openaddresses configuration, rather than just start the importer and let the importer worry about configuration.

Connects pelias/pelias#255

import.js dosen't check to see if deduplicator is running

Noted in --help:

        --deduplicate: (advanced use) Deduplicate addresses using the
                OpenVenues deduplicator: https://github.com/openvenues/address_deduper.
                It must be running at localhost:5000.

This point isn't very clear in the docs, and the import script will keep looping around [address-deduplicator] without throwing an error/checking to see if the service is running. :-(

error on download

Hi,

somehow I am getting this error on download

user@server:/opt/pelias/openaddresses# node utils/download_data.js
2018-01-24T19:27:11.348Z - info: [download] Attempting to download all data
2018-01-24T19:27:11.353Z - debug: [download] downloading https://s3.amazonaws.com/data.openaddresses.io/openaddr-collected-global.zip
2018-01-24T19:52:11.771Z - debug: [download] unzipping /tmp/tmp-10khLEclwr1Pwr.zip to /data/openaddresses
2018-01-24T19:56:48.095Z - error: [download] Failed to download data Error: Command failed: unzip -o -qq -d /data/openaddresses /tmp/tmp-10khLEclwr1Pwr.zip

unzip is installed:

user@server:/opt/pelias/openaddresses# unzip
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.

Usage: unzip [-Z] [-opts[modifiers]] file[.zip] [list] [-x xlist] [-d exdir]
...

Any ideas?

City and region information being dropped

Gday all,

Thanks so much for the hard work on Pelias and it's associated pieces.

I've just imported the Australian Countrywide dataset into my pelias instance. When I open the csv from OpenAddresses I can see City and Region columns that are populated with useful info. However it looks like this information then gets dropped when I import it into pelias.

It looks like the hosted Mapzen search instance relies on this information coming from WhosOnFirst.

Is there anything that I can set anywhere to help retain this information and map it to the relevant pelias fields for import rather than relying on WhosOnFirst?

Thanks,
Rowan

Add more stats about import process

It would be cool to get a report of exactly how many addresses were parsed during an import.

Something like this at the end of the import (or maybe for each file):
Total records in file: 900707
Records skipped due to missing data: 50383
Records skipped by deduplicator: 20837
Total records imported: 829487

Identify source file in ID field

Openaddress IDs as stored in our Elasticsearch index are currently just an autoincrementing integer, which isn't very helpful to OA data users and may not be unique across our different sources. Ids should change to have the source file and row in that source file identified in the ID.

Error: Delimiter not found in the file ","

Ran import against all openaddresses states files and the following error was thrown. I'm not sure if the process actually completed or not at this point. Looks like it may from the final info log record.

2016-11-10T17:36:53.553Z - verbose: [dbclient]  paused=false, transient=4, current_length=18, indexed=5463000, batch_ok=10926, batch_retries=0, failed_records=0, address=5463000, persec=2750
2016-11-10T17:36:56.697Z - verbose: [openaddresses] Number of bad records: 58679
2016-11-10T17:37:03.757Z - verbose: [dbclient]  paused=false, transient=4, current_length=406, indexed=5490500, batch_ok=10981, batch_retries=0, failed_records=0, address=5490500, persec=2750
2016-11-10T17:37:06.953Z - verbose: [openaddresses] Number of bad records: 64436
2016-11-10T17:37:13.662Z - info: [openaddresses] Total time taken: 1747.274s
events.js:141
      throw er; // Unhandled 'error' event
      ^

Error: Delimiter not found in the file ","
    at Error (native)
    at Parser.__write (/home/vagrant/openaddresses/node_modules/csv-parse/lib/index.js:439:13)
    at Parser._transform (/home/vagrant/openaddresses/node_modules/csv-parse/lib/index.js:172:10)
    at Transform._read (_stream_transform.js:167:10)
    at Transform._write (_stream_transform.js:155:12)
    at doWrite (_stream_writable.js:300:12)
    at writeOrBuffer (_stream_writable.js:286:5)
    at Writable.write (_stream_writable.js:214:11)
    at ReadStream.ondata (_stream_readable.js:542:20)
    at emitOne (events.js:77:13)
    at ReadStream.emit (events.js:169:7)
```

Version 10 of node.js has been released

Version 10 of Node.js (code name Dubnium) has been released! 🎊

To see what happens to your code in Node.js 10, Greenkeeper has created a branch with the following changes:

Added the new Node.js version to your .travis.yml
The new Node.js version is in-range for the engines in 1 of your package.json files, so that was left alone

If you’re interested in upgrading this repo to Node.js 10, you can open a PR with these changes. Please note that this issue is just intended as a friendly reminder and the PR as a possible starting point for getting your code running on Node.js 10.

More information on this issue

Greenkeeper has checked the engines key in any package.json file, the .nvmrc file, and the .travis.yml file, if present.

engines was only updated if it defined a single version, not a range.
.nvmrc was updated to Node.js 10
.travis.yml was only changed if there was a root-level node_js that didn’t already include Node.js 10, such as node or lts/*. In this case, the new version was appended to the list. We didn’t touch job or matrix configurations because these tend to be quite specific and complex, and it’s difficult to infer what the intentions were.

For many simpler .travis.yml configurations, this PR should suffice as-is, but depending on what you’re doing it may require additional work or may not be applicable at all. We’re also aware that you may have good reasons to not update to Node.js 10, which is why this was sent as an issue and not a pull request. Feel free to delete it without comment, I’m a humble robot and won’t feel rejected 🤖

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

Explore disabling name field for OpenAddresses Records

Investigate #24

Disable NAME for OpenAddresses
Disable NAME for OSM records with only address fields

add option to cherry-pick files

Allow cherry-picking the files to import in both pelias-config and with command line arguments.

Update --help text

https://github.com/pelias/openaddresses/blob/master/import.js#L116

Add supported for blacklisted sources

Occasionally OA sources just have terrible data. For example, Lamar County, TX has, more often than not, wildly incomplete data that extends beyond 0 house numbers:

215 SE 33RD - no street type and directional should be post
2 W PLAZA & 135 BONHAM - inexplicable house number with intersection
OLD BELK'S PARKING LOT - GIS admin is on drugs
112 BONHAM ST - SHOE STORE - ibid
2ND NE @ E KAUFFMAN - intersection
800 BLK JACKSON ST - block address

It would be programmatically difficult to determine address validity on a per-row basis, so this solution would blacklist an entire source to reduce data pollution when the majority of a source is bad data.

Removal US/CA house number reduceable to 0

A previous issue filtered out US/CA house numbers that were literal 0. There are, however, ~67k addresses that have house numbers reduceable to 0, such as 00 and 0000. Filter these out.

Import Hangs on OpenAddresses Phase

Working on a full-country USA import for pelias, and all data sources imported fine excepting Openaddresses, which seems to simply repeat the same log items over and over again without using up much CPU or increasing the disk space usage.

Not quite sure what the cause might be, so would love some documentation on what the output is supposed to look like, or what types of things to check to see if progress is being made silently.

pelias.json

{
  "esclient": {
    "apiVersion": "2.3",
    ...
  },
  "elasticsearch": {
    "settings": {
      "index": {
        "number_of_replicas": "0",
        "number_of_shards": "10",
        "refresh_interval": "1m"
      }
    }
  },
  "interpolation": {
    "client": {
      "adapter": "null"
    }
  },
  "dbclient": {
    "statFrequency": 10000
  },
  "api": {
    "accessLog": "common",
    "host": "http://pelias.mapzen.com/",
    "indexName": "pelias",
    "version": "1.0",
    "textAnalyzer": "libpostal"
  },
  "schema": {
    "indexName": "pelias"
  },
  "logger": {
    "level": "debug",
    "timestamp": true,
    "colorize": true
  },
  "acceptance-tests": {
    "endpoints": {
      "local": "http://localhost:3100/v1/",
      "dev-cached": "http://pelias.dev.mapzen.com.global.prod.fastly.net/v1/",
      "dev": "http://pelias.dev.mapzen.com/v1/",
      "prod": "http://search.mapzen.com/v1/",
      "prod-uncached": "http://pelias.mapzen.com/v1/",
      "prodbuild": "http://pelias.prodbuild.mapzen.com/v1/"
    }
  },
  "imports": {
    "adminLookup": {
      "enabled": true,
      "maxConcurrentRequests": 100
    },
    "geonames": {
      "datapath": "/opt/mount/pelias-data/geonames",
      "countryCode": "US"
    },
    "openstreetmap": {
      "datapath": "/opt/mount/pelias-data/openstreetmap",
      "leveldbpath": "/tmp",
      "import": [{
        "filename": "us-midwest-latest.osm.pbf"
      },{
        "filename": "us-northeast-latest.osm.pbf"
      },{
        "filename": "us-pacific-latest.osm.pbf"
      },{
        "filename": "us-south-latest.osm.pbf"
      },{
        "filename": "us-west-latest.osm.pbf"
      }]
    },
    "openaddresses": {
      "datapath": "/opt/mount/pelias-data/openaddresses",
      "files": [],
      "deduplicate": true
    },
    "polyline": {
      "datapath": "/opt/mount/pelias-data/polylines",
      "files": [
        "road_network"
      ]
    },
    "whosonfirst": {
      "datapath": "/opt/mount/pelias-data/whosonfirst",
      "importVenues": false,
      "importPostalcodes": true
    }
  }
}

Log output:

nohup: ignoring input

> [email protected] start /opt/pelias-src/openaddresses
> node import.js

2017-05-05T22:47:56.756Z - info: [openaddresses] Setting up deduplicator.
2017-05-05T22:47:56.892Z - info: [openaddresses] Importing 2906 files.
2017-05-05T22:47:58.015Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/summary/us/ak/anchorage-summary.csv
2017-05-05T22:47:58.173Z - verbose: [openaddresses] number of invalid records skipped: 36
2017-05-05T22:47:58.175Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/summary/us/ak/city_of_juneau-summary.csv
2017-05-05T22:47:58.319Z - verbose: [openaddresses] number of invalid records skipped: 78
2017-05-05T22:47:58.320Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/summary/us/ak/fairbanks_north_star_borough-summary.csv
2017-05-05T22:47:58.355Z - verbose: [openaddresses] number of invalid records skipped: 82
...
2017-05-05T22:48:22.754Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/summary/us/wy/statewide-summary.csv
2017-05-05T22:48:23.681Z - verbose: [openaddresses] number of invalid records skipped: 2572
2017-05-05T22:48:23.682Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/summary/us/wy/sublette-summary.csv
2017-05-05T22:48:23.706Z - verbose: [openaddresses] number of invalid records skipped: 98
2017-05-05T22:48:23.706Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/summary/us/wy/teton-summary.csv
2017-05-05T22:48:23.720Z - verbose: [openaddresses] number of invalid records skipped: 41
2017-05-05T22:48:23.720Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/us/ak/anchorage.csv
2017-05-05T22:48:26.832Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-05T22:48:33.727Z - verbose: [openaddresses] Number of bad records: 1
2017-05-05T22:48:36.834Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-05T22:48:43.730Z - verbose: [openaddresses] Number of bad records: 1
2017-05-05T22:48:45.713Z - info: [wof-pip-service:master] country worker loaded 218 features in 47.967 seconds
2017-05-05T22:48:46.836Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-05T22:48:47.288Z - info: [wof-pip-service:master] region worker loaded 4874 features in 49.431 seconds
2017-05-05T22:48:53.734Z - verbose: [openaddresses] Number of bad records: 1
2017-05-05T22:48:56.838Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-05T22:49:03.738Z - verbose: [openaddresses] Number of bad records: 1
...
2017-05-06T02:57:46.095Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:57:56.105Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:57:56.105Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:58:06.115Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:58:06.116Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:58:16.126Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:58:16.126Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:58:26.135Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:58:26.135Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:58:36.146Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:58:36.146Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:58:46.156Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:58:46.156Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:58:56.161Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:58:56.161Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:59:06.171Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:59:06.171Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:59:16.181Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:59:16.182Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0

refactor into more modules

The packge's main entry point, import.js, has become a little unwieldy and contains unrelated pieces of functionality (argument handling, import pipeline construction, etc). Partition it into multiple modules.

An in-range update of csv-parse is breaking the build 🚨

Version 1.3.0 of csv-parse was just published.

Branch	Build failing 🚨
Dependency	csv-parse
Current Version	1.2.4
Type	dependency

This version is covered by your current version range and after updating it in your project the build failed.

csv-parse is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details

✅ ci/circleci Your tests passed on CircleCI! Details
❌ continuous-integration/travis-ci/push The Travis CI build failed Details

Commits

The new version differs by 6 commits.

d11de9d package: bump to version 1.3.0
61762a5 test: should require handled by mocha
b0fe635 package: coffeescript 2 and use semver tilde
b347b69 Allow auto_parse to be a function and override default parsing
8359816 Allow user to pass in custom date parsing functionn
f87c273 options: ensure objectMode is cloned

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

Null values should not be imported

http://pelias.github.io/compare/#/v1/autocomplete%3Fsources=oa&text=null

1)  NULL Null Null Null, Newport, WA, USA
2)  NULL Null Null Null, WA, USA
3)  NULL Null Null Null, Ione, WA, USA

Connected to pelias/pelias#127

add more unit tests to testing suite

Add more tests to the unit-testing suite (test.test module).

Where is the list of valid CSVs for import?

I tried importing au-countrywide.csv from https://openaddresses.io/ without any luck. I noticed through browsing the code of the vagrant project, you host some files here: http://data.openaddresses.io.s3.amazonaws.com/

I was able to download files such as au-queensland.zip and through the naming syntax find other Australian states to download.

Can you please update the documentation and provide a list of valid downloads? The documentation basically says you can use the files from openaddresses.io, but none that I have tried have worked. Only the ones you host (for Australia).

high number of records dropped during deduplication

High numbers of records are lost in the pipeline during deduplication (ie, when the script is run with --deduplicate), apparently due to the fact that the pipes closely prematurely. I suspect this is not a problem with the address-deduplicator-stream, since it's been used safely in other importers.

Many streetnames are all uppercase

shouting allthestreetnames

{
  "type": "Feature",
  "geometry": {
    "type": "Point",
    "coordinates": [
      138.561943,
      -34.837527
    ]
  },
  "properties": {
    "id": "au/countrywide:1799337",
    "gid": "openaddresses:address:au/countrywide:1799337",
    "layer": "address",
    "source": "openaddresses",
    "name": "18 GLASGOW STREET",
    "housenumber": "18",
    "street": "GLASGOW STREET",
    ...
    "country_a": "AUS",
    "region": "South Australia",
    ...
  }
}

we could resolve this issue with some code like this:

if( name.IsAllCaps() ){
  name = name.LowerCase()
  for each word in name {
    uppercaseFirstChar( word )
  }
}

.. it wouldn't be perfect but considering it only works on names which are completely uppercase to begin with, it can't really get any worse than the original source data.

Maximum call stack size exceeded when running `pelias openaddresses import` using node LTS (v4)

# pelias openaddresses import
pelias: Cloning repo.
pelias: npm installing.
WARN engine [email protected]: wanted: {"node":">=0.8 <0.11"} (current: {"node":"4.2.6","npm":"2.14.12"})
npm WARN deprecated [email protected]: This package is no longer maintained. See its readme for upgrade details.
2016-02-05T21:22:03.133Z - info: [openaddresses] Importing 665 files.
2016-02-05T21:22:03.319Z - info: [openaddresses] Total time taken: .243s
events.js:397
EventEmitter.listenerCount = function(emitter, type) {
                                     ^

RangeError: Maximum call stack size exceeded
    at Function.EventEmitter.listenerCount (events.js:397:38)
    at Log.listenerCount (/root/.pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/log.js:68:25)
    at Function.EventEmitter.listenerCount (events.js:399:20)
    at Log.listenerCount (/root/.pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/log.js:68:25)
    at Function.EventEmitter.listenerCount (events.js:399:20)
    at Log.listenerCount (/root/.pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/log.js:68:25)
    at Function.EventEmitter.listenerCount (events.js:399:20)
    at Log.listenerCount (/root/.pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/log.js:68:25)
    at Function.EventEmitter.listenerCount (events.js:399:20)
    at Log.listenerCount (/root/.pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/log.js:68:25)
child_process.js:507
    throw err;
    ^

Error: Command failed: node import.js 
    at checkExecSyncError (child_process.js:464:13)
    at Object.execSync (child_process.js:504:13)
    at runRepoSubcommand (/usr/local/lib/node_modules/pelias-cli/pelias.js:193:16)
    at Object.<anonymous> (/usr/local/lib/node_modules/pelias-cli/pelias.js:196:1)
    at Module._compile (module.js:410:26)
    at Object.Module._extensions..js (module.js:417:10)
    at Module.load (module.js:344:32)
    at Function.Module._load (module.js:301:12)
    at Function.Module.runMain (module.js:442:10)
    at startup (node.js:136:18)

Works fine with 0.12.x

Handle OpenAddresses summary files

Newer OA zip files contain summary files, which are also CSVs, so they tend to get picked up by the importer, which then fails.

Request Timeout during Import

I'm in the process of installing Pelias without vagrant on my Ubuntu machine as well as a Centos machine.

On the Ubuntu machine I tried to npm install OpenAddresses and node import.js using node version v0.10.38. Which resulted in the following error.

node import.js
2016-04-11T21:59:53.991Z - info: [openaddresses] Importing 1 files.
2016-04-11T21:59:54.202Z - info: [openaddresses] Creating read stream for: /home/user/pelias/openaddresses/data/au/countrywide.csv
2016-04-11T22:08:41.242Z - error: [dbclient] esclient error Error: Request Timeout after 120000ms
    at /home/kent/pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/transport.js:340:15
    at null.<anonymous> (/home/kent/pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/transport.js:369:7)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)
2016-04-11T22:08:41.243Z - error: [dbclient] invalid resp from es bulk index operation
2016-04-11T22:08:41.243Z - info: [dbclient] retrying batch [500]

This was resolved by switching to a newer version of node (v0.12.0). I could import a large dataset like Australia (countrywide) without an issue. I've also tried using both master and production branches.

I tried to do this on a Centos machine and got the above error with both versions of node. Would this be an issue with the OS I am using or a RAM problem?

add support for importing an entire directory of OpenAddresses files

The import script currently only supports importing one file per run, meaning that importing multiple files requires invoking it as many times. This involves a number of inconveniences, so add support for importing numerous files via a CombinedStream or something similar.

co/statewide.csv:0,0,411,Chelsea Court,,ELIZABETH,,,80107,,ee0c56b26029664d
co/statewide.csv:0,0,2112,Rawhide Drive,,ELIZABETH,,,80107,,7cf4a32b00f95b41
co/statewide.csv:0,0,875,South Mobile Street,,ELIZABETH,,,80107,,c5b8525a9dc83ece
co/statewide.csv:0,0,27482,East Broadview Drive,,KIOWA,,,80117,,2e487d9d8f1bd017
co/statewide.csv:0,0,1980,Legacy Circle,,ELIZABETH,,,80107,,06f78ce679f99256

These should be filtered out.

See pelias/leaflet-plugin#163

pelias / openaddresses Goto Github PK

openaddresses's Introduction

A modular, open-source search engine for our world.

Pelias OpenAddresses importer

Overview

Requirements

Installation

Data Download

Usage

Admin Lookup

Configuration

imports.openaddresses.datapath

imports.openaddresses.files

imports.openaddresses.missingFilesAreFatal

imports.openaddresses.dataHost

imports.openaddresses.s3Options

Parallel Importing

openaddresses's People

Stargazers

Watchers

Forkers

openaddresses's Issues

Version 13.4.0 of joi was just published.

Version 10 of Node.js (code name Dubnium) has been released! 🎊

Version 1.3.0 of csv-parse was just published.

Recommend Projects

Recommend Topics

Recommend Org

Jobs

`imports.openaddresses.datapath`

`imports.openaddresses.files`

`imports.openaddresses.missingFilesAreFatal`

`imports.openaddresses.dataHost`

`imports.openaddresses.s3Options`