GithubHelp home page GithubHelp logo

openaddresses's Introduction

A modular, open-source search engine for our world.

Pelias is a geocoder powered completely by open data, available freely to everyone.

Local Installation · Cloud Webservice · Documentation · Community Chat

What is Pelias?
Pelias is a search engine for places worldwide, powered by open data. It turns addresses and place names into geographic coordinates, and turns geographic coordinates into places and addresses. With Pelias, you’re able to turn your users’ place searches into actionable geodata and transform your geodata into real places.

We think open data, open source, and open strategy win over proprietary solutions at any part of the stack and we want to ensure the services we offer are in line with that vision. We believe that an open geocoder improves over the long-term only if the community can incorporate truly representative local knowledge.

Pelias OpenAddresses importer

Greenkeeper badge

Overview

The OpenAddresses importer is used to process data from OpenAddresses for import into the Pelias geocoder.

Requirements

Node.js is required. See Pelias software requirements for supported versions.

Installation

For instructions on setting up Pelias as a whole, see our getting started guide. Further instructions here pertain to the OpenAddresses importer only

git clone https://github.com/pelias/openaddresses
cd openaddresses
npm install

Data Download

Use the imports.openaddresses.files configuration option to limit the download to just the OpenAddresses files of interest. Refer to the OpenAddresses data listing for file names.

npm run download

Usage

# show full command line options
node import.js --help

# run an import
npm start

Admin Lookup

OpenAddresses records do not contain information about which city, state (or other region like province), or country that they belong to. Pelias has the ability to compute these values from Who's on First data. For more info on how admin lookup works, see the documentation for pelias/wof-admin-lookup. By default, adminLookup is enabled. To disable, set imports.adminLookup.enabled to false in Pelias config.

Note: Admin lookup requires loading around 5GB of data into memory.

Configuration

This importer can be configured in pelias-config, in the imports.openaddresses hash. A sample configuration file might look like this:

{
  "esclient": {
    "hosts": [
      {
        "env": "development",
        "protocol": "http",
        "host": "localhost",
        "port": 9200
      }
    ]
  },
  "logger": {
    "level": "debug"
  },
  "imports": {
    "whosonfirst": {
      "datapath": "/mnt/data/whosonfirst/",
      "importPostalcodes": false,
      "importVenues": false
    },
    "openaddresses": {
      "datapath": "/mnt/data/openaddresses/",
      "files": [ "us/ny/city_of_new_york.csv" ]
    }
  }
}

The following configuration options are supported by this importer.

imports.openaddresses.datapath

  • Required: yes
  • Default: ``

The absolute path to a directory where OpenAddresses data is located. The download command will also automatically place downloaded files in this directory.

imports.openaddresses.files

  • Required: no
  • Default: []

An array of OpenAddresses files to be downloaded (full list can be found on the OpenAddresses results site). If no files are specified, the full planet data files (11GB+) will be downloaded.

imports.openaddresses.missingFilesAreFatal

  • Required: no
  • Default: false

If set to true, any missing files will immediately halt the importer with an error. Otherwise, the importer will continue processing with a warning. The data downloader will also continue if any download errors were encountered with this set to false.

imports.openaddresses.dataHost

  • Required: no
  • Default: https://data.openaddresses.io

The location from which to download OpenAddresses data from. By default, the primary OpenAddresses servers will be used. This can be overrriden to allow downloading customized data. Paths are supported (for example, https://yourhost.com/path/to/your/data), but must not end with a trailing slash.

S3 buckets are supported. Files will be downloaded using aws-cli.

For example: s3://data.openaddresses.io.

Note: When using s3, you might need authentcation (IAM instance role, env vars, etc.)

imports.openaddresses.s3Options

  • Required: no

If imports.openaddresses.dataHost is an s3 bucket, this will add options to the command. For example: --profile my-profile

This is useful, for example, when downloading from s3://data.openaddresses.io, as they require the requester to pay for data transfer. You can then use the following option: --request-payer

Parallel Importing

Because OpenAddresses consists of many small files, this importer can be configured to run several instances in parallel that coordinate to import all the data.

To use this functionality, replace calls to npm start with

npm run parallel 3 # replace 3 with your desired level of paralellism

Generally, a parallelism of 2 or 3 is suitable for most tasks.

openaddresses's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openaddresses's Issues

retrieve admin values through coarse reverse-geocoding

The importer currently retrieves admin values by scraping them from OpenAddresses CSV filenames and their correspondent config JSON files. Since the data present there is only meant to suggest the rough region that any set of addresses belongs so, it's occasionally inaccurate. We should migrate over to a model that finds admin values by perform coarse reverse-geocoding against some polygons dataset, like Quattroshapes.

ECONNREFUSED when attempting to import

I am attempting to import one CSV file of OpenAddresses data, and finding this error. I have previously created an index.

% node import.js

2016-05-05T03:42:56.151Z - info: [openaddresses] Importing 1 files.
2016-05-05T03:42:57.091Z - info: [openaddresses] Creating read stream for: /mnt/pelias/openaddresses/alameda.csv
Elasticsearch ERROR: 2016-05-05T03:42:59Z
  Error: Request error, retrying
  POST http://localhost:9200/_bulk => connect ECONNREFUSED 127.0.0.1:9200
      at Log.error (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/src/lib/log.js:225:56)
      at checkRespForFailure (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/src/lib/transport.js:240:18)
      at HttpConnector.<anonymous> (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/src/lib/connectors/http.js:162:7)
      at ClientRequest.wrapper (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/node_modules/lodash/index.js:3095:19)
      at emitOne (events.js:77:13)
      at ClientRequest.emit (events.js:169:7)
      at Socket.socketErrorListener (_http_client.js:258:9)
      at emitOne (events.js:77:13)
      at Socket.emit (events.js:169:7)
      at emitErrorNT (net.js:1256:8)

Elasticsearch ERROR: 2016-05-05T03:42:59Z
  Error: Request error, retrying
  POST http://localhost:9200/_bulk => connect ECONNREFUSED 127.0.0.1:9200
      at Log.error (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/src/lib/log.js:225:56)
      at checkRespForFailure (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/src/lib/transport.js:240:18)
      at HttpConnector.<anonymous> (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/src/lib/connectors/http.js:162:7)
      at ClientRequest.wrapper (/home/migurski/pelias-openaddresses/node_modules/elasticsearch/node_modules/lodash/index.js:3095:19)
      at emitOne (events.js:77:13)
      at ClientRequest.emit (events.js:169:7)
      at Socket.socketErrorListener (_http_client.js:258:9)
      at emitOne (events.js:77:13)
      at Socket.emit (events.js:169:7)
      at emitErrorNT (net.js:1256:8)

(repeats many times)

The ES server is responding as I would expect:

% curl -i http://localhost:9200/

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 332

{
  "status" : 200,
  "name" : "Sack",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.7.5",
    "build_hash" : "00f95f4ffca6de89d68b7ccaf80d148f1f70e4d4",
    "build_timestamp" : "2016-02-02T09:55:30Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

I am using OpenJDK 8 on Ubuntu 16.04, with Node 4.2.6.

whitespace in street name

note: this may be fixed in dev, the data below was taken from the production server 18 Mar '16 and I couldn't confirm because the dev server was being rebuilt at the time.

"street": "South Albany          Avenue",
{
  "type": "Feature",
  "properties": {
    "id": "2d698c8c74484096a80758033b694967",
    "gid": "oa:address:2d698c8c74484096a80758033b694967",
    "layer": "address",
    "source": "oa",
    "name": "200 South Albany Avenue",
    "housenumber": "200",
    "street": "South Albany          Avenue",
    "country_a": "USA",
    "country": "United States",
    "region": "Florida",
    "region_a": "FL",
    "county": "Martin County",
    "locality": "Stuart",
    "neighbourhood": "Watermark Marina of Palm City",
    "confidence": 0.848,
    "label": "200 South Albany Avenue, Stuart, FL"
  },
  "geometry": {
    "type": "Point",
    "coordinates": [
      -80.256743,
      27.199085
    ]
  }
}

Elasticsearch errors when doing the import on a local machine

Hi, I'm attempting a worldwide import on a local machine, but have persistent Elasticsearch errors, with ES dropping connections resulting in batch_error and missing data. The typical example is as follows:

Elasticsearch WARNING: 2017-12-11T14:05:12Z
Unable to revive connection: http://localhost:9200/
Elasticsearch WARNING: 2017-12-11T14:05:12Z
No living connections
2017-12-11T14:05:12.121Z - error: [dbclient] esclient error Error: No Living connections
at sendReqWithConnection (/mnt/scratch/pelias/openstreetmap/node_modules/elasticsearch/src/lib/transport.js:225:15)
at next (/mnt/scratch/pelias/openstreetmap/node_modules/elasticsearch/src/lib/connection_pool.js:213:7)
at nextTickCallbackWith0Args (node.js:419:9)
at process._tickCallback (node.js:348:13)
2017-12-11T14:05:12.121Z - error: [dbclient] invalid resp from es bulk index operation
2017-12-11T14:05:12.121Z - info: [dbclient] retrying batch [500]
2017-12-11T14:05:13.643Z - info: [dbclient] paused=true, transient=15, current_length=0, indexed=3754500, batch_ok=7509, batch_retries=0, failed_records=0, venue=1578642, address=2175858, persec=0, batch_error=35

The errors happen reliably and within an hour when running multiple importers in parallel. Running sequentially, Geonames and WOF have imported without issues. OA and OSM start throwing errors after a few hours of importing (e.g. after about 600M OSM nodes in the above example, after getting through half of Brazil in OA).

The hardware is quite powerful (32 cores, 128 GB RAM, though not SSDs - HDD in RAID0), so I do not believe this is an issue. ES maxes out 2-3 cores during import, but then it suddenly requires more power and starts dropping connections.

Not an expert in ES (this is my first encounter), but maybe there is some reindexing going on, or maybe even heavy garbage collection. I give ES 30GB memory (via ES_JAVA_OPTS="-Xms30g -Xmx30g") and run it by launching a daemon. ES version is 2.4, the system is Ubuntu 16.04. I tried with both standard settings and with some suggestions found on the web. The ES documentation is missing and seems to be very different across versions with people suggesting undocumented parameters to tweak. The config that worked to import Geonames and WOF is threadpool.bulk.type: fixed threadpool.bulk.size: 25 threadpool.bulk.queue_size: 1000

I imagine that there could be some ES setting to either give it more resources, keep retrying for longer (or ideally just wait more between retries), or maybe changing batch sizes? Alternatively, is there a way to throttle the importers to a given maximum inserts per second?

housenumber with a '#' prefix

"housenumber": "#2708",
{
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          -73.716908,
          45.513926
        ]
      },
      "properties": {
        "id": "ca/qc/montreal:330808",
        "gid": "openaddresses:address:ca/qc/montreal:330808",
        "layer": "address",
        "source": "openaddresses",
        "source_id": "ca/qc/montreal:330808",
        "name": "#2708 rue Équinoxes",
        "housenumber": "#2708",
        "street": "rue Équinoxes",
        "confidence": 0.74,
        "country": "Canada",
        "country_gid": "whosonfirst:country:85633041",
        "country_a": "CAN",
        "region": "Quebec",
        "region_gid": "whosonfirst:region:136251273",
        "locality": "Montréal",
        "locality_gid": "whosonfirst:locality:101736545",
        "neighbourhood": "Saint-Laurent",
        "neighbourhood_gid": "whosonfirst:neighbourhood:85895749",
        "label": "#2708 rue Équinoxes, Montréal, Quebec, Canada"
      }
    },

another one:

        "id": "ca/qc/montreal:331017",
        "gid": "openaddresses:address:ca/qc/montreal:331017",
        "housenumber": "#2855",

Use `HASH` field from OA for id

OA recently started incorporating a hashcode of the field values to create a unique identifier, an example of which is:

-75.8897798,40.6627665,,QUAKER CITY RD,,ALBANY,,,,,40928065a7dc01e1

We can use this reliable persistent unique identifier to replace our id scheme.

An in-range update of joi is breaking the build 🚨

Version 13.4.0 of joi was just published.

Branch Build failing 🚨
Dependency joi
Current Version 13.3.0
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

joi is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ci/circleci Your tests passed on CircleCI! Details
  • continuous-integration/travis-ci/push The Travis CI build could not complete due to an error Details

Commits

The new version differs by 13 commits.

  • f75f0d3 13.4.0
  • 759c558 Cleanup for #1499.
  • d97ca0d Merge pull request #1499 from rgoble4/dynamic-keys
  • 8a1eb96 Consider extended types parameters
  • eaefa17 review changes
  • 3a84adc update docs
  • b747016 Allow pattern to support schema objects
  • 944dbe9 Fix empty path reach. Fixes #1515.
  • 1f39ed4 Update issue templates
  • ee15213 Merge pull request #1514 from radicand/fix/1513
  • f097e37 remove indirect require reference to index.js
  • e520c6d Merge pull request #1500 from logoran/feature-date-greater-less
  • 97fb85c add date greater less rules

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

set Document address* values

pelias-model Documents now have address objects, which can contain data like house number, house name, street name, etc; set them accordingly.

Document guid does not persist across different imports

Individual imports start off with a per-document guid of 0, meaning that, when importing multiple OpenAddresses files (which requires executing the import script once per file since it currently doesn't support file batches), each successive import overwrites the previous ones partly/entirely.

Allow street-less records in white-listed countries

Some countries allow addresses without streets. So far, it just appears to be the Czech Republic. Here are two examples:

  • č.p. 360, 79862 Rozstání
  • č.ev. 9, 79857 Rakůvka

These can be parsed as:

{
  housenumber: 'č.p. 360',
  postcode: '79862',
  city: 'Rozstání'
}

and:

{
  housenumber: 'č.ev. 9',
  postcode: '79857',
  city: 'Rakůvka'
}

Negative house numbers in US

I ran across the behavior in Ohio where it's most egregious, but there are just shy of 35,000 addresses in that state with a negative house number. These should not be considered valid and filtered out during import.

strip quotation marks from fields

Quotation marks don't appear to get stripped, which results in records like the following getting indexed:

{
    "bbox": [
        -58.4255,
        -34.6249,
        -58.4255,
        -34.6249
    ],
    "date": 1423086910066,
    "features": [
        {
            "geometry": {
                "coordinates": [
                    -58.4255,
                    -34.6249
                ],
                "type": "Point"
            },
            "properties": {
                "admin0": "Argentina",
                "admin1": "Ciudad de Buenos Aires",
                "alpha3": "ARG",
                "id": "41578ad2c4144d9ca27144efef27d2ce",
                "layer": "openaddresses",
                "name": "4265 \"Calvo",
                "neighborhood": "Cafferata",
                "text": "4265 \"Calvo, Cafferata, Ciudad de Buenos Aires",
                "type": "openaddresses"
            },
            "type": "Feature"
        }
    ],
    "type": "FeatureCollection"
}

Limit needed in download_filtered.js

In order to get data for the entire US, I added all 1600 of the current US openaddress files to imports.openaddresses.files, and then did npm run download. But I didn't realize that this would start 1600 asynchronous curl jobs (which crashed the server). Is there a way to set a limit at, say, 10 or so asynchronous curl jobs at a time?

OpenAddress loader will stop processing if given the wrong .csv name in an array of .csv names in Pelias.json.

The OpenAddress loader will stop if given the wrong .csv name in an array of .csv names in pelias.json. I would rather be warned that one of the .csv files is an invalid file name, but have the loader continue to download and process the other .csv files in the list. I can see arguments for having the loader just die on such an error, but in the docker environment, all the other loaders (e.g., polylines and others that use OA data for interpolation) continue to process data … but it’s incomplete, since the OA loader didn’t download other valid .csv files.

NOTE: one other time, with a valid set of file names, I also saw the OA loader exist early due to some hiccup in downloading OA data (i.e., it was unclear if the .zip file didn't download correctly, a file in the .zip file was corrupt or maybe the .csv or .vrt meta data file was missing from the .zip -- error message was a bit cryptic) ... and just like above, it would have been better for me just to see a warning within the context of the docker-compose environment.

Diana's response on October 19th, 2017:

The OA importer bailing after an error, for our build purposes, we wanted to know as soon as possible that something was wrong but i can see how you’d rather just finish the files that can be finished and then worry about the missing or invalid ones separately. If you create an issue the team can discuss if that’s something we could change.

Refactor for cleanliness

There are several streams that include helper require's that I think were written before we could easily test streams. Now that we can easily test streams, these helpers are superfluous. Refactor them to be simpler.

import.js filepath is not taken.

When I run import.js in bash (exec folder is 'openaddresses-master', PELIAS_CONF set to my conf file)

The importer searches the file in the execution folder instead of getting the datapath path set in the config

my config:

"openaddresses": {
      "datapath": "\\\\prod\\..(morepath)...\\openaddresses-collected\\be\\wa\\",
      "files": ["brussels-fr.csv"]
    }

I guess smth wrong happens here as the "filePath" variable should be the absolute path, but it contains only the file name.

logger.info( 'Importing %s files.', files.length );
  files.forEach( function forEach( filePath ){
    recordStream.append( function ( next ){
      logger.info( 'Creating read stream for: ' + filePath );
      next( importPipelines.createRecordStream( filePath ) );
    });
  });

add usage documentation

Add documentation once all the moving parts of this pipeline are in place and mostly stable.

  • Enhance README.md
  • Add command-line usage message.

Configure admin lookup and deduplication via pelias.config

Unlike most of our other importers, this importer only allows configuration of admin lookup and deduplication via command line flags. This makes it difficult to point to a single place where configuration changes can be made, whether in our production/dev builds, in the vagrant image, or even on a local dev setup. It also makes our Chef configuration more complicated, as the chef recipes have to know about our openaddresses configuration, rather than just start the importer and let the importer worry about configuration.

Connects pelias/pelias#255

import.js dosen't check to see if deduplicator is running

Noted in --help:

        --deduplicate: (advanced use) Deduplicate addresses using the
                OpenVenues deduplicator: https://github.com/openvenues/address_deduper.
                It must be running at localhost:5000.

This point isn't very clear in the docs, and the import script will keep looping around [address-deduplicator] without throwing an error/checking to see if the service is running. :-(

error on download

Hi,

somehow I am getting this error on download

user@server:/opt/pelias/openaddresses# node utils/download_data.js
2018-01-24T19:27:11.348Z - info: [download] Attempting to download all data
2018-01-24T19:27:11.353Z - debug: [download] downloading https://s3.amazonaws.com/data.openaddresses.io/openaddr-collected-global.zip
2018-01-24T19:52:11.771Z - debug: [download] unzipping /tmp/tmp-10khLEclwr1Pwr.zip to /data/openaddresses
2018-01-24T19:56:48.095Z - error: [download] Failed to download data Error: Command failed: unzip -o -qq -d /data/openaddresses /tmp/tmp-10khLEclwr1Pwr.zip

unzip is installed:

user@server:/opt/pelias/openaddresses# unzip
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.

Usage: unzip [-Z] [-opts[modifiers]] file[.zip] [list] [-x xlist] [-d exdir]
...

Any ideas?

City and region information being dropped

Gday all,

Thanks so much for the hard work on Pelias and it's associated pieces.

I've just imported the Australian Countrywide dataset into my pelias instance. When I open the csv from OpenAddresses I can see City and Region columns that are populated with useful info. However it looks like this information then gets dropped when I import it into pelias.

It looks like the hosted Mapzen search instance relies on this information coming from WhosOnFirst.

Is there anything that I can set anywhere to help retain this information and map it to the relevant pelias fields for import rather than relying on WhosOnFirst?

Thanks,
Rowan

Add more stats about import process

It would be cool to get a report of exactly how many addresses were parsed during an import.

Something like this at the end of the import (or maybe for each file):
Total records in file: 900707
Records skipped due to missing data: 50383
Records skipped by deduplicator: 20837
Total records imported: 829487

Identify source file in ID field

Openaddress IDs as stored in our Elasticsearch index are currently just an autoincrementing integer, which isn't very helpful to OA data users and may not be unique across our different sources. Ids should change to have the source file and row in that source file identified in the ID.

Error: Delimiter not found in the file ","

Ran import against all openaddresses states files and the following error was thrown. I'm not sure if the process actually completed or not at this point. Looks like it may from the final info log record.

2016-11-10T17:36:53.553Z - verbose: [dbclient]  paused=false, transient=4, current_length=18, indexed=5463000, batch_ok=10926, batch_retries=0, failed_records=0, address=5463000, persec=2750
2016-11-10T17:36:56.697Z - verbose: [openaddresses] Number of bad records: 58679
2016-11-10T17:37:03.757Z - verbose: [dbclient]  paused=false, transient=4, current_length=406, indexed=5490500, batch_ok=10981, batch_retries=0, failed_records=0, address=5490500, persec=2750
2016-11-10T17:37:06.953Z - verbose: [openaddresses] Number of bad records: 64436
2016-11-10T17:37:13.662Z - info: [openaddresses] Total time taken: 1747.274s
events.js:141
      throw er; // Unhandled 'error' event
      ^

Error: Delimiter not found in the file ","
    at Error (native)
    at Parser.__write (/home/vagrant/openaddresses/node_modules/csv-parse/lib/index.js:439:13)
    at Parser._transform (/home/vagrant/openaddresses/node_modules/csv-parse/lib/index.js:172:10)
    at Transform._read (_stream_transform.js:167:10)
    at Transform._write (_stream_transform.js:155:12)
    at doWrite (_stream_writable.js:300:12)
    at writeOrBuffer (_stream_writable.js:286:5)
    at Writable.write (_stream_writable.js:214:11)
    at ReadStream.ondata (_stream_readable.js:542:20)
    at emitOne (events.js:77:13)
    at ReadStream.emit (events.js:169:7)
```

Version 10 of node.js has been released

Version 10 of Node.js (code name Dubnium) has been released! 🎊

To see what happens to your code in Node.js 10, Greenkeeper has created a branch with the following changes:

  • Added the new Node.js version to your .travis.yml
  • The new Node.js version is in-range for the engines in 1 of your package.json files, so that was left alone

If you’re interested in upgrading this repo to Node.js 10, you can open a PR with these changes. Please note that this issue is just intended as a friendly reminder and the PR as a possible starting point for getting your code running on Node.js 10.

More information on this issue

Greenkeeper has checked the engines key in any package.json file, the .nvmrc file, and the .travis.yml file, if present.

  • engines was only updated if it defined a single version, not a range.
  • .nvmrc was updated to Node.js 10
  • .travis.yml was only changed if there was a root-level node_js that didn’t already include Node.js 10, such as node or lts/*. In this case, the new version was appended to the list. We didn’t touch job or matrix configurations because these tend to be quite specific and complex, and it’s difficult to infer what the intentions were.

For many simpler .travis.yml configurations, this PR should suffice as-is, but depending on what you’re doing it may require additional work or may not be applicable at all. We’re also aware that you may have good reasons to not update to Node.js 10, which is why this was sent as an issue and not a pull request. Feel free to delete it without comment, I’m a humble robot and won’t feel rejected 🤖


FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

Add supported for blacklisted sources

Occasionally OA sources just have terrible data. For example, Lamar County, TX has, more often than not, wildly incomplete data that extends beyond 0 house numbers:

  • 215 SE 33RD - no street type and directional should be post
  • 2 W PLAZA & 135 BONHAM - inexplicable house number with intersection
  • OLD BELK'S PARKING LOT - GIS admin is on drugs
  • 112 BONHAM ST - SHOE STORE - ibid
  • 2ND NE @ E KAUFFMAN - intersection
  • 800 BLK JACKSON ST - block address

It would be programmatically difficult to determine address validity on a per-row basis, so this solution would blacklist an entire source to reduce data pollution when the majority of a source is bad data.

Removal US/CA house number reduceable to 0

A previous issue filtered out US/CA house numbers that were literal 0. There are, however, ~67k addresses that have house numbers reduceable to 0, such as 00 and 0000. Filter these out.

Import Hangs on OpenAddresses Phase

Working on a full-country USA import for pelias, and all data sources imported fine excepting Openaddresses, which seems to simply repeat the same log items over and over again without using up much CPU or increasing the disk space usage.

Not quite sure what the cause might be, so would love some documentation on what the output is supposed to look like, or what types of things to check to see if progress is being made silently.

pelias.json

{
  "esclient": {
    "apiVersion": "2.3",
    ...
  },
  "elasticsearch": {
    "settings": {
      "index": {
        "number_of_replicas": "0",
        "number_of_shards": "10",
        "refresh_interval": "1m"
      }
    }
  },
  "interpolation": {
    "client": {
      "adapter": "null"
    }
  },
  "dbclient": {
    "statFrequency": 10000
  },
  "api": {
    "accessLog": "common",
    "host": "http://pelias.mapzen.com/",
    "indexName": "pelias",
    "version": "1.0",
    "textAnalyzer": "libpostal"
  },
  "schema": {
    "indexName": "pelias"
  },
  "logger": {
    "level": "debug",
    "timestamp": true,
    "colorize": true
  },
  "acceptance-tests": {
    "endpoints": {
      "local": "http://localhost:3100/v1/",
      "dev-cached": "http://pelias.dev.mapzen.com.global.prod.fastly.net/v1/",
      "dev": "http://pelias.dev.mapzen.com/v1/",
      "prod": "http://search.mapzen.com/v1/",
      "prod-uncached": "http://pelias.mapzen.com/v1/",
      "prodbuild": "http://pelias.prodbuild.mapzen.com/v1/"
    }
  },
  "imports": {
    "adminLookup": {
      "enabled": true,
      "maxConcurrentRequests": 100
    },
    "geonames": {
      "datapath": "/opt/mount/pelias-data/geonames",
      "countryCode": "US"
    },
    "openstreetmap": {
      "datapath": "/opt/mount/pelias-data/openstreetmap",
      "leveldbpath": "/tmp",
      "import": [{
        "filename": "us-midwest-latest.osm.pbf"
      },{
        "filename": "us-northeast-latest.osm.pbf"
      },{
        "filename": "us-pacific-latest.osm.pbf"
      },{
        "filename": "us-south-latest.osm.pbf"
      },{
        "filename": "us-west-latest.osm.pbf"
      }]
    },
    "openaddresses": {
      "datapath": "/opt/mount/pelias-data/openaddresses",
      "files": [],
      "deduplicate": true
    },
    "polyline": {
      "datapath": "/opt/mount/pelias-data/polylines",
      "files": [
        "road_network"
      ]
    },
    "whosonfirst": {
      "datapath": "/opt/mount/pelias-data/whosonfirst",
      "importVenues": false,
      "importPostalcodes": true
    }
  }
}

Log output:

nohup: ignoring input

> [email protected] start /opt/pelias-src/openaddresses
> node import.js

2017-05-05T22:47:56.756Z - info: [openaddresses] Setting up deduplicator.
2017-05-05T22:47:56.892Z - info: [openaddresses] Importing 2906 files.
2017-05-05T22:47:58.015Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/summary/us/ak/anchorage-summary.csv
2017-05-05T22:47:58.173Z - verbose: [openaddresses] number of invalid records skipped: 36
2017-05-05T22:47:58.175Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/summary/us/ak/city_of_juneau-summary.csv
2017-05-05T22:47:58.319Z - verbose: [openaddresses] number of invalid records skipped: 78
2017-05-05T22:47:58.320Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/summary/us/ak/fairbanks_north_star_borough-summary.csv
2017-05-05T22:47:58.355Z - verbose: [openaddresses] number of invalid records skipped: 82
...
2017-05-05T22:48:22.754Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/summary/us/wy/statewide-summary.csv
2017-05-05T22:48:23.681Z - verbose: [openaddresses] number of invalid records skipped: 2572
2017-05-05T22:48:23.682Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/summary/us/wy/sublette-summary.csv
2017-05-05T22:48:23.706Z - verbose: [openaddresses] number of invalid records skipped: 98
2017-05-05T22:48:23.706Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/summary/us/wy/teton-summary.csv
2017-05-05T22:48:23.720Z - verbose: [openaddresses] number of invalid records skipped: 41
2017-05-05T22:48:23.720Z - info: [openaddresses] Creating read stream for: /opt/mount/pelias-data/openaddresses/us/ak/anchorage.csv
2017-05-05T22:48:26.832Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-05T22:48:33.727Z - verbose: [openaddresses] Number of bad records: 1
2017-05-05T22:48:36.834Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-05T22:48:43.730Z - verbose: [openaddresses] Number of bad records: 1
2017-05-05T22:48:45.713Z - info: [wof-pip-service:master] country worker loaded 218 features in 47.967 seconds
2017-05-05T22:48:46.836Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-05T22:48:47.288Z - info: [wof-pip-service:master] region worker loaded 4874 features in 49.431 seconds
2017-05-05T22:48:53.734Z - verbose: [openaddresses] Number of bad records: 1
2017-05-05T22:48:56.838Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-05T22:49:03.738Z - verbose: [openaddresses] Number of bad records: 1
...
2017-05-06T02:57:46.095Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:57:56.105Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:57:56.105Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:58:06.115Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:58:06.116Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:58:16.126Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:58:16.126Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:58:26.135Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:58:26.135Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:58:36.146Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:58:36.146Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:58:46.156Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:58:46.156Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:58:56.161Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:58:56.161Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:59:06.171Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:59:06.171Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0
2017-05-06T02:59:16.181Z - verbose: [openaddresses] Number of bad records: 1
2017-05-06T02:59:16.182Z - verbose: [address-deduplicator]  total=1000, duplicates=0, uniques=0, timeSpentPaused=0

refactor into more modules

The packge's main entry point, import.js, has become a little unwieldy and contains unrelated pieces of functionality (argument handling, import pipeline construction, etc). Partition it into multiple modules.

An in-range update of csv-parse is breaking the build 🚨

Version 1.3.0 of csv-parse was just published.

Branch Build failing 🚨
Dependency csv-parse
Current Version 1.2.4
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

csv-parse is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • ci/circleci Your tests passed on CircleCI! Details
  • continuous-integration/travis-ci/push The Travis CI build failed Details

Commits

The new version differs by 6 commits.

  • d11de9d package: bump to version 1.3.0
  • 61762a5 test: should require handled by mocha
  • b0fe635 package: coffeescript 2 and use semver tilde
  • b347b69 Allow auto_parse to be a function and override default parsing
  • 8359816 Allow user to pass in custom date parsing functionn
  • f87c273 options: ensure objectMode is cloned

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

Where is the list of valid CSVs for import?

I tried importing au-countrywide.csv from https://openaddresses.io/ without any luck. I noticed through browsing the code of the vagrant project, you host some files here: http://data.openaddresses.io.s3.amazonaws.com/

I was able to download files such as au-queensland.zip and through the naming syntax find other Australian states to download.

Can you please update the documentation and provide a list of valid downloads? The documentation basically says you can use the files from openaddresses.io, but none that I have tried have worked. Only the ones you host (for Australia).

Many streetnames are all uppercase

shouting allthestreetnames

{
  "type": "Feature",
  "geometry": {
    "type": "Point",
    "coordinates": [
      138.561943,
      -34.837527
    ]
  },
  "properties": {
    "id": "au/countrywide:1799337",
    "gid": "openaddresses:address:au/countrywide:1799337",
    "layer": "address",
    "source": "openaddresses",
    "name": "18 GLASGOW STREET",
    "housenumber": "18",
    "street": "GLASGOW STREET",
    ...
    "country_a": "AUS",
    "region": "South Australia",
    ...
  }
}

we could resolve this issue with some code like this:

if( name.IsAllCaps() ){
  name = name.LowerCase()
  for each word in name {
    uppercaseFirstChar( word )
  }
}

.. it wouldn't be perfect but considering it only works on names which are completely uppercase to begin with, it can't really get any worse than the original source data.

Maximum call stack size exceeded when running `pelias openaddresses import` using node LTS (v4)

# pelias openaddresses import
pelias: Cloning repo.
pelias: npm installing.
WARN engine [email protected]: wanted: {"node":">=0.8 <0.11"} (current: {"node":"4.2.6","npm":"2.14.12"})
npm WARN deprecated [email protected]: This package is no longer maintained. See its readme for upgrade details.
2016-02-05T21:22:03.133Z - info: [openaddresses] Importing 665 files.
2016-02-05T21:22:03.319Z - info: [openaddresses] Total time taken: .243s
events.js:397
EventEmitter.listenerCount = function(emitter, type) {
                                     ^

RangeError: Maximum call stack size exceeded
    at Function.EventEmitter.listenerCount (events.js:397:38)
    at Log.listenerCount (/root/.pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/log.js:68:25)
    at Function.EventEmitter.listenerCount (events.js:399:20)
    at Log.listenerCount (/root/.pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/log.js:68:25)
    at Function.EventEmitter.listenerCount (events.js:399:20)
    at Log.listenerCount (/root/.pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/log.js:68:25)
    at Function.EventEmitter.listenerCount (events.js:399:20)
    at Log.listenerCount (/root/.pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/log.js:68:25)
    at Function.EventEmitter.listenerCount (events.js:399:20)
    at Log.listenerCount (/root/.pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/log.js:68:25)
child_process.js:507
    throw err;
    ^

Error: Command failed: node import.js 
    at checkExecSyncError (child_process.js:464:13)
    at Object.execSync (child_process.js:504:13)
    at runRepoSubcommand (/usr/local/lib/node_modules/pelias-cli/pelias.js:193:16)
    at Object.<anonymous> (/usr/local/lib/node_modules/pelias-cli/pelias.js:196:1)
    at Module._compile (module.js:410:26)
    at Object.Module._extensions..js (module.js:417:10)
    at Module.load (module.js:344:32)
    at Function.Module._load (module.js:301:12)
    at Function.Module.runMain (module.js:442:10)
    at startup (node.js:136:18)

Works fine with 0.12.x

Request Timeout during Import

I'm in the process of installing Pelias without vagrant on my Ubuntu machine as well as a Centos machine.

On the Ubuntu machine I tried to npm install OpenAddresses and node import.js using node version v0.10.38. Which resulted in the following error.

node import.js
2016-04-11T21:59:53.991Z - info: [openaddresses] Importing 1 files.
2016-04-11T21:59:54.202Z - info: [openaddresses] Creating read stream for: /home/user/pelias/openaddresses/data/au/countrywide.csv
2016-04-11T22:08:41.242Z - error: [dbclient] esclient error Error: Request Timeout after 120000ms
    at /home/kent/pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/transport.js:340:15
    at null.<anonymous> (/home/kent/pelias/openaddresses/node_modules/pelias-dbclient/node_modules/elasticsearch/src/lib/transport.js:369:7)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)
2016-04-11T22:08:41.243Z - error: [dbclient] invalid resp from es bulk index operation
2016-04-11T22:08:41.243Z - info: [dbclient] retrying batch [500]

This was resolved by switching to a newer version of node (v0.12.0). I could import a large dataset like Australia (countrywide) without an issue. I've also tried using both master and production branches.

I tried to do this on a Centos machine and got the above error with both versions of node. Would this be an issue with the OS I am using or a RAM problem?

add support for importing an entire directory of OpenAddresses files

The import script currently only supports importing one file per run, meaning that importing multiple files requires invoking it as many times. This involves a number of inconveniences, so add support for importing numerous files via a CombinedStream or something similar.

Correct house numbers prefixed with `0`

There are currently 125,630 OA records that are prefixed with 0, which causes us searching and display issues. This is predominantly a problem in the US.

Filter out null island addresses

There are 41,789 addresses in OA that have lat/lon of 0/0:

co/statewide.csv:0,0,411,Chelsea Court,,ELIZABETH,,,80107,,ee0c56b26029664d
co/statewide.csv:0,0,2112,Rawhide Drive,,ELIZABETH,,,80107,,7cf4a32b00f95b41
co/statewide.csv:0,0,875,South Mobile Street,,ELIZABETH,,,80107,,c5b8525a9dc83ece
co/statewide.csv:0,0,27482,East Broadview Drive,,KIOWA,,,80117,,2e487d9d8f1bd017
co/statewide.csv:0,0,1980,Legacy Circle,,ELIZABETH,,,80107,,06f78ce679f99256

These should be filtered out.

See pelias/leaflet-plugin#163

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.