GithubHelp home page GithubHelp logo

geoslegend / osm-data-classification Goto Github PK

View Code? Open in Web Editor NEW

This project forked from oslandia/osm-data-classification

0.0 2.0 0.0 110.08 MB

OpenStreetMap Data Classification

License: MIT License

Python 100.00%

osm-data-classification's Introduction

OpenStreetMap Data Quality based on the Contributions History

License: MIT

Working with community-built data as OpenStreetMap forces to take care of data quality. We have to be confident with the data we work with. Is this road geometry accurate enough? Is this street name missing?

Our first idea was to answer to this question: can we assess the quality of OpenStreetMap data? (and how?).

This project is dedicated to explore and analyze the OpenStreetMap data history in order to classify the contributors.

There are a serie of articles on the Oslandia's blog site which deal with this topic. Theses articles are also in the articles folder.

Dependencies

Works with Python 3

  • pyosmium
  • luigi
  • pandas
  • statsmodels
  • scikit-learn
  • matplotlib
  • seaborn

There is a requirements.txt file. Thus, do pip install -r requirements.txt from a virtual environment.

How does it work?

There are several Python files to extract and analyze the OSM history data. Two machine learning models are used to classify the changesets and the OSM contributors.

  • Dimension reduction with PCA
  • Clustering with the KMeans

The purpose of the PCA is not to reduce the dimension (you have less than 100 features). It's to analyze the different features and understand the most important ones.

Running

Get some history data

You can get some history data for a specific world region on Geofabrik. You have to download a *.osh.pbf file. For instance, on the Greater London page, you can download the file greater-london.osh.pbf.

Organize your output data directories

Create a data directory and some subdirs elsewhere. The data processing should be launched from the folder where you have your data folder (or alternatively, where a symbolic link points out to it).

  • mkdir -p data/output-extracts
  • mkdir data/raw

Then, copy your fresh downloaded *.osh.pbf file into the data/raw/ directory.

Note: if you want another name for your data directory, you'll be able to specify the name thanks to the --datarep luigi option.

The limits of the data pipeline

The data pipeline processing is handled by Luigi, which can build a direct acyclic dependency graph of your different processing tasks and launch them in parallel when it's possible.

These tasks yield output files (CSV, JSON, hdf5, png). Some files such as all-changesets-by-user.csv and all-editors-by-user.csv needed for some tasks was built outside of this pipeline. Actually, these files come from the big changesets-latest.osm XML file which is difficult to include in the pipeline because:

  • the processing can be a quite long
  • you should have a large amount of RAM

Thus, you can get these two CSV files in the osm-user-data and copy them into your data/output-extracts directory.

See also the I want to parse the changesets.osm file section.

Run your first analyze

You should have the following files:

data
data/raw
data/raw/region.osh.pbf
data/output-extracts
data/output-extracts/all-changesets-by-user.csv
data/output-extracts/all-editors-by-user.csv

Launch

luigi --local-scheduler --module analysis_tasks AutoKMeans --dsname region

or

python3 -m luigi --locale-scheduler --module analysis_tasks AutoKMeans --dsname region

dsname mean "dataset name". It must have the same name as your *.osh.pbf file.

Note: The default value of this parameter is bordeaux-metropole. If you do not set another value and if you do not have such .osh.pbf file onto your file system, the program will crash.

Most of the time (if you have an Python import error), you have to prepend the luigi command by the PYTHONPATH environment variable to the osm-data-quality/src directory. Such as:

PYTHONPATH=/path/to/osm-data-quality/src luigi --local-scheduler ...

The MasterTask chooses the number of PCA components and the number of KMeans clusters in an automatic way. If you want to set the number of clusters for instance, you can pass the following options to the luigi command:

--module analysis_tasks KMeansFromPCA --dsname region --n-components 6 --nb-clusters 5

In this case, the PCA will be carried out with 6 components. The clustering will use the PCA results to carry out the KMeans with 5 clusters.

See also the different luigi options in the official luigi documentation.

Results

You should have a data/output-extracts/<region> directory with several CSV, JSON and h5 files.

  • Several intermediate CSV files;
  • JSON KMeans report to see the "ideal" number of clusters (the key n_clusters);
  • PCA hdf5 files with /features and /individuals keys;
  • KMeans hdf5 files with /centroids and /individuals keys;
  • A few PNG images.

Open the results analysis notebook to have an insight about how to exploit the results.

I want to parse the changesets.osm file

See http://planet.openstreetmap.org/planet/changesets-latest.osm.bz2

  • Convert the file into a huge CSV file
  • Group each user by editors and changesets thanks with dask

TODO : write the "how to"

osm-data-classification's People

Contributors

delhomer avatar

Watchers

James Cloos avatar Seongkyu Lee avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.