GithubHelp home page GithubHelp logo

wikidata-dump-processor's Introduction

wikidata-dump-processor

Import Wikidata json dump (.json.bz2) into Mongodb and create index

  • Index:

    Wikidata ID: { id: 1 }

    English Alias: { aliases.en.value: 1 }

    English Wikipedia Title: { sitelinks.enwiki.title: 1 }

    Freebase ID: { claims.P646.mainsnak.datavalue.value: 1 }

    subclass of: { claims.P279.mainsnak.datavalue.value.id: 1 }

    instance of: { claims.P31.mainsnak.datavalue.value.id: 1 }

    all properties: { properties: 1 }

  • Partial Index for Covered Query: { sitelinks.enwiki.title: 1, id: 1 } { labels.en.value: 1, id: 1 }

  • Performance: ~3 hours for importing, ~1 hour for indexing (--nworker 12, --chunk_size 10000, based on 20180717 dump (25 GB))

Quickstart

Step 1: import

usage: import.py [-h] [--chunk_size CHUNK_SIZE] [--nworker NWORKER]
                 inpath host port db_name collection_name

positional arguments:
  inpath                Path to inpath file (xxxxxxxx-all.json.bz2)
  host                  MongoDB host
  port                  MongoDB port
  db_name               Database name
  collection_name       Collection name

optional arguments:
  --chunk_size CHUNK_SIZE, -c CHUNK_SIZE
                        Chunk size (default=10000, RAM usage depends on chunk
                        size)
  --nworker NWORKER, -n NWORKER

Step 2: index

usage: index.py [-h] host port db_name collection_name

positional arguments:
  host             MongoDB host
  port             MongoDB port
  db_name          Database name
  collection_name  Collection name

Miscellaneous

  • If you get errno:24 Too many open files error, try to increase system limits. For example, in Linux, you can run ulimit -n 64000 in the console running mongod.

wikidata-dump-processor's People

Contributors

panx27 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.