GithubHelp home page GithubHelp logo

maxlath / import-wikidata-dump-to-couchdb Goto Github PK

View Code? Open in Web Editor NEW
21.0 3.0 0.0 18 KB

import a subset or a full Wikidata dump into a CouchDB database

JavaScript 100.00%
wikidata couchdb wikidata-dump stream

import-wikidata-dump-to-couchdb's Introduction

import-wikidata-dump-to-couchdb

A tool to transfer an extract of a wikidata dump into a CouchDB database


2024 archive note

This tool was a bit of a naive implementation; if I wanted to do that today, I would do it differently, and make sure to use CouchDB bulk mode:


Summary

Dependency

  • NodeJS >= v6. If your distribution doesn't provide an recent version of NodeJS, you might want to uninstall NodeJS and reinstall it using NVM

Installation

git clone https://github.com/maxlath/import-wikidata-dump-to-couchdb
cd import-wikidata-dump-to-couchdb
npm install

Now you can customize ./config/default.js to your needs.

How to

Download dump

Download Wikidata latest dump

Extract subset

Extract the subset of the dump fitting your needs, as you might not want to throw ~40Go at your database's face.

For instance, for the needs of the authors-birthday bot, I wanted to keep only Wikidata entities of writers:

As each line of the dump is an entity, you could do something like this with grep

cat dump.json | grep '36180\,' > isWriter.json

Here the trick is that every entity with occupation-> writer (P106->Q36180) will have 36180 somewhere in the line (as a claim numeric-id). And tadaa, you went from a 39Go dump to a way nicer 384Mo subset.

But now, we can do something cleaner using wikidata-filter:

cat dump.json | wikidata-filter --claim P106:Q36180 > isWriter.json

Import

This new file isnt valid json (it's line-delimited JSON), but every new line is, once you remove the coma at the end of the line, so here is the plan: take every line, remove the coma, PUT it in your database:

./import.js ./isWriter.json

Specify start and end line numbers:

startline=5
# the line 10 will be included
endline=10
./import.js ./isWriter.json $startline $endline

Behavior on conflict

In the config file (./config/default.js), you can set the behavior on conflict, that is, when the importers tries to add an entity that was already previously added to CouchDB:

  • update (default): update document if there is a change, otherwise pass.
  • pass: always pass
  • exit: exit process at first conflict

See also

License

MIT

import-wikidata-dump-to-couchdb's People

Contributors

maxlath avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

import-wikidata-dump-to-couchdb's Issues

use CouchDB bulk insert API

maxlath : hello! Any recommendation on how to import a data dump of 40GB+ of newline delimited JSON in CouchDB? I assume I should go with the bulk import API, but I can't just throw the whole dump at Couch at once, right? What would be the optimal/sustainable split size? thanks in advance :)
jan____ : maxlath: 1k-10k batches should do

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.