GithubHelp home page GithubHelp logo

nasenag / joyodb Goto Github PK

View Code? Open in Web Editor NEW

This project forked from melissaboiko/joyodb

0.0 1.0 1.0 428 KB

JoyoDB: The Jōyō Kanji table exported to machine-readable formats

Makefile 2.41% Python 97.59%

joyodb's Introduction

Introduction

Warning: This software is currently alpha. Don't use the data blindly.

Kanji usage in Japan is regulated by the Jōyō Kanji table. The latest, 2010 edition of the table is provided by the Ministry of Education in PDF format: http://kokugo.bunka.go.jp/kokugo_nihongo/joho/kijun/naikaku/pdf/joyokanjihyo_20101130.pdf

The original table is quite messy and hard to use in computer programs. This project includes code to extract the data and convert it into popular formats: TSV, JSON, SQL and HTML. To minimize human error, the data is parsed automatically as much as possible. The results will be tested to ensure consistency.

Most users won't have to run the scripts to extract the data; you can just download the data directly from the output directory.

Roadmap/TODO

Completed

  • On- and kun-readings

    • Romaji converter.
    • Distinguish uncommon readings (marked in the table as indented/1字下げ).
  • Example words

    • Use examples to delimit okurigana in kun-readings
      • Including inflected examples, and "double okurigana" (like 成り立ち)
    • Handle POS markers :〔副〕,〔接〕, '……',
    • Treat glossed variations as different, special readings
    • Handle examples with explicative text and 「」
  • Old kanji

    • Handle 弁:[辨, 瓣, 辯].
    • Handle 亀/龜 as Unicode.
  • Variant forms

    • Encode accepted variants (許容字体) as Unicode variation sequences.
    • Convert little-used codepoints to popular use alternatives (通用字体: 塡 剝 頰 → 填 剥 頬).
    • Convert 叱 U+53F1 into common alternate (異体字) 𠮟 U+20B9F.
    • Reference images
  • Notes (参考)

    • Distinguish kanji-scoped notes from reading-scoped ones.
    • Save full note as text
      • Handle notes spanning multiple lines
    • Extract data from notes:
      • Alternate orthographies.
      • Compounds (test against appendix).
      • Prefecture names.
      • Examples marked as literary (文語).
  • Output formats

    • TSV
  • Tests

    • doctests for functions
    • old_kanji: against wikipedia, old dataset
    • readings: against kanjidic
    • examples: against JMdict (edict)

Yet to be done

  • Notes (参考)

    • Parse and structure the data from notes
      • Alternative orthographies (同訓異字).
      • Pointers to reference section in text.
        • Extract example images.
      • Exceptional readings..
        • Unbounded lists (with a "などは").
          • Complement with all available examples from edict.
        • Alternatives ("とも").
      • One-off types of notes.
  • Tests

    • notes: write at least one test for each type of parsed note
  • Parse appendix

  • Output types:

    • SQL
    • JSON
    • HTML table
  • Document:

How to recreate the files

 pip3 install romkan
 pip3 install ostruct
 pip3 install regex # newer version of 're'
 git clone https://github.com/leoboiko/joyodb.git
 cd joyodb
 make # (needs Internet)
 bin/convert_joyodb

Output will be in output/ directory.

How to test

apt-get install rsync python3-lxml python3-bs4 mecab unidic-mecab
pip3 install mecab-python3
make test # (needs Internet)

joyodb's People

Contributors

melissaboiko avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.