GithubHelp home page GithubHelp logo

scriptin / jmdict-simplified Goto Github PK

View Code? Open in Web Editor NEW
168.0 9.0 12.0 90.39 MB

JMdict, JMnedict, Kanjidic, KRADFILE/RADKFILE in JSON format

License: Creative Commons Attribution Share Alike 4.0 International

Kotlin 76.52% JavaScript 4.36% TypeScript 19.12%
jmdict json japanese-language japanese dictionary language xml dictionary-tools jmnedict kanjidic

jmdict-simplified's Introduction

👋 Hello, my name is Dmitry

WebsiteLinkedInStackExchangeNPMTatoeba.org

I am a software developer, coding mostly webapps. Programming is both my job and hobby, but I am also interested in (natural) languages, cyber-security, astronomy, math, and art.

const scriptin = {
  name: "Dmitry Shpika", // [DMEE-tree SHPEE-kah]
  pronouns: ["he", "him"],
  education: [
    { specialty: "Software Engineering",
      type: "self-taught",
      experience: "12+ years" },
    { specialty: "IT Security",
      type: "degree",
      experience: "Occasional consulting and auditing" },
  ],
  hardSkills: {
    programming: [TypeScript, JavaScript, Kotlin, SQL, ShellScripts, Python],
    ui: [React, HTML, CSS, Tailwind, Bootstrap, MaterialUI, Figma],
    api: [GraphQL, REST, SOAP, RPC],
    buildTools: [Webpack, Vite, Gradle],
    graphics: [Canvas2D, Processing, P5js, SVG],
    testing: [TDD, BDD, Unit, E2E],
    ci: [Docker, GitHubActions, GitLabCI],
    other: [Parsing, WebScraping, XML],
    _outdated: [PHP, Java, jQuery, AngularJS], // In the past
  },
  softSkills: {
    teaching: "I like to explain complex topics in simple terms",
    learning: "Currently focusing on foreign languages",
    design: "Creating UIs using component libraries/frameworks, or from scratch",
    writing: `Technical and fiction.
              I've written 130+ answers on Stack Exchange,
              maybe 5-7 articles, dozens of readmes,
              countless Jira tickets,
              a few short novels and poems`,
  },
  hobbies: [Programming, Languages, Astronomy, Math, Art],
} satisfies SoftwareEngineer;

Dmitry Shpika's GitHub stats

Check out my pinned repos 👇

jmdict-simplified's People

Contributors

actions-user avatar garybernhardt avatar nyuczka avatar scriptin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jmdict-simplified's Issues

Kanjidic 3.5.0 json is missing some radicals

The radicals array in Kanjidic json has only one element of each type (classical or nelson_c) in every kanji.

For example, kanji 家 has only one classical radical element with value 40 (宀). Another radical with value 152 (豕) is missing from the array.

Cloning takes forever, remove xml files ?

The xml files makes the project wighe over 80M, clone takes a long time for no good reason.
If someone needs to run the code he can just download the latest xml file from the site.

Add g_type attribute on gloss elements

Added in the latest revision of JMdict (does not apply to JMnedict):

<!ATTLIST gloss g_type CDATA #IMPLIED>
        <!-- The g_type attribute specifies that the gloss is of a particular
        type, e.g. "lit" (literal), "fig" (figurative), "expl" (explanation).
        -->

Usually Kana

Hello, I noticed that the UK tag is omitted from the json. Is there a way that can be added in?

appliesToKanji / appliesToKana became empty for senses

At some previous point, appliesToKanji and appliesToKana were populated for senses. I'm not sure what the new pattern is exactly except that it looks like appliesToKanji is now always(?) empty for senses whereas it used to include "*", and appliesToKana also used to have "*" for senses.

Extract specific language

Hi,

The json is converted from JMdict_e, but JMdict is also available in French, German, Russian and Dutch.

Is it possible to extract data from a specific language code ?
Something like a language code ./gradlew download fr
If no lang code is provided, default would be english

More directions on set up?

I don't have much experience with gradle and I'm having a hard time getting this set up. Could you add directions for how to get started? I'm getting the message "FAILURE: Build failed with an exception." and "Could not determine java version from '12.0.1'" when I run "./gradlew tags" or "./gradlewtasks"

README said 'Java8+', but I'm wondering if this is not compatible with the newest version of Java, or if I skipped a step when setting up gradle.

Update with latest JMdict?

Forgive me for bugging you, I wondered if you could publish an update with the latest JMdict? Version 3.0.1 was released 1.5 years ago and today I came across a word that had been added earlier this year and wasn't in jmdict-simplified. Thank you!!

xref element in JMdict sometimes contains a reb with JIS centre-dots

Whilst playing around parsing the xref field in my own parser I noticed that there is a problem with the xref field in the original XML file.

The JIS centre-dot '・' is used to separate components of the xref but some reb contain that centre dot, so you get xrefs like:
<xref>ブロードノーズ・セブンギル・シャーク</xref>
<xref>イエローテール・スターリー・ラビットフィッシュ</xref>

From my short investigations it seems like it is only these two xrefs which have this problem.

Parsing these by splitting on the centre-dot will get you a list of 3 strings but it actually should only be a list of a single string.

I have contacted Jim Breen the author of JMdict, but in the meantime the solution is to just hard-code a check for these two xrefs and return it as is instead of splitting them by centre dot, as they both relate to a single reb.

Fields often contain "?"

See ID 1577140 for example - its first sense, for "where"/"what place", has question marks for partOfSpeech and misc.

I've been trying to read the code to understand how this could happen, but no luck yet. Very nice work, otherwise! :)

part-of-speech of an earlier 'sense' elements must apply to later senses unless there is a new part-of-speech indicated

In the header of original file:

<!ELEMENT pos (#PCDATA)>
<!-- Part-of-speech information about the entry/sense. Should use 
appropriate entity codes. In general where there are multiple senses
in an entry, the part-of-speech of an earlier sense will apply to
later senses unless there is a new part-of-speech indicated.
-->

Currently, there are senses with empty part-of-speech field.

JMdict version number?

It would be useful if the JSON file included another key that contained the JMdict version number or release date or some way to identify which copy of JMdict it corresponded to.

In my copy of JMdict_e XML database, I see a comment:

<!-- JMdict created: 2016-12-28 -->

I realize it might be hard to extract that, via XQuery, and insert it into the JSON, but perhaps this could be done outside XQuery, as a post-processing step?

I ask about this because we have found that entries come and go as JMdict matures, and it’s conceivable that a cross-reference made today might be invalid tomorrow. If cross-references included, as metadata, the JMdict version, that’s a step at resolving such problems when they inevitably arise.

A huge, huge onegai: JMnedict?

I'm so hesitant and ashamed to ask this—have you ever thought of a jmnedict-simplified, i.e., a JSON version of JMnedict, the Japanese names dictionary? A glance at the the XML file's DTD shows a lot of parallels with JMdict, but I'm not familiar enough with XPath to surmise how much or how little work this would be, so I'll just say that I love working with jmdict-simplified and would love a similar resource for JMnedict. Thanks for listening 🙇‍♂️!

Publish NPM packages

There should be 2 separate libraries:

  • @scriptin/jmdict-simplified-types - TypeScript type definitions. Useful when someone works with JSON files directly.
  • @scriptin/jmdict-simplified-loader - Loader utility package. Includes a streaming JSON parser with a simplified API for extracting metadata and words one-by-one. This library depends on @scriptin/jmdict-simplified-types

(Namespace @scriptin is chosen to prevent name clashes and squatting. We don't want a repeat of the left-pad story)

This work is already started - see node directory.

Remaining work:

  • Documentation
  • Publishing to NPM
  • CI/CD workflow: build, test, publish - most likely a separate workflow is required

Good guide on how to create a TypeScript lib: How to publish packages to npm in 2023 by Matt Pocock

"misc" tags for senses

I would appreciate having the "misc" tags for senses, particularly so that I can filter / label language that is less relevant or sensitive such as "chn" (children's language) and historical terms. Thank you

Automatically update when source dictionaries are updated

Create a new workflow that periodically (how often?):

  1. Downloads the source files
  2. Checks if they were updated by comparing file hashes. Hashes from the last build need to be stored in the repository to be able to compare
  3. If any change was detected, trigger an existing update action to create a new release
  4. If possible, set a release notes message about which dictionary was updated and triggered a build

Limitations: free 2000 minutes per month

Possible to get the JSON file?

Hey, what if I don't want to download and install whatever your app is to get a file? Do you have a prebuilt json file of jmdict I can grab?

TypeScript type definitions

  • Provide *.d.ts files for both JMdict and JMnedict JSON dictionaries, with sufficient comments on every type and field
  • Maybe replace the current format descriptions in readme with this new definitions?

KanjiDic?

Are you interested in adding kanjidic, or have thoughts on what already covers that well for JSON?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.