scriptin / jmdict-simplified Goto Github PK

View Code? Open in Web Editor NEW

168.0 9.0 12.0 90.39 MB

JMdict, JMnedict, Kanjidic, KRADFILE/RADKFILE in JSON format

License: Creative Commons Attribution Share Alike 4.0 International

Kotlin 76.52% JavaScript 4.36% TypeScript 19.12%

jmdict json japanese-language japanese dictionary language xml dictionary-tools jmnedict kanjidic

jmdict-simplified's Introduction

👋 Hello, my name is Dmitry

Website • LinkedIn • StackExchange • NPM • Tatoeba.org

I am a software developer, coding mostly webapps. Programming is both my job and hobby, but I am also interested in (natural) languages, cyber-security, astronomy, math, and art.

const scriptin = {
  name: "Dmitry Shpika", // [DMEE-tree SHPEE-kah]
  pronouns: ["he", "him"],
  education: [
    { specialty: "Software Engineering",
      type: "self-taught",
      experience: "12+ years" },
    { specialty: "IT Security",
      type: "degree",
      experience: "Occasional consulting and auditing" },
  ],
  hardSkills: {
    programming: [TypeScript, JavaScript, Kotlin, SQL, ShellScripts, Python],
    ui: [React, HTML, CSS, Tailwind, Bootstrap, MaterialUI, Figma],
    api: [GraphQL, REST, SOAP, RPC],
    buildTools: [Webpack, Vite, Gradle],
    graphics: [Canvas2D, Processing, P5js, SVG],
    testing: [TDD, BDD, Unit, E2E],
    ci: [Docker, GitHubActions, GitLabCI],
    other: [Parsing, WebScraping, XML],
    _outdated: [PHP, Java, jQuery, AngularJS], // In the past
  },
  softSkills: {
    teaching: "I like to explain complex topics in simple terms",
    learning: "Currently focusing on foreign languages",
    design: "Creating UIs using component libraries/frameworks, or from scratch",
    writing: `Technical and fiction.
              I've written 130+ answers on Stack Exchange,
              maybe 5-7 articles, dozens of readmes,
              countless Jira tickets,
              a few short novels and poems`,
  },
  hobbies: [Programming, Languages, Astronomy, Math, Art],
} satisfies SoftwareEngineer;

Check out my pinned repos 👇

jmdict-simplified's People

Contributors

Stargazers

Watchers

Forkers

kwstewar spamdaemon vnenkpet lettenj61 nyuczka ashrafuljoypb fasiha konstantindjairo marcabutler

jmdict-simplified's Issues

Kanjidic 3.5.0 json is missing some radicals

The radicals array in Kanjidic json has only one element of each type (classical or nelson_c) in every kanji.

For example, kanji 家 has only one classical radical element with value 40 (宀). Another radical with value 152 (豕) is missing from the array.

Cloning takes forever, remove xml files ?

The xml files makes the project wighe over 80M, clone takes a long time for no good reason.
If someone needs to run the code he can just download the latest xml file from the site.

Add g_type attribute on gloss elements

Added in the latest revision of JMdict (does not apply to JMnedict):

<!ATTLIST gloss g_type CDATA #IMPLIED>
        <!-- The g_type attribute specifies that the gloss is of a particular
        type, e.g. "lit" (literal), "fig" (figurative), "expl" (explanation).
        -->

*.tgz files are not compressed

They are just tar archives, with no GZIP compression.

Add JSON schema validation

Add JSON schema files and a task for running the validation for each dictionary.

Usually Kana

Hello, I noticed that the UK tag is omitted from the json. Is there a way that can be added in?

Make BaseX a build script dependency

Instead of expecting to have basex command on $PATH, add basex dependency to a build script. This way the only requirement for running the build will be Java 8.

appliesToKanji / appliesToKana became empty for senses

At some previous point, appliesToKanji and appliesToKana were populated for senses. I'm not sure what the new pattern is exactly except that it looks like appliesToKanji is now always(?) empty for senses whereas it used to include "*", and appliesToKana also used to have "*" for senses.

Extract specific language

Hi,

The json is converted from JMdict_e, but JMdict is also available in French, German, Russian and Dutch.

Is it possible to extract data from a specific language code ?
Something like a language code ./gradlew download fr
If no lang code is provided, default would be english

More directions on set up?

I don't have much experience with gradle and I'm having a hard time getting this set up. Could you add directions for how to get started? I'm getting the message "FAILURE: Build failed with an exception." and "Could not determine java version from '12.0.1'" when I run "./gradlew tags" or "./gradlewtasks"

README said 'Java8+', but I'm wondering if this is not compatible with the newest version of Java, or if I skipped a step when setting up gradle.

Update with latest JMdict?

Forgive me for bugging you, I wondered if you could publish an update with the latest JMdict? Version 3.0.1 was released 1.5 years ago and today I came across a word that had been added earlier this year and wasn't in jmdict-simplified. Thank you!!

xref element in JMdict sometimes contains a reb with JIS centre-dots

Whilst playing around parsing the xref field in my own parser I noticed that there is a problem with the xref field in the original XML file.

The JIS centre-dot '・' is used to separate components of the xref but some reb contain that centre dot, so you get xrefs like:
<xref>ブロードノーズ・セブンギル・シャーク</xref>
<xref>イエローテール・スターリー・ラビットフィッシュ</xref>

From my short investigations it seems like it is only these two xrefs which have this problem.

Parsing these by splitting on the centre-dot will get you a list of 3 strings but it actually should only be a list of a single string.

I have contacted Jim Breen the author of JMdict, but in the meantime the solution is to just hard-code a check for these two xrefs and return it as is instead of splitting them by centre dot, as they both relate to a single reb.

Change field names into camel case: jmdict-date -> jmdictDate, etc.

jmdict-date -> jmdictDate
jmdict-revisions -> jmdictRevisions
jmnedict-date -> jmnedictDate
jmnedict-revisions -> jmnedictRevisions

Because it's more convenient for many popular programming languages.

Fields often contain "?"

See ID 1577140 for example - its first sense, for "where"/"what place", has question marks for partOfSpeech and misc.

I've been trying to read the code to understand how this could happen, but no luck yet. Very nice work, otherwise! :)

part-of-speech of an earlier 'sense' elements must apply to later senses unless there is a new part-of-speech indicated

In the header of original file:

<!ELEMENT pos (#PCDATA)>
<!-- Part-of-speech information about the entry/sense. Should use 
appropriate entity codes. In general where there are multiple senses
in an entry, the part-of-speech of an earlier sense will apply to
later senses unless there is a new part-of-speech indicated.
-->

Currently, there are senses with empty part-of-speech field.

JMdict version number?

It would be useful if the JSON file included another key that contained the JMdict version number or release date or some way to identify which copy of JMdict it corresponded to.

In my copy of JMdict_e XML database, I see a comment:

<!-- JMdict created: 2016-12-28 -->

I realize it might be hard to extract that, via XQuery, and insert it into the JSON, but perhaps this could be done outside XQuery, as a post-processing step?

I ask about this because we have found that entries come and go as JMdict matures, and it’s conceivable that a cross-reference made today might be invalid tomorrow. If cross-references included, as metadata, the JMdict version, that’s a step at resolving such problems when they inevitably arise.

Generate documentation from types

Use https://typedoc.org/ or https://tsdoc.org/
Add intro on how to use the types - ideally with packages, see #23

A huge, huge onegai: JMnedict?

I'm so hesitant and ashamed to ask this—have you ever thought of a jmnedict-simplified, i.e., a JSON version of JMnedict, the Japanese names dictionary? A glance at the the XML file's DTD shows a lot of parallels with JMdict, but I'm not familiar enough with XPath to surmise how much or how little work this would be, so I'll just say that I love working with jmdict-simplified and would love a similar resource for JMnedict. Thanks for listening 🙇‍♂️!

Publish NPM packages

There should be 2 separate libraries:

@scriptin/jmdict-simplified-types - TypeScript type definitions. Useful when someone works with JSON files directly.
@scriptin/jmdict-simplified-loader - Loader utility package. Includes a streaming JSON parser with a simplified API for extracting metadata and words one-by-one. This library depends on @scriptin/jmdict-simplified-types

(Namespace @scriptin is chosen to prevent name clashes and squatting. We don't want a repeat of the left-pad story)

This work is already started - see node directory.

Remaining work:

Documentation
Publishing to NPM
CI/CD workflow: build, test, publish - most likely a separate workflow is required

Good guide on how to create a TypeScript lib: How to publish packages to npm in 2023 by Matt Pocock

Downloads the source files
Checks if they were updated by comparing file hashes. Hashes from the last build need to be stored in the repository to be able to compare
If any change was detected, trigger an existing update action to create a new release
If possible, set a release notes message about which dictionary was updated and triggered a build

Limitations: free 2000 minutes per month

Possible to get the JSON file?

Hey, what if I don't want to download and install whatever your app is to get a file? Do you have a prebuilt json file of jmdict I can grab?

TypeScript type definitions

Provide *.d.ts files for both JMdict and JMnedict JSON dictionaries, with sufficient comments on every type and field
Maybe replace the current format descriptions in readme with this new definitions?

KanjiDic?

Are you interested in adding kanjidic, or have thoughts on what already covers that well for JSON?

scriptin / jmdict-simplified Goto Github PK

jmdict-simplified's Introduction

👋 Hello, my name is Dmitry

jmdict-simplified's People

Contributors

Stargazers

Watchers

Forkers

jmdict-simplified's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs