GithubHelp home page GithubHelp logo

eagleflo / jisho Goto Github PK

View Code? Open in Web Editor NEW
10.0 3.0 1.0 42.34 MB

Jisho is a CLI tool & Rust library that provides a Japanese-English dictionary.

License: GNU General Public License v3.0

Rust 97.63% Shell 2.37%
japanese-dictionary jmdict

jisho's People

Contributors

eagleflo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

jisho's Issues

Support wildcard searches

It would be useful to support wildcard searches, where you can replace an unknown kanji (more than one?) with an asterisk and it would match any word in the corpus. A bit like regex but less complex.

> 飛*機
飛行機【ひこうき】- aeroplane, airplane, aircraft

Quick reference

If I end up implementing subcommands (#3), more complex querying capabilities (#24) or flags (#28), I'll want to have a quick reference for these because nobody will be able to remember them.

Tradition dictates this would be printed to stdout with -h / --help. It should also be printed when the program is invoked with an incorrect amount of arguments and similar confusing situations.

(Probably not going to require a full-blown man page, unless I get really wild.)

Deduplicate dictionary lookup logic

There's a fair bit of duplicated logic depending on which dictionary is being consulted in

jisho/src/lib.rs

Lines 52 to 76 in 425b409

if is_japanese(&first) {
if j2e.contains_key(input) {
let (reb, gloss) = j2e.get(input).unwrap();
println!("{}【{}】- {}", input, reb, gloss);
} else {
for key in j2e.keys() {
if key.starts_with(input) {
let (reb, gloss) = j2e.get(key).unwrap();
println!("{}【{}】- {}", input, reb, gloss);
}
}
}
} else {
if e2j.contains_key(input) {
let (keb, reb) = e2j.get(input).unwrap();
println!("{}【{}】- {}", keb, reb, input);
} else {
for key in e2j.keys() {
if key.starts_with(input) {
let (keb, reb) = e2j.get(key).unwrap();
println!("{}【{}】- {}", keb, reb, key);
}
}
}
}

By clearly setting up keb, reb and gloss it should be possible to cut that down in half.

Make lookup return data

Currently jisho::lookup takes upon itself to print the results, if any. If this is to be of any use as a library, we should be returning real data and letting the client program decide what to do with the results.

Create separate entries for each gloss

Many entries in JMdict have multiple glosses defined. For instance:

<entry>
<ent_seq>1183780</ent_seq>
<k_ele>
<keb>音響</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf11</ke_pri>
</k_ele>
<r_ele>
<reb>おんきょう</reb>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf11</re_pri>
</r_ele>
<sense>
<pos>&n;</pos>
<gloss>sound</gloss>
<gloss>noise</gloss>
<gloss>acoustics</gloss>
<gloss>reverberation</gloss>
<gloss>echo</gloss>
<gloss>audio</gloss>
</sense>
</entry>

Reporting each of these makes tons of sense, and again improves results by a wide margin.

Allow prefix matches for meanings?

~ % jisho prime minister
首相【しゅしょう】- prime minister, chancellor (Germany, Austria, etc.), premier
宰相【さいしょう】- prime minister, premier, chancellor
首班【しゅはん】- head, leader, prime minister
PM【ピー・エム】- private message, PM, post-meridiem, afternoon, project manager, product manager, particulate matter, prime minister
プライムミニスター - prime minister
~ % jisho 総理大臣
総理大臣【そうりだいじん】- prime minister (as the head of a cabinet government), premier

Well, that's not optimal. It looks like we currently only collect exact matches with an English search term.

I'm wondering what kind of hassle would it be to allow prefix matches for meanings -- would that lead to a flood of irrelevant results for shorter search terms? This is a case where I definitely think the additional clarification should not block inclusion.

If this pattern of "right answer (additional note)" is common enough, maybe it ought to be handled separately.

Version check

I personally use jisho on three computers and I'm not always sure which version I have installed via cargo.

Tradition dictates this should be available with -v / --version, as I can't foresee needing --verbose instead.

Fix how backspace works in REPL-mode

As of now backspacing characters doesn't fully work -- there are leftover characters in the CLI. Ideally backspace removes one character, be that ASCII or Unicode, and there is no garbage left over.

(It's possible that solving #19 also solves this.)

Handle multiple reading elements

Some common words are often spelled in both ひらがな and カタカナ. For example:

<entry>
<ent_seq>1345605</ent_seq>
<r_ele>
<reb>そろそろ</reb>
<re_pri>ichi1</re_pri>
</r_ele>
<r_ele>
<reb>ソロソロ</reb>
</r_ele>
<sense>
<pos>&adv;</pos>
<pos>&adv-to;</pos>
<misc>&on-mim;</misc>
<gloss>slowly</gloss>
<gloss>quietly</gloss>
<gloss>steadily</gloss>
<gloss>gradually</gloss>
<gloss>gingerly</gloss>
</sense>
<sense>
<misc>&on-mim;</misc>
<gloss>soon</gloss>
<gloss>momentarily</gloss>
<gloss>before long</gloss>
<gloss>any time now</gloss>
</sense>
</entry>

Currently only the first reading element is processed. It wouldn't hurt to also process the second one.

Store *all* results for a given key

A pretty clear bug: there are multiple entries for many keb, reb and gloss elements. Storing every match instead of overwriting the results constantly produces much better lookup results (duh).

Reduce binary size

As of now jisho is a quite large binary, as no effort whatsoever has been spent in optimizing for binary size.

However, it looks like Rust tooling has (recently?) grown more aware of binary sizes, and trying to update the embedded JMdict version to a more recent version triggered some built-in size limit of Crates.io. The JSON files derived from JMdict are certainly much more verbose than necessary, so this should be relatively easy to fix.

Sort result entries by frequency

When looking up words with a lot of exact matches like "green" or "blue", the canonical answers 緑 and 青 should come up first.

Add readline-like features for REPL-mode

It's often helpful to refine searches over multiple queries. Adding basic readline-like support into jisho would make tons of sense.

It looks like Rust has a mature library called rustyline providing something like that.

Combine multiple args into a single input for lookup

Currently you have to use quotes around multiple words to perform a single lookup:

$ jisho "quantum mechanics"
量子力学【りょうしりきがく】- quantum mechanics

While in accordance with how shells are supposed to work, this is a bit tiresome. Could as well just combine all the args into a single input so the quotes would become unnecessary.

Simple lookup tests

Might be useful to have certain test lookups running automatically to keep the largest regressions at bay.

Improve lookup heuristic

~ % jisho ある限り
~ % jisho "all (there is)"
ある限り【あるかぎり】- all (there is), as long as there is

That's not ideal. The original very naïve piece of code just looked at the first character of the input when deciding which HashMap to look from. I rewrote it so that if there is any 漢字 in the query we look at keb rather than reb. Likewise reading lookups now require the whole input to be in 仮名; according to JMdict DTD reb is restricted to 仮名.

Update the embedded JMdict

JMdict doesn't really have version numbers, as it's basically a dump from database. This means the dictionary embedded in Jisho falls out of date as time moves on. The current version is from 2020-09-23, which is now a couple of years out of date.

(Initial attempt at fixing this encountered problems with binary sizes, tracked separately in #21.)

Handle entries without associated kanji

Some dictionary entries don't have a k_ele / keb element at all. For instance:

<entry>
<ent_seq>1345605</ent_seq>
<r_ele>
<reb>そろそろ</reb>
<re_pri>ichi1</re_pri>
</r_ele>
<r_ele>
<reb>ソロソロ</reb>
</r_ele>
<sense>
<pos>&adv;</pos>
<pos>&adv-to;</pos>
<misc>&on-mim;</misc>
<gloss>slowly</gloss>
<gloss>quietly</gloss>
<gloss>steadily</gloss>
<gloss>gradually</gloss>
<gloss>gingerly</gloss>
</sense>
<sense>
<misc>&on-mim;</misc>
<gloss>soon</gloss>
<gloss>momentarily</gloss>
<gloss>before long</gloss>
<gloss>any time now</gloss>
</sense>
</entry>

These should still be looked up.

Add benchmarks

I need some lookup benchmarks to make informed choices about possible changes to jisho.

Reading lookup

Sometimes you have a good idea of a reading but no idea which kanji to use for the word. If the input is purely hiragana we could also look up readings and try to find matches based on that.

Work directly with JMdict_e.gz

I didn't use to think this was a problem since crates are compressed regardless, but maybe it would be a workflow improvement to work straight off the original primary source archive in the build script.

Low priority.

Prefix & postfix matches

I'm getting convinced that the most common use cases for #24 would actually be covered by simple prefix and postfix searches. The syntax should follow convention: for instance, any search term followed by either * or should turn on prefix match.

(When #30 gets fixed, it would also be useful to support forcing exact matches for those cases where the default heuristics including either prefix or postfix matches would provide too many results.)

Interactive mode

Without CLI arguments jisho should open a REPL-like interactive mode.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.