eagleflo / jisho Goto Github PK

View Code? Open in Web Editor NEW

10.0 3.0 1.0 42.34 MB

Jisho is a CLI tool & Rust library that provides a Japanese-English dictionary.

License: GNU General Public License v3.0

Rust 97.63% Shell 2.37%

japanese-dictionary jmdict

jisho's People

Contributors

Stargazers

Watchers

jisho's Issues

Support wildcard searches

It would be useful to support wildcard searches, where you can replace an unknown kanji (more than one?) with an asterisk and it would match any word in the corpus. A bit like regex but less complex.

> 飛*機
飛行機【ひこうき】- aeroplane, airplane, aircraft

Publish crate

In the wild west world of https://crates.io/ you need to reserve your crate name early on. This involves configuring a bit of metadata. See https://doc.rust-lang.org/cargo/reference/publishing.html for reference.

Quick reference

If I end up implementing subcommands (#3), more complex querying capabilities (#24) or flags (#28), I'll want to have a quick reference for these because nobody will be able to remember them.

Tradition dictates this would be printed to stdout with -h / --help. It should also be printed when the program is invoked with an incorrect amount of arguments and similar confusing situations.

(Probably not going to require a full-blown man page, unless I get really wild.)

Deduplicate dictionary lookup logic

There's a fair bit of duplicated logic depending on which dictionary is being consulted in

jisho/src/lib.rs

Lines 52 to 76 in 425b409

 if is_japanese(&first) { 

 if j2e.contains_key(input) { 

 let (reb, gloss) = j2e.get(input).unwrap(); 

 println!("{}【{}】- {}", input, reb, gloss); 

 } else { 

 for key in j2e.keys() { 

 if key.starts_with(input) { 

 let (reb, gloss) = j2e.get(key).unwrap(); 

 println!("{}【{}】- {}", input, reb, gloss); 

 } 

 } 

 } 

 } else { 

 if e2j.contains_key(input) { 

 let (keb, reb) = e2j.get(input).unwrap(); 

 println!("{}【{}】- {}", keb, reb, input); 

 } else { 

 for key in e2j.keys() { 

 if key.starts_with(input) { 

 let (keb, reb) = e2j.get(key).unwrap(); 

 println!("{}【{}】- {}", keb, reb, key); 

 } 

 } 

 } 

 }

By clearly setting up keb, reb and gloss it should be possible to cut that down in half.

Make lookup return data

Currently jisho::lookup takes upon itself to print the results, if any. If this is to be of any use as a library, we should be returning real data and letting the client program decide what to do with the results.

Create separate entries for each gloss

Many entries in JMdict have multiple glosses defined. For instance:

<entry>
<ent_seq>1183780</ent_seq>
<k_ele>
<keb>音響</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf11</ke_pri>
</k_ele>
<r_ele>
<reb>おんきょう</reb>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf11</re_pri>
</r_ele>
<sense>
<pos>&n;</pos>
<gloss>sound</gloss>
<gloss>noise</gloss>
<gloss>acoustics</gloss>
<gloss>reverberation</gloss>
<gloss>echo</gloss>
<gloss>audio</gloss>
</sense>
</entry>

Reporting each of these makes tons of sense, and again improves results by a wide margin.

Evaluate using a build script to create dictionaries at compile time

Using jisho as a CLI tool still feels a bit too slow. All of the time is spent parsing the original JMdict XML file and creating the lookup HashMaps for jisho's own use.

It looks like most of this work could be moved to compile time with the aid of https://doc.rust-lang.org/cargo/reference/build-scripts.html.

Allow prefix matches for meanings?

~ % jisho prime minister
首相【しゅしょう】- prime minister, chancellor (Germany, Austria, etc.), premier
宰相【さいしょう】- prime minister, premier, chancellor
首班【しゅはん】- head, leader, prime minister
ＰＭ【ピー・エム】- private message, PM, post-meridiem, afternoon, project manager, product manager, particulate matter, prime minister
プライムミニスター - prime minister
~ % jisho 総理大臣
総理大臣【そうりだいじん】- prime minister (as the head of a cabinet government), premier

Well, that's not optimal. It looks like we currently only collect exact matches with an English search term.

I'm wondering what kind of hassle would it be to allow prefix matches for meanings -- would that lead to a flood of irrelevant results for shorter search terms? This is a case where I definitely think the additional clarification should not block inclusion.

If this pattern of "right answer (additional note)" is common enough, maybe it ought to be handled separately.

Version check

I personally use jisho on three computers and I'm not always sure which version I have installed via cargo.

Tradition dictates this should be available with -v / --version, as I can't foresee needing --verbose instead.

Fix how backspace works in REPL-mode

As of now backspacing characters doesn't fully work -- there are leftover characters in the CLI. Ideally backspace removes one character, be that ASCII or Unicode, and there is no garbage left over.

(It's possible that solving #19 also solves this.)

Using lazy_static macro

Currently dictionary initialization happens on each lookup. That's not ideal. Unfortunately static variables come with a lot of restrictions in Rust; heap allocation being forbidden for one.

It looks like https://docs.rs/lazy_static/1.4.0/lazy_static/ might offer a way to sidestep that issue.

Handle multiple reading elements

Some common words are often spelled in both ひらがな and カタカナ. For example:

<entry>
<ent_seq>1345605</ent_seq>
<r_ele>
<reb>そろそろ</reb>
<re_pri>ichi1</re_pri>
</r_ele>
<r_ele>
<reb>ソロソロ</reb>
</r_ele>
<sense>
<pos>&adv;</pos>
<pos>&adv-to;</pos>
<misc>&on-mim;</misc>
<gloss>slowly</gloss>
<gloss>quietly</gloss>
<gloss>steadily</gloss>
<gloss>gradually</gloss>
<gloss>gingerly</gloss>
</sense>
<sense>
<misc>&on-mim;</misc>
<gloss>soon</gloss>
<gloss>momentarily</gloss>
<gloss>before long</gloss>
<gloss>any time now</gloss>
</sense>
</entry>

Currently only the first reading element is processed. It wouldn't hurt to also process the second one.

Store all results for a given key

A pretty clear bug: there are multiple entries for many keb, reb and gloss elements. Storing every match instead of overwriting the results constantly produces much better lookup results (duh).

Reduce binary size

As of now jisho is a quite large binary, as no effort whatsoever has been spent in optimizing for binary size.

However, it looks like Rust tooling has (recently?) grown more aware of binary sizes, and trying to update the embedded JMdict version to a more recent version triggered some built-in size limit of Crates.io. The JSON files derived from JMdict are certainly much more verbose than necessary, so this should be relatively easy to fix.

Sort result entries by frequency

When looking up words with a lot of exact matches like "green" or "blue", the canonical answers 緑 and 青 should come up first.

👀

Add readline-like features for REPL-mode

It's often helpful to refine searches over multiple queries. Adding basic readline-like support into jisho would make tons of sense.

It looks like Rust has a mature library called rustyline providing something like that.

Combine multiple args into a single input for lookup

Currently you have to use quotes around multiple words to perform a single lookup:

$ jisho "quantum mechanics"
量子力学【りょうしりきがく】- quantum mechanics

While in accordance with how shells are supposed to work, this is a bit tiresome. Could as well just combine all the args into a single input so the quotes would become unnecessary.

Collect multiple glosses into a vector

Rather than creating a separate entry for each different gloss as done in 7536fc7, it would make more sense to just collect them under a single entry.

Simple lookup tests

Might be useful to have certain test lookups running automatically to keep the largest regressions at bay.

Improve lookup heuristic

~ % jisho ある限り
~ % jisho "all (there is)"
ある限り【あるかぎり】- all (there is), as long as there is

That's not ideal. The original very naïve piece of code just looked at the first character of the input when deciding which HashMap to look from. I rewrote it so that if there is any 漢字 in the query we look at keb rather than reb. Likewise reading lookups now require the whole input to be in 仮名; according to JMdict DTD reb is restricted to 仮名.

Update the embedded JMdict

JMdict doesn't really have version numbers, as it's basically a dump from database. This means the dictionary embedded in Jisho falls out of date as time moves on. The current version is from 2020-09-23, which is now a couple of years out of date.

(Initial attempt at fixing this encountered problems with binary sizes, tracked separately in #21.)

Handle entries without associated kanji

Some dictionary entries don't have a k_ele / keb element at all. For instance:

<entry>
<ent_seq>1345605</ent_seq>
<r_ele>
<reb>そろそろ</reb>
<re_pri>ichi1</re_pri>
</r_ele>
<r_ele>
<reb>ソロソロ</reb>
</r_ele>
<sense>
<pos>&adv;</pos>
<pos>&adv-to;</pos>
<misc>&on-mim;</misc>
<gloss>slowly</gloss>
<gloss>quietly</gloss>
<gloss>steadily</gloss>
<gloss>gradually</gloss>
<gloss>gingerly</gloss>
</sense>
<sense>
<misc>&on-mim;</misc>
<gloss>soon</gloss>
<gloss>momentarily</gloss>
<gloss>before long</gloss>
<gloss>any time now</gloss>
</sense>
</entry>

These should still be looked up.

Single kanji lookup

A sibling project to EDICT, KANJIDIC offers a lot of data for single kanji lookups.

Add benchmarks

I need some lookup benchmarks to make informed choices about possible changes to jisho.

Reading lookup

Sometimes you have a good idea of a reading but no idea which kanji to use for the word. If the input is purely hiragana we could also look up readings and try to find matches based on that.

Work directly with JMdict_e.gz

I didn't use to think this was a problem since crates are compressed regardless, but maybe it would be a workflow improvement to work straight off the original primary source archive in the build script.

Low priority.

Prefix & postfix matches

I'm getting convinced that the most common use cases for #24 would actually be covered by simple prefix and postfix searches. The syntax should follow convention: for instance, any search term followed by either * or ＊ should turn on prefix match.

(When #30 gets fixed, it would also be useful to support forcing exact matches for those cases where the default heuristics including either prefix or postfix matches would provide too many results.)

Interactive mode

Without CLI arguments jisho should open a REPL-like interactive mode.

Mention REPL-mode in README

It would help to mention that jisho can be also be used in an interactive fashion.

	if is_japanese(&first) {
	if j2e.contains_key(input) {
	let (reb, gloss) = j2e.get(input).unwrap();
	println!("{}【{}】- {}", input, reb, gloss);
	} else {
	for key in j2e.keys() {
	if key.starts_with(input) {
	let (reb, gloss) = j2e.get(key).unwrap();
	println!("{}【{}】- {}", input, reb, gloss);
	}
	}
	}
	} else {
	if e2j.contains_key(input) {
	let (keb, reb) = e2j.get(input).unwrap();
	println!("{}【{}】- {}", keb, reb, input);
	} else {
	for key in e2j.keys() {
	if key.starts_with(input) {
	let (keb, reb) = e2j.get(key).unwrap();
	println!("{}【{}】- {}", keb, reb, key);
	}
	}
	}
	}

eagleflo / jisho Goto Github PK

jisho's People

Contributors

Stargazers

Watchers

jisho's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs