eagleflo / jisho Goto Github PK
View Code? Open in Web Editor NEWJisho is a CLI tool & Rust library that provides a Japanese-English dictionary.
License: GNU General Public License v3.0
Jisho is a CLI tool & Rust library that provides a Japanese-English dictionary.
License: GNU General Public License v3.0
It would be useful to support wildcard searches, where you can replace an unknown kanji (more than one?) with an asterisk and it would match any word in the corpus. A bit like regex but less complex.
> 飛*機
飛行機【ひこうき】- aeroplane, airplane, aircraft
In the wild west world of https://crates.io/ you need to reserve your crate name early on. This involves configuring a bit of metadata. See https://doc.rust-lang.org/cargo/reference/publishing.html for reference.
If I end up implementing subcommands (#3), more complex querying capabilities (#24) or flags (#28), I'll want to have a quick reference for these because nobody will be able to remember them.
Tradition dictates this would be printed to stdout
with -h
/ --help
. It should also be printed when the program is invoked with an incorrect amount of arguments and similar confusing situations.
(Probably not going to require a full-blown man
page, unless I get really wild.)
There's a fair bit of duplicated logic depending on which dictionary is being consulted in
Lines 52 to 76 in 425b409
By clearly setting up keb
, reb
and gloss
it should be possible to cut that down in half.
Currently jisho::lookup
takes upon itself to print the results, if any. If this is to be of any use as a library, we should be returning real data and letting the client program decide what to do with the results.
Many entries in JMdict have multiple glosses defined. For instance:
<entry>
<ent_seq>1183780</ent_seq>
<k_ele>
<keb>音響</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf11</ke_pri>
</k_ele>
<r_ele>
<reb>おんきょう</reb>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf11</re_pri>
</r_ele>
<sense>
<pos>&n;</pos>
<gloss>sound</gloss>
<gloss>noise</gloss>
<gloss>acoustics</gloss>
<gloss>reverberation</gloss>
<gloss>echo</gloss>
<gloss>audio</gloss>
</sense>
</entry>
Reporting each of these makes tons of sense, and again improves results by a wide margin.
Using jisho
as a CLI tool still feels a bit too slow. All of the time is spent parsing the original JMdict XML file and creating the lookup HashMap
s for jisho
's own use.
It looks like most of this work could be moved to compile time with the aid of https://doc.rust-lang.org/cargo/reference/build-scripts.html.
~ % jisho prime minister
首相【しゅしょう】- prime minister, chancellor (Germany, Austria, etc.), premier
宰相【さいしょう】- prime minister, premier, chancellor
首班【しゅはん】- head, leader, prime minister
PM【ピー・エム】- private message, PM, post-meridiem, afternoon, project manager, product manager, particulate matter, prime minister
プライムミニスター - prime minister
~ % jisho 総理大臣
総理大臣【そうりだいじん】- prime minister (as the head of a cabinet government), premier
Well, that's not optimal. It looks like we currently only collect exact matches with an English search term.
I'm wondering what kind of hassle would it be to allow prefix matches for meanings -- would that lead to a flood of irrelevant results for shorter search terms? This is a case where I definitely think the additional clarification should not block inclusion.
If this pattern of "right answer (additional note)" is common enough, maybe it ought to be handled separately.
I personally use jisho
on three computers and I'm not always sure which version I have installed via cargo
.
Tradition dictates this should be available with -v
/ --version
, as I can't foresee needing --verbose
instead.
As of now backspacing characters doesn't fully work -- there are leftover characters in the CLI. Ideally backspace removes one character, be that ASCII or Unicode, and there is no garbage left over.
(It's possible that solving #19 also solves this.)
Currently dictionary initialization happens on each lookup. That's not ideal. Unfortunately static
variables come with a lot of restrictions in Rust; heap allocation being forbidden for one.
It looks like https://docs.rs/lazy_static/1.4.0/lazy_static/ might offer a way to sidestep that issue.
Some common words are often spelled in both ひらがな and カタカナ. For example:
<entry>
<ent_seq>1345605</ent_seq>
<r_ele>
<reb>そろそろ</reb>
<re_pri>ichi1</re_pri>
</r_ele>
<r_ele>
<reb>ソロソロ</reb>
</r_ele>
<sense>
<pos>&adv;</pos>
<pos>&adv-to;</pos>
<misc>&on-mim;</misc>
<gloss>slowly</gloss>
<gloss>quietly</gloss>
<gloss>steadily</gloss>
<gloss>gradually</gloss>
<gloss>gingerly</gloss>
</sense>
<sense>
<misc>&on-mim;</misc>
<gloss>soon</gloss>
<gloss>momentarily</gloss>
<gloss>before long</gloss>
<gloss>any time now</gloss>
</sense>
</entry>
Currently only the first reading element is processed. It wouldn't hurt to also process the second one.
A pretty clear bug: there are multiple entries for many keb
, reb
and gloss
elements. Storing every match instead of overwriting the results constantly produces much better lookup results (duh).
As of now jisho
is a quite large binary, as no effort whatsoever has been spent in optimizing for binary size.
However, it looks like Rust tooling has (recently?) grown more aware of binary sizes, and trying to update the embedded JMdict version to a more recent version triggered some built-in size limit of Crates.io. The JSON files derived from JMdict are certainly much more verbose than necessary, so this should be relatively easy to fix.
When looking up words with a lot of exact matches like "green" or "blue", the canonical answers 緑 and 青 should come up first.
It's often helpful to refine searches over multiple queries. Adding basic readline
-like support into jisho
would make tons of sense.
It looks like Rust has a mature library called rustyline providing something like that.
Currently you have to use quotes around multiple words to perform a single lookup:
$ jisho "quantum mechanics"
量子力学【りょうしりきがく】- quantum mechanics
While in accordance with how shells are supposed to work, this is a bit tiresome. Could as well just combine all the args into a single input so the quotes would become unnecessary.
Rather than creating a separate entry for each different gloss as done in 7536fc7, it would make more sense to just collect them under a single entry.
Might be useful to have certain test lookups running automatically to keep the largest regressions at bay.
~ % jisho ある限り
~ % jisho "all (there is)"
ある限り【あるかぎり】- all (there is), as long as there is
That's not ideal. The original very naïve piece of code just looked at the first character of the input when deciding which HashMap
to look from. I rewrote it so that if there is any 漢字 in the query we look at keb
rather than reb
. Likewise reading lookups now require the whole input to be in 仮名; according to JMdict DTD reb
is restricted to 仮名.
JMdict doesn't really have version numbers, as it's basically a dump from database. This means the dictionary embedded in Jisho falls out of date as time moves on. The current version is from 2020-09-23, which is now a couple of years out of date.
(Initial attempt at fixing this encountered problems with binary sizes, tracked separately in #21.)
Some dictionary entries don't have a k_ele
/ keb
element at all. For instance:
<entry>
<ent_seq>1345605</ent_seq>
<r_ele>
<reb>そろそろ</reb>
<re_pri>ichi1</re_pri>
</r_ele>
<r_ele>
<reb>ソロソロ</reb>
</r_ele>
<sense>
<pos>&adv;</pos>
<pos>&adv-to;</pos>
<misc>&on-mim;</misc>
<gloss>slowly</gloss>
<gloss>quietly</gloss>
<gloss>steadily</gloss>
<gloss>gradually</gloss>
<gloss>gingerly</gloss>
</sense>
<sense>
<misc>&on-mim;</misc>
<gloss>soon</gloss>
<gloss>momentarily</gloss>
<gloss>before long</gloss>
<gloss>any time now</gloss>
</sense>
</entry>
These should still be looked up.
I need some lookup benchmarks to make informed choices about possible changes to jisho
.
Sometimes you have a good idea of a reading but no idea which kanji to use for the word. If the input is purely hiragana we could also look up readings and try to find matches based on that.
I didn't use to think this was a problem since crates are compressed regardless, but maybe it would be a workflow improvement to work straight off the original primary source archive in the build script.
Low priority.
I'm getting convinced that the most common use cases for #24 would actually be covered by simple prefix and postfix searches. The syntax should follow convention: for instance, any search term followed by either *
or *
should turn on prefix match.
(When #30 gets fixed, it would also be useful to support forcing exact matches for those cases where the default heuristics including either prefix or postfix matches would provide too many results.)
Without CLI arguments jisho
should open a REPL-like interactive mode.
It would help to mention that jisho
can be also be used in an interactive fashion.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.