GithubHelp home page GithubHelp logo

neocl / jamdict Goto Github PK

View Code? Open in Web Editor NEW
114.0 4.0 11.0 1 MB

Python 3 library for manipulating Jim Breen's JMdict, KanjiDic2, JMnedict and kanji-radical mappings

License: MIT License

Python 99.84% Shell 0.16%
python japanese dictionary japanese-dictionary python-library japanese-language japanese-study jmdict kanjidic2 kanji

jamdict's Introduction

Jamdict

Jamdict is a Python 3 library for manipulating Jim Breen's JMdict, KanjiDic2, JMnedict and kanji-radical mappings.

ReadTheDocs Badge

Documentation: https://jamdict.readthedocs.io/

Main features

  • Support querying different Japanese language resources
    • Japanese-English dictionary JMDict
    • Kanji dictionary KanjiDic2
    • Kanji-radical and radical-kanji maps KRADFILE/RADKFILE
    • Japanese Proper Names Dictionary (JMnedict)
  • Fast look up (dictionaries are stored in SQLite databases)
  • Command-line lookup tool (Example)

Contributors are welcome! πŸ™‡. If you want to help, please see Contributing page.

Try Jamdict out

Jamdict is used in Jamdict-web - a web-based free and open-source Japanese reading assistant software. Please try out the demo instance online at:

https://jamdict.herokuapp.com/

There also is a demo Jamdict virtual machine online for trying out Jamdict Python code on Repl.it:

https://replit.com/@tuananhle/jamdict-demo

Installation

Jamdict & Jamdict database are both available on PyPI and can be installed using pip

pip install --upgrade jamdict jamdict-data

Sample jamdict Python code

from jamdict import Jamdict
jam = Jamdict()

# use wildcard matching to find anything starts with 食べ and ends with γ‚‹
result = jam.lookup('食べ%γ‚‹')

# print all word entries
for entry in result.entries:
     print(entry)

# [id#1358280] γŸγΉγ‚‹ (ι£ŸγΉγ‚‹) : 1. to eat ((Ichidan verb|transitive verb)) 2. to live on (e.g. a salary)/to live off/to subsist on
# [id#1358300] γŸγΉγ™γŽγ‚‹ (ι£ŸγΉιŽγŽγ‚‹) : to overeat ((Ichidan verb|transitive verb))
# [id#1852290] γŸγΉγ€γ‘γ‚‹ (ι£ŸγΉδ»˜γ‘γ‚‹) : to be used to eating ((Ichidan verb|transitive verb))
# [id#2145280] γŸγΉγ―γ˜γ‚γ‚‹ (ι£ŸγΉε§‹γ‚γ‚‹) : to start eating ((Ichidan verb))
# [id#2449430] γŸγΉγ‹γ‘γ‚‹ (ι£ŸγΉζŽ›γ‘γ‚‹) : to start eating ((Ichidan verb))
# [id#2671010] たべγͺγ‚Œγ‚‹ (ι£ŸγΉζ…£γ‚Œγ‚‹) : to be used to eating/to become used to eating/to be accustomed to eating/to acquire a taste for ((Ichidan verb))
# [id#2765050] γŸγΉγ‚‰γ‚Œγ‚‹ (ι£ŸγΉγ‚‰γ‚Œγ‚‹) : 1. to be able to eat ((Ichidan verb|intransitive verb)) 2. to be edible/to be good to eat ((pre-noun adjectival (rentaishi)))
# [id#2795790] γŸγΉγγ‚‰γΉγ‚‹ (ι£ŸγΉζ―”γΉγ‚‹) : to taste and compare several dishes (or foods) of the same type ((Ichidan verb|transitive verb))
# [id#2807470] γŸγΉγ‚γ‚γ›γ‚‹ (ι£ŸγΉεˆγ‚γ›γ‚‹) : to eat together (various foods) ((Ichidan verb))

# print all related characters
for c in result.chars:
    print(repr(c))

# 食:9:eat,food
# ε–°:12:eat,drink,receive (a blow),(kokuji)
# 過:12:overdo,exceed,go beyond,error
# 付:5:adhere,attach,refer to,append
# 始:8:commence,begin
# ζŽ›:11:hang,suspend,depend,arrive at,tax,pour
# ζ…£:14:accustomed,get used to,become experienced
# ζ―”:4:compare,race,ratio,Philippines
# 合:6:fit,suit,join,0.1

Command line tools

To make sure that jamdict is configured properly, try to look up a word using command line

python3 -m jamdict lookup 言θͺžε­¦
========================================
Found entries
========================================
Entry: 1264430 | Kj:  言θͺžε­¦ | Kn: γ’γ‚“γ”γŒγ
--------------------
1. linguistics ((noun (common) (futsuumeishi)))

========================================
Found characters
========================================
Char: 言 | Strokes: 7
--------------------
Readings: yan2, eon, μ–Έ, NgΓ΄n, NgΓ’n, ゲン, ゴン, い.う, こと
Meanings: say, word
Char: θͺž | Strokes: 14
--------------------
Readings: yu3, yu4, eo, μ–΄, Ngα»―, Ngα»©, γ‚΄, γ‹γŸ.γ‚‹, γ‹γŸ.らう
Meanings: word, speech, language
Char: ε­¦ | Strokes: 8
--------------------
Readings: xue2, hag, ν•™, HoΜ£c, ガク, まγͺ.ぢ
Meanings: study, learning, science

No name was found.

Using KRAD/RADK mapping

Jamdict has built-in support for KRAD/RADK (i.e. kanji-radical and radical-kanji mapping). The terminology of radicals/components used by Jamdict can be different from else where.

  • A radical in Jamdict is a principal component, each character has only one radical.
  • A character may be decomposed into several writing components.

By default jamdict provides two maps:

  • jam.krad is a Python dict that maps characters to list of components.
  • jam.radk is a Python dict that maps each available components to a list of characters.
# Find all writing components (often called "radicals") of the character ι›²
print(jam.krad['ι›²'])
# ['δΈ€', '雨', '二', '厢']

# Find all characters with the component 鼎
chars = jam.radk['鼎']
print(chars)
# {'鼏', 'ιΌ’', '鼐', '鼎', 'ιΌ‘'}

# look up the characters info
result = jam.lookup(''.join(chars))
for c in result.chars:
    print(c, c.meanings())
# 鼏 ['cover of tripod cauldron']
# ιΌ’ ['large tripod cauldron with small']
# 鼐 ['incense tripod']
# 鼎 ['three legged kettle']
# ιΌ‘ []

Finding name entities

# Find all names with 鈴木 inside
result = jam.lookup('%鈴木%')
for name in result.names:
    print(name)

# [id#5025685] γ‚­γƒ₯γƒΌγƒ†γ‚£γƒΌγ™γšγ (γ‚­γƒ₯γƒΌγƒ†γ‚£γƒΌιˆ΄ζœ¨) : Kyu-ti- Suzuki (1969.10-) (full name of a particular person)
# [id#5064867] γƒ‘γƒ‘γ‚€γƒ€γ™γšγ (γƒ‘γƒ‘γ‚€γƒ€ιˆ΄ζœ¨) : Papaiya Suzuki (full name of a particular person)
# [id#5089076] γƒ©γ‚Έγ‚«γƒ«γ™γšγ (γƒ©γ‚Έγ‚«γƒ«ιˆ΄ζœ¨) : Rajikaru Suzuki (full name of a particular person)
# [id#5259356] γγ€γ­γ–γγ™γšγγ²γͺた (η‹ε΄Žιˆ΄ζœ¨ζ—₯向) : Kitsunezakisuzukihinata (place name)
# [id#5379158] γ“γ™γšγ (小鈴木) : Kosuzuki (family or surname)
# [id#5398812] γ‹γΏγ™γšγ (上鈴木) : Kamisuzuki (family or surname)
# [id#5465787] γ‹γ‚γ™γšγ (川鈴木) : Kawasuzuki (family or surname)
# [id#5499409] γŠγŠγ™γšγ (倧鈴木) : Oosuzuki (family or surname)
# [id#5711308] すすき (鈴木) : Susuki (family or surname)
# ...

Exact matching

Use exact matching for faster search.

Find the word 花火 by idseq (1194580)

>>> result = jam.lookup('id#1194580')
>>> print(result.names[0])
[id#1194580] はγͺび (花火) : fireworks ((noun (common) (futsuumeishi)))

Find an exact name 花火 by idseq (5170462)

>>> result = jam.lookup('id#5170462')
>>> print(result.names[0])
[id#5170462] はγͺび (花火) : Hanabi (female given name or forename)

See jamdict_demo.py and jamdict/tools.py for more information.

Useful links

Contributors

jamdict's People

Contributors

alt-romes avatar letuananh avatar matteofumagalli1275 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

jamdict's Issues

Better lookup results

  • Allow strict_lookup (no additional characters, only the ones in the query)
  • add str() and repr() to result objects

Accessing reading breakdown of a vocabulary term from JMdict

For example, given the entry for ζ—₯本θͺž I'd like to not only get the reading, にほんご, but also which parts of the reading are associated with which kanji, e.g. ζ—₯→に, ζœ¬β†’γ»γ‚“, and θͺžβ†’ご. This would make rendering furigana from the database much easier. Is this possible? Thanks!

[Feature Request] In-memory database

Hi there, thanks for making this library! I was wondering if it's possible to add an option for the database to be forcibly created in memory:

class ExecutionContext(object):
    # ...
    def __init__(self, path, schema, auto_commit=True):
        source = sqlite3.connect(str(path))
        self.conn = sqlite3.connect(':memory:')
        source.backup(self.conn)
        # ...

I added this snippet to puchikarui.py and it sped up lookups by about 30-40% (7.7 seconds down to ~4 seconds). Of course, ideally the database would be kept outside of the context construction. I'm currently have reuse of contexts enabled.

[Feature Request] Output optionally slightly formatted with color

So that the output is easier to read in the terminal. Just an idea, I don't have a particular scheme/logic in mind.

Also, I think a blank line between several different found characters would be helpful, because with the current format it's kind of confusing: the readings and meanings of a character look like they belong to the next one.

Some words not searchable in dictionary

I am finding that a word that should be in the dictionary (for example, ε€§δΊ‹γ«γ™γ‚‹οΌˆγ γ„γ˜γ«γ™γ‚‹οΌ‰) is not showing up in the dictionary. Not sure why this is- it can be seen in the JMDictDB here

use AppConfig to config jamdict

Simplify API:

  • People don't really use XML files to lookup, the default option to create a jamdict most likely will be DB.
  • jmdict, kanjidic and multikrad most likely will be in a single database.
  • Add util functions:
    • read()
    • parse() from xml file(s) to db file(s)

how to reset the sqlite 3 database

There is some error while importing the file gdrive

Which make it doesn't find any character

image

May I know where is the database storing at, how do reset them or delete them

Customizing JAMDICT_HOME / JAMDICT_DATA

I'm using jamdict for an educational game and I would like to install jamdict's data in a custom folder.
After looking at jamdict.config, I've tried setting environment variables JAMDICT_HOME and JAMDICT_DATA, be this seems to have no effect.
Is there a proper way to do this ?

Hide field(s) when not available

For example kanji form and characters are not available in ムーン

========================================
Found entries
========================================
Entry: 1132270 | Kj:   | Kn: ムーン
--------------------
1. moon ((noun (common) (futsuumeishi)))

========================================
Found characters
========================================

Searching POS in parameters doesn't search for all possible POS

I noticed when using this solution for finding POS that we discussed in #22 :

# find all idseq of lexical entry (i.e. words) that have at least 1 sense with pos = suru verb - irregular
with jam.jmdict.ctx() as ctx:
    # query all word's idseqs
    rows = ctx.select(
        query="SELECT DISTINCT idseq FROM Sense WHERE ID IN (SELECT sid FROM pos WHERE text = ?)",
        params=("expressions (phrases, clauses, etc.)",))
    for row in rows:
        # reuse database connection with ctx=ctx for better performance
        word = jam.jmdict.get_entry(idseq=row['idseq'], ctx=ctx)
        ruler.add_patterns([{"label": "EXPRESSION", "pattern": x.text} for x in word.kanji_forms])
        ruler.add_patterns([{"label": "EXPRESSION", "pattern": x.text} for x in word.kana_forms])
        print("Working on expressions...")

that some expressions that have the 'expressions (phrases, clauses, etc.)' as a secondary parameter instead of primary seem to not be caught by this search. Is this a bug, or an intended feature?

Additionally, it seems that the original JMDict does not use this scheme to refer to expressions. Instead the term used is 'exp'. Am I mistaken here?

Thank you

Can't install jamdict-data on Windows

Hello.

I tried to install jamdict-data on Windows but I couldn't. My details: Windows 11, Python 3.11.4, PowerShell 7.3.6.

Seems to be a Windows-specific thing: [WinError 32] The process cannot access the file because it is being used by another process

Full terminal output:

PS C:\Users\User> python.exe -m pip install jamdict-data
Collecting jamdict-data
  Downloading jamdict_data-1.5.tar.gz (53.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.9/53.9 MB 9.8 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error

  Γ— Preparing metadata (pyproject.toml) did not run successfully.
  β”‚ exit code: 1
  ╰─> [13 lines of output]
      running dist_info
      creating C:\Users\User\AppData\Local\Temp\pip-modern-metadata-_2sae9a0\jamdict_data.egg-info
      writing C:\Users\User\AppData\Local\Temp\pip-modern-metadata-_2sae9a0\jamdict_data.egg-info\PKG-INFO
      writing dependency_links to C:\Users\User\AppData\Local\Temp\pip-modern-metadata-_2sae9a0\jamdict_data.egg-info\dependency_links.txt
      writing top-level names to C:\Users\User\AppData\Local\Temp\pip-modern-metadata-_2sae9a0\jamdict_data.egg-info\top_level.txt
      writing manifest file 'C:\Users\User\AppData\Local\Temp\pip-modern-metadata-_2sae9a0\jamdict_data.egg-info\SOURCES.txt'
      reading manifest file 'C:\Users\User\AppData\Local\Temp\pip-modern-metadata-_2sae9a0\jamdict_data.egg-info\SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      adding license file 'LICENSE'
      writing manifest file 'C:\Users\User\AppData\Local\Temp\pip-modern-metadata-_2sae9a0\jamdict_data.egg-info\SOURCES.txt'
      creating 'C:\Users\User\AppData\Local\Temp\pip-modern-metadata-_2sae9a0\jamdict_data-1.5.dist-info'
      Unpacking database from C:\Users\User\AppData\Local\Temp\pip-install-6nyf8b0w\jamdict-data_51d99a1c3c554a3a9b8858235b75d3ac\jamdict_data\jamdict.db.xz to C:\Users\User\AppData\Local\Temp\pip-install-6nyf8b0w\jamdict-data_51d99a1c3c554a3a9b8858235b75d3ac\jamdict_data\jamdict.db
      error: [WinError 32] The process cannot access the file because it is being used by another process: 'jamdict_data/jamdict.db.xz'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Γ— Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Thank you very much for creating jamdict, by the way.

Can't buid database file since latest release

I can't build the database since the latest (I think) release. Before I just did python3 -m jamdict.tools import and it worked.
Β 

Now python3 -m jamdict.tools import or python3 -m jamdict import give me this:

Traceback (most recent call last): File "/data/data/com.termux/files/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/data/data/com.termux/files/usr/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/data/com.termux/files/usr/lib/python3.9/site-packages/jamdict/__main__.py", line 2, in <module> tools.main() File "/data/data/com.termux/files/usr/lib/python3.9/site-packages/jamdict/tools.py", line 295, in main app.run() File "/data/data/com.termux/files/usr/lib/python3.9/site-packages/chirptext/cli.py", line 135, in run args.func(self, args) File "/data/data/com.termux/files/usr/lib/python3.9/site-packages/jamdict/tools.py", line 70, in import_data db_loc = os.path.abspath(os.path.expanduser(args.jdb)) File "/data/data/com.termux/files/usr/lib/python3.9/posixpath.py", line 231, in expanduser path = os.fspath(path) TypeError: expected str, bytes or os.PathLike object, not NoneType
Β 

Everything seems fine in python3 -m jamdict info:

Jamdict 0.1a11.post1
Python library for using Japanese dictionaries and resources (Jim Breen's JMdict, KanjiDic2, KRADFILE, JMnedict)

Basic configuration
------------------------------------------------------------
JAMDICT_HOME: /data/data/com.termux/files/home/.jamdict [OK]
jamdict-data: Not installed
Config file : /data/data/com.termux/files/home/.jamdict/config.json

Data files
------------------------------------------------------------
Jamdict DB location: /storage/emulated/0/Documents/Dictionaries/jamdict.db - [OK]
JMDict XML file : /storage/emulated/0/Documents/Dictionaries/JMdict_e.gz - [OK]
KanjiDic2 XML file : /storage/emulated/0/Documents/Dictionaries/kanjidic2.xml.gz - [OK]
JMnedict XML file : /storage/emulated/0/Documents/Dictionaries/JMnedict.xml.gz - [OK]

Jamdict database metadata
------------------------------------------------------------
jmdict.version: 1.08
jmdict.url: http://www.csse.monash.edu.au/~jwb/edict.html
generator: jamdict
generator_version: 0.1a9
generator_url: https://github.com/neocl/jamdict
jmnedict.version: 1.08
jmnedict.url: https://www.edrdg.org/enamdict/enamdict_doc.html
jmnedict.date: 2020-05-29
kanjidic2.version: 1.6
kanjidic2.url: https://www.edrdg.org/wiki/index.php/KANJIDIC_Project
kanjidic2.date: April 2008

Others
------------------------------------------------------------
puchikarui: version 0.2a2
chirptext : version 0.1.2
lxml : True
Β 

My config.json looks like this:
{ "JAMDICT_HOME": "/data/data/com.termux/files/home/.jamdict", "JAMDICT_DATA": "{JAMDICT_HOME}/data", "JAMDICT_DB": "/storage/emulated/0/Documents/Dictionaries/jamdict.db", "JMDICT_XML": "/storage/emulated/0/Documents/Dictionaries/JMdict_e.gz", "JMNEDICT_XML": "/storage/emulated/0/Documents/Dictionaries/JMnedict.xml.gz", "KD2_XML": "/storage/emulated/0/Documents/Dictionaries/kanjidic2.xml.gz", "KRADFILE": "/storage/emulated/0/Documents/Dictionaries/kradfile-u.gz" }
Β 

This is Termux on Android, if that matters. Also I can't install jamdict-data from pip, it fails and asks me to install wheel, which doesn't solve the problem -- but that's another matter.

Trim down dependencies

  1. Make lxml optional (most people don't parse jamdict XML files but use prebuilt SQLite DB file)
  2. May use embedded puchikarui and keep it up to date instead of loose linking
  3. Review chirptext dependency

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.