pedroallenrevez / jisho-api Goto Github PK

A jisho.org API made in Python

License: Apache License 2.0

Python 100.00%

jisho-api's Introduction

jisho-api

A Python API built around scraping jisho.org, an online Japanese dictionary.

pip install jisho_api

Requests

You can request three types of information:

Words
Kanji
Sentences
Tokenize sentences

The search terms are directly injected into jisho's search engine, which means all of the filters used to curate a search should work as well. For instance, "水" would look precisely for a word with just that character.

Check https://jisho.org/docs on how to use the search filters.

jisho search word water
jisho search word 水
jisho search word "#jlpt-n4"

The request replies are Pydantic objects. You can check the structure of a word request in jisho/word/cfg.py, and likewise for both kanji and sentences.

You could also do so programatically, by doing:

from jisho_api.word import Word
r = Word.request('water')
from jisho_api.kanji import Kanji
r = Kanji.request('水')
from jisho_api.sentence import Sentence
r = Sentence.request('水')
from jisho_api.tokenize import Tokens
r = Tokens.request('昨日すき焼きを食べました')

Note: Almost everything that is available in a page is being scraped. Note: Kanji requests can come with incomplete information, because it is not available in the page.

Scrapers

You can scrape the website for a list of given search terms. Supply them with a .txt file with the words separated by newlines.

jisho scrape word words.txt
jisho scrape kanji kanji.txt
jisho scrape sentence search_words.txt
jisho scrape tokens sentences.txt

All of the resulting searches will be stored in ~/.jisho/data.

In case you want to scrape programatically you can:

from jisho_api import scrape
from jisho_api.word import Word

word_requests = scrape(Word, ['water', 'fire'], 'to/path/')

This will return a dictionary, which key values are the search term and request result. Failing requests are not included.

Cache and config

If you want cache enabled just run

jisho config

This will create a ~/.jisho/ folder with a config.json with your settings. All your searches will be cached, and accessed if you search for the exact same term again.

Notes and considerations

According to this thread, there is no official API, although there is a kind of API request made by jisho.org, which is used to scrape words. This does not work for Kanji tho, because it would search the Kanji as a word, and not have any relevant metadata for the character itself.

Permissions to scrape also granted in the aforementioned thread.

As stated in their about page as well, jisho.org uses a collection of well-known electronic dictionaries:

This site uses the JMdict, Kanjidic2, JMnedict and Radkfile dictionary files. -jisho.org

Credits and Acknowledgements for data

All credit is given where it's due, and the several extracted resources is given at jisho.org's about page.

jisho-api's People

Contributors

Stargazers

Watchers

Forkers

finia2na siying1611 joys06 mmatlacz mrmuffyman friendlypigeon fulguritude

jisho-api's Issues

PIP version not up to date

The version of this libary that is hostet on PIP is still at the old commit f636e0e , meaning the tokenizer issues fixed by #5 are still there

It would be great if it were updated, as people pulling the library right now still don't have a working tokenizer (including me 😅)

Missing Proper Nouns in sentence

Hi,

I was doing a simple test with the sentence API:

from jisho_api.sentence import Sentence
r = Sentence.request('象る')

and got the following response:

meta=RequestMeta(status=200)
data=[
> SentenceConfig(japanese='神(かみ)は自ら(みずか)にかたどって人(ひと)を創造(そうぞう)された', en_translation='God created man in his own image.'), 
> SentenceConfig(japanese='はレークをかたどって池(いけ)を造った(つく)', en_translation="Ken'nichi made a pond in the shape of Lake Geneva.")
]

The second sentence is actually on Jisho: 見日はレークジェニーバをかたどって池を造った。

So, looks like it is missing the proper nouns for some reason.... is it expected behaviour?

Issue with scraping programatically using provided code sample

Hi there,

When I run the sample code provided for scraping programmatically:

from jisho_api.word import Word
from jisho_api.cli import scrape

word_requests = scrape(Word, ['water', 'fire'], '~/japanese/test')

I get this error:

Traceback (most recent call last):
  File "/home/frank/scripts/scrape_from_jisho.py", line 7, in <module>
    word_requests = scrape(Word, ['water', 'fire'], '~/japanese/test')
  File "/home/frank/dev/venv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/frank/dev/venv/lib/python3.8/site-packages/click/core.py", line 1042, in main
    args = list(args)
TypeError: 'type' object is not iterable

Help is appreciated. Thanks.

ValidationError in Tokenizer

Hi,
executing the following 2 lines of code:

from jisho_api.tokenize import Tokens

r = Tokens.request("だって僕は星だから")

gives me the following error:

Traceback (most recent call last):
  File "/tmp/tmp.TjV1XDIbOj/test.py", line 3, in <module>
    r = Tokens.request("だって僕は星だから")
  File "/home/finia2na/.local/share/virtualenvs/tmp.TjV1XDIbOj-ckSbLsEx/lib/python3.9/site-packages/jisho_api/tokenize/request.py", line 85, in request
    "data": Tokens.tokens(soup),
  File "/home/finia2na/.local/share/virtualenvs/tmp.TjV1XDIbOj-ckSbLsEx/lib/python3.9/site-packages/jisho_api/tokenize/request.py", line 59, in tokens
    tks.append(TokenConfig(
  File "pydantic/main.py", line 406, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for TokenConfig
pos_tag
  value is not a valid enumeration member; permitted: 'Noun', 'Particle', 'Verb', 'Determiner', 'Unknown' (type=type_error.enum; enum_values=[<PosTag.noun: 'Noun'>, <PosTag.particle: 'Particle'>, <PosTag.verb: 'Verb'>, <PosTag.det: 'Determiner'>, <PosTag.unk: 'Unknown'>])

I runnning Python 3.9 with pipenv on Linux 5.15.

I also tried executing the tokenizer with the example sentence used elsewhere here (昨日すき焼きを食べました), which worked fine.

Weird import conflict

TLDR:

A file called linecache.py, used by pydantic, tries to import jisho_api's tokenize, causing a cyclic dep. Refactoring the "tokenize" folder's name to "tokens" indeed resolves the issue. The below explains the bug discovery.

FIRST:

When launching the interpreter ($> python3.10) from the terminal in the jisho_api folder, I get this weird error (at launch, before attempting to write anything in actual python):

Exception ignored in: <module 'inspect' from '/usr/lib/python3.10/inspect.py'>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: partially initialized module 'inspect' has no attribute 'isgenerator' (most likely due to a circular import)
<frozen importlib._bootstrap>:241: RuntimeWarning: Cython module failed to patch module with custom type
Exception ignored in: <module 'inspect' from '/usr/lib/python3.10/inspect.py'>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: partially initialized module 'inspect' has no attribute 'isgenerator' (most likely due to a circular import)

Annoyingly, this somewhat breaks the interpreter and makes it so that I can't even test out the code.

However, I've isolated the cause as the following line from jisho_api/tokenize/__init__.py, which, when commented, removes the error.

from .request import Tokens

No idea why it happens, probably not some name conflict, since refactoring Tokens to Token does not seem to solve the issue. Maybe it's some error within the file tokenize/request.py that propagates somehow ?

EDIT: extra info

When running from kanji import Kanji, I get the following.

>>> from kanji import Kanji
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fulguritude/Workspace/Clones/jisho-api/jisho_api/kanji/__init__.py", line 1, in <module>
    from .request import Kanji
  File "/home/fulguritude/Workspace/Clones/jisho-api/jisho_api/kanji/request.py", line 8, in <module>
    from pydantic import BaseModel
  File "pydantic/__init__.py", line 2, in init pydantic.__init__
    from pathlib import Path
  File "pydantic/dataclasses.py", line 7, in init pydantic.dataclasses
    import builtins
  File "pydantic/main.py", line 310, in init pydantic.main
  File "pydantic/main.py", line 254, in pydantic.main.ModelMetaclass.__new__
  File "pydantic/class_validators.py", line 197, in pydantic.class_validators.extract_root_validators
  File "/usr/lib/python3.10/inspect.py", line 43, in <module>
    import linecache
  File "/usr/lib/python3.10/linecache.py", line 11, in <module>
    import tokenize
  File "/home/fulguritude/Workspace/Clones/jisho-api/jisho_api/tokenize/__init__.py", line 1, in <module>
    from .request import Tokens
  File "/home/fulguritude/Workspace/Clones/jisho-api/jisho_api/tokenize/request.py", line 8, in <module>
    from pydantic import BaseModel
ImportError: cannot import name 'BaseModel' from partially initialized module 'pydantic' (most likely due to a circular import) (/home/fulguritude/.local/lib/python3.10/site-packages/pydantic/__init__.cpython-310-x86_64-linux-gnu.so)

Note the "import tokenize" in linecache.py, might be a name conflict after all.

So it's probably the BaseModel import in tokenize/request.py and/or somewhere else that causes the issue, by causing a conflict with the folder name tokenize itself. Moving it around in said file (up a line or two in the import order), I get slightly different behavior.

At launch:

Exception ignored in: <module 'inspect' from '/usr/lib/python3.10/inspect.py'>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: partially initialized module 'inspect' has no attribute 'isgenerator' (most likely due to a circular import)
<frozen importlib._bootstrap>:241: RuntimeWarning: Cython module failed to patch module with custom type

FINAL:

Refactoring the "tokenize" folder to "tokens" indeed resolve the issue. What happened was probably that linecache tried to import jisho_api's tokenize, causing the cyclic dep.

Validation error for certain word searches

When certain word searches are run, such as Word.request('一月'), a Pydantic validation error is returned
This is the full error message from PyCharm

Print list in reverse

Hello,
Absolutely love your implementation, but I was wondering if it would not be better to print the results in reverse or at least have a argument for it?
Currently, if I use jisho search word water it prints:

水 (みず), 水 (み) [JLPT: jlpt-n5]
   2   │ 1. water (esp. cool, fresh water, e.g. drinking water)
   3   │ 2. fluid (esp. in an animal tissue), liquid
[...]
水滴 (すいてき) [JLPT: jlpt-n2]
1. drop of water
2. vessel for replenishing inkstone water
──────────────────────────────
浄水器 (じょうすいき)
1. water filter, water purification system
2. Water purification

The problem is, if the result has multiple objects, the most common case might not be present on screen and then I either have to scroll up or pipe the result into tac or bat, which both ruins the well done highlighting.
I feel like having the most important result be print last is more intuitive.

It wasn't recognizing a utf-8 encoded txt, now it is, but I get nothing in the data folder

Hi, I would like to download definitions from Jisho.org for about 1000 kanji. If I understood correctly, this tool can do the job.

I've putted every Kanji in a new line, saved it on a utf-8 encoded txt file, and when I run "Jisho scrape kanji name.txt" I get these errors.

C:\Jisho>jisho scrape kanji name.txt
Traceback (most recent call last):
File "c:\users\username\appdata\local\programs\python\python38\lib\runpy.py", li
ne 192, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\users\username\appdata\local\programs\python\python38\lib\runpy.py", li
ne 85, in run_code
exec(code, run_globals)
File "C:\Users\username\AppData\Local\Programs\Python\Python38\Scripts\jisho.exe
_main.py", line 7, in
File "c:\users\username\appdata\local\programs\python\python38\lib\site-packages
\jisho_api\cli.py", line 209, in make_cli
main()
File "c:\users\username\appdata\local\programs\python\python38\lib\site-packages
\click\core.py", line 1130, in call
return self.main(*args, **kwargs)
File "c:\users\username\appdata\local\programs\python\python38\lib\site-packages
\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "c:\users\username\appdata\local\programs\python\python38\lib\site-packages
\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "c:\users\username\appdata\local\programs\python\python38\lib\site-packages
\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "c:\users\username\appdata\local\programs\python\python38\lib\site-packages
\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\users\username\appdata\local\programs\python\python38\lib\site-packages
\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "c:\users\username\appdata\local\programs\python\python38\lib\site-packages
\jisho_api\cli.py", line 102, in scrape_words
scraper(Word, _load_words(file_path), root_dump)
File "c:\users\username\appdata\local\programs\python\python38\lib\site-packages
\jisho_api\cli.py", line 89, in _load_words
txt = fp.read()
File "c:\users\username\appdata\local\programs\python\python38\lib\encodings\cp1
252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 39: chara
cter maps to

What can I do to solve this issue?

Adding A wiki

I'm sorry if this is not the proper method for suggesting this, but I think it would be beneficial to add a wiki to help People find exactly what they're looking for with greater ease. Here's a start to mock up for the Kanji portion.

Key Classes:

General request information:
Each request notably returns a [Language part]Request Class with three important values

meta = 200 # this value always returns 200 if the request returns a class
data  : List["""Language Part"""Config] | KanjiConfig # stores most of the data 
Config : type:[BaseConfig] # if you're using this, you're beyond God's help

Seeing as if there's a return meta will always be 200, the sections below focus on what data contains.

All requests are akin to searching on Jisho, therefore, in order to help visualize what data we're receiving, there'll be screen shots
Kanji

from jisho_api.kanji import Kanji
ten_thousand = '万'
data = Kanji.request(ten_thousand).data

This code is Equivalent to this search:

Here's a picture of the basic values that you'll want to use. Remember that data as written in the code block above is the KanjiRequest class.
Basic Values:

Search tokenized words?

Hi - this is a really great tool!

Currently, the regular Jisho search (jisho.org/search/~) tokenizes a long phrase into its component words. For example, it splits 昨日すき焼きを食べました into 昨日/すき焼き/を/食べました. (For some reason, #sentence returns no results here.)

Would you consider adding an implementation to iterate through these individual words (returning a Word search for each one)? Each one has a data-word tag on it, so they're easy to pull from the soup.

I'm happy to contribute something like this if you think it'd be useful and if you let me know where it would fit best.

Errors with basic requests

Jisho must have updated their website.

Word requests crash

$>  jisho search word nichi
Traceback (most recent call last):
  File "/home/fulguritude/.local/lib/python3.10/site-packages/requests/models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
  File "/home/fulguritude/.local/lib/python3.10/site-packages/simplejson/__init__.py", line 525, in loads
    return _default_decoder.decode(s)
  File "/home/fulguritude/.local/lib/python3.10/site-packages/simplejson/decoder.py", line 370, in decode
    obj, end = self.raw_decode(s)
  File "/home/fulguritude/.local/lib/python3.10/site-packages/simplejson/decoder.py", line 400, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/fulguritude/.local/bin/jisho", line 8, in <module>
    sys.exit(make_cli())
  File "/home/fulguritude/.local/lib/python3.10/site-packages/jisho_api/cli.py", line 209, in make_cli
    main()
  File "/home/fulguritude/.local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/fulguritude/.local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/fulguritude/.local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/fulguritude/.local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/fulguritude/.local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/fulguritude/.local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/fulguritude/.local/lib/python3.10/site-packages/jisho_api/cli.py", line 147, in request_word
    w = Word.request(word, cache=flag)
  File "/home/fulguritude/.local/lib/python3.10/site-packages/jisho_api/word/request.py", line 74, in request
    r = requests.get(url).json()
  File "/home/fulguritude/.local/lib/python3.10/site-packages/requests/models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Kanji requests fail.

$> jisho search kanji 数
[Error] No kanji found with name 数.