GithubHelp home page GithubHelp logo

foosoft / zero-epwing Goto Github PK

View Code? Open in Web Editor NEW
98.0 98.0 16.0 206 KB

Sane data exporter for an insane dictionary format.

Home Page: https://foosoft.net/projects/zero-epwing/

License: MIT License

CMake 2.15% C 97.85%
dictionary epwing japanese

zero-epwing's Introduction

Leaving GitHub

I'm migrating my open source work away from GitHub. Individual project pages will soon be updated to reflect this change. Once the migration is complete, all repositories will be archived and will receive no further updates.

My projects now live at git.foosoft.net.

zero-epwing's People

Contributors

ejls avatar foosoft avatar makigumo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zero-epwing's Issues

Encoding Issue with some Entries

I've been using zero-epwing to convert a number of old epwing dictionaries I have over to yomichan, and I have run into an issue that I haven't seen before. It seems that some of these epwing dictionaries have characters (like �) in their entries that cannot be encoded in EUC-JP. As a result, I think zero-epwing is unable to convert the text in these entries to UTF-8 successfully, and it ends up jumping over the text for the definitions of various headwords. As most of the entries are valid in EUC-JP, their definitions are collected as expected.

I verified this by looking at the json output from zero epwing for certain headwords that had definitions containing � when viewed in an epwing file reader and noticed that the json data had no text key associated with those headwords. I am trying to figure out if a regex could be implemented in the zero epwing code that could attempt to remove characters like � prior to doing the encoding shift to UTF-8. If those characters could be removed, more entry data could be collected when attempting to move the epwing data over to yomichan.

An entry from Kenkyusha WAEI 5 is cut, maybe related to the entry size

The dictionary in question: 研究社 新和英大辞典 第5版
The entry: する6 (do; perform; ...)
When it's parsed by zero-epwing it gets cut at
することなすこと皆うまく行かなかった. Everything I did went wrong. | I fail
while the entry in EBWin continues

することなすこと皆うまく行かなかった. Everything I did went wrong. | I failed in every attempt.
…にすれば, …にしたら, …にしてみれば, …にしたって 〔…にとっては〕 from one's point of view; as far as one is concerned; to…; for….
君にすればただのペンかもしれないが, 僕には貴重なものなんだ. 返してくれよ. It may be just a pen to you, but to me it's a really precious thing. I want you to give it back.
・タヌキにしたっていい迷惑だよな. そこは自分たちのすみ家だったんだから. I'm afraid we've caused the raccoon dogs a lot of trouble. After all, it was their home.

It manifests in the entry being cut when imported using yomichan import. Also I know someone who is making a Discord bot using zero-epwing and has the same issue, and that's where we've actually discovered it.

I assume zero-epwing just can't handle entries this big, but maybe there's some other issue.

I don't know if there are more entries like this, this is just an example where zero-epwing breaks.

Output not displaying properly

I'm not sure what format this is being outputted to, but when I view it in notepad++ it says it's UCS 2 LE BOM. When I convert it to UTF-8 nothing changes, and when I use SHIFT-JIS it gives Japanese characters but it's all nonsense.

I've attached an excerpt of my output, as well as another file showing the original entries from that excerpt for two headwords I was able to track down.

Excerpt:

{
"charCode": "jisx0208",
"discCode": "epwing",
"subbooks": [
{
"title": "三省堂 スーパー大辞林",
"copyright": " \n       ●三省堂 スーパー大辞林 CD−ROM●\n \n       <収録書名>\n \n       『大辞林 第2版』\n        編者  松村 明・三省堂編修所\n             {{w_44663}}1995\n \n        『デイリーコンサイス英和辞典 第5版』\n        編者 三省堂編修所\n            {{w_44663}}1990\n \n       『デイリーコンサイス和英辞典 第4版』\n        編者 三省堂編修所\n            {{w_44663}}1990\n \n       『三省堂 ワープロ漢字辞典』\n        編者 三省堂編修所\n            {{w_44663}}1986\n \n        発行者 株式会社 三省堂\n \n        *このCD−ROMに収録されているデータは、著作権法により\n         保護されており、無断で転載・複写することはできません。\n",
"entries": [
{
"heading": "しょ-り【書吏】",
"text": "しょ-り [1] 【書吏】\n(1)律令制で,四品以上の親王・内親王および三位以上の公卿に仕えた職員。文案の起稿・筆録をつかさどった。\n(2)「胥吏(シヨリ){(2)}」に同じ。\n"
},
{
"heading": "しょり【処理】(和英)",
"text": "しょり【処理】\ndisposition;→英和\nmanagement;→英和\ntransaction;treatment.→英和\n〜する manage;→英和\ndispose;→英和\ntreat.→英和\n"
},
{
"heading": "ジョリオ-キュリー{{w_44666}}Fr{{n_49447}}d{{n_49447}}ric Joliot-Curie{{w_44667}}",
"text": "ジョリオ-キュリー {{w_44666}}Fr{{n_49447}}d{{n_49447}}ric Joliot-Curie{{w_44667}}\n(1900-1958) フランスの物理学者・平和運動家。キュリー夫妻の長女イレーヌの夫。夫妻で人工放射能を発見。第二次大戦中はナチスに対する抵抗運動に参加,戦後は平和運動に積極的に参加。世界平和評議会議長。\n"
},
{
"heading": "りふ【利府】",
"text": "りふ 【利府】\n宮城県中部,宮城郡の町。仙台市の北に接し,松島湾に臨む。石巻街道の宿駅として発達。梨の産地。\n"
},
{
"heading": "πüÿπéçπéè-πüÿπéçπéè",
"text": "じょり-じょり [1] (副)\n髪やひげなどを剃(ソ)る音を表す語。\n"
},
{
"heading": "しょ-りゅう【庶流】",
"text": "しょ-りゅう ―リウ [0] 【庶流】\n(1)庶子の系統。庶族。庶系。\n⇔嫡流\n(2)本家から分家した家筋。分家。別家。\n"
},
{
"heading": "しょ-りゅう【諸流】",
"text": "しょ-りゅう ―リウ [1] 【諸流】\nさまざまの流派。\n"
},
{
"heading": "じょ-りゅう【女流】",
"text": "じょ-りゅう ヂヨリウ [0] 【女流】\n女性。婦人。「―棋士」「―文学」\n"
},
{
"heading": "じょ-りゅう【叙留】",
"text": "じょ-りゅう ―リウ 【叙留】\n律令制下,位階だけ昇進し,官職はもとのままにとどまること。\n"
},
{
"heading": "じょりゅう【女流(の)】(和英)",
"text": "じょりゅう【女流(の)】\na woman;→英和\na female;→英和\na lady.→英和\n"
},
{
"heading": "かみかぜ【神風】(和英)",
"text": "かみかぜ【神風】\n(1) a divine wind;the timely rescue of Providence.(2) a Kamikaze;a suicide pilot (特攻隊員).\n‖神風運転手 a reckless driver.\n"
},
{
"heading": "うんてん-しゅ【運転手】",
"text": "うんてん-しゅ [3] 【運転手】\n電車・自動車などの運転をする人。\n"
}
]
}
]
}

Original Entries:

{
"heading": "かみかぜ【神風】(和英)",
"text": "かみかぜ【神風】\n(1) a divine wind;the timely rescue of Providence.(2) a Kamikaze;a suicide pilot (特攻隊員).\n‖神風運転手 a reckless driver.\n"
}

かみかぜ【神風】(和英)
かみかぜ【神風】
(1) a divine wind;the timely rescue of Providence.(2) a Kamikaze;a suicide pilot (特攻隊員).
‖神風運転手 a reckless driver.

{
"heading": "うんてん-しゅ【運転手】",
"text": "うんてん-しゅ [3] 【運転手】\n電車・自動車などの運転をする人。\n"
}

うんてん-しゅ【運転手】
うんてん-しゅ [3] 【運転手】
電車・自動車などの運転をする人。

Building for macOS

I got this tool to build successfully and run on macOS. Running the steps listed in the README as is almost worked, but the executable that gets produced actually errors out with dyld: Library not loaded: /usr/local/lib/libeb.16.dylib.

The issue seems to stem from Apple's linker as described here.

Setting DYLD_LIBRARY_PATH worked, as did removing the problematic dylibs to force static linking.

But since zero-epwing's target_link_libraries is specifically referring to .a anyway, disabling building of shared libraries for eb is also an option, right? Setting AC_DISABLE_SHARED for eb (or just running with ./configure --disable-shared) also fixed the problem for me on macOS.

I'm not sure which approach you prefer, or if this is any different for Linux or Windows, but I wanted to at least document this for macOS users.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.