Comments (6)
I did some experiments and here's what I found:
Validating JSON files with a Node.js validator:
Validating jmdict-all-3.3.0.json...
appliesTo*: 844 (kanji) and 1367 (kana) in 1321 unique words
PASS
Counting <stagk>
/<stagr>
elements (I used closing tags to avoid unexpected stuff like attributes) in XML:
% grep '</stagk>' build/dict-xml/JMdict.xml | wc -l
844
% grep '</stagr>' build/dict-xml/JMdict.xml | wc -l
1367
And finally, counting unique sense
elements with non-empty appliesTo*
arrays from the Kotlin converter (sanity check): 1818
- seems about right
As you can see, the numbers match. It seems that the change comes from the JMdict itself, and not something I did in the conversion. Same for empty strings in appliesTo*, I couldn't find any cases in the original XML files.
The code for converting these parts is very straightforward, there are no filters or post-processing, just a simple copy. And there is no pattern-matching (if that's what you mean by "the new pattern"), the converter is a streaming XML parser. So, it's nothing like grep
or sed
.
To clarify, empty arrays in JSON files are by design (see the readme). I never omit any collections just because they are empty, because that could lead to some nasty null/undefined errors in users' code.
Let me know if you find something contradicting my experiments above.
from jmdict-simplified.
The old XQuery code used to do the following:
<j:array key="appliesToKanji">
{ if (not($elem/stagk))
then <j:string> { "*" } </j:string>
else for $restr in $elem/stagk
return <j:string> { $restr/text() } </j:string> }
</j:array>
<j:array key="appliesToKana">
{ if (not($elem/stagr))
then <j:string> { "*" } </j:string>
else for $restr in $elem/stagr
return <j:string> { $restr/text() } </j:string> }
</j:array>
It means that if there was no stagk
/stagr
elements, the *
wildcard was added by default. That part has indeed changed. Now, I'm thinking if I should correct the documentation to reflect the new state, or back-port the old logic.
I'm thinking about the former (changing the docs). The reason is that the original XML files have to way of saying "this sense applies to none of the kanji/kana" anyway, so whenever these arrays are empty, the *
/"all" should be assumed by default.
Comparing that to appliesToKanji
on kana
elements - there is a special <re_nokanji/>
tag which is a special "none" value. No analog for this for sense
elements.
I would even consider renaming sense.appliesToKanji
/sense.appliesToKana
into something like sense.restrictedToKanji
/sense.restrictedToKana
to make it more distinct from kana.appliesToKanji
. @aehlke let me know if you like that idea.
from jmdict-simplified.
Updated the README in c5c4d9c
from jmdict-simplified.
Actually, @aehlke, I think back-porting might be a better option. Just to keep backwards-compatibility and uniform logic.
Marking this as a bug. Will bring the old "*"
back.
from jmdict-simplified.
@aehlke Grab the latest release
Thank you for finding this. Really subtle issue.
from jmdict-simplified.
Thank you very much! for the detailed analysis and follow-up. I quickly patched an update in my usage but wasn't confident I understood what changed so this helps and seems more consistent now
from jmdict-simplified.
Related Issues (20)
- Add JSON schema validation HOT 2
- More directions on set up? HOT 2
- Possible to get the JSON file? HOT 1
- Update with latest JMdict? HOT 2
- Automatically update when source dictionaries are updated HOT 4
- Extract specific language HOT 6
- Add g_type attribute on gloss elements
- New JMnedict packages endpoint HOT 1
- TypeScript type definitions
- Usually Kana HOT 1
- "misc" tags for senses HOT 2
- Publish NPM packages HOT 1
- *.tgz files are not compressed
- KanjiDic? HOT 5
- Generate documentation from types HOT 1
- RADKFILE/KRADFILE HOT 1
- Kanjidic 3.5.0 json is missing some radicals HOT 3
- xref element in JMdict sometimes contains a reb with JIS centre-dots HOT 1
- Make BaseX a build script dependency
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jmdict-simplified.