Hi, I maintain a public-domain Chinese-English dictionary supplement (currently about 64,000 entries), it's data that is not usually in "normal" dictionaries but still useful to have in a reader (mostly names of people and places). If including these extra entries, I would suggest labelling them in some way to differentiate them from the "main" CEDICT, as it seems the CEDICT editors are not sure they want to merge in CedPane entries en-masse (and anyway I'm still writing it).
If you do want to merge in, I think the best starting point would be the CedPane ChinaScribe file because the format of that is quite similar to CEDICT. The main difference is that some of the pinyin syllables are separated with _
instead of space: this indicates a word boundary in a multi-word phrase; if you can't cope with multi-word phrases then these entries are possibly best dropped. And some of the definitions are in <
...>
to indicate an environment (e.g. PRC, TW, netspeak, etc). Other than that it's basically the same apart from the sort order.
I don't know what is your current method of generating cedict.idx
and your modified cedict_ts.u8
from upstream (do you have scripts to do this / can you put them in the repo for reference?), but I'd imagine merging in another source (and perhaps labelling every definition with [CedPane]
or similar so as to differentiate them from mainline CEDICT) shouldn't be difficult.
It's nice that you are able to push out several updates a year. I currently tend to publish my CedPane edits on the last day of each month, although that's not a guarantee. If you do a git pull
from my repo as part of your normal update script, that should work. Alternatively on the CedPane home page there's always a "Last update" and entry count.
(I used to make a text file of CedPane available for download from the home page, but then a developer in China thought it was a good idea to write a lookup extension that re-downloads CedPane from my server every time it was used, which caused hundreds of gigabytes of traffic—when I say “keep up to date” I don't mean that much☺ You can still get it from the home page but it's now a ZIP file which I hope will discourage our extension-writing friend from hammering my server. Meanwhile it also lives on the major Git providers which have more bandwidth. Including the data in the extension with periodic updates, as you do, does seem to be a better way.)