GithubHelp home page GithubHelp logo

Comments (9)

ssb22 avatar ssb22 commented on August 16, 2024 1

Thanks, I think the easiest way to omit the "phrase" entries is simply to omit any entry that has a _ in it in the ChinaScribe format (or a space in the pinyin column in the main format). That works better than using a length limit, because a length limit may omit entries like 图斯潘德伯拉尼奥斯镇 (the town of Tuxpan de Bolaños, yes I saw that in a news article in 2018).

Phrases like "allow installation for unknown sources" are included because we occasionally need them for translating English into Chinese. I find people tend not to understand my technical instructions unless I can quote the exact wording that's displayed on their screen, not just a paraphrase of it, so yes we do want to be able to look up how things like that are worded in Chinese. But they are not meant to be displayed without spaces, which is why I include spaces in the pinyin field of cedpane.txt, and _ characters in the ChinaScribe format.

If anyone has code that can process phrases including spaces, I'd rather they include the multi-word phrases because some of these entries are used to "clear up" what would otherwise be a difficult case for a computer to get right. For example, the entry 万国都 is 2 words, and it is meant to clarify that, in the texts I've seen, 万国都 should be written as "wànguó dōu" (all nations + all), rather than "wàn guódū" (myriad + capital cities). Otherwise, software like Wenlin might incorrectly put "wàn guódū" because 国都 has a higher usage frequency than 万国 (usage frequency is the wrong signal to use in this case, so I added an 'override' phrase entry).

CEDICT also has a few 'long phrase' entries (like 金窩銀窩不如自己的狗窩) which I don't think should be written without spaces. Unfortunately, CEDICT doesn't have the _ characters I use to indicate spaces, so about the only thing you can do with that data is to have a length limit. But for CedPane you can look out for the _ characters (or spaces in the pinyin field of cedpane.txt) to identify this type of entry.

from zhongwen.

ssb22 avatar ssb22 commented on August 16, 2024

Sorry I forgot to mention that the reason why I wrote this ticket was because a user of your extension emailed me asking me to fork your extension into a version with CedPane added. I'd rather avoid creating a fork if it's something that can be done upstream.

from zhongwen.

ienablemuch avatar ienablemuch commented on August 16, 2024

@ssb22

I appreciate that users won't anymore spent a great deal of time analyzing enigmatic series of Chinese words when those Chinese words are merely people's transliterated names, compound words, company names, colloquial phrases, and idioms. In fact, I included your dictionary in my Chinese Words Separator extension for Chrome https://chrome.google.com/webstore/detail/chinese-words-separator/gacfacdpfimbkgcnlegknnmcccjgcbnp

It will help a lot of Chinese language learners to save time from over-analyzing a series of hanzis. Here's an example of my extension result, before and after I included your CedPane dictionary:

image

However, there are phrases that I feel should not be in the CedPane dictionary, for instance:

允许安装来自未知来源的应用

Yǔnxǔ ānzhuāng láizì wèizhī láiyuán de yìngyòng

I feel that it's not a compound word, nor a colloquial phrase that should be remembered by heart by Chinese language learners. For that matter, I want to exclude it, so I made my code's compound-words look-ahead limited to a certain length, so those kind of lengthy phrases will be excluded from the extension's compound words mechanism. There are more phrases that I think should not be in the CedPane's dictionary

The hesitancy of some Chinese dictionary tool makers to include CedPane's dictionary to their dictionary, stems from those examples, I believe.

image

from zhongwen.

ienablemuch avatar ienablemuch commented on August 16, 2024

@ssb22 It's a good idea to put a field on CedPane's dictionary to indicate if something is a name, compound words, colloquial phrases, or idioms. Or for examples such as 允许安装来自未知来源的应用, it should be indicated as an accurate translation of commonly occurring phrases

from zhongwen.

ienablemuch avatar ienablemuch commented on August 16, 2024

@ssb22 Here's another output of Chinese Words Separator extension with your CedPane dictionary included:

image

Without CedPane:

image

Thanks :)

from zhongwen.

cschiller avatar cschiller commented on August 16, 2024

Hi @ssb22 , thanks for getting in touch. You obviously put a lot of work into compiling your dictionary and the result is very impressive.

I would actually prefer if you made your work available via publishing it through CC-CEDICT. That dictionary already includes a number of names of famous people and well-known places. I believe this approach would have several benefits:

  • Other users of CC-CEDICT would benefit from your work as well.
  • The entries would go through an extra QA step, ensuring that any mistakes you might have made could be caught and fixed.
  • I'm actually not sure whether a reading tool, such as Zhongwen, needs to include all of the entries you've compiled. I'm fairly sure that a subset of the most important ones would be a helpful addition, but, as you have already pointed out, a "normal" dictionary doesn't contain all those entries either. So by working together with the CC-CEDICT team you could narrow down the list to the most relevant entries, however large that subset might turn out to be. In the end it's an editorial decision.

Anyway, I respect the amount of work you've put into this. By working together with the CC-CEDICT team you could make it available to an even wider audience and it would be a win for everybody.

from zhongwen.

ienablemuch avatar ienablemuch commented on August 16, 2024

But they are not meant to be displayed without spaces, which is why I include spaces in the pinyin field of cedpane.txt, and _ characters in the ChinaScribe format

I overlooked the file (CedPane-ChinaScribe.txt) that have word boundaries delimited by underscore _ character, I used the cedpane.txt initially. I uses the ChinaScribe file now, I included back the CedPane's phrases to Chinese Words Separator extension. Besides CedPane's names and compound words, the commonly occurring phrases now also jumps out of screen, at least with the use of an extension

image

Is there a version or fork of CC-CEDICT that is in ChinaScribe format? It's neat when pinyin have word boundaries like underscore , not just 'syllable' boundaries. Indeed, the idiom there's no place like home is rendered with no spaces as it is treated as one word due to the CC-CEDICT source dictionary having no word boundaries :)

image

from zhongwen.

ssb22 avatar ssb22 commented on August 16, 2024

Thanks @cschiller . I believe I was ostracized by the CEDICT team after an email misunderstanding 4 years ago, and I wouldn't want to annoy them by trying again now.

The CEDICT license doesn't let developers mix CEDICT data with other data, unless that other data also has a CC license. I was in the awkward position of having been given special permission to use certain proprietary data in a zero-cost zero-profit Android app, but I didn't have permission to CC-license that data, therefore I could not mix in CEDICT (unless CEDICT gave me an exception to the "must CC it" rule, which they didn't). I did try Adsotrans data for a while, since Adso's license did allow mixing without a CC requirement. But I found issues with the quality of Adso's data, so ended up going my own way instead.

Pleco has an innovative way of keeping dictionaries separate while still letting you use several, so Pleco is able to use both CC-CEDICT and proprietary dictionaries if you want. But not all apps can be written like Pleco (and I could not figure out how to make my Annotator Generator work anything like Pleco) so I couldn't just do it that way. I felt public-domain data would be least likely to cause problems for developers.

Sure I'm happy for CEDICT to use the data as long as they don't try to stop me from keeping the public-domain version available as well. They would probably want to review everything before inclusion, which could end up being a lot of work. In the short term it may make more sense to have CedPane as a separate source, and perhaps label your entries so everyone knows which of them have been edited by CEDICT versus which of them have only been edited by me. I suppose it's not impossible CEDICT could decide my editing is good enough to import without further review, but that is not my call to make! At the very least, I'd want to draw their attention to:

  • any entry that has a ? in it (usually means I wasn't sure)
  • the fact that I've written unproven or disproven against some alternative-medicine words
  • the definitions mentioning Mormon might not be phrased properly (I did try to get it right but I'm not LDS)
  • all the commercial trademarks that I list in the readme
  • the entries marked vulgar (in most cases I have made the English definition milder; in one case I only put expletive, do not use)
  • the fact that Bible book names are written using Protestant wording by default and Catholic version for the Catholic wording

etc. Marking all entries as "from CedPane" until reviewed could be one way to shift any blame.

@ienablemuch the only other dictionary I know of that uses ChinaScribe format is the one bundled with ChinaScribe itself which is commercial Windows software with free trial (it sometimes works on WINE depending on the version). The License Agreement that pops up when you install it says: "Many ChinaScribe entries are derived from the following sources: CC-CEDICT Chinese-English dictionary. Available free of charged and licensed under a Creative Commons Attribution-Share Alike 3.0 License." So I suppose that means, although ChinaScribe is commercial software, its dictionary is a CC-CEDICT fork and can be used with other programs if you can get the data out of ChinaScribe. (The paid-up version has File / Export dictionary entries, but this refuses to run on the unpaid version. The internal file is typically drive_c/users/Public/Application Data/ChinaScribe/MainDict.cs1 but it's in some binary format I haven't figured out.)

(Edit: formatting)

from zhongwen.

ssb22 avatar ssb22 commented on August 16, 2024

Incidentally ChinaScribe merged in CedPane in 2017, but I haven't checked to what extent they're keeping up since then. (I do keep an entry in CedPane for CedPane itself, with the date on it—see for example Ce.html—I figured that this entry could be used to check when a project that imported CedPane last did so, assuming they kept that entry. ChinaScribe doesn't seem to have it at the moment, which might perhaps mean their last import predates when I first added it.)

from zhongwen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.