Comments (9)
Thanks, I think the easiest way to omit the "phrase" entries is simply to omit any entry that has a _
in it in the ChinaScribe format (or a space in the pinyin column in the main format). That works better than using a length limit, because a length limit may omit entries like 图斯潘德伯拉尼奥斯镇 (the town of Tuxpan de Bolaños, yes I saw that in a news article in 2018).
Phrases like "allow installation for unknown sources" are included because we occasionally need them for translating English into Chinese. I find people tend not to understand my technical instructions unless I can quote the exact wording that's displayed on their screen, not just a paraphrase of it, so yes we do want to be able to look up how things like that are worded in Chinese. But they are not meant to be displayed without spaces, which is why I include spaces in the pinyin field of cedpane.txt
, and _
characters in the ChinaScribe format.
If anyone has code that can process phrases including spaces, I'd rather they include the multi-word phrases because some of these entries are used to "clear up" what would otherwise be a difficult case for a computer to get right. For example, the entry 万国都 is 2 words, and it is meant to clarify that, in the texts I've seen, 万国都 should be written as "wànguó dōu" (all nations + all), rather than "wàn guódū" (myriad + capital cities). Otherwise, software like Wenlin might incorrectly put "wàn guódū" because 国都 has a higher usage frequency than 万国 (usage frequency is the wrong signal to use in this case, so I added an 'override' phrase entry).
CEDICT also has a few 'long phrase' entries (like 金窩銀窩不如自己的狗窩) which I don't think should be written without spaces. Unfortunately, CEDICT doesn't have the _
characters I use to indicate spaces, so about the only thing you can do with that data is to have a length limit. But for CedPane you can look out for the _
characters (or spaces in the pinyin field of cedpane.txt
) to identify this type of entry.
from zhongwen.
Sorry I forgot to mention that the reason why I wrote this ticket was because a user of your extension emailed me asking me to fork your extension into a version with CedPane added. I'd rather avoid creating a fork if it's something that can be done upstream.
from zhongwen.
I appreciate that users won't anymore spent a great deal of time analyzing enigmatic series of Chinese words when those Chinese words are merely people's transliterated names, compound words, company names, colloquial phrases, and idioms. In fact, I included your dictionary in my Chinese Words Separator extension for Chrome https://chrome.google.com/webstore/detail/chinese-words-separator/gacfacdpfimbkgcnlegknnmcccjgcbnp
It will help a lot of Chinese language learners to save time from over-analyzing a series of hanzis. Here's an example of my extension result, before and after I included your CedPane dictionary:
However, there are phrases that I feel should not be in the CedPane dictionary, for instance:
允许安装来自未知来源的应用
Yǔnxǔ ānzhuāng láizì wèizhī láiyuán de yìngyòng
I feel that it's not a compound word, nor a colloquial phrase that should be remembered by heart by Chinese language learners. For that matter, I want to exclude it, so I made my code's compound-words look-ahead limited to a certain length, so those kind of lengthy phrases will be excluded from the extension's compound words mechanism. There are more phrases that I think should not be in the CedPane's dictionary
The hesitancy of some Chinese dictionary tool makers to include CedPane's dictionary to their dictionary, stems from those examples, I believe.
from zhongwen.
@ssb22 It's a good idea to put a field on CedPane's dictionary to indicate if something is a name, compound words, colloquial phrases, or idioms. Or for examples such as 允许安装来自未知来源的应用
, it should be indicated as an accurate translation of commonly occurring phrases
from zhongwen.
@ssb22 Here's another output of Chinese Words Separator extension with your CedPane dictionary included:
Without CedPane:
Thanks :)
from zhongwen.
Hi @ssb22 , thanks for getting in touch. You obviously put a lot of work into compiling your dictionary and the result is very impressive.
I would actually prefer if you made your work available via publishing it through CC-CEDICT. That dictionary already includes a number of names of famous people and well-known places. I believe this approach would have several benefits:
- Other users of CC-CEDICT would benefit from your work as well.
- The entries would go through an extra QA step, ensuring that any mistakes you might have made could be caught and fixed.
- I'm actually not sure whether a reading tool, such as Zhongwen, needs to include all of the entries you've compiled. I'm fairly sure that a subset of the most important ones would be a helpful addition, but, as you have already pointed out, a "normal" dictionary doesn't contain all those entries either. So by working together with the CC-CEDICT team you could narrow down the list to the most relevant entries, however large that subset might turn out to be. In the end it's an editorial decision.
Anyway, I respect the amount of work you've put into this. By working together with the CC-CEDICT team you could make it available to an even wider audience and it would be a win for everybody.
from zhongwen.
But they are not meant to be displayed without spaces, which is why I include spaces in the pinyin field of cedpane.txt, and _ characters in the ChinaScribe format
I overlooked the file (CedPane-ChinaScribe.txt) that have word boundaries delimited by underscore _
character, I used the cedpane.txt initially. I uses the ChinaScribe file now, I included back the CedPane's phrases to Chinese Words Separator extension. Besides CedPane's names and compound words, the commonly occurring phrases now also jumps out of screen, at least with the use of an extension
Is there a version or fork of CC-CEDICT that is in ChinaScribe format? It's neat when pinyin have word boundaries like underscore , not just 'syllable' boundaries. Indeed, the idiom there's no place like home is rendered with no spaces as it is treated as one word due to the CC-CEDICT source dictionary having no word boundaries :)
from zhongwen.
Thanks @cschiller . I believe I was ostracized by the CEDICT team after an email misunderstanding 4 years ago, and I wouldn't want to annoy them by trying again now.
The CEDICT license doesn't let developers mix CEDICT data with other data, unless that other data also has a CC license. I was in the awkward position of having been given special permission to use certain proprietary data in a zero-cost zero-profit Android app, but I didn't have permission to CC-license that data, therefore I could not mix in CEDICT (unless CEDICT gave me an exception to the "must CC it" rule, which they didn't). I did try Adsotrans data for a while, since Adso's license did allow mixing without a CC requirement. But I found issues with the quality of Adso's data, so ended up going my own way instead.
Pleco has an innovative way of keeping dictionaries separate while still letting you use several, so Pleco is able to use both CC-CEDICT and proprietary dictionaries if you want. But not all apps can be written like Pleco (and I could not figure out how to make my Annotator Generator work anything like Pleco) so I couldn't just do it that way. I felt public-domain data would be least likely to cause problems for developers.
Sure I'm happy for CEDICT to use the data as long as they don't try to stop me from keeping the public-domain version available as well. They would probably want to review everything before inclusion, which could end up being a lot of work. In the short term it may make more sense to have CedPane as a separate source, and perhaps label your entries so everyone knows which of them have been edited by CEDICT versus which of them have only been edited by me. I suppose it's not impossible CEDICT could decide my editing is good enough to import without further review, but that is not my call to make! At the very least, I'd want to draw their attention to:
- any entry that has a
?
in it (usually means I wasn't sure) - the fact that I've written
unproven
ordisproven
against some alternative-medicine words - the definitions mentioning
Mormon
might not be phrased properly (I did try to get it right but I'm not LDS) - all the commercial trademarks that I list in the readme
- the entries marked
vulgar
(in most cases I have made the English definition milder; in one case I only putexpletive, do not use
) - the fact that Bible book names are written using Protestant wording by default and
Catholic version
for the Catholic wording
etc. Marking all entries as "from CedPane" until reviewed could be one way to shift any blame.
@ienablemuch the only other dictionary I know of that uses ChinaScribe format is the one bundled with ChinaScribe itself which is commercial Windows software with free trial (it sometimes works on WINE depending on the version). The License Agreement that pops up when you install it says: "Many ChinaScribe entries are derived from the following sources: CC-CEDICT Chinese-English dictionary. Available free of charged and licensed under a Creative Commons Attribution-Share Alike 3.0 License." So I suppose that means, although ChinaScribe is commercial software, its dictionary is a CC-CEDICT fork and can be used with other programs if you can get the data out of ChinaScribe. (The paid-up version has File / Export dictionary entries, but this refuses to run on the unpaid version. The internal file is typically drive_c/users/Public/Application Data/ChinaScribe/MainDict.cs1
but it's in some binary format I haven't figured out.)
(Edit: formatting)
from zhongwen.
Incidentally ChinaScribe merged in CedPane in 2017, but I haven't checked to what extent they're keeping up since then. (I do keep an entry in CedPane for CedPane itself, with the date on it—see for example Ce.html—I figured that this entry could be used to check when a project that imported CedPane last did so, assuming they kept that entry. ChinaScribe doesn't seem to have it at the moment, which might perhaps mean their last import predates when I first added it.)
from zhongwen.
Related Issues (20)
- 'c' for clipboard does not work under Google doc
- Cross device transfer of list HOT 1
- Publish your extension to Microsoft Edge Addons website HOT 3
- Link to Skritter outdated HOT 1
- Dictionary look up when hovering over Youtube subtitle doesn't work HOT 1
- Adding content.js to be available under Devtools HOT 1
- Request: English-Chinese features for the use of Chinese students learning English. HOT 1
- Request: Please list definitions in order of most common to least common HOT 2
- Changing definitions in cedict_ts.u8 has side effects HOT 1
- Is the script used to generate the indices available? HOT 4
- No longer working on Google Docs? HOT 16
- Request: Option to only show pinyin (hide English definition) HOT 2
- Technical Documentation to help others debug why Zhongwen ext. is not working with some resources HOT 9
- Zhongwen Not Working with Google Documents HOT 7
- Read the database
- URL with chindict/chindict.php is no longer accepted by mdbg.net
- Mobile support HOT 2
- [feature request] Can you please make it work with Amazon Prime subtitles?
- How to make it working with Readmoo reader HOT 10
- Include radicals or components information? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zhongwen.