Comments (4)
@ceilican I was able to reproduce this error in the following way.
Add a debug line to print filtered words in filterSourceWords
...
!userBlacklistedWords.test(word.toLowerCase()); // no blacklisted words
}));
console.debug('Filtered List:', countedWordsList);
var targetLength = Math.floor((length(countedWords) * translationProbability) / 100);
...
Add a debug line to print counted words in main function
...
console.log('starting translation');
var countedWords = getAllWords(ngramMin, ngramMax);
console.debug('countedWords', countedWords);
...
Run the extension on http://worldcrm.org/About/index/id/3
As you can see, the counted words object has elements, but the filtered list has zero elements.
Error
filterSourceWords function
is not testing for non-ASCII characters. As a result it is skipping all Chinese words. It will also skip all languages which use non-ASCII characters. This is why we are having issue #24.
var countedWordsList = shuffle(toList(countedWords, function(word, count) {
return !!word && word.length >= minimumSourceWordLength && // no words that are too short
word !== '' && !/\d/.test(word) && // no empty words
word.charAt(0) != word.charAt(0).toUpperCase() && // no proper nouns
!userBlacklistedWords.test(word.toLowerCase()); // no blacklisted words
}));
This callback function is returning false for all the Chinese words as they are non-ASCII.
Solution
To put a check for non-ASCII characters.
...
!userBlacklistedWords.test(word.toLowerCase()) || // no blacklisted words
/[^\x00-\x7F]+/.test(word));
...
This did get me all the translations. But for some reason all words are not getting replaced. Just Pan is getting replaced. It will require a little more work. I cannot work on this issue today but i will submit a PR by tomorrow.
from mindtheword.
That's great, @ankit-m !
from mindtheword.
@ceilican the problem is more complicated than what I thought. The reason why all words are not getting replaced is because of spaces. Look at the following lines of code in replaceAll()
sortedSourceWords.forEach(function concatRExp(sourceWord) {
rExp += '(\\s' + escapeRegExp(sourceWord) + '\\s)|';
});
The variable rExp
is set to find all patterns like space + word + space
i.e. translate word by word.
Here in lay the problem. Languages like Chinese do not have space separated words all the time. For example θ£δΊζ
means The Board of Trustees
. It has no spaces anywhere - neither in the beginning nor the ending.
So as a solution I tried to change the rExp
to
rExp += '(\\s*' + escapeRegExp(sourceWord) + '\\s*)|';
i.e. it should have zero or more spaces on both sides
.
With a few other changes, this seemed to have solved the problem. I got Chinese translations working. But soon I realized that this will destroy English translations, as it will not translate word by word but instead do it for all substrings.
For example if there was the word how
in the regular expression, it will also translate the word somehow
as some <space> <translation of how>
. I am now thinking as to how to solve this problem. We do need different rExp
for these types of languages.
from mindtheword.
That is a very interesting observation, @ankit-m .
The spaces that I added in rExp
a long time ago were a hack to deal with issues like the one you described. But I have never been fully satisfied with his "hacky" solution. Maybe the ideal solution would not be to have different rExp for different languages, but to make rExp without spaces work for English too.
from mindtheword.
Related Issues (20)
- Dropdown width is varying HOT 3
- When a new pattern is created, the new pattern should be automatically selected HOT 4
- Blocking of Duplicate Patterns should not be silent HOT 1
- Social Media icons in options page should not be underlined when mouse hovers HOT 2
- Wrong visual hints HOT 1
- Breaking user defined translations HOT 1
- Make "one word per sentence" option local (per pattern)
- Blacklisted words are getting translated HOT 4
- Esperanto is available in Yandex's website, but not in MTW when Yandex is selected
- When Options page is opened, some words are spoken HOT 3
- Dropdown for "Voice Name" does not fit in the "Playback Settings" box HOT 3
- Use the alternative API for Google Translate HOT 12
- It appears as MindTheWord is not working on some Australian news sites HOT 5
- No Hindi translation listed with yandex api
- Add keyboard shortcuts for the MTW functionality HOT 7
- Update changes from dist folder to lib folder HOT 1
- Error: (SystemJS) XHR error on Linux HOT 12
- translation Russian>Spanish not working for some russian sites HOT 1
- Move Bing Translation to Azure Translation HOT 1
- Pressing Enter on User Defined Translations Saves the Current Input
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mindtheword.