GithubHelp home page GithubHelp logo

Comments (4)

ankit-m avatar ankit-m commented on July 17, 2024

@ceilican I was able to reproduce this error in the following way.
Add a debug line to print filtered words in filterSourceWords

...
      !userBlacklistedWords.test(word.toLowerCase()); // no blacklisted words
  }));
  console.debug('Filtered List:', countedWordsList);
  var targetLength = Math.floor((length(countedWords) * translationProbability) / 100);
...

Add a debug line to print counted words in main function

...
  console.log('starting translation');
  var countedWords = getAllWords(ngramMin, ngramMax);
  console.debug('countedWords', countedWords);
...

Run the extension on http://worldcrm.org/About/index/id/3

Console Debug:
chinese_error

As you can see, the counted words object has elements, but the filtered list has zero elements.

Error

filterSourceWords function is not testing for non-ASCII characters. As a result it is skipping all Chinese words. It will also skip all languages which use non-ASCII characters. This is why we are having issue #24.

var countedWordsList = shuffle(toList(countedWords, function(word, count) {
    return !!word && word.length >= minimumSourceWordLength && // no words that are too short
      word !== '' && !/\d/.test(word) && // no empty words
      word.charAt(0) != word.charAt(0).toUpperCase() && // no proper nouns
      !userBlacklistedWords.test(word.toLowerCase()); // no blacklisted words
  }));

This callback function is returning false for all the Chinese words as they are non-ASCII.

Solution

To put a check for non-ASCII characters.

...
 !userBlacklistedWords.test(word.toLowerCase()) || // no blacklisted words 
  /[^\x00-\x7F]+/.test(word)); 
...

working_chinese

This did get me all the translations. But for some reason all words are not getting replaced. Just Pan is getting replaced. It will require a little more work. I cannot work on this issue today but i will submit a PR by tomorrow.

from mindtheword.

ceilican avatar ceilican commented on July 17, 2024

That's great, @ankit-m !

from mindtheword.

ankit-m avatar ankit-m commented on July 17, 2024

@ceilican the problem is more complicated than what I thought. The reason why all words are not getting replaced is because of spaces. Look at the following lines of code in replaceAll()

  sortedSourceWords.forEach(function concatRExp(sourceWord) {
    rExp += '(\\s' + escapeRegExp(sourceWord) + '\\s)|';
  });

The variable rExp is set to find all patterns like space + word + space i.e. translate word by word.

Here in lay the problem. Languages like Chinese do not have space separated words all the time. For example θ‘£δΊ‹ζœƒ means The Board of Trustees. It has no spaces anywhere - neither in the beginning nor the ending.

So as a solution I tried to change the rExp to

rExp += '(\\s*' + escapeRegExp(sourceWord) + '\\s*)|';

i.e. it should have zero or more spaces on both sides.

With a few other changes, this seemed to have solved the problem. I got Chinese translations working. But soon I realized that this will destroy English translations, as it will not translate word by word but instead do it for all substrings.

For example if there was the word how in the regular expression, it will also translate the word somehow as some <space> <translation of how>. I am now thinking as to how to solve this problem. We do need different rExp for these types of languages.

from mindtheword.

ceilican avatar ceilican commented on July 17, 2024

That is a very interesting observation, @ankit-m .

The spaces that I added in rExp a long time ago were a hack to deal with issues like the one you described. But I have never been fully satisfied with his "hacky" solution. Maybe the ideal solution would not be to have different rExp for different languages, but to make rExp without spaces work for English too.

from mindtheword.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.