GithubHelp home page GithubHelp logo

Comments (7)

ceilican avatar ceilican commented on August 16, 2024

This is a comment for GSoC students trying to solve this bug.

To solve any debugging task, I suggest that you:

  1. find some webpages where this issue can be reproduced. And run MindTheWord on these pages to test it.

  2. debug the extension by using Google Chrome's Javascript Console. (You can use "console.log(...)", "console.debug()" and "console.error(...)" to log messages while MindTheWord is executed, and you can try to find the bug by analyzing the messages that you leave to yourself.)

For this particular issue, the bug could be in the "invertMap" function, which is responsible for wrapping a span with the "mtwTranslatedWord" CSS style around translated words. It seems that, for unknown reasons, not all words are being wrapped. Or maybe the bug is in the "processTranslations" function, where the "invertMap" function is called.

from mindtheword.

ceilican avatar ceilican commented on August 16, 2024

Here is another example screenshot provided by another user:

bug_mtw1

from mindtheword.

ceilican avatar ceilican commented on August 16, 2024

Strangely, both examples above happen when a definite article (i.e. "a" or "los") is translated before the sentence that is translated but not highlighted. In the second example, this examples several times, almost always after "los" (e.g. "los cerebros", "los investigadores", "los animales", "los fallecidos", "los atletas").

from mindtheword.

ceilican avatar ceilican commented on August 16, 2024

It would be useful to know if the users were using the advanced feature that allows the translation of sequences of words. If they are, a possible explanation could be the following:

  1. MTW translates {"the --> los", "the animals" --> "los animales"}.
  2. Then it searches and back-translates/highlights the translated words using the inverted translation map {"los"--> "the", "los animales"--> "the animals"}. Consequently, the "los" in "los animales" will be highlighted (and wrapped by "...") and the whole sequence "los animales" will not be found anymore.

from mindtheword.

alexklibisz avatar alexklibisz commented on August 16, 2024

Hi, I'm a GSOC applicant. I believe I've found the issue and will be submitting a PR to resolve it.

1. Environment

I forked and cloned the repo and installed it in developer mode in Chrome. I added a Yandex API key and made no further modifications. I set English to Spanish translation at 75%.

2. Reproducing the bug:

I went to the same page as the screenshot above: https://www.technologyreview.com/s/600691/new-collar-promises-to-keep-athletes-brains-from-sloshing-during-impact/. I noticed a recurring issue at the excerpt:

both of which tolerate repetitive, high-impact blows to the head

It was getting translated to:

tanto of que tolerate repetitive, de alto impacto blows to the la cabeza

Here, Yandex translates high-impact to de alto impacto. Not a perfect translation, but that's not the issue at hand here.

But only de alto impacto only has de highlighted:
image

3. Investigating

I'm new to the codebase so I started tracing through the functions replaceAll, invertMap, processTranslations, and deepHTMLReplacement. I found the following was happening for the paragraph in question.

At line 38 in MindTheWord.js, replaceAll is called with the paragraph text (with English->Spanish replacements already made), and the inverted translation map iTMap. iTMap contains a key for de and de alto impacto.

Because of the way rExp is constructed, the regular expression for de will come before the regular expression for de alto impacto.

This means the replacement at line 73 in MindTheWord.js occurs first for de and then for de alto impacto. When the replacement for de is made, de alto impacto turns into <span ... >de</span> alto impacto. So then the replacement for de alto impacto.

In more general terms: say you have a string aaabbbccc and you need to match/replace both bbb and abbbc. If you do the replacement for bbb first, then you can no longer match abbbc.

4. Solution

It's arguable that the replaceAll function can be improved to handle this scenario in a different way. However, I found that a straightforward solution to this problem is to sort the source words in descending order by length in the replaceAll function before the rExp string is constructed. This will ensure that de alto impacto is replaced before de is replaced.

The PR I am submitting does exactly that, and at least for this case, the resulting output is correct:
image
and it toggles the phrase correctly:
image

5. Other thoughts

  • There are presumably other problems with replaceAll that will need to be resolved (like the whitespace hack), and this fix does not take those into account.
  • This of course is computationally more expensive. I would say that it will only be noticeable as the number of words becomes extremely large.
  • The unit tests pass for my PR, but I assume unit tests will have to be written for this specific case.

TLDR: sort the keys from translationMap in descending order by their length in the replaceAll function to avoid issues with substrings of longer strings preventing the longer strings from being formatted.

from mindtheword.

ceilican avatar ceilican commented on August 16, 2024

Excellent explanation and elegant solution!

Do you have any idea how to improve the whitespace hack?

from mindtheword.

alexklibisz avatar alexklibisz commented on August 16, 2024

@ceilican Thanks! I have some thoughts but for now I am writing the GSOC application so I will look after I send you my first draft.

from mindtheword.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.