Escaped strings in the translated text are not decoded back to their original un escap

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Escaped strings are not correctly handled about gettext.net HOT 7 CLOSED

andy-noisyduck commented on July 17, 2024

Escaped strings are not correctly handled

from gettext.net.

Comments (7)

perpetualKid commented on July 17, 2024

@andy-noisyduck
I'm bit confused when you say

"It also looks like strings are not escaped when generating POT files from the extractors"

and

"I am not using the extractors"

i.e. referenced change was only targeting the extractor, not anything in catalog/GetString*.

do you have a sample string which you are trying to extract and/or translate?
Thanks!

from gettext.net.

andy-noisyduck commented on July 17, 2024

I think there might be 2 closely related issues here. Maybe I should have made 2 tickets to be clearer.

First... translations with escaped strings in .mo files are not being unescaped when parsed. I.e. if you have an escaped single quote \' it will be returned including the backslash, not just the single quote. If you have an escaped newline \n in your translation string it won't be converted to an actual new-line etc.

The GetText spec indicates that any C style escape string is valid in a translation string within a .po file. My understanding was that strings in .mo files follow the same format and should thus be un-escaped either when parsing or when returned through GetString - although I concede there is scope for ambiguity here, as the .mo file format spec just refers to them as "translation strings" and doesn't explicitly document their format.

The existing tooling we use escapes new-line in .mo files, so I'm leaning towards unescaping them being the correct behaviour.

Secondly, a similar issue but on the input rather than output side. It looks like the extractors only escape double quotes now, and don't correctly handle new-lines, tabs etc. I wanted to caveat that we don't really use the extractors, and was something I only noticed when going through the source. It's possible I missed some handling of escape stings elsewhere.

from gettext.net.

perpetualKid commented on July 17, 2024

as for the second issue, the extractor escapes non-printable characters here:

GetText.NET/src/GetText.Extractor/Engine/ParserBase.cs

Lines 148 to 151 in 146090a

 protected static string ToLiteral(string valueTextForCompiler) 

 { 

 return SymbolDisplay.FormatLiteral(valueTextForCompiler, false); 

 }

however that doesn't work for double quotes, hence the fix in 85eb836. The removed code was stale copy&paste leftover from an xgettext implementation.

This should work correctly, but if you have some sample where this fails, please provide!

For the first issues, looking into it. What tooling are you using? The scenario is primarily tested with the extractor and PoEdit, from other issues I've seen that some of the older GNU commandline utilities are more peculiar and not as relaxed as Windows-based tools. As far reasonable I'm open to adapt/update.

from gettext.net.

andy-noisyduck commented on July 17, 2024

For the first issues, looking into it. What tooling are you using? The scenario is primarily tested with the extractor and PoEdit, from other issues I've seen that some of the older GNU commandline utilities are more peculiar and not as relaxed as Windows-based tools. As far reasonable I'm open to adapt/update.

We manage translations through Phrase.com, and we use it to generate our .mo files.

It's unclear where the problem is, whether its the generating of the .mo file or the parsing of them. Should the sequence \n in an .mo file be a newline or the literal \n? The somewhat ambiguous GetText spec doesn't really help here.

We can work around it in most cases by not having escape strings in the source content (.pot files), but this doesn't work for newlines or double quotes which must always be escaped. Right now we are unescaping them manually after we call GetString but that feels a bit clunky.

from gettext.net.

perpetualKid commented on July 17, 2024

this really gets tricky, and the GetText spec isn't of particular help either, but it seems that Phrase is handling escaped strings differently at least.

assuming a string like this

string test = catalog.GetString("This is a long string with \nline break");

extracting this, ie using xgettext, PoEdit, or GetText.Extractor, will become like this in the pot template file

#: Form1.cs:34
#, csharp-format
msgid ""
"This is a long string with \n"
"line break"
msgstr ""

it's mentioned as convention but not required in the GetText documentation that line breaks in the strings, will also induce a line break in the msgid format.

#: Form1.cs:34
#, csharp-format
msgid ""
"This is a long string with \nline break"
msgstr ""

in PoEdit, looks like this

So I've created an test account at Phrase.com, and uploaded the generated pot file there - and oops - the line break is gone?!

Now the even stranger part. While a generated po-translation file from phrase looks ok (except it does not break the strings itself, but as mentioned above that's optional in the documentation, it does escape the strings when compiling to mo:

vs. (using PoEdit)

with that escaping, the translation will not even be found, because the MessageId (original string to look for) was "This is a long string with \nline break" but in the compiled catalog only find "This is a long string with \nline break".

While I originally would have no opinion which behavior is correct, I think the phrase.com output is just wrong by escaping the strings both for the message id (as it breaks the whole lookup) as well the translation as it breaks formatting behavior.

If you have a business relationship with phrase, I'd ask you to open an support issue with them, happy to join the conversation if needed.
Otherwise I see limited option here to update the code, if the same input produces different output from different translation engines, how should GetText.Net else handle this?

from gettext.net.

andy-noisyduck commented on July 17, 2024

Thanks for investigating this.

The missing linebreak in Phrase is actually correct. The list on the left is a preview list of all your translated strings, similar to how Poedit displays the text without breaks in the list at the top. It's probably just not obvious with only one string. Phrase then shows the full string (including linebreaks) in the middle section when translating. I'm not sure why your example doesn't have the text here - it might be that Phrase's default mode of operation is to use keys for strings and not the source text (as Phrase supports a bunch of other non-GetText file formats where this makes sense), and the option for this wasn't set when parsing the po/pot file.

That said, Phrase is clearly mishandling .mo file output.

I probably should have done what you did, and compared the output from Phrase with the GetText command line tools. Whilst the GetText spec is a bit vague on .mo file format, using the msgfmt tool seems a reasonable choice here as an authoritative answer. When writing the binary .mo file format msgfmt takes any escaped strings in the .po source and converts them to their binary equivalent (i..e the 2 character literal backslash + n \n becomes just the newline character). I'm going to assume that's the correct behaviour, and raise this as a bug with Phrase.

That said, what is also strange is that msgfmt doesn't correctly handle the escape sequence for a single quote. A single quote ' is a valid escape sequence when used as \', and although the GetText docs explicitly mentioning this as valid, xgettext will unescape them when parsing. If the 2 char sequence \' is in the text, msgfmt will actually return invalid control sequence when reading the po file. When using the 4-char hex escape sequence for the the quote \x27 it parses it correctly. That seems like a bug in the GetText tooling to me.

This issue can probably be closed?

from gettext.net.

perpetualKid commented on July 17, 2024

thanks for confirming my research!
I'll do some more testing around the \' sequence, but if this is purely affecting msgfmt I assume (or hope :) ) GetText.NET is handling it correctly.

In the meanwhile closing this issue.

from gettext.net.

Escaped strings are not correctly handled about gettext.net HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	protected static string ToLiteral(string valueTextForCompiler)
	{
	return SymbolDisplay.FormatLiteral(valueTextForCompiler, false);
	}