GithubHelp home page GithubHelp logo

Encoding of arbtt-stats about arbtt HOT 15 CLOSED

nomeata avatar nomeata commented on May 5, 2024
Encoding of arbtt-stats

from arbtt.

Comments (15)

nomeata avatar nomeata commented on May 5, 2024

Original comment by nomeata (Bitbucket: nomeata, GitHub: nomeata).


arbtt should handle unicode properly, and output it in whatever locale your system is running. On Linux, I’d say “make sure that LANG is set to a UTF8 locale”...

Do you get the weirdness only when piping the output to a file or program, or also when you run it as it is?

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by amenthes (Bitbucket: amenthes, GitHub: amenthes).


It would appear that the output is ISO-8859-1 on my windows machine, when piped to another command or file. Currently I have to detect the encoding at runtime and convert to utf-8.

I guess, now i have two conversions, one by arbtt-stats (internal to iso-8859-1) and one by my script (iso-8859-1 to utf-8). The conversion to iso will probably be lossy, there's a bunch of characters it can't display. I't love to request a mode where i can force arbtt-stats to output utf-8 regardless of locale and other environment settings.

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by nomeata (Bitbucket: nomeata, GitHub: nomeata).


If it is non-trivial to set it via environment variables, I might add a command line flag... but I’m surprised this is so hard.

Have you tried issuing chcp 65001 before running arbtt? According to http://stackoverflow.com/a/388500/946226 this should set the code page to utf8.

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by amenthes (Bitbucket: amenthes, GitHub: amenthes).


chcp does not seem to have an effect. My terminal happily tells me, that i'm on that codepage, now. But it still outputs ü as 0xFC (ISO-8859-1 or Windows-1252, as both would look identical in that area).

chcp.png

produces this byte sequence:

hex.png

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by amenthes (Bitbucket: amenthes, GitHub: amenthes).


I am able to convert this in the receiving script, now. I'm auto-detecting the encoding and always convert to utf-8. This way I was able to import ~10.000 window titles, ~400 of which also contained german umlauts. Still, I think it would make a nice addition, especially when using arbtt-stats as a stepstone in a custom chain of tools.

The current handling works very well in the command line. I have never had a problem with that. I do not want that to change.

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by nomeata (Bitbucket: nomeata, GitHub: nomeata).


Of course, the question is first: Does arbtt actually save it correctly internally? It cold well be that the screen capture is wrong...

On the other hand, that’s unlikely, as it would then cause mojibake when printing.

Maybe the problem disappears when I mange to make a new windows release that is then built with a new version of GHC and the base libraries.

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by amenthes (Bitbucket: amenthes, GitHub: amenthes).


I was using a build from the current head (7e3b5a7) and used

dist\build\arbtt-capture\arbtt-capture.exe -f unicode.stuff

to capture the window title of this website in firefox: https://www.qnap.com/i/de/news/con_show.php?op=showone&cid=416 which reads "QNAP unterstützt Kodi – ehemals XBMC - zur Multimedia-Wiedergabe"

both arbtt-dump and arbtt-stats (same build) have problems with this:

> dist\build\arbtt-dump\arbtt-dump.exe -f unicode.stuff
2015-10-14 19:57:44 (0ms inactive):
    ( ) [redacted for privacy reasons]
    ( ) \Device\HarddiskVolume2\Program Files (x86)\Mozilla Firefox\firefox.exe: QNAP unterstützt Kodi arbtt-dump.exe: <stdout>: commitBuffer: invalid argument (invalid character)

The output stops there. No further lines are dumped.

Please note that the title reads just fine in the terminal. When i write the same output to a file, this happens:

> dist\build\arbtt-dump\arbtt-dump.exe -f unicode.stuff > unicode.stuff.dump.txt
arbtt-dump.exe: <stdout>: commitBuffer: invalid argument (invalid character)

(same error and termination of program)

arbtt-stats-encoding.png

The "ü" is converted to 81, which is valid in Codepage 850. This is also what my terminal is set to.

If i switch my terminal to chcp 65001, the "ü" becomes c3bc -> which is actually valid utf8. The dump will run through as expected. So in that case, everything is well.

arbtt-stats is also working after issueing a codepage 65001. Interestingly, it does not have the codepage 850 problem. It will work correctly in both cases!

So there's a small caveat that running arbtt-dump from a plain and simple terminal does not work. One has to issue the chcp 65001. I am not sure if this can be fixed, i guess many non-programmer users would find this unnerving.

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by amenthes (Bitbucket: amenthes, GitHub: amenthes).


(tiny correction to the post above)

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by amenthes (Bitbucket: amenthes, GitHub: amenthes).


There also appears to be an issue with old files, created with 0.6, it seems the encoding in the existing legacy logfile might confuse the newer arbtt-stats. I'm investigating. But this also only happens on codepage 850, so a user with that problem can work around it easily. I had no problems with a mixed legacy logfile (mixed in the sense that it was written to by both arbtt-capture 0.6 and 0.9).

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by nomeata (Bitbucket: nomeata, GitHub: nomeata).


Hmm. I am pretty confident that the log files are fixed to utf8, and have been like that since then, so I would hope that the reading of files old and new files is not a problem.

Otherwise the behaviour is somewhat expected: The program tries to print according to the current locale (i.e. codepage), and prefers to abort rather than print invalid characters.

Is it correct that everything works fine as long as your codepage is 65001?

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by amenthes (Bitbucket: amenthes, GitHub: amenthes).


Yes, in CP65001, everything is fine.

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by nomeata (Bitbucket: nomeata, GitHub: nomeata).


Ok. I’m inclined to close this, with the argument that if you want to use unicode, you need to use a unicode-aware codepage. Do you agree?

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by amenthes (Bitbucket: amenthes, GitHub: amenthes).


I'm fine with that, but i'd top it off with a note in the windows section of the readme. Once i understand how packaging an installer works, i might be able to contribute one. But i can't promise when i get around to doing that.

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by nomeata (Bitbucket: nomeata, GitHub: nomeata).


Mention codepage in the windows readme.

Suggestions to improve this notice and make it easier to follow for “normal”
users are welcome. This fixes #32.

from arbtt.

nomeata avatar nomeata commented on May 5, 2024

Original comment by nomeata (Bitbucket: nomeata, GitHub: nomeata).


Heh, when trying to run the test suite under wine I am now stuck with the same problem, and here, I don’t even have chcp available. I hope someone can help me at http://stackoverflow.com/questions/33156758/get-haskell-programs-to-assume-a-utf8-locale-under-wine.

from arbtt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.