GithubHelp home page GithubHelp logo

Comments (24)

nad avatar nad commented on May 21, 2024 1

ghc-7.6.1 on MacOSX 10.7.5, with bash.
ghc-7.8.3 on Windows 7 Professional SP1, with bash.

I just discussed this issue with a Mac user, and it seems as if the System.IO functions by default always use UTF-8 under MacOS, while the locale is ignored.

Under Windows I guess that one can use chcp to trigger the problem. Perhaps chcp 1252 would work.

GHC has used UTF-8 as the character encoding for source files since version 6.6 (which was released in 2006), so perhaps cpphs could also use this as the default. Note, however, that the GHC documentation states that "invalid UTF-8 sequences [are] ignored in comments, so it is possible to use other encodings such as Latin-1, as long as the non-comment source code is ASCII only".

I've attached a patch that switches to UTF-8 everywhere (?) in cpphs, with two caveats:

  • The command-line arguments are treated as before.
  • The encoding of stderr is only changed in the top-level module. If cpphs is intended to be used as a library, and error messages can contain non-ASCII characters, then the encoding of stderr should perhaps be changed in the applicable library modules.

I've used the base library's support for roundtripping to handle illegal characters. Feel free to base any changes on this patch.

from cpphs.

malcolmwallace avatar malcolmwallace commented on May 21, 2024 1

Thanks for the patch Nils. I rolled something slightly different, to ensure that e.g. #included files also get the UTF8 encoding. I was not previously aware of the roundtripping style of TextEncoding, so that was a useful addition for me.

from cpphs.

malcolmwallace avatar malcolmwallace commented on May 21, 2024 1

cpphs-1.20.2 released.

from cpphs.

malcolmwallace avatar malcolmwallace commented on May 21, 2024

I can't seem to reproduce the issue with the given steps. cpphs uses the standard Haskell/ghc System.IO.openFile, which I think trusts the underlying filesystem's metadata about the file's encoding? Certainly, setting LC_CTYPE does not seem to change its behaviour.

$ LC_CTYPE=C ./cpphs Test.hs 
#line 1 "Test.hs"
module Main where

main = putStrLn ""

from cpphs.

asr avatar asr commented on May 21, 2024

Using the file command for determining the file type I got

$ file Test.hs
Test.hs: UTF-8 Unicode text

What do you get?

from cpphs.

malcolmwallace avatar malcolmwallace commented on May 21, 2024

The same.

from cpphs.

asr avatar asr commented on May 21, 2024

It seems you have no the C locale installed. Which is the output of running

$ locale -a

?

from cpphs.

malcolmwallace avatar malcolmwallace commented on May 21, 2024

$ locale -a
af_ZA
af_ZA.ISO8859-1
af_ZA.ISO8859-15
af_ZA.UTF-8
am_ET
am_ET.UTF-8
be_BY
be_BY.CP1131
be_BY.CP1251
be_BY.ISO8859-5
be_BY.UTF-8
bg_BG
bg_BG.CP1251
bg_BG.UTF-8
ca_ES
ca_ES.ISO8859-1
ca_ES.ISO8859-15
ca_ES.UTF-8
cs_CZ
cs_CZ.ISO8859-2
cs_CZ.UTF-8
da_DK
da_DK.ISO8859-1
da_DK.ISO8859-15
da_DK.UTF-8
de_AT
de_AT.ISO8859-1
de_AT.ISO8859-15
de_AT.UTF-8
de_CH
de_CH.ISO8859-1
de_CH.ISO8859-15
de_CH.UTF-8
de_DE
de_DE.ISO8859-1
de_DE.ISO8859-15
de_DE.UTF-8
el_GR
el_GR.ISO8859-7
el_GR.UTF-8
en_AU
en_AU.ISO8859-1
en_AU.ISO8859-15
en_AU.US-ASCII
en_AU.UTF-8
en_CA
en_CA.ISO8859-1
en_CA.ISO8859-15
en_CA.US-ASCII
en_CA.UTF-8
en_GB
en_GB.ISO8859-1
en_GB.ISO8859-15
en_GB.US-ASCII
en_GB.UTF-8
en_IE
en_IE.UTF-8
en_NZ
en_NZ.ISO8859-1
en_NZ.ISO8859-15
en_NZ.US-ASCII
en_NZ.UTF-8
en_US
en_US.ISO8859-1
en_US.ISO8859-15
en_US.US-ASCII
en_US.UTF-8
es_ES
es_ES.ISO8859-1
es_ES.ISO8859-15
es_ES.UTF-8
et_EE
et_EE.ISO8859-15
et_EE.UTF-8
eu_ES
eu_ES.ISO8859-1
eu_ES.ISO8859-15
eu_ES.UTF-8
fi_FI
fi_FI.ISO8859-1
fi_FI.ISO8859-15
fi_FI.UTF-8
fr_BE
fr_BE.ISO8859-1
fr_BE.ISO8859-15
fr_BE.UTF-8
fr_CA
fr_CA.ISO8859-1
fr_CA.ISO8859-15
fr_CA.UTF-8
fr_CH
fr_CH.ISO8859-1
fr_CH.ISO8859-15
fr_CH.UTF-8
fr_FR
fr_FR.ISO8859-1
fr_FR.ISO8859-15
fr_FR.UTF-8
he_IL
he_IL.UTF-8
hi_IN.ISCII-DEV
hr_HR
hr_HR.ISO8859-2
hr_HR.UTF-8
hu_HU
hu_HU.ISO8859-2
hu_HU.UTF-8
hy_AM
hy_AM.ARMSCII-8
hy_AM.UTF-8
is_IS
is_IS.ISO8859-1
is_IS.ISO8859-15
is_IS.UTF-8
it_CH
it_CH.ISO8859-1
it_CH.ISO8859-15
it_CH.UTF-8
it_IT
it_IT.ISO8859-1
it_IT.ISO8859-15
it_IT.UTF-8
ja_JP
ja_JP.SJIS
ja_JP.UTF-8
ja_JP.eucJP
kk_KZ
kk_KZ.PT154
kk_KZ.UTF-8
ko_KR
ko_KR.CP949
ko_KR.UTF-8
ko_KR.eucKR
lt_LT
lt_LT.ISO8859-13
lt_LT.ISO8859-4
lt_LT.UTF-8
nl_BE
nl_BE.ISO8859-1
nl_BE.ISO8859-15
nl_BE.UTF-8
nl_NL
nl_NL.ISO8859-1
nl_NL.ISO8859-15
nl_NL.UTF-8
no_NO
no_NO.ISO8859-1
no_NO.ISO8859-15
no_NO.UTF-8
pl_PL
pl_PL.ISO8859-2
pl_PL.UTF-8
pt_BR
pt_BR.ISO8859-1
pt_BR.UTF-8
pt_PT
pt_PT.ISO8859-1
pt_PT.ISO8859-15
pt_PT.UTF-8
ro_RO
ro_RO.ISO8859-2
ro_RO.UTF-8
ru_RU
ru_RU.CP1251
ru_RU.CP866
ru_RU.ISO8859-5
ru_RU.KOI8-R
ru_RU.UTF-8
sk_SK
sk_SK.ISO8859-2
sk_SK.UTF-8
sl_SI
sl_SI.ISO8859-2
sl_SI.UTF-8
sr_YU
sr_YU.ISO8859-2
sr_YU.ISO8859-5
sr_YU.UTF-8
sv_SE
sv_SE.ISO8859-1
sv_SE.ISO8859-15
sv_SE.UTF-8
tr_TR
tr_TR.ISO8859-9
tr_TR.UTF-8
uk_UA
uk_UA.ISO8859-5
uk_UA.KOI8-U
uk_UA.UTF-8
zh_CN
zh_CN.GB18030
zh_CN.GB2312
zh_CN.GBK
zh_CN.UTF-8
zh_CN.eucCN
zh_HK
zh_HK.Big5HKSCS
zh_HK.UTF-8
zh_TW
zh_TW.Big5
zh_TW.UTF-8
C
POSIX

from cpphs.

malcolmwallace avatar malcolmwallace commented on May 21, 2024

I don't know whether the version of ghc might be relevant, but in case it is, I'm compiling cpphs with ghc-7.6.1

from cpphs.

asr avatar asr commented on May 21, 2024

You have the C locale installed. I could reproduce the issue compiling cpphs with GHC 7.6.3. What shell are you using? I'm using

$ echo $SHELL
/bin/bash

from cpphs.

nad avatar nad commented on May 21, 2024

cpphs uses the standard Haskell/ghc System.IO.openFile, which I think trusts the underlying filesystem's metadata about the file's encoding?

I think recent versions of GHC by default use the locale (or code page) to decide what encoding to use.

from cpphs.

nad avatar nad commented on May 21, 2024

A simple (system-dependent) test:

$ echo -e '\u2200' > test
$ cat test
∀
$ file test
test: UTF-8 Unicode text
$ ghc -e 'putStr =<< readFile "test"'
∀
$ LC_CTYPE=C ghc -e 'putStr =<< readFile "test"'
<interactive>: test: hGetContents: invalid argument (invalid byte sequence)

from cpphs.

nad avatar nad commented on May 21, 2024

Certainly, setting LC_CTYPE does not seem to change its behaviour.

Perhaps you've set LC_ALL, which overrides LC_CTYPE.

from cpphs.

malcolmwallace avatar malcolmwallace commented on May 21, 2024

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.4
$ cat test

$ file test
test: UTF-8 Unicode text
$ ghc -e 'putStr =<< readFile "test"'

$ LC_CTYPE=C ghc -e 'putStr =<< readFile "test"'

$ LC_ALL=C ghc -e 'putStr =<< readFile "test"'

from cpphs.

malcolmwallace avatar malcolmwallace commented on May 21, 2024

I think I can close this issue, since it appears that neither cpphs nor ghc is at fault.

from cpphs.

asr avatar asr commented on May 21, 2024

Which operating system and shell are you using?

from cpphs.

asr avatar asr commented on May 21, 2024

Could you reproduce the issue running

$ export LC_ALL=C
$ cpphs test

?

from cpphs.

malcolmwallace avatar malcolmwallace commented on May 21, 2024

ghc-7.6.1 on MacOSX 10.7.5, with bash.
ghc-7.8.3 on Windows 7 Professional SP1, with bash.

from cpphs.

malcolmwallace avatar malcolmwallace commented on May 21, 2024

Cannot reproduce the issue, even with LC_ALL=C.

from cpphs.

asr avatar asr commented on May 21, 2024

Did you mean export LC_ALL=C?

from cpphs.

asr avatar asr commented on May 21, 2024

Which is the output of

$ locale
$ LC_ALL=C locale

?

from cpphs.

malcolmwallace avatar malcolmwallace commented on May 21, 2024

$ locale # MacOSX
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
$ LC_ALL=C locale
LANG="en_GB.UTF-8"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

The result is similar on Windows 7, except that the default is en_US.UTF-8 rather than en_GB.UTF-8.

from cpphs.

asr avatar asr commented on May 21, 2024

FYI, I reported here the different behaviour in Linux and Mac OS.

from cpphs.

asr avatar asr commented on May 21, 2024

Thanks for fixing the issue (tested on Agda). Could you release a new version, please.

from cpphs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.