Comments (24)
ghc-7.6.1 on MacOSX 10.7.5, with bash.
ghc-7.8.3 on Windows 7 Professional SP1, with bash.
I just discussed this issue with a Mac user, and it seems as if the System.IO functions by default always use UTF-8 under MacOS, while the locale is ignored.
Under Windows I guess that one can use chcp to trigger the problem. Perhaps chcp 1252
would work.
GHC has used UTF-8 as the character encoding for source files since version 6.6 (which was released in 2006), so perhaps cpphs could also use this as the default. Note, however, that the GHC documentation states that "invalid UTF-8 sequences [are] ignored in comments, so it is possible to use other encodings such as Latin-1, as long as the non-comment source code is ASCII only".
I've attached a patch that switches to UTF-8 everywhere (?) in cpphs, with two caveats:
- The command-line arguments are treated as before.
- The encoding of stderr is only changed in the top-level module. If cpphs is intended to be used as a library, and error messages can contain non-ASCII characters, then the encoding of stderr should perhaps be changed in the applicable library modules.
I've used the base library's support for roundtripping to handle illegal characters. Feel free to base any changes on this patch.
from cpphs.
Thanks for the patch Nils. I rolled something slightly different, to ensure that e.g. #included files also get the UTF8 encoding. I was not previously aware of the roundtripping style of TextEncoding, so that was a useful addition for me.
from cpphs.
cpphs-1.20.2 released.
from cpphs.
I can't seem to reproduce the issue with the given steps. cpphs uses the standard Haskell/ghc System.IO.openFile, which I think trusts the underlying filesystem's metadata about the file's encoding? Certainly, setting LC_CTYPE does not seem to change its behaviour.
$ LC_CTYPE=C ./cpphs Test.hs
#line 1 "Test.hs"
module Main where
main = putStrLn "∀"
from cpphs.
Using the file
command for determining the file type I got
$ file Test.hs
Test.hs: UTF-8 Unicode text
What do you get?
from cpphs.
The same.
from cpphs.
It seems you have no the C
locale installed. Which is the output of running
$ locale -a
?
from cpphs.
$ locale -a
af_ZA
af_ZA.ISO8859-1
af_ZA.ISO8859-15
af_ZA.UTF-8
am_ET
am_ET.UTF-8
be_BY
be_BY.CP1131
be_BY.CP1251
be_BY.ISO8859-5
be_BY.UTF-8
bg_BG
bg_BG.CP1251
bg_BG.UTF-8
ca_ES
ca_ES.ISO8859-1
ca_ES.ISO8859-15
ca_ES.UTF-8
cs_CZ
cs_CZ.ISO8859-2
cs_CZ.UTF-8
da_DK
da_DK.ISO8859-1
da_DK.ISO8859-15
da_DK.UTF-8
de_AT
de_AT.ISO8859-1
de_AT.ISO8859-15
de_AT.UTF-8
de_CH
de_CH.ISO8859-1
de_CH.ISO8859-15
de_CH.UTF-8
de_DE
de_DE.ISO8859-1
de_DE.ISO8859-15
de_DE.UTF-8
el_GR
el_GR.ISO8859-7
el_GR.UTF-8
en_AU
en_AU.ISO8859-1
en_AU.ISO8859-15
en_AU.US-ASCII
en_AU.UTF-8
en_CA
en_CA.ISO8859-1
en_CA.ISO8859-15
en_CA.US-ASCII
en_CA.UTF-8
en_GB
en_GB.ISO8859-1
en_GB.ISO8859-15
en_GB.US-ASCII
en_GB.UTF-8
en_IE
en_IE.UTF-8
en_NZ
en_NZ.ISO8859-1
en_NZ.ISO8859-15
en_NZ.US-ASCII
en_NZ.UTF-8
en_US
en_US.ISO8859-1
en_US.ISO8859-15
en_US.US-ASCII
en_US.UTF-8
es_ES
es_ES.ISO8859-1
es_ES.ISO8859-15
es_ES.UTF-8
et_EE
et_EE.ISO8859-15
et_EE.UTF-8
eu_ES
eu_ES.ISO8859-1
eu_ES.ISO8859-15
eu_ES.UTF-8
fi_FI
fi_FI.ISO8859-1
fi_FI.ISO8859-15
fi_FI.UTF-8
fr_BE
fr_BE.ISO8859-1
fr_BE.ISO8859-15
fr_BE.UTF-8
fr_CA
fr_CA.ISO8859-1
fr_CA.ISO8859-15
fr_CA.UTF-8
fr_CH
fr_CH.ISO8859-1
fr_CH.ISO8859-15
fr_CH.UTF-8
fr_FR
fr_FR.ISO8859-1
fr_FR.ISO8859-15
fr_FR.UTF-8
he_IL
he_IL.UTF-8
hi_IN.ISCII-DEV
hr_HR
hr_HR.ISO8859-2
hr_HR.UTF-8
hu_HU
hu_HU.ISO8859-2
hu_HU.UTF-8
hy_AM
hy_AM.ARMSCII-8
hy_AM.UTF-8
is_IS
is_IS.ISO8859-1
is_IS.ISO8859-15
is_IS.UTF-8
it_CH
it_CH.ISO8859-1
it_CH.ISO8859-15
it_CH.UTF-8
it_IT
it_IT.ISO8859-1
it_IT.ISO8859-15
it_IT.UTF-8
ja_JP
ja_JP.SJIS
ja_JP.UTF-8
ja_JP.eucJP
kk_KZ
kk_KZ.PT154
kk_KZ.UTF-8
ko_KR
ko_KR.CP949
ko_KR.UTF-8
ko_KR.eucKR
lt_LT
lt_LT.ISO8859-13
lt_LT.ISO8859-4
lt_LT.UTF-8
nl_BE
nl_BE.ISO8859-1
nl_BE.ISO8859-15
nl_BE.UTF-8
nl_NL
nl_NL.ISO8859-1
nl_NL.ISO8859-15
nl_NL.UTF-8
no_NO
no_NO.ISO8859-1
no_NO.ISO8859-15
no_NO.UTF-8
pl_PL
pl_PL.ISO8859-2
pl_PL.UTF-8
pt_BR
pt_BR.ISO8859-1
pt_BR.UTF-8
pt_PT
pt_PT.ISO8859-1
pt_PT.ISO8859-15
pt_PT.UTF-8
ro_RO
ro_RO.ISO8859-2
ro_RO.UTF-8
ru_RU
ru_RU.CP1251
ru_RU.CP866
ru_RU.ISO8859-5
ru_RU.KOI8-R
ru_RU.UTF-8
sk_SK
sk_SK.ISO8859-2
sk_SK.UTF-8
sl_SI
sl_SI.ISO8859-2
sl_SI.UTF-8
sr_YU
sr_YU.ISO8859-2
sr_YU.ISO8859-5
sr_YU.UTF-8
sv_SE
sv_SE.ISO8859-1
sv_SE.ISO8859-15
sv_SE.UTF-8
tr_TR
tr_TR.ISO8859-9
tr_TR.UTF-8
uk_UA
uk_UA.ISO8859-5
uk_UA.KOI8-U
uk_UA.UTF-8
zh_CN
zh_CN.GB18030
zh_CN.GB2312
zh_CN.GBK
zh_CN.UTF-8
zh_CN.eucCN
zh_HK
zh_HK.Big5HKSCS
zh_HK.UTF-8
zh_TW
zh_TW.Big5
zh_TW.UTF-8
C
POSIX
from cpphs.
I don't know whether the version of ghc might be relevant, but in case it is, I'm compiling cpphs with ghc-7.6.1
from cpphs.
You have the C
locale installed. I could reproduce the issue compiling cpphs
with GHC 7.6.3. What shell are you using? I'm using
$ echo $SHELL
/bin/bash
from cpphs.
cpphs uses the standard Haskell/ghc System.IO.openFile, which I think trusts the underlying filesystem's metadata about the file's encoding?
I think recent versions of GHC by default use the locale (or code page) to decide what encoding to use.
from cpphs.
A simple (system-dependent) test:
$ echo -e '\u2200' > test
$ cat test
∀
$ file test
test: UTF-8 Unicode text
$ ghc -e 'putStr =<< readFile "test"'
∀
$ LC_CTYPE=C ghc -e 'putStr =<< readFile "test"'
<interactive>: test: hGetContents: invalid argument (invalid byte sequence)
from cpphs.
Certainly, setting LC_CTYPE does not seem to change its behaviour.
Perhaps you've set LC_ALL, which overrides LC_CTYPE.
from cpphs.
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.4
$ cat test
∀
$ file test
test: UTF-8 Unicode text
$ ghc -e 'putStr =<< readFile "test"'
∀
$ LC_CTYPE=C ghc -e 'putStr =<< readFile "test"'
∀
$ LC_ALL=C ghc -e 'putStr =<< readFile "test"'
∀
from cpphs.
I think I can close this issue, since it appears that neither cpphs nor ghc is at fault.
from cpphs.
Which operating system and shell are you using?
from cpphs.
Could you reproduce the issue running
$ export LC_ALL=C
$ cpphs test
?
from cpphs.
ghc-7.6.1 on MacOSX 10.7.5, with bash.
ghc-7.8.3 on Windows 7 Professional SP1, with bash.
from cpphs.
Cannot reproduce the issue, even with LC_ALL=C.
from cpphs.
Did you mean export LC_ALL=C
?
from cpphs.
Which is the output of
$ locale
$ LC_ALL=C locale
?
from cpphs.
$ locale # MacOSX
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
$ LC_ALL=C locale
LANG="en_GB.UTF-8"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"
The result is similar on Windows 7, except that the default is en_US.UTF-8 rather than en_GB.UTF-8.
from cpphs.
FYI, I reported here the different behaviour in Linux and Mac OS.
from cpphs.
Thanks for fixing the issue (tested on Agda). Could you release a new version, please.
from cpphs.
Related Issues (20)
- lexical error at character '\n' HOT 21
- Warning: Can't find file "C:\.../include\ghcversion.h" in directories HOT 22
- Broken on GHC 7.4 HOT 18
- Missing changelog entries HOT 1
- cpphs 1.20.5 broken when using GHC 8.2.1 on Windows HOT 24
- Continuous integration?
- head on empty list error HOT 1
- Stringizing deviates from the spec
- Stop using old-time HOT 7
- compile error in xkbcommon on current arch HOT 2
- No option for newlines with --cpp
- Broken compilation on GHC 8.6.1 HOT 5
- cpphs 1.20.8 chokes on sys/cdefs.h
- ## concatenation operator isn't supported HOT 2
- Line splicing should be applied everywhere
- cpphs 1.20.9 fails to build on GHC 7.8.4 because of time library HOT 1
- Warnings about trailing characters always emitted
- cpphs only compiles on ghc 9.0 rc1` if you use --allow-new HOT 4
- CPPHS fails to substitute constant
- cpphs fails to parse an `#if` directive
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cpphs.