GithubHelp home page GithubHelp logo

Comments (18)

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: ross@

To handle non-ASCII characters in the source, you need to decide which encoding it is in. There is the encoding-independent workaround of using &#nnn; in the source.

The generated HTML doesn't need encoding, as non-ASCII characters are rendered as numeric entities by stringToHtmlString.

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: dav.vire+haskell@

The workaround of using &#nnn; in the source is not usable. The comment becomes totally unreadable, in the case of comments in a foreign langage it's a real problem.

Haddock should at least be able to handle UTF-8 encoding of the source file without mangling the HTML output.

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: yuriks.br@

This is a major pain in the ass for anyone who isn't coding in english, I'm bumping this up.

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: leonelfl@

Non english programmers need this.

While not related to Haskell language capabilities, the fact of having tools that work universally gives credibility to the whole platform.

UTF-8 support is necessary. It must be stressed out that other people are programming, explaning their programs and having interfaces in languages others than english. They do this naturally and expect to so without any inconvient.

Let's do stop thinking that Haskell is just for Haskell programmers that program for fun and that are willing to show each others results in english (lingua franca). Haskell platform components need to usable in environments which purpose is not Haskell itself.

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: ppavel

I vote for this. I'm willing to hack but will need some directions to get started

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: david.waern@

Replying to [comment:5 ppavel]:

I vote for this. I'm willing to hack but will need some directions to get started

Hi Pavel,

I've looked at this briefly and I think it could be related to the fact that we use alexGetChar in the GHC lexer where we should alexGetChar' instead. You could try changing that and see if it helps.

The lexer is in compiler/parser/Lexer.x in the GHC source tree. Look for functions that read Haddock comments such as multiline_doc_comment, nested_doc_comment, etc.

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: ddssff@

I don't think alexGetChar' exists any more.

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: pho@

I vote for this too.
Personally I stick using English in docs while my native language is Japanese, but I'm really fond of UnicodeSyntax. I want to use UnicodeSyntax in code examples, not only the code itself.

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: marlowsd@

Alex 3 can lex UTF-8 directly, which might make this easier. I made the changes to Haddock to make it work with Alex 3, ut I didn't add Unicode support at the time, because I wanted to keep it working with Alex 2.

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: david.waern@

Simon,

I made modifications to the GHC lexer so that Unicode characters are preserved in the comments fed to the Haddock lexer. I then tested with a simple Unicode comment and I can see that it appears in the documentation without getting mangled by the Haddock lexer.

However I'm assuming by your last comment that something still needs to be done in the Haddock lexer for this to work 100%. Do you think we could drop compatibility with Alex 2 by now, and if so could you explain what needs to be done in the lexer?

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: marlowsd@

The comments from GHC are lexed again by Haddock using an Alex lexer, and I would expect that step to mangle the Unicode. From src\Lex.x:

alexGetByte :: AlexInput -> Maybe (Word8,AlexInput)
alexGetByte (p,c,[]) = Nothing
alexGetByte (p,_,(c:s))  # let p'alexMove p c
                              in p' `seq`  Just (fromIntegral (ord c), (p', c, s))

-- for compat with Alex 2.x:
alexGetChar :: AlexInput -> Maybe (Char,AlexInput)
alexGetChar i = case alexGetByte i of
                  Nothing     -> Nothing
                  Just (b,i') -> Just (chr (fromIntegral b), i')

You can see we apply ord in alexGetByte and chr again in alexGetChar, so Unicode should be squashed to the low 8 bits.

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: selinger@

I agree that this should be fixed. It would be better to assume that all files are UTF8 than to assume all files are ASCII.

Either way, users that use another encoding first have to do an offline conversion before invoking Haddock. But conversion from, say, Latin1 to UTF8 is trivial to do, whereas conversion from Latin1 to ASCII with HTML entities requires offline parsing: non-ASCII characters in Haddock comments must be converted to HTML entities, and non-ASCII characters in the code itself must be converted to something else (UTF8?), because Haddock will croak if it encounters an HTML entity in the code itself.

Moreover, the current HTML entities encoding does not even work correctly; see bug #191.

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: sol@

I can reproduce this with Haddock 2.9.2, the version of Haddock that ships with GHC 7.4.0.20111219 produces proper HTML entities for codepoints outside the ASCII range.

Are there still any issues left? And if yes, how would a minimal test case look like?

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: alex-voikov@

Replying to [comment:14 SimonHengel]:

I can reproduce this with Haddock 2.9.2, the version of Haddock that ships with GHC 7.4.0.20111219 produces proper HTML entities for codepoints outside the ASCII range.

Are there still any issues left? And if yes, how would a minimal test case look like?

Haddock version 2.12.0

-- | Это модуль mytime
module MyTime (Time(..),testFunc) where


-- ^  Тип данных время
data Time = Time{ hour :: Int -- ^ Часы
                , mins  :: Int -- ^ Минуты
                }
          deriving(Show)

-- |Тестовая функция, которая всегда возвращает 42
testFunc :: String -- ^ строка
            -> Int -- ^ возвращает число
testFunc x = 42
$ haddock 3.hs -html
Haddock coverage:
doc comment parse failed:   Тип данных время
doc comment parse failed: Тестовая функция, которая всегда возвращает 42
doc comment parse failed:  строка
doc comment parse failed:  возвращает число
  33% (  1 /  3) in 'MyTime'

from haddock.

ghc-mirror avatar ghc-mirror commented on May 7, 2024

Original reporter: batterseapower

These patches implement support for this in Haddock by using Alex 3's native Unicode support.

from haddock.

ivan-m avatar ivan-m commented on May 7, 2024

Sorry if this is the wrong bug to report in, but this is what Github came up with for search results.

Using Haddock that ships with GHC 7.8.2, if I build with LANG=C (done so by my distribution's package manager), then I still get issues like this: UnkindPartition/tasty-golden#10

from haddock.

hvr avatar hvr commented on May 7, 2024

Fwiw, I believe this is fixed in GHC HEAD

from haddock.

Fuuzetsu avatar Fuuzetsu commented on May 7, 2024

As noted on haskell/cabal#1721, from Haddock 2.15.0 cabal and Haddock will enforce UTF-8.

If absolutely necessary, this could be backported into the 2.14.3 but as it requires co-ordination with cabal and backporting is a pain, I'd rather not.

from haddock.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.