GithubHelp home page GithubHelp logo

mochiweb_html unicode breakage about mochiweb HOT 5 CLOSED

mattsta avatar mattsta commented on August 25, 2024
mochiweb_html unicode breakage

from mochiweb.

Comments (5)

etrepum avatar etrepum commented on August 25, 2024

The intention is that the input is already UTF-8, whether it is binary or iolist. This is consistent with what you would get by doing something like reading the request body from a mochiweb request

1> io:fwrite([mochiweb_html:to_html(mochiweb_html:parse(<<"<done>Matt\xe2\x80\x99s Breaking Line</done>">>)), "\n"]).
<done>Matt’s Breaking Line</done>
ok

Note the usage of io:fwrite/1 to send the output "directly" to stdout instead of io:format/2 which does the wrong thing. io:format/2 does some really broken shit by default if you are using non-ascii characters.

2> io:format("~s~n", [[8217]]).                                              
** exception exit: {badarg,[{io,format,[<0.25.0>,"~s~n",[[8217]]]},
                            {erl_eval,do_apply,5},
                            {shell,exprs,7},
                            {shell,eval_exprs,7},
                            {shell,eval_loop,3}]}
     in function  io:o_request/3

from mochiweb.

dreid avatar dreid commented on August 25, 2024

@mattsta it's also worth noting here that you can use mochiutf8:codepoints_to_bytes/1 to convert the list you have into a utf8 binary.

1> S = "<done>Matt’s Breaking Line</done>".
[60,100,111,110,101,62,77,97,116,116,8217,115,32,66,114,101,
 97,107,105,110,103,32,76,105,110,101,60,47,100|...]
2> mochiutf8:codepoints_to_bytes(S).       
<<60,100,111,110,101,62,77,97,116,116,226,128,153,115,32,
  66,114,101,97,107,105,110,103,32,76,105,110,101,60,...>>
3> mochiweb_html:parse(mochiutf8:codepoints_to_bytes(S)).
{<<"done">>,[],
 [<<77,97,116,116,226,128,153,115,32,66,114,101,97,107,
    105,110,103,32,76,105,110,101>>]}

from mochiweb.

etrepum avatar etrepum commented on August 25, 2024

I believe that unicode:characters_to_binary(S, utf8) is faster than mochiutf8:codepoints_to_bytes/1 (some functions in the unicode module are BIFs). Different error behavior though.

from mochiweb.

mattsta avatar mattsta commented on August 25, 2024

Thanks for the clarification. I was passing in a fetched HTML body from httpc:request without first running through unicode/utf8 conversion.

Everything works fine as long as I pre-convert:

 27> io:fwrite(mochiweb_html:to_html(mochiweb_html:parse(unicode:characters_to_binary("<done>Matt’s Breaking Line</done>")))).
 <done>Matt’s Breaking Line</done>ok

 28> io:format(mochiweb_html:to_html(mochiweb_html:parse(unicode:characters_to_binary("<done>Matt’s Breaking Line</done>")))).
 <done>Matt’s Breaking Line</done>ok

What can we do to help people not bug you about this in the future? The type system isn't elaborate enough to reject non-utf8 outright.

There's always the cheap:

 parse(Input) ->
     try
        parse_tokens(tokens(Input))
     catch
       error:badarg -> not_utf8
     end.

Or maybe even try to fix it ourselves, though it could hide performance issues by double-evaluating on every run:

 parse(Input) ->
     try
        parse_tokens(tokens(Input))
     catch
       error:badarg -> parse_tokens(tokens(unicode:characters_to_binary(Input)))
     end.

resolved: dumbmatt

from mochiweb.

etrepum avatar etrepum commented on August 25, 2024

I'd accept a documentation patch that says the input must be UTF-8. I guess string() is a bit misleading, but I wrote all this stuff before Erlang had any attempt to support unicode (except for some functions hiding in xmerl).

from mochiweb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.