GithubHelp home page GithubHelp logo

mochiweb_html: parse exception about mochiweb HOT 8 OPEN

mochi avatar mochi commented on August 25, 2024
mochiweb_html: parse exception

from mochiweb.

Comments (8)

doubleyou avatar doubleyou commented on August 25, 2024
<ul><li><a href='http://wp-skins.info/2009/06/08/oshibka-cannot-modify-header-information-headers-already-sent.html'>Warning: Cannot modify header information</a></li><li><a href='http://wp-skins.info/2009/09/04/kak-rasshifrovat-base64_decode-v-futere-wordpress-temyi.html'><? echo(base64_decode("</a></li><li><a href='http://wp-skins.info/2009/03/31/delaem-krasivyie-ssyilki-ili-horoshiy-chpu-v-wordpress-dlya-seo.html'>ЧПУ в wordpress</a></li><li><a href='http://wp-skins.info/2009/03/16/besplatnyiy-internet-magazin-s-pomoschyu-wordpress-i-quick-shop.html'>интернет магазин на wordpress</a></li><li><a href='http://wp-skins.info/2009/02/18/shablonyi-zhurnalov-ili-gazet-dlya-sayta-na-wordpress.html'>шаблоны газет</a></li><li><a href='http://wp-skins.info/2007/12/19/wordpress-dlya-novichkov-chast-2.html'>Hello Dolly wordpress</a></li><li><a href='http://wp-skins.info/category/design'>дизайн wordpress</a></li><li><a href='http://wp-skins.info/2008/07/11/kak-dobavit-vatermark-watermark-na-svoi-kartinki.html'>ватермарк</a></li></ul></li>

In particular, this part (most likely, crappy WP code chunk):

<? echo(base64_decode("

Piece of advice: next time, just find the problem line using binary search and analyze it (generally, syntax highlighting is enough for that).

from mochiweb.

etrepum avatar etrepum commented on August 25, 2024

Even if it's "invalid HTML", it should still parse to something.

from mochiweb.

helllamer avatar helllamer commented on August 25, 2024

So sorry about crappy html, but it is unclear for me, that exception

no case clause matching <<"<!DOCTYPE html PUBLIC \"-// ....

is somehow connected with inlined php-tags. My first idea was problems with DOCTYPE.

from mochiweb.

etrepum avatar etrepum commented on August 25, 2024

Yeah, Erlang's error reporting for failed pattern matches on binaries could use some work. That's not the failure mode mochiweb_html is supposed to have, so this is a bug either way.

from mochiweb.

doubleyou avatar doubleyou commented on August 25, 2024
find_qgt(Bin, S=#decoder{offset=O}) ->
    case Bin of
        <<_:O/binary, "?>", _/binary>> ->
            ?ADV_COL(S, 2);
        <<_:O/binary, ">", _/binary>> ->
            ?ADV_COL(S, 1);
        <<_:O/binary, "/>", _/binary>> ->
            ?ADV_COL(S, 2);
        %% tokenize_attributes takes care of this state:
        %% <<_:O/binary, C, _/binary>> ->
        %%     find_qgt(Bin, ?INC_CHAR(S, C));
        <<_:O/binary>> ->
            S;
        _ ->
            S
    end.

(Just added the last clause actually). This handles the current case, but in case of, say, an incomplete tag, it may purge all the contents after the corrupted part.

@etrepum From my experience of sanitizing HTML, there's no ultimately good way of doing it. At Echo, we just tested several approaches and ended up with the one that was statistically the most accurate for us.

Personally, I've been always thinking of mochiweb_html as something like mochijson/mochijson2, and they fail when they meet incorrect format.

from mochiweb.

etrepum avatar etrepum commented on August 25, 2024

Ideally it would follow the HTML5 recommendations for how HTML parsing should work, but I think this code predates that spec. It wasn't supposed to be a brittle parser, it was supposed to be relatively forgiving... but it was never used to parse crazy stuff in the wild so the parser was never adapted to be as robust as it was intended to be.

from mochiweb.

helllamer avatar helllamer commented on August 25, 2024

So... I understood the mochihtml aims, and I will sanitize crazy-HTML and patch find_qgt/2.
Should one close this bug or ...?

from mochiweb.

etrepum avatar etrepum commented on August 25, 2024

The bug shouldn't be closed until it's fixed in the repo

from mochiweb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.