GithubHelp home page GithubHelp logo

Comments (5)

wader avatar wader commented on May 26, 2024

Hey! thanks for the report

The html parser is based on https://pkg.go.dev/golang.org/x/net/html which i think is a standard compliant html5 parser. My guess is that some well known self-closing elements, like <br/>, are treated differently. For example this is how chrome parses <html><br/><bar>baz</bar></html> and <html><foo/><bar>baz</bar></html>:

Screenshot 2023-06-30 at 21 41 15

from fq.

Earnestly avatar Earnestly commented on May 26, 2024

I had wondered if browsers special case such tags. I'm not really sure what you can do here except use whatever the golang implementation decides. It seems to be fairly "unspecified" territory here.

from fq.

wader avatar wader commented on May 26, 2024

Yeap i think it's part of the html5 spec which tags to treat specially and looking at the go parser implementation https://github.com/golang/net/blob/master/html/parse.go it looks to be quite hardcoded.

What kind of text do you want to parse? using the xml decoder causes other issues?

from fq.

Earnestly avatar Earnestly commented on May 26, 2024

I was parsing amazon order pages which somehow ended up with a bunch of <i .../> tags, but I think my source material might have been damaged at some point. Starting with fresh html sources and the problems were gone. The little test cases were just a result of that, but -d html now works fine for my purposes.

This issue can probably be closed and provide some google fodder for people who encounter the same issue.

from fq.

wader avatar wader commented on May 26, 2024

Great 👍 i've sometimes used some jq to "massage" things before doing queries or export to something else, might be a workaround. Also fq -o array=true -d html could be interesting, makes some kind of queries or manipulations easier.

from fq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.