Comments (5)
Hey! thanks for the report
The html parser is based on https://pkg.go.dev/golang.org/x/net/html which i think is a standard compliant html5 parser. My guess is that some well known self-closing elements, like <br/>
, are treated differently. For example this is how chrome parses <html><br/><bar>baz</bar></html>
and <html><foo/><bar>baz</bar></html>
:
from fq.
I had wondered if browsers special case such tags. I'm not really sure what you can do here except use whatever the golang implementation decides. It seems to be fairly "unspecified" territory here.
from fq.
Yeap i think it's part of the html5 spec which tags to treat specially and looking at the go parser implementation https://github.com/golang/net/blob/master/html/parse.go it looks to be quite hardcoded.
What kind of text do you want to parse? using the xml decoder causes other issues?
from fq.
I was parsing amazon order pages which somehow ended up with a bunch of <i .../>
tags, but I think my source material might have been damaged at some point. Starting with fresh html sources and the problems were gone. The little test cases were just a result of that, but -d html
now works fine for my purposes.
This issue can probably be closed and provide some google fodder for people who encounter the same issue.
from fq.
Great 👍 i've sometimes used some jq to "massage" things before doing queries or export to something else, might be a workaround. Also fq -o array=true -d html
could be interesting, makes some kind of queries or manipulations easier.
from fq.
Related Issues (20)
- Runtime/bespoke format support like kaitai struct HOT 12
- Msgpack: fixstr gets wrong length HOT 4
- Editing a binary HOT 9
- demo.svg looks wired in my environment HOT 11
- [feature] shell completions HOT 3
- [Feature request] Support cwf, swf, zwf HOT 1
- [Feature request] Support pdf HOT 1
- [Feature] Support for Doom WAD Files HOT 5
- [feature] add decimal floating-point number support HOT 2
- mp3 file with id3 2.4.0 got killed from console output HOT 6
- typo HOT 3
- [Documentation] Any interest in creating a man page? HOT 5
- Feature request: zero-length start/end properties HOT 7
- Color output is unreadable on terminals using light backgrounds. HOT 4
- make use of kaitai struct for additional formats? HOT 4
- [Feature request] Support image/bmp HOT 1
- Format Decoder Conventions HOT 3
- zip: last_modification_date and last_modification_time are mislabeled or swapped HOT 2
- gzip files can contain multiple concatenated gzips HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fq.