I'm trying to parse structured metadata from <a href="https://dataverse.harvard.edu/da

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Example document: <div class="snippet-clipboard-content notranslate position-relat

Unicode/string parsing error about extruct HOT 5 OPEN

scrapinghub commented on May 18, 2024

Unicode/string parsing error

from extruct.

Comments (5)

Kebniss commented on May 18, 2024 2

You cannot parse a unicode string that contains an encoding declaration, see here. In Harvard's html the encoding is specified in the first line. You can just encode the text before passing it to extruct: data = extruct.extract(r.text.encode('utf8'), base_url=base_url)

Strings encoding are very confusing, this article helped me understanding the basis :)

from extruct.

andrewsu commented on May 18, 2024 1

Thank you @Kebniss, worked perfectly! Your help (and your addition to my must-read list) is much appreciated!

from extruct.

lopuhin commented on May 18, 2024

I think it's possible to make extruct work on such cases, and it should be a responsibility of the library.

from extruct.

lopuhin commented on May 18, 2024

Example document:

extruct.extract('<?xml version="1.0" encoding="utf-8"?><html><body>foo</body></html>')

from extruct.

jimmytuc commented on May 18, 2024

Not about Unicode, but I got an issue when parsing from json-ld structure has hex string in this url
The root cause is because of the description which is having hex string, and it is fixed by removing \x according to this article
I think this case should be handled as well. Does anyone have any idea?

from extruct.

Unicode/string parsing error about extruct HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs