Comments (13)
For reference, this is json-ld from the site:
[{
"@context" : "http://schema.org",
"@type" : "Event",
"name" : "Salsa Night",
"startDate" : "2018-06-27T18:00:00",
"location" : {
"@type" : "EventVenue",
"name" : "Montalvo Arts Center",
"address" : "15400 Montalvo Rd, Saratoga, CA"
}]
from extruct.
Here's an example:
w w w .browneyedbaker.com/nutter-butter-snowmen/
from extruct.
Hey @maugch , thanks for the report.
Correcting this kind of unescaped double quotes looks non-trivial.
demjson
and ujson
both choke on this input.
There might be a way with demjson
's return_errors=True
:
>>> demjson.decode(r'''"test"quotes""''', return_errors=True)
json_results(object='test', errors=[JSONDecodeError('Unexpected text after end of JSON value', position=position_marker(offset=6,line=1,column=6), severity='error')], stats=None)
checking what chars is around the offset
from extruct.
I've had similar issues with other chars but I'm not sure exactly which, because every time I do a result[XX] where XX is the value on the the exception, I get either a blank space or a letter.
There must be a wordpress plugin that misses some chars. Right now I had issues mostly with Recipe schemas.
I suppose the only possible solution is to check for the next square bracket and take the ellipsis before it as the closing one and escape all others. A further check is if there are "," since it might be a list of strings. Actually it might be enough to check all " not followed by , (apart the last one followed by ].
Beware of \n \t lying randomly everywhere..
from extruct.
Might be unrelated but extruct json parser also chokes on \t
characters in json.
Example case:
There are some tabs in json that break extruct. And can be solved by replacing them away:
class ExtendJsonLdExtractor(JsonLdExtractor):
def _extract_items(self, node):
script = node.xpath('string()')
script = script.replace('\t', '')
<...>
I think extruct should either:
-
Expose itself for
script
parsing for monkey patching or injections:class JsonLdExtractor(): def process_script(self, script): return # then you can monkey patch your cleanup logic ext = JsonLdExtractor() ext.process_script = lambda script: script.replace('\t','')
-
Or implement some basic json cleanup in core code.
-
Preferably both :P
from extruct.
Btw @maugch I can't replicate your issue on www.browneyedbaker.com/nutter-butter-snowmen/
$ scrapy shell http://www.browneyedbaker.com/nutter-butter-snowmen/
In [1]: from extruct.jsonld import JsonLdExtractor
In [2]: JsonLdExtractor().extract(response.body_as_unicode())
In [3]: len(_)
Out[3]: 3
It works correctly here
from extruct.
Hi @Granitosaurus - I downloaded the URL you supplied above and I was able to decode the JSON using json
, and I was able to extract it using JsonLdExtractor
. Can you provide example code of this failing in your case?
My code, approximately:
>>> import user_agent, requests, json, extruct
>>> from scrapy.http import HtmlResponse
>>> r = requests.get('https://www.alltricks.fr/F-41493-pieces-roues/P-81593-fond_de_jante_notubes_yellow_tape_25_mm_pour_5_jantes', headers={'User-Agent': user_agent.generate_user_agent()})
>>> response = HtmlResponse('https://www.alltricks.fr/F-41493-pieces-roues/P-81593-fond_de_jante_notubes_yellow_tape_25_mm_pour_5_jantes', body=r.content)
>>> data = response.css('script[type="application/ld+json"]::text').extract_first()
>>> json.loads(data)
{'@context': 'http://schema.org/',
'@type': 'Product',
'aggregateRating': {'@type': 'AggregateRating',
'ratingValue': '4.1053',
'reviewCount': '19'},
'brand': {'@type': 'Thing', 'name': 'NoTubes'},
'description': 'Scotch jaune spécial pour rendre étanche les jantes tubeless NoTubes. Détails : Largeur : 25 mm. Longueur : 9.144 m (10 Yards). Un rouleau convient pour 5 jantes 26'' ou 4 jantes 29''. Compatibilités : ZTR 355 (26", 650b, 29"). ZTR Crest. ZTR Arch EX. ZTR Flow EX. #shortcode_video .row { display:block; } #shortcode_video .col { padding:15px; }',
'image': 'https://media.alltricks.com/medium/56bdff3278142.jpg',
'name': 'Fond de Jante NOTUBES YELLOW TAPE 25 mm Pour 5 Jantes',
'offers': {'@type': 'Offer',
'availability': 'http://schema.org/InStock',
'price': '14.99',
'priceCurrency': 'EUR',
'seller': {'@type': 'Organization', 'name': 'Alltricks'}}}
>>> extruct.jsonld.JsonLdExtractor().extract(r.content)
[{'@context': 'http://schema.org/',
'@type': 'Product',
'aggregateRating': {'@type': 'AggregateRating',
'ratingValue': '4.1053',
'reviewCount': '19'},
'brand': {'@type': 'Thing', 'name': 'NoTubes'},
'description': 'Scotch jaune spécial pour rendre étanche les jantes tubeless NoTubes. Détails : Largeur : 25 mm. Longueur : 9.144 m (10 Yards). Un rouleau convient pour 5 jantes 26'' ou 4 jantes 29''. Compatibilités : ZTR 355 (26", 650b, 29"). ZTR Crest. ZTR Arch EX. ZTR Flow EX. #shortcode_video .row { display:block; } #shortcode_video .col { padding:15px; }',
'image': 'https://media.alltricks.com/medium/56bdff3278142.jpg',
'name': 'Fond de Jante NOTUBES YELLOW TAPE 25 mm Pour 5 Jantes',
'offers': {'@type': 'Offer',
'availability': 'http://schema.org/InStock',
'price': '14.99',
'priceCurrency': 'EUR',
'seller': {'@type': 'Organization', 'name': 'Alltricks'}}}]
from extruct.
I did try again now and I don't get an exception. I suppose they corrected it. According to my previous comment, there was a text "buttons" that I don't see anymore. I see this on firefox:
the two “buttons”
My code is simple (now even simplified for this comment):
results = response.css("script[type='application/ld+json']").extract()`
jslde = JsonLdExtractor()
data = jslde.extract(results[1])
from extruct.
Hey @maugch - Glad to hear your problem has resolved. Pity we couldn't capture test cases before it disappeared, though. :)
@Granitosaurus - Any chance you can replicate, and if so can you capture failing HTML so we can use it to build a test case?
from extruct.
Hey folks, I'll close this for now, but if anyone can find us a failure case we can work with, we'll reopen. :)
from extruct.
Here's a tragic example:
http://montalvoarts.org/events/summernights18_salsa/
They omit a closing brace in their "location" field in their ld+json in every event on their site. When parsing manually, I'm able to correct this and extract the events. I'm looking at moving to extruct and it would be great if this site kept working.
from extruct.
Some (but not all) issues raised in this thread were fixed in #85
from extruct.
Again another jsonld with wrong data. Again a Recipe site. I suppose there is a wordpress plugin that isn't working correctly. There is a ] at the end that shouldn't be there
`
`
from extruct.
Related Issues (20)
- Some websites put meta tags outside the head. HOT 2
- Very slow extraction for specific string HOT 6
- LD+JSON outside HTML element HOT 1
- error extracting json-ld for validated json
- [suggestion] adding type hints? HOT 7
- Should not Depends on python3 (<< 3.7) HOT 6
- lxml.etree.ParserError: Document is empty HOT 5
- " in application/ld+json gives exception
- Consider switching from lxml's clean_html for enhanced security (and possibly performance) HOT 7
- Selectolax benchmarks
- Unable to get meta tag value from inside body
- SyntaxWarning invalid escape sequence '\s'
- Package breaking due to change in lxml HOT 2
- ImportError: cannot import name '_ElementStringResult' from 'lxml.etree' HOT 1
- chore: Remove Python 2 specific code
- feat: Add dependabot for github actions HOT 3
- cannot import name '_ElementStringResult' from 'lxml.etree HOT 2
- Latest release on PyPi (0.16.0) breaks with lxml>5.1.0: import extruct throws ImportError: cannot import name '_ElementStringResult' HOT 1
- Installing with lxml-5.2.1 ImportError: cannot import name '_ElementStringResult' from 'lxml.etree' HOT 3
- Fix the Build Status badge
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from extruct.