scrapinghub / extruct Goto Github PK
View Code? Open in Web Editor NEWExtract embedded metadata from HTML markup
License: BSD 3-Clause "New" or "Revised" License
Extract embedded metadata from HTML markup
License: BSD 3-Clause "New" or "Revised" License
Hello,
ref your example
{ 'json-ld': [ { '@context': 'http://schema.org',
'@id': 'FP',
'@type': 'Product',
'brand': { '@type': 'Brand',
'url': 'https://www.sarenza.com/i-love-shoes'},
'color': ['Lava', 'Black', 'Lt grey'],
'image': [ 'https://cdn.sarenza.net/_img/productsv4/0000119412/MD_0000119412_223992_08.jpg?201509221045&v=20180313113923'],
'name': 'Susket',
'offers': { '@type': 'AggregateOffer',
'availability': 'InStock',
'highPrice': '49.00',
'lowPrice': '0.00',
'price': '0.00',
'priceCurrency': 'EUR'}}],
is it possible to extract from command line EXACTLY name and image values ?
I mean something like
extruct "https://www.sarenza.com/i-love-shoes-susket-s767163-br964-t76-p0000119412" --syntaxes json-ld | extruct name,image
will output clear values
Susket
https://cdn.sarenza.net/_img/productsv4/0000119412/MD_0000119412_223992_08.jpg?201509221045&v=20180313113923
Thanks in advance for any hint !
I'm running extruct using docker, however have a problem
FROM python:3.5
#see https://github.com/scrapinghub/extruct
RUN pip install bottle
RUN pip install gevent
RUN pip install requests
RUN pip install extruct==0.7.3
WORKDIR /usr/src/app
#this will run server on port 10005
CMD [ "python", "-m", "extruct.service" ]
#to build run.
#docker build -t python-extruct .
#to run http server use
#docker run -p 10005:10005 python-extruct
#to check usage using http use
#curl http://your_IP:10005/extruct/http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412
/usr/local/lib/python3.7/site-packages/extruct/service.py:13: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. It may also silently lead to incorrect behaviour on Python 3.7. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016. Modules that had direct imports (NOT patched): ['urllib3.util (/usr/local/lib/python3.7/site-packages/urllib3/util/__init__.py)', 'urllib3.util.ssl_ (/usr/local/lib/python3.7/site-packages/urllib3/util/ssl_.py)'].
monkey.patch_all()
Bottle v0.12.16 server starting up (using GeventServer())...
Listening on http://0.0.0.0:10005/
pip list
---> Running in c0d1c0855f84
Package Version
-------------- --------
beautifulsoup4 4.7.1
bottle 0.12.16
certifi 2019.3.9
chardet 3.0.4
extruct 0.7.3
gevent 1.4.0
greenlet 0.4.15
html5lib 1.0.1
idna 2.8
isodate 0.6.0
lxml 4.3.4
mf2py 1.1.2
pip 19.1.1
pyparsing 2.4.0
rdflib 4.2.2
rdflib-jsonld 0.4.0
requests 2.22.0
setuptools 41.0.1
six 1.12.0
soupsieve 1.9.1
urllib3 1.25.3
w3lib 1.20.0
webencodings 0.5.1
wheel 0.33.4
When I run http request to server
http://192.168.5.134:10005/extruct/https://www.sarenza.com/i-love-shoes-susket-s767163-br964-t76-p0000119412`
I get error that is probably related to gevent monkeypatching abov:
{"url": "https://www.sarenza.com/i-love-shoes-susket-s767163-br964-t76-p0000119412", "status": "error", "message": "RecursionError('maximum recursion depth exceeded')"}
Right now extruct.extract
has an url
parameter which is documented as "url of the html documents", but in reality it's used as a base url (at least in LxmlMicrodataExtractor
, maybe in others as well). I think we should check if it's indeed always used as a base url, update documentation and introduce base_url
argument deprecating url
? Another option would be to extract base_url
in extruct, but this feels like worse solution to me (what if caller already has base_url or has more accurate base_url?), although we could also support both base_url and url.
Hello,
extruct is working very well for me use case and I get plenty of structured text out of websites.
I'm mostly using Microdata.
Unfortunately, some websites seem to have different structures from others, so for example, sometimes I'd get an array:
'brand': {'properties': {'name': 'NIKE'}, 'type': 'http://schema.org/Brand'},
and sometimes a string:
'brand': 'NIKE',
So to access the data, I'd need to do something like:
if isinstance(productData['brand'], dict): if 'http://schema.org/Brand' == productData['brand']['type']: self.brand = productData['brand']['properties']['name'] if isinstance(productData['brand'], str): self.brand = productData['brand']
Is this the best way to go or am I doing this in a clumsy way?
Thanks,
Chris
Is there a way to pass headers and also set cookies in the request that is made?
I don't know if having the JSON being followed by a semicolon constitutes valid JSON-LD, but I have encountered it in the wild.
Running extruct on the following works fine:
<script type="application/ld+json">
{}
</script>
However, this breaks:
<script type="application/ld+json">
{};
</script>
The error message looks like this:
Failed to extract json-ld, raises Extra data: line 2 column 3 (char 3)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/extruct/jsonld.py", line 34, in _extract_items
data = json.loads(script, strict=False)
File "/usr/lib/python3.6/json/__init__.py", line 367, in loads
return cls(**kw).decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 342, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 3 (char 3)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/extruct/_extruct.py", line 101, in extract
output[syntax] = list(extract(document, base_url=base_url))
File "/usr/local/lib/python3.6/dist-packages/extruct/jsonld.py", line 26, in extract_items
for items in map(self._extract_items, self._xp_jsonld(document))
File "/usr/local/lib/python3.6/dist-packages/extruct/jsonld.py", line 25, in <listcomp>
item
File "/usr/local/lib/python3.6/dist-packages/extruct/jsonld.py", line 38, in _extract_items
HTML_OR_JS_COMMENTLINE.sub('', script), strict=False)
File "/usr/lib/python3.6/json/__init__.py", line 367, in loads
return cls(**kw).decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 342, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 3 (char 3)
Extruct version 0.7.2; Python version 3.6.7; Ubuntu 18.04.2
Instead of call each extractor for individual microdata formats, there could be a do-all extractor combining results of several extractors.
Something like (pseudo-Python-code):
class GenericExtractor()
def extract(string, url):
tree = lxml.fromstring()
return self.extract_items(tree, url)
def extract_items(tree, url):
output = {}
for name, extractor in extractors:
output.update({
name: extrator.extract_items(tree, url)
})
return out
At present, extruct
supports a HTTP API for "testing", but that carries a maintenance burden, and it invites feature-requests that may nudge it more and more into becoming a monolithic proxy service. That's not really where we want Extruct to be, I think.
Similarly with the HTTP-Client mode and the CLI tool that offers it - it's a mode of operation which probably shouldn't be our priority with extruct
. I feel that if we provide a CLI client for extruct
, it should probably just accept HTML through a Unix pipe or from a file, and operate on that. That way, people can use curl
or wget
or whatever else they like, and they won't worry about extruct
's support for various HTTP client features.
Thoughts? :)
Hi, when I use extruct to extract some microdata from html element that contains some script or style tags I encountered some problems: tags are skipped out but theirs contents remains.
This behaviour is happened because LxmlMicrodataExtractor.extract_textContent
uses lxml.html.tostring(node, method="text", encoding='unicode', with_tail=False)
with method "text".
Probably we have to add a parameter to allow using "html" method and maybe a way to use the lxml Cleaner (http://lxml.de/lxmlhtml.html#cleaning-up-html).
please add functions to provide a pre-parsed lxml.etree instead of htmlstring.
Also, using a library such as "ujson" may significantly speedup processing for jsonld.
r = requests.get(url)
data = extruct.extract(r.text, r.url)
why i am getting error in this way?
This package is great. Thanks for it and other packages from scrapinghub.
Image captions and credits are included in article body. It is messing up with article content.
Extruct ought to support microformats.
Exception like this can be raised by functions from extruct.utils:
document = parse_xmldom_html(html_string, encoding=encoding)
File "/usr/local/lib/python3.6/dist-packages/extruct/utils.py", line 16, in parse_xmldom_html
return lxml.html.fromstring(html, parser=parser)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/__init__.py", line 876, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/__init__.py", line 765, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty
In parsel this is worked around: empty documents are handled explicitly. There is also an issue with null bytes handled. I think we should bring similar fixes to extruct. See https://github.com/scrapy/parsel/blob/e01093cf6342c90445028de28034b3cc3d2ead8b/parsel/selector.py#L38.
Steps to reproduce:
Login to Ubuntu box:
Run extruct as a service
nohup python -m extruct.service &
hit the http://localhost:10005/extruct/ in loop for 20k urls. Memory consumption increases with time and never comes down
mf2py already supports and returns them but we are only keeping h-entry. h-item and h-product are of particular interest
List http://microformats.org/wiki/Microformats2#v2_vocabularies
It would be nice to have a cmdline script that, given an URL, it would try to run all the extractors (much like the webservice) and output a JSON with the results.
extruct URL
i have this json but i want only the @type Product annotated not @type BreadcrumbList . Is there a way to get only Product ?
[
{
"@context": "http://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [
{
"@type": "ListItem",
"position": 1,
"item": {
"@id": "https://concordpetfoods.com/collections",
"name": "Collections"
}
},
{
"@type": "ListItem",
"position": 2,
"item": {
"@id": "https://concordpetfoods.com/collections/dog",
"name": "Dog"
}
},
{
"@type": "ListItem",
"position": 3,
"item": {
"@id": "https://concordpetfoods.com/collections/dog/products/blue-buffalo-blue-wilderness-rocky-mountain-recipe-adult-healthy-weight-red-meat-dry-dog-food",
"name": "Blue Buffalo BLUE Wilderness Rocky Mountain Recipe Adult Healthy Weight Red Meat Dry Dog Food"
}
}
]
},
{
"@context": "http://schema.org/",
"@type": "Product",
"name": "Blue Buffalo BLUE Wilderness Rocky Mountain Recipe Adult Healthy Weight Red Meat Dry Dog Food",
"image": "https://cdn.shopify.com/s/files/1/2382/0223/products/35913-1501600645_fc502f43-827d-4a76-a639-90c668e5e4bc_1024x1024.png?v=1533919507",
"description": "
Looking for a great food to help your four legged best friend reach and maintain their ideal weight? Blue Buffalo has got just the food for you with their BLUE Wilderness Rocky Mountain Recipe Adult Healthy Weight Red Meat Dry Dog Food! This grain-free, protein-rich food contains the finest natural ingredients and provides multiple sources of protein using deboned beef, lamb and venison without the added calories! Blue Buffalo BLUE Wilderness Rocky Mountain Recipe Adult Healthy Weight Red Meat Dry Dog Food also includes blueberries, cranberries and carrots to help support antioxidant-enrichment. Put on your spandex, Rover! Let’s get physical!
BLUE Buffalo's True Blue promise is the pillar of their business, straight to every customer; the finest natural ingredients, and no chicken/poultry by-product meals, corn, wheat, soy, artificial preservatives, colors or flavors. BLUE Buffalo is the only food made with unique Lifesource Bits; a precise blend of vitamins, minerals and antioxidants created by veterinarians and animal nutritionists. With recipes for all tastes and diets, including limited ingredient diets, high protein, grain-free, wholesome grains, and exotic proteins, BLUE Buffalo always starts with real meat, and ends with good health.
Nutrient | Guaranteed Units |
---|---|
Crude Protein | 30.0% min |
Crude Fat | 10% min |
Crude Fiber | 10.0% max |
Moisture | 10.0% max |
Calcium | 1.2% min |
Phosphorus | 0.9% min |
Omega-3 Fatty Acids | 0.5% min |
Omega-6 Fatty Acids | 1.5% min |
L-Carnitine | 150 mg/kg min |
Glucosamine | 400 mg/kg min |
Chondroitin Sulfate | 300 mg/kg min |
It would be nice to be able to call the extruct command line tool using python -m extruct
.
With this, people will be able to use a specific Python version and it may also help people with issues on their system's PATH.
Hi, I've been using extruct pretty successfully but came across a URL that seems to validate OK but when I run it through I get an error:
File "/Users/frankapap/KCApp/extruct/RecipeInfoService.py", line 24, in recipeExtract
data = extruct.extract(r.text, base_url=base_url,syntaxes=['json-ld', 'opengraph'],uniform=True)
File "/usr/local/lib/python3.7/site-packages/extruct/_extruct.py", line 67, in extract
output[label] = list(extract(document, base_url=base_url))
File "/usr/local/lib/python3.7/site-packages/extruct/jsonld.py", line 25, in extract_items
self._xp_jsonld(document))
File "/usr/local/lib/python3.7/site-packages/extruct/jsonld.py", line 26, in <listcomp>
for item in items
TypeError: 'NoneType' object is not iterable
The code is:
data = extruct.extract(r.text, base_url=base_url,syntaxes=['json-ld', 'opengraph'],uniform=True)
The URL being passed in is https://www.tasteofhome.com/collection/keto-diet-recipes/view-all/
I came across some JSON+LD on a site that contained a &
and I assumed that I accidentally escaped something somewhere. However, I found that that was what was actually in the content and also that the standard says that it should be there.
For my application, I would like that &
to be a &
, but I was wondering if extruct should be doing this already?
Some web pages contain badly formatted JSON-LD data, e.g., an example
The JSON-LD in this page is:
{
"@context": "http://schema.org",
"@type": "Product",
"name": "Black 'Clint' FT0511 cat eye sunglasses",
"image": "https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001",
"brand": {
"@type": "Thing",
"name": "Tom Ford"
},
"offers": {
"@type": "Offer",
"priceCurrency": "GBP",
"price": "285.00",
"itemCondition": "http://schema.org/NewCondition",
"availability": "http://schema.org/InStock"
}
}
}
In the JSON-LD above, the last }
is extra. And extruct
or json.loads
won't handle it properly.
The json.loads
in Python after 3.5 will give detailed error information as JSONDecodeError: Extra data: line 19 column 1 (char 624)
In [7]: try:
...: data = json.loads(json_ld_string)
...: except json.JSONDecodeError as err:
...: print(err)
...: print(err.msg)
...: print(err.pos)
...:
Extra data: line 19 column 1 (char 624)
Extra data
624
The error.msg
and error.pos
can give some hint to fix the JSON-LD data, e.g., this one we can remove the character at position 624 and parse the data string again to correctly get:
{'@context': 'http://schema.org',
'@type': 'Product',
'brand': {'@type': 'Thing', 'name': 'Tom Ford'},
'image': 'https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001',
'name': "Black 'Clint' FT0511 cat eye sunglasses",
'offers': {'@type': 'Offer',
'availability': 'http://schema.org/InStock',
'itemCondition': 'http://schema.org/NewCondition',
'price': '285.00',
'priceCurrency': 'GBP'}}
There're many possible format errors and some can be fixed easily some might be harder or even impossible.
I propose 3 ways to improve the situation:
extruct
try various ways to fix the json-ld data case by case, but need to adapt to Python >= 3.5 to allow to get detailed error infoextruct
allow the user to pass in a function to parse JSON data, and let the user to handle his own possible error typesextruct
can output the extracted JSON-LD string not parsed data and let the user to parse and handle his own possible error typesI personally recommend the latter 2 ways.
Thanks.
Some pages have JSON-LD with control characters.
One example is: https://www.johnlewis.com/sony-xperia-x-smartphone-android-5-4g-lte-sim-free-32gb/p3210080
when you try to extract JSON-LD data from this page, you'll get:
Invalid control character at: line 8 column 353 (char 625)
Maybe need to change JsonLdExtractor._extract_items()
in extruct/extruct/jsonld.py
as below:
from json import JSONDecodeError
def _extract_items(self, node):
script = node.xpath('string()')
try:
data = json.loads(script)
except ValueError:
# sometimes JSON-decoding errors are due to leading HTML or JavaScript comments
try:
data = json.loads(HTML_OR_JS_COMMENTLINE.sub('', script))
except JSONDecodeError:
data = json.loads(script, strict=False)
if isinstance(data, list):
return data
elif isinstance(data, dict):
return [data]
I have been using extruct inside Scrapy Spider and the code got stuck in the middle and it is neither going forward nor skipping the same url. Also, there is no code error, no exception, nothing.
#115 was a step in a right direction (prefer first results), but it seems it is not the whole solution, as empty results should not be prioritized.
E.g. on https://www.triganostore.com/tente-de-camping-raclet-bora-4.html there are two og:description values, the first one is empty. https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Fwww.triganostore.com%2Ftente-de-camping-raclet-bora-4.html shows that a non-empty one is extracted.
See below, the output is a nested list rather than a list of dicts.
In [9]: url = 'http://www.superpages.com/yellowpages/c-nurseries/s-wa/t-redmond/'
In [10]: r = requests.get(url)
In [11]: ex = extruct.jsonld.JsonLdExtractor()
In [12]: ex.extract(r.text)['items']
[[{'@context': 'http://schema.org',
'@type': 'LocalBusiness',
'address': {'@type': 'PostalAddress',
'addressLocality': 'Redmond',
'addressRegion': 'WA',
'postalCode': '98053',
'streetAddress': '20871 NE Redmond Fall City Rd'},
'description': "There's more to a beautiful garden than what meets the eye.",
'name': 'Gray Barn Nursery',
'telephone': '888-820-9506'},
...
]]
I used the REST API service for extracting embedded metadata from HTML markup.
When I do a get request for a site that throws a popup in the beginning, the metadata returned is that of the popup.
Example site: Faballey.com
Request:
http://localhost:10005/extruct/http://www.faballey.com/hot-mesh-maxi-skirt-87
Is there a way we could block/skip the popup and get the metadata of the required page?
Hello! Is there a way to deal with async loaded json-ld such as in this url -> “https://www.omicsdi.org/dataset/arrayexpress-repository/E-GEOD-33515”
extruct “https://www.omicsdi.org/dataset/arrayexpress-repository/E-GEOD-33515” results in no json-ld metadata being returned
In cases, when one is not interested in all but some parts of microdata, the approach to filter required content is not very straight forward. Can we support look-up by itemtype
or itemprop
values as follows:
>>> data = mde.extruct(html)
>>> data.get_first(itemprop='name')
'foo'
>>> data.get(itemtype='http://schema.org/Person')
[{'name': 'foo', 'jobTitle': 'bar', 'additionalName': 'foobar'}]
>>> data.get(itemtype='http://schema.org/Person', itemprop='name')
['foo', 'abc', 'cde', 'def']
>>> data.get(itemtype='http://schema.org/Organization', itemprop='name')
['foocompany']
or a cleaner version with some sort of built-in support for popular vocabularies.
>>> data.get(itemtype=schema_org.Person)
[{'name': 'foo', 'jobTitle': 'bar', 'additionalName': 'foobar'}, {'name': 'abc', ...}]
>>> data.get(itemtype=schema_org.Person', itemprop='name')
['foo', 'abc', 'cde', 'def']
>>> data.get_first(itemtype=schema_org.Organization, itemprop='name')
'foocompany'
I've found at least a couple of bad json+ld that extruct can't read.
File "/cygdrive/d/recipeWorkspace/python/parsers.py", line 25, in readJsonLd
data = jslde.extract(html)
File "/usr/lib/python2.7/site-packages/extruct/jsonld.py", line 21, in extract
return self.extract_items(lxmldoc)
File "/usr/lib/python2.7/site-packages/extruct/jsonld.py", line 25, in extract_items
self._xp_jsonld(document))
File "/usr/lib/python2.7/site-packages/extruct/jsonld.py", line 35, in _extract_items
data = json.loads(HTML_OR_JS_COMMENTLINE.sub('', script))
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 20 column 778 (char 1342)
The reason are ellipsis inside the text. For example:
"recipeInstructions": [
"1. blablabla two "buttons".5. Dab Snowmen!"
]
Html allow this, but it's not possible to read it. Is there an easy way to correct similar issues automatically?
Hi,
wanted to ask if anyone out there has used extruct on AWS lambda? I tested running extruct
function which seems to fail to work for rdfa. Other default metadata types are fine.
A simple test case:
import pprint as pp
import requests
from extruct.rdfa import RDFaExtractor
import config_files.logging_config as log
logger = log.logger
def main():
try:
import extruct
logger.info("Testing importing extruct which loaded successfully")
import rdflib
logger.info("Testing importing rdflib which loaded successfully")
import extruct.rdfa
logger.info("Testing importing rdfa which loaded successfully")
from extruct.rdfa import RDFaExtractor
logger.info("Testing importing RDFaExtractor which loaded successfully")
except ImportError as e:
logger.error("failed to import : {}".format(e))
try:
url = 'https://www.littlewoods.com/ri-plus-floral-trumpet-sleeve-top/1600159211.prd'
r = requests.get(url)
rdfae = RDFaExtractor()
rdfa_json = rdfae.extract(r.text, base_url=None)
pp.pprint(rdfa_json)
except Exception as e:
logger.exception("Failed to extract rdfa. Error: {}".format(e))
main()
The part of pipenv graph for extruct when I build the artifact.zip file:
extruct==0.7.1
- lxml [required: Any, installed: 3.6.0]
- mf2py [required: Any, installed: 1.1.2]
- BeautifulSoup4 [required: >=4.6.0, installed: 4.7.1]
- soupsieve [required: >=1.2, installed: 1.6.2]
- html5lib [required: >=1.0.1, installed: 1.0.1]
- six [required: >=1.9, installed: 1.11.0]
- webencodings [required: Any, installed: 0.5.1]
- requests [required: >=2.18.4, installed: 2.18.4]
- certifi [required: >=2017.4.17, installed: 2018.11.29]
- chardet [required: >=3.0.2,<3.1.0, installed: 3.0.4]
- idna [required: >=2.5,<2.7, installed: 2.6]
- urllib3 [required: >=1.21.1,<1.23, installed: 1.22]
- rdflib [required: Any, installed: 4.2.2]
- isodate [required: Any, installed: 0.6.0]
- six [required: Any, installed: 1.11.0]
- pyparsing [required: Any, installed: 2.3.0]
- rdflib-jsonld [required: Any, installed: 0.4.0]
- rdflib [required: >=4.2, installed: 4.2.2]
- isodate [required: Any, installed: 0.6.0]
- six [required: Any, installed: 1.11.0]
- pyparsing [required: Any, installed: 2.3.0]
- six [required: Any, installed: 1.11.0]
- w3lib [required: Any, installed: 1.19.0]
- six [required: >=1.4.1, installed: 1.11.0]
When I run this locally in the same pipenv env (Ubuntu 17.10, Docker, 17.12.0-ce, pipenv==v2018.11.26), I don't experience any issues. On lambda invocation I log the following stack trace:
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing extruct which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing rdflib which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing rdfa which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing RDFaExtractor which loaded successfully
2019-01-10 14:32:51,753:ERROR:pid 1:Failed to extract rdfa. Error: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)
Traceback (most recent call last):
File "/var/task/rdflib/plugin.py", line 100, in get
p = _plugins[(name, kind)]
KeyError: ('json-ld', <class 'rdflib.serializer.Serializer'>)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/task/metadata_extractor/rdfa_extract_poc.py", line 15, in main
rdfa_json = rdfae.extract(r.text, base_url=None)
File "/var/task/extruct/rdfa.py", line 35, in extract
return self.extract_items(tree, base_url=base_url, expanded=expanded)
File "/var/task/extruct/rdfa.py", line 48, in extract_items
jsonld_string = g.serialize(format='json-ld', auto_compact=not expanded).decode('utf-8')
File "/var/task/rdflib/graph.py", line 940, in serialize
serializer = plugin.get(format, Serializer)(self)
File "/var/task/rdflib/plugin.py", line 103, in get
"No plugin registered for (%s, %s)" % (name, kind))
rdflib.plugin.PluginException: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)
I have been scratching my head over this but can't figure this one out. What should I try? Thanks in advance
I'm trying to parse structured metadata from this url. I first executed this code on the example URL https://www.optimizesmart.com/how-to-use-open-graph-protocol/:
import extruct
import requests
from w3lib.html import get_base_url
def extract_metadata(url):
r = requests.get(url)
base_url = get_base_url(r.text, r.url)
data = extruct.extract(r.text, base_url=base_url)
return(data)
url = 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'
data = extract_metadata(url)
print(data)
And works just fine. However, this block of code:
url = 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/G4TBLF'
data = extract_metadata(url)
print(data)
returns this error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-f0db0dd65eaf> in <module>()
1 url = 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/G4TBLF'
----> 2 data = extract_metadata(url)
3 print(data)
<ipython-input-3-25c85aeebf1a> in extract_metadata(url)
2 r = requests.get(url)
3 base_url = get_base_url(r.text, r.url)
----> 4 data = extruct.extract(r.text, base_url=base_url)
5 return(data)
/usr/local/lib/python3.5/dist-packages/extruct/_extruct.py in extract(htmlstring, base_url, encoding, syntaxes, errors, uniform, return_html_node, schema_context, **kwargs)
50 raise ValueError('Invalid error command, valid values are either "log"'
51 ', "ignore" or "strict"')
---> 52 tree = parse_xmldom_html(htmlstring, encoding=encoding)
53 processors = []
54 if 'microdata' in syntaxes:
/usr/local/lib/python3.5/dist-packages/extruct/utils.py in parse_xmldom_html(html, encoding)
14 """ Parse HTML using XmlDomHTMLParser, return a tree """
15 parser = XmlDomHTMLParser(encoding=encoding)
---> 16 return lxml.html.fromstring(html, parser=parser)
/usr/local/lib/python3.5/dist-packages/lxml/html/__init__.py in fromstring(html, base_url, parser, **kw)
874 else:
875 is_full_html = _looks_like_full_html_unicode(html)
--> 876 doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
877 if is_full_html:
878 return doc
/usr/local/lib/python3.5/dist-packages/lxml/html/__init__.py in document_fromstring(html, parser, ensure_head_body, **kw)
760 if parser is None:
761 parser = html_parser
--> 762 value = etree.fromstring(html, parser, **kw)
763 if value is None:
764 raise etree.ParserError(
src/lxml/etree.pyx in lxml.etree.fromstring()
src/lxml/parser.pxi in lxml.etree._parseMemoryDocument()
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
Any idea what is going on here? Seems like an lxml.etree parsing error? Can I somehow modify r.text
to fix this error? Any help is appreciated...
Often used to describe publications etc. More information:
http://dublincore.org/
https://en.wikipedia.org/wiki/Dublin_Core
http://dublincore.org/documents/dces/
Hi,
I am using extruct
to extract metadata from emails, either in microformats or JSON+LD format. A very good point for this library is that in a single call one can extract all possible information from the message, that's super convenient!
However, I realized that the structure of the data returned from JSON+LD and microformats is quite different. For instance, microformats will return something like
{
"type": "SOME_SCHEMA_URL",
"properties": { /* A dict of properties */ }
}
whereas JSON+LD parsing would return something like
{
"@type": "SOME_SCHEMA_URL",
/* All the properties in keys here */
}
This is not so convenient as it implies that microformats and JSON+LD data should be handled in a different way, although they match the same schema.org schema.
Not sure if it is fine for extruct
or if it should lie in another lib, but what about offering a way to have a standard representation of the extracted data. This could either be building a class object (a struct
basically) from the fetched data for each type, or converting one of the format to the other. Not sure if this could already be offloaded to some external lib, but I could not find any doing the job so far.
Thanks!
EDIT: I guess something as simple as
def microformats_to_jsonld(mf):
if isinstance(mf, dict) and 'type' in mf and 'properties' in mf:
if isinstance(mf['type'], list):
# Fix a bug in JSON-LD format of some emails
mf['type'] = ''.join(mf['type'])
context, type = mf['type'].rsplit('/', 1)
converted = {
'@type': type,
'@context': context
}
for key, property in mf['properties'].items():
converted[key] = microformats_to_jsonld(property)
return converted
else:
return mf
could do the trick.
Seems that extruct
incorrectly interprets description with included HTML tags from microdata.
See the below description extracted from URL https://www.monsterpetsupplies.co.uk/cat/cat-flea-tick/johnsons-4-fleas-cats-kittens:
>>> import extruct
>>> import requests
>>> from w3lib.html import get_base_url
>>> r = requests.get('https://www.monsterpetsupplies.co.uk/cat/cat-flea-tick/johnsons-4-fleas-cats-kittens')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url=base_url)
>>> data['microdata'][0]['properties']['description']
"Johnsons 4 Fleas Cats & Kittens - 3 Treatment Pack, 6 Treatment PackFor use with Cats and Kittens over 4 weeks of age between 1 and 11kg.Johnson's 4fleas tablets are an easy to use oral treatment to kill adult fleas found on your pet.Effects on the fleas may be seen as soon as 15 minutes after administration.Between 95 - 100% of fleas will be killed off in the first six hours, but ALL adult fleas will be gone after a day.These tablets can be given directly to the mouth or may be mixed in a small portion f our pet's favourite food and given immediately. Administer a single tablet on an day when fleas are seen on your pet. Repeat on any subsequent day as necessary. Do not give more than one treatment per day.You may notice your pet scratching more than usual for the first half hour after administration; this is completely normal and caused by the fleas reacting to Johnson's 4Fleas tablets.While highly effective by themselves, 4Fleas is great when used as part of a programme to eliminate fleas and their larvae from both pets and their surroundings."
As it can be seen, there is a problem with formatting, like lack of space between "Pack" and "For" or between "11kg." and "Johnson's".
It turns out that the problem is not because of description property content per-se, because it looks correctly on the page source:
<p><strong>Johnsons 4 Fleas Cats & Kittens - 3 Treatment Pack, 6 Treatment Pack</strong></p>For use with Cats and Kittens over 4 weeks of age between 1 and 11kg.<br /><br />Johnson's 4fleas tablets are an easy to use oral treatment to kill adult fleas found on your pet.<br /><br />Effects on the fleas may be seen as soon as 15 minutes after administration.<br /><br />Between 95 - 100% of fleas will be killed off in the first six hours, but ALL adult fleas will be gone after a day.<br /><br />These tablets can be given directly to the mouth or may be mixed in a small portion f our pet's favourite food and given immediately. Administer a single tablet on an day when fleas are seen on your pet. Repeat on any subsequent day as necessary. Do not give more than one treatment per day.<br /><br />You may notice your pet scratching more than usual for the first half hour after administration; this is completely normal and caused by the fleas reacting to Johnson's 4Fleas tablets.<br /><br />While highly effective by themselves, 4Fleas is great when used as part of a programme to eliminate fleas and their larvae from both pets and their surroundings.
Likely it is a matter of line
extruct/extruct/w3cmicrodata.py
Line 185 in de219cb
It is possible in Open Graph Protocol to specify more than one value for a single property. It's called OG Array http://ogp.me/#array.
It seems that currently extruct
doesn't support arrays in case the uniform
option is set to True
, because uniform._uopengraph
function doesn't handle duplicated keys from the list of properties.
It'd be cool to add that support and return list in case there is more than one value or to just add some separate property with list
suffix to be backward compatible.
v0.4.0 shipped June 20, 2017. There have been 40 commits (plus 6 for Read Me updates).
What is the version strategy?
Thanks!
Facebook Oepn Graph defines an expanded version of embedded metadata depending on the value of og:type
.
For example:
article - Namespace URI: http://ogp.me/ns/article#
article:published_time - datetime - When the article was first published.
article:modified_time - datetime - When the article was last changed.
article:expiration_time - datetime - When the article is out of date after.
article:author - profile array - Writers of the article.
article:section - string - A high-level section name. E.g. Technology
article:tag - string array - Tag words associated with this article.
This is used for example on nytimes.com. Snippet:
<meta property="og:url" content="http://www.nytimes.com/2016/12/15/arts/music/from-steet-theater-to-wagner-on-the-opera-stage.html" />
<meta property="og:type" content="article" />
<meta property="og:title" content="From Street Theater to Wagner on the Opera Stage" />
<meta property="og:description" content="Àlex Ollé brings an avant-garde sensibility to “The Flying Dutchman,” which he set in Bangladesh instead of Norway. The production opens in Madrid on Saturday." />
<meta property="article:published" itemprop="datePublished" content="2016-12-15T05:55:55-05:00" />
<meta property="article:modified" itemprop="dateModified" content="2016-12-15T06:19:30-05:00" />
<meta property="article:section" itemprop="articleSection" content="Music" />
<meta property="article:section-taxonomy-id" itemprop="articleSection" content="C5BFA7D5-359C-427B-90E6-6B7245A6CDD8" />
<meta property="article:section_url" content="http://www.nytimes.com/section/arts" />
<meta property="article:top-level-section" content="arts" />
<meta property="fb:app_id" content="9869919170" />
Currently (as I write these lines version 0.3.0a1) extracts raw article:...
properties
...
'article:author': [{'@value': 'http://www.nytimes.com/by/raphael-minder'}],
'article:collection': [{'@value': 'https://static01.nyt.com/services/json/sectionfronts/arts/music/index.jsonp'}],
'article:modified': [{'@value': '2016-12-15T06:19:30-05:00'}],
'article:published': [{'@value': '2016-12-15T05:55:55-05:00'}],
'article:section': [{'@value': 'Music'}],
'article:section-taxonomy-id': [{'@value': 'C5BFA7D5-359C-427B-90E6-6B7245A6CDD8'}],
'article:section_url': [{'@value': 'http://www.nytimes.com/section/arts'}],
'article:tag': [{'@value': 'Opera'},
{'@value': 'Bangladesh'},
{'@value': 'Madrid (Spain)'},
{'@value': 'Teatro Real'},
{'@value': 'Wagner, Richard'}],
'article:top-level-section': [{'@value': 'arts'}],
'fb:app_id': [{'@value': '9869919170'}],
'http://opengraphprotocol.org/schema/description': [{'@value': 'Ã\x80lex '
'Ollé brings '
'an '
'avant-garde '
'sensibility '
'to '
'â\x80\x9cThe '
'Flying '
'Dutchman,â\x80\x9d '
'which he set '
'in '
'Bangladesh '
'instead of '
'Norway. The '
'production '
'opens in '
'Madrid on '
'Saturday.'}],
'http://opengraphprotocol.org/schema/image': [{'@value': 'https://static01.nyt.com/images/2016/12/16/arts/16ALEXOLLE1-INYT/16ALEXOLLE1-INYT-facebookJumbo.jpg'}],
'http://opengraphprotocol.org/schema/title': [{'@value': 'From Street '
'Theater to Wagner '
'on the Opera '
'Stage'}],
'http://opengraphprotocol.org/schema/type': [{'@value': 'article'}],
'http://opengraphprotocol.org/schema/url': [{'@value': 'http://www.nytimes.com/2016/12/15/arts/music/from-steet-theater-to-wagner-on-the-opera-stage.html'}],
...
while they could use the type-dependent OGP namespace
I have tried to extract information from the very same example in this and I noticed that the image tag is not extracted at all, even if the itemprop=image
is present in the web page.
Why this happens? is it intentional or a bug?
Motivation: #37 (comment)
When the JsonLdExtractor tries to parse json ld in some web page raise ValueError; no json object could be decoded
.
My solution was to catch the error in JsonLdExtractor._extract_items(self, node)
(because maybe the extractor detected some microdata or rdfa in the webpage but the error only occurs with json-ld, and if we catch the error in extruct.extract we'll lose that data) and by default return an empty list:
def _extract_items(self, node):
try:
data = json.loads(node.xpath('string()'))
if isinstance(data, list):
return data
elif isinstance(data, dict):
return [data]
except Exception as e:
print e
return []
Testing out the new RDF push, just some quick things that came up with possible fixes:
pip install extruct[rdfa]
didn't get the latest RDF files, had to clone from git (assuming this is by design but mentioning just in case)rdflib.plugin.PluginException: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)
. I'm not sure if this is because of how I downloaded RDFlib previously, but in any case the fix for me was the following: git clone https://github.com/RDFLib/rdflib-jsonld.git && cd rdflib-jsonld && python setup.py install
url='http://www.exaple.com/index.html'
-> example.com
I'd like to put this in a pull request but not sure how best to handle the middle case in particular.
(The regex used to remove comments)[https://github.com/scrapinghub/extruct/blob/c465e629c9e35cff08a703f6d2912c1c71c642ff/extruct/jsonld.py#L13] from JSON that fails to decode is pinned to the beginning on one side, but not the other. So, the regex may remove HTML comments from strings within the JSON document, as well as outside the JSON.
A quick fix would be to just add the ^
token to the other pattern, or to brace the two patterns to share the same ^
. But, I'm wondering whether trailing comments are also an occasional problem, and if so whether a custom little FSM scanner might be a better solution? E.g., something that scans for the earliest possible "valid" opening character for a JSON document, and the last such character, and returns the indices for those two characters for decoding?
To be able to reuse an lxml tree with RDFaExtractor
See #37 for motivation.
E.g. if an error happens during html parsing or unification, it won't be ignored when calling extrcuct.extract(html, errors='ignore')
When a property is repeated (i.e. on a page with multiple images annotates as og:image
) RDFa return it as a list but is not preserving order. Preserving order is important as usually the first image is the most important. An example of page where this would be happening:
It seems difficult to solve it in extruct as the problem seems to present in PyRdfa
library, and it is even happening in the online service: https://www.w3.org/2012/pyRdfa/Overview.html#distill_by_uri+with_options
Related with #115 (I created an xfail test for that in this PR)
Rest API not working for extruct 0.7.2
Response for requests are:
{
url: "https://nerdist.com/article/star-wars-cast-reylo-episode-ix/",
status: "error",
message: "RecursionError('maximum recursion depth exceeded while calling a Python object',)"
}
A warning is shown at startup:
python -m extruct.service
/home/ivan/Documentos/scrapinghub/dev/extruct/extruct/service.py:3: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. It may also silently lead to incorrect behaviour on Python 3.7. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016. Modules that had direct imports (NOT patched): ['urllib3.util (/home/ivan/Documentos/scrapinghub/dev/extruct/venv/lib/python3.6/site-packages/urllib3/util/__init__.py)', 'urllib3.util.ssl_ (/home/ivan/Documentos/scrapinghub/dev/extruct/venv/lib/python3.6/site-packages/urllib3/util/ssl_.py)'].
monkey.patch_all()
Bottle v0.12.16 server starting up (using GeventServer())...
Listening on http://0.0.0.0:10005/
Hit Ctrl-C to quit.
A possible solution could be in this message: gevent/gevent#1235 (comment)
pip list:
Package Version Location
-------------- ---------- ---------------------------------------------
atomicwrites 1.3.0
attrs 18.2.0
beautifulsoup4 4.7.1
bottle 0.12.16
bumpversion 0.5.3
certifi 2018.11.29
chardet 3.0.4
entrypoints 0.3
extruct 0.7.2
filelock 3.0.10
flake8 3.7.5
gevent 1.4.0
greenlet 0.4.15
html5lib 1.0.1
idna 2.8
isodate 0.6.0
lxml 4.3.0
mccabe 0.6.1
mf2py 1.1.2
more-itertools 5.0.0
pip 10.0.1
pluggy 0.8.1
py 1.7.0
pycodestyle 2.5.0
pyflakes 2.1.0
pyparsing 2.3.1
pytest 4.2.0
rdflib 4.2.2
rdflib-jsonld 0.4.0
requests 2.21.0
setuptools 39.1.0
six 1.12.0
soupsieve 1.7.3
toml 0.10.0
tox 3.7.0
urllib3 1.24.1
virtualenv 16.3.0
w3lib 1.20.0
webencodings 0.5.1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.