scrapinghub / extruct Goto Github PK

Extract embedded metadata from HTML markup

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

microdata json-ld rdfa opengraph microformats semantic-web hacktoberfest

extruct's Issues

How to extract only value of a one or two JSON-LD parameters ?

Hello,
ref your example
{ 'json-ld': [ { '@context': 'http://schema.org',
'@id': 'FP',
'@type': 'Product',
'brand': { '@type': 'Brand',
'url': 'https://www.sarenza.com/i-love-shoes'},
'color': ['Lava', 'Black', 'Lt grey'],
'image': [ 'https://cdn.sarenza.net/_img/productsv4/0000119412/MD_0000119412_223992_08.jpg?201509221045&v=20180313113923'],
'name': 'Susket',
'offers': { '@type': 'AggregateOffer',
'availability': 'InStock',
'highPrice': '49.00',
'lowPrice': '0.00',
'price': '0.00',
'priceCurrency': 'EUR'}}],

is it possible to extract from command line EXACTLY name and image values ?
I mean something like
extruct "https://www.sarenza.com/i-love-shoes-susket-s767163-br964-t76-p0000119412" --syntaxes json-ld | extruct name,image
will output clear values
Susket
https://cdn.sarenza.net/_img/productsv4/0000119412/MD_0000119412_223992_08.jpg?201509221045&v=20180313113923

Thanks in advance for any hint !

Dockerfile

I'm running extruct using docker, however have a problem

FROM python:3.5
#see https://github.com/scrapinghub/extruct

RUN pip install bottle
RUN pip install gevent
RUN pip install requests
RUN pip install extruct==0.7.3

WORKDIR /usr/src/app

#this will run server on port 10005
CMD [ "python", "-m", "extruct.service" ]

#to build run.
#docker build -t python-extruct .

#to run http server use
#docker run -p 10005:10005 python-extruct

#to check usage using http use
#curl http://your_IP:10005/extruct/http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412

/usr/local/lib/python3.7/site-packages/extruct/service.py:13: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. It may also silently lead to incorrect behaviour on Python 3.7. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016. Modules that had direct imports (NOT patched): ['urllib3.util (/usr/local/lib/python3.7/site-packages/urllib3/util/__init__.py)', 'urllib3.util.ssl_ (/usr/local/lib/python3.7/site-packages/urllib3/util/ssl_.py)']. 
  monkey.patch_all()
Bottle v0.12.16 server starting up (using GeventServer())...
Listening on http://0.0.0.0:10005/

pip list
 ---> Running in c0d1c0855f84
Package        Version 
-------------- --------
beautifulsoup4 4.7.1   
bottle         0.12.16 
certifi        2019.3.9
chardet        3.0.4   
extruct        0.7.3   
gevent         1.4.0   
greenlet       0.4.15  
html5lib       1.0.1   
idna           2.8     
isodate        0.6.0   
lxml           4.3.4   
mf2py          1.1.2   
pip            19.1.1  
pyparsing      2.4.0   
rdflib         4.2.2   
rdflib-jsonld  0.4.0   
requests       2.22.0  
setuptools     41.0.1  
six            1.12.0  
soupsieve      1.9.1   
urllib3        1.25.3  
w3lib          1.20.0  
webencodings   0.5.1   
wheel          0.33.4

When I run http request to server

http://192.168.5.134:10005/extruct/https://www.sarenza.com/i-love-shoes-susket-s767163-br964-t76-p0000119412`
I get error that is probably related to gevent monkeypatching abov:

{"url": "https://www.sarenza.com/i-love-shoes-susket-s767163-br964-t76-p0000119412", "status": "error", "message": "RecursionError('maximum recursion depth exceeded')"}

url vs. base_url in extruct.extract

Right now extruct.extract has an url parameter which is documented as "url of the html documents", but in reality it's used as a base url (at least in LxmlMicrodataExtractor, maybe in others as well). I think we should check if it's indeed always used as a base url, update documentation and introduce base_url argument deprecating url? Another option would be to extract base_url in extruct, but this feels like worse solution to me (what if caller already has base_url or has more accurate base_url?), although we could also support both base_url and url.

Accessing extracted data which is not unique

Hello,

extruct is working very well for me use case and I get plenty of structured text out of websites.
I'm mostly using Microdata.
Unfortunately, some websites seem to have different structures from others, so for example, sometimes I'd get an array:

'brand': {'properties': {'name': 'NIKE'}, 'type': 'http://schema.org/Brand'},

and sometimes a string:

'brand': 'NIKE',

So to access the data, I'd need to do something like:

if isinstance(productData['brand'], dict): if 'http://schema.org/Brand' == productData['brand']['type']: self.brand = productData['brand']['properties']['name'] if isinstance(productData['brand'], str): self.brand = productData['brand']

Is this the best way to go or am I doing this in a clumsy way?

Thanks,
Chris

Adding headers in request

Is there a way to pass headers and also set cookies in the request that is made?

Parsing of JSON-LD breaks when the JSON is followed by a semicolon

I don't know if having the JSON being followed by a semicolon constitutes valid JSON-LD, but I have encountered it in the wild.

Running extruct on the following works fine:

<script type="application/ld+json">
{}
</script>

However, this breaks:

<script type="application/ld+json">
{};
</script>

The error message looks like this:

Failed to extract json-ld, raises Extra data: line 2 column 3 (char 3)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/extruct/jsonld.py", line 34, in _extract_items
    data = json.loads(script, strict=False)
  File "/usr/lib/python3.6/json/__init__.py", line 367, in loads
    return cls(**kw).decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 3 (char 3)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/extruct/_extruct.py", line 101, in extract
    output[syntax] = list(extract(document, base_url=base_url))
  File "/usr/local/lib/python3.6/dist-packages/extruct/jsonld.py", line 26, in extract_items
    for items in map(self._extract_items, self._xp_jsonld(document))
  File "/usr/local/lib/python3.6/dist-packages/extruct/jsonld.py", line 25, in <listcomp>
    item
  File "/usr/local/lib/python3.6/dist-packages/extruct/jsonld.py", line 38, in _extract_items
    HTML_OR_JS_COMMENTLINE.sub('', script), strict=False)
  File "/usr/lib/python3.6/json/__init__.py", line 367, in loads
    return cls(**kw).decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 3 (char 3)

Extruct version 0.7.2; Python version 3.6.7; Ubuntu 18.04.2

Add a generic extractor that would call each built-in extractors

Instead of call each extractor for individual microdata formats, there could be a do-all extractor combining results of several extractors.
Something like (pseudo-Python-code):

class GenericExtractor()

    def extract(string, url):
        tree = lxml.fromstring()
        return self.extract_items(tree, url)

    def extract_items(tree, url):
        output = {}
        for name, extractor in extractors:
            output.update({
                name: extrator.extract_items(tree, url)
            })
        return out

Deprecate Server and HTTP Client mode?

At present, extruct supports a HTTP API for "testing", but that carries a maintenance burden, and it invites feature-requests that may nudge it more and more into becoming a monolithic proxy service. That's not really where we want Extruct to be, I think.

Similarly with the HTTP-Client mode and the CLI tool that offers it - it's a mode of operation which probably shouldn't be our priority with extruct. I feel that if we provide a CLI client for extruct, it should probably just accept HTML through a Unix pipe or from a file, and operate on that. That way, people can use curl or wget or whatever else they like, and they won't worry about extruct's support for various HTTP client features.

Thoughts? :)

Extract contents with tags

Hi, when I use extruct to extract some microdata from html element that contains some script or style tags I encountered some problems: tags are skipped out but theirs contents remains.

This behaviour is happened because LxmlMicrodataExtractor.extract_textContent uses lxml.html.tostring(node, method="text", encoding='unicode', with_tail=False) with method "text".
Probably we have to add a parameter to allow using "html" method and maybe a way to use the lxml Cleaner (http://lxml.de/lxmlhtml.html#cleaning-up-html).

Probably leaks memory

Add support for RDFa Lite

support pre-parsed lxml.etree and faster json

please add functions to provide a pre-parsed lxml.etree instead of htmlstring.
Also, using a library such as "ujson" may significantly speedup processing for jsonld.

Unable to pass the url

r = requests.get(url)
data = extruct.extract(r.text, r.url)

why i am getting error in this way?

Extracting image captions with ArticleBody

This package is great. Thanks for it and other packages from scrapinghub.

Image captions and credits are included in article body. It is messing up with article content.

Example URL
https://www.nytimes.com/2019/03/03/us/tornado-alabama-georgia-deaths.html?action=click&module=Top%20Stories&pgtype=Homepage

Add support for microformats

Extruct ought to support microformats.

html parsing fail on empty documents

Exception like this can be raised by functions from extruct.utils:

document = parse_xmldom_html(html_string, encoding=encoding)
File "/usr/local/lib/python3.6/dist-packages/extruct/utils.py", line 16, in parse_xmldom_html
return lxml.html.fromstring(html, parser=parser)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/__init__.py", line 876, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/__init__.py", line 765, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty

In parsel this is worked around: empty documents are handled explicitly. There is also an issue with null bytes handled. I think we should bring similar fixes to extruct. See https://github.com/scrapy/parsel/blob/e01093cf6342c90445028de28034b3cc3d2ead8b/parsel/selector.py#L38.

Memory leak issue

Steps to reproduce:
Login to Ubuntu box:
Run extruct as a service
nohup python -m extruct.service &

hit the http://localhost:10005/extruct/ in loop for 20k urls. Memory consumption increases with time and never comes down

Add support for all microformat vocabularies

mf2py already supports and returns them but we are only keeping h-entry. h-item and h-product are of particular interest

List http://microformats.org/wiki/Microformats2#v2_vocabularies

Add command line tool

It would be nice to have a cmdline script that, given an URL, it would try to run all the extractors (much like the webservice) and output a JSON with the results.

extruct URL

I want to get only one type of schema.org annotation how can i do it

i have this json but i want only the @type Product annotated not @type BreadcrumbList . Is there a way to get only Product ?
[
{
"@context": "http://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [
{
"@type": "ListItem",
"position": 1,
"item": {
"@id": "https://concordpetfoods.com/collections",
"name": "Collections"
}
},
{
"@type": "ListItem",
"position": 2,
"item": {
"@id": "https://concordpetfoods.com/collections/dog",
"name": "Dog"
}
},
{
"@type": "ListItem",
"position": 3,
"item": {
"@id": "https://concordpetfoods.com/collections/dog/products/blue-buffalo-blue-wilderness-rocky-mountain-recipe-adult-healthy-weight-red-meat-dry-dog-food",
"name": "Blue Buffalo BLUE Wilderness Rocky Mountain Recipe Adult Healthy Weight Red Meat Dry Dog Food"
}
}
]
},
{
"@context": "http://schema.org/",
"@type": "Product",
"name": "Blue Buffalo BLUE Wilderness Rocky Mountain Recipe Adult Healthy Weight Red Meat Dry Dog Food",
"image": "https://cdn.shopify.com/s/files/1/2382/0223/products/35913-1501600645_fc502f43-827d-4a76-a639-90c668e5e4bc_1024x1024.png?v=1533919507",
"description": "

Looking for a great food to help your four legged best friend reach and maintain their ideal weight? Blue Buffalo has got just the food for you with their BLUE Wilderness Rocky Mountain Recipe Adult Healthy Weight Red Meat Dry Dog Food! This grain-free, protein-rich food contains the finest natural ingredients and provides multiple sources of protein using deboned beef, lamb and venison without the added calories! Blue Buffalo BLUE Wilderness Rocky Mountain Recipe Adult Healthy Weight Red Meat Dry Dog Food also includes blueberries, cranberries and carrots to help support antioxidant-enrichment. Put on your spandex, Rover! Let’s get physical!

Why We Love It

100% grain-free
No by-products, fillers, soy, corn, artificial preservatives, colors or flavors.
Made in the USA

About Blue Buffalo

BLUE Buffalo's True Blue promise is the pillar of their business, straight to every customer; the finest natural ingredients, and no chicken/poultry by-product meals, corn, wheat, soy, artificial preservatives, colors or flavors. BLUE Buffalo is the only food made with unique Lifesource Bits; a precise blend of vitamins, minerals and antioxidants created by veterinarians and animal nutritionists. With recipes for all tastes and diets, including limited ingredient diets, high protein, grain-free, wholesome grains, and exotic proteins, BLUE Buffalo always starts with real meat, and ends with good health.

Ingredients

Deboned Beef, Chicken Meal, Pea Protein, Peas, Tapioca Starch, Pea Starch, Menhaden Fish Meal (source of Omega 3 Fatty Acids), Pea Fiber, Dried Tomato Pomace, Natural Flavor, Flaxseed (source of Omega 6 Fatty Acids), Chicken Fat (preserved with Mixed Tocopherols), Powdered Cellulose, Dehydrated Alfalfa Meal, DL-Methionine, Deboned Lamb, Deboned Venison, Dried Chicory Root, Choline Chloride, Potatoes, Calcium Carbonate, Caramel Color, Dicalcium Phosphate, preserved with Mixed Tocopherols, Sweet Potatoes, Carrots, L-Carnitine, Zinc Amino Acid Chelate, Zinc Sulfate, Salt, Potassium Chloride, Ferrous Sulfate, Vitamin E Supplement, Iron Amino Acid Chelate, Glucosamine Hydrochloride, Blueberries, Cranberries, Barley Grass, Parsley, Yucca Schidigera Extract, Dried Kelp, Turmeric, Nicotinic Acid (Vitamin B3), Calcium Pantothenate (Vitamin B5), L-Ascorbyl-2-Polyphosphate (source of Vitamin C), L-Lysine, Oil of Rosemary, Copper Sulfate, Biotin (Vitamin B7), Vitamin A Supplement, Copper Amino Acid Chelate, Manganese Sulfate, Taurine, Chondroitin Sulfate, Manganese Amino Acid Chelate, Thiamine Mononitrate (Vitamin B1), Riboflavin (Vitamin B2), Vitamin D3 Supplement, Vitamin B12 Supplement, Pyridoxine Hydrochloride (Vitamin B6), Calcium Iodate, Dried Yeast, Dried Enterococcus faecium fermentation product, Dried Lactobacillus acidophilus fermentation product, Dried Aspergillus niger fermentation extract, Dried Trichoderma longibrachiatum fermentation extract, Dried Bacillus subtilis fermentation extract, Folic Acid (Vitamin B9), Sodium Selenite.
\r\n

Guaranteed Analysis

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n

Nutrient	Guaranteed Units
Crude Protein	30.0% min
Crude Fat	10% min
Crude Fiber	10.0% max
Moisture	10.0% max
Calcium	1.2% min
Phosphorus	0.9% min
Omega-3 Fatty Acids	0.5% min
Omega-6 Fatty Acids	1.5% min
L-Carnitine	150 mg/kg min
Glucosamine	400 mg/kg min
Chondroitin Sulfate	300 mg/kg min

",
"brand": {
"@type": "Thing",
"name": "Blue Buffalo"
},
"sku": "36492",
"offers": {
"@type": "Offer",
"priceCurrency": "USD",
"price": 65.99,
"availability": "http://schema.org/InStock",
"seller": {
"@type": "Organization",
"name": "Concord Pet Foods & Supplies"
}
}
},
{
"@context": "http://schema.org",
"@type": "WebSite",
"name": "Concord Pet Foods & Supplies",
"url": "https://concordpetfoods.com"
}
]

Allow calling extruct cmd line tool using "python -m extruct"

It would be nice to be able to call the extruct command line tool using python -m extruct.

With this, people will be able to use a specific Python version and it may also help people with issues on their system's PATH.

TypeError: 'NoneType' object is not iterable on some pages

Hi, I've been using extruct pretty successfully but came across a URL that seems to validate OK but when I run it through I get an error:

  File "/Users/frankapap/KCApp/extruct/RecipeInfoService.py", line 24, in recipeExtract
    data = extruct.extract(r.text, base_url=base_url,syntaxes=['json-ld', 'opengraph'],uniform=True)
  File "/usr/local/lib/python3.7/site-packages/extruct/_extruct.py", line 67, in extract
    output[label] = list(extract(document, base_url=base_url))
  File "/usr/local/lib/python3.7/site-packages/extruct/jsonld.py", line 25, in extract_items
    self._xp_jsonld(document))
  File "/usr/local/lib/python3.7/site-packages/extruct/jsonld.py", line 26, in <listcomp>
    for item in items
TypeError: 'NoneType' object is not iterable

The code is:
data = extruct.extract(r.text, base_url=base_url,syntaxes=['json-ld', 'opengraph'],uniform=True)
The URL being passed in is https://www.tasteofhome.com/collection/keto-diet-recipes/view-all/

HTML escaping in JSON+LD

I came across some JSON+LD on a site that contained a & and I assumed that I accidentally escaped something somewhere. However, I found that that was what was actually in the content and also that the standard says that it should be there.

For my application, I would like that & to be a &, but I was wondering if extruct should be doing this already?

Handle badly formatted JSON-LD data.

Some web pages contain badly formatted JSON-LD data, e.g., an example

The JSON-LD in this page is:


{
  "@context": "http://schema.org",
        "@type": "Product",
                "name": "Black 'Clint' FT0511 cat eye sunglasses",
                "image": "https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001",
		"brand": {
                  "@type": "Thing",
                  "name": "Tom Ford"
                },
                "offers": {
                	"@type": "Offer",
                	"priceCurrency": "GBP",
                	"price": "285.00",
                	"itemCondition": "http://schema.org/NewCondition",
                	"availability": "http://schema.org/InStock"
                }
    }
}

In the JSON-LD above, the last } is extra. And extruct or json.loads won't handle it properly.

The json.loads in Python after 3.5 will give detailed error information as JSONDecodeError: Extra data: line 19 column 1 (char 624)

In [7]: try:
   ...:     data = json.loads(json_ld_string)
   ...: except json.JSONDecodeError as err:
   ...:     print(err)
   ...:     print(err.msg)
   ...:     print(err.pos)
   ...:
Extra data: line 19 column 1 (char 624)
Extra data
624

The error.msg and error.pos can give some hint to fix the JSON-LD data, e.g., this one we can remove the character at position 624 and parse the data string again to correctly get:

{'@context': 'http://schema.org',
 '@type': 'Product',
 'brand': {'@type': 'Thing', 'name': 'Tom Ford'},
 'image': 'https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001',
 'name': "Black 'Clint' FT0511 cat eye sunglasses",
 'offers': {'@type': 'Offer',
            'availability': 'http://schema.org/InStock',
            'itemCondition': 'http://schema.org/NewCondition',
            'price': '285.00',
            'priceCurrency': 'GBP'}}

There're many possible format errors and some can be fixed easily some might be harder or even impossible.

I propose 3 ways to improve the situation:

extruct try various ways to fix the json-ld data case by case, but need to adapt to Python >= 3.5 to allow to get detailed error info
extruct allow the user to pass in a function to parse JSON data, and let the user to handle his own possible error types
extruct can output the extracted JSON-LD string not parsed data and let the user to parse and handle his own possible error types

I personally recommend the latter 2 ways.

Thanks.

Extract JSON-LD with control characters.

Some pages have JSON-LD with control characters.
One example is: https://www.johnlewis.com/sony-xperia-x-smartphone-android-5-4g-lte-sim-free-32gb/p3210080

when you try to extract JSON-LD data from this page, you'll get:
Invalid control character at: line 8 column 353 (char 625)

Maybe need to change JsonLdExtractor._extract_items() in extruct/extruct/jsonld.py as below:

from json import JSONDecodeError

    def _extract_items(self, node):
        script = node.xpath('string()')
        try:
            data = json.loads(script)
        except ValueError:
            # sometimes JSON-decoding errors are due to leading HTML or JavaScript comments
            try:
                data = json.loads(HTML_OR_JS_COMMENTLINE.sub('', script))
            except JSONDecodeError:
                data = json.loads(script, strict=False)
        if isinstance(data, list):
            return data
        elif isinstance(data, dict):
            return [data]

The utiility gets stuck in the middle

I have been using extruct inside Scrapy Spider and the code got stuck in the middle and it is neither going forward nor skipping the same url. Also, there is no code error, no exception, nothing.

First non-empty result should be extracted in case of OpenGraph

#115 was a step in a right direction (prefer first results), but it seems it is not the whole solution, as empty results should not be prioritized.

E.g. on https://www.triganostore.com/tente-de-camping-raclet-bora-4.html there are two og:description values, the first one is empty. https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Fwww.triganostore.com%2Ftente-de-camping-raclet-bora-4.html shows that a non-empty one is extracted.

Nested lists returned by JsonLD extractor.

See below, the output is a nested list rather than a list of dicts.

In [9]: url = 'http://www.superpages.com/yellowpages/c-nurseries/s-wa/t-redmond/'

In [10]: r = requests.get(url)

In [11]: ex = extruct.jsonld.JsonLdExtractor()

In [12]: ex.extract(r.text)['items']

[[{'@context': 'http://schema.org',
   '@type': 'LocalBusiness',
   'address': {'@type': 'PostalAddress',
    'addressLocality': 'Redmond',
    'addressRegion': 'WA',
    'postalCode': '98053',
    'streetAddress': '20871 NE Redmond Fall City Rd'},
   'description': "There's more to a beautiful garden than what meets the eye.",
   'name': 'Gray Barn Nursery',
   'telephone': '888-820-9506'},
    ...
]]

Ability to block/skip popups

I used the REST API service for extracting embedded metadata from HTML markup.

When I do a get request for a site that throws a popup in the beginning, the metadata returned is that of the popup.

Example site: Faballey.com

Request:
http://localhost:10005/extruct/http://www.faballey.com/hot-mesh-maxi-skirt-87

Is there a way we could block/skip the popup and get the metadata of the required page?

Dealing with Async loaded json-ld

Hello! Is there a way to deal with async loaded json-ld such as in this url -> “https://www.omicsdi.org/dataset/arrayexpress-repository/E-GEOD-33515”

extruct “https://www.omicsdi.org/dataset/arrayexpress-repository/E-GEOD-33515” results in no json-ld metadata being returned

Ability to look up items by itemtype or itemprop

In cases, when one is not interested in all but some parts of microdata, the approach to filter required content is not very straight forward. Can we support look-up by itemtype or itemprop values as follows:

>>> data = mde.extruct(html)
>>> data.get_first(itemprop='name')
'foo'
>>> data.get(itemtype='http://schema.org/Person')
[{'name': 'foo', 'jobTitle': 'bar', 'additionalName': 'foobar'}]
>>> data.get(itemtype='http://schema.org/Person', itemprop='name')
['foo', 'abc', 'cde', 'def']
>>> data.get(itemtype='http://schema.org/Organization', itemprop='name')
['foocompany']

or a cleaner version with some sort of built-in support for popular vocabularies.

>>> data.get(itemtype=schema_org.Person)
[{'name': 'foo', 'jobTitle': 'bar', 'additionalName': 'foobar'}, {'name': 'abc', ...}]
>>> data.get(itemtype=schema_org.Person', itemprop='name')
['foo', 'abc', 'cde', 'def']
>>> data.get_first(itemtype=schema_org.Organization, itemprop='name')
'foocompany'

How to correct "nasty" jsonl+ld

I've found at least a couple of bad json+ld that extruct can't read.

  File "/cygdrive/d/recipeWorkspace/python/parsers.py", line 25, in readJsonLd
    data = jslde.extract(html)
  File "/usr/lib/python2.7/site-packages/extruct/jsonld.py", line 21, in extract
    return self.extract_items(lxmldoc)
  File "/usr/lib/python2.7/site-packages/extruct/jsonld.py", line 25, in extract_items
    self._xp_jsonld(document))
  File "/usr/lib/python2.7/site-packages/extruct/jsonld.py", line 35, in _extract_items
    data = json.loads(HTML_OR_JS_COMMENTLINE.sub('', script))
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 20 column 778 (char 1342)

The reason are ellipsis inside the text. For example:

"recipeInstructions": [
		"1. blablabla two "buttons".5. Dab  Snowmen!"		
	]

Html allow this, but it's not possible to read it. Is there an easy way to correct similar issues automatically?

Problem parsing rdfa in aws lambda

Hi,
wanted to ask if anyone out there has used extruct on AWS lambda? I tested running extruct function which seems to fail to work for rdfa. Other default metadata types are fine.

A simple test case:

import pprint as pp
import requests
from extruct.rdfa import RDFaExtractor
import config_files.logging_config as log

logger = log.logger

def main():

    try:
        import extruct
        logger.info("Testing importing extruct which loaded successfully")
        import rdflib
        logger.info("Testing importing rdflib which loaded successfully")
        import extruct.rdfa
        logger.info("Testing importing rdfa which loaded successfully")
        from extruct.rdfa import RDFaExtractor
        logger.info("Testing importing RDFaExtractor which loaded successfully")

     except ImportError as e:
            logger.error("failed to import : {}".format(e))

    try:
        url = 'https://www.littlewoods.com/ri-plus-floral-trumpet-sleeve-top/1600159211.prd'
        r = requests.get(url)
        rdfae = RDFaExtractor()
        rdfa_json = rdfae.extract(r.text, base_url=None)

        pp.pprint(rdfa_json)

    except Exception as e:
        logger.exception("Failed to extract rdfa. Error: {}".format(e))

main()

The part of pipenv graph for extruct when I build the artifact.zip file:

extruct==0.7.1
  - lxml [required: Any, installed: 3.6.0]
  - mf2py [required: Any, installed: 1.1.2]
    - BeautifulSoup4 [required: >=4.6.0, installed: 4.7.1]
      - soupsieve [required: >=1.2, installed: 1.6.2]
    - html5lib [required: >=1.0.1, installed: 1.0.1]
      - six [required: >=1.9, installed: 1.11.0]
      - webencodings [required: Any, installed: 0.5.1]
    - requests [required: >=2.18.4, installed: 2.18.4]
      - certifi [required: >=2017.4.17, installed: 2018.11.29]
      - chardet [required: >=3.0.2,<3.1.0, installed: 3.0.4]
      - idna [required: >=2.5,<2.7, installed: 2.6]
      - urllib3 [required: >=1.21.1,<1.23, installed: 1.22]
  - rdflib [required: Any, installed: 4.2.2]
    - isodate [required: Any, installed: 0.6.0]
      - six [required: Any, installed: 1.11.0]
    - pyparsing [required: Any, installed: 2.3.0]
  - rdflib-jsonld [required: Any, installed: 0.4.0]
    - rdflib [required: >=4.2, installed: 4.2.2]
      - isodate [required: Any, installed: 0.6.0]
        - six [required: Any, installed: 1.11.0]
      - pyparsing [required: Any, installed: 2.3.0]
  - six [required: Any, installed: 1.11.0]
  - w3lib [required: Any, installed: 1.19.0]
    - six [required: >=1.4.1, installed: 1.11.0]

When I run this locally in the same pipenv env (Ubuntu 17.10, Docker, 17.12.0-ce, pipenv==v2018.11.26), I don't experience any issues. On lambda invocation I log the following stack trace:

2019-01-10 14:32:49,092:INFO:pid 1:Testing importing extruct which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing rdflib which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing rdfa which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing RDFaExtractor which loaded successfully
2019-01-10 14:32:51,753:ERROR:pid 1:Failed to extract rdfa. Error: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)
Traceback (most recent call last):
  File "/var/task/rdflib/plugin.py", line 100, in get
    p = _plugins[(name, kind)]
KeyError: ('json-ld', <class 'rdflib.serializer.Serializer'>)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/task/metadata_extractor/rdfa_extract_poc.py", line 15, in main
    rdfa_json = rdfae.extract(r.text, base_url=None)
  File "/var/task/extruct/rdfa.py", line 35, in extract
    return self.extract_items(tree, base_url=base_url, expanded=expanded)
  File "/var/task/extruct/rdfa.py", line 48, in extract_items
    jsonld_string = g.serialize(format='json-ld', auto_compact=not expanded).decode('utf-8')
  File "/var/task/rdflib/graph.py", line 940, in serialize
    serializer = plugin.get(format, Serializer)(self)
  File "/var/task/rdflib/plugin.py", line 103, in get
    "No plugin registered for (%s, %s)" % (name, kind))
rdflib.plugin.PluginException: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)

I have been scratching my head over this but can't figure this one out. What should I try? Thanks in advance

Unicode/string parsing error

I'm trying to parse structured metadata from this url. I first executed this code on the example URL https://www.optimizesmart.com/how-to-use-open-graph-protocol/:

import extruct
import requests
from w3lib.html import get_base_url

def extract_metadata(url):
    r = requests.get(url)
    base_url = get_base_url(r.text, r.url)
    data = extruct.extract(r.text, base_url=base_url)
    return(data)

url = 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'
data = extract_metadata(url)
print(data)

And works just fine. However, this block of code:

url = 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/G4TBLF'
data = extract_metadata(url)
print(data)

returns this error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-f0db0dd65eaf> in <module>()
      1 url = 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/G4TBLF'
----> 2 data = extract_metadata(url)
      3 print(data)

<ipython-input-3-25c85aeebf1a> in extract_metadata(url)
      2     r = requests.get(url)
      3     base_url = get_base_url(r.text, r.url)
----> 4     data = extruct.extract(r.text, base_url=base_url)
      5     return(data)

/usr/local/lib/python3.5/dist-packages/extruct/_extruct.py in extract(htmlstring, base_url, encoding, syntaxes, errors, uniform, return_html_node, schema_context, **kwargs)
     50         raise ValueError('Invalid error command, valid values are either "log"'
     51                          ', "ignore" or "strict"')
---> 52     tree = parse_xmldom_html(htmlstring, encoding=encoding)
     53     processors = []
     54     if 'microdata' in syntaxes:

/usr/local/lib/python3.5/dist-packages/extruct/utils.py in parse_xmldom_html(html, encoding)
     14     """ Parse HTML using XmlDomHTMLParser, return a tree """
     15     parser = XmlDomHTMLParser(encoding=encoding)
---> 16     return lxml.html.fromstring(html, parser=parser)

/usr/local/lib/python3.5/dist-packages/lxml/html/__init__.py in fromstring(html, base_url, parser, **kw)
    874     else:
    875         is_full_html = _looks_like_full_html_unicode(html)
--> 876     doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
    877     if is_full_html:
    878         return doc

/usr/local/lib/python3.5/dist-packages/lxml/html/__init__.py in document_fromstring(html, parser, ensure_head_body, **kw)
    760     if parser is None:
    761         parser = html_parser
--> 762     value = etree.fromstring(html, parser, **kw)
    763     if value is None:
    764         raise etree.ParserError(

src/lxml/etree.pyx in lxml.etree.fromstring()

src/lxml/parser.pxi in lxml.etree._parseMemoryDocument()

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Any idea what is going on here? Seems like an lxml.etree parsing error? Can I somehow modify r.text to fix this error? Any help is appreciated...

Add support for dublin core metadata(DCMI)

Often used to describe publications etc. More information:

http://dublincore.org/
https://en.wikipedia.org/wiki/Dublin_Core
http://dublincore.org/documents/dces/

Unified formatting for microformats and JSON-LD

Hi,

I am using extruct to extract metadata from emails, either in microformats or JSON+LD format. A very good point for this library is that in a single call one can extract all possible information from the message, that's super convenient!

However, I realized that the structure of the data returned from JSON+LD and microformats is quite different. For instance, microformats will return something like

{
  "type": "SOME_SCHEMA_URL",
  "properties": { /* A dict of properties */ }
}

whereas JSON+LD parsing would return something like

{
  "@type": "SOME_SCHEMA_URL",
  /* All the properties in keys here */
}

This is not so convenient as it implies that microformats and JSON+LD data should be handled in a different way, although they match the same schema.org schema.

Not sure if it is fine for extruct or if it should lie in another lib, but what about offering a way to have a standard representation of the extracted data. This could either be building a class object (a struct basically) from the fetched data for each type, or converting one of the format to the other. Not sure if this could already be offloaded to some external lib, but I could not find any doing the job so far.

Thanks!

EDIT: I guess something as simple as

def microformats_to_jsonld(mf):
    if isinstance(mf, dict) and 'type' in mf and 'properties' in mf:
        if isinstance(mf['type'], list):
            # Fix a bug in JSON-LD format of some emails
            mf['type'] = ''.join(mf['type'])
        context, type = mf['type'].rsplit('/', 1)
        converted = {
            '@type': type,
            '@context': context
        }
        for key, property in mf['properties'].items():
            converted[key] = microformats_to_jsonld(property)
        return converted
    else:
        return mf

could do the trick.

Extruct returns incorrectly formatted description property

Seems that extruct incorrectly interprets description with included HTML tags from microdata.

See the below description extracted from URL https://www.monsterpetsupplies.co.uk/cat/cat-flea-tick/johnsons-4-fleas-cats-kittens:

>>> import extruct
>>> import requests
>>> from w3lib.html import get_base_url
>>> r = requests.get('https://www.monsterpetsupplies.co.uk/cat/cat-flea-tick/johnsons-4-fleas-cats-kittens')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url=base_url)
>>> data['microdata'][0]['properties']['description']
"Johnsons 4 Fleas Cats & Kittens - 3 Treatment Pack, 6 Treatment PackFor use with Cats and Kittens over 4 weeks of age between 1 and 11kg.Johnson's 4fleas tablets are an easy to use oral treatment to kill adult fleas found on your pet.Effects on the fleas may be seen as soon as 15 minutes after administration.Between 95 - 100% of fleas will be killed off in the first six hours, but ALL adult fleas will be gone after a day.These tablets can be given directly to the mouth or may be mixed in a small portion f our pet's favourite food and given immediately. Administer a single tablet on an day when fleas are seen on your pet. Repeat on any subsequent day as necessary. Do not give more than one treatment per day.You may notice your pet scratching more than usual for the first half hour after administration; this is completely normal and caused by the fleas reacting to Johnson's 4Fleas tablets.While highly effective by themselves, 4Fleas is great when used as part of a programme to eliminate fleas and their larvae from both pets and their surroundings."

As it can be seen, there is a problem with formatting, like lack of space between "Pack" and "For" or between "11kg." and "Johnson's".

It turns out that the problem is not because of description property content per-se, because it looks correctly on the page source:

<p><strong>Johnsons 4 Fleas Cats &amp; Kittens - 3 Treatment Pack, 6 Treatment Pack</strong></p>For use with Cats and Kittens over 4 weeks of age between 1 and 11kg.<br /><br />Johnson's 4fleas tablets are an easy to use oral treatment to kill adult fleas found on your pet.<br /><br />Effects on the fleas may be seen as soon as 15 minutes after administration.<br /><br />Between 95 - 100% of fleas will be killed off in the first six hours, but ALL adult fleas will be gone after a day.<br /><br />These tablets can be given directly to the mouth or may be mixed in a small portion f our pet's favourite food and given immediately. Administer a single tablet on an day when fleas are seen on your pet. Repeat on any subsequent day as necessary. Do not give more than one treatment per day.<br /><br />You may notice your pet scratching more than usual for the first half hour after administration; this is completely normal and caused by the fleas reacting to Johnson's 4Fleas tablets.<br /><br />While highly effective by themselves, 4Fleas is great when used as part of a programme to eliminate fleas and their larvae from both pets and their surroundings.

Likely it is a matter of line

extruct/extruct/w3cmicrodata.py

Line 185 in de219cb

return u"".join(self._xp_clean_text(node)).strip()

where html-text should be used instead of ad-hoc text extraction.

Add support for Open Graph Arrays

It is possible in Open Graph Protocol to specify more than one value for a single property. It's called OG Array http://ogp.me/#array.

It seems that currently extruct doesn't support arrays in case the uniform option is set to True, because uniform._uopengraph function doesn't handle duplicated keys from the list of properties.

It'd be cool to add that support and return list in case there is more than one value or to just add some separate property with list suffix to be backward compatible.

When will a new version be pushed to pypi?

v0.4.0 shipped June 20, 2017. There have been 40 commits (plus 6 for Read Me updates).

What is the version strategy?

Thanks!

Unable to read UTF-8

The code is not able to read the UTF-8, where I can modify the code? The extruct supports only ASCII?

Error-

Support "expanded" Open Graph metadata based on og:type

Facebook Oepn Graph defines an expanded version of embedded metadata depending on the value of og:type.

For example:

article - Namespace URI: http://ogp.me/ns/article#

    article:published_time - datetime - When the article was first published.
    article:modified_time - datetime - When the article was last changed.
    article:expiration_time - datetime - When the article is out of date after.
    article:author - profile array - Writers of the article.
    article:section - string - A high-level section name. E.g. Technology
    article:tag - string array - Tag words associated with this article.

This is used for example on nytimes.com. Snippet:

<meta property="og:url" content="http://www.nytimes.com/2016/12/15/arts/music/from-steet-theater-to-wagner-on-the-opera-stage.html" />
<meta property="og:type" content="article" />
<meta property="og:title" content="From Street Theater to Wagner on the Opera Stage" />
<meta property="og:description" content="Àlex Ollé brings an avant-garde sensibility to “The Flying Dutchman,” which he set in Bangladesh instead of Norway. The production opens in Madrid on Saturday." />
<meta property="article:published" itemprop="datePublished" content="2016-12-15T05:55:55-05:00" />
<meta property="article:modified" itemprop="dateModified" content="2016-12-15T06:19:30-05:00" />
<meta property="article:section" itemprop="articleSection" content="Music" />
<meta property="article:section-taxonomy-id" itemprop="articleSection" content="C5BFA7D5-359C-427B-90E6-6B7245A6CDD8" />
<meta property="article:section_url" content="http://www.nytimes.com/section/arts" />
<meta property="article:top-level-section" content="arts" />
<meta property="fb:app_id" content="9869919170" />

Currently (as I write these lines version 0.3.0a1) extracts raw article:... properties

...
  'article:author': [{'@value': 'http://www.nytimes.com/by/raphael-minder'}],
  'article:collection': [{'@value': 'https://static01.nyt.com/services/json/sectionfronts/arts/music/index.jsonp'}],
  'article:modified': [{'@value': '2016-12-15T06:19:30-05:00'}],
  'article:published': [{'@value': '2016-12-15T05:55:55-05:00'}],
  'article:section': [{'@value': 'Music'}],
  'article:section-taxonomy-id': [{'@value': 'C5BFA7D5-359C-427B-90E6-6B7245A6CDD8'}],
  'article:section_url': [{'@value': 'http://www.nytimes.com/section/arts'}],
  'article:tag': [{'@value': 'Opera'},
                  {'@value': 'Bangladesh'},
                  {'@value': 'Madrid (Spain)'},
                  {'@value': 'Teatro Real'},
                  {'@value': 'Wagner, Richard'}],
  'article:top-level-section': [{'@value': 'arts'}],
  'fb:app_id': [{'@value': '9869919170'}],
  'http://opengraphprotocol.org/schema/description': [{'@value': 'Ã\x80lex '
                                                                 'OllÃ© brings '
                                                                 'an '
                                                                 'avant-garde '
                                                                 'sensibility '
                                                                 'to '
                                                                 'â\x80\x9cThe '
                                                                 'Flying '
                                                                 'Dutchman,â\x80\x9d '
                                                                 'which he set '
                                                                 'in '
                                                                 'Bangladesh '
                                                                 'instead of '
                                                                 'Norway. The '
                                                                 'production '
                                                                 'opens in '
                                                                 'Madrid on '
                                                                 'Saturday.'}],
  'http://opengraphprotocol.org/schema/image': [{'@value': 'https://static01.nyt.com/images/2016/12/16/arts/16ALEXOLLE1-INYT/16ALEXOLLE1-INYT-facebookJumbo.jpg'}],
  'http://opengraphprotocol.org/schema/title': [{'@value': 'From Street '
                                                           'Theater to Wagner '
                                                           'on the Opera '
                                                           'Stage'}],
  'http://opengraphprotocol.org/schema/type': [{'@value': 'article'}],
  'http://opengraphprotocol.org/schema/url': [{'@value': 'http://www.nytimes.com/2016/12/15/arts/music/from-steet-theater-to-wagner-on-the-opera-stage.html'}],
...

while they could use the type-dependent OGP namespace

itemprop=image not extracted from the Microdata example

I have tried to extract information from the very same example in this and I noticed that the image tag is not extracted at all, even if the itemprop=image is present in the web page.
Why this happens? is it intentional or a bug?

JSON-LD: Use UltraJSON if available

Motivation: #37 (comment)

Accept JSON parsing errors in JSON-LD extractor

When the JsonLdExtractor tries to parse json ld in some web page raise ValueError; no json object could be decoded.
My solution was to catch the error in JsonLdExtractor._extract_items(self, node) (because maybe the extractor detected some microdata or rdfa in the webpage but the error only occurs with json-ld, and if we catch the error in extruct.extract we'll lose that data) and by default return an empty list:

def _extract_items(self, node):
        try:
            data = json.loads(node.xpath('string()'))
            if isinstance(data, list):
                return data
            elif isinstance(data, dict):
                return [data]
        except Exception as e:
            print e
        return []

Little things on latest RDF push

Testing out the new RDF push, just some quick things that came up with possible fixes:

pip install extruct[rdfa] didn't get the latest RDF files, had to clone from git (assuming this is by design but mentioning just in case)
The example in the README led to this error rdflib.plugin.PluginException: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>). I'm not sure if this is because of how I downloaded RDFlib previously, but in any case the fix for me was the following: git clone https://github.com/RDFLib/rdflib-jsonld.git && cd rdflib-jsonld && python setup.py install
(small typo) README file has url='http://www.exaple.com/index.html' -> example.com

I'd like to put this in a pull request but not sure how best to handle the middle case in particular.

JSON Comment Removal RE may remove comments from within JSON strings

(The regex used to remove comments)[https://github.com/scrapinghub/extruct/blob/c465e629c9e35cff08a703f6d2912c1c71c642ff/extruct/jsonld.py#L13] from JSON that fails to decode is pinned to the beginning on one side, but not the other. So, the regex may remove HTML comments from strings within the JSON document, as well as outside the JSON.

A quick fix would be to just add the ^ token to the other pattern, or to brace the two patterns to share the same ^. But, I'm wondering whether trailing comments are also an occasional problem, and if so whether a custom little FSM scanner might be a better solution? E.g., something that scans for the earliest possible "valid" opening character for a JSON document, and the last such character, and returns the indices for those two characters for decoding?

Add extract_items() to RDFaExtractor

To be able to reuse an lxml tree with RDFaExtractor

See #37 for motivation.

errors="ignore" does not ignore all errors

E.g. if an error happens during html parsing or unification, it won't be ignored when calling extrcuct.extract(html, errors='ignore')

RDFa ordering not preserved on duplicated properties

When a property is repeated (i.e. on a page with multiple images annotates as og:image) RDFa return it as a list but is not preserving order. Preserving order is important as usually the first image is the most important. An example of page where this would be happening:

https://cleantechnica.com/2019/04/16/fukushimas-final-costs-will-approach-one-trillion-dollars-just-for-nuclear-disaster/

It seems difficult to solve it in extruct as the problem seems to present in PyRdfa library, and it is even happening in the online service: https://www.w3.org/2012/pyRdfa/Overview.html#distill_by_uri+with_options

Related with #115 (I created an xfail test for that in this PR)

RecursionError('maximum recursion depth exceeded while calling a Python object',)

Rest API not working for extruct 0.7.2

Response for requests are:

{
url: "https://nerdist.com/article/star-wars-cast-reylo-episode-ix/",
status: "error",
message: "RecursionError('maximum recursion depth exceeded while calling a Python object',)"
}

A warning is shown at startup:

python -m extruct.service
/home/ivan/Documentos/scrapinghub/dev/extruct/extruct/service.py:3: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. It may also silently lead to incorrect behaviour on Python 3.7. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016. Modules that had direct imports (NOT patched): ['urllib3.util (/home/ivan/Documentos/scrapinghub/dev/extruct/venv/lib/python3.6/site-packages/urllib3/util/__init__.py)', 'urllib3.util.ssl_ (/home/ivan/Documentos/scrapinghub/dev/extruct/venv/lib/python3.6/site-packages/urllib3/util/ssl_.py)']. 
  monkey.patch_all()
Bottle v0.12.16 server starting up (using GeventServer())...
Listening on http://0.0.0.0:10005/
Hit Ctrl-C to quit.

A possible solution could be in this message: gevent/gevent#1235 (comment)

pip list:

Package        Version    Location                                     
-------------- ---------- ---------------------------------------------
atomicwrites   1.3.0      
attrs          18.2.0     
beautifulsoup4 4.7.1      
bottle         0.12.16    
bumpversion    0.5.3      
certifi        2018.11.29 
chardet        3.0.4      
entrypoints    0.3        
extruct        0.7.2      
filelock       3.0.10     
flake8         3.7.5      
gevent         1.4.0      
greenlet       0.4.15     
html5lib       1.0.1      
idna           2.8        
isodate        0.6.0      
lxml           4.3.0      
mccabe         0.6.1      
mf2py          1.1.2      
more-itertools 5.0.0      
pip            10.0.1     
pluggy         0.8.1      
py             1.7.0      
pycodestyle    2.5.0      
pyflakes       2.1.0      
pyparsing      2.3.1      
pytest         4.2.0      
rdflib         4.2.2      
rdflib-jsonld  0.4.0      
requests       2.21.0     
setuptools     39.1.0     
six            1.12.0     
soupsieve      1.7.3      
toml           0.10.0     
tox            3.7.0      
urllib3        1.24.1     
virtualenv     16.3.0     
w3lib          1.20.0     
webencodings   0.5.1

scrapinghub / extruct Goto Github PK

extruct's Issues

Why We Love It

About Blue Buffalo

Ingredients

Guaranteed Analysis

Recommend Projects

Recommend Topics

Recommend Org

Jobs