GithubHelp home page GithubHelp logo

scrapinghub / extruct Goto Github PK

View Code? Open in Web Editor NEW
821.0 116.0 114.0 1022 KB

Extract embedded metadata from HTML markup

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
microdata json-ld rdfa opengraph microformats semantic-web hacktoberfest

extruct's Introduction

extruct

Build Status

Coverage report

PyPI Version

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

The microdata algorithm is a revisit of this Scrapinghub blog post showing how to use EXSLT extensions.

Installation

pip install extruct

Usage

All-in-one extraction

The simplest example how to use extruct is to call extruct.extract(htmlstring, base_url=base_url) with some HTML string and an optional base URL.

Let's try this on a webpage that uses all the syntaxes supported (RDFa with ogp).

First fetch the HTML using python-requests and then feed the response body to extruct:

>>> import extruct
>>> import requests
>>> import pprint
>>> from w3lib.html import get_base_url
>>>
>>> pp = pprint.PrettyPrinter(indent=2)
>>> r = requests.get('https://www.optimizesmart.com/how-to-use-open-graph-protocol/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url=base_url)
>>>
>>> pp.pprint(data)
{ 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description',
                                      'content': 'What is Open Graph Protocol '
                                                 'and why you need it? Learn to '
                                                 'implement Open Graph Protocol '
                                                 'for Facebook on your website. '
                                                 'Open Graph Protocol Meta Tags.',
                                      'name': 'description'}],
                      'namespaces': {},
                      'terms': []}],

'json-ld': [ { '@context': 'https://schema.org',
                 '@id': '#organization',
                 '@type': 'Organization',
                 'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg',
                 'name': 'Optimize Smart',
                 'sameAs': [ 'https://www.facebook.com/optimizesmart/',
                             'https://uk.linkedin.com/in/analyticsnerd',
                             'https://www.youtube.com/user/optimizesmart',
                             'https://twitter.com/analyticsnerd'],
                 'url': 'https://www.optimizesmart.com/'}],
  'microdata': [ { 'properties': {'headline': ''},
                   'type': 'http://schema.org/WPHeader'}],
  'microformat': [ { 'children': [ { 'properties': { 'category': [ 'specialized-tracking'],
                                                     'name': [ 'Open Graph '
                                                               'Protocol for '
                                                               'Facebook '
                                                               'explained with '
                                                               'examples\n'
                                                               '\n'
                                                               'Specialized '
                                                               'Tracking\n'
                                                               '\n'
                                                               '\n'
                                                               (...)
                                                               'Follow '
                                                               '@analyticsnerd\n'
                                                               '!function(d,s,id){var '
                                                               "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
                                                               "'script', "
                                                               "'twitter-wjs');"]},
                                     'type': ['h-entry']}],
                     'properties': { 'name': [ 'Open Graph Protocol for '
                                               'Facebook explained with '
                                               'examples\n'
                                               (...)
                                               'Follow @analyticsnerd\n'
                                               '!function(d,s,id){var '
                                               "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
                                               "'script', 'twitter-wjs');"]},
                     'type': ['h-feed']}],
  'opengraph': [ { 'namespace': {'og': 'http://ogp.me/ns#'},
                   'properties': [ ('og:locale', 'en_US'),
                                   ('og:type', 'article'),
                                   ( 'og:title',
                                     'Open Graph Protocol for Facebook '
                                     'explained with examples'),
                                   ( 'og:description',
                                     'What is Open Graph Protocol and why you '
                                     'need it? Learn to implement Open Graph '
                                     'Protocol for Facebook on your website. '
                                     'Open Graph Protocol Meta Tags.'),
                                   ( 'og:url',
                                     'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'),
                                   ('og:site_name', 'Optimize Smart'),
                                   ( 'og:updated_time',
                                     '2018-03-09T16:26:35+00:00'),
                                   ( 'og:image',
                                     'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'),
                                   ( 'og:image:secure_url',
                                     'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg')]}],
  'rdfa': [ { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/#header',
              'http://www.w3.org/1999/xhtml/vocab#role': [ { '@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]},
            { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/',
              'article:modified_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
              'article:published_time': [ { '@value': '2010-07-02T18:57:23+00:00'}],
              'article:publisher': [ { '@value': 'https://www.facebook.com/optimizesmart/'}],
              'article:section': [{'@value': 'Specialized Tracking'}],
              'http://ogp.me/ns#description': [ { '@value': 'What is Open '
                                                            'Graph Protocol '
                                                            'and why you need '
                                                            'it? Learn to '
                                                            'implement Open '
                                                            'Graph Protocol '
                                                            'for Facebook on '
                                                            'your website. '
                                                            'Open Graph '
                                                            'Protocol Meta '
                                                            'Tags.'}],
              'http://ogp.me/ns#image': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
              'http://ogp.me/ns#image:secure_url': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
              'http://ogp.me/ns#locale': [{'@value': 'en_US'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Optimize Smart'}],
              'http://ogp.me/ns#title': [ { '@value': 'Open Graph Protocol for '
                                                      'Facebook explained with '
                                                      'examples'}],
              'http://ogp.me/ns#type': [{'@value': 'article'}],
              'http://ogp.me/ns#updated_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'}],
              'https://api.w.org/': [ { '@id': 'https://www.optimizesmart.com/wp-json/'}]}]}

Select syntaxes

It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned:

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])
>>>
>>> pp.pprint(data)
{ 'microdata': [],
  'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
                                  'fb': 'http://www.facebook.com/2008/fbml',
                                  'og': 'http://ogp.me/ns#'},
                   'properties': [ ('fb:app_id', '308540029359'),
                                   ('og:site_name', 'Songkick'),
                                   ('og:type', 'songkick-concerts:artist'),
                                   ('og:title', 'Elysian Fields'),
                                   ( 'og:description',
                                     'Find out when Elysian Fields is next '
                                     'playing live near you. List of all '
                                     'Elysian Fields tour dates and concerts.'),
                                   ( 'og:url',
                                     'https://www.songkick.com/artists/236156-elysian-fields'),
                                   ( 'og:image',
                                     'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],
  'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
              'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
              'al:ios:app_store_id': [{'@value': '438690886'}],
              'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
              'http://ogp.me/ns#description': [ { '@value': 'Find out when '
                                                            'Elysian Fields is '
                                                            'next playing live '
                                                            'near you. List of '
                                                            'all Elysian '
                                                            'Fields tour dates '
                                                            'and concerts.'}],
              'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
              'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
              'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
              'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

Alternatively, if you already parsed the HTML before calling extruct, you can use the tree instead of the HTML string: :

>>> # using the request from the previous example
>>> base_url = get_base_url(r.text, r.url)
>>> from extruct.utils import parse_html
>>> tree = parse_html(r.text)
>>> data = extruct.extract(tree, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])

Microformat format doesn't support the HTML tree, so you need to use a HTML string.

Uniform

Another option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure: :

{'@context': 'http://example.com',
             '@type': 'example_type',
             /* All other the properties in keys here */
             }

To do so set uniform=True when calling extract, it's false by default for backward compatibility. Here the same example as before but with uniform set to True: :

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [],
  'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
                               'fb': 'http://www.facebook.com/2008/fbml',
                               'og': 'http://ogp.me/ns#'},
                 '@type': 'songkick-concerts:artist',
                 'fb:app_id': '308540029359',
                 'og:description': 'Find out when Elysian Fields is next '
                                   'playing live near you. List of all '
                                   'Elysian Fields tour dates and concerts.',
                 'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',
                 'og:site_name': 'Songkick',
                 'og:title': 'Elysian Fields',
                 'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],
  'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
              'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
              'al:ios:app_store_id': [{'@value': '438690886'}],
              'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
              'http://ogp.me/ns#description': [ { '@value': 'Find out when '
                                                            'Elysian Fields is '
                                                            'next playing live '
                                                            'near you. List of '
                                                            'all Elysian '
                                                            'Fields tour dates '
                                                            'and concerts.'}],
              'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
              'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
              'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
              'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

NB rdfa structure is not uniformed yet.

Returning HTML node

It is also possible to get references to HTML node for every extracted metadata item. The feature is supported only by microdata syntax.

To use that, just set the return_html_node option of extract method to True. As the result, an additional key "nodeHtml" will be included in the result for every item. Each node is of lxml.etree.Element type: :

>>> r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata'], return_html_node=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [ { 'htmlNode': <Element div at 0x7f10f8e6d3b8>,
                   'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\n'
                                                  'Not your thin sticky pad, '
                                                  'No-Muv is truly the best!',
                                   'image': ['', ''],
                                   'name': ['No-Muv', 'No-Muv'],
                                   'offers': [ { 'htmlNode': <Element div at 0x7f10f8e6d138>,
                                                 'properties': { 'availability': 'http://schema.org/InStock',
                                                                 'price': 'Price:  '
                                                                          '$45'},
                                                 'type': 'http://schema.org/Offer'},
                                               { 'htmlNode': <Element div at 0x7f10f8e60f48>,
                                                 'properties': { 'availability': 'http://schema.org/InStock',
                                                                 'price': '(Select '
                                                                          'Size/Shape '
                                                                          'for '
                                                                          'Pricing)'},
                                                 'type': 'http://schema.org/Offer'}],
                                   'ratingValue': ['5.00', '5.00']},
                   'type': 'http://schema.org/Product'}]}

Single extractors

You can also use each extractor individually. See below.

Microdata extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.w3cmicrodata import MicrodataExtractor
>>>
>>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Photo gallery</title>
...  </head>
...  <body>
...   <h1>My photos</h1>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
...    <figcaption itemprop="title">The house I found.</figcaption>
...   </figure>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
...    <figcaption itemprop="title">The mailbox.</figcaption>
...   </figure>
...   <footer>
...    <p id="licenses">All images licensed under the <a itemprop="license"
...    href="http://www.opensource.org/licenses/mit-license.php">MIT
...    license</a>.</p>
...   </footer>
...  </body>
... </html>"""
>>>
>>> mde = MicrodataExtractor()
>>> data = mde.extract(html)
>>> pp.pprint(data)
[{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The house I found.',
                 'work': 'http://www.example.com/images/house.jpeg'},
  'type': 'http://n.whatwg.org/work'},
 {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The mailbox.',
                 'work': 'http://www.example.com/images/mailbox.jpeg'},
  'type': 'http://n.whatwg.org/work'}]

JSON-LD extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.jsonld import JsonLdExtractor
>>>
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Some Person Page</title>
...  </head>
...  <body>
...   <h1>This guys</h1>
...     <script type="application/ld+json">
...     {
...       "@context": "http://schema.org",
...       "@type": "Person",
...       "name": "John Doe",
...       "jobTitle": "Graduate research assistant",
...       "affiliation": "University of Dreams",
...       "additionalName": "Johnny",
...       "url": "http://www.example.com",
...       "address": {
...         "@type": "PostalAddress",
...         "streetAddress": "1234 Peach Drive",
...         "addressLocality": "Wonderland",
...         "addressRegion": "Georgia"
...       }
...     }
...     </script>
...  </body>
... </html>"""
>>>
>>> jslde = JsonLdExtractor()
>>>
>>> data = jslde.extract(html)
>>> pp.pprint(data)
[{'@context': 'http://schema.org',
  '@type': 'Person',
  'additionalName': 'Johnny',
  'address': {'@type': 'PostalAddress',
              'addressLocality': 'Wonderland',
              'addressRegion': 'Georgia',
              'streetAddress': '1234 Peach Drive'},
  'affiliation': 'University of Dreams',
  'jobTitle': 'Graduate research assistant',
  'name': 'John Doe',
  'url': 'http://www.example.com'}]

RDFa extraction (experimental)

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.rdfa import RDFaExtractor  # you can ignore the warning about html5lib not being available
INFO:rdflib:RDFLib Version: 4.2.1
/home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.
  'parsers will not be available.')
>>>
>>> html = """<html>
...  <head>
...    ...
...  </head>
...  <body prefix="dc: http://purl.org/dc/terms/ schema: http://schema.org/">
...    <div resource="/alice/posts/trouble_with_bob" typeof="schema:BlogPosting">
...       <h2 property="dc:title">The trouble with Bob</h2>
...       ...
...       <h3 property="dc:creator schema:creator" resource="#me">Alice</h3>
...       <div property="schema:articleBody">
...         <p>The trouble with Bob is that he takes much better photos than I do:</p>
...       </div>
...      ...
...    </div>
...  </body>
... </html>
... """
>>>
>>> rdfae = RDFaExtractor()
>>> pp.pprint(rdfae.extract(html, base_url='http://www.example.com/index.html'))
[{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',
  '@type': ['http://schema.org/BlogPosting'],
  'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],
  'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],
  'http://schema.org/articleBody': [{'@value': '\n'
                                               '        The trouble with Bob '
                                               'is that he takes much better '
                                               'photos than I do:\n'
                                               '      '}],
  'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]

You'll get a list of expanded JSON-LD nodes.

Open Graph extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.opengraph import OpenGraphExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
...  <head>
...   <title>Himanshu's Open Graph Protocol</title>
...   <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
...   <meta http-equiv="Content-Language" content="en-us" />
...   <link rel="stylesheet" type="text/css" href="event-education.css" />
...   <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
...   <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
...   <meta property="og:type" content="article"/>
...   <meta property="og:url" content="https://www.eventeducation.com/test.php"/>
...   <meta property="og:image" content="https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"/>
...   <meta property="fb:admins" content="himanshu160"/>
...   <meta property="og:site_name" content="Event Education"/>
...   <meta property="og:description" content="Event Education provides free courses on event planning and management to event professionals worldwide."/>
...  </head>
...  <body>
...   <div id="fb-root"></div>
...   <script>(function(d, s, id) {
...               var js, fjs = d.getElementsByTagName(s)[0];
...               if (d.getElementById(id)) return;
...                  js = d.createElement(s); js.id = id;
...                  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=501839739845103";
...                  fjs.parentNode.insertBefore(js, fjs);
...                  }(document, 'script', 'facebook-jssdk'));</script>
...  </body>
... </html>"""
>>>
>>> opengraphe = OpenGraphExtractor()
>>> pp.pprint(opengraphe.extract(html))
[{"namespace": {
      "og": "http://ogp.me/ns#"
  },
  "properties": [
      [
          "og:title",
          "Himanshu's Open Graph Protocol"
      ],
      [
          "og:type",
          "article"
      ],
      [
          "og:url",
          "https://www.eventeducation.com/test.php"
      ],
      [
          "og:image",
          "https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"
      ],
      [
          "og:site_name",
          "Event Education"
      ],
      [
          "og:description",
          "Event Education provides free courses on event planning and management to event professionals worldwide."
      ]
    ]
 }]

Microformat extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.microformat import MicroformatExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
...  <head>
...   <title>Himanshu's Open Graph Protocol</title>
...   <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
...   <meta http-equiv="Content-Language" content="en-us" />
...   <link rel="stylesheet" type="text/css" href="event-education.css" />
...   <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
...   <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
...   <article class="h-entry">
...    <h1 class="p-name">Microformats are amazing</h1>
...    <p>Published by <a class="p-author h-card" href="http://example.com">W. Developer</a>
...       on <time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time></p>
...    <p class="p-summary">In which I extoll the virtues of using microformats.</p>
...    <div class="e-content">
...     <p>Blah blah blah</p>
...    </div>
...   </article>
...  </head>
...  <body></body>
... </html>"""
>>>
>>> microformate = MicroformatExtractor()
>>> data = microformate.extract(html)
>>> pp.pprint(data)
[{"type": [
      "h-entry"
  ],
  "properties": {
      "name": [
          "Microformats are amazing"
      ],
      "author": [
          {
              "type": [
                  "h-card"
              ],
              "properties": {
                  "name": [
                      "W. Developer"
                  ],
                  "url": [
                      "http://example.com"
                  ]
              },
              "value": "W. Developer"
          }
      ],
      "published": [
          "2013-06-13 12:00:00"
      ],
      "summary": [
          "In which I extoll the virtues of using microformats."
      ],
      "content": [
          {
              "html": "\n<p>Blah blah blah</p>\n",
              "value": "\nBlah blah blah\n"
          }
      ]
    }
 }]

DublinCore extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.dublincore import DublinCoreExtractor
>>> html = '''<head profile="http://dublincore.org/documents/dcq-html/">
... <title>Expressing Dublin Core in HTML/XHTML meta and link elements</title>
... <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
... <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" />
...
...
... <meta name="DC.title" lang="en" content="Expressing Dublin Core
... in HTML/XHTML meta and link elements" />
... <meta name="DC.creator" content="Andy Powell, UKOLN, University of Bath" />
... <meta name="DCTERMS.issued" scheme="DCTERMS.W3CDTF" content="2003-11-01" />
... <meta name="DC.identifier" scheme="DCTERMS.URI"
... content="http://dublincore.org/documents/dcq-html/" />
... <link rel="DCTERMS.replaces" hreflang="en"
... href="http://dublincore.org/documents/2000/08/15/dcq-html/" />
... <meta name="DCTERMS.abstract" content="This document describes how
... qualified Dublin Core metadata can be encoded
... in HTML/XHTML &lt;meta&gt; elements" />
... <meta name="DC.format" scheme="DCTERMS.IMT" content="text/html" />
... <meta name="DC.type" scheme="DCTERMS.DCMIType" content="Text" />
... <meta name="DC.Date.modified" content="2001-07-18" />
... <meta name="DCTERMS.modified" content="2001-07-18" />'''
>>> dublinlde = DublinCoreExtractor()
>>> data = dublinlde.extract(html)
>>> pp.pprint(data)
[ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',
                    'content': 'Expressing Dublin Core\n'
                               'in HTML/XHTML meta and link elements',
                    'lang': 'en',
                    'name': 'DC.title'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/creator',
                    'content': 'Andy Powell, UKOLN, University of Bath',
                    'name': 'DC.creator'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/identifier',
                    'content': 'http://dublincore.org/documents/dcq-html/',
                    'name': 'DC.identifier',
                    'scheme': 'DCTERMS.URI'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/format',
                    'content': 'text/html',
                    'name': 'DC.format',
                    'scheme': 'DCTERMS.IMT'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/type',
                    'content': 'Text',
                    'name': 'DC.type',
                    'scheme': 'DCTERMS.DCMIType'}],
    'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',
                    'DCTERMS': 'http://purl.org/dc/terms/'},
    'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',
                 'content': '2003-11-01',
                 'name': 'DCTERMS.issued',
                 'scheme': 'DCTERMS.W3CDTF'},
               { 'URI': 'http://purl.org/dc/terms/abstract',
                 'content': 'This document describes how\n'
                            'qualified Dublin Core metadata can be encoded\n'
                            'in HTML/XHTML <meta> elements',
                 'name': 'DCTERMS.abstract'},
               { 'URI': 'http://purl.org/dc/terms/modified',
                 'content': '2001-07-18',
                 'name': 'DC.Date.modified'},
               { 'URI': 'http://purl.org/dc/terms/modified',
                 'content': '2001-07-18',
                 'name': 'DCTERMS.modified'},
               { 'URI': 'http://purl.org/dc/terms/replaces',
                 'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',
                 'hreflang': 'en',
                 'rel': 'DCTERMS.replaces'}]}]

Command Line Tool

extruct provides a command line tool that allows you to fetch a page and extract the metadata from it directly from the command line.

Dependencies

The command line tool depends on requests, which is not installed by default when you install extruct. In order to use the command line tool, you can install extruct with the cli extra requirements:

pip install 'extruct[cli]'

Usage

extruct "http://example.com"

Downloads "http://example.com" and outputs the Microdata, JSON-LD and RDFa, Open Graph and Microformat metadata to stdout.

Supported Parameters

By default, the command line tool will try to extract all the supported metadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph and Microformat). If you want to restrict the output to just one or a subset of those, you can pass their individual names collected in a list through 'syntaxes' argument.

For example, this command extracts only Microdata and JSON-LD metadata from "http://example.com":

extruct "http://example.com" --syntaxes microdata json-ld

NB syntaxes names passed must correspond to these: microdata, json-ld, rdfa, opengraph, microformat

Development version

mkvirtualenv extruct
pip install -r requirements-dev.txt

Tests

Run tests in current environment:

py.test tests

Use tox to run tests with different Python versions:

tox

extruct's People

Contributors

adityas114 avatar andrix avatar brycestevenwilley avatar burnzz avatar cathalgarvey avatar croqaz avatar eliasdorneles avatar gallaecio avatar grafst avatar ivanprado avatar jakubwasikowski avatar jayaddison avatar joaquingx avatar kebniss avatar kmike avatar lopuhin avatar marillat avatar michael-genson avatar mikhuang avatar osaid-r avatar platelminto avatar redapple avatar rmax avatar rotzbua avatar sbdchd avatar serhii73 avatar shiquanwang avatar shivindass avatar stummjr avatar susca avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

extruct's Issues

Memory leak issue

Steps to reproduce:
Login to Ubuntu box:
Run extruct as a service
nohup python -m extruct.service &

hit the http://localhost:10005/extruct/ in loop for 20k urls. Memory consumption increases with time and never comes down

Deprecate Server and HTTP Client mode?

At present, extruct supports a HTTP API for "testing", but that carries a maintenance burden, and it invites feature-requests that may nudge it more and more into becoming a monolithic proxy service. That's not really where we want Extruct to be, I think.

Similarly with the HTTP-Client mode and the CLI tool that offers it - it's a mode of operation which probably shouldn't be our priority with extruct. I feel that if we provide a CLI client for extruct, it should probably just accept HTML through a Unix pipe or from a file, and operate on that. That way, people can use curl or wget or whatever else they like, and they won't worry about extruct's support for various HTTP client features.

Thoughts? :)

RecursionError('maximum recursion depth exceeded while calling a Python object',)

Rest API not working for extruct 0.7.2

Response for requests are:

{
url: "https://nerdist.com/article/star-wars-cast-reylo-episode-ix/",
status: "error",
message: "RecursionError('maximum recursion depth exceeded while calling a Python object',)"
}

A warning is shown at startup:

python -m extruct.service
/home/ivan/Documentos/scrapinghub/dev/extruct/extruct/service.py:3: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. It may also silently lead to incorrect behaviour on Python 3.7. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016. Modules that had direct imports (NOT patched): ['urllib3.util (/home/ivan/Documentos/scrapinghub/dev/extruct/venv/lib/python3.6/site-packages/urllib3/util/__init__.py)', 'urllib3.util.ssl_ (/home/ivan/Documentos/scrapinghub/dev/extruct/venv/lib/python3.6/site-packages/urllib3/util/ssl_.py)']. 
  monkey.patch_all()
Bottle v0.12.16 server starting up (using GeventServer())...
Listening on http://0.0.0.0:10005/
Hit Ctrl-C to quit.

A possible solution could be in this message: gevent/gevent#1235 (comment)

pip list:

Package        Version    Location                                     
-------------- ---------- ---------------------------------------------
atomicwrites   1.3.0      
attrs          18.2.0     
beautifulsoup4 4.7.1      
bottle         0.12.16    
bumpversion    0.5.3      
certifi        2018.11.29 
chardet        3.0.4      
entrypoints    0.3        
extruct        0.7.2      
filelock       3.0.10     
flake8         3.7.5      
gevent         1.4.0      
greenlet       0.4.15     
html5lib       1.0.1      
idna           2.8        
isodate        0.6.0      
lxml           4.3.0      
mccabe         0.6.1      
mf2py          1.1.2      
more-itertools 5.0.0      
pip            10.0.1     
pluggy         0.8.1      
py             1.7.0      
pycodestyle    2.5.0      
pyflakes       2.1.0      
pyparsing      2.3.1      
pytest         4.2.0      
rdflib         4.2.2      
rdflib-jsonld  0.4.0      
requests       2.21.0     
setuptools     39.1.0     
six            1.12.0     
soupsieve      1.7.3      
toml           0.10.0     
tox            3.7.0      
urllib3        1.24.1     
virtualenv     16.3.0     
w3lib          1.20.0     
webencodings   0.5.1

Add support for Open Graph Arrays

It is possible in Open Graph Protocol to specify more than one value for a single property. It's called OG Array http://ogp.me/#array.

It seems that currently extruct doesn't support arrays in case the uniform option is set to True, because uniform._uopengraph function doesn't handle duplicated keys from the list of properties.

It'd be cool to add that support and return list in case there is more than one value or to just add some separate property with list suffix to be backward compatible.

The utiility gets stuck in the middle

I have been using extruct inside Scrapy Spider and the code got stuck in the middle and it is neither going forward nor skipping the same url. Also, there is no code error, no exception, nothing.

Ability to look up items by itemtype or itemprop

In cases, when one is not interested in all but some parts of microdata, the approach to filter required content is not very straight forward. Can we support look-up by itemtype or itemprop values as follows:

>>> data = mde.extruct(html)
>>> data.get_first(itemprop='name')
'foo'
>>> data.get(itemtype='http://schema.org/Person')
[{'name': 'foo', 'jobTitle': 'bar', 'additionalName': 'foobar'}]
>>> data.get(itemtype='http://schema.org/Person', itemprop='name')
['foo', 'abc', 'cde', 'def']
>>> data.get(itemtype='http://schema.org/Organization', itemprop='name')
['foocompany']

or a cleaner version with some sort of built-in support for popular vocabularies.

>>> data.get(itemtype=schema_org.Person)
[{'name': 'foo', 'jobTitle': 'bar', 'additionalName': 'foobar'}, {'name': 'abc', ...}]
>>> data.get(itemtype=schema_org.Person', itemprop='name')
['foo', 'abc', 'cde', 'def']
>>> data.get_first(itemtype=schema_org.Organization, itemprop='name')
'foocompany'

Parsing of JSON-LD breaks when the JSON is followed by a semicolon

I don't know if having the JSON being followed by a semicolon constitutes valid JSON-LD, but I have encountered it in the wild.

Running extruct on the following works fine:

<script type="application/ld+json">
{}
</script>

However, this breaks:

<script type="application/ld+json">
{};
</script>

The error message looks like this:

Failed to extract json-ld, raises Extra data: line 2 column 3 (char 3)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/extruct/jsonld.py", line 34, in _extract_items
    data = json.loads(script, strict=False)
  File "/usr/lib/python3.6/json/__init__.py", line 367, in loads
    return cls(**kw).decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 3 (char 3)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/extruct/_extruct.py", line 101, in extract
    output[syntax] = list(extract(document, base_url=base_url))
  File "/usr/local/lib/python3.6/dist-packages/extruct/jsonld.py", line 26, in extract_items
    for items in map(self._extract_items, self._xp_jsonld(document))
  File "/usr/local/lib/python3.6/dist-packages/extruct/jsonld.py", line 25, in <listcomp>
    item
  File "/usr/local/lib/python3.6/dist-packages/extruct/jsonld.py", line 38, in _extract_items
    HTML_OR_JS_COMMENTLINE.sub('', script), strict=False)
  File "/usr/lib/python3.6/json/__init__.py", line 367, in loads
    return cls(**kw).decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 3 (char 3)

Extruct version 0.7.2; Python version 3.6.7; Ubuntu 18.04.2

Little things on latest RDF push

Testing out the new RDF push, just some quick things that came up with possible fixes:

  • pip install extruct[rdfa] didn't get the latest RDF files, had to clone from git (assuming this is by design but mentioning just in case)
  • The example in the README led to this error rdflib.plugin.PluginException: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>). I'm not sure if this is because of how I downloaded RDFlib previously, but in any case the fix for me was the following: git clone https://github.com/RDFLib/rdflib-jsonld.git && cd rdflib-jsonld && python setup.py install
  • (small typo) README file has url='http://www.exaple.com/index.html' -> example.com

I'd like to put this in a pull request but not sure how best to handle the middle case in particular.

TypeError: 'NoneType' object is not iterable on some pages

Hi, I've been using extruct pretty successfully but came across a URL that seems to validate OK but when I run it through I get an error:

  File "/Users/frankapap/KCApp/extruct/RecipeInfoService.py", line 24, in recipeExtract
    data = extruct.extract(r.text, base_url=base_url,syntaxes=['json-ld', 'opengraph'],uniform=True)
  File "/usr/local/lib/python3.7/site-packages/extruct/_extruct.py", line 67, in extract
    output[label] = list(extract(document, base_url=base_url))
  File "/usr/local/lib/python3.7/site-packages/extruct/jsonld.py", line 25, in extract_items
    self._xp_jsonld(document))
  File "/usr/local/lib/python3.7/site-packages/extruct/jsonld.py", line 26, in <listcomp>
    for item in items
TypeError: 'NoneType' object is not iterable

The code is:
data = extruct.extract(r.text, base_url=base_url,syntaxes=['json-ld', 'opengraph'],uniform=True)
The URL being passed in is https://www.tasteofhome.com/collection/keto-diet-recipes/view-all/

RDFa ordering not preserved on duplicated properties

When a property is repeated (i.e. on a page with multiple images annotates as og:image) RDFa return it as a list but is not preserving order. Preserving order is important as usually the first image is the most important. An example of page where this would be happening:

https://cleantechnica.com/2019/04/16/fukushimas-final-costs-will-approach-one-trillion-dollars-just-for-nuclear-disaster/

It seems difficult to solve it in extruct as the problem seems to present in PyRdfa library, and it is even happening in the online service: https://www.w3.org/2012/pyRdfa/Overview.html#distill_by_uri+with_options

Related with #115 (I created an xfail test for that in this PR)

How to extract only value of a one or two JSON-LD parameters ?

Hello,
ref your example
{ 'json-ld': [ { '@context': 'http://schema.org',
'@id': 'FP',
'@type': 'Product',
'brand': { '@type': 'Brand',
'url': 'https://www.sarenza.com/i-love-shoes'},
'color': ['Lava', 'Black', 'Lt grey'],
'image': [ 'https://cdn.sarenza.net/_img/productsv4/0000119412/MD_0000119412_223992_08.jpg?201509221045&v=20180313113923'],
'name': 'Susket',
'offers': { '@type': 'AggregateOffer',
'availability': 'InStock',
'highPrice': '49.00',
'lowPrice': '0.00',
'price': '0.00',
'priceCurrency': 'EUR'}}],

is it possible to extract from command line EXACTLY name and image values ?
I mean something like
extruct "https://www.sarenza.com/i-love-shoes-susket-s767163-br964-t76-p0000119412" --syntaxes json-ld | extruct name,image
will output clear values
Susket
https://cdn.sarenza.net/_img/productsv4/0000119412/MD_0000119412_223992_08.jpg?201509221045&v=20180313113923

Thanks in advance for any hint !

JSON Comment Removal RE may remove comments from within JSON strings

(The regex used to remove comments)[https://github.com/scrapinghub/extruct/blob/c465e629c9e35cff08a703f6d2912c1c71c642ff/extruct/jsonld.py#L13] from JSON that fails to decode is pinned to the beginning on one side, but not the other. So, the regex may remove HTML comments from strings within the JSON document, as well as outside the JSON.

A quick fix would be to just add the ^ token to the other pattern, or to brace the two patterns to share the same ^. But, I'm wondering whether trailing comments are also an occasional problem, and if so whether a custom little FSM scanner might be a better solution? E.g., something that scans for the earliest possible "valid" opening character for a JSON document, and the last such character, and returns the indices for those two characters for decoding?

Unicode/string parsing error

I'm trying to parse structured metadata from this url. I first executed this code on the example URL https://www.optimizesmart.com/how-to-use-open-graph-protocol/:

import extruct
import requests
from w3lib.html import get_base_url

def extract_metadata(url):
    r = requests.get(url)
    base_url = get_base_url(r.text, r.url)
    data = extruct.extract(r.text, base_url=base_url)
    return(data)

url = 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'
data = extract_metadata(url)
print(data)

And works just fine. However, this block of code:

url = 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/G4TBLF'
data = extract_metadata(url)
print(data)

returns this error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-f0db0dd65eaf> in <module>()
      1 url = 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/G4TBLF'
----> 2 data = extract_metadata(url)
      3 print(data)

<ipython-input-3-25c85aeebf1a> in extract_metadata(url)
      2     r = requests.get(url)
      3     base_url = get_base_url(r.text, r.url)
----> 4     data = extruct.extract(r.text, base_url=base_url)
      5     return(data)

/usr/local/lib/python3.5/dist-packages/extruct/_extruct.py in extract(htmlstring, base_url, encoding, syntaxes, errors, uniform, return_html_node, schema_context, **kwargs)
     50         raise ValueError('Invalid error command, valid values are either "log"'
     51                          ', "ignore" or "strict"')
---> 52     tree = parse_xmldom_html(htmlstring, encoding=encoding)
     53     processors = []
     54     if 'microdata' in syntaxes:

/usr/local/lib/python3.5/dist-packages/extruct/utils.py in parse_xmldom_html(html, encoding)
     14     """ Parse HTML using XmlDomHTMLParser, return a tree """
     15     parser = XmlDomHTMLParser(encoding=encoding)
---> 16     return lxml.html.fromstring(html, parser=parser)

/usr/local/lib/python3.5/dist-packages/lxml/html/__init__.py in fromstring(html, base_url, parser, **kw)
    874     else:
    875         is_full_html = _looks_like_full_html_unicode(html)
--> 876     doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
    877     if is_full_html:
    878         return doc

/usr/local/lib/python3.5/dist-packages/lxml/html/__init__.py in document_fromstring(html, parser, ensure_head_body, **kw)
    760     if parser is None:
    761         parser = html_parser
--> 762     value = etree.fromstring(html, parser, **kw)
    763     if value is None:
    764         raise etree.ParserError(

src/lxml/etree.pyx in lxml.etree.fromstring()

src/lxml/parser.pxi in lxml.etree._parseMemoryDocument()

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Any idea what is going on here? Seems like an lxml.etree parsing error? Can I somehow modify r.text to fix this error? Any help is appreciated...

Add a generic extractor that would call each built-in extractors

Instead of call each extractor for individual microdata formats, there could be a do-all extractor combining results of several extractors.
Something like (pseudo-Python-code):

class GenericExtractor()

    def extract(string, url):
        tree = lxml.fromstring()
        return self.extract_items(tree, url)

    def extract_items(tree, url):
        output = {}
        for name, extractor in extractors:
            output.update({
                name: extrator.extract_items(tree, url)
            })
        return out

How to correct "nasty" jsonl+ld

I've found at least a couple of bad json+ld that extruct can't read.

  File "/cygdrive/d/recipeWorkspace/python/parsers.py", line 25, in readJsonLd
    data = jslde.extract(html)
  File "/usr/lib/python2.7/site-packages/extruct/jsonld.py", line 21, in extract
    return self.extract_items(lxmldoc)
  File "/usr/lib/python2.7/site-packages/extruct/jsonld.py", line 25, in extract_items
    self._xp_jsonld(document))
  File "/usr/lib/python2.7/site-packages/extruct/jsonld.py", line 35, in _extract_items
    data = json.loads(HTML_OR_JS_COMMENTLINE.sub('', script))
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 20 column 778 (char 1342)

The reason are ellipsis inside the text. For example:

"recipeInstructions": [
		"1. blablabla two "buttons".5. Dab  Snowmen!"		
	]

Html allow this, but it's not possible to read it. Is there an easy way to correct similar issues automatically?

Add command line tool

It would be nice to have a cmdline script that, given an URL, it would try to run all the extractors (much like the webservice) and output a JSON with the results.

extruct URL

First non-empty result should be extracted in case of OpenGraph

#115 was a step in a right direction (prefer first results), but it seems it is not the whole solution, as empty results should not be prioritized.

E.g. on https://www.triganostore.com/tente-de-camping-raclet-bora-4.html there are two og:description values, the first one is empty. https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Fwww.triganostore.com%2Ftente-de-camping-raclet-bora-4.html shows that a non-empty one is extracted.

Dockerfile

I'm running extruct using docker, however have a problem

FROM python:3.5
#see https://github.com/scrapinghub/extruct

RUN pip install bottle
RUN pip install gevent
RUN pip install requests
RUN pip install extruct==0.7.3

WORKDIR /usr/src/app

#this will run server on port 10005
CMD [ "python", "-m", "extruct.service" ]

#to build run.
#docker build -t python-extruct .

#to run http server use
#docker run -p 10005:10005 python-extruct

#to check usage using http use
#curl http://your_IP:10005/extruct/http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412
/usr/local/lib/python3.7/site-packages/extruct/service.py:13: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. It may also silently lead to incorrect behaviour on Python 3.7. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016. Modules that had direct imports (NOT patched): ['urllib3.util (/usr/local/lib/python3.7/site-packages/urllib3/util/__init__.py)', 'urllib3.util.ssl_ (/usr/local/lib/python3.7/site-packages/urllib3/util/ssl_.py)']. 
  monkey.patch_all()
Bottle v0.12.16 server starting up (using GeventServer())...
Listening on http://0.0.0.0:10005/
pip list
 ---> Running in c0d1c0855f84
Package        Version 
-------------- --------
beautifulsoup4 4.7.1   
bottle         0.12.16 
certifi        2019.3.9
chardet        3.0.4   
extruct        0.7.3   
gevent         1.4.0   
greenlet       0.4.15  
html5lib       1.0.1   
idna           2.8     
isodate        0.6.0   
lxml           4.3.4   
mf2py          1.1.2   
pip            19.1.1  
pyparsing      2.4.0   
rdflib         4.2.2   
rdflib-jsonld  0.4.0   
requests       2.22.0  
setuptools     41.0.1  
six            1.12.0  
soupsieve      1.9.1   
urllib3        1.25.3  
w3lib          1.20.0  
webencodings   0.5.1   
wheel          0.33.4  

When I run http request to server

http://192.168.5.134:10005/extruct/https://www.sarenza.com/i-love-shoes-susket-s767163-br964-t76-p0000119412`
I get error that is probably related to gevent monkeypatching abov:

{"url": "https://www.sarenza.com/i-love-shoes-susket-s767163-br964-t76-p0000119412", "status": "error", "message": "RecursionError('maximum recursion depth exceeded')"}

Accept JSON parsing errors in JSON-LD extractor

When the JsonLdExtractor tries to parse json ld in some web page raise ValueError; no json object could be decoded.
My solution was to catch the error in JsonLdExtractor._extract_items(self, node) (because maybe the extractor detected some microdata or rdfa in the webpage but the error only occurs with json-ld, and if we catch the error in extruct.extract we'll lose that data) and by default return an empty list:

def _extract_items(self, node):
        try:
            data = json.loads(node.xpath('string()'))
            if isinstance(data, list):
                return data
            elif isinstance(data, dict):
                return [data]
        except Exception as e:
            print e
        return []

I want to get only one type of schema.org annotation how can i do it

i have this json but i want only the @type Product annotated not @type BreadcrumbList . Is there a way to get only Product ?
[
{
"@context": "http://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [
{
"@type": "ListItem",
"position": 1,
"item": {
"@id": "https://concordpetfoods.com/collections",
"name": "Collections"
}
},
{
"@type": "ListItem",
"position": 2,
"item": {
"@id": "https://concordpetfoods.com/collections/dog",
"name": "Dog"
}
},
{
"@type": "ListItem",
"position": 3,
"item": {
"@id": "https://concordpetfoods.com/collections/dog/products/blue-buffalo-blue-wilderness-rocky-mountain-recipe-adult-healthy-weight-red-meat-dry-dog-food",
"name": "Blue Buffalo BLUE Wilderness Rocky Mountain Recipe Adult Healthy Weight Red Meat Dry Dog Food"
}
}
]
},
{
"@context": "http://schema.org/",
"@type": "Product",
"name": "Blue Buffalo BLUE Wilderness Rocky Mountain Recipe Adult Healthy Weight Red Meat Dry Dog Food",
"image": "https://cdn.shopify.com/s/files/1/2382/0223/products/35913-1501600645_fc502f43-827d-4a76-a639-90c668e5e4bc_1024x1024.png?v=1533919507",
"description": "

Looking for a great food to help your four legged best friend reach and maintain their ideal weight? Blue Buffalo has got just the food for you with their BLUE Wilderness Rocky Mountain Recipe Adult Healthy Weight Red Meat Dry Dog Food! This grain-free, protein-rich food contains the finest natural ingredients and provides multiple sources of protein using deboned beef, lamb and venison without the added calories! Blue Buffalo BLUE Wilderness Rocky Mountain Recipe Adult Healthy Weight Red Meat Dry Dog Food also includes blueberries, cranberries and carrots to help support antioxidant-enrichment. Put on your spandex, Rover! Let’s get physical!


Why We Love It

    \n
  • 100% grain-free
  • \n
  • No by-products, fillers, soy, corn, artificial preservatives, colors or flavors.
  • \n
  • Made in the USA
  • \n


About Blue Buffalo

BLUE Buffalo's True Blue promise is the pillar of their business, straight to every customer; the finest natural ingredients, and no chicken/poultry by-product meals, corn, wheat, soy, artificial preservatives, colors or flavors. BLUE Buffalo is the only food made with unique Lifesource Bits; a precise blend of vitamins, minerals and antioxidants created by veterinarians and animal nutritionists. With recipes for all tastes and diets, including limited ingredient diets, high protein, grain-free, wholesome grains, and exotic proteins, BLUE Buffalo always starts with real meat, and ends with good health.

Ingredients

Deboned Beef, Chicken Meal, Pea Protein, Peas, Tapioca Starch, Pea Starch, Menhaden Fish Meal (source of Omega 3 Fatty Acids), Pea Fiber, Dried Tomato Pomace, Natural Flavor, Flaxseed (source of Omega 6 Fatty Acids), Chicken Fat (preserved with Mixed Tocopherols), Powdered Cellulose, Dehydrated Alfalfa Meal, DL-Methionine, Deboned Lamb, Deboned Venison, Dried Chicory Root, Choline Chloride, Potatoes, Calcium Carbonate, Caramel Color, Dicalcium Phosphate, preserved with Mixed Tocopherols, Sweet Potatoes, Carrots, L-Carnitine, Zinc Amino Acid Chelate, Zinc Sulfate, Salt, Potassium Chloride, Ferrous Sulfate, Vitamin E Supplement, Iron Amino Acid Chelate, Glucosamine Hydrochloride, Blueberries, Cranberries, Barley Grass, Parsley, Yucca Schidigera Extract, Dried Kelp, Turmeric, Nicotinic Acid (Vitamin B3), Calcium Pantothenate (Vitamin B5), L-Ascorbyl-2-Polyphosphate (source of Vitamin C), L-Lysine, Oil of Rosemary, Copper Sulfate, Biotin (Vitamin B7), Vitamin A Supplement, Copper Amino Acid Chelate, Manganese Sulfate, Taurine, Chondroitin Sulfate, Manganese Amino Acid Chelate, Thiamine Mononitrate (Vitamin B1), Riboflavin (Vitamin B2), Vitamin D3 Supplement, Vitamin B12 Supplement, Pyridoxine Hydrochloride (Vitamin B6), Calcium Iodate, Dried Yeast, Dried Enterococcus faecium fermentation product, Dried Lactobacillus acidophilus fermentation product, Dried Aspergillus niger fermentation extract, Dried Trichoderma longibrachiatum fermentation extract, Dried Bacillus subtilis fermentation extract, Folic Acid (Vitamin B9), Sodium Selenite.
\r\n

Guaranteed Analysis

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
NutrientGuaranteed Units
Crude Protein30.0% min
Crude Fat10% min
Crude Fiber10.0% max
Moisture10.0% max
Calcium1.2% min
Phosphorus0.9% min
Omega-3 Fatty Acids0.5% min
Omega-6 Fatty Acids1.5% min
L-Carnitine150 mg/kg min
Glucosamine400 mg/kg min
Chondroitin Sulfate300 mg/kg min
",
"brand": {
"@type": "Thing",
"name": "Blue Buffalo"
},
"sku": "36492",
"offers": {
"@type": "Offer",
"priceCurrency": "USD",
"price": 65.99,
"availability": "http://schema.org/InStock",
"seller": {
"@type": "Organization",
"name": "Concord Pet Foods & Supplies"
}
}
},
{
"@context": "http://schema.org",
"@type": "WebSite",
"name": "Concord Pet Foods & Supplies",
"url": "https://concordpetfoods.com"
}
]

Nested lists returned by JsonLD extractor.

See below, the output is a nested list rather than a list of dicts.

In [9]: url = 'http://www.superpages.com/yellowpages/c-nurseries/s-wa/t-redmond/'

In [10]: r = requests.get(url)

In [11]: ex = extruct.jsonld.JsonLdExtractor()

In [12]: ex.extract(r.text)['items']

[[{'@context': 'http://schema.org',
   '@type': 'LocalBusiness',
   'address': {'@type': 'PostalAddress',
    'addressLocality': 'Redmond',
    'addressRegion': 'WA',
    'postalCode': '98053',
    'streetAddress': '20871 NE Redmond Fall City Rd'},
   'description': "There's more to a beautiful garden than what meets the eye.",
   'name': 'Gray Barn Nursery',
   'telephone': '888-820-9506'},
    ...
]]

Unable to read UTF-8

The code is not able to read the UTF-8, where I can modify the code? The extruct supports only ASCII?

Error-

image

Extruct returns incorrectly formatted description property

Seems that extruct incorrectly interprets description with included HTML tags from microdata.

See the below description extracted from URL https://www.monsterpetsupplies.co.uk/cat/cat-flea-tick/johnsons-4-fleas-cats-kittens:

>>> import extruct
>>> import requests
>>> from w3lib.html import get_base_url
>>> r = requests.get('https://www.monsterpetsupplies.co.uk/cat/cat-flea-tick/johnsons-4-fleas-cats-kittens')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url=base_url)
>>> data['microdata'][0]['properties']['description']
"Johnsons 4 Fleas Cats & Kittens - 3 Treatment Pack, 6 Treatment PackFor use with Cats and Kittens over 4 weeks of age between 1 and 11kg.Johnson's 4fleas tablets are an easy to use oral treatment to kill adult fleas found on your pet.Effects on the fleas may be seen as soon as 15 minutes after administration.Between 95 - 100% of fleas will be killed off in the first six hours, but ALL adult fleas will be gone after a day.These tablets can be given directly to the mouth or may be mixed in a small portion f our pet's favourite food and given immediately. Administer a single tablet on an day when fleas are seen on your pet. Repeat on any subsequent day as necessary. Do not give more than one treatment per day.You may notice your pet scratching more than usual for the first half hour after administration; this is completely normal and caused by the fleas reacting to Johnson's 4Fleas tablets.While highly effective by themselves, 4Fleas is great when used as part of a programme to eliminate fleas and their larvae from both pets and their surroundings."

As it can be seen, there is a problem with formatting, like lack of space between "Pack" and "For" or between "11kg." and "Johnson's".

It turns out that the problem is not because of description property content per-se, because it looks correctly on the page source:

<p><strong>Johnsons 4 Fleas Cats &amp; Kittens - 3 Treatment Pack, 6 Treatment Pack</strong></p>For use with Cats and Kittens over 4 weeks of age between 1 and 11kg.<br /><br />Johnson's 4fleas tablets are an easy to use oral treatment to kill adult fleas found on your pet.<br /><br />Effects on the fleas may be seen as soon as 15 minutes after administration.<br /><br />Between 95 - 100% of fleas will be killed off in the first six hours, but ALL adult fleas will be gone after a day.<br /><br />These tablets can be given directly to the mouth or may be mixed in a small portion f our pet's favourite food and given immediately. Administer a single tablet on an day when fleas are seen on your pet. Repeat on any subsequent day as necessary. Do not give more than one treatment per day.<br /><br />You may notice your pet scratching more than usual for the first half hour after administration; this is completely normal and caused by the fleas reacting to Johnson's 4Fleas tablets.<br /><br />While highly effective by themselves, 4Fleas is great when used as part of a programme to eliminate fleas and their larvae from both pets and their surroundings.

Likely it is a matter of line

return u"".join(self._xp_clean_text(node)).strip()
where html-text should be used instead of ad-hoc text extraction.

Handle badly formatted JSON-LD data.

Some web pages contain badly formatted JSON-LD data, e.g., an example

The JSON-LD in this page is:


{
  "@context": "http://schema.org",
        "@type": "Product",
                "name": "Black 'Clint' FT0511 cat eye sunglasses",
                "image": "https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001",
		"brand": {
                  "@type": "Thing",
                  "name": "Tom Ford"
                },
                "offers": {
                	"@type": "Offer",
                	"priceCurrency": "GBP",
                	"price": "285.00",
                	"itemCondition": "http://schema.org/NewCondition",
                	"availability": "http://schema.org/InStock"
                }
    }
}

In the JSON-LD above, the last } is extra. And extruct or json.loads won't handle it properly.

The json.loads in Python after 3.5 will give detailed error information as JSONDecodeError: Extra data: line 19 column 1 (char 624)

In [7]: try:
   ...:     data = json.loads(json_ld_string)
   ...: except json.JSONDecodeError as err:
   ...:     print(err)
   ...:     print(err.msg)
   ...:     print(err.pos)
   ...:
Extra data: line 19 column 1 (char 624)
Extra data
624

The error.msg and error.pos can give some hint to fix the JSON-LD data, e.g., this one we can remove the character at position 624 and parse the data string again to correctly get:

{'@context': 'http://schema.org',
 '@type': 'Product',
 'brand': {'@type': 'Thing', 'name': 'Tom Ford'},
 'image': 'https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001',
 'name': "Black 'Clint' FT0511 cat eye sunglasses",
 'offers': {'@type': 'Offer',
            'availability': 'http://schema.org/InStock',
            'itemCondition': 'http://schema.org/NewCondition',
            'price': '285.00',
            'priceCurrency': 'GBP'}}

There're many possible format errors and some can be fixed easily some might be harder or even impossible.

I propose 3 ways to improve the situation:

  • extruct try various ways to fix the json-ld data case by case, but need to adapt to Python >= 3.5 to allow to get detailed error info
  • extruct allow the user to pass in a function to parse JSON data, and let the user to handle his own possible error types
  • extruct can output the extracted JSON-LD string not parsed data and let the user to parse and handle his own possible error types

I personally recommend the latter 2 ways.

Thanks.

Extract JSON-LD with control characters.

Some pages have JSON-LD with control characters.
One example is: https://www.johnlewis.com/sony-xperia-x-smartphone-android-5-4g-lte-sim-free-32gb/p3210080

when you try to extract JSON-LD data from this page, you'll get:
Invalid control character at: line 8 column 353 (char 625)

Maybe need to change JsonLdExtractor._extract_items() in extruct/extruct/jsonld.py as below:

from json import JSONDecodeError

    def _extract_items(self, node):
        script = node.xpath('string()')
        try:
            data = json.loads(script)
        except ValueError:
            # sometimes JSON-decoding errors are due to leading HTML or JavaScript comments
            try:
                data = json.loads(HTML_OR_JS_COMMENTLINE.sub('', script))
            except JSONDecodeError:
                data = json.loads(script, strict=False)
        if isinstance(data, list):
            return data
        elif isinstance(data, dict):
            return [data]

HTML escaping in JSON+LD

I came across some JSON+LD on a site that contained a &amp; and I assumed that I accidentally escaped something somewhere. However, I found that that was what was actually in the content and also that the standard says that it should be there.

For my application, I would like that &amp; to be a &, but I was wondering if extruct should be doing this already?

Accessing extracted data which is not unique

Hello,

extruct is working very well for me use case and I get plenty of structured text out of websites.
I'm mostly using Microdata.
Unfortunately, some websites seem to have different structures from others, so for example, sometimes I'd get an array:

'brand': {'properties': {'name': 'NIKE'}, 'type': 'http://schema.org/Brand'},

and sometimes a string:

'brand': 'NIKE',

So to access the data, I'd need to do something like:

if isinstance(productData['brand'], dict): if 'http://schema.org/Brand' == productData['brand']['type']: self.brand = productData['brand']['properties']['name'] if isinstance(productData['brand'], str): self.brand = productData['brand']

Is this the best way to go or am I doing this in a clumsy way?

Thanks,
Chris

Unified formatting for microformats and JSON-LD

Hi,

I am using extruct to extract metadata from emails, either in microformats or JSON+LD format. A very good point for this library is that in a single call one can extract all possible information from the message, that's super convenient!

However, I realized that the structure of the data returned from JSON+LD and microformats is quite different. For instance, microformats will return something like

{
  "type": "SOME_SCHEMA_URL",
  "properties": { /* A dict of properties */ }
}

whereas JSON+LD parsing would return something like

{
  "@type": "SOME_SCHEMA_URL",
  /* All the properties in keys here */
}

This is not so convenient as it implies that microformats and JSON+LD data should be handled in a different way, although they match the same schema.org schema.

Not sure if it is fine for extruct or if it should lie in another lib, but what about offering a way to have a standard representation of the extracted data. This could either be building a class object (a struct basically) from the fetched data for each type, or converting one of the format to the other. Not sure if this could already be offloaded to some external lib, but I could not find any doing the job so far.

Thanks!

EDIT: I guess something as simple as

def microformats_to_jsonld(mf):
    if isinstance(mf, dict) and 'type' in mf and 'properties' in mf:
        if isinstance(mf['type'], list):
            # Fix a bug in JSON-LD format of some emails
            mf['type'] = ''.join(mf['type'])
        context, type = mf['type'].rsplit('/', 1)
        converted = {
            '@type': type,
            '@context': context
        }
        for key, property in mf['properties'].items():
            converted[key] = microformats_to_jsonld(property)
        return converted
    else:
        return mf

could do the trick.

html parsing fail on empty documents

Exception like this can be raised by functions from extruct.utils:

document = parse_xmldom_html(html_string, encoding=encoding)
File "/usr/local/lib/python3.6/dist-packages/extruct/utils.py", line 16, in parse_xmldom_html
return lxml.html.fromstring(html, parser=parser)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/__init__.py", line 876, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/__init__.py", line 765, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty

In parsel this is worked around: empty documents are handled explicitly. There is also an issue with null bytes handled. I think we should bring similar fixes to extruct. See https://github.com/scrapy/parsel/blob/e01093cf6342c90445028de28034b3cc3d2ead8b/parsel/selector.py#L38.

Unable to pass the url

r = requests.get(url)
data = extruct.extract(r.text, r.url)

why i am getting error in this way?

url vs. base_url in extruct.extract

Right now extruct.extract has an url parameter which is documented as "url of the html documents", but in reality it's used as a base url (at least in LxmlMicrodataExtractor, maybe in others as well). I think we should check if it's indeed always used as a base url, update documentation and introduce base_url argument deprecating url? Another option would be to extract base_url in extruct, but this feels like worse solution to me (what if caller already has base_url or has more accurate base_url?), although we could also support both base_url and url.

Problem parsing rdfa in aws lambda

Hi,
wanted to ask if anyone out there has used extruct on AWS lambda? I tested running extruct function which seems to fail to work for rdfa. Other default metadata types are fine.

A simple test case:

import pprint as pp
import requests
from extruct.rdfa import RDFaExtractor
import config_files.logging_config as log

logger = log.logger

def main():

    try:
        import extruct
        logger.info("Testing importing extruct which loaded successfully")
        import rdflib
        logger.info("Testing importing rdflib which loaded successfully")
        import extruct.rdfa
        logger.info("Testing importing rdfa which loaded successfully")
        from extruct.rdfa import RDFaExtractor
        logger.info("Testing importing RDFaExtractor which loaded successfully")

     except ImportError as e:
            logger.error("failed to import : {}".format(e))

    try:
        url = 'https://www.littlewoods.com/ri-plus-floral-trumpet-sleeve-top/1600159211.prd'
        r = requests.get(url)
        rdfae = RDFaExtractor()
        rdfa_json = rdfae.extract(r.text, base_url=None)

        pp.pprint(rdfa_json)

    except Exception as e:
        logger.exception("Failed to extract rdfa. Error: {}".format(e))

main()

The part of pipenv graph for extruct when I build the artifact.zip file:

extruct==0.7.1
  - lxml [required: Any, installed: 3.6.0]
  - mf2py [required: Any, installed: 1.1.2]
    - BeautifulSoup4 [required: >=4.6.0, installed: 4.7.1]
      - soupsieve [required: >=1.2, installed: 1.6.2]
    - html5lib [required: >=1.0.1, installed: 1.0.1]
      - six [required: >=1.9, installed: 1.11.0]
      - webencodings [required: Any, installed: 0.5.1]
    - requests [required: >=2.18.4, installed: 2.18.4]
      - certifi [required: >=2017.4.17, installed: 2018.11.29]
      - chardet [required: >=3.0.2,<3.1.0, installed: 3.0.4]
      - idna [required: >=2.5,<2.7, installed: 2.6]
      - urllib3 [required: >=1.21.1,<1.23, installed: 1.22]
  - rdflib [required: Any, installed: 4.2.2]
    - isodate [required: Any, installed: 0.6.0]
      - six [required: Any, installed: 1.11.0]
    - pyparsing [required: Any, installed: 2.3.0]
  - rdflib-jsonld [required: Any, installed: 0.4.0]
    - rdflib [required: >=4.2, installed: 4.2.2]
      - isodate [required: Any, installed: 0.6.0]
        - six [required: Any, installed: 1.11.0]
      - pyparsing [required: Any, installed: 2.3.0]
  - six [required: Any, installed: 1.11.0]
  - w3lib [required: Any, installed: 1.19.0]
    - six [required: >=1.4.1, installed: 1.11.0]

When I run this locally in the same pipenv env (Ubuntu 17.10, Docker, 17.12.0-ce, pipenv==v2018.11.26), I don't experience any issues. On lambda invocation I log the following stack trace:

2019-01-10 14:32:49,092:INFO:pid 1:Testing importing extruct which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing rdflib which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing rdfa which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing RDFaExtractor which loaded successfully
2019-01-10 14:32:51,753:ERROR:pid 1:Failed to extract rdfa. Error: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)
Traceback (most recent call last):
  File "/var/task/rdflib/plugin.py", line 100, in get
    p = _plugins[(name, kind)]
KeyError: ('json-ld', <class 'rdflib.serializer.Serializer'>)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/task/metadata_extractor/rdfa_extract_poc.py", line 15, in main
    rdfa_json = rdfae.extract(r.text, base_url=None)
  File "/var/task/extruct/rdfa.py", line 35, in extract
    return self.extract_items(tree, base_url=base_url, expanded=expanded)
  File "/var/task/extruct/rdfa.py", line 48, in extract_items
    jsonld_string = g.serialize(format='json-ld', auto_compact=not expanded).decode('utf-8')
  File "/var/task/rdflib/graph.py", line 940, in serialize
    serializer = plugin.get(format, Serializer)(self)
  File "/var/task/rdflib/plugin.py", line 103, in get
    "No plugin registered for (%s, %s)" % (name, kind))
rdflib.plugin.PluginException: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)

I have been scratching my head over this but can't figure this one out. What should I try? Thanks in advance

Support "expanded" Open Graph metadata based on og:type

Facebook Oepn Graph defines an expanded version of embedded metadata depending on the value of og:type.

For example:

article - Namespace URI: http://ogp.me/ns/article#

    article:published_time - datetime - When the article was first published.
    article:modified_time - datetime - When the article was last changed.
    article:expiration_time - datetime - When the article is out of date after.
    article:author - profile array - Writers of the article.
    article:section - string - A high-level section name. E.g. Technology
    article:tag - string array - Tag words associated with this article.

This is used for example on nytimes.com. Snippet:

<meta property="og:url" content="http://www.nytimes.com/2016/12/15/arts/music/from-steet-theater-to-wagner-on-the-opera-stage.html" />
<meta property="og:type" content="article" />
<meta property="og:title" content="From Street Theater to Wagner on the Opera Stage" />
<meta property="og:description" content="Àlex Ollé brings an avant-garde sensibility to “The Flying Dutchman,” which he set in Bangladesh instead of Norway. The production opens in Madrid on Saturday." />
<meta property="article:published" itemprop="datePublished" content="2016-12-15T05:55:55-05:00" />
<meta property="article:modified" itemprop="dateModified" content="2016-12-15T06:19:30-05:00" />
<meta property="article:section" itemprop="articleSection" content="Music" />
<meta property="article:section-taxonomy-id" itemprop="articleSection" content="C5BFA7D5-359C-427B-90E6-6B7245A6CDD8" />
<meta property="article:section_url" content="http://www.nytimes.com/section/arts" />
<meta property="article:top-level-section" content="arts" />
<meta property="fb:app_id" content="9869919170" />

Currently (as I write these lines version 0.3.0a1) extracts raw article:... properties

...
  'article:author': [{'@value': 'http://www.nytimes.com/by/raphael-minder'}],
  'article:collection': [{'@value': 'https://static01.nyt.com/services/json/sectionfronts/arts/music/index.jsonp'}],
  'article:modified': [{'@value': '2016-12-15T06:19:30-05:00'}],
  'article:published': [{'@value': '2016-12-15T05:55:55-05:00'}],
  'article:section': [{'@value': 'Music'}],
  'article:section-taxonomy-id': [{'@value': 'C5BFA7D5-359C-427B-90E6-6B7245A6CDD8'}],
  'article:section_url': [{'@value': 'http://www.nytimes.com/section/arts'}],
  'article:tag': [{'@value': 'Opera'},
                  {'@value': 'Bangladesh'},
                  {'@value': 'Madrid (Spain)'},
                  {'@value': 'Teatro Real'},
                  {'@value': 'Wagner, Richard'}],
  'article:top-level-section': [{'@value': 'arts'}],
  'fb:app_id': [{'@value': '9869919170'}],
  'http://opengraphprotocol.org/schema/description': [{'@value': 'Ã\x80lex '
                                                                 'Ollé brings '
                                                                 'an '
                                                                 'avant-garde '
                                                                 'sensibility '
                                                                 'to '
                                                                 'â\x80\x9cThe '
                                                                 'Flying '
                                                                 'Dutchman,â\x80\x9d '
                                                                 'which he set '
                                                                 'in '
                                                                 'Bangladesh '
                                                                 'instead of '
                                                                 'Norway. The '
                                                                 'production '
                                                                 'opens in '
                                                                 'Madrid on '
                                                                 'Saturday.'}],
  'http://opengraphprotocol.org/schema/image': [{'@value': 'https://static01.nyt.com/images/2016/12/16/arts/16ALEXOLLE1-INYT/16ALEXOLLE1-INYT-facebookJumbo.jpg'}],
  'http://opengraphprotocol.org/schema/title': [{'@value': 'From Street '
                                                           'Theater to Wagner '
                                                           'on the Opera '
                                                           'Stage'}],
  'http://opengraphprotocol.org/schema/type': [{'@value': 'article'}],
  'http://opengraphprotocol.org/schema/url': [{'@value': 'http://www.nytimes.com/2016/12/15/arts/music/from-steet-theater-to-wagner-on-the-opera-stage.html'}],
...

while they could use the type-dependent OGP namespace

Extract contents with tags

Hi, when I use extruct to extract some microdata from html element that contains some script or style tags I encountered some problems: tags are skipped out but theirs contents remains.

This behaviour is happened because LxmlMicrodataExtractor.extract_textContent uses lxml.html.tostring(node, method="text", encoding='unicode', with_tail=False) with method "text".
Probably we have to add a parameter to allow using "html" method and maybe a way to use the lxml Cleaner (http://lxml.de/lxmlhtml.html#cleaning-up-html).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.