GithubHelp home page GithubHelp logo

coleifer / micawber Goto Github PK

View Code? Open in Web Editor NEW
622.0 17.0 90.0 202 KB

a small library for extracting rich content from urls

Home Page: http://micawber.readthedocs.org/

License: MIT License

Python 99.61% HTML 0.39%
oembed python

micawber's Introduction

image

A small library for extracting rich content from urls.

what does it do?

micawber supplies a few methods for retrieving rich metadata about a variety of links, such as links to youtube videos. micawber also provides functions for parsing blocks of text and html and replacing links to videos with rich embedded content.

examples

here is a quick example:

import micawber

# load up rules for some default providers, such as youtube and flickr
providers = micawber.bootstrap_basic()

providers.request('http://www.youtube.com/watch?v=54XHDUOHuzU')

# returns the following dictionary:
{
    'author_name': 'pascalbrax',
    'author_url': u'http://www.youtube.com/user/pascalbrax'
    'height': 344,
    'html': u'<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe>',
    'provider_name': 'YouTube',
    'provider_url': 'http://www.youtube.com/',
    'title': 'Future Crew - Second Reality demo - HD',
    'type': u'video',
    'thumbnail_height': 360,
    'thumbnail_url': u'http://i2.ytimg.com/vi/54XHDUOHuzU/hqdefault.jpg',
    'thumbnail_width': 480,
    'url': 'http://www.youtube.com/watch?v=54XHDUOHuzU',
    'width': 459,
    'version': '1.0',
}

providers.parse_text('this is a test:\nhttp://www.youtube.com/watch?v=54XHDUOHuzU')

# returns the following string:
this is a test:
<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe>

providers.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>')

# returns the following html:
<p><iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&amp;feature=oembed" frameborder="0" allowfullscreen="allowfullscreen"></iframe></p>

micawber's People

Contributors

bashu avatar benkonrath avatar bigjust avatar busla avatar carljm avatar coleifer avatar dokterbob avatar eshagh avatar flimm avatar ivirabyan avatar jvanasco avatar kopos avatar mgaitan avatar mgorny avatar mikedizon avatar msghens avatar richardcornish avatar stefanfoulis avatar tclancy avatar tptlab avatar vdboor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

micawber's Issues

Allow compositing Providers

We need to grab oembed data and would prefer to do it ourselves using the providers in micawber.providers.bootstrap_basic.

There are some services that don't provide endpoints (thinking of Facebook and Vine, in particular) or aren't defined in bootstrap_basic. We want to compose a ProviderRegistry instance which tries providers from bootstrap_basic first, falling back to oembedio or Embedly if nothing is found.

Our current (proposed) solution:

from micawber import bootstrap_basic, bootstrap_embedio

# embedio first so that basic providers overwrite embedio providers
# a bit icky since it relies on internal registry implementation
providers = bootstrap_embedio()
for provider in boostrap_basic():
    providers.register(provider)

That seems a bit ... circuitous. So, here a couple of ways to provide composited ProverRegistrys that I can think of:

  1. use our proposed solution above, and note it in the docs,
  2. allow the various bootstrap_* funcs to take an optional registry argument that defaults to None, but is used if passed,
def bootstrap_basic(pr=None, cache=None):
    pr = pr or ProviderRegistry(cache)
    ...
    return pr
  1. Extract the hard coded endpoints in bootstrap_basic so that they're available to use by library users.
PROVIDERS = {
    'http://blip.tv/\S+': 'http://blip.tv/oembed',
    ...
}

def bootstrap_basic(cache=None)
    pr = ProviderRegistry(cache)

    for regex, endpoint in PROVIDERS.items():
        pr.register(regex, Provider(endpoint))

    return pr

Thoughts?

Provider bootstrap_oembed broken

The bootstrap_oembed provider appears to be broken. The following works fine with other providers (Python 3.7, micawber 0.4.0).

from micawber.providers import bootstrap_oembed
r = bootstrap_oembed()
result = r.provider_for_url("https://i.imgur.com/CZX7D64.jpg")
/usr/local/lib/python3.7/site-packages/micawber/providers.py in provider_for_url(self, url)
    136     def provider_for_url(self, url):
    137         for regex, provider in self:
--> 138             if re.match(regex, url):
    139                 return provider
    140 

/usr/local/lib/python3.7/re.py in match(pattern, string, flags)
    171     """Try to apply the pattern at the start of the string, returning
    172     a Match object, or None if no match was found."""
--> 173     return _compile(pattern, flags).match(string)
    174 
    175 def fullmatch(pattern, string, flags=0):

/usr/local/lib/python3.7/re.py in _compile(pattern, flags)
    284     if not sre_compile.isstring(pattern):
    285         raise TypeError("first argument must be string or compiled pattern")
--> 286     p = sre_compile.compile(pattern, flags)
    287     if not (flags & DEBUG):
    288         if len(_cache) >= _MAXCACHE:

/usr/local/lib/python3.7/sre_compile.py in compile(p, flags)
    762     if isstring(p):
    763         pattern = p
--> 764         p = sre_parse.parse(p, flags)
    765     else:
    766         pattern = None

/usr/local/lib/python3.7/sre_parse.py in parse(str, flags, pattern)
    928 
    929     try:
--> 930         p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
    931     except Verbose:
    932         # the VERBOSE flag was switched on inside the pattern.  to be

/usr/local/lib/python3.7/sre_parse.py in _parse_sub(source, state, verbose, nested)
    424     while True:
    425         itemsappend(_parse(source, state, verbose, nested + 1,
--> 426                            not nested and not items))
    427         if not sourcematch("|"):
    428             break

/usr/local/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
    652             if item[0][0] in _REPEATCODES:
    653                 raise source.error("multiple repeat",
--> 654                                    source.tell() - here + len(this))
    655             if item[0][0] is SUBPATTERN:
    656                 group, add_flags, del_flags, p = item[0][1]

error: multiple repeat at position 44

Option to customize fallback behavior if provider not available

At present if no provider is found for a URL, and urlize_all is True, the urlize function appears to always be called which renders a simple link. There doesn't appear to be a way to change this.

I'd like to be able to customize this fallback behavior - perhaps by passing in a function as is done with the handlers - for example if I want to render the link with target="_blank", or use the domain instead of the full URL in the title, etc.

Is there a way to do this at present?

Google Maps with HTTPS not working.

I'm not 100% certain what the issue is, but I'll give you as many details as possible.

my site is behind https (only). When trying to embed a map using an https link, the map does not embed.

When switching the URL to http the map will embed.

poking in the source I found: https://github.com/coleifer/micawber/blob/master/micawber/contrib/providers.py#L34

Which seems to only accept http as a valid url for google maps?

I think the problem might be that google redirects to https if you just go to maps.google.com.

So then when trying to embed a google map with https it fails the regex match?

performance suggestion

I'm considering migrating to micawber from a custom oembed consumer, and wanted to suggest a performance improvement that I am willing to generate a PR for.

I'd like to extend the ProviderRegistry with a secondary internal register that nests providers under domain names.

this would allow users to optionally avoid a regex match against every provider and only test the domain.

some light tests on a quick mockup showed the lookups to run in 30% the time -- including the overhead of parsing the domain name from a url, but about 5% of the time if you have the domain already.

we would be using this on a high volume indexer, so this performance is a need.

How to bootstrap Iframely?

Hi Charles, this is a separate ticket to continue this discussion.

We added the description of Iframely's approach to providers here: https://iframely.com/docs/providers.

Though our preference would be to bootstrap for all URLs as Iframely can generate summary cards, handle link shorteners, detect direct image links, etc.

Another issue is the API endpoint address:

Any suggestions?

youtube unpredictable parsing

Hi Coleifer,

Thank you for your nice project. :)

Sorry for disturbing you

I found the answer to my issue, so I deleted the issue text as I can't delete all the issue record.

Best regards
and thanks again!

Igor

New release?

The most recent commit is a valuable addition and it would be great to see it rolled into an "official" release. In the meantime, deploying via git is working.

bootstrap_basic raw strings / escapes

I noticed that a lot of the regular expression patterns in bootstrap_basic don't escape dots (match all). This means that a fair number of these patterns will match more than intended.

In addition most patterns aren't marked as raw strings and therefore contain invalid escape sequences. This isn't noticeable directly, but could cause issues in a future python version.

For an example of the latter:

python -W always -c '"https://\S*?soundcloud.com/\S+"'
<string>:1: DeprecationWarning: invalid escape sequence \S

html5lib incompatibility

Getting this error when building a Wagtail project: AttributeError: module 'html5lib.treebuilders' has no attribute '_base'

Looks like it's being picked up elsewhere so hopefully a fix will be released soon..

live demo is down

got an error from google:

Support for Python 2.5 has turned off. Please refer to https://goo.gl/aESk5L for more information

'IOError: [Errno 11] Resource temporarily unavailable' with Peewee sample blog app

I get the error shown below when I run the Peewee sample blog app from here: https://github.com/coleifer/peewee/tree/master/examples/blog

Specifically this happens when Micawber tries to display a post with links that need converting to embeds (e.g. a YouTube video link).

I've been able to reproduce this reliably with different links (e.g. Vimeo links instead of YouTube) and different browsers. It doesn't always happen immediately, but if you click around to view the posts with embeds, then return to the index page, then view posts again, the error appears and the page is either unavailable or shows the page with no CSS. Errors in the console show that files failed to load: Failed to load resource: net::ERR_SOCKET_NOT_CONNECTED

This is in a Python 2.7.10 virtualenv on Ubuntu 15.10 running the Flask dev server.

Interestingly, running it in a Python 3.4 virtualenv works without issues. But it would be great to have a fix for Python 2.

Exception happened during processing of request from ('127.0.0.1', 33044)
Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 655, in __init__
    self.handle()
  File "/home/tom/.virtualenvs/peewee-blog/local/lib/python2.7/site-packages/werkzeug/serving.py", line 216, in handle
    rv = BaseHTTPRequestHandler.handle(self)
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
    self.handle_one_request()
  File "/home/tom/.virtualenvs/peewee-blog/local/lib/python2.7/site-packages/werkzeug/serving.py", line 247, in handle_one_request
    self.raw_requestline = self.rfile.readline()
IOError: [Errno 11] Resource temporarily unavailable

Vimeo and bootstrap_oembedio

In [5]: bootstrap_oembedio().request('http://vimeo.com/111410510')
---------------------------------------------------------------------------
ProviderNotFoundException                 Traceback (most recent call last)
<ipython-input-5-cdc3e12d26dd> in <module>()
----> 1 bootstrap_oembedio().request('http://vimeo.com/111410510')

/home/adas/.virtualenvs/rownosc-info/local/lib/python2.7/site-packages/micawber/providers.pyc in inner(self, url, **params)
     91                 self.cache.set(key, data)
     92             return data
---> 93         return fn(self, url, **params)
     94     return inner
     95 

/home/adas/.virtualenvs/rownosc-info/local/lib/python2.7/site-packages/micawber/providers.pyc in request(self, url, **params)
    132         if provider:
    133             return provider.request(url, **params)
--> 134         raise ProviderNotFoundException('Provider not found for "%s"' % url)
    135 
    136 

ProviderNotFoundException: Provider not found for "http://vimeo.com/111410510"
In [12]: [(k,v) for k,v in bootstrap_oembedio()._registry.items() if 'vimeo' in k]
Out[12]: [(u'vimeo\\.com', <micawber.providers.Provider at 0xb5ebfd0c>)]

Am doing something wrong?

Other way was working...

In [17]: bootstrap_basic().request('http://vimeo.com/111410510')
Out[17]: 
{u'author_name': u'Fundacja Picture Doc',
 u'author_url': u'http://vimeo.com/user8938954',
 u'description': u'Copyright by Fundacja Picture Doc\nCopyright by Fundacja Dialog-Pheniben',
 u'duration': 310,
 u'height': 720,
 u'html': u'<iframe src="//player.vimeo.com/video/111410510" width="1280" height="720" frameborder="0" title="Romowie w Europie. Zag\u0142ada" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>',
 u'is_plus': u'1',
 u'provider_name': u'Vimeo',
 u'provider_url': u'https://vimeo.com/',
 u'thumbnail_height': 720,
 u'thumbnail_url': u'http://i.vimeocdn.com/video/496100635_1280.jpg',
 u'thumbnail_width': 1280,
 u'title': u'Romowie w Europie. Zag\u0142ada',
 u'type': u'video',
 u'uri': u'/videos/111410510',
 'url': 'http://vimeo.com/111410510',
 u'version': u'1.0',
 u'video_id': 111410510,
 u'width': 1280}
In [18]: bootstrap_embedly().request('http://vimeo.com/111410510')
Out[18]: 
{u'author_name': u'Fundacja Picture Doc',
 u'author_url': u'http://vimeo.com/user8938954',
 u'description': u'Copyright by Fundacja Picture Doc Copyright by Fundacja Dialog-Pheniben',
 u'height': 720,
 u'html': u'<iframe class="embedly-embed" src="//cdn.embedly.com/widgets/media.html?src=http%3A%2F%2Fplayer.vimeo.com%2Fvideo%2F111410510&src_secure=1&url=http%3A%2F%2Fvimeo.com%2F111410510&image=http%3A%2F%2Fi.vimeocdn.com%2Fvideo%2F496100635_1280.jpg&type=text%2Fhtml&schema=vimeo" width="1280" height="720" scrolling="no" frameborder="0" allowfullscreen></iframe>',
 u'provider_name': u'Vimeo',
 u'provider_url': u'https://vimeo.com/',
 u'thumbnail_height': 720,
 u'thumbnail_url': u'http://i.vimeocdn.com/video/496100635_1280.jpg',
 u'thumbnail_width': 1280,
 u'title': u'Romowie w Europie. Zag\u0142ada',
 u'type': u'video',
 'url': 'http://vimeo.com/111410510',
 u'version': u'1.0',
 u'width': 1280}
In [19]: bootstrap_noembed().request('http://vimeo.com/111410510')
Out[19]: 
{u'author_name': u'Fundacja Picture Doc',
 u'author_url': u'http://vimeo.com/user8938954',
 u'description': u'Copyright by Fundacja Picture Doc\nCopyright by Fundacja Dialog-Pheniben',
 u'duration': 310,
 u'height': 720,
 u'html': u'\n<div class="noembed-embed ">\n  <div class="noembed-wrapper">\n    \n<div class="noembed-embed-inner noembed-vimeo">\n  <iframe src="//player.vimeo.com/video/111410510" width="1280" height="720" frameborder="0" title="Romowie w Europie. Zag\u0142ada" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>\n</div>\n\n    <table class="noembed-meta-info">\n      <tr>\n        <td class="favicon"><img src="https://noembed.com/favicon/Vimeo.png"></td>\n        <td>Vimeo</td>\n        <td align="right">\n          <a title="http://vimeo.com/111410510" href="http://vimeo.com/111410510">http://vimeo.com/111410510</a>\n        </td>\n      </tr>\n    </table>\n  </div>\n</div>\n',
 u'is_plus': u'1',
 u'provider_name': u'Vimeo',
 u'provider_url': u'https://vimeo.com/',
 u'thumbnail_height': 720,
 u'thumbnail_url': u'http://i.vimeocdn.com/video/496100635_1280.jpg',
 u'thumbnail_width': 1280,
 u'title': u'Romowie w Europie. Zag\u0142ada',
 u'type': u'video',
 u'uri': u'/videos/111410510',
 u'url': u'http://vimeo.com/111410510',
 u'version': u'1.0',
 u'video_id': 111410510,
 u'width': 1280}

Oembed.io support it too

In [23]: requests.get('http://oembed.io/api?url=http://vimeo.com/111410510').json()
Out[23]: 
{u'author': u'Fundacja Picture Doc',
 u'author_url': u'http://vimeo.com/user8938954',
 u'canonical': u'http://vimeo.com/111410510',
 u'description': u'Copyright by Fundacja Picture Doc\nCopyright by Fundacja Dialog-Pheniben',
 u'duration': 310,
 u'html': u'<div class="oembed-widget-container" style="left: 0px; width: 100%; height: 0px; position: relative; padding-bottom: 56%;"><iframe class="oembed-widget oembed-iframe" src="//player.vimeo.com/video/111410510" frameborder="0" style="top: 0px; left: 0px; width: 100%; height: 100%; position: absolute;"></iframe></div>',
 u'provider_name': u'Vimeo',
 u'thumbnail_height': 720,
 u'thumbnail_url': u'http://i.vimeocdn.com/video/496100635_1280.jpg',
 u'thumbnail_width': 1280,
 u'title': u'Romowie w Europie. Zag\u0142ada',
 u'type': u'rich',
 u'version': u'1.0'}

ModuleNotFoundError: No module named 'micawber.contrib.mcdjango.mcdjango_tests'

Since 0.3.7 I have troubles running the tests during packaging.

running test
running egg_info
writing micawber.egg-info/PKG-INFO
writing dependency_links to micawber.egg-info/dependency_links.txt
writing top-level names to micawber.egg-info/top_level.txt
reading manifest file 'micawber.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'micawber.egg-info/SOURCES.txt'
running build_ext
test_extract (micawber.tests.ParserTestCase) ... ok
test_html_entities (micawber.tests.ParserTestCase) ... ok
test_multiline (micawber.tests.ParserTestCase) ... ok
test_multiline_full (micawber.tests.ParserTestCase) ... ok
test_outside_of_markup (micawber.tests.ParserTestCase) ... ok
test_parse_text (micawber.tests.ParserTestCase) ... ok
test_parse_text_full (micawber.tests.ParserTestCase) ... ok
test_urlize (micawber.tests.ParserTestCase) ... ok
test_caching (micawber.tests.ProviderTestCase) ... ok
test_caching_params (micawber.tests.ProviderTestCase) ... ok
test_invalid_json (micawber.tests.ProviderTestCase) ... ok
test_multiple_matches (micawber.tests.ProviderTestCase) ... ok
test_provider (micawber.tests.ProviderTestCase) ... ok
test_provider_matching (micawber.tests.ProviderTestCase) ... ok
test_register_unregister (micawber.tests.ProviderTestCase) ... ok

----------------------------------------------------------------------
Ran 15 tests in 0.082s

OK
Running micawber tests
All micawber tests passed
Running django integration tests
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/django/apps/config.py", line 118, in create
    cls = getattr(mod, cls_name)
AttributeError: module 'micawber.contrib.mcdjango' has no attribute 'mcdjango_tests'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "setup.py", line 35, in <module>
    test_suite='runtests.runtests',
  File "/usr/lib/python3.7/site-packages/setuptools/__init__.py", line 140, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/lib/python3.7/site-packages/setuptools/command/test.py", line 228, in run
    self.run_tests()
  File "/usr/lib/python3.7/site-packages/setuptools/command/test.py", line 250, in run_tests
    exit=False,
  File "/usr/lib/python3.7/unittest/main.py", line 100, in __init__
    self.parseArgs(argv)
  File "/usr/lib/python3.7/unittest/main.py", line 147, in parseArgs
    self.createTests()
  File "/usr/lib/python3.7/unittest/main.py", line 159, in createTests
    self.module)
  File "/usr/lib/python3.7/unittest/loader.py", line 220, in loadTestsFromNames
    suites = [self.loadTestsFromName(name, module) for name in names]
  File "/usr/lib/python3.7/unittest/loader.py", line 220, in <listcomp>
    suites = [self.loadTestsFromName(name, module) for name in names]
  File "/usr/lib/python3.7/unittest/loader.py", line 205, in loadTestsFromName
    test = obj()
  File "/build/python-micawber/src/python-micawber-0.3.7/runtests.py", line 80, in runtests
    dj_failures = run_django_tests()
  File "/build/python-micawber/src/python-micawber-0.3.7/runtests.py", line 60, in run_django_tests
    setup()
  File "/usr/lib/python3.7/site-packages/django/__init__.py", line 24, in setup
    apps.populate(settings.INSTALLED_APPS)
  File "/usr/lib/python3.7/site-packages/django/apps/registry.py", line 89, in populate
    app_config = AppConfig.create(entry)
  File "/usr/lib/python3.7/site-packages/django/apps/config.py", line 123, in create
    import_module(entry)
  File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'micawber.contrib.mcdjango.mcdjango_tests'

It seems currently only the templates of mcdjango are packaged.

feature request: add media.ccc.de integration

Hi!
Falsely reported to nikola (to add more features), I'm now reporting this here as a feature request:
It would be great to integrate videos/ streams from https://media.ccc.de into this library.

The service is run by the German hacker association Chaos Computer Club (CCC), which hosts annual events itself and lends streaming expertise to many external events via its Video Operation Center (VOC).

The streaming service is a valuable source of information on many different topics and I think it would be an awesome addition!

If you have pointers on where I can add it (I assume somewhere in providers.py), I might be able to do a pull request myself. I wouldn't call myself a Python expert though :-)

Custom fetcher

Do you plan to add a way to use custom data retrieval method by any chance?

It could be nice because than different methods could be used, like requests library or asyncio in python3.4.

oauthlib (https://oauthlib.readthedocs.org/en/latest/oauth1/client.html) has this implementation and works pretty nice. Example:

client = oauthlib.oauth1.Client('client_key', client_secret='your_secret')
uri, headers, body = client.sign('http://example.com/request_token')
# Here you do a request
# and next you can grab data from response

So in this case it could be

provider = micawber.bootstrap_basic()
url, headers, body = provider.prepare_request(URL)
# Do a request
provider.parse_response(resp)

Or, could be easier maybe to allow override fetch method by using a callback?

provider.request(URL, fetch_callback=my_callback(url, headers, body))

I know I can monkeypatch fetch method but that don't seem to be a good way in a long run.

suggestion: requests and responses

i know that requests is a bit of a resource hit (and it's been brought up before), but I wanted to suggest using it as the Provider (or an ancillary option) because it could improve testing.

The responses library (https://github.com/getsentry/responses) lets you easily intercept calls to the requests library to quickly write integrated tests. for example:

expected_payload = {'author_name': 
                    }
as_json = json.dumps(expected_payload)
with responses.RequestsMock() as rsps:
    rsps.add(responses.GET,
             "http://www.youtube.com/oembed",
             body=as_json,
             status=200,
             content_type='text/html',
             )
    result = providers.request('http://www.youtube.com/watch?v=54XHDUOHuzU')
    for (k, v) in expected_payload.items():
        assert result[k] == v

This was a big benefit to us for testing and simulations (and incredibly easy to implement via subclassing), so I wanted to suggest it upstream.

bootstrap_embedly() is not python 3 compatible

Using Python 3.4.2 in ubuntu

In [1]: import micawber

In [2]: micawber.bootstrap_embedly()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-2419c7664bfa> in <module>()
----> 1 micawber.bootstrap_embedly()

/home/tin/.virtualenvs/waliki/lib/python3.4/site-packages/micawber/providers.py in bootstrap_embedly(cache, **params)
    203     resp.close()
    204 
--> 205     json_data = json.loads(contents)
    206 
    207     for provider_meta in json_data:

/usr/lib/python3.4/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    310     if not isinstance(s, str):
    311         raise TypeError('the JSON object must be str, not {!r}'.format(
--> 312                             s.__class__.__name__))
    313     if s.startswith(u'\ufeff'):
    314         raise ValueError("Unexpected UTF-8 BOM (decode using utf-8-sig)")

TypeError: the JSON object must be str, not 'bytes'

Possible unintended behavior with parse_html?

If you have encoded html characters like &lt; and &gt; inside the same html tag as an untagged link, parse_html will decode the encoded characters in stead of skipping them. This is inconsistent with the behavior when the encoded character is not inside the same tag as the untagged link, or if the link is already tagged.

Encoded characters next to an untagged link:

from micawber import ProviderRegistry
from micawber import parse_html
text = u'<p>http://www.google.com &lt;script&gt; alert("foo"); &lt;/script&gt;</p>'
parse_html(text, ProviderRegistry())

Output:

u'<p><a href="http://www.google.com">http://www.google.com</a> <script> alert("foo"); </script></p>'

Here the encoded characters are decoded.

Encoded characters next to a tagged link:

text = u'<p><a href="http://www.google.com">http://www.google.com</a> &lt;script&gt; alert("foo"); &lt;/script&gt;</p>'
parse_html(text, ProviderRegistry())

Output:

u'<p><a href="http://www.google.com">http://www.google.com</a> &lt;script&gt; alert("foo"); &lt;/script&gt;</p>'

Here the encoded characters are not decoded.

Encoded characters alone:

text = u'<p>&lt;script&gt; alert("foo"); &lt;/script&gt;</p>'
parse_html(text, ProviderRegistry())

Output:

u'<p>&lt;script&gt; alert("foo"); &lt;/script&gt;</p>'

Here the encoded characters are not decoded.

Environment:

python2.7
Package                       Version
----------------------------- -------
backports.functools-lru-cache 1.5
beautifulsoup4                4.8.1
micawber                      0.5.0
pip                           19.2.3
pkg-resources                 0.0.0
setuptools                    41.4.0
soupsieve                     1.9.4
wheel                         0.33.6

What is the intended behavior for parse_html?

Option to prevent converting inline links

I have a Markdown RichText field in my Django app that I'm using micawber for converting video links into embedded videos. I only want micawber to convert links on their own line into embedded media however. I don't want it to convert my markdown links, the mardown converter will take care of those.

So far the text is first run through an oembed_no_urlize function as described in your documentation:

from micawber.contrib.mcdjango import extension

oembed_no_urlize = extension('oembed', urlize_all=False)

Inline YouTube links are still oEmbed converted though, so a Markdown link like

[5 minutter og ti sekunder](http://www.youtube.com/watch?v=chbOViRudAg&t=5m10s)

is converted into

<a href='a href="http://www.youtube.com/watch?v=chbOViRudAg&amp;t=5m10s" title="Joo Sae Hyuk Vs Chuang Chih Yuan: WTTC 2014: 1/4 Final AMAZING MATCH">Joo Sae Hyuk Vs Chuang Chih Yuan: WTTC 2014: 1/4 Final AMAZING MATCH</a'>5 minutter og ti sekunder</a>

when first converted by micawber and then markdown.

Is it possible to disable all inline conversion?

Django 1.10 support

Hello! What versions of Django micawber does support?
Now with Django 1.9.x I get RemovedInDjango110Warning warnings in log.

.../site-packages/django/template/loader.py:97: RemovedInDjango110Warning: render() must be called with a dict, not a Context.
  return template.render(context, request)

It's because of render_to_string function. I looked through 1.8-1.10 Django docs. Looks like this function really waiting for dict.

HTML parser doesn't deal with &amp

Suppose you've got the following content:

Testing

http://picasaweb.google.com/lh/sredir?uname=test&target=ALBUM&id=123&authkey=abc

(Note: the link itself is not valid due to mangled IDs (it was a private album))

Rendering this content as follows will not work:

{{post.body|linebreaksbr|oembed_html}}

The reason is that the "&" has been escaped and turned into "&amp". The HTML parser over at https://github.com/coleifer/micawber/blob/master/micawber/parsers.py#L144 does recognize & extract the URL, but it does not unescape &amp. Hence, &amp is fed to embed.ly... resulting in a 404 over there.

500px and bootstrap_embedly

In [3]: requests.get('http://api.embed.ly/1/oembed?url=https%3A%2F%2Fiso.500px.com%2Fguest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following%2F&maxwidth=500').json()
Out[3]: 
{u'author_name': u'DL Cade',
 u'author_url': u'https://iso.500px.com/author/dl/',
 u'description': u"One of December's talented 500px Guest Curators was photographer Joel (Julius) Tjintjelaar , and he fully embraced the real purpose of the Editors' Choice section: to unveil photos and photographers that might not have made the Popular page for one reason or another... but probably should have.",
 u'provider_name': u'500px',
 u'provider_url': u'https://iso.500px.com',
 u'thumbnail_height': 1000,
 u'thumbnail_url': u'https://isocdn.500px.org/wp-content/uploads/2014/12/julius-1500x1000.jpg',
 u'thumbnail_width': 1500,
 u'title': u'Guest Curator Joel (Julius) Tjintjelaar Reveals Three Photographers that Should Have a Larger Following',
 u'type': u'link',
 u'url': u'https://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/',
 u'version': u'1.0'}

In [4]: bootstrap_embedly().request('http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/')
---------------------------------------------------------------------------
ProviderNotFoundException                 Traceback (most recent call last)
<ipython-input-4-aca3a4c8cf6f> in <module>()
----> 1 bootstrap_embedly().request('http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/')

/tmp/micawber/local/lib/python2.7/site-packages/micawber/providers.pyc in inner(self, url, **params)
     91                 self.cache.set(key, data)
     92             return data
---> 93         return fn(self, url, **params)
     94     return inner
     95 

/tmp/micawber/local/lib/python2.7/site-packages/micawber/providers.pyc in request(self, url, **params)
    132         if provider:
    133             return provider.request(url, **params)
--> 134         raise ProviderNotFoundException('Provider not found for "%s"' % url)
    135 
    136 

ProviderNotFoundException: Provider not found for "http://iso.500px.com/guest-curator-joel-julius-tjintjelaar-reveals-three-photographers-that-should-have-a-larger-following/"

LICENSE file?

Hi, what's the license for the micawber project?
Would you mind adding a license file for it?
We'd prefer an MIT license if you're open to suggestions.
Thanks

Limit number of rendered links

Hello,
There is a security concern that is generally not taken care of in oEmbed solutions: if one uses these solutions to provide media display of user input, one has to take care of malicious users filling their input with dozens or hundreds of links. (posted in order to clutter the other viewers' pages)
So I wonder if there is a simple way with micawber to limit the number of links parsed.
Thanks

Packaging: examples conflict with flasgger

There is a file conflict between flasgger and micawber, because both install files into the too generic path name examples.
For reference, please see this Arch Linux bug.

As a solution, micawber and flasgger should either not install these examples at all, or if required into a unique directory (e.g. micawber-examples) or another system directory (e.g. on Linux: /usr/share/doc/python-micawber/examples, which is usually done by the packagers).

I will remove them for now to resolve the file conflict.

bootstrap_embedly performance question

Hi, I'm new to micawber, and I'm reading the docs. I have a question about bootstrap_embedly.

If I want to use embed.ly, is bootstrap_embedly required initialization every time? For example, if I call it in a django web app's initialization code, is it going to cause a delay at startup? Or does it cache results for future use?

From the docs:

>>> import micawber
>>> providers = micawber.bootstrap_embedly() # may take a second
>>> print micawber.parse_text('this is a test:\nhttp://www.youtube.com/watch?v=54XHDUOHuzU', providers)
this is a test:
<iframe width="640" height="360" src="http://www.youtube.com/embed/54XHDUOHuzU?fs=1&feature=oembed" frameborder="0" allowfullscreen></iframe>

A bit more detail in the docs regarding this issue would be appreciated.

Define (test) requirements

Hi!
I'm currently trying to package this module for Arch Linux [community].
However, while doing so, I realized, that there is no definition of required dependencies.
When grep'ing for imports, I see that tests definitely require beautifulsoup, because they import it (also they failed hard trying to execute without it being installed). There seems to be a conditional dependency on redis, django and flask. Can you please add them to a requirements.txt or add a Pipfile (and explain why they are needed), so I can add proper (optional, runtime and test) dependencies for the package and people will have an easier time using micawber?
Thanks for your work!

Django 1.8 warning

Would be nice to see this fix added to next release:

/micawber/contrib/mcdjango/__init__.py:4: RemovedInDjango19Warning: django.utils.importlib will be removed in Django 1.9. from django.utils.importlib import import_module

To parse HTML, install BeautifulSoup

We get this error for some yet to be clarified reason. Is there a hidden dependency on the BeautifulSoup package?

File "/app/.heroku/python/lib/python3.9/site-packages/micawber/contrib/mcflask.py", line 21, in _oembed

2020-10-25T09:55:22.161763+00:00 app[web.1]:     return oembed(s, providers, urlize_all, html, **params)

2020-10-25T09:55:22.161763+00:00 app[web.1]:   File "/app/.heroku/python/lib/python3.9/site-packages/micawber/contrib/mcflask.py", line 10, in oembed

2020-10-25T09:55:22.161763+00:00 app[web.1]:     return Markup(fn(s, providers, urlize_all, **params))

2020-10-25T09:55:22.161764+00:00 app[web.1]:   File "/app/.heroku/python/lib/python3.9/site-packages/micawber/parsers.py", line 137, in parse_html

2020-10-25T09:55:22.161764+00:00 app[web.1]:     raise Exception('Unable to parse HTML, please install BeautifulSoup '

2020-10-25T09:55:22.161764+00:00 app[web.1]: Exception: Unable to parse HTML, please install BeautifulSoup or beautifulsoup4, or use the text parser

Can micawber parse into https contents of youtube?

I tried to parse a youtube https url by the steps:

import micawber
providers = micawber.bootstrap_basic()
url = "https://www.youtube.com/watch?v=5BbSe_pI_eo"
micawber.parse_text(url, providers)

output:

<iframe width="480" height="270" src="http://www.youtube.com/embe/5BbSe_pI_eo?feature=oembed" frameborder="0" allowfullscreen></iframe>

The result still use http url instead of https. Is this due to the design of micawber or the limitation of youtube?

Media with https

I am trying to embed that video in my nikola blog post.

The video is embed in http no matter what configuration I use.
I tried both http and https with the following syntax:

# With http
.. media:: http://www.dailymotion.com/video/x1apjif_une-arbalete-de-poche-fabriquee-manuellement_tv

# With https
.. media:: https://www.dailymotion.com/video/x1apjif_une-arbalete-de-poche-fabriquee-manuellement_tv

The problem is that the video is hidden by Firefox when I use the https version of my blog.

Is this a bug in micawber or, as @RAISINA mentioned here, is it a dailymotion issue?

If it can be of any help, I am currently using:

  • nikola 7.7.7
  • micawber 0.3.3

Here is my original issue

bootstrap_basic

Hi,
I was trying to include Facebook into the basic list of providers. E.g.,
pr.register('https://www.facebook.com/\S*?/posts/\S*', Provider('https://www.facebook.com/plugins/post/oembed.json'))

or

pr.register('https://www.facebook.com/\S*/photos/\S*', Provider('https://www.facebook.com/plugins/post/oembed.json'))

work perfectly fine. However, when I try

pr.register('https://www.facebook.com/photo.php?fbid=\S*', Provider('https://www.facebook.com/plugins/post/oembed.json')) for a url like
https://www.facebook.com/photo.php?fbid=10204669368414661&set=a.10201344709340262.1073741826.1849311083&type=3&theater
it always comes back with the message "Provider not found for ..."
What am I doing wrong? Is it the regular expression? Or is it an issue with the endpoints?

Many thanks for any feedback.

in case you want to support more services:

pr.register('http://qik.com/\S*',
            Provider('http://qik.com/api/oembed.json'))
pr.register('http://www.polleverywhere.com/\w+/\S+',
            Provider('http://www.polleverywhere.com/services/oembed/'))
pr.register('http://www.slideshare.net/\w+/\S+',
            Provider('http://www.slideshare.net/api/oembed/2'))
pr.register('http://\w+.wordpress.com/\S+',
            Provider('http://public-api.wordpress.com/oembed/'))
pr.register('http://*.revision3.com/\S+',
            Provider('http://revision3.com/api/oembed/'))
pr.register('http://www.slideshare.net/\w+/\S+',
            Provider('http://api.smugmug.com/services/oembed/'))
pr.register('http://\w+.viddler.com/\S+',
            Provider('http://lab.viddler.com/services/oembed/'))

CSP headers

Hi!
I'm using Flask but this will be usefull for Django and others
Will be supernice to have a feature that accumulates in a per request cache or something which services has been used and correct the content security policy header to include this services as accepted origins

Otherwise the embedded object will not load blocked by the browser and it is not acceptable to allow any origin but only those needed

Thanks a lot!

parse_html overhead

>>> import micawber
>>> providers = micawber.bootstrap_basic()
>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
u'<html><body><p><html><body><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></body></html></p></body></html>'

What is html and body tags ? i do not need it.

>>> micawber.parse_text('http://www.youtube.com/watch?v=54XHDUOHuzU', providers)
u'<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" frameborder="0" allowfullscreen></iframe>'
>>> micawber.parse_text('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
u'<p><a href="http://www.youtube.com/watch?v=54XHDUOHuzU" title="Future Crew - Second Reality demo - HD">Future Crew - Second Reality demo - HD</a></p>'

I don't want link, i want iframe, etc, as in docs, even i have other tags in text.

I use bs4, but why it is not in docs as dependency?

ps. Python 2.7.3 (default, Mar 13 2014, 11:03:55)

I'd like more granular exceptions

I'd like more granular exceptions so I can distinguish between exception cases in my calling code. Specifically, I'd like to differentiate when a call to ProviderRegistry.request fails due to a provider not being found for a URL versus an error fetching a particular endpoint URL.

Let me know what you think about this. I'm happy to fork and make a pull request if you're willing to go this direction.

Youtube Playlists

I'm not quite sure where the fault for this lies, but here seems a good start.

Embedding a youtube playlist using embed.ly directly works okay:
http://embed.ly/code?url=https%3A%2F%2Fwww.youtube.com%2Fplaylist%3Flist%3DPLE2714DC8F2BA092D (literally an example playlist heh)

Running it thorough micawber doesn't embed anything using the URL: https://www.youtube.com/playlist?list=PLE2714DC8F2BA092D - using the embed URL of https://www.youtube.com/embed/videoseries?list=PLE2714DC8F2BA092D results in the first video in the series being embedded but no playlist controls.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.