GithubHelp home page GithubHelp logo

embedly-proxy's Introduction

Embedly Proxy

A simple python/flask app which proxies requests to the embed.ly service and masks the application API key.

Build

Circle CI

API Interface

Extract V1

This V1 API is no longer supported.

Fetch Metadata V2

Fetch metadata for a provided list of URLs from remote metadata services.

None

  • Data Params

  • Request Headers

    The POST body must be a JSON encoded dictionary.

    content-type: application/json

  • Success Response:

    • Code: 200

    JSON encoding

    {
      urls: {
        "<url1>": <embedly metadata>,
        "<urln>": <embedly metadata>,
      },
      error: ""
    }
    
    ex success:
    
    {
      urls: {
        "https://www.mozilla.org": {
          <embedly metadata>
      },
      error: ""
    }
    
    ex failure:
    
    {
      urls: {},
      error: "The Content-Type header must be set to application/json"
    }
    
  • Error Responses:

    • Code: 400

    The server received a malformed request.

    • Code: 500

    The server was unable to satisfy the request.

  • Sample Call:

      curl -X POST -d '{"urls":["https://www.mozilla.org"]}' -H 'content-type:application/json' https://embedly-proxy.services.mozilla.com/v2/metadata
    

embedly-proxy's People

Contributors

jaredlockhart avatar jbuck avatar jonalmeida avatar mozilla-github-standards avatar rlr avatar sarracini avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

embedly-proxy's Issues

Add CORS header support

We must support CORS headers for the client to be able to effectively all into this service.

Modify server timeout

I filed mozilla/activity-stream#251 earlier but it may be easier to handle request timeouts from the proxy directly.

I was stuck waiting on a embedly proxy response for ~30 seconds (which ultimately came back as a 502). Not sure if we want to tighten that to fail after 5-10s if we can't get a response, or ...

Integrate with Onyx IP blacklist

The Onyx data collection service implements an IP blacklist to block malicious traffic. We should periodically pull down the IP black list from the Onyx service and reject calls which match the list.

Emit statsd messages

On heartbeat calls increment a statds counter. We should also capture the timing of calls to find out if calls that reach out to embedly take too much time.

Add dev make command

Add a dev make command that starts the Flask dev server, also move those commands out of views.py into dev_server.py

Promote V2 api to V1

There's no need in alpha to preserve the V1 GET API so we should promote the V2 API to V1 rather than having a separate endpoint.

Split api.py into api.py and extract.py

Split the embedly extraction logic and the flask views into two separate views and alter the tests to unit test the extraction logic directly rather than through the views. The views should then only have a simple integration test.

Add rate limiting

Add rate limiting to the embedly proxy to ensure that rapid callers do not take the server down.

Set up HSTS and HPKP for servers when HTTPS is enabled

When the application is configured to run in HTTPS mode, we should enable HSTS and HPKP.

Now, since we're terminating the TLS connection at the load balancer, HTTPS mode will have to be a configuration item of the embedly proxy

Include missing URLs from embedly in response

If we request a URL from embedly and the URL they return is transformed or missing, rather than omitting it from the result set, we should include it with a payload that indicates that the URL data returned from embedly was malformed in some way.

Add a Make file

Add a Make file for the following:

  • build (just the embedly container)
  • test (build the container and run tests inside)
  • dev server (compose build, compose up)
  • gunicorn (run just the embedly container)
  • deploy (to the dev instances controlled by docker-machine)

Add request metadata to return signature

We should add an outer layer to the response data which includes information about the request, for instance:

{
    url_count: <int> // number urls in response
    url_data: {
        <url>: <url dict> // data from embedly about a url
    }
    error: <str> // a string which optionally contains information about what went wrong during the request
}

@k88hudson @oyiptong @pdehaan thoughts?

Return errors in v1 api

If there's an exception raised by the URL extractor we should return it to the client with a 500 status.

A few HTTP 500 Internal Server Error messages after enabling Activity Stream experiment via Test Pilot

Found in https://embedly-proxy.stage.mozaws.net/__version__ (8682d12):

{ "commit": "8682d1204c9ae5210b517d949619689589ccc268",
  "version": "0.5",
  "source": "https://github.com/mozilla/embedly-proxy.git" }

Steps to reproduce:

  1. Go to https://testpilot.firefox.com/experiments/activity-stream (log in w/ mozilla email).
  2. Open the Browser Console (Tools > Web Developer > Browser Console).
  3. Enable the Activity Stream experiment.

Actual results:

I saw a few HTTP 500 POSTs to https://embedly-proxy.stage.mozaws.net/v2/extract

browser_console_and_firefox_test_pilot_-activity_stream_and_webextensions_chrome_ext_downloader_save_get_started_js_at_master stuartphilp_webextensions_chrome_ext_downloader _github_and_software_update

browser_console_and_new_issue_ mozilla_embedly-proxy_and_firefox_test_pilot-_activity_stream

Expected results:

No HTTP 500 errors in console.

500 error for a large request

I'm guessing there's probably just one bad URL in here that's causing an unhandled error:

curl 'http://embedly-proxy.dev.mozaws.net/extract?urls=https%3A%2F%2Fwww.mozilla.org%2Fen-US%2Ffirefox%2Fnightly%2Ffirstrun%2F%3Foldversion%3D45.0&urls=https%3A%2F%2Faccounts.google.com%2FServiceLogin%3Fservice%3Dwise%26passive%3D1209600%26continue%3Dhttps%3A%2F%2Fdrive.google.com%2Fdrive%2Fmy-drive%26followup%3Dhttps%3A%2F%2Fdrive.google.com%2Fdrive%2Fmy-drive%23identifier&urls=https%3A%2F%2Faccounts.google.com%2FServiceLogin%3Fservice%3Dcl%26passive%3D1209600%26osid%3D1%26continue%3Dhttps%3A%2F%2Fcalendar.google.com%2Fcalendar%2Frender%26followup%3Dhttps%3A%2F%2Fcalendar.google.com%2Fcalendar%26scc%3D1%23identifier&urls=https%3A%2F%2Fwww.tumblr.com%2Flogin&urls=https%3A%2F%2Fwww.tumblr.com%2F&urls=https%3A%2F%2Ftumblr.com%2F&urls=http%3A%2F%2Ftumblr.com%2F&urls=http%3A%2F%2Fsmile.amazon.com%2F&urls=https%3A%2F%2Fgithub.com%2Fmozilla%2Factivity-streams%2Fpull%2F172&urls=https%3A%2F%2Fgithub.com%2Fnchapman%2Fsummarizer-server%2Fblob%2Fmaster%2Fapp%2Fcontrollers%2Fembedly_controller.rb&urls=https%3A%2F%2Fgithub.com%2Fnchapman%2Fsummarizer-server%2Ftree%2Fmaster%2Fapp%2Fcontrollers&urls=https%3A%2F%2Fgithub.com%2Fnchapman%2Fsummarizer-server%2Ftree%2Fmaster%2Fapp&urls=https%3A%2F%2Fgithub.com%2Fnchapman%2Fsummarizer-server&urls=https%3A%2F%2Fgithub.com%2Fnchapman%3Ftab%3Drepositories&urls=https%3A%2F%2Fgithub.com%2Fnchapman%2F&urls=http%3A%2F%2Fgithub.com%2Fnchapman%2F&urls=http%3A%2F%2Fembed.ly%2Fpricing&urls=http%3A%2F%2Fembed.ly%2F&urls=https%3A%2F%2Fapp.embed.ly%2Forganization%2Fchronicle%2Fapi&urls=https%3A%2F%2Fapp.embed.ly%2Forganization%2Fchronicle%2Fapi%2Fextract%2Fusage%2F20160204%2F20160303&urls=https%3A%2F%2Fapp.embed.ly%2Forganization%2Fchronicle%2Fapi%2Fkey&urls=https%3A%2F%2Fapp.embed.ly%2Forganization%2Fchronicle%2Fapi%2Fembed%2Fusage%2F20160204%2F20160303&urls=https%3A%2F%2Fapp.embed.ly%2Forganization%2Fchronicle&urls=https%3A%2F%2Fapp.embed.ly%2Flogin' -H 'Host: embedly-proxy.dev.mozaws.net' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) Gecko/20100101 Firefox/47.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Origin: resource://activity-streams' -H 'Connection: keep-alive'

Returns:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request.  Either the server is overloaded or there is an error in the application.</p>

Figure out bandwidth impact of embedly proxy

Embedly returns more data that content needs. We need to measure how much data that is.

If it is big, we can investigate various mitigation strategies like:

  • gzip (which we probably should do anyway)
  • pruning the data sent from embedly

Basically, this issue is about evaluating how much bandwidth ingress we cause the clients.
If the cost will be prohibitive, we can discuss how to make things better.

Add logging

We should be logging requests somewhere, however there are privacy considerations here because if we log an entire request that includes multiple query URLs then that can be used to reconstruct a partial window of a user's history, which we do not want. So we should figure out what we need to log for application health and performance but that does not compromise user privacy.

Security Audit

Tracking issues raised by @jvehent for security audit.

  • ensure embedly doesn't return any images that are hosted by embedly controlled domains (ie. embedly.com)
  • send @jvehent example logs from production NGINX
  • audit any/all application logs to ensure no user history is logged
  • ensure user IP addresses are not relayed to embedly

Do not return embedly messages to client

If a communication with Embedly results in an error, do not return the Embedly return message to the calling client. It may leak private information including our private key.

Remove protocol/query string filtering

A feature was added to resolve multiple URLs into the same URL by stripping out protocol and query strings, however many websites put page dependent information in the query string, so we should not be removing this information from the cache key. Remove this logic for now.

Detect if ?urls= is an empty value

Currently this throws an Internal Server Error:
http://embedly-proxy.dev.mozaws.net/extract?urls=

Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

We may just want to return an empty object in this weird edge case (which was me accidentally hitting the enter key on my keyboard while testing).

Remove V1 API

We should remote the V1 API entirely as it is no longer needed.

Deploy production server?

Currently the add-on prefs point to a stage server. We may want to stand up a production server before release because technically after that point that'd be a production server with a confusing name.

Plus, we may want to make sure #28 is addressed before release.

Move embedly calls to asynchronous workers

Rather than querying embedly synchronously during a request flow, we should defer uncached URLs to a queue and have an asynchronous worker query embedly and store the results in the cache. This will reduce overall load on the request handlers, reduce latency, and lower the likelihood of the servers being DOSed.

Do not send POST body to Sentry

We presently use Sentry to log exceptions, however this will send the POST body in its entirety to our ops controlled sentry instance which will contain unobfuscated URLs which come from users histories. We should be omitting or obfuscating this in some way to prevent leaking users histories.

This was accidentally filed in mozilla-services/location-leaderboard#261

CORS for error responses

When testing this locally I ran into the issue with the query string being too long, which returns a 400 Bad Request / Request Line is too large error if you curl it, but the client gets a Cross-Origin Request Blocked error instead. Maybe we want to add CORS for error responses as well?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.