GithubHelp home page GithubHelp logo

fedora-infra / datagrepper Goto Github PK

View Code? Open in Web Editor NEW
41.0 13.0 34.0 5.55 MB

HTTP API for datanommer and the fedmsg bus

Home Page: https://apps.fedoraproject.org/datagrepper/

License: GNU General Public License v2.0

Python 67.03% CSS 9.48% JavaScript 4.16% HTML 13.49% Makefile 0.59% Shell 1.85% Jinja 3.40%
python fedora fedora-project data-science data-analysis data postgresql postgres

datagrepper's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datagrepper's Issues

Better documentation

For an API, datagrepper's documentation is fairly unreadable. No table of contents, no good navigation between parts of the page, way too much whitespace on the page (lots and lots of wasted space).

I would like to create something similar to this page: https://stripe.com/docs/api

Error rendering docs with code-block

The docs in staging have the following error messages sprinkled throughout:

System Message: ERROR/3 (<string>, line 35)

Unknown directive type "code-block".

Query for messages with time range

datagrepper should be almost entirely just a JSON api (with not much HTML other than explaining how to query the api).

There should be a url that, given a start and/or stop time, returns every message in that period.

/topics endpoint

A /topics endpoint should be created, which returns all the topics seen in the datanommer database.

This should probably be cached in memcached as well.

/id endpoint

There needs to be an /id endpoint to get a message given a fedmsg msg_id.

/id?id=2013-UUID

Deploy to staging

At some point, we want to deploy this to our staging infrastructure.

Review FUDCon discussion and create tickets for queued requests

https://fedoraproject.org/wiki/User:Ianweller/statistics_plus_plus/datagrepper_API are the notes from @ianweller and @ralphbean's FUDCon discussion.

I created tickets here for everything in the '/raw' section just now.

Someone still needs to review that and create tickets here for everything in the "queued requests" section.

The idea is that users could submit really complicated queries in some specification language (that we invent?) and that we don't return their results immediately, but queue them to be "crunched" by a background worker over time. Their results will be emailed to them.

This portion of datagrepper (phase 2) is a much more sophisticated analysis suite/engine/thing.

Make embeddable JavaScript widget

We should have a widget you can put on your website that pulls the latest messages (given basic arguments to /raw) from datagrepper.

Write a README

With instructions for how to setup and run datagrepper in a dev environment on Fedora 18.

Move away from Flask config to using fedmsg config

Being able to get rid of the DATANOMMER_CONFIG environment variable and instead using a configuration system that is better designed would be a good thing.

Since datagrepper has a hard dependency on fedmsg, there's no reason not to use its built-in configuration system, alongside datanommer.

Query for messages by user

Like the other tickets, given a username, all messages relating to that user should be returned (for the last 10 minutes).

If start and stop times are specifying, then all messages in that time range relating to that user should be return.

Paginate results

Results (from every kind of query) should be paginated JSON if possible.

This is low-priority for now, but will be necessary later for people to really start using this.

We can't have someone make a request for an entire dump of every message ever over and over and over again. It would crash our httpd server. So... return results ~100 rows at a time (with some standard way of how to get the next 100 rows).

API keys?

We could enhance #8 by adding API keys. Users could login with their FAS account and request a key. If their requests to the JSON api bear their api key, then maybe the restrictions on their rate limits could be loosened (but not abolished).

If some user is being rude, we could revoke their api key.

This is super low priority, but would really be pro.

API rate limiting

It would be useful if we could rate limit clients who are querying datagrepper.

Most public APIs have this. If you make more than X requests in Y amount of time, then your IP is blocked for Z amount of time.

Create queued job runner

This will be the daemon/cronjob/not-a-webapp that runs the actual jobs submitted with /submit.

try/except block in datagrepper.runner is overzealous

Running into an issue now on datagrepper01.stg where this traceback is happening:

Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/twisted/python/context.py", line 59, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib64/python2.6/site-packages/twisted/python/context.py", line 37, in callWithContext
    return func(*args,**kw)
  File "/usr/lib64/python2.6/site-packages/twisted/internet/selectreactor.py", line 146, in _doReadOrWrite
    why = getattr(selectable, method)()
  File "/usr/lib/python2.6/site-packages/txzmq/connection.py", line 241, in doRead
    log.callWithLogger(self, self.messageReceived, message)
--- <exception caught here> ---
  File "/usr/lib64/python2.6/site-packages/twisted/python/log.py", line 84, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib64/python2.6/site-packages/twisted/python/log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib64/python2.6/site-packages/twisted/python/context.py", line 59, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib64/python2.6/site-packages/twisted/python/context.py", line 37, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python2.6/site-packages/txzmq/pubsub.py", line 60, in messageReceived
    self.gotMessage(message[1], message[0])
  File "/usr/lib/python2.6/site-packages/moksha/hub/zeromq/zeromq.py", line 168, in chain_over_moksha_callbacks
    f(_body, _topic)
  File "/usr/lib/python2.6/site-packages/moksha/hub/zeromq/zeromq.py", line 195, in intercept
    return callback(ZMQMessage(_topic, _body))
  File "/usr/lib/python2.6/site-packages/moksha/hub/api/consumer.py", line 107, in _consume_json
    self._consume(message_as_dict)
  File "/usr/lib/python2.6/site-packages/fedmsg/consumers/__init__.py", line 140, in _consume
    self.consume(message)
  File "/usr/lib/python2.6/site-packages/datagrepper/runner.py", line 42, in consume
    'datagrepper_{0}'.format(job.id))
  File "/usr/lib/python2.6/site-packages/moksha/hub/api/consumer.py", line 90, in _consume_json
    topic = message.headers[0].routing_key
exceptions.AttributeError: 'ZMQMessage' object has no attribute 'headers'

As far as I can tell, the AttributeError is occurring in another thread, but it's being caught by this runner. :(

/cancel endpoint

Cancel a submitted job. /cancel?id=

Checks if you are logged in as the same user who submitted the job. Can only cancel a job if it is free (open jobs are already being processed). Sets job state to "canceled".

Job runner should check before setting each job to open whether or not it has been canceled.

http get timeout is less compared to time required by datagrepper

http get https://apps.fedoraproject.org/datagrepper/raw/ start==1378305614 end==1378824505 category==bodhi rows_per_page==1 page==1 user==mbooth

http: error: SSLError: The read operation timed out

The default time out in http is lesser than what it takes the datagrepper to return data and hence it times out. The same query return results in browser (higher timeout for browser http request)

https://apps.fedoraproject.org/datagrepper/raw/?start=1378305614&end=1378824505&category=bodhi&rows_per_page=1&user=mbooth

Possibility to ask for a specific topic

Thinking: we may want to have in the front page of fedora-packages the list of the last 10 packages that were added into pkgdb. At the moment we can query datagrepper for the last entries in the pkgdb category and filter by relevant topic.

Would it makes sense to be able to query directly a specific topic? (in this case pkgdb.package.new)

Determine way to make advanced queries

The initial plan for the /submit endpoint was to be able to submit advanced queries (such as "get all the git commits that changed a specfile" or "get wiki edits where users edited their own user pages").

Ideally, the user would be able to write a Python function that filtered message content for them. This is a security hell hole.

We could invent a language, but there's no way it could be complete enough. (I tried to use a couple different methods before giving up at Flock 2013.)

Base query for messages

datagrepper should be almost entirely just a JSON api (with not much HTML other than explaining how to query the api).

There should be a url that, given no arguments, returns every message from the last 10 minutes

Split datagrepper into subpackages

After a discussion with @ralphbean we decided we should split up datagrepper into subpackages, similar to datanommer, so that the job runner can run on a different server than the web server.

Packages will be split as follows:

  • datagrepper.models
  • datagrepper.web
  • datagrepper.runner

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.