fedora-infra / datagrepper Goto Github PK

View Code? Open in Web Editor NEW

41.0 13.0 34.0 5.55 MB

HTTP API for datanommer and the fedmsg bus

Home Page: https://apps.fedoraproject.org/datagrepper/

License: GNU General Public License v2.0

Python 67.03% CSS 9.48% JavaScript 4.16% HTML 13.49% Makefile 0.59% Shell 1.85% Jinja 3.40%

python fedora fedora-project data-science data-analysis data postgresql postgres

datagrepper's People

Stargazers

Watchers

datagrepper's Issues

Bad indentation in references doc.

https://apps.fedoraproject.org/datagrepper/reference

Its at the bottom of the page under the 'meta' section.

Better documentation

For an API, datagrepper's documentation is fairly unreadable. No table of contents, no good navigation between parts of the page, way too much whitespace on the page (lots and lots of wasted space).

I would like to create something similar to this page: https://stripe.com/docs/api

Job runner needs to delete output of job after a specified period of time

http/https disrepancy in the docs in staging

The docs in staging direct the user to query http://apps.fp.o/datagrepper/ but the user should really hit https://... (unless they specify the --follow option for the HTTPie tool).

Error rendering docs with code-block

The docs in staging have the following error messages sprinkled throughout:

System Message: ERROR/3 (<string>, line 35)

Unknown directive type "code-block".

Query for messages with time range

datagrepper should be almost entirely just a JSON api (with not much HTML other than explaining how to query the api).

There should be a url that, given a start and/or stop time, returns every message in that period.

/topics endpoint

A /topics endpoint should be created, which returns all the topics seen in the datanommer database.

This should probably be cached in memcached as well.

/id endpoint

There needs to be an /id endpoint to get a message given a fedmsg msg_id.

/id?id=2013-UUID

Wait for datanommer's next release

We rely on this commit: fedora-infra/datanommer@126f6da

Job runner should catch exceptions while running jobs and set state to "failed" instead of just stopping

Deploy to staging

At some point, we want to deploy this to our staging infrastructure.

Review FUDCon discussion and create tickets for queued requests

https://fedoraproject.org/wiki/User:Ianweller/statistics_plus_plus/datagrepper_API are the notes from @ianweller and @ralphbean's FUDCon discussion.

I created tickets here for everything in the '/raw' section just now.

Someone still needs to review that and create tickets here for everything in the "queued requests" section.

The idea is that users could submit really complicated queries in some specification language (that we invent?) and that we don't return their results immediately, but queue them to be "crunched" by a background worker over time. Their results will be emailed to them.

This portion of datagrepper (phase 2) is a much more sophisticated analysis suite/engine/thing.

Make embeddable JavaScript widget

We should have a widget you can put on your website that pulls the latest messages (given basic arguments to /raw) from datagrepper.

OpenID auth

start==0 is an invalid start time

probably because 0 is being casted to False in an if statement somewhere

Query causes internal server error

https://apps.fedoraproject.org/datagrepper/raw?start=1370534556.0&end=1367856156.0

Option to reverse order of the results

At the moment the messages from fedmsg which are returned are from the oldest to the newest, it would be nice to have the option to invert this.

Write a README

With instructions for how to setup and run datagrepper in a dev environment on Fedora 18.

Queries with the delta variable take way too long

$ http get https://apps.fedoraproject.org/datagrepper/raw/ \
    delta==172800  category==bodhi user==besser82 

http: error: SSLError: The read operation timed out

Move away from Flask config to using fedmsg config

Being able to get rid of the DATANOMMER_CONFIG environment variable and instead using a configuration system that is better designed would be a good thing.

Since datagrepper has a hard dependency on fedmsg, there's no reason not to use its built-in configuration system, alongside datanommer.

Require login for /submit

Query for mesages by category.

Just like #3, #4, and #5, by querying by category.

meta==title causes a traceback

See fedora-infra/fedmsg#140 for the root cause.

In the meantime, it would be nice to stop users from running into this on accident.

Query for messages by user

Like the other tickets, given a username, all messages relating to that user should be returned (for the last 10 minutes).

If start and stop times are specifying, then all messages in that time range relating to that user should be return.

Create queued job database model

Query for messages by package

Just like #3, but for packages.

RPM package

Make a nice looking landing page

An HTML page that explains to the user how to use the JSON api.

Paginate results

Results (from every kind of query) should be paginated JSON if possible.

This is low-priority for now, but will be necessary later for people to really start using this.

We can't have someone make a request for an entire dump of every message ever over and over and over again. It would crash our httpd server. So... return results ~100 rows at a time (with some standard way of how to get the next 100 rows).

Deploy to production.

Once #12 is done, we'll probably want to deploy this to production.

Gut Flask-SQLAlchemy

Replace Flask-SQLAlchemy with plain ol' SQLAlchemy.

Blocks #56.

replace Flask-OpenID with the FAS OpenID implementation in python-fedora

Set up /submit to accept and validate queued jobs

API keys?

We could enhance #8 by adding API keys. Users could login with their FAS account and request a key. If their requests to the JSON api bear their api key, then maybe the restrictions on their rate limits could be loosened (but not abolished).

If some user is being rude, we could revoke their api key.

This is super low priority, but would really be pro.

API rate limiting

It would be useful if we could rate limit clients who are querying datagrepper.

Most public APIs have this. If you make more than X requests in Y amount of time, then your IP is blocked for Z amount of time.

Query for mesages by topic.

Just like #3 and #4, but for the topic specified.

0.2 requirements.txt updates

Upon release, we will need to make these updates to requirements.txt:

fedmsg>=0.7.0latest version (#46, fedora-infra/fedmsg#140)
datanommer.models>=latest version (#52, fedora-infra/datanommer#60)

rst renders incorrectly on the bottom of the reference page

See https://apps.fedoraproject.org/datagrepper/reference

The very last section looks like:



System Message: ERROR/3 (<string>, line 168)
Unexpected indentation.

    packages, objects

Default: None

Create queued job runner

This will be the daemon/cronjob/not-a-webapp that runs the actual jobs submitted with /submit.

try/except block in datagrepper.runner is overzealous

Running into an issue now on datagrepper01.stg where this traceback is happening:

Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/twisted/python/context.py", line 59, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib64/python2.6/site-packages/twisted/python/context.py", line 37, in callWithContext
    return func(*args,**kw)
  File "/usr/lib64/python2.6/site-packages/twisted/internet/selectreactor.py", line 146, in _doReadOrWrite
    why = getattr(selectable, method)()
  File "/usr/lib/python2.6/site-packages/txzmq/connection.py", line 241, in doRead
    log.callWithLogger(self, self.messageReceived, message)
--- <exception caught here> ---
  File "/usr/lib64/python2.6/site-packages/twisted/python/log.py", line 84, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib64/python2.6/site-packages/twisted/python/log.py", line 69, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib64/python2.6/site-packages/twisted/python/context.py", line 59, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib64/python2.6/site-packages/twisted/python/context.py", line 37, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python2.6/site-packages/txzmq/pubsub.py", line 60, in messageReceived
    self.gotMessage(message[1], message[0])
  File "/usr/lib/python2.6/site-packages/moksha/hub/zeromq/zeromq.py", line 168, in chain_over_moksha_callbacks
    f(_body, _topic)
  File "/usr/lib/python2.6/site-packages/moksha/hub/zeromq/zeromq.py", line 195, in intercept
    return callback(ZMQMessage(_topic, _body))
  File "/usr/lib/python2.6/site-packages/moksha/hub/api/consumer.py", line 107, in _consume_json
    self._consume(message_as_dict)
  File "/usr/lib/python2.6/site-packages/fedmsg/consumers/__init__.py", line 140, in _consume
    self.consume(message)
  File "/usr/lib/python2.6/site-packages/datagrepper/runner.py", line 42, in consume
    'datagrepper_{0}'.format(job.id))
  File "/usr/lib/python2.6/site-packages/moksha/hub/api/consumer.py", line 90, in _consume_json
    topic = message.headers[0].routing_key
exceptions.AttributeError: 'ZMQMessage' object has no attribute 'headers'

As far as I can tell, the AttributeError is occurring in another thread, but it's being caught by this runner. :(

datagrepper.runner does not log exceptions

If a job is failed because of an exception, the exception is lost.

/cancel endpoint

Cancel a submitted job. /cancel?id=

Checks if you are logged in as the same user who submitted the job. Can only cancel a job if it is free (open jobs are already being processed). Sets job state to "canceled".

Job runner should check before setting each job to open whether or not it has been canceled.

http get timeout is less compared to time required by datagrepper

http get https://apps.fedoraproject.org/datagrepper/raw/ start==1378305614 end==1378824505 category==bodhi rows_per_page==1 page==1 user==mbooth

http: error: SSLError: The read operation timed out

The default time out in http is lesser than what it takes the datagrepper to return data and hence it times out. The same query return results in browser (higher timeout for browser http request)

https://apps.fedoraproject.org/datagrepper/raw/?start=1378305614&end=1378824505&category=bodhi&rows_per_page=1&user=mbooth

Possibility to ask for a specific topic

Thinking: we may want to have in the front page of fedora-packages the list of the last 10 packages that were added into pkgdb. At the moment we can query datagrepper for the last entries in the pkgdb category and filter by relevant topic.

Would it makes sense to be able to query directly a specific topic? (in this case pkgdb.package.new)

Determine way to make advanced queries

The initial plan for the /submit endpoint was to be able to submit advanced queries (such as "get all the git commits that changed a specfile" or "get wiki edits where users edited their own user pages").

Ideally, the user would be able to write a Python function that filtered message content for them. This is a security hell hole.

We could invent a language, but there's no way it could be complete enough. (I tried to use a couple different methods before giving up at Flock 2013.)

datagrepper.models
datagrepper.web
datagrepper.runner

Add a way to retrieve the X last entries regardless of the time period

It might be nice to add an entry to retrieve the last X events in a certain category, regardless of when they happened.

fedora-infra / datagrepper Goto Github PK

datagrepper's People

Stargazers

Watchers

Forkers

datagrepper's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs