fedora-infra / datagrepper Goto Github PK
View Code? Open in Web Editor NEWHTTP API for datanommer and the fedmsg bus
Home Page: https://apps.fedoraproject.org/datagrepper/
License: GNU General Public License v2.0
HTTP API for datanommer and the fedmsg bus
Home Page: https://apps.fedoraproject.org/datagrepper/
License: GNU General Public License v2.0
https://apps.fedoraproject.org/datagrepper/reference
Its at the bottom of the page under the 'meta' section.
For an API, datagrepper's documentation is fairly unreadable. No table of contents, no good navigation between parts of the page, way too much whitespace on the page (lots and lots of wasted space).
I would like to create something similar to this page: https://stripe.com/docs/api
The docs in staging direct the user to query http://apps.fp.o/datagrepper/ but the user should really hit https://... (unless they specify the --follow
option for the HTTPie tool).
The docs in staging have the following error messages sprinkled throughout:
System Message: ERROR/3 (<string>, line 35)
Unknown directive type "code-block".
datagrepper should be almost entirely just a JSON api (with not much HTML other than explaining how to query the api).
There should be a url that, given a start and/or stop time, returns every message in that period.
A /topics endpoint should be created, which returns all the topics seen in the datanommer database.
This should probably be cached in memcached as well.
There needs to be an /id endpoint to get a message given a fedmsg msg_id
.
/id?id=2013-UUID
We rely on this commit: fedora-infra/datanommer@126f6da
At some point, we want to deploy this to our staging infrastructure.
https://fedoraproject.org/wiki/User:Ianweller/statistics_plus_plus/datagrepper_API are the notes from @ianweller and @ralphbean's FUDCon discussion.
I created tickets here for everything in the '/raw' section just now.
Someone still needs to review that and create tickets here for everything in the "queued requests" section.
The idea is that users could submit really complicated queries in some specification language (that we invent?) and that we don't return their results immediately, but queue them to be "crunched" by a background worker over time. Their results will be emailed to them.
This portion of datagrepper (phase 2) is a much more sophisticated analysis suite/engine/thing.
We should have a widget you can put on your website that pulls the latest messages (given basic arguments to /raw) from datagrepper.
probably because 0 is being casted to False in an if statement somewhere
At the moment the messages from fedmsg which are returned are from the oldest to the newest, it would be nice to have the option to invert this.
With instructions for how to setup and run datagrepper in a dev environment on Fedora 18.
$ http get https://apps.fedoraproject.org/datagrepper/raw/ \
delta==172800 category==bodhi user==besser82
http: error: SSLError: The read operation timed out
Being able to get rid of the DATANOMMER_CONFIG environment variable and instead using a configuration system that is better designed would be a good thing.
Since datagrepper has a hard dependency on fedmsg, there's no reason not to use its built-in configuration system, alongside datanommer.
See fedora-infra/fedmsg#140 for the root cause.
In the meantime, it would be nice to stop users from running into this on accident.
Like the other tickets, given a username, all messages relating to that user should be returned (for the last 10 minutes).
If start and stop times are specifying, then all messages in that time range relating to that user should be return.
Just like #3, but for packages.
An HTML page that explains to the user how to use the JSON api.
Results (from every kind of query) should be paginated JSON if possible.
This is low-priority for now, but will be necessary later for people to really start using this.
We can't have someone make a request for an entire dump of every message ever over and over and over again. It would crash our httpd server. So... return results ~100 rows at a time (with some standard way of how to get the next 100 rows).
Once #12 is done, we'll probably want to deploy this to production.
Replace Flask-SQLAlchemy with plain ol' SQLAlchemy.
Blocks #56.
We could enhance #8 by adding API keys. Users could login with their FAS account and request a key. If their requests to the JSON api bear their api key, then maybe the restrictions on their rate limits could be loosened (but not abolished).
If some user is being rude, we could revoke their api key.
This is super low priority, but would really be pro.
It would be useful if we could rate limit clients who are querying datagrepper.
Most public APIs have this. If you make more than X
requests in Y
amount of time, then your IP is blocked for Z
amount of time.
Upon release, we will need to make these updates to requirements.txt:
fedmsg>=0.7.0
latest version (#46, fedora-infra/fedmsg#140)datanommer.models>=
latest version (#52, fedora-infra/datanommer#60)See https://apps.fedoraproject.org/datagrepper/reference
The very last section looks like:
System Message: ERROR/3 (<string>, line 168)
Unexpected indentation.
packages, objects
Default: None
This will be the daemon/cronjob/not-a-webapp that runs the actual jobs submitted with /submit.
Running into an issue now on datagrepper01.stg where this traceback is happening:
Traceback (most recent call last):
File "/usr/lib64/python2.6/site-packages/twisted/python/context.py", line 59, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/usr/lib64/python2.6/site-packages/twisted/python/context.py", line 37, in callWithContext
return func(*args,**kw)
File "/usr/lib64/python2.6/site-packages/twisted/internet/selectreactor.py", line 146, in _doReadOrWrite
why = getattr(selectable, method)()
File "/usr/lib/python2.6/site-packages/txzmq/connection.py", line 241, in doRead
log.callWithLogger(self, self.messageReceived, message)
--- <exception caught here> ---
File "/usr/lib64/python2.6/site-packages/twisted/python/log.py", line 84, in callWithLogger
return callWithContext({"system": lp}, func, *args, **kw)
File "/usr/lib64/python2.6/site-packages/twisted/python/log.py", line 69, in callWithContext
return context.call({ILogContext: newCtx}, func, *args, **kw)
File "/usr/lib64/python2.6/site-packages/twisted/python/context.py", line 59, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/usr/lib64/python2.6/site-packages/twisted/python/context.py", line 37, in callWithContext
return func(*args,**kw)
File "/usr/lib/python2.6/site-packages/txzmq/pubsub.py", line 60, in messageReceived
self.gotMessage(message[1], message[0])
File "/usr/lib/python2.6/site-packages/moksha/hub/zeromq/zeromq.py", line 168, in chain_over_moksha_callbacks
f(_body, _topic)
File "/usr/lib/python2.6/site-packages/moksha/hub/zeromq/zeromq.py", line 195, in intercept
return callback(ZMQMessage(_topic, _body))
File "/usr/lib/python2.6/site-packages/moksha/hub/api/consumer.py", line 107, in _consume_json
self._consume(message_as_dict)
File "/usr/lib/python2.6/site-packages/fedmsg/consumers/__init__.py", line 140, in _consume
self.consume(message)
File "/usr/lib/python2.6/site-packages/datagrepper/runner.py", line 42, in consume
'datagrepper_{0}'.format(job.id))
File "/usr/lib/python2.6/site-packages/moksha/hub/api/consumer.py", line 90, in _consume_json
topic = message.headers[0].routing_key
exceptions.AttributeError: 'ZMQMessage' object has no attribute 'headers'
As far as I can tell, the AttributeError
is occurring in another thread, but it's being caught by this runner. :(
If a job is failed because of an exception, the exception is lost.
Cancel a submitted job. /cancel?id=
Checks if you are logged in as the same user who submitted the job. Can only cancel a job if it is free (open jobs are already being processed). Sets job state to "canceled".
Job runner should check before setting each job to open whether or not it has been canceled.
http get https://apps.fedoraproject.org/datagrepper/raw/ start==1378305614 end==1378824505 category==bodhi rows_per_page==1 page==1 user==mbooth
http: error: SSLError: The read operation timed out
The default time out in http is lesser than what it takes the datagrepper to return data and hence it times out. The same query return results in browser (higher timeout for browser http request)
Thinking: we may want to have in the front page of fedora-packages the list of the last 10 packages that were added into pkgdb. At the moment we can query datagrepper for the last entries in the pkgdb category and filter by relevant topic.
Would it makes sense to be able to query directly a specific topic? (in this case pkgdb.package.new)
The initial plan for the /submit endpoint was to be able to submit advanced queries (such as "get all the git commits that changed a specfile" or "get wiki edits where users edited their own user pages").
Ideally, the user would be able to write a Python function that filtered message content for them. This is a security hell hole.
We could invent a language, but there's no way it could be complete enough. (I tried to use a couple different methods before giving up at Flock 2013.)
datagrepper should be almost entirely just a JSON api (with not much HTML other than explaining how to query the api).
There should be a url that, given no arguments, returns every message from the last 10 minutes
Final ticket of the 0.1 milestone!
After a discussion with @ralphbean we decided we should split up datagrepper into subpackages, similar to datanommer, so that the job runner can run on a different server than the web server.
Packages will be split as follows:
It might be nice to add an entry to retrieve the last X events in a certain category, regardless of when they happened.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.