calmrat / metrique Goto Github PK

View Code? Open in Web Editor NEW

34.0 34.0 10.0 5.68 MB

Python ETL and Data Warehouse

License: GNU General Public License v3.0

Python 100.00%

metrique's People

Contributors

Stargazers

Watchers

Forkers

gal-leib jniznan radymus pombredanne young8 w3ss jackhengxingmeng curioustauseef aranwislon anilktechie

metrique's Issues

object simplehash()

a.k.a. hash(str()) like being used in gitrepo cube

This is a weak hash because in some instances, hashes could be different for two objects, even if the content is exactly the same, just not ordered the same. But its, I guess, a fast hash. Faster anyway than recursively taking frozen sets of lists and dict.items()

py2.6, py2.7. py3.3 compat

run -3 argument for warning about python 3 incompatabilities
use future functions

refactor .index_warehouse -> .index(db=..., ...)

one method for wrapping pymongos index method. Let the db (valid in [warehouse, timeline] be passed as an argument.

pass optional argument to save_objects(..., mtime=None)

permit the client to push an mtime along with save_objects call that will be applied when calling save_objects server side into etl_activity for the cube.field rather than serverside calling utcnow() upon starting the operation. This will reduce the risk of hitting the race condition where something is updated between when client starts saving and server generates the utcnow mtime; the next client delta using the server generated mtime might miss changes that happened between that gap in time.

get sphinx autodocs (source) building again

In the past, when we were still on py2.6, you found a way to generate documentation from python source code doc strings. Could you make this work again?

add .removed() method for clients to remove objects from their collections

wrap pymongo .remove()

metric class

@jniznan describe what you're working on :)

BUG: git commit extract delta

It always extracts at least one bug (even if it shouldn't). I guess we are comparing the delta_ts with '>=' instead of '>'.

add 'favicon' to server config and tornado handler

Browsers by default search for favicon when loading pages, we need to add a static path to tornado's url handlers to catch this request and serve up

github issues cube

https://api.github.com/repos/drpoovilleorg/metrique/issues

All git tags should be GPG signed

setup.py auto install dependencies

Add 'dependency_links' to setup.py, to auto install dependencies

see this link for reference to more info.

FriendCode/gittle@76d04ec#commitcomment-3938113

'full text' search?

on save_objects() index unique field token values per cube with map to _ids.

such that as metrique recieves an object like:

{
_id: 23523599
tweet: '#wow #metrique',
user: 'cward',
...
}

push workers fields and values and return back a set of unique (token, _id) tuples. The all those sets would be merged to (token, [_id, ...]).

@jniznan @wejnik thoughts?

RESTful compatability

AUTH: add sys to valid permissions

Lower than 'r'

For action like ping

get_cube should guess what to import based on cube naming convention

it would be nice to simply be able to run

get_cube('bz_bug') and we'd get back the bz.bug.Bug cube class

if the first attempt to import fails, get_cube could attempt to pop off the first part of the string before the first '_' underscore. This would be the 'package'.

Then, the remaining set of characters would be the 'module'

And the class would be formed by taking the module, removing underscores, and capitalizing each word (first character + first character of each work which is separated by underscore)

Update readme at release

current readme talks of pyclient... what's that? ;)

Update to mongo 2.4 aka support MongoClient

port auth/pass validation to python passlib

http://pythonhosted.org/passlib/

refactor .index_warehouse -> .index(db=..., ...)

one method for wrapping pymongos index method. Let the db (valid in [warehouse, timeline] be passed as an argument.

webapp config should set debug only if debug == 2

in server.tornado.http webapp setup, we set debug = True (in tornado) if metrique_config.debug is not false... but we've moved to a new debug level schema and tornado debug should be turned on only if metrique_config.debug == 2

config should not be autosave=True

the config file shouldn't be changing state unless .save() is explicitly called.

0.1.3 ga & 0.1.4 & 0.2.0 planning

We need stability. We need tests.

Im pretty sure we have the core functionality in 013, server side, that we need to accomplish our current goals. Performance is excellent too.

I wanted to tag the code yesterday with 013 version release, but time ran out. I'll do it by Mon, hopefully.

We'll focus on bug fixing and client usability issues now. And docs.

API affecting changes should be avoided.

Branch 0.1.4 will be for MEPs and refactors

object hashing

@jniznan @wejnik

what if we hash every object and save its' unique field:value hash to _hash; the username which saved it would be saved in every object in admin: [user, ...], then... the way someone gets access to an is by first running an extract(force=True); all objects the user is able to push to save_documents will be saved if no hash already exists and or tagged with that username if the object (with the same hash) already exists.

The users which can create the same data are assumed to have access to the same data used to create the objects... no?

result.to_result_file() fails with encoding errors

http://stackoverflow.com/questions/9693699/python-json-and-unicode

API refactor

/ping
/USER/[cubes]
/USER/CUBE/[fields]
/USER/CUBE/query/[find, aggregate, fetch, ...]
/USER/CUBE/[saveobjects, removeobjects, drop ...]

.find() should have a "pagination" or 'batch' arg

in case we don't want to get back all X thousand results back at once, like fetch() has.

.find() from metrique.client with iterator functionality (one at a time, on demand data loading from server)? ASYNC?

date query: merge version with unchanging fields of interest

cube.find('...', fields='x,y', date='~')

returns back all version of objects matching the given query.

there might be three versions of a particular object because field Z changed. But since we're only interested in fields x,y and they're the same between all three versions where Z field changed, we can merge/reconcile those into a single object version and update the _start/_end accordingly

df.save_to_cube()

pandas df can save to HDFS. Extend pandas df cls to support saving df as a dict and lists, with cube.save_objects(df), which would dump the df (object with unique _id?) of results into metrique cube for future, historical or other analysis.

DB CLEAN: reconcile simliar objects

over time, there WILL be duplicate and overlapping data being dumped. Production dbs can't be 'refreshed' :) we need a safe way to minimize duplicate/waste data objects. One, perhaps, is to create a clean up function::

def cleanup(cube, dt_resolution, collapse=False):
'''
walk through all commits with a date range separated by a time 
difference equal or less than the dt_resolution argument and merge 
the objects. 

All fiels which are the same will be merged; differing fields will default to 
end up with the value of the most recent. If collapse is False, the objects 
will reduce ONLY in so far as not more than 2 object states merge 
into one; if more than two objects would be merged, given the dt_resolution 
used, then reduce the maximum possible, but expect a list of objects in 
return. If collapse is True, always return only ONE new objects, which is,
essentially, the most recent object of the set.
'''

run .snapshot () as separate process

use futures.concurrent.ProcessPoolExecutor

The process to check whether a new obj should be created or not is pretty CPU intensive.

Move the running of .snapshot (on the server) into a separate process (ie, will not be trapped by the python GIL)

json Encoding and decoding

@jniznan and I battled with json today. We brought down the wallclock time for handling large data sets. But its still slow.

If we avoid encode and decode all together, the process which took 30s ... Takes 1s using ultajson in place of simplejson. But ujson has no encoder method.

Seems we might need to make a new API standard. All data in and out of the server should require zero json encode or decode. Clients are responsible for correctly serializing all data out and in.

All queries should run against timeline, unless warehouse=Force & date is None

wrap server.etl_api.save_docs in @stats decorator

aggregate statistics of incoming data objects.

eg, how many times each cube.field is manipulated

repr is failing with new result subclasses

in ipython, for example, b.find('product == "Metrique"') excepts with

My guess is it's related to the new result subclass implementation

Int64Index([], dtype=int64) Empty BugResult

DEBUG:metrique.client.http_api:URL: https://192.168.1.5:8080/api/v1/query/find

TypeError Traceback (most recent call last)
in ()
----> 1 b.find('product == "Metrique"')

/usr/lib/python2.7/site-packages/IPython/core/displayhook.pyc in call(self, result)
236 self.start_displayhook()
237 self.write_output_prompt()
--> 238 format_dict = self.compute_format_data(result)
239 self.write_format_data(format_dict)
240 self.update_user_ns(result)

/usr/lib/python2.7/site-packages/IPython/core/displayhook.pyc in compute_format_data(self, result)
148 MIME type representation of the object.
149 """
--> 150 return self.shell.display_formatter.format(result)
151
152 def write_format_data(self, format_dict):

/usr/lib/python2.7/site-packages/IPython/core/formatters.pyc in format(self, obj, include, exclude)
124 continue
125 try:
--> 126 data = formatter(obj)
127 except:
128 # FIXME: log the exception

/usr/lib/python2.7/site-packages/IPython/core/formatters.pyc in call(self, obj)
445 type_pprinters=self.type_printers,
446 deferred_pprinters=self.deferred_printers)
--> 447 printer.pretty(obj)
448 printer.flush()
449 return stream.getvalue()

/usr/lib/python2.7/site-packages/IPython/lib/pretty.pyc in pretty(self, obj)
358 if callable(meth):
359 return meth(obj, self, cycle)
--> 360 return _default_pprint(obj, self, cycle)
361 finally:
362 self.end_group()

/usr/lib/python2.7/site-packages/IPython/lib/pretty.pyc in _default_pprint(obj, p, cycle)
478 if getattr(klass, 'repr', None) not in _baseclass_reprs:
479 # A user-provided repr.
--> 480 p.text(repr(obj))
481 return
482 p.begin_group(1, '<')

/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in repr(self)
720 Yields Bytestring in Py2, Unicode String in py3.
721 """
--> 722 return str(self)
723
724 def repr_html(self):

/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in str(self)
664 if py3compat.PY3:
665 return self.unicode()
--> 666 return self.bytes()
667
668 def bytes(self):

/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in bytes(self)
674 """
675 encoding = com.get_option("display.encoding")
--> 676 return self.unicode().encode(encoding, 'replace')
677
678 def unicode(self):

/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in unicode(self)
707 verbose = (max_info_rows is None or
708 self.shape[0] <= max_info_rows)
--> 709 self.info(buf=buf, verbose=verbose)
710
711 value = buf.getvalue()

/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in info(self, verbose, buf, max_cols)
1609 space = max([len(com.pprint_thing(k)) for k in self.columns]) + 4
1610 counts = self.count()
-> 1611 if len(cols) != len(counts):
1612 raise AssertionError('Columns must equal counts')
1613 for col, count in counts.iteritems():

TypeError: object of type 'float' has no len()

run save_objects a separate process

save_objects is a CPU hog too...

object manipulation (save/remove) journal

generate a text journal (activity log) of all objects saved/updated/removed from a collection (_id, when, what (new object state), who), before actually saving

can be used as a 'backup', or if a save_objects goes bad (eg, after a future update), we can 're-play' the transactions.

use logger?

split server client, server, tools into separate py pkgs

Both client and server depend on tools

Split out client and server requirements

.count should accept date arg

.count() method should accept date are just like find

see: client_api.count()
38 #### COMING SOON - 0.1.4 ####
39 :param String date: Date (date range) that should be queried:
...

Ignore the 1.4 remark, let's try to get this into 1.3 (i'll try to release tomorrow)

Enabled kerberos login ('basic')

Wrap basic auth with kerberos auth handler for tornado 3.0
http://liftoff.github.io/GateOne/Developer/sso.html

I don't have a keytab yet to play with, so we'll only test this one with 'basic' kerberos auth and hope the key auth works like magic, as it should if the manual basic works.

add requirements.txt

output all pypi requirements/dependencies into a txt file called requirements.txt; then installation can be done with pip install metrique -r

sped up gitrepo.commit parsing of 'stats' and several other similar fields

https://github.com/drpoovilleorg/metrique/blob/master/metrique/client/cubes/gitrepo/commit.py#L136

'stats' is one property of a git commit that pythongit offers us. It shows +- lines per commit. But it's slow to extract. Without it enabled as a field in gitrepo.commit, we can process 900+ commits a second (which i think is also slow...); with it enabled, 30.

We need to make it faster.

refactor .index_warehouse -> .index(db=..., ...)

one method for wrapping pymongos index method. Let the db (valid in [warehouse, timeline] be passed as an argument.

disallow period and other mongo forbidden key characters

DOCS: add introductory tutorial ipynb

show the basics of extracting, querying, plotting

for example, git.extract(get_dependency_git_uris('metrique'))

Then show how many LOC; distinct contributors; plot contributions per person over time; etc

USER MGMT: use should be able to change password

pyclient().user_passwd('new', 'user', 'old' (optional, if admin))

pyclient should autoset self.user to current user and use it if not passed in as arg

wrap plt.asvline(dt, **kwargs) to enable also passing label which gets added at the top of the line...

plt.axvline(dt, color='k')

Currently, the above draws a line on a timeseries (x) plot, where dt == date in timeseries (x).

What about wrapping this in a function that also accepts a 'label', which would be printed at the top of the plot on the 2nd x axis... ?

@wejnik How does this overlap with the work you've done on mapping dates to a plot with timeseries (x)?

disallow dunder fieldnames __hash for example

Clients can set One underscore in any field name but not Dundee (double underscore)

Dunder's are reserved for properties set by metrique server

calmrat / metrique Goto Github PK

metrique's People

Contributors

Stargazers

Watchers

Forkers

metrique's Issues

DEBUG:metrique.client.http_api:URL: https://192.168.1.5:8080/api/v1/query/find

Recommend Projects

Recommend Topics

Recommend Org

Jobs