calmrat / metrique Goto Github PK
View Code? Open in Web Editor NEWPython ETL and Data Warehouse
License: GNU General Public License v3.0
Python ETL and Data Warehouse
License: GNU General Public License v3.0
Wrap pymongos cursor.explain()
a.k.a. hash(str()) like being used in gitrepo cube
This is a weak hash because in some instances, hashes could be different for two objects, even if the content is exactly the same, just not ordered the same. But its, I guess, a fast hash. Faster anyway than recursively taking frozen sets of lists and dict.items()
run -3 argument for warning about python 3 incompatabilities
use future functions
one method for wrapping pymongos index method. Let the db (valid in [warehouse, timeline] be passed as an argument.
permit the client to push an mtime along with save_objects call that will be applied when calling save_objects server side into etl_activity for the cube.field rather than serverside calling utcnow() upon starting the operation. This will reduce the risk of hitting the race condition where something is updated between when client starts saving and server generates the utcnow mtime; the next client delta using the server generated mtime might miss changes that happened between that gap in time.
In the past, when we were still on py2.6, you found a way to generate documentation from python source code doc strings. Could you make this work again?
wrap pymongo .remove()
@jniznan describe what you're working on :)
It always extracts at least one bug (even if it shouldn't). I guess we are comparing the delta_ts with '>=' instead of '>'.
Add 'dependency_links' to setup.py, to auto install dependencies
see this link for reference to more info.
on save_objects() index unique field token values per cube with map to _ids.
such that as metrique recieves an object like:
{
_id: 23523599
tweet: '#wow #metrique',
user: 'cward',
...
}
push workers fields and values and return back a set of unique (token, _id) tuples. The all those sets would be merged to (token, [_id, ...]).
@jniznan @wejnik thoughts?
Lower than 'r'
For action like ping
it would be nice to simply be able to run
get_cube('bz_bug') and we'd get back the bz.bug.Bug cube class
if the first attempt to import fails, get_cube could attempt to pop off the first part of the string before the first '_' underscore. This would be the 'package'.
Then, the remaining set of characters would be the 'module'
And the class would be formed by taking the module, removing underscores, and capitalizing each word (first character + first character of each work which is separated by underscore)
current readme talks of pyclient... what's that? ;)
one method for wrapping pymongos index method. Let the db (valid in [warehouse, timeline] be passed as an argument.
in server.tornado.http webapp setup, we set debug = True (in tornado) if metrique_config.debug is not false... but we've moved to a new debug level schema and tornado debug should be turned on only if metrique_config.debug == 2
the config file shouldn't be changing state unless .save() is explicitly called.
We need stability. We need tests.
Im pretty sure we have the core functionality in 013, server side, that we need to accomplish our current goals. Performance is excellent too.
I wanted to tag the code yesterday with 013 version release, but time ran out. I'll do it by Mon, hopefully.
We'll focus on bug fixing and client usability issues now. And docs.
API affecting changes should be avoided.
Branch 0.1.4 will be for MEPs and refactors
@jniznan @wejnik
what if we hash every object and save its' unique field:value hash to _hash; the username which saved it would be saved in every object in admin: [user, ...], then... the way someone gets access to an is by first running an extract(force=True); all objects the user is able to push to save_documents will be saved if no hash already exists and or tagged with that username if the object (with the same hash) already exists.
The users which can create the same data are assumed to have access to the same data used to create the objects... no?
/ping
/USER/[cubes]
/USER/CUBE/[fields]
/USER/CUBE/query/[find, aggregate, fetch, ...]
/USER/CUBE/[saveobjects, removeobjects, drop ...]
in case we don't want to get back all X thousand results back at once, like fetch() has.
.find() from metrique.client with iterator functionality (one at a time, on demand data loading from server)? ASYNC?
cube.find('...', fields='x,y', date='~')
returns back all version of objects matching the given query.
there might be three versions of a particular object because field Z changed. But since we're only interested in fields x,y and they're the same between all three versions where Z field changed, we can merge/reconcile those into a single object version and update the _start/_end accordingly
pandas df can save to HDFS. Extend pandas df cls to support saving df as a dict and lists, with cube.save_objects(df), which would dump the df (object with unique _id?) of results into metrique cube for future, historical or other analysis.
over time, there WILL be duplicate and overlapping data being dumped. Production dbs can't be 'refreshed' :) we need a safe way to minimize duplicate/waste data objects. One, perhaps, is to create a clean up function::
def cleanup(cube, dt_resolution, collapse=False):
'''
walk through all commits with a date range separated by a time
difference equal or less than the dt_resolution argument and merge
the objects.
All fiels which are the same will be merged; differing fields will default to
end up with the value of the most recent. If collapse is False, the objects
will reduce ONLY in so far as not more than 2 object states merge
into one; if more than two objects would be merged, given the dt_resolution
used, then reduce the maximum possible, but expect a list of objects in
return. If collapse is True, always return only ONE new objects, which is,
essentially, the most recent object of the set.
'''
use futures.concurrent.ProcessPoolExecutor
The process to check whether a new obj should be created or not is pretty CPU intensive.
Move the running of .snapshot (on the server) into a separate process (ie, will not be trapped by the python GIL)
@jniznan and I battled with json today. We brought down the wallclock time for handling large data sets. But its still slow.
If we avoid encode and decode all together, the process which took 30s ... Takes 1s using ultajson in place of simplejson. But ujson has no encoder method.
Seems we might need to make a new API standard. All data in and out of the server should require zero json encode or decode. Clients are responsible for correctly serializing all data out and in.
aggregate statistics of incoming data objects.
eg, how many times each cube.field is manipulated
in ipython, for example, b.find('product == "Metrique"') excepts with
My guess is it's related to the new result subclass implementation
Int64Index([], dtype=int64) Empty BugResult
TypeError Traceback (most recent call last)
in ()
----> 1 b.find('product == "Metrique"')
/usr/lib/python2.7/site-packages/IPython/core/displayhook.pyc in call(self, result)
236 self.start_displayhook()
237 self.write_output_prompt()
--> 238 format_dict = self.compute_format_data(result)
239 self.write_format_data(format_dict)
240 self.update_user_ns(result)
/usr/lib/python2.7/site-packages/IPython/core/displayhook.pyc in compute_format_data(self, result)
148 MIME type representation of the object.
149 """
--> 150 return self.shell.display_formatter.format(result)
151
152 def write_format_data(self, format_dict):
/usr/lib/python2.7/site-packages/IPython/core/formatters.pyc in format(self, obj, include, exclude)
124 continue
125 try:
--> 126 data = formatter(obj)
127 except:
128 # FIXME: log the exception
/usr/lib/python2.7/site-packages/IPython/core/formatters.pyc in call(self, obj)
445 type_pprinters=self.type_printers,
446 deferred_pprinters=self.deferred_printers)
--> 447 printer.pretty(obj)
448 printer.flush()
449 return stream.getvalue()
/usr/lib/python2.7/site-packages/IPython/lib/pretty.pyc in pretty(self, obj)
358 if callable(meth):
359 return meth(obj, self, cycle)
--> 360 return _default_pprint(obj, self, cycle)
361 finally:
362 self.end_group()
/usr/lib/python2.7/site-packages/IPython/lib/pretty.pyc in _default_pprint(obj, p, cycle)
478 if getattr(klass, 'repr', None) not in _baseclass_reprs:
479 # A user-provided repr.
--> 480 p.text(repr(obj))
481 return
482 p.begin_group(1, '<')
/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in repr(self)
720 Yields Bytestring in Py2, Unicode String in py3.
721 """
--> 722 return str(self)
723
724 def repr_html(self):
/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in str(self)
664 if py3compat.PY3:
665 return self.unicode()
--> 666 return self.bytes()
667
668 def bytes(self):
/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in bytes(self)
674 """
675 encoding = com.get_option("display.encoding")
--> 676 return self.unicode().encode(encoding, 'replace')
677
678 def unicode(self):
/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in unicode(self)
707 verbose = (max_info_rows is None or
708 self.shape[0] <= max_info_rows)
--> 709 self.info(buf=buf, verbose=verbose)
710
711 value = buf.getvalue()
/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in info(self, verbose, buf, max_cols)
1609 space = max([len(com.pprint_thing(k)) for k in self.columns]) + 4
1610 counts = self.count()
-> 1611 if len(cols) != len(counts):
1612 raise AssertionError('Columns must equal counts')
1613 for col, count in counts.iteritems():
TypeError: object of type 'float' has no len()
save_objects is a CPU hog too...
generate a text journal (activity log) of all objects saved/updated/removed from a collection (_id, when, what (new object state), who), before actually saving
can be used as a 'backup', or if a save_objects goes bad (eg, after a future update), we can 're-play' the transactions.
use logger?
Both client and server depend on tools
Split out client and server requirements
Register at pypi
.count() method should accept date are just like find
see: client_api.count()
38 #### COMING SOON - 0.1.4 ####
39 :param String date: Date (date range) that should be queried:
...
Ignore the 1.4 remark, let's try to get this into 1.3 (i'll try to release tomorrow)
Wrap basic auth with kerberos auth handler for tornado 3.0
http://liftoff.github.io/GateOne/Developer/sso.html
I don't have a keytab yet to play with, so we'll only test this one with 'basic' kerberos auth and hope the key auth works like magic, as it should if the manual basic works.
output all pypi requirements/dependencies into a txt file called requirements.txt; then installation can be done with pip install metrique -r
https://github.com/drpoovilleorg/metrique/blob/master/metrique/client/cubes/gitrepo/commit.py#L136
'stats' is one property of a git commit that pythongit offers us. It shows +- lines per commit. But it's slow to extract. Without it enabled as a field in gitrepo.commit, we can process 900+ commits a second (which i think is also slow...); with it enabled, 30.
We need to make it faster.
one method for wrapping pymongos index method. Let the db (valid in [warehouse, timeline] be passed as an argument.
show the basics of extracting, querying, plotting
for example, git.extract(get_dependency_git_uris('metrique'))
Then show how many LOC; distinct contributors; plot contributions per person over time; etc
pyclient().user_passwd('new', 'user', 'old' (optional, if admin))
pyclient should autoset self.user to current user and use it if not passed in as arg
plt.axvline(dt, color='k')
Currently, the above draws a line on a timeseries (x) plot, where dt == date in timeseries (x).
What about wrapping this in a function that also accepts a 'label', which would be printed at the top of the plot on the 2nd x axis... ?
@wejnik How does this overlap with the work you've done on mapping dates to a plot with timeseries (x)?
Clients can set One underscore in any field name but not Dundee (double underscore)
Dunder's are reserved for properties set by metrique server
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.