GithubHelp home page GithubHelp logo

metrique's People

Contributors

bitdeli-chef avatar calmrat avatar jniznan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metrique's Issues

object simplehash()

a.k.a. hash(str()) like being used in gitrepo cube

This is a weak hash because in some instances, hashes could be different for two objects, even if the content is exactly the same, just not ordered the same. But its, I guess, a fast hash. Faster anyway than recursively taking frozen sets of lists and dict.items()

pass optional argument to save_objects(..., mtime=None)

permit the client to push an mtime along with save_objects call that will be applied when calling save_objects server side into etl_activity for the cube.field rather than serverside calling utcnow() upon starting the operation. This will reduce the risk of hitting the race condition where something is updated between when client starts saving and server generates the utcnow mtime; the next client delta using the server generated mtime might miss changes that happened between that gap in time.

BUG: git commit extract delta

It always extracts at least one bug (even if it shouldn't). I guess we are comparing the delta_ts with '>=' instead of '>'.

'full text' search?

on save_objects() index unique field token values per cube with map to _ids.

such that as metrique recieves an object like:

{
_id: 23523599
tweet: '#wow #metrique',
user: 'cward',
...
}

push workers fields and values and return back a set of unique (token, _id) tuples. The all those sets would be merged to (token, [_id, ...]).

@jniznan @wejnik thoughts?

get_cube should guess what to import based on cube naming convention

it would be nice to simply be able to run

get_cube('bz_bug') and we'd get back the bz.bug.Bug cube class

if the first attempt to import fails, get_cube could attempt to pop off the first part of the string before the first '_' underscore. This would be the 'package'.

Then, the remaining set of characters would be the 'module'

And the class would be formed by taking the module, removing underscores, and capitalizing each word (first character + first character of each work which is separated by underscore)

webapp config should set debug only if debug == 2

in server.tornado.http webapp setup, we set debug = True (in tornado) if metrique_config.debug is not false... but we've moved to a new debug level schema and tornado debug should be turned on only if metrique_config.debug == 2

0.1.3 ga & 0.1.4 & 0.2.0 planning

We need stability. We need tests.

Im pretty sure we have the core functionality in 013, server side, that we need to accomplish our current goals. Performance is excellent too.

I wanted to tag the code yesterday with 013 version release, but time ran out. I'll do it by Mon, hopefully.

We'll focus on bug fixing and client usability issues now. And docs.

API affecting changes should be avoided.

Branch 0.1.4 will be for MEPs and refactors

object hashing

@jniznan @wejnik

what if we hash every object and save its' unique field:value hash to _hash; the username which saved it would be saved in every object in admin: [user, ...], then... the way someone gets access to an is by first running an extract(force=True); all objects the user is able to push to save_documents will be saved if no hash already exists and or tagged with that username if the object (with the same hash) already exists.

The users which can create the same data are assumed to have access to the same data used to create the objects... no?

API refactor

/ping
/USER/[cubes]
/USER/CUBE/[fields]
/USER/CUBE/query/[find, aggregate, fetch, ...]
/USER/CUBE/[saveobjects, removeobjects, drop ...]

.find() should have a "pagination" or 'batch' arg

in case we don't want to get back all X thousand results back at once, like fetch() has.

.find() from metrique.client with iterator functionality (one at a time, on demand data loading from server)? ASYNC?

date query: merge version with unchanging fields of interest

cube.find('...', fields='x,y', date='~')

returns back all version of objects matching the given query.

there might be three versions of a particular object because field Z changed. But since we're only interested in fields x,y and they're the same between all three versions where Z field changed, we can merge/reconcile those into a single object version and update the _start/_end accordingly

df.save_to_cube()

pandas df can save to HDFS. Extend pandas df cls to support saving df as a dict and lists, with cube.save_objects(df), which would dump the df (object with unique _id?) of results into metrique cube for future, historical or other analysis.

DB CLEAN: reconcile simliar objects

over time, there WILL be duplicate and overlapping data being dumped. Production dbs can't be 'refreshed' :) we need a safe way to minimize duplicate/waste data objects. One, perhaps, is to create a clean up function::

def cleanup(cube, dt_resolution, collapse=False):
'''
walk through all commits with a date range separated by a time 
difference equal or less than the dt_resolution argument and merge 
the objects. 

All fiels which are the same will be merged; differing fields will default to 
end up with the value of the most recent. If collapse is False, the objects 
will reduce ONLY in so far as not more than 2 object states merge 
into one; if more than two objects would be merged, given the dt_resolution 
used, then reduce the maximum possible, but expect a list of objects in 
return. If collapse is True, always return only ONE new objects, which is,
essentially, the most recent object of the set.
'''

run .snapshot () as separate process

use futures.concurrent.ProcessPoolExecutor

The process to check whether a new obj should be created or not is pretty CPU intensive.

Move the running of .snapshot (on the server) into a separate process (ie, will not be trapped by the python GIL)

json Encoding and decoding

@jniznan and I battled with json today. We brought down the wallclock time for handling large data sets. But its still slow.

If we avoid encode and decode all together, the process which took 30s ... Takes 1s using ultajson in place of simplejson. But ujson has no encoder method.

Seems we might need to make a new API standard. All data in and out of the server should require zero json encode or decode. Clients are responsible for correctly serializing all data out and in.

__repr__ is failing with new result subclasses

in ipython, for example, b.find('product == "Metrique"') excepts with

My guess is it's related to the new result subclass implementation

Int64Index([], dtype=int64) Empty BugResult

DEBUG:metrique.client.http_api:URL: https://192.168.1.5:8080/api/v1/query/find

TypeError Traceback (most recent call last)
in ()
----> 1 b.find('product == "Metrique"')

/usr/lib/python2.7/site-packages/IPython/core/displayhook.pyc in call(self, result)
236 self.start_displayhook()
237 self.write_output_prompt()
--> 238 format_dict = self.compute_format_data(result)
239 self.write_format_data(format_dict)
240 self.update_user_ns(result)

/usr/lib/python2.7/site-packages/IPython/core/displayhook.pyc in compute_format_data(self, result)
148 MIME type representation of the object.
149 """
--> 150 return self.shell.display_formatter.format(result)
151
152 def write_format_data(self, format_dict):

/usr/lib/python2.7/site-packages/IPython/core/formatters.pyc in format(self, obj, include, exclude)
124 continue
125 try:
--> 126 data = formatter(obj)
127 except:
128 # FIXME: log the exception

/usr/lib/python2.7/site-packages/IPython/core/formatters.pyc in call(self, obj)
445 type_pprinters=self.type_printers,
446 deferred_pprinters=self.deferred_printers)
--> 447 printer.pretty(obj)
448 printer.flush()
449 return stream.getvalue()

/usr/lib/python2.7/site-packages/IPython/lib/pretty.pyc in pretty(self, obj)
358 if callable(meth):
359 return meth(obj, self, cycle)
--> 360 return _default_pprint(obj, self, cycle)
361 finally:
362 self.end_group()

/usr/lib/python2.7/site-packages/IPython/lib/pretty.pyc in _default_pprint(obj, p, cycle)
478 if getattr(klass, 'repr', None) not in _baseclass_reprs:
479 # A user-provided repr.
--> 480 p.text(repr(obj))
481 return
482 p.begin_group(1, '<')

/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in repr(self)
720 Yields Bytestring in Py2, Unicode String in py3.
721 """
--> 722 return str(self)
723
724 def repr_html(self):

/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in str(self)
664 if py3compat.PY3:
665 return self.unicode()
--> 666 return self.bytes()
667
668 def bytes(self):

/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in bytes(self)
674 """
675 encoding = com.get_option("display.encoding")
--> 676 return self.unicode().encode(encoding, 'replace')
677
678 def unicode(self):

/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in unicode(self)
707 verbose = (max_info_rows is None or
708 self.shape[0] <= max_info_rows)
--> 709 self.info(buf=buf, verbose=verbose)
710
711 value = buf.getvalue()

/usr/lib/python2.7/site-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in info(self, verbose, buf, max_cols)
1609 space = max([len(com.pprint_thing(k)) for k in self.columns]) + 4
1610 counts = self.count()
-> 1611 if len(cols) != len(counts):
1612 raise AssertionError('Columns must equal counts')
1613 for col, count in counts.iteritems():

TypeError: object of type 'float' has no len()

object manipulation (save/remove) journal

generate a text journal (activity log) of all objects saved/updated/removed from a collection (_id, when, what (new object state), who), before actually saving

can be used as a 'backup', or if a save_objects goes bad (eg, after a future update), we can 're-play' the transactions.

use logger?

.count should accept date arg

.count() method should accept date are just like find

see: client_api.count()
38 #### COMING SOON - 0.1.4 ####
39 :param String date: Date (date range) that should be queried:
...

Ignore the 1.4 remark, let's try to get this into 1.3 (i'll try to release tomorrow)

add requirements.txt

output all pypi requirements/dependencies into a txt file called requirements.txt; then installation can be done with pip install metrique -r

DOCS: add introductory tutorial ipynb

show the basics of extracting, querying, plotting

for example, git.extract(get_dependency_git_uris('metrique'))

Then show how many LOC; distinct contributors; plot contributions per person over time; etc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.