tgsmith61591 / skutil Goto Github PK

NOTE: skutil is now deprecated. See its sister project: https://github.com/tgsmith61591/skoot. Original description: A set of scikit-learn and h2o extension classes (as well as caret classes for python). See more here: https://tgsmith61591.github.io/skutil

License: BSD 3-Clause "New" or "Revised" License

Python 95.74% Fortran 3.75% Shell 0.51%

h2o machine-learning pandas python sklearn

skutil's Introduction

Hi there 👋

About me

I'm Taylor Smith, Principal ML Engineer at Toyota Connected. In my day-to-day, I work on training and deploying Tensorflow language models into our Kubernetes cluster to support dozens of microservices. In my spare time, I actively maintain pmdarima, Python's leading equivalent to R's auto.arima.

Come work with me! Contribute to one of my active issues or check out careers at Toyota Connected.

Publications

Check out some of my publications and courses:

skutil's People

Contributors

Stargazers

Watchers

Forkers

xiangchenm dingchaoz ringwraith faadal eycab vedraiyani geektemo haiyuni nivedhithae

skutil's Issues

Fix pandas qcut issue

The skutil.metrics.GainsStatisticalReport contains a pandas qcut function which can break on non-unique bins. Need to write a work-around to this function for both pandas as well as H2O frames

Return sanitized log in Poisson branch

Need to return the sanitized log in the Poisson regression model

H2OSelectiveScaler

Similar to the skutil.preprocessing.SelectiveScaler, create a H2OSelectiveScaler for H2OFrames.

Running Nosetests with sklearn 0.17 creates error in skutil.utils.tests.test_util.test_grid_search_fix

acbc327cf45b:skutil qwq594$ pip freeze | grep -i scikit-learn
scikit-learn==0.17.1

Running nosetests with sklearn 0.17.1 causes the following error:

======================================================================
ERROR: skutil.utils.tests.test_util.test_grid_search_fix
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 329, in run
    testMethod()
  File "/Users/qwq594/Library/Python/2.7/lib/python/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/qwq594/PycharmProjects/skutil/skutil/utils/tests/test_util.py", line 79, in test_grid_search_fix
    grid1.fit_predict(df, y)
  File "/Users/qwq594/PycharmProjects/skutil/skutil/utils/metaestimators.py", line 57, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
  File "/Users/qwq594/PycharmProjects/skutil/skutil/utils/fixes.py", line 339, in fit_predict
    return self.fit(X, y).predict(X)
  File "/Users/qwq594/PycharmProjects/skutil/skutil/utils/fixes.py", line 721, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "/Users/qwq594/PycharmProjects/skutil/skutil/utils/fixes.py", line 462, in _fit
    X, y = indexable(X, y)
NameError: global name 'indexable' is not defined

Add Function Formatting via LaTeX in Sphinx

There are various areas in the docstrings where formulae are present. Having this formatted properly to play nicely with LaTeX would be optimal.

Tutorial for using LaTeX to generate formulae inside of sphinx.

Improving Adherence to PEP8 Standards in codebase

This issue will be a reoccurring issue that will reemerge closer to milestone dates for releases. For now these are some major areas that should/could be considered are:

Unused code
- parameters
- docstring parameters
- imports
Formatting
- Indentation
- Arguments/Variables should be in lowercase
- Spacing between class methods and classes/functions themselves
Implementing all function from an abstract class

- Instance attributes defined external to the init of a class. - This is kind of self explanatory :) - Tests for membership

- Unresolved Attribute References

- Test for Object Identity

Mark the docs page "No longer supported" to point folk in the right direction?

Hi Tom. I had skutil tagged for a while to try your SafeLabelEncoder. I finally got to try to install it today and after a few minutes tracking dependencies backwards in conda (getting openjdk installed for h20), I jumped to your github page and saw that this module was deprecated.

Can I suggest you added a headline note at https://www.alkaline-ml.com/skutil/index.html saying the same, and pointing folk at your new skoot?

I maintain my own packages and I know that deprecated stuff is a pain, but a pointer might help other folk from getting confused, as and when you have a few minutes. Many thanks of course for your support of our ecosystem.

Also - I lifted and used your SafeLabelEncoder, it did the right job, many thanks!

H2O model persistence - autogen main class

Downloading a POJO is actually kind of complicated—it still requires the user to build a main class. Can we write a class that will auto-gen and completely jar up a main class/method?

Initial Commit of Sphinx documentation

We want to get Sphinx set up to identify gaps in documentation and provide a better solution than the wiki for documenting behavior and examples.

Running Tests as a matrix of scikit-learn 0.17/0.18 and python 2.7/3.5

Currently the version of scikit-learn .travis.yml defaults to scikit-learn 0.18. As a result the test cases do not test for scikit-learn version 0.17.

The tricky part is to have a matrix of the four possible configurations of scikit-learn / python versions similar to scikit-learn's .travis.yml. Another reference is from the travis docs themselves.

Between these and some other docs yet to be found this should be something we could manage.

Add Functions Geared Towards Apache Spark

Not sure how valid this is (is this truly within the scope of this library?) or in what form this will rear its ugly head, but it would be neat to add some complimentary functions for Spark. This is as open-ended as it can be in order to Spark discussion. Ideally we would want to keep this functionality to after the emergence of Spark DataFrames so we can just leverage their existing DataFrame API.

Add `fit_predict` method to `H2OPipeline`

H2OPipeline does not currently inherit this method and should

H2O grid search scoring

Make the scoring methods in H2O grid searches H2OFrame specific for speed, and add a make_h2o_scorer method that will allow callables to adapt to the gridsearch framework.

gh-pages release num

Since make clean html depends on a built copy of skutil (using python setup.py develop), the egg info is in the skutil directory. Thus, it's pointing to a stale version. We need to merge master into gh-pages in our automation process or rebase it or something so that the version is updated at build time...

Common Date Transformers

First off thanks to the devs for creating such an awesome and useful library. Just a suggestion - it would be great to add a few date transformers to this library. For example pass on a list of data columns and for each column spit out separate columns year, month, weekday, hour etc. Here is a rudimentary date differ transformer I use often.

import pandas as pd
import numpy as np
import datetime as dt
from sklearn.base import TransformerMixin

class DateDiffer(TransformerMixin):
    '''
    # takes the difference between two dates and returns value in days
    # Please use DateFormatter() before using DateDiffer()
    
    How it works:
    If you specify 3 dates: [date1,date2,date3]
    Output will be 2 columns:
        date2-date1
        date3 - date2
    
    The transformer takes the following parameter 'units':
        Y:  year	
        M:  month	
        W:  week	
        D:  day		
        h:  hour	
        m:  minute	
        s:  second	
        ms: millisecond	
        us: microsecond	
        ns: nanosecond	
        ps: picosecond	
        fs: femtosecond	
        as: attosecond	
    '''
    def __init__(self, unit='D'):
        self.unit = unit
    
    def fit(self, X, y=None):
        # stateless transformer
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        beg_cols = X.columns[:-1]
        end_cols = X.columns[1:]
        Xbeg = X[beg_cols].as_matrix()
        Xend = X[end_cols].as_matrix()
        Xd = (Xend - Xbeg) / np.timedelta64(1, self.unit)
        diff_cols = ['->'.join(pair) for pair in zip(beg_cols, end_cols)]
        Xdiff = pd.DataFrame(Xd, index=X.index, columns=diff_cols)
        return Xdiff

My Python foo skills are limited - for example, I am unable to generalize the DateDiffer() transformer to an entire dataframe, or say, pass it a list of columns and do a fit_transform()

Finally, is there a way to pass two numeric columns to a transformer and obtain the column differences? I know I can create interaction variables with the sklearn polynomial transformer but not df{'x1']+df['x2'] for instance.

Creating pip installable package with PyPI

Create pip-installable package in PyPI. Ideally we would be able to push out releases on PyPI at the same time that we make them available on our repo.

Retain estimator state on save

H2OPipeline and H2OGridSearch alter the state on save, and restore it on load, but they should retain state between the two disk operations...

Output dataframe with SafeLabelEncoder?

Hey guys, any tips on how to output a dataframe instead of an array when using SafeLabelEncoder()?

This works for me, but I was really hoping to have an argument similar to as_df=True so I can stay in Pandas-land.

train = pd.DataFrame.from_records(data=np.array([
                           ['USA','RED','a'],
                           ['MEX','GRN','b'],
                           ['FRA','RED','b']]), 
                           columns=['Country','Color','Category'])


test = pd.DataFrame.from_records(data=np.array([
                           ['BBR','RED','a'],
                           ['CAN','BLK','b'],
                           ['FRA','RED','b']]), 
                           columns=['Country','Color','Category'])
    
COLS = ['Country']

# learn the levels on 'Country'
SLC = SafeLabelEncoder().fit(train[COLS].values.ravel())

# create dummies in the train dataset
train_labels = SLC.transform(train[COLS].values.ravel())
    
test_labels = SLC.transform(test[COLS].values.ravel())

print(train_labels)
print(test_labels)

[2 1 0]
[99999 99999     0]

Python 3.5 travis tests

Add CI unit tests for Python 3.5 support

Fixing Travis to Allow plotting

Currently Travis is not configured to allow plotting via matplotlib and seaborn. Let's fix that.

I have found some resources that may be helpful:

skutil.h2o.H2OOversamplingClassBalancer Runtime Error

Getting a runtime error while trying to return the H2O frame after H2OOversamplingClassBalancer.

Python 2.7.11 :: Anaconda 2.3.0 (64-bit)
H2O version: 3.10.0.7

Note - error is too long to include in the message box, so trimming it down by cutting most of the middle of the error trace.

Error:

RuntimeError Traceback (most recent call last)
/opt/anaconda/2.3.0/lib/python2.7/site-packages/IPython/core/formatters.pyc in call(self, obj)
688 type_pprinters=self.type_printers,
689 deferred_pprinters=self.deferred_printers)
--> 690 printer.pretty(obj)
691 printer.flush()
692 return stream.getvalue()

/opt/anaconda/2.3.0/lib/python2.7/site-packages/IPython/lib/pretty.pyc in pretty(self, obj)
407 if callable(meth):
408 return meth(obj, self, cycle)
--> 409 return _default_pprint(obj, self, cycle)
410 finally:
411 self.end_group()

/opt/anaconda/2.3.0/lib/python2.7/site-packages/IPython/lib/pretty.pyc in _default_pprint(obj, p, cycle)
527 if _safe_getattr(klass, 'repr', None) not in baseclass_reprs:
528 # A user-provided repr. Find newlines and replace them with p.break()
--> 529 _repr_pprint(obj, p, cycle)
530 return
531 p.begin_group(1, '<')

/opt/anaconda/2.3.0/lib/python2.7/site-packages/IPython/lib/pretty.pyc in repr_pprint(obj, p, cycle)
709 """A pprint that just redirects to the normal repr function."""
710 # Find newlines and replace them with p.break()
--> 711 output = repr(obj)
712 for idx,output_line in enumerate(output.splitlines()):
713 if idx:

/home/eobg/.local/lib/python2.7/site-packages/h2o/frame.pyc in repr(self)
318 stk = traceback.extract_stack()
319 if not ("IPython" in stk[-2][0] and "info" == stk[-2][2]):
--> 320 self.show()
321 return ""
322

/home/eobg/.local/lib/python2.7/site-packages/h2o/frame.pyc in show(self, use_pandas)
330 print("This H2OFrame has been removed.")
331 return
--> 332 if not self._ex._cache.is_valid(): self._frame()._ex._cache.fill()
333 if H2ODisplay._in_ipy():
334 import IPython.display

/home/eobg/.local/lib/python2.7/site-packages/h2o/frame.pyc in _frame(self, fill_cache)
375
376 def _frame(self, fill_cache=False):
--> 377 self._ex._eager_frame()
378 if fill_cache:
379 self._ex._cache.fill()

/home/eobg/.local/lib/python2.7/site-packages/h2o/expr.pyc in _eager_frame(self)
83 if not self._cache.is_empty(): return self
84 if self._cache._id is not None: return self # Data already computed under ID, but not cached locally
---> 85 return self._eval_driver(True)
86
87 def _eager_scalar(self): # returns a scalar (or a list of scalars)

/home/eobg/.local/lib/python2.7/site-packages/h2o/expr.pyc in _eval_driver(self, top)
96
97 def _eval_driver(self, top):
---> 98 exec_str = self._do_it(top)
99 res = ExprNode.rapids(exec_str)
100 if 'scalar' in res:

/home/eobg/.local/lib/python2.7/site-packages/h2o/expr.pyc in _do_it(self, top)
123 if self._cache._id is not None: return self._cache._id # Data already computed under ID, but not cached
124 # assert isinstance(self._children,tuple)
--> 125 exec_str = "({} {})".format(self._op, " ".join([ExprNode._arg_to_expr(ast) for ast in self._children]))
126 gc_ref_cnt = len(gc.get_referrers(self))
127 if top or gc_ref_cnt >= ExprNode.MAGIC_REF_COUNT:

/home/eobg/.local/lib/python2.7/site-packages/h2o/expr.pyc in _arg_to_expr(arg)
136 return "[]" # empty list
137 elif isinstance(arg, ExprNode):
--> 138 return arg._do_it(False)
139 elif isinstance(arg, ASTId):
140 return str(arg)

...............................................................................................................................

/home/eobg/.local/lib/python2.7/site-packages/h2o/expr.pyc in _arg_to_expr(arg)
132 @staticmethod
133 def _arg_to_expr(arg):
--> 134 if arg is not None and isinstance(arg, range): arg = list(arg)
135 if arg is None:
136 return "[]" # empty list

/opt/anaconda/2.3.0/lib/python2.7/abc.pyc in instancecheck(cls, instance)
130 # Inline the cache checking when it's simple.
131 subclass = getattr(instance, 'class', None)
--> 132 if subclass is not None and subclass in cls._abc_cache:
133 return True
134 subtype = type(instance)

/opt/anaconda/2.3.0/lib/python2.7/_weakrefset.pyc in contains(self, item)
70 def contains(self, item):
71 try:
---> 72 wr = ref(item)
73 except TypeError:
74 return False

RuntimeError: maximum recursion depth exceeded

Errors in importing classes

I tried to call BoxCoxTransformer class following examples on transformers in skutil.
But I got an error like this:

>>> from skutil.preprocessing import BoxCoxTransformer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Leslie/anaconda3/lib/python3.5/site-packages/skutil/preprocessing/__init__.py", line 7, in <module>
    from .balance import *
  File "/Users/Leslie/anaconda3/lib/python3.5/site-packages/skutil/preprocessing/balance.py", line 268, in <module>
    class _BaseBalancer(six.with_metaclass(abc.ABCMeta, object, BalancerMixin)):
  File "/Users/Leslie/anaconda3/lib/python3.5/site-packages/sklearn/externals/six.py", line 566, in with_metaclass
    return meta("NewBase", bases, {})
  File "/Users/Leslie/anaconda3/lib/python3.5/abc.py", line 133, in __new__
    cls = super().__new__(mcls, name, bases, namespace)
TypeError: Cannot create a consistent method resolution
order (MRO) for bases object, BalancerMixin

It never happened to me. So anyone has ideas to solve it?
Thanks!

Add coverage for plotting code

Now that travis is wired up to plot on the VMs, we should start implementing the unittests for plotting various functions.

Improve Test Coverage To be on Par with sklearn

Our code on develop is at :

Our code on master is at :

Currently sklearn master is at (they do not have a develop branch):

We should strive to be >= their coverage as a baseline . For now striving for develop to be at that level would be more easily attained. Shooting for within the .5% margin would round the coverage to be equivalent in terms of badges.

Identify and Fix Gaps in Sphinx Documentation

The intention for this issue is that it will be a reoccurring issue that would be recreated relative to each milestone. I think it is worth creating a new one each time to ensure that the new functionality for each milestone is captured and to also not get to a point where we gloss over this. In addition to what is below there will be some different types of issues that will be identified through the course of investigation and comparison of the results of make clean html.

Currently some areas of focus are: