bmmalone / pyllars Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 1.0 574 KB

This repository contains supporting utilities for Python 3, with an emphasis on data science tasks.

License: MIT License

Python 89.77% Jupyter Notebook 10.23%

pyllars's People

Contributors

Stargazers

Watchers

Forkers

stianlagstad

pyllars's Issues

Add a `remove_nans` function

This function returns the non-nan values from an array. This should probably be a part of math_utils, or maybe matrix_utils. A basic implementation is as follows.

def _remove_nans(vals):
    m_nan = pd.isnull(vals)
    vals = vals[~m_nan]
    return vals

All of the package could use improved documentation; however, dataset manager is used in several external projects. The fields, etc., it exposes, and how it determines those, should be explained in much better detail.

pandas_utils requires shared build of python

misc.pandas_utils imports fastparquet, which in turn imports numba. It seems that numba requires a shared build of python:

  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/misc/pandas_utils.py", line 19, in <module>
    import fastparquet
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/fastparquet/__init__.py", line 8, in <module>
    from .core import read_thrift
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/fastparquet/core.py", line 13, in <module>
    from . import encoding
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/fastparquet/encoding.py", line 8, in <module>
    import numba
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/__init__.py", line 12, in <module>
    from .special import typeof, prange
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/special.py", line 3, in <module>
    from .typing.typeof import typeof
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typing/__init__.py", line 2, in <module>
    from .context import BaseContext, Context
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typing/context.py", line 10, in <module>
    from numba.typeconv import Conversion, rules
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typeconv/rules.py", line 3, in <module>
    from .typeconv import TypeManager, TypeCastingRules
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typeconv/typeconv.py", line 3, in <module>
    from . import _typeconv, castgraph, Conversion
ImportError: libpython3.5m.so.1.0: cannot open shared object file: No such file or directory

Removing the import stops the problem.

The easiest option is to move the parquet import inside the respective functions. Then, using fastparquet simply requires python to be build with --enable-shared.

Document mpl_utils.plot_stacked_bar_graph

All of mpl_utils is documented except the plot_stacked_bar_graph function and its get_diff_counts helper.

Allow `None` for the client in dask_utils

It is convenient if the various apply functions in dask_utils can accept None as the dask_client. In these cases, the implementation can fall back to the respective function in pd_utils.

sklearn is on a brownout schedule. Replace with scikit-learn?

Hi,

Ref https://github.com/scikit-learn/sklearn-pypi-package#brownout-schedule, the sklearn package is on a brownout schedule, causing our builds to fail at specific times of the day due to pyllars dependency on it. I believe that changing this line to say scikit-learn is all it takes to solve the problem: https://github.com/bmmalone/pyllars/blob/master/setup.py#L35. PR coming with the change shortly.

Add an `apply_rolling_window` helper to pd_utils.

The idea is that this function applies a function to overlapping rows in a data frame (that is, a "rolling window"). A basic implementation is as follows:

def apply_rolling_window(df, func, window_size, progress_bar=False):
    
    it = range(len(df))
    
    if progress_bar:
        it = tqdm.trange(len(df))
        
    ret = [
        func(df.iloc[i: i+window_size])
            for i in it
    ]

    return ret

Document modules

We need to document the following modules:

AutoSklearnWrapper does not save "le_"

The label encoder "le_" is not persisted when dumped to disk with joblib, so the model cannot be used for classification prediction when reloaded.

Add `union_` and `intersect_masks` functions to `pandas_utils`

Here are example implementations:

def intersect_masks(masks:Sequence[np.ndarray]) -> np.ndarray:
    m_intersect = np.all(list(m for m in masks), axis=0)
    return m_intersect

def union_masks(masks:Sequence[np.ndarray]) -> np.ndarray:
    m_union = np.any(list(m for m in masks), axis=0)
    return m_union

Simplify mpl_utils.plot_roc_curve

Currently, this function is very complicated, and its parameters are highly non-obvious. Simplify it to plot a single ROC curve. Users can call it multiple times if desired.

Add a `sample_dirichlet_multinomial` function to `stats_utils`

Here is a basic example:

def sample_dirichlet_multinomial(dirichlet_alphas:np.ndarray, num_samples:int) -> np.ndarray:
    pvals = np.random.dirichlet(dirichlet_alphas)
    sampled_counts = np.random.multinomial(n=num_samples, pvals=pvals)
    return sampled_counts

Add a `get_index_and_reverse_map` function

This function should return two maps which map from items to indices and back from indices to items. It should also check that the items are unique (or the reverse map will not work). That is, this should be a bijective mapping. Here is a basic example:

    index_map = {
        c:i for i, c in enumerate(items)
    }
    reverse_index_map = {
        i:c for c,i in index_map.items()
    }

Handle old versions of pyyaml gracefully

Right now, old versions of pyyaml will cause the following error message: AttributeError: module 'yaml' has no attribute 'full_load'. This is due to having an old version of pyyaml (and the call to "load" there is unsafe).

Handle timeouts in dask_utils.collect_results

For large jobs, dask sometimes times out when retrieving individual future results. It is not clear why this happens. future.result has a timeout parameter, so that can be used to avoid indefinite hangs waiting for specific jobs.

The function should probably also have an option to return timed-out results.

Update to sheetname reading excel files in pd_utils

In particular, reading an excel file currently raises the following warning or similar:

DEBUG    : The guessed filetype was: excel
/home/bmalone/.virtualenvs/nes-ehr/lib/python3.6/site-packages/pandas/util/_decorators.py:188: FutureWarning: The `sheetname` keyword is deprecated, use `sheet_name` instead
  return func(*args, **kwargs)

Update sheet_name to sheetname.

pandas_utils needs a logger

The pandas_utils module refers to a logger which is not defined, causing a NameError.

Remove fastparquet dependency

This dependency is slow to install and causes problems if not handled carefully (#4). Just remove this functionality.

Build documentation with sphinx

The internal documentation is largely compatible with sphinx (sklearn-style). Fix any improperly formatted documentation and build it with sphinx.

The name of mpl_utils.hide_tick_labels_by_index is misleading

By default, this removes all tick labels. The name should be changed.

Verify univariate Gaussian KL divergence calculation

The stats_utils.calculate_univariate_gaussian_kl function aims to be numerically stable by performing most calculations in logspace. However, it is not clear that the equations are correct.

Add logging option defaults to function signature

The add_logging_option function sets all defaults within the function body.

https://pyllars.readthedocs.io/en/stable/_modules/pyllars/logging_utils.html#add_logging_options

Add `metric` to AutoSklearnWrapper constructor

In some cases, like when constructing a pipeline and the call to fit is not direct, it is convenient to specify the optimization metric when the wrapper is constructed.

pd_utils.get_group_extreme fails if the data frame index is not unique

The following is the problematic code:

if is_max:
        ex_vals = groups[ex_field].idxmax()
    elif is_min:
        ex_vals = groups[ex_field].idxmin()

    ex_rows = df.loc[ex_vals]

If the index of df is not unique, then the loc line can match multiple rows per group.

Add a "get_class" function

This function should take in a class name and return the class. It is just a light wrapper around importlib.import_module and should look similar to the following.

def get_class(fully_qualified_class_name):
    """ Convert the string version of a class to the class object
    
    For example, for the input "keras.optimizers.Adam", this function
    will return the Adam class. It could then be called to create
    an instance of an Adam optimizer.
    
    Parameters
    ----------
    fully_qualified_class_name : str
        The name of the class
        
    Returns
    -------
    class : type
        The type of the class
    """
    sp = fully_qualified_class_name.split(".")

    module_name = ".".join(sp[:-1])
    class_name = sp[-1]
    
    m = importlib.import_module(module_name)
    clazz = getattr(m, class_name)
    
    return clazz

Make "fontsize" and "font_size" consistent in mpl_utils

Right now, some functions include an underscore and others do not. Matplotlib does not use the underscore, so use fontsize everywhere.

bmmalone / pyllars Goto Github PK

pyllars's People

Contributors

Stargazers

Watchers

Forkers

pyllars's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs