bmmalone / pyllars Goto Github PK
View Code? Open in Web Editor NEWThis repository contains supporting utilities for Python 3, with an emphasis on data science tasks.
License: MIT License
This repository contains supporting utilities for Python 3, with an emphasis on data science tasks.
License: MIT License
This function returns the non-nan values from an array. This should probably be a part of math_utils
, or maybe matrix_utils
. A basic implementation is as follows.
def _remove_nans(vals):
m_nan = pd.isnull(vals)
vals = vals[~m_nan]
return vals
All of the package could use improved documentation; however, dataset manager is used in several external projects. The fields, etc., it exposes, and how it determines those, should be explained in much better detail.
misc.pandas_utils
imports fastparquet
, which in turn imports numba
. It seems that numba
requires a shared build of python:
File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/misc/pandas_utils.py", line 19, in <module> import fastparquet File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/fastparquet/__init__.py", line 8, in <module> from .core import read_thrift File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/fastparquet/core.py", line 13, in <module> from . import encoding File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/fastparquet/encoding.py", line 8, in <module> import numba File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/__init__.py", line 12, in <module> from .special import typeof, prange File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/special.py", line 3, in <module> from .typing.typeof import typeof File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typing/__init__.py", line 2, in <module> from .context import BaseContext, Context File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typing/context.py", line 10, in <module> from numba.typeconv import Conversion, rules File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typeconv/rules.py", line 3, in <module> from .typeconv import TypeManager, TypeCastingRules File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typeconv/typeconv.py", line 3, in <module> from . import _typeconv, castgraph, Conversion ImportError: libpython3.5m.so.1.0: cannot open shared object file: No such file or directory
Removing the import stops the problem.
The easiest option is to move the parquet import inside the respective functions. Then, using fastparquet
simply requires python to be build with --enable-shared
.
All of mpl_utils
is documented except the plot_stacked_bar_graph
function and its get_diff_counts
helper.
It is convenient if the various apply
functions in dask_utils
can accept None
as the dask_client
. In these cases, the implementation can fall back to the respective function in pd_utils
.
Hi,
Ref https://github.com/scikit-learn/sklearn-pypi-package#brownout-schedule, the sklearn
package is on a brownout schedule, causing our builds to fail at specific times of the day due to pyllars dependency on it. I believe that changing this line to say scikit-learn
is all it takes to solve the problem: https://github.com/bmmalone/pyllars/blob/master/setup.py#L35. PR coming with the change shortly.
The idea is that this function applies a function to overlapping rows in a data frame (that is, a "rolling window"). A basic implementation is as follows:
def apply_rolling_window(df, func, window_size, progress_bar=False):
it = range(len(df))
if progress_bar:
it = tqdm.trange(len(df))
ret = [
func(df.iloc[i: i+window_size])
for i in it
]
return ret
We need to document the following modules:
collection_utils
dask_utils
deprecated_decorator
external_sparse_matrix_list
gene_ontology_utils
hyperparameter_utils
incremental_gaussian_estimator
latex_utils
logging_utils
math_utils
matrix_utils
missing_data_utils
ml_utils
mpl_utils
mygene_utils
nlp_utils
pandas_utils
physionet_utils
scip_utils
shell_utils
sparse_vector
ssh_utils
stats_utils
string_utils
suppress_stdout_stderr
utils
validation_utils
The label encoder "le_" is not persisted when dumped to disk with joblib, so the model cannot be used for classification prediction when reloaded.
Here are example implementations:
def intersect_masks(masks:Sequence[np.ndarray]) -> np.ndarray:
m_intersect = np.all(list(m for m in masks), axis=0)
return m_intersect
def union_masks(masks:Sequence[np.ndarray]) -> np.ndarray:
m_union = np.any(list(m for m in masks), axis=0)
return m_union
Currently, this function is very complicated, and its parameters are highly non-obvious. Simplify it to plot a single ROC curve. Users can call it multiple times if desired.
Here is a basic example:
def sample_dirichlet_multinomial(dirichlet_alphas:np.ndarray, num_samples:int) -> np.ndarray:
pvals = np.random.dirichlet(dirichlet_alphas)
sampled_counts = np.random.multinomial(n=num_samples, pvals=pvals)
return sampled_counts
This function should return two maps which map from items to indices and back from indices to items. It should also check that the items are unique (or the reverse map will not work). That is, this should be a bijective mapping. Here is a basic example:
index_map = {
c:i for i, c in enumerate(items)
}
reverse_index_map = {
i:c for c,i in index_map.items()
}
Right now, old versions of pyyaml will cause the following error message: AttributeError: module 'yaml' has no attribute 'full_load'
. This is due to having an old version of pyyaml (and the call to "load" there is unsafe).
For large jobs, dask sometimes times out when retrieving individual future results. It is not clear why this happens. future.result
has a timeout
parameter, so that can be used to avoid indefinite hangs waiting for specific jobs.
The function should probably also have an option to return timed-out results.
In particular, reading an excel file currently raises the following warning or similar:
DEBUG : The guessed filetype was: excel
/home/bmalone/.virtualenvs/nes-ehr/lib/python3.6/site-packages/pandas/util/_decorators.py:188: FutureWarning: The `sheetname` keyword is deprecated, use `sheet_name` instead
return func(*args, **kwargs)
Update sheet_name
to sheetname
.
The pandas_utils
module refers to a logger which is not defined, causing a NameError
.
This dependency is slow to install and causes problems if not handled carefully (#4). Just remove this functionality.
The internal documentation is largely compatible with sphinx (sklearn-style). Fix any improperly formatted documentation and build it with sphinx.
By default, this removes all tick labels. The name should be changed.
The stats_utils.calculate_univariate_gaussian_kl
function aims to be numerically stable by performing most calculations in logspace. However, it is not clear that the equations are correct.
The add_logging_option
function sets all defaults within the function body.
https://pyllars.readthedocs.io/en/stable/_modules/pyllars/logging_utils.html#add_logging_options
In some cases, like when constructing a pipeline and the call to fit
is not direct, it is convenient to specify the optimization metric when the wrapper is constructed.
The following is the problematic code:
if is_max:
ex_vals = groups[ex_field].idxmax()
elif is_min:
ex_vals = groups[ex_field].idxmin()
ex_rows = df.loc[ex_vals]
If the index of df
is not unique, then the loc
line can match multiple rows per group.
This function should take in a class name and return the class. It is just a light wrapper around importlib.import_module
and should look similar to the following.
def get_class(fully_qualified_class_name):
""" Convert the string version of a class to the class object
For example, for the input "keras.optimizers.Adam", this function
will return the Adam class. It could then be called to create
an instance of an Adam optimizer.
Parameters
----------
fully_qualified_class_name : str
The name of the class
Returns
-------
class : type
The type of the class
"""
sp = fully_qualified_class_name.split(".")
module_name = ".".join(sp[:-1])
class_name = sp[-1]
m = importlib.import_module(module_name)
clazz = getattr(m, class_name)
return clazz
Right now, some functions include an underscore and others do not. Matplotlib does not use the underscore, so use fontsize
everywhere.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.