capitalone / rubicon-ml Goto Github PK

Capture all information throughout your model's development in a reproducible way and tie results directly to the model code!

Home Page: https://capitalone.github.io/rubicon-ml/

License: Apache License 2.0

Python 32.64% Jupyter Notebook 67.27% CSS 0.09%

python data-science model-development reproducibility exploration

rubicon-ml's Introduction

rubicon-ml

Purpose

rubicon-ml is a data science tool that captures and stores model training and execution information, like parameters and outcomes, in a repeatable and searchable way. Its git integration associates these inputs and outputs directly with the model code that produced them to ensure full auditability and reproducibility for both developers and stakeholders alike. While experimenting, the dashboard makes it easy to explore, filter, visualize, and share recorded work.

p.s. If you're looking for Rubicon, the Java/ObjC Python bridge, visit this instead.

Components

rubicon-ml is composed of three parts:

A Python library for storing and retrieving model inputs, outputs, and analyses to filesystems that’s powered by fsspec
A dashboard for exploring, comparing, and visualizing logged data built with dash
And a process for sharing a selected subset of logged data with collaborators or reviewers that leverages intake

Workflow

Use rubicon_ml to capture model inputs and outputs over time. It can be easily integrated into existing Python models or pipelines and supports both concurrent logging (so multiple experiments can be logged in parallel) and asynchronous communication with S3 (so network reads and writes won’t block).

Meanwhile, periodically review the logged data within the Rubicon dashboard to steer the model tweaking process in the right direction. The dashboard lets you quickly spot trends by exploring and filtering your logged results and visualizes how the model inputs impacted the model outputs.

When the model is ready for review, Rubicon makes it easy to share specific subsets of the data with model reviewers and stakeholders, giving them the context necessary for a complete model review and approval.

Use

Check out the interactive notebooks in this Binder to try rubicon_ml for yourself.

Here's a simple example:

from rubicon_ml import Rubicon

rubicon = Rubicon(
    persistence="filesystem", root_dir="/rubicon-root", auto_git_enabled=True
)

project = rubicon.create_project(
    "Hello World", description="Using rubicon to track model results over time."
)

experiment = project.log_experiment(
    training_metadata=[SklearnTrainingMetadata("sklearn.datasets", "my-data-set")],
    model_name="My Model Name",
    tags=["my_model_name"],
)

experiment.log_parameter("n_estimators", n_estimators)
experiment.log_parameter("n_features", n_features)
experiment.log_parameter("random_state", random_state)

accuracy = rfc.score(X_test, y_test)
experiment.log_metric("accuracy", accuracy)

Then explore the project by running the dashboard:

rubicon_ml ui --root-dir /rubicon-root

Documentation

For a full overview, visit the docs. If you have suggestions or find a bug, please open an issue.

Install

The Python library is available on Conda Forge via conda and PyPi via pip.

conda config --add channels conda-forge
conda install rubicon-ml

pip install rubicon-ml

Develop

The project uses conda to manage environments. First, install conda. Then use conda to setup a development environment:

conda env create -f environment.yml
conda activate rubicon-ml-dev

Finally, install rubicon_ml locally into the newly created environment.

pip install -e ".[all]"

Testing

The tests are separated into unit and integration tests. They can be run directly in the activated dev environment via pytest tests/unit or pytest tests/integration. Or by simply running pytest to execute all of them.

Note: some integration tests are intentionally marked to control when they are run (i.e. not during CICD). These tests include:

Integration tests that write to physical filesystems - local and S3. Local files will be written to ./test-rubicon relative to where the tests are run. An S3 path must also be provided to run these tests. By default, these tests are disabled. To enable them, run:
```
pytest -m "write_files" --s3-path "s3://my-bucket/my-key"
```
Integration tests that run Jupyter notebooks. These tests are a bit slower than the rest of the tests in the suite as they need to launch Jupyter servers. By default, they are enabled. To disable them, run:
```
pytest -m "not run_notebooks and not write_files"
```
Note: When simply running pytest, -m "not write_files" is the default. So, we need to also apply it when disabling notebook tests.

Code Formatting

Install and configure pre-commit to automatically run black, flake8, and isort during commits:

install pre-commit
run pre-commit install to set up the git hook scripts

Now pre-commit will run automatically on git commit and will ensure consistent code format throughout the project. You can format without committing via pre-commit run or skip these checks with git commit --no-verify.

rubicon-ml's People

Contributors

Stargazers

Watchers

rubicon-ml's Issues

preserve logging order on fetches

Is your enhancement request related to a problem? Please describe

rubicon objects are returned back to the user in an indeterminate order during fetches.

Describe the solution you'd like

rubicon objects should be returned back to the user in the same order they were logged during fetches

Additional context

#114 updated rubicon to preserve the logging order of artifacts when retrieved back via the library. it only required a one line change using the builtin sort method

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/repository/base.py#L329

this same change should be applied to each of the get_{entity}s_metadata functions in the repository

note that this does not include the singular get_{entity}_metadata functions - these always return only one thing so no need to sort)
this change should be able to fully take place in the repository layer

tests should also be update to cover the new cases. adding the same simple test from #114 to each of the newly sorted fetches should be sufficient

remove memory fs workarounds

Is your enhancement request related to a problem? Please describe
there was bug with the memory filesystem in fsspec introduced in 0.8.5 that is now resolved as of 0.8.7. we no longer need the workarounds for that bug in rubicon/repository/memory.py and the associated tests

Describe the solution you'd like
remove the workaround to the fsspec bug from rubicon/repository/memory.py

succinct sklearn rubicon pipeline example

Is your feature request related to a problem? Please describe

There should be an sklearn integration example notebook that succinctly demonstrates the overall functionality

Describe the solution you'd like

Create a notebook example that demonstrates how useful the sklearn pipeline integration can be

Lift integrations to top-level within docs

Is your documentation request related to a problem? Please describe

The integrations, like git integration for example, are kind of buried in the examples section of the documentation. We want those to be more easily found.

Describe the solution you'd like

Lift the integrations up as a top level section. This should be done after the sklearn integration is in and should include docs around that.

sklearn integration documented

Is your documentation request related to a problem? Please describe

The docs and readme should document the sklearn integration and provide context/example for use

Describe the solution you'd like

Both docs and readme are updated

standardize getters - experiment

Is your enhancement request related to a problem? Please describe

rubicon currently offers four different types of "getters"

get all
get one by name
get one by ID
get many by tags

however, these offerings vary by entity type. experiments, for example, currently have "get many by tags," "get one by ID" and "get all"

Describe the solution you'd like

experiments should also have the ability to "get one by name"

specifically, def experiment(self, id) should be updated to def experiment(self, name=None, id=None) where either name or id is required (but not both)

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/project.py#L231

there is a caveat that makes this a bit different that #131, #133 and #135 - experiment names are not unique. experiment, when given a name, could return more than one experiment. we don't want experiment to return multiple experiments, so we should return the most recent experiment with the requested name. perhaps also raise a warning that multiple experiments were found any only one will be returned

then, we should additionally update def experiments(self, tags=[], qtype="or") to def experiments(self, tags=[], qtype="or", name=None) to account for the case where users intentionally have duplicately named experiments and want all of them

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/project.py#L263

Additional context

we originally only offered "get one by ID" because experiments are logged by ID. so when we have an ID, we can retrieve that experiment directly from the filesystem. for experiment names, we'll need to read all the experiments then filter them in-memory. for this reason, we originally left "get one by name" off of experiments. however, I find myself doing experiment = [e for e in rubicon.experiments() if e.name == "desired name"][0] to get the one experiment with a specific name often enough, so we may as well just put it in the library

this change should be fully in the client layer

WhiteSource Security Check

Is your feature request related to a problem? Please describe

The WhiteSource Security scan was not running properly. We need to investigate and get the scan triggered to run on PRs

Describe the solution you'd like

The WhiteSource Security scan running on PRs

standardize getters - parameter

Is your enhancement request related to a problem? Please describe

rubicon currently offers four different types of "getters"

get all
get one by name
get one by ID
get many by tags

however, these offerings vary by entity type. parameters, for example, currently only have "get all"

Describe the solution you'd like

parameters should also have the ability to "get one by name" and "get one by ID." parameters are not tagged, so they do not need the tags getter

specifically, the experiment client needs a def parameter(self, name=None, id=None) function added as there is currently only the plural, get all function parameters

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/experiment.py#L13

Additional context

parameters are logged by name and a singular get_parameter method already exists on the repository layer that can be leveraged

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/repository/base.py#L845

the "get one by name" functionality should be easy to replicate in a similar manner to other singular getters like project

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/rubicon.py#L103

for parameter ID's, we'll need to read all the parameters then filter them in-memory. for this reason, we originally left "get one by ID" off of parameters. however, I find myself doing parameter = [p for p in rubicon.parameters() if p.id == "desired ID"][0] to get the one parameter with a specific ID often enough, so we may as well just put it in the library

this change should be fully in the client layer. all entity UUIDs are unique, so there's no need to worry about returning multiple parameters with the same ID

UI issue when used with ipython

Describe the bug
When using Rubicon in a jupyter notebook (ipython), removing files in the filesystem that store the log project data generates .ipynb_checkpoints directories within the root directory. This causes the UI to fail to read and display project and/or experiment data.

Steps/Code to reproduce bug
In an environment with jupyter and rubicon installed:

Create a new jupyter notebook, copy the following script into a cell, and run the cell

from rubicon import Rubicon
root_dir = "./rubicon-root" 
rubicon = Rubicon(persistence="filesystem", root_dir=root_dir)

project = rubicon.create_project("Sample Project")

project.log_experiment("Exp1")
project.log_experiment("Exp2")

From a terminal, navigate to the directory containing the notebook

Run rubicon ui --root-dir rubicon-root - you can see the Sample Project in the UI with two experiments

Quit the UI

Remove one of the experiment folders, run rm -r rubicon-root/experiments/{experiment_id}

Restart the UI with rubicon ui --root-dir rubicon-root - you should see the 'Sample Project' but when selected, there will be no experiments displayed.

Expected behavior
When .ipynb_checkpoints directories exist within the rubicon root directory, I should still be able to see my projects and experiments in the UI.

docs builds are too long

Is your documentation request related to a problem? Please describe

the documentation builds take over an hour when run in github actions. they did not always take this long

Describe the solution you'd like

I'd like the documentation builds to only take a few minutes again

Additional context

it slowed down when we added all the rubicon dependencies to the conda env - we don't need all of them

Preserve Artifact Ordering

Is your enhancement request related to a problem? Please describe

When I log artifacts to an experiment and then later fetch them back, the original order in which they were logged is not preserved.

Describe the solution you'd like

I would like experiment.artifacts() to fetch these artifacts in the same order as they were logged.

Generate code snippet to retrieve selected experiment

Is your feature request related to a problem? Please describe

Within the dashboard, it's possible to select experiment(s) while exploring the logged data. It could be the case that while exploring, I see an experiment that I'd like to retrieve and fetch more information on using the library. This process could be made easier by adding a "copy code snippet" button directly on the dashboard that gives me exactly what i need to fetch the corresponding experiment

Describe the solution you'd like

When an experiment is selected, I'd like to be able to select a button that generates and copies the corresponding code to fetch that experiment using the library

standardize getters - feature

Is your enhancement request related to a problem? Please describe

rubicon currently offers four different types of "getters"

get all
get one by name
get one by ID
get many by tags

however, these offerings vary by entity type. features, for example, currently only have "get all"

Describe the solution you'd like

features should also have the ability to "get one by name" and "get one by ID." features are not tagged, so they do not need the tags getter

specifically, the experiment client needs a def feature(self, name=None, id=None) function added as there is currently only the plural, get all function features

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/experiment.py#L13

Additional context

features are logged by name and a singular get_feature method already exists on the repository layer that can be leveraged

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/repository/base.py#L635

the "get one by name" functionality should be easy to replicate in a similar manner to other singular getters like project

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/rubicon.py#L103

for feature ID's, we'll need to read all the features then filter them in-memory. for this reason, we originally left "get one by ID" off of features. however, I find myself doing feature = [f for f in rubicon.features() if f.id == "desired ID"][0] to get the one feature with a specific ID often enough, so we may as well just put it in the library

this change should be fully in the client layer. all entity UUIDs are unique, so there's no need to worry about returning multiple features with the same ID

Automatic sklearn pipeline logging

Is your feature request related to a problem? Please describe

One way to log training data to Rubicon would be to extend the scikit-learn.pipeline so information could be logged before and/or after each step. We could extend the class and override the fit and predict methods to add optional hooks before and after.

Describe the solution you'd like

Something like...

from sklearn.pipeline import Pipeline

class RubiconPipeline(Pipeline):

def before_fit(X, y=None, **fit_params):
    # logs info from self.steps
    ...

def after_fit(X, y=None, **fit_params):
    # logs info from self.steps after fitting
    ...

def fit(self, X, y=None, **fit_params):
    self.before_fit(X, y)
    retval = super().fit(X, y=y, **fit_params)
    self.after_fit(X, y)
    return retval

Additional context

Three cases to consider:

Inferred logging from inspecting X's, y's and estimator object
Logging through an extended common Rubicon/SKLearn API (optionally call .rubicon_log methods on estimators)
Logging through user defined functions (UDFs) optionally provided to RubiconPipeline.__init__

automate version string updates

Is your enhancement request related to a problem? Please describe

for each release, we need to manually update the version strings in setup.py and rubicon/__init__.py. this is prone to us forgetting to do so.

Describe the solution you'd like

we use git tags as the source of truth for our versions. the version strings should be populated from this.

Describe alternatives you've considered

versioneer

How does Rubicon compare to MLFlow?

What is your question?
How does Rubicon compare to MLFlow? Architectural differences? Feature set differences? Tradeoffs made?

more column filtering control on dashboard

Is your enhancement request related to a problem? Please describe

I'd like to be able to toggle hiding all columns on the dashboard. currently, when filtering, all columns are checked and unchecking them unhides them. this is a lot of clicks if I have hundreds of columns and only want to show a few of them

Describe the solution you'd like

I'd like either more top-level buttons or some buttons in the show-hide menu to toggle selecting all columns. similar to the select all button for the rows of the data table. it'd be even better if there was a toggle for all metrics or parameters individually

Misc files async fix

Describe the bug

Related to #82, there's an issue when misc files (like dotfiles) make their way into the root dir. This issue was fixed for the sync logging library in #84. This issue is for tackling the fix in the async logging library as well. Also, there have been updates to fsspec's async support, so let's pull them in if appropriate here.

Steps/Code to reproduce bug

In an environment with jupyter and rubicon installed:

from rubicon import Rubicon
root_dir = "./rubicon-root" 
rubicon = Rubicon(persistence="filesystem", root_dir=root_dir)

project = rubicon.create_project("Sample Project")

project.log_experiment("Exp1")
project.log_experiment("Exp2")

Remove one of the experiment folders, run rm -r rubicon-root/experiments/{experiment_id}

Restart the UI with rubicon ui --root-dir rubicon-root - you should see the 'Sample Project' but when selected, there will be no experiments displayed.

Expected behavior

When .ipynb_checkpoints directories exist within the rubicon root directory, I should still be able to see my projects and experiments in the UI.

standardize getters - metric

Is your enhancement request related to a problem? Please describe

rubicon currently offers four different types of "getters"

get all
get one by name
get one by ID
get many by tags

however, these offerings vary by entity type. metrics, for example, currently only have "get all"

Describe the solution you'd like

metrics should also have the ability to "get one by name" and "get one by ID." metrics are not tagged, so they do not need the tags getter

specifically, the metric client needs a def metric(self, name=None, id=None) function added as there is currently only the plural, get all function metrics

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/experiment.py#L13

Additional context

metrics are logged by name and a singular get_metric method already exists on the repository layer that can be leveraged

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/repository/base.py#L740

the "get one by name" functionality should be easy to replicate in a similar manner to other singular getters like project

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/rubicon.py#L103

for metric ID's, we'll need to read all the metrics then filter them in-memory. for this reason, we originally left "get one by ID" off of metrics. however, I find myself doing metric = [m for m in rubicon.metrics() if m.id == "desired ID"][0] to get the one metric with a specific ID often enough, so we may as well just put it in the library

this change should be fully in the client layer. all entity UUIDs are unique, so there's no need to worry about returning multiple metrics with the same ID

Address documentation feedback around content

Is your enhancement request related to a problem? Please describe

Issue #21 raises a few content related requests for the docs related to the current state:

for a new concept like this, the initial leadup showing the scope/purpose is critically important. As far as I understand, really it's about storing machine running parameters and outcomes in a repeatable and searchable way: the set of runs becomes itself data ("AI on AI")

I would love to see some motivation around prefect integration. Will other runners be included?

would you recommend Rubicon logging for any non-machine-learning workflow ?

on behalf of my colleagues, why Dash over holoviz? Is his pluggable?

why "rubicon" ? :) A name like this should be explained.

Describe the solution you'd like

Would like to see additional content provided in the scope/purpose and the rest of these questions answered within the FAQs

standardize getters - project

Is your enhancement request related to a problem? Please describe

rubicon currently offers four different types of "getters"

get all
get one by name
get one by ID
get many by tags

however, these offerings vary by entity type. projects, for example, currently have "get one by name" and "get all"

Describe the solution you'd like

projects should also have the ability to "get one by ID." projects are not tagged, so they do not need the tags getter

specifically, def get_project(self, name) should be updated to def get_project(self, name=None, id=None) where either name or id is required (but not both)

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/rubicon.py#L103

Additional context

we originally only offered "get one by name" because projects are logged by name. so when we have a name, we can retrieve that project directly from the filesystem. for project ID's, we'll need to read all the projects then filter them in-memory. for this reason, we originally left "get one by ID" off of projects. however, I find myself doing project = [p for p in rubicon.projects() if p.id == "desired ID"][0] to get the one project with a specific ID often enough, so we may as well just put it in the library

this change should be fully in the client layer, and can probably be implemented with the line of code I just posted above. all entity UUIDs are unique, so there's no need to worry about returning multiple projects with the same ID

"Rubicon" name collides with existing project in the Python ecosystem

Describe the bug

I was just made aware of this project via a PyCon US announcement email.

The problem: the name you've chosen for this project collides with an existing project in the Python ecosystem.

I've been using the name Rubicon in the Python ecosystem since 2014. I'm the owner of the Rubicon record in PyPI, as well as some related projects:

These projects are in active use in the Python community, and the Java subproject received funding (indirectly) from the PSF through their support of the BeeWare Android port.

I can only assume this is something you were at least partially aware of, because you've chosen the name rubicon-ml for your PyPI package, and changed the name of the package in setup.py.

Although the projects are in a different domain (language bridging vs numerical processing), I'd argue there is potential for confusion since they're both active projects in the same language ecosystem, and there is some usage of BeeWare tooling in the numerical processing community.

I humbly request you choose a different name for your project that doesn't collide with my pre-existing usage.

Support dataframe.plot

Is your feature request related to a problem? Please describe

The plotting functionality for dataframes raises a NotImplementedError. It was dependent on an internal visualization library and will need to be switched to an open source library under the HoloViz ecosystem.

Describe the solution you'd like

Reimplement the convenience method dataframe.plot using hvplot

Docs link in the readme takes you to a 404 page

Is your documentation request related to a problem? Please describe
When I click the docs link in the documentation section of the readme, it takes me here (https://capitalone.github.io/rubicon/) and I get a 404 page saying "There isn't a GitHub pages site here". It looks like the link in the readme still points to the old project name instead of rubicon-ml. If I go to (https://capitalone.github.io/rubicon-ml/) I can see the docs as expected.

Describe the solution you'd like
The readme docs url updated with the new project name.

Optuna Integration

Is your feature request related to a problem? Please describe
Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API. Thanks to our define-by-run API, the code written with Optuna enjoys high modularity, and the user of Optuna can dynamically construct the search spaces for the hyperparameters.

We should investigate ways to integrate with Optuna so that it logs studies and trials to Rubicon experiments automatically.

Describe the solution you'd like
Optuna has a storage module for providing a way to persist hp tuning information. We should review the storage API to consider how it overlaps with the Rubicon client or repositories. Optuna storages look to be fairly plug-able.

Here's an example usage:

from rubicon.optuna import RubiconStorage
import optuna

def objective(trial):
    x = trial.suggest_uniform("x", -10, 10)
    return (x - 2) ** 2

storage = RubiconStorage(persistence="filesystem", root_dir="/rubicon-root", auto_git_enabled=True)
study = optuna.create_study(storage=storage)
study.optimize(study.optimize(objective, n_trials=100)

project = storage.get_project()
print(f"{len(project.experiments)} logged.")

Additional context
Dask Optuna implements a DaskStorage backend.

Configurable pagination for experiment table

Is your enhancement request related to a problem? Please describe

By default the experiment table has a page size of 10 experiments. It'd be nice for this to be configurable on the dashboard or by specifying a CLI argument when spinning up the dashboard so the pagination could be more flexible.

Describe the solution you'd like

Configurable page size for the experiment table by either CLI arg or as an option within the dashboard

Relax pinned dependencies

Is your enhancement request related to a problem? Please describe

It might be wise to take a look at our existing dependencies in setup.py and see if we can loosen them. We've already had one issue (#65) that spawned from too strict dependencies.

Describe the solution you'd like

Take a look and adjust dependency pins (or unpin) as needed

test example notebooks in scheduled action

Is your feature request related to a problem? Please describe

as maintainers, we don't always run the examples enough to know if they're up to date & working

Describe the solution you'd like

using a combo of nbformat to read and nbconvert to execute, I'd like to create a test suite that validates each of the examples in the notebooks directory runs against the latest version of rubicon without error. these tests will likely be longer running, especially as we add examples, so we should run them as scheduled actions rather than as part of the PR review pipeline.

https://nbformat.readthedocs.io/en/latest/api.html

https://nbconvert.readthedocs.io/en/latest/

maybe with the expectation that we manually run them any time theres a PR with example updates? it looks like it may be possible to only run these tests when certain files change, so thats an option too

https://github.community/t/is-it-possible-to-run-the-job-only-when-a-specific-file-changes/115484

PyCon workshop materials

Is your feature request related to a problem? Please describe

PyCon workshop materials are due April 28th

Describe the solution you'd like

Create and send over all required materials for the PyCon conference

Support filtering out params within experiment table / comparison plot

Is your enhancement request related to a problem? Please describe

When lots of params are logged, the experiment table & corresponding comparison plot becomes a little overwhelming.

Describe the solution you'd like

I'd like the ability to filter out specific params so that the experiment table & comparison plot becomes less cluttered as I explore the data.

Additional context

This could be done by filtering out the cols within the experiment table itself since the comparison plot pulls its data from there or by adding filtering solely on the comparison plot itself

[BUG] Dashboard unable to load experiments which lack metrics/parameters

Describe the bug
The Rubicon Dash-powered Dashboard is unable to handle experiments that do not contain metrics and parameters.

Steps/Code to reproduce bug

To encounter the bug, run the following python script to generate data.

from rubicon import Rubicon

root_dir = "./rubicon-root" # Replace me with an S3 path if desired
rubicon = Rubicon(persistence="filesystem", root_dir=root_dir)

# Create a project
project = rubicon.create_project("Sample Project")

# Create three empty experiments (note that these do not have metrics/parameters!)
project.log_experiment("Exp1")
project.log_experiment("Exp2")
project.log_experiment("Exp3")

Now attempt to load the rubicon-root project folder in the Dashboard. I have gotten two varieties of failure:

Error 1: Running in a container environment

Click on the "Sample Project" button
Check the developer console for the following error:

 ---dash_renderer.min.js:20 
Object
html: "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">↵<title>500 Internal Server Error</title>↵<h1>Internal Server Error</h1>↵<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>↵"
message: "Callback error updating grouped-project-explorer.children"---

Error 2: Running the Dashboard locally

I encountered a purely silent failure while running the Dash app locally.

I could open the Dashboard, but I didn't even see the "Sample Project" button.

Expected behavior

Given the following generated data:

from rubicon import Rubicon

root_dir = "./rubicon-root" # Replace me with an S3 path if desired
rubicon = Rubicon(persistence="filesystem", root_dir=root_dir)

# Create a project
project = rubicon.create_project("Sample Project")

# Create three empty experiments (note that these do not have metrics/parameters!)
project.log_experiment("Exp1")
project.log_experiment("Exp2")
project.log_experiment("Exp3")

I would expect to be able to load this project in the Dashboard and view my 3 experiments.

Additional Context

The following code snippet will generate data that does work with the current Dashboard.

The key difference between this and the code above is that we generate metrics and parameters for each experiment.

from rubicon import Rubicon

root_dir = "./rubicon-root"
rubicon = Rubicon(persistence="filesystem", root_dir=root_dir)

# Create a project
project = rubicon.create_project("Working Project")

# Create two experiments
exp1 = project.log_experiment("Exp1")
exp1.log_parameter("param1", "a")
exp1.log_metric("accuracy", 90)

exp2 = project.log_experiment("Exp2")
exp2.log_parameter("param1", "b")
exp2.log_metric("accuracy", 45)

Documentation feedback

Thank you for making this code public!
It looks like a nice and comprehensive package with a good API. You could hang a lot of functionality onto this. I have not tried to run anything, but here are some thoughts on the current documentation.

for a new concept like this, the initial leadup showing the scope/purpose is critically important. As far as I understand, really it's about storing machine running parameters and outcomes in a repeatable and searchable way: the set of runs becomes itself data ("AI on AI")
if possible, I would strongly recommend providing the example notebook as a runnable binder and a short video of it in action. To simulate s3, you could use moto-server or minio.
is there any getting around potential users having to pepper their code with logger statements?
you may be interested in intake-sklearn, with which you can expose json serialised sklearn models/artefacts as datasets
I bet people do some interesting things with the git hooks; this part is a little buried. Note that fsspec has a git backend, you can walk the tree at any branch/tag/hash
I would love to see some motivation around prefect integration. Will other runners be included?
would you recommend Rubicon logging for any non-machine-learning workflow ?
on behalf of my colleagues, why Dash over holoviz? Is his pluggable?
why "rubicon" ? :) A name like this should be explained.

run all formatting checks in test workflow

Is your enhancement request related to a problem? Please describe

the formatting checks we enforce locally via pre-commit are not enforced in our CI

Describe the solution you'd like

black, isort and flake8 should each be run against any opened PR and fail the CI if the formatting is not correct.
they should also each be updated to the most recent version

Relax S3FS version requirement to support wider python ecosystem

Describe the bug
Currently Rubicon requires s3fs>=0.5.1. There are a wide range of libraries that require s3fs version 0.4 which causes incompatibility errors. Since fsspec doesn't require a specific version of s3fs the 0.5.1 version can be relaxed to have larger compatibility.

Steps/Code to reproduce bug
Install package that requires s3fs version 0.4 then install rubicon-ml and rubicon[ui]. When you start the rubicon ui it throws the error:

pkg_resources.ContextualVersionConflict: (s3fs 0.4.2 (...), Requirement.parse('s3fs>=0.5.1'), {'rubicon', 'rubicon-ml'})

Expected behavior
Would love to see rubicon work with libraries that require s3fs.

Additional context
Link to fsspec requirements: https://github.com/intake/filesystem_spec/blob/master/setup.py#L50

[spike] investigate existing `intake` integration

Is your enhancement request related to a problem? Please describe

the existing intake related code in rubicon-ml (Rubicon.publish and rubicon_ml/intake_rubicon) hasn't been looked at it since it was written a while ago. we need to see if its current state is acceptable.

Describe the solution you'd like

catalog any issues with the existing intake code in rubicon-ml and open issues for each of them

Additional context

some things to consider

we call Rubicon.publish(project, other_args) instead of just project.publish(other_args)
- we've gotten identical feedback on the dashboard that having to use the Rubicon object is clunky
personally, I think its a bit confusing how we catalog the project and all the experiments, then use intake to load individual experiments instead of letting the project load them
- I don't currently have any thoughts on what should be done about this (maybe its fine as is), just wanna take a look

Revisit all examples

Is your documentation request related to a problem? Please describe

Based on some feedback, there are areas to improve within our examples.

Describe the solution you'd like

Make sure all examples within notebooks, docs and readme are still up to date and relevant. Provide more context where it makes sense to do so.

standardize getters - artifact

Is your enhancement request related to a problem? Please describe

rubicon currently offers four different types of "getters"

get all
get one by name
get one by ID
get many by tags

however, these offerings vary by entity type. artifacts, for example, currently have "get all"

Describe the solution you'd like

artifacts should also have the ability to "get one by id" and "get one by name" (they should probably be able to be tagged too, but that's way out of the scope of this ticket. we can circle back to that)

specifically, def artifact(self, name=None, id=None) should be created where either name or id is required (but not both) and added to the ArtifactMixin

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/mixin.py#L32

there is already a get_artifact_metadata method on the base repository, so that can be used to construct the artifact method on the client in a similar way to the last few tickets like this one

there is a caveat that makes this a bit different that #131, #133 and #135 - artifact names are not unique. artifact, when given a name, could return more than one artifact. we don't want artifact to return multiple artifacts, so we should return the most recent artifact with the requested name. perhaps also raise a warning that multiple artifacts were found any only one will be returned

then, we should additionally update def artifacts(self) to def artifacts(self, name=None) to account for the case where users intentionally have duplicately named artifacts and want all of them

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/mixin.py#L191

Additional context

we originally only offered "get one by ID" because artifacts are logged by ID. so when we have an ID, we can retrieve that artifact directly from the filesystem. for artifact names, we'll need to read all the artifacts then filter them in-memory. for this reason, we originally left "get one by name" off of artifacts. however, I find myself doing artifact = [e for e in rubicon.artifacts() if e.name == "desired name"][0] to get the one artifact with a specific name often enough, so we may as well just put it in the library

this change should be fully in the client layer

launch dashboard from `intake` catalogs

Is your enhancement request related to a problem? Please describe

some users aren't going to want to look at rubicon-ml objects, they're gonna want things like CSVs or a dashboard

Describe the solution you'd like

the plugin's read function should still return rubicon-ml objects, but we should expose other options from the catalog like read_as_csv or launch_dashboard

edit - 02/01/22: for now, this ticket will be for implementing the dashboard option mentioned above as well as exploring the feasibility of other formats, like csv

Describe alternatives you've considered

both of those examples are things that could be done if we load the project via intake and call one or two functions on it, but if our model reviewing users are going to commit to familiarizing themselves with intake, it'd be best if they never had to touch a rubicon-ml object if they don't want to

`percy` is available on `conda-forge`

If you want to remove the lone pip package from the ci environment.yml file: https://github.com/conda-forge/percy-feedstock

Logging parameters used in pipeline's fit method

Is your feature request related to a problem? Please describe
Rubicon pipeline stores parameters of estimators used inside the pipeline when fit method is called. But sometimes we might want to inject some parameter in fit method of the pipeline for a specific step. Currently user will need to manually log those parameters. It would be useful to automatically log those parameters.

Describe the solution you'd like
The fit method here could log the fit_params passed in in addition to logging each estimator's parameters.

Additional context
The following code shows an example. When the fit method is called from the pipeline the degree parameter of SVC is logged automatically. But when we are passing in sample_weight specifically for svc step of the pipeline that parameter is not logged.

import os

import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from rubicon_ml import Rubicon
from rubicon_ml.sklearn import FilterEstimatorLogger, RubiconPipeline


def main():
    root_dir = os.environ.get("EXPERIMENT", "experiment")
    root_path = f"{os.path.dirname(os.getcwd())}/{root_dir}"
    rubicon = Rubicon(persistence="filesystem", root_dir=root_path)
    project = rubicon.get_or_create_project("Sklearn-Pipeline")

    pipeline = RubiconPipeline(
        project,
        [
            ("scaler", StandardScaler()),
            ("svc", SVC()),
        ],
        experiment_kwargs={
            "name": "logged from a RubiconPipeline",
            "model_name": "Rubicon_pipeline",
            "tags": "Rubicon_pipeline",
        },
    )

    X, y = make_classification(random_state=0)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
    pipeline.fit(X_train, y_train, svc__sample_weight=0.1 * np.ones(len(X_train)))
    print(pipeline.score(X_test, y_test))

    experiment = project.experiments()[0]

    for p in experiment.parameters():
        # printing out a parameter
        if p.name == "svc__degree":
            print("svc degree: ", p.value)
        # Currently this will not be printed and needs to be logged manually
        if p.name == "svc__sample_weight":
            print("svc fit sample weight", p.value)


if __name__ == "__main__":
    main()

Logging MultiIndex Dataframes Fails

Describe the bug
It appears that internally, rubicon's .log_dataframe() converts pandas dataframes to dask dataframes regardless of the situation. This can cause issues in scenarios where dask might not support certain dataframe features such as multiindex dataframes.

Steps/Code to reproduce bug

import pandas as pd
from rubicon.client import Rubicon
# Create sample data
df = pd.DataFrame([[0,1,'a'],[1,1,'b'],[2,2,'c'],[3,2,'d']], columns=['a', 'b', 'c'])
df = df.set_index(['b', 'a']) # Set multiindex
df
     c
b a   
1 0  a
  1  b
2 2  c
  3  d

# Log dataframe to rubicon
rubicon = Rubicon(persistence="memory")
project = rubicon.get_or_create_project("test")
exp = project.log_experiment('test_exp')
exp.log_dataframe(df)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ouz343/miniconda3/envs/lustr/lib/python3.8/site-packages/rubicon/client/mixin.py", line 251, in log_dataframe
    self.repository.create_dataframe(dataframe, df, project_name, experiment_id=experiment_id)
  File "/Users/ouz343/miniconda3/envs/lustr/lib/python3.8/site-packages/rubicon/repository/base.py", line 426, in create_dataframe
    data = self._convert_to_dask_dataframe(data)
  File "/Users/ouz343/miniconda3/envs/lustr/lib/python3.8/site-packages/rubicon/repository/base.py", line 396, in _convert_to_dask_dataframe
    return dd.from_pandas(df, npartitions=1)
  File "/Users/ouz343/miniconda3/envs/lustr/lib/python3.8/site-packages/dask/dataframe/io/io.py", line 202, in from_pandas
    raise NotImplementedError("Dask does not support MultiIndex Dataframes.")
NotImplementedError: Dask does not support MultiIndex Dataframes.

Additional context
Not familiar as to why pandas dataframes need to be converted to dask dataframes every time during logging but the solution would revolve around avoiding conversion to dask since dask in this case does not support multiindex.

CVE-2021-23358 (High) detected in underscore-min-1.12.0.js

CVE-2021-23358 - High Severity Vulnerability

Vulnerable Library - underscore-min-1.12.0.js

JavaScript's functional programming helper library.

Library home page: https://cdnjs.cloudflare.com/ajax/libs/underscore.js/1.12.0/underscore-min.js

Path to dependency file: rubicon-ml/logging-examples/logging-training-metadata.html

Path to vulnerable library: rubicon-ml/_static/underscore.js,rubicon-ml/logging-examples/../_static/underscore.js,rubicon-ml/_static/underscore.js,rubicon-ml/integrations/../_static/underscore.js

Dependency Hierarchy:

❌ underscore-min-1.12.0.js (Vulnerable Library)

Found in HEAD commit: db600edfb9b5eaf680c2fc85a6963c6a4a5bfb14

Vulnerability Details

The package underscore from 1.13.0-0 and before 1.13.0-2, from 1.3.2 and before 1.12.1 are vulnerable to Arbitrary Code Injection via the template function, particularly when a variable property is passed as an argument as it is not sanitized.

Publish Date: 2021-03-29

URL: CVE-2021-23358

CVSS 3 Score Details (7.2)

Base Score Metrics:

Exploitability Metrics:
- Attack Vector: Network
- Attack Complexity: Low
- Privileges Required: High
- User Interaction: None
- Scope: Unchanged
Impact Metrics:
- Confidentiality Impact: High
- Integrity Impact: High
- Availability Impact: High

For more information on CVSS3 Scores, click here.

Suggested Fix

Type: Upgrade version

Origin: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-23358

Release Date: 2021-03-29

Fix Resolution: underscore - 1.12.1,1.13.0-2

Step up your Open Source Security Game with WhiteSource here

create a new example for `intake` enhancements

Is your documentation request related to a problem? Please describe

the intake integration is currently only mentioned on the landing page and in the quick look

Describe the solution you'd like

once we've developed more features, it should have it's own example in the integrations section

investigate pinned dependencies

Is your enhancement request related to a problem? Please describe
the conda install for rubicon-ml is super slow and takes a few solves to install into a fresh environment. this is almost certainly due to issues with our pinned dependencies

Describe the solution you'd like

investigate why our pinned dependencies are pinned (pyarrow, pyyaml, fsspec)
investigate the dependencies called out by the conda forge bot in the dependency analysis here (UI libraries, pyarrow, s3fs)
update and unpin any upper limits we can

Include auto git integration in prefect workflow

Is your feature request related to a problem? Please describe
I'd like to use the provided prefect tasks with git integration. Right now, the get_or_create_project task doesn't allow for git integration.

Describe the solution you'd like
Have git integration available in the existing prefect task or new prefect task.

Describe alternatives you've considered
Manually specifying git commit hashes and branches myself, however I would optimally like to harness the existing functionality built in to rubicon

Additional context

Look into interactive examples (spike)

Is your enhancement request related to a problem? Please describe

From #21

if possible, I would strongly recommend providing the example notebook as a runnable binder and a short video of it in action.

Describe the solution you'd like

At a minimum, would like a short video of rubicon logging and dashboard

latest `fsspec` version causes `FileExistsError` if multiple memory filesystems are created

Describe the bug
A clear and concise description of what the bug is.

when a second memory filesystem is created, it attempts to re-make the root directory. in previous versions of fsspec it was a no-op if the directory existed, now it throws a FileExistsError

Steps/Code to reproduce bug

python -m pytest from a new dev environment shows the error in the tests

Expected behavior

when a second memory filesystem is created, there should be no error. hopefully fsspec.mkdir exposes some kind of exists_ok flag

running the dashboard from JupyterHub is troublesome

Describe the bug

sometimes when running from JupyterHub, Dash needs some additional configuration to properly load assets and assign paths. some of these config values are read only and need to be set at creation time, so they cant be updated on an instantiated Dashboard object

additionally, JupyterDash appears to be unmaintained and a source of some of our dashboard struggles. specifically, the "inline" and "jupyterlab" modes dont work from JupyterHub anymore

Steps/Code to reproduce bug

from a notebook running in JupyterHub:

from rubicon_ml.ui import Dashboard

Dashboard(persistence="memory", mode="inline").run_server()

and

Dashboard(persistence="memory", mode="inline").run_server(mode="inline")

navigating to the dashboard URL will show a number of errors

Expected behavior

navigating to the dashboard URL should show the dashboard
running the dashboard with mode="inline" should render the dashboard in the notebook

Additional context
steps to fix:

replace JupyterDash with Dash
enable a passthru for kwargs to the Dash object
use an IFrame for run_server(mode="inline") instead of letting JupyterDash handle it

bad JSON writes empty files

Describe the bug

when logging something that can not be json serialized, rubicon throws an error yet still writes an empty file to the logged item's path. this is problematic because when rubicon tries to read the empty file back in, it throws another json decode error.

Steps/Code to reproduce bug

>>> from rubicon import Rubicon
>>>
>>> r = Rubicon(persistence="filesystem", root_dir="./rubicon_root_dir")
>>> p = r.create_project("bug")
>>> e = p.log_experiment()
>>>
>>> param_value = str
>>> type(param_value) 
<class 'type'>
>>>
>>> e.log_parameter(name="fails", value=param_value)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nvd215/oss/rubicon/rubicon/client/experiment.py", line 141, in log_parameter
    self.repository.create_parameter(parameter, self.project.name, self.id)
  File "/Users/nvd215/oss/rubicon/rubicon/repository/base.py", line 801, in create_parameter
    self._persist_domain(parameter, parameter_metadata_path)
  File "/Users/nvd215/oss/rubicon/rubicon/repository/local.py", line 36, in _persist_domain
    f.write(json.dumps(domain))
  File "/Users/nvd215/oss/rubicon/rubicon/repository/utils/json.py", line 42, in dumps
    return json.dumps(data, cls=DomainJSONEncoder)
  File "/Users/nvd215/miniconda3/envs/rubicon-dev/lib/python3.8/json/__init__.py", line 234, in dumps
    return cls(
  File "/Users/nvd215/miniconda3/envs/rubicon-dev/lib/python3.8/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/nvd215/miniconda3/envs/rubicon-dev/lib/python3.8/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Users/nvd215/oss/rubicon/rubicon/repository/utils/json.py", line 23, in default
    return super().default(obj)  # pragma: no cover
  File "/Users/nvd215/miniconda3/envs/rubicon-dev/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type type is not JSON serializable
>>>
>>> e.parameters()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nvd215/oss/rubicon/rubicon/client/experiment.py", line 155, in parameters
    for p in self.repository.get_parameters(self.project.name, self.id)
  File "/Users/nvd215/oss/rubicon/rubicon/repository/base.py", line 858, in get_parameters
    parameters = [
  File "/Users/nvd215/oss/rubicon/rubicon/repository/base.py", line 859, in <listcomp>
    domain.Parameter(**json.loads(data))
  File "/Users/nvd215/oss/rubicon/rubicon/repository/utils/json.py", line 50, in loads
    return json.loads(data, cls=DomainJSONDecoder)
  File "/Users/nvd215/miniconda3/envs/rubicon-dev/lib/python3.8/json/__init__.py", line 370, in loads
    return cls(**kw).decode(s)
  File "/Users/nvd215/miniconda3/envs/rubicon-dev/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/nvd215/miniconda3/envs/rubicon-dev/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
>>>
>>> e.repository.filesystem.ls("./rubicon_root_dir/bug/experiments/0f72e81c-5e50-4874-8dc0-e94ba366c36a/parameters/fails")
['/Users/nvd215/oss/rubicon/rubicon_root_dir/bug/experiments/0f72e81c-5e50-4874-8dc0-e94ba366c36a/parameters/fails/metadata.json']
>>>
>>> e.repository.filesystem.cat("/Users/nvd215/oss/rubicon/rubicon_root_dir/bug/experiments/0f72e81c-5e50-4874-8dc0-e94ba366c36a/parameters/fails/metadata.json")
b''

Expected behavior

the first error is fine, its the second one that's the problem. that fact that the first part errors means no file should be written for this parameter, yet we can see an empty one has been. there should be no file written when the first error is thrown during logging

standardize getters - dataframe

Is your enhancement request related to a problem? Please describe

rubicon currently offers four different types of "getters"

get all
get one by name
get one by ID
get many by tags

however, these offerings vary by entity type. dataframes, for example, currently have "get many by tags" and "get all"

Describe the solution you'd like

dataframes should also have the ability to "get one by id" and "get one by name"

specifically, def dataframe(self, name=None, id=None) should be created where either name or id is required (but not both) and added to the DataframeMixin

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/mixin.py#L225

there is already a get_dataframe_metadata method on the base repository, so that can be used to construct the dataframe method on the client in a similar way to the last few tickets like this one

there is a caveat that makes this a bit different that #131, #133 and #135 - dataframe names are not unique. dataframe, when given a name, could return more than one dataframe. we don't want dataframe to return multiple dataframes, so we should return the most recent dataframe with the requested name. perhaps also raise a warning that multiple dataframes were found any only one will be returned

then, we should additionally update def dataframes(self, tags=[], qtype="or") to def dataframes(self, tags=[], qtype="or", name=None) to account for the case where users intentionally have duplicately named dataframes and want all of them

https://github.com/capitalone/rubicon-ml/blob/main/rubicon_ml/client/mixin.py#L268

Additional context

we originally only offered "get one by ID" because dataframes are logged by ID. so when we have an ID, we can retrieve that dataframe directly from the filesystem. for dataframe names, we'll need to read all the dataframes then filter them in-memory. for this reason, we originally left "get one by name" off of dataframes. however, I find myself doing dataframe = [e for e in rubicon.dataframes() if e.name == "desired name"][0] to get the one dataframe with a specific name often enough, so we may as well just put it in the library

this change should be fully in the client layer

capitalone / rubicon-ml Goto Github PK

rubicon-ml's Introduction

rubicon-ml

Purpose

Components

Workflow

Use

Documentation

Install

Develop

Testing

Code Formatting

rubicon-ml's People

Contributors

Stargazers

Watchers

Forkers

rubicon-ml's Issues

Error 1: Running in a container environment

Error 2: Running the Dashboard locally

CVE-2021-23358 - High Severity Vulnerability

Recommend Projects

Recommend Topics

Recommend Org

Jobs