h2oai / wave-h2o-automl Goto Github PK

View Code? Open in Web Editor NEW

13.0 55.0 6.0 6.61 MB

Wave App for H2O AutoML

Home Page: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

License: Apache License 2.0

Python 98.58% Makefile 1.42%

automl h2o-automl h2oai h2o h2o-wave

wave-h2o-automl's Introduction

Wave app for H2O AutoML

System Requirements

Python 3.6+
pip3

Installation

1. Run the Wave Server

Follow the instructions here to download and run the latest Wave Server, a requirement for apps.

2. Setup Your Python Environment

$ git clone [email protected]:h2oai/wave-h2o-automl.git
$ cd wave-h2o-automl
$ make setup
$ source venv/bin/activate

3. Run the App

wave run src.app

Note! If you did not activate your virtual environment this will be:

./venv/bin/wave run src.app

4. View the App

Point your favorite web browser to localhost:10101

wave-h2o-automl's People

Stargazers

Watchers

Forkers

swaptr sanyaade-teachings jeunjetta codingmonk6147

wave-h2o-automl's Issues

Add Download MOJO button on the Model Explain page

This was part of the original functionality and it seems useful to keep around.

Move PD column picker

Let's move the PD column picker from the left side to the right side above the PD plot.

Updates list of features on homepage

Update to the same list that we have here: https://github.com/h2oai/wave-h2o-automl/blob/main/about.md

Features:

AutoML Training: Allows a user to train many models using H2O AutoML using their own train/test datasets.
Leaderboard: View the AutoML leaderboard to rank models.
AutoML Viz: Shows feature importance, Shapley contributions.
Model Explain: Explain any model using feature importance, Shapley values, and learning curves.
Deployment: Download any model in the MOJO model format.

CSV with no header not importing correctly

If you import a CSV with no header, it will use the first row instead of creating a dummy header. Not sure why since if you do this in h2o.import_file() it will work correctly, so it seems like the generic Wave data loader is not working correctly.

Example:

import h2o

h2o.init()

train = h2o.import_file("https://github.com/h2oai/h2o-tutorials/raw/0bd643cddc850eb8692f1e3ff7d8211e4168c7d2/tutorials/data/higgs_10k.csv")
train.columns # proper column names (C1, C2...)

In the app, it will show columns with names that are "numeric":

Updates to homepage

Clicking on Model Explain before models are trained causes app to go blank

There should be a better response than the app to go blank when the user clicks on Model Explain too soon. The same thing happens when you click on AutoML Viz -> Variable Explain. Maybe it just says "No models trained yet".

Update requirements.txt to use latest version of H2O

Current version is: 3.36.1.2

Add additional demo dataset

Let's add a very small dataset that is a regression problem. Maybe the wine dataset? (6497 rows x 13 columns). We can do an 80/20 split to create train and test csvs and upload them to the repo.
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/explain.html#explain-models

Currently we have credit card which is ~24k and is binary classification. This will speed up demos and also wine quality is nice since we use it in all our other explainability demos.

To split and export file: https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/h2o.html?highlight=export#h2o.export_file

import h2o

h2o.init()

# Import wine quality dataset
f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
df = h2o.import_file(f)

# Split into train & test
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

h2o.export_file(train, path="wine_quality_train.csv")
h2o.export_file(test, path="wine_quality_test.csv")

The files that will be uploaded are: wine_quality_train.csv and wine_quality_test.csv

Update screenshots in the static folder

These will be displayed in the H2OAI Cloud App store, so they need to be updated./
Also update the screenshot in the README.

Import file not working

I think this used to work...? Here's an error when I tried to load a CSV (replicated with multiple CSVs).

I can upload the file, and it appears in the training set dropdown. I can select it, but when I click Train we get this error:

Unhandled exception
Traceback (most recent call last):
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/h2o_wave/server.py", line 320, in _process
    await self._handle(q)
  File "./src/app.py", line 915, in serve
    await train_menu(q)
  File "./src/app.py", line 283, in train_menu
    q.app.train_df = pd.read_csv(local_path)
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 462, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 819, in __init__
    self._engine = self._make_engine(self.engine)
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1867, in __init__
    self._open_handles(src, kwds)
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1362, in _open_handles
    self.handles = get_handle(
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/common.py", line 642, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/_f/43b1c3b5-fefe-4b8c-835a-ecacd3e6e632/higgs_10k.csv'
2022/07/19 18:11:02 * /cd47eb9e-86b6-42ef-bc48-14b771410fdd {"d":[{"k":"main"},{"k":"plot1"},{"k":"plot21"},{"k":"plot22"},{"k":"plot31"},{"k":"plot32"},{"k":"foo"},{},{"k":"__unhandled_error__","d":{"view":"markdown","box":"1 1 12 10","title":"Error","content":"```\nTraceback (most recent call last):\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/h2o_wave/server.py\", line 320, in _process\n    await self._handle(q)\n  File \"./src/app.py\", line 915, in serve\n    await train_menu(q)\n  File \"./src/app.py\", line 283, in train_menu\n    q.app.train_df = pd.read_csv(local_path)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 610, in read_csv\n    return _read(filepath_or_buffer, kwds)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 462, in _read\n    parser = TextFileReader(filepath_or_buffer, **kwds)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 819, in __init__\n    self._engine = self._make_engine(self.engine)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 1050, in _make_engine\n    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 1867, in __init__\n    self._open_handles(src, kwds)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 1362, in _open_handles\n    self.handles = get_handle(\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/common.py\", line 642, in get_handle\n    handle = open(\nFileNotFoundError: [Errno 2] No such file or directory: '/_f/43b1c3b5-fefe-4b8c-835a-ecacd3e6e632/higgs_10k.csv'\n\n```"}}]}

Change column selector to a picker instead of dropdown

In the case of many columns, this would be not ideal, so picker will be better. https://wave.h2o.ai/docs/widgets/form/picker/#basic-picker

Review App guidelines

Please go through: https://h2oai.atlassian.net/wiki/spaces/PROD/pages/2986049638/Wave+App+Checklist and make any changes as needed, thanks!

Increase font size for labels in plots

It's too hard to read at this font...

Change all fonts: https://stackoverflow.com/questions/3899980/how-to-change-the-font-size-on-a-matplotlib-plot

Utility Function for DataFrames

Maybe not Real Applications, but a lot of demos will use data frames, so a scoring function that takes a data frame and returns one would be nice. This is a rough draft which needs to be cleaned up, but putting hear as a place holder for the ModelOps Utiliies

def get_predictions(rows: pd.DataFrame):
    # handle nulls
    rows = rows.where(pd.notnull(rows), "")

   # every value needs to be a string
    vals = rows.values.tolist()
    for i in range(len(vals)):
        vals[i] = [str(x) for x in vals[i]]

    # create a string that is in the expected dictionary format
    dictionary = '{"fields": ' + str(df.columns.tolist()) + ', "rows": ' + str(vals) + '}'
    dictionary = dictionary.replace("'", '"'). #mlops needs double quotes!

   # use the utility function
    dict_preds = mlops_get_score('https://model.wave.h2o.ai/f2659e88-cbad-4ae0-baf0-e25daef42461/model/score',
                                 dictionary)

   # turn the returned dict into a dataframe
    preds = pd.DataFrame(data=dict_preds['score'], columns=dict_preds['fields'])

    # join with original data, assumption is the row order never changes
    return pd.concat([rows, preds], axis=1)

Shrink the column widths in leaderboard

They are too wide for metrics columns. https://wave.h2o.ai/docs/api/ui#table_column

Also add "Task Type" as output in the text above Leaderboard.

Expand training interface to include advanced and expert options

The current training interface is pretty minimal. I think we should expose all AutoML parameters, but hide most of them in the "Expert" settings section. We can expose another set as "Advanced". Here's the full list of AutoML params: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html#automl-interface

Basic Data Parameters:

training frame
target column
classification (vs regression)

Advanced Data Parameters:

columns to remove (or conversely, x)
validation frame (? not very useful)
leaderboard frame
blending frame
fold column
weights column

Basic Training Parameters:

max_models
max_runtime_secs

Advanced:

balance_classes
class_sampling_factors
early stopping metric
early stopping rounds
nfolds
etc... all the others until include_algos

Expert:

preprocessing
modeling_plan
monotone constraints
exploitation ratio

Checklist Requirements

The description needs to be a complete sentence of non-technical terms
The log description needs to follow the template https://h2oai.atlassian.net/wiki/spaces/PROD/pages/2986049638/Wave+App+Checklist

Turn fold_column and weights_column into picker selection

Currently they are drop-down, which poses an issue for datasets with a large number of columns.

We should allow the user to view all the model hyperparamters

There should be a way to see the hyperparameters of the trained models. This could be another tab in the Model Explain interface.

Add Explain tabs

Let's add two new menus item (tabs) for the H2O Explain output.

AutoML level plots:

AutoML Explain tab (only group/AutoML level plots)

Variable importance heatmap
Model correlation heatmap
PD multiplot (this should be a new section below with a column picker). The user will have the varimp already available, so they can see which vars are important and then select them on their own.

We have stored the AutoML object in the q.app.aml location, and we can re-use the aml object from here to generate the AutoML plots.

Model level plots:

Model Explain tab:

First section is model only:

Second section is row-related plots (for this same model):

SHAP local explanation
ICE plot (need to select a column)

Notes

Currently we are going to use the plots directly from matplotlib, so that the users can download the images easily and it looks the same as R/Py. Also it's easier to re-use rather than try to re-create some of the more complex plots. However, we may re-consider this approach (if we do, we will re-use the plotting code to generate plots that the user can download with the click of a button).

Some of the explain functions require a test set. Right now they are re-using the train set (which is not good #14). We should not force the user to provide a test set in case they don't care about using the explain functionality, but we might want to add a checkbox (selected by default) that says "Automatically create a test set (used in some explainability features)". If the user selects a test set from the drop down, then that will be used instead. However, if the user wants to use train only, then they can unselect the checkbox and then we can remove the plots that require a test set.

TO DO

Replace plots with images (varimp plot, for now just comment out the code and replace it with image code)
AutoML Explain tab: group level plots
Model Explain tab: drop-down (picker) for the models, and then we put all the model plots there

Render actual AutoML progress bar with percentage complete

In the R and Python API, we have a nice AutoML progress bar that shows the estimated percentage completed:

AutoML progress: |███████                                |  18%

In the app, we have a "in progress" bar but it does not show any information about how much work remains.

Add a tab where you can visualize the training data

It seems useful to have some basic visualizations about the training data (e.g. shape, distribution of response, etc). These visualizations should probably be automatically generated when the user uploads a CSV file.

Allow user to auto-split test set from their training file

Currently we allow the user to optionally upload a test file, but we really always want to have a test set for the Explain functions.

If the user only uploads a training csv, then we should automatically create a test set that will be used in the Explain functions. We should allow the user to adjust the fraction of their training data that will be used for the test set.

Related to this: #14

ignore_columns not actually being ignored

This might be an H2O bug actually...

Allow users to use only a training frame and disable all explanations that require a test frame

There could be a case where the user does not care about any of the explanations and wants to use all of their data for training. So we should allow them to do that.

App UI unusable on certain screens

From support ticket

App isn’t fitting screen on Apple M1 Macbook. I can’t see the ends. Its stretched and there is no option to fix it.

App Details
ID: bec14f76-cc10-4886-823c-d631c6492867
NAME: H2O AutoML
VERSION: 0.3.0

Clicking on Variable Explain under the AutoML Viz tab causes an error

It's fine the first time, but if you go away and come back, it breaks the app.

,{"k":"__unhandled_error__","d":{"view":"markdown","box":"1 1 12 10","title":"Error","content":"```\nTraceback (most recent call last):\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/server.py\", line 341, in _process\n    await self._handle(q)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/./src/app.py\", line 1112, in serve\n    elif not await handle_on(q):\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/routing.py\", line 179, in handle_on\n    if await _match_predicate(predicate, func, arity, q, arg_value):\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/routing.py\", line 129, in _match_predicate\n    await _invoke_handler(func, arity, q, arg)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/routing.py\", line 119, in _invoke_handler\n    await func(q, arg)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/./src/app.py\", line 910, in aml_varimp\n    ui.picker(name='column_pd', label='Select Column', choices=choices, max_choices = 1, values = [q.app.pd_col]),\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/ui.py\", line 1741, in picker\n    return Component(picker=Picker(\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/types.py\", line 4471, in __init__\n    _guard_vector('Picker.values', values, (str,), False, True, False)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/types.py\", line 50, in _guard_vector\n    _guard_scalar(f'{name} element', value, types, False, non_empty, False)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/types.py\", line 37, in _guard_scalar\n    raise ValueError(f'{name}: want one of {types}, got {type(value)}')\nValueError: Picker.values element: want one of (<class 'str'>,), got <class 'NoneType'>\n\n```"}}]}

Add figsize to variable importance plot

It's very small by default; use FIGSIZE[0] x2.

Set max_models to None if user does not specify manually

Right now, we are passing a default of 60 mins, which is causing the progress bar to be confused about it's progress. When the user sets max_models but not max_runtime_secs then max_runtime_secs should be set to None.

Click on model name in leaderboard should redirect to Model Explain

Currently it will bring up a similar "model explain" (original) page within the Leaderboard tab. Let's just redirect to our Model Explain tab, with the requested model selected.

Improve leaderboard output

Let's use .round(5) on all the columns in the leaderboard. For now, we are using h2o 3.32 so there's only a subset of the full leaderboard columns right now (we can update later).
Let's use the extended leaderboard instead, with all the columns.

Update list of features on the homepage

Use updated version from app.toml file.

Add code of conduct

Automatically infer the response type and produce helpful error

There's a classification button, which is set to classification by default. However, that makes it easy to train a classification model when you really want a regression model. If the response is real-valued and you try to do classification, it currently just breaks wave (e.g. wine quality data using "alcohol" as response). So we need a better error.

Another issue is when you have an integer valued column as the response which should be numeric/regression (e.g. wine quality using "quality" as response), it will still train a classification model when asked to. You don't realize that until after you read the column headers in the leaderboard.

Test data not currently used

Currently the test_file is stored as test_df but then not used anywhere. We should consider what we want to do with a test file (such as passing on to the explain functions).

Render AutoML Event Log in the Train tab

Maybe we can render the table in real time(?) underneath the training progress bar? Need to look into whether or not this is currently supported in Wave.

Add a "stop training" button

If you realize you did something wrong or want to quit, let's make a button to do that gracefully.

Throw error if Classification toggle is turned on for a regression dataset

If the classification toggle is kept on (the default) and you use a regression dataset, it will throw an error on the backend because it looks for max_per_class_error on the leader model, which is not there. So it trains the whole AutoML and then spits an error at the end, which is bad.

Ideally we'd have the app automatically set the toggle once the user selects a real valued target column, but right now it doesn't work that way...

Save images in the app and serve cached copy

Currently, every time you view the AutoML Viz or Model Explain tab, all the data/images are regenerated from scratch. Let's cache this in the app and serve it from the cached copy rather than regenerating each time.

For each plot, we can store the plot in the app:
e.g. q.app.varimp_heat_plot and q.app.mc_plot

Example, something like this:

    # Model Correlation Heatmap (1)
    try:
        train = h2o.H2OFrame(q.app.train_df)
        y = q.app.target
        if q.app.is_classification:
            train[y] = train[y].asfactor()
        if q.app.mc_plot is None:
            q.app.mc_plot = q.app.aml.model_correlation_heatmap(frame = train, figsize=(FIGSIZE[0], FIGSIZE[0]))
        q.page['plot21'] = ui.image_card(
            box='charts_left',
            title="Model Correlation Heatmap Plot",
            type="png",
            image=get_image_from_matplotlib(q.app.mc_plot),
        )

New H2O AutoML logo

Will be used in the H2O AI Appstore.

Display training parameters after training starts

Somewhere on this page. This will allow the user to view what exactly is training.

Button to download the leaderboard as a CSV

This might be useful for the user to save.

PD plot not showing on multiclass

If you use wine dataset with "quality" and let it train a classification (multiclass model), the PD plot is broken. That's because we are missing the part where we set a reference class, so we need to update the code to input a reference class.

For multiclass, we should add a second picker box with the list of classes, so that you can choose what class you view the PD plot for.

Loosen h2o version requirement

We might consider changing the version requirement in requirements.txt to <3.36.0.2 (or put a lower bound on it), so that users can have more flexibility in using it with an existing/installed version of h2o on their machine.

If we want people to be able to use earlier versions of h2o (e.g. below 3.32.1.1), then we need to abstract some of the leaderboard code a bit more. The "algo" column was only added in 3.32.1.1, for example. And other extended leaderboard columns were added in prior versions (3.28.0.1).

If we add learning curve, that was also not added to some specific recent version of h2o, etc.

Add Pareto Front to the AutoML Viz Models Summary tab

We should add the Pareto front. To start, let's just display the default (x,y) axis, which is prediction time on x axis and model performance on y axis. In the future we can consider making this more interactive.

Redesign the home tab display

Let's divide this area into two columns. One for the description (left) and one for the logo (right). Or figure out a way to move the image to the right, via HTML image tags.

Clicking on a Stacked Ensemble model in the leaderboard redirects to top base model

We would expect that clicking on the Stacked Ensemble would bring up model visualizations for the SE, but instead it just redirects to the top base model. I think we can create some SE-specific visualizations here that would be more helpful.

Run Explain tasks in background after training finishes

To avoid the computational delay, let's start the explain tasks immediately after the training finishes using Background Tasks. Then we can store the data/images in the app so they will be ready to serve by the time the user clicks on the Explain tabs. We will just store the default model images for Model Explain and generate the other ones on-demand.

Useful blog: https://medium.com/@unusualcode/background-jobs-in-wave-or-how-not-to-kill-your-ui-ae1fed95693a

Change form types on training interface

Change target column from drop-down to picker; It's much nicer since some datasets will have a lot of options.
Change max_models and max_runtime_secs to a text field.
Or consider a spinbox with a huge upper limit for the values: https://wave.h2o.ai/docs/examples/spinbox, and if that is not good, let's just do something like this: Add form validation for inputs that should be numeric: https://wave.h2o.ai/docs/widgets/form/textbox/

Add blending_frame to the training data interface

This is an option in AutoML, so we should also expose it here.