GithubHelp home page GithubHelp logo

h2oai / wave-h2o-automl Goto Github PK

View Code? Open in Web Editor NEW
13.0 55.0 6.0 6.61 MB

Wave App for H2O AutoML

Home Page: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

License: Apache License 2.0

Python 98.58% Makefile 1.42%
automl h2o-automl h2oai h2o h2o-wave

wave-h2o-automl's Introduction

Wave app for H2O AutoML

System Requirements

  1. Python 3.6+
  2. pip3

Installation

1. Run the Wave Server

Follow the instructions here to download and run the latest Wave Server, a requirement for apps.

2. Setup Your Python Environment

$ git clone [email protected]:h2oai/wave-h2o-automl.git
$ cd wave-h2o-automl
$ make setup
$ source venv/bin/activate

3. Run the App

wave run src.app

Note! If you did not activate your virtual environment this will be:

./venv/bin/wave run src.app

4. View the App

Point your favorite web browser to localhost:10101

homepage

wave-h2o-automl's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wave-h2o-automl's Issues

Move PD column picker

Let's move the PD column picker from the left side to the right side above the PD plot.

Screen Shot 2022-07-11 at 11 17 35 AM

Updates list of features on homepage

Update to the same list that we have here: https://github.com/h2oai/wave-h2o-automl/blob/main/about.md

Features:

  • AutoML Training: Allows a user to train many models using H2O AutoML using their own train/test datasets.
  • Leaderboard: View the AutoML leaderboard to rank models.
  • AutoML Viz: Shows feature importance, Shapley contributions.
  • Model Explain: Explain any model using feature importance, Shapley values, and learning curves.
  • Deployment: Download any model in the MOJO model format.

CSV with no header not importing correctly

If you import a CSV with no header, it will use the first row instead of creating a dummy header. Not sure why since if you do this in h2o.import_file() it will work correctly, so it seems like the generic Wave data loader is not working correctly.

Example:

import h2o

h2o.init()

train = h2o.import_file("https://github.com/h2oai/h2o-tutorials/raw/0bd643cddc850eb8692f1e3ff7d8211e4168c7d2/tutorials/data/higgs_10k.csv")
train.columns # proper column names (C1, C2...)

In the app, it will show columns with names that are "numeric":

Screen Shot 2021-06-29 at 11 50 43 AM

Add additional demo dataset

Let's add a very small dataset that is a regression problem. Maybe the wine dataset? (6497 rows x 13 columns). We can do an 80/20 split to create train and test csvs and upload them to the repo.
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/explain.html#explain-models

Currently we have credit card which is ~24k and is binary classification. This will speed up demos and also wine quality is nice since we use it in all our other explainability demos.

To split and export file: https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/h2o.html?highlight=export#h2o.export_file

import h2o

h2o.init()

# Import wine quality dataset
f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
df = h2o.import_file(f)

# Split into train & test
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

h2o.export_file(train, path="wine_quality_train.csv")
h2o.export_file(test, path="wine_quality_test.csv")

The files that will be uploaded are: wine_quality_train.csv and wine_quality_test.csv

Import file not working

I think this used to work...? Here's an error when I tried to load a CSV (replicated with multiple CSVs).

Screen Shot 2022-07-19 at 6 14 02 PM

I can upload the file, and it appears in the training set dropdown. I can select it, but when I click Train we get this error:

Unhandled exception
Traceback (most recent call last):
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/h2o_wave/server.py", line 320, in _process
    await self._handle(q)
  File "./src/app.py", line 915, in serve
    await train_menu(q)
  File "./src/app.py", line 283, in train_menu
    q.app.train_df = pd.read_csv(local_path)
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 462, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 819, in __init__
    self._engine = self._make_engine(self.engine)
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1867, in __init__
    self._open_handles(src, kwds)
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1362, in _open_handles
    self.handles = get_handle(
  File "/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/common.py", line 642, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/_f/43b1c3b5-fefe-4b8c-835a-ecacd3e6e632/higgs_10k.csv'
2022/07/19 18:11:02 * /cd47eb9e-86b6-42ef-bc48-14b771410fdd {"d":[{"k":"main"},{"k":"plot1"},{"k":"plot21"},{"k":"plot22"},{"k":"plot31"},{"k":"plot32"},{"k":"foo"},{},{"k":"__unhandled_error__","d":{"view":"markdown","box":"1 1 12 10","title":"Error","content":"```\nTraceback (most recent call last):\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/h2o_wave/server.py\", line 320, in _process\n    await self._handle(q)\n  File \"./src/app.py\", line 915, in serve\n    await train_menu(q)\n  File \"./src/app.py\", line 283, in train_menu\n    q.app.train_df = pd.read_csv(local_path)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 610, in read_csv\n    return _read(filepath_or_buffer, kwds)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 462, in _read\n    parser = TextFileReader(filepath_or_buffer, **kwds)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 819, in __init__\n    self._engine = self._make_engine(self.engine)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 1050, in _make_engine\n    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 1867, in __init__\n    self._open_handles(src, kwds)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/parsers.py\", line 1362, in _open_handles\n    self.handles = get_handle(\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.8/site-packages/pandas/io/common.py\", line 642, in get_handle\n    handle = open(\nFileNotFoundError: [Errno 2] No such file or directory: '/_f/43b1c3b5-fefe-4b8c-835a-ecacd3e6e632/higgs_10k.csv'\n\n```"}}]}

Utility Function for DataFrames

Maybe not Real Applications, but a lot of demos will use data frames, so a scoring function that takes a data frame and returns one would be nice. This is a rough draft which needs to be cleaned up, but putting hear as a place holder for the ModelOps Utiliies

def get_predictions(rows: pd.DataFrame):
    # handle nulls
    rows = rows.where(pd.notnull(rows), "")

   # every value needs to be a string
    vals = rows.values.tolist()
    for i in range(len(vals)):
        vals[i] = [str(x) for x in vals[i]]

    # create a string that is in the expected dictionary format
    dictionary = '{"fields": ' + str(df.columns.tolist()) + ', "rows": ' + str(vals) + '}'
    dictionary = dictionary.replace("'", '"'). #mlops needs double quotes!

   # use the utility function
    dict_preds = mlops_get_score('https://model.wave.h2o.ai/f2659e88-cbad-4ae0-baf0-e25daef42461/model/score',
                                 dictionary)

   # turn the returned dict into a dataframe
    preds = pd.DataFrame(data=dict_preds['score'], columns=dict_preds['fields'])

    # join with original data, assumption is the row order never changes
    return pd.concat([rows, preds], axis=1)

Expand training interface to include advanced and expert options

The current training interface is pretty minimal. I think we should expose all AutoML parameters, but hide most of them in the "Expert" settings section. We can expose another set as "Advanced". Here's the full list of AutoML params: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html#automl-interface

Basic Data Parameters:

  • training frame
  • target column
  • classification (vs regression)

Advanced Data Parameters:

  • columns to remove (or conversely, x)
  • validation frame (? not very useful)
  • leaderboard frame
  • blending frame
  • fold column
  • weights column

Basic Training Parameters:

  • max_models
  • max_runtime_secs

Advanced:

  • balance_classes
  • class_sampling_factors
  • early stopping metric
  • early stopping rounds
  • nfolds
  • etc... all the others until include_algos

Expert:

  • preprocessing
  • modeling_plan
  • monotone constraints
  • exploitation ratio

Add Explain tabs

Let's add two new menus item (tabs) for the H2O Explain output.

AutoML level plots:

AutoML Explain tab (only group/AutoML level plots)

We have stored the AutoML object in the q.app.aml location, and we can re-use the aml object from here to generate the AutoML plots.

Model level plots:

Model Explain tab:

First section is model only:

Second section is row-related plots (for this same model):

Notes

Currently we are going to use the plots directly from matplotlib, so that the users can download the images easily and it looks the same as R/Py. Also it's easier to re-use rather than try to re-create some of the more complex plots. However, we may re-consider this approach (if we do, we will re-use the plotting code to generate plots that the user can download with the click of a button).

Some of the explain functions require a test set. Right now they are re-using the train set (which is not good #14). We should not force the user to provide a test set in case they don't care about using the explain functionality, but we might want to add a checkbox (selected by default) that says "Automatically create a test set (used in some explainability features)". If the user selects a test set from the drop down, then that will be used instead. However, if the user wants to use train only, then they can unselect the checkbox and then we can remove the plots that require a test set.

TO DO

  • Replace plots with images (varimp plot, for now just comment out the code and replace it with image code)
  • AutoML Explain tab: group level plots
  • Model Explain tab: drop-down (picker) for the models, and then we put all the model plots there

Render actual AutoML progress bar with percentage complete

In the R and Python API, we have a nice AutoML progress bar that shows the estimated percentage completed:

AutoML progress: |███████                                |  18%

In the app, we have a "in progress" bar but it does not show any information about how much work remains.

Screen Shot 2023-02-14 at 8 37 31 PM

Add a tab where you can visualize the training data

It seems useful to have some basic visualizations about the training data (e.g. shape, distribution of response, etc). These visualizations should probably be automatically generated when the user uploads a CSV file.

Allow user to auto-split test set from their training file

Currently we allow the user to optionally upload a test file, but we really always want to have a test set for the Explain functions.

If the user only uploads a training csv, then we should automatically create a test set that will be used in the Explain functions. We should allow the user to adjust the fraction of their training data that will be used for the test set.

Related to this: #14

App UI unusable on certain screens

From support ticket

App isn’t fitting screen on Apple M1 Macbook. I can’t see the ends. Its stretched and there is no option to fix it.

App Details
ID: bec14f76-cc10-4886-823c-d631c6492867
NAME: H2O AutoML
VERSION: 0.3.0

Clicking on Variable Explain under the AutoML Viz tab causes an error

It's fine the first time, but if you go away and come back, it breaks the app.

,{"k":"__unhandled_error__","d":{"view":"markdown","box":"1 1 12 10","title":"Error","content":"```\nTraceback (most recent call last):\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/server.py\", line 341, in _process\n    await self._handle(q)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/./src/app.py\", line 1112, in serve\n    elif not await handle_on(q):\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/routing.py\", line 179, in handle_on\n    if await _match_predicate(predicate, func, arity, q, arg_value):\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/routing.py\", line 129, in _match_predicate\n    await _invoke_handler(func, arity, q, arg)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/routing.py\", line 119, in _invoke_handler\n    await func(q, arg)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/./src/app.py\", line 910, in aml_varimp\n    ui.picker(name='column_pd', label='Select Column', choices=choices, max_choices = 1, values = [q.app.pd_col]),\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/ui.py\", line 1741, in picker\n    return Component(picker=Picker(\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/types.py\", line 4471, in __init__\n    _guard_vector('Picker.values', values, (str,), False, True, False)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/types.py\", line 50, in _guard_vector\n    _guard_scalar(f'{name} element', value, types, False, non_empty, False)\n  File \"/Users/me/h2oai/github/wave-h2o-automl/venv/lib/python3.10/site-packages/h2o_wave/types.py\", line 37, in _guard_scalar\n    raise ValueError(f'{name}: want one of {types}, got {type(value)}')\nValueError: Picker.values element: want one of (<class 'str'>,), got <class 'NoneType'>\n\n```"}}]}

Improve leaderboard output

  • Let's use .round(5) on all the columns in the leaderboard. For now, we are using h2o 3.32 so there's only a subset of the full leaderboard columns right now (we can update later).
  • Let's use the extended leaderboard instead, with all the columns.

Automatically infer the response type and produce helpful error

There's a classification button, which is set to classification by default. However, that makes it easy to train a classification model when you really want a regression model. If the response is real-valued and you try to do classification, it currently just breaks wave (e.g. wine quality data using "alcohol" as response). So we need a better error.

Another issue is when you have an integer valued column as the response which should be numeric/regression (e.g. wine quality using "quality" as response), it will still train a classification model when asked to. You don't realize that until after you read the column headers in the leaderboard.

Test data not currently used

Currently the test_file is stored as test_df but then not used anywhere. We should consider what we want to do with a test file (such as passing on to the explain functions).

Throw error if Classification toggle is turned on for a regression dataset

If the classification toggle is kept on (the default) and you use a regression dataset, it will throw an error on the backend because it looks for max_per_class_error on the leader model, which is not there. So it trains the whole AutoML and then spits an error at the end, which is bad.

Ideally we'd have the app automatically set the toggle once the user selects a real valued target column, but right now it doesn't work that way...

Save images in the app and serve cached copy

Currently, every time you view the AutoML Viz or Model Explain tab, all the data/images are regenerated from scratch. Let's cache this in the app and serve it from the cached copy rather than regenerating each time.

For each plot, we can store the plot in the app:
e.g. q.app.varimp_heat_plot and q.app.mc_plot

Example, something like this:

    # Model Correlation Heatmap (1)
    try:
        train = h2o.H2OFrame(q.app.train_df)
        y = q.app.target
        if q.app.is_classification:
            train[y] = train[y].asfactor()
        if q.app.mc_plot is None:
            q.app.mc_plot = q.app.aml.model_correlation_heatmap(frame = train, figsize=(FIGSIZE[0], FIGSIZE[0]))
        q.page['plot21'] = ui.image_card(
            box='charts_left',
            title="Model Correlation Heatmap Plot",
            type="png",
            image=get_image_from_matplotlib(q.app.mc_plot),
        )

PD plot not showing on multiclass

If you use wine dataset with "quality" and let it train a classification (multiclass model), the PD plot is broken. That's because we are missing the part where we set a reference class, so we need to update the code to input a reference class.

For multiclass, we should add a second picker box with the list of classes, so that you can choose what class you view the PD plot for.

Loosen h2o version requirement

We might consider changing the version requirement in requirements.txt to <3.36.0.2 (or put a lower bound on it), so that users can have more flexibility in using it with an existing/installed version of h2o on their machine.

If we want people to be able to use earlier versions of h2o (e.g. below 3.32.1.1), then we need to abstract some of the leaderboard code a bit more. The "algo" column was only added in 3.32.1.1, for example. And other extended leaderboard columns were added in prior versions (3.28.0.1).

If we add learning curve, that was also not added to some specific recent version of h2o, etc.

Add Pareto Front to the AutoML Viz Models Summary tab

We should add the Pareto front. To start, let's just display the default (x,y) axis, which is prediction time on x axis and model performance on y axis. In the future we can consider making this more interactive.

Redesign the home tab display

Let's divide this area into two columns. One for the description (left) and one for the logo (right). Or figure out a way to move the image to the right, via HTML image tags.

Screen Shot 2022-06-13 at 5 18 53 PM

Run Explain tasks in background after training finishes

To avoid the computational delay, let's start the explain tasks immediately after the training finishes using Background Tasks. Then we can store the data/images in the app so they will be ready to serve by the time the user clicks on the Explain tabs. We will just store the default model images for Model Explain and generate the other ones on-demand.

Useful blog: https://medium.com/@unusualcode/background-jobs-in-wave-or-how-not-to-kill-your-ui-ae1fed95693a

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.