tempeh

tempeh is a framework to

TEst

Machine learning

PErformance

exHaustively

which includes tracking memory usage and run time. This is particularly useful as a pluggable tool for your repository's performance tests. Typically, people want to run them periodically over various datasets and/or with a number of models to catch regressions with respect to run time or memory consumption. This should be as easy as

import pytest
from time import time
from tempeh.configurations import datasets, models

@pytest.mark.parametrize('Dataset', datasets.values())
@pytest.mark.parametrize('Model', models.values())
def test_fit_predict_regression(Dataset, Model):
    dataset = Dataset()
    X_train, X_test = dataset.get_X()
    y_train, y_test = dataset.get_y()
    model = Model()
    max_execution_time = get_max_execution_time(dataset, model)
    if model.compatible_with_dataset(dataset):
        start_time = time()
        model.fit(X_train, y_train)
        model.predict(X_test)
        duration = time() - start_time

        assert duration < max_execution_time

Installation

tempeh depends on various packages to provide models, including tensorflow, torch, xgboost, lightgbm. To install a release version of tempeh just run

pip install tempeh

Common issues

If you're using a 32-bit Python version you might need to switch to a 64-bit Python version first to successfully install tensorflow.
If the installation of torch fails try using the recommendation from the pytorch website for stable versions without CUDA for your python version on your operating system.
If the installation of lightgbm or xgboost fails try to use a pip version less than 20.0 until their bug is resolved.

Structure

Datasets

Datasets (located in the datasets/ directory) encapsulate different datasets used for testing.

To add a new one

Create a python file in the datasets/ directory with naming convention [name]_datasets.py
Subclass BasePerformanceDatasetWrapper. The naming convention is [dataset_name]PerformanceDatasetWrapper
In __init__ load the dataset and call super().__init__(data, targets, size)
Add the class to __init__.py
Make sure the class contains class variables task, data_type, size
Add an entry to the datasets dictionary in configurations.py.

Models

Models (models/ directory) wrap different machine learning models.

To add a new one

Create a python file in the models/ directory with naming convention [name]_model.py
Subclass BaseModelWrapper and name the class [name]ModelWrapper
In __init__ train the model; we expect format __init__(self, ...)
Models must contain tasks and algorithm
Add an entry to the models dictionary in configurations.py.

Maintainers

In alphabetical order:

Contributing

To contribute please check our Contributing Guide.

Issues

Regular (non-Security) Issues

Please submit a report through Github issues. A maintainer will respond within a reasonable period of time to handle the issue as follows:

bug: triage as bug and provide estimated timeline based on severity
feature request: triage as feature request and provide estimated timeline
question or discussion: triage as question and respond or notify/identify a suitable expert to respond

Maintainers are supposed to link duplicate issues when possible.

Reporting Security Issues

Please take a look at our guidelines for reporting security issues.

Boston Dataset no longer supported by scikit-learn

As the title states, when loading the Boston dataset it displays an error message saying it is no longer supported:

`WARNING:root:
load_boston has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

for the California housing dataset and::

from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air`

microsoft / tempeh Goto Github PK

tempeh's Introduction

tempeh

Installation

Structure

Datasets

To add a new one

Models

To add a new one

Maintainers

Contributing

Issues

Regular (non-Security) Issues

Reporting Security Issues

tempeh's People

Contributors

Stargazers

Watchers

Forkers

tempeh's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs