![](https://private-user-images.githubusercontent.com/2350154/275078544-7f4a8f92-2136-4442-aeb1-4737a5807f3d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMyNjYxODksIm5iZiI6MTcyMzI2NTg4OSwicGF0aCI6Ii8yMzUwMTU0LzI3NTA3ODU0NC03ZjRhOGY5Mi0yMTM2LTQ0NDItYWViMS00NzM3YTU4MDdmM2QucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDgxMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA4MTBUMDQ1ODA5WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MGY2YWM1YWEzNGFiNDVmZjJhMzhhYjg5MzQ0YzMzN2QxZjBjMGQ5ZGEzZDlhNWZkYmY4OGU5N2QxMDhiZGY4NyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.bI7xdLTUxZimjkC9LZ0cSXgfHYpfJ27ia6exSiCoVrY)
A Python package designed for the evaluation of machine learning models with heterogeneous test data.
Imagine a scenario where the observed data consists of multiple groups, and the composition of these groups changes in a non-stationary manner. If the expected value of a machine learning model's evaluation metric varies by group, and this expected value doesn't vary based on factors other than the group, the model's evaluation metric will fluctuate non-stationarily unless viewed group-by-group. This fluctuation complicates the comparison of evaluation metrics across different models. This library aids in automatically determining an appropriate grouping method for such scenarios, ensuring that if the model remains consistent, its evaluation metrics within each group will too.
Within the health application domain, it's crucial to monitor metrics like physical activity, dietary habits, and sleep patterns to forecast health risks. Given the diverse user base, ranging from teenagers to retirees in their 60s and from active athletes to office workers, prediction complexities can significantly differ between groups. Moreover, if a dataset doesn't have a balanced representation of each user group, certain groups might overly influence the results. This highlights the need to segment predictions and evaluations based on distinct user demographics.
However, overly detailed segmentation brings its own set of challenges. Segmenting users into numerous specific groups can lead to scarce evaluation data for each segment. Evaluating based on smaller datasets can result in greater metric variance, making it harder to accurately assess machine learning models.
To address this, it's essential to group users with the right level of granularity. heteroeval provides a solution by suggesting the best granularity for user grouping, considering evaluation metric trends and the amount of evaluation data, without depending on the actual feature values. For instance, if metrics for users in their 20s are similar to those in their 30s, heteroeval might advise clustering these age groups together.
By utilizing heteroeval, professionals can account for the unique evaluation metrics of different user groups, ensuring a more precise model evaluation.
For a given model
Where:
-
$E_{m,r,G}$ represents the evaluation metric for model$m$ , regime$r$ , and group$G$ . -
$y_{m,r,i}$ and$\hat{y}_{m,r,i}$ denote the true value and predicted value, respectively. -
$F$ is a general function to compute the evaluation metric. As an example, the squared error can be used and is represented as:
Here,
Given a grouping rule
Where:
-
$\text{Aggregate}_{\text{inter-regime}}$ is a general function to aggregate evaluation metrics across regimes. An example implementation can be the standard deviation.
The cost function calculates the average of the evaluation metric variations
Where:
-
$\text{Aggregate}_{\text{group}}$ and $\text{Aggregate}_{\text{model}}$ are general functions to aggregate evaluation metrics per group and across the model, respectively. An example implementation can be the average.
The process of grouping involves transforming data samples, characterized by their features and possibly meta-information, into a specific group index. This mapping can be represented by a function parameterized by
Where:
-
$G_i$ represents the group index for the$i$ -th sample. -
$x_i$ is the vector of features for the$i$ -th sample. -
$m_i$ denotes the meta-information associated with the$i$ -th sample. -
$g_{\theta}$ is the grouping function, parameterized by$\theta$ , determining the group index based on features and meta-information.
We aim to find the parameter
By finding
pip install git+https://github.com/inoueakimitsu/heteroeval
Simply call find_best_grouping()
, as shown below:
from heteroeval import find_best_grouping
find_best_grouping(
n_models,
regimes,
X, y_true,
y_pred_for_each_model,
evaluation_measure,
inter_regime_variation_measure,
groupwise_variation_measure_aggregate_function,
modelwise_variation_measure_aggregate_function,
cost_function,
optimizer)
Refer to heteroeval/discrete.py
for a comprehensive working example.
heteroeval is licensed under the MIT License.