GithubHelp home page GithubHelp logo

salesforce / merlion Goto Github PK

View Code? Open in Web Editor NEW
3.3K 54.0 288.0 103.27 MB

Merlion: A Machine Learning Framework for Time Series Intelligence

License: BSD 3-Clause "New" or "Revised" License

Python 98.61% Dockerfile 0.11% CSS 1.26% JavaScript 0.02%
time-series anomaly-detection forecasting machine-learning benchmarking automl ensemble-learning

merlion's Introduction

Logo

Merlion: A Machine Learning Library for Time Series

Table of Contents

  1. Introduction
  2. Comparison with Related Libraries
  3. Installation
  4. Documentation
  5. Getting Started
    1. Anomaly Detection
    2. Forecasting
  6. Evaluation and Benchmarking
  7. Technical Report and Citing Merlion

Introduction

Merlion is a Python library for time series intelligence. It provides an end-to-end machine learning framework that includes loading and transforming data, building and training models, post-processing model outputs, and evaluating model performance. It supports various time series learning tasks, including forecasting, anomaly detection, and change point detection for both univariate and multivariate time series. This library aims to provide engineers and researchers a one-stop solution to rapidly develop models for their specific time series needs, and benchmark them across multiple time series datasets.

Merlion's key features are

  • Standardized and easily extensible data loading & benchmarking for a wide range of forecasting and anomaly detection datasets. This includes transparent support for custom datasets.
  • A library of diverse models for anomaly detection, forecasting, and change point detection, all unified under a shared interface. Models include classic statistical methods, tree ensembles, and deep learning approaches. Advanced users may fully configure each model as desired.
  • Abstract DefaultDetector and DefaultForecaster models that are efficient, robustly achieve good performance, and provide a starting point for new users.
  • AutoML for automated hyperaparameter tuning and model selection.
  • Unified API for using a wide range of models to forecast with exogenous regressors.
  • Practical, industry-inspired post-processing rules for anomaly detectors that make anomaly scores more interpretable, while also reducing the number of false positives.
  • Easy-to-use ensembles that combine the outputs of multiple models to achieve more robust performance.
  • Flexible evaluation pipelines that simulate the live deployment & re-training of a model in production, and evaluate performance on both forecasting and anomaly detection.
  • Native support for visualizing model predictions, including with a clickable visual UI.
  • Distributed computation backend using PySpark, which can be used to serve time series applications at industrial scale.

Comparison with Related Libraries

The table below provides a visual overview of how Merlion's key features compare to other libraries for time series anomaly detection and/or forecasting.

Merlion Prophet Alibi Detect Kats darts statsmodels nixtla GluonTS RRCF STUMPY Greykite pmdarima
Univariate Forecasting
Multivariate Forecasting
Univariate Anomaly Detection
Multivariate Anomaly Detection
Pre Processing
Post Processing
AutoML
Ensembles
Benchmarking
Visualization

The following features are new in Merlion 2.0:

Merlion Prophet Alibi Detect Kats darts statsmodels nixtla GluonTS RRCF STUMPY Greykite pmdarima
Exogenous Regressors
Change Point Detection
Clickable Visual UI
Distributed Backend

Installation

Merlion consists of two sub-repos: merlion implements the library's core time series intelligence features, and ts_datasets provides standardized data loaders for multiple time series datasets. These loaders load time series as pandas.DataFrame s with accompanying metadata.

You can install merlion from PyPI by calling pip install salesforce-merlion. You may install from source by cloning this repoand calling pip install Merlion/, or pip install -e Merlion/ to install in editable mode. You may install additional dependencies via pip install salesforce-merlion[all], or by calling pip install "Merlion/[all]" if installing from source. Individually, the optional dependencies include dashboard for a GUI dashboard, spark for a distributed computation backend with PySpark, and deep-learning for all deep learning models.

To install the data loading package ts_datasets, clone this repo and call pip install -e Merlion/ts_datasets/. This package must be installed in editable mode (i.e. with the -e flag) if you don't want to manually specify the root directory of every dataset when initializing its data loader.

Note the following external dependencies:

  1. Some of our forecasting models depend on OpenMP. If using conda, please conda install -c conda-forge lightgbm before installing our package. This will ensure that OpenMP is configured to work with the lightgbm package (one of our dependencies) in your conda environment. If using Mac, please install Homebrew and call brew install libomp so that the OpenMP libary is available for the model.

  2. Some of our anomaly detection models depend on the Java Development Kit (JDK). For Ubuntu, call sudo apt-get install openjdk-11-jdk. For Mac OS, install Homebrew and call brew tap adoptopenjdk/openjdk && brew install --cask adoptopenjdk11. Also ensure that java can be found on your PATH, and that the JAVA_HOME environment variable is set.

Documentation

For example code and an introduction to Merlion, see the Jupyter notebooks in examples, and the guided walkthrough here. You may find detailed API documentation (including the example code) here. The technical report outlines Merlion's overall architecture and presents experimental results on time series anomaly detection & forecasting for both univariate and multivariate time series.

Getting Started

The easiest way to get started is to use the GUI web-based dashboard. This dashboard provides a great way to quickly experiment with many models on your own custom datasets. To use it, install Merlion with the optional dashboard dependency (i.e. pip install salesforce-merlion[dashboard]), and call python -m merlion.dashboard from the command line. You can view the dashboard at http://localhost:8050. Below, we show some screenshots of the dashboard for both anomaly detection and forecasting.

anomaly dashboard

forecast dashboard

To help you get started with using Merlion in your own code, we provide below some minimal examples using Merlion default models for both anomaly detection and forecasting.

Anomaly Detection

Here, we show the code to replicate the results from the anomaly detection dashboard above. We begin by importing Merlion’s TimeSeries class and the data loader for the Numenta Anomaly Benchmark NAB. We can then divide a specific time series from this dataset into training and testing splits.

from merlion.utils import TimeSeries
from ts_datasets.anomaly import NAB

# Data loader returns pandas DataFrames, which we convert to Merlion TimeSeries
time_series, metadata = NAB(subset="realKnownCause")[3]
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
test_labels = TimeSeries.from_pd(metadata.anomaly[~metadata.trainval])

We can then initialize and train Merlion’s DefaultDetector, which is an anomaly detection model that balances performance with efficiency. We also obtain its predictions on the test split.

from merlion.models.defaults import DefaultDetectorConfig, DefaultDetector
model = DefaultDetector(DefaultDetectorConfig())
model.train(train_data=train_data)
test_pred = model.get_anomaly_label(time_series=test_data)

Next, we visualize the model's predictions.

from merlion.plot import plot_anoms
import matplotlib.pyplot as plt
fig, ax = model.plot_anomaly(time_series=test_data)
plot_anoms(ax=ax, anomaly_labels=test_labels)
plt.show()

anomaly figure

Finally, we can quantitatively evaluate the model. The precision and recall come from the fact that the model fired 3 alarms, with 2 true positives, 1 false negative, and 1 false positive. We also evaluate the mean time the model took to detect each anomaly that it correctly detected.

from merlion.evaluate.anomaly import TSADMetric
p = TSADMetric.Precision.value(ground_truth=test_labels, predict=test_pred)
r = TSADMetric.Recall.value(ground_truth=test_labels, predict=test_pred)
f1 = TSADMetric.F1.value(ground_truth=test_labels, predict=test_pred)
mttd = TSADMetric.MeanTimeToDetect.value(ground_truth=test_labels, predict=test_pred)
print(f"Precision: {p:.4f}, Recall: {r:.4f}, F1: {f1:.4f}\n"
      f"Mean Time To Detect: {mttd}")
Precision: 0.6667, Recall: 0.6667, F1: 0.6667
Mean Time To Detect: 1 days 10:22:30

Forecasting

Here, we show the code to replicate the results from the forecasting dashboard above. We begin by importing Merlion’s TimeSeries class and the data loader for the M4 dataset. We can then divide a specific time series from this dataset into training and testing splits.

from merlion.utils import TimeSeries
from ts_datasets.forecast import M4

# Data loader returns pandas DataFrames, which we convert to Merlion TimeSeries
time_series, metadata = M4(subset="Hourly")[0]
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])

We can then initialize and train Merlion’s DefaultForecaster, which is an forecasting model that balances performance with efficiency. We also obtain its predictions on the test split.

from merlion.models.defaults import DefaultForecasterConfig, DefaultForecaster
model = DefaultForecaster(DefaultForecasterConfig())
model.train(train_data=train_data)
test_pred, test_err = model.forecast(time_stamps=test_data.time_stamps)

Next, we visualize the model’s predictions.

import matplotlib.pyplot as plt
fig, ax = model.plot_forecast(time_series=test_data, plot_forecast_uncertainty=True)
plt.show()

forecast figure

Finally, we quantitatively evaluate the model. sMAPE measures the error of the prediction on a scale of 0 to 100 (lower is better), while MSIS evaluates the quality of the 95% confidence band on a scale of 0 to 100 (lower is better).

# Evaluate the model's predictions quantitatively
from scipy.stats import norm
from merlion.evaluate.forecast import ForecastMetric

# Compute the sMAPE of the predictions (0 to 100, smaller is better)
smape = ForecastMetric.sMAPE.value(ground_truth=test_data, predict=test_pred)

# Compute the MSIS of the model's 95% confidence interval (0 to 100, smaller is better)
lb = TimeSeries.from_pd(test_pred.to_pd() + norm.ppf(0.025) * test_err.to_pd().values)
ub = TimeSeries.from_pd(test_pred.to_pd() + norm.ppf(0.975) * test_err.to_pd().values)
msis = ForecastMetric.MSIS.value(ground_truth=test_data, predict=test_pred,
                                 insample=train_data, lb=lb, ub=ub)
print(f"sMAPE: {smape:.4f}, MSIS: {msis:.4f}")
sMAPE: 4.1944, MSIS: 18.9331

Evaluation and Benchmarking

One of Merlion's key features is an evaluation pipeline that simulates the live deployment of a model on historical data. This enables you to compare models on the datasets relevant to them, under the conditions that they may encounter in a production environment. Our evaluation pipeline proceeds as follows:

  1. Train an initial model on recent historical training data (designated as the training split of the time series)
  2. At a regular interval (e.g. once per day), retrain the entire model on the most recent data. This can be either the entire history of the time series, or a more limited window (e.g. 4 weeks).
  3. Obtain the model's predictions (anomaly scores or forecasts) for the time series values that occur between re-trainings. You may customize whether this should be done in batch (predicting all values at once), streaming (updating the model's internal state after each data point without fully re-training it), or some intermediate cadence.
  4. Compare the model's predictions against the ground truth (labeled anomalies for anomaly detection, or the actual time series values for forecasting), and report quantitative evaluation metrics.

We provide scripts that allow you to use this pipeline to evaluate arbitrary models on arbitrary datasets. For example, invoking

python benchmark_anomaly.py --dataset NAB_realAWSCloudwatch --model IsolationForest --retrain_freq 1d

will evaluate the anomaly detection performance of the IsolationForest (retrained once a day) on the "realAWSCloudwatch" subset of the NAB dataset. Similarly, invoking

python benchmark_forecast.py --dataset M4_Hourly --model ETS

will evaluate the batch forecasting performance (i.e. no retraining) of ETS on the "Hourly" subset of the M4 dataset. You can find the results produced by running these scripts in the Experiments section of the technical report.

Technical Report and Citing Merlion

You can find more details in our technical report: https://arxiv.org/abs/2109.09265

If you're using Merlion in your research or applications, please cite using this BibTeX:

@article{bhatnagar2021merlion,
      title={Merlion: A Machine Learning Library for Time Series},
      author={Aadyot Bhatnagar and Paul Kassianik and Chenghao Liu and Tian Lan and Wenzhuo Yang
              and Rowan Cassius and Doyen Sahoo and Devansh Arpit and Sri Subramanian and Gerald Woo
              and Amrita Saha and Arun Kumar Jagota and Gokulakrishnan Gopalakrishnan and Manpreet Singh
              and K C Krithika and Sukumar Maddineni and Daeki Cho and Bo Zong and Yingbo Zhou
              and Caiming Xiong and Silvio Savarese and Steven Hoi and Huan Wang},
      year={2021},
      eprint={2109.09265},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

To Dos

We are striving to leverage the time-series modeling with GPUs to further improve the speed and throughput of Merlion. Stay tuned ...

merlion's People

Contributors

aadyotb avatar chenghaoliu89 avatar cnll0075 avatar emerald01 avatar isenilov avatar jonwiggins avatar mattfernandez-salesforce avatar paulkass avatar rafaelleinio avatar shreyanand avatar svc-scm avatar uchiiii avatar yangwenzhuo08 avatar yihaocs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

merlion's Issues

[BUG] `TimeSeries.from_pd()` does not use pandas frequency codes

I'm not sure if I'm missing something but with monthly data, it appears that Merlion does not recognize freq='1MS' as per the pandas offset aliases listed here.

When I run:

from merlion.utils import TimeSeries
ts = TimeSeries.from_pd(df.set_index('cal_month_begin_date'), freq='1MS')

The output is incrementing from 1970-01-01 00:00:00.000 by 1 millisecond, when "MS" is the pandas code for "month start"

[BUG] Monthly data (freq='MS') might be accidentally interpolated to daily data

Describe the bug
When using monthly data (freq='MS') the default TemporalResample might interpolate to daily data.

To Reproduce

import pandas as pd
from merlion.utils import TimeSeries
from merlion.models.forecast.arima import Arima, ArimaConfig

data = pd.read_csv('https://raw.githubusercontent.com/facebookresearch/Kats/main/kats/data/air_passengers.csv', names=['time', 'value'], index_col='time', skiprows=1, parse_dates=True)
data = data.asfreq('MS')
data = TimeSeries.from_pd(data['value'])

model = Arima(ArimaConfig(order=(0, 1, 1)))

print(f'Training data 5/{len(data)}')
print(data[0:5])
print()

print('initial attributes')
print(f"model.timedelta: {model.timedelta}")
print(f"model.transform: {model.transform}\n")
print('after training transform')
model.transform.train(data)
print(f"model.timedelta: {model.timedelta}")
print(f"model.transform: {model.transform}\n")
print('after training')
model.train(data)
print(f"model.timedelta: {model.timedelta}\n")

data_transformed = model.transform(data)
print(f'Transformed training data 5/{len(data_transformed)}')
print(data_transformed[0:5])

Expected behavior
The monthly data should not be interpolated to daily data by default or an error message should be given.

Desktop (please complete the following information):

  • Merlion: 1.0.0

Additional context
I hope I got something wrong for the monthly data. There are quiet some places with time deltas expressed as seconds in the code base which might be problematic for monthly data.

[BUG]Memory Error in Example Code

Describe the bug
A clear and concise description of what the bug is.
Memory error occured when the Anomaly Detection example code was executed.
MemoryError: Unable to allocate 1.16 TiB for an array with shape (159082031251,) and data type int64

To Reproduce
Steps to reproduce the behavior

from merlion.utils import TimeSeries
from ts_datasets.anomaly import NAB

# Data loader returns pandas DataFrames, which we convert to Merlion TimeSeries
time_series, metadata = NAB(subset="realKnownCause", rootdir=my_root_dir)[3]
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
test_labels = TimeSeries.from_pd(metadata.anomaly[~metadata.trainval])

Expected behavior
A clear and concise description of what you expected to happen.
Data should be loaded sucessfully.

Screenshots
If applicable, add screenshots to help explain your problem.
微信图片_20220105144643

Desktop (please complete the following information):

  • OS: [e.g. Ubuntu 16.04 LTS] win 10 64bit
  • Merlion Version [e.g. 1.0.0] 1.1.0

Additional context
Add any other context about the problem here.

AutoSarima seasonality bugs [BUG]

Describe the bug
The default value of the season_order parameter ('auto','auto','auto','auto') cannot be passed as-is, because in autosarima.py line 139 m = season_order[-1], m is set to 'auto', but in line 144, m is used in a division, which throws a TypeError.
In addition, in line 163, the variable xx is used but it is only defined in an elif D is None branch in line 157, which also throws an error says xx is referred before definition. This happens when I set season_order[-1] = 1, which makes D = 0 and the program will not enter the branch that defines xx.

Desktop:

  • OS: Ubuntu 16.04 LTS
  • Merlion Version 1.0.0

[FEATURE REQUEST] docker and colab example

Is your feature request related to a problem? Please describe.
Its difficult to try merlion because of requirements collision. A docker or maintained colab notebook would solve this nicely. Currently merlion installation in colab fails because of pandas/statsmodels collision.

Describe the solution you'd like
A docker or colab notebook that just work.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[BUG] Sarima will not fit to monthly data (infinite loop?)

Describe the bug
A simple Sarima model won't fit the air passenger data set (seems to loop infinitely in L-BFGS when estimating the parameters). I guess it could be that the internal pre-processing is the culprit, because the same data will be fit when using ARIMA from statsmodels directly.

To Reproduce

import pandas as pd
from merlion.utils import TimeSeries
from merlion.models.forecast.sarima import Sarima, SarimaConfig
from statsmodels.tsa.arima.model import ARIMA

data = pd.read_csv('https://raw.githubusercontent.com/facebookresearch/Kats/main/kats/data/air_passengers.csv', names=['time', 'value'], index_col='time', skiprows=1, parse_dates=True)
data = data.asfreq('MS')
data = TimeSeries.from_pd(data['value'])

model_sm = ARIMA(data.to_pd(), order=(0, 1, 1), seasonal_order=(2, 1, 0, 12))
model_sm.fit().summary() # this will work

model_merlion = Sarima(SarimaConfig(order=(0, 1, 1), seasonal_order=(2, 1, 0, 12)))
model_merlion.train(data) # this won't work

Expected behavior
Fitting the Sarima model on monthly data should work.

Desktop (please complete the following information):

  • OS: Ubuntu 18.04.5 LTS (Bionic Beaver)
  • Merlion: 1.0.0
  • Statsmodels: 0.13.0

[BUG] AttributeError: 'DefaultForecasterConfig' object has no attribute 'granularity'

Running 0_ForecastIntro.ipynb

When training default model throws an Exception

AttributeError: 'DefaultForecasterConfig' object has no attribute 'granularity'

To Reproduce

from merlion.models.defaults import DefaultForecasterConfig, DefaultForecaster
model = DefaultForecaster(DefaultForecasterConfig())
model.train(train_data=train_data)
  • OS: Colab
  • Merlion Version 1.0.2 and 1.1.0

Unable to detect anomalies with my dataset having weekly seasonality

Note: I am not a data scientist. I am trying to find out if Merlion can be a solution to my problem.

I tried the example notebook "0_AnomalyIntro.ipynb" with my dataset. It did not detect any anomalies, that I expect it to detect.

To Reproduce

  1. Replace contents of machine_temperature_system_failure.csv with my own https://github.com/MacNale/TestDataset/blob/main/test_dataset.csv dataset
  2. Set sys.path.append("/myproject//Merlion/ts_datasets") at the beginning as module could not be loaded otherwise.
  3. Execute the notebook.

Expected behavior
I was hoping to see an anomaly on 7th July. I could not interpret the result. Are there any anomalies detected here?

Screenshots
image

Desktop (please complete the following information):

  • OS: Mac
  • Merlion - Main branch

Additional context
Using visual studio code

[FEATURE REQUEST] Update prophet package

Is your feature request related to a problem? Please describe.
As of v1.0, the package name on PyPI is "prophet"; prior to v1.0 it was "fbprophet". source

Describe the solution you'd like
Update prophet package

Describe alternatives you've considered
.
Additional context

[BUG]import error

Describe the bug
benchmark_forecast.py from ts_datasets.forecast import * should be from ts_datasets.ts_datasets.forecast import *

[FEATURE REQUEST]More light-weight version

Is your feature request related to a problem? Please describe.
Is there a way to install a simplified version of Merlion? We are trying put Merlion into a docker image. But when building the image, we found out that Merlion depends on torch and prophet, which has very large size (around 5GB). If we are not using models related to torch and prophet, can we have a more light-weight version?

GPU utilization

Hi,
is it possible to train the model using GPU ?
I don't see where we can specify the gpu parameter, for example when wanting to use AutoML module.

Thanks

[BUG] ValueError: invalid literal for int() with base 10: b'' - Anomaly Detection Jupyter Example

Describe the bug
Running the example Jupyter Notebook Merlion/examples/anomaly/0_AnomalyIntro.ipynb (second sell) - Model can not be trained
Java was installed as described:

➜  ~ java -version
java version "1.8.0_311"
Java(TM) SE Runtime Environment (build 1.8.0_311-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.311-b11, mixed mode)

To Reproduce
Running the example Jupyter Notebook Merlion/examples/anomaly/0_AnomalyIntro.ipynb (second sell)

Expected behavior
Model can be trained as expected.

Screenshots
image
image

Desktop (please complete the following information):

  • OS: [MacOS Monterey 12.3.1]
  • Merlion Version [ 1.1.3]

Additional context

[BUG] resample_time_stamps fails when time_stamps is int and max_forecast_steps is None

Describe the bug
Calling resample_time_stamps fails when requesting an integer number of time_stamps and max_forecast_steps has not been set (code tries to subset time_stamps then which won't work for integers).

This can be solved by setting the variable tf later which is only needed when a list of time_stamps has been passed.

To Reproduce
This will fail...

data = pd.Series([100, 100, 100, 100])
model = Arima(ArimaConfig(order=(1, 1, 0)))
model.train(TimeSeries.from_pd(data))
model.forecast(time_stamps=1)

This works...

data = pd.Series([100, 100, 100, 100])
model = Arima(ArimaConfig(order=(1, 1, 0), max_forecast_steps=42))
model.train(TimeSeries.from_pd(data))
model.forecast(time_stamps=1)

Expected behavior

Passing time_stamps as int should work when max_forecast_steps has not been set.

Desktop (please complete the following information):

  • Merlion: 1.0.0

[BUG] from_pd mess up the time indexes

Describe the bug
When importing a TimeSeries from a Pandas Dataframe of date indexes (containing only dates), TimeSeries automatically samples them into datetime and creates fake time indexes by sampling 24 dates into 24 hours of the first day.

To Reproduce

from merlion.utils import TimeSeries

train_indexes = df_returns.index < "2017-01-01"
train_df = df_returns.loc[train_indexes,:]
test_df = df_returns.loc[~train_indexes,:]
train_data = TimeSeries.from_pd(train_df)
test_data = TimeSeries.from_pd(test_df)

By using a date type index in df_returns

Expected behavior
It should keep the same date index format.

Screenshots
bug_merlion

Desktop (please complete the following information):

  • OS: Ubuntu 20.04

Additional context
I am working on a Jupyter notebook for quick analysis.

Can not execute model.train() - error: "The system cannot find the file specified"

Describe the bug
Hi,
I just installed Merlion in my new anaconda environment on windows 10.
The installation was fine except the build error of prophet package using wheel which was solved by installing "libpython m2w64-toolchain -c msys2".
But when I tried to execute the sample code of anomaly detection it throws an error: "FileNotFoundError: [WinError 2] The system cannot find the file specified".

Expected behavior
The console output is:

model.train(train_data=train_data)
Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda3\envs\merlion\lib\site-packages\merlion\models\defaults.py", line 88, in train
ModelFactory.create(
File "C:\Anaconda3\envs\merlion\lib\site-packages\merlion\models\factory.py", line 89, in create
model._load_state(kwargs)
File "C:\Anaconda3\envs\merlion\lib\site-packages\merlion\models\base.py", line 367, in _load_state
self.setstate(state_dict)
File "C:\Anaconda3\envs\merlion\lib\site-packages\merlion\models\anomaly\random_cut_forest.py", line 138, in setstate
RCFSerDe = JVMSingleton.gateway().jvm.com.amazon.randomcutforest.serialize.RandomCutForestSerDe
File "C:\Anaconda3\envs\merlion\lib\site-packages\merlion\models\anomaly\random_cut_forest.py", line 40, in gateway
cls._gateway = JavaGateway.launch_gateway(classpath=classpath, javaopts=javaopts)
File "C:\Anaconda3\envs\merlion\lib\site-packages\py4j\java_gateway.py", line 2159, in launch_gateway
_ret = launch_gateway(
File "C:\Anaconda3\envs\merlion\lib\site-packages\py4j\java_gateway.py", line 331, in launch_gateway
proc = Popen(
File "C:\Anaconda3\envs\merlion\lib\subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Anaconda3\envs\merlion\lib\subprocess.py", line 1311, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

What can I do to solve this problem?
Thanks

defining metadata in Merlion

Hi,
I have my dataset in pandas dataframe. How should I define the metadata? I have univariate data, I have tried several ways to define the metadata before train and test, but it is not working. Also, I am not being able to import ts_datasets to use the os datasets.
I would appreciate your help

Unable to reproduce results for DAGMM

Hello,

Thank you for the nice library!

I was just wondering if you managed to reproduce the results in Zong, Bo, et al. "Deep autoencoding gaussian mixture model for unsupervised anomaly detection." International conference on learning representations. 2018.

image

I used the following configuration:

DAGMMConfig(gmm_k=2, hidden_size=4, num_epochs=20000, lr=0.0001, batch_size=1024)

and only managed to get the following results on the Thyroid dataset (.mat obtained from http://odds.cs.stonybrook.edu):

Precision: 0.0238
Recall: 0.3571
F1: 0.0447

[BUG] 'MoE_ForecasterEnsemble' object has no attribute 'optimiser'

Describe the bug
Code @ examples/advanced/2_MoE_Forecasting_tutorial.ipynb

moe_model = TransformerModel(input_dim=train_data.dim, lookback_len=lookback_len, nexperts=nexperts,\
                    output_dim=max_forecast_steps, nfree_experts=nfree_experts,\
                    hid_dim=hidden_dim, dim_head = dim_head, mlp_dim=mlp_dim,\
                     pool='cls', dim_dropout=dim_dropout,\
                    time_step_dropout=time_step_dropout)
#moe_model = None # use me if you want to see the default model in use

# create MoE forecaster model
ensemble = MoE_ForecasterEnsemble(config=config_ensemble, models= models, moe_model=moe_model)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_1014/3322114368.py in <module>
     20 
     21 # create MoE forecaster model
---> 22 ensemble = MoE_ForecasterEnsemble(config=config_ensemble, models= models, moe_model=moe_model)
     23 
     24 # train MoE

/usr/local/lib/python3.8/dist-packages/merlion/models/ensemble/MoE_forecast.py in __init__(self, config, models, moe_model)
    272         )
    273 
--> 274         self.moe_model = moe_model
    275 
    276     @property

/usr/local/lib/python3.8/dist-packages/merlion/models/ensemble/MoE_forecast.py in moe_model(self, moe_model)
    282         self._moe_model = moe_model
    283         if self.moe_model is not None:
--> 284             if self.optimiser is None:
    285                 self.optimiser = torch.optim.Adam(self.moe_model.parameters(), lr=self.lr, weight_decay=0.00000)
    286             if self.lr_sch is None:

AttributeError: 'MoE_ForecasterEnsemble' object has no attribute 'optimiser'

Request to provide a tutorial of some sort to implement AutoML variants of ETS and prophet for univariate forecasting

Hey, I had been going through the paper "Merlion: A Machine Learning Library for Time Series" and I came across AutoML variants of ETS and prophet models for univariate forecasting. It would be of great help if you could show some tutorial for implementing them, on a simple univariate dataset like the "air-passengers dataset".
I have also tried the AutoSarima for the same dataset from the merlion.models.automl module. But it gives very large errors compared to the auto_arima model from the pmdarima library and even basic statsmodel.tsa SARIMAX methods.
What could be the reason , given that the air-passengers dataset isn't very complicated to forecast?

my code for autosarima.

max_iter = [10,20,50,100,200,400,1000]
list_autosarima_merlion_models = []  #stores all models with diff parameters
parameters_autosarima_merlion_models = [] #stores different params used for diff models

for mi in max_iter:
    config1 = AutoSarimaConfig(max_forecast_steps=len(test_df), order=("auto", "auto", "auto"),
                           seasonal_order=("auto", "auto", "auto", 12), approximation=True, maxiter=mi)
    model1  = SeasonalityLayer(model = AutoSarima(model = Sarima(config1)))
    train_pred, train_err = model1.train(train_df_merlion, train_config={"enforce_stationarity": True,"enforce_invertibility": True})
    list_autosarima_merlion_models.append(model1)
    parameters_autosarima_merlion_models.append(f'{mi} maximum iterations')

Link to the paper that I had gone through.
https://arxiv.org/abs/2109.09265

[BUG]

Describe the bug
when i run python benchmark_forecast.py --model ets --retrain_type sliding_window_retrain --n_retrain 4
retrain seems didnt work. the model is only trained once.

i think that there is sometion wrong with the ForecastEvaluatorConfig.cadence, it is set to be the horizon, which is the lenth of the test_set, not the retrain_freq, so when one loop is done, t adding to the cadence larger then tf, the train procedure is over.

[BUG]delay value when creating a TSADEvaluator

Describe the bug
In benchmark_anomaly.py line 307:
delay = post_rule_train_config["max_early_sec"]
Should it be delay = post_rule_train_config["max_delay_sec"] cuz post_rule_train_config is updated in get_model() at line 235 where d.update({"max_early_sec": dataset.max_lead_sec, "max_delay_sec": dataset.max_lag_sec})

[BUG] Problem in using multiprocessing with model

In version 1.0.1, I try to pass models (say DefaultForecaster(DefaultForecasterConfig('10min')), but it happens other mnodels as well) to different process using the native python multiprocessing package (v3.6.9), and I get the following error:

'DefaultForecaster' object has no attribute 'config'. 'config' is an invalid kwarg for the load()

[FEATURE REQUEST] Options to impute missing values in univariate time series

Imputing missing values with the .algin() method works afaik only for multivariate (so a collection of univariate) time series. But what if I only have a single time series.

from merlion.utils import TimeSeries
import pandas as pd
import numpy as np

ts_series = pd.Series(data = [20, 21, np.nan, 18],
                      index = ['01/04/2022', '02/04/2022', '03/04/2022', '04/04/2022'],
                      name = 'v'
                     )
ts_df = pd.DataFrame(ts_series)
ts_df.index = pd.to_datetime(ts_df.index)

ts = TimeSeries.from_pd(ts_df)
ts

In this case the entry for 03/04/2022 will be removed/excluded from the set when calling .from_pd().
It would be great to still have here options to impute the nan value at index "03/04/2022".

What do you think?

[FEATURE REQUEST] Update/Forecast/Update example for MoE

Is your feature request related to a problem? Please describe.
I'm trying to figure out if it's possible to update the MoE example with new data as it comes in or if the model has to be rebuilt every time.

Describe the solution you'd like
An example similar to the MSES example with the update/prediction loop.

Describe alternatives you've considered
The MSES model didn't even come close to creating something predictable, but MoE seemed to do a decent job.

[BUG] Chunking issue for LSTM forecasting

Describe the bug

Hi, I am trying to implement the LSTM method on the Itrust SWAT datasets using Merlion. However, I am running into a chunking issue for LSTM. LSTM is the only model where I encounter this issue. I resampled the training_data for every second to have a consistent interval. However, the issue remains. I wonder whether this is a bug or a problem in my code.

Lastly, do you have an example notebook with an LSTM example? So far, I wasn't able to identify any LSTM example configs. Making it hard to try something out quickly.

To Reproduce

Below I have the code for my LSTM model.

from merlion.models.forecast.prophet import Prophet, ProphetConfig
from merlion.models.forecast.smoother import MSES, MSESConfig
from merlion.models.forecast.lstm import LSTM, LSTMConfig, LSTMTrainConfig
from merlion.models.forecast.base import ForecasterBase

lstm_config = LSTMConfig(len(test_data),
                         nhid=100,
                         model_strides=(1,),
                         target_seq_index=None,
                         transform=None,
                         max_score=1,
                         threshold=None,
                         enable_calibrator=True,
                         enable_threshold=True)

training_config_lstm = LSTMTrainConfig(lr=1e-05,
                                       batch_size=500,
                                       epochs=100,
                                       seq_len=30,
                                       data_stride=1,
                                       valid_split=0.2,
                                       checkpoint_file='checkpoint.pt')


lstm = LSTM(lstm_config)
lstm.train(training_data,train_config=training_config_lstm)

Expected behavior
No, chunking problem.

Screenshots

RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_62686/1285329270.py in <module>
   62 
   63 lstm = LSTM(lstm_config)
---> 64 lstm.train(training_data,train_config=training_config_lstm)
   65 #len(test_data)

~/.local/lib/python3.9/site-packages/merlion/models/forecast/lstm.py in train(self, train_data, train_config)
  303                         batch = batch.cuda()
  304                     self.optimizer.zero_grad()
--> 305                     out = self.model(batch[:, : -(self.max_forecast_steps + 1)], future=self.max_forecast_steps)
  306                     loss = F.l1_loss(out, batch[:, 1:])
  307                     loss.backward()

~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
 1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
 1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
 1111         # Do not call functions when jit is used
 1112         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.9/site-packages/merlion/models/forecast/lstm.py in forward(self, input, future)
  198         :return: the predicted values including both 1-step predictions and the future step predictions
  199         """
--> 200         outputs = [rnn(input[:, ::stride]) for stride, rnn in zip(self.strides, self.rnns)]
  201         batch_sz, dim = outputs[0].shape
  202         preds = [

~/.local/lib/python3.9/site-packages/merlion/models/forecast/lstm.py in <listcomp>(.0)
  198         :return: the predicted values including both 1-step predictions and the future step predictions
  199         """
--> 200         outputs = [rnn(input[:, ::stride]) for stride, rnn in zip(self.strides, self.rnns)]
  201         batch_sz, dim = outputs[0].shape
  202         preds = [

~/.local/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
 1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
 1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
 1111         # Do not call functions when jit is used
 1112         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.9/site-packages/merlion/models/forecast/lstm.py in forward(self, input)
  146         self.reset(bsz=input.size(0))
  147 
--> 148         for i, input_t in enumerate(input.chunk(input.size(1), dim=1)):
  149             self.h_t, self.c_t = self.lstm1(input_t, (self.h_t, self.c_t))
  150             self.h_t2, self.c_t2 = self.lstm2(self.h_t, (self.h_t2, self.c_t2))

RuntimeError: chunk expects `chunks` to be greater than 0, got: 0

Desktop (please complete the following information):

  • OS: [e.g. Ubuntu 21.10 LTS]
  • Merlion Version [e.g. 1.0.0]

Additional context

Thank you for considering my issue.
If any more information is required please let me know.

[BUG] Calculating ForecastMetric for multivariate forecasts

Hi I am facing an assertion error when calculating a forecast metric for a prediction running multivariate forecasts:

from merlion.evaluate.forecast import ForecastMetric

ForecastMetric.RMSE.value(ground_truth=test_multi, 
                          predict=forecast, 
                          insample=False)

assert self.predict.dim == self.ground_truth.dim == 1

Since the forecast has only one dimension (my predicted sales) and the test_multi set has multiple ones (sales, store, etc.).

So I thought I could fix the problem by providing the function only the single sales column/series:

from merlion.evaluate.forecast import ForecastMetric

ForecastMetric.RMSE.value(ground_truth=test_multi.univariates["sales"], 
                          predict=forecast, 
                          insample=False)

However then I run into the following error:


TypeError Traceback (most recent call last)
in
2 from merlion.evaluate.forecast import ForecastMetric
3
----> 4 ForecastMetric.RMSE.value(ground_truth=test_multi.univariates["sales"],
5 predict=forecast,
6 insample=False)

~/opt/miniconda3/lib/python3.9/site-packages/merlion/evaluate/forecast.py in accumulate_forecast_score(ground_truth, predict, insample, periodicity, ub, lb, metric)
197 metric=None,
198 ) -> Union[ForecastScoreAccumulator, float]:
--> 199 acc = ForecastScoreAccumulator(
200 ground_truth=ground_truth, predict=predict, insample=insample, periodicity=periodicity, ub=ub, lb=lb
201 )

~/opt/miniconda3/lib/python3.9/site-packages/merlion/evaluate/forecast.py in init(self, ground_truth, predict, insample, periodicity, ub, lb)
51 """
52 t0, tf = predict.t0, predict.tf
---> 53 ground_truth = ground_truth.window(t0, tf, include_tf=True).align()
54 self.ground_truth = ground_truth
55 self.predict = predict.align(reference=ground_truth.time_stamps)

TypeError: align() missing 1 required positional argument: 'other'

A possible workaround would be to create a new TimeSeries object with only one dimension (sales). However, it would be cool to have a more handy solution.

What do you think?

[BUG] Cannot install salesforce-merlion conda package in a python 3.9 environment on Mac

Describe the bug

The salesforce-merlion conda package is not installing in a python 3.9 environment. Instead, the following
error is being thrown

Encountered problems while solving:
  - package salesforce-merlion-1.0.0-pyhd8ed1ab_0 requires fbprophet, but none of the providers can be installed

To Reproduce

Create a python 3.9 environment and activate it

mamba create -n py39-env -c conda-forge 'python=3.9'
mamba activate py39-env

Install salesforce-merlion into environment

mamba install -c conda-forge salesforce-merlion

Expected behavior

salesforce-merlion conda package is installed and available in python environment.

Screenshots

Screen Shot 2022-04-13 at 2 35 11 PM

Desktop (please complete the following information):

  • OS: MacOS 12.0.1
  • Merlion Version: 1.1.2

Additional context

These steps use mamba as it provides a clearer error message, but using conda directly doesn't work either. Also trying mamba install -c conda-forge salesforce-merlion>=1.1.2 didn't help either.

[FEATURE REQUEST] Time Series Cross Validation

Is your feature request related to a problem? Please describe.
Time series cross validation is very useful during model selection and some packages do have it as part of backtesting. Curious to know if Merlion will add it in the future or the reason for not adding as part of the package.

Describe the solution you'd like
Incorporate time series cross validation as part of package

Describe alternatives you've considered
Currently trying to use time series cross validation from scikit-learn

Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'

When a TimeSeries is created, no matter the original DataFrame or Series index, it is recognized as 'Index', which is not valid when calling model.train().

Code to reproduce (adapted from examples):

from merlion.models.defaults import DefaultForecaster, DefaultForecasterConfig
from merlion.utils import TimeSeries, UnivariateTimeSeries
import pandas as pd

df = pd.DataFrame([[1632428144486, 2], [1632428160612, 6]], columns=["Time", "Value"])
df["Time"] = pd.to_datetime(df["Time"], unit="ms", utc=True)
df.set_index("Time", inplace=True)
series=df["Value"]

print(series.index)
>>> DatetimeIndex(['2021-09-23 20:15:44.486000+00:00', '2021-09-23 20:16:00.612000+00:00'], dtype='datetime64[ns, UTC]', name='Time', freq=None)

time_series = TimeSeries(univariates=[UnivariateTimeSeries.from_pd(series)])
model = DefaultForecaster(DefaultForecasterConfig())
model.train(train_data=time_series)

As you can see, the index is indeed DatetimeIndex, but it is lost when creating the TimeSeries.

The complete error traceback is:

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\javma\miniconda3\lib\site-packages\merlion\models\defaults.py", line 185, in train
    return self.model.train(train_data=train_data, train_config=train_config)
  File "C:\Users\javma\miniconda3\lib\site-packages\merlion\models\forecast\ets.py", line 106, in train
    train_data = self.train_pre_process(train_data, require_even_sampling=True, require_univariate=False)
  File "C:\Users\javma\miniconda3\lib\site-packages\merlion\models\forecast\base.py", line 140, in train_pre_process
    train_data = super().train_pre_process(train_data, require_even_sampling, require_univariate)
  File "C:\Users\javma\miniconda3\lib\site-packages\merlion\models\base.py", line 214, in train_pre_process
    train_data = self.transform(train_data)
  File "C:\Users\javma\miniconda3\lib\site-packages\merlion\transform\resample.py", line 131, in __call__
    return time_series.align(
  File "C:\Users\javma\miniconda3\lib\site-packages\merlion\utils\time_series.py", line 895, in align
    df = df.resample(granularity, origin=origin, label="right", closed="right")
  File "C:\Users\javma\miniconda3\lib\site-packages\pandas\core\frame.py", line 10351, in resample
    return super().resample(
  File "C:\Users\javma\miniconda3\lib\site-packages\pandas\core\generic.py", line 8126, in resample
    return get_resampler(
  File "C:\Users\javma\miniconda3\lib\site-packages\pandas\core\resample.py", line 1382, in get_resampler
    return tg._get_resampler(obj, kind=kind)
  File "C:\Users\javma\miniconda3\lib\site-packages\pandas\core\resample.py", line 1558, in _get_resampler
    raise TypeError(
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'

Windows 10 21H1, Python 3.8 (miniconda), Merlion v1.0.0

[FEATURE REQUEST]Can we loose the requirement on numpy library

Is your feature request related to a problem? Please describe.
In our ML pipeline, there are library version conflict for numpy:
kserve/kserve#8 93.26 The conflict is caused by:
kserve/kserve#8 93.26 The user requested numpy>=1.15.4
kserve/kserve#8 93.26 cmdstanpy 0.9.68 depends on numpy>=1.15
kserve/kserve#8 93.26 pystan 2.19.1.1 depends on numpy>=1.7
kserve/kserve#8 93.26 matplotlib 2.0.0 depends on numpy>=1.7.1
kserve/kserve#8 93.26 kserve 0.8.0rc0 depends on numpy~=1.19.2
kserve/kserve#8 93.26 scikit-learn 1.0.1 depends on numpy>=1.14.6
kserve/kserve#8 93.26 salesforce-merlion 1.1.1 depends on numpy>=1.21

Can we make numpy version requirement a bit loose, like >=1.8, etc , so that there are less chances to meet version conflict issue.

[BUG] ValueError: invalid literal for int() with base 10: b''

Describe the bug
While trying to run the example of anomaly detection as presented in the README file, I got this ValueError. The full traceback of the error is presented below.

To Reproduce
Just try to run the example of Anomaly detection as presented in the README file.

Expected behavior
Run with the expected outputs as the anomaly detection example in the README file.

Screenshots

Traceback (most recent call last):
  File "/home/anomaly.py", line 54, in <module>
    model.train(train_data=train_data)
  File "/home/anaconda3/envs/timeseries/lib/python3.7/site-packages/merlion/models/defaults.py", line 94, in train
    max_n_samples=512,
  File "/home/anaconda3/envs/timeseries/lib/python3.7/site-packages/merlion/models/factory.py", line 89, in create
    model._load_state(kwargs)
  File "/home/anaconda3/envs/timeseries/lib/python3.7/site-packages/merlion/models/base.py", line 365, in _load_state
    self.__setstate__(state_dict)
  File "/home/anaconda3/envs/timeseries/lib/python3.7/site-packages/merlion/models/anomaly/random_cut_forest.py", line 138, in __setstate__
    RCFSerDe = JVMSingleton.gateway().jvm.com.amazon.randomcutforest.serialize.RandomCutForestSerDe
  File "/home/anaconda3/envs/timeseries/lib/python3.7/site-packages/merlion/models/anomaly/random_cut_forest.py", line 40, in gateway
    cls._gateway = JavaGateway.launch_gateway(classpath=classpath, javaopts=javaopts)
  File "/home/anaconda3/envs/timeseries/lib/python3.7/site-packages/py4j/java_gateway.py", line 2165, in launch_gateway
    use_shell=use_shell)
  File "/home/anaconda3/envs/timeseries/lib/python3.7/site-packages/py4j/java_gateway.py", line 337, in launch_gateway
    _port = int(proc.stdout.readline())
ValueError: invalid literal for int() with base 10: b''

Desktop (please complete the following information):

  • OS: Ubuntu 18.04 LTS
  • Merlion Version 1.1.3

Additional context
N/A

[FEATURE REQUEST] business working days

Is your feature request related to a problem? Please describe.
When I predict one month ahead with a daily time series of just business working days I need to make a hack to computer the parameters max_forecast_steps

Describe the solution you'd like
max_forecast_steps should work taking into account only business working days too or there should be an option to "equalize" indexes regardless of the time interval, namely dropping the time value.

Describe alternatives you've considered
Drop the timeindex from the pandas series

Additional context
Add any other context or screenshots about the feature request here.

[BUG] Can't pass holiday config to Prophet model

Describe the bug
when setting up the holidays config in ProphetConfig and passing it to Prophet model:
ValueError: holidays must be a DataFrame with "ds" and "holiday" columns.
holidays is pd.dataframe, but after wrapping, it becomes dict;
To Reproduce
config2 = ProphetConfig(max_forecast_steps=None, transform=Identity(), holidays=holidays)
model2 = Prophet(config2)

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.
截屏2021-12-21 上午11 37 55

Desktop (please complete the following information):

  • OS: Macbook
  • Merlion 1.1.0, 1.0.2

Additional context
Add any other context about the problem here.

[FEATURE REQUEST] Easier Merlion Install on Azure Databricks

Is your feature request related to a problem? Please describe.

I've been trying to install Merlion and all the datasets on Azure Databricks for the past couple of days and it's proving to be very difficult. Here's what I've done so far:

  1. Install Merlion on my cluster using pip. This was failing at first for some reason but started randomly working two days after my first try.
  2. Install the dataset using the egg. I had to manually find and run setup.py to get the .egg file from the newly created dist folder. I then installed the datasets to my cluster using the egg. This was my first time using Python eggs so I'm not sure if this is standard procedure. It installed properly, but as soon as I try using any dataset I get an error saying AssertionError: <some-long-path> does not contain dataset file.

And now I don't know where to go from here.

Describe the solution you'd like
It'd be nice for the dataset module to be pip-installable, just like the Merlion module is, without having to download the entire repo.

Describe alternatives you've considered
It seems like you really want to separate the Merlion module from the dataset module. I don't understand why exactly, but if you're not married to this idea, then I'd make everything installable with one pip-call. If you want to give the developer more control over whether or not to download the datasets, I think their are better ways to do it without having to download the entire repo then using pip

EDIT (January 6th 2022)
It seems to be a bug. I'm able to use the anomaly detection datasets but not the forecasting datasets

[BUG] issues with using ETS model for forecasting

Describe the bug
I'm trying to use the ETS model for forecasting and I'm running into some issues. I trained the model (with the default config aside from max_forecast_steps=10 on the M4 training set (using Weekly data), and then I'm trying forecast on the test set (the timestamps of the test set come right after the timestamps of the training set). I tried using the first 10 values of the test set as prev_time_series, and wanted to forecast on the next 10 timestamps. This gave me an error saying that the model needed at least 2 full seasonal cycles of data. I then created a dummy test set with 1000 timestamps starting at the same point as the M4 test set, and used the first 104 values of that test set as prev_time_series (2 full years in weeks), and I get an obscure IndexError from statsmodels.

To Reproduce
Code for reproducing the 2 full seasonal cycles error:

model = ETS(ETSConfig(max_forecast_steps=10))

time_series, metadata = M4(subset="Weekly")[0]
trainval = time_series[metadata.trainval]
testval = time_series[~metadata.trainval]

model.train(TimeSeries.from_pd(trainval))

test_ts = TimeSeries.from_pd(testval)
forecast, stderr = model.forecast(test_ts.time_stamps[10:20], test_ts[:10])
plt.plot(forecast)
plt.show()

Code for reproducing the IndexError from statsmodels:

model = ETS(ETSConfig(max_forecast_steps=10))

time_series, metadata = M4(subset="Weekly")[0]
trainval = time_series[metadata.trainval]
testval = pd.DataFrame(np.random.randn(1000, 1), columns=["W1"], index=pd.date_range(start="2011-10-09", periods=1000, freq="W"))

model.train(TimeSeries.from_pd(trainval))

test_ts = TimeSeries.from_pd(testval)
forecast, stderr = model.forecast(test_ts.time_stamps[104:114], test_ts[:104])
plt.plot(forecast)
plt.show()

Expected behavior
I would expect that this shouldn't result in any errors, especially given that I'm using the default configuration for ETS besides providing a forecast horizon.

Screenshots
Stack trace for first error:

  File "/Users/vsridhar/Documents/ds_tools/sc_forecasting_tools/sc_forecasting_tools/forecasting/test_ets.py", line 18, in <module>
    forecast, stderr = model.forecast(test_ts.time_stamps[10:20], test_ts[:10])
  File "/Users/vsridhar/yes/envs/jupyter_env/envs/spuds-dev/lib/python3.8/site-packages/merlion/models/forecast/ets.py", line 206, in forecast
    new_model = ETSModel(
  File "/Users/vsridhar/yes/envs/jupyter_env/envs/spuds-dev/lib/python3.8/site-packages/statsmodels/tsa/exponential_smoothing/ets.py", line 483, in __init__
    self.set_initialization_method(
  File "/Users/vsridhar/yes/envs/jupyter_env/envs/spuds-dev/lib/python3.8/site-packages/statsmodels/tsa/exponential_smoothing/ets.py", line 583, in set_initialization_method
    ) = _initialization_simple(
  File "/Users/vsridhar/yes/envs/jupyter_env/envs/spuds-dev/lib/python3.8/site-packages/statsmodels/tsa/exponential_smoothing/initialization.py", line 26, in _initialization_simple
    raise ValueError('Cannot compute initial seasonals using'
ValueError: Cannot compute initial seasonals using heuristic method with less than two full seasonal cycles in the data.

Stack trace for second error:

 File "/Users/vsridhar/Documents/ds_tools/sc_forecasting_tools/sc_forecasting_tools/forecasting/test_ets.py", line 18, in <module>
    forecast, stderr = model.forecast(test_ts.time_stamps[104:114], test_ts[:104])
  File "/Users/vsridhar/yes/envs/jupyter_env/envs/spuds-dev/lib/python3.8/site-packages/merlion/models/forecast/ets.py", line 219, in forecast
    self.model = new_model.fit(start_params=self.model.params, disp=False)
  File "/Users/vsridhar/yes/envs/jupyter_env/envs/spuds-dev/lib/python3.8/site-packages/statsmodels/tsa/exponential_smoothing/ets.py", line 1024, in fit
    internal_start_params = self._convert_and_bound_start_params(
  File "/Users/vsridhar/yes/envs/jupyter_env/envs/spuds-dev/lib/python3.8/site-packages/statsmodels/tsa/exponential_smoothing/ets.py", line 890, in _convert_and_bound_start_params
    internal_params = self._internal_params(params)
  File "/Users/vsridhar/yes/envs/jupyter_env/envs/spuds-dev/lib/python3.8/site-packages/statsmodels/tsa/exponential_smoothing/ets.py", line 789, in _internal_params
    internal[internal_idx] = params[i]
IndexError: index 5 is out of bounds for axis 0 with size 5

Desktop (please complete the following information):

  • OS: macOS Big Sur, Version 11.6.4
  • Merlion Version: 1.1.2

[BUG] I can't import any of the merlion subpachages

Describe the bug
I was able to import merlion. but to install and apply any of the following sub packages it was unsucceful and I am getting error
that No module is available:
from merlion.utils.time_series import TimeSeries
from merlion.evaluate.forecast import ForecastMetric
from merlion.models.automl.autosarima import AutoSarima, AutoSarimaConfig
from merlion.models.automl.seasonality_mixin import SeasonalityLayer
from merlion.models.forecast.sarima import Sarima

To Reproduce
Steps to reproduce the behavior

Expected behavior
My expectation is to be able to install these sub packages.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. Ubuntu 16.04 LTS]
  • Merlion Version [e.g. 1.0.0]
  • Spyder=5

Additional context
Add any other context about the problem here.

[BUG]. DefaultForecaster returns naive forecast.

Describe the bug
DefaultForecaster returns naive estimate when one might expect otherwise.

To Reproduce
See notebook

Expected behavior
It may be fine. Perhaps it awaits more data or there is a bug in the usage. I'm yet to trace in and the example was created by a fellow contributor.

Desktop (please complete the following information):
Colab.

[FEATURE REQUEST] Can Merlion handle multi-series datasets?

This is more a question, but I didn't see a tag for it. Does Merlion support modeling multiple-series datasets? I understand from the README that it supports multi-variate models. I was curious to know if it supports multi-series. For example, consider this OJ Sales Dataset. In this case, the data contains weekly sales of orange juice over 121 weeks. There are 3,991 stores included and three brands of orange juice per store so that 11,973 models can be trained.
I understand one can train independent models for each of the stores. However, I was interested in knowing if Merlion can take in data from multiple stores to learn correlations between them.

Another example of a multi-series dataset can be found in this article.

Why can't I find the m4-info.csv file in the code?

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.