microsoft / forecasting Goto Github PK

View Code? Open in Web Editor NEW

2.6K 104.0 431.0 27.87 MB

Time Series Forecasting Best Practices & Examples

Home Page: https://microsoft.github.io/forecasting/

License: MIT License

Python 75.16% Shell 0.42% R 4.00% Batchfile 0.52% Jupyter Notebook 19.90%

forecasting time-series best-practices machine-learning deep-learning azure-ml automl demand-forecasting retail python

forecasting's Introduction

Forecasting Best Practices

Time series forecasting is one of the most important topics in data science. Almost every business needs to predict the future in order to make better decisions and allocate resources more effectively.

This repository provides examples and best practice guidelines for building forecasting solutions. The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in forecasting algorithms to build solutions and operationalize them. Rather than creating implementations from scratch, we draw from existing state-of-the-art libraries and build additional utilities around processing and featurizing the data, optimizing and evaluating models, and scaling up to the cloud.

The examples and best practices are provided as Python Jupyter notebooks and R markdown files and a library of utility functions. We hope that these examples and utilities can significantly reduce the “time to market” by simplifying the experience from defining the business problem to the development of solutions by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.

Cleanup notice (2020-06-23)

We've carried out a cleanup of large obsolete files to reduce the size of this repo. If you had cloned or forked it previously, please delete and clone/fork it again to avoid any potential merge conflicts.

Content

The following is a summary of models and methods for developing forecasting solutions covered in this repository. The examples are organized according to use cases. Currently, we focus on a retail sales forecasting use case as it is widely used in assortment planning, inventory optimization, and price optimization. To enable high-throughput forecasting scenarios, we have included examples for forecasting multiple time series with distributed training techniques such as Ray in Python, parallel package in R, and multi-threading in LightGBM. Note that html links are provided next to R examples for best viewing experience when reading this document on our github.io page.

Model	Language	Description
Auto ARIMA	Python	Auto Regressive Integrated Moving Average (ARIMA) model that is automatically selected
Linear Regression	Python	Linear regression model trained on lagged features of the target variable and external features
LightGBM	Python	Gradient boosting decision tree implemented with LightGBM package for high accuracy and fast speed
DilatedCNN	Python	Dilated Convolutional Neural Network that captures long-range temporal flow with dilated causal connections
Mean Forecast (.html)	R	Simple forecasting method based on historical mean
ARIMA (.html)	R	ARIMA model without or with external features
ETS (.html)	R	Exponential Smoothing algorithm with additive errors
Prophet (.html)	R	Automated forecasting procedure based on an additive model with non-linear trends

The repository also comes with AzureML-themed notebooks and best practices recipes to accelerate the development of scalable, production-grade forecasting solutions on Azure. In particular, we have the following examples for forecasting with Azure AutoML as well as tuning and deploying a forecasting model on Azure.

Method	Language	Description
Azure AutoML	Python	AzureML service that automates model development process and identifies the best machine learning pipeline
HyperDrive	Python	AzureML service for tuning hyperparameters of machine learning models in parallel on cloud
AzureML Web Service	Python	AzureML service for deploying a model as a web service on Azure Container Instances

Getting Started in Python

To quickly get started with the repository on your local machine, use the following commands.

Install Anaconda with Python >= 3.6. Miniconda is a quick way to get started.

Clone the repository

git clone https://github.com/microsoft/forecasting
cd forecasting/

Run setup scripts to create conda environment. Please execute one of the following commands from the root of Forecasting repo based on your operating system.
- Linux
```
./tools/environment_setup.sh
```
- Windows
```
tools\environment_setup.bat
```
Note that for Windows you need to run the batch script from Anaconda Prompt. The script creates a conda environment forecasting_env and installs the forecasting utility library fclib.
Start the Jupyter notebook server
```
jupyter notebook
```
Run the LightGBM single-round notebook under the 00_quick_start folder. Make sure that the selected Jupyter kernel is forecasting_env.

If you have any issues with the above setup, or want to find more detailed instructions on how to set up your environment and run examples provided in the repository, on local or a remote machine, please navigate to the Setup Guide.

Getting Started in R

We assume you already have R installed on your machine. If not, simply follow the instructions on CRAN to download and install R.

The recommended editor is RStudio, which supports interactive editing and previewing of R notebooks. However, you can use any editor or IDE that supports RMarkdown. In particular, Visual Studio Code with the R extension can be used to edit and render the notebook files. The rendered .nb.html files can be viewed in any modern web browser.

The examples use the Tidyverts family of packages, which is a modern framework for time series analysis that builds on the widely-used Tidyverse family. The Tidyverts framework is still under active development, so it's recommended that you update your packages regularly to get the latest bug fixes and features.

Target Audience

Our target audience for this repository includes data scientists and machine learning engineers with varying levels of knowledge in forecasting as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world forecasting problems.

Contributing

We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our Contributing Guide.

Reference

The following is a list of related repositories that you may find helpful.


Deep Learning for Time Series Forecasting	A collection of examples for using deep neural networks for time series forecasting with Keras.
Microsoft AI Github	Find other Best Practice projects, and Azure AI designed patterns in our central repository.

Build Status

Build	Branch	Status
Linux CPU	master
Linux CPU	staging

forecasting's People

Contributors

Stargazers

Watchers

Forkers

sdonohoo mtoqeerpk nozimmjon rickv9 elissandromendes eponkratova jimduggan deepfool danglive gwill yris-brice zwbjtu123 hashihab jingmouren ananyaghosal siamakz 100rabh1401 time-series-analysis-learn vickzhang batermj garyelephant birajparikh16 langatgilbert arvindmits flarywu antgr elnazsn1988 atul70 siliang625 chaitanya176 revodavid mindis syedbmasood ggyimah1031 davtalab onderdemirtas chetanmehra gridl shafiahmed iuniorhsiung konvyzas zclfly sts-sadr atahankocak rahuljyala7 danibunny jhon-dong goswamig davenw16hd jamshaidsohail5 andc314 nkipa paldamo tonyabell gowthamnair adsglass cristinanichiforov farihal mvishruth flaboss datavoli gitgithan nsvankulov grig101 yiranxu jtatineni sriramny skols sylwesterf bassk ajamanu joda66 apratap steve-chapman zaandrew biswajitsahoo1111 smuratsirin christk shanekerr2012 salilathalye jfhub kvantas snci caas1996 pbenavidesh yessardi vrishank97 tonyxv kchuangk pymousse morgan-tam luyidong ashutoshnayakie stevespa huangshizhi nadiaantony msgnans andreaczhang sprinterzzj vitormnsousa

forecasting's Issues

[BUG] setup.md jupyter env name does not match pytest

Description

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

[BUG] Broken examples link

Description

The link to the examples/ directory in SETUP.md is broken.

Why is there an init.py in the root directory?

There seems to be an init.py in the root directory which may be superfluous

Add CI pipeline for R

[ASK] update .gitignore file

Description

Update .gitignore file to include more unnecessary files e.g., AML config file, output from AML experiments, model files, etc

Other Comments

[ASK] Use fclib directly in aml examples

Description

Use fclib directly in aml examples, rather than copying utility functions in separate utils.py file, therefore duplicating the code. Good resources:

https://render.githubusercontent.com/view/ipynb?commit=1782b93ce748a9bdcac443ecc38c38ee07baf9f1&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6d6963726f736f66742f6e6c702d726563697065732f313738326239336365373438613962646361633434336563633338633338656530376261663966312f6578616d706c65732f746578745f636c617373696669636174696f6e2f74635f7472616e73666f726d6572735f617a7572656d6c5f706970656c696e65732f74635f7472616e73666f726d6572735f617a7572656d6c5f706970656c696e65732e6970796e62&nwo=microsoft%2Fnlp-recipes&path=examples%2Ftext_classification%2Ftc_transformers_azureml_pipelines%2Ftc_transformers_azureml_pipelines.ipynb&repository_id=179728393&repository_type=Repository#Setup-Execution-Environment

https://github.com/microsoft/nlp-recipes/blob/ignite/examples/text_classification/tc_transformers_azureml_pipelines/tc_transformers_azureml_pipelines.ipynb

https://github.com/microsoft/seismic-deeplearning/blob/contrib/interpretation/deepseismic_interpretation/azureml_pipelines/pipeline_config.json

https://github.com/microsoft/seismic-deeplearning/blob/contrib/experiments/interpretation/dutchf3_patch/local/azureml_requirements.txt

Other Comments

[BUG] System does not allow for create of file named "aux"

Description

I have no idea why...

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

remove Batch AI from documentation and from benchmarking scripts

[BUG] azure_automl_forecast uses wrong workspace creation

Description

You should use this to get or create an existing workspace.

ws = Workspace.create(subscription_id=subscription_id, resource_group=resource_group, name=workspace_name,
create_resource_group=True, exist_ok=True, location=workspace_region)

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

Broken links in README files

The links in retail_sales/README.md and energy_load/README.md are broken

train_validate_vm.sh fails in fnn submission

I followed instructions in README file of fnn submission and got two different failures in master and staging branches.

In master branch I get an error:
[1] "cv_round_1"
[1] 1
Error in { :
task 1 failed - "Variable 'subset_columns_train' is not found in calling scope. Looking in calling scope because you used the .. prefix."
Calls: %dopar% ->
Execution halted

In staging branch I get an error:
[1] "cv_round_1"
[1] 1
Error in [.data.table(validation_data, , c("recent_load_ratio_10", "recent_load_ratio_11", :
column(s) not found: recent_load_ratio_10, recent_load_ratio_11, recent_load_ratio_12, recent_load_ratio_13, recent_load_ratio_14, recent_load_ratio_15, recent_load_ratio_16
Calls: rowMeans -> is.data.frame -> [ -> [.data.table
Execution halted

Add AUTHORS.md, CONTRIBUTING.md, LICENSE, chlog.txt and codeofconduct.md

Take templates from https://github.com/Microsoft/RecipeTemplate

[BUG] May add git clone in SETUP.md

Description

In SETUP.md, I think it will be clear to add before all commands:

git clone https://github.com/microsoft/forecasting.git
cd forecasting/

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

rewrite download_data.r of OrangeJuice dataset in Python

since Python is a main language of this repo, we should have this dataset accessible for people who are not familiar with R. Since the actual dataset is part of R package, download_data.py of this dataset can call download_data.r .

[BUG] yield in split_train_test() function

Description

split_train_test() function doesn't write out csv files when we call it with write_csv=True. I found this is because we use yield statement at the end of the function. The function only returns a generator every time we call it without actually executing the code inside the function. Right now, we need to iterator through the generator to force the function to be really executed, by doing something like

for train_df, test_df, aux_df in split_train_test(DATA_DIR, forecasting_setting, write_csv=True):

@vapaunic Do you think it is better to replace yield with returning lists of data frames train_df_list, test_df_list, aux_df_list when NUM_ROUNDS>1 and returning three data frames train_df, test_df, aux_df when NUM_ROUNDS=1?

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

[BUG] azure_automl_forecast throws exception instead of creating compute

Description

ComputeTarget creation should be checked, (not caught in except), and if it is not created it should then be created.

[BUG] Prediction HORIZON specified in forecast settings, but not used

Description

We specify a value PRED_HORIZON in forecast_setttings,py to be used as a forecasting horizon. However, we don't use this value when forecasting, or creating the train/test data splits. Rather, variables TRAIN(TEST)_START(END)_WEEK are used as a proxy for the prediction horizon.

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

[BUG] examples: oj_retail

Description

The examples directory has a subdirectory called oj_retail. The name probably could be optimized to better reflect the use case we are trying to highlight. Are there keywords we want to cover in here? retail? grocery? perishable goods? I would not think OJ really signifies anything here.

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

[ASK]forecasting_lib or something shorter

Description

Personally, I prefer something shorter like fclib. Would be great to conduct user interviews and make a decision based on input.

Other Comments

[FEATURE] Put jupyter startup instructions in setup

Description

I would recommend that the jupyter startup instructions, currently found in the first cell of examples/README.md, either be moved or copied into docs/SETUP.md. This seems to make sense, so that you can have jupyter running before you are instructed to start running the example notebooks. I guess this will require a new section under the "automated" and "manual" steps, one that instructs the user to start jupyter.

Expected behavior with the suggested feature

Setup instructions tells users to start jupyter before jumping into the example notebooks.

Other Comments

FYI, I'm running from a DSVM which has a running Jupyterlab on port 8888, so port 8889 was used for jupyter.

[FEATURE] Put "Get Started" instructions in README

Description

Having the get started instructions in README makes it easier for users to quickly set up the environment and do the experimentation with the repo.

Expected behavior with the suggested feature

Other Comments

[BUG] lightgbm could not be found in Jupyter

Description

I could import lightgbm in forecast env.
I could also import it while using python 3 kernel in jupyter.
However, I could not import it using forecast kernel in jupyter.

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

Links in the performance boards should be fixed

Links in the Performance Board tables should be changed. They point to the old Azure devops repo.

[BUG] two out of three notebooks empty on master.

Description

Only lightgbm notebook is a legit notebook. When opening the other two notebooks from browser, Jupyter gives error that they are not JSON but in fact they are of 0 bytes.

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

[FEATURE] Evaluation metrics to support any iterables

Description

Current metrics only support pd.series.
Would be nice to work w/ any kinds of iterables like np.array, list, etc.

Expected behavior with the suggested feature

E.g.,

def MAPE(predictions, actuals):
    predictions = np.array(predictions)
    actuals = np.array(actuals)
    
    # mape calculation here
    np.absolute(predictions-actuals)  ...

[FEATURE] Remove the files that are not needed from the root directory

Description

Remove the files (e.g., .flake8) that are not needed from the root directory.
Put the unnecessary files into .gitignore to avoid checking in them

Expected behavior with the suggested feature

Other Comments

OJ data for store=38 does not have entire time series

Stops in week=157

Implement single-model approach in R

Currently, the tidyverts framework only supports one model per subject, for datasets that consist of multiple subjects with one time series per subject. There is an open issue to support fitting a single model across all subjects, similar to what is being done here on the Python side.

[ASK] Add info to top-level README

Description

Add a Content table to list the examples in the repo (see https://github.com/microsoft/nlp-recipes/blob/master/README.md)
Add an Azure Machine Learning Service section to summarize the dependency on Azure ML service (see https://github.com/microsoft/nlp-recipes/blob/master/README.md)
Add a References section to link to other repo (e.g. https://github.com/Azure/DeepLearningForTimeSeriesForecasting,
https://github.com/Azure/cortana-intelligence-price-optimization)

In which platform does it happen?

How do we replicate the issue?

Expected behavior (i.e. solution)

Other Comments

[ASK] Clean up feature engineering module

Description

Remove unused files.

Other Comments

[ASK] Add README file under /examples directory

Description

Add a README file to explain the examples we have in this directory

Other Comments

NA handling in orange juice dataset

Should add some discussion on this especially in R context; some modelling functions can handle them natively, others require imputation and can be fragile when NAs are present

[BUG] Pylint Score 6.54/10

Description

After running pylint the repo has a score of 6.54.

I keep my repo score at 10. But repo should be at least greater then 8 before release.

Replace docker images with dockerfiles

Docker image should be replaced with Dockerfile in the implementation process to improve the scalability of the process and avoid legal issue.

[BUG] Conda activate needed in setup instructions

Description

When following the instructions at
https://github.com/microsoft/forecasting/blob/staging/docs/SETUP.md
the environment_setup.sh script creates the forecasting_env conda environment.
There should be a conda activate instruction after that.

How do we replicate the bug?

./tools/environment_setup.sh`
conda env list
# conda environments:
#
base                  *  /home/andreas/anaconda3
forecasting_env          /home/andreas/anaconda3/envs/forecasting_env

Refactor R side to use tidyverts utility functions

[BUG] Starting jupyter throws warning missing "jupyter_nbextensions_configurator"

Description

When I start jupyter in the conda env the following extension is missing and throws a warning.

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

[BUG] Clean-up evaluation module in fclib

Description

Clean up evaluation module in fclib. There are left-over files there from the tsperf days (evaluate, train_util). Move these to contrib directory.

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

Two bugs in tsperf_rules.md

item 4.3 in contents is misaligned
the link to diagram in "Definitions" section is broken

[FEATURE] Use tidyverts datasets

https://tsibbledata.tidyverts.org/

[BUG] `download_ojdata` does not work inside a Jupyter Notebook

Description

When running the 00_quick_start/auto_arima_forecasting.ipynb notebook in the cell where the data is downloaded and split, it failed to download the data.

For example, if we run the function to download the data, it says it starts to download the data but the actual download operation is not triggered (see screen shot below).

How do we replicate the bug?

Follow the environment set up instructions and run the notebook.

Expected behavior (i.e. solution)

The data should be successfully downloaded.

Other Comments

The problem may be something to do with the script path construction where os.path.abspath(__file__) is used - it might be somewhat incompatible with Jupyter notebook. One discussion that may be useful to resolve the issue is here.

[BUG] Forecast settings train/test weeks run past the end of the dataset

Description

The forecast settings will see the test data run past the end of the dataset.
TEST_START_WEEK goes up to 161, TEST_END_WEEK up to 162, but the data only goes to 160
TRAIN_END_WEEK goes up to 159 so there will be nothing left

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

[BUG] forecasting or forecast

Description

The repo's name is forecasting.
There is a directory called forecasting_lib.
conda env is called forecast
In addition, the Jupyter kernel is called forecast.

I don't know if most people would prefer to have a single name (e.g., forecast). Would be good to interview users and make a call.

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

[ASK] Group the essential command line prompts together to facilitate onboarding

Based on rounds of user testing, one typically only needs the following three lines of command to set up the environment:

git clone https://github.com/microsoft/forecasting.git
cd forecasting/
./tools/environment_setup.sh

Please consider grouping these commands in a single section for ease of reference and interpretation. I had two interviewees who didn't locate "./tools/environment_setup.sh" during the first pass of the setup guide.

[FEATURE] OrangeJuice data can be directly downloaded from github

Description

Instead of installing bayesm, consider simply downloading the source from here. The datasets are in the /data directory and can be loaded into R using load.

[Suggested by @Hong-Revo ]

Expected behavior with the suggested feature

Other Comments

[FEATURE] Link to a sample notebook from SETUP.md

Instead of vaguely mentioning the existence of an examples/ folder in SETUP.md, we may consider explicitly linking to a sample notebook such as examples/00_quick_start/auto_arima_forecasting.ipynb

This small hand-holding can go a long way of helping a new user develop their first forecasting use case and boost user satisfaction.

[BUG] Put setup instructions outside CONTRIBUTING.md

Description

maybe in a separate file SETUP.md?

How do we replicate the bug?

Expected behavior (i.e. solution)

Other Comments

[ASK] Add integration tests for all (relevant notebooks)

Description

Integration tests for all (relevant) notebooks. This may require parameterization of notebooks, so that the notebooks can be executed relatively quickly.

Other Comments

[BUG] Unable to execute "./tools/environment_setup.sh" in Windows Command Prompt

This is regarding the required command ./tools/environment_setup.sh as part of the environment setup process.

.sh/shell scripts are batch files for Linux/Unix. So Windows Users would either have to use Ubuntu Terminal or WSL (Windows subsystem for Linux) but the easiest way would be to either make a powershell .ps1 script or a Windows Batch .bat script.

https://www.thewindowsclub.com/how-to-run-sh-or-shell-script-file-in-windows-10
https://simply-python.com/2014/03/20/easy-invoke-pip-install-using-batch-commands/

[ASK] Include model training time in examples

Description

Add information about estimated running time in "Model training" cell and mention that user could reduce the number of iterations to speed up the model training.