GithubHelp home page GithubHelp logo

lucidrains / itransformer Goto Github PK

View Code? Open in Web Editor NEW
338.0 7.0 23.0 207 KB

Unofficial implementation of iTransformer - SOTA Time Series Forecasting using Attention networks, out of Tsinghua / Ant group

License: MIT License

Python 100.00%
artificial-intelligence attention-mechanisms deep-learning time-series-forecasting transformers

itransformer's Introduction

iTransformer

Implementation of iTransformer - SOTA Time Series Forecasting using Attention networks, out of Tsinghua / Ant group

All that remains is tabular data (xgboost still champion here) before one can truly declare "Attention is all you need"

In before Apple gets the authors to change the name.

The official implementation has been released here!

Appreciation

  • StabilityAI and 🤗 Huggingface for the generous sponsorship, as well as my other sponsors, for affording me the independence to open source current artificial intelligence techniques.

  • Greg DeVos for sharing experiments he ran on iTransformer and some of the improvised variants

Install

$ pip install iTransformer

Usage

import torch
from iTransformer import iTransformer

# using solar energy settings

model = iTransformer(
    num_variates = 137,
    lookback_len = 96,                  # or the lookback length in the paper
    dim = 256,                          # model dimensions
    depth = 6,                          # depth
    heads = 8,                          # attention heads
    dim_head = 64,                      # head dimension
    pred_length = (12, 24, 36, 48),     # can be one prediction, or many
    num_tokens_per_variate = 1,         # experimental setting that projects each variate to more than one token. the idea is that the network can learn to divide up into time tokens for more granular attention across time. thanks to flash attention, you should be able to accommodate long sequence lengths just fine
    use_reversible_instance_norm = True # use reversible instance normalization, proposed here https://openreview.net/forum?id=cGDAkQo1C0p . may be redundant given the layernorms within iTransformer (and whatever else attention learns emergently on the first layer, prior to the first layernorm). if i come across some time, i'll gather up all the statistics across variates, project them, and condition the transformer a bit further. that makes more sense
)

time_series = torch.randn(2, 96, 137)  # (batch, lookback len, variates)

preds = model(time_series)

# preds -> Dict[int, Tensor[batch, pred_length, variate]]
#       -> (12: (2, 12, 137), 24: (2, 24, 137), 36: (2, 36, 137), 48: (2, 48, 137))

For an improvised version that does granular attention across time tokens (as well as the original per-variate tokens), just import iTransformer2D and set the additional num_time_tokens

Update: It works! Thanks goes out to Greg DeVos for running the experiment here!

Update 2: Got an email. Yes you are free to write a paper on this, if the architecture holds up for your problem. I have no skin in the game

import torch
from iTransformer import iTransformer2D

# using solar energy settings

model = iTransformer2D(
    num_variates = 137,
    num_time_tokens = 16,               # number of time tokens (patch size will be (look back length // num_time_tokens))
    lookback_len = 96,                  # the lookback length in the paper
    dim = 256,                          # model dimensions
    depth = 6,                          # depth
    heads = 8,                          # attention heads
    dim_head = 64,                      # head dimension
    pred_length = (12, 24, 36, 48),     # can be one prediction, or many
    use_reversible_instance_norm = True # use reversible instance normalization
)

time_series = torch.randn(2, 96, 137)  # (batch, lookback len, variates)

preds = model(time_series)

# preds -> Dict[int, Tensor[batch, pred_length, variate]]
#       -> (12: (2, 12, 137), 24: (2, 24, 137), 36: (2, 36, 137), 48: (2, 48, 137))

Experimental

iTransformer with fourier tokens

A iTransformer but also with fourier tokens (FFT of time series is projected into tokens of their own and attended along side with the variate tokens, spliced out at the end)

import torch
from iTransformer import iTransformerFFT

# using solar energy settings

model = iTransformerFFT(
    num_variates = 137,
    lookback_len = 96,                  # or the lookback length in the paper
    dim = 256,                          # model dimensions
    depth = 6,                          # depth
    heads = 8,                          # attention heads
    dim_head = 64,                      # head dimension
    pred_length = (12, 24, 36, 48),     # can be one prediction, or many
    num_tokens_per_variate = 1,         # experimental setting that projects each variate to more than one token. the idea is that the network can learn to divide up into time tokens for more granular attention across time. thanks to flash attention, you should be able to accommodate long sequence lengths just fine
    use_reversible_instance_norm = True # use reversible instance normalization, proposed here https://openreview.net/forum?id=cGDAkQo1C0p . may be redundant given the layernorms within iTransformer (and whatever else attention learns emergently on the first layer, prior to the first layernorm). if i come across some time, i'll gather up all the statistics across variates, project them, and condition the transformer a bit further. that makes more sense
)

time_series = torch.randn(2, 96, 137)  # (batch, lookback len, variates)

preds = model(time_series)

# preds -> Dict[int, Tensor[batch, pred_length, variate]]
#       -> (12: (2, 12, 137), 24: (2, 24, 137), 36: (2, 36, 137), 48: (2, 48, 137))

Todo

  • beef up the transformer with latest findings
  • improvise a 2d version across both variates and time
  • improvise a version that includes fft tokens
  • improvise a variant that uses adaptive normalization conditioned on statistics across all variates

Citation

@misc{liu2023itransformer,
  title   = {iTransformer: Inverted Transformers Are Effective for Time Series Forecasting}, 
  author  = {Yong Liu and Tengge Hu and Haoran Zhang and Haixu Wu and Shiyu Wang and Lintao Ma and Mingsheng Long},
  year    = {2023},
  eprint  = {2310.06625},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG}
}
@misc{shazeer2020glu,
    title   = {GLU Variants Improve Transformer},
    author  = {Noam Shazeer},
    year    = {2020},
    url     = {https://arxiv.org/abs/2002.05202}
}
@misc{burtsev2020memory,
    title   = {Memory Transformer},
    author  = {Mikhail S. Burtsev and Grigory V. Sapunov},
    year    = {2020},
    eprint  = {2006.11527},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}
@inproceedings{Darcet2023VisionTN,
    title   = {Vision Transformers Need Registers},
    author  = {Timoth'ee Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
    year    = {2023},
    url     = {https://api.semanticscholar.org/CorpusID:263134283}
}
@inproceedings{dao2022flashattention,
    title   = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
    author  = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
    booktitle = {Advances in Neural Information Processing Systems},
    year    = {2022}
}
@Article{AlphaFold2021,
    author  = {Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\v{Z}}{\'\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},
    journal = {Nature},
    title   = {Highly accurate protein structure prediction with {AlphaFold}},
    year    = {2021},
    doi     = {10.1038/s41586-021-03819-2},
    note    = {(Accelerated article preview)},
}
@inproceedings{kim2022reversible,
    title   = {Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift},
    author  = {Taesung Kim and Jinhee Kim and Yunwon Tae and Cheonbok Park and Jang-Ho Choi and Jaegul Choo},
    booktitle = {International Conference on Learning Representations},
    year    = {2022},
    url     = {https://openreview.net/forum?id=cGDAkQo1C0p}
}
@inproceedings{Katsch2023GateLoopFD,
    title   = {GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling},
    author  = {Tobias Katsch},
    year    = {2023},
    url     = {https://api.semanticscholar.org/CorpusID:265018962}
}

itransformer's People

Contributors

lucidrains avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

itransformer's Issues

ModuleNotFoundError: No module named 'iTransformer.iTransformerNormConditioned'

(shy_py38) zhenningli@iotsc-zhenningli:~/shy$ /home/zhenningli/anaconda3/envs/shy_py38/bin/python /home/zhenningli/shy/DSTA2/test.py
Traceback (most recent call last):
File "/home/zhenningli/shy/DSTA2/test.py", line 2, in
import iTransformer
File "/home/zhenningli/anaconda3/envs/shy_py38/lib/python3.8/site-packages/iTransformer/init.py", line 10, in
from iTransformer.iTransformerNormConditioned import iTransformerNormConditioned
ModuleNotFoundError: No module named 'iTransformer.iTransformerNormConditioned'

I installed iTransformer using 'pip install Transformers' according to the readme. But when I run the example, I get an error and I cannot import iTransformer correctly. My torch version is 2.1.0 and CUDA version is 11.8. Looking forward to your answer, thank you very much.

Question about model performance of 2D version.

Hello, I'm wordering about 2DiTransformer.
PatchTST or Crossformer using Patching to improve model performance,
so when you try 2DiTransformer instead of original 1D version, can you see any improvement in model performance?

Does the image work?

Hello, I am interested in your proposed iTransformer, but found that your example only works for timing signals, what I want to know is if it can be used for images, e.g. the inputs are [b,c,w,h],b stands for batch_size,c stands for channnel, w stands for width, h stands for height.

How to use exogenous variables in iTransformer?

Dear @lucidrains,
I am asking myself whether the author has used exogenous (describing) variables/features for the modelling or not?

As far as i understand the approach, they did not take care of that.

Do you know how to adapt the code that it will take care of exogenous features?

I think the most problems in ml using exogenous features to increase accuracy.

Best regards
Daniel

wrong variable

Hi, there is an issue with iTransformer.py in line 210 and the other iTransformer implementations as well.

assert targets.shape == pred_list.shape

should be changed to

assert target.shape == pred.shape

Series Stationarization implementation

The original paper uses "Series Stationarization" as proposed in "Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting".

It could be easily implemented in the forward function, similar to the iTransformer and the original paper as shown in the code.

def forecast(self, x_enc, x_mark_enc, x_dec, x_mark_dec):
    means = x_enc.mean(1, keepdim=True).detach()
    x_enc = x_enc - means
    stdev = torch.sqrt(torch.var(x_enc, dim=1, keepdim=True, unbiased=False) + 1e-5)
    x_enc /= stdev

and for rescaling after the attention compute

    dec_out = dec_out * (stdev[:, 0, :].unsqueeze(1).repeat(1, self.pred_len, 1))
    dec_out = dec_out + (means[:, 0, :].unsqueeze(1).repeat(1, self.pred_len, 1))

Am I missing something that would need to be changed as well for this to potentially work as intended?

If you are open to having this as an alternative to revin, I can adapt the code to this implementation and create a pull request.

Thanks for the code, already running some experiments with the model and changing some things and feeling able to adapt it to my needs.

hyperparameters cause issues

I'm running my hyperparameter tuning with iTransformer.py and found some issues.

attend.py
134: attn = attn.type(dtype) # causes reference before assignment error

solution?
if self.causal:
attn = attn.type(dtype)
else:
attn = attn.type(sim.dtype)

revin.py # using num_tokens_per_variate != 1
44: unscaled_output = (scaled_output - self.beta) / clamped_gamma

scaled_output is always a factor of the num_tokens variable

scaled_ouput torch.Size([32, 60, 641])
beta torch.Size([30, 1])
clamped_gamma torch.Size([30, 1])

Did not figure out a solution, yet.

Why does the time scale affect prediction accuracy?

Hello, thanks for this excellent work!
I used iTransformer to train on my own data set (univariate, no 'date' column), and the results were much better when the dataset was spaced 1 second apart than 1 hour, what could be the reason?Here is a comparison of the results:
1s: mse: 0.05796017870306969, mae: 0.1638924479484558
1h: mse: 0.4430641829967499, mae: 0.46550384163856506
Thanks again.

Typo in Description under usage in Readme.md

Dear creators of iTransformer.

I am really looking forward to Test in the next days.

In your usage example, there might be a typo in the comment describing the dimensions of the output.
The dimensions pred_length and variate have to change ?!

Best regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.