wenjiedu / saits Goto Github PK

The official PyTorch implementation of the paper "SAITS: Self-Attention-based Imputation for Time Series". A fast and state-of-the-art (SOTA) deep-learning neural network model for efficient time-series imputation (impute multivariate incomplete time series containing NaN missing data/values with machine learning). https://arxiv.org/abs/2202.08516

Home Page: https://doi.org/10.1016/j.eswa.2023.119619

License: MIT License

Python 96.96% Shell 3.04%

time-series imputation-model missing-values self-attention partially-observed-data partially-observed-time-series partially-observed interpolation time-series-imputation incomplete-data

saits's Issues

训练步长可以动态调整吗？

你好，根据给出的example.py中的saits = SAITS(n_steps=48, n_features=37, n_layers=2, d_model=256, d_inner=128, n_heads=4, d_k=64, d_v=64, dropout=0.1, epochs=10)，可以看到n_steps被设置为48，因为example中给定的数据集中每个RecordID都有48个样本。
但我的数据集中每个RecordID对应的样本数是不固定的，比如1个，7个，甚至216个，这样的话我把n_steps参数设置为最大的RecordID对应数目，比如216，这会是可行的吗？或者有没有其它方案。十分感谢！

Final error calculation

Hello Wenjie,

I have a doubt regarding the calculation of the final error metrics on the test data.

Suppose my sample data looks like this:

date          A       B
timestamp1    3       5
timestamp2    4       7
timestamp3    6       8
timestamp4    8       10

After introducing 50% missingness :

date          A       B
timestamp1    Nan     5
timestamp2    Nan     7
timestamp3    6       8
timestamp4    Nan    Nan

After imputation :

date          A       B
timestamp1    2       5
timestamp2    4       7
timestamp3    6       8
timestamp4    6       5

The MAE, RMSE, and MRE are calculated only on the imputed values or on the whole dataset?
Can you explain the MAE, RMSE, and MRE formulas/ equations used.

Thank you, Please let me know

Regards
Niharika Joshi

pd.concat

FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

why do not have the code for ETTDataset

Can this be implemented to single feature time series dataset?

I'm trying to impute a single feature dataset. Can it be possible?

Custom dataset

Thank you for your wonderful work and I would like to know whether I can use this model or train a model from scratch to imputate my time series?

Question about output of the first DMSA

Hello, I want to ask you about the saits.py part of the modeling in your code, I only used the first DMSA module, I also entered the X and Miss Mask in your way, but after going through the encoder layer, data becomes all Nan, what is the reason for this situation.
Looking forward to your reply

missing value problems

Question about loss

Thank you for your work and please understand that it is not a direct question about the code. Is there any reason the loss function does not include the classification error term? Some models that perform reconstruction and imputation are include classification error in the loss function. Have you ever trained models in this way? If so, please let me know what the results were like.

Calculation of the loss function

Thank you for excellent work!

The imputation loss of MIT is not covered the complement feature vector in the code.

Secondly, the paper also talks about taking the raw data X without artificially-masking as input to the MIT formula, and I found in the corresponding code that you used the manual masked X^.

Is there something I don't understand. I look forward to your resolution of my doubts!

Loss_MIT wrong?

I saw that the loss of MIT computaion in core.py was
'MIT_loss = self.customized_loss_func(
X_tilde_3, inputs["X_ori"], inputs["indicating_mask"]
)'
,which computed the MAE between M~3 and X_ori and differed to the paper.

Test data

Hi,

After certain modification and inclusion of code snippets I was able train, validate and get the mae for test data.
I want to obtain the de-normalized value after the imputation happens in test data, both predicted and actual. Can you help?

Question about hyperparameter optimization

Hello, I would like to ask you about the hyperparameter optimization for the model. In your file NNI_tuning/SAITS/SAITS_searching_config.yml, you described the settings for the hyperparameters and the training command, which also includes a JSON file. However, when I tried to run the command for hyperparameter optimization on SAITS, I encountered an error: "No option 'mit' in section: 'training'". I supplemented the missing parameters and ran it again, but I only obtained the parameters set in the SAITS_basic_config.ini file. Could you please advise me on how to iterate through the parameters in the JSON file to obtain the optimal parameters?

Some questions about multivariate time series Imputation.

Thank you for your work，I recently read your paper SAITS: Self-Attention-based Imputation for Time Series. I am also doing the work related to multivariate time series Imputation. I have some questions, and I hope to communicate with you.
1.I recently used your method to run the data set I used. My data processing approach is first divided into training set and test set, and then build time sequence, first use the train set to train, and then use the test set test (but I know data Imputation algorithm is unsupervised algorithm and did not use the true information of the missing data, there are some people who divided the test set training set, while there are also some people didn't,There are some differences in the results of your algorithm between these two data set partitioning methods,May I ask how do you view the partitioning of data sets?)
2. May I ask whether your algorithm will have overfitting, because the loss of back propagation is the MAE of unmissing items, not the MAE of the whole data set. I feel that with the increase of training times, it will gradually tend to be overfitting
3. Now the stopping condition of the algorithm is to reach the specified epoch. The epoch of different data sets need to be detected,If we divide the test set and the training set, can we quit the training by judging that the missing item data MAE of the training set reaches the minimum.
Thank you very much

How to comprehend the NNI finetunning?

First, the file SAITS_basic_config.ini under NNI_tuning folder miss 2 args: "MIT" & "ORT", which influence the script python ../../run_models.py --config_path SAITS_basic_config.ini --param_searching_mode running. You may add this two args in the .ini file and also check for other .ini files if you have time.
Second, i am wondering how to check the help of nni for tuning the parameters. To be more specific, which parameters did nni change? When and how much did the parameters change? Only parameters listed in SAITS_searching_space.json file will be changed?

Thanks for your attention.

Configs of ETTm1

Hello,
Could you share the configuration settings on the ETTm1 dataset?
Thanks!

Question about MAE

Hi, Wenjie

def masked_mae_cal(inputs, target, mask):
    """ calculate Mean Absolute Error"""
    return torch.sum(torch.abs(inputs - target) * mask) / (torch.sum(mask) + 1e-9)

I have a little doubt about the calculation of MAE.
I found you normalizes the dataset with standard scaling, it means the target and input are standard normalized. So why not calculate MAE after inverse the scaling to them?

Is this X_tilde_3 or X_c?

        imputation_MAE = masked_mae_cal(
            X_tilde_3, inputs["X_holdout"], inputs["indicating_mask"]
        )

window truncate function

def window_truncate(feature_vectors, seq_len):
    """ Generate time series samples, truncating windows from time-series data with a given sequence length.
    Parameters
    ----------
    feature_vectors: time series data, len(shape)=2, [total_length, feature_num]
    seq_len: sequence length
    """
    start_indices = np.asarray(range(feature_vectors.shape[0] // seq_len)) * seq_len
    sample_collector = []
    for idx in start_indices:
        sample_collector.append(feature_vectors[idx: idx + seq_len])

    return np.asarray(sample_collector).astype('float32')

Wenjie,

I have some questions if you do not mind to clarify

In the implementation, is the training data generated by diving into the time series based on the sequence length?
What is the advantage of such training data configuration over using the sliding window approach, e.g., generates the training set with one-time step lag [t-n, t-n+1, ... t], [t-n+1, t-n+2, ... t+1], [t-n+2, t-n+3, ... t+2]. Is not the sliding window approach would generate more datasets for training?
I am not quite familiar with transformer architecture. In a typical RNN based imputation method, there are the concepts of sequence length (i.e., length of historical or future data for input) and prediction horizon (i.e., how far in the future or in the past the model try to impute). For the SAITS, what would be the equivalent concepts or does such a concept of the prediction horizon exist?
I understand from your paper that the sequence length is fixed between different models for comparison purposes. How does the sequence length affect the accuracy of the imputation? What would you recommend to determine the appropriate sequence length for the problem at hand?
An unrelated question, Is your PyPOTS currently working with the Air Quality dataset?

Thanks in advance,
Haochen

Question about temporal dependencies and feature correlations captured by DMSA

你好，关于文章当中的自注意力我有问题想请教您。维度为N×N的自注意力矩阵Q·Kt，表示的是长度为N的一种维度之间的注意力关系，而您文章中提到的“Such a mechanism makes DMSA able to capture the temporal dependencies and feature correlations between time steps in the high dimensional space with only one attention operation”，DMSA的一个注意力矩阵能一次性同时捕获到两种维度之间的注意力，想问一次注意力操作捕获到两种类型的注意力是怎么做到的。

Using CSV files versus h5 data

Hello Wenjie,

Thank you for releasing the code, I had couple of questions. I am trying to run the code using Air Quality dataset in Google Colab. These are some of my doubts:

!CUDA_VISIBLE_DEVICES=2 python run_models.py --config_path configs/AirQuality_SAITS_best.ini
Running this gives me the following error message.
OSError: Unable to open file (unable to open file: name = 'dataset_generating_scripts/RawData/AirQuality/PRSA_Data_20130301-20170228/datasets.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

All the dataset is in .CSV format.
1a) Is there an option to use the default .csv data?
1b) How do I convert .csv to h5 format?

Where should we change the file path of the dataset for training purpose
As in file configs/AirQuality_SAITS_best.ini ?

Please do let me know, thanks.

Niharika

wenjiedu / saits Goto Github PK

saits's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs