sjtu-quant / master Goto Github PK

This is the official code and supplementary materials for our AAAI-2024 paper: MASTER: Market-Guided Stock Transformer for Stock Price Forecasting. MASTER is a stock transformer for stock price forecasting, which models the momentary and cross-time stock correlation and guide feature selection with market information.

Python 100.00%

master's Introduction

Readme

This is the official code and supplementary materials for our AAAI-2024 paper: MASTER: Market-Guided Stock Transformer for Stock Price Forecasting. [Paper] [ArXiv preprint]

MASTER is a stock transformer for stock price forecasting, which models the momentary and cross-time stock correlation and guides feature selection with market information.

Our original experiments were conducted in a complex business codebase developed based on Qlib. The original code is confidential and exhaustive. In order to enable anyone to quickly use MASTER and reproduce the paper's results, here we publish our well-processed data and core code.

Usage

Install dependencies.

pandas == 1.5.3
torch == 1.11.0

Install Qlib. We have minimized the reliance on Qlib, and you can simply install it by

pip install pyqlib (Pip installation only supports python 3.7 and 3.8, please refer to its Readme.md.)
pylib == 0.9.1.99

Download data from one of the following links (the data files are the same) and unpack it into data/

Run main.py.
We provide two trained models: model/csi300master_0.pkl, model/csi800master_0.pkl

Dataset

Form

The downloaded data is split into training, validation, and test sets, with two stock universes. Note the csi300 data is a subset of the csi800 data. You can use the following code to investigate the datetime, instrument, and feature formulation.

with open(f'data/csi300/csi300_dl_train.pkl', 'rb') as f:
    dl_train = pickle.load(f)
    dl_train.data # a Pandas dataframe

In our code, the data will be gathered chronically and then grouped by prediction dates. the data iterated by the data loader is of shape (N, T, F), where:

N - number of stocks. For CSI300, N is around 300 on each prediction date; For CSI800, N is around 800 on each prediction date.
T - length of lookback_window, T=8.
F - 222 in total, including 158 factors, 63 market information, and 1 label.

Market information

For convenient reference, we extract and organize market information from the published data into data/csi_market_information.csv. You can check the datetime and feature formulation in the file. Note that m is shared by all stocks. The market data is generated by the following pseudo-code.

m = []
for S in csi300, csi500, csi800:
  m += [market_index(S,-1)]
  for d in [5, 10, 20, 30, 60]:
    m += [historical_market_index_mean(S, d), historical_market_index_std(S, d)]
    m += [historical_amount_mean(S, d), historical_amount_std(S, d)]

Preprocessing

The published data went through the following necessary preprocessing.

Drop NA features, and perform robust daily Z-score normalization on each feature dimension.
Drop NA labels and 5% of the most extreme labels, and perform daily Z-score normalization on labels.

Daily Z-score normalization is a common practice in Qlib to standardize the labels for stock price forecasting. To mitigate the difference between a normal distribution and groundtruth distribution, we filtered out 5% of most extreme labels in training. Note that the reported RankIC compares the output ranking with the groundtruth, whose value is not affected by the label normalization.

An Alternative Qlib implementation

We are happy to hear that MASTER has been integrated into the open-sourced Qlib framework at this repo. We thank LIU, Qiaoan and ZHAO, Lifan for their contributions and please also give credits to the new repo if you use it.

As a brief introduction to the new version, with the Qlib framework, you can

report AR, IR, and more portfolio-based metrics,
modify experiment configuration with .yaml files,
compare with various models from the Qlib examples collection,
benefit from other merits of Qlib.

In the meantime, please note that

The new version utilizes a different data source published by Qlib, which covers a different timespan. The new data source is considered logically equal to our published data but may differ in values.
🔥[Add new notice] The new version uses stock universe CSI300 & CSI500, because qlib does not include a CSI800 dataset. Correspondingly, the representative indices to construct market information are different, it uses CSI100, CSI300, and CSI500, which is different from CSI300, CSI500, and CSI800 as in this repo.
The new version does not include the 'DropExtremeLabel' operation in data preprocessing but also reports decent performance.

Cite

If you use the data or the code, please cite our work! 😄

@inproceedings{li2024master,
  title={Master: Market-guided stock transformer for stock price forecasting},
  author={Li, Tong and Liu, Zhaoyang and Shen, Yanyan and Wang, Xue and Chen, Haokun and Huang, Sen},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={1},
  pages={162--170},
  year={2024}
}

master's People

Stargazers

Watchers

master's Issues

about the result of ric and ricir

when I use the dataset and the model you offer, I get the ric and ricir in the test dataset which is much worse than your paper display,
can you explain for me ? thanks. my ric and ricir are 'RIC': 0.034360794294788645, 'RICIR': 0.26788646356158735

AR and IR Evaluations framework.

Thanks for the remarkable job!

And I have a small question about the calculation framework about AR (Annualized Return ) and IR (Information Ratio) are not mentioned in the current framework. I think it might run in the qlib?

Could you please add this part for the comparison at same level please?

Looking forward to your reply!

Best Regards

Lookback window

Hello,

Thanks for the remarkable job!

I have downloaded the data following the provided link(https://1drv.ms/f/c/652674690cc447e6/Eu8Kxv4xxTFMtDQqTW0IU0UB8rnpjACA5twMi8BA_PfbSA?e=ooc0za), but I noticed that there is an overlap in the timeline for the train, valid, and test datasets. The train set covers from 2008/1 to 2020/3/31, the valid set from 2020/3/20 to 2020/6/20, and the test set from 2020/6/17 to 2022/12. Is this for the purpose of lookback window?

Looking forward to your reply!

Best Regards

How to view the DataFrame format of datasets generated using qlib?

Thank you for your excellent work! I would like to ask how to view the DataFrame format of dl_train generated by qlib, since it is a TSDataSampler, or how to generate and save dl_train directly without using the YAML format, similar to the dataset you provided in your code?

My main purpose is to check whether the generated dataset is correct because YAML seems to be unable to view the generated DataFrame. Thank you!

Correlation factors?

Congrats for this work, it's certainly innovative.
I have question regarding the mine function on the momentary and cross-time stock correlation - which data are you mainly considering (variables used)? Are you mainly looking at stock / portfolio price correlation or are considering finding stock price correlation with other market factors?

AttributeError: module 'qlib.contrib.data.dataset' has no attribute 'MASTERTSDatasetH'

Hello,

Thanks for the remarkable job!

I have downloaded SJTU's Qlib (your forked), installed all the packages, but it still shows the following error:
File "main.py", line 51, in
dataset = init_instance_by_config(config['task']["dataset"])
File "/home/miniconda3/envs/qlib38/lib/python3.8/site-packages/qlib/utils/mod.py", line 171, in init_instance_by_config
klass, cls_kwargs = get_callable_kwargs(config, default_module=default_module)
File "/home/miniconda3/envs/qlib38/lib/python3.8/site-packages/qlib/utils/mod.py", line 103, in get_callable_kwargs
_callable = getattr(module, cls) # may raise AttributeError
AttributeError: module 'qlib.contrib.data.dataset' has no attribute 'MASTERTSDatasetH'

Can you help me with the error? Looking forward to your reply!

Best Regards

相同的模型在最新的qlib中IC得分比实验得分高出3倍是什么原因

Rank ICIR	Rank IC	ICIR	IC	step_len	train_stop_loss_thred	d_feat	d_model	t_nhead	s_nhead	n_epochs	lr	beta	seed
0.8322383616317734	0.17739827966880373	0.891352351367328	0.1711398365432469	8	0.95	158	256	4	2	40	8e-06	10	0

MASTERTSDatasetH fillna_type="ffill+bfill"是必要的吗？

在qlib的实现中MASTERTSDatasetH返回的TSDataSampler做了fillna_type="ffill+bfill"前后向填充，然而pytorch_master_ts.py中
def fit(self, dataset: DatasetH, save_path=None): dl_train = dataset.prepare("train", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L) dl_valid = dataset.prepare("valid", col_set=["feature", "label"], data_key=DataHandlerLP.DK_L)没有配置 dl_train.config(fillna_type="ffill+bfill") # process nan brought by dataloader dl_valid.config(fillna_type="ffill+bfill") # process nan brought by dataloader，请问TSDataSampler中的fillna_type="ffill+bfill"是否是必要的

Why each iteration can yield data for lookback_window days?

Hello. Thank you for your outstanding work. While running your code, I have a small question: I'm curious about why the dataset obtained after running the iteration of DataLoader is four-dimensional (1NT*F). My main concern is how the time dimension T is introduced here. I noticed that the data type you provided is "qlib.data.dataset.TSDataSampler," and by referring to qlib's source code and the PyTorch official tutorial, I understand that this is a map-style dataset data type similar to PyTorch. When using DataLoader, it seems that the iterator of the Sampler you use returns data for only one day each time, rather than data for the lookback_window (8) days. Therefore, I am a bit confused about how each iteration can yield data for lookback_window days. I hope to receive your response, and once again, thank you!

Can you provide the detailed code about how Market information is calculated?

I just found the value of feature.1 can not be inferred from the column feature.

	feature	feature.1
	Mask($close/Ref($close,1)-1,'SH000300')	Mask(Mean($close/Ref($close,1)-1,5),'SH000300')
datetime
2008/1/2	0.74735785	1.0109634
2008/1/3	0.57677513	1.1709653
2008/1/4	0.9783356	1.2388867
2008/1/7	1.1510324	1.0415094
2008/1/8	-0.4920946	1.0515707
2008/1/9	1.3472286	1.2670099

For example, at 2008/1/8 , the feature.1 's value is 1.0515707.
While according to the expression Mask(Mean($close/Ref($close,1)-1,5),'SH000300'), I think it can be simply calculated by (0.74735785 + 0.57677513 + 0.9783356 + 1.1510324 -0.4920946 ) / 5, and the result is 0.592281276.

Do I misunderstand the expression of feature.1? Can anyone explain this?
Besides, can you provide the detailed code about how Market information is calculated?

Many thanks!

Valid Loss不下降的问题

您好！请问根据您提供的数据和源码，为何从开始训练时，虽然train loss不断下降，但valid loss持续震荡最终并未收敛。请问您在工作中遇到了这种情况吗？您觉得这是什么原因造成的？

Inconsistent master dataset handling

Thank you for your fancy job!

Recently when I wanted to use Qlib to get a dataset in Master format for the latest time, I found that the Label handling method integrated in Qlib was inconsistent with the method provided by the author in Issue. The CSRankNorm is used for Label in Qlib, but the author provides CSZScoreNorm to process Label.

May I ask which method is correct?

Best wishes!

Handler provided by author

Qlib Handler

Have a nice day!

sjtu-quant / master Goto Github PK

master's Introduction

Readme

Usage

Dataset

Form

Market information

Preprocessing

An Alternative Qlib implementation

Cite

master's People

Stargazers

Watchers

Forkers

master's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs