GithubHelp home page GithubHelp logo

xiangruizxr / transdim Goto Github PK

View Code? Open in Web Editor NEW

This project forked from xinychen/transdim

0.0 0.0 0.0 193.74 MB

Machine learning for transportation data imputation and prediction.

Home Page: https://transdim.github.io/

License: MIT License

Jupyter Notebook 100.00%

transdim's Introduction

transdim

MIT License Python 3.7 repo size GitHub stars

Made by Xinyu Chen โ€ข ๐ŸŒ https://twitter.com/chenxy346

logo

Machine learning models make important developments in the field of spatiotemporal data modeling - like how to forecast near-future traffic states of road networks. But what happens when these models are built on incomplete data commonly collected from real-world systems (e.g., transportation system)?

About this Project

In the transdim (transportation data imputation) project, we develop machine learning models to help address some of the toughest challenges of spatiotemporal data modeling - from missing data imputation to time series prediction. The strategic aim of this project is creating accurate and efficient solutions for spatiotemporal traffic data imputation and prediction tasks.

In a hurry? Please check out our contents as follows.

Tasks and Challenges

Missing data are there, whether we like them or not. The really interesting question is how to deal with incomplete data.

Figure 1: Two classical missing patterns in a spatiotemporal setting.

We create three missing data mechanisms on real-world data.

  • Missing data imputation ๐Ÿ”ฅ

    • Random missing (RM): Each sensor lost observations at completely random. (โ˜…โ˜…โ˜…)
    • Non-random missing (NM): Each sensor lost observations during several days. (โ˜…โ˜…โ˜…โ˜…)
    • Blockout missing (BM): All sensors lost their observations at several consecutive time points. (โ˜…โ˜…โ˜…โ˜…)

drawing

Figure 2: Tensor completion framework for spatiotemporal missing traffic data imputation.

  • Spatiotemporal prediction ๐Ÿ”ฅ
    • Forecasting without missing values. (โ˜…โ˜…โ˜…)
    • Forecasting with incomplete observations. (โ˜…โ˜…โ˜…โ˜…โ˜…)

Figure 3: Illustration of our proposed Low-Rank Autoregressive Tensor Completion (LATC) imputer/predictor with a prediction window ฯ„ (green nodes: observed values; white nodes: missing values; red nodes/panel: prediction; blue panel: training data to construct the tensor).

Implementation

Open data

In this repository, we have adapted some publicly available data sets into our experiments. The original links for these data are summarized as follows,

For example, if you want to view or use these data sets, please download them at the ../datasets/ folder in advance, and then run the following codes in your Python console:

import scipy.io

tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/tensor.mat')
tensor = tensor['tensor']

In particular, if you are interested in large-scale traffic data, we recommend PeMS-4W/8W/12W (see Large-scale traffic speed data sets in California, USA) and UTD19. For PeMS data, you can download the data from Zenodo and place them at the folder of datasets (data path example: ../datasets/California-data-set/pems-4w.csv). Then you can use Pandas to open data:

import pandas as pd

data = pd.read_csv('../datasets/California-data-set/pems-4w.csv', header = None)

For model evaluation, we mask certain entries of the "observed" data as missing values and then perform imputation for these "missing" values.

Model implementation

Old version, updated in 2019

In our experiments, we have implemented some machine learning models mainly on Numpy, and written these Python codes with Jupyter Notebook. So, if you want to evaluate these models, please download and run these notebooks directly (prerequisite: download the data sets in advance).

  • Our models
Task Jupyter Notebook Gdata Bdata Hdata Sdata Ndata
Missing Data Imputation BTMF โœ… โœ… โœ… โœ… ๐Ÿ”ถ
BGCP โœ… โœ… โœ… โœ… โœ…
LRTC-TNN โœ… โœ… โœ… โœ… ๐Ÿ”ถ
BTTF ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ โœ…
Single-Step Prediction BTMF โœ… โœ… โœ… โœ… ๐Ÿ”ถ
BTTF ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ โœ…
Multi-Step Prediction BTMF โœ… โœ… โœ… โœ… ๐Ÿ”ถ
BTTF ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ โœ…
  • Baselines
Task Jupyter Notebook Gdata Bdata Hdata Sdata Ndata
Missing Data Imputation BayesTRMF โœ… โœ… โœ… โœ… ๐Ÿ”ถ
TRMF โœ… โœ… โœ… โœ… ๐Ÿ”ถ
BPMF โœ… โœ… โœ… โœ… ๐Ÿ”ถ
HaLRTC โœ… โœ… โœ… โœ… ๐Ÿ”ถ
TF-ALS โœ… โœ… โœ… โœ… โœ…
BayesTRTF ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ โœ…
BPTF ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ โœ…
Single-Step Prediction BayesTRMF โœ… โœ… โœ… โœ… ๐Ÿ”ถ
TRMF โœ… โœ… โœ… โœ… ๐Ÿ”ถ
BayesTRTF ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ โœ…
TRTF ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ โœ…
Multi-Step Prediction BayesTRMF โœ… โœ… โœ… โœ… ๐Ÿ”ถ
TRMF โœ… โœ… โœ… โœ… ๐Ÿ”ถ
BayesTRTF ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ โœ…
TRTF ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ โœ…
  • โœ… โ€” Cover
  • ๐Ÿ”ถ โ€” Does not cover
  • ๐Ÿšง โ€” Under development

Model implementation

New version, updated in 2020

In the following implementation, we have improved Python codes (in Jupyter Notebook) in terms of both readiability and efficiency.

Our proposed models are highlighted in bold fonts.

  • imputer (imputation models)
Notebook Guangzhou Birmingham Hangzhou Seattle London NYC Pacific
BPMF โœ… โœ… โœ… โœ… โœ… ๐Ÿ”ถ ๐Ÿ”ถ
TRMF โœ… ๐Ÿ”ถ โœ… โœ… โœ… ๐Ÿ”ถ ๐Ÿ”ถ
BTRMF โœ… ๐Ÿ”ถ โœ… โœ… โœ… ๐Ÿ”ถ ๐Ÿ”ถ
BTMF โœ… โœ… โœ… โœ… โœ… ๐Ÿ”ถ ๐Ÿ”ถ
BGCP โœ… โœ… โœ… โœ… โœ… โœ… โœ…
BATF โœ… โœ… โœ… โœ… โœ… โœ… โœ…
BTTF ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ โœ… โœ…
HaLRTC โœ… ๐Ÿ”ถ โœ… โœ… โœ… โœ… โœ…
  • predictor (prediction models)
Notebook Guangzhou Birmingham Hangzhou Seattle London NYC Pacific
TRMF โœ… ๐Ÿ”ถ โœ… โœ… โœ… ๐Ÿ”ถ ๐Ÿ”ถ
BTRMF โœ… ๐Ÿ”ถ โœ… โœ… โœ… ๐Ÿ”ถ ๐Ÿ”ถ
BTRTF ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ โœ… โœ…
BTMF โœ… ๐Ÿ”ถ โœ… โœ… โœ… โœ… โœ…
BTTF ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ ๐Ÿ”ถ โœ… โœ…

For the implementation of these models, we use both dense_mat and sparse_mat (or dense_tensor and sparse_tensor) as the inputs. However, it is not necessary by doing so if you do not hope to see the imputation/prediction performance in the iterative process, you can remove dense_mat (or dense_tensor) from the inputs of these algorithms.

Imputation/Prediction performance

  • Imputation example (on Guangzhou data)

example (a) Time series of actual and estimated speed within two weeks from August 1 to 14.

example (b) Time series of actual and estimated speed within two weeks from September 12 to 25.

The imputation performance of BGCP (CP rank r=15 and missing rate ฮฑ=30%) under the fiber missing scenario with third-order tensor representation, where the estimated result of road segment #1 is selected as an example. In the both two panels, red rectangles represent fiber missing (i.e., speed observations are lost in a whole day).

  • Prediction example

example

example

example

Quick Start

This is an imputation example of Low-Rank Tensor Completion with Truncated Nuclear Norm minimization (LRTC-TNN). One notable thing is that unlike the complex equations in our paper, our Python implementation is extremely easy to work with.

  • First, import some necessary packages:
import numpy as np
from numpy.linalg import inv as inv
  • Define the operators of tensor unfolding (ten2mat) and matrix folding (mat2ten) using Numpy:
def ten2mat(tensor, mode):
    return np.reshape(np.moveaxis(tensor, mode, 0), (tensor.shape[mode], -1), order = 'F')
def mat2ten(mat, tensor_size, mode):
    index = list()
    index.append(mode)
    for i in range(tensor_size.shape[0]):
        if i != mode:
            index.append(i)
    return np.moveaxis(np.reshape(mat, list(tensor_size[index]), order = 'F'), 0, mode)
  • Define Singular Value Thresholding (SVT) for Truncated Nuclear Norm (TNN) minimization:
def svt_tnn(mat, tau, theta):
    [m, n] = mat.shape
    if 2 * m < n:
        u, s, v = np.linalg.svd(mat @ mat.T, full_matrices = 0)
        s = np.sqrt(s)
        idx = np.sum(s > tau)
        mid = np.zeros(idx)
        mid[:theta] = 1
        mid[theta:idx] = (s[theta:idx] - tau) / s[theta:idx]
        return (u[:,:idx] @ np.diag(mid)) @ (u[:,:idx].T @ mat)
    elif m > 2 * n:
        return svt_tnn(mat.T, tau, theta).T
    u, s, v = np.linalg.svd(mat, full_matrices = 0)
    idx = np.sum(s > tau)
    vec = s[:idx].copy()
    vec[theta:] = s[theta:] - tau
    return u[:,:idx] @ np.diag(vec) @ v[:idx,:]
  • Define performance metrics (i.e., RMSE, MAPE):
def compute_rmse(var, var_hat):
    return np.sqrt(np.sum((var - var_hat) ** 2) / var.shape[0])

def compute_mape(var, var_hat):
    return np.sum(np.abs(var - var_hat) / var) / var.shape[0]
  • Define LRTC-TNN:
def LRTC(dense_tensor, sparse_tensor, alpha, rho, theta, epsilon, maxiter):
    """Low-Rank Tenor Completion with Truncated Nuclear Norm, LRTC-TNN."""
    
    dim = np.array(sparse_tensor.shape)
    pos_missing = np.where(sparse_tensor == 0)
    pos_test = np.where((dense_tensor != 0) & (sparse_tensor == 0))
    dense_test = dense_tensor[pos_test]
    del dense_tensor
    
    X = np.zeros(np.insert(dim, 0, len(dim))) # \boldsymbol{\mathcal{X}}
    T = np.zeros(np.insert(dim, 0, len(dim))) # \boldsymbol{\mathcal{T}}
    Z = sparse_tensor.copy()
    last_tensor = sparse_tensor.copy()
    snorm = np.sqrt(np.sum(sparse_tensor ** 2))
    it = 0
    while True:
        rho = min(rho * 1.05, 1e5)
        for k in range(len(dim)):
            X[k] = mat2ten(svt_tnn(ten2mat(Z - T[k] / rho, k), alpha[k] / rho, np.int(np.ceil(theta * dim[k]))), dim, k)
        Z[pos_missing] = np.mean(X + T / rho, axis = 0)[pos_missing]
        T = T + rho * (X - np.broadcast_to(Z, np.insert(dim, 0, len(dim))))
        tensor_hat = np.einsum('k, kmnt -> mnt', alpha, X)
        tol = np.sqrt(np.sum((tensor_hat - last_tensor) ** 2)) / snorm
        last_tensor = tensor_hat.copy()
        it += 1
        if (it + 1) % 50 == 0:
            print('Iter: {}'.format(it + 1))
            print('MAPE: {:.6}'.format(compute_mape(dense_test, tensor_hat[pos_test])))
            print('RMSE: {:.6}'.format(compute_rmse(dense_test, tensor_hat[pos_test])))
            print()
        if (tol < epsilon) or (it >= maxiter):
            break

    print('Imputation MAPE: {:.6}'.format(compute_mape(dense_test, tensor_hat[pos_test])))
    print('Imputation RMSE: {:.6}'.format(compute_rmse(dense_test, tensor_hat[pos_test])))
    print()
    
    return tensor_hat
  • Let us try it on Guangzhou urban traffic speed data set:
import scipy.io

tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/tensor.mat')
dense_tensor = tensor['tensor']
random_tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/random_tensor.mat')
random_tensor = random_tensor['random_tensor']

missing_rate = 0.2

### Random missing (RM)
sparse_tensor = dense_tensor * np.round(random_tensor + 0.5 - missing_rate)
  • Run the imputation experiment:
import time
start = time.time()
alpha = np.ones(3) / 3
rho = 1e-5
theta = 0.30
epsilon = 1e-4
maxiter = 200
tensor_hat = LRTC(dense_tensor, sparse_tensor, alpha, rho, theta, epsilon, maxiter)
end = time.time()
print('Running time: %d seconds'%(end - start))

This example is from ../experiments/Imputation-LRTC-TNN.ipynb, you can check out this Jupyter Notebook for advanced usage.

Toy Examples

Our Publications

  • Xinyu Chen, Mengying Lei, Nicolas Saunier, Lijun Sun (2021). Low-rank autoregressive tensor completion for spatiotemporal traffic data imputation. arXiv: 2104.14936. [preprint] [data & Python code]

  • Xinyu Chen, Lijun Sun (2021). Bayesian temporal factorization for multidimensional time series prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence. (Early access) [preprint] [doi] [slides] [data & Python code]

  • Xinyu Chen, Yixian Chen, Nicolas Saunier, Lijun Sun (2020). Scalable low-rank tensor learning for spatiotemporal traffic data imputation. arXiv: 2008.03194. [preprint] [data] [Python code]

  • Xinyu Chen, Lijun Sun (2020). Low-rank autoregressive tensor completion for multivariate time series forecasting. arXiv: 2006.10436. [preprint] [data & Python code]

  • Xinyu Chen, Jinming Yang, Lijun Sun (2020). A nonconvex low-rank tensor completion model for spatiotemporal traffic data imputation. Transportation Research Part C: Emerging Technologies, 117: 102673. [preprint] [doi] [data & Python code]

  • Xinyu Chen, Zhaocheng He, Yixian Chen, Yuhuan Lu, Jiawei Wang (2019). Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model. Transportation Research Part C: Emerging Technologies, 104: 66-77. [preprint] [doi] [slides] [data] [Matlab code] [Python code]

  • Xinyu Chen, Zhaocheng He, Lijun Sun (2019). A Bayesian tensor decomposition approach for spatiotemporal traffic data imputation. Transportation Research Part C: Emerging Technologies, 98: 73-84. [preprint] [doi] [data] [Matlab code] [Python code]

  • Xinyu Chen, Zhaocheng He, Jiawei Wang (2018). Spatial-temporal traffic speed patterns discovery and incomplete data recovery via SVD-combined tensor decomposition. Transportation Research Part C: Emerging Technologies, 86: 59-77. [doi] [data]

    This project is from the above papers, please cite these papers if they help your research.

Collaborators

Xinyu Chen
Xinyu Chen

๐Ÿ’ป
Jinming Yang
Jinming Yang

๐Ÿ’ป
Yixian Chen
Yixian Chen

๐Ÿ’ป
Mengying Lei
Mengying Lei

๐Ÿ’ป
Lijun Sun
Lijun Sun

๐Ÿ’ป
Tianyang Han
Tianyang Han

๐Ÿ’ป
  • Principal Investigator (PI)
Lijun Sun
Lijun Sun

๐Ÿ’ป
Nicolas Saunier
Nicolas Saunier

๐Ÿ’ป

See the list of contributors who participated in this project.

Our transdim is still under development. More machine learning models and technical features are going to be added and we always welcome contributions to help make transdim better. If you have any suggestion about this project or want to collaborate with us, please feel free to contact Xinyu Chen (email: [email protected]) and send your suggestion/statement. We would like to thank everyone who has helped this project in any way.

Recommended email subjects:

  • Suggestion on transdim from [+ your name]
  • Collaboration statement on transdim from [+ your name]

Acknowledgements

This research is supported by the Institute for Data Valorization (IVADO).

License

This work is released under the MIT license.

transdim's People

Contributors

hanty avatar lijunsun avatar mengyinglei avatar vadermit avatar xinychen avatar yxnchen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.