GithubHelp home page GithubHelp logo

ycq091044 / mtc Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 2.0 34.58 MB

KDD2021: Code for MTC algorithm. We use coarse granular data and partially observed data to (low-rank) recover the fine granular data.

Jupyter Notebook 71.06% Python 28.94%
multiresolution-analysis tensor-completion tensor-decomposition tensor-factorization

mtc's Introduction

Code Scripts and Data for KDD'21 paper: MTC

Cite

contact Chaoqi Yang ([email protected], [email protected]) for any question.

@inproceedings{yang2021mtc,
    title = {MTC: Multiresolution Tensor Completion from Partial and Coarse Observations},
    author = {Yang, Chaoqi and Singh, Navjot and Xiao, Cao and Qian, Cheng and Solomonik, Edgar and Sun, Jimeng},
    booktitle = {Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2021},
    year = {2021}
}

We provide the code for (1) tensor data processing; (2) baselines; (3) our models.

0. Quick start

Run our model MTC on the synthetic data

cd src_synthetic_numpy; python test_X_C1_C2_unknown.py --partial 0.05 --jacobi 1 --multi 1

Run our model MTC on a subset of GCSS data

cd src_GCSS_torch; python run.py --model MTC --partial 0.05 --jacobi 1 --multi 1

1. Folder Tree

  • ./data_synthetic
    • the generated synthetic data (.pkl format): X, C1, C2, P1, P2
    • the generating script: synthetic.ipynb
  • ./src_synthetic_numpy (numpy implementation for synthetic data)
    • codes with only partial observation: test_only_T.py
    • codes with X, C1, C2 and unknown P1, P2: test_X_C1_C2_unknown.py
    • codes with X, C1, C2 and known P1, P2: test_X_C1_C2_known.py
  • ./data_GCSS
  • ./src_GCSS_numpy (numpy implementation for GCSS data)
    • MTC codes with X, O1, O2 and unknown P1, P2: X_O1_O2_unknwon_P1_unknown_P2.py
    • Baseline codes with O1, O2 and unknown P1, P2: OracleCPD.py, BGD.py, PREMA.py, CMTF.py
    • MTC codes with X, O1, O2 and unknown P1, known P2: X_O1_O2_unknwon_P1_known_P2.py
    • Codes for tensor initialization comparison: initialization.py
  • ./src_GCSS_torch (torch implementation for GCSS data, GPU support)
    • the interfact to call completion models: run.py
    • different model functions: model.py

2. Python package dependency

pip install scipy==1.5.0
pip install numpy==1.19.1
pip install pandas==1.1.2
pip install torch==1.7.0

The dependency packages are pretty common. If any missing, please install that.

3. Instruction for synthetic data

3.1 Generate the synthetic data

  • The following is the key scripts to generate the synthetic data by ./src_synthetic/synthetic.ipynb.

    import numpy as np
    np.random.seed(400)
    # generate factors
    U = np.random.random((125, 10))
    V = np.random.random((125, 10))
    W = np.random.random((125, 10))
    U = np.sort(U, axis=0) # make it smooth, so as to be continuous mode
    W = np.sort(W, axis=0) # make it smooth, so as to be continuous mode
    
    # tensor X
    T = np.einsum('ir,jr,kr->ijk',U,V,W,optimize=True)
    
    # mapping on the first mode
    P1 = np.random.random((12, 125))
    P1 = np.argsort(P1, axis=0) <= 0.0
    
    # mapping on the first mode
    P2 = np.random.random((12, 125))
    P2 = np.argsort(P2, axis=0) <= 0.0
    
    # C1, C2
    C1 = np.einsum('ijk,ri->rjk',T,P1,optimize=True)
    C2 = np.einsum('ijk,rj->irk',T,P2,optimize=True)
  • We also provide the data (used in the paper): ./data_synthetic.

3.2 Run and visualize the experiments

The following is the argument options:

optional arguments:
  -h, --help         show this help message and exit
  --partial PARTIAL  the amount of partial observation
  --jacobi JACOBI    Jacobi (1) or not (0)
  --multi MULTI      multiresolution (1) or not (0)

To get the results with only the partial observation, use

python test_only_X.py --partial [0.05 or other] --jacobi [0/1] --multi [0/1]

To get the results with also coarse information (C1, C2), but not with the aggregation (P1, P2), use

python test_X_C1_C2_unknown.py --partial [0.05 or other] --jacobi [0/1] --multi [0/1]

To get the results with also coarse information (C1, C2) and the aggregation (P1, P2), use

python test_X_C1_C2_known.py --partial [0.05 or other] --jacobi [0/1] --multi [0/1]

In fact, we also provide all the running result files in ./src_synthetic_numpy/synthetic_result/.., and the results could be fed into the provided jupyter notebook for visualization.

4. Instruction for GCSS data

4.1 data preprocessing

  • We already provide a subset of GCSS data in ./data_GCSS. To get the whole tensor, following these steps:

    • First, download the GCSS data from https://pair-code.github.io/covid19_symptom_dataset/?country=GB
    • Second, run the data process script (Google Tensor Data.ipynb) to generate X, C1, C3, P1, P3 (Since in this dataset, the aggregation is on the first and the third mode, the C1, C3 are coarse tensors, and P1, P3 are the according aggregation matrices), or using the following code equivalently,
    import pandas as pd
    import numpy as np
    
    data = pd.read_csv('./2020_country_daily_2020_US_daily_symptoms_dataset.csv') # change the data path accordingly
    
    # all zipcode
    zipcode = list(filter(lambda x: ('US' in x) and (x.count('_') == 2), data.key.unique()))
    # the time
    time = data.date.unique().tolist()
    # disease list
    disease = data.columns.tolist()[2:]
    
    # tensor placeholder
    T = np.zeros((len(zipcode), len(disease), len(time)))
    
    # process per zipcode
    for i, single_zip in enumerate(zipcode):
        print (i, single_zip)
        tmp = data[data.key == single_zip]
        re_time = [time.index(i) for i in tmp.date.tolist()]
        T[i, :, :][:, re_time] = tmp.values[:, 2:].T
    
    # change nan to zero
    T = np.nan_to_num(T)
    
    # zipcode -> state mapping
    state = np.unique([item.split('_')[1] for item in zipcode]).tolist()
    P1 = np.zeros((len(state), len(zipcode)))
    
    for i, z in enumerate(zipcode):
        for j, s in enumerate(state):
            if s in z:
                P1[j, i] = 1
                
    # date -> week mapping
    P3 = np.zeros(((len(time) - 1) // 7 + 1, len(time)))
    for i, t in enumerate(time):
        P3[i // 7, i] = 1
    
    O1 = np.einsum('ijk,ri->rjk',T,P1,optimize=True)
    O3 = np.einsum('ijk,rk->ijr',T,P3,optimize=True)
    
    import pickle 
    # dump
    pickle.dump(O1, open('O1_google.pkl', 'wb'))
    pickle.dump(O3, open('O2_google.pkl', 'wb'))
    pickle.dump(T, open('X_google.pkl', 'wb'))
    pickle.dump(P1, open('P1_google.pkl', 'wb'))
    pickle.dump(P3, open('P2_google.pkl', 'wb'))
  • ATTENTION! In the code, the variable names might be a little bit different, for example, we use variable name O1 to denote the tensor aggregated on the first mode. Take care, not get confused.

4.2 Use torch implementation

One command for all:

python run.py --model [model name] --partial [0.05 or other] --jacobi [0/1] --multi [0/1]

The [model name] could be (1) OracleCPD; (2) CPC-ALS; (3) BGD; (4) CMTF; (5) PREMA; (6) MTC; (7) MTC-P2. Here, MTC is our proposed model without knowning P1,P2 while MTC-P2 knows P2 but not P1. Also, the last two arguments are only for MTC and MTC-P2.

4.3 Use numpy implementation

Oracle CPD model: run standard CP decomposition model on the origianl tensor

python OracleCPD.py

CPC-ALS model: only use the partial observation, treat the problem as a tensor completion problem

python CPC_ALS.py --partial [0.05 or other]

For BGD, CMTF, PREMA

python BGD.py --partial [0.05 or other]
python CMTF.py --partial [0.05 or other]
python PREMA.py --partial [0.05 or other]

For our model variant: MTC_{both-}, MTC_{multi-}, MTC_{jacobi-}, MTC:

python X_C1_C3_unknwon_P1_unknown_P3.py --partial [0.05 or other] --jacobi [0/1] --multi [0/1]

mtc's People

Contributors

ycq091044 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.