GithubHelp home page GithubHelp logo

darshanpatel11 / x-transformer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from octoberchang/x-transformer

0.0 1.0 0.0 112 KB

X-Transformer: Taming Pretrained Transformers for eXtreme Multi-label Text Classification

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.77% Shell 4.45% Python 41.38% C++ 53.40%

x-transformer's Introduction

Taming Pretrained Transformers for XMC problems

This is a README for the experimental code of the following paper

Taming Pretrained Transformers for eXtreme Multi-label Text Classification

Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yiming Yang, Inderjit Dhillon

KDD 2020

Installation

Depedencies via Conda Environment

> conda env create -f environment.yml
> source activate pt1.2_xmlc_transformer
> (pt1.2_xmlc_transformer) pip install -e .
> (pt1.2_xmlc_transformer) python setup.py install --force

**Notice: the following examples are executed under the > (pt1.2_xmlc_transformer) conda virtual environment

Reproduce Evaulation Results in the Paper

We demonstrate how to reproduce the evaluation results in our paper by downloading the raw dataset and pretrained models.

Download Dataset (Eurlex-4K, Wiki10-31K, AmazonCat-13K, Wiki-500K)

Change directory into ./datasets folder, download and unzip each dataset

cd ./datasets
bash download-data.sh Eurlex-4K
bash download-data.sh Wiki10-31K
bash download-data.sh AmazonCat-13K
bash download-data.sh Wiki-500K
cd ../

Each dataset contains the following files

  • label_map.txt: each line is the raw text of the label
  • train_raw_text.txt, test_raw_text.txt: each line is the raw text of the instance
  • X.trn.npz, X.tst.npz: instance's embedding matrix (either sparse TF-IDF or fine-tuned dense embedding)
  • Y.trn.npz, Y.tst.npz: instance-to-label assignment matrix

Download Pretrained Models (processed data, Indexing codes, fine-tuned Transformer models)

Change directory into ./pretrained_models folder, download and unzip models for each dataset

cd ./pretrained_models
bash download-models.sh Eurlex-4K
bash download-models.sh Wiki10-31K
bash download-models.sh AmazonCat-13K
bash download-models.sh Wiki-500K
cd ../

Each folder has the following strcture

  • proc_data: a sub-folder containing: X.{trn|tst}.{model}.128.pkl, C.{label-emb}.npz, L.{label-emb}.npz
  • pifa-tfidf-s0: a sub-folder containing indexer and matcher
  • pifa-neural-s0: a sub-folder containing indexer and matcher
  • text-emb-s0: a sub-folder containing indexer and matcher

Evaluate Linear Models

Given the provided indexing codes (label-to-cluster assignments), train/predict linear models, and evaluate with Precision/Recall@k:

bash eval_linear.sh ${DATASET} ${VERSION}
  • DATASET: the dataset name such as Eurlex-4K, Wiki10-31K, AmazonCat-13K, or Wiki-500K.
  • VERSION: v0=sparse TF-IDF features. v1=sparse TF-IDF features concatenate with dense fine-tuned XLNet embedding.

The evaluaiton results should located at ./results_linear/${DATASET}.${VERSION}.txt

Evaluate Fine-tuned X-Transformer Models

Given the provided indexing codes (label-to-cluster assignments) and the fine-tuned Transformer models, train/predict ranker of the X-Transformer framework, and evaluate with Precision/Recall@k:

bash eval_transformer.sh ${DATASET}
  • DATASET: the dataset name such as Eurlex-4K, Wiki10-31K, AmazonCat-13K, or Wiki-500K.

The evaluaiton results should located at ./results_transformer/${DATASET}.final.txt

Running X-Transformer on customized datasets

The X-Transformer framework consists of 9 configurations (3 label-embedding times 3 model-type). For simplicity, we show you 1 out-of 9 here, using LABEL_EMB=pifa-tfidf and MODEL_TYPE=bert.

We will use Eurlex-4K as an example. In the ./datasets/Eurlex-4K folder, we assume the following files are provided:

  • X.trn.npz: the instance TF-IDF feature matrix for the train set. The data type is scipy.sparse.csr_matrix of size (N_trn, D_tfidf), where N_trn is the number of train instances and D_tfidf is the number of features.
  • X.tst.npz: the instance TF-IDF feature matrix for the test set. The data type is scipy.sparse.csr_matrix of size (N_tst, D_tfidf), where N_tst is the number of test instances and D_tfidf is the number of features.
  • Y.trn.npz: the instance-to-label matrix for the train set. The data type is scipy.sparse.csr_matrix of size (N_trn, L), where n_trn is the number of train instances and L is the number of labels.
  • Y.tst.npz: the instance-to-label matrix for the test set. The data type is scipy.sparse.csr_matrix of size (N_tst, L), where n_tst is the number of test instances and L is the number of labels.
  • train_raw_texts.txt: The raw text of the train set.
  • test_raw_texts.txt: The raw text of the test set.
  • label_map.txt: the label's text description.

Given those input files, the pipeline can be divided into three stages: Indexer, Matcher, and Ranker.

Indexer

In stage 1, we will do the following

  • (1) construct label embedding
  • (2) perform hierarchical 2-means and output the instance-to-cluster assignment matrix
  • (3) preprocess the input and output for training Transformer models.

TLDR: we combine and summarize (1),(2),(3) into two scripts: run_preprocess_label.sh and run_preprocess_feat.sh. See more detailed explaination in the following.

(1) To construct label embedding,

OUTPUT_DIR=save_models/${DATASET}
PROC_DATA_DIR=${OUTPUT_DIR}/proc_data
mkdir -p ${PROC_DATA_DIR}
python -m xbert.preprocess \
    --do_label_embedding \
    -i ${DATA_DIR} \
    -o ${PROC_DATA_DIR} \
    -l ${LABEL_EMB} \
    -x ${LABEL_EMB_INST_PATH}
  • DATA_DIR: ./datasets/Eurlex-4K
  • PROC_DATA_DIR: ./save_models/Eurlex-4K/proc_data
  • LABEL_EMB: pifa-tfidf (you can also try text-emb or pifa-neural if you have fine-tuned instance embeddings)
  • LABEL_EMB_INST_PATH: ./datasets/Eurlex-4K/X.trn.npz

This should yield L.${LABEL_EMB}.npz in the PROC_DATA_DIR.

(2) To perform hierarchical 2-means,

SEED_LIST=( 0 1 2 )
for SEED in "${SEED_LIST[@]}"; do
    LABEL_EMB_NAME=${LABEL_EMB}-s${SEED}
    INDEXER_DIR=${OUTPUT_DIR}/${LABEL_EMB_NAME}/indexer
    python -u -m xbert.indexer \
    python -m xbert.preprocess \
        -i ${PROC_DATA_DIR}/L.${LABEL_EMB}.npz \
        -o ${INDEXER_DIR} --seed ${SEED}

This should yield code.npz in the INDEXIER_DIR.

(3) To preprocess input and output for Transformer models,

SEED=0
LABEL_EMB_NAME=${LABEL_EMB}-s${SEED}
INDEXER_DIR=${OUTPUT_DIR}/${LABEL_EMB_NAME}/indexer
python -u -m xbert.preprocess \
    --do_proc_label \
    -i ${DATA_DIR} \
    -o ${PROC_DATA_DIR} \
    -l ${LABEL_EMB_NAME} \
    -c ${INDEXER_DIR}/code.npz

This should yield the instance-to-cluster matrix C.trn.npz and C.tst.npz in the PROC_DATA_DIR.

OUTPUT_DIR=save_models/${DATASET}
PROC_DATA_DIR=${OUTPUT_DIR}/proc_data
python -u -m xbert.preprocess \
    --do_proc_feat \
    -i ${DATA_DIR} \
    -o ${PROC_DATA_DIR} \
    -m ${MODEL_TYPE} \
    -n ${MODEL_NAME} \
    --max_xseq_len ${MAX_XSEQ_LEN} \
    |& tee ${PROC_DATA_DIR}/log.${MODEL_TYPE}.${MAX_XSEQ_LEN}.txt
  • MODEL_TYPE: bert (or roberta, xlnet)
  • MODEL_NAME: bert-large-cased-whole-word-masking (or roberta-large, xlnet-large-cased)
  • MAX_XSEQ_LEN: maximum number of tokens, we set to 128

This should yield X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pt and X.tst.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pt in the PROC_DATA_DIR.

Matcher

In stage 2, we will do the following

  • (1) train deep Transformer models to map instances to the induced clusters
  • (2) output the predicted cluster scores and fine-tune instance embeddings

TLDR: run_transformer_train.sh. See more detailed explaination in the following.

(1) Assume we have 8 Nvidia V100 GPUs. To train the models,

MODEL_DIR=${OUTPUT_DIR}/${INDEXER_NAME}/matcher/${MODEL_NAME}
mkdir -p ${MODEL_DIR}
python -m torch.distributed.launch \
    --nproc_per_node 8 xbert/transformer.py \
    -m ${MODEL_TYPE} -n ${MODEL_NAME} --do_train \
    -x_trn ${PROC_DATA_DIR}/X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
    -c_trn ${PROC_DATA_DIR}/C.trn.${INDEXER_NAME}.npz \
    -o ${MODEL_DIR} --overwrite_output_dir \
    --per_device_train_batch_size ${PER_DEVICE_TRN_BSZ} \
    --gradient_accumulation_steps ${GRAD_ACCU_STEPS} \
    --max_steps ${MAX_STEPS} \
    --warmup_steps ${WARMUP_STEPS} \
    --learning_rate ${LEARNING_RATE} \
    --logging_steps ${LOGGING_STEPS} \
    |& tee ${MODEL_DIR}/log.txt
  • MODEL_TYPE: bert (or roberta, xlnet)
  • MODEL_NAME: bert-large-cased-whole-word-masking (or roberta-large, xlnet-large-cased)
  • PER_DEVICE_TRN_BSZ: 16 if using Nvidia V100 (or set to 8 if using Nvidia 2080Ti)
  • GRAD_ACCU_STEPS: 2 if using Nvidia V100 (or set to 4 if using Nvidia 2080Ti)
  • MAX_STEPS: set to 1,000 for Eurlex-4K. Depending on your datasets
  • WARMUP_STEPS: set to 1,00 for Eurlex-4K. Depending on your datasets
  • LEARNING_RATE: set to 5e-5 for Eurlex-4K. Depending on your datasets
  • LOGGING_STEPS: set to 100

(2) To generate predictions and instance embedding,

GPID=0,1,2,3,4,5,6,7
PER_DEVICE_VAL_BSZ=32
CUDA_VISIBLE_DEVICES=${GPID} python -u xbert/transformer.py
    -m ${MODEL_TYPE} -n ${MODEL_NAME} \
    --do_eval -o ${MODEL_DIR} \
    -x_trn ${PROC_DATA_DIR}/X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
    -c_trn ${PROC_DATA_DIR}/C.trn.${INDEXER_NAME}.npz \
    -x_tst ${PROC_DATA_DIR}/X.tst.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
    -c_tst ${PROC_DATA_DIR}/C.tst.${INDEXER_NAME}.npz \
    --per_device_eval_batch_size ${PER_DEVICE_VAL_BSZ}

This should yield the following output in the MODEL_DIR

  • C_trn_pred.npz and C_tst_pred.npz: model-predicted cluster scores
  • trn_embeddings.npy and tst_embeddings.npy: fine-tuned instance embeddings

Ranker

In stage 3, we will do the following

  • (1) train linear rankers to map instances and predicted cluster scores to label scores
  • (2) output top-k predicted labels

TLDR: run_transformer_predict.sh. See more detailed explaination in the following.

(1) To train linear rankers,

LABEL_NAME=pifa-tfidf-s0
MODEL_NAME=bert-large-cased-whole-word-masking
OUTPUT_DIR=save_models/${DATASET}/${LABEL_NAME}
INDEXER_DIR=${OUTPUT_DIR}/indexer
MATCHER_DIR=${OUTPUT_DIR}/matcher/${MODEL_NAME}
RANKER_DIR=${OUTPUT_DIR}/ranker/${MODEL_NAME}
mkdir -p ${RANKER_DIR}
python -m xbert.ranker train \
    -x1 ${DATA_DIR}/X.trn.npz \
    -x2 ${MATCHER_DIR}/trn_embeddings.npy \
    -y ${DATA_DIR}/Y.trn.npz \
    -z ${MATCHER_DIR}/C_trn_pred.npz \
    -c ${INDEXER_DIR}/code.npz \
    -o ${RANKER_DIR} -t 0.01 \
    -f 0 --mode ranker

(2) To predict the final top-k labels,

PRED_NPZ_PATH=${RANKER_DIR}/tst.pred.npz
python -m xbert.ranker predict \
    -m ${RANKER_DIR} -o ${PRED_NPZ_PATH} \
    -x1 ${DATA_DIR}/X.tst.npz \
    -x2 ${MATCHER_DIR}/tst_embeddings.npy \
    -y ${DATA_DIR}/Y.tst.npz \
    -z ${MATCHER_DIR}/C_tst_pred.npz \
    -f 0 -t noop

This should yield the predicted top-k labels tst.pred.npz specified in PRED_NPZ_PATH.

Acknowledge

Some portions of this repo is borrowed from the following repos:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.