GithubHelp home page GithubHelp logo

hgtm's Introduction

HGTM

This is the pytorch implementation of KDD-2023 paper "Hyperbolic Graph Topic Modeling Network with Continuously Updated Topic Tree", authored by Delvin Ce Zhang, Rex Ying, and Hady W. Lauw.

HGTM is a topic model designed for interconnected texts in a graph structure, such as academic citation graphs and Webpage hyperlink graphs. HGTM learns text embeddings with the aim of preserving both hierarchical graph connectivity and hierarchical topic structure within textual content. The learned text embeddings can fulfill downstream tasks, inclduing text classification, graph link prediction, and topic analysis.

Implementation Environment

  • python == 3.10
  • pytorch == 2.0.0
  • numpy == 1.20.3

Run

python main.py -s False # unsupervised training

python main.py -s True # supervised training

Parameter Setting

  • -ne: number of training epochs, default = 200 (set 200 for small datasets (ml, pl, covid), and 10 for large datasets (aminer and web))
  • -lr: learning rate, default = 0.01
  • -ms: minibatch size, default = 64
  • -dn: dataset name, default = ml
  • -dim: dimension of text embeddings H, default = 16
  • -s: supervised training, default = False (set True for supervised training, which involves a MLP classifier and document labels for training. See Eq. 26 in the paper.)
  • -reg_s: regularizer for MLP classifier in supervised training, default = 1 ($\lambda_{label}$ in the paper. If the above -s is set to False, this -reg_s can be ignored.)
  • -reg_text: regularizer for text generation, default = 0.1 ($\lambda_{text}$ in the paper)
  • -tr: training ratio, the ratio of training documents to the total documents, default = 0.8
  • -m: manifold, PoincareBall or Hyperboloid, default = PoincareBall
  • -le: log epochs, how many epochs should the model output results on testing documents, default = 25 (set 25 for small datasets (ml, pl, covid), and 1 for large datasets (aminer and web))
  • -nl: number of graph convolutional layers, default = 2
  • -nn: number of sampled neighbors for aggregation, default = 5
  • -neg: number of negative samples for graph structure decoding, default = 5
  • -c: curvature, default = 1.0
  • -b: if the model adds bias for encoding, default = True
  • -ut: if the model updates the topic tree during training, default = True
  • -max_l: maximum number of levels for topic tree, default = 4
  • -max_c: maximum number of children a parent topic has, default = 20
  • -at: threshold of adding a new topic on the tree, default = 0.05
  • -rt: thrshold of removing a topic on the tree, default = 0.05
  • -we: if the model uses pretrained word embeddings for decoding, default = False
  • -rs: random seed
  • -gpu: gpu

Please note that some hyperparameters are different from the ones reported in the paper. This is because we used tensorflow to implement the model when doing research of this paper, while we reproduce the results using pytorch upon publication. Different libraries may result in slight deviations of implementation, but the results reported in the paper are reproducible.

Data

We release ML, PL, COVID, and Aminer datasets in ./data folder. For Aminer dataset, please unzip aminer.zip and put the unzipped file into ./data folder. For the largest Web dataset, please email Delvin Ce Zhang ([email protected]) for access.

Each dataset contains contents, links, vocabulary, pretrained glove word embeddings, labels, and label names.

  • contents: each row corresponds to a document, containing a sequence of words represented by word IDs in vocabulary. Word ID starts from 0. For example, a row with [0, 6, 4] means a document with a sequence of three words, i.e., the 0th, 6th, and 4th words in the vocabulary. There are N rows (N documents) in total.
  • links: each row corresponds to a link represented by a pair of document IDs. For example, a row with [5, 8] means a link from document 5 to document 8.
  • voc (|V|x1): vocabulary.
  • pretrained glove word embeddings (|V|x300): 300-dimensional pretrained glove word embeddings, with each row corresponding to a word in the vocabulary. For example, the 0th word embedding corresponds to the 0th word in the vocabulary.
  • labels (Nx1): labels or categories of N documents. For example, the 0th label corresponds to the 0th document.
  • label names: the names of labels or categories.

Output

We output three files for each training. Output results are saved to the ./dataset_name/results folder.

  • doc_emb.txt (NxH): each row is an H-dimensional document embedding. There are N documents in total. The first 80% embeddings are for training documents, and the remaining 20% embeddings are for testing documents. These document embeddings can fulfill downstream tasks, including text classification and graph link prediction.
  • topic_emb.txt (KxH): each row is an H-dimensional topic embedding.
  • topic_word_dist.txt (Kx|V|): topic-word distribution, where each row is a probability distribution over the whole vocabulary. This topic-word distribution can fulfill topic analysis tasks, including topic coherence.

Reference

If you find our paper useful, including code and data, please cite

@inproceedings{hgtm,
    author = {Zhang, Delvin Ce and Ying, Rex and Lauw, Hady W.},
    title = {Hyperbolic Graph Topic Modeling Network with Continuously Updated Topic Tree},
    year = {2023},
    booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
    pages = {3206โ€“3216},
    location = {Long Beach, CA, USA},
    series = {KDD '23}
}

hgtm's People

Contributors

cezhang01 avatar

Stargazers

Crystal avatar Xiaobao Wu avatar  avatar Xinjie Shen avatar  avatar

Watchers

Kostas Georgiou avatar  avatar

Forkers

missileee

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.