cshaitao / jtr Goto Github PK

The official repo for our SIGIR'23 Full paper: Constructing Tree-based Index for Efficient and Effective Dense Retrieval

License: MIT License

Python 100.00%

jtr's Introduction

Constructing Tree-based Index for Efficient and Effective Dense Retrieval

The official repo for our SIGIR'23 Full paper: Constructing Tree-based Index for Efficient and Effective Dense Retrieval

Introduction

To balance the effectiveness and efficiency of the tree-based indexes, we propose JTR, which stands for Joint optimization of TRee-based index and query encoding. To jointly optimize index structure and query encoder in an end-to-end manner, JTR drops the original ``encoding-indexing" training paradigm and designs a unified contrastive learning loss. However, training tree-based indexes using contrastive learning loss is non-trivial due to the problem of differentiability. To overcome this obstacle, the tree-based index is divided into two parts: cluster node embeddings and cluster assignment. For differentiable cluster node embeddings, which are small but very critical, we design tree-based negative sampling to optimize them. For cluster assignment, an overlapped cluster method is applied to iteratively optimize it.

Preprocess

JTR initializes the document embeddings with STAR, refer to DRhard for details.

Run the following codes in DRhard to preprocess document.

python preprocess.py --data_type 0; python preprocess.py --data_type 1

Tree Initialization

After getting the text embeddings, we can initialize the tree using recursive k-means.

Run the following codes:

python construct_tree.py

We will get the following files:

tree.pkl: Tree structure

node_dict.pkl: Map of node id to node

node_list: Nodes per level

pid_labelid.memmap: Mapping of document ids to clustering nodes

leaf_dict.pkl: Leaf Nodes

Train

Run the following codes: python train.py --task train

The training process trains both the query encoder and the clustering node embeddings. Therefore, we need to save both the node embeddings and the query encoder.

Inference

Run the following codes:

python train.py --task dev

The inference process can construct the matrix M for Reorganize Cluster.

Reorganize Cluster

Run the following codes:

python reorganize_clusters_tree.py

The re-clustering requires M and Y matrices. Y matrix is constructed by running other retrieval models. M matrix is constructed by inference on the tree index.

Other

This work was done when I was a beginner and the code was embarrassing. If somebody can further organize and optimize the code or integrate it into Faiss with C. I would appreciate it.

Citations

If you find our work useful, please do not save your star and cite our work:

@misc{JTR,
      title={Constructing Tree-based Index for Efficient and Effective Dense Retrieval}, 
      author={Haitao Li and Qingyao Ai and Jingtao Zhan and Jiaxin Mao and Yiqun Liu and Zheng Liu and Zhao Cao},
      year={2023},
      eprint={2304.11943},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

jtr's People

Contributors

Stargazers

Watchers

Forkers

thuir

jtr's Issues

您好，请问一下进行Overlapped Cluster后，正负结点是如何选定的

利用KMeans初始化得到的树中，一个document仅仅被分配给了一个leaf node，这个时候正节点是document对应的leaf node及其祖先，其余为负结点，树的每一层仅有一个正节点；
而在进行一个Overlapped Cluster后，一个document被分配给了多个leaf node，此时树的每一层将会有多个正节点，那这个时候是怎么选取正负节点，以适应新的cluster assignment的？

在您的JTR/reorganize_clusters_tree.py源码中，有如下代码，如果我的理解没有出错，您在更新cluster后，dict_label直接取成了pid所对应的最后一个节点的label，即只考虑document被分配到的最后一个leaf node，而之前被分配的leaf node直接不考虑？这样的话，似乎不能适应新的cluster assignment。这是我的一点浅薄的理解，如果有错误，烦请您指出，我将感激不尽！

    dict_label = {}   
    for leaf in tree.leaf_dict:
        node = tree.leaf_dict[leaf]
        pids = node.pids
       
        for pid in pids:
            dict_label[pid] = str(node.val)

Source codes are not available and Question about HNSW parameters setting

Dear authors,

Thank you for sharing your work with the research community. I enjoyed reading your paper and learning from your insights. I found your work to be both insightful and informative. I am writing to request some additional information that would help me better understand your work.

First, I would appreciate it if you could kindly share the source codes and the dataset that you used in your paper. I understand that you processed the dataset with STAR, but I would like to reproduce your results!

Second, I am interested in knowing more about the parameter settings of HNSW that you used in your experiments. I noticed that you set the link number to 8, which I assume is the degree of a node (i.e., the parameter M). However, according to the HNSW official repo, the recommended range for M is 12-48. Have you tried any higher values for M and how did they affect the performance?

I hope you can find some time to reply to my queries. I am looking forward to hearing from you soon!

JTR for ColBERT

Hello,

First of all, I'd like to say that I really like the work you've done.

I saw the potential of using JTR to speed up token level embeddings models such as ColBERT and created neural-tree. I don't yet know how to perform hierarchical clustering with ColBERT, so I did it using TfIdf or SentenceTransformer, then export the created tree and average ColBERT embeddings into corresponding nodes. The speed gains are quite impressive for ColBERT, well done to you.

Have a great day,

Raphaël

cshaitao / jtr Goto Github PK

jtr's Introduction

Constructing Tree-based Index for Efficient and Effective Dense Retrieval

Introduction

Preprocess

Tree Initialization

Train

Inference

Reorganize Cluster

Other

Citations

jtr's People

Contributors

Stargazers

Watchers

Forkers

jtr's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs