GithubHelp home page GithubHelp logo

cshaitao / jtr Goto Github PK

View Code? Open in Web Editor NEW
23.0 3.0 1.0 268 KB

The official repo for our SIGIR'23 Full paper: Constructing Tree-based Index for Efficient and Effective Dense Retrieval

License: MIT License

Python 100.00%

jtr's Introduction

Constructing Tree-based Index for Efficient and Effective Dense Retrieval

The official repo for our SIGIR'23 Full paper: Constructing Tree-based Index for Efficient and Effective Dense Retrieval

Introduction

To balance the effectiveness and efficiency of the tree-based indexes, we propose JTR, which stands for Joint optimization of TRee-based index and query encoding. To jointly optimize index structure and query encoder in an end-to-end manner, JTR drops the original ``encoding-indexing" training paradigm and designs a unified contrastive learning loss. However, training tree-based indexes using contrastive learning loss is non-trivial due to the problem of differentiability. To overcome this obstacle, the tree-based index is divided into two parts: cluster node embeddings and cluster assignment. For differentiable cluster node embeddings, which are small but very critical, we design tree-based negative sampling to optimize them. For cluster assignment, an overlapped cluster method is applied to iteratively optimize it.

image

Preprocess

JTR initializes the document embeddings with STAR, refer to DRhard for details.

Run the following codes in DRhard to preprocess document.

python preprocess.py --data_type 0; python preprocess.py --data_type 1

Tree Initialization

After getting the text embeddings, we can initialize the tree using recursive k-means.

Run the following codes:

python construct_tree.py

We will get the following files:

tree.pkl: Tree structure

node_dict.pkl: Map of node id to node

node_list: Nodes per level

pid_labelid.memmap: Mapping of document ids to clustering nodes

leaf_dict.pkl: Leaf Nodes

Train

Run the following codes: python train.py --task train

The training process trains both the query encoder and the clustering node embeddings. Therefore, we need to save both the node embeddings and the query encoder.

Inference

Run the following codes:

python train.py --task dev

The inference process can construct the matrix M for Reorganize Cluster.

Reorganize Cluster

Run the following codes:

python reorganize_clusters_tree.py

The re-clustering requires M and Y matrices. Y matrix is constructed by running other retrieval models. M matrix is constructed by inference on the tree index.

Other

This work was done when I was a beginner and the code was embarrassing. If somebody can further organize and optimize the code or integrate it into Faiss with C. I would appreciate it.

Citations

If you find our work useful, please do not save your star and cite our work:

@misc{JTR,
      title={Constructing Tree-based Index for Efficient and Effective Dense Retrieval}, 
      author={Haitao Li and Qingyao Ai and Jingtao Zhan and Jiaxin Mao and Yiqun Liu and Zheng Liu and Zhao Cao},
      year={2023},
      eprint={2304.11943},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

jtr's People

Contributors

cshaitao avatar

Stargazers

 avatar Jan Luca Scheerer avatar Raphael Sourty avatar 6ev avatar 刘新放 avatar 王嘉楠 avatar  avatar DeHors avatar Jeff Carpenter avatar  avatar Xiaoyu Zhang avatar  avatar  avatar Yi Ren avatar Jiayu Li avatar Jia Chen avatar Yixiao Ma avatar  avatar Shenghao Yang avatar Ramsey avatar  avatar Qian Dong avatar Yan Fang avatar

Watchers

Vladimir Gurevich avatar  avatar  avatar

Forkers

thuir

jtr's Issues

您好,请问一下进行Overlapped Cluster后,正负结点是如何选定的

利用KMeans初始化得到的树中,一个document仅仅被分配给了一个leaf node,这个时候正节点是document对应的leaf node及其祖先,其余为负结点,树的每一层仅有一个正节点;
而在进行一个Overlapped Cluster后,一个document被分配给了多个leaf node,此时树的每一层将会有多个正节点,那这个时候是怎么选取正负节点,以适应新的cluster assignment的?

在您的JTR/reorganize_clusters_tree.py源码中,有如下代码,如果我的理解没有出错,您在更新cluster后,dict_label直接取成了pid所对应的最后一个节点的label,即只考虑document被分配到的最后一个leaf node,而之前被分配的leaf node直接不考虑?这样的话,似乎不能适应新的cluster assignment。这是我的一点浅薄的理解,如果有错误,烦请您指出,我将感激不尽!

    dict_label = {}   
    for leaf in tree.leaf_dict:
        node = tree.leaf_dict[leaf]
        pids = node.pids
       
        for pid in pids:
            dict_label[pid] = str(node.val)

Source codes are not available and Question about HNSW parameters setting

Dear authors,

Thank you for sharing your work with the research community. I enjoyed reading your paper and learning from your insights. I found your work to be both insightful and informative. I am writing to request some additional information that would help me better understand your work.

First, I would appreciate it if you could kindly share the source codes and the dataset that you used in your paper. I understand that you processed the dataset with STAR, but I would like to reproduce your results!

Second, I am interested in knowing more about the parameter settings of HNSW that you used in your experiments. I noticed that you set the link number to 8, which I assume is the degree of a node (i.e., the parameter M). However, according to the HNSW official repo, the recommended range for M is 12-48. Have you tried any higher values for M and how did they affect the performance?

I hope you can find some time to reply to my queries. I am looking forward to hearing from you soon!

JTR for ColBERT

Hello,

First of all, I'd like to say that I really like the work you've done.

I saw the potential of using JTR to speed up token level embeddings models such as ColBERT and created neural-tree. I don't yet know how to perform hierarchical clustering with ColBERT, so I did it using TfIdf or SentenceTransformer, then export the created tree and average ColBERT embeddings into corresponding nodes. The speed gains are quite impressive for ColBERT, well done to you.

Have a great day,

Raphaël

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.