GithubHelp home page GithubHelp logo

colstar_virtualappendix's Introduction

ColStar

Virtual Appendix for Xiao Wang, Craig Macdonald, Nicola Tonellotto, Iadh Ounis. Reproducibility, Replicability, and Insights into Dense Multi-Representation Retrieval Models: from ColBERT to Col*. In Proceedings of SIGIR 2023.

Training

For Training and Indexing a Col* model, please follow this instruction for training. The trained checkpoints for the selected Col* models, namely ColBERT-Cosine, ColBERT-L2, ColRoBERTa and ColALBERT, to build the dense index are shared here: https://drive.google.com/drive/folders/1S9d1M_5LGcGyC6W15ZOCL6KGGRNc6wXa?usp=sharing

Indexing

Using your own trained model or our shared checkpoint, you can build the dense index following this instruction for indexing.

Evaluation

Reproducibility Experiments

We first conduct the reproducibility study of ColBERT, by training our own ColBERT model. The reproducibility experiments are demonstrated in this demo notebook. In addition, all the results files are provided under this Reproducibility results folder. Based on the results of our reproducibility study, we conclude that we can well-reproduce the training of ColBERT.

Replicability Experiments

Then, based on the success of our reproducibility study, we further conduct the replicability study where we extend the ColBERT to Col*. In particular, Col* is a collection of models where we implement the Contextualised Late Interaction Mechanism upon various underlying Pretrained Language Models with different tokenisers. The replicability experiments of our paper are demonstrated in this demo notebook. Also, all the results files that can be used to reproduce our results are provided in this Replicability results folder.

Based on the results of our replicability study, we conclude that we can replicate the contextualised late interaction mechanism upon various pretrained models. Moreover, in the following, we report the statistics of four selected Col* models that are built upon the base-size PLM but with different tokenisers, namely ColBERT, ColminiLM, ColRoBERTa and ColALBERT models.

image

Insights Experiments

Furthermore, we investigate the matching behaviour of our selected Col* models to obtain more insights. In particular, we investigate the following research questions:

  • RQ3.1: How does the late interaction matching behaviour varies across different Col* models?
  • RQ3.2: How does the Col* models impact on the matching bebavaiour across different types of tokens?
  • RQ3.3: Can we quantify the contribution of different types of matching behaviour?

In particular, according to the following experiment results, we find that (i) ColRoBERTa is more likely to perform semantic matching than other models (RQ3.1); (ii) Low IDF tokens are most likely to exhibit semantic matching (RQ2); (iii) Col* modelsbenefit more from lexical matching than semantic matching, less so for ColRoBERTa (RQ3.3):! Please check more results in our paper!

image

To reproduce the above results, we also provide a demo demo notebook for the insights experiments. Or, instead of running the experiments from scratch, you can use our provided results files in this Insights results folder directly to reproduce our presented results.

Reference

[Wang23]: Xiao Wang, Craig Macdonald, Nicola Tonellotto, Iadh Ounis. Replicability, and Insights into Dense Multi-Representation Retrieval Models: from ColBERT to Col*. In Proceedings of SIGIR 2023.

colstar_virtualappendix's People

Contributors

xiao0728 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.