GithubHelp home page GithubHelp logo

tacchan7412 / ragged Goto Github PK

View Code? Open in Web Editor NEW

This project forked from neulab/ragged

0.0 0.0 0.0 42.54 MB

Retrieval Augmented Generation Generalized Evaluation Dataset

License: MIT License

Shell 0.85% Python 14.60% Jupyter Notebook 84.55%

ragged's Introduction

Description

Retrieval-augmented generation (RAG) greatly benefits language models (LMs) by providing additional context for tasks such as document-based question answering (DBQA). Despite its potential, the power of RAG is highly dependent on its configuration, raising the question: What is the optimal RAG configuration? To answer this, we introduce the RAGGED framework to analyze and optimize RAG systems. On the representative DBQA tasks, we study two classic sparse and dense retrievers, and four top-performing LMs in encoder-decoder and decoder-only architectures. Through RAGGED, we uncover that different models suit substantially varied RAG setups. While encoder-decoder models monotonically improve with more documents, we find decoder-only models can only effectively use <5 documents, despite often having a longer context window. RAGGED offers further insights into LMs' context utilization habits, where we find encoder-decoder models rely more on contexts and are thus more sensitive to retrieval quality, while decoder-only models tend to rely on knowledge memorized during training.

Installation

To recreate the conda environment, run conda create -n ragged -y python=3.10 pip install -r requirements.txt

To run and evaluate the retriever, see retriver/README.md.

To run and evaluate the reader, see reader/README.md.

To conduct downstream RAGGED analysis, see analysis_framework/README.md.

Datasets

Our datasets are available on Huggingface

1. Download and process corpus datasets

Specify corpus_dir and corpus_name and see download_data.py for how to download and save the files in appropriate folders.

For Pubmed corpus for BioASQ, the corpus name is pubmed.

For KILT wikipedia corpus, the corpus name is kilt_wikipedia.

After downloading the datasets, process the corpus for ColBERT format by running python retriever/data_processing/create_corpus_tsv.py --corpus $corpus --corpus_dir $corpus_dir, which outputs $corpus_dir/${corpus}/${corpus}.json.

2. Download query datasets

Specify data_dir and dataset_name and see download_data.py for how to download the file to ${data_dir}/${dataset_name}.jsonl.

We support Natural Questions (KILT ver), HotpotQA (KILT ver), and BioASQ11B.

The above files are ready for BM25, but not for ColBERT. To reformat them for ColBERT, run python retriever/data_processing/create_query_tsv.py --data_dir $data_dir --dataset $dataset, which outputs $data_dir/${dataset}-queries.tsv.

3. Adapt your own datasets.

To adapt for BM25, format your corpus and query as jsonl files as instructed here. To adapt for ColBERT, format your corpus and query datasets as instructed here.

Citation

If you use our code, datasets, or concepts from our paper in your research, we would appreciate citing it in your work. Here is an example BibTeX entry for citing our paper:

@article{hsia2024ragged,
  title={RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems},
  author={Jennifer Hsia and Afreen Shaikh and Zhiruo Wang and Graham Neubig},
  journal={arXiv preprint arXiv:2403.09040},
  year={2024}
}

Contact

For any questions, feedback, or discussions regarding this project, please feel free to open an issue on the repository or contact us:

ragged's People

Contributors

jenhsia avatar afreens1997 avatar neubig avatar zorazrw avatar tacchan7412 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.