GithubHelp home page GithubHelp logo

german_rag_enriched's Introduction

GermanRAG Dataset Enhancement

This repository contains the code and documentation for enhancing the germanrag dataset, specifically designed for fine-tuning Language Models (LLMs) for German-language Retrieval-Augmented Generation (RAG) applications. The project focuses on introducing easy negatives, ensuring unique combinations of contexts, and preparing the dataset with a suitable prompt template for LLM fine-tuning.

Project Overview

The germanrag dataset, available on Hugging Face, is an invaluable resource for developing advanced NLP models tailored to the German language. My enhancement efforts are directed towards improving the dataset by:

  • Adding easy negatives to provide a broader range of training challenges.
  • Implementing checks to ensure no combination of contexts appears more than once, enhancing dataset uniqueness.
  • Preparing the dataset for LLM fine-tuning with a custom prompt template suitable for RAG applications.

Suggested Future Features

  • Dynamic Difficulty Scaling: Algorithm that adjusts the difficulty level of questions and negatives based on the model's performance, addressing the "lost in the middle" problem.
  • Advanced Negative Selection: Enhanced selection of hard and easy negatives using deep learning techniques to better simulate real-world misinformation challenges.
  • MetaData Inclusion: Augmentation of the dataset with metadata to provide context about the source and reliability of information.

Getting Started

To get started with this project, clone this repository and install the required dependencies:

git clone <repository-url>
cd <repository-name>
pip install -r requirements.txt

Follow the notebook steps: 
- Download the original germanrag dataset from Hugging Face.

- Run the Enhancement Script: Execute the script to add easy negatives and apply the uniqueness checks.

- Prepare for Fine-Tuning: Use the provided script to format the dataset according to the specified prompt template.

- Evaluate and Adjust: Review the enhanced dataset and adjust parameters in the script as necessary to optimize the training data

german_rag_enriched's People

Contributors

henihaddad avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.