GithubHelp home page GithubHelp logo

j-seo / kocommongen-v2 Goto Github PK

View Code? Open in Web Editor NEW
23.0 4.0 1.0 1.03 MB

KoCommonGEN v2: A Benchmark for Navigating Korean Commonsense Reasoning Challenges in Large Language Models

Python 99.60% Shell 0.40%

kocommongen-v2's Introduction

🌠 KoCommonGEN v2

KoCommonGEN v2: A Benchmark for Navigating Korean Commonsense Reasoning Challenges in Large Language Models [ACL 2024-Findings]

Jaehyung Seo, Jaewook Lee, Chanjun Park, SeongTae Hong, Seungjun Lee and Heuiseok Lim

🏫 NLP & AI Lab, Korea University


πŸ”₯ News

  • September 27, 2023: Provided data support for the Open Ko-LLM Leaderboard
  • August 7, 2024: Dataset Release
  • August 10, 2024: Experimental Results for the New Models Added
  • August 14, 2024: Presented a research paper at ACL 2024

πŸ“Š Dataset

The KoCommonGEN v2 dataset is available on Hugging Face:

You can easily access and use these datasets for your research and experiments.

πŸ› οΈ Installation

This repository partially adopts the evaluation methods of version 0.3.0 of EleutherAI/lm-eval-harness for the evaluation of KoCommonGEN v2

$ git clone https://github.com/J-Seo/KoCommonGEN-V2.git
# python_requires >=3.9
$ cd KoCommonGEN_v2
$ pip install -r requirements.txt 

πŸš€ Usage

The maximum number of few-shot examples currently uploaded is 5. Users can freely add more to increase --num_fewshot

$ sh test.sh
## test.sh
python3 main.py \ 
--model hf-causal-experimental \
--model_args pretrained="nlpai-lab/KULLM3" \
--task ko_commongen_v2 \
--device cuda:1 \
--num_fewshot 2 \
--batch_size 1 \
--output nlpai-lab/KULLM3 &

You can also use sequence-to-sequence models.

## test.sh
python3 main.py \
--model hf-seq2seq \
--model_args pretrained="google/flan-t5-xxl" \
--task ko_commongen_v2 \
--device cuda:1 \
--num_fewshot 2 \
--batch_size 1 \
--output google/flan-t5-xxl &

πŸ‘₯ Human Evaluation

We recruited 22 native Korean speaking volunteers as human evaluators and paid them $0.8 per question.

Model # Average Score cohen's kappa Krippendorff's alpha
Human 22 0.8395 0.7693 0.7706

πŸ€– Models (August 10, 2024)

The results of 2-shot evaluation of the newly released models.

Model Size Acc_norm Stderr Link
GPT-4 (June 13, 2023) 0.7450
Mistral-Nemo-Instruct 12B 0.6612 0.0163 πŸ”—
Mistral-Nemo-Base 12B 0.6340 0.0166 πŸ”—
Meta-Llama-3.1-8B 8B 0.6246 0.0166 πŸ”—
QWEN2-7B base 7B 0.6187 0.0167 πŸ”—
EXAONE-3.0-7.8B-Instruct 7.8B 0.6088 0.0168 πŸ”—
MLP-KTLim-Bllossom-8B 8B 0.6057 0.0168 πŸ”—
Meta-Llama-3.1-8B-Instruct 8B 0.6057 0.0168 πŸ”—
KULLM3 10.8B 0.6033 0.0168 πŸ”—
QWEN2-7B inst 7B 0.5832 0.017 πŸ”—
Gemma-2-9b-it 9B 0.5714 0.0170 πŸ”—
Aya-23-8B 8B 0.5159 0.0172 πŸ”—
Allganize-Alpha-Instruct 8B 0.4970 0.0172 πŸ”—

As mentioned in the paper, it is possible to evaluate various models.

πŸ‡°πŸ‡·πŸ‡ΊπŸ‡ΈπŸ‡―πŸ‡΅πŸ‡¨πŸ‡³πŸ‡ͺπŸ‡Έ Code-switching

The multilingual dataset consists of 99 samples for numerical commonsense reasoning, which were created relying on machine translation.

The dataset can be found at the following path: lm_eval/datasets/ko_commongen_v2/shuffled_$LANG$_1.0.jsonl.

You can also access the code-switching dataset on Hugging Face: nlpai-lab/ko_commongen_v2_code_switching

(The code-switching data relies on machine translation, which may result in some inaccuracies.)

If you intend to use it for evaluation, you should modify the prompt and file path in lm_eval/tasks/ko_commongen_v2.py.

πŸ“– Citation

@inproceedings{seo2024Kocommongenv2,
    title = "KoCommonGEN v2: A Benchmark for Navigating Korean Commonsense Reasoning Challenges in Large Language Models",
    author = "Jaehyung Seo and Jaewook Lee and Chanjun Park and SeongTae Hong and Seungjun Lee and Heuiseok Lim",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
    month = August,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "TBD",
    doi = "TBD",
    pages = "TBD"}

🚨 Warning!

This dataset contains some instances of toxic speech.

πŸ™ Acknowledgement

We sincerely appreciate the dedication of Chanjun Park, Sanghoon Kim and Sunghun Kim (Sung Kim) from Upstage AI in managing one of the benchmark datasets for the Open Ko-LLM LeaderBoard.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.