Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I updated the prompt dic <a href="https://github.com/AkariAsai/self-rag/blob/main/retr

Reproducing the TriviaQA numbers about self-rag HOT 7 CLOSED

jiangllan commented on September 18, 2024

Reproducing the TriviaQA numbers

from self-rag.

Comments (7)

jiangllan commented on September 18, 2024

Since the PROMPT_DICT in utils.py does not include key prompt_no_input_retrieval, I implemented it following run_short_form.py , which is

PROMPT_DICT = {
    "prompt_input": (
        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
    ),
    "prompt_no_input": (
        "### Instruction:\n{instruction}\n\n### Response:\n"
    ),
    "prompt_no_input_retrieval": (
        "### Instruction:\n{instruction}\n\n### Response:\n[Retrieval]<paragraph>{paragraph}</paragraph>"
    )
}

from self-rag.

AkariAsai commented on September 18, 2024

Hi I just wanted to double check but did you use the meta-llama/Llama-2-7b-hf or meta-llama/Llama-2-7b?
Sorry for the error of the missing prompt_no_input_retrieval. I mistakenly dropped when I was refactoring the code base. For baselines, we used different RAG prompts as we found some models perform much better when we locate the paragraphs before instructions. I'll update the script.
On the other hand, the number seems to be too low and I am not sure why this happens... I can rerun the evaluations this week.

from self-rag.

jiangllan commented on September 18, 2024

Hi, Thanks for your response!

I used meta-llama/Llama-2-7b(converted into the Hugging Face Transformers format using the conversion script).

By the way, I also experimented with the meta-llama/Llama-2-7b-chat model, and set the prompts according to the original Llama2 paper like:

"my_prompt_no_input": (
    "Answer these questions:\nQ: {instruction}\n\nA:\n"
),
"my_prompt_no_input_retrieval": (
    "Answer these questions:\nQ: {instruction}\nReferences: {paragraph}\nA:"
)

This approach yielded more reasonable results.

from self-rag.

AkariAsai commented on September 18, 2024

I updated the prompt dic here. I'm currently rerunning evaluations on TriviaQA and let you know once it's done and I get the final number.

PROMPT_DICT = {
    "prompt_input": (
        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
    ),
    "prompt_no_input": (
        "### Instruction:\n{instruction}\n\n### Response:\n"
    ),
    "prompt_no_input_retrieval": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Paragraph:\n{paragraph}\n\n### Instruction:\n{instruction}\n\n### Response:"
    ),
...

from self-rag.

jiangllan commented on September 18, 2024

Got it! I will use it to rerun my experiment on TQA today. Let's see if I can obtain the same final number as you did = )

from self-rag.

AkariAsai commented on September 18, 2024

Oh one thing is you should use match as an evaluation metric for OpenQA task! accuracy is used for the task where we only have a single answer label, while match checkes whether any of the gold answer string is included in the model generation, as specified in the paper.

I rerun the evaluation with the most recent commit and got 43.4 on TriviaQA by the command below:

python run_baseline_lm.py \
--model_name meta-llama/Llama-2-7b-hf \
--input_file eval_data/triviaqa_test.jsonl \
--max_new_tokens 100 \
--metric match \
--result_fp triviaqa_llama2_7b-with_retrieval_full_test.jsonl \
--task qa \
--mode retrieval --prompt_name "prompt_no_input_retrieval" \
--download_dir /gscratch/h2lab/akari/model_cache

Let me know if you still see some issue.

from self-rag.

jiangllan commented on September 18, 2024

Hi! I successfully reproduced similar numbers for the TQA task. Thank you so much for your assistance! I'm now ready to close this issue.

from self-rag.

Reproducing the TriviaQA numbers about self-rag HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs