Comments (7)
Since the PROMPT_DICT
in utils.py
does not include key prompt_no_input_retrieval
, I implemented it following run_short_form.py
, which is
PROMPT_DICT = {
"prompt_input": (
"### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
),
"prompt_no_input": (
"### Instruction:\n{instruction}\n\n### Response:\n"
),
"prompt_no_input_retrieval": (
"### Instruction:\n{instruction}\n\n### Response:\n[Retrieval]<paragraph>{paragraph}</paragraph>"
)
}
from self-rag.
Hi I just wanted to double check but did you use the meta-llama/Llama-2-7b-hf
or meta-llama/Llama-2-7b
?
Sorry for the error of the missing prompt_no_input_retrieval
. I mistakenly dropped when I was refactoring the code base. For baselines, we used different RAG prompts as we found some models perform much better when we locate the paragraphs before instructions. I'll update the script.
On the other hand, the number seems to be too low and I am not sure why this happens... I can rerun the evaluations this week.
from self-rag.
Hi, Thanks for your response!
I used meta-llama/Llama-2-7b
(converted into the Hugging Face Transformers format using the conversion script).
By the way, I also experimented with the meta-llama/Llama-2-7b-chat
model, and set the prompts according to the original Llama2 paper like:
"my_prompt_no_input": (
"Answer these questions:\nQ: {instruction}\n\nA:\n"
),
"my_prompt_no_input_retrieval": (
"Answer these questions:\nQ: {instruction}\nReferences: {paragraph}\nA:"
)
This approach yielded more reasonable results.
from self-rag.
I updated the prompt dic here. I'm currently rerunning evaluations on TriviaQA and let you know once it's done and I get the final number.
PROMPT_DICT = {
"prompt_input": (
"### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
),
"prompt_no_input": (
"### Instruction:\n{instruction}\n\n### Response:\n"
),
"prompt_no_input_retrieval": (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request.\n\n"
"### Paragraph:\n{paragraph}\n\n### Instruction:\n{instruction}\n\n### Response:"
),
...
from self-rag.
Got it! I will use it to rerun my experiment on TQA today. Let's see if I can obtain the same final number as you did = )
from self-rag.
Oh one thing is you should use match
as an evaluation metric for OpenQA task! accuracy
is used for the task where we only have a single answer label, while match
checkes whether any of the gold answer string is included in the model generation, as specified in the paper.
I rerun the evaluation with the most recent commit and got 43.4 on TriviaQA by the command below:
python run_baseline_lm.py \
--model_name meta-llama/Llama-2-7b-hf \
--input_file eval_data/triviaqa_test.jsonl \
--max_new_tokens 100 \
--metric match \
--result_fp triviaqa_llama2_7b-with_retrieval_full_test.jsonl \
--task qa \
--mode retrieval --prompt_name "prompt_no_input_retrieval" \
--download_dir /gscratch/h2lab/akari/model_cache
Let me know if you still see some issue.
from self-rag.
Hi! I successfully reproduced similar numbers for the TQA task. Thank you so much for your assistance! I'm now ready to close this issue.
from self-rag.
Related Issues (20)
- How can I get initial input file for generator?
- model issues
- Processed Input Dataset and Flan-3B Critic Generated Dataset
- Reproducing Self-RAG
- accuracy metric HOT 3
- About parameter `max_depth` HOT 2
- Doesn't the generator need to call the retriever when training the model?
- The critic model will generate different type of token when I use run_reward_vllm.py to generate tokens HOT 1
- some problem with run_long_form_static.py
- Data formatting to call the retriever
- Question Regarding Formula Error in Your Paper
- FactScore Inference Fails with KeyError: 'original_splitted_sentences' HOT 2
- Incorrect setup of Learning Rate Scheduler HOT 6
- dependency HOT 2
- torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary: HOT 2
- CUDA Memory is not enough HOT 1
- Max_logprobs and logprobs value HOT 1
- How to curate the preceding sentences? and Can you inform the distribution of IsUse token (1~5)?
- About bio eval HOT 1
- question about multi content reference HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from self-rag.