Comments (13)
Hi, thank you so much for reporting! Hmm, the citation rec and precision particularly look low... Let me check in on this tomorrow.
from self-rag.
Hi, apologies for pinging again, but just checking in on this to see if you have any update?
Thanks so much!
from self-rag.
Sorry for my late response! I was busy with other commitments in the past two weeks. I think the issue might have happened due to some code changes I did for refactoring but I haven't investigated the diffs line by line. Do you mind if I get back to you early next week? I can also upload the model prediction file we have first if it helps!
from self-rag.
No worries! Early next week sounds good.
Yes, having access to the model outputs will be helpful for now :)
from self-rag.
Sorry for my late response! This is the link to our 7B prediction results: Google Drive
Here's the output of the asqa eval.py script.
{
"length": 29.829113924050635,
"str_em": 29.957805907172997,
"str_hit": 8.544303797468354,
"rougeLsum": 35.7030296755528,
"QA-EM": 18.568917018284107,
"QA-F1": 24.01608779257571,
"QA-Hit": 3.2700421940928273,
"mauve": 74.3314936476492,
"citation_rec": 66.96554149085794,
"citation_prec": 67.81821378340366
}
I'm still investigating the gap in the citation rec and prec, but someone just found a bug in our long-form qa script I mistakenly added during refactoring and I am currently re-running the evaluations. I'll keep you posted!
from self-rag.
Thanks for sharing this! Please let me know whenever you have updated the long-form QA script and I will try it out again.
from self-rag.
Hello, I also encountered a similar situation when reproducing the ASQA numbers for the 13B model, where:
- "length": 27.064345991561183
- "str_em": 31.70534458509142
- "rougeLsum": 34.15415408241215
- "mauve": 56.94032421646569
- "citation_prec": 68.67088607594937
- "citation_rec": 58.35016073940123
I wonder if you could also share the 13B prediction results. Thanks a lot.
from self-rag.
Sorry for being late on this issue as I was being busy with helping to wrap up some other projects and traveling in the past weeks. I can upload the 13B results tomorrow and will take a closer look at the code base.
from self-rag.
Here's the 13B predictions (google drive) and results:
{'length': 27.029535864978904, 'str_em': 31.66139240506329, 'str_hit': 8.438818565400844, 'rougeLsum': 36.0146483715914, 'QA-EM': 20.386779184247537, 'QA-F1': 26.404630941269915, 'QA-Hit': 2.9535864978902953, 'mauve': 71.59056482735427, 'citation_rec': 70.35387783805504, 'citation_prec': 71.26280892103678}
from self-rag.
Hi, apologies for pinging, it seems like I encountered the same question...
I would appreciate it if there is any possible solution!
from self-rag.
@AkariAsai Facing the same issue with ASQA citation precision and recall. Here is the diff between the author output and reproduced output: https://www.diffchecker.com/HLAGTddk/ .
from self-rag.
Hi, I was unable to reproduce the ASQA numbers for long-form generation. After evaluating the output with ALCE, I see the below numbers which are very different from those reported in the paper:
- 'str_em': 30.05098452883263
- 'rougeLsum': 34.10838297032821
- 'mauve': 68.43516667345226
- 'citation_rec': 50.0210970464135
- 'citation_prec': 63.60759493670886
The command I used:
python run_long_form_static.py --model_name selfrag/selfrag_llama2_7b --ndocs 5 --max_new_tokens 300 --threshold 0.2 --use_grounding --use_utility --use_seqscore --task asqa --input_file eval_data/asqa_eval_gtr_top100.json --output_file asqa/selfrag_llama2_7b.json --max_depth 7 --mode always_retrieve
I have also uploaded the model output file here for your reference. Just wanted to know whether I am doing anything wrong for ASQA.
Btw, I did a sanity check by evaluating on short-form generation with PopQA and I see 55.0 for accuracy, which matches the number reported in the paper.
Hi, I would like to ask you about the retrieval and test Settings of PopQA. I used the retrieval device and Settings in the paper to conduct the search, but the subsequent evaluation accuracy was only 0.42, far lower than the 0.55 reported in the paper. I would like to ask if there are some setup problems in my experiment?
Here are the retrieval and test scripts I used:
python passage_retrieval.py \ --model_name_or_path facebook/contriever-msmarco --passages psgs_w100.tsv \ --passages_embeddings "wikipedia_embeddings/*" \ --data INPUT_FILE \ --output_dir OUTPUT_FILE \ --n_docs 20
python run_short_form.py \ --model_name ./model/models--selfrag--selfrag_llama2_7b \ --input_file ./ret_out/my_retrieval_output2.jsonl \ --mode adaptive_retrieval --max_new_tokens 100 \ --threshold 0.2 \ --output_file output/out2 \ --metric match --ndocs 20 --use_groundness --use_utility --use_seqscore \ --dtype half > ./log/nohup.my_eval0_20 2>&1 &
from self-rag.
Sorry for my late response! This is the link to our 7B prediction results: Google Drive
Here's the output of the asqa eval.py script.
{ "length": 29.829113924050635, "str_em": 29.957805907172997, "str_hit": 8.544303797468354, "rougeLsum": 35.7030296755528, "QA-EM": 18.568917018284107, "QA-F1": 24.01608779257571, "QA-Hit": 3.2700421940928273, "mauve": 74.3314936476492, "citation_rec": 66.96554149085794, "citation_prec": 67.81821378340366 }
I'm still investigating the gap in the citation rec and prec, but someone just found a bug in our long-form qa script I mistakenly added during refactoring and I am currently re-running the evaluations. I'll keep you posted!
{
'length': 29.89873417721519,
'str_em': 30.226793248945143,
'str_hit': 8.755274261603375,
'rougeLsum': 35.75958018700113,
'QA-EM': 18.52496483825598,
'QA-F1': 24.03806388258978,
'QA-Hit': 3.2700421940928273,
'mauve': 76.23131396071514,
'citation_rec': 50.18811533052039,
'citation_prec': 63.92405063291139
}
from self-rag.
Related Issues (20)
- How can I get initial input file for generator?
- model issues
- Processed Input Dataset and Flan-3B Critic Generated Dataset
- Reproducing Self-RAG
- accuracy metric HOT 3
- About parameter `max_depth` HOT 2
- Doesn't the generator need to call the retriever when training the model?
- The critic model will generate different type of token when I use run_reward_vllm.py to generate tokens HOT 1
- some problem with run_long_form_static.py
- Data formatting to call the retriever
- Question Regarding Formula Error in Your Paper
- FactScore Inference Fails with KeyError: 'original_splitted_sentences' HOT 2
- Incorrect setup of Learning Rate Scheduler HOT 6
- dependency HOT 2
- torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 14447) of binary: HOT 2
- CUDA Memory is not enough HOT 1
- Max_logprobs and logprobs value HOT 1
- How to curate the preceding sentences? and Can you inform the distribution of IsUse token (1~5)?
- About bio eval HOT 1
- question about multi content reference HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from self-rag.