GithubHelp home page GithubHelp logo

Reproducing the ASQA numbers about self-rag HOT 13 OPEN

gangiswag avatar gangiswag commented on September 18, 2024
Reproducing the ASQA numbers

from self-rag.

Comments (13)

AkariAsai avatar AkariAsai commented on September 18, 2024

Hi, thank you so much for reporting! Hmm, the citation rec and precision particularly look low... Let me check in on this tomorrow.

from self-rag.

gangiswag avatar gangiswag commented on September 18, 2024

Hi, apologies for pinging again, but just checking in on this to see if you have any update?

Thanks so much!

from self-rag.

AkariAsai avatar AkariAsai commented on September 18, 2024

Sorry for my late response! I was busy with other commitments in the past two weeks. I think the issue might have happened due to some code changes I did for refactoring but I haven't investigated the diffs line by line. Do you mind if I get back to you early next week? I can also upload the model prediction file we have first if it helps!

from self-rag.

gangiswag avatar gangiswag commented on September 18, 2024

No worries! Early next week sounds good.
Yes, having access to the model outputs will be helpful for now :)

from self-rag.

AkariAsai avatar AkariAsai commented on September 18, 2024

Sorry for my late response! This is the link to our 7B prediction results: Google Drive

Here's the output of the asqa eval.py script.

 {
    "length": 29.829113924050635,
    "str_em": 29.957805907172997,
    "str_hit": 8.544303797468354,
    "rougeLsum": 35.7030296755528,
    "QA-EM": 18.568917018284107,
    "QA-F1": 24.01608779257571,
    "QA-Hit": 3.2700421940928273,
    "mauve": 74.3314936476492,
    "citation_rec": 66.96554149085794,
    "citation_prec": 67.81821378340366
}

I'm still investigating the gap in the citation rec and prec, but someone just found a bug in our long-form qa script I mistakenly added during refactoring and I am currently re-running the evaluations. I'll keep you posted!

from self-rag.

gangiswag avatar gangiswag commented on September 18, 2024

Thanks for sharing this! Please let me know whenever you have updated the long-form QA script and I will try it out again.

from self-rag.

XuLingnan avatar XuLingnan commented on September 18, 2024

Hello, I also encountered a similar situation when reproducing the ASQA numbers for the 13B model, where:

  • "length": 27.064345991561183
  • "str_em": 31.70534458509142
  • "rougeLsum": 34.15415408241215
  • "mauve": 56.94032421646569
  • "citation_prec": 68.67088607594937
  • "citation_rec": 58.35016073940123

I wonder if you could also share the 13B prediction results. Thanks a lot.

from self-rag.

AkariAsai avatar AkariAsai commented on September 18, 2024

Sorry for being late on this issue as I was being busy with helping to wrap up some other projects and traveling in the past weeks. I can upload the 13B results tomorrow and will take a closer look at the code base.

from self-rag.

AkariAsai avatar AkariAsai commented on September 18, 2024

Here's the 13B predictions (google drive) and results:

{'length': 27.029535864978904, 'str_em': 31.66139240506329, 'str_hit': 8.438818565400844, 'rougeLsum': 36.0146483715914, 'QA-EM': 20.386779184247537, 'QA-F1': 26.404630941269915, 'QA-Hit': 2.9535864978902953, 'mauve': 71.59056482735427, 'citation_rec': 70.35387783805504, 'citation_prec': 71.26280892103678}

from self-rag.

Jack-ZC8 avatar Jack-ZC8 commented on September 18, 2024

Hi, apologies for pinging, it seems like I encountered the same question...
I would appreciate it if there is any possible solution!

from self-rag.

ShayekhBinIslam avatar ShayekhBinIslam commented on September 18, 2024

@AkariAsai Facing the same issue with ASQA citation precision and recall. Here is the diff between the author output and reproduced output: https://www.diffchecker.com/HLAGTddk/ .

from self-rag.

Zg-Serein avatar Zg-Serein commented on September 18, 2024

Hi, I was unable to reproduce the ASQA numbers for long-form generation. After evaluating the output with ALCE, I see the below numbers which are very different from those reported in the paper:

  • 'str_em': 30.05098452883263
  • 'rougeLsum': 34.10838297032821
  • 'mauve': 68.43516667345226
  • 'citation_rec': 50.0210970464135
  • 'citation_prec': 63.60759493670886

The command I used:

python run_long_form_static.py 
--model_name selfrag/selfrag_llama2_7b --ndocs 5 --max_new_tokens 300 
--threshold 0.2 --use_grounding --use_utility --use_seqscore  --task asqa 
--input_file eval_data/asqa_eval_gtr_top100.json 
--output_file asqa/selfrag_llama2_7b.json --max_depth 7 --mode always_retrieve

I have also uploaded the model output file here for your reference. Just wanted to know whether I am doing anything wrong for ASQA.

Btw, I did a sanity check by evaluating on short-form generation with PopQA and I see 55.0 for accuracy, which matches the number reported in the paper.

Hi, I would like to ask you about the retrieval and test Settings of PopQA. I used the retrieval device and Settings in the paper to conduct the search, but the subsequent evaluation accuracy was only 0.42, far lower than the 0.55 reported in the paper. I would like to ask if there are some setup problems in my experiment?
Here are the retrieval and test scripts I used:
python passage_retrieval.py \ --model_name_or_path facebook/contriever-msmarco --passages psgs_w100.tsv \ --passages_embeddings "wikipedia_embeddings/*" \ --data INPUT_FILE \ --output_dir OUTPUT_FILE \ --n_docs 20

python run_short_form.py \ --model_name ./model/models--selfrag--selfrag_llama2_7b \ --input_file ./ret_out/my_retrieval_output2.jsonl \ --mode adaptive_retrieval --max_new_tokens 100 \ --threshold 0.2 \ --output_file output/out2 \ --metric match --ndocs 20 --use_groundness --use_utility --use_seqscore \ --dtype half > ./log/nohup.my_eval0_20 2>&1 &

from self-rag.

aiden-leong avatar aiden-leong commented on September 18, 2024

Sorry for my late response! This is the link to our 7B prediction results: Google Drive

Here's the output of the asqa eval.py script.

 {
    "length": 29.829113924050635,
    "str_em": 29.957805907172997,
    "str_hit": 8.544303797468354,
    "rougeLsum": 35.7030296755528,
    "QA-EM": 18.568917018284107,
    "QA-F1": 24.01608779257571,
    "QA-Hit": 3.2700421940928273,
    "mauve": 74.3314936476492,
    "citation_rec": 66.96554149085794,
    "citation_prec": 67.81821378340366
}

I'm still investigating the gap in the citation rec and prec, but someone just found a bug in our long-form qa script I mistakenly added during refactoring and I am currently re-running the evaluations. I'll keep you posted!

{
    'length': 29.89873417721519, 
    'str_em': 30.226793248945143, 
    'str_hit': 8.755274261603375, 
    'rougeLsum': 35.75958018700113, 
    'QA-EM': 18.52496483825598, 
    'QA-F1': 24.03806388258978, 
    'QA-Hit': 3.2700421940928273, 
    'mauve': 76.23131396071514, 
    'citation_rec': 50.18811533052039, 
    'citation_prec': 63.92405063291139
}

WX20240724-025402@2x

from self-rag.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.