madaan / self-refine Goto Github PK

View Code? Open in Web Editor NEW

483.0 12.0 40.0 54.96 MB

LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.

Home Page: https://selfrefine.info

License: Apache License 2.0

Python 74.96% Jupyter Notebook 25.04%

few-shot-learning language-generation large-language-models llms prompting reasoning chatgpt gpt-35 gpt-4 prompts

self-refine's People

Contributors

Stargazers

Watchers

self-refine's Issues

Instructions on PIE Evaluation

Hi,

Thank you for your fantastic work!

It seems like the instructions for conducting PIE evaluation are absent. Would you be able to provide instructions on how to use the pie_eval.py script? I'm particularly uncertain about the process of obtaining the .report file. Thanks!

CommonGen-Hard dataset

Will you release the CommonGen-Hard dataset soon? It looks like that it has 20-30 concepts for each sentence. Very curious about this dataset.

Codex discontinued

Hello. For code-related tasks, do you plan to update your code to replace the Codex with another model? Do you have a suggestion for the alternative model & do you plan to push the updated code?

Evaluation for Dialogue Response Generation

Would you please provide the instructions for evaluating model responses in the Dialogue Response Generation task?

Questions about initial generation

I want to use self-refine for reasoning task, such as open-book qa for example.
For the few-shot examples for the initial generation. Does the examples have to be bad examples?
If I have good examples, could I use them for the initial stage and hope that through iterations, it gets even better?
However if i were to use already good examples, it might be tough to come up with even better ones in the few-shot examples for the refine stage?

PIE run.py

Hi All,

Thank you for your lovely work.

There is no run.py in PIE folder.

Thank you

Codebase Completeness

Great work! We are currently trying to reproduce your results such that we can build on top of your insights. I see that you are still working on this codebase. Are there already some benchmarks that are fully implemented in this repository that you do not intend to develop further in the near future? Because when I am running the different benchmarks I sometimes run into some errors and I am not sure when the error is from my side, or just because some functionality is missing.

For example: When I am running the CommonGen benchmark on a reduced test set, I observe a lot of errors in the output file. My main source of error seems that the feedback from GPT does not have the intended structure, such that exceptions in the code occur. I did not change any of the training prompts/instructions prompts you provide in this file and use "gpt-3.5-turbo". Did you also observe this behaviour?

I also noticed two different things, which I thought I would notify you about:

Compared to other benchmarks you do not include the refinement history in the GSM-8k benchmark. You only use the training prompt, the instruction prompt and the current iteration. Is this intended?
The run.py file is missing for the sentiment_reversal benchmark and I can't find the training data for the Code Readability Improvement

Thanks a lot!

Licence Request

Could this repo get an MIT or Apache license to fully free up anyone to take and adapt work found here and innovate further?

GSM8K performance difference issue

In the appendix, the original PAL with ChatGPT is around 74%.

But how come the initial accuracy is only 71% in self-refine, I was expecting the initial should be the same?

Code optimization

Hello,I am really interested in your fine work and trying to reproduce the results!

Can you share the prompts and examples for the code optimization?
I am having hard time reproducing the code optimization results.
I have an additional question regarding code optimization in your paper. Specifically, I'm interested in how you calculated the percentage of programs that were optimized. When reproducing the results, I noticed that some codes performed worse, while others actually showed improvement.

Is it compatible with LLaMa 2

Was there an attempt to test this library with LLaMa 2 model

Missing File in Sentiment Reversal

In the Yelp benchmark, the file for the task_measure is missing. I.e. The class SentimentTransferMeasurement, can't be found. Can you upload this file? Thanks!

Demo doesn't seem to work

There is never a feedback

Trying to understand GSM code

Hello! First of all this is a super nice paper.

I am trying to wrap my head around the concept of the paper. What I don't understand is this:
No matter what the output from the LM is, the LM is prompted again with the same question and the generated code/text (code+comments), until the LM itself says "it is correct", with a maximum of max_attempts for each question?
The paper reports improvements over 5 iterations, so if the model outputs "it is correct" the same output is used for the next iteration? Just want to make sure I understood this correctly.

Question 1: Has this error happened with Codex? I'm wondering if this is because ChatGPT is not always following input exemplar's format perfectly.
Question 2: Even with these errors, src/gsm/run.py keeps running. Should I just ignore these errors? I'm hoping to obtain results that is close to or better than gsm results in your paper.

1%|▊                                                                                                                           | 8/1319 [03:04<8:42:28, 23.91s/it]

An error occurred: list index out of range. Traceback (most recent call last):
  File "/home/ubuntu/code/hideodeo/self-refine/src/utils.py", line 39, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/code/hideodeo/self-refine/src/gsm/run.py", line 40, in iterative_gsm
    fb_and_maybe_soln = task_feedback(solution=solution)
  File "/home/ubuntu/code/hideodeo/self-refine/src/gsm/feedback.py", line 42, in __call__
    solution = entire_output.split("def solution():")[1]
IndexError: list index out of range
. Left retries: 2.

madaan / self-refine Goto Github PK

self-refine's People

Contributors

Stargazers

Watchers

Forkers

self-refine's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs