GithubHelp home page GithubHelp logo

madaan / self-refine Goto Github PK

View Code? Open in Web Editor NEW
483.0 12.0 40.0 54.96 MB

LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.

Home Page: https://selfrefine.info

License: Apache License 2.0

Python 74.96% Jupyter Notebook 25.04%
few-shot-learning language-generation large-language-models llms prompting reasoning chatgpt gpt-35 gpt-4 prompts

self-refine's People

Contributors

madaan avatar majumderb avatar prakharguptaz avatar shallinan1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

self-refine's Issues

Instructions on PIE Evaluation

Hi,

Thank you for your fantastic work!

It seems like the instructions for conducting PIE evaluation are absent. Would you be able to provide instructions on how to use the pie_eval.py script? I'm particularly uncertain about the process of obtaining the .report file. Thanks!

CommonGen-Hard dataset

Will you release the CommonGen-Hard dataset soon? It looks like that it has 20-30 concepts for each sentence. Very curious about this dataset.

Codex discontinued

Hello. For code-related tasks, do you plan to update your code to replace the Codex with another model? Do you have a suggestion for the alternative model & do you plan to push the updated code?

Questions about initial generation

I want to use self-refine for reasoning task, such as open-book qa for example.
For the few-shot examples for the initial generation. Does the examples have to be bad examples?
If I have good examples, could I use them for the initial stage and hope that through iterations, it gets even better?
However if i were to use already good examples, it might be tough to come up with even better ones in the few-shot examples for the refine stage?

PIE run.py

Hi All,

Thank you for your lovely work.

There is no run.py in PIE folder.

Thank you

Codebase Completeness

Great work! We are currently trying to reproduce your results such that we can build on top of your insights. I see that you are still working on this codebase. Are there already some benchmarks that are fully implemented in this repository that you do not intend to develop further in the near future? Because when I am running the different benchmarks I sometimes run into some errors and I am not sure when the error is from my side, or just because some functionality is missing.

For example: When I am running the CommonGen benchmark on a reduced test set, I observe a lot of errors in the output file. My main source of error seems that the feedback from GPT does not have the intended structure, such that exceptions in the code occur. I did not change any of the training prompts/instructions prompts you provide in this file and use "gpt-3.5-turbo". Did you also observe this behaviour?

I also noticed two different things, which I thought I would notify you about:

  • Compared to other benchmarks you do not include the refinement history in the GSM-8k benchmark. You only use the training prompt, the instruction prompt and the current iteration. Is this intended?
  • The run.py file is missing for the sentiment_reversal benchmark and I can't find the training data for the Code Readability Improvement

Thanks a lot!

Licence Request

Could this repo get an MIT or Apache license to fully free up anyone to take and adapt work found here and innovate further?

GSM8K performance difference issue

In the appendix, the original PAL with ChatGPT is around 74%.
image

But how come the initial accuracy is only 71% in self-refine, I was expecting the initial should be the same?
image

Code optimization

Hello,I am really interested in your fine work and trying to reproduce the results!

Can you share the prompts and examples for the code optimization?
I am having hard time reproducing the code optimization results.
I have an additional question regarding code optimization in your paper. Specifically, I'm interested in how you calculated the percentage of programs that were optimized. When reproducing the results, I noticed that some codes performed worse, while others actually showed improvement.

Missing File in Sentiment Reversal

In the Yelp benchmark, the file for the task_measure is missing. I.e. The class SentimentTransferMeasurement, can't be found. Can you upload this file? Thanks!

Trying to understand GSM code

Hello! First of all this is a super nice paper.

I am trying to wrap my head around the concept of the paper. What I don't understand is this:
No matter what the output from the LM is, the LM is prompted again with the same question and the generated code/text (code+comments), until the LM itself says "it is correct", with a maximum of max_attempts for each question?
The paper reports improvements over 5 iterations, so if the model outputs "it is correct" the same output is used for the next iteration? Just want to make sure I understood this correctly.

Releasing the Yelp Dataset

Amazing work! We are interested in building on this work. Are there any plans for releasing the long-form sentiment reversal Yelp datatset in this work? Thanks!

IndexError in src/gsm/feedback.py

Hello! I'm running python -u src/gsm/run.py with "gpt-3.5-turbo" and has the following error. This is happening because "def solution():" not in entire_output in feedback.py.

  • Question 1: Has this error happened with Codex? I'm wondering if this is because ChatGPT is not always following input exemplar's format perfectly.
  • Question 2: Even with these errors, src/gsm/run.py keeps running. Should I just ignore these errors? I'm hoping to obtain results that is close to or better than gsm results in your paper.
1%|โ–Š                                                                                                                           | 8/1319 [03:04<8:42:28, 23.91s/it]

An error occurred: list index out of range. Traceback (most recent call last):
  File "/home/ubuntu/code/hideodeo/self-refine/src/utils.py", line 39, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/code/hideodeo/self-refine/src/gsm/run.py", line 40, in iterative_gsm
    fb_and_maybe_soln = task_feedback(solution=solution)
  File "/home/ubuntu/code/hideodeo/self-refine/src/gsm/feedback.py", line 42, in __call__
    solution = entire_output.split("def solution():")[1]
IndexError: list index out of range
. Left retries: 2.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.