yizhongw / self-instruct Goto Github PK

View Code? Open in Web Editor NEW

3.9K 59.0 455.0 59.98 MB

Aligning pretrained language models with instruction data generated by themselves.

License: Apache License 2.0

Shell 2.07% Python 87.76% Jupyter Notebook 10.17%

general-purpose-model language-model instruction-tuning

self-instruct's People

Contributors

Stargazers

Watchers

Forkers

yeganehkordi dumpmemory brunotech cemberk techthiyanes lxuechen artek0chumak zhixhan lewtun sbhambr1 pooruss mrzhengxin rosssong philschmid ronysh kyrolabs conceptofmind rogervaas peterjliu elaatifi entn-at linhduongtuan jadentan co-simulation kemolo zhchen18 kunshou123 trigrass2 git-models benoitdalferro humdingers canslove moohax assassindesign henrywu2019 pradeepkr1003 fuhengwu2021 wang-shun marantz ungikim samchen8008 leedaga 19245222 xiongcailuo chenweihua91 syno8 clcarwin luigio danielelite jwoody02 xiangliu886 mohammadzainabbas mageirakos darylrodrigo mdmmn378 mikemansour hchoi-moveworks htle2101 codelogia ccyhxg c00renut marcus-arcadius hodeware knowledgehacker andrew922 bradygp learning-group1 jiahuei nisaaragharia pratik-behera mingchen62 0xjacklove stive168 hertera1 yjcyxky rowzzy ijindal synaptekresearch qiugen manshake606 benhakunama geometrylearner maddes8cht fiskvr hkbu-victor zfang2019 abelho wiwomu bmasterc test-banana-dev-ai wbbeyourself mistobaan skyhan88 jboru xiongma averroes marcobay haojiepan1 edreismd ahmetkca

self-instruct's Issues

What does error mean?

Hello Community,

I always become the same error. What does this error mean?:

usage: generate_instances.py [-h] --batch_dir BATCH_DIR
[--input_file INPUT_FILE]
[--output_file OUTPUT_FILE]
[--num_instructions NUM_INSTRUCTIONS]
[--max_instances_to_generate MAX_INSTANCES_TO_GENERATE]
[--generation_tasks_only]
[--classification_tasks_only] [--engine ENGINE]
[--request_batch_size REQUEST_BATCH_SIZE]
[--api_key API_KEY] [--organization ORGANIZATION]
generate_instances.py: error: the following arguments are required: --batch_dir

THX!

Where can I find the ROUGE-L metric mentioned in the paper?

Sorry, I may not look into your codes carefully, but could you please show me where you put the ROUGE-L implementation? Thanks.

We released a project to build a self-instruct instruction dataset for free !

Welcome to use our project

https://github.com/SupritYoung/free-self-instruct

Scaling the number of instructions

Hi~

Why limit the number of instructions to 52K? Will the model be better if we have more instructions?

where was machine_generated_instructions.jsonl?

Timeline on dataset release?

Where do you train your own GPT?

I see the API code: https://github.com/yizhongw/self-instruct/blob/main/self_instruct/gpt3_api.py

So you train the GPT on your own GPU? or How?

Thank you very much!

Minor Grammatical Errors

Greetings!

Excellent code! I saw a few grammatical errors in some of your code that I figured I'd share with you.

on Prep:
Line 32 - Word misspelled. ign instead of ing.

on GPT:
Line 43 - there is a word capitalized after a comma.
Lines 74 and 79 - gpt is lowercase and the rest are upper case.

on Bootstrap:
Line 116 - Used 'GPT-3', however, other instances in your code refer to it as 'GPT3'.
Line 121 - 'missing quotes' around variable referenced.

on CLF:
Line 55 - 'missing quotes' around variable referenced.

Trivial in nature as it does not interfere with your code, but figured you may want some uniformity.

Regards,

Atlas

Is there a more detailed analysis of seed tasks?

Why set the number of seed tasks to 175? How did the number of seed tasks affect the final results including the quality of generated instructions and the performance of instructions-tuned model?
I have considered generating more domain-specified instructions recently. The number of seed tasks should be smaller and the content (or format) should be more in line. I wonder is there any I should notice if I craft the seed tasks set myself, for example the number and the content. And do you think models tuned by the domian-specified instructions will do better in the specified domain?
Thanks a lot, wish you a good day :).

what do this <|endoftext|> use for？

I found fine-tune data which ouputs all have <|endoftext|> in the end。Is it use for end flag？

What are the steps to create a new instruction from a seed?

The paper is not clear to me. If I have an instruction seed written by human, what is the process to create a single new instruction from this single seed?

In addition the repository says “generated by themselves”, but it is not by themselves, it’s by using third party api.

不能支持gpt4吗？

How to support chinese?

where was is_clf_or_not_davinci_template_1.jsonl

The problem of training my own data

Thanks for sharing!

I want to "self-instruct" using my Chinese data, but I can't call Open AI. If I want to use an existing model (such as LLAMA2), I can use this model to implement "self-instruct" locally offline. How do I modify the code? Or do you have any suggestions?

Thanks!

yizhongw / self-instruct Goto Github PK

self-instruct's People

Contributors

Stargazers

Watchers

Forkers

self-instruct's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs