google-research / flan Goto Github PK

License: Apache License 2.0

Python 99.97% Shell 0.03%

flan's Introduction

The FLAN Instruction Tuning Repository

Original Flan (2021) | The Flan Collection (2022) | Flan 2021 Citation | License

This repository contains code to generate instruction tuning dataset collections. The first is the original Flan 2021, documented in Finetuned Language Models are Zero-Shot Learners, and the second is the expanded version, called the Flan Collection, described in The Flan Collection: Designing Data and Methods for Effective Instruction Tuning and used to produce Flan-T5 and Flan-PaLM.

Flan 2021

To generate the Flan 2021 data as Seqio mixtures, first install the relevant requirements.txt then use mixtures.py.

Flan 2021 Citation

Please cite the following if you found Flan 2021 useful in your research.

@inproceedings{weifinetuned,
  title={Finetuned Language Models are Zero-Shot Learners},
  author={Wei, Jason and Bosma, Maarten and Zhao, Vincent and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M and Le, Quoc V},
  booktitle={International Conference on Learning Representations}
}

License

The code in this repository is licensed according to the LICENSE file.

Contact Us

To contact us feel free to create an Issue in this repository, or email the respective authors that contributed to this code base: Jason Wei for the Flan 2021 paper, Le Hou for the Scaling Flan paper, and Shayne Longpre for the Flan Collection.

flan's People

Contributors

Stargazers

Watchers

Forkers

liujie40 sombochea techthiyanes olbychos jaedukseo stjordanis awesome-archive purn3ndu tawawhite mitcheccles makeesyai patelrajnath ai-natural-language-processing-lab saulocatharino apsarageek isabella232 marcus-arcadius cnjelita hanjanghoon usaiprashanth kevinpl07 xuqiao-algo zphang tanglespace kmswin1 yetiansh stat-eklee eleutherai xiaoqingnlp ppasupat evelynmitchell shayne-longpre fnan jayelm feyzaakyurek seungonekim cdxeve teargosling taidnguyen chenyu-jiang mitzen vinhjaxt bigbird01 henrywoo jadentan theexgenesis cyzlovedream wangpeiyi9979 byshev333 sirneural samuli dan255 anotherchudov hertera1 sentian dimosthenis1029 meg-tong nutanc kioco deepthought9527 lewtun torlarse paperwave joocjun haojiepan1 whaleloops giyaseddin mistobaan aadsah yizhongw madehong assassindesign yxwang8775 chrsitinass viraat oglee815 tech1media qingpeijun xiaoqiangcs xjtu-e-learning llysuda yaakovsu luisespinosaanke mysqlsc vangogh0318 crazyboystop ironiksk dptam dst1213 deep-cognition ishandutta2007 petercao serena-liuwx bsvmelo bc2009 aahmadai runngezhang pranavpurwar ebptwllc jiudingsun01

flan's Issues

QReCC processing does not match TFDS features

The QReCC preprocessing code expects a history_with_truth_answer key:

FLAN/flan/v2/preprocessors.py

Line 1305 in 36a9304

dialog = example["history_with_truth_answer"]

However, this key does not appear to be in the TFDS features:

https://github.com/tensorflow/datasets/blob/e7649b4373c32d81946f40647a7a62ad475fb647/tensorflow_datasets/text/qrecc/qrecc.py#L53-L71

From a quick check, it does not appear in the original dataset files either.

BBH causal judgment reasoning

(Pls direct me to the best place to ask if needed).

Would you mind sharing UL2 results broken down by tasks for Big-Bench Hard? Causal Judgment, Disambiguation QA, Formal Fallacies, Hyperbaton, and Logical Deduction (Five Objects) would be of interest, as those are published for Flan-T5 series.

Thanks.

Mixture used to train FLAN-T5 / UL2

It would be great if you could add the exact mixture used to train FLAN-T5 and FLAN-UL2. This is not trivial because it doesn't look like limiting the number of examples per task is implemented in this repo.

TypeError: Invalid `datasets`. `datasets` must have compatible element specs.

@shayne-longpre Thank you for sharing the reproducing script!

I got the following error when I was trying to reproduce flan2021_submix. The venv is the one you specified in the repo (flan/v2/requirements.txt ). It seems that some datasets have different fields. Any suggestions would be appreciated!

Traceback (most recent call last):
File "flan/v2/run_example.py", line 93, in
dataset = selected_mixture.get_dataset(
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/seqio/dataset_providers.py", line 1758, in get_dataset
dataset = self._sample_fn(datasets, rates, sample_seed)
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py", line 371, in new_func
return func(*args, **kwargs)
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/experimental/ops/interleave_ops.py", line 148, in sample_from_datasets_v2
return dataset_ops.Dataset.sample_from_datasets(
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 3571, in sample_from_datasets
return sample_from_datasets_op._sample_from_datasets( # pylint: disable=protected-access
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/ops/sample_from_datasets_op.py", line 119, in _sample_from_datasets
return directed_interleave_op._directed_interleave( # pylint: disable=protected-access
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/ops/directed_interleave_op.py", line 25, in _directed_interleave
return _DirectedInterleaveDataset(
File "/export/home/FLAN/.venv2/lib/python3.8/site-packages/tensorflow/python/data/ops/directed_interleave_op.py", line 50, in init
raise TypeError(f"Invalid datasets. datasets must have compatible "
TypeError: Invalid datasets. datasets must have compatible element specs.
Dataset 0 element_spec={'_task_name': TensorSpec(shape=(), dtype=tf.string, name=None), '_task_source': TensorSpec(shape=(), dtype=tf.string, name=None), '_template_type': TensorSpec(shape=(), dtype=tf.string, name=None), '_template_idx': TensorSpec(shape=(), dtype=tf.int32, name=None), 'inputs_pretokenized': TensorSpec(shape=(), dtype=tf.string, name=None), 'inputs': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'targets_pretokenized': TensorSpec(shape=(), dtype=tf.string, name=None), 'targets': TensorSpec(shape=(None,), dtype=tf.int32, name=None)}.
Dataset 19 element_spec={'_template_type': TensorSpec(shape=(), dtype=tf.string, name=None), '_template_idx': TensorSpec(shape=(), dtype=tf.int32, name=None), 'inputs_pretokenized': TensorSpec(shape=(), dtype=tf.string, name=None), 'inputs': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'targets_pretokenized': TensorSpec(shape=(), dtype=tf.string, name=None), 'targets': TensorSpec(shape=(None,), dtype=tf.int32, name=None)}.

How were the Held-In and Held-Out splits created?

Hi @shayne-longpre, can you please point me to the script which was used for creating the held-in and held-out splits?

Do Instructions work with flan-T5?

We are trying to make this instruction work with flan-t5 but haven't got it working. Any ideas would be MOST helpful on how to make it work with this model or any other model. Of course, this works with GPT3.5 so was curious if this model indeed can support such instructions.

Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\n

Context:
Install Python with the Official Installer. Download the installer package from Python's official website.
Wait for the download to complete. Once the installation is complete, the installer will automatically open Python's installation directory in a new Finder window.

Installing python on Windows. Open a browser to the Python website and download the Windows installer. 2. Double click on the downloaded file and install Python for all users, and ensure that Python is added to your path. Click on Install now to begin

Installing Python 3 on Linux¶ · $ python3 --version · $ sudo apt-get update $ sudo apt-get install python3. · $ sudo apt-get install software-properties-common $ ...

Question: How do you install Python on Mac?
Answer:

Question on (A)NLI templates

Thanks for sharing the templates used, really useful!

I have a few questions around the NLI templates. Based on @shayne-longpre 's answer in #27 I understand that ANLI was used in training.

From the ANLI templates no-CoT and CoT, it is not clear how "contradiction" is treated. IIUC, the options are always ['Yes', "It's impossible to say", 'No']. For most templates in their (human) logical semantics, a "No" answer would not mean contradiction (maybe the ones that use the words "true/false" are an exception), it would mean that the hypothesis cannot be entailed.

It seems that "No" was used in these templates to mean "contradiction" (and I'm guessing the model was trained to say "No" for the samples with contradictions in ANLI), which is rather strange. Could you please clarify? Thanks!

Error getting dataset for the mixture from the Flan collection

Trying to get niv2_zsopt from mixtures fails:

import seqio

from flan.v2 import mixtures

mix = seqio.get_mixture_or_task('niv2_zsopt')
ds = mix.get_dataset(sequence_length={'inputs': 1024, 'targets': 1024}, shuffle=False, num_epochs=1, copy_pretokenized=True, compute_stats_empirically=True)

The error message:

DatasetNotFoundError: Dataset natural_instructions not found.                                                                                                                                                                           
Available datasets:                                                    
        - abstract_reasoning    
        - accentdb            
        - aeslc                       
        - aflw2k3d                                                                                                                                                                                                                      
        - ag_news_subset                                
        - ai2_arc                   
...

Check that:
    - if dataset was added recently, it may only be available
      in `tfds-nightly`
    - the dataset name is spelled correctly
    - dataset class defines all base class abstract methods
    - the module defining the dataset class is imported

Did you mean: natural_instructions -> natural_questions

T0 tasks broken

This has already been mentioned in #17, but I think it's worth surfacing as a separate issue.

TFDS (even the most recent tfds-nightly) does not recognize the names provided for the T0 datasets in flan/v2/task_configs.py. e.g. tfds build huggingface:bigscience__p3/adversarial_qa_dbert_answer_the_following_q crashes.

niv2 few-shots are returning single shot examples

@lehougoogle I applied the latest changes and running

task = seqio.get_mixture_or_task(f"tfds_natural_instructions_template_0to10_x_shot")
dataset = task.source.get_dataset("train")
next(dataset.as_numpy_iterator())

is outputting

{'definition': b'In this task, given a sentence in the English language, your task is to convert it into the Filipino language.',
 'id': b'task559-1221f105f0cf40bb97291692f5eda329',
 'input': b'"Out of concern for the interpreters and their families\' security as well as the security of the Danish base in Iraq, the Defence Ministry has chosen to inform the public after the interpreters and others had left Iraq," the Danish Defence Ministry said in a statement.',
 'output': b'"Para sa kapakanan ng mga ng mga tagapagsalin at seguridad ng kanilang mga pamilya pati narin ang seguridad ng mga taga denmark na nakabase sa Iraq, Pinili ng Ministro ng Depensa na ipaalam sa publiko matapos makaalis sa Iraq ang mga tagapagsalin at iba pa," nasabi ng Danish Defence Ministry sa isang pahayag.',
 'source': b'asian_language_treebank',
 'task_name': b'task559_alt_translation_en_fi'}

[Question] How much memory is required to generate wiki_dialog?

I've been running out of memory generating wiki_dialog on machines with 88GB RAM, whereas the dataset is supposed to be only ~37GB.

v2: Few-shot tasks don't have splits

Running seqio.get_mixture_or_task("cot_fsopt").splits

Errors ValueError: Task cot_creak_template_0to10_x_shot has no splits

Recommended caching method

I'm trying to finetune T5x on the FLAN collection, exactly as in Longpre et al. 2023 and Chung et al. 2022. I'm starting with the small checkpoint.

Could you recommend a data caching scheme? run_example.py recommends storing examples on disk and then mixing them manually. seqio recommends using seqio.CacheDatasetPlaceholder instead, and I see that this function appears in the FLAN Collection source code. It's not clear to me how many examples from each mixture to store in either case, especially taking packing into account. Any tips?

All attempts to get a Google authentication bearer token failed

I create a new environment with python3.8 and have installed all of the packages showed in requirements.txt and run PYTHONPATH=. python flan/v2/run_example.py, however, I failed for the following reasons:

2023-03-17 15:00:13.007874: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2023-03-17 15:00:13.007925: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-03-17 15:00:13.008442: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-17 15:00:16.014237: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Not found: Error executing an HTTP request: HTTP response code 410 with body '<html>
  <head>
    <meta http-equiv='refresh' content='0; url=http://metadata/' />
  </head>
</html>
'".
2023-03-17 15:01:17.240986: E tensorflow/core/platform/cloud/curl_http_request.cc:614] The transmission  of request 0x558cf720ec00 (URI: https://www.googleapis.com/storage/v1/b/t5-data/o/vocabs%2Fcc_all.32000%2Fsentencepiece.model?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.046775 (No error), connect time: 0 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)
2023-03-17 15:02:18.712983: E tensorflow/core/platform/cloud/curl_http_request.cc:614] The transmission  of request 0x558cf720ec00 (URI: https://www.googleapis.com/storage/v1/b/t5-data/o/vocabs%2Fcc_all.32000%2Fsentencepiece.model?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.044402 (No error), connect time: 0 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)
2023-03-17 15:03:20.425956: E tensorflow/core/platform/cloud/curl_http_request.cc:614] The transmission  of request 0x558cf720ec00 (URI: https://www.googleapis.com/storage/v1/b/t5-data/o/vocabs%2Fcc_all.32000%2Fsentencepiece.model?fields=size%2Cgeneration%2Cupdated) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.032221 (No error), connect time: 0 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)
Traceback (most recent call last):
  File "flan/v2/run_example.py", line 93, in <module>
    dataset = selected_mixture.get_dataset(
  File "xx/lib/python3.8/site-packages/seqio/dataset_providers.py", line 1278, in get_dataset
    self._check_compatible_features()
  File "xx/lib/python3.8/site-packages/seqio/dataset_providers.py", line 1235, in _check_compatible_features
    if task.output_features[name].vocabulary != feature.vocabulary:
  File "xx/lib/python3.8/site-packages/seqio/vocabularies.py", line 330, in __eq__
    their_md5 = hashlib.md5(other.sp_model).hexdigest()
  File "xx/lib/python3.8/site-packages/seqio/vocabularies.py", line 248, in sp_model
    self._load_model()
  File "xx/lib/python3.8/site-packages/seqio/vocabularies.py", line 216, in _load_model
    self._sp_model = f.read()
  File "xx/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 119, in read
    length = self.size() - self.tell()
  File "xx/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 98, in size
    return stat(self.__name).length
  File "xx/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 871, in stat
    return stat_v2(filename)
  File "xx/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 887, in stat_v2
    return _pywrap_file_io.Stat(compat.path_to_str(path))
KeyboardInterrupt

And the intalled packages are shown below:(cuda 11.6)

Package                  Version
------------------------ ---------------------
absl-py                  0.12.0
astunparse               1.6.3
attrs                    21.2.0
Babel                    2.9.1
cachetools               4.2.2
certifi                  2021.5.30
charset-normalizer       2.0.4
clang                    5.0
click                    8.0.1
colorama                 0.4.4
dill                     0.3.4
editdistance             0.5.3
filelock                 3.0.12
flatbuffers              1.12
frozendict               2.3.5
future                   0.18.2
gast                     0.4.0
gin-config               0.4.0
google-auth              1.35.0
google-auth-oauthlib     0.4.5
google-pasta             0.2.0
googleapis-common-protos 1.53.0
grpcio                   1.39.0
h5py                     3.1.0
huggingface-hub          0.0.12
idna                     3.2
importlib-resources      5.12.0
iniconfig                1.1.1
joblib                   1.0.1
keras                    2.6.0
Keras-Preprocessing      1.1.2
Levenshtein              0.13.0
Markdown                 3.3.4
mesh-tensorflow          0.1.19
nltk                     3.6.2
numpy                    1.19.5
oauthlib                 3.1.1
opt-einsum               3.3.0
packaging                21.0
pandas                   1.3.2
pip                      23.0.1
pluggy                   0.13.1
portalocker              2.3.0
promise                  2.3
protobuf                 3.17.3
py                       1.10.0
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pyparsing                2.4.7
pytest                   6.2.4
python-dateutil          2.8.2
pytz                     2021.1
PyYAML                   5.4.1
regex                    2021.8.3
requests                 2.26.0
requests-oauthlib        1.3.0
rouge-score              0.0.4
rsa                      4.7.2
sacrebleu                2.0.0
sacremoses               0.0.45
scikit-learn             0.24.2
scipy                    1.7.1
sentencepiece            0.1.96
seqio                    0.0.6
setuptools               67.6.0
six                      1.15.0
t5                       0.9.2
tabulate                 0.8.9
tensorboard              2.6.0
tensorboard-data-server  0.6.1
tensorboard-plugin-wit   1.8.0
tensorflow               2.6.0
tensorflow-datasets      4.4.0
tensorflow-estimator     2.6.0
tensorflow-hub           0.12.0
tensorflow-metadata      1.2.0
tensorflow-text          2.6.0
termcolor                1.1.0
tfds-nightly             4.4.0.dev202108200109
threadpoolctl            2.2.0
tokenizers               0.10.3
toml                     0.10.2
torch                    1.9.0
tqdm                     4.62.1
transformers             4.9.2
typing-extensions        3.7.4.3
urllib3                  1.26.6
Werkzeug                 2.0.1
wheel                    0.40.0
wrapt                    1.12.1
yapf                     0.32.0
zipp                     3.15.0

How could I solve this problem? Is this a network error or python environmental support problem?

Data distribution in language

Is there any specific data statistics in the distribution of datasets in language? I did not find the corresponding results in the paper.

Program synthesis data, reproduce in t5x and data proportion

Thanks for a nice open source and paper.

I would like to reproduce the FLAN-T5. I have a few questions.

I can't see "Program Synthesis" in the flan v2 data, how can I add it?
I built the mixture following this repo within t5x. It takes about 2 days to train flan-t5 in t5x.
I wonder it takes a long time to build the mixture. should I make a cache for FLAN-T5 training?
what is the proportion of data used in the published FLAN-T5?

Mixed zero, few-shot prompt proportion - zero-shot: few-shot = 10:90 or 50:50
Submixture proportion: flan(46%), t0(28%), cot(1.8%), niv2(24.2%), dialog(...), program synthesis(...)

Thanks!

Does each sub-mixture instructed both with zero-shot and few-shot?

We have four sub-mixture: muffin, t0-sf, cot, niv2. And a set of instruction settings: zsopt, zsnoopt, fsopt, fsnoopt. Are scaling flan use the all the combinations?
If so, does few-shot setting leak data information for zero-shot setting on a same dataset?
If not, can you give us the exact instruction settings on each sub-mixture?

[Question] What license is used for this FLAN dataset(not the code).

Hi,

Thanks a lot for open source the code to fetch the FLAN data set.

I noticed in the paper: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. (https://arxiv.org/abs/2301.13688) you mentioned

"to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at this https URL."

I noticed that this repo used Apache 2.0 license. Is the FLAN data set that fetched from the code also under Apache 2.0 license?

Thanks a lot!

Data Quality Issue

I found that there are a lot of bad cases in the data downloaded from：
https://huggingface.co/datasets/SirNeural/flan_v2/tree/main

For example:
niv2_zs_noopt_train.jsonl.gz
Detailed Instructions: You are given a statement written in Malayalam. Choose the most logical word from the given 4 options which can be used to replace the ⁇ MASK> token in the statement. Output the word from the correct option . Q: Statement: ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ . ⁇ , ⁇ , ⁇ ⁇ ⁇ ⁇ ⁇ . ⁇ ⁇ ⁇ . ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ . ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ . 1341 ⁇ 19 - ⁇ ⁇ ⁇ ⁇ . 1678 ⁇ 26 - ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ MASK> ⁇ ⁇ ⁇ . 1737 ⁇ 16 - ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ . ⁇ ⁇ 19 ⁇ ⁇ ⁇ . Option A: ⁇ Option B: ⁇ Option C: ⁇ Option D: ⁇ A:

May I ask what caused this?

Held-in validation set details

Were examples in the held-in validation set pre- and postprocessed exactly as the corresponding examples in the training set? If so, what were the mixture coefficients for the zero shot, template_mix, etc. tasks?

ERROR:absl:Failed to load task 'tfds_natural_instructions_template_0to10_zero_shot' as part of mixture 'palmflan_niv2_zs_opt'

Hi @shayne-longpre, thanks for answering the queries actively. I am using @SirNeural's script from https://huggingface.co/datasets/SirNeural/flan_v2 to create the dataset, but I am repeatedly facing this error:

ERROR:absl:Failed to load task 'tfds_natural_instructions_template_0to10_zero_shot' as part of mixture 'palmflan_niv2_zs_opt'

I have also checked out the issue which you have raised in the tensorflow repo here: tensorflow/datasets#4804

I am wondering if you were able to find a fix to this?

Thanks!

Prompt for multiple choice question with quote extraction

Hello,

this is not an issue per se but a question how to best use Flan models.

Is there a way to design the prompt template in such a way that the model is expected to choose an option and at the same time a quote from the context that supports the choice?

For example, currently the following prompt

I have a big family. We all live in London. However, I was born in Rome.

Answer the following question by taking a quote from the text above.

Where do I live?
A. Rome
B. Berlin
C. London

is completed by "C.", but ideally it would be completed by "C. We all live in London.".

Best regards,

gdeleva

[Question] What environment did you use for fetching large data set like dialog_mixture

Hi, I am trying to fetch FLAN v2 by running the

PYTHONPATH=. python flan/v2/run_example.py

I could successfully run cot_submix, but I faced out of memory issue when I was trying to fetch dialog_submix in a single AWS p4d instance.
Some of the logs showed it download wiki_dialog data and also did some processing (maybe) using Apache Beam.

Warning: The dataset you're trying to generate is using Apache Beam,
yet no `beam_runner` nor `beam_options` was explicitly provided.

Some Beam datasets take weeks to generate, so are usually not suited
for single machine generation. Please have a look at the instructions
to setup distributed generation:

https://www.tensorflow.org/datasets/beam_datasets#generating_a_beam_dataset

What I was doing is follow the readme file in flan/v2 directory, bash setup.sh and ran PYTHONPATH=. python flan/v2/run_example.py. The only entrance I could get was the seqio.get_mixture_or_task('dialog_submix').get_dataset() function in that run_example script. I was not clear about how seqio.get_dataset() interact with or call apache beam. Apart from pip install apache beam, is there any other step about settings? like the environment? how could we pass runner type to beam_runner?

And I assume dialog_submix is not the largest in this five categories. So could you give me some help on explaining what environment you are using when running the script to generate data? for example, do you use multiple machines to do it, like google cloud or aws EC2, EMR? Are there further steps like settings, configs before running that run_example.py code? Thanks a lot!

Request for Training Code for Reproducibility and Research Purposes

Hi @shayne-longpre ,

I am a graduate student working on LLMs and I am writing to request access to the training code associated with your paper, FLAN 2021, to reproduce your results and conduct further research.

I am greatly inspired by your work in FLAN 2021, especially the "number of instruction tuning clusters" part in ablation studies. Your findings have significant implications for my research, and I am eager to replicate your results and build upon your work, particularly to explore the connections between the type of training data and the performance of the model. Thank you so much for sharing the code to generate instruction tuning dataset, it would be really helpful if I could get the code to instruction tune the base model.

I understand that sharing the training code and data may require effort on your part, and I am committed to citing your work appropriately and complying with any terms or conditions associated with the use of the code and data. Your generosity in providing access to the training code and data would greatly contribute to the advancement of the field and foster a collaborative and open research environment.

Thank you for considering my request. I am looking forward to your positive response.

Scale law In-consistency between flan-v1 and flan-v2

LaMDA is instruct finetuned with flan-v1, the scale law shows only 68B and 137B model performance improves with instruct tuning.The reason instructing tuning fills up small model capacity(for example like a 8B model trained on 63 tasks).
However the scale law in "scaling instruction-fintuned language models(Training with flan-v2)", all model with capacity arrange from 8B-540B have performance improvement with instruct tuning, even using 1836 tasks.
Can anybody explain this? is there a concrete conclusion between the model size and num of instruct tuning tasks?

[BUG] get_dataset for flan2021_submix taking hours, while flan_zsnoopt takes 5m

Not much to add except to say that flan_zsnoopt is gotten quite quickly, whereas flan2021_submix is going on 3h and and 22GB RAM with no sign of stopping

Arabic tokenization is broken

The T5 tokenizer doesn't seem to know what to do with all of the Arabic text in the dataset (along with some other non-Latin scripts). Here's an example from Natural Instructions followed by its tokenization:

Example:

Teacher:You are given a sentence in Arabic. Your job is to translate the Arabic sentence into English.
Teacher: Now, understand the problem? Solve this instance: الممرضة في عيادة ذات ضغط نرى فيها من 50 إلى 100 مريض يومياً ، يترك لها فقط بضع دقائق لتقضيها مع كل مريض — دقائق لكل مريض.
Student:

Tokenization:

[17476    10  3774    33   787     3     9  7142    16 19248     5   696
   613    19    12 13959     8 19248  7142   139  1566     5 17476    10
   852     6   734     8   682    58  5175   162    48  3421    10     3
     2     3     2     3     2     3     2     3     2     3     2     3
     2     3     2   943     3     2   910     3     2     3     2     3
     2     3     2     3     2     3     2     3     2     3     2     3
     2     3     2     3     2     3     2     3   318     3     2     3
     2     3     2     5  6341    10     1]

Is this known? Were scripts not in the T5 tokenizer excluded from FLAN runs?

There may be some bugs in data preprocessing

Thank you for sharing the templates and data. It is really helpful!

I am trying to reproduce the data collection but found that some task splits may be wrong preprocessed in T0. The targets of some tasks with no_opt templates are not correct.

There are some examples, comparing the _no_opt_zero_shot with the _zero shot:

t0_task_adaptation:sciq_Direct_Question_Closed_Book__template_0to10_zero_shot
{'input': 'Q: Where are most of the organs contained in insects?   A:\nAvailable options: (a). head. (b). appendages. (c). thorax. (d). the abdomen.', 'target': '(d).'}                                                                                                                 
t0_task_adaptation:sciq_Direct_Question_Closed_Book__template_0to10_no_opt_zero_shot
{'input': 'Q: Where are most of the organs contained in insects?   A:', 'target': '[4].'} 

t0_task_adaptation:social_i_qa_I_was_wondering_template_0to10_zero_shot
{'input': "I heard that Alex gave Lee's son a toy to keep him busy during the movie.  And I was wondering What will happen to Lee?\nPossible answers: [-] make a movie; [-] be liked by Lee's son; [-] thank Alex;", 'target': 'thank Alex'}
t0_task_adaptation:social_i_qa_I_was_wondering_template_0to10_no_opt_zero_shot
{'input': "I heard that Alex gave Lee's son a toy to keep him busy during the movie.  And I was wondering What will happen to Lee?", 'target': '(c).'}

t0_task_adaptation:qasc_is_correct_2_template_0to10_zero_shot
{'input': 'Do you think the right answer to the question "what is plasma created with?" is "urea", given that  plasma can be created with heat?\nChoose your answer from:\n(i). Yes.\n(ii). No.', 'target': '(i).'}
t0_task_adaptation:qasc_is_correct_2_template_0to10_no_opt_zero_shot
{'input': 'Do you think the right answer to the question "what is plasma created with?" is "urea", given that  plasma can be created with heat?', 'target': '1).'}

Does the bug come from the original T0 collection or from the code of preprocessing? I have no clue about this. As I found most tasks in T0 are correct, and only a few have such wrong targets.

Pin versions for flan/v2/requirements.txt

Could you please pin versions for flan/v2/requirements.txt ?

How do I get only the CoT related data?

In the FLAN papers, it was mentioned that using CoT data during instruction tuning improves the model performance on unseen tasks (more performant).

I am planning to curate some self-instruct dataset together with some CoT data. How can I specify in the repo to get only the CoT data? Can you also refer me to the CoT based datasets used that helped you get that extra improvement in performance that was mention in the FLAN papers?

I am also thinking of using self-instruct to generate some CoT data, seems like that is something that is missing from the FLAN methods. Any ideas or plans in this direction?

I was playing around with run_example.py. These are the kinds of input and output data I am looking for. I like that you do not need to have a base prompt template using FLAN style of instruction tuning compared to needing a rigid base prompt such as in .

Example 1:
input

Antonio is preparing a meal of spaghetti and meatballs for his family. His recipe for meatballs calls for 1/8 of a pound of hamburger per 

meatball. Antonio has 8 family members, including himself. If he uses 4 pounds of hamburger to make meatballs, and each member of 

the family eats an equal number of meatballs, how many meatballs will Antonio eat? Let's be accurate as possible.

output

If one meatball is made from 1 / 8 pound of hamburger meat, then 4 pounds of hamburger meat will make 4 / (1 / 8) = 4 * 8 = 32 

meatballs. 32 meatballs divided amongst 8 family members is 32 / 8 = 4 meatballs per family member.

The answer: 4.

Example 2:
input

Given the sentence "A man playing the guitar on an elevated stage in front an audience." is it true that "A man is about to play the final 

song in his set."? A step-by-step solution is:

Just because he is playing guitar doesn't mean he is playing the final song in his set.

The final answer: it is not possible to tell.

Example 3:
input

 Given the sentence "A toddler is riding a plastic scooter." is it true that "A toddler is taking a bath."? Step by step answer:

output

A toddler cannot be taking a bath and riding a scooter at the same time. Therefore, the final answer is no.

Lastly, do you think that the improvement in performance by CoT data is that it provides the model with extra reasoning capabilities that can be generalized to tasks that requires a few steps of reasoning.

How to get a specific dataset, such as SST-2

How can I select a small dataset, such as sentiment dataset SST-2 in flan2021_submix ?

Accuracy mismatches between Table 1 and Figure 3/4 in FLAN XL 2022

Thanks for open-sourced the FLAN collections. We noticed that in the paper The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. Table 1 shows that the 0-shot and 5-shot accuracy of TX-XL Flan 2022 on MMLU are 50.3/52.4.
However, in both Figure 3 and Figure 4, the highest 0-shot and 5-shot accuracy achieved by TX-XL Flan 2022 is below 46.0. It is unclear which accuracy numbers are correct.

Additionally, Table 5 of Scaling Instruction-Finetuned Language Models. The 5-shot accuracy of T5-XL is 52.4, which is the same as TX-XL Flan 2022. Are they use same data settings or by chance the numbers are same?

We may missed information here. Can we know which numbers should be correct? Thanks:)

Decoding parameters used in evaluation (Flan Collection)

First off, thank you for open-sourcing the collection of templates and data, etc. it can enable a lot more research!

Could you please share the parameters that were used in decoding and any other details around decoding that could be useful in playing with Flan-T5? @shayne-longpre Thank you!

handling longer sequences

Hi everyone,

I'm just curious to understand how this family of models are handling sequences of length larger than the maximum allowed number of tokens. Elaborating, say the maximum token size is 512, if I have a sequence of length 700, how this model is supposed to process this specific input?

Thanks in advance.

tfds version issue

I was trying to load qrecc and wiki_dialog from the dialogue section to find that the current tfds version installed by requirements doesn't include them. Updating the version seemed to fix it. I haven't checked if this particular update causes any problems in other areas, but if it seems okay could you update the version of tfds in the requirements?

Is Inversion used for the Few-Shot case?

Hi @shayne-longpre do you use inversion for the few-shot data? If yes, can you provide the different permutations you use for it?

Thanks!

FLAN few-shot template structure

Hi,

I have a question about the representation of few-shot tasks as instructions to query flan models.

Does the following format follows the structure of the training instructions?

Thanks a lot for your feedback.


Is the following text a valid evidence regarding the given topic?

Options:
-yes
-no

Evidence: At the outset of his government, President Banzer launched a policy of using special police-units to physically eradicate the illegal coca of the Chapare region.
Topic: We should legalize the growing of coca leaf
Label: yes

Evidence: The U.S. Supreme Court has held that the terms "partial-birth abortion" and "intact dilation and extraction" are basically synonymous [REF].
Topic: We should ban partial birth abortions
Label: no

Evidence: A say on pay - a non-binding vote of the general meeting to approve director pay packages, is practised in a growing number of countries.
Topic: We should limit executive compensation
Label:

A minor bug in instruction

FLAN/flan/v2/templates.py

Line 1166 in 14d1e9e

("Data: {meaning_representation}. Can you generate a sentence about "

However, the output of e2e dataset may not consist of only a single sentence. For example, "The Wrestlerss is rated 5 out of 5, serving Japanese food in a pub. It is higher than average priced, and located near the city centre near Raja Indian Cuisine."

Questions about the training data sourced from Natural-instructions V2

Question: Can you share the exact list of 1836 tasks used for training?

For example, I found a list of Super-natural-instructions tasks in your code-based but this includes less than 1000 tasks (including the held-out MMLU tasks): https://github.com/google-research/FLAN/blob/main/flan/v2/constants_niv2.py

data leakage with exemplars

I found that in the experiment "4.4 Instructions with few-shot exemplars" of the FLAN paper，it said At both training and inference time, exemplars are randomly drawn from the training set.
During training, some examples would be sampled as exemplars. I think this may lead to data leakage?

[Question] Code Synthesis eval functions?

My understanding is you didn't include the program synthesis datasets for IP reasons, but could you include the eval functions you used? Were the eval environments from the original repos hooked up to python functions to be called by seqio?

How to include metadata from flan_collection_info.csv in each data entry?

Hi @shayne-longpre

I am trying to prepare the dataset with all the metadata mentioned in https://github.com/google-research/FLAN/blob/main/flan/v2/flan_collection_info.csv included. Do you know any quick workarounds which can help me do that?

Thanks!

ERROR:absl:Failed to load task 'arc_challenge_template_0to10_no_opt_x_shot' as part of mixture 'flan2022_submix'

Using the latest commit, when I run python flan/v2/run_example.py, get the following error:

ERROR:absl:Failed to load task 'arc_challenge_template_0to10_no_opt_x_shot' as part of mixture 'flan2022_submix'
--------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[3], line 3
      1 INPUT_SEQ_LEN = 2056
      2 TARGET_SEQ_LEN = 512
----> 3 dataset = selected_mixture.get_dataset(
      4     sequence_length={"inputs": INPUT_SEQ_LEN, "targets": TARGET_SEQ_LEN},
      5     num_epochs=1,
      6     shuffle=True,
      7     copy_pretokenized=True,
      8     # The passthrough features let you track the source[/task/template](https://vscode-remote+ssh-002dremote-002b77.vscode-resource.vscode-cdn.net/task/template) metadata for the example
      9     passthrough_features=["_template_idx", "_task_source", "_task_name", "_template", "_template_type"]
     10 )

File [~/miniconda3/envs/flanv2/lib/python3.8/site-packages/seqio/dataset_providers.py:1730](https://vscode-remote+ssh-002dremote-002b77.vscode-resource.vscode-cdn.net/data/lm/project/instruction_dataset/FLAN/flan/v2/~/miniconda3/envs/flanv2/lib/python3.8/site-packages/seqio/dataset_providers.py:1730), in Mixture.get_dataset(self, sequence_length, split, use_cached, shuffle, seed, shard_info, num_epochs, copy_pretokenized, compute_stats_empirically, log_mixing_proportions, passthrough_features, trim_output_features)
   1728 for task in tasks:
   1729   try:
-> 1730     ds = task.get_dataset(
   1731         sequence_length,
   1732         split=split,
   1733         use_cached=use_cached,
   1734         shuffle=shuffle,
   1735         seed=seed,
   1736         shard_info=shard_info,
   1737         num_epochs=num_epochs,
...
    394     )
    395   # 1st case: The key exists: info.splits['train']
    396   elif str(key) in self.keys():

KeyError: "Trying to access splits['train[:-200]'] but splits is empty. This likely indicate the dataset has not been generated yet."```

Errors running V2 code

Hi! Thanks for the open source effort for this project.
When I am trying to run V2 code, I had two issues:

AESLC checksum error -> I have removed this dataset from configs, which makes the code move forward but breaks the “academic benchmark”
ValueError: Mixture, palmflan_flan_zs_opt, does not contain any Tasks.
I can't figure out this error.

Code I run:

import mixtures, task_splits, tasks
dataset = seqio.get_mixture_or_task("flan_inter_cluster_split_0_10templates_train").get_dataset(
    sequence_length={"inputs": 4096, "targets": 4096},
)

and

from flan.v2 import constants
from flan.v2 import constants_t0
from flan.v2 import mixtures_utils
from flan.v2 import mixtures
from flan.v2 import tasks  # pylint: disable=unused-import
import seqio
seqio.add_global_cache_dirs(constants.CACHE_DIRS)
seqio.set_global_cache_dirs(constants.CACHE_DIRS)

seqio.MixtureRegistry.add(
    'flan_zs_fs_opt',
    tasks=[
        ('flan_zsopt', 50),  # mixing weight = 50
        ('flan_fsopt', 50),  # mixing weight = 50
])

dataset = seqio.get_mixture_or_task("flan_zs_fs_opt").get_dataset(
    sequence_length={"inputs": 256, "targets": 128},
    split="train",
    shuffle=True,
    num_epochs=1,
    shard_info=seqio.ShardInfo(index=0, num_shards=10),
    use_cached=False,
    seed=42
)

# Print the first 5 examples.
for _, ex in zip(range(5), dataset.as_numpy_iterator()):
  print(ex)

Package version:

Package                      Version
---------------------------- ---------------------
absl-py                      1.4.0
aiohttp                      3.8.3
aiosignal                    1.3.1
astunparse                   1.6.3
async-timeout                4.0.2
asynctest                    0.13.0
attrs                        21.2.0
Babel                        2.9.1
backports.cached-property    1.0.2
blessed                      1.20.0
bpython                      0.24
cached-property              1.5.2
cachetools                   4.2.2
certifi                      2021.5.30
charset-normalizer           2.0.4
chex                         0.1.5
clang                        5.0
click                        8.0.1
clu                          0.0.8
colorama                     0.4.4
contextlib2                  21.6.0
curtsies                     0.4.1
cwcwidth                     0.1.8
cycler                       0.11.0
datasets                     2.9.0
dill                         0.3.6
dm-tree                      0.1.8
editdistance                 0.5.3
etils                        0.9.0
filelock                     3.0.12
flatbuffers                  23.1.21
flax                         0.6.4
fonttools                    4.38.0
frozendict                   2.3.4
frozenlist                   1.3.3
fsspec                       2023.1.0
future                       0.18.2
gast                         0.4.0
gin-config                   0.4.0
google-auth                  1.35.0
google-auth-oauthlib         0.4.5
google-pasta                 0.2.0
googleapis-common-protos     1.53.0
greenlet                     2.0.2
grpcio                       1.39.0
h5py                         3.1.0
huggingface-hub              0.12.0
idna                         3.2
importlib-metadata           6.0.0
importlib-resources          5.10.2
iniconfig                    1.1.1
jax                          0.3.25
jaxlib                       0.3.25
joblib                       1.0.1
keras                        2.11.0
Keras-Preprocessing          1.1.2
kiwisolver                   1.4.4
Levenshtein                  0.13.0
libclang                     15.0.6.1
Markdown                     3.3.4
markdown-it-py               2.1.0
matplotlib                   3.5.3
mdurl                        0.1.2
mesh-tensorflow              0.1.19
ml-collections               0.1.1
msgpack                      1.0.4
multidict                    6.0.4
multiprocess                 0.70.14
nltk                         3.6.2
numpy                        1.21.6
oauthlib                     3.1.1
opt-einsum                   3.3.0
optax                        0.1.4
orbax                        0.1.0
packaging                    21.0
pandas                       1.3.2
Pillow                       9.4.0
pip                          23.0
pluggy                       0.13.1
portalocker                  2.3.0
promise                      2.3
protobuf                     3.17.3
psutil                       5.9.4
py                           1.10.0
pyarrow                      11.0.0
pyasn1                       0.4.8
pyasn1-modules               0.2.8
pyglove                      0.2.1
Pygments                     2.14.0
pyparsing                    2.4.7
pytest                       6.2.4
python-dateutil              2.8.2
pytz                         2021.1
pyxdg                        0.28
PyYAML                       5.4.1
regex                        2021.8.3
requests                     2.26.0
requests-oauthlib            1.3.0
responses                    0.18.0
rich                         13.3.1
rouge-score                  0.0.4
rsa                          4.7.2
sacrebleu                    2.0.0
sacremoses                   0.0.45
scikit-learn                 0.24.2
scipy                        1.7.1
sentencepiece                0.1.96
seqio                        0.0.14
setuptools                   47.1.0
six                          1.15.0
t5                           0.9.2
tabulate                     0.8.9
tensorboard                  2.11.2
tensorboard-data-server      0.6.1
tensorboard-plugin-wit       1.8.0
tensorflow                   2.11.0
tensorflow-datasets          4.8.2
tensorflow-estimator         2.11.0
tensorflow-hub               0.12.0
tensorflow-io-gcs-filesystem 0.30.0
tensorflow-metadata          1.12.0
tensorflow-text              2.11.0
tensorstore                  0.1.28
termcolor                    1.1.0
tfds-nightly                 4.8.2.dev202301270045
threadpoolctl                2.2.0
tokenizers                   0.10.3
toml                         0.10.2
toolz                        0.12.0
torch                        1.9.0
tqdm                         4.62.1
transformers                 4.9.2
typing_extensions            4.4.0
urllib3                      1.26.6
wcwidth                      0.2.6
Werkzeug                     2.0.1
wheel                        0.38.4
wrapt                        1.12.1
xxhash                       3.2.0
yarl                         1.8.2
zipp                         3.12.1

T0 FS OPT - unable to process.

Hi @shayne-longpre, I have been using @SirNeural's script here to generate the data. I have been able to generate all the files, except t0_fs_opt data. Every time I run the script, the process gets killed (with a prompt "Killed", and no error message) after some time. Also, the same data is unavailable in the data processed by @SirNeural in the huggingface repo. Did you face such issues earlier? Any help? :)

Thanks!

v2: SplitInfo has no field named filepathTemplate and DatasetInfo has no field named releaseNotes

Many datasets are throwing these 2 errors in particular. (over 200 tasks in Flan alone) I'll give 2 examples:

seqio.get_mixture_or_task("ag_news_subset_template_0to10_x_shot").get_dataset(sequence_length={"inputs": 512, "targets": 512})
Error:

Failed to parse splits field: Message type "tensorflow_datasets.SplitInfo" has no field named "filepathTemplate".
Available Fields(except extensions): ['name', 'numShards', 'shardLengths', 'numBytes', 'statistics'].

seqio.get_mixture_or_task("cnn_dailymail_template_0to10_no_opt_x_shot").get_dataset(sequence_length={"inputs": 512, "targets": 512})
Error:

Message type "tensorflow_datasets.DatasetInfo" has no field named "releaseNotes".
 Available Fields(except extensions): ['name', 'description', 'version', 'configName', 'configDescription', 'citation', 'sizeInBytes', 'downloadSize', 'location', 'downloadChecksums', 'schema', 'splits', 'supervisedKeys', 'redistributionInfo', 'moduleName', 'disableShuffling', 'fileFormat']

Issues when run enumerate(dataset) in run_example.py

When we try to run the enumerate(dataset) code in https://github.com/google-research/FLAN/blob/main/flan/v2/run_example.py#L108. The code gets stuck and cannot go to the following steps while our CPU memories are still enough.

The dataset which mixed Flan, T0, NIV2, etc., in https://github.com/google-research/FLAN/blob/main/flan/v2/run_example.py#L65 may be too large with your provided default mix ratios. We then change

DEFAULT_MIXTURE_MAX_EXAMPLES = {
    'FLAN': 30000,
    'T0': 20000,
    'CoT': 100000,
    'NIv2': 5000,
    'Dialog': 200000,
}

to 10 times and 100 times smaller, they still cannot run.

Do you have any suggestions here? Thanks

base model for self-query?

Thanks to the FLAN team for open-sourcing your methods and prompts!

If I want to create a system that can self-query - generate its own options and decide on a best option depending on context, which pre-trained model would you recommend as a baseline model?
Would any of the Flan-T5 models (or U2) currently published on Huggingface be sufficient? Would you recommend further finetuning?

Thank you.

Most performant prompts per task?

If I'm understanding this correctly, in the original FLAN paper, the evaluation protocol for a single task involved averaging the accuracies achieved with the task's different prompts, with the most performant prompt being used for the test set if a dev set existed (I'm assuming this was the same with the Flan-T5 paper since I couldn't find the specifics in the paper, but maybe I just missed it).

Do the authors have records indicating what the most performant prompts were per task? Or the accuracies per prompt per task?

[BUG] Cache is empty with FLAN, but not with seqio

Caching tasks registered by FLAN is resulting in empty files. I'm running seqio_cache_tasks --output_cache_dir=/root/seqio_cache --module_import=src.register_tasks where src.register_tasks is a file that registers tasks in a way equivalent to importing flan.v2.mixtures.

I tried registering a task straight via seqio, as shown below, and it cached correctly.

seqio.TaskRegistry.add(
    "wmt19_ende",
    seqio.TfdsDataSource(tfds_name="wmt19_translate/de-en:1.0.0"),
    preprocessors=[
        functools.partial(
            translate, source_language='en', target_language='de'),
        seqio.preprocessors.tokenize, seqio.preprocessors.append_eos
    ],
    output_features=task_configs.DEFAULT_OUTPUT_FEATURES,
    metric_fns=[bleu])

google-research / flan Goto Github PK

flan's Introduction

The FLAN Instruction Tuning Repository

Flan 2021

Flan 2021 Citation

License

Contact Us

flan's People

Contributors

Stargazers

Watchers

Forkers

flan's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs