krasserm / perceiver-io Goto Github PK

A PyTorch implementation of Perceiver, Perceiver IO and Perceiver AR with PyTorch Lightning scripts for distributed training

License: Apache License 2.0

Python 99.84% Dockerfile 0.16%

perceiver deep-learning machine-learning pytorch pytorch-lightning perceiver-io perceiver-ar

perceiver-io's Introduction

Perceiver, Perceiver IO and Perceiver AR

This repository is a PyTorch implementation of Perceiver, Perceiver IO and Perceiver AR, with PyTorch Lightning interfaces for model training and Hugging Face 🤗 interfaces for inference.

Perceiver: General Perception with Iterative Attention (paper, video)
Perceiver IO: A General Architecture for Structured Inputs & Outputs (paper, blog post)
General-purpose, long-context autoregressive modeling with Perceiver AR (paper, blog post)

Overview

Core of the perceiver-io library are backend models, lightweight PyTorch implementations of Perceiver, Perceiver IO and Perceiver AR. They can be wrapped into PyTorch Lightning modules for training (Lightning interface) and 🤗 modules for inference (Hugging Face interface). See library design for details.

The command line interface for training is implemented with Lightning CLI. Training datasets are 🤗 datasets wrapped into PyTorch Lightning data modules. For NLP tasks, perceiver-io supports all 🤗 fast tokenizers and the 🤗 Perceiver UTF-8 bytes tokenizer.

Documentation

Installation

Via pip

pip install perceiver-io[text,vision,audio]

From sources

Installation from sources requires a Miniconda and a Poetry (1.2.0 or higher) installation.

Create and activate the perceiver-io conda environment:

conda env create -f environment.yml
conda activate perceiver-io

Install main and test dependencies, including all extras:

# Without dependencies required for examples
poetry install --all-extras

If you want to run the examples locally, additionally use --with examples:

poetry install --all-extras --with examples

Docker image

docker pull ghcr.io/krasserm/perceiver-io:latest

See Docker image for details.

Getting started

Inference

Optical flow

Compute the optical flow between consecutive frames of an input video and write the rendered results to an output video:

from urllib.request import urlretrieve
from transformers import pipeline

from perceiver.data.vision import video_utils
from perceiver.model.vision import optical_flow  # register auto-classes and pipeline

urlretrieve(
    url="https://martin-krasser.com/perceiver/flow/sintel_clip_cave_dragon_fight.mp4",
    filename="sintel_clip_cave_dragon_fight.mp4",
)

# Create optical flow pipeline
optical_flow_pipeline = pipeline("optical-flow", model="krasserm/perceiver-io-optical-flow", device="cuda:0")

# load consecutive video frame pairs
frame_pairs = video_utils.read_video_frame_pairs("sintel_clip_cave_dragon_fight.mp4")

# create and render optical flow for all frame pairs
optical_flows = optical_flow_pipeline(frame_pairs, render=True, device="cuda:0")

# create video with rendered optical flows
video_utils.write_video("sintel_clip_cave_dragon_fight_output.mp4", optical_flows, fps=24)

Here is a side-by-side comparison of the input and output video:

Symbolic audio generation

Create audio sequences by generating symbolic (MIDI) audio data and converting the generated audio symbols into WAV output using fluidsynth (Note: fluidsynth must be installed in order for the following example to work):

from transformers import pipeline
from pretty_midi import PrettyMIDI
from perceiver.model.audio import symbolic  # auto-class registration

repo_id = "krasserm/perceiver-ar-sam-giant-midi"

prompt = PrettyMIDI("prompt.mid")
audio_generator = pipeline("symbolic-audio-generation", model=repo_id)

output = audio_generator(prompt, max_new_tokens=64, num_latents=1, do_sample=True, top_p=0.95, temperature=1.0, render=True)

with open("generated_audio.wav", "wb") as f:
    f.write(output["generated_audio_wav"])

Examples of generated audio sequences are available on the 🤗 hub.

See inference examples for more examples.

Training

Train a small Perceiver IO image classifier (907K parameters) on MNIST from the command line. The classifier cross-attends to individual pixels of input images with repeated cross-attention. See image classification training example for more details.

python -m perceiver.scripts.vision.image_classifier fit \
  --model.num_latents=32 \
  --model.num_latent_channels=128 \
  --model.encoder.num_frequency_bands=32 \
  --model.encoder.num_cross_attention_layers=2 \
  --model.encoder.num_self_attention_blocks=3 \
  --model.encoder.num_self_attention_layers_per_block=3 \
  --model.encoder.first_self_attention_block_shared=false \
  --model.encoder.dropout=0.1 \
  --model.encoder.init_scale=0.1 \
  --model.decoder.num_output_query_channels=128 \
  --model.decoder.dropout=0.1 \
  --model.decoder.init_scale=0.1 \
  --data=MNISTDataModule \
  --data.batch_size=64 \
  --optimizer=AdamW \
  --optimizer.lr=1e-3 \
  --lr_scheduler.warmup_steps=500 \
  --trainer.accelerator=gpu \
  --trainer.devices=1 \
  --trainer.max_epochs=30 \
  --trainer.logger=TensorBoardLogger \
  --trainer.logger.save_dir=logs \
  --trainer.logger.name=logs

Model construction describes how to implement model-specific command line interfaces with the Lightning CLI. Training checkpoints are written to the logs/img_clf/version_0/checkpoints directory. Assuming a checkpoint with filename epoch=025-val_loss=0.065.ckpt exists, it can be converted to a perceiver-io 🤗 model with

from perceiver.model.vision.image_classifier import convert_mnist_classifier_checkpoint

convert_mnist_classifier_checkpoint(
    save_dir="example/mnist-classifier",
    ckpt_url="logs/img_clf/version_0/checkpoints/epoch=025-val_loss=0.065.ckpt",
)

so that it can be used in a 🤗 image classification pipeline

from datasets import load_dataset
from transformers import pipeline

mnist_dataset = load_dataset("mnist", split="test")[:9]

images = mnist_dataset["image"]
labels = mnist_dataset["label"]

classifier = pipeline("image-classification", model="example/mnist-classifier")
predictions = [pred[0]["label"] for pred in classifier(images)]

print(f"Labels:      {labels}")
print(f"Predictions: {predictions}")

Labels:      [7, 2, 1, 0, 4, 1, 4, 9, 5]
Predictions: [7, 2, 1, 0, 4, 1, 4, 9, 5]

or loaded directly:

import torch
from transformers import AutoModelForImageClassification, AutoImageProcessor

model = AutoModelForImageClassification.from_pretrained("example/mnist-classifier")
processor = AutoImageProcessor.from_pretrained("example/mnist-classifier")

inputs = processor(images, return_tensors="pt")

with torch.no_grad():
    # use perceiver-io Hugging Face model
    output_1 = model(**inputs).logits

with torch.no_grad():
    # or use perceiver-io backend model directly  
    output_2 = model.backend_model(inputs.pixel_values)

print(f"Predictions: {output_1.argmax(dim=-1).numpy().tolist()}")
print(f"Predictions: {output_2.argmax(dim=-1).numpy().tolist()}")

Predictions: [7, 2, 1, 0, 4, 1, 4, 9, 5]
Predictions: [7, 2, 1, 0, 4, 1, 4, 9, 5]

See training examples for more examples.

Articles

Articles referencing this repository:

Other implementations

perceiver-io's People

Contributors

Stargazers

Watchers

perceiver-io's Issues

Replace example-wise with batch-wise random-truncation

Supports batches without pad tokens.

Configurable number of attention heads to be processed in parallel

Mainly required for Perceiver AR training to reduce GPU memory consumption for initial cross-attention

AttributeError: 'tuple' object has no attribute 'last_hidden_state'

Hi and thanks for sharing this repo. I'm trying to run the symbolic music audio model training examples and they run fine with activation checkpointing disabled, however if I enable that, I get:

AttributeError: 'tuple' object has no attribute 'last_hidden_state'

Dumping the output of ca_output in perceiver/model/core/modules.py produces a proper ModuleOutput object on the sanity checks but when it starts training, just a ('last_hidden_state', 'kv_cache') tuple.

Sequence Modeling Examples

Greetings, it would be useful to pose the problem more abstractly and add examples of autoregressive sequence models, on raw sequences (no text/audio/tokenizers), like the authors do in the PerceiverAR paper.

The examples include Copy Task, and Autoregressive Imagenet Picture Generation.
I am planning to contribute those myself, in the near future.

Enable external widgets in inference examples notebook

Perceiver AR FSDP training via command line

includes FSDP-compatible gradient norm clipping
includes FSDP-compatible activation checkpointing

text encoding error

Hi,
I am getting this error

Traceback (most recent call last):
  File "train/train_mlm.py", line 113, in <module>
    main(parser.parse_args())
  File "train/train_mlm.py", line 69, in main
    data_module.setup()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/datamodule.py", line 428, in wrapped_fn
    fn(*args, **kwargs)
  File "/opt/perceiver-io/data/imdb.py", line 131, in setup
    self.ds_train = IMDBDataset(root=self.root, split='train')
  File "/opt/perceiver-io/data/imdb.py", line 42, in __init__
    self.raw_x, self.raw_y = load_split(root, split)
  File "/opt/perceiver-io/data/imdb.py", line 34, in load_split
    raw_x.append(f.read())
  File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 449: ordinal not in range(128)

it is probably related to the unicode encoding

Upgrade to PyTorch 2.0 and PyTorch Lightning 2.0

citing the repo

Is there a preferred way to cite the repo in an academic paper?

Use `OutputAdapter` in `PerceiverAR` subclasses

Set number of processes for parallel text dataset pre-processing to `num_workers` of data module

What is Q in the latent encoder layers?

It seems that in the multi-layer encoder, you use x_latent as Q, x as KV, shouldn't the QKV all be x_latent in latent layers?
Please correct me if I missed something in the paper, thank you!

AttributeError: module 'numpy' has no attribute '_no_nep50_warning'

After I install perceiver io by the given code: !pip install perceiver-io[text,vision]
I tried to import by：from perceiver.model.core import PerceiverIOConfig
But get the following error:

AttributeError Traceback (most recent call last)
in
----> 1 from perceiver.model.core import PerceiverIOConfig
2 from perceiver.model.text.mlm import MaskedLanguageModel, TextEncoderConfig, TextDecoderConfig
3
4 vocab_size = 262 # E
5 max_seq_len = 2048 # M, O

46 frames
/usr/local/lib/python3.8/dist-packages/numpy/init.py in getattr(attr)
311 # The previous way Tester was imported also had a side effect of adding
312 # the full numpy.testing namespace
--> 313 if attr == 'testing':
314 import numpy.testing as testing
315 return testing

AttributeError: module 'numpy' has no attribute '_no_nep50_warning'

About evaluation metrics

Hi, thank you for your great work.
And I am wondering what is the evaluation metrics used in the audio generation task?(for example giantmidi)

Optional absolute position embedding for Perceiver AR

If turned off, all channels are rotated for relative position encoding, otherwise, only the first 50% of channels.

Generate sequences from batch of prompts

Prompt tokens, left-padded if needed, are passed as prefix tokens to the model by default. The default prefix-latent split can be user-defined too.

TypeError: new() missing 1 required positional argument: 'task'

Thank you for sharing the code.
When I try to run the examples/training/img_clf/train.py, I encounter a bug:

could you tell me how to figure it out? thank you

Share weights of embedding layer with output layer in `CausalLanguageModel`

Any plan for optical flow?

Please kindly consider the implementing the models and experiments for optical flow.

Error while adding CIFAR10 and CIFAR100

Dear @krasserm,

I'm currently attempting to integrate CIFAR10 and CIFAR100 datasets into your code to train the perceiver-io model. Following the approach you've taken for MNIST, I've created a cifar10.py file within perceiver/data/vision and followed your steps. However, when I attempt to run the training example, I consistently encounter an error stating that 'image' has not been defined.

I have also ensured that I imported CIFAR10DataModule into the train.py script located in examples/training/img_clf. Despite these efforts, I'm still unable to successfully execute your code with CIFAR10 and CIFAR100.

Thank you in advance for your assistance!

Optional random sequence truncation in C4 data module

Multimodal autoencoder

Hi @krasserm, awesome project BTW.
I'd be interested in implementing the multimodal autoencoder in the perceiver-io paper. Are there any existing efforts along this lines on our end, or do you have any suggestions before starting?

Compute-optimal Perceiver AR models

Application of the Chinchilla paper on small scale.

Support tokenizers that do not have a `pad_token` configured

such as the gpt2 tokenizer for example.

Optional bias term for `TiedTextOutputAdapter`

Upgrade to PyTorch Lightning 1.7 and PyTorch 1.12

Support user-defined prefix length at training and inference time

Prefix length should be kept constant during generation. It was dynamically adjusted in previous versions which is a bug.

Implement Perceiver AR

https://arxiv.org/abs/2202.07765

Genomic sequences

Hello,

Thank you for your implementation of the PerceiverIO project. I am trying to use your work for genomic sequences of shape (10k, 1). I noticed that your model produces the SAME output for DIFFERENT inputs when the num_channels dimension is 1 (I am not using the Fourier Feature encodings). If the outputs are not the same, then they are nominally different. Can you please guide me in solving this issue? Thanks in advance!

Please let me know what additional information you would need to reproduce this bug.

Parameterize `generate` method with `top_k` directly instead of `threshold`

ValueError: Could not load model krasserm/perceiver-io-optical-flow with any of the following classes: (<class 'perceiver.model.vision.optical_flow.huggingface.OpticalFlowPerceiver'>,).

Code:

from urllib.request import urlretrieve
from transformers import pipeline

from perceiver.data.vision import video_utils
from perceiver.model.vision import optical_flow # register auto-classes and pipeline

optical_flow_pipeline = pipeline("optical-flow", model="krasserm/perceiver-io-optical-flow", device="cuda:0")

frame_pairs = video_utils.read_video_frame_pairs("Scene_004.mp4")

optical_flows = optical_flow_pipeline(frame_pairs, render=True, device="cuda:0")

video_utils.write_video("test_data/flow/perceiverperceiver_flow_output.mp4", optical_flows, fps=24)