GithubHelp home page GithubHelp logo

krasserm / perceiver-io Goto Github PK

View Code? Open in Web Editor NEW
424.0 10.0 39.0 22.71 MB

A PyTorch implementation of Perceiver, Perceiver IO and Perceiver AR with PyTorch Lightning scripts for distributed training

License: Apache License 2.0

Python 99.84% Dockerfile 0.16%
perceiver deep-learning machine-learning pytorch pytorch-lightning perceiver-io perceiver-ar

perceiver-io's Introduction

Perceiver, Perceiver IO and Perceiver AR

This repository is a PyTorch implementation of Perceiver, Perceiver IO and Perceiver AR, with PyTorch Lightning interfaces for model training and Hugging Face πŸ€— interfaces for inference.

Perceiver: General Perception with Iterative Attention (paper, video) Perceiver
Perceiver IO: A General Architecture for Structured Inputs & Outputs (paper, blog post) Perceiver IO
General-purpose, long-context autoregressive modeling with Perceiver AR (paper, blog post) Perceiver AR

Overview

Core of the perceiver-io library are backend models, lightweight PyTorch implementations of Perceiver, Perceiver IO and Perceiver AR. They can be wrapped into PyTorch Lightning modules for training (Lightning interface) and πŸ€— modules for inference (Hugging Face interface). See library design for details.

library-design

The command line interface for training is implemented with Lightning CLI. Training datasets are πŸ€— datasets wrapped into PyTorch Lightning data modules. For NLP tasks, perceiver-io supports all πŸ€— fast tokenizers and the πŸ€— Perceiver UTF-8 bytes tokenizer.

Documentation

Installation

Via pip

pip install perceiver-io[text,vision,audio]

From sources

Installation from sources requires a Miniconda and a Poetry (1.2.0 or higher) installation.

Create and activate the perceiver-io conda environment:

conda env create -f environment.yml
conda activate perceiver-io

Install main and test dependencies, including all extras:

# Without dependencies required for examples
poetry install --all-extras

If you want to run the examples locally, additionally use --with examples:

poetry install --all-extras --with examples

Docker image

docker pull ghcr.io/krasserm/perceiver-io:latest

See Docker image for details.

Getting started

Inference

Optical flow

Compute the optical flow between consecutive frames of an input video and write the rendered results to an output video:

from urllib.request import urlretrieve
from transformers import pipeline

from perceiver.data.vision import video_utils
from perceiver.model.vision import optical_flow  # register auto-classes and pipeline

urlretrieve(
    url="https://martin-krasser.com/perceiver/flow/sintel_clip_cave_dragon_fight.mp4",
    filename="sintel_clip_cave_dragon_fight.mp4",
)

# Create optical flow pipeline
optical_flow_pipeline = pipeline("optical-flow", model="krasserm/perceiver-io-optical-flow", device="cuda:0")

# load consecutive video frame pairs
frame_pairs = video_utils.read_video_frame_pairs("sintel_clip_cave_dragon_fight.mp4")

# create and render optical flow for all frame pairs
optical_flows = optical_flow_pipeline(frame_pairs, render=True, device="cuda:0")

# create video with rendered optical flows
video_utils.write_video("sintel_clip_cave_dragon_fight_output.mp4", optical_flows, fps=24)

Here is a side-by-side comparison of the input and output video:

optical-flow-sbs

Symbolic audio generation

Create audio sequences by generating symbolic (MIDI) audio data and converting the generated audio symbols into WAV output using fluidsynth (Note: fluidsynth must be installed in order for the following example to work):

from transformers import pipeline
from pretty_midi import PrettyMIDI
from perceiver.model.audio import symbolic  # auto-class registration

repo_id = "krasserm/perceiver-ar-sam-giant-midi"

prompt = PrettyMIDI("prompt.mid")
audio_generator = pipeline("symbolic-audio-generation", model=repo_id)

output = audio_generator(prompt, max_new_tokens=64, num_latents=1, do_sample=True, top_p=0.95, temperature=1.0, render=True)

with open("generated_audio.wav", "wb") as f:
    f.write(output["generated_audio_wav"])

Examples of generated audio sequences are available on the πŸ€— hub.

See inference examples for more examples.

Training

Train a small Perceiver IO image classifier (907K parameters) on MNIST from the command line. The classifier cross-attends to individual pixels of input images with repeated cross-attention. See image classification training example for more details.

python -m perceiver.scripts.vision.image_classifier fit \
  --model.num_latents=32 \
  --model.num_latent_channels=128 \
  --model.encoder.num_frequency_bands=32 \
  --model.encoder.num_cross_attention_layers=2 \
  --model.encoder.num_self_attention_blocks=3 \
  --model.encoder.num_self_attention_layers_per_block=3 \
  --model.encoder.first_self_attention_block_shared=false \
  --model.encoder.dropout=0.1 \
  --model.encoder.init_scale=0.1 \
  --model.decoder.num_output_query_channels=128 \
  --model.decoder.dropout=0.1 \
  --model.decoder.init_scale=0.1 \
  --data=MNISTDataModule \
  --data.batch_size=64 \
  --optimizer=AdamW \
  --optimizer.lr=1e-3 \
  --lr_scheduler.warmup_steps=500 \
  --trainer.accelerator=gpu \
  --trainer.devices=1 \
  --trainer.max_epochs=30 \
  --trainer.logger=TensorBoardLogger \
  --trainer.logger.save_dir=logs \
  --trainer.logger.name=logs

Model construction describes how to implement model-specific command line interfaces with the Lightning CLI. Training checkpoints are written to the logs/img_clf/version_0/checkpoints directory. Assuming a checkpoint with filename epoch=025-val_loss=0.065.ckpt exists, it can be converted to a perceiver-io πŸ€— model with

from perceiver.model.vision.image_classifier import convert_mnist_classifier_checkpoint

convert_mnist_classifier_checkpoint(
    save_dir="example/mnist-classifier",
    ckpt_url="logs/img_clf/version_0/checkpoints/epoch=025-val_loss=0.065.ckpt",
)

so that it can be used in a πŸ€— image classification pipeline

from datasets import load_dataset
from transformers import pipeline

mnist_dataset = load_dataset("mnist", split="test")[:9]

images = mnist_dataset["image"]
labels = mnist_dataset["label"]

classifier = pipeline("image-classification", model="example/mnist-classifier")
predictions = [pred[0]["label"] for pred in classifier(images)]

print(f"Labels:      {labels}")
print(f"Predictions: {predictions}")
Labels:      [7, 2, 1, 0, 4, 1, 4, 9, 5]
Predictions: [7, 2, 1, 0, 4, 1, 4, 9, 5]

or loaded directly:

import torch
from transformers import AutoModelForImageClassification, AutoImageProcessor

model = AutoModelForImageClassification.from_pretrained("example/mnist-classifier")
processor = AutoImageProcessor.from_pretrained("example/mnist-classifier")

inputs = processor(images, return_tensors="pt")

with torch.no_grad():
    # use perceiver-io Hugging Face model
    output_1 = model(**inputs).logits

with torch.no_grad():
    # or use perceiver-io backend model directly  
    output_2 = model.backend_model(inputs.pixel_values)

print(f"Predictions: {output_1.argmax(dim=-1).numpy().tolist()}")
print(f"Predictions: {output_2.argmax(dim=-1).numpy().tolist()}")
Predictions: [7, 2, 1, 0, 4, 1, 4, 9, 5]
Predictions: [7, 2, 1, 0, 4, 1, 4, 9, 5]

See training examples for more examples.

Articles

Articles referencing this repository:

Other implementations

perceiver-io's People

Contributors

borda avatar cstub avatar krasserm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

perceiver-io's Issues

AttributeError: 'tuple' object has no attribute 'last_hidden_state'

Hi and thanks for sharing this repo. I'm trying to run the symbolic music audio model training examples and they run fine with activation checkpointing disabled, however if I enable that, I get:

AttributeError: 'tuple' object has no attribute 'last_hidden_state'

Dumping the output of ca_output in perceiver/model/core/modules.py produces a proper ModuleOutput object on the sanity checks but when it starts training, just a ('last_hidden_state', 'kv_cache') tuple.

Sequence Modeling Examples

Greetings, it would be useful to pose the problem more abstractly and add examples of autoregressive sequence models, on raw sequences (no text/audio/tokenizers), like the authors do in the PerceiverAR paper.

The examples include Copy Task, and Autoregressive Imagenet Picture Generation.
I am planning to contribute those myself, in the near future.
image
image

text encoding error

Hi,
I am getting this error

Traceback (most recent call last):
  File "train/train_mlm.py", line 113, in <module>
    main(parser.parse_args())
  File "train/train_mlm.py", line 69, in main
    data_module.setup()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/datamodule.py", line 428, in wrapped_fn
    fn(*args, **kwargs)
  File "/opt/perceiver-io/data/imdb.py", line 131, in setup
    self.ds_train = IMDBDataset(root=self.root, split='train')
  File "/opt/perceiver-io/data/imdb.py", line 42, in __init__
    self.raw_x, self.raw_y = load_split(root, split)
  File "/opt/perceiver-io/data/imdb.py", line 34, in load_split
    raw_x.append(f.read())
  File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 449: ordinal not in range(128)

it is probably related to the unicode encoding

citing the repo

Is there a preferred way to cite the repo in an academic paper?

What is Q in the latent encoder layers?

It seems that in the multi-layer encoder, you use x_latent as Q, x as KV, shouldn't the QKV all be x_latent in latent layers?
Please correct me if I missed something in the paper, thank you!

AttributeError: module 'numpy' has no attribute '_no_nep50_warning'

After I install perceiver io by the given code: !pip install perceiver-io[text,vision]
I tried to import by:from perceiver.model.core import PerceiverIOConfig
But get the following error:

AttributeError Traceback (most recent call last)
in
----> 1 from perceiver.model.core import PerceiverIOConfig
2 from perceiver.model.text.mlm import MaskedLanguageModel, TextEncoderConfig, TextDecoderConfig
3
4 vocab_size = 262 # E
5 max_seq_len = 2048 # M, O

46 frames
/usr/local/lib/python3.8/dist-packages/numpy/init.py in getattr(attr)
311 # The previous way Tester was imported also had a side effect of adding
312 # the full numpy.testing namespace
--> 313 if attr == 'testing':
314 import numpy.testing as testing
315 return testing

AttributeError: module 'numpy' has no attribute '_no_nep50_warning'

About evaluation metrics

Hi, thank you for your great work.
And I am wondering what is the evaluation metrics used in the audio generation task?(for example giantmidi)

Error while adding CIFAR10 and CIFAR100

Dear @krasserm,

I'm currently attempting to integrate CIFAR10 and CIFAR100 datasets into your code to train the perceiver-io model. Following the approach you've taken for MNIST, I've created a cifar10.py file within perceiver/data/vision and followed your steps. However, when I attempt to run the training example, I consistently encounter an error stating that 'image' has not been defined.

I have also ensured that I imported CIFAR10DataModule into the train.py script located in examples/training/img_clf. Despite these efforts, I'm still unable to successfully execute your code with CIFAR10 and CIFAR100.

Thank you in advance for your assistance!

Multimodal autoencoder

Hi @krasserm, awesome project BTW.
I'd be interested in implementing the multimodal autoencoder in the perceiver-io paper. Are there any existing efforts along this lines on our end, or do you have any suggestions before starting?

Genomic sequences

Hello,

Thank you for your implementation of the PerceiverIO project. I am trying to use your work for genomic sequences of shape (10k, 1). I noticed that your model produces the SAME output for DIFFERENT inputs when the num_channels dimension is 1 (I am not using the Fourier Feature encodings). If the outputs are not the same, then they are nominally different. Can you please guide me in solving this issue? Thanks in advance!

Please let me know what additional information you would need to reproduce this bug.

ValueError: Could not load model krasserm/perceiver-io-optical-flow with any of the following classes: (<class 'perceiver.model.vision.optical_flow.huggingface.OpticalFlowPerceiver'>,).

Code:

from urllib.request import urlretrieve
from transformers import pipeline

from perceiver.data.vision import video_utils
from perceiver.model.vision import optical_flow # register auto-classes and pipeline

optical_flow_pipeline = pipeline("optical-flow", model="krasserm/perceiver-io-optical-flow", device="cuda:0")

frame_pairs = video_utils.read_video_frame_pairs("Scene_004.mp4")

optical_flows = optical_flow_pipeline(frame_pairs, render=True, device="cuda:0")

video_utils.write_video("test_data/flow/perceiverperceiver_flow_output.mp4", optical_flows, fps=24)

ValueError: Could not load model krasserm/perceiver-io-optical-flow with any of the following classes: (<class 'perceiver.model.vision.optical_flow.huggingface.OpticalFlowPerceiver'>,).

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.