GithubHelp home page GithubHelp logo

aleph-alpha / magma Goto Github PK

View Code? Open in Web Editor NEW
471.0 11.0 55.0 1.85 MB

MAGMA - a GPT-style multimodal model that can understand any combination of images and language. NOTE: The freely available model from this repo is only a demo. For the latest multimodal and multilingual models from Aleph Alpha check out our website https://app.aleph-alpha.com

License: MIT License

Python 100.00%

magma's Introduction

MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning

Authors

repo (alphabetical)

Constantin (CoEich), Mayukh (Mayukhdeb), Sid (sdtblck)

paper

Constantin Eichenberg, Sidney Black, Samuel Weinbach, Aleph Alpha

Letitia Parcalabescu, Anette Frank, Heidelberg University

Abstract

Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pretraining is entirely end-to-end using a single language modeling objective, simplifying optimization compared to previous approaches. Importantly, the language model weights remain unchanged during training, allowing for transfer of encyclopedic knowledge and in-context learning abilities from language pretraining. MAGMA outperforms Frozen on open-ended generative tasks, achieving state of the art results on the OKVQA benchmark and competitive results on a range of other popular VL benchmarks, while pretraining on 0.2% of the number of samples used to train SimVLM.

Paper on arXiv: https://arxiv.org/abs/2112.05253

Examples (via Aleph Alpha playground)

Photos Text & Technical
A man covering a woman's eyes to hide a present A hand drawn treasure map
A fallen tree is blocking a road A software architecture

Model design

MAGMA model design

About the repository

In this repository we share the main parts of the codebase for training and inference of our MAGMA VL model. The main use of the repo is for downloading our pretrained weights and interacting with the model. We include a script for data parallel training with Deepspeed for finetuning our models or training a MAGMA model from scratch.

NOTE: The freely available model from this repo is only a demo. For the latest multimodal and multilingual models from Aleph Alpha check out our website.

Installation

Make sure PyTorch (Ver >= 1.9.0) and Torchvision are installed. See https://pytorch.org/get-started/locally/.

You can pip install from the git repository with:

pip install git+https://github.com/Aleph-Alpha/magma.git

Make sure that you also download the config:

mkdir configs; wget -O configs/MAGMA_v1.yml https://raw.githubusercontent.com/Aleph-Alpha/magma/master/configs/MAGMA_v1.yml

Or if you've cloned the repo, you can install all further requirements by:

pip install -r requirements.txt

Checkpoint

We also publish a model checkpoint that has been used for the publication. It is hosted on our infrastructure and downloads automatically. It can be downloaded manually here: https://bit.ly/aleph_alpha_magma_download

This checkpoint can also be played around with on a space managed by Heath Mitchell, AK, and Stella Biderman. (This is a 3rd party space, not managed by Aleph Alpha.)

Loading a model for inference

Downloads the checkpoint file into checkpoint_path if it's not already present.

from magma import Magma
from magma.image_input import ImageInput

model = Magma.from_checkpoint(
    config_path = "configs/MAGMA_v1.yml",
    checkpoint_path = "./mp_rank_00_model_states.pt",
    device = 'cuda:0'
)

inputs =[
    ## supports urls and path/to/image
    ImageInput('https://www.art-prints-on-demand.com/kunst/thomas_cole/woods_hi.jpg'),
    'Describe the painting:'
]

## returns a tensor of shape: (1, 149, 4096)
embeddings = model.preprocess_inputs(inputs)  

## returns a list of length embeddings.shape[0] (batch size)
output = model.generate(
    embeddings = embeddings,
    max_steps = 6,
    temperature = 0.7,
    top_k = 0,
)  

print(output[0]) ##  A cabin on a lake

Converting datasets to our format

To convert an image-caption dataset to our dataset class magma.datasets.ImgCptDataset, we suggest:

from magma.datasets.convert_datasets import convert_dataset

def my_dataset_iterator():
    """
    Implement an iterator for your dataset that for every datapoint yields a tuple
    image_path, {"captions": [...], "metadata": {...}, }, where image_path is the path to the image as a Path object, captions is a list of caption strings and metadata is an optional field.
    """

if __name__ == "__main__":
    convert_dataset(data_dir="/target/directory", ds_iterator=my_dataset_iterator())

How to train MAGMA

Run the training with:

deepspeed train.py --config path_to_my_config

To continue training from a deepspeed checkpoint, provide the checkpoint directory in the "load" config parameter.

WARNING: By default, instantiating magma via the init method instead of from_checkpoint loads the pretrained CLIP weights but not the pretrained gpt-j weights. For training MAGMA from scratch, download the gpt-j weights from this repo: https://github.com/finetuneanon/transformers and include them in the state dict after initializing the MAGMA model.

magma's People

Contributors

bashfish avatar coeich avatar countably1nfinite avatar golovneva avatar mayukhdeb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

magma's Issues

Automatic model download doesn't work

mp_rank_00_model_states.pt ends up containing:

<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="t256RPQHLynZvvCq0ggl7w">/* Copyright 2022 Google Inc. All Rights Reserved. */
.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block,*:first-child+html .goog-inline-block{display:inline}.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}</style><link rel="icon" href="null"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=1EiAY3IcKWmGADaLDzdG25ykQghUwza6L">mp_rank_00_model_states.pt</a> (12G)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="downloadForm" action="https://drive.google.com/uc?id=1EiAY3IcKWmGADaLDzdG25ykQghUwza6L&amp;export=download&amp;confirm=t" method="post"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>

causing:

Traceback (most recent call last):
  File "/home/ubuntu/magma/example_inference.py", line 4, in <module>
    model = Magma.from_checkpoint(
  File "/home/ubuntu/magma/magma/magma.py", line 292, in from_checkpoint
    sd = torch.load(checkpoint_path, map_location=torch.device("cpu"))
  File "/usr/local/share/miniconda/lib/python3.9/site-packages/torch/serialization.py", line 593, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/share/miniconda/lib/python3.9/site-packages/torch/serialization.py", line 762, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.

Possibly related to wkentaro/gdown#26

Subsequent inference calls produce less good results

Following the code in README.md or example_inference.py to perform inference by calling model.preprocess_inputs(…) followed by model.generate(…) produces good results the first time the pair is called, but poor results for subsequent pairs of calls.

The reason is that model = Magma.from_checkpoint(…) loads the model with inconsistent training/eval settings. model.training is True but model.image_prefix.enc.training is False. The first call to model.preprocess_inputs(…) works correctly as the image encoder has training False and so its Batch Normalisation steps work correctly. The call to model.generate(…) records the training state on entry and restores it on exit, which because model.training is True puts the whole model into training state. Subsequent calls to model.preprocess_inputs(…) then don't perform Batch Normalisation steps correctly.

The play space at https://huggingface.co/spaces/EleutherAI/magma has this problem too.

The fix is to add model.eval() after model = Magma.from_checkpoint(…), setting the whole model to a consistent eval state.

Torch Size Mismatch

Hey guys!

I had a quick issue while loading Magma from the checkpoint, and I was wondering if anyone encountered or knows how to solve the problem.

RuntimeError: Error(s) in loading state_dict for Magma: size mismatch for lm.lm_head.weight: copying a param with shape torch.Size([50400, 4096]) from checkpoint, the shape in current model is torch.Size([50258, 4096]).

It seems like the size of the checkpoint model differs from the size of the model it is expecting from the rest of the code.

Thank you so much--this model looks super cool and I'm excited to use it!

clean up train.py

  • remove old dataset paths
  • remove classification codepaths
  • remove vqa/gqa eval codepaths

Issue with the dataloader

I have downloaded cc3m in files format where each folder is named as 00000 to 00331 where each folder contains 0000.jpg and 000.json i.e. 1 image and 1 json. Can you please help me I am unsure how to convert my data to your format. @Mayukhdeb @benbrandt

Typeguard version in requirements

When using newer typeguard versions like 4.0.0 I encountered an error while the code runs fine with an older typeguard version like 2.11.1. Maybe the requirements file should be changed from typeguard -> typeguard==2.11.1.

Reproducing results from your paper

Hi! Thank you for sharing the code for your model.
I'm having troubles to reproduce the results you have published in your paper.
Here are the scores I get on Coco dataset using the checkpoint provided in your paper:

{'Bleu_1': 0.22440850959728406, 'Bleu_2': 0.11753228266783161, 'Bleu_3': 0.06043320902662557, 'Bleu_4': 0.0321128847993337, 'METEOR': 0.09099773362803487, 'ROUGE_L': 0.16770810280576667, 'CIDEr': 0.11203192991375235}

As you can see, they are significantly lower. I'm using nlg-eval package as you have mentioned here.

What model does your checkpoint corresponds to in the paper, base or long? How do you initialize it for evaluations? Here is my setup:

model = Magma.from_checkpoint(
config_path=os.path.join(model_path, "configs/MAGMA_v1.yml"),
checkpoint_path="mp_rank_00_model_states.pt",
device="cuda:0",
)

As a prompt message I'm using "A picture of " - is that correct?
I'm using temperature=0.7, and also setting manual seed in torch to 42.

Is there anything I'm missing or doing wrong here? If everything looks fine, could you please share your evaluation scripts that would reproduce results?

top_p argument is used like 1-top_p

For example, top_p=0.999 gives you nearly deterministic sampling, not nearly on-distribution sampling.


I was confused why I was getting much less diverse samples with top_p=0.95 than I got with top_p turned off.

I found the cause in these lines:

magma/magma/sampling.py

Lines 11 to 14 in bfd5c8d

sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cum_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
sorted_indices_to_remove = cum_probs > (1 - threshold)

threshold is set to top_p here:

magma/magma/sampling.py

Lines 101 to 102 in bfd5c8d

if top_p > 0:
logits = top_p_filter(logits, threshold=top_p)

Suppose eg threshold is 0.95. Then 1-threshold is 0.05.

So we remove all tokens where the cumulative probs are > 0.05, which is most of the tokens -- we are really doing top-p sampling with top_p=0.05 (in the usual convention), not the intended top_p=0.95.

how did you calculate the bleu score

Hi, thanks for the awesome project.
I noticed that the reported BLEU@4 and CIDEr scores in Table 1 are ~10 and ~50 on the MS COCO dataset(zero-shot, after fine-tuning the scores increase to 31 and 90+), respectively, which fall far behind traditional baselines like AoA and CLIP-ViL(they usually achieve ~40 BLEU-4 and 120+ CIDEr).
I am wondering whether the difference is due to the evaluation setup, did you use the evaluation in coco-caption or calculate the scores yourself?

Mismatching LM shape between 50400 (pre-trainined pt) and 50258 (gpt-2)

Thanks for this wonderful work 😍


I load mp_rank_00_model_states.pt, but it shows that the shape of LM is different:

size mismatch for lm.lm_head.weight: copying a param with shape torch.Size([50400, 4096]) from checkpoint, the shape in current model is torch.Size([50258, 4096])

I guess it is because of the resize_token_embeddings here.


I also tried to truncate the additional dimension,

sd["lm.lm_head.weight"] = sd["lm.lm_head.weight"][:50258, :]
sd["lm.lm_head.bias"] = sd["lm.lm_head.bias"][:50258]

but the result of example_inference.py seems weird 😂

bondankeNM Drama fixtures Sergey
Fantasticheddar AUTHOR hob sealedunction

Super thanks for the help!

No module named 'magma.transformers'

Hi,

just download the "magma-master" file and followed the instructions (I think), but trying to run test.py I get errors. It seems there are some parts missing?

First I get:
(magma) c:\Python\magma-master>python test.py
Traceback (most recent call last):
File "c:\Python\magma-master\test.py", line 4, in <module>
from magma.language_model import get_language_model
ImportError: cannot import name 'get_language_model' from 'magma.language_model' (c:\Python\magma-master\magma\language_model.py)>

looking at the code it seems like get_language_model is not used anyhow, so commented line 4 out. But after that there is a similar miss:

(magma) c:\Python\magma-master>python test.py
Traceback (most recent call last):
File "c:\Python\magma-master\test.py", line 25, in <module>
from magma.transformers import GPTJForCausalLM
ModuleNotFoundError: No module named 'magma.transformers'

And here GPTJForCausalLM is used right in the next line. Looking at transformers.py there is just nothing like GPTJForCausalLM in there at all. Seems like something is missing here completly?

Best
Tuxius

More model structure details expecially parameter numbers of the adapters

Hello, first, thanks for your work on the multimodal model using adapter-based fine-tuning, which inspires me greatly.
I would like to know more details about the model structure, especially the parameter numbers of the adapters, for which you only give a ratio to the default setting with sequential FF adapters of downsample factor 4 in the ablation study.

OpenSource strategy

AlephAlpha has repeatedly emphasised in various publications and statements that it is committed to OpenSource. Looking at the activity in this repository and the available trained weights, the reality is different.

Apart from positive marketing, what role does OpenSource play at AlephAlpha? The French company Mistral has shown how important the OpenSource idea is, and has set international standards with its released models.

All I see are academic advances that are used as arguments for further funding rounds. The model code in this repository is two years out of date, as are the available checkpoints. There are no publications on Huggingface nor any other engagement in the community of LLM's tools. Once again it seems to be a special German way of compartmentalisation, combined with the belief that enough money will take care of it.

So what is AlpehAlpha's actual strategy regarding OpenSource?

```build_labels``` includes masked image tokens?

Hi Authors,

in these lines, the function build_labels masked all the labels in positions up to the seq length of the embeddings. What differences would it make if one just use the caption?

To be more specific, now the code build a label with first part of the sequence (which has sequence length the same as the image) all set to -100, then the second part would be the actual text labels. Why would we need all the -100s? Why couldn't we just use text label ids?

Thanks a lot!

How are the n-shot VQA examples selected?

Hi, thanks for the nice work! For the n-shot VQA experiments, how did you select the support set examples? More specifically, did you use the same fixed set of n-shot demonstration examples for all queries, or for each query question, different n-shot demonstration examples are selected? Many thanks!

Improved inference interface

Implement an interface like the one Mayukh suggested:

from magma import Magma 
from magma.image import Image, ImageFromURL  ## to easily load/use images

model, tokenizer = Magma(checkpoint = 'model.pt', config = 'config.yml', device = 'cuda:0')

inputs = [
    Image('path/to/image.jpg'),
    'Where is this ? A: Egypt',
    ImageFromURL('url/to/image.jpg'),
    'Where is this ? A:'
]

embeddings = tokenizer.tokenize(inputs).to(model.device)

output = model.forward(embeddings, output_attentions = True)

logits = output.logits ## tensor of shape [1, len_seq, len_vocab]
attentions = output.attentions ## list of tensors

## this already exists https://gitlab.aleph-alpha.de/research/multimodal_fewshot/-/blob/master/multimodal_fewshot/model.py#L442
generated_text = model.generate(embeddings, n_steps = 10, *args)```

AssertionError: Parameter with name: lm.transformer.wte.weight occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behaviour

Hi,

I'd like to rerun the code using gpt-neo125M or gpt2-med instead of gpt-nep2.7B ad I'm getting this error?

AssertionError: Parameter with name: lm.transformer.wte.weight occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behaviour.

Any idea why this issue exist for other language model?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.