GithubHelp home page GithubHelp logo

sevlm-code's Introduction

Training A Small Emotional Vision Language Model for Visual Art Comprehension

Jing Zhang   Liang Zheng   Meng Wang   Dan Guo

This is the repo for the official implementation of the SEVLM.

💡Abstract

This paper develops small vision language models to understand visual art, which, given an art work, aims to identify its emotion category and explain this prediction with natural language. While small models are computationally efficient, their capacity is much limited compared with large models. To break this trade-off, this paper builds a small emotional vision language model (SEVLM) by emotion modeling and input-output feature alignment. On the one hand, based on valence-arousal-dominance (VAD) knowledge annotated by psychology experts, we introduce and fuse emotional features derived through VAD dictionary and a VAD head to align VAD vectors of predicted emotion explanation and the ground truth. This allows the vision language model to better understand and generate emotional texts, compared with using traditional text embeddings alone. On the other hand, we design a contrastive head to pull close embeddings of the image, its emotion class, and explanation, which aligns model outputs and inputs. On two public affective explanation datasets, we show that the proposed techniques consistently improve the visual art understanding performance of baseline SEVLMs. Importantly, the proposed model can be trained and evaluated on a single RTX 2080 Ti while exhibiting very strong performance: it not only outperforms the state-of-the-art small models but is also competitive compared with LLaVA 7B after fine-tuning and GPT4(V). Teaser figure Fig. 1: Examples comparing different methods of predicting emotion class and explaining why this emotion is evoked given an art image on both ArtEmis v1.0 test set and ArtEmis v2.0 Combined test set. Three models are compared: SAT[2], NLX-GPT2[29], and our method. In both examples, the explanations from existing methods are misaligned with the emotion label or the art image, but our method gives superior results. Green fonts indicate incorrect emotion classification results; red texts indicate large discrepancies between the semantics of explanations and visual content; blue texts denote that the emotion of the explanations does not correspond to the predicted category. Our design in Sec.4.2 aims to alleviate these problems.

🚀Model Architecture

Teaser figure Fig.2: Detailed network structure of the proposed small emotional vision language model. It has: 1) a vision language backbone including an image encoder, a samll language model (SLM) GPT2 decoder, and a language head; 2) VAD emotion modeling introducing emotion knowledge VAD into text embeddings to enhance model capacity of understanding emotion; 3) a VAD head to learn VAD-aware emotion; and 4) a contrastive head to force the features alignment among image, emotion label and explanation. During training, we use the emotion label and explanation as ground truth. In inference, we use the prompt `The emotion is _' and an art image as input and generate the emotion label and explanations.

🔥News

  • [2024.07.02] SEVLM is accepted by ECCV 2024!
  • [2024.03.17] We have released the codes of SEVLM!

⚙️Preparations

Requirements

  • PyTorch 1.8 or higher
  • pip install git+https://github.com/openai/CLIP.git
  • pip install transformers
  • pip install git+https://github.com/huggingface/accelerate

Images Download

We conduct experiments on artEmis Dataset WikiArt dataset, and then resize images into a 600px resized folder, named wikiart_rescaled_max_size_to_600px_same_aspect_ratio

Data preprocessing

We conduct experiments on two benchmark datasets: ArtEmis v1.0 and ArtEmis v2.0. You need to perform data preprocessing for both datasets.

🎇Run

Please run from the command line with:

accelerate launch artEmisX_train.py --ckpt_path /xxxx/ckpts/  --caption_save_path /xxxx/results/  --nle_data_train_path /xxxx/data/artEmis/artEmisX_cl_train.json  --nle_data_val_path /xxxx/data/artEmis/artEmisX_val.json

☕Citation

If you find this work useful in your research, please consider citing:

@misc{zhang2024training,
      title={Training A Small Emotional Vision Language Model for Visual Art Comprehension}, 
      author={Jing Zhang and Liang Zheng and Meng Wang and Dan Guo},
      year={2024},
      eprint={2403.11150},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🍃Acknowledgement

  • We thank NLX-GPT for open-source implementation of their language model. SEVLM repo is built on NLX-GPT.

sevlm-code's People

Contributors

betterzh avatar

Stargazers

 avatar  avatar  avatar Bulat Suleymanov avatar  avatar  avatar  avatar  avatar  avatar Liang Zheng avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sevlm-code's Issues

UnboundLocalError: local variable 'emotion_distribution' referenced before assignment

Hi, I am facing the mentioned Error while I run the training command, training get started:
accelerate launch artEmisX_train.py --ckpt_path /xxxx/ckpts/ --caption_save_path /xxxx/results/ --nle_data_train_path /xxxx/data/artEmis/artEmisX_cl_train.json --nle_data_val_path /xxxx/data/artEmis/artEmisX_val.json

But then I have faced the followed error:

File "SEVLM-for-Visual-Art-Comprehension/utils/datasets.py", line 171, in getitem
emotion_distribution = torch.stack([emotion_distribution,emotion_distribution])
UnboundLocalError: local variable 'emotion_distribution' referenced before assignment

Can you help fix that?
Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.