GithubHelp home page GithubHelp logo

mmtom-qa's Introduction

MMToM-QA: Multimodal Theory of Mind Question Answering
Official Model Implementation and Benchmark Guidelines

intro

This repo features the code for the paper MMToM-QA: Multimodal Theory of Mind Question Answering.

It contains:

  • 🪐 The implementation of our method, BIP-ALM.
  • 💥 Guidelines for utilizing the MMToM-QA benchmark.

Bayesian Inverse Planning Accelerated by Language Models (BIP-ALM)

We propose Bayesian Inverse Planning Accelerated by Language Models (BIP-ALM), a novel method to engineer multimodal Theory of Mind. Here is an overview of the method.

To extract symbolic representations from videos:

python visual_perception.py --benchmark_video_path videos

To extract symbolic representations from text and fuse them with video-derived representations, creating a unified depiction of the event and the physical scene:

python text_parsing.py --type multimodal --benchmark_path questions.json --output_file multimodal_representations.json

To conduct inverse inference about the agent's goal and belief using (finetuned) language models (taking LLaMA 2 as an example):

python inverse_symbolic_planner.py --data_path multimodal_representations.json --model_name_or_path meta-llama/Llama-2-7b-hf --load_lora 1 --lora_name_or_path output/checkpoint-llama --output_path output.txt

To finetune the language model (optional):

python generate_finetuning_dataset.py --data_path finetuning_data --output_file formatted_finetuning_data.json
python finetuning.py --train_file formatted_finetuning_data.json --model_name_or_path meta-llama/Llama-2-7b-hf --fisher_matrix_path fisher-matrix/fisher-matrix-6B --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --lr 5e-5 --num_epochs 3 --ewc_lambda 0.5 --output_dir output_finetuning.txt

MMToM-QA Benchmark

MMToM-QA is the first multimodal benchmark to evaluate machine Theory of Mind (ToM), the ability to understand people's minds. For a detailed description, please visit our project page.

Benchmark composition

Benchmark (text only)

Videos (~250G)

Usage

Text-only version

To use the text-only version of MMToM-QA, you require the file questions.json. This file comprises 600 entries, each corresponding to one of the 600 questions. Every entry is formatted as a Python dictionary, with a key labeled "question" for the question itself, and another key named "answer" that contains the correct response, either “a” or “b”. The "question_type" key can be used to assess the accuracy for each question subtype.

Example of testing GPT-4:

python testing_gpt.py --gpt_choice gpt-4

Multimodal version

To use the multimodal version of MMToM-QA, you also need the relevant videos for each question. Each entry includes a key named “episode” to identify the corresponding video for that question. Additionally, there's a key called "end_time" that indicates the necessity of only considering the steps before this specified end time.

In the videos, you'll find 134 videos, each labeled with an "episode number." The file "task_{episode}/frame_intervals.pik" details the start and end frames for each step. By referring to the "end time", you can determine the required "end frame" for a particular video. The frames within a video are found in "task_{episode}/script/0/Action_{frame_number:04d}_0_normal.png". These frames can be used entirely or selectively sampled. Additionally, we offer various other data types (e.g., "depth.exr", "seg_inst.png") in the same path for your reference.

Example of testing GPT-4V:

python testing_gpt.py --gpt_choice gpt-4v

Video-only version

To use the video-only version, follow a process similar to the multimodal version, but omit certain elements in the "questions". Exclude the sections "What's inside the apartment: ..." and "Actions taken by the person: ...", focusing solely on the questions.

Example of testing GPT-4V:

python testing_gpt.py --gpt_choice gpt-4v --video_only

Note: There is both overlap and complementarity in the information presented in the text and videos, as demonstrated in the example. From a video, a model gets the dynamic state change at each step as well as what objects the agent is walking towards and is passing by at a given step. A model needs this information to determine the agent’s expected action plans given a hypothesis about the belief and the goal. From the text, a model gets ground truth information about the initial state. Because of the partial observations caused by the limited camera view and occlusion, the text provides additional state information that is sometimes unavailable in the video. A model requires information about the true world state to determine whether an agent has a true belief or false belief as well as what objects the agent has observed so far.

Citations

Please cite the paper and star this repo if you find it interesting/useful, thanks!

@inproceedings{jin2023mmtom,
      title={MMToM-QA: Multimodal Theory of Mind Question Answering},
      author={Jin, Chuanyang and Wu, Yutong and Cao, Jing and Xiang, Jiannan and Kuo, Yen-Ling and Hu, Zhiting and Ullman, Tomer and Torralba, Antonio and Tenenbaum, Joshua B and Shu, Tianmin},
      booktitle={NeurIPS 2023 Foundation Models for Decision Making Workshop},
      year={2023}
}

mmtom-qa's People

Contributors

chuanyangjin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.