This repo is a work in progress.
This repo contains the code for inference, and evaluation for the paper Binding Text, Images, Graphs, and Audio for Music Representation Learning
The current state of this repo is not ideal, to help you navigate around checkpoints and inference, please refer to the following sheet temporarily while we prepare this repo. The code for embedding Text and Images is availabe in the scripts folder. For Audio Embeddings, code is available here, for Graph Embeddings, code is available here
N.B. Fairouz refers to the codename given to the model we envisioned, this is an iteration, hopefully of many, it covers part of our vision, but nowhere near the full scope of what we aim to do with Fairouz
In the field of Information Retrieval and Natural Language Processing, text embeddings play a significant role in tasks such as classification, clustering, and topic modeling. However, extending these embeddings to abstract concepts such as music, which involves multiple modalities, presents a unique challenge. Our work addresses this challenge by integrating rich multi-modal data into a unified joint embedding space. This space includes textual, visual, acoustic, and graph-based modality features. By doing so, we mirror cognitive processes associated with music interaction and overcome the disjoint nature of individual modalities. The resulting joint low-dimensional vector space facilitates retrieval, clustering, embedding space arithmetic, and cross-modal retrieval tasks. Importantly, our approach carries implications for music information retrieval and recommendation systems. Furthermore, we propose a novel multi-modal model that integrates various data types—text, images, graphs, and audio—for music representation learning. Our model aims to capture the complex relationships between different modalities, enhancing the overall understanding of music. By combining textual descriptions, visual imagery, graph-based structures, and audio signals, we create a comprehensive representation that can be leveraged for a wide range of music-related tasks. Notably, our model demonstrates promising results in music classification, recommendation systems.