An experiment of using Cross-Modal Retrieval to create MADmovies.
The idea is to use a multimodal pre-trained model, i.e. Chinese-Clip, to create vector presentations of images. Given some text, the most related images can then be retrieved using nearest neighbor search as the basis for MADmovie creation.