Meta-Transformer: A Unified Framework for Multimodal Learning

Unified Multimodal Learning. Meta-Transformer utilizes the same backbone to encode natural language, image, point cloud, audio, video, infrared, hyperspectral, X-ray, time-series, tabular, Inertial Measurement Unit (IMU), and graph data. It reveals the potential of transformer architectures for unified multi-modal intelligence.

https://arxiv.org/abs/2307.10802