NExT-GPT: Any-to-Any Multimodal LLM

An end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT, based on an LLM with multimodal adaptors and different diffusion decoders. NExT-GPT can perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio, using existing well-trained highly-performing encoders and decoders. The model is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities.

https://arxiv.org/pdf/2309.05519.pdf

https://next-gpt.github.io/