An end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT, based on an LLM with multimodal adaptors and different diffusion decoders. NExT-GPT can perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio, using existing well-trained highly-performing encoders and decoders. The model is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities.