As Chinese AI start-ups try to work around restricted access to advanced chips and more limited access to capital than in the US, the domestic industry has been racing to keep up with the rapid model development from industry leaders such as OpenAI and Google. BAAI is a non-profit agency that helps China’s AI community to grow their capabilities.
The latest generation of Emu3, BAAI’s multimodal model, uses a simple architectural design to train models to understand pictures and produce video clips, the organisation said at an event in Beijing on Monday. Multimodal models are meant to understand multiple types of input data such as text, video and audio, unlike traditional models that only handle one type.
Wang Zhongyuan, head of BAAI, also known as the Zhiyuan Institute, said the new model is the “largest technological contribution in recent years” from the 6-year-old organisation.
Emu3 adopts a unified AI architecture that turns text, images and video clips into a mishmash of tokens, which are used to pre-train a single model. A token is the smallest unit of data – such as words, parts of images, or video frames – that an AI model can process.
That approach strips away the need to combine task-specific models to handle different data types, making the training of a versatile AI model less complicated and more efficient.
BAAI said Emu3 outperforms some well-established task-specific models such as the image generation model Stable Diffusion XL, as well as the multimodal model LLaVA in understanding and generating images.