From 16 Sentences to 11-Minute Videos: Microsoft’s NUWA-XL

Microsoft Research Asia recently introduced a multimodal generative AI model called NUWA-XL, boasting the ability to generate video content up to 11 minutes long with just 16 descriptive sentences.

Microsoft Research Asia proposed a multimodal generative AI model named NUWA in 2021, capable of generating text, images, and video content from natural language descriptions. The subsequent upgraded version, NUWA-Infinity, further enhanced the resolution of generated images and videos.

The proposed NUWA-XL builds on a “diffusion over diffusion” operational framework, generating key frames within the entire temporal range of the video through a global diffusion model and accelerating overall content generation efficiency while ensuring content continuity and completeness using a local diffusion model to fill in neighboring key frame content.

The overall process involves generating key frames based on input description sentences, sequentially generating corresponding videos for key frames, and extending video content length through the diffusion model, transforming the initially generated rough chapters into complete story content. In the demonstration, Microsoft used the animated series “The Flintstones” as a basis to automatically generate entirely new animated content.

With this Microsoft technology, the average inference time to generate 1,024 frames has been reduced from 7.55 minutes to a mere 26 seconds, an overall speed increase of 94.26%.

However, Microsoft states that video generation still relies on sufficiently high-quality video content training. The proposed NUWA-XL primarily references professional animation content production processes by first generating key frames and continuously generating derived content from those key frames, ultimately composing a complete animated video while ensuring content continuity and generation quality, thus accelerating content generation speed.