Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Stability AI is expanding its growing roster of generative AI models, quite literally adding a new dimension with the debut of Stable Video 4D.
While there is a growing set of gen AI tools for video generation, including OpenAI’s Sora, Runway, Haiper and Luma AI among others, Stable Video 4D is something a bit different. Stable Video 4D builds on the foundation of Stability AI’s existing Stable Video Diffusion model, which converts images into videos. The new model takes this concept further by accepting video input and generating multiple novel-view videos from 8 different perspectives.
“We see Stable Video 4D being used in movie production, gaming, AR/VR, and other use cases where there is a need to view dynamically moving 3D objects from arbitrary camera angles,” Varun Jampani, team lead, 3D Research at Stability AI told VentureBeat.
Stable Video 4D is different than just 3D for gen AI
This isn’t Stability AI’s first foray beyond the flat world of 2D space.
In March, Stable Video 3D was announced, enabling users to generate short 3D video from an image or text prompt. Stable Video 4D is going a significant step further. While the concept of 3D, that is 3 dimensions, is commonly understood as a type of image or video with depth, 4D isn’t perhaps as universally understood.
Jampani explained that the four dimensions include width (x), height (y), depth (z) and time
“The key aspects that enabled Stable Video 4D are that we combined the strengths of our previously-released Stable Video Diffusion and Stable Video 3D models, and fine-tuned it with a carefully curated dynamic 3D object dataset,” Jampani explained.
Jampani noted that Stable Video 4D is a first-of-its-kind network where a single network does both novel view synthesis and video generation. Existing works leverage separate video generation and novel view synthesis networks for this task.
He also explained that Stable Video 4D is different from Stable Video Diffusion and Stable Video 3D, in terms of how the attention mechanisms work.
“We carefully design attention mechanisms in the diffusion network which allow generation of each video frame to attend to its neighbors at different camera views or timestamps, thus resulting in better 3D coherence and temporal smoothness in the output videos,” Jampani said.
How Stable Video 4D works differently than gen AI infill
With gen AI tools for 2D image generation the concept of infill and outfill, to fill in gaps, is well established. The infill/outfill approach however is not how Stable Video 4D works.
Jampani explained that the approach is different from generative infill/outfill, where the networks typically complete the partially given information. That is, the output is already partially filled by the explicit transfer of information from the input image.
“Stable Video 4D completely synthesizes the 8 novel view videos from scratch by using the original input video as guidance,” he said. “There is no explicit transfer of pixel information from input to output, all of this information transfer is done implicitly by the network.”
Stable Video 4D is currently available for research evaluation on Hugging Face. Stability AI has not yet announced what commercial options will be available for it in the future.
“Stable Video 4D can already process single-object videos of several seconds with a plain background,” Jampani said. “We plan to generalize it to longer videos and also to more complex scenes.”