Google Lumiere: Everything about multimodal AI model for videos creation Harsh ShivamNew Delhi
Google has unveiled a new multimodal AI model ‘Lumiere’ for video generation. Google said, “Lumiere is a text-to-video diffusion model designed for synthesising videos that portray realistic, diverse and coherent motion.” The company touted that the model facilitates content creation tasks and video editing applications such as image-to-video, video in painting, and stylized video generation.
According to Google, Lumiere model uses a Space-Time u-Net (STUNet) architecture to generate videos. Using this architectural design, the model processes all frames in a video at once instead of generating keyframes and then filling the missing frames using temporal super-resolution (TSR) models, which is a common approach for existing video generators.
Google said Lumiere generates the entire temporal duration of the video at once by deploying both spatial and temporal down- and up-sampling. It essentially means the model first generates a full frame rate video in low resolution and later upscales the generated video using a spatial super-resolution (SSR) model to produce the final result. In the research paper previewing Lumiere, Google said that the sample videos generated by the AI model are 80 frames long at 16 frames-per-second, essentially 5 seconds long. The initially generated video is at 128×128 resolution, which is then upscaled to 1024×1024 using SSR.
According to Google, Lumiere video generation model also lets users apply text-based image editing methods for consistent video editing. For example, its Cinemagraphs feature lets users animate a specific region within the image to generate a video. For stylized video generation, Lumiere can generate videos in the target style using a single reference image provided by the user.