VideoElevator : Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Yabo Zhang¹, Yuxiang Wei¹, Xianhui Lin, Zheng Hui, Peiran Ren, Xuansong Xie, Xiangyang Ji² Wangmeng Zuo¹

¹Harbin Institute of Technology ²Tsinghua University

Abstract

Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Our code will be made publicly available.

VideoElevator for improved T2V generation

Top: Taking text τ as input, conventional T2V performs both temporal and spatial modeling and accumulates low-quality contents throughout sampling chain.
Bottom: VideoElevator explicitly decompose each step into temporal motion refining and spatial quality elevating, where the former encapsulates T2V to enhance temporal consistency and the latter harnesses T2I to provide more faithful details, e.g., dressed in suit. Empirically, applying T2V in several timesteps is sufficient to ensure temporal consistency.

Results enhanced with foundational T2I

We use Stable Diffusion V1.5 and V2.1-base to enhance T2V baselines.

1. Comparisons with Zeroscope [1]

A cat wearing sunglasses and working as a lifeguard at a pool.

Hulk wearing virtual reality goggles, 4k, high resolution.

Monkey learning to play the piano.

Time lapse at the snow land with aurora in the sky.

Photo of a cat singing in a barbershop quartet.

Video of yacht sailing in the ocean.

2. Comparisons with AnimateDiff [2]

A young couple walking in a heavy rain.

Beer pouring into glass, low angle video shot.

Firewood burning in dark room.

Waves crashing against a lone lighthouse, ominous lighting.

A pancake tower with chocolate syrup and strawberries on top.

An african penguin walking on a beach.

3. Comparisons with LaVie [3]

A stack of dried leaves burning in a forest.

Fireworks display in the sky at night.

Sailboat sailing on a sunny day in a mountain lake.

A blue unicorn flying over a mystical land.

A fire dragon breathing, trending on artstation, slow motion.

A seagull walking on shore.

Results enhanced with personalized T2I

We use multiple personalized T2I to enhance T2V baselines, including Lyriel, RCNZCartoon3d, RealisticVision, and ToonYou.

1. Comparisons with Zeroscope [1]

A panda standing on a surfboard in the ocean in sunset.

A polar bear is playing bass guitar in snow.

A raccoon dressed in suit playing the trumpet.

Low angle of pouring beer into a glass cup.

Milk dripping into a cup of coffee.

Waves crashing against a lone lighthouse, ominous lighting.

A teddy bear is playing the electric guitar.

Hulk wearing virtual reality goggles.

Wood on fire.

A dog swimming.

Monkey learning to play the piano.

2. Comparisons with AnimateDiff [2]

A fire dragon breathing, trending on artstation, slow motion.

A teddy bear painting a portrait.

A ufo hovering over aliens in a field.

Clown fish swimming through the coral reef.

Sunset time lapse at the beach with moving clouds and colors in the sky.

Low angle of pouring beer into a glass cup.

An animated painting of fluffy white clouds moving in sky.

Sunset time lapse at the beach with moving clouds and colors in the sky.

Wood on fire.

A big moon rises on top of Toronto city.

Unicorn running along a beach.

3. Comparisons with LaVie [3]

A cute happy Corgi playing in park, sunset.

Time lapse at a fantasy landscape.

Yellow flowers swing in wind.

Milk dripping into a cup of coffee.

Teddy bear walking down 5th Avenue, front view, beautiful sunset.

Waves crashing against a lone lighthouse, ominous lighting.

An astronaut cooking with a pan and fire in the kitchen.

Fireworks.

Ironman flying over a burning city, very detailed surroundings, cities are blazing.

A burning volcano.

Wood on fire.

Results enhanced with personalized SDXL

We take AnimateDiff XL as baseline and use DynaVision and Diffusion-DPO to enhance it.

A dog swimming.

A ufo hovering over aliens in a field.

Fireworks.

Microscopic image of kute mini white mouse with yellow Christmas hat and scarf with icicles snowball.

An astronaut riding a horse in sunset.

Robot dancing in times square.

Ablation studies

1. Effect of low-pass frequency filter

Prompt: A polar bear is playing bass guitar in snow.

(a) w/o LPFF	(b) Temporal LPFF (Ours)	(c) Spatial-Temporal LPFF

2. Effect of inversion strategies

Prompt: A raccoon dressed in suit playing the trumpet.

(a) Add same noise	(b) Add random noise	(c) DDIM inversion

3. Different choice of timestep in temporal motion refining

Prompt: A polar bear is playing bass guitar in snow.

[]	[45]	[45, 35]	[45, 35, 25]	[45, 35, 25, 15] (Ours)

4. Different number of timesteps in T2V denoising

Prompt: A polar bear is playing bass guitar in snow.

N=1	N=2	N=4	N=8	N=10

[1] Zeroscope text-to-video model. https://huggingface.co/cerspense/zeroscope_v2_576w. Accessed: 2023-10-31.

[2] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.

[3] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.