MAGVIT

MAGVIT: Masked Generative Video Transformer

Lijun Yu^‡†, Yong Cheng^†, Kihyuk Sohn^†, José Lezama^†, Han Zhang^†, Huiwen Chang^†,
Alexander G. Hauptmann^‡, Ming-Hsuan Yang^†, Yuan Hao^†, Irfan Essa^†+, Lu Jiang^†
^‡Carnegie Mellon University, ^†Google Research, ⁺Georgia Institute of Technology
CVPR 2023 (Highlight)

We introduce MAGVIT to tackle various video synthesis tasks with a single model, where we demonstrate its quality, efficiency, and flexibility.

(Unmute for narrations) Youtube Bilibili

Multi-task Video Generation

Inspirational Applications

(Click each to expand)

Panoramic Video

Smart Remover

Auto Flip

Image to Animation

Stop Motion

Novel View Synthesis

Future Prediction

Perceptual Compression

Benchmark Results / Code and Models / Paper

Acknowledgements

Web design: Lijun Yu, Freelancer Jekyll theme

Thanks to Tom Duerig, Victor Gomes, Paul Natsev, David Salesin, Jay Yagnik, Tomas Izo, Rahul Sukthankar, Wolfgang Macherey, David Alexander Ross, Yu-Chuan Su, Sarah Laszlo, Hugh Williams, Bryan Seybold, Albert Shaw, Jonathan Ho, Tim Salimans, Wenhe liu, Xinyu Yao, Mingzhi Cai, Yizhi Zhang, Zhao Jin, Zhiruo Zora Wang, and the Multipod committee and Scenic team.

Multi-task Video Generation

A single MAGVIT model is trained to perform 10 different video generation tasks. All examples in the left column are produced by the same MAGVIT model trained only on the public Something-Something-v2 dataset.


Frame Prediction
Frame Interpolation
Central Outpainting
Vertical Outpainting
Horizontal Outpainting
Dynamic Outpainting
Central Inpainting
Dynamic Inpainting
Class-conditional Generation / Frame Prediction	Squeezing something
Class-conditional Frame Prediction / Frame Interpolation	Moving something down