MAGVIT

MAGVIT: Masked Generative Video Transformer

Lijun Yu^‡†, Yong Cheng^†, Kihyuk Sohn^†, José Lezama^†, Han Zhang^†, Huiwen Chang^†,
Alexander G. Hauptmann^‡, Ming-Hsuan Yang^†, Yuan Hao^†, Irfan Essa^†+, Lu Jiang^†
^‡Carnegie Mellon University, ^†Google Research, ⁺Georgia Institute of Technology
CVPR 2023 (Highlight)

We introduce MAGVIT to tackle various video synthesis tasks with a single model, where we demonstrate its quality, efficiency, and flexibility.

Benchmark Results

(Click each to expand)

Multi-task Generation on Something-Something-v2

Frame Prediction on Kinetics-600

Frame Prediction on BAIR Robot Pushing

Class-conditional Generation on UCF-101

Skiing

Comparing Decoding Methods on BAIR

Comparaing VQ Tokenizers on UCF-101

Inspirational Applications / Code and Models / Paper

Multi-task Generation on Something-Something-v2

A single MAGVIT model is trained to perform 10 different tasks on the Something-Something-v2 dataset.


Frame Prediction
Frame Interpolation
Central Outpainting
Vertical Outpainting
Horizontal Outpainting
Dynamic Outpainting
Central Inpainting
Dynamic Inpainting
Class-conditional Generation	Pretending to open something without actually opening it	Scooping something with something
Class-conditional Frame Prediction	Plugging something into something	Pulling something from left to right

Frame Prediction on Kinetics-600

MAGVIT achieves 39% improvement over state-of-the-art on the large-scale Kinetics-600 benchmark (FVD 16.5 9.9).


MAGVIT (ours)
MAGVIT (ours)
RaMViD (Höppe et al. 2022)	Condition information unavailable.	Condition information unavailable.	Condition information unavailable.
RaMViD (Höppe et al. 2022)	Condition information unavailable.	Condition information unavailable.	Condition information unavailable.

Frame Prediction on BAIR Robot Pushing

MAGVIT surpasses previous state-of-the-art FVD on the BAIR frame prediction benchmark by a large margin (84 62, 26%).


MAGVIT (ours)
MAGVIT (ours)
RaMViD (Höppe et al. 2022)
RaMViD (Höppe et al. 2022)

Class-conditional Generation on UCF-101

MAGVIT achieves best published FVD (76) and IS (89.3) scores on the UCF-101 benchmark.


MAGVIT (ours)
TATS (Ge et al. 2022)
CCVS+StyleGAN (Moing et al. 2021) (from TATS)

Comparing Decoding Methods on BAIR

We compare different decoding methods to show the quality and efficiency of MAGVIT COMMIT decoding. We train a base transformer model with the same 3D-VQ tokenizer for each method.


Condition frames
Real videos
MAGVIT COMMIT decoding 12 steps (ours)
Autoregressive decoding 1024 steps
MaskGIT (Chang et al. 2022) MTM decoding 12 steps

Comparaing VQ Tokenizers on UCF-101

We compare different VQ tokenizers to demonstrate the superior reconstruction quality of MAGVIT 3D-VQ. These models are only trained on 9.5K training videos of the small UCF-101 dataset. See Perceptual Compression for large real-world examples of MAGVIT 3D-VQ.


Real
MAGVIT 3D-VQ (ours)
TATS (Ge et al. 2022) 3D-VQ
MaskGIT (Chang et al. 2022) 2D-VQ