A Controllable Multi-Shot Video Generation Framework

Qinghe Wang¹ Xiaoyu Shi^2✉ Baolu Li¹ Weikang Bian³ Quande Liu² Huchuan Lu¹
Xintao Wang² Pengfei Wan² Kun Gai² Xu Jia^1✉

¹Dalian University of Technology ²Kling Team, Kuaishou Technology ³The Chinese University of Hong Kong

^✉Corresponding Author

CVPR 2026 Code 🤗 Model

Abstract

Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.

Video Gallery

Multi-Shot Text-to-Video Generation — Built on Wan Models

We convert the user-provided shot arrangement (e.g., [[0, 37], [37, 62], ..., [start_frame_idx, end_frame_idx]]), and hierarchical captions into caption videos (the lower part of the video) to visualize the controllable shot transitions. The Story represents the global caption. [Shot N_i] indicates that the currently-playing shot is the i-th shot of N shots.

Multi-Shot Text-to-Video Generation (384×672) — Built on an Internal DiT

Multi-Shot & Multi-Reference Video Generation (384×672) — Built on an Internal DiT

We visualize the input sparse bounding box sequences on the generated videos to show the effect of Spatiotemporal-Grounded Reference Injection. Clean version does not show the bbox sequences for better visual experience. The background image is injected to the first frame of the corresponding shot as default.

How Does it Work?

Architecture of MultiShotMaster

MultiShotMaster is a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot T2V model by two key RoPE variants: Multi-Shot Narrative RoPE for flexible shot arrangement with temporal narrative order, and Spatiotemporal Position-Aware RoPE for grounded reference injection. To manage in-context information flows, we design a Multi-Shot & Multi-Reference Attention Mask. We finetune temporal attention, cross attention and FFN, leveraging the intrinsic architectural properties to achieve flexible and controllable multi-shot video generation.

Multi-Shot & Multi-Reference Data Curation

We employ a shot transition detection model to cut the collected long videos into short clips, use a scene segmentation model to cluster clips within the same scene, and then sample multi-shot videos. Then we introduce a hierarchical caption structure and use Gemini-2.5 in a two-stage process to produce global caption and per-shot captions. Moreover, we integrate YOLOv11, ByteTrack and SAM to detect, track and segment subject images. Then we use Gemini-2.5 to merge the per-shot tracking results by subject appearance. In addition, we obtain clean backgrounds by using OmniEraser.