A Controllable Multi-Shot Video Generation Framework

Qinghe Wang1 Xiaoyu Shi2✉ Baolu Li1 Weikang Bian3 Quande Liu2 Huchuan Lu1
Xintao Wang2 Pengfei Wan2 Kun Gai2 Xu Jia1✉
1Dalian University of Technology 2Kling Team, Kuaishou Technology 3The Chinese University of Hong Kong
✉ Corresponding Author  
PaperCode (working)
Abstract

Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.




Video Gallery
Multi-Shot Text-to-Video Generation

Multi-Shot & Multi-Reference Video Generation



How Does it Work?
Architecture of MultiShotMaster
Architecture diagram
MultiShotMaster is a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot T2V model by two key RoPE variants: Multi-Shot Narrative RoPE for flexible shot arrangement with temporal narrative order, and Spatiotemporal Position-Aware RoPE for grounded reference injection. To manage in-context information flows, we design a Multi-Shot & Multi-Reference Attention Mask. We finetune temporal attention, cross attention and FFN, leveraging the intrinsic architectural properties to achieve flexible and controllable multi-shot video generation.

Multi-Shot & Multi-Reference Data Curation
Architecture diagram
We employ a shot transition detection model to cut the collected long videos into short clips, use a scene segmentation model to cluster clips within the same scene, and then sample multi-shot videos. Then we introduce a hierarchical caption structure and use Gemini-2.5 in a two-stage process to produce global caption and per-shot captions. Moreover, we integrate YOLOv11, ByteTrack and SAM to detect, track and segment subject images. Then we use Gemini-2.5 to merge the per-shot tracking results by subject appearance. In addition, we obtain clean backgrounds by using OmniEraser.