Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

Ben Hayun, Omer; Betser, Roy; Levi, Meir Yossef; Kassel, Levi; Gilboa, Guy

Abstract

Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial.

Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection.

We introduce STALL (Spatial-Temporal Aggregated Log-Likelihoods), a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines.

How STALL Works

Method overview. A video is split into frames and encoded into embeddings. The spatial branch scores the likelihood of each frame embedding; the temporal branch normalizes inter-frame differences and scores their likelihood analogously. The two scores are fused into a unified measure that separates AI-generated from real videos.

Spatial Branch

Frame embeddings are whitened using statistics from a calibration set of real videos and assigned a log-likelihood under an isotropic Gaussian model. The per-video spatial score is the maximum frame log-likelihood.

Temporal Branch

Inter-frame difference vectors are L2-normalized to lie on the unit sphere, making their distribution well-approximated by a Gaussian (Maxwell-Poincaré lemma). Whitened normalized transitions are scored analogously; the per-video temporal score is the minimum transition log-likelihood.

Both branch scores are converted to percentiles against the calibration distribution and averaged into a final video score. Higher score = more likely real.
No training. No generated samples. Only real videos needed for calibration.

Results

Zero-shot detection average AUC across three benchmarks. STALL is the only method that consistently maintains AUC > 0.5 for every evaluated generator (see Table 1 in the paper). AP scores and per-generator breakdowns are also reported there. All results use a calibration set of real videos that is completely disjoint from every evaluation benchmark.

Benchmark	AEROBLADE	RIGID	ZED	D3 (L2)	D3 (cos)	STALL (Ours)
VideoFeedback	0.58	0.63	0.54	0.54	0.55	0.83
GenVideo	0.59	0.65	0.55	0.72	0.70	0.80
ComGenVid (Ours)	0.69	0.57	0.55	0.73	0.73	0.85
Overall Avg	0.62	0.61	0.57	0.64	0.64	0.82

Average AUC comparison across all benchmarks

Performance comparison. Average AUC across all three benchmarks. STALL is both high-performing and efficient.

Robustness to image perturbations across five severity levels

Robustness to perturbations. JPEG compression, Gaussian blur, crop, and noise at five severity levels; STALL maintains high AUC throughout.

Latency comparison. Inference time per video (16 frames). STALL is fast; the bulk of latency comes from the video embedder, not the detection itself.

Qualitative Examples

Qualitative comparison of ZED, D3, and STALL on several video clips.

Qualitative comparison of ZED, D3, and STALL. Each row shows a video clip with natural or unnatural spatial/temporal behavior. ZED (spatial-only) misses temporal inconsistencies; D3 (temporal-only) fails when spatial realism misleads. STALL fuses both signals for robust detection.

Additional examples (set 1). Thumbs indicate spatial and temporal likelihood per video. Fake videos may fool spatial-only or temporal-only detectors individually; STALL combines both signals to correctly identify them.

Additional examples (set 2). More cases where either modality alone is ambiguous; the joint STALL score correctly separates real from fake.

ComGenVid - New Benchmark

We introduce ComGenVid, a new benchmark curated to evaluate detection against state-of-the-art commercial video generators. It contains:

~3,400 generated videos from Sora (OpenAI) and Veo 3 (Google DeepMind), among the most capable and realistic generators available today.
~1,700 real videos from MSVD as the real counterpart.

Pre-computed DINOv3 embeddings for the entire dataset are available on HuggingFace, enabling zero-setup evaluation without video files or a GPU.

Citation

If you find this work useful in your research, please consider citing us:

BibTeX

@inproceedings{hayun2026trainingfreedetectiongeneratedvideos, title = {Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods}, author = {{Ben Hayun}, Omer and Betser, Roy and Levi, Meir Yossef and Kassel, Levi and Gilboa, Guy}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year = {2026}, eprint = {2603.15026}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2603.15026}, }

Training-free Detection of Generated Videosvia Spatial-Temporal Likelihoods