Training-free Detection of Generated Videos
via Spatial-Temporal Likelihoods

Technion - Israel Institute of Technology
CVPR 2026
Spatial-temporal likelihoods scatter plot. Blue dots: real videos; red dots: fake videos from ComGenVid. The two clusters are clearly separated.

Spatial-temporal likelihoods per video. Blue: real; red: fake (ComGenVid). Joint spatio-temporal likelihoods clearly separate real and fake videos; examples illustrate high/low spatial likelihood (frame realism) and temporal likelihood (motion naturalness).

Abstract

Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial.

Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection.

We introduce STALL (Spatial-Temporal Aggregated Log-Likelihoods), a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines.

How STALL Works

STALL method pipeline: frames are encoded into embeddings, spatial branch scores frame likelihoods, temporal branch scores inter-frame transition likelihoods, both fused into a final score.

Method overview. A video is split into frames and encoded into embeddings. The spatial branch scores the likelihood of each frame embedding; the temporal branch normalizes inter-frame differences and scores their likelihood analogously. The two scores are fused into a unified measure that separates AI-generated from real videos.

Spatial Branch

Frame embeddings are whitened using statistics from a calibration set of real videos and assigned a log-likelihood under an isotropic Gaussian model. The per-video spatial score is the maximum frame log-likelihood.

Temporal Branch

Inter-frame difference vectors are L2-normalized to lie on the unit sphere, making their distribution well-approximated by a Gaussian (Maxwell-Poincaré lemma). Whitened normalized transitions are scored analogously; the per-video temporal score is the minimum transition log-likelihood.

Both branch scores are converted to percentiles against the calibration distribution and averaged into a final video score. Higher score = more likely real.
No training. No generated samples. Only real videos needed for calibration.

Results

Zero-shot detection average AUC across three benchmarks. STALL is the only method that consistently maintains AUC > 0.5 for every evaluated generator (see Table 1 in the paper). AP scores and per-generator breakdowns are also reported there. All results use a calibration set of real videos that is completely disjoint from every evaluation benchmark.

Benchmark AEROBLADE RIGID ZED D3 (L2) D3 (cos) STALL (Ours)
VideoFeedback 0.580.630.540.540.550.83
GenVideo 0.590.650.550.720.700.80
ComGenVid (Ours) 0.690.570.550.730.730.85
Overall Avg0.620.610.570.640.640.82
Average AUC comparison across all benchmarks

Performance comparison. Average AUC across all three benchmarks. STALL is both high-performing and efficient.

Robustness to image perturbations across five severity levels

Robustness to perturbations. JPEG compression, Gaussian blur, crop, and noise at five severity levels; STALL maintains high AUC throughout.

Inference latency comparison per video

Latency comparison. Inference time per video (16 frames). STALL is fast; the bulk of latency comes from the video embedder, not the detection itself.

Qualitative Examples

ComGenVid - New Benchmark

We introduce ComGenVid, a new benchmark curated to evaluate detection against state-of-the-art commercial video generators. It contains:

  • ~3,400 generated videos from Sora (OpenAI) and Veo 3 (Google DeepMind), among the most capable and realistic generators available today.
  • ~1,700 real videos from MSVD as the real counterpart.

Pre-computed DINOv3 embeddings for the entire dataset are available on HuggingFace, enabling zero-setup evaluation without video files or a GPU.

Citation

If you find this work useful in your research, please consider citing us:

BibTeX
@inproceedings{hayun2026trainingfreedetectiongeneratedvideos,
  title     = {Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods},
  author    = {{Ben Hayun}, Omer and Betser, Roy and Levi, Meir Yossef and Kassel, Levi and Gilboa, Guy},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026},
  eprint    = {2603.15026},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url       = {https://arxiv.org/abs/2603.15026},
}