← Back to Papers
2026 multimedia security;deepfake video;deepfake detection;deepfake video detection;spatio-temporal learning

A Multi-Grained Parallel Spatio-Temporal Learning Architecture for Deepfake Video Detection

Miao, Hui and Guo, Yuanfang and Zhang, Leo Yu and Zhou, Jiantao and Wang, Yunhong

With advances in generation techniques, malicious users can easily generate deepfake videos, which can cause severe social problems and trust issues. Therefore, deepfake video detection has received increasing attention in recent years. Given that forgery clues are often subtle and imperceptible, effective detection relies heavily on multi-grained learning. However, existing approaches fail to systematically incorporate multi-grained learning across the key components of network training—namely, the training data, network structure, and supervision strategy—thus limiting their performance. In this article, we propose a multi-grained parallel spatio-temporal deepfake video detection architecture, which introduces a novel framework to mine more discriminative deepfake cues throughout the training pipeline. Firstly, we design a parallel spatio-temporal network combined with a cross-guided mechanism to concurrently extract frame-level spatial features and patch-level temporal features, while leveraging the relationship between spatial artifacts and temporal inconsistencies to enable multi-grained spatio-temporal synchronous learning. Secondly, we propose segment-level data augmentation strategies, including frame-random consistent self-blending and spatio-temporal data augmentation, which improve training data diversity at both frame and patch levels, thereby improving the model’s ability to learn comprehensive deepfake representations. Finally, we construct a multi-grained supervision, comprising a patch-level temporal loss, a distance-based frame-level spatial loss, and a standard segment-level loss, for subtle deepfake feature learning. Extensive experiments demonstrate that our method possesses strong robustness and the generalization ability outperforms the current state-of-the-art methods across a series of deepfake datasets, including FaceForensics++, CelebDF, DFDC, DeeperForensics, and Faceshifter, on average.

Added 2026-04-21