Dual Sparse Long-Short Term Transformer for Video Shadow Detection
Video Shadow Detection (VSD) is critical yet challenging, primarily due to ambiguous shadow boundaries and the presence of confusing shadow-like non-shadow regions, which existing methods struggle to resolve effectively by limited temporal modeling. We propose the Dual Sparse Long-Short Term Transformer Network (DSLSTT-Net), a novel framework designed to enhance feature learning by integrating robust temporal consistency and detailed local context. DSLSTT-Net utilizes a dual-stream architecture to concurrently process global temporal information and local shadow feature refinement, enabling effective discrimination between true shadows and confusing areas. At its core, the Sparse Long-Short Term Attention Module (Sparse LSTAM) is introduced to efficiently propagate only high-confidence shadow features from memory, significantly enhancing feature discriminability and computational efficiency. Furthermore, an Adaptive Fusion Module (AFM) dynamically merges purified long-term features with short-term details, optimizing final segmentation. Experimental results confirm that DSLSTT-Net significantly outperforms state-of-the-art methods on VSD benchmarks, validating our approach of dual-stream architecture and sparse temporal modeling. The source code is available at .
Added 2026-04-21