Abstract
We propose Heterogeneous Dual-path Contrastive Architecture (HDCA) for action recognition. Our model involves a spatial pathway and a temporal pathway; these two pathways employ distinct backbone networks and input formats, tailored to the specific properties of spatial features and temporal features. The spatial pathway processes super images to capture spatial semantics while the temporal pathway operates on frame sequences to model motion patterns. This targeted design can precisely capture the scenes and motions depicted in videos while improving parameter efficiency. To establish a cross-modality complementary enhancement mechanism, we develop cross-modality contrastive loss and intra-group contrastive loss to train the HDCA. These contrastive losses work synergistically to maximize the similarity of feature representations among videos belonging to the same class while minimizing similarity across different classes, achieving cross-modality alignment through cross-modality contrastive loss and enhancing intra-group compactness via intra-group contrastive loss. HDCA fully exploits the complementary strengths of spatial features and temporal features in action recognition. Systematic experiments on three benchmark datasets validate the effectiveness and superiority of our approach, which support the motivation and hypothesis of our model design. The experimental results demonstrate that our model achieves competitive performance compared to existing state-of-the-art approaches for action recognition. Notably, performance gains increase with dataset complexity, indicating that discriminative correlation information between modalities learned by HDCA yield greater performance gains in the recognition tasks of complex videos.