Abstract
In the energy industry, like industrial monitoring scenarios, using generative AI for video captioning technology is crucial in event understanding and safety analysis. Current approaches typically rely on a single language model to decode visual semantics from video frames. Lightweight pre-trained generative models often produce overly generic captions that omit domain-specific details like energy equipment states or procedural steps. Conversely, multimodal large generative AI models can capture fine-grained visual cues but are prone to distraction from complex backgrounds, resulting in hallucinated descriptions that reduce reliability in high-risk energy workflows. To bridge this gap, we propose a collaborative video captioning framework, EnerSafe-Cap (Energy-Safe Video Captioning), which introduces domain-aware prompt engineering to integrate the efficient summarization of lightweight models with the fine-grained analytical capability of large models, enabling multi-level semantic understanding, thereby improving the accuracy and completeness of video content expression. Furthermore, to fully exploit the strengths of both small and large models, we design a dual-path heterogeneous sampling module. The large model receives key frames selected according to inter-frame motion dynamics, while the lightweight model processes densely sampled frames at fixed intervals, thereby capturing complementary spatiotemporal cues global event semantics from salient moments and fine-grained procedural continuity from uniform sampling. Experimental results on commonly used benchmark datasets show that our model outperforms baseline models. Specifically, on the VATEX dataset, our model surpasses the lightweight pre-trained language model SwinBERT by 19.49 in the SentenceBERT metric, and outperforms the multimodal large language model Qwen2-vl-2b by 8.27, validating the effectiveness of the method.