Toward Energy-Safe Industrial Monitoring: A Hybrid Language Model Framework for Video Captioning

Qianwen Cao; Che Li; Hangyuan Shi

doi:10.3390/app152312848

,

and

College of Safety and Ocean Engineering, China University of Petroleum-Beijing, Beijing 102249, China

^*

Author to whom correspondence should be addressed.

Appl. Sci.2025, 15(23), 12848;https://doi.org/10.3390/app152312848

Version Notes

Order Reprints

Abstract

In the energy industry, like industrial monitoring scenarios, using generative AI for video captioning technology is crucial in event understanding and safety analysis. Current approaches typically rely on a single language model to decode visual semantics from video frames. Lightweight pre-trained generative models often produce overly generic captions that omit domain-specific details like energy equipment states or procedural steps. Conversely, multimodal large generative AI models can capture fine-grained visual cues but are prone to distraction from complex backgrounds, resulting in hallucinated descriptions that reduce reliability in high-risk energy workflows. To bridge this gap, we propose a collaborative video captioning framework, EnerSafe-Cap (Energy-Safe Video Captioning), which introduces domain-aware prompt engineering to integrate the efficient summarization of lightweight models with the fine-grained analytical capability of large models, enabling multi-level semantic understanding, thereby improving the accuracy and completeness of video content expression. Furthermore, to fully exploit the strengths of both small and large models, we design a dual-path heterogeneous sampling module. The large model receives key frames selected according to inter-frame motion dynamics, while the lightweight model processes densely sampled frames at fixed intervals, thereby capturing complementary spatiotemporal cues global event semantics from salient moments and fine-grained procedural continuity from uniform sampling. Experimental results on commonly used benchmark datasets show that our model outperforms baseline models. Specifically, on the VATEX dataset, our model surpasses the lightweight pre-trained language model SwinBERT by 19.49 in the SentenceBERT metric, and outperforms the multimodal large language model Qwen2-vl-2b by 8.27, validating the effectiveness of the method.

Keywords:

video caption; multimodal understanding; scene-text; energy safety monitoring

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.