A Multimodal Power Sample Feature Migration Method Based on Dual Cross-Modal Information Decoupling
Abstract
1. Introduction
- We design a fine-grained image–text alignment strategy that aligns hierarchical visual features from various stages of the encoder with corresponding linguistic features. This effectively captures discriminative multimodal representations.
- We propose a multimodal data feature alignment framework tailored for power systems. By implementing a dual-stream text–image attention mechanism, it achieves richer cross-modal interactions and successfully realizes multimodal power sample feature migration.
- We conduct evaluation experiments on multimodal tasks within the power industry. The results demonstrate that our proposed framework can effectively extract shared semantic information from paired multimodal data and project it into a fine-grained, common quantized latent space. This provides a practical solution for multimodal feature migration tasks in the power sector.
2. Related Work
2.1. Multimodal Sample Feature Migration Method
2.2. Multimodal Unified Representation
2.3. Cross-Modal Mutual Information Assessment
3. Power System-Oriented Multimodal Data Feature Alignment Framework
4. Experiments and Analysis
4.1. Datasets
- Scene Descriptions: Detailed textual characterizations of operational environments.
- Professional Knowledge Q&A: Expert-curated question–answer pairs on grid operations.
- Instance Detection: Annotated visual targets with bounding boxes and defect labels.
- Safety Monitoring—Operational Violations: Includes violations related to fall protection for aerial work, safety warnings at job sites, and personal protective equipment compliance. Examples encompass failure to wear safety harnesses during high-altitude operations, unattended ladder work, and uncovered or unguarded openings at work sites.
- Substation Equipment: Covers inspection data for various substation devices such as GIS/SF6 pressure gauges, lightning arresters, transformers, and reactors.
- Transmission Equipment: Includes detection data for transmission components including protective fittings, conductors/ground wires, and bird-deterrence facilities.
4.2. Experimental Setup
- Cross-modal Scenario Description (CSD): Each image contained corresponding scene descriptions. We assessed model capability by evaluating scene description accuracy.
- Cross-modal Event Localization/Classification (CELC): We evaluated the method’s capability in power defect detection and domain knowledge by localizing and classifying fine-grained events.
- Cross-modal Knowledge Quiz (CKQ): We assessed whether models effectively migrate features between image and text modalities for power knowledge question answering.
4.3. Experimental Evaluation
- Image Captioning (IC): We used GPT-4 as an objective benchmark to evaluate caption adequacy, linguistic fluency, and accuracy against reference answers. Scores were in the range 0–10 with comprehensive evaluation explanations.
- Grounded Captioning (GC): GPT-4 evaluated category recall, coordinate deviation, fluency, and the accuracy of responses against references. Scores were in the range 0–10 with detailed explanations.
- Referential Expression Comprehension (REC): GPT-4 assessed false negatives/positives, coordinate deviation, linguistic fluency, and informativeness. Scores (0–10, higher being better) included comprehensive explanations.
- Referential Expression Generation (REG): GPT-4 evaluated
- –
- Correct category identification in specified regions;
- –
- Correspondence between generated and target categories;
- –
- Linguistic fluency and informativeness.
- Our model is built upon ViT-Tiny as the visual encoder (∼5.7 M parameters, ∼1.3 GFLOPs for 224 × 224 inputs). With the addition of our dual-stream decoupling, attention-based alignment modules, and lightweight fusion heads, the total parameter count remains under ∼25 M, with FLOPs below 5 GFLOPs per sample. In contrast, LLaVA (∼7 B parameters) and Qwen-VL (>14 B) are orders of magnitude larger, as they use large-scale vision-language backbones. Our architecture is designed for modularity and efficiency. Despite having multiple attention components, they operate on compact intermediate features, not raw high-res inputs. The observed performance gain arises from structural enhancements (semantic decoupling and fine-grained alignment) rather than brute-force scaling. The design is thus suitable for downstream adaptation, and we envision real-time potential through model distillation or partial encoder freezing.Scores were in the range 0–10 (higher being better) with full evaluation explanations.
- IC Metric: Our method achieved significantly higher scores than alternatives, indicating superior precision in generating image descriptions highly aligned with power scenarios and more effective extraction/expression of critical information.
- GC Metric: Substantial improvements over LLaVA/Qwen/Shikra confirmed that our decoupled cross-modal alignment mechanism enabled more effective capture of semantic consistency across modalities, enhancing model comprehension.
- REC Metric: Dominant performance signified enhanced precision in semantic matching for cross-modal retrieval tasks, improving information reliability.
- REG Metric: The largest performance gap demonstrated higher accuracy in region alignment tasks (e.g., fault region localization), directly boosting target detection and failure analysis capabilities in power monitoring scenarios.
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Shan, L.; Song, H.; Tang, Q.; Guo, H.; Hou, S.; Kang, C. Analysis and implications of power balancing mechanisms under decentralized decision-making: A case study of German balancing settlement units. Power Syst. Technol. 2025, 1–12. [Google Scholar] [CrossRef]
- Dan, Z.; Wang, L. Research on the application of smart grid technology in power system optimization. China Strateg. Emerg. Ind. 2025, 26–28. [Google Scholar]
- Pu, T.; Zhao, Q.; Wang, X. Research framework, application status and prospects of electric power artificial intelligence technology. Power. Syst. Technol. 2025, 1–21. Available online: http://kns.cnki.net/kcms/detail/11.2410.tm.20241024.1500.002.html (accessed on 2 March 2025).
- Ji, X.; Chen, Y.; Wang, J.; Zhou, G.; Sing Lai, C.; Dong, Z. Time-Frequency Hybrid Neuromorphic Computing Architecture Development for Battery State-of-Health Estimation. IEEE Internet Things J. 2024, 11, 39941–39957. [Google Scholar] [CrossRef]
- Dong, Z.; Gu, S.; Zhou, S.; Yang, M.; Sing Lai, C.; Gao, M.; Ji, X. Periodic Segmentation Transformer-Based Internal Short Circuit Detection Method for Battery Packs. IEEE Trans. Transp. Electrif. 2025, 11, 3655–3666. [Google Scholar] [CrossRef]
- Li, G.; Wang, N.; Gao, R.; Wang, B.; Yin, Z.; Zhu, M. Research on multi-source heterogeneous data fusion technology for intelligent efficiency improvement in new power systems. Electr. Meas. Instrum. 2024, 61, 116–121. [Google Scholar] [CrossRef]
- Dong, Z.; Yang, M.; Wang, J.; Wang, H.; Sing Lai, C.; Ji, X. PFFN: A Parallel Feature Fusion Network for Remaining Useful Life Early Prediction of Lithium-Ion Battery. IEEE Trans. Transp. Electrif. 2025, 11, 2696–2706. [Google Scholar] [CrossRef]
- Wang, J.; Chen, Y.; Ji, X.; Dong, Z.; Gao, M.; Sing Lai, C. Metaverse Meets Intelligent Transportation System: An Efficient and Instructional Visual Perception Framework. IEEE Trans. Intell. Transp. Syst. 2024, 25, 14986–15001. [Google Scholar] [CrossRef]
- Zhou, Y.; Chen, C.; Dong, P.; Xue, W.; Zhai, Y. Research on multimodal fusion technology for power grid engineering data recognition. Inf. Technol. 2025, 61–65. [Google Scholar] [CrossRef]
- Wang, J.; Chen, Y.; Ji, X.; Dong, Z.; Gao, M.; He, Z. SpikeTOD: A Biologically Interpretable Spike-Driven Object Detection in Challenging Traffic Scenarios. IEEE Trans. Intell. Transp. Syst. 2024, 25, 21297–21314. [Google Scholar] [CrossRef]
- Ji, X.; Dong, Z.; Lai, C.; Zhou, G.; Qi, D. A physics-oriented memristor model with the coexistence of NDR effect and RS memory behavior for bio-inspired computing. Mater. Today Adv. 2022, 16, 100293. [Google Scholar] [CrossRef]
- Ji, X.; Lai, C.S.; Zhou, G.; Dong, Z.; Qi, D.; Lai, L.L. A Flexible Memristor Model With Electronic Resistive Switching Memory Behavior and Its Application in Spiking Neural Network. IEEE Trans. NanoBiosci. 2023, 22, 52–62. [Google Scholar] [CrossRef] [PubMed]
- Sun, S.; Wei, S.; Meng, J.; Lin, H.; Xiao, W.; Liu, S. Fine-grained text-guided cross-modal style transfer. J. Chin. Inf. Process. 2024, 38, 170–180. [Google Scholar]
- Wang, S. Research on Unsupervised Deep Learning Methods for Robust Multimodal Emotion Recognition. Ph.D. Thesis, Xi’an University of Technology, Xi’an, China, 2024. [Google Scholar] [CrossRef]
- Ji, X.; Dong, Z.; Han, Y.; Lai, C.S.; Zhou, G.; Qi, D. EMSN: An Energy-Efficient Memristive Sequencer Network for Human Emotion Classification in Mental Health Monitoring. IEEE Trans. Consum. Electron. 2023, 69, 1005–1016. [Google Scholar] [CrossRef]
- Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXX. Springer: Cham, Switzerland, 2020; pp. 104–120. [Google Scholar]
- Wang, T.; Jiang, W.; Lu, Z.; Zheng, F.; Cheng, R.; Yin, C.; Luo, P. Vlmixer: Unpaired vision-language pre-training via cross-modal cutmix. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 22680–22690. [Google Scholar]
- Wang, C.; Wu, Y.; Qian, Y.; Kumatani, K.; Liu, S.; Wei, F.; Zeng, M.; Huang, X. Unispeech: Unified speech representation learning with labeled and unlabeled data. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 10937–10947. [Google Scholar]
- Pedersoli, F.; Wiebe, D.; Banitalebi, A.; Zhang, Y.; Yi, K.M. Estimating visual information from audio through manifold learning. arXiv 2022, arXiv:2208.02337. [Google Scholar] [CrossRef]
- Sarkar, P.; Etemad, A. Xkd: Cross-modal knowledge distillation with domain alignment for video representation learning. arXiv 2022, arXiv:2211.13929. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Zhao, Y.; Zhang, C.; Huang, H.; Li, H.; Zhao, Z. Towards effective multi-modal interchanges in zero-resource sounding object localization. Adv. Neural Inf. Process. Syst. 2022, 35, 38089–38102. [Google Scholar]
- Duan, J.; Chen, L.; Tran, S.; Yang, J.; Xu, Y.; Zeng, B.; Chilimbi, T. Multi-modal alignment using representation codebook. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15651–15660. [Google Scholar]
- Lu, J.; Clark, C.; Zellers, R.; Mottaghi, R.; Kembhavi, A. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv 2022, arXiv:2206.08916. [Google Scholar]
- Ao, J.; Wang, R.; Zhou, L.; Wang, C.; Ren, S.; Wu, Y.; Liu, S.; Ko, T.; Li, Q.; Zhang, Y.; et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv 2021, arXiv:2110.07205. [Google Scholar]
- Liu, A.H.; Jin, S.; Lai, C.I.J.; Rouditchenko, A.; Oliva, A.; Glass, J. Cross-modal discrete representation learning. arXiv 2021, arXiv:2106.05438. [Google Scholar] [CrossRef]
- Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. Mutual information neural estimation. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 531–540. [Google Scholar]
- Boudiaf, M.; Ziko, I.; Rony, J.; Dolz, J.; Piantanida, P.; Ben Ayed, I. Information maximization for few-shot learning. Adv. Neural Inf. Process. Syst. 2020, 33, 2445–2457. [Google Scholar]
- Dong, Z.; Ji, X.; Lai, C.S.; Qi, D. Design and Implementation of a Flexible Neuromorphic Computing System for Affective Communication via Memristive Circuits. IEEE Commun. Mag. 2023, 61, 74–80. [Google Scholar] [CrossRef]
- van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Zhu, W.; Zheng, H.; Liao, H.; Li, W.; Luo, J. Learning bias-invariant representation by cross-sample mutual information minimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 15002–15012. [Google Scholar]
Method | IC | GC | REC | REG |
---|---|---|---|---|
LLava | 4.3 | 5.8 | 5.6 | 4.0 |
Qwen | 4.7 | 5.4 | 5.6 | 3.7 |
Shikra | 3.5 | 4.2 | 5.0 | 3.3 |
Ours | 5.3 | 6.5 | 6.4 | 4.1 |
Method | IC | GC | REC | REG |
---|---|---|---|---|
Baseline: No Alignment Module | 4.1 | 5.0 | 5.1 | 3.5 |
Baseline + Coarse-Grained Alignment | 4.5 | 5.6 | 5.7 | 3.9 |
Baseline + Fine-Grained Alignment (Without Decoupling) | 4.9 | 6.0 | 6.1 | 4.0 |
Full Proposed Method (Fine-Grained Alignment + Decoupling) | 5.3 | 6.5 | 6.4 | 4.1 |
Method | IC | GC | REC | REG |
---|---|---|---|---|
Baseline: No Attention Mechanism | 4.3 | 5.2 | 5.4 | 3.6 |
Only Image-Adaptive Attention | 4.7 | 5.7 | 5.9 | 3.9 |
Only Text-Guided Attention | 4.8 | 5.8 | 6.0 | 4.0 |
Full Proposed Method (Dual-Stream Attention) | 5.3 | 6.5 | 6.4 | 4.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, Z.; Yan, H.; Du, J.; Zhou, Y.; Chen, Y.; Yan, Y.; Zhao, S. A Multimodal Power Sample Feature Migration Method Based on Dual Cross-Modal Information Decoupling. Appl. Sci. 2025, 15, 9913. https://doi.org/10.3390/app15189913
Chen Z, Yan H, Du J, Zhou Y, Chen Y, Yan Y, Zhao S. A Multimodal Power Sample Feature Migration Method Based on Dual Cross-Modal Information Decoupling. Applied Sciences. 2025; 15(18):9913. https://doi.org/10.3390/app15189913
Chicago/Turabian StyleChen, Zhenyu, Huaguang Yan, Jianguang Du, Yuhao Zhou, Yi Chen, Yunfeng Yan, and Shuai Zhao. 2025. "A Multimodal Power Sample Feature Migration Method Based on Dual Cross-Modal Information Decoupling" Applied Sciences 15, no. 18: 9913. https://doi.org/10.3390/app15189913
APA StyleChen, Z., Yan, H., Du, J., Zhou, Y., Chen, Y., Yan, Y., & Zhao, S. (2025). A Multimodal Power Sample Feature Migration Method Based on Dual Cross-Modal Information Decoupling. Applied Sciences, 15(18), 9913. https://doi.org/10.3390/app15189913