A Cross-Modal Temporal Alignment Framework for Artificial Intelligence-Driven Sensing in Multilingual Risk Monitoring
Abstract
1. Introduction
- A multilingual semantic–numerical collaborative sensing paradigm is proposed, in which multilingual large language models are embedded into the financial anomaly detection pipeline, facilitating the transition from purely numerical behavior detection to semantically driven risk perception and providing a novel modeling perspective for intelligent financial monitoring.
- A cross-modal temporal alignment attention mechanism is designed, wherein learnable temporal offset parameters are introduced to characterize the dynamic lag of semantic event transmission toward market fluctuations, alleviating the misalignment between textual and price sequences and enhancing early warning capability.
- A multilingual semantic noise-robust encoding module is constructed by incorporating semantic confidence weighting and contrastive learning mechanisms, thereby improving model stability and generalization under complex linguistic contexts and low-quality textual environments.
- A semantic–numerical collaborative risk fusion module is developed to model the coupling relationship between semantic shock intensity and market volatility amplitude within a unified latent space; a gated fusion strategy is employed to achieve adaptive allocation of risk contributions, improving anomaly recognition accuracy and robustness.
- Extensive experiments conducted on multi-market real-world datasets demonstrate that the proposed framework significantly outperforms traditional statistical models and deep temporal models in terms of accuracy, f1 score, AUC, and early warning time, indicating strong practical applicability.
2. Related Work
2.1. Financial Time-Series Anomaly Detection
2.2. Multilingual Financial Text Analysis and Large Language Models
2.3. Cross-Modal Financial Intelligent Sensing Methods
3. Materials and Method
3.1. Data Collection
3.2. Data Preprocessing and Augmentation Strategy
3.3. Problem Formulation
3.4. Proposed Method
3.4.1. Overall
3.4.2. Cross-Modal Temporal Alignment Attention Mechanism
3.4.3. Multilingual Semantic Noise-Robust Encoding Module
3.4.4. Semantic–Numerical Collaborative Risk Fusion Module
4. Results and Discussion
4.1. Experimental Configuration
4.1.1. Hardware and Software Platform
4.1.2. Baseline Models and Evaluation Metrics
4.2. Anomaly Detection Performance Comparison
4.3. Cross-Market Generalization Experiment
4.4. Module Ablation Study
4.5. Ablation Study on Multilingual Semantic Impact
4.6. Parameter Sensitivity Analysis
4.7. Case Study
4.8. Discussion
4.9. Limitation and Future Work
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Joshi, P.; Santy, S.; Budhiraja, A.; Bali, K.; Choudhury, M. The state and fate of linguistic diversity and inclusion in the NLP world. arXiv 2020, arXiv:2004.09095. [Google Scholar]
- Chen, A.; Wei, Y.; Le, H.; Zhang, Y. Learning by teaching with ChatGPT: The effect of teachable ChatGPT agent on programming education. Br. J. Educ. Technol. 2024, 57, 163–184. [Google Scholar] [CrossRef]
- Ousidhoum, N.; Beloucif, M.; Mohammad, S. Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 8881–8894. [Google Scholar]
- Ruder, S.; Vulić, I.; Søgaard, A. A survey of cross-lingual word embedding models. J. Artif. Intell. Res. 2019, 65, 569–631. [Google Scholar] [CrossRef]
- Aharoni, R.; Johnson, M.; Firat, O. Massively multilingual neural machine translation. arXiv 2019, arXiv:1903.00089. [Google Scholar] [PubMed]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8440–8451. [Google Scholar]
- Dal Pozzolo, A.; Caelen, O.; Johnson, R.A.; Bontempi, G. Calibrating probability with undersampling for unbalanced classification. In Proceedings of the 2015 IEEE Symposium Series on Computational Intelligence; IEEE: New York, NY, USA, 2015; pp. 159–166. [Google Scholar]
- Akoglu, L.; Tong, H.; Koutra, D. Graph based anomaly detection and description: A survey. Data Min. Knowl. Discov. 2015, 29, 626–688. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
- Zhang, L.; Zhang, Y.; Ma, X. A new strategy for tuning ReLUs: Self-adaptive linear units (SALUs). In Proceedings of the ICMLCA 2021; 2nd International Conference on Machine Learning and Computer Application; VDE: Berlin, Germany, 2021; pp. 1–8. [Google Scholar]
- Chalapathy, R.; Chawla, S. Deep learning for anomaly detection: A survey. arXiv 2019, arXiv:1901.03407. [Google Scholar] [CrossRef]
- Zhan, X.; Kou, L.; Xue, M.; Zhang, J.; Zhou, L. Reliable long-term energy load trend prediction model for smart grid using hierarchical decomposition self-attention network. IEEE Trans. Reliab. 2022, 72, 609–621. [Google Scholar] [CrossRef]
- Conneau, A.; Wu, S.; Li, H.; Zettlemoyer, L.; Stoyanov, V. Emerging cross-lingual structure in pretrained language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6022–6034. [Google Scholar]
- Ruff, L.; Kauffmann, J.R.; Vandermeulen, R.A.; Montavon, G.; Samek, W.; Kloft, M.; Dietterich, T.G.; Müller, K.R. A unifying review of deep and shallow anomaly detection. Proc. IEEE 2021, 109, 756–795. [Google Scholar] [CrossRef]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
- Song, A.; Seo, E.; Kim, H. Anomaly VAE-transformer: A deep learning approach for anomaly detection in decentralized finance. IEEE Access 2023, 11, 98115–98131. [Google Scholar] [CrossRef]
- Feng, H.; Wang, Y.; Fang, R.; Xie, A.; Wang, Y. Federated risk discrimination with siamese networks for financial transaction anomaly detection. In Proceedings of the P2025 2nd International Conference on Digital Economy and Computer Science, Wuhan, China, 17–19 October 2025; pp. 231–236. [Google Scholar]
- Sakurada, M.; Yairi, T. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, Gold Coast, QLD, Australia, 2 December 2014; pp. 4–11. [Google Scholar]
- Pires, T.; Schlinger, E.; Garrette, D. How multilingual is multilingual BERT? arXiv 2019, arXiv:1906.01502. [Google Scholar] [CrossRef]
- Wu, S.; Dredze, M. Are all languages created equal in multilingual BERT? arXiv 2020, arXiv:2005.09093. [Google Scholar] [CrossRef]
- Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv 2020, arXiv:2004.10964. [Google Scholar]
- He, D. A multimodal deep neural network-based financial fraud detection model via collaborative awareness of semantic analysis and behavioral modeling. J. Circuits Syst. Comput. 2025, 34, 2550054. [Google Scholar] [CrossRef]
- Xu, J.; Lo, S.Y.; Safaei, B.; Patel, V.M.; Dwivedi, I. Towards zero-shot anomaly detection and reasoning with multimodal large language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 20370–20382. [Google Scholar]
- Pan, X.; Wang, D.; Tsung, F. Empowering Intelligent Quality Control with Large Models: A Comprehensive Survey of Methods, Challenges, and Perspectives. TechRxiv 2025. [Google Scholar] [CrossRef]
- Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv. (CSUR) 2020, 53, 63. [Google Scholar] [CrossRef]
- Wu, Y.; Xiang, C. Multimodal Financial Anomaly Detection in Enterprises Using VAE–Transformer–GNN Hybrid Ensemble Models. In Proceedings of the 2nd International Symposium on Integrated Circuit Design and Integrated Systems, Singapore, 26–28 September 2025; pp. 197–203. [Google Scholar]
- ForouzeshNejad, A.A.; Arabikhan, F.; Gegov, A.; Jafari, R.; Ichtev, A. Data-driven predictive modelling of agile projects using explainable artificial intelligence. Electronics 2025, 14, 2609. [Google Scholar] [CrossRef]
- Hu, J.; Ruder, S.; Siddhant, A.; Neubig, G.; Firat, O.; Johnson, M. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 4411–4421. [Google Scholar]
- Lauscher, A.; Ravishankar, V.; Vulić, I.; Glavaš, G. From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers. arXiv 2020, arXiv:2005.00633. [Google Scholar]
- Xiang, Y. Using Arima-Garch model to analyze fluctuation law of international oil price. Math. Probl. Eng. 2022, 2022, 3936414. [Google Scholar] [CrossRef]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining; IEEE: New York, NY, USA, 2008; pp. 413–422. [Google Scholar]
- Chen, H.; Liu, H.; Chu, X.; Liu, Q.; Xue, D. Anomaly detection and critical SCADA parameters identification for wind turbines based on LSTM-AE neural network. Renew. Energy 2021, 172, 829–840. [Google Scholar] [CrossRef]
- Hewage, P.; Behera, A.; Trovati, M.; Pereira, E.; Ghahremani, M.; Palmieri, F.; Liu, Y. Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput. 2020, 24, 16453–16482. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
- Chen, Y.; Wang, Q.; Wu, S.; Gao, Y.; Xu, T.; Hu, Y. Tomgpt: Reliable text-only training approach for cost-effective multi-modal large language model. ACM Trans. Knowl. Discov. Data 2024, 18, 171. [Google Scholar] [CrossRef]
- Gadzicki, K.; Khamsehashari, R.; Zetzsche, C. Early vs late fusion in multimodal convolutional neural networks. In Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION); IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
- Wu, H.; Sun, Y.; Yang, Y.; Wong, D.F. Beyond Simple Fusion: Adaptive Gated Fusion for Robust Multimodal Sentiment Analysis. arXiv 2025, arXiv:2510.01677. [Google Scholar] [CrossRef]






| Data Type | Sensing Modality | Data Source | Data Volume |
|---|---|---|---|
| High-frequency price streams | Market behavior sensor | NYSE (New York, NY, USA), NASDAQ (New York, NY, USA), SSE (Shanghai, China), SZSE (Shenzhen, China) | |
| Order book depth streams | Microstructure sensor | Exchange real-time APIs (NYSE: New York, NY, USA; NASDAQ: New York, NY, USA; SSE: Shanghai, China; SZSE: Shenzhen, China) | |
| English financial news | Semantic text sensor | Refinitiv (London, UK), Bloomberg (New York, NY, USA) | 850,000 |
| Chinese news and disclosures | Semantic text sensor | CNINFO (Shenzhen, China), Wind (Shanghai, China) | 620,000 |
| Social media financial posts | Public sentiment sensor | Twitter (San Francisco, CA, USA), financial forums (multiple platforms) | 2,100,000 |
| Total textual sensing data | Multilingual semantic sensor | Aggregated multi-source platforms | 3,570,000 |
| Method | Precision | Recall | F1-Score | AUC | MCC | EWT (min) | Latency (ms) |
|---|---|---|---|---|---|---|---|
| ARIMA-GARCH [31] | 0.672 ± 0.012 | 0.541 ± 0.015 | 0.599 ± 0.011 | 0.711 ± 0.009 | 0.452 ± 0.014 | 2.6 ± 0.4 | 0.54 ± 0.02 |
| Isolation Forest [32] | 0.701 ± 0.010 | 0.603 ± 0.012 | 0.648 ± 0.009 | 0.744 ± 0.008 | 0.518 ± 0.011 | 3.1 ± 0.3 | 0.82 ± 0.03 |
| LSTM-AE [33] | 0.742 ± 0.008 | 0.661 ± 0.010 | 0.699 ± 0.007 | 0.789 ± 0.006 | 0.584 ± 0.009 | 3.8 ± 0.3 | 2.15 ± 0.08 |
| TCN [34] | 0.758 ± 0.009 | 0.684 ± 0.009 | 0.719 ± 0.008 | 0.802 ± 0.007 | 0.613 ± 0.010 | 4.2 ± 0.2 | 1.42 ± 0.05 |
| Transformer (numerical only) [35] | 0.781 ± 0.007 | 0.706 ± 0.008 | 0.742 ± 0.006 | 0.823 ± 0.005 | 0.655 ± 0.008 | 4.7 ± 0.2 | 3.28 ± 0.11 |
| Text-only (mLLM encoder + classifier) [36] | 0.736 ± 0.011 | 0.691 ± 0.013 | 0.713 ± 0.010 | 0.807 ± 0.008 | 0.628 ± 0.012 | 5.4 ± 0.5 | 15.42 ± 0.45 |
| Late fusion (concat + MLP) [37] | 0.803 ± 0.006 | 0.728 ± 0.007 | 0.764 ± 0.005 | 0.842 ± 0.004 | 0.681 ± 0.007 | 6.1 ± 0.3 | 16.18 ± 0.52 |
| Simple cross-attn fusion [38] | 0.818 ± 0.006 | 0.742 ± 0.007 | 0.778 ± 0.005 | 0.856 ± 0.005 | 0.709 ± 0.007 | 0.68 ± 0.3 | 18.45 ± 0.58 |
| MLLM-Anomaly [24] | 0.829 ± 0.005 | 0.755 ± 0.006 | 0.790 ± 0.005 | 0.865 ± 0.004 | 0.732 ± 0.006 | 7.2 ± 0.4 | 45.61 ± 1.25 |
| CMT [15] | 0.838 ± 0.005 | 0.766 ± 0.006 | 0.800 ± 0.004 | 0.877 ± 0.004 | 0.748 ± 0.006 | 7.8 ± 0.4 | 20.14 ± 0.62 |
| Proposed method | 0.852 ± 0.005 * | 0.781 ± 0.006 * | 0.815 ± 0.005 * | 0.892 ± 0.004 * | 0.796 ± 0.007 * | 8.9 ± 0.3 * | 19.36 ± 0.55 |
| Source→Target | Method | Precision | Recall | F1-Score | AUC | EWT (min) |
|---|---|---|---|---|---|---|
| NYSE/NASDAQ→SSE/SZSE | Transformer (numerical only) | 0.732 | 0.641 | 0.683 | 0.781 | 3.6 |
| NYSE/NASDAQ→SSE/SZSE | Simple cross-attn fusion | 0.768 | 0.682 | 0.722 | 0.824 | 5.2 |
| NYSE/NASDAQ→SSE/SZSE | Proposed method | 0.816 | 0.731 | 0.771 | 0.868 | 7.6 |
| SSE/SZSE→NYSE/NASDAQ | Transformer (numerical only) | 0.721 | 0.628 | 0.671 | 0.769 | 3.3 |
| SSE/SZSE→NYSE/NASDAQ | Simple cross-attn fusion | 0.759 | 0.674 | 0.714 | 0.817 | 5.0 |
| SSE/SZSE→NYSE/NASDAQ | Proposed method | 0.807 | 0.722 | 0.762 | 0.861 | 7.2 |
| Model Variant | Precision | Recall | F1-Score | AUC | EWT (min) |
|---|---|---|---|---|---|
| w/o Cross-modal temporal alignment | 0.821 | 0.742 | 0.779 | 0.861 | 6.4 |
| w/o Multilingual semantic denoising | 0.833 | 0.751 | 0.790 | 0.871 | 7.1 |
| w/o Semantic–numerical collaborative fusion | 0.826 | 0.748 | 0.785 | 0.866 | 6.7 |
| Numerical branch only | 0.781 | 0.706 | 0.742 | 0.823 | 4.7 |
| Full model (proposed method) | 0.852 | 0.781 | 0.815 | 0.892 | 8.9 |
| Linguistic Configuration | Precision | Recall | F1-Score | AUC | MCC | EWT (min) |
|---|---|---|---|---|---|---|
| English-only Baseline | 0.806 | 0.746 | 0.775 | 0.848 | 0.742 | 6.4 |
| Chinese-only Baseline | 0.792 | 0.736 | 0.763 | 0.835 | 0.725 | 5.9 |
| Multilingual (Proposed) | 0.852 | 0.781 | 0.815 | 0.892 | 0.796 | 8.9 |
| Decay Coefficient | Precision | Recall | F1-Score | AUC | MCC | EWT (min) |
|---|---|---|---|---|---|---|
| (Slow Decay) | 0.786 | 0.741 | 0.763 | 0.832 | 0.718 | 9.4 |
| 0.824 | 0.768 | 0.795 | 0.864 | 0.765 | 9.2 | |
| (Optimal) | 0.852 | 0.781 | 0.815 | 0.892 | 0.796 | 8.9 |
| 0.835 | 0.739 | 0.784 | 0.857 | 0.751 | 5.6 | |
| (Fast Decay) | 0.812 | 0.684 | 0.742 | 0.816 | 0.698 | 3.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Sun, H.; Zhang, J.; Hong, W.; Fang, Y.; Ma, M.; Shi, K.; Li, M. A Cross-Modal Temporal Alignment Framework for Artificial Intelligence-Driven Sensing in Multilingual Risk Monitoring. Sensors 2026, 26, 2319. https://doi.org/10.3390/s26082319
Sun H, Zhang J, Hong W, Fang Y, Ma M, Shi K, Li M. A Cross-Modal Temporal Alignment Framework for Artificial Intelligence-Driven Sensing in Multilingual Risk Monitoring. Sensors. 2026; 26(8):2319. https://doi.org/10.3390/s26082319
Chicago/Turabian StyleSun, Hanzhi, Jiarui Zhang, Wei Hong, Yihan Fang, Mengqi Ma, Kehan Shi, and Manzhou Li. 2026. "A Cross-Modal Temporal Alignment Framework for Artificial Intelligence-Driven Sensing in Multilingual Risk Monitoring" Sensors 26, no. 8: 2319. https://doi.org/10.3390/s26082319
APA StyleSun, H., Zhang, J., Hong, W., Fang, Y., Ma, M., Shi, K., & Li, M. (2026). A Cross-Modal Temporal Alignment Framework for Artificial Intelligence-Driven Sensing in Multilingual Risk Monitoring. Sensors, 26(8), 2319. https://doi.org/10.3390/s26082319
