Advancing Cross-Language Information Retrieval Through Shared Semantic Models: Applications in Public Cultural Resources
Abstract
1. Introduction
2. Related Work
2.1. Domain-Specific Information Retrieval
2.2. Research on Semantic Spaces
3. Construction of a Chinese Semantic Space for the Public Culture Domain
3.1. Construction of a Chinese Corpus for the Public Culture Domain
3.1.1. Data Sources
3.1.2. Corpus Construction Process
3.2. Pre-Trained Chinese Semantic Space
3.2.1. Semantic Space Pathway Design
- (1)
- Learning from Scratch: Word embeddings are learned simultaneously with a primary task (e.g., document classification or sentiment prediction). In this scenario, embeddings are initialized with random numbers and updated alongside the neural network weights during the training of the primary task.
- (2)
- Using Pre-trained Models: Word embedding models pre-trained on large-scale corpora are utilized. Existing pre-trained embeddings, developed for machine learning tasks distinct from the target problem, are loaded into the model. These are known as Pre-trained Word Embeddings, and the space they constitute is referred to as a Pre-trained Semantic Space.
3.2.2. FastText Pre-Trained Model
3.3. Transfer Learning of the Chinese Semantic Space for the Public Culture Domain
3.3.1. Transfer Learning Model Based on Public Culture Corpora
3.3.2. Chinese Semantic Space After Domain Transfer Learning
4. Construction of a Cross-Lingual Shared Semantic Model Based on Chinese Semantic Space
4.1. Framework of the Chinese-English-French Cross-Lingual Shared Semantic Model
4.2. Construction Steps of the Chinese-English-French Cross-Lingual Shared Semantic Model
4.2.1. Shared Semantic Space Generation
4.2.2. Shared Semantic Space Adjustment
4.2.3. Shared Semantic Space Optimization
- (1)
- Pseudo Embedding:
- (2)
- Neighbor Equidistant Generation Method:
4.3. Validation of the Chinese-English-French Cross-Lingual Shared Semantic Model
4.4. Experimental Design for Cross-Lingual Information Retrieval in the Public Culture Domain
5. Discussion
5.1. Enhancing Language Scalability and Domain Flexibility
5.2. Comparative Interpretation of Semantic Alignment Performance
5.3. Improving Cross-Lingual Information Retrieval and Services in the Public Culture Domain
5.4. Limitations and Future Directions
5.5. Strengthening the Evaluation of Readability in Cross-Lingual Information Retrieval
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Litschko, R.; Vulić, I.; Glavaš, G. On cross-lingual retrieval with multilingual text encoders. Inf. Retr. J. 2022, 25, 149–183. [Google Scholar] [CrossRef]
- Cheon, J.; Ko, Y. Parallel sentence extraction to improve cross-language information retrieval from Wikipedia. J. Inf. Sci. 2021, 47, 281–293. [Google Scholar] [CrossRef]
- Agarwal, S.; Barry, J.; Boschee, E.; Miller, S. What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models. arXiv 2025, arXiv:2511.19324. [Google Scholar] [CrossRef]
- Wang, Y.; Wu, A.; Neubig, G. English contrastive learning can learn universal cross-lingual sentence embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 9122–9133. [Google Scholar] [CrossRef]
- Goworek, R.; Macmillan-Scott, O.; Özyiğit, E.B. Bridging language gaps: Advances in cross-lingual information retrieval with multilingual llms. arXiv 2025, arXiv:2510.00908. [Google Scholar] [CrossRef]
- Lawrie, D.; Mayfield, J.; Yang, E.; Yates, A.; MacAvaney, S.; Pradeep, R.; Miller, S.; McNamee, P.; Soldani, L. NeuCLIRBench: A modern evaluation collection for cross-language and multilingual information retrieval. arXiv 2025, arXiv:2511.14758. [Google Scholar]
- Litschko, R.; Kraus, O.; Blaschke, V.; Plank, B. Cross-dialect information retrieval: Information access in low-resource and high-variance languages. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 10158–10171. [Google Scholar]
- Oro, E.; Granata, F.M.; Ruffolo, M. A comprehensive evaluation of embedding models and LLMs for IR and QA across English and Italian. Big Data Cogn. Comput. 2025, 9, 141. [Google Scholar] [CrossRef]
- Sun, G.; Deng, Y.; Liang, S.; Wu, D. Chinese-Tibetan Bilingual Knowledge Organization in the Cultural Heritage Domain: A Practice for Traditional Tibetan Festivals. Proc. Assoc. Inf. Sci. Technol. 2023, 60, 1137–1139. [Google Scholar] [CrossRef]
- Elayeb, B. Arabic word sense disambiguation: A review. Artif. Intell. Rev. 2019, 52, 2475–2532. [Google Scholar] [CrossRef]
- Wu, D.; Fan, S.; Yao, S.; Xu, S. An exploration of ethnic minorities’ needs for multilingual information access of public digital cultural services. J. Doc. 2023, 79, 1–20. [Google Scholar] [CrossRef]
- Tognola, G.; Murri, A.; Cuda, D. Cognitive computing for the automated extraction and meaningful use of health data in narrative medical notes: An application to the clinical management of hearing impaired aged patients. In Proceedings of the 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), Las Vegas, NV, USA, 4–7 March 2018; pp. 299–302. [Google Scholar]
- Litschko, R.; Vulić, I.; Glavaš, G. Parameter-efficient neural reranking for cross-lingual and multilingual retrieval. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 1071–1082. [Google Scholar]
- Raza, M.A.; Mokhtar, R.; Ahmad, N.; Pasha, M.; Pasha, U. A taxonomy and survey of semantic approaches for query expansion. IEEE Access 2019, 7, 17823–17833. [Google Scholar] [CrossRef]
- Calegari, S.; Sanchez, E. A fuzzy ontology-approach to improve semantic information retrieval. URSW 2007, 327, 1–6. [Google Scholar]
- Lee, C.S.; Jian, Z.W.; Huang, L.K. A fuzzy ontology and its application to news summarization. IEEE Trans. Syst. Man Cybern. B 2005, 35, 859–880. [Google Scholar] [CrossRef]
- Abulaish, M. An ontology enhancement framework to accommodate imprecise concepts and relations. J. Emerg. Technol. Web Intell. 2009, 1, 22–36. [Google Scholar] [CrossRef]
- Jain, S.; Seeja, K.R.; Jindal, R. A fuzzy ontology framework in information retrieval using semantic query expansion. Int. J. Inf. Manag. Data Insights 2021, 1, 100009. [Google Scholar] [CrossRef]
- Polley, S.; Koparde, R.R.; Gowri, A.B.; Perera, M.; Nuernberger, A. Towards trustworthiness in the context of explainable search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 2580–2584. [Google Scholar]
- Abu-Rasheed, H.; Weber, C.; Zenkert, J.; Krumm, R.; Fathi, M. Explainable graph-based search for lessons-learned documents in the semiconductor industry. In Intelligent Computing; Springer: Cham, Switzerland, 2022; pp. 1097–1106. [Google Scholar]
- Yang, Z. Biomedical information retrieval incorporating knowledge graph for explainable precision medicine. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; p. 2486. [Google Scholar]
- Abu-Rasheed, H.; Weber, C.; Zenkert, J.; Dornhöfer, M.; Fathi, M. Transferrable framework based on knowledge graphs for generating explainable results in domain-specific intelligent information retrieval. Informatics 2022, 9, 6. [Google Scholar] [CrossRef]
- Majeed, A.; Mujtaba, H.; Beg, M.O. Emotion detection in roman urdu text using machine learning. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, Virtual, 21–25 September 2020; pp. 125–130. [Google Scholar]
- Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv 2024, arXiv:2402.03216. [Google Scholar]
- Khan, S.U.R.; Farooq, M.U.; Beg, M.O. Big data analysis of stack overflow for energy consumption of android framework. In Proceedings of the 2019 International Conference on Innovative Computing (ICIC), Lahore, Pakistan, 1–2 November 2019; pp. 1–9. [Google Scholar]
- Guo, P.; Wei, X.; Hu, Y.; Yang, B.; Liu, D.; Huang, F.; Xie, J. EMMA-X: An EM-like multilingual pre-training algorithm for cross-lingual representation learning. Adv. Neural Inf. Process. Syst. NeurIPS 2023, 36, 1–15. [Google Scholar]
- Hämmerl, K.; Libovický, J.; Fraser, A. Understanding cross-lingual alignment—A survey. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024 Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 10922–10943. [Google Scholar] [CrossRef]
- Gudivada, A.; Tabrizi, N. A literature review on machine learning based medical information retrieval systems. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; pp. 250–257. [Google Scholar]
- Kotei, E.; Thirunavukarasu, R. A systematic review of transformer-based pre-trained language models through self-supervised learning. Information 2023, 14, 187. [Google Scholar] [CrossRef]
- Zhang, X.; Le Cun, Y. Which encoding is best for text classification in Chinese, English, Japanese and Korean? arXiv 2017, arXiv:1708.02657. [Google Scholar] [CrossRef]
- Mozaffari, M.H.; Lee, W.S. Domain adaptation for ultrasound tongue contour extraction using transfer learning: A deep learning approach. J. Acoust. Soc. Am. 2019, 146, 431–437. [Google Scholar] [CrossRef]
- Kowsher, M.; Sobuj, M.S.; Shahriar, M.F.; Prottasha, N.J.; Arefin, M.S.; Dhar, P.K.; Koshiba, T. An enhanced neural word embedding model for transfer learning. Appl. Sci. 2022, 12, 2848. [Google Scholar] [CrossRef]
- Forney, G.D. The Viterbi algorithm. Proc. IEEE 1973, 61, 268–278. [Google Scholar] [CrossRef]
- Huang, Z.; Yu, P.; Allan, J. Cross-lingual knowledge transfer via distillation for multilingual information retrieval. arXiv 2023, arXiv:2302.13400. [Google Scholar] [CrossRef]
- Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
- Ebrahimi, A.; von der Wense, K. Zero-shot vs. translation-based cross-lingual transfer: The case of lexical gaps. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers); Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 443–458. [Google Scholar]
- Karan, M.; Vulić, I.; Korhonen, A.; Glavaš, G. Classification-based self-learning for weakly supervised bilingual lexicon induction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 6915–6922. [Google Scholar]
- Xu, R.; Yang, Y.; Otani, N.; Wu, Y. Unsupervised cross-lingual transfer of word embedding spaces. arXiv 2018, arXiv:1809.03633. [Google Scholar] [CrossRef]
- Lin, J. The neural hype and comparisons against weak baselines. ACM SIGIR Forum 2019, 52, 40–51. [Google Scholar] [CrossRef]
- Qiu, X.; Zhang, Q.; Huang, X.J. Fudannlp: A toolkit for chinese natural language processing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Sofia, Bulgaria, 4–9 August 2013; pp. 49–54. [Google Scholar]
- TextData.cn. New York Times News Dataset from 2000 to 2025. 2025. Available online: https://textdata.cn/blog/2025-03-05-nytimes-news-dataset-from-2000-to-2025/ (accessed on 2 June 2025).
- Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan, A. Topic modeling algorithms and applications: A survey. Inf. Syst. 2023, 112, 102131. [Google Scholar] [CrossRef]
- Pohorec, S.; Verlič, M.; Zorman, M. Domain specific information retrieval system. In Proceedings of the 8th WSEAS International Conference on Computer, Hangzhou, China, 20–22 May 2009; pp. 465–469. [Google Scholar]
- Yan, X.; Song, D.; Li, X. Concept-based document readability in domain specific information retrieval. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, Arlington, VA, USA, 6–11 November 2006; pp. 540–549. [Google Scholar]








| Reading Section Book Type | Count | Venue Booking Section Venue Category | Count | Activity Section Venue Category | Count |
|---|---|---|---|---|---|
| Chinese and Foreign Classics | 67 | Cultural Center | 3160 | Public Culture and Arts | |
| Children’s Literature | 62 | Library | 1348 | COVID-19 Prevention Works | 20 |
| Literature | 54 | Tourist Attraction | 391 | Works on Building a Moderately Prosperous Society in All Respects | 15 |
| Classic Chinese Studies | 46 | Museum | 290 | Works on Poverty Alleviation | 8 |
| Poetry and Prose | 42 | Art Museum | 70 | Folk Culture and Arts | |
| Fiction and Stories | 40 | City Book Bar | 27 | Works Exhibition | 76 |
| Humanities and History | 36 | Visitor Center | 12 | Folk Custom Showcase | 55 |
| Biography | 23 | Art Heritage | 10 | ||
| Art | 17 | Unclassified | 17,926 | ||
| Local Chronicles | 16 | ||||
| Others | 14 | ||||
| Chinese History | 13 | ||||
| Essays and Prose | 10 | ||||
| History of Literature | 4 | ||||
| Literary Criticism and Appreciation | 1 | ||||
| Total | 430 | Total | 5298 | Total | 18,100 |
| Similarity Rank | Library (Before Transfer) | Similarity Score | Library (After Transfer) | Similarity Score | Museum (Before Transfer) | Similarity Score | Museum (After Transfer) | Similarity Score |
|---|---|---|---|---|---|---|---|---|
| 1 | Branch Library | 0.936 | Reading Room | 0.812 | Art Museum | 0.923 | Art Museum | 0.804 |
| 2 | Collection | 0.926 | Library Science | 0.804 | Collection | 0.922 | Natural History Museum | 0.806 |
| 3 | Library (archaic) | 0.925 | Branch Library | 0.800 | Inside the Museum | 0.916 | Museum (abbr.) | 0.778 |
| 4 | Reading Room | 0.923 | Library Catalog | 0.788 | Museum (abbr.) | 0.916 | Exhibition Hall | 0.764 |
| 5 | Book | 0.921 | Library (archaic) | 0.772 | Open to the Public | 0.915 | Artwork | 0.764 |
| 6 | Book Collection | 0.918 | Reading Room | 0.771 | Branch Museum | 0.914 | Branch Museum | 0.747 |
| 7 | New Library | 0.913 | Collection Size | 0.765 | Exhibition | 0.913 | Collection | 0.746 |
| 8 | Library Building | 0.911 | Ancient Library | 0.756 | Gallery | 0.913 | Botanical Garden | 0.745 |
| 9 | Various Libraries | 0.910 | Collection | 0.753 | Museum Director | 0.911 | Memorial Hall | 0.743 |
| 10 | Borrowing | 0.909 | Book | 0.752 | History Museum | 0.910 | Artwork | 0.742 |
| Category | Parameter | Value/Description |
|---|---|---|
| Embeddings | Vector Dimension | 300 |
| Discriminator | Number of Hidden Layers | 2 |
| Hidden Layer Size | 2048 | |
| Activation Function | Leaky-ReLU | |
| Input Dropout | 0.1 | |
| Output Smoothing Coefficient | 0.2 | |
| Training | Optimizer | SGD |
| Batch Size | 32 | |
| Initial Learning Rate | 0.1 | |
| Learning Rate Decay Rate | 0.95 | |
| Learning Rate Shrinkage Factor | 0.5 | |
| Mapping Matrix Initialization | Identity Matrix |
| Model | EN-FR | FR-EN | ZH-EN | EN-ZH | ZH-FR | FR-ZH |
|---|---|---|---|---|---|---|
| Best Supervised Mode1 [38] | 81.1 | 82.4 | 49.9 | 45.4 | - | - |
| MUSE (Strategy 1) | 82.3 | 82.1 | 31.4 | 32.5 | - | - |
| M-APE (Strategy 2) | 82.3 | 82.1 | 84.6 | 61.8 | 84.5 | 55.4 |
| EN-ZH | FR-ZH | AVG | |
|---|---|---|---|
| K5 | 74.7 | 73.7 | 74.2 |
| K10 | 87.9 | 85.9 | 86.9 |
| K15 | 95.3 | 91.9 | 93.6 |
| Category | Retrieval Technique | Query List | P@10 | P@20 | R-prec | AP/mAP | AVG |
|---|---|---|---|---|---|---|---|
| Monolingual Information Retrieval | SLIR | Q1 | 1.000 | 0.983 | 0.525 | 0.689 | 0.799 |
| Q2 | 0.833 | 0.783 | 0.264 | 0.221 | 0.525 | ||
| Q3 | 0.733 | 0.783 | 0.461 | 0.386 | 0.591 | ||
| Q4 | 0.567 | 0.567 | 0.189 | 0.172 | 0.374 | ||
| Q5 | 0.733 | 0.533 | 0.233 | 0.196 | 0.424 | ||
| QAVG | 0.773 | 0.730 | 0.334 | 0.333 | 0.543 | ||
| SESLIR | Q1 | 1.000 | 0.983 | 0.535 | 0.665 | 0.796 | |
| Q2 | 0.733 | 0.750 | 0.297 | 0.241 | 0.505 | ||
| Q3 | 1.000 | 0.950 | 0.523 | 0.491 | 0.741 | ||
| Q4 | 0.533 | 0.533 | 0.182 | 0.160 | 0.352 | ||
| Q5 | 0.733 | 0.533 | 0.233 | 0.196 | 0.424 | ||
| QAVG | 0.800 | 0.750 | 0.354 | 0.351 | 0.564 | ||
| Cross-Lingual Information Retrieval | MCLIR | Q1 | 1.000 | 0.950 | 0.902 | 0.916 | 0.942 |
| Q2 | 0.900 | 0.900 | 0.490 | 0.495 | 0.696 | ||
| Q3 | 0.700 | 0.750 | 0.686 | 0.645 | 0.695 | ||
| Q4 | 1.000 | 0.950 | 0.545 | 0.488 | 0.746 | ||
| Q5 | 1.000 | 0.800 | 0.517 | 0.521 | 0.709 | ||
| QAVG | 0.920 | 0.870 | 0.628 | 0.613 | 0.758 | ||
| SESLIR +MCLIR | Q1 | 1.000 | 1.000 | 0.865 | 0.883 | 0.937 | |
| Q2 | 0.600 | 0.700 | 0.552 | 0.542 | 0.598 | ||
| Q3 | 1.000 | 1.000 | 0.784 | 0.871 | 0.914 | ||
| Q4 | 0.600 | 0.800 | 0.525 | 0.437 | 0.591 | ||
| Q5 | 1.000 | 0.800 | 0.517 | 0.521 | 0.709 | ||
| QAVG | 0.840 | 0.860 | 0.649 | 0.651 | 0.750 | ||
| SECLIR | Q1 | 1.000 | 0.950 | 0.902 | 0.916 | 0.942 | |
| Q2 | 0.700 | 0.700 | 0.490 | 0.495 | 0.596 | ||
| Q3 | 1.000 | 1.000 | 0.686 | 0.645 | 0.833 | ||
| Q4 | 0.800 | 0.900 | 0.545 | 0.488 | 0.683 | ||
| Q5 | 0.900 | 0.650 | 0.517 | 0.521 | 0.647 | ||
| QAVG | 0.880 | 0.840 | 0.628 | 0.613 | 0.740 | ||
| UECLIR (M-APE) | Q1 | 1.000 | 1.000 | 0.857 | 0.883 | 0.935 | |
| Q2 | 0.800 | 0.700 | 0.563 | 0.550 | 0.653 | ||
| Q3 | 1.000 | 0.900 | 0.725 | 0.801 | 0.857 | ||
| Q4 | 0.800 | 0.900 | 0.535 | 0.435 | 0.668 | ||
| Q5 | 1.000 | 0.800 | 0.517 | 0.521 | 0.709 | ||
| QAVG | 0.920 | 0.860 | 0.640 | 0.638 | 0.764 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Xia, Z.; Liang, S.; Wu, D.; Lv, S. Advancing Cross-Language Information Retrieval Through Shared Semantic Models: Applications in Public Cultural Resources. Appl. Sci. 2026, 16, 4158. https://doi.org/10.3390/app16094158
Xia Z, Liang S, Wu D, Lv S. Advancing Cross-Language Information Retrieval Through Shared Semantic Models: Applications in Public Cultural Resources. Applied Sciences. 2026; 16(9):4158. https://doi.org/10.3390/app16094158
Chicago/Turabian StyleXia, Zishuo, Shaobo Liang, Dan Wu, and Siyu Lv. 2026. "Advancing Cross-Language Information Retrieval Through Shared Semantic Models: Applications in Public Cultural Resources" Applied Sciences 16, no. 9: 4158. https://doi.org/10.3390/app16094158
APA StyleXia, Z., Liang, S., Wu, D., & Lv, S. (2026). Advancing Cross-Language Information Retrieval Through Shared Semantic Models: Applications in Public Cultural Resources. Applied Sciences, 16(9), 4158. https://doi.org/10.3390/app16094158

