Empowering Weak Languages Through Cross-Language Hyperlink Recommendation
Abstract
1. Introduction
- RQ1: Can reliable hyperlink recommendations be made?
- RQ2: How does the volume of language–specific Wikipedias affect recommendation performance? Specifically, can Wikipedias of low–resource languages benefit more from recommendations?
- RQ3: What are the characteristics of an effective recommendation model for hyperlinks?
- RQ4: To what extent do the recommendation results align with the editors’ intent?
2. Related Works
3. Hyperlink Recommendation Method
3.1. Hyperlink Recommendation Task
3.2. Bayesian Personalized Ranking Model
3.3. Graph Neural Network Model
- Embedding, which converts users and items into vectorized representation;
- Interaction modeling, which reconstructs historical interactions using embedding.
4. Dataset
4.1. Data Creation
4.2. Data in the Different Cases
- Case 1: The data consists of seven datasets, each configured according to a language combination or a single language. The articles in this dataset were selected using strict criteria, requiring that the articles in all three languages share the same topic and have more than 100 hyperlinks. Articles from this dataset were also used in our previous study [41], where it demonstrated effectiveness in recommending hyperlink types that align with local interests in different language communities.
- Case 2: This is the multilingual dataset in three languages (en, ja, vi) with the same number of users (articles) as Case 1, but these articles are randomly selected. Each article may pertain to a distinct topic and is not necessarily related.
- Case 3: This dataset is extended with Sinhala (si) Wikipedia articles. This dataset also keeps our condition with articles that share topics in the four languages.
4.3. Hyper Parameter
5. Result and Evaluation
5.1. Evaluation Metrics
5.2. Experiment Result
5.2.1. Experiment 1: Comparison of the Performance of Different Models in Various Cases
5.2.2. Experiment 2: Comparison of Performance Across Different Language Configurations in Case 1
5.3. Generated Recommendations
5.4. Evaluation by Local Wikipedia Editors
6. Proof-of-Concept Application
7. Discussion
7.1. Limitations of Dataset
7.2. Limitations of Method
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Moas, P.M.; Lopes, C.T. Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature Review. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
- Kim, J.; Kim, S.; Lee, C. Anticipating technological convergence: Link prediction using Wikipedia hyperlinks. Technovation 2019, 79, 25–34. [Google Scholar] [CrossRef]
- West, R.; Paranjape, A.; Leskovec, J. Mining Missing Hyperlinks from Human Navigation Traces: A Case Study of Wikipedia. In Proceedings of the WWW ’15: 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; Republic and Canton of Geneva: Geneva, Switzerland, 2015; pp. 1242–1252. [Google Scholar]
- Schwarzer, M.; Schubotz, M.; Meuschke, N.; Breitinger, C.; Markl, V.; Gipp, B. Evaluating Link-based Recommendations for Wikipedia. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, Newark, NJ, USA, 19–23 June 2016; pp. 191–200. [Google Scholar]
- Labhishetty, S.; Siddiqa, A.; Nagipogu, R.; Chakraborti, S. WikiSeeAlso: Suggesting tangentially related concepts (see also links) for Wikipedia articles. In Proceedings of the Mining Intelligence and Knowledge Exploration: 5th International Conference, MIKE 2017, Hyderabad, India, 13–15 December 2017; pp. 274–286. [Google Scholar]
- Tsunakawa, T.; Araya, M.; Kaji, H. Enriching Wikipedia’s Intra-language Links by their Cross-language Transfer. In Proceedings of the COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, 23–29 August 2014; pp. 1260–1268. [Google Scholar]
- Wang, Y.C.; Chuang, C.M.; Wu, C.K.; Pan, C.L.; Tsai, R.T.H. Cross-language article linking with deep neural network based paragraph encoding. Comput. Speech Lang. 2022, 72, 101279. [Google Scholar] [CrossRef]
- Lih, A. The Wikipedia revolution: How a bunch of nobodies created the world’s greatest encyclopedia. In The Wikipedia Revolution: How a Bunch of Nobodies Created the World’s Greatest Encyclopedia; Hyperion: New York, NY, USA, 2009. [Google Scholar]
- Ashrafimoghari, V. Detecting Cross-Lingual Information Gaps in Wikipedia. In Proceedings of the Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion), Austin, TX, USA, 30 April–4 May 2023; pp. 581–585. [Google Scholar]
- Roy, D.; Bhatia, S.; Jain, P. Information asymmetry in Wikipedia across different languages: A statistical analysis. J. Assoc. Inf. Sci. Technol. 2022, 73, 347–361. [Google Scholar] [CrossRef]
- Gottschalk, S.; Demidova, E. MultiWiki: Interlingual Text Passage Alignment in Wikipedia. ACM Trans. Web 2017, 11, 1–30. [Google Scholar] [CrossRef]
- Boschin, A.; Bonald, T. Enriching Wikidata with Semantified Wikipedia Hyperlinks. In Proceedings of the Wikidata Workshop ISWC, Virtual Conference, 24 October 2021. [Google Scholar]
- Miz, V.; Hanna, J.; Aspert, N.; Ricaud, B.; Vandergheynst, P. What is Trending on Wikipedia? Capturing Trends and Language Biases Across Wikipedia Editions. In Proceedings of the WWW ’20: Companion Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; ACM: New York, NY, USA, 2020; pp. 794–801. [Google Scholar]
- Piccardi, T.; West, R. Crosslingual Topic Modeling with WikiPDA. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 3032–3041. [Google Scholar]
- Briakou, E.; Carpuat, M. Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 1563–1580. [Google Scholar]
- Roy, D.; Bhatia, S.; Jain, P. A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 2373–2380. [Google Scholar]
- Lewoniewski, W.; Węcel, K.; Abramowicz, W. Companies in Multilingual Wikipedia: Articles Quality and Important Sources of Information. In Proceedings of the FedCSIS-AIST 2022, ISM 2022, Sofia, Bulgaria, 4–7 September 2022; pp. 48–67. [Google Scholar]
- Wu, J.; Zhang, X.; Zhu, Y.; Liu, Z.; Guo, Z.; Fei, Z.; Lai, R.; Wu, Y.; Cao, Z.; Dou, Z. Pre-training for Information Retrieval: Are Hyperlinks Fully Explored? arXiv 2022, arXiv:2209.06583. [Google Scholar] [CrossRef]
- Oh, J.H.; Kawahara, D.; Uchimoto, K.; Kazama, J.I.; Torisawa, K. Enriching Multilingual Language Resources by Discovering Missing Cross-Language Links in Wikipedia. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Sydney, NSW, Australia, 9–12 December 2008; IEEE: New York, NY, USA, 2008. [Google Scholar]
- Yang, D.; Halfaker, A.; Kraut, R.; Hovy, E. Identifying Semantic Edit Intentions from Revisions in Wikipedia. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 2000–2010. [Google Scholar]
- Piccardi, T.; Catasta, M.; Zia, L.; West, R. Structuring Wikipedia Articles with Section Recommendations. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’18), Ann Arbor, MI, USA, 8–12 July 2018; pp. 665–674. [Google Scholar]
- Faltings, F.; Galley, M.; Hintz, G.; Brockett, C.; Quirk, C.; Gao, J.; Dolan, B. Text Editing by Command. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human, Online, 6–11 June 2021; pp. 5259–5274. [Google Scholar]
- Difallah, D.; Saez-Trumper, D.; Augustine, E.; West, R.; Zia, L. Crosslingual Section Title Alignment in Wikipedia. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 5892–5901. [Google Scholar]
- Schick, T.; Yu, J.A.; Jiang, Z.; Petroni, F.; Lewis, P.; Izacard, G.; You, Q.; Nalmpantis, C.; Grave, E.; Riedel, S. PEER: A Collaborative Language Model. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Wulczyn, E.; West, P.; Zia, L.; Leskovec, J. Growing wikipedia across languages via recommendation. In Proceedings of the WW ’16: International Conference on World Wide Web 2016, Montreal, QC, Canada, 11–15 April 2016; ACM: New York, NY, USA, 2016. [Google Scholar]
- Gundala, L.A.; Spezzano, F. Readers’ Demanded Hyperlink Prediction in Wikipedia. In Proceedings of the Companion Proceedings of the WWW ’18: The Web Conference 2018, Lyon, France, 23–27 April 2018; Republic and Canton of Geneva: Geneva, Switzerland, 2018; pp. 1805–1807. [Google Scholar]
- Adafre, S.F.; de Rijke, M. Discovering missing links in Wikipedia. In Proceedings of the LinkKDD ’05: 3rd International Workshop on Link Discovery, Chicago, IL, USA, 21 August 2005; pp. 90–97. [Google Scholar]
- Calixto, I.; Raganato, A.; Pasini, T. Wikipedia Entities as Rendezvous across Languages: Grounding Multilingual Language Models by Predicting Wikipedia Hyperlinks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 3651–3661. [Google Scholar]
- Arora, A.; West, R.; Gerlach, M. Orphan Articles: The Dark Matter of Wikipedia. In Proceedings of the International AAAI Conference on Web and Social Media 2024, Buffalo, NY, USA, 3–6 June 2024; Volume 18, pp. 100–112. [Google Scholar]
- Bompotas, A.; Triantafyllopoulos, P.; Raptis, G.E.; Katsini, C.; Makris, C. Towards Exploring Personalized Hyperlink Recommendations Through Machine Learning. In Proceedings of the UMAP Adjunct ’24: Adjunct Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization, Cagliari, Italy, 1–4 July 2024; pp. 528–533. [Google Scholar]
- Sedhain, S.; Menon, A.K.; Sanner, S.; Xie, L. AutoRec: Autoencoders Meet Collaborative Filtering. In Proceedings of the WWW ’15 Companion: 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 111–112. [Google Scholar]
- Zheng, Y.; Tang, B.; Ding, W.; Zhou, H. A neural autoregressive approach to collaborative filtering. In Proceedings of the ICML’16: 33rd International Conference on International Conference on Machine Learning—Volume 48, New York, NY, USA, 19–24 June 2016; pp. 764–773. [Google Scholar]
- Wang, X.; He, X.; Wang, M.; Feng, F.; Chua, T.S. Neural Graph Collaborative Filtering. In Proceedings of the SIGIR’19: 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 165–174. [Google Scholar]
- Kumar, A.; Singh, S.S.; Singh, K.; Biswas, B. Link prediction techniques, applications, and performance: A survey. Phys. A Stat. Mech Its Appl. 2020, 553, 124289. [Google Scholar] [CrossRef]
- Rendle, S.; Freudenthaler, C.; Gantner, Z.; Schmidt-Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the UAI ’09: Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Arlington, VA, USA, 18–21 June 2009; pp. 452–461. [Google Scholar]
- Elkahky, A.M.; Song, Y.; He, X. A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems. In Proceedings of the WWW ’15: 24th International Conference on World Wide Web, Florence Italy, 18–22 May 2015; Republic and Canton of Geneva: Geneva, Switzerland, 2015; pp. 278–288. [Google Scholar]
- Wang, Q.; Wu, S.; Bai, Y.; Liu, Q.; Shi, X. Neighbor importance-aware graph collaborative filtering for item recommendation. Neurocomputing 2023, 549, 126429. [Google Scholar] [CrossRef]
- Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; p. 3. [Google Scholar]
- Consonni, C.; Laniado, D.; Montresor, A. WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks. In Proceedings of the International Conference on Web and Social Media, Munich, Germany, 11–14 June 2019. [Google Scholar]
- Nguyen, N.; Takeda, H. Exploring Cross-Language Differences in Wikidata-based Hyperlink Types for Enhanced Editorial Support on Wikipedia. In Proceedings of the 12th International Joint Conference on Knowledge Graphs (IJCKG 2023), Tokyo, Japan, 8–9 December 2023. [Google Scholar]
- Nguyen, N.; Takeda, H. Augmenting Low-Resource Language Wikipedia through Hyperlink Type Recommendation. IEICE Trans. Inf. Syst. 2025, 12, 2024EDP7258. [Google Scholar] [CrossRef]
- Bayer, I.; He, X.; Kanagal, B.; Rendle, S. A Generic Coordinate Descent Framework for Learning from Implicit Feedback. In Proceedings of the 26th International Conference on World Wide Web 2016, Tokyo, Japan, 8–9 December 2023. [Google Scholar]
- He, X.; Zhang, H.; Kan, M.Y.; Chua, T.S. Fast Matrix Factorization for Online Recommendation with Implicit Feedback. In Proceedings of the SIGIR ’16: 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 549–558. [Google Scholar]
- Koren, Y. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the KDD’08: 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 426–434. [Google Scholar]
- He, X.; Chen, T.; Kan, M.Y.; Chen, X. TriRank: Review-aware Explainable Recommendation by Modeling Aspects. In Proceedings of the CIKM ’15: 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 19–23 October 2015; pp. 1661–1670. [Google Scholar]
- Wang, Y.; Wang, L.; Li, Y.; He, D.; Liu, T.Y. A Theoretical Analysis of NDCG Type Ranking Measures. In Proceedings of the 26th Annual Conference on Learning Theory, Princeton, NJ, USA, 12–14 June 2013; Shalev-Shwartz, S., Steinwart, I., Eds.; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2013; Volume 30, pp. 25–54. [Google Scholar]
Language | Code | en | fr | de | nl | pl | es | it | ru | ja | pt | vi |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Englis | en | - | 29.2 | 24.1 | 16.4 | 17 | 22.6 | 22.2 | 19.5 | 12.5 | 16.9 | 12 |
French | fr | 78.7 | - | 44.4 | 30.9 | 33.8 | 40.6 | 43.7 | 35.6 | 22.3 | 32.5 | 17.5 |
German | de | 58.5 | 40 | - | 24.2 | 27 | 28.7 | 32.6 | 28.6 | 17 | 22.7 | 11.6 |
Dutch | nl | 50.7 | 35.4 | 30.8 | - | 26.2 | 30.5 | 29.2 | 23.8 | 14.6 | 24.9 | 29.6 |
Polish | pl | 72.9 | 53.7 | 47.7 | 36.3 | - | 41.7 | 45.9 | 42.9 | 23.2 | 35.8 | 19.8 |
Spanish | es | 83.4 | 55.7 | 43.7 | 36.5 | 36 | - | 47.5 | 39.5 | 25.5 | 41.6 | 25.5 |
Italian | it | 82.2 | 60.1 | 49.8 | 35 | 39.7 | 47.7 | - | 41.4 | 25.2 | 38.3 | 19.1 |
Russian | ru | 70.3 | 47.6 | 42.4 | 27.8 | 36.1 | 38.6 | 40.3 | - | 24.8 | 32.2 | 17 |
Japanese | ja | 75.7 | 40.7 | 41.6 | 23.2 | 25.9 | 34 | 33.4 | 33.8 | - | 28.6 | 17.4 |
Portuguese | pt | 81.4 | 60.2 | 55.6 | 47.9 | 49.8 | 67.0 | 61.4 | 53.1 | 34.6 | - | 34.2 |
Vietnamese | vi | 67.4 | 32.8 | 24.2 | 48.4 | 23.4 | 36.3 | 26 | 23.8 | 17.9 | 29 | - |
Article (User ID) | Hyperlink (Item ID) |
---|---|
Mickey Mouse (1) | New York (1), Walt Disney (2) |
Tokyo (2) | New York (1), World War I (3), Kanto (4) |
Marie Curie (3) | World War I (3), Nobel Prize (5) |
Language | #Articles |
---|---|
English | 6,928,834 |
Japanese | 1,441,422 |
Vietnamese | 1,294,331 |
Sinhala | 21,950 |
Dataset | Language Configuration | #Users | #Items | #Interactions | Density |
---|---|---|---|---|---|
Case 1 | en | 1000 | 201,047 | 406,877 | 0.00202 |
ja | 1000 | 111,075 | 284,165 | 0.00256 | |
vi | 1000 | 43,592 | 144,301 | 0.00331 | |
en + ja | 2000 | 257,214 | 691,041 | 0.00134 | |
en + vi | 2000 | 209,810 | 551,177 | 0.00131 | |
ja + vi | 2000 | 126,816 | 428,466 | 0.00169 | |
en + ja + vi | 3000 | 263,396 | 835,342 | 0.00106 | |
Case 2 | en + ja + vi | 3000 | 98,792 | 167,890 | 0.00057 |
Case 3 | en + ja + vi + si | 4000 | 98,791 | 319,080 | 0.00073 |
Top-k | Model | Recall (%) | NDCG (%) | Hit Rate (%) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Case 1 | Case 2 | Case 3 | Case 1 | Case 2 | Case 3 | Case 1 | Case 2 | Case 3 | ||
k = 20 | BPRMF | 4.44 | 11.33 | 11.14 | 22.38 | 20.31 | 21.4 | 67.00 | 41.22 | 60.49 |
NGCF | 8.63 | 11.48 | 24.74 | 13.7 | 30.01 | 18.68 | 89.67 | 45.19 | 71.84 | |
k = 40 | BPRMF | 6.56 | 14.17 | 15.66 | 27.91 | 22.72 | 25.89 | 78.20 | 48.91 | 71.46 |
NGCF | 13.05 | 14.83 | 20.02 | 37.90 | 21.66 | 30.82 | 95.17 | 54.26 | 81.53 | |
k = 60 | BPRMF | 8.11 | 15.65 | 19.16 | 31.52 | 24.15 | 29.37 | 83.03 | 53.64 | 77.93 |
NGCF | 16.44 | 17.39 | 24.61 | 43.54 | 23.74 | 34.73 | 97.17 | 59.69 | 86.43 | |
k = 80 | BPRMF | 9.40 | 16.87 | 21.66 | 34.40 | 25.20 | 31.84 | 86.13 | 56.48 | 81.62 |
NGCF | 19.16 | 19.22 | 28.29 | 47.82 | 25.18 | 37.81 | 98.07 | 63.18 | 89.31 | |
k = 100 | BPRMF | 10.59 | 17.88 | 23.65 | 36.90 | 26.10 | 33.84 | 88.27 | 59.32 | 83.87 |
NGCF | 21.56 | 20.82 | 31.08 | 51.30 | 26.36 | 40.12 | 98.37 | 66.10 | 91.12 |
Recommendation for ja | Recommendation for vi |
---|---|
Walt Disney Studios | 2012 |
Marvel Entertainment | 2006 |
IGN | TV Tokyo |
Disney Channel | Nintendo |
Walt Disney Studios Motion Pictures | Game Boy Advance |
Harrison Ford | 2020 |
superhero | comics |
Metacritic | The Walt Disney Company |
Universal Pictures | 1991 |
Academy Award for Best Visual Effects | Nihon Keizai Shimbun |
Paramount Pictures | 1999 |
Steven Spielberg | … |
Nitendo | Steamboat Willie |
Walt Disney Pictures | 1996 |
… | United States of America |
animation | video game |
Tokyo Disneyland | PlayStation 2 |
… | … |
Topic | Usefulness Rate (%) |
---|---|
Prominent Figure | 40 |
Geography | 65 |
Society | 45 |
History | 25 |
Astronomy | 70 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nguyen, N.; Takeda, H.; Karunathilake, L. Empowering Weak Languages Through Cross-Language Hyperlink Recommendation. Information 2025, 16, 749. https://doi.org/10.3390/info16090749
Nguyen N, Takeda H, Karunathilake L. Empowering Weak Languages Through Cross-Language Hyperlink Recommendation. Information. 2025; 16(9):749. https://doi.org/10.3390/info16090749
Chicago/Turabian StyleNguyen, Nhu, Hideaki Takeda, and Lakshan Karunathilake. 2025. "Empowering Weak Languages Through Cross-Language Hyperlink Recommendation" Information 16, no. 9: 749. https://doi.org/10.3390/info16090749
APA StyleNguyen, N., Takeda, H., & Karunathilake, L. (2025). Empowering Weak Languages Through Cross-Language Hyperlink Recommendation. Information, 16(9), 749. https://doi.org/10.3390/info16090749