A Privacy-Preserving Multilingual Comparable Corpus Construction Method in Internet of Things
Abstract
:1. Introduction
- We propose a similarity-based method for constructing a privacy-preserving multilingual comparable corpus in IoT involving three or more languages. Currently, natural language processing tasks often heavily rely on large-scale parallel corpora, while resources for multilingual parallel corpora are limited. A comparable corpus serves as a rich resource that offers indispensable supplements to parallel corpus. Previous research has predominantly focused on constructing bilingual comparable corpora, with little attention given to constructing comparable corpora involving three or more languages, particularly for low-resource languages. We introduce a capable approach to address this challenge of constructing a multilingual comparable corpus.
- We propose a decision making mechanism for comparing the comparability relationships in multilingual comparable corpora. In the existing process of constructing comparable corpora, there is limited research on decision mechanisms for comparability relationships. Most of the existing research primarily focuses on calculating the comparability between bilingual texts. However, when dealing with multiple languages, determining the comparability relationships between texts of different languages through simple calculations becomes challenging. Therefore, our proposed comparability decision making mechanism effectively addresses the issue of selecting comparable corpus pairs that satisfy comparability relationships across multiple languages and texts.
- The constructed corpus provides a better resource for the convenience of language activities like multilingual language teaching, compilation of multilingual dictionaries, cross-lingual translation studies, and a solution for the privacy and security challenges in IoT applications. From the perspective of privacy protection, during pre-processing, this corpus retains only elements such as titles and content, thus partially avoiding the retention and leakage of related privacy information. Furthermore, during sharing and using the corpus, by using the format of comparable pairs, users obtain processed and usable corpora instead of the original ones, which to some extent protects sensitive and personal information in the source corpus data, ensuring the privacy of the source language corpus.
2. Related Work
2.1. Methods of Privacy-Preserving Multilingual Comparable Corpus Construction and Comparability Calculation
2.2. Applications of Privacy-Preserving Multilingual Comparable Corpus in IoT
3. The Construction Method of Privacy-Preserving Multilingual Comparable Corpus
3.1. Privacy-Preserving Text Embedding under the Unified Language
3.2. Calculation of Privacy-Preserving Multilingual Text Comparability
3.3. The Decision Making Mechanism of Privacy-Preserving Multilingual Comparability
- Choose one piece of Chinese news i from randomly, and search the proper news that matches the maximum similarity in the corpus
- Aim at the number jth piece of news in , search for the most similar news in .
- Calculate the similarity value of the number ith piece of Chinese news and the number lth Tibetan news,
- The rule of similarity decision making mechanism: , if it fits the conditions, then skip to procedure (6);
- The number ith piece of news in news corpus C, the number jth news in news corpus U and the number lth piece of news in news corpus T form a comparable corpus pair and are entered into the final C-U-T comparable corpus. Delete the ith piece of news from C, the jth news from Uand the lth news from T.
- Finish repeating procedure (1)–(5) until all news in the corpus are comparable.
Algorithm 1 Algorithm describing the forward steps of constructing multilingual comparable corpus. |
|
4. Experiments and Results
4.1. Experimental Environment
4.2. Data Preparation for Privacy-Preserving Multilingual Comparable Corpus
4.3. Metrics
4.4. The Realization of Privacy-Preserving Multilingual News Comparable Corpus
4.4.1. Chinese–Uighur–Tibetan Data Collecting and Processing
4.4.2. The Construction Results of Privacy-Preserving Chinese–Uighur–Tibetan Comparable Corpus
4.5. Evaluation Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Rock, L.Y.; Tajudeen, F.P.; Chung, Y.W. Usage and impact of the internet-of-things-based smart home technology: A quality-of-life perspective. Univers. Access Inf. Soc. 2022, 1–20. [Google Scholar] [CrossRef] [PubMed]
- Bin, G.; Sicong, L.; Yan, L.; Zhigang, L.; Zhiwen, Y.; Xingshe, Z. AIoT: The Concept, Architecture, and Key Techniques. Chin. J. Comput. 2023, 46. Available online: https://kns.cnki.net/kcms2/article/abstract?v=rCMvAF-4El1WLvIjsXZvAiChQ0k3XL_bsnLH7YPUPymadeQl07Yn4l2QCxVCT00_44fCKwOqV3BqfGYLToQHOBA5_7c8GU109AwCbRghrzgOcLqM8RjBiYu-a3zDXmea9Atwq5h28dVtTYsbmZu0sQ==&uniplatform=NZKPT&language=CHS (accessed on 1 October 2023).
- O’Shaughnessy, P.; Lin, Y.X. Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering. Mathematics 2022, 10, 4744. [Google Scholar] [CrossRef]
- Aljumah, A.; Ahanger, T. Blockchain-Based Information Sharing Security for the Internet of Things. Mathematics 2023, 11, 2157. [Google Scholar] [CrossRef]
- Liang, K.; Zhou, B.; Zhang, Y.; He, Y.; Guo, X.; Zhang, B. A Multi-Entity Knowledge Joint Extraction Method of Communication Equipment Faults for Industrial IoT. Electronics 2022, 11, 979. [Google Scholar] [CrossRef]
- Pilán, I.; Lison, P.; Øvrelid, L.; Papadopoulou, A.; Sánchez, D.; Batet, M. The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization. Comput. Linguist. 2022, 48, 1053–1101. [Google Scholar] [CrossRef]
- He, M.; Li, Y. Application of Big Data Technology in News Media Scene Visualization Based on Internet of Things (IoTs). Math. Probl. Eng. 2022, 2022, 5508125. [Google Scholar] [CrossRef]
- Gaimei, G.; Xu, S.; Chunxia, L.; Weichao, D.; Na, W. A Blockchain-based Method for Privacy Protection of Medical Data. J. Comput. Appl. Res. 2023, 1–7. [Google Scholar] [CrossRef]
- Zhong, Z.; Zhang, G.; Yin, L.; Chen, Y. Description and Analysis of Data Security Based on Differential Privacy in Enterprise Power Systems. Mathematics 2023, 11, 4829. [Google Scholar] [CrossRef]
- Baker, M. Corpora in Translation Studies: An Overview and Some Suggestions for Future Research. Target 1995, 7, 223–243. [Google Scholar] [CrossRef]
- Xu, H.; Jiang, M.; Lin, J.; Huang, C.R. Light verb variations and varieties of Mandarin Chinese: Comparable corpus driven approaches to grammatical variations. Corpus Linguist. Linguist. Theory 2020, 18, 145–173. [Google Scholar] [CrossRef]
- Wang, B. Feature Extraction Method of Machine Translation Equivalent Pairs in Chinese-English Comparable Corpus based OCR Recognition. In Proceedings of the 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 3–5 June 2021; pp. 899–902. [Google Scholar] [CrossRef]
- Dominic, P.; Purushothaman, N.; Kumar, A.S.A.; Prabagaran, A.; Blessy, J.A.; John, A. Multilingual Sentiment Analysis using Deep-Learning Architectures. In Proceedings of the 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirenelveli, India, 23–25 January 2023; pp. 1077–1083. [Google Scholar] [CrossRef]
- Katsumata, S.; Komachi, M. Towards Unsupervised Grammatical Error Correction using Statistical Machine Translation with Synthetic Comparable Corpus. arXiv 2019, arXiv:1907.09724. [Google Scholar]
- Goyal, V.; Kumar, A.; Lehal, M. Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia. Int. J. E-Adopt. 2020, 12, 42–51. [Google Scholar] [CrossRef]
- Huajun, L.; Kaiyue, W. The media industry’s format innovation, relationship reconstruction, and development path in the era of intelligent IoT. J. Lover 2022, 4, 10–14. [Google Scholar] [CrossRef]
- Li, J.; Xie, L.; Chen, Z.; Shi, L.; Chen, R.; Ren, Y.; Wang, L.; Lu, X. An AIoT-Based Assistance System for Visually Impaired People. Electronics 2023, 12, 3760. [Google Scholar] [CrossRef]
- Tang, X.; Zhu, L.; Shen, M.; Peng, J.; Kang, J.; Niyato, D.; Abd El-Latif, A. Secure and Trusted Collaborative Learning Based on Blockchain for Artificial Intelligence of Things. IEEE Wirel. Commun. 2022, 29, 14–22. [Google Scholar] [CrossRef]
- Yujie, L. Reflection on the Communication Mechanism and Media of Wearable Smart Devices in the News Field. Publ. Angle 2020, 15, 63–65. [Google Scholar] [CrossRef]
- Tang, X.; Liao, D.; Shen, M.; Zhu, L.; Huang, S.; Li, G.; Man, H.; Xu, J. Confidence-aware Sentiment Quantification via Sentiment Perturbation Modeling. IEEE Trans. Affect. Comput. 2023, 1–15. [Google Scholar] [CrossRef]
- Tang, X.; Shen, M.; Li, Q.; Zhu, L.; Xue, T.; Qu, Q. PILE: Robust Privacy-Preserving Federated Learning Via Verifiable Perturbations. IEEE Trans. Dependable Secur. Comput. 2023, 20, 5005–5023. [Google Scholar] [CrossRef]
- Shuman, W.; Aiping, L.; Liguo, D.; Jia, F.; Yongle, C. BTM-based IoT service discovery method. J. Comput. Appl. 2020, 40, 459–464. [Google Scholar] [CrossRef]
- Yimei, W. Content Production Strategy and Practice of Satellite News. Youth J. 2022, 2, 70–72. [Google Scholar] [CrossRef]
- Ruslan, A.; Jusoh, A.; Asnawi, A.L.; Othman, M.; Abdul Razak, N.I. Development of multilanguage voice control for smart home with IoT. J. Phys. Conf. Ser. 2021, 1921, 012069. [Google Scholar] [CrossRef]
- Sayakkara, A.; Le-Khac, N.A. Electromagnetic Side-Channel Analysis for IoT Forensics: Challenges, Framework, and Datasets. IEEE Access 2021, 9, 113585–113598. [Google Scholar] [CrossRef]
- Iliev, Y.; Ilieva, G. A Framework for Smart Home System with Voice Control Using NLP Methods. Electronics 2022, 12, 116. [Google Scholar] [CrossRef]
- Zhang, Q.; Xiang, Z. Improvement of culture media efficiency in Internet of Things based on global numerical ant colony algorithm. Pers. Ubiquitous Comput. 2020, 24, 347–361. [Google Scholar] [CrossRef]
- Wei, P. Research on the Construction Technology of Tibetan-Chinese Bilingual Comparable Corpus Based on Web. Master’s Thesis, Minzu University of China, Beijing, China, 2015. [Google Scholar]
- Langlois, D.; Saad, M.; Smaïli, K. Alignment of comparable documents: Comparison of similarity measures on French–English–Arabic data. Nat. Lang. Eng. 2018, 24, 677–694. [Google Scholar] [CrossRef]
- Wei, Y. Construction, evaluation and application prospects of Russian-Chinese news comparable corpus. J. PLA Univ. Foreign Lang. 2017, 40, 8. [Google Scholar]
- Lianfu, Z.; Zuowens, T. A Privacy Preservation Method for Multi-Modal Medical Data in Federated Learning. Comput. Sci. 2023, 50, 933–940. [Google Scholar]
- Qihui, T.; Lanjiang, Z.; Chang, L. Textual feature based bilingual sentence similarity measure between Chinese and Lao. J. Chin. Inf. Process. 2022, 35, 64–72. [Google Scholar] [CrossRef]
- Hongjun, W.; Shuicai, S.; Shiwen, Y.; Shibin, X. Cross-language similar document retrieval. J. Chin. Inf. Process. 2007, 21, 8. [Google Scholar]
- Xing, T.; Jin, Z.; Zuping, Z. Jaccard text similarity algorithm based on word embedding. Comput. Sci. 2018, 45, 186–189. [Google Scholar]
- Xiaoli, D.; Shifeng, L.; Daqing, G. NLP-based text similarity detection method. J. Commun. 2021, 42, 173–181. [Google Scholar]
- Xunyu, L.; Cunli, M.; Zhengtao, Y.; Shengxiang, G.; Zhenhan, W.; Yafei, Z. Chinese-Burmese comparable document acquisition based on topic model and bilingual word embedding. J. Chin. Inf. Process. 2021, 35, 88–95. [Google Scholar]
- Weizhen, Z.; Shuang, R. A Study on the Technology System of Railway Data Security and Privacy Protection. Railw. Comput. Appl. 2023, 32, 45–50. [Google Scholar] [CrossRef]
- Lufang, L.; Bo, L.; Peng, C.; Linghan, Z.; Bing, W. Bilingual lexicon extraction based on word vector and comparable corpus. Comput. Sci. Eng. 2018, 40, 368–373. [Google Scholar]
- Panlu, C. 6G, Semantic Communication, and Future Models of Journalism and Communication: A Digital Journalism Perspective. J. Guangzhou Univ. (Soc. Sci. Ed.) 2022, 21, 5–16. [Google Scholar]
- Wang, X. The Impact of IoT on News Media in the Smart Age. Mob. Inf. Syst. 2022, 2022, 2238233. [Google Scholar] [CrossRef]
- Zhang, J.; Tao, D. Empowering Things With Intelligence: A Survey of the Progress, Challenges, and Opportunities in Artificial Intelligence of Things. IEEE Internet Things J. 2020, 8, 7789–7817. [Google Scholar] [CrossRef]
- Nwanakwaugwu, A.; Matthew, U.; Okey, O.; Kazaure, J.; Ubochi, C. News Reporting in Drone Internet of Things Digital Journalism: Drones Technology for Intelligence Gathering in Journalism. Int. J. Interact. Commun. Syst. Technol. 2023, 12, 22–42. [Google Scholar] [CrossRef]
- Ning, S.; Yan, X.; Nuo, Y.; Zhou, F.; Xie, Q.; Zhang, J. Chinese-Khmer Parallel fragments Extraction from Comparable Corpus Based on Dirichlet Process. Procedia Comput. Sci. 2020, 166, 213–221. [Google Scholar] [CrossRef]
- Dalian Minzu University. A Method of Constructing a Parallel Corpus of Chinese-English-Mongolian-Tibetan Victorian; No. 18 Liaohe West Road; Dalian Economic and Technological Development Zone: Dalian, China, 2022; Available online: https://kns.cnki.net/kcms2/article/abstract?v=rCMvAF-4El0GzZ5X9eGvD8ATcYVVIhH19Df_FMaey6NT0D6YpiI9mvcbcPRDuaZLEq2D8RuHPzmRu4ofEIF5zqrrtiEJPcM92H-_03dOHzoS-F5_zPhG38gBLu3TwUMlg5y3ac7bkEU=&uniplatform=NZKPT&language=CHS (accessed on 1 October 2023).
- Lei, C.; Weibin, Y.; Qinyao, S.; Zhi, W.; Chongzhong, Y.; Daowei, L. Research and construction of endangered language spoken corpus-case study on Lizu. Comput. Eng. Appl. 2018, 54, 234–238. [Google Scholar] [CrossRef]
- Cohen, L.; Christopher, M.; Quoc, N. NBER Working Paper Series; National Bureau of Economic Research: Cambridge, MA, USA, 2018; Available online: http://www.nber.org/papers/w25084 (accessed on 1 October 2023).
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
- Sa, L. Building and Evaluating Special Domain Comparable Corpus. Master’s Thesis, Nanjing University of Science and Technology, Nanjing, China, 2012. Available online: https://kns.cnki.net/kcms2/article/abstract?v=rCMvAF-4El2rI7_R9d-DLHpt8ZdySbER3tlBKhiyUSqwln4Gn3z1b03sy_uDfXRvWb9w07GNk99u14O89yOLdxBTjPClkraUYNU9ae9Lp2TAnRB29l8iY3IPcacXVJZ3JpFEq10E0IgvqTfQ0d-9sQ==&uniplatform=NZKPT&language=CHS (accessed on 1 October 2023).
- Chengcheng, H.; Lei, L.; Tingting, L.; Ming, G. Approaches of semantic textual similarity. J. East China Norm. Univ. (Nat. Sci. Ed.) 2020, 5, 95–112. [Google Scholar]
- Fei, P.; Ibrahim, T.; Er, A.S.W.; Litipu, M. Construction of Chinese-Uighur comparable corpus for alignment of bilingual technical terms. J. Xinjiang Univ. (Nat. Sci. Ed.) 2017, 34, 316–321. [Google Scholar]
Paper | Summary |
---|---|
[28] | Construct comparable corpus based on word frequency. |
[29] | Construct trilingual comparable corpus based on cross-lingual retrieval, but ignore low resource languages. |
[30] | Web-based construction method is employed to construct Chinese–Russian new comparable corpus. |
[32,33] | Calculate comparability by taking textual features and dictionary-based methods. |
[34,35] | Compute Jaccard similarity, minimum edit distance, and Dice coefficient as comparability index |
[25,36] | Calculate the similarity between word vector. |
Our study | We employ a news web-based construction method to gain source material and calculate comparability among multilingual news texts in a unified victor space. |
Paper | Summary |
---|---|
[3,6,22] | Privacy protection practice for data mining and construction corpus for multiple data clustering; multilingual comparable corpus used in language teaching; evaluation of privacy-oriented corpus by use of text anonymization. |
[4,8,30,36] | Privacy protection for medical data, industry data and railway data, less coverage of multilingual comparable data in IoT. |
[12,13,22,38,39,40,42] | Multilingual comparable corpus used in machine translation, sentiment analysis; IoT-based corpus used in smart education; IoT-based data used in news media; |
[36,43,44,45] | These studies focus on bilingual (common languages) comparable corpus; however, they do not mention multilingual, especially low-resource language comparable corpora. |
Our study | We focus on the construction of three or more language comparable corpora that also meet the needs of privacy protection, which can be used in related multilingual situations, including all above aspects. |
Storage Format | Title | Source | Content | Time | Location |
---|---|---|---|---|---|
.txt | - | http://politics.people.com.cn/n1/2021/0119/c1001-32005216.html | - | - Year -Month -Day: - | - |
ID | Language | Title | Content | Location | Time |
---|---|---|---|---|---|
Chinese | - | - Year - Month - Day: - | |||
CUTCC3 | Uighur | - | - Year - Month - Day: - | ||
Tibetan | - | - Year - Month - Day: - |
ID | Language | Title | Content | Location | Time |
---|---|---|---|---|---|
Chinese | A 5.1 magnitude earthquake occurred in Taitung County, Taiwan, with a focal depth of 10 km. | According to the China Earthquake Networks Center, it has been officially determined that a 5.1-magnitude earthquake occurred in Taitung County, Taiwan at 09:56 on 4 April, with a focal depth of 10 km. | Taitung County, Taiwan, China | 4 April, 09:56. | |
CUTCC3 | Uighur | A 5.1 magnitude earthquake occurred in Taiwan County, with a focal depth of 10 km) | According to the China Seismological Network, a 5.1-magnitude earthquake was officially determined to have occurred in Taitung County, Taiwan at 09:56 on 4 April. The earthquake had a focal depth of 10 km. | Taitung County, Taiwan Province, China | 9:56 on 4 April, |
Tibetan | The focal depth of the M 5.1 earthquake in Taitung County, Taiwan Province is 10 km | According to China Seismological Network, China Seismological Network officially determined that at 09:56 on 4 April, a 5.1-magnitude earthquake occurred in Taitung County, Taiwan Province, with a focal depth of 10 km. | Taitung County, Taiwan, China | At 9:56, 4 April. |
Metrics | P | R | F |
---|---|---|---|
Baseline | 43.18% | 38% | 40.42% |
CUTCC | 77% | 34% | 47.17% |
Metrics | Jaccard | ||
---|---|---|---|
Dispersion | Variance | 0.0123 | |
Standard Deviation | 0.1109 | ||
Range | 0.376 | ||
Central Tendency | Mean | 0.704 | |
Median | 0.734 | ||
Mode | 0.684 | ||
Location | Max | 0.922 | |
Min | 0.546 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Weng, Y.; Dong, S.; Chaomurilige. A Privacy-Preserving Multilingual Comparable Corpus Construction Method in Internet of Things. Mathematics 2024, 12, 598. https://doi.org/10.3390/math12040598
Weng Y, Dong S, Chaomurilige. A Privacy-Preserving Multilingual Comparable Corpus Construction Method in Internet of Things. Mathematics. 2024; 12(4):598. https://doi.org/10.3390/math12040598
Chicago/Turabian StyleWeng, Yu, Shumin Dong, and Chaomurilige. 2024. "A Privacy-Preserving Multilingual Comparable Corpus Construction Method in Internet of Things" Mathematics 12, no. 4: 598. https://doi.org/10.3390/math12040598