Investigating the Challenges and Opportunities in Persian Language Information Retrieval through Standardized Data Collections and Deep Learning
Abstract
:1. Introduction and Motivation
2. Current Challenges
- Linguistic specificity. To address language-specific challenges inherent to Persian to enhance the performance of information retrieval systems [27].
- Corpus investigation. To construct a robust and extensive list of Persian datasets in the form of text and images that can serve as a reliable collection for further developments [28].
- Model review. To evaluate how Persian data corpora can be assessed via learning-based approaches, for which their disadvantages/advantages and origins are investigated [29].
3. Properties of the Persian Language
4. Availability of Persian Datasets
- Tehran English–Persian parallel corpus (TEP) [38]. Launched by a natural language processing lab in Tehran in 2011, the Tehran English–Persian parallel corpus is publicly accessible and comprises approximately 550,000 sentence pairs with 8 million terms derived from movie subtitles.
- MIZAN [39]. The manually aligned Persian–English parallel corpus MIZAN features one million sentence pairs along with 25 million terms, providing a rich resource for translation studies and linguistic algorithms.
- hmBlogs [40]. This corpus addresses the unique challenges of Persian orthography and font recognition, highlighting the complexity and diversity of research within Persian linguistic studies.
- MirasText [41]. Generated automatically, MirasText aggregates content from over 250 Persian websites, encompassing a wide array of metadata including keywords, descriptions, and titles, to facilitate extensive text analysis.
- Shiraz Corpus [43]. Designed for machine translation, the Shiraz Corpus is a bilingual corpus developed from Persian texts sourced online, incorporating a bilingual dictionary with 50,000 terms, alongside sophisticated morphological and syntactic analysis tools.
- Peykare Corpus [44]. This corpus serves as a fundamental resource for language resources and evaluations, particularly in linguistic research.
- FLDB [45]. The Persian Linguistic Database, also referred to as the Farsi Linguistic Database (FLDB), comprises over 3 million words, featuring a blend of contemporary modern Persian literature, spoken varieties, and comprehensive lists of dictionary entries and word lists.
- Mahak Samim [46]. Focused on academic integrity, Mahak Samim consists of Persian academic texts from peer-reviewed journals designed to test and refine plagiarism detection systems.
- PerKey [47]. This extensive corpus of 553,000 news articles, sourced from six Persian news sites, includes high-quality key phrases that are filtered and cleaned to ensure the integrity of the data.
Corpus | Source | Size | Description | Use Cases |
---|---|---|---|---|
Text-based datasets | ||||
Tehran English–Persian parallel corpus (TEP) [38] | University of Tehran, 2011 | 550,000 sentences | Movie subtitles | For use in NLP areas (statistical machine translation, cross-lingual information retrieval, and bilingual lexicography) |
MIZAN [39] | 250 Persian websites from 2018 | 1 million sentence pairs collected from masterpieces of literature | Persian corpus with focus on Islamic studies | Islamic studies, cultural analysis, and machine translation |
Persica Corpus [49] | Not specified | More than 1.5 million categorized news | Standard text document formats | Persian news articles |
hmBlogs [40] | Persian blogs from 2021 | 20 million blog posts | Semantic analogy dataset | Used by the Blogfa hosting service |
Digikala Magazine (DigiMag) [37] | Digikala Online Magazine | 8515 articles | Seven different classes | Persian text classification |
Persian News Dataset [50] | Various news articles | 16,438 articles | Eight different classes: economic, international, political, science, technology, cultural art, sport, and medical | Persian text classification |
PEYMA [51] | 10 news websites | 7145 sentences with 302,530 tokens | Seven different classes: organization, money, location, date, time, person, and percent | Persian named-entity recognition |
EmoPars Dataset [52] | Persian social media texts | 30,000 social media texts | Emotion-annotated texts | Used for sentiment analysis |
MirasText [41] | An automatically generated text corpus for Persian | 1000 documents | Automatically generated corpus by crawling more than 250 Persian news websites | Language modeling, title extraction, and named-entity recognition for unsupervised learning approaches |
Image-based datasets | ||||
Arshasb [21] | Persian text | 33,000 pages | Not specified | Public texts and news texts |
Persian-OCR [53,54] | Persian alphabets | 570,000 images | Labeled images of Persian alphabets, created by the zarnevis Python library | Image recognition |
Unavailable datasets | ||||
Hamshahri Corpus [1,42] | University of Tehran | 700 MB | 166,774 newspaper articles with approximately 380 terms per document | General NLP applications and information retrieval |
Shiraz Corpus [43] | New Mexico State University, 2000 | Approximately 50,000 terms | Is a bilingual Persian-to-English dictionary | Machine translation for Persian to English, NLP applications, morphological analysis, and syntax parsing |
Peykare Corpus [44] | Institute for Humanities and Cultural Studies, Tehran | 2.6 million manually tagged words, containing 550 Persian part-of-speech tags | Persian corpus with daily news and common texts | Linguistic studies |
5. Persian Language Information Retrieval
5.1. Optical Character Recognition
5.2. Algebraic and Statistical Approaches
5.3. Learning-Based Approaches
5.3.1. Classical Machine Learning Approaches
5.3.2. Deep Neural Networks
5.3.3. Recurrent Neural Networks
5.3.4. Transformers
5.4. Discussion
6. Conclusions and Outlook
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
ANN | Artificial neural network |
AI | Artificial intelligence |
BERT | Bidirectional Encoder Representations from Transformers |
CNN | Convolutional neural network |
CV | Computer vision |
DL | Deep learning |
DLA | Document layout analysis |
DNN | Deep neural network |
DTrOCR | Decoder-only Transformer for Optical Character Recognition |
GRU | Gated recurrent unit |
IR | Information retrieval |
k-NN | k-nearest neighbors |
LM | Language model |
LSTM | Long short-term memory |
ML | Machine learning |
MLP | Multilayer perceptron |
NLP | Natural language processing |
OCR | Optical character recognition |
RNN | Recurrent neural network |
SMT | Statistical machine translation |
SVC | Support vector classifier |
TLD | Text line detection |
VS | Vector space (model) |
References
- Sadeghi, M.; Vegas, J. How well does Google work with Persian documents? J. Inf. Sci. 2017, 43, 316–327. [Google Scholar] [CrossRef]
- Kobayashi, M.; Takeda, K. Information retrieval on the web. ACM Comput. Surv. (CSUR) 2000, 32, 144–173. [Google Scholar] [CrossRef]
- Garg, D.; Sharma, D. Information Retrieval on the Web and its Evaluation. Int. J. Comput. Appl. 2012, 975, 8887. [Google Scholar] [CrossRef]
- Mooers, C. Information retrieval viewed as temporal signaling. In Proceedings of the International Congress of Mathematicians, Cambridge, MA, USA, 30 August–6 September 1950; Volume 1, pp. 572–573. [Google Scholar]
- Bush, V. As we may think. Atl. Mon. 1945, 176, 101–108. [Google Scholar]
- Masood Ghayoomi Saeedeh Momtazi, M.B. A Study of Corpus Development for Persian. Int. J. Asian Lang. Process. 2010, 20, 17–33. [Google Scholar]
- Hirschberg, J.; Manning, C.D. Advances in natural language processing. Science 2015, 349, 261–266. [Google Scholar] [CrossRef]
- Savoy, J. Comparative study of monolingual and multilingual search models for use with Asian languages. ACM Trans. Asian Lang. Inf. Process. (TALIP) 2005, 4, 163–189. [Google Scholar] [CrossRef]
- Braschler, M.; Ripplinger, B. How effective is stemming and decompounding for German text retrieval? Inf. Retr. 2004, 7, 291–316. [Google Scholar] [CrossRef]
- Ranaldi, L.; Pucci, G. Knowing knowledge: Epistemological study of knowledge in transformers. Appl. Sci. 2023, 13, 677. [Google Scholar] [CrossRef]
- Valian, V. Arguing about innateness. J. Child Lang. 2014, 41, 78–92. [Google Scholar] [CrossRef]
- Allen, J.W.; Bickhard, M.H. Emergent constructivism: Theoretical and methodological considerations. Hum. Dev. 2022, 66, 276–294. [Google Scholar] [CrossRef]
- Chomsky, N. Syntactic Structures; Mouton de Gruyter: Berlin, Germany, 2002. [Google Scholar]
- Chomsky, N. On certain formal properties of grammars. Inf. Control 1959, 2, 137–167. [Google Scholar] [CrossRef]
- Soles, D.E. Locke’s Empiricism and the Postulation of Unobservables. J. Hist. Philos. 1985, 23, 339–369. [Google Scholar] [CrossRef]
- Spelke, E.S.; Kinzler, K.D. Innateness, learning, and rationality. Child Dev. Perspect. 2009, 3, 96–98. [Google Scholar] [CrossRef]
- Vijayarani, S.; Janani, R. Text mining: Open source tokenization tools-an analysis. Adv. Comput. Intell. Int. J. (ACII) 2016, 3, 37–47. [Google Scholar] [CrossRef]
- Grefenstette, G. Tokenization. In Syntactic Wordclass Tagging; Springer: Berlin/Heidelberg, Germany, 1999; pp. 117–133. [Google Scholar]
- Harman, D.K. The First Text Retrieval Conference (TREC-1); US Department of Commerce, National Institute of Standards and Technology: Gaithersburg, MD, USA, 1993; Volume 500.
- Braschler, M. CLEF 2000—Overview of results. In Proceedings of the Workshop of the Cross-Language Evaluation Forum for European Languages; Springer: Berlin/Heidelberg, Germany, 2000; pp. 89–101. [Google Scholar]
- GitHub User “Persiandataset”. GitHub Repository “Arshasb”. Available online: https://github.com/persiandataset/Arshasb (accessed on 31 May 2024).
- Hosseini, F.; Kashef, S.; Shabaninia, E.; Nezamabadi-pour, H. Idpl-pfod: An image dataset of printed Farsi text for OCR research. In Proceedings of the Second International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2021) Co-Located with ICNLSP 2021, Trento, Italy, 12–13 November 2021; pp. 22–31. [Google Scholar]
- Mohammadian, M.; Maleki, N.; Olsson, T.; Ahlgren, F. Persis: A Persian Font Recognition Pipeline Using Convolutional Neural Networks. In Proceedings of the 2022 12th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 17–18 November 2022; pp. 196–204. [Google Scholar] [CrossRef]
- Tourani, A.; Soroori, S.; Shahbahrami, A.; Akoushideh, A. Iranis: A large-scale dataset of iranian vehicles license plate characters. In Proceedings of the 2021 5th International Conference on Pattern Recognition and Image Analysis (IPRIA), Kashan, Iran, 28–29 April 2021; pp. 1–5. [Google Scholar]
- Pallotti, G. A simple view of linguistic complexity. Second. Lang. Res. 2015, 31, 117–134. [Google Scholar] [CrossRef]
- Sedighi, A.; Shabani-Jadidi, P. The Oxford handbook of Persian linguistics; Oxford University Press: Oxford, UK, 2018. [Google Scholar]
- Khashabi, D.; Cohan, A.; Shakeri, S.; Hosseini, P.; Pezeshkpour, P.; Alikhani, M.; Aminnaseri, M.; Bitaab, M.; Brahman, F.; Ghazarian, S.; et al. Parsinlu: A suite of language understanding challenges for persian. Trans. Assoc. Comput. Linguist. 2021, 9, 1147–1162. [Google Scholar] [CrossRef]
- Barbaresi, A. Challenges in web corpus construction for low-resource languages in a post-BootCaT world. In Proceedings of the 6th Language & Technology Conference, Less Resourced Languages Special Track, Poznań, Poland, 7–9 December 2013; pp. 69–73. [Google Scholar]
- Mohtaj, S.; Roshanfekr, B.; Zafarian, A.; Asghari, H. Parsivar: A Language Processing Toolkit for Persian. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 2018, 13, 55–75. [Google Scholar] [CrossRef]
- Otter, D.W.; Medina, J.R.; Kalita, J.K. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 604–624. [Google Scholar] [CrossRef]
- Anand Kumar, M.; Chakravarthi, B.R.; Bharathi, B.; O’Riordan, C.; Murthy, H.; Durairaj, T.; Mandl, T. Speech and Language Technologies for Low-Resource Languages. In Proceedings of the First International Conference, SPELLL 2022, Kalavakkam, India, 23–25 November 2022; Springer: Berlin/Heidelberg, Germany, 2023. [Google Scholar]
- Strassel, S.; Tracey, J. LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016. [Google Scholar]
- Khosrobeigi, Z.; Veisi, H.; Hoseinzadeh, E. Persian Optical Character Recognition Using Deep Bidirectional Long Short-Term Memory. Appl. Sci. 2022, 22, 11760. [Google Scholar] [CrossRef]
- Ebrahimi, A. Large Dataset of Persian License Plate Characters. Available online: https://www.kaggle.com/datasets/amirebrahimi66/large-dataset-of-persian-license-plate-characters (accessed on 31 May 2024).
- Farahani, M.; Gharachorloo, M.; Farahani, M.; Manthouri, M. ParsBERT: Transformer-based Model for Persian Language Understanding. Neural Process. Lett. 2021, 53, 3831–3847. [Google Scholar] [CrossRef]
- Pilevar, M.T.; Faili, H.; Pilevar, A.H. TEP: Tehran English–Persian parallel corpus. In International Conference on Intelligent Text Processing and Computational Linguistics; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6609 LNCS; Springer: Berlin/Heidelberg, Germany, 2011; pp. 68–79. [Google Scholar] [CrossRef]
- Kashefi, O. MIZAN: A large persian–English parallel corpus. arXiv 2018, arXiv:1801.02107. [Google Scholar]
- Khansari, H.M.; Shamsfard, M. HmBlogs: A big general Persian corpus. arXiv 2021, arXiv:2111.02362. [Google Scholar] [CrossRef]
- Sabeti, B.; Firouzjaee, H.A.; Choobbasti, A.J.; Najafabadi, S.M.; Vaheb, A. Mirastext: An automatically generated text corpus for persian. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- AleAhmad, A.; Amiri, H.; Darrudi, E.; Rahgozar, M.; Oroumchian, F. Hamshahri: A standard Persian text collection. Knowl.-Based Syst. 2009, 22, 382–387. [Google Scholar] [CrossRef]
- Amtrup, J.W.; Rad, H.M.; Megerdoomian, K.; Zajac, R. Persian–English machine translation: An overview of the Shiraz project. In Memoranda in Computer and Cognitive Science MCCS-00-319; New Mexico State University: Las Cruces, New Mexico, 2000; pp. 1–42. [Google Scholar]
- Bijankhan, M.; Sheykhzadegan, J.; Bahrani, M.; Ghayoomi, M. Lessons from building a Persian written corpus: Peykare. Lang. Resour. Eval. 2011, 45, 143–164. [Google Scholar] [CrossRef]
- Assi, S. Farsi linguistic database (FLDB). Int. J. Lexicogr. 1997, 10, 5. [Google Scholar]
- Sharifabadi, M.R.; Eftekhari, S.A. Mahak Samim: A Corpus of Persian Academic Texts for Evaluating Plagiarism Detection Systems. In Proceedings of the Working Notes of FIRE 2016—Forum for Information Retrieval Evaluation, Tehran, Iran, 17–19 December 2016. [Google Scholar]
- Doostmohammadi, E.; Bokaei, M.H.; Sameti, H. PerKey: A Persian News Corpus for Keyphrase Extraction and Generation. In Proceedings of the 2018 9th International Symposium on Telecommunications (IST), Tehran, Iran, 17–19 December 2018; pp. 460–465. [Google Scholar] [CrossRef]
- Alibrahim, H.; Ludwig, S.A. Hyperparameter optimization: Comparing genetic algorithm against grid search and bayesian optimization. In Proceedings of the 2021 IEEE Congress on Evolutionary Computation (CEC), Kraków, Poland, 28 June–1 July 2021; pp. 1551–1559. [Google Scholar]
- Eghbalzadeh, H.; Hosseini, B.; Khadivi, S.; Khodabakhsh, A. Persica: A Persian corpus for multi-purpose text mining and natural language processing. In Proceedings of the 2012 6th International Symposium on Telecommunications, IST 2012, Tehran, Iran, 6–8 November 2012; pp. 1207–1214. [Google Scholar] [CrossRef]
- GitHub User “Milad-4274”. GitHub Repository “Persian_News”: Persian News Dataset. Available online: https://github.com/milad-4274/persian_news (accessed on 31 May 2024).
- Shahshahani, M.S.; Mohseni, M.; Shakery, A.; Faili, H. PEYMA: A Tagged Corpus for Persian Named Entities. arXiv 2018, arXiv:1801.09936. [Google Scholar] [CrossRef]
- Sabri, N.; Akhavan, R.; Bahrak, B. Emopars: A collection of 30k emotion-annotated persian social media texts. In Proceedings of the Student Research Workshop Associated with RANLP, Online, 1–3 September 2021; pp. 167–173. [Google Scholar]
- GitHub Repository “Persian OCR Using LeNet5”. Available online: https://github.com/mostafamhmdi/Persian-OCR (accessed on 31 May 2024).
- Team, Z.D. Zarnevis: A Python Package for Persian Text Processing. 2024. Available online: https://pypi.org/project/zarnevis/ (accessed on 21 June 2024).
- Vijayarani, S.; Ilamathi, M.J.; Nithya, M. Preprocessing techniques for text mining-an overview. Int. J. Comput. Sci. Commun. Netw. 2015, 5, 7–16. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Chaudhuri, A.; Mandaviya, K.; Badelia, P.; K Ghosh, S.; Chaudhuri, A.; Mandaviya, K.; Badelia, P.; Ghosh, S.K. Optical Character Recognition Systems; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
- Kasem, M.S.; Mahmoud, M.; Kang, H.S. Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey. arXiv 2023, arXiv:2312.11812. [Google Scholar]
- Chaudhuri, A.; Mandaviya, K.; Badelia, P.; Ghosh, S.K. Optical Character Recognition Systems. In Optical Character Recognition Systems for Different Languages with Soft Computing; Chaudhuri, A., Mandaviya, K., Badelia, P., K Ghosh, S., Eds.; Studies in Fuzziness and Soft Computing; Springer International Publishing: Cham, Switzerland, 2017; pp. 9–41. [Google Scholar] [CrossRef]
- Kashef, S.; Nezamabadi-pour, H.; Shabaninia, E. A review on deep learning approaches for optical character recognition with emphasis on Persian, Arabic and Urdu scripts. J. Mach. Vis. Image Process. 2021, 8, 51–85. [Google Scholar]
- Ehikioya, S.A.; Zeng, J. Mining web content usage patterns of electronic commerce transactions for enhanced customer services. Eng. Rep. 2021, 3, e12411. [Google Scholar] [CrossRef]
- Fateh, A.; Rezvani, M.; Tajary, A.; Fateh, M. Providing a voting-based method for combining deep neural network outputs to layout analysis of printed documents. J. Mach. Vis. Image Process. 2022, 9, 47–64. [Google Scholar]
- Guo, Y.; Sun, Y.; Bauer, P.; Allebach, J.P.; Bouman, C.A. Text line detection based on cost optimized local text line direction estimation. In Proceedings of the Color Imaging XX: Displaying, Processing, Hardcopy, and Applications, San Francisco, CA, USA, 9–12 February 2015; Volume 9395, pp. 59–65. [Google Scholar]
- Fateh, A.; Fateh, M.; Abolghasemi, V. Enhancing optical character recognition: Efficient techniques for document layout analysis and text line detection. Eng. Rep. 2023, e12832. [Google Scholar] [CrossRef]
- Bukhari, S.S.; Shafait, F.; Breuel, T.M. Coupled snakelets for curled text-line segmentation from warped document images. Int. J. Doc. Anal. Recognit. (IJDAR) 2013, 16, 33–53. [Google Scholar] [CrossRef]
- Amer, I.M.; Hamdy, S.; Mostafa, M.G.M. Deep Arabic document layout analysis. In Proceedings of the 2017 Eighth International Conference on Intelligent Computing and Information Systems (ICICIS), Cairom, Eygpt, 5–7 December 2017; pp. 224–231. [Google Scholar] [CrossRef]
- Rahmati, M.; Fateh, M.; Rezvani, M.; Tajary, A.; Abolghasemi, V. Printed Persian OCR system using deep learning. IET Image Process. 2020, 14, 3920–3931. [Google Scholar] [CrossRef]
- Alkhateeb, F.; Abu Doush, I.; Albsoul, A. Arabic optical character recognition software: A review. Pattern Recognit. Image Anal. 2017, 27, 763–776. [Google Scholar] [CrossRef]
- Plötz, T.; Fink, G.A. Markov models for offline handwriting recognition: A survey. Int. J. Doc. Anal. Recognit. (IJDAR) 2009, 12, 269–298. [Google Scholar] [CrossRef]
- Smith, R. An overview of the Tesseract OCR engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Brazil, 23–26 September 2007; Volume 2, pp. 629–633. [Google Scholar]
- Patel, C.; Patel, A.; Patel, D. Optical character recognition by open source OCR tool tesseract: A case study. Int. J. Comput. Appl. 2012, 55, 50–56. [Google Scholar] [CrossRef]
- Zacharias, E.; Teuchler, M.; Bernier, B. Image Processing Based Scene-Text Detection and Recognition with Tesseract. arXiv 2020, arXiv:2004.08079. [Google Scholar] [CrossRef]
- Hiemstra, D. Using Language Models for Information Retrieval. 2001. Available online: https://ris.utwente.nl/ws/portalfiles/portal/6042641/t000001d.pdf (accessed on 31 May 2024).
- Duh, K.; McNamee, P.; Post, M.; Thompson, B. Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 2667–2675. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Ekman, P. Basic emotions. In Handbook of Cognition and Emotion; John Wiley& Sons Ltd.: Hoboken, NJ, USA, 1999; Volume 98, p. 16. [Google Scholar]
- Ghayoomi, M.; Momtazi, S. Challenges in developing Persian corpora from online resources. In Proceedings of the 2009 International Conference on Asian Language Processing, Singapore, 7–9 December 2009; pp. 108–113. [Google Scholar]
- Gibbon, D.; Moore, R.; Winski, R. Handbook of Standards and Resources for Spoken Language Systems; Walter de Gruyter: Berlin, Germany, 1997. [Google Scholar]
- Yousef, S. Persian: A Comprehensive Grammar; Routledge: London, UK, 2018. [Google Scholar]
- St, L.; Wold, S. Analysis of variance (ANOVA). Chemom. Intell. Lab. Syst. 1989, 6, 259–272. [Google Scholar]
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Hand, D.J.; Yu, K. Idiot’s Bayes—Not so stupid after all? Int. Stat. Rev. 2001, 69, 385–398. [Google Scholar]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar]
- Lewis, D.D.; Yang, Y.; Russell-Rose, T.; Li, F. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 2004, 5, 361–397. [Google Scholar]
- Li, Y.; Yang, T. Word embedding for understanding natural language: A survey. Guide to Big Data Applications; Springer: Cham, Switzerland, 2018; pp. 83–104. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
- Cohen, I.; Huang, Y.; Chen, J.; Benesty, J.; Benesty, J.; Chen, J.; Huang, Y.; Cohen, I. Pearson correlation coefficient. Noise Reduction in Speech Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–4. [Google Scholar]
- Camacho-Collados, J.; Pilehvar, M.T.; Collier, N.; Navigli, R. Semeval-2017 task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 15–26. [Google Scholar]
- AleAhmad, A.; Zahedi, M.; Rahgozar, M.; Moshiri, B. irBlogs: A standard collection for studying Persian bloggers. Comput. Hum. Behav. 2016, 57, 195–207. [Google Scholar] [CrossRef]
- Schober, P.; Boer, C.; Schwarte, L.A. Correlation coefficients: Appropriate use and interpretation. Anesth. Analg. 2018, 126, 1763–1768. [Google Scholar] [CrossRef]
- Lin, Y.; Michel, J.B.; Lieberman, E.A.; Orwant, J.; Brockman, W.; Petrov, S. Syntactic annotations for the google books ngram corpus. In Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Republic of Korea, 8–14 July 2012; pp. 169–174. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, 2009; Volume 2. [Google Scholar]
- Prayogo, R.D.; Karimah, S.A. Comparison Study of Machine Learning Techniques for Letter Recognition. In Proceedings of the 2022 1st International Conference on Technology Innovation and Its Applications (ICTIIA), Tangerang, Indonesia, 23 September 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar]
- Hinton, G.E. Connectionist learning procedures. In Machine Learning; Elsevier: Amsterdam, The Netherlands, 1990; pp. 555–610. [Google Scholar]
- LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
- LeCun, Y.; Boser, B.; Denker, J.; Henderson, D.; Howard, R.; Hubbard, W.; Jackel, L. Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems; NeurIPS Proceedings: New Orleans, LA, USA, 1989; Volume 2. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 25, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Luqman, H.; Mahmoud, S.A.; Awaida, S. KAFD Arabic font database. Pattern Recognit. 2014, 47, 2231–2240. [Google Scholar] [CrossRef]
- Ullah, Z.; Jamjoom, M. An intelligent approach for Arabic handwritten letter recognition using convolutional neural network. PeerJ Comput. Sci. 2022, 8, e995. [Google Scholar] [CrossRef] [PubMed]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [PubMed]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Yan, Q.; Weeks, D.E.; Xin, H.; Swaroop, A.; Chew, E.Y.; Huang, H.; Ding, Y.; Chen, W. Deep-learning-based prediction of late age-related macular degeneration progression. Nat. Mach. Intell. 2020, 2, 141–150. [Google Scholar] [CrossRef]
- Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Sutton, C.; McCallum, A. An introduction to conditional random fields. Found. Trends® Mach. Learn. 2012, 4, 267–373. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Minneapolis, Minnesota, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Nadeau, D.; Sekine, S. A survey of named entity recognition and classification. Lingvisticae Investig. 2007, 30, 3–26. [Google Scholar] [CrossRef]
- Fujitake, M. DTrOCR: Decoder-only Transformer for Optical Character Recognition. arXiv 2023, arXiv:2308.15996. [Google Scholar] [CrossRef]
- Guo, Z.; Jin, R.; Liu, C.; Huang, Y.; Shi, D.; Yu, L.; Liu, Y.; Li, J.; Xiong, B.; Xiong, D.; et al. Evaluating large language models: A comprehensive survey. arXiv 2023, arXiv:2310.19736. [Google Scholar]
- Ghahroodi, O.; Nouri, M.; Sanian, M.V.; Sahebi, A.; Dastgheib, D.; Asgari, E.; Baghshah, M.S.; Rohban, M.H. Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language? arXiv 2024, arXiv:2404.06644. [Google Scholar]
- Rostami, P.; Salemi, A.; Dousti, M.J. PersianMind: A Cross-Lingual Persian–English Large Language Model. arXiv 2024, arXiv:2401.06466. [Google Scholar]
- Liang, D.; Gonen, H.; Mao, Y.; Hou, R.; Goyal, N.; Ghazvininejad, M.; Zettlemoyer, L.; Khabsa, M. Xlm-v: Overcoming the vocabulary bottleneck in multilingual masked language models. arXiv 2023, arXiv:2301.10472. [Google Scholar]
- Mollanorozy, S.; Tanti, M.; Nissim, M. Cross-lingual transfer learning with Persian. In Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Dubrovnik, Croatia, 6 May 2023; pp. 89–95. [Google Scholar]
- Aliramezani, M.; Doostmohammadi, E.; Bokaei, M.H.; Sameti, H. Persian sentiment analysis without training data using cross-lingual word embeddings. In Proceedings of the 2020 10th International Symposium onTelecommunications (IST), Tehran, Iran, 15–17 December 2020; pp. 78–82. [Google Scholar]
- Torrance, E.P. Torrance Tests of Creative Thinking. Educational and Psychological Measurement. 1966. Available online: https://psycnet.apa.org/doiLanding?doi=10.1037%2Ft05532-000 (accessed on 31 May 2024).
- Zhao, Y.; Zhang, R.; Li, W.; Huang, D.; Guo, J.; Peng, S.; Hao, Y.; Wen, Y.; Hu, X.; Du, Z.; et al. Assessing and understanding creativity in large language models. arXiv 2024, arXiv:2401.12491. [Google Scholar]
- Das, B.C.; Amini, M.H.; Wu, Y. Security and privacy challenges of large language models: A survey. arXiv 2024, arXiv:2402.00888. [Google Scholar]
- Petruzzellis, F.; Testolin, A.; Sperduti, A. Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models. arXiv 2024, arXiv:2406.06588. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Moniri, S.; Schlosser, T.; Kowerko, D. Investigating the Challenges and Opportunities in Persian Language Information Retrieval through Standardized Data Collections and Deep Learning. Computers 2024, 13, 212. https://doi.org/10.3390/computers13080212
Moniri S, Schlosser T, Kowerko D. Investigating the Challenges and Opportunities in Persian Language Information Retrieval through Standardized Data Collections and Deep Learning. Computers. 2024; 13(8):212. https://doi.org/10.3390/computers13080212
Chicago/Turabian StyleMoniri, Sara, Tobias Schlosser, and Danny Kowerko. 2024. "Investigating the Challenges and Opportunities in Persian Language Information Retrieval through Standardized Data Collections and Deep Learning" Computers 13, no. 8: 212. https://doi.org/10.3390/computers13080212
APA StyleMoniri, S., Schlosser, T., & Kowerko, D. (2024). Investigating the Challenges and Opportunities in Persian Language Information Retrieval through Standardized Data Collections and Deep Learning. Computers, 13(8), 212. https://doi.org/10.3390/computers13080212