Probabilistic Modeling and Pattern Discovery-Based Sindhi Information Retrieval System
Abstract
1. Introduction
Contribution of the Study
2. Related Material
3. Sindhi Information Retrieval System (SIR)
3.1. Document Database
3.2. Document Collection
3.3. Identification of Potential Documents
3.4. Ranking of Documents
3.5. Statistics of the Documents Collected
3.6. Corpus Statistics (Description)
3.7. Relevance Assessment Protocol
4. Pattern Discovery and Probabilistic Modeling
| Algorithm 1: Probabilistic Pattern-Based Ranking Procedure |
| Input: Query Q, Document Corpus D Output: Ranked list of documents 1. Extract normalized patterns from Q → P_Q 2. For each document d in D: a. Extract normalized patterns P_d b. For each pattern t in P_Q: i. Compute PF(t) = pattern frequency in d ii. Compute DF(t) = number of documents containing t iii. Compute P(t|R) = PF(t)/DF(t) c. Compute document score S(d) = Σ P(t|R) for all t in P_Q 3. Rank all documents by descending S(d) 4. Return top-k documents |
4.1. Sindhi-Specific Algorithmic Extensions for Pattern Discovery
4.2. Theoretical Basis and Distinction of the Probabilistic Model
5. Results and Discussion
5.1. Explanation of the ‘Without Diacritics’ Indexing Strategy
5.2. Evaluation Metrics (Precision, Recall, MRR)
5.3. Baseline Comparison with Standard IR Models
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Vishwakarma, D.; Kumar, S. Automatic query expansion for enhancing document retrieval system in healthcare application using GAN based embedding and hyper-tuned DAEBERT algorithm. Data Knowl. Eng. 2025, 160, 102468. [Google Scholar] [CrossRef]
- Qureshi, M.M.; Shoaib, M. An Efficient Indexing and Searching Technique for Information Retrieval for Urdu Language. Pak. J. Sci. 2010, 62, 172–176. [Google Scholar]
- Alzaidi, M.S.A.; Alshammari, A.; Hassan, A.Q.A.; Ebad, S.A.; Al Sultan, H.; Alliheedi, M.A.; Aljubailan, A.A.; Alzahrani, K.A. Enhanced automated text categorization via Aquila optimizer with deep learning for Arabic news articles. Ain Shams Eng. J. 2025, 16, 103189. [Google Scholar] [CrossRef]
- Lancaster, W. Information Retrieval Systems: Characteristics, Testing and Evaluation, 2nd ed.; Wiley: Chichester, NY, USA, 1979. [Google Scholar]
- Robertson, S.E. Between Aboutness and Meaning. In The Analysis of Meaning: Informatics 5; Aslib, Centre for Information Science, Queens College Oxford, The City University: London, UK, 1979; pp. 202–205. [Google Scholar]
- Gerrie, B. Online Information System Use and Operating Characteristics, Limitations, and Design Alternatives; Information Resources Press: Arlington, VA, USA, 1983. [Google Scholar]
- Basir, N.; Hakro, D.N.; Khoumbati, K.-U.-R.; Bhatti, Z. Leveraging Machine-Labeled Data and Cross-Lingual Transfer for NER in Urdu and Sindhi. J. Inf. Commun. Technol.—(JICT) 2025, 19, 1–8. [Google Scholar]
- Panja, S. Information Retrieval Systems in Healthcare: Understanding Medical Data Through Text Analysis. In Transformative Approaches to Patient Literacy and Healthcare Innovation; IGI Global: Hershey, PA, USA, 2024; pp. 180–200. [Google Scholar]
- Li, X.; Jin, J.; Zhou, Y.; Zhang, Y.; Zhang, P.; Zhu, Y.; Dou, Z. From matching to generation: A survey on generative information retrieval. arXiv 2024, arXiv:2404.14851. [Google Scholar] [CrossRef]
- Mhawi, D.N.; Oleiwi, H.W.; Aldallal, A. Enhanced Cultural Algorithm for Information Retrieval System. Appl. Math. 2024, 18, 1081–1094. [Google Scholar]
- Wang, J.; Huang, J.X.; Tu, X.; Wang, J.; Huang, A.J.; Laskar, M.T.R.; Bhuiyan, A. Utilizing BERT for Information Retrieval: Survey, Applications, Resources, and Challenges. ACM Comput. Surv. 2024, 56, 1–33. [Google Scholar] [CrossRef]
- Roostaee, M.; Sadreddini, M.H.; Fakhrahmad, S.M. An effective approach to candidate retrieval for cross-language plagiarism detection: A fusion of conceptual and keyword-based schemes. Inf. Process. Manag. 2020, 57, 102150. [Google Scholar] [CrossRef]
- Safiulin, I.; Butakov, N.; Alexandrov, D.; Nasonov, D. Ensemble-based method of answers retrieval for domain specific questions from text-based documentation. Procedia Comput. Sci. 2019, 156, 158–165. [Google Scholar] [CrossRef]
- Esposito, M.; Damiano, E.; Minutolo, A.; De Pietro, G.; Fujita, H. Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering. Inf. Sci. 2020, 514, 88–105. [Google Scholar] [CrossRef]
- Fernández-Reyes, F.C.; Shinde, S. CV Retrieval System based on job description matching using hybrid word embeddings. Comput. Speech Lang. 2019, 56, 73–79. [Google Scholar] [CrossRef]
- Alotaibi, F.S.; Gupta, V. A cognitive inspired unsupervised language-independent text stemmer for information retrieval. Cogn. Syst. Res. 2018, 52, 291–300. [Google Scholar] [CrossRef]
- Cai, Q.; Zhao, X.; Pan, L.; Xin, X.; Huang, J.; Zhang, W.; Zhao, L.; Yin, D.; Yang, G.H. AgentIR: 1st Workshop on Agent-based Information Retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington DC, USA, 14–18 July 2024; pp. 3025–3028. [Google Scholar]
- Hall, K.; Jayne, C.; Chang, V. A Transformer-Based Framework for Biomedical Information Retrieval Systems. In International Conference on Artificial Neural Networks; Springer Nature: Cham, Switzerland, 2023; pp. 317–331. [Google Scholar]
- Fröbe, M.; Reimer, J.H.; MacAvaney, S.; Deckers, N.; Reich, S.; Bevendorff, J.; Stein, B.; Hagen, M.; Potthast, M. The information retrieval experiment platform. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 23–27 July 2023; pp. 2826–2836. [Google Scholar]
- Kumar, C.S.; Santhosh, R. Effective information retrieval and feature minimization technique for semantic web data. Comput. Electr. Eng. 2020, 81, 106518. [Google Scholar] [CrossRef]
- Hakro, D.N. Enhanced Segmentation and Feature Extraction Approaches for Sindhi Optical Character Recognition. Ph.D. Thesis, Universiti Science Malaysia, Malaysia, 2015. [Google Scholar]
- Available online: http://sl.sindhila.org/ (accessed on 28 May 2020).
- Available online: www.tanzil.net (accessed on 12 May 2020).
- Hakro, D.N.; Talib, A.Z. Printed text image database for Sindhi OCR. ACM Tran. Asian Low-Resour. Lang. Inf. Process. 2016, 15, 21. [Google Scholar] [CrossRef]
- Hakro, D.N.; Ismaili, I.A.; Talib, A.Z.; Bhatti, Z.; Mojai, G.N. Issues and Challenges in Sindhi OCR. Sindh Univ. Res. J. (Sci. Ser.) 2014, 46, 143–152. [Google Scholar]
- Aggarwal, M.; Bhatia, A. Pattern Discovery Techniques in Online Data Mining. Int. J. Eng. Tech. Res. 2015, 3, 28–31. [Google Scholar]
- Maiti, S.; Subramanyam, R.B.V. Mining Co-Location Patterns from Distributed Spatial. J. King Saud Univ. Comput. Inf. Sci. 2021, 33, 1064–1073. [Google Scholar] [CrossRef]










| Feature | Count |
|---|---|
| Total documents | 7601 |
| Total words/tokens | 1,690,519 |
| Average words per document | 222 |
| Unique ligature patterns | 219,037 |
| Unique normalized patterns | 87,532 |
| Single-word queries | 10 |
| Two-word queries | 10 |
| Sentence queries | 5 |
| Query Type | Number of Queries | Retrieved Docs Evaluated (Top 10 per Query) | Relevant Docs | Relevancy % |
|---|---|---|---|---|
| Single-word queries | 20 | 200 | 181 | 90.5% |
| Two-word queries | 20 | 200 | 178 | 89.0% |
| Sentence-level queries | 10 | 100 | 92 | 92.0% |
| Total | 50 | 500 | 451 | 90.2% |
| Aspect | Established Models (e.g., BM25, TF-IDF) | Our Probabilistic Model |
|---|---|---|
| Unit of scoring | Terms/unigrams | Patterns (normalized, morphology-aware units) |
| Handling of morphology | Minimal | Morphological segmentation + ligature normalization |
| Compound words | Not explicitly modeled | Bidirectional pattern capture |
| Probability basis | Term frequency or tf–idf | Pattern frequency and likelihood |
| Relevance computation | Weight-based ranking | Pattern occurrence probability conditioned on query |
| Language adaptation | English/Latin-optimized | Explicitly adapted for Sindhi script |
| Retrieved Pages | Time Elapsed | |
|---|---|---|
| Single-Word | 9623 | 0.2–4.8 |
| Two-Word | 3490 | 0.4–4.7 |
| Sentence | 1382 | 0.2–4.6 |
| Model | P@10 | R@10 | MRR |
|---|---|---|---|
| TF-IDF | 0.41 | 0.38 | 0.46 |
| BM25 | 0.53 | 0.50 | 0.57 |
| Proposed Probabilistic Pattern Model | 0.68 | 0.65 | 0.72 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hakro, D.N.; Abbasi, A.; Bhat, A.Z.; Raza, S.; Babar, M.; Rahbi, O.A. Probabilistic Modeling and Pattern Discovery-Based Sindhi Information Retrieval System. Information 2026, 17, 82. https://doi.org/10.3390/info17010082
Hakro DN, Abbasi A, Bhat AZ, Raza S, Babar M, Rahbi OA. Probabilistic Modeling and Pattern Discovery-Based Sindhi Information Retrieval System. Information. 2026; 17(1):82. https://doi.org/10.3390/info17010082
Chicago/Turabian StyleHakro, Dil Nawaz, Abdullah Abbasi, Anjum Zameer Bhat, Saleem Raza, Muhammad Babar, and Osama Al Rahbi. 2026. "Probabilistic Modeling and Pattern Discovery-Based Sindhi Information Retrieval System" Information 17, no. 1: 82. https://doi.org/10.3390/info17010082
APA StyleHakro, D. N., Abbasi, A., Bhat, A. Z., Raza, S., Babar, M., & Rahbi, O. A. (2026). Probabilistic Modeling and Pattern Discovery-Based Sindhi Information Retrieval System. Information, 17(1), 82. https://doi.org/10.3390/info17010082

