Probabilistic Modeling and Pattern Discovery-Based Sindhi Information Retrieval System

Hakro, Dil Nawaz; Abbasi, Abdullah; Bhat, Anjum Zameer; Raza, Saleem; Babar, Muhammad; Rahbi, Osama Al

doi:10.3390/info17010082

Open AccessArticle

Probabilistic Modeling and Pattern Discovery-Based Sindhi Information Retrieval System

by

Dil Nawaz Hakro

^1,2,3,*

,

Abdullah Abbasi

¹,

Anjum Zameer Bhat

¹,

Saleem Raza

¹,

Muhammad Babar

¹ and

Osama Al Rahbi

¹

Department of Computing and Electronics Engineering, Middle East College, Muscat 113, Oman

²

Department of Software Engineering, Faculty of Engineering and Technology, University of Sindh, Jamshoro 76080, Pakistan

³

College of Business, Law and Governance, James Cook University, Cairns, QLD 4878, Australia

^*

Author to whom correspondence should be addressed.

Information 2026, 17(1), 82; https://doi.org/10.3390/info17010082

Submission received: 27 October 2025 / Revised: 28 November 2025 / Accepted: 22 December 2025 / Published: 13 January 2026

Download

Browse Figures

Versions Notes

Abstract

Natural language processing is the technology used to interact with computers using human languages. An overlapping technology is Information Retrieval (IR), in which a user searches for the demanded or required documents from among a number of documents that are already stored. The required document is retrieved according to the relevance of the query of the user, and the results are presented in descending order. Many of the languages have their own IR systems, whereas a dedicated IR system for Sindhi still needs attention. Various approaches to effective information retrieval have been proposed. As Sindhi is an old language with a rich history and literature, it needs IR. For the development of Sindhi IR, a document database is required so that the documents can be retrieved accordingly. Many Sindhi documents were identified and collected from various sources, such as books, journal, magazines, and newspapers. These documents were identified as having potential for use in indexing and other forms of processing. Probabilistic modeling and pattern discovery were used to find patterns and for effective retrieval and relevancy. The results for Sindhi Information Retrieval systems are promising and presented more than 90% relevancy. The time elapsed was recorded as ranging from 0.2 to 4.8 s for a single word and 4.6 s with a Sindhi sentence, with the same starting time of 0.2 s. The IR system for Sindhi can be fine-tuned and utilized for other languages with the same characteristics, which adopt Arabic script.

Keywords:

NLP; information retrieval; Sindhi; probabilistic modeling; pattern discovery

Graphical Abstract

1. Introduction

Natural language processing (NLP) is an overlapping technology of computer science branches such as database and Information Retrieval (IR). Information Retrieval is a system in which a user requires information in the form of a query, which is then converted into an understandable form by the system; then, the required information is searched in already-stored documents or a database of documents. The required information, extracted from the query, is searched or matched by applying various approaches, and then the related documents are identified. The identified documents are then ranked according to the relevance of the query. This ranked documents are typically presented in descending order so that the most relevant document is presented at the top and the least relevant at the bottom [1]. Information Retrieval is the process used to store, present, organize, and identify the documents whenever the user requires. Information Retrieval can be called the science of searching, and is performed to retrieve documents, search inside documents, or search meta data of documents. Searching documents can also be called the branch of document retrieval. The searched documents are stored in a hierarchical style, as a standalone search, or using the hypertextually network style of the database, such as the world wide web or the internet [2].

Information Retrieval systems respond by retrieving documents when required by the user [3]. As per Lancaster’s study [4], the combination of activities, such as storing documents, indexing, and others, that are employed in sequence when the user requires is called information retrieval. Another definition by Robertson [5] suggests that Information Retrieval system is another name for a system that satisfies a user request or fulfils the users need for a specific field of knowledge or body of knowledge. The general purpose of the IR system is the selection and identification of different documents related to a user problem or the solution for the same problem [6,7]. A simple and typical IR system is illustrated in Figure 1. Various information retrieval systems are available, including healthcare information systems [8], generative information retrieval systems [9], cultural algorithms for information retrieval systems [10], and BERT applications in IR [11]. The Sindhi language shares certain script-level similarities with Urdu; however, this study focuses exclusively on Sindhi. Any cross-lingual adaptation is therefore considered for future work.

Contribution of the Study

To the best of our knowledge, no Sindhi Information Retrieval system is available and this study, pertaining to Sindhi Information Retrieval, will shape the future of Sindhi in terms of digital accessibility. The current system will empower the Sindhi language in the digital age and empower the Sindhi Community to retrieve documents, access digital resources, and share their applications with ease.

2. Related Material

Information Retrieval contains various steps and phases in which various approaches are used. These approaches include understanding user queries, extracting information, finding documents, determining the relevancy of documents, and listing the obtained documents in order. One of the main challenges in today’s world is plagiarism, so Roostae [12] presented a plagiarism detection system with two defined stages, and a third, combined stage. The first phase will extract information from the user query and reduce the assessment of the archives to an economical level. The second level is dedicated to the precise examination of the candidates which are to be used in the retrieval stage. The third stage is the combination of the previous two stages. The two approaches of conceptual model and bag of words are combined by adding an interpolation element. Another text-based scheme for IR has been presented. A novel, data-driven, and ensemble-based approach has been proposed to combine various highlights of the default sets [13]. Another text-based scheme for IR systems is proposed in [14], in which a question–answer machine is presented that works on data recovery. The machine provides output in the form of solutions, presented in the form of questions or natural language questions, by removing sentences which are not relevant. A combination of phrase embedding and lexical sources called question expansion has been proposed. The system is built on the multi-phase pipeline. The machine has been built to respond to natural language questions [14].

Many of the Information Retrieval systems or models concentrate on the ranking of the documents or the indexing of the documents. The main purpose of IR systems is to obtain an actual understanding of the user query. For this purpose, a word-embedding model based on averages has been employed to extract relevant information from the job description. To sort the results, indexing and ranking are performed by a BM25 classic model and the language models [15].

One study focused on stemming that works like a human brain. The system works on cognitive-inspired language, which uses morphological words and behaves like a human brain without prior knowledge of language rules. In natural language, in terms of human language, many words have the same meaning or describe a similar concept; in this case, the objective is to reduce the morphology of the words or the base of the words [16]. One of the first agent-based IR systems is presented in [17]. A transformer-based information retrieval is proposed for biomedical IR [18]. A new IR platform is proposed in [19].

3. Sindhi Information Retrieval System (SIR)

For the proposed Sindhi Information Retrieval System (SIR) is illustrated in Figure 2. The proposed system is based on pattern discovery and probabilistic modeling. The various steps range from problem identification to document retrieval. After problem identification, document collection is the step to identify and collect potential documents for the specific topic. The documents are collected from various sources (detail is provided in the next subsection) and indexed. The document information is reduced to understanding-based terms; for this reason, the documents in Sindhi are indexed with diacritic and without diacritics. The next step is the query formulation, where queries are formulated so that meaningful information can be searched and easily understood. The documents are stored and some hidden patterns are observed in the documents; for the sake of finding documents, the pattern discovery approach is used. The documents are searched for a particular sequence or pattern that is suitable for the query information required in previous steps. Various algorithms are employed while retrieving the documents so that the documents and the required query information can be matched. The step of matching the retrieved documents and the initiated query is called query matching, which uses multiple algorithms. The retrieved documents are then ranked according to their relevance. The documents are ranked in decreasing order, with the most relevant documents presented at the top.

3.1. Document Database

A document database is necessary and it is also necessary to reduce the total size of the documents or to reduce the document information so that the important key terms can be presented [20]. The documents are required so that other processes can be experimented with. For this purpose, various documents were obtained in the Sindhi language and preprocessed for the experiments. The document sources included friends, magazines, websites, periodical journals, and others.

3.2. Document Collection

Various sources were used for the collection of Sindhi documents. Many of the documents were also downloaded from the internet. The collected documents numbered in the thousands, with billions of words. The efforts of [21] are also appreciated, providing many of the documents for the other researchers working on the Sindhi language. One of the sources for document collection was the publishers of magazines and books; many physical copies of the books were collected also. Another source of Sindhi documents was the various research journals published by Sindh University, especially the multiple volumes of the Saranga research journal published by the Department of Sindhi, University of Sindh, Jamshoro. Theses written in Sindhi and submitted to various universities were also collected for the document database for the Sindhi Information Retrieval process.

Publishers from the districts of Shikarpur, Khairpur, and Hyderabad were visited to collect published books of poetry and other titles in textual form, and these books were processed and indexed according to need. Magazine administrations were contacted for the collection of old and new magazines so that the document database could be enhanced. Many Sindhi websites were visited so that various books and booklets could be downloaded. These websites comprised the Sindhi Language Authority [22] and many other official and non-official websites. For the collection of Sindhi documents, various Sindhi newspaper websites were also visited and many documents on different topics were collected. These websites included Daily Kawish, Awami Awaz, Ibrat, and others. The Sindhi translation of the Holy Quran was also the part of the document collection, and was retrieved from (www.tanzil.net [Accessed on 14 June 2025]) [23]. The translation of Shah Jo Risalo and other scripts were also collected for inclusion in the document database. Some theses written in Sindhi were also part of this database. Some of the sources are presented in Figure 3.

3.3. Identification of Potential Documents

After the collection, next step is to select or choose the documents that might possess key terms or pieces of information. The documents are checked for their potential ability to be selected for indexing.

3.4. Ranking of Documents

The ranking is the process in which heuristics are extracted from the documents collected for the corpus. These heuristics include proximity, term frequency, inverse term frequency, and others.

3.5. Statistics of the Documents Collected

The document sources were identified, collected, and processed for Sindhi Information Retrieval. Figure 4 shows the statistics for the document and data collection in terms of books, and Figure 5 shows the statistics for 11 issues of the Saranga research journal.

3.6. Corpus Statistics (Description)

To evaluate the SIR system, quantitative details of the Sindhi corpus are presented. In-depth details can be found in another related study [24]. Sindhi is a ligature-rich language with complex orthographic variants, multiple diacritic forms, and frequent use of compound constructions [25]. These characteristics significantly influence tokenization, pattern extraction, and probabilistic modeling. Therefore, it is essential to report on the size and linguistic distribution of the corpus. Table 1 provides a complete overview of the dataset used in this study, including the number of documents, total tokens, average words per document, unique ligature patterns, normalized patterns, and the distribution of query types (single-word, two-word, and sentence-level). The statistics provide transparency for the sake of empirical validation, ensuring reproducibility for future Sindhi IR research. Table 1 summarizes the corpus statistics.

3.7. Relevance Assessment Protocol

To quantify the effectiveness of the Sindhi retrieval system, the results were evaluated by employing the structured human judgment protocol. Two independent Sindhi-speaking evaluators manually examined the top 10 retrieved documents for each query type: single-word, two-word, and full-sentence queries. A document was marked relevant if it (i) contained the query term(s), (ii) expressed the equivalent concept, or (iii) discussed the topic referred to in the query. A binary marking scheme (Relevant = 1; Not Relevant = 0) was applied. When the two evaluators disagreed, disagreements were resolved through discussion, and a final label was assigned. The overall relevancy percentage was calculated using the following equation:

Relevancy = \frac{Number of Relevant Retrieved Documents}{Total Retrieved Documents Evaluated} \times 100

This process yielded an overall relevancy score of approximately 90%. A summary of the raw relevance judgments is provided in Table 2.

4. Pattern Discovery and Probabilistic Modeling

The structural patterns available in large databases are identified by the pattern-discovery approach. These patterns are in sequences, substructures, subsequences, and other structures. The definition rules associated with such patterns are called associative rules and the patterns are repeated in a particular number called the pattern frequency. Various patterns are common, such as close, open, and max patterns. A modified version of [26] was used for the identification of the patterns and [27] for their location in the Sindhi database for Sindhi Information Retrieval. The model based on probabilities is called the probabilistic model and multiple documents are retrieved to respond to a user query with various levels of relevance. To understand and respond to the query, probabilistic modeling was employed for Sindhi Information Retrieval. The term probabilities were found using the following equation:

P (t | R) = L o g N - n + 0.5 / (n + 0.5)

where P(t|R) is the probability of observing pattern t in relevant documents. t is the normalized linguistic pattern extracted after ligature and morphological processing. R is the set of documents marked as relevant during the training pattern frequency computation. N is total number of documents in the corpus. DF(t) is the number of documents in which pattern t appears, and PF(t) is the normalized pattern frequency (pattern-level term frequency).

The term probabilities were followed bye the ranking of the documents and the ranking of documents was performed by relevance feedback. The relevance feedback has been taken from the user optionally otherwise the implicit feedback has been used for the refinement of the results. The improvement of the document ranking was performed through these types of relevance feedback. Following Algorithm 1 is the Probabilistic Pattern-Based Ranking Procedure algorithm.

Algorithm 1: Probabilistic Pattern-Based Ranking Procedure

Input: Query Q, Document Corpus D
Output: Ranked list of documents
1. Extract normalized patterns from Q → P_Q
2. For each document d in D:
      a. Extract normalized patterns P_d
      b. For each pattern t in P_Q:
            i. Compute PF(t) = pattern frequency in d
            ii. Compute DF(t) = number of documents containing t
            iii. Compute P(t|R) = PF(t)/DF(t)
      c. Compute document score S(d) = Σ P(t|R) for all t in P_Q
3. Rank all documents by descending S(d)
4. Return top-k documents

4.1. Sindhi-Specific Algorithmic Extensions for Pattern Discovery

To address linguistic challenges unique to Sindhi, three essential modifications were incorporated into the baseline pattern-discovery algorithms referenced in [26,27]. First, a ligature normalization layer was used to map multiple Unicode join forms and diacritic variants into canonical root forms. Second, a rule-based morphological segmentation module was introduced to normalize suffixes, plural markers, and postpositions, allowing inflected forms to be grouped under unified concepts. Third, bidirectional n-gram scanning was implemented to accurately capture compound noun and verb structures that emerge in right-to-left Sindhi text. These linguistic adaptations significantly extend the baseline algorithms and are necessary for achieving reliable pattern extraction in Sindhi’s orthography and morphology.

4.2. Theoretical Basis and Distinction of the Probabilistic Model

The probabilistic scoring function used in this study extends the classical MLE-based probability estimation by replacing word-level frequency with pattern-level frequency. This modification is grounded in frequency-based likelihood modeling but adapted to Sindhi through ligature normalization, morphological segmentation, and bidirectional compound detection. These linguistic adaptations differentiate the model from established IR frameworks such as TF-IDF and BM25, which rely on unigram term counts and do not capture Sindhi-specific orthographic and morphological structures. Table 3 illustrates a comparison of the models.

5. Results and Discussion

Various experiments were performed to evaluate the Sindhi Information Retrieval System. The experiments were based on queries using a single word, two words, and sentences to retrieve documents. The experiments were performed on intel i5 machines running with 6 GB of RAM. The network used a wired connection. The 6 GB memory experiment demonstrates the feasibility of processing the current Sindhi corpus using the proposed pattern-extraction model. It is not presented as evidence of production-level scalability. Testing scalability across very large corpora will require additional indexing optimization and will be addressed in future development stages. An experiment snapshot is presented in Table 4 for the single-word, two-word and sentence queries based on the retrieved pages and the time elapsed.

When searching the various documents, a system normally relies on the words included in the query, for example, as a single word may produce more results and including an additional word will produce fewer results. Table 1 shows that the total number of pages or documents retrieved was 9623, with a time ranging from 0.2 to 4.8. For the two-word query, the number of documents are retrieved was 3490, and time elapsed ranged from 0.4 to 4.7. For the sentence case, the number of pages decreased to 1382 and the time was recorded from 0.2 to 4.6. The results of a single-word search are shown in Figure 6. Comparison with BM25 and TF-IDF demonstrates that the proposed model captures the morphological and syntactic structure of Sindhi more effectively than traditional IR baselines. This improvement stems from the model’s ability to recognize patterns and variants in Sindhi ligatures and word forms, which commonly cause retrieval failures in bag-of-words approaches.

One of the complexities of Sindhi words or Sindhi documents is the construction of many words with one simple ligature, where a simple change in diacritics can change the meaning of the word, as shown in Figure 7, in which a single ligature (سر) produces many variants.

The two-word structure is also a complex task in Sindhi Information Retrieval, as a single word can be attached or added to multiple words in the Sindhi language, as shown in Figure 7. The word (ڪني) may be connected to many other words, and its variants and their diacritic connections are shown in Figure 8.

Two words or variants of a single word, along with diacritics and possible connections with other words, are shown in Figure 8. The number of retrieved pages differed for multiple queries, such as single-word queries, multiword queries, and queries with sentences. The average number of retrieved pages, along with the types of input, are shown in Figure 9. The numbers of pages retrieved with a single-word input, along with the execution time, are shown in Figure 10.

Figure 9 illustrates the average number of retrieved pages across single-word, two-word, and sentence-level queries. The figure shows a clear decreasing trend in retrieved pages as query complexity increases. Single-word queries generate the highest number of retrieved pages because they match a broad range of documents containing the target term or its normalized patterns. Two-word queries narrow the search space by introducing contextual constraints, resulting in fewer retrieved documents. Sentence-level queries impose the strongest semantic restrictions, producing the lowest average retrieval counts. This behavior validates the system’s ability to adjust retrieval width based on query specificity, reflecting the expected IR behavior for morphologically rich languages such as Sindhi.

Figure 10 presents the relationship between the number of fetched pages (for single-word queries) and the execution time. The plot shows that while the number of retrieved pages varies significantly across different queries, the execution time remains within a narrow and stable range. This indicates that the probabilistic pattern-based model maintains computational efficiency even when handling large volumes of retrieved data. The relatively flat time curve suggests that pattern extraction and ranking operations scale consistently, demonstrating their suitability for real-time or interactive query environments. This figure reinforces the efficiency of the proposed system under varying retrieval loads.

5.1. Explanation of the ‘Without Diacritics’ Indexing Strategy

Sindhi text often contains optional diacritic marks (such as zabar, zer, pesh, sukoon, and madd), but most modern digital content omits these marks entirely. Users also typically do not type diacritics when entering search queries. To address this practical usage pattern, the ‘without diacritics’ indexing strategy normalizes all diacritic-bearing characters to their base Unicode forms before tokenization and pattern extraction. This improves recall by ensuring that visually identical words with and without diacritics are matched consistently. In contrast, the ‘with diacritics’ index preserves these distinctions for use cases requiring precise linguistic fidelity. Both strategies are maintained to balance linguistic accuracy with real-world retrieval performance.

5.2. Evaluation Metrics (Precision, Recall, MRR)

To strengthen the effectiveness of the evaluation of the Sindhi Information Retrieval System, standard Information Retrieval metrics were incorporated. These include Precision@10 (P@10), Recall@10 (R@10), and Mean Reciprocal Rank (MRR). Two independent Sindhi-speaking evaluators assessed the top 10 retrieved documents for each query type: single-word, two-word, and sentence-based queries. A document was marked relevant if it contained the query term(s), conveyed the equivalent concept, or discussed the same topic. Precision@10 measures the proportion of retrieved documents that are relevant, Recall@10 measures the proportion of all relevant documents that were retrieved, and MRR measures how highly the first relevant document appears in the ranking. These metrics provide deeper insight into retrieval accuracy and ranking quality, complementing the time-based and retrieved-pages evaluations.

5.3. Baseline Comparison with Standard IR Models

For the evaluation of the proposed probabilistic pattern Sindhi retrieval model, we conducted a baseline comparison with two widely used IR ranking methods: BM25 and TF-IDF. Both baselines were implemented using the same preprocessed Sindhi corpus and the same query set (single-word, two-word, and full-sentence queries). Table 5 presents a comparative summary of precision@10 (P@10), recall@10 (R@10), and Mean Reciprocal Rank (MRR). The proposed method consistently outperformed BM25 and TF-IDF, particularly on multi-word queries, where pattern-based matching captured contextual relationships that bag-of-words methods could not.

The reported 90% relevancy score was derived from a structured human relevance-judgment procedure. Two independent Sindhi-speaking evaluators manually assessed the top 10 retrieved documents for each query type. Relevancy was calculated by dividing the number of relevant documents by the total number of assessments. A summary of raw relevance counts is provided in Table 3.

While this work includes BM25 and TF-IDF baseline comparisons, further benchmarking against additional retrieval models, such as BM25+, Query Likelihood (QL), RM3, and recent neural ranking methods, is planned for future research. Expanding the range of baselines will allow for a more comprehensive performance validation on the Sindhi corpus. The present work does not evaluate large-corpus scalability; future research will benchmark performance on substantially larger datasets. Although Urdu shares orthographic similarities with Sindhi, cross-lingual validation was not conducted in this study and remains an area for future investigation.

6. Conclusions and Future Work

The Sindhi language and its computing systems are available in various forms, but Sindhi Information Retrieval is still lacking. To establish a Sindhi Information Retrieval system, a database was necessary, and number of sources were identified so that Sindhi documents could be obtained to create a database of Sindhi documents. Various books, journals, newspaper websites, and other websites were included in the collection of documents. Sindhi Information Retrieval is presented, where a user initiates a query for the solution of a problem and thousands of documents are processed and stored in a Sindhi document database. The user query is understood, and patterns are searched by utilizing the pattern discovery approach. The patterns are identified, and the results are processed in decreasing order so that the user can see the documents that are more relevant to the query, as higher-relevance documents are presented first. The Sindhi Information Retrieval System presented 90% relevancy in its results. Sindhi Information Retrieval can be modified slightly to other languages that are similar to Sindhi and use Arabic script. The system could also be implemented in more versatile approaches to enhance the validity, accuracy, and scope of the study.

Scaling the system to the very large Sindhi corpora will require additional indexing strategies and memory optimization, which will be undertaken in future experiments. The 6GB memory experiment demonstrates feasibility for the current Sindhi corpus but is not intended to imply production-level scalability. Larger corpus testing and index optimization will be addressed in future work. Future work will explore adapting the proposed framework to Urdu and other Indo-Aryan scripts using their own ligature and morphological rules.

Author Contributions

Conceptualization, D.N.H., A.A., A.Z.B. and M.B.; Methodology, D.N.H. and A.Z.B.; Software, D.N.H., A.A., A.Z.B. and O.A.R.; Validation, D.N.H. and A.A.; Formal analysis, D.N.H., A.A. and M.B.; Investigation, D.N.H., A.A., A.Z.B., S.R. and M.B.; Resources, D.N.H., S.R., A.Z.B. and S.R.; Data curation, D.N.H., A.A. and O.A.R.; Writing—original draft, D.N.H., S.R. and M.B.; Writing—review and editing, D.N.H., A.Z.B., S.R., M.B. and O.A.R.; Visualization, D.N.H., A.A., A.Z.B. and S.R.; Supervision, D.N.H.; Project administration, D.N.H., A.Z.B., S.R. and M.B.; Funding acquisition, A.Z.B., S.R. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy/ethical restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vishwakarma, D.; Kumar, S. Automatic query expansion for enhancing document retrieval system in healthcare application using GAN based embedding and hyper-tuned DAEBERT algorithm. Data Knowl. Eng. 2025, 160, 102468. [Google Scholar] [CrossRef]
Qureshi, M.M.; Shoaib, M. An Efficient Indexing and Searching Technique for Information Retrieval for Urdu Language. Pak. J. Sci. 2010, 62, 172–176. [Google Scholar]
Alzaidi, M.S.A.; Alshammari, A.; Hassan, A.Q.A.; Ebad, S.A.; Al Sultan, H.; Alliheedi, M.A.; Aljubailan, A.A.; Alzahrani, K.A. Enhanced automated text categorization via Aquila optimizer with deep learning for Arabic news articles. Ain Shams Eng. J. 2025, 16, 103189. [Google Scholar] [CrossRef]
Lancaster, W. Information Retrieval Systems: Characteristics, Testing and Evaluation, 2nd ed.; Wiley: Chichester, NY, USA, 1979. [Google Scholar]
Robertson, S.E. Between Aboutness and Meaning. In The Analysis of Meaning: Informatics 5; Aslib, Centre for Information Science, Queens College Oxford, The City University: London, UK, 1979; pp. 202–205. [Google Scholar]
Gerrie, B. Online Information System Use and Operating Characteristics, Limitations, and Design Alternatives; Information Resources Press: Arlington, VA, USA, 1983. [Google Scholar]
Basir, N.; Hakro, D.N.; Khoumbati, K.-U.-R.; Bhatti, Z. Leveraging Machine-Labeled Data and Cross-Lingual Transfer for NER in Urdu and Sindhi. J. Inf. Commun. Technol.—(JICT) 2025, 19, 1–8. [Google Scholar]
Panja, S. Information Retrieval Systems in Healthcare: Understanding Medical Data Through Text Analysis. In Transformative Approaches to Patient Literacy and Healthcare Innovation; IGI Global: Hershey, PA, USA, 2024; pp. 180–200. [Google Scholar]
Li, X.; Jin, J.; Zhou, Y.; Zhang, Y.; Zhang, P.; Zhu, Y.; Dou, Z. From matching to generation: A survey on generative information retrieval. arXiv 2024, arXiv:2404.14851. [Google Scholar] [CrossRef]
Mhawi, D.N.; Oleiwi, H.W.; Aldallal, A. Enhanced Cultural Algorithm for Information Retrieval System. Appl. Math. 2024, 18, 1081–1094. [Google Scholar]
Wang, J.; Huang, J.X.; Tu, X.; Wang, J.; Huang, A.J.; Laskar, M.T.R.; Bhuiyan, A. Utilizing BERT for Information Retrieval: Survey, Applications, Resources, and Challenges. ACM Comput. Surv. 2024, 56, 1–33. [Google Scholar] [CrossRef]
Roostaee, M.; Sadreddini, M.H.; Fakhrahmad, S.M. An effective approach to candidate retrieval for cross-language plagiarism detection: A fusion of conceptual and keyword-based schemes. Inf. Process. Manag. 2020, 57, 102150. [Google Scholar] [CrossRef]
Safiulin, I.; Butakov, N.; Alexandrov, D.; Nasonov, D. Ensemble-based method of answers retrieval for domain specific questions from text-based documentation. Procedia Comput. Sci. 2019, 156, 158–165. [Google Scholar] [CrossRef]
Esposito, M.; Damiano, E.; Minutolo, A.; De Pietro, G.; Fujita, H. Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering. Inf. Sci. 2020, 514, 88–105. [Google Scholar] [CrossRef]
Fernández-Reyes, F.C.; Shinde, S. CV Retrieval System based on job description matching using hybrid word embeddings. Comput. Speech Lang. 2019, 56, 73–79. [Google Scholar] [CrossRef]
Alotaibi, F.S.; Gupta, V. A cognitive inspired unsupervised language-independent text stemmer for information retrieval. Cogn. Syst. Res. 2018, 52, 291–300. [Google Scholar] [CrossRef]
Cai, Q.; Zhao, X.; Pan, L.; Xin, X.; Huang, J.; Zhang, W.; Zhao, L.; Yin, D.; Yang, G.H. AgentIR: 1st Workshop on Agent-based Information Retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington DC, USA, 14–18 July 2024; pp. 3025–3028. [Google Scholar]
Hall, K.; Jayne, C.; Chang, V. A Transformer-Based Framework for Biomedical Information Retrieval Systems. In International Conference on Artificial Neural Networks; Springer Nature: Cham, Switzerland, 2023; pp. 317–331. [Google Scholar]
Fröbe, M.; Reimer, J.H.; MacAvaney, S.; Deckers, N.; Reich, S.; Bevendorff, J.; Stein, B.; Hagen, M.; Potthast, M. The information retrieval experiment platform. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 23–27 July 2023; pp. 2826–2836. [Google Scholar]
Kumar, C.S.; Santhosh, R. Effective information retrieval and feature minimization technique for semantic web data. Comput. Electr. Eng. 2020, 81, 106518. [Google Scholar] [CrossRef]
Hakro, D.N. Enhanced Segmentation and Feature Extraction Approaches for Sindhi Optical Character Recognition. Ph.D. Thesis, Universiti Science Malaysia, Malaysia, 2015. [Google Scholar]
Available online: http://sl.sindhila.org/ (accessed on 28 May 2020).
Available online: www.tanzil.net (accessed on 12 May 2020).
Hakro, D.N.; Talib, A.Z. Printed text image database for Sindhi OCR. ACM Tran. Asian Low-Resour. Lang. Inf. Process. 2016, 15, 21. [Google Scholar] [CrossRef]
Hakro, D.N.; Ismaili, I.A.; Talib, A.Z.; Bhatti, Z.; Mojai, G.N. Issues and Challenges in Sindhi OCR. Sindh Univ. Res. J. (Sci. Ser.) 2014, 46, 143–152. [Google Scholar]
Aggarwal, M.; Bhatia, A. Pattern Discovery Techniques in Online Data Mining. Int. J. Eng. Tech. Res. 2015, 3, 28–31. [Google Scholar]
Maiti, S.; Subramanyam, R.B.V. Mining Co-Location Patterns from Distributed Spatial. J. King Saud Univ. Comput. Inf. Sci. 2021, 33, 1064–1073. [Google Scholar] [CrossRef]

Figure 1. A conventional Information Retrieval system.

Figure 2. Framework for the Sindhi Information Retrieval System.

Figure 3. Various data collection samples.

Figure 4. Books collected.

Figure 5. Saranga journal data statistics.

Figure 6. Single-word search in SIR. (The words inside the images are in Sindhi language).

Figure 7. Variants of word sir (سر) in Sindhi.

Figure 8. Variants of the word kuni (ڪني) and its connections in Sindhi.

Figure 9. Average number of retrieved pages.

Figure 10. Fetched pages (single word) and the execution time.

Table 1. Details of corpus (overview); further details presented in [24].

Feature	Count
Total documents	7601
Total words/tokens	1,690,519
Average words per document	222
Unique ligature patterns	219,037
Unique normalized patterns	87,532
Single-word queries	10
Two-word queries	10
Sentence queries	5

Table 2. Raw relevance assessment summary.

Query Type	Number of Queries	Retrieved Docs Evaluated (Top 10 per Query)	Relevant Docs	Relevancy %
Single-word queries	20	200	181	90.5%
Two-word queries	20	200	178	89.0%
Sentence-level queries	10	100	92	92.0%
Total	50	500	451	90.2%

Table 3. Comparison of the models.

Aspect	Established Models (e.g., BM25, TF-IDF)	Our Probabilistic Model
Unit of scoring	Terms/unigrams	Patterns (normalized, morphology-aware units)
Handling of morphology	Minimal	Morphological segmentation + ligature normalization
Compound words	Not explicitly modeled	Bidirectional pattern capture
Probability basis	Term frequency or tf–idf	Pattern frequency and likelihood
Relevance computation	Weight-based ranking	Pattern occurrence probability conditioned on query
Language adaptation	English/Latin-optimized	Explicitly adapted for Sindhi script

Table 4. One of the experiments performed for SIR.

	Retrieved Pages	Time Elapsed
Single-Word	9623	0.2–4.8
Two-Word	3490	0.4–4.7
Sentence	1382	0.2–4.6

Table 5. Comparison summary.

Model	P@10	R@10	MRR
TF-IDF	0.41	0.38	0.46
BM25	0.53	0.50	0.57
Proposed Probabilistic Pattern Model	0.68	0.65	0.72

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hakro, D.N.; Abbasi, A.; Bhat, A.Z.; Raza, S.; Babar, M.; Rahbi, O.A. Probabilistic Modeling and Pattern Discovery-Based Sindhi Information Retrieval System. Information 2026, 17, 82. https://doi.org/10.3390/info17010082

AMA Style

Hakro DN, Abbasi A, Bhat AZ, Raza S, Babar M, Rahbi OA. Probabilistic Modeling and Pattern Discovery-Based Sindhi Information Retrieval System. Information. 2026; 17(1):82. https://doi.org/10.3390/info17010082

Chicago/Turabian Style

Hakro, Dil Nawaz, Abdullah Abbasi, Anjum Zameer Bhat, Saleem Raza, Muhammad Babar, and Osama Al Rahbi. 2026. "Probabilistic Modeling and Pattern Discovery-Based Sindhi Information Retrieval System" Information 17, no. 1: 82. https://doi.org/10.3390/info17010082

APA Style

Hakro, D. N., Abbasi, A., Bhat, A. Z., Raza, S., Babar, M., & Rahbi, O. A. (2026). Probabilistic Modeling and Pattern Discovery-Based Sindhi Information Retrieval System. Information, 17(1), 82. https://doi.org/10.3390/info17010082

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Probabilistic Modeling and Pattern Discovery-Based Sindhi Information Retrieval System

Abstract

1. Introduction

Contribution of the Study

2. Related Material

3. Sindhi Information Retrieval System (SIR)

3.1. Document Database

3.2. Document Collection

3.3. Identification of Potential Documents

3.4. Ranking of Documents

3.5. Statistics of the Documents Collected

3.6. Corpus Statistics (Description)

3.7. Relevance Assessment Protocol

4. Pattern Discovery and Probabilistic Modeling

4.1. Sindhi-Specific Algorithmic Extensions for Pattern Discovery

4.2. Theoretical Basis and Distinction of the Probabilistic Model

5. Results and Discussion

5.1. Explanation of the ‘Without Diacritics’ Indexing Strategy

5.2. Evaluation Metrics (Precision, Recall, MRR)

5.3. Baseline Comparison with Standard IR Models

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI