Developing a Hybrid Morphological Analyzer for Low-Resource Languages

Supriya, Musica; Acharya Udupi, Dinesh; Nayak, Ashalatha; Srirangapatna Raghavendra, Arjuna

doi:10.3390/app15105682

Open AccessArticle

Developing a Hybrid Morphological Analyzer for Low-Resource Languages

by

Musica Supriya

¹

,

Dinesh Acharya Udupi

^1,*

,

Ashalatha Nayak

^1,*

and

Arjuna Srirangapatna Raghavendra

²

¹

Department of Computer Science & Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education (MAHE), Manipal 576104, Karnataka, India

²

Department of Liberal Arts, Humanities and Social Sciences (DLHS), Manipal Academy of Higher Education (MAHE), Manipal 576104, Karnataka, India

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5682; https://doi.org/10.3390/app15105682

Submission received: 13 March 2025 / Revised: 26 April 2025 / Accepted: 29 April 2025 / Published: 19 May 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Morphological analysis is the fundamental and preliminary task for Natural Language Processing (NLP) applications, which involve speech and language. Kannada is a low-resource language belonging to the Dravidian language family, which is highly agglutinative and morphologically rich in nature, where dataset development is happening rapidly due to the increasing demands of NLP tools. This study presents a hybrid approach that integrates rule-based and Transformer-based techniques, aiming to maximize their strengths while minimizing the respective limitations. In the Kannada language, the analysis of inflections has been challenging due to morphological richness, and to address this issue, 85 paradigms are created using Lttoolbox of Apertium. Further, a Transformer model is trained with the generated nominal data to generate the morphological analysis for the out-of-vocabulary inflections. The hybrid approach can be easily extended to new words as they are added to the dictionary. The obtained results are on a test set for inflections in Kannada precision: 0.924; recall: 0.925; and F1 score: 0.925. The main contributions include rule extraction for paradigm design at the word level, morphological analysis for nouns, verbs, adjectives, pronouns, and indeclinables on a benchmark dataset and morphological analysis generation using the Transformer architecture.

Keywords:

computational linguistics; morphological analysis; Kannada language; transformer; Natural Language Processing (NLP)

1. Introduction

Natural Language Processing (NLP) plays a key role in developing speech and text processing systems, wherein the analysis of words and their formation is an essential component. Morphology is the systematic study of word formation, where details of a language are required in the process of constructing a meaningful sentence. Morphological analysis is an essential phase in NLP where a morphological analyzer decodes a given word into its constituent components: stem and affixes [1]. Given a word, its morphological analysis can be generated by humans who might have learned the language formally or informally or by machines trained to use a morphological analyzer. A morphological analyzer can serve as a necessary component in developing Stemmer, Named Entity Recognition (NER), and Machine Translation (MT) systems, to name a few NLP applications [2].

The morphological analyzer generates morphemes for a given word, where a morpheme is a minimal meaningful unit, which can be either a stem or an affix, causing inflections based on number, gender, tense, or case [1]. The inflections are based on morphological rules that vary from one language to another and are often associated with parts of speech that contribute differently during inflections. For instance, verbs in English have inflections with the -ing or -ed suffix being added to the stem [3]. The various parts of speech contribute differently in inflections. Nouns undergo inflection based on the eight cases: nominative, accusative, instrumental, dative, ablative, genitive, locative, and vocative, whereas verbs inflect based on tenses: past, present, and future. Indeclinables do not generate inflections when gender, tense, number, or case are altered [4].

Automated morphological analysis can assist in downstream NLP applications, especially for low-resource languages and assist language documentation efforts for endangered languages [5]. Many low-resource languages lack large, annotated corpora or linguistic resources, making it difficult to analyze or process them computationally. By developing morphological analyzers, it becomes possible to create linguistic resources for such languages. This supports machine translation and contributes to the documentation and preservation of endangered languages. The purpose of the study is to encourage the development of morphological analyzer for low-resource languages, whose development can help build more NLP tools. In machine translation, especially for morphologically rich languages, morphological analyzers play a crucial role in enhancing translation accuracy by breaking down complex words into constituent morphemes. This approach addresses challenges such as data sparsity and vocabulary explosion, which are prevalent in languages with intricate morphological structures [6]. Morphological analysis is particularly beneficial for low-resource languages, where large parallel corpora are scarce. By segmenting words into meaningful units, translation systems can better handle rare and out-of-vocabulary words, improving translation quality [7].

In recent decades, morphological analyzers have been built for English and other European languages. The Stanford CoreNLP [8] toolkit processes mainly the English language but also supports languages like French, German, Arabic and Italian. Morphological analyzers have been built for Indian languages [9] by creating the necessary corpus or dictionary. Indian languages are broadly classified as belonging to the following families: Indo-European (Hindi, Marathi, Urdu, Gujarati, and Sanskrit), Dravidian (Kannada, Tulu, Tamil, Telugu, Malayalam), Austroasiatic (Munda in particular), and Sino-Tibetan (Tibeto-Burman in particular) [10]. Kannada is one of the Dravidian languages spoken widely in the state of Karnataka. It is included in the eighth schedule of the Indian Constitution [11] and has approximately 40 million speakers in India. In this paper, the WX notation is used throughout to represent Kannada words [12].

Kannada is highly agglutinative in nature, meaning two or more words can be combined into a single word, resulting in a complex word structure. It is morphologically rich in nature by having a greater number of inflections for a given word. The agglutinative nature of the language makes it difficult to split the words conjoined as a feature of euphonic change (Sandhi), in which case it is even more difficult to perform morphological analysis [13]. In English, a verb learn undergoes inflection based on person/number and tense as learnt, learning, learns, learned, whereas in Kannada, the equivalent word for learn is kali, and inflections based on gender, number and tense are shown in Table 1.

From Table 1, it is clear that there are multiple suffixes added based on gender, person, number, and tense for a verb in Kannada, whereas it is simplified in English as the inflections are unaffected for gender, but change only for person/number and tense. English still undergoes the -s inflection in the third-person singular form. These inflections show the complex structure and morphological richness of the Kannada language, which makes it challenging to develop a morphological analyzer in Kannada. The complex structure of morphemes makes it difficult to build an automated morphological analyzer, as linguistic features such as the case markers, gender, tense, number, and honorifics affect the morpheme, leading to more inflected forms.

In Kannada, several researchers have proposed methods for word categorization into declinables and indeclinables [14]. Declinables are nouns, verbs, adjectives, and pronouns, and indeclinables are conjunctions, interjections, and adverbs. Verbs undergo inflection based on gender, number, and tense. Pronouns and adjectives generate inflection similar to that of nouns. Compared to nouns, verbs have more inflections. Indeclinables like conjunctions, interjection, and adverbs do not show any inflection based on gender, number, tense, or case. Using these declinables and by building a dictionary of stems and affixes, it is possible to generate morphological analyses in Kannada.

Considering the above-mentioned challenges, a Kannada morphological analyzer can be approached in two ways: the first is a rule-based approach, which utilizes a dictionary and applies predefined morphological rules to the words in the dictionary. This approach involves manually defining the rules, which are subsequently integrated into the system. The second approach is corpus-based, where a corpus is created, and machine learning algorithms are applied to generate the morphological analysis [15]. A paradigm refers to a set of rules that can be recognized in the process of word formation [16]. A morphological analyzer can be built by creating paradigms for various word categories in a dictionary.

The rule-based approach is the most suitable approach among analysis techniques for morphologically rich languages, as there can be more inflections generated with multiple suffixes being added to the stem [17]. Deep learning methods have also been popular among NLP researchers for morphological analysis due to the availability of open-source architectures [18]. The researchers have implemented morphological analyzers using sequence-to-sequence, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and Transformer architectures.

Based on the significance of deep learning techniques in NLP, this paper focuses on developing rule-based paradigms for morphological analysis and building a unique corpus on which a Transformer architecture is trained [19]. The significant outcome here is to generate the morphological analysis of new words that are not present in the dictionary. The following are the contributions in developing a morphological analyzer for the Kannada language:

1.: A set of paradigms is designed for the words in a standard dictionary using morphological rules based on gender, number, tense, person, and case applicable to individual word categories.
2.: A unique morphological analysis dataset is developed using the generated paradigms on the standard dictionary.
3.: A hybrid model is trained on a unique dataset that is capable of analyzing any new inflection.

The remainder of the paper is organized as follows: a literature survey on morphological analyzers specific to Kannada is discussed in Section 2, followed by a detailed methodology on the creation of paradigms to generate morphological analysis in Kannada presented Section 3. Based on these discussions, the implementation details are explained in Section 4, followed by an evaluation of the results in Section 5. Finally, the conclusion and future scope of the proposed work are dealt with in Section 6.

2. Related Work

Indian languages can be classified into the following language families in particular: Indo-European (Hindi, Marathi, Urdu, Gujarati, Sanskrit), Dravidian (Kannada, Tulu, Tamil, Telugu, Malayalam), Austroasiatic (Munda in particular), and Sino-Tibetan (Tibeto-Burman) [10]. Indo-Aryan languages are a subgroup of the Indo-Iranian branch of the Indo-European language family, which are morphologically rich and share some commonalities. As a result, this literature survey examines languages from both the Indo-Aryan and Dravidian language families as source languages for morphological analysis.

2.1. Morphological Analyzer for Indo-Aryan Languages

Goyal and Lehal proposed a morphological analyzer and generator for Hindi based on a dictionary-reliant approach [20]. In the database, all possible word forms for every root word are stored. The main focus was on words like nouns and other word classes but excludes proper nouns to some extent. This approach prefers time and accuracy to memory space, which is a drawback.

Malladi et al. developed a statistical morphological analyzer trained on the Hindi tree bank (HTB) [21]. The analyzer identifies the lemma, gender, number, person (GNP), and case marker for every word in a given sentence by training separate models on the Hindi tree bank for each of them. Other grammatical features such as TAM (tense, aspect, and modality) and case are analyzed using heuristics on fine-grained POS tags of the input sentence. This analyzer achieved an accuracy of 82.03% compared with the Paradigm-Based Analyzer (PBA) for lemma, gender, number, person, case and TAM. As a a statistical model, it analyzed out-of-vocabulary words. The prediction of gender, number, and person of words in their sentential context could have been better if dependency relations were given as inputs. However, the standard natural language analysis pipeline forbids using parsing information during morphological analysis.

Bapat et al. presented a morphological analyzer for Marathi using a paradigm-based approach considering only inflections [22]. The classification of postpositions and the development of morphotactic FSA was one of the important contributions, as Marathi has complex morphotactics. Though improvement was shown in shallow parsing, the morphological analyzer does not handle the derivation morphology and compound words.

Jena et al. developed a morphological analyzer using Apertium for Oriya [17]. The analyzer data conformed to Apertium’s dix format, and the dictionary showed correspondences between surface and lexical forms for any given word.The paper lacks a detailed analysis of the inflections for nouns and verbs. In the case of verbs, a few verbs did not fall into any of the existing verb paradigms, which can be ascribed to the lower robustness of the paradigms or to the need for separate paradigms for these verbs, which is a shortcoming. Causative verbs and verbal complexes were not handled, and these remain unanalyzed.

From the literature survey, it is observed that the morphological analyzers developed for Indo-Aryan languages are limited to inflectional morphology and do not handle derivational morphology and compound words. Most of the methods use the dictionary-based approach and consume more memory as all the affixes and stems are stored.

2.2. Morphological Analyzer for Dravidian Languages

Kumar et al. developed a morphological analyzer for the Tamil language using a sequence labeling approach of machine learning [23]. Support Vector Machine was used in morphological analysis for classification. Their work employs machine learning techniques in data preparation without a morpheme dictionary, but the system is trained on morpheme boundaries. Jayan et al. surveyed three major methods for developing a morphological analyzer in Malayalam, viz. paradigm-based and suffix-stripping hybrid [24] methods. The availability of morphological paradigms and classification was a major issue in developing a morphological analyzer.

Prathibha and Padma proposed an analyzer only for Kannada verbs [25]. They created verb, suffix, and root databases and used a hybrid approach with suffix-stripping for the paradigms. Although they used transliterated text, there was no proper standardized notation mentioned in their model. Padma et al. proposed a rule-based stemmer, an analyzer, and a generator for nouns in the Kannada language [26]. Their model made use of noun–suffix- and noun–noun-based dictionaries, which were implemented using suffix stripping. Their work was restricted to only nouns, whereas other parts of speech (POSs) present in the Kannada language were ignored. Veerappan et al. proposed a rule-based morphological analyzer and generator using Finite State Transducer (FST) [15]. The system was implemented using the AT&T Finite State Machine. The lexicon data and the analyzer are not openly available, and the testing was performed on random samples.

Prathibha et al. proposed a morphological analyzer for Kannada using a hybrid approach [27]. The affixes are stripped, followed by the use of a paradigm-based approach. A rule-based method is employed in the creation of the lexicon. A questionnaire-based approach was used to make new entries into the lexicon. Kannada’s derivational words and foreign words were not considered for nouns, and for verbs, multiple suffixes were not considered. Shambhavi et al. proposed a morphological analyzer and generator for Kannada using the Trie data structure in a paradigm-based approach [28]. They constructed individual tries for handling suffixes corresponding to each paradigm class and linked them to the roots that could be mapped to the same paradigm. A limitation of their work was the high memory consumption.

Murthy proposed a network and process model for a morphological analyzer and generator [29]. In this approach, the Sandhi (euphonic change) between the root and affixes was treated as a process, while the formation of morphemes was treated as a network modeled using a finite automaton, following a rule-based approach. Limitations of rule-based approaches lie in their inability to span the rules on new words that were not seen as part of the lexicon. Anitha et al. proposed a Morph-analyzer for the Kannada dialect [30] by employing machine learning techniques like Support Vector Machine on printed Kannada text. They extracted the words by OCR and performed morphological analysis. A custom dataset was developed using a Romanized Kannada text, without any standard format, which is not publicly available. There was no explicit algorithm provided for handling grammatical features or for performing morphological analysis.

Most of the methods employed for Dravidian languages were rule-based in nature by making use of a dictionary. The creation of paradigms poses a challenging task for this kind of morphological analysis. As language evolves and new words are added to the dictionary, rule-based methods may struggle to generate accurate analysis, subsequently leading to failures in adapting to these changes.

2.3. Morphological Analyzer Developed Using Neural Networks

Karahan et al. proposed TransMorpher [31], a two-level analyzer for the Turkish language that consists of a rule-based phonological normalization module and a sequence-to-sequence character translation module developed using the Transformer architecture. The model was extended to other languages, but the results did not match those of state-of-the-art models. Premjith et al. proposed a deep learning approach for Malayalam morphological analysis [18] at the character level. Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRUs) were used to learn the rules for identifying the morphemes automatically and segmenting them from the original word. Further, each of the morphemes was analyzed to identify the grammatical structure of the word. They made use of Romanized scripts and generated inflections based on paradigms prior to analysis, where the development of the paradigms was the challenging task. Dasari et al. proposed Transformer-based Context Aware Morphological Analyzer for Telugu [32]. They experimented with available Telugu Transformer models and existing multi-lingual Transformer models, out of which Bert-Te outperformed other multilingual models.

Most of the neural methods employed for the above languages were based on a prior dataset that used rules/paradigms and trained the neural architectures to generate the morphological analysis. Similar approaches can be extended to low and extremely low-resource languages.

Among the state-of-the-art models on morphological analyzers in Kannada, a lack of standardization in transliteration has been observed. Rule-based methods have not been evaluated on benchmark datasets, and open-source platforms were not used. Despite several efforts in the morphological analysis of Kannada, the evaluations tend to be ad hoc without adherence to standard metrics or validation against benchmark datasets. The following points serve as motivation to carry out the task of creating a morphological analyzer in the Kannada language:

1.: It is observed from the literature that there is a lack of a morphological analysis dataset based on a standard dictionary.
2.: The existing rule-based methods do not tend to generate analysis for words outside the dictionary, so the morphological analysis of for new words is not generated at all.
3.: The current dataset creation strategies struggle to produce valid input datasets for deep learning models and hence demand additional efforts.

3. Methodology

The morphological analysis of new words plays a significant role in developing NLP tools. A rule-based method with paradigms is the most suitable for analyzing a fixed number of inflections from the words present in morphologically rich languages. But the limitation on the fixed number of words in a dictionary makes it impossible to add every new word into a paradigm, as there could be the addition of words to a language based on cultural and sociological influences. Hence, a transformer-based morphological analyzer has been proposed in this paper. The proposed method can be applied to any low-resource morphologically rich Indic language; as a case study, we have considered the Kannada language. The main objectives of building the Transformer based on a morphological analyzer are as follows:

1.: Designing paradigms based on morphological rules of inflection in Kannada, considering gender, number, tense, person, and case for specific word categories.
2.: Generating morphological analysis for words in the Kannada dictionary using the designed paradigms, thereby creating a unique dataset that can be utilized for further NLP tasks in Kannada.
3.: Training a Transformer model to perform morphological analysis of nominal inflections using the generated dataset and hence developing a hybrid model. The proposed model can also generate analysis for words not present in the dictionary.

The high-level architecture for low-resource language in developing the morphological analyzer is shown in Figure 1:

The input to the system is a dictionary that has two parts, the root word and the meaning. The text must be in Unicode format. The root word and the part-of-speech (POS) tag have to be extracted, discarding the meaning for morphological analysis. The detailed process of creating a morphological analyzer is explained in the further subsections.

3.1. Dataset Creation

As there is a requirement for a good-quality dataset for obtaining the morphological analysis using Transformers, we create a unique dataset using an existing Kannada dictionary. A sufficient amount of data must be present in the dataset to train any neural network model. To achieve this, we created a morphological analysis dataset using a rule-based method. The morphological analysis dataset was created using Apertium Lttoolbox 3.8.0, an open source software [33]. There are four important steps outlined, which are discussed in further subsections.

3.1.1. Pre-Processing

The Kannada dictionary file in non-unicode Kannada text format was converted into Unicode format (UTF-16) using the online tool Sanka [34]. Certain words containing an aspirated phoneme were represented using an unaspirated phoneme followed by an ‘S’ in the dictionary. For example, aBiruci was represented as absiruci. These corrections were manually identified and incorporated before converting to WX notation. Special symbols like comma, colon, and full stop were eliminated. The dictionary containing words and their meanings was sliced so that only the words were retained, ignoring the descriptive meaning part. The segregation of homonyms categorized as nouns and verbs was considered under noun and verb paradigms. The mis-categorization of words was handled manually. Certain borrowed words, which, though present in the dictionary were not used in written Kannada language, were ignored, for example, the words such as GeVrAv and walAS. These words are omitted in the morphological analyzer that was developed. The words with zero-width non-joiners cause problems in conversion to WX notation, and they were resolved case by case. Also, the words containing pZA and rYa, which were omitted in the modern Kannada character chart, though prominent in old Kannada language texts, were omitted. The segregation of homonyms and misclassifications was rectified manually and included for the data to be considered while developing the analyzer. It was observed that the excluded words from the raw dictionary file were roughly 5% of the dictionary data. Table 2 gives the analysis of dictionary data before and after pre-processing.

A patch was written in Python 3 to transliterate given Kannada Unicode input text into its equivalent WX text as a morphological analyzer was built for Kannada using WX notation. The steps involved in pre-processing are shown in Figure 2.

3.1.2. Word Classification

The first level of classification of words is the one present in the Kannada dictionary: noun, verb, indeclinable and adjective. In the Kannada language, a noun is a broad category that has subcategories like proper nouns, adjectives, pronouns, and numerals (ordinal and cardinal), which in English are separately categorized. Adverbs in Kannada can be addressed separately, and yet they can be referred to as nouns. A separate classification for pronouns is considered while designing the paradigm. The dictionary lacks gender information, so information is manually added in the paradigms. Word classification in the Kannada language is shown in Figure 3.

3.1.3. Paradigm Generation

Certain words undergo inflection according to a set of rules, and the features that cause these changes form paradigms. This is carried out for noun, verb, indeclinable, adjective, numeric and pronouns. The basic paradigms are built based on the morphological rules, and these are extended onto similar words that form the dictionary. The paradigms are written as XML files before compiling, and the analysis is performed by giving an inflected word as input. The steps involved in paradigm generation are shown in Figure 4. The generic algorithm for developing paradigms is shown in Algorithm 1.

Algorithm 1: Paradigm construction from the dictionary

The following POS tags are considered in this study: noun, verb, indeclinable, adjective, numeric and pronoun. A Finite State automaton (FSA) is an abstract machine that can be in one of a finite number of states at a given point in time [35]. The adjectives, numerics, and indeclinables are retained as they do not have any inflections. The FSA for inflections of the noun huduga is shown in Figure 5.

3.1.4. Paradigm Extension

After creating a paradigm in XML, it is necessary to compile a dictionary of words that adhere to the specific paradigm. If a word does not follow any of the existing paradigms, then it is an exception for which a new paradigm needs to be written separately. Hence, separate dictionary files for each of the identified categories are created. The categories of nouns are identified based on gender and proceed with the basic paradigm building. The details of the root word, case, number, and gender are included in the paradigm. Similar to this, paradigms for verbs are written considering tense, gender, and number. Comparatively, verbs have more inflections than nouns. The basic paradigms are built based on the morphological rules, and these are extended onto similar words that form the dictionary. The paradigm is written as XML files before compiling, and analysis is performed by giving an inflected word as input. The generated output is the analysis of the given input word in WX notation [12]. This contains tags and shows the features that cause the input inflection.

A sample input and output for the verb mAduwwixxAlYeV is given in Figure 6. The features identified for the input word mAduwwixxAlYeV are POS, number, gender tense, and person. These features vary depending on the inflection that is being analyzed.

A graphical representation of the inflections is depicted using a word cloud shown in Figure 7.

The exploratory data analysis was performed on nominal inflections, and the description of the same is shown in Table 3.

In the data shown in Table 3, it can be observed that certain inflections have more than one analysis, which shows in the difference between the count and the uniqueness. The top inflection is SAsanaxiMxa with its root being SAsana. Since the inflectional analysis data can be considered categorical specific to linguistic features of nominal inflections, we have analyzed the distribution of inflections and visualized the same using a bar plot, as shown in Figure 8, Figure 9 and Figure 10.

The genders considered are masculine (m), feminine (f), and neuter (o). It can be observed that the nominal data have more words that belong to the feminine category than the neuter, and the masculine category has the least number of words.

All three genders considered in the dataset have both singular (sg) and plural (pl) forms in Kannada. Hence, the plot shows equal distribution.

There are mainly eight cases in Kannada that the inflections can be categorized to, which include nominative (nom), accusative (acc), instrumental (ins), dative (dat), ablative (abl), genitive (gen), locative (loc) and an optional vocative (voc) case. It can be observed that all cases except vocative have equal distribution, indicating a lower number of vocative inflections.

3.2. Training the Transformer

The rule-based model does not generate analysis for out-of-vocabulary words. As a solution, a Transformer model is trained using the dataset created. The reason for choosing Transformer over other neural architectures is that they perform well for long-range dependencies in data over basic LSTM or RNN [19]. By training the Transformer architecture model on the dataset created from the dictionary, we generate the morphological analysis for Kannada nouns. The dataset consists of words with inflections, root words, and analysis. Further, the word categories include noun, pronoun, verb, adjective, and indeclinable. The data are divided into Left-Hand Side (LHS) and Right-Hand Side (RHS) files. To model the morphological analysis task as a sequential one, the inflection is considered as LHS and root words, along with analysis as RHS. In the RHS file, the root word is eliminated and only the analysis is retained, whereas in the LHS, only inflected nouns are retained. The data are split into train, test, and validation sets. The LHS acts as a source language, and the RHS acts as a target language. The data distribution across the train, validation, and test sets for morphological analysis generation is shown in Table 4.

The dataset generated using paradigms consisted of a total of 106,000 inflections with analysis. As large amounts of training data yield better results in neural architectures, we divided the train, validation, and test sets with 94.34%, 0.94% and 4.72% respectively. The test set has no overlap with train or validation sets. The Hyperparameter details used in training the Transformer model are shown in Table 5.

The algorithm for training a Transformer model is shown in Algorithm 2.

Algorithm 2: Training a Transformer Model

The validation and testing data are distinct from the training data, which is essential for achieving accurate results. Considering morphological analysis as a sequence of generated words, we train a Transformer architecture to obtain the morphological analysis. The generated output is written onto a file for further evaluation using evaluation metrics like Precision, Recall, and F1 score.

P r e c i s i o n P = \frac{# c o r r e c t}{# o u t p u t_s e n t e n c e_l e n g t h}

(1)

R e c a l l R = \frac{# c o r r e c t}{# r e f e r e n c e_s e n t e n c e_l e n g t h}

(2)

F 1 S c o r e = \frac{2 P R}{(P + R)}

(3)

Here, #correct is the number of features correctly predicted in the generated analysis with the length #output_ sentence_length. #reference_ sentence_ length is the expected output, which is used for calculating the scores.

4. Implementation Details

The details of the tools and languages used for implementing the morphological analyzer are described in this section.

4.1. Dataset

The initial prototype was created for nouns and pronouns using the Kannada Tree bank dataset developed by IIIT Hyderabad [36]. Using the 2617 stems, a dictionary of nouns was built. After analyzing the results, the paradigms were extended to a larger dictionary. Table 6 depicts the paradigm’s label and count of words that follow the specified paradigm.

The dataset used here is the dictionary Sankshipta Adhunika Kannada Nighantu, Saujanya Prasaranga Kannada Vishwa Vidhyalaya Hampi [37], with 9879 words obtained after pre-processing. Table 7 furnishes the overall classification of the number of paradigms based on POSs with their corresponding inflection count.

4.2. Developing Paradigms

The dataset created using Apertium Lttoolbox 3.8.0 [33] was pre-processed using Python 3 on the Google Colab platform. Apertium [33] is an open-source software that can be used to develop paradigms for any of the low-resource languages. Given a word from the pre-processed dataset, the morphological parsing is implemented using FST by mapping rules that map letter sequences into the morpheme at the surface level and its features at the lexical level. The FST maps the set of symbols onto other symbols through an FSA. The FST accepts string input and generates a set of related strings as output, wherein the FSA recognizes various inflections of the given word. After pre-processing, 9879 words that end with vowels are obtained from the dictionary. These words are categorized into 7873 nouns, 956 verbs, 242 indeclinables, 799 adjectives, and 9 numerics. Halant ending words are not considered as they do not adhere to the described paradigms.

The main advantage of using Lttoolbox is that it can be used for generating inflections for the dictionary mapped to each paradigm. A patch is written for the WX-to-Kannada conversion so that the generated output can be analyzed. The dictionary file built for this tool was extracted from the Kannada dataset [37]. The dataset contains Kannada root words categorized in nouns, verbs, adjectives and indeclinables with meaning in Kannada. Once the paradigms for Kannada nouns, verbs, adjectives, indeclinables, numeric words and pronouns have been created in WX format [12], Lttoolbox is used for lexical processing, morphological analysis and generation. The paradigms are based on the documentation by Mikel [33] and Amba [38]. The morphological dictionary is built based on which the analysis is generated. The transducer in Lttoolbox makes use of a

. d i x

file, which dictates the rules to join the morphemes(stem + affix). Analysis is the process of splitting the word into its root and grammatical information. The proposed morphological analyzer of Kannada nouns shows the details of the root word, gender, number and case. The declension forms for the masculine noun huduga (boy) are depicted in Table 8.

The declension is used to build the paradigm with categorization on linguistic features. Several other masculine nouns ending with a like yuvaka, muxuka, etc., follow the pattern of declension forms as that of huduga. On the whole, there are certain exceptions, and they have to be manually fed into the system. Furthermore, the exceptions are also found in various POSs like verbs, such as geVllu, that do not follow the same paradigm. The exceptions handled by the system require separate paradigms designed for them, as highlighted in Table 9.

The developed morphological analyzer can obtain the analysis for words present in the dictionary based on the paradigms coded. However, it is limited to known words. As new words keep being added to the dictionary, it is impossible to manually categorize all the words. This limitation is overcome by the Transformer architecture, a model trained on the nouns in the generated dataset. The nouns are used to train the model due to its volume in the morphological analysis data compared to other POS categories. The OpenNMT toolkit is used for generating the analysis [39,40] and Transformer architectures [19] to train the Apertium data. The model is trained for 10,000 steps, where a stable result was achieved. The evaluation metrics are coded using Python 3.8. We have modeled the morphological analysis generation similar to a translation task, where source and target sides are replaced by LHS and RHS files.

5. Discussion and Analysis of Results

The Kannada dictionary dataset with 11,005 words is considered in this study. After pre-processing, a total of 9879 stems are extracted [37]. The stems are classified into nouns, verbs, indeclinables, adjectives, numeric words, and pronouns. Nouns are segregated by genders—masculine, feminine, and neuter. The gender details are assigned manually to nouns, and the description for the same is shown in Table 10. A total of 29 paradigms are formed for nouns and 42 for verbs.

In addition to the nouns and verbs, 12 pronoun stems and their paradigms for cases are coded manually. The details describing the stems and their genders are shown in Table 11.

The generated sample output from the developed model is as follows:

$\begin{matrix} x e v i y a l l i : x e v i < n > < s g > < l o c > < f > \\ u l i y i w u : u l i < v e r b > < s i n g u l a r > < n e u t > < p a s t > < t h i r d_p e r s o n > \\ a w w a : a w w a < a v y > \\ e V M t u : e V M t u < s n k > \\ k l i R t a : k l i R t a < g u n a > \\ I w a n a l l i : I w a < p n > < s g > < l o c > < m > \end{matrix}$

The sample outputs generated from the rule-based part of the morphological analyzer, with details of their analysis, are shown in Table 12. The noun xeviyalli is correctly categorized as a noun that is singular, locative, and feminine. Similarly, words in the other POS categories are also correctly analyzed.

Some nouns do not follow the paradigm rules, and the inflections generated are not the actual words in usage; hence, we have written separate paradigms as exceptions. The semantic correctness of a word is based on the language in the context used, as given in Table 13.

The word aNNa can follow the paradigm for huduga following an ending letter that is the same. But the proper paradigm is to create a new exception paradigm for aNNa. The incorrect usage is shown in Table 13 for the plural forms corresponding to the paradigm huduga, and the correct usages are the ones mentioned under the column of the exception paradigm. There are a total of 35 words that are exceptions to the paradigms defined manually and added to the exceptions list with a separate paradigm. The proposed work covers all the vowel (svara) ending nouns, verbs, numbers, adjectives (guNavAcaka), and indeclinables (avyaya) in the Kannada dictionary by Subbanna and Madhava [37], along with manually identified pronouns. The generated analysis can be used in machine translation to enhance the quality of translated text, which is a possibility that needs to be explored.

Using the Transformer architecture, the generated morphological analysis was evaluated for 1000 inflected nouns, and the results are shown in Table 14. The test set was independent of the inflectional words present in the training or validation set. Hence, the test was performed on new data that the model had not seen. The Precision, Recall, and F1 scores for steps 500, 5000 and 10,000 were calculated by varying the number of sentences, as shown in Table 15. The graphical representation of the overall scores is shown in Figure 11.

It is observed from Table 14 that the best values for Precision, Recall, and F1 Scores are 0.924, 0.925, and 0.925, respectively. These values are obtained by testing the generated morphological analysis of 1000 inflections.

The model is trained for 10,000 steps, and the evaluation measures such as the Precision, Recall and F1 score are analyzed at 500, 5000 and 10,000 steps and incrementing the dataset by 100 sentences. It is observed from Table 15 that the scores are 0.011 higher at the initial step with the test data with 1000 inflections, but at the 10,000th step the actual training is complete and Precision, Recall and F1 score values of 0.924, 0.925 and 0.925 are obtained.

Perplexity scores in language models serve as indicators of their language processing efficacy. But in the case of morphological analysis, a model with a low perplexity score indicates higher accuracy and has been trained well on data by understanding the structure of the words present in the language. A model with a higher perplexity score is less reliable than one with a lower score. The perplexity score can be considered a direct measure of the model’s linguistic competence, with lower scores indicating superior language processing capabilities [41]. A perplexity score of one means that the model perfectly predicts the output given the input, while higher scores indicate worse performance. The perplexity score for the generated morphological analysis of nominal inflections is shown in Figure 12.

The perplexity score indicates the predictiveness of the next word. In Figure 12, the perplexity score at every 500th step is plotted, and it can be observed that a stable value of 1.11 is achieved between 2500 and 8000th steps. Later there was a rise to 1.15 before dropping to 1.11 at the 10,000th step. The proposed model is compared with existing Kannada morphological analyzers and summarized in Table 16.

The existing approaches discussed in the literature show that the majority of methods used in developing a morphological analyzer for Kannada are either rule- or learning-based. Considering a smaller subset of nouns or verbs that were rule-based made the morphological analyzer less useful in practical scenarios, as unknown words could not be categorized. The lack of standard dictionary was a major drawback in the litera, ture which was overcome by the proposed morphological analyzer model. As language evolves, new words continue to be included in the dictionary, but the provision to handle such out-of-vocabulary words is not handled in the literature, whereas the proposed model is capable of handling new nouns and is also based on standard dictionary data that are reliable and can be expanded when new words are added to the dictionary. It can be observed that the proposed hybrid method is suitable for generating an analysis of inflected Kannada words. The proposed method can be easily applied to any low-resource language to create a dataset as well as to generate a morphological analysis.

Limitations and Challenges with Building the Morphological Analyzer

The morphological analyzer developed for Kannada needed a strong open-source Kannada dictionary containing the majority of words used in everyday vocabulary. The creation of paradigms was an uphill task and needed expert supervision prior to the implementation stage. Kannada, a morphologically rich language, poses challenges while generating inflections based on linguistic features like gender, number, tense, and case. Verbs were harder to analyze than nouns, as verbs yielded a greater number of inflections. The exceptional cases required smart handling as they could be mis-categorized. In the neural part, the training would be based on the data curated, so cleaning and transformation were crucial.

6. Conclusions

A hybrid morphological analyzer leveraging the power of rule-based and Transformer-based architectures has been built for a low-resource language, Kannada. The dataset provided by Kannada Vishwa Vidyalaya, Hampi covers major words of the modern Kannada language to which the rule-based morphological analyzer has been built for all POS categories. The morphological analyzer generates inflectional analysis for nouns, pronouns, verbs, indeclinables, and adjectives present in the dictionary based on predefined rules. As most Kannada nouns could be categorized into paradigms defined in the paper, the Transformer training was focused on inflectional morphology for nouns. Further, this work can be expanded onto verbal inflections by curating an extensive dataset capturing the inflectional rule of verbs as carried out for nouns.

The morphological analyzer covers the majority of the common words in Kannada, and the analysis is helpful in developing language-related tools like chunker, POS tagger, stemmer, Named Entity Recognition (NER) and machine translation systems in Kannada. The scope of this work is limited to inflectional morphology, so compound words and derivational morphology are not included. At present, nouns ending with consonants and borrowed foreign words from English, Hindi, Arabic, Oriya, Gujarati, Tamil, Tulu, Telugu, Punjabi, Parsi, Marathi, Malayalam, Sindhi and Sanskrit are not handled. The proposed research is challenging due to the complex structure of the Kannada language. A total of 9879 words have been mapped onto 85 paradigms and 205,659 word inflections have been analyzed. Any new nominal inflection can be analyzed using the proposed model, which contributes additionally in extracting the linguistic information. We observe that the rule-based model retrieves morphological analysis more quickly for words present in the corpus, while the Transformer-based model excels at generating analyses for new words not found in the dictionary. The precision, recall and F1-score obtained are 0.924, 0.925 and 0.925, respectively, on the test set for inflections in Kannada. The inflections tested had no overlap with data seen during the training, and the results are promising. This study can be extended to a wider dictionary for further enhancements. The proposed technique is suitable for low-resource languages to create their own unique dataset, which is beneficial in developing language-specific NLP tools. Better results can be expected with a larger dictionary of root words, and the data/paradigms can be widely extended. It will be interesting to pursue word analysis in derivational morphology and consonant ending nouns of the ancient Kannada language in a future work.

Author Contributions

Conceptualization, A.S.R. and M.S.; methodology, A.S.R. and M.S.; software, M.S.; validation, D.A.U., A.N. and M.S.; investigation, M.S.; resources, M.S.; data curation, M.S.; writing—original draft preparation, M.S. and A.S.R.; writing—review and editing, D.A.U., A.N. and A.S.R.; visualization, M.S.; supervision, D.A.U. and A.N.; project administration, D.A.U. and A.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We acknowledge the support extended by the director and authors of Saujanya Prasaranga Kannada Vishwa Vidhyalaya, Hampi for granting permission to use their dictionary Sankshipta Adhunika Kannada Nighantu, which is used in developing this morphological analyzer tool. We are thankful for the support extended by Amba. Kulkarni, Department of Sanskrit Studies, University of Hyderabad, for guiding us during the initial steps of morphological analyzer development using Apertium. We are thankful to Narasimha Murthy, Secretary, Kannada Ganaka Parishat, Bengaluru, for providing guidelines to utilize the existing dataset and S A Krishnaiah, Director of Studies (Prachya Sanchaya Samshodhana Kendra, A Unit of NTCAOM), co-editor and technical expert of Folk Lexicon, Karnataka Janapada Vishwavidyalaya, Gotagodi, Karnataka, and technical expert to the Kannada Pradhikara for verifying the results.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NLP	Natural Language Processing
POS	Parts of Speech
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
GRU	Gated Recurrent Units
LHS	Left-Hand Side
RHS	Right-Hand Side

References

Jurafsky, D.; Martin, J. Speech and Language Processing, 2nd ed.; Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 2009. [Google Scholar]
Khurana, D.; Koli, A.; Khatter, K.; Singh, S. Natural Language Processing: State of The Art, Current Trends and Challenges. arXiv 2017, arXiv:1708.05148. [Google Scholar] [CrossRef] [PubMed]
Quirk, R.; Crystal, D. A Comprehensive Grammar of the English Language; Pearson: Madrid, Spain, 2010; Available online: https://books.google.co.in/books?id=XnOdYhVoEZ0C (accessed on 10 July 2024).
Sridhar, S. Modern Kannada Grammar; Manohar: New Delhi, India, 2007; Available online: https://books.google.co.in/books?id=XvoVXcXV3X4C (accessed on 5 June 2024).
Wiemerslage, A.; Silfverberg, M.; Yang, C.; McCarthy, A.; Nicolai, G.; Colunga, E.; Kann, K. Morphological Processing of Low-Resource Languages: Where We Are and What’s Next. *Findings of the Association for Computational Linguistics: ACL 2022*; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 988–1007. [Google Scholar] [CrossRef]
Pan, Y.; Li, X.; Yang, Y.; Dong, R. Morphological Word Segmentation on Agglutinative Languages for Neural Machine Translation. arXiv 2020, arXiv:2001.01589. [Google Scholar]
Tsai, M.-F.; Nakov, P.; Ng, H.T. Morphological Analysis for Resource-Poor Machine Translation; Technical Report TR22/10; Department of Computer Science, National University of Singapore: Singapore, 2010. [Google Scholar]
Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting Of The Association For Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 23–24 June 2014; pp. 55–60. Available online: https://aclanthology.org/P14-5010 (accessed on 5 June 2024).
Nair, J.; Aiswarya, L.; Sruthy, P. A Study on Morphological Analyser for Indian Languages: A Literature Perspective. In Proceedings of the Advances in Computing and Data Sciences: 5th International Conference, ICACDS 2021, Nashik, India, 23–24 April 2021; pp. 112–123. [Google Scholar]
Britannica, T. Encyclopedia Britannica-“Indian Languages”. 2024. Available online: https://www.britannica.com/topic/Indian-languages (accessed on 2 October 2024).
Rani, S. Department of Official Language, Ministry of Home Affairs, Government of India NDCC. 2023. Available online: https://rajbhasha.gov.in/en/languages-included-eighth-schedule-indian-constitution (accessed on 2 March 2023).
Gupta, R.; Goyal, P.; Diwakar, S. Transliteration Among Indian Languages Using WX Notation. In Proceedings of the KONVENS, Saarbrücken, Germany, 6–8 September 2010. [Google Scholar]
Krishnamurti, B. The Dravidian Languages; Cambridge University Press: Cambridge, UK, 2003; Available online: https://books.google.co.in/books?id=54fV7Lwu3fMC (accessed on 1 June 2024).
Kempegowda, D. Kannada Bhasha Swaroopa; Kuvempu Kannada Research Center: Kuppalli, India, 1997. [Google Scholar]
Veerappan, R.; Saravanan, S.; Soman, K.P. A Rule based Kannada Morphological Analyzer and Generator using Finite State Transducer. Int. J. Comput. Appl. 2011, 27, 8. [Google Scholar] [CrossRef]
Stump, G. Morphological and syntactic paradigms: Arguments for a theory of paradigm linkage. In Yearbook of Morphology 2001; Springer: Dordrecht, The Netherlands, 2002; pp. 147–180. [Google Scholar] [CrossRef]
Jena, I.; Chaudhury, S.; Chaudhry, H.; Sharma, D. Developing Oriya Morphological Analyzer Using Lt-Toolbox. In Information Systems For Indian Languages; Springer: Berlin/Heidelberg, Germany, 2011; pp. 124–129. [Google Scholar]
Premjith, B.; Soman, K.; Kumar, M. A deep learning approach for Malayalam morphological analysis at character level. Procedia Comput. Sci. 2018, 132, 47–54. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Goyal, V.; Lehal, G. Hindi Morphological Analyzer and Generator. In Proceedings of the 2008 First International Conference on Emerging Trends in Engineering and Technology, Nagpur, India, 16–18 July 2008; pp. 1156–1159. [Google Scholar]
Malladi, D.; Mannem, P. Statistical Morphological Analyzer for Hindi. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, Nagoya, Japan, 14–19 October 2013; pp. 1007–1011. Available online: https://aclanthology.org/I13-1136 (accessed on 1 June 2024).
Bapat, M.; Gune, H.; Bhattacharyya, P. A Paradigm-Based Finite State Morphological Analyzer for Marathi. In Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing, Beijing, China, 2010. pp. 26–34. Available online: https://aclanthology.org/W10-3604 (accessed on 1 June 2024).
Kumar, M.; Dhanalakshmi, V.; Kp, S.; Sankaravelayuthan, R. A Sequence Labeling Approach to Morphological Analyzer for Tamil Language. Int. J. Comput. Sci. Eng. 2010, 2, 9. [Google Scholar]
Jayan, J.; Rajeev, R.; Sankaravelayuthan, R. A Morphological Analyser for Malayalam—A Comparison of Different Apporaches. Int. J. Comput. Sci. Inf. Technol. 2009, 2, 155–160. [Google Scholar]
Prathibha, R.; Padma, M. Development of morphological analyzer for Kannada verbs. In Proceedings of the Fifth International Conference on Advances in Recent Technologies in Communication and Computing (ARTCom 2013), Bangalore, India, 20–21 September 2013; pp. 22–27. [Google Scholar]
Padma, M.; Prathibha, R. Development of Morphological Stemmer, Analyzer and Generator for Kannada Nouns. In Emerging Research in Electronics, Computer Science and Technology; Springer: Mandya, India, 2014; pp. 713–723. [Google Scholar]
Prathibha, R.J. Morphological Analyzer for Kannada Inflectional Words Using Hybrid Approach. Int. J. Comput. Linguist. Res. 2016, 7, 133–161. [Google Scholar]
Shambhavi, B.; RamakanthKumar, P.; Srividya, K.; JyothiB, J.; Kundargi, S.; Varsha, G.S. Kannada Morphological Analyser and Generator Using Trie. IJCSNS Int. J. Comput. Sci. Netw. Secur. 2011, 11, 112–116. [Google Scholar]
Murthy, K. A network and process model for morphological analysis/generation. In Proceedings of the ICOSAL-2, the Second International Conference on South Asian Languages, Punjabi University, Patiala, India, 1–4 January 1999. [Google Scholar]
Anitha, G.; Sunilkumar, G.; Manjunathswamy, B.; Thriveni, J.; Venugopal, K.R. Advancing Morphological Analyser for Kannada Using NLP and Machine Learning Approaches. Int. J. Adv. Sci. Technol. 2020, 29, 9135–9148. Available online: http://sersc.org/journals/index.php/IJAST/article/view/32455 (accessed on 7 May 2024).
Şahin, K. Atlamaz TransMorpher: A Phonologically Informed Transformer-based Morphological Analyzer. ALTNLP. 2022. Available online: https://api.semanticscholar.org/CorpusID:259093155 (accessed on 7 May 2024).
Dasari, P.; Chelpuri, A.; Vuppala, N.; Marreddy, M.; Krishnamurthy, P.; Mamidi, R. Transformer-Based Context Aware Morphological Analyzer for Telugu. In Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages, Varna, Bulgaria, 2023. pp. 25–32. Available online: https://aclanthology.org/2023.dravidianlangtech-1.4/ (accessed on 30 April 2025).
Forcada, M.; Boyan, Ivanov, Bonev, S.; Ortiz, J.; Sanchez, G.; Martınez, F.; Armentano-Oller, C.; Montava, M.; Tyers, F.; Rosell, M. Lttoolbox. 2010. Available online: https://wiki.apertium.org/w/images/d/d0/Apertium2-documentation.pdf (accessed on 11 September 2021).
V K Aravinda Sanka: ASCII Unicode Convertor. Available online: https://aravindavk.in/sanka/ (accessed on 1 August 2022).
Wikipedia contributors Finite-State Machine—Wikipedia, The Free Encyclopedia. 2024. Available online: https://en.wikipedia.org/w/index.php?\title=Finite-state_machine&oldid=1262168421 (accessed on 26 December 2024).
IIIT-Hyderabad Kannada Tree Bank, Kannada Annotated Dataset. 2018. Available online: https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi/ (accessed on 11 November 2021).
Rai, D.; Peraje, D. Sankshipta Adhunika Kannada Nighantu (a Brief Lexicon of Modern Kannada); Saujanya Prasaranga: Hampi, India, 2013. [Google Scholar]
Kulkarni, A. Sanskrit Parsing: Based on the Theories of Śābdabodha; Indian Institute of Advanced Study: Shimla, India, 2019; Available online: https://books.google.co.in/books?id=sgFEygEACAAJ (accessed on 7 June 2024).
Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; Rush, A. Opennmt: Open-source toolkit for neural machine translation. arXiv 2017, arXiv:1701.02810. [Google Scholar]
Klein, G.; Hernandez, F.; Nguyen, V.; Senellart, J. The OpenNMT Neural Machine Translation Toolkit: 2020 Edition. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), Virtual, 2020. pp. 102–109. Available online: https://aclanthology.org/2020.amta-research.9 (accessed on 7 June 2024).
Walker, S. Perplexity in AI and NLP. 2024. Available online: https://klu.ai/glossary/perplexity (accessed on 19 June 2024).

Figure 1. Steps involved in developing the proposed morphological analyzer.

Figure 2. Steps involved in pre-processing.

Figure 3. Proposed word classification in the Kannada language.

Figure 4. Steps involved in paradigm generation.

Figure 5. FSA for inflections of the noun huduga.

Figure 6. Sample input and output for verb mAduwwixxAlYeV.

Figure 7. Word cloud generated for Kannada inflections.

Figure 8. Distribution of inflections based on gender.

Figure 9. Distribution of inflections based on number.

Figure 10. Distribution of inflections based on case.

Figure 11. The graphical representation of the overall score for the generated morphological analysis of nominal inflections.

Figure 12. A graphical representation of the perplexity score for the generated morphological analysis of nominal inflections.

Table 1. Declension forms for the word kali.

		Gender
Tense	Person	Masculine (M)	Feminine (F)	Neuter		M/F
		Singular			Plural
Past	First	kaliwanu	kaliwalYu	kaliyiwu	kaliwavu	kaliwaru
Past	Second	kaliweV	-	-	-	kaliwiri
Past	Third	kaliweVnu	-	-	-	kaliweVvu
Present	First	kaliyuwwixxAneV	kaliyuwwixxAlYeV	kaliyuwwixeV	kaliyuwwiveV	kaliyuwwixxAreV
Present	Second	kaliyuwwixxIya	-	-	-	kaliyuwwixxIri
Present	Third	kaliyuwwixxeneV	-	-	-	kaliyuwwixxeveV
Future	First	kaliyuwwAneV	kaliyuwwAlYeV	kaliyuwwaxeV	kaliyuwwaveV	kaliyuwwAreV
Future	Second	kaliyuwwIya	-	-	-	kaliyuwwIri
Future	Third	kaliyuwweneV	-	-	-	kaliyuwweveV

Table 2. Word count statistics before and after pre-processing.

Type	Word Count
Type	Before Pre-Processing	After Pre-Processing
Noun	9167	7873
Verb	1006	956
Indeclinable	176	242
Adjective	75	799
Numeric	-	9
Pronoun	-	12
Total	10,424	9891

Table 3. Description of nominal inflectional data.

	Inflection	Root	Category	Number	Case	Gender
count	173,740	173,740	173,740	173,740	173,740	173,740
unique	154,037	7685	1	2	8	3
top	SAsanaxiMxa	SAsana	n	sg	dat	o
freq	6	66	173,740	87,135	47,238	156,009

Table 4. Statistics of data distribution across train, validation, and test sets for morphological analysis generation.

Category	Sentence Count
Category	LHS	RHS
Train	100,000	100,000
Validation	5000	5000
Test	1000	1000

Table 5. Hyperparameters and their values used for training the Transformer.

Hyper Parameters	Values
Word vector size	512
Encoding layers	6
Decoding layers	6
Attention Heads	8
Learning rate	2
Batch size	2048
Train steps	10,000
Encoder type	Transformer
Label smoothing	0.1
Optimizer	Adam

Table 6. Paradigm dictionary statistics.

Type	Paradigm	Dictionary Size (Words)	Type	Paradigm	Dictionary Size (Words)
Noun	huduga	527	Verb	bA	2
Noun	akka	3	Verb	bagi	1
Noun	puswaka	3445	Verb	kali	2
Noun	aNNa	7	Verb	huyyi	1
Noun	AyA	1	Verb	kudi	83
Noun	rikRA	30	Verb	kaxi	2
Noun	sarvAMwaryAmi	1	Verb	avi	1
Noun	kitaki	1612	Verb	sugi	1
Noun	prakqwi	6	Verb	woyu	1
Noun	AcAri	123	Verb	iru	1
Noun	nAri	40	Verb	iMgu	1
Noun	SrI	1	Verb	mAdu	726
Noun	swrI	1	Verb	bayyu	2
Noun	gaMdu	7	Verb	solu	2
Noun	heVnnu	2	Verb	geVllu	1
Noun	gaMdasu	2	Verb	sAyu	2
Noun	heVMgasu	2	Verb	sudu	21
Noun	magalYu	5	Verb	koVlYlYu	5
Noun	coVMbu	1098	Verb	Agu	21
Noun	sAXu	1	Verb	nagu	3
Noun	magu	1	Verb	alYu	2
Noun	piwq	4	Verb	kAyu	1
Noun	AcarNeV	890	Verb	elYu	1
Noun	vaniweV	50	Verb	bIlYu	3
Noun	kE	10	Verb	eVnnu	3
Noun	hU	1	Verb	koVllu	1
Noun	go	1	Verb	naMxu	1
Noun	xoVreV	1	Verb	nillu	1
Noun	patAlaM	1	Verb	bawwu	1
Indeclinable	mAwra	242	Verb	baru	1
Adjective	uwwama	799	Verb	mIyu	1
Numeric	Sawa	9	Verb	soMku	1
Pronoun	nAnu	1	Verb	hoVradu	1
Pronoun	nInu	1	Verb	hoVru	1
Pronoun	avanu	1	Verb	Ayu	2
Pronoun	avalYu	1	Verb	uNNu	2
Pronoun	axu	1	Verb	eVtaku	1
Pronoun	ixu	1	Verb	hoVdeV	68
Pronoun	ivanu	1	Verb	mareV	2
Pronoun	ivalYu	1	Verb	oVxeV	2
Pronoun	Awa	1	Verb	bE	1
Pronoun	Iwa	1	Verb	ko	1
Pronoun	AkeV	2

Table 7. Word classification with the count of stems and generated inflections.

Type	Stems	Inflections
Noun	7873	173,740
Verb	956	30,725
Indeclinable	242	242
Adjective	799	799
Numeric	9	9
Pronoun	12	144
Total	9891	205,659

Table 8. Declension forms for the word huduga.

Case	Singular Form	Plural Form
Nominative	huduganu /	hudugaru
	huduga
Accusative	huduganannu	hudugarannu
Instrumental	huduganiMxa	hudugariMxa
Dative	huduganigeV /	hudugarigeV /
	huduganigAgi /	hudugarigAgi /
	huduganigoskara	hudugarigoskara
Ablative	hudugana deVseVyiMxa /	hudugara deVseVyiMxa /
	huduganiMxa	hudugariMxa
Genitive	hudugana	hudugara
Locative	huduganalli	hudugaralli
Vocative	hudugane /	hudugare
	hudugA

Table 9. Exception paradigm statistics.

Word	Type	Total Count
AyA	Noun	10
sarvAMwaryAmi
SrI
swrI
sAXu
magu
hU
go
xoVreV
patAlaM
nAnu	Pronoun	10
nInu
avanu
avalYu
axu
ixu
ivanu
ivalYu
Awa
Iwa
bagi	Verb	22
huyyi
avi
sugi
woyu
iru
iMgu
geVllu
kAyu
elYu
koVllu
naMxu
nillu
bawwu
baru
mIyu
soMku
hoVradu
hoVru
eVtaku
bE
ko

Table 10. Nouns segregated by gender with the count of stems and generated inflections.

Type	Gender	Stems	Inflections
Noun	Masculine	673	15,333
Noun	Feminine	110	2398
Noun	Neuter	7090	156,009

Table 11. Pronouns segregated by gender with the count of stems and generated inflections.

Type	Gender	Stems	Inflections
Pronoun	Masculine	4	40
Pronoun	Feminine	4	40
Pronoun	Neuter	4	64

Table 12. Analysis of output from the morphological analyzer.

Type	Input	Generated Output	Meaning	Category
Noun	xeviyalli	xevi <n ><sg ><loc ><f >	<Stem, Noun, Singular, Locative, Feminine >	<-, POS, Number, Case, Gender >
Verb	uliyiwu	uli <verb ><singular ><neut ><past ><third person >	<Stem, Verb, Singular, Neuter, Past, Third person>	<-, POS, Number, Gender, Tense, Person >
Pronoun	Iwanalli	Iwa <pn ><sg ><loc ><m >	<Stem, Pronoun, Singular, Locative, Masculine >	<-, POS, Number, Case, Gender >
Indeclinable	awwa	Awwa <avy>	<Indeclinable>	<-, POS>
Numeric	eVMtu	eVMtu<snk>	<Numeric>	<-, POS>
Adjective	kliRta	KliRta <guna>	<Adjective>	<-, POS>

Table 13. Inflection for word aNNa using the general paradigm, huduga, and the exception paradigm.

Case		huduga Paradigm		Exception Paradigm
Case	Singular	Plural	Singular	Plural
Nominative	aNNa/ aNNanu	aNNaru	aNNa/ aNNanu	aNNaMxiru
Accusative	aNNanannu	aNNarannu	aNNanannu	aNNaMxirannu
Instrumental	aNNaniMxa	aNNariMxa	aNNaniMxa	aNNaMxiriMxa
	aNNanigeV	aNNarigeV	aNNanigeV	aNNaMxirigeV
Dative	aNNanigAgi	aNNarigAgi	aNNanigAgi	aNNaMxirigAgi
	aNNanigoskara	aNNarigoskara	aNNanigoskara	aNNaMxirigoskara
Ablative	aNNana deVseVyiMxa	aNNara deVseVyiMxa	aNNana deVseVyiMxa	aNNaMxira deVseVyiMxa
Genitive	aNNana	aNNara	aNNana	aNNaMxira
Locative	aNNanalli	aNNaralli	aNNanalli	aNNaMxiralli
Vocative	aNNane/aNNA	aNNare	aNNane/aNNA	aNNaMxire

Table 14. Evaluation scores for the generated morphological analysis of nominal inflections at every 2000th step.

Steps	Precision	Recall	F1 Score
2000	0.924	0.925	0.924
4000	0.924	0.925	0.924
6000	0.924	0.925	0.925
8000	0.924	0.925	0.924
10,000	0.924	0.925	0.925

Table 15. Evaluation scores for the generated morphological analysis of nominal inflections by varying the sentence count (P-Precision, R-Recall, F1-F1 Score P (5000) indicates precision at 5000th step of training).

Sentences	P (500)	R (500)	F1 (500)	P (5000)	R (5000)	F1 (5000)	P (10,000)	R (10,000)	F1 (10,000)
100	0.921	0.921	0.921	0.926	0.926	0.926	0.931	0.931	0.931
200	0.915	0.915	0.915	0.928	0.928	0.928	0.928	0.928	0.928
300	0.919	0.919	0.919	0.929	0.929	0.929	0.93	0.93	0.93
400	0.914	0.914	0.914	0.926	0.926	0.926	0.926	0.926	0.926
500	0.917	0.917	0.917	0.927	0.927	0.927	0.928	0.928	0.928
600	0.916	0.916	0.916	0.928	0.928	0.928	0.928	0.928	0.928
700	0.918	0.918	0.918	0.929	0.929	0.929	0.929	0.929	0.929
800	0.918	0.918	0.918	0.929	0.929	0.929	0.929	0.929	0.929
900	0.913	0.913	0.913	0.924	0.924	0.924	0.925	0.925	0.925
1000	0.913	0.914	0.913	0.924	0.925	0.924	0.924	0.925	0.925

Table 16. Comparison of the proposed model with existing Kannada morphological analyzers.

Sl. No.	Existing Approaches	Model	Word Count	Limitation	Comparison with the Proposed Approach
1	Prathibha and Padma [25]	Rule-based	6000 verbs	Romanized text, without standardized notation, limiting the analysis only to nouns	Uses a hybrid approach to effectively handle all word categories
2	Padma et al. [26]	Rule-based	150,000 nouns	Only nouns were analyzed	Covers all the POSs present in a Kannada dictionary
3	Veerappan et al. [15]	Rule-based	23,000+	Random tests on lexicon	Testing performed on a generated corpus, verified by a Kannada expert
4	Prathibha et al. [27]	Rule-based	3000 noun roots	Limiting the analysis only to few nouns	Nominal inflection analysis is generated for a given word
5	Shambhavi et al. [28]	Rule-based using Trie data structure	88,000 inflections	High memory consumption	A generic hybrid model utilizing a flexible structure
6	Murthy [29]	Rule-based	10,000+	Rule-based approaches may not handle out-of-lexicon words	Out-of-dictionary words handled using Transformers
7	Anitha et al. [30]	Machine-learning-based	∼1000+ (not clearly mentioned)	Tested on a custom dataset, Romanised Kannada text without standard format	Standard WX notation is used
8	Proposed approach	Rule- and Transformer-based	205,659+	Performs better using a hybrid approach	Generates analysis for any new nominal inflection, including all word categories

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Supriya, M.; Acharya Udupi, D.; Nayak, A.; Srirangapatna Raghavendra, A. Developing a Hybrid Morphological Analyzer for Low-Resource Languages. Appl. Sci. 2025, 15, 5682. https://doi.org/10.3390/app15105682

AMA Style

Supriya M, Acharya Udupi D, Nayak A, Srirangapatna Raghavendra A. Developing a Hybrid Morphological Analyzer for Low-Resource Languages. Applied Sciences. 2025; 15(10):5682. https://doi.org/10.3390/app15105682

Chicago/Turabian Style

Supriya, Musica, Dinesh Acharya Udupi, Ashalatha Nayak, and Arjuna Srirangapatna Raghavendra. 2025. "Developing a Hybrid Morphological Analyzer for Low-Resource Languages" Applied Sciences 15, no. 10: 5682. https://doi.org/10.3390/app15105682

APA Style

Supriya, M., Acharya Udupi, D., Nayak, A., & Srirangapatna Raghavendra, A. (2025). Developing a Hybrid Morphological Analyzer for Low-Resource Languages. Applied Sciences, 15(10), 5682. https://doi.org/10.3390/app15105682

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Developing a Hybrid Morphological Analyzer for Low-Resource Languages

Abstract

1. Introduction

2. Related Work

2.1. Morphological Analyzer for Indo-Aryan Languages

2.2. Morphological Analyzer for Dravidian Languages

2.3. Morphological Analyzer Developed Using Neural Networks

3. Methodology

3.1. Dataset Creation

3.1.1. Pre-Processing

3.1.2. Word Classification

3.1.3. Paradigm Generation

3.1.4. Paradigm Extension

3.2. Training the Transformer

4. Implementation Details

4.1. Dataset

4.2. Developing Paradigms

5. Discussion and Analysis of Results

Limitations and Challenges with Building the Morphological Analyzer

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI