Computational Infrastructure for Modern Greek: From Grammar to Ontology

Samaridi, Nikoletta E.; Karanikolas, Nikitas N.; Papakitsos, Evangelos C.; Skourlas, Christos

doi:10.3390/computation13110272

Open AccessArticle

Computational Infrastructure for Modern Greek: From Grammar to Ontology

by

Nikoletta E. Samaridi

¹

,

Nikitas N. Karanikolas

¹

,

Evangelos C. Papakitsos

^2,*

and

Christos Skourlas

¹

Department of Informatics and Computer Engineering, University of West Attica, A. Spyridonos Str., 12243 Egaleo, Greece

²

Department of Industrial Design and Production Engineering, University of West Attica, Thivon 250, 12241 Egaleo, Greece

^*

Author to whom correspondence should be addressed.

Computation 2025, 13(11), 272; https://doi.org/10.3390/computation13110272

Submission received: 6 October 2025 / Revised: 29 October 2025 / Accepted: 7 November 2025 / Published: 19 November 2025

(This article belongs to the Special Issue Recent Advances on Computational Linguistics and Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

This study presents a comprehensive NLP infrastructure for Modern Greek that bridges grammatical analysis and ontological representation, integrating linguistic theory, algorithmic modeling, and semantic structuring within a unified computational framework. Modern Greek remains under-resourced in terms of interoperable linguistic tools, with existing systems addressing morphology, syntax, or semantics separately. The proposed system fills this gap by connecting data extraction, morphological generation, and syntactic analysis within a single executable environment, validated through diagnostic evaluation and ontology consistency checking. Linguistic data were automatically retrieved from the Dictionary of Standard Modern Greek (55,101 entries), processed through Python-based algorithms, and integrated into an ontology in Protégé. The resulting ontology captures grammatical and semantic relations with high precision and supports reasoning, interoperability, and linguistic knowledge expansion. The system transforms grammatical knowledge into ontological knowledge, providing a scalable linguistic foundation for future advancements in Natural Language Processing and Understanding for Modern Greek.

Keywords:

part of speech; lemma; morphology; automatic generation; inflected forms; syntax; algorithms; semantics; ontology; NLP/NLU

1. Introduction

The computational processing of Modern Greek poses particular challenges due to its complex and multilayered morphology, which makes the development of reliable tools in computational linguistics, Natural Language Processing (NLP), and Natural Language Understanding (NLU) especially demanding. The diversity of grammatical forms and the structural flexibility of the language affect both syntax and semantics, requiring precise modeling of inflectional and syntactic rules. The central role of morphological structure in the analysis and interpretation of Greek highlights the need for computational systems capable of handling complex morphological and syntactic phenomena in applications such as text analysis, information retrieval, and natural language understanding.

In response to these needs, the computational system presented in this study performs four (4) core functions:

(a): It extracts linguistic data (data mining) (Appendix A—item 1) from online structured or semi-structured sources, with the primary source being the electronic Dictionary of Standard Modern Greek (DSMG) [1] developed by Manolis Triantafyllidis Foundation;
(b): It automatically generates all inflected forms of the inflectional parts of speech in Greek, based on inflectional paradigms, i.e., representative words serving as declension models for other words with similar morphological characteristics, as well as on inflectional categories, as documented both in the Grammar of Modern Greek (GMG) by Manolis Triantafyllidis [2] and the same dictionary [1] (Appendix A—item 2);
(c): It identifies and analyzes grammatical features (part of speech, gender, number, case, etc.), which are then used by specially designed algorithms to detect syntactic structures within a sentence (subject, verb, direct and indirect object), based on Relational Grammar (Appendix A—item 3);
(d): It organizes the lemmata and their morphological and syntactic features into an ontology within the Protégé platform.

All these core processes are interconnected within a unified processing pipeline that links the linguistic, syntactic, and ontological modules of the system, as further described in the Methodology section. The combined use of the above functions enables the transition from morphological analysis to higher levels of language processing. Its ability to generate all inflected forms, perform morphosyntactic analysis at the sentence level, and integrate data into semantic models (e.g., OWL) through Protégé enhances interoperability and ensures compatibility with modern research infrastructures based on ontological modeling. The proposed framework combines detailed grammatical processing, user-friendly design, and expandability toward ontological representations, constituting a significant contribution to the field of language technology. It is not a conceptual design but an operational, fully implemented environment that has undergone extensive testing and internal validation, combining algorithmic precision with linguistic reliability.

2. Conceptual Framework and Related Research

Research in the computational processing of the Greek language has made significant progress over the past decades, with a particular focus on morphological analysis, part-of-speech (PoS) recognition, and syntactic annotation. Initiatives such as the electronic DSMG, developed by the Institute of Modern Greek Studies at the Aristotle University of Thessaloniki [1], have served as important sources of lexical and morphological data. At the same time, the ILSP NLP Suite of the “Athena” Research Center [3,4,5], and the AUEB NLP Group (2025) [6,7,8] have provided advanced tools for natural language processing, including PoS tagging, semantic labeling, and dependency parsing.

With regard to morphological resources, the Modern Greek NLP Resources project [9,10] played a key role in the exploitation of Greek morphology. Internationally, the UniMorph project [11] provides standardized morphological data for many languages, including Greek, aiming at interoperability and the comparative study of inflectional systems. At the level of computational models, neural transformers such as Greek BERT [12] have enabled high performance in tasks, such as named entity recognition and word classification. However, they do not support internal integration with syntactic and semantic layers of analysis.

Compared to other well-resourced languages such as English, French, or German, Greek remains less supported in terms of large-scale linguistic infrastructures, annotated corpora, and interoperable lexical resources. As noted by [13], the available language resources for Greek are considerably fewer and less interconnected, which limits the range and scalability of natural language processing applications. While languages like English benefit from mature ecosystems of integrated NLP tools—such as Stanford CoreNLP, SpaCy, or the Universal Dependencies (UD) corpora with extensive semantic annotations—the corresponding resources for Greek are smaller in scale and less systematically maintained. This asymmetry has historically restricted the direct application of multilingual or cross-lingual tools to Greek, making it essential to develop specialized infrastructures that will strengthen the language’s position within the international field of Computational Linguistics and ensure its long-term digital sustainability.

Furthermore, most approaches separate morphological and syntactic analysis, without implementing an integrated connection of the data with semantic structures and ontological representations. In English, the linking of grammatical data with RDF or OWL models has led to resources such as WordNet [14] and FrameNet [15]. However, such holistic implementations remain limited in the case of the Greek language. Although recent initiatives, such as [16], have implemented ontologies on platforms like Protégé, covering morphology, syntax, semantics, and phonetics, although the capability for automatic generation of all inflected forms and their dynamic integration with syntactic analysis is still lacking.

This gap is precisely what the proposed computational system seeks to address. The system introduces an innovative, ontology-based framework that integrates:

(a): Automatic linguistic data extraction from static and dynamic sources (data mining);
(b): Generation of inflected forms based on typological paradigms;
(c): Syntactic processing grounded in Relational Grammar;
(d): Ontological modeling in Protégé using OWL/RDF representation.

By incorporating morphological, syntactic, and semantic processing within a unified and fully automated environment, the proposed system fills a major gap in Greek language technology and contributes to the advancement of computational linguistics and its applications in cognitive technology, offering a flexible and scalable solution.

To the best of our knowledge, the only prior large-scale lexical resource providing structured semantic representation for Modern Greek was the Greek WordNet. However, as this resource is no longer accessible or actively maintained, direct benchmarking or quantitative comparison is infeasible.

A comparative overview of existing Greek NLP systems and the proposed framework is presented in Table 1, which summarizes their main functionalities and integration levels across morphology, syntax, and semantics.

The table summarizes the main functionalities and integration levels of existing Greek NLP tools compared with the proposed system across morphology, syntax, and semantics.

Among the various theoretical frameworks for syntactic representation, Relational Grammar was selected as the underlying approach because it provides a relational rather than dependency-based view of syntax, emphasizing grammatical relations such as subject, object, and indirect object as primary structural entities. This approach allows for a more transparent mapping between syntactic roles and ontological relations, which is particularly advantageous for the morphologically rich and flexible structure of Modern Greek. While Universal Dependencies (UD) offers a widely adopted and standardized scheme for cross-linguistic annotation, its surface-oriented dependency arcs are less suitable for representing the deep grammatical relations required for ontology-based modeling. Nevertheless, the proposed framework remains compatible with UD principles, ensuring interoperability with current annotation and parsing conventions.

3. Methodology

The methodology followed for the development of the proposed system was structured in successive phases of design, implementation, and integration, with a central focus on achieving interoperability between linguistic data and semantic representations through the balanced utilization of morphological knowledge, syntactic analysis, and ontological modeling.

A decisive factor in the success of the project was the selection of tools that ensured scalability and adaptability to the specific features of Modern Greek. The Python (3.12) programming language was chosen due to its high-level abstraction, which enables the handling of complex computational problems through modular routines, and for its extensive library ecosystem for NLP applications. In parallel, spreadsheet software (Microsoft Excel) was employed for the structured storage and management of linguistic data during the system’s development phase. The spreadsheet followed a fixed schema defining column names, data types, and validation constraints (e.g., mandatory lemma and PoS fields, uniqueness checks, and consistency validation scripts). Python routines were used to automatically detect and prevent data type coercion or formatting errors. This lightweight environment facilitated rapid prototyping and manual verification of linguistic entries before their synchronization with the OWL data model in Protégé. Future iterations of the system will migrate the validated schema to a transactional and version-controlled database (e.g., SQLite or RDF triple store) to ensure long-term maintainability and scalability.

The theoretical foundation of the system combines Relational Grammar for the analysis of syntactic relations with OWL/RDF standards for semantic representation, forming a coherent linguistic modeling framework. The research project was conceived as a multi-stage processing pipeline that integrates all modules into a unified workflow, designed not only to implement a morphological and syntactic parser but also to embed it in an ontological environment capable of supporting cognitive technology applications.

The system’s architecture follows a four-stage pipeline. The first stage involves data extraction and normalization from structured and semi-structured linguistic resources, primarily the DSMG. The second stage performs morphological generation based on structured rule patterns derived from the GMG and the same dictionary, incorporating exception rules for irregular verbs and a mechanism for resolving ambiguity among homographs through unique lemma identifiers. The third stage focuses on morphosyntactic analysis at the sentence level, applying algorithms grounded in Relational Grammar to identify the core grammatical relations—Subject, Verb, and Object—that constitute the backbone of Modern Greek syntax. The final stage integrates the processed linguistic data into the ontology and formally validates their logical consistency through OWL/RDF reasoning tools such as the Pellet reasoner.

Although the individual components were developed and executed in separate environments, the pipeline functions as a coordinated integration workflow. The output of each phase systematically feeds into the next through standardized data formats and controlled import–export procedures, ensuring structural and conceptual consistency across modules and supporting the interoperability required for a unified infrastructure for Modern Greek language processing.

In this way, the present work does not merely aim to implement yet another language processing tool but rather proposes a methodological model for the unified treatment of morphological, syntactic, and semantic requirements in morphologically rich languages such as Greek.

4. Linguistic Resource Extraction and Ontology Population in Protégé

4.1. Linguistic Data Extraction

As an online source for entries in Modern Greek, the “Dictionary of Standard Modern Greek” by Manolis Triantafyllidis was selected. This resource is freely available in electronic form on the “Portal for the Greek Language” website (https://www.greek-language.gr/greekLang/modern_greek/tools/lexica/triantafyllides/index.html, accessed on 14 August 2025) and is maintained by the Manolis Triantafyllidis Foundation (Aristotle University of Thessaloniki) for academic use under institutional permission granted to the authors. It offers extensive and reliable documentation of Modern Greek, making it an ideal reference for the evaluation and analysis of linguistic features. To extract and process the data from this dictionary, a Python-based application was developed. This application sends HTTP requests to the website and extracts the entries in data form, which are then stored in a .csv file. During the extraction of an entry/lemma, the application organizes the information into specific columns, taking into account that these must be structured in a way that allows them to be later ontologically classified according to the part of speech to which each entry belongs. For each lemma, the following properties are clearly defined:

“word”: the base form of the lemma, e.g., φιλία [filía] (friendship);
“total_word”: the word along with its article, if applicable, e.g., η φιλία [i filia] (the friendship);
“vocal_transcription”: the phonetic transcription of the word, e.g., [filía];
“etymology”: the origin of the word, e.g., λόγ. < αρχ. φιλία.
“category”: the inflectional category of the word, e.g., O1, O2, O3…, Ε1, Ε2, Ε3…, Ρ1, Ρ2, etc. (Appendix A—item 4);
“part_of_speech”: the grammatical category of the word, e.g., article, noun, adjective, verb, etc.;
“definition”: the definition of the word, e.g., κάλλιο [kallio]→ καλύτερα [kalytera] (better);
“annotation”: notes regarding the word’s usage in context, e.g., colloquial usage;
“male”: gender specification if the word is masculine;
“female”: gender specification if the word is feminine;
“neutral”: gender specification if the word is neuter;
“antonym”: the antonym of the word, e.g., φιλία [filia] (friendship) ≠ έχθρα [echthra] (rivalry);
“synonym”: the synonym of the word, e.g., ανομοιογενής [anomoiogenis] (non-homogeneous) → ετερογενής [eterogenis] (heterogeneous);
“diminutive”: the diminutive form of the word, e.g., γάτα [gata] (cat)→ γατούλα [gatoula] (citty);
“magnifying”: the augmentative form of the word, e.g., γάτα [gata] (cat)→ γάταρος [gataros] (big cat);
“synonymous_phrase”: a phrase synonymous with the word, e.g., τελευταίος [teleftaios] (last) → ο τελευταίος τροχός της αμάξης [o teleftaios trochos tis amaxis] (“the last wheel of the chariot” = the least important person in the room);
“phrase”: idioms or fixed expressions in which the word appears, e.g., άρες μάρες [ares mares] → άρες μάρες κουκουνάρες [ares mares koukounares] (utter nonsense);
“saying”: proverbs in which the word is used, e.g., κάλλιο [kallio]→ κάλλιο πέντε και στο χέρι παρά δέκα και καρτέρει [kallio pente kai sto cheri para deka kai karterei] (a bird in the hand is worth two in the bush);
“expression”: expressions in which the word appears, e.g., τελευταίος [teleftaios] (last) → τελευταίος και τυχερός [teleftaios kai tycheros] (lucky last);
“suffix”: when the word functions as a suffix, e.g., -άδα [-ada];
“prefix”: when the word functions as a prefix, e.g., αμφί- / αμφι- [amfi];
“has_as_female”: when the word has a feminine form in addition to the masculine, e.g., ο μακιγιέρ [ο makigier] → η μακιγιέζ [i makigiez] (the makeup artist);
“has_as_male”: when the word has a masculine form in addition to the feminine, e.g., η γαζώτρια [i gazotria] → ο γαζωτής [o gazotis] (the stitcher);
“different_form_of_lemma”: when the lemma has an alternative form with the same meaning, e.g., αβάντζο [avantzo] & αβάντσο [avantso] (advantage);
“different_form_of_female”: when the feminine word has another grammatical variant, e.g., η γιατρός [i giatros] & η γιατρίνα [i giatrina] (the doctor);
“different_form_of_male”: when the masculine word has another grammatical variant, e.g., ο γελαδάρης [o geladaris] & ο αγελαδάρης [o ageladaris] (the cowherd);
“different_form_of_neutral”: when the neuter word has another variant, e.g., το αλάφι [to alafi] & το λάφι [to lafi] (deer).

By extracting entries from the electronic dictionary, the application organizes the data in a structured tabular format, assigning each linguistic feature to a separate column in the file. The outcome of this process is a .csv file consisting of twenty-seven (27) columns, each representing one of the features described above. The file contains a total of 55,101 rows, corresponding to the dictionary entries, thus covering the majority of the content included in the “Dictionary of Standard Modern Greek.” This tabular organization provides a flexible and detailed framework for processing and utilizing linguistic information, facilitating its integration into NLP applications and language infrastructure projects.

4.2. Integration of Linguistic Data into the Protégé Environment

The linguistic data extracted from M. Triantafyllidis’s electronic dictionary and organized from .csv tables into column-based .xls spreadsheets were uploaded automatically into the ‘Ontology of Modern Greek’. This ontology was designed as a component of a broader Ontological Knowledge System for Modern Greek, known as the Multi Solution Ontology, as described in [16]. The adopted methodology includes techniques of preliminary text processing, entity and relation identification, and their systematic ontological integration. The entries listed in the “word” column of the .csv file (see Section 4.1) were imported as individuals of the corresponding lexical classes (e.g., Noun, Verb, Adjective), all organized under the superclass Parts_of_Speech (Figure 1).

The data in the remaining twenty-six (26) columns were uploaded as attributes or properties of these individuals, namely as Annotation Properties or Data Properties, respectively. To automate this process, a script was developed in Python [17] that, for each lemma (word) in the .csv file, reads the column indicating the part of speech and assigns the entry to the appropriate class in the Ontology. It also retrieves the values from the other columns and adds them to the corresponding Data or Annotation Properties. In this way, the ‘Ontology of Modern Greek’ is generated automatically (Figure 2).

For example, the following figure (Figure 3) presents the entry μάντης [mandis] (‘diviner’), which was automatically uploaded from the LinguisticData.csv file into the Protégé ontological platform:

(a)

The entry μάντης belongs to the Noun class of the ontology and is represented as an individual of that class.

(b)

The Annotation Properties of the ontology display the definition of the entry, as extracted from the electronic dictionary of M. Triantafyllidis.

(c)

The Data Properties present the lexical attributes of the entry—also retrieved from the Triantafyllidis dictionary—including:

(i): Its etymology: μσν. μάντης < αρχ. μάντ(ις) μεταπλ. -ης· μσν. μάντισσα < μάντ(ης) -ισσα,
(ii): Its feminine form: μάντισσα [mantissa],
(iii): Its phonetic transcription: [mándis],
(iv): Its inflectional category: O10.

The outcome of the research at this stage was the construction of an ontology, as described in [16], capable of representing the morphosyntactic and semantic structure of Modern Greek. The ontology was developed in OWL 2 DL, which ensures logical consistency and decidable reasoning through the Pellet reasoner. This profile supports the formal definition of classes, properties, and logical axioms, providing accuracy in modeling and stability in semantic interpretation. Specifically, it comprises:

1364 object classes;
1115 object relations;
51,278 individuals, corresponding to the current entries of the ontological lexicon;
317,776 axioms.

This exceptionally large volume of structured data, along with the extensive number of semantic and structural relations among them, makes the resulting ontology one of the most comprehensive for the Greek language. It provides a valuable foundation for NLP applications that require a high level of semantic precision and linguistic coverage.

Furthermore, the ontology has been conceptually designed to ensure compatibility with the OntoLex-Lemon model proposed by the W3C for linking lexical and ontological resources. The internal schema follows a structural alignment with the core OntoLex-Lemon entities (LexicalEntry, Form, LexicalSense), providing the conceptual basis for future integration and RDF export using standardized linguistic vocabularies.

During the import process, a set of consistency and uniqueness checks was performed to ensure the reliability of the ontology population. Cases of homographs (identical lemmas with distinct meanings) were automatically detected through Excel-based functions and assigned unique lemma identifiers (e.g., α-1, α-2) before integration, resulting in separate lexical entries and distinct ontology individuals/URIs. Furthermore, naming inconsistencies between the Greek and English class labels (e.g., “Oυσιαστικό_Noun”, “Επίθετο_Adjective”) were resolved programmatically to ensure accurate mapping between lexical entries and ontological classes. The ontology was subsequently validated using the Pellet reasoner, confirming logical consistency across class hierarchies and object properties. Future work includes the generation of a detailed coverage table per part of speech and the implementation of SHACL-based integrity constraints for formal validation under the OWL open world assumption.

Beyond structural validation, all lexical data extracted from the Triantafyllidis electronic dictionary underwent a preprocessing and cleaning phase to remove duplicates, incomplete entries, and encoding inconsistencies. The selection criterion involved the complete inclusion of all lexical entries from the Triantafyllidis source, regardless of their morphological or syntactic completeness, in order to achieve the broadest possible lexical coverage of Modern Greek. Each transformation step—from extraction to ontology population—was implemented through reproducible Python scripts, allowing the entire process to be transparently replicated. Collectively, these procedures guarantee the reliability, consistency, and long-term reusability of the linguistic dataset.

5. Automatic Generation of Inflected Forms

As part of the present research, a computational system was developed for the automatic generation of morphological forms in the Greek language, with emphasis on nouns, adjectives, and verbs—both regular and irregular [18]. The system utilizes inflectional paradigms, inflectional patterns (i.e., the endings used to generate the inclined/inflected words), and inflectional categories as defined in the Electronic Dictionary and the Grammar of Standard Modern Greek by Manolis Triantafyllidis, ensuring compatibility with established linguistic resources. The implementation is based on a hybrid methodology that combines the automatic extraction of lemmata from the Triantafyllidis Electronic Dictionary (see Section 4.1) with the application of production rules for the reliable generation of inflected forms. The system has been developed in Python and is designed to provide full coverage of inflectional categories, including dedicated modules for irregular forms. It features a generalized, parameterizable architecture suitable for integration into NLP applications.

The system’s morphological component has been designed as a scalable and flexible morphological processing infrastructure, capable of being integrated into broader computational environments. Architecturally, it consists of three main components:

(a): Input data;
(b): Specialized processing functions;
(c): An application programming interface (API).

5.1. Input Data

The input data are organized into two main categories:

(a): files containing complete inflectional paradigms and inflectional patterns;
(b): lexical datasets automatically extracted from the electronic Dictionary of Standard Modern Greek.

The data are stored in spreadsheet format (.xlsx) and have been categorized according to the inflectional categories described in the Dictionary and Grammar of Triantafyllidis. Specifically, the system includes fifty-three (53) noun inflectional categories, seventeen (17) for adjectives, and twelve (12) for verbs, further subdivided into subcategories.

Each input row corresponds to a lemma and includes the following fields:

The root or thematic stem of the word, i.e., the basic morphological element carrying the core semantic content of the word, upon which morphological processes operate to generate its inflected forms (e.g., in the noun άνθρωπος [anthropos] (man), the root is ανθρωπ-);
Prefixes, i.e., morphological elements added to the beginning of the root or stem, which often modify the meaning or express grammatical features such as tense (e.g., in the verb έγραψα [egrapsa] (wrote), the prefix is ε- [e-]);
Suffixes, i.e., elements inserted between the root and the ending that may modify meaning or determine the morphological category of the word; for instance, tense-related suffixes that signal tense or voice (e.g., in the verb κλειδώθηκα [kleidothika] (I was locked), the suffix is -θη- [-thi-]);
Endings, i.e., the final morphological segments of a word indicating grammatical categories, such as person, number, gender, case, tense, mood, or voice. They follow the root or stem (and possibly a derivational affix) and play a decisive role in the word’s inflection (e.g., in the verb βλέπω [vlepo] (see), the ending is -ω [-ο]);
Tables of grammatical forms for all cases, numbers, tenses, moods, and voices;
Morphological behavior notes (e.g., stress patterns, root alternations, or multiple variant forms).

This structured representation ensures comprehensive coverage of the morphological variability of the Greek language and supports the reliable processing of both regular and irregular forms. It also defines the inflectional paradigms employed by the system for automatic morphological generation, providing a consistent and machine-interpretable framework for rule application. An example of this representation format is presented in Figure 4, which illustrates the declensional patterns of masculine nouns belonging to categories O8 and O24.

5.2. Data Handling and Processing Modules

The processing of linguistic data is carried out through a set of functions and classes written in the Python programming language. These components implement morphological generation rules and handle linguistic information at the level of root, prefixes, suffixes, and inflectional endings. The core mechanisms include:

Morphological structure analysis of words, involving the identification of vowels, diphthongs, syllables, endings, and stress position;
Recognition and segmentation of the root and the ending;
Management of morphological alternations, such as stress shift, root variation, and thematic vowel adjustment;
Transformation of endings with dynamic adaptation to grammatical category contexts;
Data integrity control through internal validation checks for consistency, structural accuracy, and completeness.

Through initialization functions, the system decomposes the data of each lemma, isolates its specific morphological components, and applies the appropriate production rules to generate the complete set of inflected forms, as illustrated in Figure A1 in Appendix B. Specifically, in the case of irregular verbs, processing includes the integration of specialized notes that reflect the variety of possible forms, such as changes in stem, prefix, augment, or ending. The recognition functions distinguish between person and number to handle a range of variation patterns.

5.3. Application Programming Interface

Access to the functionality of the morphological component is provided through a REST API, which enables remote and user-friendly interaction with the system, regardless of platform or programming language. The API accepts as input a lemma, along with its corresponding inflectional class, and returns, in JSON format, the complete set of generated inflected forms (Figure 5a,b). Specifically:

For nouns, a table of grammatical cases and numbers is returned;
For adjectives, a hierarchical structure of forms is returned, organized by gender, number, and case;
For verbs, a dictionary of grammatical features is returned, including tense, mood, voice, person, and number.

The output is structured, consistent, and easily readable by external applications, enabling the integration of the morphological module into platforms, such as dialogue systems, linguistic analysis tools, search engines, and natural language processors.

5.4. System Performance in Automatic Inflectional Generation

The developed computational system for the generation and analysis of morphological forms in the Greek language proved to be highly effective, particularly in the categories of nouns and adjectives. Its rule-based architecture leverages the typological classification of inflected words as recorded in the Grammar and Dictionary of M. Triantafyllidis, ensuring high accuracy and predictability in the reproduction of word forms. The generated output covers nearly the entire range of nominal types included in the dictionary (26,272 nouns και 11,616 adjectives), while for verbs, the system performs satisfactorily in the majority of regular formations (6302 verbs). In contrast, in the case of irregular verbs, the complexity of their morphology resulted in limited output, yielding a hundred and ninety-one (191) representative examples out of a total of 1150. Nevertheless, two recurring patterns of variation were identified, depending on the verb’s grammatical person, which opens up possibilities for further systematic treatment.

The evaluation of the system was based on the comparison of the generated morphological forms with the data of the electronic lexicon of M. Triantafyllidis. Since no baseline morphological analyser for Modern Greek is currently available—because most existing systems, such as the ILSP NLP Suite and the Greek WordNet, are either not publicly accessible or do not provide open and standardized inflectional data—the system’s output was qualitatively validated against the Triantafyllidis Electronic Dictionary, which constitutes the most authoritative linguistic resource for the language. The system addresses a critical gap in existing tools for NLP and NLU, which are unable to generate the full range of inflected forms of a lemma or to accurately identify the lemma and the grammatical features of a non-canonical word form. In contrast, the proposed system leverages fundamental knowledge derived from reliable linguistic data, transforms this information into examples of complete inflectional paradigms, and implements the corresponding morphological mechanism algorithmically. The output is organized in a systematic and transparent manner, offering flexibility in handling morphological variation and enabling its application to tasks, such as part-of-speech tagging, syntactic analysis, automatic linguistic normalization, and grammatical error correction. Nevertheless, certain limitations are noted, including the lack of coverage for dialectal or neological usages, as well as the absence of evaluation based on free-form natural language data.

Furthermore, the data produced by the system are transformed into structured linguistic information and integrated into the Protégé platform, contributing to the enrichment of the ontology of Modern Greek. Each inflected form is linked to the corresponding PoS category and its relevant grammatical attributes, allowing the representation not only of the surface form of the word but also of the underlying linguistic knowledge, regarding its position within the morphological system of the language. The incorporation of these data into the ontological structure enhances the semantic interconnectivity of linguistic units and renders the Greek Ontological Dictionary in [19] a reusable and interoperable infrastructure, suitable for research and technological applications in computational linguistics, such as automatic annotation, syntactic parsing, information extraction, and semantic search. As part of future research, comparative benchmarking tests are planned to be developed once compatible datasets become available.

Diagnostic Analysis of Verb Morphology and Error Types

To further evaluate the morphological coverage of verbs, a diagnostic error analysis was conducted on the automatically generated forms. The analysis revealed recurring patterns of rule failure that reflect both the complexity of Greek verb morphology and the structural limitations of dictionary encoding. The most frequent issues concerned the mapping between lemmata and inflectional categories, the treatment of compound verbs, augment placement in past tenses, contraction, and irregular stem alternation.

A common source of error was the incorrect correspondence between a lemma and its inflectional category, particularly in entries that included multiple sub-lemmas joined by the symbol “&”. In such cases, the system initially failed to distinguish separate paradigms. For example, from the dictionary entry μπανιαρίζω –ομαι [baniarizo–omai] (‘to bathe’) Ρ2.1 & μπανιάρω –ομαι [baniaro–omai] (‘to bathe’) Ρ6, the application produced only μπανιαρίζω [baniarizo] (‘to bathe’) as Ρ6. The problem was addressed by adding a pre-processing stage that splits compound entries and maps each one to its appropriate category. Furthermore, several discrepancies between lemma entries and their paradigm classifications, such as λευκαίνω [lefkaino] (‘to make white’) appearing as Ρ7.3α in the paradigm list but Ρ7.2 in the lemma, were corrected through a dictionary override list.

Another difficulty arose in the representation of the middle voice. The dictionary frequently records only the suffix (e.g., “-ιέμαι” [-iemai]) rather than the full lemma, preventing the system from reconstructing the corresponding middle-voice form, as in πετώ [peto] & -άω, [-ao] -ιέμαι [-iemai] (‘to throw’), which failed to generate πετιέμαι [petiemai] (‘to be thrown’). To address this issue, a reconstruction rule for the complete lemma in the middle voice will be defined in the future, prior to inflection.

More significant challenges were observed in the treatment of augment in past tenses. Compound verbs, in particular, tended to generate both augmented and non-augmented forms, as in απονέμω [aponemo] (‘to award’) → απένειμα [apeneima] and απόνειμα [aponeima] in past simple. For this reason, a special rule of internal augment insertion between the prefix and the stem was applied, restricted to stems beginning with a consonant. Moreover, when the first component of the verb is a full word rather than a preposition —as in καλοπιάνω [kalopiano] (‘to coax’) or σιγοβρέχω [sigovrecho] (‘to rain softly’)—the system will be further enhanced with a heuristic list of lexical prefixes (e.g., καλο-, σιγο-) in order to generate the correct forms καλοέπιανα (kaloepiana) and σιγοέβρεχε (sigoevreche). Also, in verbs such as βλέπω [vlepo] (‘to see’), the system had to accurately render the augment (έ-) and the stem alternation between stressed and unstressed forms, depending on person and number. For example, in the Indicative, Imperfect, Active Voice, the generated forms are έ-βλεπα, έ-βλεπες, έ-βλεπε (é-vlepa, é-vlepes, é-vlepe) in the singular (1st, 2nd, and 3rd person) and βλέπαμε, βλέπατε, έ-βλεπαν (vlépame, vlépate, é-vlepan) in the plural (1st, 2nd, and 3rd person). The rule governing augment placement and stem stress was explicitly parameterized per person and number, enabling the system to automatically reproduce the correct morphological distribution.

Errors in contracted and uncontracted verb forms were also frequent. For verbs such as ακουμπάω/–ώ [akoumpao, -ο] (‘to touch’), the system produced the invalid form ακουμπαάω [akoumpaao]. This was resolved by introducing a filter preventing consecutive identical vowels and by defining restrictions for particular stem endings (e.g., stems ending in –ζ [-z], –φ [f], –λ [l] combine only with –ω [-o]). Likewise, cases of overgeneration in the present tense were observed, where forms such as λέεις [leeis] or λες [les] (2nd person in the singular] were produced for λέγω/λέω [lego/leo] (‘to say’) (1st person in the singular]. These forms are expected to be filtered out through euphony filters and stress pattern checks.

Further irregularities appeared in past simple formation, such as stem alternation in ξεχνώ [xechno] (‘to forget’), where the system produced ξέχνασα [xechnasa] instead of ξέχασα [xechasa]. This was solved by checking the irregular verb tables before applying the regular rules. Particular care was also given to impersonal or third-person-only verbs like σοβεί [sovei] (‘it prevails’) and συμβαίνει [symvainei] (‘it happens’), which were marked with the symbol #ΓΕ (“3rd person singular only”) to restrict generation exclusively to the third-person singular. Similarly, verbs limited to present-based tenses, such as ασφυκτιώ [asfyktio] (‘to be under pressure’), were annotated as #ΜΕ (“present stem only”) to prevent unattested tense formations.

A separate class of errors concerned irregular participles and passive aorists, where the generated forms deviated from the dictionary patterns. For example, αποπειρώμαι [apopeiromai] (‘to attempt’) correctly forms αποπειράθηκα [apopeirathika] (not αποπειρήθηκα [apopeirithika]), while κοιμάμαι [koimamai] (‘to sleep’) and φοβάμαι [fovamai] (to be afraid’) produce κοιμισμένος [koimismenos] and φοβισμένος [fovismenos] (not κοιμημένος [koimimenos] and φοβημένος [fovimenos]). These exceptions were compiled in an irregulars list to guide future generations.

Finally, certain ‘mixed’ categories in the dictionary grouped verbs with different phonological endings (–ίζω [-izo], –άζω [-azo], –ύζω [-yzo], –ύσσω [-ysso]), producing hybrid errors like αγιαζίζω [agiagizo] from αγιάζω [agiazo] (‘to become serene’). This led to the creation of new subcategories for more precise classification. Overall, six new inflectional subcategories—Ρ10.1.1, Ρ2.1.1, Ρ2.2.1, Ρ2.3.1, Ρ2.3.2, and Ρ5.1.1—were introduced, along with functional tags (#ΓΕ, #ΜΕ) that define restricted inflectional behavior.

The aforementioned adjustments substantially improved the precision of verb morphology generation. Most residual errors occurred in past tenses (augment and stem alternation), in present forms of contracted verbs (overgeneration), and in passive voice constructions (irregular participles). The system currently generates 191 irregular forms, a number that reflects the deliberate choice of accuracy over completeness: only the forms explicitly confirmed by the lexicon or by the customized paradigms are included, while the remaining ones will be added in a subsequent phase following the integration of additional exceptions and rules. After optimization with the proposed interventions, the system is expected to demonstrate stable and reliable performance across both regular and irregular verbs, ensuring complete inflectional coverage and establishing a solid foundation for further comparative evaluation.

6. Grammatical Structure Recognition Within a Sentence

The proposed approach falls within the scope of NLP and aims at the automatic recognition of grammatical and syntactic structures in Modern Greek sentences. The developed system [20] combines artificial intelligence algorithms with an interactive graphical environment, supporting the extraction of high-accuracy linguistic information.

The system’s database consists of lexical data that include the inflectional paradigms from the electronic dictionary of M. Triantafyllidis, as well as a large number of inflectional patterns developed, based on the corresponding grammar. Words within a sentence are isolated, based on whitespace and punctuation marks, and for each token a search is performed in the database to retrieve the relevant linguistic information.

The functionality of the system is not limited to the mere recognition of individual morphological forms. Through specially designed algorithms, the program identifies the core syntactic constituents of the sentence: subject, verb, direct object, and indirect object. This recognition is based on the rules of Greek grammar and is supported by the theoretical framework of Relational Grammar, which was selected due to its emphasis on grammatical relations as primary syntactic elements. The hierarchical representation of grammatical roles (1 for the subject, 2 for the direct object, and 3 for the indirect object) allows for a numerical encoding of sentence structure and enables the management of syntactic transformations, such as passivization or role alternation across different strata. This theoretical model has proven particularly suitable for the computational exploitation of Modern Greek syntax, as it focuses on the functional relationship between sentence constituents, regardless of their surface position, thereby offering a coherent and efficient foundation for automatic syntactic analysis.

Although the current implementation focuses on identifying the core syntactic relations—Subject, Verb, and Object—the underlying Relational Grammar framework provides a scalable basis for incorporating more complex syntactic phenomena, such as coordination, passive constructions, clitic pronouns, accentual patterns, and the free word order characteristic of Greek. This extensibility ensures that the system can gradually evolve toward a more comprehensive syntactic parser while maintaining theoretical consistency and computational efficiency.

6.1. System Architecture

The system is designed according to a modular architecture, allowing for the independent development and modification of its individual subcomponents. Each module can communicate with the others through well-defined interfaces, facilitating system maintenance, scalability, and potential integration into other computational or linguistic environments. The main subsystems include:

Input Module: Manages the user input of sentences.
Analysis Module: Contains the algorithms for morphological and syntactic analysis.
Lexical Database: Stores grammatical information for Greek language words.
Visualization Module: Provides a GUI for result presentation.
Ontology Integration Unit: Supports the export of data to Protégé for the creation or updating of ontological models.

6.2. Development of the Lexical Database

For the development of the lexical database, Microsoft Excel was selected as the software environment due to its combination of usability, flexibility, and capacity to manage large volumes of information, without requiring specialized programming knowledge. The structure of Excel spreadsheets allows for immediate sorting, filtering, and searching of data, while its simple user interface facilitates both the editing and expansion of the lexicon. Compared to more complex solutions, such as Microsoft Access or Google Sheets, Excel offers greater user autonomy, as it does not depend on internet connectivity and enables faster implementation of language resources with high information density.

The data entries were organized into categories, based on the PoS to which each word belongs (articles, nouns, adjectives, participles, pronouns, verbs), with dedicated columns for each critical grammatical feature, such as case, number, gender, mood, tense, voice, and person. Special provisions were made for indicating the absence of forms using the symbol “#”, which triggers corresponding management or update functions within the system. In addition, particular attention was paid to the organization of inflectional endings, which were enriched with metadata, enabling their alignment with actual lexical paradigms. This design supports the system’s verification and processing functions. A key role is played by the reference form column—nominative singular for nouns and adjectives, and first-person singular for verbs—which allows the linkage of inflected forms to their canonical lemma representations. This structure of the database effectively supports both the structural processing of linguistic data and its functional integration with the system’s core algorithm.

The database was developed with the primary objective of storing the morphological properties of words. Each record includes the following fields (Table 2):

Lemma form (e.g., τρέχω [trexo] (run), μαθητής [mathitis] (student), καλός [kalos] (good), etc.)
Part of speech (verb, noun, adjective, etc.);
Grammatical features (gender, number, case, tense, voice, mood, etc.);
Link to related concepts for ontological representation.

Each lemma is linked to a specific ontological class according to its grammatical role and semantic type. Verbs are represented as action entities, nouns as individual or conceptual entities (e.g., Person), and adjectives as quality descriptors. This mapping enables the system to associate grammatical information with semantic structures during ontology population.

6.3. Morphological and Syntactic Analysis Algorithms

The system implements morphological and syntactic analysis algorithms developed in the Python programming language, which was chosen for its clean syntax, ease of maintenance, and its rich ecosystem of NLP libraries (such as Pandas, Re and Unicodedata), all of which contribute to faster and more flexible development. The morphological algorithms identify and analyze grammatical features of words (such as case, gender, number, tense, voice, and mood), recognize inflectional endings, convert words to their nominative singular form, and verify their existence in the database. In parallel, the syntactic algorithms utilize information on articles, pronouns, adjectives, nouns, and verbs to automatically identify the core syntactic constituents of a sentence (subject, verb, direct object, and indirect object). These algorithms rely on counters, control structures, and mechanisms for handling unknown words. The results are stored in a text file, based on the principles of Relational Grammar, in which grammatical roles are numerically represented (e.g., 1 for subject, 2 for direct object), ensuring systematic and clearly delineated mapping of syntactic relations in natural language.

The sentence analysis is performed in three stages:

Tokenization: Splitting the sentence into words/morphemes;
Morphological Analysis: Retrieving each word’s grammatical features from the database;
Syntactic Analysis: Assigning syntactic roles (subject, verb, object, etc.) using rule-based logic.

Syntactic analysis is based on predefined rules that follow the typical word order of the Greek language (SVO—Subject, Verb, Object), while also supporting alternative syntactic structures. The procedural logic of the syntactic analysis module is presented in Figure A2 in Appendix C, which depicts the pseudocode describing the rule-based process of identifying grammatical relations within a sentence. The algorithm follows the sequence of tokenization, morphological tagging, syntactic role detection, and validation of grammatical relations, ensuring full alignment with the principles of Relational Grammar.

6.4. Development of the Graphical User Interface

As part of the system’s functionality, a user-friendly and efficient GUI was developed to facilitate interaction between the end user and the core of the language processing system.

This interface allows for:

(a): Input of natural language sentences through a dedicated form, which the system processes automatically;
(b): Detailed grammatical analysis for each word in the sentence, including information related to its morphological features such as PoS, case, number, gender, etc. (Appendix D—Figure A3a,b);
(c): Visual representation of the sentence’s syntactic structure, based on the results of the syntactic analysis and the application of Relational Grammar (Appendix D—Figure A4a,b). This representation facilitates the understanding of the relationships between words (subject, verb, objects, etc.);
(d): Export of analysis results in formats suitable for integration into ontological tools such as Protégé, supporting standards like RDF/XML or OWL, with the aim of enabling future use in semantic web environments or other computational linguistics and knowledge-based applications.

7. Integration of Lexical Data into an Ontology (Protégé)

As part of the system’s ongoing semantic expansion, a dedicated module is being implemented for the integration of morphological and syntactic data into ontological form, enabling the export of analyses as ontological statements. Each grammatical element (e.g., word, PoS, syntactic role) is mapped to corresponding ontological classes and properties, forming a semantic model of the sentence that can be further processed by systems based on the Semantic Web. This integration relies on OWL and RDF standards, which support the structured linking of lexical information and syntactic relationships. Through the use of subject–predicate–object triples, the grammatically analyzed content of each sentence is transformed into a semantically manageable format.

For example, in the sentence “ο μαθητής τρέχει” [o mathitis trechei] (“the student runs”), both the basic syntactic roles and the complete identification of the morphological and syntactic features of the words can be represented in RDF. The Listing 1 below presents a minimal RDF example that renders only the subject and the verb of the sentence, while it is also possible to provide the enriched version, which includes a detailed account of the grammatical and syntactic features of the words.

Listing 1. Simple RDF representation of syntactic roles.

ttl

<μαθητής> rdf:type :Noun

<τρέχω> rdf:type :Verb

In a similar manner, the same RDF schema can be used to capture the inflection of both nominal and verbal forms, enabling their complete ontological representation. The resulting final ontological structure is loaded directly into Protégé, allowing the ‘Ontology of Modern Greek’ to be dynamically extended with new concepts, relations, and categories. In this way, the system can be integrated into broader semantic networks and support future interoperability with other knowledge processing platforms. This approach makes the system suitable not only for advanced language technology applications but also for environments where semantic understanding and natural language processing are required.

Illustrative Queries and Retrieved Instances

To validate the ontology’s functionality and illustrate its intended querying capabilities, several representative SPARQL queries are being tested within the Protégé environment. These examples demonstrate how the integrated morphological and syntactic information can be retrieved through semantic relations defined in OWL/RDF.

(a) Query 1—Retrieving Prepositions and Governed Cases

The ontology allows users to retrieve all prepositions together with the grammatical cases they govern. The following SPARQL query demonstrates this functionality:

SELECT ?preposition ?case

WHERE {

?preposition rdf:type :Preposition ;

:governsCase ?case .

}}

The execution of this query is expected to return, among others, the following results: the preposition σε [se] ( “in/to”) is linked with the accusative, υπέρ [yper] (“for/in favor of”) with the genitive/accusative, and για [gia] (“for”) with the accusative case. These outputs confirm that syntactic dependencies between prepositions and the corresponding cases are correctly represented in the ontology through the property: governsCase.

(b) Query 2—Retrieving Verbs by Inflectional Category

The following query retrieves all verbs belonging to a specific inflectional category, showing that the system can combine morphological and ontological data:

SELECT ?verb ?inflectionalCategory

WHERE {

?verb rdf:type :Verb ;

:belongsToCategory :Ρ5 .

}

When executed, this query is expected to produce instances such as βραβεύω [vravévo] (“to award”), δεσμεύω [desmévo] (“to bind”), and συγχωνεύω [synchonevo] (“to merge”), all of which belong to the inflectional category Ρ5. This outcome demonstrates that the ontology can support structured retrieval of morphological information and enable the verification of the correct classification of lexical items according to their grammatical properties.

Beyond the above examples, the ontology supports a wide range of linguistic queries at the morphological, syntactic, and semantic levels. At the morphological level, the system can grammatically recognize a specific word form by identifying its part of speech, gender, number, case, tense, voice, and mood. It can also retrieve all nouns ending in a particular suffix (e.g., -ος [-os]) or all adjectives belonging to a specific inflectional pattern. At the syntactic level, queries can identify verbs that govern particular grammatical cases or retrieve all complements associated with a given predicate. At the semantic level, the ontology enables the extraction of verbs linked to thematic roles such as Agent or Instrument, as well as the retrieval of lexical items connected through synonymy or antonymy relations. These capabilities demonstrate that the system not only integrates linguistic information into a structured ontological form but also provides advanced mechanisms for exploring and validating morphological, syntactic, and semantic relationships within Modern Greek.

In addition to the SPARQL queries presented above, the ontology’s structural organization and semantic consistency can be visually demonstrated through its internal graph representation. Figure 6 illustrates a representative segment of the ontology, where the preposition συν [syn] (“plus”) appears as a subclass of Λόγια Πρόθεση/Learned Preposition. Within this structure, the preposition is connected to the Aρχαία Δοτική Πτώση/Ancient Dative Case, which it governs, and to the Έννοια της Πρόσθεσης/The Meaning of Addition, which it semantically expresses. The graph highlights how syntactic dependencies and semantic associations are jointly modeled, demonstrating the ontology’s ability to represent grammatical, syntactic, and conceptual relations in an integrated manner.

8. Research Results

The implemented modules have been systematically tested to ensure functional validity, indicating that the framework operates as a fully functional environment rather than a purely theoretical prototype. All NLP components have been integrated into a unified and operational system, which is currently being evaluated for its capacity to process the morphological and syntactic phenomena of Modern Greek and to integrate linguistic data into an ontological framework.

Regarding morphological analysis, the system consistently met the requirements for generating word forms for nouns, adjectives, and verbs, particularly when these followed predictable inflectional patterns. Its architecture was based on morphological templates and rule-based generation, ensuring stability and extensibility. For irregular forms, recurring patterns were identified, offering opportunities for future generalization.

The syntactic analysis, grounded in the theoretical framework of Relational Grammar, identified the core syntactic components of a sentence (subject, verb, objects) by applying grammatical rules and hierarchical representations. The system produced reliable results for sentences of simple and moderate complexity, while in cases of linguistic ambiguity, the application of predefined prioritization strategies contributed to error reduction.

The linguistic data were integrated into an ontological schema within the Protégé platform, following the OWL and RDF standards. The representation of morphological and syntactic relationships, using subject–predicate–object triples, enhanced the semantic exploitation of the data and facilitated interoperability with other knowledge processing systems.

In certain cases, however, manual intervention was deemed necessary, mainly for the processing of forms that deviate from regular patterns, idiomatic expressions, or unusual structures. Nonetheless, the system’s clear organization, modular design, and capacity for semantic integration confirm its usefulness in applications involving language processing and understanding.

A key outcome emerging from the present research is the transformation of grammatical knowledge into ontological knowledge. Through the structured mapping of morphological and syntactic features onto OWL classes and properties, grammatical relations are formally represented as semantic triples that connect subjects, predicates, and objects within the ontology. In this way, grammatical information is no longer treated as isolated annotation but as an integral component of a structured knowledge network. This integration allows linguistic data to participate in reasoning processes, supports interoperability with external semantic systems, and demonstrates that linguistic analysis and knowledge representation can function as a unified computational framework for Modern Greek.

Building on these results, the evaluation phase is currently underway, employing datasets derived from the DSMG and the GMG. Preliminary findings indicate consistently high accuracy in tagging, lemmatization, morphological generation, and syntactic dependency mapping, confirming the reliability of the underlying linguistic rules and the robustness of the ontology. Standard metrics such as precision, recall, and F1-score are being applied to quantify performance, while comparative validation with external morphological analysers (e.g., ILSP) is planned for a subsequent evaluation phase, based on available test data and documented benchmarks. These results underline the linguistic soundness of the system and provide a solid foundation for future extensions involving larger corpora and domain-specific ontological expansions.

Regarding runtime and scalability, the system performs lexical and morphological processing efficiently, with execution times remaining stable even as the dataset size increases. Its modular architecture allows for the seamless integration of new lexical entries and ontological relations without any degradation in performance. Preliminary error analysis revealed only minor inconsistencies in stress placement and in the handling of certain irregular nominal and verbal forms, confirming the overall robustness and linguistic reliability of the system.

9. Discussion

This study demonstrated the feasibility and value of the combined use of linguistic resources, algorithmic techniques, and semantic modeling for the processing of the Greek language. The integration of morphological and syntactic data into ontological schemas enabled a shift from surface-level linguistic processing to deeper knowledge representation, which is a key requirement for modern applications in computational linguistics.

The proposed system adopts a holistic approach to grammatical analysis, overcoming the separation between individual levels (morphology, syntax, semantics) that characterizes many previous approaches. Although its architecture is grounded in predefined linguistic rules, it provides a solid foundation for unified linguistic representation and functional exploitation of the data.

Despite the system’s performance, the complexity of the Greek language—with phenomena such as stress shifts, idiomatic expressions, and irregular structures—continues to pose challenges, particularly in real-world textual environments. Addressing such cases requires detailed linguistic documentation and sensitivity to the nuances of the language.

At the theoretical level, the adoption of Relational Grammar highlighted the usefulness of grammatical models that emphasize the function of constituents rather than their linear position. The numerical representation of syntactic relations facilitates not only computational processing but also the transfer of linguistic data into semantic environments, making the connection between natural language and knowledge more direct and computationally manageable.

The use of Protégé as an infrastructure for ontological modeling proved to be of strategic importance. The system benefited significantly from its ability to organize, extend, and connect linguistic information to broader knowledge networks. The adoption of standards, such as OWL, enhances interoperability and enables its deployment across multiple knowledge-based environments.

In terms of real-life applications, the proposed system can support a wide range of AI-driven environments that require structured linguistic interpretation. By linking grammatical structures with ontological entities, it can enhance language understanding modules in Large Language Models (LLMs), provide semantic enrichment of digital corpora, and improve information retrieval and question–answering systems operating in Greek. The ontological representation of grammatical relations also facilitates the integration of language data into intelligent tutoring systems, educational chatbots, and knowledge graphs, enabling automatic reasoning and adaptive interaction. Similar approaches have been discussed in recent studies on the alignment of linguistic ontologies with neural architectures for AI-based language processing [21].

These promising perspectives, however, coexist with certain inherent limitations that reflect the linguistic complexity of Modern Greek. Although the diagnostic analysis of verb morphology (see “Diagnostic Analysis of Verb Morphology and Error Types” in Section 5.4) demonstrated the system’s ability to identify and correct rule failures, further challenges persist across all levels of linguistic processing.

Morphological inaccuracies occur in the generation of nominal and verbal forms, particularly in cases involving (the affected morphemes are underlined):

stem alternations (e.g., βλέπ-ω [vlep-o] “see” (Present) → είδ-α [id-a] (Past Simple)),
vowel contractions (e.g., αγαπάω → αγαπώ “love”),
the presence or absence of augment in past tenses (e.g., έ-παιζα “I was playing” with augment ε-, but ∅δρόσιζα “I was cooling” without augment),
nouns that exist only in the singular (e.g., η ξενιτιά “exile”) or only in the plural (e.g., τα Βαλκάνια “the Balkans”),
and double-stem nouns (e.g., δύναμ-η “strength” in the nominative singular → δύναμ-ης and δυνάμε-ως in the genitive singular).

At the syntactic level, inaccuracies are observed in sentences with free word order, ellipsis, or multiple possible grammatical interpretations.

Occasional inconsistencies also arise during the integration of the morphological and syntactic modules, when the same form may assume different syntactic functions depending on context. For instance, the form γράφεται [gráfetai] may mean “the article is being written” (passive meaning) or “the student registers for the course” (middle meaning), reflecting different mappings between the agent and the recipient of the action.

These limitations reflect the inherent difficulty of processing a morphologically rich and syntactically flexible language exclusively through rule-based logic.

Overall, this work confirms that the processing of the Greek language —despite its structural complexities—can be effectively supported by systems that combine linguistic documentation with computational formalization. The transformation of language into machine-readable forms not only preserves its linguistic richness but also highlights its potential for integration into advanced artificial intelligence environments.

10. Conclusions and Future Work

This study demonstrates the feasibility of automatic extraction of linguistic data and their ontological organization for the Modern Greek language. The adopted methodology confirmed that grammatical processing can be fully supported even without the use of complex machine learning technologies. The proposed system enables the identification and representation of morphological and syntactic features of Greek, linking them organically to semantic structures. This approach has proven applicable to knowledge representation, making it suitable for use by artificial intelligence systems and NLP applications, without the need for specialized deep learning models.

Based on the results, the continuation of this research effort is oriented toward targeted extensions and technical optimizations. These include:

Enrichment of the lexical database with additional entries;
Support for more complex syntactic structures and dependency parsing;
Integration of speech recognition and phonological features;
Bilingual analysis and export of results in standardized formats;
Optimization of the linguistic analysis algorithms.

Although the current study primarily focuses on the linguistic and ontological integration rather than runtime benchmarking, future work will include comprehensive performance profiling. This will involve the measurement of execution time per sentence, memory footprint during ontology population, and throughput of the REST API under different workloads. Profiling tools such as cProfile, memory_profiler, and ApacheBench will be employed to identify and optimize potential bottlenecks. The system’s hardware and software specifications will also be documented in detail to ensure full reproducibility of the results.

In addition, it is proposed that the methodology be applied to larger corpora and that the ontology be dynamically updated in real time by incorporating new data from contemporary sources, such as online platforms and social media. The systematic documentation of linguistic variation and its integration into dynamic ontological structures may enhance the monitoring of language evolution.

Furthermore, subsequent development stages will aim to broaden the lexical and corpus coverage beyond the Triantafyllidis Electronic Dictionary by cross-validating data against independent Greek corpora (e.g., ILSP, Greek Web 2020, or OpenSubtitles) and incorporating dialectal variants, neologisms, and out-of-dictionary forms. This expansion will allow the ontology and the morphological parser to handle noisy social-media text, informal registers, and historical variants, thereby increasing the system’s robustness and generalizability.

Finally, another direction for future research involves the inclusion of demographic and sociolinguistic parameters in the evaluation and enrichment of the Ontology of Modern Greek. Although the present system focuses on grammatical, morphological, and semantic representation, future extensions will incorporate demographic diversity—such as gender, age, and regional variation—to assess linguistic behavior across user groups and enhance the educational applicability of the ontology. Prior studies [22,23] have demonstrated the importance of such demographic awareness in ensuring representational fairness and inclusivity in language technologies.

In summary, the developed system is not merely a linguistic analysis tool, but a comprehensive infrastructure that bridges morphological and syntactic information with semantic representation, laying the foundation for the development of advanced language technology applications for Modern Greek (as well as other languages of similar concatenative morphology).

Author Contributions

Conceptualization, methodology, investigation, resources, writing—original draft preparation, N.E.S.; validation, supervision, project administration, N.N.K.; formal analysis, software, writing—review and editing, E.C.P.; visualization, data curation, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors express their thankfulness to I. Koronaios and K. Angelou, for their valuable contribution to the technical development supporting this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUEB	Athens University of Economics and Business
BERT	Bidirectional Encoder Representations from Transformers.
DSMG	Dictionary of Standard Modern Greek
GMG	Grammar of Modern Greek
HTTP	HyperText Transfer Protocol
ILSP	Institute for Language and Speech Processing
JSON	JavaScript Object Notation
REST	Representational State Transfer
API	Application Programming Interface
GUI	Graphical User Interface
NLP	Natural Language Processing
NLU	Natural Language Understanding
OWL	Web Ontology Language
PoS	Part of Speech
RDF	Resource Description Framework
UD	Universal Dependencies
URI	Uniform Resource Identifier
XML	Extensible Markup Language

Appendix A

1. Data mining is defined as “the process of extracting information through the analysis of large volumes of data, aiming at identifying trends and patterns” [24].

2. See the inflectional examples in the Dictionary of Standard Modern Greek online:

http://www.komvos.edu.gr/dictionaries/triantafyllidis/TriLegent.htm (accessed on 1 September 2025).

3. Relational Grammar is a theoretical approach to syntax that focuses on grammatical relations (such as subject and object) rather than hierarchical phrase structures, proposing that syntactic changes are described as transformations across levels of these relations [25].

4. The symbol O denotes noun categories, Ε denotes adjective categories, and Ρ denotes verb categories. In the English version, these correspond to N1, N2 (nouns), A1, A2 (adjectives), and V1, V2 (verbs), respectively.

Appendix B

Figure A1. Flow diagram of the morphological generation module. The workflow illustrates the sequential rule-based processing applied to each lemma: class data retrieval, normalization, stem extraction, selection of endings per grammatical category, tone application, and validation of generated forms. Irregular verbs are handled through a specialized branch that combines stem, augment, and ending variants.

Appendix C

Figure A2. Pseudocode of the rule-based syntactic analysis process. The algorithm tokenizes each sentence, performs part-of-speech tagging, detects the main grammatical constituents (subject, verb, object), assigns relational-grammar roles, and validates agreement and order.

Appendix D

Figure A3. (a) Grammatical analysis results for the Greek word ‘ταμίας’ [tamias] (cashier) (left); (b) The English translation of grammatical analysis results for the Greek word ‘ταμίας’ [tamias] (cashier) (right).

Figure A4. (a) Syntactic analysis of four (4) Greek sentences based on Relational Grammar (left); (b) The English translation of the Syntactic analysis of four (4) Greek sentences based on Relational Grammar (right).

References

Triantafyllidis Foundation. Dictionary of Standard Modern Greek; Aristotle University of Thessaloniki: Thessaloniki, Greece, 1998; Available online: https://www.greek-language.gr/greekLang/modern_greek/tools/lexica/triantafyllides/index.html (accessed on 14 August 2025). (In Greek)
Triantafyllidis, M. Modern Greek Grammar; Organisation for the Publication of Educational Books: Athens, Greece, 1941. (In Greek) [Google Scholar]
ILSP NLP Suite. Available online: https://www.ilsp.gr/nlp (accessed on 5 June 2025).
Prokopidis, P.; Georgantopoulos, B.; Papageorgiou, H. A Suite of Natural Language Processing Tools for Greek. In Proceedings of the 10th International Conference on Greek Linguistics (ICGL 2011), Komotini, Greece, 1–4 September 2011. [Google Scholar]
Prokopidis, P.; Piperidis, S. A Neural NLP Toolkit for Greek. In Proceedings of the 11th Hellenic Conference on Artificial Intelligence (SETN 2020), Athens, Greece, 2–4 September 2020. [Google Scholar] [CrossRef]
AUEB NLP Group. AUEB’s Natural Language Processing Group. Available online: http://nlp.cs.aueb.gr (accessed on 9 June 2025).
Dikonimaki, C. A Transformer-based Natural Language Processing Toolkit for Greek—Part of Speech Tagging and Dependency Parsing. Bachelor’s Thesis, Athens University of Economics and Business, Athens, Greece, 2021. Available online: http://nlp.cs.aueb.gr/theses/dikonimaki_bsc_thesis.pdf (accessed on 27 June 2025).
Loukas, L.; Smyrnioudis, N.; Dikonomaki, C.; Barbakos, S.; Toumazatos, A.; Koutsikakis, J.; Kyriakakis, M.; Georgiou, M.; Vassos, S.; Pavlopoulos, J.; et al. GR NLP TOOLKIT: An Open-Source NLP Toolkit for Modern Greek. In Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations (COLING 2025), Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 174–182. [Google Scholar] [CrossRef]
Tsalidis, C.; Vagelatos, A.; Orphanos, G. An Electronic Dictionary as a Basis for NLP Tools: The Greek Case. arXiv 2004. Available online: https://arxiv.org/abs/cs/0408061 (accessed on 14 August 2025). [CrossRef]
Tsalidis, C.; Piperidis, S.; Papageorgiou, H. Greek WordNet: Construction and Applications. In Proceedings of the Language Resources and Evaluation Conference (LREC) 2012; Istanbul, Turkey, 21–27 May 2012; European Language Resources Association (ELRA): Paris, France, 2013; pp. 2895–2900. ISBN 978-2-9517408-7-7. [Google Scholar]
Kirov, C.; Cotterell, R.; Sylak-Glassman, J.; Walther, G.; Vylomova, E.; Xia, P.; Faruqui, M.; Mielke, S.; McCarthy, A.; Kübler, S.; et al. UniMorph 2.0: Universal Morphology. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018; European Language Resources Association (ELRA): Luxembourg, 2018. Available online: https://aclanthology.org/L18-1293/ (accessed on 14 August 2025).
Koutsikakis, J.; Chalkidis, I.; Malakasiotis, P.; Androutsopoulos, I. GREEK-BERT: The Greeks Visiting Sesame Street. arXiv 2020. [Google Scholar] [CrossRef]
Gavriilidou, M.; Giagkou, M.; Loizidou, D.; Piperidis, S. Report on the Greek Language; European Language Equality (ELE) Project Deliverable D1.17; European Language Resources Association (ELRA): Luxembourg, 2022; Available online: https://european-language-equality.eu/wp-content/uploads/2022/03/ELE___Deliverable_D1_17__Language_Report_Greek_.pdf?utm_source=chatgpt.com (accessed on 14 August 2025).
Fellbaum, C. WordNet: An Electronic Lexical Database; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar] [CrossRef]
Fillmore, C.J.; Johnson, C.R.; Petruck, M.R. Background to FrameNet. Int. J. Lexicogr. 2003, 16, 235–250. [Google Scholar] [CrossRef]
Samaridi, N.; Papakitsos, E.; Karanikolas, N. Ontological Representation of the Structure and Vocabulary of Modern Greek on the Protégé Platform. Computation 2024, 12, 249. [Google Scholar] [CrossRef]
Koronaios, I. Data Retrieval for Modern Greek Language and Their Ontological Structuring in PROTÉGÉ Platform. Bachelor’s Thesis, University of West Attica, Egaleo, Greece, 2024. Available online: https://www.openarchives.gr/aggregator-openarchives/edm/polynoe/000125-11400_7945 (accessed on 10 August 2025). (In Greek).
Papadimitriou, R. Foundational Tools for Natural Language Processing. Bachelor’s Thesis, University of West Attica, Athens, Greece, 30 July 2025. (In Greek). [Google Scholar]
Samaridi, N.E.; Karanikolas, N.N.; Papakitsos, E.C.; Papoutsidakis, M. Designing a Greek Electronic Dictionary Based on Ontology. In Proceedings of the 24th Pan-Hellenic Conference on Informatics (PCI 2020), Athens, Greece, 20–22 November 2020; ACM: New York, NY, USA, 2020; pp. 223–225. [Google Scholar] [CrossRef]
Angelou, K. Development of Algorithms and Interactive Environment for the Analysis of Grammatical Structures in Greek Sentences. Bachelor’s Thesis, University of West Attica, Egaleo, Greece, 2024. Available online: https://www.openarchives.gr/aggregator-openarchives/edm/polynoe/000125-11400_7938 (accessed on 10 August 2025). (In Greek).
Karanikolas, N.; Manga, E.; Samaridi, N.; Stergiopoulos, V.; Tousidou, E.; Vassilakopoulos, M. Strengths and Weaknesses of LLM-Based and Rule-Based NLP Technologies and Their Potential Synergies. Electronics 2025, 14, 3064. [Google Scholar] [CrossRef]
Thakur, N.; Cui, S.; Khanna, K.; Knieling, V.; Duggal, Y.N.; Shao, M. Investigation of the Gender-Specific Discourse about Online Learning during COVID-19 on Twitter Using Sentiment Analysis, Subjectivity Analysis, and Toxicity Analysis. Computers 2023, 12, 221. [Google Scholar] [CrossRef]
Hudders, L.; De Jans, S. Gender Effects in Influencer Marketing: An Experimental Study on the Efficacy of Endorsements by Same- vs. Other-Gender Social Media Influencers on Instagram. Int. J. Advert. 2022, 41, 128–149. [Google Scholar] [CrossRef]
Big Blue Data Academy. Data Mining: Concepts, Benefits, and Techniques. 2023. Available online: https://bigblue.academy/gr/data-mining (accessed on 21 April 2025).
Blake, B.J. Relational Grammar; Routledge: London, UK, 1990. [Google Scholar] [CrossRef]

Figure 1. Screenshot from the Protégé platform: The classes of the ontology of the Modern Greek language are shown on the left, and on the right, a subset of the nouns extracted from the Triantafyllidis Electronic Dictionary and automatically imported as individuals of the class ‘Oυσιαστικό_Noun’ is displayed (both the Greek and the equivalent English terms are used in the declaration of classes).

Figure 2. Overview of the system’s functionality for extracting linguistic data and integrating them into the ‘Ontology of Modern Greek’.

Figure 3. Ontological representation of the entry μάντης (‘diviner’) within the ‘Ontology of Modern Greek’.

Figure 4. Structured representation of inflectional paradigms for masculine and feminine nouns in Modern Greek. The figure illustrates declensional models for masculine nouns of category O8—such as γανωματής (ganomatis–tinsmith), καφετζής (kafetzis–coffee shop owner), and παπουτσής (papoutsis–shoemaker)—and feminine nouns of category O24, including καρδιά (kardia–heart), αχλαδιά (achladia–pear tree), and φωλιά (folia–nest). Each row corresponds to a lemma entry, showing the stem and its endings for each case and number. The tabular format defines the rule-based organization of paradigms within the system’s lexical database, enabling transparent, machine-interpretable processing for automatic morphological generation (the red color in the designation of Inflectional Category denotes here the first instance of each category).

Figure 5. (a) Output for the noun βραδιά [vradia] (night); (b) Output for the adjective δίμετρος [dimetros] (two-meter-long). The figure displays the inflectional paradigm automatically generated by the system, including grammatical cases (nominative [Oνομαστική_], genitive [Γενική_], accusative [Aιτιατική_], vocative [Κλητική_]) and numbers (singular [Ενικός:], plural [Πληθυντικός:]). Each form is organized under the corresponding morphological attribute extracted from the Triantafyllidis dictionary (the attributes are denoted in blue color, while the corresponding values in red).

Figure 6. Ontological representation of the preposition “συν” (“plus”) as a subclass of “Learned Preposition”. Note: Rectangular nodes represent classes, while arrows indicate object properties (relations) connecting them. The graph visualizes the syntactic link of the preposition “συν” (“plus”) with the “Ancient Dative Case”, the grammatical case it governs, and its semantic association with the “Meaning of Addition” (noted in the red box), illustrating how grammatical and conceptual structures are jointly encoded within the ontology (other colors represent different types of relationships).

Table 1. Comparative overview of Greek NLP systems and the proposed ontology-based framework.

System/Resource	Morphological Coverage	Syntactic Analysis	Semantic/Ontology Integration	Automation Level	Accessibility/Maintenance
ILSP NLP Suite	High (rule-based, limited inflectional depth, no generation)	Yes (dependency parser)	None	Semi-automated	Active (Institute for Language and Speech Processing)
AUEB NLP Suite	Moderate (tokenization, POS tagging, no generation)	Yes (transformer-based parser)	None	Fully automated	Active (research use)
UniMorph Greek Dataset	Extensive inflectional morphology (static dataset, no generation)	No	None	Static/non-interactive	Publicly available
Greek-BERT (AUEB)	Lexical embeddings, no morphological generation	Yes (contextualized syntax)	None	Fully automated (machine learning)	Active (open-source)
Greek WordNet	Lexical-semantic relations	Yes	Semantic network (ontology schema)	Static	Inactive/not accessible
Proposed Ontology-based System	Comprehensive (automatic inflectional generation and classification)	Yes (Relational Grammar model)	Full OWL/RDF ontology integration	Semi-automated (Python–Protégé pipeline)	Active/expandable framework

Table 2. Structure of lexical entries in the morphological database.

Lemma	Part of Speech	Grammatical Features	Ontological Class
τρέχω (run)	verb	present, indicative, active	Verb
μαθητής (student)	noun	masculine, nominative, singular	Noun → Person
καλός (good)	adjective	masculine, nominative, singular	Adjective

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Samaridi, N.E.; Karanikolas, N.N.; Papakitsos, E.C.; Skourlas, C. Computational Infrastructure for Modern Greek: From Grammar to Ontology. Computation 2025, 13, 272. https://doi.org/10.3390/computation13110272

AMA Style

Samaridi NE, Karanikolas NN, Papakitsos EC, Skourlas C. Computational Infrastructure for Modern Greek: From Grammar to Ontology. Computation. 2025; 13(11):272. https://doi.org/10.3390/computation13110272

Chicago/Turabian Style

Samaridi, Nikoletta E., Nikitas N. Karanikolas, Evangelos C. Papakitsos, and Christos Skourlas. 2025. "Computational Infrastructure for Modern Greek: From Grammar to Ontology" Computation 13, no. 11: 272. https://doi.org/10.3390/computation13110272

APA Style

Samaridi, N. E., Karanikolas, N. N., Papakitsos, E. C., & Skourlas, C. (2025). Computational Infrastructure for Modern Greek: From Grammar to Ontology. Computation, 13(11), 272. https://doi.org/10.3390/computation13110272

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Computational Infrastructure for Modern Greek: From Grammar to Ontology

Abstract

1. Introduction

2. Conceptual Framework and Related Research

3. Methodology

4. Linguistic Resource Extraction and Ontology Population in Protégé

4.1. Linguistic Data Extraction

4.2. Integration of Linguistic Data into the Protégé Environment

5. Automatic Generation of Inflected Forms

5.1. Input Data

5.2. Data Handling and Processing Modules

5.3. Application Programming Interface

5.4. System Performance in Automatic Inflectional Generation

Diagnostic Analysis of Verb Morphology and Error Types

6. Grammatical Structure Recognition Within a Sentence

6.1. System Architecture

6.2. Development of the Lexical Database

6.3. Morphological and Syntactic Analysis Algorithms

6.4. Development of the Graphical User Interface

7. Integration of Lexical Data into an Ontology (Protégé)

Illustrative Queries and Retrieved Instances

8. Research Results

9. Discussion

10. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

Appendix C

Appendix D

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI