Artificial Intelligence for Historical Manuscripts Digitization: Leveraging the Lexicon of Cyril

Moutsis, Stavros N.; Ioakeimidou, Despoina; Tsintotas, Konstantinos A.; Evangelidis, Konstantinos; Nastou, Panagiotis E.; Tsolomitis, Antonis

doi:10.3390/engproc2025107008

Open AccessProceeding Paper

Artificial Intelligence for Historical Manuscripts Digitization: Leveraging the Lexicon of Cyril^†

by

Stavros N. Moutsis

^1,*

,

Despoina Ioakeimidou

¹

,

Konstantinos A. Tsintotas

²

,

Konstantinos Evangelidis

³

,

Panagiotis E. Nastou

³

and

Antonis Tsolomitis

³

¹

Department of Production and Management Engineering, Democritus University of Thrace, GR-671 32 Xanthi, Greece

²

Department of Information and Electronic Engineering, International Hellenic University, GR-574 00 Thessaloniki, Greece

³

Department of Mathematics, University of the Aegean, GR-832 00 Mytilene, Greece

^*

Author to whom correspondence should be addressed.

^†

Presented at the 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.

Eng. Proc. 2025, 107(1), 8; https://doi.org/10.3390/engproc2025107008

Published: 21 August 2025

(This article belongs to the Proceedings of The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society)

Download

Browse Figure

Versions Notes

Abstract

Artificial intelligence (AI) is a cutting-edge and revolutionary technology in computer science that has the potential to completely transform a wide range of disciplines, including the social sciences, the arts, and the humanities. Therefore, since its significance has been recognized in engineering and medicine, history, literature, paleography, and archaeology have recently embraced AI as new opportunities have arisen for preserving ancient manuscripts. Acknowledging the importance of digitizing archival documents, this paper explores the use of advanced technologies during this process, showing how these are employed at each stage and how the unique challenges inherent in past scripts are addressed. Our study is based on Cyril’s Lexicon, a Byzantine-era dictionary of great historical and linguistic significance in Greek territory.

Keywords:

artificial intelligence (AI); historical manuscripts; Cyril’s Lexicon; digitization; handwritten text recognition (HTR); cultural heritage preservation

1. Introduction

Books and manuscripts have been written since antiquity, permitting humanity to preserve past years’ intellectual and cultural achievements [1]. Therefore, digitization, the primary technique for keeping them through time, is a social responsibility of modern society [2]. The main reason is related to their protection from environmental hazards and their natural deterioration, e.g., prolonged humidity, temperature fluctuations, and exposure to direct sunlight [3]. At the same time, the detailed digitization, a complex scheme that contains multiple stages, including handwriting recognition, image segmentation, margin handling, and storage, permits their enhanced accessibility to scholars and public audiences [4]. Hence, sophisticated techniques are required for accurate outcomes [5]. In addition, unlike printed books, historical manuscripts exhibit considerable variations in structure, writing style, and linguistic conventions while incorporating ornamental details with artistic value, such as decorations and calligraphic elements [6]. These inherent differences prevent the usage of generic approaches, necessitating methodologies tailored to each document’s unique characteristics [7]. Consequently, adopting contemporary technological advancements, like machine and deep learning [8,9,10], is a prerequisite for tackling such challenges [11]. As artificial intelligence (AI) leverages computer vision [12,13], image analysis [14], and text analytics to facilitate optimal optical character recognition (OCR), document analysis, and semantic processing, it bridges the gap between technological innovation and cultural preservation [15,16]. Aiming to outline this comprehensive process, this paper presents the technologies, techniques, and tools employed at every workflow stage and their effectiveness in reaching the optimal result. Cyril’s Lexicon, a Byzantine-era dictionary of great historical significance in Greek territory, serves as our case study since it offers most of the inherent challenges, as referred to before. The main contributions of this paper are summarized as follows:

A detailed study concerning the digitization pipeline of historical manuscripts and the importance of AI in improving performance.
A qualitative and quantitative experimentation based on Cyril’s Lexicon.
A direct search tool for the content and data, e.g., metadata, comments, drafts, and transcriptions of Cyril’s Lexicon to support philological research and the preservation of cultural heritage.

The remainder of the paper is organized as follows. The following section provides an overview of the steps needed to preserve recorded documents and the role of advanced technologies in digitization. Next, the approach is described, and finally, our pipeline is given in the final section.

2. Background on Manuscripts Digitization

Figure 1 illustrates the digitization workflow of historical writings. At first, images of the pages and covers are captured based on the necessary standards and rules. At the same time, researchers who interact with the original manuscripts record features that will be used as metadata. Next, a control check confirms their quality; the process is re-initialized if issues arise. Subsequently, during step 3, advanced image processing techniques are employed to improve the legibility of the scanned documents. These are stored in the cloud, with ensured accessibility and security, in step 4 for further analysis. Valuable information is documented throughout metadata creation, including the manuscript’s origin, authorship, and context. Moreover, at this phase, researchers provide important details for ensuring the efficiency of the subsequent stages. In the sixth step, all generated data are examined through machine and deep learning tools, allowing OCR to convert the handwritten text into a machine-readable format. It is worth noting that this end-to-end pipeline commits to archiving cultural heritage and digitizing books and all their essential components. The process culminates in making all this data available to users.

2.1. Image Capture

The first step constitutes the foundation of the digitization workflow, laying the groundwork for all subsequent phases. It involves utilizing several tools and technologies to create high-quality digital representations [17], as its objective is to maintain the integrity of the original artifact, producing a digital facsimile that captures even the finest details, e.g., faded ink, marginalia, or material textures. The choice of technology depends on the manuscript’s physical condition and size, as well as its preservation needs. Standard tools for this process are flatbed and planetary scanners. The former performs well for unbound documents [18], offering resolutions often exceeding 600 DPI (stands for dots per inch). However, their inability to handle bounding or fragile materials limits their use [19]. On the contrary, planetary scanners address these challenges by allowing manuscripts to remain open at natural angles while supported by noninvasive cradles and specialized lighting systems [20]. This way, physical strain on delicate materials is reduced [21].

Another strategy is based on multispectral imaging, which permits capturing details invisible to the naked eye using different wavelengths of light, such as ultraviolet and infrared [22]. As it can recover faded text, erased annotations, or hidden features, e.g., watermarks, this technology allows the study of damaged or overwritten manuscripts [23]. Similarly, reflectance transformation imaging enhances surface textures by capturing multiple photos under varying light angles [24]. Thus, it reveals three-dimensional features often missed in conventional scans, such as embossments or inscriptions. Based on the results, this pipeline improves the examination of manuscripts’ craftsmanship and material composition [25]. Finally, color calibration and focus stacking are frequently employed to increase image fidelity and achieve consistent sharpness across uneven surfaces [26,27]. An important note is that high-quality image capturing is essential for downstream, as errors at this stage can compromise the entire digitization workflow [28].

2.2. Quality Control

This phase ensures the accuracy, consistency, and fidelity of registered images and metadata [29]. It involves systematic evaluation and verification to determine and correct errors while protecting the integrity of digitized manuscripts for research, analysis, and long-term conservation. Concerning images, technicians review the resolution quality, which typically ranges from 300 to 600 DPI or higher, to confirm the preservation of fine details. These are also examined for distortions, shadows, or glare that might obscure text and features, as well as the consistency of the color [30]. For advanced methods, like multispectral or reflectance transformation imaging, additional checks confirm the successful capture of hidden or intricate details, i.e., faded ink or textural features [31].

For metadata accuracy, errors in transcription, formatting, or contextual data undermine the usability of checking digitized manuscripts. More specifically, metadata are reviewed to ensure consistency and adherence to standards that enable interoperability across digital repositories [32]. Any descriptive or structural inaccuracies are corrected for better retrieval and contextual understanding [33]. In addition, file integrity is validated by verifying compliance with archival file formats. Proper naming conventions and organizational structures are examined to facilitate efficient file management. Furthermore, checks detect file corruption or incomplete data, guaranteeing digital assets are stored securely and without errors.

Finally, a dual approach combining automated tools and manual inspection is typically employed for quality control. The former efficiently flags common issues, such as blurry images or metadata inconsistencies, while the latter addresses more nuanced problems, like image processing. Hence, effective quality control guarantees reliable digital representations that meet the standards required for research, machine learning applications, public access, and proper preservation [30].

2.3. Image Processing

This stage involves improving quality, correcting distortions, and preparing the data for long-term storage and analysis. Effective image processing is vital for creating reliable digital reproductions that preserve the historical and cultural significance of manuscripts [34]. By addressing challenges like fading, warping, and surface irregularities, image processing ensures that digitized materials meet high criteria of usability and fidelity [32]. It constitutes a mature technology with established methods available that quickly and with high performance improve brightness, contrast, color accuracy, and noise reduction [35]. Concerning the damaged manuscripts, advanced algorithms recover faint text and restore degraded details [29].

Geometric adjustments, either automated or manual, are applied to flatten text lines and realign distorted images. Distortion correction is crucial for manuscripts with curved or warped pages, like books, as these physical irregularities affect the accuracy of the digital output [36]. Moreover, as images are captured at different focal depths, focus stacking addresses uneven surfaces by combining multiple photos to create a single, uniformly sharp representation [26,37]. Similarly, image stitching, which merges multiple frames of large documents, is applied. Automated tools remove artifacts introduced during digitization [38], ensuring that the final image closely represents the original document.

Finally, as mentioned before, processed images are optimized for storage and analysis. This phase includes file compression without sacrificing quality, standardizing dimensions, and converting photos into archival formats that meet long-term preservation requirements [21]. The processed images are the cornerstone of digitization and the input for the machine learning techniques used in the following steps.

2.4. Cloud Storage

The large volumes of data increase the storage, access, management, and security requirements [39,40]. Cloud technology is adopted to meet these needs, proving to be great at supporting its integrity [41,42]. Nowadays, many cloud platforms, such as Amazon S3 [43], Google Cloud Storage [44], and Microsoft Azure [45], dynamically adapt to growing data volumes [46,47]. Their wealth of tools and their adaptability eliminate the demand for expensive physical infrastructure. Regarding accessibility, the crowd and researchers can interact with digital collections, enhancing collaboration and expanding access to cultural documents [48]. As for security and redundancy, cloud platforms store multiple copies of data on geographically distributed servers, ensuring hardware protection from failures or natural disasters [49,50]. Security measures, such as encryption, access controls, and monitoring tools, protect sensitive information and guarantee the security of intellectual and cultural heritage [50,51]. Finally, cloud providers offer archival solutions for infrequent historical data access [52,53]. These services provide stable and cost-effective storage, as well as secure data retention, ensuring the integrity of digitized collections for future generations [42].

2.5. Metadata Creation

Metadata refers to details, such as the manuscript’s title, the author, the date of creation, the document’s origin, its language, its physical characteristics, the decoration type and color, as well as any annotations or marginalia [54]. This information allows users to search, retrieve, and understand the manuscripts’ historical, cultural, and material significance [55]. These characteristics enhance the link between digitized elements and their broader contexts, helping scholars comprehend their chronological meaning, authenticity, and relationships with other texts. This contextual information is essential for achieving detailed research and for institutions to present their collections to diverse audiences [56,57]. Therefore, effective metadata creation requires careful planning, adherence to established standards, and often a combination of automated tools and manual input. Its design begins with defining the schema to be used, which determines the structure and fields included in the metadata. Commonly adopted standards, such as Dublin Core, METS/ALTO, and TEI (Text Encoding Initiative), offer predefined formats that ensure consistency and interoperability across collections [58]. These facilitate their seamless integration into global digital repositories.

Metadata are categorized into three types: descriptive, structural, and administrative. The first details the manuscript’s content, including information regarding the title, the author, the creation date, and the document’s language, and is one of the most critical components [32]. Structural metadata describes the document’s physical or logical organization, i.e., the relationships between pages, chapters, and sections. The latter type tracks information about the digitization process, such as the equipment and settings used, file formats, and copyright or access rights [59]. Recently, automated metadata generation tools streamlined the process, particularly for tasks like extracting text from printed or typed manuscripts using OCR [60]. However, their use is limited because historical writings often contain non-standard scripts, marginal annotations, or complex layouts [61]. In such cases, manual transcription and annotation are often adopted for accuracy and completeness [34].

2.6. Machine and Deep Learning Analysis

Machine and deep learning techniques [62,63] are becoming integral to the digitization of historical manuscripts, as they enable advanced analysis, transcription, restoration, and other critical tasks with speed and efficiency [55]. In detail, these techniques improve the accessibility and interpretability of digitized materials by automating processes [55], e.g., recognizing non-standard handwriting, restoring damaged text, and organizing extensive manuscript collections (see Table 1).

During image capture, AI-powered systems permit the automatic optimization of camera settings, i.e., set up the exposure, the focus, and the illumination to accommodate the delicate and varied physical conditions of manuscripts [64]. Similarly, deep learning assists in real-time enhancements, including noise reduction and sharpening of captured images, preserving minute details critical for scholarly analysis [65]. Likewise, AI-driven multispectral imaging allows dynamic parameterization for data capturing beyond the visible spectrum, uncovering hidden text or underdrawings [66,67,68].

Another significant application in manuscript digitization is OCR, a technology that automates the transcription of printed or handwritten text into machine-readable formats. While traditional systems work effectively for modern fonts and layouts, historical recordings often present challenges due to their non-standardized scripts, complex page structures, and varying levels of degradation [69]. Therefore, advances in deep learning improve its performance, enabling the recognition of diverse scripts and handwritten content [70]. Transkribus is a valuable asset that extends OCR capabilities by focusing on handwritten text recognition for registered documents [71]. Its handwritten text recognition (HTR) engine trains models on specific manuscript samples, tailoring its performance to unique handwriting styles. This adaptability is particularly beneficial for digitizing rare writings where pre-trained OCR systems may fail. Transkribus also supports collaborative workflows, allowing users to refine models and share transcription projects, improving its utility in large-scale digitization efforts [72].

Beyond transcription, machine and deep learning tools are critical in restoring damaged or worn-out texts [73,74]. Convolutional neural networks (CNNs) are employed for image enhancement and text reconstruction, as they analyze degraded sections of manuscripts to predict and restore missing or faded content [75]. These techniques are particularly effective for documents that have suffered environmental damage or aging [76]. DeepArt [77] and custom-trained models [78] have demonstrated their utility in reconstructing illegible sections and highlighting features invisible to the naked eye [77]. Additionally, such techniques facilitate layout analysis, which is vital for segmenting complex manuscript structures, often featuring multi-column layouts, marginalia, or decorative elements that must be detected and preserved. Tools like Kraken [79] and OCRopus [80] use machine learning to detect and separate text blocks, annotations, and visual elements, ensuring that these components are correctly represented in digital formats [81]. Another vital task is script and language identification since manuscripts often contain multiple scripts and languages. Platforms like eScriptorium [82] integrate machine learning models that enable efficient classification and segmentation of multilingual or multi-script documents [83].

As mentioned, AI tools also automate metadata generation, analyzing images to extract relevant metadata, e.g., dates, keywords, and document structure. These reduce the manual effort required in metadata creation while maintaining consistency across extensive collections. Again, another growing application is forgery detection and authenticity verification. Machine learning models examine handwriting consistency, ink composition, and material properties to detect potential forgeries or later modifications [84,85]. These tools provide critical insights into manuscripts’ authenticity and historical value by recognizing anomalies. Furthermore, language translation, powered by natural language processing (NLP), is increasingly applied to historical papers. While general tools like Google Translate can provide preliminary translations, specialized models trained on historical corpora handle complex linguistic nuances and offer more accurate results [86].

Last, machine learning-based text clustering and classification help organize manuscripts by themes, genres, or periods for large-scale collections, e.g., TensorFlow or PyTorch enable efficient classification, helping researchers manage and analyze extensive archives [87]. Additionally, interactive exploration and visualization tools map relationships between manuscripts, authors, and historical events. This allows users to visually explore connections within collections, uncovering new insights into historical contexts [88].

2.7. Digital Preservation and User Access

The final step ensures the long-term safety and usability of digitized manuscripts, protecting them from data loss, deterioration, and technological obsolescence. This stage is compatible with cloud storage solutions, which provide scalable and resilient options for images, accompanying information, and data generated during their digitization. At the same time, it is essential to note that achieving digitization does not preclude further improvement in representation. Access to digitized archives is crucial for researching, preserving, and disseminating cultural heritage [89]. User access makes digitized manuscripts available to researchers, educators, and the general public through digital systems and repositories [90]. These platforms offer advanced exploration tools, such as metadata browsing, search capabilities, and IIIF (stands for international image interoperability framework) interfaces for magnifying and annotating manuscripts [32]. Users also interact with documents, allowing them to give feedback that ensures consistency and interoperability between repositories, resulting in easier accessibility and understanding for future users. Finally, in recent years, interactive tools and collaborative platforms have been developed, permitting user participation through collaborative transcription, annotation, and metadata enrichment. To this end, AI-based interfaces assist users by suggesting relevant materials and providing complementary information, enhancing interaction with digital collections [91].

3. Digitization of Cyril’s Lexicon

3.1. Cyril’s Lexicon

This article examines a monumental yet unpublished dictionary of Ancient Greek, likely compiled in Alexandria during the late fifth or early sixth century AD. It is often attributed to Saint Cyril, Patriarch of Alexandria (c. 370–444 AD). Cyril’s Lexicon is notable for uniting entries from both Christian and pagan authors, marking it as the first Greek dictionary to synthesize these traditions [92]. Despite its historical significance, it remains unpublished due to its complex textual culture. It is worth noting that over 200 manuscripts feature numerous interpolations and divergences. Hence, this work consolidates scholarly insights into its tradition, examines its enduring influence, and addresses debates over its authorship and reception. Cyril’s Lexicon is valuable as a dictionary because it features points of interest beyond its linguistic content. These include decorative elements, such as small illustrations or ornamental designs within letters, unique abbreviations, and corrections, all of which highlight the complexity and artistry of the Greek language as it evolved over centuries. These features offer valuable insights into the period’s culture, reflecting both aesthetic and functional aspects of text generation. All these elements are considered during the digitization process. More specifically, given the absence of previous efforts, the present initiative encompasses as many features as possible and records them comprehensively.

3.2. Image Capture

Before converting the writings into images, every available document in Greece was reported, and the necessary actions were taken to obtain the required permissions for their publication. Given this, they were categorized into two groups: those that already had digitized images and those that required photography. This strategic process reduces the need for repetitive handling of fragile documents, thereby minimizing the risk of physical damage during the image capture. Particular emphasis is given to assessing and optimizing the existing images to ensure they meet the current resolution standards and clarity needed for advanced analysis [2]. Special care is taken to store them in high-resolution formats, which promotes further computational processing. A systematic naming convention is also implemented to guarantee efficient cataloging and accessibility, and enable seamless integration into subsequent stages. This way, we leverage the available images, allowing for immediate advancements in manuscript analysis while simultaneously safeguarding the original documents from unnecessary exposure to potential risks during recapture.

3.3. Metadata of Interest

Alongside digitization efforts, each manuscript and page undergoes a meticulous manual metadata recording process, including basic bibliographic information and more detailed descriptive elements. Recording metadata further enhances the development of advanced AI-based technologies, laying the groundwork for creating standardized text and contextual analysis frameworks. Combining manual precision with technological foresight, we guarantee that metadata meets the required needs and contributes to the long-term goal of developing AI methods for digitalizing manuscripts.

3.4. Annotation Procedure

During data annotation, philologists with expertise in historical texts systematically identify and categorize key segments within the pages of Cyril’s Lexicon [2]. Through LabelImg [93], users create precise bounding rectangles around distinct manuscript components and systematically classify them into predefined categories. These are selected based on the text’s structural and semantic elements, which are essential for digital analysis and preservation of historical integrity. More specifically, the categories includelemmas, viz., Lexicon’s main entries or headwords, definitions like the explanatory content corresponding to each lemma, abbreviations, i.e., shortened forms of words or phrases used in the text, and titles, such as headers or introductory sections distinguishing different parts of the Lexicon.

Additionally, delimiter symbols are annotated, including visual markers that separate definitions from entries, mark transitions, and indicate changes in letter sequences. Other elements contain corrections, viz., adjustments or edits made in the manuscript, often by a later hand, marks for corrections like indicators signifying that a correction is necessary or has been made, scholions, i.e., marginal or interlinear notes providing commentary or explanation, and drawings or decorations, such as artistic embellishments or illustrations present in the manuscript.

3.5. Deep Learning as Annotation Assistant

A subset of the annotated images is adopted to train YOLOv9m [94]. Our training was focused on the categories with the highest frequency of occurrences per page, i.e., lemmas, definitions, and abbreviations. These elements were prioritized due to their larger volume than the other categories, which occur less frequently. However, it is important to note that this technique does not replace manual annotation. The tool remains experimental and requires further enhancement, viz., additional lexicons in the training data. However, this procedure remains critical despite its promising potential, as errors cannot be permitted in the digitized outcome. Finally, tasks such as clustering lemmas with corresponding definitions are performed manually to ensure accuracy and fidelity to the original manuscript structure.

3.6. A Graphical User Interface for Metadata and Transcription Management

To further enhance the digitization process, a graphical user interface (GUI) that integrates data from the annotations for streamlining metadata and transcription management was developed through PyQt5. The processed annotations generate CSV files that include detailed information about each identified item, i.e., entries, decorations, etc., corresponding components, and locations within each image. Based on these files, our GUI provides the following functionalities:

Book selection and metadata management: Users can select a specific lexicon (book) from the interface. Once selected, the GUI allows the input and management of metadata information corresponding to the chosen lexicon.
Page selection and metadata assignment Users can navigate through the pages of the selected book, opening any page within the GUI. For each page, metadata can be entered or updated as needed.
Transcription and annotation editing: Based on the information in the CSV files, the GUI enables users to add metadata or transcribe the content of entries, titles, scholions, and decorations. While transcribing, the corresponding page is displayed in the GUI, and bounding boxes indicating the annotated items are depicted, providing a visual reference for accuracy.

In summary, the GUI is a dedicated tool for philologists to manage lexicon metadata, input page-specific information, and transcribe textual content directly from the manuscript images. Integrating annotation data into the interface ensures they can seamlessly navigate transcription tasks and metadata management, significantly improving efficiency and accuracy.

Lastly, all data entered through the GUI are subsequently uploaded to a MySQL database. This database consolidates the metadata, annotations, transcriptions, and bounding box coordinates, providing a comprehensive digital representation of the lexicons. To this end, we digitize the lexicons entirely, combining textual, structural, and visual information in a centralized repository.

3.7. Search Engine

The final goal is to build an advanced website that allows users to search, compare, and analyze the contents of Cyril’s Lexicon. This platform is designed to be a comprehensive resource for scholars, linguists, and historians [95]. The integration of AI plays a key role in optimizing the functionality of the site’s search engine, as it aims to provide more accurate and context-aware results by leveraging techniques such as NLP and semantic search. Moreover, AI-driven comparative tools promote side-by-side analysis of lexicon entries across manuscripts, using machine learning models to identify text patterns, similarities, and discrepancies. Therefore, users can uncover insights not immediately evident through manual comparison. The presented search engine also learns from user interactions, gradually refines its performance, and adapts to the needs of its audience. This dynamic approach ensures that the site remains an evolving and increasingly valuable tool for studying historical lexicons.

3.8. Main Challenges

Digitizing historical manuscripts with AI presents multifaceted challenges. The fragile condition of the documents and the need for high-resolution scans necessitate careful handling and precise strategies to prevent further degradation during image capture. At the same time, this procedure remains time-intensive and resource-demanding. During image processing, handwriting, ink quality, and paper texture variability require sophisticated algorithms to enhance readability without compromising authenticity. Manual validation adds to the time burden. Furthermore, creating metadata poses difficulties in ensuring consistency and accuracy. Automated AI tools struggle with incomplete or ambiguous data, particularly in rare or ancient languages, necessitating comprehensive human intervention to ensure accuracy. Similarly, the lack of annotated datasets for diverse scripts hinders annotation efforts, making AI model training and validation resource-intensive and requiring significant expertise. Balancing simplicity with advanced features for multilingual, semantic, and contextual searches is challenging for the GUI, as is addressing cultural sensitivities in design. Similarly, developing a robust search engine that provides accurate, multilingual, and context-aware results demands advanced AI algorithms and extensive datasets, which require ongoing refinement and substantial computational power. Additionally, time constraints affect all stages, as achieving high-quality results while managing validation and iterative development extends the timeline for delivering a functional and user-friendly digital repository. These challenges emphasize the complexity of the digitization process, extending beyond preservation to ensure accessibility and deeper insights into these invaluable manuscripts.

4. Discussion

4.1. Impact of Digitizing Cyril’s Lexicon

The digitization of Cyril’s Lexicon significantly impacts research, education, and cultural preservation, as it enables researchers to access and analyze its content digitally. This enables advanced computational studies on its linguistic, cultural, and historical aspects, fostering new insights into synthesizing Christian and pagan traditions, as reflected in the Lexicon. From an educational perspective, the digitized material is a dynamic resource for exploring historical linguistics, manuscript culture, and the evolution of the Greek language. Furthermore, it provides opportunities to teach interdisciplinary methodologies integrating AI, digital humanities, and historical analysis. From a cultural preservation perspective, it safeguards this chronologically meaningful but fragile manuscript tradition from physical degradation.

4.2. AI in Preserving Cultural Heritage

The integration of AI in the digitization process highlights its transformative potential for preserving cultural heritage. Advanced technologies augment human efforts in labor-intensive tasks like annotation, accelerating digitization and highlighting AI’s scalability for extensive manuscript collection projects. At the same time, our work also emphasizes human oversight in tasks requiring nuanced contextual understanding, such as clustering entries with definitions or interpreting marginal annotations. Ultimately, a balanced approach combining machine and deep learning with human expertise ensures efficiency and preserves cultural and scholarly authenticity.

4.3. Ethical Issues

The digitization and AI-driven annotation of Cyril’s Lexicon raise critical ethical considerations [96]. The first one corresponds to data ownership. Determining the rights to digitized manuscripts, annotations, and any derived AI models is essential to ensuring equitable access and proper acknowledgment of cultural and intellectual property. Clear policies should be established to address these concerns in a transparent and fair manner. Another key issue is related to the authenticity of digital representations. While digital surrogates protect original writings from physical degradation, they must faithfully capture the nuances and context of the original materials. Inaccurate digitization or AI-induced errors could misrepresent the content and structure of the manuscripts, potentially leading to misleading interpretations by future scholars. Additionally, reliance on AI raises concerns about biases in training data. Since the model is trained on a selected dataset of manuscripts, biases or inconsistencies in the dataset could propagate through the AI outputs. Mitigating this issue requires rigorous validation, quality control, and ongoing collaboration between technical experts and humanities scholars to ensure the reliability and fidelity of the digitization process. Addressing these ethical matters is vital for maintaining the task’s integrity, transparency, and cultural sensitivity.

5. Conclusions

Concluding the work, it is evident that modern technology profoundly influences numerous aspects of contemporary life, and its application in cultural preservation tasks is no exception. This work presents the pipeline for digitizing historical manuscripts, from image capture to metadata creation and search engine development. Employing advanced techniques, such as high-resolution imaging with adaptive lighting, deep learning for image processing, and AI-assisted metadata validation, significantly enhances accuracy and efficiency. Yet, as such complexity may initially pose challenges, the collaboration of diverse experts is also required.

Future plans include integrating natural language processing models into the search engine to provide intuitive, multilingual, and semantically enriched queries, enhancing its impact and usability. This would allow end-users to access and explore the manuscripts with unprecedented precision and ease. At the same time, the documents would be made more accessible globally, fostering cross-cultural research and collaboration. Similarly, future applications could leverage augmented reality to improve the user experience through manuscripts’ dynamic and interactive explorations [97].

Author Contributions

Conceptualization, S.N.M., D.I. and K.A.T.; methodology, S.N.M. and D.I.; software, S.N.M., D.I. and K.E.; validation, S.N.M., D.I. and K.A.T.; formal analysis, P.E.N. and A.T.; investigation, P.E.N.; resources, P.E.N. and A.T.; data curation, D.I.; writing—original draft preparation, S.N.M., D.I. and K.E.; writing—review and editing, K.A.T.; visualization, S.N.M.; supervision, K.A.T. and P.E.N.; project administration, A.T.; funding acquisition, P.E.N. and A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work is financially supported by the Hellenic Foundation for Research and Innovation (HFRI) under the Basic Research Funding (Horizontal support for all Sciences) and the National Recovery and Resilience Plan (Greece 2.0) funded by the European Union—NextGenerationEU (HFRI Number: KE 014890).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Reynolds, L.D.; Wilson, N.G. Scribes and Scholars: A Guide to the Transmission of Greek and Latin Literature; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
Ioakeimidou, D.; Moutsis, S.N.; Evangelidis, K.; Tsintotas, K.A.; Nastou, P.E.; Perdiki, E.; Gkinidis, E.; Tsoukatos, N.; Tsolomitis, A.; Konstantinidou, M.; et al. Cyril’s Lexicon Layout Analysis Through Deep Learning. In Proceedings of the IEEE International Conference on Imaging Systems and Techniques, Tokyo, Japan, 14–16 October 2024; pp. 1–6. [Google Scholar]
Lopatin, L. Library digitization projects, issues and guidelines: A survey of the literature. Libr. Tech. 2006, 24, 273–289. [Google Scholar] [CrossRef]
Pandey, R.; Kumar, V. Exploring the Impediments to Digitization and Digital Preservation of Cultural Heritage Resources: A Selective Review. Preserv. Digit. Technol. Cult. 2020, 49, 26–37. [Google Scholar] [CrossRef]
Liu, J.; Ma, X.; Wang, L.; Pei, L. How Can Generative Artificial Intelligence Techniques Facilitate Intelligent Research into Ancient Books? Acm J. Comput. Cult. Herit. 2024, 17, 1–20. [Google Scholar] [CrossRef]
Quirós, L.; Vidal, E. Reading order detection on handwritten documents. Neural Comput. Appl. 2022, 34, 9593–9611. [Google Scholar] [CrossRef]
Perino, M.; Pronti, L.; Moffa, C.; Rosellini, M.; Felici, A. New Frontiers in the Digital Restoration of Hidden Texts in Manuscripts: A Review of the Technical Approaches. Heritage 2024, 7, 683–696. [Google Scholar] [CrossRef]
Wei, D.; An, S.; Zhang, X.; Tian, J.; Tsintotas, K.A.; Gasteratos, A.; Zhu, H. Dual Regression for Efficient Hand Pose Estimation. In Proceedings of the IEEE International Conference on Robotics and Automation, Philadelphia, PA, USA, 23–27 May 2022; pp. 6423–6429. [Google Scholar]
An, S.; Zhang, X.; Wei, D.; Zhu, H.; Yang, J.; Tsintotas, K.A. FastHand: Fast monocular hand pose estimation on embedded systems. J. Syst. Archit. 2022, 122, 102361. [Google Scholar] [CrossRef]
Kansizoglou, I.; Misirlis, E.; Tsintotas, K.; Gasteratos, A. Continuous emotion recognition for long-term behavior modeling through recurrent neural networks. Technologies 2022, 10, 59. [Google Scholar] [CrossRef]
Perdiki, E. Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR Training. J. Data Min. Digit. Humanit. 2023. [Google Scholar] [CrossRef]
Kansizoglou, I.; Bampis, L.; Gasteratos, A. Deep feature space: A geometrical perspective. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6823–6838. [Google Scholar] [CrossRef]
Konstantinidis, F.K.; Kansizoglou, I.; Tsintotas, K.A.; Mouroutsos, S.G.; Gasteratos, A. The role of machine vision in industry 4.0: A textile manufacturing perspective. In Proceedings of the IEEE International Conference on Imaging Systems and Techniques, Kaohsiung, Taiwan, 24–26 August 2021; pp. 1–6. [Google Scholar]
Oikonomou, K.M.; Kansizoglou, I.; Gasteratos, A. A hybrid reinforcement learning approach with a spiking actor network for efficient robotic arm target reaching. IEEE Robot. Autom. Lett. 2023, 8, 3007–3014. [Google Scholar] [CrossRef]
Ioakeimidou, D.; Chatzoudes, D.; Chatzoglou, P. Assessing data analytics maturity: Proposing a new measurement scale. J. Bus. Anal. 2025, 8, 55–69. [Google Scholar] [CrossRef]
Zhong, Z.; Wang, J.; Sun, H.; Hu, K.; Zhang, E.; Sun, L.; Huo, Q. A hybrid approach to document layout analysis for heterogeneous document images. In Proceedings of the International Conference on Document Analysis and Recognition, San José, CA, USA, 21–26 August 2023; pp. 189–206. [Google Scholar]
Jarlbrink, J.; Snickars, P. Cultural heritage as digital noise: Nineteenth century newspapers in the digital archive. J. Doc. 2017, 73, 1228–1243. [Google Scholar] [CrossRef]
Schofield, R.; King, L.; Tayal, U.; Castellano, I.; Stirrup, J.; Pontana, F.; Earls, J.; Nicol, E. Image reconstruction: Part 1–understanding filtered back projection, noise and image acquisition. J. Cardiovasc. Comput. Tomogr. 2020, 14, 219–225. [Google Scholar] [CrossRef] [PubMed]
Perrin, J.M. Digitizing Flat Media: Principles and Practices; Rowman & Littlefield: Lanham, MD, USA, 2015. [Google Scholar]
Lynn Maroso, A. Educating future digitizers: The Illinois Digitization Institute’s Basics and Beyond digitization training program. Libr. Tech. 2005, 23, 187–204. [Google Scholar] [CrossRef]
Shashidhara, B.; Amith, G. A review on text extraction techniques for degraded historical document images. In Proceedings of the Second International Conference on Advances in Information Technology (ICAIT), Chikkamagaluru, India, 24–27 July 2024; Volume 1, pp. 1–8. [Google Scholar]
Verhoeven, G. Multispectral and hyperspectral imaging. In The Encyclopedia of Archaeological Sciences; John Wiley & Sons, Inc: Hoboken, NJ, USA, 2018; pp. 1–4. [Google Scholar]
Mazzocato, S.; Cimino, D.; Daffara, C. Integrated microprofilometry and multispectral imaging for full-field analysis of ancient manuscripts. J. Cult. Herit. 2024, 66, 110–116. [Google Scholar] [CrossRef]
Earl, G.; Basford, P.; Bischoff, A.; Bowman, A.; Crowther, C.; Dahl, J.; Hodgson, M.; Isaksen, L.; Kotoula, E.; Martinez, K.; et al. Reflectance transformation imaging systems for ancient documentary artefacts. In Proceedings of the Electronic Visualisation and the Arts, London, UK, 6–8 July 2011. [Google Scholar]
Lech, P.; Matera, M.; Zakrzewski, P. Using reflectance transformation imaging (RTI) to document ancient amphora stamps from Tanais, Russia. Reflections on first approach to their digitalisation. J. Archaeol. Sci. Rep. 2021, 36, 102839. [Google Scholar] [CrossRef]
Qian, Q.; Gunturk, B.K. Extending depth of field and dynamic range from differently focused and exposed images. Multidimens. Syst. Signal Process. 2016, 27, 493–509. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, Y.; Zhang, Y.; Ozcan, A. Color calibration and fusion of lens-free and mobile-phone microscopy images for high-resolution and accurate color reproduction. Sci. Rep. 2016, 6, 27811. [Google Scholar] [CrossRef]
Suissa, O.; Elmalech, A.; Zhitomirsky-Geffet, M. Text analysis using deep neural networks in digital humanities and information science. J. Assoc. Inf. Sci. Technol. 2022, 73, 268–287. [Google Scholar] [CrossRef]
Jones, C.; Duffy, C.; Gibson, A.; Terras, M. Understanding multispectral imaging of cultural heritage: Determining best practice in MSI analysis of historical artefacts. J. Cult. Herit. 2020, 45, 339–350. [Google Scholar] [CrossRef]
Pande, S.D.; Jadhav, P.P.; Joshi, R.; Sawant, A.D.; Muddebihalkar, V.; Rathod, S.; Gurav, M.N.; Das, S. Digitization of handwritten Devanagari text using CNN transfer learning–A better customer service support. Neurosci. Inform. 2022, 2, 100016. [Google Scholar] [CrossRef]
Dulecha, T.G.; Fanni, F.A.; Ponchio, F.; Pellacini, F.; Giachetti, A. Neural reflectance transformation imaging. Vis. Comput. 2020, 36, 2161–2174. [Google Scholar] [CrossRef]
Koho, M.; Burrows, T.; Hyvönen, E.; Ikkala, E.; Page, K.; Ransom, L.; Tuominen, J.; Emery, D.; Fraas, M.; Heller, B.; et al. Harmonizing and publishing heterogeneous premodern manuscript metadata as Linked Open Data. J. Assoc. Inf. Sci. Technol. 2022, 73, 240–257. [Google Scholar] [CrossRef]
Alma’aitah, W.; Talib, A.Z.; Osman, M.A. Opportunities and challenges in enhancing access to metadata of cultural heritage collections: A survey. Artif. Intell. Rev. 2020, 53, 3621–3646. [Google Scholar] [CrossRef]
Sulaiman, A.; Omar, K.; Nasrudin, M.F. Degraded historical document binarization: A review on issues, challenges, techniques, and future directions. J. Imaging 2019, 5, 48. [Google Scholar] [CrossRef] [PubMed]
Chaitra, B.; Reddy, P.B. Digital image forgery: Taxonomy, techniques, and tools–a comprehensive study. Int. J. Syst. Assur. Eng. Manag. 2023, 14, 18–33. [Google Scholar] [CrossRef]
Garai, A.; Biswas, S.; Mandal, S.; Chaudhuri, B.B. Dewarping of document images: A semi-CNN based approach. Multimed. Tools Appl. 2021, 80, 36009–36032. [Google Scholar] [CrossRef]
Hernández, R.M.; Shaus, A. New technologies for tracing magical texts and drawings: Experience with automatic binarization algorithms. In Proceedings of the The Materiality of Greek and Roman Curse Tablets: Technological Advances, Institute for the Study of Ancient Cultures; University of Chicago: Chicago, IL, USA, 2022; pp. 33–43. [Google Scholar]
Gupta, M.R.; Jacobson, N.P.; Garcia, E.K. OCR binarization and image pre-processing for searching historical documents. Pattern Recognit. 2007, 40, 389–397. [Google Scholar] [CrossRef]
Wang, J.; Li, W. The construction of a digital resource library of English for higher education based on a cloud platform. Sci. Program. 2021, 2021, 4591780. [Google Scholar] [CrossRef]
Tsintotas, K.A.; Moutsis, S.N.; Konstantinidis, F.K.; Gasteratos, A. Toward smart supply chain: Adopting internet of things and digital twins. In Proceedings of the AIP International Conference on ICT, Entertainment Technologies, and Intelligent Information Management in Education and Industry, Aizuwakamatsu, Japan, 23–26 January 2024; Volume 3220. [Google Scholar]
Mansouri, Y.; Toosi, A.N.; Buyya, R. Data storage management in cloud environments: Taxonomy, survey, and future directions. Acm Comput. Surv. (Csur) 2017, 50, 1–51. [Google Scholar] [CrossRef]
Yang, P.; Xiong, N.; Ren, J. Data security and privacy protection for cloud storage: A survey. IEEE Access 2020, 8, 131723–131740. [Google Scholar] [CrossRef]
Persico, V.; Montieri, A.; Pescapè, A. On the Network Performance of Amazon S3 Cloud-Storage Service. In Proceedings of the 5th IEEE International Conference on Cloud Networking, Pisa, Italy, 3–5 October 2016; pp. 113–118. [Google Scholar]
Bisong, E. An overview of google cloud platform services. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Apress: Berkeley, CA, USA, 2019; pp. 7–10. [Google Scholar]
Collier, M.; Shahan, R. Microsoft Azure Essentials-Fundamentals of Azure; Microsoft Press: Redmond, WA, USA, 2015. [Google Scholar]
Gupta, B.; Mittal, P.; Mufti, T. A review on amazon web service (aws), microsoft azure & google cloud platform (gcp) services. In Proceedings of the 2nd International Conference on ICT for Digital, Smart, and Sustainable Development, Jamia Hamdard, New Delhi, India, 27–28 February 2020; p. 9. [Google Scholar]
Miryala, N.K.; Gupta, D. Big Data Analytics in Cloud–Comparative Study. Int. J. Comput. Trends Technol. 2023, 71, 30–34. [Google Scholar] [CrossRef]
Jia, M.; Zhao, Y.C.; Zhang, X.; Wu, D. “That looks like something I would do”: Understanding humanities researchers’ digital hoarding behaviors in digital scholarship. J. Doc. 2024, 81, 24–55. [Google Scholar] [CrossRef]
Dikaiakos, M.D.; Katsaros, D.; Mehra, P.; Pallis, G.; Vakali, A. Cloud computing: Distributed internet computing for IT and scientific research. IEEE Internet Comput. 2009, 13, 10–13. [Google Scholar] [CrossRef]
Ganesan, P. Cloud-Based Disaster Recovery: Reducing Risk and Improving Continuity. J. Artif. Intell. Cloud Comput. 2024, 3, 2–4. [Google Scholar] [CrossRef]
Gholami, A.; Laure, E. Security and privacy of sensitive data in cloud computing: A survey of recent developments. arXiv 2016, arXiv:1601.01498. [Google Scholar] [CrossRef]
Ray, P.P. A survey of IoT cloud platforms. Future Comput. Inform. J. 2016, 1, 35–46. [Google Scholar] [CrossRef]
Tsintotas, K.A.; Moutsis, S.N.; Kansizoglou, I.; Konstantinidis, F.K.; Gasteratos, A. The advent of AI in modern supply chain. In Proceedings of the Olympus International Conference on Supply Chains, Katerini, Greece, 24–26 May 2024; pp. 333–343. [Google Scholar]
Humphrey, J. Manuscripts and metadata: Descriptive metadata in three manuscript catalogs: DigCIM, MALVINE, and Digital Scriptorium. Cat. Classif. Q. 2007, 45, 19–39. [Google Scholar] [CrossRef]
Guéville, E.; Wrisley, D.J. Transcribing medieval manuscripts for machine learning. J. Data Min. Digit. Humanit. 2024, 10, 01. [Google Scholar] [CrossRef]
Colla, D.; Goy, A.; Leontino, M.; Magro, D.; Picardi, C. Bringing semantics into historical archives with computer-aided rich metadata generation. J. Comput. Cult. Herit. (Jocch) 2022, 15, 1–24. [Google Scholar] [CrossRef]
Ioakeimidou, D.; Symeonidis, S.; Chatzoudes, D.; Chatzoglou, P. From data to knowledge in four decades: A systematic literature review of human resource analytics. J. Manag. Anal. 2025, 12, 87–116. [Google Scholar] [CrossRef]
Beals, M.; Bell, E. The atlas of digitised newspapers and metadata: Reports from Oceanic Exchanges. Loughborough 2020, 10, m9. [Google Scholar]
Bellotto, A. Medieval manuscript descriptions and the Semantic Web: Analysing the impact of CIDOC CRM on Italian codicological-paleographical data. Dhq Digit. Humanit. Q. 2020, 14. [Google Scholar]
Griffin, S.M. Epigraphy and paleography: Bringing records from the distant past to the present. Int. J. Digit. Libr. 2023, 24, 77–85. [Google Scholar] [CrossRef]
Philips, J.P.; Tabrizi, N. Historical document processing: Historical document processing: A survey of techniques, tools, and trends. arXiv 2020, arXiv:2002.06300. [Google Scholar] [CrossRef]
Moutsis, S.N.; Tsintotas, K.A.; Ioannis, K.; An, S.; Yiannis, A.; Gasteratos, A. Fall detection paradigm for embedded devices based on YOLOv8. In Proceedings of the IEEE International Conference on Imaging Systems and Techniques, Copenhagen, Denmark, 17–19 October 2023; pp. 1–6. [Google Scholar]
Moutsis, S.N.; Tsintotas, K.A.; Kansizoglou, I.; Gasteratos, A. Evaluating the Performance of Mobile-Convolutional Neural Networks for Spatial and Temporal Human Action Recognition Analysis. Robotics 2023, 12, 167. [Google Scholar] [CrossRef]
Lombardi, F.; Marinai, S. Deep learning for historical document analysis and recognition—A survey. J. Imaging 2020, 6, 110. [Google Scholar] [CrossRef]
Khayyat, M.M.; Elrefaei, L.A. Towards author recognition of ancient Arabic manuscripts using deep learning: A transfer learning approach. Int. J. Comput. Digit. Syst. 2020, 90. [Google Scholar] [CrossRef]
Hollaus, F.; Gau, M.; Sablatnig, R. Multispectral image acquisition of ancient manuscripts. In Proceedings of the 4th International Conference on Computational Intelligence for Modelling, Control and Automation (CIMCA 2012), Lemessos, Cyprus, 29 October–3 November 2012; pp. 30–39. [Google Scholar]
Prathap, G.; Afanasyev, I. Deep learning approach for building detection in satellite multispectral imagery. In Proceedings of the 2018 international conference on intelligent systems (IS), Funchal, Portugal, 25–27 September 2018; pp. 461–465. [Google Scholar]
Sullivan, M.J.; Easton Roger, J.; Beeby, A. Reading Behind the Lines: Ghost Texts and Spectral Imaging in the Manuscripts of Alfred Tennyson. Rev. Engl. Stud. 2025, 76, hgaf007. [Google Scholar] [CrossRef]
Jayanthi, N.; Indu, S.; Hasija, S.; Tripathi, P. Digitization of ancient manuscripts and inscriptions—A review. In Proceedings of the Advances in Computing and Data Sciences: 1st International Conference, Ghaziabad, India, 11–12 November 2016; pp. 605–612. [Google Scholar]
Owen, D.; Groom, Q.; Hardisty, A.; Leegwater, T.; Livermore, L.; van Walsum, M.; Wijkamp, N.; Spasic, I. Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections. Res. Ideas Outcomes 2020, 6, e50150. [Google Scholar] [CrossRef]
Kahle, P.; Colutto, S.; Hackl, G.; Mühlberger, G. Transkribus—A Service Platform for Transcription, Recognition and Retrieval of Historical Documents. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 04, pp. 19–24. [Google Scholar]
Nockels, J.; Gooding, P.; Ames, S.; Terras, M. Understanding the application of handwritten text recognition technology in heritage contexts: A systematic review of Transkribus in published research. Arch. Sci. 2022, 22, 367–392. [Google Scholar] [CrossRef]
Miloud, K.; Abdelmounaim, M.L.; Mohammed, B.; Ilyas, B.R. Advancing ancient arabic manuscript restoration with optimized deep learning and image enhancement techniques. Trait. Signal 2024, 41, 2203. [Google Scholar] [CrossRef]
Assael, Y.; Sommerschield, T.; Prag, J. Restoring ancient text using deep learning: A case study on Greek epigraphy. arXiv 2019, arXiv:1910.06262. [Google Scholar] [CrossRef]
Li, C.; Guo, J.; Porikli, F.; Pang, Y. LightenNet: A convolutional neural network for weakly illuminated image enhancement. Pattern Recognit. Lett. 2018, 104, 15–22. [Google Scholar] [CrossRef]
Wang, S.; Cen, Y.; Qu, L.; Li, G.; Chen, Y.; Zhang, L. Virtual Restoration of Ancient Mold-Damaged Painting Based on 3D Convolutional Neural Network for Hyperspectral Image. Remote Sens. 2024, 16, 2882. [Google Scholar] [CrossRef]
Mao, H.; Cheung, M.; She, J. Deepart: Learning joint representations of visual arts. In Proceedings of the 25th ACM international conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1183–1191. [Google Scholar]
Ravishankar, S.; Ye, J.C.; Fessler, J.A. Image reconstruction: From sparsity to data-adaptive methods and machine learning. Proc. IEEE 2019, 108, 86–109. [Google Scholar] [CrossRef]
Kraken Security Labs. Homepage. 2025. Available online: https://kraken.re/main/index.html (accessed on 17 March 2025).
Ocropus Developers. DUP-Ocropy Repository. Available online: https://github.com/ocropus-archive/DUP-ocropy (accessed on 17 March 2025).
Wecker, A.J.; Raziel-Kretzmer, V.; Kiessling, B.; Ezra, D.S.B.; Lavee, M.; Kuflik, T.; Elovits, D.; Schorr, M.; Schor, U.; Jablonski, P. Tikkoun Sofrim: Making ancient manuscripts digitally accessible: The case of Midrash Tanhuma. Acm J. Comput. Cult. Herit. (Jocch) 2022, 15, 1–20. [Google Scholar] [CrossRef]
Kiessling, B.; Tissot, R.; Stokes, P.; Ezra, D.S.B. eScriptorium: An open source platform for historical document analysis. In Proceedings of the International Conference on Document Analysis and Recognition Workshops, Sydney, Australia, 22–25 September 2019; Volume 2, p. 19. [Google Scholar]
Jacsont, P.; Leblanc, E. Impact of Image Enhancement Methods on Automatic Transcription Trainings with eScriptorium. J. Data Min. Digit. Humanit. 2023. [Google Scholar] [CrossRef]
Petrík, M.; Mataš, E.; Sabo, M.; Ries, M.; Matejčík, Š. Fast Detection and Classification of Ink by Ion Mobility Spectrometry and Artificial Intelligence. IEEE Access 2025, 13, 33379–33386. [Google Scholar] [CrossRef]
López-Baldomero, A.B.; Buzzelli, M.; Moronta-Montero, F.; Martínez-Domingo, M.Á.; Valero, E.M. Ink classification in historical documents using hyperspectral imaging and machine learning methods. Spectrochim. Acta Part Mol. Biomol. Spectrosc. 2025, 335, 125916. [Google Scholar] [CrossRef]
Ciambella, F. AI-Driven Intralingual Translation across Historical Varieties: Theoretical Frameworks and Examples from Early Modern English. Iperstoria 2024, 23, 15–30. [Google Scholar]
Novac, O.C.; Chirodea, M.C.; Novac, C.M.; Bizon, N.; Oproescu, M.; Stan, O.P.; Gordan, C.E. Analysis of the application efficiency of TensorFlow and PyTorch in convolutional neural network. Sensors 2022, 22, 8872. [Google Scholar] [CrossRef] [PubMed]
Guha, S. Doris: A tool for interactive exploration of historic corpora (Extended Version). arXiv 2017, arXiv:1711.00714. [Google Scholar] [CrossRef]
Lian, Y.; Xie, J. The Evolution of Digital Cultural Heritage Research: Identifying Key Trends, Hotspots, and Challenges through Bibliometric Analysis. Sustainability 2024, 16, 7125. [Google Scholar] [CrossRef]
Terras, M. Opening Access to collections: The making and using of open digitised cultural content. Online Inf. Rev. 2015, 39, 733–752. [Google Scholar] [CrossRef]
Ioakeimidou, D.; Chatzoudes, D.; Symeonidis, S.; Chatzoglou, P. HRA adoption via organizational analytics maturity: Examining the role of institutional theory, resource-based view and diffusion of innovation. Int. J. Manpow. 2023, 44, 363–380. [Google Scholar] [CrossRef]
Papanikolaou, D. Sacred, Profane, Troublesome, Adventurous: The Lexicon Cyrilli across Ages and Manuscripts. Bull. John Rylands Libr. 2020, 96, 1–18. [Google Scholar] [CrossRef]
Tzutalin. LabelImg. 2015. Available online: https://github.com/tzutalin/labelImg (accessed on 17 March 2025).
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Cyril Lexicon. Available online: https://cyril-lexicon.aegean.gr/ (accessed on 17 March 2025).
Gasteratos, A.; Moutsis, S.N.; Tsintotas, K.A.; Aloimonos, Y. Future Aspects in Human Action Recognition: Exploring Emerging Techniques and Ethical Influences. In Proceedings of the 40th Anniversary of the IEEE Conference on Robotics and Automation, Rotterdam, The Netherlands, 23–26 September 2024. [Google Scholar]
Campanella, S.; Alnasef, A.; Falaschetti, L.; Belli, A.; Pierleoni, P.; Palma, L. A Novel Embedded Deep Learning Wearable Sensor for Fall Detection. IEEE Sens. J. 2024, 24, 15219–15229. [Google Scholar] [CrossRef]

Figure 1. The manuscript digitization process includes various steps, initially focusing on image optimization for better clarity and readability, as well as cloud-based storage. Metadata tools and advanced technologies, e.g., optical character recognition, follow, transforming the initial data into searchable and editable formats. Finally, digital preservation and user access make historical manuscripts available to the audience.

Table 1. An overview of machine and deep learning in the digitization manuscript stages.

Image Capture	Image Preprocessing	Metadata Creation	Text Analysis
Automatic optimization of camera settings, real-time noise reduction and sharpening, Uncovering hidden text or underdrawings	Automates the transcription, optical character recognition (OCR) (ABBYY FineReader, Google Tesseract), handwritten text recognition (HTR) (Transkribus), Image enhancement and text reconstruction	Language identification; dates, keywords, and document structure; forgery detection and authenticity verification; layout analysis; identifying anomalies	Language translation, organizing manuscripts by themes, interactive exploration and visualization tools

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moutsis, S.N.; Ioakeimidou, D.; Tsintotas, K.A.; Evangelidis, K.; Nastou, P.E.; Tsolomitis, A. Artificial Intelligence for Historical Manuscripts Digitization: Leveraging the Lexicon of Cyril. Eng. Proc. 2025, 107, 8. https://doi.org/10.3390/engproc2025107008

AMA Style

Moutsis SN, Ioakeimidou D, Tsintotas KA, Evangelidis K, Nastou PE, Tsolomitis A. Artificial Intelligence for Historical Manuscripts Digitization: Leveraging the Lexicon of Cyril. Engineering Proceedings. 2025; 107(1):8. https://doi.org/10.3390/engproc2025107008

Chicago/Turabian Style

Moutsis, Stavros N., Despoina Ioakeimidou, Konstantinos A. Tsintotas, Konstantinos Evangelidis, Panagiotis E. Nastou, and Antonis Tsolomitis. 2025. "Artificial Intelligence for Historical Manuscripts Digitization: Leveraging the Lexicon of Cyril" Engineering Proceedings 107, no. 1: 8. https://doi.org/10.3390/engproc2025107008

APA Style

Moutsis, S. N., Ioakeimidou, D., Tsintotas, K. A., Evangelidis, K., Nastou, P. E., & Tsolomitis, A. (2025). Artificial Intelligence for Historical Manuscripts Digitization: Leveraging the Lexicon of Cyril. Engineering Proceedings, 107(1), 8. https://doi.org/10.3390/engproc2025107008

Article Menu

Artificial Intelligence for Historical Manuscripts Digitization: Leveraging the Lexicon of Cyril^†

Abstract

1. Introduction

2. Background on Manuscripts Digitization

2.1. Image Capture

2.2. Quality Control

2.3. Image Processing

2.4. Cloud Storage

2.5. Metadata Creation

2.6. Machine and Deep Learning Analysis

2.7. Digital Preservation and User Access

3. Digitization of Cyril’s Lexicon

3.1. Cyril’s Lexicon

3.2. Image Capture

3.3. Metadata of Interest

3.4. Annotation Procedure

3.5. Deep Learning as Annotation Assistant

3.6. A Graphical User Interface for Metadata and Transcription Management

3.7. Search Engine

3.8. Main Challenges

4. Discussion

4.1. Impact of Digitizing Cyril’s Lexicon

4.2. AI in Preserving Cultural Heritage

4.3. Ethical Issues

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Artificial Intelligence for Historical Manuscripts Digitization: Leveraging the Lexicon of Cyril †

Abstract

1. Introduction

2. Background on Manuscripts Digitization

2.1. Image Capture

2.2. Quality Control

2.3. Image Processing

2.4. Cloud Storage

2.5. Metadata Creation

2.6. Machine and Deep Learning Analysis

2.7. Digital Preservation and User Access

3. Digitization of Cyril’s Lexicon

3.1. Cyril’s Lexicon

3.2. Image Capture

3.3. Metadata of Interest

3.4. Annotation Procedure

3.5. Deep Learning as Annotation Assistant

3.6. A Graphical User Interface for Metadata and Transcription Management

3.7. Search Engine

3.8. Main Challenges

4. Discussion

4.1. Impact of Digitizing Cyril’s Lexicon

4.2. AI in Preserving Cultural Heritage

4.3. Ethical Issues

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Artificial Intelligence for Historical Manuscripts Digitization: Leveraging the Lexicon of Cyril^†