Development of a Software Model for Classification and Automatic Cataloging of Archive Documents

Dauletov, Adilbek; Muminov, Bahodir; Matyakubova, Noila; Abdurahmonova, Uldona; Bakhriyeva, Khurshida; Fayzieva, Makhbubakhon

doi:10.3390/info17040341

Open AccessArticle

Development of a Software Model for Classification and Automatic Cataloging of Archive Documents

by

Adilbek Dauletov

^1,*,

Bahodir Muminov

²,

Noila Matyakubova

³,

Uldona Abdurahmonova

⁴,

Khurshida Bakhriyeva

¹ and

Makhbubakhon Fayzieva

^5,6

¹

Department of Digital Technologies, Alfraganus University, Tashkent 100190, Uzbekistan

²

Department of Artificial Intelligence, Tashkent State University of Economics, Tashkent 100066, Uzbekistan

³

Language Teaching Center, Alfraganus University, Tashkent 100190, Uzbekistan

⁴

Department of Computational Linguistics and Digital Technologies, Tashkent State University of Uzbek Language and Literature Named after Alisher Navoi, Tashkent 100100, Uzbekistan

⁵

Department of Information Systems and Technologies, National Pedagogical University of Uzbekistan, Tashkent 100007, Uzbekistan

⁶

Department of Modern Information Communication Technologies, International Islamic Academy of Uzbekistan, Tashkent 100002, Uzbekistan

^*

Author to whom correspondence should be addressed.

Information 2026, 17(4), 341; https://doi.org/10.3390/info17040341

Submission received: 6 February 2026 / Revised: 16 March 2026 / Accepted: 27 March 2026 / Published: 1 April 2026

(This article belongs to the Topic Advances in Integrative AI, Machine Learning, and Big Data for Transformative Applications)

Download

Browse Figures

Versions Notes

Abstract

This study proposes an integrated software model for automatic document classification and metadata generation based on the Dublin Core standard to address the issue of rapid and consistent management of archival documents in a digital environment. This approach combines the stages of receiving incoming documents, converting them to text using optical character recognition (OCR), image preprocessing (binarization, deskew, noise reduction), and text cleaning and vectorization (TF–IDF) into a single pipeline. In the document classification stage, the Bidirectional Encoder Representations from Transformers (BERT) model with a context-sensitive transformer architecture is used, along with classical machine learning models (Logistic Regression, Naive Bayes, Support Vector Machine) and an ensemble approach (LightGBM), to increase the accuracy by modeling the document content at a deep semantic level. Experiments were conducted on the RVL-CDIP dataset, and the OCR efficiency was evaluated using the Character Error Rate (CER) indicator, and the classification results were evaluated using the accuracy, precision, recall and F1-score metrics. The results confirmed the high stability and generalization ability of the BERT (accuracy, 95.1%; F1, 95.0%) and LightGBM (accuracy, 93.2%; F1, 93.2%) models. In the final stage, OCR, NER, and classification outputs are automatically organized into Dublin Core metadata elements (Title, Creator, Date, Description, Subject, Type, Format, Language) and exported in JSON/XML formats. This automation significantly reduces manual cataloging effort and improves indexing and retrieval efficiency in digital archival systems.

Keywords:

digitization of archival documents; classification; optical character recognition; machine learning algorithms; Dublin Core metadata standard

Graphical Abstract

1. Introduction

In recent years, the processes of integrating archival documents into the digital environment have been developing rapidly. State and non-state organizations, as well as scientific institutions, are implementing large-scale digitization initiatives to preserve documentary heritage, facilitate its use, and ensure the long-term preservation of archival materials. Although these initiatives significantly expand the openness and accessibility of archival resources, they also lead to a continuous increase in the volume of digital documents. As a result, modern archival information systems receive a large volume of new digitized materials every day, which further complicates the process of their manual processing and systematization.

Traditional methods of classifying documents by content, type, or archival collection are labor-intensive and resource-intensive. The presence of documents in various formats, the combination of text and image data, and the fact that historical archival documents are often outdated or of poor quality further complicate this process. Therefore, there is a growing need for automated technological solutions aimed at improving document retrieval, reducing human error, and increasing overall efficiency.

Although existing archival information systems offer a variety of practical functions for storing and searching documents, automatic classification and cataloging mechanisms based on international metadata standards are not sufficiently implemented. In many systems, the main focus is on document storage and simple search capabilities, while the processes of document description and metadata creation are often performed manually by specialists. For large archival funds, such approaches can lead to time loss, subjective errors in metadata description, and a decrease in overall efficiency.

In this context, the development of intelligent methods for automatic document processing is of great importance. Machine learning technologies create new opportunities for analyzing document content, identifying semantic features, and classifying documents into predefined categories. At the same time, standardized metadata models allow for structured description, interoperability, and efficient search of data in digital archives. Among such standards, the Dublin Core metadata model is widely used in digital libraries and archival systems due to its flexibility and simplicity.

However, much research in this area has focused on individual components of the document processing process, such as optical character recognition (OCR), text classification, or metadata extraction. These approaches have often been developed as stand-alone tools and have not been sufficiently integrated into a single integrated system for use in archival environments. In addition, while most existing research has focused on improving classification accuracy, the integration of document classification with the automatic generation of standard metadata models used in digital archives has not been sufficiently explored. As a result, practical solutions that combine document recognition, classification, and metadata generation into a single, scalable, and scalable workflow that is compatible with real archival information systems are still lacking.

The rapid growth of digitized archival materials poses new challenges for archival institutions in organizing, classifying, and managing documents. Manual cataloging, especially when dealing with large collections of documents of various types and formats, requires a significant amount of time and human resources. In addition, inconsistencies in the manual creation of metadata can lead to additional difficulties in document retrieval and information management. Therefore, it is necessary to develop automated approaches that allow analyzing document content, classifying them into appropriate categories, and automatically generating standardized metadata descriptions.

To address these issues, this study proposes an integrated system for automatic classification and cataloging of archival documents based on machine learning methods and the Dublin Core metadata model. The proposed system combines the processes of text extraction based on OCR, document processing, classification using machine learning models, and automatic metadata generation within a single pipeline. This integration serves to more effectively organize the processes of document organization and cataloging in digital archival information systems.

The main scientific contributions of this study are as follows: first, an integrated architecture for automatic processing of archival documents is proposed. The developed pipeline combines the processes of OCR-based text extraction, text processing, machine learning-based document classification, and automatic generation of Dublin Core metadata into a single system. This integration ensures that the different stages of document analysis work in a coordinated manner in an automated environment.

Second, a systematic comparative analysis of machine learning models such as Logistic Regression, Naive Bayes, Support Vector Machines (SVM), LightGBM, and BERT is performed. All models are evaluated under the same experimental conditions, allowing for a comparison of the effectiveness of classical machine learning methods and modern deep learning models in document classification.

Third, the study analyzes not only the classification accuracy but also the computational efficiency. The training time and computational resource requirements of the models are studied, highlighting the trade-off between efficiency and computational cost in large-scale archival document processing systems.

Fourth, the proposed approach demonstrates the potential for automatic organization, indexing, and cataloging of archival documents by integrating machine learning techniques with the Dublin Core metadata model.

Finally, the developed system is designed taking into account real archival workflows, providing a scalable and practical solution for managing large-scale digitized document collections in modern digital archives.

2. Literature Review

In the infrastructure of digital archives and electronic libraries, the issues of correct description and long-term preservation of documents are considered an important component of effective information management. The constant increase in the volume of digitized documents creates the need to ensure the accuracy, consistency and compliance of metadata with international standards. Therefore, the Open Archival Information System (OAIS) reference model is widely used as a conceptual and methodological basis for the design of digital archival systems, which standardizes long-term preservation processes by combining information flows and functional blocks within a single logical structure [1].

Studies on the quality of metadata have shown that experts offer different semantic interpretations when using the Dublin Core standard. It is scientifically proven that such differences lead to insufficiently complete or consistent formation of metadata, reducing the accuracy of search and reuse processes [2]. To eliminate these disadvantages, digital library systems use the Metadata Encoding and Transmission Standard (METS) models, which combine descriptive, administrative and structural metadata in a single XML-based container, significantly simplifying integration between different information systems [3].

The need for a more in-depth representation of contextual information about archival materials led to the development of the Encoded Archival Description (EAD) standard. This standard provides users with a broader range of search capabilities than traditional catalogs by presenting archival holdings, hierarchical structures, and inventories in a machine-readable format [4]. Also important in overcoming the technical, legal, and organizational issues associated with the long-term preservation of digital objects is the Preservation Metadata Implementation Strategies (PREMIS) standard, which provides semantic units that allow for systematic management of the complexities of the preservation process [5]. Early research on automatic classification of documents and texts was mainly based on the Naive Bayes model, which is characterized by its simplicity and computational efficiency. In a number of experimental studies, this model has shown competitive results even when compared to more complex algorithms [6].

In recent years, there has been a significant increase in interest in ensemble approaches, in particular Random Forest, XGBoost, and LightGBM, to improve the accuracy and robustness of classification. Empirical studies have shown that these models outperform single models in various domains, especially in text and document title classification tasks [7]. Experiments on large image document datasets such as RVL-CDIP, created by Ryerson Vision Lab, also confirm that the use of ensemble models in combination with deep learning methods can achieve very high accuracy in document type identification [8].

In the field of text classification, BERT and its extensions have been shown to outperform traditional machine learning algorithms, especially when processing semantically complex texts [9]. As an extension of this approach, the LayoutLM model has been shown to perform well on scanned documents, forms, and receipts by learning text, layout, and visual features together [10]. In addition, OCR-free transformer architectures, including Donut, process visual documents end-to-end, reducing the error chain in the text recognition process and providing consistent results in classification and information extraction tasks [11]. Layout-based multimodal approaches have also been highlighted to provide significant advantages in document class detection by integrating text, visual, and structural features, confirming the importance of multimodal integration in document understanding [12].

In modern library and archive systems, manual cataloging and metadata creation processes are becoming a limiting factor in the context of the rapid growth of information resources. High time consumption and human error are increasing the need for automated solutions, increasing the relevance of approaches based on artificial intelligence and machine learning. Research shows that these technologies allow for the automatic generation of metadata, document classification and subject indexing, and are of great importance in optimizing library services. At the same time, the introduction of SI technologies into cataloging processes raises ethical issues, quality control, and algorithmic transparency issues [13]. Previous studies have described information systems in which document processing processes are organized into separate functional components specialized for data processing, storage, and text analysis. This structured approach simplifies the document management process and improves the efficiency of searching and retrieving information [14].

Applying artificial intelligence to archival and records management processes can increase the openness of records, improve search efficiency, and automate the processing of large amounts of data. AI and machine learning methods can automate processes such as transcription, indexing, metadata extraction, and classification, reducing manual work. At the same time, they can optimize the stages of document evaluation, classification, and storage to create conditions for more efficient use of archival materials [15].

In the scientific literature, the issue of automatic classification of archival and library documents has been widely studied based on data mining and machine learning approaches. In particular, the method proposed by Qiao extracts features from archival data and normalizes them, and after noise reduction, documents are classified into categories using an entropy-based weighting and classification model. Experiments have shown that this approach provides accuracy of up to 97%, confirming that it provides efficient management and fast search of large amounts of archival data [16].

Recent research has widely used artificial intelligence and natural language processing technologies to effectively manage electronic archives and digital library documents. Based on machine learning and language models with transformer architecture, methods for automatic classification of documents at the chapter or section level, text segmentation, and metadata generation have been developed. These approaches help to improve content indexing, increase search accuracy, and find the necessary information quickly. Experiments show that classifiers based on language models provide higher accuracy and F1-scores than traditional algorithms [17].

In the literature, the volume of archival data is rapidly increasing, and manual classification of documents requires a lot of work and leads to low efficiency. Therefore, it is advisable to use artificial intelligence and natural language processing methods to automatically analyze and classify archival texts according to their semantic content. A multi-label classification model that combines extended convolutional coding, a label graph-based attention mechanism, and multi-granular semantic attention modules automatically classifies documents at the word, phrase, and sentence levels of the text. Experiments have shown that this approach provides higher accuracy and F1-scores than traditional methods and serves to effectively manage archival data and organize fast search [18].

The use of artificial intelligence technologies in libraries and information institutions provides an opportunity to automate cataloging and classification processes. Studies have evaluated the effectiveness of generative AI systems, including Claude AI, in automatically generating classification symbols, subject headings, and cutter numbers for library resources. As a result, such systems provide correct recommendations at the general class level and reduce the intellectual load in cataloging. However, their stability and consistency are not sufficient, as different classification results are observed for the same document in different sessions, and it is emphasized that AI tools are more effective as an auxiliary tool for specialists than as a fully automatic solution [19].

Research on automatic classification of educational and management documents shows that machine learning algorithms can be effectively used. In particular, feature extraction based on TF–IDF and the use of Naive Bayes classifier allow for fast and accurate classification of documents. In this approach, the stages of text pre-cleaning, segmentation, stop-word removal and vectorization are performed, and an automated classification system is created. Experimental results have shown that the Naive Bayes algorithm provides higher accuracy and stability compared to SVM, Random Forest and MLP methods [20].

In recent years, multimodal approaches have been widely used in automatic document classification, and the joint processing of image and text features allows for more accurate identification of the semantic content of documents. In the research, a multimodal deep learning model combining text embeddings obtained through OCR and visual features extracted based on convolutional neural networks has been proposed. This approach has been shown to increase the accuracy by approximately 3% compared to image-only models on the Tobacco3482 and RVL-CDIP datasets. The results confirm that the integration of text and image data improves the efficiency of automatic processing of archival and administrative documents [21].

It has been found that the use of natural language processing and deep learning models in automatic classification of text documents and knowledge extraction from them gives effective results. Neural networks such as BERT, LSTM and CNN provide high accuracy in classifying texts into specific categories, identifying semantic features and fast indexing of data [22]. Also, when processing large volumes of text, pre-trained transformer models are effective in extracting important features and automatically categorizing documents, which serves to optimize the intelligent cataloging processes in electronic archive systems [23].

3. Materials and Methods

This section describes in detail the working principle of the proposed system for automatic document analysis, classification, and generation of metadata based on the Dublin Core standard. The system includes all stages, including receiving incoming documents (PDF, JPG, DOCX), converting them to text using OCR, text cleaning and processing, analysis based on machine learning and natural language processing methods, and finally generating a standardized metadata record.

These processes consist of modules that work in series and in parallel, each of which performs a specific task. In the proposed approach, documents are first cleaned and normalized, and then contextual embeddings are generated and used in the classification and summarization processes within a single pipeline [24]. In particular, the NER module identifies the names of individuals and organizations, the classification module determines the subject and type of the document, and the summarization module forms a short content description of the document. These results are combined into a single system and exported as Dublin Core metadata. The general architecture and functional stages of the proposed approach are presented in Figure 1.

In order to evaluate the effectiveness of the proposed pipeline model and verify its performance in real archival conditions, the RVL-CDIP dataset, which is widely used in the field of document image classification, was used in the training and testing processes. This dataset is formed on the basis of document images obtained from real corporate and archival environments, and it embodies the content diversity, graphic complexity and structural differences of documents. The images in the collection are characterized by different scanning quality, noise, uneven text placement, font differences and the presence of handwritten elements, which fully reflects the problems inherent in real archival documents. Therefore, the RVL-CDIP dataset allows testing the developed model in an environment close to practical conditions.

Although RVL-CDIP is widely used as a benchmark dataset, classification difficulty varies across document categories. Document types such as forms, invoices, advertisements, and scientific publications are recognized more reliably due to their distinctive structural patterns and domain-specific vocabulary. In contrast, document types such as letters, memos, and emails may share similar administrative language and layout structures, which increases the probability of misclassification. This observation reflects the semantic overlap between several document categories and explains the challenges in distinguishing structurally similar document types.

Based on this dataset, the system performs the following processes: text extraction via OCR, feature vectorization, classification based on machine learning, and creation of metadata according to the Dublin Core standard. Each of these steps is described in detail in the following sections.

3.1. Data Set and Its Mathematical Description

The RVL-CDIP dataset was used in the training and evaluation stages of the proposed automatic document classification system. This dataset is widely used in the field of document image classification and covers a variety of document images from real corporate and archival environments. It reflects the differences in the content variety, graphic complexity, and structure of documents.

The RVL-CDIP dataset based on supervised learning is represented as follows:

D = {\{(x_{i}, y_{i})\}}_{i = 1}^{N}

(1)

where

x_{i}

is the set of features extracted from the document image based on the OCR and vectorization steps, and

y_{i}

is the class label corresponding to the document. This expression allows us to formally describe document classification as a multi-class classification problem.

The dataset contains a total of 16 different semantic and visual categories, which are defined by the following set of classes:

C = \{c_{1}, c_{2} \dots c_{16}\}

(2)

These classes include document types such as specification, scientific report, scientific publication, resume, questionnaire, presentation, news article, memo, letter, invoice, handwritten, form, file folder, email, budget, and advertisement. The differences in content and structure between the classes give the model a chance to work with complex real-world documents. Additional descriptive attributes are also available for each document image, including image dimensions, raw text extracted through OCR, and elements related to the document layout. These attributes serve to model the document not only based on the text content, but also together with its general structure.

When training the model, the data was divided into train and validation sets based on stratification:

D_{t r a i n} \cup D_{v a l} = D, |D_{t r a i n}| \approx 0.7 N, |D_{v a l}| \approx 0.3 N,

(3)

Here, stratification was used to keep the proportions of all classes balanced. Mathematically, the stratification condition is expressed as follows:

\forall c_{k} \in C : \frac{|\{(x_{i}, y_{i}) \in D_{t r a i n} : y_{i} = c_{k}\}|}{|D_{t r a i n}|} \approx \frac{|\{(x_{i}, y_{i}) \in D : y_{i} = c_{k}\}|}{|D|}

(4)

This approach helped prevent rare classes from being overlooked and certain classes from being overly favored. Additional experiments were conducted on local archival documents, including OCR results in Uzbek, and manuscript materials, increasing the adaptability of the developed model to real archival conditions.

These experiments were mainly performed as a practical validation step to verify the robustness of the proposed pipeline when processing documents in languages other than English. Since the primary goal of the study was the evaluation of the classification pipeline using the RVL-CDIP dataset, the detailed quantitative results of these additional experiments are not included in the current manuscript.

3.2. OCR-Based Text Extraction and Feature Vectorization Model

The process of extracting text from images of archival documents was carried out based on the Tesseract OCR tool. OCR algorithms are challenged to function flawlessly due to quality degradation, color distortion, worn paper textures, or printing defects that occur in real archival environments. Therefore, a series of preprocessing steps were performed to achieve more accurate text recognition. In particular,

Binarization eliminates unnecessary background colors and enhances the contrast of the text.
Deskew reduces the effect of character illumination by correcting the skew in the image.
Noise removal filters out noise and artifacts in the image. As a result of these operations, the quality of the image transmitted to the OCR model improves, and the clarity of the text obtained in the next stage increases.

To express this process mathematically, first of all, we have the space where the images are stored:

I \subset R^{H + W}

(5)

Here

H

is the image height,

W

is the width, and

R^{H + W}

is a matrix in which the brightness values of each pixel are represented by real numbers.

The sequence of characters generated by OCR is represented by a text space. The process of converting from image to text is formalized using the Tesseract model through the following map:

G_{θ} : I \to T, t_{i} = G_{θ} (i_{i}),

(6)

Here

i_{i} \in I

is the image of the i-document, and

t_{i} \in T

is the text generated by OCR. Since real archival images contain inaccuracies, it is natural for errors to occur as a result of OCR. This situation is expressed through the uncertainty component:

t_{i} = G_{θ} (i_{i}) + ε_{i},

(7)

Here

ε_{i}

is a random error that reflects noise in the OCR process, incorrectly recognized characters, or missing letters.

The text generated by OCR is the main input for the next stage—feature extraction, classification, and cataloging. Therefore, the accuracy of this stage directly affects the final quality of the entire system.

During the OCR stage, the text extracted from the document is converted into a digital vector format for further analysis. Since it is not possible to directly feed raw text to machine learning models, the text needs to be mapped into a space of numerical features based on words, their frequency, and semantic weight. This process is carried out through the feature extraction module.

First is the text space: T is taken as input and converted into a vector space using the following functional map based on TF–IDF:

φ_{T F - I D F} : T \to R^{n}

(8)

Here

n

is the size of the constructed dictionary, i.e., the number of words. The dictionary is defined as follows:

V = \{ω_{1}, ω_{2}, \dots, ω_{n}\}

(9)

The components in the TF–IDF vector for each document are defined as follows:

x_{i j} = t f - i d f (ω_{j,} t_{i})

(10)

Here, the document

t_{i} - i -

is the OCR text,

ω_{j} - j

is a word, and

x_{i j}

is the weight value of this word in the document.

The advantage of the TF–IDF model is that it reduces the weight of words that occur frequently in the text but have little impact on the meaning and gives greater importance to terms that occur infrequently but are important.

3.3. Classification Based on Machine Learning Models

In the stage of classification of archival documents, text vectors transferred to the feature space are processed using classifiers based on several machine learning models. The study used Multinomial Logistic Regression, SVM, Naive Bayes, and LightGBM models, taking into account the semantic and structural features of documents. Logistic Regression and Naive Bayes models were used in high-dimensional and sparse text spaces generated on the basis of TF–IDF, that is, in cases where the content of the document is determined mainly by the text. The SVM model was selected in cases where the boundaries between classes are complex and it is necessary to distinguish semantically close document types.

In the classification stage of archival documents, the vector expressions generated as a result of the OCR and feature extraction processes described in Section 3.2 are fed to machine learning models. The main goal of this stage is to classify documents into appropriate classes according to their content and structural features. This task is considered as a multi-class classification problem. The study used classical machine learning models (Multinomial Logistic Regression, Naive Bayes, Support Vector Machines (SVM) and LightGBM) and the BERT model with a deep learning-based transformer architecture to solve this task. Classical models are computationally efficient and provide stable results in high-dimensional and sparse text spaces. The BERT model, on the other hand, allows for a deeper semantic representation of document content through contextual embeddings.

Since the Logistic Regression and Naive Bayes models work effectively in the feature space generated based on TF–IDF, they were used in cases where the content of the document is mainly determined by text. These models are computationally efficient and provide stable and fast results when working with large volumes of documents. The SVM model was chosen in cases where the boundaries between classes are complex and it is necessary to separate semantically close document types. This model is distinguished by the ability to form complex decision boundaries even in high-dimensional spaces. The LightGBM model is based on the gradient boosting ensemble and is able to effectively model nonlinear relationships. This model was used in cases where, in addition to text features, there are also structural and layout attributes of the document. This model showed advantages in cases where there are large datasets, complex document structures, and near-real-time performance requirements.

Along with classical models, the BERT model was also used in this study. The BERT model was initially pre-trained in a self-supervised mode on a large amount of unlabeled texts, and in this work it was re-trained using supervised learning (fine-tuning) to adapt it to document classification and named entity recognition (NER) tasks. Based on BERT, each document is represented using contextual embeddings, which allows us to take into account not only the frequency of words, but also their contextual meaning. As a result, the accuracy of identifying semantically close documents increases significantly.

Layout-aware multimodal models such as LayoutLM and LayoutLMv3 have demonstrated strong performance in document analysis tasks. However, the present study focuses primarily on the semantic classification of archival documents and the automatic generation of Dublin Core metadata from OCR-extracted text. Since key metadata elements such as Subject, Type, Description, and Creator depend mainly on textual content rather than document layout, a text-based BERT model was selected in this work. In addition, multimodal approaches require additional layout annotations and visual feature extraction, which may complicate the processing of heterogeneous archival documents with varying scan quality. Therefore, BERT was considered a practical and efficient solution for this study, while multimodal approaches remain a promising direction for future research.

3.4. Automatic Cataloging Algorithm Based on Dublin Core

After the document classification and semantic analysis stages are completed, the results are transferred to an automatic cataloging module based on the Dublin Core international metadata standard. The main task of this module is to form a consistent and standardized metadata record based on the text extracted through OCR, feature vectors, and the output of machine learning models (including modules based on BERT). This process allows you to automatically generate descriptive information necessary for subsequent search, indexing, and long-term storage of documents.

In this process, the content and structural features of the document are taken into account in a comprehensive manner. In particular, the title (Title) is determined from the OCR results, the author and organization names are extracted using the BERT-based named entity recognition (NER) module and matched to the Creator and Contributor attributes. The subject and type of the document are mapped to the Subject and Type elements based on the class label determined by the classification module, and a short description of the document content is placed in the Description field based on the summarization module. Supporting attributes such as date, format, language, and identifier are automatically filled in using the OCR results and file metadata.

The proposed cataloging algorithm does not limit itself to classifying documents, but also makes them ready for integration with other information systems. The resulting metadata records are exported in JSON and XML formats, allowing for compatibility with archive, library, and electronic document management systems. As a result, the process from document image to fully standardized metadata record is implemented as a single composite model, significantly increasing the efficiency of information resource management. The integration of modern machine learning and deep learning techniques into electronic archive environments has been reported to noticeably improve both the accuracy and efficiency of document management and retrieval processes [25].

3.5. System Architecture and Functional Model Divided into Domains

This system is designed based on a multi-layered, service-oriented architecture to ensure automatic document processing. The system’s interaction with the user is carried out through the Presentation Layer and is based on React technology. When the user uploads documents (PDF, JPG, DOCX) or enters them via a scanner/QR code, all requests are received via the Application API and directed to the appropriate functional services. The document intake, registration and initial verification are performed by the Document Intake Service, and the input channels are managed by the Scanner/QR Service. Technical information about the document and intermediate results are stored in the Database Domain (PostgreSQL). This process is shown in Figure 2 below.

Semantic document processing processes are organized within the Processing Domain, which consists of a set of services that work in sequence. First, the document image is converted to text using the OCR Service. Then this text is cleaned, normalized, and prepared for analysis in the Text Processing Service. After that, by the usage of the ML Module document classification, topic detection, and semantic analysis are performed. This module uses BERT-based models, which allow for deeper study of the semantic context, along with classic machine learning models (Logistic Regression, Naive Bayes, SVM, LightGBM). As a result, the type, topic, and important semantic attributes of the document are automatically identified.

The metadata generation and export process is carried out through the Metadata / Export Domain. The results obtained at this stage are aligned with the Dublin Core international metadata standard. In particular, Title is formed from the OCR result, Creator and Contributor are formed through the BERT-based NER module, Subject and Type are formed based on the classification results, and Description is formed using the summarization module. Date, Language, Format and other auxiliary attributes are obtained from the file metadata and OCR results. The resulting metadata records are exported in JSON and XML formats and transmitted to external archive systems through the External Archive API Gateway. As a result, the process from receiving documents to analyzing them, classifying them and creating standardized metadata is automated through a single integrated architecture.

4. Results

Here, a more comprehensive evaluation of the proposed document classification and automatic cataloging system based on Dublin Core on real data is conducted. Experiments were performed with the RVL-CDIP dataset to evaluate the accuracy of the OCR step, feature space quality, classification performance of machine learning models, and semantic accuracy of metadata generation. All experiments were performed under controlled and repeatable conditions, guaranteeing the dependability of the results obtained.

4.1. Computing Environment (Hardware Configuration)

The experiments were conducted on an efficient mobile workstation. In this computing environment, due to the combined operation of a modern processor and graphics accelerator, as well as the high speed of RAM and RAM, the processes of OCR, text vectorization and classification based on the LightGBM algorithm were efficiently performed. The general configuration of the system provided conditions close to real archival environments and is presented in Table 1 below.

This hardware configuration kept system latency to a minimum during the experiments and sufficient performance for automatic processing of large volumes of archive documents were provided. As a result, the computational resources required for the practical application of the proposed approach are fully satisfied in real conditions.

4.2. Software Environment

The experiments were carried out in the Windows 11 Pro (24H2) environment using Python 3.10. In the machine learning stage, Multinomial Logistic Regression, SVM and Naive Bayes models were used using the Scikit-Learn library, and LightGBM was used for the ensemble approach. Transformers based on the BERT model were used for deeper modeling of semantic context, identification of semantic proximity between documents, and named object recognition (NER).

OCR processes were performed using Pytesseract, NLTK and spaCy libraries for text cleaning and linguistic analysis, and OpenCV was used for image preprocessing. NumPy 1.24.4 and Pandas 1.5.3 were used for data processing, and Matplotlib 3.7.1 and Seaborn 0.12.2 were used for visual analysis of the results. All components were managed in a single virtual environment, ensuring the reproducibility of the experiments. The main technologies used in this software environment and their functions in the system are given in Table 2; it shows the logical and functional compatibility of the tools selected for each stage of the document processing process.

The software components listed in the second table cover all stages of document processing, ensuring consistent and stable execution of stages from OCR to text preprocessing, feature extraction, classification based on classical machine learning models, and deep semantic analysis based on BERT and NER processes. The combined use of these tools served to maintain the continuity of the data flow in the system, effectively use computing resources, and guarantee the reproducibility and reliability of experimental results.

4.3. Analysis of the Efficiency of the OCR Stage

The accuracy of the OCR module directly determines the quality of the text extracted from the document in the subsequent stages—in particular, in the feature vectorization and machine learning-based classification processes. Therefore, assessing the efficiency of the OCR stage was the focus of special attention in this study. In order to determine the real impact of preprocessing operations on OCR results, the Character Error Rate (CER) indicators for different document types were compared before and after preprocessing.

The data in Table 3 show that the preprocessing step significantly improved the OCR accuracy of printed documents. Such a high level of improvement is explained by the relatively stable structure of printed documents, clear separation of character contours and low noise level. Although the improvement is relatively lower for handwritten and outdated scanned documents, it was observed that this is due to the visual complexity of the documents, the unevenness of the character shape and the high degree of degradation.

Nevertheless, the consistent decrease in CER values in all considered document categories confirms the important role of preprocessing operations in improving OCR efficiency. These results indicate that the preparation of high-quality images is a prerequisite for the stable and reliable operation of subsequent stages of automatic document processing—in particular, NLP-based feature extraction and document classification processes.

4.4. General Results of Machine Learning Models

Multinomial Logistic Regression, SVM, Naive Bayes, and LightGBM models were used to evaluate the effectiveness of automatic document classification. All models were evaluated under the same experimental conditions—a 70/30 train/validation split and a single TF–IDF feature space. This methodology enabled an objective and consistent comparison of the obtained results.

It can be seen that the LightGBM model showed the highest performance across all key metrics, which is explained by its ability to effectively model nonlinear relationships. The SVM model emerged as a stable alternative with close results. The Logistic Regression and Naive Bayes models, on the other hand, are computationally lightweight and have shown themselves to be useful as baseline models in cases where linear or simpler probability assumptions are preferable.

The results presented in Table 4 show that there are significant differences in the efficiency of the models used in document classification. In particular, the BERT model showed the highest results in all metrics (accuracy—95.1%; precision—94.7%; recall—95.4%; F1-score—95.0%). These results are explained by the fact that the BERT model is built on the basis of a context-sensitive transformer architecture and its ability to effectively assimilate long-range semantic relationships between words. In particular, the high recall indicator indicates the ability of the model to identify important documents without falsely discarding them.

A closer examination of classification behavior indicates that some document categories are more difficult to distinguish than others. Documents such as letters, memos, and emails often contain overlapping vocabulary and similar structural organization, which increases the probability of confusion between these classes. Similarly, invoices and forms share structured fields and tabular layouts, which can introduce additional classification challenges. In contrast, categories such as advertisements or scientific publications contain distinctive textual patterns and visual structures, making them easier for the models to identify.

In contrast, LightGBM outperformed classical machine learning models by a large amount and achieved high results in all metrics (accuracy—93.2%; F1-score—93.2%). The ensemble nature of this architecture is helpful to account for nonlinear relationships in the data and provides high stability in analyzing documents with complex semantic and structural features. The SVM model also emerged as a reliable alternative with high accuracy (89.7%) and balanced precision–recall indicators.

Logistic Regression and Naive Bayes models are proved to be baseline and computationally efficient solutions. Logistic Regression has limited ability to capture complex linguistic structures due to its reliance on linear relationships, while Naive Bayes cannot fully capture contextual semantics due to its independence assumption. Nevertheless, their fast performance and low computational complexity remain an important advantage for near-real-time systems.

The performance of the BERT model can be explained by its ability to generate contextual embeddings and capture long-range semantic dependencies within the document text. This capability allows BERT to better differentiate between document categories that share similar vocabulary but differ in contextual meaning. In contrast, classical machine learning models rely mainly on surface lexical patterns extracted through TF-IDF representations.

The performance of the LightGBM model can be attributed to its gradient boosting architecture, which effectively models nonlinear relationships between discriminative lexical features. Even when OCR introduces moderate noise into the text, TF-IDF representations still preserve important category-specific keywords. The ensemble nature of LightGBM allows the model to exploit these signals efficiently, which explains why its performance approaches that of BERT despite not using contextual language modeling.

In order to systematically evaluate these trends, the results of the models were presented in a visual form in terms of accuracy, recall, and F1-score. The comparative histogram clearly shows the superiority of the BERT and LightGBM models, as well as the effectiveness of classical algorithms on simpler tasks. The diagram shows that modern transformer-based approaches (BERT) and ensemble models (LightGBM) provide the highest efficiency when working with complex and noisy text data.

The Figure 3 was created in order to show the dynamics of the statistical data presented in the table and the strengths and weaknesses of the models:

The relative performance differences between the models in document classification is quite noticeable as visual reference in the diagrams. In particular, the BERT model had the highest level of performance in all measurements (accuracy, precision, recall and F1-score), which confirms the highly advanced capability of the model to interpret complex semantic relationships. This demonstrates that transformer-based models can capture hidden semantic layers of the text much more efficiently than previous methods.

The LightGBM model took the lead among the classical algorithms, which provide high accuracy and stability. The model’s similarly high recall and F1-score values enable it to identify important information even within noisy and irregular OCR texts without missing important content. In contrast, the SVM model proves to be a dependable option for handling complex text classification, delivering consistent and stable performance.

While the classification results demonstrate the effectiveness of modern machine learning and transformer-based models, it is also important to consider the computational efficiency of these approaches. In real-world digital archival systems, large volumes of documents must be processed continuously, making training time and computational cost important evaluation criteria.

To provide a clearer comparison between the evaluated models, an additional analysis of training time was conducted. The results show that classical machine learning algorithms such as Logistic Regression and Naive Bayes are significantly more efficient in terms of computational resources and training time. In contrast, the BERT model achieves the highest classification accuracy but requires substantially higher computational cost due to its transformer-based architecture.

Figure 4 visually illustrates the differences in computational efficiency among the evaluated models. While deep learning approaches such as BERT provide superior classification performance, their training time is significantly longer compared to classical and ensemble-based models. LightGBM offers a practical compromise by providing relatively high accuracy with moderate computational cost. These results highlight the trade-off between classification performance and computational efficiency when selecting models for large-scale digital archival systems.

As shown in the comparative diagrams and training time analysis, classical machine learning models such as Logistic Regression and Naive Bayes demonstrate high computational efficiency, requiring relatively short training times (18 s and 12 s). However, their simple modeling assumptions limit their ability to capture complex semantic relationships in document texts.

More advanced models, particularly LightGBM, provide a good balance between computational efficiency and classification performance, achieving high accuracy (93.2%) with moderate training time (34 s). The BERT model achieved the highest results across all evaluation metrics (accuracy—95.1%; precision—94.7%; recall—95.4%; F1-score—95.0%), confirming its ability to capture deep contextual representations of text.

However, this improvement in accuracy comes at the cost of significantly higher computational complexity, as reflected by the longer training time (410 s). These results highlight the trade-off between computational efficiency and classification performance. While classical models remain useful as baseline approaches, LightGBM offers a practical compromise, and BERT provides the highest accuracy for applications requiring deeper semantic analysis.

4.5. Quality Analysis of Dublin Core Metadata Generation

The automatic generation of metadata elements based on the Dublin Core standard serves as a central component that determines the system’s semantic completeness, document retrieval speed, and the accuracy of the cataloging process. In this study, the quality of generation was evaluated using eight of the most commonly applied Dublin Core elements: Title, Creator, Date, Description, Subject, Type, Format, and Language. These elements were produced automatically through the use of OCR, NER, machine learning-based classification, summarization, and file metadata extraction.

For structural metadata elements such as Title, Date, Format, and Language, the evaluation was performed using exact string matching between the automatically generated values and the ground truth annotations. For the Creator element, the entities identified by the BERT-based named entity recognition (NER) module were compared with manually annotated names of persons and organizations.

For more complex textual fields such as Description, exact string matching was not semantically sufficient. Therefore, the evaluation was conducted based on the semantic similarity between the generated description and the reference description. If the similarity score exceeded a predefined threshold, the generated result was considered correct.

The overall accuracy for each metadata element was calculated using the following formula:

A c c u r a c y = \frac{N u m b e r o f c o r r e c t l y g e n e r a t e d m e t a d a t a e l e m e n t s}{T o t a l n u m b e r o f e v a l u a t e d d o c u m e n t s}

Based on this evaluation methodology, the results of the automatic generation of the selected Dublin Core metadata elements are presented in Table 5.

The system given in the Table 6 was evaluated on eight main metadata elements of Dublin Core (Title, Creator, Date, Description, Subject, Type, Format, Language), and the results showed high efficiency of the automatic metadata generation process. The Title extraction element showed the best result with 95% accuracy, which is explained by the stable performance of OCR and layout analysis. The NER and OCR normalization processes worked quite accurately on the Creator (88%) and Date (90%) elements.

Subject (93%) and Type (91%) resulted in high accuracy due to the reliable separation of semantic boundaries by ML models. The Format element (99%) was detected almost without errors from file metadata, while Language detection (97%) was supported by the stability of n-gram-based linguistic models. The lowest result was recorded in Description summarization (86%), which is explained by the OCR quality and sensitivity of the summarization models to text integrity.

The experimental results confirm that the document processing stages interact effectively and form a consistent processing workflow. The integration of text segmentation, identification of significant units, automatic classification and content summarization processes served to maintain a high level of semantic quality of metadata records generated based on Dublin Core. This approach allows for further expansion of the system and adaptation to the full 15-element model of Dublin Core.

The Figure 5 illustrates that different Dublin Core elements require different levels of complexity during automatic metadata generation. While some elements showed high accuracy due to their identification based on structural and technical features, relatively lower results were observed for components that depend more on semantic analysis. This clearly reflects the relationship between the content complexity of metadata elements and the approaches used to identify them.

High accuracy indicators confirm that the system can reliably identify the structure, format and language features of the document. At the same time, since processes such as content summarization, author identification and creation of a context-appropriate description require more semantic analysis, the probability of errors at these stages remains relatively high. This indicates the need to introduce not only technical but also deep linguistic and logical models in automatic metadata generation.

Overall, the results presented in the graph confirm the stable operation of the proposed pipeline architecture and the high efficiency achieved in the automatic cataloging process. At the same time, increasing the accuracy of some elements remains an important area for future research. This serves as a necessary scientific and practical basis for further improvement of systems aimed at automating the full Dublin Core model.

5. Conclusions

In this study, a model was proposed and evaluated on real data to implement the processes of digital processing of archival documents, automatic classification and cataloging based on the Dublin Core standard within a single integrated system. The proposed approach combines the stages of document reception, text extraction using OCR, text cleaning and vectorization, classification using machine learning models and automatic generation of standardized metadata records into a single pipeline. This allows reducing time consumption compared to traditional manual processes, minimizing human errors and increasing the efficiency of archival resource management.

Experimental results confirmed that preprocessing operations (binarization, deskew, noise reduction) in the OCR stage have a significant impact on the overall stability of the system and the accuracy of subsequent modules. In particular, the quality of text recognition in printed documents was significantly improved, while problems remained partially due to visual complexity in handwritten and degraded documents. This shows that the quality of OCR is an important supporting component for the entire automatic processing chain.

Comparative evaluation of classical and ensemble models at the classification stage showed the superiority of modern ensemble and deep learning models for documents with complex semantic structure. In particular, the LightGBM and BERT models were distinguished by high accuracy, stability and generalization ability. This confirms the methodological justification of using nonlinear and context-sensitive models when working with documents with large volumes and diverse structures.

This can be explained by the different representational capacities of the evaluated models. Transformer-based architectures such as BERT are capable of modeling contextual semantic relationships across the entire document, which improves classification accuracy for semantically similar document types. Ensemble models such as LightGBM, although based on TF-IDF features, remain effective due to their ability to model nonlinear interactions between discriminative lexical features and maintain robustness under moderate OCR noise.

The results obtained on the basis of Dublin Core metadata generation showed the practical effectiveness of automatic cataloging. While elements based on structural features (Format, Language) were identified with high accuracy, lower indicators were observed for elements that are more dependent on semantic analysis (Creator, Description, Date). This indicates the need to introduce more context-rich language models and post-processing mechanisms in the future.

Overall, the developed system demonstrates robustness, scalability, and practical efficiency for the automated processing, classification, and generation of standardized metadata for archival documents.

The modular design allows for independent building of components and integration into real archival infrastructures. Further work will concentrate on tuning OCR, increasing NER accuracy, incorporating context-sensitive models into the summarization modules and automating the entire 15-item Dublin Core model. This is also going to increase the search precision, semantic consistency and overall efficiency of digital archival systems.

Author Contributions

Conceptualization, A.D.; methodology, A.D. and N.M.; software, A.D.; validation, N.M. and U.A.; formal analysis, A.D. and N.M.; data curation, B.M.; writing—original draft preparation, A.D.; writing—review and editing, K.B. and U.A.; visualization, A.D.; supervision, A.D.; resources and technical support, M.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study (RVL-CDIP) is publicly available. Additional data are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

OCR	Optical Character Recognition
TF–IDF	Term Frequency—Inverse Document Frequency
NER	Named Entity Recognition
BERT	Bidirectional Encoder Representations from Transformers
SVM	Support Vector Machine
LightGBM	Light Gradient Boosting Machine
RVL-CDIP	Ryerson Vision Lab—Complex Document Information Processing
CER	Character Error Rate
API	Application Programming Interface
PDF	Portable Document Format
JPG/JPEG	Joint Photographic Experts Group
DOCX	Microsoft Word Open XML Document
JSON	JavaScript Object Notation
XML	Extensible Markup Language
OAIS	Open Archival Information System
METS	Metadata Encoding and Transmission Standard
EAD	Encoded Archival Description
PREMIS	Preservation Metadata: Implementation Strategies
NLP	Natural Language Processing
OS	Operating System
GPU	Graphics Processing Unit
RAM	Random Access Memory
SSD	Solid State Drive
NVMe	Non-Volatile Memory Express
NLTK	Natural Language Toolkit
OpenCV	Open Source Computer Vision Library
ML	Machine Learning

References

Lee, C.A. Open archival information system (OAIS) reference model. Encycl. Libr. Inf. Sci. 2010, 3, 4020–4030. [Google Scholar]
Andias, W.A. Dublin Core Metadata for Research Data Lessons Learned in a Real-World Scenario with datorium. In Proceedings of the 2014 International Conference on Dublin Core and Metadata Applications, Austin, TX, USA, 8–11 October 2014. [Google Scholar]
Zavalina, O.L.; Burke, M. Assessing skill building in metadata instruction: Quality evaluation of dublin core metadata records created by graduate students. J. Educ. Libr. Inf. Sci. 2021, 62, 423–442. [Google Scholar] [CrossRef]
Ting, S.L.; Ip, W.H.; Tsang, A.H. Is Naive Bayes a good classifier for document classification. Int. J. Softw. Eng. Its Appl. 2011, 5, 37–46. [Google Scholar]
Sueno, H.T.; Gerardo, B.D.; Medina, R.P. Multi-class document classification using support vector machine (SVM) based on improved Naïve bayes vectorization technique. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 3937–3943. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 2–9. [Google Scholar]
Kim, G.; Hong, T.; Yim, M.; Nam, J.; Park, J.; Yim, J.; Hwang, W.; Yun, S.; Han, D.; Park, S. Ocr-free document understanding transformer. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 498–517. [Google Scholar]
Shi, Y.; Kim, M.; Chae, Y. Multi-scale cell-based layout representation for document understanding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 3670–3679. [Google Scholar]
Liddy, E.D.; Allen, E.; Harwell, S.; Corieri, S.; Yilmazel, O.; Ozgencil, N.E.; Diekema, A.; McCracken, N.; Silverstein, J.; Sutton, S. Automatic metadata generation & evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 11–15 August 2002; pp. 401–402. [Google Scholar]
Wada, I.; Joel, A.P.; Digma, F.Z. Artificial Intelligence (AI) and Machine Learning for Metadata Generation and Library Service Delivery: A Systematic Review. J. Inf. Pract. Manag. (JIPM) 2023, 3, 2588–2593. [Google Scholar]
Golub, K. Automated subject indexing: An overview. Cat. Classif. Q. 2021, 59, 702–719. [Google Scholar] [CrossRef]
Mahmud, M.R. AI in automating library cataloging and classification. Libr. Hi Tech News, 2024; ahead of print. [CrossRef]
Li, Z.; Larson, S.; Leach, K. Document Classification using File Names. In Proceedings of the ACM Symposium on Document Engineering 2025 (DocEng ’25), New York, NY, USA, 2–5 September 2025; ACM: Nottingham, UK, 2025. [Google Scholar] [CrossRef]
Yusupbayevich, D.A.; Shakirjanovna, M.N. Development of the Functional Structure of the Science and Education Information System. In Proceedings of the 2024 9th International Conference on Computer Science and Engineering (UBMK), Antalya, Turkiye, 26–28 October 2024; IEEE: New York, NY, USA, 2024; pp. 1058–1063. [Google Scholar]
Shinde, G.; Kirstein, T.; Ghosh, S.; Franks, P.C. AI in Archival Science—A Systematic Review. arXiv 2024, arXiv:2410.09086. [Google Scholar]
Qiao, L. An automatic classification method of library archives data based on data mining. In Web Intelligence; No. 2; SAGE Publications: London, UK, 2022; Volume 20, pp. 123–131. [Google Scholar]
Banerjee, B.; Ingram, W.A.; Fox, E.A. Automating Chapter-Level Classification for Electronic Theses and Dissertations. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 2400–2409. [Google Scholar]
Jiang, X. Intelligent classification method of archive data based on multigranular semantics. Comput. Intell. Neurosci. 2022, 7559523. [Google Scholar] [CrossRef] [PubMed]
Martins, S. Artificial intelligence-assisted classification of library resources: The case of Claude AI. Artif. Intell. 2024, 2, 27. [Google Scholar]
Zhang, P.; Ma, Z.; Ren, Z.; Wang, H.; Zhang, C.; Wan, Q.; Sun, D. Design of an automatic classification system for educational reform documents based on naive bayes algorithm. Mathematics 2024, 12, 1127. [Google Scholar] [CrossRef]
Audebert, N.; Herold, C.; Slimani, K.; Vidal, C. Multimodal deep networks for text and image-based document classification. In Joint European Conference On Machine Learning and Knowledge Discovery in Databases; Springer International Publishing: Cham, Switzerland, 2019; pp. 427–443. [Google Scholar]
Kesiku, C.Y.; Chaves-Villota, A.; Garcia-Zapirain, B. Natural language processing techniques for text classification of biomedical documents: A systematic review. Information 2022, 13, 499. [Google Scholar] [CrossRef]
Figueroa, A. Less Is More: Analyzing Text Abstraction Levels for Gender and Age Recognition Across Question-Answering Communities. Information 2025, 16, 602. [Google Scholar] [CrossRef]
Dauletov, A.; Bakhrieva, K.; Dilfuza, G.; Abduraxmanov, A.; Eisa, Z.; Auda, H.H.; Yehya, M. Text Classification Algorithm Using BERT for Legal Document Analysis and Summarization. In Proceedings of the2025 3rd International Conference on Cyber Resilience (ICCR) Dubai, United Arab Emirates, 3–4 July 2025; IEEE: New York, NY, USA, 2025; pp. 1–8. [Google Scholar]
Dauletov, A.; Matyakubova, N.; Bobur, B.; Oydinov, S. Development and Comparative Analysis of Algorithms for Automatic Classification of Documents in Electronic Archives. In Proceedings of the 2025 10th International Conference on Computer Science and Engineering (UBMK), Istanbul, Turkiye, 17–21 September 2025; IEEE: New York, NY, USA; 2025, pp. 681–686. [Google Scholar]

Figure 1. Metadata generation pipeline model based on machine learning models.

Figure 2. Architecture of the automatic document classification system based on Dublin Core.

Figure 3. Visualization of comparative performance of machine learning models.

Figure 4. Visualization of training time comparison of machine learning models.

Figure 5. Analysis of automatic generation accuracy for Dublin Core metadata elements.

Table 1. Technical characteristics of the hardware environment used for the experiments.

№	Components	Technical Specifications	Main Functions
1	Processor	Intel^® Core™ Ultra 9 275HX	CPU processing for OCR and machine learning tasks
2	Graphics Processing Unit (GPU)	NVIDIA GeForce RTX 5080 (16 GB VRAM)	GPU acceleration for neural network training and inference
3	Random Access Memory (RAM)	32 GB DDR5	Fast data processing and multi-threading support
4	Permanent memory	NVMe SSD	High-speed data access and storage
5	System configuration	High-performance mobile workstation	Reliable evaluation under realistic archival processing conditions

Table 2. Software tools used in the experimental environment and their functions.

Components	Technology/Library	Function
OS	Windows 11 Pro (24H2)	Experimental environment
Programming Language	Python 3.10	Basic programming environment
OCR	Pytesseract 0.3	Text extraction from images
Image Processing	OpenCV 4.x	Preprocessing (binarization, deskew)
Text Processing	NLTK 3.8.1, spaCy 3.7.2	Tokenization, normalization, NER
Feature Extraction	TF–IDF (Scikit-Learn 1.3.2)	Text vectorization
Deep Learning NLP	BERT (Transformers 4.36.2, Hugging Face Inc., New York, NY, USA)	Contextual embeddings, semantic analysis, NER
ML Models	Logistic Regression, SVM, Naive Bayes (Scikit-Learn)	Classical classification
Ensemble ML	LightGBM 4.1	Nonlinear ensemble classification
Data Handling	NumPy 1.24.4, Pandas 1.5.3	Data structure and analysis
Visualization	Matplotlib 3.7.1, Seaborn 0.12.2	Visual analysis of results
Environment	Virtual Environment	Reproducibility

Table 3. The impact of preprocessing operations on OCR accuracy.

Document Type	CER (Before)	CER (After)	Relative Improvement
Printed Documents	4.2%	1.1%	73%
Handwritten Documents	18.7%	12.5%	33%
Old Scanners	26.4%	17.9%	32%

Table 4. Comparative results of machine learning models.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Logistic Regression	81.3	80.9	81.1	81.0
Naive Bayes	84.6	83.9	84.2	84.0
SVM	89.7	90.1	89.4	89.7
LightGBM	93.2	92.9	93.5	93.2
BERT	95.1	94.7	95.4	95.0

Table 5. Training time comparison of machine learning models.

Model	Training Time (s)
Logistic Regression	18
Naive Bayes	12
SVM	52
LightGBM	34
BERT	410

Table 6. Results of automatic generation accuracy of Dublin Core metadata elements.

Metadata Element	Accuracy (%)
Title extraction	95%
Creator (NER)	88%
Date (OCR + normalization)	90%
Description summarization	86%
Subject (classification)	93%
Type (document category)	91%
Format (file metadata)	99%
Language detection	97%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dauletov, A.; Muminov, B.; Matyakubova, N.; Abdurahmonova, U.; Bakhriyeva, K.; Fayzieva, M. Development of a Software Model for Classification and Automatic Cataloging of Archive Documents. Information 2026, 17, 341. https://doi.org/10.3390/info17040341

AMA Style

Dauletov A, Muminov B, Matyakubova N, Abdurahmonova U, Bakhriyeva K, Fayzieva M. Development of a Software Model for Classification and Automatic Cataloging of Archive Documents. Information. 2026; 17(4):341. https://doi.org/10.3390/info17040341

Chicago/Turabian Style

Dauletov, Adilbek, Bahodir Muminov, Noila Matyakubova, Uldona Abdurahmonova, Khurshida Bakhriyeva, and Makhbubakhon Fayzieva. 2026. "Development of a Software Model for Classification and Automatic Cataloging of Archive Documents" Information 17, no. 4: 341. https://doi.org/10.3390/info17040341

APA Style

Dauletov, A., Muminov, B., Matyakubova, N., Abdurahmonova, U., Bakhriyeva, K., & Fayzieva, M. (2026). Development of a Software Model for Classification and Automatic Cataloging of Archive Documents. Information, 17(4), 341. https://doi.org/10.3390/info17040341

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Development of a Software Model for Classification and Automatic Cataloging of Archive Documents

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Data Set and Its Mathematical Description

3.2. OCR-Based Text Extraction and Feature Vectorization Model

3.3. Classification Based on Machine Learning Models

3.4. Automatic Cataloging Algorithm Based on Dublin Core

3.5. System Architecture and Functional Model Divided into Domains

4. Results

4.1. Computing Environment (Hardware Configuration)

4.2. Software Environment

4.3. Analysis of the Efficiency of the OCR Stage

4.4. General Results of Machine Learning Models

4.5. Quality Analysis of Dublin Core Metadata Generation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI