1. Introduction
In recent years, the processes of integrating archival documents into the digital environment have been developing rapidly. State and non-state organizations, as well as scientific institutions, are implementing large-scale digitization initiatives to preserve documentary heritage, facilitate its use, and ensure the long-term preservation of archival materials. Although these initiatives significantly expand the openness and accessibility of archival resources, they also lead to a continuous increase in the volume of digital documents. As a result, modern archival information systems receive a large volume of new digitized materials every day, which further complicates the process of their manual processing and systematization.
Traditional methods of classifying documents by content, type, or archival collection are labor-intensive and resource-intensive. The presence of documents in various formats, the combination of text and image data, and the fact that historical archival documents are often outdated or of poor quality further complicate this process. Therefore, there is a growing need for automated technological solutions aimed at improving document retrieval, reducing human error, and increasing overall efficiency.
Although existing archival information systems offer a variety of practical functions for storing and searching documents, automatic classification and cataloging mechanisms based on international metadata standards are not sufficiently implemented. In many systems, the main focus is on document storage and simple search capabilities, while the processes of document description and metadata creation are often performed manually by specialists. For large archival funds, such approaches can lead to time loss, subjective errors in metadata description, and a decrease in overall efficiency.
In this context, the development of intelligent methods for automatic document processing is of great importance. Machine learning technologies create new opportunities for analyzing document content, identifying semantic features, and classifying documents into predefined categories. At the same time, standardized metadata models allow for structured description, interoperability, and efficient search of data in digital archives. Among such standards, the Dublin Core metadata model is widely used in digital libraries and archival systems due to its flexibility and simplicity.
However, much research in this area has focused on individual components of the document processing process, such as optical character recognition (OCR), text classification, or metadata extraction. These approaches have often been developed as stand-alone tools and have not been sufficiently integrated into a single integrated system for use in archival environments. In addition, while most existing research has focused on improving classification accuracy, the integration of document classification with the automatic generation of standard metadata models used in digital archives has not been sufficiently explored. As a result, practical solutions that combine document recognition, classification, and metadata generation into a single, scalable, and scalable workflow that is compatible with real archival information systems are still lacking.
The rapid growth of digitized archival materials poses new challenges for archival institutions in organizing, classifying, and managing documents. Manual cataloging, especially when dealing with large collections of documents of various types and formats, requires a significant amount of time and human resources. In addition, inconsistencies in the manual creation of metadata can lead to additional difficulties in document retrieval and information management. Therefore, it is necessary to develop automated approaches that allow analyzing document content, classifying them into appropriate categories, and automatically generating standardized metadata descriptions.
To address these issues, this study proposes an integrated system for automatic classification and cataloging of archival documents based on machine learning methods and the Dublin Core metadata model. The proposed system combines the processes of text extraction based on OCR, document processing, classification using machine learning models, and automatic metadata generation within a single pipeline. This integration serves to more effectively organize the processes of document organization and cataloging in digital archival information systems.
The main scientific contributions of this study are as follows: first, an integrated architecture for automatic processing of archival documents is proposed. The developed pipeline combines the processes of OCR-based text extraction, text processing, machine learning-based document classification, and automatic generation of Dublin Core metadata into a single system. This integration ensures that the different stages of document analysis work in a coordinated manner in an automated environment.
Second, a systematic comparative analysis of machine learning models such as Logistic Regression, Naive Bayes, Support Vector Machines (SVM), LightGBM, and BERT is performed. All models are evaluated under the same experimental conditions, allowing for a comparison of the effectiveness of classical machine learning methods and modern deep learning models in document classification.
Third, the study analyzes not only the classification accuracy but also the computational efficiency. The training time and computational resource requirements of the models are studied, highlighting the trade-off between efficiency and computational cost in large-scale archival document processing systems.
Fourth, the proposed approach demonstrates the potential for automatic organization, indexing, and cataloging of archival documents by integrating machine learning techniques with the Dublin Core metadata model.
Finally, the developed system is designed taking into account real archival workflows, providing a scalable and practical solution for managing large-scale digitized document collections in modern digital archives.
2. Literature Review
In the infrastructure of digital archives and electronic libraries, the issues of correct description and long-term preservation of documents are considered an important component of effective information management. The constant increase in the volume of digitized documents creates the need to ensure the accuracy, consistency and compliance of metadata with international standards. Therefore, the Open Archival Information System (OAIS) reference model is widely used as a conceptual and methodological basis for the design of digital archival systems, which standardizes long-term preservation processes by combining information flows and functional blocks within a single logical structure [
1].
Studies on the quality of metadata have shown that experts offer different semantic interpretations when using the Dublin Core standard. It is scientifically proven that such differences lead to insufficiently complete or consistent formation of metadata, reducing the accuracy of search and reuse processes [
2]. To eliminate these disadvantages, digital library systems use the Metadata Encoding and Transmission Standard (METS) models, which combine descriptive, administrative and structural metadata in a single XML-based container, significantly simplifying integration between different information systems [
3].
The need for a more in-depth representation of contextual information about archival materials led to the development of the Encoded Archival Description (EAD) standard. This standard provides users with a broader range of search capabilities than traditional catalogs by presenting archival holdings, hierarchical structures, and inventories in a machine-readable format [
4]. Also important in overcoming the technical, legal, and organizational issues associated with the long-term preservation of digital objects is the Preservation Metadata Implementation Strategies (PREMIS) standard, which provides semantic units that allow for systematic management of the complexities of the preservation process [
5]. Early research on automatic classification of documents and texts was mainly based on the Naive Bayes model, which is characterized by its simplicity and computational efficiency. In a number of experimental studies, this model has shown competitive results even when compared to more complex algorithms [
6].
In recent years, there has been a significant increase in interest in ensemble approaches, in particular Random Forest, XGBoost, and LightGBM, to improve the accuracy and robustness of classification. Empirical studies have shown that these models outperform single models in various domains, especially in text and document title classification tasks [
7]. Experiments on large image document datasets such as RVL-CDIP, created by Ryerson Vision Lab, also confirm that the use of ensemble models in combination with deep learning methods can achieve very high accuracy in document type identification [
8].
In the field of text classification, BERT and its extensions have been shown to outperform traditional machine learning algorithms, especially when processing semantically complex texts [
9]. As an extension of this approach, the LayoutLM model has been shown to perform well on scanned documents, forms, and receipts by learning text, layout, and visual features together [
10]. In addition, OCR-free transformer architectures, including Donut, process visual documents end-to-end, reducing the error chain in the text recognition process and providing consistent results in classification and information extraction tasks [
11]. Layout-based multimodal approaches have also been highlighted to provide significant advantages in document class detection by integrating text, visual, and structural features, confirming the importance of multimodal integration in document understanding [
12].
In modern library and archive systems, manual cataloging and metadata creation processes are becoming a limiting factor in the context of the rapid growth of information resources. High time consumption and human error are increasing the need for automated solutions, increasing the relevance of approaches based on artificial intelligence and machine learning. Research shows that these technologies allow for the automatic generation of metadata, document classification and subject indexing, and are of great importance in optimizing library services. At the same time, the introduction of SI technologies into cataloging processes raises ethical issues, quality control, and algorithmic transparency issues [
13]. Previous studies have described information systems in which document processing processes are organized into separate functional components specialized for data processing, storage, and text analysis. This structured approach simplifies the document management process and improves the efficiency of searching and retrieving information [
14].
Applying artificial intelligence to archival and records management processes can increase the openness of records, improve search efficiency, and automate the processing of large amounts of data. AI and machine learning methods can automate processes such as transcription, indexing, metadata extraction, and classification, reducing manual work. At the same time, they can optimize the stages of document evaluation, classification, and storage to create conditions for more efficient use of archival materials [
15].
In the scientific literature, the issue of automatic classification of archival and library documents has been widely studied based on data mining and machine learning approaches. In particular, the method proposed by Qiao extracts features from archival data and normalizes them, and after noise reduction, documents are classified into categories using an entropy-based weighting and classification model. Experiments have shown that this approach provides accuracy of up to 97%, confirming that it provides efficient management and fast search of large amounts of archival data [
16].
Recent research has widely used artificial intelligence and natural language processing technologies to effectively manage electronic archives and digital library documents. Based on machine learning and language models with transformer architecture, methods for automatic classification of documents at the chapter or section level, text segmentation, and metadata generation have been developed. These approaches help to improve content indexing, increase search accuracy, and find the necessary information quickly. Experiments show that classifiers based on language models provide higher accuracy and F1-scores than traditional algorithms [
17].
In the literature, the volume of archival data is rapidly increasing, and manual classification of documents requires a lot of work and leads to low efficiency. Therefore, it is advisable to use artificial intelligence and natural language processing methods to automatically analyze and classify archival texts according to their semantic content. A multi-label classification model that combines extended convolutional coding, a label graph-based attention mechanism, and multi-granular semantic attention modules automatically classifies documents at the word, phrase, and sentence levels of the text. Experiments have shown that this approach provides higher accuracy and F1-scores than traditional methods and serves to effectively manage archival data and organize fast search [
18].
The use of artificial intelligence technologies in libraries and information institutions provides an opportunity to automate cataloging and classification processes. Studies have evaluated the effectiveness of generative AI systems, including Claude AI, in automatically generating classification symbols, subject headings, and cutter numbers for library resources. As a result, such systems provide correct recommendations at the general class level and reduce the intellectual load in cataloging. However, their stability and consistency are not sufficient, as different classification results are observed for the same document in different sessions, and it is emphasized that AI tools are more effective as an auxiliary tool for specialists than as a fully automatic solution [
19].
Research on automatic classification of educational and management documents shows that machine learning algorithms can be effectively used. In particular, feature extraction based on TF–IDF and the use of Naive Bayes classifier allow for fast and accurate classification of documents. In this approach, the stages of text pre-cleaning, segmentation, stop-word removal and vectorization are performed, and an automated classification system is created. Experimental results have shown that the Naive Bayes algorithm provides higher accuracy and stability compared to SVM, Random Forest and MLP methods [
20].
In recent years, multimodal approaches have been widely used in automatic document classification, and the joint processing of image and text features allows for more accurate identification of the semantic content of documents. In the research, a multimodal deep learning model combining text embeddings obtained through OCR and visual features extracted based on convolutional neural networks has been proposed. This approach has been shown to increase the accuracy by approximately 3% compared to image-only models on the Tobacco3482 and RVL-CDIP datasets. The results confirm that the integration of text and image data improves the efficiency of automatic processing of archival and administrative documents [
21].
It has been found that the use of natural language processing and deep learning models in automatic classification of text documents and knowledge extraction from them gives effective results. Neural networks such as BERT, LSTM and CNN provide high accuracy in classifying texts into specific categories, identifying semantic features and fast indexing of data [
22]. Also, when processing large volumes of text, pre-trained transformer models are effective in extracting important features and automatically categorizing documents, which serves to optimize the intelligent cataloging processes in electronic archive systems [
23].
3. Materials and Methods
This section describes in detail the working principle of the proposed system for automatic document analysis, classification, and generation of metadata based on the Dublin Core standard. The system includes all stages, including receiving incoming documents (PDF, JPG, DOCX), converting them to text using OCR, text cleaning and processing, analysis based on machine learning and natural language processing methods, and finally generating a standardized metadata record.
These processes consist of modules that work in series and in parallel, each of which performs a specific task. In the proposed approach, documents are first cleaned and normalized, and then contextual embeddings are generated and used in the classification and summarization processes within a single pipeline [
24]. In particular, the NER module identifies the names of individuals and organizations, the classification module determines the subject and type of the document, and the summarization module forms a short content description of the document. These results are combined into a single system and exported as Dublin Core metadata. The general architecture and functional stages of the proposed approach are presented in
Figure 1.
In order to evaluate the effectiveness of the proposed pipeline model and verify its performance in real archival conditions, the RVL-CDIP dataset, which is widely used in the field of document image classification, was used in the training and testing processes. This dataset is formed on the basis of document images obtained from real corporate and archival environments, and it embodies the content diversity, graphic complexity and structural differences of documents. The images in the collection are characterized by different scanning quality, noise, uneven text placement, font differences and the presence of handwritten elements, which fully reflects the problems inherent in real archival documents. Therefore, the RVL-CDIP dataset allows testing the developed model in an environment close to practical conditions.
Although RVL-CDIP is widely used as a benchmark dataset, classification difficulty varies across document categories. Document types such as forms, invoices, advertisements, and scientific publications are recognized more reliably due to their distinctive structural patterns and domain-specific vocabulary. In contrast, document types such as letters, memos, and emails may share similar administrative language and layout structures, which increases the probability of misclassification. This observation reflects the semantic overlap between several document categories and explains the challenges in distinguishing structurally similar document types.
Based on this dataset, the system performs the following processes: text extraction via OCR, feature vectorization, classification based on machine learning, and creation of metadata according to the Dublin Core standard. Each of these steps is described in detail in the following sections.
3.1. Data Set and Its Mathematical Description
The RVL-CDIP dataset was used in the training and evaluation stages of the proposed automatic document classification system. This dataset is widely used in the field of document image classification and covers a variety of document images from real corporate and archival environments. It reflects the differences in the content variety, graphic complexity, and structure of documents.
The RVL-CDIP dataset based on supervised learning is represented as follows:
where
is the set of features extracted from the document image based on the OCR and vectorization steps, and
is the class label corresponding to the document. This expression allows us to formally describe document classification as a multi-class classification problem.
The dataset contains a total of 16 different semantic and visual categories, which are defined by the following set of classes:
These classes include document types such as specification, scientific report, scientific publication, resume, questionnaire, presentation, news article, memo, letter, invoice, handwritten, form, file folder, email, budget, and advertisement. The differences in content and structure between the classes give the model a chance to work with complex real-world documents. Additional descriptive attributes are also available for each document image, including image dimensions, raw text extracted through OCR, and elements related to the document layout. These attributes serve to model the document not only based on the text content, but also together with its general structure.
When training the model, the data was divided into train and validation sets based on stratification:
Here, stratification was used to keep the proportions of all classes balanced. Mathematically, the stratification condition is expressed as follows:
This approach helped prevent rare classes from being overlooked and certain classes from being overly favored. Additional experiments were conducted on local archival documents, including OCR results in Uzbek, and manuscript materials, increasing the adaptability of the developed model to real archival conditions.
These experiments were mainly performed as a practical validation step to verify the robustness of the proposed pipeline when processing documents in languages other than English. Since the primary goal of the study was the evaluation of the classification pipeline using the RVL-CDIP dataset, the detailed quantitative results of these additional experiments are not included in the current manuscript.
3.2. OCR-Based Text Extraction and Feature Vectorization Model
The process of extracting text from images of archival documents was carried out based on the Tesseract OCR tool. OCR algorithms are challenged to function flawlessly due to quality degradation, color distortion, worn paper textures, or printing defects that occur in real archival environments. Therefore, a series of preprocessing steps were performed to achieve more accurate text recognition. In particular,
Binarization eliminates unnecessary background colors and enhances the contrast of the text.
Deskew reduces the effect of character illumination by correcting the skew in the image.
Noise removal filters out noise and artifacts in the image. As a result of these operations, the quality of the image transmitted to the OCR model improves, and the clarity of the text obtained in the next stage increases.
To express this process mathematically, first of all, we have the space where the images are stored:
Here is the image height, is the width, and is a matrix in which the brightness values of each pixel are represented by real numbers.
The sequence of characters generated by OCR is represented by a text space. The process of converting from image to text is formalized using the Tesseract model through the following map:
Here
is the image of the
i-document, and
is the text generated by OCR. Since real archival images contain inaccuracies, it is natural for errors to occur as a result of OCR. This situation is expressed through the uncertainty component:
Here is a random error that reflects noise in the OCR process, incorrectly recognized characters, or missing letters.
The text generated by OCR is the main input for the next stage—feature extraction, classification, and cataloging. Therefore, the accuracy of this stage directly affects the final quality of the entire system.
During the OCR stage, the text extracted from the document is converted into a digital vector format for further analysis. Since it is not possible to directly feed raw text to machine learning models, the text needs to be mapped into a space of numerical features based on words, their frequency, and semantic weight. This process is carried out through the feature extraction module.
First is the text space:
T is taken as input and converted into a vector space using the following functional map based on TF–IDF:
Here
is the size of the constructed dictionary, i.e., the number of words. The dictionary is defined as follows:
The components in the TF–IDF vector for each document are defined as follows:
Here, the document is the OCR text, is a word, and is the weight value of this word in the document.
The advantage of the TF–IDF model is that it reduces the weight of words that occur frequently in the text but have little impact on the meaning and gives greater importance to terms that occur infrequently but are important.
3.3. Classification Based on Machine Learning Models
In the stage of classification of archival documents, text vectors transferred to the feature space are processed using classifiers based on several machine learning models. The study used Multinomial Logistic Regression, SVM, Naive Bayes, and LightGBM models, taking into account the semantic and structural features of documents. Logistic Regression and Naive Bayes models were used in high-dimensional and sparse text spaces generated on the basis of TF–IDF, that is, in cases where the content of the document is determined mainly by the text. The SVM model was selected in cases where the boundaries between classes are complex and it is necessary to distinguish semantically close document types.
In the classification stage of archival documents, the vector expressions generated as a result of the OCR and feature extraction processes described in
Section 3.2 are fed to machine learning models. The main goal of this stage is to classify documents into appropriate classes according to their content and structural features. This task is considered as a multi-class classification problem. The study used classical machine learning models (Multinomial Logistic Regression, Naive Bayes, Support Vector Machines (SVM) and LightGBM) and the BERT model with a deep learning-based transformer architecture to solve this task. Classical models are computationally efficient and provide stable results in high-dimensional and sparse text spaces. The BERT model, on the other hand, allows for a deeper semantic representation of document content through contextual embeddings.
Since the Logistic Regression and Naive Bayes models work effectively in the feature space generated based on TF–IDF, they were used in cases where the content of the document is mainly determined by text. These models are computationally efficient and provide stable and fast results when working with large volumes of documents. The SVM model was chosen in cases where the boundaries between classes are complex and it is necessary to separate semantically close document types. This model is distinguished by the ability to form complex decision boundaries even in high-dimensional spaces. The LightGBM model is based on the gradient boosting ensemble and is able to effectively model nonlinear relationships. This model was used in cases where, in addition to text features, there are also structural and layout attributes of the document. This model showed advantages in cases where there are large datasets, complex document structures, and near-real-time performance requirements.
Along with classical models, the BERT model was also used in this study. The BERT model was initially pre-trained in a self-supervised mode on a large amount of unlabeled texts, and in this work it was re-trained using supervised learning (fine-tuning) to adapt it to document classification and named entity recognition (NER) tasks. Based on BERT, each document is represented using contextual embeddings, which allows us to take into account not only the frequency of words, but also their contextual meaning. As a result, the accuracy of identifying semantically close documents increases significantly.
Layout-aware multimodal models such as LayoutLM and LayoutLMv3 have demonstrated strong performance in document analysis tasks. However, the present study focuses primarily on the semantic classification of archival documents and the automatic generation of Dublin Core metadata from OCR-extracted text. Since key metadata elements such as Subject, Type, Description, and Creator depend mainly on textual content rather than document layout, a text-based BERT model was selected in this work. In addition, multimodal approaches require additional layout annotations and visual feature extraction, which may complicate the processing of heterogeneous archival documents with varying scan quality. Therefore, BERT was considered a practical and efficient solution for this study, while multimodal approaches remain a promising direction for future research.
4. Results
Here, a more comprehensive evaluation of the proposed document classification and automatic cataloging system based on Dublin Core on real data is conducted. Experiments were performed with the RVL-CDIP dataset to evaluate the accuracy of the OCR step, feature space quality, classification performance of machine learning models, and semantic accuracy of metadata generation. All experiments were performed under controlled and repeatable conditions, guaranteeing the dependability of the results obtained.
4.4. General Results of Machine Learning Models
Multinomial Logistic Regression, SVM, Naive Bayes, and LightGBM models were used to evaluate the effectiveness of automatic document classification. All models were evaluated under the same experimental conditions—a 70/30 train/validation split and a single TF–IDF feature space. This methodology enabled an objective and consistent comparison of the obtained results.
It can be seen that the LightGBM model showed the highest performance across all key metrics, which is explained by its ability to effectively model nonlinear relationships. The SVM model emerged as a stable alternative with close results. The Logistic Regression and Naive Bayes models, on the other hand, are computationally lightweight and have shown themselves to be useful as baseline models in cases where linear or simpler probability assumptions are preferable.
The results presented in
Table 4 show that there are significant differences in the efficiency of the models used in document classification. In particular, the BERT model showed the highest results in all metrics (accuracy—95.1%; precision—94.7%; recall—95.4%; F1-score—95.0%). These results are explained by the fact that the BERT model is built on the basis of a context-sensitive transformer architecture and its ability to effectively assimilate long-range semantic relationships between words. In particular, the high recall indicator indicates the ability of the model to identify important documents without falsely discarding them.
A closer examination of classification behavior indicates that some document categories are more difficult to distinguish than others. Documents such as letters, memos, and emails often contain overlapping vocabulary and similar structural organization, which increases the probability of confusion between these classes. Similarly, invoices and forms share structured fields and tabular layouts, which can introduce additional classification challenges. In contrast, categories such as advertisements or scientific publications contain distinctive textual patterns and visual structures, making them easier for the models to identify.
In contrast, LightGBM outperformed classical machine learning models by a large amount and achieved high results in all metrics (accuracy—93.2%; F1-score—93.2%). The ensemble nature of this architecture is helpful to account for nonlinear relationships in the data and provides high stability in analyzing documents with complex semantic and structural features. The SVM model also emerged as a reliable alternative with high accuracy (89.7%) and balanced precision–recall indicators.
Logistic Regression and Naive Bayes models are proved to be baseline and computationally efficient solutions. Logistic Regression has limited ability to capture complex linguistic structures due to its reliance on linear relationships, while Naive Bayes cannot fully capture contextual semantics due to its independence assumption. Nevertheless, their fast performance and low computational complexity remain an important advantage for near-real-time systems.
The performance of the BERT model can be explained by its ability to generate contextual embeddings and capture long-range semantic dependencies within the document text. This capability allows BERT to better differentiate between document categories that share similar vocabulary but differ in contextual meaning. In contrast, classical machine learning models rely mainly on surface lexical patterns extracted through TF-IDF representations.
The performance of the LightGBM model can be attributed to its gradient boosting architecture, which effectively models nonlinear relationships between discriminative lexical features. Even when OCR introduces moderate noise into the text, TF-IDF representations still preserve important category-specific keywords. The ensemble nature of LightGBM allows the model to exploit these signals efficiently, which explains why its performance approaches that of BERT despite not using contextual language modeling.
In order to systematically evaluate these trends, the results of the models were presented in a visual form in terms of accuracy, recall, and F1-score. The comparative histogram clearly shows the superiority of the BERT and LightGBM models, as well as the effectiveness of classical algorithms on simpler tasks. The diagram shows that modern transformer-based approaches (BERT) and ensemble models (LightGBM) provide the highest efficiency when working with complex and noisy text data.
The
Figure 3 was created in order to show the dynamics of the statistical data presented in the table and the strengths and weaknesses of the models:
The relative performance differences between the models in document classification is quite noticeable as visual reference in the diagrams. In particular, the BERT model had the highest level of performance in all measurements (accuracy, precision, recall and F1-score), which confirms the highly advanced capability of the model to interpret complex semantic relationships. This demonstrates that transformer-based models can capture hidden semantic layers of the text much more efficiently than previous methods.
The LightGBM model took the lead among the classical algorithms, which provide high accuracy and stability. The model’s similarly high recall and F1-score values enable it to identify important information even within noisy and irregular OCR texts without missing important content. In contrast, the SVM model proves to be a dependable option for handling complex text classification, delivering consistent and stable performance.
While the classification results demonstrate the effectiveness of modern machine learning and transformer-based models, it is also important to consider the computational efficiency of these approaches. In real-world digital archival systems, large volumes of documents must be processed continuously, making training time and computational cost important evaluation criteria.
To provide a clearer comparison between the evaluated models, an additional analysis of training time was conducted. The results show that classical machine learning algorithms such as Logistic Regression and Naive Bayes are significantly more efficient in terms of computational resources and training time. In contrast, the BERT model achieves the highest classification accuracy but requires substantially higher computational cost due to its transformer-based architecture.
Figure 4 visually illustrates the differences in computational efficiency among the evaluated models. While deep learning approaches such as BERT provide superior classification performance, their training time is significantly longer compared to classical and ensemble-based models. LightGBM offers a practical compromise by providing relatively high accuracy with moderate computational cost. These results highlight the trade-off between classification performance and computational efficiency when selecting models for large-scale digital archival systems.
As shown in the comparative diagrams and training time analysis, classical machine learning models such as Logistic Regression and Naive Bayes demonstrate high computational efficiency, requiring relatively short training times (18 s and 12 s). However, their simple modeling assumptions limit their ability to capture complex semantic relationships in document texts.
More advanced models, particularly LightGBM, provide a good balance between computational efficiency and classification performance, achieving high accuracy (93.2%) with moderate training time (34 s). The BERT model achieved the highest results across all evaluation metrics (accuracy—95.1%; precision—94.7%; recall—95.4%; F1-score—95.0%), confirming its ability to capture deep contextual representations of text.
However, this improvement in accuracy comes at the cost of significantly higher computational complexity, as reflected by the longer training time (410 s). These results highlight the trade-off between computational efficiency and classification performance. While classical models remain useful as baseline approaches, LightGBM offers a practical compromise, and BERT provides the highest accuracy for applications requiring deeper semantic analysis.
4.5. Quality Analysis of Dublin Core Metadata Generation
The automatic generation of metadata elements based on the Dublin Core standard serves as a central component that determines the system’s semantic completeness, document retrieval speed, and the accuracy of the cataloging process. In this study, the quality of generation was evaluated using eight of the most commonly applied Dublin Core elements: Title, Creator, Date, Description, Subject, Type, Format, and Language. These elements were produced automatically through the use of OCR, NER, machine learning-based classification, summarization, and file metadata extraction.
For structural metadata elements such as Title, Date, Format, and Language, the evaluation was performed using exact string matching between the automatically generated values and the ground truth annotations. For the Creator element, the entities identified by the BERT-based named entity recognition (NER) module were compared with manually annotated names of persons and organizations.
For more complex textual fields such as Description, exact string matching was not semantically sufficient. Therefore, the evaluation was conducted based on the semantic similarity between the generated description and the reference description. If the similarity score exceeded a predefined threshold, the generated result was considered correct.
The overall accuracy for each metadata element was calculated using the following formula:
Based on this evaluation methodology, the results of the automatic generation of the selected Dublin Core metadata elements are presented in
Table 5.
The system given in the
Table 6 was evaluated on eight main metadata elements of Dublin Core (Title, Creator, Date, Description, Subject, Type, Format, Language), and the results showed high efficiency of the automatic metadata generation process. The Title extraction element showed the best result with 95% accuracy, which is explained by the stable performance of OCR and layout analysis. The NER and OCR normalization processes worked quite accurately on the Creator (88%) and Date (90%) elements.
Subject (93%) and Type (91%) resulted in high accuracy due to the reliable separation of semantic boundaries by ML models. The Format element (99%) was detected almost without errors from file metadata, while Language detection (97%) was supported by the stability of n-gram-based linguistic models. The lowest result was recorded in Description summarization (86%), which is explained by the OCR quality and sensitivity of the summarization models to text integrity.
The experimental results confirm that the document processing stages interact effectively and form a consistent processing workflow. The integration of text segmentation, identification of significant units, automatic classification and content summarization processes served to maintain a high level of semantic quality of metadata records generated based on Dublin Core. This approach allows for further expansion of the system and adaptation to the full 15-element model of Dublin Core.
The
Figure 5 illustrates that different Dublin Core elements require different levels of complexity during automatic metadata generation. While some elements showed high accuracy due to their identification based on structural and technical features, relatively lower results were observed for components that depend more on semantic analysis. This clearly reflects the relationship between the content complexity of metadata elements and the approaches used to identify them.
High accuracy indicators confirm that the system can reliably identify the structure, format and language features of the document. At the same time, since processes such as content summarization, author identification and creation of a context-appropriate description require more semantic analysis, the probability of errors at these stages remains relatively high. This indicates the need to introduce not only technical but also deep linguistic and logical models in automatic metadata generation.
Overall, the results presented in the graph confirm the stable operation of the proposed pipeline architecture and the high efficiency achieved in the automatic cataloging process. At the same time, increasing the accuracy of some elements remains an important area for future research. This serves as a necessary scientific and practical basis for further improvement of systems aimed at automating the full Dublin Core model.
5. Conclusions
In this study, a model was proposed and evaluated on real data to implement the processes of digital processing of archival documents, automatic classification and cataloging based on the Dublin Core standard within a single integrated system. The proposed approach combines the stages of document reception, text extraction using OCR, text cleaning and vectorization, classification using machine learning models and automatic generation of standardized metadata records into a single pipeline. This allows reducing time consumption compared to traditional manual processes, minimizing human errors and increasing the efficiency of archival resource management.
Experimental results confirmed that preprocessing operations (binarization, deskew, noise reduction) in the OCR stage have a significant impact on the overall stability of the system and the accuracy of subsequent modules. In particular, the quality of text recognition in printed documents was significantly improved, while problems remained partially due to visual complexity in handwritten and degraded documents. This shows that the quality of OCR is an important supporting component for the entire automatic processing chain.
Comparative evaluation of classical and ensemble models at the classification stage showed the superiority of modern ensemble and deep learning models for documents with complex semantic structure. In particular, the LightGBM and BERT models were distinguished by high accuracy, stability and generalization ability. This confirms the methodological justification of using nonlinear and context-sensitive models when working with documents with large volumes and diverse structures.
This can be explained by the different representational capacities of the evaluated models. Transformer-based architectures such as BERT are capable of modeling contextual semantic relationships across the entire document, which improves classification accuracy for semantically similar document types. Ensemble models such as LightGBM, although based on TF-IDF features, remain effective due to their ability to model nonlinear interactions between discriminative lexical features and maintain robustness under moderate OCR noise.
The results obtained on the basis of Dublin Core metadata generation showed the practical effectiveness of automatic cataloging. While elements based on structural features (Format, Language) were identified with high accuracy, lower indicators were observed for elements that are more dependent on semantic analysis (Creator, Description, Date). This indicates the need to introduce more context-rich language models and post-processing mechanisms in the future.
Overall, the developed system demonstrates robustness, scalability, and practical efficiency for the automated processing, classification, and generation of standardized metadata for archival documents.
The modular design allows for independent building of components and integration into real archival infrastructures. Further work will concentrate on tuning OCR, increasing NER accuracy, incorporating context-sensitive models into the summarization modules and automating the entire 15-item Dublin Core model. This is also going to increase the search precision, semantic consistency and overall efficiency of digital archival systems.