2.1. Classification of Information, Documents, and Texts by Confidentiality Levels
Assigning documents to appropriate confidentiality levels based on their content is a critical process for ensuring information security. In particular, government agencies, the defense industry, diplomatic missions, and private-sector organizations protect their information assets by applying classifications such as confidential, top secret, or public [
1]. Within this context, accurately determining confidentiality levels is essential both for preventing information leakage and for enabling the enforcement of appropriate access control policies. However, the limitations of manual classification procedures and the risks associated with human error have motivated researchers to develop artificial intelligence- and machine learning-based methods that automate confidentiality classification [
6]. The resulting body of work in the literature has highlighted the advantages and limitations of various approaches in terms of both model performance and practical deployment challenges. In this section, prior studies on the classification of texts, information, and documents according to confidentiality levels are reviewed and summarized.
Alzhrani et al. [
4] proposed a robust preprocessing framework that utilizes Latent Dirichlet Allocation (LDA) to refine training datasets by pruning irrelevant sub-topics from diplomatic cables. By mitigating the impact of linguistic noise within the WikiLeaks dataset, their logistic regression-based approach achieved statistically significant performance gains (
p = 0.0007), demonstrating that strategic data purification is essential for maintaining high F1-scores across multi-level security classifications.
Alparslan et al. [
5] introduced a hybrid SVM-ANFIS architecture for the security classification of 222 Turkish documents from the TUBITAK UEKAE dataset, reaching an accuracy of 96.67%. By utilizing custom stemming and Chi-square feature selection to overcome the morphological challenges of Turkish, the study demonstrates that merging discriminative scores with fuzzy inference models ensures high precision in sensitive, agglutinative text categorization.
Richter et al. [
6] explored the efficacy of deep learning for classifying historical NATO documents by confidentiality levels, demonstrating that a ConvNet-based architecture significantly outperformed traditional Random Forest baselines. Utilizing a balanced subset of unstructured military records dating back to the 1950s, the study achieved a 1.2 to 1.3-fold accuracy improvement through a model optimized with 50-dimensional embeddings and strategic dropout rates. Their results validate that while convolutional networks offer superior performance for complex defense datasets, full exploitation of more advanced architectures like RNNs remains contingent upon high-performance computational resources.
Yazidi et al. [
11] performed a comparative analysis of ten machine learning algorithms to automate binary security classification using a DNSA dataset containing 2884 documents on geopolitical conflicts. To ensure model generalization and prevent label leakage, the authors utilized a TF–IDF-weighted bag-of-words representation after meticulously removing explicit security keywords from the text. Their experimental results identified AdaBoost as the superior model, achieving over 90% accuracy across diverse topics and demonstrating significant robustness in handling the high-dimensional feature spaces inherent in sensitive document classification.
Richter and Wrona [
12] evaluated open-source machine learning algorithms for the confidentiality classification of over 30,000 historical NATO documents, categorized into three distinct security tiers. Following an extensive preprocessing pipeline involving the KeyGraph algorithm and Porter stemming, the Random Forest model emerged as the top performer with 80% accuracy, assessed through metrics including Cohen’s kappa and misclassification costs. A significant finding of the study was that OCR-induced noise did not diminish classification effectiveness, leading the authors to propose a framework that integrates open-source solutions with confidence scores to enhance information management workflows.
Heintz et al. [
13] developed an automated information guard system by analyzing 101,142 documents from the Foreign Relations of the United States (FRUS) collection to optimize secure data sharing across different clearances. By benchmarking DistilBERT and DistilRoBERTa architectures at the paragraph and metadata levels, the study demonstrated that the integration of document content with structural metadata significantly enhances classification robustness and accuracy. Their findings, validated through rigorous F1-score analysis and confusion matrices across multiple security tiers, suggest that jointly exploiting textual and contextual features provides a scalable foundation for advanced multi-modal information protection systems.
Alzhrani et al. [
14] introduced the ACESS (Automated Classification Enabled by Security Similarity) model, a hybrid architecture designed to overcome the limitations of conventional data loss prevention systems in large-scale textual environments. By integrating k-means clustering with localized linear SVM classifiers, the framework performs paragraph-level analysis on imbalanced WikiLeaks diplomatic cables to identify sensitive information with high granularity. The experimental results demonstrated that ACESS consistently exceeds the 90% F1-Measure threshold, with findings indicating that optimizing cluster density relative to dataset size significantly enhances the precision and adaptability of automated security classification.
Trieu et al. [
15] introduced a content-based classification framework utilizing a specialized document embedding model (TD2V) and modified k-nearest neighbor retrieval to overcome the semantic limitations of traditional DLP systems. By integrating Automated Query Expansion (AQE) and majority voting, the methodology achieved exceptional classification accuracies exceeding 99% across diverse datasets, including Enron, Snowden, and Dyncorp. Their findings demonstrate that the synergy of TD2V and AQE ensures a robust, real-time solution for sensitive data identification, maintaining high precision even in fragmented text segments with a processing latency of 10–15 ms.
Liang et al. [
16] proposed the Incremental Learning and Similarity Comparison (ILSC) framework to handle the dynamic nature of sensitive information classification in expanding document repositories. By integrating online learning algorithms such as Incremental SVM (ISVM) with a manual sentence-based similarity component, the authors achieved an 87.4% accuracy across five security echelons, significantly outperforming static offline models. Their findings emphasize that the synergy between incremental updates and granular similarity checks provides a scalable solution, as evidenced by a 39.5% accuracy increase as the dataset volume expanded during the learning process.
Alparslan et al. [
17] proposed a hybrid SVM-ANFIS framework to automate the security grading of 222 Turkish institutional documents from the TUBITAK UEKAE corpus. Leveraging the Zemberek library for morphological analysis and Chi-square for feature optimization, the study utilized SVM-derived scores as antecedent inputs for a fuzzy inference model, which was subsequently discretized into security tiers using the CACC algorithm with specific thresholds. The empirical results demonstrated that the hybrid architecture significantly outperformed standalone models, achieving a 96% accuracy rate and providing a robust solution for mitigating ambiguity in Data Loss Prevention (DLP) environments.
Jiang et al. [
18] introduced CES2Vec, a task-specific word embedding framework that integrates confidentiality-driven polarity into vector representations to enhance sensitive information detection. By extending the Word2Vec negative sampling objective with a privacy-oriented loss function, the authors enforced more distinct classification boundaries on a dataset of 89,681 WikiLeaks paragraphs. Their empirical findings revealed that CES2Vec-based CNN classifiers achieved an 80.04% F1-score, significantly outperforming standard FastText and Word2Vec embeddings and demonstrating that task-specific distribution optimization is critical for advancing the state-of-the-art in confidentiality identification.
Alzhrani et al. [
19] introduced the MS-CNN (Multi-Sequence Convolutional Neural Network) framework to optimize the identification of sensitive paragraphs within unstructured datasets like WikiLeaks and the Panama Papers. By integrating a Multi-Sequence Learning technique with a Wide CNN architecture, the model effectively mitigates information loss and noise issues inherent in variable-length text segments. Empirical benchmarks on diplomatic and general datasets demonstrated that MS-CNN significantly outperforms deep architectures such as Sent-CNN and VDCNN, particularly in minority class detection with F1-scores reaching 0.773 for highly sensitive categories.
Subhashini and Rani [
20] proposed a confidentiality detection methodology that integrates K-means clustering with cluster-specific language models to identify sensitive terminology within the Enron email corpus. By employing TF-IDF vectorization and Dirichlet smoothing, the authors established probability ratios to distinguish between sensitive and non-sensitive lexicons based on their distribution across categorized document groups. Although the study primarily details a conceptual framework rather than providing comprehensive performance metrics, it highlights the potential of using probability-based term identification to refine Data Loss Prevention systems through non-confidential contextual validators.
Kesenek et al. [
21] developed a robust classification algorithm to counter content-based adversarial attacks, utilizing a comprehensive dataset of 5539 documents consolidated from TM, Mormon, Dyncorp, and DBPedia. The proposed framework, which integrates Latent Semantic Analysis (LSA) and multi-layered feature extraction with strategic spelling correction, achieved an exceptional baseline accuracy of 99.5%, effectively matching CNN-based architectures in attack-free environments. However, the study highlights superior durability in adversarial scenarios, as the model maintained F1-scores above 90% during intensive character-level manipulations—conditions under which CNN accuracy plummeted to 55%—thereby providing a highly reliable defense mechanism for confidential document identification.
Hart et al. [
22] introduced a hierarchical two-stage classification framework employing a “supplement and adjust” strategy to mitigate high false-positive rates, utilizing a hybrid corpus of corporate and public datasets including the Enron email collection. By integrating a primary classifier with a metadata-enhanced meta-classifier using
$xtra.info
$ features, the authors successfully reduced the False Positive Rate (FPR) on non-corporate documents from a baseline of 87.2% to less than 0.1%. The experimental results further highlight the model’s precision, as the False Discovery Rate (FDR) for the Enron dataset was curtailed from 47.05% to 0.92% while maintaining a False Negative Rate (FNR) below 3.0%, demonstrating that auxiliary metadata is critical for minimizing disruptive false alarms in real-world DLP applications.
Lin et al. [
23] introduced a feature fusion architecture integrating Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) networks to optimize the identification of sensitive information within unstructured Chinese electronic documents. Utilizing a specialized dataset of 7500 sensitive documents from encrypted WikiLeaks files and laboratory data, complemented by the Sohu News corpus, the methodology employed the cppjieba tool for segmentation and Word2Vec for generating 200-dimensional embeddings. The experimental results demonstrated that the hybrid model achieved an accuracy of 93.44% and an F1-score of 93.82%, outperforming standalone CNN and BiLSTM architectures by 2.1% and 1.38%, respectively. This research concludes that the synergy of CNN’s localized feature extraction and BiLSTM’s global contextual capture provides superior robustness and stability compared to traditional SVM and keyword-matching techniques in complex linguistic environments.
Sulavko et al. [
24] proposed a multi-stage classification framework to detect and grade confidential information within a massive corpus of 1.1 million Russian-language text instances derived from ten different organizations. The methodology employs a hierarchical pipeline where 1024-dimensional E5-based embeddings are first processed by an autoencoder that functions as a high-precision binary filter, achieving a 97.84% accuracy and a negligible 0.015% False Negative Rate. Subsequently, an ensemble of ten CNN models, structured according to the Condorcet Jury Theorem, utilizes a voting mechanism to categorize sensitive messages into four confidentiality levels, yielding an overall multi-class accuracy of 90.3%. This modular architecture represents a robust solution for organizational data protection by virtually eliminating the risk of undetected information leakage through its strategic combination of deep feature extraction and ensemble reliability.
Han et al. [
25] proposed a prompt-based data augmentation strategy to address class imbalance in security-sensitive text classification, specifically targeting the minority “Secret” class within the WikiLeaks dataset. Using GPT-4o with chain-of-thought reasoning and a sliding-window approach centred on the frequency-weighted median, the authors generated 1596 synthetic Secret-class samples to rebalance the training distribution. A fine-tuned LLaMA 3.1 8B model with LoRA adaptations was subsequently trained on the augmented dataset, achieving 98% accuracy with one-shot generation and 99% with few-shot generation. Their findings demonstrate that prompt-engineering strategies incorporating mathematical similarity constraints and explicit label definitions can effectively mitigate the precision–recall trade-off inherent in imbalanced diplomatic document classification, though the approach has yet to be extended to multi-tier four-class corpora or validated beyond the WikiLeaks benchmark.
Han et al. [
7] conducted the first comprehensive survey on text-based information security rating, proposing a standardised taxonomy encompassing domain scope, methodology, and evaluation metrics for automated document security classification. Covering methodological evolution from rule-based systems and clustering approaches through classical machine learning to deep learning architectures, the survey identifies representative datasets—including WikiLeaks, TUBITAK UEKAE, Enron, and DNSA—and highlights five persistent open challenges: scarcity of publicly available labelled corpora, class imbalance in confidential data, absence of domain-specific evaluation metrics, limited cross-domain generalisability, and the need for convergence between administrative and technical approaches. The present study directly addresses three of these challenges through the construction of the FRUS dataset, the application of Focal Loss for class imbalance, and a multi-dataset evaluation framework spanning two distinct diplomatic corpora.
Existing studies on confidentiality-aware classification exhibit notable differences in methodologies, datasets, and evaluation metrics. To enable a systematic comparison and highlight research gaps, a summary of the literature is provided in
Table 1.
2.2. Information, Document, and Text Classification in Diverse Contexts
This section reviews text and document classification studies conducted across diverse application domains, ranging from news categorization to financial data analysis. By examining various architectures tailored for specific linguistic structures, document lengths, and contextual depths, the review highlights how classification methodologies adapt to the unique challenges of different information environments.
Tan et al. [
26] proposed an optimized k-NN based system for real-time classification of large-scale texts to improve document confidentiality management. To overcome the computational inefficiencies of high-dimensional feature vectors, the authors developed a novel feature selection algorithm (tf-DE) and a parallel computing infrastructure leveraging the AVX-256 instruction set. Experimental results on the THUCNews dataset demonstrated that this hardware-accelerated approach reduced classification time by 53.8% and achieved an average F1-score of 91.4%, effectively competing with SVM models while maintaining superior performance in sports and entertainment categories with scores exceeding 96%.
Liu et al. [
27] introduced the GCSA (Graph Convolutional Network and Self-Attention-based) algorithm to address the limitations of traditional Sentiment Word Tree (SMT) and DFA-based methods in sensitive information detection. To overcome high computational costs and poor generalization on out-of-vocabulary terms, the GCSA framework employs an end-to-end learning strategy that integrates BERT-like encodings with a graph-based integration layer. By synergizing GCNs with a self-attention mechanism, the model captures complex contextual relationships between word pairs, enabling the detection of sensitive content beyond predefined dictionaries while maintaining high precision through softmax-based probability estimation.
Huang [
28] developed a two-stage security classification framework integrating Hidden Markov Models (HMM) and Support Vector Machines (SVM) to detect sensitive corporate information within configuration files. Addressing the inadequacy of traditional pattern matching for identifying IP addresses and credentials, the study extracted seven structural features, including file size, extension, and sensitive keyword density, which were subsequently processed via TF-IDF filtering. The methodology utilized a Gaussian HMM for initial screening, followed by a polynomial kernel SVM for secondary validation; although the integrated model achieved a baseline accuracy of 60% on a synthetic dataset, the HMM and SVM components independently yielded accuracies of 78% and 76%, respectively, highlighting the potential of multi-stage hybrid models in mitigating data leakage from internet-based configuration projects.
Gang et al. [
29] introduced the SDC (Semantic Dependency-based Classification) algorithm, which prioritizes contextual semantic dependencies over simple word frequencies for sensitive information detection. The methodology employs a two-stage process: a sentence-level Conditional Random Field (CRF) model for word labeling and a semantic tree structure to define sensitivity transfer principles. By integrating individual sentence structures into a unified document-level sensitivity framework, the authors benchmarked various classifiers—including SVM, Naive Bayes, and KNN—on a dataset of 3344 forum posts. Experimental results revealed that the SVM-based SDC model achieved F1-scores exceeding 80% across four security levels, representing a 15–25% improvement over traditional TF-IDF and syntactic analysis approaches.
McDonald et al. [
30] proposed a novel approach for the automated classification of sensitive government documents under Freedom of Information (FOI) exemptions by leveraging Part-of-Speech (POS) sequence representations and specialized SVM kernel functions. The methodology evaluates five distinct kernels—Linear, Gaussian, Spectrum, Mismatch, and Smith-Waterman—to determine if linguistic structures can serve as indicators of sensitive content. Experimental benchmarks on a dataset of 3801 government documents demonstrated that while the Linear kernel achieved the highest individual auROC (68.97%), an ensemble approach using Weighted Majority Vote (WMV) significantly improved classification performance, reaching an auROC of 74.43%. Their findings validate that integrating grammatical structures with text classification provides a robust and statistically significant solution (
p < 0.001) for identifying sensitive information in public disclosure contexts.
Alneyadi et al. [
31] proposed a statistical DLP model integrating TF-IDF and Singular Value Decomposition (SVD) to overcome the limitations of content-based systems in detecting semantically varied sensitive data. By analyzing documents across six distinct confidentiality categories, the framework utilizes Cosine similarity and Taxicab distance to execute classification decisions within a reduced-dimensionality vector space. Experimental results on a corpus of 360 security-related articles demonstrated that the model maintains up to 99% accuracy for known data and, notably, achieves a 63.33% classification rate even when documents are heavily altered with synonyms. Their findings confirm that SVD-enhanced statistical profiles provide a resilient mechanism for semantic sensitivity detection in evolving data leakage scenarios.
Perron et al. [
32] evaluated the feasibility of Local Large Language Models (LLMs) for the secure analysis of unstructured text data in social work research, addressing the ethical constraints of proprietary cloud-based models. Utilizing a dataset of 2956 child welfare investigation summaries, the study benchmarked Mistral-7b, Mixtral-8×7b, Llama3-8b, and Llama3-70b through zero-shot prompting for classification and extraction tasks. The results highlighted that Llama3-8b achieved 95% accuracy and a 0.93 precision rate, while Llama3-70b demonstrated over 95% faithfulness in text extraction with a Cohen’s kappa of 0.90, matching human-expert performance. Their findings conclude that local, open-source LLMs provide a secure, low-hallucination (<1%) alternative for processing sensitive qualitative data without compromising data privacy.
Sinoara et al. [
33] proposed knowledge-enriched document embeddings to enhance text classification, particularly in scenarios where semantic disambiguation is critical. By developing two models, Babel2Vec and NASARI + Babel2Vec, the authors integrated BabelNet synset vectors and utilized the Babelfy system to resolve lexical ambiguities before document vectorization. Experimental evaluations across nine English and Portuguese datasets demonstrated that these semantically enriched representations significantly outperformed traditional Bag-of-Words (BOW) and LDA baselines in Macro-F1 scores, while the Babel2Vec model achieved the highest correlation with human judgments (Pearson: 0.66). Their findings highlight that low-dimensional, knowledge-based embeddings offer superior interpretability and computational efficiency for cross-lingual and short-text classification tasks.
Timmer et al. [
34] investigated the efficacy of fine-tuned Transformer models for sensitive sentence detection, focusing on semantic classification challenges where keyword-based methods fall short. Utilizing a real-world dataset of 1073 manually labeled sentences from the Monsanto legal case, the authors fine-tuned a bert-base-uncased model across four distinct sensitive categories (GHOST, TOXIC, CHEMI, REGUL). The experimental results demonstrated that the BERT-based approach significantly outperformed Inference Rule, LSTM, and RecNN baselines, particularly in F2-scores (reaching 65.79% for GHOST), highlighting the robustness of pre-trained architectures in low-resource and high-stakes sensitive information identification scenarios.
Wu et al. [
35] introduced FedAPILLM, a federated learning framework integrated with Large Language Models (LLMs) such as Llama3 and Qwen2.5, to identify sensitive content within API calls without compromising data privacy. To enhance computational efficiency in a distributed environment, the authors employed Low-Rank Adaptation (LoRA) for fine-tuning specific model layers. Experimental evaluations on a manually labeled dataset of 1123 Web API samples demonstrated that the framework achieved near 100% accuracy with a 0% false alarm rate at 2500 iterations, with the Qwen2.5 14B model showing faster convergence compared to Llama3. Their research highlights that the synergy between federated learning and parameter-efficient fine-tuning offers a robust and scalable solution for real-time sensitive data recognition in secure API ecosystems.
Chong [
36] proposed a hybrid deep learning framework for real-time sensitive data recognition across both structured and unstructured data sources. The methodology integrates Regular Expressions (regex) for structured data patterns with a BERT-CRF architecture to handle the contextual complexities of unstructured text through Named Entity Recognition (NER). Experimental results on a specialized dataset derived from Conll-2003 and CLUE benchmarks demonstrated that while the standalone BERT model achieved an F1-score of 0.896, the integrated BERT+regex framework yielded superior performance with a precision of 0.925. Their findings highlight that combining the deterministic accuracy of regex with BERT’s contextual learning capabilities provides a robust and generalizable solution for identifying diverse sensitive information types.
Aydın et al. [
37] conducted a comparative analysis between traditional machine learning algorithms—such as Naive Bayes and Decision Trees—and Transformer-based deep learning models, including BERT and GPT-3, for text classification across various educational levels. Utilizing a dataset of 476 documents from the Ministry of National Education and the Council of Higher Education, the study demonstrated that the GPT-3 model achieved the highest performance with 77% accuracy and F1-score, significantly outperforming the best-performing traditional method, Naive Bayes (65% accuracy). Their findings emphasize that the contextual learning capabilities of Transformer architectures provide superior generalization and accuracy compared to classical approaches in complex, multi-level document classification tasks.
Tran et al. [
38] systematically evaluated the robustness of text classification algorithms—ranging from classical models (LR, SVM) to deep learning architectures (VDCNN, Transformer)—against real-world data corruptions such as keyboard errors, OCR inaccuracies, and microphone distortions. Utilizing a methodology based on the Area Under the Robustness Curve (robustness AUC), the study analyzed performance degradation across YelpPolarity, AG’s News, and DBpedia datasets. The findings revealed that while VDCNN models excel in character-level resilience, Transformers are more robust to word-level distortions; crucially, the research demonstrates that the most accurate models are not necessarily the most robust, emphasizing that data augmentation with corrupted samples is essential to bridge the performance gap between controlled and noisy environments.
Chen et al. [
39] conducted a comparative study between domain-concept-based machine learning methods and pre-trained deep learning models for the classification of large-scale US legal documents. Addressing the complexities of legal terminology and sparse label distributions within the SigmaLaw dataset, the authors developed a Random Forest model trained on 400 PCA-selected domain features from 30,000 case documents. The experimental results demonstrated that the proposed model achieved 85.98% accuracy, significantly outperforming SOTA deep learning architectures like BiLSTM + Attention, while offering superior interpretability and computational efficiency in low-resource and domain-specific scenarios.
Mohammed and Kora [
40] proposed a novel two-layer meta-learning-based ensemble deep learning framework to extend the boundaries of accuracy and robustness in text classification. The architecture integrates diverse base classifiers, including CNN, LSTM, and GRU, whose probability-based outputs (soft predictions) are processed by shallow meta-learners across a three-level hierarchical structure. Experimental results on six benchmark datasets, such as IMDB and ArSarcasm, demonstrated that the ensemble approach significantly outperformed individual models, achieving up to a 16.03% accuracy gain on the AJGT dataset. Their research underscores that synergizing classifier diversity with meta-learning effectively reduces variance and enhances generalization across varying languages and linguistic structures.
Jiang et al. [
41] introduced TechDoc, a multimodal deep learning framework designed to enhance technical document management by integrating textual, visual, and relational data. Utilizing a massive dataset of 800,000 USPTO patent documents, the architecture employs RNNs for text processing, CNNs (VGG-19) for visual features, and GNNs (GraphSAGE) for citation networks. Experimental results demonstrated that TechDoc significantly outperformed unimodal and dual-modal baselines, achieving a 3.2% improvement in top-1 accuracy for IPC subclassifications while being 13 times faster to train than fine-tuned BERT models. Their research concludes that multimodal information fusion not only improves classification accuracy in specialized technical domains but also offers superior computational efficiency and transfer learning potential for enterprise-level document systems.
Schoppmann [
42] evaluated the efficacy of privacy-preserving machine learning protocols for text classification and regression tasks, focusing on secure linear regression and differential privacy mechanisms. By comparing fixed-point arithmetic protocols with iterative FP-CGD methods, the study demonstrated that the proposed framework significantly reduces computational latency and communication costs—outperforming existing systems like SecureML by up to 3x. Experimental results on the UCI and Movie Reviews datasets showed that privacy-protected TF-IDF extraction and k-NN/Naive Bayes classifiers maintained high accuracy with less than an 8% deviation from plaintext models, proving that secure, high-dimensional data processing is feasible for real-time document classification.
Xu et al. [
43] proposed a Convolutional Neural Network (CNN)-based approach for detecting sensitive information within unstructured military and political texts. To overcome the high computational costs of RNNs and the limited generalization of traditional keyword-matching methods, the study utilized a Text-CNN architecture integrated with Word2vec embeddings and Jieba segmentation. Experimental results on a dataset of 22,000 documents demonstrated that the CNN model achieved a superior accuracy of 96.82%, while completing training nearly three times faster than RNN baselines. Their findings validate that CNNs offer a highly efficient and accurate alternative for real-time sensitivity classification by automatically extracting salient patterns through varied convolutional kernels without the need for manual feature engineering.
Park et al. [
44] introduced GenNER, a novel framework designed to detect sensitive information within structured Korean table data by integrating Text Generation (TG) and Named Entity Recognition (NER) modules. To address the lack of context in tabular formats and the morphological complexities of the Korean language, the GenNER architecture employs a BiLSTM-CRF structure that generates synthetic natural language sentences from raw table entries before performing NER tasks. Experimental results on public documents from the Seoul Metropolitan Government demonstrated that the GenNER system achieved a 0.91 F1-score, significantly outperforming the baseline BiLSTM-CRF model (0.74 F1-score) which operates directly on raw data. Their research underscores that generating contextual sentences from structured inputs is a highly effective strategy for improving the accuracy of automated sensitivity classification in enterprise data security.
Chen et al. [
45] proposed a novel framework for the security classification of language technology resources by integrating BERT-based word embeddings with the graph-based TextRank algorithm. To address the subjectivity and inefficiency of manual classification, the methodology constructs a weighted graph where nodes represent sentence vectors derived from dynamic BERT embeddings and edges represent semantic similarities. By calculating sensitivity scores through the TextRank ranking mechanism, the model automatically identifies high-risk text components and assigns security levels. Experimental results in Chinese text processing demonstrated that this hybrid approach significantly outperforms traditional SVM, Random Forest, and standard TextRank models in accuracy and recall, providing a sophisticated automation level for managing sensitive linguistic assets despite its higher computational complexity.
Liang et al. [
46] proposed the A-ELMo algorithm, which enhances dynamic word embeddings with an integrated attention mechanism for the context-aware detection of sensitive information in social media. Addressing the inability of static representations like Word2Vec and GloVe to resolve lexical ambiguities, the framework utilizes the bidirectional LSTM structure of ELMo to generate context-sensitive embeddings, subsequently refined by an attention layer to prioritize semantically significant terms. Experimental evaluations on Twitter data revealed that A-ELMo achieved an accuracy of 84.15% and a recall of 88.41%, substantially outperforming fastText (75.14%) and traditional keyword-matching approaches (41.09%). Their research demonstrates that the synergy between weighted semantic representations and dictionary-tree structures provides a superior standard for identifying sensitive content within high-variance linguistic environments.
Fu et al. [
47] developed an intelligent recognition framework for identifying and interpreting sensitive information within long-form unstructured texts by integrating natural language processing with machine learning. The proposed architecture employs a multi-modal feature fusion approach that combines textual content (via BERT and Word2Vec), contextual metadata, and user behavior analysis to execute a comprehensive sensitivity scoring system. Experimental results demonstrated that the Gradient Boosting Machine (GBM) model achieved superior performance with an AUC of 0.95 and an F1-score of 0.85, significantly outperforming Random Forest baselines. While highlighting the contextual superiority of BERT for long-sequence analysis, the study also addresses model robustness and scalability, providing a strategic foundation for proactive data privacy and risk management in high-volume information environments.
Kaliappan et al. [
48] introduced SentinelGuard, a sophisticated Data Loss Prevention (DLP) solution specifically optimized for the healthcare sector to protect Personal Identifiable Information (PII) and Protected Health Information (PHI). By utilizing a BERT-based deep learning model trained on clinical datasets, the system executes real-time analysis of user activities and data entries to preemptively block leakage attempts. Experimental benchmarks against prominent industry solutions revealed that SentinelGuard achieved a remarkably low False Positive rate of 8%, significantly outperforming Forcepoint (64%), McAfee (36%), and Symantec (23%). Their findings demonstrate that integrating contextual language models into DLP frameworks not only enhances detection accuracy for medical records but also provides a scalable and customizable security architecture for high-stakes enterprise environments.
Li et al. [
49] introduced CoGraphNet, a novel graph-based framework designed to address the computational complexity and interpretability challenges in text classification. The model constructs heterogeneous graph structures at both word and sentence levels, utilizing GRU-based neural networks and SwiGLU activation functions to process contextual dependencies and positional biases. By incorporating attention mechanisms, CoGraphNet enhances model transparency by identifying the specific words and sentences that drive classification decisions. Experimental results across four benchmark datasets—20NG (91.50% accuracy), R52 (96.10%), and Ohsumed—demonstrated that the proposed architecture offers competitive performance while providing significant advantages in terms of explainability and robust generalization in information-dense textual environments.
Ren et al. [
50] introduced DyLas (Dynamic Label Alignment Strategy), a novel three-stage framework designed to enhance Large Language Model (LLM) performance in Large-scale Multi-label Text Classification (LMTC) without requiring model retraining. The strategy addresses the challenges of dynamic label sets and long-tail distributions through a pipeline consisting of vanilla input-output generation, label alignment (utilizing both hard and soft alignment via embeddings like BGE-M3), and counterfactual error checking. Experimental evaluations across four datasets, including Reuters21578 and FB15k237, demonstrated significant improvements; notably, integrating DyLas with GLM4-9b increased the Micro-F1 score from 15.4% to 55.4%. Their findings confirm that DyLas enables LLMs like GPT-4o and LLaMA 3.1 to outperform traditional PLMs (BERT, RoBERTa) in complex multi-label tasks, providing a scalable and robust solution for real-time classification in evolving information environments.
Wang et al. [
51] proposed a novel Meta-Active Learning framework designed to minimize labeled data requirements while enhancing flexibility in text classification tasks. The methodology integrates the TextGCN architecture with a Bi-LSTM-based meta-information generator and a domain discriminator to identify and prioritize the most informative samples during the learning process. Evaluated on Amazon, 20 Newsgroups, and Reuters-21578 datasets under 1-shot and 5-shot scenarios, the proposed model achieved average accuracies of 74.2% and 87.1%, respectively. Their findings demonstrate that synergizing graph-based representations with meta-learning significantly outperforms traditional active learning strategies and baseline meta-learning models like MLADA, offering a scalable and high-performance solution for low-resource text classification environments.
Zhang et al. [
52] introduced VIWHard, a novel framework for generating adversarial texts against text classification models in a hard-label black-box setting, where only the predicted labels are accessible. To overcome the discrete nature of text, the proposed method utilizes a two-stage architecture: an importance-word discriminator to identify high-impact tokens independently of the target model and a masked language model (MLM) to generate contextually relevant synonyms that preserve semantic and syntactic integrity. Experimental evaluations across eight datasets and three architectures (WordCNN, WordLSTM, and BERT) demonstrated that VIWHard achieved attack success rates exceeding 97% in security-oriented datasets (e.g., Jigsaw2018), significantly outperforming baselines like TextHacker and HydraText in terms of naturalness, grammatical correctness, and query efficiency. Their findings underscore the vulnerability of current NLP models to transferrable adversarial examples and highlight the necessity of robust defense strategies in practical deployment scenarios.
Alparslan [
53] conducted a comparative performance analysis between traditional machine learning algorithms and a Convolutional Neural Network (CNN) model for Turkish text classification. Utilizing the multi-class TTC-4900 news dataset and the binary MY-15130 customer review dataset, the study employed rigorous preprocessing steps, including stemming via the Zemberek library and feature weighting through TF-IDF with log-normalization. Experimental results indicated that while traditional classifiers like SVM and Naive Bayes performed exceptionally well on binary tasks (96.2% accuracy), the proposed CNN architecture (Conv1D + Flatten + Dense) achieved a state-of-the-art F1-score of 92.2% on the multi-class TTC-4900 dataset, surpassing existing literature. The findings suggest that deep learning models offer superior performance in handling the linguistic diversity of multi-class Turkish corpora, whereas traditional methods remain highly competitive in simpler binary classification scenarios. Riyadi et al. [
54] introduced IndoGovBERT, a domain-specific BERT-based pre-trained language model developed from Indonesian government corpora for automated processing of SDG-related government documents. Evaluated against four general-purpose Indonesian language models and a multilingual BERT baseline across text classification and document similarity tasks, IndoGovBERT demonstrated consistently superior performance, validating that domain-adaptive pre-training substantially improves classification accuracy in specialised governmental document environments. Their findings support the broader principle that encoder-based transformer architectures benefit significantly from domain-specific fine-tuning when applied to formal institutional text, a rationale that directly motivates the present study’s adoption of DeBERTa-v3-Large for diplomatic security classification.
Sujatha and Nimala [
55] proposed EPLM-HT, an ensemble framework combining BERT, RoBERTa, GPT, DistilBERT, and XLNet with systematic hyperparameter tuning for conversational sentence classification into four categories. By aggregating predictions from five independently fine-tuned transformer models, the ensemble achieved an F1-score of 0.88, consistently outperforming all individual base models. Their findings demonstrate that multi-model ensemble aggregation with fine-tuned hyperparameters provides more robust classification boundaries than any single transformer architecture, a principle that directly informs the three-seed ensemble strategy employed in the present study.
Alzamel and Alajmi [
56] fine-tuned three transformer-based encoder models—AraBERT, GigaBERT, and XLM-RoBERTa—for the automated classification of official Arabic governmental correspondence into six ministry-defined categories, using a balanced dataset of 22,741 documents. GigaBERT achieved the highest accuracy of 98%, demonstrating that bilingual pre-training provides superior cross-lingual transfer for domain-specific institutional text. Their findings confirm that transformer-based architectures can effectively automate document routing in high-volume governmental environments, reinforcing the applicability of encoder-based models for formal institutional document classification across diverse linguistic contexts.
Mao et al. [
57] proposed BERT-DXLMA, a hybrid architecture integrating BERT with xLSTM and semantic fusion technology to enhance deep semantic feature extraction and representation learning for English text classification. To address class imbalance, the authors re-designed the focal loss function to further amplify attention to minority-class samples, achieving superior precision and overall accuracy across six public benchmark datasets compared to multiple baselines. Their work validates that combining transformer-based contextual encoding with sequential deep learning modules and adaptive loss re-weighting constitutes an effective strategy for robust text classification under imbalanced conditions—a convergent design principle with the Focal Loss and hybrid architectural approach adopted in the present study.
Prabhakar and Pati [
58] investigated the automated classification of Indian court judgments into legal domains, a task characterized by non-standardized document structures, verbose domain-specific language, and significant class imbalance. The study systematically evaluated a broad range of transformer-based embeddings—including InLegalBERT, InCaseLawBERT, DeBERTa, RoBERTa, and T5—in combination with feature engineering techniques such as TF-IDF, Word2Vec, PCA, and forward feature selection. To handle class imbalance across 14 legal domains, SMOTE oversampling was employed. Ensemble classifiers—voting classifiers, gradient boosting, and random forest—were integrated with the embedding representations. The optimal configuration of T5 embeddings combined with SMOTE, feature selection, and a voting classifier achieved 98% accuracy on a manually curated Indian legal corpus, establishing a benchmark for hybrid embedding-ensemble approaches in domain-specific legal document classification.
Liu et al. [
59] addressed the limitations of existing contrastive learning methods in text classification, particularly the problem of insufficient data utilization. The study introduced instance-level weighted contrast learning, combining four text augmentation strategies—symbol insertion, affirmative auxiliary verb substitution, double negation, and punctuation repetition—organized into two complementary schemes: affirmative enhancement and negation transformation. To mitigate the adverse effects of false negative samples introduced during augmentation, an instance weighting mechanism was employed, where complementary models generate per-sample weights to correct sampling bias during training. Experiments across multiple benchmark datasets demonstrated improvements in model generalization and robustness over standard contrastive learning baselines. The proposed instance weighting approach bears conceptual similarity to focal loss-based training strategies, both aiming to dynamically adjust the contribution of individual samples during optimization.
Existing studies on information, document, and text classification across diverse contexts demonstrate variations in application domains, data characteristics, and methodological approaches. To facilitate a systematic comparison of these studies, a summary of the relevant literature is provided in
Table 2.