Next Article in Journal
A Comprehensive Review on Lignin-Based Biodegradable Mulch Films for Sustainable Agriculture
Previous Article in Journal
BioHARP: A Feasibility Framework Toward Bio-Adaptive Human Risk Profiling for Phishing with Cost-Sensitive Learning and Scenario-Based Physiological Fusion Design
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SACHN-DeBERTa-v3-Large: Automated Document Security Classification with XAI and LLM Comparison

by
Mehmet Tuğrul Sariçiçek
and
Murat Dener
*
Department of Information Security Engineering, Graduate School of Natural and Applied Sciences, Gazi University, Ankara 06560, Türkiye
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(11), 5661; https://doi.org/10.3390/app16115661
Submission received: 28 April 2026 / Revised: 27 May 2026 / Accepted: 28 May 2026 / Published: 4 June 2026

Featured Application

The SACHN-DeBERTa-v3-Large architecture can be integrated into automated document management systems for government agencies, defense organizations, and diplomatic institutions to support real-time security classification of incoming records. It is additionally suited for cross-domain security solutions, where content classification is a prerequisite for enforcing access control policies during secure data transfer between networks of different classification levels. The framework is equally applicable to large-scale enterprises in regulated industries—including financial services, legal, and healthcare—seeking sensitivity-based document categorization in compliance with data governance policies.

Abstract

The manual classification of diplomatic documents by security sensitivity level is labor-intensive and inconsistent at scale. This study proposes SACHN-DeBERTa-v3-Large, a hybrid classification architecture integrating a DeBERTa-v3-Large backbone with a Security-Aware Gate, Prototype Classification Layer, and Supervised Contrastive Projection Head. A rule-based preprocessing pipeline removes inline classification markers prior to training, ensuring that models learn semantic content rather than surface-form annotation artifacts. Two diplomatic corpora are evaluated: the WikiLeaks Cable Classifier (9005 documents, three classes) and a novel Foreign Relations of the United States (FRUS) dataset (24,706 documents, four classes) constructed for this study. A controlled multi-LLM comparative study evaluates LLaMA-3-8B-Instruct, Qwen2.5-14B-Instruct, and Mistral-7B-Instruct-v0.3 under identical QLoRA fine-tuning conditions. SACHN-DeBERTa-v3-Large achieves 96.12% accuracy and 92.57% macro F1 on WikiLeaks, and 92.11% accuracy and 91.02% macro F1 on FRUS, surpassing all baselines by at least 10 percentage points in accuracy and 26 points in macro F1 (McNemar’s test, p < 0.001). Under the primary QLoRA protocol (r = 16), no evaluated LLM exceeded 57.16% accuracy; extended experiments with r = 64 and 8-bit INT8 quantisation yielded a best result of 64.48% (Mistral-7B-Instruct, WikiLeaks), confirming that the performance gap relative to SACHN-DeBERTa-v3-Large remains structural. Post-hoc SHAP and LIME analyses confirm that predictions are grounded in domain-specific semantic content, validating the classification marker removal methodology.

1. Introduction

The classification of documents according to their informational sensitivity occupies a central role in institutional governance, national security administration, and records management. At its core, classification denotes the systematic assignment of documents to predefined security tiers—UNCLASSIFIED, CONFIDENTIAL, SECRET, and TOP SECRET—based on the potential consequences their unauthorized disclosure could impose on national interests, diplomatic stability, or individual privacy. Diplomatic correspondence, inter-agency policy records, and military communications represent particularly high-stakes information assets that demand rigorous and consistent access control [1]. Ensuring that such documents are accurately labeled is not merely an administrative formality; it is a foundational requirement for effective security governance, legal compliance, and the principled management of institutional knowledge.
For decades, the determination of confidentiality levels has depended on the judgment of trained human reviewers. This reliance carries well-documented drawbacks: manual classification is vulnerable to subjective interpretation, inconsistent application of evolving security standards, cognitive fatigue during high-volume review cycles, and substantial temporal and financial cost [2]. These challenges have grown more acute in parallel with the exponential expansion of digitally generated records. As document volumes exceed the practical capacity of human annotation, the risk of misclassification—whether inadvertent over-classification that obstructs legitimate information sharing, or under-classification that exposes sensitive material—increases proportionally [3,4,5].
Machine learning and natural language processing offer a principled path toward resolving these systemic limitations. Automated classification systems can process large corpora in a fraction of the time required by manual review, apply labeling decisions with consistency across heterogeneous collections, and be retrained as classification policies evolve [3,4,5,6]. Early computational approaches to this problem relied predominantly on handcrafted feature representations—term frequency-inverse document frequency (TF-IDF) weighting, bag-of-words models, and lexical pattern matching—paired with classical supervised classifiers such as support vector machines and random forests. These methods, while interpretable and computationally tractable, face a fundamental ceiling: they struggle to capture the nuanced semantic relationships that distinguish, for instance, a CONFIDENTIAL diplomatic cable from a SECRET one, where surface vocabulary may be nearly identical and the operative distinction resides in contextual judgment rather than explicit lexical cues [6].
The development of large pre-trained transformer architectures has substantially raised the performance ceiling for text classification. BERT-family models and their successors have demonstrated strong generalization across diverse natural language benchmarks, and their application to domain-specific classification—including legal, clinical, and governmental text—has expanded rapidly. Despite this progress, the automated classification of sensitive diplomatic documents remains comparatively underexplored [7]. Existing studies are frequently limited in scope, treating the problem as binary rather than hierarchical, relying on proprietary datasets that preclude reproducibility, or failing to control for the presence of explicit classification markers embedded in raw document text. A model trained on text that retains such markers risks learning to recognize surface-form annotations rather than substantive semantic content—a critical failure mode in any real operational deployment where markers may be absent, redacted, or deliberately obfuscated.
This study addresses these limitations through a systematic investigation of automated multi-class document security classification across two large-scale diplomatic corpora. The first is the WikiLeaks Cable Classifier [8], a publicly available benchmark of 9005 diplomatic cables spanning three security tiers. The second is a novel dataset constructed programmatically for this study from the Foreign Relations of the United States (FRUS) collection [9,10]—an open-access archive of declassified U.S. diplomatic records covering five presidential administrations—comprising 24,706 long-form formal documents across four classification levels. Both datasets are evaluated under a unified preprocessing pipeline that systematically removes inline security markers prior to training, ensuring that model performance reflects genuine semantic learning rather than marker exploitation.
The principal contributions of this work are as follows:
  • Novel benchmark dataset: A structured multi-class FRUS diplomatic document dataset is constructed from raw XML archives and will be released publicly on Kaggle upon publication, providing a reproducible resource for future research in formal diplomatic text classification.
  • Security marking removal as experimental control: A rule-based preprocessing pipeline comprising 14 targeted regex patterns removes inline classification markers and header directives, masking them with a neutral sentinel token. This ablation is validated through post-hoc explainability analysis using SHAP and LIME, confirming that learned representations reflect substantive diplomatic content.
  • SACHN-DeBERTa-v3-Large: A novel hybrid classification architecture is proposed, integrating a DeBERTa-v3-Large backbone with a Security-Aware Gate for adaptive hidden-dimension feature selection, a Prototype Classification Layer implementing metric learning-based inference, and a Supervised Contrastive Projection Head to strengthen inter-class separability. The training pipeline combines Focal Loss for imbalanced class distributions, layer-wise learning rate decay across 24 transformer layers, mixed-precision optimization, and a three-seed dynamic weighted ensemble with per-class decision threshold tuning.
  • Multi-LLM comparative study: A systematic empirical evaluation of three open-source instruction-tuned large language models—LLaMA-3-8B-Instruct, Qwen2.5-14B-Instruct, and Mistral-7B-Instruct-v0.3—under identical QLoRA fine-tuning conditions is conducted, providing, to the best of our knowledge, the first controlled benchmark of multiple generative architectures on multi-class diplomatic security classification. These models were selected on the basis of three complementary criteria: architectural diversity across distinct decoder-only transformer families, parameter-scale coverage spanning 7B to 14B parameters, and open-weight licensing ensuring full experimental reproducibility.
The proposed architecture achieves 96.12% accuracy and 92.57% macro F1 on the WikiLeaks corpus, and 92.11% accuracy and 91.02% macro F1 on FRUS—surpassing all evaluated baselines by margins of at least 10 percentage points in accuracy and 26 points in macro F1, with statistical significance confirmed by McNemar’s test (p < 0.001 in all comparisons).
The remainder of this paper is organized as follows. Section 2 reviews the relevant literature on automated text and document classification, spanning classical machine learning approaches, deep learning architectures, and domain-specific applications in security-sensitive contexts. Section 3 describes the datasets, preprocessing pipeline, proposed architecture, training strategy, and baseline methods. Section 4 presents experimental results and comparative analysis. Section 5 discusses the findings in the context of prior literature, addresses the limitations of the proposed approach, and outlines directions for future research. Section 6 concludes with a summary of the principal contributions and results.

2. Related Works

2.1. Classification of Information, Documents, and Texts by Confidentiality Levels

Assigning documents to appropriate confidentiality levels based on their content is a critical process for ensuring information security. In particular, government agencies, the defense industry, diplomatic missions, and private-sector organizations protect their information assets by applying classifications such as confidential, top secret, or public [1]. Within this context, accurately determining confidentiality levels is essential both for preventing information leakage and for enabling the enforcement of appropriate access control policies. However, the limitations of manual classification procedures and the risks associated with human error have motivated researchers to develop artificial intelligence- and machine learning-based methods that automate confidentiality classification [6]. The resulting body of work in the literature has highlighted the advantages and limitations of various approaches in terms of both model performance and practical deployment challenges. In this section, prior studies on the classification of texts, information, and documents according to confidentiality levels are reviewed and summarized.
Alzhrani et al. [4] proposed a robust preprocessing framework that utilizes Latent Dirichlet Allocation (LDA) to refine training datasets by pruning irrelevant sub-topics from diplomatic cables. By mitigating the impact of linguistic noise within the WikiLeaks dataset, their logistic regression-based approach achieved statistically significant performance gains (p = 0.0007), demonstrating that strategic data purification is essential for maintaining high F1-scores across multi-level security classifications.
Alparslan et al. [5] introduced a hybrid SVM-ANFIS architecture for the security classification of 222 Turkish documents from the TUBITAK UEKAE dataset, reaching an accuracy of 96.67%. By utilizing custom stemming and Chi-square feature selection to overcome the morphological challenges of Turkish, the study demonstrates that merging discriminative scores with fuzzy inference models ensures high precision in sensitive, agglutinative text categorization.
Richter et al. [6] explored the efficacy of deep learning for classifying historical NATO documents by confidentiality levels, demonstrating that a ConvNet-based architecture significantly outperformed traditional Random Forest baselines. Utilizing a balanced subset of unstructured military records dating back to the 1950s, the study achieved a 1.2 to 1.3-fold accuracy improvement through a model optimized with 50-dimensional embeddings and strategic dropout rates. Their results validate that while convolutional networks offer superior performance for complex defense datasets, full exploitation of more advanced architectures like RNNs remains contingent upon high-performance computational resources.
Yazidi et al. [11] performed a comparative analysis of ten machine learning algorithms to automate binary security classification using a DNSA dataset containing 2884 documents on geopolitical conflicts. To ensure model generalization and prevent label leakage, the authors utilized a TF–IDF-weighted bag-of-words representation after meticulously removing explicit security keywords from the text. Their experimental results identified AdaBoost as the superior model, achieving over 90% accuracy across diverse topics and demonstrating significant robustness in handling the high-dimensional feature spaces inherent in sensitive document classification.
Richter and Wrona [12] evaluated open-source machine learning algorithms for the confidentiality classification of over 30,000 historical NATO documents, categorized into three distinct security tiers. Following an extensive preprocessing pipeline involving the KeyGraph algorithm and Porter stemming, the Random Forest model emerged as the top performer with 80% accuracy, assessed through metrics including Cohen’s kappa and misclassification costs. A significant finding of the study was that OCR-induced noise did not diminish classification effectiveness, leading the authors to propose a framework that integrates open-source solutions with confidence scores to enhance information management workflows.
Heintz et al. [13] developed an automated information guard system by analyzing 101,142 documents from the Foreign Relations of the United States (FRUS) collection to optimize secure data sharing across different clearances. By benchmarking DistilBERT and DistilRoBERTa architectures at the paragraph and metadata levels, the study demonstrated that the integration of document content with structural metadata significantly enhances classification robustness and accuracy. Their findings, validated through rigorous F1-score analysis and confusion matrices across multiple security tiers, suggest that jointly exploiting textual and contextual features provides a scalable foundation for advanced multi-modal information protection systems.
Alzhrani et al. [14] introduced the ACESS (Automated Classification Enabled by Security Similarity) model, a hybrid architecture designed to overcome the limitations of conventional data loss prevention systems in large-scale textual environments. By integrating k-means clustering with localized linear SVM classifiers, the framework performs paragraph-level analysis on imbalanced WikiLeaks diplomatic cables to identify sensitive information with high granularity. The experimental results demonstrated that ACESS consistently exceeds the 90% F1-Measure threshold, with findings indicating that optimizing cluster density relative to dataset size significantly enhances the precision and adaptability of automated security classification.
Trieu et al. [15] introduced a content-based classification framework utilizing a specialized document embedding model (TD2V) and modified k-nearest neighbor retrieval to overcome the semantic limitations of traditional DLP systems. By integrating Automated Query Expansion (AQE) and majority voting, the methodology achieved exceptional classification accuracies exceeding 99% across diverse datasets, including Enron, Snowden, and Dyncorp. Their findings demonstrate that the synergy of TD2V and AQE ensures a robust, real-time solution for sensitive data identification, maintaining high precision even in fragmented text segments with a processing latency of 10–15 ms.
Liang et al. [16] proposed the Incremental Learning and Similarity Comparison (ILSC) framework to handle the dynamic nature of sensitive information classification in expanding document repositories. By integrating online learning algorithms such as Incremental SVM (ISVM) with a manual sentence-based similarity component, the authors achieved an 87.4% accuracy across five security echelons, significantly outperforming static offline models. Their findings emphasize that the synergy between incremental updates and granular similarity checks provides a scalable solution, as evidenced by a 39.5% accuracy increase as the dataset volume expanded during the learning process.
Alparslan et al. [17] proposed a hybrid SVM-ANFIS framework to automate the security grading of 222 Turkish institutional documents from the TUBITAK UEKAE corpus. Leveraging the Zemberek library for morphological analysis and Chi-square for feature optimization, the study utilized SVM-derived scores as antecedent inputs for a fuzzy inference model, which was subsequently discretized into security tiers using the CACC algorithm with specific thresholds. The empirical results demonstrated that the hybrid architecture significantly outperformed standalone models, achieving a 96% accuracy rate and providing a robust solution for mitigating ambiguity in Data Loss Prevention (DLP) environments.
Jiang et al. [18] introduced CES2Vec, a task-specific word embedding framework that integrates confidentiality-driven polarity into vector representations to enhance sensitive information detection. By extending the Word2Vec negative sampling objective with a privacy-oriented loss function, the authors enforced more distinct classification boundaries on a dataset of 89,681 WikiLeaks paragraphs. Their empirical findings revealed that CES2Vec-based CNN classifiers achieved an 80.04% F1-score, significantly outperforming standard FastText and Word2Vec embeddings and demonstrating that task-specific distribution optimization is critical for advancing the state-of-the-art in confidentiality identification.
Alzhrani et al. [19] introduced the MS-CNN (Multi-Sequence Convolutional Neural Network) framework to optimize the identification of sensitive paragraphs within unstructured datasets like WikiLeaks and the Panama Papers. By integrating a Multi-Sequence Learning technique with a Wide CNN architecture, the model effectively mitigates information loss and noise issues inherent in variable-length text segments. Empirical benchmarks on diplomatic and general datasets demonstrated that MS-CNN significantly outperforms deep architectures such as Sent-CNN and VDCNN, particularly in minority class detection with F1-scores reaching 0.773 for highly sensitive categories.
Subhashini and Rani [20] proposed a confidentiality detection methodology that integrates K-means clustering with cluster-specific language models to identify sensitive terminology within the Enron email corpus. By employing TF-IDF vectorization and Dirichlet smoothing, the authors established probability ratios to distinguish between sensitive and non-sensitive lexicons based on their distribution across categorized document groups. Although the study primarily details a conceptual framework rather than providing comprehensive performance metrics, it highlights the potential of using probability-based term identification to refine Data Loss Prevention systems through non-confidential contextual validators.
Kesenek et al. [21] developed a robust classification algorithm to counter content-based adversarial attacks, utilizing a comprehensive dataset of 5539 documents consolidated from TM, Mormon, Dyncorp, and DBPedia. The proposed framework, which integrates Latent Semantic Analysis (LSA) and multi-layered feature extraction with strategic spelling correction, achieved an exceptional baseline accuracy of 99.5%, effectively matching CNN-based architectures in attack-free environments. However, the study highlights superior durability in adversarial scenarios, as the model maintained F1-scores above 90% during intensive character-level manipulations—conditions under which CNN accuracy plummeted to 55%—thereby providing a highly reliable defense mechanism for confidential document identification.
Hart et al. [22] introduced a hierarchical two-stage classification framework employing a “supplement and adjust” strategy to mitigate high false-positive rates, utilizing a hybrid corpus of corporate and public datasets including the Enron email collection. By integrating a primary classifier with a metadata-enhanced meta-classifier using $xtra.info$ features, the authors successfully reduced the False Positive Rate (FPR) on non-corporate documents from a baseline of 87.2% to less than 0.1%. The experimental results further highlight the model’s precision, as the False Discovery Rate (FDR) for the Enron dataset was curtailed from 47.05% to 0.92% while maintaining a False Negative Rate (FNR) below 3.0%, demonstrating that auxiliary metadata is critical for minimizing disruptive false alarms in real-world DLP applications.
Lin et al. [23] introduced a feature fusion architecture integrating Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) networks to optimize the identification of sensitive information within unstructured Chinese electronic documents. Utilizing a specialized dataset of 7500 sensitive documents from encrypted WikiLeaks files and laboratory data, complemented by the Sohu News corpus, the methodology employed the cppjieba tool for segmentation and Word2Vec for generating 200-dimensional embeddings. The experimental results demonstrated that the hybrid model achieved an accuracy of 93.44% and an F1-score of 93.82%, outperforming standalone CNN and BiLSTM architectures by 2.1% and 1.38%, respectively. This research concludes that the synergy of CNN’s localized feature extraction and BiLSTM’s global contextual capture provides superior robustness and stability compared to traditional SVM and keyword-matching techniques in complex linguistic environments.
Sulavko et al. [24] proposed a multi-stage classification framework to detect and grade confidential information within a massive corpus of 1.1 million Russian-language text instances derived from ten different organizations. The methodology employs a hierarchical pipeline where 1024-dimensional E5-based embeddings are first processed by an autoencoder that functions as a high-precision binary filter, achieving a 97.84% accuracy and a negligible 0.015% False Negative Rate. Subsequently, an ensemble of ten CNN models, structured according to the Condorcet Jury Theorem, utilizes a voting mechanism to categorize sensitive messages into four confidentiality levels, yielding an overall multi-class accuracy of 90.3%. This modular architecture represents a robust solution for organizational data protection by virtually eliminating the risk of undetected information leakage through its strategic combination of deep feature extraction and ensemble reliability.
Han et al. [25] proposed a prompt-based data augmentation strategy to address class imbalance in security-sensitive text classification, specifically targeting the minority “Secret” class within the WikiLeaks dataset. Using GPT-4o with chain-of-thought reasoning and a sliding-window approach centred on the frequency-weighted median, the authors generated 1596 synthetic Secret-class samples to rebalance the training distribution. A fine-tuned LLaMA 3.1 8B model with LoRA adaptations was subsequently trained on the augmented dataset, achieving 98% accuracy with one-shot generation and 99% with few-shot generation. Their findings demonstrate that prompt-engineering strategies incorporating mathematical similarity constraints and explicit label definitions can effectively mitigate the precision–recall trade-off inherent in imbalanced diplomatic document classification, though the approach has yet to be extended to multi-tier four-class corpora or validated beyond the WikiLeaks benchmark.
Han et al. [7] conducted the first comprehensive survey on text-based information security rating, proposing a standardised taxonomy encompassing domain scope, methodology, and evaluation metrics for automated document security classification. Covering methodological evolution from rule-based systems and clustering approaches through classical machine learning to deep learning architectures, the survey identifies representative datasets—including WikiLeaks, TUBITAK UEKAE, Enron, and DNSA—and highlights five persistent open challenges: scarcity of publicly available labelled corpora, class imbalance in confidential data, absence of domain-specific evaluation metrics, limited cross-domain generalisability, and the need for convergence between administrative and technical approaches. The present study directly addresses three of these challenges through the construction of the FRUS dataset, the application of Focal Loss for class imbalance, and a multi-dataset evaluation framework spanning two distinct diplomatic corpora.
Existing studies on confidentiality-aware classification exhibit notable differences in methodologies, datasets, and evaluation metrics. To enable a systematic comparison and highlight research gaps, a summary of the literature is provided in Table 1.

2.2. Information, Document, and Text Classification in Diverse Contexts

This section reviews text and document classification studies conducted across diverse application domains, ranging from news categorization to financial data analysis. By examining various architectures tailored for specific linguistic structures, document lengths, and contextual depths, the review highlights how classification methodologies adapt to the unique challenges of different information environments.
Tan et al. [26] proposed an optimized k-NN based system for real-time classification of large-scale texts to improve document confidentiality management. To overcome the computational inefficiencies of high-dimensional feature vectors, the authors developed a novel feature selection algorithm (tf-DE) and a parallel computing infrastructure leveraging the AVX-256 instruction set. Experimental results on the THUCNews dataset demonstrated that this hardware-accelerated approach reduced classification time by 53.8% and achieved an average F1-score of 91.4%, effectively competing with SVM models while maintaining superior performance in sports and entertainment categories with scores exceeding 96%.
Liu et al. [27] introduced the GCSA (Graph Convolutional Network and Self-Attention-based) algorithm to address the limitations of traditional Sentiment Word Tree (SMT) and DFA-based methods in sensitive information detection. To overcome high computational costs and poor generalization on out-of-vocabulary terms, the GCSA framework employs an end-to-end learning strategy that integrates BERT-like encodings with a graph-based integration layer. By synergizing GCNs with a self-attention mechanism, the model captures complex contextual relationships between word pairs, enabling the detection of sensitive content beyond predefined dictionaries while maintaining high precision through softmax-based probability estimation.
Huang [28] developed a two-stage security classification framework integrating Hidden Markov Models (HMM) and Support Vector Machines (SVM) to detect sensitive corporate information within configuration files. Addressing the inadequacy of traditional pattern matching for identifying IP addresses and credentials, the study extracted seven structural features, including file size, extension, and sensitive keyword density, which were subsequently processed via TF-IDF filtering. The methodology utilized a Gaussian HMM for initial screening, followed by a polynomial kernel SVM for secondary validation; although the integrated model achieved a baseline accuracy of 60% on a synthetic dataset, the HMM and SVM components independently yielded accuracies of 78% and 76%, respectively, highlighting the potential of multi-stage hybrid models in mitigating data leakage from internet-based configuration projects.
Gang et al. [29] introduced the SDC (Semantic Dependency-based Classification) algorithm, which prioritizes contextual semantic dependencies over simple word frequencies for sensitive information detection. The methodology employs a two-stage process: a sentence-level Conditional Random Field (CRF) model for word labeling and a semantic tree structure to define sensitivity transfer principles. By integrating individual sentence structures into a unified document-level sensitivity framework, the authors benchmarked various classifiers—including SVM, Naive Bayes, and KNN—on a dataset of 3344 forum posts. Experimental results revealed that the SVM-based SDC model achieved F1-scores exceeding 80% across four security levels, representing a 15–25% improvement over traditional TF-IDF and syntactic analysis approaches.
McDonald et al. [30] proposed a novel approach for the automated classification of sensitive government documents under Freedom of Information (FOI) exemptions by leveraging Part-of-Speech (POS) sequence representations and specialized SVM kernel functions. The methodology evaluates five distinct kernels—Linear, Gaussian, Spectrum, Mismatch, and Smith-Waterman—to determine if linguistic structures can serve as indicators of sensitive content. Experimental benchmarks on a dataset of 3801 government documents demonstrated that while the Linear kernel achieved the highest individual auROC (68.97%), an ensemble approach using Weighted Majority Vote (WMV) significantly improved classification performance, reaching an auROC of 74.43%. Their findings validate that integrating grammatical structures with text classification provides a robust and statistically significant solution (p < 0.001) for identifying sensitive information in public disclosure contexts.
Alneyadi et al. [31] proposed a statistical DLP model integrating TF-IDF and Singular Value Decomposition (SVD) to overcome the limitations of content-based systems in detecting semantically varied sensitive data. By analyzing documents across six distinct confidentiality categories, the framework utilizes Cosine similarity and Taxicab distance to execute classification decisions within a reduced-dimensionality vector space. Experimental results on a corpus of 360 security-related articles demonstrated that the model maintains up to 99% accuracy for known data and, notably, achieves a 63.33% classification rate even when documents are heavily altered with synonyms. Their findings confirm that SVD-enhanced statistical profiles provide a resilient mechanism for semantic sensitivity detection in evolving data leakage scenarios.
Perron et al. [32] evaluated the feasibility of Local Large Language Models (LLMs) for the secure analysis of unstructured text data in social work research, addressing the ethical constraints of proprietary cloud-based models. Utilizing a dataset of 2956 child welfare investigation summaries, the study benchmarked Mistral-7b, Mixtral-8×7b, Llama3-8b, and Llama3-70b through zero-shot prompting for classification and extraction tasks. The results highlighted that Llama3-8b achieved 95% accuracy and a 0.93 precision rate, while Llama3-70b demonstrated over 95% faithfulness in text extraction with a Cohen’s kappa of 0.90, matching human-expert performance. Their findings conclude that local, open-source LLMs provide a secure, low-hallucination (<1%) alternative for processing sensitive qualitative data without compromising data privacy.
Sinoara et al. [33] proposed knowledge-enriched document embeddings to enhance text classification, particularly in scenarios where semantic disambiguation is critical. By developing two models, Babel2Vec and NASARI + Babel2Vec, the authors integrated BabelNet synset vectors and utilized the Babelfy system to resolve lexical ambiguities before document vectorization. Experimental evaluations across nine English and Portuguese datasets demonstrated that these semantically enriched representations significantly outperformed traditional Bag-of-Words (BOW) and LDA baselines in Macro-F1 scores, while the Babel2Vec model achieved the highest correlation with human judgments (Pearson: 0.66). Their findings highlight that low-dimensional, knowledge-based embeddings offer superior interpretability and computational efficiency for cross-lingual and short-text classification tasks.
Timmer et al. [34] investigated the efficacy of fine-tuned Transformer models for sensitive sentence detection, focusing on semantic classification challenges where keyword-based methods fall short. Utilizing a real-world dataset of 1073 manually labeled sentences from the Monsanto legal case, the authors fine-tuned a bert-base-uncased model across four distinct sensitive categories (GHOST, TOXIC, CHEMI, REGUL). The experimental results demonstrated that the BERT-based approach significantly outperformed Inference Rule, LSTM, and RecNN baselines, particularly in F2-scores (reaching 65.79% for GHOST), highlighting the robustness of pre-trained architectures in low-resource and high-stakes sensitive information identification scenarios.
Wu et al. [35] introduced FedAPILLM, a federated learning framework integrated with Large Language Models (LLMs) such as Llama3 and Qwen2.5, to identify sensitive content within API calls without compromising data privacy. To enhance computational efficiency in a distributed environment, the authors employed Low-Rank Adaptation (LoRA) for fine-tuning specific model layers. Experimental evaluations on a manually labeled dataset of 1123 Web API samples demonstrated that the framework achieved near 100% accuracy with a 0% false alarm rate at 2500 iterations, with the Qwen2.5 14B model showing faster convergence compared to Llama3. Their research highlights that the synergy between federated learning and parameter-efficient fine-tuning offers a robust and scalable solution for real-time sensitive data recognition in secure API ecosystems.
Chong [36] proposed a hybrid deep learning framework for real-time sensitive data recognition across both structured and unstructured data sources. The methodology integrates Regular Expressions (regex) for structured data patterns with a BERT-CRF architecture to handle the contextual complexities of unstructured text through Named Entity Recognition (NER). Experimental results on a specialized dataset derived from Conll-2003 and CLUE benchmarks demonstrated that while the standalone BERT model achieved an F1-score of 0.896, the integrated BERT+regex framework yielded superior performance with a precision of 0.925. Their findings highlight that combining the deterministic accuracy of regex with BERT’s contextual learning capabilities provides a robust and generalizable solution for identifying diverse sensitive information types.
Aydın et al. [37] conducted a comparative analysis between traditional machine learning algorithms—such as Naive Bayes and Decision Trees—and Transformer-based deep learning models, including BERT and GPT-3, for text classification across various educational levels. Utilizing a dataset of 476 documents from the Ministry of National Education and the Council of Higher Education, the study demonstrated that the GPT-3 model achieved the highest performance with 77% accuracy and F1-score, significantly outperforming the best-performing traditional method, Naive Bayes (65% accuracy). Their findings emphasize that the contextual learning capabilities of Transformer architectures provide superior generalization and accuracy compared to classical approaches in complex, multi-level document classification tasks.
Tran et al. [38] systematically evaluated the robustness of text classification algorithms—ranging from classical models (LR, SVM) to deep learning architectures (VDCNN, Transformer)—against real-world data corruptions such as keyboard errors, OCR inaccuracies, and microphone distortions. Utilizing a methodology based on the Area Under the Robustness Curve (robustness AUC), the study analyzed performance degradation across YelpPolarity, AG’s News, and DBpedia datasets. The findings revealed that while VDCNN models excel in character-level resilience, Transformers are more robust to word-level distortions; crucially, the research demonstrates that the most accurate models are not necessarily the most robust, emphasizing that data augmentation with corrupted samples is essential to bridge the performance gap between controlled and noisy environments.
Chen et al. [39] conducted a comparative study between domain-concept-based machine learning methods and pre-trained deep learning models for the classification of large-scale US legal documents. Addressing the complexities of legal terminology and sparse label distributions within the SigmaLaw dataset, the authors developed a Random Forest model trained on 400 PCA-selected domain features from 30,000 case documents. The experimental results demonstrated that the proposed model achieved 85.98% accuracy, significantly outperforming SOTA deep learning architectures like BiLSTM + Attention, while offering superior interpretability and computational efficiency in low-resource and domain-specific scenarios.
Mohammed and Kora [40] proposed a novel two-layer meta-learning-based ensemble deep learning framework to extend the boundaries of accuracy and robustness in text classification. The architecture integrates diverse base classifiers, including CNN, LSTM, and GRU, whose probability-based outputs (soft predictions) are processed by shallow meta-learners across a three-level hierarchical structure. Experimental results on six benchmark datasets, such as IMDB and ArSarcasm, demonstrated that the ensemble approach significantly outperformed individual models, achieving up to a 16.03% accuracy gain on the AJGT dataset. Their research underscores that synergizing classifier diversity with meta-learning effectively reduces variance and enhances generalization across varying languages and linguistic structures.
Jiang et al. [41] introduced TechDoc, a multimodal deep learning framework designed to enhance technical document management by integrating textual, visual, and relational data. Utilizing a massive dataset of 800,000 USPTO patent documents, the architecture employs RNNs for text processing, CNNs (VGG-19) for visual features, and GNNs (GraphSAGE) for citation networks. Experimental results demonstrated that TechDoc significantly outperformed unimodal and dual-modal baselines, achieving a 3.2% improvement in top-1 accuracy for IPC subclassifications while being 13 times faster to train than fine-tuned BERT models. Their research concludes that multimodal information fusion not only improves classification accuracy in specialized technical domains but also offers superior computational efficiency and transfer learning potential for enterprise-level document systems.
Schoppmann [42] evaluated the efficacy of privacy-preserving machine learning protocols for text classification and regression tasks, focusing on secure linear regression and differential privacy mechanisms. By comparing fixed-point arithmetic protocols with iterative FP-CGD methods, the study demonstrated that the proposed framework significantly reduces computational latency and communication costs—outperforming existing systems like SecureML by up to 3x. Experimental results on the UCI and Movie Reviews datasets showed that privacy-protected TF-IDF extraction and k-NN/Naive Bayes classifiers maintained high accuracy with less than an 8% deviation from plaintext models, proving that secure, high-dimensional data processing is feasible for real-time document classification.
Xu et al. [43] proposed a Convolutional Neural Network (CNN)-based approach for detecting sensitive information within unstructured military and political texts. To overcome the high computational costs of RNNs and the limited generalization of traditional keyword-matching methods, the study utilized a Text-CNN architecture integrated with Word2vec embeddings and Jieba segmentation. Experimental results on a dataset of 22,000 documents demonstrated that the CNN model achieved a superior accuracy of 96.82%, while completing training nearly three times faster than RNN baselines. Their findings validate that CNNs offer a highly efficient and accurate alternative for real-time sensitivity classification by automatically extracting salient patterns through varied convolutional kernels without the need for manual feature engineering.
Park et al. [44] introduced GenNER, a novel framework designed to detect sensitive information within structured Korean table data by integrating Text Generation (TG) and Named Entity Recognition (NER) modules. To address the lack of context in tabular formats and the morphological complexities of the Korean language, the GenNER architecture employs a BiLSTM-CRF structure that generates synthetic natural language sentences from raw table entries before performing NER tasks. Experimental results on public documents from the Seoul Metropolitan Government demonstrated that the GenNER system achieved a 0.91 F1-score, significantly outperforming the baseline BiLSTM-CRF model (0.74 F1-score) which operates directly on raw data. Their research underscores that generating contextual sentences from structured inputs is a highly effective strategy for improving the accuracy of automated sensitivity classification in enterprise data security.
Chen et al. [45] proposed a novel framework for the security classification of language technology resources by integrating BERT-based word embeddings with the graph-based TextRank algorithm. To address the subjectivity and inefficiency of manual classification, the methodology constructs a weighted graph where nodes represent sentence vectors derived from dynamic BERT embeddings and edges represent semantic similarities. By calculating sensitivity scores through the TextRank ranking mechanism, the model automatically identifies high-risk text components and assigns security levels. Experimental results in Chinese text processing demonstrated that this hybrid approach significantly outperforms traditional SVM, Random Forest, and standard TextRank models in accuracy and recall, providing a sophisticated automation level for managing sensitive linguistic assets despite its higher computational complexity.
Liang et al. [46] proposed the A-ELMo algorithm, which enhances dynamic word embeddings with an integrated attention mechanism for the context-aware detection of sensitive information in social media. Addressing the inability of static representations like Word2Vec and GloVe to resolve lexical ambiguities, the framework utilizes the bidirectional LSTM structure of ELMo to generate context-sensitive embeddings, subsequently refined by an attention layer to prioritize semantically significant terms. Experimental evaluations on Twitter data revealed that A-ELMo achieved an accuracy of 84.15% and a recall of 88.41%, substantially outperforming fastText (75.14%) and traditional keyword-matching approaches (41.09%). Their research demonstrates that the synergy between weighted semantic representations and dictionary-tree structures provides a superior standard for identifying sensitive content within high-variance linguistic environments.
Fu et al. [47] developed an intelligent recognition framework for identifying and interpreting sensitive information within long-form unstructured texts by integrating natural language processing with machine learning. The proposed architecture employs a multi-modal feature fusion approach that combines textual content (via BERT and Word2Vec), contextual metadata, and user behavior analysis to execute a comprehensive sensitivity scoring system. Experimental results demonstrated that the Gradient Boosting Machine (GBM) model achieved superior performance with an AUC of 0.95 and an F1-score of 0.85, significantly outperforming Random Forest baselines. While highlighting the contextual superiority of BERT for long-sequence analysis, the study also addresses model robustness and scalability, providing a strategic foundation for proactive data privacy and risk management in high-volume information environments.
Kaliappan et al. [48] introduced SentinelGuard, a sophisticated Data Loss Prevention (DLP) solution specifically optimized for the healthcare sector to protect Personal Identifiable Information (PII) and Protected Health Information (PHI). By utilizing a BERT-based deep learning model trained on clinical datasets, the system executes real-time analysis of user activities and data entries to preemptively block leakage attempts. Experimental benchmarks against prominent industry solutions revealed that SentinelGuard achieved a remarkably low False Positive rate of 8%, significantly outperforming Forcepoint (64%), McAfee (36%), and Symantec (23%). Their findings demonstrate that integrating contextual language models into DLP frameworks not only enhances detection accuracy for medical records but also provides a scalable and customizable security architecture for high-stakes enterprise environments.
Li et al. [49] introduced CoGraphNet, a novel graph-based framework designed to address the computational complexity and interpretability challenges in text classification. The model constructs heterogeneous graph structures at both word and sentence levels, utilizing GRU-based neural networks and SwiGLU activation functions to process contextual dependencies and positional biases. By incorporating attention mechanisms, CoGraphNet enhances model transparency by identifying the specific words and sentences that drive classification decisions. Experimental results across four benchmark datasets—20NG (91.50% accuracy), R52 (96.10%), and Ohsumed—demonstrated that the proposed architecture offers competitive performance while providing significant advantages in terms of explainability and robust generalization in information-dense textual environments.
Ren et al. [50] introduced DyLas (Dynamic Label Alignment Strategy), a novel three-stage framework designed to enhance Large Language Model (LLM) performance in Large-scale Multi-label Text Classification (LMTC) without requiring model retraining. The strategy addresses the challenges of dynamic label sets and long-tail distributions through a pipeline consisting of vanilla input-output generation, label alignment (utilizing both hard and soft alignment via embeddings like BGE-M3), and counterfactual error checking. Experimental evaluations across four datasets, including Reuters21578 and FB15k237, demonstrated significant improvements; notably, integrating DyLas with GLM4-9b increased the Micro-F1 score from 15.4% to 55.4%. Their findings confirm that DyLas enables LLMs like GPT-4o and LLaMA 3.1 to outperform traditional PLMs (BERT, RoBERTa) in complex multi-label tasks, providing a scalable and robust solution for real-time classification in evolving information environments.
Wang et al. [51] proposed a novel Meta-Active Learning framework designed to minimize labeled data requirements while enhancing flexibility in text classification tasks. The methodology integrates the TextGCN architecture with a Bi-LSTM-based meta-information generator and a domain discriminator to identify and prioritize the most informative samples during the learning process. Evaluated on Amazon, 20 Newsgroups, and Reuters-21578 datasets under 1-shot and 5-shot scenarios, the proposed model achieved average accuracies of 74.2% and 87.1%, respectively. Their findings demonstrate that synergizing graph-based representations with meta-learning significantly outperforms traditional active learning strategies and baseline meta-learning models like MLADA, offering a scalable and high-performance solution for low-resource text classification environments.
Zhang et al. [52] introduced VIWHard, a novel framework for generating adversarial texts against text classification models in a hard-label black-box setting, where only the predicted labels are accessible. To overcome the discrete nature of text, the proposed method utilizes a two-stage architecture: an importance-word discriminator to identify high-impact tokens independently of the target model and a masked language model (MLM) to generate contextually relevant synonyms that preserve semantic and syntactic integrity. Experimental evaluations across eight datasets and three architectures (WordCNN, WordLSTM, and BERT) demonstrated that VIWHard achieved attack success rates exceeding 97% in security-oriented datasets (e.g., Jigsaw2018), significantly outperforming baselines like TextHacker and HydraText in terms of naturalness, grammatical correctness, and query efficiency. Their findings underscore the vulnerability of current NLP models to transferrable adversarial examples and highlight the necessity of robust defense strategies in practical deployment scenarios.
Alparslan [53] conducted a comparative performance analysis between traditional machine learning algorithms and a Convolutional Neural Network (CNN) model for Turkish text classification. Utilizing the multi-class TTC-4900 news dataset and the binary MY-15130 customer review dataset, the study employed rigorous preprocessing steps, including stemming via the Zemberek library and feature weighting through TF-IDF with log-normalization. Experimental results indicated that while traditional classifiers like SVM and Naive Bayes performed exceptionally well on binary tasks (96.2% accuracy), the proposed CNN architecture (Conv1D + Flatten + Dense) achieved a state-of-the-art F1-score of 92.2% on the multi-class TTC-4900 dataset, surpassing existing literature. The findings suggest that deep learning models offer superior performance in handling the linguistic diversity of multi-class Turkish corpora, whereas traditional methods remain highly competitive in simpler binary classification scenarios. Riyadi et al. [54] introduced IndoGovBERT, a domain-specific BERT-based pre-trained language model developed from Indonesian government corpora for automated processing of SDG-related government documents. Evaluated against four general-purpose Indonesian language models and a multilingual BERT baseline across text classification and document similarity tasks, IndoGovBERT demonstrated consistently superior performance, validating that domain-adaptive pre-training substantially improves classification accuracy in specialised governmental document environments. Their findings support the broader principle that encoder-based transformer architectures benefit significantly from domain-specific fine-tuning when applied to formal institutional text, a rationale that directly motivates the present study’s adoption of DeBERTa-v3-Large for diplomatic security classification.
Sujatha and Nimala [55] proposed EPLM-HT, an ensemble framework combining BERT, RoBERTa, GPT, DistilBERT, and XLNet with systematic hyperparameter tuning for conversational sentence classification into four categories. By aggregating predictions from five independently fine-tuned transformer models, the ensemble achieved an F1-score of 0.88, consistently outperforming all individual base models. Their findings demonstrate that multi-model ensemble aggregation with fine-tuned hyperparameters provides more robust classification boundaries than any single transformer architecture, a principle that directly informs the three-seed ensemble strategy employed in the present study.
Alzamel and Alajmi [56] fine-tuned three transformer-based encoder models—AraBERT, GigaBERT, and XLM-RoBERTa—for the automated classification of official Arabic governmental correspondence into six ministry-defined categories, using a balanced dataset of 22,741 documents. GigaBERT achieved the highest accuracy of 98%, demonstrating that bilingual pre-training provides superior cross-lingual transfer for domain-specific institutional text. Their findings confirm that transformer-based architectures can effectively automate document routing in high-volume governmental environments, reinforcing the applicability of encoder-based models for formal institutional document classification across diverse linguistic contexts.
Mao et al. [57] proposed BERT-DXLMA, a hybrid architecture integrating BERT with xLSTM and semantic fusion technology to enhance deep semantic feature extraction and representation learning for English text classification. To address class imbalance, the authors re-designed the focal loss function to further amplify attention to minority-class samples, achieving superior precision and overall accuracy across six public benchmark datasets compared to multiple baselines. Their work validates that combining transformer-based contextual encoding with sequential deep learning modules and adaptive loss re-weighting constitutes an effective strategy for robust text classification under imbalanced conditions—a convergent design principle with the Focal Loss and hybrid architectural approach adopted in the present study.
Prabhakar and Pati [58] investigated the automated classification of Indian court judgments into legal domains, a task characterized by non-standardized document structures, verbose domain-specific language, and significant class imbalance. The study systematically evaluated a broad range of transformer-based embeddings—including InLegalBERT, InCaseLawBERT, DeBERTa, RoBERTa, and T5—in combination with feature engineering techniques such as TF-IDF, Word2Vec, PCA, and forward feature selection. To handle class imbalance across 14 legal domains, SMOTE oversampling was employed. Ensemble classifiers—voting classifiers, gradient boosting, and random forest—were integrated with the embedding representations. The optimal configuration of T5 embeddings combined with SMOTE, feature selection, and a voting classifier achieved 98% accuracy on a manually curated Indian legal corpus, establishing a benchmark for hybrid embedding-ensemble approaches in domain-specific legal document classification.
Liu et al. [59] addressed the limitations of existing contrastive learning methods in text classification, particularly the problem of insufficient data utilization. The study introduced instance-level weighted contrast learning, combining four text augmentation strategies—symbol insertion, affirmative auxiliary verb substitution, double negation, and punctuation repetition—organized into two complementary schemes: affirmative enhancement and negation transformation. To mitigate the adverse effects of false negative samples introduced during augmentation, an instance weighting mechanism was employed, where complementary models generate per-sample weights to correct sampling bias during training. Experiments across multiple benchmark datasets demonstrated improvements in model generalization and robustness over standard contrastive learning baselines. The proposed instance weighting approach bears conceptual similarity to focal loss-based training strategies, both aiming to dynamically adjust the contribution of individual samples during optimization.
Existing studies on information, document, and text classification across diverse contexts demonstrate variations in application domains, data characteristics, and methodological approaches. To facilitate a systematic comparison of these studies, a summary of the relevant literature is provided in Table 2.

3. Materials and Methods

This section provides a detailed account of the datasets, the comprehensive preprocessing pipeline, the hybrid feature extraction techniques, and the advanced classification architectures developed for the automated detection of document confidentiality.

3.1. Datasets

To evaluate the models’ robustness across different linguistic structures and historical contexts, two primary datasets were utilized.

3.1.1. WikiLeaks Cable Classifier Dataset

The first dataset employed in this research is the WikiLeaks Cable Classifier [8], a publicly available and pre-existing corpus frequently used as a benchmark for automated document security classification. The dataset consists of 9005 diplomatic records, provided in a structured .jl (JSON Lines) format. This format facilitates the processing of large-scale text as a stream of discrete objects, each containing comprehensive metadata alongside the raw document content. The dataset contains 20 distinct attribute fields that represent various dimensions of diplomatic correspondence. The primary attributes and their descriptions are summarized in Table 3.
Analysis of the Original_Classification field reveals a diverse set of security labels, reflecting the complexity of governmental classification systems. As shown in the statistical analysis, labels range from various “UNCLASSIFIED” formats to “SECRET, NOFORN” (No Foreign Nationals) variants. To ensure model stability and prevent class sparsity, labels representing similar levels of sensitivity were consolidated. For instance, “CONFIDENTIAL, NOFORN” was mapped to the primary “CONFIDENTIAL” class, and “UNCLASSIFIED, FOR OFFICIAL USE ONLY” was merged into the “UNCLASSIFIED” category. The distribution of these original labels is visualized in Figure 1.

3.1.2. Foreign Relations of the United States (FRUS) Dataset

The second dataset utilized in this research was programmatically constructed by the researcher from the Foreign Relations of the United States (FRUS) collection [9], an open-source repository of official U.S. diplomatic records. While prior research by Heintz et al. [13] has utilized this collection to develop security classification models, their specific datasets are not publicly accessible. Consequently, this study involved the autonomous development of a structured dataset by processing raw XML data provided via the FRUS GitHub repository [10].
Only documents identified with the <div subtype=“historical-document”> tag were extracted to ensure archival relevance. The textual content was stored in .txt format, while metadata—including confidentiality levels, administration details, and archival sources—was consolidated into a relational .csv database.
The XML parsing pipeline was implemented using the lxml library, which enforces strict structural validation. Each volume file was processed within a try/except block; files raising parse exceptions due to structural corruption or encoding inconsistencies were logged and excluded from further processing. Within successfully parsed files, documents lacking a <text> or <body> element were silently discarded. Only elements carrying the attribute subtype=“historical-document” were retained; all other <div> elements were ignored. Following the initial extraction, a dedicated post-processing validation pass was applied to each administration-level CSV file. Records were excluded if:
(i)
the content field was empty,
(ii)
the corresponding document file did not exist on disk, or
(iii)
the document_confidentiality field did not correspond to any of the six recognized classification labels. These procedures collectively yielded the final corpus of 24,706 documents.
The dataset covers five presidential administrations, providing a diverse range of long-form, formal diplomatic language. The document distribution across these administrations is presented in Table 4.
The processed data is organized into a dual-storage structure to optimize both metadata management and large-scale text analysis. While the metadata for all records is maintained in a centralized relational .csv file, the corresponding document bodies are stored as individual .txt files, indexed by their unique identifiers. This hybrid structure ensures computational efficiency during the feature extraction and model training phases. The dataset consists of 10 primary attribute fields that encompass both the archival metadata and the document body. The complete set of these attribute fields, along with their respective descriptions, is presented in Table 5.
Analysis of the document_confidentiality field reveals a hierarchy of six original labels: Secret (13,151), Confidential (5471), Top Secret (3351), No classification marking (1932), Limited Official Use (609), and Unclassified (192). To stabilize the classification target and address class imbalances, a normalization process was applied. High-security labels (“Secret” and “Top Secret”) were maintained as distinct categories due to their differing administrative sensitivities. Conversely, labels representing non-restricted content—including “No classification marking” and “Limited Official Use”—were consolidated into the Unclassified category, consistent with industry-standard practices. The distribution of these original labels is visualized in Figure 2.
To contribute to future research in diplomatic text analysis and confidentiality classification, the custom-curated FRUS dataset developed in this study will be made publicly available on Kaggle under the name “FRUS Dataset” upon publication. This will allow other researchers to replicate and extend the findings presented herein.

3.2. Data Preprocessing

Prior to model training, a systematic preprocessing pipeline was applied to both datasets to eliminate classification artifacts, harmonise label spaces, and ensure semantic integrity of the input representations.

3.2.1. Security Marking Removal

A central preprocessing contribution—and a foundational ablation study component of the SACHN framework—is the systematic removal of inline security classification markers. Raw diplomatic documents frequently contain explicit tokens such as (S), (C//NF), or (TS//SI//NF), along with header directives like ‘CLASSIFIED BY:’ and ‘DECLASSIFY ON:’. A model trained on text retaining these markers would likely spuriously correlate surface tokens with their respective classes, exploiting annotation artifacts rather than the substantive semantic content. Such a model would generalise poorly to real-world operational scenarios where markers are absent, redacted, or deliberately obfuscated.
To ensure reliance on genuine semantic features, a rule-based regex pipeline was developed, comprising 14 patterns targeting five distinct categories of classification artifacts:
  • Parenthesized Inline Markers: Tokens such as (S), (C//NF), and (TS//SI//NF), including all slash and double-slash variants, matched via the pattern: \([A-Z]{1,3}(?://[A-Z0-9]+)*\).
  • Bare Double-Slash Markers: Standalone tokens like S//NF or TS//SI//NF occurring at word boundaries, identified by: \b(TS|S|C|U|SBU)//[A-Z0-9/]+.
  • Classification Header Lines: Complete lines initiated by directives such as CLASSIFIED BY:, REASON:, or DECLASSIFY ON:.
  • Declassification Reason Codes: Specific archival patterns, including 1.4(b) and Reason 1.4(d).
  • Legacy Archival Forms: Residual tokens from older formatting conventions, such as (S/NF), (C/NF), and (S/REL).
Matched tokens were replaced with the neutral sentinel [MASK] to maintain sequence length and structural integrity, while header lines and reason codes were removed entirely to eliminate the artifact signal. This data ablation serves as the primary experimental control of the SACHN framework and is rigorously evaluated in the ablation study presented in Section 4.6.3.
The motivation for this pipeline was empirically established through a preliminary LIME analysis conducted on the model trained on unprocessed WikiLeaks cables. Attribution scores revealed that residual tokens from inline classification markers exerted disproportionate predictive influence; in one representative instance, a single-character token derived from a parenthesized marker received a LIME attribution of 0.709, exceeding all domain-semantic features. Following pipeline application and model re-training, subsequent LIME analysis confirmed that attributions shifted entirely to domain-semantic vocabulary (see Section 4.6.3). Regarding false positive risk: the 14 regex patterns exclusively target syntactic constructs—parenthesized uppercase codes, double-slash compound markers, and bureaucratic header prefixes—that do not appear in ordinary diplomatic prose, ensuring that substantive textual content is not affected by the removal procedure.

3.2.2. Label Normalization

Raw label spaces in both datasets exhibit substantial heterogeneity arising from varying caveat suffixes, abbreviation variants, and administrative sub-classifications. To address this, a hierarchical case-insensitive lookup table was implemented to map all raw labels to one of four canonical security tiers:
  • TOP SECRET: Strings containing “TOP SECRET” or the abbreviation “TS”.
  • SECRET: Strings beginning with “SEC” or the abbreviation “S”.
  • CONFIDENTIAL: Strings beginning with “CON” or the abbreviation “C”.
  • UNCLASSIFIED: Strings containing “UNC”, “UNCLASS”, or the abbreviation “U”.
Documents labeled “No classification marking” were mapped to UNCLASSIFIED in accordance with Executive Order 13,526 [60], under which exactly three formal security tiers are defined—CONFIDENTIAL, SECRET, and TOP SECRET—and information not satisfying the substantive criteria of the Order shall not be classified. The absence of an explicit marking is therefore operationally equivalent to an unclassified designation.
The label “Limited Official Use” (LOU) is not defined as a formal classification tier under EO 13,526 [60]. Mapping LOU to any canonical level therefore requires a policy judgment beyond the formal classification scheme; for methodological conservatism, LOU-labeled documents were excluded from the primary classification task. The sensitivity of this decision is assessed in Section 4.5.2.
For the WikiLeaks dataset, TOP SECRET instances were additionally excluded. Since this label does not appear as a standard class in the WikiLeaks cable corpus, its inclusion would introduce a degenerate singleton class that could compromise model stability. Consequently, the WikiLeaks corpus was reduced to a three-class schema (UNCLASSIFIED, CONFIDENTIAL, SECRET), whereas the FRUS corpus maintains the full four-class hierarchy.

3.2.3. Text Truncation

Documents were truncated to a maximum of 500 words prior to tokenisation. This threshold was selected to retain the header and early body sections of diplomatic documents, which carry the highest density of classification-relevant semantic content, while maintaining computational tractability given the quadratic complexity of self-attention with respect to sequence length. The sensitivity of this threshold to classification performance is evaluated empirically in the ablation study presented in Section 4.5.4.

3.2.4. Dataset Partitioning

A stratified document-level split of 80% training, 10% validation, and 10% test was applied using a fixed random seed (42). Stratification ensured that class distributions were preserved across all partitions. Critically, the test set was held fixed across all experimental seeds, ensuring that ensemble combinations, per-class threshold optimizations, and McNemar’s significance tests were evaluated on an identical, unseen partition.

3.3. Proposed Architecture: SACHN-DeBERTa-v3

The proposed architecture, SACHN-DeBERTa-v3 (Security-Aware Classification with Hybrid Networks), integrates a pre-trained DeBERTa-v3 transformer backbone with three specialized components: a Security-Aware Gate for dynamic feature selection, a Prototype Classification Layer for robust class representation, and a Supervised Contrastive Projection Head to enhance inter-class separation. The complete architectural workflow and the interaction between these modules are illustrated in Figure 3.

3.3.1. Backbone Encoder: DeBERTa-v3-Large

The backbone encoder is microsoft/deberta-v3-large, a 24-layer transformer with hidden dimensionality d = 1024 and 304 million parameters. DeBERTa-v3 extends standard BERT-family pretraining with two key innovations that are particularly advantageous for domain-specific document classification:
  • Disentangled Attention: Unlike standard transformers, DeBERTa encodes content and relative positions through separate embedding matrices [61]. For each token pair (i, j), the attention score Aij is computed as the sum of four cross-terms (content-to-content, content-to-position, position-to-content, and position-to-position) as shown in Equation (1):
A i j = H i W q H j W k T + H i W q P i j W k , r T + P i j W q , r H j W K T + P i j W q , r P i j W k , r T
where Hi ∈ ℝd denotes the content embedding of token i, Pi|j ∈ ℝd represents the relative positional embedding between indices i and j, and Wq, Wk, Wq,r, Wk,r are learnable projection matrices. This disentanglement enables the model to focus on security-centric lexicon—such as “classified,” “sensitive,” or “nuclear”—independently of positional context. Such a mechanism is particularly vital for analyzing diplomatic records, where these terms may manifest across various structural locations.
2.
ELECTRA-style replaced token detection pretraining: ELECTRA-style replaced token detection pretraining, as introduced by Clark et al. [62], offers a more intensive supervisory signal compared to traditional masked language modeling. Rather than predicting masked tokens at a mere 15% of positions, the model is tasked with identifying replaced tokens across all positions. This methodology yields significantly more robust contextual representations, which is especially advantageous for capturing infrequent, domain-specific terminology. The encoder backbone generates a sequence of contextualized token representations H = {h1, h2, …, hL} ∈ ℝB×d, where B represents the batch size and L denotes the sequence length. Within this structure, the [CLS] token representation h0 ∈ ℝd functions as the aggregate document embedding.

3.3.2. Security-Aware Gate (SAG)

Before extracting the CLS token, a Security-Aware Gate module is applied to the full sequence output H. The gate learns to amplify the dimensions of the hidden space most informative for security classification, providing a soft feature-level attention mechanism operating across the hidden dimension rather than the sequence dimension as defined in Equation (2):
H ^ = H σ w g   1 + σ β ,  
where wg ∈ ℝd is a learnable gate weight vector, β ∈ ℝ is a learnable scalar boost parameter, σ(·) denotes the sigmoid function, and ⊙ denotes element-wise multiplication with broadcasting across the batch and sequence dimensions. The term σ(wg) selectively suppresses hidden dimensions with low discriminative value for security classification, while the boost term (1 + σ(β)) provides a global scaling factor that amplifies the gated representation. The CLS token ĥ0 ∈ ℝd is subsequently extracted from the gated sequence.
The sigmoid activation was selected for three reasons. First, the DeBERTa-v3-Large backbone already performs 24 layers of disentangled sequence-level self-attention; applying an additional sequence-level attention mechanism atop the backbone output would introduce redundancy without addressing a distinct representational gap. The SAG instead operates along the hidden dimension axis—a complementary axis of selection not performed by transformer attention layers—learning to suppress feature dimensions of low discriminative value while amplifying those most relevant to security classification. Second, sigmoid constrains gate values to [0, 1], providing bounded keep/suppress semantics per dimension and promoting training stability under the LLRD schedule applied to backbone parameters. Third, the module introduces only 1025 parameters (d + 1 = 1024 + 1), maintaining parameter efficiency. We acknowledge that more expressive alternatives—such as multi-head attention operating over the hidden dimension—may capture higher-order feature interactions and represent a direction for future architectural refinement.

3.3.3. Feature Projection

The gated CLS token is passed through a two-layer feed-forward projection network as shown in Equation (3):
f = R e L U ( W 2 D r o p o u t ( R e L U W 1 h 0 ^ + b 1 ) ,
where W1 ∈ ℝ512×1024, W2 ∈ ℝ256×512, with dropout rate 0.3 applied between the two layers. The resulting feature vector f ∈ ℝ256 serves as the shared input to both the classification head and the contrastive projection head.

3.3.4. Prototype Classification Layer (PCL)

Classification is performed via a Prototype Classification Layer (PCL) implementing metric learning-based inference as defined in Equation (4):
l c =   1 256 W q f p c 2  
A set of C learnable class prototypes pc ∈ ℝ256, initialised by Xavier uniform initialisation, are maintained alongside a learnable query projection W_q ∈ ℝ256×256. The predicted class is determined by ŷ = argmaxcc. This formulation, derived from Prototypical Networks [63], encourages intra-class compactness and inter-class separation in the feature space without requiring a fixed linear decision boundary. It is particularly advantageous for security classification tasks where within-class semantic variance is high—for example, SECRET documents may range from brief operational directives to lengthy analytical assessments. Xavier uniform initialization [64] was preferred over data-driven alternatives such as class-conditional centroid initialization for two reasons. First, centroid embeddings derived prior to fine-tuning reflect pre-training feature distributions rather than the security-domain representations that emerge during backbone adaptation, introducing a distributional mismatch in early training epochs. Second, the supervised contrastive loss warmup schedule (λ_con = 0.05 for epochs 0–2) serves a functionally analogous role to data-driven initialization by allowing the feature space to stabilize before the full contrastive objective drives inter-prototype separation. Data-driven prototype initialization from fine-tuned class-conditional centroids remains a direction for future investigation.

3.3.5. Supervised Contrastive Projection Head

A linear projection Wz ∈ ℝ128×256 (without bias) maps the feature vector to a contrastive embedding space as defined in Equation (5):
z = W z f ,   z R 128
These embeddings are L2-normalised prior to loss computation. The Supervised Contrastive Loss [65] is applied over mini-batches as shown in Equation (6):
L c o n =     1 P i l o g   [ e ž i   ž p τ e ž i ž a τ ]   ,
where ž = z/|z| denotes the L2-normalised embedding, P(i) is the set of positive samples sharing the same label as anchor i, A(i) is the set of all other samples in the batch, and τ = 0.07 is the temperature hyperparameter. This objective explicitly pulls embeddings of same-class documents together and pushes different-class embeddings apart. For numerical stability, the similarity matrix is shifted by subtracting the row-wise maximum as described in Equation (7):
à i j = ž i   ž j τ m a x _ k ( ž i ž k τ )
The full architecture is summarised in Table 6.

3.4. Training Strategy

3.4.1. Loss Function

The total training objective combines classification and contrastive losses with a warmup schedule for the contrastive weight as shown in Equation (8):
L t o t a l = λ c l s   L c l s + λ c o n t · L c o n ,
where λ_cls = 1.0 throughout training, and the contrastive weight follows the schedule defined in Equation (9):
λ c o n t = 0.05   ,   i f   t < 3 0.20   ,   i f   t 3 ,
with t denoting the current training epoch. This warmup schedule prevents the contrastive loss from overpowering the classification signal during the early training epochs, when the feature space is insufficiently structured for meaningful positive-negative pair discrimination.
  • Classification Loss—WikiLeaks (Focal Loss): Given the severe class imbalance of the SECRET class in the WikiLeaks dataset (658/9005 samples, 7.3%), Focal Loss [66] is employed as _cls as shown in Equation (10):
L f o c a l = 1 N   1 p ^ y i γ   l o g   p ^ y i ,
where p ^ _yi is the predicted probability for the true class, and γ = 2.0 is the focusing parameter. The modulating factor (1 − p ^ _yi)γ down-weights the loss contribution of well-classified examples, directing the optimisation signal towards the hard-to-classify minority class. For the minority SECRET class with p ^ _yi ≈ 0.1, the focal weight is (1 − 0.1)2 = 0.81, compared to (1 − 0.9)2 = 0.01 for a confidently classified majority sample, yielding an 81-fold relative upweighting. The sensitivity of this choice to classification performance is evaluated empirically in Section 4.5.4.
2.
Classification Loss—FRUS (Weighted Cross-Entropy): For the FRUS dataset, which is comparatively balanced across four classes, frequency-weighted cross-entropy is employed as shown in Equation (11):
L C E = 1 N i = 1 N l o g p ^ y i ,
where N_c is the number of training samples in class c and C is the number of classes. Class weights are computed using the balanced weighting scheme from scikit-learn.

3.4.2. Layer-Wise Learning Rate Decay (LLRD)

Fine-tuning all layers of a large pre-trained transformer with a uniform learning rate risks catastrophic forgetting of lower-level linguistic representations. To mitigate this, Layer-wise Learning Rate Decay (LLRD) [67] is applied. The learning rate assigned to encoder layer i is defined in Equation (12):
l r i = l r b a s e · α ( N 1 i )   ,   α = 0.9 ,   N = 24 ,   l r b a s e = 1 × 10 5
The embedding layer receives the minimum learning rate lr_base · α24 ≈ 8.5 × 10−7, while the topmost encoder layer (layer 23) receives lr_base = 1 × 10−5. Custom head parameters (SAG, PCL, projection heads) receive a higher learning rate of lr_base × 30 = 3 × 10−4. Weight decay of 0.01 is applied to all non-bias, non-LayerNorm parameters. The resulting learning rate schedule across architectural components is illustrated in Figure 4. The AdamW optimiser [68] is used with β1 = 0.9, β2 = 0.999, and ε = 10−8.

3.4.3. Mixed Precision Training

Automatic Mixed Precision (AMP) training is employed via PyTorch’s GradScaler, performing forward and backward passes in float16 while maintaining a float32 master copy of parameters. This reduces GPU memory consumption by approximately 40% and accelerates matrix multiplications on H200 Tensor Cores without accuracy degradation, as the loss scaling in GradScaler compensates for float16 underflow.

3.4.4. Gradient Clipping and Early Stopping

Gradient norm clipping at ‖∇‖2 ≤ 1.0 is applied before each optimiser step to prevent gradient explosion in the early fine-tuning epochs. Early stopping is triggered after 5 consecutive epochs without improvement in validation accuracy, with the best model checkpoint—measured by validation accuracy—saved to disk for subsequent test evaluation and ensemble construction.

3.5. Ensemble Construction and Decision Threshold Optimisation

3.5.1. Multi-Seed Training and Dynamic Weighted Ensemble

Three independent training runs with random seeds S = {42, 123, 456} were conducted per dataset, each using an identical fixed test partition. The test set softmax probability outputs from all seeds are combined via a dynamic weighted ensemble as shown in Equation (13):
P ^ e n s = s s = 1 w s P s ,     w s = F 1 s , v a l s s = 1   F 1 s 1 , v a l ,
where Ps ∈ ℝ^(N_test × C) is the matrix of softmax probabilities from seed s, and F1s_val is the best macro-F1 achieved on the validation set during training of seed s. This weighting scheme assigns greater ensemble influence to seeds that produced more discriminative classifiers, in contrast to the uniform averaging used in standard ensemble methods.

3.5.2. Per-Class Decision Threshold Optimisation

Standard argmax classification uses a uniform decision threshold of 0.5 for all classes, which is suboptimal under class imbalance. Per-class threshold optimisation is applied to the ensemble probability outputs on the validation set. For each class c, the optimal threshold τcx is selected to maximise the binary F1 score as defined in Equation (14):
τ c x = a r g m a x T   F 1 b i n a r y ( c , τ ; D _ ( v a l ) ) ,     τ { 0.10 ,   0.15 , , 0.90 } .
The final predicted label for a test sample is determined by the decision rule in Equation (15):
y ^ = a r g   m a x c C   P e n s , c   i f   C ;   e l s e   a r g   m a x c P e n s , c ,
where Cx = {c : P ^ ens,c ≥ τcx} is the set of classes whose ensemble probability exceeds their optimised threshold. When multiple classes exceed their threshold, the one with the highest probability is selected; when none do, argmax is used as a fallback.

3.6. Explainability Analysis Methodology

To validate that the model’s predictions are grounded in substantive semantic content rather than residual classification markers surviving the preprocessing pipeline, post-hoc explainability analysis was conducted using two complementary methods.

3.6.1. SHAP Analysis

SHapley Additive exPlanations with a Text masker was applied to 150 documents for WikiLeaks (50 per class) and 148 documents for FRUS (37 per class). Token-level SHAP values were aggregated to word level: subword tokens sharing a SentencePiece word prefix were summed into a single word-level attribution. The following visualisations were produced for each dataset:
  • global feature importance bar chart (top 20 words by mean |SHAP|);
  • per-class feature importance;
  • SHAP beeswarm plot;
  • per-class heatmaps;
  • token-level colour-coded text highlights for five representative samples; and
  • a correct-vs-incorrect prediction comparison.

3.6.2. LIME Analysis

Local Interpretable Model-agnostic Explanations [69] was applied to 45/60 samples per dataset (stratified by class and prediction correctness), with 20 explanation features and 300 perturbation samples per instance. Word-level attribution scores from correct and incorrect predictions were aggregated and compared, providing a local consistency check complementary to the global SHAP analysis.

3.7. Experimental Setup and Evaluation Metrics

3.7.1. Computational Infrastructure

All experiments were conducted on the TRUBA high-performance computing infrastructure. The numerical calculations reported in this study were performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources). Each training job was allocated exclusively to a single compute node equipped with one NVIDIA H200 GPU (139.8 GB VRAM), 80 GB of system memory, and 16 CPU cores. Jobs were submitted and managed via the SLURM workload manager on the kolyoz-cuda partition. This configuration ensured reproducible, resource-isolated execution across all experimental seeds and model variants.

3.7.2. Hyperparameter Configuration

All hyperparameters were determined through preliminary validation experiments and held constant across datasets and seeds unless otherwise stated. The complete configuration is presented in Table 7.
Gradient accumulation with four steps raises the effective batch size from 8 to 32, mitigating the instability that small micro-batches introduce in supervised contrastive learning. With an effective batch size of 32, each anchor sample is contrasted against 31 negatives per update step, providing adequate negative diversity for stable InfoNCE gradient estimation. Additional stability mechanisms include:
(i)
a contrastive loss warmup schedule suppressing the InfoNCE objective for the first three epochs (λ_con = 0.05), allowing the classification loss to establish a stable feature space before full contrastive optimization begins;
(ii)
L2 normalization of contrastive embeddings prior to similarity computation;
(iii)
a low temperature parameter (τ = 0.07), which sharpens the similarity distribution and compensates for reduced batch diversity relative to large-batch contrastive settings; and (iv) gradient clipping at norm 1.0, preventing destabilizing gradient spikes during contrastive backpropagation. The 3-seed ensemble further attenuates any residual batch-sampling variance across training runs.
To ensure reproducibility, all random number generators are seeded prior to each training run via a unified seeding function that initialises Python’s built-in random module, NumPy, and PyTorch (both CPU and CUDA states) with the run-specific seed value; the PYTHONHASHSEED environment variable is globally fixed to 42. The training DataLoader operates with num_workers = 0, inheriting PyTorch’s seeded global random state, ensuring deterministic batch ordering conditioned on the seed.

3.7.3. Evaluation Metrics

Model performance is reported across five complementary metrics. Macro-averaged variants are adopted as the primary evaluation criterion, as they assign equal weight to each class regardless of its frequency in the corpus—a property of particular importance given the pronounced class imbalance present in both datasets.
The metrics are formally defined as follows, where N denotes the total number of test samples, C denotes the number of classes, and TPc, FPc, and FNc denote the true positives, false positives, and false negatives for class c, respectively.
  • Accuracy measures the proportion of correctly classified samples across all classes as defined in Equation (16):
A c c = 1 N N i = 1   1 [ ŷ i = y i ] ,
  • Macro Precision computes the unweighted mean of per-class precision scores as shown in Equation (17):
P m a c r o = 1 C C c = 1 T P c T P c + F P c ,
  • Macro Recall computes the unweighted mean of per-class recall scores as shown in Equation (18):
R m a c r o = 1 C C c = 1 T P c T P c + F N c ,
  • Macro F1 is the unweighted harmonic mean of per-class precision and recall in Equation (19):
F 1 m a c r o = 1 C C c = 1 2 P c × R c P c + R c   ,
  • Weighted F1 computes the class-frequency-weighted harmonic mean in Equation (20):
F 1 w e i g h t e d = C c = 1   N c N ( 2 · P c · R c P c + R c ) ,
where Nc is the number of test samples belonging to class c. Zero-division cases—arising when a class produces no positive predictions—are handled by substituting a score of zero for the corresponding term, consistent with the default behaviour of the scikit-learn classification report.

3.7.4. Statistical Significance Testing

To assess whether the performance improvement of the proposed SACHN-DeBERTa-v3 ensemble over the ablation baseline is statistically significant rather than attributable to sampling variability, McNemar’s test is applied. The test evaluates the asymmetry in the off-diagonal cells of a 2 × 2 contingency table constructed from the per-sample binary correctness vectors of two models evaluated on the same fixed test partition. Because both the proposed model and the baseline are evaluated on an identical held-out test set (fixed at seed = 42), the paired nature of the observations satisfies the test’s assumptions.
The null hypothesis H0 states that the two models commit errors on the same samples with equal probability. Statistical significance is established at a Type I error rate of α = 0.05. A p-value below this threshold leads to the rejection of H0, providing evidence that the observed performance difference reflects a genuine improvement rather than random variation.

3.8. Baseline and Comparative Methods

To contextualise the performance of the proposed SACHN-DeBERTa-v3 architecture, four comparative models from the literature were implemented and evaluated on the same datasets under identical preprocessing and evaluation conditions.

3.8.1. Feature-Engineering Ensemble Classifier Methodology

The first baseline follows the content-based document classification framework proposed by Kesenek et al. [21], originally developed for data leakage prevention (DLP) in adversarial settings. The original study evaluates classification robustness against 18 types of content-based evasion attacks—including structural modifications (character insertion, deletion, whitespace manipulation), substitution attacks (synonym replacement, summarisation), and obfuscation attacks—across three-class corpora (G: secret, KG: corporate public, KO: non-corporate) drawn from TM, Mormon, Dyncorp, and DBPedia collections.
We adopt the baseline classification pipeline of this method, excluding the attack simulation component, as our evaluation objective focuses on standard classification accuracy on diplomatic document corpora rather than adversarial robustness. This allows for a direct comparison between a model designed for high-security environments and our proposed architecture under formal diplomatic classification constraints.
  • Text Preprocessing: Following the original methodology, all documents are normalised to lowercase, after which numerical tokens, punctuation, and English stop-words are removed. Porter stemming is then applied to reduce inflected word forms to their morphological roots, thereby reducing feature dimensionality and standardising vocabulary across documents. In our implementation, this preprocessing stage is performed after the removal of security markings (Section 3.2.2) to prevent classification artefacts introduced by residual marking patterns.
  • Feature Extraction: A heterogeneous feature set is constructed by combining four complementary representations via a FeatureUnion pipeline:
  • Word n-grams (1–3): TF-IDF weighted unigrams, bigrams, and trigrams (max_features = 10,000; min_df = 3). These multi-word sequences capture local semantic context and domain-specific terminology.
  • Character n-grams (1–5): Character-level TF-IDF features (max_features = 5000; min_df = 3). These features provide robustness to morphological variation. While originally intended to improve resilience to whitespace-manipulation attacks, in our setting, they enhance the coverage of hyphenated diplomatic terms and abbreviations.
  • Skip-grams: Word-pair features with a skip distance of k = 2, capturing non-adjacent co-occurrence relationships. These are TF-IDF weighted with max_features = 3000 and min_df = 3. In our context, this component captures long-range term dependencies in formal diplomatic prose.
  • Latent Semantic Analysis (LSA): A TF-IDF bag-of-words matrix (max_features = 5000; min_df = 3) decomposed via Truncated SVD to retain d = 100 latent dimensions. LSA maps documents to a semantic subspace robust to exact vocabulary matching, capturing synonymous expressions across classification-relevant concepts.
All features are concatenated to form the final document representation. Consistent with the original study’s frequency threshold, terms appearing in fewer than three documents are excluded across all components to maintain a robust vocabulary. All classifiers are implemented using scikit-learn 1.7.2.
  • Classification: Three distinct classifiers are trained on the combined feature representation and integrated via soft voting. In this ensemble approach, class probabilities from each base learner are averaged, and the label with the highest mean probability is selected. The base learners are configured as follows:
  • Support Vector Machine (SVM): Utilizes a linear kernel with probability calibration (Platt scaling), trained for a maximum of 10,000 iterations.
  • Random Forest (RF): Comprises 100 decision trees with bootstrapped samples, aggregated by majority vote.
  • Multi-layer Perceptron (MLP): Consists of a single hidden layer with 100 units, utilizing early stopping to prevent overfitting.
The original paper employs hard majority voting among these three classifiers. In this implementation, we substitute soft voting to better leverage the probability estimates of the SVM and MLP. This adaptation is particularly beneficial in a multi-class setting, where diplomatic classification boundaries are less clearly separated than in the original task. An empirical comparison of soft and hard voting is provided in Section 4.5.4.

3.8.2. TD2V-Based Retrieval with Automatic Query Expansion

The second baseline follows the sensitivity classification method proposed by Trieu et al. [15], which reformulates document sensitivity assessment as a retrieval problem. The original work was designed for binary enterprise document leakage prevention, evaluated on datasets including Enron email archives and WikiLeaks-derived enterprise corpora such as Dyncorp, Transcendental Meditation, and Mormon collections. The method consists of two stages: document encoding via the PV-DM Doc2Vec architecture (TD2V), and sensitivity inference via nearest-neighbour retrieval augmented with Automatic Query Expansion (AQE).
Document Representation: In the original paper, Trieu et al. [15] employ a pre-trained Twitter-based document embedding model (TD2V), trained on 422,351 multi-domain news and social media articles. Since this pre-trained model is neither publicly available nor domain-appropriate for diplomatic correspondence, we train an equivalent PV-DM Doc2Vec model from scratch using gensim 4.4.0 on the target corpora, preserving all hyperparameters reported in the original study: vector dimensionality d = 300, context window w = 10, minimum word frequency n_min = 5, negative sampling with k_neg = 5 noise words, and 10 training epochs. Training document vectors are retrieved directly from the model’s document matrix D, while test document vectors are inferred over 50 inference epochs.
Sensitivity Evaluation via AQE: For a query document q, the method first retrieves the k nearest neighbours from the training corpus S using Euclidean (L2) distance. This process forms a ranklist as defined in Equation (21):
R L k ( q ) = ( y 1 , q , y 2 , q , , y k , q ) ,     w h e r e   d ( q , y i , q ) d ( q , y i + 1 , q ) .
A Modified Distance is then computed for every training document x in S as a weighted combination of the direct query-to-candidate distance and the mean distance from the k neighbours of q to x. This calculation is formally expressed in Equation (22):
d ^ q , x = α · d q , x + 1 α · 1 k · k i = 1   d ( z i , x ) .
The re-ranked list RL-hat_k,l(q) of l candidates is formed by sorting all training documents by their modified distance to q. The final sensitivity label is assigned by majority voting over the l re-ranked candidates. Following the original paper, we set a = 0.5, k = 5, and l = 7 for all experiments. The FAISS IndexFlatL2 library (faiss-cpu 1.13.2) is used for efficient L2-based retrieval over the training index.
Adaptations for Diplomatic Classification: Three principal adaptations are introduced for the present task. First, the original binary classification scheme is extended to a four-class scheme reflecting the U.S. government classification hierarchy: UNCLASSIFIED, CONFIDENTIAL, SECRET, and TOP SECRET. Second, the pre-training step is replaced by domain-specific training on diplomatic corpora, as the Twitter-based TD2V vocabulary is misaligned with formal diplomatic register and terminology. Third, security classification markings are removed from all documents prior to encoding (Section 3.2.2), ensuring that retrieval similarity is driven by substantive content rather than explicit label strings. Consistent with the original study, a 50%/50% stratified train–test split is preserved for this baseline.
It should be noted that replacing the Twitter-pretrained TD2V model with domain-specific retraining is unlikely to disadvantage the baseline; the Twitter corpus covers news and social media text whose vocabulary and register are substantially more distant from formal diplomatic correspondence than the target corpora used for in-domain retraining. All retrieval hyperparameters (α, k, l) and Doc2Vec architecture parameters (d, w, n_min, k_neg) are preserved exactly from the original study, ensuring that the sole source of divergence from [15] is the embedding pretraining corpus. Since the original pretrained model is not publicly released, a direct ablation isolating this adaptation decision is not feasible.

3.8.3. FedAPILLM: LLM-Based Document Classification via Instruction Tuning

The third baseline adapts the FedAPILLM framework proposed by Wu et al. [35], which introduces a federated learning pipeline for sensitive data identification using large language models (LLMs) fine-tuned via Low-Rank Adaptation (LoRA). The original study targets binary classification of Web API data across a distributed architecture, where local LLaMA-3-8B instances are fine-tuned and aggregated via a trust-score-filtered FedAvg protocol. The original paper evaluates several variants, including LLaMA-3-8B and Qwen-2.5-14B, demonstrating near-perfect performance on Web API corpora.
We adopt the instruction fine-tuning component of this framework as a standalone centralised classifier, applying it to multi-class diplomatic document security classification. The federated aggregation, anomaly detection regularisation, and trust-score filtering mechanisms are not reproduced, as our experimental setting utilizes a single-institution corpus without distributed data sources. This adaptation isolates the LLM fine-tuning signal from the federated communication overhead, providing a direct assessment of instruction-tuned LLM performance on the target task.
Model and Quantisation: The base model is Meta LLaMA-3-8B-Instruct, loaded in 4-bit NormalFloat (NF4) quantisation using BitsAndBytesConfig (BitsAndBytes 0.48.2) with double quantisation enabled and bfloat16 compute dtype (QLoRA). Quantisation reduces the GPU memory footprint from approximately 16 GB (full precision) to under 6 GB, making single-GPU deployment on TRUBA HPC feasible without degrading classification performance relative to the original paper’s non-quantised baseline.
LoRA Fine-tuning: Consistent with the original paper’s parameterisation principle, LoRA low-rank adapter matrices are injected using PEFT 0.17.1 into the query, key, value, and output projection layers of the attention mechanism (target_modules = {q_proj, k_proj, v_proj, o_proj}). The LoRA rank is set to r = 16 with scaling factor α = 32 and dropout rate p = 0.05, yielding approximately 20 million trainable parameters out of the 8B total—less than 0.3% of the model. All base model weights are frozen during training, and only the LoRA matrices are updated via gradient descent. The forward pass for a layer with weight matrix W0 is defined in Equation (23):
h = W 0 x   + α r · B A x
where A and B are the trainable decomposition matrices, initialised as Gaussian and zero respectively.
Instruction Prompt Format: Each document is formatted as a structured instruction prompt:
“Below is a diplomatic document. Classify its security level into one of these categories: {class_list}\n\nDocument: {text}\n\nSecurity Level:”
During training, the ground-truth label and an EOS token are appended to the prompt, and the cross-entropy loss is computed only over the label tokens (prompt positions masked with −100). During inference, the model generates up to 8 new tokens with greedy decoding (do_sample = False), and the prediction is extracted by matching the generated text against the class vocabulary in descending length order to resolve ambiguities between class names sharing a common prefix (e.g., “SECRET” vs. “TOP SECRET”).
Training Configuration: The model is trained for up to 5 epochs with gradient accumulation over 4 steps (effective batch size = 16), using an 8-bit AdamW optimiser (learning rate = 2 × 10−4, weight decay = 0.01) and a cosine learning rate schedule with a 10% warm-up period. Gradient clipping is applied at norm threshold 1.0. Early stopping with patience of 2 epochs is applied based on macro F1 on the validation set; the best checkpoint is retained in CPU memory and restored prior to test evaluation. The data are split 80/10/10 (train/validation/test) with stratified sampling.
Adaptations for Diplomatic Classification: Four principal adaptations are made relative to the original paper. First, the binary classification scheme is extended to the four-class U.S. government classification hierarchy; the prompt class list is constructed dynamically from the observed label set per dataset. Second, the federated learning infrastructure is replaced by centralised single-node training, as the WikiLeaks and FRUS corpora are held by a single institution. Third, security classification markings are removed from all documents prior to tokenisation to prevent label leakage via surface-form matching. Fourth, documents are truncated to 500 words before tokenisation, with the tokeniser applying a hard limit of 512 subword tokens; this constrains the effective context per document but is necessary for memory-efficient batch processing on a single H100 GPU.
Regarding the removal of federated components: centralised single-node training with full data access constitutes an upper bound on the performance achievable by the federated variant, as it eliminates communication overhead, aggregation noise, and the trust-score filtering step that may discard informative local updates. Any performance gap between this centralised FedAPILLM adaptation and the proposed SACHN model therefore represents a conservative estimate of SACHN’s advantage. Direct reproduction of the federated pipeline is not feasible in the present single-institution setting, which holds the entire corpus on a single node without distributed data sources.

3.8.4. Multi-LLM Comparative Study: Instruction-Tuned Open-Source LLMs

To complement the FedAPILLM baseline (Section 3.8.3) and assess the sensitivity of classification performance to LLM architecture and scale, we conduct a systematic comparative evaluation of three state-of-the-art open-source instruction-tuned large language models under identical QLoRA fine-tuning conditions. This constitutes an original empirical contribution of the present study, as no prior work has benchmarked multiple open-source LLMs on multi-class diplomatic document security classification under controlled fine-tuning conditions. The three models evaluated are:
  • LLaMA-3-8B-Instruct (Meta AI, 8 billion parameters): A decoder-only transformer with grouped query attention and RoPE positional encoding, pre-trained on over 15 trillion tokens with instruction-following post-training via RLHF.
  • Qwen2.5-14B-Instruct (Alibaba Cloud, 14 billion parameters): A larger-scale model featuring an extended 128K context window, trained on an 18 trillion token corpus with emphasis on multi-lingual and domain-specific instruction following.
  • Mistral-7B-Instruct-v0.3 (Mistral AI, 7 billion parameters): A sliding-window attention model with grouped query attention, optimised for efficiency on long-context inputs and fine-tuned with direct preference optimisation (DPO).
Unified Fine-tuning Protocol: All three models are fine-tuned under an identical protocol to ensure the comparative validity of results. Each model is loaded in 4-bit NF4 quantisation (QLoRA) with bfloat16 compute dtype and double quantisation enabled. LoRA adapter matrices are injected into the query, key, value, and output projection layers of the attention mechanism, with rank r = 16, scaling factor α = 32, and dropout rate p = 0.05. Gradient checkpointing is enabled for all models. Training proceeds for up to 5 epochs with gradient accumulation over 4 steps (effective batch size = 16), using an 8-bit AdamW optimiser at learning rate 2 × 10−4 with cosine scheduling and a 10% warm-up phase. Gradient clipping is applied at norm threshold 1.0, and early stopping with patience 2 is applied based on macro F1 on the validation set. The best checkpoint is retained in CPU memory and restored prior to test evaluation.
Prompt Format and Inference: The same instruction prompt and core inference procedure described in Section 3.8.3 are applied to all three models.
During training, the ground-truth label followed by an EOS token is appended to the prompt, and cross-entropy loss is computed only over the label tokens. During inference, up to 8 new tokens are generated with greedy decoding (do_sample = False) and the prediction is extracted by longest-match against the class vocabulary, resolving potential prefix ambiguities (e.g., “SECRET” ⊂ “TOP SECRET”) through descending-length class ordering.
For inference batches, left-padding is applied to preserve the relative position of prompt tokens for each input in the batch, as right-padding degrades generation quality in causal language models by displacing prompt content away from the generation boundary. Training batches use right-padding with the padded positions masked by label = −100 to prevent loss contributions from padding tokens.
Data and Evaluation: All models are evaluated on the same 80/10/10 stratified train/validation/test splits (seed = 42), on the WikiLeaks (3-class) and FRUS (4-class) corpora, with security marking removal applied uniformly prior to tokenisation. This design ensures that observed performance differences reflect model architecture and scale rather than data partitioning or preprocessing variation. Each model is trained independently on a dedicated TRUBA HPC node (single NVIDIA H100 GPU) with identical resource allocation.
The comparative analysis addresses three research questions:
  • whether larger model scale (Qwen2.5-14B vs. 7–8B models) translates to higher classification accuracy under a fixed fine-tuning budget;
  • whether sliding-window attention (Mistral-7B) offers advantages on the truncated 500-word diplomatic documents relative to full-attention models;
  • how all three LLMs compare against the proposed SACHN-DeBERTa-v3 encoder-only architecture, which is specifically designed for classification rather than generation.
In addition to the r = 16 baseline protocol described above, extended experiments with LoRA rank r = 64—under both 4-bit NF4 and 8-bit INT8 quantisation conditions for all three architectures—were conducted to assess whether increased trainable capacity or relaxed quantisation precision narrows the performance gap. Results of these extended experiments are reported in Section 4.4.

4. Results

4.1. Overview

Table 8 presents the complete classification performance of all evaluated methods across the WikiLeaks and FRUS diplomatic corpora. Results are reported in terms of overall accuracy (Acc.), macro-averaged F1-score (F1_M), and weighted F1-score (F1_W). For the proposed SACHN-DeBERTa-v3-Large model, the reported figures reflect the dynamic weighted ensemble of three independently trained runs (seeds 42, 123, and 456); individual seed results are examined in detail in Section 4.5. All baseline methods were re-implemented and evaluated under the identical conditions described in Section 3.

4.2. Feature-Engineering Ensemble Classifier

The re-implemented Kesenek et al. [21] pipeline achieved 86.12% accuracy and a macro F1-score of 66.44% on WikiLeaks, compared to 69.03% accuracy and a macro F1 of 60.95% on FRUS. While overall accuracy appears moderate on WikiLeaks, the macro F1-score diverges substantially from accuracy—a consequence of severe class-level performance imbalance.
As shown in Table 9, the SECRET class on WikiLeaks yielded an F1-score of only 24.79% (precision 68.18%, recall 15.15%), while UNCLASSIFIED achieved 93.34%. The classifier effectively learned to exploit the distributional dominance of UNCLASSIFIED documents (59% of the WikiLeaks test partition) at the expense of minority-class discrimination. On FRUS, performance across all four classes suffered; specifically, the CONFIDENTIAL class attained only 53.69% F1 and UNCLASSIFIED reached 57.60%. This reflects the method’s inability to generalise sparse n-gram features across the denser and more varied FRUS vocabulary.
Computational efficiency also varied significantly across datasets. Training time on FRUS was 12,895 s (~3.6 h), compared to 737 s on WikiLeaks. This steep increase is attributed to the larger corpus size and the combinatorial feature space complexity inherent in the four-class diplomatic classification problem.

4.3. TD2V+AQE—Document-Level Retrieval

The TD2V+AQE retrieval system, implemented following Trieu et al. [15], achieved 78.73% accuracy on WikiLeaks and 65.44% on FRUS, with macro F1-scores of 54.00% and 57.83% respectively. The Wikipedia-pretrained Doc2Vec representations [70], although adapted through domain-specific retraining on the target corpora (Section 3.8.2), were unable to form sufficiently separable manifolds for minority security classes. On WikiLeaks, the SECRET class attained a per-class F1 of only 3.86% (recall 2.13%), with 592 of 1849 queries returning no candidate above the confidence threshold and receiving no classification (false rejects). The FAISS-based ranklist [71], while efficient, amplified this failure: because SECRET documents occupy a sparse, overlapping region of the embedding space relative to CONFIDENTIAL documents, the top-k retrieved neighbours were predominantly CONFIDENTIAL, collapsing SECRET recall. On FRUS, the 239/10,973 false-reject rate was lower, yet CONFIDENTIAL and UNCLASSIFIED F1-scores remained below 53% (Table 10), indicating systematic confusion between adjacent-sensitivity levels. These results confirm that unsupervised distributional representations are insufficient for fine-grained security classification when inter-class semantic overlap is high.

4.4. FedAPILLM and Multi-LLM Study—QLoRA Fine-Tuning

The FedAPILLM adaptation, implemented following the centralized QLoRA protocol of Wu et al. [35], achieved 51.61% accuracy and 40.76% macro F1 on WikiLeaks, and 37.93% accuracy with 33.21% macro F1 on FRUS. These results represent performance only marginally above random chance on the 3-class (33.3%) and 4-class (25.0%) problems, respectively. This underperformance is attributable to two compounding factors:
  • The 80/10/10 train–validation–test split yields a substantially smaller training set than those typically required for generative LLM convergence on niche domains.
  • The 4-bit NF4 quantisation [72] imposes representational constraints that disproportionately affect fine-grained semantic distinctions, such as the nuances between CONFIDENTIAL and SECRET language, which are lexically similar but bureaucratically distinct.
To further investigate whether these limitations are model-specific or paradigm-level, we conducted a multi-architecture comparative study evaluating LLaMA-3-8B-Instruct [73], Qwen2.5-14B-Instruct [74], and Mistral-7B-Instruct-v0.3 [75] under identical QLoRA fine-tuning conditions. Table 11 reports these comparative results.
Mistral-7B achieved the highest WikiLeaks accuracy (57.16%) and macro F1 (43.96%) among the three evaluated LLMs, suggesting that its sliding-window attention mechanism may provide a modest advantage for encoding long diplomatic documents. Qwen2.5-14B, despite having nearly twice the parameter count of the other two models, performed worst on FRUS (accuracy 31.37%, macro F1 25.35%), yielding results approximately at random chance for a 4-class problem.
This unexpected result is consistent with reports in the literature suggesting that larger instruction-tuned models can underfit when fine-tuned in 4-bit NF4 precision with a low LoRA rank (r = 16). Specifically, the effective rank of the adapted weight matrices may be insufficient to absorb the significant domain shift from general instruction following to formal bureaucratic text classification. Across all three architectures and both datasets, QLoRA-based instruction fine-tuning failed to deliver competitive classification performance. These findings suggest that the instruction-following paradigm is poorly matched to the closed-set, categorical nature of security classification tasks compared to dedicated encoding-based architectures.
To directly test whether the LoRA rank constraint or 4-bit quantisation is the primary driver of underperformance, we conducted additional experiments with r = 64 under both 4-bit NF4 and 8-bit INT8 quantisation conditions for all three LLM architectures. Table 12 reports these extended results alongside the SACHN-DeBERTa-v3-Large reference.
Despite quadrupling the LoRA rank, improvements remain marginal: LLaMA-3-8B-Instruct gains 4.93 macro F1 points on WikiLeaks under 4-bit quantisation (36.63% → 41.56%), while Mistral-7B-Instruct declines by 2.57 macro F1 points (43.96% → 41.39%) under the same condition. Increasing quantisation precision from 4-bit to 8-bit does not yield consistent gains: LLaMA-3-8B-Instruct degrades in WikiLeaks accuracy from 51.83% to 46.50% under 8-bit, and Mistral-7B-Instruct, while achieving its peak WikiLeaks result of 64.48% accuracy under 8-bit, simultaneously degrades on FRUS from 50.31% to 37.43%—a reversal of 12.88 percentage points. The best result across all extended configurations—Mistral-7B-Instruct with 8-bit INT8 and r = 64 on WikiLeaks (accuracy 64.48%, macro F1 50.28%)—remains 31.64 percentage points below SACHN-DeBERTa-v3-Large in accuracy and 42.29 points below in macro F1. These results confirm that the observed LLM underperformance is structural and paradigm-level rather than a consequence of rank or quantisation choices.

4.5. Proposed Model: SACHN-DeBERTa-v3-Large

4.5.1. Per-Seed Reproducibility Analysis

Table 13 reports the test-set performance for each of the three independently trained runs (seeds 42, 123, and 456), alongside the validation macro F1-score used to derive the dynamic ensemble weights. The results reveal substantial intra-model variance across seeds, motivating a detailed root-cause analysis of seed-level instability presented below.
The per-seed results reveal substantial intra-model variance. On WikiLeaks, Seed 42 attained only 87.46% accuracy and 77.27% macro F1—a 10.4-point accuracy gap relative to the best single run (Seed 123 at 97.89%). On FRUS, this gap is more pronounced: Seed 42 achieved 70.84% accuracy versus 94.06% for Seed 456, representing a 23.2-point difference. Critically, this variance is not predictable from validation performance; Seed 42’s validation macro F1 (80.33% WikiLeaks, 69.78% FRUS) was comparable to, or higher than, the other seeds. This indicates that the model converged to a local minimum that generalised poorly to the test partition while achieving artificially inflated performance on the validation set. Such seed-level instability is consistent with the known sensitivity of disentangled-attention transformers to early-epoch gradient trajectories when contrastive and classification losses are jointly optimised, particularly in the presence of severely skewed class frequencies.
McNemar’s test (Yates-corrected, Bonferroni-adjusted α′ = 0.017 for three pairwise comparisons per dataset) statistically characterises Seed 42 as a point outlier rather than evidence of general architectural instability. On WikiLeaks, Seed 42 is significantly different from both Seed 123 (χ2 = 99.21, p = 2.28 × 10−23) and Seed 456 (χ2 = 86.52, p = 1.38 × 10−20), whereas Seeds 123 and 456 are not statistically distinguishable after Bonferroni correction (χ2 = 5.63, p = 0.018 > α′). On FRUS, the same pattern holds with greater magnitude: Seed 42 diverges from Seed 123 (χ2 = 466.16, p = 2.19 × 10−103) and Seed 456 (χ2 = 493.45, p = 2.54 × 10−109), while Seeds 123 and 456 are statistically indistinguishable (χ2 = 2.08, p = 0.150). These results confirm that Seeds 123 and 456 converged to the same region of the loss landscape, and that Seed 42’s failure is a localised convergence event rather than a systematic architectural flaw.
Three concurrent factors explain this sensitivity. First, the joint optimisation of focal loss and supervised contrastive loss creates a highly non-convex loss landscape with multiple local minima; unlike single-objective fine-tuning, the gradient signal is a weighted combination of two competing objectives whose balance shifts across training, increasing the probability of initialisation-dependent divergence. Second, the 7.3% SECRET class imbalance in the WikiLeaks corpus makes early gradient trajectories highly sensitive to which minority-class examples appear in the first training batches—a stochastic effect controlled entirely by the random seed. Third, the validation partition (10%, approximately 900 WikiLeaks and 2470 FRUS samples) is insufficient to discriminate between the local minimum reached by Seed 42 and the global minimum region reached by Seeds 123 and 456, explaining why validation macro F1 scores remained comparable across all three seeds. A principled optimisation-based mitigation is a contrastive loss warm-up schedule: withholding the supervised contrastive objective for the first N_warmup training steps allows the classification head to establish an initial decision boundary before the contrastive loss begins shaping the representation space, decoupling the two optimisation phases and reducing initialisation sensitivity. Evaluating this strategy empirically constitutes a concrete direction for future work.
As a consequence of the near-uniform validation F1-scores across seeds, the dynamic weighted ensemble assigned nearly equal weights to all three runs: [0.3382, 0.3225, 0.3393] on WikiLeaks and [0.3379, 0.3385, 0.3235] on FRUS. Despite Seed 42’s degraded performance, the ensemble successfully mitigated its influence through soft logit averaging. Seeds 123 and 456 collectively contribute approximately 0.660–0.662 of the total probability mass, which is sufficient to override Seed 42’s erroneous confident predictions on the minority SECRET class. The resulting ensemble accuracy of 96.12% on WikiLeaks and 92.11% on FRUS exceeds the simple average of the three seeds (93.52% and 85.90%, respectively), confirming the variance-reduction benefit of ensemble aggregation even when individual model trajectories diverge. The dynamic weighted ensemble therefore serves as a deployment-stage variance-reduction mechanism rather than an architectural remedy; the contrastive loss warm-up schedule described above constitutes the proposed optimisation-level solution to the underlying convergence sensitivity.

4.5.2. Ensemble Results and Per-Class Analysis

The SACHN-DeBERTa-v3-Large ensemble achieves 96.12% accuracy (F1_Macro = 92.57%, ROC-AUC = 99.64%) on WikiLeaks and 92.11% accuracy (F1_Macro = 91.02%, ROC-AUC = 99.02%) on FRUS, representing the best results across all evaluated methods by substantial margins. The detailed per-class performance metrics, including precision, recall, and F1-score, are reported in Table 14 for the WikiLeaks corpus and Table 15 for the FRUS corpus.
On WikiLeaks, the SECRET class remains the most challenging, attaining an 85.00% F1-score (77.27% recall). This class constitutes only 7.3% of the WikiLeaks dataset (658 of 9005 training documents; 66 of 901 test instances), making it the primary driver of the 3.6-point gap between macro and weighted F1-scores. Despite this scarcity, the Focal Loss objective (gamma = 2.0) applied during training markedly improved SECRET recall relative to the 15.15% obtained by the Kesenek et al. [21] baseline. The confusion matrix, provided in Table 16, shows that 15 of the 66 SECRET test documents were misclassified as CONFIDENTIAL—a plausible error given that both classes share bureaucratic hedging language and diplomatic formality—while zero SECRET documents were misclassified as UNCLASSIFIED. This indicates that the SecurityAwareGate mechanism successfully separated the sensitivity spectrum at its extremes, effectively preventing catastrophic under-classification.
On FRUS, all four classes achieved F1-scores between 88.46% (TOP SECRET) and 93.80% (SECRET), indicating that the model discriminates reliably across the full sensitivity hierarchy. The CONFIDENTIAL class exhibits the widest precision–recall gap (92.87% vs. 88.12%); this specific performance characteristic is detailed in Table 17, which shows that 55 of 547 CONFIDENTIAL documents were misclassified as SECRET. This confusion pattern is expected, as CONFIDENTIAL and SECRET documents in the FRUS collection share similar diplomatic content, and the principal distinction often lies in the judgment of the original classifying authority rather than in surface-level textual markers. The inclusion of the TOP SECRET class—which was filtered out entirely in the ablation model—did not degrade performance on the remaining three classes. This confirms that the FRUS label hierarchy is learnable when sufficient representative training data exists, as evidenced by the 3347 TOP SECRET training documents used in this study.
To assess the sensitivity of the LOU exclusion decision (Section 3.2.2), a complementary experiment was conducted in which WikiLeaks LOU documents were remapped to UNCLASSIFIED rather than excluded. The ensemble achieved 97.34% accuracy and 94.51% macro F1 under this alternative mapping, compared to 95.23% and 89.85% under the exclusion policy. The marginal improvement confirms that LOU documents carry linguistic characteristics consistent with unclassified diplomatic cables, validating the original mapping decision. The FRUS dataset contains no LOU-labeled documents and is unaffected by this choice.

4.5.3. Comparative Summary

SACHN-DeBERTa-v3-Large significantly outperforms the re-implementation of the method proposed by Kesenek et al. [21]—the second-best performing baseline—by 10.0 percentage points in accuracy and 26.1 points in macro F1 on WikiLeaks. On the FRUS dataset, the margin is even more substantial, with the proposed model exceeding the baseline by 23.1 percentage points in accuracy and 30.1 points in macro F1.
The performance gap relative to QLoRA-based LLM fine-tuning is even more pronounced: the proposed model exceeds the best QLoRA result, achieved by Mistral-7B with a WikiLeaks accuracy of 57.16%, by 38.96 percentage points in accuracy and 48.61 points in macro F1. These results underscore that autoregressive instruction-tuned models in 4-bit quantized form are not competitive with discriminative encoder architectures on closed-set, fine-grained classification tasks of this nature. Table 18 presents pairwise McNemar’s test results (Yates-corrected, df = 1) comparing SACHN-DeBERTa-v3-Large ensemble predictions against each baseline on the held-out test partitions. On WikiLeaks, the ensemble significantly outperforms Kesenek et al. [21] (χ2(1) = 64.04, p = 1.22 × 10−15) and TD2V (χ2(1) = 276.56, p = 4.22 × 10−62). On FRUS, the margins are substantially larger: χ2(1) = 500.92, p = 5.98 × 10−111 versus Kesenek et al. [21], and χ2(1) = 829.67, p = 1.91 × 10−182 versus TD2V. All four comparisons are statistically significant at p < 0.001.

4.5.4. Ablation Studies

To validate key design choices in the preprocessing and training pipeline, a series of controlled ablation experiments were conducted using the SACHN-DeBERTa-v3-Large ensemble, with all other hyperparameters held constant.
First, the tokeniser maximum sequence length was varied across 128, 256, and 512 tokens to assess the impact of input truncation on classification performance. Table 19 reports ensemble performance under each configuration. The 256-token setting achieves the highest macro F1 on both datasets (89.85% on WikiLeaks, 91.96% on FRUS). Extending to 512 tokens yields lower performance on both corpora, attributable to padding overhead introduced by the many shorter diplomatic cables in the dataset. Reducing to 128 tokens discards content beyond the document header, degrading minority-class recall. The 256-token configuration was therefore retained for all main experiments.
Second, the Focal Loss focusing parameter γ was varied across 1.0, 2.0, and 3.0 to assess the optimality of the selected value. Table 20 reports ensemble performance under each configuration. Note that γ is applied exclusively to WikiLeaks training; FRUS employs weighted cross-entropy independently of this parameter, and minor FRUS variations across settings reflect stochastic training effects.
Although γ = 3.0 yields the highest WikiLeaks macro F1, γ = 2.0 achieves the best cross-dataset balance and was therefore retained as the primary configuration.
Third, to empirically validate the substitution of hard voting with soft voting in the Kesenek et al. [21] baseline, both strategies were evaluated under identical conditions. Table 21 reports the results.
The performance difference between the two strategies is negligible—less than 0.15 percentage points across all metrics and both datasets—confirming that the substitution does not introduce a material evaluation bias.

4.6. Explainability Analysis

To evaluate whether SACHN-DeBERTa-v3-Large prioritizes semantically robust classification signals over mere superficial patterns, two distinct post-hoc interpretability frameworks were utilized: SHapley Additive exPlanations (SHAP) [66] and Local Interpretable Model-agnostic Explanations (LIME) [69]. These explanations were synthesized based on the optimal single-seed configurations (seed 123 for WikiLeaks; seed 456 for FRUS) subsequent to the elimination of security markings. By applying [MASK] token substitution in a manner consistent with inference-time processing, we ensured that the resulting interpretability maps faithfully represent the model’s decision-making process on unmarked, raw textual data.
For the SHAP implementation, a shap. Explainer integrated with a Text masker specifically configured for the DebertaV2Tokenizer was employed. To ensure representative analysis, a stratified sample of 150 documents (WikiLeaks) and 148 documents (FRUS) was extracted from the fixed test set, comprising 50 instances per class for WikiLeaks and 37 instances per class for FRUS, respectively.
Within this game-theoretic attribution framework, SHAP values were computed to quantify the marginal contribution of individual tokens toward specific class probabilities.
Complementing this approach, the LIME configuration entailed the evaluation of 105 total documents (45 for WikiLeaks and 60 for FRUS). This selection was meticulously balanced to include 10 correctly classified and 5 misclassified instances per category. Each local explanation was generated using 300 perturbations, focusing on the top 20 features per class. To provide a holistic view of the model’s decision boundaries, all categories were evaluated simultaneously (labels = range(n_classes)), ensuring consistency across the multi-class interpretability assessment.

4.6.1. SHAP Feature Attribution

Granular SHAP analysis for the WikiLeaks corpus, presented in Figure 5, demonstrates clear semantic differentiation across the tripartite classification hierarchy. Within the CONFIDENTIAL class, the most influential positive features are structural and document-level markers: ‘section’ (+0.211), ‘introduction’ (+0.109), and the [MASK] placeholder (+0.086), indicating that the model associates document organization cues and masked header positions with diplomatic correspondence. Tokens such as ‘indonesian’ and ‘supplies’ exhibit negative attribution against CONFIDENTIAL, reflecting terms more characteristic of routine administrative traffic. The SECRET class is primarily characterized by operational and thematic specificity: ‘sfo’ (+0.111)—a cable routing code for the San Francisco consulate—together with ‘chem’ (+0.062) and ‘disarmament’ activate this classification head, indicating that the model successfully identifies references to chemical disarmament negotiations and specific diplomatic routing as high-sensitivity signals. Additional SECRET discriminators include Near East Affairs abbreviation ‘nea’, ‘geneva’, and ‘blake’, reinforcing an operational intelligence profile. For the UNCLASSIFIED stratum, the model associates distribution vocabulary such as ‘headlines’ (+0.022) and ‘collection’ (+0.022) with public-facing summaries, while ‘allocation’ (−0.061), ‘email’, ‘york’, and ‘keeping’ serve as suppressors, effectively demarcating the boundary between routine correspondence and classified content.
The four-class FRUS analysis, illustrated in Figure 6, exhibits inter-class separation consistent with the model’s high classification accuracy on that corpus. Within the CONFIDENTIAL class, the dominant positive attributors are ‘conv’ (+0.072)—reflecting conversation records of bilateral consultations—alongside ‘talks’, ‘guidance’, ‘nigerian’, and ‘inspections’, indicating that the model identifies diplomatic consultation records and bilateral engagement frameworks as characteristic of this tier. The SECRET class is primarily driven by regional and intelligence-source specificity: ‘zaire’ (+0.104) emerges as the top feature, accompanied by ‘algeria’, ‘background’, and ‘dci’ (Director of Central Intelligence), while ‘hak’—an abbreviation associated with Henry Kissinger’s communications—serves as a suppressor, demonstrating that the model disambiguates intelligence-centric regional reporting from senior-principal correspondence. Within the TOP SECRET stratum, the attributions reflect high-sensitivity crisis documentation: ‘cap’ (+0.043), ‘cambodian’, ‘confrontation’, and ‘principal’ activate this class head, while ‘dci’ acts as a negative discriminator, separating executive-level crisis records from intelligence agency outputs. The UNCLASSIFIED class is characterized by administrative and logistical language—‘ros’ (+0.030), ‘spokane’, ‘panel’, ‘received’, and ‘sending’—consistent with routine procedural documentation. Collectively, these findings underscore the model’s ability to discern between high-stakes strategic intelligence and routine bureaucratic correspondence.

4.6.2. LIME Complementary Analysis

Instance-level LIME explanations further corroborate the SHAP findings by offering simultaneous multi-class attribution maps.
Figure 7 presents the per-class LIME attributions for the WikiLeaks corpus, within which the SECRET class consistently triggers on high-sensitivity themes, including terrorism (‘Qaeda’, +0.211) and targeted intelligence reporting, such as the assassination of Lebanese Prime Minister Rafik Hariri (‘Hariri’, +0.079). Conversely, UNCLASSIFIED activations are driven by contemporary public markers like ‘Facebook’ and ‘employment’. A critical observation involves the [MASK] token; its modest positive attribution (+0.099 to +0.132) in misclassified instances suggests that the model partially leverages structural artifacts—specifically the positioning of placeholders in document headers—as a residual cue for classification.
As shown in Figure 8, in the FRUS analysis, TOP SECRET predictions are dominated by high-specificity historical entities, notably ‘Brezhnev’ (+0.388) and ‘Fritz’ (+0.415)—the latter effectively capturing Vice President Walter Mondale’s nickname within NSC communications. Interestingly, the token ‘declassified’ (+0.246) serves as a discriminator for SECRET cables; in FRUS, this term frequently appears in State Department metadata, which the model identifies as a systematic (albeit indirect) class signal.
Misclassification analysis, as visualized in Figure 9, reveals that errors are not stochastic but rather grounded in semantic ambiguity. In WikiLeaks, CONFIDENTIAL-to-UNCLASSIFIED errors typically occur in documents lacking substantive intelligence depth. In FRUS, the misclassification of SECRET records as TOP SECRET during high-stakes events (e.g., the 1982 Falklands War) underscores a genuine boundary collapse where the operational urgency of the content mirrors the linguistic profile of the highest classification tier.
A closer examination of the asymmetric confusion patterns between adjacent sensitivity tiers provides additional interpretive depth. In WikiLeaks, 15 of 66 TRUE SECRET documents (22.7%) were misclassified as CONFIDENTIAL, whereas only 3 of 304 TRUE CONFIDENTIAL documents (1.0%) received a SECRET prediction (Table 16). This directional asymmetry reflects the compositional heterogeneity of the SECRET tier: while high-specificity operational tokens (‘sfo’, ‘disarmament’, ‘chem’) reliably activate the SECRET classification head when present, a non-trivial subset of SECRET cables employs the diplomatic register that SHAP identifies as characteristic of CONFIDENTIAL correspondence—particularly bureaucratic hedging language and formal diplomatic syntax. In the absence of operational-domain vocabulary, the model defaults to CONFIDENTIAL as the highest-confidence high-sensitivity assignment. In FRUS, the error pattern is more symmetric but equally interpretable: 55 of 547 CONFIDENTIAL documents (10.1%) were assigned to SECRET, while 27 of 1313 SECRET cables (2.1%) were redirected to CONFIDENTIAL (Table 17). The dominant CONFIDENTIAL→SECRET direction reflects geopolitical and intelligence-source vocabulary (‘zaire’, ‘algeria’) appearing in CONFIDENTIAL cables tokens that SHAP identifies as primary SECRET-class discriminators in the FRUS corpus (Section 4.6.1). Collectively, these confusion patterns confirm that boundary errors in both corpora are semantically grounded in genuine lexical overlap between adjacent classification tiers, rather than stochastic model noise.

4.6.3. Security Marking Independence Verification

A central contribution of this study is the empirical demonstration that the model classifies documents based on latent semantic content rather than explicit security markers. The SHAP and LIME analyses, conducted exclusively on text from which all markings were removed, confirm this observation: classification-relevant tokens consistently consist of domain-specific vocabulary—such as intelligence sources, geopolitical actors, and operational terminology—rather than formatting artifacts. The [MASK] placeholder token, when present in attributions, contributes magnitudes that are substantially lower than substantive content terms (e.g., [MASK] attribution ~0.086 vs. ‘section’ attribution ~0.211 for WikiLeaks CONFIDENTIAL). This disparity, often reaching an order of magnitude, indicates that while the placeholder retains a structurally informative role, it is not semantically decisive. These results validate the security marking removal methodology and support the generalizability of the model to operational scenarios where explicit markers may be absent, redacted, or inconsistent with the document substance.

5. Discussion

The substantial performance margins achieved by the SACHN-DeBERTa-v3-Large architecture—exceeding all established baselines by at least 10 percentage points in accuracy and 26 points in macro F1 across both corpora—stem from the synergistic effect of three primary architectural interventions. Firstly, the DeBERTa-v3-Large backbone’s disentangled attention mechanism successfully decouples semantic content from positional embeddings. This enabling feature allows the model to associate security-centric lexicon—such as ‘classified’, ‘proliferation’, or ‘contingency’—with their respective sensitivity tiers independently of their structural positioning within a cable. In contrast, prior methodologies relying on surface-form n-gram features [21] or unsupervised distributional representations [15] fail to establish such positionally agnostic and semantically discriminative boundaries; this is evidenced by their consistent inability to distinguish CONFIDENTIAL from SECRET documents beyond majority-class heuristics. Secondly, the integration of the Security-Aware Gate and the Supervised Contrastive Projection Head jointly enforces robust feature-level discrimination and embedding-space separation. This design directly addresses the high within-class semantic variance inherent in security corpora, where documents of an identical class may range from concise operational directives to extensive analytical assessments. Consequently, the proposed architecture surpasses all previously reported metrics on the WikiLeaks corpus, including the 80.04% F1 of CES2Vec+CNN [18] and the benchmark results of the ACESS system [14], as well as the DistilBERT and DistilRoBERTa performance reported by Heintz et al. [13] on the FRUS collection—despite operating exclusively on truncated textual data without structural metadata. These results align with theoretical predictions in the transformer fine-tuning literature, which consistently demonstrate that disentangled positional encoding and ELECTRA-style pretraining yield superior domain adaptation compared to masked language modeling alone, particularly in low-resource label regimes where inter-class semantic overlap is high.
The most consequential methodological refinement, however, was the strategic application of Focal Loss to the WikiLeaks corpus. Given that the SECRET class constitutes a mere 7.3% of the training data, standard baselines consistently fail under this severe imbalance; for instance, the pipeline established by Kesenek et al. [21] yielded a negligible 15.15% recall, while the TD2V+AQE system achieved only 3.86% F1 for this category. Implementing Focal Loss with $\gamma = 2.0$ elevated SECRET recall to 77.27%—representing a substantial 62-point improvement. This enhancement was achieved by down-weighting the gradient contributions from confidently classified majority-class instances, thereby redirecting the optimization signal toward ‘hard’ minority examples, a process consistent with the seminal formulation by Lin et al. [66]. In contrast, on the more balanced FRUS corpus, weighted cross-entropy proved sufficient to achieve uniformly high F1-scores across all four classes (88.46–93.80%). These results confirm that loss function selection must be meticulously tailored to the specific class distribution of each corpus rather than being applied as a uniform architectural default.
The systematic underperformance of QLoRA fine-tuned Large Language Models (LLMs)—with accuracy not exceeding 64.48% on WikiLeaks even under extended rank r = 64 and 8-bit INT8 configurations (Table 12)—reflects a fundamental structural incompatibility between autoregressive generation and closed-set classification. Diplomatic security categorization demands high-fidelity discrimination where boundaries are defined by subtle bureaucratic judgment rather than explicit surface-level vocabulary. Consequently, an autoregressive decoder, fundamentally optimised for next-token prediction over an open-ended vocabulary, is not naturally predisposed to such a rigid classification regime. Extended experiments with r = 64 and 8-bit INT8 quantisation confirm that these underperformance patterns are not attributable to hyperparameter choices: quadrupling the LoRA rank yields improvements of at most 4.93 macro F1 points for any individual LLM–dataset combination, and relaxing quantisation from 4-bit to 8-bit does not consistently improve performance—Mistral-7B-Instruct, for instance, improves by 6.32 macro F1 points on WikiLeaks under 8-bit but simultaneously degrades by 12.88 percentage points in FRUS accuracy. Even the best extended result—Mistral-7B-Instruct with 8-bit INT8 and r = 64, achieving 64.48% accuracy and 50.28% macro F1 on WikiLeaks—remains 31.64 percentage points below SACHN-DeBERTa-v3-Large. These findings confirm that discriminative encoder architectures remain the more robust paradigm for fine-grained, closed-set categorical classification tasks, irrespective of rank, quantisation level, or LLM parameter scale.
The per-seed reproducibility analysis further raises a critical consideration for operational deployment. Seed 42 converged to a local minimum yielding only 87.46% accuracy on WikiLeaks and 70.84% on FRUS, despite validation F1 scores comparable to better-performing seeds—demonstrating that validation performance alone is an insufficient deployment criterion. In a high-stakes security classification context, a single misclassification can carry significant operational consequences; therefore, multi-seed ensemble inference is not merely a performance optimization but a deployment safety requirement. Practitioners seeking to adapt SACHN-DeBERTa-v3-Large to new archival collections should treat ensemble aggregation over at minimum three independently initialized runs as a prerequisite for production-grade reliability.
Interpretability assessments via SHAP and LIME demonstrate that the decision-making process of SACHN-DeBERTa-v3-Large is anchored in domain-specific semantic substance rather than residual metadata artifacts. High-sensitivity predictions are consistently driven by domain-specific vocabulary—including operational routing codes (‘sfo’), thematic indicators (‘chem’, ‘disarmament’), and geopolitical specificity (‘zaire’, ‘conv’). Crucially, the [MASK] sentinel token, despite its structural prevalence, yields attribution magnitudes substantially lower than these leading content features. Furthermore, the LIME-based error analysis indicates that misclassifications are structurally coherent; documents categorized as CONFIDENTIAL but mislabeled as UNCLASSIFIED typically lack intelligence depth, while SECRET FRUS cables regarding the 1982 Falklands War are occasionally elevated to TOP SECRET due to an operational urgency that mirrors the linguistic profile of the highest classification tier. These boundary collapses represent irreducible ambiguity within the original labeling criteria rather than fundamental architectural failures. The primary limitations of this research include the restriction to English-language corpora, the 256-token truncation that discards content from longer FRUS records, and residual seed-level training instability. Three additional limitations merit explicit discussion in the context of practical deployment.
  • First, both datasets consist of historical archival documents: the FRUS corpus spans diplomatic cables from 1776 to approximately 1980, and the WikiLeaks corpus reflects U.S. Department of State cable traffic from the 2000s–2010s. Both corpora thus represent communication norms that may diverge meaningfully from contemporary diplomatic language, particularly as emerging domains—cybersecurity, climate diplomacy, digital infrastructure—have introduced novel specialised vocabulary absent from either dataset. A model trained on these corpora may exhibit degraded sensitivity discrimination when applied to post-2010 institutional communications involving these domains.
  • Second, the deployed ensemble consists of three DeBERTa-v3-Large models (approximately 400 million parameters each), requiring approximately 1.2 billion parameters to be loaded simultaneously for inference. This computational profile renders real-time inference impractical in resource-constrained or high-throughput document processing environments. Potential mitigation strategies include knowledge distillation to a single compact student model, single-seed deployment with calibrated confidence thresholds for uncertainty quantification, or asynchronous ensemble evaluation where latency constraints permit.
  • Third, the model has not been evaluated against adversarial semantic manipulation—such as deliberate paraphrasing of sensitive content to reduce its apparent classification signal, insertion of misleading context to shift classification toward a lower sensitivity tier, or document structure obfuscation. The Feature-Engineering Ensemble Classifier baseline [17] was specifically designed and evaluated against 18 categories of content evasion attacks, including synonym replacement, summarisation, and structural modification. The SACHN architecture, while achieving substantially higher accuracy on standard test data, inherits the same vulnerability surface as any supervised classifier when faced with adversarial inputs crafted to exploit the decision boundary between CONFIDENTIAL and SECRET documents. Evaluating SACHN’s robustness against such threat models constitutes an important direction for future work.
Future research will focus on cross-lingual extensions through multilingual DeBERTa variants, the implementation of hierarchical encoding for full-length documents, and the adaptation of the SACHN framework to international schemas, such as those utilized by NATO and the European Union.

6. Conclusions

This research introduced SACHN-DeBERTa-v3-Large, a novel hybrid architecture that integrates a DeBERTa-v3-Large backbone with a Security-Aware Gate, a Prototype Classification Layer, and a Supervised Contrastive Projection Head for the automated multi-class categorization of diplomatic document security levels. Upon evaluation using the WikiLeaks Cable Classifier (9005 documents, three classes) and a newly constructed FRUS dataset (24,706 documents, four classes) derived from declassified U.S. diplomatic archives, the model attained 96.12% accuracy and 92.57% macro F1 on WikiLeaks, alongside 92.11% accuracy and 91.02% macro F1 on the FRUS corpus. These results outperform all established baselines by at least 10 percentage points in accuracy and 26 points in macro F1, with statistical significance rigorously confirmed by McNemar’s test (p < 0.001). Furthermore, the application of Focal Loss effectively elevated the SECRET class recall from a baseline high of 15.15% to 77.27%, demonstrating the pivotal role of specialized loss function design when navigating severe class imbalances in diplomatic textual data.
A controlled comparative study involving multiple Large Language Models (LLMs) under identical QLoRA fine-tuning conditions demonstrated that LLaMA-3-8B-Instruct, Qwen2.5-14B-Instruct, and Mistral-7B-Instruct-v0.3 exhibit a structural mismatch for closed-set security classification, with none exceeding 57.16% accuracy under the primary QLoRA protocol (r = 16). Extended experiments with r = 64 and 8-bit INT8 quantisation yielded a best result of 64.48% (Mistral-7B-Instruct, WikiLeaks), confirming that the performance gap relative to SACHN-DeBERTa-v3-Large remains structural. This validates the security marking removal methodology as a robust experimental control and supports the model’s viability for operational deployment in environments where explicit markers may be absent or redacted.
To support reproducibility and facilitate future research, the custom-curated FRUS dataset will be released on the Kaggle platform upon publication. Future research directions will focus on extending the SACHN framework to non-English diplomatic archives, implementing hierarchical encoding to process full-length documents, and adapting the system to international classification schemas, such as those utilized by NATO and the European Union.

7. Patents

A patent application based on the SACHN-DeBERTa-v3-Large architecture and methodology described in this manuscript is being prepared for submission.

Author Contributions

Conceptualization, M.T.S. and M.D.; methodology, M.T.S.; software, M.T.S.; validation, M.T.S. and M.D.; formal analysis, M.T.S.; investigation, M.T.S.; resources, M.T.S. and M.D.; data curation, M.T.S.; writing—original draft preparation, M.T.S.; writing—review and editing, M.T.S. and M.D.; visualization, M.T.S.; supervision, M.D.; project administration, M.T.S. and M.D.; funding acquisition, M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Gazi University Scientific Research Projects Coordination Unit (BAP), grant number FYL-2026-10676.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The WikiLeaks Cable Classifier dataset analyzed in this study is publicly available at [8]. The FRUS dataset constructed for this study will be made publicly available on Kaggle under the name “FRUS Dataset” upon publication of this article. The raw FRUS XML source files are openly accessible through the U.S. Department of State FRUS GitHub repository [10].

Acknowledgments

This work has been supported by Gazi University Scientific Research Projects Coordination Unit under grant number FYL-2026-10676. The numerical calculations reported in this paper were fully performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
SACHNSecurity-Aware Classification with Hybrid Networks
DeBERTaDecoding-enhanced BERT with Disentangled Attention
BERTBidirectional Encoder Representations from Transformers
FRUSForeign Relations of the United States
SHAPSHapley Additive exPlanations
LIMELocal Interpretable Model-agnostic Explanations
XAIExplainable Artificial Intelligence
NLPNatural Language Processing
LLMLarge Language Model
QLoRAQuantized Low-Rank Adaptation
LoRALow-Rank Adaptation
NF4NormalFloat 4-bit quantization
SAGSecurity-Aware Gate
PCLPrototype Classification Layer
LLRDLayer-wise Learning Rate Decay
AMPAutomatic Mixed Precision
SupConSupervised Contrastive Loss
TF-IDFTerm Frequency–Inverse Document Frequency
SVMSupport Vector Machine
CNNConvolutional Neural Network
BiLSTMBidirectional Long Short-Term Memory
RNNRecurrent Neural Network
MLPMulti-layer Perceptron
LDALatent Dirichlet Allocation
LSALatent Semantic Analysis
DLPData Loss Prevention
AQEAutomatic Query Expansion
TD2VTwitter-based Document2Vec
FAISSFacebook AI Similarity Search
ROC-AUCReceiver Operating Characteristic—Area Under the Curve
HPCHigh Performance Computing
SLURMSimple Linux Utility for Resource Management
FOIFreedom of Information
POSPart-of-Speech
GCNGraph Convolutional Network
HMMHidden Markov Model
CRFConditional Random Field
OCROptical Character Recognition
RLHFReinforcement Learning from Human Feedback
GQAGrouped Query Attention
RoPERotary Position Embedding
DPODirect Preference Optimization
FPRFalse Positive Rate
FDRFalse Discovery Rate
FNRFalse Negative Rate
WLWikiLeaks
F1_MF1 Macro
F1_WF1 Weighted

References

  1. Zhuo, L.; Sagduyu, Y. Risk Assessment Based Access Control with Text and Behavior Analysis for Document Management; IEEE: New York, NY, USA, 2016; pp. 37–42. [Google Scholar]
  2. Ahmed, H.; Traore, I.; Saad, S.; Mamun, M. Automated detection of unstructured context-dependent sensitive information using deep learning. Internet Things 2021, 16, 100444. [Google Scholar] [CrossRef]
  3. Tantuğ, A.C. Text Classification. Comput. Sci. Eng. J. 2016, 5, 2. Available online: https://izlik.org/JA59NK46AW (accessed on 27 May 2026).
  4. Alzhrani, K.; Rudd, E.M.; Chow, C.E.; Boult, T.E. Automated Big Security Text Pruning and Classification. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; pp. 3629–3637. [Google Scholar]
  5. Alparslan, E.; Karahoca, A.; Bahsi, H. Classification of confidential documents by using adaptive neurofuzzy inference systems. Procedia Comput. Sci. 2011, 3, 1412–1417. [Google Scholar] [CrossRef]
  6. Richter, M.; Street, M.; Lenk, P. Deep Learning NATO document labels A preliminary investigation. In Proceedings of the 2018 International Conference on Military Communications and Information Systems (ICMCIS), Warsaw, Poland, 22–23 May 2018. [Google Scholar]
  7. Han, Y.; Lee, J.; Lee, J.; Chang, H. A Review of Text-Based Information Security Rating: Fundamental Concepts, Methods, Datasets, Challenges, and Future Works. Hum.-Centric Comput. Inf. Sci. 2025, 15, 32. [Google Scholar] [CrossRef]
  8. Lu, Y. (Ed.) Wikileaks Cable Classifier; Kaggle: San Francisco, CA, USA, 2021. [Google Scholar]
  9. Department of State Office of The Historian. The Foreign Relations of the United States (FRUS). Available online: https://history.state.gov/historicaldocuments (accessed on 27 May 2026).
  10. United States Department of State Office of The Historian. FRUS. 2022. Available online: https://github.com/HistoryAtState/frus (accessed on 27 May 2026).
  11. Yazidi, A.; Hammer, H.; Bai, A.; Engelstad, P.E. On the Feasibility of Machine Learning as a Tool for Automatic Security Classification: A Position Paper. In Proceedings of the 2016 International Conference on Computing, Networking and Communications (ICNC), Kauai, HI, USA, 15–18 February 2016. [Google Scholar]
  12. Richter, M.; Wrona, K. Devil in the details: Assessing automated confidentiality classifiers in context of NATO documents. In Proceedings of the First Italian Conference on Cybersecurity (ITASEC17), Venice, Italy, 17–20 January 2017; pp. 136–145. [Google Scholar]
  13. Heintz, I.; Grothendieck, J.; Bernardin, F.; Kuperman, G. Improving Text Security Classification Towards an Automated Information Guard. In Proceedings of the MILCOM 2022–2022 IEEE Military Communications Conference (MILCOM), Rockville, MD, USA, 28 November–2 December 2022. [Google Scholar] [CrossRef]
  14. Alzhrani, K.; Rudd, E.M.; Boult, T.E.; Chow, C.E. Automated Big Text Security Classification. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics: Cybersecurity and Big Data, Tucson, AZ, USA, 28–30 September 2016; pp. 103–108. [Google Scholar]
  15. Trieu, L.Q.; Tran, T.; Tran, M.; Tran, M. Document Sensitivity Classification for Data Leakage Prevention with Twitter-based Document Embedding and Query Expansion. In Proceedings of the 2017 13th International Conference on Computational Intelligence and Security (CIS), Hong Kong, China, 15–18 December 2017; pp. 537–542. [Google Scholar] [CrossRef]
  16. Liang, Y.; Wen, Z.P.; Tao, Y.Z.; Li, G.L.; Guo, B. Automatic Security Classification Based on Incremental Learning and Similarity Comparison. In Proceedings of the 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 24–26 May 2019; Volume 2019, pp. 812–817. [Google Scholar] [CrossRef]
  17. Alparslan, E.; Karahoca, A.; Bahsi, H. Security-level classification for confidential documents by using adaptive neuro-fuzzy inference systems. Expert Syst. 2013, 30, 233–242. [Google Scholar] [CrossRef]
  18. Jiang, J.G.; Lu, Y.; Yu, M.; Li, G.; Liu, C.; An, S.H.; Huang, W.Q. CES2Vec: A Confidentiality-Oriented Word Embedding for Confidential Information Detection. In Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC), Rennes, France, 7–10 July 2020; pp. 289–295. [Google Scholar]
  19. Alzhrani, K.; Alrasheedi, F.S.; Kateb, F.A.; Boult, T.E. CNN with Paragraph to Multi-Sequence Learning for Sensitive Text Detection. In Proceedings of the 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 1–3 May 2019. [Google Scholar] [CrossRef]
  20. Subhashini, P.; Rani, B.P. Confidential Terms Detection Using Language Modeling Technique in Data Leakage Prevention. Adv. Intell. Syst. 2016, 381, 271–279. [Google Scholar] [CrossRef]
  21. Kesenek, Y.; Özçelik, I.; Kaya, E. Zararlı yazılım kaynaklı veri kaçırma ataklarına karşı yeni bir doküman sınıflandırma algoritması. J. Fac. Eng. Archit. Gaz. 2022, 37, 1639–1654. [Google Scholar] [CrossRef]
  22. Hart, M.; Manadhata, P.; Johson, R. Text Classification for Data Loss Prevention. In Proceedings of the 11th International Conference on Privacy Enhancing Technologies, Waterloo, ON, Canada, 27–29 July 2011; pp. 18–37. [Google Scholar]
  23. Lin, Y.; Xu, G.S.; Xu, G.A.; Chen, Y.D.; Sun, D.W. Sensitive Information Detection based on Convolution Neural Network and Bi-directional LSTM. In Proceedings of the 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Guangzhou, China, 29 December 2020–1 January 2021; pp. 1614–1621. [Google Scholar] [CrossRef]
  24. Sulavko, A.; Varkentin, Y.; Panfilova, I.; Samotuga, A. Automatic classification of text messages by confidentiality level based on ensemble of artificial neural networks. In Proceedings of the 2024 2nd International Conference on Foundation and Large Language Models (FLLM), Dubai, United Arab Emirates, 26–29 November 2024; pp. 484–492. [Google Scholar]
  25. Han, Y.; Shim, S.; Gajulamandyam, D.K.; Chang, H. Prompt-based Generation Strategy for Imbalanced Information Security Rating Dataset Augmentation. In Proceedings of the 2024 International Conference on AI x Data and Knowledge Engineering (AIxDKE), Tokyo, Japan, 11–13 December 2024; pp. 117–122. [Google Scholar] [CrossRef]
  26. Tan, L.L.; Yi, J.K.; Yang, F. Improving Performance of Massive Text Real-Time Classification for Document Confidentiality Management. Appl. Sci. 2024, 14, 1565. [Google Scholar] [CrossRef]
  27. Liu, Y.; Yang, C.Y.; Yang, J. A Graph Convolutional Network-Based Sensitive Information Detection Algorithm. Complexity 2021, 2021, 6631768. [Google Scholar] [CrossRef]
  28. Huang, Z.Y. Sensitive Information Detection Using HMM&SVM. In Proceedings of the 2021 3rd International Conference on Intelligent Medicine and Image Processing (IMIP 2021), Tianjin, China, 23–26 April 2021; pp. 146–150. [Google Scholar] [CrossRef]
  29. Gang, L.; Xu, W.; Xie, X.Q. A Method of Document Sensitivity Calculation Based on Semantic Dependency Analysis. In Proceedings of the 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 24–26 May 2019; pp. 34–40. [Google Scholar] [CrossRef]
  30. McDonald, G.; García-Pedrajas, N.; Macdonald, C.; Ounis, I. A Study of SVM Kernel Functions for Sensitivity Classification Ensembles with POS Sequences. In Proceedings of the SIGIR’17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11 August 2017; pp. 1097–1100. [Google Scholar] [CrossRef]
  31. Alneyadi, S.; Sithirasenan, E.; Muthukkumarasamy, V. Detecting data semantic: A data leakage prevention approach. In Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA, Helsinki, Finland, 20–22 August 2015; Volume 2015, pp. 910–917. [Google Scholar] [CrossRef]
  32. Perron, B.E.; Luan, H.; Victor, B.G.; Hiltz-Perron, O.; Ryan, J. Moving Beyond ChatGPT: Local Large Language Models (LLMs) and the Secure Analysis of Confidential Unstructured Text Data in Social Work Research. Res. Soc. Work Pract. 2024, 35, 695–710. [Google Scholar] [CrossRef]
  33. Sinoara, R.A.; Camacho-Collados, J.; Rossi, R.G.; Navigli, R.; Rezende, S.O. Knowledge-enhanced document embeddings for text classification. Knowl.-Based Syst. 2019, 163, 955–971. [Google Scholar] [CrossRef]
  34. Timmer, R.C.; Liebowitz, D.; Nepal, S.; Kanhere, S.S. Can pre-trained Transformers be used in detecting complex sensitive sentences?—A Monsanto case study. In Proceedings of the 2021 Third IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), Atlanta, GA, USA, 13–15 December 2021; Volume 2021, pp. 90–97. [Google Scholar] [CrossRef]
  35. Wu, J.P.; Chen, L.F.; Fang, S.Y.; Wu, C.M. An Application Programming Interface (API) Sensitive Data Identification Method Based on the Federated Large Language Model. Appl. Sci. 2024, 14, 10162. [Google Scholar] [CrossRef]
  36. Chong, P. Deep Learning Based Sensitive Data Detection. In Proceedings of the 2022 19th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 16–18 December 2022. [Google Scholar] [CrossRef]
  37. Aydin, N.; Erdem, O.A.; Tekerek, A. Comparative Analysis of Traditional Machine Learning and Transformer-based Deep Learning Models for Text Classification. J. Polytech. 2024, 28, 445–452. [Google Scholar] [CrossRef]
  38. Tran, Q.; Shpileuskaya, K.; Zaunseder, E.; Putzar, L.; Blankenburg, S. Comparing the Robustness of Classical and Deep Learning Techniques for Text Classification. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022. [Google Scholar] [CrossRef]
  39. Chen, H.H.; Wu, L.; Chen, J.P.; Lu, W.; Ding, J.H. A comparative study of automated legal text classification using random forests and deep learning. Inf. Process. Manag. 2022, 59, 102798. [Google Scholar] [CrossRef]
  40. Mohammed, A.; Kora, R. An effective ensemble deep learning framework for text classification. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 8825–8837. [Google Scholar] [CrossRef]
  41. Jiang, S.; Hu, J.; Magee, C.L.; Luo, J.X. Deep Learning for Technical Document Classification. IEEE Trans. Eng. Manag. 2024, 71, 1163–1179. [Google Scholar] [CrossRef]
  42. Schoppmann, P. Secure Computation Protocols for Privacy-Preserving Machine Learning. 2021. Available online: https://edoc.hu-berlin.de/server/api/core/bitstreams/05b68f3a-a320-45d6-bd04-881ac21a6dda/content (accessed on 27 May 2026).
  43. Xu, G.; Qi, C.; Yu, H.; Xu, S.; Zhao, C.; Yuan, J. Detecting Sensitive Information of Unstructured Text Using Convolutional Neural Network. In Proceedings of the 2019 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), Guilin, China, 17–19 October 2019; pp. 474–479. [Google Scholar]
  44. Park, J.-S.; Kim, G.-W.; Lee, D.-H. Sensitive Data Identification in Structured Data through GenNER Model based on Text Generation and NER. In Proceedings of the CNIOT’20: Proceedings of the 2020 International Conference on Computing, Networks and Internet of Things, Sanya, China, 24–26 April 2020; pp. 36–40. [Google Scholar]
  45. Chen, M.; Zhang, Y.; Zhang, P. Research on Confidentiality Management of Language Technology Resources Based on the Combination of BERT and TextRank Algorithms with a Language Technology Vocabulary. In Proceedings of the SPCNC’24: Proceedings of the 3rd International Conference on Signal Processing, Computer Networks and Communications, Sanya China, 22–24 December 2025; pp. 348–352. [Google Scholar]
  46. Liang, Y.; Gao, E.; Ma, Y.; Zhan, Q.; Sun, D.; Gu, X. Contextual Analysis Using Deep Learning for Sensitive Information Detection. In Proceedings of the 2024 International Conference on Computers, Information Processing and Advanced Education (CIPAE), Ottawa, ON, Canada, 26–28 August 2024; pp. 633–637. [Google Scholar]
  47. Fu, H.; Meng, L.; Wei, S.; Nong, C.; Tan, Y. Research on Intelligent Sensitive Data Recognition and Annotation Technology Based on Long Text. In Proceedings of the 4th International Conference on Electronics, Circuits and Information Engineering, ECIE 2024, Hangzhou, China, 24–26 May 2024; pp. 433–438. [Google Scholar]
  48. Kaliappan, V.K.; Dharunkumar, U.P.; Uppili, S.; Bharani, S. SentinelGuard: An Integration of Intelligent Text Data Loss Prevention Mechanism for Organizational Security (I-ITDLP). In Proceedings of the 2024 International Conference on Science, Technology, Engineering and Management, ICSTEM 2024, Coimbatore, India, 26–27 April 2024. [Google Scholar]
  49. Li, P.; Fu, X.; Chen, J.; Hu, J. CoGraphNet for enhanced text classification using word-sentence heterogeneous graph representations and improved interpretability. Sci. Rep. 2025, 15, 356. [Google Scholar] [CrossRef]
  50. Ren, L.; Liu, Y.; Ouyang, C.; Yu, Y.; Zhou, S.; He, Y.; Wan, Y. DyLas: A dynamic label alignment strategy for large-scale multi-label text classification. Inf. Fusion 2025, 120, 103081. [Google Scholar] [CrossRef]
  51. Wang, J.; Zhang, R.; Han, H. Text Classification based on Active learning and Meta-learning. In Proceedings of the 4th Asia-Pacific Artificial Intelligence and Big Data Forum; Association for Computing Machinery: New York, NY, USA, 2025; pp. 816–820. [Google Scholar]
  52. Zhang, H.; Wang, J.H.; Gao, H.R.; Zhang, X.; Wang, H.W.; Li, W.M. VIWHard: Text adversarial attacks based on important-word discriminator in the hard-label black-box setting. Neurocomputing 2025, 616, 128917. [Google Scholar] [CrossRef]
  53. Alparslan, G. Document Classification Using Machine Learning. 2023. Available online: https://tez.yok.gov.tr/ (accessed on 27 May 2026).
  54. Riyadi, A.; Kovacs, M.; Serdült, U.; Kryssanov, V. IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents. Big Data Cogn. Comput. 2024, 8, 153. [Google Scholar] [CrossRef]
  55. Sujatha, R.; Nimala, K. Classification of Conversational Sentences Using an Ensemble Pre-Trained Language Model with the Fine-Tuned Parameter. Comput Mater. Con 2024, 78, 1669–1686. [Google Scholar] [CrossRef]
  56. Alzamel, K.; Alajmi, M. An application of textual document classification for Arabic governmental correspondence. Kuwait J. Sci. 2025, 52, 100299. [Google Scholar] [CrossRef]
  57. Mao, X.L.; Li, Z.H.; Li, Q.X.; Zhang, S.C. BERT-DXLMA: Enhanced representation learning and generalization model for english text classification. Neurocomputing 2025, 622, 129325. [Google Scholar] [CrossRef]
  58. Prabhakar, P.; Pati, P.B. Enhancing Indian legal judgment classification with embeddings, feature selection, and ensemble strategies. Artif. Intell. Law 2025, 34, 523–564. [Google Scholar] [CrossRef]
  59. Liu, X.H.; Chen, J.F.; Huang, Q.B. Instance-Level Weighted Contrast Learning for Text Classification. Appl. Sci. 2025, 15, 4236. [Google Scholar] [CrossRef]
  60. United States. Executive Order 13526: Classified National Security Information. Office of the Federal Register, National Archives and Records Administration, 29 December 2009. Available online: https://www.govinfo.gov/app/details/DCPD-200901022 (accessed on 27 May 2026).
  61. He, P.; Gao, J.; Chen, W. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  62. Clark, K.; Luong, M.-T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
  63. Snell, J.; Swersky, K.; Zemel, R.S. Prototypical Networks for Few-shot Learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4080–4090. [Google Scholar]
  64. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
  65. Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan; D. Khosla, P. Supervised Contrastive Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
  66. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE international Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
  67. Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to fine-tune bert for text classification? In Proceedings of the China National Conference on Chinese Computational Linguistics, Kunming, China, 18–20 October 2019; pp. 194–206. [Google Scholar]
  68. Ribeiro, M.T.; Singh, S.; Guestrin, C. Why should i trust you? Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
  69. Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International conference on machine learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
  70. Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
  71. Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar]
  72. Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
  73. Yang, Q.A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Dong, G.; et al. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar]
  74. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L. Mistral 7b. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
  75. Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf (accessed on 27 May 2026).
Figure 1. Distribution of original confidentiality labels in the WikiLeaks Cable Classifier dataset.
Figure 1. Distribution of original confidentiality labels in the WikiLeaks Cable Classifier dataset.
Applsci 16 05661 g001
Figure 2. Distribution of original confidentiality labels in the FRUS dataset.
Figure 2. Distribution of original confidentiality labels in the FRUS dataset.
Applsci 16 05661 g002
Figure 3. Proposed SACHN-DeBERTa-v3 architecture and modular workflow.
Figure 3. Proposed SACHN-DeBERTa-v3 architecture and modular workflow.
Applsci 16 05661 g003
Figure 4. Layer-wise Learning Rate Decay (LLRD) Schedule.
Figure 4. Layer-wise Learning Rate Decay (LLRD) Schedule.
Applsci 16 05661 g004
Figure 5. Per-class SHAP token attribution for the WikiLeaks test set (seed 123, security markings removed).
Figure 5. Per-class SHAP token attribution for the WikiLeaks test set (seed 123, security markings removed).
Applsci 16 05661 g005
Figure 6. Per-class SHAP token attribution for the FRUS test set (seed 456, security markings removed).
Figure 6. Per-class SHAP token attribution for the FRUS test set (seed 456, security markings removed).
Applsci 16 05661 g006
Figure 7. Per-class aggregate LIME feature attribution for the WikiLeaks dataset (45 instances, 300 perturbations per explanation).
Figure 7. Per-class aggregate LIME feature attribution for the WikiLeaks dataset (45 instances, 300 perturbations per explanation).
Applsci 16 05661 g007
Figure 8. Per-class aggregate LIME feature attribution for the FRUS dataset (60 instances, 300 perturbations per explanation).
Figure 8. Per-class aggregate LIME feature attribution for the FRUS dataset (60 instances, 300 perturbations per explanation).
Applsci 16 05661 g008
Figure 9. SHAP attribution comparison between correctly classified and misclassified WikiLeaks instances.
Figure 9. SHAP attribution comparison between correctly classified and misclassified WikiLeaks instances.
Applsci 16 05661 g009
Table 1. Summary of literature on the classification of information, documents, and texts by confidentiality levels.
Table 1. Summary of literature on the classification of information, documents, and texts by confidentiality levels.
Ref.YearMethodDatasetsPerf. Performance (%)
[4]2016LDA-based Topic Pruning + Logistic RegressionWikiLeaks (Diplomatic Cables)p-value/F1-score0.0007 (p value)/Significant Gains
[5]2011Hybrid SVM-ANFIS (Support Vector Machines + Adaptive Neuro-Fuzzy Inference Systems)222 Turkish Docs (TUBITAK UEKAE)Accuracy96.67
[6]2018ConvNet (Convolutional Neural Networks) + 50-dim EmbeddingsHistorical NATO Documents (1950s Military Records)Accuracy Improvement120–130 (1.2 to 1.3-fold gain over baseline)
[11]2016AdaBoost (Comparison of 10 ML Algorithms) + TF-IDFDNSA (2884 docs on Geopolitical Conflicts)Accuracy>90
[12]2017Random Forest + KeyGraph Algorithm + Porter Stemming30,000+ Historical NATO Documents (3 Security Tiers)Accuracy/Cohen’s Kappa80
[13]2022DistilBERT & DistilRoBERTa (Transformer-based)101,142 documents (FRUS Collection)F1-scoreHigh/Scalable (Specific % not listed in snippet, but “High Robustness” noted)
[14]2016ACESS (Hybrid K-means Clustering + Localized Linear SVM)WikiLeaks (Diplomatic Cables)F1-Measure>90
[15]2017TD2V (Document Embedding) + Modified k-NN + AQEEnron Snowden
Dyncorp
Accuracy>99
[16]2019ILSC (Incremental Learning and Similarity Comparison) + ISVMExpanding Document Repository (5 Security Echelons)Accuracy87.4
[17]2013Hybrid SVM-ANFIS + Zemberek (Morphological) + CACC Discretization222 Turkish Docs (TUBITAK UEKAE)Accuracy96
[18]2020CES2Vec (Confidentiality-driven Word2Vec Extension) + CNN89,681 WikiLeaks ParagraphsF1-score80.04
[19]2019MS-CNN (Multi-Sequence Wide Convolutional Neural Network)WikiLeaks
Panama Papers
F1-score77.3 (Highly Sensitive Category)
[20]2016K-means Clustering + Cluster-specific Language Models (Dirichlet Smoothing)Enron CorpusConceptual FrameworkN/A * (Qualitative/Probability-based)
[21]2022LSA + Multi-layered Feature Extraction + Spelling Correction5539 docs (TM, Mormon, Dyncorp, DBPedia)Baseline Acc./Adversarial F199.5/>90 (CNN dropped to 55)
[22]2011Hierarchical Two-stage (Primary + Meta-classifier) with extra informationEnron CorpusFPR/FDR/FNR<0.1 (from 87.2)/0.92/<3.0
[23]2020Feature Fusion (CNN + BiLSTM) + Word2Vec7500 Chinese Docs (WikiLeaks & Sohu News)Accuracy/F1-score93.44/93.82
[24]2024Multi-stage (Autoencoder + CNN Ensemble) with E5 Embeddings1.1M Russian Texts (10 Organizations)Binary Acc./FNR/Multi-class Acc.97.84/0.015/90.3
[25]2024Prompt-based augmentation (GPT-4o, CoT + sliding window) + Fine-tuned LLaMA 3.1 8B with LoRAWikiLeaks (3-class: Secret, Confidential, Unclassified)Accuracy99 (few-shot)/98 (one-shot)
[7]2025Systematic review: Rule-based → Clustering → ML → Deep Learning (Survey) | WikiLeaks, Reuters, TUBITAK UEKAE, Enron, DNSAWikiLeaks, Reuters, TUBITAK UEKAE, Enron, DNSASurveyN/A
* N/A = Not Available.
Table 2. Summary of literature on the classification of information, documents, and texts by diverse context.
Table 2. Summary of literature on the classification of information, documents, and texts by diverse context.
Ref.YearModelApplication Domain/Task
[26]2024k-NN + TF-DE + AVX-256Large-scale text confidentiality classification
[27]2021GCSA (GCN + Self-Attention)Sensitive information detection
[28]2021HMM + Poly-Kernel SVMCorporate sensitive info detection in config files
[29]2019CRF + Semantic Tree (SDC)Multi-level sensitive information detection
[30]2017POS + SVM Ensemble (WMV)FOI-exempt government document classification
[31]2015TF-IDF + SVD (DLP)6-category sensitivity detection/DLP
[32]2024Zero-shot LLMs (Mistral, Llama3)Child welfare case text classification
[33]2019Babel2Vec + NASARICross-lingual short-text classification
[34]2021Fine-tuned BERTSensitive sentence detection in legal documents
[35]2024FedAPILLM (LoRA + LLMs)Privacy-preserving API sensitive content detection
[36]2022Regex + BERT-CRFReal-time sensitive data recognition
[37]2024BERT/GPT-3 vs. NB/DTMulti-level educational document classification
[38]2022VDCNN/Transformer (Robustness AUC)Robustness evaluation under data corruption
[39]2022Random Forest + PCALarge-scale legal document classification
[40]2022CNN + LSTM + GRU EnsembleMulti-domain text classification
[41]2024TechDoc (RNN + CNN + GNN)Technical patent document classification
[42]2021Privacy-preserving ML (DP)Secure text classification
[43]2019Text-CNN + Word2VecMilitary/political sensitive text detection
[44]2020GenNER (BiLSTM-CRF)Korean table sensitive information detection
[45]2025BERT + TextRank (Graph)Language resource security classification
[46]2024A-ELMo + BiLSTMSocial media sensitive information detection
[47]2024BERT + GBM (Multi-modal)Long-text sensitive information recognition
[48]2024SentinelGuard (BERT-DLP)Healthcare DLP/PII & PHI protection
[49]2025CoGraphNet (GRU + GCN)Interpretable text classification
[50]2025DyLas (LLM Label Alignment)Large-scale multi-label text classification
[51]2025Meta-Active Learning (TextGCN)Low-resource text classification
[52]2025VIWHard (Adversarial NLP)Adversarial attacks on text classifiers
[53]2023CNN vs. SVM/Naive BayesTurkish multi-class text classification
[54]2024IndoGovBERTIndonesian government SDG document classification
[55]2024EPLM-HT (BERT+RoBERTa+GPT+XLNet)Conversational sentence classification
[56]2025GigaBERT/AraBERT/XLM-RoBERTaArabic governmental correspondence classification
[57]2025BERT-DXLMA + Focal LossImbalanced English text classification
[58]2025T5 + SMOTE + Voting EnsembleIndian legal judgment classification
[59]2025Instance-Level Contrastive LearningText classification via weighted augmentation
Table 3. Attribute Fields of the WikiLeaks Cable Classifier Dataset.
Table 3. Attribute Fields of the WikiLeaks Cable Classifier Dataset.
AttributeDescription
DateThe timestamp indicating when the document was generated
Canonical_IDThe unique identifier assigned to the diplomatic cable.
Original_ClassificationThe initial security label assigned at the document’s origin
Current_ClassificationThe security status of the document at the time of the leak.
Handling_RestrictionsSpecific codes (e.g., NOFORN) limiting the distribution of the document.
Character_CountThe total length of the document in characters.
Executive_OrderThe legal mandate or executive authority governing the classification.
LocatorThe digital or physical storage address within the archive system.
TAGSMetadata keywords associated with the document’s subject matter.
ConceptsHigh-level thematic categories extracted from the document content.
EnclosureReferences to any physical or digital attachments accompanying the cable.
TypeThe specific document format (e.g., TE—Telegram).
Office_OriginThe specific embassy or government office that authored the cable.
Office_ActionThe department responsible for responding to or acting upon the cable.
Archive_StatusThe status of the record within the official archival lifecycle.
FromThe specific author or senior official sending the document.
MarkingsStandardized archival stamps or metadata flags.
ToThe primary recipient or office intended to receive the document.
ContentThe full-text body of the cable used for NLP-based feature extraction.
Linked_documents_or_other_documents_with_the_same_IDReferences to related diplomatic records or prior correspondences.
Table 4. Distribution of FRUS Documents by Presidential Administration.
Table 4. Distribution of FRUS Documents by Presidential Administration.
AdministrationPeriodDocument Count
John F. Kennedy Administration1961–19632726
Lyndon B. Johnson Administration1964–19686182
Richard M. Nixon Administration1969–19748993
Jimmy Carter Administration1977–19804478
Ronald Reagan Administration1981–19882327
Table 5. Attribute Fields of the FRUS Dataset.
Table 5. Attribute Fields of the FRUS Dataset.
AttributeDescription
idThe unique identifier assigned to each archival document.
administrationThe specific U.S. presidential administration context.
document_main_issueThe primary thematic volume or diplomatic category.
document_sub_issueThe secondary sub-topic or geographical focus.
document_titleThe official subject line or title of the record.
document_dateThe formal date of authorship or archival entry.
document_archive_sourceThe specific physical or digital source within the archives.
document_confidentialityThe original security label (Target Label for classification).
document_publicationReferences to the specific FRUS publication volume.
contentThe full-text archival content used for NLP processing.
Table 6. SACHN-DeBERTa-v3 Architecture Summary.
Table 6. SACHN-DeBERTa-v3 Architecture Summary.
ComponentInput DimOutput DimParameters
DeBERTa-v3-Large backboneToken IDsB × L × 1024~304M
Security-Aware GateB × L × 1024B × L × 10241025
CLS extractionB × L × 1024B × 1024
Feature projection (FC1)1024512524,800
Feature projection (FC2)512256131,328
Prototype Classification Layer256C256C + 2562
Contrastive Projection Head25612832,768
Table 7. SACHN-DeBERTa-v3 Hyperparameter Configuration.
Table 7. SACHN-DeBERTa-v3 Hyperparameter Configuration.
HyperparameterValue
Backbone modelmicrosoft/deberta-v3-large
Maximum input sequence length256 tokens
Training batch size8
Maximum training epochs20
Early stopping patience5 epochs
Backbone base learning rate1 × 10−5
Custom head learning rate3 × 10−4
LLRD decay factor (alpha)0.9
AdamW weight decay0.01
Gradient clipping (L2 norm)1.0
Gradient accumulation steps4 (effective batch size: 32)
Contrastive temperature (tau)0.07
Contrastive embedding dimension128
Focal Loss focusing parameter gamma (WikiLeaks)2.0
Contrastive loss warmup duration3 epochs
Contrastive loss weight—warmup phase0.05
Contrastive loss weight—full training phase0.20
Classification loss weight1.0
Ensemble random seeds42, 123, 456
Dataset split ratio (train/val/test)80%/10%/10%
GPU hardwareNVIDIA H200 (139.8 GB VRAM)
Deep learning frameworkPyTorch 2.8.0
Transformer libraryHuggingFace Transformers 4.57.6
Table 8. Classification performance of all evaluated methods.
Table 8. Classification performance of all evaluated methods.
MethodWL 1 Acc. 1WL 1 F1_M 1WL 1 F1_W 1FRUS Acc. 1FRUS F1_M 1FRUS F1_W 1
Feature-Engineering Ensemble Classifier [21]86.1266.4484.2269.0360.9567.22
TD2V+AQE [15]78.7354.0075.6865.4457.8364.33
FedAPILLM—LLaMA-3-8B [35]51.6140.7648.1637.9333.2136.39
LLaMA-3-8B-Instruct (QLoRA)44.4036.6344.4640.4230.7238.40
Qwen2.5-14B-Instruct (QLoRA)51.0539.6147.4731.3725.3526.36
Mistral-7B-Instruct-v0.3 (QLoRA) 257.1643.9656.62
SACHN-DeBERTa-v3-Large (Ours) 396.1292.5796.0492.1191.0292.08
1 WL = WikiLeaks; F1_M = macro-averaged F1; F1_W = weighted F1. All values in percent. 2 Mistral-7B did not complete FRUS training due to CUDA out-of-memory errors within the allocated wall-clock budget. 3 The proposed model (SACHN-DeBERTa-v3-Large) is highlighted in bold.
Table 9. Feature-Engineering Ensemble Classifier per class F1 scores.
Table 9. Feature-Engineering Ensemble Classifier per class F1 scores.
ClassWikiLeaks F1WikiLeaks SupportFRUS F1FRUS Support
CONFIDENTIAL81.1991253.691641
SECRET24.7919877.533939
TOP SECRET54.961004
UNCLASSIFIED93.34159257.60637
Macro Avg. 166.44270260.957221
1 Macro-average metrics are highlighted in bold to distinguish aggregate performance from class-specific scores.
Table 10. TD2V+AQE—Document-Level Retrieval per class F1 scores.
Table 10. TD2V+AQE—Document-Level Retrieval per class F1 scores.
ClassWikiLeaks F1WikiLeaks SupportFRUS F1FRUS Support
CONFIDENTIAL70.72152052.212736
SECRET3.8632974.396564
TOP SECRET52.221673
UNCLASSIFIED87.43265452.521061
Macro Avg.54.00450357.8312,034
Table 11. Multi-LLM study—QLoRA fine-tuning results.
Table 11. Multi-LLM study—QLoRA fine-tuning results.
ModelWL Acc. 1WL F1_M 1WL F1_W 1FRUS Acc. 1FRUS F1_M 1FRUS F1_W 1
FedAPILLM—LLaMA-3-8B [35]51.6140.7648.1637.9333.2136.39
LLaMA-3-8B-Instruct44.4036.6344.4640.4230.7238.40
Qwen2.5-14B-Instruct51.0539.6147.4731.3725.3526.36
Mistral-7B-Instruct-v0.3 257.1643.9656.62
1 WL = WikiLeaks; F1_M = macro-averaged F1; F1_W = weighted F1. All values in percent. 2 Mistral-7B did not complete FRUS training within the allocated wall-clock budget.
Table 12. Extended LLM experiments—effect of LoRA rank (r = 64) and quantisation precision on classification performance.
Table 12. Extended LLM experiments—effect of LoRA rank (r = 64) and quantisation precision on classification performance.
ModelQuantisationWL 2 Acc.WL 2 F1_M 2FRUS Acc.FRUS F1_M 2
LLaMA-3-8B-Instruct4-bit NF4, r = 6451.8341.5639.8431.67
LLaMA-3-8B-Instruct8-bit INT8, r = 6446.5036.7238.3531.93
Qwen2.5-14B-Instruct 14-bit NF4, r = 64
Qwen2.5-14B-Instruct8-bit INT8, r = 6450.9440.2831.2824.36
Mistral-7B-Instruct-v0.34-bit NF4, r = 6458.0541.3950.3120.45
Mistral-7B-Instruct-v0.38-bit INT8, r = 6464.4850.2837.4320.72
SACHN-DeBERTa-v3-Large (Ours)Full precision96.1292.5792.1191.02
1 Qwen2.5-14B 4-bit NF4 with r = 64 did not converge within the allocated wall-clock budget.2 WL = WikiLeaks; F1_M = macro-averaged F1.
Table 13. SACHN-DeBERTa-v3-Large—per-seed results.
Table 13. SACHN-DeBERTa-v3-Large—per-seed results.
SeedVal F1_M 1,2WL 1 Test Acc.WL 1 F1_M 1FRUS Test Acc.FRUS F1_M 1
4280.33/69.7887.4677.2770.8467.08
12376.61/69.9097.8996.2292.8192.07
45680.61/66.8095.2391.4394.0693.27
Ensemble96.1292.5792.1191.02
1 WL = WikiLeaks; F1_M = macro-averaged. 2 Val F1_M column shows WikiLeaks/FRUS validation F1 for each seed.
Table 14. SACHN-DeBERTa-v3-Large Ensemble—per-class results (WikiLeaks).
Table 14. SACHN-DeBERTa-v3-Large Ensemble—per-class results (WikiLeaks).
ClassPrecisionRecallF1-ScoreSupport
CONFIDENTIAL93.2595.3994.31304
SECRET94.4477.2785.0066
UNCLASSIFIED97.9598.8798.41531
Macro Avg.195.2190.5192.57901
Weighted Avg.196.1196.1296.04901
1 Macro Avg. and Weighted Avg. rows are highlighted in bold to distinguish aggregate metrics from class-specific results.
Table 15. SACHN-DeBERTa-v3-Large Ensemble—per-class results (FRUS).
Table 15. SACHN-DeBERTa-v3-Large Ensemble—per-class results (FRUS).
ClassPrecisionRecallF1-ScoreSupport
CONFIDENTIAL92.8788.1290.43547
SECRET92.5295.1393.801313
TOP SECRET88.8688.0688.46335
UNCLASSIFIED92.7290.0991.39212
Macro Avg.191.7490.3591.022407
Weighted Avg.192.1192.1192.082407
1 Macro Avg. and Weighted Avg. rows are highlighted in bold to distinguish aggregate metrics from class-specific results.
Table 16. Confusion matrix—WikiLeaks (ensemble predictions).
Table 16. Confusion matrix—WikiLeaks (ensemble predictions).
Pred. CONF.Pred. SECRETPred. UNCL.
True CONF.290311
True SECRET15510
True UNCL.60525
Table 17. Confusion matrix—FRUS (ensemble predictions).
Table 17. Confusion matrix—FRUS (ensemble predictions).
Pred. CONF.Pred. SECRETPred. TOP S.Pred. UNCL.
True CONF.48255010
True SECRET271249334
True TOP S.0392951
True UNCL.1074191
Table 18. McNemar’s test results—SACHN-DeBERTa-v3-Large ensemble vs. baselines (Yates-corrected, df = 1).
Table 18. McNemar’s test results—SACHN-DeBERTa-v3-Large ensemble vs. baselines (Yates-corrected, df = 1).
ComparisonDatasetNn11 1n10 1n01 1n00 1χ2(1)p-Value
SACHN 2 vs. Kesenek [17]WikiLeaks90176594113164.041.22 × 10−15
SACHN 2 vs. TD2V [11]WikiLeaks901563296636276.564.22 × 10−62
SACHN 2 vs. Kesenek [17]FRUS2407158364854122500.925.98 × 10−111
SACHN 2 vs. TD2V [11]FRUS24071228100362114829.671.91 × 10−182
1: n11: both correct; n10: SACHN correct, baseline wrong; n01: baseline correct, SACHN wrong; n00: both wrong. Yates-corrected, df = 1. 2 SACHN accuracy in this table reflects raw ensemble predictions (argmax) prior to per-class threshold optimisation (Section 3.5.2); the threshold-optimised accuracy is reported in Table 8.
Table 19. Tokeniser maximum sequence length ablation—SACHN-DeBERTa-v3-Large ensemble.
Table 19. Tokeniser maximum sequence length ablation—SACHN-DeBERTa-v3-Large ensemble.
max_len (Tokens)WL AccuracyWL F1_MacroFRUS AccuracyFRUS F1_Macro
12893.78%85.99%91.07%90.25%
25695.23%89.85%92.65%91.96%
51293.12%86.49%90.36%89.79%
Table 20. Focal Loss γ ablation—SACHN-DeBERTa-v3-Large ensemble.
Table 20. Focal Loss γ ablation—SACHN-DeBERTa-v3-Large ensemble.
γWL AccuracyWL F1_MacroFRUS AccuracyFRUS F1_Macro
1.095.56%89.44%90.32%89.82%
2.095.23%89.85%92.65%91.96%
3.096.23%92.00%91.69%91.04%
Table 21. Soft vs. hard voting comparison—Feature Engiinering Ensemble Classifier.
Table 21. Soft vs. hard voting comparison—Feature Engiinering Ensemble Classifier.
Voting StrategyWL AccuracyWL F1_MacroFRUS AccuracyFRUS F1_Macro
Soft86.12%66.44%69.03%60.95%
Hard86.27%66.52%68.84%61.46%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sariçiçek, M.T.; Dener, M. SACHN-DeBERTa-v3-Large: Automated Document Security Classification with XAI and LLM Comparison. Appl. Sci. 2026, 16, 5661. https://doi.org/10.3390/app16115661

AMA Style

Sariçiçek MT, Dener M. SACHN-DeBERTa-v3-Large: Automated Document Security Classification with XAI and LLM Comparison. Applied Sciences. 2026; 16(11):5661. https://doi.org/10.3390/app16115661

Chicago/Turabian Style

Sariçiçek, Mehmet Tuğrul, and Murat Dener. 2026. "SACHN-DeBERTa-v3-Large: Automated Document Security Classification with XAI and LLM Comparison" Applied Sciences 16, no. 11: 5661. https://doi.org/10.3390/app16115661

APA Style

Sariçiçek, M. T., & Dener, M. (2026). SACHN-DeBERTa-v3-Large: Automated Document Security Classification with XAI and LLM Comparison. Applied Sciences, 16(11), 5661. https://doi.org/10.3390/app16115661

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop