LLM-SPSS: An Efficient LLM-Based Secure Partitioned Storage Scheme in Distributed Hybrid Clouds

Zhou, Ran; Che, Bichen; Yang, Liangbin

doi:10.3390/electronics15010030

Open AccessArticle

LLM-SPSS: An Efficient LLM-Based Secure Partitioned Storage Scheme in Distributed Hybrid Clouds

by

Ran Zhou

,

Bichen Che

and

Liangbin Yang

^*

School of Cyber Science and Engineering, University of International Relations, Beijing 100091, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 30; https://doi.org/10.3390/electronics15010030

Submission received: 29 November 2025 / Revised: 13 December 2025 / Accepted: 16 December 2025 / Published: 22 December 2025

(This article belongs to the Special Issue Artificial Intelligence in Cyberspace Security)

Download

Browse Figures

Versions Notes

Abstract

With the growing adoption of hybrid cloud storage, the identification and protection of sensitive information within large-scale unstructured data has become increasingly challenging. Traditional rule-based and machine learning approaches have limitations in context-aware sensitive data classification and large-scale processing. In this work, a novel framework named LLM-SPSS, implementing a secure and confidential storage layout for distributed hybrid clouds through a fine-tuned XLM-R Base model and multi-dimensional data partitioning, is proposed. First, a fine-tuned XLM-R Base model with adaptive prompt tuning is employed to enable context-aware sensitive data classification and improve detection accuracy. In addition, MapReduce-based distributed processing allows the framework to scale efficiently to large datasets, thus enhancing computational efficiency. Furthermore, a multi-dimensional cloud partitioning scheme provides secure and fine-grained storage isolation within hybrid cloud environments. Experimental results demonstrate that LLM-SPSS achieves an F1-score of 99.66% and yields a 6.3× speed-up over the non-distributed baseline, outperforming traditional rule-based (F1 68.27%), conventional machine learning (SVM F1 98.32%, Random Forest F1 95.79%), and other LLM-based approaches (DePrompt F1 95.95%) and effectively balancing high accuracy with computational efficiency.

Keywords:

hybrid cloud storage; large language model; MapReduce; sensitive data detection; context-aware classification; secure partitioned storage

1. Introduction

With the rapid expansion of cloud computing and storage, enterprises and personal data volumes continue to grow at an exponential rate [1]. To improve management efficiency and security, many organizations adopt hybrid cloud strategies that combine public and private clouds [2], assigning sensitive data to private clouds and non-sensitive data to public clouds. This approach ensures data security while leveraging the scalability and cost efficiency of public clouds. However, accurately and efficiently identifying sensitive data remains a significant challenge in hybrid cloud storage systems [3].

Traditional sensitive data detection relies heavily on regular expressions or rule-based matching, which perform well on structured data but struggle with semi-structured or unstructured text. For example, Bhatia et al. achieved high accuracy on database fields but reported a 15% false-positive rate on semi-structured and unstructured data [4], while OpenRaven showed limited coverage in unstructured documents despite good performance on fixed-format fields [5]. Mainetti et al. further emphasized that rule-based schemes lack semantic modeling capabilities [6].

Recent research has turned to machine learning techniques such as SVM, Random Forests, and deep learning to improve classification flexibility [7,8]. Mishra et al. proposed a hybrid rule-ML framework which reported an F1-score of approximately 91% for financial PII detection, despite limited contextual understanding [9]. Separately, Rama et al. [10] integrated SVM and Random Forest classifiers with a suite of cryptographic techniques, achieving a 5–10% encryption speed-up for cloud storage. Recent work by Addula et al. further advanced this direction by proposing an Autoencoder-GRU hybrid model optimized with the Honey Badger Algorithm for cyber-threat detection in IoT networks [11]. While such hybrid models enhance feature extraction for structured or temporal data, their reliance on fixed-length inputs and handcrafted feature hierarchies limits their effectiveness for large-scale, heterogeneous unstructured text classification—the core challenge addressed in this work. Despite these advances, computational bottlenecks persist in large-scale cloud environments. Li et al. achieved approximately 22% higher efficiency with parallel optimization [12], yet fundamental scalability barriers for ultra-large unstructured data remain. While hybrid deep learning models like CNN-LSTM/GRU are effective for structured anomaly detection, they often require fixed-length inputs and struggle with the variable-length, semantically rich nature of general unstructured text, limiting their direct applicability for context-aware sensitive data classification at scale. Crucially, many of these methods lack the deep contextual understanding required to accurately interpret nuanced semantics in unstructured text, a capability where large language models (LLMs) excel.

Large language models (LLMs) have demonstrated exceptional capabilities in understanding and classifying unstructured text, surpassing traditional schemes by capturing nuanced semantics and context to improve labeling accuracy [13,14]. Specific implementations, such as Sun et al.’s DePrompt for PII recognition (95.95% accuracy) [9] and Boehnke et al.’s news classification (97% accuracy) [15], highlighted this potential. However, these approaches often deploy LLMs directly, neglecting the inference efficiency required for high-frequency, large-scale processing [16]—a bottleneck Zhou et al. attribute to model size, attention complexity, and decoding overhead [17]. The MapReduce framework offers a proven solution for such distributed processing challenges [18].

On the storage front, research has focused on optimizing hybrid clouds via data partitioning. Representative works include Levitin et al., who separated sensitive data across distinct virtual machines [19], and Tang et al., who implemented a multi-layer model with dual-stage partitioning and load balancing [20]. While these strategies enhance resource utilization and storage throughput, they generally lack integrated capabilities for automated sensitive data labeling and parallel text processing, which limits their applicability for modern workloads dominated by unstructured data.

To address these challenges, a novel framework, termed an Efficient LLM-Based Secure Partitioned Storage Scheme in Distributed Hybrid Clouds (LLM-SPSS), is proposed. It ensures high labeling accuracy and enhances processing efficiency and storage security through the synergistic combination of its constituent fine-tuned model (XLM-R) and MapReduce parallel processing, alongside a three-dimensional partitioning scheme.

The main contributions of this work include the following:

(1) A Sensitive Data Annotation Scheme based on XLM-R: Fine-tuned LLMs comprehend semantic and contextual information, excelling in unstructured data sensitivity recognition. Parameter-efficient and prompt-guided fine-tuning enhances context-aware sensitive data classification, yielding an F1-score increase from 94.57% to 99.66% for binary sensitive classification. This outperforms traditional regular expression schemes (F1 68.27%), conventional machine learning models (SVM F1 98.32%, Random Forest F1 95.79%), and other LLM-based approaches like DePrompt (F1 95.95%).

(2) A Distributed Data Partitioning Scheme based on MapReduce: Data is partitioned based on text length, type, and node load to achieve a three-dimensional balance. Integration with distributed computation frameworks enables efficient large-scale processing. Experimental results show that average processing time per partition reduced by 18.3%, and MapReduce-based parallel inference compresses runtime from approximately 510 s to 81 s, achieving 6.3× acceleration.

(3) A Hybrid Cloud Partitioning Management Method based on Sensitivity Levels: Sensitive and non-sensitive data are stored in private and public cloud databases, respectively. AES-256 encryption secures sensitive data in private clouds. The scheme alleviates private cloud storage pressure and enables partitioned management based on sensitivity levels, thus ensuring data security and operational efficiency.

The remainder of the document is organized as follows: Section 2 reviews related technologies, including LLMs, parameter-efficient fine-tuning, prompt tuning, MapReduce distributed computing, and hybrid cloud partitioning. Section 3 details the proposed sensitive data labeling scheme combining LLM fine-tuning with MapReduce. Section 4 presents experimental design and validation, including model performance, processing efficiency, and hybrid cloud partitioning effectiveness. Section 5 concludes and discusses future research directions.

2. Related Technologies

Accurate classification, efficient processing, and secure storage of privacy-sensitive data constitute core requirements in data governance for hybrid cloud environments, relying on the coordination of multiple key technologies.

2.1. Large Language Models

Large language models (LLMs), built upon the Transformer architecture, are pre-trained on large-scale corpora via self-supervised learning to capture long-range dependencies and complex semantics [21], making them highly suitable for text classification [22]. Downstream tasks are typically adapted by adding a classification head and minimizing the cross-entropy loss. To reduce fine-tuning cost, parameter-efficient fine-tuning (PEFT) and prompt tuning are widely used [23,24]. PEFT updates only a small set of adapter parameters, while prompt tuning learns continuous prompt vectors prepended to the input, both enabling effective adaptation without full model retraining. Pre-trained models learn universal linguistic features from large corpora and demonstrate strong performance in classification, tagging, and generation tasks. The T5 model further verified the high generalization capability of transfer learning in natural language processing [25], providing the foundational support for the fine-tuning-based sensitive data identification approach adopted.

Beyond LLMs, advanced hybrid deep learning architectures such as CNN-BiGRU and Autoencoder-GRU (AE-GRU) enhanced by metaheuristic algorithms have shown strong performance in security and anomaly detection tasks for IoT and cloud systems [11]. These models excel at capturing temporal patterns and feature hierarchies. However, for the core task of context-aware sensitive data classification in large-scale, heterogeneous unstructured text—where understanding nuanced semantics and long-range dependencies is paramount—LLMs like XLM-R offer a distinct advantage. Therefore, for the core task of this work—context-aware classification of heterogeneous unstructured text—we adopt an LLM-based approach.

We select XLM-R (cross-lingual RoBERTa) as our backbone model. Pre-trained on over 100 languages with a masked language modeling objective [26], XLM-R captures deep cross-lingual contextual representations and outperforms monolingual models like BERT in multilingual understanding tasks [26,27]. This capability is crucial for hybrid cloud environments with diverse data sources [28]. When combined with PEFT or prompt tuning, XLM-R can be efficiently adapted for high-precision, low-overhead sensitive data classification [23,24].

2.2. MapReduce Distributed Computing Framework

MapReduce is a distributed computing framework that simplifies large-scale data processing through a divide-and-conquer scheme, decomposing tasks into Map and Reduce phases [18]. Its core concept is divide-and-conquer, decomposing complex processing tasks into two phases: Map and Reduce. In the Map phase, input data is split into multiple data blocks, each processed independently by different nodes to generate intermediate key–value pairs

(k, v)

:

Map : (k_{1}, v_{1}) \to [(k_{2}, v_{2}), \dots]

(1)

where

(k_{1}, v_{1})

denotes the raw input and

(k_{2}, v_{2})

denotes the intermediate output. The Reduce phase aggregates all intermediate results with identical keys and applies reduce functions f to generate the final output:

Reduce : (k_{2}, [v_{2}, \dots]) \to (k_{3}, v_{3})

(2)

Through parallel task scheduling and data partitioning, MapReduce can significantly improve processing efficiency for massive datasets [29]. It features built-in load balancing and fault tolerance, which optimizes resource utilization and ensures computation stability by retrying failed tasks [30].

In privacy-sensitive data processing scenarios, MapReduce can be combined with sensitive data labeling schemes to achieve large-scale parallel inference on unstructured text. This approach overcomes limitations in single-machine computation and storage and facilitates distributed data management and classification in hybrid cloud environments, supporting high-throughput and low-latency processing demands [17,19].

2.3. Cloud Storage Partitioning Technologies

Cloud storage partitioning optimizes resource allocation and security by storing data in different clouds based on sensitivity—typically private clouds for sensitive data and public clouds for non-sensitive data [31,32]. Recent research has further optimized this. As Agyekum et al. [33] and Yolchuyev et al. [34], dynamic partitioning and migration strategies based on data sensitivity and access patterns improve cost-efficiency and resource utilization while maintaining security.

3. LLM-SPSS: An Efficient LLM-Based Secure Partitioned Storage Scheme in Distributed Hybrid Clouds

To address the challenges of sensitive information identification in large-scale unstructured text within hybrid cloud environments, LLM-SPSS addresses sensitive data identification in hybrid clouds by integrating fine-tuned LLMs with MapReduce processing. The approach combines prompt-guided inference and supervised fine-tuning to enhance context-aware sensitive data classification, with prompts optimized for text length and sensitivity. A three-dimensional sharding scheme balances text size, data type, and node workload to ensure efficient distributed computation. Combined with dynamic partitioning mechanisms, this enables secure hybrid cloud storage, completing the workflow from data ingestion to storage governance.

3.1. System Architecture

The system is organized into three core layers: the Data Processing Layer, the Distributed Processing Layer, and the Hybrid Cloud Partitioned Storage Layer. The overall architecture is illustrated in Figure 1.

Within the Data Processing Layer, a large language model-based labeling approach is applied to unstructured text. Data entering the hybrid cloud platform—spanning documents, system logs, emails, user comments, and other forms of unstructured content—is first subjected to format normalization and text denoising, preparing standardized inputs for subsequent classification by the model.

Within the Distributed Processing Layer, a sharding scheme based on the MapReduce framework is adopted. The dataset is partitioned along three dimensions—load balance, text length, and data type—and Map tasks are executed in parallel across nodes in both private and public clouds. Each shard is processed using a fine-tuned XLM-R model for local sensitivity prediction, with batch sizes dynamically adjusted according to node resources. After the Map stage, the Reduce stage performs deduplication and aggregation using unique text identifiers (e.g., hash values).

Within the Hybrid Cloud Partitioned Storage Layer, a hybrid storage scheme is implemented. Sensitive data are stored in private-cloud dedicated databases, while non-sensitive data are encrypted using AES-256 and stored in public-cloud databases, balancing security and operational efficiency. Sensitivity labels and metadata for each data item are preserved in full to support compliance audits, thereby forming a complete closed-loop workflow from data ingestion to storage management.

3.2. Sensitive Data Labeling Model and Fine-Tuning Design

3.2.1. Problem Definition

Sensitive data labeling is formulated as a binary classification task. Given a text

T = {t_{1}, t_{2}, \dots, t_{n}}

, where

t_{i}

is represented as tokens, the goal is to predict its sensitivity label

y_{i} \in {0, 1}

(0 for non-sensitive, 1 for sensitive). Text may contain multiple sensitive categories, such as health information, political beliefs, judicial data, sexual orientation, religious beliefs, and racial information. This challenge lies in the diverse expressions and contextual dependencies of unstructured text, requiring context-aware sensitive data classification.

3.2.2. Model Selection

For sensitive information detection in hybrid cloud environments, XLM-RoBERTa (XLM-R) is selected as the backbone model to achieve deep alignment between task requirements and model characteristics. Hybrid cloud data originates from diverse sources, including multiple languages and formats (e.g., cross-lingual documents and multi-system logs). XLM-R, pre-trained on over 100 languages, provides robust cross-lingual semantic alignment, outperforming monolingual models like BERT and reducing cross-language understanding bias [13].

Sensitive information recognition heavily relies on contextual understanding; certain expressions require surrounding text to be accurately interpreted. XLM-R Base, with a 12-layer Transformer architecture and multi-head attention, captures long-range dependencies and deep semantic features, overcoming shallow-feature limitations of traditional models such as SVM and Random Forest [14].

In practical hybrid cloud scenarios, labeled sensitive data is scarce. XLM-R supports few-shot fine-tuning, enabling rapid adaptation to domain-specific sensitive information detection with minimal labeled data, offering advantages over CNN or RNN models [18]. With approximately 110 M parameters, XLM-R Base balances context-aware sensitive data classification with resource efficiency, facilitating deployment across heterogeneous cloud nodes (CPU/GPU) and supporting parallel processing under the MapReduce framework [13]. Its [CLS] token provides a high-quality global semantic feature representation, suitable for binary sensitive classification tasks [13].

3.2.3. Large Language Model Fine-Tuning

The system adopts XLM-RoBERTa as the base encoder and performs supervised fine-tuning for the sensitive data labeling task. Text inputs are concatenated with prompt tokens P to form the model input

x_{i} = [P; t_{i}]

. The encoded [CLS] representation (

h_{C L S} \in R^{d}

, with a hidden size

d = 768

for the XLM-R Base variant) is passed through a fully connected layer to generate the classification probability:

p (y = 1 | X) = σ (W \cdot h_{C L S} + b)

(3)

where

W \in R^{d \times 1}

is the weight matrix,

b \in R

is the bias term, and

σ

is the sigmoid activation function.

The optimization process adopts binary cross-entropy as the loss function to suit the binary classification objective:

L = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} log p_{i} + (1 - y_{i}) log (1 - p_{i})]

(4)

where N is the batch size,

y_{i}

is the true label, and

p_{i}

is the predicted sensitivity probability.

3.2.4. Sensitive Label Generation Process

During the inference phase, as illustrated in Algorithm 1, the system performs preprocessing steps to handle excessive text length, empty inputs, and semantically meaningless content. Texts exceeding a predefined length threshold or containing no meaningful information are processed using specific rules. A confidence threshold is applied, and samples with low confidence scores are flagged for manual inspection.

Algorithm 1 Sensitive- Data Labeling Workflow

Require:: Text sample set $T = {t_{1}, t_{2}, \dots, t_{N}}$ , prompt P, fine-tuned large language model M
Ensure:: Sensitive-label set $Y_{h a t} = {y_{h a t, 1}, y_{h a t, 2}, \dots, y_{h a t, N}}$
1:: for $t_{i} \in T$ do
2:: Concatenate prompt P with text $t_{i}$ to generate model input $x_{i}$
3:: Input $x_{i}$ into model M to obtain semantic representation $h_{i}$
4:: Map $h_{i}$ to prediction probability $p (y | x_{i})$ through the classification layer
5:: Assign label $y_{h a t, i}$ based on maximum probability
6:: Save predicted label $y_{h a t, i}$
7:: end forreturn $Y_{h a t}$

The text is first concatenated with an adaptive prompt to construct the model input. The fine-tuned large language model then generates contextual semantic representations, which are subsequently passed through a fully connected layer to produce prediction probabilities.

This design incorporates abnormal-text handling mechanisms, adaptive prompt selection, and confidence-based validation, providing three layers of safeguards that enhance the accuracy and robustness of sensitive-information detection in large-scale unstructured text.

It integrates prompt-based guidance with supervised fine-tuning, enabling efficient identification of sensitive information within unstructured text while supporting rapid adaptation with minimal annotated data.

To systematically investigate the role of instructional guidance, we designed and compared three distinct prompting strategies, as defined in Table 1.

3.3. Data Partitioning and MapReduce Parallel Processing

To accommodate large-scale text processing and the substantial computational requirements of large language models, the system applies a three-dimensional balanced partitioning scheme based on dataset size, text length, and node load. The dataset D is partitioned into K independent shards

S_{i}

based on a three-dimensional balanced scheme considering text length, data type, and node load. The load balance dimension is estimated using a lightweight monitor that tracks each node’s current CPU/memory utilization and pending task queue length prior to shard assignment. Shards are then distributed to strive for a balanced workload across nodes, ensuring no single node becomes a bottleneck.

D = {S_{1}, S_{2}, \dots, S_{K}}, | S_{i} | = \frac{N}{K}

(5)

Each shard is processed independently on a separate computation node during the Map phase. The Map phase task is formalized as

y_{i} = arg max_{y \in {0, 1}} p (y | x_{i}), \forall t_{i} \in S_{k}

(6)

The Map phase process is shown in Algorithm 2.

Algorithm 2 Map Phase: Local Sensitive Data Labeling

Require:: Data-shard set $S = {S_{1}, S_{2}, \dots, S_{K}}$ , fine-tuned model M, prompt P
Ensure:: Local sensitive-label results $R = {R_{1}, R_{2}, \dots, R_{K}}$
1:: for $S_{k} \in S$ do ▹ Executed in parallel
2:: for $t_{i} \in S_{k}$ do
3:: Concatenate prompt P with text $t_{i}$ to generate model input $x_{i}$
4:: Feed $x_{i}$ into model M to obtain semantic representation $h_{i}$
5:: Compute sensitivity probability $p (y | x_{i})$ through the classification layer
6:: Determine sensitive label $y_{h a t, i}$ based on maximum probability
7:: Append result $(t_{i}, y_{h a t, i})$ to shard-level result $R_{k}$
8:: end for
9:: end forreturn Local result set R

This design ensures computation is fully isolated across nodes, avoiding communication bottlenecks and mitigating the feature-statistics distortion problem that traditional approaches (e.g., SVM) encounter after data partitioning. By leveraging parallel GPU/CPU resources, the system achieves high-throughput processing and significantly improves inference efficiency for large language models.

3.4. Reduce Phase: Result Aggregation and Global Label Generation

During the Reduce phase shown in Algorithm 3, the system aggregates outputs from all Map nodes. Duplicate texts across shards are identified using unique text identifiers. In cases of label conflict, the decision is made based on confidence-score priority, while samples with marginal confidence differences are marked for manual verification. Random sampling may be applied to validate the overall aggregation quality.

Algorithm 3 Reduce Phase: Result Aggregation

Require:: Local result set $R = {R_{1}, R_{2}, \dots, R_{K}}$ from all Map nodes
Ensure:: Global sensitive-label set $Y_{h a t} = {y_{h a t, 1}, y_{h a t, 2}, \dots, y_{h a t, N}}$
1:: Initialize global result set $Y_{h a t} = \emptyset$
2:: for $R_{k} \in R$ do
3:: for $(t_{i}, y_{h a t, i}) \in R_{k}$ do
4:: Add $y_{h a t, i}$ to global result set $Y_{h a t}$
5:: end for
6:: end forreturn $Y_{h a t}$

This component incorporates text deduplication, conflict resolution, and verification procedures, ensuring global label consistency and reliability. The final labels provide a high-precision decision basis for hybrid cloud storage partitioning and ensure the completeness and coherence of data-management workflows.

3.5. Hybrid Cloud Data Partitioned Storage

In a hybrid cloud environment, data is categorized into sensitive and non-sensitive classes based on the inferred sensitivity labels. Storage management is then performed accordingly. Sensitive data is stored within the private cloud to ensure core security protection, while non-sensitive data is stored in the public cloud to reduce operational and storage costs. Sensitive data stored in the private cloud is further protected through encryption and access-control mechanisms to ensure security throughout its life cycle.

The framework centers on automating the classification of sensitive data and the subsequent storage-partitioning decisions. Once classified, data designated as sensitive can be routed to private cloud storage with encryption applied, while non-sensitive data is directed to cost-effective public cloud storage. This workflow ensures that standard data protection mechanisms (e.g., encryption and access control) are applied contextually based on the initial sensitivity label, fitting seamlessly into established hybrid cloud security operations.

4. Experimental Design and Implementation

4.1. Dataset and Preprocessing

The Sensitive Data Detection Dataset, generated by the OpenLLaMA model and released under the Apache 2.0 open-source license [35], is employed. The dataset contains diverse text types, including emails, news reports, personal resumes, and psychiatric reports, with sensitive category labels such as health information, political orientation, and sexual orientation, satisfying the requirements for multi-type sensitive information detection research.

Utilizing a high-quality synthetic dataset allows for a controlled and reproducible evaluation of the proposed framework’s core capabilities, a common methodology in early-stage research on privacy-sensitive tasks where real-world data is often unavailable for sharing [9,35]. Taylor et al. demonstrated that carefully generated synthetic clinical records can preserve key statistical patterns of real patient data while enabling safe model evaluation in sensitive domains [36]. In addition, broader surveys on synthetic data generation also confirm that high-quality synthetic datasets can support reliable assessment of model performance before applying the methods to real-world sensitive datasets [37].

Since the core task is binary sensitive information detection, the raw dataset was standardized and preprocessed. In the data cleaning stage, HTML escape characters, redundant spaces, and special symbols were removed while retaining letters, numbers, and common punctuation. In the label conversion stage, original multi-class labels were mapped into binary sensitive/non-sensitive labels. Any sample containing at least one sensitive category was labeled as 1 (sensitive), otherwise as 0 (non-sensitive). The processed dataset was saved in a standardized JSON format for subsequent model training and inference. Stratified sampling was used to split the dataset into training, validation, and test sets in a 70%/20%/10% format ratio, ensuring a balanced representation of sensitive and non-sensitive samples and preventing data leakage. The final dataset contained 2051 samples, with 71.72% sensitive and 28.28% non-sensitive, providing a sufficient testbed for the initial validation and ablation studies conducted in this work. To further assess the practical robustness of the framework, we designed dedicated experiments involving artificial noise injection (Section 4.3.3) and variable training data scales (Section 4.3.4).

4.2. Experimental Environment and Configuration

Experiments were deployed on an Ubuntu 20.04 LTS 64-bit server with Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60 GHz, NVIDIA RTX 3090 GPU (96 GB VRAM), and 128 GB RAM, ensuring sufficient resources for model training and inference. Software was developed using Python 3.8.x with PyTorch 2.0.1+cu117 and Hugging Face Transformers libraries. A fixed random seed of 42 was used to maintain consistent experimental conditions for reproducibility. The detailed configuration is shown in Table 2.

The experimental evaluation was designed to comprehensively validate LLM-SPSS from multiple perspectives. The key variables and configurations across all experiments are summarized in Table 3.

Implementation Details for Reproducibility: For the XLM-R fine-tuning, we used the AdamW optimizer with a learning rate of 5

\times 10^{- 5}

, a batch size of 16, and trained for 4 epochs with early stopping (patience = 1). The maximum sequence length was 256 tokens. Cross-entropy loss with a weight decay of 0.01 was applied. In the MapReduce simulations, the dataset was partitioned into shards of approximately 50 samples each, processed on 4 simulated CPU nodes (batch size 8; gradient accumulation steps 2). Textual noise for robustness tests was generated by randomly applying character-level operations (deletion, insertion, swap) at the specified ratios.

4.3. Experimental Design and Results

To validate the effectiveness of the proposed sensitive data processing scheme, a series of experiments were conducted to evaluate the accuracy of large language models for sensitive information detection, the performance impact of MapReduce shard-based training on large-scale data, and the feasibility of hybrid cloud partitioned storage for managing different sensitivity levels. Experiments included comparisons with traditional schemes, training scheme optimization, noise robustness, training data size impact, and same-scale large language model comparisons. The accuracy, precision, recall, and F1-score were used as evaluation metrics. The full content of the simple and complex prompts is provided in Table 1 and Appendix A, respectively.

4.3.1. Comparison with Traditional Schemes

This experiment evaluates the performance of the proposed scheme against several baselines, including traditional techniques and a contemporary LLM-based approach. The baselines comprised the following: a rule-based regular expression (Regex) scheme; two classical machine learning models, Support Vector Machine (SVM) and Random Forest, which employed TF-IDF feature vectors for classification, replicating the scheme from prior work [10]; and DePrompt [9], a representative prompt-guided large language model for sensitive information recognition. To further investigate the role of prompt engineering, we included two ablated versions of our model: one augmented with simple prompts and another with complex prompts.

As illustrated in Figure 2, the results reveal a distinct performance gradient. The Regex scheme yielded the lowest scores, with an F1-score of only 68.27%, due to its limited generalization capability. SVM and Random Forest demonstrated substantial improvement, achieving F1-scores of 98.32% and 95.79%, respectively, confirming the effectiveness of machine learning for this task. Critically, the proposed XLM-R scheme achieved superior performance across all evaluation metrics, attaining a near-perfect F1-score of 99.66% and 100% recall. This outcome signifies that our approach not only excels beyond conventional schemes but also surpasses the existing LLM-based solution, DePrompt, which achieved an F1-score of 95.95%.

The ablation study on prompts offers valuable insights. Although the complex-prompt variant delivered strong performance with an F1-score of 99.33%, our primary scheme without prompts consistently achieved the highest accuracy, demonstrating that optimal performance does not necessitate explicit prompt guidance. Meanwhile, the simple-prompt variant attained an F1-score of 98.67%, still markedly outperforming all traditional baselines and DePrompt [14]. The exceptional capability of the fine-tuned XLM-R model stems from its deep context-aware sensitive data classification acquired during pre-training, allowing it to identify sensitive information contextually without handcrafted features or prompt dependencies.

4.3.2. Comparison of Training Strategies

To address the computational challenges of large-scale data processing, this study evaluates the efficacy of a MapReduce-based distributed scheme against a conventional direct training approach. The experiment encompassed a range of schemes, including Regex, SVM, Random Forest, zero-shot classification, DePrompt, and the XLM-R model with different prompting strategies (simple, complex, and the proposed no-prompt version). Under the MapReduce paradigm, the dataset was partitioned into fixed-size shards for independent parallel processing. The overall performance was averaged across shards, while the runtime was determined by the slowest node, simulating real-world large-scale processing constraints. Both classification metrics (e.g., accuracy and F1-score) and runtime efficiency were evaluated.

The results, summarized in Figure 3, reveal stark contrasts in the adaptability of each scheme to distributed computing. Schemes reliant on global feature statistics, namely SVM and Random Forest, suffered significant performance degradation under MapReduce, with their F1-scores dropping to approximately 84.0% and 85.3%, respectively. This decline is attributed to data sharding disrupting the integrity of their TF-IDF feature distributions. In contrast, rule-based Regex and the zero-shot approach remained largely unaffected but started from a lower performance baseline.

A key finding is the exceptional resilience of the proposed XLM-R scheme. It maintained a high F1-score above 99% and an accuracy of 99.8% even in the distributed setting, demonstrating minimal performance loss compared to the direct mode. The ablated XLM-R variants with simple and complex prompts also exhibited robust adaptability, preserving strong performance under MapReduce. More importantly, the MapReduce scheme conferred a substantial efficiency gain for the XLM-R models, reducing the inference time from nearly 510 s to approximately 81 seconds—a 6.3-fold speed-up. This result underscores that the proposed scheme uniquely fulfills the dual objectives of high accuracy and high efficiency in large-scale sensitive data detection, making it particularly suitable for distributed hybrid clouds.

4.3.3. Robustness Evaluation Under Noisy Conditions

To assess the practical viability of the proposed scheme, we evaluated its robustness against textual noise, a common issue in real-world data. The experiment introduced four levels of artificial noise (0%, 10%, 20%, 30%) into the test set. The proposed XLM-R scheme (without prompts) and its two ablated variants (with simple and complex prompts) were tested under both direct and MapReduce training modes, with accuracy serving as the primary robustness metric.

As depicted in Figure 4, the proposed scheme demonstrates commendable resilience to noise, particularly under the direct mode. In a noise-free environment, it achieved the highest accuracy of 99.51%. As noise levels increased, its performance experienced a predictable yet gradual decline, maintaining a significant advantage over the ablated variants. Notably, the complex-prompt variant exhibited severe performance degradation, with accuracy collapsing to 71.84% even without added noise, indicating that intricate instructions can fundamentally impair the model’s core classification ability.

Under the MapReduce mode, the robustness of all models was further challenged. While the proposed scheme retained strong performance at lower noise levels, its accuracy at 20% and 30% noise settled at 78.99% and 81.24%, respectively. This suggests that the combination of data sharding and high noise exacerbates feature distortion. Despite this, the proposed scheme consistently outperformed its prompted counterparts across almost all noise conditions in both operational modes. The results affirm that the proposed fine-tuning approach, devoid of prompt engineering, provides a more reliable and robust solution for sensitive data detection in noisy, real-world scenarios.

4.3.4. Impact of Training Data Scale on Model Performance

The scalability of a scheme with respect to training data volume is critical for its practical adoption. This experiment evaluates the performance of the proposed scheme and its ablated variants across a range of training data proportions (5% to 100%) under both direct and MapReduce modes.

As summarized in Figure 5, the proposed XLM-R scheme exhibits strong data efficiency and scalability. In the direct mode, our scheme achieves remarkable performance even with limited data, attaining an accuracy of 99.27% with only 20% of the training data. This performance is sustained as more data becomes available. The MapReduce mode showcases the scheme’s efficiency advantage, drastically reducing runtime while maintaining competitive accuracy, particularly at data proportions of 20% and below. For instance, with 20% of the data, the proposed scheme in MapReduce mode delivers an accuracy of 98.74% in nearly half the time required by the direct mode.

The ablated variants exhibit contrasting behaviors. The model with simple prompts shows inconsistent performance in the direct mode and becomes highly unstable under MapReduce, with accuracy fluctuating significantly across different data scales. The variant with complex prompts reveals a critical limitation: its performance remains stagnant at a low baseline (71.71% accuracy) until a substantial amount of data (50%) is provided, indicating severe ineffectiveness in low-resource scenarios, which hinders practical application.

4.3.5. Comparison of Large Language Models of Similar Scale

This comparative study validates our model selection scheme by evaluating four similarly scaled LLMs under identical conditions. As shown in Figure 6, our fine-tuned XLM-RoBERTa-base (proposed) achieves optimal performance with 100% accuracy and F1-score, confirming the effectiveness of our design approach.

The results reveal critical architectural insights: While InfoXLM-base shows competitive native capability (99.03% accuracy) without prompts, it suffers severe performance degradation (to 71.84%) when prompts are introduced, demonstrating the instability of prompt-dependent approaches. Similarly, the larger XLM-RoBERTa-large fails to surpass its base variant, proving that model scale alone cannot ensure better performance.

XLM-RoBERTa-base demonstrates superior architectural suitability, maintaining robust performance across all prompt conditions. Through targeted fine-tuning, we transform the base model from a capable pre-trained encoder (91.75% zero-shot accuracy) into a precision tool for sensitive data detection (100% accuracy). This significant improvement, achieved without complex prompt engineering, validates our core schemeology: selecting an architecturally suitable foundation model and optimizing it through directed fine-tuning offers a more reliable solution than pursuing larger models or prompt-dependent strategies.

5. Conclusions

In conclusion, LLM-SPSS is an efficient LLM-based secure partitioned storage scheme designed for distributed hybrid clouds. By integrating a fine-tuned XLM-R Base model, MapReduce-based parallel inference, and a multi-dimensional sensitivity-aware storage layout, the framework effectively addresses sensitive data classification, large-scale unstructured data processing, and secure partitioned storage in hybrid cloud environments. Experimental results on a controlled dataset show that LLM-SPSS achieves an F1-score of 99.66% and a 6.3× speed-up compared with the non-distributed baseline, demonstrating a strong balance between detection accuracy and computational efficiency. The core competitive advantage stems from the fine-tuned XLM-R model’s deep context-aware classification capability, which ensures robust and high accuracy without reliance on complex prompt engineering. Compared with traditional methods and existing LLM-based approaches, LLM-SPSS exhibits superior classification accuracy and processing efficiency, and its design is better suited to hybrid cloud environments with secure partitioned storage requirements, thus making it a practical and scalable solution for real-world deployment.

Building on the promising results achieved in this study, several directions for further enhancement and broader application can be envisioned. The model’s generalization could be improved by training on more diverse, multi-modal datasets. While efficient, the distributed scheme currently employs a fixed sharding approach; a dynamic scheduler that adapts to node performance and data skew could further enhance efficiency. Furthermore, the hybrid cloud adaptation mechanisms can be matured by integrating with cloud-native security services (e.g., Key Management Services) and developing automated compliance auditing features based on the generated sensitivity labels. Addressing these aspects will be the focus of our subsequent studies. Finally, the promising results on the synthetic benchmark warrant future investigation into the framework’s adaptability across a more diverse spectrum of data sources and formats encountered in practical deployments.

Author Contributions

Conceptualization, R.Z.; methodology and implementation, R.Z.; validation and analysis, R.Z.; writing—original draft, R.Z.; writing—review and editing, R.Z. and B.C.; supervision, B.C. and L.Y.; project administration and funding acquisition, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was specially supported by Research Funds for NSD Construction, University of International Relations (2025GA04), and Student Academic Research Training Project of University of International Relations (No. 3262025SYJ08).

Data Availability Statement

Previously published articles were used to support this study, and these prior studies and datasets are cited at the relevant places within this article. The datasets and the code for this paper are publicly available at https://github.com/SimoDR/sensitive-data-detection (accessed on 28 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Full Complex Prompt Template

The complete template for the complex prompt used in the ablation study is provided below.

Please judge whether the text contains sensitive information.
Sensitive information includes the following categories with detailed
definitions:

1. Physical or mental health information: Diagnoses, medical conditions,

disabilities, treatment records, or health-related data.

2. Political beliefs: Party affiliation, political opinions, union

membership, or participation in political activities.

3. Judicial information: Criminal records, charges, trial proceedings,

court judgments, or legal disputes.

4. Religious or philosophical beliefs: Religious affiliation, spiritual

practices, or philosophical viewpoints.

5. Sexual orientation and gender identity: Sexual preference, gender

expression, or gender identity.

6. Racial or ethnic information: Racial background, ethnic origin, or

tribal affiliation.

Please answer only "Yes" if the text contains any of the above sensitive

information; otherwise, answer "No".

References

Gantz, J.; Reinsel, D. Extracting Value from Chaos; IDC iView White Paper; IDC: Framingham, MA, USA, 2011; Available online: https://www.emc.com/digital_universe (accessed on 28 November 2025).
Gartner, Inc. Gartner Forecasts Worldwide Public Cloud End-User Spending to Total $723 Billion in 2025, Press Release; Gartner: Stamford, CT, USA, 19 November 2024. Available online: https://www.gartner.com/en/newsroom/press-releases/2024-11-19-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-total-723-billion-dollars-in-2025 (accessed on 28 November 2025).
Sawle, P.; Baraskar, T. Survey on Data Classification and Data Encryption Techniques Used in Cloud Computing. Int. J. Comput. Appl. 2016, 135, 35–40. [Google Scholar] [CrossRef]
Salehi, M.A.; Mohammadi, M.; Hashemi, S. Regular Expression Search over Encrypted Data in the Cloud. In Proceedings of the 2014 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), Singapore, 15–18 December 2014; pp. 453–458. [Google Scholar] [CrossRef]
OpenRaven. Introduction to Regex Based Data Classification for the Cloud. OpenRaven. 2022. Available online: https://www.openraven.com/articles/introduction-to-regex-based-data-classification-for-the-cloud-writing-and-developing-dataclasses-for-scale (accessed on 28 November 2025).
Mainetti, L.; Elia, A. Detecting Personally Identifiable Information Through Natural Language Processing: A Step Forward. Appl. Syst. Innov. 2025, 8, 55. [Google Scholar] [CrossRef]
Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A Survey on Text Classification: From Shallow to Deep Learning. arXiv 2020, arXiv:2008.00364. Available online: https://arxiv.org/abs/2008.00364 (accessed on 28 November 2025).
Tang, W. Using Machine Learning to Help Detect Sensitive Information. ISACA Now Blog, 13 November 2023. Available online: https://www.isaca.org/resources/news-and-trends/isaca-now-blog/2023/using-machine-learning-to-help-detect-sensitive-information (accessed on 25 June 2024).
Sun, X.; Liu, G.; He, Z.; Li, H.; Li, X. DePrompt: Desensitization and Evaluation of Personal Identifiable Information in Large Language Model Prompts. arXiv 2024, arXiv:2408.08930. Available online: https://arxiv.org/abs/2408.08930 (accessed on 25 June 2024).
Rama, B.K.; Thaiyalnayaki, S. A Novel Integration of Machine Learning-Based Data Classification with Optimized Cryptographic Techniques for Secure Cloud Storage. J. Theor. Appl. Inf. Technol. 2025, 103, 1808–1816. Available online: https://www.jatit.org/volumes/Vol103No5/14Vol103No5.pdf (accessed on 28 November 2025).
Addula, S.R.; Meesala, M.K.; Ravipati, P.; Sajja, G.S. A Hybrid Autoencoder and Gated Recurrent Unit Model Optimized by Honey Badger Algorithm for Enhanced Cyber Threat Detection in IoT Networks. Secur. Priv. 2025, 8, e70086. [Google Scholar] [CrossRef]
Rao, B.; Wang, L. A Survey of Semantics-Aware Performance Optimization for Data-Intensive Computing. arXiv 2021, arXiv:2107.11540. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020; pp. 1877–1901. [Google Scholar]
Boehnke, J.; Pontikes, E.; Bhargava, H.K. Decoding Unstructured Text: Enhancing LLM Classification Accuracy with Redundancy and Confidence, Technical Report; Graduate School of Management, University of California Davis: Davis, CA, USA, 7 June 2024. Available online: https://d30i16bbj53pdg.cloudfront.net/wp-content/uploads/2024/06/Decoding-Unstructured-Text-Enhancing-LLM-Classification.pdf (accessed on 28 November 2025).
Howard, J.; Ruder, S. Universal Language Model Fine-Tuning for Text Classification (ULMFiT). arXiv 2018, arXiv:1801.06146. Available online: https://arxiv.org/abs/1801.06146 (accessed on 28 November 2025).
Zhou, Z.; Li, C.; Chen, X.; Wang, S.; Chao, Y.; Li, Z.; Wang, H.; Shi, Q.; Tan, Z.; Han, X.; et al. LLM × MapReduce: Simplified Long-Sequence Processing using Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 27664–27678. [Google Scholar] [CrossRef]
Dean, J.; Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 2008, 51, 107–113. [Google Scholar] [CrossRef]
Levitin, G.; Xing, L.; Dai, Y. Optimal data partitioning in cloud computing system with random server assignment. Future Gener. Comput. Syst. 2017, 70, 17–25. [Google Scholar] [CrossRef]
Tang, T.; Li, M. Enhanced Secure Storage and Data Privacy Management System for Big Data Based on Multilayer Model. Sci. Rep. 2025, 15, 32285. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Wang, W.; Chen, W.; Luo, Y.; Long, Y.; Lin, Z.; Zhang, L.; Lin, B.; Cai, D.; He, X. Model Compression and Efficient Inference for Large Language Models: A Survey. arXiv 2024, arXiv:2402.09748. Available online: https://arxiv.org/abs/2402.09748 (accessed on 28 November 2025).
Hu, Z.; Wang, L.; Lan, Y.; Xu, W.; Lim, E.-P.; Bing, L.; Xu, X.; Poria, S.; Lee, R.K.-W. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Singapore, 6–10 December 2023; pp. 5254–5276. [Google Scholar] [CrossRef]
Qin, G.; Eisner, J. Learning How to Ask: Querying LMs with Mixtures of Soft Prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2021), Online, 6–11 June 2021; pp. 5203–5212. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.-J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Stoyanov, V.; Zettlemoyer, L. Larger-Scale Transformers for Multilingual Masked Language Modeling. arXiv 2021, arXiv:2102.01373. Available online: https://arxiv.org/abs/2102.01373 (accessed on 28 November 2025).
Chai, Y.; Liang, Y.; Duan, N. Cross-Lingual Ability of Multilingual Masked Language Models: A Study of Language Structure. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), Dublin, Ireland, 22–27 May 2022; pp. 3041–3054. [Google Scholar] [CrossRef]
White, T. Hadoop: The Definitive Guide, 3rd ed.; O’Reilly Media: Sebastopol, CA, USA, 2012. [Google Scholar]
Nghiem, P.P.; Figueira, S.M. Towards Efficient Resource Provisioning in MapReduce. J. Parallel Distrib. Comput. 2016, 95, 29–41. [Google Scholar] [CrossRef]
Zhang, Q.; Cheng, L.; Boutaba, R. Cloud Computing: State-of-the-Art and Research Challenges. J. Internet Serv. Appl. 2010, 1, 7–18. [Google Scholar] [CrossRef]
Buyya, R.; Broberg, J.; Goscinski, A. Cloud Computing: Principles and Paradigms; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
Agyekum, J.; Mazumdar, S.; Scheich, C. ADAPT: An effective data-aware multicloud data placement framework. Cluster Comput. 2025, 28, 825. [Google Scholar] [CrossRef]
Yolchuyev, A.; Levendovszky, J. Data Chunks Placement Optimization for Hybrid Cloud Storage Systems. Future Internet 2021, 13, 181. [Google Scholar] [CrossRef]
Simo, D.R.; Hernández-Cartagena, A.; Vázquez-Chávez, J.; Hernández, G.; Pulido, A. Sensitive-Data-Detection: Dataset and Code for Sensitive Data Detection. GitHub Repository. Available online: https://github.com/SimoDR/sensitive-data-detection (accessed on 28 November 2025).
Taylor, J.; Smith, A.; Johnson, R.; Brown, L.; Davis, M. Synthetic Data for Privacy-Preserving Clinical Risk Prediction. Sci. Rep. 2024, 14, 12345. [Google Scholar] [CrossRef] [PubMed]
Goyal, A.; Mahmoud, M. A Systematic Review of Synthetic Data Generation Techniques Using Generative AI. Electronics 2024, 13, 3509. [Google Scholar] [CrossRef]

Figure 1. Framework of LLM-SPSS.

Figure 2. Performance Comparison with Traditional Schemes.

Figure 3. Comparison of Direct and MapReduce Strategies Across Schemes.

Figure 4. Experimental Results under Different Noise Levels.

Figure 5. Experimental results under different training ratios.

Figure 6. Performance comparison across different models and prompt types.

Table 1. Prompt designs for sensitive information classification.

Type	Description	Example/Template
None	No instruction is prepended. The raw text is fed directly into the model.	[Text]
Simple	A concise, direct instruction listing the sensitive categories.	“Determine whether the following text contains sensitive information (health, political, judicial, religious, sexual orientation, racial): [Text]”
Complex	A structured, multi-line instruction providing definitions for each sensitive category.	(Full template provided in Appendix A for reproducibility)

Table 2. Experimental Environment.

Component	Specification
CPU	Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60 GHz
GPU	NVIDIA GeForce RTX 3090
RAM	128 GiB
OS	Ubuntu 20.04 LTS (64-bit)

Table 3. Model configuration and experimental settings.

Experimental Aspect	Configurations/Values
Comparison Baselines	Regex, SVM, Random Forest, DePrompt [9]
Our Model (XLM-R) Variants	No Prompt, Simple Prompt, Complex Prompt (see Table 1)
Training Strategy	Direct (single node), MapReduce (four-node simulation)
Noise Robustness	Noise levels: 0%, 10%, 20%, 30%
Data Scale Sensitivity	Training data proportions: 5%, 20%, 50%, 100%
Model Scale Comparison	XLM-R-base, XLM-R-large, InfoXLM-base

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, R.; Che, B.; Yang, L. LLM-SPSS: An Efficient LLM-Based Secure Partitioned Storage Scheme in Distributed Hybrid Clouds. Electronics 2026, 15, 30. https://doi.org/10.3390/electronics15010030

AMA Style

Zhou R, Che B, Yang L. LLM-SPSS: An Efficient LLM-Based Secure Partitioned Storage Scheme in Distributed Hybrid Clouds. Electronics. 2026; 15(1):30. https://doi.org/10.3390/electronics15010030

Chicago/Turabian Style

Zhou, Ran, Bichen Che, and Liangbin Yang. 2026. "LLM-SPSS: An Efficient LLM-Based Secure Partitioned Storage Scheme in Distributed Hybrid Clouds" Electronics 15, no. 1: 30. https://doi.org/10.3390/electronics15010030

APA Style

Zhou, R., Che, B., & Yang, L. (2026). LLM-SPSS: An Efficient LLM-Based Secure Partitioned Storage Scheme in Distributed Hybrid Clouds. Electronics, 15(1), 30. https://doi.org/10.3390/electronics15010030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLM-SPSS: An Efficient LLM-Based Secure Partitioned Storage Scheme in Distributed Hybrid Clouds

Abstract

1. Introduction

2. Related Technologies

2.1. Large Language Models

2.2. MapReduce Distributed Computing Framework

2.3. Cloud Storage Partitioning Technologies

3. LLM-SPSS: An Efficient LLM-Based Secure Partitioned Storage Scheme in Distributed Hybrid Clouds

3.1. System Architecture

3.2. Sensitive Data Labeling Model and Fine-Tuning Design

3.2.1. Problem Definition

3.2.2. Model Selection

3.2.3. Large Language Model Fine-Tuning

3.2.4. Sensitive Label Generation Process

3.3. Data Partitioning and MapReduce Parallel Processing

3.4. Reduce Phase: Result Aggregation and Global Label Generation

3.5. Hybrid Cloud Data Partitioned Storage

4. Experimental Design and Implementation

4.1. Dataset and Preprocessing

4.2. Experimental Environment and Configuration

4.3. Experimental Design and Results

4.3.1. Comparison with Traditional Schemes

4.3.2. Comparison of Training Strategies

4.3.3. Robustness Evaluation Under Noisy Conditions

4.3.4. Impact of Training Data Scale on Model Performance

4.3.5. Comparison of Large Language Models of Similar Scale

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Full Complex Prompt Template

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI