A Secure and Efficient Framework for Multimodal Prediction Tasks in Cloud Computing with Sliding-Window Attention Mechanisms

Cui, Weiyuan; Lin, Qianye; Shi, Jiaqi; Zhou, Xingyu; Li, Zeyue; Zhan, Haoyuan; Qin, Yihan; Lv, Chunli

doi:10.3390/app15073827

Open AccessArticle

A Secure and Efficient Framework for Multimodal Prediction Tasks in Cloud Computing with Sliding-Window Attention Mechanisms

by

Weiyuan Cui

^1,2,†,

Qianye Lin

^1,3,†,

Jiaqi Shi

^1,4,†,

Xingyu Zhou

^1,5,

Zeyue Li

¹,

Haoyuan Zhan

^1,6,

Yihan Qin

^1,7 and

Chunli Lv

^1,*

¹

China Agricultural University, Beijing 100083, China

²

National School of Development, Peking University, Beijing 100871, China

³

School of Statistics, University of International Business and Economics, Beijing 100029, China

⁴

Beijing Foreign Studies University, Beijing 100089, China

⁵

College of Mathematics and Physics, North China Electric Power University, Beijing 102206, China

⁶

Capital University of Economics and Business, Beijing 100070, China

⁷

Beijing Forestry University, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(7), 3827; https://doi.org/10.3390/app15073827

Submission received: 30 January 2025 / Revised: 21 March 2025 / Accepted: 28 March 2025 / Published: 31 March 2025

(This article belongs to the Special Issue Cloud Computing: Privacy Protection and Data Security)

Download

Browse Figures

Versions Notes

Abstract

An efficient and secure computation framework based on the sliding-window attention mechanism and sliding loss function was proposed to address challenges in temporal and spatial feature modeling for multimodal data processing. The framework aims to overcome the limitations of traditional methods in privacy protection, feature-capturing capabilities, and computational efficiency. The experimental results demonstrated that, in time-series data processing tasks, the proposed method achieved precision, recall, accuracy, and F1-score values of 0.95, 0.91, 0.93, and 0.93, respectively, significantly outperforming the federated learning, secure multi-party computation, homomorphic encryption, and TEE-based approaches. In spatial data processing tasks, these metrics reached 0.93, 0.90, 0.92, and 0.91, also surpassing all the comparative methods. Compared with the existing secure computation frameworks, the proposed approach substantially enhanced computational efficiency while minimizing accuracy loss, all while ensuring data privacy. These findings provide an efficient and reliable solution for privacy protection and data security in cloud computing environments. Furthermore, the research demonstrates significant theoretical value and practical potential in real-world scenarios such as financial forecasting and image analysis.

Keywords:

secure computation for cloud environments; data security in cloud computing; sliding-window attention mechanism; time-series and spatial data prediction

1. Introduction

Secure computation has emerged as a critical research direction in modern computer science and financial technology, with its applications expanding across financial forecasting, natural language processing (e.g., financial text analysis), autonomous driving prediction, low-altitude economy, and medical imaging [1]. The importance of data in these domains is self-evident. For instance, in the financial sector, precise stock price forecasting directly influences investment strategies and the stable operation of financial markets [2]. In autonomous driving, vehicle path prediction and behavior analysis rely heavily on complex multimodal data processing to ensure driving safety and system reliability [3,4,5]. Similarly, tasks such as drone logistics and airspace monitoring in low-altitude economic scenarios demand robust data collaboration and privacy protection mechanisms [6]. In medical imaging analysis, lesion detection and diagnostic assistance depend on high-quality medical data, which must be efficiently analyzed while protecting patient privacy [7,8,9]. These scenarios collectively highlight a central challenge: how to balance computational efficiency with data privacy in large-scale data analysis, which serves as a driving force for advancements in secure computation technologies. Traditional data processing methods predominantly employ centralized architectures, where data are stored on a central server for analysis and computation [10]. However, such centralized approaches have significant limitations. First, centralized storage increases the risk of privacy breaches, particularly for sensitive financial data or medical information, where any oversight can result in severe security issues [11]. Second, as data scales continue to grow, the traditional methods exhibit performance bottlenecks in handling high-dimensional data and complex tasks, failing to meet the demands for efficiency and real-time processing [12]. While distributed computing partially alleviates these issues, challenges such as high communication overhead and complex model synchronization in large-scale collaborative computing remain barriers to replacing traditional methods entirely.

In recent years, secure computation technologies have emerged as key solutions to these challenges, with federated learning, secure multi-party computation, and homomorphic encryption being the three core approaches. Federated learning enables collaborative model training across different data owners without sharing raw data, significantly reducing privacy risks through distributed training [13]. Secure multi-party computation employs advanced cryptographic protocols, allowing multiple parties to perform collaborative computations without exposing their data, which is critical for cross-institutional data analysis [14,15]. Homomorphic encryption, on the other hand, allows computations to be performed directly on encrypted data, offering end-to-end privacy protection from input to output [16]. However, applying deep learning methods to secure computation scenarios presents notable challenges [17]. First, deep learning typically relies on large-scale data and complex computations, which conflict with the high computational overhead of privacy-preserving techniques. Second, the existing deep learning methods often model different modalities separately, lacking effective mechanisms for collaborative modeling, especially when integrating time-series data (e.g., stock prices and text) with spatial data (e.g., images and medical scans), resulting in limited performance [18]. Privacy protection in cloud computing has garnered significant attention in recent years. With the increasing demand for large-scale data sharing and computation, traditional privacy-preserving methods face challenges in security, computational efficiency, and scalability. Beyond federated learning (FL), secure multi-party computation (MPC), homomorphic encryption (HE), and trusted execution environment (TEE)-based approaches, several recent innovations have emerged in the field of privacy-preserving computation. For instance, differential privacy (DP) techniques have been widely applied in cloud-based data analytics. DP operates by introducing controlled noise into statistical computations to limit the influence of individual data points on the final results, thereby reducing the risk of data leakage [19]. In recent years, deep-learning-based differential privacy methods, such as DP-SGD, have been proposed to protect data privacy during the training process and have shown promising results across multiple domains [20]. Additionally, privacy-preserving techniques based on secure hardware have seen significant advancements. Hybrid approaches that combine trusted execution environments (TEEs) with homomorphic encryption have been developed to enhance computational efficiency while maintaining data security. These approaches have been effectively applied in privacy-preserving computation and blockchain applications [21]. Moreover, blockchain-based privacy protection schemes leverage decentralized architectures and zero-knowledge proofs (ZKPs) to enable secure computations without relying on trusted third parties, making them particularly suitable for multi-party collaborative cloud computing scenarios [22]. Notably, recent studies have proposed novel privacy-preserving machine learning frameworks, such as privacy-aware methods based on graph neural networks (GNNs) [23], which have demonstrated strong privacy protection capabilities in social network analysis and financial risk assessment. Furthermore, privacy-enhanced methods that integrate federated learning with knowledge distillation (KD) have been introduced to reduce communication overhead and improve model performance [24].

To address these issues, a high-efficiency secure computation framework based on sliding-window attention is proposed in this study. By optimizing attention mechanisms and multimodal data fusion techniques, this framework enhances computational efficiency and predictive accuracy while ensuring privacy protection. The core innovation lies in the introduction of the sliding-window attention mechanism, which segments input data and employs sliding windows to capture both local and global dependencies, significantly reducing computational complexity and improving predictive performance. Additionally, a temporal–spatial fusion module is designed to enable collaborative modeling of time-series and spatial data, enhancing the framework’s ability to process multimodal data effectively. This enables the framework to excel in applications such as financial forecasting, autonomous driving, and medical imaging. A novel sliding loss function is also developed to improve the model’s adaptability to dynamic data variations, further optimizing the convergence speed and model performance during training. The framework’s efficacy is validated through extensive experiments conducted in representative scenarios including finance, autonomous driving, and medical imaging. The experimental results demonstrate that, compared to existing methods such as federated learning, secure multi-party computation, and homomorphic encryption, the proposed framework achieves superior privacy protection, computational efficiency, and predictive accuracy. These findings suggest that the proposed sliding-window attention-based secure computation framework offers a novel solution for privacy-preserving and efficient multimodal data processing while providing a theoretical and practical foundation for advancing secure computation research in other domains. Future work will focus on extending and applying this framework to more complex and diverse scenarios.

2. Related Work

2.1. Federated Learning

Federated learning is an emerging distributed machine learning paradigm aimed at enabling collaborative training of a global model across multiple participants while preserving data privacy [25]. The fundamental principle involves performing model training locally at each participant’s site and sharing only model parameters or gradients instead of raw data, thereby mitigating the risk of data breaches [26,27]. The core mechanism of federated learning can be summarized in three stages: local model training, global model aggregation, and parameter updates. In federated learning, suppose there are K participants, each with a local dataset

D_{k}

. The objective is to collaboratively train a global model w that minimizes the loss function over all datasets. The global optimization objective is formulated as

min_{w} F (w) = \sum_{k = 1}^{K} \frac{n_{k}}{n} F_{k} (w),

(1)

where

F_{k} (w)

represents the local loss function of the k-th participant,

n_{k}

is the number of samples in

D_{k}

, and

n = \sum_{k = 1}^{K} n_{k}

denotes the total number of samples across all participants. A critical step in federated learning is local model training. During each iteration, participants compute updated model parameters

w_{k}

based on their local datasets

D_{k}

and the current global model parameters

w_{t}

. For each participant, the local optimization problem is expressed as

w_{k}^{(t + 1)} = \arg \min_{w} F_{k} (w) .

(2)

Typically,

F_{k} (w)

is optimized using gradient descent, with the update rule given by

w_{k}^{(t + 1)} = w_{k}^{(t)} - η \nabla F_{k} (w_{k}^{(t)}),

(3)

where

η

is the learning rate, and

\nabla F_{k} (w_{k}^{(t)})

represents the gradient of the local loss function with respect to

w_{k}

. After completing local training, federated learning integrates information across participants through global model aggregation. A commonly used method is weighted averaging, where the global model parameters w are updated as

w^{(t + 1)} = \sum_{k = 1}^{K} \frac{n_{k}}{n} w_{k}^{(t + 1)} .

(4)

This formula indicates that the global model parameters are a weighted average of the local model parameters, with weights determined by the sample size of each participant’s dataset. This mechanism ensures model sharing without transmitting raw data, effectively preserving data privacy. The success of federated learning lies in balancing data privacy and model performance. However, certain challenges persist in practical applications. One major issue is communication overhead. Frequent transmission of model parameters between participants and the central server, especially in large-scale distributed environments, can become a bottleneck for system performance. To address this, model compression techniques such as quantization and pruning have been proposed to reduce the volume of transmitted parameters. Another critical challenge is the non-IID (non-independent and identically distributed) nature of data among participants, which can adversely affect the performance of the global model. When data distributions vary significantly, the aggregated global model may fail to generalize effectively in specific scenarios. Solutions such as personalized model training, adaptive aggregation strategies, and adversarial training have been developed to mitigate these issues. In financial forecasting, federated learning provides a novel solution for collaborative modeling across institutions. For instance, multiple banks or financial organizations can collaboratively train market prediction models without exposing customer data, thereby enhancing both prediction accuracy and generalization [28]. Similarly, in medical imaging analysis, federated learning facilitates cross-institutional collaboration, enabling multiple hospitals to jointly train tumor detection models without sharing sensitive patient data. By combining local training and global aggregation, federated learning not only improves model performance but also meets stringent privacy protection requirements. In scenarios involving time-series data (e.g., stock prices) and spatial data (e.g., medical images or autonomous driving image data), federated learning demonstrates significant advantages. Traditional centralized data processing methods often expose privacy risks when handling large-scale heterogeneous data, while federated learning mitigates these issues through distributed training. To further enhance computational efficiency and model performance, it is essential to design more efficient training and aggregation mechanisms tailored to heterogeneous data characteristics, such as leveraging attention mechanisms to optimize performance in multimodal data scenarios.

2.2. Secure Multi-Party Computation

Secure multi-party computation (MPC) is a cryptographic technique designed to allow multiple participants to collaboratively perform a global computation task without revealing their private data while ensuring that the input data remain confidential throughout the process. This technique has broad applications in scenarios with high data sensitivity, such as collaborative modeling, joint analysis, and distributed data processing [29]. The primary objective of MPC is to enable secure collaboration through cryptographic protocols, ensuring that participants can only access the computation results without inferring the private inputs of other parties [30]. The fundamental concept of MPC can be formalized as a function computation problem. Consider n participants, each represented by

P_{i}

and holding private input

x_{i}

. The goal is to jointly compute a public function

f (x_{1}, x_{2}, \dots, x_{n})

while ensuring that no participant learns the inputs of others. This problem is expressed as

y = f (x_{1}, x_{2}, \dots, x_{n}),

(5)

where y represents the output of the computation. In an ideal scenario, each participant knows only their own input and the final result y, with no knowledge of the inputs from other participants. To achieve this, MPC typically relies on cryptographic protocols, primarily secret sharing and homomorphic encryption. In the secret sharing scheme, each participant splits their private data into multiple “shares” and distributes these shares to other participants. Let

x_{i}

be the private input of the i-th participant. Secret sharing divides

x_{i}

into shares

s_{i 1}, s_{i 2}, \dots, s_{i n}

such that

x_{i} = \sum_{j = 1}^{n} s_{i j} (\mod p),

(6)

where p is a large prime number used for modular arithmetic to ensure security. Each participant only holds a portion of the shares from other participants. As a result, even if one participant is compromised, the full input of others cannot be reconstructed. The core steps of MPC based on secret sharing include distributed computation and result reconstruction. During distributed computation, participants perform computations on their shares. For instance, for the addition operation

f (x_{1}, x_{2}) = x_{1} + x_{2}

, the computation can be performed directly on the shares as

s_{i j}^{'} = s_{i j} + s_{k j} (\mod p),

(7)

where

s_{i j}^{'}

represents the intermediate share of the addition result. For multiplication, distributed computation becomes more complex and often requires a trusted third party or additional protocols to ensure correctness. Finally, participants exchange and aggregate the shares to reconstruct the global computation result:

y = \sum_{j = 1}^{n} s_{i j}^{'} (\mod p) .

(8)

Apart from secret sharing, homomorphic encryption is another widely used approach for implementing MPC. Homomorphic encryption allows arithmetic operations to be performed directly on encrypted data, with the decrypted result matching the outcome of the operations performed on the plaintext. Let E and D denote the encryption and decryption functions, respectively. Homomorphic encryption satisfies the following properties:

D (E (x_{1}) + E (x_{2})) = x_{1} + x_{2},

(9)

D (E (x_{1}) \cdot E (x_{2})) = x_{1} \cdot x_{2} .

(10)

Using this method, participants can encrypt their private data and share them with others, allowing computations to be performed on the encrypted data. The data owner then decrypts the final result. This approach ensures higher security as the entire computation process is conducted in the encrypted domain. MPC has significant applications in domains such as finance, healthcare, and autonomous driving. In financial scenarios, MPC enables institutions to collaboratively train market prediction models, such as stock price forecasting or portfolio optimization, without sharing customer data or transaction records. In healthcare, MPC facilitates collaboration among hospitals to jointly train tumor detection models without exposing sensitive patient data, thus improving diagnostic accuracy while adhering to privacy regulations. In autonomous driving, MPC supports vehicle-to-cloud collaborative computation, such as real-time path optimization and environmental sensing, while preventing the leakage of sensitive data, including driving behaviors and location information. These applications highlight the critical role of MPC in ensuring data privacy and secure computation across various sensitive and distributed data scenarios.

2.3. Homomorphic Encryption

Homomorphic encryption is a cryptographic technique that allows computations to be performed on encrypted data without requiring decryption. The core concept ensures that the results of operations on ciphertexts, when decrypted, are identical to those performed directly on plaintexts, enabling computations without exposing the original data [31]. This property is a crucial safeguard for data privacy, particularly in multi-party collaborative computations and cloud computing scenarios. The fundamental principle of homomorphic encryption is based on a mathematical property, where specially designed encryption functions guarantee that operations in the ciphertext space mirror those in the plaintext space. Let the encryption and decryption functions be denoted as

E (\cdot)

and

D (\cdot)

, respectively. Homomorphic encryption satisfies the following property:

D (E (x_{1}) \oplus E (x_{2})) = x_{1} \otimes x_{2},

(11)

where ⊗ represents operations in the ciphertext space, and ⊗ denotes corresponding operations in the plaintext space. Homomorphic encryption can be categorized based on the types of operations it supports: partially homomorphic encryption (PHE) [32], fully homomorphic encryption (FHE) [33], and somewhat homomorphic encryption (SHE) [34]. PHE typically supports either addition or multiplication. For example, the Paillier encryption scheme supports additive homomorphism. Given plaintexts

x_{1}

and

x_{2}

with ciphertexts

E (x_{1})

and

E (x_{2})

, the additive homomorphism is expressed as

D (E (x_{1}) \cdot E (x_{2}) mod n^{2}) = x_{1} + x_{2},

(12)

where n is a part of the public key, and operations are performed modulo

n^{2}

. FHE represents a significant breakthrough in homomorphic encryption as it supports arbitrary additions and multiplications on ciphertexts. This flexibility, however, comes at the cost of considerable computational complexity. For instance, in Gentry’s lattice-based FHE scheme, a set of special encryption functions are designed to support repeated addition and multiplication operations. To manage noise growth during these operations, FHE often requires “bootstrapping” to refresh ciphertexts, which significantly increases computational overhead. Specifically, in Gentry’s scheme, each plaintext x is encrypted as a ciphertext

E (x)

, introducing random noise r during encryption:

E (x) = x + r (\mod q),

(13)

where q is a large modulus used during encryption. The encrypted data can be directly manipulated in the ciphertext space. For example, for addition:

E (x_{1}) + E (x_{2}) = (x_{1} + r_{1}) + (x_{2} + r_{2}) (\mod q),

(14)

decryption yields the correct addition result

x_{1} + x_{2}

. Similarly, multiplication is supported, although noise accumulation during these operations necessitates bootstrapping to maintain decryption accuracy. Recent advancements have significantly improved the computational efficiency of homomorphic encryption. The CKKS (Cheon–Kim–Kim–Song) [35] scheme is a prominent example designed for approximate addition and multiplication on encrypted floating-point data. By introducing small noise and leveraging approximate decryption, CKKS enhances efficiency. Its core principle involves representing encryption operations with a polynomial

p (x)

, such that operations on ciphertexts

p (E (x))

yield approximate plaintext results upon decryption:

D (p (E (x_{1})) + p (E (x_{2}))) \approx x_{1} + x_{2} .

(15)

This optimization has enabled homomorphic encryption to become viable for large-scale data analysis, machine learning model training, and other computationally intensive tasks. Homomorphic encryption holds immense potential in sensitive data applications, such as finance, healthcare, and autonomous driving. In the financial sector, it enables institutions to perform encrypted customer data analysis, such as market analysis or risk prediction, without decrypting the data, thus ensuring client privacy. In healthcare, it facilitates the analysis of encrypted medical data, such as tumor detection and drug recommendation, safeguarding patient confidentiality while improving diagnostic accuracy. In autonomous driving, homomorphic encryption supports vehicle-to-cloud collaborative computations, including real-time path optimization and environmental awareness, ensuring data privacy for sensitive information like driving behaviors and location data. Despite its theoretical and practical advantages, the high computational complexity of homomorphic encryption and its limited application scenarios remain significant challenges. In particular, the bootstrapping process and the accumulation of noise in FHE impose substantial computational overhead, negatively impacting system performance. Research efforts are increasingly focused on optimizing computational efficiency while maintaining encryption strength. Techniques such as integrating attention mechanisms, multimodal data processing, and parallel computing are promising directions for enhancing the practicality and computational efficiency of homomorphic encryption in complex scenarios. Overall, homomorphic encryption stands as a pivotal technology for privacy-preserving computation, with its potential and applicability expanding through advancements in algorithms and hardware support, providing robust security solutions for sensitive data processing.

2.4. Multimodal Fusion Techniques: Transformers and Cross-Modal GANs

Multimodal fusion has become a critical area of research, particularly in applications requiring the integration of heterogeneous data sources such as images, text, and time-series data. Transformer-based models and cross-modal generative adversarial networks (GANs) represent two major categories of multimodal fusion techniques, each offering distinct advantages and challenges. Transformer-based multimodal fusion models, such as Vision-Language Transformers (ViLTs) and Multimodal BERT variants, employ self-attention mechanisms to model cross-modal dependencies. These models have demonstrated state-of-the-art performance in tasks like image–text understanding, video captioning, and multimodal reasoning. However, the quadratic complexity

O (n^{2})

of standard self-attention mechanisms poses significant computational challenges, making them inefficient for large-scale multimodal datasets in cloud computing environments. Cross-modal GANs, including CycleGAN and CLIP-guided GANs, leverage adversarial training to enhance feature alignment across different modalities. These models have been widely applied in domain adaptation, data augmentation, and cross-modal synthesis. Despite their effectiveness, GAN-based methods are often plagued by mode collapse and unstable training dynamics, making them less reliable for structured multimodal tasks that require robust feature representations. In contrast, the proposed sliding-window attention mechanism optimizes local-global feature fusion while reducing computational costs, making it more suitable for privacy-preserving multimodal learning in cloud-based environments. The sliding loss function further enhances the adaptability of the model by prioritizing local feature variations, addressing limitations observed in both Transformer-based and GAN-based multimodal approaches.

3. Materials and Methods

3.1. Dataset Collection

In this study, two distinct datasets—time-series and spatial datasets—were constructed to validate the effectiveness of the proposed model. These datasets reflect the characteristics and requirements of time-series and spatial data in various real-world applications. The process of dataset collection and preparation was a critical component of the experimental workflow, involving meticulous efforts in data source selection, collection methods, cleaning, and preprocessing to ensure high data quality and broad coverage. The construction of the time-series dataset primarily focused on financial scenarios. Representative stock markets, including the Shanghai Stock Exchange Composite Index (SSE), NASDAQ, and S and P 500, were selected to gather stock prices and related indicators. These data were sourced from multiple authoritative financial platforms, such as Yahoo Finance, Sina Finance, and Google Finance. Automated web scraping scripts were developed to periodically retrieve data, utilizing multithreaded crawling techniques to enhance efficiency and ensure real-time data availability. During the collection process, particular attention was given to the following aspects:

First, data completeness was ensured by covering as long a historical period as possible for each stock (2013–2023), avoiding data gaps caused by network issues. Second, data diversity was emphasized, as shown in Table 1. Beyond core metrics such as stock prices, auxiliary indicators were collected, including stock volatility, market sentiment indices, trading volumes, and turnover rates. These supplementary indicators provided robust support for multidimensional analyses.

For the spatial dataset, the CelebA [36] dataset was chosen as the primary data source, as shown in Table 2. CelebA is a high-quality public facial image dataset containing over 200,000 images, each annotated with detailed metadata, such as facial landmarks, expression categories, and whether glasses are worn. These annotations provided a rich foundation for in-depth analysis in the research tasks. The collection method for spatial data combined the direct download of publicly available datasets with careful compliance checks. During the download process, the licensing agreements for the datasets were thoroughly reviewed to ensure adherence to ethical standards and legal requirements for research.

The datasets collected represent diverse and comprehensive data sources, ensuring their applicability across various research scenarios. The time-series dataset captures critical temporal dynamics in financial markets, while the spatial dataset provides rich facial image annotations for detailed analysis. These datasets lay a solid foundation for validating the proposed model’s performance across multimodal tasks.

3.2. Dataset Preprocessing

Time-series data preprocessing is a critical step in time-series analysis and modeling, primarily aimed at addressing issues such as missing values and outliers to ensure data continuity and validity, thereby enhancing model training and predictive performance. Time-series data, consisting of observations arranged in chronological order, often suffer from missing values, noise, or abrupt fluctuations, which can adversely affect model performance. Comprehensive preprocessing is therefore essential prior to modeling. Handling missing values is one of the primary tasks in time-series data preprocessing. Missing values may result from sensor malfunctions, incomplete data recording, or sampling errors. In time-series data, missing values typically appear as gaps in observations at specific time points. Common methods for filling missing values include mean imputation, interpolation, and predictive modeling. Mean imputation replaces missing values with either global or local averages, smoothing the data. For a given time series

x_{1}, x_{2}, \dots, x_{n}

with a missing value at position i, mean imputation can be expressed as

x_{i} = \frac{1}{n - 1} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{n} x_{j},

(16)

where n denotes the series length, and

x_{j}

represents non-missing values. While mean imputation is computationally simple, it may reduce the variability in the time series. In contrast, interpolation leverages the temporal order of data to estimate missing values through linear or higher-order methods. Linear interpolation assumes a linear relationship between the missing point and its neighbors, computed as

x_{i} = x_{i - 1} + \frac{x_{i + 1} - x_{i - 1}}{t_{i + 1} - t_{i - 1}} \cdot (t_{i} - t_{i - 1}),

(17)

where

x_{i - 1}

and

x_{i + 1}

are known values before and after the missing point, and

t_{i}

represents the timestamp. Interpolation preserves the dynamic characteristics of the time series but may struggle with highly volatile data. In more complex scenarios, predictive models, such as ARIMA [37] or LSTM [38], can estimate missing values by leveraging historical patterns in the data. The predicted value using such models is given by

x_{i} = f (x_{i - 1}, x_{i - 2}, \dots, x_{i - p}),

(18)

where p denotes the lag. Although computationally intensive, predictive modeling offers higher accuracy for complex data distributions. Outlier detection and treatment are equally essential in time-series data preprocessing. Outliers, which deviate significantly from the general data distribution, often arise from sensor errors, external interference, or system malfunctions. Statistical methods, rule-based approaches, or machine learning techniques are typically employed for outlier detection. Common statistical methods include Z-score and standard-deviation-based rules. In the Z-score method, the normalized value for each data point is calculated as

z_{i} = \frac{x_{i} - μ}{σ},

(19)

where

μ

and

σ

denote the mean and standard deviation of the series, respectively. A data point

x_{i}

is flagged as an outlier if its

z_{i}

exceeds a threshold. The standard deviation rule identifies outliers as points deviating by more than a specified multiple of the standard deviation:

| x_{i} - μ | > k \cdot σ,

(20)

where k is a user-defined threshold. Detected outliers can be addressed through interpolation, regression, or resampling methods. For instance, linear interpolation for outliers follows a similar formula as for missing values, while regression-based methods use models such as linear or polynomial regression to estimate outliers:

x_{i} = g (x_{i - 1}, x_{i - 2}, \dots, x_{i - p}),

(21)

where

g (\cdot)

represents a regression model.

Additionally, the continuity and periodicity of time-series data must be preserved during preprocessing. For applications with significant periodic patterns (e.g., financial or meteorological data), Fourier transforms can extract periodic components, and abnormal frequency components can be filtered. To better suit subsequent model inputs, normalization or standardization is often applied. Normalization scales the data to a range of [0, 1]:

x_{i}^{'} = \frac{x_{i} - \min (x)}{\max (x) - \min (x)},

(22)

where

min (x)

and

max (x)

denote the minimum and maximum values of the series, respectively. Standardization, on the other hand, adjusts the data to have zero mean and unit variance:

x_{i}^{'} = \frac{x_{i} - μ}{σ} .

(23)

Image data augmentation generates new training samples by transforming original images, aiming to enhance model generalization and robustness. This technique introduces controlled perturbations without altering data labels, exposing the model to diverse sample variations, reducing overfitting, and improving resilience to noise and unseen data. In deep learning, augmentation techniques have proven indispensable, especially when training data are limited. Cutout is a simple yet effective augmentation method that simulates local information loss by randomly masking one or more rectangular regions in an image. This forces the model to learn robust global features. Given an input image

I \in R^{H \times W \times C}

, where H, W, and C denote height, width, and channels, respectively, a center point

(x_{c}, y_{c})

is selected randomly, and a rectangular mask of size

h \times w

is applied. The augmented image

I^{'}

is defined as

I^{'} (x, y) = \{\begin{matrix} 0, & if x_{c} - \frac{h}{2} \leq x \leq x_{c} + \frac{h}{2} and y_{c} - \frac{w}{2} \leq y \leq y_{c} + \frac{w}{2}, \\ I (x, y), & otherwise . \end{matrix}

(24)

CutMix, a more advanced augmentation method, combines two images and their labels by pasting a cropped region from one image onto another. For two images A and B with labels

y_{A}

and

y_{B}

, the augmented image

I^{'}

is defined as

I^{'} (x, y) = \{\begin{matrix} I_{B} (x, y), & if x_{c} - \frac{h}{2} \leq x \leq x_{c} + \frac{h}{2} and y_{c} - \frac{w}{2} \leq y \leq y_{c} + \frac{w}{2}, \\ I_{A} (x, y), & otherwise . \end{matrix}

(25)

The new label is computed as a weighted combination:

y^{'} = λ y_{A} + (1 - λ) y_{B},

(26)

where

λ = \frac{h \cdot w}{H \cdot W}

represents the area ratio of the cropped region. Compared to traditional methods, Cutout and CutMix offer significant advantages. Cutout enforces global feature learning by simulating information loss, while CutMix enhances inter-class variability through pixel and label blending. Both methods alleviate data imbalance issues, especially in scenarios involving small samples or uneven class distributions. The selection of augmentation techniques should align with task characteristics and data distribution. For instance, in object detection, methods preserving target regions are preferable, whereas, in classification, occlusion and blending may enhance diversity. Integrating augmentation with model architectures and optimization strategies, such as attention mechanisms, can further improve utilization of augmented data.

3.3. Proposed Method

3.3.1. Sliding-Window Computation Network

The overall framework of the proposed method is illustrated in the figure and comprises an encoder and a decoder. As shown in Figure 1, the effective processing and integration of multimodal data are achieved through the design of a sliding-window attention mechanism and cross-layer blocks. The input consists of preprocessed multimodal data (e.g., images and time-series data), which, after passing through the feature extraction module, is fed into the encoder to generate multi-scale feature representations. These features are progressively reconstructed in the decoder to produce high-resolution predictions. The preprocessed input data are divided into multi-scale feature maps denoted as

F_{1}, F_{2}, F_{3}, F_{4}

, where

F_{1}

represents the highest-resolution feature map and

F_{4}

represents the lowest resolution. In the encoder, each feature map is processed through the sliding-window attention mechanism, resulting in multi-head attention representations

M_{i}

. The sliding-window attention mechanism divides the feature maps into local windows, where self-attention is computed for features within each window. The resulting multi-scale feature maps

M_{1}, M_{2}, M_{3}, M_{4}

are then passed to the cross-layer block, which facilitates feature complementation across layers. Cross-layer blocks employ residual connections to fuse information from the encoder and decoder, enhancing feature expressiveness. In the decoder, the outputs of the cross-layer blocks, denoted as

D_{1}, D_{2}, D_{3}, D_{4}

, are processed through upsampling modules to restore feature resolution, resulting in high-resolution feature maps

D_{1}^{u p}, D_{2}^{u p}, D_{3}^{u p}, D_{4}^{u p}

. These feature maps are concatenated along the channel dimension and further transformed using a multi-layer perceptron (MLP), producing the final high-resolution output mask. Throughout this process, the encoder leverages the synergy between the sliding-window attention mechanism and cross-layer blocks to extract multi-scale features and facilitate cross-layer interaction. Meanwhile, the decoder integrates these multi-scale features through progressive upsampling and fusion operations, culminating in high-resolution predictions. This design efficiently utilizes both global and local information in multimodal data, significantly improving model performance and computational efficiency.

3.3.2. Sliding-Window Attention Mechanism

The sliding-window attention mechanism was first introduced in Longformer [39] and has since been integrated into multiple Transformer variants to enhance computational efficiency. For example, Swin Transformer [40] employs a hierarchical window attention mechanism, enabling efficient feature modeling while reducing computational costs. BigBird [41] further combines sliding-window attention with global sparse attention and random attention, maintaining global information retention while significantly reducing computational overhead. Building on these advancements, the proposed method introduces an adaptive sliding-window attention mechanism to better accommodate the complex feature distributions of multimodal data, achieving significant performance improvements in temporal and spatial data fusion tasks. The sliding-window attention mechanism is a critical component of the proposed model, as shown in Figure 2.

Compared to traditional global self-attention mechanisms, it reduces computational complexity by incorporating localized window designs while retaining essential local and global information. Traditional self-attention computes pairwise attention weights across all elements of a sequence or feature map, resulting in a computational complexity of

O (n^{2})

, where n represents the length of the input sequence or feature map. This approach becomes computationally prohibitive for long sequences or high-resolution feature maps. In contrast, the sliding-window attention mechanism divides input features into fixed-size local windows and computes attention weights only within each window, reducing complexity to

O (n \cdot w)

, where w is the window size. This localized attention preserves local feature correlations while capturing global context through the sliding operation. For a feature map with dimensions

H \times W \times C

, where H and W denote the height and width and C is the number of channels, the sliding-window size is defined as

w \times w

, with a stride of s. The input feature map

F \in R^{H \times W \times C}

is partitioned into multiple local windows. Each window’s query (Q), key (K), and value (V) vectors have dimensions

Q, K, V \in R^{w \times w \times C}

. The attention weights

A

for each window are computed as

A = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}),

(27)

where

d_{k}

is the dimensionality of the key vectors. The output features

Y

for each window are obtained via a weighted sum:

Y = A V .

(28)

The outputs of all windows are concatenated based on their original positions to reconstruct the final sliding-window attention output

F_{output} \in R^{H \times W \times C}

. The sliding-window attention mechanism is embedded into multiple layers of the model to handle features at varying resolutions. For high-resolution feature maps (e.g.,

H / 4 \times W / 4 \times C_{1}

), a window size of

w = 7

and stride

s = 4

are used to balance local feature extraction and computational cost. For low-resolution feature maps (e.g.,

H / 32 \times W / 32 \times C_{4}

), a window size of

w = 3

and stride

s = 2

are employed to accommodate sparser feature distributions. For each layer, the channel dimension C is reduced to

C / 4

through linear projection, further lowering computational overhead. The sliding-window attention mechanism is combined with the sliding-window computation network by alternately stacking attention computations and convolutional operations. Convolutions enhance translational invariance and extract local features, while sliding-window attention captures long-range dependencies within windows. This synergistic design maintains computational efficiency while capturing both local and global information. Such a structure is particularly advantageous for multimodal data processing, such as integrating temporal and spatial features. The sliding-window attention mechanism adapts to varying data distributions across modalities, achieving efficient feature integration through hierarchical processing. This design improves model performance and enhances robustness and generalization in complex scenarios.

3.3.3. Time-Series and Space Fusion Module

The time-series and space fusion module is designed to effectively integrate temporal features (e.g., stock prices and text sequences) with spatial features (e.g., images), leveraging a combination of local perception, sliding-window attention, and a feed-forward network (FFN) to achieve deep multimodal feature fusion, as shown in Figure 3.

The module, as illustrated in the figure, consists of three main components: sliding-window attention, a local perception unit, and a feed-forward network. In the input stage, assume the temporal features are represented as

F_{i} \in R^{N_{t} \times C_{t}}

, and the spatial features are represented as

M_{i} \in R^{H_{s} \times W_{s} \times C_{s}}

, where

N_{t}

and

C_{t}

denote the length and number of channels of the temporal features, respectively, and

H_{s}

,

W_{s}

, and

C_{s}

represent the height, width, and number of channels of the spatial features. To enable interaction between these features, a channel alignment operation is first applied to unify the channel dimensions to d:

F_{i}^{'} = W_{t} F_{i}, M_{i}^{'} = W_{s} M_{i},

(29)

where

W_{t} \in R^{C_{t} \times d}

and

W_{s} \in R^{C_{s} \times d}

are linear transformation matrices. The sliding-window attention mechanism serves as the core of this module and is used to capture the correlation between temporal and spatial features. Specifically, both

F_{i}^{'}

and

M_{i}^{'}

are first normalized using LayerNorm, followed by the projection of query (Q), key (K), and value (V) vectors:

Q_{t} = W_{q}^{t} F_{i}^{'}, K_{s} = W_{k}^{s} M_{i}^{'}, V_{s} = W_{v}^{s} M_{i}^{'},

(30)

where

W_{q}^{t}

,

W_{k}^{s}

, and

W_{v}^{s} \in R^{d \times d}

are projection matrices. This mechanism explicitly models the relationships between temporal and spatial features, enabling effective capture of multimodal data interactions. In the local perception unit, the extracted features are further enhanced through depthwise separable convolution operations, which focus on local feature enhancement. Given the input features

O_{t s}

, a

3 \times 3

depthwise separable convolution is applied to extract local features:

L = {Conv}_{3 \times 3} (O_{t s}),

(31)

where this convolution operation efficiently captures local spatial information while maintaining computational efficiency. Finally, the fused features are transformed non-linearly using a feed-forward network. The FFN comprises two fully connected layers, where the first layer increases the feature dimension and the second layer reduces it back:

F_{output} = ReLU (W_{1} L) W_{2},

(32)

where

W_{1} \in R^{d \times 4 d}

and

W_{2} \in R^{4 d \times d}

are the weight matrices. This design has several advantages. The sliding-window attention mechanism captures long-range dependencies between temporal and spatial features, while the local perception unit enhances the representation of local spatial details. The FFN further improves feature expressiveness, ensuring robust performance in complex tasks. In the context of this study, the fusion module effectively integrates the characteristics of temporal and spatial data, allowing the model to capture both local and global information simultaneously. As a result, the module demonstrates superior accuracy and robustness in multimodal data processing tasks. With its hierarchical structure and lightweight design, the module also significantly reduces computational overhead, making it suitable for large-scale data processing scenarios.

3.3.4. Sliding Loss Function

The sliding loss function is an optimization objective proposed to address the limitations of traditional loss functions in handling dynamic characteristics of time-series or multimodal data. Traditional loss functions, such as the mean squared error (MSE), compute the error between model predictions and ground truth values globally. The formulation of the MSE loss function is given as

L_{MSE} = \frac{1}{n} \sum_{t = 1}^{n} {({\hat{y}}_{t} - y_{t})}^{2},

(33)

where

{\hat{y}}_{t}

denotes the predicted value,

y_{t}

represents the ground truth, and n is the total number of samples. While this global computation approach is straightforward, it fails to capture the local variations in data, especially in scenarios with significant dynamic changes or prominent local features. To address this, the sliding loss function incorporates the concept of sliding windows, computing losses within localized windows to better capture the dynamic variations in data. Specifically, let the predicted values be

\hat{y} = [{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{n}]

and the ground truth values be

y = [y_{1}, y_{2}, \dots, y_{n}]

. With a sliding window of size w, the sliding loss function is defined as

L_{sliding} = \frac{1}{n - w + 1} \sum_{i = 1}^{n - w + 1} \frac{1}{w} \sum_{t = i}^{i + w - 1} {({\hat{y}}_{t} - y_{t})}^{2} .

(34)

Unlike traditional loss functions, the sliding loss function computes errors independently within each window and averages the losses across all windows. This design balances global optimization with sensitivity to local dynamics, improving the model’s performance in complex scenarios. The integration of the sliding loss function with the sliding-window computation network highlights their complementarity. In the sliding-window computation network, input data are divided into multiple local windows, and features within each window are processed independently, enabling efficient extraction of local information. Correspondingly, the sliding loss function calculates losses for the predictions and ground truth values within the same windows, enhancing the model’s ability to learn dynamic patterns within these localized regions. A critical aspect of this integration is the consistency of the window size w across the feature extraction and loss computation processes. Ensuring identical strides and sizes for the model’s feature extraction windows and the loss function’s sliding windows guarantees alignment in optimization objectives. The sliding loss function offers a significant advantage in its dynamic adjustment capability. Traditional global loss functions may overly prioritize global errors, neglecting critical local patterns in data with uneven distributions or strong dynamic characteristics. By introducing the perspective of localized windows, the sliding loss function localizes error computation, enabling the model to effectively learn prominent dynamic patterns. Furthermore, this loss function significantly enhances the model’s ability to capture local features in both time-series data (e.g., stock price prediction) and spatial data (e.g., image feature extraction). Mathematically, the global error distribution of the sliding loss function can approximate the global optimization objective through the linear combination of local windows. This relationship is demonstrated as follows: let the global error be

L_{global}

and the local error within a window be

L_{local} (i)

. Then,

L_{global} = \frac{1}{n} \sum_{t = 1}^{n} {({\hat{y}}_{t} - y_{t})}^{2} \approx \frac{1}{n - w + 1} \sum_{i = 1}^{n - w + 1} L_{local} (i),

(35)

where

L_{local} (i) = \frac{1}{w} \sum_{t = i}^{i + w - 1} {({\hat{y}}_{t} - y_{t})}^{2} .

(36)

This approximation indicates that the sliding loss function not only optimizes the global error but also explicitly constrains the local error, enhancing the model’s stability and flexibility. In the task addressed by this study, the application of the sliding loss function effectively resolves the challenge of learning local dynamic patterns in multimodal data. This is particularly evident in scenarios involving the fusion of time-series and spatial data, where the sliding loss function significantly enhances the model’s capability to capture local variations. Moreover, the combined design of the sliding loss function and the sliding-window computation network substantially reduces noise interference during optimization, ensuring stability and efficiency in model training. Overall, the sliding loss function provides an innovative solution for efficient learning and dynamic modeling of multimodal data.

3.4. Security Analysis

Ensuring data privacy and security is critical in multimodal data processing, particularly in fields such as financial forecasting, medical imaging, and autonomous driving, where data breaches can lead to severe financial losses or legal risks. The proposed sliding-window computation network and sliding loss function are designed not only to optimize multimodal data modeling but also to integrate multiple security-enhancing mechanisms to ensure privacy protection and computational security. The security guarantees of this framework can be categorized as follows. First, local data processing and decentralized computation are adopted. The sliding-window computation network ensures that data processing occurs locally, eliminating the need to transmit raw data, thereby reducing the risk of data exposure. Compared to traditional centralized data processing methods, this approach significantly mitigates the possibility of cloud-based data breaches. In decentralized computing scenarios, only necessary encrypted computation results are shared among collaborators, ensuring privacy while still achieving efficient training and inference. Second, trusted execution environments (TEE-based computing) enhance cloud-based security. In cloud computing environments, data are often processed remotely, making it challenging to guarantee data confidentiality during computation. The proposed framework integrates TEE-based computing (e.g., Intel SGX) to establish secure execution environments, ensuring that, even when data are processed in the cloud, they remain encrypted and inaccessible to unauthorized parties. This approach effectively mitigates malicious server attacks while ensuring that model parameters and private data are not compromised. Additionally, differential privacy (DP) prevents information leakage. During collaborative training, attackers may attempt to reconstruct original data using gradient updates or intermediate training results. To address this, the proposed framework incorporates differential privacy, introducing controlled noise into gradient computations to prevent individual data points from being reverse-engineered. Mathematically, differential privacy ensures that, even if an attacker gains partial access to model parameters, they cannot accurately reconstruct the original dataset. Moreover, the framework allows for adjustable privacy budgets, enabling different levels of privacy protection based on application requirements.

For cloud-based deployment, a zero-trust architecture (ZTA) is employed to enforce restricted data access, ensuring that only authorized users and devices can interact with sensitive data. Unlike traditional security models that assume implicit trust, ZTA employs multi-factor authentication (MFA), dynamic access controls, and continuous verification to prevent unauthorized access. Additionally, secure multi-party computation (SMPC) enables collaborative learning across institutions while ensuring that raw data remain confidential. This technique is particularly beneficial for privacy-sensitive federated learning tasks. To align with global privacy regulations, the proposed framework ensures strict compliance with legal standards governing data storage, access, and processing. Under the General Data Protection Regulation (GDPR), the framework adheres to the principle of data minimization as sliding-window computation only utilizes essential localized data fragments. Moreover, explainability mechanisms ensure that users can understand how their data are utilized. For healthcare applications, compliance with the Health Insurance Portability and Accountability Act (HIPAA) is maintained through encrypted storage and restricted access control. In financial data security, the framework integrates distributed ledger technology and data provenance tracking, ensuring the integrity and traceability of shared data across institutions while meeting financial sector compliance requirements.

3.5. Experimental Design

3.5.1. Evaluation Metrics

Model performance was evaluated using metrics such as precision, recall, accuracy, and F1-score, which assess classification performance from various perspectives. Precision measures the proportion of correctly predicted positive samples among all samples predicted as positive, while recall reflects the model’s ability to identify all actual positive samples. Accuracy quantifies the proportion of correctly predicted samples among the total, providing a general measure of performance. However, due to the trade-off between precision and recall, the F1-score, defined as their harmonic mean, is used to balance these metrics, especially in scenarios with imbalanced class distributions. The corresponding formulas are expressed as follows:

Precision = \frac{TP}{TP + FP},

(37)

Recall = \frac{TP}{TP + FN},

(38)

Accuracy = \frac{TP + TN}{TP + TN + FP + FN},

(39)

F 1 - Score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} .

(40)

In these formulas, TP (true positive) represents the number of correctly predicted positive samples, FP (false positive) represents the number of incorrectly predicted positive samples, TN (true negative) represents the number of correctly predicted negative samples, and FN (false negative) represents the number of actual positive samples that were not correctly predicted. Precision and recall capture the model’s performance in terms of prediction accuracy and detection capability, respectively, while accuracy serves as an intuitive overall performance indicator. However, when class distribution is imbalanced, relying solely on accuracy may misrepresent model performance, as it does not effectively reflect the model’s ability to predict minority classes. F1-score is particularly useful in these cases as it balances precision and recall.

3.5.2. Hardware and Software Platforms

The hardware and software configurations employed for model training and evaluation were critical to ensuring efficient execution. The hardware platform consisted of a high-performance computing server equipped with four NVIDIA A100 GPUs, each with 40 GB of memory, enabling efficient training and inference of large-scale deep learning models. The server was powered by a 128-core AMD EPYC 7742 processor with a base clock speed of 2.25 GHz, supporting parallel computation tasks effectively. Additionally, the system featured 1 TB of RAM, accommodating the loading of large-scale datasets and the training of complex models. NVMe SSDs were employed as storage media, providing read and write speeds of up to 6.4 GB/s, facilitating rapid data loading and checkpoint saving during model training.

The software platform was based on the Ubuntu 20.04 operating system and utilized the PyTorch deep learning framework (version 1.12.0) for model implementation. GPU parallelism was optimized using CUDA 11.6 and cuDNN 8.4. Data preprocessing and augmentation were performed using Python libraries such as Pandas and NumPy, with Albumentations employed for advanced image augmentation. Distributed training was implemented using PyTorch’s Distributed Data Parallel (DDP) module, which significantly improved training speed while minimizing GPU communication overhead. All experimental code was written in Python 3.9, with results visualized in Jupyter Notebook (version 7.3.3).

3.5.3. Dataset Partitioning and Hyperparameter Configuration

To ensure fairness and robustness in model evaluation, 80% of the data were used for training, 10% for validation, and 10% for testing. To further assess model performance under varying data distributions, a fivefold cross-validation strategy was adopted. Specifically, the dataset was divided into five mutually exclusive subsets, with four subsets used as the training set and the remaining one as the validation set in each iteration. This process was repeated five times, and the average performance was reported to reduce bias caused by data distribution differences. The initial learning rate was set to

0.001

and gradually decreased using a cosine annealing scheduler to stabilize training in later stages. The AdamW optimizer was used, with a weight decay factor of

0.01

to prevent overfitting. The batch size was set to 32, and the model was trained for 100 epochs, with the first 10 epochs designated for warm-up to mitigate instability in the early training phase. Dropout regularization was applied with a dropout rate of

0.5

, and gradient clipping with a threshold of 1.0 was employed to prevent gradient explosion.

3.5.4. Baselines

To comprehensively evaluate the performance of the proposed method, federated learning (FL) [42], secure multi-party computation (MPC) [43], homomorphic encryption (HE) [44], and trusted execution environment (TEE)-based methods [45] were selected as comparative baselines. These approaches offer different privacy-preserving computation solutions, each with distinct technical characteristics and applications. For federated learning (FL), the FedAvg algorithm was implemented, with local training rounds set to 5 and global aggregation occurring every 10 rounds. Each client trains a model locally using stochastic gradient descent (SGD) with a learning rate of 0.01, and the aggregated model is updated at the central server. To ensure privacy, differential privacy (DP) noise was added during model updates, and communication overhead was optimized through model compression techniques. For secure multi-party computation (MPC), a secret-sharing-based protocol was employed, where data from each participant were split into multiple shares distributed among computing nodes. Secure summation and multiplication operations were implemented using additive secret sharing, and computations were performed over a modular arithmetic field. The MP-SPDZ framework was used to simulate real-world privacy-preserving computations. For homomorphic encryption (HE), the BFV encryption scheme was adopted, enabling arithmetic computations on encrypted data. Additive and multiplicative homomorphism operations were supported, with polynomial modulus and ciphertext modulus set to balance security and computational efficiency. To reduce decryption overhead, batching techniques and ciphertext packing were applied. The Microsoft SEAL library was used for implementation. For TEE-based methods, Intel SGX was utilized to create a secure enclave, allowing encrypted data to be loaded, processed, and decrypted within a hardware-isolated environment. The model inference was conducted inside the enclave, ensuring confidentiality even against privileged system attacks. The Gramine framework was used for secure execution, and performance was optimized by minimizing enclave transitions and utilizing efficient memory access patterns. All baseline methods were evaluated on the same dataset and in the same environment to ensure fair comparisons.

4. Results and Discussion

4.1. Time-Series Data Testing Results

The purpose of the time-series data testing experiment was to evaluate the performance of the proposed method in handling dynamic time-series data, particularly its ability to effectively capture temporal dependencies in multimodal scenarios and enhance predictive capabilities. The experimental comparison included federated learning, MPC, homomorphic encryption, TEE-based methods, and the proposed approach. Performance was assessed using four evaluation metrics: precision, recall, accuracy, FPS, and F1-score [46,47,48]. The results demonstrated that the proposed method outperformed all the baseline models across these metrics, achieving precision, recall, accuracy, and F1-score values of 0.95, 0.91, 0.93, and 0.93, respectively. These findings underscore the advantages of the proposed sliding-window computation network and sliding loss function in processing dynamic time-series data while effectively integrating multimodal characteristics to enhance the ability to capture temporal patterns.

As shown in Figure 4 and Table 3, the observed performance differences among models can be attributed to their respective algorithmic characteristics and design logic. Federated learning, which employs distributed training to optimize a global model while preserving data privacy, exhibited sensitivity to imbalanced data distributions, resulting in a recall of only 0.81 and limiting its coverage of time-series data. MPC, leveraging cryptographic protocols for privacy preservation, demonstrated improved performance over federated learning but faced limitations in capturing complex data patterns due to significant communication overhead. Homomorphic encryption, enabling direct computation on encrypted data, enhanced both security and model performance; however, its high computational complexity constrained its support for real-time applications, which affected recall and accuracy. TEE-based methods, which rely on hardware isolation technologies, achieved superior performance across all four metrics compared to the aforementioned methods, but their dependence on hardware led to limitations in feature modeling capability. In contrast, the proposed method leveraged a sliding-window attention mechanism to precisely model temporal dependencies and utilized a sliding loss function to optimize local dynamic characteristics. This design resulted in significant improvements in precision and F1-score compared to the baseline methods, validating the model’s effectiveness. These mathematical advantages stem from the efficiency of the sliding-window mechanism in localized computations and the sensitivity of the sliding loss function to local dynamic changes. Consequently, the model effectively captures both global trends and local fluctuations within time-series data, demonstrating superior robustness and generalization in multimodal time-series tasks.

4.2. Spatial Data Testing Results

The objective of the spatial data testing experiment was to evaluate the performance of various methods in handling spatial data, particularly in modeling high-dimensional features (e.g., images) and capturing local features in multimodal tasks. The same baselines and evaluation metrics used for the time-series dataset were applied. The experimental results revealed that the proposed method outperformed all the baseline models across all the metrics, achieving precision, recall, accuracy, and F1-score values of 0.93, 0.90, 0.92, and 0.91, respectively. These findings demonstrate the proposed method’s significant advantages in extracting high-dimensional spatial features and modeling dynamic characteristics, effectively integrating local features with global dependencies and substantially improving the adaptability to complex spatial data.

As shown in Figure 5 and Table 4, from a theoretical perspective, the performance differences among the models can be attributed to their feature modeling capabilities and algorithmic structures. Federated learning provides robust privacy protection but exhibits limited ability to model high-dimensional spatial features due to its reliance on distributed training, resulting in precision and recall values of 0.84 and 0.79, respectively, which constrained its overall performance. MPC enhances the ability to capture local features through cryptographic protocols; however, its significant communication and computation overheads limit its performance in global feature modeling. Homomorphic encryption offers end-to-end privacy protection and improves security and model performance through encrypted computation, but its computational complexity restricts its scalability for large-scale spatial data, with recall reaching only 0.84. TEE-based methods significantly enhance computational efficiency and global modeling capability through hardware isolation technologies, achieving superior results across all the metrics compared to the aforementioned methods. However, due to weaker capture of complex local features, the F1-score remained at 0.89. In contrast, the proposed method employs a sliding-window attention mechanism to effectively capture local spatial information, complemented by a stripe–cross attention mechanism to extract global features, mathematically optimizing the feature representation capabilities. Additionally, the sliding loss function further reinforces the joint modeling of local features and global patterns during training, enabling the proposed method to achieve higher precision and robustness in complex spatial data tasks. These results underscore the innovation in mathematical structure and computational design of the proposed method, highlighting its applicability and superiority in multimodal spatial data scenarios.

4.3. Ablation Study on Different Attention Mechanisms

The ablation study on different attention mechanisms aimed to evaluate the contribution of the proposed sliding-window attention mechanism in time-series and spatial data tasks. Comparisons were made with the standard self-attention mechanism and the convolutional block attention module (CBAM) to validate the performance superiority of the proposed approach. The evaluation was conducted using four metrics: precision, recall, accuracy, and F1-score, across both time-series and spatial data tasks. The results demonstrated that the proposed method outperformed the other attention mechanisms in all the metrics for both tasks. Specifically, in the time-series task, the proposed method achieved precision, recall, accuracy, and F1-score values of 0.95, 0.91, 0.93, and 0.93, respectively. Similarly, in the spatial data task, the corresponding metrics were 0.93, 0.90, 0.92, and 0.91. While CBAM demonstrated improvements in local feature modeling over the standard self-attention mechanism, its overall performance remained inferior to the proposed method, particularly in recall and F1-score. In contrast, the standard self-attention mechanism exhibited the lowest performance in both tasks, highlighting its limitations in multimodal feature modeling.

As shown in Table 5, from a theoretical perspective, the observed performance differences can be attributed to the mathematical properties and structural designs of the attention mechanisms. The standard self-attention mechanism has a computational complexity of

O (n^{2})

, where n represents the input sequence length or the number of pixels in the feature map. This quadratic complexity renders the mechanism inefficient for large-scale time-series data or high-resolution spatial data. Although the global nature of self-attention is advantageous for capturing long-range dependencies, it exhibits insufficient sensitivity to local features, resulting in lower precision and recall. CBAM addresses this limitation by introducing channel and spatial attention, enhancing local feature modeling while reducing computational complexity. However, its static feature extraction strategy is less adaptable to dynamic data distributions, leading to suboptimal performance in recall and F1-score. In contrast, the proposed sliding-window attention mechanism effectively combines localized attention with dynamic window adjustments, reducing computational complexity to

O (n \cdot w)

, where w represents the window size. This approach significantly enhances computational efficiency while capturing the local information within each sliding window. Additionally, global feature integration is achieved through window sliding, demonstrating superior generalization and robustness in both time-series and spatial data tasks. The mathematical advantage of the proposed method lies in its alignment with task-specific requirements. By constraining the attention calculation range, irrelevant features are excluded, enhancing the purity of feature representations. This design improves sensitivity to dynamic patterns in time-series data and ensures precise capture of local details in spatial data. Furthermore, the proposed method incorporates a stripe–cross attention mechanism that enhances interactions between features, enabling simultaneous optimization of local and global information representations in multimodal tasks. These mathematical innovations not only improve the overall model performance but also significantly enhance adaptability to complex scenarios. The results conclusively demonstrate the theoretical and practical superiority of the proposed sliding-window attention mechanism.

To further evaluate the effectiveness of different attention mechanisms in real-world applications, two qualitative case studies were conducted. In a financial time-series anomaly detection task, the sliding-window attention mechanism successfully localized short-term fluctuations, improving sensitivity to rapid changes. In contrast, standard self-attention showed limited responsiveness, and CBAM lacked adequate sequential modeling. In a spatial small-object recognition task, the sliding-window mechanism provided focused local attention and improved foreground–background separation, significantly enhancing recall and precision. These findings validate the practical advantage of the proposed attention design.

4.4. Ablation Study on Different Loss Functions

The ablation study on different loss functions aimed to verify the advantages of the proposed sliding loss function in time-series and spatial data tasks. A comprehensive evaluation was conducted by comparing the sliding loss function with cross-entropy loss and focal loss using four metrics: precision, recall, accuracy, and F1-score. The experimental results demonstrated that the sliding loss function outperformed the other two loss functions across all the metrics. In the time-series task, the proposed method achieved precision, recall, accuracy, and F1-score values of 0.95, 0.91, 0.93, and 0.93, respectively. For the spatial data task, these metrics were 0.93, 0.90, 0.92, and 0.91, respectively. While focal loss showed improvements over cross-entropy loss in certain scenarios, it fell short in recall and F1-score compared to the sliding loss function. Cross-entropy loss exhibited the weakest performance in both tasks, indicating its limited suitability for multimodal tasks.

As shown in Table 6, from a theoretical perspective, the differences in performance among the loss functions can be attributed to their design logic and capabilities in modeling error distributions. However, cross-entropy loss is insensitive to long-tail distributions, resulting in subpar performance in recall and F1-score, particularly in tasks involving significant dynamic characteristics, such as time-series data. Focal loss enhances the model’s learning capability for minority classes, thereby outperforming cross-entropy loss in multimodal tasks. However, the limitation of focal loss lies in its insufficient sensitivity to local dynamic characteristics, making it less adaptable to fine-grained feature changes in multimodal data. This localized computation enhances the model’s sensitivity to data’s dynamic variations while avoiding the neglect of local patterns by global optimization objectives. Furthermore, the integration of the sliding loss function with the sliding-window computation network reinforces joint modeling of local features and global patterns, resulting in superior performance in multimodal tasks. The mathematical advantage of the sliding loss function lies in its ability to balance global optimization with localized sensitivity, thereby improving robustness when handling complex dynamic data. This capability enables higher prediction accuracy and generalization in multimodal scenarios. The experimental results confirm the theoretical strengths and practical effectiveness of the sliding loss function, offering a more efficient and precise solution for multimodal data processing.

4.5. Evaluation of Model Robustness Under Synthetic Data and Adversarial Samples

The primary objective of this experiment is to evaluate the robustness of the proposed sliding-window computation framework when confronted with data perturbations and non-ideal inputs, with particular emphasis on its stability and generalization under synthetic data and adversarial sample conditions. To achieve this, four comparative settings—“None” (no perturbation), “Synthetic Data” (augmented data), “Adversarial Samples” (perturbed inputs), and the “Proposed Method”—were established across two types of multimodal tasks. For both time-series and spatial image data, the model was assessed using five metrics: precision, recall, accuracy, F1-score, and FPS. This experimental setup aims to simulate realistic scenarios involving data drift, adversarial attacks, or incomplete observations, thereby offering a more practical understanding of the model’s reliability in deployment contexts.

As reflected in Table 7, under the “None” condition for time-series data, the model exhibited relatively poor performance, with a precision of 0.63 and recall of only 0.60, indicating weak adaptability to data perturbations and limited predictive capability. After introducing synthetic data, all the performance metrics improved, with the F1-score increasing to 0.73, demonstrating that representative augmented samples effectively enhance the model’s ability to adapt to data distribution variations. The inclusion of adversarial samples further improved the performance, with the F1-score reaching 0.84, implying a certain degree of robustness, although still constrained by the complexity of adversarial perturbation construction. In contrast, the proposed method achieved the highest performance, with an F1-score of 0.93 and FPS of 46, significantly outperforming all the other approaches. This result validates the effectiveness of the designed sliding-window attention mechanism and sliding loss function in improving both predictive power and computational efficiency. A similar trend was observed in spatial data tasks, where the performance under the “None” condition was the lowest, gradually improved with synthetic and adversarial samples, and peaked under the proposed method, achieving a precision of 0.93 and an F1-score of 0.91. These outcomes collectively confirm the generalizable robustness of the proposed framework across multimodal tasks.

4.6. Limitations and Future Work

Despite the significant performance improvements achieved by the proposed sliding-window attention mechanism and sliding loss function in multimodal time-series and spatial data processing tasks, certain limitations require further exploration. Firstly, the proposed method exhibits sensitivity to specific hyperparameters, such as window size, sliding stride, and loss function weights. These hyperparameters must be manually adjusted according to the dataset and task objectives, potentially increasing the adaptation cost in diverse application scenarios. Developing an adaptive parameter optimization mechanism that enables the model to automatically tune hyperparameters based on the data distribution represents an important direction for future research. In practical applications, the correlations within multimodal data often exhibit higher complexity, and simple attention mechanisms may fall short of fully capturing their latent characteristics. Future studies could explore the integration of more advanced multimodal feature fusion techniques, such as multimodal contrastive learning or cross-modal generative adversarial networks, to enhance the model’s comprehensive ability to represent multimodal features. Although the proposed method outperforms traditional global attention mechanisms in computational efficiency, training on large-scale datasets remains time-consuming, particularly when the number of sliding windows is substantial. To address this issue, future research could incorporate more efficient model compression techniques or hardware acceleration methods, such as optimizing attention mechanisms using low-rank decomposition or quantizing network weights. These approaches may further reduce computational complexity and resource consumption, paving the way for broader applicability in large-scale scenarios. Future research will explore the applicability of the proposed framework in a broader range of multimodal prediction tasks. Beyond financial forecasting, autonomous driving, and medical imaging, this approach also holds potential value in climate modeling, network security, and industrial IoT applications. Climate data exhibit high-dimensional spatiotemporal dependencies, where the sliding-window attention mechanism can effectively capture both local and global variations, improving long-term trend predictions. In network security, time-series log data and evolving attack patterns require dynamic adaptation, and the proposed method enhances anomaly detection capabilities and improves robustness in attack identification. For industrial IoT applications, sensor fusion is a critical challenge, and the localized computation of the sliding loss function facilitates sensor data alignment and feature extraction. Further experimental studies will be conducted in these domains to evaluate the generalization capability of the framework and optimize its computational efficiency and predictive accuracy across different data patterns.

5. Conclusions

To address the challenges of temporal and spatial feature modeling in multimodal data processing, an efficient and secure computation framework based on the sliding-window attention mechanism and sliding loss function was proposed. This framework aims to overcome the limitations of traditional methods in terms of privacy protection, computational efficiency, and accuracy loss. Multimodal data, prevalent in domains such as financial forecasting, medical imaging analysis, and autonomous driving, present high complexity and dynamic characteristics that impose stringent demands on model performance. The traditional approaches, including federated learning, secure multi-party computation, homomorphic encryption, and TEE-based techniques, have shown success in privacy protection and data collaboration but often struggle to balance efficiency and accuracy due to high computational complexity or limited feature-capturing capabilities. Against this backdrop, the proposed framework employs innovative designs to significantly enhance feature modeling and dynamic data processing performance while ensuring data privacy and achieving higher computational efficiency with minimal accuracy loss. This framework provides a novel solution for privacy-preserving data security in cloud computing environments. The primary innovations of this study are twofold. First, the sliding-window attention mechanism confines the attention scope to local windows, effectively reducing the computational complexity from global

O (n^{2})

to

O (n \cdot w)

, where w denotes the window size. This mechanism also captures global feature dependencies through the sliding operation, enabling efficient modeling of both local and global information. Second, the sliding loss function employs a localized loss computation approach that enhances the model’s sensitivity to dynamic data variations, which is particularly advantageous in multimodal scenarios. The experimental results strongly validated the effectiveness of the proposed method. For time-series data tasks, the proposed method achieved a precision of 0.95, recall of 0.91, accuracy of 0.93, and F1-score of 0.93, representing significant improvements over the baseline methods. In spatial data tasks, these metrics reached 0.93, 0.90, 0.92, and 0.91, respectively, further demonstrating their superior performance. Additionally, in ablation experiments comparing different attention mechanisms, the sliding-window attention mechanism outperformed standard self-attention and CBAM. For time-series data tasks, it increased the F1-score from 0.71 and 0.83 to 0.93, while, for spatial data tasks, the F1-score improved from 0.69 and 0.81 to 0.91, highlighting the advantages of the proposed attention mechanism. Similarly, in ablation experiments on loss functions, the sliding loss function outperformed cross-entropy loss and focal loss, increasing the F1-score from 0.67 and 0.83 to 0.93 for time-series tasks and from 0.64 and 0.82 to 0.91 for spatial tasks. These results further demonstrated the significance of modeling local dynamic characteristics. The findings indicate that the proposed method achieves higher efficiency and lower accuracy loss while maintaining data privacy and security. This work provides crucial theoretical support and practical value for multimodal data processing in cloud computing environments.

Author Contributions

Conceptualization, W.C., Q.L., J.S. and C.L.; Data curation, X.Z., Z.L. and H.Z.; Funding acquisition, C.L.; Methodology, W.C., Q.L. and J.S.; Project administration, C.L.; Resources, X.Z., Z.L. and H.Z.; Software, W.C., Q.L., J.S. and H.Z.; Supervision, Y.Q. and C.L.; Validation, Z.L. and Y.Q.; Visualization, X.Z. and Y.Q.; Writing—original draft, W.C., Q.L., J.S., X.Z., Z.L., H.Z., Y.Q. and C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The National Key Research and Development Program of China (2024YFC2607600).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, S.; Zhou, Z.; Wang, C.; Liang, Y.; Wang, L.; Zhang, J.; Zhang, J.; Lv, C. A User-Centered Framework for Data Privacy Protection Using Large Language Models and Attention Mechanisms. Appl. Sci. 2024, 14, 6824. [Google Scholar] [CrossRef]
An, H.; Ma, R.; Yan, Y.; Chen, T.; Zhao, Y.; Li, P.; Li, J.; Wang, X.; Fan, D.; Lv, C. Finsformer: A Novel Approach to Detecting Financial Attacks Using Transformer and Cluster-Attention. Appl. Sci. 2024, 14, 460. [Google Scholar] [CrossRef]
Al-Ansi, A.; Al-Ansi, A.M.; Muthanna, A.; Elgendy, I.A.; Koucheryavy, A. Survey on intelligence edge computing in 6G: Characteristics, challenges, potential use cases, and market drivers. Future Internet 2021, 13, 118. [Google Scholar] [CrossRef]
Arciniegas-Ayala, C.; Marcillo, P.; Valdivieso Caraguay, A.L.; Hernandez-Alvarez, M. Prediction of Accident Risk Levels in Traffic Accidents Using Deep Learning and Radial Basis Function Neural Networks Applied to a Dataset with Information on Driving Events. Appl. Sci. 2024, 14, 6248. [Google Scholar] [CrossRef]
Shen, J.; Wang, N.; Wan, Z.; Luo, Y.; Sato, T.; Hu, Z.; Zhang, X.; Guo, S.; Zhong, Z.; Li, K.; et al. Sok: On the semantic ai security in autonomous driving. arXiv 2022, arXiv:2203.05314. [Google Scholar]
Zhang, Y.; Wang, H.; Xu, R.; Yang, X.; Wang, Y.; Liu, Y. High-Precision Seedling Detection Model Based on Multi-Activation Layer and Depth-Separable Convolution Using Images Acquired by Drones. Drones 2022, 6, 152. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, X.; Wa, S.; Liu, Y.; Kang, J.; Lv, C. GenU-Net++: An Automatic Intracranial Brain Tumors Segmentation Algorithm on 3D Image Series with High Performance. Symmetry 2021, 13, 2395. [Google Scholar] [CrossRef]
Kumar, R.; Wang, W.; Kumar, J.; Yang, T.; Khan, A.; Ali, W.; Ali, I. An integration of blockchain and AI for secure data sharing and detection of CT images for the hospitals. Comput. Med. Imaging Graph. 2021, 87, 101812. [Google Scholar] [CrossRef]
Zhang, Y.; He, S.; Wa, S.; Zong, Z.; Lin, J.; Fan, D.; Fu, J.; Lv, C. Symmetry GAN detection network: An automatic one-stage high-accuracy detection network for various types of lesions on CT images. Symmetry 2022, 14, 234. [Google Scholar] [CrossRef]
Li, Q.; Zhang, Y.; Ren, J.; Li, Q.; Zhang, Y. You Can Use But Cannot Recognize: Preserving Visual Privacy in Deep Neural Networks. arXiv 2024, arXiv:2404.04098. [Google Scholar]
Li, Q.; Zhang, Y. Confidential Federated Learning for Heterogeneous Platforms against Client-Side Privacy Leakages. In Proceedings of the ACM-TURC ’24: ACM Turing Award Celebration Conference-China 2024, Changsha, China, 5–7 July 2024; pp. 239–241. [Google Scholar]
Song, L.; Wang, J.; Wang, Z.; Tu, X.; Lin, G.; Ruan, W.; Wu, H.; Han, W. Pmpl: A robust multi-party learning framework with a privileged party. In Proceedings of the CCS ’22: 2022 ACM SIGSAC Conference on Computer and Communications Security, Los Angeles, CA, USA, 7–11 November 2022; pp. 2689–2703. [Google Scholar]
Mammen, P.M. Federated learning: Opportunities and challenges. arXiv 2021, arXiv:2101.05428. [Google Scholar]
Arora, S.; Beams, A.; Chatzigiannis, P.; Meiser, S.; Patel, K.; Raghuraman, S.; Rindal, P.; Shah, H.; Wang, Y.; Wu, Y.; et al. Privacy-preserving financial anomaly detection via federated learning & multi-party computation. arXiv 2023, arXiv:2310.04546. [Google Scholar]
Odeh, A.; Abdelfattah, E.; Salameh, W. Privacy-Preserving Data Sharing in Telehealth Services. Appl. Sci. 2024, 14, 10808. [Google Scholar] [CrossRef]
Pulido-Gaytan, B.; Tchernykh, A.; Cortés-Mendoza, J.M.; Babenko, M.; Radchenko, G.; Avetisyan, A.; Drozdov, A.Y. Privacy-preserving neural networks with homomorphic encryption: C hallenges and opportunities. Peer-Netw. Appl. 2021, 14, 1666–1691. [Google Scholar]
Imteaj, A.; Amini, M.H. Leveraging asynchronous federated learning to predict customers financial distress. Intell. Syst. Appl. 2022, 14, 200064. [Google Scholar] [CrossRef]
Ali, A.; Pasha, M.F.; Ali, J.; Fang, O.H.; Masud, M.; Jurcut, A.D.; Alzain, M.A. Deep learning based homomorphic secure search-able encryption for keyword search in blockchain healthcare system: A novel approach to cryptography. Sensors 2022, 22, 528. [Google Scholar] [CrossRef]
El Ouadrhiri, A.; Abdelhadi, A. Differential privacy for deep and federated learning: A survey. IEEE Access 2022, 10, 22359–22380. [Google Scholar]
Sander, T.; Stock, P.; Sablayrolles, A. Tan without a burn: Scaling laws of dp-sgd. In Proceedings of the 40th International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 29937–29949. [Google Scholar]
Wu, P.; Ning, J.; Shen, J.; Wang, H.; Chang, E.C. Hybrid Trust Multi-party Computation with Trusted Execution Environment. In Proceedings of the Network and Distributed Systems Security (NDSS) Symposium, San Diego, CA, USA, 24–28 April 2022. [Google Scholar]
Zhou, L.; Diro, A.; Saini, A.; Kaisar, S.; Hiep, P.C. Leveraging zero knowledge proofs for blockchain-based identity sharing: A survey of advancements, challenges and opportunities. J. Inf. Secur. Appl. 2024, 80, 103678. [Google Scholar]
Liu, Q.; Yang, L.; Liu, Y.; Deng, J.; Wu, G. Privacy-Preserving Recommendation Based on a Shuffled Federated Graph Neural Network. IEEE Internet Comput. 2024, 28, 17–24. [Google Scholar] [CrossRef]
Li, L.; Gou, J.; Yu, B.; Du, L.; Tao, Z.Y.D. Federated distillation: A survey. arXiv 2024, arXiv:2404.08564. [Google Scholar]
Wen, J.; Zhang, Z.; Lan, Y.; Cui, Z.; Cai, J.; Zhang, W. A survey on federated learning: Challenges and applications. Int. J. Mach. Learn. Cybern. 2023, 14, 513–535. [Google Scholar]
Kanchan, S.; Jang, J.W.; Yoon, J.Y.; Choi, B.J. GSFedSec: Group Signature-Based Secure Aggregation for Privacy Preservation in Federated Learning. Appl. Sci. 2024, 14, 7993. [Google Scholar] [CrossRef]
Li, Q.; Ren, J.; Zhang, Y.; Song, C.; Liao, Y.; Zhang, Y. Privacy-Preserving DNN Training with Prefetched Meta-Keys on Heterogeneous Neural Network Accelerators. In Proceedings of the 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 9–13 July 2023; pp. 1–6. [Google Scholar]
Byrd, D.; Polychroniadou, A. Differentially private secure multi-party computation for federated learning in financial applications. In Proceedings of the First ACM International Conference on AI in Finance, New York, NY, USA, 15–16 October 2020; pp. 1–9. [Google Scholar]
Liu, X.; Liu, X.; Zhang, R.; Luo, D.; Xu, G.; Chen, X. Securely Computing the Manhattan Distance under the Malicious Model and Its Applications. Appl. Sci. 2022, 12, 11705. [Google Scholar] [CrossRef]
Zhou, I.; Tofigh, F.; Piccardi, M.; Abolhasan, M.; Franklin, D.; Lipman, J. Secure Multi-Party Computation for Machine Learning: A Survey. IEEE Access 2024, 12, 53881–53899. [Google Scholar] [CrossRef]
Dhiman, S.; Nayak, S.; Mahato, G.K.; Ram, A.; Chakraborty, S.K. Homomorphic encryption based federated learning for financial data security. In Proceedings of the 2023 4th International Conference on Computing and Communication Systems (I3CS), Shillong, India, 16–18 March 2023; pp. 1–6. [Google Scholar]
Acar, A.; Aksu, H.; Uluagac, A.S.; Conti, M. A survey on homomorphic encryption schemes: Theory and implementation. ACM Comput. Surv. 2018, 51, 79. [Google Scholar]
Fan, J.; Vercauteren, F. Somewhat Practical Fully Homomorphic Encryption. Cryptology ePrint Archive. 2012; p. 144. Available online: https://eprint.iacr.org/2012/144 (accessed on 27 March 2025).
Damgård, I.; Pastro, V.; Smart, N.; Zakarias, S. Multiparty computation from somewhat homomorphic encryption. In Advances in Cryptology—CRYPTO 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 643–662. [Google Scholar]
Lee, E.; Lee, J.W.; Kim, Y.S.; No, J.S. Optimization of homomorphic comparison algorithm on rns-ckks scheme. IEEE Access 2022, 10, 26163–26176. [Google Scholar]
Zhang, Y.; Yin, Z.; Li, Y.; Yin, G.; Yan, J.; Shao, J.; Liu, Z. Celeba-spoof: Large-scale face anti-spoofing dataset with rich annotations. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XII 16. pp. 70–85. [Google Scholar]
Contreras, J.; Espinola, R.; Nogales, F.J.; Conejo, A.J. ARIMA models to predict next-day electricity prices. IEEE Trans. Power Syst. 2003, 18, 1014–1020. [Google Scholar] [CrossRef]
Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Networks Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Found. Trends^® Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Evans, D.; Kolesnikov, V.; Rosulek, M. A pragmatic introduction to secure multi-party computation. Found. Trends^® Priv. Secur. 2018, 2, 70–246. [Google Scholar] [CrossRef]
Naehrig, M.; Lauter, K.; Vaikuntanathan, V. Can homomorphic encryption be practical? In Proceedings of the 3rd ACM Workshop on Cloud Computing Security Workshop, Chicago, IL, USA, 21 October 2011; pp. 113–124. [Google Scholar]
Sabt, M.; Achemlal, M.; Bouabdallah, A. Trusted execution environment: What it is, and what it is not. In Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA, Helsinki, Finland, 20–22 August 2015; Volume 1, pp. 57–64. [Google Scholar]
Yang, Z.; Li, P.; Bao, Y.; Huang, X. Speeding Up Multivariate Time Series Segmentation Using Feature Extraction. In Proceedings of the 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 12–14 June 2020; Volume 1, pp. 954–957. [Google Scholar] [CrossRef]
Zhou, N.; Zheng, Z.; Zhou, J. Prediction of the RUL of PEMFC based on multivariate time series forecasting model. In Proceedings of the 2023 3rd International Symposium on Computer Technology and Information Science (ISCTIS), Chengdu, China, 7–9 July 2023; pp. 87–92. [Google Scholar]
Di Mauro, M.; Galatro, G.; Postiglione, F.; Song, W.; Liotta, A. Hybrid learning strategies for multivariate time series forecasting of network quality metrics. Comput. Netw. 2024, 243, 110286. [Google Scholar] [CrossRef]
Du, W.; Côté, D.; Liu, Y. Saits: Self-attention-based imputation for time series. Expert Syst. Appl. 2023, 219, 119619. [Google Scholar] [CrossRef]
Liang, Y.; Lin, Y.; Lu, Q. Forecasting gold price using a novel hybrid model with ICEEMDAN and LSTM-CNN-CBAM. Expert Syst. Appl. 2022, 206, 117847. [Google Scholar] [CrossRef]
Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Du, R.; Chen, H.; Yu, M.; Li, W.; Niu, D.; Wang, K.; Zhang, Z. 3DTCN-CBAM-LSTM short-term power multi-step prediction model for offshore wind power based on data space and multi-field cluster spatio-temporal correlation. Appl. Energy 2024, 376, 124169. [Google Scholar] [CrossRef]

Figure 1. Illustration of the sliding-window computation network (SWCN) architecture. The network follows an encoder–decoder structure designed for efficient multimodal spatial and temporal data processing. The encoder extracts multi-scale features (

F_{1}, F_{2}, F_{3}, F_{4}

) through sliding-window attention layers, where queries and key–value pairs are denoted as

(X_{i, q})

and

(X_{i, k v})

. The decoder progressively upsamples hierarchical features (

D_{1}, D_{2}, D_{3}, D_{4}

) and concatenates them before passing through a multi-layer perceptron (MLP) to generate the final mask output. The notation Cat[

p_{1} (F_{1}), p_{2} (F_{2}), p_{3} (F_{3}), p_{4} (F_{4})

] represents feature concatenation.

Figure 1. Illustration of the sliding-window computation network (SWCN) architecture. The network follows an encoder–decoder structure designed for efficient multimodal spatial and temporal data processing. The encoder extracts multi-scale features (

F_{1}, F_{2}, F_{3}, F_{4}

) through sliding-window attention layers, where queries and key–value pairs are denoted as

(X_{i, q})

and

(X_{i, k v})

. The decoder progressively upsamples hierarchical features (

D_{1}, D_{2}, D_{3}, D_{4}

) and concatenates them before passing through a multi-layer perceptron (MLP) to generate the final mask output. The notation Cat[

p_{1} (F_{1}), p_{2} (F_{2}), p_{3} (F_{3}), p_{4} (F_{4})

] represents feature concatenation.

Figure 2. Sliding-window attention mechanism.

Figure 3. Time-series and space fusion module.

Figure 4. The figure presents the violin plot of accuracy distribution for different baseline models and the proposed method in the time-series data testing results experiment.

Figure 5. The figure presents the violin plot of accuracy distribution for different baseline models and the proposed method in the spatial data testing results experiment.

Table 1. Categories and counts of time-series data.

Data Type	Number of Entries
Stock Prices	50,923
Stock Volatility	37,010
Market Sentiment Index	29,741
Trading Volume	40,532
Turnover Rate	25,994

Table 2. Categorization of CelebA dataset.

Category	Number of Entries
Smiling	9760
Not Smiling	8341
Wearing Glasses	9957
Not Wearing Glasses	7803
Other	8675

Table 3. Experimental results of time-series data testing.

Model	Precision	Recall	Accuracy	F1-Score	FPS
Federated Learning [42]	0.86	0.81	0.84	0.83	29
MPC [43]	0.88	0.85	0.87	0.86	34
Homomorphic Encryption [44]	0.90	0.87	0.89	0.88	37
TEE-Based [44]	0.92	0.89	0.91	0.90	40
Proposed Method	0.95	0.91	0.93	0.93	46

Table 4. Experimental results of spatial data testing.

Model	Precision	Recall	Accuracy	F1-Score	FPS
Federated Learning	0.84	0.79	0.81	0.81	26
MPC	0.87	0.83	0.85	0.85	32
Homomorphic Encryption	0.88	0.84	0.86	0.85	38
TEE-Based	0.91	0.87	0.89	0.89	42
Proposed Method	0.93	0.90	0.92	0.91	49

Table 5. Ablation study on different attention mechanisms.

Model	Precision	Recall	Accuracy	F1-Score	FPS
Time-Series—Standard Self-Attention [49]	0.73	0.70	0.72	0.71	31
Time-Series—CBAM [50]	0.85	0.81	0.83	0.83	35
Time-Series—Proposed Method	0.95	0.91	0.93	0.93	46
Spatial Data—Standard Self-Attention [51]	0.71	0.68	0.70	0.69	33
Spatial Data—CBAM [52]	0.83	0.80	0.82	0.81	40
Spatial Data—Proposed Method	0.93	0.90	0.92	0.91	49

Table 6. Ablation study on different loss functions.

Model	Precision	Recall	Accuracy	F1-Score	FPS
Time-Series—Cross-Entropy Loss	0.69	0.65	0.67	0.67	27
Time-Series—Focal Loss	0.87	0.82	0.84	0.83	34
Time-Series—Proposed Method	0.95	0.91	0.93	0.93	46
Spatial Data—Cross-Entropy Loss	0.66	0.63	0.65	0.64	30
Spatial Data—Focal Loss	0.84	0.80	0.82	0.82	37
Spatial Data—Proposed Method	0.93	0.90	0.92	0.91	49

Table 7. Evaluation of model robustness under synthetic data and adversarial samples.

Model	Precision	Recall	Accuracy	F1-Score	FPS
Time-Series—None	0.63	0.60	0.62	0.61	28
Time-Series—Synthetic Data	0.74	0.71	0.73	0.73	34
Time-Series—Adversarial Samples	0.88	0.83	0.85	0.84	37
Time-Series—Proposed Method	0.95	0.91	0.93	0.93	46
Spatial Data—None	0.65	0.68	0.66	0.67	31
Spatial Data—Synthetic Data	0.73	0.70	0.72	0.71	38
Spatial Data—Adversarial Samples	0.85	0.81	0.83	0.82	42
Spatial Data—Proposed Method	0.93	0.90	0.92	0.91	49

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, W.; Lin, Q.; Shi, J.; Zhou, X.; Li, Z.; Zhan, H.; Qin, Y.; Lv, C. A Secure and Efficient Framework for Multimodal Prediction Tasks in Cloud Computing with Sliding-Window Attention Mechanisms. Appl. Sci. 2025, 15, 3827. https://doi.org/10.3390/app15073827

AMA Style

Cui W, Lin Q, Shi J, Zhou X, Li Z, Zhan H, Qin Y, Lv C. A Secure and Efficient Framework for Multimodal Prediction Tasks in Cloud Computing with Sliding-Window Attention Mechanisms. Applied Sciences. 2025; 15(7):3827. https://doi.org/10.3390/app15073827

Chicago/Turabian Style

Cui, Weiyuan, Qianye Lin, Jiaqi Shi, Xingyu Zhou, Zeyue Li, Haoyuan Zhan, Yihan Qin, and Chunli Lv. 2025. "A Secure and Efficient Framework for Multimodal Prediction Tasks in Cloud Computing with Sliding-Window Attention Mechanisms" Applied Sciences 15, no. 7: 3827. https://doi.org/10.3390/app15073827

APA Style

Cui, W., Lin, Q., Shi, J., Zhou, X., Li, Z., Zhan, H., Qin, Y., & Lv, C. (2025). A Secure and Efficient Framework for Multimodal Prediction Tasks in Cloud Computing with Sliding-Window Attention Mechanisms. Applied Sciences, 15(7), 3827. https://doi.org/10.3390/app15073827

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Secure and Efficient Framework for Multimodal Prediction Tasks in Cloud Computing with Sliding-Window Attention Mechanisms

Abstract

1. Introduction

2. Related Work

2.1. Federated Learning

2.2. Secure Multi-Party Computation

2.3. Homomorphic Encryption

2.4. Multimodal Fusion Techniques: Transformers and Cross-Modal GANs

3. Materials and Methods

3.1. Dataset Collection

3.2. Dataset Preprocessing

3.3. Proposed Method

3.3.1. Sliding-Window Computation Network

3.3.2. Sliding-Window Attention Mechanism

3.3.3. Time-Series and Space Fusion Module

3.3.4. Sliding Loss Function

3.4. Security Analysis

3.5. Experimental Design

3.5.1. Evaluation Metrics

3.5.2. Hardware and Software Platforms

3.5.3. Dataset Partitioning and Hyperparameter Configuration

3.5.4. Baselines

4. Results and Discussion

4.1. Time-Series Data Testing Results

4.2. Spatial Data Testing Results

4.3. Ablation Study on Different Attention Mechanisms

4.4. Ablation Study on Different Loss Functions

4.5. Evaluation of Model Robustness Under Synthetic Data and Adversarial Samples

4.6. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI