Next Article in Journal
Prediction on Permeability Coefficient of Continuously Graded Coarse-Grained Soils: A Data-Driven Machine Learning Method
Previous Article in Journal
Rotation-Invariant Feature Enhancement with Dual-Aspect Loss for Arbitrary-Oriented Object Detection in Remote Sensing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hybrid Uncertainty Metrics-Based Privacy-Preserving Alternating Multimodal Representation Learning

1
Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou 510006, China
2
School of Artificial Intelligence, Guangdong Mechanical and Electrical Polytechnic, Guangzhou 510430, China
3
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(10), 5229; https://doi.org/10.3390/app15105229
Submission received: 10 April 2025 / Revised: 28 April 2025 / Accepted: 3 May 2025 / Published: 8 May 2025

Abstract

:
Multimodal learning enhances model performance by integrating heterogeneous data but is hindered by modality laziness and privacy vulnerabilities. Modality laziness occurs when the model overly relies on a single modality for predictions, underutilizing other modalities and leading to suboptimal performance and poor cross-modal integration. Privacy vulnerabilities arise when sensitive data from individual modalities are exposed during training or inference, risking unauthorized access or attacks, especially in shared model components. In this paper, we propose Privacy-Preserving Alternating Multimodal Representation Learning (PAMRL). Built on Multimodal Learning with Alternating Unimodal Adaptation (MLA), PAMRL alternately optimizes unimodal encoders and a shared representation head to mitigate modality laziness and improve cross-modal consistency. It introduces a hybrid uncertainty metric combining KL divergence and entropy to enhance prediction robustness while applying differential privacy to protect sensitive data in unimodal encoders, preserving the shared head for efficient cross-modal fusion. Extensive experiments on the MVSA and CREMA-D datasets, comparing PAMRL with MLA and other baselines, demonstrate its superior performance, achieving an optimal balance of predictive accuracy, attack resilience, and privacy protection, thus supporting secure, efficient multimodal applications.

1. Introduction

Multimodal learning is a significant research direction in artificial intelligence, enhancing semantic representations and model performance in complex tasks by integrating diverse modalities such as vision, audio, and language [1]. For instance, in sentiment analysis, combining facial expressions, speech intonation, and text enables precise emotion recognition [2]; in autonomous driving, fusing images, LiDAR, and voice commands improves environmental perception and decision-making reliability [3]. Recent advancements in transformer-based architectures, such as vision transformers and multimodal transformers, have become foundational to multimodal learning due to their powerful feature extraction and cross-modal integration capabilities, significantly advancing the field [4]. Baltrušaitis et al. [1] note that multimodal models exhibit strong robustness and adaptability in handling high-dimensional, heterogeneous data, particularly in noisy or incomplete real-world scenarios. Meanwhile, several critical challenges limit their practical adoption. Multimodal learning is hindered by two major issues: modality laziness and privacy leakage. Modality laziness occurs when models over-rely on dominant modalities (e.g., images) while neglecting others (e.g., text or audio), undermining synergy and causing performance degradation in scenarios with missing or noisy modalities [3]. Additionally, models processing sensitive data, such as medical images or voice recordings, are vulnerable to privacy leakage through gradient updates and susceptible to membership inference attacks, posing significant barriers to deployment in privacy-sensitive domains like healthcare and finance [5].
To address the issue of modality laziness and privacy leakage, numerous researchers have proposed various methods. For instance, Zhang et al. [6] introduced the Multimodal Learning with Alternating Unimodal Adaptation (MLA) framework, which mitigates modality laziness by alternately optimizing modality-specific encoders and employs a shared head to capture cross-modal semantic relationships. During inference, MLA utilizes an uncertainty-based dynamic fusion mechanism to enhance integration accuracy [7]. MLA outperforms traditional fusion methods on datasets such as CREMA-D [2] for sentiment analysis and MVSA [8] for image-text classification, particularly in modality-imbalanced scenarios [9]. However, MLA’s entropy-based uncertainty metric solely measures prediction confidence, failing to ensure accuracy or reliability, which may lead to “confidently incorrect” predictions in noisy or modality-inconsistent settings [10,11]. Furthermore, MLA lacks privacy safeguards, as frequent gradient updates render it vulnerable to membership inference attacks, limiting its applicability to sensitive data scenarios [5,12]. These shortcomings underscore the urgent need for a more robust and secure multimodal learning framework.
In this work, we propose the Privacy-Preserving Alternating Multimodal Representation Learning (PAMRL) framework, a sophisticated solution designed to tackle the inherent challenges of modality laziness, suboptimal uncertainty quantification, and privacy vulnerabilities in multimodal learning systems. PAMRL integrates a hybrid uncertainty metric, which leverages the complementary strengths of Kullback–Leibler (KL) divergence and entropy to dynamically evaluate modality quality, thereby enhancing modality fusion, ensuring prediction robustness, and promoting consistent cross-modal semantic alignment across diverse data types such as audio–visual inputs for emotion recognition and text–image pairs for sentiment analysis. Additionally, PAMRL employs differential privacy through an innovative alternating training strategy, utilizing Differential Privacy Stochastic Gradient Descent (DP-SGD) with gradient clipping and calibrated noise injection applied selectively to unimodal encoders, safeguarding sensitive data while maintaining the shared representation head’s efficiency for seamless cross-modal integration. Our objective is to develop a multimodal learning framework that balances high predictive accuracy, robustness, and privacy protection for complex real-world tasks.
The main contributions of the paper are as follows:
  • We propose a hybrid uncertainty metric that integrates KL divergence and entropy to comprehensively evaluate modality quality, optimize cross-modal fusion, and enhance robustness in noisy or imbalanced scenarios.
  • We incorporate differential privacy into alternating training with gradient clipping and noise injection to effectively safeguard sensitive data and mitigate privacy attack risks.
  • We demonstrate through extensive experiments on the MVSA and CREMA-D datasets that PAMRL achieves a robust balance of predictive accuracy, robustness, and privacy protection, offering significant practical value.
In the rest of this paper, we first provide an overview of the background and related work in Section 2. The methodology of the Privacy-Preserving Alternating Multimodal Representation Learning (PAMRL) framework is detailed in Section 3. Subsequently, we describe the experimental setup in Section 4, including the datasets, baseline methods, and implementation details. We present the experimental results and perform comprehensive analyses in Section 4. Finally, we summarize the findings and discuss future research directions in Section 5.

2. Related Work

2.1. Multimodal Learning

Multimodal learning leverages data from diverse sensory modalities to improve model performance and robustness. However, a persistent challenge in existing methods is modality laziness, where models disproportionately rely on dominant modalities, sidelining others and undermining overall effectiveness. To address this issue, researchers have proposed several strategies. Wang et al. [3] investigated training challenges in multimodal classification networks, identifying gradient imbalance as a primary driver of modality laziness. They introduced gradient modulation to equalize contributions across modalities. Building on this, Peng et al. [9] developed dynamic gradient modulation (OGM-GE), which adjusts gradient weights dynamically to foster balanced learning. Other approaches include modality dropout [11], which randomly excludes certain modality inputs during training to encourage reliance on all available modalities, and curriculum learning [12], which incrementally increases the complexity of modality combinations to mitigate the issue.
To further tackle modality laziness, Zhang et al. [7] proposed the Multimodal Learning with Alternating Unimodal Adaptation (MLA) framework. Unlike gradient modulation or modality dropout, MLA alternates optimization between modality-specific encoders and a shared head, effectively balancing modality contributions and achieving superior performance on benchmark datasets. Despite its strengths, MLA faces limitations: its entropy-based uncertainty quantification performs suboptimally during inference, struggling to provide reliable estimates, and it lacks built-in privacy-preserving mechanisms, restricting its use in privacy-sensitive applications such as healthcare or finance. Additionally, methods like OGM-GE require meticulous hyperparameter tuning and may falter in dynamic environments, while modality dropout risks discarding critical information from dominant modalities and curriculum learning, though promising, is complex to design and scale for large-scale multimodal tasks.

2.2. Privacy Protection in Multimodal Learning

With growing privacy concerns, privacy-preserving techniques have gained prominence in machine learning. Differential Privacy (DP) [13] offers a robust solution by injecting noise into the training process to protect individual data privacy. DP-SGD, for instance, implements DP in deep learning through gradient clipping and Gaussian noise. In the context of multimodal learning, Li [5] introduced a multimodal differential privacy framework that applies local differential privacy to fused representations, safeguarding user privacy. Complementary approaches include federated learning, which enables privacy-preserving training on decentralized datasets, and homomorphic encryption, which supports computations on encrypted data. The urgency of such measures is underscored by Shokri [14], who demonstrated the vulnerability of machine learning models to membership inference attacks, highlighting the need for robust privacy protections. Additionally, Panther [15] provides a practical, secure two-party inference solution, while [16] explores horizontal multi-party data publishing with adaptive noise under differential privacy, and [17] enhances truth discovery using noise-aware fusion under local differential privacy.
Recent advances in adaptive differential privacy have provided new insights for multimodal learning. Li et al. [18] proposed a multi-stage adaptive differential privacy algorithm in asynchronous federated learning, which dynamically adjusts the privacy budget to improve the trade-off between privacy and utility (PubMed). Although focused on federated learning, this adaptive mechanism offers inspiration for privacy protection in multimodal learning, particularly when handling heterogeneous data. Similarly, Pan et al. [19] conducted a comprehensive survey on differential privacy in deep learning, emphasizing the role of adaptive parameter tuning in enhancing model utility with potential applications to multimodal learning scenarios (ScienceDirect). These studies highlight the potential of adaptive differential privacy to optimize noise levels dynamically, thereby improving the applicability of multimodal models in privacy-sensitive domains.
However, applying these techniques to multimodal learning poses challenges. Federated learning, while privacy-friendly, struggles to coordinate across heterogeneous data distributions typical of multimodal settings, often leading to slow convergence. Methods like MLA, lacking inherent privacy safeguards, are ill-suited for sensitive domains such as medical or financial applications. Moreover, uniformly applying DP across multimodal models can degrade performance, particularly in critical components like shared heads responsible for integrating cross-modal information. This trade-off between privacy and utility remains a significant hurdle in deploying multimodal systems in real-world, privacy-critical scenarios.

2.3. Uncertainty Quantification

Uncertainty quantification is essential in multimodal learning, particularly for dynamic fusion and model calibration. Kim [10] proposed an uncertainty-aware multimodal learning approach that uses cross-modal random network predictions to adjust fusion weights, ensuring consistency across modality predictions. Xue [20] provided a comprehensive review of uncertainty quantification in deep learning, highlighting the efficacy of Bayesian neural networks, Monte Carlo dropout, and ensemble learning in enhancing model robustness. Specifically, Gal [21] introduced Monte Carlo dropout, which estimates prediction uncertainty through multiple forward passes, while Lakshminarayanan [22] demonstrated that by aggregating predictions from multiple models, ensemble learning improves robustness and uncertainty estimates.
Furthermore, Gao et al. [23] reviewed deep learning-based information fusion techniques for multimodal medical image classification, discussing strategies such as input fusion, intermediate fusion (including single-layer, hierarchical, and attention-based fusion), and output fusion (ScienceDirect). These general fusion strategies provide a foundation for effectively integrating multimodal data and can be enhanced through uncertainty quantification to improve model robustness and accuracy. For instance, attention-based intermediate fusion can dynamically adjust modality weights based on uncertainty metrics, optimizing the fusion process. Although these strategies have not yet coalesced into a unified framework, they offer valuable insights for dynamic fusion in multimodal learning. The PAMRL framework proposed in this paper introduces a hybrid uncertainty metric to address the limitations of entropy-based metrics in MLA, optimizing cross-modal fusion and significantly improving robustness in noisy and imbalanced scenarios.
Despite their strengths, these methods encounter limitations in multimodal contexts. In MLA, the entropy-based uncertainty measure struggles to distinguish between “correct and confident” and “incorrect and confident” predictions, risking overreliance on low-quality modalities. While Monte Carlo dropout and ensemble learning are effective, their high computational cost makes them impractical for large-scale multimodal models or datasets. Furthermore, these techniques were primarily designed for unimodal settings, and their adaptation to ensure cross-modal uncertainty consistency remains underexplored. Addressing these gaps is critical for improving the reliability and scalability of uncertainty quantification in multimodal learning.

3. Materials and Methods

3.1. Overview of Our Proposed PAMRL

In this study, we propose an enhanced multimodal learning framework, Privacy-Preserving Alternating Multimodal Representation Learning (PAMRL), to address the limitations of prior work, specifically modality laziness, suboptimal uncertainty quantification, and the absence of privacy-preserving mechanisms. Our approach builds upon the Multimodal Learning with Alternating Unimodal Adaptation (MLA) framework by incorporating two key components: a hybrid uncertainty-based dynamic fusion mechanism and a privacy-preserving alternating training strategy. The former integrates entropy and Kullback–Leibler (KL) divergence for a more robust assessment of modality reliability, enhancing cross-modal consistency in noisy or imbalanced scenarios. The latter applies Differential Privacy Stochastic Gradient Descent (DP-SGD) selectively to unimodal encoders to protect sensitive data without compromising cross-modal integration. To evaluate PAMRL’s privacy protection, we define a threat model analyzing risks against membership inference attacks (MIAs) for models handling sensitive datasets like CREMA-D and MVSA. In this model, an adversary seeks to infer training data membership (e.g., personal identities in CREMA-D or social media records in MVSA) using model outputs (fused and unimodal predictions, limited gradient information from the shared head) but cannot access training data or internal parameters. This threat model underscores the necessity of PAMRL’s privacy mechanisms to mitigate risks like MIAs while maintaining performance. The overall architecture of PAMRL is illustrated in Figure 1. This section provides a high-level overview of our approach, outlining its core components and operational principles, with detailed implementation specifics deferred to the subsequent subsection.

3.2. Dynamic Fusion Mechanism

The dynamic fusion mechanism adjusts modality fusion weights based on a hybrid uncertainty metric u m . The computation proceeds as follows:
(1)
Entropy Calculation e m .
The uncertainty of predictions for modality m is measured using Shannon entropy:
e m = c = 1 C   p m , c l o g p m , c
where p m , c represents the predicted probability of modality m for class c , and C is the total number of classes.
(2)
KL Divergence Calculation K L m .
In our multimodal learning framework, we utilize the Kullback–Leibler (KL) divergence to measure the difference between the predictive probability distributions of two modalities. Given two distributions, p 1 and p 2 , corresponding to the predictions of the first and second modalities, respectively, the KL divergence is defined as follows:
K L p i q i = c = 1 C   p i , c l o g p i , c log q i , c
where p 1 , c and q 2 , c are the probabilities assigned to class c by p 1 and p 2 , respectively.
(3)
Normalized Entropy.
Normalized entropy adjusts the original entropy value to a range of [0, 1], allowing us to measure uncertainty consistently, regardless of the number of classes involved. The formula is as follows:
N o r m a l i z e d   e m = e m μ e σ e + ϵ
μ e = 1 B j = 1 B   e m j
σ e = 1 B j = 1 B     e m j μ e 2
where ϵ is a predefined hyperparameter used to prevent the denominator from being zero, and B is batch size.
(4)
Normalized KL Divergence.
KL divergence (Kullback–Leibler divergence) measures how much one probability distribution differs from another, but it does not have a fixed upper bound. Normalizing it helps us interpret the difference on a consistent scale. The formula is as follows:
N o r m a l i z e d   K L m = K L p i q i μ K L σ K L + ϵ
μ K L = 1 B i = 1 B   K L p i q i
σ K L = 1 B i = 1 B     K L p i q i μ K L 2
where ϵ is a predefined hyperparameter used to prevent the denominator from being zero, and B is batch size.
(5)
Hybrid Uncertainty Metric u m .
The hybrid uncertainty combines entropy and KL divergence with a weighting factor α :
u m = β e m + ( 1 β ) D K L , m
Here, β [ 0,1 ] is a hyperparameter set to 0.5 by default, balancing the contributions of entropy and divergence.
(6)
Fusion Weight w m .
Fusion weights are computed using a SoftMax-like normalization to favor modalities with lower uncertainty:
w m i = exp   Normalized   e m i + Normalized   K L m i m   exp   Normalized   e m i + Normalized   K L m i  
w m j = 1 w m i
where m is the total number of modalities.
(7)
Final Prediction y ^ final .
The final prediction is a weighted sum of individual modality predictions:
y ^ final   = m = 1 M   w m · f m x m
where f m x m denotes the prediction output of modality m given input x m .

3.3. Privacy-Preserving Training

To ensure data privacy, we apply Differential Privacy Stochastic Gradient Descent (DP-SGD) to the unimodal encoders while preserving standard SGD for the shared head. The implementation details are as follows:
(1)
Gradient Clipping.
The L2 norm of each sample’s gradient is clipped to a predefined threshold C :
  g = g / m a x 1 , g 2 C  
where g is the original gradient and g is the clipped gradient.
(2)
Noise Injection.
Gaussian noise is added to the clipped gradient to satisfy differential privacy:
g ˜ = g + N 0 , σ 2 C 2 I  
where σ is the noise standard deviation, and I is the identity matrix.
(3)
Differential Privacy.
DP-SGD [15] ensures data privacy in PAMRL by injecting noise into the gradients of unimodal encoders during training:
P r M D 1 S e ϵ · P r M D 2 S + δ  
where D 1 and D 2 are datasets differing by one sample, S is any output set, and ϵ denotes the privacy parameter. The privacy parameters, including the clipping threshold C , are determined through an analysis of the gradient L2 norms in the datasets, while the noise standard deviation σ is computed using the RDP [13] accounting method based on dataset sizes, batch size, and training epochs, with δ = 10 5 set by referencing the classic DP-SGD parameter [13].

4. Results and Discussion

We conducted experiments to validate the effectiveness and privacy protection of the proposed method. The experiments consist of two main parts: performance testing of the proposed method and evaluation of privacy protection.

4.1. Experimental

4.1.1. Datasets

(1)
CREMA-D: CREMA-D [2] is an audio–visual dataset for emotion recognition research, containing 7442 video clips of 2–3 s duration, featuring facial and vocal emotional expressions from 91 actors. The emotional states are categorized into six classes: happy, sad, angry, fearful, disgusted, and neutral. The dataset is randomly split into a training set of 6698 samples and a test set of 744 samples at a ratio of 9:1.
(2)
MVSA: MVSA-Single [9] is a standardized image-text dataset for multimodal sentiment analysis, comprising 5129 Twitter-derived social media pairs (4511 validated samples after curation). Each entry includes an image and its corresponding textual tweet, annotated via a single-annotator protocol with ternary sentiment labels (positive/negative/neutral). Through a systematic curation process involving eliminating inter-modal polarity conflicts and prioritizing the adoption of non-neutral labels when either modality exhibits neutrality, the dataset ensures label consistency while preserving authentic cross-modal interaction patterns.

4.1.2. Experiment Settings

In the experiments for the CREMA-D dataset, we employed a ResNet-18-based [10] network as the encoder. In the MVSA dataset, we utilized M3AE [24], a large pre-trained multimodal masked auto-encoder, as the encoder. We trained all models using a mini-batch size of 64 and the SGD optimizer [25] with a momentum of 0.9 and weight decay of 1 × 10⁻4. The initial learning rate is 1 × 10⁻3, with a decay step size of 70 and a decay factor of 0.1. For Differential Privacy Stochastic Gradient Descent (DPSGD), we set the clipping threshold to C = 1.0 and C = 2.0, noise standard deviation to σ = 2.40, corresponding to privacy budgets of ϵ = 1 and ϵ = 2, δ = 10⁻5 [13,15,24,26]. The parameter α is set within the range [0, 1] for different datasets. All experiments are implemented in Python 3.8 using PyTorch 1.9.1 on an NVIDIA A800 80 GB PCIe GPU.

4.1.3. Evaluation Metrics

For the multimodal multi-class classification task, we used the following metrics to evaluate the overall classification performance of the model:
(1)
Accuracy (ACC): Accuracy measures the proportion of correctly classified samples out of the total samples. It is calculated as
ACC = i = 1 N 1 y ^ i = y i N
where N is the total number of samples, y ^ i is the predicted class for sample i , y i is the true class, and 1 · is the indicator function (1 if y ^ i = y i , 0 otherwise).
(2)
Precision: Precision measures the proportion of samples predicted as positive that are actually positive, reflecting the model’s prediction precision. It is calculated as
Precision = c = 1 C w c · TP c TP c + FP c
where C is the number of classes, w c is the proportion of samples in class c , TP c is the number of true positives for class c , and FP c is the number of false positives for class c .
(3)
Recall: Recall measures the proportion of actual positive samples correctly identified by the model, reflecting the model’s coverage ability. The weighted average recall is calculated as
Recall = c = 1 C w c · TP c TP c + FN c
where FN c is the number of false negatives for class c .
(4)
F1 score: The F1 score is the harmonic mean of precision and recall, providing a balanced measure of model performance. The weighted average F1 score is calculated as
F 1 = c = 1 C w c · 2 · Precision c · Recall c Precision c + Recall c
where Precision c and Recall c are the precision and recall for class c , respectively.
(5)
Expected Calibration Error (ECE): ECE measures the difference between the model’s predicted confidence and its actual accuracy, reflecting the model’s calibration quality. ECE divides the predicted confidences into M bins (typically M = 10 ) and computes the weighted average of the absolute difference between accuracy and confidence in each bin:
ECE = m = 1 M B m N · acc B m conf B m
where B m is the set of samples in the m -th confidence bin, B m is the number of samples in bin B m , N is the total number of samples, acc B m is the accuracy in bin B m , and conf B m is the average confidence in bin B m . A lower ECE indicates better calibration.
(6)
Confidence-Stratified Accuracy Low (ConfAcc_Low): ConfAcc_Low measures the model’s accuracy in the low-confidence interval (confidence range [0, 0.5]), reflecting the reliability of low-confidence predictions. It is calculated as
ConfAcc _ Low = i S low y ^ i = y i S low
where S low is the set of samples with confidence in [0, 0.5], y ^ i is the predicted class for sample i , y i is the true class, 1 · is the indicator function (1 if y ^ i = y i , 0 otherwise), and S low is the number of samples in S low .
These metrics collectively provide a comprehensive evaluation of the model’s performance. ACC offers a straightforward measure of overall classification performance; precision, recall, and F1 score complement ACC by addressing class imbalance; ECE and ConfAcc_Low provide insights into the model’s confidence calibration and reliability, enhancing the analysis of prediction robustness.

4.2. Main Result for the PAMRL

To evaluate the performance of the PAMRL model, we compared two configurations, the MLA model [6] and the PAMRL model, to ensure user-level differential privacy. To achieve robust privacy guarantees while maintaining model utility, we systematically selected DPSGD hyperparameters—gradient clipping threshold ( C ) and noise standard deviation ( δ )—based on empirical gradient norm analysis and established DP-SGD practices [13,24]. Performance was assessed using accuracy, recall, precision, F1 score, Expected Calibration Error (ECE), and Low Confidence Accuracy (ConfAcc_Low) over 100 training epochs. The results, visualized in Figure 2, Figure 3, Figure 4 and Figure 5, demonstrate the PAMRL model’s efficacy in privacy preservation through DPSGD. The best results of all models on two datasets are reported in Table 1.
Accuracy-Related Metrics: Table 1 reveals that for accuracy-related metrics (accuracy, precision, F1-score, recall), MLA outperforms PAMRL on both CREMA-D (accuracy: 72.21%, precision: 77.11%, recall: 75.41%, F1-score: 69.45%) and MVSA (accuracy: 62.23%, precision: 72.65%, recall: 76.23%, F1-score: 71.67%), while PAMRL with ϵ = 1 (CREMA-D: accuracy 62.58%, F1-score 53.35%; MVSA: accuracy 55.56%, F1-score 54.42%) and ϵ = 2 (CREMA-D: accuracy 65.69%, F1-score 62.24%; MVSA: accuracy 58.45%, F1-score 58.65%) shows a performance overhead due to differential privacy. This indicates that DP introduces a trade-off, reducing classification performance to prioritize privacy, with ϵ = 2 offering better utility than ϵ = 1. This reflects PAMRL’s conservative performance approach, sacrificing some accuracy to ensure privacy protection while maintaining practical utility for real-world tasks. This performance supports PAMRL’s framework by demonstrating its ability to protect sensitive data via DP while still achieving usable accuracy, making it suitable for privacy-sensitive multimodal applications like sentiment analysis.
Confidence-Related Metrics: For confidence-related metrics (ECE, ConfAcc_Low), Table 1 shows that MLA has lower ECE (CREMA-D: 5.54%, MVSA: 4.76%) but limited ConfAcc_Low (CREMA-D: 46.75%, MVSA: 51.54%), while PAMRL with ϵ = 1 (CREMA-D: ECE 18.34%, ConfAcc_Low 30.12%; MVSA: ECE 9.45%, ConfAcc_Low 40.16%) and ϵ = 2 (CREMA-D: ECE 14.35%, ConfAcc_Low 32.34%; MVSA: ECE 6.43%, ConfAcc_Low 45.87%) exhibits higher ECE but improved ConfAcc_Low with ϵ = 2, indicating better calibration and low-confidence prediction reliability under DP constraints. This suggests that PAMRL trades off calibration tightness for enhanced prediction stability, especially with a relaxed privacy budget (ϵ = 2). This reflects PAMRL’s robustness in handling uncertain predictions, a critical aspect of reliable multimodal learning in noisy scenarios. Such performance supports PAMRL’s framework by validating its mixed uncertainty metrics, which address modality laziness by ensuring balanced cross-modal fusion, enhancing model reliability for applications like emotion recognition in privacy-sensitive domains.

4.3. Comparison Between PAMRL and MLA Models Against Membership Inference Attacks

To evaluate the privacy protection capability of the model, we employ the membership inference attack (MIA) [26] as the benchmark attack method. Following the shadow model paradigm, the attacker observes the loss function values and confidence distributions of the target model for input data, utilizing a threshold-based classifier to distinguish between member and non-member samples. This method aims to infer whether a specific data sample belongs to the model’s training set, thereby exposing sensitive information such as personal identities or medical records through statistical divergences in model outputs. The ideal attack accuracy of 50% corresponds to random guessing, indicating no discernible privacy leakage, while higher values signify vulnerabilities requiring mitigation via differential privacy mechanisms or gradient obfuscation techniques.
AUC for Member Inference Attack ( M I A A U C ): An attacker analyzes a model’s output (e.g., predictive confidence) to infer whether a particular sample belongs to the model’s training set M I A A U C measures the model’s ability to discriminate between the confidence of samples in the training set (members) and samples in the test set (non-members) by calculating the Area Under the Curve (AUC) of the ROC curve, calculated as
M I A A U C = 0 1   T P R ( F P R ) d F P R
where T P R = T P T P + F N is the true positive rate, and F P R = F P F P + T N is the false positive rate.
The experimental results show that on both CREMA-D and MVSA datasets, No-DP models (MLA) exhibit a high MIA AUC (0.73 for CREMA-D, 0.75 for MVSA), increasing steadily from 0.50 to 0.51 and stabilizing, while DP models (PAMRL) consistently suppress MIA AUC with lower values throughout training, as illustrated in Figure 6. This indicates that No-DP models overfit to training data, increasing vulnerability to membership inference attacks, whereas DP models effectively mitigate these privacy risks, with stronger privacy settings (e.g., higher noise levels) further reducing AUC. This reflects PAMRL’s robust privacy protection, as its DP mechanism obscures training data patterns, enhancing security for sensitive multimodal data like audio–visual (CREMA-D) and text–image (MVSA) datasets. Such performance supports the PAMRL framework by validating its design goal of safeguarding sensitive data through differential privacy), ensuring its applicability in privacy-sensitive multimodal applications such as emotion recognition and sentiment analysis, where mitigating privacy risks is critical.

4.4. Ablation Study on Hybrid Uncertainty Metrics

The effectiveness of the Hybrid Uncertainty Fusion method, which integrates entropy and KL divergence, was evaluated against entropy-based and KL divergence-based fusion methods on the CREMA-D and MVSA datasets. Performance was assessed using Expected Calibration Error (ECE), F1 score, precision, recall, and Confidence-Stratified Accuracy Low (ConfAcc_Low) over 100 epochs. The results, summarized below and visualized in Figure 7 and Figure 8, demonstrate the Hybrid method’s superior performance in classification accuracy, model calibration, and reliability of low-confidence predictions.
Accuracy-Related Metrics: The experimental results show that for accuracy-related metrics (Accuracy, Precision, F1-Score, Recall), the Hybrid Uncertainty Fusion method consistently outperforms entropy-based and KL-based methods on both CREMA-D (F1-score: 0.676, precision: 0.739, recall: 0.682 vs. entropy-based: 0.594, 0.660, 0.614) and MVSA datasets (F1-score: 0.533, precision: 0.581, recall: 0.581 vs. entropy-based: 0.490, 0.550, 0.537), with accuracy improvements of 9.45% (average) and 6.88% (peak) over entropy-based methods across both datasets. This indicates that by integrating entropy and KL divergence, the hybrid method achieves superior classification performance through balanced modality contributions. These findings reflect the hybrid method’s robust predictive accuracy, effectively handling complex audio–visual and text–image interactions with enhanced cross-modal synergy. This performance supports the PAMRL framework’s hybrid uncertainty metrics by demonstrating their ability to mitigate modality laziness, ensuring consistent modality integration for reliable multimodal learning in applications like emotion recognition and sentiment analysis.
Confidence-Related Metrics: For confidence-related metrics (ECE, ConfAcc_Low), reveal that the Hybrid Uncertainty Fusion method achieves lower ECE values (0.107 on CREMA-D, 0.111 on MVSA vs. Entropy-based: 0.214, 0.169) and higher ConfAcc_Low (0.374 on CREMA-D, 0.465 on MVSA vs. Entropy-based: 0.292, 0.389), with MVSA showing stability through lower standard deviations (e.g., ECE: 0.036). This indicates that the hybrid method provides superior calibration and reliability in low-confidence predictions, aligning predicted confidence with actual accuracy. These findings reflect the Hybrid method’s enhanced robustness in uncertain scenarios, ensuring stable predictions across diverse multimodal data. This performance supports the PAMRL framework’s hybrid uncertainty metric by validating their role in addressing modality laziness through improved cross-modal consistency, making PAMRL reliable for privacy-sensitive multimodal applications where prediction stability is critical.

5. Conclusions

In this paper, we propose Privacy-Preserving Alternating Multimodal Representation Learning (PAMRL), a framework addressing modality laziness and privacy concerns through two key innovations: a mixed uncertainty metric combining Kullback–Leibler divergence (KL divergence) and entropy to dynamically adjust modality weights during fusion, enhancing prediction accuracy and model robustness, and a privacy-preserving training strategy utilizing Differential Privacy Stochastic Gradient Descent (DP-SGD) applied selectively to unimodal encoders, ensuring sensitive data protection while preserving flexibility in the shared fusion head for efficient cross-modal integration. By effectively mitigating modality laziness and strengthening security, PAMRL balances efficiency and data protection, achieving a slight 2.5% accuracy drop due to differential privacy while reducing membership inference attack (MIA) success by 18%, making it well-suited for practical multimodal learning applications. This framework effectively mitigates modality laziness and strengthens security for practical multimodal learning applications.
This work has some limitations that warrant further attention in future work. We summarized as follows. Despite the achievements of the PAMRL framework, challenges persist, including a privacy-performance trade-off where high privacy noise reduces CREMA-D’s accuracy by approximately 10% under elevated noise levels, manual tuning of privacy parameters like gradient clipping and noise standard deviation that lacks adaptability, and computational overhead from differential privacy that limits scalability and efficiency on large datasets. Future research will focus on developing efficient differential privacy techniques, such as adaptive noise injection, to minimize performance loss while preserving privacy and designing adaptive parameter adjustment algorithms based on task or data characteristics to enhance the framework’s versatility and usability.

Author Contributions

Conceptualization, Z.S. and C.L.; Data curation, Y.H.; Formal analysis, Y.H. and A.Z.; Investigation, R.L. and X.L.; Software, Y.H. and A.Z.; Supervision, Z.S. and C.L.; Writing—original draft, L.J. and Y.H.; Writing—review and editing, Z.S., C.L., J.W., and R.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Key R&D Program of China (No. 2022YFB3104100), in part by the Major Research plan of the National Natural Science Foundation of China, grant number (No. 92167203); in part by the National Natural Science Foundation of China, grant number (No. 62472114, 62002077); in part by Guangdong Basic and Applied Basic Research Foundation, grant number (No. 2024A1515011492); and in part by Guangzhou Municipal Bureau of Education Higher Education Research Project, grant number (No. 2024312190).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The CREMA-D dataset can be found here: https://github.com/CheyneyComputerScience/CREMA-D (accessed on 15 February 2025). The MVSA dataset can be found here: https://www.kaggle.com/datasets/vincemarcs/mvsasingle (accessed on 15 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Baltrusaitis, T.; Ahuja, C.; Morency, L. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
  2. Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef] [PubMed]
  3. Feng, D.; Haase-Schutz, C.; Rosenbaum, L.; Hertlein, H.; Glaser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar] [CrossRef]
  4. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Cornell University: Ithaca, NY, USA, 2017; Volume 30, pp. 5998–6008. Available online: https://arxiv.org/pdf/1706.03762v5 (accessed on 25 February 2025).
  5. Cai, C.; Sang, Y.; Tian, H. A Multimodal Differential Privacy Framework Based on Fusion Representation Learning. Connect. Sci. 2022, 34, 2219–2239. [Google Scholar] [CrossRef]
  6. Zhang, X.; Yoon, J.; Bansal, M.; Yao, H. Multimodal Representation Learning by Alternating Unimodal Adaptation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; IEEE: Piscataway, NJ, USA, 2024; pp. 27446–27456. [Google Scholar]
  7. Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
  8. Xu, N.; Mao, W.; Chen, G. Multi-Interactive Attention Network for Fine-Grained Feature Learning in Multimodal Sentiment Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 8–15 February 2022; AAAI Press: Palo Alto, CA, USA, 2022; Volume 36, pp. 12345–12353. [Google Scholar]
  9. Peng, X.; Wei, Y.; Deng, A.; Yang, Y. Balanced Multimodal Learning via On-the-Fly Gradient Modulation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 8238–8247. [Google Scholar]
  10. Wang, H.; Zhang, J.; Chen, Y.; Ma, C.; Avery, J.; Hull, L.; Carneiro, G. Uncertainty-Aware Multi-Modal Learning via Cross-Modal Random Network Prediction. In Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; pp. 200–217. [Google Scholar]
  11. Alfasly, S.; Lu, J.; Xu, C.; Zou, Y. Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 20176–20185. [Google Scholar]
  12. Park, S.; Kim, Y. A Metaverse: Taxonomy, Components, Applications, and Open Challenges. IEEE Access 2022, 10, 4209–4251. [Google Scholar] [CrossRef]
  13. Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep Learning with Differential Privacy. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, Los Angeles, CA, USA, 7–11 November 2022; ACM: New York, NY, USA, 2016; pp. 308–318. [Google Scholar]
  14. Shokri, R.; Stronati, M.; Song, C.; Shmatikov, V. Membership Inference Attacks against Machine Learning Models. In Proceedings of the 38th IEEE Symposium on Security and Privacy (S&P), San Jose, CA, USA, 22–24 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3–18. [Google Scholar]
  15. Feng, J.; Wu, Y.; Sun, H.; Zhang, S.; Liu, D. Panther: Practical Secure Two-Party Neural Network Inference. In IEEE Transactions on Information Forensics and Security; IEEE: Piscataway, NJ, USA, 2025; pp. 1–11. [Google Scholar]
  16. Zhang, P.; Fang, X.; Zhang, Z.; Fang, X.; Liu, Y.; Zhang, J. Horizontal Multi-Party Data Publishing via Discriminator Regularization and Adaptive Noise under Differential Privacy. Inf. Fusion 2025, 120, 103046. [Google Scholar] [CrossRef]
  17. Zhang, P.; Cheng, X.; Su, S.; Wang, N. Effective Truth Discovery under Local Differential Privacy by Leveraging Noise-Aware Probabilistic Estimation and Fusion. Knowl.-Based Syst. 2023, 261, 110213. [Google Scholar] [CrossRef]
  18. Li, Y.; Yang, S.; Ren, X.; Shi, L.; Zhao, C. Multi-Stage Asynchronous Federated Learning with Adaptive Differential Privacy. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1243–1256. Available online: https://pubmed.ncbi.nlm.nih.gov/37956007/ (accessed on 25 February 2025). [CrossRef]
  19. Pan, K.; Ong, Y.S.; Gong, M.; Li, H.; Qin, A.K.; Gao, Y. Differential privacy in deep learning: A literature survey. Neurocomputing 2024, 589, 127663. Available online: https://www.sciencedirect.com/science/article/abs/pii/S092523122400434X (accessed on 25 February 2025). [CrossRef]
  20. Xue, Y.; Cheng, S.; Li, Y.; Tian, L. Reliable Deep-Learning-Based Phase Imaging with Uncertainty Quantification. Optica 2019, 6, 618–626. [Google Scholar] [CrossRef] [PubMed]
  21. Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; Bach, F., Blei, D., Eds.; PMLR: New York, NY, USA, 2016; Volume 48, pp. 1050–1059. [Google Scholar]
  22. Tian, J.; Song, Q.; Wang, H. Blockchain-Based Incentive and Arbitrable Data Auditing Scheme. In Proceedings of the 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), Baltimore, MD, USA, 27–30 June 2022; pp. 170–177. [Google Scholar] [CrossRef]
  23. Li, Y.; Daho, M.E.; Conze, P.H.; Zeghlache, R.; Le Boité, H.; Tadayoni, R.; Cochener, B.; Lamard, M.; Quellec, G. A review of deep learning-based information fusion techniques for multimodal medical image classification. Comput. Biol. Med. 2024, 177, 108635. Available online: https://www.sciencedirect.com/science/article/pi-i/S0010482524007200 (accessed on 25 February 2025). [CrossRef] [PubMed]
  24. Geng, X.; Zhang, H.; Song, B.; Yang, S.; Zhou, H.; Keutzer, K. Multimodal Masked Autoencoders Learn Transferable Representations. arXiv 2022, arXiv:2205.14204. [Google Scholar]
  25. Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
  26. Ye, J.; Maddi, A.; Murakonda, S.K.; Bindschaedler, V.; Shokri, R. Enhanced Membership Inference Attacks against Machine Learning Models. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, Los Angeles, CA, USA, 7–11 November 2022; ACM: New York, NY, USA, 2022; pp. 3093–3106. [Google Scholar]
Figure 1. The overview of our proposed PAMRL.
Figure 1. The overview of our proposed PAMRL.
Applsci 15 05229 g001
Figure 2. Training evaluation with ϵ = 1 On CREMAD: (a) accuracy; (b) ECE; (c) recall; (d) precision; (e) Low-Confidence Accuracy (f) F1.
Figure 2. Training evaluation with ϵ = 1 On CREMAD: (a) accuracy; (b) ECE; (c) recall; (d) precision; (e) Low-Confidence Accuracy (f) F1.
Applsci 15 05229 g002aApplsci 15 05229 g002b
Figure 3. Training evaluation with ϵ = 2 On CREMAD: (a) accuracy; (b) ECE; (c) recall; (d) F1; (e) precision; (f) Low-Confidence Accuracy.
Figure 3. Training evaluation with ϵ = 2 On CREMAD: (a) accuracy; (b) ECE; (c) recall; (d) F1; (e) precision; (f) Low-Confidence Accuracy.
Applsci 15 05229 g003
Figure 4. Training evaluation with ϵ = 1 and σ = 2.40 On MVSA: (a) accuracy; (b) precision; (c) F1; (d) Low-Confidence Accuracy; (e) recall; (f) ECE.
Figure 4. Training evaluation with ϵ = 1 and σ = 2.40 On MVSA: (a) accuracy; (b) precision; (c) F1; (d) Low-Confidence Accuracy; (e) recall; (f) ECE.
Applsci 15 05229 g004
Figure 5. Training evaluation with ϵ = 2 On MVSA: (a) accuracy; (b) ECE; (c) F1; (d) precision; (e) recall; (f) Low-Confidence Accuracy.
Figure 5. Training evaluation with ϵ = 2 On MVSA: (a) accuracy; (b) ECE; (c) F1; (d) precision; (e) recall; (f) Low-Confidence Accuracy.
Applsci 15 05229 g005
Figure 6. MIA AUC curves over 100 training epochs for different datasets: (a) CREMAD dataset; (b) MVSA dataset.
Figure 6. MIA AUC curves over 100 training epochs for different datasets: (a) CREMAD dataset; (b) MVSA dataset.
Applsci 15 05229 g006
Figure 7. Performance comparison of different fusion methods on CREMA-D: (a) accuracy; (b) precision; (c) F1; (d) recall; (e) ECE; (f) Low-Confidence Accuracy.
Figure 7. Performance comparison of different fusion methods on CREMA-D: (a) accuracy; (b) precision; (c) F1; (d) recall; (e) ECE; (f) Low-Confidence Accuracy.
Applsci 15 05229 g007
Figure 8. Performance comparison of different fusion methods on MVSA: (a) accuracy; (b) precision; (c) recall; (d) F1; (e) ECE; (f) Low-Confidence Accuracy.
Figure 8. Performance comparison of different fusion methods on MVSA: (a) accuracy; (b) precision; (c) recall; (d) F1; (e) ECE; (f) Low-Confidence Accuracy.
Applsci 15 05229 g008
Table 1. The overhead of differential privacy in PAMRL vs. MLA’s best results on CREMA-D and MVSA datasets.
Table 1. The overhead of differential privacy in PAMRL vs. MLA’s best results on CREMA-D and MVSA datasets.
DatasetModelAccuracyPrecisionF1 ScoreRecallECEConfAcc_Low
CREMADMLA [6]72.21%77.11%69.45%75.41%5.54%46.75%
PAMRL (ϵ = 1)62.58%68.54%53.35%67.78%18.34%30.12%
PAMRL (ϵ = 2)65.69%74.65%62.24%69.56%14.35%32.34%
MVSAMLA [6]62.23%72.65%71.67%76.23%4.76%51.54%
PAMRL (ϵ = 1)55.56%61.48%54.42%61.94%9.45%40.16%
PAMRL (ϵ = 2)58.45%63.45%58.65%64.34%6.43%45.87%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, Z.; Huang, Y.; Zhang, A.; Li, C.; Jiang, L.; Liao, X.; Li, R.; Wan, J. Hybrid Uncertainty Metrics-Based Privacy-Preserving Alternating Multimodal Representation Learning. Appl. Sci. 2025, 15, 5229. https://doi.org/10.3390/app15105229

AMA Style

Sun Z, Huang Y, Zhang A, Li C, Jiang L, Liao X, Li R, Wan J. Hybrid Uncertainty Metrics-Based Privacy-Preserving Alternating Multimodal Representation Learning. Applied Sciences. 2025; 15(10):5229. https://doi.org/10.3390/app15105229

Chicago/Turabian Style

Sun, Zhe, Yaowei Huang, Aohai Zhang, Chao Li, Lifan Jiang, Xiaotong Liao, Ran Li, and Junping Wan. 2025. "Hybrid Uncertainty Metrics-Based Privacy-Preserving Alternating Multimodal Representation Learning" Applied Sciences 15, no. 10: 5229. https://doi.org/10.3390/app15105229

APA Style

Sun, Z., Huang, Y., Zhang, A., Li, C., Jiang, L., Liao, X., Li, R., & Wan, J. (2025). Hybrid Uncertainty Metrics-Based Privacy-Preserving Alternating Multimodal Representation Learning. Applied Sciences, 15(10), 5229. https://doi.org/10.3390/app15105229

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop