TriDA: Privacy-Aware and Efficient Multimodal AI for Disaster Assessment

Oaphy, Md Abdullahil; Khalid, Adeel; Hu, Da; Xu, Honghui

doi:10.3390/math14122064

Open AccessArticle

TriDA: Privacy-Aware and Efficient Multimodal AI for Disaster Assessment

¹

Department of Information Technology, College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA

²

Department of Industrial and Systems Engineering, Southern Polytechnic College of Engineering and Engineering Technology, Kennesaw State University, Marietta, GA 30060, USA

³

Department of Civil and Environmental Engineering, Southern Polytechnic College of Engineering and Engineering Technology, Kennesaw State University, Marietta, GA 30060, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(12), 2064; https://doi.org/10.3390/math14122064

Submission received: 25 March 2026 / Revised: 23 May 2026 / Accepted: 2 June 2026 / Published: 10 June 2026

(This article belongs to the Special Issue Advanced Algorithms for Multi-Modal Learning, Knowledge Graphs, and Trustworthy AI)

Download

Browse Figures

Versions Notes

Abstract

As disaster imagery and social media reports become vital for crisis response, automated assessment systems must address challenges of multimodal integration, privacy-aware learning, and computational efficiency. To address these challenges, we propose TriDA, a privacy-aware and efficiency-conscious multimodal disaster classification framework that fuses image features with text representations through a late-fusion design. A classifier-head DP-SGD stage is used to report training-record-level differential privacy accounting for paired image–text samples under the stated private optimization protocol. To study efficiency-oriented simplification, structured neuron pruning reduces redundant capacity in the classification head while preserving predictive utility. Experiments on the multimodal damage identification dataset show that TriDA maintains strong classification performance, exhibits a controlled privacy–utility trade-off under increasing DP noise, and achieves quantifiable classifier-head parameter and MAC reductions through pruning. These findings position TriDA as a controlled empirical framework for privacy-aware and resource-conscious multimodal disaster assessment.

Keywords:

disaster assessment; privacy-preserving multimodal AI; efficient AI

MSC:

68T07; 68T45; 68T50; 68P27; 68T05

1. Introduction

Natural disasters such as hurricanes, floods, earthquakes, and wildfires inflict severe human, infrastructural, and economic losses worldwide [1]. The frequency and intensity of such events are rising, amplifying their impact on vulnerable populations and infrastructures. Rapid and reliable disaster assessment is critical for coordinating emergency response and allocating resources effectively [2,3]. Yet traditional approaches relying on field-based inspections remain slow, resource-intensive, and challenging to scale during large-scale crises [4]. These constraints have motivated a shift toward data-driven methodologies, fueled by the growing availability of crisis imagery and crowd-sourced textual reports shared through social media and emergency information streams [5,6]. These multimodal data sources provide an important opportunity to automate situational awareness and improve decision-making in disaster-response triage.

Deep learning has already demonstrated strong results in this context. Convolutional and transformer-based vision models have been successfully applied to classify infrastructure damage, map flood extent, and detect affected areas in aerial and satellite imagery [7,8], while similar CNN-based architectures have also shown strong performance on visual recognition tasks across diverse domains [9,10]. In parallel, natural language processing has enabled the automatic extraction of crisis-related information from social media streams and emergency texts, providing valuable situational updates [11]. However, unimodal pipelines face intrinsic limitations. Images alone may be ambiguous without textual context (e.g., differentiating floodwater from rainfall), while textual posts are often noisy, incomplete, or subjective. These limitations have motivated the development of multimodal learning frameworks, which fuse complementary visual and textual signals to capture richer semantic context. Recent studies confirm that multimodal approaches consistently outperform unimodal baselines, offering a more robust path for disaster assessment [12].

Despite these advances, several critical gaps remain unaddressed. First, the privacy risks associated with multimodal disaster data are still underexplored. Disaster imagery and textual reports frequently contain sensitive personal or locational information, raising concerns of leakage through model memorization or exposure to inference attacks [13,14]. While differentially private optimization, particularly DP-SGD, provides principled guarantees against such risks [15], its integration into multimodal disaster models remains limited. Second, efficiency challenges complicate practical use in emergency analytics settings. Transformer-heavy fusion architectures, although powerful, demand substantial compute and memory, which can be restrictive for field workstations and resource-aware operational pipelines. Prior research in neural network compression and pruning has shown that redundant parameters can be reduced while preserving predictive utility [16,17], yet such efficiency-oriented analysis remains limited in privacy-aware multimodal disaster modeling. Finally, there has been little systematic investigation of the combined privacy–utility–efficiency trade-off, which is important for operationally relevant multimodal disaster-report triage.

To address these challenges, this paper introduces TriDA, an efficiency-conscious multimodal disaster assessment framework that fuses visual and textual streams for fine-grained classification, uses a classifier-head DP-SGD stage to report training-record-level differential privacy accounting for paired image–text samples, and applies structured pruning to simplify the classifier head under computational constraints. Rather than treating predictive utility, privacy, and efficiency as isolated objectives, TriDA provides a controlled empirical study of their interaction within a unified multimodal disaster classification setting.

The key contributions of this paper are as follows:

A compact late-fusion multimodal architecture is developed to integrate visual and textual cues for disaster classification, reducing the limitations of single-modality modeling.
DP-SGD is incorporated to report training-record-level differential privacy accounting for paired image–text samples under the specified optimization procedure.
Structured neuron pruning is applied to the classifier head, and its effect is quantified through retained hidden units, head-parameter reduction, and head-level efficiency analysis.
A detailed empirical analysis is provided across multimodal utility, class-wise behavior under data imbalance, privacy–utility trade-offs, pruning-based efficiency trade-offs, and repeated-run variability.

The remainder of this paper is organized as follows: Section 2 reviews related work, Section 3 details the proposed methodology, Section 4 reports experimental results, Section 5 discusses implications and limitations, and Section 6 concludes with key findings and directions for future research.

2. Related Work

This section reviews prior studies relevant to the proposed framework, focusing on three core areas: multimodal deep learning for disaster assessment, privacy-preserving learning in disaster contexts, and AI model-compression mechanisms.

2.1. Multimodal Deep Learning for Disaster Assessment

During disasters, timely and reliable information is crucial for effective response. Social media platforms and remote sensing imagery provide complementary perspectives, giving rise to multimodal disaster assessment where images and text are jointly analyzed. Early work established this value. Mouzannar et al. [18] and Hossain et al. [19] reported gains from combining image and textual representations for disaster assessment, while Ofli et al. [20] systematically demonstrated that multimodal fusion consistently outperformed unimodal baselines. Hao et al. [21] further illustrated how multimodal posts could rapidly guide disaster assessment, enabling time-critical decision support. Decision-level fusion approaches by Zou and Gan [22] and Kotha et al. [23] improved triaging and downstream classification, underscoring multimodality as not just beneficial for accuracy but essential for humanitarian deployment. Moreover, subsequent advances focused on adaptive fusion mechanisms to better capture cross-modal relationships. Abavisani et al. [24] introduced cross-attention to suppress weaker modalities, while Sosea et al. [25] explicitly modeled redundancy and complementarity for robustness. More recent work by Shetty et al. [26] used middle-fusion with cross- and self-attention to produce richer joint representations. Beyond pairwise fusion, Dar et al. [27] proposed graph-based multimodal networks incorporating social and semantic context, and causal fusion frameworks have also been explored to mitigate spurious correlations in noisy crisis data. Furthermore, backbone innovations have further advanced multimodal systems. Basit et al. [28] introduced hierarchical classifiers for multimodal tweets, Zhang et al. [29] combined BERT with CNNs for damage detection, and Li et al. [30] adapted CLIP-style bi-cross-attention with gating for humanitarian classification. These developments move beyond simple feature concatenation toward architectures that explicitly model cross-modal dynamics with state-of-the-art encoders. Yet challenges remain around privacy and efficiency, motivating approaches that pair strong multimodal performance with privacy-aware training and model compression for disaster response.

2.2. Privacy-Preserving Deep Learning in Disaster Contexts

As deep learning advances in disaster response, protecting the privacy of individuals in social media, UAV footage, and satellite imagery is vital. Such data often reveals identifiable details, leaving models vulnerable to leakage or re-identification. Prior research has explored three main strategies: de-identification, federated learning, and differential privacy. (1) De-identification and location privacy remain important safeguards. For visual privacy, Li et al. [31] proposed LPDi-GAN to anonymize license plates while preserving recognition accuracy, and Du et al. [32] developed pipelines that conceal sensitive features without compromising utility. For geospatial privacy, Andres et al. [33] introduced geo-indistinguishability, while Löchner et al. [34] showed that geo-obfuscation workflows could support emergency response without exposing raw data. (2) Federated learning (FL) offers another approach by reducing data centralization. Zhang et al. [35] introduced a federated transfer learning system for disaster classification with Paillier encryption, enabling inter-agency collaboration without raw data sharing. Ayoub et al. [36] extended this to multimodal social media fusion, while Deng et al. [37] combined FL with encryption for power system risk assessment. Other studies explored encrypted FL for active learning [38] and secure gradient sharing [39], showing that collaborative disaster modeling is feasible but often limited by communication and computation overheads. (3) Differential privacy (DP) has gained traction in vision tasks. Alkhelaiwi et al. [40] proposed a DP framework for satellite image classification, showing that calibrated noise protects sensitive features while preserving accuracy. Zhang et al. [41] introduced a local DP defense against membership inference on remote-sensing classifiers. These works validate DP as an effective safeguard, but most focus on unimodal pipelines and overlook multimodal scenarios, where text and imagery together can expose more sensitive information. These observations motivate formal privacy accounting in multimodal settings. In TriDA, DP-SGD is used to analyze how a private classifier-head optimization stage affects paired image–text disaster classification. This design complements earlier de-identification, federated-learning, and unimodal DP studies by quantifying a multimodal privacy–utility trade-off under an explicit paired-record definition.

2.3. AI Model Compression Mechanisms

Model simplification is relevant for disaster-response analytics, where emergency workstations and resource-aware operational pipelines may face compute, memory, and latency constraints. Researchers have explored several strategies, including quantization, knowledge distillation (KD), and pruning. (1) Quantization lowers numerical precision to reduce model size and inference cost. The GHOST framework applies guided quantization with self-teaching to sustain accuracy while compressing multimodal remote sensing models [42]. Edge-AI wildfire systems further demonstrate that quantized CNNs can achieve real-time performance on embedded hardware under stringent resource constraints [43]. While some accuracy loss is common, careful calibration generally preserves reliable deployment. (2) Knowledge distillation (KD) transfers knowledge from large teacher models to smaller student networks. Bai et al. [44] distilled ResNet-based teachers on the xBD building-damage dataset, preserving accuracy with lower computation. El-Madafri et al. [45] applied KD for wildfire detection, training compact CNNs that retained strong performance on UAV imagery. Surveys confirm KD’s promise in remote sensing tasks [46], and hybrid extensions combine KD with embedding techniques [47] or adaptive UAV-based inference [48]. (3) Structured pruning, in contrast, has proven particularly effective for disaster vision tasks. Compact CNNs and slimmed YOLO variants already enable UAV-based classification, smoke detection, and wildfire monitoring [49,50,51,52]. Building on this, Hu et al. [53] pruned channels for road-distress detection with minimal accuracy loss, while Zhao et al. [54] pruned YOLOv8n for UAV deployment. Our framework adopts structured pruning as a controlled head-level simplification strategy that complements DP-SGD. By removing entire neurons from the classifier head, it reduces decision-stage complexity without altering numerical precision or requiring a teacher model. This makes pruning well suited for analyzing the efficiency–utility trade-off in privacy-aware multimodal disaster assessment, while full deployment efficiency remains subject to hardware-level validation. In summary, quantization, distillation, and pruning provide complementary efficiency tools. TriDA uses structured head pruning as a bounded supporting analysis rather than as a claim of full-model compression or end-to-end deployment acceleration.

3. Methodology

This work presents a multimodal framework for disaster assessment that integrates visual and textual evidence while addressing two practical concerns, privacy-aware optimization and computational efficiency. The proposed approach contains three components. First, a late-fusion multimodal architecture combines image and caption representations for disaster classification. Second, a classifier-head DP-SGD stage provides explicit privacy accounting for paired image–text records under a fixed-feature private optimization protocol. Third, structured neuron pruning reduces redundancy in the classifier head to study efficiency-oriented simplification. Figure 1 illustrates the overall pipeline, where modality-specific features are fused for prediction, private optimization is applied at the decision head, and pruning is used for structural efficiency analysis. This design provides a unified framework for examining multimodal utility, privacy accounting, and efficiency within the same disaster assessment setting.

3.1. Multimodal AI Architecture

Accurate disaster assessment requires models that jointly leverage visual cues from imagery and semantic context from accompanying text. Relying on a single modality can lead to incomplete or ambiguous predictions. For example, an image of a flooded roadway may be difficult to distinguish from heavy rainfall without the accompanying caption. To address this limitation, we design a late-fusion multimodal architecture that integrates image and text representations within a unified and computationally controlled framework.

The visual branch uses a ResNet50 backbone pretrained on ImageNet, with its original classification head removed. Given an input disaster image

I \in R^{H \times W \times 3}

, the visual encoder extracts high-level structural features associated with disaster patterns such as damaged infrastructure, debris, fire regions, or water boundaries. Global average pooling converts the final convolutional feature map into a compact 2048-dimensional visual representation,

v = f_{v} (I) \in R^{2048} .

(1)

The textual branch processes disaster captions

(w_{1}, \dots, w_{L})

using pretrained GloVe embeddings. Each token is mapped through an embedding lookup matrix

E \in R^{| V | \times d_{e}}

, where

| V |

denotes the vocabulary size and

d_{e}

is the embedding dimension. A bidirectional LSTM with 128 hidden units in each direction captures contextual dependencies from both forward and backward token sequences. This produces a 256-dimensional textual representation,

h = BiLSTM (E [w_{1}], \dots, E [w_{L}]) \in R^{256} .

(2)

Following standard multimodal fusion practice [55], the modality-specific embeddings are concatenated and layer-normalized to form a joint multimodal representation,

z = LN (concat (v, h)) \in R^{2304} .

(3)

We adopt late fusion instead of early fusion or attention-heavy cross-modal interaction to preserve branch-specific evidence while keeping the fusion stage computationally moderate. This choice supports the paper’s focus on privacy-aware training and structured efficiency analysis without introducing unnecessary architectural complexity.

The fused representation is passed through a compact classification head. The head contains a dense layer with 128 hidden units, ReLU activation, dropout with rate

0.2

, and a final softmax layer over the six disaster categories. Formally,

\begin{matrix} u & = ϕ (W_{h}^{⊤} z + b_{h}), \end{matrix}

(4)

\begin{matrix} p & = softmax (W_{o}^{⊤} u + b_{o}), \end{matrix}

(5)

where

W_{h} \in R^{2304 \times 128}

,

W_{o} \in R^{128 \times 6}

,

ϕ (\cdot)

denotes the ReLU activation, and

p

represents the predicted class-probability distribution. The model is trained by minimizing the categorical cross-entropy objective over N paired training samples,

L_{θ} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i c} log p_{i c},

(6)

where C is the number of disaster categories,

y_{i c}

is the ground-truth indicator, and

p_{i c}

is the predicted probability for class c.

This architecture preserves modality-specific feature extraction while maintaining a compact fusion and decision pathway. The explicit 2304-to-128 classifier head also provides a well-defined target for the structured pruning analysis developed later in the paper.

3.2. Privacy-Enhanced Multimodal AI Training

The proposed multimodal framework is trained on disaster imagery and associated textual reports that may contain sensitive information, including identifiable individuals, geographic cues, and survivor-related descriptions. To study privacy-aware adaptation without overstating the scope of the guarantee, we apply differentially private stochastic gradient descent (DP-SGD) in a dedicated classifier-head optimization stage. During this private stage, the modality-specific feature extractors remain fixed, and only the decision head is updated from paired image–text records. The reported accounting therefore characterizes the DP-SGD classifier-head stage rather than any preceding non-private representation-learning procedure.

DP-SGD modifies standard stochastic gradient descent through gradient clipping and calibrated noise injection [15]. First, for each training example, the gradient is clipped to an

ℓ_{2}

norm bound C so that no individual sample can dominate the update,

{\tilde{g}}_{i} = g_{i} \cdot min (1, \frac{C}{{∥ g_{i} ∥}_{2}}),

(7)

where

g_{i}

denotes the gradient of the i-th training sample. This step controls per-record sensitivity before aggregation and is particularly relevant when paired multimodal records exhibit heterogeneous feature contributions.

Next, Gaussian noise is added to the sum of clipped minibatch gradients before normalization,

\hat{g} = \frac{1}{B} (\sum_{i = 1}^{B} {\tilde{g}}_{i} + N (0, σ^{2} C^{2} I)),

(8)

where B is the minibatch size and

σ

is the noise multiplier. The noise term obscures the contribution of any single paired record while retaining an aggregate learning signal for the classifier-head update.

Formally, for any two neighboring private-stage datasets D and

D^{'}

that differ in exactly one multimodal training pair, the randomized classifier-head mechanism

M_{θ_{h}}

satisfies

(ε, δ)

-differential privacy if

Pr [M_{θ_{h}} (D) \in S] \leq e^{ε} Pr [M_{θ_{h}} (D^{'}) \in S] + δ

(9)

for all measurable output sets S. Each image–caption pair is treated as one protected training record. The guarantee, therefore, concerns the private classifier-head optimization stage under the paired-record definition.

For reproducibility, privacy accounting is performed for this DP-SGD stage using an

ℓ_{2}

clipping norm of

C = 1.0

, minibatch size

B = 32

, and

N = 5247

private training records. The target failure probability is set to

δ = 1 / N \approx 1.91 \times 10^{- 4}

. The private stage runs for a fixed six epochs without validation-driven early stopping, yielding an effective sampling ratio

q = B / N \approx 0.00610

and

T = ⌈ 6 N / B ⌉ = 984

private update steps. Reported privacy budgets are computed with the dp-accounting RDP implementation for the sampled Gaussian mechanism over orders

α \in {1.25, 1.50, \dots, 9.75} \cup {10, 11, \dots, 511}

. Training uses shuffled fixed-size minibatches, while accounting follows the standard subsampled-Gaussian convention parameterized by q; the same convention is applied consistently across all evaluated noise settings in the privacy–utility analysis.

The reported

(ϵ, δ)

-DP guarantee is limited to the classifier-head DP-SGD optimization stage under fixed image and text feature extractors. Any non-private ImageNet pretraining, GloVe pretraining, feature extraction, warmup training, or earlier fine-tuning outside this DP-SGD stage is not covered by the reported privacy accounting. Therefore, the guarantee should be interpreted as record-level protection for the private classifier-head adaptation mechanism, not as an end-to-end privacy guarantee for the complete multimodal training pipeline.

3.3. Structured Pruning for Efficiency

While differential privacy protects sensitive training records, disaster-response models must also remain computationally manageable. To address this requirement, we apply structured neuron pruning to the classifier head of TriDA. Pruning is restricted to the dense layer after multimodal fusion, while the ResNet50 visual encoder and BiLSTM textual encoder remain unchanged. This design preserves modality-specific representation learning and confines efficiency-oriented simplification to the decision stage.

Let

W_{h} \in R^{(d_{v} + d_{t}) \times d_{h}}

denote the weight matrix of the hidden classification layer, where

d_{h}

is the number of hidden neurons before pruning. The importance of hidden neuron j is measured by the

ℓ_{1}

-norm of its incoming weight vector,

s_{j} = {∥ W_{h, :, j} ∥}_{1} .

(10)

We use the

ℓ_{1}

-norm as a simple magnitude-based neuron-importance score that favors the removal of units whose incoming weights remain globally small. Neurons with lower importance scores are removed according to the target pruning level. If k hidden units are retained after pruning, then

k \leq d_{h}

, and the classifier head consists of a

(d_{v} + d_{t}) \times k

hidden projection followed by a

k \times C

output layer.

The number of parameters in the pruned classifier head is

P_{head} (k) = (d_{v} + d_{t}) k + k + k C + C,

(11)

where the first two terms correspond to hidden-layer weights and biases, and the last two terms correspond to output-layer weights and biases. The relative parameter reduction with respect to the unpruned classifier head is

Reduction (k) = 1 - \frac{P_{head} (k)}{P_{head} (d_{h})} .

(12)

We further quantify computational simplification through multiply–accumulate operations in the classifier head. For k retained hidden units,

{MACs}_{head} (k) = (d_{v} + d_{t}) k + k C .

(13)

The corresponding head-level MAC speed-up is defined as

SpeedUp (k) = \frac{{MACs}_{head} (d_{h})}{{MACs}_{head} (k)} .

(14)

In the instantiated TriDA architecture, the fused multimodal representation has dimensionality

d_{v} + d_{t} = 2304

, the unpruned hidden layer contains

d_{h} = 128

neurons, and the classifier predicts

C = 6

disaster categories. These dimensions are used to compute the classifier-head parameter counts, head-level MACs, and MAC-based speed-up reported in the efficiency analysis.

This formulation clarifies that the reported structural savings arise from simplifying the classifier head rather than compressing the full multimodal encoder. In the efficiency study, pruning is applied as deterministic post-processing of the selected DP-SGD model, without additional retraining on private records. Moderate pruning can reduce decision-stage complexity while preserving useful multimodal discrimination, whereas more aggressive pruning may remove capacity that remains important for stable classification.

4. Experiments

This section presents the experimental setup and evaluates the proposed privacy-aware multimodal disaster assessment framework from four complementary perspectives. We first examine aggregate classification performance against unimodal and multimodal baselines. We then study class-wise behavior under dataset imbalance using per-class metrics and a row-normalized confusion matrix. Next, we analyze the privacy–utility trade-off introduced by DP-SGD across different noise levels. Finally, we evaluate the efficiency–utility trade-off achieved through structured neuron pruning in the classifier head. Together, these analyses provide a comprehensive assessment of TriDA in terms of predictive utility, privacy preservation, and computational efficiency.

4.1. Experiment Settings

We describe the experimental setting used throughout the study, including the dataset composition, preprocessing pipeline, evaluation protocol, privacy analysis, and efficiency measurements. This setup is designed to assess TriDA under a consistent multimodal disaster classification framework.

4.1.1. Dataset

We conduct experiments on the UCI Multimodal Damage Identification for Humanitarian Computing (University of California, Irvine, CA, USA) dataset introduced by Mouzannar et al. [18]; we refer to it as the UCI dataset throughout the remainder of the paper. The dataset contains 5831 paired image and text samples annotated into six humanitarian categories, namely damage to nature, damage to infrastructure, human damage, fires, floods, and non-damage. We follow the established split of 5247 training samples and 584 test samples. The class-wise distribution is reported in Table 1.

As shown in Table 1, the dataset exhibits a pronounced class imbalance. The non-damage category contains the largest number of samples, while human damage is substantially underrepresented. This imbalance can favor majority-class behavior when performance is judged only through aggregate metrics. It therefore motivates the use of class-wise evaluation and confusion-matrix analysis in addition to overall summary statistics.

The textual modality also varies noticeably across categories. Table 2 shows that caption length and vocabulary volume differ across classes, with the non-damage category contributing the largest lexical mass. Several categories also contain relatively long captions, suggesting that textual cues may carry detailed contextual evidence beyond the visual signal alone. These characteristics reinforce the need for a multimodal design that can integrate visual and linguistic information while remaining attentive to class imbalance.

Taken together, the class distribution and textual statistics highlight two central evaluation challenges. The first is uneven support across disaster categories. The second is heterogeneous linguistic complexity across multimodal samples. We retain the original data distribution without resampling or synthetic balancing so that the evaluation reflects the natural structure of the benchmark. The subsequent per-class analysis is therefore important for determining whether TriDA performs consistently beyond the majority class.

4.1.2. Data Preprocessing

Disaster images are resized to

224 \times 224

pixels and normalized to the range

[0, 1]

. During training, we apply random cropping, horizontal flips, and slight color jittering to improve visual generalization. These augmentations are particularly useful for disaster categories with limited visual support. Text captions are tokenized and mapped to 300-dimensional pretrained GloVe embeddings developed by Stanford University (Stanford, CA, USA). Out-of-vocabulary tokens are randomly initialized and updated during training. Sequences are padded or truncated to 100 tokens, which provides broad caption coverage while maintaining computational efficiency. Each multimodal instance is represented as a synchronized image and text pair, consistent with real-world disaster reporting scenarios in which visual and textual evidence are observed jointly.

4.1.3. Training Configuration and Model Selection

TriDA is trained using a staged optimization protocol designed to stabilize multimodal learning while preserving the benefits of pretrained visual representations. The model is optimized with Adam using a minibatch size of 32. During the initial warmup stage, the pretrained ResNet50 visual backbone is kept frozen while the task-specific multimodal layers are optimized with a learning rate of

3 \times 10^{- 4}

. The classifier head uses a dropout rate of 0.2, consistent with the architecture described in the methodology. In the subsequent fine-tuning stage, the final ResNet50 block is selectively unfrozen to adapt higher-level visual features to disaster imagery using a reduced learning rate of

1 \times 10^{- 5}

. Table 3 summarizes the optimizer, learning-rate schedule, maximum epochs, fine-tuning policy, and early-stopping settings used for TriDA, the DP-SGD head stage, and all implemented baselines. A held-out validation subset from the training partition is used for early stopping, adaptive learning-rate reduction, and checkpoint selection, while the predefined test split is reserved exclusively for final reporting. Training stability is supported through validation-loss-based early stopping with a patience of 6 epochs, and the checkpoint with the best monitored validation accuracy is retained for evaluation. This staged procedure balances stable convergence with controlled adaptation of the pretrained visual encoder. For the privacy analysis, the DP-SGD stage updates only the classifier head while keeping the modality-specific feature extractors fixed, following the private training protocol described in Section 3.2.

All experiments were implemented in Python 3.10 using TensorFlow 2.17.0 (Google LLC, Mountain View, CA, USA) with tf_keras 2.17.0. The final reported experiments were executed on the Kennesaw State University high-performance computing cluster using NVIDIA A100 GPU resources (NVIDIA Corporation, Santa Clara, CA, USA), providing a consistent computational environment for model training and evaluation.

All implemented baselines used the same predefined train/test split, identical metric definitions, and the same validation-based model-selection protocol as TriDA. The predefined test split was reserved exclusively for final reporting and was not used for hyperparameter tuning, early stopping, or checkpoint selection.

4.1.4. Evaluation Metrics

We evaluate TriDA in terms of predictive utility, privacy accounting, and efficiency. For classification, we report accuracy together with macro-averaged precision, recall, and F1-score unless otherwise stated. Since the UCI test split is imbalanced, we further provide per-class metrics, macro and weighted averages, and a row-normalized confusion matrix showing both raw counts and class-wise percentages. Qualitative examples in Figure 2 present representative predictions across all six humanitarian categories, with subfigure labels and local panel descriptions clarifying the corresponding image–text examples.

For privacy analysis, privacy budgets are reported for the classifier-head DP-SGD stage using a Rényi Differential Privacy accountant under the accounting configuration specified in Section 3.2. Increasing the noise multiplier

σ

lowers the final privacy budget

ε

, typically at the cost of predictive utility. DP-SGD and pruning results are reported as mean ± standard deviation across repeated runs.

For efficiency analysis, we report target neuron-drop ratio, retained hidden units, classifier-head parameters, parameter reduction, multiply–accumulate operations, and head-level MAC speed-up. These measurements characterize the utility–privacy–efficiency trade-off while distinguishing classifier-head simplification from full end-to-end deployment acceleration. All implemented models use the same dataset split and metric definitions, while literature results are used only for contextual comparison.

4.2. Baseline Comparisons

Baselines in Table 4 and Table 5 were evaluated with the same dataset split and metric definitions as TriDA whenever they were implemented in our setting. Architecture-specific preprocessing is retained where required, while validation-based checkpoint selection follows the common experimental protocol described above. Pretrained backbones are used where applicable, and the comparisons are intended as controlled within-benchmark empirical references rather than an exhaustive hyperparameter search across every alternative architecture. These results establish the aggregate predictive utility of the proposed unimodal and multimodal designs. We then examine whether this performance remains consistent across classes through the per-class evaluation in Table 6 and the row-normalized confusion matrix in Figure 4. Table 7 subsequently positions TriDA within the broader multimodal disaster assessment literature.

Table 4 (a) reports text-only results on the UCI Multimodal Damage Identification dataset. Among the compared baselines, BERT [56] achieves 82.6% accuracy, while TextCNN [57] and RoBERTa [58] reach 84.1% and 84.9%, respectively. Our TriDA-Text variant obtains the strongest text-only result, with 86.4% accuracy, 84.2% precision, 81.1% recall, and 82.6% F1-score. Relative to RoBERTa, it improves accuracy by 1.5 percentage points and F1-score by 0.4 points. This result suggests that the proposed text branch captures category-specific disaster semantics effectively, although text alone remains insufficient for fully resolving visually grounded situations.

Table 4 (b) presents image-only performance. VGG16 [59] achieves 78.4% accuracy, while DenseNet121 [60] and Inception-v3 [61] improve performance to 81.3% and 82.9%, respectively. Our TriDA-Image branch reaches 83.7% accuracy, 80.8% precision, 79.1% recall, and 79.9% F1-score. Compared with Inception-v3, this yields an accuracy gain of 0.8 percentage points and an F1 gain of 1.0 point. These results confirm that visual structure contributes meaningfully to disaster classification, yet the image modality is less discriminative than text when used in isolation on this benchmark.

Overall, the unimodal analysis reveals a clear modality gap. Textual cues provide strong category-level signals, while images supply complementary contextual and spatial evidence. This finding supports the need for multimodal fusion rather than reliance on a single information stream.

Table 5 reports multimodal fusion results on the UCI dataset. Figure 3 visualizes the accuracy comparison among the implemented multimodal baselines and the proposed TriDA model. EarlyFusion-CNNTr achieves 90.7% accuracy, while HybridRes-BERT reaches 91.2%. DualCNN-Align attains 91.9% accuracy and 89.7% F1-score, making it the strongest implemented multimodal baseline. Our proposed TriDA achieves the best overall performance, with 93.7% accuracy, 92.5% precision, 90.6% recall, and 91.6% F1-score. Relative to DualCNN-Align, TriDA improves accuracy by 1.8 percentage points and F1-score by 1.9 points. This advantage remains stable across repeated runs, where TriDA records

93.7 \pm 0.21 %

accuracy and

91.6 \pm 0.24 %

F1-score, compared with

91.9 \pm 0.27 %

accuracy and

89.7 \pm 0.31 %

F1-score for DualCNN-Align.

These results indicate that the late-fusion design provides a strong balance between representational richness and computational restraint. Rather than relying on heavier cross-attention or transformer-style fusion, TriDA preserves modality-specific evidence and combines it effectively at the decision stage. The consistent gains over unimodal and multimodal baselines establish TriDA as a strong predictive foundation for the subsequent privacy and efficiency analyses.

To determine whether the strong aggregate performance of TriDA is distributed consistently across disaster categories, we next examine class-wise behavior on the imbalanced UCI test split. Table 6 reports per-class precision, recall, and F1-score together with macro and weighted averages. TriDA maintains strong performance across all six categories. It achieves F1-scores of 94.4% for damage infrastructure and fires, 95.2% for non-damage, and 90.1% for flood. The human-damage class records the lowest recall at 85.7%, which is expected given its limited test support of only 21 instances. Even so, TriDA preserves a strong macro-F1 of 91.6%, showing that its aggregate accuracy is not driven solely by the majority class. TriDA also achieves a macro recall of 90.6%, which is equivalent to a balanced accuracy of 90.6% in the multiclass setting.

Figure 4 complements the tabular analysis through a row-normalized confusion matrix that presents both raw counts and within-class percentages. The diagonal concentration remains strong across all categories, including 93.8% recall for damage infrastructure, 91.9% for fires, and 96.2% for non-damage. The largest off-diagonal confusion occurs when minority damage-related categories are assigned to non-damage, reflecting the influence of class imbalance and the semantic overlap between visually or textually ambiguous cases. The figure therefore provides a more transparent view of model behavior than aggregate accuracy alone.

Having established both aggregate and class-wise performance within our controlled experimental setting, we next compare TriDA with representative prior multimodal disaster assessment studies. Because earlier works often report different metrics and follow non-identical evaluation protocols, strict direct comparison is not always possible. Table 7 is therefore intended as a contextual comparison. It summarizes previously reported results, indicates the level of evaluative completeness, and situates the performance of TriDA relative to the published literature.

As shown in Table 7, prior studies often report only a limited subset of evaluation metrics, most commonly accuracy or weighted F1-score. This limits transparent cross-paper analysis. In contrast, TriDA is evaluated using accuracy, precision, recall, and F1-score within a consistent experimental protocol. It achieves 93.7% accuracy, which matches or exceeds the reported accuracy values of the cited multimodal studies. These comparisons should still be interpreted cautiously because the underlying evaluation settings are not fully identical, and the weighted F1-score reported by Hossain et al. is not directly comparable to the macro F1-score used for TriDA.

More broadly, the comparison highlights a methodological gap in prior multimodal disaster assessment work. Existing studies primarily emphasize predictive performance, whereas TriDA is evaluated through a broader lens that combines predictive utility, formal privacy accounting, and structured efficiency analysis. This broader perspective is central to the goal of developing multimodal disaster assessment systems that remain reliable when sensitive data and computational constraints must be considered jointly.

4.3. Privacy–Utility Trade-Off

We next examine how the classifier-head DP-SGD stage affects predictive utility under varying noise multipliers

σ

. Table 8 summarizes the resulting privacy–utility trade-off and reports retained utility relative to the no-noise classifier-head control. The privacy budget

ϵ

is computed using the Rényi Differential Privacy accountant described in Section 3.2, while predictive metrics are reported as mean ± standard deviation across repeated runs. As expected, stronger Gaussian perturbation lowers

ϵ

and improves privacy accounting, but it also reduces downstream classification utility.

The

σ = 0.0

row serves as the no-noise control within the same classifier-head optimization protocol used for the privacy study. It is therefore not intended to duplicate the full non-private TriDA result reported in Table 5. Under this control setting, the model achieves

93.70 \pm 0.21

% accuracy and

91.2 \pm 0.25

% macro F1-score. At

σ = 0.5

, TriDA retains 97.51% of the no-noise accuracy and 97.26% of the no-noise F1-score, but the resulting

ϵ = 22.6

remains relatively large. At

σ = 1.0

, the model retains 94.14% of accuracy and 92.87% of F1-score with

ϵ = 17.1

. Under the strongest privacy setting considered,

σ = 2.0

,

ϵ

decreases to 8.3, but F1-score retention falls to 80.92%. These results show a controlled classifier-head privacy–utility trade-off across the evaluated noise range.

The reported

ϵ

values should be interpreted as moderate rather than strict high-privacy guarantees. Lower

ϵ

indicates stronger privacy accounting, but practical interpretation depends on the deployment context, threat model, data sensitivity, and whether the entire training pipeline is private. In this study,

σ = 0.5

preserves the highest utility but yields a relatively large

ϵ = 22.6

, whereas

σ = 2.0

reduces

ϵ

to 8.3 but substantially lowers the F1-score. Thus, these results are best interpreted as a controlled classifier-head privacy–utility characterization, not as evidence of strong end-to-end private deployment.

To further contextualize the privacy evidence, Table 9 compares TriDA with representative privacy-preserving disaster learning studies. The comparison is intentionally contextual rather than strictly controlled because the cited works differ in dataset, task definition, modality, privacy mechanism, and reporting protocol. Existing disaster-focused privacy methods mainly rely on federated learning, encrypted model-update exchange, or adaptive privacy mechanisms to reduce raw-data exposure during collaborative training. In contrast, TriDA reports explicit classifier-head DP-SGD privacy accounting for paired image–text records under a stated neighboring-record definition. This distinction is important because federated learning or encryption can reduce direct data sharing, but they do not by themselves provide the same record-level

(ϵ, δ)

-DP accounting reported by TriDA.

Table 9 shows that prior privacy-preserving disaster studies mainly rely on federated learning, encrypted model-update exchange, or adaptive privacy mechanisms. These approaches reduce direct data sharing, but most do not report explicit paired-record

(ϵ, δ)

-DP accounting for multimodal disaster records. TriDA therefore complements prior work by quantifying how classifier-head DP-SGD affects utility under formal privacy accounting for paired image–text disaster samples.

The standard deviations increase moderately with stronger privacy noise, indicating that heavier perturbation affects both average utility and run-to-run stability. Even so, the degradation remains gradual rather than erratic, suggesting that the privacy effect is controlled across the tested settings. Figure 5 further illustrates the monotonic decline in accuracy as the noise multiplier increases.

4.4. Efficiency–Utility Trade-Off

We next evaluate the effect of structured neuron pruning on predictive utility and classifier-head efficiency. Starting from the DP-SGD model trained at

σ = 1.0

, neurons in the dense classification head are pruned from 0% to 20%. Pruning is applied as deterministic post-processing of the selected DP-SGD model, without additional retraining on private records. Table 10 reports the resulting predictive performance under increasing neuron removal, while Table 11 quantifies the corresponding structural simplification through retained hidden units, classifier-head parameter counts, parameter reduction, MACs, and head-level MAC-based speed-up. Predictive results are summarized as mean ± standard deviation across repeated runs.

At 0% pruning, the DP baseline achieves

88.2 \pm 0.34

% accuracy and

84.7 \pm 0.38

% F1-score. With 5% neuron removal, accuracy remains at

87.0 \pm 0.36

% and F1-score at

83.3 \pm 0.41

%, while the classifier-head parameter count decreases from 295,814 to 281,948. At the 10% pruning level, TriDA retains 116 hidden units, reduces classifier-head parameters by 9.4%, lowers MACs from 295,680 to 267,960, and achieves a

1.10 \times

head-level MAC-based speed-up. The model still preserves

85.6 \pm 0.39

% accuracy and

81.4 \pm 0.44

% F1-score. This setting provides the most balanced trade-off between predictive utility and structural simplification.

More aggressive pruning produces sharper utility degradation. At 15% pruning, accuracy falls to

82.2 \pm 0.46

% and F1-score to

77.8 \pm 0.51

%, although the head-level MAC-based speed-up increases to

1.17 \times

. At 20%, the model reaches the largest classifier-head reduction considered in this study, with 19.5% fewer head parameters and a

1.24 \times

head-level MAC-based speed-up. However, predictive utility decreases to

78.6 \pm 0.53

% accuracy and

73.8 \pm 0.59

% F1-score. These results indicate that light-to-moderate pruning mainly removes redundant decision-stage capacity, whereas heavier pruning begins to discard neurons that remain important for multimodal discrimination.

Figure 6 and Figure 7 visualize this trade-off from complementary perspectives. Figure 6 shows the decline in accuracy as neuron removal increases, while Figure 7 shows the corresponding increase in head-level MAC-based speed-up. Because pruning is applied only to the classifier head, the reported efficiency gains should be interpreted as structural simplification of the decision stage rather than as end-to-end deployment acceleration.

Estimated total parameters exclude the trainable GloVe embedding matrix because the vocabulary size is implementation-dependent. If the embedding matrix is included,

300 | V |

parameters should be added to every total, making the full-model percentage reduction even smaller.

This parameter-scale analysis clarifies that structured pruning primarily simplifies the classifier head rather than the full multimodal pipeline. Although 20% neuron pruning reduces classifier-head parameters by 19.53%, the estimated full-model parameter reduction is only 0.238% when excluding the trainable embedding matrix. Therefore, the pruning result should be interpreted as decision-stage structural simplification, not as full-model compression or end-to-end acceleration.

5. Discussion and Limitations

The results show that TriDA provides a strong balance among multimodal utility, privacy-aware optimization, and classifier-head efficiency. It outperforms the implemented unimodal and multimodal baselines, while the class-wise analysis confirms that the improvement is not driven only by the dominant non-damage class. TriDA achieves a macro F1-score of 91.6% and a macro recall of 90.6%, equivalent to balanced accuracy in this multiclass setting. The confusion matrix further indicates strong class-level discrimination, although human damage remains the most challenging category because of limited sample support. The privacy and pruning analyses reveal controlled trade-offs. Increasing DP-SGD noise lowers the reported privacy budget while gradually reducing predictive utility, showing that substantial utility can be retained under low-to-moderate noise. Moderate classifier-head pruning reduces parameters and MAC operations with limited performance loss, with the 10% pruning setting providing the strongest balance in the tested range. Since pruning is applied as post-processing of the selected DP-SGD model without private-data retraining, it does not introduce additional private optimization steps.

TriDA is primarily a methodological and empirical contribution. Its formulation specifies the privacy-accounting and pruning mechanisms rather than establishing new convergence or generalization guarantees. Several limitations remain. The evaluation is restricted to one multimodal disaster benchmark, and additional datasets are needed to assess broader generalization. The reported privacy accounting applies only to the classifier-head DP-SGD stage with fixed feature extractors, not to a full end-to-end private training pipeline. Empirical privacy attack analysis, calibration evaluation, formal statistical significance testing, and hardware-level deployment profiling also remain important directions for future work.

Overall, TriDA offers a focused empirical study of privacy-aware multimodal disaster assessment and provides a solid foundation for broader validation, stronger privacy evaluation, and deeper theoretical and deployment-oriented analysis.

6. Conclusions

This paper introduced TriDA, a multimodal disaster assessment framework that integrates image and text streams while examining the joint challenges of predictive utility, privacy-aware optimization, and classifier-head efficiency. The framework uses a classifier-head DP-SGD stage to report training-record-level differential privacy accounting for paired image–text samples under the stated private optimization protocol, and applies structured pruning to reduce redundant capacity in the classification head. The experiments show that TriDA achieves strong multimodal performance, retains meaningful utility under moderate DP noise, and benefits from controlled head-level parameter and MAC reductions through pruning. These findings position TriDA as a privacy-aware and efficiency-conscious step toward resource-aware multimodal disaster assessment, while motivating future work on broader dataset validation, empirical privacy analysis, calibration-aware evaluation, and hardware-level efficiency profiling.

Author Contributions

Conceptualization, A.K., D.H. and H.X.; methodology, M.A.O. and H.X.; software, M.A.O.; validation, M.A.O., A.K., D.H. and H.X.; formal analysis, M.A.O. and H.X.; investigation, M.A.O. and H.X.; resources, D.H.; writing—original draft, M.A.O. and H.X.; writing—review and editing, A.K., D.H. and H.X.; visualization, M.A.O.; supervision, H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Kennesaw State University (KSU) Grand Challenges Seed Grants.

Data Availability Statement

The original data presented in the study are openly available in [18] and at the UCI Machine Learning Repository: https://doi.org/10.24432/C52P6P (accessed on 7 May 2026).

Acknowledgments

The authors gratefully acknowledge the support from KSU.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Coppola, D. Introduction to International Disaster Management; Elsevier: Amsterdam, The Netherlands, 2006. [Google Scholar]
Cutter, S.L. The changing nature of hazard and disaster risk in the Anthropocene. Ann. Am. Assoc. Geogr. 2021, 111, 819–827. [Google Scholar]
Cheng, C.S.; Luo, L.; Murphy, S.; Lee, Y.C.; Leite, F. A framework to enhance disaster debris estimation with AI and aerial photogrammetry. Int. J. Disaster Risk Reduct. 2024, 107, 104468. [Google Scholar] [CrossRef]
Voigt, S.; Giulio-Tonolo, F.; Lyons, J.; Kučera, J.; Jones, B.; Schneiderhan, T.; Platzeck, G.; Kaku, K.; Hazarika, M.K.; Czaran, L.; et al. Global trends in satellite-based emergency mapping. Science 2016, 353, 247–252. [Google Scholar] [CrossRef]
Imran, M.; Castillo, C.; Diaz, F.; Vieweg, S. Processing social media messages in mass emergency: A survey. ACM Comput. Surv. (CSUR) 2015, 47, 1–38. [Google Scholar] [CrossRef]
Alam, F.; Sajjad, H.; Imran, M.; Ofli, F. CrisisBench: Benchmarking crisis-related social media datasets for humanitarian information processing. In Proceedings of the International AAAI Conference on Web and Social Media, Virtual, 7–10 June 2021; Volume 15, pp. 923–932. [Google Scholar]
Gupta, R.; Shah, M. Rescuenet: Joint building segmentation and damage assessment from satellite imagery. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR); IEEE: New York, NY, USA, 2021; pp. 4405–4411. [Google Scholar]
Oaphy, M.A.; Hu, D.; Khalid, A.; Xu, H. Lightweight and Privacy-Enhanced Detection Model on Aerial Imagery for Post-Disaster Building Damage Reconnaissance. In Proceedings of the 59th Hawaii International Conference on System Sciences; University of Hawaii at Manoa: Honolulu, HI, USA, 2026; pp. 7058–7067. [Google Scholar] [CrossRef]
Abdullahil-Oaphy, M.; Bhuiyan, M.R.; Islam, M.S. Classifying the Usual Leaf Diseases of Paddy Plants in Bangladesh Using Multilayered CNN Architecture. In Soft Computing Techniques and Applications: Proceedings of the International Conference on Computing and Communication; Springer: Singapore, 2020; pp. 389–397. [Google Scholar]
Bhuiyan, M.R.; Abdullahil-Oaphy, M.; Khanam, R.S.; Islam, M.S. MediNET: A Deep Learning Approach to Recognize Bangladeshi Ordinary Medicinal Plants Using CNN. In Soft Computing Techniques and Applications: Proceedings of the International Conference on Computing and Communication; Springer: Singapore, 2020; pp. 371–380. [Google Scholar]
Zahra, K.; Imran, M.; Ostermann, F.O. Automatic identification of eyewitness messages on twitter during disasters. Inf. Process. Manag. 2020, 57, 102107. [Google Scholar] [CrossRef]
Xu, P.; Zhu, X.; Clifton, D.A. Multimodal learning with transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef]
Shokri, R.; Stronati, M.; Song, C.; Shmatikov, V. Membership inference attacks against machine learning models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP); IEEE: New York, NY, USA, 2017; pp. 3–18. [Google Scholar]
Jayaraman, B.; Evans, D. Evaluating differentially private machine learning in practice. In Proceedings of the 28th USENIX Security Symposium (USENIX Security 19), Santa Clara, CA, USA, 14–16 August 2019; pp. 1895–1912. [Google Scholar]
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar]
Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. Adv. Neural Inf. Process. Syst. 2015, 28, 1135–1143. [Google Scholar]
Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv 2018, arXiv:1803.03635. [Google Scholar]
Mouzannar, H.; Rizk, Y.; Awad, M. Damage Identification in Social Media Posts using Multimodal Deep Learning. In Proceedings of the ISCRAM, Rochester, NY, USA, 20–23 May 2018. [Google Scholar]
Hossain, E.; Hoque, M.M.; Hoque, E.; Islam, M.S. A deep attentive multimodal learning approach for disaster identification from social media posts. IEEE Access 2022, 10, 46538–46551. [Google Scholar] [CrossRef]
Ofli, F.; Alam, F.; Imran, M. Analysis of social media data using multimodal deep learning for disaster response. arXiv 2020, arXiv:2004.11838. [Google Scholar] [CrossRef]
Hao, H.; Wang, Y. Leveraging multimodal social media data for rapid disaster damage assessment. Int. J. Disaster Risk Reduct. 2020, 51, 101760. [Google Scholar] [CrossRef]
Zou, Z.; Gan, H.; Huang, Q.; Cai, T.; Cao, K. Disaster image classification by fusing multimodal social media data. ISPRS Int. J.-Geo-Inf. 2021, 10, 636. [Google Scholar] [CrossRef]
Kota, S.; Haridasan, S.; Rattani, A.; Bowen, A.; Rimmington, G.; Dutta, A. Multimodal Combination of Text and Image Tweets for Disaster Response Assessment. In Proceedings of the D2R2, Leipzig, Germany, 6 July 2022. [Google Scholar]
Abavisani, M.; Wu, L.; Hu, S.; Tetreault, J.; Jaimes, A. Multimodal categorization of crisis events in social media. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 3–7 June 2020; pp. 14679–14689. [Google Scholar]
Sosea, T.; Sirbu, I.; Caragea, C.; Caragea, D.; Rebedea, T. Using the image-text relationship to improve multimodal disaster tweet classification. In Proceedings of the 18th International Conference on Information Systems for Crisis Response and Management (ISCRAM 2021), Virtual, 23–26 May 2021. [Google Scholar]
Shetty, N.P.; Bijalwan, Y.; Chaudhari, P.; Shetty, J.; Muniyal, B. Disaster assessment from social media using multimodal deep learning. Multimed. Tools Appl. 2025, 84, 18829–18854. [Google Scholar] [CrossRef]
Dar, S.S.; Rehman, M.Z.U.; Bais, K.; Haseeb, M.A.; Kumar, N. A social context-aware graph-based multimodal attentive learning framework for disaster content classification during emergencies. Expert Syst. Appl. 2025, 259, 125337. [Google Scholar] [CrossRef]
Basit, M.; Alam, B.; Fatima, Z.; Shaikh, S. Natural disaster tweets classification using multimodal data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 7584–7594. [Google Scholar]
Zhang, J.; Liao, M.; Wang, Y.; Huang, Y.; Chen, F.; Makiko, C. Multi-modal deep learning framework for damage detection in social media posts. PeerJ Comput. Sci. 2024, 10, e2262. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Liu, Q.; Pan, Z.; Wu, X. CLIP-BCA-Gated: A Dynamic Multimodal Framework for Real-Time Humanitarian Crisis Classification with Bi-Cross-Attention and Adaptive Gating. Appl. Sci. 2025, 15, 2076–3417. [Google Scholar] [CrossRef]
Li, X.; Liu, H.; Lin, Q.; Sun, Q.; Jiang, Q.; Su, S. LPDi GAN: A license plate DE-identification method to preserve strong data utility. Sensors 2024, 24, 4922. [Google Scholar] [CrossRef]
Du, L.; Ling, H. Preservative license plate de-identification for privacy protection. In Proceedings of the 2011 International Conference on Document Analysis and Recognition; IEEE: New York, NY, USA, 2011; pp. 468–472. [Google Scholar][Green Version]
Andrés, M.E.; Bordenabe, N.E.; Chatzikokolakis, K.; Palamidessi, C. Geo-indistinguishability: Differential privacy for location-based systems. In Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, Berlin, Germany, 4–8 November 2013; pp. 901–914. [Google Scholar]
Löchner, M.; Fathi, R.; Schmid, D.; Dunkel, A.; Burghardt, D.; Fiedrich, F.; Koch, S. Case study on privacy-aware social media data processing in disaster management. ISPRS Int. J.-Geo-Inf. 2020, 9, 709. [Google Scholar] [CrossRef]
Zhang, Z.; He, N.; Li, D.; Gao, H.; Gao, T.; Zhou, C. Federated transfer learning for disaster classification in social computing networks. J. Saf. Sci. Resil. 2022, 3, 15–23. [Google Scholar] [CrossRef]
El-Niss, A.; Alzu’Bi, A.; Abuarqoub, A. Multimodal fusion for disaster event classification on social media: A deep federated learning approach. In Proceedings of the 7th International Conference on Future Networks and Distributed Systems, Dubai, United Arab Emirates, 21–22 December 2023; pp. 758–763. [Google Scholar]
Deng, S.; Zhang, L.; Yue, D. Data-driven and privacy-preserving risk assessment method based on federated learning for smart grids. Commun. Eng. 2024, 3, 154. [Google Scholar] [CrossRef] [PubMed]
Kurniawan, H.; Mambo, M. Homomorphic encryption-based federated privacy preservation for deep active learning. Entropy 2022, 24, 1545. [Google Scholar] [CrossRef]
Fang, H.; Qian, Q. Privacy preserving machine learning with homomorphic encryption and federated learning. Future Internet 2021, 13, 94. [Google Scholar] [CrossRef]
Alkhelaiwi, M.; Boulila, W.; Ahmad, J.; Koubaa, A.; Driss, M. An efficient approach based on privacy-preserving deep learning for satellite image classification. Remote Sens. 2021, 13, 2221. [Google Scholar] [CrossRef]
Zhang, Z.; Ma, X.; Ma, J. Local differential privacy based membership-privacy-preserving federated learning for deep-learning-driven remote sensing. Remote Sens. 2023, 15, 5050. [Google Scholar] [CrossRef]
Zhang, J.; Lei, J.; Xie, W.; Li, Y.; Yang, G.; Jia, X. Guided Hybrid Quantization for Object Detection in Remote Sensing Imagery via One-to-One Self-Teaching. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5614815. [Google Scholar] [CrossRef]
Mahmoudi, S.A.; Gloesener, M.; Benkedadra, M.; Lerat, J.S. Edge AI System for Real-Time and Explainable Forest Fire Detection Using Compressed Deep Learning Models. Proc. Copyr. 2025, 847, 847–854. [Google Scholar]
Bai, Y.; Su, J.; Zou, Y.; Adriano, B. Knowledge distillation based lightweight building damage assessment using satellite imagery of natural disasters. Geoinformatica 2022, 27, 237–261. [Google Scholar] [CrossRef]
El-Madafri, I.; Peña, M.; Olmedo-Torre, N. Real-time forest fire detection with lightweight CNN using hierarchical multi-task knowledge distillation. Fire 2024, 7, 392. [Google Scholar] [CrossRef]
Himeur, Y.; Aburaed, N.; Elharrouss, O.; Varlamis, I.; Atalla, S.; Mansoor, W.; Ahmad, H.A. Applications of Knowledge Distillation in Remote Sensing: A Survey. arXiv 2024, arXiv:2409.12111. [Google Scholar] [CrossRef]
Xing, S.; Xing, J.; Ju, J.; Hou, Q.; Ding, X. Collaborative Consistent Knowledge Distillation Framework for Remote Sensing Image Scene Classification Network. Remote Sens. 2022, 14, 5186. [Google Scholar] [CrossRef]
Seidel, L.; Gehringer, S.; Raczok, T.; Ivens, S.N.; Eckardt, B.; Maerz, M. Advancing early wildfire detection: Integration of Vision Language Models with Unmanned Aerial Vehicle remote sensing for enhanced situational awareness. Drones 2025, 9, 347. [Google Scholar] [CrossRef]
Deng, X.; Shi, M.; Khan, B.; Choo, Y.H.; Ghaffar, F.; Lim, C.P. A lightweight CNN model for UAV-based image classification. Soft Comput. 2025, 29, 2363–2378. [Google Scholar] [CrossRef]
Akagic, A.; Buza, E. LW-FIRE: A lightweight wildfire image classification with a deep convolutional neural network. Appl. Sci. 2022, 12, 2646. [Google Scholar] [CrossRef]
Liu, C.; Sui, H.; Wang, J.; Ni, Z.; Ge, L. Real-time Ground-level building damage detection based on lightweight and accurate YOLOv5 using terrestrial images. Remote Sens. 2022, 14, 2763. [Google Scholar] [CrossRef]
Zhang, Y.; Lin, Q.; Qin, C.; Ge, H. Forest fire smoke detection method based on MoAm-YOLOv4 algorithm. J. Comput. Commun. 2022, 10, 1–14. [Google Scholar] [CrossRef]
Hu, Y.; Chen, N.; Hou, Y.; Lin, X.; Jing, B.; Liu, P. Lightweight deep learning for real-time road distress detection on mobile devices. Nat. Commun. 2025, 16, 4212. [Google Scholar] [CrossRef]
Zhao, D.; Mo, B. UAV-Based Real-Time Object Detection Network Using Structured Pruning Strategy. Electron. Lett. 2025, 61, 70206. [Google Scholar] [CrossRef]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
Kim, Y. Convolutional Neural Networks for Sentence Classification. arXiv 2014, arXiv:1408.5882. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. arXiv 2018, arXiv:1608.06993. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. arXiv 2015, arXiv:1512.00567. [Google Scholar] [CrossRef]
Fan, K.; Hua, K.; Yang, M.; Fang, C. Multimodal Disaster Information Detection Method Based on Privacy-Preserving Federated Learning. In Proceedings of the 2025 7th International Conference on Electronic Engineering and Informatics (EEI); IEEE: New York, NY, USA, 2025; pp. 720–725. [Google Scholar]

Figure 1. Overview of TriDA. The text and image pipelines first encode disaster captions and images using BiLSTM and ResNet50-based feature extractors, respectively. The extracted modality-specific features are then concatenated in the fusion and core model block for disaster classification. The blue shield denotes the use of DP-SGD for privacy accounting, while the red dots indicate pruning for speed-up and lightweight deployment. Arrows show the flow from input modalities to output predictions and deployment scenarios.

Figure 2. Representative TriDA predictions on UCI image–text samples across six humanitarian categories: (a) fire, (b) damage infrastructure, (c) human damage, (d) flood, (e) damage nature, and (f) non-damage. Each panel presents the paired textual description, disaster image, and corresponding model prediction.

Figure 3. Multimodal comparison on the UCI dataset.

Figure 4. Row-normalized confusion matrix of TriDA on the UCI test split. Each cell reports the raw prediction count, with the corresponding percentage relative to the actual class shown in parentheses.

Figure 5. Accuracy versus noise multiplier

σ

under classifier-head DP-SGD. Error bars indicate standard deviation across five repeated runs.

Figure 5. Accuracy versus noise multiplier

σ

under classifier-head DP-SGD. Error bars indicate standard deviation across five repeated runs.

Figure 6. Accuracy versus neuron drop under structured pruning. Error bars indicate standard deviation across five repeated runs.

Figure 7. Neuron drop versus head-level MAC-based speed-up under structured pruning.

Table 1. UCI class distribution across disaster categories.

Class Names	Train	Test
Damage nature	459	55
Damage infrastructure	1246	144
Human damage	219	21
Fires	309	37
Flood	348	36
Non-damage	2666	291
Total	5247	584

Note: Bold values indicate the total number of training and test samples across all classes.

Table 2. Text statistics across disaster categories.

Class	Total Words	Max Tweet Length	Avg. Words/Tweet
Damage nature	13,230	345	28.82
Damage infrastructure	35,202	363	28.25
Human damage	8800	354	40.18
Fires	11,809	382	38.21
Flood	11,774	290	33.83
Non-damage	111,686	382	41.89

Table 3. Baseline training configuration and reproducibility settings.

Model	Opt.	LR/Schedule	Max Ep.	FT	ES
BERT	AdamW	$2 \times 10^{- 5}$	5	T1	2
RoBERTa	AdamW	$1 \times 10^{- 5}$	5	T1	2
TextCNN	Adam	$1 \times 10^{- 3}$	40	T2	6
VGG16	Adam	$3 \times 10^{- 4}$ / $1 \times 10^{- 5}$	50	V1	6
DenseNet121	Adam	$3 \times 10^{- 4}$ / $1 \times 10^{- 5}$	50	V1	6
Inception-v3	Adam	$3 \times 10^{- 4}$ / $1 \times 10^{- 5}$	50	V1	6
EarlyFusion-CNNTr	AdamW	$3 \times 10^{- 4}$ / $2 \times 10^{- 5}$ / $1 \times 10^{- 5}$	40	M1	6
HybridRes-BERT	AdamW	$3 \times 10^{- 4}$ / $2 \times 10^{- 5}$ / $1 \times 10^{- 5}$	40	M2	6
DualCNN-Align	Adam	$3 \times 10^{- 4}$ / $1 \times 10^{- 5}$	50	M3	6
TriDA	Adam	$3 \times 10^{- 4}$ / $1 \times 10^{- 5}$	50	M4	6
TriDA DP-SGD head	DP-SGD	$5 \times 10^{- 4}$	6	D1	–

Note: Opt. = optimizer; Ep. = epochs; FT = fine-tuning policy; ES = early-stopping patience. All models used batch size 32 and five runs with seeds

{13, 21, 42, 87, 100}

. T1: all transformer layers fine-tuned. T2: 300-d GloVe trainable and CNN fully trained. V1: ImageNet backbone frozen during head training, then final convolutional block unfrozen. M1: VGG16 and BERT pretrained; fusion layers trained with final CNN/BERT adaptation. M2: ResNet50 and BERT pretrained; fusion layers trained with final ResNet/BERT adaptation. M3: InceptionCNN and TextCNN trained with alignment head. M4: ResNet50 frozen during warmup, then final ResNet block unfrozen. D1: fixed image/text encoders with DP-SGD classifier-head optimization. All non-DP models used best validation accuracy for checkpoint selection; the DP-SGD stage used the final private epoch.

Table 4. Unimodal Model Comparison on UCI with Implemented Baselines.

(a) Text Models
Model	Acc (%)	P (%)	R (%)	F1 (%)
BERT [56]	82.6	79.8	77.5	78.7
TextCNN [57]	84.1	81.9	79.2	80.5
RoBERTa [58]	84.9	83.8	80.6	82.2
TriDA-Text (Ours)	86.4	84.2	81.1	82.6
(b) Image Models
Model	Acc (%)	P (%)	R (%)	F1 (%)
VGG16 [59]	78.4	75.1	72.8	73.9
DenseNet121 [60]	81.3	78.3	76.0	77.1
Inception-v3 [61]	82.9	80.1	77.8	78.9
TriDA-Image (Ours)	83.7	80.8	79.1	79.9

Note: Bold values indicate the proposed TriDA unimodal variants and their corresponding best results within each modality group. The merged table format and group headers are retained to distinguish text-only and image-only baselines.

Table 5. Multimodal Fusion Comparison on UCI with Implemented Baselines.

Model	Acc (%)	P (%)	R (%)	F1 (%)
EarlyFusion-CNNTr (VGG16 + BERT)	90.7	89.5	87.3	88.2
HybridRes-BERT (ResNet50 + BERT)	91.2	89.2	87.8	88.5
DualCNN-Align (IncepCNN + TextCNN)	91.9	90.8	88.6	89.7
TriDA (Ours)	93.7	92.5	90.6	91.6

Table 6. Per-Class Performance of TriDA on the UCI Dataset.

Class	Test Samples	P (%)	R (%)	F1 (%)
Damage nature	55	87.3	87.3	87.3
Damage infrastructure	144	95.1	93.8	94.4
Human damage	21	90.0	85.7	87.8
Fires	37	97.1	91.9	94.4
Flood	36	91.4	88.9	90.1
Non-damage	291	94.3	96.2	95.2
Macro Average	–	92.5	90.6	91.6
Weighted Average	–	93.7	93.7	93.7

Note: Bold values indicate the aggregate macro- and weighted-average performance summaries. Weighted precision, weighted recall, and weighted F1-score are rounded to one decimal place. Their unrounded values are 93.6647%, 93.6644%, and 93.6509%, respectively.

Table 7. Comparison with Prior Multimodal Disaster Assessment Models on the UCI Dataset.

Model	Acc (%)	P (%)	R (%)	F1/WF (%)
CNN + Text Fusion (Mouzannar et al. [18])	92.62	NR	NR	NR
ResNet50 + BiLSTM + Attn (Hossain et al. [19])	NR	NR	NR	93.21 (WF)
BERT + CNN Multimodal (Zhang et al. [29])	84.73	NR	NR	79.47
TriDA (Ours)	93.7	92.5	90.6	91.6

Note: Bold values indicate the proposed TriDA model and its reported results. NR denotes not reported in the cited paper. WF denotes weighted F1-score. Because the cited studies use non-identical reporting protocols, the comparison is contextual rather than a strict controlled benchmark.

Table 8. Privacy–utility retention of TriDA under different DP noise levels.

Noise ( $σ$ )	$ϵ$	Acc. (%)	F1 (%)	Acc. Drop	F1 Drop	Acc. Ret. (%)	F1 Ret. (%)
0.0	–	$93.70 \pm 0.21$	$91.2 \pm 0.25$	0.00	0.0	100.00	100.00
0.5	22.6	$91.37 \pm 0.29$	$88.7 \pm 0.33$	2.33	2.5	97.51	97.26
1.0	17.1	$88.21 \pm 0.34$	$84.7 \pm 0.38$	5.49	6.5	94.14	92.87
1.5	13.4	$84.64 \pm 0.41$	$80.5 \pm 0.47$	9.06	10.7	90.33	88.27
2.0	8.3	$79.29 \pm 0.52$	$73.8 \pm 0.59$	14.41	17.4	84.62	80.92

Note: Acc. = accuracy; F1 = macro F1-score; Ret. = retained utility. Drops are calculated relative to the no-noise setting

(σ = 0.0)

. As

σ

increases,

ϵ

decreases, indicating stronger privacy accounting but lower utility. The reported privacy budget applies only to the DP-SGD classifier-head training stage; non-private feature extraction and pretrained encoders are outside the formal DP guarantee.

Table 9. Contextual comparison with privacy-preserving disaster learning studies.

Study	Dataset/Modality	Privacy Mechanism	Formal DP Evidence	Reported Utility Evidence
Zhang et al. [35]	MDI disaster images; 5879 samples; image-only	FedTL with Paillier homomorphic encryption and AES-secured communication.	No $ϵ$ -DP budget; CPA-security analysis for encrypted updates.	Acc. 83.68%, P 83.44%, R 83.68%, F1 83.56% †.
El-Niss et al. [36]	MEDIC/UCI image–tweet pairs; 5831 samples; image + text	Federated ResNet + BERT late-fusion learning.	No $ϵ$ -DP budget; privacy mainly through decentralized FL.	Acc. 85.1%, P 85.6%, R 85.1%, F1 85.2%.
Fan et al. [62]	CrisisMMD disaster detection; image + text	FL with adaptive DP, vertical clipping, and dynamic privacy-budget allocation.	Evaluates $ϵ = 5, 10, 20$ ; full $(ϵ, δ)$ accounting not tabulated.	Task 1: Acc. 92.3%, R 91.8%, F1 91.4%; Task 2: Acc. 91.4%, R 90.9%, F1 90.6%.
TriDA (Ours)	UCI/MEDIC image–text pairs; 5831 samples; image + text	Classifier-head DP-SGD with RDP accounting over paired records.	Explicit $(ϵ, δ)$ -DP accounting for the classifier-head stage.	$ϵ = 17.1$ : Acc. 88.21%, F1 84.7%; $ϵ = 8.3$ : Acc. 79.29%, F1 73.8%.

Note: Bold text highlights the proposed TriDA method and its distinguishing privacy-related contributions. The comparison is contextual rather than a strict controlled benchmark because the studies use different datasets, task definitions, privacy assumptions, and reporting protocols. FL = federated learning; RDP = Rényi Differential Privacy. † F1 for Zhang et al. is computed from the reported precision and recall because F1 is not directly reported in their table.

Table 10. Predictive utility under structured neuron pruning.

Target Neuron Drop (%)	Acc (%)	P (%)	R (%)	F1 (%)
0	$88.2 \pm 0.34$	$86.0 \pm 0.36$	$83.5 \pm 0.41$	$84.7 \pm 0.38$
5	$87.0 \pm 0.36$	$84.6 \pm 0.39$	$82.0 \pm 0.44$	$83.3 \pm 0.41$
10	$85.6 \pm 0.39$	$83.0 \pm 0.42$	$79.8 \pm 0.48$	$81.4 \pm 0.44$
15	$82.2 \pm 0.46$	$79.6 \pm 0.48$	$76.0 \pm 0.56$	$77.8 \pm 0.51$
20	$78.6 \pm 0.53$	$75.8 \pm 0.55$	$71.9 \pm 0.64$	$73.8 \pm 0.59$

Table 11. Classifier-head pruning statistics and estimated full-model impact.

Drop (%)	Hidden Units	Head Params	Head Red. (%)	Head MACs	MAC Speed-Up	Est. Total Params	Head Share (%)	Full Red. (%)
0	128	295,814	0.00	295,680	$1.00 \times$	24,327,430	1.216	0.000
5	122	281,948	4.69	281,820	$1.05 \times$	24,313,564	1.160	0.057
10	116	268,082	9.37	267,960	$1.10 \times$	24,299,698	1.103	0.114
15	109	251,905	14.84	251,790	$1.17 \times$	24,283,521	1.037	0.181
20	103	238,039	19.53	237,930	$1.24 \times$	24,269,655	0.981	0.238

Note: Red. = reduction; MACs = multiply–accumulate operations. The pruning operation is applied only to the classifier head, not to the full image or text backbone. Therefore, although a 20% drop reduces the classifier-head parameters by 19.53% and improves head-level MAC speed-up to

1.24 \times

, the estimated full-model parameter reduction is only 0.238%. This distinction clarifies that the reported compression benefit is primarily a classifier-head efficiency gain rather than full-model compression.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Oaphy, M.A.; Khalid, A.; Hu, D.; Xu, H. TriDA: Privacy-Aware and Efficient Multimodal AI for Disaster Assessment. Mathematics 2026, 14, 2064. https://doi.org/10.3390/math14122064

AMA Style

Oaphy MA, Khalid A, Hu D, Xu H. TriDA: Privacy-Aware and Efficient Multimodal AI for Disaster Assessment. Mathematics. 2026; 14(12):2064. https://doi.org/10.3390/math14122064

Chicago/Turabian Style

Oaphy, Md Abdullahil, Adeel Khalid, Da Hu, and Honghui Xu. 2026. "TriDA: Privacy-Aware and Efficient Multimodal AI for Disaster Assessment" Mathematics 14, no. 12: 2064. https://doi.org/10.3390/math14122064

APA Style

Oaphy, M. A., Khalid, A., Hu, D., & Xu, H. (2026). TriDA: Privacy-Aware and Efficient Multimodal AI for Disaster Assessment. Mathematics, 14(12), 2064. https://doi.org/10.3390/math14122064

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TriDA: Privacy-Aware and Efficient Multimodal AI for Disaster Assessment

Abstract

1. Introduction

2. Related Work

2.1. Multimodal Deep Learning for Disaster Assessment

2.2. Privacy-Preserving Deep Learning in Disaster Contexts

2.3. AI Model Compression Mechanisms

3. Methodology

3.1. Multimodal AI Architecture

3.2. Privacy-Enhanced Multimodal AI Training

3.3. Structured Pruning for Efficiency

4. Experiments

4.1. Experiment Settings

4.1.1. Dataset

4.1.2. Data Preprocessing

4.1.3. Training Configuration and Model Selection

4.1.4. Evaluation Metrics

4.2. Baseline Comparisons

4.3. Privacy–Utility Trade-Off

4.4. Efficiency–Utility Trade-Off

5. Discussion and Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI