1. Introduction
With the increasing frequency and intensity of natural disasters due to climate change [
1], rapid and accurate analysis of satellite imagery for damage assessment has become critically important. In large-scale hurricane or flood events, communication infrastructure is often damaged or unavailable, and network connectivity to ground stations is frequently intermittent. These disaster-zone conditions pose significant challenges for cloud-based AI, particularly in terms of latency, bandwidth, and availability. As a result, Intelligent satellite IoT [
2], where satellites, high-altitude platforms, or unmanned aerial vehicles are paired with on-board or nearby edge devices that perform local inference, has emerged as a practical alternative for real-time disaster response.
In this context, Vision–Language Models (VLMs) are particularly attractive. A single VLM can both classify damage severity and generate human-readable descriptions of the scene, providing interpretable evidence (e.g., “standing water covering roads”, “intact roofs and dry ground”) that is valuable for analysts and first responders. However, most state-of-the-art VLMs are designed for server-class GPUs with abundant memory, rendering them too resource-intensive for direct deployment on small edge devices in Satellite IoT scenarios. For example, popular open-source models such as LLaVA-1.5-7B [
3] require server-class GPUs with substantial memory footprints, particularly in FP16 settings. In addition, top-tier commercial models like GPT-4V [
4] have undisclosed architectures and model scales and are deployed via cloud-based inference, which precludes practical on-device deployment.
The primary goal of our research is to develop a high-performance VLM that can efficiently run on commercial off-the-shelf edge devices, such as the NVIDIA Jetson Orin Nano [
5], which has been widely adopted for on-device deep learning applications [
6]. Jetson Orin Nano provides only 8 GB of unified LPDDR5 memory and modest compute throughput, even though it is specifically designed as an edge-inference module. This fundamental resource mismatch makes it infeasible to deploy such large VLMs directly on edge devices, motivating the use of more compact yet expressive architectures together with intelligent fine-tuning strategies. Prior work related to edge deployment includes lightweight VLM backbones, compression/acceleration, and parameter-efficient fine-tuning. Meanwhile, complementary directions address reliability and deployment constraints from different angles, such as inference-time hallucination mitigation via retrieval-and-compare decoding [
7] and multimodal semantic communication in resource- and bandwidth-constrained environments [
8]. These lines of work are complementary to our focus on embedding training-time self-reflection into an on-device VLM, improving reliability without adding inference-time overhead. In this work, we focus on how to fine-tune a mid-sized VLM so that it becomes both reliable and edge-deployable, rather than further compressing the architecture itself. Among existing models, we choose BLIP-Large [
9] as a representative encoder–decoder VLM that is close to the upper limit of what can be deployed on commercial off-the-shelf edge devices, such as the NVIDIA Jetson series, in FP16.
For complex and specialized tasks such as hurricane damage assessment from satellite imagery, characterized by scarce labeled data and subtle visual differences between damage and no-damage classes, standard supervised fine-tuning (SFT) alone is often data-inefficient and insufficient to deliver the robustness and reliability required in real-world settings. SFT optimizes a token-level or label-level cross-entropy objective, which primarily rewards alignment with the ground-truth target. As training progresses, samples that the model already fits well quickly contribute diminishing gradients [
10], even though their predictions can remain fragile under minor distribution shifts or visually subtle variations. Moreover, SFT does not explicitly encourage the model to assess the reliability of its own decisions or to focus learning on borderline cases that matter most in low-label regimes. These characteristics motivate training strategies that complement scarce labels with additional, self-generated signals such as consistency regularization and agreement-based learning.
To address these challenges, we adopt a self-reflective fine-tuning strategy that augments SFT with uncertainty-aware, self-generated learning signals, without changing the model size or adding inference-time overhead. In contrast to recent self-correction approaches that rely on iterative prompting at inference time, which incurs additional latency, our method embeds the reflective process directly into the model parameters during training. This approach allows the model to benefit from self-evaluation without incurring any runtime computational overhead on resource-constrained edge devices. To this end, we propose EdgeV-SE (
Edge-deployable
Vision–Language Models using
Self
Evaluation and
Self
Enhancement), a self-reflective fine-tuning framework designed to enable models to identify their own uncertainty and reconcile internal inconsistencies between complementary pathways. Conceptually, EdgeV-SE operationalizes self-reflection [
11,
12] within an encoder–decoder VLM by turning the discrepancy between a generative linguistic pathway and a discriminative visual pathway into explicit learning signals leveraging the principles of deep mutual learning [
13] to reconcile these views.
Empirically, our results show that the proposed framework substantially improves both classification accuracy and caption quality without introducing any inference-time overhead on commercial edge devices such as the Jetson Orin Nano [
5].
The main contributions of our work are as follows:
We propose a self-reflective fine-tuning framework, EdgeV-SE, that enables the model to recognize its own uncertainty and learn by resolving internal inconsistencies.
We introduce an efficient mechanism that enhances prediction reliability with minimal overhead by designing asymmetric dual pathways, a generative linguistic path and a discriminative visual path, within the VLM and performing internal cross-validation through mutual learning.
We demonstrate the practical effectiveness of our approach by empirically validating the proposed model’s superior classification accuracy and inference efficiency on an actual edge device.
3. Self-Reflective Fine-Tuning (EdgeV-SE)
EdgeV-SE augments standard supervised fine-tuning (SFT) with a training-time self-reflective mechanism derived from the model’s internal disagreement. The training update for each mini-batch proceeds in four phases: (1) discrepancy induction, (2) uncertainty diagnosis, (3) discrepancy resolution, and (4) supervised consolidation (
Figure 1; Algorithm 1).
| Algorithm 1: EdgeV-SE Training Procedure (per mini-batch) |
Input: Input: Output: Updated parameters θ, H 1: for to B do 2: 3: 4: 5: 6: 7: 8: 2. Uncertainty Diagnosis (Self-Diagnosis) on original 9: 10: 3. Discrepancy Resolution via Augmentation Consistency & Consensus 11: 12: 13: 14: 15: 16: 17: 18: 19: end for 20: 4. Parameter Update
|
Phase 1—Discrepancy induction creates complementary predictions from asymmetric linguistic vs. visual pathways to expose internal disagreement.
Phase 2—Uncertainty diagnosis estimates sample uncertainty via a margin-based self-assessment and computes an uncertainty weight.
Phase 3—Discrepancy resolution focuses learning on uncertain samples by enforcing augmentation consistency and cross-pathway agreement.
Phase 4—Supervised consolidation aggregates supervised losses and reflection losses to update parameters.
The first three phases generate and refine self-produced training signals, while the final phase performs the standard supervised update to preserve captioning ability and align the model with the downstream damage classification objective. Phase 4 supervised consolidation is not merely a collection of independent regularization terms. Instead, phase 4 treats internal linguistic–visual disagreement as a learnable uncertainty signal that directly guides optimization.
Importantly, the visual pathway is employed only during training and is not activated during inference. Inference relies solely on the original VLM pathway, incurring no additional runtime overhead on edge devices.
3.1. Discrepancy Induction: Asymmetric Dual Pathways
Standard Vision–Language Models often conflate semantic plausibility with visual evidence, which can lead to overconfident predictions in visually ambiguous scenarios [
44]. To explicitly disentangle these factors, we decompose the model into two asymmetric prediction pathways from distinct viewpoints, a generative viewpoint and a discriminative viewpoint, representing complementary but potentially conflicting perspectives.
The linguistic pathway, corresponding to a generative view, produces class evidence through the text decoder conditioned on the image, capturing high-level semantics. In parallel, a lightweight visual pathway, corresponding to discriminative view, predicts the same classes directly from vision encoder features, emphasizing low-level visual cues. Since these pathways emphasize complementary cues, their inevitable disagreement on ambiguous samples serves as an intrinsic signal for self-reflective learning.
- (a)
Linguistic Pathway (Generative Viewpoint)
The linguistic pathway functions as the Generative Branch (acting as a ‘Theorist’), interpreting high-level semantics such as debris, flooding, or roof collapse. Given an input image, it evaluates the likelihood of each class by conditioning on prompt-based class descriptions and subsequently applies Softmax normalization to obtain image-conditioned class probabilities.
Given an original image
, we score each class verbalizer/prompt
using the decoder likelihood:
where
is the number of classes to identify.
We convert these scores to class probabilities using a temperature-scaled Softmax normalization:
where
is temperature,
is the decoder log-likelihood for class
, and
is the image-conditioned class probability estimated from the linguistic pathway.
- (b)
Visual Pathway (Discriminative Viewpoint)
The visual pathway functions as the Discriminative Branch (acting as an “Empiricist”), predicting class probabilities directly from visual encoder, focusing on visual details.
Let
be the vision encoder representation of the original image. A lightweight head produces logits and probabilities:
To keep EdgeV-SE lightweight, we implement the visual pathway as a minimal auxiliary head and use it only during fine-tuning to provide agreement/consistency signals. During inference, EdgeV-SE follows the same forward path as the base VLM and does not invoke this auxiliary head; therefore, runtime latency and memory footprint remain identical to the base model. In this work we adopt a shallow MLP design as a deliberate efficiency choice.
3.2. Recognizing Uncertainty: Identifying ‘Uncertain’ Samples via Self-Diagnosis
This phase quantifies the uncertainty of the class prediction for a given image by measuring the margin between the highest and second-highest class probabilities. To quantify its own uncertainty, the model measures a prediction margin from the linguistic pathway:
where
denotes the class with the maximum predicted probability for image
.
In our main experiments, the downstream task is binary (damage vs. no-damage) and the linguistic pathway uses two class verbalizers
and
. In this case, the top-1 vs. runner-up margin in Equation (4) reduces to the two-way log-likelihood gap
. For implementation convenience (Algorithm 1;
Figure 2), we compute the signed margin
and apply the uncertainty gate using its magnitude
in Equation (5), while the sign of
is later used as decision evidence in the LL-margin classifier (
Section 4.2). Furthermore, uncertainty diagnosis is performed on the original view
, whereas
and
are computed only for the augmentation-consistency objective in Equation (6).
A small absolute margin implies indecision or conflict between the two outcomes, corresponding to an ambiguous or high-uncertainty sample. This margin explicitly identifies samples where semantic priors and visual evidence diverge, which are precisely the cases where standard supervised fine-tuning becomes unreliable under limited supervision. Uncertainty diagnosis is based on the margin from the linguistic pathway because it directly reflects inference-time decision confidence, whereas the visual pathway is employed solely as an auxiliary during training.
The model assigns a higher learning weight to such uncertainty based on a threshold
, as schematically illustrated in
Figure 2:
where
is a tunable hyperparameter that amplifies the learning signal for “uncertain” samples identified by the model itself, while confident samples receive a standard weight of 1. The concrete values of τ and
are selected on a validation set and reported in
Section 4.2. This discrete weighting scheme, aligned with Algorithm 1, encourages the model to focus on its own weaknesses.
3.3. Discrepancy Resolution: Internalizing Knowledge Through Consistency and Mutual Consensus
Once samples are identified as uncertain, EdgeV-SE resolves the internal discrepancies by applying weighted Consistency Regularization and Mutual Learning. These objectives are weighted by calculated from the original view ().
- (a)
Consistency Regularization
Given an input image
, EdgeV-SE constructs two stochastically augmented views
and
using mild spatial/photometric transformations. Then, EdgeV-SE enforces consistency based on their prediction margins, weighted by the uncertainty of
:
This regularization encourages the decision boundary to remain stable under perturbations such as changes in viewpoint or illumination. While consistency regularization promotes invariance, it alone does not determine which prediction pathway should be trusted when their outputs disagree.
To address this limitation, EdgeV-SE couples consistency with margin-based disagreement and mutual learning, enabling consistency to be selectively enforced on uncertain samples. Consistency is defined on the logit margin rather than on raw probability outputs, since the margin serves as a more direct proxy for decision boundary stability and prediction confidence, whereas probability vectors are more susceptible to temperature scaling and calibration effects.
- (b)
Mutual Learning ()
Here, the Linguistic Pathway (“Theorist”) and the Visual Pathway (“Empiricist”) act as soft targets for each other. This objective regularizes the linguistic pathway to remain visually grounded (reducing semantic drift or hallucination) while distilling high-level semantics into the visual pathway.
We quantify the cross-pathway agreement using the Jensen–Shannon divergence (JSD) between
and
[
45]. JSD is symmetric and bounded, so
, and remains finite even when supports differ, yielding stable gradients for alignment. We adopt JSD because it is symmetric, bounded, and numerically stable when the two pathways assign divergent support, unlike KL which can diverge. For readability, qualitative figures report normalized values
.
We define the mutual loss as a weighted JSD between
and
(Equation (7)), implemented as the mean of two KL terms to the mixture m:
Through this cooperative process, linguistic semantics are distilled into visual recognition, while visual grounding prevents linguistic hallucinations.
3.4. Supervised Consolidation: Aggregation of Supervised and Self-Reflective Losses to Update Parameters
The total objective integrates four components, standard supervised fine-tuning, auxiliary classification, consistency, and mutual learning, applied to the same mini-batch in each iteration (Algorithm 1). The uncertainty-weighted consistency and mutual terms act as auxiliary stabilizers that selectively reinforce reflection on ambiguous samples, thereby improving both convergence and interpretability.
- (a)
Supervised fine-tuning loss from generated caption
where
denotes the model’s predicted token distribution (logits) under teacher forcing for the primary view
, and
is the ground-truth caption. This is the standard token-level cross-entropy loss that preserves the model’s image–text generation ability.
- (b)
Auxiliary classification loss from linguistic pathway
where
is the binary class label (denoted as
in Algorithm 1). This loss directly supervises the two-way log-likelihoods from the linguistic pathway, sharpening its decision margin and tying the generative branch to the downstream damage-classification objective.
- (c)
Self-Reflective objectives
The self-reflective terms are given by Equations (10) and (11):
The consistency term enforces a stable linguistic margin across augmentations, ensuring that the decision boundary is robust to benign perturbations. The mutual learning term encourages agreement between the linguistic and visual predictions, keeping the linguistic theorist visually grounded while injecting high-level semantics into the visual empiricist.
In both cases, the uncertainty weight amplifies gradients on uncertain samples with small , so the model spends more capacity on introspectively correcting its own mistakes.
- (d)
Overall Training Objective
We combine all components into the final training objective:
In practice, each term is averaged over the mini-batch as in Algorithm 1. The uncertainty weight is computed per sample from the original view and applied consistently across both self-reflective objectives.
We set the weighting coefficients
,
,
, as well as the uncertainty parameters
,
, based on validation performance and stability; their concrete values are summarized in
Section 4.2. The coefficients are chosen to ensure that no individual objective dominates the training process, while appropriately amplifying the contribution of uncertain samples.
The complete training procedure is summarized in Algorithm 1. It details each mini-batch update, including uncertainty weighting, consistency enforcement, mutual agreement between the two pathways, and the final supervised caption and classification steps.
4. Experimental Setup and Implementation
4.1. Dataset and Preprocessing
We conducted a comprehensive set of experiments to validate the proposed EdgeV-SE framework in comparison with both standard and self-improving baselines.
We employed a balanced subset of a hurricane damage assessment dataset consisting of satellite images categorized into two classes:
damage and
no_damage [
46]. Since the original dataset provides only image-level labels (damage vs. no_damage) and lacks human-written captions, we generated domain-specific pseudo-reference captions using LLaVA-1.5-7B [
3]. These captions serve as weak supervision for caption fine-tuning and as a standardized reference for text-metric reporting. They should not be interpreted as human-verified ground truth.
We used two classes of prompts (
Table 1), one for damage and one for no-damage, to encourage factual, concise descriptions in a consistent style. To estimate the pseudo-caption noise rate, we manually audited ≈200 randomly sampled images; ≈8% were judged noisy. These samples were excluded from the dataset prior to the train/validation/test split, and inter-annotator agreement exceeded 90%, with disagreements resolved by discussion.
To reduce the risk of over-interpreting style-matching as “factual improvement,” we report (i) discriminative classification metrics (Accuracy/F1) and (ii) a reference-free image–text alignment proxy (CLIPScore), in addition to overlap-based caption metrics (CIDEr-D, BERTScore).
The final curated dataset contains 5000 damage/5000 no-damage images for training, 1000/1000 images for validation, and 1000/1000 images for testing, resulting in 14,000 images in total balanced across the two classes.
For all caption-quality metrics, we remove the leading class header phrase (e.g., “Confirmed Damage Zone:” or “No Damage Zone:”) to avoid coupling the scores to fixed prompt formatting.
LLaVA-1.5 is used solely as an offline annotation tool to provide weak caption supervision and standardized textual references for evaluation. Accordingly, classification metrics (Accuracy and F1) are treated as the primary indicators of task performance, while caption-related metrics (CIDEr-D, BERTScore, and CLIPScore) are reported as complementary measures of descriptive consistency and image-text alignment rather than as human-verified ground truth. To encourage class-consistent and concise descriptions, we explicitly conditioned LLaVA-1.5 on the ground-truth class labels (e.g., prompting with “Confirmed Damage Zone” or “No Damage Zone”), which effectively constrains the semantic scope of the generated captions and prevents class-level mismatches. Furthermore, as described above, a rigorous manual audit and filtering process removed approximately 8% of sampled captions exhibiting clear factual inconsistencies or low semantic relevance. This procedure substantially reduced pseudo-caption noise prior to dataset splitting, yielding a noise-reduced and structurally consistent supervision signal suitable for large-scale training. While residual hallucinations cannot be entirely ruled out, the curated dataset provides a reliable weak-supervision basis for evaluating both discriminative performance and caption consistency.
4.2. Model Configuration and Training Details
We use the Salesforce/blip-image-captioning-large [
9] model as the base Vision-Language Model (VLM). For parameter-efficient fine-tuning, all methods except Standard SFT used the same QLoRA [
20] setting (r = 16, α = 32, dropout = 0.05) applied to selected text layers and the last two vision transformer blocks. Standard SFT uses a text-only LoRA configuration with a higher-rank adapter (r = 128, α = 256), while keeping the same base model, data, optimizer, and decoding settings for a controlled comparison. The optimization was performed using AdamW8bit with a cosine learning-rate scheduler and a warm-up ratio of 0.1. Learning rates were set separately for text and vision modules (1 × 10
−4 and 3 × 10
−5, respectively), with weight decay = 0.01. The model was trained for 12 epochs with a batch size of 8 and gradient accumulation of 4. Models were trained on an NVIDIA RTX 4090 GPU, while all on-device performance benchmarks (e.g., inference speed) were conducted on an NVIDIA Jetson Orin Nano (8 GB) device [
5] to verify edge-level deployability.
The main hyperparameters for the self-reflective fine-tuning were empirically set as τ = 0.7,
= 1.5,
= 0.3,
= 0.2,
= 0.3. Unless otherwise stated, the temperature in Equation (2) is set to T = 2.0, selected on the validation set as described in
Section 5.5. For caption metrics we use beam search (num_beams = 4, do_sample = False, max_new_tokens = 120). For on-device benchmarks (
Section 5.3), we report caption-generation latency using num_beams = 1 and max_new_tokens = 40. For classification, we use the validation-selected LL-margin threshold (maximizing Macro-F1), stored as the optimal threshold parameter.
For the linguistic pathway, we compute length-normalized log-likelihoods for two class verbalizers
(damage) and
(no-damage) (set to the header strings in
Table 1, i.e., “Confirmed Damage Zone:” and “No Damage Zone:”) using the decoder likelihood
. Following prior prompt-based scoring practice, we normalize by the number of tokens in
(excluding special tokens) to avoid length bias. The log-likelihood margin
is used as decision evidence, and the binary decision threshold is selected on the validation set (maximizing Macro-F1) and stored as the optimal threshold parameter. Unless stated otherwise, the temperature in Equation (2) is fixed to
.
The discriminative pathway is implemented as a lightweight head on top of the vision encoder output. We intentionally designed this head as a shallow MLP to minimize the training-time memory footprint, ensuring that the self-reflective mechanism remains computationally efficient even during fine-tuning. Concretely, we apply (average) pooling over the final-layer vision tokens to obtain a single feature vector, followed by a two-layer MLP with ReLU to produce logits and probabilities . This head is used only during fine-tuning to generate self-reflective signals. During inference, EdgeV-SE follows the same forward path as the base VLM and does not evaluate the auxiliary head; therefore, runtime latency and memory footprint remain identical to the base VLM.
We generate two independent augmented views using mild spatial/photometric transforms: random horizontal flip (p = 0.5), random rotation (±15 degrees), and color jitter (brightness/contrast/saturation = 0.2/0.2/0.2). These augmentations are applied independently to and .
Although EdgeV-SE uses two augmented views during fine-tuning, the additional computation is modest because the auxiliary head is lightweight and operates on the vision encoder features already computed for each view. Importantly, EdgeV-SE does not rely on iterative multi-round generation or candidate reranking loops, so the training procedure remains practical under our fine-tuning setting.
4.3. Evaluation Metrics
To comprehensively assess both visual and linguistic performance, we evaluate the model using classification and captioning metrics. For classification, we report Accuracy, class-wise F1, and Macro-F1, along with Expected Calibration Error (ECE) for calibration analysis (
Section 5.5). Caption quality is assessed using three complementary metrics: (1) CIDEr-D [
47], which evaluates caption fluency and consensus; (2) BERTScore [
48] (F1 × 100), which measures semantic alignment with the ground-truth text; and (3) CLIPScore [
49], computed as cosine similarity ×100 using CLIP ViT-B/32, which evaluates the factual consistency (groundedness) of generated captions with respect to the source image. To ensure statistical reliability, we report Accuracy/Macro-F1 (and caption metrics) with 95% bootstrap confidence intervals (1000 resamples) on the hurricane test set. For calibration, we report ECE (15 bins) and provide a 95% bootstrap confidence interval for the test-set ECE at the validation-selected temperature.
5. Experimental Results and Analysis
5.1. Component-Wise Analysis and Ablation Study
To investigate how self-reflective mechanisms progressively enhance the model’s reasoning ability, we conducted a detailed study of the internal variants of EdgeV-SE.
Table 2 and
Table 3 summarize the performance at each phase of this developmental trajectory.
5.1.1. Self-Reflection Variants
During the initial exploration stage, we sequentially introduced two versions of Self-Reflection to encourage autonomous improvement:
Self-Reflection type 1 Generate-then-Correct: This variant adopted an iterative mechanism where the model first produced preliminary captions and then re-generated refined outputs conditioned on its own previous text. While it aimed to simulate self-revision, this approach yielded only modest improvement (F1 = 0.918) due to the lack of explicit numerical feedback and stability control.
Self-Reflection type 2 Direct LL + Multitask: To address the limitations of v1, this version incorporated direct log-likelihood (LL) supervision and a multitask alignment loss. This architectural modification replaces heuristic regeneration with a differentiable JSD-based coupling between the linguistic distribution and visual distribution . This shift yielded a significant performance jump (Macro-F1 0.964), establishing the foundational backbone for the final EdgeV-SE framework.
5.1.2. Objective-Level Ablation Study
Building on the foundation of Self-Reflection type 2, we conducted an ablation study to isolate the contributions of the specific reflective components, Consistency Regularization and Mutual Learning. We evaluate two ablated variants (Model A: SFT + Consistency; Model B: SFT + Mutual Learning) and the full EdgeV-SE model to isolate the contribution of each component:
Model A SFT + Consistency: This variant enforces prediction invariance across multiple augmented visual views (e.g., random flip, rotation and photometric jitter). By regularizing the decision boundary against distributional noise, Model A achieved an F1 score of 0.966, demonstrating that consistency regularization effectively mitigates overfitting and improves spatial robustness. This result indicates that consistency regularization alone stabilizes training, but it does not fully resolve cross-modal disagreement without the mutual-learning component.
Model B SFT + Mutual Learning: This variant introduces a cross-pathway mutual agreement loss between the vision encoder and text decoder representations. This loss encourages both branches to produce congruent feature semantics, enabling uncertainty-aware refinement during back-propagation. Model B achieved an F1 score of 0.981, outperforming consistency-only training and underscoring the role of cross-pathway agreement.
EdgeV-SE: EdgeV-SE framework combines both Consistency and Mutual Learning under a unified reflective optimization. This configuration achieves the best overall results (0.985 F1 for Damage, 0.986 F1 for No-Damage), demonstrating a clear synergistic effect that enhances not only classification precision but also caption coherence and factual grounding.
A similar trend was observed in caption generation performance, highlighting that the proposed self-reflective mechanisms improve not only classification accuracy but also descriptive expressiveness.
As shown in
Table 3, a significant performance leap occurs with Self-Reflection type 2 (CIDEr-D 37.49, BERTScore 90.77), which incorporated direct log-likelihood supervision and a multi-task alignment loss (JSD-based) that couples
and
. This model establishes a strong foundation for both high-accuracy classification (F1 = 0.964) and high-quality caption generation.
The subsequent ablation models (Model A, B) and the final EdgeV-SE model primarily enhance classification F1 accuracy (
Table 2) while sustaining this high level of caption quality. For instance, while Model B achieves a major classification jump to 0.981 F1, it maintains a high-quality caption profile (CIDEr-D 38.00, BERTScore 90.60), demonstrating that the mutual learning component enhances discriminative reasoning without causing linguistic degradation.
EdgeV-SE model achieves the highest traditional fluency (CIDEr-D 38.37) and semantic alignment (BERTScore 90.82). While Self-Reflection type 2 showed the highest image-text consistency (CLIPScore 28.79), our model’s optimization toward the classification F1 score (0.985) appears to create a slight trade-off, marginally reducing this groundedness metric in favor of superior discriminative accuracy.
In summary, EdgeV-SE not only achieves the highest classification accuracy but also demonstrates superior language generation ability, producing domain-consistent, and context-aware captions even in edge-constrained environments.
Building upon these internal developments, we next position EdgeV-SE against strong fine-tuning and preference-optimization baselines in
Section 5.2.
5.1.3. Sensitivity Analysis
We further analyzed the sensitivity of the hyperparameter
(uncertainty threshold). As presented in
Table 4, the model achieves optimal performance at
= 0.7. This stability indicates that EdgeV-SE is not overly sensitive to precise threshold tuning.
5.2. Comparison with Advanced Fine-Tuning Baselines
To comprehensively evaluate the effectiveness of the proposed EdgeV-SE, we compared it against four key baselines representing distinct learning paradigms: (1) Standard SFT, (2) Controlled SFT (a LoRA-optimized variant), (3) Self-Rewarding VLM [
50], and (4) Iterative Self-Correction [
11,
12]. Our goal is to compare EdgeV-SE against methods that similarly aim to improve the model through self-generated signals within an efficient fine-tuning process. We do not include direct preference optimization [
51] methods in the main comparison because they require additional preference data (e.g., preference pairs) and typically introduce a separate alignment stage beyond the single-stage PEFT fine-tuning budget considered in this work. Our focus is on training-time, self-generated learning signals under a comparable supervision and compute budget for models intended for edge inference; preference-optimization pipelines are complementary and left for future work.
We first established strong supervised PEFT baselines under the same data split, optimizer, and decoding settings. Standard SFT is a supervised baseline that trains text-only LoRA adapters with a higher-rank setting (r = 128, α = 256) on the BLIP text decoder, while keeping the base model frozen. Controlled SFT uses the same supervised objective and identical data/decoding setup, but employs a lower-rank QLoRA setting (r = 16, α = 32) and expands the trainable scope to include selected text layers plus the last two vision transformer blocks, yielding a more balanced parameter update.
Importantly, the Trainable Params (%) in
Table 5 reflects both (i) the adapter scope and (ii) the LoRA rank; therefore, despite covering both modalities, Controlled SFT can have a smaller trainable-parameter fraction due to its substantially lower rank. As shown in
Table 5, Controlled SFT improves Macro-F1 from 0.911 to 0.930 compared to Standard SFT. However, its gains plateau under purely supervised PEFT, motivating mechanisms that explicitly handle uncertainty and self-correction.
Next, we compared EdgeV-SE against two advanced self-improvement paradigms. Since original works like Self-Rewarding [
51] and Reflexion [
11] target decoder-only LLMs via prompting, we implemented custom adaptations suitable for the BLIP encoder-decoder architecture:
Self-Rewarding VLM: This method generates multiple candidate captions via stochastic sampling and selects the one with the highest internal generative confidence (log-likelihood) as a pseudo-label for training.
Iterative Self-Correction: This method performs a second “correction” forward pass to refine the initial output, explicitly training the model to produce a higher-confidence version of its first prediction.
Table 6 and
Table 7 present the comparative results. While both self-improvement baselines (Macro-F1 0.979 and 0.972, respectively) significantly outperform the SFT baselines, EdgeV-SE achieves the best Macro-F1 (0.985). This demonstrates that our dual-pathway reflective mechanism is more effective than single-pathway confidence maximization.
A critical distinction emerges in caption quality (
Table 7). While the advanced baselines successfully improved classification, their gains in caption quality were limited. Notably, their CLIPScores (26.28 and 24.30) are the lowest among all methods, suggesting that optimizing a single internal confidence signal may not translate to better image–text alignment, and can be associated with semantic drift. In sharp contrast, EdgeV-SE achieves the highest scores across CIDEr-D (38.37), BERTScore (90.82), while maintaining a competitive CLIPScore (28.21) comparable to the strong SFT baseline. Crucially, while other methods tend to optimize for a single internal reward signal, EdgeV-SE’s dual-pathway consistency enforces a robust consensus between linguistic and visual modalities, preventing semantic drift.
Finally, regarding computational efficiency, EdgeV-SE achieves these performance gains without iterative generate-and-select loops or multi-round correction at training time. Importantly, EdgeV-SE does not introduce a duplicated backbone: both pathways share the same VLM components, and the additional visual pathway is only a lightweight head. The primary source of additional training computation arises from evaluating one extra augmented view for the consistency objective, which introduces only a modest overhead at training time while preserving strictly single-pass inference. Under the same PEFT training schedule, EdgeV-SE instead leverages lightweight self-reflective objectives, including margin consistency and cross-pathway agreement, without requiring repeated generation or evaluation cycles.
This design highlights a key advantage of our approach. While many existing methods improve classification performance by aggressively optimizing a single internal reward—often at the expense of caption quality—EdgeV-SE enforces a balanced consensus between linguistic and visual pathways. As a result, EdgeV-SE provides a practical single-pass alternative that delivers strong classification performance while maintaining competitive caption quality, without the substantial training-time overhead associated with iterative generation-based schemes.
5.3. On-Device Performance
We verified that the performance gains of EdgeV-SE are obtained without additional inference-time overhead compared to a parameter-efficient baseline (Controlled SFT) on the target edge device. We benchmarked our model on an NVIDIA Jetson Orin Nano (8 GB) configured in 15 W power mode (MAXN) using FP16 precision with batch = 1. To reflect deployment practice, inputs are resized to a short-side of 384, decoding uses num_beams = 1 and max_new_tokens = 40, and we report per-image latency after a 3-step warm-up over a 50-image set (balanced 25/25 damage vs. no-damage). We report the average per-image latency over the set under the stated decoding settings in the standard 15 W power mode.
Table 8 summarizes latency, throughput, and peak unified memory usage. EdgeV-SE runs at 1837.2 ms per image (≈1.84 s), 0.54 FPS, and with a peak memory allocation of 0.915 GB. Because EdgeV-SE leaves the inference stack unchanged (BLIP-Large + LoRA) and activates all reflective mechanisms only during training (consistency and mutual-agreement losses; see Equations (6) and (7)), the Controlled SFT baseline—sharing the identical architecture at inference—exhibits virtually identical latency and memory footprint (differences within run-to-run noise). Reported latency corresponds to caption generation under the stated decoding settings (num_beams = 1, max_new_tokens = 40). We report latency for caption generation, which is computationally more demanding than classification-only inference, to validate the model’s efficiency under maximum load.
These results corroborate that our method introduces no additional inference overhead beyond the base BLIP model. The self-reflective losses are injected exclusively at training time, and inference uses the same BLIP-Large + LoRA network without the auxiliary visual head. This makes EdgeV-SE directly suitable for Satellite IoT deployments where Jetson-class devices serve as on-site processing nodes under strict power and latency constraints.
We emphasize that this is a small microbenchmark intended to provide a worst-case upper-bound load estimate for on-device caption generation; it does not characterize run-to-run variance over long streams, energy draw, or thermal throttling. More comprehensive profiling (power/thermal and end-to-end throughput under realistic satellite ingestion pipelines) is left for future work.
5.4. Qualitative Analysis
To provide empirical insight into how EdgeV-SE outperforms the standard SFT baseline, we visualize representative classification scenarios in
Figure 3 and
Figure 4. For each case, we report the log-likelihood margin
(Equation (4)) and the linguistic probability
(Equation (2)), which serve as indicators of the model’s internal confidence. We emphasize that the uncertainty threshold used during training (
Figure 2) serves only as a diagnostic gate for identifying ambiguous samples, whereas the large margins observed here emerge naturally after convergence and reflect the model’s final decision confidence.
5.4.1. Analysis of Damage Detection (Sensitivity & Calibration)
Figure 3 illustrates the models’ responses to disaster scenes. For a clear-cut flood case (
Figure 3a, e.g., panel a-2), both models correctly identify damage. However, EdgeV-SE yields a substantially larger positive margin (Δ = +8.20) with a high linguistic probability (
p (damage|v) = 0.968), reflecting stronger internal agreement on salient evidence such as extensive standing water inundating road segments and surrounding building footprints. In contrast, in a challenging scenario where muddy floodwater visually resembles exposed soil (
Figure 3b, e.g., panel b-2), the SFT baseline fails (False Negative) by interpreting the scene as accessible open terrain. EdgeV-SE correctly recognizes the subtle boundaries of turbid floodwater and its interaction with road edges and nearby structures. Notably, EdgeV-SE outputs a moderated margin for this hard sample (Δ = +4.10;
p (damage|v) = 0.882) relative to the easy case, indicating calibrated confidence under visual ambiguity rather than blind overconfidence.
5.4.2. Analysis of False Positive Reduction (Specificity & Hallucination)
Figure 4 examines the models’ robustness against visually confusing artifacts in non-damaged areas. For an easy no-damage case (
Figure 4a, e.g., panel a-2), both models correctly predict normalcy, while EdgeV-SE exhibits strong negative evidence (Δ = −7.50;
p (damage|v) = 0.022), consistent with high confidence in the no-damage decision. In the hard no-damage scenario (
Figure 4b, e.g., panel b-2), the SFT baseline produces a critical hallucination-driven false positive by mistaking a rectangular swimming pool for floodwater. EdgeV-SE suppresses this semantic drift by explicitly distinguishing the pool from flooding and maintains a confident negative (margin Δ = −5.00;
p (damage|v) = 0.068). The reduced magnitude compared to the easy case (|Δ|: 7.50 → 5.00) also reflects appropriate uncertainty in the presence of water-like visual patterns while preserving correct classification.
5.4.3. Summary of Qualitative Findings
These qualitative results corroborate the quantitative improvements reported in
Section 5.1. While the standard SFT baseline exhibits failure modes under visual ambiguity—missing muddy flooding (false negatives) and hallucinating damage in the presence of water-like objects such as swimming pools (false positives)—EdgeV-SE achieves more reliable visual grounding. Importantly, EdgeV-SE’s confidence signals are better calibrated: the margins are more extreme for easy cases (
Figure 3(a-2): Δ = +8.20;
Figure 4(a-2): Δ = −7.50) and appropriately moderated for hard cases (
Figure 3(b-2): Δ = +4.10;
Figure 4(b-2): Δ = −5.00), aligning confidence with scene difficulty. Crucially, these behaviors are internalized during training, enabling precise and context-aware decisions even under severe visual ambiguity.
5.5. Temperature Scaling and Calibration Procedure
Reliable uncertainty estimation is critical for decision support in safety-critical satellite IoT disaster assessment scenarios. To ensure that the predicted probabilities of EdgeV-SE are statistically well-calibrated, we apply temperature scaling as a post-hoc calibration method.
Rather than fixing the temperature parameter heuristically, the temperature T is selected on the validation set by minimizing the Expected Calibration Error (ECE). Specifically, we perform a grid search over candidate temperature values T ∈ {0.5, 1.0, 1.5, 2.0, 2.5, 3.0}. For each candidate, the predicted probabilities are obtained by applying temperature scaling to the logit differences, and ECE is computed using 15 equal-width bins.
Table 9 reports the validation ECE for different temperature values. The temperature T = 2.0 consistently achieves the lowest ECE and is therefore selected for all subsequent experiments.
Using the validation-selected temperature (T = 2.0), we further evaluate calibration performance on the test set. To assess the statistical stability of the reported ECE, we compute 95% confidence intervals using bootstrap resampling with 1000 iterations. This procedure ensures that the reported calibration results are not artifacts of sampling noise or heuristic parameter choices.
With T = 2.0 fixed, EdgeV-SE achieves a test-set ECE of 3.2, with a 95% bootstrap confidence interval of [2.7, 3.6], indicating that the observed calibration improvement is statistically stable.
Having established a statistically grounded calibration procedure, we next examine the robustness of the calibrated model under common image corruptions and distributional shifts.
5.6. Robustness Under Common Corruptions and Calibration
Beyond clean test images, Satellite IoT deployments must contend with various distortions introduced by sensors, atmospheric conditions, and transmission pipelines. To examine whether EdgeV-SE’s gains translate into such settings, we evaluate robustness and calibration under seven common corruptions—rotation, brightness, contrast, Gaussian blur, Gaussian noise, JPEG compression, and occlusion—each at five severities (1–5). For every condition we keep the classifier unchanged, using the LL-margin decision rule with the validation-selected threshold stored as the optimal threshold parameter, and compute Accuracy, class-wise F1, Macro-F1, and Expected Calibration Error (ECE; 15 bins) from the same temperature-scaled probability in Equation (2). We present a reliability diagram on clean data in
Figure 5 and plot severity–performance curves for Macro-F1 and ECE in
Figure 6a,b.
On clean inputs, the model achieves high performance (Acc = 0.990, Macro-F1 = 0.985) and is well-calibrated (ECE = 3.2%;
Figure 5). Aggregated over all 35 corruption–severity settings by pooling all corrupted test images, performance remains high (Acc = 0.875, Macro-F1 = 0.875) with low ECE (5.7%), indicating that confidence generally tracks accuracy. The severity curves show graceful degradation for brightness, blur, and rotation; JPEG exhibits a moderate drop only at the highest severity; in contrast, severe contrast corruption (s = 5) and Gaussian noise (s ≥ 3) are the principal failure modes where accuracy approaches chance and ECE rises sharply (
Figure 6). Occlusion has minimal effect across severities in our data, likely because the synthetic central patch seldom masks key evidence.
We attribute the observed robustness and calibration to two training-time signals that incur no inference-time overhead: (i) a -consistency term that stabilizes the LL-margin decision under view changes, and (ii) mutual alignment/learning between the language and visual pathways, which reduces both over- and under-confidence on atypical inputs.
5.7. Robustness to Prompt Verbalizers
To examine whether the classification gains of EdgeV-SE are driven by memorizing specific label tokens, we evaluate the trained model using alternative class verbalizers at test time. While training uses the original class-conditioned prompts, we replace the default verbalizers (“Confirmed Damage Zone”/”No Damage Zone”) with multiple semantically equivalent phrasings that do not share lexical overlap with the training prefixes. For all verbalizer sets, we keep the LL-margin decision threshold fixed to the validation-selected value obtained under the default verbalizers (i.e., we do not re-tune the threshold per set) to measure genuine prompt sensitivity.
As shown in
Table 10, EdgeV-SE maintains consistent classification performance across different verbalizer choices, with only marginal variation in Accuracy and Macro-F1. This indicates that the proposed framework does not rely on specific label tokens, but instead learns robust visual-linguistic associations for damage discrimination.
5.8. Human-Verified Hallucination Analysis
Although a portion of low-quality pseudo-captions were filtered during dataset construction, residual hallucinations may still persist because the pseudo-references are generated by a teacher model rather than written by humans. This raises a key concern highlighted by the reviewers: gains in overlap-based caption metrics (e.g., CIDEr-D and BERTScore) could partially reflect stylistic imitation of the pseudo-caption teacher, instead of improved factual grounding to the image. To directly evaluate the factual reliability of the final model outputs, we conducted a human-verified hallucination study on a subset of the test set.
Sampling and protocol. We randomly sampled 100 test images with a balanced ratio of damage and no-damage cases (50/50). For each sample, annotators were presented with the image and the final model-generated caption, with the classification headers removed to prevent template cues from influencing the judgment. The annotators then assessed whether the caption contained any claim that is unsupported by visible evidence in the image. In this study, we define a hallucination as (i) introducing objects/conditions that are not visually present (e.g., describing flooding or debris when none is observable), or (ii) stating a damage state that contradicts the image (e.g., describing intact structures when clear damage is visible). If a caption contained at least one unsupported claim, the sample was counted as hallucinated; the hallucination rate is reported as the percentage of hallucinated samples among the 100 evaluated images.
Hallucination taxonomy and severity. When hallucinations were identified, they were further categorized by type (e.g., flooding, debris, roof damage, road blockage, building collapse) to clarify the dominant failure modes. We also graded the severity of hallucinations using a three-level scale: 1 (minor) for stylistic exaggerations or low-impact inaccuracies, 2 (decision-affecting) for inaccuracies that could plausibly alter a binary damage/no-damage decision, and 3 (critical) for severe misinformation that could mislead downstream prioritization or response actions. The average severity in
Table 11 is computed by mapping these levels to numerical scores (1–3) and averaging over hallucinated samples.
Two annotators independently reviewed all samples, and disagreements were resolved by consensus. We report the raw inter-annotator agreement (%) in
Table 11 to indicate labeling consistency prior to reconciliation.
Results. The results are summarized in
Table 11. Compared to the strong baseline (Controlled SFT), EdgeV-SE reduces the hallucination rate from 11.0% (11/100) to 6.0% (6/100), and also lowers the average severity (from 1.7 to 1.4), while maintaining high annotator agreement (92–94%). Given
n = 100, the corresponding Wilson 95% confidence intervals for the hallucination rate are [6.3, 18.6] % for Controlled SFT and [2.8, 12.5] % for EdgeV-SE. These intervals reflect the uncertainty induced by the limited sample size. These findings provide direct human evidence that, despite relying on pseudo-caption supervision, EdgeV-SE tends to generate captions that are more visually grounded and less prone to unsupported damage assertions.
Positioning relative to automatic metrics (fluency vs. factual correctness). Automatic captioning metrics such as CIDEr-D and BERTScore primarily reflect linguistic fluency and semantic similarity to the pseudo-reference captions, whereas CLIPScore and the hallucination rate in
Table 11 serve as complementary indicators of image-grounded factual reliability. This separation clarifies that the improvements of EdgeV-SE are not solely attributable to stylistic alignment with pseudo-references, but are accompanied by measurable reductions in human-verified hallucinations, consistent with our design goal of visual-faithful reasoning.
Limitations. This human-verified slice is intentionally small to keep annotation costs tractable; expanding the annotation scale and extending the taxonomy to additional disaster types are important directions for future work.
5.9. Generalization to Wildfire Damage Assessment
To evaluate the generalization capability of the proposed EdgeV-SE framework beyond hurricane-specific scenarios, we conducted additional experiments on the Wildfire Prediction Dataset [
52]. This dataset, sourced from the Canadian Open Government Portal, contains satellite images of areas affected by wildfires.
To ensure a rigorous comparison with the hurricane benchmark, we constructed a balanced subset of identical size and class distribution. Specifically, we sampled 5000 wildfire and 5000 no-wildfire images for training, 1000/1000 for validation, and 1000/1000 for testing, resulting in a total of 14,000 images. This controlled setup eliminates data scale as a confounding factor and allows us to focus solely on the model’s domain generalization capability.
Compared to hurricane damage, wildfire scenes exhibit distinct visual characteristics such as burn scars, vegetation loss, and smoke-induced texture degradation, making this a challenging testbed for cross-domain generalization.
Crucially, the experimental pipeline, model architecture, training procedure, and hyperparameters were kept identical to those used in the hurricane experiments. We reuse the same pipeline and hyperparameters (including τ and λ settings) without task-specific re-tuning. This design ensures that the evaluation strictly reflects the inherent generalization capability of the proposed framework.
Table 12 reports the performance comparison on the wildfire dataset. Despite the domain shift, EdgeV-SE consistently outperforms the standard supervised fine-tuning (SFT) baseline in terms of both Accuracy and Macro-F1. Interestingly, the model achieved slightly higher performance on the wildfire dataset compared to the hurricane benchmark. We attribute this to the fact that wildfire damage (e.g., distinct burn scars) presents more visually salient features than the subtle flooding or debris cues in hurricane imagery, making the classification task inherently less ambiguous for the proposed framework.
In addition, EdgeV-SE demonstrates more stable calibration behavior on the wildfire dataset, achieving lower ECE values than the baseline. This result suggests that the uncertainty estimation and calibration mechanisms used in EdgeV-SE remain effective under domain shift.
Overall, these results support the claim that EdgeV-SE is not restricted to a single disaster type, but can generalize to structurally different disaster scenarios without architectural modification or task-specific re-tuning of hyperparameters.
6. Discussion
6.1. Synergistic Mechanisms of Self-Reflection
The significant performance gains of EdgeV-SE, improving Macro-F1 from 0.911 to 0.985, stem from the mechanism’s ability to selectively weigh gradient updates based on internal uncertainty. Unlike standard SFT which treats all samples equally, our margin-based weighting mechanism (Equation (5)) effectively acts as an internal curriculum, forcing the model to allocate more gradient capacity to ambiguous samples (e.g., subtle roof damage or occluded debris) that usually fall into the long tail of the error distribution. Furthermore, the mutual learning objective (Equation (7)) serves as a regularization term that prevents the “semantic drift” often observed in VLMs, where the model generates plausible captions that are visually unsupported. By forcing the linguistic theorist to agree with the visual empiricist, EdgeV-SE ensures that generated captions remain grounded in pixel-level evidence.
Unlike conventional consistency regularization that enforces invariance between multiple predictions, EdgeV-SE explicitly treats internal disagreement between generative (linguistic) and discriminative (visual) pathways as a diagnostic signal, and resolves it through uncertainty-aware weighting rather than uniform agreement enforcement.
6.2. Operational Feasibility vs. Hard Real-Time Constraints
A critical consideration for edge deployment is the definition of “real-time.” Our benchmark on the Jetson Orin Nano shows an inference speed of approximately 0.54 FPS (1.84 s/image). While this does not meet the standard for “video-rate” real-time (e.g., >30 FPS) required for autonomous driving, it satisfies the “operational real-time” constraints of Satellite IoT disaster response. In typical satellite-to-ground scenarios, data transmission bandwidth is the primary bottleneck, often taking minutes per high-resolution image block. Consequently, an inference latency of 1.8 s is negligible compared to transmission latency. Therefore, EdgeV-SE provides a viable solution for on-site filtering, where the device processes images locally and transmits only prioritized “Damage” alerts with text descriptions, drastically reducing the bandwidth requirement compared to raw image transmission.
Regarding hardware constraints, our empirical on-device benchmarking focuses on Jetson-class embedded GPUs (Jetson Orin Nano, 8 GB), which provide practical FP16 acceleration for transformer decoding in caption generation. We did not benchmark additional heterogeneous devices in this study (e.g., CPU-only Raspberry Pi platforms or NPU accelerators such as Google Coral). Nevertheless, pairing EdgeV-SE with smaller backbones and aggressive compression (e.g., INT8/4-bit quantization, distillation, and reduced decoding length) could make such platforms viable targets, potentially at the cost of longer per-image latency. We leave a systematic multi-device study—including latency variance, energy, and thermal behavior—as future work.
6.3. Justification for VLM over Lightweight Classifiers
One might question whether a lightweight image-only classifier (e.g., ResNet/MobileNet) could offer lower latency for binary damage detection. While such models can be attractive for throughput, they do not provide natural-language explanations. In disaster assessment workflows, practitioners often require interpretable evidence (e.g., flooding vs. debris vs. roof damage) rather than an opaque binary label. EdgeV-SE targets this decision-support setting by producing both a damage decision and an evidence-oriented caption. A direct speed/accuracy comparison to lightweight classifiers is left to future work.
6.4. Robustness and Reliability Under Domain Shift
The bootstrap analysis and corruption tests indicate that EdgeV-SE is well-calibrated on clean inputs and degrades gracefully under several common perturbations (e.g., mild rotation/brightness/blur). However, the controlled corruption suite also reveals clear failure modes under extreme contrast collapse and stronger Gaussian noise. Therefore, robustness to real deployment conditions may require sensor- and pipeline-specific augmentation and calibration strategies beyond modest perturbations used in our self-reflective training.
Beyond robustness to input perturbations, reliable calibration is particularly critical in satellite IoT disaster assessment, where downstream decisions are often made under severe uncertainty. The bootstrap-based calibration analysis in
Section 5.5 demonstrates that EdgeV-SE maintains statistically stable confidence estimates, rather than producing overconfident predictions. This property is essential for risk-aware decision support, enabling operators to prioritize uncertain or borderline cases for human review instead of relying solely on hard binary decisions.
6.5. Limitations and Future Directions
Despite the promising results, this study has limitations that warrant discussion.
Pseudo-Ground Truth. Despite filtering approximately 8% of noisy pseudo-captions during dataset construction, some residual hallucinations may remain. To address this, we additionally performed a human-verified evaluation of final model outputs (
Section 5.8), which confirms that EdgeV-SE substantially reduces factual errors compared to standard fine-tuning. Nevertheless, incorporating human-in-the-loop verification during dataset construction remains an important direction for future work.
Domain Specificity. While the primary focus of our experiments is hurricane damage assessment, we additionally validated the generalization capability of EdgeV-SE on a wildfire damage dataset (
Section 5.9). The consistent performance gains across these two structurally different disaster domains suggest that the proposed framework is not narrowly specialized to a single event type. Nevertheless, broader validation across additional disaster categories (e.g., floods or earthquakes) and geographic regions remains an important direction for future work.
Comparison scope. This study evaluates EdgeV-SE on BLIP-Large, which was selected as the optimal high-performance backbone deployable on Jetson-class hardware at the time of study design. While emerging edge-specific architectures (e.g., MobileVLM, TinyLLaVA) have since gained traction, our architectural analysis suggests that the proposed EdgeV-SE framework is inherently compatible with these decoder-only models. Specifically, the core uncertainty mechanism—currently calculated via cross-attention scores in BLIP—can be adapted to decoder-only architectures by leveraging the causal next-token probabilities of class verbalizers. This indicates that our self-reflective algorithm is not backbone-specific but rather a generalizable fine-tuning strategy. Therefore, applying EdgeV-SE to these newer architectures is a promising direction for future work to further enhance on-device efficiency. Finally, cloud-based proprietary models (e.g., Gemini, Claude) remain outside our scope, as the target setting assumes communication-denied satellite IoT operations requiring offline on-device inference.
Hardware Scope. Our on-device evaluation focuses on a Jetson Orin Nano (8 GB) as a representative embedded GPU platform capable of running BLIP-Large caption generation in FP16. We did not benchmark Raspberry Pi or Google Coral in this study. However, we note that such platforms represent plausible deployment targets when paired with lighter, INT8-compatible backbones (e.g., MobileVLM-like architectures on EdgeTPU). While end-to-end caption generation on these devices might incur higher latency than on Jetson-class GPUs due to limited sequence modeling support, this trade-off may be acceptable in scenarios where communication latency dominates. Exploring these heterogeneous platforms—together with aggressive compression techniques—will be an important direction for future work.
Lightweight Classifier Baselines. While lightweight image-only classifiers (e.g., MobileNet- or ResNet-based models) are attractive for minimizing inference latency, this study focuses on caption-capable VLMs to provide explainable decision support in disaster-response settings. Nevertheless, we acknowledge that classifier baselines are important to quantify the accuracy–latency trade-off in operational deployments. As a concrete future evaluation plan, we will (i) train lightweight classifiers on the same data split, (ii) measure on-device latency under identical edge settings, and (iii) optionally pair the classifier with a lightweight captioner or structured attribute template to assess decision-support utility. Such a study would complement the current VLM-focused analysis by characterizing throughput-oriented deployment regimes.
Taken together, these limitations outline the current scope of our study and motivate several directions for future investigation. Although our primary experiments focus on hurricane damage assessment, the proposed framework is inherently domain-agnostic, as it relies solely on internal log-likelihood margins and cross-pathway agreement rather than domain-specific heuristics. The additional wildfire experiments provide initial evidence of this generality. Extending this validation to a broader range of disaster scenarios and to different encoder–decoder VLM backbones remains an important avenue for future research.
7. Conclusions
We proposed a self-reflective fine-tuning framework, EdgeV-SE, for edge-deployable VLMs. Conceptually, EdgeV-SE turns the model’s own uncertainty and internal disagreement into learning signals, without extra labels, external oracles, or runtime overhead. EdgeV-SE integrates uncertainty-aware weighting based on linguistic margins, margin-level multi-view semantic consistency, and dual-pathway mutual learning between linguistic and visual routes to deliver robust and reliable performance for edge-deployable Vision–Language Models.
Our proposed model yields a substantial improvement in classification performance, increasing Macro-F1 from 0.911 to 0.985 over standard supervised fine-tuning without introducing any additional inference-time overhead. Moreover, EdgeV-SE consistently improves caption quality across key metrics, including CIDEr-D, BERTScore, and CLIPScore, resulting in concise, factual, and context-aware descriptions. Crucially, these gains are achieved while preserving edge feasibility, maintaining low latency and high throughput on resource-constrained platforms such as the Jetson Orin Nano.
In future work, we will focus on strengthening both generalization and real-world deployability of EdgeV-SE. First, we plan to construct and release a human-verified, multi-hazard benchmark (e.g., hurricanes, floods, wildfires, earthquakes) with structured damage attributes to replace or complement pseudo-caption supervision, enabling more rigorous evaluation of calibration and robustness under distribution shift. Second, we will extend EdgeV-SE to a continual/active learning setting, where low-margin (high-uncertainty) samples encountered on the edge can be flagged for optional human review and then incorporated for incremental adaptation to new geographies, sensors, and acquisition conditions without full retraining. Third, we will pursue deployment-aware optimization by combining EdgeV-SE with compression techniques (e.g., quantization, structured pruning, and distillation) and by validating the framework across lighter VLM backbones and multimodal inputs, aiming to further reduce latency and memory usage while improving reliability in time-critical disaster response workflows.