Enhancing Self-Driving Segmentation in Adverse Weather Conditions: A Dual Uncertainty-Aware Training Approach to SAM Optimization
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper investigates enhancing SAM/SAM2 segmentation under adverse weather conditions via uncertainty-aware training methods. The topic is practically significant, the methodology is reasonably designed, and the experiments are fairly comprehensive, especially showing some improvements in extreme weather scenarios. However, there are several areas requiring improvement in methodological details, experimental comparisons, and results analysis.
1. Keywords missing: The keyword section contains placeholders. Please provide actual keywords.
2. Experimental Results: In Tables 3 and 4, the reported "Overall" scores (IoU and Dice) for the BDD100K and CamVid datasets are identical (0.303 and 0.690, respectively). This is highly unusual and severely undermines the credibility of the experimental findings. The authors must check their data, recalculate the metrics, and provide a clear explanation. This discrepancy likely stems from a data processing error, misreporting, or flawed experimental setup. The revised manuscript must present distinct, recalculated results for each dataset.
3. Insufficient methodological details: The description of the Monte Carlo Uncertainty Loss in Section 3.1 is vague. A mathematical formulation or pseudocode is needed. The integration of the UAT adapter requires a more detailed description or a structural diagram.
4. Lack of comparative experiments and analysis:Compare with existing uncertainty-aware methods (e.g., Bayesian NN, MC Dropout).Clarify the train/validation/test split for each dataset.Explain the performance drop for certain classes (e.g., stop signs) in Tables 1 and 2 (e.g., class imbalance, annotation quality).Figures 3–6 are missing from the provided text. Please include and reference them.
5. Inconsistent reference formatting: Unify the reference format; some entries are incomplete.
6. Conclusion is too general: Provide a more specific summary of the scenarios and limitations of each method, based on the corrected experimental results.
Author Response
Comment 1:
“Keywords missing: The keyword section contains placeholders. Please provide actual keywords.”
Response 1:
We appreciate this observation. The placeholder keywords have been replaced with the finalized list on page 1, lines 25–27:
"Autonomous driving, Adverse weather, Image segmentation, Segment Anything Model, Uncertainty modeling, Aleatoric uncertainty, UAT, Foundation models."
This correction ensures that the keywords accurately reflect the manuscript’s core technical focus.
Comment 2: Experimental Results: In Tables 3 and 4, the reported "Overall" scores (IoU and Dice) for the BDD100K and CamVid datasets are identical (0.303 and 0.690, respectively). This is highly unusual and severely undermines the credibility of the experimental findings. The authors must check their data, recalculate the metrics, and provide a clear explanation. The revised manuscript must present distinct, recalculated results for each dataset.
Response 2: We appreciate this careful observation. The identical overall scores in the previous version were caused by a logging and table transcription error. We re-ran our evaluation scripts and recomputed all metrics directly from the saved predictions for each dataset. The corrected “Overall” results are now:
(1) Table III (Average IoU Scores, Overall): Fine-tuned SAM2 (BDD100K: 0.306, CamVid: 0.220); Zero-shot SAM2 (BDD100K: 0.245, CamVid: 0.244).
(2) Table IV (Average Dice Scores, Overall): Fine-tuned SAM2 (BDD100K: 0.696, CamVid: 0.551); Zero-shot SAM2 (BDD100K: 0.548, CamVid: 0.630).
These updated values appear in Tables III and IV in Section V-A.1, and the surrounding text has been revised to explicitly state that these metrics are computed separately for BDD100K and CamVid using their respective test splits (Page 4–5, lines 363–386)
Comment 3: Insufficient methodological details: The description of the Monte Carlo Uncertainty Loss in Section 3.1 is vague. A mathematical formulation or pseudocode is needed. The integration of the UAT adapter requires a more detailed description or a structural diagram.
Response 3: Thank you for the comment. The Monte Carlo Uncertainty Loss was already fully defined in the original submission, including the mathematical equations and process. In the revised manuscript, we did not alter the formulation but clarified its explanation to make the methodology more transparent. The text now explicitly states that the model performs ten stochastic forward passes for each input image, computes the pixel-wise standard deviation to form the uncertainty map U, and uses this map to weight the combined Binary Cross-Entropy and Intersection-over-Union losses with exp(−U) and a regularization term on U. This explanation ensures that the role of U and its integration into the optimization process are clear. These clarifications appear in Section IV-A on Page 3, lines 216–242.
For the UAT-SAM integration, we expanded the description to explain the entire architecture flow from input to output, including how the latent variable z is sampled from the Conditional Variational Autoencoder and injected into each transformer block of SAM through lightweight adapters while the SAM backbone remains frozen. This clarification appears in Section IV-B on Page 4, lines 243–289.
Comment 4: Lack of comparative experiments and analysis:
Compare with existing uncertainty-aware methods (e.g., Bayesian NN, MC Dropout).
Clarify the train/validation/test split for each dataset.
Explain the performance drop for certain classes (e.g., stop signs) in Tables 1 and 2 (e.g., class imbalance, annotation quality).
Figures 3–6 are missing from the provided text. Please include and reference them.
Response 4:
(1) Positioning w.r.t. uncertainty-aware methods: We have extended the Related Work section to more clearly situate our methods relative to Bayesian neural networks, Monte Carlo dropout, evidential and calibration-focused approaches, and recent uncertainty estimation techniques in vision. We discuss how our uncertainty-aware adaptations of SAM/SAM2 are complementary to these frameworks and note that incorporating full Bayesian backbones is an important direction for future work. These additions appear in Section II (Page 1-2, lines 85–96 and 106–121).
(2) Dataset splits: The train/validation/test splits are now explicitly stated in Section III. We specify the splits for BDD100K, CamVid, and GTA Driving, and clarify that UAT-SAM is trained on CamVid car instances only, while SAM2 uses all three datasets for uncertainty-aware fine-tuning (Page 2, lines 165–176).
(3) Class-specific performance drops: In Section V-A.1, we now explain the weaker improvements for stop signs and fire hydrants by (i) strong class imbalance in BDD100K, (ii) the small spatial extent and long-range occurrence of these objects, and (iii) their sensitivity to noise and our loss design. This clarification appears in the discussion following Tables III and IV (Page 5, lines 371–386).
(4) Missing figures: Figures 3–6 have been included and are referenced in the corresponding result subsections:
1. Figure 3: Qualitative comparison before and after SAM2 fine-tuning (Page 6, lines 393–394).
2. Figure 4: UAT-SAM vs. SAM under heavy rain (Page 6, lines 419).
3. Figure 5: Dice scores across 177 patches (Page 6, lines 432).
4. Figure 6: IoU scores across 177 patches (Page 6, lines 432).
These figures are now integrated consistently into the narrative in Section V-A and V-B.
Reviewer 2 Report
Comments and Suggestions for AuthorsIn general, this paper is well-written and organized. A comprehensive analysis has been provided for the authors.
Comments for author File:
Comments.pdf
Author Response
Comment 1:
Add non-SAM baselines (Mask2Former/SegFormer/DeepLabv3+) and weather-robust pipelines; include ACDC or Boreas weather subsets for realism.
Response 1:
Thank you for this helpful suggestion. We agree that adding these baselines would provide a stronger comparative analysis. However, due to hardware and time limitations, we were unable to include large-scale retraining for all additional models in this version. We have clarified this limitation in the revised manuscript and indicated that future work will benchmark against Mask2Former, SegFormer, and DeepLabv3+ on weather-specific datasets such as ACDC and Boreas. The corresponding clarification can be found in the Conclusion (Page 6, Lines 441–446) and the Future Work section (Page 7, Lines 490–494).
Comment 2:
Compare UAT-SAM vs SAM2-FT under identical conditions.
Response 2:
We agree that clarification was needed. We have expanded the Results and Discussion section to explain that SAM2-FT performs full-scene segmentation, while UAT-SAM focuses on single-object segmentation in heavy weather. Because the training domains and objectives differ, a direct numerical comparison would be misleading. This clarification has been added to Section V-B (Page 5, Lines 407–431).
Comment 3:
Tables 3 and 4 show identical “Overall IoU = 0.303” for both datasets BDD100K and CamVid—clarify if this is a mistake.
Response 3:
We thank the reviewer for catching this oversight. The duplicated values were a reporting error. We have recalculated and corrected the results. The updated Overall IoU and Dice scores are now distinct: Table III – BDD100K = 0.306, CamVid = 0.220; Table IV – BDD100K = 0.696, CamVid = 0.551. These corrections appear in Section V-A (Page 5, Lines 368–377).
Comment 4:
Correct typographical and reference issues (e.g., “Kendall” spelling) and add missing DOIs.
Response 4:
We appreciate this observation. All typographical errors have been corrected, including “Kenadall → Kendall,” and DOI links have been added where available. These changes appear in the References section (Page 8, Lines 514–574).
Comment 5:
Tables 1 and 2 are trimmed on the right-hand side and need alignment adjustments.
Response 5:
We agree with this comment. The tables have been reformatted to ensure proper alignment and complete visibility. These adjustments are reflected in the updated manuscript (Table I and II, page 5).
Comment 6:
Clarify dataset sample counts and subset sizes used for testing.
Response 6:
We have clarified this in the revised text. UAT-SAM was trained on 1,047 training, 274 validation, and 177 testing car instances from CamVid. These details are now stated in Section III (Page 3, Lines 173–176) and Section V-B (Page 5, Line 408).
Reviewer 3 Report
Comments and Suggestions for AuthorsComments:
This paper proposes two improvement methods to solve the problem of degraded segmentation performance in adverse weather. However, the manuscript still needs to be improved in the following aspects:
- Consider adding specific numerical values to demonstrate the performance improvements in the abstract.
- The keywords section appears incomplete or improperly formatted; please revise it to accurately reflect the core topics of the manuscript.
- Some possible corrections should be applied. For instance, on page 2, line 58, the authors could state more clearly what this paper contributes. In addition, the validity of the two complementary uncertainty-aware approaches proposed should also be demonstrated in the preceding discussion.
- Regarding the results in Tables 1 and 2, the metrics of the multistep fine-tuned SAM2 model should clearly outperform those of the zero-shot SAM2 model. However, how can it be demonstrated that this is due to the incorporation of the uncertainty mechanism mentioned rather than conventional fine-tuning? Consider adding a control group to demonstrate the performance improvement achieved by incorporating uncertainty metrics.
- On page 8, line 255, the manuscript mentions that “fine-tuned SAM2 model demonstrated strong generalization across diverse driving scenarios.” Consider presenting it in the form of a table or image.
- Please check the formatting issues in Tables 1 and 2.
- It is recommended to further improve the clarity, precision, and academic rigor of the language throughout the manuscript to make the exposition more concise and formal.
Author Response
Comment 1:
Consider adding specific numerical values to demonstrate the performance improvements in the abstract.
Response 1:
Thank you for this helpful comment. We have revised the abstract to include quantitative results demonstrating the performance improvements. The updated text now specifies that “UAT-SAM improves IoU by 42.7 percent and Dice by 30 percent under heavy weather conditions, while the fine-tuned SAM2 with uncertainty-aware loss shows improved performance across a wide range of driving scenes.” This change appears in the Abstract (Page 1, Lines 18–20).
Comment 2:
The keywords section appears incomplete or improperly formatted; please revise it to accurately reflect the core topics of the manuscript.
Response 2:
We agree with this observation. The keywords have been reformatted and expanded to include “Autonomous driving, Adverse weather, Image segmentation, Segment Anything Model, Uncertainty modeling, Aleatoric uncertainty, UAT, Foundation models.” The corrected list appears immediately after the Abstract (Page 1, Lines 25–27).
Comment 3:
Some possible corrections should be applied. For instance, on page 2, line 58, the authors could state more clearly what this paper contributes. In addition, the validity of the two complementary uncertainty-aware approaches proposed should also be demonstrated in the preceding discussion.
Response 3:
We appreciate this suggestion. We have clarified our main contributions and their validation in the Introduction. The revised text now explicitly states that this paper introduces “(1) an uncertainty-aware fine-tuning approach for SAM2 to enhance overall scene segmentation, and (2) an uncertainty-aware adapter (UAT) for SAM targeting adverse weather conditions.” These contributions are now summarized in the final paragraph of the Introduction (Page 2, Lines 63–71).
Comment 4:
Regarding the results in Tables 1 and 2, the metrics of the multistep fine-tuned SAM2 model should clearly outperform those of the zero-shot SAM2 model. However, how can it be demonstrated that this is due to the incorporation of the uncertainty mechanism mentioned rather than conventional fine-tuning? Consider adding a control group to demonstrate the performance improvement achieved by incorporating uncertainty metrics.
Response 4:
Thank you for this important point. We have clarified that the only modification from standard fine-tuning is the inclusion of the uncertainty-weighted loss; all other settings (dataset splits, optimizer, and network layers) remain identical. Hence, the observed improvement stems directly from the uncertainty modeling mechanism rather than ordinary fine-tuning. This clarification appears in Section IV-A (Page 4, Lines 226–231).
Comment 5:
On page 8, line 255, the manuscript mentions that “fine-tuned SAM2 model demonstrated strong generalization across diverse driving scenarios.” Consider presenting it in the form of a table or image.
Response 5:
We have clarified and expanded this discussion. The revised text now references Tables III and IV (Page 5, Lines 397–405), which summarize cross-dataset results for BDD100K and CamVid, effectively illustrating the model’s generalization across diverse real-world driving conditions
Comment 6:
Please check the formatting issues in Tables 1 and 2.
Response 6:
We have corrected all formatting and alignment issues in Tables I and II. Column widths and numeric alignment were adjusted to ensure all entries are fully visible and properly aligned.
Comment 7:
It is recommended to further improve the clarity, precision, and academic rigor of the language throughout the manuscript to make the exposition more concise and formal.
Response 7:
We agree with this recommendation. The entire manuscript has been thoroughly proofread to improve clarity and academic tone. Revisions include simplifying sentence structures, removing redundancy, and ensuring consistent technical phrasing throughout the Abstract (Page 1, Lines 1–24), Introduction (Page 1–2, Lines 28–71), and Discussion (Page 6, Lines 447–478).
Reviewer 4 Report
Comments and Suggestions for AuthorsThis manuscript studies two approaches to enhance segmentation performance in adverse driving conditions. However, some key points are not cleared. The details are listed as follow:
- The “Keywords” section of the paper only labels “keyword 1; keyword 2; keyword 3”without providing specific keywords.
- Autonomous driving datasets may contain annotation errors (such as boundary annotation deviations and category mislabeling). Especially in adverse weather conditions, annotation difficulty increases, and these annotation errors can affect the model training effect. The paper does not mention how to handle annotation errors in the dataset, nor does it evaluate the impact of annotation errors on model performance. It is recommended to supplement the description of annotation error handling strategies.
- The paper only focuses on model segmentation accuracy and does not analyze indicators related to model efficiency. It is recommended to supplement the model efficiency analysis, including indicators such as inference time, model parameter quantity, and computational complexity, to evaluate the feasibility of the method in practical applications.
- Autonomous driving datasets may contain annotation errors. Especially in adverse weather conditions, annotation difficulty increases, and these annotation errors can affect the model training effect. The paper does not mention how to handle annotation errors in the dataset, nor does it evaluate the impact of annotation errors on model performance. It is recommended to supplement the description of annotation error handling strategies.
- The current comparative experiments only compare the proposed method with zero-shot SAM/SAM2, and do not compare it with other advanced methods for autonomous driving segmentation under adverse weather conditions. It is recommended to add comparative experiments with mainstream methods in the field and expand the comparison dimensions (such as accuracy, speed, and robustness).
- The “Future Work” section of the paper mentions expanding the scope of target segmentation and increasing training data and scenarios, but the expression is too general and lacks specific implementation directions. It is recommended to refine the content of future work, clarify the specific research plans, technical paths, and expected goals for each direction, and enhance the continuity and operability of the research.
Author Response
Comment 1:
The “Keywords” section of the paper only labels “keyword 1; keyword 2; keyword 3” without providing specific keywords.
Response 1:
We appreciate the reviewer’s comment. The keywords section has been corrected to include specific and relevant terms reflecting the paper’s content. The revised list now reads: “Autonomous driving, Adverse weather, Image segmentation, Segment Anything Model, Uncertainty modeling, Aleatoric uncertainty, UAT, Foundation models.” This correction appears immediately after the Abstract (Page 1, Lines 25–27).
Comment 2:
Autonomous driving datasets may contain annotation errors (such as boundary annotation deviations and category mislabeling). Especially in adverse weather conditions, annotation difficulty increases, and these annotation errors can affect the model training effect. The paper does not mention how to handle annotation errors in the dataset, nor does it evaluate the impact of annotation errors on model performance. It is recommended to supplement the description of annotation error handling strategies.
Response 2:
We thank the reviewer for this insightful observation. A clarification has been added in Section IV-B (Page 5, Lines 303–339), describing our strategy for addressing annotation inaccuracies. To mitigate the impact of potential boundary and labeling errors, elastic deformations were applied to the original CamVid masks to simulate multiple plausible ground truths. This generated diverse yet realistic segmentation variants, effectively reducing overfitting to single annotations and enhancing robustness against annotation noise. The revised text explicitly notes that this approach accounts for potential human labeling errors and boundary uncertainty in adverse weather conditions.
Comment 3:
The paper only focuses on model segmentation accuracy and does not analyze indicators related to model efficiency. It is recommended to supplement the model efficiency analysis, including indicators such as inference time, model parameter quantity, and computational complexity, to evaluate the feasibility of the method in practical applications.
Response 3:
We agree with the reviewer that model efficiency is an important factor for real-world deployment. A detailed model efficiency analysis has been added to Section IV-B (Page 4–5, Lines 290–301). The text now specifies that UAT-SAM adds only 8.9 million trainable parameters (approximately 13 percent of SAM-ViT-B) and introduces a 5.04 percent inference-time overhead (+16.67 ms per image). These results confirm that the proposed approach maintains near–real-time performance and practical feasibility for autonomous driving systems
Comment 4:
The current comparative experiments only compare the proposed method with zero-shot SAM/SAM2, and do not compare it with other advanced methods for autonomous driving segmentation under adverse weather conditions. It is recommended to add comparative experiments with mainstream methods in the field and expand the comparison dimensions (such as accuracy, speed, and robustness).
Response 4:
We acknowledge this important suggestion. While the present study focuses on demonstrating improvements relative to SAM and SAM2 baselines, future work will include direct comparisons with state-of-the-art segmentation frameworks such as Mask2Former, SegFormer, and DeepLabv3+. This plan has been added to the Discussion (Page 6, Lines 440–445), which now explicitly mentions benchmarking uncertainty-aware methods against established driving-scene segmentation baselines in terms of accuracy, speed, and robustness.
Comment 5:
The “Future Work” section of the paper mentions expanding the scope of target segmentation and increasing training data and scenarios, but the expression is too general and lacks specific implementation directions. It is recommended to refine the content of future work, clarify the specific research plans, technical paths, and expected goals for each direction, and enhance the continuity and operability of the research.
Response 5:
We thank the reviewer for this valuable feedback. The Future Work section has been refined to outline more concrete next steps. The updated section (Page 7, Lines 485–498) now specifies three actionable directions: (1) extending UAT-SAM to new object categories such as pedestrians, cyclists, and traffic signs; (2) conducting weather-specific adapter training using ACDC and Boreas datasets; and (3) evaluating scalability through parameter-efficient transfer learning across diverse environmental conditions. These revisions clarify the technical roadmap and ensure the continuity of this research
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe author has already revised most of the content of the manuscript, which is relatively complete. I agree to the current version being published.
Author Response
We sincerely thank the reviewer for their positive feedback and recommendation for publication.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe revised draft has significantly improved clarity, rigor, and expression, and has resolved most of the issues raised in the initial draft. I suggest making some minor revisions before acceptance:
- Regarding comment #5, we note that while Tables 3 and 4 provide overall performance on the BDD100K and CamVid datasets, indicating some generalization ability, the revised version removed the content related to the GTA Driving dataset. However, the GTA Driving dataset is explicitly mentioned in the abstract and other materials, which is contradictory.
Author Response
Comment 1:
Regarding comment #5, we note that while Tables 3 and 4 provide overall performance on the BDD100K and CamVid datasets, indicating some generalization ability, the revised version removed the content related to the GTA Driving dataset. However, the GTA Driving dataset is explicitly mentioned in the abstract and other materials, which is contradictory.
Response 1:
Thank you for your helpful observation. We agree with this comment and have clarified the role of the GTA Driving dataset in the revised manuscript to ensure consistency across all sections. Specifically, we added clear explanations in the Abstract and Section 3 (Datasets) indicating that the GTA Driving dataset was used exclusively for fine-tuning (adaptation) and not for evaluation.
The corresponding revisions are highlighted in blue in the updated version:
Page 1, lines 21–24 (Abstract): “We evaluate these approaches on the CamVid and BDD100K datasets, while the GTA Driving dataset is used exclusively during the fine-tuning process for adaptation and not for evaluation, helping improve generalization to diverse driving conditions.”
Page 3, lines 130–132 (Section 3, Datasets): “The GTA Driving dataset was incorporated only during the fine-tuning phase to improve domain adaptation and model generalization; therefore, it is not included in the quantitative results reported in Section 5.”
