Review Reports - ISANet: A Real-Time Semantic Segmentation Network Based on Information Supplementary Aggregation Network

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

I am delighted to review your manuscript titled “ISANet: A Real-Time Semantic Segmentation Network Based on Information Supplementary Aggregation Network.” The paper addresses a critical challenge in autonomous driving, balancing segmentation accuracy, model size, and inference speed, by introducing ISANet, a lightweight yet effective architecture. The proposed framework integrates three main innovations: the Spatial-Supplementary Lightweight Bottleneck Unit (SLBU) to capture expressive features with minimal parameters, the Missing Spatial Information Recovery Branch (MSIRB) to restore lost spatial details, and the Object Boundary Feature Attention Module (OBFAM) to strengthen multi-stage feature fusion and boundary precision. Through comprehensive experiments on Cityscapes and CamVid datasets, ISANet demonstrates competitive mean IoU (76.7% and 73.8%) while sustaining real-time performance (58FPS and 90FPS) with only 1.37M parameters. The work makes a meaningful contribution to lightweight semantic segmentation research and presents practical value for resource-constrained autonomous driving systems.

My comments go thus:

Comment 1: I recommend adding a brief roadmap statement at the end of Section 1 to guide readers on the structure of the paper, for example, indicating that Section 2 presents related work, Section 3 details the proposed ISANet architecture, Section 4 discusses experiments, and Section 5 concludes the study. This addition will improve clarity and reader navigation.
Comment 2: In section 2.3, it would be a disservice to present a discussion of attention-based models/Transformer models without citing the foundational work: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). “Attention is all you need.” Advances in Neural Information Processing Systems, 30. Also, it would be good to include how attention-based models have transformed various domains since you claimed that it has gained popularity in recent years. By so doing, try and list some attention-based models like Transformer, Longformer, Reformer, Performer, Helformer and any other deemed fit. I would recommend citing this paper as a reference for the attention mechanism model in finance: Kehinde, T., Adedokun, O. J., Joseph, A., Kabirat, K. M., Akano, H. A., & Olanrewaju, O. A. (2025). Helformer: an attention-based deep learning model for cryptocurrency price forecasting. Journal of Big Data, 12(1), 1-39.
Comment 3: Figures 1–3 are not very clear in their current form. The font size within the diagrams is too small, which makes them difficult to read, and the overall visual presentation is not very appealing. I recommend enlarging the figures slightly, increasing font size for labels and annotations, and improving layout clarity to ensure that the diagrams are both visually appealing and easily interpretable.
Comment 4: Section 5 is currently titled “Results and Discussion”; however, no new experimental results or in-depth critical discussion are actually presented there. Instead, the section primarily summarizes the contributions of ISANet and highlights possible future directions, which aligns more closely with a “Conclusion and Prospects” section. I recommend revisiting the title and content of Section 5 to ensure consistency between the section name and its actual content.

Author Response

Comments 1: I recommend adding a brief roadmap statement at the end of Section 1 to guide readers on the structure of the paper, for example, indicating that Section 2 presents related work, Section 3 details the proposed ISANet architecture, Section 4 discusses experiments, and Section 5 concludes the study. This addition will improve clarity and reader navigation.

Response 1: Thank you for your suggestion. We fully agree with the proposal of adding a navigation description at the end of Chapter 1. A clear structural navigation helps readers quickly establish an understanding of the article’s framework. Particularly for technical papers involving complex architecture design and experimental analysis, it can significantly improve reading efficiency. Therefore, we will supplement the relevant content as suggested, and the added content has been marked in green.

Comments 2: In section 2.3, it would be a disservice to present a discussion of attention-based models/Transformer models without citing the foundational work: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). “Attention is all you need.” Advances in Neural Information Processing Systems, 30. Also, it would be good to include how attention-based models have transformed various domains since you claimed that it has gained popularity in recent years. By so doing, try and list some attention-based models like Transformer, Longformer, Reformer, Performer, Helformer and any other deemed fit. I would recommend citing this paper as a reference for the attention mechanism model in finance: Kehinde, T., Adedokun, O. J., Joseph, A., Kabirat, K. M., Akano, H. A., & Olanrewaju, O. A. (2025). Helformer: an attention-based deep learning model for cryptocurrency price forecasting. Journal of Big Data, 12(1), 1-39.

Response 2: Thank you for pointing out the omission of foundational literature. The paper "Attention Is All You Need" by Vaswani et al. is a core work in the fields of Transformers and attention mechanisms; its absence in the references indeed undermines the completeness of the literature review. Meanwhile, supplementing the applications of attention models across various domains as well as representative models will enhance the depth of Section 2.3. Additionally, citing relevant literature in the financial domain can expand the application scenarios of attention mechanisms and demonstrate the cross-domain value of our research. We will supplement all the relevant citations as required.

Comments 3: Figures 1–3 are not very clear in their current form. The font size within the diagrams is too small, which makes them difficult to read, and the overall visual presentation is not very appealing. I recommend enlarging the figures slightly, increasing font size for labels and annotations, and improving layout clarity to ensure that the diagrams are both visually appealing and easily interpretable.

Response 3: As the core technical figures of this paper, Figures 1–3 have readability that directly affects readers’ understanding of the architecture and design details of ISANet. The existing issues with the current figures, such as small font sizes and disorganized layouts, indeed require improvement. We have reviewed and revised all figures in the manuscript, and will enhance the figure quality by adjusting dimensions, enlarging font sizes, and optimizing layouts—all to ensure that technical details are clearly distinguishable.

Comments 4: Section 5 is currently titled “Results and Discussion”; however, no new experimental results or in-depth critical discussion are actually presented there. Instead, the section primarily summarizes the contributions of ISANet and highlights possible future directions, which aligns more closely with a “Conclusion and Prospects” section. I recommend revisiting the title and content of Section 5 to ensure consistency between the section name and its actual content.

Response 4: Thank you for your detailed review. We fully agree with your judgment on the positioning of Chapter 5. The current content of this chapter, which mainly focuses on summary and outlook, is inconsistent with the title "Results and Discussion," which may easily cause misunderstandings among readers. Adjusting and optimizing the content structure will make the logic of the paper more rigorous. We have revised the conclusion content and marked the revised parts in purple.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper discusses a new lightweight neural network, designed for real-time semantic segmentation in autonomous driving. The core idea is to improve accuracy by recovering spatial information lost during feature extraction, without significantly increasing computational cost or slowing down inference speed.

While the components are well-motivated, the overall architecture feels incremental. It follows the well-established pattern of a multi-branch network (like BiSeNet) and adds recovery mechanisms. The Object Boundary Feature Attention Module (OBFAM) is presented as a key differentiator from CBAM, but the changes (using 1x1 conv instead of compression, addition instead of concatenation) are relatively minor.
The literature review is largely descriptive rather than analytical. It lists what other models have done but provides a less critical analysis of why they fall short, beyond the generic "lose spatial information" or "increase parameters." A sharper, more critical gap analysis would strengthen the justification for ISANet.
Figure 5 and 6 are of very poor quality and nearly impossible to interpret. They are small, blurry, and lack clear differentiation between the models. To support the claims, high-resolution, side-by-side comparisons highlighting areas where ISANet outperforms others (especially on boundaries) are essential.
Results are "Good," but is it "The Best?": The mIoU of 76.7% on Cityscapes is respectable for a lightweight model. However, the paper does not sufficiently contextualize this. For instance, SEDNet achieves 76.4% with a higher computational cost, but the trade-off is not deeply discussed.
Incomplete Comparison: Table 8 is intended to compare with other lightweight methods; however, the "Methods" column is misformatted (e.g., "4%Methods", "494SANet"), which makes it confusing. More importantly, a direct comparison of inference speed (FPS) is problematic as it is highly dependent on hardware and implementation. The lack of a standardized comparison (e.g., FLOPs vs. mIoU plot) makes it difficult to judge ISANet's efficiency truly objectively.
The quality of the figures is a major issue. The architecture diagram (Figure 2) is complex and cluttered, making it hard to follow the information flow. The bottleneck unit comparisons (Figure 3) are helpful, but the algorithmic figures (Algorithms 1 and 2) contain obvious typos, which undermine the scientific rigor. The tables, while data-rich, suffer from formatting errors ( table 8, first column as an example)
The conclusion does not adequately address the model's limitations beyond the mentioned robustness. It does not discuss the potentially small performance gains relative to complexity or the practical challenges of the multi-branch design.

Author Response

Comments 1: While the components are well-motivated, the overall architecture feels incremental. It follows the well-established pattern of a multi-branch network (like BiSeNet) and adds recovery mechanisms. The Object Boundary Feature Attention Module (OBFAM) is presented as a key differentiator from CBAM, but the changes (using 1x1 conv instead of compression, addition instead of concatenation) are relatively minor.

Response 1: Although ISANet is based on a multi-branch framework, its improvement is not a simple incremental one. Its core innovation lies in the design concept of "spatial information compensation and aggregation" — through the missing information compensation branch of the Spatial Loss Compensation Block (SLBU), and the dual-branch collaborative recovery mechanism of the Spatial Feature Aggregation Block (SFAB) and Multi-scale Information Recovery Block (MSIRB), a full-process spatial information protection system is constructed. This is distinct from BiSeNet, which merely adopts the approach of extracting spatial and semantic features separately via dual branches.

Although the improvement of the Optimized Bottleneck Feature Attention Module (OBFAM) seems subtle, it specifically addresses the key drawbacks of the Convolutional Block Attention Module (CBAM):

The 1×1 convolution avoids feature loss caused by spatial dimension compression in channel attention;

The addition operation not only ensures the effect of feature fusion but also reduces the computational latency by more than 30% (experimental verification: the inference speed of OBFAM reaches 58 FPS, while that of CBAM is only 21 FPS). Such subtle improvements have achieved dual enhancements in both accuracy and speed.

Comments 2: The literature review is largely descriptive rather than analytical. It lists what other models have done but provides a less critical analysis of why they fall short, beyond the generic "lose spatial information" or "increase parameters." A sharper, more critical gap analysis would strengthen the justification for ISANet.

Response 2: Thank you for your comments. We have reorganized the content of the related work in Section 2, and highlighted the unique value of ISANet in the "lightweightness-accuracy-speed" triangular balance through comparative discussion. The relevant content is marked in blue in Section 2.

Comments 3: Figure 5 and 6 are of very poor quality and nearly impossible to interpret. They are small, blurry, and lack clear differentiation between the models. To support the claims, high-resolution, side-by-side comparisons highlighting areas where ISANet outperforms others (especially on boundaries) are essential.

Response 3: We would like to express our sincere gratitude for your detailed review of our manuscript. We have identified the issues in Figures 5 and 6 and revised them accordingly. Meanwhile, we have added new content to the description of the comparative conclusions to highlight the advantages of ISANet over other models in scenarios such as boundary segmentation, and the added content has been marked in blue.

Comments 4: Results are "Good," but is it "The Best?": The mIoU of 76.7% on Cityscapes is respectable for a lightweight model. However, the paper does not sufficiently contextualize this. For instance, SEDNet achieves 76.4% with a higher computational cost, but the trade-off is not deeply discussed.

Response 4: A mean Intersection over Union (mIoU) of 76.7% achieved by ISANet on the Cityscapes dataset cannot be simply defined as "globally optimal"; however, it can be regarded as one of the "optimal models in terms of performance-efficiency trade-off" within the category of lightweight models (with parameter count ≤ 2M and FLOPs ≤ 15G). From the perspective of absolute accuracy, some non-lightweight models (e.g., SegFormer-B1 with 11.6M parameters and an mIoU of 77.5%) exhibit slightly higher accuracy. Nevertheless, the core advantage of ISANet lies in the balance between "lightweight property" and "high accuracy", which is elaborated as follows:

When compared with lightweight models of the same magnitude:

For SEDNet (6.63M parameters, 65.9GFLOPs) with an mIoU of 76.4%, ISANet only requires 20.7% of the parameters and 19.1% of the FLOPs of SEDNet, while achieving a 0.3% higher mIoU.

For LETNet (0.95M parameters, 13.6GFLOPs) with an mIoU of 72.8%, ISANet achieves a 3.9% higher mIoU with only a 44% increase in parameter count and a mere 7% increase in FLOPs, demonstrating significantly higher cost-effectiveness.

From the perspective of real-time semantic segmentation requirements: The current application scenarios impose strict constraints on model parameters and computational complexity (usually requiring parameter count ≤ 2M and FLOPs ≤ 20G). ISANet fully meets these requirements, with an inference speed of 58 FPS (exceeding the minimum standard of 30 FPS for real-time segmentation). In contrast, SEDNet fails to meet such requirements due to its excessive computational complexity.

Therefore, in resource-constrained scenarios such as autonomous driving, ISANet outperforms most existing models in the balance of "performance-cost-speed" and thus possesses distinct advantages.

Comments 5: Incomplete Comparison: Table 8 is intended to compare with other lightweight methods; however, the "Methods" column is misformatted (e.g., "4%Methods", "494SANet"), which makes it confusing. More importantly, a direct comparison of inference speed (FPS) is problematic as it is highly dependent on hardware and implementation. The lack of a standardized comparison (e.g., FLOPs vs. mIoU plot) makes it difficult to judge ISANet's efficiency truly objectively.

Response 5: Thank you for the careful reading of our manuscript. We have conducted a detailed check on the final manuscript and corrected the errors in the tables. However, regarding the comprehensive evaluation of the model, as you pointed out, Frames Per Second (FPS) is highly susceptible to hardware conditions. Therefore, more emphasis should be placed on parameters such as segmentation accuracy and Floating-point Operations (FLOPs). ISANet holds certain advantages in this aspect. We have rewritten the content in Section 4.3.1 to enable an objective assessment of ISANet’s advantages, and the rewritten part has been marked in blue.

Comments 6: The quality of the figures is a major issue. The architecture diagram (Figure 2) is complex and cluttered, making it hard to follow the information flow. The bottleneck unit comparisons (Figure 3) are helpful, but the algorithmic figures (Algorithms 1 and 2) contain obvious typos, which undermine the scientific rigor. The tables, while data-rich, suffer from formatting errors ( table 8, first column as an example)

Response 6: Thank you for your valuable comments. We have rearranged the figures in the manuscript to enable an intuitive presentation of the research ideas. We have also conducted a detailed check on Algorithm 1 and Algorithm 2, corrected the erroneous descriptions in the tables, and revised both the format and content of the tables in the manuscript—all to ensure academic rigor and readability.

Comments 7: The conclusion does not adequately address the model's limitations beyond the mentioned robustness. It does not discuss the potentially small performance gains relative to complexity or the practical challenges of the multi-branch design.

Response7: Thank you for the comments on our manuscript. We have revised the conclusion section (Section 5), where we added discussions and prospects for future research directions—specifically focusing on the cost-effectiveness of model complexity versus performance improvement, as well as the challenges of the multi-branch design in practical applications. The revised content has been marked in purple.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

I thank the authors for considering my remarks. I recommend publishing the paper.