RSM-Optimizer: Branch Optimization for Dual- or Multi-Branch Semantic Segmentation Networks
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- The paper does mention the number of parameters (Params) for each model, but it doesn’t provide a clear discussion on computational complexity (e.g., FLOPs).
- In Section 3.2, Some design choices seem arbitrary and need a stronger justification. For example, in RF-FEM, why were dilation rates of 1, 3, 5, and 7 chosen?
- Why not use an adaptive approach or learnable dilation? Similarly, why does PMPM use k×1 and k kernels instead of other pooling strategies like deformable pooling or adaptive pooling?
- SUFM is introduced to reduce noise in upsampling, but the explanation is mostly qualitative. Some quantitative analysis on how much noise is reduced would help validate its effectiveness.
- in section 3.2, RF-FEM tries to expand receptive fields in the detail branch, but how much does it really improve over standard atrous convolutions or dilated residual blocks?
- The tables contain a lot of numbers, but key results aren’t highlighted. Bold important values (like best mIoU or FPS) so it’s easier to compare in section 4.2.
- The paper proposes SUFM to improve feature fusion, but how does it compare to other fusion strategies?
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper presents a study aimed at improving the semantic segmentation technique, which accurately distinguishes objects in images, by making it faster and more precise. Existing models such as BiSeNet, DDRNet, and PIDNet suffer from two main issues: â‘ The detail branch has a limited receptive field, making it difficult to distinguish small objects, and â‘¡ noise is introduced when upsampling the semantic branch. To address these problems, the authors introduce â‘ RF-FEM to expand the detail branch's receptive field, â‘¡ SUFM to reduce noise during the stepwise upsampling of semantic information, and â‘¢ PMPM to enhance the model's ability to recognize objects of various sizes and shapes. Experimental results on the Cityscapes and CamVid datasets demonstrate that the proposed method achieves higher accuracy while maintaining fast inference compared to existing models.
---
<Introduction>
1. The importance of the problem addressed in this paper is well explained. However, expressions such as "significant applications" should be supported with more concrete examples and statistical evidence.
<Conclusion>
2. The sentence "Experimental results on the Cityscape and CamVid datasets confirmed the performance improvements achieved by our solution." is somewhat general. Including a numerical summary of the improvements would make the statement more effective.
<Proposed Approach>
3. The paper lacks a discussion on how much the Receptive Field Fusion Block (RFFB) increases computational complexity compared to existing methods. A more detailed analysis of the impact on model complexity and actual inference time is needed.
4. The explanation "SUFM consists of two WFBs" does not clearly describe how the Weighted Fusion Block (WFB) operates. Specifically, based on Equations (4) and (5), it is unclear how the weight computation function fam(·) is implemented.
<Experiments>
5. The example in Figure 5, where PIDNet-S shows better object boundary recognition, is a positive aspect. However, a more systematic evaluation is needed to determine whether this improvement is consistent across different scenarios (e.g., analyzing IoU improvements for specific classes).
<Errors>
6. "It follows a polynomial decay strategy (poly strategy) on the both datasets."
→ The phrase "the both datasets" is incorrect. It should be revised to "both datasets."
7. "In the future, we plan to tackle the challenges of semantic segmentation in real-time scenarios."
→ The word "tackle" should be replaced with "address" as it is more appropriate for academic writing.
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsMy main suggestion would be to link the introduction, method and results sections with the research context. These aspects remain largely independent from each other, where the context is only really introduced in the method and the analytical approach becomes the primary focus through most of the paper. More detailed comments can be found below, which I hope the authors find constructive.
- The introduction is generally well-prepared. However, the introduction moves into analytical performance too quickly before adequately explaining the context. Is there an opportunity to help readers understand, in greater detail, the research background? For example, the analysis uses the Cityscapes dataset, yet there doesn’t appear to be any mention of urban or city landscapes in the introductory sections. The first mention appears to be on line 231, which may leave readers wondering why the datasets were important to consider.
- Sections 3 and 4 seem to have some relation, albeit this is not overly clear. I believe there are opportunities to link important aspects of the method to the dataset selection. Otherwise, other readers may also wonder what the importance of highlighting many details of the analytical model (e.g., RF-FEM and MPM structures) without connecting these explanations to how they were applied to the datasets.
- Line 143 to 144. This was an interesting point to highlight. Could this be used to explain why the authors decided to evaluate the two datasets on line 231?
- Would it be possible to include more details about the datasets? For example, which cities were used and how were the images originally generated (e.g., high-resolution camera images?)?
- Should lines 234 to 236 and lines 238 to 240 be included in the training part of 4.1? Moreover, why were the dataset split ratios expressed by the number of images for the Cityscapes dataset, but for the CamVid dataset, this was shown by the actual ratio (i.e., 6:1:3?) Wouldn’t this be clearer if they were both consistently
- Does Resolution need to include the units (Pi) in Table 2 and Table 3?
- Line 268: Should more explanation be given about the higher mIoU. Although this is accurate, the differences in mIoU across the models don’t appear to be substantial. For example, the difference between model4 and model0 is 1.5. Can the authors help give more details about how this verifies the effectiveness of the models? I would be interested in understanding this more.
- Figure 5. This is likely an important illustration, but it appears just before the conclusions. Should this section be moved to an earlier section of the results? Also, Figure 5(a) shows the Input Image, and Figure 5(b) indicates the Ground Truth. Is Figure 5(b) the false color version of the same Input Image? Line 311 indicates that they are both the input images, which is not very clear. Although I generally understood these images, readers may not understand that the two images are essentially the same.
- The authors may want to consider how they envision applying the model in practice and what benefits this approach could potentially convey. The approach is interesting, yet this aspect remained somewhat undefined in the paper as the results directly led to the conclusions.
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsThe revisions and inclusion of background information about the datasets were appreciated. There were a few minor formatting issues that appeared in the revised paper, which should be checked. Please also check if other formatting errors are present in other sections of the paper.
1. Line 26: Please check "[4? ]". This might be an unfortunate formatting error.
2. Line 354. Similarly, please check "Table ??"
Author Response
Please see the attachment.
Author Response File: Author Response.pdf