LSTM-CA-YOLOv11: A Road Sign Detection Model Integrating LSTM Temporal Modeling and Multi-Scale Attention Mechanism
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsSome comments to improve the article mainly associated with editorial aspects:
1. The main aim of this article was to show the advantage of the approach proposed by the authors. Generally, they showed its advantages by comparing it with other methods. So, the aim was reached.
2. Speaking about the research gap here: there is no ideal approach to detect all possible road signs on any road (most of the approaches are useful only in selected countries). At the moment, it is impossible to close this gap by making only one research work. That is why, theoretically, each new approach (other than existing ones) brings something new to cover the part of this gap. Summarizing, the authors proposed a new approach, which could considered as a gap covering.
3. The method is described clearly in Chapter 3, and no critical issues were detected.
4. Presentation of outcomes is also clear for me. Of course, it could be criticized for the following aspects:
- dataset: yes, it could provide more details about this database such as roads, where it was collected, amount km, etc., times of the year, when it was collected, number of signs according to the classification from Figure 7, etc.
- problems associated with outcomes (for example, from Figure 10 it is clear that surrounding cars could create a problem with detection)
But, in my opinion, such information should arise in the discussion chapter.
5. It is recommended to include a discussion chapter to present limitations and uncertainties of the research work. Some minimal suggestions are mentioned in point 4.
6. Authors should not forget to correct all editorial comments, which I have already mentioned.
7. Chapters 3 and 4 have the same titles (“methods”). The introduction states that Chapter 4 will contain the results, so this inaccuracy should be corrected.
8. Figure 7. Examples of the types of characters examined should be presented in a more legible manner and/or the photographs should be better cropped and of higher quality.
9. Table 1./Table 2./Table 4. – require improvement and editorial correction, as it is recommended not to split the table across two pages; however, if necessary, the first row should be repeated when splitting the table (in this particular case, it is recommended to move the table and start on a new page).
10. Figure 8., 9. And 10./lines 540-545 - personal pronouns (“our”, “ours”) should not be used in scientific articles. Impersonal forms are recommended.
11. There is a gap under the heading of subsection 4.4 – correct and check the entire article for similar cases with gaps.
12. Line 632 – space is missing before word ‘this’.
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper presents a well-motivated and technically sound approach to road sign detection with thoughtful integration of attention, multi-scale fusion, and structured feature modeling. The experimental results are promising, and the ablation studies are largely thorough. However, the following with further improve the quality of the paper.
- The paper claims that the Bi-LSTM module performs “temporal modeling,” but the input consists of static images, not video sequences. The LSTM processes spatially serialized feature maps, which constitutes structured spatial sequence modeling, not true temporal modeling, and this terminology should be revised to avoid confusion.
- The choice of sequence length (S = 8, yielding 64 tokens) for feature serialization lacks justification or ablation; alternative tokenization strategies (e.g., patch-based or adaptive grid partitioning) are not considered, leaving readers uncertain whether this hyperparameter is optimal or arbitrary.
- Sections 3.2.3, 3.3.1 and 3.3.2 are both titled “Feature Re-calibration,” creating structural confusion; the one subsection actually describes branch design in MSEF, not recalibration, and the section headings should be renamed to reflect their distinct purposes.
- The paper removes residual connections in the Coordinate Attention (CA) module to “force attention learning,” but provides no ablation study comparing performance with and without the residual path, making it unclear whether this design choice genuinely improves results or harms gradient flow.
- The proposed Focal-IoU loss i.e. Eq. 32 combines focal weighting with IoU-based regression in an ad hoc manner without theoretical justification or citation of prior similar approaches, raising concerns about its principled design.
- The dataset exhibits significant class imbalance (e.g., 7,464 prohibition signs vs. 1,621 auxiliary signs), yet per-class AP metrics are omitted; high overall mAP may mask poor performance on minority categories, which is critical for real-world deployment.
- Mosaic and MixUp augmentations, standard in YOLO training, are disabled to “reduce interference,” but this decision may unfairly handicap baseline models and limit the generalizability of results; the paper should either justify this choice or include a control experiment with standard augmentations enabled.
- The TT100K evaluation uses only 20 out of 221 classes (>50 instances each), but it is unclear whether the model was fine-tuned on these classes or evaluated in a zero-shot setting; without clarification, the generalization claims lacks rigor.
- FPS values are reported without specifying inference conditions (e.g., batch size, precision mode, TensorRT usage), making speed comparisons between models potentially misleading; all latency metrics should be normalized under identical hardware and software settings.
- On TT100K, the paper only compares against YOLOv13n, omitting published state-of-the-art results from peer-reviewed works; this weakens the claim of superior generalization and should be addressed by including at least 2–3 established baselines.
- The Bi-LSTM’s contribution is overstated as enabling the model to “effectively read” sign layout like text, which anthropomorphizes a spatial modeling mechanism; the language should be tempered to reflect that it captures long-range spatial dependencies; not linguistic structure.
- While model size (Params) and theoretical cost (GFLOPs) are reported, there is no breakdown of computational overhead per module (e.g., how much Bi-LSTM or MSEF slows inference); such analysis would help assess practical trade-offs.
- Visualization in Figure 8 shows only successful detections; inclusion of failure cases (e.g., missed small signs, false positives under occlusion) would provide a more balanced assessment of robustness.
- The heavy reliance on non-peer-reviewed arXiv preprints (Refs [23]–[25]) as foundational backbones and baselines risks propagating unverified claims; if retained, these must be clearly labeled as preliminary and not treated as authoritative benchmarks.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper proposes an improved YOLOv11-based detector by inserting an LSTM module at the end of the backbone network, embedding a Coordinate Attention module, and integrating an MSEF module for more robust multi-scale detection. The model is thoroughly evaluated on both private and public benchmarks, demonstrating superior performance in terms of mAP and small-sign detection. However, some methodological choices lack sufficient transparency, and key architectural details (e.g., LSTM pipeline, loss calibration, generalization methodology) should be more rigorously justified. I advise the following revisions to improve clarity, reproducibility, and impact of the work.
1. In Section 3.1, you describe the adaptive pooling and sequential reshaping (Eq. 1-2), but the exact position in the backbone where the LSTM is inserted is unclear. Is it after C3k2 or before the neck? Please add a diagram or more explicit text.
2. Table 1 reports model parameters and GFLOPs for some models only. Provide full profiling for all baseline methods, including YOLOv13 and DETR variants, to justify computational efficiency claims.
3. The training setup avoids Mosaic and MixUp because they “interfere with learning basic features.” However, this choice may harm model generalization. Provide ablation to justify their removal or clarify the training context.
4. In Section 3.5, you propose a combined loss addressing imbalance and regression. Adding a convergence plot or comparison to GIoU or EIoU in terms of localization precision over epochs would strengthen the section.
5.YOLOv12 and YOLOv13 are mentioned but not fully described. Claims about architectural conflicts (e.g., global hypergraph computation) should be supported by citing official sources or supplementary experiments.
6. Table 7 reports impressive gains over YOLOv13n on TT100K, but it is unclear if the same data splits and training schedules were reused. Please confirm consistency to ensure fair comparison.
7. The private dataset has significant class imbalance (e.g., warning signs vs auxiliary) and is dominated by two categories. Discuss how this affects training and how Focal-IoU helps mitigate it.
8. Figures 2–6 are complex and dense. Add clear labels (e.g., kernel sizes, dimensions) and briefly explain arrows or connections in captions to aid readability.
9. Expressions like “completely enhances positional awareness” or “perfectly addresses challenges” are too strong and not necessary. Tone down these claims for academic precision.
10. The inclusion of a structured outline of how the LSTM, attention, and multi-scale modules interact in code would greatly enhance reproducibility.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsDear Authors,
Please address the following comments:
- Present the confusion matrix used to obtain P and R in Table 1.
- Include a reference(s) for the different attention mechanisms used in Table 3.
- Include a reference(s) for the different loss functions used in Table 4.
- Include an explanation about how to calculate FPS reported in Table 5.
- Include a reference(s) for the different multi-scale fusion strategies in Table 6.
- Substitute the kanji characters in line 648.
- Explain what the numbers with different line colors represent in the nomenclature of Figure 9.
- Include some ideas as future work.
- Since the dataset has been approved for use in academic publications, please provide a public GitHub repository containing all the data used to ensure the reproducibility of the proposal.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 5 Report
Comments and Suggestions for AuthorsDear Authors,
Thank you for your comprehensive and insightful research on a road sign detection model (LSTM-CA-YOLOv11). The topic is both timely and highly relevant to the current advancements in the field.
However, I would like to offer several suggestions that could help improve the clarity and impact of your manuscript:
– Please move analysis of existed models and sources with references from Introduction section to Related Work section.
– Please explicitly clarify which aspects of the work are novel relative to prior studies that address similar problems. This should include a critical analysis of existing approaches, highlighting their advantages and disadvantages. If possible, provide a comparative table summarizing the existing models, their key characteristics, and the identified gaps. Such a structured comparison will help demonstrate how your proposed novelty achieves measurable improvements over these models.
– Please clarify the limitations of the study.
– Please introduce all abbreviations, symbols, and terms at their first occurrence and use them consistently across the entire manuscript.
Thank you once again for your valuable contribution.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe concerns have been fully resolved. The revisions significantly improve scientific rigor, clarity, and reproducibility. The paper is now suitable for acceptance pending any final editorial checks.