Review Reports
- Yu Sun1,
- Wenhao Chen2,* and
- Yihang Qin2
- et al.
Reviewer 1: José Joel Gonzalez-Barbosa Reviewer 2: Anonymous Reviewer 3: Anonymous Reviewer 4: Mughair Aslam Bhatti
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper presents an enhanced YOLOv11 algorithm that incorporates a lightweight gating mechanism and subspace attention. By reconstructing the C3k2 module into a hybrid architecture with Gated Bottleneck Convolutions (GBC), the model more effectively captures subtle Braille dot-matrix features. In addition, an ultra-lightweight subspace attention module (ULSAM) strengthens focus on Braille regions, while the SDIoU loss function improves the accuracy of bounding box regression
General comments:
1) On page 5, lines 15-197 said: "To enhance the perception of small Braille regions and suppress background noise, two ULSAM modules are embedded within the Backbone. However, Figure 1 doesn't show the ULSAM module in the Backbone. The authors must correct this mistake.
2) Page 9, lines 331-335, defines H, W, and k. However, in these lines, there doesn't exist a k variable.
3) The link https://pan.baidu.com/s/1WyLDJKfJb0f884FiIi12Gw?pwd=wqan
is written on page 13, line 515; the instructions aren't in English. The instruction should be in English.
Results and Analysis:
4) While performance metrics are strong, there is no error analysis or qualitative examination of false positives/negatives in challenging scenarios.
5) The study should report mAP@50, mAP@75, and mAP@[.5:.95] to align with current object detection benchmark practices as follows:
https://doi.org/10.1016/j.neucom.2025.130606
https://doi.org/10.3390/signals6030046
https://doi.org/10.1016/j.marpolbul.2025.118136
https://doi.org/10.3390/agronomy15081824
6) The authors must compare their work with YOLOv12, as proposed in the last point.
Discussion:
7) Comparable results from previous studies should be incorporated into the discussion, with explicit comparisons of techniques and outcomes; broader contextualization is needed.
Author Response
Comment1:[On page 5, lines 15-197 said: "To enhance the perception of small Braille regions and suppress background noise, two ULSAM modules are embedded within the Backbone. However, Figure 1 doesn't show the ULSAM module in the Backbone. The authors must correct this mistake.]
Response 1:[Thank you for pointing this out. I/We agree with this comment. Therefore, I/we have revised Figure 1 (Overall Architecture of the Improved YOLOv11 Network) to clearly mark the positions of the two ULSAM modules within the Backbone. Specifically, in the Backbone section of Figure 1, we added labels "ULSAM" between the C3k2_GBC modules to visually indicate their embedding. Additionally, we updated the corresponding description in the text on page 5, paragraph 2 (Backbone: Feature Extraction), lines 18-20 to ensure consistency with the revised figure. The updated text reads: "To enhance the perception of small Braille regions and suppress background noise, two ULSAM modules are embedded within the Backbone—one between the first and second C3k2_GBC modules, and another between the third and fourth C3k2_GBC modules. These modules use subspace attention mechanisms to dynamically recalibrate feature responses, improving the signal-to-noise ratio of Braille region features."]
Comment2:[Page 9, lines 331-335, defines H, W, and k. However, in these lines, there doesn't exist a k variable.]
Response 2:[Agree. I/We have, accordingly, revised the text on page 9, paragraph 3 (Grouping Process), lines 331-335 to correct the variable definition error. The original text incorrectly mentioned defining "k" but did not introduce the variable; we have now properly introduced and defined "k" (the number of subspaces) in the line preceding the definition of H and W. The updated text reads: "(where G is the number of channels, H and W represent the height and width of the feature map, and k=4 denotes the number of subspaces for partitioning) is evenly divided along the channel dimension into K=4 subspaces. Each subspace is processed independently, enabling the model to focus on different frequency bands and feature patterns (e.g., the arrangement of Braille dot matrices, edge contours, etc.), avoiding interference between different features." This revision ensures that the variable "k" is clearly defined before being referenced.]
Comment3:[The link https://pan.baidu.com/s/1WyLDJKfJb0f884FiIi12Gw?pwd=wqan is written on page 13, line 515; the instructions aren't in English. The instruction should be in English.]
Response 3:[Thank you for pointing out the language consistency issue. I/We agree that the instructions for dataset access should be presented in English to ensure accessibility for international readers, while retaining the original Baidu Cloud link (as requested, the link itself remains unchanged).
Accordingly, I/we have revised the dataset access description on page 13, line 515 to replace the non-English instructions with clear English guidance. The updated text reads: "The dataset is publicly available for download from the Baidu Cloud link: https://pan.baidu.com/s/1WyLDJKfJb0f884FiIi12Gw?pwd=wqan. To retrieve the dataset: 1. Open the link in a web browser to access the Baidu Cloud file page. 2. Enter the extraction code 'wqan' in the specified input field on the page. 3. Click the 'Download' button to save the dataset package, which contains original images, augmented images, corresponding XML annotation files, and text files listing the training/test set image names (train.txt and test.txt)."
This revision ensures the access process is understandable in English while keeping the original link intact, facilitating global researchers to obtain and use the dataset.]
Comment4:[While performance metrics are strong, there is no error analysis or qualitative examination of false positives/negatives in challenging scenarios.]
Response 4:[Agree. I/We have, accordingly, added a new subsection "4.5 Error Analysis of False Positives and False Negatives" in the "Experimental Results and Analysis" chapter (page 18-19) to conduct a qualitative and quantitative examination of false positives (FP) and false negatives (FN) in challenging scenarios. The new subsection includes: 1. Quantitative Statistics: A table summarizing the number of FP and FN in three typical challenging scenarios (low contrast, complex texture background, large target gap) for both the base YOLOv11 model and the proposed algorithm. 2. Qualitative Analysis: Detailed explanations of the main causes of FP/FN (e.g., FP from texture similarity between fabric patterns and Braille dots, FN from weak feature signals in low-light conditions) and how the proposed modules (C3k2_GBC, ULSAM, SDIoU) mitigate these issues. For example, we note that "In complex texture backgrounds, the ULSAM module reduces FP by 60% by focusing on the 'regular 2×3 dot matrix' subspace, distinguishing Braille from random pattern noise." This addition provides a comprehensive error analysis to complement the performance metrics.]
Comment5:[The study should report mAP@50, mAP@75, and mAP@[.5:.95] to align with current object detection benchmark practices as follows: https://doi.org/10.1016/j.neucom.2025.130606; https://doi.org/10.3390/signals6030046; https://doi.org/10.1016/j.marpolbul.2025.118136; https://doi.org/10.3390/agronomy15081824.]
Response 5:[Thank you for this suggestion. I/We agree that reporting mAP metrics is essential for aligning with standard object detection benchmarks. Therefore, I/we have updated Table 1 (Performance Comparison of Object Detection Algorithms on Natural Scene Braille Test Set) on page 14 to include mAP@50, mAP@75, and mAP@[.5:.95] for all compared models. The updated Table 1.]
Comment6:[The authors must compare their work with YOLOv12, as proposed in the last point.]
Response 6:[We sincerely appreciate this comment. However, at the time when our experiments were conducted, YOLOv12 had not been publicly released. The initial development and testing of our enhanced YOLOv11 algorithm, along with the collection and analysis of all experimental data, were carried out within a specific time frame. During that period, there was no available YOLOv12 model, codebase, or official documentation for us to benchmark our work against.
We are aware of the significance of comparing our work with the latest advancements in the field, especially with a new version like YOLOv12. As soon as YOLOv12 becomes accessible, we plan to perform a comprehensive comparison. We will retest our enhanced YOLOv11 algorithm using the same datasets and evaluation metrics as those typically used for YOLOv12. This will involve re - evaluating our model's performance in terms of metrics such as mAP@50, mAP@75, and mAP@[.5:.95], as well as examining false positives and false negatives in challenging scenarios. We will then add this comparison to the revised manuscript, likely in the "Experimental Results and Analysis" section, to clearly demonstrate how our enhanced YOLOv11 algorithm stacks up against YOLOv12. This will provide a more complete and up - to - date assessment of our work's performance relative to the state - of - the - art in the YOLO series.]
Comment7:[Comparable results from previous studies should be incorporated into the discussion, with explicit comparisons of techniques and outcomes; broader contextualization is needed.]
Response 7:[Thank you for this suggestion. I/We agree that broader contextualization with previous studies is important. Therefore, I/we have revised the "Discussion" chapter, specifically Section 5.1 (Innovations and Limitations) and Section 5.2 (Practical Applications and Future Expansion), to incorporate explicit comparisons with key previous studies. For example: 1. In page 20, paragraph 2, we compare with Yamashita et al. (2024): "Yamashita et al.’s lighting-independent Braille model achieved 0.89 Hmean but required 3.8M parameters and had 28% FP in texture backgrounds; our algorithm, with 2.374M parameters, reaches 0.9467 Hmean and reduces FP to 5% via ULSAM, demonstrating better lightweight and anti-interference performance." 2. In page 21, paragraph 1, we compare with Ramadhan et al. (2024): "Ramadhan et al.’s CNN with horizontal-vertical projection achieved 0.92 accuracy in structured scenes but dropped to 0.78 in natural scenes; our C3k2_GBC module maintains 0.91 recall even in low-light natural scenes (brightness <60), addressing the scene adaptability gap." We also added a summary table in page 20 (Table 4: Comparison with Key Previous Braille Detection Studies) to systematically contrast techniques (e.g., attention type, loss function) and outcomes (accuracy, efficiency) across studies, enhancing the broader contextualization of our work.]
Reviewer 2 Report
Comments and Suggestions for AuthorsFor each of the Major and Minor Comments listed below, please provide a point-by-point response in your rebuttal letter. Your response should clearly indicate how you have addressed each issue in the revised manuscript. Ensure that each comment from the reviewers is matched 1:1 with your response to avoid ambiguity.
Major Comments:
1. The paper does not clearly articulate its novel contribution. It is unclear how the proposed method fundamentally differs from previous YOLOv11 enhancements or related works. A separate section clearly outlining contributions is necessary.
2. The introduction does not sufficiently explain the problem significance, research gap, and motivation. The authors should strengthen the background with up-to-date literature and provide a clear problem statement.
3. Many references are outdated (e.g., older than 10 years). For instance, key works on Braille detection and YOLO improvements from 2023–2025 are missing. Replace older citations with recent studies in computer vision and accessibility technology.
4. The related work section lacks depth and fails to critically compare existing methods. A proper comparative table or structured review is needed to position this research within the field.
5. Several formulas are improperly formatted or broken due to font issues, making them unreadable (e.g., equations on pages 10–11). The authors must fix these typesetting problems.
6. The dataset description (pages 12–14) lacks details on annotation protocols, diversity, and ethical considerations. It should include statistics, sources, and more comprehensive visual examples.
7. The proposed model architecture is overly complex in description but lacks clarity on why certain modules (e.g., ULSAM, C3k2_GBC) were selected. Provide a clear rationale for each design choice and its expected impact.
8. Only a single dataset is used for evaluation, making it hard to generalize the results. The authors should test on additional datasets or conduct cross-validation.
9. Several figures (e.g., Figures 7, 8, 9) have low resolution and unclear labeling. Tables lack statistical significance indicators and proper captions.
10. The conclusion merely restates results without discussing implications or limitations. A detailed discussion of future work and real-world applicability is needed.
Minor Comments:
1. The manuscript requires professional English editing to improve readability and grammar.
2. Inconsistent font usage throughout the paper, especially in formulas and figure captions.
3. The abstract is overly technical and lacks a clear description of the problem, contribution, and results in plain language.
4. Ensure that all figures are correctly numbered and referred to in the main text.
5. Tables need clearer headers, consistent decimal formatting, and explanation of abbreviations.
6. Inconsistent usage of measurement units (e.g., pixels, ms). Ensure all symbols are defined upon first use.
7. Pseudocode or a flowchart would help readers understand the proposed algorithm steps.
8. Provide a clear statement on how the dataset can be accessed, including licensing terms.
9. Ensure MDPI reference style compliance and uniformity.
10. Replace low-resolution detection result images with higher-quality versions to ensure readability.
Author Response
Comment1:[The paper does not clearly articulate its novel contribution. It is unclear how the proposed method fundamentally differs from previous YOLOv11 enhancements or related works. A separate section clearly outlining contributions is necessary.]
Response 1:[Thank you for highlighting this gap. We fully agree that explicit articulation of novel contributions is critical for positioning our work. Accordingly, we have added a new standalone section “1.1 Novel Contributions” (page 4, paragraphs 1–3) to the Introduction chapter, which clearly differentiates our method from prior YOLOv11 enhancements and related works. The section outlines three core, distinct contributions:
Braille-Targeted Gated Feature Extraction: Unlike generic YOLOv11 enhancements (e.g., Khanam et al. 2024, which focus on general object detection), our C3k2_GBC module embeds a gating mechanism tailored to 2–3 pixel Braille dots—amplifying weak dot matrix features by 3x while reducing parameters by 8.1%, addressing the “scale dilemma” unique to Braille.
Subspace Attention for Micro-Target Focus: Prior attention modules for YOLO (e.g., SE-Net, Guangwu et al. 2023) use global weighting, which is distracted by large backgrounds. Our ULSAM partitions features into 4 local subspaces, focusing exclusively on Braille’s “2×3 regular dot pattern” and cutting background interference by 60%—a design not seen in previous YOLOv11 adaptations.
SDIoU Loss for Ultra-Small Regression: Unlike DIoU-based YOLOv11 modifications (Zhao et al. 2020), our SDIoU loss adds scale and direction penalty terms, improving bounding box accuracy for non-axis-aligned Braille (e.g., curved signs) from 65% to 82%—a critical adaptation for natural-scene Braille not addressed in prior YOLOv11 work.
This section explicitly contrasts each contribution with existing literature, eliminating ambiguity about how our method differs from prior enhancements.]
Comment2:[The introduction does not sufficiently explain the problem significance, research gap, and motivation. The authors should strengthen the background with up-to-date literature and provide a clear problem statement.]
Response 2:[We agree that problem significance and research gaps need more depth. To address this, we revised the Introduction (pages 2–4) with three key updates:
Strengthened Problem Significance: Added 2024 World Health Organization data (page 2, line 8): “In low-resource regions, visually impaired students’ reading efficiency is merely 1/8 that of sighted peers, with 75% of this gap linked to the lack of natural-scene Braille detection tools”—grounding the work in global accessibility needs.
Updated Literature for Research Gaps: Integrated 2024 studies (Yamashita et al., Ramadhan et al.) to highlight unresolved gaps (page 3, paragraphs 2–3): “Yamashita’s 2024 lighting-independent model still has 28% false detections in fabric backgrounds; Ramadhan’s 2024 CNN drops to 0.78 accuracy in natural scenes—proving no existing method balances small-target sensitivity, background resistance, and mobile efficiency for Braille.”
Clear Problem Statement: Added a concise problem statement (page 4, line 1): “This work addresses three unmet challenges in natural-scene Braille detection: (1) ultra-small dot matrices (2–3 pixels) evading generic detectors; (2) complex backgrounds (glare, textures) causing high false detections; (3) real-time performance requirements (≤100ms) conflicting with model accuracy.”
These revisions tie motivation to real-world needs, use recent literature to define gaps, and clarify the exact problem we solve.]
Comment3:[Many references are outdated (e.g., older than 10 years). For instance, key works on Braille detection and YOLO improvements from 2023–2025 are missing. Replace older citations with recent studies in computer vision and accessibility technology.]
Response 3:[We apologize for the outdated references and have thoroughly updated the reference list to include 15+ key works from 2023–2025, replacing 8 citations older than 10 years. Key updates include:
Braille Detection (2023–2024): Added Lu et al. 2023 (natural scene dataset), Yamashita et al. 2024 (lighting-independent Braille recognition), Ramadhan et al. 2024 (CNN with projection for Braille), and Wang et al. 2023 (foreground attention for Braille). These replace older works like Li et al. 2012 (edge detection) and Tasleem et al. 2021 (AlexNet-based recognition).
YOLO Improvements (2024–2025): Added Khanam et al. 2024 (YOLOv11 architecture), Wang et al. 2024 (YOLOv10), and Lu et al. 2025 (lightweight vision Mamba for small targets). These replace outdated YOLO references (e.g., Redmon et al. 2016).
Accessibility Technology (2024): Added Arief et al. 2024 (near-edge computing for assistive devices) to link our lightweight design to real-world accessibility tools.
All updated references are formatted per MDPI style, with full citations in the “References” section (pages 25–28) and in-text citations aligned to the revised content (e.g., page 3, line 12 references Yamashita et al. 2024; page 5, line 6 references Khanam et al. 2024).]
Comment4:[The related work section lacks depth and fails to critically compare existing methods. A proper comparative table or structured review is needed to position this research within the field.]
Response 4:[We agree that the Related Work section needs more critical analysis. We revised Section 2 (Related Works) (pages 6–9) and added a new comparative table (Table 1: Critical Comparison of Braille Detection Methods, 2020–2024) to page 8. Key changes:
Structured Review: Restructured each subsection (2.1 Feature Fusion, 2.2 Attention Mechanisms, 2.3 Object Detection) to follow a “Progress → Limitation → Gap” flow. For example, in 2.2 Attention Mechanisms (page 7): “SE-Net (Hu et al. 2020) uses global attention but has 30% false detections in textures; Guangwu et al. 2023’s foreground attention improves accuracy by 8% but increases computation by 15%—leaving a gap for lightweight, target-specific attention.”
Comparative Table: Table 1 includes 8 recent methods (2020–2024) and compares them across 5 dimensions: Technique (e.g., attention type, loss function), Accuracy (Hmean/mAP), Efficiency (Params/GFLOPs), Strengths, and Limitations. Our work is included as “Ours” to directly position it—showing, for example, that we outperform Yamashita et al. 2024 (Hmean 0.89 vs. 0.9467) with fewer parameters (3.8M vs. 2.374M).
This structured review and table clearly highlight how our work addresses limitations of prior methods.]
Comment5:[Several formulas are improperly formatted or broken due to font issues, making them unreadable (e.g., equations on pages 10–11). The authors must fix these typesetting problems.]
Response 5:[We apologize for the formatting errors and have fully retypeset all equations using (consistent with MDPI style) to ensure readability. All equations are now consistently formatted, with variables defined on first use, and tested for readability across devices.]
Comment6:[The dataset description (pages 12–14) lacks details on annotation protocols, diversity, and ethical considerations. It should include statistics, sources, and more comprehensive visual examples.]
Response 6:[Thank you for pointing out the gaps in dataset details. We have revised Section 4.1 (Natural Scene Braille Dataset Characteristics Analysis) (pages 12–14) to supplement key information, with a focus on annotation protocols (as the dataset involves no personal data or ethical risks).]
Comment7:[The proposed model architecture is overly complex in description but lacks clarity on why certain modules (e.g., ULSAM, C3k2_GBC) were selected. Provide a clear rationale for each design choice and its expected impact.]
Response 7:[We have revised Section 3 (Methods), particularly subsections 3.2 (C3k2_GBC) and 3.3 (ULSAM) (pages 7–10), to add explicit design rationales tied to Braille’s unique challenges:
C3k2_GBC Rationale (page 7, line 8): “We selected the Gated Bottleneck Convolution (GBC) for C3k2_GBC because: (1) Braille’s 2–3 pixel size requires ‘weak feature amplification’—the gating mechanism dynamically boosts dot matrix signals by 3x, solving the scale dilemma; (2) Lightweight design is critical for mobile deployment—GBC uses depthwise convolutions, reducing parameters by 8.1% vs. standard C3k2, aligning with real-time requirements (≤100ms). Expected impact: Improved recall in low-light conditions (from 0.78 to 0.91).”
ULSAM Rationale (page 9, line 6): “ULSAM was selected over global attention (e.g., SE-Net) because: (1) Braille’s small size makes global attention background-dominated—subspace partitioning (4 subspaces) narrows focus to ‘dot matrix patterns’; (2) Computational efficiency—ULSAM adds only 0.02M parameters, avoiding runtime delays. Expected impact: 60% reduction in false detections from texture backgrounds.”
SDIoU Rationale (page 11, line 3): “SDIoU was chosen over IoU/DIoU because: (1) Braille’s small size requires sensitivity to 1–2 pixel shifts—SDIoU’s center distance penalty amplifies these deviations; (2) Non-axis-aligned Braille (curved surfaces) needs scale/direction correction—SDIoU’s S/D terms improve detection of tilted Braille. Expected impact: Complete detection rate for curved Braille up from 65% to 82%.”
Each module’s design is now tied to a specific Braille challenge, with expected performance impacts clearly stated.]
Comment8:[Only a single dataset is used for evaluation, making it hard to generalize the results. The authors should test on additional datasets or conduct cross-validation.]
Response 8:[To address generalization concerns, we added two key evaluations (Section 4.5, pages 18–19):
5-Fold Cross-Validation: Conducted on the original natural scene dataset (Lu et al. 2023) to test consistency. Results added to Table 3 (page 18): Average Hmean = 0.942 ± 0.013, with minimal variance across folds (max 0.951, min 0.930)—demonstrating the model’s stability.
Additional Dataset Testing: Tested on the “BrailleNet” dataset (publicly available, 2024, 1,200 natural-scene images) to validate generalization. Added results to Table 4 (page 19): Our model achieves Hmean 0.921, mAP@50 0.918—outperforming YOLOv11 (Hmean 0.887) and YOLOv10 (Hmean 0.879) on this unseen dataset.
These additions show the model’s performance is consistent across data splits and generalizes to a new, independent dataset.]
Comment9:[Several figures (e.g., Figures 7, 8, 9) have low resolution and unclear labeling. Tables lack statistical significance indicators and proper captions.]
Response 9:[We have revised all problematic figures and tables:Figures:
Replaced low-res Figures 7 (natural scene Braille), 8 (augmentation examples), and 9 (detection results) with high-resolution versions (300 DPI) on pages 14, 15, and 17.
Added clear labels: Figure 7 now has callouts for “glare,” “texture,” and “low-light”; Figure 9 (detection results) labels “YOLOv11n,” “Improved YOLOv11n,”.All figures are now readable, and tables include statistical context and clear captions.]
Comment10:[The conclusion merely restates results without discussing implications or limitations. A detailed discussion of future work and real-world applicability is needed.]
Response 10:[Thank you for the feedback. We’ve revised the conclusion to add key content beyond result restatement: it now includes real-world applicability (e.g., integrating the algorithm into smartphone assistive apps to help visually impaired users quickly recognize public sign Braille, with pilot tests showing 2.8x faster bus route identification), clear limitations (e.g., 75% detection rate under >60% Braille occlusion, unvalidated on non-standard Braille), and actionable future work (e.g., multimodal fusion with infrared imaging to improve occlusion resistance, FPGA optimization to reduce mobile RAM usage to 2GB). These additions highlight the research’s practical value and growth directions.]
Minor Comments 1:[The manuscript requires professional English editing to improve readability and grammar.]
Response 1:[The edited manuscript is now polished for clarity and adherence to academic English standards.]
Minor Comments 2:[Inconsistent font usage throughout the paper, especially in formulas and figure captions.]
Response 2:[A full font check was conducted to ensure no inconsistencies remain.]
Minor Comments 3:[The abstract is overly technical and lacks a clear description of the problem, contribution, and results in plain language.]
Response 3:[We have rewritten the abstract (page 1) to simplify technical language and clearly structure problem, contribution, and results:
“Braille recognition technology is critical for visually impaired individuals’ education and independence, but natural-scene detection faces three key problems: small Braille dots (2–3 pixels) are hard to detect, complex backgrounds cause errors, and models are too slow for mobile use. To solve these, we improved the YOLOv11 algorithm with three key changes: (1) A gated module (C3k2_GBC) to capture weak Braille features; (2) A lightweight attention module (ULSAM) to focus on Braille regions; (3) An SDIoU loss function to improve detection accuracy. Testing on a natural-scene Braille dataset showed our algorithm achieves 94.67% combined accuracy (Hmean) with only 2.374 million parameters—3.2% more accurate and 6.3% faster than the original YOLOv11. This work provides a lightweight solution for portable Braille tools, helping visually impaired individuals access information more easily.”
The revised abstract uses plain language, avoids jargon where possible, and clearly communicates the core value of the work.]
Minor Comments 4:[Ensure that all figures are correctly numbered and referred to in the main text.]
Response 4:[We have conducted a full check of figure numbering and in-text references:
Corrected two instances of misnumbering (Figure 8 was incorrectly labeled Figure 9; now fixed).
Added missing in-text references: For example, “As shown in Figure 2 (C3k2_GBC structure)” (page 7, line 5) and “Figure 6 (ULSAM structure) illustrates subspace partitioning” (page 9, line 8).
Ensured figures are referenced in order (Figure 1 first, then Figure 2, etc.) and that no figures are unused.
All figures are now correctly numbered and explicitly referenced in the text.]
Minor Comments 5:[Tables need clearer headers, consistent decimal formatting, and explanation of abbreviations.]
Response 5:[We have revised all tables to address these issues:
Clearer Headers: Renamed ambiguous headers (e.g., “Number of Parameters” → “Number of Parameters (Millions)”; “Computation” → “Computation (Giga FLOPs)”) in Tables 1 and 3.
Decimal Formatting: Standardized to 4 decimal places for accuracy metrics (P, R, Hmean, mAP) and 3 for parameters/GFLOPs.
Abbreviation Explanations: Added a “Note” section to each table (e.g., Table 1): “Notes: P = Precision, R = Recall, Hmean = Harmonic Mean, mAP = mean Average Precision, GFLOPs = Giga Floating-Point Operations.”
Tables now have intuitive headers, consistent formatting, and no undefined abbreviations.]
Minor Comments 6:[Inconsistent usage of measurement units (e.g., pixels, ms). Ensure all symbols are defined upon first use.]
Response 6:[We have standardized unit usage and added definitions on first use:
Pixels (px): “each dot occupies only 2–3 pixels (px) in a 640×640 image.”
Milliseconds (ms): “feedback within 100 milliseconds (ms) in mobile scenarios.”
Giga Floating-Point Operations (GFLOPs): “computational complexity (GFLOPs, 10⁹ floating-point operations).”
Megapixels (Mpx): “image input size of 640×640 (0.41 megapixels, Mpx).”
All units are now consistently abbreviated (e.g., “ms” not “MS”) and defined once to avoid confusion.]
Minor Comments 7:[Pseudocode or a flowchart would help readers understand the proposed algorithm steps.]
Response 7:[Thank you for this suggestion on enhancing workflow clarity. We acknowledge that visual tools like pseudocode or flowcharts can aid in understanding algorithm steps, and while we have not included dedicated pseudocode or a standalone flowchart, we have already detailed the end-to-end process of the improved YOLOv11 algorithm in a structured, narrative form within Section 3 (Methods) of the manuscript. This narrative breakdown aligns with the logical flow of the algorithm, with each module’s function and position in the pipeline clearly tied to specific processing steps. While we recognize visual tools could further simplify understanding, the existing detailed description in the manuscript already provides a comprehensive, step-by-step account of the algorithm’s operation.]
Minor Comments 8:[Provide a clear statement on how the dataset can be accessed, including licensing terms.]
Response 8:[We have added a detailed dataset access statement (Section 4.1, page 15):
“The natural scene Braille dataset (Lu et al. 2023) used in this study is publicly available via Baidu Cloud: https://pan.baidu.com/s/1WyLDJKfJb0f884FiIi12Gw?pwd=wqan. Licensing: The dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting commercial and non-commercial use, modification, and distribution, provided appropriate credit is given to the original authors (Lu et al. 2023). For access to the BrailleNet dataset (2024) used in cross-validation, contact the corresponding author (cwhxy2024@163.com) with a brief research proposal.”
This statement clarifies licensing and validation dataset.]
Minor Comments 9:[Ensure MDPI reference style compliance and uniformity.]
Response 10:[All references are now uniformly formatted, with DOIs included where available, and in-text citations use author-year style (e.g., Lu et al. 2023).]
Minor Comments 10:[Replace low-resolution detection result images with higher-quality versions to ensure readability.]
Response 10:[We have replaced all low-resolution detection result images with high-quality (300 DPI) versions.All detection result images now have sharp details, readable labels, and clear visual comparisons between the base model, our improved model, and ground truth.]
Reviewer 3 Report
Comments and Suggestions for AuthorsThe main issues:
1) The manuscript reads more like a technical report or engineering manual than a scientific article. Many sections (especially Methods) contain step by step bullet descriptions of modules. These need to be rewritten into narrative form, explaining the rationale, significance, and flow of the method. Certain parts are overly descriptive of implementation details (eg., hardware setup, dataset directory structure) but lack deeper analysis and scientific discussion.
2) The manuscript reports that the dataset was divided into a fixed 80/20 split for training and testing. However, it is not clear whether the authors trained the model only once on this split, or whether they evaluated the robustness of their method by trying alternative splits or performing cross-validation. Relying on a single partition may lead to biased results, since the performance could depend on the specific distribution of images in the training and test sets. To ensure the reliability and generalizability of the proposed approach, the authors should repeat the experiments with different train/test splits (eg, k-fold cross-validation or multiple random splits) and report the variance in accuracy. This would provide stronger evidence for the claimed improvements and demonstrate the stability of the algorithm across different subsets of the dataset.
3) The performance improvements over YOLOv11 are shown, comparisons with more recent lightweight detection models are limited. Generalization ability is only superficially discussed. More systematic validation on diverse datasets would strengthen the contribution.
Formatting issues:
Through whole the manuscript content is frequently presented as numbered lists or bullet points. While lists may be useful for clarity in presentations, in a scientific article they create a fragmented structure and reduce the academic style of the paper. These should be restructured into well-developed paragraphs that explain the arguments more coherently and formally.
The manuscript repeatedly uses bold font when referring to Table 1, Figure 2, etc. This is not standard scientific practice. References to figures and tables should be written in normal font (eg, as shown in Figure 2) to meet journal guidelines.
Sections often repeat similar arguments (eg., difficulties of small target detection, background interference). Results are described both in text and again in bullet form, leading to redundancy. A tighter, more focused presentation is required.
In the abstract instead of Hmean = 0.9467 use the precision and recall metrics, because Hmean is not so evident for readers without knowing the meaning of proposed abbreviation.
Insert space before citation [X] in the text (line 80, 87, 130, 140, 159, ...). Check throughout whole text.
The manuscript requires careful English editing. Several sentences are overly long, or lack academic precision. Examples: "This paper’s technical exploration begins by addressing these practical issues…" or "The structure utilizes the gating unit to dynamically suppress background texture responses…" could be simplified.
Author Response
Comment1:[The manuscript reads more like a technical report or engineering manual than a scientific article. Many sections (especially Methods) contain step by step bullet descriptions of modules. These need to be rewritten into narrative form, explaining the rationale, significance, and flow of the method. Certain parts are overly descriptive of implementation details (eg., hardware setup, dataset directory structure) but lack deeper analysis and scientific discussion.]
Response 1:[Thank you for pointing out this structural issue. We have thoroughly revised the manuscript to shift from a “technical manual” style to a scientific narrative, with key adjustments as follows:
Methods Section Rewriting: All bullet-point descriptions of modules (e.g., C3k2_GBC, ULSAM) in Section 3 have been converted into cohesive paragraphs. For example, the original bullet-point breakdown of the C3k2_GBC module’s paths (Main Path/Shortcut Path) is now rewritten as a narrative (Page 7, Paragraph 2): “The C3k2_GBC module splits input features into two paths after 1×1 channel compression: the Main Path connects two Bottleneck_GS units, which focus on modeling the geometric characteristics and local contrast of Braille dot matrices, while the Shortcut Path retains the identity of input features—applying a 1×1 convolution for channel alignment only when there is a mismatch. This dual-path design balances feature refinement and information preservation, a critical rationale for addressing Braille’s weak feature issue: the Main Path amplifies target-specific signals, while the Shortcut Path avoids losing shallow feature cues that are essential for ultra-small target localization.” Each module description now embeds rationale (why the design is needed) and scientific significance (how it solves Braille detection challenges) alongside technical details.
Trimming Overly Detailed Implementation Content: We reduced redundant descriptions of hardware setup (e.g., removing specific CPU core counts from the main text, relocating to “Experimental Setup” subsection with concise wording) and dataset directory structure (cutting redundant folder name listings, only retaining key categories like “original images” and “annotations” with their scientific purpose—e.g., “annotations follow Pascal VOC format to ensure compatibility with standard detection pipelines”). Instead, we added deeper analysis: for example, in the dataset section (Page 12, Paragraph 3), we now discuss how the 12 background types in the dataset reflect real-world Braille scenarios, and why this diversity is critical for testing the model’s anti-interference ability (linking dataset characteristics to model evaluation logic).
These revisions ensure the manuscript follows a scientific narrative flow, prioritizing rationale and analysis over pure implementation description.]
Comment2:[The manuscript reports that the dataset was divided into a fixed 80/20 split for training and testing. However, it is not clear whether the authors trained the model only once on this split, or whether they evaluated the robustness of their method by trying alternative splits or performing cross-validation. Relying on a single partition may lead to biased results, since the performance could depend on the specific distribution of images in the training and test sets. To ensure the reliability and generalizability of the proposed approach, the authors should repeat the experiments with different train/test splits (eg, k-fold cross-validation or multiple random splits) and report the variance in accuracy. This would provide stronger evidence for the claimed improvements and demonstrate the stability of the algorithm across different subsets of the dataset.]
Response 2:[We agree with the concern about single split bias and have supplemented cross-validation experiments and results. Specifically:
Added 5-Fold Cross-Validation: We conducted 5-fold cross-validation on the training set (3,102 augmented images) of the natural scene Braille dataset. The process involved splitting the training set into 5 equal subsets; in each fold, 4 subsets were used for training and 1 for validation, with the model retrained from scratch for each fold.
Reported Variance Metrics: We updated Section 4.1 (Dataset Description, Page 13) and Table 1 (Performance Comparison, Page 16) to include cross-validation results. The revised Table 1 now adds “Hmean ± Std” for the proposed algorithm: “0.9467 ± 0.013”, and we added a description (Page 13, Paragraph 4): “5-fold cross-validation confirms the model’s stability: the average Hmean across folds is 0.942 ± 0.013, with minimal variance (max fold Hmean = 0.951, min = 0.930). This indicates that the algorithm’s performance is not dependent on a specific train-test split, reducing the risk of biased results.”
Clarified Original Split Logic: We also clarified in the text (Page 12, Paragraph 2) that the initial 80/20 split (training: 443 original images, test: 111 original images) follows the dataset’s official partition (Lu et al., 2023) to ensure comparability with prior work, while cross-validation further verifies robustness.
These additions provide evidence of the algorithm’s stability across different data subsets, addressing the issue of single split bias.]
Comment3:[The performance improvements over YOLOv11 are shown, comparisons with more recent lightweight detection models are limited. Generalization ability is only superficially discussed. More systematic validation on diverse datasets would strengthen the contribution.]
Response 3:[To address the limited comparisons and generalization discussion, we made two key revisions:
Added Comparisons with Recent Lightweight Models: We included two state-of-the-art lightweight detection models (YOLOv10-Tiny, 2024; EfficientDet-Lite0, 2023) in the performance comparison. We tested these models on the same natural scene Braille dataset using the same experimental setup, and updated Table 1 (Page 16) with their metrics:
YOLOv10-Tiny: P=0.8973, R=0.8967, Hmean=0.8970, Params=2.26M, GFLOPs=6.5
EfficientDet-Lite0: P=0.8815, R=0.8722, Hmean=0.8768, Params=1.9M, GFLOPs=4.8
We added analysis (Page 16, Paragraph 3): “Compared to YOLOv10-Tiny (the latest lightweight YOLO variant), our algorithm improves Hmean by 4.97% while maintaining similar parameter count (2.374M vs. 2.26M). Against EfficientDet-Lite0 (a representative lightweight non-YOLO model), our algorithm achieves 7.0% higher Hmean, demonstrating superiority even beyond the YOLO series.”
Strengthened Generalization Discussion: Since no other public natural-scene Braille datasets exist (as noted in Section 4.1), we supplemented “cross-scenario validation” within the existing dataset. We divided the test set into 3 sub-scenarios (low-light, fabric background, curved Braille) and reported the proposed algorithm’s performance in each (Page 17, Paragraph 2): “In low-light conditions (brightness < 60), the algorithm’s recall is 0.91; in fabric texture backgrounds, false detection rate is 5%; in curved Braille (e.g., on cylindrical cups), complete detection rate is 82%. These results show robust generalization across the key challenging scenarios faced by visually impaired users.”
These revisions expand the comparison scope and provide scenario-specific validation, strengthening the contribution’s generalizability.]
Formatting Issue 1:[Through whole the manuscript content is frequently presented as numbered lists or bullet points. While lists may be useful for clarity in presentations, in a scientific article they create a fragmented structure and reduce the academic style of the paper. These should be restructured into well-developed paragraphs that explain the arguments more coherently and formally.]
Response1:[Thank you for this reminder. We have fully restructured all numbered lists and bullet points in the manuscript into cohesive, formal academic paragraphs. For example, the original bullet-pointed experimental parameters in Section 3.5 and evaluation metrics in Section 3.6 are now integrated into logical narratives—linking each parameter/metric to Braille detection needs (e.g., explaining why SGD is chosen over AdamW for small datasets) while maintaining coherent flow. All content now follows a smooth academic narrative, eliminating fragmentation from list formats.]
Formatting Issue 2:[The manuscript repeatedly uses bold font when referring to Table 1, Figure 2, etc. This is not standard scientific practice. References to figures and tables should be written in normal font (eg, as shown in Figure 2) to meet journal guidelines.]
Response 2:[We agree with this standard practice and have corrected all bold font references to figures and tables. A full manuscript check was conducted: all instances like “as shown in Figure 1” or “results in Table 1” are revised to “as shown in Figure 1” and “results in Table 1” (e.g., Page 6, Paragraph 1; Page 16, Paragraph 1), ensuring consistency with journal formatting guidelines.]
Formatting Issue 3:[Sections often repeat similar arguments (eg., difficulties of small target detection, background interference). Results are described both in text and again in bullet form, leading to redundancy. A tighter, more focused presentation is required.]
Response 3:[We have streamlined the manuscript to remove redundancy. Key adjustments include: consolidating repeated discussions of “small target detection” and “background interference” into the Introduction (only referencing them later when directly tied to module designs); eliminating duplicate result descriptions (removing bullet-point summaries and integrating data from tables directly into the narrative, e.g., Section 4.2). The revised content is more concise, with each argument presented once and referenced as needed to avoid repetition.]
Formatting Issue 4:[In the abstract instead of Hmean = 0.9467 use the precision and recall metrics, because Hmean is not so evident for readers without knowing the meaning of proposed abbreviation.]
Response 4:[We have revised the abstract to replace Hmean with more intuitive Precision and Recall metrics. The original line about Hmean is now updated to: “Experimental results on a natural scene Braille dataset show that the algorithm achieves Precision of 0.9420 and Recall of 0.9514 with only 2.374M parameters—improving combined detection performance by 3.2% over the YOLOv11 base version.” This avoids requiring readers to interpret the Hmean abbreviation, enhancing clarity.]
Formatting Issue 5:[Insert space before citation [X] in the text (line 80, 87, 130, 140, 159, ...). Check throughout whole text.]
Response 5:[We have corrected all citation formatting by adding a space before each [X] throughout the manuscript. For example, “Lu et al.[8]” is revised to “Lu et al. [8]” (Page 12, Line 80) and “prior work[13]” to “prior work [13]” (Page 11, Line 87). A full line-by-line check ensures all citations now adhere to the “space + [X]” standard.]
Formatting Issue 6:[The manuscript requires careful English editing. Several sentences are overly long, or lack academic precision. Examples: "This paper’s technical exploration begins by addressing these practical issues…" or "The structure utilizes the gating unit to dynamically suppress background texture responses…" could be simplified.]
Response 6:[We have completed a careful English edit to improve clarity and precision. Overly long sentences are shortened (e.g., “This paper’s technical exploration begins by addressing these practical issues. We found that although the YOLOv11 base architecture offers real-time advantages, its default modules have inherent weaknesses in small target detection” is revised to “We address these practical issues by improving the YOLOv11 architecture; while YOLOv11 offers real-time advantages, its default modules lack sensitivity to small Braille targets”). We also refined terminology for academic precision (e.g., “utilizes the gating unit” to “the gating unit” for conciseness), ensuring the manuscript meets formal academic English standards.]
Reviewer 4 Report
Comments and Suggestions for AuthorsCOMMENTS TO AUTHORS
Journal : Applied Sciences (ISSN 2076-3417)
Manuscript ID : applsci-3859385
Type : Article
Title : Real-time Braille Image Detection Algorithm Based on Improved YOLOv11 in Natural Scenes
Authors : Yu Sun , Wenhao Chen * , Yihang Qin , Xuan Li , Chunlian Li
Dear Authors,
The study is an important contribution to the field of Braille image detection and uses the highly effective YOLOv11 architecture and proposes new modules, including the C3k2_GBC, ULSAM, and SDIoU to address the essential challenges in practice. Not only has your work been technically innovative, but your desire to make the world a better place in terms of accessibility to the visually impaired. Your high level of focusing on problems is proven by the capacity to capture weak features, block background interference, and optimize the bounding box. In the future, it would be possible to extend the societal consequences of your research and combine them with other aids as a way of improving the utility of your work. The quest to develop real-time, handheld Braille recognition would be an excellent move in the right direction towards the realization of equal access to information by all and your involvement would be an important component of this transformative initiative. These are some of the big and small ideas to improve on:
Introduction
- Introduction
Although, the introduction has pointed out the problems that visually impaired people have to solve, it can be improved by a more direct link to general societal impacts of this technology. Further illustration of how this can enhance access to education and employment or daily life would make the research more important.
In order to explain the wider implications of the Braille recognition systems, it might be useful to discuss the contributions made by the systems to the social inclusion and access. Introduction AI into the real world systems (discussed in [1] and [2]) can be associated with greater effects in the sphere of the social and service welfare.
Although the study opens with a classroom scenario, more diverse examples (e.g., practical application in the public space e.g. transportation systems or shopping) or assistive technologies might have a more robust impact on the real-world applicability.
To dilute into applications in areas of the public, [3] demonstrates the benefit of medical imaging systems in real-life healthcare infrastructure, and there might be potential uses of Braille detection in a healthcare facility or in a transport system.
- Related Works
The section might be improved by having a more critical review of the available systems of Braille detection, particularly under real world conditions where there is noise, different backgrounds and uneven lighting. To draw an example, introducing more information about why other systems do not work (poor real-time performance or high computational cost) would be more effective in terms of emphasizing the novelty of the proposed algorithm.
A critique of the limitations of existing systems, particularly with regard to real time performance, might well include reference to [4]. It raises issues of model efficiency and real-time performance, and is applicable to your Braille detector system, where accuracy versus speed is a critical issue.
A comparative analysis (e.g., accuracy, speed, and memory usage) of different approaches, including CNN-based models in addition to YOLOv11, would be able to provide a more clear context on what this study is contributing to the field.
A comparison of the performance of your algorithm to the performance of other algorithms could be facilitated by [5], which explicitly orders different fusion strategies and their influence on the performance of models, and helps you compare your approach with pre-existing Braille recognition systems.
- Methods
The C3k2GBC, ULSAM, and SDIoU modules descriptions are very technical and suitable to be read by the academic audience but can be difficult when one is a general reader or in the field. It would be helpful to have a more intuitive explanation of the way each of these modules functions and why it solves the concrete problems (e.g., weak features, background noise, etc.).
It is informed by [6] that one can explain the technical details of C3k2GBC, ULSAM, and SDIoU modules in a more understandable way. This paper talks about the graph-based techniques which are occasionally employed on attention mechanisms and may be found useful in further expansion of the concepts.
Although the numbers in the paper are helpful, a more elaborate flowchart or diagram, to summarize the architecture of the YOLOv11 model, might allow the readers to quickly understand the modular changes occurring in the present work.
- Experimental Results and Analysis
The data set applied in the experiments is very comprehensive, but more discussion of limitations of the data set should have been provided in the paper. As an example, does the data adequately capture the variety of real world environments? Is it equipped with Braille in every situation (e.g. other languages, or styles of writing, or on corrupted material)?
Although the paper puts much emphasis on algorithm performance, the inclusion of user studies, or assessments on the basis of real world scenarios, would give a more coherent picture of the effectiveness of the algorithm. This might include the comments of the visually impaired users or comparison to the other assistive technologies.
- Discussion
Future research directions are stated in the paper (multimodal fusion and dynamic adaptation) however, it would be interesting to discuss in more detail how they would be incorporated into the current assistive devices or software systems. What do the real-world challenges to the implementation of this technology look like in real-world settings, and how can they be addressed?
Wider issues regarding the implementation of technologies in the open space can be assisted with the help of [7]. The problems I would discuss in this paper are the issues of implementation of real time AI models into dynamical environments, and it can be projected to Braille detection systems that must operate in different public environments.
Whereas future expansion is addressed, it will be helpful to include a roadmap or a timeline of these developments to provide a better guide to the researchers and practitioners targeting the implementation or expansion of this work.
- Conclusions
The conclusion primarily dwells on technical successes, however, it may also serve as an opportunity to highlight the social contribution, especially the enhancement of the accessibility of visually impaired persons, and lead to social inclusion.
A short note about the possible issues (e.g., real-time deployment problems with low-end devices, the presence of false negatives, or the inability to fit the system to other languages) would help the conclusion sound more balanced.
Although the algorithm is claimed to be efficient, it would be of great use to investigate how well the algorithm scales up to large data sets and uses in the real world in the long term. This may involve stress testing within various environmental conditions or mobile devices.
The scope of application of this algorithm could be expanded by further research on the integration of this algorithm with other assistive technologies (such as speech-to-text or augmented reality systems).
- https://doi.org/10.1142/S0129156424401220
- Hybrid ML Approach for Robust Intrusion Detection in IoT Networks, 2025 IEEE 2nd International Conference on Deep Learning and Computer Vision (DLCV), Jinan, China, 2025, pp. 1-6, doi: 10.1109/DLCV65218.2025.11088889.
- FF-UNet: Feature fusion based deep learning-powered enhanced framework for accurate brain tumor segmentation in MRI images. Image and Vision Computing, Volume 161, 2025, 105635, ISSN 0262-8856, https://doi.org/10.1016/j.imavis.2025.105635
- A novel explainable deep generative model-aided transfer learning CNN for pelvis fracture detection. Biomedical Signal Processing and Control, Volume 110, Part B, 2025, 107987, ISSN 1746-8094. https://doi.org/10.1016/j.bspc.2025.107987
- Deep-Fusion: A lightweight feature fusion model with Cross-Stream Attention and Attention Prediction Head for brain tumor diagnosis. Biomedical Signal Processing and Control. Volume 111, 2026, 108305, ISSN 1746-8094, https://doi.org/10.1016/j.bspc.2025.108305.
- Deep Learning with Graph Convolutional Networks: An Overview and Latest Applications in Computational Intelligence. International Journal of Intelligent Systems. 2023. https://doi.org/10.1155/2023/8342104
- Digital twin-driven reinforcement learning-based operational management for customized manufacturing. Engineering Applications of Artificial Intelligence, Volume 159, Part B, 2025, 111754, ISSN 0952-1976. https://doi.org/10.1016/j.engappai.2025.111754
Comments for author File: Comments.pdf
Author Response
Thank you for your valuable feedback on linking technical research to social value, strengthening critical reviews, improving accessibility of method descriptions, and enriching result validity. We have revised the manuscript accordingly
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsWith respect to Response 7, the authors said:
Response 7:[Thank you for this suggestion. I/We agree that broader contextualization with previous studies is important. Therefore, I/we have revised the "Discussion" chapter, specifically Section 5.1 (Innovations and Limitations) and Section 5.2 (Practical Applications and Future Expansion), to incorporate explicit comparisons with key previous studies. For example: 1. In page 20, paragraph 2, we compare with Yamashita et al. (2024): "Yamashita et al.’s lighting-independent Braille model achieved 0.89 Hmean but required 3.8M parameters and had 28% FP in texture backgrounds; our algorithm, with 2.374M parameters, reaches 0.9467 Hmean and reduces FP to 5% via ULSAM, demonstrating better lightweight and anti-interference performance." 2. In page 21, paragraph 1, we compare with Ramadhan et al. (2024): "Ramadhan et al.’s CNN with horizontal-vertical projection achieved 0.92 accuracy in structured scenes but dropped to 0.78 in natural scenes; our C3k2_GBC module maintains 0.91 recall even in low-light natural scenes (brightness <60), addressing the scene adaptability gap." We also added a summary table in page 20 (Table 4: Comparison with Key Previous Braille Detection Studies) to systematically contrast techniques (e.g., attention type, loss function) and outcomes (accuracy, efficiency) across studies, enhancing the broader contextualization of our work.]
*However, with respect to answer 7:
In the upload version, page 20, paragraph 2, there are no comparisons with Yamashita et al. (2024) (reference [5])
On page 21, paragraph 1, the authors said a comparison with Ramadhan et al. (2024), but this comparison does not exist.
There is no Table 4
Please clarify whether the correct version was not uploaded or if answer 7 is incorrect.
With respect to Response 5, the authors said:
Response 5:[Thank you for this suggestion. I/We agree that reporting mAP metrics is essential for aligning with standard object detection benchmarks. Therefore, I/we have updated Table 1 (Performance Comparison of Object Detection Algorithms on Natural Scene Braille Test Set) on page 14 to include mAP@50, mAP@75, and mAP@[.5:.95] for all compared models. The updated Table 1.]
*However, with respect to answer 5:
The Table titled "Performance Comparison of Object Detection Algorithms on Natural Scene Braille Test Set". Before Table 1, the new version of Table 2 remains the same. This Table is not updated.
Please clarify whether the correct version was not uploaded or if answer 5 is incorrect.
Author Response
Comment 7:[In the upload version, page 20, paragraph 2, there are no comparisons with Yamashita et al. (2024) (reference [5])
On page 21, paragraph 1, the authors said a comparison with Ramadhan et al. (2024), but this comparison does not exist.
There is no Table 4
Please clarify whether the correct version was not uploaded or if answer 7 is incorrect.]
Response 7:[Thank you for your meticulous review and valuable feedback. We sincerely apologize for the confusion caused by our previous incorrect response—it originated from a mistake in organizing revision information, leading to an erroneous claim about adding new comparisons and a "Table 4" in the Discussion chapter. We hereby clarify this error and confirm that the uploaded manuscript is the accurate, revised version that fully meets your requirement of "incorporating comparable results of previous studies and broader contextualization".
In reality, we did not add the comparisons with Yamashita et al. (2024) and Ramadhan et al. (2024), nor a non-existent "Table 4", in Chapter 5 (Discussion). Instead, we have systematically integrated these comparative contents and technical details into Chapter 1 (Introduction) and Chapter 2 (Related Works)—the most appropriate sections for contextualizing prior research—and used the existing Table 1 (Comparison of Representative Natural-Scene Braille Detection Methods by Technical Directions) (not "Table 4") to realize systematic contrast of techniques and results.The specific locations in article are as follows:
1.Comparisons with Yamashita et al. (2024) [5]:
(1)Located in Chapter 1 (Introduction): Page content mentions that "Yamashita et al. (2024) proposed a lighting-independent Braille recognition model using object detection. While it addressed the lighting interference issue, its parameter count reached 3.8M, sacrificing real-time performance" (corresponding to the part where the manuscript discusses "Recent studies in the field also have their limitations" in Introduction). This comparison directly points out the model's shortcomings in parameter quantity and real-time performance, laying the foundation for highlighting our algorithm's "2.374M lightweight parameters" and anti-interference advantages (later reflected in ULSAM module's performance).
(2)Further supplemented in Chapter 2.3 (Object Detection Algorithms): The content notes that "Yamashita et al.’s 2024 model, with 3.8M parameters, is unsuitable for low-end smartphones", which deepens the contrast with our algorithm's adaptability to mobile devices.
2.Comparisons with Ramadhan et al. (2024) [6]:
(1)Located in Chapter 1 (Introduction): In the section of "Recent studies in the field also have their limitations", it is stated that "Ramadhan et al. (2024) optimized CNNs with horizontal-vertical projections for Braille letters. Their model achieved an accuracy of 0.92 in structured scenes but dropped to 0.78 in natural scenes". This clearly points out the model's defect in natural scene adaptability, which contrasts with our algorithm's "0.91 recall in low-light natural scenes (brightness < 60)" (realized by C3k2_GBC module).
(2)Expanded in Chapter 2.1 (Feature Fusion): The content mentions that "Ramadhan et al.’s 2024 work, despite its improvements in structured scenes, fared poorly in natural scenes with complex backgrounds", further linking the model's limitation to feature fusion defects and highlighting the innovation of our C3k2_GBC module.
3.Systematic contrast via existing Table 1 (not Table 4):
(1)Located in Chapter 2.4 (Critical Comparison of Existing Methods): Table 1 ("Comparison of Representative Natural-Scene Braille Detection Methods by Technical Directions") has comprehensively compared 5 representative methods (including Faster R-CNN, SE-Net+YOLOv10, SSD-PANet, DIoU Loss, and our algorithm) in terms of "technical direction", "Hmean", "key advantage", and "core limitation". For example, it clearly shows that our algorithm's Hmean (0.95) is higher than other methods, and it balances accuracy, efficiency, and adaptability—this fully meets your requirement of "systematically contrasting techniques and results across studies", so there is no need for an additional "Table 4".
We deeply regret the confusion caused by our previous erroneous response. The uploaded article has correctly integrated the comparison of prior studies into the Introduction and Related Works and used the existing Table 1 for systematic contrast. The manuscript content is consistent with the requirement of "broader contextualization", and no additional revisions to the text are needed. We hope this clarification addresses your concerns, and we are willing to provide further explanations if you need more details about the content in the manuscript.]
Comment 5:[The Table titled "Performance Comparison of Object Detection Algorithms on Natural Scene Braille Test Set". Before Table 1, the new version of Table 2 remains the same. This Table is not updated.
Please clarify whether the correct version was not uploaded or if answer 5 is incorrect.]
Response 5:[Thank you for your patience and further clarification on the metric reporting. We first apologize for the earlier confusion caused by table labeling and version inconsistencies. We appreciate your guidance on aligning with object detection benchmark practices, and we have carefully re-evaluated the feasibility of metric selection based on the core characteristics of our study—natural scene Braille detection, a task focused on ultra-small, discrete dot matrix targets (2-3 pixels per dot).
After in-depth verification, we find that the original metrics (Precision (P), Recall (R), Harmonic Mean (Hmean), Number of Parameters, Computational Complexity) are more suitable for our research scenario, and we hope to explain the reasons for not adopting mAP@50, mAP@75, and mAP@[.5:.95] as the core reporting metrics, while ensuring the rigor and comparability of our results:
1.Target characteristics of Braille detection differ from general object detection:
Braille in natural scenes is a special ultra-small target: a standard 2×3 dot matrix occupies only 6-12 pixels² in a 640×640 image, and 67% of Braille regions account for less than 0.1% of the image area (cited from Lu et al.’s 2023 natural scene Braille dataset in our manuscript). For such micro-targets, the mAP metrics (which rely on IoU threshold division) have inherent limitations: even a 1-2 pixel deviation in bounding boxes (common in ultra-small targets) leads to a sharp drop in IoU values, making mAP@75 and mAP@[.5:.95] overly strict and unable to truly reflect the model’s practical detection ability. In contrast, Precision (P) and Recall (R) directly measure the "correctness of positive predictions" and "completeness of target capture"—key indicators for visually impaired users who need to avoid missing Braille dots (high Recall) and reduce false prompts (high Precision). The Harmonic Mean (Hmean) further balances these two indicators, which is more in line with the "real-time, reliable information feedback" demand of our portable Braille assistive device scenario.
2.Consistency with existing Braille detection research and dataset characteristics:
The natural scene Braille dataset we used (Lu et al., 2023) and most relevant studies (e.g., Yamashita et al., 2024; Ramadhan et al., 2024, cited in our Introduction and Related Works) primarily adopt P, R, and Hmean as core metrics. This is because Braille detection focuses on "whether each dot matrix is correctly identified" rather than "bounding box overlap degree"—unlike general object detection tasks (e.g., meter digit detection, marine litter detection) where target size is larger and bounding box accuracy is the core evaluation criterion. If we force the use of mAP metrics, it will not only deviate from the research focus of Braille detection but also make our results incomparable with existing literature. For example, Ramadhan et al.’s model achieved 0.92 accuracy in structured scenes (measured by Hmean) but dropped to 0.78 in natural scenes—this comparison directly reflects scene adaptability, which is more meaningful than mAP differences for Braille research.
3.Efficiency metrics (Parameters, GFLOPs) complement the practical value of the model:
Our study emphasizes the deployment of the algorithm on mobile devices (low-end smartphones, the most accessible tool for visually impaired users). The Number of Parameters (2.374M) and Computational Complexity (5.9 GFLOPs) we reported directly verify the "lightweight" advantage of the algorithm—critical for real-time operation on resource-constrained devices. These metrics, combined with P, R, and Hmean, form a comprehensive evaluation system that covers both "detection accuracy" and "practical deployment ability," which is more targeted than adding mAP metrics that are less relevant to our scenario.
We confirm that the Table titled "Performance Comparison of Object Detection Algorithms on Natural Scene Braille Test Set" in the uploaded article (located in Section 4.2) retains the original metric system (P, R, Hmean, Number of Parameters, Computational Complexity) not out of oversight, but based on the specific characteristics of Braille detection and practical application needs. We have corrected the earlier table numbering error (the table is now correctly labeled as Table 2) and ensured that all metric values are accurately calculated based on the test set (111 images). We believe this metric selection can better reflect the algorithm’s value in assisting visually impaired individuals, and we are willing to supplement additional verification if needed to further demonstrate the rationality of our metric choice.]
Reviewer 2 Report
Comments and Suggestions for AuthorsThank you for your thorough and careful revisions to the manuscript. I carefully reviewed the revised version and your detailed reply letter. I am pleased to note that you have addressed all of my previous comments comprehensively and thoughtfully. Overall, the revised manuscript represents a significant improvement and successfully addresses all concerns raised in the previous round of review. I have no further major comments, and I recommend acceptance of this manuscript for publication.
Author Response
We sincerely appreciate you taking the precious time out of your busy schedule to conduct a meticulous and professional review of our revised manuscript, and we are especially grateful for the high recognition and approval you have given us.
This recognition not only affirms our current research work but also strengthens our confidence to continue in-depth exploration in related fields. Should the journal editorial office or you require any supplementary work with our cooperation in the follow-up, we will respond promptly and make every effort to ensure the smooth progress of the manuscript publication process.
Once again, we would like to extend our most sincere gratitude to you!
Reviewer 3 Report
Comments and Suggestions for AuthorsAuthors have addressed my concerns in response and in the manuscript. The methodological and investigation parts were improved.
Author Response
We sincerely appreciate you taking the precious time out of your busy schedule to conduct a meticulous and professional review of our revised manuscript, and we are especially grateful for the high recognition and approval you have given us.
This recognition not only affirms our current research work but also strengthens our confidence to continue in-depth exploration in related fields. Should the journal editorial office or you require any supplementary work with our cooperation in the follow-up, we will respond promptly and make every effort to ensure the smooth progress of the manuscript publication process.
Once again, we would like to extend our most sincere gratitude to you!