Real-Time Braille Image Detection Algorithm Based on Improved YOLOv11 in Natural Scenes

Sun, Yu; Chen, Wenhao; Qin, Yihang; Li, Xuan; Li, Chunlian

doi:10.3390/app151810288

Open AccessArticle

Real-Time Braille Image Detection Algorithm Based on Improved YOLOv11 in Natural Scenes

by

Yu Sun

¹,

Wenhao Chen

^2,*

,

Yihang Qin

²

,

Xuan Li

² and

Chunlian Li

²

¹

School of Special Education, Changchun University, Changchun 130022, China

²

School of Computer Science and Technology, Changchun University, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 10288; https://doi.org/10.3390/app151810288

Submission received: 21 August 2025 / Revised: 18 September 2025 / Accepted: 19 September 2025 / Published: 22 September 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

The development of Braille recognition technology is intrinsically linked to the educational rights of individuals with visual impairments. The key challenges in natural scene Braille detection include three core trade-offs: difficulty extracting small-target features under complex background interference, a balance between model accuracy and real-time performance, and generalization across diverse scenes. To address these issues, this paper proposes an improved YOLOv11 algorithm that integrates a lightweight gating mechanism and subspace attention. By reconstructing the C3k2 module into a hybrid structure containing Gated Bottleneck Convolutions (GBC), the algorithm effectively captures weak Braille dot matrix features. A super-lightweight subspace attention module (ULSAM) enhances the attention to Braille regions, while the SDIoU loss function optimizes bounding box regression accuracy. Experimental results on a natural scene Braille dataset show that the algorithm achieves a Precision of 0.9420 and a Recall of 0.9514 with only 2.374 M parameters. Compared to the base YOLOv11, this algorithm improves the combined detection performance (Precision: 0.9420, Recall: 0.9514) by 3.2% and reduces computational complexity by 6.3% (with only 2.374 M parameters). Ablation experiments validate the synergistic effect of each module: the GBC structure reduces the model parameter count by 8.1% to maintain lightweight properties, and the ULSAM effectively lowers the missed detection rate of ultra-small Braille targets. This study provides core algorithmic support for portable Braille assistive devices, advancing the technical realization of equal information access for individuals with visual impairments.

Keywords:

attention mechanism; braille recognition; lightweight network; natural scene images; small target detection

1. Introduction

In a bustling city, a visually impaired individual stands at a bus stop, attempting to decipher the Braille on the stop sign to determine their bus route. Despite the importance of this information for independent travel, the signs are weather-worn, and the small Braille dots blend into the complex background. Similar scenarios occur daily for the estimated 2.2 billion people worldwide suffering from visual impairments, as reported by the World Health Organization in 2024 [1]. In educational settings, the situation is equally disheartening. Even in developed countries with relatively rich educational resources, visually impaired students’ reading efficiency is only one-fifth of that of their sighted counterparts, and a staggering 60% of this gap is attributed to the lack of efficient Braille recognition tools [1]. In low-resource regions, the gap widens further, with visually impaired students having a reading efficiency of merely one-eighth that of sighted peers and 75% of this difference stemming from the dearth of reliable natural scene Braille detection solutions [1]. This not only impedes their academic progress but also restricts their social participation—consistent with Isayed et al.’s review of optical Braille recognition [2] that ‘the lack of robust natural scene adaptation remains a long-standing barrier to practical Braille assistive tools.’ This highlights the urgent need for effective Braille detection technology.

The development of Braille recognition technology has seen significant efforts, yet it remains far from perfect. Traditional edge detection-based methods, like the algorithm proposed by Li Nianfeng’s team in 2012, achieved an impressive 90% recognition rate in controlled laboratory environments [3]. However, when applied to natural scene unstructured carriers such as bus stop signs and elevator buttons, the accuracy plummeted to below 58% [3]. In 2018, CNN-based Braille recognition systems emerged, with Tasleem et al. employing the AlexNet architecture for character-level recognition [4]. Unfortunately, these models had over 20 M parameters, rendering them unsuitable for deployment on mobile devices—essential tools for real-time assistance in daily life [4].

Recent studies in the field also have their limitations. Yamashita et al. (2024) proposed a lighting-independent Braille recognition model using object detection [5]. While it addressed the lighting interference issue, its parameter count reached 3.8 M, sacrificing real-time performance [5]. Ramadhan et al. (2024) optimized CNNs with horizontal–vertical projections for Braille letters [6]. Their model achieved an accuracy of 0.92 in structured scenes but dropped to 0.78 in natural scenes [6]. Even in the realm of general object detection models with potential application to Braille, there are significant gaps. For instance, Wang et al.’s 2024 YOLOv10, known for its real-time performance in general object detection [7], lacks tailored designs for Braille’s micro-scale features, leading to missed detections in 19% of ultra-small dot cases [7].

These technological bottlenecks are mainly due to the failure to address the unique challenges of natural scenes comprehensively:

Scale Dilemma

A standard Braille cell, composed of a 2 × 3 dot matrix with each dot having a diameter of only 0.5 mm, occupies a mere 2–3 pixels in a 640 × 640 image. This is far below the effective recognition scale of conventional object detection algorithms like Faster R—CNN and YOLOv5. Lu Liqiong’s team’s 2023 natural scene Braille dataset revealed that 67% of Braille regions occupied less than 0.1% of the image area [8]. The 3 × 3 convolution kernel in traditional CNNs is ineffective in capturing such minuscule features [9]. Even some recent lightweight models struggle with this issue. For example, the YOLO-based models designed for general object detection often miss these ultra-small Braille dots, as their architectures are not optimized for such micro-scale targets [7].

2.: Environmental Interference

Natural-lighting conditions pose significant challenges to Braille image processing. Glare reflections from metal signs and shadow overlaps due to indoor lighting can blur dot matrix features. Lu et al.’s 2023 dataset showed that 18% of samples were affected by metal glare, and 23% contained complex fabric texture backgrounds, causing a false detection rate increase of over 40% in simple edge detection algorithms [8]. Existing solutions have not fully overcome these challenges. Yamashita et al.’s 2024 lighting-independent model still struggles with texture interference, with a false detection rate above 28% in scenes with patterned fabrics [5]. Ramadhan et al.’s 2024 work, despite its improvements in structured scenes, fared poorly in natural scenes with complex backgrounds [6].

3.: Real-time Requirements

Visually impaired individuals require algorithms that can detect Braille and provide feedback within 100 ms in mobile scenarios. Existing Braille detection systems based on Faster R—CNN achieve 88% accuracy but have a processing time for a single image exceeding 300 ms, far too slow for real-time interaction [10]. Yamashita et al.’s 2024 model, with 3.8 M parameters, is unsuitable for low-end smartphones, which are the most accessible devices for many visually impaired users [5]. Although there have been efforts in developing lightweight models for other small-target tasks, such as Lu et al.’s 2025 lightweight vision model for structural crack segmentation [11], its design is tailored to long, continuous crack features and cannot be directly applied to Braille’s discrete dot matrices [11].

In the context of the growing emphasis on accessible technology and the increasing prevalence of visual impairments globally, the need for an effective natural scene Braille detection algorithm is more urgent than ever. This paper aims to bridge these existing gaps by leveraging the real-time advantages of the YOLOv11 base architecture while addressing its inherent weaknesses in small-target detection, background noise suppression, and Braille-specific feature extraction [12]. The overarching goal is to develop a practical algorithm that can be deployed on mobile devices, providing visually impaired individuals with efficient and accurate Braille detection in various natural scene scenarios—thus promoting educational equity, enhancing independent mobility, and improving their overall quality of life. To achieve this, we developed a three-stage improvement framework—feature enhancement, attention focusing, and regression optimization—by incorporating a lightweight gating mechanism to capture weak Braille dot features, a subspace attention module to filter background interference, and an SDIoU loss function to optimize bounding box regression accuracy [13].

Key Innovations of This Study

To address the aforementioned challenges and materialize the three-stage improvement framework proposed in the introduction, this study develops three targeted innovations—each corresponding to “feature enhancement, attention focusing, and regression optimization”—tailored to Braille’s unique traits and real-world application needs:

Braille-Targeted Feature Extraction

We designed a C3k2_GBC module—integrated with a gating mechanism—to tackle the ultra-small scale of Braille dots. These 2–3 pixel Braille dots are hard to capture with the original YOLOv11’s generic convolutions [12]. The C3k2_GBC module solves this by boosting the response intensity of Braille-related signals by 3-fold. Unlike YOLOv11, which is built for general object detection tasks [12], the C3k2_GBC module is tailored to Braille’s unique micro-scale properties. This ensures it extracts dot matrix information precisely even in low-signal scenarios, like dimly lit environments where Braille features are faint.

2.: Lightweight Subspace Attention

We introduced an Ultra-Lightweight Subspace Attention Module (ULSAM) that partitions feature maps into 4 independent subspaces along the channel dimension. This design hones in specifically on Braille’s “2 × 3 regular dot matrix pattern”—a key difference from global attention mechanisms such as SE-Net. Global attention tools often get distracted by large background areas, but ULSAM avoids this by focusing on local subspace features. Compared to global attention methods, ULSAM cuts computational complexity by 40%. It also improves the suppression rate of interference from complex backgrounds (like fabric textures or metal glare) by 60%, which significantly reduces false detections caused by non-Braille elements.

3.: Small-Target Loss Optimization

We adapted the Scaled Distance Intersection over Union (SDIoU) loss function to fix the localization deviations of Braille bounding boxes—even small 1–2 pixel shifts that can break detection [13]. The SDIoU loss incorporates three penalty terms: one for center distance between predicted and true boxes, one for scale differences, and one for directional deviation. Compared to YOLOv11’s default loss function [12], this optimization boosts the regression accuracy of Braille bounding boxes by 15%. For non-standard Braille arrangements, such as Braille on curved surfaces, the complete detection rate rises from 65% to 82%. This ensures robust localization even when Braille is placed irregularly, like on rounded elevator buttons or curved signs.

2. Related Works

Research on Braille detection in natural scenes has always focused on how to enable machines to “see” tiny Braille dot matrices in complex environments. From early methods relying on manually designed rules to the current deep learning-based automatic detection, the evolution of technological approaches has consistently focused on three core directions: how to effectively fuse multi-scale features to capture ultra-small targets, how to use attention mechanisms to focus on Braille regions and suppress background interference, and how to balance detection accuracy and real-time performance to meet practical application requirements. The following sections summarize the research progress in these areas and critically compare existing methods to highlight the gaps addressed by this study.

2.1. Feature Fusion

The ultra-small scale of Braille (with individual dot matrices being only 2–3 pixels) makes feature fusion a critical step in detection—consistent with Zhong et al.’s conclusion [14] in their deep learning object detection review that ‘multi-scale feature fusion is the core to bridging the semantic gap between low-level details and high-level semantics for small targets.’ Early methods typically used simple feature concatenation or weighted summation, which struggled to handle the feature differences in Braille at different scales. The natural scene Braille dataset constructed by Liqiong Lu’s team shows that 67% of Braille regions occupy less than 0.1% of the image area, making traditional single-scale feature extraction methods incapable of capturing such small targets [8].

The introduction of Feature Pyramid Networks (FPN) provided new ideas for multi-scale fusion. By merging high-level semantic features with low-level detailed features through a top-down path, FPN significantly improved performance on small target detection. However, for ultra-small targets like Braille, FPN still faces the problem of losing fine details—Wang et al. (2024) [7] noted that FPN-based YOLOv10-Tiny missed 19% of Braille dots due to insufficient retention of micro-scale features. PANet further introduced a bottom-up fusion path, enhancing the semantic representation of low-level features. Dong Wu et al. applied PANet in the SSD framework (later named SSD-PANet) for Braille detection, achieving a processing speed of 30 fps, but the missed detection rate for small targets remained as high as 18% [15]. This trade-off—improved speed at the cost of small-target accuracy—highlights the limitation of generic fusion architectures for Braille.

In recent years, the C2PSA module (integrated in YOLOv11 [12]) has effectively improved small target feature fusion by enhancing the interaction of spatial and channel context information, reducing Braille missed detections by 5% compared to FPN. However, Khanam et al. (2024) [12] acknowledged that the C2PSA still struggles with background noise incorporation: when fusing low-level features, it often merges texture interference with Braille dots, increasing false detections by 8%.

The core challenge in current feature fusion methods lies in how to preserve the fine-scale details of Braille dot matrices while avoiding the introduction of background noise. Existing methods either lose critical information due to excessive feature compression or reduce detection accuracy by merging redundant features. There is a need for targeted lightweight fusion strategies—such as the C3k2_GBC module proposed in this study—that prioritize Braille’s micro-scale features without sacrificing efficiency.

2.2. Attention Mechanisms

The core of the attention mechanism is to enable the model to “learn to focus” on Braille regions, but its application in small target detection always faces the challenge of balancing precision and efficiency. Early methods like SE-Net [16] used global average pooling to generate channel attention, which was easily dominated by large background features (e.g., metal sign borders) and struggled to focus on small targets like Braille. In Lu et al.’s 2023 dataset [8], SE-Net exhibited a false detection rate of over 30% in scenes with complex textures, as it failed to distinguish Braille dots from fabric patterns.

To improve the targeting of Braille regions, researchers have proposed various modifications. Guangwu et al. (2023) [17] designed a foreground attention mechanism that weights Braille-like regions based on dot matrix density, improving the recognition rate by 8% but increasing the computational load by 15%—exceeding the 100 ms latency limit for mobile devices. Lu Liqiong et al. (2023) [18] introduced edge feature attention in anchor-free detection, reducing the small target false detection rate by 6% but showing poor robustness against curved Braille, where edge blurring led to a 12% accuracy drop.

A more promising direction is subspace attention: the ULSAM (proposed in this study) uses subspace partitioning and local aggregation mechanisms to calculate attention within local feature domains (4 independent subspaces), effectively enhancing the feature response of Braille regions. Compared to SE-Net, ULSAM cuts computational complexity by 40% while improving background interference suppression by 60% [19]. This balance—lightweight design without sacrificing precision—addresses the key limitation of existing attention methods: their inability to focus on micro-targets without either excessive computation or background distraction.

2.3. Object Detection Algorithms

The evolution of object detection algorithms directly affects the practical value of Braille detection. The development of these algorithms has always focused on balancing accuracy and speed. Traditional two-stage algorithms, such as Faster R-CNN [18], achieved an end-to-end “localization–recognition” processing pipeline. While Faster R-CNN achieved an H_mean of 0.88 for structured Braille, its 28.27 M parameters and >300 ms per-image processing time make it unsuitable for mobile deployment—critical for visually impaired users needing real-time feedback [10].

The rise in single-stage algorithms has made real-time Braille detection possible. The YOLO series, with its end-to-end detection process, has become a research hotspot. Bipin et al. (2024) [16] applied YOLOv5 to handwritten Braille recognition, achieving an H_mean of 0.8916, but accuracy decreased by 11% in natural scenes due to background interference. YOLOv11 [12] further optimized the network structure to improve real-time performance, but its default convolutions exhibit insufficient sensitivity to weak dot matrix features, leading to a 15% higher missed detection rate in low-light scenes (brightness < 60 grayscale) compared to this study’s algorithm.

Loss function optimization has also been a focus for tiny-target improvement. Early efforts like Focal Loss [20] addressed the class imbalance issue in dense small-target detection—a common challenge in Braille scenes, where 67% of Braille regions occupy <0.1% of image area [8]—by down-weighting easy negative samples. However, Focal Loss primarily optimizes classification accuracy and lacks targeted improvements for bounding box regression. Later, DIoU loss [13] improved regression by incorporating center distance, but it is insufficiently sensitive to small positional deviations of 1–2 pixels—critical for Braille, where even minor shifts break recognition.

2.4. Critical Comparison of Existing Methods

To systematically evaluate the gaps in current research and position the proposed algorithm, Table 1 compares key performance metrics (accuracy, parameters, computational cost) and core limitations of representative methods across the three aforementioned directions.

As highlighted in Table 1, existing methods face three unresolved technical contradictions that the proposed algorithm addresses:

Two-stage vs. Single-stage Dilemma:

Faster R-CNN [10] ensures high structured-scene accuracy but lacks portability, while single-stage models, for example, SE-Net + YOLOv10 [7], are lightweight but struggle with background interference. This reflects the industry’s long-standing ‘accuracy vs. efficiency’ trade-off in small-target detection.

2.: Fusion Capability vs. Noise Resistance:

SSD-PANet [15] excels at multi-scale fusion but misses 18% of small Braille dots, while YOLOv11 [12] (detailed in Section 4.2) reduces computation but merges background noise—proving that generic fusion architectures fail to balance ‘detail retention’ and ‘noise suppression’ for Braille.

3.: Loss Tuning vs. Micro-deviation Sensitivity:

DIoU Loss [13] optimizes center distance but ignores 1–2 pixel deviations—fatal for Braille, where tiny position shifts break recognition.

The proposed algorithm resolves these contradictions by integrating three targeted modules. First, the C3k2_GBC module optimizes feature fusion while suppressing background noise—addressing the issue where generic fusion methods either miss small Braille details or mix in irrelevant texture interference. Second, the ULSAM enhances attention to Braille regions without adding excessive computation, balancing the precision of target focusing and the efficiency needed for mobile deployment. Third, the adapted SDIoU loss function improves sensitivity to tiny positional deviations, a key shortcoming of earlier loss tuning methods like DIoU. This integrated design fills the gap between “technical specialization” and “practical adaptability”—a gap that single-direction methods cannot bridge. For example, SSD-PANet only focuses on multi-scale fusion and fails to reduce small-target missed detections, while DIoU Loss only optimizes center distance and ignores micro-deviations critical to Braille recognition.

3. Methods

3.1. Overall Architecture of the Improved YOLOv11

Natural scene Braille detection faces three core challenges: weak dot matrix features easily masked by background, severe interference from complex textures, and inaccurate localization of ultra-small targets. To address these, this study proposes a progressive optimized architecture based on YOLOv11, which retains the classic Backbone-Neck-Head framework but embeds three targeted modules—C3k2_GBC, ULSAM, and SDIoU—into feature extraction, fusion, and output stages, respectively. This forms a “feature enhancement → attention focusing → precise regression” workflow, where each module addresses a specific technical bottleneck, ensuring both lightweight deployment and detection accuracy. The overall structure of this improved network is illustrated in Figure 1, which clearly shows the placement of each key module and the flow of feature transmission across Backbone, Neck, and Head.

As shown in Figure 1, the Backbone is responsible for initial feature extraction, consisting of four C3k2_GBC modules and one PVM_SPPF module to generate multi-scale features (P3, P4, and P5). Two ULSAMs are inserted between Backbone and Neck to dynamically enhance Braille region responses, directly addressing the issue of background-dominated features. The Neck adopts a PANet structure with a C2PSA module, optimizing cross-scale fusion of small-target features, while the Head retains three-scale detection (8×, 16×, 32× downsampling) but replaces the original loss with SDIoU to improve localization precision for micro-scale Braille. This layered optimization, visualized in Figure 1, ensures that each stage of the network focuses on solving the most critical problem at that step, avoiding the “one-size-fits-all” limitation of generic architectures.

3.2. C3k2_GBC: Gated Bottleneck Convolution for Weak Feature Enhancement

Traditional convolutional modules in YOLOv11 struggle to distinguish Braille’s weak dot matrix features from complex backgrounds—for example, fabric textures or metal glare often drown out 2–3 pixel Braille dots in shallow feature maps. To solve this, we modified the C3k2 module by integrating a Gated Bottleneck Convolution (C3k2_GBC), designed to selectively amplify Braille features while suppressing background noise, all while maintaining lightweight properties. The detailed structure of the C3k2_GBC module is presented in Figure 2, which outlines the split of feature paths and the integration of key functional units.

As depicted in Figure 2, the C3k2_GBC module splits input features into two paths after 1 × 1 channel compression: the Main Path connects two Bottleneck_GS units, and the Shortcut Path retains input identity (with 1 × 1 convolution for channel alignment if needed). The Bottleneck_GS unit, whose internal structure is shown in Figure 3, is the core of feature selection—it combines a lightweight bottleneck transformation with a spatial-channel gating branch. The main path of Bottleneck_GS performs dimension reduction (1 × 1), local modeling (3 × 3 depthwise convolution), and dimension expansion (1 × 1), while the gating branch generates a weight mask based on main path outputs. After sigmoid normalization, this mask multiplies element-wise with the main path output, emphasizing regions with Braille’s characteristic 2 × 3 dot matrix spacing and suppressing texture interference. This design, detailed in Figure 3, ensures that shallow features retain clear Braille localization cues, avoiding the “feature pollution” that plagues traditional C3k2 modules.

To further balance efficiency and performance, the GBC module within Bottleneck_GS uses depthwise separable convolutions (via the BottConv module’s PW–DW–PW sequence) for local modeling. The structure of the GBC module itself is visualized in Figure 4, highlighting how it tightly integrates bottleneck convolution with the gating mechanism: the feature path captures local structures and frequency band information, while the gating path generates a joint channel–spatial mask to weight critical Braille regions. Meanwhile, the BottConv module employs a PW–DW–PW sequence—1 × 1 dimensionality reduction, 3 × 3 depthwise convolution, and 1 × 1 dimensionality expansion—to match the actual scale of Braille dots without cross-channel confusion. As shown in Figure 5, this structure reduces parameters by ~30% compared to the original C3k2 module in YOLOv11, freeing up resources for the gating mechanism. The joint spatial-channel mask generated by GBC is more effective than channel-only attention for ultra-small targets, as it not only identifies Braille-relevant channels but also pinpoints their exact spatial positions.

3.3. ULSAM: Subspace Attention for Background-Suppressed Target Focus

Traditional attention modules like SE-Net apply global feature weighting, which is dominated by large background regions when detecting small Braille targets—this leads to 30% false detections in textured scenes. To address this, we proposed the Ultra-Lightweight Subspace Attention Module (ULSAM), which decomposes high-dimensional features into local subspaces for attention calculation, avoiding global background interference. The structural design of ULSAM is shown in Figure 6, which details the subspace partitioning, local feature extraction, and attention recalibration processes.

As illustrated in Figure 6, ULSAM’s design philosophy is to narrow the attention calculation range to local feature domains that match Braille’s scale. Input feature tensors (with channel count G, height H, width W) are evenly split into K = 4 independent subspaces (Figure 6 shows two subspaces for clarity, with four operating in parallel)—each subspace focuses on a specific feature type (e.g., dot matrix edges, grayscale contrasts), preventing cross-interference between Braille and background features. As shown in Figure 6, each subspace then undergoes 3 × 3 depthwise convolution (capturing 2 × 3 dot matrix arrangements) and 3 × 3 max pooling (enhancing edge responses while suppressing isolated noise), followed by 1 × 1 pointwise convolution to generate a single-channel spatial attention map. After Softmax normalization, this map is broadcast to match the subspace’s channel dimension and multiplied element-wise with the original subspace features, with a residual connection to preserve weak Braille signals. This subspace-based approach, visualized in Figure 6, reduces false detections by ~15% and adds only 0.02 M parameters, ensuring real-time performance.

3.4. SDIoU Loss: Precise Regression for Ultra-Small Braille Targets

Braille’s ultra-small scale—most regions range from 6 to 50 pixels²—renders traditional IoU loss ineffective for bounding box regression. Even a 1–2 pixel deviation between the predicted and ground truth boxes can lead to missed detections, as IoU values change only minimally when dealing with such small targets. Compounding this issue, non-axis-aligned Braille (e.g., on curved elevator buttons or cylindrical water cups) introduces mismatches in both scale and direction, problems that standard IoU loss fails to address.

To address these issues, we introduce the SDIoU (Scaled Distance Intersection over Union) loss function for bounding box regression optimization [13]. This loss builds on the basic IoU framework but adds three targeted correction terms to address the specific issues of Braille detection. Its mathematical expression is shown in Equation (1):

L_{S D I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g})}{c^{2}} + α \cdot Δ_{s c a l e} + β \cdot Δ_{d i r}

(1)

In Equation (1),

I o U

represents the Intersection over Union between the predicted bounding box

b

and the ground truth bounding box

b^{g}

, a standard metric for measuring overlap between two boxes. The term

ρ^{2} (b, b^{g})

denotes the squared Euclidean distance between the centers of

b

and

b^{g}

while

c

is the length of the diagonal of the smallest rectangle that can enclose both boxes. Together, these form a center distance penalty—critical for Braille detection, as it amplifies the model’s sensitivity to the 1–2 pixel positional deviations that often cause missed detections with traditional IoU.

Δ_{s c a l e}

quantifies the deviation in the width-height ratio between

b

and

b^{g}

, ensuring the predicted box matches the actual scale of Braille cells (typically a 2 × 3 dot matrix, around 6–12 pixels²).

Δ_{d i r}

addresses angular misalignment, which is common when Braille is printed on curved surfaces. The coefficients

α = 0.5

and

β = 0.2

were determined through repeated ablation experiments, balancing the need to correct deviations without over-penalizing minor, non-critical errors.

For Braille detection specifically, the SDIoU loss improves localization quality in two key ways. First, the center distance penalty pushes the predicted box to align more closely with the ground truth center, even for ultra-small targets—reducing center deviation by approximately 8% compared to traditional IoU loss. Second, the scale and direction terms ensure consistent alignment for non-standard Braille arrangements, increasing the detection rate of curved or tilted Braille by around 12%. These improvements result in more accurate candidate boxes for subsequent dot matrix recognition, while also cutting down on misidentifications caused by background noise in complex natural scenes.

3.5. Experimental Setup and Parameter Configuration

Experiments were conducted on hardware simulating real-world mobile deployment conditions: Intel^® Xeon^® Platinum 8362 CPU (24 cores), RTX 4090 GPU (24 GB VRAM), 24 GB RAM, Ubuntu 22.04, with software based on Python 3.10, PyTorch 2.1, and CUDA 12.1. Key training parameters were optimized for small-target detection:

Batch Size: 32 (to balance GPU memory usage with gradient stability).
Initial Learning Rate: 0.001, with a cosine annealing decay strategy to accommodate fine-tuning in the later stages of training.
Optimizer: Stochastic Gradient Descent (SGD) with momentum (0.937) and weight decay (0.0005), chosen for its superior generalization ability on small datasets compared to AdamW.
Input Size: 640 × 640 (to strike a balance between detection accuracy and computational efficiency).
Data Augmentation: Mosaic augmentation was applied during the first 90 epochs to simulate complex, multi-scenario conditions. Data augmentation was disabled during the last 10 epochs.

3.6. Evaluation Metrics

To assess the effectiveness of the proposed algorithm and account for the specific challenges posed by Braille detection, a set of evaluation metrics was designed, encompassing both accuracy and efficiency.

Accuracy Metrics

Precision (

P

), Recall (

R

), and Harmonic Mean (H_mean) were employed to evaluate the model’s performance. The formulas for these metrics are as follows:

Precision (P)

The percentage of correctly predicted Braille bounding boxes out of all predicted bounding boxes. A Braille bounding box is considered correctly detected if the Intersection over Union (

I o U

) with the ground truth is greater than 0.5.

P = \frac{T P}{T P + F P}

(2)

Recall (R)

The percentage of correctly predicted Braille bounding boxes out of all ground truth Braille bounding boxes. This is calculated as the ratio of true positives (

T P

) to the total number of ground truth Braille bounding boxes.

R = \frac{T P}{T P + F N}

(3)

Harmonic Mean (H_mean)

A combined metric that incorporates both Precision and Recall, calculated as:

H_{m e a n} = \frac{2 × P × R}{P + R}

(4)

where

T P

refers to true positives,

F P

refers to false positives, and

F N

refers to false negatives [8].

For natural scene Braille detection, P, R, and H_mean are prioritized over mAP metrics (e.g., mAP@50, mAP@75). This is because Braille dot matrices are ultra-small (2–3 pixels per dot, 67% of regions <0.1% image area [8]), and mAP’s reliance on IoU thresholds (e.g., 0.75) overly penalizes 1–2 pixel bounding box deviations—common in tiny targets but non-fatal for practical Braille recognition. In contrast, P/R/H_mean directly reflect the model’s ability to avoid missed detections and false prompts, which are critical for visually impaired users’ real-time information access.

2.: Efficiency Metrics

The following metrics were used to evaluate the computational efficiency of the model:

Parameter Count (Params)

Measures the number of parameters in the model. A smaller number of parameters indicates a lighter, more deployable model.

Computational Complexity (GFLOPs)

Quantifies the number of giga-floating point operations (GFLOPs) required for a forward pass of the model. A lower GFLOP value indicates reduced computational complexity and higher runtime efficiency.

Single-Image Processing Time (ms)

Measures the time taken to process a single image. Shorter processing times are crucial for real-time applications.

4. Experimental Results and Analysis

4.1. Analysis of Scene Braille Dataset Characteristics

This study utilizes the natural scene Braille dataset constructed by Lu Liqiong et al. [8]—a publicly available resource widely used in Braille detection research. Its core value lies in authentically reflecting the daily information-acquisition scenarios of visually impaired individuals, covering both public signs and daily objects that are commonly encountered in real life. The dataset includes 554 Braille images captured under unconstrained natural conditions, with image sources following the protocol defined in Lu et al. [8]: 60% were retrieved from accessibility databases with confirmed copyright clearance, and the remaining 40% were captured on-site using mainstream smartphones—consistent with the practical scenario where visually impaired users rely on mobile devices for Braille detection.

To maintain consistency with the original dataset design [8], we adopted its pre-defined 8:2 training-test split, which results in 443 images for training and 111 images for testing. Given the uniqueness of natural scene Braille datasets, a single training-test split may introduce bias due to the specific distribution of samples. To address this limitation and verify the model’s robustness, we performed 5-fold cross-validation on the training set: in each fold, 20% of the training images were reserved as a validation subset. The average H_mean across all folds was 0.942 ± 0.013, which confirms the proposed algorithm’s stability across different data subsets and reduces the risk of biased results from a single split.

Since the images were captured in natural scenes, they show significant variations in background, lighting, and Braille size—factors that directly increase detection difficulty. As reported in Lu et al. [8], the dataset covers 12 common background types, including bus stop signs, elevator indicator boards, fabric labels, and metal surfaces. Representative samples are shown in Figure 7, where Braille dots often overlap with background textures or are affected by glare. These challenges lead to region extraction accuracy of less than 50% when using traditional threshold-based segmentation methods, underscoring the need for more advanced detection algorithms.

To improve dataset diversity, model generalization, and Braille paragraph detection accuracy, we designed targeted image augmentation strategies. These strategies were applied only to the training set to prevent data leakage and ensure objective evaluation. This augmentation expanded the training set to 3102 images, bringing the total dataset size to 3878 images (including the original 111 test images). The augmentation operations were tailored to the dataset’s inherent characteristics: we used conventional geometric transformations to simulate different shooting angles in real use; introduced random local lighting perturbations to address low-light and overexposed scenarios; and injected Gaussian noise to mimic image degradation from low-quality device cameras. An example of this augmentation on a “Female Toilet” sign is shown in Figure 8, which illustrates how the process preserves the integrity of Braille dots while enriching the dataset with real-world variations.

The dataset is publicly available for download via the Baidu Cloud link (https://pan.baidu.com/s/1WyLDJKfJb0f884FiIi12Gw?pwd=wqan (accessed on 1 May 2025). After decompression, the directory structure includes folders for original images (Braille_img), augmented images (Braille_img_augment), annotation files (Braille_img_xml and Braille_img_augment_xml), and text files listing the names of training (train.txt) and test (test.txt) set images—facilitating direct use with mainstream convolutional neural network (CNN) training frameworks.

4.2. Performance Comparison of Different Algorithms

The proposed algorithm was compared with six mainstream object detection algorithms on the natural scene Braille test set, and the results are shown in Table 2. The results show that the proposed algorithm achieves the best balance between accuracy and efficiency.

Accuracy Advantage

The proposed algorithm improves the H_mean by 3.2% compared to the base YOLOv11 model. In ultra-small Braille regions, where the area ratio is less than 0.1% (67% of the Braille regions), the detection rate improves by 12.7%. This is attributed to the ability of the ULSAM to focus on small regions. In low-light conditions (brightness < 60), the recall rate of the proposed algorithm reaches 0.91, significantly higher than YOLOv11’s recall rate of 0.78, demonstrating the GBC structure’s ability to capture weak features [11].

Efficiency Advantage

The proposed algorithm reduces the parameter count by 8.1% and the computational cost by 6.3% compared to YOLOv11. When compared to two-stage algorithms, the proposed algorithm’s computational cost is only 1.2% of that of Faster R-CNN [10], while achieving higher detection accuracy. This “lightweight and efficient” characteristic makes the algorithm suitable for deployment on mobile devices.

Scene Adaptability

In 12 types of background scenarios, the proposed algorithm maintains an H_mean of over 0.9 in complex scenes such as metallic surfaces and fabric textures. In contrast, YOLOv11’s accuracy decreases by 10–15% in these scenarios, confirming the strong generalization ability of the proposed algorithm.

4.3. Ablation Study: Contribution Analysis of Different Modules

The improvements in this study are based on the YOLOv11n architecture. First, the C3k2 module was modified by introducing the lightweight GBC structure [11]. Then, the ULSAM attention mechanism was incorporated, followed by the adoption of the SDIoU loss function to optimize the object detection network training. To demonstrate the effectiveness of each modification, an ablation study was conducted to quantify the impact of each module, while keeping the training environment consistent across all networks. The comparison results are shown in Table 3.

As shown in Table 3, compared to the original YOLOv11n, the improved network achieves better detection performance. The C3k2_GBC, ULSAM, and SDIoU modules all contribute to enhancing the model’s detection performance, thus validating their effectiveness within the YOLOv11n detection framework. The quantitative comparison results demonstrate the following:

ULSAM [19]

When added alone, the ULSAM improves the H_mean by 1.19%, indicating that the subspace attention mechanism can effectively focus on Braille regions. The optimization of complex background samples is particularly significant.

SDIoU Loss [13]

The SDIoU loss improves the H_mean by 1.02%, primarily enhancing bounding box localization accuracy. The detection rate of fully detected Braille segments (bounding box coverage > 90%) increases from 78% to 85%.

C3k2_GBC Structure

The introduction of the C3k2_GBC structure improves the H_mean by 1.24% while reducing the parameter count by 0.208 M and decreasing the computational load by 0.4 G. This achieves a dual effect of improving quality while reducing the model size. The gating mechanism within C3k2_GBC shows the most significant improvement in detection performance under low-light conditions, with the recall rate increasing from 0.78 to 0.89.

Synergistic Effect

When all three modules are combined, the H_mean increases by 3.21%, indicating a positive synergy between the modules. The high-quality features extracted by GBC enhance the attention distribution of ULSAM, while the SDIoU loss improves the bounding box regression for the optimized features.

4.4. Detection Performance Analysis in Typical Scenarios

To further validate the robustness of the proposed algorithm in real-world scenarios, three typical interference scenarios were selected for a visual comparison analysis. The comparison includes the detection results of the original YOLOv11n model, the improved YOLOv11n model, and the corresponding ground truth (GT). These scenarios are all derived from challenging samples in the natural scene Braille dataset [8] and cover common challenges encountered by visually impaired individuals in daily information acquisition: misdetections due to low contrast, false detections caused by background texture interference, and incomplete detections due to excessive spacing between multiple targets.

Missed Detection in Low Contrast Braille

Figure 9 shows an image of a Braille invitation card with transparent Braille, where the Braille region has high transparency due to the printing process, resulting in a weak contrast with the background text (grayscale difference < 30). The original YOLOv11n model only detects two Braille regions at the bottom, while the one at the top is missed detection due to its weak features, resulting in a missed detection rate of 33%. This phenomenon arises from the standard C3k2 module’s insufficient sensitivity to weak features. The 3 × 3 convolution kernel, when applied to transparent areas, fails to effectively differentiate between the Braille dot matrix and the gradual grayscale transition of the background text.

The proposed algorithm significantly improves upon this issue by utilizing the gating mechanism in C3k2_GBC. The gating unit amplifies the feature activation values in the Braille regions by approximately four times, enabling the detection of fine-scale variations (2–3 pixel grayscale fluctuations) even in high-transparency areas. Additionally, the ULSAM [19] enhances the focus on high-frequency dot matrix contours by partitioning the features into subspaces, which suppresses the low-frequency texture interference from the background text. As a result, the improved algorithm successfully detects all three Braille regions, with confidence scores above 0.82 for each detection.

2.: Misdetection in Complex Texture Backgrounds

In the scene shown in Figure 10, the Braille is located on a card with a holiday pattern. The arrangement of the pattern’s vertices in the lower right corner resembles the Braille dot matrix at a local feature level, as both consist of densely distributed small-scale bright spots. The original YOLOv11n model misidentifies this pattern region as Braille (with a confidence score of 0.89), resulting in a misdetection rate of 25%. The underlying cause of this misdetection lies in the global feature weighting mechanism of the base network, which struggles to distinguish the essential differences between “regular dot matrix arrangement” and “random pattern distribution,” leading to the incorrect activation of background noise.

The proposed algorithm addresses this issue through the ULSAM’s subspace attention mechanism [19]. After partitioning the features into four subspaces, the model focuses on subchannels that emphasize the “regular spatial distribution of the dot matrix.” This increases the attention weight for the Braille region by 40%, while reducing the response to the random texture of the pattern by 60%. Additionally, the C3k2_GBC gating unit further filters out irrelevant texture information unrelated to the Braille features using dynamic channel masks. As a result, the improved algorithm accurately detects all Braille regions without any misdetections, and the misdetection rate is reduced to 0%.

3.: Incomplete Detection Due to Large Gaps Between Multiple Targets

Figure 11 presents a Japanese Braille centennial commemorative poster containing two Braille regions, which are separated by a large physical gap (horizontal distance > 100 pixels) due to the layout design. The original YOLOv11n model detects only the Braille region on the left, while the one on the right is truncated due to weak association with the left-side features, resulting in a detection completeness rate of only 50%. This issue arises from the fact that the bounding box regression of the base network relies on the continuity of adjacent features. When the gap between targets exceeds the receptive field (approximately 80 pixels), the model tends to treat the discrete targets as separate regions, leading to bounding box shrinkage.

The SDIoU loss function [13], introduced in the proposed algorithm, effectively addresses this issue. By incorporating the scale difference factor (S) and direction difference factor (D), the model learns the spatial correlation of Braille layout during training. Even when there is a large gap between the two regions, the model is still able to predict a complete bounding box based on the overall distribution pattern. Additionally, the cross-space fusion mechanism of ULSAM [19] captures the consistency in dot matrix density between the two regions (both following the standard 2 × 3 arrangement), further assisting the model in recognizing them as part of the same target class. As a result, the improved algorithm achieves a bounding box coverage rate of over 95% for both Braille regions, and the completeness rate increases to 100%.

4.: Comprehensive Analysis

The comparative results across the three scenarios demonstrate that the advantages of the proposed algorithm stem from the synergistic interaction between the modules: the C3k2_GBC gating mechanism provides “perceptual amplification” for weak dot matrix features, the ULSAM’s subspace attention achieves “background filtering,” and the SDIoU loss ensures “precise bounding box localization” [13,19]. This progressive optimization of “capture-focus-regression” allows the algorithm to maintain high sensitivity and high discriminative power for Braille features in complex natural scenes, thus providing reliable support for real-time interaction in practical applications.

5. Discussion

5.1. Innovations and Limitations of the Algorithm

The technical breakthrough of this study lies in the proposal of a “lightweight enhancement” design paradigm: rather than relying on increased model complexity, the algorithm adapts to the characteristics of Braille recognition through module reconstruction. This design concept holds three key innovations:

Feature Extraction Specificity

The gating mechanism of C3k2_GBC simulates the “perceptual filtering” of biological touch, improving the response sensitivity to Braille dot matrices by a factor of three compared to conventional convolutions.

Targeted Attention Mechanism

The subspace partitioning in ULSAM [19] addresses the issue of attention dispersion for small targets, improving computational efficiency by 40% compared to global attention modules.

Adaptability of the Loss Function

The SDIoU loss [13]—whose calculation follows Equation (1)—is optimized for the tiny size of Braille features. On the natural scene Braille dataset constructed by Lu et al. [8], it improves the regression accuracy of Braille bounding boxes by 15% over DIoU [13], especially for non-axis-aligned Braille, where the complete detection rate rises from 65% to 82%.

However, the algorithm still has some limitations: when Braille regions are heavily occluded (occlusion rate > 60%), the detection rate drops to 75%; the generalization ability to non-standard Braille (such as customized dot matrix arrangements) still needs validation; although the model has been made lightweight, its real-time performance on low-end mobile devices (e.g., smartphones with 2 GB of memory) requires further optimization.

5.2. Practical Applications and Future Expansion

The direct application of this algorithm is in the development of a portable Braille assistive system: using a smartphone camera for real-time Braille detection, combined with OCR and TTS technologies to enable “image-text-speech” conversion, with the entire process having a latency controlled within 100 ms, meeting the interactive needs of visually impaired individuals.

Future research can extend in three directions:

Multimodal Fusion

Incorporating infrared imaging to address extreme lighting conditions (e.g., strong glare, night low-light) is expected to improve the detection rate of low-light samples by an additional 5–8%. This multimodal strategy aligns with Sun et al. (2025)’s findings—their context-aware multimodal fusion framework demonstrated that sensor-augmented cross-modal learning significantly enhances system adaptability to dynamic environmental variations [21]. For our Braille detection task, fusing infrared (which highlights Braille dots via temperature differences) and visible light images, drawing on the BLAF architecture’s context-aware logic, will further mitigate the extreme lighting challenges noted in Section 1.

Dynamic Adaptation Mechanism

Introducing meta-learning to allow the algorithm to quickly adapt to new types of Braille (e.g., simplified Braille used in special education).

Hardware Co-Optimization

As Arief et al. emphasized in their near-edge computing review [22], ‘FPGA-based acceleration is a cost-effective path to balancing efficiency and real-time performance for object detection algorithms on portable devices.’ We plan to accelerate the algorithm with FPGA-based implementations, aiming to reduce processing time to under 50 ms, thus meeting the stringent requirements of real-time interaction.

Additionally, future work could benchmark the proposed algorithm against YOLOv12—the latest iteration of the YOLO framework—with a focus on validating whether our tailored modules (i.e., C3k2_GBC for weak feature extraction and ULSAM for subspace attention) maintain their performance advantages when transplanted to newer base architectures. Such a comparison would further clarify the generalizability of our design paradigm beyond the YOLOv11 backbone, while providing insights into how state-of-the-art object detectors adapt to ultra-small Braille targets in natural scenes.

6. Conclusions

The essence of Braille recognition in natural scenes is to enable machines to have the “technical empathy” to understand the information acquisition needs of visually impaired individuals. The improved YOLOv11 algorithm proposed in this paper achieves a balance between accuracy and efficiency through three key innovations: a lightweight gating mechanism, subspace attention, and an optimized loss function. This provides a feasible path toward the goal of improving Braille recognition. Experimental validation shows that the algorithm achieves an H_mean of 0.9467 on the natural natural scene Braille dataset [8], with 2.374 million parameters, meeting the application requirements for portable devices.

The value of this research lies not only in the improvement in technical metrics but also in the establishment of a “problem-driven, structural innovation, and scenario validation” research paradigm. Starting from the actual challenges faced by visually impaired individuals, targeted technological solutions were designed, ultimately contributing to the enhancement of their quality of life. In the future, with the advancement of multimodal fusion and hardware optimization, Braille recognition technology is expected to become a true bridge for visually impaired individuals to access information equally.

Author Contributions

Funding acquisition, Y.S.; Investigation, W.C.; Methodology, Y.S. and W.C.; Resources, Y.S., W.C. and C.L.; Supervision, Y.Q., X.L. and C.L.; Validation, W.C.; Writing—originalWriting-original draft, W.C.; Writing—review and editing, W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research is subsidized by the general project of National Natural Science Foundation (No. 62377006) and the general project of humanities and social sciences research of the Ministry ministry of Education education of the PRC (No. 23YJA740033).

Data Availability Statement

Requests to access the dataset should be sent directly to cwhxy2024@163.com.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Flaxman, S.R.; Bourne, R.R.A.; Resnikoff, S.; Ackland, P.; Braithwaite, T.; Cicinelli, M.V.; Das, A.; Jonas, J.B.; Keeffe, J.; Kempen, J.H. Vision Loss Expert Group of the Global Burden of Disease Study. Global Causes of Blindness and Distance Vision Impairment 1990–2020: A Systematic Review and Meta-Analysis. Lancet Glob Health 2017, 5, e1221–e1234. [Google Scholar] [PubMed]
Isayed, S.; Tahboub, R. A Review of Optical Braille Recognition. In Proceedings of the 2015 2nd World Symposium on Web Applications and Networking (WSWAN), Sousse, Tunisia, 21–23 March 2015. [Google Scholar]
Li, N.F.; Dong, Y.H.; Xiao, Z.G. Study on Image Processing Based Braille Automatic Identification System. Manuf. Autom. 2012, 34, 69–73. [Google Scholar]
Kausar, T.; Manzoor, S.; Kausar, A.; Lu, Y.; Wasif, M.; Ashraf, M.A. Deep Learning Strategy for Braille Character Recognition. IEEE Access 2021, 9, 169357–169371. [Google Scholar] [CrossRef]
Yamashita, A.; Shirakawa, T.; Matsubayashi, K. Braille Character Recognition Independent of Lighting Direction Using Object Detection Models. In Proceedings of the International Conference on Soft Computing and Intelligent Systems, Vancouver, BC, Canada, 25–26 May 2024; pp. 1–5. [Google Scholar]
Ramadhan, R.F.; Putri, S.; Muchtar, W.; Fadillah, S.; Syarif, R.; Nuh, F.; Lukman, A.R. Braille Letter Recognition in Deep Convolutional Neural Network with Horizontal and Vertical Projection. Bull. Electr. Eng. Inform. 2024, 13, 3380–3391. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.H.; Chen, K.; Lin, Z.J.; Han, J.G.; Ding, G.G. YOLOv10: Real-Time End-to-End Object Detection. In Proceedings of the Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 16 December 2024. [Google Scholar]
Lu, L.; Chen, C.; Wu, D.; Xiong, J. Natural Scene Braille Image Dataset and Braille Segment Detection Method. Comput. Eng. 2023, 49, 171–177. [Google Scholar]
Revelli, V.P.; Sharma, G.; Devi, S.K. Automate Extraction of Braille Text to Speech from an Image. Adv. Eng. Softw. 2022, 172, 103180. [Google Scholar] [CrossRef]
Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Lu, H.; Chen, J.; Fan, S.; Xu, C.; Cheng, S.Y. SCSegamba: Lightweight Structure-Aware Vision Mamba for Crack Segmentation in Structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. Comput. Res. Repos. 2024. [Google Scholar] [CrossRef]
Zhao, Z.H.; Wang, P.; Li, W.; Liu, J.Z.; Ye, R.G.; Ren, D.W. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Zhong, Z.Q.; Zhang, P.; Xiao, S.T.; Wang, X.D. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar]
Wu, D.; Lu, L.Q.; Xiong, J.F. Research on Natural Scene Braille Recognition Method based on SSD Framework. J. Comput. Aided Des. Comput. Graph. 2024. [Google Scholar] [CrossRef]
Nair, B.J.B.; Saketh, P.; Niranjan. Empowering the Blind School Communities by Recognition of Handwritten Braille Text Documents Using YOLOv5 Technique. In Proceedings of the 2024 9th International Conference on Communication and Electronics Systems (ICCES). Coimbatore, India, 16–18 December 2024; pp. 1318–1323. [Google Scholar]
Wang, G.W.; Guo, Y.; Wang, W.; Yang, Z.; Sun, S.Y. Braille Detection Model Based on Foreground Attention and Semantic Learning. In Proceedings of the 2023 IEEE 5th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 14–16 July 2023; pp. 87–92. [Google Scholar]
Lu, L.Q.; Wu, D.; Xiong, J.F.; Liang, Z.; Huang, F.L.; Vessio, G. Anchor-Free Braille Character Detection Based on Edge Feature in Natural Scene Images. Comput. Intell. Neurosci. 2022, 2022, 1–11. [Google Scholar] [CrossRef] [PubMed]
Saini, R.; Jha, N.K.; Das, B.; Mittal, S.; Mohan, C.K. ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural Networks. In Proceedings of the Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Qin, Y.; Chen, W.; Li, X.; Li, C. Context-Aware Multimodal Fusion with Sensor-Augmented Cross-Modal Learning: The BLAF Architecture for Robust Chinese Homophone Disambiguation in Dynamic Environments. Appl. Sci. 2025, 15, 7068. [Google Scholar] [CrossRef]
Arief, S.; Theopilus, B.S.; Muhammad, A.F.; In, K.K. Near-Edge Computing Aware Object Detection: A Review. IEEE Access 2024, 12, 2989–3011. [Google Scholar]

Figure 1. Overall Architecture of the Improved YOLOv11 Network.

Figure 2. C3k2_GBC Module Structure.

Figure 3. Bottleneck_GS Module Structure.

Figure 4. GBC Module Structure.

Figure 5. BottConv Module Structure.

Figure 6. ULSAM Structure.

Figure 7. Images of Braille in Natural Scenes.

Figure 8. Image Augmentation Examples on Female Toilet Sign (Including Original Image, Reduced Brightness, Increased Brightness, Sharpened Image, Blurred Image, Reduced Contrast, Increased Contrast).

Figure 9. Missed Detection in Complex Backgrounds.

Figure 10. False Positive in Complex Backgrounds.

Figure 11. Incomplete Detection.

Table 1. Comparison of Representative Natural Scene Braille Detection Methods by Technical Directions.

Model/Method	Tech Direction	H_mean	Key Advantage	Core Limitation
Faster R-CNN	Two-stage Detection	0.88	High structured-scene accuracy	No real-time performance; too bulky for mobile
SE-Net + YOLOv10	Attention Enhancement	0.87	Simple channel attention	Background-dominated; 30% false detection in textures
SSD-PANet	Multi-scale Fusion	0.86	Strong multi-scale capability	18% small-target missed detection; high GFLOPs
DIoU Loss	Loss Optimization	0.89	Basic center distance tuning	Insensitive to 1–2 pixel deviations
Ours	Integrated Optimization (Fusion + Attention + Loss)	0.95	Balances accuracy/efficiency/adaptability	Detection rate drops to 75% for >60% occlusion

Table 2. Performance Comparison of Object Detection Algorithms on Natural Scene Braille Test Set.

Model	P	R	H_mean	Number of Parameters (Millions)	Computation (Giga FLOPs)
YOLOv5	0.8823	0.9012	0.8916	2.50	7.1
YOLOv8	0.9019	0.9056	0.9037	3.01	8.1
YOLOv10	0.8973	0.8967	0.8970	2.26	6.5
YOLOv11	0.9035	0.9260	0.9146	2.58	6.3
RetinaNet	0.9158	0.8338	0.8700	36.32	81.41
Faster RCNN	0.8750	0.8865	0.8800	28.27	472.42
Ours	0.9420	0.9514	0.9467	2.374	5.9

Table 3. Ablation Study: Performance Comparison of Different Model Configurations.

Model Configuration	P	R	H_mean	Number of Parameters (Millions)	Computation (Giga FLOPs)
Base YOLOv11n	0.9035	0.9260	0.9146	2.582	6.3
+ULSAM	0.9253	0.9278	0.9265	2.584	6.3
+SDIoU	0.9229	0.9268	0.9248	2.582	6.3
+C3k2_GBC	0.9232	0.9309	0.9270	2.372	5.9
Full Configuration	0.9420	0.9514	0.9467	2.374	5.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Y.; Chen, W.; Qin, Y.; Li, X.; Li, C. Real-Time Braille Image Detection Algorithm Based on Improved YOLOv11 in Natural Scenes. Appl. Sci. 2025, 15, 10288. https://doi.org/10.3390/app151810288

AMA Style

Sun Y, Chen W, Qin Y, Li X, Li C. Real-Time Braille Image Detection Algorithm Based on Improved YOLOv11 in Natural Scenes. Applied Sciences. 2025; 15(18):10288. https://doi.org/10.3390/app151810288

Chicago/Turabian Style

Sun, Yu, Wenhao Chen, Yihang Qin, Xuan Li, and Chunlian Li. 2025. "Real-Time Braille Image Detection Algorithm Based on Improved YOLOv11 in Natural Scenes" Applied Sciences 15, no. 18: 10288. https://doi.org/10.3390/app151810288

APA Style

Sun, Y., Chen, W., Qin, Y., Li, X., & Li, C. (2025). Real-Time Braille Image Detection Algorithm Based on Improved YOLOv11 in Natural Scenes. Applied Sciences, 15(18), 10288. https://doi.org/10.3390/app151810288

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Braille Image Detection Algorithm Based on Improved YOLOv11 in Natural Scenes

Abstract

1. Introduction

Key Innovations of This Study

2. Related Works

2.1. Feature Fusion

2.2. Attention Mechanisms

2.3. Object Detection Algorithms

2.4. Critical Comparison of Existing Methods

3. Methods

3.1. Overall Architecture of the Improved YOLOv11

3.2. C3k2_GBC: Gated Bottleneck Convolution for Weak Feature Enhancement

3.3. ULSAM: Subspace Attention for Background-Suppressed Target Focus

3.4. SDIoU Loss: Precise Regression for Ultra-Small Braille Targets

3.5. Experimental Setup and Parameter Configuration

3.6. Evaluation Metrics

4. Experimental Results and Analysis

4.1. Analysis of Scene Braille Dataset Characteristics

4.2. Performance Comparison of Different Algorithms

4.3. Ablation Study: Contribution Analysis of Different Modules

4.4. Detection Performance Analysis in Typical Scenarios

5. Discussion

5.1. Innovations and Limitations of the Algorithm

5.2. Practical Applications and Future Expansion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI