1. Introduction
According to global statistics, over 2.2 billion people worldwide suffer from some form of visual impairment [
1]. In China alone, the visually impaired population is approximately 17.31 million, making it the country with the largest number of blind individuals. Notably, 23.5% of this group are adolescents or young to middle-aged adults. This substantial demographic faces an urgent demand for safe and efficient mobility assistance tools. At present, most navigation methods rely on smartphones for global route planning. While such systems can offer general directional guidance, they fall short in providing timely and detailed local navigation during movement, particularly when encountering dynamic obstacles such as pedestrians, bicycles, and vehicles, or static traffic elements such as crosswalks and traffic lights. This limitation poses serious challenges to the efficiency and safety of travel for the visually impaired.
With the rapid advancement of robotic assistance technologies and intelligent perception algorithms, machine guide dogs are increasingly regarded as an ideal alternative to traditional guide dogs. These systems are becoming a crucial component in the mobility solutions available to visually impaired individuals. Compared to real guide dogs, which are costly to train and difficult to scale, quadruped robotic guide dogs based on intelligent navigation and environmental perception offer advantages such as high replicability and flexible deployment. In recent years, Xiao et al. [
2] proposed a robot guide dog system based on the hybrid physical interaction of the Mini Cheetah quadruped robot and a traction rope. The robot guide dog system can guide blind people to move safely in narrow environments through a rope with variable tension and relaxation, demonstrating the feasibility and effectiveness of robots in assisting blind people to navigate in real scenarios, as shown in
Figure 1. However, ensuring the reliable operation of such systems in complex environments hinges on the ability to accurately and efficiently perceive typical obstacles and traffic cues in real time.
Therefore, constructing a visual perception system capable of road target detection is essential for the effective operation of guide robots. On the one hand, such a system enables accurate detection of common obstacles and traffic signs in road environments, thereby preventing visually impaired users from deviating from the intended path due to occlusions or ambiguous route information. On the other hand, it provides a reliable environmental awareness foundation for subsequent local path planning and dynamic obstacle avoidance. Vision-based object detection modules have thus become a core component in achieving safe, stable, and intelligent guidance for machine guide dogs.
The detection of typical road obstacles and traffic signs has long been a research focus in the field of computer vision. Currently, mainstream approaches in object detection utilize a combination of radar, ultrasonic sensors, and vision-based systems. Among these, visual detection offers richer scene information, lower cost, and easier deployment compared to alternative sensing technologies. Visual detection typically involves capturing scenes using cameras and applying algorithms to identify objects of interest within the images.
At present, there are two detection methods used in target detection tasks: traditional detection methods and deep learning detection methods. Traditional target detection algorithms use sliding windows and manually extracted features [
3,
4]. The Regionlets [
5] detection model generates multiple local regions by modeling each region of the object, and trains these regions through a support vector machine (SVM). This model can better cope with the morphological changes and complex backgrounds of the object. The Deformable Part Model (DPM) [
6] detection model extracts local features from the image to represent each part, and then applies these features to the part model through convolution operations. The core idea of DPM was later further developed in deep learning models. Traditional detection methods have a lot of redundant calculations, a slow running speed, poor robustness in complex environments, and other problems, making it difficult for them to achieve satisfactory detection results. In contrast, Ross Girshick et al. [
7] proposed the application of deep convolutional networks to the field of target detection and designed a new network architecture R-CNN to improve the accuracy of object detection and compared it with the traditional target detection algorithm on the VOC 2010 dataset. The detection accuracies of multiple categories such as bike, car, and person are higher than those of traditional detection methods, and the average accuracy is also higher than that of traditional methods. Deep learning algorithms do not require manual feature extraction and have strong anti-interference ability, so they are widely used in the field of target detection.
Traditional methods usually rely on manually designed feature extractors and classifiers, facing problems such as complex backgrounds, occlusion, and scale changes. By introducing target detection technology based on deep learning, breakthrough progress has been made. Detection algorithms based on deep learning are mainly divided into two-stage and single-stage detection algorithms. The two-stage detection algorithm obtains the detection result by extracting candidate boxes and performing secondary correction. Representative algorithms include R-CNN [
8], Fast R-CNN [
9], Faster-RCNN [
10], etc. Single-stage algorithms can complete positioning and classification at one time. Representative algorithms include the SSD series of algorithms [
11,
12,
13] and the YOLO series of algorithms [
14]. In the past few years, researchers have developed the target detection framework from a two-stage to a one-stage framework.
The YOLO [
15,
16] algorithm adopts a single-stage target detection method, which divides the entire image into grid cells and predicts multiple bounding boxes and category probabilities in each cell to achieve rapid detection of traffic signs. Compared with traditional methods, the YOLO [
17] algorithm has higher detection speed and accuracy.
The core idea of the algorithm is to learn feature representation from the original image through a deep convolutional neural network (CNN) and then use the predictor to generate bounding boxes and category probabilities. YOLOv8 adopts the deep neural network structure of Darknet, combined with feature extraction at different levels, to effectively capture the shape, texture, and contextual information of traffic signs. In addition, the YOLOv8 algorithm also introduces a series of optimization strategies, such as multi-scale training, data enhancement, and loss function optimization, to further improve the performance of obstacle and traffic sign detection.
Existing object detection methods, particularly those based on the YOLO series, have shown remarkable performance in real-time applications. YOLOv8, with its efficient Darknet backbone and multi-scale feature extraction, achieves a strong balance of speed and accuracy. However, it faces limitations in the context of machine guide dog applications. First, YOLOv8 struggles with small target detection (e.g., traffic lights smaller than pixels), as deeper network layers lose fine-grained details due to downsampling. Second, blurred-edge features, such as crosswalks under occlusion or varying lighting, are often misdetected due to weak edge representations. Third, the Complete IoU (CIoU) loss function in YOLOv8 overemphasizes low-quality samples, reducing robustness in complex scenes with scale variations and boundary ambiguities. These shortcomings hinder reliable navigation assistance in diverse urban environments, necessitating targeted improvements for visually impaired applications.
To address these challenges, this paper designs a new feature fusion network architecture that combines local and global feature information to obtain more accurate feature maps. Compared with the original network structure, it has fewer parameters and higher average accuracy.
The contributions of this paper are summarized as follows, and are quantitatively verified on both public and custom datasets:
- 1
A lightweight Triplet Attention module is introduced into the backbone network. It captures cross-dimensional interactions to enhance the correlation among local regions and the interactions between feature channels. This mechanism significantly improves the network’s focus on indistinct features with blurred edges, such as crosswalks, resulting in an approximately 4.6% improvement in the recall rate for this category of objects.
- 2
We design a multi-scale feature enhancement module, termed Triple Feature Encoding (TFE). It fuses spatial information from three different feature map resolutions (large, medium, and small). This structure facilitates the extraction of fine-grained details from small objects and reduces background noise interference. Working in concert with the P2 detection head, it achieves a 5.2% increase in average precision (AP) for small objects such as traffic lights.
- 3
A P2 detection head is employed to construct a four-head multi-scale detection architecture. It extracts lower-level features from higher-resolution feature maps, which aids in identifying small-scale targets. This design lowers the model’s effective detection size limit from 8 × 8 pixels to 4 × 4 pixels, significantly enhancing the perception of very small objects. It collaborates with other detection heads to effectively handle objects of varying scales.
- 4
The Complete IoU (CIoU) loss function is replaced with Wise-IoU v3 (WIoU). Its dynamic focusing mechanism addresses the challenges of blurred boundaries and scale variations by enhancing the focus on hard samples (e.g., small traffic lights) while reducing the emphasis on low-quality samples. This replacement ultimately improves the mean average precision (mAP@0.5) by 4.1% while reducing the number of parameters by 17.2%, achieving a better trade-off between accuracy and efficiency.
2. Related Work
Recent advancements in object detection, particularly within the YOLO series, have focused on improving accuracy for small targets and complex urban scenes, which are critical for applications like machine guide dogs. These systems require real-time detection of obstacles and traffic signs to ensure safe navigation for visually impaired users.
Early improvements to YOLO models emphasized using attention mechanisms to enhance feature representation. For instance, Yan et al. [
18] integrated the Squeeze-and-Excitation (SE) Block into YOLOv5, achieving a 1.44% mAP improvement by prioritizing salient features. Similarly, Li et al. [
19] incorporated SE Net and CBAM into YOLOv3’s backbone, boosting mAP by up to 8.50% through channel-wise importance learning. Ma et al. [
20] introduced the Feature Select Module (FSM) in the neck layer of YOLOv3, YOLOv4, and YOLOv5-L, reducing noise in feature fusion and improving performance by 0.60% to 1.50%. Ju et al. [
21] proposed AFFAM for YOLOv3, combining global and spatial attention for multi-scale feature fusion, yielding mAP gains of 5.08% to 7.41% on datasets like KITTI.
More recent works have targeted small object detection in traffic scenarios, directly relevant to urban navigation challenges. For example, the ETSR-YOLO model [
22] enhanced YOLO for multi-scale traffic sign detection, improving robustness in complex environments. TSD-YOLO [
23] introduced a Space-to-Depth module to handle scale variations in traffic signs, addressing missed detections. DP-YOLO [
24] optimized YOLOv8s for small traffic signs by reducing parameters while boosting accuracy. SOD-YOLOv8 [
25] specifically improved YOLOv8 for small objects in traffic scenes, incorporating optimizations for urban drone imagery. CAS-YOLOv8 [
26] enhanced remote sensing object detection with contextual attention, showing promise for urban small targets.
Despite these advances, existing models often increase parameter complexity or compromise real-time performance, limiting their deployment in resource-constrained guide dog systems. Moreover, few address the unique needs of visually impaired navigation, such as detecting blurred-edge features (e.g., crosswalks) alongside small targets (e.g., traffic lights) in varied urban conditions. This study builds on YOLOv8’s efficiency, introducing targeted improvements to achieve a balance of accuracy, speed, and lightweight design for machine guide dog applications.
While robust perception is fundamental, a complete machine guide dog system also requires advanced path planning and navigation algorithms to ensure safe and efficient guidance. Traditional global planners such as A* [
27] and Dijkstra’s algorithm [
28] perform well in static environments but lack real-time reactivity to unknown obstacles. In contrast, local planners like the Dynamic Window Approach (DWA) [
29] offer high reactivity but may suffer from local minima and suboptimal global performance.
To address these limitations, hybrid approaches that integrate global and local planning have emerged. A notable example is the BRRT*-DWA framework with Adaptive Monte Carlo Localization (AMCL) proposed by Ayalew et al. [
30], which combines bidirectional rapidly exploring random tree star (BRRT*) for global path generation with DWA for real-time obstacle avoidance in dynamic environments.
Crucially, the performance of such navigation systems highly depends on the accuracy of perceptual inputs. Our work enhances this pipeline by providing a highly accurate visual perception module that reliably detects obstacles (e.g., pedestrians, vehicles) and traffic elements (e.g., crosswalks, traffic lights). These outputs enable robust downstream path planning and localization, ultimately improving the safety and effectiveness of the machine guide dog system.
5. Result and Discussion
5.1. Comparison Experiment of Different Weight Versions of YOLOv8
Pre-trained weights provide valuable prior knowledge for object detection models through transfer learning, accelerating convergence and improving performance. Their effectiveness varies across different weighting schemes and datasets.
Table 3 shows the detection performance of YOLOv8 with different weights for obstacles and traffic signs.
Table 3 shows that the weighting model affects detection accuracy, speed, and parameter count. YOLOv8n offers the best balance with minimal parameters (3.01M) and the fastest speed, making it ideal for guide robot perception. We therefore select YOLOv8n as our base model.
5.2. Attention Mechanism Comparison Experiment
To validate the superiority of our Triplet Attention module, we compared it against CBAM, SE, CA, and EMA mechanisms in YOLOv8. As shown in
Table 4, Triplet Attention outperforms others by capturing cross-dimensional interactions more efficiently, enhancing local feature relevance and channel relationships while maintaining lightweight design.
Table 4 compares different attention mechanisms in YOLOv8n. While EMA shows good performance, Triplet Attention achieves superior results with 91.4% mAP0.5, 91.3% precision, and 87.2% recall.
5.3. Loss Function Comparison Experiment
To validate the improved YOLOv8n’s detection performance, we compared various IoU metrics (Wise-IoUv3, CIoU, DIoU, EIoU, SIoU) as shown in
Table 5. For vehicle detection tasks with imbalanced categories (e.g., pedestrians) and small targets (e.g., traffic lights), Wise-IoUv3 addresses these challenges through its dynamic weighting mechanism. It assigns larger weights to small categories and neglected regions, preventing model bias toward dominant categories while improving small object detection.
Table 5 shows that the enhanced YOLOv8n (with Triplet Attention, TFE, and P2) achieves 92.1% mAP0.5 (90.8% P, 86.3% R) using CIoU. Among tested WIoUv3 variants, the configuration with
and
yields optimal performance, 93.9% mAP0.5 (90.2% P, 89.8% R), surpassing other loss functions.
5.4. Analysis of Object Detection Results
To validate the detection capability of the proposed algorithm model in complex road scenarios, experimental verification was conducted on two public datasets.
Table 6 and
Table 7 present the comparative detection results between the YOLOv8 model and the proposed model on the validation sets of these two public datasets. Experimental results demonstrate that on the TSDD dataset, the proposed model achieves improvements of 5.3 percentage points in mAP50 and 7.6 percentage points in mAP50:95. Similarly, on the GTSDB dataset, it shows improvements of 5.7 percentage points in mAP50 and 3.4 percentage points in mAP50:95.
5.5. Ablation Experiment
To validate the effectiveness of each module in enhancing the Yolov8n network, ablation experiments were conducted using Yolov8n as the baseline. Metrics included average precision, accuracy, recall, and parameter count, as shown in
Table 8 (where ✓ indicates inclusion of a module). Results show that each added module improved performance to varying degrees. The Wise-IoUv3 module achieved the highest AP gain of 1.7% without increasing parameters, demonstrating its effectiveness in improving bounding box localization for varied object shapes and scales. The P2 module enhanced AP by 1.1%, particularly benefiting small target detection such as traffic lights. Although combining P2 and Wise-IoUv3 yielded slightly lower AP than Wise-IoUv3 alone, it reduced parameters to 2.74 M. The full integration of Triplet Attention, TFE, P2, and Wise-IoUv3 boosted Yolov8n’s accuracy to 93.9%, with 2.49 M parameters and 182 FPS.
The results presented in
Table 6 not only demonstrate the performance improvements achieved by each proposed module but also reveal their distinct effect sizes and underlying mechanisms.
The Triplet Attention (TA) module boosts performance (+1.6% mAP) by enhancing cross-dimensional spatial–channel interactions, which is critical for recognizing objects with weak textual cues (e.g., crosswalks), as reflected in the increased recall.
The TFE module’s primary role is to aggregate and preserve multi-scale features, supplying richer representations for the detection heads. Its effect is most evident when combined with the P2 head.
The P2 detection head provides a substantial gain (+1.1% mAP) by leveraging high-resolution (160 × 160) features. This is decisive for small objects (e.g., traffic lights), as it drastically improves localization precision for targets below 8 × 8 pixels.
The Wise-IoU v3 (WIoU) loss brings the largest individual improvement (+1.7% mAP) by introducing a dynamic focusing mechanism that suppresses gradients from low-quality examples, thereby improving generalization and robustness.
The full model’s performance (93.9% mAP) demonstrates clear synergy: TFE provides multi-scale features, TA refines their representation, P2 detects small objects precisely, and WIoU ensures stable training. This integration achieves an optimal accuracy–efficiency balance.
5.6. Comparison of Different Models
We compared mainstream models including YOLOv5n, YOLOv6n, and YOLOv10n in terms of parameter count, average precision, accuracy, and FPS to evaluate the performance of the proposed model, as summarized in
Table 9.
As shown in
Table 9, although YOLOv6n slightly surpasses YOLOv8n in average accuracy, it suffers from significantly lower FPS. Models such as YOLOv5n, YOLOv10n, and YOLOv11n achieve comparable accuracy but with reduced speed. YOLOv7 and Faster R-CNN not only have much larger parameter counts but also perform worse in both accuracy and FPS.
Overall, YOLOv8n offers a better balance of accuracy, speed, and model size than other unmodified models. The proposed improved model further reduces parameters by 0.52M, boosts accuracy by 4.1%, and achieves higher precision and FPS, validating the effectiveness of our approach.
5.7. Algorithm Verification
The improved algorithm was compared with YOLOv8n, YOLOv10n, and YOLOv11n—top performers in average accuracy—to evaluate detection on typical obstacles for visually impaired users, including crosswalks and traffic signs under various scenarios and weather conditions. Results are shown in
Figure 9.
The improved algorithm reduces false negatives for small targets like traffic lights and lowers false positives for blurred-edge targets compared to the other three algorithms. This enhances obstacle and traffic sign detection for guide dogs in complex environments, improving local path planning accuracy and reliability and laying a solid foundation for future research.
5.8. Implementation of YOLO Framework on NVIDIA Jetson Orin Nano Super
Today, single-board computers like the Nvidia Jetson Orin Nano Super are gaining popularity for edge computing applications, including artificial intelligence and deep learning. The Jetson Orin Nano Super, featuring a 6-core Arm® Cortex®-A78AE v8.2 64-bit CPU, a 1024-core NVIDIA Ampere architecture GPU with 32 Tensor Cores, an 8 GB 128-bit LPDDR5 RAM offering 102 GB/s bandwidth, and comprehensive high-speed I/O support, delivers outstanding AI computational performance up to 67 TOPS with remarkable power efficiency. In this study, we leverage the benefits of this embedded edge computing device specifically for model reasoning, coupled with an Intel RealSense D435i camera to form a comprehensive perception system. Given that model training demands more powerful computing resources than the testing process, we perform dedicated model training tasks on a GPU-equipped workstation in the cloud. The generated weight files are deployed on the edge device, where the Jetson Orin Nano Super processes real-time visual data from the D435i camera to perform efficient obstacle and traffic sign detection in various environmental conditions.
To validate the practical deployment of the proposed system, we established a complete experimental setup utilizing the Jetson Orin Nano Super as the central processing unit and the Intel RealSense D435i camera as the visual perception module. In this configuration, the D435i camera serves as the “eyes” of the machine guide dog, continuously capturing RGB video streams at 1280 × 720 resolution with a frame rate of 30 FPS. The Jetson Orin Nano Super functions as the main controller, executing real-time inference using the optimized YOLO model. The physical implementation of this hardware system is illustrated in
Figure 10.
The experimental results demonstrate that the integrated system achieves outstanding performance metrics: a detection precision of 94.2% and a recall rate of 91.6%, indicating high accuracy and reliability in obstacle and traffic sign recognition. The system maintains an average processing throughput of 28.7 FPS, ensuring smooth real-time operation. The per-frame inference latency remains below 50 ms, providing responsive feedback for navigation assistance. Regarding resource utilization, the GPU occupancy is approximately 42%, with memory consumption of 3.1 GB, indicating efficient resource management. The total power consumption is controlled at 8.7 W, demonstrating the energy efficiency of the edge deployment. All performance indicators meet the expected targets for real-world machine guide dog applications, validating the effectiveness of the proposed hardware–software co-design approach. The detailed performance metrics are systematically summarized in
Table 10.