Research on Optimized YOLOv5s Algorithm for Detecting Aircraft Landing Runway Markings

Wei Huang; Hongrui Guo; Xiangquan Li; Xi Tan; Bo Liu

doi:10.3390/pr13082572

,

and

School of Information Engineering, Jingdezhen University, Jingdezhen 333000, China

^*

Author to whom correspondence should be addressed.

Processes2025, 13(8), 2572;https://doi.org/10.3390/pr13082572

This article belongs to the Special Issue Modelling and Optimizing Process in Industry 4.0

Version Notes

Order Reprints

Abstract

During traditional aircraft landings, pilots face significant challenges in identifying runway numbers with the naked eye, particularly at decision height under adverse weather conditions. To address this issue, this study proposes a novel detection algorithm based on an optimized version of the YOLOv5s model (You Only Look Once, version 5) for recognizing runway markings during civil aircraft landings. By integrating a data augmentation strategy with external datasets, the method effectively reduces both false detections and missed targets through expanded feature representation. An Alpha Complete Intersection over Union (CIOU) Loss function is introduced in place of the original CIOU Loss function, offering improved gradient optimization. Additionally, the model incorporates several advanced modules and techniques, including a Convolutional Block Attention Module (CBAM), Soft Non-Maximum Suppression (Soft-NMS), cosine annealing learning rate scheduling, the FReLU activation function, and deformable convolutions into the backbone and neck of the YOLOv5 architecture. To further enhance detection, a specialized small-target detection layer is added to the head of the network and the resolution of feature maps is improved. These enhancements enable better feature extraction and more accurate identification of smaller targets. As a result, the optimized model shows significantly improved recall (R) and precision (P). Experimental results, visualized using custom-developed software, demonstrate that the proposed optimized YOLOv5s model achieved increases of 5.66% in P, 2.99% in R, and 2.74% in mean average precision (mAP) compared to the baseline model. This study provides valuable data and a theoretical foundation to support the accurate visual identification of runway numbers and other reference markings during aircraft landings.

Keywords:

YOLOv5 algorithm; neural network training; machine vision; object detection; civil aircraft information visualization

1. Introduction

Against the backdrop of continuously increasing global air transport volumes, ensuring flight safety and operational efficiency has become a primary focus in modern aviation technology research. Studies have demonstrated a significant correlation between the efficiency of runway marking identification during the approach and landing phases and overall flight safety. Specifically, navigational information such as runway numbers and lighting requires real-time and accurate identification to support safe landings. However, current identification methods predominantly rely on manual visual interpretation. Under adverse weather conditions, such as haze, low clouds, or nighttime operations, identification accuracy declines sharply. This technical limitation not only significantly reduces the safety margin during the approach phase but also poses serious challenges to pilots’ situational awareness and emergency response capabilities. In January 2022, the Civil Aviation Administration of China issued the Smart Civil Aviation Construction Roadmap, which outlined clear objectives for the development of intelligent air traffic management. The roadmap mandates strengthening national flight operation simulations and testing new-generation air traffic management systems by 2025, with the goal of operational deployment by 2030. The increasing complexity of low-altitude landing scenarios further challenges pilots, making the intelligentization of aviation information systems a pressing research direction.

Regarding image data of aircraft landing runways, data acquisition is influenced by various weather conditions, environmental wavelengths, and imaging methods, all within complex airspace environments, contributing to data distribution inconsistency. This increases the likelihood of pilot misjudgment during landing decision-making, posing one of the key challenges in aviation target recognition. Currently, global target detection algorithms can be broadly categorized into two types: one-stage detection and two-stage detection. One-stage detection methods directly perform localization and classification, with representative algorithms including YOLO [1], SSD [2], and RetinaNet [3]. In contrast, two-stage detection methods first generate coarse anchor boxes and then refine localization and classification. Representative algorithms in this category include R-CNN, Fast R-CNN [4], and Cascade R-CNN [5].

These general-purpose detection algorithms, although demonstrating outstanding performance in recognizing common solid objects in everyday contexts, tend to yield poor results when applied to the detection of aviation-specific targets. Therefore, tailored optimizations are necessary to adapt these models to the precise demands of detecting runway markings in aviation scenarios. Yang [6] integrated the Swin Transformer into a Mask R-CNN framework and combined it with natural language processing techniques. While this approach achieved a degree of recognition success, it suffered from relatively low processing speed. Similarly, Zhang [7] employed Faster R-CNN integrated with a TIBAM module for two-stage detection. Although this algorithm improved recognition accuracy to some extent, it imposed a considerable computational burden. Chen et al. [8] proposed a novel visual positioning algorithm that integrates YOLOv5 with Kalman filtering to address occlusion challenges in determining relative positional relationships using vision-based positioning modules. Rao et al. [9] developed a landmark detection model by combining lightweight techniques with a fast contour optimization algorithm to achieve reliable position estimation under poor visual conditions. Liu et al. [10] introduced a deep learning-based airport runway line detection method, providing precise positioning information for drone landings. Cao et al. [11] presented an enhanced lightweight target detection approach for coal gangue, improving YOLOv5s by constructing new convolutional blocks and embedding an Efficient Channel Attention (ECA) module in the backbone network. This improvement significantly increased localization and recognition accuracy. Pan et al. [12] optimized the YOLOv3 network by integrating a spatial pyramid pooling (SPP) module, a squeeze-and-excitation (SE) module, and dilated convolution, accelerating recognition speed while maintaining model accuracy. Yan et al. [13] enhanced the YOLOv5 architecture by adding a spatial and channel squeeze-and-excitation (scSE) module, achieving an average detection accuracy of 0.983. Chen et al. [14] proposed a track identification and monitoring method based on an improved YOLOv5s framework, incorporating a lightweight backbone, improved feature fusion strategies, and an optimized regression loss function. Wang et al. [15] introduced an enhanced YOLOv5s object detection algorithm by integrating an inner convolutional module into the backbone and improving the feature fusion network using a GSConv module, thereby improving detection accuracy. Liu et al. [16] developed a cotton seed damage detection method based on an improved YOLOv5 algorithm, incorporating the lightweight up-sampling operator CARAFE into the YOLOv5s framework and refining the loss function.

To address the industry challenge of insufficient recognition accuracy of runway markings during aircraft landings, this study proposes an enhanced target detection architecture, ours-YOLOv5s, based on deep learning. This model systematically improves image parsing efficiency through a series of multidimensional innovations. First, the model incorporates a Convolutional Block Attention Module (CBAM), which employs a channel-spatial dual-domain feature recalibration strategy. This significantly enhances the discriminative ability of feature representations under complex weather conditions. Second, it replaces the conventional feature fusion structure with a Bidirectional Feature Pyramid Network (BiFPN). Through bidirectional cross-scale concatenation and a weighted feature fusion mechanism, this architecture enhances multi-scale feature expression, thereby improving the recall rate (R%) for small-scale targets such as runway numbers in aviation imagery. In addition, the model introduces an adaptive Alpha-Complete Intersection over Union (CIoU) loss function based on the Alpha parameter. By incorporating dynamic balancing factors alongside traditional geometric constraints and aligning with a cosine annealing learning rate strategy, it achieves an optimal trade-off between localization accuracy and convergence speed. Furthermore, a comprehensive data augmentation strategy is implemented to address the limited quantity of available aviation image samples. This effectively mitigates overfitting, thus reducing generalization errors, particularly under conditions of fog or night operations. Experimental results indicate that, compared to the baseline model, the proposed ours-YOLOv5s exhibits a marked improvement in accuracy, offering a robust technical solution for mitigating runway incursion risks during the aircraft landing phase.

2. YOLOv5 Algorithm Principle

YOLOv5s adopts a modular architecture to construct a triple-cascade feature addressing system, as illustrated in Figure 1. The collaborative framework consists of three core functional modules: a feature extraction network (backbone), feature fusion layers (neck), and a detection head (head).

Figure 1. YOLOv5s network structure.

During feature extraction, the backbone network performs progressive feature abstraction through the CSPDarknet53 framework. Initially, the Focus module restructures the input tensor dimensionally via slicing operations. Subsequently, a cross-stage residual concatenation and gradient flow diversion mechanism is employed to optimize the feature propagation path. Within this structure, the C3 module utilizes two strategies, branch-and-cut and channel compression, to construct a bottleneck layer that expands the receptive field while significantly reducing computational complexity. The feature fusion layers incorporate a BiFPN for multi-scale integration. A dynamic weighted fusion is achieved through bidirectional top-down and bottom-up cross-layer concatenation, effectively combining shallow positional information with deep semantic features. Among these, the PANet module employs a deformable convolution kernel to adaptively adjust the feature mapping, thereby enhancing the semantic expressiveness of shallow feature maps. The detection head module adopts a decoupled prediction structure that separates the tasks of localization regression and classification confidence prediction into parallel branches. In the localization branch, a dynamic anchor box optimization algorithm is applied, with geometric constraints constructed via the CIoU loss function. The classification branch utilizes a compound activation function (Sigmoid Linear Unit, SiLU) to enhance nonlinear representation capabilities. In the post-processing stage, an improved Non-Maximum Suppression (NMS) algorithm is employed. This version incorporates a Gaussian-weighted suppression strategy along with an adaptive threshold adjustment mechanism, effectively reducing the false match rate for dense and small-scale targets in aerial images.

3. Optimized YOLOv5 Algorithm

This study addresses several key limitations of the YOLOv5s object detection network in complex scenarios, including errors in small-target detection, low feature fusion efficiency, and insufficient model convergence stability. To overcome these challenges, a systematic optimization framework is proposed. This framework enhances the network’s detection accuracy and robustness through the construction of a multi-scale collaborative enhancement architecture, refinement of the bounding box regression mechanism, and integration of dynamic optimization strategies. As illustrated in Figure 2, the optimized framework incorporates the following core technical modules.

Figure 2. Optimized YOLOv5 network structure.

3.1. Funnel ReLU (FReLU) Activation Function

Activation functions enable networks to perform hierarchical modeling of complex data patterns by introducing differentiable nonlinear transformation mechanisms. This nonlinearity breaks the superposition constraint of linear systems, allowing the network to approximate any continuous function through layered feature compositions. The core mechanism behind this nonlinear mapping lies in the design of the activation unit. The SiLU activation function adopts a composite formulation whose curve remains continuously differentiable around zero, thereby improving parameter update efficiency in regions prone to gradient saturation. By incorporating a probabilistic compression function, this module dynamically adjusts outputs to a bounded probability space (0, 1), as expressed in Formula (1):

g (v) = v \cdot \frac{1}{1 + e x p (- v)}

(1)

Ma et al. [17] proposed the FReLU activation function, with its mechanism illustrated in Figure 3. FReLU introduces a spatially aware mechanism with minimal computational overhead, extending traditional ReLU and PReLU functions into a two-dimensional activation framework characterized by regional correlations. Its formulation is given in Formula (2).

f (x_{c, i, j}) = \max (x_{c, i, j}, T (x_{c, i, j}))

(2)

where T(x) denotes a two-dimensional spatial condition operator that employs a parametric pooling window [18] to establish spatial dependencies and facilitate effective local feature extraction. The design leverages learnable convolutional kernels to dynamically adjust the receptive field, enhancing sensitivity to local geometric structures. This process is formally described by Formula (3):

T (x_{c, i, j}) = x_{c, i, j}^{w} \cdot p_{c}^{w}

(3)

where T(x_(c,i,j)) represents funnel conditions, while x_(c,i,j)^w represents the pixel’s 2D spatial coordinates. The nonlinear activation function of the c-th channel serves as the foundation for generating the parameterized pooling window, where a shared coefficient P is assigned to all pixels within the same channel and window. Principal diagrams of FReLU, PReLU, and ReLU activation functions are illustrated in Figure 3.

Figure 3. Principal diagrams of FReLU, PReLU, and ReLU activation functions.

A spatial conditional constraint mechanism is incorporated to strengthen spatial modeling capacity further. This mechanism enables refined spatial feature encoding via pixel-level parameter modulation and, when combined with standard convolution operations, facilitates multi-scale feature extraction. As a result, the model effectively captures complex visual layouts and spatial structural relationships within images while maintaining high computational efficiency.

3.2. Alpha-CIOU Loss Function

Compared to the CIoU loss function, as illustrated in Figure 4, Alpha-CIoU [19] introduces a tunable hyperparameter α, which enhances the model’s adaptability and performance across different scenarios. While CIoU optimizes three critical factors, the center point distance, aspect ratio, and overlapping area between bounding boxes, Alpha-CIoU further incorporates the influence of rotation angles, thereby improving its effectiveness for rotated bounding box tasks. Additionally, the inclusion of the α parameter enables dynamic adjustment of the weight contributions from various components (e.g., aspect ratio and angle) based on task-specific requirements. This flexibility allows the model to achieve a better trade-off between accuracy and computational efficiency, particularly in complex or irregular object detection tasks.

Figure 4. CIoU mechanism.

As shown in Figure 4, d represents the distance between the centers of the ground truth box and the predicted box and c denotes the diagonal length of the smallest enclosing box that contains both. This design allows CIoU to effectively handle cases where there is no overlap between boxes. The CIoU loss function is defined as follows:

C I O U = I O U - \frac{ρ^{2} (b, b^{g t})}{c^{2}} - α v

(4)

where Intersection over Union (IOU) represents the ratio of the overlapping area to the union area of the predicted and ground truth boxes, measuring the degree of overlap between the two boxes.

ρ^{2} (b, b^{g t})

represents the square of the Euclidean distance between the center of the predicted box b and the ground truth box and

b^{g t}

measures the deviation between the center points of the predicted box and the ground truth box. c is the diagonal distance of the smallest enclosing box containing both the b and

b^{g t}

. v measures the aspect ratio discrepancy between the predicted and ground truth boxes. α is a weight coefficient that balances the influence of v. The formulas for α and v are expressed as Formulas (5) and (6):

α = \frac{v}{1 - I o U + v}

(5)

v = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2}

(6)

where

w^{g t}

,

h^{g t}

, w, and h represent the width and height of ground truth box and prediction box, respectively. The constant

\frac{4}{π^{2}}

ensures the normalization of aspect ratio differences within a stable range. Accordingly, the complete CIoU loss is expressed as

{L o s s}_{C I O U} = 1 - I O U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(7)

Alpha-CIOU additionally introduces a Power regularization term with a single Power parameter α. By adjusting α, the detector exhibits greater flexibility in achieving different levels of b box regression accuracy. The formula for the Alpha-CIOU loss function is expressed as Formula (8):

{L o s s}_{α - C I O U} = 1 - {I O U}^{α} + \frac{ρ^{2 α} (b, b^{g t})}{c^{2 α}} + {(α v)}^{α}

(8)

3.3. CBAM

CBAM [20], illustrated in Figure 5, is an attention mechanism that sequentially applies a Channel Attention Module (CAM) and Spatial Attention Module (SAM). By incorporating both channel-wise and spatial attention, CBAM enhances feature learning capability while maintaining computational efficiency and parameter economy.

Figure 5. CBAM mechanism.

The CBAM module consists of an input layer, CAM, SAM, and output layer. The input layer introduces the feature F ∈ R^(C*H*W) to the one-dimensional convolution M_c ∈ R^(C*H*W) for multiplication with original image. Afterward, the CAM outputs the previous input as the feature, which is then input into the two-dimensional convolution M_s ∈ R^(1*H*W) of the spatial attention module. Finally, the result is obtained by multiplying the output by the original image. The entire attention process is formulated as Formula (9):

F^{'} = M_{c} (F) ⨂ F F^{″} = M_{s} (F') ⨂ F'

(9)

The CAM, as depicted in Figure 6, begins by applying global average pooling and global max pooling along the height and width of the input feature map. These two types of pooled features are then passed through an MLP. The outputs of the MLP are combined via element-wise summation and subsequently passed through a sigmoid activation function to produce the final channel attention map. This attention map is then element-wise multiplied with the original input feature map to generate the demanded input features of the CAM. The operation of the CAM is formulated as

M_{c} (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F)))

(10)

Figure 6. CAM mechanism.

The SAM, shown in Figure 7, takes the output of the CAM as its input. First, global average pooling and global max pooling are applied, this time along the channel axis. The results of the two addresses are then concatenated along the channel dimension for dimensionality reduction. The resulting feature map undergoes a sigmoid activation function to generate the spatial features. This map is then element-wise multiplied with the input feature map to produce the final refined features. The SAM operation is formally defined as

M_{s} (F) = σ (f^{7 * 7} ([A v g P o o l (F); M a x P o o l (F)]))

(11)

Figure 7. SAM mechanism.

3.4. BiFPN Structure

The architecture of the Bidirectional Feature Pyramid Network (BiFPN) [21] is illustrated in Figure 8. Unlike YOLOv5, which employs CSPDarknet as the backbone and PANet as its feature fusion module, BiFPN enhances multiscale feature representation by introducing bidirectional information flow, both bottom-up and top-down, within the network structure and adopting a weighted fusion strategy, resulting in multiple cross-scale connections.

Figure 8. BIFPN mechanism.

3.5. Soft-Non-Maximum Suppression (NMS) Mechanism

The NMS algorithm operates by ranking the anchor box set S in descending order. The IoU is calculated between the box with the highest confidence score b* and all other boxes, and those with an IoU exceeding a predefined threshold π_0 are removed. However, traditional NMS suffers from inherent limitations. In regions with overlapping targets, such a rigid suppression strategy can lead to false negatives, where valid detections are erroneously discarded. The performance of NMS is highly sensitive to the choice of π_0. A lower threshold may result in the loss of critical information due to over-suppression, while a higher threshold may allow excessive redundant boxes, reducing detection accuracy and interpretability.

Unlike traditional NMS, Soft-NMS adapted a smooth address to the overlapping candidate boxes, significantly mitigating overcompensation. The algorithm dynamically adjusts the scores of candidate boxes rather than directly eliminating them, allowing some overlapping detections to be retained, which is particularly beneficial in complex visual environments; for example, where aircraft landing markings and runway lines, buildings, or other structures are spatially intertwined. The core idea of Soft-NMS [22] is as follows: when the IoU between the highest-confidence box and another box exceeds the threshold π_0, the algorithm does not discard the other box; instead, it reduces its confidence score according to the level of overlap. The greater the IoU, the more significant the score decay. This dynamic adjustment enables better preservation of true positive detections in dense scenes. The algorithm flow is shown in Table 1, and the Soft-NMS principle diagram is shown in Figure 9.

Table 1. Soft-NMS algorithm.

Figure 9. Mechanism of Soft-NMS.

3.6. Deformable Convolution (Deformable Conv)

Deformable Conv enhances the model’s ability to adapt to the geometric deformations of objects by dynamically adjusting the sampling grid, as illustrated in Figure 10. Unlike standard convolution, which uses a fixed rectangular sampling pattern, deformable convolution exhibits stronger shape-awareness capabilities.

Figure 10. Schematic diagram of deformable convolution.

In the top-level feature maps generated by deformable convolution, the distribution of activated feature points shows a significant correlation with the object’s contours and structural characteristics, resulting in a selective response to object-specific features. This behavior suppresses background noise during feature extraction and enhances the representational power of the learned features, thereby improving target localization and recognition accuracy in complex and cluttered environments.

Compared to traditional conv, deformable convolution introduces learnable offsets to shift the sampling points toward more informative regions of the input. This mechanism is depicted in Figure 11.

Figure 11. Mechanism of deformable convolution.

In traditional convolution, for an input feature map of size 7 × 7 and a convolution kernel of size 3 × 3, the weights of the convolution kernel are multiplied by the corresponding elements of the input feature map and summed to obtain the elements, sliding across the input to produce the complete output feature map. The formulation of traditional convolution is given by

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) * x (p_{0} + p_{n})

(12)

where p_n represents the offset of each point in the convolution kernel relative to the center point, p_0 represents the position on the feature map, and w(p_n) is the weight parameter, X represents the feature input position. R is expressed as in Formula (13).

R = {(- 1, - 1), (- 1,0), \dots \dots (1,0), (0,0)}

(13)

The formula of deformable convolution is illustrated as Formula (14), where ∆p represents the offset generated by the input feature map and another convolution.

y (p_{0}) = \sum_{P_{n} \in R} w (p_{n}) * x (p_{0} + p_{n} + ∆ p)

(14)

4. Experiment

4.1. Dataset

The experiments in this study primarily focus on the aircraft landing phase at altitudes below 3000 m.

In terms of dataset selection, the datasets used in this study include: a self-constructed dataset from the Roboflow platform; the “Landing Approach Runway Detection (LARD)” dataset—developed by Airbus France [23] and hosted on www.github.com, with its data distribution shown in Figure 12; and the “FS2020 Runway Dataset” obtained from the Kaggle platform. Additionally, one of the authors, Huang Wei, supplemented the dataset with images of actual operational scenarios within airports based on his experience as airport staff. These supplementary data help improve the alignment between the dataset and real-world landing scenarios.

Figure 12. Distribution of airport runway datasets.

In terms of the screening of dataset images:

(1): From the perspective of viewing angle, all selected images simulate the top-down or forward-looking perspective during the aircraft’s approach to landing, which is consistent with the visual perspective of pilots during actual operations. This ensures the alignment between the detection scenario and real application scenarios. Meanwhile, images with irrelevant viewing angles such as ground side shots and high-altitude aerial shots are excluded to avoid interference from non-target viewing angles in model learning.
(2): In terms of target types, only images containing specific runway markings are retained. These markings include core detection targets such as runway numbers (e.g., “01L”, “36R”), runway centerlines, and touchdown zone marks. Images featuring non-marking targets such as airport buildings and aircraft bodies are excluded to ensure the dataset focuses on the key detection objects required by the research.
(3): From the dimension of environmental conditions, considering various situations that may be encountered in actual landing scenarios, the selected images cover diverse meteorological conditions (sunny, rainy, and foggy), lighting conditions (strong noon light, weak twilight light, and night lights), and imaging quality states (clear images, slightly motion-blurred images, images with sudden brightness changes, etc.). Among them, “Mixed Weather” in Table 2 specifically includes complex scenarios such as low visibility in fog, night light reflection, runway surface reflection in rainy weather, and backlight at dusk. This ensures that the dataset can support the model’s ability to detect runway markings in different complex environments.

Table 2. Dataset distribution.

The final dataset comprises 10,362 images, annotated with a total of 206,725 bounding box labels. The dataset is divided into three subsets: 80% of images (8290) are the training set, 15% (1553) are the validation set, and 5% (519) are the testing set. The detailed distribution of targets across these subsets is presented in Table 2. Additionally, representative sample images from each category are illustrated in Figure 13.

Figure 13. Examples of images in the dataset.

4.2. Data Augmentation

To improve the model’s generalization and robustness, this study employed a series of data augmentation strategies. These techniques include both geometric transformations and color space manipulations, as illustrated in Figure 14. The geometric transformations applied are as follows: The rotation transformation randomly rotated the image at a certain angle, enhancing the model’s adaptability to rotational variations. Translation transformation applied horizontal and vertical shifts, increasing the model’s tolerance to positional shifts. Cropping transformation randomly crops part of the image to improve the model’s performance on locally visible targets. In addition to geometric transformations, HSV (hue, saturation, value) adjustments are utilized to simulate hue variations caused by different times of day and weather conditions, enhancing the model’s robustness.

Figure 14. Data enhancement diagram.

4.3. Experimental Environment and Model Training

The experiment was conducted using the following hardware and software configurations. The experimental platform employed a heterogeneous computing architecture consisting of an AMD Ryzen 7 4800H processor (maximum clock speed: 4.2 GHz) and an NVIDIA GeForce RTX 3060 GPU, operating on Windows 10. The CUDA parallel computing framework (version 12.2) was used in conjunction with the PyTorch (version 2.1.0) deep learning framework for algorithm implementation. During the data preprocessing stage, all input images were uniformly resized to a resolution of 640 × 640 pixels and a hierarchical random sampling strategy was applied to divide the dataset into training and validation sets in an 8:2 ratio. Model training was performed using batch gradient descent with a batch size of 16. A cosine annealing strategy was employed to dynamically adjust the learning rate, thereby optimizing the convergence process. The complete training process consisted of 270 epochs. To evaluate the model’s feasibility in edge computing scenarios, the trained weight files were ultimately deployed on a Raspberry Pi 5 embedded system. The target platform is equipped with a Broadcom BCM2712 quad-core ARM Cortex-A76 processor (clock speed: 2.4 GHz) and a VideoCore VII GPU. Real-time detection following deployment in a simulated laboratory environment is shown in Figure 15.

Figure 15. Images of real-time detection following deployment in a simulated laboratory environment.

4.4. Analysis on the Influence of CBAM Module Placement

In the direct information chain from “marker recognition to pilot decision-making”, this study employs a 30 ms latency threshold based on a synthesis of human perceptual characteristics and aviation safety requirements. Concurrently, validation against the computational constraints of edge-deployed hardware (Raspberry Pi 5) reveals that with 4 CBAM modules, the single-frame inference time approaches the deployment threshold (27 ms) at a corresponding frame rate of 28FPS—sufficient to meet real-time performance criteria. In contrast, increasing the module count to 5 or more results in inference times exceeding 30 ms and a reduced frame rate of 22FPS, which fails to accommodate the real-time detection demands of aviation scenarios. Consequently, this study establishes 4 as the optimal upper limit for the number of CBAM modules.

To validate the scientific rationale for integrating CBAM modules into runway marker detection tasks, this study designed ablation experiments using a controlled-variable approach as shown in Table 3. In these experiments, configurations of other improved modules—including the Alpha-CIoU loss function, BiFPN feature fusion structure, and Soft-NMS post-processing—were kept constant, and this setup is designated as YOLOv5-0. Only the insertion positions and quantities of CBAM modules in the backbone and neck were adjusted as illustrated in Figure 16, with a focused analysis on how positional parameters influence feature extraction efficiency, multi-scale fusion performance, and detection accuracy. The specific results are as follows.

Table 3. Ablation experiment.

Figure 16. Diagram of CBAM module placement in the YOLOv5 Network.

(1): Optimal Placement and Mechanism of CBAM in the Backbone.

As the core component for low-level feature extraction, the backbone primarily captures basic visual features such as runway edges and pavement textures. Experimental results demonstrate that deploying one CBAM module after the deep C3-DCN module and before the SPPF layer yields optimal performance gains. This placement enables channel-spatial dual-domain recalibration of high-level semantic features output by the backbone, effectively suppressing non-target noise (e.g., sky background and ground clutter) while avoiding feature redundancy that would occur with shallow-layer insertion (e.g., after shallow C3 modules). Shallow features contain substantial irrelevant visual information (e.g., pavement stains), and excessive attention enhancement here would waste computational resources and reduce feature discriminability.

(2): Optimal Placement and Mechanism of CBAM in the Neck.

The neck handles multi-scale feature fusion, and its performance directly impacts the detection of small targets such as distant runway numbers. Experimental data show that inserting one CBAM module after each BiFPN fusion layer and before the C3 module (three modules in total) achieves the best results. Feature maps fused by BiFPN already integrate multi-scale semantic information; CBAM dynamically enhances feature weights of critical targets (e.g., runway number “01L” or center lines) via channel attention and focuses on target regions through spatial attention. This effectively addresses detection failures caused by blurred small-target features under complex meteorological conditions (e.g., fog or nighttime).

(3): Synergistic Enhancement of CBAM Placement in Backbone and Neck.

The combined configuration of “1 CBAM in backbone + 3 CBAMs in neck” achieves globally optimal detection performance: mean Average Precision (mAP@0.5) reaches 80.03%, a 2.20% improvement over the baseline model without CBAM, with precision and recall increased by 5.66% and 2.99%, respectively. Their synergy embodies “hierarchical progressive feature optimization”: the backbone CBAM reduces noise interference in subsequent fusion through “feature purification”, providing high-signal-to-noise-ratio base features for the neck; neck CBAMs further refine multi-scale feature expression via “target enhancement”. This forms a “base purification-refined enhancement” closed-loop feature processing pipeline, significantly boosting the model’s target discrimination capability in complex scenarios.

4.5. Evaluation Indicators and Performance Analysis

In target detection tasks, the primary metrics used to evaluate model performance include P, R, and mAP. Their mathematical definitions are provided in Equations (15) and (16). To verify the effectiveness of the proposed improvements of various modules in the ours-YOLOv5s model, systematic ablation experiments were conducted. The results of the module combination comparisons are detailed in Table 3, while performance data for different model architectures are presented in Table 4. Visual comparison results of detection outputs are shown in Figure 17.

P = \frac{T P}{T P + F P} R = \frac{T P}{T P + F N}

(15)

Table 4. Ablation experiment.

Figure 17. Visual comparison of detection results across different models.

Among them, TP (true positives) refers to the number of positive samples correctly predicted by the model, while FP (false positives) indicates the number of negative samples incorrectly predicted as positive. FN (false negatives) represents the number of positive samples that the model failed to identify correctly. P and R are commonly used to evaluate the accuracy and correct proportion of the model’s predictions. Higher values, approaching 1, indicate better model performance.

The AP quantifies the area under the P-R curve and is calculated using a definite integral. The mAP is the average AP across all detected categories, with n denoting the total number of categories. The formulas are defined as follows:

A P = \int_{0}^{1} P (R) d R m A P = \frac{1}{n} \sum_{1}^{n} A P

(16)

To evaluate the effectiveness of the proposed improvements to the ours-YOLOv5s model, a series of ablation experiments were conducted on an enhanced multi-scale dataset, as shown in Table 4. The experimental results demonstrate the performance gains of various model configurations compared to the baseline module. Specifically, Model A, which replaces the original loss function with the Alpha-CIoU loss function, achieves performance improvements of 0.72% in P, 0.15% in R, and 0.04% in mAP. Model B, which builds on Model A by incorporating the CBAM attention mechanism, further enhances P, R, and mAP by 1.26%, 0.62%, and 0.88%, respectively. Model C, which integrates a multi-scale feature enhancement strategy, improves these metrics by an additional 1.83%, 0.97%, and 0.68% compared to Model B. Model D, which introduces BiFPN for feature fusion, shows further improvements of 0.7%, 0.13%, and 0.24% in P, R, and mAP, respectively. Finally, the ours-YOLOv5s model proposed in this study incorporates all enhancements, including dynamic weight allocation and an optimized feature fusion path. It achieves the best overall performance, with improvements of 1.15% in P, 0.85% in R, and 0.90% in mAP compared to Model D.

The experimental results presented in Table 5 demonstrate the performance differences among various target detection frameworks, highlighting the overall superiority of the improved ours-YOLOv5s model. Quantitative analysis shows that the proposed model achieves the highest P and R rates among all models tested, with values of 85.97% and 86.31%, respectively. Compared to YOLOv5m, the precision improves by 4.26 percentage points, and by 3.52 percentage points over YOLOv5l. In terms of R, it exceeds YOLOv5m and YOLOv5l by 2.59 and 2.42 percentage points, respectively. Regarding overall detection performance, ours-YOLOv5s achieves a mAP of 80.03%, significantly outperforming other models. Specifically, it outperforms YOLOv3s (62.24%) by 17.79 percentage points and exceeds YOLOv5m (76.81%) and YOLOv5l (75.42%) by 3.22 and 4.61 percentage points, respectively. When compared to the more recent YOLOv8s (76.39%), the proposed model shows an improvement of 3.64 percentage points. Notably, the proposed model also demonstrates substantial improvements over classical detection frameworks. It surpasses R-CNN (75.25%), RetinaNet (73.80%), SSD (76.29%), DETR (75.52%), and Transformer (77.47%) by 4.78, 6.23, 3.74, 4.51, and 2.56 percentage points, respectively, in terms of mAP.

Table 5. Comparative experiment of different models.

5. Conclusions

This study proposes a novel detection framework optimized for identifying aircraft runway markings, targeting key technical challenges encountered during aircraft landing, such as signal-to-noise ratio attenuation of marking features, significant meteorological interference, and inefficient multi-scale feature coupling. To address these issues, a spatial-channel dual-domain attention mechanism (CBAM) was integrated to enhance the model’s ability to filter out background disturbances. Additionally, a BiFPN was constructed to strengthen cross-layer semantic feature interactions, while the Alpha-CIoU dynamic intersection-over-union loss function was introduced to improve the accuracy of bounding box regression. Furthermore, the incorporation of the FReLU nonlinear activation function, a periodic learning rate adjustment strategy, and deformable convolution operations collectively contributed to accelerating model convergence and improving overall detection performance. Experimental results validate the proposed architecture’s robustness and real-time detection capability under complex weather conditions, demonstrating its practical applicability in aviation engineering. This work lays a solid technical foundation for the future development of lightweight, edge-computing-compatible detection systems in the field of intelligent aviation safety.

Author Contributions

Conceptualization, W.H. and B.L.; methodology, H.G.; software, H.G.; validation, X.L.; formal analysis, X.T.; investigation, X.T.; resources, W.H.; data curation, H.G.; writing—original draft preparation, W.H.; writing—review and editing, B.L.; visualization, W.H.; supervision, B.L.; project administration, H.G.; funding acquisition, X.L. and W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Education’s Higher Education Computer Course Teaching Guidance Committee, China, grant number AEJR-202430; funded by the Jiangxi Provincial Department of Education, China, grant number GJJ2402304; and it was also funded by the Jiangxi Provincial Natural Science Foundation General Project, China, grant number 20242BAB25091.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A. SSD: Single shot multibox detector. In Proceedings of the Computer Vision ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar] [CrossRef]
Yang, Q.; Han, Y.; Zhang, T. Aircraft Target Detection on Airport Runways Based on Computer Vision. Aeronaut. Comput. Tech. 2025, 55, 88–92. [Google Scholar] [CrossRef]
Zhang, J.; Feng, J.; Zhang, J.P.; Zhu, X.Y. Research on Aerial Image Object Detection Based on Object Image Block Activation. Aeronaut. Sci. Technol. 2025, 36, 111–118. [Google Scholar] [CrossRef]
Chen, D.; Shi, C.; Pan, X.; Jin, J.; Li, S. VSDRL: A robust and accurate unmanned aerial vehicle autonomous landing scheme. IET Control Theory Appl. 2025, 19, e70002. [Google Scholar] [CrossRef]
Rao, Y.; Ma, S.; Xing, J.; Zhang, H.; Ma, X. Real time vision-based autonomous precision landing system for UAV airborne processor. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020; pp. 532–537. [Google Scholar]
Liu, X.; Xue, W.; Xu, X.; Zhao, M.; Qin, B. Research on unmanned aerial vehicle (UAV) visual landing guidance and positioning algorithms. Drones 2024, 8, 257. [Google Scholar] [CrossRef]
Cao, Z.; Fang, L.; Li, Z.; Li, J. Lightweight Target Detection for Coal and Gangue Based on Improved Yolov5s. Processes 2023, 11, 1268. [Google Scholar] [CrossRef]
Pan, H.; Shi, Y.; Lei, X.; Wang, Z.; Xin, F. Fast identification model for coal and gangue based on the improved tiny YOLO v3. J. Real-Time Image Process. 2022, 19, 687–701. [Google Scholar] [CrossRef]
Yan, P.; Sun, Q.; Yin, N.; Hua, L.; Shang, S.; Zhang, C. Detection of coal and gangue based on improved YOLOv5.1 which embedded scSE module. Measurement 2022, 188, 110530. [Google Scholar] [CrossRef]
Chen, R.; Lv, J.; Tian, H.; Li, Z.; Liu, X.; Xie, Y. Research on a New Method of Track Turnout Identification Based on Improved Yolov5s. Processes 2023, 11, 2123. [Google Scholar] [CrossRef]
Wang, T.; Li, Y.; Zhai, Y.; Wang, W.; Huang, R. A Sewer Pipeline Defect Detection Method Based on Improved YOLOv5. Processes 2023, 11, 2508. [Google Scholar] [CrossRef]
Liu, Z.; Wang, L.; Liu, Z.; Wang, X.; Hu, C.; Xing, J. Detection of Cotton Seed Damage Based on Improved YOLOv5. Processes 2023, 11, 2682. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Sun, J. Funnel activation for visual recognition. In Proceedings of the Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020. Proceedings, Part XI 16. pp. 351–368. [Google Scholar] [CrossRef]
Lv, K.; Li, Y. Research on Edge Detection Algorithm of Photovoltaic Panel’s Partial Shadow Shading Image. J. Image Signal Process. 2021, 10, 61–69. [Google Scholar] [CrossRef]
Xu, Y.; Qiang, Z.; Guo, B. Traffic Flow Statistics Algorithm Based on Improved YOLOv5. In Proceedings of the 2022 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), Xi’an, China, 28–30 October 2022; pp. 340–345. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Zhao, Y.; Cui, Y.; Wang, Z. Improved YOLOv5 based on CBAM and BiFPN for Rice Pest and Disease Detection. In Proceedings of the 2023 International Conference on Computer, Internet of Things and Smart City (CIoTSC), Luoyang, China, 3–5 November 2023; pp. 29–36. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L. Soft-NMS: Improving object detection with one line of code. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5562–5570. [Google Scholar] [CrossRef]
Ducoffe, M.; Carrere, M.; Feliers, L. LARD-Landing Approach Runway Detection-Dataset for Vision Based Landing [EB/0L]. Available online: https://ar5iv.labs.arxiv.org/html/2304.09938 (accessed on 30 April 2024).
Zheng, Y. Hybrid YOLOv3 and ReID intelligent identification statistical model for people flow in public places. Sci. Rep. 2024, 14, 14601. [Google Scholar] [CrossRef] [PubMed]
Wang, G.X.; Zhu, J.D.; Ru, H.F. An Object Detection Method for Gangue Flow Based on Improved YOLOv8s. Min. Res. Dev. 2025, 45, 229–237. [Google Scholar] [CrossRef]
Wang, S.Q. Research on Image Recognition Technology Based on the Faster-RCNN Deep Learning Algorithm. Inf. Rec. Mater. 2025, 26, 120–122. [Google Scholar] [CrossRef]
Sahin, D.; Torkul, O.; Sisci, M.; Diren, D.D.; Yilmaz, R.; Kibar, A. Real-Time Classification of Chicken Parts in the Packaging Process Using Object Detection Models Based on Deep Learning. Processes 2025, 13, 1005. [Google Scholar] [CrossRef]
Liu, J.W.; Cao, J.T.; Ji, X.F. Non-Woven Fabric Defect Detection Based on the Combination of Swin Transformer and YOLOv5. J. Liaoning Shihua Univ. 2024, 44, 80–88. [Google Scholar] [CrossRef]

Figure 1. YOLOv5s network structure.

Figure 2. Optimized YOLOv5 network structure.

Figure 3. Principal diagrams of FReLU, PReLU, and ReLU activation functions.

Figure 4. CIoU mechanism.

Figure 5. CBAM mechanism.

Figure 6. CAM mechanism.

Figure 7. SAM mechanism.

Figure 8. BIFPN mechanism.

Figure 9. Mechanism of Soft-NMS.

Figure 10. Schematic diagram of deformable convolution.

Figure 11. Mechanism of deformable convolution.

Figure 12. Distribution of airport runway datasets.

Figure 13. Examples of images in the dataset.

Figure 14. Data enhancement diagram.

Figure 15. Images of real-time detection following deployment in a simulated laboratory environment.

Figure 16. Diagram of CBAM module placement in the YOLOv5 Network.

Figure 17. Visual comparison of detection results across different models.

Table 1. Soft-NMS algorithm.

Input: Initial anchor box set B, confidence score set S, overlap threshold π₀
Output: Updated anchor box set B’, score set S’.

B \leftarrow \{b_{i}\}, S \leftarrow \{s_{i}\}, π (b_{u}, b_{v}) \leftarrow π_{0}, π (b_{u}, b_{v}) = \frac{b_{u} \cap b_{v}}{b_{u} \cup b_{v}}

B^{'}, S' \leftarrow \emptyset

// diversities of the anchor box set and confidence score set

w h i l e B \neq \emptyset d o

// until the initial anchor box set B is empty

s^{*} \leftarrow m a x {S_{i}}

b^{*} \leftarrow s^{*}

// anchor box with the highest confidence score

B^{'} \leftarrow B \cap b^{*}

// anchor box set after updating

S^{'} \leftarrow S \cap s^{*}

// confidence score set after updating

B^{'} \leftarrow B - b^{*}

//anchor box set except

b^{*}

S \leftarrow B

// corresponding confidence score set

f o r b_{i} \in B

s_{i} \leftarrow \{\begin{matrix} s_{i} π (b^{*}, b_{i}) < π_{0} \\ s_{i} (1 - π (b^{*}, b_{i}) π (b^{*}, b_{i}) \geq π_{0} \end{matrix}

e n d

end

Table 2. Dataset distribution.

Dataset	Sunny Day		Rainy Day		Hybrid Weather
Dataset	Target Number	Target Proportion%	Target Number	Target Proportion%	Target Number	Target Proportion%
Training set	56,721	34.21%	52,107	31.42%	56972	34.37%
Validation set	10,311	34.94%	9341	31.65%	9855	33.41%
Training set	4026	35.26%	3682	32.24%	3710	32.50%

Table 3. Ablation experiment.

Model	Position of the CBAM Module				Precision (%)	Recall (%)	mAP (%)
Model	A	B	C	D	Precision (%)	Recall (%)	mAP (%)
YOLOv5s					80.31%	83.32%	77.29%
YOLOv5s-0					81.03%	83.19%	77.83%
YOLOv5s-1	√				81.47%	84.24%	78.32%
YOLOv5s-2		√			81.51%	84.33%	78.37%
YOLOv5s-3			√		81.53%	84.40%	78.33%
YOLOv5s-4				√	81.52%	84.36%	78.32%
YOLOv5s-5	√	√			82.13%	84.97%	78.93%
YOLOv5s-6	√		√		82.30%	85.06%	79.01%
YOLOv5s-7	√			√	82.23%	84.97%	78.61%
YOLOv5s-8	√	√	√		83.71%	85.07%	78.77%
YOLOv5s-9	√		√	√	83.93%	85.31%	78.80%
YOLOv5s-10		√	√	√	83.75%	85.21%	78.50%
ours-YOLOv5s	√	√	√	√	85.97%	86.31%	80.03%

Table 4. Ablation experiment.

Model	α-CIOU	CBAM	Soft-NMS	BiFPN	Dconv	Precision (%)	Recall (%)	mAP (%)
YOLOv5s						80.31%	83.32%	77.29%
A	√					81.03%	83.47%	77.33%
B	√	√				82.29%	84.36%	78.21%
C	√	√	√			84.12%	85.33%	78.89%
D	√	√	√	√		84.82%	85.46%	79.13%
ours-YOLOv5s	√	√	√	√	√	85.97%	86.31%	80.03%

Table 5. Comparative experiment of different models.

Model	Precision (%)	Recall (%)	mAP (%)
YOLOv3s [24]	72.47%	68,76%	62.24%
YOLOv5m	81.31%	83.72%	76.81%
YOLOv5l	82.05%	83.89%	75.42%
YOLOv8s [25]	81.65%	83.13%	76.39%
R-CNN [26]	84.22%	81.26%	75.25%
RetinaNet	71.23%	78.61%	73.80%
SSD [2]	83.74%	81,87%	76.29%
DETR [27]	82.52%	81.08%	75.52%
Transformer [28]	80.03%	82.31%	77.47%
ours-YOLOv5s	85.97%	86.31%	80.03%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Research on Optimized YOLOv5s Algorithm for Detecting Aircraft Landing Runway Markings

Abstract

1. Introduction

2. YOLOv5 Algorithm Principle

3. Optimized YOLOv5 Algorithm

3.1. Funnel ReLU (FReLU) Activation Function

3.2. Alpha-CIOU Loss Function

3.3. CBAM

3.4. BiFPN Structure

3.5. Soft-Non-Maximum Suppression (NMS) Mechanism

3.6. Deformable Convolution (Deformable Conv)

4. Experiment

4.1. Dataset

4.2. Data Augmentation

4.3. Experimental Environment and Model Training

4.4. Analysis on the Influence of CBAM Module Placement

4.5. Evaluation Indicators and Performance Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics