1. Introduction
Against the backdrop of continuously increasing global air transport volumes, ensuring flight safety and operational efficiency has become a primary focus in modern aviation technology research. Studies have demonstrated a significant correlation between the efficiency of runway marking identification during the approach and landing phases and overall flight safety. Specifically, navigational information such as runway numbers and lighting requires real-time and accurate identification to support safe landings. However, current identification methods predominantly rely on manual visual interpretation. Under adverse weather conditions, such as haze, low clouds, or nighttime operations, identification accuracy declines sharply. This technical limitation not only significantly reduces the safety margin during the approach phase but also poses serious challenges to pilots’ situational awareness and emergency response capabilities. In January 2022, the Civil Aviation Administration of China issued the Smart Civil Aviation Construction Roadmap, which outlined clear objectives for the development of intelligent air traffic management. The roadmap mandates strengthening national flight operation simulations and testing new-generation air traffic management systems by 2025, with the goal of operational deployment by 2030. The increasing complexity of low-altitude landing scenarios further challenges pilots, making the intelligentization of aviation information systems a pressing research direction.
Regarding image data of aircraft landing runways, data acquisition is influenced by various weather conditions, environmental wavelengths, and imaging methods, all within complex airspace environments, contributing to data distribution inconsistency. This increases the likelihood of pilot misjudgment during landing decision-making, posing one of the key challenges in aviation target recognition. Currently, global target detection algorithms can be broadly categorized into two types: one-stage detection and two-stage detection. One-stage detection methods directly perform localization and classification, with representative algorithms including YOLO [
1], SSD [
2], and RetinaNet [
3]. In contrast, two-stage detection methods first generate coarse anchor boxes and then refine localization and classification. Representative algorithms in this category include R-CNN, Fast R-CNN [
4], and Cascade R-CNN [
5].
These general-purpose detection algorithms, although demonstrating outstanding performance in recognizing common solid objects in everyday contexts, tend to yield poor results when applied to the detection of aviation-specific targets. Therefore, tailored optimizations are necessary to adapt these models to the precise demands of detecting runway markings in aviation scenarios. Yang [
6] integrated the Swin Transformer into a Mask R-CNN framework and combined it with natural language processing techniques. While this approach achieved a degree of recognition success, it suffered from relatively low processing speed. Similarly, Zhang [
7] employed Faster R-CNN integrated with a TIBAM module for two-stage detection. Although this algorithm improved recognition accuracy to some extent, it imposed a considerable computational burden. Chen et al. [
8] proposed a novel visual positioning algorithm that integrates YOLOv5 with Kalman filtering to address occlusion challenges in determining relative positional relationships using vision-based positioning modules. Rao et al. [
9] developed a landmark detection model by combining lightweight techniques with a fast contour optimization algorithm to achieve reliable position estimation under poor visual conditions. Liu et al. [
10] introduced a deep learning-based airport runway line detection method, providing precise positioning information for drone landings. Cao et al. [
11] presented an enhanced lightweight target detection approach for coal gangue, improving YOLOv5s by constructing new convolutional blocks and embedding an Efficient Channel Attention (ECA) module in the backbone network. This improvement significantly increased localization and recognition accuracy. Pan et al. [
12] optimized the YOLOv3 network by integrating a spatial pyramid pooling (SPP) module, a squeeze-and-excitation (SE) module, and dilated convolution, accelerating recognition speed while maintaining model accuracy. Yan et al. [
13] enhanced the YOLOv5 architecture by adding a spatial and channel squeeze-and-excitation (scSE) module, achieving an average detection accuracy of 0.983. Chen et al. [
14] proposed a track identification and monitoring method based on an improved YOLOv5s framework, incorporating a lightweight backbone, improved feature fusion strategies, and an optimized regression loss function. Wang et al. [
15] introduced an enhanced YOLOv5s object detection algorithm by integrating an inner convolutional module into the backbone and improving the feature fusion network using a GSConv module, thereby improving detection accuracy. Liu et al. [
16] developed a cotton seed damage detection method based on an improved YOLOv5 algorithm, incorporating the lightweight up-sampling operator CARAFE into the YOLOv5s framework and refining the loss function.
To address the industry challenge of insufficient recognition accuracy of runway markings during aircraft landings, this study proposes an enhanced target detection architecture, ours-YOLOv5s, based on deep learning. This model systematically improves image parsing efficiency through a series of multidimensional innovations. First, the model incorporates a Convolutional Block Attention Module (CBAM), which employs a channel-spatial dual-domain feature recalibration strategy. This significantly enhances the discriminative ability of feature representations under complex weather conditions. Second, it replaces the conventional feature fusion structure with a Bidirectional Feature Pyramid Network (BiFPN). Through bidirectional cross-scale concatenation and a weighted feature fusion mechanism, this architecture enhances multi-scale feature expression, thereby improving the recall rate (R%) for small-scale targets such as runway numbers in aviation imagery. In addition, the model introduces an adaptive Alpha-Complete Intersection over Union (CIoU) loss function based on the Alpha parameter. By incorporating dynamic balancing factors alongside traditional geometric constraints and aligning with a cosine annealing learning rate strategy, it achieves an optimal trade-off between localization accuracy and convergence speed. Furthermore, a comprehensive data augmentation strategy is implemented to address the limited quantity of available aviation image samples. This effectively mitigates overfitting, thus reducing generalization errors, particularly under conditions of fog or night operations. Experimental results indicate that, compared to the baseline model, the proposed ours-YOLOv5s exhibits a marked improvement in accuracy, offering a robust technical solution for mitigating runway incursion risks during the aircraft landing phase.
2. YOLOv5 Algorithm Principle
YOLOv5s adopts a modular architecture to construct a triple-cascade feature addressing system, as illustrated in
Figure 1. The collaborative framework consists of three core functional modules: a feature extraction network (backbone), feature fusion layers (neck), and a detection head (head).
During feature extraction, the backbone network performs progressive feature abstraction through the CSPDarknet53 framework. Initially, the Focus module restructures the input tensor dimensionally via slicing operations. Subsequently, a cross-stage residual concatenation and gradient flow diversion mechanism is employed to optimize the feature propagation path. Within this structure, the C3 module utilizes two strategies, branch-and-cut and channel compression, to construct a bottleneck layer that expands the receptive field while significantly reducing computational complexity. The feature fusion layers incorporate a BiFPN for multi-scale integration. A dynamic weighted fusion is achieved through bidirectional top-down and bottom-up cross-layer concatenation, effectively combining shallow positional information with deep semantic features. Among these, the PANet module employs a deformable convolution kernel to adaptively adjust the feature mapping, thereby enhancing the semantic expressiveness of shallow feature maps. The detection head module adopts a decoupled prediction structure that separates the tasks of localization regression and classification confidence prediction into parallel branches. In the localization branch, a dynamic anchor box optimization algorithm is applied, with geometric constraints constructed via the CIoU loss function. The classification branch utilizes a compound activation function (Sigmoid Linear Unit, SiLU) to enhance nonlinear representation capabilities. In the post-processing stage, an improved Non-Maximum Suppression (NMS) algorithm is employed. This version incorporates a Gaussian-weighted suppression strategy along with an adaptive threshold adjustment mechanism, effectively reducing the false match rate for dense and small-scale targets in aerial images.
3. Optimized YOLOv5 Algorithm
This study addresses several key limitations of the YOLOv5s object detection network in complex scenarios, including errors in small-target detection, low feature fusion efficiency, and insufficient model convergence stability. To overcome these challenges, a systematic optimization framework is proposed. This framework enhances the network’s detection accuracy and robustness through the construction of a multi-scale collaborative enhancement architecture, refinement of the bounding box regression mechanism, and integration of dynamic optimization strategies. As illustrated in
Figure 2, the optimized framework incorporates the following core technical modules.
3.1. Funnel ReLU (FReLU) Activation Function
Activation functions enable networks to perform hierarchical modeling of complex data patterns by introducing differentiable nonlinear transformation mechanisms. This nonlinearity breaks the superposition constraint of linear systems, allowing the network to approximate any continuous function through layered feature compositions. The core mechanism behind this nonlinear mapping lies in the design of the activation unit. The SiLU activation function adopts a composite formulation whose curve remains continuously differentiable around zero, thereby improving parameter update efficiency in regions prone to gradient saturation. By incorporating a probabilistic compression function, this module dynamically adjusts outputs to a bounded probability space (0, 1), as expressed in Formula (1):
Ma et al. [
17] proposed the FReLU activation function, with its mechanism illustrated in
Figure 3. FReLU introduces a spatially aware mechanism with minimal computational overhead, extending traditional ReLU and PReLU functions into a two-dimensional activation framework characterized by regional correlations. Its formulation is given in Formula (2).
where T(x) denotes a two-dimensional spatial condition operator that employs a parametric pooling window [
18] to establish spatial dependencies and facilitate effective local feature extraction. The design leverages learnable convolutional kernels to dynamically adjust the receptive field, enhancing sensitivity to local geometric structures. This process is formally described by Formula (3):
where T(x_(c,i,j)) represents funnel conditions, while x_(c,i,j)^w represents the pixel’s 2D spatial coordinates. The nonlinear activation function of the
c-th channel serves as the foundation for generating the parameterized pooling window, where a shared coefficient
P is assigned to all pixels within the same channel and window. Principal diagrams of FReLU, PReLU, and ReLU activation functions are illustrated in
Figure 3.
A spatial conditional constraint mechanism is incorporated to strengthen spatial modeling capacity further. This mechanism enables refined spatial feature encoding via pixel-level parameter modulation and, when combined with standard convolution operations, facilitates multi-scale feature extraction. As a result, the model effectively captures complex visual layouts and spatial structural relationships within images while maintaining high computational efficiency.
3.2. Alpha-CIOU Loss Function
Compared to the CIoU loss function, as illustrated in
Figure 4, Alpha-CIoU [
19] introduces a tunable hyperparameter
α, which enhances the model’s adaptability and performance across different scenarios. While CIoU optimizes three critical factors, the center point distance, aspect ratio, and overlapping area between bounding boxes, Alpha-CIoU further incorporates the influence of rotation angles, thereby improving its effectiveness for rotated bounding box tasks. Additionally, the inclusion of the
α parameter enables dynamic adjustment of the weight contributions from various components (e.g., aspect ratio and angle) based on task-specific requirements. This flexibility allows the model to achieve a better trade-off between accuracy and computational efficiency, particularly in complex or irregular object detection tasks.
As shown in
Figure 4,
d represents the distance between the centers of the ground truth box and the predicted box and
c denotes the diagonal length of the smallest enclosing box that contains both. This design allows CIoU to effectively handle cases where there is no overlap between boxes. The CIoU loss function is defined as follows:
where Intersection over Union (IOU) represents the ratio of the overlapping area to the union area of the predicted and ground truth boxes, measuring the degree of overlap between the two boxes.
represents the square of the Euclidean distance between the center of the predicted box b and the ground truth box and
measures the deviation between the center points of the predicted box and the ground truth box. c is the diagonal distance of the smallest enclosing box containing both the b and
. v measures the aspect ratio discrepancy between the predicted and ground truth boxes.
α is a weight coefficient that balances the influence of
v. The formulas for α and v are expressed as Formulas (5) and (6):
where
,
,
w, and
h represent the width and height of ground truth box and prediction box, respectively. The constant
ensures the normalization of aspect ratio differences within a stable range. Accordingly, the complete CIoU loss is expressed as
Alpha-CIOU additionally introduces a Power regularization term with a single Power parameter α. By adjusting α, the detector exhibits greater flexibility in achieving different levels of b box regression accuracy. The formula for the Alpha-CIOU loss function is expressed as Formula (8):
3.3. CBAM
CBAM [
20], illustrated in
Figure 5, is an attention mechanism that sequentially applies a Channel Attention Module (CAM) and Spatial Attention Module (SAM). By incorporating both channel-wise and spatial attention, CBAM enhances feature learning capability while maintaining computational efficiency and parameter economy.
The CBAM module consists of an input layer, CAM, SAM, and output layer. The input layer introduces the feature F ∈ R^(C*H*W) to the one-dimensional convolution M_c ∈ R^(C*H*W) for multiplication with original image. Afterward, the CAM outputs the previous input as the feature, which is then input into the two-dimensional convolution M_s ∈ R^(1*H*W) of the spatial attention module. Finally, the result is obtained by multiplying the output by the original image. The entire attention process is formulated as Formula (9):
The CAM, as depicted in
Figure 6, begins by applying global average pooling and global max pooling along the height and width of the input feature map. These two types of pooled features are then passed through an MLP. The outputs of the MLP are combined via element-wise summation and subsequently passed through a sigmoid activation function to produce the final channel attention map. This attention map is then element-wise multiplied with the original input feature map to generate the demanded input features of the CAM. The operation of the CAM is formulated as
The SAM, shown in
Figure 7, takes the output of the CAM as its input. First, global average pooling and global max pooling are applied, this time along the channel axis. The results of the two addresses are then concatenated along the channel dimension for dimensionality reduction. The resulting feature map undergoes a sigmoid activation function to generate the spatial features. This map is then element-wise multiplied with the input feature map to produce the final refined features. The SAM operation is formally defined as
3.4. BiFPN Structure
The architecture of the Bidirectional Feature Pyramid Network (BiFPN) [
21] is illustrated in
Figure 8. Unlike YOLOv5, which employs CSPDarknet as the backbone and PANet as its feature fusion module, BiFPN enhances multiscale feature representation by introducing bidirectional information flow, both bottom-up and top-down, within the network structure and adopting a weighted fusion strategy, resulting in multiple cross-scale connections.
3.5. Soft-Non-Maximum Suppression (NMS) Mechanism
The NMS algorithm operates by ranking the anchor box set S in descending order. The IoU is calculated between the box with the highest confidence score b* and all other boxes, and those with an IoU exceeding a predefined threshold π_0 are removed. However, traditional NMS suffers from inherent limitations. In regions with overlapping targets, such a rigid suppression strategy can lead to false negatives, where valid detections are erroneously discarded. The performance of NMS is highly sensitive to the choice of π_0. A lower threshold may result in the loss of critical information due to over-suppression, while a higher threshold may allow excessive redundant boxes, reducing detection accuracy and interpretability.
Unlike traditional NMS, Soft-NMS adapted a smooth address to the overlapping candidate boxes, significantly mitigating overcompensation. The algorithm dynamically adjusts the scores of candidate boxes rather than directly eliminating them, allowing some overlapping detections to be retained, which is particularly beneficial in complex visual environments; for example, where aircraft landing markings and runway lines, buildings, or other structures are spatially intertwined. The core idea of Soft-NMS [
22] is as follows: when the IoU between the highest-confidence box and another box exceeds the threshold π_0, the algorithm does not discard the other box; instead, it reduces its confidence score according to the level of overlap. The greater the IoU, the more significant the score decay. This dynamic adjustment enables better preservation of true positive detections in dense scenes. The algorithm flow is shown in
Table 1, and the Soft-NMS principle diagram is shown in
Figure 9.
3.6. Deformable Convolution (Deformable Conv)
Deformable Conv enhances the model’s ability to adapt to the geometric deformations of objects by dynamically adjusting the sampling grid, as illustrated in
Figure 10. Unlike standard convolution, which uses a fixed rectangular sampling pattern, deformable convolution exhibits stronger shape-awareness capabilities.
In the top-level feature maps generated by deformable convolution, the distribution of activated feature points shows a significant correlation with the object’s contours and structural characteristics, resulting in a selective response to object-specific features. This behavior suppresses background noise during feature extraction and enhances the representational power of the learned features, thereby improving target localization and recognition accuracy in complex and cluttered environments.
Compared to traditional conv, deformable convolution introduces learnable offsets to shift the sampling points toward more informative regions of the input. This mechanism is depicted in
Figure 11.
In traditional convolution, for an input feature map of size 7 × 7 and a convolution kernel of size 3 × 3, the weights of the convolution kernel are multiplied by the corresponding elements of the input feature map and summed to obtain the elements, sliding across the input to produce the complete output feature map. The formulation of traditional convolution is given by
where p_n represents the offset of each point in the convolution kernel relative to the center point, p_0 represents the position on the feature map, and w(p_n) is the weight parameter, X represents the feature input position. R is expressed as in Formula (13).
The formula of deformable convolution is illustrated as Formula (14), where ∆p represents the offset generated by the input feature map and another convolution.
4. Experiment
4.1. Dataset
The experiments in this study primarily focus on the aircraft landing phase at altitudes below 3000 m.
In terms of dataset selection, the datasets used in this study include: a self-constructed dataset from the Roboflow platform; the “Landing Approach Runway Detection (LARD)” dataset—developed by Airbus France [
23] and hosted on
www.github.com, with its data distribution shown in
Figure 12; and the “FS2020 Runway Dataset” obtained from the Kaggle platform. Additionally, one of the authors, Huang Wei, supplemented the dataset with images of actual operational scenarios within airports based on his experience as airport staff. These supplementary data help improve the alignment between the dataset and real-world landing scenarios.
In terms of the screening of dataset images:
- (1)
From the perspective of viewing angle, all selected images simulate the top-down or forward-looking perspective during the aircraft’s approach to landing, which is consistent with the visual perspective of pilots during actual operations. This ensures the alignment between the detection scenario and real application scenarios. Meanwhile, images with irrelevant viewing angles such as ground side shots and high-altitude aerial shots are excluded to avoid interference from non-target viewing angles in model learning.
- (2)
In terms of target types, only images containing specific runway markings are retained. These markings include core detection targets such as runway numbers (e.g., “01L”, “36R”), runway centerlines, and touchdown zone marks. Images featuring non-marking targets such as airport buildings and aircraft bodies are excluded to ensure the dataset focuses on the key detection objects required by the research.
- (3)
From the dimension of environmental conditions, considering various situations that may be encountered in actual landing scenarios, the selected images cover diverse meteorological conditions (sunny, rainy, and foggy), lighting conditions (strong noon light, weak twilight light, and night lights), and imaging quality states (clear images, slightly motion-blurred images, images with sudden brightness changes, etc.). Among them, “Mixed Weather” in
Table 2 specifically includes complex scenarios such as low visibility in fog, night light reflection, runway surface reflection in rainy weather, and backlight at dusk. This ensures that the dataset can support the model’s ability to detect runway markings in different complex environments.
The final dataset comprises 10,362 images, annotated with a total of 206,725 bounding box labels. The dataset is divided into three subsets: 80% of images (8290) are the training set, 15% (1553) are the validation set, and 5% (519) are the testing set. The detailed distribution of targets across these subsets is presented in
Table 2. Additionally, representative sample images from each category are illustrated in
Figure 13.
4.2. Data Augmentation
To improve the model’s generalization and robustness, this study employed a series of data augmentation strategies. These techniques include both geometric transformations and color space manipulations, as illustrated in
Figure 14. The geometric transformations applied are as follows: The rotation transformation randomly rotated the image at a certain angle, enhancing the model’s adaptability to rotational variations. Translation transformation applied horizontal and vertical shifts, increasing the model’s tolerance to positional shifts. Cropping transformation randomly crops part of the image to improve the model’s performance on locally visible targets. In addition to geometric transformations, HSV (hue, saturation, value) adjustments are utilized to simulate hue variations caused by different times of day and weather conditions, enhancing the model’s robustness.
4.3. Experimental Environment and Model Training
The experiment was conducted using the following hardware and software configurations. The experimental platform employed a heterogeneous computing architecture consisting of an AMD Ryzen 7 4800H processor (maximum clock speed: 4.2 GHz) and an NVIDIA GeForce RTX 3060 GPU, operating on Windows 10. The CUDA parallel computing framework (version 12.2) was used in conjunction with the PyTorch (version 2.1.0) deep learning framework for algorithm implementation. During the data preprocessing stage, all input images were uniformly resized to a resolution of 640 × 640 pixels and a hierarchical random sampling strategy was applied to divide the dataset into training and validation sets in an 8:2 ratio. Model training was performed using batch gradient descent with a batch size of 16. A cosine annealing strategy was employed to dynamically adjust the learning rate, thereby optimizing the convergence process. The complete training process consisted of 270 epochs. To evaluate the model’s feasibility in edge computing scenarios, the trained weight files were ultimately deployed on a Raspberry Pi 5 embedded system. The target platform is equipped with a Broadcom BCM2712 quad-core ARM Cortex-A76 processor (clock speed: 2.4 GHz) and a VideoCore VII GPU. Real-time detection following deployment in a simulated laboratory environment is shown in
Figure 15.
4.4. Analysis on the Influence of CBAM Module Placement
In the direct information chain from “marker recognition to pilot decision-making”, this study employs a 30 ms latency threshold based on a synthesis of human perceptual characteristics and aviation safety requirements. Concurrently, validation against the computational constraints of edge-deployed hardware (Raspberry Pi 5) reveals that with 4 CBAM modules, the single-frame inference time approaches the deployment threshold (27 ms) at a corresponding frame rate of 28FPS—sufficient to meet real-time performance criteria. In contrast, increasing the module count to 5 or more results in inference times exceeding 30 ms and a reduced frame rate of 22FPS, which fails to accommodate the real-time detection demands of aviation scenarios. Consequently, this study establishes 4 as the optimal upper limit for the number of CBAM modules.
To validate the scientific rationale for integrating CBAM modules into runway marker detection tasks, this study designed ablation experiments using a controlled-variable approach as shown in
Table 3. In these experiments, configurations of other improved modules—including the Alpha-CIoU loss function, BiFPN feature fusion structure, and Soft-NMS post-processing—were kept constant, and this setup is designated as YOLOv5-0. Only the insertion positions and quantities of CBAM modules in the backbone and neck were adjusted as illustrated in
Figure 16, with a focused analysis on how positional parameters influence feature extraction efficiency, multi-scale fusion performance, and detection accuracy. The specific results are as follows.
- (1)
Optimal Placement and Mechanism of CBAM in the Backbone.
As the core component for low-level feature extraction, the backbone primarily captures basic visual features such as runway edges and pavement textures. Experimental results demonstrate that deploying one CBAM module after the deep C3-DCN module and before the SPPF layer yields optimal performance gains. This placement enables channel-spatial dual-domain recalibration of high-level semantic features output by the backbone, effectively suppressing non-target noise (e.g., sky background and ground clutter) while avoiding feature redundancy that would occur with shallow-layer insertion (e.g., after shallow C3 modules). Shallow features contain substantial irrelevant visual information (e.g., pavement stains), and excessive attention enhancement here would waste computational resources and reduce feature discriminability.
- (2)
Optimal Placement and Mechanism of CBAM in the Neck.
The neck handles multi-scale feature fusion, and its performance directly impacts the detection of small targets such as distant runway numbers. Experimental data show that inserting one CBAM module after each BiFPN fusion layer and before the C3 module (three modules in total) achieves the best results. Feature maps fused by BiFPN already integrate multi-scale semantic information; CBAM dynamically enhances feature weights of critical targets (e.g., runway number “01L” or center lines) via channel attention and focuses on target regions through spatial attention. This effectively addresses detection failures caused by blurred small-target features under complex meteorological conditions (e.g., fog or nighttime).
- (3)
Synergistic Enhancement of CBAM Placement in Backbone and Neck.
The combined configuration of “1 CBAM in backbone + 3 CBAMs in neck” achieves globally optimal detection performance: mean Average Precision (mAP@0.5) reaches 80.03%, a 2.20% improvement over the baseline model without CBAM, with precision and recall increased by 5.66% and 2.99%, respectively. Their synergy embodies “hierarchical progressive feature optimization”: the backbone CBAM reduces noise interference in subsequent fusion through “feature purification”, providing high-signal-to-noise-ratio base features for the neck; neck CBAMs further refine multi-scale feature expression via “target enhancement”. This forms a “base purification-refined enhancement” closed-loop feature processing pipeline, significantly boosting the model’s target discrimination capability in complex scenarios.
4.5. Evaluation Indicators and Performance Analysis
In target detection tasks, the primary metrics used to evaluate model performance include P, R, and mAP. Their mathematical definitions are provided in Equations (15) and (16). To verify the effectiveness of the proposed improvements of various modules in the ours-YOLOv5s model, systematic ablation experiments were conducted. The results of the module combination comparisons are detailed in
Table 3, while performance data for different model architectures are presented in
Table 4. Visual comparison results of detection outputs are shown in
Figure 17.
Among them, TP (true positives) refers to the number of positive samples correctly predicted by the model, while FP (false positives) indicates the number of negative samples incorrectly predicted as positive. FN (false negatives) represents the number of positive samples that the model failed to identify correctly. P and R are commonly used to evaluate the accuracy and correct proportion of the model’s predictions. Higher values, approaching 1, indicate better model performance.
The AP quantifies the area under the P-R curve and is calculated using a definite integral. The mAP is the average AP across all detected categories, with n denoting the total number of categories. The formulas are defined as follows:
To evaluate the effectiveness of the proposed improvements to the ours-YOLOv5s model, a series of ablation experiments were conducted on an enhanced multi-scale dataset, as shown in
Table 4. The experimental results demonstrate the performance gains of various model configurations compared to the baseline module. Specifically, Model A, which replaces the original loss function with the Alpha-CIoU loss function, achieves performance improvements of 0.72% in P, 0.15% in R, and 0.04% in mAP. Model B, which builds on Model A by incorporating the CBAM attention mechanism, further enhances P, R, and mAP by 1.26%, 0.62%, and 0.88%, respectively. Model C, which integrates a multi-scale feature enhancement strategy, improves these metrics by an additional 1.83%, 0.97%, and 0.68% compared to Model B. Model D, which introduces BiFPN for feature fusion, shows further improvements of 0.7%, 0.13%, and 0.24% in P, R, and mAP, respectively. Finally, the ours-YOLOv5s model proposed in this study incorporates all enhancements, including dynamic weight allocation and an optimized feature fusion path. It achieves the best overall performance, with improvements of 1.15% in P, 0.85% in R, and 0.90% in mAP compared to Model D.
The experimental results presented in
Table 5 demonstrate the performance differences among various target detection frameworks, highlighting the overall superiority of the improved ours-YOLOv5s model. Quantitative analysis shows that the proposed model achieves the highest P and R rates among all models tested, with values of 85.97% and 86.31%, respectively. Compared to YOLOv5m, the precision improves by 4.26 percentage points, and by 3.52 percentage points over YOLOv5l. In terms of R, it exceeds YOLOv5m and YOLOv5l by 2.59 and 2.42 percentage points, respectively. Regarding overall detection performance, ours-YOLOv5s achieves a mAP of 80.03%, significantly outperforming other models. Specifically, it outperforms YOLOv3s (62.24%) by 17.79 percentage points and exceeds YOLOv5m (76.81%) and YOLOv5l (75.42%) by 3.22 and 4.61 percentage points, respectively. When compared to the more recent YOLOv8s (76.39%), the proposed model shows an improvement of 3.64 percentage points. Notably, the proposed model also demonstrates substantial improvements over classical detection frameworks. It surpasses R-CNN (75.25%), RetinaNet (73.80%), SSD (76.29%), DETR (75.52%), and Transformer (77.47%) by 4.78, 6.23, 3.74, 4.51, and 2.56 percentage points, respectively, in terms of mAP.
5. Conclusions
This study proposes a novel detection framework optimized for identifying aircraft runway markings, targeting key technical challenges encountered during aircraft landing, such as signal-to-noise ratio attenuation of marking features, significant meteorological interference, and inefficient multi-scale feature coupling. To address these issues, a spatial-channel dual-domain attention mechanism (CBAM) was integrated to enhance the model’s ability to filter out background disturbances. Additionally, a BiFPN was constructed to strengthen cross-layer semantic feature interactions, while the Alpha-CIoU dynamic intersection-over-union loss function was introduced to improve the accuracy of bounding box regression. Furthermore, the incorporation of the FReLU nonlinear activation function, a periodic learning rate adjustment strategy, and deformable convolution operations collectively contributed to accelerating model convergence and improving overall detection performance. Experimental results validate the proposed architecture’s robustness and real-time detection capability under complex weather conditions, demonstrating its practical applicability in aviation engineering. This work lays a solid technical foundation for the future development of lightweight, edge-computing-compatible detection systems in the field of intelligent aviation safety.