A Lightweight Track Feature Detection Algorithm Based on Element Multiplication and Extended Path Aggregation Networks

Qiu, Hong; Yang, Dayong; Cao, Juanhua; Ming, Jingqiang; Jiang, Kun; Wu, Weijun

doi:10.3390/s25185753

Open AccessArticle

A Lightweight Track Feature Detection Algorithm Based on Element Multiplication and Extended Path Aggregation Networks

by

Hong Qiu

¹,

Dayong Yang

^1,*,

Juanhua Cao

²,

Jingqiang Ming

¹,

Kun Jiang

³ and

Weijun Wu

¹

School of Advanced Manufacturing, Nanchang University, Nanchang 330031, China

²

Jiangxi Technical College of Manufacturing, Nanchang 330031, China

³

Jiangxi Everbright Measurement and Control Technology Co., Ltd., Nanchang 330031, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(18), 5753; https://doi.org/10.3390/s25185753

Submission received: 25 July 2025 / Revised: 30 August 2025 / Accepted: 9 September 2025 / Published: 16 September 2025

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

Aiming at the problems of excessive computational load, insufficient real-time performance, and an excessive amount of model parameters in track inspection, this paper proposes a lightweight track feature detection module (YOLO-LWTD) based on YOLO11n: first, the StarNet module is integrated into the backbone network, and its elemental multiplication operation is utilized to enhance the feature characterization capability; second, in the neck part, a lightweight extended path aggregation network reconstructs the feature pyramid information flow paths by combining with the C3K2-Light module to enhance the efficiency of the multi-scale feature fusion; finally, in the head part, a lighter and more efficient detection header, Detect-LADH, is used to reduce the feature decoding complexity. Experimental validation showed that the improved model outperforms the benchmark model in precision, recall, and mean average precision (MAP) by 0.5%, 2.0%, and 0.8%, respectively, with an inference speed of 163 FPS (a 38.1% improvement). The model volume is compressed to 1.5 MB (a 71.1% lightweight rate). This provides an energy-efficient solution for lightweight track detection tasks geared towards embedded deployment or real-time processing.

Keywords:

track inspection; YOLO11; elemental multiplication; lightweight network

1. Introduction

As the demand for intelligent operation and maintenance of rail transportation continues to increase, the importance of track inspection is becoming increasingly prominent. The track inspection system describes the geometric position parameters of the track with the running mileage as the independent variable; however, under the influence of cumulative error, the mileage of the geometric deviation position marking in long-distance measurement may have a significant deviation from the actual position. To systematically monitor the condition of key components, such as sleepers, fasteners, roadbeds, and turnouts, the detection accuracy of track inspection is significantly constrained by the cumulative mileage error, which directly leads to deviations in the condition assessment and spatial localization inaccuracy. It significantly affects the reliability of decision-making regarding operation and maintenance [1].

To achieve dynamic error correction, a real-time error compensation mechanism based on feature matching can be established by detecting features with fixed patterns in the track as mileage calibration points [2,3]. Therefore, high-precision track feature detection technology has become the core link for improving the robustness of the intelligent inspection system, which is of key significance in guaranteeing the safe operation and maintenance of the entire track life cycle.

In recent years, researchers have used various sensors and optimization algorithms to extract features from tracks for research on track feature detection [4]. Wang [5] used an odometer to detect the speed of the train on the track and the distance traveled, which is still limited due to the acceleration and deceleration skidding of the train and the wear and tear of the wheels; Wei [6] introduced light detection and ranging (LIDAR) equipment to detect the track plane and used the moving average filter (MAF) algorithm to detect and classify the track, but the method is too computationally intensive and it has very high requirements for the hardware terminal; Wang et al. [7] fused the radar-acquired multi-cycle tunnel profile point cloud data through the localization algorithm and used the subway tunnel modeling algorithm to establish a standard tunnel profile model to process the fused data; Zhang [8] used the existing communication optical fiber along the railroad line to detect the track plane and classify the track and proposed an interferometric technique based on Rayleigh backscattered signals in optical fibers for identification and localization of railroad vehicles using existing communication cables along the railroad, but the method still has some limitations; Olaby [9] proposed a method for railroad localization by using RFID technology to align vehicles to the location of turnouts and crossings on the railroad network; Lian [10] proposed a new modular visual processing framework based on the multi-target tracking module of dynamic regions of interest to assign a unique identification code to each landmark for continuous train localization; Spinsante [11] proposed a hybrid GNSS method for train localization, but the method has certain requirements for GNSS signals; Qin [12] proposed a new method using data fusion techniques with mileage-corrected track geometry inspection data combined with the uncorrected velocity information of axlebox acceleration inspection data to correct the mileage deviation of axlebox acceleration inspection data; and Chen [13] proposed an on-board railroad positioning system assisted by digital track maps using the Jetlink inertial navigation system (SINS) and OD, which effectively suppresses the accumulation of the train’s position error. In summary, the current track detection method relies on high-precision equipment to achieve accurate track detection, and the error caused by this equipment also requires higher algorithm complexity. At the same time, the detection model needs to be deployed in the mobile edge device, and its resource constraints and multi-threaded environment also require the model to have a smaller number of parameters. Therefore, lightweight, efficient, and accurate track feature detection technology has become crucial in ensuring line safety and operational and maintenance efficiency.

The rapid development of deep learning technology has injected unprecedented vitality into the field of computer vision, and vision-based track inspection is gradually becoming a key research direction [14]. Phaphuangwittayakul [15] utilized the Dual Attention Visual Transformer (DaViT) to construct RailTrack-DaViT, effectively capturing both global and local information to achieve accurate track detection; Xiao [16] developed a novel fusion model combining the Segment Anything Model and U-Net network to perform detailed identification and segmentation of track scaling areas; Bottalico [17] developed a method based on 3D vision to identify inherent features already present on track structures; HU [18] enhanced the detection in complex slab track scenarios using synthetic images based on the YOLO architecture; Ma [19] improved the YOLOv8 algorithm for detecting train track fasteners, achieving good detection results; Luo [20] automated ballast detection using computer vision methods, employing BSV to thoroughly assess continuous track sections; Shen [21] improved YOLOv7 and Center-Point for detecting visible light images and point clouds, respectively, and used AED as a new metric in the data correlation module to track detection results between images and point clouds, effectively enhancing the correlation robustness and reducing the tracking errors. In terms of algorithm-based applications, Xia [22] proposed an Odess iteration, significantly reducing the computational overhead of similarity detection while achieving a high detection accuracy and high compression ratios, while Zou [23] proposed a novel management-friendly duplicate data deletion framework named MFDedup, maximizing the locality; these methods open new horizons for embedded deployment in track detection applications.

Based on the current research status and existing problems, this paper proposes a lightweight track feature detection algorithm, YOLO-LWTD, for track inspection tasks, building upon YOLO11 [24]. First, StarNet is used to replace the original backbone network, significantly reducing the model complexity. Second, the feature fusion network is reconfigured to be lightweight, and the efficient C3K2-Light module is introduced. Finally, the model’s performance is further enhanced by the detection head structure. The structure further enhances the model’s detection performance.

The structure of this paper is as follows: Section 1 provides a summary of the current state of research; Section 2 systematically describes the overall architecture of the track detection algorithm based on the improved YOLO11 and the design principle of its optimization module; Section 3 outlines the data acquisition and model-training process; Section 4 analyzes the differences in the performance indexes between the proposed algorithm and the main benchmark methods through comparative experiments and systematically evaluates the experimental results; and Section 5 summarizes the entire paper and presents constructive perspectives for future research directions.

2. Proposed Methodology

2.1. YOLO11

The YOLO (You Only Look Once) series of algorithms, a representative research algorithm in the field of target detection, employs an end-to-end single-stage detection architecture, which enables the efficient detection and precise localization of multiple target objects in images. YOLO11 is the latest model proposed by the Ultralytics team, featuring main innovations that include the introduction of the C3K2 module to optimize the shallow feature extraction process, the incorporation of the C2PSA attention mechanism to enhance feature capture, and the addition of depth-separable convolution (DWConv) to the detection head. In terms of the model architecture, YOLO11 consists of three core components: a feature extraction backbone network (backbone), a multi-scale feature fusion neck network (neck), and a target detection head (head); its overall architecture is shown in Figure 1.

YOLO11 mainly consists of three parts: backbone, neck, and head. The backbone network part of YOLO11 is used to extract the multi-scale feature maps of the input image. It includes modules such as Conv, C3K2, SPPF, and C2PSA. C3K2 enhances the overall performance of feature extraction. In contrast, the spatial attention (C2PSA) module is combined with the SPPF, which enables the model to adaptively focus on the salient regions in the image to enhance the key feature expression ability; the neck network adopts a combined bi-directional feature fusion structure (PANet) that combines FPN and PAN, in which the C3K2 module fuses features at different scales more efficiently; the detection head part follows the decoupled head of YOLOv8, but YOLO11 adds two depth-separable convolutions (DW-Conv) to the classification detection head to substantially reduce the computation amount without losing accuracy and, at the same time, significantly reduces the computational effort. For the regression loss, a composite loss function combining distribution focal loss and CIoU (complete intersection over union) is used; for the classification loss, distribution focal loss (DFL) is used for the optimization, which adaptively adjusts the weights of positive and negative samples, effectively alleviating the category imbalance problem [25].

YOLO11 provides five models with different network depths and widths—n (nano), s (small), m (medium), l (large), and x (extra-large)—based on the synergistic tuning of the network depths and widths, as well as the structural parameter counts of each model variant under the condition of an input resolution of 640 × 640 pixels (parameters) and floating point operations (FLOPs) metrics for each model variant at an input resolution of 640 × 640 pixels are shown in Table 1.

2.2. YOLO-LWTD

Aiming to enhance the feature extraction efficiency and meet the requirements for lightweight models in multi-target detection tasks for tracking scenes, this study improves the model by utilizing YOLO11n as the base network and proposes a new lightweight track feature detection model. The innovative improvements of the model are mainly reflected in the following three aspects:

In the backbone feature extraction part, the element-level feature interaction mechanism unique to StarNet is innovatively introduced to enhance the feature extraction;
In the neck part of the model, a lightweight extended path aggregation network and the C3K2-Light module are adopted to achieve efficient fusion of multi-scale features by optimizing the information propagation path of the feature pyramid;
In the head part, a lighter and more efficient detection head Detect-LADH is adopted, and this structure significantly reduces the computational complexity by simplifying the feature-decoding process while ensuring the detection accuracy.

The above improved complete network structure is shown in Figure 2.

2.3. StarNet

The backbone network architecture of YOLO11 constructs a multilevel feature extraction system by integrating components such as the C3K2 module, SPPF, C2PSA, and the product layer. Although this module combination strategy effectively improves the feature expression capability of the network, it also significantly increases the amount of parameter computation required by the model, resulting in a decrease in the inference speed and negatively impacting the deployment efficiency of the model in real-world application scenarios. To effectively address the performance bottleneck mentioned above, this paper proposes a backbone network reconfiguration scheme based on the StarNet network.

StarNet [26] is an efficient neural network architecture based on elementary multiplication operations. Unlike the standard linear dot product operation, it employs element-wise multiplication to construct a mapping relation from a low-dimensional feature space to a high-dimensional, nonlinear space. This operation does not increase the width of the network, and not only preserves the local specificity of the input features but also significantly improves the discriminative ability of the model through nonlinear interactions. For target detection networks, such as YOLO, the elemental multiplication operation is particularly suitable for capturing visual features with subtle structural differences.

In a single-layer neural network, the neuron inputs can be represented by the representation of the multiplication operation as in Equation (1), where ⋆ represents the multiplication operation, W represents the weights of the input neurons, and B represents the bias of the input neurons.

(W_{1}^{T} X + B_{1}) ⋆ (W_{2}^{T} X + B_{2})

(1)

Based on Equation (1), the weight matrix and bias are then combined into a single entity, denoted as

W = [\begin{matrix} W \\ B \end{matrix}]

, corresponding to

X = [\begin{matrix} W \\ 1 \end{matrix}]

, and the fusion of the characteristics of two linear transformations is expressed by element multiplication, resulting in a single output channel conversion and a single-element-input element multiplication, as shown in Equation (2). This equation defines

W_{1}, W_{2}, X \in R^{(d + 1) \times 1}

, where d is the input channel number; it can also be extended to the case of multiple output channels and processing of multiple feature elements, i.e.,

W_{1}, W_{2} \in R^{(d + 1) \times (d^{'} + 1)}

, where

X \in R^{(d + 1) \times n}

. Meanwhile, i and j are used to index the channels and

α

is the coefficient of each item, where the expression for the coefficients is shown in Equation (3).

\begin{matrix} W_{1}^{T} X ⋆ W_{2}^{T} X & = (\sum_{i = 1}^{d + 1} W_{1}^{i} X^{i}) ⋆ (\sum_{j = 1}^{d + 1} W_{2}^{j} X^{j}) \\ = \sum_{i = 1}^{d + 1} \sum_{j = 1}^{d + 1} W_{1}^{i} W_{2}^{j} W^{i} X^{j} \\ = \underset{(d + 2) (d + 1) / 2 items}{\underset{︸}{α_{(1, 1)} X^{1} X^{1} + \dots + α_{(4, 5)} X^{4} X^{5} + \dots + α_{(d + 1, d + 1)} X^{d + 1} X^{d + 1}}} \end{matrix}

(2)

α_{(i, j)} = \{\begin{matrix} w_{1}^{i} w_{2}^{j} & if i = j, \\ w_{1}^{i} w_{2}^{j} + w_{1}^{j} w_{2}^{i} & if i! = j . \end{matrix}

(3)

From Equation (3), it can be seen that except for the term

α_{(d + 1, :)} x^{(d + 1)} x

, the rest of the terms all exhibit linear irrelevance to the term x, i.e., they are all independent dimensions. Therefore, the element multiplication operation in the d-dimensional space yields a representation in the

\frac{(d + 2) (d + 1)}{2} \approx {(\frac{d}{\sqrt{2}})}^{2}

dimensional space, which significantly enhances the dimensionality of the features. This computational mechanism strikes a good balance between computational complexity and model expressiveness, effectively retaining and extracting rich deep semantic information even under low-resolution input conditions, making it particularly suitable for real-time inspection tasks in track or near-track.

The network structure of StarNet, based on the above elemental multiplication, is shown in Figure 3. It employs an efficient four-stage hierarchical feature extraction framework that achieves progressive expansion of feature dimensions by multiplying the number of channels stage by stage. Specifically, the network first performs basic feature extraction on the input image through the first convolutional layer, followed by deep feature extraction through four-layered architectures. A convolutional layer downsamples each layered architecture, and feature extraction is performed using the StarBlocks module, which consists of two deeply divisible convolutions and three fully connected networks. First, batch normalization is introduced after deep convolution to facilitate information fusion, and batch normalization can improve the computational efficiency of the model; subsequently, the result of batch normalization is passed through the ReLU6 activation function to introduce the nonlinear transformation capability; after this, a new high-dimensional feature space is generated through the elemental multiplication operation; then, the result of the operation is passed through the fully connected layer to integrate the features for preprocessing of the classification task; and, finally, the features are efficiently fused by deep convolution at the end of StarBlocks to further enhance the feature extraction capability. StarNet abandons the traditional method of expanding the network width (i.e., increasing the number of channels) to enhance the expression capability of the model and realizes high-dimensional feature mapping in low-dimensional space, which not only significantly improves the expression capability of the model but also improves the performance of the model, which can also be used for classification tasks. This not only significantly improves the efficiency of the feature extraction and characterization ability but also significantly reduces the computational complexity, achieving the design goal of model lightweighting.

In YOLO-LWTD, StarNet serves as the front-end component of the backbone, performing basic feature extraction on the input track images to obtain multi-scale feature information. Specifically, the first convolutional layer of StarNet corresponds to the item module in the backbone shown in Figure 2. Subsequent convolutional layers and Star Blocks are alternately stacked to form four stage modules. Finally, SPPF and C2PSA perform additional processing on low-scale features to enhance the model’s ability to express low-frequency features in images.

2.4. EPAN

YOLO11 adopts the path aggregation network (PANet) as its neck part of the pyramid structure, as shown in Figure 4b. Compared with the traditional Feature Pyramid Network (FPN) shown in Figure 4a, PANet adds a bottom-up pathway, where this two-way feature fusion strategy realizes the efficient transfer and fusion of high-level feature information to the low-level, which not only makes the feature map rich in semantic information and precise location information but also breaks through the limitation of unidirectional information flow in the FPN and effectively alleviates the problem of shallow feature information loss. Meanwhile, the feature information is distributed among layers according to the network size, with smaller features assigned to lower layers and larger features to higher layers, thereby optimizing the utilization of multi-scale features. Thanks to this structure, YOLO11 can detect the scale, shape, and class of the target more accurately, while the model’s characterization ability is further enhanced by gradually increasing the depth and resolution of the feature map.

However, in practical applications for track detection, YOLO11 shows certain shortcomings in detection. The primary reason is the unsatisfactory effect of feature fusion, which is insufficient for integrating low-level features (e.g., detailed information about the target) and high-level features (e.g., global context information), resulting in limited accuracy and recall in target detection. Therefore, for tracking images with variable background information and irregular image noise, PANet still has limitations in capturing the detailed features of the image, which significantly affects the model’s performance in complex scenes [27].

To address these limitations, this paper further optimizes the feature fusion mechanism based on the architecture of PANet, achieving finer multi-scale feature interactions by introducing an efficient extended path aggregation network (EPAN). As shown in Figure 4c, EPAN’s front end introduces additional feature-processing modules to perform more refined processing of the main features, making them better suited to the complexity of track scenes and addressing the shortcomings of traditional multi-scale fusion networks in terms of feature depth information mining. Meanwhile, an innovative cross-layer jump connection, similar to a residual structure, has been introduced to enhance the retention and utilization of spatial detail information, enabling the model to more effectively capture the key features of orbital scenes. Additionally, by optimizing the information flow path, the model achieves its lightweighting goal by carrying richer effective information with fewer feature layers.

In Figure 4c, the feature maps of each row in EPAN have the same scale, but there are differences in how each feature map is processed in detail. The specific implementation process is described as follows: First, three feature maps of different scales—P1, P2, and P3—are extracted from the backbone network. Next, the feature maps P1, P2, and P3 output by the backbone network are processed through a 1 × 1 convolution module to generate P4, P5, and P6, respectively, aiming to achieve a nonlinear mapping between the input channels. Subsequently, P6 is upsampled to generate P7; P7 is concatenated with P5, then processed through the C3K2-Light module and a convolutional layer to generate P8; P8 is then concatenated with P4 and fed into the C3K2-Light module for processing to generate P9. Finally, P9 is downsampled and concatenated with P8 and P5, then processed through the C3K2-Light module to generate P11; and P11 is downsampled and concatenated with P3 and P6, then processed through the C3K2-Light module to generate P10. Ultimately, the feature maps P9, P10, and P11 are output from EPAN and fed into the object detection head. The correspondence between feature maps is shown in Figure 5.

2.5. C3K2-Light

To construct a more lightweight YOLO11 detection network and achieve the optimal balance between model efficiency and detection accuracy while ensuring track detection performance, this study is inspired by FasterNet and improves the C3K2 module in YOLO11 based on Partial Convolution (PConv [28]), which makes clever use of the redundancy of channels in the feature map to extract spatial features while keeping other channels undisturbed. Channel redundancy is maintained in the feature map to perform traditional convolution operations on only some of the input channels to extract spatial features while keeping the other channels undisturbed, and this selective computation mechanism significantly reduces the amount of floating-point operations (FLOPs). The structure of PConv is shown in Figure 6.

For regular convolution with input

X \in R^{c \times h \times w}

and output

Y \in R^{c \times h \times w}

, the FLOPs are shown in Equation (4), where c is the channel number, h and w are the height and width of the input data, and k is the size of the convolution kernel. Furthermore, the FLOPs for PConv are shown in Equation (5), where

c_{p}

is denoted as the channel number. Since PConv performs conventional convolution operations only on the first and last consecutive channels of the input feature map while keeping the middle channels unchanged, this selective computational strategy makes its FLOPs significantly lower than that of the conventional convolution method and significantly reduces the overall parameter count of the model, thus realizing the balance between the computational efficiency and the feature expression capability.

h \times w \times k^{2} \times c^{2}

(4)

h \times w \times k^{2} \times c_{p}^{2}

(5)

We designed the C3K2-Light module, shown in Figure 7, based on PConv to optimize the balance between detection accuracy and computational efficiency. The core architecture of the module employs a

3 \times 3

PConv as the central layer, which not only retains the attention property of standard convolution in the center region of the sensory field but also significantly reduces the computational complexity through a selective channel computation strategy. To enhance the feature characterization capability, the module cascades two

1 \times 1

convolutional layers after the PConv layer, extending the receptive field. It then fuses the features of the PConv and the regular convolution through residual concatenation, ensuring feature diversity while facilitating fast inference. In particular, the module places BatchNormalization after the intermediate convolutional layer and supplements it with the ReLU activation function, which not only accelerates the model’s convergence but also significantly improves its inference efficiency.

2.6. Detect-LADH

The decoupled two-branch detection head structure adopted by YOLO11 improves the task specificity by independently processing classification and regression tasks. However, this architectural design has three significant drawbacks: first, the multiple convolutional operations in the two branches significantly increase the number of model parameters and computational complexity, making it difficult to achieve efficient deployment on low-computing-power devices or meet real-time detection requirements; second, since the feature processing of the classification and regression branches is completely isolated, the network cannot effectively utilize the complementary high-level semantic features extracted by the backbone network, thereby limiting the model’s representational capability and detection performance in track inspection tasks; additionally, the original decoupled head uses the same convolutional layer at the top of the network for both regression and classification, but these tasks have different focuses, leading to potential conflicts during the detection process.

We introduce a lightweight asymmetric detection head (LADH [29]) to address the above issues; the structure is illustrated in Figure 8. This architecture is based on a task-driven design philosophy and uses a three-channel separated network to handle the classification, regression, and IoU prediction tasks separately. LADH consists of two core components: the Asymmetric Head and the Dual Head. The Asymmetric Head of LADH employs asymmetric multi-level compression to apply differentiated compression to features of different categories, thereby adapting to variations in the target complexity. The Dual Head is responsible for integrating the multi-scale feature outputs (P3–P5) from the Asymmetric Head and generating the final detection results.

LADH-Head uses depthwise separable convolution (DWConv) instead of standard convolution to avoid performance bottlenecks that shared feature layers may cause. Depthwise separable convolution decomposes traditional convolution into two independent operations: depthwise convolution and pointwise convolution. Specifically, pointwise convolution uses a

1 \times 1

convolution kernel to fuse cross-channel information, adjusting only the number of channels while maintaining the spatial dimension of the feature map. This design further optimizes the model complexity. In the detection head design, the introduction of

3 \times 3

depthwise separable convolution decouples the classification task from the bounding box regression task, effectively avoiding task interference caused by differences in the positive sample matching loss. Therefore, replacing the original decoupling head of YOLO11 with LADH-Head ensures the model’s lightweight nature while significantly improving the detection accuracy and computational efficiency through asymmetric feature processing, making it particularly suitable for track detection scenarios.

3. Experiments and Results Analysis

Current publicly available datasets focused on rail tracks primarily emphasize local features such as the track surface texture and localized defects, severely neglecting the holistic perspective essential for railway inspection tasks. This emphasis on local features results in significant discrepancies between existing datasets and real-world inspection scenarios, making it difficult to fully reflect the actual railway operational environment and thereby limiting the development of object detection algorithms tailored to this task. To address this issue, this study systematically constructed a dataset specifically tailored to the railway inspection context, thereby overcoming the limitations of existing datasets. This paper uses a self-built dataset to train and test the proposed model.

3.1. Experimental Datasets

Track image acquisition experiments were performed in a ballasted track field environment. The experimental site was selected as a standard rail ballasted track section. Fasteners and sleepers are crucial components of the track, serving as the primary connectors between the rails and the sleepers. Due to their unique shapes, this paper identifies and detects them by extracting the features of the fasteners and sleepers.

Additionally, to enhance the practical usability of the dataset, this study specifically designed the image acquisition system to be mounted on the central section of the track inspection train, using a 45° installation angle to precisely target the bottom edge of the track for image capture. This structural design minimizes the inclusion of excessive external background environments, thereby maximizing the exclusion of external uncontrollable factors such as extreme weather conditions and sudden changes in lighting intensity that could affect the image data. This ensures that the captured track images effectively highlight the key visual features of the track itself. The camera’s installation position and framing angle are shown in Figure 9.

However, despite the rigorous image acquisition process, it is still challenging to fully account for the more complex real-world orbital detection environments, for example, dynamic changes in lighting conditions, the presence of random noise, fog caused by weather conditions, and varying degrees of obstruction by foreign objects between track structural components. Therefore, to enhance the completeness of the dataset, enable the model to learn more features, and improve the generalization ability of the deep learning model in complex scenarios, we employed image processing techniques to perform data augmentation on the original track image dataset, thereby striving to approximate and cover the scenarios that may be encountered in reality.

Specifically, five targeted image transformation methods shown in Figure 10 were implemented, and each augmentation method was parametrically controlled to ensure that the generated samples maintained the semantic authenticity of the original images while effectively extending the coverage of the data distribution, as described below:

Geometric transformation: Performs a horizontal mirror flip of the original image.
Luminance adjustment: Randomly adjust the image luminance value in the range of $[- 30 %, + 30 %]$ by linear transformation to simulate the track scene under different lighting conditions.
Noise injection: A Gaussian noise model $(s i g m a = 0.05)$ was used to add random noise to the image to improve the robustness of the model to sensor noise.
Random rotation: Rotate the image at a random angle to increase the spatial diversity of the samples.
Atmospheric interference simulation: Based on the atmospheric scattering model [30], add the fogging effect of different concentrations to simulate the imaging characteristics under rainy and foggy weather conditions.

During the data collection experiments, a total of 566 valid track images were acquired, with an image resolution of

1408 \times 1024

pixels. After the dataset augmentation, the LabelImg software was used to manually label the target area of the track images, draw bounding boxes around the track features, and assign categories. The annotated dataset consisted of 3396 images, including 3396 sleepers and 3396 fasteners. Finally, all the labeled images are divided into training sets (70%, 2377 images), validation sets (10%, 339 images), and test sets (20%, 680 images).

3.2. Experimental Environment and Parameter Configuration

The experiment was conducted in an environment with Windows 11, CUDA 12.6, Python 3.12.8, and Pytorch 2.5.1. The hardware configuration and training hyperparameters are presented in Table 2. During training, the input image size was uniformly adjusted to

640 \times 640

, and the L2 regularization term was used to penalize large weights to prevent overfitting.

3.3. Evaluation Indicators

To validate the performance of the algorithm proposed in this paper, we systematically evaluated the model using the following metrics: recall (R), precision (P), mean average precision (MAP), floating-point operations per second (FLOPS, denoted as F), parameter count, and frames per second (FPS), among others. Formulas for some of these metrics are provided below:

P = \frac{T_{P}}{T_{P} + F_{P}}

(6)

R = \frac{T_{P}}{T_{P} + F_{N}}

(7)

M A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n}

(8)

F = K^{2} \times C_{i n} \times C_{o u t} \times H_{o u t} \times W_{o u t}

(9)

P a r a m e t e r = K^{2} \times C_{i n} \times C_{o u t} + B

(10)

F P S = \frac{1}{T}

(11)

In Equations (6) and (7),

T_{P}

stands for true positive samples,

T_{P}

stands for false positive samples, and

F_{N}

stands for false negative samples; in Equation (8),

A P_{i}

is the accuracy of the ith category; in Equations (9) and (10), K stands for the size of the convolutional kernel,

C_{i n}

is the number of channels in the input feature layer,

C_{o u t}

is the number of channels in the output feature layer,

H_{o u t}

is the height of the output feature layer, and

W_{o u t}

is the width of the output feature layer; B stands for the bias term; in Equation (11), T is the time required for the model to infer a single sample (in s).

3.4. Convergence Test

We evaluate the convergence performance of the model by monitoring the trend of the loss function and optimizing the training strategy accordingly. Specifically, we focus on the change process of the following three types of loss functions:

Bounding box loss, which is used to assess the regression effect of the target detection box;
Distribution focal loss, which is used to optimize the distributional characteristics of the bounding box prediction;
Classification loss, which is used to measure the classification performance of the model.

The change curves of the above loss functions during the training process are shown in Figure 11.

Throughout the entire training process, as the number of training epochs increased, the total loss value of the model exhibited a monotonically decreasing trend, and the model displayed neither an obvious overfitting nor an underfitting phenomenon. When the number of training rounds reached 175, the loss curve gradually stabilized and entered a state of convergence. Subsequent performance evaluation and analysis could then be carried out after the loss function converged.

4. Analysis of Experimental Results

4.1. Ablation Experiment

To fully validate the effectiveness of the proposed improved method, ablation experiments are conducted on the datasets presented in this paper. Each set of experiments was conducted under the same environmental configuration and parameter settings. The results of the ablation experiments conducted using the six metrics of MAP50, parameters, model size, FPS, precision (P), and recall (R) for comparison are shown in Table 3.

The results show that the standard YOLO11 model achieves a detection MAP of 84.1%, a model size of 5.2 million parameters, and a detection speed of 118 frames per second. Improvements can be made to increase the model map to 84.9%, reduce the model size to 1.5 M, and increase the detection speed to 163 frames/s. The proposed method in this paper outperforms the YOLO11 model on all datasets, demonstrating its impressive performance in target detection and recognition.

As shown in Table 3, when using YOLO11n as the baseline model, each of the four improvements individually enhances the model’s lightweight nature. The model, which uses StarNet as the main feature extraction network module (YOLO11-Starnet), achieves a significant improvement in lightweight performance compared with the original model, with a 36.5% reduction in the model size and a 35.6% increase in the detection speed. However, this comes at the cost of a slight decrease in the average accuracy. Given the significant lightweighting achieved, this trade-off is acceptable, indicating that the element multiplication operations in StarNet can characterize object depth features with fewer parameters. The model incorporating the Detect-LADH detection head (YOLO11-Detect-LADH) shows a slight improvement in lightweight design and detection speed compared with the original model, indicating that Detect-LADH effectively balances accuracy and speed. The model incorporating the EPAN structure (YOLO11-EPAN) not only improves the model’s MAP but also achieves an unexpectedly high degree of lightweight optimization, indicating that EPAN can better fuse image scale features and demonstrate the effectiveness of its residual structure-like design. The model (YOLO11-C3K2-Light) that introduces the PConv module into the C3K2 module in the original neck model achieves a 0.7 percentage point improvement in average accuracy compared with the original model, while reducing the number of parameters and achieving a certain degree of lightweight optimization. When the C3K2-Light module is introduced into StarNet, Detect-LADH, and EPAN, it achieves varying degrees of accuracy improvement, all of which exceed the baseline model, indicating that the C3K2-Light module has good applicability. The lightweight track feature detection model (YOLO-LWTD) proposed in this paper performs well across all evaluation metrics, achieving an average precision, recall, and average precision that are 0.5, 2.0, and 0.8 percentage points higher than the original model, respectively. Meanwhile, the model size is only 1.5 MB, achieving a lightweight degree of 71.1%. The detection speed has improved by 45 frames per second, indicating that the improved model has a more concise and efficient network structure and can effectively enhance the model’s detection performance.

4.2. Comparison Test

To further validate the advantages of the improved YOLO11n for track feature recognition, a series of target detection models were selected for comparison tests. The constructed track dataset was used for training and evaluated on the test set. The test results are presented in Table 4.

From Table 4, the YOLO family of algorithms, which are also one-stage detection algorithms, have iteratively increased performances in terms of the average accuracy. However, YOLOv5, as a classic target detection algorithm, has a simple model structure that makes it have a lower model size and higher inference speed, but its accuracy is lower; YOLOv8, as an update of YOLOv5, has a greater improvement in accuracy; YOLOv9 has a smaller model size, but its detection speed is too low to apply to such a mobile detection task as track detection. YOLOv10 weighs the model size and detection accuracy; YOLO11, as a new generation of target detection algorithms, has been improved compared with the previous algorithms. YOLOv12 and YOLOv13 are iterative versions of YOLO11, with improvements in the average accuracy and lightweight performance, but neither is suitable for track detection scenarios. In this paper, the improved model we propose has satisfactory results in the precision rate, recall rate, average precision rate, model size, and inference speed. Specifically, relative to YOLOv5n, YOLOv8, YOLOv9, YOLOv10, and YOLO11, our model is 6.5, 2.8, 0.7, 0.9, and 0.5 percentage points higher regarding the precision rate; 7.2, 2.2, 7.0, 3.8, and 2.0 percentage points higher regarding the recall rate; and 7.4, 4.4, 3.3, 2.3, and 0.7 percentage points higher regarding the mean average precision rate, respectively. It is also 7% higher than YOLOv5 and YOLOv9 regarding the recall rate. Furthermore, the model size is lightened to 1.5M, which is significantly lower than the rest of the YOLO family of models, and its inference speed is 163 frames/s, which is much higher than the rest.

We designed and conducted a series of backbone network comparison experiments to evaluate the relative advantages of StarNet in terms of the model performance. We selected and included a variety of representative lightweight neural network modules as benchmark models, including FasterNet, ShuffleNetv2, EfficientNetv2, and MobileNetv3. The experimental results are shown in Table 5. The data shows that StarNet demonstrates significant advantages in comparative experiments. With a MAP50 accuracy of 83.1, it not only has the lowest number of parameters and smallest model size but also achieves the highest inference speed and lowest computational overhead. Compared with other networks, StarNet is not the most accurate network, but it performs better in terms of efficiency, speed, and resource utilization while maintaining a competitive accuracy, making it an efficient solution for track detection tasks.

To systematically evaluate the quality of the EPAN model in terms of its multi-scale feature fusion capabilities, BiFPN and SlimNeck were introduced as benchmarks for the comparative experimental analysis. The detailed comparison results are shown in Table 6. As can be seen from the experimental data in the table, EPAN demonstrates significant advantages in multiple key metrics: it achieves efficient inference with the fewest parameters, smallest model size, and lowest computational overhead, while also achieving the highest real-time performance and optimal recall rate. Although its MAP50 is slightly lower than SlimNeck, EPAN leads comprehensively in terms of accuracy and overall efficiency (e.g., FPS is 8.5% higher than PaNet and 33.3% higher than BiFPN), highlighting its superiority in balancing the accuracy, speed, and resource consumption.

4.3. Visual Analysis

The detection results obtained by training the track image dataset using the model proposed in this paper are shown in Figure 12.

As demonstrated by the seven visualization experiments, the YOLO-LWTD model proposed in this paper achieves good detection performance across various scenarios. In the ordinary scenes shown in the first column, there is little difference in the detection performance between all the models; the second and third columns represent low-light conditions at night and enhanced lighting during the day, respectively. The improved model focuses more closely on the detailed features of the track, whereas the original model is less sensitive to changes in lighting conditions. The fourth column introduces hazy weather conditions that may be encountered during track inspections, revealing that hazy weather significantly impacts the detection performance, with all models struggling to accurately identify objects. However, the proposed model still holds a certain advantage. The subsequent fifth and sixth columns further complicate the background information of the detection scenes. It can be seen that the improved model can better focus on the features of the target object, thereby achieving good detection results. Overall, the track detection model proposed in this paper can effectively focus on track features, and its lightweight structure will also demonstrate a more meaningful performance in practical applications.

Heatmaps can more intuitively show the key location information learned by the model network. To more clearly evaluate the improved model, Grad-CAM was used to generate corresponding heatmaps, the results of which are shown in Figure 13.

As can be seen from Figure 13, the improved model exhibits significantly enhances the spatial focusing capability in the orbital feature region, with the CAM peak response region more concentrated around the geometric center of the orbital structure. This phenomenon intuitively confirms the effectiveness of the model optimization.

4.4. Deployment Testing

We built the deployment platform shown in Figure 14 based on Rockchip RK3588. RK3588 uses a high-performance processor that integrates quad-core Cortex-A76 and quad-core Cortex-A55 architectures, equipped with an ARM Mali-G610 MC4 GPU, 16 GB LPDDR4X memory, and 64 GB eMMC storage, and supports multiple operating systems, including Linux.

Experiments were conducted on a mobile terminal platform using the Ubuntu 22.04 to perform Int8 quantization on the model. The results showed that the original model size was 5.2 MB with an FPS of 52, while the improved model size was 1.4 MB with an FPS of 69. The lightweight design of the improved model was proven to be effective and suitable for deployment on mobile platforms, providing a highly energy-efficient solution for lightweight track detection tasks targeting embedded deployment or real-time processing.

5. Summary and Future Work

5.1. Summary

In this study, the network structure of YOLO11n is enhanced by introducing StarNet into the backbone network, resulting in a lower model size and a significantly higher detection speed. This makes it suitable for multi-threaded work. A work structure is introduced in the neck to improve the feature fusion effectiveness of the enhanced model, and some C3K2-Light attention modules are added to improve the model’s lightness. Finally, the model is further improved by Detect-LADH to further improve the detection effect of the model.

The results show that the proposed improved YOLO11n can improve the detection accuracy of the model for track features, which is 0.5, 2.0, and 0.8 percentage points higher than the original model in terms of the mean values of precision, recall, and average precision, respectively; the inference speed is 163 frames/s, which is 38.1% higher than that of the original model, and the size of the model is only 1.5 MB, with a lightweight degree of 71.1%. The performance of the improved model is more balanced in terms of the detection accuracy and degree of lightweight, with a lower model size and significantly higher detection speed. This makes it suitable for multi-threaded, real-time track detection processing tasks and deployment on mobile devices.

5.2. Future Work

This study conducted experimental validation based on a diverse dataset of track images. However, it is worth noting that the current dataset still has limitations in covering all possible track scenarios. Due to the lack of a widely recognized public benchmark dataset specifically designed for track feature detection within the industry, this research faces certain challenges regarding the comprehensiveness and universality of the dataset. This limitation also represents one of the key issues that future research needs to address.

Looking ahead, our work will focus on the following directions: (1) continuously expanding the scale and diversity of datasets while building an open-access benchmark dataset for track features to foster collaborative development in this field; (2) applying more rigorous statistical methods, such as cross-validation, to further validate the model generalization capabilities on larger datasets; (3) addressing highly heterogeneous challenges in railway inspection operations—such as dynamic environmental changes, extreme weather, and complex terrain—by prioritizing research on robust perception and intelligent recognition technologies for complex, multi-variable scenarios (e.g., low-light conditions, rain/fog interference, track debris, and high-speed moving perspectives); (4) explore the deep integration and collaborative analysis of multi-source heterogeneous sensing methods—including high-resolution machine vision, 3D laser scanning, multispectral/infrared imaging, ground-penetrating radar, acoustic detection, and inertial measurement units—to build next-generation intelligent track inspection systems characterized by high precision, real-time capability, and reliability.

Author Contributions

Conceptualization, H.Q., D.Y. and K.J.; methodology, H.Q.; software, H.Q.; validation, H.Q. and D.Y.; formal analysis, J.M.; investigation, K.J.; resources, K.J.; data curation, K.J. and D.Y.; writing—original draft preparation, H.Q.; writing—review and editing, H.Q. and J.M.; visualization, D.Y.; supervision, W.W. and J.C.; project administration, H.Q. and D.Y.; funding acquisition, W.W. and J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the National Key Research and Development Program of China (2022YFB2602905-02) and the National Natural Science Foundation of China (52068052).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors are thankful to the anonymous reviewers and editors for their valuable comments and suggestions.

Conflicts of Interest

Author Kun Jiang was employed by the company Jiangxi Everbright Measurement and Control Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Xie, Q.L.; Tao, G.Q.; Lo, S.M.; Cai, W.B.; Wen, Z.F. High-speed railway wheel polygon detection framework using improved frequency domain integration. Veh. Syst. Dyn. 2024, 62, 1424–1445. [Google Scholar] [CrossRef]
Fang, B.; Chen, Q.; Niu, X. Track geometry feature matching method for train positioning. Bull. Surv. Mapp. 2019, 109–113. [Google Scholar]
Otegui, J.; Bahillo, A.; Lopetegi, I.; Díez, L.E. A Survey of Train Positioning Solutions. IEEE Sens. J. 2017, 17, 6788–6797. [Google Scholar] [CrossRef]
Rahimi, M.; Liu, H.C.; Cardenas, I.D.; Starr, A.; Hall, A.; Anderson, R. A Review on Technologies for Localisation and Navigation in Autonomous Railway Maintenance Systems. Sensors 2022, 22, 4185. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Liu, J.; Cai, B.; Wang, J.; Lu, D. Research on adaptability of train positioning scheme in Virtual Balise with satellite navigation system. J. Railw. Sci. Eng. 2023, 20, 1054–1065. [Google Scholar]
Jiang, W.; Yu, Y.Z.; Zong, K.B.; Cai, B.G.; Rizos, C.; Wang, J.; Liu, D.; Wei, S.G. A Seamless Train Positioning System Using a Lidar-Aided Hybrid Integration Methodology. IEEE Trans. Veh. Technol. 2021, 70, 6371–6384. [Google Scholar] [CrossRef]
Wang, Y.; Su, G.; Fang, E.; Zhou, W. Research on three-dimensional point cloud reconstruction and deformation detection of tunnel contours based on LiDAR. J. Cent. South Univ. Sci. Technol. 2024, 55, 2393–2403. [Google Scholar]
Zhang, X.; Li, J.; Zhang, D.; Yan, R.; Tian, B.; Ding, G.; Yin, H.; Ma, T.; Wang, W.; Zhai, X. Real-time positioning technology of train based on optical fiber coherent Rayleigh backscattering. J. Appl. Opt. 2022, 43, 994–1000. [Google Scholar] [CrossRef]
Olaby, O.; Hamadache, M.; Soper, D.; Winship, P.; Dixon, R. Development of a Novel Railway Positioning System Using RFID Technology. Sensors 2022, 22, 2401. [Google Scholar] [CrossRef]
Lian, L.R.; Qin, Y.; Cao, Z.W.; Gao, Y.; Bai, J.; Ge, X.Y.; Guo, B.Q. A Continuous Autonomous Train Positioning Method Using Stereo Vision and Object Tracking. IEEE Intell. Transp. Syst. Mag. 2025, 17, 6–22. [Google Scholar] [CrossRef]
Spinsante, S.; Stallo, C. Hybridized-GNSS Approaches to Train Positioning: Challenges and Open Issues on Uncertainty. Sensors 2020, 20, 1885. [Google Scholar] [CrossRef] [PubMed]
Qin, H.Y.; Wang, W.D.; Liu, J.Z.; Sun, S.C.; Zhang, M.X. Fast mileage deviation correction method for track dynamic inspection system based on data fusion technology. In Proceedings of the 3rd IEEE Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 12–14 October 2018; pp. 280–286. [Google Scholar]
Chen, W.; Yang, G.L.; Tu, Y.Q. A Digital Track Map-Assisted SINS/OD Fusion Algorithm for Onboard Train Localization. Appl. Sci. 2024, 14, 247. [Google Scholar] [CrossRef]
Olivier, B.; Guo, F.; Qian, Y.; Connolly, D.P. A Review of Computer Vision for Railways. IEEE Trans. Intell. Transp. Syst. 2025, 26, 11034–11065. [Google Scholar] [CrossRef]
Phaphuangwittayakul, A.; Harnpornchai, N.; Ying, F.; Zhang, J. RailTrack-DaViT: A Vision Transformer-Based Approach for Automated Railway Track Defect Detection. J. Imaging 2024, 10, 192. [Google Scholar] [CrossRef]
Xiao, Y.J.; Ning, Y.F.; Peng, Y.Q.; Wang, M.; Long, Y.; Tan, S.; Wang, W.D.; Yu, Z.W. A novel machine-vision-based algorithm for quantifying surface fouling of railway ballast beds. Adv. Eng. Inform. 2025, 68, 103639. [Google Scholar] [CrossRef]
Bottalico, F.; Sabato, A. Stereo-point tracking of inherent structural features for 3D computer vision measurements. Mech. Syst. Signal Process. 2025, 235, 112937. [Google Scholar] [CrossRef]
Hu, W.B.; Liu, X.H.; Zhou, Z.Z.; Wang, W.D.; Wu, Z.; Chen, Z.W. Robust crack detection in complex slab track scenarios using STC-YOLO and synthetic data with highly simulated modeling. Autom. Constr. 2025, 175, 106219. [Google Scholar] [CrossRef]
Ma, S.W.; Li, R.H.; Hu, H.A. Train track fastener defect detection algorithm based on MGSF-YOLO. J. Supercomput. 2025, 81, 494. [Google Scholar] [CrossRef]
Luo, J.Y.; Ding, K.L.; Qamhia, I.I.A.; Hart, J.M.; Tutumluer, E. Deep Learning Approach for Automated Railroad Ballast Condition Evaluation. In International Conference on Transportation Geotechnics; Lecture Notes in Civil Engineering; Springer Nature: Singapore, 2024; Volume 402, pp. 49–57. [Google Scholar] [CrossRef]
Shen, Z.; He, Y.Z.; Du, X.; Yu, J.F.; Wang, H.J.; Wang, Y.N. YCANet: Target Detection for Complex Traffic Scenes Based on Camera-LiDAR Fusion. IEEE Sens. J. 2024, 24, 8379–8389. [Google Scholar] [CrossRef]
Xia, W.; Pu, L.F.; Zou, X.Y.; Shilane, P.; Li, S.Y.; Zhang, H.J.; Wang, X. The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression. ACM Trans. Storage 2023, 19, 22. [Google Scholar] [CrossRef]
Zou, X.Y.; Yuan, J.S.; Shilane, P.; Xia, W.; Zhang, H.J.; Wang, X. From Hyper-dimensional Structures to Linear Structures: Maintaining Deduplicated Data’s Locality. ACM Trans. Storage 2022, 18, 25. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 8 September 2025).
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 5694–5703. [Google Scholar] [CrossRef]
Weng, T.; Niu, X. Enhancing UAV Object Detection in Low-Light Conditions with ELS-YOLO: A Lightweight Model Based on Improved YOLOv11. Sensors 2025, 25, 4463. [Google Scholar] [CrossRef] [PubMed]
Liu, R.; Shao, Z.; Sun, Q.; Liu, J.; Yu, Z. Lightweight strategy of pipeline detection model based on parameter sharing, pruning and distillation. Signal Image Video Process. 2025, 19, 786. [Google Scholar] [CrossRef]
Lv, Z.; Dong, S.; He, J.; Hu, B.; Liu, Q.; Wang, H. Lightweight Sewer Pipe Crack Detection Method Based on Amphibious Robot and Improved YOLOv8n. Sensors 2024, 24, 6112. [Google Scholar] [CrossRef] [PubMed]
Ding, B.; Zhang, R.; Xu, L.; Liu, G.; Yang, S.; Liu, Y.; Zhang, Q. U²D²Net: Unsupervised Unified Image Dehazing and Denoising Network for Single Hazy Image Enhancement. IEEE Trans. Multimed. 2024, 26, 202–217. [Google Scholar] [CrossRef]
Jocher, G. Ultralytics YOLOv5 (Version 7.0). 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 8 September 2025).
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8 (Version 8.0.0). 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 8 September 2025).
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Chen, J.R.; Kao, S.H.; He, H.; Zhuo, W.P.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar] [CrossRef]
Ma, N.N.; Zhang, X.Y.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Volume 11218, pp. 122–138. [Google Scholar] [CrossRef]
Tan, M.X.; Le, Q. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; Volume 139, pp. 7102–7110. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.X.; Wang, W.J.; Zhu, Y.K.; Pang, R.M.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.F.; Shi, J.P.; Jia, J.Y. Path Aggregation Network for Instance Segmentation. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Doherty, J.; Gardiner, B.; Kerr, E.; Siddique, N. BiFPN-YOLO: One-stage object detection integrating Bi-Directional Feature Pyramid Networks. Pattern Recognit. 2025, 160, 111209. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. arXiv 2024, arXiv:2206.02424. [Google Scholar] [CrossRef]

Figure 1. YOLO11 network architecture.

Figure 2. YOLO-LWTD network architecture.

Figure 3. StarNet network architecture.

Figure 4. Diagram of the three neck network structures: (a) FPN; (b) PANet; (c) EPAN.

Figure 5. Feature relationship diagram.

Figure 6. PConv network architecture.

Figure 7. C3K2-Light network architecture.

Figure 8. Detect-LADH network architecture diagram.

Figure 9. Data acquisition schematic.

Figure 10. Raw data and data enrichment.

Figure 11. Model training loss value change curve.

Figure 12. Test result chart.

Figure 13. Heatmap results.

Figure 14. Experimental platform.

Table 1. YOLO11 model parametric quantities and calculations.

Model	FLOPs/B	Param/M
YOLO11n	6.5	2.6
YOLO11s	21.5	9.4
YOLO11m	68.0	20.1
YOLO11l	86.9	25.3
YOLO11x	194.9	56.9

Table 2. Experimental environment parameters.

Parameter Name	Parameter Information
CPU	Intel® Core i5-13490F (Santa Clara, CA, USA)
Memory	32 GB
Display card	NVIDIA GeForce RTX 4070 GPU (Santa Clara, CA, USA)
Epoch	200
Batch size	16
Image size	640 × 640
Model size	Depth: 0.25, Width: 0.25
Weight decay	0.0005
Initial learning rate	0.01
Optimizer	Adam

Table 3. Table of ablation experiments.

Base	StarNet	Detect-LADH	EPAN	C3K2-Light	MAP50 (%)	Parameters	Size (M)	FPS	P (%)	R (%)
✓					84.1	2,582,542	5.2	118	97.7	71.0
✓	✓				83.1	1,598,406	3.3	160	96.5	70.8
✓		✓			83.6	2,281,742	4.7	120	97.8	69.9
✓			✓		85	1,788,782	3.7	128	98.3	73.2
✓				✓	84.8	2,431,646	4.9	118	96.5	72.6
✓	✓			✓	85.9	1,447,510	3	152	98.8	74.3
✓		✓		✓	84.3	2,130,846	4.4	117	97.2	73.3
✓			✓	✓	85.5	1,768,318	3.7	126	97.9	73.8
✓	✓	✓	✓	✓	84.9	620,598	1.5	163	98.2	73.0

Table 4. Comparative test table.

Model	MAP50 (%)	Parameters	Size (M)	FPS	P (%)	R (%)	GFLOPs
YOLOv5 [31]	77.5	2,182,054	4.4	138	91.7	65.8	5.8
YOLOv8 [32]	80.5	2,684,785	5.4	133	95.4	70.8	6.8
YOLOv9 [33]	81.6	1,756,950	4	67	97.5	66	6.5
YOLOv10 [34]	82.6	2,695,196	5.5	104	97.3	69.2	8.2
YOLO11	84.1	2,582,542	5.2	118	97.7	71.0	6.3
YOLOv12 [35]	84.5	2,557,118	5.3	72	98.3	71.9	6.3
YOLOv13 [36]	83.7	2,448,285	5.2	52	98.8	66.4	6.2
RTDETR (ResNet50) [37]	84	41,938,794	82	10	88.9	77.9	125
Ours	84.9	620,598	1.5	163	98.2	73.0	1.7

Table 5. Results of the backbone network comparison experiment.

Backbone	MAP50 (%)	Parameters	Size (M)	FPS	P (%)	R (%)	GFLOPs
StarNet	83.1	1,598,406	3.3	160	96.5	70.8	3.7
FasterNet [38]	82.9	2,378,314	5.1	128	93.1	69.1	5.5
ShuffleNetv2 [39]	74.9	1,705,318	3.6	102	94.2	65	4.1
EfficientNetv2 [40]	81.7	2,086,978	4.4	96	97.1	69	5.2
MobileNetv3 [41]	83.3	2,144,716	4.4	114	96.2	73.1	3.8

Table 6. Results of the neck network comparison experiment.

Neck	MAP50 (%)	Parameters	Size (M)	FPS	P (%)	R (%)	GFLOPs
EPAN	85	1,788,782	3.7	128	98.3	73.2	5.2
PaNet [42]	84.1	2,582,542	5.2	118	97.7	71.0	6.3
BiFPN [43]	84.7	2,670,538	5.4	96	98.9	68.4	7
SlimNeck [44]	85.2	2,731,678	5.6	105	94.6	70	6.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiu, H.; Yang, D.; Cao, J.; Ming, J.; Jiang, K.; Wu, W. A Lightweight Track Feature Detection Algorithm Based on Element Multiplication and Extended Path Aggregation Networks. Sensors 2025, 25, 5753. https://doi.org/10.3390/s25185753

AMA Style

Qiu H, Yang D, Cao J, Ming J, Jiang K, Wu W. A Lightweight Track Feature Detection Algorithm Based on Element Multiplication and Extended Path Aggregation Networks. Sensors. 2025; 25(18):5753. https://doi.org/10.3390/s25185753

Chicago/Turabian Style

Qiu, Hong, Dayong Yang, Juanhua Cao, Jingqiang Ming, Kun Jiang, and Weijun Wu. 2025. "A Lightweight Track Feature Detection Algorithm Based on Element Multiplication and Extended Path Aggregation Networks" Sensors 25, no. 18: 5753. https://doi.org/10.3390/s25185753

APA Style

Qiu, H., Yang, D., Cao, J., Ming, J., Jiang, K., & Wu, W. (2025). A Lightweight Track Feature Detection Algorithm Based on Element Multiplication and Extended Path Aggregation Networks. Sensors, 25(18), 5753. https://doi.org/10.3390/s25185753

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Track Feature Detection Algorithm Based on Element Multiplication and Extended Path Aggregation Networks

Abstract

1. Introduction

2. Proposed Methodology

2.1. YOLO11

2.2. YOLO-LWTD

2.3. StarNet

2.4. EPAN

2.5. C3K2-Light

2.6. Detect-LADH

3. Experiments and Results Analysis

3.1. Experimental Datasets

3.2. Experimental Environment and Parameter Configuration

3.3. Evaluation Indicators

3.4. Convergence Test

4. Analysis of Experimental Results

4.1. Ablation Experiment

4.2. Comparison Test

4.3. Visual Analysis

4.4. Deployment Testing

5. Summary and Future Work

5.1. Summary

5.2. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI