Abstract
Real-time, precise monitoring of the number and distribution of indoor personnel is crucial for building safety management, operational optimization, and personnel scheduling. However, narrow entrances and high-density passageways often lead to missed detections, false positives, and tracking failures in pedestrian detection, thereby reducing cross-line counting accuracy. Additionally, edge devices deployed in practical scenarios frequently process multiple video streams simultaneously, resulting in computational resource constraints. To address these challenges, this paper proposes a lightweight, enhanced multi-object pedestrian tracking and counting method tailored for indoor scenarios by optimizing deep learning models. Firstly, modular optimizations are applied to the YOLOv8n model to construct a more lightweight detector, RL_YOLOv8, reducing computational overhead while maintaining accuracy. Secondly, correlated pedestrian auxiliary prediction and pedestrian position change constraints are employed to mitigate ID switching, tracking interruptions, and trajectory jumps in dense scenes. Finally, a buffer zone auxiliary counting strategy is designed to further reduce missed detections of pedestrians crossing lines. Experimental results demonstrate that compared to the original detection-and-tracking-based line-crossing counting method, the improved approach effectively enhances counting accuracy and real-time performance, better meeting the requirements of practical intelligent security and crowd monitoring systems.
1. Introduction
Pedestrian cross-line counting aims to estimate the number of pedestrians passing a designated counting line within a given time interval [1]. It has been widely used in intelligent surveillance, shopping mall footfall analysis, and traffic management, playing an important role in public-safety assurance and operational optimization [2]. Existing cross-line counting methods can generally be categorized into two groups. The first group combines density-map estimation [3] with optical flow [4] or velocity-field information for flux-based flow estimation, whereas the second group performs individual counting based on detection and tracking [5]. The former computes crossing counts by integrating the crowd density over the flow region intersecting the counting line and is well suited for open areas or macro-level flow statistics. However, in typical indoor scenarios such as teaching building corridors and classroom entrances, camera viewpoints are often constrained, pedestrian trajectories are complex, and short-term back-and-forth motion occurs frequently. These characteristics make density-map-and-flow-based methods difficult to reliably associate with fine-grained line-crossing events, thereby limiting their applicability to precise indoor counting tasks [6].
Although existing detection-and-tracking frameworks can provide individual trajectories, they still face severe challenges in indoor environments. First, double-line (gate-based) crossing rules [7] can suppress duplicate counts caused by short-term oscillations, but in narrow indoor spaces, pedestrians may not fully traverse both lines, which can lead to missed detections and counting bias [8]. Second, heavy occlusions under high crowd density significantly degrade bounding-box stability; jitter in the lower-body region easily causes trajectory interruptions and crossing misjudgments. Some studies use the head region as a more stable reference point [9,10], yet in teaching building scenes, head motion is not always consistent with full-body crossing behavior, and near-wall regions may cause the head to miss the counting line, resulting in unreliable judgments.
To address these issues, this paper proposes a lightweight detection-and-tracking-based counting method for indoor scenarios. We introduce modular improvements to YOLOv8n [11] to build an efficient detector referred to as RL_YOLOv8, and develop an occlusion-robust trajectory refinement scheme that combines bottom-boundary suppression and correction with related-pedestrian-assisted motion prediction. This design mitigates occlusion-induced bounding-box jitter and tracking drift, thereby improving trajectory stability and the reliability of cross-line judgment. The experimental results demonstrate that the proposed method achieves notable gains in both counting accuracy and processing efficiency in dense indoor environments.
2. System Framework and Optimization Methods
2.1. System Framework and Process
The fundamental framework of the indoor pedestrian cross-line counting system based on video surveillance comprises three primary components: pedestrian detection, pedestrian tracking, and cross-line counting [12]. Initially, video frames are extracted from the surveillance footage, and the YOLO [13] detector is employed to identify pedestrian targets. Next, the DeepSort [14] tracker processes the pedestrian position and boundary information, performing ID association across consecutive frames to establish stable trajectories. Subsequently, the two nearest adjacent points within the pedestrian trajectory are combined to create motion vectors, which are then used in vector operations with the endpoints of the counting line to ascertain whether the pedestrian has crossed it. The direction of pedestrian movement is determined using the vector cross product, facilitating the calculation of bidirectional pedestrian flow. To improve the accuracy and efficiency of the counting system, this paper proposes the following optimizations for the system model and process:
- (1)
- Make lightweight improvements to the YOLOv8n model, reduce its size, and lower the consumption of computing resources while ensuring its detection performance.
- (2)
- Pedestrian trajectories are smoothed by correcting pedestrian position, which reduces the occurrence of trajectory jumps and enhances the stability of the trajectories.
- (3)
- The motion trends of related pedestrians are utilized to correct the predicted trajectory of the occluded pedestrians, significantly enhancing the accuracy of predicting the positions of occluded pedestrians.
- (4)
- An auxiliary counting zone is designed to reduce counting mistakes that are prone to occur in narrow areas, which will enhance the accuracy of counting pedestrians.
The overall optimization process is illustrated in Figure 1, where the yellow sections are improved. In the detecting phase, a more lightweight detector RL_YOLOv8 is constructed to reduce computational overhead while maintaining accuracy. In the tracking phase, the correction and prediction of the pedestrian position is proposed to make the predicted trajectory more closely align with the actual trajectory, thereby alleviating interruptions caused by occlusion.
Figure 1.
The overall flowchart of the optimized pedestrian counting system.
2.2. Improvement of Pedestrian Target Detection
Pedestrian target detection serves as the foundation for cross-line counting. Traditional detection methods, which depend on manual features and sliding windows, face challenges such as high computational cost and low efficiency, making it difficult to satisfy real-time requirements. With advancements in deep learning, object detection techniques utilizing convolutional neural networks (CNNs) have gradually gained prominence, and are typically categorized into two types: two-stage and single-stage methods [15]. Two-stage methods, such as Faster R-CNN, initially generate candidate regions for subsequent classification and regression. While these methods achieve high detection accuracy, their inference speed is comparatively low, which complicates deployment in scenarios that demand high real-time performance [16]. In contrast, single-stage methods, including SSD, RetinaNet, and the YOLO series, accomplish target localization and classification in a single forward pass. These methods are characterized by their compact structure and high speed, making them the preferred choice for real-time target detection. Among these, the YOLO series stands out due to its end-to-end architecture and exceptional inference efficiency, making it a widely adopted choice for real-time object detection.
Despite the continuous advancements in detection speed and accuracy of the YOLO series models, optimization for specific application scenarios remains essential. The teaching building environment presents two primary challenges: First, the large number of cameras produces a substantial volume of data, necessitating a lightweight model to improve throughput and deployment efficiency. Second, as illustrated in Figure 2, the scale variations of pedestrians in indoor settings are considerable due to varying camera–target distances and strong perspective effects [17], requiring the model to possess robust multiscale detection capabilities [18]. This paper proposes an optimization model in response to the requirements above.
Figure 2.
The corridor scene of a teaching building. In the scene, the detection bounding boxes of four pedestrians at different positions are 126 × 309, 119 × 328, 46 × 124, and 32 × 82, respectively.
Firstly, the RGCSPELAN (Residual Ghost Channel Spatial Pyramid Efficient Layer Attention Network) [19] modules are used to replace some C2f modules in the YOLOv8 structure to reduce the computational load while preserving detection accuracy. By integrating the lightweight design philosophy of GhostNet and the efficient feature aggregation strategy of the ELAN architecture, RGCSPELAN effectively lowers computational complexity and improves inference efficiency, which is particularly beneficial for deployment on resource-constrained devices. Secondly, the detection head is replaced with the LSCD (Light-weight Spatial-channel Dual-attention) module to minimize computational redundancy and enhance the detection capability of multiscale targets. Consequently, the proposed RL_YOLOv8 model (Figure 3) achieves substantial reductions in parameters and computational complexity while maintaining comparable detection accuracy, making it well suited for practical indoor applications with constrained computing resources.
Figure 3.
The structure of the RL_YOLOv8 model.
2.2.1. Lightweight Module Design: RGCSPELAN
On resource-constrained edge devices, especially in scenarios where multiple video streams need to be processed simultaneously, enhancing the model’s lightweight characteristics becomes crucial [20]. While YOLOv8n achieves high accuracy in handling complex environments and occluded pedestrians, its computational load is relatively heavy. To reduce the computational load of YOLOv8n, we introduced the RGCSPELAN module. RGCSPELAN, by integrating the advantages of CSPNet and ELAN (Efficient Layer Aggregation Networks), reduces computational complexity and optimizes gradient flow and feature aggregation, significantly improving inference speed and ensuring efficient operation on resource-constrained devices.
RepConv (Re- parameterization Convolution) [21] in the RGCSPELAN module enhances feature expression in occluded and complex target scenarios while improving inference efficiency. The RepConv structure, illustrated in Figure 4, employs convolution re-parameterization for optimizing computational efficiency. During the training phase, three branches, 3 × 3 Conv, 1 × 1 Conv, and BatchNorm (BN), are utilized for parallel computing to optimize feature extraction. The output feature maps from these branches are then summed element-wise and passed through an activation function layer. In the inference stage, the multiple branches are fused into a single 3 × 3 convolution using the convolution re-parameterization method. This approach not only preserves the original feature capabilities but also reduces the model’s computational load and memory consumption.
Figure 4.
The structure of RepConv.
2.2.2. LSCD Detection Head
The YOLOv8n architecture incorporates three detection heads, each designed to manage targets of varying scales. However, this configuration imposes a substantial computational burden during deployment. Each detection head features an independent convolutional layer and normalization module, leading to significant parameter overhead and redundant computations, which restricts the model’s practicality on edge devices [22]. To solve the problem, our approach adopts a lightweight shared convolutional detection LSCD head, illustrated in Figure 5.
Figure 5.
The structure of the LSCD module.
The LSCD head is designed to balance lightweight architecture and inference efficiency. A shared convolutional mechanism, where GroupNorm (GN) is incorporated to replace Batch Normalization (BN), is employed in it. Existing studies have demonstrated that Group Normalization (GN) can enhance the stability of gradient flow in the regression and classification branches within object detection frameworks, thereby benefiting overall performance [23]. The structure utilizes 1 × 1 convolution to process input features from the P3, P4, and P5 levels, followed by two 3 × 3 shared convolutions to fuse these features. Finally, a learnable scale-adaptive module, Scale Layer, is used to dynamically adjust the scale distribution of the predicted boxes, alleviating the problem of positioning deviation caused by excessive multiscale differences among pedestrians in monitoring and ensuring accurate positioning of pedestrians [24].
2.3. Improvement of Pedestrian Target Tracking and Counting Model
2.3.1. Problems Existing in DeepSort
Multi-object tracking is a crucial component of the cross-line counting system. The DeepSort algorithm combines spatio-temporal motion predictions from Kalman filtering with appearance features extracted by a deep convolutional network to ensure stable tracking of pedestrians. The process involves several steps: First, the Kalman filter utilizes historical motion state information to predict pedestrian positions in the current frame. Simultaneously, an appearance-embedding network extracts appearance features from the detection bounding box. Finally, target matching and trajectory updates are accomplished by assessing the similarity between the weighted fusion of motion trajectory predictions and the appearance features.
In the motion prediction stage, the DeepSort tracker employs an 8-dimensional state space to characterize the motion state of the target. Specifically, denote the center position of the bounding box, represents the aspect ratio of the target detection bounding box, indicates the height of the target bounding box, and reflect the variation values of the aforementioned four parameters. Utilizing the Kalman filtering algorithm, the DeepSort predicts the target’s position in the subsequent frame based on this state vector. When the target is unobstructed, the DeepSort tracker can continuously detect, update, and correct errors by incorporating the latest positional information of pedestrians, thereby maintaining stable tracking of the target. However, when the target becomes occluded, the system must rely solely on Kalman filtering to estimate the position of pedestrians. In this scenario, the linear prediction capabilities of the Kalman filter reveal significant limitations. Pedestrians exhibit nonlinear motion, characterized by deceleration, acceleration, turning, and other dynamic behaviors. Despite this, the Kalman filter continues to perform linear extrapolation based on the speed and direction prior to occlusion, failing to account for changes in the actual motion state. Consequently, this oversight results in an increasing deviation between the predicted and actual positions. Furthermore, discrepancies in the predicted positions of pedestrians contribute to variations in their apparent scale, which exacerbates the challenges associated with target re-identification and may lead to issues such as trajectory mismatch and identity loss [25].
2.3.2. Correction of Pedestrian Position
In this paper, we adopt the center point of the bottom boundary of the pedestrian target bounding box as the designated position of the pedestrian. However, in vision-based pedestrian detection and tracking tasks, occlusion frequently leads to jitter in the boundaries of the detection bounding boxes, with the bottom boundary exhibiting the most significant variability. This phenomenon occurs because the bottom portion of the pedestrian is typically the first to be obscured, resulting in the absence of bottom features, which causes the bottom boundary of the target bounding box to ascend. Once the occlusion ceases, the bottom boundary rapidly retracts, leading to sharp fluctuations in pedestrian positioning. The erratic movement of pedestrian trajectories directly impacts cross-line detection; specifically, the abrupt shifts in the bottom boundary can cause the trajectory to instantaneously jump over the counting line. Additionally, the recurrent fluctuations at the bottom boundary may result in pedestrians being counted multiple times when oscillations occur near the counting line.
In contrast, the head area is typically the last to become occluded, resulting in more stable temporal variation of the top boundary [26]. Consequently, the smooth motion of the top boundary can serve as a reference to constrain and correct the sudden changes of the bottom boundary. The correction of the bottom boundary will reduce misjudgments in trajectory cross-line detection. Firstly, the vertical movements of the top and bottom boundaries of the pedestrian target bounding box are calculated in two consecutive frames. Secondly, the jitter deviation of the bottom boundary can be corrected by utilizing the stable motion changes of the top boundary. Assuming that and denote the vertical displacements of the top and bottom boundaries between frames and respectively, they are computed as follows:
where , , , and denote the vertical positions of the pedestrian of the top and bottom boundaries of the target bounding box in frames and , respectively. We set a tolerance threshold If the difference between and exceeds , the change will be regarded as abnormal. Then, we use the top-boundary displacement as a reference to correct it. The corrected bottom-boundary displacement is defined as follows:
When detecting pedestrians, the lack of pedestrian bottom details will cause the lower boundary of the detection frame to rise, and its height will not be lower than the real bottom boundary of the pedestrian. Consequently, it is essential to ensure that the predicted bottom-boundary value is not higher than the detected bottom boundary. However, if this substitution is carried out for a long time, the predicted lower boundary will gradually accumulate deviation and deviate from the true bottom position. To suppress this drift, we introduce a stepwise alignment mechanism where the predicted bottom boundary of the target detection box is gradually pulled to the bottom boundary of the target detection box detected by YOLO in a certain step . The parameter represents the number of pixels that move from the predicted bottom boundary to the lower boundary of the detection box in each frame. Finally, the vertical position of the predicted bottom boundary is
By smoothing the bottom boundary and incorporating a stepwise alignment mechanism to correct drift, the proposed method yields trajectories that are smoother and more closely aligned with the real motion path. Figure 6 illustrates a typical example. As shown in the figure, the smoothed trajectory is much more stable, indicating that the bottom boundary smoothing effectively reduces the errors caused by occlusion, ensuring a stable trajectory. The improvement of trajectory smoothness and stability can increase the accuracy and robustness of cross-line counting.
Figure 6.
Comparison of original and smoothed pedestrian trajectories. Subplots (a–f) present representative frames for pedestrian ID 10, comparing the original bounding boxes (green) with the smoothed bounding boxes (yellow) across the occlusion period, including pre-occlusion, during-occlusion, and post-occlusion frames. Subplot (g) shows the corresponding trajectories, where the original trajectory is plotted in green and the smoothed trajectory in yellow.
2.3.3. Improvement of the Target Prediction Position
In educational environments such as classrooms and corridors, peak periods during class transitions result in increased pedestrian density and significant occlusions, which present considerable challenges for target detection and cross-line counting. Furthermore, specific social relationships, including friendships and class affiliations, create an attraction among individuals, leading them to maintain proximity throughout their movement. In high-pedestrian-density environments, pedestrian flow is further constrained by physical space, resulting in a phenomenon known as “flow alignment,” where movement direction remains consistent and speeds synchronize [27]. The combined effects of social relationships and flow alignment can lead to prolonged blockage of some pedestrians.
When the target is occluded and cannot be tracked temporarily, the traditional DeepSort tracker relies on the Kalman filter to estimate the target’s position. This reliance can lead to position drift, which is detrimental to counting accuracy [28]. To address this issue, this paper introduces using the surrounding crowd to assist in predicting the positional changes of the occluded target.
Assuming the frame where the target disappears is the current frame, we use the first frame and the th frame before the current frame to calculate the related pedestrians whose movements are consistent with that of the occluded pedestrian. The motion trends of related pedestrians are then utilized to correct the predicted trajectory of the occluded targets. The approach significantly enhances the accuracy of predicting the positions of occluded targets, thereby facilitating more-reliable cross-line counting.
Firstly, select the pedestrian whose Manhattan distance from the occluded target in the previous frame of the current frame is less than half the height of the bounding box of the target. Let and denote the distances of the th pedestrian from the lost target in the and directions, respectively. The proximity constraint condition is then expressed as
Secondly, we perform direction consistency filtering based on the pedestrian motion directions from the th frame to the first frame. Specifically, the displacement vectors of the target and the surrounding pedestrians are calculated by the two frames, and the direction angles are recorded as and , respectively. Individuals whose angle between the motion direction and the target direction is within 30 degrees are selected.
Thirdly, for the candidate pedestrians that satisfy Formula (6), the relative difference in moving distance between them and the target pedestrian between the two frames is calculated. When the relative difference meets the following conditions, these candidate pedestrians are identified as the final relevant pedestrians.
where denotes the total movement distance of the target pedestrian over the course of frames, while denotes the total movement distance of the th candidate pedestrian during the same interval. If the absolute value of the difference between these two distances is less than 0.5 times the total movement of the pedestrian across the frames, the candidate pedestrian is deemed related to the target pedestrian and is classified as a related pedestrian.
Finally, the average position change of these related pedestrians is utilized to estimate the predicted position change of the occluded target as follows.
where n denotes the number of related pedestrians, denote the position change of the ith related pedestrian.
In Figure 7a, the pedestrian with ID 2 is obstructed at the classroom door, resulting in detection failure. By considering proximity, directional consistency, and motion similarity, the model identifies pedestrians with IDs 3 and 4 as related. It then uses their average displacement instead of the predicted displacement of the pedestrian with ID 2 during the occluded frames, effectively preserving trajectory continuity and improving cross-line counting accuracy.
Figure 7.
Detection of occluded pedestrians based on pedestrian correlation: (a) shows the scene of occluded pedestrians in surveillance, with five pedestrians present, two of whom are occluded; (b) provides a schematic diagram of relevant pedestrian detection, illustrating how the model determines related pedestrians by utilizing the movement characteristics of surrounding pedestrians.
2.3.4. Counting with Auxiliary Zone
Cross-line counting is typically determined using the cross product to ascertain the movement direction of pedestrians across the counting line. Before deployment, it is essential to establish counting lines across various scenarios and to define the entry and exit directions. As illustrated in Figure 8, let the endpoints of the counting line be designated as C and D, forming a vector . When a pedestrian crosses the counting line from left to right, the action is recorded as exiting; conversely, crossing from right to left is recorded as entering. Let the pedestrian’s position in the previous frame be denoted as point A and the current frame’s position as point B. The coordinates of these two points create a vector . A crossing event is first detected by checking whether A and B lie on different sides of the counting line. The sign of the cross product of vectors and is then employed to ascertain whether the pedestrian is entering or exiting.
Figure 8.
The diagram of pedestrian crossing line counting. Here, the red line CD is a counting line and the black line AB represents the line connecting the positions of a pedestrian in two consecutive frames.
In actual monitoring scenarios, corridor entrances and lobbies serve as critical nodes for pedestrian flow and are significant areas where individuals tend to congregate densely, leading to frequent occlusions [29]. For pedestrians exiting a room and entering the monitored area, their initial positions are typically near the counting line. However, they are often impeded by individuals ahead, resulting in cases where the initial tracking position has already crossed the counting line, thereby causing counting omissions due to delayed detection or late track initialization. The small image on the left in Figure 9 depicts the scene at the classroom door as captured in the video. It illustrates that the two pedestrians in front partially obstruct the view of the three individuals exiting the classroom from behind. The small image on the right displays the positions of pedestrians within the two-dimensional building floor plan. This phenomenon indicates that in dense-crowd or occlusion scenarios, cross-line counting that relies solely on the positional data of trackers may encounter issues with counting omissions. In contrast, the buffer counting strategy effectively addresses this limitation by compensating for late-initialization cases near the counting line.
Figure 9.
Illustration of missed counts at a classroom doorway. (a) Surveillance scene where pedestrians exiting the classroom are partially occluded; due to delayed detection, a pedestrian may already be beyond the counting line when first detected. (b) Corresponding 2D floor-plan schematic.
To solve the problem, an auxiliary counting zone is established at one end of the counting line, as illustrated in Figure 10. The auxiliary counting zone is positioned at various exit locations, including corridor entrances, lobbies, and room doors, and extends outward along the normal direction of the counting line. The buffer width is adaptively set to one-third of the pedestrian’s bounding box height, accounting for perspective-induced scale variations. When a new ID first appears in the buffer, the system records the positional changes of the target over the subsequent 10 frames. If the movement direction aligns with the corresponding predefined crossing direction, it is classified as a valid cross-line event. The approach effectively identifies the appearance position and local movement direction of pedestrians across different scenarios, thereby reducing missed counts caused by delayed detection and improving counting robustness.
Figure 10.
Auxiliary counting zone for cross-line counting: (a) Surveillance scene with the auxiliary counting zone placed near the counting line; (b) Schematic illustration of the auxiliary-zone-based counting principle, where short-term motion cues are used to validate crossing events for newly appearing targets.
3. Experiment and Analysis
3.1. Experimental Dataset
To assess the effectiveness and generalization of the proposed RL_YOLOv8, we produced a campus indoor pedestrian dataset by extracting frames from surveillance video captured in a teaching building. The dataset covers typical conditions, including daytime and nighttime scenes, peak periods during class transitions and breaks, and varying weather conditions such as sunny and rainy days, thereby reflecting real-world challenges such as large pedestrian density fluctuations and complex illumination. The dataset comprises a total of 7604 images, which is partitioned into training, validation and test sets in a ratio of 7:2:1.
To further test the generalization ability of the model outside of the campus scenario, we also conducted experiments on two public pedestrian benchmarks, the WiderPerson and CityPersons datasets. On WiderPerson, we use the labels of full-body, partial-body, and riders as pedestrian instances.
To assess the applicability and robustness of the proposed cross-line pedestrian counting model in real-world scenarios, an experiment was conducted in a typical teaching building environment. The experimental area included a classroom and its adjacent corridor, with dense pedestrian flow during peak commuting hours. We set up three detection lines Line 1, Line 2 and Line 3 at the entrances of the two classrooms and along the corridor, and designated buffer zones at the classroom entrances. The experiment was conducted ten minutes prior to the first afternoon class to evaluate the model’s missed detection rate and ability to avoid duplicate counting under complex motion conditions such as continuous passing, parallel walking, and multidirectional interweaving. The specific settings are illustrated in Figure 11.
Figure 11.
Experimental scene with three pedestrian counting lines.
3.2. Environmental Configuration and Evaluation Indicators
All training and validation was performed on the training machine (test environment, TE1) listed in Table 1. To further evaluate deployability, we additionally conducted model testing on a test environment (TE2) with more limited computing resources, as also listed in Table 1.
Table 1.
Environment configurations.
For a fair comparison, all models were trained using identical hyperparameters, including 300 epochs, a batch size of 32 and the same data augmentation. For object detection, we use Precision, Recall, FLOPs, number of parameters, FPS, mAP@50, and mAP@50:95 to evaluate the quality of the models. For cross-line counting, we use Precision, Recall, and the F1-score to evaluate counting accuracy.
3.3. Comparative Experiment on the Improvement of Object Detection Models
To assess the performance of the enhanced model, the RL_YOLO model was compared with three other YOLO models: YOLOv5n, YOLOv8n, and YOLO11n. The experimental results are presented in Table 2.
Table 2.
Performance comparison between the improved model and existing models.
According to Table 2, RL_YOLOv8 achieves the best detection accuracy, with mAP@50 reaching 96.40% and mAP@50:95 reaching 76.50%. At the same time, it significantly reduces computational cost. The FLOPs drops to 5.5 G and the number of parameters decreases to 1.67 M, both clearly lower than the baseline models. These results indicate that RL_YOLOv8 provides a better balance between accuracy and efficiency, making it suitable for real-time pedestrian detection and cross-line counting on resource-constrained indoor devices.
In addition, we test the models under different runtime environments. The detection accuracy remains largely stable across platforms, while inference speed varies due to hardware and software differences. The FPS comparison is reported in Table 3.
Table 3.
Inference speed comparison under an alternative runtime environment.
To further compare the performance of RL_YOLOv8 with other models, we additionally conduct experiments on two public benchmark datasets, WiderPerson [30] and CityPersons [31], in the environment of TE1. Since neither of the two public datasets has an official verification set label, 20% of the training set is randomly selected as the verification set, the test set remains unchanged, and all model training parameters are consistent with those of the campus dataset. The test results are shown in Table 4 and Table 5.
Table 4.
Test results on WiderPerson.
Table 5.
Test results on CityPersons.
According to Table 2, Table 4 and Table 5, the detection accuracy of all models in the public datasets was significantly lower than that in the self-made data set from the campus, mainly because the public dataset is more crowded and sheltered by people, and there are some label errors in the public datasets. In addition, on the two public datasets, L_YOLOv8 has better detection accuracy than YOLOv5n and YOLO11n, and is close to the YOLOv8n model, but its FLOPs and parameters are significantly smaller than those of other models, which makes it easy to deploy and gives it better operation efficiency.
3.4. Ablation Experiment of Target Detection Model
To further assess the contribution of each module, a series of ablation experiments was performed utilizing the campus indoor pedestrian dataset. The symbol “√” denotes the application of the improvement module based on the baseline model YOLOv8n. The results of the experiment are presented in Table 6.
Table 6.
Results of the ablation experiment.
The ablation study in Table 6 evaluates the contributions of RGCSPELAN and LSCD on the YOLOv8n baseline using the campus indoor pedestrian dataset. Replacing the C2f blocks with RGCSPELAN reduces model complexity, decreasing FLOPs from 8.1 G to 7.1 G, and parameters from 3.01 M to 2.31 M, while keeping detection accuracy at a comparable level. Introducing the LSCD head further improves detection performance, increasing mAP@50 from 95.66% to 95.83%, and reduces computational cost and model size to 6.5 G FLOPs and 2.36 M parameters.
When RGCSPELAN and LSCD are combined, RL_YOLOv8 achieves the best accuracy on this dataset, with mAP@50 reaching 96.40% and mAP@50:95 reaching 76.50%. At the same time, the model becomes substantially more lightweight: FLOPs decreases from 8.1 G to 5.5 G, and the number of parameters drops from 3.01 M to 1.67 M. Overall, these results show a better accuracy–efficiency trade-off, making RL_YOLOv8 more suitable for real-time indoor pedestrian detection and cross-line counting on resource-constrained devices.
3.5. Influence of Parameters on Experiments
All parameters in the algorithm will affect the final counting result. To find the optimal parameters, we conducted multiple sets of experiments for different parameter values. Compared to Line 2 and Line 3 scenarios, Line 1 exhibits more severe occlusion and is more challenging, so we chose Line 1 for the experiment.
3.5.1. Influences of Tolerance Threshold and Pulling Step
As described in Section 2.3.2, the tolerance threshold and the pulling step are the two key parameters for suppressing the lower boundary. Both affect the effectiveness of pedestrian position point correction and thus have a certain impact on cross-line counting. To test the effect of these two variables on cross-line counting, experiments were conducted with different parameter settings while fixing other variables. Compared to Lines 2 and 3, the occlusion near Line 1 is more severe and representative, so we chose Line 1 for experimental analysis. The experiment sets as 1, and sets different values for . The results are shown in Table 7.
Table 7.
Counting results with different tolerance thresholds.
Figure 12 shows the impact of different tolerance thresholds on counting accuracy. From Table 7 and Figure 12, when is too small, the suppression effect becomes too sensitive, resulting in excessive constraints on the update of the bottom edge and an increase in missed counts. When is set to 5, 8, or 10, the best balance can be achieved between suppressing occlusion jitter and maintaining motion tracking, resulting in the best counting accuracy. When exceeds 15, some jumps in the lower boundary of the target box will be recognized as normal and no jump correction will be made, resulting in a decrease in counting accuracy. Therefore, we set the default value of in the system to 8.
Figure 12.
The impact of different tolerance thresholds on counting accuracy.
To test the influence of the pulling step on the experimental results, we fixed to 8, and set to 0.5, 1.0, 1.5, 2.0. The test results are shown in Table 8.
Table 8.
The impact of s setting on counting accuracy.
According to Table 8, when is 1, the counting accuracy is the highest; when the value of is low (0.5), it will cause a significant deviation between the corrected lower boundary and the true boundary; When the value of is high (1.5, 2), it will cause the lower boundary to quickly rise to the detected lower boundary, and the improvement in correction effect is not significant, resulting in a decrease in counting accuracy.
3.5.2. Analysis of Buffer
When a pedestrian walks out of a classroom, he will not be detected at first for he is obstructed by doors, walls, or other pedestrians, shown as Figure 13a. When the pedestrian is detected and assigned an ID number of 6, his detection position (the red point) has already appeared outside the counting line (the red line), shown as Figure 13b. At this point, relying on a single counting line cannot detect the pedestrian’s crossing behavior, resulting in missed detections. To address this issue, a buffer zone (the yellow rectangle) is introduced. The width of the buffer zone is affected by occlusion and pedestrians in other directions. When the width is too narrow, the problem of missed detections will still exist; When the width is too wide, there may be situations where pedestrians outside the classroom suddenly appear in the buffer zone due to obstruction, resulting in false positives. The buffer zone mainly solves whether pedestrians are obstructed when crossing lines. When the obstruction is light, there is no need to set a buffer zone, such as in scene Line 3.
Figure 13.
The buffer zone reduces the situation of omissions in recording. (a) At the 15th frame before the current frame, a pedestrian walking out of a classroom is not detected; (b) At the current frame, the pedestrian is detected and assigned an ID number of 6, and his position is located in the buffer; (c) At the 15th frame after the current frame, the pedestrian left the buffer zone.
3.5.3. Influences of the Length of Frame Window
To analyze the impact of frame window length on counting results, we conducted experiments using Line 1 as the test line. During the experiment, set τ to 8 and s to 1. The counting results under different values of are shown in Table 9.
Table 9.
Experimental results with different values.
According to Table 9, the highest accuracy is achieved when is set to 15; when the value of is too small or too large, it can lead to a decrease in pedestrian counting accuracy. The reason is that when the value of is too small, it is difficult to accurately determine the proximity relationship between the occluded person and other pedestrians around them before occlusion. When it is too large, pedestrians may exhibit sudden behaviors such as turning, slowing down, and pausing during walking, which can cause changes in the proximity relationship between pedestrians and those around them, resulting in inaccurate predictions.
3.6. Pedestrian Counting Experiment
The pedestrian detection and cross-line counting experiment was conducted in a corridor scene during the 10 min before class, when pedestrian density is high. In both the original method and the improved cross-line system, to avoid the counting interference caused by a pedestrian wandering near the counting line, we stipulate that the movement of frequent back and forth crossing lines in a short period of time is not included in the crossing count. That is, if a pedestrian crosses the counting line in one direction and then turns back to cross the line again in 15 frames, the two cross-line movements of the pedestrian back and forth are not counted. The detailed experimental results are shown in Table 10 and Figure 14.
Table 10.
Experimental results of pedestrian cross-line counting using original and improved methods.
Figure 14.
Comparison of cross-line counting results between original and improved methods.
For Line 1, located at the classroom entrance, pedestrians are frequently occluded by others, causing noticeable jitter in the bottom boundary of the bounding box. Such instability may lead to trajectory “line jumping” and duplicate counting near the counting line. With the proposed trajectory refinement and bottom-boundary stabilization, false positives decrease from nine to six and false negatives decrease from five to four. Consequently, the F1 score improves from 89.23% to 92.19%, indicating more stable counting under frequent occlusions.
For Line 2, it is farther from the camera, where pedestrians appear smaller and are more prone to missed detections under occlusion. The auxiliary counting zone helps compensate for missed crossings caused by delayed detection or late track initialization near the doorway. As a result, the number of missed counts is reduced (FNs decrease from four to two), which increases recall from 75.00% to 87.50%. The F1 score increases from 82.76% to 90.32%, showing that the buffer-zone strategy improves robustness mainly by recovering missed events.
For Line 3, placed in the corridor close to the camera with a long counting line, pedestrians occupy larger pixel heights and are easier to detect. The improved system notably reduces false counts (FPs drop from 12 to 3) while maintaining a similar Recall level. This yields a clear improvement in precision from 91.61% to 97.78% and in F1 score from 95.27% to 98.88%, demonstrating that trajectory stabilization and the anti-repetition rule effectively suppress duplicate counts caused by trajectory fluctuations.
Overall, by integrating trajectory refinement, bottom-boundary stabilization and the auxiliary counting zone, the improved system reduces missed counts and suppresses false counts across different lines, leading to more stable and reliable cross-line counting in crowded and occluded indoor scenarios.
4. Conclusions
To solve the low cross-line counting accuracy in critical areas, such as corridors and classroom entrances, this paper proposes an optimized approach for pedestrian cross-line counting. The approach utilizes an enhanced model of YOLOv8n (that is RL_YOLOv8) combined with DeepSort tracking. The RL_YOLOv8 model serves as the foundation for the enhanced cross-line counting system, which reduces computational complexity while maintaining high detection accuracy. To improve the accuracy of cross-line counting, we introduce three key strategies: First, bottom boundary suppression reduces cross-line counting errors caused by trajectory jumps by suppressing drastic fluctuations in pedestrian position points. Second, a related-pedestrian prediction method corrects the predicted trajectory during periods when pedestrians are occluded, ensuring that occluded pedestrians can still successfully trigger the counting line. Finally, the buffer counting mechanism addresses missed detections caused by pedestrians already crossing the counting line when first detected, especially in scenarios such as classroom entrances.
Experimental results show that the improved system exhibits significant improvement in counting accuracy, effectively avoiding the original cross-line counting error caused by factors such as occlusion, and overall performance is better than the original system. Future work could further expand the adaptability of this method to other application scenarios, such as large shopping malls, subway stations, and airports, and explore more advanced pedestrian detection and tracking models to further improve the accuracy and performance of pedestrian counting systems. In addition, to explore the practicality of the system, we will carry out further work, such as adaptation to varying camera viewpoints and integration with multi-camera systems.
Author Contributions
Conceptualization, Litao Han and Jiayan Wang; Methodology, Laihao Song, Hengjian Feng and Litao Han; Software, Laihao Song and Ran Ji; Validation, Laihao Song and Ran Ji; Writing—original draft, Laihao Song; Writing—review and editing, Litao Han, Laihao Song and Jiayan Wang. All authors have read and agreed to the published version of the manuscript.
Funding
This research is funded by the National Natural Science Foundation of China (Grant No. 42271436) and the Shandong Provincial Natural Science Foundation, China (Grant No. ZR2021MD030).
Institutional Review Board Statement
The research protocol was approved by the College of Geodesy and Geomatics of Shandong University of Science and Technology (CGG20250801) on 14 August 2025.
Informed Consent Statement
Informed consent for publication was obtained from all identifiable human participants.
Data Availability Statement
The data presented in this study are available on request from the corresponding author. The reason why the data can not be fully disclosed is that the experimental data in this paper belongs to video surveillance data, which involves the personal privacy of the people in the video. If readers really need it, they can contact the corresponding author and use it under certain guarantee.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Ma, Z.; Chan, A.B. Crossing the line: Crowd counting by integer programming with local features. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2539–2546. [Google Scholar]
- Deng, L.; Zhou, Q.; Wang, S.; Górriz, J.M.; Zhang, Y. Deep learning in crowd counting: A survey. CAAI Trans. Intell. Technol. 2024, 9, 1043–1077. [Google Scholar] [CrossRef]
- Lempitsky, V.; Zisserman, A. Learning to count objects in images. In Proceedings of the 24th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2010; pp. 1324–1332. [Google Scholar]
- Taniguchi, Y.; Mizushima, M.; Hasegawa, G.; Nakano, H.; Matsuoka, M. Counting pedestrians passing through a line in crowded scenes by extracting optical flows. Information 2016, 19, 303–316. [Google Scholar]
- Huang, R. Research on Cross-Line Counting Method for High-Density Population in Surveillance Videos. Master’s Thesis, Yangzhou University, Yangzhou, China, 2023. [Google Scholar]
- Zheng, H.; Lin, Z.; Cen, J.; Wu, Z.; Zhao, Y. Cross-line pedestrian counting based on spatially-consistent two-stage local crowd density estimation and accumulation. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 787–799. [Google Scholar] [CrossRef]
- Satyanarayana, P.; Pavuluri, G.; Kunda, S.; Satvik, M.; CharanKumar, Y.J.A. A robust bi-directional algorithm for people count in crowded areas. Int. J. Pure Appl. Math. 2017, 116, 73–78. [Google Scholar]
- He, M.; Luo, H.; Hui, B.; Chang, Z. Pedestrian flow tracking and statistics of monocular camera based on convolutional neural network and kalman filter. Appl. Sci. 2019, 9, 1624. [Google Scholar] [CrossRef]
- Gochoo, M.; Rizwan, S.A.; Ghadi, Y.Y.; Jalal, A.; Kim, K. A systematic deep learning based overhead tracking and counting system using RGB-D remote cameras. Appl. Sci. 2021, 11, 5503. [Google Scholar] [CrossRef]
- Marczyk, M.; Kempski, A.; Socha, M.; Cogiel, M.; Foszner, P.; Staniszewski, M. Passenger location estimation in public transport: Evaluating methods and camera placement impact. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17878–17887. [Google Scholar] [CrossRef]
- Hussain, M. YOLO-v1 to YOLO-v8, the rise of YOLO and its complementary nature toward digital manufacturing and industrial defect detection. Machines 2023, 11, 677. [Google Scholar] [CrossRef]
- Han, L.; Feng, H.; Liu, G.; Zhang, A.; Han, T. A real-time intelligent monitoring method for indoor evacuee distribution based on deep learning and spatial division. J. Build. Eng. 2024, 92, 109764. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
- Mohammed, S.Y. Architecture Review: Two-stage and one-stage object detection. Franklin. Open 2025, 12, 100322. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Zhuang, J.; Wang, K.; Yuan, Z.; Yan, Y. Frequency domain iterative clustering for boundary-preserving superpixel segmentation. Appl. Soft Comput. 2026, 191, 114717. [Google Scholar] [CrossRef]
- Yang, M.; Xu, R.; Yang, C.; Wu, H.; Wang, A. Hybrid-DETR: A differentiated module-based model for object detection in remote sensing images. Remote Sens. 2024, 13, 5014. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-Style ConvNets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13728–13737. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
- Wu, Y.; He, K. Group normalization. Int. J. Comput. Vis. 2018, 128, 742–753. [Google Scholar] [CrossRef]
- Zhao, W.; Wang, L.; Li, Y.; Liu, X.; Zhang, Y.; Yan, B.; Li, H. A multi-scale and multi-stage human pose recognition method based on convolutional neural networks for non-wearable ergonomic evaluation. Processes 2024, 12, 2419. [Google Scholar] [CrossRef]
- Pereira, R.; Carvalho, G.; Garrote, L.; Nunes, U.J. Sort and Deep-SORT based multi-object tracking for mobile robotics: Evaluation with new data association metrics. Appl. Sci. 2022, 12, 1319. [Google Scholar] [CrossRef]
- Wu, Z.; Teixeira, C.; Ke, W.; Xiong, Z. Head anchor enhanced detection and association for crowded pedestrian tracking. arXiv 2025, arXiv:2508.05514. [Google Scholar] [CrossRef]
- He, M.; Luan, Q.; Shui, W.; Yu, H.; Fan, D. An improved social force model considering pedestrian perception avoidance feature of peer groups. J. Highw. Transp. Res. Dev. 2017, 34, 125–130. [Google Scholar]
- Chen, K.; Zhao, X.; Dong, C.; Di, Z.; Chen, Z. Anti-occlusion object tracking algorithm based on filter prediction. J. Shanghai Jiaotong Univ. Sci. 2024, 29, 400–413. [Google Scholar] [CrossRef]
- Zhang, L.; Yue, H.; Li, M.; Wang, S.; Mi, X.Y. Simulation of pedestrian push-force in evacuation with congestion. Acta Phys. Sin. 2015, 64, 060505. [Google Scholar] [CrossRef]
- Zhang, S.; Xie, Y.; Wan, J.; Xia, H.; Li, S.Z.; Guo, G. WiderPerson: A diverse dataset for dense pedestrian detection in the wild. IEEE Trans. Multimed. 2020, 22, 380–393. [Google Scholar] [CrossRef]
- Zhang, S.; Benenson, R.; Schiele, B. CityPersons: A diverse dataset for pedestrian detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4457–4465. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.













