1. Introduction
The construction industry is the pillar industry of China’s economic development [
1]. As shown in
Figure 1, the added value of the construction industry in 2024 reached CNY 8994.9 billion, which has made important contributions to the growth of the national economy and the improvement of people’s lives. However, while the construction industry is developing rapidly, it is also facing a severe safety production situation. Especially in the construction stage, due to the complex working environment and frequent process crossing, the safety risks are highly concentrated. Previous studies have shown that the occurrence of construction accidents is closely related to human factors such as the illegal operation of operators, which is the core cause of construction safety accidents [
2].
Among the many human factors that cause construction safety accidents, the prolonged violation behavior of operators has become a kind of important risk point with strong concealment and difficult to find in real time in site management [
3]. Behaviors such as playing with mobile phones, sleeping, and leaving jobs usually have obvious time-series residence characteristics. Compared with transient high-risk behaviors such as falls and running, such behaviors show slow evolution to a certain extent. After entering the state, it is relatively static. Therefore, for the recognition of such behaviors, although there is still a corresponding real-time requirement, it is possible to judge within seconds after the behavior occurs, thereby improving the robustness of the recognition. At the same time, because its behavior process often contains obvious transition stages, such as the process from taking out the mobile phone to continuous browsing, the system has a certain tolerance for short-term behavior fluctuations and has good adaptability to low-computation identification schemes. In addition, the judgment basis of prolonged violation behavior is more focused on the continuous, stable state of the target over a period of time, rather than the accidental characteristics of a single frame image, which puts forward higher requirements for the time-series modeling and persistence analysis ability of the behavior recognition scheme [
4].
In recent years, with the continuous development of cutting-edge technologies, their influence across industries has significantly expanded. In the construction field, computer vision based on deep learning has gained wide attention due to its continuous, non-contact, and human-independent dynamic monitoring characteristics [
5,
6]. Among these technologies, the YOLO series has become a core approach for object detection because of its fast speed, high accuracy, and suitability for edge deployment, achieving remarkable results in practice [
7,
8]. Recent studies have introduced multi-scale attention mechanisms and data augmentation strategies to improve detection robustness in extreme construction environments, though challenges remain in handling complex behaviors and continuity modeling [
9]. Other research has enhanced YOLOv5 with multi-scale feature fusion to improve detection accuracy in track construction [
10]. Further advances include the use of instance segmentation for personnel tracking and improved robustness under adverse weather conditions, although these methods still overlooked behavior state information [
11,
12]. Efforts have also been made to improve YOLOv5 for accurate safety equipment detection, but limitations remain in dynamic behavior recognition [
13]. In addition, sensor-based mobile systems have been developed to identify unsafe behaviors, though their ability to process complex visual information on construction sites is limited [
14].
After the release of YOLOv8, its improved detection accuracy and network scalability further promoted its application and optimization in construction scenarios. Research has explored its use in safety helmets, construction equipment, and safety behaviors. Enhancements such as attention mechanisms, lightweight structures, and improved detection heads have generally boosted performance [
15,
16]. YOLOv8 has also been applied to tasks including helmet detection with UAV images [
17], safety risk factor identification through comparisons with YOLOv5 [
18], construction waste detection [
19], and industrial defect inspection [
20]. However, most of these studies focus on accuracy improvement for static objects, limiting model universality. Some efforts have extended YOLOv8 to personnel status recognition, such as multi-scale models for unsafe behavior detection [
21], demonstrating its potential but still relying heavily on single-frame reasoning without dynamic behavior modeling. Compared with traditional risk management methods based on manual inspection, post-event reports, or static records, new measures integrating computer vision offer faster detection, timely response, and stronger adaptability to complex scenarios, thereby improving the initiative and accuracy of safety management while supporting risk control on construction sites [
22]. Nonetheless, current technologies largely target simple static behaviors or wearable devices, lacking sufficient depth and adaptability to prolonged violations, which poses considerable challenges for practical application.
To overcome the limitations of existing methods in recognizing prolonged violation behaviors, recent studies have introduced algorithmic structures with enhanced time-series modeling to improve dynamic behavior recognition. A common approach is to build large-scale time-series models using depth information such as skeletal joints. For example, the combination of ST-GCN and YOLO has been applied to interaction behavior modeling, which improves the analysis of complex actions but suffers from a complex structure, large size, and high computational demands, making it unsuitable for deployment on low-power devices in real construction scenarios [
23]. Other work has integrated modules such as target detection, attention estimation, and personnel re-identification, demonstrating certain recognition capabilities, but these methods still rely heavily on static spatial information, lack temporal awareness, and fail to align with risk management needs on construction sites [
24]. With advances in multi-object tracking (MOT), combining it with detection algorithms for dynamic behavior recognition has also gained attention. Applications such as automatic efficiency analysis in sling transportation highlight the potential of MOT in spatio-temporal continuous analysis, though current use remains limited to mechanical objects rather than human behavior [
25]. Overall, although these approaches represent progress in action recognition, systematic solutions that balance accuracy, real-time performance, and deployment feasibility for identifying prolonged violations in construction scenarios are still lacking.
Recent YOLO-based improvement studies have increasingly explored deformable convolutions, enhanced backbones, and attention mechanisms for small-object detection and occlusion scenarios across diverse domains. For instance, Gu et al. integrated multi-scale and occlusion-aware structures into YOLOv8 for improved robustness under complex visual conditions [
26], while Jin et al. demonstrated the effectiveness of DCNv3 in handling non-rigid deformation through adaptive geometric modeling [
27,
28]. Lightweight backbone optimizations such as MSF-CSPNet and GELAN have shown advantages in balancing accuracy and inference efficiency [
29,
30], and attention-based designs—including ECA—have also been applied to strengthen fine-grained feature responses for small objects [
31,
32]. Additionally, recent works have combined YOLO variants with multi-object tracking frameworks such as ByteTrack to maintain identity consistency in dynamic scenes [
33,
34]. However, these studies typically address either structural enhancement or tracking accuracy in isolation, with limited consideration of prolonged behavioral cues, temporal stability, or the specific demands of construction scenarios. This gap underscores the need for an integrated architecture that jointly enhances geometric adaptability, multi-scale perception, and cross-frame consistency to support long-duration violation behavior recognition.
In summary, the construction site’s open structure, frequent personnel flow, complex edge equipment, and dynamic safety risks require safety risk management to ensure both response efficiency and adaptability to dynamic behavior identification under complex conditions. Prolonged violation behavior has become a key challenge in risk pre-control due to its high risk, concealment, and fuzzy boundaries. While previous research has made progress in target detection and behavior recognition, most methods focus on single-frame images or short-term static states, limiting their ability to model continuous behavior changes. Existing computer vision approaches struggle with prolonged behaviors like mobile phone use, sleeping, and leaving posts, due to insufficient temporal awareness, limited behavior persistence strategies, and high deployment costs. Therefore, it is crucial to design a behavior recognition system that balances accuracy, lightweight design, and temporal modeling to effectively monitor prolonged violations and support risk intervention and governance.
In construction behavior recognition, symmetry and asymmetry can be understood as the balance between data distribution and behavioral dynamics. Under ideal conditions, standardized operational behaviors exhibit a certain degree of spatiotemporal symmetry. However, prolonged violations such as “playing with mobile phones” disrupt this balance, resulting in significant spatiotemporal asymmetry. To address the numerous shortcomings of existing models in recognizing Prolonged violation behaviors of key personnel in construction scenarios, this paper proposes an integrated recognition framework, the DGEA-YOLOv8 algorithm, combining an improved YOLO-based object detection model with a lightweight multi-object tracking algorithm. The specific implementation process of this plan is shown in
Figure 2.
This approach effectively mitigates the aforementioned asymmetry by employing methods such as deformable convolution, attention-guided feature weighting, and multi-scale contextual perception, thereby enhancing the model’s robustness in detecting asymmetric feature distributions and irregular behaviors under complex construction conditions by restoring structural balance. Furthermore, this approach offers the following three key advantages in both technical and application aspects:
1. A specialized dataset for “playing with mobile phone” behavior is created, considering target pose, occlusion, and background complexity in construction scenes, with enhanced data diversity to support prolonged violation behavior recognition.
2. To meet the application’s specific needs, the YOLOv8 model is optimized with structural adjustments, integrating DCNv3, GELAN, ECA, and ASPP modules into the DGEA-YOLOv8 framework. This enhances the model’s adaptability to pose deformation, global feature capture, and lightweight performance, while strengthening small target attention and multi-scale perception for improved recognition of fine-grained action features.
3. Using the lightweight ByteTrack tracking architecture, the system improves target identity maintenance and trajectory continuity across frames, reducing ID switching and target loss, and better supporting stable tracking of prolonged dynamic behaviors.
2. Materials and Methods
2.1. Construction and Processing of the Dataset
This paper focuses on the intelligent identification of prolonged violation behaviors at construction sites, aiming to construct a new recognition scheme that combines target detection and multi-target tracking to efficiently perceive and assess these hidden risk behaviors. Given the high representativeness of the “playing with mobile phone” behavior of construction personnel in key positions, which exhibits typical time-series residence characteristics, this study uses this behavior as the research entry point for designing and verifying the recognition scheme. However, most current public datasets primarily focus on general pedestrian detection, with key personnel often seated, making them unsuitable for this research due to limited posture diversity and scene adaptability. Although some datasets in the construction field cover behaviors such as helmet-wearing, smoking, and answering calls, they cannot directly support effective modeling for “playing with mobile phone” behavior, which significantly differs in both posture and timing characteristics. Therefore, building upon a review of existing datasets and considering real-world construction site environments, this paper supplements them by collecting a large number of image samples, creating a dataset specifically for recognizing the “playing with mobile phone” behavior of construction personnel, thus providing reliable support for subsequent model training and performance validation.
2.1.1. Dataset Acquisition
In the process of constructing the dataset of “mobile phone usage” behavior in key positions, this paper fully incorporates the perspective of ergonomics. By using web crawlers and construction video detection platforms, etc., it systematically collected image samples of construction workers using mobile phones in different postures in the actual working environment. For behaviors observed while standing, the data includes actions such as one-handed hanging, two-handed operation in front of the chest, and body-leaning browsing. In seated scenarios, samples cover typical behaviors like forward leaning, reclining relaxation, and side compensation, to capture more real-world instances of abnormal body posture during work. Additionally, for precise object recognition, this study focuses not only on the overall body movement features but also on detailed visibility factors of the mobile phone itself, including screen on/off states, phone orientation, brand, and size, to enhance the model’s ability to detect small and occluded targets. Beyond focusing on the interaction between the person and the phone, the study also categorizes behaviors based on typical actions, such as one-handed or two-handed holding, typing, and screen scrolling, ensuring a fine-grained distribution of behavior samples in the dataset.
Figure 3 presents a selection of images from the dataset of mobile phone usage behavior of key positions constructed in this study.
2.1.2. Data Preprocessing and Augmentation
To enhance the model’s robustness and generalization in complex construction environments, this paper introduces various data preprocessing and augmentation strategies after image collection, further expanding the sample size and feature diversity. The processing scheme for image data can be seen in
Figure 4. Starting with the original dataset of 1000 images and 2300 annotated targets, common construction site factors such as lighting variations, angle shifts, local occlusions, and posture changes were considered. Techniques like affine and perspective transformations were applied to enrich the spatial structure of the images, while HSV color perturbations improved the model’s adaptability to different lighting conditions. Additionally, the Copy–Paste strategy was used to create multi-target interference scenes, simulating the density of actual work environments. To further improve the model’s handling of missing local information and blurred behavioral boundaries, random occlusion and image mixing were incorporated.
After combining these strategies, the dataset was expanded to approximately 3000 images, increasing coverage in terms of scene composition, target posture, and behavior actions. After applying the multi-strategy augmentation pipeline, a dataset of approximately 3000 images was finally constructed. Among them, 37% of the images contain varying degrees of local occlusion, such as partial coverage of the phone by the hand or obstruction of phone edges by clothing or tools, simulating the frequent visibility limitations encountered in real construction environments. In terms of illumination distribution, 35% of the images correspond to bright lighting, 48% to medium lighting, and 17% to low-light or backlit conditions, ensuring model robustness across diverse lighting scenarios. Through these augmentation strategies, the dataset is substantially enriched in terms of scene composition, target poses, and behavioral variations, thereby enhancing the model’s generalization capability.
For the above dataset, annotation was performed using LabelImg. To ensure annotation quality, a dual cross-verification procedure was adopted, in which all bounding boxes were manually reviewed by two annotators. Additionally, IoU-based consistency checks and missing-label inspection mechanisms were incorporated to guarantee the accuracy and consistency of annotations. After annotation, the labels were first saved in XML format and subsequently converted into the TXT format required by the YOLOv8 model using a Python script (3.12). Finally, the dataset was randomly split into training, validation, and test sets in an 8:1:1 ratio to ensure sufficient stability during the model’s training and validation phases.
2.2. Standard YOLOv8 Model
The YOLOv8 network architecture consists of three main parts: Backbone, Neck, and Head [
26]. The Backbone uses a series of convolutional and deconvolutional layers for feature extraction, incorporating residual connections and bottleneck structures to enhance performance. The C2f module, used as the basic building block, outperforms the C3 module in YOLOv5 in terms of parameter extraction, offering superior feature extraction capabilities. The Neck is responsible for multi-scale feature fusion, combining the feature maps generated at various stages to improve feature representation. The Head handles the final object detection and classification tasks. The input image data is processed through the CBS (convolution-normalization-activation) module, and the Cat module is used to concatenate feature maps from different layers. The Detect module generates prediction boxes and class information.
Although the YOLOv8 model performs excellently in various general object detection tasks, its standard structure still has limitations when applied to identifying prolonged violation behaviors of key personnel at construction sites. Taking “playing with mobile phones” as an example, the target actions typically involve specific movements of the hands and head, such as looking down at the phone screen or holding the phone with one or both hands. These subtle actions are often similar to behaviors like resting with the head down or reading paper documents, demanding higher detail-capturing capabilities from the YOLOv8 algorithm. However, the standard YOLOv8 model is not fully suited to these tasks. The commonly used standard convolution (CBS) and C2f structures, while strong in feature expression, struggle to capture the small feature variations in the behavioral state changes. Additionally, the original Spatial Pyramid Pooling (SPPF) structure in YOLOv8 has limited multi-scale modeling capabilities, failing to fully extract the discriminative features of small targets, such as those seen in “playing with mobile phones,” in complex environments. Moreover, the model’s attention to local salient information is insufficient, making it less effective in handling occlusion, lighting changes, and other interference factors at construction sites. Therefore, it is necessary to optimize YOLOv8’s feature extraction, multi-scale adaptation, and attention mechanisms to better recognize key behavior states and improve accuracy and robustness for construction site applications.
2.3. The DGEA-YOLOv8 Algorithm
To address the challenges posed by prolonged violation behaviors, such as “playing with mobile phones,” at key positions on construction sites, characterized by significant target shape variations, scale differences, hidden feature details, and deployment constraints, this paper proposes an improved DGEA-YOLOv8 algorithm to enhance detection performance and deployment feasibility. The algorithm optimizes the YOLOv8 baseline model at multiple structural levels. On one hand, it introduces the DCNv3 module to replace some standard convolution structures, improving the model’s adaptability to behavior deformation features. On the other hand, the GELAN architecture replaces the original C2f module, balancing global feature capture and lightweight requirements to enhance the model’s efficiency in complex construction environments. Additionally, the Neck layer integrates the ECA channel attention mechanism to improve the model’s ability to recognize fine details and small targets, such as hands holding a phone. Furthermore, the ASPP module replaces the original SPPF structure, enhancing multi-scale perception and contextual modeling, effectively addressing issues related to varying target sizes, significant occlusion, and background interference.
On this basis, the enhanced modules do not function in isolation but instead form a complementary pipeline throughout the feature extraction process. DCNv3 provides richer and more task-aligned local details through stable geometric alignment, while GELAN efficiently aggregates multi-level semantic representations. The aggregated features are then further strengthened along the channel dimension by the ECA attention mechanism, which amplifies the model’s sensitivity to key fine-grained cues. Meanwhile, the ASPP structure supplements multi-scale contextual semantics, enabling more stable and context-aware feature perception in complex construction environments. Through this system-level collaborative enhancement, the DGEA-YOLOv8 model achieves a holistic performance gain beyond the simple accumulation of individual components, delivering significantly improved violation recognition accuracy and environmental adaptability while maintaining high inference efficiency. A comparison of the model structure before and after improvement is shown in
Figure 5. The following sections further analyze the functional principles and advantages of each integrated module.
2.3.1. Deformable Convolutional Networks
In construction scenarios, prolonged violation behaviors such as “playing with mobile phones” are often accompanied by gradual head and arm movements, including nodding, turning, or subtle hand motions. These actions show significant individual variability and instability, and visually they are easily confused with non-violation states such as short breaks. This raises higher demands on the model’s sensitivity to local morphological changes. However, traditional standard convolution blocks (CBS), while stable for regular structures or static targets, rely on fixed sampling methods. As a result, they struggle to capture all key features when faced with large posture variations or complex deformations, thereby limiting the model’s ability to represent geometric changes.
To enhance the model’s flexibility in detecting human actions with large deformations, this study introduces the Deformable Convolutional Networks (DCN) module to replace the traditional CBS blocks, enabling more effective capture of irregular structures and fine-grained features [
27]. As shown in
Figure 6, the DCN module leverages convolutional offset learning and interpolation sampling. Unlike the fixed sampling strategy of conventional convolution, its kernels dynamically adjust sampling positions based on semantic information from the input feature map, thereby providing greater flexibility and generalization.
As the latest version of the series, DCNv3 further optimizes spatial modeling capabilities and computational efficiency based on the previous two generations. As shown in Equations (1)–(3), this version adopts a more deeply coupled offset prediction mechanism and lightweight structure design, which significantly reduces the computational redundancy while maintaining high accuracy, so that it not only performs better in the fusion method and feature response ability of the offset mechanism but also has good adaptability in the edge computing environment widely used in construction scenarios.
Among them, denotes the center position of the current convolution window on the output feature map, represents the fixed relative coordinate of the n-th sampling point in the convolution kernel, is the learnable offset predicted by the network, denotes the input feature map, is the weight assigned to the n-th sampling point of the convolution kernel, represents the bilinear interpolation kernel function when the sampling location lies on a non-integer spatial coordinate grid, denotes the set of integer grid points in the neighborhood of the sampling position, and the final sampling coordinate is given by .
In summary, the DCNv3 module enhances the model through learnable offsets and dynamic sampling, enabling it to better adapt to the complex geometric structures and diverse viewpoints of construction scenes and significantly improving detection accuracy under heavy background interference and frequent occlusion. At the same time, its strong deformation-alignment capability allows stable recognition of continuously changing Prolonged violation behaviors. Unlike attention modules such as SE and CBAM that mainly reweight feature channels, or multi-scale fusion structures such as FPN, PAN, and BiFPN that focus only on scale aggregation, DCNv3 directly reconstructs visual features at the geometric level, providing more reliable cross-frame geometric consistency. This makes it particularly suitable for the dual requirements of small-object detection in construction environments and Prolonged behavior determination. Therefore, integrating DCNv3 improves not only single-frame detection performance but also the stability and reliability of subsequent temporal behavior inference.
2.3.2. Generalized Efficient Layer Aggregation Network
In construction environments characterized by multiple overlapping tasks, frequent occlusions, and limited monitoring resources, the model must not only achieve accurate recognition but also ensure efficiency and deployment feasibility. Although the original C2f module provides strong feature extraction accuracy, it still leaves room for optimization in lightweight design and fast inference. As a universal lightweight attention network, the Generalized Efficient Layer Aggregation Network (GELAN) architecture can extract key features more efficiently and significantly reduce the computational load, making it highly suitable for application in resource-constrained scenarios [
28].
The GELAN architecture combines the strengths of CSPNet and ELAN, creating a lightweight module with efficient feature representation and strong inference performance. As shown in
Figure 7, CSPNet enhances gradient flow efficiency through cross-stage residual structures [
29], while ELAN improves feature reuse and representation with multi-branch convolutional paths [
30]. Building on these advantages, GELAN introduces flexible Any Block units and path combinations, which preserve feature aggregation depth while reducing computational complexity. This balance of structural optimization and performance makes GELAN particularly suitable for real-time multi-object recognition in construction monitoring, where both lightweight design and robustness are essential.
To further confirm the superiority of this architecture in terms of lightweight and feature extraction performance, this paper uses the MS COCO dataset to verify the various operating parameters of the YOLOv8 benchmark model when it is built on the MobileNet architecture, CSP-Darknet architecture, FPN architecture, and GELAN architecture, respectively. The specific data can be seen in
Table 1.
From the data in the above table, it can be seen that compared with feature extraction architectures such as CSP-Darknet and FPN, or simple lightweight backbone network architectures such as MobileNet, the GELAN selected in this paper can achieve a better balance between robustness and running efficiency. While maintaining lightweight, it can better capture global features, enhance the model’s sensitivity to key areas, more effectively identify the action details of Prolonged violation behaviors, and effectively improve the recognition accuracy. Additionally, through the ablation experiment results of the computing module type and depth of the ELAN convolution layer, it can be known that different computing modules have a relatively small impact on the performance of the GELAN architecture, and this architecture is not sensitive to the system depth. Therefore, it also has higher device portability and system stability and is more suitable for the deployment requirements in different levels of devices such as edge devices in construction scenarios.
2.3.3. Enhanced Context-Attention
In construction scenarios, prolonged violation behaviors also place high demands on the model’s ability to detect small targets. For instance, recognizing mobile phone use often requires capturing subtle details such as screen brightness or hand gestures. Taking into account the requirements of the application scenarios for model size and running efficiency, this paper chooses to introduce the ECA module instead of techniques such as the attention module or small target detection head, in order to enhance the YOLOv8 benchmark model’s ability to capture and analyze detailed features, thereby avoiding the inability to complete the recognition task of specific targets due to complex background or changes in lighting conditions of the construction scene, and improving the detection accuracy of small targets in the area to be recognized.
ECA, as a lightweight channel attention mechanism, aims to effectively model the correlations between different channels, thereby enhancing the feature representation capability of the channels without introducing dimension compression or significant increase in inference cost [
31]. As illustrated in
Figure 8, its input is a three-dimensional feature tensor of size H × W × C, where H and W denote height and width, and C represents the number of channels. The process begins with global average pooling (GAP), which compresses the spatial dimensions H and W into a single value to form a 1 × C channel descriptor vector. This vector is then processed using a 1D convolution to capture cross-channel interactions. Finally, the generated channel weights are multiplied with the original feature tensor on a channel-by-channel basis, completing the weighted update and outputting an optimized feature tensor of the same size as the input.
2.3.4. Atrous Spatial Pyramid Pooling
The SPPF (Spatial Pyramid Pooling-Fast) module used in the YOLOv8 baseline offers advantages in maintaining inference efficiency. However, when applied to complex tasks such as recognizing the behaviors of key construction personnel, it shows limitations in multi-scale perception. Prolonged violation behaviors at construction sites often involve body parts of varying sizes, frequent posture changes, as well as challenges such as occlusion and background interference, which place higher demands on feature extraction. To address these issues, this study replaces the baseline SPPF module with the ASPP module, aiming to strengthen multi-scale feature perception and improve adaptability to complex scenarios.
The Atrous Spatial Pyramid Pooling (ASPP) structure extracts multi-scale features by applying dilated convolutions with different dilation rates, enabling more comprehensive capture of spatial information across objects of various sizes and enhancing the model’s ability to represent multi-scale features and contextual semantics [
32]. As shown in
Figure 9, the ASPP module introduces multiple parallel branches with varying dilation rates, effectively expanding the receptive field to capture spatial features at different scales. In addition, by incorporating a global average pooling channel and feature fusion mechanism, it strengthens the model’s capability to represent scene context. Compared with SPPF, which relies only on pooling with fixed windows, ASPP is better suited to handle challenges such as blurred behavior features, large scale variations, and confusion between targets and background in construction scenarios. As a result, it significantly improves the accuracy and robustness of recognizing complex prolonged violation behaviors.
2.4. Target Tracking Module
In real construction operations, significant challenges arise in object detection due to diverse worker identities and the difficulty of matching prolonged behaviors. These challenges are often compounded by occluded key information, interfering objects, and blurred action details, making it difficult to accurately identify critical prolonged violation behaviors based solely on single-frame images. To address this, this study introduces Multiple Object Tracking (MOT) technology [
33], which enables continuous tracking of multiple workers’ behavioral trajectories on site. By modeling the temporal evolution of target states, MOT provides valuable support for more accurate judgment of violation behaviors.
The MOT task generally involves two core steps. First is object detection, where the proposed DGEA-YOLO detector identifies all targets and key behaviors present in the current frame. Second is object association, which matches detected targets with historical trajectories to establish stable and continuous temporal tracking paths. The integration of MOT not only enhances the system’s coherence in dynamic scenes but also significantly improves robustness in detecting prolonged and low-saliency behaviors such as “playing with mobile phones.”
The lightweight ByteTrack framework is one of the most widely adopted efficient algorithms for Multiple Object Tracking (MOT), developed primarily in conjunction with YOLO series detection models [
34]. Built on the Tracking-by-Detection paradigm, it repeatedly matches and updates detected targets throughout the tracking process, achieving both computational efficiency and robustness to occlusion while ensuring continuity in target trajectories. This makes it well-suited for scenarios with frequent occlusions and overlapping target movements. The core idea of ByteTrack is a simple yet effective byte-level tracking strategy that integrates information across frames, reduces ID switches, and suppresses background noise, thereby enhancing tracking stability and accuracy. For prolonged violation behaviors such as “playing with mobile phones,” where visual cues are subtle, ByteTrack shows particular advantages through its low-confidence detection box retention mechanism. By combining both high- and low-confidence detections, it increases coverage, prevents trajectory interruptions caused by temporary occlusion or weak movements, and reduces detection loss.
The overall workflow of object detection and multi-object tracking used in this study is shown in
Figure 10. Firstly, the DEGA-YOLO model serves as the target detector to independently detect the two types of recognition objects, “person” and “phone”, locate the construction workers and identify the key features. Subsequently, using ByteTrack, in the subsequent frames, the Kalman filtering algorithm is employed to predict and correct the spatiotemporal trajectory of the identified individuals. Through the matching strategy based on location and appearance features, the current frame detection results are associated with the previous frame target, achieving continuous tracking of the identity and behavior of the individuals. During this process, in order to achieve a clear association between people and phones, this research adopts the association rule based on spatial neighborhood and bounding box IoU, binding the “phone” detection box with the “person” that is closest in time and has the most stable trajectory at the current moment, ensuring that the phone target always belongs to the same individual trajectory in the long time-series window. This process ensures that even in cases where the monitoring perspective is not ideal or there are brief occlusions, the trajectory information can be updated stably and accurately.
During the tracking process, ByteTrack classifies the matching states into three types: (1) Successful match, where the target is continuously tracked and its trajectory updated; (2) Fail to match the detector frame, where a new tracking trajectory is created for the target; and (3) Match failure, where the existing trajectory is deleted, and tracking is temporarily terminated. For failed matches, the algorithm retains the trajectory for up to 30 frames; if the target is re-identified within this period, tracking resumes, otherwise the trajectory is permanently removed.
This mechanism is particularly well-suited for recognizing prolonged violation behaviors in dynamic and highly occluded construction environments. By strengthening the continuity of temporal modeling, ByteTrack enhances both the stability and accuracy of detecting violation behaviors under dynamic conditions. Integrated with the improved YOLOv8 model, it not only increases detection accuracy but also ensures target continuity in video tracking, providing a practical solution for real-time video data analysis and processing in construction scenarios.
2.5. Temporal Window-Based Behavior Decision Strategy
In this study, a “Prolonged violation behavior” is defined as an action with a clear semantic pattern and a duration exceeding the minimal unit of human motion. Its temporal window is inherently driven by the semantics of the behavior itself and should therefore be determined in accordance with the operational environment and the intrinsic persistence of different behavior types. Considering that the target behaviors examined in this research—such as playing with a mobile phone, sleeping on duty, and leaving the workstation—exhibit distinct temporal characteristics in real construction scenarios, and taking into account the need for the detection-tracking system to absorb short-term fluctuations, this study defines a sustained phone-use behavior lasting no less than 5 s as a Prolonged violation requiring alert. Such a duration is sufficient to cover a complete sequence of phone-operation actions while effectively distinguishing it from incidental hand movements or object shifts, thereby providing an objective and reproducible temporal basis for trajectory-level behavior recognition.
In practice, based on the stable phone detection results from the target detector and the cross-frame identity information produced by ByteTrack, this study develops a behavior decision mechanism using a sliding temporal window. According to the above temporal definition, the window length is set to 30 frames with a stride of 10 frames, enabling temporal aggregation of detection outputs along each individual trajectory. To reduce the influence of short-term occlusion, transient action changes, or sporadic false detections, five temporal windows are used as the decision unit. For each window, the detection frequency, average confidence, and longest continuous detection subsequence are calculated. Statistical analysis of real construction-site video segments shows that in typical phone-use scenarios—affected by occlusion, viewpoint shifts, and lighting variations—the detection frequency typically falls within 63–78%, the mean confidence ranges from 0.40 to 0.65, and the continuous visible duration is usually between 8 and 22 frames. Therefore, the decision thresholds are set as follows: detection frequency ≥ 60%, mean confidence ≥ 0.45, and continuous detection length ≥ 10 frames, ensuring coverage of the stable visibility patterns observed in most real-world cases. When five consecutive windows along the same trajectory satisfy these criteria, the system determines that the worker exhibits a sustained phone-use behavior. This strategy enables reliable Prolonged behavior identification at low computational cost and meets the real-time requirements of on-site monitoring.
In complex construction environments, relying solely on the presence of a detected phone in a single frame would lead to frequent false alarm such as when a phone briefly appears at the frame boundary, reflective objects resemble phone contours, or a worker temporarily holds a phone without actual operation. To prevent these cases, this study integrates the cross-frame association mechanism between person and phone (
Section 2.4) with the multi-window temporal statistics described above, thereby filtering out pseudo-behaviors in a progressive manner. With this multi-stage decision framework, the system can effectively distinguish “actual phone-use violations” from “visible but unused phones” or other non-violation conditions. When the phone appears only sporadically or fails to maintain a stable association with the same trajectory, the behavior recognition module will not trigger a violation alert, substantially reducing false-positive risks in real construction scenarios.
Compared with conventional video-based temporal modeling approaches such as ST-GCN, the proposed technical framework is more suitable for practical deployment in construction sites. On the one hand, construction-site cameras often exhibit variable viewpoints and complex backgrounds, while behaviors such as phone use evolve slowly and show stable cross-frame patterns; thus, high-dimensional temporal modeling is unnecessary for achieving accurate and efficient risk alerts. On the other hand, the proposed method provides superior real-time performance and feasibility on edge devices compared with high-cost temporal networks. Built upon the DGEA-YOLOv8 detector and ByteTrack cross-frame tracking, and supported by the sliding temporal window strategy, the framework provides a lightweight, stable, and engineering-oriented decision mechanism that effectively satisfies the practical requirements of Prolonged violation behavior recognition in construction environments.
2.6. Experimental Environment Construction and Evaluation Indicators
The experiments in this study were conducted on an Ubuntu 18.04 operating system with a GeForce RTX 2080Ti GPU and 16 GB of memory, accelerated by CUDA 10.2 and CUDNN 7.6 on a dedicated workstation. The deep learning framework used was PyTorch 1.8.0, with the corresponding Torchvison version 1.9.0, and Python 3.7 as the programming language. The detailed training parameters are shown in
Table 2.
The performance of object detection models is mainly evaluated in terms of accuracy and speed. Accuracy is typically measured using metrics such as precision, recall, average precision (AP), and mean average precision (mAP), while speed is assessed through frames per second (FPS) and floating-point operations (FLOPs). In practice, mAP and FPS are often used as the key indicators of model performance. After training, the model is evaluated using three primary metrics: mean average precision (mAP), precision, and recall.
Precision (P) refers to the proportion of the sample size with positive prediction results in all positive examples. It can be used to measure the accuracy of the model and the accuracy of the detected samples. The expression of accuracy is shown in Equation (4).
Recall (R) refers to the proportion of predicted samples in all predicted samples, which is a measure of whether a model is complete and whether it can accurately find real samples. The expression of recall rate is shown in Equation (5).
TP represents the number of samples correctly predicted as positive, FN denotes positive samples incorrectly predicted as negative, and FP refers to negative samples misclassified as positive. Recall ranges from 0 to 1, with higher values indicating stronger recognition ability for positive samples. A recall value of 1 means the model successfully identifies all positive samples, whereas a value of 0 indicates it fails to recognize any of them.
The Precision-Recall (P-R) curve plots precision on the vertical axis against recall on the horizontal axis. A model with higher precision and recall demonstrates stronger accuracy and completeness, and thus better overall performance. The larger the area under the P-R curve, the better the model’s effectiveness. Average Precision (AP) is defined as the area under this curve, representing detection performance for each class. Its expression, shown in Equation (6), corresponds to the integral of the P-R curve.
Mean Average Precision (mAP) refers to the average accuracy of each category, which is used to measure the overall detection effect of the model. The expression of the average mean accuracy is shown in Equation (7).
Frames Per Second (FPS) is an important index to evaluate the real-time performance of the algorithm. FPS represents the number of image frames that the model can process in one second, which reflects the speed and efficiency of the algorithm. The calculation method of FPS is shown in Equation (8).
Here, t denotes the single-frame processing time, referring to the duration required for the model to complete forward inference on one input image. Factors such as network architecture, parameter size, and computational complexity significantly affect FPS. In general, as the model becomes more complex, with larger parameter counts and higher computational demands, the FPS decreases accordingly.
4. Discussion
To further interpret the experimental results presented in
Section 3, this section discusses how the proposed architectural components jointly contribute to the overall performance and suitability of the DGEA-YOLOv8 framework for prolonged violation behavior recognition. The ablation experiments indicate that each integrated module brings measurable improvements in detection accuracy, recall, and robustness to small and occluded targets. Specifically, DCNv3 enhances geometric adaptability under pose variation and partial occlusion, GELAN provides efficient multi-scale feature aggregation with lightweight complexity, and the ECA and ASPP modules reinforce fine-grained attention and contextual perception. The combined synergy of these components confirms that the proposed architecture is structurally coherent and effectively aligned with the challenges of recognizing prolonged violation behaviors in complex construction environments. These findings form the foundation for the following discussions on the system’s practical applicability and its ability to maintain stable performance under real-world conditions.
In order to further intuitively analyze the performance and detection effect of DGEA-YOLOv8 algorithm, this paper takes SSD, YOLOv5 and YOLOv8 as comparison items, and judges them from four indicators: input size, accuracy, recall rate, mAP-50 and FPS. The performance comparison table is shown in
Table 5.
From the comparative experimental results, DGEA-YOLOv8 achieved the best performance in Precision, Recall, and mAP-50, reaching 94.3%, 93.7%, and 94.50%, respectively. Compared with YOLOv8, these values improved by 2.2%, 1.9%, and 2.95%. However, FPS decreased by 7.2 f/s. Both YOLOv8 and YOLOv5 performed well in Precision and Recall, with YOLOv8 achieving 92.1% Precision, 91.55% mAP-50, and 109.5 f/s. In contrast, SSD showed lower Precision, Recall, and mAP-50 than YOLOv5, YOLOv8, and DGEA-YOLOv8. Overall, DGEA-YOLOv8 demonstrated superior accuracy and computational efficiency, making it more suitable for high-precision requirements in real construction scenarios.
Figure 13 further compares the relationship between training epochs and mAP-50 for YOLOv5, YOLOv8, DGEA-YOLOv8, and SSD. The results show that YOLOv5 improved rapidly at the early stage but plateaued at a lower final accuracy. YOLOv8 performed better initially, with a rapid increase in mAP-50, but its final result was slightly inferior to DGEA-YOLOv8. By contrast, DGEA-YOLOv8 exhibited steady and consistent improvement throughout training, achieving a final mAP-50 significantly higher than YOLOv5 and YOLOv8, demonstrating the best training effect. SSD, however, lagged behind with a slower curve and lower final accuracy. The subfigures further illustrate the mAP-50 trends at different stages, providing an intuitive comparison of learning curves and convergence rates. Overall, DGEA-YOLOv8 showed the most robust performance during training.
Figure 14 presents a comparison of PR curves between the proposed improved algorithm and other related models, providing an intuitive analysis of the relationship between Precision and Recall. The DGEA-YOLOv8 algorithm (blue line) consistently maintains high values for both precision and recall, with its curve positioned above the others, demonstrating a clear advantage in balancing accuracy and completeness. YOLOv8 (red line) and YOLOv5 (green line) intersect, but the larger area under YOLOv8’s PR curve indicates its superior performance over YOLOv5. In contrast, the SSD model shows the lowest curve, with precision and recall values significantly below the other three models. Overall, DGEA-YOLOv8 achieves the best balance between precision and recall, confirming its strong performance.
Figure 15 compares the performance of SSD, YOLOv5, YOLOv8, and DGEA-YOLOv8 across different evaluation metrics.
Figure 15a shows the relationship between Recall (
x-axis) and mAP-50 (
y-axis). The distribution reveals that YOLOv8 and DGEA-YOLOv8 are located in the upper-right corner, indicating a better balance between mAP-50 and Recall and thus superior performance. In contrast, SSD lies in the lower-left region, reflecting its relatively poor performance.
Figure 15b illustrates the relationship between FPS (
x-axis) and mAP-50 (
y-axis). The results show that DGEA-YOLOv8 achieves relatively high FPS while maintaining near-maximum mAP-50, demonstrating an effective trade-off between speed and accuracy. Overall, DGEA-YOLOv8 outperforms the other models in terms of both precision and efficiency, offering the best comprehensive performance.
5. Conclusions
This study explores the recognition of prolonged violation behaviors, such as “playing with mobile phones,” on construction sites by proposing the improved DGEA-YOLOv8 detection algorithm combined with the ByteTrack tracking framework, which demonstrates strong applicability in this domain. In model design, considering that workers’ behaviors often involve diverse postures, slow variations, and confusion with non-violation states, several enhancements were introduced: DCNv3 to improve recognition of geometric deformations; GELAN to balance feature representation and lightweight performance, ensuring feasibility on resource-constrained devices; ECA attention to strengthen responsiveness to small local targets; and ASPP to expand the receptive field, improving adaptability to posture, scale, and occlusion variations.
Furthermore, given the strong reliance of prolonged behavior analysis on trajectory continuity, the ByteTrack multi-object tracking algorithm was incorporated for behavior process modeling. This integration maintains tracking accuracy while effectively suppressing frequent ID switches and trajectory losses, thereby improving the stability and reliability of prolonged behavior analysis. Comparative experiments demonstrate that the proposed method can achieve both precise recognition and continuous tracking in real construction scenarios.
Overall, the experimental results—including ablation studies, robustness tests, visual tracking analysis, and temporal-window evaluations—jointly support the conclusion that the proposed DGEA-YOLOv8 framework effectively addresses the core challenges of prolonged violation behavior recognition in construction sites. These results substantiate the feasibility and practicality of integrating enhanced spatial detection with lightweight temporal reasoning for real-world safety monitoring. Future work can be further extended in several directions: expanding the range of behaviors to include phone calls, smoking, napping, or leaving posts, thereby achieving more comprehensive monitoring of prolonged violations; developing multimodal recognition systems that integrate video, audio, and location data to enhance accuracy and robustness under complex conditions; and exploring more efficient lightweight deployment strategies to improve real-time performance on edge devices, meeting the practical needs for low-power, high-accuracy behavior recognition in construction environments.