Construction Worker Activity Recognition Using Deep Residual Convolutional Network Based on Fused IMU Sensor Data in Internet-of-Things Environment

Mekruksavanich, Sakorn; Jitpattanakul, Anuchit

doi:10.3390/iot6030036

Open AccessArticle

Construction Worker Activity Recognition Using Deep Residual Convolutional Network Based on Fused IMU Sensor Data in Internet-of-Things Environment

by

Sakorn Mekruksavanich

¹

and

Anuchit Jitpattanakul

^2,3,*

¹

Department of Computer Engineering, School of Information and Communication Technology, University of Phayao, Phayao 56000, Thailand

²

Department of Mathematics, Faculty of Applied Science, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand

³

Intelligent and Nonlinear Dynamic Innovations Research Center, Science and Technology Research Institute, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand

^*

Author to whom correspondence should be addressed.

IoT 2025, 6(3), 36; https://doi.org/10.3390/iot6030036

Submission received: 28 March 2025 / Revised: 5 June 2025 / Accepted: 26 June 2025 / Published: 28 June 2025

Download

Browse Figures

Versions Notes

Abstract

With the advent of Industry 4.0, sensor-based human activity recognition has become increasingly vital for improving worker safety, enhancing operational efficiency, and optimizing workflows in Internet-of-Things (IoT) environments. This study introduces a novel deep learning-based framework for construction worker activity recognition, employing a deep residual convolutional neural network (ResNet) architecture integrated with multi-sensor fusion techniques. The proposed system processes data from multiple inertial measurement unit sensors strategically positioned on workers’ bodies to identify and classify construction-related activities accurately. A comprehensive pre-processing pipeline is implemented, incorporating Butterworth filtering for noise suppression, data normalization, and an adaptive sliding window mechanism for temporal segmentation. Experimental validation is conducted using the publicly available VTT-ConIoT dataset, which includes recordings of 16 construction activities performed by 13 participants in a controlled laboratory setting. The results demonstrate that the ResNet-based sensor fusion approach outperforms traditional single-sensor models and other deep learning methods. The system achieves classification accuracies of 97.32% for binary discrimination between recommended and non-recommended activities, 97.14% for categorizing six core task types, and 98.68% for detailed classification across sixteen individual activities. Optimal performance is consistently obtained with a 4-second window size, balancing recognition accuracy with computational efficiency. Although the hand-mounted sensor proved to be the most effective as a standalone unit, multi-sensor configurations delivered significantly higher accuracy, particularly in complex classification tasks. The proposed approach demonstrates strong potential for real-world applications, offering robust performance across diverse working conditions while maintaining computational feasibility for IoT deployment. This work advances the field of innovative construction by presenting a practical solution for real-time worker activity monitoring, which can be seamlessly integrated into existing IoT infrastructures to promote workplace safety, streamline construction processes, and support data-driven management decisions.

Keywords:

human activity recognition (HAR); Internet-of-Things (IoT); sensor data fusion; construction worker activity; Industry 4.0

1. Introduction

In the last few years, the explosive growth of Industry 4.0 has converted conventional industrial settings into intelligent manufacturing facilities defined by networked technologies, real-time data analysis, and autonomous decision-making skills [1,2]. In this progression, individual workers are essential to manufacturing and construction, necessitating precise identification and tracking of their actions to maximize efficiency, safety improvement, and quality assurance at the worksite. Human activity recognition (HAR) has become a pivotal aspect of intelligent manufacturing and construction, facilitating the integration of individuals with computerized machinery [3,4].

Construction worker activity recognition (CWAR) is an individualized kind of HAR that tackles the distinct concerns and demands of the building and construction sector [5,6]. In construction settings, comprehensive identification of activities is essential, considering the potentially dangerous nature of functioning, intricate workflow dependencies, and the necessity for exact productivity assessments [7]. Conventional techniques for overseeing construction operations, including personal monitoring and video surveillance, are frequently labor-intensive, prone to individual mistakes, and constrained by environmental variables such as lighting circumstances and obstructions [8]. The advent of sensor-based CWAR systems has created novel prospects for controlled, automatically constant monitoring of employee activities. These technologies generally employ wearable sensors, specifically inertial measurement units (IMUs), that can acquire comprehensive information about movement while remaining reasonably unobtrusive to employees [9]. Nonetheless, the execution of efficient CWAR systems encounters distinct issues, such as the varied nature of building-related duties, the dynamic characteristics of construction locations, and the necessity to differentiate between analogous physical motions that signify disparate operations [10].

Incorporating Internet-of-Things (IoT) technologies in construction settings has enabled the installation of diverse sensors that can record workers’ motions, reactions, and operations with remarkable precision [11]. These detectors, which encompass portable gadgets and ambient surroundings tracking, produce continuous information streams that can be utilized for movement identification [12,13]. Nonetheless, the diverse characteristics of sensor data, fluctuating environmental circumstances, and the intricacies of building operations present considerable obstacles to attaining dependable and precise action identification [14]. A significant challenge is the efficient processing and analysis of enormous amounts of real-time sensor data while ensuring that computational performance is appropriate for IoT implementations.

Prior investigations in HAR have predominantly concentrated on single-sensor methodologies or restricted sensor amalgamations, which frequently do not encapsulate the intricacy of worker actions within construction environments [15]. Although specific investigations have investigated multi-sensor methodologies, they generally focus on controlled settings or simplified action scenarios that fail to capture the intricacies of actual construction processes [16,17]. Moreover, the difficulty of efficiently integrating data from various sensors while preserving computational performance and immediate data processing capabilities is still insufficiently handled within intelligent construction [18].

This study presents a practical methodology for recognizing worker activities by integrating sensor data streams across an IoT framework. Our study mainly tackles the issues of functional construction situations by formulating an efficient approach capable of managing worker actions’ variety and intricate nature. We substantiate our methodology utilizing a reputable public dataset: VTT-ConIoT, which comprehensively encapsulates various operational circumstances and activity classifications [19].

The present research makes three primary contributions:

Creation of an innovative sensor combination structure tailored for construction surroundings that efficiently integrates data from various IoT sensors.
Deployment of a sophisticated CWAR algorithm utilizing deep residual convolutional networks. This algorithm capitalizes on the synergistic properties of various sensors to enhance identification accuracy and resilience.
Thorough assessment and verification of the proposed methodology utilizing actual-life situations, illustrating its efficacy across various operational contexts and activity categories.

The study aspires to enhance the domain of CWAR in intelligent construction by offering a pragmatic and efficient alternative to tracking worker activities. The proposed system enhances recognition accurateness and provides insights into effective sensor placement and data combination strategies for industrial IoT conditions.

The subsequent sections of this paper are structured as follows: Section 2 examines the pertinent literature on CWAR and deep learning techniques. Section 3 delineates the proposed deep learning model and its constituent elements. Section 4 delineates the experimental configuration and articulates the findings. Section 5 addresses the findings. Finally, Section 6 concludes the manuscript and delineates prospective research directions.

2. Related Work

HAR has been effectively applied across various domains, especially construction studies [8]. Given the intricacy of construction operations, prior research on CWAR has predominantly relied on image-acquisition technological advances. Classification and recognition primarily utilize vision-based methods, supplemented by sensor-based techniques. HAR is derived from the amalgamation of vision-based technological support and sensor-based innovations.

2.1. Vision-Based CWAR

There has been an increasing fascination with machine vision within construction studies [20,21,22]. Image- and video-based approaches have also been employed for action identification. Luo et al. [23] devised a convolutional neural network (CNN) model utilizing optical flow to identify the following behaviors: strolling, carrying, and steel bending. Support vector machine (SVM) is a widely used approach to identifying activities in laboratory environments [24] and construction locations [25]. Methods for computer vision have, in many instances, yielded highly effective models owing to the substantial advancements made in the development of solid recognizing systems. Nonetheless, computer vision techniques possess constraints. The utilization of cameras necessitates adequate illumination, is a static approach, and is prone to occlusions in nearly all configurations, particularly in active workplaces [20]. Moreover, the data archive required for supervised machine learning algorithms can be expensive, as files from cameras (images or videos) necessitate substantial storage capacity [8].

2.2. Sensor-Based CWAR

Sensor signals, including orientation data, electromagnetic fields, accelerated motions, and rotation velocity, have been the foundation for identifying activity procedures [26]. Data streams are gathered from sensors, such as IMUs. This is feasible because actions frequently exhibit distinct kinematic characteristics. Different algorithms based on machine learning have been employed to categorize these patterns.

Joshua and Varghese [9] were among the pioneering researchers investigating the use of accelerometers to recognize construction worker activities. Accelerometers were utilized to create a framework for categorizing three minor tasks related to construction in masonry. The classification strategies were then compared and evaluated. A subsequent investigation [10] defined the work activities of artisans and steel repairers. Nonetheless, the categories of predetermined duties comprised several simple tasks.

Nevertheless, a solitary technology built on sensors cannot fulfill the organizational demands of complicated construction surroundings. Conversely, combined multiple sensor-based techniques typically exhibit a deficiency in systemic adaptability and robustness. Bangaru et al. [16] employed a feedforward neural network to categorize tasks related to scaffolding utilizing data from equipment that gathered both electromyography (EMG) and IMU information. Utilizing EMG with IMU, actions exhibiting analogous movement yet distinct muscular responses could be categorized with a weighted class-precision mean of 88%. Comparable studies utilizing IMUs have explored alternative approaches for action classification, including various neural network models, k-nearest neighbors (KNN), and SVM [17].

Some time ago, with the advancement and widespread adoption of smartphones, investigators employed smartphone-based data collection and activity recognition for data-driven modeling of construction activities [27]. A smartphone serves as a singular, combined platform for various sensor-based innovations, fulfilling the dimensional requirements of collecting information and adapting to the complicated construction situation. Akhavian and Behzadan [28] employed wearable inertial measurement units from ubiquitous smartphones to identify bricklaying actions conducted by participants in an experimental setting. The seamless delivery of data was executed even at a high collection frequency. They additionally evaluated the efficacy of different classification techniques. Neural networks, decision trees, KNN, SVM, and logistic regression were utilized independently for activity classification. The findings indicated that the neural network exhibited the most excellent accuracy in recognizing the activities of masonry employees.

3. Methodology

3.1. Data Acquisition

This research employed data from the VTT-ConIoT dataset, a widely recognized standard for identifying construction worker activities. The VTT-ConIoT dataset is a publicly available benchmark specifically designed to detect human activities in construction environments. It consists of data from 13 individuals performing 16 construction-related tasks, as detailed in Table 1. Each activity was recorded for roughly one minute in a controlled laboratory setting that simulated real-world construction conditions.

Three IMU sensors were strategically placed on the participants’ bodies for data collection. One sensor was positioned on the hip, while the other two were affixed near the shoulder of the non-dominant arm—one on the upper arm and the other on the back of the shoulder. The sensors used in this study were Aistin iProxoxi devices, which feature a 10-degree-of-freedom IMU. This IMU includes a 3-axis accelerometer with a sampling rate of 103 Hz, a 3-axis gyroscope operating at 97 Hz, a 3-axis magnetometer sampling at 97 Hz, and a barometer functioning at 0.033 Hz.

The dataset comprises 16 actions classified into six primary tasks:

Painting (roll painting, spraying and leveling paint);
Cleaning (vacuuming and picking objects);
Climbing (climbing stairs, jumping down, and stairs up and down);
Hands up work (laying back, hands up, high and low);
Floor work (crouch and kneel floor);
Walking displacements (pushing a cart, walking straight, and walking winding).

The dataset comprises recommended and non-recommended operations, with actions such as jumping down, laying back, raising hands high, kneeling on the floor, and walking in a winding manner categorized as non-recommended due to ergonomic or safety issues. Complementary video recordings were obtained employing a conventional full-HD camera (720p resolution, 25 fps) for reference. However, only the extracted human poses and body key points are included in the public dataset to ensure privacy.

The average activity durations recorded in the VTT-ConIoT dataset, as presented in Table 2, offer valuable context for evaluating the influence of temporal window sizes on recognition accuracy. Nearly all activities span approximately 60 s, indicating a high degree of consistency in task execution time. This uniformity supports the equitable distribution of data across segmented windows, which is critical for model training. Moreover, the alignment between activity duration and chosen window lengths plays a key role in interpreting the observed performance differences resulting from various temporal segmentation strategies employed in our study.

3.2. Data Pre-Processing

Unprocessed sensor data obtained from worn devices frequently includes noise and fluctuations that can impact the efficacy of CWAR models. This research employed several pre-processing procedures that improved input data quality while enhancing the model’s identification performance. The pre-processing pipeline comprises three primary stages: data denoising to eliminate extraneous noise from sensor signals, data normalization to standardize input values, and data segmentation employing a sliding window technique to gather sequential data for the deep learning model. The pre-processing phases are essential for guaranteeing the robustness and reliability of the CWAR system, as they mitigate the effects of sensor noise, device variations, and temporal inconsistencies in the unprocessed information.

3.2.1. Data Denoising

In conditions involving smart wearable gadgets, including domestic settings, outdoor activities, and routine tasks, the smartwatch’s integrated IMU could generate noise caused by environmental magnetic fields, external interference, and hardware inaccuracies, consequently impairing the efficacy of user action comprehension algorithms. In multi-sensor user behavior identification mechanisms, employing data from multiple wearable IMUs can partially alleviate the effects of noise. In everyday situations where wearable devices function as data collection tools and a solitary IMU produces data, noise can considerably affect an individual’s behavior identification algorithm. Consequently, data must be filtered in single-sensor solutions.

Numerous filtering strategies exist, such as median, low-pass, and Kalman. Prior research indicates that the frequency of typical human behavior typically spans from 1 to 15 Hz, categorizing them as low-frequency behaviors that can be attenuated by employing a low-pass filter on IMU signals [29]. In this study, we applied a fourth-order Butterworth low-pass filter with a cutoff frequency of 10 Hz to remove high-frequency noise from the IMU signals. The Butterworth filter was selected due to its maximally flat frequency response within the passband. This filter is characterized by a substantially flat frequency response curve within the passband, while in the stopband, the frequency response curve gradually declines to zero. Furthermore, the amplitude attenuation ratio in the stopband is directly correlated with the filter’s order. The frequency response curve of this filter exhibits exceptional features, leading to its widespread application in the analysis of signals.

3.2.2. Data Normalization

Subsequently, the unprocessed sensor data are standardized to 0 to 1. This procedure mitigates the model learning issue by normalizing all data values to a comparable range. Accordingly, gradient descents may converge more rapidly [30].

x_{i}^{n o r m} = \frac{x_{i} - x_{i}^{m i n}}{x_{i}^{m a x} - x_{i}^{m i n}}, i = 1, 2, 3, . . ., n

(1)

where

x_{i}^{n o r m}

represents the normalized data, n denotes the total number of channels, and

x_{i}^{m a x}

and

x_{i}^{m i n}

indicate the maximum and minimum values of the i-th channel, respectively.

3.2.3. Data Segmentation

Owing to the substantial volume of signal data amassed by wearable sensors, it is unfeasible to input all the data into the HAR model simultaneously. Consequently, sliding window segmentation must be performed before inputting data into the model. The sliding window approach is regarded as one of the most prevalent data segmentation methods employed in HAR for identifying periodic actions (e.g., jogging, strolling) and static actions (e.g., seated, standing, and lying). The unprocessed sensor signals are divided into fixed-length segments with an overlapping ratio between consecutive segments to augment the training data samples and prevent the omission of movement transitions.

In our approach, each data instance is partitioned using a sliding window technique, resulting in a matrix of dimensions

K \times N

. Here, K signifies the total number of sensors, while N represents the sliding window length. More formally, the sample

W_{t}

at time t is defined as

W_{t} = [a_{t 1} a_{t 2} \dots a_{t K}] \in R^{K \times N}

. In this notation, the column vector

a_{t k} = {(a_{t k 1}, a_{t k 2}, \dots, a_{t k N})}^{T}

encapsulates the signal data recorded by sensor k during the window period t. The transpose operation T ensures the correct matrix dimensionality, and the subscripts indicate the chronological order of data points within each window.

To effectively capture the relationships between consecutive windows and enhance the training process, the windowed data is structured into sequential patterns. These sequences are defined as

S = {(W_{1}, y_{1}), (W_{2}, y_{2}), \dots, (W_{T}, y_{T})},

where T indicates the total number of windows in the sequence, and

y_{t}

represents the corresponding activity label for each window

W_{t}

. In cases where a window encompasses multiple activity categories, which commonly happens during transitional phases, a majority voting strategy is employed. Under this approach, the most frequently occurring activity within the window is chosen as its representative label.

The determination of an optimal window size is essential for identification effectiveness. Expanded windows allow for more comprehensive patterns of action, but they may encompass various operations and elevate computational complexity. Conversely, shorter time windows enhance temporal resolution and computational effectiveness but may overlook significant temporal dependencies. Our experiments assessed window sizes varying from 2 to 10 s to identify the most effective equilibrium between recognition accuracy and computational performance.

The overlap between successive windows was established at 50% of the window size, as this arrangement has demonstrated an effective equilibrium between data augmentation and computational burden. This overlap facilitates seamless transitions between actions and offers supplementary training samples to improve model robustness. Our experimental findings corroborate the efficacy of these segmentation parameters, indicating their appropriateness for building-related recognizing activity functions.

3.3. The Proposed CNN-ResBiGRU Model

The proposed model entails developing a deep learning design that integrates the following key elements: convolutional layer, residual bidirectional gated recurrent unit neural network (BiGRU) block, and classification block. Figure 1 illustrates the comprehensive structure of the suggested model.

The initial component, the convolution block, extracts spatial features from the pre-processed data. By regulating the step size of the convolution kernel, it significantly diminishes the duration of the time series. This allows the model to decrease identification duration. The BiGRU is subsequently employed to acquire time-series features from the data processed by the convolutional block. This component improves the model’s capacity to identify long-term dependencies in time-series data by integrating the advantages of a BiGRU. This integration enhances the model’s capacity to comprehend intricate temporal patterns and increases recognition accuracy. The fully connected layer and SoftMax function are utilized to organize behavioral data. The outcomes of this classification step function as the recognition outcome, offering an estimation of the specific action being executed. In the following sections, we will comprehensively explain each component, detailing their functions and contributions to our proposed model.

3.3.1. Convolution Block

A predetermined collection of elements is generally utilized when employing a CNN. CNNs are frequently employed in supervised learning. Typically, these neural networks connect each neuron to all other neurons in the subsequent network layers. The neural network’s activation function transforms the neurons’ input value into the output value. Two critical factors determine the efficacy of the activation function. This encompasses sparsity and the ability of the neural network’s lower layers to withstand diminished gradient flow. CNNs commonly utilize pooling as a method for dimensionality reduction. Max-pooling and average-pooling are both frequently employed techniques.

Convolutional blocks (ConvB) are employed in this research to extract low-level features from raw sensor data. As illustrated in Figure 1, a ConvB consists of four layers of 1D convolutional (Conv1D) and batch normalization (BN), a max-pooling layer (MP), and a dropout layer. Numerous learnable convolutional kernels develop unique attributes in Conv1D, with each kernel generating a feature map. The batch normalization layer was selected to stabilize and accelerate the training step.

3.3.2. Residual BiGRU Block

Human actions are fundamentally temporal, and depending exclusively on the convolution block for spatial feature extraction is inadequate for movement identification. The chronological order of the total movement must also be considered. Recurrent neural networks (RNNs) are proficient in handling time-series data. Nevertheless, as the time series expands, RNN models may experience gradient vanishing and information degradation.

Hochreiter et al. [31] introduced a long short-term memory network (LSTM). In contrast to simple RNNs, LSTM is a gated recurrent neural network capable of efficiently conserving long-term temporal data. Furthermore, it surpasses basic RNNs in managing extended time series. Behavioral data is affected by both prior and subsequent occurrences.

While LSTM effectively mitigates the vanishing gradient issue inherent in RNNs, its memory cells result in heightened memory usage. In 2014, Cho et al. [32] proposed the gate recurrent unit (GRU) network, an innovative model based on RNNs. The GRU is a simplified variant of the LSTM that lacks a distinct memory cell in its architecture. Within a GRU network, an update and reset gate regulates the extent of modification for each hidden state. It ascertains which knowledge should be conveyed to the subsequent state and which should not. The BiGRU is a dual-directional GRU network incorporating forward and backward data. In contrast to the GRU network, BiGRU improves time-series feature extraction by preserving bidirectional dependencies. Consequently, employing a BiGRU network to extract temporal features from behavioral data is a practical strategy.

While the bidirectional LSTM (BiLSTM) network effectively extracts temporal features, it is deficient in obtaining spatial features, and the addition of more stacking layers exacerbates the issue of gradient vanishing throughout training. In 2015, the Microsoft Research team developed the residual network ResNet to address the issue of gradient disappearance. The network attained 152 layers, and the precise residual structure is illustrated in Figure 2. Every residual block can be articulated as:

x^{i + 1} = x^{i} + F (x^{i}, W_{i})

(2)

A residual block comprises two separate elements. The term

x_{i}

signifies the straightforward or identity connection, while

F (x^{i}, W_{i})

refers to the residual component that represents the learned deviation.

Likewise, the structure mentioned above is utilized to develop the encoder component within the Transformer model. Our research presents a residual architecture founded on the BiGRU network, leveraging the advantages of this framework. Normalization methods may also be applied within the BiGRU network. Layer normalization (LN) is especially beneficial for recurrent neural networks compared to BN. LN is calculated in a manner akin to BN and can be articulated as follows:

\hat{x^{i}} = \frac{x^{i} - E (x^{i})}{\sqrt{var (x^{i})}}

(3)

Here,

x^{i}

denotes the input vector corresponding to the i-th dimension, while the resulting output

\hat{x^{i}}

is obtained after applying layer normalization.

In this study, a new integration of the residual mechanism with layer normalization within a BiGRU framework is proposed. This modified architecture is referred to as ResBiGRU, as depicted in Figure 2. The recursive feature representation, denoted by y, can be formulated as follows:

x_{t}^{f (i + 1)} = L N (x_{t}^{f (i)} + G R U (x_{t}^{f (i)}, W_{i}))

(4)

x_{t}^{b (i + 1)} = L N (x_{t}^{b (i)} + G R U (x_{t}^{b (i)}, W_{i}))

(5)

y^{t} = concat (x_{t}^{f}, x_{t}^{b})

(6)

In this context, LN refers to layer normalization. The symbol G denotes the transformation applied to the input states within the GRU network. The subscript t in

x_{t}^{f (i + 1)}

indicates the specific t-th time step in the temporal sequence. Superscripts are used to distinguish between directions of information flow: f represents the forward state, while b indicates the backward state. The term

(i + 1)

corresponds to the depth of the stacked layers. At each time step t, the encoded output

y_{t}

is constructed by concatenating the representations from both the forward and backward passes.

Although numerous well-established deep learning frameworks are available for addressing HAR tasks, our proposed CNN-ResBiGRU model was specifically designed to address the complexities encountered in recognizing activities performed by construction workers. Conventional HAR architectures often struggle to accurately interpret the highly dynamic and multifaceted motions characteristic of construction environments, which involve a combination of large-scale bodily movements and intricate manual actions. Through comprehensive experimental evaluation, we demonstrated that integrating convolutional layers—responsible for capturing spatial information—with a residual BiGRU module—designed for modeling temporal dependencies—significantly outperforms generic HAR models. This superiority was particularly evident when tested on the VTT-ConIoT dataset, where our model consistently outperformed the other models in terms of recognition performance.

3.4. Training and Hyperparameters

The effectiveness of the CNN-ResBiGRU model is largely influenced by both the quantity and variability of the training dataset, as well as the careful calibration of its design-specific control variables, commonly referred to as hyperparameters. These hyperparameters include elements such as the number of training iterations, the learning rate, batch size, and the selection of activation functions.

To achieve stable and reliable performance, we followed a conventional methodology that splits the dataset into two subsets: one for training and one for independent validation. The training subset was dedicated to fine-tuning hyperparameters, whereas the holdout validation subset served as an unbiased benchmark for evaluating the model’s predictive capabilities. Optimal hyperparameter values were identified through an iterative trial-and-error process designed to maximize classification accuracy. The final configuration included a batch size of 256, an initial learning rate of 1 ×

10^{- 3}

, and a total of 200 training epochs. Furthermore, an adaptive learning rate mechanism was incorporated, which automatically reduced the learning rate by 25% if the validation performance plateaued for 10 consecutive epochs. This adaptive strategy facilitated faster convergence and helped prevent the model from being trapped in local minima.

To promote generalization and reduce overfitting, data samples were randomly shuffled before each training epoch. This ensured exposure to a diverse sequence of training instances across epochs. For the model optimization process, we employed the Adam algorithm, which dynamically adjusts learning rates for individual parameters based on their accumulated gradient information. The model’s predictive performance was assessed using the cross-entropy loss function, which measures the divergence between the predicted probability distribution and the true class labels.

As can be seen from Table 3, the CNN-ResBiGRU model incorporates a well-structured set of hyperparameters organized into three primary processing stages. The convolutional module consists of four consecutive one-dimensional convolutional layers, each configured to reduce the number of filters gradually. The initial layer employs a kernel of size 7 with 256 filters, followed by layers with kernel sizes of 5, 3, and 3, paired with 128, 64, and 32 filters, respectively. Across all convolutional layers, the stride is fixed at 1, and the Swish activation function is consistently applied. Additional configurations include batch normalization, max pooling with a factor of 2, and dropout with a rate of 0.25, which helps mitigate overfitting.

In the residual BiGRU module, two sequential ResBiGRU layers are implemented with 128 and 64 hidden units. These layers are designed to model temporal dependencies within time-series data, leveraging bidirectional recurrent structures enhanced by residual connections. This design not only facilitates the learning of both short- and long-range dependencies but also helps maintain stable gradient flow, thereby reducing the risk of vanishing gradients—a common issue in recurrent neural networks.

The classification layer consists of a fully connected dense layer where the number of output neurons matches the total number of activity classes. This layer concludes with a SoftMax activation function, producing a probabilistic output across the class labels. For model training, cross-entropy is used as the loss function to evaluate prediction accuracy, while the Adam optimizer is employed for adjusting the model parameters during backpropagation.

Training is conducted with a batch size of 128 over 200 epochs, offering a balance between computational resource usage and learning convergence. This layered hyperparameter configuration contributes to efficient feature extraction in the convolutional segment, accurate temporal representation in the BiGRU component, and robust classification outcomes in the final output layer.

3.5. Evaluation Metrics

This study conducted an evaluation of the proposed CNN-ResBiGRU model by comparing its performance with standard baseline approaches. The assessment utilized essential metrics commonly applied in multi-class classification tasks: overall accuracy, precision, recall, and the F1-score. These evaluation measures offer a detailed understanding of the model’s effectiveness in classifying different activities, both across individual classes and entire datasets.

Accuracy represents the proportion of correctly identified activity instances among all predictions made by the model:

$Accuracy = \frac{T P + T N}{T P + T N + F P + F N},$

where $T P$ denotes true positives, $T N$ represents true negatives, $F P$ refers to false positives, and $F N$ stands for false negatives.
Precision, also known as the positive predictive value, calculates the fraction of relevant instances among all instances predicted as positive:

$Precision = \frac{T P}{T P + F P} .$
Recall (or sensitivity) determines the model’s ability to correctly identify actual positive cases:

$Recall = \frac{T P}{T P + F N} .$
F1-score combines precision and recall into a unified metric by computing their harmonic mean:

$F1-score = \frac{2 \times (Precision \times Recall)}{Precision + Recall} .$

Evaluating the model using both precision and recall, along with accuracy, allows for a more comprehensive understanding of its capability—not only in terms of correct predictions overall but also in recognizing various activity classes, including those with fewer samples. The F1-score serves as an integrated measure that balances precision and recall, offering a single interpretable value. Comparing different models using this set of metrics yields deeper insights than relying on accuracy alone.

4. Experiments and Results

4.1. Experiment Settings

This study employed the Google Colab Pro+ platform, leveraging a Tesla L4 GPU to expedite the training processes of deep learning models. The architecture of the proposed CNN-ResBiGRU model, along with various baseline deep learning models, was implemented in Python (version 3.11.13) using TensorFlow (version 2.17.1) and CUDA (version 12.4) as computational backends.

Several essential Python libraries supported the development and experimentation process:

NumPy (version 1.26.4) and Pandas (version 2.2.2) were utilized for efficient data manipulation, including data acquisition, processing, and sensor-based analysis.
Matplotlib (version 3.10.0) and Seaborn (version 0.13.2) served as visualization tools to graphically present the results of exploratory data analysis and model performance.
Scikit-learn (Sklearn 1.5.2) was employed for data sampling and pre-processing during various phases of experimentation.
TensorFlow in combination with Keras (version 2.5.0) was used to construct and train the deep learning models.

To assess the effectiveness of the proposed methodology, comprehensive experimental analyses were carried out using the VTT-ConIoT dataset. A 5-fold cross-validation strategy was adopted to ensure the consistency and generalizability of the results.

Three separate experimental setups were designed to evaluate the model’s performance across different activity classification challenges:

Scenario I addressed a binary classification task by differentiating between recommended and non-recommended construction activities. This setup emphasized safety and ergonomic considerations by organizing the 16 activities into two overarching categories.
Scenario II focused on a task-oriented classification framework, wherein the 16 activities were grouped into six general functional clusters: painting, cleaning, climbing, hands-up work, floor-related tasks, and walking movements. This scenario aimed to assess the model’s capability to detect broader construction task types.
Scenario III evaluated the model’s proficiency in identifying all 16 distinct activity classes individually. This configuration represented the most detailed and complex classification challenge, intended to test the comprehensive discriminative ability of the proposed CNN-ResBiGRU approach.

4.1.1. Cross-Validation Strategy

To assess the robustness and adaptability of our proposed method, we employed a participant-independent 5-fold cross-validation technique. The dataset, comprising sensor data from 13 distinct individuals, was partitioned in such a way that no participant’s data appeared simultaneously in both the training and testing subsets within any given fold.

In each fold, approximately 80% of the participants—equivalent to 10 individuals—were randomly selected for the training set, while the remaining 3 participants (around 20%) formed the testing set. This design ensured that the model was evaluated on data from participants it had not encountered during training, thereby offering a more realistic estimation of its generalization capabilities in real-world scenarios.

Moreover, within each fold’s training set, we allocated 20% of the samples, chosen at random, to serve as a validation set. This subset was used for tasks such as tuning model hyperparameters and implementing early stopping criteria. Consequently, each fold utilized roughly 64% of the entire dataset for model training, 16% for validation, and 20% for final testing.

This cross-validation methodology was uniformly adopted across all experimental configurations, including variations in sensor placement and time window size, to ensure a consistent and unbiased comparison of results.

4.1.2. Baseline Performance

To accurately interpret our experimental outcomes, we first established baseline performance thresholds for each classification task. In Scenario I, which involves binary classification, we derived a baseline based on the distribution of activity classes. Specifically, within the VTT-ConIoT dataset, 11 out of 16 activities (equivalent to 62.5%) are categorized as recommended. As a result, a naive classifier that always predicts the majority class would achieve a baseline accuracy of 68.75.

In Scenario II, corresponding to the six-class classification, the task categories are more evenly distributed across the six classes. Therefore, the expected accuracy of a simple model predicting the most frequent class would be around 16.7%.

For Scenario III, which involves classifying 16 distinct activities, assuming a uniform distribution across all classes, a random guess would yield an accuracy of approximately 6.25%.

These baseline values represent the minimal performance thresholds against which the efficacy of our CNN-ResBiGRU model is assessed. They provide essential reference points for evaluating how well the proposed approach performs across varying sensor placements and segmentation window lengths.

4.2. Experimental Results

This section presents the experimental assessments carried out to evaluate the effectiveness of the proposed CNN-ResBiGRU model. The recognition performance was examined across three distinct activity classification tasks.

The experimental findings from Scenario I, as detailed in Table 4, highlight the impact of sensor placement and window length on the binary classification of construction activities into recommended and non-recommended categories. The classification outcomes varied depending on the location of the sensors, with each position presenting unique strengths.

Among the configurations, the sensor positioned on the back displayed consistently strong performance across various window durations. Accuracy values ranged from 92.75% to 93.91%, with the highest accuracy of 93.85% recorded when a 6 s window was used. This configuration also yielded balanced precision and recall scores of 92.16% and 91.53%, respectively. These results indicate that back-mounted sensors can capture key distinguishing features of physical activity in construction environments.

The hand sensor configuration emerged as the most effective single-sensor setup. It reached the highest observed accuracy of 95.03% when a 4 s window size was applied. This superior performance can be attributed to the hand’s sensitivity to motion, which enables it to register nuanced activity patterns typical in construction tasks. In addition, this sensor maintained high levels of both precision and recall across all window lengths, suggesting reliable classification capability.

Although the hip-mounted sensor exhibited slightly reduced performance relative to the back and hand placements, it still delivered satisfactory results. The best accuracy for this position was 93.22% using a 4 s window. The comparatively lower performance is likely due to the hip region displaying less differentiated motion patterns between recommended and non-recommended behaviors.

The most notable outcome was observed in the multi-sensor configuration labeled “All,” which combined data from multiple sensor locations. This fusion-based approach significantly outperformed the individual sensor configurations across all evaluation metrics. It achieved the highest accuracy of 97.32% using a 4 s window, alongside excellent precision (96.45%) and recall (96.43%) values. These results affirm the advantages of integrating multiple sensor inputs to capture more comprehensive information about worker activities.

Regarding the analysis of window durations, the 4 s window generally offered the most favorable balance across different sensor setups. This duration appears well-suited to capture the temporal characteristics of construction activities while maintaining computational efficiency. The reliability of the results is further supported by the low standard deviations observed across evaluation metrics, indicating the consistent and stable performance of the CNN-ResBiGRU model.

The results from Scenario II (Table 5), which focused on classifying construction activities into six functional groups, confirm the effectiveness of the CNN-ResBiGRU model across various sensor placements and window durations. The back sensor showed strong and consistent performance, with accuracy ranging from 91.01% to 92.27%. Its best results were with 8 s windows, achieving 92.27% accuracy, 92.83% precision, and 92.17% recall, indicating its suitability for capturing full-body movements.

The hand sensor, while slightly less accurate than in binary classification (Scenario I), still performed well, particularly with 4 s windows—yielding 91.87% accuracy, 91.90% precision, and 92.01% recall. The minor decline may be due to the increased difficulty of distinguishing six classes rather than two.

The hip sensor showed a distinct pattern. Although generally lower in performance than the back and hand placements, it achieved its highest accuracy (91.90%) with 8 s windows. This suggests it benefits from longer temporal contexts to recognize complex activities.

The multi-sensor setup (“All”) delivered the highest performance overall. With just a 2 s window, it reached 97.14% accuracy, along with precision and recall values of 97.22% and 97.20%, respectively. Notably, its accuracy remained above 95% across all window sizes, demonstrating the robustness of sensor fusion and its potential for fast and reliable classification.

Although standard deviations were slightly higher than in Scenario I, they remained low overall, indicating stable model behavior despite the increased task complexity. These findings highlight the model’s effectiveness in multi-class recognition tasks. They also suggest that while individual sensors are adequate, combining multiple sensors greatly enhances recognition performance—particularly for real-time applications requiring rapid activity classification.

Table 6 reports the findings from Scenario III, which involves the most complex task—classifying all 16 individual construction activities. The results highlight noticeable variations in model performance across different sensor types and window durations, illustrating the challenges of fine-grained activity recognition.

The back sensor exhibited moderate effectiveness, with accuracy ranging from 80.48% to 83.40%. Its highest performance occurred with a 4 s window, achieving 83.40% accuracy, 84.01% precision, and 83.38% recall. Compared to previous scenarios, this decrease in accuracy is expected due to the increased classification complexity. Moreover, performance declined as the window lengthened, indicating that shorter windows more effectively capture distinct movement patterns.

The hand sensor slightly outperformed the back sensor, with accuracy ranging from 78.36% to 83.09%. Its optimal result, also at 4 s, yielded 83.09% accuracy, 83.81% precision, and 83.08% recall. However, performance dropped when using windows longer than 6 s, suggesting that brief time frames are more suitable for recognizing hand-based activities.

The hip sensor showed the lowest accuracy among all single-sensor setups, ranging from 76.96% to 82.61%. Still, its best performance also occurred at the 4 s window. This consistent trend across all placements reinforces the conclusion that a 4 s window offers a balanced duration for capturing relevant features while preserving temporal specificity.

The multi-sensor configuration (“All”) significantly outperformed individual sensors. It reached its highest performance with a 4 s window, recording 98.68% accuracy, 98.73% precision, and 98.69% recall. The pronounced performance gap between this setup and single-sensor configurations was more significant than in earlier scenarios, emphasizing the growing importance of sensor fusion as classification complexity increases. Notably, even with extended windows, the multi-sensor approach maintained high reliability, though peak accuracy was still achieved with shorter durations.

Standard deviations were slightly higher than in Scenarios I and II, which aligns with the increased difficulty of differentiating between 16 distinct classes. The multi-sensor model exhibited greater consistency, showing less variability than single-sensor alternatives.

Therefore, these results confirm that fine-grained recognition tasks challenge single-sensor systems, but the CNN-ResBiGRU model with multi-sensor fusion provides a highly accurate and stable solution. The 4 s window consistently delivers optimal performance, and combining data from multiple sensors proves essential for high-precision, detailed activity classification in construction environments.

4.3. Comparison Results with State-of-the-Art Models

To validate the performance and advantages of the proposed CNN-ResBiGRU architecture in the context of CWAR, we conducted an extensive comparative evaluation against state-of-the-art methods using the publicly available VTT-ConIoT benchmark dataset. By employing identical experimental setups and standardized evaluation protocols, we ensured a fair and consistent comparison across all models.

To maintain methodological integrity and avoid bias, we reproduced the experimental procedures established in the baseline study [19]. The evaluation process involved segmenting each pre-processed one-minute signal using fixed-duration sliding windows of 2, 5, and 10 s, each advanced by a one-second overlap. For model validation, we adopted a strict leave-one-subject-out (LOSO) cross-validation framework. In this scheme, for a dataset comprising N participants, N separate models were independently trained and assessed. Each iteration excluded one subject from the training set, using their data solely for testing, while the remaining participants contributed to the training set.

This validation strategy provides a robust measure of the model’s ability to generalize across unseen individuals, effectively addressing the challenge of inter-subject variability that commonly affects human activity recognition systems. Through the LOSO process, a total of thirteen models per classification algorithm and data modality were generated and evaluated.

Although this approach captures the nuanced behavioral patterns of individual construction workers, the aggregate performance metrics offer a clear representation of the model’s overall generalization strength. To further examine model reliability, we calculated and reported classification accuracy across all test subjects. These results reveal both the variability and consistency of the model when applied to participants whose data were not included in the training set.

Table 7 provides a detailed comparative analysis between the proposed CNN-ResBiGRU framework and the traditional SVM baseline under varying segmentation window lengths and classification complexities. The experimental findings indicate that our deep learning model consistently delivers superior performance across all tested configurations.

In the six-class classification scenario, the CNN-ResBiGRU model achieved recognition accuracies of 90.24%, 93.46%, and 95.79% for window lengths of 2, 5, and 10 s, respectively. In contrast, the corresponding SVM baseline produced accuracies of 86%, 89%, and 94% under the same conditions. This translates to improvements of 4.24%, 4.46%, and 1.79% across the respective window sizes, underscoring the effectiveness of our architecture in moderately complex classification tasks.

More notably, the advantages of our model become even more pronounced in the more complex 16-class classification task. The CNN-ResBiGRU achieved accuracies of 76.10%, 81.60%, and 91.20% for window sizes of 2 s, 5 s, and 10 s, respectively. In comparison, the baseline SVM model reached only 72%, 78%, and 84%. These results reflect performance gains of 4.10%, 3.60%, and 7.20%, thereby demonstrating the model’s robustness and scalability in addressing high-dimensional, multi-class activity recognition challenges.

5. Discussion

This section discusses three central components derived from the experimental results: the influence of sensor positioning on CWAR, the role of window duration in shaping recognition accuracy, and the existing constraints within the proposed framework. This evaluation highlights both the advantages and shortcomings of the presented method. Additionally, it offers meaningful guidance for future applications of sensor-driven activity recognition systems within construction settings.

5.1. Effects of Sensor Placements for CWAR

In addition to the overall performance indicators discussed previously, a more granular examination of activity-level results provides valuable insights into the interplay between sensor location and specific construction tasks. Table 8, Table 9 and Table 10 offer a comprehensive analysis of F1-scores, disaggregated by individual activities, for various sensor configurations and segmentation window lengths across the three classification settings considered in this study.

5.1.1. Safety-Critical Activity Recognition

In the binary classification task distinguishing between recommended and non-recommended activities (Scenario I), sensor location played a significant role, as illustrated in Table 8. The configuration involving multiple sensors consistently yielded the highest F1-scores across both activity categories, achieving scores ranging from 98.3% to 98.5% for recommended tasks and from 94.5% to 95.7% for non-recommended tasks.

Among the single-sensor configurations, the hand-mounted sensor delivered the most balanced performance. It recorded an F1-score of 96.5% for recommended activities and 89.3% for non-recommended ones, based on a 2 s segmentation window.

Notably, the hip sensor, despite producing the lowest overall accuracy for several tasks, performed comparably to other sensors in detecting non-recommended actions, achieving an F1-score of 84.3%. This indicates that hip-mounted sensing devices are capable of capturing key ergonomic risk indicators, especially in movements such as jumping or awkward postures that may pose safety hazards.

Table 8. Activity-specific F1-scores across different sensor placements in Scenario I.

Placement	Activity Types	F1-Score (%)
Placement	Activity Types	2 s	4 s	6 s	8 s	10 s
Back	Recommended	94.6	95.9	96.1	95.8	96.2
	Not recommended	84.6	87.7	87.5	87.4	87.6
Hand	Recommended	96.5	96.7	96.2	96.3	93.9
	Not recommended	89.3	89.6	88.7	88.4	79.6
Hip	Recommended	92.5	95.7	94.9	93.2	94.3
	Not recommended	84.3	86.9	83.7	80.6	81.8
All	Recommended	98.3	98.2	98.5	96.8	97.3
	Not recommended	94.9	94.5	95.7	89.8	91.3

5.1.2. Task-Specific Activity Recognition

The evaluation of the six-class classification problem, shown in Table 9, highlights the varying effectiveness of sensor placements in detecting different types of construction-related activities. Each body-mounted sensor exhibits task-specific advantages. Sensors positioned on the back demonstrate well-rounded accuracy across all activity classes, achieving F1-scores of between 88.0% and 97.9%, with particularly high precision in detecting floor-level tasks (95.4–97.1%) and consistently accurate recognition of walking. In contrast, hand-mounted sensors are particularly adept at identifying tasks involving upper limb movements, such as hand-up activities, where they attain F1-scores ranging from 92.5% to 95.2%. Hip sensors demonstrate superior performance in recognizing locomotion-based actions, particularly walking, with peak accuracy reaching 99.5% at an 8 s segmentation window. They also perform strongly in floor-level tasks, achieving F1-scores of between 97.3% and 97.6%.

Table 9. Activity-specific F1-scores across different sensor placements in Scenario II.

Placement	Activity Types	F1-Score (%)
Placement	Activity Types	2 s	4 s	6 s	8 s	10 s
Back	Painting	90.0	87.2	91.2	90.7	92.7
	Cleaning	88.5	90.3	89.8	91.5	94.1
	Climbing	88.0	94.0	96.8	92.7	89.8
	Hand Up work	94.0	90.9	94.6	93.5	94.4
	Floor work	95.4	95.8	95.2	94.7	97.1
	Walking displacements	92.8	94.2	97.9	93.5	92.7
Hand	Painting	89.8	91.4	89.8	90.1	91.3
	Cleaning	84.0	90.3	87.3	89.8	90.7
	Climbing	89.0	93.8	89.4	92.0	89.5
	Hand Up work	95.2	94.2	92.6	92.5	93.5
	Floor work	89.1	92.6	91.9	94.6	88.7
	Walking displacements	92.1	95.2	89.9	90.3	94.3
Hip	Painting	89.0	91.0	89.4	95.6	88.1
	Cleaning	82.9	84.0	81.2	92.4	79.6
	Climbing	86.0	88.6	93.5	95.0	93.1
	Hand Up work	88.0	86.8	89.6	94.7	90.0
	Floor work	89.8	95.3	97.3	97.6	90.9
	Walking displacements	93.6	96.0	94.8	99.5	98.7
All	Painting	96.9	98.2	96.9	99.0	96.3
	Cleaning	96.1	97.6	96.2	96.2	95.0
	Climbing	95.2	96.5	96.3	98.5	92.3
	Hand Up work	99.0	98.9	98.6	99.5	97.4
	Floor work	99.0	99.3	98.9	99.2	96.1
	Walking displacements	97.6	97.9	96.6	98.1	96.8

The duration of the input window plays a critical role in task-specific recognition performance. For example, painting tasks are most effectively detected using longer time windows (10 s) when utilizing back-mounted sensors, achieving an F1-score of 92.7%. Hand-up activities achieve maximum accuracy with shorter time windows (2 s) when using hand sensors, while the optimal window length for walking detection varies by sensor position: 6 s for back sensors, 4 s for hand sensors, and 8 s for hip sensors.

Across all task types, the multi-sensor setup consistently outperforms individual sensor placements, maintaining high F1-scores of over 95% for the majority of activities, regardless of window size. This reinforces the notion that combining sensor data from multiple body locations offers complementary information, significantly enhancing model performance. Notably, the multi-sensor configuration achieves outstanding recognition for both hand-up (97.4–99.5%) and floor-level tasks (96.1–99.3%).

These results indicate that while employing multiple sensors offers the most reliable and comprehensive solution for activity recognition in construction environments, deploying a single sensor in a strategically chosen location can still deliver competitive performance for specific task types. Such targeted implementations may be more practical and economical in scenarios where hardware or deployment resources are constrained.

5.1.3. Activity-Specific Activity Recognition

The findings presented in Table 10 highlight that various sensor placements offer unique benefits depending on the specific construction task being recognized. Within the sixteen-class classification scenario (Scenario III), sensors positioned on the back showed strong performance in detecting activities such as vacuum cleaning (with an F1-score of 87.3% using a 2 s window) and walking straight (90.1%). However, their effectiveness declined when identifying tasks involving detailed hand movements, such as leveling paint, which yielded a lower F1-score of 75.4%.

In contrast, hand-mounted sensors were particularly well-suited for recognizing actions that involve upper limb coordination. These sensors achieved high recognition accuracy for tasks such as “hands-up low” (82.7% F1-score with a 2 s window) and also performed well in walking detection (92.0% for walking straight). These results support the assumption that sensors located on the hands are well-equipped to capture the nuances of fine motor tasks.

Table 10. Activity-specific F1-scores across different sensor placements in Scenario III.

Placement	Activity Types	F1-Score (%)
Placement	Activity Types	2 s	4 s	6 s	8 s	10 s
Back	Roll painting	86.0	82.9	88.4	76.7	74.6
	Spraying paint	82.2	79.2	78.3	83.9	69.1
	Leveling paint	75.4	75.2	80.8	75.4	68.0
	Vacuum Cleaning	87.3	84.9	81.0	78.9	66.7
	Picking objects	85.6	81.9	80.8	87.0	90.2
	Climbing stairs	82.0	80.3	86.0	86.6	86.8
	Jumping down	82.4	88.6	86.8	100.0	94.5
	Laying back	84.8	75.2	81.3	80.5	88.0
	Hands up high	80.3	68.1	74.4	64.9	68.1
	Hands up low	83.5	75.6	79.6	80.0	73.7
	Crouch floor	82.9	77.6	71.4	72.1	72.7
	Kneel floor	80.4	70.3	69.9	66.7	77.2
	Walk straight	90.1	89.7	93.3	94.4	89.4
	Walk winding	88.0	88.2	92.2	91.4	91.2
	Pushing cart	84.0	83.8	87.4	91.2	89.4
	Stairs up-down	80.6	83.6	83.7	82.4	80.7
Hand	Roll painting	85.7	80.7	83.2	84.1	84.6
	Spraying paint	83.9	77.8	83.1	82.7	76.0
	Leveling paint	76.3	66.7	71.7	72.5	58.8
	Vacuum cleaning	81.7	76.3	82.6	79.4	79.2
	Picking objects	81.2	77.9	74.2	84.1	84.2
	Climbing stairs	81.3	71.3	83.9	87.3	83.3
	Jumping down	80.7	86.4	88.7	93.0	88.9
	Laying back	84.6	82.2	83.9	80.0	85.1
	Hands up high	81.7	79.7	81.4	73.0	65.3
	Hands up low	82.7	78.5	82.6	60.4	70.4
	Crouch floor	79.1	71.2	83.0	85.7	63.4
	Kneel floor	82.0	69.9	79.6	72.1	76.4
	Walk straight	92.0	89.2	92.3	92.8	92.0
	Walk winding	88.4	83.5	87.1	89.9	89.7
	Pushing cart	79.7	83.9	84.2	90.9	94.3
	Stairs up-down	80.7	74.8	78.6	82.0	81.8
Hip	Roll painting	82.2	84.0	80.8	73.6	78.4
	Spraying paint	77.2	76.3	73.3	70.1	66.7
	Leveling paint	79.6	84.2	74.5	63.6	68.1
	Vacuum cleaning	78.0	78.9	76.8	60.3	71.4
	Picking objects	73.3	81.9	77.9	82.0	89.4
	Climbing stairs	78.9	83.5	73.6	81.8	83.6
	Jumping down	83.5	87.0	92.8	98.6	92.6
	Laying back	92.5	88.1	94.5	66.7	83.3
	Hands up high	74.6	67.1	54.8	65.6	59.3
	Hands up low	76.3	72.8	60.4	55.9	34.0
	Crouch floor	92.2	96.1	91.6	87.5	76.6
	Kneel floor	92.1	92.8	84.3	87.9	76.7
	Walk straight	87.9	89.8	93.3	85.3	85.2
	Walk winding	84.0	85.5	84.1	84.1	84.0
	Pushing cart	84.1	84.7	89.7	84.8	94.3
	Stairs up-down	78.2	83.9	85.7	90.1	98.0
All	Roll painting	94.7	99.3	92.3	94.4	89.7
	Spraying paint	94.0	99.6	94.7	97.0	92.3
	Leveling paint	91.4	99.1	88.2	80.0	87.5
	Vacuum cleaning	91.9	99.3	90.0	88.5	88.1
	Picking objects	92.9	99.3	92.5	92.3	93.9
	Climbing stairs	90.7	99.7	88.4	92.8	92.0
	Jumping down	93.6	99.7	92.9	91.9	98.1
	Laying back	96.7	98.9	85.4	90.6	89.4
	Hands up high	92.4	97.6	67.5	87.1	87.7
	Hands up low	93.2	98.1	85.4	82.9	82.6
	Crouch floor	96.0	99.7	81.6	83.3	81.6
	Kneel floor	95.2	98.9	81.8	82.9	86.3
	Walk straight	97.2	99.1	95.7	95.5	100.0
	Walk winding	90.3	98.6	92.2	95.8	96.3
	Pushing cart	88.7	99.0	92.0	89.2	88.5
	Stairs up-down	88.8	99.6	90.9	92.3	94.1

Unexpectedly, the hip-mounted sensor outperformed the other placements in floor-level activities. It recorded F1-scores of 92.2% and 92.1% for crouching and kneeling tasks, respectively, using a 2 s segmentation window. This notable outcome suggests that posture and movement patterns detected in the lower body are particularly informative for distinguishing between similar floor-based actions. This factor may be underutilized in conventional sensor placement strategies.

5.2. Effects of Window Sizes for CWAR

Our investigation into the influence of temporal window length across the three classification settings yields critical insights regarding segmentation strategies for recognizing construction worker activity. As detailed in Table 8, Table 9 and Table 10, window size plays a significant role in shaping recognition outcomes, with its impact varying based on both the sensor’s physical location and the nature of the activity.

5.2.1. Window Size Impact on Safety-Critical Distinction

In the binary classification task differentiating between recommended and non-recommended activities (Scenario I), the effect of window size varied considerably depending on the sensor position. The back-mounted sensor exhibited relatively stable performance across all window lengths for recommended tasks, with a modest peak at the 6 s window (96.1%). In contrast, hand-mounted sensors achieved their best results at a 4 s duration, yielding 96.7% accuracy for recommended activities and 89.6% accuracy for non-recommended activities. However, performance declined sharply for non-recommended activities when using longer windows, dropping to 79.6% at 10 s.

The hip sensor showed greater sensitivity to temporal windowing. It achieved its highest accuracy at the 4 s window for both recommended (95.7%) and non-recommended (86.9%) activity categories. Notably, the multi-sensor arrangement consistently demonstrated strong performance across all durations, with optimal results occurring within the short-range window of 2 to 6 s. These findings indicate that discriminating between safety-critical activities is more effective when capturing brief postural transitions, particularly when combining data from multiple sensor placements.

5.2.2. Task-Level Temporal Dependencies

In the six-class categorization scenario (Scenario II), different activity types displayed distinct optimal window sizes. Painting-related tasks showed enhanced recognition accuracy when using extended observation periods, reaching 92.7% with a 10 s window on the back sensor—suggesting that capturing complete motion sequences is beneficial. Conversely, hand up-related tasks achieved maximum recognition accuracy using shorter time segments, with the hand sensor reaching 95.2% at a 2 s window. This highlights that upper-limb movements can be effectively identified through brief temporal slices.

For walking-related activities, the ideal window length varied by sensor placement. Optimal recognition was observed at 6 s for the back sensor (97.9%), 4 s for the hand (95.2%), and 8 s for the hip (99.5%). These results suggest that locomotion dynamics manifest differently depending on sensor location, with longer windows being more suitable for capturing cyclic patterns in hip movement.

Across all tasks, the multi-sensor setup displayed remarkable resilience to changes in window length. The performance variation remained within a narrow margin (typically under 3%), underscoring the advantage of multi-sensor integration in maintaining recognition robustness regardless of temporal segmentation settings.

5.2.3. Fine-Grained Activity Recognition and Window Duration

The most nuanced insights were uncovered in the sixteen-class classification scenario (Scenario III), where the impact of window length on recognition accuracy became highly activity-specific. Four primary response patterns were identified based on temporal sensitivity:

Consistency-oriented activities: Certain tasks, such as walking straight, showed minimal variation in recognition performance across different window sizes (e.g., 90.1–94.4% for the back sensor), indicating that their motion signatures are stable across multiple time scales.
Short-window-preferred activities: Activities involving brief, clearly defined postural states, such as hand-up tasks, achieved their highest accuracy using windows of 2–4 s. For instance, F1-scores for back-mounted sensors ranged from 80.3% to 83.5% at 2 s but fell sharply at extended durations (64.9–68.1% at 8–10 s). This decline suggests that longer windows may introduce transitional noise that confounds recognition.
Long-window-dependent activities: Some activities benefited from more extended observation periods. For example, picking objects reached 90.2%, and jumping down peaked at 100% when using 10 and 8 s windows, respectively, with the back sensor. These actions require complete motion cycles to be accurately identified.
Intermediate-window-optimal activities: Several tasks performed best at medium-length windows of 4 to 6 s. Notable examples include climbing stairs (86.0–86.8% for the back sensor) and painting with a roller (88.4% at 6 s), suggesting a balance between too-brief and overly extended temporal contexts.

A particularly striking case of temporal sensitivity was observed with the hip sensor for the hands-up low activity. The F1-score dropped dramatically from 76.3% at 2 s to just 34.0% at 10 s, reflecting a 42.3 percentage point reduction. This result indicates that upper-body movements generate complex patterns at the hip that become increasingly ambiguous with longer windows.

Interestingly, the multi-sensor configuration exhibited a clear performance peak at the 4 s window across nearly all activity types, achieving near-perfect recognition rates (97.6–99.7%). This suggests that when multiple sensor inputs are leveraged, a 4 s window offers an ideal balance—sufficient to capture relevant motion dynamics while avoiding the dilution effects of prolonged sequences.

5.3. Limitations

Although the proposed CNN-ResBiGRU model achieved encouraging outcomes, several limitations must be addressed. The experiments were conducted using the VTT-ConIoT dataset, which was gathered under controlled laboratory conditions. As a result, it may not accurately reflect the dynamic and unpredictable nature of real-world construction sites, where factors such as weather, equipment interference, and site-specific obstacles can affect sensor data quality.

Furthermore, the study employed a fixed sensor setup with three predefined placement locations, while this configuration proved effective in the experiments, it may not represent the best possible sensor arrangement for all construction activities. In addition, the practical difficulties of maintaining consistent sensor positioning during actual work shifts were not explored.

The analysis also focused on isolated, well-defined activities. However, in real construction scenarios, workers often perform tasks in continuous sequences involving gradual transitions and overlapping motions, which were not considered in this evaluation.

Another important consideration is the computational load associated with the multi-sensor fusion method. This may limit the model’s applicability in real-time or on low-power IoT devices commonly used in field settings. Moreover, the system’s resilience to sensor malfunctions or intermittent data loss—both frequent issues in construction environments—was not thoroughly examined.

Lastly, the generalizability of the findings remains uncertain. The current dataset may not fully encompass the worker demographics, motion styles, or construction tasks encountered across diverse real-world environments. Further studies are needed to validate the model’s performance across different populations, activity types, and site conditions.

6. Conclusions and Future Works

This research proposed an innovative deep learning framework for recognizing construction worker activities, leveraging a CNN-ResBiGRU model enhanced through multi-sensor data integration. Extensive experimentation using the VTT-ConIoT dataset validated the model’s effectiveness across three levels of classification complexity: binary classification (recommended vs. non-recommended actions), six-category task recognition, and detailed identification of sixteen specific activities.

The results highlighted several key outcomes. Most notably, the multi-sensor fusion strategy consistently outperformed configurations using individual sensors. Peak accuracies reached 97.32% for binary classification, 97.14% for six-class tasks, and 98.68% for the sixteen-class scenario. These results affirm the advantages of combining data from multiple IMU sensors positioned at strategic body locations. In addition, a 4 s window size emerged as the most effective across configurations, striking a balance between temporal coverage and computational efficiency.

Further analysis showed that single-sensor setups can perform adequately in more straightforward classification tasks, but their limitations become evident as the task complexity increases. Hand-mounted sensors delivered the best performance among single placements, followed by back-mounted ones, while hip sensors showed comparatively lower effectiveness. Nevertheless, integrating all three placements led to significantly enhanced performance, especially for more demanding recognition tasks.

The CNN-ResBiGRU architecture, when paired with pre-processing techniques such as Butterworth filtering, data normalization, and adaptive sliding window segmentation, effectively addressed the challenges inherent in construction activity recognition. The model maintained high accuracy while remaining efficient for potential deployment on IoT platforms, marking a meaningful advancement in the field.

Future research can build upon these findings in several directions. First, validating the model under actual construction site conditions—where variables like environmental noise, equipment use, and dynamic workflows exist—will be essential. Second, exploring adaptive sensor fusion methods that remain accurate in sensor failure or data degradation can improve system reliability. Third, developing lightweight model variants for deployment on edge devices can enable real-time recognition in resource-constrained environments. Additionally, the studies should broaden the scope of recognized activities to include more complex, overlapping, or transitional tasks. Integrating other sensing technologies, such as environmental monitors or physiological sensors, could offer more contextual information. Finally, implementing continuous learning mechanisms to adapt to differences in individual workers’ movements and physical characteristics may enhance the model’s generalizability and long-term applicability.

Author Contributions

Conceptualization, S.M. and A.J.; methodology, S.M.; software, A.J.; validation, A.J.; formal analysis and investigation, S.M.; resources and data curation, A.J.; writing—original draft preparation, S.M.; writing—review and editing, S.M. and A.J.; visualization, S.M.; supervision, A.J.; project administration, A.J.; funding acquisition, S.M. and A.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research budget was allocated by the University of Phayao; the Thailand Science Research and Innovation Fund (Fundamental Fund 2025, Grant No. 5014/2567); National Science, Research and Innovation Fund (NSRF); and King Mongkut’s University of Technology North Bangkok with Contract no. KMUTNB-FF-68-B-03.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

To clarify, our research utilizes a pre-existing, publicly available dataset. The dataset has been anonymized and does not contain any personally identifiable information. We have cited the source of the dataset in our manuscript and have complied with the terms of use set forth by the dataset provider.

Data Availability Statement

The original data presented in the study are openly available for the VTT-ConIoT dataset at https://zenodo.org/records/4683703 (accessed on 10 January 2025).

Acknowledgments

This manuscript benefited from language editing support provided by Generative AI (ChatGPT). The AI was used solely for improving grammatical accuracy, sentence structure, and clarity of expression. All research design, data collection, analysis, interpretation, and scientific conclusions are entirely the authors’ original work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Visconti, P.; Rausa, G.; Del-Valle-Soto, C.; Velázquez, R.; Cafagna, D.; De Fazio, R. Machine Learning and IoT-Based Solutions in Industrial Applications for Smart Manufacturing: A Critical Review. Future Internet 2024, 16, 394. [Google Scholar] [CrossRef]
Lei, Z.; Shi, J.; Luo, Z.; Cheng, M.; Wan, J. Intelligent Manufacturing From the Perspective of Industry 5.0: Application Review and Prospects. IEEE Access 2024, 12, 167436–167451. [Google Scholar] [CrossRef]
Farahani, M.A.; McCormick, M.; Gianinny, R.; Hudacheck, F.; Harik, R.; Liu, Z.; Wuest, T. Time-series pattern recognition in Smart Manufacturing Systems: A literature review and ontology. J. Manuf. Syst. 2023, 69, 208–241. [Google Scholar] [CrossRef]
Tan, X.; Zhang, B.; Liu, G.; Zhao, X.; Zhao, Y. Phase Variable Based Recognition of Human Locomotor Activities Across Diverse Gait Patterns. IEEE Trans. Hum.-Mach. Syst. 2021, 51, 684–695. [Google Scholar] [CrossRef]
Li, J.; Zhao, X.; Kong, L.; Zhang, L.; Zou, Z. Construction Activity Recognition Method Based on Object Detection, Attention Orientation Estimation, and Person Re-Identification. Buildings 2024, 14, 1644. [Google Scholar] [CrossRef]
Li, J.; Miao, Q.; Zou, Z.; Gao, H.; Zhang, L.; Li, Z.; Wang, N. A Review of Computer Vision-Based Monitoring Approaches for Construction Workers’ Work-Related Behaviors. IEEE Access 2024, 12, 7134–7155. [Google Scholar] [CrossRef]
Zhang, M.; Chen, S.; Zhao, X.; Yang, Z. Research on Construction Workers’ Activity Recognition Based on Smartphone. Sensors 2018, 18, 2667. [Google Scholar] [CrossRef]
Sherafat, B.; Ahn, C.; Akhavian, R.; Behzadan, A.; Golparvar-Fard, M.; Kim, H.; Lee, Y.; Rashidi, A.; Azar, E. Automated Methods for Activity Recognition of Construction Workers and Equipment: State-of-the-Art Review. J. Constr. Eng. Manag. 2020, 146, 03120002. [Google Scholar] [CrossRef]
Joshua, L.; Varghese, K. Accelerometer-Based Activity Recognition in Construction. J. Comput. Civ. Eng. 2011, 25, 370–379. [Google Scholar] [CrossRef]
Joshua, L.; Varghese, K. Automated recognition of construction labour activity using accelerometers in field situations. Int. J. Product. Perform. Manag. 2014, 63, 841–862. [Google Scholar] [CrossRef]
K, L.; G, R.; M, S. AI-Based Safety Helmet for Mining Workers Using IoT Technology and ARM Cortex-M. IEEE Sens. J. 2023, 23, 21355–21362. [Google Scholar] [CrossRef]
Sopidis, G.; Haslgrübler, M.; Azadi, B.; Guiza, O.; Schobesberger, M.; Anzengruber-Tanase, B.; Ferscha, A. System Design for Sensing in Manufacturing to Apply AI through Hierarchical Abstraction Levels. Sensors 2024, 24, 4508. [Google Scholar] [CrossRef]
Bello, H.; Geißler, D.; Suh, S.; Zhou, B.; Lukowicz, P. TSAK: Two-Stage Semantic-Aware Knowledge Distillation for Efficient Wearable Modality and Model Optimization in Manufacturing Lines. In Pattern Recognition: Proceedings of the 27th International Conference, ICPR 2024, Kolkata, India, 1–5 December 2024; Proceedings, Part XXV; Springer: Berlin/Heidelberg, Germany, 2024; pp. 201–216. [Google Scholar] [CrossRef]
Abdel-Basset, M.; Hawash, H.; Chang, V.; Chakrabortty, R.K.; Ryan, M. Deep Learning for Heterogeneous Human Activity Recognition in Complex IoT Applications. IEEE Internet Things J. 2022, 9, 5653–5665. [Google Scholar] [CrossRef]
Khan, S.H.; Sohail, M. Activity monitoring of workers using single wearable inertial sensor. In Proceedings of the 2013 International Conference on Open Source Systems and Technologies, Lahore, Pakistan, 16–18 December 2013; pp. 60–67. [Google Scholar] [CrossRef]
Bangaru, S.S.; Wang, C.; Busam, S.A.; Aghazadeh, F. ANN-based automated scaffold builder activity recognition through wearable EMG and IMU sensors. Autom. Constr. 2021, 126, 103653. [Google Scholar] [CrossRef]
Yang, Z.; Yuan, Y.; Zhang, M.; Zhao, X.; Tian, B. Assessment of Construction Workers’ Labor Intensity Based on Wearable Smartphone System. J. Constr. Eng. Manag. 2019, 145, 04019039. [Google Scholar] [CrossRef]
Wang, Z.; González, V.A.; Mei, Q.; Lee, G. Sensor adoption in the construction industry: Barriers, opportunities, and strategies. Autom. Constr. 2025, 170, 105937. [Google Scholar] [CrossRef]
Mäkela, S.M.; Lämsä, A.; Keränen, J.S.; Liikka, J.; Ronkainen, J.; Peltola, J.; Häikiö, J.; Järvinen, S.; Bordallo López, M. Introducing VTT-ConIot: A Realistic Dataset for Activity Recognition of Construction Workers Using IMU Devices. Sustainability 2022, 14, 220. [Google Scholar] [CrossRef]
Jacobsen, E.L.; Teizer, J. Deep Learning in Construction: Review of Applications and Potential Avenues. J. Comput. Civ. Eng. 2022, 36, 03121001. [Google Scholar] [CrossRef]
Khan, A.U.; Huang, L.; Onstein, E.; Liu, Y. Overview of Emerging Technologies for Improving the Performance of Heavy-Duty Construction Machines. IEEE Access 2022, 10, 103315–103336. [Google Scholar] [CrossRef]
Du, Y.; Zhang, H.; Liang, L.; Zhang, J.; Song, B. Applications of Machine Vision in Coal Mine Fully Mechanized Tunneling Faces: A Review. IEEE Access 2023, 11, 102871–102898. [Google Scholar] [CrossRef]
Luo, H.; Xiong, C.; Fang, W.; Love, P.E.; Zhang, B.; Ouyang, X. Convolutional neural networks: Computer vision-based workforce activity assessment in construction. Autom. Constr. 2018, 94, 282–289. [Google Scholar] [CrossRef]
Khosrowpour, A.; Niebles, J.C.; Golparvar-Fard, M. Vision-based workface assessment using depth images for activity analysis of interior construction operations. Autom. Constr. 2014, 48, 74–87. [Google Scholar] [CrossRef]
Liu, K.; Golparvar-Fard, M. Crowdsourcing Construction Activity Analysis from Jobsite Video Streams. J. Constr. Eng. Manag. 2015, 141, 04015035. [Google Scholar] [CrossRef]
Choudhury, N.A.; Soni, B. Enhanced Complex Human Activity Recognition System: A Proficient Deep Learning Framework Exploiting Physiological Sensors and Feature Learning. IEEE Sens. Lett. 2023, 7, 6008104. [Google Scholar] [CrossRef]
Akhavian, R.; Behzadan, A. Wearable sensor-based activity recognition for data-driven simulation of construction workers’ activities. In Proceedings of the 2015 Winter Simulation Conference (WSC), Huntington Beach, CA, USA, 6–9 December 2015; pp. 3333–3344. [Google Scholar] [CrossRef]
Akhavian, R.; Behzadan, A.H. Smartphone-based construction workers’ activity recognition and classification. Autom. Constr. 2016, 71, 198–209. [Google Scholar] [CrossRef]
Crenna, F.; Rossi, G.B.; Berardengo, M. Filtering Biomechanical Signals in Movement Analysis. Sensors 2021, 21, 4580. [Google Scholar] [CrossRef] [PubMed]
Wan, X. Influence of feature scaling on convergence of gradient iterative algorithm. J. Phys. Conf. Ser. 2019, 1213, 032021. [Google Scholar] [CrossRef]
Hochreiter, S. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions. Int. J. Uncertain. Fuzziness-Knowl.-Based Syst. 1998, 6, 107–116. [Google Scholar] [CrossRef]
Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar]

Figure 1. Detailed and architecture of the proposed CNN-ResBiGRU model.

Figure 2. Structure of the Residual BiGRU.

Table 1. Description of construction worker activities in the VTT-ConIoT dataset.

Task	Activity	Description	Recommended Activities	Non-Recommended Activities
Painting	Roll painting	A participant applies paint to a wall using a paint roller.	✓	-
	Spraying paint	A participant holds a tube designed to mimic a machine and performs hand movements that imitate the action of spraying paint onto a wall.	✓	-
	Leveling paint	A participant operates a tool that replicates the process of spreading screed or paint onto a wall.	✓	-
Cleaning	Vacuum cleaning	A participant cleans the floor by operating a vacuum cleaner.	✓	-
	Picking objects	A participant picks up objects from the floor by hand and places them into a container.	✓	-
Climbing	Climbing stairs	A participant climbs three stairs, then turns around and walks back down.	✓	-
	Jumping down	A participant steps up three stairs, turns, and jumps down the same steps.	-	✓
	Stairs up and down	A participant moves up the steps continuously for thirty seconds before descending.	✓	-
Hands up work	Laying back	One participant mimics working with their hands raised while lying on a moderately elevated platform.	-	✓
	Hands up high	One participant simulates handling tubes by extending their hands above their head.	-	✓
	Hands up low	One participant pretends to work on tubes by positioning their hands near their head or shoulders.	✓	-
Floor work	Crouch floor	One participant installs tiles on the floor while squatting.	✓	-
	Kneel floor	One participant kneels on the ground while positioning tiles in place.	-	✓
Walking displacements	Pushing cart	One participant pushes a cart for 20 m along a hallway, then turns around and returns with it.	✓	-
	Walking straight	One participant walks 20 m in a straight path down the hallway, then turns and retraces their steps.	✓	-
	Walking winding	One participant maneuvers around seven cones, then covers a distance of 20 m, turns, and walks back.	-	✓

Table 2. Average duration of activities in the VTT-ConIoT dataset.

Activity	Average Duration (s)	Standard Deviation (s)
Roll painting	59.916	0.149
Spraying paint	59.856	0.192
Leveling paint	59.950	0.111
Vacuum cleaning	59.985	0.005
Picking objects	59.978	0.006
Climbing stairs	59.981	0.006
Jumping down	59.981	0.005
Stairs up and down	59.982	0.008
Laying back	59.982	0.007
Hands up high	59.981	0.008
Hands up low	59.981	0.005
Crouch floor	59.982	0.004
Kneel floor	59.982	0.004
Pushing cart	59.981	0.005
Walking straight	59.981	0.006
Walking winding	59.981	0.008

Table 3. Summary of hyperparameters of the CNN-ResBiGRU used in this work.

Stage	Hyperparameters		Values
Architecture	(Convolution Block)
	1D Convolution	Kernel Size	7
		Stride	1
		Filters	256
	Activation		Swish
	Batch Normalization		-
	Max Pooling		2
	Dropout		0.25
	1D Convolution	Kernel Size	5
		Stride	1
		Filters	128
	Activation		Swish
	Batch Normalization		-
	Max Pooling		2
	Dropout		0.25
	1D Convolution	Kernel Size	3
		Stride	1
		Filters	64
	Activation		Swish
	Batch Normalization		-
	Max Pooling		2
	Dropout		0.25
	1D Convolution	Kernel Size	3
		Stride	1
		Filters	32
	Activation		Swish
	Batch Normalization		-
	Max Pooling		2
	Dropout		0.25
	(Residual BiGRU Block)
	ResBiGRU_1		128
	ResBiGRU_2		64
	(Classification Block)
	Dense		Number of activity classes
	Activation		SoftMax
Training	Loss Function		Cross-entropy
	Optimizer		Adam
	Batch Size		128
	Number of Epochs		200

Table 4. Comparison of CNN-ResBiGRU model performance based on varying sensor positions and window lengths for binary classification of recommended and non-recommended construction tasks.

Sensor Placements	Window Sizes	Performance (%)
Sensor Placements	Window Sizes	Accuracy	Precision	Recall	F1-Score
Back	2 s	92.75% (±0.63%)	90.73% (±1.17%)	89.90% (±0.84%)	90.28% (±0.78%)
	4 s	93.15% (±0.91%)	91.06% (±1.29%)	90.80% (±1.71%)	90.87% (±1.23%)
	6 s	93.85% (±1.10%)	92.16% (±1.97%)	91.53% (±1.21%)	91.79% (±1.38%)
	8 s	92.79% (±1.76%)	90.10% (±2.37%)	91.47% (±2.52%)	90.65% (±2.22%)
	10 s	93.91% (±1.45%)	92.86% (±2.48%)	90.82% (±1.61%)	91.74% (±1.87%)
Hand	2 s	94.42% (±0.47%)	93.05% (±0.89%)	92.02% (±0.72%)	92.50% (±0.61%)
	4 s	95.03% (±0.57%)	93.82% (±0.81%)	92.89% (±1.17%)	93.32% (±0.80%)
	6 s	93.99% (±0.56%)	92.69% (±1.30%)	91.19% (±0.68%)	91.88% (±0.68%)
	8 s	94.13% (±1.38%)	93.34% (±1.97%)	90.82% (±1.96%)	91.96% (±1.88%)
	10 s	93.24% (±1.43%)	92.14% (±1.83%)	89.67% (±2.90%)	90.71% (±2.15%)
Hip	2 s	92.72% (±0.94%)	90.76% (±1.62%)	89.84% (±0.87%)	90.26% (±1.13%)
	4 s	93.22% (±0.97%)	91.66% (±1.54%)	90.11% (±1.39%)	90.82% (±1.29%)
	6 s	91.12% (±0.95%)	88.47% (±1.81%)	87.96% (±0.86%)	88.15% (±1.06%)
	8 s	90.30% (±3.33%)	87.39% (±4.83%)	87.09% (±3.42%)	87.18% (±4.12%)
	10 s	91.98% (±1.22%)	90.60% (±1.83%)	87.68% (±2.12%)	88.95% (±1.75%)
All	2 s	96.83% (±0.74%)	96.19% (±0.82%)	95.31% (±1.26%)	95.73% (±1.01%)
	4 s	97.32% (±0.30%)	96.45% (±0.31%)	96.43% (±0.59%)	96.44% (±0.42%)
	6 s	96.00% (±1.35%)	94.64% (±2.50%)	95.02% (±1.16%)	94.75% (±1.66%)
	8 s	95.58% (±0.47%)	94.07% (±1.29%)	94.44% (±1.52%)	94.15% (±0.64%)
	10 s	97.10% (±1.02%)	96.38% (±1.16%)	95.96% (±2.07%)	96.11% (±1.42%)

Table 5. Comparative performance evaluation of the proposed CNN-ResBiGRU model across various sensor placements and window sizes for six-class classification of construction-related activities.

Sensor Placements	Window Sizes	Performance (%)
Sensor Placements	Window Sizes	Accuracy	Precision	Recall	F1-Score
	2 s	91.14% (±0.73%)	91.44% (±0.65%)	91.23% (±0.88%)	91.28% (±0.79%)
	4 s	91.65% (±0.83%)	92.20% (±0.64%)	91.82% (±0.80%)	91.92% (±0.75%)
Back	6 s	92.16% (±1.89%)	92.06% (±1.97%)	92.17% (±1.85%)	92.06% (±1.94%)
	8 s	92.27% (±1.09%)	92.83% (±0.77%)	92.17% (±1.28%)	92.32% (±1.07%)
	10 s	91.01% (±3.38%)	91.85% (±2.19%)	90.98% (±3.80%)	91.09% (±3.40%)
	2 s	90.37% (±0.51%)	90.16% (±0.58%)	90.20% (±0.53%)	90.09% (±0.56%)
	4 s	91.87% (±1.05%)	91.90% (±1.11%)	92.01% (±0.89%)	91.83% (±1.05%)
Hand	6 s	91.79% (±1.21%)	91.86% (±1.24%)	91.75% (±1.11%)	91.71% (±1.20%)
	8 s	91.42% (±0.92%)	91.73% (±0.91%)	91.22% (±0.89%)	91.31% (±0.98%)
	10 s	89.57% (±2.39%)	89.55% (±1.95%)	89.58% (±2.34%)	89.28% (±2.26%)
	2 s	89.05% (±1.38%)	89.06% (±1.52%)	88.77% (±1.34%)	88.82% (±1.46%)
	4 s	89.98% (±1.43%)	90.15% (±1.37%)	89.96% (±1.38%)	89.96% (±1.37%)
Hip	6 s	90.77% (±0.42%)	90.98% (±0.33%)	90.33% (±0.62%)	90.47% (±0.43%)
	8 s	91.90% (±2.65%)	91.87% (±2.88%)	91.78% (±2.53%)	91.68% (±2.85%)
	10 s	88.70% (±4.65%)	89.76% (±3.48%)	87.87% (±4.35%)	88.18% (±4.41%)
	2 s	97.14% (±0.24%)	97.22% (±0.23%)	97.20% (±0.27%)	97.20% (±0.25%)
	4 s	97.05% (±0.97%)	97.02% (±1.12%)	97.17% (±0.94%)	97.07% (±1.05%)
All	6 s	96.78% (±0.49%)	97.03% (±0.39%)	96.77% (±0.68%)	96.85% (±0.55%)
	8 s	95.76% (±2.11%)	95.91% (±2.03%)	95.98% (±1.98%)	95.80% (±2.17%)
	10 s	96.09% (±1.13%)	96.29% (±1.10%)	95.95% (±1.10%)	96.04% (±1.11%)

Table 6. Performance comparison of the proposed CNN-ResBiGRU model under varying sensor placements and window sizes for fine-grained classification of sixteen distinct construction activities.

Sensor Placements	Window Sizes	Performance (%)
Sensor Placements	Window Sizes	Accuracy	Precision	Recall	F1-Score
	2 s	82.62% (±0.83%)	82.89% (±0.73%)	82.64% (±0.83%)	82.62% (±0.84%)
	4 s	83.40% (±1.64%)	84.01% (±1.73%)	83.38% (±1.66%)	83.39% (±1.64%)
Back	6 s	82.04% (±2.75%)	82.82% (±2.02%)	81.96% (±2.81%)	81.92% (±2.70%)
	8 s	81.53% (±0.89%)	82.25% (±1.06%)	81.48% (±0.89%)	81.34% (±0.87%)
	10 s	80.48% (±0.88%)	81.91% (±1.15%)	80.43% (±0.85%)	80.36% (±0.94%)
	2 s	83.04% (±1.08%)	83.35% (±0.98%)	83.03% (±1.11%)	83.03% (±1.09%)
	4 s	83.09% (±2.67%)	83.81% (±2.21%)	83.08% (±2.69%)	83.05% (±2.68%)
Hand	6 s	82.82% (±0.98%)	83.57% (±0.94%)	82.84% (±0.96%)	82.77% (±0.98%)
	8 s	80.01% (±2.03%)	80.96% (±1.94%)	80.00% (±2.02%)	79.74% (±2.03%)
	10 s	78.36% (±1.20%)	79.34% (±1.19%)	78.32% (±1.17%)	78.18% (±1.24%)
	2 s	81.33% (±0.69%)	81.70% (±0.65%)	81.38% (±0.67%)	81.33% (±0.62%)
	4 s	82.61% (±1.05%)	83.39% (±0.83%)	82.65% (±1.04%)	82.63% (±0.98%)
Hip	6 s	80.43% (±1.41%)	81.33% (±1.09%)	80.47% (±1.40%)	80.42% (±1.33%)
	8 s	78.30% (±1.19%)	79.28% (±1.25%)	78.33% (±1.16%)	78.22% (±1.19%)
	10 s	76.96% (±2.38%)	78.07% (±2.08%)	76.99% (±2.37%)	76.93% (±2.42%)
	2 s	93.02% (±0.38%)	93.18% (±0.36%)	93.02% (±0.38%)	93.01% (±0.37%)
	4 s	98.68% (±0.31%)	98.73% (±0.29%)	98.69% (±0.31%)	98.69% (±0.31%)
All	6 s	90.66% (±1.36%)	91.00% (±1.12%)	90.63% (±1.38%)	90.57% (±1.43%)
	8 s	89.71% (±2.04%)	90.33% (±1.75%)	89.64% (±2.10%)	89.60% (±2.14%)
	10 s	88.74% (±2.10%)	89.42% (±2.07%)	88.73% (±2.10%)	88.59% (±2.21%)

Table 7. Comparison results with state-of-the-art for CWAR using sensor data from VTT-ConIoT dataset.

Window Sizes	Model	Accuracy (%)
Window Sizes	Model	6-Class	16-Class
2 s	SVM [19]	86.00	72.00
	CNN-ResBiGRU	90.24	76.10
5 s	SVM [19]	89.00	78.00
	CNN-ResBiGRU	93.46	81.60
10 s	SVM [19]	94.00	84.00
	CNN-ResBiGRU	95.79	91.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mekruksavanich, S.; Jitpattanakul, A. Construction Worker Activity Recognition Using Deep Residual Convolutional Network Based on Fused IMU Sensor Data in Internet-of-Things Environment. IoT 2025, 6, 36. https://doi.org/10.3390/iot6030036

AMA Style

Mekruksavanich S, Jitpattanakul A. Construction Worker Activity Recognition Using Deep Residual Convolutional Network Based on Fused IMU Sensor Data in Internet-of-Things Environment. IoT. 2025; 6(3):36. https://doi.org/10.3390/iot6030036

Chicago/Turabian Style

Mekruksavanich, Sakorn, and Anuchit Jitpattanakul. 2025. "Construction Worker Activity Recognition Using Deep Residual Convolutional Network Based on Fused IMU Sensor Data in Internet-of-Things Environment" IoT 6, no. 3: 36. https://doi.org/10.3390/iot6030036

APA Style

Mekruksavanich, S., & Jitpattanakul, A. (2025). Construction Worker Activity Recognition Using Deep Residual Convolutional Network Based on Fused IMU Sensor Data in Internet-of-Things Environment. IoT, 6(3), 36. https://doi.org/10.3390/iot6030036

Article Menu

Construction Worker Activity Recognition Using Deep Residual Convolutional Network Based on Fused IMU Sensor Data in Internet-of-Things Environment

Abstract

1. Introduction

2. Related Work

2.1. Vision-Based CWAR

2.2. Sensor-Based CWAR

3. Methodology

3.1. Data Acquisition

3.2. Data Pre-Processing

3.2.1. Data Denoising

3.2.2. Data Normalization

3.2.3. Data Segmentation

3.3. The Proposed CNN-ResBiGRU Model

3.3.1. Convolution Block

3.3.2. Residual BiGRU Block

3.4. Training and Hyperparameters

3.5. Evaluation Metrics

4. Experiments and Results

4.1. Experiment Settings

4.1.1. Cross-Validation Strategy

4.1.2. Baseline Performance

4.2. Experimental Results

4.3. Comparison Results with State-of-the-Art Models

5. Discussion

5.1. Effects of Sensor Placements for CWAR

5.1.1. Safety-Critical Activity Recognition

5.1.2. Task-Specific Activity Recognition

5.1.3. Activity-Specific Activity Recognition

5.2. Effects of Window Sizes for CWAR

5.2.1. Window Size Impact on Safety-Critical Distinction

5.2.2. Task-Level Temporal Dependencies

5.2.3. Fine-Grained Activity Recognition and Window Duration

5.3. Limitations

6. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI