Moving Toward Automated Construction Management: An Automated Construction Worker Efficiency Evaluation System

Zhang, Chaojun; Mao, Chao; Liu, Huan; Liao, Yunlong; Zhou, Jiayi

doi:10.3390/buildings15142479

Open AccessArticle

Moving Toward Automated Construction Management: An Automated Construction Worker Efficiency Evaluation System^†

by

Chaojun Zhang

¹

,

Chao Mao

¹,

Huan Liu

^1,*

,

Yunlong Liao

² and

Jiayi Zhou

¹

School of Management Science & Real Estate, Chongqing University, Chongqing 400044, China

²

School of Civil Engineering, Chongqing University, Chongqing 400044, China

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version in Mao, C.; Zhang, C.; Liao, Y.; Zhou, J.; Liu, H. An Automated Vision-Based Construction Efficiency Evaluation System with Recognizing Worker’s Activities and Counting Effective Working Hours. In Advances in Information Technology in Civil and Building Engineering, ICCCBE 2024; Francis, A., Miresco, E., Melhado, S., Eds.; Lecture Notes in Civil Engineering; Springer: Cham, Switzerland, 2025.

Buildings 2025, 15(14), 2479; https://doi.org/10.3390/buildings15142479

Submission received: 25 May 2025 / Revised: 20 June 2025 / Accepted: 6 July 2025 / Published: 15 July 2025

(This article belongs to the Special Issue Digital and Sustainable Building and Construction Management: Advances and Prospects)

Download

Browse Figures

Versions Notes

Abstract

In the Architecture, Engineering, and Construction (AEC) industry, traditional labor efficiency evaluation methods have limitations, while computer vision technology shows great potential. This study aims to develop a potential automated construction efficiency evaluation framework. We propose a method that integrates keypoint processing and extraction using the BlazePose model from MediaPipe, action classification with a Long Short-Term Memory (LSTM) network, and construction object recognition with the YOLO algorithm. A new model framework for action recognition and work hour statistics is introduced, and a specific construction scene dataset is developed under controlled experimental conditions. The experimental results on this dataset show that the worker action recognition accuracy can reach 82.23%, and the average accuracy of the classification model based on the confusion matrix is 81.67%. This research makes contributions in terms of innovative methodology, a new model framework, and a comprehensive dataset, which may have potential implications for enhancing construction efficiency, supporting cost-saving strategies, and providing decision support in the future. However, this study represents an initial validation under limited conditions, and it also has limitations such as its dependence on well-lit environments and high computational requirements. Future research should focus on addressing these limitations and further validating the approach in diverse and practical construction scenarios.

Keywords:

computer vision; action recognition; BlazePose; LSTM networks; construction efficiency

1. Introduction

The Architecture, Engineering, and Construction (AEC) industry is one of the largest industry in the world, with an annual budget of USD 10 trillion, accounting for approximately 13% of global GDP [1,2,3]. In recent decades, the AEC industry has significantly lagged behind other industry in terms of productivity [4,5,6,7]. Despite substantial investments in new technologies, which have enhanced the efficiency of design, planning, and construction, the industry still faces numerous severe challenges. These challenges include increasing project complexity, tight deadlines, cost control pressures, and the critical tasks of ensuring safety and quality [8,9,10,11]. As a labor-intensive industry, labor efficiency is crucial to the success or failure of construction projects. Workers’ behavior directly impacts the progress and cost of a project [12,13]. Therefore, accurately monitoring workers’ work status is essential as it allows construction managers to access real-time information about workers’ conditions and adjust their strategies accordingly. This has become a key factor in driving construction project performance.

To monitor the labor status of workers on construction sites effectively, a more automated efficiency evaluation method must be adopted. This will help with the real-time monitoring and analysis of construction sites, addressing potential issues through timely labor force recognition [14]. The inefficiency of traditional methods stems from the lack of systematic and automated monitoring tools [15]. The manual recording of construction progress data encounters discrepancies between recorded and actual progress, hindering accurate efficiency evaluation and problem prediction. Additionally, the commonly used work sampling technique, which observes only a single worker, leads to inefficiency and low reliability [16].

Computer vision technology demonstrates great potential in the automatic monitoring and evaluation of construction activities. By utilizing advanced video processing algorithms and machine learning models, relevant worker actions can be automatically identified from construction site videos. This makes construction labor efficiency evaluation more objective and accurate. However, current labor recognition in the construction industry largely focuses on basic actions like walking and standing, with professional action recognition being mostly carried out in laboratory settings. This study aims to develop a system that utilizes action recognition to address the issue of construction labor efficiency monitoring. The system is applicable to various construction scenarios, providing continuous and comprehensive monitoring, improving information collection efficiency, ensuring timely identification and resolution of on-site problems, and ultimately meeting the needs of future unmanned construction.

Given the substantial potential of computer vision technology in automated construction monitoring and real-time evaluation, this study leverages this method to address the limitations of traditional methods. By utilizing advanced video processing algorithms and machine learning models, we aim to enhance the objectivity and accuracy of construction efficiency evaluation.

The structure of this paper is divided into four parts. First, the Section 1 and Section 2 outline the background and motivation for the study. Next, the Section 3 and Section 4 describe the methodology used and verify the feasibility of the approach through the use of a single construction scenario. Finally, the Section 5 and Section 6 analyze the experimental data and research findings, summarize the main content of the paper, and offer prospects for future research directions.

2. Related Work

In the construction industry, traditional labor efficiency evaluation methods typically rely on manual recording and work sampling. However, with the development of modern technologies, especially the application of computer-based methods, the accuracy and efficiency of evaluations have been significantly improved. This section will explore the main methods currently used for labor efficiency evaluation in the construction industry, with a focus on analyzing the limitations of traditional methods, the shortcomings of current methods, and providing the fundamental principles for selecting computer vision models and activity recognition models.

2.1. Labor Efficiency Evaluation in Construction

In construction labor assessment, work sampling, as a method with stable probabilities, has been widely used for decades [1,17,18]. However, in the current context of the construction industry, where precise efficiency evaluation is required, this method is no longer applicable [19,20]. The method involves randomly sampling to record work time and non-work time, thereby inferring the overall work efficiency during a given time period. The advantages of work sampling include low cost and that there is no need for continuous monitoring [9,21,22]. However, its accuracy is severely affected by sample size and sampling frequency. Additionally, work sampling is also a time-consuming method as it typically requires long periods of personnel deployment at construction sites. As a result, work sampling is a labor-intensive task that requires several people to be present at the site to collect data [23,24,25]. This leads to the subjectivity of the observers potentially having a significant impact on the results as different observers may classify the same instance differently.

Currently, automated methods can address the issues of frequent data sampling and labor intensity. However, most of them merely classify the types of labor performed by construction workers to measure their productivity [12,20,26]. For example, Luo et al. (2018) captured the working state of workers through surveillance videos, helping managers accurately quantify and measure labor productivity [15,16,27]. They classified workers’ activities into three modes: productive, semi-productive, and non-productive. Observers recorded the results by inputting them under the observed activity. Finally, all input results were summed to calculate the percentage of activities in each mode, representing productivity. Jacobsen et al. (2023), on the other hand, classified the productivity indicators of construction site workers [21]. This method uses a combination of CNN and LSTM to classify multivariate time-series data collected from five inertial measurement units installed on workers, without any manual feature engineering. In this study, the indicators are overall work sampling metrics: direct work, indirect work, and waste. Unlike most other research focusing on computer vision, this study uses motion sensors and collected 3D acceleration and angular velocity data instead of videos or photos, thereby eliminating the main limitation of computer vision, which is occlusion [28,29,30,31]. This allows the system to monitor from the start to the end of the project [8]. Reza Akhavian et al. (2016) used smartphone sensors as data collectors, using the collected data as input for machine learning classification algorithms to detect and distinguish various types of human activities [1,32]. While this research uses the worker’s idle/busy state to provide the necessary information for productivity analysis, it further advances this information by extracting more specific and accurate knowledge about different activities performed by construction workers [11,33]. For example, in their experiment, they categorized the actions of loading a wheelbarrow into Loading, Pushing, Unloading, Returning, and Idling actions.

In conclusion, current labor monitoring methods require data collection for action recognition to infer productivity indicators. During the action recognition phase, detailed action classification of workers not only provides productivity inference information but also assists in further improving safety and ergonomics analysis for future construction worker activity recognition systems.

2.2. Recognition of Construction Worker Activities by Computer Vision

In the construction field, accurately identifying worker activities is crucial for improving construction efficiency, ensuring worker safety, and optimizing project management [3,34,35,36]. With continuous technological advancements, activity recognition methods have also been evolving.

In the early days, activity recognition in the construction field primarily relied on the combination of sensors and machine learning [27,30,37,38]. IoT-based systems began focusing on target objects equipped with electronic sensors, analyzing their activity states through information from mobile sensors such as accelerometers, gyroscopes, and GPS [3]. For example, in the early 2000s, Akhavian et al. (2015) successfully identified earthmoving equipment activities using mobile sensors and machine learning classifiers [1,32,39]. Simultaneously, the direct angle measurement system with an electric angular instrument could accurately measure node angles on construction sites, exhibiting high resolution and accuracy around 2005 [40]. However, its limitation lies in its ability to measure only two-dimensional motion and its cumbersome operation. The indirect angle measurement system uses image-based systems (e.g., depth sensors and stereo cameras) or inertial measurement units (IMU) to track human movement. In the early 21st century, depth sensors like Kinect™ cameras became widely used for tracking workers’ body movements, with applications ranging from activity classification to biomechanics evaluation and safety assessment [41,42,43,44]. Furthermore, around 2010, IMUs were used for full-body tracking and classifying construction activities. For instance, Joshua and Varghese (2010) used a wired accelerometer installed at the waist of a mason, combined with machine learning algorithms, to classify three types of masonry activities with an accuracy of 80% [45]. In 2016, Ryu et al. used wristbands equipped with accelerometers for activity classification in masonry work, achieving over 97% accuracy using a multi-layer perceptron neural network [46].

Since 2010, pose-based action recognition technologies have made significant progress in construction productivity monitoring. Early studies used wearable sensors to capture workers’ body postures. For example, Joshua and Varghese (2010) attached accelerometers to bricklayers and applied machine learning to classify masonry work into three activity categories, achieving about 80% accuracy [45]. Around the same time, the introduction of depth cameras such as Microsoft Kinect enabled on-site tracking of skeletal joints, and research began to use 3D skeleton data for activity classification and safety assessment [3]. With advances in computer vision, visual-based pose recognition gradually emerged. Escorcia et al. (2012) applied RGB-D cameras to automatically recognize specific activities of construction workers indoors [35]. Luo et al. (2018) utilized two-stream convolutional neural networks on construction site videos to automatically classify activities as “productive,” “semi-productive,” or “non-productive,” thus enabling more objective and efficient work sampling [15]. Researchers have also focused on using visually extracted keypoints to represent worker posture. Alwasel et al. (2017) used machine learning to identify “safe and efficient” typical postures of bricklayers, relating specific poses to productivity and safety outcomes [14]. Similarly, Nath et al. (2017) analyzed workers’ body postures using wearable inertial sensor data to assess ergonomic and efficiency factors during tasks [33]. More recently, advanced skeleton-based action recognition methods have emerged. For example, Tian et al. (2024) proposed a lightweight deep learning model that uses selected keypoint combinations to recognize specific construction actions [10]. Jacobsen et al. (2023) input kinematic skeleton data into deep learning models to automatically estimate labor input and productivity [21]. Another study demonstrated that applying action recognition to construction site training can also effectively assess productivity (Cheng et al., 2023) [26]. In summary, pose-based action recognition has been increasingly integrated into construction productivity monitoring, evolving from sensor-dependent posture classification to approaches that combine computer vision skeleton extraction with deep learning, substantially improving the accuracy and objectivity of on-site labor monitoring.

With technological advancements, computer vision technology has become increasingly important in construction worker activity recognition. The first convolutional neural network (CNN) models began to emerge in the construction field, primarily based on region-based CNNs for detecting construction workers and equipment from RGB data and identifying unsafe activities or objects [47,48]. However, this method had limitations as it only processes video frames or images individually, failing to establish connections between objects in preceding and subsequent frames. For example, Fang et al. (2020) used a region-based CNN to detect construction workers and equipment from RGB data. Subsequently, the YOLO (You Only Look Once) algorithm was introduced into the construction field [36]. As a real-time object detection system, YOLO can quickly detect and classify multiple objects. In the construction scene, YOLO was used to identify and classify tools and materials, providing strong support for a comprehensive understanding of the construction environment. For example, Zhang et al. (2018) proposed a real-time deep learning method based on YOLO was proposed for object detection in construction scenes [8]. The BlazePose model in the MediaPipe library performs excellently in keypoint processing and extraction. It can detect 33 key points on the human body in images and video streams, including 2D coordinates and relative depth information. Combined with a multi-object tracking algorithm framework, using bounding box information, motion information, and appearance information, workers can be tracked to provide key data for analyzing construction activities. Recent studies have used keypoint extraction algorithms, such as MediaPipe and BlazePose, to estimate human poses in videos, and have further employed deep learning models—including LSTM and convolutional neural networks—for action recognition [15,32,49,50,51,52]. While increasing the complexity of network architectures or the number of subjects can make these methods computationally intensive, keypoint-based pose estimation techniques have been proven effective for activity recognition tasks in various applications, including sign language recognition and real-time body tracking [50,51,52].

The advent of depth cameras has brought new opportunities for activity recognition in the construction field. Various types of depth cameras, such as structured light depth cameras, stereo depth cameras, and time-of-flight (ToF) depth cameras, have been applied to the construction field to generate distance images or 3D point clouds. For example, Kinect V1, a typical structured light depth camera, uses infrared light patterns to estimate depth and was applied to estimate the joint position trajectories of construction workers [3]. Stereo depth cameras, simulating the human binocular vision system to perceive depth, were used in 2018 in a study that developed a stereo depth camera using two ordinary smartphones to estimate 3D poses. Kinect V2, which includes a ToF camera, uses infrared light for 3D pose collection to detect unsafe behaviors, with applications starting around 2015 [3,6,34,53,54,55,56]. Additionally, LiDAR sensors and radar sensors, as well as ToF depth cameras, have been used to recognize human poses or estimate human poses based on 3D point clouds [20,35,57,58,59].

Activity recognition in the construction field involves analyzing and determining workers’ actions, postures, interactions with objects, and group activities to identify the specific activity state of workers on the construction site. Algorithms are the specific methods and techniques for achieving activity recognition, with different algorithms having unique characteristics and applicable scenarios. For instance, CNN is used for feature extraction and classification of images to detect construction workers and equipment, YOLO can quickly and accurately identify tools and materials in construction scenes, and LSTM (Long Short-Term Memory) is suitable for processing time-series data, enabling accurate action classification by analyzing the temporal features of worker movements [29]. Frameworks are software platforms that integrate multiple algorithms and tools to provide a complete solution for activity recognition. MediaPipe(version 0.8.7), for example, is a powerful framework designed for computer vision tasks, integrating multiple advanced deep learning models and algorithms, such as BlazePose for human pose estimation and keypoint processing, performing well in activity recognition in the construction field, providing efficient computation and processing capabilities, and meeting real-time requirements in construction sites [52]. In addition, MediaPipe is easy to use and integrate, providing rich interfaces and tools for rapid development and deployment, allowing developers to quickly integrate MediaPipe into existing construction management systems for real-time monitoring and analysis of worker activities.

The advantages of choosing monitoring for recognition lie in its real-time nature, simplicity, and non-invasiveness. Widely distributed surveillance cameras on construction sites can capture real-time images and videos, providing a rich data source for activity recognition, enabling real-time activity recognition and monitoring. It is easy to install and use and does not interfere with the workers’ normal activities [30,60,61]. Compared to contact-based sensors, monitoring systems are non-invasive, and workers do not need to wear sensors or markers, avoiding discomfort and frustration caused by sensor usage, which makes it more acceptable to workers, enhancing the system’s practicality and feasibility.

The basis for choosing the LSTM algorithm lies in its advantage in processing time-series data. In construction activity recognition, workers’ movements are typically continuous time-series data, and LSTM can effectively learn and extract temporal features of movements, enabling accurate action classification. By inputting the time-series data of detected keypoints into the LSTM network, the system can capture the duration and dynamic changes of actions, providing a more detailed analysis of worker behavior. At the same time, compared with other machine learning models, LSTM can better utilize the temporal features in the video, where the results of the previous frame affect the next frame, allowing for more accurate action recognition [50]. The output of the LSTM network provides a probability distribution of possible actions, enabling accurate classification of worker actions and improving the accuracy of activity recognition.

The basis for choosing the MediaPipe framework lies in its powerful functionality and performance. MediaPipe is a powerful framework specifically designed for computer vision tasks, integrating various advanced deep learning models and algorithms. BlazePose, used for human pose estimation and keypoint processing, performs well in construction activity recognition, offering high performance and accuracy. It provides efficient computation and processing capabilities, quickly processing image and video data to meet the real-time demands of construction sites. Additionally, MediaPipe is easy to use and integrate, offering a rich set of interfaces and tools that allow developers to quickly develop and deploy solutions. Developers can easily integrate MediaPipe into existing construction management systems to enable real-time monitoring and analysis of worker activities [35]. The framework’s ease of use lowers development costs and technical barriers, making it more accessible to construction companies looking to apply computer vision technology for activity recognition and management. Given that construction sites typically use RGB cameras instead of depth cameras, the BlazePose model in MediaPipe infers relative depth information from standard RGB camera data, providing a practical solution for construction site environments [35,62]. The multi-object tracking algorithm framework, combined with bounding box information, motion information, and appearance information, can effectively handle the complex environment of construction sites, tracking and analyzing multiple workers.

At present, mainstream human action recognition models have achieved very high accuracy on public benchmark datasets. For instance, the classical two-stream convolutional neural network (CNN) architecture proposed by Simonyan and Zisserman (2014) is specifically designed to recognize human actions in video clips by simultaneously processing RGB frames and optical flow [63]. This model achieved an accuracy of about 88% on the UCF101 dataset, which consists of 13,320 video clips categorized into 101 distinct human action classes such as “Basketball Dunk”, “Biking”, “Cliff Diving”, and “Handstand Walking.” Feichtenhofer et al. (2016) further enhanced the model by integrating spatial and temporal features at multiple levels, raising the accuracy to about 92.5% on the UCF101 dataset [64]. Temporal convolutional recurrent network models, which combine CNN feature extraction with long short-term memory (LSTM) units, have also demonstrated strong performance. For example, Donahue et al. (2015) applied LSTM on top of CNN features for sequential modeling of actions in videos, reporting up to 93.6% accuracy on UCF101 [65]. These models are capable of recognizing complex temporal dynamics, such as differentiating between actions like “Playing Piano” and “Playing Guitar” over multiple frames. Early 3D convolutional neural networks (C3D) directly model spatiotemporal features from short video clips, enabling the recognition of motion-dependent actions such as “Drumming” or “Front Crawl.” As shown by the study of Tran et al. (2015), C3D achieved approximately 85% accuracy on UCF101 [65,66]. More advanced deep architectures, such as Inflated 3D ConvNets (I3D), introduced by Carreira and Zisserman (2017), leverage large-scale pre-training on datasets like Kinetics and then fine-tune on benchmarks [67]. I3D achieved an impressive 98.0% accuracy on UCF101 and 80.9% on the more challenging HMDB51 dataset, which features 6766 videos across 51 diverse human action categories. Collectively, these results demonstrate that modern deep learning models are highly effective at recognizing a wide range of human actions—from sports and daily activities to complex group interactions—on large-scale, well-annotated video datasets [67].

In summary, the activity recognition methods in the construction field have evolved from early sensor-based machine learning methods to later computer vision technologies, continually improving the accuracy and practicality of recognition. Monitoring, as a non-invasive activity recognition method, offers advantages such as real-time nature and simplicity. The choice of LSTM algorithm and MediaPipe framework for activity recognition has solid foundations, providing effective solutions for construction companies to improve construction management and ensure worker safety. With continuous technological advancements, activity recognition methods in the construction field will keep innovating and improving, contributing to the development of the construction industry.

2.3. Construction Scene Theory

Although the recognition of construction worker activities has provided significant help for automated labor statistics in the construction field, most studies focus only on statistical methods for construction worker actions, treating activity recognition results as the sole conclusion. These studies lack further development and utilization of the data. Moreover, most research is concentrated on single scenes or laboratory environments. In response to this, we introduce activity recognition in scene theory to assist in our analysis [68].

The concept of construction activity scenes comes from engineering practice and is a generalization of the real working state within a specific scope of the construction site. There are three main types of classification: the first type represents a generalization of a specific construction task (e.g., masonry work, rebar work), known as construction activity; the second focuses on a specific construction procedure (e.g., wall curing, bricklaying), referred to as direct construction work; and the third emphasizes actions driven by workers or machines, referred to as construction operations [68]. In this study, we primarily adopt the first classification to define construction activity scenes. Regardless of the classification level, the overall construction activity scene consists of multiple interconnected entities, relationships, and attributes, among other scene elements. In the intelligent era, a construction scene is defined as the totality of one or more physical objects, the relationships between these objects, and their attributes as contained in visual data related to construction operations. These scene elements, centered around the same construction activity and subject, include entities such as workers, equipment, materials, relationships (e.g., cooperation or coexistence between entities), and attributes of the entities (e.g., color, quantity, shape) [15,16,28,38].

To determine the indicators related to construction productivity in a construction activity scene, it is necessary to deconstruct the ontology of the scene. Ontological decomposition specifically refers to the structural breakdown of content in visual data based on visually identifiable objects and clarifying the relevance of each visible object to the subject and the relationships between objects [68]. A construction activity scene must break down the scene elements related to construction production activities in the visual data into entities, relationships, and attributes according to unified rules. Combined with the relevant work mentioned in construction labor productivity, this process mainly involves pre-classifying workers based on attributes in the construction activity scene and then analyzing the entities and their attributes to influence construction labor productivity.

In scene theory, activity recognition is an automated process aimed at matching the relationships between entities in visual data with predicted activity names. In construction management, it is primarily achieved by automatically analyzing the interactions or coexistence relationships between workers, construction tools, and operational objects in activity scenes. In existing scene analyses, the “workface” concept proposed by Luo et al. (2018) refers to a specific area where a worker performs a task [15,16,27,68]. This concept has limitations as it isolates workers from the broader context of the construction site. In contrast, Liu et al. (2020) “construction scene” method provides a more comprehensive perspective. This method analyzes video clips and extracts information from all elements in the scene, including entities, relationships, and attributes, thereby providing a more comprehensive view of the construction process and facilitating better management and efficiency assessment [48,68].

Furthermore, Liu et al. (2020) construction scene analysis solves the limitation of single-object tracking. By using advanced models capable of multi-object detection and tracking, the system can monitor the interactions between multiple workers and tools in the same scene. This ability is crucial for accurately capturing the complex work processes and collaborative efforts typical of construction sites [68]. By combining scene elements such as tools, machinery, and materials with worker activities, a more detailed and nuanced analysis can be conducted. For instance, understanding the relationship between tool availability and usage and worker activity efficiency can provide insights that cannot be obtained through single-object tracking alone. This holistic view helps identify bottlenecks, optimize resource allocation, and improve overall site management.

2.4. Summary of Literature

A comprehensive review of the existing literature reveals several key limitations in the current research on construction efficiency assessment and action recognition, which this study addresses through targeted improvements.

First, traditional construction efficiency assessment methods, such as manual recording and work sampling, heavily rely on human observation and subjective judgment. These methods are not only labor-intensive and time-consuming but also prone to errors. The frequency and scope of manual monitoring are limited, which hinders comprehensive and timely data collection on construction sites. These limitations restrict the accuracy and efficiency of efficiency assessments. To address these challenges, this study proposes an automated construction efficiency evaluation method based on computer vision technology. This method provides a more comprehensive and accurate evaluation by integrating the recognition of worker actions, tools, equipment, and other relevant elements, surpassing the limitations of traditional methods.

Furthermore, current datasets for action recognition in the construction field are typically limited to specific types of activities or environments, lacking comprehensive data that captures the diversity of worker actions and construction site conditions. The lack of comprehensive datasets makes it difficult to develop models that can generalize across different construction environments. To fill this gap, this study develops a construction scene dataset specifically tailored to particular scenarios, encompassing unique worker activities, tools, and interactive scenes. This dataset enriches existing resources, making it more applicable to real-world applications.

Based on these improvements, the proposed method aims to enhance the understanding of construction productivity through an integrated process. This process begins with analyzing the tools in the hands of workers to determine their associated construction scene, followed by worker activity recognition. The recognition results are then integrated with actual outcomes to provide a comprehensive productivity analysis. This method not only optimizes the evaluation process but also offers deeper insight into the factors affecting construction efficiency, thus providing a more accurate and dynamic approach to construction productivity assessment.

3. Methodology

The automated analysis system comprises three primary modules: (1) Keypoint Processing and Extraction; (2) Action Classification; and (3) Construction Object Recognition.

Keypoint Processing and Extraction Module: This module utilizes the BlazePose model from the MediaPipe library, a deep learning framework designed for human pose estimation. BlazePose is capable of detecting 33 key points on the human body, including the head, shoulders, elbows, wrists, hips, knees, and ankles, in both images and video streams. Besides estimating the 2D coordinates (x,y) of these key points, BlazePose also infers relative depth information (z) to approximate the distance between the key points and the camera. However, it is important to note that the depth information provided by BlazePose is inferred from standard RGB camera data rather than obtained from physical depth cameras. This estimation offers a relative depth reference, not an absolute physical distance. For applications requiring precise depth measurements, such as specific physical distances, devices equipped with depth sensors like Kinect or Intel RealSense would be necessary. Given that construction sites typically utilize RGB cameras instead of depth cameras, this approach aligns well with the environmental conditions present on-site.

Action Classification Module: This module employs a Long Short-Term Memory (LSTM) network, a type of neural network adept at handling and predicting time-series data. LSTM is particularly effective for analyzing sequential actions in video frames. By feeding the time-series data of the detected key points into the LSTM network, the system can learn and extract temporal features of the actions, enabling accurate classification of worker movements. This module not only identifies the type of actions but also captures their duration and dynamic variations, providing a detailed analysis of worker behavior.

Construction Object Recognition Module: The YOLO (You Only Look Once) algorithm powers this module, a real-time object detection system renowned for its efficiency and accuracy. YOLO can detect and classify multiple objects within a single pass through the neural network. In the context of a construction site, YOLO is utilized to identify and classify tools and materials present in the video footage, such as bricks, trowels, and scaffolding. The model provides rapid identification of various objects in the scene, along with their corresponding categories and locations, which is crucial for comprehensively understanding the dynamics and layout of the construction environment.

Traditionally, the productivity of bricklaying work is measured by the output, such as the number of bricks laid or the area of wall completed per unit time. While this approach provides an overall assessment after the work is completed, it cannot reflect specific inefficiencies or delays that may occur during the construction process itself. In this study, we propose an action recognition-based approach that enables real-time and fine-grained monitoring of the work process. By continuously recognizing and analyzing individual bricklaying actions, our system can identify periods of waiting or idle time between tasks, which are not captured by traditional output metrics. This real-time, process-oriented analysis offers immediate feedback and allows for a more comprehensive understanding of construction productivity, complementing and extending beyond conventional productivity measures. The overall process is illustrated in Figure 1. The following sections detail the key technical modules of our approach, including keypoint extraction and tracking, activity segmentation, action recognition, object detection, and productivity analysis.

3.1. Extraction and Tracking of Worker Key Points

The extraction and tracking of worker key points is a fundamental step in analyzing construction activities. This process provides crucial data for understanding worker movements and behaviors.

We utilize the BlazePose model from the MediaPipe library for keypoint processing and extraction. BlazePose is a deep learning framework designed for human pose estimation and can detect 33 key points on the human body in images and video streams. The spatial distribution of these keypoints is illustrated in Figure 2. It not only provides 2D coordinates (x,y) but also infers relative depth information (z), which is valuable for assessing changes in posture and movement. Although the depth information is an estimation from standard RGB camera data rather than from physical depth cameras, it still offers a relative depth reference that is useful in construction site environments where RGB cameras are commonly used [51,69].

To handle the complexity of a dynamic construction site with multiple workers, we introduce a multi-target tracking algorithm framework. This framework consists of three main components: bounding box information, motion information, and appearance information. Bounding box information is generated based on the key point coordinate data, helping to locate workers in the image and providing detailed information for local regions in subsequent motion recognition [51,69]. Motion information is extracted using a Kalman filter, which predicts the worker’s position in the next moment using current and past key point position data. Mahalanobis distance is used to measure the similarity between newly detected targets and existing tracked targets to maintain effective tracking in a multi-target environment. Appearance information, such as clothing color and texture features, provides additional identification features for long-term tracking and reduces the risk of misidentification and tracking loss in complex backgrounds.

To ensure effective tracking in crowded and dynamic construction scenes, each worker is assigned a unique identification ID and tracked across frames using a combination of motion prediction (Kalman filter) and appearance features (e.g., clothing color and texture). This approach helps maintain worker identity even if a person is temporarily occluded or overlaps with others, thereby enhancing the stability and reliability of keypoint detection in multi-worker scenarios. However, it should be noted that in cases of severe or long-term occlusion, or when multiple workers have highly similar appearances, tracking continuity may still be challenged, leading to potential identity switches or tracking loss.

It is also important to note that BlazePose provides only relative depth estimation based on RGB images, not absolute or physical depth values. When multiple workers are present or when partial/full occlusions occur, the estimated depth values may become unreliable or inaccurate. This limitation can affect the accuracy of action and posture recognition, especially in challenging, crowded environments. Future work will focus on integrating depth cameras or multi-sensor fusion to address these challenges.

Each worker is assigned a unique identification ID, and the data, including key point coordinates, timestamps, bounding box information, and appearance features, is stored in a comprehensive database for subsequent analysis. Feature analysis is performed on the key point data, extracting positional coordinate features, center of mass features, and angle features. These features provide valuable insights into worker postures, movements, and task performance, enabling the identification and classification of various worker actions as well as the assessment of their efficiency and safety.

3.2. The Process of Deconstructing Construction Activity Scenes

To accurately deconstruct the actions of bricklaying workers, we first segment the video into time units based on the minimum duration of each action, where each time unit represents a complete action. These time units ensure that the scene elements (including workers, tools, bricks, etc.) remain consistent within each unit, allowing us to ignore the influence of time in the analysis. This method simplifies the data processing process and allows us to focus on the analysis of scene elements without considering the temporal variations. The minimum duration of each action determines the division points for the video, ensuring the accuracy and simplicity of action recognition. For example, during actions such as picking up a brick, placing a brick, or applying mortar, the start and end times of the actions are very clear. By segmenting these time periods, we can more efficiently analyze each action performed by the worker.

Each action of the bricklaying worker can be further broken down into five basic actions based on their duration and details: picking up a brick, placing a brick, applying mortar, cleaning mortar, and placing mortar, as illustrated in Figure 3. By clearly defining these actions into individual time units, we can process each step of the bricklaying process more simply. The choice of these five actions is based on the necessity and sequence of each step in the bricklaying process. They represent the most critical and indispensable parts of the work, with each action having a clear temporal boundary. For each action’s deconstruction, we apply the analysis method of entities, relationships, and attributes in scene theory, which helps us to deeply understand the actual processes represented by each action.

The action of picking up a brick begins with the worker visually identifying the target brick and accurately grabbing it. In this process, the relationship between the worker and the brick is “grasping,” and the physical attributes of the brick (such as weight and size) influence the worker’s operation. Although this action seems simple, it requires the worker to have good coordination to ensure the stability of the brick. When placing the brick, the worker must consider factors such as the wall structure, the arrangement of the bricks, and the thickness of the mortar, all of which are reflected in the relationship between the worker, the brick, and the wall. In the process of applying mortar, the worker uses a tool to extract an appropriate amount of mortar and evenly applies it to the surface of the brick. This requires the worker to control the amount of mortar and the application pressure based on the surface of the brick, ensuring the mortar firmly binds the bricks together. Next, the worker cleans off excess mortar to maintain the neatness and aesthetic of the wall. This cleaning process is not just surface treatment; it ensures the stability of the connections between the bricks. Finally, placing mortar is undertaken to ensure that the gaps between the bricks are filled adequately, ensuring the stability and durability of the structure.

To ensure effective classification of bricklaying actions, we maintain the consistency of scene elements by considering the completeness of each action within the time unit. During the analysis, we deconstruct each action’s characteristics and impacts using the entities, relationships, and attributes method. For example, in the action of “picking up a brick,” the relationship between the worker and the brick is “grasping,” and the brick’s attributes (such as weight and shape) determine the worker’s movements. When placing the brick, the relationship between the worker, the brick, and the wall is “placing,” and the attributes of the brick are still important, but they focus more on the angle, position, and fit with the wall. The processes of applying mortar, cleaning mortar, and placing mortar follow a similar logic, where the interactions between the worker, the tools, the mortar, and the wall determine the final outcome of the actions.

Through the in-depth deconstruction of these five actions, we can clarify the steps of each action, the tools, materials, and environmental elements involved, and use this information for efficient labor productivity evaluation. This detailed analysis of actions provides essential foundational data for the subsequent automation of construction and offers theoretical support for improving the workflow of bricklaying workers. In the age of smart buildings and the continuous development of automation technology, accurate action classification and recognition will play a key role in enhancing construction productivity, optimizing workplace safety, and improving construction quality [68].

3.3. Action Recognition of Time Series Data

Worker action recognition is essential for optimizing workflows on construction sites. We employ Long Short-Term Memory (LSTM) networks, a type of recurrent neural network effective for sequential data analysis. The LSTM model receives input in the form of a sequence of coordinates corresponding to the worker’s key points over time. These coordinates are pre-processed to normalize scale differences and align time steps for consistent input data. The model processes this sequence, capturing temporal dependencies and identifying patterns that correspond to different actions, such as lifting, placing, or adjusting materials. The output of the LSTM network provides a probability distribution over possible actions, allowing for accurate classification of worker movements.

By leveraging LSTM networks, we can make better use of the temporal features in videos. The processing result of the previous frame of video has an impact on the result of the next frame. Compared with other machine learning models, it can better utilize the video characteristics to improve the accuracy [29,51]. The overall workflow of multi-channel action sequence analysis, incorporating pose extraction via MediaPipe and sequential modeling with LSTM networks, is illustrated in Figure 4.

At each time step, features such as human pose keypoints are extracted from video frames using MediaPipe and input into the LSTM network. The LSTM then uses its internal memory (cell state) and several gates to decide what information to keep, update, or output. Through this sequence modeling, the network captures the temporal dynamics of worker movements. The final output label represents the recognized action class (e.g., “lifting,” “placing,” or “idle”) for each video segment. The overall workflow of action recognition using pose extraction and LSTM-based sequential analysis is illustrated in Figure 5. This process enables the system to automatically and robustly identify construction worker activities over time, based on sequential pose information.

3.4. Object Recognition in Scenes

In addition to recognizing worker actions, identifying and tracking construction tools is crucial for a comprehensive analysis of the construction site. We utilize the YOLO (You Only Look Once) algorithm, a state-of-the-art deep learning model for real-time object detection22. The YOLO model is trained on a curated dataset of construction tools, including common items like hammers, bricks, trowels, and power tools. The model outputs bounding boxes with confidence scores, indicating the presence and identity of tools in the scene. This data is then integrated with worker tracking information to provide a holistic view of the site, allowing us to correlate tool usage with specific tasks and worker activities.

By continuously monitoring and identifying tools, the system not only ensures efficient tool management but also contributes to safety management. For instance, the system can alert supervisors if unauthorized tools are in use or if tools are being used in unsafe manners.

3.5. Comprehensive Framework for Construction Productivity Evaluation

Productivity analysis is a key objective of our research. By integrating the extraction and tracking of worker key points, action recognition of time series data, and object recognition in scenes, we can comprehensively understand construction activities. This understanding enables us to analyze various factors that affect productivity, such as worker actions, tool usage, and environmental conditions. First, through tool identification data, we can determine different task scenarios. Later, perform trained action recognition on the construction trades in the scene. Finally, we can better reflect the labor productivity in the construction field by establishing the connection between the counting of the results of this trade and action recognition.

In our productivity evaluation framework, an action is counted as valid work only when both the LSTM-based action recognition module (using keypoints extracted by BlazePose) identifies a specific work action and the YOLO tool detection module simultaneously detects the corresponding construction tool. This “dual matching” mechanism is designed to reduce false positives by requiring evidence from both human pose and tool use for each labeled action.

However, if either the pose recognition or the tool detection module fails to identify the expected action or object, the system will miss (i.e., not count) that work action. At present, the approach does not include advanced error correction or data fusion strategies; the two modules operate independently, and no further recovery is attempted for missed detections. This limitation is acknowledged in the conclusion, and future work will focus on developing more robust multi-source information fusion and error handling mechanisms to further improve system reliability.

In conclusion, our research method provides a comprehensive framework for analyzing construction activities and evaluating productivity. By utilizing advanced technologies such as deep learning and computer vision, we aim to improve the level of labor productivity analysis for construction industry workers.

4. Result

4.1. Dataset Production and Preprocessing

As depicted in Figure 6, the research process comprises seven steps and can be divided into three parts. The first part is action deconstruction analysis. A standard is established through deconstructive analysis to define workers’ actions. In the construction industry, bricklayers are selected as the research object, and their work process is divided into five steps: applying mortar, picking up bricks, placing bricks, adjusting bricks, and cleaning the wall. Figure 6 shows the actual actions and the deconstruction of workers’ actions. For instance, picking up bricks (Figure 6a) involves picking up bricks from the ground or other places and handing them to the bricklayer. This action is characterized by placing the right arm (the arm holding the shovel) on the chest, with the body twisted backward and the left hand elongated. Applying mortar (Figure 6b) is the act of spreading cement mortar paste on one or both sides of the brick. In this case, the front is facing forward, and the left and right hands are not much different in length and are of medium length. Placing the brick (Figure 6c) is the action of placing the brick with cement paste on the position where the brick needs to be laid. The left arm (the arm without the shovel) is extended, and the right arm is retracted. Adjusting the brick (Figure 6d) is the action of firmly tapping bricks and mortar. The palms of both hands are placed in front of the chest, and the lengths of the left and right hands are not significantly different. Cleaning the wall (Figure 6e) is to use a scraper to smooth the excess cement paste. The left and right hands are held straight at the same time, and there will be a slight swing during the movement.

In the experiment, worker workmanship videos from YouTube and self-filmed ones are manually cut into five actions of workers and invalid actions, totaling six parts. Then, MediaPipe is used to read the short videos of each action. The coordinates of each key point are represented by a normalized image coordinate system. The three data are added and squared again to simplify the output.The dataset consists of 60 short video clips, each approximately two seconds in duration and containing about 20 frames. These clips correspond to 60 individual action instances used for training, resulting in a total of approximately 2400 frames. After obtaining the output of a specific frame, it is labeled and put into the LSTM model for training. The specific corresponding order between the label and the action is as follows: 0—applying mortar; 1—picking up bricks; 2—placing bricks; 3—adjusting bricks; 4—cleaning the wall; and 5—invalid action.

4.2. Action Recognition Model Training

The segmented 2 s pick-up brick action video is input into the action recognition model for further processing and evaluation. After the model processes the video, the results obtained are consistent with the manually labeled outcomes, indicating that the model has been successfully trained to a preliminary level. This validation confirms that the action recognition model is able to recognize and categorize actions effectively, providing initial evidence of its capability. The overall workflow of data preprocessing and pose alignment, including key frame extraction and landmark detection, is illustrated in Figure 7. The model, in particular, is based on the Long Short-Term Memory (LSTM) architecture, which excels in handling time-series data. This ability allows it to analyze sequential video frames, extract meaningful features, and ultimately perform accurate action classification. Each LSTM cell contains three essential components: the input gate, the forget gate, and the output gate, all of which work together to control the flow of information within the memory unit. The input gate determines which data should be fed into the memory unit, while the forget gate decides which data should be discarded. The output gate then regulates the output of the memory unit, ensuring that relevant information is retained for future prediction. The LSTM network is a supervised learning model, which means that it requires labeled data for training. As a result, videos need to be manually divided into smaller segments, typically between 1 and 2 s in length, and each segment must be labeled with one of six predefined action categories.

In this case, the video segments are first processed and converted into individual frames through the use of the MediaPipe framework, which efficiently handles keypoint extraction and video frame analysis. The frames are extracted at regular intervals—specifically every 0.25 s—to ensure that the data maintains high accuracy. This segmentation strategy also helps prevent chaotic data distribution, which could otherwise lead to noise and imprecision in the model’s predictions. With this approach, the LSTM model is provided with structured data, enabling it to learn the patterns and features of the actions within the video more effectively. By following this methodology, the system can be trained on time-sequenced video data, where each action is captured in discrete segments, allowing the LSTM model to learn how to classify complex actions accurately over time. The segmentation of video into small time units ensures that the model can identify and distinguish between various actions, such as “pick-up brick,” while accounting for the sequential nature of human motion. As the model progresses through the training process, it becomes increasingly capable of recognizing the nuanced dynamics of worker actions on a construction site, further enhancing its effectiveness for real-world applications.

The data format recognized by MediaPipe is as shown in the following Table 1.

The Long Short-Term Memory (LSTM) model can process time series data and extract features for action classification. Each LSTM unit contains three gates and a memory cell. The training of the LSTM model is based on supervised learning. Therefore, videos need to be manually segmented into 1–2 s clips and divided into six action types. The segmented videos input into the untrained LSTM model are first converted into video frames through MediaPipe. Output is set every 0.25 s to ensure accuracy and avoid chaotic data distribution.

The training process of the LSTM model is as follows: First, the previously prepared data set is divided into training set, validation set and test set in a ratio of 7:2:1. In the training stage, the video clips in the training set are input into the LSTM model in sequence. The LSTM model continuously adjusts its internal weight parameters to make the model output as close as possible to the real action label. The three gates (input gate, forget gate and output gate) work together to control the inflow and outflow of information and the state update of the memory cell. In each round of training, the loss function of the model is calculated, and the cross-entropy loss function is used to measure the difference between the model output and the real label. Continuously reduce the value of the loss function through stochastic gradient descent to optimize the parameters of the model. During the training process, the validation set is used to evaluate the performance of the model so as to adjust the hyperparameters in time and prevent overfitting. When the performance of the model on the validation set no longer improves, training can be stopped. Finally, the prepared test set is used to conduct the final performance evaluation on the trained model to determine the accuracy and generalization ability of the model in practical applications.The training loss curve of the LSTM model during supervised learning is shown in Figure 8.

During the training process, the continuous decline of the loss function curve is a key indicator for evaluating the model’s optimization progress. As training progresses, the loss function value should gradually approach zero, indicating that the difference between the model’s output and the true labels is becoming smaller. In the initial stages, the loss function curve is typically higher because the model’s weights are randomly initialized, leading to large prediction errors. However, as training continues, the model gradually adjusts its internal parameters to better fit the training data, and the loss function starts to steadily decrease.

In this training process, as the training progresses, the loss function curve gradually becomes smoother and tends toward a lower value, ultimately approaching zero. When the loss function curve gradually stabilizes at a lower value, it indicates that the difference between the model’s output and the true labels is very small, and the model’s predictive ability has significantly improved. The model has gradually learned the patterns and features within the data. At this stage, the loss curves for both the training set and the validation set should closely align and stabilize, indicating that the model is not only making accurate predictions on the training data but also demonstrating good generalization ability. This trend of gradual convergence also indicates the effectiveness and sufficiency of the training. By observing the loss function curve, we can determine when to stop training, avoid over-fitting caused by excessive training, and ensure the model’s effectiveness in practical applications.

4.3. Model Test Results

Currently, the accuracy rate of worker action recognition can reach 82.23%. When using the trained model for actual detection, the actions must be consistent with the collected dataset. This is because the MediaPipe coordinates change depending on the distance from the camera. To solve this problem, video data can be continuously collected to enrich the dataset and update the model. The model can count and visualize the working hours of workers and generate bar charts. Figure 8 shows the specific location of each action, the articulation between actions, and the duration of the action. The comparison between model prediction data and real data shows that under the premise of a total duration of 38 s, 31.25 s of data are consistent, so the accuracy rate is 82.23%.

After completing the training process, we evaluated the model’s action recognition performance using a confusion matrix. The confusion matrix is an intuitive tool that provides a comprehensive overview of the model’s classification results across all action categories. For multi-class tasks, each row of the confusion matrix represents the true class, while each column represents the predicted class. Each cell in the matrix indicates the number of samples of a given true class that were predicted as a particular class by the model.

In multi-class recognition tasks, the prediction results for each class can be divided into four categories:

True Positive (TP): The number of samples correctly identified as a given class by the model.

False Positive (FP): The number of samples from other classes that were incorrectly predicted as the given class.

False Negative (FN): The number of samples that actually belong to a given class but were predicted as another class.

True Negative (TN): The number of samples correctly identified as not belonging to the given class.

Based on the confusion matrix, we can calculate the accuracy.

Accuracy: The proportion of correctly predicted samples (TP + TN) to the total number of samples.

In this study, we calculated the accuracy for each action category, and the overall average accuracy was obtained by taking the mean of the individual accuracies across all categories. For example, if the model correctly predicts a certain class 8 times out of 10, the accuracy for that class would be 80%. According to the confusion matrix shown in Figure 9, the model achieved an average accuracy of 81.67%. The visualization of the confusion matrix allows us to intuitively observe which action categories the model performs well on (darker colors) and which categories are more prone to confusion or misclassification (lighter colors).The confusion matrix of the LSTM action recognition model is presented in Figure 9.

Through these methods, we can comprehensively and accurately evaluate the performance of the action recognition system, identify the model’s strengths and weaknesses, and provide a basis for further improvement.

4.4. Productivity Analysis

Table 2 shows the duration and percentage of the six actions (applying mortar, picking up bricks, placing bricks, adjusting bricks, cleaning the wall, and invalid action) in the entire video length. Counting the recognized actions and manually segmented action duration separately shows the accuracy of the model to a certain extent.

This performance evaluation reveals that the model has made significant progress in recognizing worker actions, but there are still areas for improvement. The accuracy of 81.67% suggests that while the model is able to classify most actions correctly, there are still instances where the model mistakenly classifies an action as another category, as shown by the false positives. The detailed time distribution of each action and the overall time proportion are illustrated in Figure 10 and Figure 11, respectively. This result is promising as it indicates that the model is well on its way to performing reliable action recognition in the context of construction worker movements. Further fine-tuning and optimization of the model will help reduce the occurrence of false positives and increase the overall accuracy.

5. Discussion

5.1. Contributions

This study makes several significant contributions to construction management and labor efficiency assessment. We have developed an automated efficiency assessment method based on computer vision, which addresses the limitations of traditional approaches, such as manual recording and work sampling, that are labor-intensive, time-consuming, and prone to human error. By integrating advanced image processing algorithms and machine learning models, our approach substantially reduces the dependence on manual monitoring and enables more objective and real-time performance evaluation. The system not only analyzes worker actions but also incorporates information about tools, equipment, and other environmental factors to provide a holistic understanding of construction activities, thereby facilitating the identification of inefficiencies and opportunities for process optimization.

Moreover, this research introduces a novel framework for action recognition and work time statistics that is capable of handling complex, multi-worker construction scenes, resulting in improved tracking and assessment accuracy compared to single-object approaches. The establishment and application of a comprehensive construction site dataset further enhance the model’s robustness and generalizability, providing a valuable resource for future research and training.

5.2. Practical Implications

This study not only introduces new methods and model frameworks theoretically but also demonstrates their substantial potential in enhancing construction efficiency, saving costs, and supporting decision-making through practical application. Future work could further extend these findings to a broader range of applications, advancing the field of construction management and efficiency assessment. The practical implications of this study are substantial. The proposed system enables the rapid identification of bottlenecks and non-productive delays in the construction workflow, providing managers with the data needed to optimize resource allocation and improve efficiency. Accurate tracking and comprehensive scene analysis support more informed management decisions and facilitate scientific project planning and adjustment. It is important to note, however, that anticipated broader benefits such as enhanced management optimization, cost reduction, and the potential for continuous, real-time site monitoring represent long-term application prospects, rather than outcomes already achieved, in the present study. These potential advantages remain to be validated through larger-scale and longer-term deployments in real construction environments.

This study also faces several limitations. The current system requires adequate lighting, appropriate camera placement, and is sensitive to occlusion—factors that are not always controllable in real-world sites. The experimental validation was conducted under controlled conditions, and the dataset remains small and focused on a single trade, limiting the generalizability of the findings. Future research will extend the evaluation to more diverse and complex environments and explore the integration of infrared cameras, depth sensors, and edge computing to improve robustness and practical applicability. We also plan to expand the dataset in both scale and diversity and conduct cross-trade, cross-site, and cross-subject validation to ensure the broader utility of our approach.

6. Conclusions

This research presents an innovative computer vision–based automated method for construction efficiency assessment, which is highly significant for the field of construction management. The approach integrates detailed analysis of worker actions, tool interactions, and environmental factors to reduce reliance on manual monitoring and achieve more objective, real-time performance evaluation. The novel model framework and the construction site dataset lay a solid foundation for improving assessment accuracy and system applicability.

Despite promising results, several limitations should be acknowledged. The system’s effectiveness depends on ideal site conditions, such as adequate lighting and camera placement, which may be difficult to maintain on active construction sites. Additionally, the study was based on a small-scale, short-duration dataset as a proof-of-concept; the findings may not directly generalize to all practical scenarios, and broader validation remains necessary.

Looking ahead, broader benefits such as improved management optimization, cost efficiency, and the feasibility of continuous, real-time site monitoring are considered long-term prospects that require further validation in future research. We plan to expand the dataset, incorporate multi-sensor and edge computing technologies, and conduct systematic comparisons with manual and sensor-based monitoring to further establish the feasibility and value of the proposed method. This research provides a foundation for future innovation in construction management and lays the groundwork for more intelligent and efficient industry practices.

Author Contributions

Conceptualization, C.Z., J.Z. and Y.L.; methodology, C.Z.; software, C.Z.; validation, C.Z.; formal analysis, C.Z.; investigation, C.Z.; resources, H.L.; data curation, C.Z.; writing—original draft, C.Z.; writing—review and editing, H.L.; visualization, C.Z.; supervision, C.M.; project administration, C.M.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available from the corresponding authors upon reasonable request.

Acknowledgments

The authors would like to thank Chongqing University for supporting this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Akhavian, R.; Behzadan, A.H. Productivity Analysis of Construction Worker Activities Using Smartphone Sensors. In Proceedings of the 16th International Conference on Computing in Civil and Building Engineering (ICCCBE), Osaka, Japan, 6–8 July 2016; pp. 1067–1074. [Google Scholar]
Alashhab, S.; Gallego, A.J.; Lozano, M.Á. Efficient Gesture Recognition for the Assistance of Visually Impaired People Using Multi-Head Neural Networks. Eng. Appl. Artif. Intell. 2022, 114, 105188. [Google Scholar] [CrossRef]
Arshad, S.; Akinade, O.; Bello, S.; Bilal, M. Computer Vision and IoT Research Landscape for Health and Safety Management on Construction Sites. J. Build. Eng. 2023, 76, 107049. [Google Scholar] [CrossRef]
Bassier, M.; Vergauwen, M. Unsupervised Reconstruction of Building Information Modeling Wall Objects from Point Cloud Data. Autom. Constr. 2020, 120, 103338. [Google Scholar] [CrossRef]
Chan, A.P.C.; Yi, W.; Wong, D.P.; Yam, M.C.H.; Chan, D.W.M. Determining an Optimal Recovery Time for Construction Rebar Workers after Working to Exhaustion in a Hot and Humid Environment. Build. Environ. 2012, 58, 163–171. [Google Scholar] [CrossRef]
Gong, J.; Caldas, C.H. Computer Vision-Based Video Interpretation Model for Automated Productivity Analysis of Construction Operations. J. Comput. Civ. Eng. 2010, 24, 252–263. [Google Scholar] [CrossRef]
Gouett, M.C.; Haas, C.T.; Goodrum, P.M.; Caldas, C.H. Activity Analysis for Direct-Work Rate Improvement in Construction. J. Constr. Eng. Manag. 2011, 137, 1117–1124. [Google Scholar] [CrossRef]
Zhang, H.; Yan, X.; Li, H. Ergonomic Posture Recognition Using 3D View-Invariant Features from Single Ordinary Camera. Autom. Constr. 2018, 94, 1–10. [Google Scholar] [CrossRef]
Zhao, J.; Zhu, N.; Lu, S. Productivity Model in Hot and Humid Environment Based on Heat Tolerance Time Analysis. Build. Environ. 2009, 44, 2202–2207. [Google Scholar] [CrossRef]
Tian, Y.; Chen, J.; Kim, J.I.; Kim, J. Lightweight Deep Learning Framework for Recognizing Construction Workers’ Activities Based on Simplified Node Combinations. Autom. Constr. 2024, 158, 105236. [Google Scholar] [CrossRef]
Sherafat, B.; Ahn, C.R.; Akhavian, R.; Behzadan, A.H.; Golparvar-Fard, M.; Kim, H.; Lee, Y.-C.; Rashidi, A.; Azar, E.R. Automated Methods for Activity Recognition of Construction Workers and Equipment: State-of-the-Art Review. J. Constr. Eng. Manag. 2020, 146, 03120002. [Google Scholar] [CrossRef]
Dixit, S.; Mandal, S.N.; Thanikal, J.V.; Saurabh, K. Evolution of Studies in Construction Productivity: A Systematic Literature Review (2006–2017). Ain Shams Eng. J. 2019, 10, 555–564. [Google Scholar] [CrossRef]
Ghodrati, N.; Yiu, T.W.; Wilkinson, S. Unintended Consequences of Management Strategies for Improving Labor Productivity in Construction Industry. J. Saf. Res. 2018, 67, 107–116. [Google Scholar] [CrossRef]
Alwasel, A.; Sabet, A.; Nahangi, M.; Haas, C.T.; Abdel-Rahman, E. Identifying Poses of Safe and Productive Masons Using Machine Learning. Autom. Constr. 2017, 84, 345–355. [Google Scholar] [CrossRef]
Luo, X.; Li, H.; Cao, D.; Yu, Y.; Yang, X.; Huang, T. Towards Efficient and Objective Work Sampling: Recognizing Workers’ Activities in Site Surveillance Videos with Two-Stream Convolutional Networks. Autom. Constr. 2018, 94, 360–370. [Google Scholar] [CrossRef]
Luo, X.; Li, H.; Cao, D.; Dai, F.; Seo, J.; Lee, S. Recognizing Diverse Construction Activities in Site Images via Relevance Networks of Construction-Related Objects Detected by Convolutional Neural Networks. J. Comput. Civ. Eng. 2018, 32, 04018006. [Google Scholar] [CrossRef]
Alaloul, W.S.; Alzubi, K.M.; Malkawi, A.B.; Al Salaheen, M.; Musarat, M.A. Productivity Monitoring in Building Construction Projects: A Systematic Review. Eng. Constr. Archit. Manag. 2022, 29, 2760–2785. [Google Scholar] [CrossRef]
Baek, J.; Kim, D.; Choi, B. Deep Learning-Based Automated Productivity Monitoring for on-Site Module Installation in off-Site Construction. Dev. Built Environ. 2024, 18, 100382. [Google Scholar] [CrossRef]
Chen, C.; Zhu, Z.; Hammad, A. Automated Excavators Activity Recognition and Productivity Analysis from Construction Site Surveillance Videos. Autom. Constr. 2020, 110, 103045. [Google Scholar] [CrossRef]
Chen, X.; Wang, Y.; Wang, J.; Bouferguene, A.; Al-Hussein, M. Vision-Based Real-Time Process Monitoring and Problem Feedback for Productivity-Oriented Analysis in off-Site Construction. Autom. Constr. 2024, 162, 105389. [Google Scholar] [CrossRef]
Jacobsen, E.L.; Teizer, J.; Wandahl, S. Work Estimation of Construction Workers for Productivity Monitoring Using Kinematic Data and Deep Learning. Autom. Constr. 2023, 152, 104932. [Google Scholar] [CrossRef]
Teizer, J.; Cheng, T.; Fang, Y. Location Tracking and Data Visualization Technology to Advance Construction Ironworkers’ Education and Training in Safety and Productivity. Autom. Constr. 2013, 35, 53–68. [Google Scholar] [CrossRef]
Small, E.P.; Baqer, M. Examination of Job-Site Layout Approaches and Their Impact on Construction Job-Site Productivity. Procedia Eng. 2016, 164, 383–388. [Google Scholar] [CrossRef]
Qi, K.; Owusu, E.K.; Francis Siu, M.-F.; Albert Chan, P.-C. A Systematic Review of Construction Labor Productivity Studies: Clustering and Analysis through Hierarchical Latent Dirichlet Allocation. Ain Shams Eng. J. 2024, 15, 102896. [Google Scholar] [CrossRef]
Oral, M.; Oral, E.L.; Aydın, A. Supervised vs. Unsupervised Learning for Construction Crew Productivity Prediction. Autom. Constr. 2012, 22, 271–276. [Google Scholar] [CrossRef]
Cheng, M.-Y.; Khitam, A.F.K.; Tanto, H.H. Construction Worker Productivity Evaluation Using Action Recognition for Foreign Labor Training and Education: A Case Study of Taiwan. Autom. Constr. 2023, 150, 104809. [Google Scholar] [CrossRef]
Luo, H.; Xiong, C.; Fang, W.; Love, P.E.D.; Zhang, B.; Ouyang, X. Convolutional Neural Networks: Computer Vision-Based Workforce Activity Assessment in Construction. Autom. Constr. 2018, 94, 282–289. [Google Scholar] [CrossRef]
Xing, X.; Zhong, B.; Luo, H.; Rose, T.; Li, J.; Antwi-Afari, M.F. Effects of Physical Fatigue on the Induction of Mental Fatigue of Construction Workers: A Pilot Study Based on a Neurophysiological Approach. Autom. Constr. 2020, 120, 103381. [Google Scholar] [CrossRef]
Bassino-Riglos, F.; Mosqueira-Chacon, C.; Ugarte, W. AutoPose: Pose Estimation for Prevention of Musculoskeletal Disorders Using LSTM. In Innovative Intelligent Industrial Production and Logistics; Terzi, S., Madani, K., Gusikhin, O., Panetto, H., Eds.; Communications in Computer and Information Science; Springer Nature: Cham, Switzerland, 2023; Volume 1886, pp. 223–238. ISBN 978-3-031-49338-6. [Google Scholar]
Han, S.; Lee, S. A Vision-Based Motion Capture and Recognition Framework for Behavior-Based Safety Management. Autom. Constr. 2013, 35, 131–141. [Google Scholar] [CrossRef]
Jarkas, A.M.; Bitar, C.G. Factors Affecting Construction Labor Productivity in Kuwait. J. Constr. Eng. Manag. 2012, 138, 811–820. [Google Scholar] [CrossRef]
Akhavian, R.; Behzadan, A.H. Smartphone-Based Construction Workers’ Activity Recognition and Classification. Autom. Constr. 2016, 71, 198–209. [Google Scholar] [CrossRef]
Nath, N.D.; Akhavian, R.; Behzadan, A.H. Ergonomic Analysis of Construction Worker’s Body Postures Using Wearable Mobile Sensors. Appl. Ergon. 2017, 62, 107–117. [Google Scholar] [CrossRef] [PubMed]
Baduge, S.K.; Thilakarathna, S.; Perera, J.S.; Arashpour, M.; Sharafi, P.; Teodosio, B.; Shringi, A.; Mendis, P. Artificial Intelligence and Smart Vision for Building and Construction 4.0: Machine and Deep Learning Methods and Applications. Autom. Constr. 2022, 141, 104440. [Google Scholar] [CrossRef]
Escorcia, V.; Dávila, M.A.; Golparvar-Fard, M.; Niebles, J.C. Automated Vision-Based Recognition of Construction Worker Actions for Building Interior Construction Operations Using RGBD Cameras. In Proceedings of the Construction Research Congress 2012, West Lafayette, IN, USA, 21–23 May 2012; American Society of Civil Engineers: Reston, VA, USA, 2012; pp. 879–888. [Google Scholar]
Fang, W.; Ding, L.; Love, P.E.D.; Luo, H.; Li, H.; Peña-Mora, F.; Zhong, B.; Zhou, C. Computer Vision Applications in Construction Safety Assurance. Autom. Constr. 2020, 110, 103013. [Google Scholar] [CrossRef]
Kim, J.; Chi, S. A Few-Shot Learning Approach for Database-Free Vision-Based Monitoring on Construction Sites. Autom. Constr. 2021, 124, 103566. [Google Scholar] [CrossRef]
Luo, H.; Wang, M.; Wong, P.K.-Y.; Cheng, J.C.P. Full Body Pose Estimation of Construction Equipment Using Computer Vision and Deep Learning Techniques. Autom. Constr. 2020, 110, 103016. [Google Scholar] [CrossRef]
Mansouri, S.; Castronovo, F.; Akhavian, R. Analysis of the Synergistic Effect of Data Analytics and Technology Trends in the AEC/FM Industry. J. Constr. Eng. Manag. 2020, 146, 04019113. [Google Scholar] [CrossRef]
Balci, R.; Aghazadeh, F. The Effect of Work-Rest Schedules and Type of Task on the Discomfort and Performance of VDT Users. Ergonomics 2003, 46, 455–465. [Google Scholar] [CrossRef]
Kim, H.; Ham, Y.; Kim, W.; Park, S.; Kim, H. Vision-Based Nonintrusive Context Documentation for Earthmoving Productivity Simulation. Autom. Constr. 2019, 102, 135–147. [Google Scholar] [CrossRef]
Kim, J.; Golabchi, A.; Han, S.; Lee, D.-E. Manual Operation Simulation Using Motion-Time Analysis toward Labor Productivity Estimation: A Case Study of Concrete Pouring Operations. Autom. Constr. 2021, 126, 103669. [Google Scholar] [CrossRef]
Mirahadi, F.; Zayed, T. Simulation-Based Construction Productivity Forecast Using Neural-Network-Driven Fuzzy Reasoning. Autom. Constr. 2016, 65, 102–115. [Google Scholar] [CrossRef]
Rao, A.S.; Radanovic, M.; Liu, Y.; Hu, S.; Fang, Y.; Khoshelham, K.; Palaniswami, M.; Ngo, T. Real-Time Monitoring of Construction Sites: Sensors, Methods, and Applications. Autom. Constr. 2022, 136, 104099. [Google Scholar] [CrossRef]
Joshua, D.; Varghese, K. Accelerometer-Based Activity Recognition in Construction. Autom. Constr. 2010, 19, 837–848. [Google Scholar] [CrossRef]
Ryu, J.; McFarland, T.; Banting, B.; Haas, C.T.; Abdel-Rahman, E. Health and Productivity Impact of Semi-Automated Work Systems in Construction. Autom. Constr. 2020, 120, 103396. [Google Scholar] [CrossRef]
Xiao, B.; Yin, X.; Kang, S.-C. Vision-Based Method of Automatically Detecting Construction Video Highlights by Integrating Machine Tracking and CNN Feature Extraction. Autom. Constr. 2021, 129, 103817, Erratum in Autom. Constr. 2021, 132, 103924. [Google Scholar] [CrossRef]
Sheng, D.; Ding, L.; Zhong, B.; Love, P.E.D.; Luo, H.; Chen, J. Construction Quality Information Management with Blockchains. Autom. Constr. 2020, 120, 103373. [Google Scholar] [CrossRef]
Zhang, M.; Zhou, Y.; Xu, X.; Ren, Z.; Zhang, Y.; Liu, S.; Luo, W. Multi-View Emotional Expressions Dataset Using 2D Pose Estimation. Sci. Data 2023, 10, 649. [Google Scholar] [CrossRef]
Bora, J.; Dehingia, S.; Boruah, A.; Chetia, A.A.; Gogoi, D. Real-Time Assamese Sign Language Recognition Using MediaPipe and Deep Learning. Procedia Comput. Sci. 2023, 218, 1384–1393. [Google Scholar] [CrossRef]
Sundar, B.; Bagyammal, T. American Sign Language Recognition for Alphabets Using MediaPipe and LSTM. Procedia Comput. Sci. 2022, 215, 642–651. [Google Scholar] [CrossRef]
Bazarevsky, V.; Grishchenko, I.; Raveendran, K.; Zhu, T.; Zhang, F.; Grundmann, M. BlazePose: On-Device Real-Time Body Pose Tracking. arXiv 2020, arXiv:2006.10204. [Google Scholar]
Konstantinou, E. Vision-Based Construction Worker Task Productivity Monitoring. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 2018. [Google Scholar]
Teizer, J. Status Quo and Open Challenges in Vision-Based Sensing and Tracking of Temporary Resources on Infrastructure Construction Sites. Adv. Eng. Inform. 2015, 29, 225–238. [Google Scholar] [CrossRef]
Xiao, B.; Xiao, H.; Wang, J.; Chen, Y. Vision-Based Method for Tracking Workers by Integrating Deep Learning Instance Segmentation in off-Site Construction. Autom. Constr. 2022, 136, 104148. [Google Scholar] [CrossRef]
Xiao, B.; Zhang, Y.; Chen, Y.; Yin, X. A Semi-Supervised Learning Detection Method for Vision-Based Monitoring of Construction Sites by Integrating Teacher-Student Networks and Data Augmentation. Adv. Eng. Inform. 2021, 50, 101372. [Google Scholar] [CrossRef]
Gong, J.; Caldas, C.H. An Intelligent Video Computing Method for Automated Productivity Analysis of Cyclic Construction Operations. In Proceedings of the Computing in Civil Engineering (2009), Austin, TX, USA, 24–27 June 2009; American Society of Civil Engineers: Washington, DC, USA, 2009; pp. 64–73. [Google Scholar]
Ishioka, H.; Weng, X.; Man, Y.; Kitani, K. Single Camera Worker Detection, Tracking and Action Recognition in Construction Site. In Proceedings of the 37th International Symposium on Automation and Robotics in Construction (ISARC), Kitakyushu, Japan, 14–18 June 2020; IAARC Publications: Oulu, Finland, 2020; Volume 37, pp. 653–660. [Google Scholar]
Kikuta, T.; Chun, P. Development of an Action Classification Method for Construction Sites Combining Pose Assessment and Object Proximity Evaluation. J. Ambient Intell. Humaniz. Comput. 2024, 15, 2255–2267. [Google Scholar] [CrossRef]
Li, C.; Lee, S. Computer Vision Techniques for Worker Motion Analysis to Reduce Musculoskeletal Disorders in Construction. In Proceedings of the Computing in Civil Engineering (2011), Miami, FL, USA, 19–22 June 2011; American Society of Civil Engineers: Washington, DC, USA, 2011; pp. 380–387. [Google Scholar]
Panahi, R.; Louis, J.; Aziere, N.; Podder, A.; Swanson, C. Identifying Modular Construction Worker Tasks Using Computer Vision. In Proceedings of the Computing in Civil Engineering 2021, Orlando, FL, USA, 12–14 September 2022; American Society of Civil Engineers: Washington, DC, USA, 2022; pp. 959–966. [Google Scholar]
Jang, Y.; Jeong, I.; Younesi Heravi, M.; Sarkar, S.; Shin, H.; Ahn, Y. Multi-Camera-Based Human Activity Recognition for Human–Robot Collaboration in Construction. Sensors 2023, 23, 6997. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. arXiv 2014, arXiv:1406.2199. [Google Scholar]
Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional Two-Stream Network Fusion for Video Action Recognition. arXiv 2016, arXiv:1604.06573v2. [Google Scholar]
Donahue, J.; Hendricks, L.A.; Rohrbach, M.; Venugopalan, S.; Guadarrama, S.; Saenko, K.; Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. arXiv 2014, arXiv:1411.4389v4. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar]
Liu, H.; Wang, G.; Huang, T.; He, P.; Skitmore, M.; Luo, X. Manifesting Construction Activity Scenes via Image Captioning. Autom. Constr. 2020, 119, 103334. [Google Scholar] [CrossRef]
Mao, C.; Zhang, C.; Liao, Y.; Zhou, J.; Liu, H. An Automated Vision-Based Construction Efficiency Evaluation System with Recognizing Worker’s Activities and Counting Effective Working Hours. In Advances in Information Technology in Civil and Building Engineering, Proceedings of the ICCCBE 2024, Montreal, Canada, 25–28 August 2024; Francis, A., Miresco, E., Melhado, S., Eds.; Lecture Notes in Civil Engineering; Springer: Cham, Switzerland, 2025. [Google Scholar]

Figure 1. Framework for automatic recognition of labor force.

Figure 2. Distribution of human body key points.

Figure 3. Illustration of bricklaying action segmentation and scene element relationships.

Figure 4. Multi-frame action recognition pipeline using MediaPipe and LSTM networks.

Figure 5. LSTM-Based worker action recognition framework for sequential pose analysis.

Figure 6. Fundamental actions in bricklaying.

Figure 7. The data recognition process of MediaPipe.

Figure 8. LSTM model training loss curve.

Figure 9. Confusion matrix of the LSTM action recognition model.

Figure 10. Comparison of time segmentation in bricklaying tasks as determined by automated action recognition and work sampling.

Figure 11. Proportion of time spent on different bricklaying activities as recognized by the automated productivity analysis model.

Table 1. The sum of the three coordinates obtained from each key point of MediaPipe.

fps	X₀	X₁	X₂	X₃	…	X₃₂
1	0.618238	0.232524	0.555385	0.78614		0.760563
2	0.651198	1.01199	0.407865	0.57514		0.499335
3	0.760563	0	1.32282	0.407865		0.760563
4	0.534423	0.555385	0.73582	0.794876		0.925179
5	0.509691	0.679963	1.01199	0.464099		0.543936
…	…	…	…	…	…	…
40	0.525569	0.599635	0.564189	0		0.760563

Table 2. Comparison and analysis of manual segmentation and model recognition results.

Name	The Total Time of Manual Segmentation (s)	Proportion of Total Time Spent on Manual Segmentation (%)	The Total Time of Model Recognition (s)	Proportion of Total Model Recognition Time (%)
Applying mortar	6	15.78	5.75	15.13
Picking up the bricks	7	18.42	7.25	19.07
Placing the bricks	4	10.52	3.5	9.21
Adjusting the bricks	4	10.52	4.75	12.5
Cleaning the wall	7	18.42	7	18.42
Other work	10	26.31	9.75	25.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, C.; Mao, C.; Liu, H.; Liao, Y.; Zhou, J. Moving Toward Automated Construction Management: An Automated Construction Worker Efficiency Evaluation System. Buildings 2025, 15, 2479. https://doi.org/10.3390/buildings15142479

AMA Style

Zhang C, Mao C, Liu H, Liao Y, Zhou J. Moving Toward Automated Construction Management: An Automated Construction Worker Efficiency Evaluation System. Buildings. 2025; 15(14):2479. https://doi.org/10.3390/buildings15142479

Chicago/Turabian Style

Zhang, Chaojun, Chao Mao, Huan Liu, Yunlong Liao, and Jiayi Zhou. 2025. "Moving Toward Automated Construction Management: An Automated Construction Worker Efficiency Evaluation System" Buildings 15, no. 14: 2479. https://doi.org/10.3390/buildings15142479

APA Style

Zhang, C., Mao, C., Liu, H., Liao, Y., & Zhou, J. (2025). Moving Toward Automated Construction Management: An Automated Construction Worker Efficiency Evaluation System. Buildings, 15(14), 2479. https://doi.org/10.3390/buildings15142479

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Moving Toward Automated Construction Management: An Automated Construction Worker Efficiency Evaluation System^†

Abstract

1. Introduction

2. Related Work

2.1. Labor Efficiency Evaluation in Construction