Multi-Modal Excavator Activity Recognition Using Two-Stream CNN-LSTM with RGB and Point Cloud Inputs

Cho, Hyuk Soo; Latif, Kamran; Sharafat, Abubakar; Seo, Jongwon

doi:10.3390/app15158505

Open AccessArticle

Multi-Modal Excavator Activity Recognition Using Two-Stream CNN-LSTM with RGB and Point Cloud Inputs

Department of Civil and Environmental Engineering, Hanyang University, Seoul 04763, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8505; https://doi.org/10.3390/app15158505

Submission received: 8 July 2025 / Revised: 26 July 2025 / Accepted: 28 July 2025 / Published: 31 July 2025

(This article belongs to the Special Issue AI-Based Machinery Health Monitoring)

Download

Browse Figures

Versions Notes

Abstract

Recently, deep learning algorithms have been increasingly applied in construction for activity recognition, particularly for excavators, to automate processes and enhance safety and productivity through continuous monitoring of earthmoving activities. These deep learning algorithms analyze construction videos to classify excavator activities for earthmoving purposes. However, previous studies have solely focused on single-source external videos, which limits the activity recognition capabilities of the deep learning algorithm. This paper introduces a novel multi-modal deep learning-based methodology for recognizing excavator activities, utilizing multi-stream input data. It processes point clouds and RGB images using the two-stream long short-term memory convolutional neural network (CNN-LSTM) method to extract spatiotemporal features, enabling the recognition of excavator activities. A comprehensive dataset comprising 495,000 video frames of synchronized RGB and point cloud data was collected across multiple construction sites under varying conditions. The dataset encompasses five key excavator activities: Approach, Digging, Dumping, Idle, and Leveling. To assess the effectiveness of the proposed method, the performance of the two-stream CNN-LSTM architecture is compared with that of single-stream CNN-LSTM models on the same RGB and point cloud datasets, separately. The results demonstrate that the proposed multi-stream approach achieved an accuracy of 94.67%, outperforming existing state-of-the-art single-stream models, which achieved 90.67% accuracy for the RGB-based model and 92.00% for the point cloud-based model. These findings underscore the potential of the proposed activity recognition method, making it highly effective for automatic real-time monitoring of excavator activities, thereby laying the groundwork for future integration into digital twin systems for proactive maintenance and intelligent equipment management.

Keywords:

excavator; activity recognition; deep learning; multi-modal; two-stream CNN-LSTM; point cloud; data fusion

1. Introduction

The construction industry is one of the most significant contributors to the development and economic growth of a country [1]. However, the low utilization of technology and its labor-intensive nature are ongoing concerns for project managers and stakeholders [2]. With the increasing demand for proactive maintenance and operational efficiency in earthwork machinery operations, especially in the context of Industry 4.0 and 5.0, artificial intelligence (AI) has emerged as a transformative solution [3,4,5]. Integrating AI-based techniques such as deep learning and sensor fusion enables not just activity recognition but also intelligent fault detection and anomaly tracking [6]. A reliable activity recognition system serves as a foundational component of future AI-based health monitoring and digital twin frameworks in construction equipment operations [6,7]. Manual observation of the construction site remains the primary source for productivity estimation and safety monitoring of the crew and construction equipment [8]. Hydraulic excavators perform various operations at earthwork construction sites due to their maneuverability and robustness. Additionally, the risks associated with the harsh and hazardous working environments of construction sites encourage the automation of excavator operations [9]. Automating the detection of excavator activities during earthmoving operations enhances the safety, monitoring, and allocation of construction resources [10]. Internet of Things (IoT) and computer vision techniques are being explored for the automatic detection, action recognition, and productivity estimation during construction. The IoT-based approach utilizes global positioning systems (GPS), radio frequency identification (RFID), and inertial measurement units (IMUs), including accelerometers and gyroscopes, to detect, track, and monitor construction productivity [11]. IoT is a widely used technology that involves electric sensors attached to construction equipment and its components to understand construction activity and estimate productivity by obtaining location coordinates and interpreting inertial values of physical motion [12,13,14]. However, the IMU-based single-sensor solutions are sensitive to inherent sensor constraints, highly affected in harsh environments, and limit the information acquisition capabilities of construction equipment.

Technological development supported by computer vision (CV) has been known as a robust approach to improving the level of automation in the construction industry [15]. Computer vision techniques have been widely used for activity recognition and productivity estimation of construction equipment [16,17,18]. Several algorithms have been used to perform the activity recognition of construction equipment [19,20,21,22]. A CNN network exploits spatial correlation in visual data to assist LSTM in processing data sequences and predicting excavator activities [23,24,25]. Additionally, the optical flow of construction machines and their components is evaluated and used during the training of various algorithms for activity recognition of construction machines [26]. Multiple algorithms have also used optical streams for the activity recognition of construction equipment [17,23]. However, previous studies have solely focused on single-source external videos, which limits the activity recognition capabilities of the deep learning algorithm. This reliance on single-source data often leads to issues such as occlusions, limited viewpoints, and sensitivity to varying environmental conditions, thereby hindering continuous and accurate monitoring of excavator operations in complex construction environments. This constraint limits the model’s ability and practical adoption in real-world construction scenarios. Therefore, an additional visual data source is necessary for continuous streaming of construction machine progress monitoring, improving the accuracy of the model training process for activity recognition. To address these limitations and enable robust activity recognition, there is a crucial need for methodologies that combine multi-stream data inputs from different sources, providing a more comprehensive understanding of the excavator’s operational context. Several other fields have implemented multi-stream input data for diagnosing health conditions [27,28,29]. A two-stream architecture incorporates spatial and temporal information using RGB and optical flow data [30].

However, the construction industry has not yet explored the use of multi-stream video data input, which combines external camera images and point cloud images, for activity recognition. Furthermore, point clouds are commonly used in the construction industry for tasks such as 3D site reconstruction, quality assessment, and progress tracking. Despite their widespread use, they have not been fully exploited for activity recognition. Due to their spatial richness, point clouds offer unique advantages, such as depth perception and robustness to environmental lighting conditions. Therefore, they provide a valuable input stream that complements traditional RGB video data, especially for capturing excavator geometry and movement in three-dimensional space. As a result, the potential of a two-stream deep learning architecture that simultaneously utilizes RGB frames and point cloud inputs needs to be explored, as it has the potential to improve the robustness and generalizability of activity recognition models significantly.

This paper presents a novel methodology for activity recognition of excavators through an integrated input stream of external camera images and point cloud images, utilizing a two-stream CNN-LSTM deep learning pipeline. This approach represents a significant advancement over single-source methods. This study simultaneously collects data from the external camera and lidar scanner to train a two-stream CNN-LSTM. The Farneback method analyzes visual information and estimates dense optical flows based on temporal sequences of images. A two-stream CNN pipeline processes an integrated input stream, including camera images, point cloud images, and motion optical flow. This architecture evaluates the sequential pattern of video frames and processes it using the LSTM architecture to enable the recognition of various excavator activities. The excavator activities, including Approach, Digging, Dumping, Idle, and Leveling, were chosen for activity recognition. The proposed deep learning model accurately categorizes excavator activities with a 94.67% accuracy rate. When compared to the single stream on the same dataset for each input separately, the RGB-based model achieves an accuracy of 90.67%, while the point cloud-based model achieves an accuracy of 92.00%. This method outperforms existing algorithms for recognizing excavator activities due to its high accuracy. Additionally, a standard data labeling format is being proposed for labeling the data gathered based on excavator activities at a specific timestamp. Moreover, a database containing 495,000 video frames of external camera images and point cloud images from various construction sites with different conditions has been developed. The main academic contributions of this paper are as follows:

A novel multi-stream input data-based methodology for excavators’ activity recognition using a two-stream CNN-LSTM DL algorithm.
Improvement of activity recognition through a two-stream CNN-LSTM algorithm by effectively combining spatial (RGB frames) and temporal (optical flow) data from RGB camera and Lidar streams under occlusion.
A distinct dataset has been developed with two different input sources for excavator activities during the earthmoving process at various site conditions. A database containing 495,000 video frames of RGB external camera images and point cloud images from various construction sites with different conditions has been developed.
A standard data labelling format is proposed for labelling the data gathered based on the excavator activities at a specific timestamp.
Comprehensive performance evaluation of CNN-LSTM for the single-stream and two-stream CNN-LSTM for multi-stream activity recognition of the excavator.

2. Literature Review

Construction machines are essential components of earthwork construction sites, performing various operations [31,32]. It is necessary to detect and track the earthmoving activities of these machines to facilitate cycle time estimation, productivity assessment, and resource allocation [10,33]. Traditional observational approaches to construction sites are time-consuming, costly, error-prone, and sensitive to the complexity of construction sites and the availability of manpower for monitoring [34]. Recently, extensive research has been conducted on the continuous monitoring of construction machines using various information technologies and artificial intelligence (AI) techniques to enhance construction efficiency, mitigate associated risks, and accelerate task completion at construction sites [35,36]. Various data types, including surveillance camera footage, images, and Internet of Things (IoT) sensor data, are processed through these techniques to facilitate construction tasks and safety monitoring [37,38,39]. Considering the input data type, these techniques can be categorized into three primary methods: audio-based, motion-based sensors, and vision-based. The audio-based methods usually follow four steps: gathering the audio data, filtering and segmenting, selecting and extracting features, and modeling training for activity classification through SVM, Short-Time Fourier Transform (STFT), continuous wavelet transforms (CWT), and Bayesian statistical models [40,41,42]. However, audio-based methods can be adversely affected by crowded and noisy conditions, significantly reducing the accuracy of activity detection at construction sites [40]. In sensor-based methods, various motion sensors (accelerometers and gyroscope) have been used to monitor construction equipment’s earthmoving activities [43]. Motion sensors often lack data acquisition capabilities for construction equipment, particularly upon reaching a certain level of maturity, and they encourage the use of additional sensors for mechanical parameters, such as pressure sensors and fuel consumption meters [44]. Bluetooth, ultrawideband, and pump pressure sensors have been utilized for working stage identification of excavators, while IMU sensors have been employed for automated cycle time estimation of excavators. Additionally, IMU sensors with GPS have been used to detect earthmoving activities of construction equipment [40,45,46,47,48,49]. However, single-sensor solutions are sensitive to inherent sensor constraints, which limit their information acquisition capabilities [10]. Additionally, sensor-based approaches require a complex installation and setup, focusing solely on internal inertial or mechanical parameters without providing any external information regarding the terrain or earthen material.

2.1. Vision-Based Methods for Construction Equipment Monitoring

Traditional observational approaches to construction sites are time-consuming, costly, error-prone, and sensitive to the complexity of construction sites and available manpower for monitoring [34,50]. Numerous vision-based methods for the detection, tracking, and activity recognition of objects, humans, or construction equipment have been reported [51,52]. Histograms of oriented gradients (HOG) classifiers and CNN-based faster R-CNN, R-CNN-based IFaster R-CNN, and Region Proposal Network (RPN) techniques have been reported for the detection of objects, workers, non-hardhat-use of the construction workers, and excavators to maintain a safe environment at construction sites [22,53,54]. The potential of computer vision techniques for the pose estimation of the construction equipment has also been explored using Cascaded Pyramid Network and Stacked Hourglass Network [55,56,57]. A CNN-based algorithm has been proposed to enhance the accuracy of the 3D pose estimation of the excavator by using synthetic images during the training process and by utilizing hybrid datasets that integrate simulation and laboratory data under diverse scenarios and camera parameters [58]. Faster R-CNN, Deep Simple Online and Realtime Tracking (SORT), and 3D ResNet classifier have been used to detect, track, recognize activities, and perform productivity analysis using surveillance videos of the construction site [19].

2.2. Vision-Based Earthmoving Activity Detection

Recently, deep learning techniques have been explored for detecting and continuously monitoring earthmoving activities of construction equipment, as well as assessing posture, cycle time, and estimating productivity of earthen material at construction sites. Various computer vision methods detect and recognize the activities of construction equipment by extracting features from consecutive frames of the visual stream from the cameras [24]. Various deep-learning algorithms have recently applied optical flow to track the motion of feature vectors and estimate the earthmoving activity of construction equipment. K-nearest neighbours (KNN), decision tree, and SVM have been utilized for human activity recognition using the Weizmann and UCF101 databases through an optical flow descriptor [26]. CNN-based backbone networks are fundamental in computer vision for detecting, tracking, and activity recognition of construction equipment [59]. A CNN architecture has been developed to automatically extract temporal gradient information before the activity decision based on the extracted features of raw RGB frames [60]. CNN methods use convolutional kernels to capture local data features and extract features from image and time series data by leveraging local connectivity, weight sharing, and pooling layers [10]. Using the hidden Markov model, a CNN network has been employed to detect and recognize sequential patterns in construction equipment activity [61]. The Multistage Temporal Convolutional Network (MS-TCN) has been utilized to enhance the performance of excavator activity recognition, synthesizing and analyzing 3D data [62]. RGB and optical flow data have been incorporated into a two-stream architecture for capturing spatial and temporal information in video action recognition [30]. An extended temporal gradient stream has been introduced to replace the optical flow stream, enhancing computing efficiency [63]. RGB, optical flow, and gray stream have been integrated for activity recognition and monitoring of construction workers using a deep stream CNN [17]. You Only Look Once (YOLO) has been used to identify idling reasons for excavators and dump trucks during the excavation operation [64]. The extended form YOLOv3 has been trained to recognize different activities, including Excavation, Concrete Screeding, Bricklaying, and Carpentry, for intelligent construction monitoring [65]. You Only Watch Once (YOWO) is another computer vision-based deep learning algorithm used for activity recognition and productivity monitoring of construction equipment by estimating the cycle time of the earthwork operation [20]. Recently, you only watch once (YOWO)-based modified Distillation of temporal Gradient data for construction Entity activity Recognition (DIGER) algorithm was presented to classify three activities of the excavator: digging, swinging, and loading the trucks [66]. However, this study utilized single-source external video data and considered only three activities.

The sequential pattern in the spatiotemporal features of earthmoving activities of excavators has been exploited by various combinations of the CNN and LSTM models for excavator detection, tracking, and activity recognition through single-source external cameras [23]. Considering the computing capabilities of the model, the CNN-integrated Bidirectional LSTM (CNN-BiLSTM) algorithm has focused on directly classifying the single-action activity videos of the excavator (dumping, excavation, hauling, and swinging) [24]. Previous research has shown that CNN-LSTM effectively integrates CNNs for spatial feature extraction and LSTM networks for capturing temporal dependencies, making it highly suitable for analyzing sequential data in video-based activity recognition [67,68]. CNN-LSTM remains computationally efficient and requires fewer resources, which makes it practical for real-time applications on edge devices and mobile platforms, particularly in construction sites [69]. An upgraded Faster R-CNN-LSTM framework improved the mean Average Precision (mAP) by 15% and achieved a detection accuracy of 99.99%, effectively identifying unsafe behaviors and clarifying ambiguous actions among construction workers [70]. CNN-LSTM models effectively handle occlusion by learning long-term dependencies and motion dynamics, which improves recognition accuracy in construction environments that often encounter visual obstructions [71]. The single-source external video data collected through surveillance/RGB external cameras limits continuous monitoring due to obstructions in the camera’s line of sight or adverse visual conditions. The performance of the excavator’s 3D pose model has been evaluated by collecting data from 12 equal regions of 360 degrees, while the excavator constantly rotated its cabin to assess the effect of visual obstructions in the case of a single RGB camera source [58]. In case of single-source RGB data input, visual obstructions can significantly disrupt visual tracking, making it challenging to monitor a target and often leading to tracking failures [72,73]. Most of these studies rely on a single-source input data stream collected through surveillance or external cameras. However, accurate monitoring and recognition of construction equipment activities require additional data sources. A multi-stream approach has classified the behavior of bus drivers by processing the spatiotemporal features of multiple cameras attached to the bus [74]. Recently, the medical field has implemented multi-stream input data for electrocardiogram (ECG) signal classification, blood glucose estimation, and spatiotemporal-based behavioral multiparametric pain assessment system [27,28,29]. However, the multi-stream input data for activity recognition in construction equipment has yet to be explored in the earthwork construction industry. An overview of the recent literature is listed in Table 1, which includes different approaches used for object detection and activity recognition at construction sites.

2.3. Problem Statement and Objective

Despite recent advances in vision-based deep learning for excavator activity recognition, existing studies focus on single-source RGB video data captured by external cameras. These approaches are significantly hindered by obstructions, limited viewpoints, and sensitivity to lighting conditions, leading to reduced reliability in dynamic construction environments. While RGB video streams have been extensively used for excavator activity recognition, they inherently lack depth perception and spatial context, which are critical for accurately distinguishing between visually similar activities. A secondary source of visual data is required to enable uninterrupted streaming of excavators while working at the construction site, thereby enhancing the visual capabilities and improving the training process of the deep learning-based activity recognition model. The point cloud enables better depth perception and spatial context in three-dimensional space. However, point cloud data have not yet been fully leveraged for activity recognition tasks in earthmoving operations. Furthermore, no existing studies have effectively combined RGB and point cloud data using a unified deep learning framework to enhance the accuracy and robustness of excavator activity recognition.

This research addresses these limitations by proposing a two-stream CNN-LSTM architecture that fuses synchronized RGB and point cloud data to extract complementary spatial and temporal features. The objectives of the proposed study are: (i) utilize point cloud as one of the input data streams to enhance the spatiotemporal features; (ii) develop a two-stream deep learning architecture (CNN-LSTM) that integrates RGB images and 3D point cloud data for robust recognition of excavator activities in dynamic construction environments; (iii) improve the accuracy of the activity classification of excavators using the two-stream CNN-LSTM deep learning model; (iv) evaluate the performance of the proposed two-stream CNN-LSTM model against single-stream models using RGB or point cloud data alone, in terms of accuracy, precision, recall, and F1-score. The proposed approach addresses the limitations of single-modality inputs and lays the groundwork for real-time monitoring, predictive analytics, and digital twin integration in earthmoving operations.

3. Methodology

3.1. Overview

Hydraulic excavators are one of the most essential equipment in the construction industry for performing various tasks at the construction site. Different methods have been developed and introduced to classify the types of work and activities performed by excavators. Recently, artificial intelligence and deep learning algorithms have been utilized to recognize the various tasks of excavators for the autonomous control and monitoring of construction site productivity. The use of multi-stream data input for activity recognition through deep learning is expected to enhance the accuracy of productivity estimation for construction machines, as mentioned in Section 2. Therefore, a deep learning-based activity recognition methodology utilizing multi-stream input data has been developed and graphically represented in Figure 1. The proposed method consists of four primary steps: (i) collecting, cleaning, and labeling input data obtained from camera and LiDAR sensors; (ii) developing datasets for both input sources for five excavator activities; (iii) creating a two-stream CNN-LSTM model for multi-stream activity recognition of the excavator and a CNN-LSTM model for single-stream input for comparison; and (iv) recognizing excavator activities using both single-stream and multi-stream input data, followed by a comparison of their performance. The proposed methodology developed an extensive database system comprising five excavator activities for activity recognition and provided footprints for further activity recognition models of different construction equipment.

3.2. Data Collection and Work Type Definition

The database developed for excavator activity recognition is collected from five types of earthwork operations: Approach, Digging, Dumping, Idle, and Leveling. The source data, including video, image, and point cloud data, are collected at a rate of five frames per second. The video/image data are collected in Audio Video Interleave (AVI) or Joint Photographic Experts Group (JPG) format, and the point cloud data are collected in Compressed LiDAR (LAZ) format. An external/surveillance camera (Hanwha Vision Co., Ltd., Seongnam-si, Gyeonggi-do, Republic of Korea) is placed at the construction site to monitor the excavator’s activities. The LiDAR (Light Detection and Ranging) also captured the point cloud at specific intervals during the earthwork operations, as shown in Figure 2. The LiDAR system consists of a laser scanner, GPS, and an IMU sensor that measures distances and creates 3D maps of objects and environments using remote sensing technology scanner (Leica Geosystems AG, Heerbrugg, Switzerland). The point cloud data are processed to curate an image dataset representing the point cloud data. The image dataset of both the external camera and the lidar is synchronized. A total of 495,000 labeled datasets, comprising video frames and point cloud data, were utilized to facilitate the learning process of the two-stream CNN-LSTM network. A high-quality, synchronized database has been developed, which facilitates the autonomous identification of equipment activity and the development of autonomous construction sites.

Hydraulic excavators are one of the most essential equipment in the construction industry for performing various tasks at the construction site, playing a pivotal role in earthwork operations [33,76]. In this research, a medium-sized hydraulic crawler excavator manufactured by Doosan Infracore Co., Ltd., Incheon, Republic of Korea, is used. The work of excavators is distinguished based on the movement of the excavator body and categorized into four types: “Excavation and Loading”, “Leveling”, “Foundation and Trench Excavation”, and “Slope Excavation” (Figure 3). Excavation and Loading deals with the excavator-dump truck interaction for digging the earthen material and dumping it into the excavator; Foundation and Trench Excavation deals with the trenches for the underground utilities or foundation works for the preparation of the underground infrastructure installation; and Slope Excavation deals with the slope or hillside work to create a flat and stable surface. These works are further divided into activities by the excavators. Idle is the state of the excavator when it is either stopped or waiting for the other equipment to perform its task with the bucket empty or filled with material. The digging activity starts when the bucket touches the ground for excavation and ends when it leaves the ground with earthen material in the bucket (reducing the total amount of earthen material at the earthwork workspace). Dumping refers to transferring earthen material from the bucket to the surroundings or a dump truck by closing and unfolding the bucket joints until the bucket is empty of earthen material. Leveling refers to flattening the earthen material from the front, rear, left, and right sides with the tip of the bucket without changing the total amount of earthen material in the earthwork workspace and any earthen material in the bucket. Approach, digging, and leveling are also known as productive activities that directly contribute to the progress of the construction process and are called digging, dumping, and leveling. The approach refers to moving excavators from one place to another or swinging to assist in productive activities. The approach includes moving and swinging. Moving refers to changing the location of the excavator due to uneven terrain, maneuvering around obstacles, or repositioning for the next task at the construction site. Swinging refers to the cab rotation and boom assembly maneuvering of the excavator around its base, allowing it to change the direction of work without moving its tracks. The maneuvering of the boom or arm of the excavator is recognized as either “swing full” or “swing empty”, depending on the availability of earthen material in the bucket. An overview of the images from five selected activities, performed for different excavation work types, is shown in Figure 3.

3.3. Data Cleaning and Labeling Process

Data cleaning and preprocessing are the most important steps in preparing the dataset for the proposed two-stream CNN-LSTM approach. This process involves identifying and removing noisy data, such as duplicate entries, corrupted files, and samples of incorrect sizes or formats. It enhanced the quality of the dataset and improved the accuracy of model training. Although data cleaning can be time-consuming, often taking 50% to 80% of the total classification process, it remains a crucial component of deep learning algorithms. The data cleaning and labeling process consisted of three stages: annotating tasks, segmenting videos, extracting frames, and reviewing the refined data. This process was performed on approximately 495,000 labeled data samples, which included synchronized video frames and point cloud images. Once cleaned and synchronized, the data were systematically organized for analysis using the two-stream CNN-LSTM architecture. During this process, a standard labeling format was used to ensure consistency in data structure, based on data type, file name, and sensor position, as shown in Figure 4. All data labels were carefully reviewed to guarantee reliability and consistency throughout the dataset.

To synchronize RGB and point cloud data temporally, timestamp metadata were used to match corresponding frames. For each point cloud instance, a synchronized RGB image was selected based on the same acquisition time. The 3D point cloud data were visualized and processed with open-source software, allowing for region segmentation, zooming, viewpoint adjustment, and background customization. The point cloud was exported as an image with a white background and blue points at a resolution of 1920 × 1080 pixels. Each point cloud image was paired with exactly RGB frame corresponding to the same timestamp. RGD and point cloud data were labeled simultaneously using the proposed format, which includes metadata such as equipment type, acquisition date and time, sensor type, frame index, site conditions, and activity class. The final labeling scheme, shown in Figure 4, consists of seven components, starting with an alphabet denoting equipment type (e.g., “E” for excavator), followed by the acquisition date and time, sensor source (e.g., CE for camera external or PC for point cloud), frame number, and a four-letter code describing the environment and activity. A detailed description of this format is provided in Table 2. This standardized format enables the construction of a structured, extensible database suitable for multi-stream activity recognition across various types of construction equipment and operational scenarios. The proposed format enables the researcher to develop a database of various construction equipment and their associated activities, following a standard data labeling format.

3.4. Data Fusion

The image data samples of both streams are labeled and time-synchronized with each other to develop two different spatial characteristics for one instant of the construction equipment at a specific time. The external camera records the movement of the excavator and its company in the surrounding environment. The standalone external camera view limits monitoring the excavator’s performance due to the occlusions, camera angle, or position. A second stream based on the point cloud data sample would provide additional information where the external camera stream is limited. The point cloud data stream mainly focused on the boom, arm, and bucket movement of the excavator to assist in learning the first stream for various activities. A customized data loader is developed to achieve a single synchronized input for the two-stream CNN pipeline. External camera image and point cloud image form a single unit of the multi-stream input data, as shown in Figure 5. Each branch of the two-stream CNN architecture is linked with only one input stream to extract the spatiotemporal features of the input data streams. Additionally, the optical flows of the input data stream are estimated simultaneously. Before training, dense optical flow is calculated for consecutive frames using the Farneback algorithm to capture detailed motion dynamics. This optical flow estimation enhances the ability to recognize motion variations, particularly when static RGB frames may not provide sufficient distinctive features. After extracting the spatiotemporal features and optical flows of the multi-stream input data through two branches of the CNN, the sequential information of both streams is constructed and concatenated for further learning. During the concatenation operation, the spatial features of both streams are horizontally stacked along the feature dimensions, and spatial and channel-wise information of both streams is preserved through the concatenation tensor to enhance the learning of the LSTM model. The LSTM layer receives the single concatenated input of both streams for processing the spatiotemporal features and optical flows, followed by a Multi-Layer Perceptron (MLP) head that constructs a two-stream CNN-LSTM model. The MLP head is responsible for making the final prediction based on learning the model and evaluating the performance predictors and loss values. The multi-stream input data consisting of five activities, approach, digging, dumping, idle, and leveling, is used to train the two-stream CNN-LSTM model.

3.5. CNN Network for Activity Recognition Through Optical Flow

In this research, a multi-stream input data framework is developed to enhance the accuracy of excavator activity recognition, particularly under occlusion conditions. The multi-stream input data are collected through an external camera and a lidar at the construction site. The external camera records the videos of excavator movements at the construction site, which are later converted into video frames. The lidar generates the point cloud of the construction site and focuses the movement of the excavator body parts at a specific frame rate. The point cloud data are processed using an open-source 3D point cloud processing software to construct an image dataset that represents the point cloud data. The image dataset of both the external camera and the lidar is synchronized. The synchronized data are annotated manually based on the type of activity of the excavator during that timestamp of the video stream. Then, the labeling process follows the specific labeling format to label the image dataset systematically. The multi-stream image-based labeled dataset serves as input for the activity recognition method, utilizing the two-stream CNN-LSTM deep learning algorithm, as illustrated in Figure 5. The RGB image data provide the spatial characteristics of the input data. In contrast, the temporal information is estimated through the optical flows of these RGB images. Optical flow is computed using the Farneback algorithm, a dense optical flow estimation method that effectively captures motion patterns between consecutive frames to improve motion representation. The Farneback algorithm uses polynomial expansion to approximate pixel intensities within a local neighborhood, enabling efficient calculation of dense motion fields. The resulting optical flow representations are then combined with RGB images to create a comprehensive spatiotemporal feature set for activity recognition. The Farneback algorithm’s ability to efficiently compute dense motion fields improves motion representation, assisting the model in tracking excavator activities. This method balances computational efficiency and accuracy by applying polynomial expansion for motion estimation. Optical flow is calculated by tracking and analyzing the motion patterns of objects’ edges and surfaces between consecutive grayscale frames. The motion vectors of the video frames are computed to evaluate the dense optical flow. This study implemented the Farneback method to assess dense optical flows, which is computationally efficient and handles various types of motion. The proposed research identifies the various activity types of excavators (approach, digging, dumping, idle, and leveling) by analyzing the optical flows and spatial information of RGB images.

After evaluating the dense optical flows, the two-stream CNN-LSTM processes an integrated input stream of labelled RGB images collected through an external camera and lidar. A two-stream architecture incorporating spatial and temporal information has been introduced using RGB and optical flow data [30]. The architecture of the two-stream CNN-LSTM, comprising 14 layers, is illustrated in Figure 5. It consists of two parallel branches, each designed to extract spatiotemporal features from the synchronized RGB and point cloud input streams. Each branch includes two convolutional layers, where the first uses 32 filters with a 3 × 3 kernel and the second uses 64 filters with a 5 × 5 kernel, followed by Rectified Linear Unit (ReLU) activations and 2 × 2 max-pooling layers with a stride of 2 to reduce spatial dimensionality. Padding is applied (1 pixel in the first stream and 2 pixels in the second) to preserve spatial resolution. The outputs of both branches are concatenated along the channel dimension to maintain spatial and channel-wise integrity before being passed to the LSTM, which has 64 hidden units to capture sequential dependencies. A Multi-Layer Perceptron (MLP) head, consisting of a fully connected layer, performs the final classification into five activity categories. The Adam optimizer is used to train the model, and optical flow computed via the Farneback method enriches temporal understanding by capturing pixel-wise motion dynamics between consecutive frames, improving robustness in real-world excavation scenarios.

Each stream in the CNN architecture independently extracts spatial and motion features, ensuring that structural and dynamic information is preserved. The convolutional layers in each stream are responsible for hierarchical feature extraction, capturing intricate patterns associated with excavator movements. Before processing the feature map information for the LSTM layer, the spatial features of both convolutional branches are concatenated to preserve the spatial and channel-wise information of both branches and enhance the learning capability of the concatenation tensor of the LSTM. The LSTM layer analyzes the feature maps to evaluate the sequential pattern and temporal dependencies, followed by the MLP head (classification layer). This MLP head serves as the classification layer and is responsible for producing the final activity prediction based on the spatiotemporal features learned by the CNN and LSTM components. The final classification is performed using a fully connected (FC) layer, which receives the LSTM output and generates activity predictions for five categories: approach, digging, dumping, idle, and leveling. The classification layer is followed by an optimization step using the ADAM optimizer, which fine-tunes the model by minimizing classification loss and improving recognition accuracy. The proposed approach, integrating Farneback-based optical flow estimation with a two-stream CNN-LSTM architecture, substantially advances excavator activity recognition. Combining multi-stream feature extraction, traditional optical flow estimation, and sequential pattern recognition enables the model to outperform conventional single-stream deep learning techniques, making it a robust and reliable solution for analyzing real-world excavator activities. The proposed framework ensures robust activity recognition, even in challenging real-world excavation scenarios, using Farneback-based motion estimation, multi-stream feature extraction, and sequential pattern learning.

3.6. Performance Metrics

Four performance predictors, Precision, Recall, Accuracy, and F1 scores, were evaluated to demonstrate the performance of the trained model using Equations (1)–(4). Precision is the ratio of true positive (TP) outcomes to the sum of true and false positive (FP) outcomes. Recall is the ratio of true positive (TP) outcomes to the sum of true positive (TP) and false negative (FN) outcomes. The harmonic meaning of the precision and recall is known as the F1 score of the model. Accuracy shows the final count of the accurate predictions, representing the model’s performance, and is used to calculate the loss of the model simultaneously. The positive outcome predicted as positive is known as a true positive (TP). An unfavorable outcome predicted as a positive output is known as a false positive (FP), and vice versa. The true negative result, which is negative, is called true negative (TN).

Precision = TP/(TP + FP)

(1)

Recall = TP/(TP + FN)

(2)

Accuracy = (TP + TN)/(TP + FP + FN + TN)

(3)

F1 = 2 × (Precision + Recall)/(Precision × Recall)

(4)

4. Experimental Implementation and Results

4.1. Datasets

The excavator database for activity recognition is developed from five types of earthwork operations: approach, digging, dumping, idling, and leveling, utilizing an external camera and lidar scanner. These sensors are placed at various positions and angles to collect the different perspectives of the excavators; however, the external camera provides visual data from certain angles. At the same time, the point cloud provides the edge to change the excavator’s perspective after collecting it and offers the best view for better learning of the training model. A total of 495,000 labeled image frames for the activities approach, digging, dumping, idle, and leveling, are collected to develop the database for activity recognition. For this research, a total of 980 video clips are used, with each activity comprising 98 video clips from each external camera stream and the point cloud, all with the same timestamp. The dataset is developed from different construction sites under various environmental conditions and synchronized. The dataset for multi-stream was split into two parts. Seventy percent of the data, consisting of 68 video clips for each activity from each stream, totaling 680 activity video clips, was allocated for training. The remaining 30%, which includes 30 videos per activity from each stream, totaling 300 video clips for each, was used for testing. This dataset includes the four construction site video frames under different environmental conditions. The RGB camera and point cloud video frames are time-synchronized for dataset input, as shown in Figure 5. Each activity contains 15 video frames in each stream. The number of video frames for each activity in both streams is counted accurately to estimate the optical stream for both streams. An overview of the data distribution is shown in Table 3.

4.2. Classification Model Training and Testing Process

The multi-stream input data are trained and tested using the algorithm on a 64-bit Windows 10 operating system running on an HP Z8 G4 Workstation (Hewlett-Packard Development Company, L.P., Palo Alto, CA, USA) with an Intel(R) Xeon(R) Gold 6242R CPU @ 3.10 GHz (Intel Corporation, Santa Clara, CA, USA). The input image dataset is processed using a CNN network after resizing the input image to 256 × 256. The two-stream CNN-LSTM network consisted of 14 layers, with the first convolutional layers in both streams having 32 filters with a 3 × 3 kernel size and the second convolutional layer having 64 filters with a 5 × 5 kernel size. The first stream had a padding of 1 pixel in the convolutional layers, while the second stream had 2 pixels. Max-pooling in both streams used a 2 × 2 kernel size with a stride of 2 pixels. The ReLU activation function was utilized in the study. The sequence length can vary depending on the duration of activities; however, this study considers a 15-frame sequence for each activity. To ensure transparency and reproducibility, all reported performance metrics using the dataset were split into 70% for training and 30% for testing. Each video clip consisted of 15 consecutive frames, and activity labels were synchronized across both RGB and point cloud streams. The final metrics, including accuracy, precision, recall, F1-score, and loss, were derived by evaluating the trained model on the test dataset. Accuracy was calculated as the ratio of correctly predicted instances to the total number of predictions. Loss values were directly obtained from the model’s softmax output during evaluation. Together, they provide a comprehensive evaluation of the model’s ability to generalize across various activity classes in unseen data. The model training involved optimizing different hyperparameters, and the chosen training hyperparameters were a learning rate of 0.0001, a minimum batch size of 8, 25 epochs for training, and 64 hidden units for the LSTM layer. The process of hyperparameter optimization, crucial for achieving robust model performance, involved evaluating various combinations of learning rates and training epochs. As detailed in Table 4, which illustrates the testing accuracies across different hyperparameter settings, a significant variation in model performance was observed. The learning rate of 0.0001 consistently outperformed other settings, achieving the highest accuracy of 94.67% at 30 epochs. Lower learning rates (e.g., 0.00001) led to underfitting, while higher learning rates (e.g., 0.01) exhibited unstable performance. This emphasizes the importance of fine-tuning hyperparameters to maximize model effectiveness. Figure 6 illustrates the learning curves for training and testing datasets across 30 epochs. The training loss consistently decreased while the accuracy rapidly approached saturation, reaching a final training accuracy of 99.71%. The most effective learning rate identified was 0.0001, which consistently produced the highest testing accuracies, steadily improving from 90.43% at 20 epochs to a peak of 94.67% at 30 epochs. This empirically validates that a learning rate of 0.0001 combined with 30 epochs, along with a batch size of 8 and 64 hidden units, constitutes the optimal set of hyperparameters for this model, leading to the highest reported training accuracy of 99.62%. The dataset was split into training and testing data with a 70% and 30% ratio, respectively, to create distinct subsets for training and testing purposes.

4.3. Results

This section presents a detailed analysis of the experimental results, highlighting the performance of the proposed two-stream CNN-LSTM model for excavator activity recognition. During the training phase, the model demonstrated exceptional learning capabilities, with an overall training accuracy of 99.62%. As depicted in Figure 7a, activities such as Digging, Dumping, Idle, and Leveling were classified with 100% accuracy, indicating the model’s robustness in recognizing their distinct features. The Approach activity also exhibited outstanding performance, with 98.72% of its instances correctly identified, and only a single misclassification (1.28%) into the Dumping category. This near-perfect training performance across all classes underscores the model’s strong learning capacity with the optimized hyperparameters.

The true measure of the model’s efficacy lies in its generalization performance on unseen data. When evaluated on the 30% test dataset, the trained model achieved an overall test accuracy of 94.67%, an F1-score of 94.78%, and a test loss of 5.10%. The testing confusion matrix in Figure 7b provides a detailed breakdown of these results. The model achieved perfect accuracy for *Digging and Idle* activities, classifying 100% of their instances correctly. For the *Dumping* activity, 95.65% of instances were correctly identified, with one instance (4.35%) being misclassified as Digging. The Leveling activity with 91.30% accuracy, while 8.70% of its instances were misclassified as Approach. Lastly, the Approach activity had an 86.96% correct classification rate, with misclassifications occurring, such as Dumping (8.70%) and Leveling (4.35%). These specific errors highlight subtle overlaps in visual cues or movement patterns between certain activity pairs during real-world operation, such as the initial phases of Approach and Dumping, or the characteristics shared by Leveling and Approach activities performed by the excavator movements.

The error in predicted labels for activity recognition is visualized in Figure 8, comparing the predicted labels with the ground truth (GT). The experimental results explicitly demonstrate the robust performance of the multi-modal two-stream CNN-LSTM model for excavator activity recognition. The systematic hyperparameter tuning, particularly the identification of an optimal learning rate and epoch count, significantly contributed to the model’s ability to achieve high accuracy and strong generalization capabilities. While minor ambiguities exist between specific activity pairs, resulting in a small number of misclassifications, the overall high precision and low error rates validate the proposed methodology’s potential for accurately and automatically monitoring excavator earthwork operations.

4.4. Sensitivity Analysis of Hyperparameters

To evaluate the robustness of the proposed two-stream CNN-LSTM model, a sensitivity analysis was conducted by systematically adjusting key hyperparameters, including the learning rate, number of epochs, and input sequence length (Table 5). The analysis reveals how these parameters influence classification accuracy, F1-score, and loss, providing insights into the model’s stability and generalization capabilities.

The learning rate had a significant impact on model performance. A rate of 0.0001 achieved the highest accuracy (94.67%) and F1-score (94.78%) with minimal loss (5.33%), demonstrating optimal convergence. In contrast, higher rates (e.g., 0.01) caused unstable training, with accuracy dropping to 43.48% due to erratic weight updates, while lower rates (e.g., 0.00001) led to underfitting (86.09% accuracy) as the model failed to learn efficiently.

Increasing epochs improved performance up to a saturation point. At 20 epochs, the model achieved 90.43% accuracy but risked premature stopping. Training for 25 epochs improved accuracy to 93.91%, while 30 epochs yielded the best results (94.67%). Beyond this, overfitting was observed, as indicated by diverging training and testing losses (Figure 6), emphasizing the need for early stopping. The input sequence length (number of frames per activity sample) was tested at 10, 15, and 20 frames. A length of 15 frames struck the best balance, achieving 94.67% accuracy. Shorter sequences (10 frames) reduced accuracy by 5% due to insufficient temporal context, while longer sequences (20 frames) introduced computational overhead without significant gains (94.12% accuracy).

The overall findings confirm that learning rate is the most critical parameter affecting convergence and final accuracy; number of epochs must be optimized to avoid underfitting or overfitting; sequence length has a moderate effect, with an optimal range for temporal coverage. These observations support the reproducibility of the proposed model and demonstrate its robustness when appropriately tuned. The sensitivity analysis enhances the transparency and applicability of the methodology for real-world construction environments.

5. Evaluation and Discussion

To evaluate the effectiveness of the proposed two-stream CNN-LSTM model, its performance was compared to two baseline single-stream models, one using only RGB video frames and the other using 3D point cloud data. Both models have the identical architecture, namely CNN-LSTM, and were trained and tested separately on the same dataset to ensure fair comparison. Similar to the proposed multi-stream approach, the single-stream input CNN-LSTM network extracts spatial features from camera RGB or point cloud data using a CNN and then utilizes these features as input for the LSTM to capture the temporal aspects of the input data. It integrates spatial and temporal features to recognize excavator activities and uses CNN layers to analyze input frames with optical flow, estimated via the Farneback algorithm. These spatiotemporal and optical flow features are then processed through an LSTM for activity recognition and classification. The dataset was split into training and testing data with a 70% and 30% ratio, respectively, to create distinct subsets for training and testing purposes. Figure 9 displays the training and testing results of two models based on a single-stream architecture using CNN-LSTM. Both models were trained on the same dataset to recognize five key excavator activities: Approach, Digging, Dumping, Idle, and Leveling. The figure includes the confusion matrix, along with class-specific precision, recall, and loss metrics, as well as overall accuracy and F1-score for training and testing.

The RGB-based model achieved a training accuracy of 98.53%, with high classification performance across all activity categories. The training confusion matrix indicates that the model learned most class boundaries effectively, with perfect classification in “Idle” and “Leveling”. The overall F1-score was 98.54, and the average loss was low (1.47). However, during the test phase, performance declined, with a test accuracy of 90.67%, an F1-score of 90.78, and a notable increase in class-wise losses (average of 9.33%). Misclassifications were more frequent between classes, such as “Approach” and “Leveling” or “Dumping”, likely due to similar motion patterns and RGB’s sensitivity to viewpoint and lighting changes.

The cloud-based point model exhibited perfect classification performance during training, achieving 100% in accuracy, precision, recall, and F1-score, with zero loss across all activity classes. This indicates potential overfitting. Nonetheless, the model performed better than the RGB model on the test set, achieving a higher test accuracy of 92.00%, an F1-score of 92.00, and a reduced average loss of 8.00. The confusion matrix shows fewer misclassifications, particularly in separating geometrically distinct actions such as “Leveling” and “Idle”. The results demonstrate the better robustness of point cloud data to visual occlusions and lighting variations, likely due to its 3D spatial structure and depth perception capabilities. However, a notable limitation of the point cloud-only model is its lack of rich texture and appearance cues, which are often essential for distinguishing between activities with subtle motion differences but similar spatial configurations. This limitation can hinder the model’s ability to differentiate between contextually similar actions, especially when spatial geometry alone is insufficient for classification.

The comparison results in Table 6 clearly indicate that he proposed a two-stream CNN-LSTM model, demonstrated superior performance in excavator activity recognition compared to single-stream approaches, as evidenced by the comparative metrics. When trained on multi-stream input data, the model achieved an accuracy of 94.67%, an F1-score of 94.78, and a loss of only 5.33%, outperforming both RGB-only (90.67% accuracy, 9.33% loss) and PC-only (92.00% accuracy, 8.00% loss) configurations. This improvement highlights the complementary strengths of RGB and point cloud data, where RGB provides rich spatial features and point cloud enhances depth perception and robustness to occlusions and lighting variations. The balanced precision (94.90%) and recall (94.67%) further indicate that the model effectively mitigates misclassifications, particularly in challenging scenarios such as distinguishing between “Approach” and “Leveling” activities, where motion patterns often overlap.

These results align with the paper’s hypothesis that multi-modal fusion addresses the limitations of single-source data, such as fixed viewpoints in RGB and sparse textures in point clouds. Notably, the point cloud-only model (92.00% accuracy) surpassed the RGB-only model (90.67% accuracy), reinforcing the value of 3D geometric features in recognizing complex excavator kinematics. Although the performance of the point cloud-only model surpassed the RGB-based model, the two-stream architecture achieved the best overall performance. This confirms the synergistic benefit of integrating RGB’s rich imagery with point clouds’ structural robustness for enhanced excavator activity recognition. Furthermore, compared to recent work in Table 1, the proposed method sets a new benchmark, validating its potential for real-time construction monitoring and digital twin integration. Nevertheless, the model’s performance may degrade in multi-excavator environments, suggesting future enhancements with IMU sensor fusion or attention mechanisms to improve scalability. Overall, this study confirms that two-stream learning significantly advances activity recognition accuracy while addressing critical challenges in dynamic construction sites.

6. Conclusions

This study presented a multi-modal deep learning framework for recognizing excavator activities using synchronized RGB and point cloud data streams. Unlike previous approaches that rely solely on single-source RGB imagery, this two-stream CNN-LSTM model leverages the complementary strengths of 2D visual textures and 3D geometric depth to deliver robust activity recognition even under occlusion, variable lighting conditions, and complex site environments. The model achieved a training accuracy of 99.62% and a test accuracy of 94.67%, significantly outperforming single-stream RGB (90.67%) and point cloud (92.00%) CNN-LSTM models on the same dataset. These results demonstrate the effectiveness of spatial-temporal feature fusion in handling ambiguities between visually similar activities such as Approach and Leveling.

The key academic contributions of this study include the development of a two-stream CNN-LSTM architecture for fusing RGB and point cloud data, the creation of a large-scale synchronized dataset comprising 495,000 labeled samples, the introduction of a standard labeling scheme for multi-modal activity annotation, and a comprehensive performance comparison with single-stream models. The proposed method consistently outperformed existing baselines, confirming the advantages of multi-modal fusion for robust excavator activity recognition in dynamic construction environments.

The study has demonstrated that the proposed method significantly improves the recognition of excavator activities over recent state-of-the-art existing single-stream models, resulting in enhanced safety, improved efficiency, and automation of earthwork operations. However, there are some limitations to consider. The model’s performance in recognizing activities is optimized for a single excavator and may be affected when multiple excavators are present. Moreover, the study did not account for the impact of surrounding equipment, particularly the movement of dump trucks. Such interactions may lead to occlusions or overlapping motion, which can potentially affect recognition accuracy. Considering these limitations, the future objectives of this study are to (i) enhance the model’s architecture to enable recognition of activities for multiple excavators simultaneously; (ii) expand the collection of data to include different excavator activities from various construction sites with diverse earthen materials; (iii) explore methods to enhance the AI model’s robustness in multi-equipment environments by incorporating occlusion handling and contextual understanding.

In addition to these goals, this study presents opportunities for integrating the proposed recognition system into broader AI-driven health monitoring platforms. By continuously observing operational patterns and behaviors through multi-stream data, the system could evolve to detect abnormal equipment states, anticipate component wear, and support prescriptive maintenance decisions. This paves the way for real-time digital twin environments where performance optimization, fault prediction, and safety management converge into a unified monitoring framework.

Author Contributions

Conceptualization, H.S.C., K.L., A.S. and J.S.; Methodology, H.S.C., K.L. and A.S.; Software, H.S.C. and K.L.; Validation, A.S. and K.L.; Formal analysis, H.S.C., A.S. and K.L.; Investigation, H.S.C., A.S. and K.L.; Resources, H.S.C. and J.S.; Data curation, A.S. and K.L.; Writing—original draft, H.S.C. and K.L.; Writing—review and editing, A.S. and J.S.; Visualization, K.L.; Supervision, A.S. and J.S.; Project administration, A.S. and J.S.; Funding acquisition, J.S. All authors contributed equally to manuscript preparation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (MEST) (No. NRF 2018R1A5A1025137) and a Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure, and Transport (National Research for Smart Construction Technology (No. RS-2020-KA157089).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used and analyzed during the current study is available from the corresponding authors upon reasonable request. The data are not publicly available due to privacy and confidentiality agreements with the construction site operators involved in the data collection.

Acknowledgments

Kamran Latif is highly thankful to the Higher Education Commission of Pakistan for a Human Resource Development Initiative scholarship at the Universities of Engineering Science and Technology. The authors hereby acknowledge the incorporation of AI-assisted instruments, specifically Grammarly (Version 8.932) and ChatGPT (GPT-4o, developed by OpenAI), to augment linguistic precision and rectify grammatical inaccuracies. The authors reviewed and edited the content as needed and take full responsibility for the content of the published article. The utilization of AI is in accordance with the journal’s stipulations regarding transparency and ethical standards in authorship practices.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AVI	Audio Video Interleave (video format)
CE	Camera External
CNN	Convolutional Neural Network
CV	Computer Vision
DL	Deep Learning
DIGER	Deep Information Guided Excavator Recognition
DT	Decision Tree
FC	Fully Connected (layer)
FN	False Negative
FP	False Positive
GPS	Global Positioning System
GT	Ground Truth
HOG	Histogram of Oriented Gradients
HMM	Hidden Markov Model
IFaster R-CNN	Improved Faster Region-based CNN
IMU	Inertial Measurement Unit
JPG	Joint Photographic Experts Group (image format)
JSON	JavaScript Object Notation
KNN	K-Nearest Neighbors
LAZ	Compressed LAS (LiDAR data format)
LSTM	Long Short-Term Memory
MLP	Multi-Layer Perceptron
MS-TCN	Multi-Stage Temporal Convolutional Network
PC	Point Cloud
RGB	Red Green Blue
RNN	Recurrent Neural Network
ReLU	Rectified Linear Unit
RPN	Region Proposal Network
SVM	Support Vector Machine
TN	True Negative
TP	True Positive
YOLO	You Only Look Once
YOWO	You Only Watch Once

References

Latif, K.; Sharafat, A.; Seo, J. Digital Twin-Driven Framework for TBM Performance Prediction, Visualization, and Monitoring through Machine Learning. Appl. Sci. 2023, 13, 11435. [Google Scholar] [CrossRef]
Yang, J.; Park, M.-W.; Vela, P.A.; Golparvar-Fard, M. Construction Performance Monitoring via Still Images, Time-Lapse Photos, and Video Streams: Now, Tomorrow, and the Future. Adv. Eng. Inform. 2015, 29, 211–224. [Google Scholar] [CrossRef]
Parente, M.; Gomes Correia, A.; Cortez, P. Artificial Neural Networks Applied to an Earthwork Construction Database. In Proceedings of the Second International Conference on Information Technology in Geo-Engineering, Durham, UK, 21–22 July 2014; pp. 200–205. [Google Scholar]
Li, B.; Hou, B.; Yu, W.; Lu, X.; Yang, C. Applications of Artificial Intelligence in Intelligent Manufacturing: A Review. Front. Inf. Technol. Electron. Eng. 2017, 18, 86–96. [Google Scholar] [CrossRef]
Aziz, R.F.; Hafez, S.M.; Abuel-Magd, Y.R. Smart Optimization for Mega Construction Projects Using Artificial Intelligence. Alex. Eng. J. 2014, 53, 591–606. [Google Scholar] [CrossRef]
Sharafat, A.; Tanoli, W.A.; Zubair, M.U.; Mazher, K.M. Digital Twin-Driven Stability Optimization Framework for Large Underground Caverns. Appl. Sci. 2025, 15, 4481. [Google Scholar] [CrossRef]
Lee, S.; Sharafat, A.; Kim, I.S.; Seo, J. Development and Assessment of an Intelligent Compaction System for Compaction Quality Monitoring, Assurance, and Management. Appl. Sci. 2022, 12, 6855. [Google Scholar] [CrossRef]
Xiao, B.; Kang, S.-C. Development of an Image Data Set of Construction Machines for Deep Learning Object Detection. J. Comput. Civ. Eng. 2021, 35. [Google Scholar] [CrossRef]
Liu, G.; Wang, Q.; Wang, T.; Li, B.; Xi, X. Vision-Based Excavator Pose Estimation for Automatic Control. Autom. Constr. 2024, 157, 105162. [Google Scholar] [CrossRef]
Lu, Y.; You, K.; Zhou, C.; Chen, J.; Wu, Z.; Jiang, Y.; Huang, C. Video Surveillance-Based Multi-Task Learning with Swin Transformer for Earthwork Activity Classification. Eng. Appl. Artif. Intell. 2024, 131, 107814. [Google Scholar] [CrossRef]
Kim, J.; Chi, S. Multi-Camera Vision-Based Productivity Monitoring of Earthmoving Operations. Autom. Constr. 2020, 112, 103121. [Google Scholar] [CrossRef]
Kim, H.; Ahn, C.R.; Engelhaupt, D.; Lee, S. Application of Dynamic Time Warping to the Recognition of Mixed Equipment Activities in Cycle Time Measurement. Autom. Constr. 2018, 87, 225–234. [Google Scholar] [CrossRef]
Ahn, C.R.; Lee, S.; Peña-Mora, F. Application of Low-Cost Accelerometers for Measuring the Operational Efficiency of a Construction Equipment Fleet. J. Comput. Civ. Eng. 2013, 29. [Google Scholar] [CrossRef]
Akhavian, R.; Behzadan, A.H. Smartphone-Based Construction Workers’ Activity Recognition and Classification. Autom. Constr. 2016, 71, 198–209. [Google Scholar] [CrossRef]
Teizer, J. Status Quo and Open Challenges in Vision-Based Sensing and Tracking of Temporary Resources on Infrastructure Construction Sites. Adv. Eng. Inform. 2015, 29, 225–238. [Google Scholar] [CrossRef]
Yang, J.; Vela, P.; Teizer, J.; Shi, Z. Vision-Based Tower Crane Tracking for Understanding Construction Activity. J. Comput. Civ. Eng. 2014, 28, 103–112. [Google Scholar] [CrossRef]
Luo, H.; Xiong, C.; Fang, W.; Love, P.E.D.; Zhang, B.; Ouyang, X. Convolutional Neural Networks: Computer Vision-Based Workforce Activity Assessment in Construction. Autom. Constr. 2018, 94, 282–289. [Google Scholar] [CrossRef]
Deng, T.; Sharafat, A.; Lee, S.; Seo, J. Automatic Vision-Based Dump Truck Productivity Measurement Based on Deep-Learning Illumination Enhancement for Low-Visibility Harsh Construction Environment. J. Constr. Eng. Manag. 2024, 150. [Google Scholar] [CrossRef]
Chen, C.; Zhu, Z.; Hammad, A. Automated Excavators Activity Recognition and Productivity Analysis from Construction Site Surveillance Videos. Autom. Constr. 2020, 110, 103045. [Google Scholar] [CrossRef]
Cheng, M.-Y.; Cao, M.-T.; Nuralim, C.K. Computer Vision-Based Deep Learning for Supervising Excavator Operations and Measuring Real-Time Earthwork Productivity. J. Supercomput. 2022, 79, 4468–4492. [Google Scholar] [CrossRef]
Gong, J.; Caldas, C.H.; Gordon, C. Learning and Classifying Actions of Construction Workers and Equipment Using Bag-of-Video-Feature-Words and Bayesian Network Models. Adv. Eng. Inform. 2011, 25, 771–782. [Google Scholar] [CrossRef]
Fang, Q.; Li, H.; Luo, X.; Ding, L.; Luo, H.; Rose, T.M.; An, W. Detecting Non-Hardhat-Use by a Deep Learning Method from Far-Field Surveillance Videos. Autom. Constr. 2018, 85, 1–9. [Google Scholar] [CrossRef]
Kim, J.; Chi, S. Action Recognition of Earthmoving Excavators Based on Sequential Pattern Analysis of Visual Features and Operation Cycles. Autom. Constr. 2019, 104, 255–264. [Google Scholar] [CrossRef]
Kim, I.-S.; Latif, K.; Kim, J.; Sharafat, A.; Lee, D.-E.; Seo, J. Vision-Based Activity Classification of Excavators by Bidirectional LSTM. Appl. Sci. 2022, 13, 272. [Google Scholar] [CrossRef]
Latif, K.; Sharafat, A.; Tao, D.; Park, S.; Seo, J. Digital Twin for Excavator-Dump Optimization Based on Two-Stream CNN-LSTM and DES. In Proceedings of the KSCE Convention Conference and Civil Expo, Jeju, Republic of Korea, 16–18 October 2024; pp. 17–18. Available online: https://www.dbpia.co.kr/Journal/articleDetail?nodeId=NODE12088514 (accessed on 25 July 2025).
Ladjailia, A.; Bouchrika, I.; Merouani, H.F.; Harrati, N.; Mahfouf, Z. Human Activity Recognition via Optical Flow: Decomposing Activities into Basic Actions. Neural Comput. Appl. 2020, 32, 16387–16400. [Google Scholar] [CrossRef]
Allam, J.P.; Sahoo, S.P.; Ari, S. Multi-Stream Bi-GRU Network to Extract a Comprehensive Feature Set for ECG Signal Classification. Biomed Signal Process Control 2024, 92, 106097. [Google Scholar] [CrossRef]
Semwal, A.; Londhe, N.D. A Multi-Stream Spatio-Temporal Network Based Behavioural Multiparametric Pain Assessment System. Biomed Signal Process Control 2024, 90, 105820. [Google Scholar] [CrossRef]
Chowdhury, M.H.; Chowdhury, M.E.H.; Alqahtani, A. MMG-Net: Multi Modal Approach to Estimate Blood Glucose Using Multi-Stream and Cross Modality Attention. Biomed Signal Process Control 2024, 92, 105975. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Montreal, QC, Canada, 2014; Volume 27. [Google Scholar]
Kuenzel, R.; Teizer, J.; Mueller, M.; Blickle, A. SmartSite: Intelligent and Autonomous Environments, Machinery, and Processes to Realize Smart Road Construction Projects. Autom. Constr. 2016, 71, 21–33. [Google Scholar] [CrossRef]
Sharafat, A.; Kim, J.; Park, S.; Seo, J. Accuracy and Error Analysis of 3D Mapping Using Unmanned Aerial Vehicle (UAV) for Earthwork Project. In Proceedings of the KSCE 2019 Convention Conference and Civil Expo, KSCE, Pyeongchang, Republic of Korea, 16–18 October 2019; pp. 385–386. Available online: https://www.dbpia.co.kr/Journal/articleDetail?nodeId=NODE09328115 (accessed on 25 July 2025).
Kim, J.; Lee, S.S.; Seo, J.; Kamat, V.R. Modular Data Communication Methods for a Robotic Excavator. Autom. Constr. 2018, 90, 166–177. [Google Scholar] [CrossRef]
Golparvar-Fard, M.; Peña-Mora, F.; Arboleda, C.A.; Lee, S. Visualization of Construction Progress Monitoring with 4D Simulation Model Overlaid on Time-Lapsed Photographs. J. Comput. Civ. Eng. 2009, 23, 391–404. [Google Scholar] [CrossRef]
Chen, C.; Xiao, B.; Zhang, Y.; Zhu, Z. Automatic Vision-Based Calculation of Excavator Earthmoving Productivity Using Zero-Shot Learning Activity Recognition. Autom. Constr. 2023, 146, 104702. [Google Scholar] [CrossRef]
Deng, T.; Sharafat, A.; Wie, Y.M.; Lee, K.G.; Lee, E.; Lee, K.H. A Geospatial Analysis-Based Method for Railway Route Selection in Marine Glaciers: A Case Study of the Sichuan-Tibet Railway Network. Remote Sens. 2023, 15, 4175. [Google Scholar] [CrossRef]
Fang, W.; Ding, L.; Love, P.E.D.; Luo, H.; Li, H.; Peña-Mora, F.; Zhong, B.; Zhou, C. Computer Vision Applications in Construction Safety Assurance. Autom. Constr. 2020, 110, 103013. [Google Scholar] [CrossRef]
Alawad, H.; Kaewunruen, S.; An, M. Learning from Accidents: Machine Learning for Safety at Railway Stations. IEEE Access 2019, 8, 633–648. [Google Scholar] [CrossRef]
Choi, H.; Seo, J. Safety Assessment Using Imprecise Reliability for Corrosion-damaged Structures. Comput.-Aided Civ. Infrastruct. Eng. 2009, 24, 293–301. [Google Scholar] [CrossRef]
Molaei, A.; Kolu, A.; Lahtinen, K.; Geimer, M. Automatic Estimation of Excavator Actual and Relative Cycle Times in Loading Operations. Autom. Constr. 2023, 156, 105080. [Google Scholar] [CrossRef]
Sabillon, C.; Rashidi, A.; Samanta, B.; Davenport, M.A.; Anderson, D.V. Audio-Based Bayesian Model for Productivity Estimation of Cyclic Construction Activities. J. Comput. Civ. Eng. 2020, 34. [Google Scholar] [CrossRef]
Sherafat, B.; Rashidi, A.; Asgari, S. Sound-Based Multiple-Equipment Activity Recognition Using Convolutional Neural Networks. Autom. Constr. 2022, 135, 104104. [Google Scholar] [CrossRef]
Rashid, K.M.; Louis, J. Times-Series Data Augmentation and Deep Learning for Construction Equipment Activity Recognition. Adv. Eng. Inform. 2019, 42, 100944. [Google Scholar] [CrossRef]
Latif, K.; Sharafat, A.; Park, S.; Seo, J. Digital Twin-Based Hybrid Approach to Visualize the Performance of TBM. In Proceedings of the KSCE, Yeosu, Republic of Korea, 19–21 October 2022; pp. 3–4. Available online: https://www.dbpia.co.kr/pdf/pdfView.do?nodeId=NODE11223304 (accessed on 25 July 2025).
HU, S.; WANG, Y.; YANG, L.; YI, L.; NIAN, Y. Primary Non-Hodgkin’s Lymphoma of the Prostate with Intractable Hematuria: A Case Report and Review of the Literature. Oncol. Lett. 2015, 9, 1187–1190. [Google Scholar] [CrossRef]
Teizer, J.; Venugopal, M.; Walia, A. Ultrawideband for Automated Real-Time Three-Dimensional Location Sensing for Workforce, Equipment, and Material Positioning and Tracking. Transp. Res. Rec. J. Transp. Res. Board 2008, 2081, 56–64. [Google Scholar] [CrossRef]
Shi, Y.; Xia, Y.; Luo, L.; Xiong, Z.; Wang, C.; Lin, L. Working Stage Identification of Excavators Based on Control Signals of Operating Handles. Autom. Constr. 2021, 130, 103873. [Google Scholar] [CrossRef]
Langroodi, A.K.; Vahdatikhaki, F.; Doree, A. Activity Recognition of Construction Equipment Using Fractional Random Forest. Autom. Constr. 2021, 122, 103465. [Google Scholar] [CrossRef]
Pradhananga, N.; Teizer, J. Automatic Spatio-Temporal Analysis of Construction Site Equipment Operations Using GPS Data. Autom. Constr. 2013, 29, 107–122. [Google Scholar] [CrossRef]
Sresakoolchai, J.; Kaewunruen, S. Railway Infrastructure Maintenance Efficiency Improvement Using Deep Reinforcement Learning Integrated with Digital Twin Based on Track Geometry and Component Defects. Sci. Rep. 2023, 13, 2439. [Google Scholar] [CrossRef]
Zhang, Y.; Yuen, K. Crack Detection Using Fusion Features-based Broad Learning System and Image Processing. Comput.-Aided Civ. Infrastruct. Eng. 2021, 36, 1568–1584. [Google Scholar] [CrossRef]
Pan, Y.; Zhang, L. Dual Attention Deep Learning Network for Automatic Steel Surface Defect Segmentation. Comput.-Aided Civ. Infrastruct. Eng. 2022, 37, 1468–1487. [Google Scholar] [CrossRef]
Golparvar-Fard, M.; Heydarian, A.; Niebles, J.C. Vision-Based Action Recognition of Earthmoving Equipment Using Spatio-Temporal Features and Support Vector Machine Classifiers. Adv. Eng. Inform. 2013, 27, 652–663. [Google Scholar] [CrossRef]
Kaewunruen, S.; Adesope, A.A.; Huang, J.; You, R.; Li, D. AI-Based Technology to Prognose and Diagnose Complex Crack Characteristics of Railway Concrete Sleepers. Discov. Appl. Sci. 2024, 6, 217. [Google Scholar] [CrossRef]
Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 483–499. [Google Scholar]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-Person Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar] [CrossRef]
Rezazadeh Azar, E.; McCabe, B. Part Based Model and Spatial-Temporal Reasoning to Recognize Hydraulic Excavators in Construction Images and Videos. Autom. Constr. 2012, 24, 194–202. [Google Scholar] [CrossRef]
Assadzadeh, A.; Arashpour, M.; Li, H.; Hosseini, R.; Elghaish, F.; Baduge, S. Excavator 3D Pose Estimation Using Deep Learning and Hybrid Datasets. Adv. Eng. Inform. 2023, 55, 101875. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
Zhao, Y.; Xiong, Y.; Lin, D. Recognize Actions by Disentangling Components of Dynamics. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6566–6575. [Google Scholar]
Roberts, D.; Golparvar-Fard, M. End-to-End Vision-Based Detection, Tracking and Activity Analysis of Earthmoving Equipment Filmed at Ground Level. Autom. Constr. 2019, 105, 102811. [Google Scholar] [CrossRef]
Torres Calderon, W.; Roberts, D.; Golparvar-Fard, M. Synthesizing Pose Sequences from 3D Assets for Vision-Based Activity Analysis. J. Comput. Civ. Eng. 2021, 35. [Google Scholar] [CrossRef]
Zhang, B.; Wang, L.; Wang, Z.; Qiao, Y.; Wang, H. Real-Time Action Recognition with Enhanced Motion Vector CNNs. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2718–2726. Available online: https://openaccess.thecvf.com/content_cvpr_2016/html/Zhang_Real-Time_Action_Recognition_CVPR_2016_paper.html (accessed on 25 July 2025).
Chen, C.; Zhu, Z.; Hammad, A.; Akbarzadeh, M. Automatic Identification of Idling Reasons in Excavation Operations Based on Excavator–Truck Relationships. J. Comput. Civ. Eng. 2021, 35. [Google Scholar] [CrossRef]
Bhokare, S.; Goyal, L.; Ren, R.; Zhang, J. Smart Construction Scheduling Monitoring Using YOLOv3-Based Activity Detection and Classification. J. Inf. Technol. Constr. 2022, 27, 240–252. [Google Scholar] [CrossRef]
Ghelmani, A.; Hammad, A. Improving Single-stage Activity Recognition of Excavators Using Knowledge Distillation of Temporal Gradient Data. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 2028–2053. [Google Scholar] [CrossRef]
Chen, J.; Wang, J.; Yuan, Q.; Yang, Z. CNN-LSTM Model for Recognizing Video-Recorded Actions Performed in a Traditional Chinese Exercise. IEEE J. Transl. Eng. Health Med. 2023, 11, 351–359. [Google Scholar] [CrossRef]
Imran, H.A.; Ikram, A.A.; Wazir, S.; Hamza, K. EdgeHARNet: An Edge-Friendly Shallow Convolutional Neural Network for Recognizing Human Activities Using Embedded Inertial Sensors of Smart-Wearables. In Proceedings of the 2023 International Conference on Communication, Computing and Digital Systems (C-CODE), Islamabad, Pakistan, 17–18 May 2023; pp. 1–6. [Google Scholar]
Kaya, Y.; Topuz, E.K. Human Activity Recognition from Multiple Sensors Data Using Deep CNNs. Multimed. Tools Appl. 2024, 83, 10815–10838. [Google Scholar] [CrossRef]
Li, X.; Hao, T.; Li, F.; Zhao, L.; Wang, Z. Faster R-CNN-LSTM Construction Site Unsafe Behavior Recognition Model. Appl. Sci. 2023, 13, 10700. [Google Scholar] [CrossRef]
Younesi Heravi, M.; Jang, Y.; Jeong, I.; Sarkar, S. Deep Learning-Based Activity-Aware 3D Human Motion Trajectory Prediction in Construction. Expert. Syst. Appl. 2024, 239, 122423. [Google Scholar] [CrossRef]
Zhu, Z.; Ren, X.; Chen, Z. Visual Tracking of Construction Jobsite Workforce and Equipment with Particle Filtering. J. Comput. Civ. Eng. 2016, 30. [Google Scholar] [CrossRef]
Park, M.-W.; Brilakis, I. Continuous Localization of Construction Workers via Integration of Detection and Tracking. Autom. Constr. 2016, 72, 129–142. [Google Scholar] [CrossRef]
Bilge, Y.C.; Mutlu, B.; Esin, Y.E. BusEye: A Multi-Stream Approach for Driver Behavior Analysis on Public Bus Driver Cameras. Expert. Syst. Appl. 2024, 245, 123148. [Google Scholar] [CrossRef]
Yang, J.; Shi, Z.; Wu, Z. Vision-Based Action Recognition of Construction Workers Using Dense Trajectories. Adv. Eng. Inform. 2016, 30, 327–336. [Google Scholar] [CrossRef]
Lee, S.; Bae, J.Y.; Sharafat, A.; Seo, J. Waste Lime Earthwork Management Using Drone and BIM Technology for Construction Projects: The Case Study of Urban Development Project. KSCE J. Civ. Eng. 2024, 28, 517–531. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed method based on a two-stream deep learning architecture (CNN-LSTM) for excavator activity recognition.

Figure 2. Data collection for the activity recognition of excavators at the construction sites.

Figure 3. Different work types and activities selected for the activity recognition of excavators.

Figure 4. Cleaning of input data and defining standard data labeling format for the video frames.

Figure 5. Two-stream CNN-LSTM network for activity recognition of excavator activities.

Figure 6. Learning curve graphs for training and testing.

Figure 7. Confusion matrix of the prediction results (a) for the training dataset and (b) for the test dataset. The bold values represent the overall performance metrics for all activity classes combined, reflecting the final results of the proposed method.

Figure 8. Example of activity recognition errors in the ground truth and prediction.

Figure 9. Confusion matrix for the training and testing single-stream CNN-LSTM.

Table 1. An overview of different Vision-based methods for object detection and activity recognition.

Author	Year	Source Type	No. of Stream	Input Data	Method/Classifier	Purpose/Goal
Golparvar-Fard et al. [53]	2013	External Camera	One	RGB Image	3D HOG feature + SVM classifier	Activity recognition of excavator and status of dump truck
Yang et al. [75]	2016	External Camera	One	RGB Image	HOG, HOE, MBH features + SVM classifier	Action recognition of construction workers using dense trajectories
J. Kim and Chi [23]	2019	External Camera	One	RGB Image	CNN	Detection of Excavator
					CNN-LSTM	Tracking of the excavator
					CNN-DLSTM	Activity recognition of excavator
Chen et al. [19]	2020	External Camera	One	RGB Image	Faster R-CNN + Deep SORT tracker + 3D ResNet classifier	Detection of Excavator
Chen et al. [19]	2020	External Camera	One	RGB Image	Faster R-CNN + Deep SORT tracker + 3D ResNet classifier	Tracking, activity recognition of excavators
Ladjailia et al. [26]	2020	External Camera	One	RGB Image	(optical flow descriptor) KNN, DT, SVM	An optical flow descriptor to recognize human actions by considering only features derived from the motion (Weizmann and UCF101 databases)
Bhokare et al. [65]	2021	External Camera	One	RGB Image	YOLOV3	Detection and activity classification
Torres Calderon et al. [62]	2021	External Camera	One	RGB Image	Multistage Temporal Convolutional Network (MS-TCN)	Synthesis data to improve the vision-based activity recognition of excavator
Chen et al. [64]	2021	External Camera	One	RGB Image	YOLOv3	Activity recognition, identification of idling reason based on excavator–truck relationships
M.-Y. Cheng et al. [20]	2022	External Camera	One	RGB Image	YOWO	Vision-based autonomous excavator productivity estimation
Chen et al. [35]	2022	External Camera	One	RGB Image	YOLO, SORT, Zero-shot learning method CLIP	Activity recognition of excavator and productivity estimation
I.-S. Kim et al. [24]	2023	External Camera	One	RGB Image	CNN-BiLSTM	Activity classification of the excavator to improve the safety, monitoring, and productivity of the earthwork site
Ghelmani and Hammad [66]	2024	External Camera	One	RGB Image	DIGER based on YOWO	simultaneous activity recognition and localization

Table 2. File naming rules for the standard labeling format of input data.

Components of the Proposed Standard	File Naming Standard E_230905_0907_CE_000750_SFSD.jpg
Equipment type	Excavator
Acquisition date	Year month date		230905		~	0907
Acquisition time	Hour min		230905		~	0907
Source sensor type	Camera External (CE)			Point Cloud (PC)
Frame number	xxxxxx		000000		~	999999
Work type	Excavating and Loading	Grading (Leveling)	Foundation and Trenching			Slope Digging
Activity type	Approach	Digging	DUmping		Idle	Leveling

Table 3. Data distribution for different datasets.

Activity	No. of Video Clips per Stream	Training Clips (70%)	Testing Clips (30%)	Total Clips (RGB + PC)
Approach	98 (RGB ) + 98 (PC *)	68 (RGB) + 68 (PC)	30 (RGB) + 30 (PC)	196
Digging	98 (RGB) + 98 (PC)	68 (RGB) + 68 (PC)	30 (RGB) + 30 (PC)	196
Dumping	98 (RGB) + 98 (PC)	68 (RGB) + 68 (PC)	30 (RGB) + 30 (PC)	196
Idle	98 (RGB) + 98 (PC)	68 (RGB) + 68 (PC)	30 (RGB) + 30 (PC)	196
Leveling	98 (RGB) + 98 (PC)	68 (RGB) + 68 (PC)	30 (RGB) + 30 (PC)	196
Total	980 (RGB + PC)	680 (RGB + PC)	300 (RGB + PC)	980

* RGB = RGB Video data clip; ** PC = Point cloud data clip.

Table 4. Testing the accuracies of the trained model at different hyperparameters.

Epochs	Learning Rate
Epochs	0.01	0.001	0.0001	0.00001
20	27.83	20	90.43	84.35
25	38.26	86.96	93.91	86.09
30	43.48	22.61	94.67	86.09

Table 5. Sensitivity analysis of hyperparameters on model performance.

Parameter		Accuracy (%)	F1-Score (%)	Loss (%)	Observation
Learning Rate
	0.01	43.48	40.25	58.33	Unstable training
	0.001	22.61	21.9	75.14	Divergence or slow learning
	0.0001	94.67	94.78	5.33	Optimal setting
	0.00001	86.09	85.4	16.05	Underfitting
Epochs
	20	90.43	91.12	9.57	Early stopping risk
	25	93.91	94.1	6.09	Better generalization
	30	94.67	94.78	5.33	Best performance
Sequence Length
	10	89.32	89.5	10.68	Temporal context insufficient
	15	94.67	94.78	5.33	Balanced setting
	20	94.12	94.3	5.88	Slight improvement, higher cost

The bolded rows represent the optimal settings used in the final model.

Table 6. Comparison of the model performance.

Input Data	Algorithm	Accuracy	Recall	Precision	F1 Score	Loss
RGB *	CNN-LSTM	90.67	90.67	90.89	90.78	9.33
PC **	CNN-LSTM	92.00	92.00	92.28	92.14	8.00
Multi-St. ***	Two-stream CNN-LSTM	94.67	94.67	94.90	94.78	5.33

* RGB Camera = RGB, ** Point Cloud = PC, *** Multi-stream = Multi-St.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cho, H.S.; Latif, K.; Sharafat, A.; Seo, J. Multi-Modal Excavator Activity Recognition Using Two-Stream CNN-LSTM with RGB and Point Cloud Inputs. Appl. Sci. 2025, 15, 8505. https://doi.org/10.3390/app15158505

AMA Style

Cho HS, Latif K, Sharafat A, Seo J. Multi-Modal Excavator Activity Recognition Using Two-Stream CNN-LSTM with RGB and Point Cloud Inputs. Applied Sciences. 2025; 15(15):8505. https://doi.org/10.3390/app15158505

Chicago/Turabian Style

Cho, Hyuk Soo, Kamran Latif, Abubakar Sharafat, and Jongwon Seo. 2025. "Multi-Modal Excavator Activity Recognition Using Two-Stream CNN-LSTM with RGB and Point Cloud Inputs" Applied Sciences 15, no. 15: 8505. https://doi.org/10.3390/app15158505

APA Style

Cho, H. S., Latif, K., Sharafat, A., & Seo, J. (2025). Multi-Modal Excavator Activity Recognition Using Two-Stream CNN-LSTM with RGB and Point Cloud Inputs. Applied Sciences, 15(15), 8505. https://doi.org/10.3390/app15158505

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Modal Excavator Activity Recognition Using Two-Stream CNN-LSTM with RGB and Point Cloud Inputs

Abstract

1. Introduction

2. Literature Review

2.1. Vision-Based Methods for Construction Equipment Monitoring

2.2. Vision-Based Earthmoving Activity Detection

2.3. Problem Statement and Objective

3. Methodology

3.1. Overview

3.2. Data Collection and Work Type Definition

3.3. Data Cleaning and Labeling Process

3.4. Data Fusion

3.5. CNN Network for Activity Recognition Through Optical Flow

3.6. Performance Metrics

4. Experimental Implementation and Results

4.1. Datasets

4.2. Classification Model Training and Testing Process

4.3. Results

4.4. Sensitivity Analysis of Hyperparameters

5. Evaluation and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI