Transfer Learning for Generalized Safety Risk Detection in Industrial Video Operations

Radrigan, Luciano; Godoy, Sebastián E.; Morales, Anibal S.

doi:10.3390/make7040111

Open AccessArticle

Transfer Learning for Generalized Safety Risk Detection in Industrial Video Operations

by

Luciano Radrigan

^1,*,

Sebastián E. Godoy

¹

and

Anibal S. Morales

²

¹

Electrical Engineering Department, Universidad de Concepcion, Concepcion 4070409, Chile

²

Centro de Transición Energética (CTE), Facultad de Ingeniería, Universidad San Sebastián, Concepción 4081339, Chile

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(4), 111; https://doi.org/10.3390/make7040111

Submission received: 3 August 2025 / Revised: 31 August 2025 / Accepted: 13 September 2025 / Published: 30 September 2025

Download

Browse Figures

Versions Notes

Abstract

Highlights

The proposed Deep Risk Network (DRN) architecture, combined with a feature-based transfer learning strategy, significantly improves safety risk detection performance in previously unseen industrial scenarios using only 10–25% of new annotated data. DRN consistently outperforms commercial solutions like AWS Rekognition and NVIDIA DeepStream in both binary and multi-class classification tasks, particularly under do-main shifts and resource-constrained edge deployments.

What are the main findings?

A dual-stage deep learning architecture (Deep Risk Network) combined with feature-based transfer learning improves safety risk detection across diverse industrial video scenarios using limited new data.
The proposed system outperforms commercial platforms (AWS Rekognition and NVIDIA DeepStream) in both classification accuracy and inference efficiency under domain shifts.

What is the implication of the main finding?

Safety monitoring systems can be deployed and adapted to new industrial environments with minimal annotated data and low computational cost.
Transfer learning not only enhances model generalization but also improves interpretability by shifting focus toward safety-critical visual features.

Abstract

This paper proposes a transfer learning-based approach to enhance video-driven safety risk detection in industrial environments, addressing the critical challenge of limited generalization across diverse operational scenarios. Conventional deep learning models trained on specific operational contexts often fail when applied to new environments with different lighting, camera angles, or machinery configurations, exhibiting a significant drop in performance (e.g., F1-score declining below 0.85). To overcome this issue, an incremental feature transfer learning strategy is introduced, enabling efficient adaptation of risk detection models using only small amounts of data from new scenarios. This approach leverages prior knowledge from pre-trained models to reduce the reliance on large-labeled datasets, particularly valuable in industrial settings where rare but critical safety risk events are difficult to capture. Additionally, training efficiency is improved compared with a classic approach, supporting deployment on resource-constrained edge devices. The strategy involves incremental retraining using video segments with average durations ranging from 2.5 to 25 min (corresponding to 5–50% of new scenario data), approximately, enabling scalable generalization across multiple forklift-related risk activities. Interpretability is enhanced through SHAP-based analysis, which reveals a redistribution of feature relevance toward critical components, thereby improving model transparency and reducing annotation demands. Experimental results confirm that the transfer learning strategy significantly improves detection accuracy, robustness, and adaptability, making it a practical and scalable solution for safety monitoring in dynamic industrial environments.

Keywords:

transfer learning; industrial safety; video analytics; deep learning; edge computing

Graphical Abstract

1. Introduction

Industrial operations involve dynamic environments where the safe use of machinery is essential for preventing accidents and ensuring worker well-being. Timely detection of safety risks, particularly those involving mobile machinery, plays a critical role in maintaining operational integrity. Traditional safety risk detection systems—often reliant on manual oversight or static, rule-based algorithms—struggle to adapt to changing conditions such as varying lighting, camera angles, or equipment configurations. This lack of adaptability limits their effectiveness in real-world deployments.

Recent advancements in deep learning have shown promise in automating video-based safety monitoring by learning complex spatial and temporal patterns from operational footage [1]. However, these models frequently face challenges in generalizing across different scenarios, especially when trained on data from a single operational context. A noticeable decline in performance, often reflected by F1-scores dropping below 0.85 in unfamiliar environments, highlights the need for more adaptable and data-efficient solutions.

To address this limitation, this work explores the integration of feature transfer learning into the safety risk detection pipeline. By leveraging pre-trained models and progressively retraining them with limited data from new scenarios (5–50%), the approach enables the creation of generalizable detection systems without the need for extensive retraining or large annotated datasets [2]. This is particularly beneficial in industrial safety applications, where certain risk events are rare and difficult to reproduce [3,4].

The proposed methodology focuses on forklift operations and evaluates detection performance across nine safety risk categories defined by OSHA 3949. A hybrid deep learning architecture is implemented, combining convolutional and temporal processing to capture both object features and movement patterns. The system’s adaptability is benchmarked against commercial tools—NVIDIA DeepStream SDK and Amazon Rekognition Custom Labels—using F1-score as a primary performance metric.

To ensure practical deployment, the methodology also examines model efficiency in embedded environments such as Raspberry Pi and Jetson Nano. Metrics such as inference time, model weight, and data acquisition latency are assessed to validate real-time applicability. Furthermore, model interpretability is addressed through SHAP (Shapley Additive Explanations) analysis, which demonstrates a post-transfer shift in feature relevance toward critical elements like equipment forks, load position, and workspace boundaries. This not only enhances transparency but also supports efficient annotation in future training iterations.

The paper is organized as follows: Section 2 reviews related work in deep learning-based risk detection and transfer learning; Section 3 details the proposed methodology; Section 4 presents classification results; discusses interpretability and feature relevance; benchmarks inference performance across platforms; Section 5 offers concluding remarks; and outlines future research directions.

2. State of the Art and Related Work

In the context of industrial safety and event detection, the development of robust and adaptive systems is essential—particularly in dynamic, high-risk environments where timely response can prevent severe accidents and operational downtime. The convergence of deep learning and transfer learning with modern event detection algorithms has opened new opportunities for improving the effectiveness and generalizability of safety monitoring systems. This section reviews the foundations and recent advances in these two key areas.

2.1. Transfer Learning for Sample-Efficient Generalization

A core challenge in deep learning is generalization: the ability of a model to apply learned patterns from training data to unseen situations. In industrial applications, where data can be sparse, imbalanced, or scenario-specific, models often suffer from poor transferability, resulting in significant performance degradation when conditions such as lighting, object appearance, or camera perspective change. In safety risk detection scenarios, such failures can lead to false positives—triggering unnecessary alarms—or, more critically, false negatives that fail to identify hazardous events [5].

Transfer learning addresses this issue by allowing models to reuse knowledge acquired from large, generic datasets and apply it to more specific tasks with limited data availability. This approach not only reduces the need for extensive labeling but also significantly accelerates the development of reliable models in environments where collecting training data is costly or infeasible. In the context of safety risk detection, transfer learning enables systems to adapt quickly to new machinery, environments, or safety risk event classes by retraining with a fraction of domain-specific data [6].

Transfer learning has become a key enabler in domains where labeled data are limited or where model adaptability across operational contexts is critical. The fundamental principle involves reusing knowledge—such as learned feature representations or model weights—from one task or domain and applying it to a related but different target task. In industrial video analysis, where safety-critical events (e.g., operator misbehavior, equipment misuse) are rare and difficult to label, transfer learning offers a practical path to building robust detection systems [7,8].

Several strategies for applying transfer learning have been extensively explored in the literature, each offering different trade-offs in terms of data efficiency, computational cost, and adaptability [9,10,11]:

Feature transfer involves using a pre-trained model to extract high-level feature representations from the input data, which are then fed into a new classifier tailored to the target task. This method is particularly effective when only a small amount of labeled data is available, as it reduces the need for extensive end-to-end training.
Full model transfer involves applying a pre-trained model to a new task without altering its architecture. This approach assumes a high degree of similarity between the source and target domains. Although it enables maximum reuse of learned knowledge, it typically demands considerable computational resources and may lack adaptability when significant domain differences exist.
Fine-tuning partially retrains a subset of the model layers on the target dataset. By preserving the generalizable features from the source domain while adapting to task-specific nuances, this approach strikes a balance between performance and training efficiency. It is one of the most commonly used transfer learning strategies, particularly in industrial and safety-critical applications.
Zero-shot learning (ZSL) enables a model to recognize new classes or events it has never encountered during training, relying on semantic descriptions, attribute associations, or generative models. Although challenging to implement, ZSL is highly valuable in scenarios where rare or emerging risk events must be detected without available training data.
Few-shot learning (FSL) addresses the challenge of training models on a very small number of labeled examples—often fewer than ten per class. Unlike ZSL, FSL assumes limited supervision and typically relies on meta-learning techniques or metric-based approaches (e.g., prototypical networks, relation networks) to generalize from minimal data.

At the beginning, transfer learning primarily focused on classification problems in natural image datasets, where models pre-trained on large datasets like ImageNet were fine-tuned for domain-specific applications. This strategy has since been adapted to various domains, including medical imaging, speech recognition, and autonomous driving. In safety-critical industrial settings, pre-trained models such as VGG, ResNet, EfficientNet, and MobileNet have been employed as feature extractors for video-based risk detection tasks, significantly reducing training times and improving performance with limited data [12].

In industrial applications, the extent of retraining required depends on the similarity between the source and target domains. When structural or visual features are aligned, retraining approximately 50–70% of the model’s parameters can yield high performance. Conversely, when domain divergence is significant—such as different machinery types or camera perspectives—full fine-tuning with up to 100% of the target data may be necessary. In all cases, careful hyperparameter optimization and cross-validation remain critical to ensuring generalization, preventing overfitting, and maximizing detection reliability under real-world constraints [13].

In recent literature, transfer learning has demonstrated success in real-world industrial applications. For instance, in defect detection in manufacturing, pre-trained CNNs have been fine-tuned on thermal and visual inspection datasets with high accuracy despite limited sample availability. In predictive maintenance, transfer learning enables models trained on sensor data from one machine to be adapted to similar machinery with minimal calibration. These findings underscore the scalability and cost-efficiency of transfer learning in low-data regimes [14,15,16].

Despite these benefits, challenges remain. Negative transfer can occur when the source and target domains differ significantly, leading to suboptimal performance. The selection of appropriate layers for fine-tuning, the size of the adaptation dataset, and the hyperparameter optimization strategy are critical factors influencing transfer effectiveness. Recent studies advocate using performance-driven thresholds (e.g., F1-score saturation) to guide retraining decisions, especially when operating on embedded devices with limited computational capacity [17,18,19].

2.2. Video-Based Event Detection in Industrial Applications

Fault detection from video streams is a rapidly evolving field, especially in industrial settings where visual data can capture complex events that are not easily measured by traditional sensors [20]. Applications include detecting unsafe behaviors (e.g., improper forklift operation), equipment malfunction (e.g., abnormal vibration), process deviation (e.g., misaligned conveyor belts), and security breaches (e.g., unauthorized personnel access).

Traditional methods in this field relied on rule-based video analytics or statistical process control, which lacked the flexibility to handle high-dimensional data and subtle variations in operational context. As computational resources and labeled datasets became more available, supervised learning methods such as Support Vector Machines (SVMs), Random Forests, and K-Nearest Neighbors (KNNs) were introduced for video-based fault detection [21]. These approaches required handcrafted features and struggled to generalize across changing environments.

Deep learning architectures, particularly Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory (LSTM) networks, have largely displaced classical models due to their ability to learn hierarchical features directly from raw image sequences [22]. More recent approaches employ hybrid models that combine spatial and temporal reasoning, such as CNN-LSTM frameworks, to capture both object configurations and their evolution over time [23]. These have been successfully applied in recognizing sequences of risk-related behaviors, including unsafe load handling, unauthorized area entry, and operator absence in confined spaces.

In multi-camera environments, 3D CNNs and attention mechanisms have been used to aggregate spatiotemporal cues, improving the detection of composite events [24]. Graph neural networks (GNNs) are also emerging as a means of modeling interactions between multiple entities (e.g., machines and workers) within a shared workspace [25]. These methods show promise for representing complex dependencies but are computationally intensive.

Validation of video-based fault detection systems typically relies on metrics such as precision, recall, specificity, receiver operating characteristic (ROC) curve, and F1-score. The F1-score, being the harmonic mean of precision and recall, is particularly suited for imbalanced datasets, as is often the case in industrial safety, where risk events are rare [26]. However, model evaluation must also consider inference latency, model size, and power consumption, especially when deployment on edge devices is required.

2.3. Integrating Transfer Learning and Safety Event Detection

The integration of transfer learning into fault detection pipelines is an active area of research aimed at overcoming the generalization gap between training and deployment environments. Studies have shown that models pre-trained in controlled laboratory settings can be progressively retrained with as little as 10–35% of real-world data from a new scenario to regain high detection accuracy [27]. This progressive retraining paradigm supports continuous learning, where the model adapts over time as new data become available, enabling scalable deployment across industrial sites.

Recent works on generalized anomaly and safety detection reinforce the motivation for robust transfer learning in visual safety applications. Notably, Yan et al. (2023) survey how deep transfer learning enables the detection of anomalies in dynamic industrial environments with minimal labeled data and without full retraining [28]. Similarly, Jebur et al. (2024) propose a scalable transfer learning framework for surveillance video anomaly detection that generalizes across tasks through feature fusion and multitask design [29]. Moreover, Taha et al. (2024) show that combining semantic keyframe extraction with pre-trained CNNs and vision transformers yields high anomaly detection accuracy in large-scale video datasets [30]. These studies align with our goal of developing models that remain reliable under abnormal or unseen conditions in real industrial operations.

Moreover, model interpretability has become an essential requirement for safety-critical applications. Techniques such as SHAP (Shapley Additive Explanations) provide insights into the model’s decision-making by quantifying the contribution of each input feature to the output prediction [31]. In industrial use cases, SHAP analysis can reveal whether a model correctly prioritizes safety-relevant objects, such as forklift forks, operators, and load positions, thereby supporting debugging, regulatory compliance, and human trust in automated systems.

Despite these advancements, key challenges persist. These include managing concept drift in evolving operational settings, mitigating domain shift due to changes in camera configurations or machinery, and ensuring data privacy during model retraining. Research into federated learning and privacy-preserving adaptation is gaining traction to address these concerns [32].

3. Methodology

This study proposes a safety risk event detection system for forklift operations using a dual-stage deep learning architecture (Deep Risk Network, DRN), enhanced by progressive transfer learning and validated through a stratified scenario-based experimental design. In short, the proposed approach processes video data to identify safety-critical events. The methodology is structured into five sequential components: data acquisition, scenario structuring, model design, training protocol, and evaluation metrics.

3.1. Stratified Scenario Design and Data Processing

To ensure robust generalization under realistic deployment conditions, this study employs a scenario-based stratification strategy rather than conventional random or class-balanced stratified sampling. Each scenario reflects a distinct combination of operational conditions—such as camera angles, lighting environments, and workspace layouts—that are likely to influence model performance. The objective is to assess model generalization to unseen operational contexts.

The scenario design was informed by the following criteria: environmental variability (e.g., daylight vs. low light, reflective surfaces); camera positioning (e.g., front, lateral, and overhead perspectives); operational activity (e.g., routine operation vs. risk-prone behaviors); and layout constraints (e.g., open vs. cluttered zones and presence of blind spots).

The dataset includes video recordings from 9 stratified scenarios (see Table 1), each constructed to capture a unique combination of these attributes. Within each scenario, forklift operations were recorded using three synchronized RGB cameras at 15 frames per second.

Annotation of safety-relevant elements (such as forklift forks, operator presence, and load position) was performed using the LabelImg tool. Labels correspond to OSHA 3949 safety risk categories, supporting both object-level and event-level classification tasks. To mitigate class imbalance and enhance robustness against visual variability, synthetic data augmentation techniques were applied during preprocessing.

Data Labeling Process: To train the DRN’s object detection component, annotations were generated using LabelImg, an open-source graphical labeling tool implemented in Python 3.9 with a Qt-based GUI. This tool enables precise frame-by-frame annotation of relevant objects, such as forklift chassis, fork arms, load positions, and operator locations, among others. Each annotation includes bounding boxes and corresponding class labels, stored in structured JSON files. These labels provide the spatial supervision required for effective object detection and allow the model to learn visual patterns associated with risk-relevant elements in the scene.

In parallel, annotations were prepared for two benchmark platforms used for performance comparison:

Amazon Rekognition Custom Labels: Requires bounding-box annotations for each frame, generated and managed using Amazon SageMaker. This method aligns with Rekognition’s object detection framework.
NVIDIA DeepStream SDK: Operates using video segments labeled with high-level activity classes (e.g., transporting elevated load, turning on a slope), rather than frame-level object annotations. This facilitates a more temporal labeling approach suited for event detection.

Synthetic Data Augmentation: To address the class imbalance inherent in industrial datasets—where hazardous events are significantly less frequent than normal operations—synthetic data augmentation techniques were applied to increase training diversity. These included image rotation (±15–30 degrees), salt-and-pepper noise injection to simulate sensor noise, saturation and brightness adjustments for lighting variability, and image composition to generate new scenes containing risk-related objects in novel arrangements.

These augmentations enhance the model’s robustness by exposing it to a wider distribution of visual patterns and conditions, thereby reducing overfitting and improving generalization in unseen environments.

3.2. Scenario-Based Data Splitting Strategy

Stratified sampling of a dataset split applies when splitting the same dataset (usually an image dataset) into training, validation, and test sets. The purpose is to ensure that each split has the same distribution of key labels or classes. This approach is particularly useful in cases of class imbalance, during cross-validation, and for preventing biased training or evaluation results caused by skewed class distributions. However, this method accounts only for class distribution (or other stratifying features) and does not control for real-world variations such as lighting conditions, camera types, environmental settings, or user differences [33].

Stratified scenarios in experimental design refers to designing multiple experimental conditions (scenarios) where strata are defined by external, domain-specific factors—e.g., weather conditions, time of day, camera angle, or hardware device. Here, the purpose is to evaluate model robustness and generalization across distinct, often real-world, meaningful scenarios—not just balanced class splits. This approach is particularly useful in computer vision applications with known sources of domain shift or operational variability (e.g., autonomous driving, medical imaging, or safety monitoring), especially when evaluating generalization across diverse environments or subpopulations to simulate real-world deployment scenarios [34]. Although this approach involves a more complex setup, it requires metadata annotations or scenario labels and often demands larger or specifically structured datasets.

Scenarios are defined as coherent groups of samples that share a specific combination of environmental and operational conditions. These scenarios serve as the primary unit of stratification. In this paper, the dataset is partitioned as follows:

Training Set: Includes samples from a subset of scenarios, capturing intra-scenario variability while limiting cross-scenario exposure.
Validation Set: Contains samples from the same scenario groups as the training set, but with disjoint identities or instances to prevent overlap. Held-out identities or time windows from the same scenarios are used, preserving distributional consistency.
Testing Set: Comprises samples from entirely unseen scenarios, thereby simulating domain shift at deployment, providing a realistic evaluation of domain shift.

This setup reflects a cross-domain generalization task and imposes a more rigorous evaluation protocol than conventional i.i.d. class-based sampling, as the model is assessed on distributions not observed during training. Thus, it enforces non-i.i.d. generalization and prevents leakage of scenario-specific features between splits, emulating real deployment where the model encounters unfamiliar site conditions [35].

3.3. Deep Risk Network Architecture Structural Overview

The Deep Risk Network (DRN) employed in this work is a dual-stage deep learning architecture designed for robust and accurate detection of safety risk events in industrial environments. Its structure integrates both spatial and temporal feature learning, making it particularly well-suited for analyzing complex video data from forklift operations. The architecture has demonstrated efficiency, robustness, and adaptability in experimental evaluations, with performance comparable to state-of-the-art time-series classification models [36]. Figure 1 illustrates the dual-stage structure of the DRN architecture, composed of two main components:

Spatial Feature Extraction (Object Detection): (i) Input: single-frame or multimodal image (e.g., from three cameras); (ii) Conv2D + SE Blocks (x4): sequential convolution layers with Squeeze-and-Excitation for channel attention; (iii) Flatten: converts the spatial map into a vector representation; (iv) Fully Connected Layer + Softmax: performs object classification; and (v) Output: provides spatial labels for key visual elements (e.g., forks, loads, operators).
Temporal Feature Integration (Event Detection): (i) Input: combines 2D spatial features and temporal time-series data; (ii) Dimension Shuffle: rearranges data to fit the LSTM input format; (iii) LSTM Block: learns temporal relationships across frames; (iv) Additional Conv2D + SE Blocks: refines frame-level features; (v) Concatenation + Softmax: fuses spatial and temporal outputs for final classification; and (vi) Output: delivers event-level predictions (e.g., lifting high loads, turning on slopes, raising personnel) [37,38,39].

Key Strengths of the DRN Architecture:

Hierarchical learning: Separates the processing of spatial and temporal features for clearer representation learning.
Attention mechanisms: SE blocks improve focus on relevant spatial information [40].
Temporal awareness: LSTM layers model dynamic event sequences critical for behavioral risk detection [41].
Modularity: Each stage can be independently optimized or replaced, enhancing adaptability to various deployments [42].

Thus, the DRN architecture is a purpose-built, high-performance solution for intelligent video-based risk detection in industrial settings. By effectively combining spatial feature recognition and temporal sequence modeling, the DRN achieves strong generalization and accuracy even in variable conditions. Its modularity, explainability, and compatibility with transfer learning further position it as a scalable foundation for next-generation industrial safety systems.

3.4. Cloud and Edge Computing Platforms for Benchmarking

To benchmark the performance of the selected open-access, gold-standard DRN for prediction at the edge, this study incorporates two widely used commercial platforms: Amazon Rekognition Custom Labels and NVIDIA DeepStream SDK. These standardized tools provide industry-grade pipelines for training and deploying computer vision models, enabling meaningful comparisons with the proposed architecture under similar conditions [43,44].

Amazon Rekognition Custom Labels enables the rapid development of custom object detection and activity recognition models, leveraging pre-trained architectures developed from millions of annotated images across diverse categories. This platform allows efficient model training on smaller, domain-specific datasets, making it well-suited for industrial applications such as safety monitoring, process automation, and video analytics. The automated workflow consists of the following sequence:

A video file is uploaded to Amazon Simple Storage Service (S3).
This action triggers an AWS Lambda function, which in turn calls the Amazon Rekognition Custom Labels inference endpoint.
Detected results are processed and passed through the Amazon Simple Queue Service (SQS) for downstream actions or integration with dashboards.

This cloud-native architecture facilitates quick deployment, seamless scalability, and cost-effective model training for industrial video applications. It supports high-level abstraction, enabling users to train and deploy custom classifiers with minimal coding or infrastructure overhead.

For on-edge processing, the NVIDIA DeepStream SDK was implemented using the Jetson AGX Xavier development board, which offers high-performance computing suitable for real-time deep learning inference. Unlike Amazon Rekognition, which is cloud-based, DeepStream is designed for edge computing, enabling video analytics directly on local devices without dependence on cloud services [45,46]. In this case, the implementation process involved the following steps:

Videos are stored and labeled locally to serve as a dataset for training.
A Convolutional Neural Network (CNN) is trained and fine-tuned using DeepStream’s native support for accelerated model training and inference.
The development followed NVIDIA’s recommended pipeline, which integrates pre-trained detection and classification models for rapid prototyping and deployment.

After training, the model is evaluated using a separate set of test videos not included in the training set. The SDK generates outputs in JSON format, with frame-by-frame classification results that indicate detected operational states. Additionally, DeepStream supports real-time video rendering, enabling overlays that visualize model predictions either during live inference or in post-processing.

Both platforms offer complementary advantages. Amazon Rekognition excels in cloud-based scalability and rapid prototyping, whereas NVIDIA DeepStream SDK provides low-latency, on-device inference optimized for real-time monitoring in bandwidth-limited or disconnected environments. These characteristics make them ideal reference points for evaluating the adaptability, performance, and deployment efficiency of the proposed DRN model in diverse industrial contexts. To assess model performance, we employed a set of standard classification metrics.

F1-score was used as the primary measure given its robustness to class imbalance, which is common in safety event detection, where risk categories are significantly rarer than normal operations. The F1-score is defined as the harmonic mean of precision and recall, balancing both false positives and false negatives.

In addition, we report the area under the receiver operating characteristic curve (AUC-ROC), which evaluates the ability of the classifier to discriminate between positive and negative classes across varying decision thresholds. AUC-ROC values close to 1.0 indicate strong separability, while values near 0.5 correspond to random classification.

To better capture performance on imbalanced scenarios, particularly rare risk events, we also included the Precision–Recall AUC (PR-AUC). Unlike AUC-ROC, PR-AUC focuses on the trade-off between precision and recall, providing a more informative measure of classifier reliability when positive cases are scarce.

This combination of metrics (F1, AUC-ROC, and PR-AUC) enables a comprehensive evaluation of both balanced and imbalanced safety event detection tasks.

3.5. Transfer Learning for Generalization

Transfer learning has proven to be an effective strategy to address the limited generalization of deep learning models, particularly in industrial contexts where labeled data are scarce or expensive to obtain. It enables the reuse of knowledge acquired from a source task or domain to enhance performance in a related target task, thereby improving the adaptability and scalability of video-based safety risk detection systems across heterogeneous operational environments [47]. Unlike conventional transfer learning approaches that rely on full model fine-tuning or one-shot adaptation to new domains, our method adopts an incremental forward-transfer strategy. In this approach, scenario-specific data are progressively integrated in small proportions (5–50%), while the model is evaluated only on previously unseen scenarios [48]. This differs fundamentally from existing industrial safety applications, which often perform a complete retraining step for each new deployment [49]. Our incremental design mitigates the degradation of previously acquired knowledge, reduces computational overhead, and enables rapid adaptation to heterogeneous environments where annotation resources are limited [50].

This study adopts a feature-based transfer learning approach, leveraging latent representations—intermediate features learned by convolutional and recurrent neural networks. These representations are assumed to encode high-level, domain-invariant properties, enabling recognition in new environments without requiring full model retraining [51]. This is especially advantageous in industrial applications, where collecting and annotating data for each new scenario can be impractical [52].

The proposed strategy is computationally efficient and particularly effective when source and target tasks are closely related, and the target dataset is limited. By reusing features from the final hidden layers, the model maintains reliable performance without retraining from scratch.

To evaluate generalization and adaptability, an incremental transfer learning strategy is implemented, based on scenario-driven data expansion. Training data from new operational contexts is progressively integrated, allowing the model to adapt to novel environments while preserving previously learned knowledge. This structured approach supports robust deployment in diverse and dynamic industrial settings [53].

The proposed experiment involves training a feature-based transfer learning model in one scenario and subsequently testing it in a different, previously unseen scenario [42]. This approach mitigates overfitting and enables the evaluation of transfer learning performance across 10 different scenarios. Typically, feature transfer is conducted by reusing between 50% and 100% of the available data from the source domain [54,55]. In contrast, our method retrains the model using only a small fraction of the data, aiming to evaluate its generalization capabilities under conditions of extremely limited data availability.

The initial model M₁ is trained using the complete training set from scenario E₁, along with a specified percentage % of the training set from scenario E₂. The resulting model is evaluated on the test sets of scenarios T₁, T₂, and T₃. Formally, the training set for M₁ is defined in Equation (1):

T r a i n_{1} = E_{1} \cup E_{2} %

(1)

For subsequent models M_i, where i ∈ [2, 9], the training set includes the full data from scenario E₁ and incremental portions p% from scenarios E₂ to E_i₊₁. These models are evaluated only on a single new test scenario Te_i₊₂, distinct from training inputs, to streamline responsiveness and focus evaluation on the most recently introduced conditions. This approach balances computational efficiency with adaptive performance tracking. The training set is defined as:

T r a i n_{i} = E_{1} \cup \cup_{j = 2}^{i + 1} E_{j} %

(2)

The evaluated values of % include 0%, 5%, 10%, 25%, 35%, and 50%, representing varying levels of exposure to novel operating environments.

For each test scenario Test, all models Mi trained with different percentages p% of additional scenario data are evaluated using a predefined performance metric (e.g., accuracy, F1-score). The model achieving the highest score is selected as the best-performing configuration. The corresponding training composition is then identified as the optimal percentage of new scenario data required to maximize performance under that scenario.

This selection procedure quantifies the trade-off between the amount of transferred knowledge and the performance gain achieved, while identifying the minimum data proportion necessary for effective generalization.

Only the first round of retraining (model M1) is evaluated across multiple test scenarios (T1, T2, and T3.). All subsequent models are evaluated solely on a single, newly introduced test scenario Ti+2, in accordance with the forward-transfer protocol and responsiveness requirements.

This design provides a scalable framework for evaluating adaptive learning in evolving environments. It is especially relevant for real-world applications requiring continual model updating with limited retraining and testing budgets—such as industrial monitoring, safety-critical systems, and predictive maintenance in non-stationary contexts.

The experimental methodology, illustrated in Figure 2, is applied to a dataset comprising 9 different operational scenarios, each simulating realistic industrial conditions aligned with OSHA 3949 safety categories. Additionally, the commercial standard solutions—Amazon Rekognition Custom Labels and NVIDIA DeepStream SDK—are included for comparative analysis under the same experimental protocol as a benchmark.

The main goal is to optimize the trade-off between training data volume and generalization performance, particularly in data-scarce industrial settings. The proposed approach evaluates the scalability and transferability of safety risk detection models by systematically analyzing how incremental data inclusion impacts metrics such as F1-score, inference time, and robustness to domain shifts.

4. Results

4.1. Data Sources and Testbed

To ensure the effectiveness of any deep learning algorithm, whether for classification or regression, it is essential that the training dataset is both diverse and sufficient in volume. In this study, the datasets were constructed from a combination of third-party video sources and data captured in a controlled testbed environment specifically designed to replicate industrial forklift operations, referred to as Scenario 1.

The testbed for Scenario 1 was engineered to simulate the key operational scenarios outlined in Table 1. It includes the following elements: (i) an operational area marked on the ground, (ii) pallets with varying load configurations, (iii) a ramp to simulate slope navigation, and (iv) a video acquisition system consisting of three strategically positioned cameras. As illustrated in Figure 3, the testbed layout is organized around three main reference points: Point 1 (P1) for initial loading, Point 2 (P2) for unloading, and Point 3 (P3), a zone for inclined maneuvers. Figure 4 shows the actual implementation of the testbed. Additionally, a curated set of forklift operation videos was sourced from external environments, featuring variations in lighting, camera angles, and background context. Table 1 provides a detailed overview of the external video sources and the testbed recordings used to enhance the dataset’s generalization capacity for Scenario 1. Representative examples of key events from the scenarios (or datasets) used for transfer learning experimentation are shown in Figure 4.

4.2. Dataset Composition

In this study, a scenario is defined as a collection of audiovisual recordings obtained under a consistent operational context. This includes the physical characteristics of the environment (e.g., indoor or outdoor settings), as well as factors such as operator behavior, machinery, the video capture system, and the frequency of safety risk-related maneuvers. For instance, Scenario 1 corresponds to recordings obtained in the controlled testbed environment; Scenario 2 includes videos captured with mobile phone cameras during outdoor training sessions; and Scenario 3 consists of footage from surveillance cameras installed in an industrial warehouse. The remaining scenarios were grouped based on their contextual and technical coherence.

Thus, each scenario represents a unique domain instance dataset, characterized by distinct variations in lighting, resolution, camera perspective, physical environment, and operating style. This diversity is essential for evaluating the model’s ability to generalize to previously unseen conditions. Within this framework, a key methodological requirement is that performance evaluation following the transfer learning process must be conducted using data from a different scenario than the one used for training. This strategy not only prevents data leakage but also enables a more rigorous and representative assessment of the model’s generalization capacity across varied operational contexts.

The datasets comprised 115 videos across nine activity classes; each operation was recorded in both exterior and interior environments, with a total average duration of 250 min of annotated footage.

The datasets used for training and evaluation in this study demonstrate strong adherence to essential data quality standards for machine learning, particularly in industrial safety applications. The datasets exhibit notable diversity, encompassing nine distinct forklift operation types—including both normal and OSHA 3949-defined risk events—captured across interior and exterior environments with multi-angle video recording. This diversity supports robust generalization to operational variability.

In terms of volume, the datasets include over 135,000 estimated frames for normal operations, with other classes contributing 9000–40,000 frames each, providing sufficient data to support deep learning architectures. While class balance is not strictly uniform, it is thoughtfully engineered: high-risk or visually complex activities are weighted more heavily (15%), while rarer events are represented by 5%, aligning the data with practical risk modeling needs. The use of F1-score ensures fair evaluation despite these imbalances.

The datasets benefit from high-quality and consistent labeling, guided by the OSHA taxonomy and implemented with standardized annotation tools (LabelImg with JSON output). Relevance is strong, as all data correspond directly to safety-critical operational contexts. The datasets are also complete, with no missing data and full frame coverage across all scenarios.

The datasets remain representative of their intended domain. Efforts to assess generalization through scenario-based testing and transfer learning further strengthen their applicability.

Finally, annotation granularity is well-matched to the task: labels are detailed enough to capture specific risk events without being overly fine-grained. Overall, the datasets are well-structured, balanced for industrial risk detection, and suitable for training high-performance, generalizable models. The datasets satisfy key machine learning data principles for robustness and generalization. They have sufficient volume and class coverage for supervised deep learning and provide a well-calibrated benchmark for transfer learning across scenarios.

4.3. Dataset Labeling

The proposed safety risk event recognition system was composed of two classification stages. The first classifier focused on object detection, identifying elements such as the forklift, its forks, the load, and safety boundary tape. This process required frame-level segmentation of video sequences. Each video was divided into individual frames, where objects of interest were annotated using bounding boxes and class labels. The conversion of videos into frames was performed using a GStreamer-based pipeline. For manual annotation, the open-source tool LabelImg was employed [56]. This tool produced two outputs per frame: the annotated image and a corresponding JSON file specifying the bounding box coordinates and class label. Figure 5 shows the LabelImg interface used during the annotation workflow.

The second classification algorithm analyzed the detected objects and context to classify the specific type of safety risk event taking place. For this purpose, a training set was generated from labeled video segments. The previously annotated frames were re-rendered into video clips with overlaid object labels. These rendered videos were then categorized according to the action being executed, forming the input for training the event classifier. The original video recordings used in this process ranged from 60 to 120 frames per second and were captured in Full HD (1080 p). However, to reduce computational demands during processing, all videos were downsampled to 15 fps and HD resolution (720 p).

In the standardized tools employed for deep learning model evaluation—such as the NVIDIA DeepStream SDK—the input comprised video segments labeled by activity rather than by individual objects and context.

Despite the availability of a diverse dataset, deep learning models typically require a large volume of training data to achieve optimal performance. To address this challenge, synthetic image generation techniques were employed to augment the dataset. Synthetic data generation was a fundamental practice in this study, as it enhanced the training dataset by introducing variability and compensating for data limitations. It allowed the model to learn from a broader range of examples and become more robust to environmental variations.

In addition to improving generalization, synthetic data helped mitigate class imbalance—a common issue in real-world datasets where safety risk events occurred significantly less frequently than normal operations [57,58]. Several augmentation techniques were applied, including image rotation, salt-and-pepper noise injection, saturation modification, and composite image generation with objects of interest [59,60]. Figure 6 presents examples of synthetic images derived from the testbed environment.

Accurate labeling and a balanced count of images per classification category were critical for training supervised deep learning models. These labels served as ground truth, guiding the model to learn meaningful patterns while reducing the risk of overfitting [61]. The quantity and distribution of labeled samples had a direct impact on classification performance, especially in multi-class scenarios.

To support this study, the videos referenced in Table 1 and Table 2 were fully annotated, and the number of labeled frames per event type was systematically organized and analyzed by category. It is important to clarify that the term “normal operation” refers to a standardized sequence of forklift maneuvers performed under safe and controlled conditions. This sequence includes the following steps:

Positioning the forklift in front of the pallet and adjusting the height and tilt of the forks.
Inserting the forks into the pallet.
Lifting the loaded pallet to a height of 10 to 20 cm above the ground.
Transporting the load along designated pathways to the delivery location.
Lowering the load to the ground, reversing, and disengaging the forks from the pallet.

4.4. Binary Classification Results Without Transfer Learning

The initial analysis involved training safety risk event detection algorithms to perform binary classification—distinguishing between risk and non-risk events. Figure 7 presents a comparison between the proposed DRN algorithm and two standardized tools: Amazon Rekognition Custom Labels and NVIDIA DeepStream SDK. The F1-score was used as the primary performance metric, as it is the standard evaluation criterion reported by both baseline tools.

In Scenario 1, where the inference data belong to the same context as the training data, all three models demonstrated high and consistent performance. The DRN outperformed AWS Rekognition and performed comparably to DeepStream, with all models requiring approximately the same amount of time for training and inference. The DRN achieved an average F1-score of 0.96, while AWS Rekognition and DeepStream reached average F1-scores of 0.94 and 0.96, respectively. All models exhibited low standard deviations (DRN: 0.023, AWS Rekognition: 0.016, DeepStream: 0.014), indicating stable performance and reliable classification under conditions similar to their training data. These results confirm the models’ ability to effectively identify both risk and non-risk events when operating within their trained domain.

In contrast, Figure 7 presents the results of applying the model to Scenario 2, where the inference data originate from a completely different context than the training set. In this setting, all models experienced a significant decline in performance. The DRN’s F1-score dropped to an average of 0.61, while AWS Rekognition and DeepStream recorded averages of 0.63 and 0.54, respectively, representing a relative decline exceeding 34% compared to Scenario 1.

This performance degradation highlights the models’ limited generalization capabilities across domains. Although AWS Rekognition slightly outperformed the others, the differences were marginal and insufficient to establish robustness against contextual variability. The substantial drop in accuracy is attributed to several factors, including overfitting to the specific characteristics of the training dataset. This overfitting prevents the models from generalizing to new conditions. Additionally, contextual variations—such as differences in camera angles, lighting conditions, and types of operations—introduced significant noise and complexity that further degraded performance.

The pre-trained architectures of AWS Rekognition and DeepStream, although effective within their domains, appear inadequately adapted for cross-domain applications in industrial settings. These findings underscore the importance of designing algorithms with improved generalization for deployment in real-world environments.

4.5. Multi-Class Classification Results Without Transfer Learning

The second analysis focused on training multi-state risk event detection algorithms to classify all activities listed in Table 1. As with the binary classification, models were first trained and tested using data from Scenario 1, and subsequently evaluated with data from Scenario 2. The results presented in Figure 8 show that all three models—DRN, AWS Rekognition Custom Labels, and NVIDIA DeepStream—achieve high F1-scores across all activity categories when both training and testing are conducted within the same context. In this controlled setting, AWS Rekognition and DeepStream exhibit slightly superior performance compared to DRN in most activity classes. Notably, DeepStream achieves the highest F1-scores in specific categories, such as “Raise or lower the load while transporting” (F1 = 0.970) and “Normal operation” (F1 = 0.966). While DRN generally performs slightly below the other two models, it still delivers competitive results, maintaining an average F1-score above 0.90 across all activities.

The activity “Turn on slopes” showed the lowest F1-scores among all categories (DRN: 0.85, Rekognition: 0.90, DeepStream: 0.91), suggesting it is more difficult to classify accurately. Despite these differences, the consistently low standard deviations across models suggest stable performance when operating under conditions similar to those seen during training. All models effectively distinguish between normal and risk activities in known operational contexts.

Figure 9 presents results when testing is conducted on data from Scenario 2, which differs significantly from the training context. The results reveal a notable decline in classification performance across all models.

Although DRN outperforms AWS Rekognition and DeepStream in most categories, all models suffer substantial performance drops—often exceeding 40% compared to Scenario 1. For example, the F1-score for “Normal operation” falls from 0.92 to 0.63 for DRN, 0.96 to 0.59 for AWS Rekognition, and 0.97 to 0.52 for DeepStream. Other complex activities, such as “Moving high loads with unbalanced loads” and “Turn on slopes”, also exhibit severe performance degradation, further highlighting the limited generalization capabilities of all models when faced with unseen environments. The noticeable performance decline when testing in Scenario 2 underscores the challenges these models face in generalizing to new industrial contexts. This limitation is critical for real-world applications, where operational conditions often vary significantly from those seen during training.

These findings underline the need for enhanced model architectures that can reduce domain dependency—particularly in industrial applications characterized by diverse and evolving operating conditions. Addressing this issue requires models capable of adapting to new contexts with minimal retraining. Integrating transfer learning techniques presents a promising direction for improving model adaptability. Developing deep learning architectures that can adapt to new conditions enhances both robustness and generalization, supporting reliable performance in dynamic and heterogeneous industrial environments.

4.6. Binary Classification Results with Transfer Learning

To evaluate the impact of transfer learning on model generalization, experiments were conducted using various retraining percentages (0%, 5%, 10%, 25%, 35%, and 50%) across multiple new scenarios. When analyzing the data presented in Table 1 and Table 2, the descriptive statistics of the selected video subsets (5% to 50% of the total number of videos and their corresponding durations) reveal considerable variability in both the quantity of videos and the time they represent across different events. On average, 5% of the videos correspond to 0.64 videos per event (SD = 0.11), while 50% equate to approximately 6.39 videos (SD = 1.08). Regarding viewing time, the mean duration for the 5% subset is 2.59 min (SD = 2.50), ranging from 0.55 to 7.5 min, whereas the 50% subset covers an average of 25.89 min (SD = 24.96), with a minimum of 5.5 min and a maximum of 75 min. These findings indicate substantial heterogeneity in content distribution, suggesting that a small percentage of videos may account for a disproportionately large—or small—portion of total viewing time, depending on the event.

This variability underscores the need for scenario-specific sampling and sourcing strategies when selecting representative subsets for training and evaluation in video-based safety risk detection algorithms and models. Two primary models, DRN and AWS Rekognition, were selected for detailed comparison due to their consistent behavior under the same retraining methodology. Although NVIDIA DeepStream followed the same retraining pipeline, it did not exhibit comparable improvements and is discussed separately. As shown in Figure 10, both binary classifiers exhibited performance gains as the retraining percentage increased. The highest F1-score (0.962) was achieved by the DRN model at 50% retraining, particularly for normal operations, indicating enhanced classification accuracy and completeness. This trend suggests that retraining data from the same operational context positively influences model performance.

Retraining consistently improved F1-scores in safety risk event classification. DRN reached a maximum of 0.95, while AWS Rekognition attained 0.93 at 50% retraining. These results confirm that feature transfer learning effectively recovers model performance in previously unseen conditions. For normal operations, performance improvements were less pronounced, with DRN showing slightly better results than AWS Rekognition, suggesting stronger adaptability in routine tasks.

A positive correlation was observed between retraining percentage and generalization capability (as expected). The 50% configuration consistently yielded the highest scores. Even with 25% retraining, performance remained strong, making it a viable approach when labeled data are limited. Notably, DRN outperformed AWS Rekognition across all configurations, particularly in normal operation classification.

Performance gains, however, varied with the testing scenario. In Scenario 1, where training and testing conditions aligned, high F1-scores were achieved without retraining (e.g., 0.975 for DRN). In contrast, when a new Scenario 10 introduced domain shifts, transfer learning significantly improved results—from 0.80 to 0.97 with 50% retraining—demonstrating its effectiveness in adapting to new conditions.

Overall, DRN consistently outperformed AWS Rekognition across all retraining percentages, especially in high-variability contexts. These findings suggest that DRN is more resilient to operational variability, offering superior generalization capabilities. To further consolidate these findings, we conducted a detailed analysis of the minimum data requirements and video quality conditions that enable reliable adaptation. Our experiments revealed that reliable adaptation can be achieved with as little as 10–25% of new scenario data, equivalent to approximately 2.5–25 min of annotated video per activity class. Below this threshold, performance deteriorates rapidly, with F1-scores dropping below 0.80 and high variance across runs. Regarding video quality, downsampled inputs (<720 p, <10 fps) or segments with strong compression noise produced a 15–20% decline in performance. Incomplete or corrupted sequences further disrupted temporal modeling in the LSTM stage, resulting in delayed or incorrect event recognition. These findings emphasize that, while the method is efficient, it requires a minimum standard of video quality and completeness to maintain robustness. This is further illustrated in Table 2.

4.7. Multi-Class Classification Results with Transfer Learning

Following the methodology applied to binary classification models, multi-class classifiers were evaluated to assess the effectiveness of transfer learning under dynamic industrial conditions. The results, presented in Figure 11, Figure 12 and Figure 13, compare the performance of the DRN and AWS Rekognition models across multiple retraining configurations and scenarios. These results underline how varying the retraining percentage of data from new scenario influences the generalization capacity of the models when faced with previously unseen operational contexts.

When trained and tested within the same scenario (Figure 10), both models demonstrated high F1-scores across all activity categories, with AWS Rekognition slightly outperforming DRN in most tasks. Notably, AWS Rekognition achieved superior accuracy in structured activities, such as “Raise operators on the forks” (0.98 vs. 0.96) and “Raise or lower the load while transporting” (0.97 vs. 0.92). The largest margin was observed in “Turn on slopes”, where AWS Rekognition (0.90) outperformed DRN (0.85). These results suggest that AWS Rekognition benefits from its pre-trained architecture in familiar operational conditions, while DRN remains highly competitive across diverse tasks.

However, introducing a new scenario with only 10% retraining (Figure 11) led to a noticeable performance drop for both models, particularly in safety risk-related activities. For instance, the F1-score for “Moving high loads with unbalanced loads” decreased from 0.93 to 0.73 for DRN and from 0.96 to 0.74 for AWS Rekognition. Similarly, scores dropped in “Go down slopes head-on with the crane loaded” (DRN: 0.72; AWS: 0.75). Even in normal operations, performance declined (DRN: 0.80; AWS: 0.81), indicating that minimal retraining is insufficient for effective generalization when adapting to new scenarios.

Increasing the retraining percentage to 35% (see Figure 12) significantly improved classification performance, restoring F1-scores close to original levels. DRN reached 0.93 in normal operation classification, nearly matching its baseline of 0.92, while AWS Rekognition recovered to 0.96. The performance gap between the two models narrowed, with DRN showing greater gains in safety risk-related tasks such as “Turn on slopes” (from 0.75 to 0.91) and “Moving high loads with unbalanced loads” (from 0.73 to 0.93). These findings emphasize DRN’s strong adaptability to previously unseen conditions.

At 50% retraining (see Figure 13), both models achieved their highest post-retraining scores. DRN reached 0.94 in normal operation classification and 0.93 in “Turn on slopes”, while AWS Rekognition achieved 0.96 and 0.93, respectively. The marginal differences at this stage indicate that with sufficient retraining, DRN achieves comparable performance to AWS Rekognition, even in complex multi-class scenarios.

Overall, the results confirm that increasing the retraining percentage substantially enhances model generalization. While AWS Rekognition initially shows an advantage in structured environments, DRN demonstrates superior adaptability to domain shifts, particularly in high-risk safety activities. These observations emphasize the importance of transfer learning in multi-class classification and underscore the necessity of a sufficiently high retraining percentage to ensure robust adaptation to dynamic operational environments for these models.

In contrast, models based on Nvidia DeepStream showed limited gains in both binary and multi-class settings. Despite incorporating additional scenarios and retraining data, only a slight upward trend in F1-score was observed, with no significant improvements in distinguishing normal operations from safety risk events. This limited responsiveness suggests that DeepStream models struggle to generalize effectively through transfer learning compared to DRN and AWS Rekognition for the use case and application under study. As shown in Figure 14 and Figure 15, when using the proposed feature transfer learning strategy in DeepStream, both the binary and multi-class classifiers show only a modest upward trend, with no significant improvements in distinguishing normal operations from safety risk events.

To further validate our approach, we compared it with state-of-the-art domain adaptation (DANN) and few-shot learning (Prototypical Networks) baselines reported in recent studies. DANN typically required 35–50% of the target domain data to achieve F1-scores > 0.85, while our DRN with incremental transfer learning reached similar performance with only 10–25% [62]. Few-shot approaches achieved acceptable results in stable domains but dropped below F1 = 0.70 under severe domain shifts, especially with strong lighting variability. By contrast, our method consistently maintained F1 > 0.85 with minimal adaptation [49]. These results indicate that the incremental feature transfer strategy offers a better balance between generalization and efficiency, particularly for deployment on edge devices where computational resources are constrained. This is further illustrated in Table 3.

4.8. Feature Significance and Model Interpretability

SHAP (Shapley Additive Explanations) values are a widely used technique for interpreting machine learning models by quantifying the contribution of each input feature to the model’s output. This method provides insight into how specific variables influence predictions, enabling the identification of the most relevant features while minimizing the impact of non-informative or redundant inputs. In this study, SHAP values were applied to evaluate how transfer learning affects the relative importance of detected objects in image-based classification tasks for DRN.

The analysis demonstrated that transfer learning modifies the distribution of feature importance, resulting in a model that is more focused and generalizable. As shown in Table 4, features such as the forklift and workspace became significantly more influential after transfer learning, with SHAP values increasing from 0.53 to 0.88 and from 0.58 to 0.88, respectively. In contrast, the relevance of features such as the chassis and upper cover decreased notably—from 0.78 to 0.25 and from 0.35 to 0, respectively—indicating that the model deprioritized areas in video scenes contributing less to accurate classification.

This reallocation of feature importance towards semantically meaningful regions contributes to improved model performance, particularly in safety-critical applications. The trends observed in Figure 14 and Figure 15 reflect an overall improvement in F1-score for both normal and risk event classification across increasing retraining scenarios. By emphasizing the most representative visual features, the model becomes more robust and interpretable, while also reducing noise introduced by less discriminative elements.

Furthermore, the refined feature focus introduced by transfer learning can reduce the number of annotated samples required in training iterations, optimizing the annotation process and decreasing computational costs. This is particularly beneficial in industrial safety contexts, where manual labeling is time-consuming and costly. The improved interpretability also enhances transparency and trust in real-time decision-making systems.

SHAP-based analysis confirms that transfer learning not only improves classification performance but also contributes to a more efficient and interpretable model. These benefits support the scalability and operational viability of the proposed approach in dynamic industrial environments. By guiding annotators to concentrate solely on safety-critical regions (e.g., forks, load zones, safety boundaries), SHAP effectively eliminated the need to label non-informative areas such as background clutter [63]. This focus reduced redundant bounding boxes and labeling steps, resulting in a 20–30% decrease in average annotation time per frame [64]. Such savings are particularly relevant in industrial video datasets, where manual annotation is often the bottleneck in developing and adapting new models.

The evaluation of transfer learning across scenarios and retraining levels reveals distinctive patterns in model performance when comparing binary and multi-class classification tasks, particularly for the detection of normal versus safety risk events.

In the binary classification results (Figure 16), which include both normal and risk event detection, the DRN and AWS models demonstrate consistent improvements in F1-score with increasing retraining percentages. DRN generally outperforms AWS across most scenarios and retraining levels. For instance, in the T1 test scenario at 0% retraining, DRN achieves an F1-score of 0.95 for normal events and 0.97 for risk events, while AWS reaches slightly lower values. This trend continues across T2 and T3, indicating DRN’s superior generalization under minimal adaptation. As retraining increases to 50%, both models converge to high F1-scores (typically > 0.90), reflecting effective domain adaptation through incremental scenario inclusion.

Notably, binary classification performance is robust even with minimal retraining, particularly for DRN. This suggests that binary discrimination (normal vs. risk) is inherently simpler and benefits from the domain-invariant feature representations learned by the base models.

Figure 17 and Figure 18 disaggregate performance into multi-class classification tasks, allowing deeper insight into how the models handle increased label granularity. In Figure 17 (normal events only), a similar trend of performance improvement with retraining is observed. DRN consistently achieves higher F1-scores than AWS, especially in early retraining stages (0–25%), with scores improving sharply from ~0.67 at 0% retraining to ~0.94–0.96 at 50%. AWS shows more modest gains under low retraining levels, suggesting that DRN’s learned features are more transferable to unseen but related normal event classes.

However, the performance gap between DRN and AWS narrows at higher retraining levels, indicating that both models can effectively adapt given sufficient data from new scenarios. This convergence supports the feasibility of incremental retraining as a practical deployment strategy for evolving environments. Figure 18 presents multi-class classification results for risk events, which are typically more variable and visually complex. Consequently, baseline performance at 0% retraining is lower compared to normal events, particularly for AWS (F1-scores < 0.55 across most test scenarios). DRN, while still affected, shows comparatively higher initial performance, with values around 0.58–0.66, reinforcing its generalization advantage. As retraining percentage increases, both models improve significantly, but the disparity remains evident. At 25% retraining, DRN surpasses 0.80 F1-score across several test scenarios, while AWS lags behind by a small margin in most cases. At 50% retraining, both models approach parity, reaching F1-scores above 0.92 in most scenarios.

Interestingly, the variance in performance across scenarios is higher for risk events than for normal events. This may be attributed to the increased contextual complexity and rarity of risk events, which require more training diversity to achieve stable classification performance. DRN exhibits higher transferability across unseen scenarios and minimal retraining conditions, particularly for risk-related tasks. AWS benefits more from higher retraining percentages. Incremental retraining yields substantial gains for both models, especially in multi-class risk classification, underscoring the necessity of scenario-specific data inclusion in training pipelines. Test scenarios T8–T10 show the most rapid improvement across retraining levels, suggesting favorable conditions for model generalization in these contexts (possibly due to event diversity or visual consistency). Binary classification is significantly easier and more stable than multi-class classification, especially for rare risk categories. This highlights the need for balanced dataset curation and tailored retraining strategies.

4.9. Inference Benchmark

To assess model performance under real-world operational conditions—where latency is critical for timely safety risk event detection, this section evaluates the end-to-end inference efficiency of trained models deployed across both edge and cloud platforms. Beyond conventional metrics such as F1-score, it is essential to consider the system’s ability to generate timely alerts. In industrial environments, where hazardous events can escalate rapidly, detection systems must provide not only accurate but also prompt warnings. Delays in detection and communication may significantly increase the risk of accidents, and compromise the safety of operators and surrounding personnel. Therefore, analyzing the complete inference pipeline—from video acquisition to user notification—is fundamental for meeting real-time operational demands.

The literature emphasizes that acceptable response times vary according to industry and event severity. Critical systems in manufacturing and heavy machinery often require responses within milliseconds to a few seconds. To quantify these performance characteristics, three metrics are considered:

Input data read time: Duration required to acquire and preprocess video data before model inference.
Inference time: Time taken by the model to process inputs and generate predictions.
Delivery time: Time elapsed before the detection results reach the user or operator.

The benchmark replicates real-world operational flow, encompassing video capture, inference, and visualization. It includes Full HD camera acquisition, edge-based embedded processing, and both local and remote result presentation via a web interface. Four deployment configurations were analyzed:

Setup 1: DRN deployed on a Raspberry Pi 5, performing inference entirely on the edge.
Setup 2: Inference on a Raspberry Pi 5 with data transferred to AWS Rekognition for cloud-based processing and visualization.
Setup 3: DRN running on a Jetson Nano, utilizing onboard GPU acceleration.
Setup 4: DeepStream deployed on a Jetson Nano for comparative edge inference.

In cloud-based configurations, such as those involving AWS Rekognition, video segments are uploaded to an S3 bucket, triggering inference on EC2 instances. Results are then distributed through AWS Greengrass for visualization. In contrast, DeepStream—optimized for NVIDIA CUDA architectures using TensorRT—executes inference locally on Jetson Nano devices. On Raspberry Pi devices, TensorFlow Lite conversion is required due to limited hardware support, involving calibration, quantization, and export stages to ensure performance efficiency.

While all implementations follow a similar data processing structure, fundamental differences in deployment—edge versus cloud—significantly impact timing, bandwidth, and system responsiveness (see Figure 19 and Figure 20).

The comparative analysis of inference runtimes across binary and multi-class classification modes reveals significant differences in performance among the evaluated setups. Quantitative results show that DRN deployed on the Jetson Nano in Setup 3 achieves the lowest end-to-end latency, with total processing times ranging from 2.25 to 2.89 s, followed closely by DeepStream in Setup 4 (2.05 to 2.57 s). In contrast, AWS Rekognition in Setup 2 exhibits the highest latency, with total runtimes exceeding 13 s due to video upload and result retrieval overhead. DRN on Raspberry Pi in Setup 1 also demonstrates competitive latency (5.84 to 6.8 s), outperforming the cloud-based alternative by avoiding cloud communication delays.

In terms of per-frame inference performance, DeepStream in Setup 4 achieves the fastest execution, ranging from 0.56 to 1.05 s. This efficiency is attributable to its optimized GPU pipelines and use of TensorRT acceleration. DRN also performs efficiently on the Jetson Nano in Setup 3, with inference times between 0.76 and 1.35 s across binary and multi-class tasks. While AWS Rekognition in Setup 2 shows comparable inference speeds (0.86 to 1.41 s), the total latency is dominated by communication steps such as video upload (2.9 to 3.1 s), API transmission (3.1 to 3.2 s), and remote visualization (4.22 to 4.25 s).

These findings underscore the advantage of edge-deployed DRN models (Setup 1 and 3) in reducing communication overhead. DRN eliminates the need for extensive data transmission and enables faster delivery of results to local edge monitors. Notably, DRN on Jetson Nano in Setup 3 achieves remote visualization times as low as 0.43 to 0.45 s, compared to AWS Rekognition in Setup 2, which requires over four seconds for the same task. Additionally, DRN on Raspberry Pi in Setup 1 achieves moderate end-to-end runtimes, confirming its suitability for deployment on low-power devices when real-time processing is not critically constrained.

Model size also plays a critical role in deployment suitability. The DRN model has a compact footprint of 0.9 GB, compared to approximately 4 GB for AWS Rekognition and 2 GB for DeepStream. This smaller memory requirement enables efficient deployment in resource-constrained environments, reducing memory load and initialization time while enhancing system responsiveness.

Across both binary and multi-class classification modes, the relative ranking of total processing time remains consistent: DRN on Jetson Nano in Setup 3 outperforms DeepStream in Setup 4 and AWS Rekognition in Setup 2, followed by DRN on Raspberry Pi in Setup 1. Although multi-class classification introduces additional latency due to increased computational complexity, the DRN model maintains a consistent advantage in both classification types.

In summary, DRN demonstrates superior performance across multiple dimensions, including latency, inference speed, communication efficiency, and model compactness. These characteristics make it a highly suitable candidate for deployment in real-time industrial safety monitoring applications. In contrast, AWS Rekognition is best suited for scenarios where cloud infrastructure is available and latency is not critical. DeepStream offers fast inference but requires more computational resources and larger memory allocation. Overall, DRN provides a balanced trade-off between performance and deployability, making it an optimal solution for scalable, low-latency edge deployments.

The edge deployment experiments also revealed important trade-offs. On Raspberry Pi 5, end-to-end latency ranged between 5.8 and 6.8 s with low power consumption (~5 W), sufficient for periodic monitoring but inadequate for real-time alerting [65]. On Jetson Nano, latency dropped below 3 s with higher power draw (~10 W), making it suitable for real-time inference but less appropriate for battery-powered deployments. AWS Rekognition, although accurate, introduced delays exceeding 13 s due to communication overhead [66]. These results confirm that latency, energy, and model footprint must be balanced depending on application needs. In future work, we plan to incorporate quantization and pruning to reduce the DRN’s size below 500 MB and achieve below 2 s inference on ultra-low-power devices [67]. The performance comparison across platforms is summarized in Table 5.

5. Conclusions

This study presented a transfer learning-based methodology to enhance the generalization capacity of video-driven safety risk detection systems in industrial environments. Addressing the limitations of conventional deep learning models—which often underperform in unseen operational contexts—the proposed approach enables incremental adaptation to new scenarios using only small subsets of annotated data. By integrating a dual-stage deep neural architecture (DRN) with scenario-driven retraining, the system demonstrated consistent improvements in both binary and multi-class classification tasks across diverse forklift operation scenarios defined by OSHA 3949.

Extensive experimental results confirmed that transfer learning significantly improves detection accuracy and robustness in cross-scenario evaluations. The DRN consistently outperformed existing solutions such as AWS Rekognition and NVIDIA DeepStream, particularly under limited data availability. The transfer learning approach enabled effective adaptation with as little as 10–25% of new scenario data, making it highly practical for industrial applications where labeled data are often scarce. Moreover, the DRN achieved state-of-the-art performance on embedded edge devices such as the Jetson Nano, with low inference latency, small memory footprint, and strong real-time response—validating its suitability for resource-constrained industrial settings.

Interpretability analysis using SHAP further revealed that transfer learning reorients the model’s focus toward semantically meaningful regions, enhancing transparency and reducing annotation requirements. This shift not only supports more efficient model retraining but also increases trust in automated safety monitoring systems.

Despite promising results, limitations must be acknowledged. The current validation is restricted to forklift operations, and generalization to machinery with different kinematic profiles (e.g., mining shovels, overhead cranes, or CAEX trucks) has not yet been demonstrated. Similarly, extreme lighting conditions such as nighttime operation, glare, or infrared-only imaging were not tested. Preliminary simulations with artificially darkened videos showed an average performance drop of ~25% in F1-score, highlighting that multimodal sensing (thermal, depth, or radar) may be necessary for robust deployment in such conditions.

The performance of our DRN with incremental transfer learning, reaching F1-scores above 0.90 and AUC values exceeding 0.95 under domain shift, is competitive with or superior to recently published frameworks for video anomaly detection. Weakly supervised methods based on multimodal and multiscale feature fusion report AUC values around 0.97 on standard benchmarks [68], while semantic keyframe extraction combined with pre-trained deep models has achieved accuracy close to 96% [30] Spatiotemporal architectures leveraging multi-stream fusion have reported AUC ≈ 0.95 even in noisy environments [69], and other multimodal solutions, including transformer-based models for violence recognition [70] and UAV-based anomaly detection frameworks [71] also reach performance levels above 90%. However, unlike these approaches, our framework achieves high performance with only 10–25% of new scenario data and a significantly smaller computational footprint. This efficiency is critical for industrial deployments where annotation costs are high and computational resources are limited. By enabling adaptation with minimal scenario-specific data and supporting inference on edge devices such as the Jetson Nano, the proposed DRN framework provides a practical, scalable, and resource-efficient alternative to more complex multimodal or transformer-based methods.

Overall, the results demonstrate that the proposed feature-based transfer learning strategy provides a scalable, efficient, and interpretable solution for deploying safety risk detection systems in dynamic and heterogeneous industrial environments. Future work will explore domain adaptation under federated learning settings, active learning strategies to further reduce annotation costs, and extending the architecture to multimodal sensor fusion.

Author Contributions

Methodology, L.R. and A.S.M.; Formal analysis, L.R.; Investigation, L.R.; Data curation, L.R.; Writing—original draft, L.R. and S.E.G.; Writing—review & editing, S.E.G. and A.S.M.; Supervision, S.E.G.; Project administration, S.E.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

Luciano Radrigan thanks ANID for its support through the scholarship ANID/Becas de Doctorado Nacional/2021-21210655 and FONDEF VIU24P0001.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ludwika, A.S.; Rifai, A.P. Deep Learning for Detection of Proper Utilization and Adequacy of Personal Protective Equipment in Manufacturing Teaching Laboratories. Safety 2024, 10, 26. [Google Scholar] [CrossRef]
Bhuiyan, M.R.; Uddin, J. Deep Transfer Learning Models for Industrial Fault Diagnosis Using Vibration and Acoustic Sensors Data: A Review. Vibration 2023, 6, 218–238. [Google Scholar] [CrossRef]
Mallikarjuna, S.B.; Shivakumara, P.; Khare, V.; Basavanna, M.; Pal, U.; Poornima, B. Mallikarjuna Multi-gradient-direction based deep learning model for arecanut disease identification. CAAI Trans. Intell. Technol. 2022, 7, 156–166. [Google Scholar] [CrossRef]
Tabernik, D.; Šela, S.; Skvarč, J.; Skočaj, D. Tabernik Segmentation-based deep-learning approach for surface-defect detection. J. Intell. Manuf. 2020, 31, 759–776. [Google Scholar] [CrossRef]
Cao, X.; Wang, Y.; Chen, B.; Zeng, N. Domain-adaptive intelligence for fault diagnosis based on deep transfer learning from scientific test rigs to industrial applications. Neural Comput. Appl. 2020, 33, 4483–4499. [Google Scholar] [CrossRef]
Khan, S.; Yin, P.; Guo, Y.; Asim, M.; Abd El-Latif, A.A. El-Latif Siraj Khan Heterogeneous transfer learning: Recent developments, applications, and challenges. Multimed. Tools Appl. 2024, 83, 69759–69795. [Google Scholar] [CrossRef]
Tama, B.A.; Vania, M.; Lee, S.; Lim, S. Recent advances in the application of deep learning for fault diagnosis of rotating machinery using vibration signals. Artif. Intell. Rev. 2022, 56, 4667–4709. [Google Scholar] [CrossRef]
Maschler, B.; Weyrich, M. Deep Transfer Learning for Industrial Automation: A Review and Discussion of New Techniques for Data-Driven Machine Learning. IEEE Ind. Electron. Mag. 2021, 15, 65–75. [Google Scholar] [CrossRef]
Sharma, P. Understanding Transfer Learning for Deep Learning. Anal. Vidhya 2021. Available online: https://www.analyticsvidhya.com/blog/2021/10/understanding-transfer-learning-for-deep-learning/ (accessed on 7 July 2024).
HIman, M.; Arabnia, H.R.; Rasheed, K. Mohammadreza Iman A Review of Deep Transfer Learning and Recent Advancements. Technologies 2023, 11, 40. [Google Scholar]
Gupta, J.; Pathak, S.; Kumar, G. Deep Learning (CNN) and Transfer Learning: A Review. J. Phys. Conf. Ser. 2022, 2273, 012029. [Google Scholar] [CrossRef]
Santos-Bustos, D.F.; Nguyen, B.M.; Espitia, H.E. Daniel Fernando Santos-Bustos Towards automated eye cancer classification via VGG and ResNet networks using transfer learning. Eng. Sci. Technol. Int. J. 2022, 35, 101214. [Google Scholar]
Wen, L.; Li, X.; Gao, L. A transfer convolutional neural network for fault diagnosis based on ResNet-50. Neural Comput. Appl. 2020, 32, 6111–6124. [Google Scholar] [CrossRef]
Reddy, A.S.B.; Juliet, D.S. Transfer Learning with ResNet-50 for Malaria Cell-Image Classification. In Proceedings of the 2019 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 4–6 April 2019. [Google Scholar]
Vrbančič, G.; Podgorelec, V. Transfer Learning With Adaptive Fine-Tuning. IEEE Access 2020, 8, 2169–3536. [Google Scholar] [CrossRef]
Subramanian, M.; Lv, N.P.; B, J.; A, M.B.; VE, S. Subramanian Hyperparameter Optimization for Transfer Learning of VGG16 for Disease Identification in Corn Leaves Using Bayesian Optimization. Big Data 2022, 10, 215–229. [Google Scholar] [CrossRef] [PubMed]
Ozcan, T.; Basturk, A. Static facial expression recognition using convolutional neural networks based on transfer learning and hyperparameter optimization. Multimed. Tools Appl. 2020, 79, 26587–26604. [Google Scholar] [CrossRef]
Bernico, M.; Li, Y.; Zhang, D. Investigating the Impact of Data Volume and Domain Similarity on Transfer Learning Applications. In Proceedings of the Future Technologies Conference (FTC), Vancouver, BC, Canada, 15–16 November 2018. [Google Scholar]
Cengiz, A.B.; McGough, A.S. How much data do I need? A case study on medical data. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023. [Google Scholar]
Jiao, J.; Zhao, M.; Lin, J.; Liang, K. A comprehensive review on convolutional neural network in machine fault diagnosis. Neurocomputing 2020, 111, 36–63. [Google Scholar] [CrossRef]
MKhorram, A.; Khalooei, M.; Rezghi, M. End-to-end CNN + LSTM deep learning approach for bearing fault diagnosis. Appl. Soft Comput. 2020, 51, 736–751. [Google Scholar]
Zhang, Z.; Wu, L. Graph neural network-based bearing fault diagnosis using Granger causality test. Expert Syst. Appl. 2024, 242, 122827. [Google Scholar] [CrossRef]
Mohammadi, S.; Rahmanian, V.; Sattarpanah Karganroudi, S.; Adda, M. Smart Defect Detection in Aero-Engines: Evaluating Transfer Learning with VGG19 and Data-Efficient Image Transformer Models. Machines 2025, 13, 49. [Google Scholar] [CrossRef]
Dutta, S.J.; Boongoen, T.; Zwiggelaar, R. Human activity recognition: A review of deep learning-based methods. IET Comput. Vis. 2025, 19, e70003. [Google Scholar] [CrossRef]
Singh, M.T.; Prasad, R.K.; Michael, G.R.; Singh, N.H.; Kaphungkui, N.K. Tiken Singh Spatial-Temporal Bearing Fault Detection Using Graph Attention Networks and LSTM. arXiv 2024, arXiv:2410.11923. [Google Scholar]
Reyes, R.C.; Sevilla, R.V.; Zapanta, G.S.; Merin, J.V.; Maaliw, R.R.; Santiago, A.F. Safety Gear Compliance Detection Using Data Augmentation-Assisted Transfer Learning in Construction Work Environment. In Proceedings of the 2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 8–10 July 2022. [Google Scholar]
Blair, J.; Amin, O.; Brown, B.D.; McArthur, S.; Forbes, A.; Stephen, B. The transfer learning of uncertainty quantification for industrial plant fault diagnosis system design. Data-Centric Eng. 2024, 5, e41. [Google Scholar] [CrossRef]
Yan, P.; Abdulkadir, A.; Luley, P.P.; Rosenthal, M.; Schatte, G.A.; Grewe, B.F.; Stadelmann, T. A Comprehensive Survey of Deep Transfer Learning for Anomaly Detection in Industrial Time Series: Methods, Applications, and Directions. Mach. Learn. 2023, 11. [Google Scholar] [CrossRef]
Abdulazeez Jebur, S.; Hussein, K.A.; Kadhim Hoomod, H.; Alzubaidi, L.; Saihood, A.A.; Gu, Y. A Scalable and Generalized Deep Learning Framework for Anomaly Detection in Surveillance Videos. arXiv 2024, arXiv:2408.00792. [Google Scholar]
Taha, R.A.; Youssif, A.A.H.; Fouad, M.M. Transfer learning model for anomalous event recognition in big video data. Nature 2024, 14, 27868. [Google Scholar] [CrossRef]
Aboulola, O.I. Improving traffic accident severity prediction using MobileNet transfer learning model and SHAP XAI technique. PLoS ONE 2024, 19, e0300640. [Google Scholar] [CrossRef]
Sajjadi, P.; Dinmohammadi, F.; Shafiee, M. Fault Detection of Cyber-Physical Systems Using a Transfer Learning Method Based on Pre-Trained Transformers. Sensors 2025, 25, 4164. [Google Scholar] [CrossRef]
Sadaiyandi, J.; Arumugam, P.; Sangaiah, A.K.; Zhang, C. Sadaiyandi Stratified Sampling-Based Deep Learning Approach to Increase Prediction Accuracy of Unbalanced Dataset. Electronics 2023, 12, 4423. [Google Scholar] [CrossRef]
Joeres, R.; Blumenthal, D.B.; Kalinina, O.V. Data splitting to avoid information leakage with DataSAIL. Nat. Commun. 2024, 16, 3337. [Google Scholar] [CrossRef]
Fogliato, R.; Patil, P.; Monfort, M.; Perona, P. A Framework for Efficient Model Evaluation Through Stratification, Sampling, and Estimation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024. [Google Scholar]
Khan, M.; Hossni, Y. A comparative analysis of LSTM models aided with attention and squeeze and excitation blocks for activity recognition. Sci. Rep. 2025, 15, 3858. [Google Scholar] [CrossRef]
Sikorska, J.Z.; Hodkiewicz, M.; Ma, L. Prognostic modelling options for remaining useful life estimation by industry. Mech. Syst. Signal Process. 2011, 25, 1803–1833. [Google Scholar] [CrossRef]
Wu, H.; Huang, A.; Sutherland, J.W. Avoiding Environmental Consequences of Equipment Failure via an LSTM-Based Model for Predictive Maintenance. Procedia Manuf. 2020, 43, 666–673. [Google Scholar] [CrossRef]
Bampoula, X.; Siaterlis, G.; Nikolakis, N.; Alexopoulos, K. A Deep Learning Model for Predictive Maintenance in Cyber-Physical Production Systems Using LSTM Autoencoders. Sensors 2021, 21, 972. [Google Scholar] [CrossRef]
Xu, X.; Guo, C.; Wan, P.; Xu, H.; Yu, Y.; Fan, J. WT-DSE-LSTM: A hybrid model for the multivariate prediction of dissolved oxygen. Alex. Eng. J. 2025, 124, 285–296. [Google Scholar] [CrossRef]
Anjali, S.; Don, S. Deep Learning-Based Video Anomaly Detection Using Optimised Attention-Enhanced Autoencoders. Electron. Lett. Comput. Vis. Image Anal. 2025, 24, 134–152. [Google Scholar] [CrossRef]
Li, X.; Hao, T.; Li, F.; Zhao, L.; Wang, Z. Faster R-CNN-LSTM Construction Site Unsafe Behavior Recognition Model. Appl. Sci. 2023, 13, 10700. [Google Scholar] [CrossRef]
AWS. Available online: https://docs.aws.amazon.com/rekognition/latest/customlabels-dg/what-is.html (accessed on 23 February 2025).
NVIDIA. Available online: https://docs.nvidia.com/metropolis/deepstream/dev-guide/ (accessed on 22 February 2025).
NVIDIA. Available online: https://developer.nvidia.com/deepstream-sdk (accessed on 2 March 2023).
NVIDIA Corporation. Available online: https://developer.nvidia.com/deepstream-sdk (accessed on 31 August 2025).
Hsiao, W.T.; Yu, W.D.; Tang, C.Y.; Bulgakov, A. Hsiao Identify Subtle Fall Hazards Using Transfer Learning. Engineering Proceedings 2025, 91, 15. [Google Scholar]
Tanveer, M.H.; Fatima, Z.; Zardari, S.; Guerra-Zubiaga, D. Tanveer An In-Depth Analysis of Domain Adaptation in Computer and Robotic Vision. Appl. Sci. 2023, 23, 12823. [Google Scholar] [CrossRef]
Wang, L.; Zhang, X.; Su, H.; Zhu, J. A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5362–5383. [Google Scholar] [CrossRef]
Feng, H.; Zhang, L.; Yang, X.; Liu, Z. Incremental few-shot object detection via knowledge transfer. Pattern Recognit. Lett. 2022, 156, 67–73. [Google Scholar] [CrossRef]
López-Lozada, E.; Sossa, H.; Rubio-Espino, E.; Montiel-Pérez, J.Y. Action Recognition in Videos through a Transfer-Learning-Based Technique. Mathematics 2024, 12, 3245. [Google Scholar] [CrossRef]
del Olmo, J.J.L.; Gómez, Á.L.P.; Lopez-de-Teruel, P.E.; Ruiz, A. A few-shot learning methodology for improving safety in industrial scenarios through universal self-supervised visual features and dense optical flow. Appl. Soft Comput. 2024, 167, 112375. [Google Scholar] [CrossRef]
Junjia, Y.; Alias, A.H.; Haron, N.A.; Bakar, N.A. Intelligent Construction Risk Management Through Transfer Learning: Trends, Challenges and Future Strategies. Artif. Intell. Evol. 2024, 6, 1–6. [Google Scholar] [CrossRef]
Hosna, A.; Merry, E.; Gyalmo, J.; Alom, Z.; Aung, Z.; Azim, M.A. Asmaul Hosna Transfer learning: A friendly introductio. J. Big Data 2022, 9, 102. [Google Scholar] [CrossRef]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. Proc. IEEE 2021, 109, 43–76. [Google Scholar] [CrossRef]
Tzutalin. Available online: https://github.com/heartexlabs/labelImg (accessed on 3 February 2025).
Mumuni, A.; Mumuni, F.; Gerrar, N.K. A Survey of Synthetic Data Augmentation Methods in Machine Vision. Mach. Intell. Res. 2024, 21, 831–869. [Google Scholar] [CrossRef]
De La Fuente, N.; Majó, M.; Luzko, I.; Córdova, H.; Fernández-Esparrach, G.; Bernal, J. Enhancing Image Classification in Small and Unbalanced Datasets Through Synthetic Data Augmentation. In Workshop on Clinical Image-Based Procedures; Springer: Cham, Switzerland, 2024. [Google Scholar]
Iqbal, S.; Qureshi, A.N.; Li, J.; Choudhry, I.A.; Mahmood, T. Dynamic learning for imbalanced data in learning chest X-ray and CT images. Heliyon 2023, 9, e16807. [Google Scholar] [CrossRef] [PubMed]
Kumar, T.; Brennan, R.; Mileo, A.; Bendechache, M. Image Data Augmentation Approaches: A Comprehensive Survey and Future Directions. IEEE Access 2024, 12, 187536–187571. [Google Scholar] [CrossRef]
Mooijman, P.; Catal, C.; Tekinerdogan, B.; Lommen, A.; Blokland, M. The effects of data balancing approaches: A case study. Appl. Soft Comput. 2023, 132, 109853. [Google Scholar] [CrossRef]
Rui, X.; Li, Z.; Cao, Y.; Li, Z.; Song, W. DILRS: Domain-Incremental Learning for Semantic Segmentation in Multi-Source Remote Sensing Data. Remote Sens. 2023, 15, 2541. [Google Scholar] [CrossRef]
Saarela, M.; Podgorelec, V. Recent Applications of Explainable AI (XAI): A Systematic Literature Review. Appl. Sci. 2024, 14, 8884. [Google Scholar] [CrossRef]
Gebreyesus, Y.; Dalton, D.; De Chiara, D.; Chinnici, M.; Chinnici, A. AI for Automating Data Center Operations: Model Explainability in the Data Centre Context Using Shapley Additive Explanations (SHAP). Electronics 2024, 13, 1628. [Google Scholar] [CrossRef]
Suwannaphong, T.; Jovan, F.; Craddock, I.; McConville, R. Optimising TinyML with quantization and distillation of transformer and mamba models for indoor localisation on edge devices. Nature 2025, 15, 10081. [Google Scholar] [CrossRef] [PubMed]
Das, S.; Cheng, J.; Rakshit, A.; Boubin, J.; Ramnath, R. EPIC: Efficient Pruning for Inference on Constrained Devices. In Proceedings of the PEARC’25: Practice and Experience in Advanced Research Computing, Columbus, OH, USA, 20–24 July 2025. [Google Scholar]
Rey, L.; Bernardos, A.M.; Dobrzycki, A.D.; Carramiñana, D.; Bergesio, L.; Besada, J.A.; Casar, J.R. A Performance Analysis of You Only Look Once Models for Deployment on Constrained Computational Edge Devices in Drone Applications. Sensors 2025, 14, 638. [Google Scholar] [CrossRef]
Sun, W.; Cao, L.; Guo, Y.; Du, K. Sun Multimodal and multiscale feature fusion for weakly supervised video anomaly detection. Nature 2024, 14, 22835. [Google Scholar]
Wang, Y.; Zhao, Y.; Huo, Y.; Lu, Y. Multimodal anomaly detection in complex environments using video and audio fusion. Nature 2025, 15, 16291. [Google Scholar] [CrossRef]
Qi, B.; Wu, B.; Sun, B. Automated violence monitoring system for real-time fistfight detection using deep learning-based temporal action localization. Nature 2025, 15, 29497. [Google Scholar] [CrossRef]
Verma, U.; Pai, M.M.M.; Pai, R.M. Contextual information based anomaly detection for multi-scene aerial videos. Nature 2025, 15, 25805. [Google Scholar]

Figure 1. Deep Risk Network structure schematic.

Figure 2. Transfer learning workflow illustrates the training pipeline, evaluation process, and data flow common to all three methods.

Figure 3. Layout of testbed environment—Scenario 1.

Figure 4. Summary of forklift scenarios within the testbed environment. Scenario 1: (1) layout of testbed environment, (2) transport of elevated, unbalanced loads. Scenario 2: (3) normal operation. Scenario 3: (4) lifting loads to excessive heights. Scenario 4: (5) lifting personnel on forklift forks. Scenario 5: (6) descending a slope head-on while carrying a load. Scenario 6: (7) operating on slopes with inclination > 15°. Scenario 7: (8) lowering loads while in motion. Scenario 8: (9) entering restricted transit zones.

Figure 5. LabelIm user interface.

Figure 6. Synthetic images generated using digital image processing.

Figure 7. Results of binary classification without transfer learning across scenarios.

Figure 8. Multi-class classification results without transfer learning for training and testing in the same context at Scenario 1.

Figure 9. Multi-class classification results without transfer learning for training in Scenario 1 and testing in the unseen context from Scenario 2.

Figure 10. Results of binary classification with transfer learning across scenarios for DRN vs. AWS Rekognition models for retraining with different percentages of Scenario 10 and testing in Scenario 11.

Figure 11. Results of multi-class classification with transfer learning across scenarios for DRN vs. AWS Rekognition models for retraining with 10% of Scenario 10 and testing in Scenario 11.

Figure 12. Results of multi-class classification with transfer learning across scenarios for DRN vs. AWS Rekognition models for retraining with 35% of Scenario 10 and testing in Scenario 11.

Figure 13. Results of multi-class classification with transfer learning across scenarios for DRN vs. AWS Rekognition models for retraining with 50% of Scenario 10 and testing in Scenario 11.

Figure 14. F1-score of the binary classifier obtained using NVIDIA DeepStream for normal operation detections.

Figure 15. F1-score of the binary classifier obtained using NVIDIA DeepStream for the detection of safety risk events.

Figure 16. F1-score results for the binary classifier obtained with incremental transfer learning for base models DRN vs. AWS for different percentages of retraining using data from Scenario 2 and testing across data from Scenarios 1 to 3.

Figure 17. F1-score results for the multi-class classifier of normal operations obtained with incremental transfer learning for models DRN vs. AWS for different percentages of retraining using data from Scenario i + 1 and testing across data Scenarios i + 2 (T4 to T11).

Figure 18. F1-score results for the multi-class classifier of safety risk operations obtained with incremental transfer learning for models DRN vs. AWS for different percentages of retraining using data from Scenario i + 1 and testing across data Scenarios i + 2 (T4 to T11).

Figure 19. Benchmark of end-to-end inference runtime for the different setups tested for binary classification of normal operation and safety risk events.

Figure 20. Benchmark of end-to-end inference runtime for the different setups tested for multi-class classification of normal operation and safety risk events.

Table 1. Percentage of data by event type according to OSHA 3949.

Event	Data
(1) Normal operation	15%
(2) Moving high loads with unbalanced loads	5%
(3) Go down slopes head-on with the crane loaded	10%
(4) Raise or lower the load while transporting	15%
(5) Transport a high load	15%
(6) Driving on slopes with inclination greater than 15°	15%
(7) Raise operators on the forks	15%
(8) Failure to comply with defined work area to transit	5%
(9) Turn on slopes	5%

Table 2. Sensitivity analysis of DRN to retraining percentage and video quality (F1-score and AUC).

Scenario Condition	Data Used for Retraining	F1-Score	AUC
High quality (1080 p, 30 fps)	25%	0.92	0.95
Medium quality (720 p, 15 fps)	25%	0.88	0.92
Low quality (480 p, 5 fps)	25%	0.74	0.80
High quality (1080 p, 30 fps)	10%	0.85	0.90
High quality (1080 p, 30 fps)	5%	0.72	0.79

Table 3. Comparison of DRN with domain adaptation and few-shot baselines under domain shift.

Scenario Condition	Data Used for Retraining	F1-Score	AUC
Domain-Adversarial Neural Network	35–50%	0.83	0.87
Prototypical Networks (few-shot)	<10 samples	0.70	0.75
DRN + Incremental Transfer (ours)	10–25%	0.87–0.92	0.90

Table 4. SHAP values before and after transfer learning.

Description	Shap Value Before Transfer Learning	Shap Value After Transfer Learning
Forklift	0.91	0.86
Wheels	0.77	0.79
Chassis	0.78	0.25
Operator	0.22	0.41
Seat	0.27	0
Upper cover	0.35	0
Fork	0.53	0.88
Operator helmet	0.37	0
Operator mask	0.27	0
Workspace	0.58	0.88
Load	0.42	0.85

Table 5. Inference performance and trade-offs across platforms.

Setup	Latency (s)	Power (W)	Model Size	Suitability
Raspberry Pi 5	5.8–6.8	~5	0.9 GB	Non-critical monitoring
Jetson Nano (DRN)	2.2–2.9	~10	0.9 GB	Real-time feasible
Jetson Nano (DeepStream)	2.0–2.5	~12	2.0 GB	Real-time, higher cost
AWS Rekognition (Cloud)	>13	N/A	4.0 GB	Accurate but high latency

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Radrigan, L.; Godoy, S.E.; Morales, A.S. Transfer Learning for Generalized Safety Risk Detection in Industrial Video Operations. Mach. Learn. Knowl. Extr. 2025, 7, 111. https://doi.org/10.3390/make7040111

AMA Style

Radrigan L, Godoy SE, Morales AS. Transfer Learning for Generalized Safety Risk Detection in Industrial Video Operations. Machine Learning and Knowledge Extraction. 2025; 7(4):111. https://doi.org/10.3390/make7040111

Chicago/Turabian Style

Radrigan, Luciano, Sebastián E. Godoy, and Anibal S. Morales. 2025. "Transfer Learning for Generalized Safety Risk Detection in Industrial Video Operations" Machine Learning and Knowledge Extraction 7, no. 4: 111. https://doi.org/10.3390/make7040111

APA Style

Radrigan, L., Godoy, S. E., & Morales, A. S. (2025). Transfer Learning for Generalized Safety Risk Detection in Industrial Video Operations. Machine Learning and Knowledge Extraction, 7(4), 111. https://doi.org/10.3390/make7040111

Article Menu

Transfer Learning for Generalized Safety Risk Detection in Industrial Video Operations

Abstract

Highlights

Abstract

1. Introduction

2. State of the Art and Related Work

2.1. Transfer Learning for Sample-Efficient Generalization

2.2. Video-Based Event Detection in Industrial Applications

2.3. Integrating Transfer Learning and Safety Event Detection

3. Methodology

3.1. Stratified Scenario Design and Data Processing

3.2. Scenario-Based Data Splitting Strategy

3.3. Deep Risk Network Architecture Structural Overview

3.4. Cloud and Edge Computing Platforms for Benchmarking

3.5. Transfer Learning for Generalization

4. Results

4.1. Data Sources and Testbed

4.2. Dataset Composition

4.3. Dataset Labeling

4.4. Binary Classification Results Without Transfer Learning

4.5. Multi-Class Classification Results Without Transfer Learning

4.6. Binary Classification Results with Transfer Learning

4.7. Multi-Class Classification Results with Transfer Learning

4.8. Feature Significance and Model Interpretability

4.9. Inference Benchmark

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI