1. Introduction
In recent years, human-centered perception research has made significant progress. Many methods have been developed to enhance the performance of pose estimation [
1], human parsing [
2], pedestrian detection [
3], and many other human-centered tasks. These significant advancements play a key role in driving the application of visual models in many fields, such as sports analysis [
4], autonomous driving [
5], and e-commerce [
6].
Although different human-centered perception tasks have their own relevant semantic information that needs attention, these semantics all rely on the same basic human body structure and the attributes of each body part [
7,
8]. Therefore, some researchers have attempted to leverage this homogeneity and train shared neural networks for multiple human-centered tasks [
2,
9,
10,
11,
12,
13,
14,
15,
16]. For example, human parsing has been jointly trained with human keypoint detection [
2,
12,
16], pedestrian attribute recognition [
15], pedestrian detection [
11], or person re-identification (ReID) [
9]. Experimental results from these works empirically confirm that some human-centered tasks can influence each other when trained together. Inspired by these works, it is naturally expected that a more general integrated model might be a feasible solution.
Challenges in Low-Light Industrial Environments:Recognizing human actions in low-light industrial environments remains a significant challenge for safety-critical applications in power systems. Underground electrical vaults, smart power stations, and nighttime outdoor monitoring scenarios often suffer from severe illumination degradation, leading to poor visibility, low contrast, and high noise.
As illustrated in
Figure 1, low-light action recognition presents multiple challenges that existing methods struggle to address effectively.
Figure 1a shows a typical low-quality input image from industrial monitoring, where the average pixel intensity is only 0.18 (normalized to [0, 1]), severely limiting visibility. The baseline YOLO-Pose model, as shown in
Figure 1b, produces a dispersed attention heatmap with an Attention Dispersion Ratio (ADR) of 0.67, indicating that the model fails to focus on critical body regions. This leads to the poor keypoint detection results in
Figure 1c, where only 10 out of 17 keypoints are correctly identified. Our proposed LPAC-Net addresses these challenges through a multi-stage approach.
Figure 1d demonstrates the enhanced image after our preprocessing pipeline, where brightness is increased by 156% (from 0.18 to 0.46) and contrast is enhanced by 82%. This improvement enables the model to generate a focused attention heatmap, as shown in
Figure 1e, with an ADR of 0.23, indicating precise attention localization. The final recognition result in
Figure 1f shows all 17 keypoints correctly identified, achieving 95.53% accuracy. This represents a 53.23 percentage point improvement over the baseline, demonstrating the effectiveness of our integrated framework.
While thermal imaging is an alternative that operates independently of visible light, it lacks color and texture information, which are crucial for fine-grained action understanding. Moreover, thermal cameras are more expensive and less commonly deployed in existing industrial monitoring systems. Therefore, we focus on RGB-based approaches, which are more cost-effective and widely available. Recent advances in low-light enhancement have shown that RGB images can be effectively enhanced to a level comparable to thermal imaging in terms of recognizability, while preserving richer semantic information.
In recent years, skeleton-based recognition methods, by extracting semantically clear keypoint features, have significantly improved the model’s robustness to interference. Mainstream solutions include the following: two-stage methods, such as OpenPose+ST-GCN, but the separation between pose estimation and action recognition leads to error accumulation; end-to-end methods, such as AlphaPose and other joint optimization frameworks, but there are detection misses in dense crowd scenarios.
Technical Gap and Motivation: Existing methods often treat low-light enhancement and action recognition as separate stages, which may lead to suboptimal performance due to information loss and error propagation. Moreover, most pose-based action recognition models are designed for well-lit conditions and degrade significantly in low-light environments. There is a pressing need for an integrated framework that jointly optimizes illumination correction, pose estimation, and temporal modeling for robust low-light action recognition.
Key Distinctions from Existing End-to-End Methods: As shown in
Table 1, our LPAC-Net differs fundamentally from previous end-to-end approaches in three key aspects. First, in the coupling method, while DTCM employs loose coupling with separate networks and Dark-DSAR uses moderate coupling with a shared backbone, LPAC-Net achieves tight coupling through completely differentiable components that allow end-to-end gradient flow. Second, in the optimization objective, LPAC-Net uses a jointly optimized loss function with adaptive weighting, enabling balanced training of enhancement, pose estimation, and classification tasks. Third, in information flow, LPAC-Net establishes bidirectional feedback where the enhancer receives direct gradients from pose and classification losses, enabling task-aware enhancement. This tight coupling distinguishes our approach from previous methods that treat enhancement as fixed preprocessing or use unidirectional pipelines.
This paper proposes the Low-Light Pose-Action Collaborative Network (LPAC-Net), an end-to-end recognition framework specifically designed for monitoring scenarios in low-light industrial environments. The main innovations are as follows:
We present a first-of-its-kind framework that tightly couples a zero-reference low-light enhancer (Zero-DCE++) with a pose estimation model (YOLO-Pose) for action recognition. Unlike prior works that use enhancement as a fixed preprocessing step, our trainable Zero-DCE++ enhancer is optimized jointly with YOLO-Pose, adapting enhancement to pose estimation needs and achieving a 10.7% accuracy improvement compared to separate enhancement approaches. This allows the enhancement process to adaptively preserve and highlight motion-relevant features, addressing both appearance degradation and pose estimation challenges in low-light video scenarios.
- 2.
Collaborative Temporal Modeling with Attention Mechanisms (Algorithmic Innovation)
To capture dynamic human actions, we design a Bi-LSTM-based temporal module enhanced with multi-head self-attention. This module ingests both enhanced appearance frames and keypoint sequences, enabling the model to maintain strong temporal coherence even under visual noise. The attention mechanism allows the model to focus on discriminative temporal segments, improving recognition accuracy for complex actions such as falls and rapid movements by 5.8% for fast-transition actions. Additionally, we introduce rotation-invariant pose encoding using pelvis-centered polar coordinates with adaptive scaling, which further improves viewpoint robustness and contributes a 4.4% accuracy gain compared to Cartesian coordinates.
- 3.
State-of-the-Art Performance on ARID-Fall
On the ARID-Fall dataset—a merged benchmark incorporating low-light indoor actions and real-world fall incidents—our LPAC-Net framework outperforms existing state-of-the-art methods in both accuracy and inference speed. Through comprehensive ablation studies, we demonstrate that the enhancement–pose–LSTM pipeline significantly improves performance over single-stream architectures, validating the contribution of each component in our joint optimization framework.
5. Experiments
5.1. Datasets and Evaluation Metrics
We evaluate the proposed LPAC-Net model using two public video datasets: ARID v1.5 [
35] and the “Dataset Video for Human Action Recognition” from Kaggle. To address the absence of explicit fall events in ARID, we augment it with labeled fall-down sequences from the Kaggle dataset, relabeled as
falling, to construct a more complete surveillance-oriented benchmark.
The Action Recognition in the Dark (ARID v1.5) dataset is a benchmark designed for low-illumination surveillance environments. It contains 5572 RGB video clips across 11 action categories such as walking, picking, sitting, and waving. Each video is sampled at 15 frames per second and resized to resolution. Following prior work, we adopt the standard train/val/test split: 4388 training clips, 558 validation clips, and 626 testing clips.
To enhance the dataset with critical human safety-related actions, we incorporate videos from the Kaggle-hosted dataset titled “Dataset Video for Human Action Recognition”. We extract samples from the class labeled fall down and relabel them as falling to be semantically compatible with the ARID taxonomy. We collect 380 unique fall-down video clips, resize and temporally normalize them to match the ARID frame format, and split them into 300 training, 40 validation, and 40 testing examples. This extension enriches the original dataset with real-world representations of dangerous motion patterns.
The integrated dataset (referred to as ARID-Fall) used for final training and evaluation consists of 12 action classes, including 11 from ARID v1.5 and an added class falling. The total dataset contains 5952 video clips distributed as follows: 4688 for training, 598 for validation, and 666 for testing. All clips are resized to and normalized to 50 frames per video sequence.
We employ three primary evaluation metrics: (1) Top-1 accuracy (%), which measures the proportion of correct class predictions; (2) F1 Score (%), which balances precision and recall and is particularly relevant under class imbalance; and (3) Statistical Significance Testing via paired t-tests with Bonferroni correction to validate the statistical significance of performance differences between methods, ensuring a fair comparison across multiple trials.
5.2. Implementation Details
All experiments were conducted on an NVIDIA A100 GPU with 40 GB VRAM and an Intel Xeon CPU with 32 GB RAM. The YOLO-Pose component was trained with an initial learning rate of using a cosine decay scheduler over 1000 total epochs. The LSTM component utilized the AdamW optimizer with and a batch size of 64. Joint training employed mixed precision training (FP16) to accelerate convergence.
To enhance model robustness, we applied several augmentation techniques, including random horizontal flipping, Gaussian noise injection with , temporal interpolation, and brightness variation () to simulate diverse low-light conditions.
For a fair comparison, all baseline methods (I3D-RGB, DarkLight-R101, DTCM, FRAGNet, URetinex-Net++) were re-trained on the ARID-Fall dataset using their original implementations with consistent input resolution () and training epochs (1000). DTCM was trained with its recommended spatio-temporal consistency loss, while DarkLight-R101 utilized its dual-path architecture with a ResNeXt-101 backbone.
5.3. Overall Performance Comparison
Figure 5 presents the comprehensive training dynamics of our LPAC-Net model over 1000 epochs. The accuracy curves demonstrate remarkable learning progress, with both training and test accuracy showing consistent improvement throughout the training process. The model achieves approximately 85% accuracy by epoch 200 and continues to refine its performance, reaching the final test accuracy of 95.53% at epoch 1000. The minimal gap between training and test accuracy curves (less than 2% after epoch 500) indicates excellent generalization capability with negligible overfitting.
The training loss curve reveals a smooth and stable optimization process, decreasing from an initial value of 0.65 to a final value of 0.30. This substantial reduction in loss, combined with the consistent accuracy improvement, validates the effectiveness of our multi-task learning strategy and the compatibility between the enhancement, pose estimation, and temporal modeling components. The convergence behavior suggests that our model architecture and training methodology are well-suited for the challenging task of low-light action recognition.
The results in
Table 3 demonstrate that LPAC-Net achieves state-of-the-art performance on the ARID-Fall benchmark, reaching 95.53% test accuracy as observed in our training curves. Compared to I3D-RGB, which relies purely on spatio-temporal RGB cues, LPAC-Net improves top-1 accuracy by over 22%, showing the benefit of combining structural pose features and illumination correction.
Relative to DarkLight-R101, which enhances image quality but lacks keypoint modeling or temporal sequence learning, LPAC-Net yields a 8.83% gain in accuracy. These gains are especially pronounced for fast transitions such as falling, where explicit geometry modeling is critical.
Compared to DTCM, a transformer-based model with strong baseline performance, LPAC-Net achieves comparable accuracy while maintaining a significantly more efficient architecture design (5.31 M vs. 28.3 M parameters). The inclusion of recent state-of-the-art methods FRAGNet and URetinex-Net++ further demonstrates LPAC-Net’s superiority, outperforming both by 1.43% and 2.03%, respectively.
The superior performance of LPAC-Net can be directly attributed to its three core innovations. First, the joint low-light and pose optimization addresses the fundamental information loss in dark scenes, which is the primary bottleneck for I3D-RGB and a limiting factor for DarkLight-R101. This is evidenced by the >10% accuracy gain over methods using fixed preprocessing. Second, the rotation-invariant pose encoding provides a geometrically stable representation, reducing variance in keypoint data and contributing to the model’s robustness. Third, the collaborative temporal modeling effectively combines enhanced appearance cues with skeletal dynamics, enabling the model to disambiguate visually similar but kinematically distinct actions (e.g., bending vs. picking), a challenge for purely RGB-based methods like DTCM. This synergy allows LPAC-Net to achieve state-of-the-art accuracy while maintaining efficiency.
5.4. Comparative Analysis of Rotation-Invariant Encoding
Table 4 shows performance under simulated electromagnetic interference. LPAC-Net with our filtering mechanisms maintains 91.28% accuracy under Gaussian noise, representing only 4.25% degradation from baseline performance, compared to 8.12% degradation without filtering. The recovery rate of 95.6% indicates effective noise suppression while preserving action recognition capability.
We further analyze the contribution of each component in our encoding. Pelvis-centric normalization provides 2.8% accuracy improvement over global coordinates. Adaptive scaling () contributes 1.5% improvement in handling varying body sizes. Orientation compensation () offers 3.2% improvement under viewpoint changes.
5.5. Real-Time Performance and Deployment Analysis
We evaluate the real-time performance of LPAC-Net across multiple hardware platforms to assess its practical deployment potential.
Table 5 evaluates performance under extreme lighting variations. LPAC-Net with adaptive gain control maintains 88.91% accuracy across illuminance levels from 0.1 to 500 lux, representing a balanced performance across challenging conditions. The adaptive enhancement module effectively compensates for extreme variations while maintaining reasonable false-positive rates and detection delays.
Table 6 shows the inference performance across different hardware platforms. LPAC-Net achieves real-time performance (>24 FPS) on edge devices like Jetson AGX Orin, making it suitable for industrial deployment. The power consumption ranges from 15 W on embedded devices to 320 W on high-end GPUs, providing deployment flexibility based on the available infrastructure.
The end-to-end latency breakdown per frame at 384 × 384 resolution is as follows: Frame input/preprocessing: 1.2 ms (6.1%); Zero-DCE++ enhancement: 0.9 ms (4.5%); YOLO-Pose forward pass: 12.5 ms (63.1%); keypoint post-processing: 2.8 ms (14.1%); Bi-LSTM temporal analysis: 2.4 ms (12.1%). The total latency is 19.8 ms, achieving approximately 50 FPS on high-end hardware.
Figure 6 presents the speed–accuracy Pareto frontier comparison. LPAC-Net achieves the optimal trade-off, operating within the desirable region for industrial applications where both high accuracy (>90%) and real-time performance (>25 FPS) are required.
5.6. Robustness Analysis Under Extreme Conditions
To evaluate practical applicability in industrial environments, we conduct comprehensive robustness testing under various adverse conditions.
Table 7 showsthe optimal weights (
,
,
) were determined through grid search over 125 combinations with 5-fold cross-validation, gradient flow analysis using gradient norm ratios (
), validation set performance with early stopping criteria, and sensitivity analysis showing robustness within ±20% of optimal values.
Figure 7 illustrates robustness to partial occlusion. LPAC-Net maintains over 80% accuracy even with 40% keypoint occlusion, significantly outperforming baseline methods which typically drop below 60% accuracy at similar occlusion levels. This performance is critical for industrial scenarios where equipment or tools may partially obscure workers.
Table 8 compares different training strategies. Single-stage training suffers from optimization difficulties due to conflicting gradients, resulting in lower accuracy and higher overfitting risk. Our three-stage strategy provides 4.26% accuracy improvement over single-stage training. Gradual unfreezing allows stable optimization of interdependent components, reducing overfitting by 57% compared to single-stage training while maintaining efficient convergence.
5.7. Ablation Study and Component Analysis
As shown in
Table 9, each module in LPAC-Net contributes to its final performance on ARID-Fall. Starting from the Pose-LSTM baseline (a), adding rotation-invariant encoding (b) improves accuracy by 4.4%, indicating its role in view-consistent representation. The enhancer-only variant (c) achieves 90.5% accuracy, demonstrating the importance of illumination correction. The combination of enhancer and pose estimation (d) yields 93.4%, while the full model with all components reaches 95.53%, confirming the synergistic effect of our integrated design.
The ablation study provides direct evidence for the value of each innovation. The 4.4% improvement from adding rotation-invariant encoding (b vs. a) validates the effectiveness of geometric normalization. The 5.6% gain from adding the trainable enhancer (d vs. a) demonstrates the importance of adaptive illumination correction. The final 2.1% improvement from the full integration (f vs. e) shows the added value of collaborative temporal modeling when combined with the other components. This stepwise improvement pattern confirms that our innovations are complementary and collectively necessary for optimal performance.
5.8. Error Correction Methods Performance
As shown in
Table 10, our multi-level error correction mechanism significantly improves recognition accuracy in noisy conditions. Compared to no processing, our method boosts accuracy from 68.3% to 83.6%, demonstrating its robustness against keypoint mis-detections. The improvement is statistically significant (
p < 0.001, paired
t-test).
We further evaluated LPAC-Net under challenging scenarios including fast motion blur and partial occlusions. For fast motion blur in the “falling” action, the standard deviation of hip joint coordinates decreased from pixels to pixels after Kalman filtering (averaged over 50 test sequences), indicating significantly improved trajectory smoothness. For occlusion recovery, when hand keypoints were lost for three consecutive frames, our spatial–temporal interpolation restored recognition accuracy from to (mean ± std over 30 occlusion events). For outlier suppression, the LSTM attention mask mechanism reduced misclassifications of “waving” as “pushing” by 62.3% (from 18.7% to 7.1% error rate), demonstrating effective temporal consistency enforcement.
5.9. Computational and Efficiency Analysis
Table 11 details the computational requirements of each component in our framework. The entire LPAC-Net pipeline maintains a compact model size (5.31 M parameters) with moderate computational complexity (14.68 GFLOPs). The end-to-end inference time is 19.8 ms per frame, corresponding to approximately 50 FPS, meeting real-time requirements for industrial monitoring. The efficient design makes it suitable for deployment on resource-constrained devices. This computational efficiency, combined with the excellent convergence behavior and 95.53% accuracy demonstrated in our training curves, positions LPAC-Net as a practical solution for real-world low-light action recognition applications.
5.10. Visualization and Qualitative Results
Figure 8 presents qualitative results of our LPAC-Net framework on the ARID-Fall dataset. The top row (a–d) illustrates the sequential detection of a “falling” action, where our method accurately tracks the progressive body collapse despite low-light conditions. The bottom row (e–h) shows recognition results for a “bending” action, demonstrating the model’s capability to distinguish subtle motion patterns. These visualizations confirm that our rotation-invariant pose encoding effectively preserves motion semantics while being robust to viewpoint variations.
Figure 9 presents the model demonstrates exceptional adaptability for multi-person pose recognition. Under conditions of dense personnel distribution, the model effectively handles pose recognition tasks for multiple targets while maintaining high recognition consistency.
We also conducted per-class confusion analysis on the ARID-Fall test set. Notably, after integrating geometric normalization and low-light enhancement, the model’s accuracy on previously confused actions—such as waving vs. pouring—increased from 74% to 89%. The temporal modeling capability of Bi-LSTM effectively captures the dynamic characteristics of different actions, significantly reducing inter-class confusion.
6. Conclusions and Future Work
This paper presented the Low-Light Pose-Action Collaborative Network (LPAC-Net), achieving state-of-the-art performance (95.53% accuracy) on low-light industrial action recognition. Our key contributions include architectural innovations: the first tight-coupling framework integrating Zero-DCE++ enhancement with YOLO-Pose estimation, novel rotation-invariant pose encoding with 4.4% accuracy improvement, and a multi-stage training strategy reducing overfitting by 57%. Performance achievements include real-time operation (50.5 FPS on A100 and 24.7 FPS on edge devices), robustness (maintains > 80% accuracy under 40% occlusion), and efficiency (5.31 M parameters and 14.68 GFLOPs).
Through extensive testing, we identify the following boundaries. Failure modes include extreme darkness (<0.1 lux), where performance drops to 72%, requiring thermal fusion; multiple overlapping persons, where keypoint association errors increase by 18%; rapid camera motion, where motion blur reduces accuracy by 12–15%; and unseen action types with a generalization gap of 8–12% for novel actions. Generalization boundaries include an illumination range of 0.1–1000 lux (covers 95% of industrial scenarios), viewing angles ±60° from frontal view, action duration of 0.5–10 s, and person distance of 1–15 m from camera.
Theoretical insights explain why joint optimization is superior to cascading. Gradient alignment allows the enhancer to receive direct feedback from recognition loss. Feature preservation enables task-aware enhancement to preserve motion-relevant features. Error compensation uses temporal modeling to compensate for enhancement artifacts. Geometric features excel in low light due to illumination invariance where spatial relationships are unaffected by lighting changes, noise robustness where structural patterns persist through image degradation, and compact representation (34-dimensional pose vs. 10 K+ pixel features).
Design principles for low-light action recognition include the following: geometric features take precedence—prioritize skeletal structure over appearance; temporal smoothness over single-frame accuracy—leverage motion continuity; adaptive enhancement over fixed preprocessing—task-aware illumination correction; progressive refinement over one-shot processing—multi-stage error correction; and edge-cloud collaboration—balance latency and accuracy.
Future research directions include multi-modal fusion (RGB + thermal + depth) for complete darkness, self-supervised adaptation for domain adaptation without labeled data, federated learning for privacy-preserving multi-site deployment, explainable AI for action reasoning with causal relationships, and proactive safety through predictive analytics for accident prevention.
LPAC-Net represents a significant step toward reliable industrial monitoring in challenging lighting conditions. By adhering to the established design principles and addressing the identified failure modes, future systems can achieve even greater robustness and practical utility in safety-critical applications.