Low-Light Pose-Action Collaborative Network for Industrial Monitoring in Power Systems

Luo, Qifeng; Zhou, Heng; Wu, Mianting; Zhou, Qiang

doi:10.3390/electronics15010199

Open AccessArticle

Low-Light Pose-Action Collaborative Network for Industrial Monitoring in Power Systems

by

Qifeng Luo

¹,

Heng Zhou

¹,

Mianting Wu

¹ and

Qiang Zhou

^2,*

¹

Zhongshan Power Supply Bureau of China Southern Power Grid, Zhongshan 528401, China

²

School of Electronic and Electrical Engineering, Shandong University of Technology, Zibo 255049, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 199; https://doi.org/10.3390/electronics15010199 (registering DOI)

Submission received: 20 November 2025 / Revised: 27 December 2025 / Accepted: 29 December 2025 / Published: 1 January 2026

(This article belongs to the Special Issue AI Applications for Smart Grid)

Download

Browse Figures

Versions Notes

Abstract

Recognizing human actions in low-light industrial environments remains a significant challenge for safety-critical applications in power systems. In this paper, we propose a Low-Light Pose-Action Collaborative Network (LPAC-Net), an integrated framework specifically designed for monitoring scenarios in underground electrical vaults and smart power stations. The pipeline begins with a modified Zero-DCE++ module for reference-free illumination correction, followed by pose extraction using YOLO-Pose and a novel rotation-invariant encoding of keypoints optimized for confined industrial spaces. Temporal dependencies are captured through a bidirectional LSTM network with attention mechanisms to model complex operational behaviors. We evaluate LPAC-Net on the newly curated ARID-Fall dataset, enhanced with industrial monitoring scenarios representative of electrical infrastructure environments. Experimental results demonstrate that our method outperforms state-of-the-art models, including DarkLight-R101, DTCM, FRAGNet, and URetinex-Net++, achieving 95.53% accuracy in recognizing worker activities and safety-critical events. Additional studies confirm LPAC-Net’s robustness under keypoint noise and motion blur, highlighting its practical value for intelligent monitoring in challenging industrial lighting conditions typical of underground electrical facilities and automated power stations.

Keywords:

action recognition; low-light enhancement; pose estimation; bidirectional LSTM; industrial monitoring; power systems

1. Introduction

In recent years, human-centered perception research has made significant progress. Many methods have been developed to enhance the performance of pose estimation [1], human parsing [2], pedestrian detection [3], and many other human-centered tasks. These significant advancements play a key role in driving the application of visual models in many fields, such as sports analysis [4], autonomous driving [5], and e-commerce [6].

Although different human-centered perception tasks have their own relevant semantic information that needs attention, these semantics all rely on the same basic human body structure and the attributes of each body part [7,8]. Therefore, some researchers have attempted to leverage this homogeneity and train shared neural networks for multiple human-centered tasks [2,9,10,11,12,13,14,15,16]. For example, human parsing has been jointly trained with human keypoint detection [2,12,16], pedestrian attribute recognition [15], pedestrian detection [11], or person re-identification (ReID) [9]. Experimental results from these works empirically confirm that some human-centered tasks can influence each other when trained together. Inspired by these works, it is naturally expected that a more general integrated model might be a feasible solution.

Challenges in Low-Light Industrial Environments:Recognizing human actions in low-light industrial environments remains a significant challenge for safety-critical applications in power systems. Underground electrical vaults, smart power stations, and nighttime outdoor monitoring scenarios often suffer from severe illumination degradation, leading to poor visibility, low contrast, and high noise.

As illustrated in Figure 1, low-light action recognition presents multiple challenges that existing methods struggle to address effectively. Figure 1a shows a typical low-quality input image from industrial monitoring, where the average pixel intensity is only 0.18 (normalized to [0, 1]), severely limiting visibility. The baseline YOLO-Pose model, as shown in Figure 1b, produces a dispersed attention heatmap with an Attention Dispersion Ratio (ADR) of 0.67, indicating that the model fails to focus on critical body regions. This leads to the poor keypoint detection results in Figure 1c, where only 10 out of 17 keypoints are correctly identified. Our proposed LPAC-Net addresses these challenges through a multi-stage approach. Figure 1d demonstrates the enhanced image after our preprocessing pipeline, where brightness is increased by 156% (from 0.18 to 0.46) and contrast is enhanced by 82%. This improvement enables the model to generate a focused attention heatmap, as shown in Figure 1e, with an ADR of 0.23, indicating precise attention localization. The final recognition result in Figure 1f shows all 17 keypoints correctly identified, achieving 95.53% accuracy. This represents a 53.23 percentage point improvement over the baseline, demonstrating the effectiveness of our integrated framework.

While thermal imaging is an alternative that operates independently of visible light, it lacks color and texture information, which are crucial for fine-grained action understanding. Moreover, thermal cameras are more expensive and less commonly deployed in existing industrial monitoring systems. Therefore, we focus on RGB-based approaches, which are more cost-effective and widely available. Recent advances in low-light enhancement have shown that RGB images can be effectively enhanced to a level comparable to thermal imaging in terms of recognizability, while preserving richer semantic information.

In recent years, skeleton-based recognition methods, by extracting semantically clear keypoint features, have significantly improved the model’s robustness to interference. Mainstream solutions include the following: two-stage methods, such as OpenPose+ST-GCN, but the separation between pose estimation and action recognition leads to error accumulation; end-to-end methods, such as AlphaPose and other joint optimization frameworks, but there are detection misses in dense crowd scenarios.

Technical Gap and Motivation: Existing methods often treat low-light enhancement and action recognition as separate stages, which may lead to suboptimal performance due to information loss and error propagation. Moreover, most pose-based action recognition models are designed for well-lit conditions and degrade significantly in low-light environments. There is a pressing need for an integrated framework that jointly optimizes illumination correction, pose estimation, and temporal modeling for robust low-light action recognition.

Key Distinctions from Existing End-to-End Methods: As shown in Table 1, our LPAC-Net differs fundamentally from previous end-to-end approaches in three key aspects. First, in the coupling method, while DTCM employs loose coupling with separate networks and Dark-DSAR uses moderate coupling with a shared backbone, LPAC-Net achieves tight coupling through completely differentiable components that allow end-to-end gradient flow. Second, in the optimization objective, LPAC-Net uses a jointly optimized loss function with adaptive weighting, enabling balanced training of enhancement, pose estimation, and classification tasks. Third, in information flow, LPAC-Net establishes bidirectional feedback where the enhancer receives direct gradients from pose and classification losses, enabling task-aware enhancement. This tight coupling distinguishes our approach from previous methods that treat enhancement as fixed preprocessing or use unidirectional pipelines.

This paper proposes the Low-Light Pose-Action Collaborative Network (LPAC-Net), an end-to-end recognition framework specifically designed for monitoring scenarios in low-light industrial environments. The main innovations are as follows:

Joint Low-Light Enhancement and Pose Integration (Architectural Innovation)

We present a first-of-its-kind framework that tightly couples a zero-reference low-light enhancer (Zero-DCE++) with a pose estimation model (YOLO-Pose) for action recognition. Unlike prior works that use enhancement as a fixed preprocessing step, our trainable Zero-DCE++ enhancer is optimized jointly with YOLO-Pose, adapting enhancement to pose estimation needs and achieving a 10.7% accuracy improvement compared to separate enhancement approaches. This allows the enhancement process to adaptively preserve and highlight motion-relevant features, addressing both appearance degradation and pose estimation challenges in low-light video scenarios.

2.: Collaborative Temporal Modeling with Attention Mechanisms (Algorithmic Innovation)

To capture dynamic human actions, we design a Bi-LSTM-based temporal module enhanced with multi-head self-attention. This module ingests both enhanced appearance frames and keypoint sequences, enabling the model to maintain strong temporal coherence even under visual noise. The attention mechanism allows the model to focus on discriminative temporal segments, improving recognition accuracy for complex actions such as falls and rapid movements by 5.8% for fast-transition actions. Additionally, we introduce rotation-invariant pose encoding using pelvis-centered polar coordinates with adaptive scaling, which further improves viewpoint robustness and contributes a 4.4% accuracy gain compared to Cartesian coordinates.

3.: State-of-the-Art Performance on ARID-Fall

On the ARID-Fall dataset—a merged benchmark incorporating low-light indoor actions and real-world fall incidents—our LPAC-Net framework outperforms existing state-of-the-art methods in both accuracy and inference speed. Through comprehensive ablation studies, we demonstrate that the enhancement–pose–LSTM pipeline significantly improves performance over single-stream architectures, validating the contribution of each component in our joint optimization framework.

2. Related Work

2.1. Low-Light Image Enhancement

Enhancing images captured in dark conditions has been a longstanding research topic, with many methods proposed to improve brightness, contrast, and visibility while suppressing noise. Early approaches were often based on histogram equalization and the Retinex theory, which assumes an image can be decomposed into reflectance and illumination. For example, RetinexNet [19] introduced a deep decomposition model that learns to brighten the illumination component while restoring reflectance, inspired by Retinex principles. Building on similar ideas, Zhang et al. [20] proposed Kindling the Darkness (KinD), which performs a multi-branch decomposition and enhancement; its improved version KinD++ [21] added a multi-scale illumination attention module to better handle uneven lighting. These Retinex-based methods achieve notable brightness correction, but they typically require supervised learning with paired low/high-light images or complex multi-stage training procedures.

In recent years, a number of zero-reference or unsupervised enhancement techniques have gained popularity, as they do not require ground truth bright images. A prominent example is Zero-DCE [22], which formulates enhancement as a pixel-wise curve estimation problem. Zero-DCE trains a lightweight deep network to predict an illumination adjustment curve for each image, using only non-reference loss functions (no paired training data needed). Its extended version, Zero-DCE++ [23], further optimizes the network architecture. It contains only about 10 k parameters and runs at real-time speed (about 90 FPS) while maintaining the enhanced quality of the original model. The zero-reference nature of Zero-DCE/Zero-DCE++ is attractive for real-world applications, since obtaining well-exposed ground truth for night images is often infeasible. Another notable unsupervised approach is the Self-Calibrated Illumination (SCI) framework by Ma et al. [24]. SCI employs a cascaded, weight-sharing network that progressively brightens an image and “self-calibrates” at each stage to refine the illumination map. Aside from these, GAN-based methods have also emerged; EnlightenGAN [25] uses unpaired adversarial training to learn an enhancement function that produces naturally illuminated outputs without over-amplifying noise.

Recent advances in low-light enhancement continue to push the boundaries of performance and efficiency. FRAGNet [26] introduces frequency-aware restoration with action guidance for low-light video recognition, achieving significant improvements on benchmark datasets. URetinex-Net++ [27] extends the unfolding Retinex paradigm with event cues for extreme low-light video enhancement. For real-time applications, LightFormer [28] proposes a lightweight transformer architecture that achieves state-of-the-art performance with minimal computational overhead. These advancements demonstrate the ongoing progress in making low-light enhancement more practical for industrial applications.

Hardware Selection Justification: While thermal imaging is effective in complete darkness, RGB cameras are more cost-effective and widely deployed in industrial monitoring systems. Given the rapid progress in RGB-based low-light enhancement algorithms and the practical constraints of existing power grid monitoring infrastructure, we opt for an RGB-only approach. Our choice of Zero-DCE++ is motivated by its real-time capability, low parameter count, and compatibility with end-to-end training, making it suitable for integration into a unified action recognition pipeline.

2.2. Human Pose Estimation Under Adverse Conditions

Human pose estimation has seen tremendous progress with the development of deep learning models. Two representative frameworks are OpenPose and HRNet. OpenPose [29] pioneered real-time multi-person 2D pose estimation by introducing Part Affinity Fields to associate detected keypoints with specific individuals. It follows a bottom-up approach that first detects all body parts in an image and then groups them into skeletons, enabling pose estimation for multiple people efficiently. In contrast, the High-Resolution Network (HRNet) [30] is a top-down approach that maintains high-resolution feature maps throughout the network to predict precise keypoint heatmaps for each detected person. HRNet achieved state-of-the-art accuracy on benchmarks like COCO, thanks to its multi-scale fusions that preserve spatial detail. However, these models are typically trained and evaluated on well-lit images. In low-light or otherwise adverse conditions, their performance can deteriorate significantly. Feature detectors in CNN-based pose models are sensitive to lighting [31]. Even state-of-the-art networks may not be able to reliably detect joints when the input environment is extremely dark or noisy. The problem is compounded for bottom-up methods like OpenPose, which might confuse limbs between people if some keypoints are missing due to darkness. Top-down methods (e.g., HRNet) that rely on an initial person detector also suffer, as detection confidence drops in low light.

Recent research has focused specifically on pose estimation in challenging lighting conditions. NightPose [32] addresses the extreme case of total darkness by leveraging multi-spectral data and cross-modal alignment techniques. RobustPose++ [33] introduces self-supervised adaptation methods that enable pose estimation networks to perform better in low-light conditions without requiring paired training data. These approaches demonstrate that specialized techniques are necessary to maintain pose estimation accuracy when illumination is limited.

An alternative, gaining traction recently, is to design end-to-end models that jointly perform detection and pose estimation, which can be more robust in adverse scenarios. YOLO-Pose [34] is a notable example. It enhances the popular YOLO object detector to predict the locations of human keypoints in a single forward pass. Instead of generating heatmaps, YOLO-Pose directly regresses the keypoint coordinates together with the bounding box of each person and introduces the Object Keypoint Similarity (OKS) loss function to train the network end-to-end with the pose estimation metric. This single-stage approach avoids separate person detection and pose inference for each person in traditional top-down methods, thus eliminating duplicate processing. In addition, YOLO-Pose itself ensures that keypoints are correctly classified with people, thus avoiding the false associations that bottom-up methods like OpenPose may encounter in cluttered low-light scenes.

2.3. Action Recognition in Low-Light Videos

The task of recognizing human actions in dark or low-illumination videos has only recently begun to receive focused attention. A milestone in this area was the introduction of the ARID dataset (Action Recognition in the Dark) [35], which explicitly targets action understanding in low-light conditions. ARID v1.0 contains 3784 video clips across 11 action categories, collected in a variety of indoor and outdoor dark environments. The Temporal Segment Network (TSN) yielded barely 58% accuracy on ARID. These results highlight that poor illumination causes a significant gap in recognition capability, motivating the development of specialized methods.

Several high-impact methods have been proposed from 2020 onwards to address this gap. Chen et al. [36] put forward the DarkLight Networks, one of the first deep models tailored for Action Recognition in the Dark. DarkLight uses a dual-path architecture: one pathway processes the original dark video frames, while the other pathway processes enhanced (brightened) frames. A self-attention fusion module then merges the “dark” and “light” feature streams, allowing the model to learn complementary representations. In the ARID benchmark, a DarkLight model with a ResNeXt-101 backbone achieved 87.3% top-1 accuracy, and with an R(2+1)D-34 [37] backbone, it reached 94.0% top-1, dramatically higher than conventional 3D CNN baselines. Following this, Hira et al. [38] explored integrating transformer-based temporal modeling for low-light actions. They introduced a BERT-based sequence model to refine 3D CNN features, which improved generalization under low-light conditions.

Most recently, researchers are converging on end-to-end trainable solutions that unify enhancement and recognition. Tu et al. [17] presented the DTCM (Dark Temporal Consistency Model), which tightly couples a video enhancer with an action recognition network. DTCM enforces a novel spatio-temporal consistency loss during training to ensure that the enhancement network produces temporally coherent outputs, thus improving the downstream classifier’s performance. On ARID, DTCM set a new state-of-the-art, exceeding previous best results by about 2.3% absolute in accuracy. Similarly, IndGIC [39] integrates specialized modules and learning strategies within modern architectures to create efficient low-light action recognition models.

The field continues to advance with new benchmarks and evaluation frameworks. DarkAction [18] provides a comprehensive benchmarking study for action recognition in extremely low-light conditions, establishing standardized evaluation protocols and baselines for future research. These approaches often report not only better accuracy but also greater efficiency, since they avoid redundant processing in separate enhancement and classification stages.

Recent Trends: The field continues to advance with new architectures and training strategies. These works underscore the trend toward designing efficient and specialized models for low-light conditions. Our LPAC-Net contributes to this trend by offering a balanced integration of enhancement, pose estimation, and LSTM-based temporal modeling, achieving competitive performance without requiring specialized hardware.

3. Related Theories and Models

3.1. Low-Light Image Enhancement via Zero-DCE++ (Task-Integrated)

Given a low-light input frame

I_{t} \in {[0, 1]}^{3 \times H \times W}

, where H and W denote the height and width of the image, respectively, we adopt a parameter-efficient zero-reference enhancement scheme based on Zero-DCE++ [23]. Instead of requiring paired bright–dark training images, the network learns to estimate pixel-wise tonal adjustment curves directly from unsupervised image priors.

The enhancement is performed recursively using a shared parameter map

A_{t} \in R^{3 \times H \times W}

. The image at iteration s is computed as

Y_{t}^{(s)} = Y_{t}^{(s - 1)} + A_{t} ⊙ ({(Y_{t}^{(s - 1)})}^{2} - Y_{t}^{(s - 1)}), s = 1, \dots, S,

(1)

with initialization

Y_{t}^{(0)} = I_{t}

, and ⊙ denoting element-wise multiplication. The second term

A_{t} ⊙ ({(Y_{t}^{(s - 1)})}^{2} - Y_{t}^{(s - 1)})

represents a learnable curve adjustment that enhances local contrast while maintaining naturalness. Here,

S = 8

is the total number of enhancement iterations, providing a good trade-off between enhancement quality and computational cost.

The final enhanced image is

{\tilde{I}}_{t} = Y_{t}^{(S)} .

(2)

This structure is fully differentiable and is integrated into our end-to-end recognition pipeline. Rather than using Zero-DCE++ as a preprocessing step, we optimize it jointly with downstream pose and classification objectives, enabling the curve prediction to align with task-relevant brightness properties.

3.2. YOLO-Pose Estimation Model

Figure 2 shows the YOLO-Pose [34] extends the keypoint regression branch based on YOLOv8, using a decoupled head design with three main components:

Backbone: CSPDarknet53 extracts multi-scale features with enhanced low-light sensitivity.
Neck: Incorporates deformable convolution (Deformable Conv) to adapt to human shape deformation in challenging poses.
Head: Parallel outputs for detection boxes $(x, y, w, h, confidence)$ and 17 keypoint coordinates $(k_{x}, k_{y}, visibility)$ .

Figure 2. The YOLO-Pose architecture for human pose estimation. The network consists of four main components: (1) Backbone (CSPDarknet) extracts hierarchical features at multiple scales, (2) neck (PANet) performs feature pyramid fusion to combine low-level and high-level features, (3) three detection heads produce bounding boxes and keypoints at different scales for detecting objects of varying sizes, and (4) output predictions integrate multi-scale results into final pose estimation. The architecture enables real-time pose estimation by combining object detection with keypoint localization in a unified framework.

The model employs a hybrid feature extraction strategy that combines standard convolutional operations with attention mechanisms to focus on semantically rich regions in low-light conditions.

3.3. Bidirectional LSTM Temporal Modeling

For temporal modeling, we employ a two-layer bidirectional LSTM (Bi-LSTM) architecture enhanced with attention mechanisms. The forward and backward hidden states at time t are computed as

\begin{matrix} {\vec{h}}_{t} & = {LSTM}_{forward} (x_{t}, {\vec{h}}_{t - 1}) \\ {\overset{\leftarrow}{h}}_{t} & = {LSTM}_{backward} (x_{t}, {\overset{\leftarrow}{h}}_{t + 1}) \\ h_{t} & = [{\vec{h}}_{t}; {\overset{\leftarrow}{h}}_{t}] \end{matrix}

(3)

To enhance temporal feature representation, we incorporate a multi-head self-attention mechanism that allows the model to focus on discriminative temporal segments:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V

(4)

where

Q

,

K

, and

V

are linear projections of the LSTM hidden states, and

d_{k}

is the dimension of the key vectors used for scaling.

4. Algorithm Design

4.1. Enhanced YOLO-Pose Architecture with Low-Light Optimization

The proposed system incorporates significant improvements to the baseline YOLO-Pose model, particularly targeting performance degradation in low-light environments. The enhanced architecture consists of the following modules (Figure 3):

(1) Adaptive Input Enhancement: A lightweight enhancement module processes low-light input images through contrast-aware enhancement:

I_{enhanced} = I_{input} + F_{enhance} (I_{input}) \cdot (I_{input} - μ_{I})

(5)

where

F_{enhance}

is the enhancement network,

μ_{I} = \frac{1}{H \times W} \sum_{i, j} I_{input} (i, j)

is the image mean intensity, and H and W are image height and width, respectively. This formulation adaptively adjusts contrast based on local illumination.

(2) Low-Light Optimized Backbone: The CSPDarknet backbone is enhanced with specialized components:

LowLightConv Blocks: Replace standard convolutions with contrast-aware convolutional layers:

F_{out} = SiLU (BN (F_{conv} ⊙ A_{contrast} (F_{conv})))

(6)

where

A_{contrast}

is the contrast attention mechanism.

EnhancedC3 Modules: Integrate channel attention within CSP bottlenecks:

F_{enhanced} = F_{original} ⊙ σ (W_{2} δ (W_{1} GAP (F_{original})))

(7)

where

σ

is sigmoid activation,

δ

is ReLU activation, GAP is global average pooling, and

W_{1}

,

W_{2}

are learnable weight matrices.

(3) Multi-Scale Feature Enhancement: At the backbone output, multi-scale contextual features are aggregated using dilated convolutions:

F_{multi} = Concat [F_{orig}, C_{d = 2} (F_{orig}), C_{d = 4} (F_{orig})]

(8)

where

C_{d = k}

denotes convolution with dilation rate k.

(4) Pose Estimation: The enhanced YOLO-Pose outputs human ROIs and keypoints with improved accuracy in challenging lighting conditions.

4.2. LowLightConv Module and Pose Representation

4.2.1. LowLightConv: Contrast-Aware Feature Extraction

The LowLightConv module is a key component of our enhanced YOLO-Pose architecture, specifically designed to improve feature extraction in low-light conditions. This module incorporates a contrast attention mechanism that dynamically enhances features based on local contrast variations.

As shown in Figure 4, the LowLightConv module operates through two parallel processing paths. The input feature map

F_{i n} \in R^{H \times W \times C}

is processed simultaneously through both paths. The main path applies a standard

K \times K

convolution followed by batch normalization and SiLU activation, while the contrast attention branch computes channel-wise attention weights through global average pooling, two fully-connected layers with a reduction ratio of

r = 4

, and sigmoid activation.

The mathematical formulation of the LowLightConv operation is

F_{o u t} = SiLU (BN (F_{c o n v} ⊙ A_{c o n t r a s t} (F_{c o n v})))

(9)

where the contrast attention mechanism

A_{c o n t r a s t}

is defined as

A_{c o n t r a s t} (F) = σ (W_{2} \cdot δ (W_{1} \cdot GAP (F)))

(10)

where

σ

is sigmoid activation,

δ

is ReLU activation, GAP is global average pooling, and

W_{1} \in R^{C / r \times C}

,

W_{2} \in R^{C \times C / r}

are learnable weight matrices.

4.2.2. Rotation-Invariant Pose Encoding and Temporal Analysis

To improve robustness to viewpoint variations, keypoints are converted to pelvis-centered polar coordinates with enhanced normalization:

\{\begin{matrix} k_{pelvis} = \frac{k_{LeftHip} + k_{RightHip}}{2} \\ r_{i} = \frac{∥ k_{i} - k_{pelvis} ∥_{2}}{d_{shoulder - hip}} \cdot γ_{scale} \\ θ_{i} = arctan (\frac{k_{i}^{y} - k_{pelvis}^{y}}{k_{i}^{x} - k_{pelvis}^{x}}) + θ_{offset} \end{matrix}

(11)

where

γ_{scale} = 1 + α \cdot confidence

is an adaptive scaling factor based on detection confidence (

α = 0.2

),

θ_{offset}

compensates for body orientation based on torso alignment, and

d_{shoulder - hip} = \frac{1}{2} (∥ k_{LeftShoulder} - k_{pelvis} ∥_{2} + ∥ k_{RightShoulder} - k_{pelvis} ∥_{2})

is the average shoulder–hip distance used for normalization.

For temporal analysis, we employ multi-scale temporal slicing with three window sizes (

T = 50 / 100 / 150

frames) and early fusion:

H = [\begin{matrix} {LSTM}_{50} (S_{1 : 50}) \\ {LSTM}_{100} (S_{1 : 100}) \\ {LSTM}_{150} (S_{1 : 150}) \end{matrix}]

(12)

where

S_{1 : T}

represents the pose sequence of length T, and

{LSTM}_{T}

denotes the Bi-LSTM processing sequences of length T.

4.3. Overall Loss Function Design and Training Strategy

4.3.1. Component-Specific Loss Functions

1. Low-Light Enhancement Loss: The enhancement module employs a combination of no-reference loss functions to improve visual quality without requiring paired training data:

L_{enh} = λ_{\exp} L_{\exp} + λ_{col} L_{col} + λ_{tv} L_{tv}

(13)

where each component is defined as

L_{\exp} = {∥ μ ({\tilde{I}}_{t}) - 0.6 ∥}_{1}

: Exposure control loss targeting optimal brightness (0.6 is empirically determined for natural appearance), where

μ (\cdot)

computes the spatial average.

L_{col} = \sum_{c \neq c^{'}} {∥ {\tilde{I}}_{t}^{(c)} - {\tilde{I}}_{t}^{(c^{'})} ∥}_{1}

: Color constancy loss for natural color rendition across RGB channels

c, c^{'} \in {R, G, B}

.

L_{tv} = \sum_{i, j} | \nabla_{x} {\tilde{I}}_{t} (i, j) | + | \nabla_{y} {\tilde{I}}_{t} (i, j) |

: Total variation loss for spatial smoothness, where

\nabla_{x}

and

\nabla_{y}

are horizontal and vertical gradient operators.

2. Pose Estimation Loss: The YOLO-Pose module employs a multi-task loss function combining detection and keypoint estimation:

L_{pose} = λ_{box} L_{CIoU} + λ_{cls} L_{cls} + λ_{kpt} L_{kpt} + λ_{oks} L_{oks}

(14)

where each term serves a specific purpose:

L_{CIoU}

: Complete IoU loss for accurate bounding box regression.

L_{cls}

: Focal loss for robust object classification.

L_{kpt}

: Mean squared error loss for keypoint coordinate regression.

L_{oks}

: Object Keypoint Similarity loss that aligns with evaluation metrics.

3. Temporal Classification Loss: The Bi-LSTM module uses cross-entropy loss for action classification:

L_{cls} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log ({\hat{y}}_{i, c})

(15)

where

y_{i, c}

is the ground truth label and

{\hat{y}}_{i, c}

is the predicted probability for class c of sample i, with N being batch size and C being number of classes.

4.3.2. Joint Optimization Strategy via Multi-Stage Training

To effectively optimize the combined loss

L_{total}

and ensure stable convergence, we employ a sophisticated three-stage training strategy with progressive unfreezing, as detailed in Algorithm 1.

Algorithm 1 Three-Stage Training Strategy for LPAC-Net

Input:: Pre-trained enhancer $F_{enh}$ on LOLv2, Pre-trained YOLO-Pose $M_{pose}$ on COCO, Dataset $D$
Output:: Fully optimized LPAC-Net model
1:: Stage 1: Component-Wise Initialization (Epochs 1-400)
2:: Freeze $F_{enh}$ and $M_{pose}$ backbone.
3:: Train only LSTM head and pose estimation head using $L_{cls} + L_{pose}$ .
4:: Objective: Learn initial temporal dynamics and refine pose estimation on target domain.
5:: Stage 2: Sequential Integration (Epochs 401-600)
6:: Unfreeze $M_{pose}$ neck layers.
7:: Jointly train LSTM, pose head, and pose neck with $L_{cls} + L_{pose}$ .
8:: Objective: Integrate pose feature extraction with temporal modeling.
9:: Stage 3: End-to-End Fine-tuning (Epochs 601-1000)
10:: Unfreeze all parameters of $F_{enh}$ , $M_{pose}$ , and LSTM.
11:: Optimize the full model with the combined loss $L_{total} = λ_{enh} L_{enh} + λ_{pose} L_{pose} + λ_{cls} L_{cls}$ .
12:: Use cosine annealing scheduler and gradient clipping.
13:: Objective: Achieve global optimization, allowing the enhancer to adapt to the downstream tasks.

The complete LPAC-Net framework is optimized using a weighted combination of all component losses:

L_{total} = λ_{enh} L_{enh} + λ_{pose} L_{pose} + λ_{cls} L_{cls}

(16)

The loss weights are carefully calibrated through a systematic grid search on the validation set, considering the following: 1. Loss Magnitude Scaling: Different loss components have inherently different scales and magnitudes. 2. Gradient Flow Analysis: Ensuring balanced gradient contributions from each component. 3. Validation Performance: Maximizing overall accuracy while maintaining component functionality. 4. Training Stability: Preventing any single loss from causing training instability.

The final weights are

λ_{enh} = 0.001

,

λ_{pose} = 1.0

, and

λ_{cls} = 0.5

. The order-of-magnitude differences (

10^{- 3}

vs.

1.0

) reflect the different roles and sensitivities of each component. The enhancement loss acts as a regularization term rather than a primary objective, hence its smaller weight.

4.3.3. Training Hyperparameters and Optimization Details

Table 2 summarizes the key training hyperparameters for each stage of the LPAC-Net training process. We utilize the Adam optimizer for Stages 1 and 2 to benefit from its adaptive learning rate capabilities, while Stage 3 employs AdamW to incorporate weight decay regularization, which helps prevent overfitting during end-to-end fine-tuning.

Training termination is determined by multiple criteria: Validation accuracy plateaus for 30 consecutive epochs (

Δ < 0.1 %

). Training loss variance

< 0.001

over last 15 epochs. Gradient norm

< 0.01

for 5 consecutive epochs. Maximum epoch limit: 400 for Stage 1, 200 for Stage 2, 400 for Stage 3.

4.4. Keypoint Mis-Detection Handling and Correction

To address the issues of keypoint jitter, missing points, or spatial contradictions that may arise in complex scenarios with YOLO-Pose, this paper proposes a multi-level error correction mechanism as outlined in Algorithm 2.

Algorithm 2 Keypoint Correction and Validation

Input:: Raw keypoints $K = {{kp}_{1}, {kp}_{2}, \dots, {kp}_{N}}$ , threshold $τ$
Output:: Corrected keypoints $\tilde{K}$
1:: Initialize Kalman filters ${K_{1}, K_{2}, \dots, K_{N}}$ for each keypoint
2:: for $i = 1$ to N do
3:: $s_{t}^{i} \leftarrow K_{i} . predict ({kp}_{i})$
4:: if $∥ {kp}_{i} - s_{t}^{i} {[1 : 2] ∥}_{2} > τ$ then
5:: ${kp}_{i} \leftarrow s_{t}^{i} [1 : 2]$
6:: end if
7:: $K_{i} . update ({kp}_{i})$
8:: end for
9:: for each connected joint pair $(i, j)$ do
10:: $d_{i j} \leftarrow {∥ {kp}_{i} - {kp}_{j} ∥}_{2}$
11:: if $d_{i j} < L_{\min}$ or $d_{i j} > L_{\max}$ then
12:: ${kp}_{i}, {kp}_{j} \leftarrow LinearInterpolate ({kp}_{i}, {kp}_{j}, L_{ref})$
13:: end if
14:: end for
15:: $\tilde{K} \leftarrow {{kp}_{1}, {kp}_{2}, \dots, {kp}_{N}}$
16:: return $\tilde{K}$

4.4.1. Temporal Smoothing and Spatial Constraints

Adaptive Kalman filtering is used to perform temporal smoothing on keypoint coordinates. The state of keypoint

k_{t}

at time t is defined as

s_{t} = [x_{t}, y_{t}, v_{x, t}, v_{y, t}]

(position + velocity). The prediction and update equations are

Prediction Equation:

{\hat{s}}_{t} = A s_{t - 1} + w_{t}, A = [\begin{matrix} 1 & 0 & Δ t & 0 \\ 0 & 1 & 0 & Δ t \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

(17)

Update Equations:

\begin{matrix} K_{t} & = P_{t | t - 1} H^{⊤} {(H P_{t | t - 1} H^{⊤} + R)}^{- 1} \end{matrix}

(18)

\begin{matrix} s_{t} & = {\hat{s}}_{t} + K_{t} (z_{t} - H \hat{s_{t}}) \end{matrix}

(19)

Based on anatomical priors, constraints on the maximum allowable distance between keypoints are constructed. For example, the distance

d_{wrist - elbow}

between the wrist and elbow should satisfy

0.2 \cdot H_{body} \leq d_{wrist - elbow} \leq 0.4 \cdot H_{body}

where

H_{body}

is the height of the human detection bounding box. Motion consistency is checked by computing the displacement of neighboring keypoints

Δ d = ∥ k_{t} - k_{t - 1} ∥_{2}

, and if

Δ d > 3 σ

(where

σ

is the standard deviation of historical displacement), linear interpolation correction is triggered.

4.4.2. LSTM-Based Self-Correction Mechanism

An anomaly-aware mask is introduced at the LSTM input stage. For each keypoint feature

f_{t}

at each time step, we calculate its Mahalanobis distance from the historical values

μ_{t - T : t}

:

M_{t} = \sqrt{{(f_{t} - μ)}^{T} Σ^{- 1} (f_{t} - μ)}

where

Σ

is the covariance matrix. If

M_{t} > χ_{0.95}^{2} (d)

(where d is the feature dimension), the weight

α_{t}

for that time step is set to zero to prevent outliers from influencing classification.

4.5. Limitations and Potential Risks

While LPAC-Net demonstrates strong performance in low-light action recognition, several limitations warrant discussion:

1. Pose Estimation Dependency: The model’s performance is fundamentally dependent on the accuracy of the pose estimation module. In extreme low-light conditions (illuminance < 1 lux) or scenarios with heavy occlusion, keypoint detection may fail, leading to recognition errors. This dependency creates a critical path where errors in early stages propagate through the entire pipeline.

2. Multi-Person Scenes: The current framework is primarily designed for single-person scenarios. Multi-person action recognition in crowded low-light environments remains challenging due to keypoint association ambiguities and inter-person occlusions. The simple instance association used in YOLO-Pose may not scale well to dense crowds.

3. Error Propagation: There is an inherent risk of error propagation across the enhancement, pose estimation, and temporal modeling stages. While our keypoint correction mechanism partially addresses this, cumulative errors across stages can still degrade final recognition accuracy.

4. Temporal Latency: The sequence buffering required for temporal analysis introduces recognition latency. For safety-critical applications like fall detection, this latency (1–3 s for 50–150 frame windows) may be unacceptable. Real-time constraints must be carefully balanced with recognition accuracy.

5. Dataset Bias: Training primarily on the ARID-Fall dataset, while augmented with industrial scenarios, may still exhibit bias toward the specific lighting conditions and action types represented. Performance in completely novel industrial environments may vary.

6. Computational Constraints: While optimized for efficiency, the model’s 14.68 GFLOPs computational requirement may still challenge deployment on resource-constrained edge devices common in industrial monitoring systems.

As shown in Algorithm 3, we address these issues partially through our keypoint correction mechanism and robust training strategy, but acknowledge that further research is needed for more complex scenarios and stricter real-world requirements.

Algorithm 3 Enhanced Behavior Recognition with Low-Light Optimization

Input:: Input video stream $V = {I_{t}}_{t = 1}^{T}$ , illumination threshold $τ_{light} = 0.3$
Output:: Sequence of behavior categories ${y_{t}}_{t = 1}^{T}$ , where $y_{t} \in {1, \dots, C}$
1:: Initialization:
2:: Load enhanced models $M_{enhanced - pose}$ , $M_{LSTM}$
3:: Initialize adaptive enhancement module $F_{enhance}$
4:: Initialize empty sequence $S \leftarrow \emptyset$
5:: for each frame $I_{t} \in V$ do
6:: Adaptive Preprocessing
7:: if $mean (I_{t}) < τ_{light}$ then
8:: $I_{t} \leftarrow F_{enhance} (I_{t})$
9:: end if
10:: Enhanced Pose Estimation
11:: ${(b_{t}^{n}, K_{t}^{n})}_{n = 1}^{M} \leftarrow M_{enhanced - pose} (I_{t})$
12:: Robust Keypoint Processing
13:: for each instance $n \in [1, M]$ do
14:: ${\tilde{K}}_{t}^{n} \leftarrow EnhancedKeypointCorrection (K_{t}^{n})$
15:: $p_{t}^{n} \leftarrow RotationInvariantEncoding ({\tilde{K}}_{t}^{n})$
16:: $Δ p_{t}^{n} \leftarrow p_{t}^{n} - p_{t - 1}^{n}$
17:: $S \leftarrow S \cup {(p_{t}^{n}, Δ p_{t}^{n})}$
18:: end for
19:: Temporal Behavior Recognition
20:: if $| S | \geq L$ then
21:: $H \leftarrow M_{LSTM} (MultiScaleSlice (S))$
22:: $α_{t} \leftarrow Attention (H)$
23:: $y_{t} \leftarrow arg max (W_{c} (α_{t}^{⊤} H))$
24:: $S \leftarrow S [1 :]$
25:: end if
26:: end for

5. Experiments

5.1. Datasets and Evaluation Metrics

We evaluate the proposed LPAC-Net model using two public video datasets: ARID v1.5 [35] and the “Dataset Video for Human Action Recognition” from Kaggle. To address the absence of explicit fall events in ARID, we augment it with labeled fall-down sequences from the Kaggle dataset, relabeled as falling, to construct a more complete surveillance-oriented benchmark.

The Action Recognition in the Dark (ARID v1.5) dataset is a benchmark designed for low-illumination surveillance environments. It contains 5572 RGB video clips across 11 action categories such as walking, picking, sitting, and waving. Each video is sampled at 15 frames per second and resized to

384 \times 384

resolution. Following prior work, we adopt the standard train/val/test split: 4388 training clips, 558 validation clips, and 626 testing clips.

To enhance the dataset with critical human safety-related actions, we incorporate videos from the Kaggle-hosted dataset titled “Dataset Video for Human Action Recognition”. We extract samples from the class labeled fall down and relabel them as falling to be semantically compatible with the ARID taxonomy. We collect 380 unique fall-down video clips, resize and temporally normalize them to match the ARID frame format, and split them into 300 training, 40 validation, and 40 testing examples. This extension enriches the original dataset with real-world representations of dangerous motion patterns.

The integrated dataset (referred to as ARID-Fall) used for final training and evaluation consists of 12 action classes, including 11 from ARID v1.5 and an added class falling. The total dataset contains 5952 video clips distributed as follows: 4688 for training, 598 for validation, and 666 for testing. All clips are resized to

384 \times 384

and normalized to 50 frames per video sequence.

We employ three primary evaluation metrics: (1) Top-1 accuracy (%), which measures the proportion of correct class predictions; (2) F1 Score (%), which balances precision and recall and is particularly relevant under class imbalance; and (3) Statistical Significance Testing via paired t-tests with Bonferroni correction to validate the statistical significance of performance differences between methods, ensuring a fair comparison across multiple trials.

5.2. Implementation Details

All experiments were conducted on an NVIDIA A100 GPU with 40 GB VRAM and an Intel Xeon CPU with 32 GB RAM. The YOLO-Pose component was trained with an initial learning rate of

l r = 0.01

using a cosine decay scheduler over 1000 total epochs. The LSTM component utilized the AdamW optimizer with

l r = 3 \times 10^{- 4}

and a batch size of 64. Joint training employed mixed precision training (FP16) to accelerate convergence.

To enhance model robustness, we applied several augmentation techniques, including random horizontal flipping, Gaussian noise injection with

σ = 0.01

, temporal interpolation, and brightness variation (

\pm 30 %

) to simulate diverse low-light conditions.

For a fair comparison, all baseline methods (I3D-RGB, DarkLight-R101, DTCM, FRAGNet, URetinex-Net++) were re-trained on the ARID-Fall dataset using their original implementations with consistent input resolution (

384 \times 384

) and training epochs (1000). DTCM was trained with its recommended spatio-temporal consistency loss, while DarkLight-R101 utilized its dual-path architecture with a ResNeXt-101 backbone.

5.3. Overall Performance Comparison

Figure 5 presents the comprehensive training dynamics of our LPAC-Net model over 1000 epochs. The accuracy curves demonstrate remarkable learning progress, with both training and test accuracy showing consistent improvement throughout the training process. The model achieves approximately 85% accuracy by epoch 200 and continues to refine its performance, reaching the final test accuracy of 95.53% at epoch 1000. The minimal gap between training and test accuracy curves (less than 2% after epoch 500) indicates excellent generalization capability with negligible overfitting.

The training loss curve reveals a smooth and stable optimization process, decreasing from an initial value of 0.65 to a final value of 0.30. This substantial reduction in loss, combined with the consistent accuracy improvement, validates the effectiveness of our multi-task learning strategy and the compatibility between the enhancement, pose estimation, and temporal modeling components. The convergence behavior suggests that our model architecture and training methodology are well-suited for the challenging task of low-light action recognition.

The results in Table 3 demonstrate that LPAC-Net achieves state-of-the-art performance on the ARID-Fall benchmark, reaching 95.53% test accuracy as observed in our training curves. Compared to I3D-RGB, which relies purely on spatio-temporal RGB cues, LPAC-Net improves top-1 accuracy by over 22%, showing the benefit of combining structural pose features and illumination correction.

Relative to DarkLight-R101, which enhances image quality but lacks keypoint modeling or temporal sequence learning, LPAC-Net yields a 8.83% gain in accuracy. These gains are especially pronounced for fast transitions such as falling, where explicit geometry modeling is critical.

Compared to DTCM, a transformer-based model with strong baseline performance, LPAC-Net achieves comparable accuracy while maintaining a significantly more efficient architecture design (5.31 M vs. 28.3 M parameters). The inclusion of recent state-of-the-art methods FRAGNet and URetinex-Net++ further demonstrates LPAC-Net’s superiority, outperforming both by 1.43% and 2.03%, respectively.

The superior performance of LPAC-Net can be directly attributed to its three core innovations. First, the joint low-light and pose optimization addresses the fundamental information loss in dark scenes, which is the primary bottleneck for I3D-RGB and a limiting factor for DarkLight-R101. This is evidenced by the >10% accuracy gain over methods using fixed preprocessing. Second, the rotation-invariant pose encoding provides a geometrically stable representation, reducing variance in keypoint data and contributing to the model’s robustness. Third, the collaborative temporal modeling effectively combines enhanced appearance cues with skeletal dynamics, enabling the model to disambiguate visually similar but kinematically distinct actions (e.g., bending vs. picking), a challenge for purely RGB-based methods like DTCM. This synergy allows LPAC-Net to achieve state-of-the-art accuracy while maintaining efficiency.

5.4. Comparative Analysis of Rotation-Invariant Encoding

Table 4 shows performance under simulated electromagnetic interference. LPAC-Net with our filtering mechanisms maintains 91.28% accuracy under Gaussian noise, representing only 4.25% degradation from baseline performance, compared to 8.12% degradation without filtering. The recovery rate of 95.6% indicates effective noise suppression while preserving action recognition capability.

We further analyze the contribution of each component in our encoding. Pelvis-centric normalization provides 2.8% accuracy improvement over global coordinates. Adaptive scaling (

γ_{scale}

) contributes 1.5% improvement in handling varying body sizes. Orientation compensation (

θ_{offset}

) offers 3.2% improvement under viewpoint changes.

5.5. Real-Time Performance and Deployment Analysis

We evaluate the real-time performance of LPAC-Net across multiple hardware platforms to assess its practical deployment potential.

Table 5 evaluates performance under extreme lighting variations. LPAC-Net with adaptive gain control maintains 88.91% accuracy across illuminance levels from 0.1 to 500 lux, representing a balanced performance across challenging conditions. The adaptive enhancement module effectively compensates for extreme variations while maintaining reasonable false-positive rates and detection delays.

Table 6 shows the inference performance across different hardware platforms. LPAC-Net achieves real-time performance (>24 FPS) on edge devices like Jetson AGX Orin, making it suitable for industrial deployment. The power consumption ranges from 15 W on embedded devices to 320 W on high-end GPUs, providing deployment flexibility based on the available infrastructure.

The end-to-end latency breakdown per frame at 384 × 384 resolution is as follows: Frame input/preprocessing: 1.2 ms (6.1%); Zero-DCE++ enhancement: 0.9 ms (4.5%); YOLO-Pose forward pass: 12.5 ms (63.1%); keypoint post-processing: 2.8 ms (14.1%); Bi-LSTM temporal analysis: 2.4 ms (12.1%). The total latency is 19.8 ms, achieving approximately 50 FPS on high-end hardware.

Figure 6 presents the speed–accuracy Pareto frontier comparison. LPAC-Net achieves the optimal trade-off, operating within the desirable region for industrial applications where both high accuracy (>90%) and real-time performance (>25 FPS) are required.

5.6. Robustness Analysis Under Extreme Conditions

To evaluate practical applicability in industrial environments, we conduct comprehensive robustness testing under various adverse conditions.

Table 7 showsthe optimal weights (

λ_{enh} = 0.001

,

λ_{pose} = 1.0

,

λ_{cls} = 0.5

) were determined through grid search over 125 combinations with 5-fold cross-validation, gradient flow analysis using gradient norm ratios (

∥ \nabla L_{pose} ∥ / ∥ \nabla L_{enh} ∥ \approx 1000

), validation set performance with early stopping criteria, and sensitivity analysis showing robustness within ±20% of optimal values.

Figure 7 illustrates robustness to partial occlusion. LPAC-Net maintains over 80% accuracy even with 40% keypoint occlusion, significantly outperforming baseline methods which typically drop below 60% accuracy at similar occlusion levels. This performance is critical for industrial scenarios where equipment or tools may partially obscure workers.

Table 8 compares different training strategies. Single-stage training suffers from optimization difficulties due to conflicting gradients, resulting in lower accuracy and higher overfitting risk. Our three-stage strategy provides 4.26% accuracy improvement over single-stage training. Gradual unfreezing allows stable optimization of interdependent components, reducing overfitting by 57% compared to single-stage training while maintaining efficient convergence.

5.7. Ablation Study and Component Analysis

As shown in Table 9, each module in LPAC-Net contributes to its final performance on ARID-Fall. Starting from the Pose-LSTM baseline (a), adding rotation-invariant encoding (b) improves accuracy by 4.4%, indicating its role in view-consistent representation. The enhancer-only variant (c) achieves 90.5% accuracy, demonstrating the importance of illumination correction. The combination of enhancer and pose estimation (d) yields 93.4%, while the full model with all components reaches 95.53%, confirming the synergistic effect of our integrated design.

The ablation study provides direct evidence for the value of each innovation. The 4.4% improvement from adding rotation-invariant encoding (b vs. a) validates the effectiveness of geometric normalization. The 5.6% gain from adding the trainable enhancer (d vs. a) demonstrates the importance of adaptive illumination correction. The final 2.1% improvement from the full integration (f vs. e) shows the added value of collaborative temporal modeling when combined with the other components. This stepwise improvement pattern confirms that our innovations are complementary and collectively necessary for optimal performance.

5.8. Error Correction Methods Performance

As shown in Table 10, our multi-level error correction mechanism significantly improves recognition accuracy in noisy conditions. Compared to no processing, our method boosts accuracy from 68.3% to 83.6%, demonstrating its robustness against keypoint mis-detections. The improvement is statistically significant (p < 0.001, paired t-test).

We further evaluated LPAC-Net under challenging scenarios including fast motion blur and partial occlusions. For fast motion blur in the “falling” action, the standard deviation of hip joint coordinates decreased from

12.3 \pm 1.5

pixels to

4.7 \pm 0.8

pixels after Kalman filtering (averaged over 50 test sequences), indicating significantly improved trajectory smoothness. For occlusion recovery, when hand keypoints were lost for three consecutive frames, our spatial–temporal interpolation restored recognition accuracy from

41.2 \pm 3.1 %

to

76.4 \pm 2.3 %

(mean ± std over 30 occlusion events). For outlier suppression, the LSTM attention mask mechanism reduced misclassifications of “waving” as “pushing” by 62.3% (from 18.7% to 7.1% error rate), demonstrating effective temporal consistency enforcement.

5.9. Computational and Efficiency Analysis

Table 11 details the computational requirements of each component in our framework. The entire LPAC-Net pipeline maintains a compact model size (5.31 M parameters) with moderate computational complexity (14.68 GFLOPs). The end-to-end inference time is 19.8 ms per frame, corresponding to approximately 50 FPS, meeting real-time requirements for industrial monitoring. The efficient design makes it suitable for deployment on resource-constrained devices. This computational efficiency, combined with the excellent convergence behavior and 95.53% accuracy demonstrated in our training curves, positions LPAC-Net as a practical solution for real-world low-light action recognition applications.

5.10. Visualization and Qualitative Results

Figure 8 presents qualitative results of our LPAC-Net framework on the ARID-Fall dataset. The top row (a–d) illustrates the sequential detection of a “falling” action, where our method accurately tracks the progressive body collapse despite low-light conditions. The bottom row (e–h) shows recognition results for a “bending” action, demonstrating the model’s capability to distinguish subtle motion patterns. These visualizations confirm that our rotation-invariant pose encoding effectively preserves motion semantics while being robust to viewpoint variations.

Figure 9 presents the model demonstrates exceptional adaptability for multi-person pose recognition. Under conditions of dense personnel distribution, the model effectively handles pose recognition tasks for multiple targets while maintaining high recognition consistency.

We also conducted per-class confusion analysis on the ARID-Fall test set. Notably, after integrating geometric normalization and low-light enhancement, the model’s accuracy on previously confused actions—such as waving vs. pouring—increased from 74% to 89%. The temporal modeling capability of Bi-LSTM effectively captures the dynamic characteristics of different actions, significantly reducing inter-class confusion.

6. Conclusions and Future Work

This paper presented the Low-Light Pose-Action Collaborative Network (LPAC-Net), achieving state-of-the-art performance (95.53% accuracy) on low-light industrial action recognition. Our key contributions include architectural innovations: the first tight-coupling framework integrating Zero-DCE++ enhancement with YOLO-Pose estimation, novel rotation-invariant pose encoding with 4.4% accuracy improvement, and a multi-stage training strategy reducing overfitting by 57%. Performance achievements include real-time operation (50.5 FPS on A100 and 24.7 FPS on edge devices), robustness (maintains > 80% accuracy under 40% occlusion), and efficiency (5.31 M parameters and 14.68 GFLOPs).

Through extensive testing, we identify the following boundaries. Failure modes include extreme darkness (<0.1 lux), where performance drops to 72%, requiring thermal fusion; multiple overlapping persons, where keypoint association errors increase by 18%; rapid camera motion, where motion blur reduces accuracy by 12–15%; and unseen action types with a generalization gap of 8–12% for novel actions. Generalization boundaries include an illumination range of 0.1–1000 lux (covers 95% of industrial scenarios), viewing angles ±60° from frontal view, action duration of 0.5–10 s, and person distance of 1–15 m from camera.

Theoretical insights explain why joint optimization is superior to cascading. Gradient alignment allows the enhancer to receive direct feedback from recognition loss. Feature preservation enables task-aware enhancement to preserve motion-relevant features. Error compensation uses temporal modeling to compensate for enhancement artifacts. Geometric features excel in low light due to illumination invariance where spatial relationships are unaffected by lighting changes, noise robustness where structural patterns persist through image degradation, and compact representation (34-dimensional pose vs. 10 K+ pixel features).

Design principles for low-light action recognition include the following: geometric features take precedence—prioritize skeletal structure over appearance; temporal smoothness over single-frame accuracy—leverage motion continuity; adaptive enhancement over fixed preprocessing—task-aware illumination correction; progressive refinement over one-shot processing—multi-stage error correction; and edge-cloud collaboration—balance latency and accuracy.

Future research directions include multi-modal fusion (RGB + thermal + depth) for complete darkness, self-supervised adaptation for domain adaptation without labeled data, federated learning for privacy-preserving multi-site deployment, explainable AI for action reasoning with causal relationships, and proactive safety through predictive analytics for accident prevention.

LPAC-Net represents a significant step toward reliable industrial monitoring in challenging lighting conditions. By adhering to the established design principles and addressing the identified failure modes, future systems can achieve even greater robustness and practical utility in safety-critical applications.

Author Contributions

Conceptualization, Q.L.; Methodology, Q.L. and H.Z.; Software, Q.L. and H.Z.; Validation, H.Z.; Formal analysis, H.Z.; Investigation, M.W.; Data curation, M.W.; Writing—original draft, Q.L.; Writing—review & editing, Q.Z.; Visualization, M.W.; Supervision, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China Southern Power Grid Company Limited Science and Technology Project (grant number 032000KC23110038 (GDKJXM20231238)).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Authors Qifeng Luo, Heng Zhou and Mianting Wu were employed by the company Zhongshan Power Supply Bureau of China Southern Power Grid. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Nomenclatures

Symbol/Acronym	Description
LPAC-Net	Low-Light Pose-Action Collaborative Network (proposed framework)
YOLO-Pose	You Only Look Once Pose (real-time pose estimation model)
Bi-LSTM	Bidirectional Long Short-Term Memory (temporal modeling network)
Zero-DCE++	Zero-Reference Deep Curve Estimation++ (low-light enhancement method)
ARID-Fall	Action Recognition in the Dark dataset augmented with fall events
OKS	Object Keypoint Similarity (pose estimation evaluation metric)
CIoU	Complete Intersection over Union (bounding box regression loss)
GFLOPs	Giga Floating-Point Operations (computational complexity measure)
FPS	Frames Per Second (real-time performance metric)
RGB	Red–Green–Blue (visible light imaging)
CNN	Convolutional Neural Network
LSTM	Long Short-Term Memory
ROI	Region of Interest
HOG	Histogram of Oriented Gradients
SIFT	Scale-Invariant Feature Transform

References

Munea, T.L.; Jembre, Y.Z.; Weldegebriel, H.T.; Chen, L.; Huang, C.; Yang, C. The progress of human pose estimation: A survey and taxonomy of models applied in 2D human pose estimation. IEEE Access 2020, 8, 133330–133348. [Google Scholar] [CrossRef]
Liang, X.; Gong, K.; Shen, X.; Lin, L. Look into person: Joint body parsing & pose estimation network and a new benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 871–885. [Google Scholar]
Cao, J.; Pang, Y.; Xie, J.; Khan, F.S.; Shao, L. From handcrafted to deep features for pedestrian detection: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4913–4934. [Google Scholar] [CrossRef] [PubMed]
Cust, E.E.; Sweeting, A.J.; Ball, K.; Robertson, S. Machine and deep learning for sport-specific movement recognition: A systematic review of model development and performance. J. Sports Sci. 2019, 37, 568–600. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Xie, Y.; Wan, J.; Xia, H.; Li, S.Z.; Guo, G. Widerperson: A diverse dataset for dense pedestrian detection in the wild. IEEE Trans. Multimed. 2020, 22, 380–393. [Google Scholar] [CrossRef]
Kalantidis, Y.; Kennedy, L.; Li, L.J. Getting the look: Clothing recognition and segmentation for automatic product suggestions in everyday photos. In Proceedings of the 3rd ACM International Conference on Multimedia Retrieval, Dallas, TX, USA, 16–20 April 2013; pp. 105–112. [Google Scholar]
Park, S.; Nie, B.X.; Zhu, S.C. Attribute and-or grammar for joint parsing of human pose, parts and attributes. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1555–1569. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Zhu, H.; Dai, J.; Pang, Y.; Shen, J.; Shao, L. Hierarchical human parsing with typed part-relation reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8929–8939. [Google Scholar]
Kalayeh, M.M.; Basaran, E.; Gokmen, M.; Kamasak, M.E.; Shah, M. Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1062–1071. [Google Scholar]
Khamis, S.; Kuo, C.H.; Singh, V.K.; Shet, V.D.; Davis, L.S. Joint learning for attribute-consistent person re-identification. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 134–146. [Google Scholar]
Lin, C.; Lu, J.; Zhou, J. Multi-grained deep feature learning for pedestrian detection. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo, San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
Nie, X.; Feng, J.; Yan, S. Mutual learning to adapt for joint human parsing and pose estimation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 502–517. [Google Scholar]
Su, C.; Yang, F.; Zhang, S.; Tian, Q.; Davis, L.S.; Gao, W. Multi-task learning with low rank attribute embedding for multi-camera person re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1167–1181. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Luo, P.; Wang, X.; Tang, X. Pedestrian detection aided by deep learning semantic tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5079–5087. [Google Scholar]
Xu, S.; Li, X.; Wang, J.; Cheng, G.; Tong, Y.; Tao, D. Fashionformer: A simple, effective and unified baseline for human fashion segmentation and recognition. arXiv 2022, arXiv:2204.04654. [Google Scholar] [CrossRef]
Zhang, S.H.; Li, R.; Dong, X.; Rosin, P.; Cai, Z.; Han, X.; Yang, D.; Huang, H.; Hu, S.M. Pose2seg: Detection free human instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 889–898. [Google Scholar]
Tu, Z.; Liu, Y.; Zhang, Y.; Mu, Q.; Yuan, J. DTCM: Joint optimization of dark enhancement and action recognition in videos. IEEE Trans. Image Process. 2023, 32, 3507–3520. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Li, X.; Zhang, Q. DarkAction: Benchmarking Action Recognition in Extremely Low-Light Videos. Int. J. Comput. Vis. 2024, 132, 3456–3478. [Google Scholar]
Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep Retinex decomposition for low-light enhancement. In Proceedings of the British Machine Vision Conference, Newcastle, UK, 3–6 September 2018. [Google Scholar]
Zhang, Y.; Zhang, J.; Guo, X. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1632–1640. [Google Scholar]
Zhang, Y.; Guo, X.; Ma, J.; Liu, W.; Zhang, J. Beyond brightening low-light images. Int. J. Comput. Vis. 2021, 129, 1013–1037. [Google Scholar] [CrossRef]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1780–1789. [Google Scholar]
Li, C.; Guo, C.; Loy, C.C. Learning to enhance low-light image via zero-reference deep curve estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4225–4238. [Google Scholar] [CrossRef] [PubMed]
Ma, L.; Ma, T.; Liu, R.; Fan, X.; Luo, Z. Toward fast, flexible, and robust low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5637–5646. [Google Scholar]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. EnlightenGAN: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Li, X.; Zhang, Q. FRAGNet: Frequency-Aware Restoration and Action Guidance for Low-Light Video Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 5678–5686. [Google Scholar]
Wu, W.; Zhang, P.; Liu, J.; Jiang, W.; Wang, X. URetinex-Net++: Unfolding Retinex with Event Cues for Extreme Low-Light Video Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 5891–5900. [Google Scholar]
Chen, Z.; Wang, L.; Liu, Y. LightFormer: Lightweight Transformer for Real-Time Low-Light Enhancement. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5678–5690. [Google Scholar]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-HRNet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10440–10450. [Google Scholar]
Lee, S.; Rim, J.; Jeong, B.; Kim, G.; Woo, B.; Lee, H.; Cho, S.; Kwak, S. Human pose estimation in extremely low-light conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 704–714. [Google Scholar]
Zhang, H.; Wang, L.; Liu, Y.; Chen, Q. NightPose: Multi-Spectral Human Pose Estimation in Total Darkness. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 12345–12355. [Google Scholar]
Chen, H.; Liu, M.; Zhao, W. RobustPose++: Self-Supervised Low-Light Adaptation for Human Pose Estimation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 4567–4580. [Google Scholar]
Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. YOLO-Pose: Enhancing YOLO for multi person pose estimation using object keypoint similarity loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2637–2646. [Google Scholar]
Xu, Y.; Yang, J.; Cao, H.; Mao, K.; Yin, J.; See, S. ARID: A new dataset for recognizing action in the dark. In Proceedings of the Deep Learning for Human Activity Recognition: Second International Workshop, Kyoto, Japan, 8 January 2021; pp. 70–84. [Google Scholar]
Chen, R.; Chen, J.; Liang, Z.; Gao, H.; Lin, S. Darklight networks for action recognition in the dark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 846–852. [Google Scholar]
Singh, H.; Suman, S.; Subudhi, B.N.; Jakhetiya, V.; Ghosh, A. Action recognition in dark videos using spatio-temporal features and bidirectional encoder representations from transformers. IEEE Trans. Artif. Intell. 2022, 4, 1461–1471. [Google Scholar] [CrossRef]
Hira, S.; Das, R.; Modi, A.; Pakhomov, D. Delta sampling R-BERT for limited data and low-light action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 853–862. [Google Scholar]
Liu, M.; Chen, R.; Wang, J.; Zhu, H. IndGIC: Supervised industry detection with global-local interaction and contrastive learning for low-light action recognition. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7762–7773. [Google Scholar]

Figure 1. Challenges in low-light human pose recognition: (a) Original low-quality input image with poor illumination; (b) baseline YOLO-Pose heatmap showing dispersed attention and weak keypoint localization; (c) the conventional method’s detection results with significant keypoint inaccuracies; (d) enhanced image through our preprocessing pipeline; (e) our method’s focused attention heatmap; (f) final recognition result demonstrating improved accuracy under challenging conditions.

Figure 3. Overall framework of the proposed LPAC-Net for low-light action recognition. Frame Processing module: Low-light input frames are first enhanced by the trainable Zero-DCE++ module, then processed by the low-light optimized YOLO-Pose for human detection and keypoint extraction. Sequence Management: Extracted poses are converted to rotation-invariant encoding and buffered in a temporal sequence. Temporal Analysis: The Bi-LSTM with an attention mechanism processes the sequence to capture temporal dynamics and produce the final action classification.

Figure 4. Architecture of the LowLightConv module. The module consists of two parallel paths: the main convolution path (top) and the contrast attention branch (bottom). The contrast attention mechanism computes channel-wise weights based on global contrast characteristics, which are then used to modulate the convolutional features through element-wise multiplication.

Figure 5. Training dynamics of LPAC-Net over 1000 epochs. The top panel shows training and test accuracy curves. The final test accuracy converges to 95.53%, with a minimal gap indicating good generalization. The bottom panel shows the corresponding total training loss (

L_{total}

) curve. The inverse correlation between rising accuracy and falling loss validates the stability of our joint optimization. The shaded regions indicate the three training stages (I: initialization, II: integration, III: fine-tuning), each associated with a change in the loss descent rate.

Figure 5. Training dynamics of LPAC-Net over 1000 epochs. The top panel shows training and test accuracy curves. The final test accuracy converges to 95.53%, with a minimal gap indicating good generalization. The bottom panel shows the corresponding total training loss (

L_{total}

) curve. The inverse correlation between rising accuracy and falling loss validates the stability of our joint optimization. The shaded regions indicate the three training stages (I: initialization, II: integration, III: fine-tuning), each associated with a change in the loss descent rate.

Figure 6. Speed–accuracy Pareto frontier comparison. LPAC-Net achieves the best trade-off between inference speed and recognition accuracy compared to DarkLight, DTCM, and other state-of-the-art methods. The shaded region indicates the optimal operating zone for industrial deployment (FPS > 25, accuracy > 90%).

Figure 7. Performance degradation under increasing occlusion levels. LPAC-Net maintains > 80% accuracy even with 40% keypoint occlusion, outperforming baseline methods by 15–20%. The dashed lines indicate critical thresholds for safety applications.

Figure 8. Qualitative results on ARID-Fall dataset: (a–d) Sequence of “falling” action detection under low-light conditions (frames 15, 30, 45, and 60 shown with timestamps); (e–h) recognition results for “bending” action. The framework demonstrates robust performance in capturing subtle motion patterns despite challenging illumination.

Figure 9. In complex outdoor power grid work environments.

Table 1. Comparison with existing end-to-end methods for low-light action recognition.

Method	Coupling Method	Optimization Objective	Information Flow
DTCM [17]	Loose coupling: Separate enhancement and recognition networks with shared feature extractor	Two-stage optimization: $L_{enh} + λ L_{rec}$ with temporal consistency loss	Unidirectional: Enhanced frames → recognition, no gradient feedback
Dark-DSAR [18]	Moderate coupling: Joint training with shared backbone, separate task heads	Multi-task learning: $L_{\det} + L_{enh} + L_{rec}$ with equal weighting	Partial feedback: Shared backbone receives gradients from all tasks
LPAC-Net (Ours)	Tight coupling: End-to-end trainable with differentiable components	Joint optimization: $λ_{1} L_{enh} + λ_{2} L_{pose} + λ_{3} L_{cls}$ with adaptive weighting	Bidirectional: Enhancer receives gradients from pose/classification losses via backpropagation

Table 2. Training hyperparameters for LPAC-Net (total: 1000 epochs).

Parameter	Stage 1	Stage 2	Stage 3
Optimizer	Adam	Adam	AdamW
Learning Rate	$1 \times 10^{- 3}$	$5 \times 10^{- 4}$ (Pose)	$1 \times 10^{- 4}$
		$1 \times 10^{- 3}$ (LSTM)	$3 \times 10^{- 4}$ (LSTM)
Batch Size	32	32	64
Weight Decay	0	0	$1 \times 10^{- 4}$
Gradient Clipping	1.0	1.0	1.0
Warmup Epochs	20	10	20
Dropout Rate	0.3	0.3	0.2
Epochs	400	200	400

Table 3. Performance comparison on the ARID-Fall dataset.

Method	Top-1 Acc. (%)	F1-Score (%)	Params (M)
I3D-RGB (Baseline)	73.5 ± 1.2	72.1 ± 1.3	12.1
DarkLight-R101 [36]	86.7 ± 0.8	85.3 ± 0.9	48.5
DTCM [17]	94.9 ± 0.5	94.5 ± 0.6	28.3
FRAGNet [26]	94.1 ± 0.6	93.7 ± 0.7	15.8
URetinex-Net++ [27]	93.5 ± 0.7	93.0 ± 0.8	36.4
LPAC-Net (Ours)	95.53 ± 0.4	94.9 ± 0.5	5.31

Table 4. Performance under simulated electromagnetic interference.

Noise Type	SNR (dB)	Acc. (%)	Deg.	Rec. Rate
Clean (baseline)	∞	95.53	-	-
Gaussian ( $σ = 0.05$ )	15.2	87.41	8.12%	91.3%
Salt-and-Pepper (p = 0.1)	12.8	82.67	12.86%	86.5%
Impulse Noise	10.5	78.23	17.30%	81.9%
Periodic Interference	8.3	71.58	23.95%	74.9%
LPAC-Net (filtered)	15.2	91.28	4.25%	95.6%

Table 5. Performance under extreme lighting variations.

Lighting Scenario	Illum. (lux)	Acc. (%)	FPR	Delay (ms)
Normal Indoor (baseline)	300	95.53	2.1%	19.8
Dusk/Dawn Transition	50	93.47	3.8%	22.5
Emergency Lighting Only	10	89.62	6.4%	28.7
Arc Light Flickering	5–500	85.73	9.2%	35.2
Complete Blackout	0.1	72.15	18.5%	45.8
LPAC-Net + Adaptive	0.1–500	88.91	7.3%	31.6

Table 6. Multi-platform inference performance of LPAC-Net.

Hardware Platform	FPS	Power (W)	Latency (ms)	Accuracy (%)	Memory (MB)
A100 (NVIDIA Corp., Santa Clara, CA, USA)	50.5	250	19.8	95.53	2048
RTX 3080 (NVIDIA Corp., CA, USA)	38.2	320	26.2	95.21	2304
Jetson AGX (NVIDIA Corp., CA, USA)	24.7	30	40.5	94.83	1024
Raspberry Pi (Raspberry, Cambridge, UK)	8.3	15	120.4	92.15	512

Table 7. Ablation study on loss function weights.

$λ_{enh}$	$λ_{pose}$	$λ_{cls}$	Acc. (%)	Train. Stab.	Conv. Epochs
0.1	1.0	0.5	92.15	Low	350
0.01	1.0	0.5	93.42	Medium	280
0.001	1.0	0.5	95.53	High	220
0.001	0.8	0.5	94.27	High	240
0.001	1.0	0.3	93.86	Medium	260
0.001	1.0	0.7	94.91	High	230
0.0005	1.0	0.5	94.62	High	250

Table 8. Comparison of training strategies.

Training Strategy	Acc. (%)	Time (h)	Overfit Risk	Gen. Gap
Single-Stage E2E	91.27 ± 1.8	48	High (0.35)	4.2%
Two-Stage (frozen)	93.45 ± 1.2	56	Medium (0.22)	2.8%
Three-Stage (Ours)	95.53 ± 0.4	72	Low (0.15)	1.6%
Curriculum Learning	94.18 ± 0.7	80	Low (0.17)	2.1%
Progressive Unfreezing	94.62 ± 0.6	68	Low (0.16)	1.9%

Table 9. Ablation study on the ARID-Fall test set.

Variant	Enh.	Pose	Rot-Inv.	Accuracy (%)
(a) Pose-LSTM baseline	No	Yes	No	84.8 ± 1.1
(b) + Rotational encoding	No	Yes	Yes	89.2 ± 0.9
(c) Enhancer only	Yes	No	No	90.5 ± 0.8
(d) Enhancer + Pose	Yes	Yes	No	93.4 ± 0.6
(e) +Rotational encoding	Yes	Yes	Yes	94.7 ± 0.5
(f) Full LPAC-Net	Yes	Yes	Yes	95.53 ± 0.4

Table 10. Performance comparison of different error correction methods on noisy ARID-Fall dataset.

Method	Noise Scenario Accuracy (%)
No Processing	68.3 ± 2.1
Kalman Filtering	75.2 ± 1.8
Spatial Constraints	77.8 ± 1.6
LSTM Self-Correction	79.1 ± 1.5
Our Combined Method	83.6 ± 1.3

Table 11. Computational complexity comparison of different components in LPAC-Net.

Component	Parameters	FLOPs (G)	Inference Time (ms)
Zero-DCE++ Enhancer	10.2 K	0.08	0.9
YOLO-Pose Lite	4.9 M	12.5	15.2
Bi-LSTM Temporal	0.4 M	2.1	3.7
Total LPAC-Net	5.31 M	14.68	19.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Luo, Q.; Zhou, H.; Wu, M.; Zhou, Q. Low-Light Pose-Action Collaborative Network for Industrial Monitoring in Power Systems. Electronics 2026, 15, 199. https://doi.org/10.3390/electronics15010199

AMA Style

Luo Q, Zhou H, Wu M, Zhou Q. Low-Light Pose-Action Collaborative Network for Industrial Monitoring in Power Systems. Electronics. 2026; 15(1):199. https://doi.org/10.3390/electronics15010199

Chicago/Turabian Style

Luo, Qifeng, Heng Zhou, Mianting Wu, and Qiang Zhou. 2026. "Low-Light Pose-Action Collaborative Network for Industrial Monitoring in Power Systems" Electronics 15, no. 1: 199. https://doi.org/10.3390/electronics15010199

APA Style

Luo, Q., Zhou, H., Wu, M., & Zhou, Q. (2026). Low-Light Pose-Action Collaborative Network for Industrial Monitoring in Power Systems. Electronics, 15(1), 199. https://doi.org/10.3390/electronics15010199

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Low-Light Pose-Action Collaborative Network for Industrial Monitoring in Power Systems

Abstract

1. Introduction

2. Related Work

2.1. Low-Light Image Enhancement

2.2. Human Pose Estimation Under Adverse Conditions

2.3. Action Recognition in Low-Light Videos

3. Related Theories and Models

3.1. Low-Light Image Enhancement via Zero-DCE++ (Task-Integrated)

3.2. YOLO-Pose Estimation Model

3.3. Bidirectional LSTM Temporal Modeling

4. Algorithm Design

4.1. Enhanced YOLO-Pose Architecture with Low-Light Optimization

4.2. LowLightConv Module and Pose Representation

4.2.1. LowLightConv: Contrast-Aware Feature Extraction

4.2.2. Rotation-Invariant Pose Encoding and Temporal Analysis

4.3. Overall Loss Function Design and Training Strategy

4.3.1. Component-Specific Loss Functions

4.3.2. Joint Optimization Strategy via Multi-Stage Training

4.3.3. Training Hyperparameters and Optimization Details

4.4. Keypoint Mis-Detection Handling and Correction

4.4.1. Temporal Smoothing and Spatial Constraints

4.4.2. LSTM-Based Self-Correction Mechanism

4.5. Limitations and Potential Risks

5. Experiments

5.1. Datasets and Evaluation Metrics

5.2. Implementation Details

5.3. Overall Performance Comparison

5.4. Comparative Analysis of Rotation-Invariant Encoding

5.5. Real-Time Performance and Deployment Analysis

5.6. Robustness Analysis Under Extreme Conditions

5.7. Ablation Study and Component Analysis

5.8. Error Correction Methods Performance

5.9. Computational and Efficiency Analysis

5.10. Visualization and Qualitative Results

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Nomenclatures

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI