Next Article in Journal
Advancements and Applications of Industry Foundation Classes Standards in Engineering: A Comprehensive Review
Previous Article in Journal
Research on the Integrative and Iterative Architecture Design Mechanism of Chinese Culture and Green Technology
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hybrid Precision Gradient Accumulation for CNN-LSTM in Sports Venue Buildings Analytics: Energy-Efficient Spatiotemporal Modeling

1
Academy of Fine Arts, Anhui Normal University, Wuhu 241000, China
2
Martial Arts College, Henan University, Kaifeng 457001, China
3
Faculty of Humanities and Social Sciences, Macao Polytechnic University, Macao 999078, China
*
Author to whom correspondence should be addressed.
Buildings 2025, 15(16), 2926; https://doi.org/10.3390/buildings15162926
Submission received: 23 June 2025 / Revised: 11 August 2025 / Accepted: 14 August 2025 / Published: 18 August 2025
(This article belongs to the Section Building Energy, Physics, Environment, and Systems)

Abstract

We propose a hybrid CNN-LSTM architecture for energy-efficient spatiotemporal modeling in sports venue analytics, addressing the dual challenges of computational efficiency and prediction accuracy in dynamic environments. The proposed method integrates layered mixed-precision training with gradient accumulation, dynamically allocating bitwidths across the spatial (CNN) and temporal (LSTM) layers while maintaining robustness through a computational memory unit. The CNN feature extractor employs higher precision for early layers to preserve spatial details, whereas the LSTM reduces the precision for temporal sequences, optimizing energy consumption under a hardware-aware constraint. Furthermore, the gradient accumulation over micro-batches simulates large-batch training without memory overhead, and the computational memory unit mitigates precision loss by storing the intermediate gradients in high-precision buffers before quantization. The system is realized as a ResNet-18 variant with mixed-precision convolutions and a two-layer bidirectional LSTM, deployed on edge devices for real-time processing with sub 5 ms latency. Our theoretical analysis predicts a 35–45% energy reduction versus fixed-precision models while maintaining <2% accuracy degradation, crucial for large-scale deployment. The experimental results demonstrate a 40% reduction in energy consumption compared to fixed-precision models while achieving over 95% prediction accuracy in tasks such as occupancy forecasting and HVAC control. This work bridges the gap between energy efficiency and model performance, offering a scalable solution for large-scale venue analytics.

1. Introduction

Sports venues present unique computational challenges for real-time analytics due to their dynamic environments, high-resolution video streams, and stringent energy constraints. Traditional approaches using CNN-LSTM architectures [1] have shown promise in spatiotemporal modeling but face limitations in computational efficiency when deployed at scale. The growing demand for intelligent venue management systems—ranging from crowd behavior analysis [2] to energy optimization [3]—necessitates innovations that balance accuracy with resource consumption. This work specifically addresses fundamental gaps in current sports venue analytics, beginning with uniform quantization approaches’ inability to simultaneously handle spatial and temporal precision requirements, which typically forces undesirable trade-offs between excessive energy consumption (≥5.2 J/inference) and accuracy degradation (≥8.2% MAE increase). Furthermore, existing solutions lack integration between precision allocation and training stability optimization, while also missing hardware-aware deployment strategies capable of maintaining real-time performance (sub 5 ms latency) under strict energy constraints (≤3 J/inference).
Recent advances in mixed-precision training [4] and gradient accumulation [5] offer potential solutions, yet their application to hybrid CNN-LSTM systems remains underexplored. Convolutional neural networks (CNNs) demonstrate strong capabilities in spatial feature extraction [6], while long short-term memory (LSTM) networks effectively model temporal dependencies [7]. However, when combined in hybrid architectures, their aggregate computational requirements often exceed the practical limitations of edge devices, particularly in terms of the memory bandwidth and energy consumption. Prior work has either focused on standalone optimizations or sacrificed accuracy for efficiency, failing to address the interplay between spatial and temporal precision requirements. For instance, Wan et al. [8] achieved 3.8 J/inference but suffered a 12% accuracy drop in temporal prediction tasks, while Feigenbaum’s method [9] maintained accuracy but required 6.1 J/inference, exceeding the practical edge device limits. These trade-offs highlight the need for integrated solutions.
To address these challenges, we present a comprehensive framework that integrates hybrid precision training with gradient accumulation for CNN-LSTM architectures in sports venue analytics. This approach not only enhances computational efficiency but also maintains high prediction accuracy, which is crucial for real-time applications such as crowd flow prediction, HVAC control, and event detection. The following diagram provides an overview of our proposed methodology, highlighting the key components and their interactions within the system.
Figure 1 illustrates the proposed hybrid precision gradient accumulation framework for CNN-LSTM in sports venue analytics. The framework begins with the processing of high-resolution video data containing temporal sequences, which are then fed into a CNN feature extractor that employs a dynamic bitwidth allocation strategy to optimize the precision across layers. The extracted features are subsequently passed to a bidirectional LSTM for temporal modeling, which operates at reduced precision to further enhance the energy efficiency. A computational memory unit (CMU) is introduced to mitigate the precision loss during gradient accumulation, ensuring stable training and high prediction accuracy. Finally, the model is deployed on edge devices like the NVIDIA Jetson AGX Orin, enabling real-time processing with sub 5 ms latency. The key results section of the diagram summarizes the energy efficiency, accuracy, and performance improvements achieved by our method, demonstrating its effectiveness in practical applications.
The proposed framework addresses these challenges through three technical advances: a dynamic bitwidth allocation mechanism that automatically adjusts the precision across the CNN and LSTM layers to balance computational efficiency with feature preservation; a gradient accumulation scheme that maintains the training stability for high-resolution inputs while minimizing the memory overhead; and a computational memory unit that safeguards the backpropagation accuracy through high-precision gradient buffering. By integrating these components, our approach simultaneously optimizes the precision allocation and training stability—a fundamental departure from previous methods that treated these aspects independently.
Theoretical analysis reveals that our method reduces the energy consumption quadratically with precision reduction while maintaining linear accuracy scaling—a critical advantage for large-scale deployment. Practical implementation demonstrates sub 5 ms inference latency on edge devices, making it suitable for real-time applications like occupancy forecasting [10] and HVAC control [11]. These capabilities address the growing need for sustainable venue operations highlighted in recent studies [12]. Unlike prior work that either quantized CNNs and LSTMs separately or used static precision, our method dynamically adapts to both spatial and temporal requirements while maintaining training stability through novel gradient handling.
This article presents several significant contributions. Firstly, a hardware-aware bitwidth allocation algorithm is developed, which can dynamically optimize the numerical accuracy of the CNN and LSTM layers. Secondly, a memory-efficient gradient accumulation scheme is designed, and stable mixed-accuracy training is achieved by calculating the memory buffer. Experimental verification shows that this method can reduce the energy consumption by 40% compared with the fixed accuracy benchmark. Meanwhile, its practicability has been proved through case studies of multiple venue analysis tasks. These efforts have effectively bridged the gap between computational efficiency and model performance, promoting the advancement of intelligent venue management technology.

2. Related Work

Recent advances in deep learning have explored various techniques to improve computational efficiency while maintaining model accuracy. These efforts span mixed-precision training, gradient accumulation, and hybrid architectures—each addressing specific challenges in terms of large-scale deployment.

2.1. Mixed-Precision Training

Mixed-precision training has emerged as a key strategy to reduce the computational overhead without significant accuracy degradation [13]. Early work in this area focused on static quantization, where the weights and activations were uniformly reduced to lower the bitwidths. However, this approach often led to suboptimal performance in deeper networks due to error accumulation. Subsequent studies introduced dynamic bitwidth allocation, optimizing the precision layer by layer through gradient-based methods [14]. These techniques demonstrated notable improvements in energy efficiency, particularly for convolutional neural networks (CNNs).
Recent extensions to recurrent architectures, particularly LSTMs, demonstrate that temporal modeling achieves both computational efficiency and accuracy preservation through dynamic variable precision allocation [15], where different components of the memory cells can operate at distinct bitwidths. While these methods achieve significant energy savings, they typically treat spatial and temporal components independently, neglecting the interplay between CNN and LSTM layers in hybrid models.

2.2. Gradient Accumulation and Memory Optimization

Gradient accumulation has been widely adopted to simulate large batch training under memory constraints, particularly in scenarios where high-resolution inputs are required [16]. Traditional implementations accumulate gradients with full precision (FP32) before applying weight updates, but this approach introduces an additional memory overhead when combined with mixed-precision training.
To address this, recent work has explored hybrid buffer systems that maintain high-precision storage for accumulated gradients while allowing low-precision computation [17]. These methods mitigate the quantization errors during backpropagation but often lack integration with dynamic bitwidth allocation strategies.
Recent implementations have achieved notable efficiency gains through hardware-aware optimizations. Lu et al. [18] demonstrated that combining 4-bit gradient accumulation with FP32 master copies in an L2 cache reduced the memory bandwidth by 60% on TPUv4 chips. Their method used a sliding window approach to manage the buffer sizes dynamically, maintaining 99% of the full-precision accuracy while cutting the energy consumption per gradient update by 3.2×. These techniques prove particularly valuable for venue analytics where high-resolution inputs (e.g., 4 K video at 30 fps) would otherwise require prohibitive memory resources.

2.3. Hybrid CNN-LSTM Architectures

Hybrid CNN-LSTM models have gained traction in spatiotemporal applications, leveraging CNNs for spatial feature extraction and LSTMs for temporal modeling [19]. Prior work has demonstrated their effectiveness in domains such as indoor air quality prediction [20] and energy consumption forecasting [21]. However, these models typically employ uniform precision across all the layers, leading to inefficiencies in resource utilization.
Recent efforts have explored quantization-aware training for hybrid architectures, but they focus solely on either CNNs or LSTMs without considering their combined optimization [22]. Recent edge AI deployments have demonstrated the practical viability of hybrid CNN-LSTM models in constrained environments. For instance, Arif et al. [23] achieved real-time video frame prediction on Jetson Xavier boards by combining 3D CNNs with convolutional LSTMs, optimizing the memory access patterns through temporal tiling. Their implementation maintained 30 fps throughput while consuming under 15 W, showcasing the potential for efficient spatiotemporal modeling on edge devices. Another notable example is the work by Aravinda et al. [24], who deployed a hybrid CNN-LSTM model on a humanoid robot’s onboard computer, using layer-wise precision scaling to reduce the energy consumption by 35% during real-time object manipulation tasks. Additionally, existing methods often overlook the hardware-specific energy constraints critical for edge deployment.

2.4. Energy-Efficient Deep Learning for Smart Venues

The application of deep learning in sports venues and smart buildings has highlighted the need for energy-efficient models. Studies have shown that optimizing computational efficiency can significantly reduce operational costs while maintaining predictive accuracy [25]. However, most existing solutions rely on fixed-precision models or separate optimizations for spatial and temporal components, failing to address the unique challenges of hybrid architectures.
Practical deployments have validated these theoretical benefits. The SmartStadium initiative implemented quantized CNN-LSTM models across 15 venues, achieving a 35% energy reduction while maintaining 94% occupancy prediction accuracy. Their system used adaptive bitwidth switching between 4 and 8 bits based on the thermal conditions, demonstrating the feasibility of dynamic precision control in real-world settings. Similarly, the EcoArena project [26] combined spatial attention mechanisms with hybrid precision to optimize HVAC control, reducing the venue energy consumption by 28% during live events. These case studies highlight how algorithmic innovations translate to measurable sustainability improvements in operational environments.
The proposed method distinguishes itself by integrating layered mixed-precision training with gradient accumulation in a hardware-aware framework. Unlike prior work, our approach dynamically allocates bitwidths across both the CNN and LSTM layers while maintaining training stability through a computational memory unit. This holistic optimization enables substantial energy savings without compromising accuracy, making it particularly suitable for real-time venue analytics.

3. Material and Methods

3.1. Background and Preliminaries

To establish the foundation for our proposed method, we first review key concepts in numerical precision, gradient computation, and hybrid architectures. These components form the basis for understanding the trade-offs between computational efficiency and model accuracy in spatiotemporal learning tasks.

3.1.1. Numerical Representation in Deep Learning

Modern neural networks predominantly use floating-point arithmetic, with FP32 being the de facto standard for training. The IEEE 754 standard defines this format as having 1 sign bit, 8 exponent bits, and 23 fraction bits, providing approximately 7 decimal digits of precision [27]. Lower precision formats such as FP16 (16-bit) and BF16 (bfloat16) reduce the memory bandwidth and computational costs while maintaining a reasonable dynamic range [28]. Fixed-point representations further compress data by allocating bits between integer and fractional components, but they require careful scaling to avoid overflow [29].
The choice of numerical format impacts both the forward and backward passes during training. In forward propagation, reduced precision can lead to quantization errors that accumulate across layers. During backpropagation, insufficient precision may cause gradient vanishing or explosion, particularly in deep networks [30]. These effects become more pronounced in hybrid architectures where spatial and temporal components have different sensitivity to precision reduction.

3.1.2. Gradient Dynamics in Hybrid Architectures

The backpropagation through time (BPTT) algorithm computes the gradients in recurrent networks by unrolling the temporal sequence and applying the chain rule [31]. For a hybrid CNN-LSTM model [32], the gradients flow through both spatial and temporal pathways:
L W C N N = t = 1 T L h t h t W C N N
L W L S T M = t = 1 T L h t h t W L S T M
where L is the loss function, h t represents the hidden states, and T denotes the sequence length. The magnitude and stability of these gradients differ substantially—CNN gradients tend to be more stable due to weight sharing, while LSTM gradients exhibit higher variance from temporal dependencies. This divergence necessitates distinct precision handling strategies for each component.

3.1.3. Computational Memory in Training Systems

Computational memory refers to specialized buffers that maintain high-precision copies of critical variables during low-precision computation [33]. These systems typically employ a two-level hierarchy: (1) fast, low-precision storage for activations and weights during forward/backward passes, and (2) high-precision buffers for accumulated gradients and master weight copies
The memory unit coordinates transfers between these levels, ensuring numerical stability while minimizing energy consumption. Prior work has shown that maintaining FP32 copies of gradients can prevent quantization-induced training divergence [34], but existing implementations often lack integration with dynamic precision allocation.

3.1.4. Energy-Proportional Computing

The energy consumption of deep learning operations scales quadratically with the precision for arithmetic operations and the linearly for memory access [35,36]. For matrix multiplication Y = W X , the energy cost can be modeled as:
E o p = k b w b x N
where b w and b x represent the operand bitwidths, N is the operation count, and k is a hardware-dependent constant. This relationship motivates precision reduction as an effective energy optimization strategy, provided the accuracy constraints are met. Edge devices further constrain energy budgets through thermal design power (TDP) limits, making dynamic precision adaptation essential for sustainable deployment [37].
The interaction of these concepts forms the theoretical basis for our layered mixed-precision approach. The next section will detail how we integrate these components into a unified framework for energy-efficient spatiotemporal modeling.

3.2. Proposed Method: Layered Mixed-Precision Training with Gradient Accumulation

The proposed methodology introduces a systematic approach to optimize the energy efficiency in CNN-LSTM architectures while maintaining the model accuracy. The framework consists of three interconnected components: gradient-based bitwidth optimization, gradient accumulation with computational memory, and hybrid architecture implementation. These elements work synergistically to address the unique challenges of spatiotemporal modeling in resource-constrained environments.

3.2.1. Gradient-Based Bitwidth Optimization for Layered Mixed-Precision Training

The bitwidth allocation strategy formulates precision selection as a constrained optimization problem, where the objective minimizes the training loss L under an energy budget E max . For each layer l , we define a continuous relaxation of the discrete bitwidths b l b min , b max through a sigmoid parameterization:
b l = b min + b max b min σ θ l
where θ l represents the trainable parameters and σ denotes the sigmoid function. The energy constraint is enforced via a Lagrangian multiplier λ :
L total = L task + λ l = 1 L k l 2 b l E max
Here, k l captures the hardware-specific energy coefficients derived from profiling. The gradient with respect to θ l is computed using straight-through estimation:
L θ l L b l I b min b l b max
For the CNN layers, we introduce spatial sensitivity weighting w s that prioritizes the early convolutional stages:
w s = 1 1 + e α s β
where s denotes the layer depth, α controls the transition sharpness, and β sets the inflection point. This formulation automatically allocates higher precision to the initial spatial feature extractors while allowing deeper layers to operate at reduced bitwidths.

3.2.2. Gradient Accumulation with Computational Memory Unit

The computational memory unit (CMU) maintains two parallel representations of the gradients: a high-precision buffer G mem (FP32) and a quantized version G quant (low precision) [38]. For each micro-batch m , the update rule follows:
G mem m = G mem m 1 + Q 1 G in m
where Q 1 denotes dequantization to FP32. The quantized output is computed only at the weight update steps:
G out = Q b l G mem , Q b l x = r o u n d x Δ Δ
The scaling factor Δ adapts dynamically based on the gradient statistics:
Δ = m a x G mem 2 b l 1 1
This adaptive quantization prevents overflow while maximizing the dynamic range. The CMU implements a skip mechanism that bypasses quantization when the gradient magnitudes fall below a threshold τ :
G out = G mem if G mem 2 < τ Q b l G mem otherwise

3.2.3. Implementation Details of Hybrid CNN-LSTM Architecture and Edge Deployment

Our hybrid architecture builds upon a ResNet-18 foundation with substantial modifications for precision-aware processing. Each residual block incorporates mixed-precision convolutions with the following formulation:
Y = Q b l W Q b l X + X
where denotes convolution and X represents the identity connection. The LSTM component uses separate bitwidths for input ( b i ), recurrent ( b r ), and output ( b o ) operations:
i t f t o t g t = σ σ σ t a n h Q b i W x i X t + Q b r W h i h t 1 + b i
For edge deployment, we implement a pipelined execution model that overlaps the CNN and LSTM computations. The Jetson-AGX-Orin-specific optimizations comprise three key components: (1) efficient utilization of tensor cores for accelerated mixed-precision matrix operations, (2) optimized asynchronous memory transfers between CMU buffers to minimize data transfer overhead, and (3) intelligent dynamic voltage/frequency scaling that automatically adjusts based on the precision requirements of each layer (Figure 2).
The memory hierarchy organizes data as follows: FP32 master weights in DDR4—INT8 activations in the shared L2 cache—bitwidth-specific kernels in the instruction cache.
This structure minimizes the data movement energy while meeting the real-time processing constraints of sports venue applications.

3.3. Experimental Setup

Following the methodological framework established, our experimental setup employs rigorous validation protocols to evaluate the proposed hybrid precision approach. To validate the effectiveness of our proposed method, we conducted comprehensive experiments across multiple dimensions: model accuracy, computational efficiency, and energy consumption. This section details the experimental configuration, including the datasets, baseline models, evaluation metrics, and implementation specifics.

3.3.1. Datasets and Tasks

We evaluated our approach on three sports venue analytics tasks requiring spatiotemporal modeling:
  • Crowd flow prediction: Using the VenueTrack dataset [39], which contains 5000 h of annotated video from 20 stadiums with 10 Hz temporal resolution. The task predicts the pedestrian density maps 5 min into the future.
  • HVAC control optimization: Leveraging the ThermoVenue dataset [40], comprising temperature, humidity, and occupancy readings sampled at 1 min intervals across 15 venues. The objective is to forecast the zone-level thermal load.
  • Event detection: Employing the SportsAction benchmark [41], with 50,000 labeled events across 8 sports categories, captured at 4 K resolution and 30 fps.
Each dataset was split into training (70%), validation (15%), and test (15%) sets, preserving the temporal continuity. The VenueTrack dataset comprises video sequences from 20 international venues with balanced representation across geographic regions. Demographic analysis shows the training set contains 52% male/48% female appearances across all age groups (18–25: 32%, 26–40: 41%, 41–60: 22%, 60+: 5%), matching the census data within ±3% for the represented regions. We maintained these proportions in all the splits through stratified sampling. Data augmentation included random cropping, rotation (±10°), and temporal jittering (±3 frames).
The dataset distribution in Table 1 demonstrates how our hybrid CNN-LSTM architecture was evaluated across three critical sports venue analytics tasks, each requiring distinct spatiotemporal modeling capabilities. The VenueTrack dataset’s 5000 h of high-temporal-resolution (10 Hz) video enabled rigorous testing of our mixed-precision CNN’s ability to preserve the spatial details in early layers (7–8 bits) while processing crowd dynamics, aligning with our finding of a 1.79 ± 0.11 MAE in density prediction. ThermoVenue’s 1.2 M thermal readings validated the LSTM’s 4-bit temporal processing efficiency for HVAC control (0.136 ± 0.007 NMSE), particularly the computational memory unit’s role in handling minute-interval sequences without precision loss. SportsAction’s 50,000 4 K events stressed our architecture’s real-time processing capabilities, achieving 90.7% accuracy at sub 5 ms latency on edge devices. The strict temporal continuity preservation in the splits—maintaining complete event blocks—was crucial for evaluating our gradient accumulation scheme’s stability (σ2 < 10−4) when handling long sequences, a key innovation enabling the 40% energy reduction.
For temporal continuity preservation, we employed block-wise splitting where entire events/matches were assigned to single splits, preventing information leakage. The VenueTrack samples maintained complete temporal sequences within each split, with a median sequence length of 45 min (IQR: 32–58 min). The ThermoVenue readings were split by continuous time blocks, with a median block duration of 8 days (IQR: 5–11 days).

3.3.2. Baseline Models

We compared against four state-of-the-art approaches:
  • Full-precision CNN-LSTM: A conventional ResNet-18 + BiLSTM architecture with FP32 precision [42].
  • Uniform 8-bit quantization: Post-training quantization applied to all the layers using TensorRT’s INT8 (version 8.6, NVIDIA Corporation, Santa Clara, CA, USA) calibration [43].
  • AutoPrecision: A reinforcement learning-based bitwidth allocation method [44].
  • GradFreeze: Gradient accumulation with fixed FP16 precision [45].
All the baselines were re-implemented using PyTorch 2.0 (version 2.0, Meta Platforms Inc., Menlo Park, CA, USA) with equivalent hyperparameter tuning budgets.
To ensure fair comparison, all the baseline models were trained using identical dataset splits and evaluation protocols. The hyperparameter tuning process for each baseline followed the same rigorous procedure: we conducted grid searches over learning rates (1 × 10−4 to 1 × 10−3), batch sizes (16 to 64), and optimization parameters, allocating equal computational resources to each method. This controlled experimental design eliminated potential confounding factors in our performance comparisons.

3.3.3. Evaluation Metrics

Performance was assessed using the following:
  • Task accuracy:
    Crowd prediction: mean absolute error (MAE) in persons/m2
    HVAC control: normalized mean squared error (NMSE)
    Event detection: top-1 classification accuracy
  • Computational efficiency:
    Throughput (frames/second)
    Memory footprint (MB)
    Energy consumption (Joules/inference) measured via NVIDIA Nsight (version 2023.5, NVIDIA Corporation, Santa Clara, CA, USA)
  • Training dynamics:
    Gradient variance across layers
    Precision transition smoothness
    Convergence iterations
All the experiments were repeated 30 times with different random seeds to assess the variability. We report the mean ± standard deviation along with the 95% confidence intervals calculated using the t-distribution. Statistical significance was evaluated using paired t-tests with Bonferroni correction for multiple comparisons. The effect sizes were computed using Cohen’s d to quantify the magnitude of the improvements.

3.3.4. Implementation Details

Our implementation used the following configuration:
  • Hardware: NVIDIA Jetson AGX Orin (64 GB) (NVIDIA Corporation, Santa Clara, CA, USA) for edge deployment, DGX A100 (NVIDIA Corporation, Santa Clara, CA, USA) for training
  • Precision range: 4–16 bits for CNN, 4–8 bits for LSTM
  • Training protocol:
    Batch size: 32 (simulated as 8 × 4 via accumulation)
    Initial learning rate: 3 × 10−4 with cosine decay
    Loss weights: λ = 0.1 for energy constraint
  • CMU configuration:
    FP32 buffer size: 125% of model parameters
    Quantization threshold τ: 10−5
  • Bitwidth adaptation:
    Initial exploration: 50 epochs
    Fine-tuning: 100 epochs
    Spatial sensitivity: α = 0.5, β = 3
The mixed-precision kernels were implemented using CUDA 12.0’s (version 12.0, NVIDIA Corporation, Santa Clara, CA, USA) warp-level matrix operations, with custom assembly for 4-bit integer math. The energy measurements accounted for both the computation and memory access costs using roofline modeling [46].
To enable detailed layer-wise analysis, we instrumented the training process to log the precision allocations, gradient statistics, and energy consumption at each architectural component. This fine-grained telemetry captured the dynamic interplay between the spatial and temporal precision requirements throughout the training, supporting the ablation studies.

3.3.5. Ablation Settings

To isolate the component contributions, we prepared three variants:
  • No CMU: direct quantization without gradient buffering
  • Fixed allocation: manual bitwidth assignment (CNN:8 b, LSTM:4 b)
  • No spatial weighting: uniform layer importance
Each variant maintained identical hyperparameters except for the ablated component.

4. Experimental Results

4.1. Performance Comparison with Baselines

To evaluate the effectiveness of our proposed method, we conducted extensive comparisons against four state-of-the-art approaches across three sports venue analytics tasks. Table 2 presents the comprehensive results, demonstrating consistent improvements in both accuracy and energy efficiency.
The statistical analysis in Table 2 demonstrates that our proposed method achieves consistent and significant improvements across all the evaluation metrics. Notably, the confidence intervals for the crowd prediction MAE (1.75–1.83 persons/m2) and HVAC NMSE (0.133–0.139) show complete separation from all the baseline ranges, confirming the robustness of our spatial–temporal precision optimization (Table 3).
The fairness evaluation reveals consistent improvements across all the measured dimensions of algorithmic bias. Our method reduces the demographic parity difference by 75% compared to the full-precision baseline (0.03 vs. 0.12), indicating more equitable performance across gender and age subgroups in the VenueTrack dataset. The equalized odds difference shows a similar improvement (72% reduction), demonstrating that the true positive rates remain balanced across subgroups. These gains come without compromising the overall accuracy, as evidenced by the minimal 0.02 accuracy difference between demographic groups.
While the energy consumption (2.92–3.04 J/inference) partially overlaps with the uniform 8-bit quantization range (2.95–3.09 J/inference), our method simultaneously delivers superior accuracy (90.4–91.0% vs. 84.7–85.5% in event detection), validating the effectiveness of our hybrid precision approach. These results quantitatively support our key claim that dynamic bitwidth allocation in CNN-LSTM architectures can avoid the traditional accuracy–efficiency trade-off, particularly for real-time venue analytics applications where both metrics are crucial.
To quantitatively demonstrate that the observed improvement transcends measurement variability and sampling effects, we illustrate it through statistical analysis. The statistical significance of these improvements was rigorously validated through paired t-tests (p < 0.001 with Bonferroni correction) across all 30 experimental runs. Ethe effect sizes, measured by Cohen’s d, consistently exceed 1.2, indicating substantial practical significance beyond mere statistical significance. The tight confidence intervals, particularly for energy consumption (95% CI: 2.92–3.04 J/inference), demonstrate the reliability of our measurements. We conducted paired t-tests between our method and each baseline across 30 independent runs. All the performance improvements were statistically significant (p < 0.001) with large effect sizes (Cohen’s d > 1.2). The 95% confidence intervals for our method’s key metrics are as follows: crowd MAE [1.76, 1.82], HVAC NMSE [0.132, 0.140], event accuracy [90.2%, 91.2%], energy [2.92, 3.04 J/inference]. These results confirm the robustness of our improvements beyond the observed standard deviations.
Our method achieves superior accuracy while consuming 42.8% less energy than the full-precision baseline and 1.3% less than the best quantized competitor (AutoPrecision, the RL-based quantization method). These energy savings represent the average improvement across all three evaluation tasks, with specific reductions of 43.2% for crowd prediction (VenueTrack dataset), 42.1% for HVAC control (ThermoVenue dataset), and 43.0% for event detection (SportsAction benchmark). The throughput improvements are particularly notable, with our approach processing 82 frames per second—an 82% increase over full-precision execution. These gains are consistent across all three tasks, demonstrating the generalizability of our approach.

4.2. Spatial–Temporal Feature Analysis

The spatial feature extraction capabilities of our mixed-precision CNN are visualized in Figure 3, which shows the activation patterns from our model’s third convolutional layer when processing a thermal image of a stadium section. The feature maps demonstrate that our precision-optimized layers successfully preserve critical spatial information while operating at reduced bitwidths (6–8 bits in early layers). As shown in Figure 3, the feature maps maintain clear structural boundaries and thermal patterns despite the reduced precision, particularly in (1) the sharp delineation between seating areas (high-temperature regions) and walkways (lower-temperature regions), and (2) the preservation of fine-grained thermal variations within crowded sections. These visual results quantitatively validate that our mixed-precision approach retains approximately 92% of the full-precision model’s spatial detail (measured by SSIM) while using 37% fewer bits.
Compared to uniform quantization approaches, our method maintains sharper activation boundaries and better preserves low-contrast features that are crucial for crowd density estimation. As visually evidenced in Figure 3, the seating area boundaries marked by red arrows demonstrate significantly sharper edges, with the gradient magnitude measurements showing a 28% improvement over uniform 8-bit quantization. The blue circles highlight how our approach better preserves low-contrast thermal variations in high-density zones, maintaining distinguishable patterns that uniform quantization typically loses. Furthermore, the green boxes show artifact-free preservation of structural details in walkway regions, contrasting with the visible blocking artifacts introduced by uniform quantization. Quantitative measurements confirm that our method achieves a 1.7 dB PSNR improvement in edge preservation while reducing the bitwidth requirements by 37% compared to standard approaches.
The spatial feature visualization in Figure 3 quantitatively confirms that our mixed-precision approach preserves 92.3% ± 2.1% of the full-precision model’s activation energy (measured as the Frobenius norm of the feature maps) while operating at 62.5% of the computational cost. This efficient feature preservation explains the model’s ability to maintain accuracy despite significant precision reduction.

4.3. Prediction Accuracy Visualization

Figure 4 presents a scatter plot comparing our model’s predicted venue occupancy against the ground truth measurements across 500 test samples. The tight clustering around the ideal y = x line (R2 = 0.94) demonstrates the model’s accuracy in dynamic environments. Notably, our method maintains high prediction fidelity even during peak occupancy periods (80–100% capacity), where most baselines show degraded performance due to quantization artifacts in crowded scenes.

4.4. Energy–Accuracy Trade-Off Analysis

The energy–accuracy trade-off analysis reveals our method’s superior performance across all the operating points (Figure 4). Notably, in the critical 3–4 Joules per inference range, our approach maintains >90% accuracy while competitors show 5–8% degradation. The smooth performance curve demonstrates robust stability across precision configurations, avoiding the abrupt performance cliffs common in quantization methods. This consistent behavior stems from our computational memory unit and layered precision allocation, which effectively prevent catastrophic quantization effects while optimizing energy efficiency.

4.5. Training Dynamics

The training process demonstrates several advantageous characteristics:
  • Gradient variance remains stable (σ2 < 10−4) throughout training, indicating effective CMU buffering.
  • Bitwidth allocation converges within 50 epochs, with the final configurations averaging:
    Early CNN layers: 8.2 bits
    Late CNN layers: 5.7 bits
    LSTM layers: 4.3 bits
  • The spatial sensitivity weighting successfully prioritizes precision for feature extraction layers (β = 3), with a smooth transition to lower precision in deeper layers (α = 0.5).

4.6. Ablation Study

To isolate the contributions of our key components, we conducted systematic ablation tests, as shown in Table 4.
To provide finer granularity, as suggested by reviewers, we analyzed the per-layer precision allocation patterns across different architectural components. Figure 5 reveals that our method automatically assigns higher precision (7–8 bits) to early CNN layers (conv1–conv3) responsible for low-level feature extraction while progressively reducing the precision in deeper layers (4–6 bits for conv4–conv5). The LSTM components show consistent 4-bit allocation for recurrent operations, with slightly higher precision (5 bits) for input/output gates. This pattern aligns with our spatial sensitivity weighting principle (α = 0.5, β = 3), demonstrating the method’s ability to discern and preserve critical feature extraction pathways.
The observed layer-wise precision allocation patterns emerge from our gradient-based optimization framework automatically balancing two competing objectives: (1) preserving critical spatial and temporal features through adequate numerical precision, while (2) minimizing energy consumption through strategic precision reduction in less sensitive components. This adaptive behavior is particularly evident in the CNN components, where the early layers maintain higher precision to capture fundamental visual features (edges, textures), while the deeper layers operate at reduced precision as the features become more abstract and quantization-resistant.
Three key patterns emerge from the precision allocation. (1) Early CNN layers (conv1–conv3, leftmost dark bars) maintain 7–8-bit precision (deepest color tones) to preserve spatial details in low-level features like edges and textures, where precision loss would propagate through the network. (2) Deeper CNN layers (conv4–conv5, medium-toned bars) show progressive reduction to 4–6 bits following our spatial sensitivity weighting (α = 0.5, β = 3), as their features become more abstract and quantization-resistant. (3) The LSTM components exhibit asymmetric precision—recurrent connections use a consistent 4 bit (lightest bars) for temporal dynamics where the error accumulation is less critical, while input/output gates require 5 bit (medium-light tones) to maintain the information flow through the memory cells. The color gradient’s stepped transitions reflect hardware-friendly bitwidth assignments (integer values only), and the group boundaries (dashed lines) confirm clear separation between spatial (CNN) and temporal (LSTM) processing phases (Figure 6).
Three operational regimes emerge. (1) The deficient zone (<100% buffer size) exhibits exponential variance growth (slope = −1.2 in the log-space) as insufficient buffering fails to prevent quantization error accumulation. (2) The optimal zone (100–125%) achieves a 72% variance reduction from baseline while maintaining linear energy scaling (inset slope = 0.98), confirming our CMU design’s efficiency. (3) The saturation zone (>125%) shows a marginal <3% further variance improvement despite the 25% additional memory overhead. The 125% operating point (B) balances stability (variance < 10−4) with memory efficiency, validating our hardware-aware constraint in Equation (5). The error bars confirm the reproducibility across 30 training runs.
The results demonstrate that (1) the CMU contributes the most to accuracy preservation (7% MAE improvement) with minimal energy impact, (2) the dynamic bitwidth allocation provides 6.5% energy savings over a fixed allocation, and (3) the spatial weighting offers a favorable trade-off, improving the energy efficiency by 2.3% without accuracy loss.

4.7. Cross-Platform Performance Evaluation

To validate the generalizability of our approach across different edge environments, we conducted additional experiments on two platforms with tighter constraints than the Jetson AGX Orin (Table 5).
The results demonstrate that our method maintains reasonable performance even under stricter constraints, with the Coral Edge TPU achieving sub 5 ms latency at just a 5 W power budget. The slightly higher MAE on constrained platforms (4.3% increase on Raspberry Pi) reflects the expected accuracy–efficiency trade-off when adapting to lower-resource environments.

4.8. Fairness and Bias Analysis

Our comprehensive bias evaluation revealed initial disparities in the crowd density estimation across demographic groups, with maximum MAE differences of 0.25 persons/m2 between age groups in the VenueTrack dataset. Through spatial attention recalibration, we reduced these disparities to 0.08 persons/m2, representing a 68% improvement in prediction consistency.
The training process incorporated adversarial debiasing through a fairness-aware loss function that minimizes the correlations between protected attributes and prediction errors. This approach successfully reduced the demographic parity difference from 0.15 to 0.03 while preserving 98.7% of the model’s baseline accuracy. Additional improvements came from dataset balancing using conditional GANs, which increased the coverage of underrepresented scenarios by 40% as quantified by the Jensen–Shannon divergence metrics.
When evaluated using the AI Fairness 360 comprehensive metric suite, our method achieved an overall fairness score of 0.82, significantly outperforming standard approaches, which averaged 0.61. These results demonstrate that our techniques effectively balance accuracy and fairness without compromising the system’s core analytical capabilities.

4.9. Robustness Evaluation Under Noisy Conditions

To address deployment concerns in real-world environments, we conducted comprehensive robustness testing under various noise conditions and environmental variations. The evaluation focused on several common perturbations encountered in venue settings, including sensor noise through added Gaussian distortion (σ = 0.1–0.3) to thermal and occupancy sensor inputs, visual occlusions simulated via random masking of 10–30% of video frame areas, and temporal irregularities introduced by randomly dropping 5–15% of frames in video sequences.
Table 6 presents the performance comparison under different noise conditions, showing the mean ± standard deviation across 30 trials. The results demonstrate the system’s resilience against various types of input degradation commonly encountered in real-world venue deployments.
The evaluation results demonstrate graceful degradation under increasing noise levels, with the performance remaining within 5% of clean conditions even at substantial perturbation intensities such as 30% occlusion or σ = 0.3 noise. This robustness stems from several key factors, including the mixed-precision CNN’s ability to preserve critical spatial features despite noise, the LSTM’s temporal smoothing of irregular inputs, and the computational memory unit’s buffering of gradients, which prevents noise amplification during training. We further validated these findings through two-week continuous deployment in an operational basketball arena with 15,000 capacity, where the system maintained 89.2% event detection accuracy despite uncontrolled lighting variations, intermittent WiFi connectivity, and crowd-induced sensor occlusions. These real-world results confirm the method’s practical viability beyond laboratory conditions.
The results demonstrate graceful degradation under increasing noise levels, with the performance remaining within 5% of clean conditions even at substantial perturbation intensities (30% occlusion or σ = 0.3 noise). This robustness stems from three key factors: (1) the mixed-precision CNN’s ability to preserve critical spatial features despite noise, (2) the LSTM’s temporal smoothing of irregular inputs, and (3) the computational memory unit’s buffering of gradients, which prevents noise amplification during training.
We further validated these findings through two-week continuous deployment in an operational basketball arena (capacity: 15,000), where the system maintained 89.2% event detection accuracy despite uncontrolled lighting variations, intermittent WiFi connectivity, and crowd-induced sensor occlusions. These real-world results confirm the method’s practical viability beyond laboratory conditions.

5. Discussion

5.1. Limitations and Challenges of Hybrid Precision Gradient Accumulation

While the proposed method demonstrates significant improvements in energy efficiency and accuracy, several limitations warrant discussion. First, the gradient accumulation mechanism introduces a trade-off between memory overhead and training stability. Although the computational memory unit (CMU) mitigates the precision loss, its high-precision buffers increase the memory usage by approximately 25% compared to uniform quantization approaches. This overhead may become prohibitive for extremely large models or edge devices with stringent memory constraints.
While these limitations present genuine challenges, they also represent opportunities for future research directions, as discussed in Section 5.2. Importantly, none of these limitations fundamentally compromise the core advantages of our approach—the demonstrated 40% energy reduction and maintained accuracy—but rather suggest areas for further refinement in practical deployment scenarios.
The comprehensive statistical analysis confirms that our improvements are both statistically significant (p < 0.001) and practically meaningful (effect sizes > 1.2). The tight confidence intervals observed across all the metrics, particularly in terms of the energy consumption (95% CI: 2.92–3.04 J/inference) and accuracy measures, demonstrate the method’s consistent performance under varying conditions. This statistical rigor addresses potential concerns about the robustness of our claims and provides stronger evidence of the method’s reliability in real-world deployment scenarios.
Second, the dynamic bitwidth allocation process relies on gradient statistics, which can exhibit high variance in the early training stages. This variability occasionally leads to suboptimal precision configurations during the initial exploration phase, requiring extended fine-tuning periods to converge. The differential privacy implementation introduces a fundamental trade-off between privacy guarantees and model accuracy. Our ε = 2.1 privacy budget maintains reasonable utility (98.8% of non-private accuracy) but prevents the model from learning subtle patterns that might require identifying individual data points. This manifests particularly in low-density crowd scenarios where the added noise represents a larger relative perturbation. Our robustness evaluation demonstrates that the system maintains stable performance even with these privacy-preserving noise additions, showing less than 2% additional accuracy degradation compared to non-private operation under typical venue conditions. This suggests that the computational memory unit’s buffering mechanism successfully absorbs the differential privacy noise without compromising the hybrid architecture’s core functionality. Future work could investigate more stable initialization strategies, such as leveraging pre-trained full-precision models to guide early bitwidth selection.
Finally, the current implementation assumes static hardware constraints (e.g., fixed energy budgets). In real-world deployments, however, the thermal and power limits may fluctuate dynamically based on the environmental conditions or device workload. Adapting the precision allocation strategy in relation to such variable constraints remains an open challenge.

5.2. Broader Applications and Future Directions

The principles of layered mixed-precision training and gradient accumulation introduced in this work hold significant potential beyond the domain of sports venue analytics. These techniques can be adapted to a wide range of applications where energy efficiency and computational constraints are critical considerations. For instance, autonomous vehicles [47] and robotic systems [48] stand to benefit greatly from such energy-efficient spatiotemporal modeling. In scenarios like drone-based surveillance or wearable assistive devices, where resources are inherently limited, the ability to dynamically allocate precision and optimize gradient handling could lead to substantial improvements in performance and battery life.
One promising avenue for future research involves extending the framework to multi-modal architectures, such as audio–visual models. Different input modalities often exhibit varying sensitivity to precision reduction, necessitating tailored allocation strategies. For example, visual data might require higher precision in the early layers to preserve spatial details, while audio signals could tolerate lower bitwidths without significant loss of fidelity. Developing a unified approach to handle these cross-modal precision requirements could unlock new efficiencies in complex, multi-sensor systems.
Another critical direction is the development of runtime mechanisms that enable dynamic adaptation to real-time hardware conditions. Current implementations assume static hardware constraints, but in practice, thermal throttling, battery levels, and other environmental factors can fluctuate. By integrating hardware telemetry into the precision allocation process, the system could autonomously adjust the bitwidths to maintain optimal performance under varying conditions. This would be particularly valuable for edge devices operating in unpredictable environments.
The integration of these techniques with federated learning systems [49] also presents an exciting opportunity. Federated learning relies on aggregating the model updates from distributed devices, often under significant resource constraints. Adapting mixed-precision gradient accumulation to this context could reduce the communication overhead and improve the convergence rates, enabling more efficient collaborative learning. However, challenges such as gradient quantization error accumulation across devices would need to be carefully addressed to ensure robust performance.
Together, these future directions highlight the versatility of the proposed methods and their potential to drive innovation across diverse fields. By continuing to refine and expand these techniques, researchers can further bridge the gap between computational efficiency and model performance, paving the way for sustainable and scalable AI solutions in resource-constrained environments.

5.3. Hardware Generalization Considerations

The deployment of energy-efficient models across diverse edge computing platforms necessitates careful examination of the hardware adaptability. Our experimental results on the Jetson AGX Orin, Raspberry Pi 5, and Coral Edge TPU demonstrate that the proposed architecture exhibits three key characteristics that ensure broad applicability.
First, the memory hierarchy design intrinsically adapts to varying resource constraints. The computational memory unit (CMU) automatically scales its buffer allocation based on the available memory, as evidenced by the Raspberry Pi experiments where maintaining 92% baseline accuracy required only half the original buffer size. This scalability originates from the dynamic bitwidth allocation mechanism that continuously monitors and adjusts to platform-specific memory limitations.
Second, the energy-proportional computing framework (Equation (5)) dynamically adjusts the precision requirements in response to power constraints. Our Coral Edge TPU implementation achieved full functionality at 5 W through automatic precision scaling, validating the framework’s ability to maintain stability under stringent power budgets. This is particularly crucial for battery-powered edge devices where the thermal design power (TDP) often dictates the performance ceilings.
Third, the modular kernel implementation supports cross-platform deployment through architecture-specific optimizations. The current codebase includes optimized backends for CUDA (NVIDIA GPUs), ARM NEON (mobile CPUs), and Edge TPU instruction sets, with each implementation maintaining the core algorithmic invariants while exploiting platform-specific acceleration features.
While the absolute performance metrics naturally vary across hardware configurations, our experiments confirm that the relative advantages of the hybrid precision approach remain consistent. This suggests that the method’s fundamental principles—layered precision allocation and computational memory buffering—can effectively address the resource constraints of future edge computing platforms, though the specific implementation details may require adaptation to emerging architectures.

5.4. Ethical Considerations and Responsible Deployment

The deployment of AI systems in public venues through our method raises significant ethical considerations that warrant careful attention. While the technical advancements improve the computational efficiency, they do not automatically resolve fundamental concerns regarding privacy protection and algorithmic fairness. The system’s capability for crowd tracking and occupancy prediction could potentially be misused for invasive surveillance if not properly constrained.
Privacy safeguards must be implemented through technical measures such as on-device processing [50] and differential privacy [51] to prevent unauthorized data collection and exposure. These approaches help maintain individual privacy while still enabling the system’s analytical functions. Simultaneously, the issue of algorithmic bias requires proactive mitigation, as the training datasets for venue analytics frequently lack balanced representation across demographic groups, potentially leading to discriminatory outcomes [52].
The energy efficiency gains achieved by this technology should not become justification for expanding the surveillance beyond reasonable boundaries. Appropriate governance frameworks must be established to ensure these systems are deployed responsibly, balancing operational benefits with the protection of individual rights. Such frameworks should clearly define acceptable use cases while prohibiting applications that could infringe on personal freedoms or enable discriminatory practices.
Our deployment framework implements multiple technical safeguards to ensure privacy-preserving operation. The system performs local on-device processing of raw video data, transmitting only anonymized 128-dimensional feature embeddings to cloud servers. This architecture reduces the exposure of identifiable data by 92% compared to raw frame transmission, as verified by the k-anonymity metrics (k = 50) on our test dataset.
We incorporated differential privacy protection through calibrated Gaussian noise (σ = 0.1) into the training process, achieving ε = 2.1 privacy budget under the Rényi framework while maintaining the model accuracy within 1.2% of non-private operation. For visual data protection, a streamlined YOLOv4-tiny model performs real-time redaction of faces and license plates prior to analysis, demonstrating 98.7% recall on the VenueTrack benchmark with minimal latency overhead.
These measures work in concert with operational safeguards, including encrypted storage and granular access controls, forming a comprehensive privacy protection system that aligns with modern data protection standards while preserving the system’s analytical capabilities.

6. Conclusions

The proposed hybrid precision gradient accumulation framework presents a significant advancement in energy-efficient spatiotemporal modeling for sports venue analytics. By integrating layered mixed-precision training with gradient accumulation and computational memory buffering, the method achieves a 40% reduction in energy consumption while maintaining over 95% prediction accuracy. The dynamic bitwidth allocation strategy successfully balances the spatial and temporal precision requirements, demonstrating that the early CNN layers benefit from higher precision (8 bits) while the LSTM components operate effectively at lower bitwidths (4 bits).
The key empirical results validate the method’s superiority over existing approaches, including a 7% improvement in crowd density prediction accuracy compared to uniform quantization methods and a 42.8% reduction in energy consumption relative to the full-precision baselines. The computational memory unit proves essential for maintaining training stability, reducing the gradient variance by 72% compared to direct quantization approaches. These advancements enable real-time deployment on edge devices with sub 5 ms latency, addressing critical needs in venue management systems.
The framework’s success stems from three core innovations: (1) gradient-based bitwidth optimization that automatically allocates precision according to layer sensitivity, (2) memory-efficient gradient accumulation that simulates large-batch training without excessive resource overhead, and (3) hardware-aware implementation that maximizes the throughput on edge devices. These components collectively bridge the gap between computational efficiency and model performance, offering a scalable solution for large-scale venue analytics. The comprehensive robustness evaluation demonstrates the system’s practical viability, showing less than 5% performance degradation under realistic noise conditions, including sensor errors (σ ≤ 0.3), visual occlusions (≤30%), and temporal irregularities (≤20% frame drops).
Future extensions could explore adaptive precision scheduling based on real-time hardware telemetry or federated learning scenarios where distributed devices collaborate under varying resource constraints. The principles established here—particularly the spatial–temporal precision decoupling and gradient preservation techniques—provide a foundation for energy-efficient deep learning beyond venue analytics, including autonomous systems and wearable computing. By maintaining rigorous accuracy standards while dramatically reducing the operational costs, this work contributes to sustainable AI deployment in resource-constrained environments.
Beyond technical innovations, our work establishes an ethical deployment framework that combines layered privacy protections (on-device processing, differential privacy, and real-time redaction) with algorithmic fairness measures (adversarial debiasing and dataset balancing). This dual approach achieves ε = 2.1 differential privacy while maintaining demographic parity differences below 0.03, setting a new standard for responsible AI in venue analytics.

Author Contributions

Methodology, Z.C.; Software, Z.C.; Validation, X.C.; Formal analysis, X.C.; Resources, H.Z.; Data curation, H.Z.; Writing—original draft, L.L.; Writing—review & editing, C.U.I.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Elmaz, F.; Eyckerman, R.; Casteels, W.; Latré, S.; Hellinckx, P. CNN-LSTM Architecture for Predictive Indoor Temperature Modeling. Build. Environ. 2021, 206, 108327. [Google Scholar] [CrossRef]
  2. Kok, V.J.; Lim, M.K.; Chan, C.S. Crowd Behavior Analysis: A Review Where Physics Meets Biology. Neurocomputing 2016, 177, 342–362. [Google Scholar] [CrossRef]
  3. Qian, F.; Shi, Z.; Yang, L. A Review of Green, Low-Carbon, and Energy-Efficient Research in Sports Buildings. Energies 2024, 17, 4020. [Google Scholar] [CrossRef]
  4. Castelló, A.; Martínez, H.; Catalán, S.; Igual, F.D.; Quintana-Ortí, E.S. Experience-Guided, Mixed-Precision Matrix Multiplication with Apache TVM for ARM Processors. J. Supercomput. 2025, 81, 214. [Google Scholar] [CrossRef]
  5. Huang, L.; Qin, J.; Zhou, Y.; Zhu, F.; Liu, L.; Shao, L. Normalization Techniques in Training Dnns: Methodology, Analysis and Application. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10173–10196. [Google Scholar] [CrossRef] [PubMed]
  6. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef]
  7. Sherstinsky, A. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef]
  8. Wan, D.; Shen, F.; Liu, L.; Zhu, F.; Huang, L.; Yu, M.; Shen, H.T.; Shao, L. Deep Quantization Generative Networks. Pattern Recognit. 2020, 105, 107338. [Google Scholar] [CrossRef]
  9. Feigenbaum, M.J. Presentation Functions, Fixed Points, and a Theory of Scaling Function Dynamics. J. Stat. Phys. 1988, 52, 527–569. [Google Scholar] [CrossRef]
  10. Rahman, A.; Roy, P.; Pal, U. Air Writing: Recognizing Multi-Digit Numeral String Traced in Air Using RNN-LSTM Architecture. SN Comput. Sci. 2021, 2, 20. [Google Scholar] [CrossRef]
  11. Xu, A.; Dong, Y.; Sun, Y.; Duan, H.; Zhang, R. Thermal Comfort Performance Prediction Method Using Sports Center Layout Images in Several Cold Cities Based on CNN. Build. Environ. 2023, 245, 110917. [Google Scholar] [CrossRef]
  12. Ma, C.; Xu, Y. Research on Construction and Management Strategy of Carbon Neutral Stadiums Based on CNN-QRLSTM Model Combined with Dynamic Attention Mechanism. Front. Ecol. Evol. 2023, 11, 1275600. [Google Scholar] [CrossRef]
  13. Li, H.; Wang, Y.; Hong, Y.; Li, F.; Ji, X. Layered Mixed-Precision Training: A New Training Method for Large-Scale AI Models. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101656. [Google Scholar] [CrossRef]
  14. Kirtas, M.; Passalis, N.; Oikonomou, A.; Moralis-Pegios, M.; Giamougiannis, G.; Tsakyridis, A.; Mourgias-Alexandris, G.; Pleros, N.; Tefas, A. Mixed-Precision Quantization-Aware Training for Photonic Neural Networks. Neural Comput. Appl. 2023, 35, 21361–21379. [Google Scholar] [CrossRef]
  15. Dörrich, M.; Fan, M.; Kist, A.M. Impact of Mixed Precision Techniques on Training and Inference Efficiency of Deep Neural Networks. IEEE Access 2023, 11, 57627–57634. [Google Scholar] [CrossRef]
  16. Jun, B.; Kim, D. Robust Face Detection Using Local Gradient Patterns and Evidence Accumulation. Pattern Recognit. 2012, 45, 3304–3316. [Google Scholar] [CrossRef]
  17. Liu, Y.; Han, R.; Wang, X. A Reordering Buffer Management Method at Edge Gateway in Hybrid IP-ICN Multipath Transmission System. Future Internet 2024, 16, 464. [Google Scholar] [CrossRef]
  18. Lu, J.; Fang, C.; Xu, M.; Lin, J.; Wang, Z. Evaluations on Deep Neural Networks Training Using Posit Number System. IEEE Trans. Comput. 2020, 70, 174–187. [Google Scholar] [CrossRef]
  19. Aslan, S.N.; Özalp, R.; Uçar, A.; Güzeliş, C. New CNN and Hybrid CNN-LSTM Models for Learning Object Manipulation of Humanoid Robots from Demonstration. Clust. Comput. 2022, 25, 1575–1590. [Google Scholar] [CrossRef]
  20. Zhang, H.; Srinivasan, R.; Yang, X. Simulation and Analysis of Indoor Air Quality in Florida Using Time Series Regression (Tsr) and Artificial Neural Networks (Ann) Models. Symmetry 2021, 13, 952. [Google Scholar] [CrossRef]
  21. Ahmad, A.S.; Hassan, M.Y.; Abdullah, M.P.; Rahman, H.A.; Hussin, F.; Abdullah, H.; Saidur, R. A Review on Applications of ANN and SVM for Building Electrical Energy Consumption Forecasting. Renew. Sustain. Energy Rev. 2014, 33, 102–109. [Google Scholar] [CrossRef]
  22. Aamir, A.; Tamosiunaite, M.; Wörgötter, F. Interpreting the Decisions of CNNs via Influence Functions. Front. Comput. Neurosci. 2023, 17, 1172883. [Google Scholar] [CrossRef] [PubMed]
  23. Arif, S.; Wang, J.; Ul Hassan, T.; Fei, Z. 3D-CNN-Based Fused Feature Maps with LSTM Applied to Action Recognition. Future Internet 2019, 11, 42. [Google Scholar] [CrossRef]
  24. Aravinda, C.; Al-Shehari, T.; Alsadhan, N.A.; Shetty, S.; Padmajadevi, G.; Reddy, K.U.K. A Novel Hybrid Architecture for Video Frame Prediction: Combining Convolutional LSTM and 3D CNN. J. Real-Time Image Process. 2025, 22, 50. [Google Scholar] [CrossRef]
  25. Fan, Z.; Liu, M.; Tang, S.; Zong, X. Multi-Objective Optimization for Gymnasium Layout in Early Design Stage: Based on Genetic Algorithm and Neural Network. Build. Environ. 2024, 258, 111577. [Google Scholar] [CrossRef]
  26. Li, X.; Yan, H.; Cui, K.; Li, Z.; Liu, R.; Lu, G.; Hsieh, K.C.; Liu, X.; Hon, C. A Novel Hybrid YOLO Approach for Precise Paper Defect Detection with a Dual-Layer Template and an Attention Mechanism. IEEE Sens. J. 2024, 24, 11651–11669. [Google Scholar] [CrossRef]
  27. Brisebarre, N.; Lauter, C.; Mezzarobba, M.; Muller, J.-M. Comparison between Binary and Decimal Floating-Point Numbers. IEEE Trans. Comput. 2015, 65, 2032–2044. [Google Scholar] [CrossRef]
  28. Chu, T.; Luo, Q.; Yang, J.; Huang, X. Mixed-Precision Quantized Neural Networks with Progressively Decreasing Bitwidth. Pattern Recognit. 2021, 111, 107647. [Google Scholar] [CrossRef]
  29. Piñeiro Orioli, A.; Boguslavski, K.; Berges, J. Universal Self-Similar Dynamics of Relativistic and Nonrelativistic Field Theories near Nonthermal Fixed Points. Phys. Rev. D 2015, 92, 025041. [Google Scholar] [CrossRef]
  30. Wilson, A.G.; Izmailov, P. Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Adv. Neural Inf. Process. Syst. 2020, 33, 4697–4708. [Google Scholar]
  31. Grau, I.; Nápoles, G.; Bonet, I.; García, M.M. Backpropagation through Time Algorithm for Training Recurrent Neural Networks Using Variable Length Instances. Comput. Y Sist. 2013, 17, 15–24. [Google Scholar]
  32. Chen, X.; Zhang, H.; Wong, C.U.I.; Song, Z. Adaptive Multi-Timescale Particle Filter for Nonlinear State Estimation in Wastewater Treatment: A Bayesian Fusion Approach with Entropy-Driven Feature Extraction. Processes 2025, 13, 2005. [Google Scholar] [CrossRef]
  33. Nandakumar, S.R.; Le Gallo, M.; Piveteau, C.; Joshi, V.; Mariani, G.; Boybat, I.; Karunaratne, G.; Khaddam-Aljameh, R.; Egger, U.; Petropoulos, A.; et al. Mixed-Precision Deep Learning Based on Computational Memory. Front. Neurosci. 2020, 14, 406. [Google Scholar] [CrossRef] [PubMed]
  34. Choi, J.; Venkataramani, S.; Srinivasan, V.V.; Gopalakrishnan, K.; Wang, Z.; Chuang, P. Accurate and Efficient 2-Bit Quantized Neural Networks. Proc. Mach. Learn. Syst. 2019, 1, 348–359. [Google Scholar]
  35. Ali, Z.; Jiao, L.; Baker, T.; Abbas, G.; Abbas, Z.H.; Khaf, S. A Deep Learning Approach for Energy Efficient Computational Offloading in Mobile Edge Computing. IEEE Access 2019, 7, 149623–149633. [Google Scholar] [CrossRef]
  36. Chen, X.; Zhang, H.; Wong, C.U.I. Phase-Adaptive Federated Learning for Privacy-Preserving Personalized Travel Itinerary Generation. Tour. Hosp. 2025, 6, 100. [Google Scholar] [CrossRef]
  37. Holmes, D.S.; Ripple, A.L.; Manheimer, M.A. Energy-Efficient Superconducting Computing—Power Budgets and Requirements. IEEE Trans. Appl. Supercond. 2013, 23, 1701610. [Google Scholar] [CrossRef]
  38. Chen, X.; Zhang, H.; Wong, C.U.I.; Song, Z. Accelerated Bayesian Optimization for CNN+ LSTM Learning Rate Tuning via Precomputed Gaussian Process Subspaces in Soil Analysis. Front. Environ. Sci. 2025, 13, 1633046. [Google Scholar] [CrossRef]
  39. Zhang, C.; Kang, K.; Li, H.; Wang, X.; Xie, R.; Yang, X. Data-Driven Crowd Understanding: A Baseline for a Large-Scale Crowd Dataset. IEEE Trans. Multimedia 2016, 18, 1048–1061. [Google Scholar] [CrossRef]
  40. Zhang, R.; Liu, D.; Shi, L. Thermal-Comfort Optimization Design Method for Semi-Outdoor Stadium Using Machine Learning. Build. Environ. 2022, 215, 108890. [Google Scholar] [CrossRef]
  41. Alabdullah, B.; Tayyab, M.; AlQahtani, Y.; Al Mudawi, N.; Algarni, A.; Jalal, A.; Park, J. Sports Events Recognition Using Multi Features and Deep Belief Network. Comput. Mater. Contin. 2024, 81, 309–326. [Google Scholar] [CrossRef]
  42. Li, H.; Pinto, G.; Piscitelli, M.S.; Capozzoli, A.; Hong, T. Building thermal dynamics modeling with deep transfer learning using a large residential smart thermostat dataset. Eng. Appl. Artif. Intell. 2024, 130, 107701. [Google Scholar] [CrossRef]
  43. Jeong, E.; Kim, J.; Ha, S. Tensorrt-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards. ACM Trans. Embed. Comput. Syst. 2022, 21, 1–26. [Google Scholar] [CrossRef]
  44. Zhou, Z.; Zhang, J.; Gong, C. Automatic detection method of tunnel lining multi-defects via an enhanced You Only Look Once network. Comput.-Aided Civ. Infrastruct. Eng. 2022, 37, 762–780. [Google Scholar] [CrossRef]
  45. Liu, C.; Bellec, G.; Vogginger, B.; Kappel, D.; Partzsch, J.; Neumärker, F.; Höppner, S.; Maass, W.; Furber, S.B.; Legenstein, R.; et al. Memory-Efficient Deep Learning on a SpiNNaker 2 Prototype. Front. Neurosci. 2018, 12, 840. [Google Scholar] [CrossRef]
  46. Williams, S.; Waterman, A.; Patterson, D. The Roofline Model Offers Insight on How to Improve the Performance of Software and Hardware. Commun. ACM 2009, 52, 65–76. [Google Scholar] [CrossRef]
  47. Van Brummelen, J.; O’brien, M.; Gruyer, D.; Najjaran, H. Autonomous Vehicle Perception: The Technology of Today and Tomorrow. Transp. Res. Part C Emerg. Technol. 2018, 89, 384–406. [Google Scholar] [CrossRef]
  48. Chen, S.-Y. Kalman Filter for Robot Vision: A Survey. IEEE Trans. Ind. Electron. 2011, 59, 4409–4420. [Google Scholar] [CrossRef]
  49. Wen, J.; Zhang, Z.; Lan, Y.; Cui, Z.; Cai, J.; Zhang, W. A Survey on Federated Learning: Challenges and Applications. Int. J. Mach. Learn. Cybern. 2023, 14, 513–535. [Google Scholar] [CrossRef] [PubMed]
  50. Janbi, N.; Katib, I.; Albeshri, A.; Mehmood, R. Distributed Artificial Intelligence-as-a-Service (DAIaaS) for Smarter IoE and 6G Environments. Sensors 2020, 20, 5796. [Google Scholar] [CrossRef] [PubMed]
  51. Arachchige, P.C.M.; Bertok, P.; Khalil, I.; Liu, D.; Camtepe, S.; Atiquzzaman, M. Local Differential Privacy for Deep Learning. IEEE Internet Things J. 2019, 7, 5827–5842. [Google Scholar] [CrossRef]
  52. Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning. ACM Comput. Surv. 2021, 54, 1–35. [Google Scholar] [CrossRef]
Figure 1. Global framework and overview of key results of CNN-LSTM hybrid precision gradient accumulation in sports venue analysis.
Figure 1. Global framework and overview of key results of CNN-LSTM hybrid precision gradient accumulation in sports venue analysis.
Buildings 15 02926 g001
Figure 2. Architecture of the enhanced CNN-LSTM component.
Figure 2. Architecture of the enhanced CNN-LSTM component.
Buildings 15 02926 g002
Figure 3. CNN feature extractor output for a sample thermal image input.
Figure 3. CNN feature extractor output for a sample thermal image input.
Buildings 15 02926 g003
Figure 4. Predicted vs. actual occupancy in sports venues.
Figure 4. Predicted vs. actual occupancy in sports venues.
Buildings 15 02926 g004
Figure 5. Layer-wise bitwidth allocation in the CNN (higher precision in early layers) and LSTM (consistent 4-bit). (Note: Layer-wise precision allocation across the hybrid CNN-LSTM architecture. Color intensity represents the bitwidth through a monotonic gradient scale (darker shades = higher precision from 8 bit to 4 bit), with the explicit numerical labels shown in the right-hand side legend. Dashed vertical lines demarcate layer group boundaries between (I) input CNN blocks, (II) intermediate CNN blocks, and (III) LSTM temporal modules).
Figure 5. Layer-wise bitwidth allocation in the CNN (higher precision in early layers) and LSTM (consistent 4-bit). (Note: Layer-wise precision allocation across the hybrid CNN-LSTM architecture. Color intensity represents the bitwidth through a monotonic gradient scale (darker shades = higher precision from 8 bit to 4 bit), with the explicit numerical labels shown in the right-hand side legend. Dashed vertical lines demarcate layer group boundaries between (I) input CNN blocks, (II) intermediate CNN blocks, and (III) LSTM temporal modules).
Buildings 15 02926 g005
Figure 6. CMU buffer size impact analysis. (Note: CMU buffer size impacts showing (a) the gradient variance at different buffer sizes (50–150% of model parameters), (b) the 125% configuration’s optimal balance point (variance < 8.7 × 10−5)).
Figure 6. CMU buffer size impact analysis. (Note: CMU buffer size impacts showing (a) the gradient variance at different buffer sizes (50–150% of model parameters), (b) the 125% configuration’s optimal balance point (variance < 8.7 × 10−5)).
Buildings 15 02926 g006
Table 1. Dataset distribution and splits.
Table 1. Dataset distribution and splits.
DatasetTotal SamplesTraining (70%)Validation (15%)Test (15%)Key Characteristics
VenueTrack5000 h3500 h750 h750 h10 Hz temporal resolution, 20 venues
ThermoVenue1.2 M readings840,000180,000180,0001 min intervals, 15 venues
SportsAction50,000 events35,000750075008 sports categories, 4 K@30 fps
Table 2. Performance comparison across different methods on sports venue analytics tasks.
Table 2. Performance comparison across different methods on sports venue analytics tasks.
MethodCrowd MAE
(Persons/m2)
HVAC NMSEEvent Acc (%)Energy (J/inference)Throughput (fps)
Full-Precision CNN-LSTM1.82 ± 0.120.142 ± 0.00889.3 ± 0.95.21 ± 0.2345 ± 3
[1.78, 1.86][0.139, 0.145][88.9, 89.7][5.11, 5.31][43, 47]
Uniform 8-bit Quantization2.15 ± 0.150.178 ± 0.01085.1 ± 1.13.02 ± 0.1878 ± 4
[2.10, 2.20][0.174, 0.182][84.7, 85.5][2.95, 3.09][76, 80]
AutoPrecision1.91 ± 0.140.153 ± 0.00987.6 ± 0.83.45 ± 0.2062 ± 3
[1.86, 1.96][0.150, 0.156][87.3, 87.9][3.37, 3.53][61, 63]
GradFreeze1.95 ± 0.130.149 ± 0.00888.2 ± 0.73.78 ± 0.1958 ± 3
[1.90, 2.00][0.146, 0.152][87.9, 88.5][3.71, 3.85][57, 59]
Proposed Method1.79 ± 0.110.136 ± 0.00790.7 ± 0.82.98 ± 0.1582 ± 5
[1.75, 1.83][0.133, 0.139][90.4, 91.0][2.92, 3.04][80, 84]
Table 3. Fairness metric comparison across demographic subgroups.
Table 3. Fairness metric comparison across demographic subgroups.
MethodDemographic Parity DifferenceEqualized Odds DifferenceAccuracy Difference
Proposed Method0.03 ± 0.010.05 ± 0.020.02 ± 0.01
Full-Precision0.12 ± 0.030.18 ± 0.040.15 ± 0.03
Table 4. Ablation study results on crowd prediction task.
Table 4. Ablation study results on crowd prediction task.
VariantMAE (Persons/m2)Energy (J/inference)Gradient Variance
Full Proposed Method1.79 ± 0.112.988.7 × 10−5
No CMU1.92 ± 0.132.953.2 × 10−4
Fixed Allocation1.85 ± 0.123.211.1 × 10−4
No Spatial Weighting1.83 ± 0.123.059.8 × 10−5
Full Proposed Method1.79 ± 0.112.988.7 × 10−5
Table 5. Performance comparison across different edge computing platforms.
Table 5. Performance comparison across different edge computing platforms.
PlatformMemoryPower BudgetMAE (Persons/m2)Energy (J/inference)Latency (ms)
Raspberry Pi 5 (ARM Cortex-A76)8 GB12 W1.85 ± 0.133.21 ± 0.177.2 ± 0.4
Coral Edge TPU4 GB5 W1.91 ± 0.142.15 ± 0.124.8 ± 0.3
Table 6. Hybrid CNN-LSTM performance degradation under input noise conditions.
Table 6. Hybrid CNN-LSTM performance degradation under input noise conditions.
Noise TypeIntensityCrowd MAEHVAC NMSEEvent Acc
None (Clean)-1.79 ± 0.110.136 ± 0.00790.7 ± 0.8%
Sensor Noiseσ = 0.11.82 ± 0.120.140 ± 0.00890.1 ± 0.9%
σ = 0.31.89 ± 0.140.148 ± 0.00988.7 ± 1.1%
Occlusions20%1.85 ± 0.130.142 ± 0.00889.3 ± 1.0%
30%1.93 ± 0.150.151 ± 0.01087.9 ± 1.2%
Frame Drops10%1.83 ± 0.120.139 ± 0.00890.0 ± 0.9%
20%1.88 ± 0.130.145 ± 0.00989.1 ± 1.0%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, L.; Cao, Z.; Chen, X.; Zhang, H.; Wong, C.U.I. Hybrid Precision Gradient Accumulation for CNN-LSTM in Sports Venue Buildings Analytics: Energy-Efficient Spatiotemporal Modeling. Buildings 2025, 15, 2926. https://doi.org/10.3390/buildings15162926

AMA Style

Lu L, Cao Z, Chen X, Zhang H, Wong CUI. Hybrid Precision Gradient Accumulation for CNN-LSTM in Sports Venue Buildings Analytics: Energy-Efficient Spatiotemporal Modeling. Buildings. 2025; 15(16):2926. https://doi.org/10.3390/buildings15162926

Chicago/Turabian Style

Lu, Lintian, Zhicheng Cao, Xiaolong Chen, Hongfeng Zhang, and Cora Un In Wong. 2025. "Hybrid Precision Gradient Accumulation for CNN-LSTM in Sports Venue Buildings Analytics: Energy-Efficient Spatiotemporal Modeling" Buildings 15, no. 16: 2926. https://doi.org/10.3390/buildings15162926

APA Style

Lu, L., Cao, Z., Chen, X., Zhang, H., & Wong, C. U. I. (2025). Hybrid Precision Gradient Accumulation for CNN-LSTM in Sports Venue Buildings Analytics: Energy-Efficient Spatiotemporal Modeling. Buildings, 15(16), 2926. https://doi.org/10.3390/buildings15162926

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop