4.1. Experimental Design
4.1.1. Investigated Data
The EduNet dataset [
37], trademarked under DRSTA™ and copyrighted in 2021, has been specifically tailored for advancing research in the field of artificial intelligence, with a focus on computer vision and HAR within educational settings. This subset version of the EduNet dataset comprises 10 distinct action classes (
Table 3) that encompass a range of teacher and student activities observed in classroom environments. This collection comprises around 929 manually annotated video clips, drawn from both YouTube sources and direct recordings within actual classrooms. Although it is not clear how many samples are taken from YouTube, it ensures a rich and diverse dataset, as shown in
Figure 2.
Data Preprocessing: The preprocessing pipeline involved several essential tasks to prepare the EduNet video dataset for classroom activity recognition using a deep learning model. All steps were implemented using Python and open-source libraries, including OpenCV, NumPy, TensorFlow/Keras and Scikit-learn.
Each video was uniformly sampled using a custom frame_extraction() function built with OpenCV’s cv2.VideoCapture, which extracts a fixed-length sequence of 25 frames to ensure temporal consistency across samples. Frames were selected at regular intervals throughout each video’s duration—regardless of its original frame rate—and resized to a resolution of pixels using cv2.resize() to match the input dimensions required by the CNN model.
Pixel intensities were scaled to the range [0, 1] by dividing all values by 255.0 using NumPy arrays, helping normalise input values and accelerate convergence during training. Each video was annotated with one of ten predefined activity classes (see
Table 3), such as “Arguing,” “Reading_Book,” and “Writing_On_Board.” These labels were encoded into one-hot categorical vectors using
tensorflow.keras.utils.to_categorical() to support multi-class classification.
The complete dataset was split into training and testing subsets using an 80:20 ratio via Scikit-learn’s train_test_split(), with a fixed random seed to ensure reproducibility. Additionally, five-fold cross-validation was conducted using KFold from Scikit-learn to evaluate generalisation and validate the robustness and consistency of the model across different data splits. This comprehensive preprocessing pipeline ensures standardised formatting, consistent scaling and accurate labelling of input data, laying a solid foundation for effective model training and evaluation.
4.1.2. Method-Specific Settings
To ensure optimal performance and prevent overfitting, we carefully tuned key hyperparameters and employed multiple regularisation techniques to improve generalisation to unseen classroom activity data.
Table 4 summarises the key configuration used in the final Time-Distributed AlexNet model.
In our experiments, we fixed the number of frames per input video sequence to . This value was selected to provide a sufficient temporal window to capture meaningful student activities such as reading, writing, or hand-raising while keeping memory usage feasible for training. Although the model architecture is scalable to longer sequences like or , our preliminary trials showed diminishing returns in accuracy and increased training time due to redundant frames common in classroom scenarios.
The learning rate was tuned using a grid search over the range of
to
, with
consistently yielding stable convergence and high validation accuracy (see
Section 4.2.6 for statistical details). We applied Gaussian noise at the input layer to simulate real-world variability. Among the tested values (from 0.005 to 1.0), a standard deviation of
showed the best balance between performance and generalisation.
Batch Normalisation was applied after each convolutional layer to stabilise and accelerate training by reducing internal covariate shift. Dropout with a rate of 50% was added after fully connected layers to reduce overfitting. Early stopping was implemented with a patience of 15 epochs and training stopped at epoch 72 in the final run when no further validation improvement was observed. The best weights were restored for final evaluation.
4.1.3. List of Compared Methods and Parameter Settings
To evaluate the performance of the proposed time-distributed AlexNet architecture, two additional state-of-the-art methods were selected for comparison. These methods were applied to the same EduNet dataset to ensure consistency in evaluation. The experimental setup was crucial in evaluating how different models process and learn from the temporal and spatial dynamics of classroom activities. Additionally, the comparison with other methods allowed us to highlight the improvements made by our approach, especially in handling raw video data without the need for manual feature extraction. The compared methods and their respective parameter settings are as follows.
ViT (Vision Transformer) [50] Description: This model adopts a transformer-based architecture for spatiotemporal modelling by applying patch-level embeddings across video frames. Each video is represented as a sequence of frame-wise patch vectors and temporal dependencies are captured using time-distributed transformer encoders.
Parameter Settings:
Input size: 224 × 224 frame dimensions with patch extraction.
Patch embedding dimension: 128.
Transformer heads: Four multi-head self-attention heads used per encoder.
Learning rate: 0.001 with Adam optimiser.
Dropout rate: 50% applied before classification to prevent overfitting.
Sequence length: 25 uniformly sampled frames per video.
Description: This hybrid model combines convolutional layers for spatial feature extraction and LSTMs to capture temporal dependencies.
Parameter Settings:
Input size: 64 × 64 frame dimensions, scaled down for faster processing.
Learning rate: Default 0.001 with Adam optimiser.
Dropout rate: 20% was applied.
4.1.4. Evaluation Method Used
K-Fold Cross-Validation: During both the training–testing split and K-Fold cross-validation, splitting was performed strictly at the video level, using video file paths only. This guarantees that all frames from a given video belong exclusively to either the training or the test set, never both. The study employed five-fold cross-validation to evaluate the performance of the proposed time-distributed AlexNet model on the EduNet dataset. This method splits the dataset into five equal parts, using four parts for training and one part for testing in each iteration. The process was repeated over 10 runs to ensure robustness and to account for variability in results.
Training–Testing Split: For initial evaluations, the dataset was split into 80% training data and 20% testing data, ensuring a consistent and balanced approach to assess generalisation.
Evaluation Metrics Used: We employed several metrics to assess the model’s performance, including overall accuracy, precision, recall, F1 score and the confusion matrix.
4.2. Experimental Results Without Noise Injection
In the beginning, the evaluation of the time-distributed AlexNet model applied to the EduNet dataset without any application of noise, the classification performance across various classroom activity classes yields insightful results. The performance metrics with the highest accuracy are shown in
Table 5. These metrics indicate a high level of efficacy in recognising and distinguishing between the ten specified classroom behaviours. In our experiment, we divided the dataset into training and testing sets as 80% and 20%, respectively, where we had 743 training samples and 186 testing samples. The training configuration for the model employed the Adam optimiser due to its adaptive learning rate capabilities, which are essential for achieving faster convergence in deep learning networks. The model utilised a categorical cross-entropy loss function suitable for multi-class classification tasks. It was trained with a batch size of 8, carefully chosen to balance the trade-off between memory consumption and effective model performance. Furthermore, the training process was designed to continue for up to 100 epochs, incorporating an early stopping mechanism triggered by validation loss improvements to prevent over-training and ensure optimal generalisation.
To ensure the robustness of our evaluation, we performed five-fold cross-validation [
52,
53], repeating this process for 10 runs. The average performance and standard deviation [
54] across these runs were calculated (see
Table 6), providing a comprehensive assessment of the model’s consistency. The average overall accuracy achieved was 90.11% with a standard deviation of 0.62%.
The confusion matrix [
55] related to the
Table 5 shown in
Figure 3 computed on the test set (20% split of EduNet dataset) offers insights into the model’s performance across various classes. The activity recognition performance of the model demonstrates varying degrees of accuracy across different activities. The activity of arguing was flawlessly identified in all seven instances without any misclassifications. The eating_in_classroom activity was mostly accurate, with 17 instances correctly classified, although there was one case where it was confused with holding_mobile_phone. Explaining_the_subject showed a high level of accuracy with 35 correct recognitions; however, there were some confusions leading to seven misclassifications with holding_book and three with writing_on_board. HandRaise was correctly identified in all 11 instances and holding_book was also perfectly recognised in all 18 cases without any errors. Holding_mobile_phone was mostly accurate with 18 correct classifications, but one instance was mistaken for eating_in_classroom. Reading_book was well classified with 13 correct recognitions, though one case was erroneously identified as HandRaise. Sitting_on_desk was correctly identified in 17 occurrences, with one misclassification as reading_book. Writing_on_board exhibited high accuracy with 19 correct recognitions, though two instances were misclassified as explaining_the_subject. Finally, writing_on_textbook was generally well classified with 15 correct recognitions, but there was one misclassification with arguing, indicating some visual similarities between these activities.
The model’s performance on the EduNet dataset, as detailed in the confusion matrix, showcases its capability to accurately identify classroom activities with varying degrees of complexity. Activities like “Arguing,” “HandRaise,” and “Holding_Book” were perfectly recognised without any misclassifications, indicating strong model reliability for distinct actions. However, some other activities involving more subtle differences in visual cues, such as “Explaining the Subject” and “Writing_on_board,” faced higher instances of misclassification due to their similarity to other action classes. This suggests that while the model is highly effective in distinguishing clear-cut activities, it encounters challenges when dealing with actions that share visual similarities.
The obtained results indicate the model’s robustness, particularly in distinguishing activities that have distinct visual cues. However, the errors in classifying actions with similar postures or hand movements point to potential areas for model refinement. Enhancing the model could involve more granular feature extraction techniques or incorporating additional contextual information to better differentiate between visually similar activities. The quality of the results is promising for the application of this model in educational tools, potentially aiding in the automated analysis of classroom dynamics to support pedagogical assessments and interventions.
4.2.1. Comparison with Existing Approaches
In the domain of educational human activity recognition, the proposed time-distributed AlexNet model on the EduNet dataset is compared with two other methods, which were evaluated on the same dataset (see
Table 7). We implemented ViT (Vision Transformer) [
50], which achieved an accuracy of 76.88% and ConvLSTM model [
51], which integrates convolutional layers with recurrent units, achieving an accuracy of 77.81%. However, it underperformed compared to our method, likely due to limitations in explicitly modelling frame-level independence and cross-frame consistency.
To further validate the observed performance differences, we conducted statistical comparisons. A paired t-test between the time-distributed AlexNet and ConvLSTM yielded a t-value of 31.82 and a p-value of , indicating a statistically significant improvement. Similarly, a paired t-test between the time-distributed AlexNet and ViT model resulted in a t-value of 55.13 and a p-value of . These results confirm that the proposed model offers a more accurate and stable solution for HAR in educational environments.
As shown in
Table 7, the proposed time-distributed AlexNet contains substantially more parameters (58.36M) compared to ConvLSTM (7.49M) and ViT (1.84M). The performance advantage of the proposed time-distributed AlexNet over ConvLSTM and ViT can be attributed primarily to its higher parameter count and representational capacity. The additional integration of Gaussian noise serves as a complementary regularisation strategy, which provides incremental gains in generalisation and training stability, as shown in
Section 4.2.2.
4.2.2. Gaussian Noise Injection into Time-Distributed AlexNet
To assess the effect of Gaussian noise injection on training dynamics and classification accuracy, we compared two configurations of the time-distributed AlexNet model: one without noise (baseline) and another with Gaussian noise (standard deviation ) applied at the input layer. Using five-fold cross-validation, the baseline model achieved an average accuracy of 90.11%, while the model trained with Gaussian noise injection achieved a higher average accuracy of 91.09%. This improvement indicates that noise injection serves as an effective regularisation strategy.
The training and validation performance of a representative fold for the baseline model is shown in
Figure 4. The training loss steadily decreased and training accuracy improved, but the validation accuracy exhibited fluctuations during early epochs, indicating potential overfitting or sensitivity to intra-class variations. The confusion matrix (
Figure 5) further highlights notable misclassifications, particularly in classes such as
Explaining_the_Subject,
Holding_Book and
Reading_Book.
In contrast, the training curves for the noise-injected model (
Figure 6) show better alignment between training and validation accuracy, with reduced variance and improved stability across epochs. The confusion matrix (
Figure 7) reveals improved classification across most classes, although some confusion remains between visually similar actions like
Explaining_the_Subject.
Overall, these results demonstrate that injecting mild Gaussian noise () improves model generalisation and stabilises learning dynamics in video-based HAR, particularly in the presence of subtle inter-class visual differences.
Furthermore, the grid search [
56] was performed over the range of Gaussian noise values (0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.5, 1), providing a comprehensive view of the relationship between noise injection and model performance. The results indicated that Gaussian noise injection could improve the model’s ability to generalise, but the effect was highly dependent on the noise value chosen.
The noise value of 0.1 achieved the highest test accuracy in this study. However, performance was not entirely consistent across all runs, suggesting that while this noise level is optimal for generalisation, there is still some sensitivity to random initialisation and other factors.
As noise values increased, there was a noticeable decline in model performance. For instance, noise values such as 0.5 and 1.0 resulted in much lower accuracies, indicating that excessive noise can obscure the underlying patterns in the data to a point where the model can no longer generalise effectively.
This variability across different noise levels emphasises the importance of careful experimentation and tuning of the noise parameter. While noise injection can improve robustness, improper tuning can degrade performance, as observed with higher noise levels.
4.2.3. Analysis of Gaussian Noise Injection and Its Effect
This section presents a detailed analysis of the effect of Gaussian noise injection on the model’s performance. The model was trained using 12 different Gaussian noise values (ranging from 0.01 to 1.0) and was evaluated across five different runs to measure the variability in test accuracy. The focus of this analysis is on the overall performance of the model, particularly highlighting the noise value of 0.1, which consistently performed well across different runs.
Table 8 presents the test accuracy for each noise value across the five experimental runs. The noise value of 0.1 consistently provided high accuracy, while higher noise values (e.g., 1.0) led to degradation in performance.
The impact of Gaussian noise injection on the HAR model’s performance, as displayed in
Table 8, reveals several critical insights into how varying levels of noise influence test accuracy. By examining the performance across different noise values (ranging from 0.005 to 1.0), we can observe a clear relationship between noise level and the model’s ability to generalise effectively.
Low Noise Values (0.005 to 0.01): For the lowest noise values (0.005 to 0.01), the model achieved relatively high accuracy across all five runs. For instance, the noise value of 0.005 resulted in average test accuracies ranging from 83.40% to 89.08%, indicating that a minimal amount of noise injection helps prevent the model from overfitting to the training data. This moderate improvement suggests that adding a slight level of randomness to the input data enables the model to learn more generalised patterns rather than memorising specific details.
The noise value of 0.01 consistently achieved the highest test accuracy with an average accuracy of approximately 92.54% across all runs. This indicates that at very low noise levels, the model benefits from noise injection, enhancing its ability to handle variations in the data without being overwhelmed by the added randomness.
Moderate Noise Values (0.02 to 0.1): As the noise level increases to moderate values (0.02 to 0.1), there is a noticeable fluctuation in model performance. For example, with a noise value of 0.02, the average accuracy dropped to around 85.69%, which is lower than that of the 0.01 noise value. Similarly, at noise values such as 0.03, 0.04 and 0.05, the model’s accuracy remained relatively high but did not exceed the peak performance observed at the 0.01 level.
Interestingly, the noise value of 0.1 stood out as one of the most effective, achieving an average accuracy of 91.40%. This suggests that a noise value of 0.1 strikes a balance between adding enough randomness to enhance generalisation while not excessively distorting the input data. As seen from the results, this noise level allows the model to avoid overfitting while still capturing essential features required for accurate classification.
These observations imply that there is a threshold for noise injection where the positive effects of regularisation are maximised. Beyond this threshold, the benefits of noise injection start to diminish and the model’s performance begins to degrade.
High Noise Values (0.5 and 1.0): The most significant drop in performance was observed at the highest noise levels (0.5 and 1.0). With an average test accuracy of around 44.92% for a noise value of 0.5 and 22.19% for a noise value of 1.0, it is evident that the model struggled to extract meaningful patterns from the highly distorted data. At these levels, the injected noise overwhelmed the input features, making it difficult for the model to distinguish between relevant information and random variations.
This outcome highlights a critical drawback of excessive noise injection: while noise can help prevent overfitting, too much noise introduces substantial uncertainty, effectively “drowning out” the signal needed for accurate classification. The model’s inability to learn from such heavily corrupted data results in a reduction in performance, as observed with the lowest accuracy scores.
4.2.4. Key Takeaways from the Noise Analysis
Optimal Noise Range: The data indicates that the optimal noise level for this HAR model lies between 0.01 and 0.1. Within this range, the model consistently achieved high accuracy, with 0.1 emerging as the most effective noise value in balancing generalisation and performance.
Diminishing Returns with Increasing Noise: As the noise level increases beyond the moderate range, the accuracy of the model begins to decline. This suggests that although noise injection is a valuable regularisation technique, there is a point where its benefits plateau and further increases in noise become detrimental to performance.
Negative Impact of High Noise Values: Excessive noise (values of 0.5 and 1.0) has a clearly adverse effect on the model’s performance. These high noise levels introduce too much randomness, leading to a dramatic loss of accuracy and the model’s inability to learn effectively from the input data.
The detailed analysis from
Table 8 demonstrates that Gaussian noise injection can be a powerful tool for improving model generalisation when applied correctly. The key to leveraging noise injection lies in selecting an appropriate noise level, as shown by the superior performance at noise values around 0.01 to 0.1. However, when noise injection is not carefully tuned, it can hinder model performance, as seen with the accuracy drop at higher noise values.
4.2.5. Statistical Analysis
To rigorously evaluate the effect of Gaussian noise injection on model performance, we conducted a statistical comparison between the baseline model and the model incorporating a Gaussian noise value of 0.01. Both models were trained using five-fold cross-validation and each experiment was repeated 10 times to ensure the reliability of the results (see
Table 9).
Statistical Comparison:
To determine whether the observed performance difference between the two models was statistically significant, we conducted a two-sample t-test on the sets of accuracy values obtained across the 10 runs. The test produced a t-value of −2.83 and a corresponding p-value of 0.01106.
This result indicates statistical significance at the commonly accepted level, allowing us to reject the null hypothesis that there is no difference in performance between the models. The improvement in accuracy with noise injection is therefore unlikely to have occurred by chance.
Interpretation:
The statistical analysis confirms that injecting Gaussian noise with results in a meaningful and consistent improvement in model performance. The lower standard deviation also indicates more stable convergence. These findings support the effectiveness of noise injection as a regularisation strategy to enhance generalisation in recognition tasks using the EduNet dataset.
4.2.6. Impact of Learning Rate on Model Performance
Learning rate [
57,
58] plays a pivotal role in the performance and stability of deep learning models, directly influencing convergence and generalisation. To explore this relationship, statistical
t-tests [
59] were conducted to evaluate the impact of various learning rates on the performance of the time-distributed AlexNet model. The learning rate of 0.0001 is used as a reference point because it represents a widely accepted baseline that balances stability and convergence, providing a reliable benchmark for assessing the impact of varying learning rates on the model’s performance. Its performance was compared against other learning rates ranging from 0.00005 to 0.001. The statistical evaluation is summarised in the
Table 10.
4.2.7. Key Observations
Significant Differences at Certain Learning Rates: Statistically significant differences in performance were observed between the learning rate of 0.0001 and the learning rates of 0.0006, 0.0007 and 0.0008, as indicated by their
p-values [
59] (
Table 10).
No Significant Differences Across Most Learning Rates: For smaller learning rates, such as 0.00005, 0.00006 and 0.00007, no significant differences in performance were observed.
Importance of Fine-Grained Tuning: While 0.0001 provided a robust baseline, the evaluation highlights the importance of fine-grained learning rate adjustments. Intermediate rates, such as 0.0007, offered a unique balance of high accuracy and stability, making them promising candidates for further exploration.
The analysis underscores the critical impact of learning rate selection on the time-distributed AlexNet model’s performance for human activity recognition tasks. Smaller learning rates exhibited improved stability, with notable reductions in variability across experimental runs. While the baseline rate of 0.0001 provided a strong foundation, the potential of intermediate rates, such as 0.0006 to 0.0008, for enhancing performance warrants further investigation.
These findings emphasise the importance of optimising learning rates to balance convergence, generalisation and stability. The insights gained from this study contribute to the broader understanding of hyperparameter tuning and its role in advancing deep learning models for real-world applications.
4.2.8. Comparison with Traditional Regularisation Techniques
To further evaluate the effectiveness of Gaussian noise injection, we compared it against traditional L2 regularisation. Both models included standard regularisation components—such as Batch Normalisation and Dropout—as part of the base time-distributed AlexNet architecture. We trained the model using L2 regularisation and performed five-fold cross-validation, achieving an average accuracy of 80.73%, precision of 85.71%, recall of 80.73% band F1 score of 81.12%. In contrast, Gaussian noise injection with
yielded a higher average accuracy of 92.54%, as detailed in
Table 8. While L2 regularisation penalises large weights to reduce overfitting, Gaussian noise injection perturbs the input distribution, encouraging the model to learn more robust and invariant spatiotemporal features. Given that both techniques were evaluated within an identical architectural and training setup, these findings indicate that biologically inspired noise injection offers superior generalisation, particularly for activity recognition in egocentric classroom video data, where environmental variability is prominent.
4.3. Computational Cost Analysis
Here, we evaluate computational efficiency and we provide a comparative analysis of the computational cost for the proposed and baseline models, both with and without noise injection, alongside alternative methods. The analysis includes Floating-Point Operations per second (FLOPs) and the number of trainable parameters. FLOPs were estimated using the keras-flops tool with a batch size of 1 and input shape of . All models were implemented and profiled under equivalent input and training configurations to ensure fair comparison.
To evaluate the computational efficiency of the proposed and baseline models, we computed both the number of Floating-Point Operations (FLOPs) and trainable parameters using the keras-flops tool. The analysis was conducted using a batch size of 1 and an input resolution of , corresponding to 25-frame video clips.
As shown in
Table 11, the ConvLSTM model achieves a good balance between computational cost and capacity with 5.26 GFLOPs and 7.49 million parameters. The time-distributed AlexNet-based models (with and without Gaussian noise) are significantly heavier, requiring 56.83 GFLOPs and 58.36 million parameters. On the other hand, the ViT transformer-based model is surprisingly lightweight, with only 2.72 GFLOPs and 1.84 million parameters, making it a computationally efficient alternative.
It is important to note that the FLOPs reported in
Table 11 represent inference-only complexity as computed using the
keras-flops tool. Since the GaussianNoise layer is non-trainable and only active during training [
60], its computational cost is not reflected in these values. Specifically, the GaussianNoise layer does not contain any learnable parameters; it simply adds random perturbations sampled from a Gaussian distribution
to the input frames during training. The noise level
is a fixed hyperparameter defined by the user and the layer is automatically bypassed during inference to ensure deterministic predictions. While this introduces a negligible overhead during training due to element-wise noise addition, the effect is minimal compared to the convolutional and fully connected operations that dominate computational complexity. Most importantly, Gaussian noise injection does not alter the model’s architecture or inference-time efficiency, meaning that FLOPs and parameter counts remain identical for both the baseline and noise-injected variants. This ensures that the method delivers measurable improvements in generalisation without incurring additional deployment costs.
4.4. Analysis of the Model Performance on UCF50 and UCF101 Datasets with Gaussian Noise Injection
While the primary focus was the evaluation on the EduNet dataset, we conducted further experiments on the UCF50 and UCF101 datasets to assess whether the effects of Gaussian noise observed on EduNet generalise to broader action recognition tasks. Initially, we experimented without noise (0.000) and then the noise was applied to the input video frames with standard deviations ranging from 0.005 to 1.000 across five independent runs. The objective was to examine the robustness and generalisation ability of the proposed time-distributed AlexNet model under varying noise perturbations.
As shown in
Table 12, a clear trend emerges: introducing low to moderate levels of Gaussian noise (
to
) consistently improves or maintains classification performance, whereas excessive noise (
and above) severely degrades accuracy.
The baseline model without any noise injection () achieved an average accuracy of 87.88%. In comparison, the highest accuracy was recorded at , reaching 91.45%, followed by with 90.93%. This improvement suggests that mild stochastic perturbations during training act as effective regularisers, reducing overfitting and enhancing generalisation. These findings align with established theories in deep learning, where noise helps smooth decision boundaries and foster more robust feature representations.
Conversely, the model’s performance deteriorated under high noise conditions: led to an average accuracy of only 49.33% and further dropped to 10.15% at . Such excessive noise disrupts the discriminative spatial–temporal patterns required for accurate human activity recognition.
Compared to results on the EduNet dataset, similar conclusions hold—moderate noise enhances performance, while too much noise is detrimental. However, UCF50 showed more robustness across a wider noise range, possibly due to its more constrained and visually consistent action categories.
In conclusion, this evaluation reinforces the effectiveness of Gaussian noise injection as a lightweight regularisation technique. Based on the UCF50 results, the optimal performance is achieved with a noise standard deviation between 0.02 and 0.04, which provides the best balance between robustness and accuracy when compared to the noise-free baseline.
On the other hand, we conducted another experiment on the UCF101 dataset. Gaussian noise with varying standard deviations—from 0.000 (no noise) to 1.000—was introduced during training and each configuration was evaluated across five independent runs.
Table 13 presents the detailed results.
The baseline model, trained without any noise (), achieved the highest average accuracy of 88.66%. Interestingly, low levels of Gaussian noise between and produced comparable results, with average accuracies ranging from 88.30% to 88.58%. The highest accuracy among noisy models was observed at with an average of 88.58%, indicating that minor stochastic perturbations can act as a soft regulariser without degrading performance.
However, as the noise level increased beyond , a sharp decline in accuracy was observed. For example, the accuracy dropped to 86.49% at and further plummeted to 29.57% and 1.95% at and , respectively. These results reaffirm that while mild noise levels may be tolerated or even beneficial, excessive noise can obscure the informative signal and severely hamper learning.
Overall, the analysis confirms that while Gaussian noise injection offers a lightweight regularisation technique, its benefits are bounded. Interestingly, for the large-scale UCF101 dataset, noise injection slightly reduced performance compared to the baseline model. This can be attributed to the inherent richness and diversity of UCF101, which already promotes generalisation. In such cases, additional noise may introduce excessive perturbations, thereby diminishing accuracy. This suggests that the effectiveness of noise injection is dataset-dependent and most beneficial in smaller or more constrained scenarios, such as EduNet.