Enhanced Localisation and Handwritten Digit Recognition Using ConvCARU

Im, Sio-Kei; Chan, Ka-Hou

doi:10.3390/app15126772

Open AccessArticle

Enhanced Localisation and Handwritten Digit Recognition Using ConvCARU

by

Sio-Kei Im

^1,2

and

Ka-Hou Chan

^1,2,*

¹

Faculty of Applied Sciences, Macao Polytechnic University, Macau, China

²

Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence, Macao Polytechnic University, Macau, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(12), 6772; https://doi.org/10.3390/app15126772

Submission received: 16 May 2025 / Revised: 12 June 2025 / Accepted: 12 June 2025 / Published: 16 June 2025

Download

Browse Figures

Versions Notes

Abstract

Predicting the motion of handwritten digits in video sequences is challenging due to complex spatiotemporal dependencies, variable writing styles, and the need to preserve fine-grained visual details—all of which are essential for real-time handwriting recognition and digital learning applications. In this context, our study aims to develop a robust predictive framework that can accurately forecast digit trajectories while preserving structural integrity. To address these challenges, we propose a novel video prediction architecture integrating ConvCARU with a modified DCGAN to effectively separate the background from the foreground. This ensures the enhanced extraction and preservation of spatial and temporal features through convolution-based gating and adaptive fusion mechanisms. Based on extensive experiments conducted on the MNIST dataset, which comprises 70 K pixel images, our approach achieves an SSIM of 0.901 and a PSNR of 29.31 dB. This reflects a statistically significant improvement in PSNR of +0.20 dB (p < 0.05) compared to current state-of-the-art models, thus demonstrating its superior capability in maintaining consistent structural fidelity in predicted video frames. Furthermore, our framework performs better in terms of computational efficiency, with lower memory consumption compared to most other approaches. This underscores its practicality for deployment in real-time, resource-constrained applications. These promising results consequently validate the effectiveness of our integrated ConvCARU–DCGAN approach in capturing fine-grained spatiotemporal dependencies, positioning it as a compelling solution for enhancing video-based handwriting recognition and sequence forecasting. This paves the way for its adoption in diverse applications requiring high-resolution, efficient motion prediction.

Keywords:

vision processing; ConvCARU; video processing; GAN; motion predicting

1. Introduction

The rapid development of deep learning technologies has had a significant impact on the field of video prediction [1,2], particularly in specialised areas such as the analysis of handwritten digit sequences [3]. In fact, the unique challenges of handwritten digit sequences require tailored solutions to effectively capture both spatial and temporal features specific to handwriting, while significant progress has been made in general video prediction tasks [4]. Video prediction of handwritten digits is a complex challenge at the intersection of computer vision and pattern recognition [5,6]. The inherently dynamic nature of handwriting, which presents fluid and ever-changing forms, presents a unique set of technical challenges for video prediction systems. These systems must capture the unpredictable movement and variation inherent in handwriting while ensuring that the predicted frames maintain the clarity of the written character and stylistic consistency throughout the sequence. Achieving this balance is essential as even small deviations in either clarity or style can compromise the usability and reliability of the prediction results. In practical applications such as digital learning platforms and real-time handwriting recognition systems, accurate prediction of the trajectory and transformation of handwritten digits is essential to improve system performance [7,8]. It is evident that conventional video prediction methods often lack the sophisticated mechanisms required to effectively address these intertwined challenges, highlighting the need for more specialised approaches tailored to the intricacies of handwriting dynamics.

Furthermore, the prediction of handwritten digit video sequences poses numerous challenges [9,10]. Firstly, it is necessary to preserve the temporal continuity of digit appearances across consecutive frames while maintaining the unique traits of individual writing styles. Furthermore, in sequences comprising multiple digits, the model must effectively manage their interactions and potential overlaps, thereby ensuring the integrity of each character is preserved. Finally, the prediction system must operate efficiently to fulfil real-time processing demands in practical applications, while achieving high levels of accuracy [11]. Although state-of-the-art video prediction models such as SimVP [12], SimVPv2 [13], and DMVFN [14] have demonstrated exceptional performance in general video prediction tasks, they encounter limitations when applied to handwritten digit sequences. These generic models frequently fail to preserve the distinctive features of digits throughout the prediction sequence, resulting in a decline in character recognition accuracy [15]. Also, these models face challenges in preserving temporal consistency in writing style and require substantial computational resources for a task that is inherently specialised [16,17].

To improve performance, an advanced video prediction framework using convolutional content-adaptive recurrent units (ConvCARUs) was introduced, specifically targeting the prediction of handwritten digit sequences. This framework incorporates several key innovations that represent significant advances in the field. The primary innovation is an improved temporal modelling approach, achieved by modifying the CARU architecture [18], DCGAN [19], and SimVP [12]; these approaches still struggle to adequately predict the dynamic behaviour of handwritten digits. The inherent challenge lies in accurately capturing the intricate spatiotemporal dependencies while preserving fine structural details over long prediction horizons. This paper aims to overcome these challenges by introducing an enhanced prediction framework based on a Convolutional content-adaptive recurrent unit (ConvCARU) integrated with a modified GAN. To clearly delineate our novel contributions from established methods, we highlight the following key innovations:

New gradient-regularised GAN loss A modified GAN loss function is proposed that stabilises the adversarial training process and improves the quality of feature extraction, thus enhancing the visual fidelity of the predicted frames.
Parameter-efficient ConvCARU gate By replacing traditional linear operations with CNN-based computations in the gating mechanisms, our ConvCARU design captures extended spatio-temporal dependencies while reducing computational overhead.
Ablation-verified performance gain Systematic ablation studies confirm that each component of our approach—from the adaptive fusion of features to the decoupling of foreground and background information—significantly contributes to performance improvements measured in SSIM, PSNR, and memory efficiency compared to state-of-the-art methods.

These contributions collectively establish a novel framework that not only addresses the limitations of its predecessors but also sets a new benchmark in video prediction for handwritten digit recognition.

2. Related Work

In recent years, the widespread availability of high-speed internet and advanced photography equipment has greatly simplified the process of capturing and compiling digital videos [20]. This accessibility has resulted in the generation of diverse video datasets, which have driven research into video prediction technology [2,21]. These technologies hold potential for applications across various fields, including sports analysis, human activity monitoring, vehicle tracking, and even the documentation of routine tasks such as makeup application [22,23]. The video prediction methods explored in this study remain at an early stage, with a particular focus on forecasting future frames containing digital posture data. Recent progress in variational autoencoder (VAE) methodologies has played a pivotal role in advancing such a domain [24]. Later, researchers have proposed a robust graph-based VAE framework capable of capturing complex spatiotemporal relationships [25]. By incorporating temporal attention mechanisms, the framework achieves enhanced reliability in long-term predictions. Also, the key challenge in video prediction lies in predicting subsequent sequences of images based on the temporal continuity of preceding frames, thus emphasising the importance of sophisticated image generation algorithms [26,27].

Moreover, generative adversarial networks (GANs) have demonstrated their ability to achieve convergence by utilising the dynamic interaction between the generator and discriminator [28]. This approach eliminates the sole dependence on manually crafted loss functions during backpropagation. In many instances, the discriminator functions as a variant of an encoder, offering feedback that guides the generator to produce a diverse range of images. GANs have proven to be particularly effective in generating refined and distinct images, which has led to their widespread application across various network architectures. Furthermore, researchers have recently introduced a flow-based generative network that meticulously calculates the distribution space of images [29]. Furthermore, more video prediction model needs the ability to generate consistent and dynamically evolving sequences. Although image generation methods cannot be directly applied to video prediction, they provide valuable insights for developing more effective video prediction algorithms [2]. This advancement allows users greater control over image manipulation. The development and enhancement of deep learning techniques, particularly VAEs and GANs, have laid a strong foundation for their integration into video prediction tasks [30,31]. By building on these image generation models, researchers have been able to delve into the temporal connections between successive video frames, a crucial factor for improving video prediction performance [32]. It is imperative that this approach encompasses not only the intricate details inherent within individual frames but also the interconnected spatiotemporal dynamics. Consequently, despite advancements in deep learning-based image generation, video prediction continues to be a considerably more intricate challenge [33,34].

In order to address these challenges, significant attention has been given to the development of unsupervised deep neural networks for future video sequence prediction. Deep learning, now a cornerstone of computer vision, has fostered the evolution of unsupervised prediction models that incorporate imagery or video data. In practice, these methods often rely on predefined semantics or fully annotated datasets, which can require significant resources [35]. The framework delineated in [36] employs a “maximum-margin” structured output support vector machine (SVM) to facilitate early detection and local event identification. Refs. [37,38] expanded on the “first principles” of structured random forest regression to predict object motion trajectories within videos. Ref. [39] used variational autoencoders to encode latent image variables and predict dense pixel trajectories to infer potential object motion in a scene. For instance, the authors of [40] introduced an innovative prediction technique that captures subtle behavioural cues, suggesting future actions and representing human motion across various levels of granularity. Meanwhile, a “bag-of-words” approach was presented that models object activity using histograms of spatiotemporal features to capture temporal changes [41]. Ref. [42] proposed a database-driven technique for anomaly detection in video sequences, enabling the evaluation of non-target events and the prediction of future occurrences. Similarly, Ref. [43] introduced the uncertain hypothesis test to assess the suitability of uncertain regression models. They further explored the uncertain significance test to determine whether certain pre-specified regression coefficients within an uncertain regression model can be considered zero [44,45]. As a result, deep learning-based predictive models primarily rely on maximum likelihood estimation to construct predictive frameworks and define associated loss functions [46,47].

In a separate study, the authors of [14] introduced the dynamic multi-scale voxel flow network (DMVFN), a groundbreaking model featuring a differentiable routing module capable of perceiving motion scales within video frames. After training, the model adapts during inference by selecting optimal subnetworks based on input features. By utilising only RGB images, DMVFN minimises computational cost while delivering superior video prediction performance compared to previous methods. Notably, DMVFN’s applicability extends to tasks involving video sequences of handwritten digits, where accurate motion prediction and feature extraction are essential for identifying character trajectories over time. Furthermore, Refs. [48,49,50,51] made significant advancements in uncertain regression models by proposing an innovative hypothesis testing framework to evaluate their suitability. They also introduced the uncertain significance test, which allows researchers to determine whether specific regression coefficients within these models can be considered null. These methods have shown promising outcomes in scenarios requiring robust predictions, including video sequences featuring handwritten digits, where uncertainty quantification plays a crucial role in ensuring model reliability [52]. Traditionally, deep learning predictive frameworks have relied heavily on maximum likelihood estimation (MLE) as the foundation for defining loss functions. Although MLE can improve accuracy, it may sacrifice flexibility in capturing complex, non-linear relationships within the data. Researchers are now exploring alternative approaches, such as Bayesian methods, ensemble models, or hybrid loss functions, which aim to strike a balance between accuracy, interpretability, and generalisation capability. This evolution is particularly relevant for tasks involving handwritten digits in the video sequences, where capturing subtle spatial-temporal patterns is pivotal. Deep learning has evolved toward end-to-end trainable networks, which are designed to autonomously forecast future sequences without strict reliance on MLE principles [53]. To enhance conventional deep learning-based video prediction models, Ref. [54] proposed the combination of involution and convolution operators (CICO). This model addresses issues such as insufficient spatial feature extraction and lower prediction accuracy. Its design includes multi-scale convolution kernels for robust spatial feature extraction, an involution operator to replace larger convolution kernels for improved computational efficiency, and a

1 \times 1

convolution kernel for linear mapping to foster integration among diverse features. These advancements have been particularly beneficial for applications involving handwritten digits in video sequences, enabling precise trajectory prediction and efficient feature extraction for improved overall performance. Following training, CICO can adaptively select appropriate subnetworks during inference based on varied input features, including those found in digit-based video data. These methods not only broaden the scope of model learning but also inspire innovation in the development of robust frameworks for diverse applications, including the intricate task of predicting handwritten digits in video sequences.

3. Proposed Method

A novel approach to video prediction is introduced by employing ConvCARU for temporal modelling while incorporating a deep convolutional generative adversarial network (DCGAN) [19] to construct the background feature encoder, target feature encoder, and decoder. This framework is designed to extract abstract representations, specifically background and target features, by applying convolutional operations to sequences of handwritten digit videos [55]. These extracted features serve as input to a ConvCARU-based predictive model, which captures and analyses temporal dependencies to anticipate future motion trajectories. In the final stage, a decoder constructed using transposed convolutional layers integrates the projected motion dynamics with the preserved background features from the last recorded frame, thereby generating future images that maintain coherence with prior visual information. Moreover, to enhance robustness and adaptability, this framework exploits the strengths of generative adversarial networks to ensure that feature extraction remains both effective and context-aware. By using ConvCARU for sequential modelling, the network is able to recognise complex temporal relationships and refine the prediction accuracy of evolving visual patterns. Integrating transposed convolution into the decoding process further enhances the reconstruction of future frames, resulting in images that are both consistent with the original sequence and visually coherent. This methodology represents a promising direction for advancing video prediction tasks, particularly in scenarios where accurate prediction of motion trajectories is critical.

3.1. Feature Extraction from Frame Sequences

The original GAN model presents several challenges in practical applications, including exploding and vanishing gradients. These issues make direct implementation problematic. The DCGAN framework has been rigorously tested and proven to be a versatile and effective solution to these issues, and is often cited in real-world applications. Its architecture is based on convolutional operations in the generator and transposed convolutional operations in the discriminator, creating a mirror-symmetric structure between the two components. A key benefit of the DCGAN framework is its ability to stabilise training and improve feature learning. Incorporating deep convolutional layers has been shown to enhance the quality of generated images while mitigating issues typically associated with standard GANs. Additionally, techniques such as batch normalisation and eliminating fully connected (FC) layers have been shown to contribute to improved convergence and performance [56]. In DCGAN architecture, traditional pooling layers in generative models are replaced by transposed convolutions with a stride mechanism, and those in discriminative models are replaced by stride convolutions to reduce spatial resolution. These design choices pave the way for more stable training processes, establishing the DCGAN as a fundamental model for many generative tasks. This strategic shift enables improved feature preservation and more controlled transformations. The mathematical underpinnings of these operations are governed by the following equations:

\begin{matrix} X^{'} & = \frac{X - K + 2 P}{b} + 1 \end{matrix}

(1)

\begin{matrix} X & = b (X^{'} - 1) - 2 P + F \end{matrix}

(2)

The kernel employed in transposed convolutions is represented as

F \times F

, and the stride b in both equations aligns with the standard formulas for convolutions and transposed CNNs, reinforcing this interpretation. Thereby, this kernel is represented as

K \times K

.

X \times X

denotes the image dimensions. The rationale behind these equations is that FC networks often introduce an excessive number of parameters, which increases the likelihood of overfitting. P denotes the padding used in both operations and encapsulates the step parameter common to convolutional processes.

The FC layer is replaced by average pooling layers within these formulations to enhance the stability of neural network models. An effective method of accelerating convergence is to establish direct connections between the generator input and the features extracted from the convolutional layers. Meanwhile, the discriminator output is linked to the feature maps derived from the same convolutional structures. To further improve performance, adaptive learning rate strategies can be employed to minimise oscillations during optimisation. Therefore, the careful selection of regularisation techniques improves generalisation and avoids issues associated with high variance in complex network architectures. In practice, applying normalisation techniques throughout the architecture alleviates the issue of excessive bias arising from a large number of network layers. The activation function used by the intermediate hidden layers within the gradient generation is the Leaky Softplus function, which optimises gradient flow and prevents neuron inactivity [57]. Meanwhile, the output layer uses the Sigmoid activation function to ensure smooth transitions in the produced data. In contrast, the discriminator uniformly applies the Maxout activation function across all layers to enhance model expressiveness and adaptability [58]. This refined approach provides superior spatial manipulation, allowing more precise control over feature distribution and mitigating the distortions typical of conventional downsampling methods.

3.2. Enhanced Feature Extraction and Sequence Decoupling

In order to predict the future position of handwritten digits in video sequences, a decoupling-based approach is employed to separate the target object from the background. This method optimises computational efficiency by minimising the number of parameters required for training and accelerating the network model’s convergence. The decoupling module uses an autoencoder-style encoder to isolate the target’s dynamic features while preserving background information. As illustrated in Figure 1, the framework consists of several key components.

This diagram illustrates the process by which the input video is first processed via the ChebyPooling layer to extract robust spatial features. It is then separated into foreground and background components using a modified DCGAN. Next, temporal modelling is performed using ConvCARU. Finally, the video is fed into classification layers that combine FC and Conv2D operations.

Video feature extraction The initial processing step involves using ConvCARU to refine the input features by applying a series of convolutional operations, followed by a ChebyPooling layer [59], batch normalisation, and activation functions. This structure ensures the efficient encoding of temporal dependencies within the video frames.
Spatial feature prediction A sequence of convolutional layers extracts spatial attributes using multi-channel feature mapping and activation functions, such as Maxout, as well as normalisation techniques. These spatial descriptors provide valuable information about the object’s positional variations.
Feature fusion The extracted video and spatial features are integrated to improve contextual perception. This mechanism uses attention layers to prioritise the most relevant information from different sources.
Feature aggregation A DCGAN-based aggregation unit is used to process feature fusion, with the aim of further refining structured patterns before classification.
Final classification The final prediction is generated by processing the refined feature representations through FC layers with dropout regularisation and activation functions (e.g., Sigmoid or Softmax).

To improve the efficiency and accuracy with which the target extraction module identifies and tracks target positions across frames, a discriminator is incorporated to form a complete DCGAN that optimises the feature extraction process. In our implementation, we focus on a unified scheme rather than employing multiple alternative approaches. Specifically, the architecture employs the following:

ConvCARU for temporal modelling This component is responsible for capturing and propagating spatio-temporal dependencies with convolutional operations embedded in its gating mechanisms. By avoiding the use of traditional fully connected (FC) layers within the recurrent unit, it preserves crucial spatial information and improves parameter efficiency.
Modified DCGAN for background and foreground separation Rather than integrating several potential pooling or segmentation strategies, we adopted a modified version of DCGAN that replaces conventional pooling with strided convolutions. This approach stabilises training and effectively decouples the dynamic (foreground) from the static (background) features.
ChebyPooling for feature abstraction We incorporate a ChebyPooling layer, which uses the Chebyshev polynomial formulation to enhance the robustness of the pooling process, ensuring that spatial details are accurately retained.

Moreover, this structured approach enhances the accuracy and computational efficiency with which an object’s motion trajectory can be predicted. The framework’s ability to dynamically track objects across video sequences is strengthened by the integration of attention mechanisms and advanced feature aggregation techniques. Moreover, the prediction module is designed to forecast the future positions of target objects using refined feature representations extracted from video sequences. The discriminator is trained using a modified binary cross-entropy loss function, formulated as follows:

l = - h log (h^{'}) - (1.0 - h) log (1.0 - h^{'})

(3)

where h represents the true feature representation extracted from adjacent frames, and

h^{'}

denotes the predicted motion feature by the network. These features undergo a normalisation process to ensure their values remain within the range

[0.0, 1.0]

, which is a necessary condition for the binary classification framework before computing the loss. Furthermore, ChebyPooling is employed to enhance spatial feature abstraction by leveraging the mathematical properties of Chebyshev polynomials, thereby reducing sensitivity to input variations. The present process captures the dynamic aspects of the video sequence while preserving feature information from content motion. As this work targets the extraction of depth information from colour features in the video sequence, the focus is primarily on motion feature extraction, with the result that noisy information on the time-axis from the frame sequence is discarded. VGGreNet utilises Chebyshev pooling [59,60] with size H and W in the current frame as follows:

\begin{matrix} μ & = AvgPool (f) \end{matrix}

(4a)

\begin{matrix} σ^{2} & = AvgPool (f^{2}) - μ^{2} \end{matrix}

(4b)

\begin{matrix} t & = Softplus (MaxPool (f)) \end{matrix}

(4c)

\begin{matrix} ChebyPool (f) & = \frac{σ^{2}}{σ^{2} + {(t - μ)}^{2}} \end{matrix}

(4d)

where AvgPool and MaxPool denote the average and maximum pooling, respectively. Such advanced pooling,

ChebyPool (f) \in R^{H \times W}

, is recommended to project the feature into a probability domain that can be used as a weighting feature. It makes use of the Chebyshev depth theory, whereby the output feature can be projected to a stable range without the need for a sigmoid function, which provides a readable probabilistic result for subsequent processing as follows:

f (x) = \frac{1}{1 + exp (- x)}

(5)

This activation function ensures that outputs remain within the

[0.0, 1.0]

interval, which is essential for subsequent binary classification tasks.

3.3. Feature Extraction and Adaptive Fusion

Taking into account the spatiotemporal characteristics of videos, the challenge of maintaining spatial and temporal coherence is addressed by replacing traditional linear operations in RNN units with convolutional mechanisms. This optimises parameter efficiency and accelerates convergence, ensuring more effective feature retention. ConvCARU enhances this approach further by incorporating convolutional operations within ‘cell’ units instead of FC layers, thereby maintaining crucial spatial information that evolves over time. At its core, ConvCARU processes current-time-step data and the hidden state from the previous time step using a feature extraction framework comprising distinct modules. Combining ConvCARU-based video feature extraction with Conv2D networks for spatial prediction improves the system’s ability to interpret dynamic sequences overall. Moreover, context-adaptive fusion is realised through a multi-level attention gate, which seamlessly integrates spatial and temporal elements. The primary components include the following:

\begin{matrix} x_{(t)} & = W e_{(t)} + B \end{matrix}

(6a)

\begin{matrix} n_{(t)} & = tanh (W e_{(t - 1)} + B + x_{(t)}) \end{matrix}

(6b)

\begin{matrix} z_{(t)} & = σ (W e_{(t - 1)} + B + W e_{(t)} + B) \end{matrix}

(6c)

\begin{matrix} l_{(t)} & = σ (x_{(t)}) \otimes z_{(t)} \end{matrix}

(6d)

\begin{matrix} e_{(t)} & = (1 - l_{(t)}) \otimes e_{(t - 1)} + l_{(t)} \otimes n_{(t)} \end{matrix}

(6e)

The received feature is encoded from Equations (6a)–(6e), with the bold variables W and B indicating trainable parameters. tanh and

σ

—as the activation functions—tend to push the input values to either end of the curve (0 or 1). The symbol ⊗ denotes the tensor product, which combines two algebraic objects into a higher-dimensional structure while preserving their linear properties. This procedure facilitates the integration of both images and content via content-adaptive gates and involves the following steps:

(6a): The input undergoes processing through a linear layer. The resulting output determines the next hidden state, which is passed to the content-adaptive gate. If the previous hidden state $e_{(t - 1)}$ is unavailable, the output directly becomes $e_{(t)}$ .
(6b): The linear layer combines the previous hidden state $h_{t - 1}$ with the output of (6a), and the sum is processed by the tanh activation function to extract integrated information.
(6c): The CARU’s update gate(s) facilitate hidden state transitions by combining the current input with the previous hidden state. This step uncovers relationships over content but introduces challenges in handling long-term dependencies.
(6d): To address long-term dependency issues, this step evaluates the features of the current input, which are able to dynamically adjust through $z_{(t)}$ conduction, boosting or reducing the input for accurate predictions during RNN decoding.
(6e): The final output is derived through linear interpolation.

ConvCARU is designed to extend CARU’s ability to capture long-term dependencies within sequential data. It incorporates early fusion strategies to provide a more comprehensive understanding of appearance and motion patterns. Unlike conventional techniques, this model enhances spatial feature extraction from high-level representations during video prediction. Despite its structural similarities to basic ConvCARU architectures. As presented in Figure 2, it has several key internal differences. Rather than using conventional FC layers, convolutional operations within ‘cell’ units govern the processing of input signals, resulting in improved feature extraction fidelity. The underlying mathematical components that are central to the design of ConvCARU contribute to its effectiveness in modelling complex video sequences.

3.4. Feature Aggregation and DCGAN

The feature aggregation module uses features from various attention mechanisms in a multimodal way to generate a unified representation. To further refine these representations, the module incorporates a DCGAN framework with an adversarial learning strategy. The generator is driven to synthesise feature aggregates that closely mimic the real aggregated feature distribution, while the discriminator strengthens the model’s ability to distinguish between genuine and synthesised features.

Role of DCGAN in Feature Aggregation

The adversarial training process can be formulated as a minimax role involving two key agents, as follows: the generator G and the discriminator D. In this context, the objective of the generator is to produce realistic aggregated features from latent input features. The objective of the discriminator is to classify whether the provided features originate from the true aggregated distribution or the generator’s synthesis.

V (D, G) = E_{x \sim P_{data}} [log D (x)] + E_{x - p_{z}} [log (1 - D (G (z)))]

(7)

Here, the notation

E_{x \sim P_{data}}

represents the expectation, or expected value, of a function over a random variable sampled from the real data distribution

P_{data}

. x represents features aggregated from the attention mechanisms and z is drawn from the prior distribution

p_{z}

. However, in the early stages of training, the gradient received by G through the term

log [1 - D (G (z))]

may diminish, thus impeding learning from (7). To counteract this effect, a non-saturating heuristic is adopted. The loss of the discriminator objective then becomes as follows:

arg min_{D} (L_{D}) = - E_{x \sim P_{data}} [log (1 - D (x))]

(8)

This modification ensures that the generator receives stronger gradient signals during training. It can also be modified further by the generator target (non-saturated) formula to encourage more efficient synthesis of aggregated features:

arg max_{G} (L_{G}) = - E_{x - p_{z}} [log (D (G (z)))]

(9)

3.5. Integration with Attention Mechanism

In this model, feature aggregation is primarily achieved by incorporating video and spatial attention mechanisms. The feature aggregation module plays a pivotal role in refining the extracted features by combining these two mechanisms. These mechanisms enhance the most relevant aspects of the input and filter out redundant information. The video attention mechanism assigns a relevance score to each feature, enabling the model to focus on the most significant temporal attributes. Mathematically, this attention is structured using a scaled dot-product approach, in which the query, key, and value matrices—denoted as

Q_{v}

,

K_{v}

, and

V_{v}

—work together to compute the attention weight matrix. This process can be formulated as follows:

A_{v} = Softmax (\frac{Q_{v} K_{v}^{T}}{\sqrt{d_{k}}}) V_{v}

(10)

where

d_{k}

represents the dimensionality of the key features. This ensures that the video feature space is dynamically weighted based on contextual importance. Similarly, the spatial attention mechanism processes spatial feature representations to improve the model’s ability to recognise relevant regions within frames. As with its video counterpart, the spatial attention computation follows the same scaled dot-product formulation:

A_{s} = Softmax (\frac{Q_{s} K_{s}^{T}}{\sqrt{d_{k}}}) V_{s}

(11)

Within the overall architecture, the feature aggregation module processes the video attention features (

A_{v}

) and spatial attention features (

A_{s}

) using separate layers. These are then combined to create a comprehensive feature vector

F_{agg}

, which effectively captures temporal and spatial dependencies. This consolidated feature set is then transformed to enhance contextual richness before being forwarded for further processing.

F_{agg} = Connet (A_{v}, A_{s})

(12)

The DCGAN framework refines the aggregate distribution

F_{agg}

to a more refined feature representation, producing G structured to align with the features of the real aggregate distribution. Meanwhile, the discriminator D continuously adapts to learn the differences between the original aggregate distribution

F_{agg}

and the synthesised output G. This dynamic interaction enables the DCGAN module to reduce feature redundancy, enhance the contextual understanding of aggregated information, and strengthen downstream classification and decision-making processes, while also fostering a robust adversarial training mechanism.

4. Experimental Results and Discussion

This section provides a detailed analysis of the experimental results, offering an in-depth examination of the model’s performance and practical applications. Key metrics and their visual representations are explored first to provide a comprehensive evaluation of the effectiveness and limitations of the proposed approach. The results are then compared with relevant benchmarks to highlight this method’s advantages and areas for improvement. In this experiment, the MNIST dataset [61] is used as a fundamental benchmark for recognising handwritten digits. This dataset comprises 70,000 grayscale images, each measuring

28 \times 28

pixels and featuring a wide range of handwriting styles. This structured composition provides a valuable testing platform for classification models, enabling researchers to optimise performance without worrying about data inconsistencies. Additionally, this study employs ConvCARU to improve localisation accuracy and classification robustness. The consistency of the MNIST dataset enables reliable evaluation and assessment of the method’s ability to adapt to different handwriting styles while maintaining high recognition accuracy.

4.1. Training Configuration and Strategy

The proposed method is implemented using the PyTorch [62] framework in Python 3.12.4 and essential libraries on four Nvidia RTX A4000 GPU(s). The training process uses a batch size of 100 and initialises all model parameters, including weights and biases, using a specific strategy. To maintain stability, each layer incorporates group normalisation to ensure independence from variations in batch size. The weight decay factor is configured at

5 \times 10^{- 4}

for the encoder, while the proposed layer is unaffected by weight decay. Training the network takes around 20 h when executed across four Nvidia RTX A4000 GPU(s) in parallel. However, memory limitations arise when the video batch size exceeds

4 \times 16

GB, which is more than the maximum memory capacity of the GPUs. More configurations have been provided with the following description:

Training configuration The design and implementation of the model incorporate several crucial aspects to enhance its performance and adaptability. The training process employs the Adam optimiser [63] with a learning rate of 0.001, and sets $β_{1}$ to 0.5 and $β_{2}$ to 0.999. Each GPU is allocated a batch size of 32, and the training extends across 100 epochs. To refine the learning dynamics, a cosine annealing scheduling strategy is applied alongside a warm-up phase to stabilise the initial training stages.
Model architecture The architecture itself incorporates ConvCARU layers with hidden dimensions of 128 and 256, enabling efficient sequential processing. An eight-headed attention mechanism is incorporated to strengthen the model’s ability to focus on relevant features. The feature fusion component, structured with 512 channels, also enhances multi-level information integration. With a total of 2.8 million parameters, the model is designed to handle complex pattern recognition tasks effectively.
Data augmentation To further improve robustness, various data augmentation techniques are employed to further improve robustness. Training images undergo random rotation within a range of $\pm 15^{\circ}$ , while scaling variations between 0.9 and 1.1 are applied. Positional transformations introduce random translations of up to 10%, and Gaussian noise with a standard deviation of 0.01 is added to simulate real-world uncertainties. These augmentation strategies enhance the model’s ability to adapt to different input conditions, reducing its susceptibility to minor distortions or variations.

This designed approach ensures a high-performing, stable, and generalised network across diverse environments. Integrating multi-head attention enhances feature selection, and the fusion mechanism improves representation. This leads to greater accuracy and efficiency in practical applications. A warm-up function also helps to mitigate divergence issues encountered in the early stages of training. Further efficiency improvements are achieved through half-precision floating-point computation and Distributed Data Parallel (DDP), which together reduce memory consumption and enhance computational efficiency. To strengthen robustness, the baseline method applies a 30% runtime random modality dropout rate and random video flipping to minimise overfitting concerns. These strategic optimisations significantly enhance the training framework’s overall effectiveness, ensuring improved performance and adaptability in real-world applications.

4.2. Performance Comparison

Meanwhile, a carefully selected set of evaluation metrics was used to measure both the visual fidelity of the predicted frames and the computational efficiency of the models. Several quantitative metrics are recommended for evaluating the accuracy and visual similarity of reconstructed or compressed media compared to a reference. In this context, the structural similarity index (SSIM) and the peak signal-to-noise ratio (PSNR) are key indicators of visual quality, with higher values denoting improved reconstruction.

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(13)

where

$μ_{x}$ and $μ_{y}$ are the mean values of images x and y,
$σ_{x}^{2}$ and $σ_{y}^{2}$ are the variances,
$σ_{x y}$ is the covariance,
$C_{1}$ and $C_{2}$ are small constants to avoid division by zero.

PSNR = 10 {log}_{10} (\frac{M A X_{I}^{2}}{M S E})

(14)

where

$M A X_{I}$ is the maximum possible pixel value,
$M S E$ is the Mean Squared Error.

In turn, the mean squared error (MSE) quantifies the deviation between pixels in images. This deviation is minimised in high-quality outputs, making lower MSE values preferable. In video analysis, the Fréchet video distance (FVD) is a deep learning-based perceptual metric where lower scores correspond to enhanced temporal coherence and realism.

MSE = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - y_{i})}^{2}

(15)

where

$x_{i}$ and $y_{i}$ are pixel values of the original and compressed images,
N is the number of pixels.

FVD = ∥ μ_{r} - μ_{g} ∥^{2} + T r (Σ_{r} + Σ_{g} - 2 {(Σ_{r} Σ_{g})}^{1 / 2})

(16)

where

$μ_{r}$ and $μ_{g}$ are the mean feature vectors of real and generated videos,
$Σ_{r}$ and $Σ_{g}$ are the covariance matrices,
$T r ()$ denotes the trace operation.

To provide a detailed evaluation of the effectiveness of the proposed model, extensive experiments were conducted to compare its performance with that of the current state-of-the-art (SOTA) approaches in video prediction.

As presented in Table 1, the results demonstrate improvements in all considered metrics. The proposed model achieves an SSIM value of 0.901, which is very close to the value of 0.902 reported by DMVFN. This highlights the model’s ability to maintain structural integrity in predicted frames. However, the model demonstrates higher overall image fidelity, achieving a PSNR of 29.31 dB and surpassing the 29.11 dB achieved by DMVFN. Although DMVFN achieves a lower MSE (0.039 vs. 0.041), our model offers superior video prediction quality, as evidenced by its significantly lower FVD score of 155.5. Although it requires 13.66% more computational time than DMVFN, this is justified by the improvements in PSNR and FVD, which are two critical metrics for maintaining visual consistency and realism in generated frames. Furthermore, the refined architecture improves the modelling of temporal dependencies, enabling more precise prediction of object trajectories and transformations. The feature decoupling mechanism strengthens spatial representation, preserving essential structural elements throughout the prediction process. Meanwhile, the optimised gating strategy mitigates information degradation to ensure superior temporal consistency across generated sequences. Overall, these results highlight the effectiveness of our approach in achieving a balance between predictive performance and computational efficiency. Further discussion of these aspects will clarify the advantages of our methodology and its potential to advance the field of video prediction.

5. Conclusions

This work introduces a novel video prediction model based on ConvCARU to address the challenges of predicting the movement of handwritten digits in video sequences. The model integrates a modified DCGAN for background–foreground separation and an optimised ConvCARU architecture that utilises convolutional operations and an advanced parameter-tuning strategy. The proposed model successfully improves prediction accuracy and computational efficiency. Extensive experiments conducted on the MNIST dataset demonstrate that the model achieves superior results in terms of structural fidelity and detail preservation, significantly surpassing SOTA approaches. With an SSIM of 0.901 and a PSNR of 29.31 dB, the model demonstrates improved long-term prediction stability while reducing memory consumption than most approaches. These results validate the model’s ability to capture fine-grained spatiotemporal dependencies while maintaining high-resolution clarity. The proposed framework improves video-based handwriting recognition and broader sequence forecasting applications, providing a robust solution for complex predictive tasks in computer vision.

Author Contributions

Conceptualisation, S.-K.I., and K.-H.C.; Methodology, S.-K.I., and K.-H.C.; Software, K.-H.C.; Validation, S.-K.I.; Formal analysis, S.-K.I., and K.-H.C.; Investigation, K.-H.C.; Resources, S.-K.I., and K.-H.C.; Data curation, S.-K.I., and K.-H.C.; Writing—original draft, K.-H.C.; Writing—review and editing, S.-K.I.; Visualisation, K.-H.C.; Supervision, S.-K.I.; Project administration, S.-K.I.; Funding acquisition, S.-K.I. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Macao Polytechnic University (RP/FCA-01/2025).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sharma, V.; Gupta, M.; Kumar, A.; Mishra, D. Video Processing Using Deep Learning Techniques: A Systematic Literature Review. IEEE Access 2021, 9, 139489–139507. [Google Scholar] [CrossRef]
Jiao, L.; Zhang, R.; Liu, F.; Yang, S.; Hou, B.; Li, L.; Tang, X. New Generation Deep Learning for Video Object Detection: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 3195–3215. [Google Scholar] [CrossRef] [PubMed]
Zhao, X.; Wang, L.; Zhang, Y.; Han, X.; Deveci, M.; Parmar, M. A review of convolutional neural networks in computer vision. Artif. Intell. Rev. 2024, 57, 99. [Google Scholar] [CrossRef]
Moetesum, M.; Diaz, M.; Masroor, U.; Siddiqi, I.; Vessio, G. A survey of visual and procedural handwriting analysis for neuropsychological assessment. Neural Comput. Appl. 2022, 34, 9561–9578. [Google Scholar] [CrossRef]
Haddad, L.E.; Hanoune, M.; Ettaoufik, A. Computer Vision with Deep Learning for Human Activity Recognition: Features Representation. In Engineering Applications of Artificial Intelligence; Springer Nature: Cham, Switzerland, 2024; pp. 41–66. [Google Scholar] [CrossRef]
Diaz, M.; Moetesum, M.; Siddiqi, I.; Vessio, G. Sequence-based dynamic handwriting analysis for Parkinson’s disease detection with one-dimensional convolutions and BiGRUs. Expert Syst. Appl. 2021, 168, 114405. [Google Scholar] [CrossRef]
Hasan, T.; Rahim, M.A.; Shin, J.; Nishimura, S.; Hossain, M.N. Dynamics of Digital Pen-Tablet: Handwriting Analysis for Person Identification Using Machine and Deep Learning Techniques. IEEE Access 2024, 12, 8154–8177. [Google Scholar] [CrossRef]
Tse, R.; Monti, L.; Im, M.; Mirri, S.; Pau, G.; Salomoni, P. DeepClass: Edge based class occupancy detection aided by deep learning and image cropping. In Proceedings of the Twelfth International Conference on Digital Image Processing (ICDIP 2020), Osaka, Japan, 19–22 May 2020; Fujita, H., Jiang, X., Eds.; SPIE: Bellingham, WA, USA, 2020; p. 13. [Google Scholar] [CrossRef]
Sánchez-DelaCruz, E.; Loeza-Mejía, C.I. Importance and challenges of handwriting recognition with the implementation of machine learning techniques: A survey. Appl. Intell. 2024, 54, 6444–6465. [Google Scholar] [CrossRef]
AlKendi, W.; Gechter, F.; Heyberger, L.; Guyeux, C. Advancements and Challenges in Handwritten Text Recognition: A Comprehensive Survey. J. Imaging 2024, 10, 18. [Google Scholar] [CrossRef]
Huang, X.; Chan, K.H.; Wu, W.; Sheng, H.; Ke, W. Fusion of Multi-Modal Features to Enhance Dense Video Caption. Sensors 2023, 23, 5565. [Google Scholar] [CrossRef]
Gao, Z.; Tan, C.; Wu, L.; Li, S.Z. SimVP: Simpler yet Better Video Prediction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3160–3170. [Google Scholar] [CrossRef]
Tan, C.; Gao, Z.; Li, S.; Li, S.Z. SimVPv2: Towards Simple yet Powerful Spatiotemporal Predictive Learning. IEEE Trans. Multimed. 2025, 1–15. [Google Scholar] [CrossRef]
Hu, X.; Huang, Z.; Huang, A.; Xu, J.; Zhou, S. A Dynamic Multi-Scale Voxel Flow Network for Video Prediction. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 6121–6131. [Google Scholar] [CrossRef]
Barrere, K.; Soullard, Y.; Lemaitre, A.; Coüasnon, B. Training transformer architectures on few annotated data: An application to historical handwritten text recognition. Int. J. Doc. Anal. Recognit. (IJDAR) 2024, 27, 553–566. [Google Scholar] [CrossRef]
Ahmed, S.F.; Alam, M.S.B.; Hassan, M.; Rozbu, M.R.; Ishtiak, T.; Rafa, N.; Mofijur, M.; Shawkat Ali, A.B.M.; Gandomi, A.H. Deep learning modelling techniques: Current progress, applications, advantages, and challenges. Artif. Intell. Rev. 2023, 56, 13521–13617. [Google Scholar] [CrossRef]
Im, S.K.; Chan, K.H. More Probability Estimators for CABAC in Versatile Video Coding. In Proceedings of the 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 23–25 October 2020; pp. 366–370. [Google Scholar] [CrossRef]
Chan, K.H.; Ke, W.; Im, S.K. CARU: A Content-Adaptive Recurrent Unit for the Transition of Hidden State in NLP. In Neural Information Processing; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 693–703. [Google Scholar] [CrossRef]
Liu, B.; Lv, J.; Fan, X.; Luo, J.; Zou, T. Application of an Improved DCGAN for Image Generation. Mob. Inf. Syst. 2022, 2022, 9005552. [Google Scholar] [CrossRef]
Vilchis, C.; Perez-Guerrero, C.; Mendez-Ruiz, M.; Gonzalez-Mendoza, M. A survey on the pipeline evolution of facial capture and tracking for digital humans. Multimed. Syst. 2023, 29, 1917–1940. [Google Scholar] [CrossRef]
Aldausari, N.; Sowmya, A.; Marcus, N.; Mohammadi, G. Video Generative Adversarial Networks: A Review. ACM Comput. Surv. 2022, 55, 1–25. [Google Scholar] [CrossRef]
Ometov, A.; Shubina, V.; Klus, L.; Skibińska, J.; Saafi, S.; Pascacio, P.; Flueratoru, L.; Gaibor, D.Q.; Chukhno, N.; Chukhno, O.; et al. A Survey on Wearable Technology: History, State-of-the-Art and Current Challenges. Comput. Netw. 2021, 193, 108074. [Google Scholar] [CrossRef]
Edriss, S.; Romagnoli, C.; Caprioli, L.; Zanela, A.; Panichi, E.; Campoli, F.; Padua, E.; Annino, G.; Bonaiuto, V. The Role of Emergent Technologies in the Dynamic and Kinematic Assessment of Human Movement in Sport and Clinical Applications. Appl. Sci. 2024, 14, 1012. [Google Scholar] [CrossRef]
Pinheiro Cinelli, L.; Araújo Marins, M.; Barros da Silva, E.A.; Lima Netto, S. Variational Autoencoder. In Variational Methods for Machine Learning with Applications to Deep Networks; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 111–149. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, Y.; Yan, D.; Deng, S.; Yang, Y. Revisiting Graph-based Recommender Systems from the Perspective of Variational Auto-Encoder. ACM Trans. Inf. Syst. 2023, 41, 1–28. [Google Scholar] [CrossRef]
Wang, T.; Hou, B.; Li, J.; Shi, P.; Zhang, B.; Snoussi, H. TASTA: Text-Assisted Spatial and Temporal Attention Network for Video Question Answering. Adv. Intell. Syst. 2023, 5, 2200131. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Z.; Qi, W.; Yang, F.; Xu, J. FreqGAN: Infrared and Visible Image Fusion via Unified Frequency Adversarial Learning. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 728–740. [Google Scholar] [CrossRef]
Navidan, H.; Moshiri, P.F.; Nabati, M.; Shahbazian, R.; Ghorashi, S.A.; Shah-Mansouri, V.; Windridge, D. Generative Adversarial Networks (GANs) in networking: A comprehensive survey & evaluation. Comput. Netw. 2021, 194, 108149. [Google Scholar] [CrossRef]
Khoramnejad, F.; Hossain, E. Generative AI for the Optimization of Next-Generation Wireless Networks: Basics, State-of-the-Art, and Open Challenges. IEEE Commun. Surv. Tutor. 2025, 1. [Google Scholar] [CrossRef]
Chan, K.H.; Im, S.K. Using Four Hypothesis Probability Estimators for CABAC in Versatile Video Coding. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–17. [Google Scholar] [CrossRef]
Im, S.K.; Chan, K.H. Multi-lambda search for improved rate-distortion optimization of H.265/HEVC. In Proceedings of the 2015 10th International Conference on Information, Communications and Signal Processing (ICICS), Singapore, 2–4 December 2015; pp. 1–5. [Google Scholar] [CrossRef]
Xing, Z.; Feng, Q.; Chen, H.; Dai, Q.; Hu, H.; Xu, H.; Wu, Z.; Jiang, Y.G. A Survey on Video Diffusion Models. ACM Comput. Surv. 2024, 57, 1–42. [Google Scholar] [CrossRef]
Liu, Y.; Feng, S.; Liu, S.; Zhan, Y.; Tao, D.; Chen, Z.; Chen, Z. Sample-Cohesive Pose-Aware Contrastive Facial Representation Learning. Int. J. Comput. Vis. 2025, 133, 3727–3745. [Google Scholar] [CrossRef]
Li, S.; Yang, J.; Bao, H.; Xia, D.; Zhang, Q.; Wang, G. Cost-Sensitive Neighborhood Granularity Selection for Hierarchical Classification. IEEE Trans. Knowl. Data Eng. 2025, 1–12. [Google Scholar] [CrossRef]
Archana, R.; Jeevaraj, P.S.E. Deep learning models for digital image processing: A review. Artif. Intell. Rev. 2024, 57, 11. [Google Scholar] [CrossRef]
Xie, L.; Luo, Y.; Su, S.F.; Wei, H. Graph Regularized Structured Output SVM for Early Expression Detection with Online Extension. IEEE Trans. Cybern. 2023, 53, 1419–1431. [Google Scholar] [CrossRef]
Kuba, R.; Rahimi, S.; Smith, G.; Shute, V.; Dai, C.P. Using the first principles of instruction and multimedia learning principles to design and develop in-game learning support videos. Educ. Technol. Res. Dev. 2021, 69, 1201–1220. [Google Scholar] [CrossRef]
Chan, K.H. Using admittance spectroscopy to quantify transport properties of P3HT thin films. J. Photonics Energy 2011, 1, 011112. [Google Scholar] [CrossRef][Green Version]
Ehrhardt, J.; Wilms, M. Autoencoders and variational autoencoders in medical image analysis. In Biomedical Image Synthesis and Simulation; Elsevier: Amsterdam, The Netherlands, 2022; pp. 129–162. [Google Scholar] [CrossRef]
Wang, J.Z.; Zhao, S.; Wu, C.; Adams, R.B.; Newman, M.G.; Shafir, T.; Tsachor, R. Unlocking the Emotional World of Visual Media: An Overview of the Science, Research, and Impact of Understanding Emotion. Proc. IEEE 2023, 111, 1236–1286. [Google Scholar] [CrossRef] [PubMed]
Huang, T.; Ru, S.R.; Zeng, Z.H.; Zhang, L. Research on motion recognition algorithm based on bag-of-words model. Microsyst. Technol. 2019, 27, 1647–1654. [Google Scholar] [CrossRef]
Xie, L.; Hang, F.; Guo, W.; Lv, Y.; Ou, W.; Vignesh, C.C. Machine learning-based security active defence model - security active defence technology in the communication network. Int. J. Internet Protoc. Technol. 2022, 15, 169. [Google Scholar] [CrossRef]
Liu, Y.; Liu, B. A modified uncertain maximum likelihood estimation with applications in uncertain statistics. Commun. Stat. Theory Methods 2023, 53, 6649–6670. [Google Scholar] [CrossRef]
Cheng, K.; Xue, X.; Chan, K. Zero emission electric vessel development. In Proceedings of the 2015 6th International Conference on Power Electronics Systems and Applications (PESA), Hong Kong, China, 15–17 December 2015; pp. 1–5. [Google Scholar] [CrossRef]
Im, S.K.; Pearmain, A.J. Unequal error protection with the H.264 flexible macroblock ordering. In Proceedings of the Visual Communications and Image Processing 2005, Beijing, China, 12–15 July 2005; p. 110. [Google Scholar] [CrossRef]
Xu, J.; Park, S.H.; Zhang, X. A Temporally Irreversible Visual Attention Model Inspired by Motion Sensitive Neurons. IEEE Trans. Ind. Inform. 2020, 16, 595–605. [Google Scholar] [CrossRef]
Chan, K.; Im, S. Sentiment analysis by using Naïve-Bayes classifier with stacked CARU. Electron. Lett. 2022, 58, 411–413. [Google Scholar] [CrossRef]
Li, L.; Chang, J.; Vakanski, A.; Wang, Y.; Yao, T.; Xian, M. Uncertainty quantification in multivariable regression for material property prediction with Bayesian neural networks. Sci. Rep. 2024, 14, 10543. [Google Scholar] [CrossRef]
Gomes, A.; Ke, W.; Lm, S.K.; Siu, A.; Mendes, A.J.; Marcelino, M.J. A teacher’s view about introductory programming teaching and learning—Portuguese and Macanese perspectives. In Proceedings of the 2017 IEEE Frontiers in Education Conference (FIE), Indianapolis, IN, USA, 18–21 October 2017; pp. 1–8. [Google Scholar] [CrossRef]
Zahari, Z.; Shafie, N.A.B.; Razak, N.B.A.; Al-Sharqi, F.A.; Al-Quran, A.A.; Awad, A.M.A.B. Enhancing Digital Social Innovation Ecosystems: A Pythagorean Neutrosophic Bonferroni Mean (PNBM)-DEMATEL Analysis of Barriers Factors for Young Entrepreneurs. Int. J. Neutrosophic Sci. 2024, 23, 170–180. [Google Scholar] [CrossRef]
Liu, B.; Lam, C.T.; Ng, B.K.; Yuan, X.; Im, S.K. A Graph-Based Framework for Traffic Forecasting and Congestion Detection Using Online Images From Multiple Cameras. IEEE Access 2024, 12, 3756–3767. [Google Scholar] [CrossRef]
Benbakhti, B.; Kalna, K.; Chan, K.; Towie, E.; Hellings, G.; Eneman, G.; De Meyer, K.; Meuris, M.; Asenov, A. Design and analysis of the As implant-free quantum-well device structure. Microelectron. Eng. 2011, 88, 358–361. [Google Scholar] [CrossRef]
Liu, Q.; Netrapalli, P.; Szepesvari, C.; Jin, C. Optimistic MLE: A Generic Model-Based Algorithm for Partially Observable Sequential Decision Making. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing (STOC ’23), Orlando FL USA, 20–23 June 2023; pp. 363–376. [Google Scholar] [CrossRef]
Li, D.; Hu, J.; Wang, C.; Li, X.; She, Q.; Zhu, L.; Zhang, T.; Chen, Q. Involution: Inverting the Inherence of Convolution for Visual Recognition. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Li, Y.; Wang, Y.; Yang, X.; Im, S.K. Speech emotion recognition based on Graph-LSTM neural network. EURASIP J. Audio Speech Music Process. 2023, 2023, 40. [Google Scholar] [CrossRef]
Xu, C.; Qiao, Y.; Zhou, Z.; Ni, F.; Xiong, J. Enhancing Convergence in Federated Learning: A Contribution-Aware Asynchronous Approach. Comput. Life 2024, 12, 1–4. [Google Scholar] [CrossRef]
Zheng, H.; Yang, Z.; Liu, W.; Liang, J.; Li, Y. Improving deep neural networks using softplus units. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, UK, 12–17 July 2015; pp. 1–4. [Google Scholar] [CrossRef]
Sun, W.; Su, F.; Wang, L. Improving deep neural networks with multi-layer maxout networks and a novel initialization method. Neurocomputing 2018, 278, 34–40. [Google Scholar] [CrossRef]
Chan, K.H.; Pau, G.; Im, S.K. Chebyshev Pooling: An Alternative Layer for the Pooling of CNNs-Based Classifier. In Proceedings of the 2021 IEEE 4th International Conference on Computer and Communication Engineering Technology (CCET), Beijing, China, 13–15 August 2021; pp. 106–110. [Google Scholar] [CrossRef]
Chan, K.H.; Im, S.K.; Ke, W. Variable-Depth Convolutional Neural Network for Text Classification. In Neural Information Processing; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 685–692. [Google Scholar] [CrossRef]
Deng, L. The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web]. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
Ansel, J.; Yang, E.; He, H.; Gimelshein, N.; Jain, A.; Voznesensky, M.; Bao, B.; Bell, P.; Berard, D.; Burovski, E.; et al. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24), La Jolla, CA, USA, 27 April–1 May 2024; pp. 929–947. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
Le Guen, V.; Thome, N. Disentangling Physical Dynamics From Unknown Factors for Unsupervised Video Prediction. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11471–11481. [Google Scholar] [CrossRef]
Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Yu, W.; Lu, Y.; Easterbrook, S.; Fidler, S. Efficient and Information-Preserving Future Frame Prediction and Beyond. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
Wang, Y.; Zhang, J.; Zhu, H.; Long, M.; Wang, J.; Yu, P.S. Memory in Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity From Spatiotemporal Dynamics. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9146–9154. [Google Scholar] [CrossRef]
Wu, H.; Yao, Z.; Wang, J.; Long, M. MotionRNN: A Flexible Model for Video Prediction with Spacetime-Varying Motions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15430–15439. [Google Scholar] [CrossRef]
Tang, S.; Li, C.; Zhang, P.; Tang, R. SwinLSTM: Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 13424–13433. [Google Scholar] [CrossRef]

Figure 1. A proposed neural architecture for spatiotemporal feature learning using enhanced discriminators.

Figure 2. The internal structure of the proposed ConvCARU.

Table 1. A performance comparison with SOTA methods is presented using the SSIM, PSNR, MSE, and FVD metrics.

Model	SSIM↑	PSNR↑ (dB)	MSE↓	FVD↓	Memory Usage (GB)
ConvLSTM [64]	0.873	26.46	0.055	185.2	9.7
PhyDNet [65]	0.885	27.52	0.048	175.4	12.3
PredRNN [66]	0.887	27.83	0.046	172.1	10.5
CrevNet [67]	0.888	27.99	0.045	172.1	11.5
MIM [68]	0.892	24.29	0.041	169.5	13.7
SimVP [12]	0.893	28.34	0.043	168.2	11.2
MotionRNN [69]	0.899	28.46	0.042	159.7	10.1
SwinLSTM [70]	0.891	26.55	0.040	158.6	11.5
DMVFN [14]	0.902	29.11	0.039	156.3	10.8
Proposed Model	0.901	29.31	0.041	155.5	10.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Im, S.-K.; Chan, K.-H. Enhanced Localisation and Handwritten Digit Recognition Using ConvCARU. Appl. Sci. 2025, 15, 6772. https://doi.org/10.3390/app15126772

AMA Style

Im S-K, Chan K-H. Enhanced Localisation and Handwritten Digit Recognition Using ConvCARU. Applied Sciences. 2025; 15(12):6772. https://doi.org/10.3390/app15126772

Chicago/Turabian Style

Im, Sio-Kei, and Ka-Hou Chan. 2025. "Enhanced Localisation and Handwritten Digit Recognition Using ConvCARU" Applied Sciences 15, no. 12: 6772. https://doi.org/10.3390/app15126772

APA Style

Im, S.-K., & Chan, K.-H. (2025). Enhanced Localisation and Handwritten Digit Recognition Using ConvCARU. Applied Sciences, 15(12), 6772. https://doi.org/10.3390/app15126772

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Localisation and Handwritten Digit Recognition Using ConvCARU

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Feature Extraction from Frame Sequences

3.2. Enhanced Feature Extraction and Sequence Decoupling

3.3. Feature Extraction and Adaptive Fusion

3.4. Feature Aggregation and DCGAN

Role of DCGAN in Feature Aggregation

3.5. Integration with Attention Mechanism

4. Experimental Results and Discussion

4.1. Training Configuration and Strategy

4.2. Performance Comparison

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI