Next Article in Journal
Enhancing IoT Education Through Hybrid Robotic Arm Integration: A Quantitative and Qualitative Student Experience Study
Previous Article in Journal
Development of an XAI-Enhanced Deep-Learning Algorithm for Automated Decision-Making on Shoulder-Joint X-Ray Retaking
Previous Article in Special Issue
A Cascade of Encoder–Decoder with Atrous Convolution and Ensemble Deep Convolutional Neural Networks for Tuberculosis Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

A Modular Perspective on the Evolution of Deep Learning: Paradigm Shifts and Contributions to AI

1
School of Digital Economics and Management, Suzhou City University, Suzhou 215104, China
2
School of Statistics and Data Science, Southwestern University of Finance and Economics, Chengdu 611130, China
3
Faculty of Data Sciences, Shimonoseki City University, Yamaguchi 751-8510, Japan
4
Graduate School of Information, Production and Systems, Waseda University, Tokyo 169-8050, Japan
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(19), 10539; https://doi.org/10.3390/app151910539
Submission received: 13 August 2025 / Revised: 12 September 2025 / Accepted: 15 September 2025 / Published: 29 September 2025
(This article belongs to the Special Issue Advances in Deep Learning and Intelligent Computing)

Abstract

The rapid development of deep learning (DL) has demonstrated its modular contributions to artificial intelligence (AI) techniques, such as large language models (LLMs). DL variants have proliferated across domains such as feature extraction, normalization, lightweight architecture design, and module integration, yielding substantial advancements in these subfields. However, the absence of a unified review framework to contextualize DL’s modular evolutions within AI development complicates efforts to pinpoint future research directions. Existing review papers often focus on narrow technical aspects or lack systemic analysis of modular relationships, leaving gaps in our understanding how these innovations collectively drive AI progress. This work bridges this gap by providing a roadmap for researchers to navigate DL’s modular innovations, with a focus on balancing scalability and sustainability amid evolving AI paradigms. To address this, we systematically analyze extensive literature from databases including Web of Science, Scopus, arXiv, ACM Digital Library, IEEE Xplore, SpringerLink, Elsevier, etc., with the aim of (1) summarizing and updating recent developments in DL algorithms, with performance benchmarks on standard dataset; (2) identifying innovation trends in DL from a modular viewpoint; and (3) evaluating how these modular innovations contribute to broader advances in artificial intelligence, with particular attention to scalability and sustainability amid shifting AI paradigms.

1. Introduction

The development of artificial intelligence (AI) has undergone multiple paradigm shifts and technological breakthroughs, with deep learning emerging as a dominant force through its modular and reusable architectural components. While early AI systems like Newell and Simon’s “Logic Theorist” [1] and knowledge engineering systems such as DENDRAL and MYCIN [2] provided foundational reasoning frameworks, it is deep learning’s inherent modularity that has enabled unprecedented scalability and adaptability. This modularity manifests in reusable architectural blocks (convolutional layers, attention mechanisms), flexible normalization techniques, hybrid algorithm modules, transfer learning capabilities, and optimized training methodologies, making it uniquely suited for contemporary AI systems.
The evolution of deep learning traces back to neuroscience-inspired theories and computational breakthroughs that introduced early modular concepts. Rosenblatt’s Perceptron [3] established single-layer neural networks for linear classification, acting as a foundational building block despite its limitations in nonlinear modeling. Hopfield’s neural network [4] demonstrated how distributed representations could solve complex optimization problems, previewing the modular potential of neural systems. The backpropagation algorithm [5] enabled gradient optimization for multi-layer networks, introducing a modular training mechanism that would later support innovations like residual blocks and attention mechanisms. These early milestones laid the groundwork for modern modular architectures by establishing principles of layered processing and distributed learning. The resurgence of deep learning [6] represents a modern reinterpretation of connectionism, enabled by exponential growth in computational power and big data.
The development of convolutional neural networks (CNNs) exemplifies modularity through standardized convolutional stacks, pooling operations, and residual connections. AlexNet’s groundbreaking victory in the 2012 ImageNet competition [7] demonstrated how modular CNN components could be combined to achieve state-of-the-art performance. Subsequent architectures like VGGNet’s deep 3 × 3 convolutional stacks [8] and ResNet’s residual connections [9] further standardized these modular components, enabling the training of thousand-layer networks. Similarly, the Inception module’s multi-scale convolutions [10,11,12] introduced a new level of architectural flexibility by combining parallel processing pathways—principles that would later influence Transformer-based models through attention mechanisms.
In parallel, sequence modeling evolved from recurrent neural networks (RNNs) to Transformers through modular advancements. While LSTMs [13] and GRUs [14] addressed RNNs’ gradient issues, the Transformer [15] revolutionized NLP with self-attention-based parallel processing, effectively modularizing sequence modeling. Transformers’ attention heads and feed-forward networks function as reusable components that can be scaled across different model sizes (e.g., BERT [16], GPT series [17,18,19]). This modularity has enabled the development of generative AI and multimodal fusion systems by allowing researchers to combine attention mechanisms with CNN-based vision modules or graph neural networks (GNNs).
Modern deep learning progresses through four synergistic branches, each contributing distinct modular innovations: (1) Supervised learning (CNNs, Transformers) leverages standardized convolutional and attention blocks; (2) Generative learning (GANs [20], contrastive learning [21], masked autoencoders [22]) introduces reusable generative heads and contrastive objectives; (3) Reinforcement learning (DQN [23], PPO [24]) develops modular decision-making units like actor-critic architectures; (4) Graph neural networks (GCNs [25], GATs [26]) modularize non-Euclidean data processing through graph kernels and message-passing components. These modular components can be combined across branches, such as integrating GNNs with Transformers in multimodal large language models (LLMs), creating hybrid systems that leverage the strengths of different architectural paradigms.
Despite these advances, deep learning faces persistent challenges that highlight the dual promise and perils of modular design. First, data hunger remains a critical issue, where limited labeled data hinders small-sample applications (e.g., medical imaging [27]). While modularity enables transfer learning to partially address this, the field still struggles with efficient data utilization. Second, the computational intensity of training large models raises sustainability concerns [28]. Modular design can both mitigate and exacerbate this issue; while reusable components reduce redundancy, the proliferation of specialized modules increases overall complexity. Third, interpretability remains elusive, as modular “black-box” systems compound opacity in high-stakes domains [29,30]. Fourth, catastrophic forgetting persists in open-world learning [31], challenging the adaptability of modular systems. Finally, the diversity of deep learning variants has transformed the field into a highly empirical process of modular experimentation and fine-tuning—a phenomenon that highlights both the promise and chaos of modular AI design.
This paper aims to map deep learning’s evolution and its contributions to AI through a modular lens, focusing on how architectural innovations have shaped modern AI systems. By synthesizing insights from WSOP, Google Scholar, and other databases, we seek to empower researchers in navigating the complexity of modern deep learning architectures. Our contributions include (1) summarizing the fundamental modular components of deep learning algorithms and their performance across standard datasets and (2) innovatively categorizing the evolutionary trajectory of deep learning into modular components such as feature extraction methods, normalization techniques, hybrid algorithm modules, transfer learning applications, and optimization strategies. The structure of this paper is as follows: Section 1 provides the introduction; Section 2 reviews deep learning model architectures and their variants from a modular perspective; and Section 3 analyzes innovation directions and how modularity shapes the future of AI development.

2. Literature Review

Existing reviews on deep learning have approached the topic from diverse angles, often emphasizing distinct aspects of the field. To provide a structured overview, we categorize prior work into five sub-themes: foundational reviews, technical challenges, model-specific advances, domain applications, and identified gaps in the existing literature.

2.1. Foundational Reviews

The foundational work by LeCun et al. [6] provided a comprehensive overview of deep learning, focusing on its biological inspiration and capacity to mimic hierarchical feature extraction in the brain. This review highlighted breakthroughs in supervised learning for tasks like image and speech recognition, particularly emphasizing the synergy between neuroscience and artificial intelligence in CNNs and RNNs. These architectures revolutionized pattern recognition and sequential data processing.
Subsequent reviews, including that by Goodfellow et al. [32], further established deep learning within the broader context of machine learning, offering a systematic introduction to neural networks and their applications. Jordan and Mitchell [33] provided a broader perspective on machine learning trends, situating deep learning within the evolving landscape of AI research.

2.2. Technical Challenges

Technical challenges in deep learning have been extensively explored in the literature. Minar [34] and Tian [35] delved into issues such as gradient vanishing/explosion, regularization techniques, and hyperparameter optimization. These reviews often emphasized practical engineering considerations like batch normalization and activation functions to stabilize training and improve generalization. Optimization dynamics have also received attention, with works such as that by Kingma and Ba [36] introducing adaptive optimization algorithms like Adam and that by Sutskever et al. [37] highlighting the importance of initialization and momentum in training deep networks.

2.3. Model-Specific Advances

Model-specific reviews have detailed advancements in various deep learning architectures and techniques. Reviews focusing on attention mechanisms [38], autoencoders [39], RWKV [40], diffusion models [41], and Kolmogorov Arnold Networks [42] have provided in-depth insights. Dosovitskiy et al. [43] demonstrated the application of Transformers to image recognition, while Radford et al. [44] introduced models leveraging natural language supervision for visual tasks. Ho et al. [45] advanced generative models with diffusion probabilistic approaches.

2.4. Domain Applications

Domain-specific applications of deep learning have been highlighted in several reviews. Chen [46] prioritized domain-specific applications, such as medical imaging and autonomous systems, showcasing how deep learning transforms industries through predictive maintenance, personalized healthcare, and real-time decision making. Zhao [47] examined the recent applications of transfer learning and self-supervised learning across various fields, offering guidance for selecting appropriate techniques to address data scarcity.

2.5. Gaps in Existing Work

Despite comprehensive coverage of deep learning in existing reviews, notable gaps remain. First, there is a lack of systematic analysis of ”modular innovations” across the deep learning pipeline. While many surveys detail architectural advancements, they seldom comprehensively analyze innovations in feature extraction, activation function design, or optimization dynamics. For the purpose of this paper, we define a ”module” in deep learning as a self-contained component that performs a specific function within the network, such as input encoding, normalization, activation, attention mechanisms, or output modeling. These modules can be combined and reused across different architectures.
Second, interdisciplinary mechanistic insights—such as the role of dopamine-driven meta-reinforcement learning in both biological and artificial systems [48]—are rarely integrated, leaving gaps in our understanding of how neuroscientific principles inspire algorithmic improvements. Hassabis et al. [49] have explored the intersection of neuroscience and AI, but such interdisciplinary connections are underrepresented in most reviews.
Third, emerging challenges like plasticity loss in continual learning, which are critical for real-world adaptability, are underexplored in broader reviews, despite their profound impact on sample efficiency and catastrophic forgetting. Additionally, interpretability remains a significant concern, with limited exploration of methods for interpreting and understanding deep neural networks [50,51].
This paper addresses these gaps by presenting a unique perspective on deep learning research. We focus on providing a comprehensive analysis of modular innovations and their impact on systemic performance, integrating interdisciplinary insights to offer a more holistic understanding of the field. Through this modular perspective, we aim to highlight how different components of deep learning architectures contribute to the overall performance and adaptability of AI systems.

2.6. Deep Neural Networks and Its Variants

Neural networks, as the core architecture of deep learning, have evolved from basic feed-forward neural networks (FNNs) to complex model architectures that adapt to multimodal data. Traditional FNNs achieve unidirectional mapping from input to output through multilayer perceptrons (MLPs), excelling at simple classification and regression tasks (e.g., house price prediction). However, traditional neural networks lack modeling capabilities for spatial, temporal, or topological structures.
y = σ ( w x + b )
x R n is the input vector, w R n is the weighted vector, b R is the bias, σ ( · ) is the activation function (such as Sigmoid, ReLU), and y is the output.
To address this, researchers have proposed a series of neural network variants, such as CNN, RNN, LSTM, GAN, VAE, etc. Residual networks (ResNet) utilize skip connections to resolve gradient vanishing issues in deep networks, establishing themselves as a benchmark in image classification; The capsule network [52] enhances the spatial invariance of entity attribute representations through dynamic routing mechanisms, addressing the limitations of CNNs in handling geometric transformations; Graph neural networks (GNNs) model non-Euclidean data via neighbor aggregation (GCN [25]), attention weighting (GAT [26]), or sampling strategies (GraphSAGE [53]), finding broad applications in social network analysis and drug discovery; Neural Turing Machines (NTMs [54]) simulate the storage–computation separation mechanism of computers using external memory modules, providing a novel paradigm for complex reasoning tasks. The evolution of neural network variants has always been problem-driven: ResNet tackles gradient vanishing, MobileNet optimizes computational efficiency, and GNN breaks the bottleneck in modeling unstructured data.

2.6.1. Autoencoders

Autoencoders (AEs) and their variants are a core paradigm for unsupervised representation learning. The Autoencoder framework contains two major modules: the encoding process and the decoding process. The input sample x is mapped to the feature space z by encoder (g), i.e., the encoding process, and then the abstract feature z is mapped back to the original space to obtain the reconstructed sample x’ by decoder (f), i.e., the decoding process. The optimization objective is then to optimize both the encoder and decoder by minimizing the reconstruction error to learn the abstract feature representation z for the sample input x.
f , g = argmin f , g L ( x , f ( g ( x ) ) )
Basic AEs [5] extract low-dimensional features through latent space mapping, used for dimensionality reduction and denoising. Denoising autoencoders [55] learn robust representations by processing noisy inputs, enhancing data cleaning. Sparse autoencoders [56], with sparsity constraints, improve feature interpretability and are suitable for medical imaging analysis. Recently, autoencoder concepts have been integrated with self-supervised learning. Masked autoencoders (MAEs [22]) reconstruct original pixels from randomly masked image patches, driving vision Transformer pre-training (e.g., ViT-MAE [22]). Contrastive autoencoders (CAEs [57]), combining contrastive learning (e.g., SimCLR [21]), optimize feature space alignment for cross-modal retrieval. These variants, through reconstruction target optimization, probabilistic modeling, and task-driven design, continue to expand autoencoders’ applications in generative models, anomaly detection, and cross-modal alignment.

2.6.2. Convolutional Neural Networks

Convolutional neural networks (CNNs), leveraging their local perception and parameter-sharing properties, have become foundational models in image processing. The following is a simple CNN:
O ( i , j ) = m = 0 M 1 n = 0 N 1 X ( i + m , j + n ) · K ( m , n )
x is the input matrix (e.g., an image), with dimensions H × W . K is the convolution kernel (filter), with dimensions M × N . O is the output feature map, with dimensions ( H M + 1 ) × ( W N + 1 ) . i, j are the position indices of the output feature map. m, n are the indices of the convolution kernel.
Early CNN architectures such as LeNet and AlexNet validated the effectiveness of stacking convolutional layers, while VGGNet enhanced the granularity of feature representation through stacked 3 × 3 small convolutional kernels. As network depth increased, the gradient vanishing problem became increasingly severe. Residual neural networks (ResNet) introduced cross-layer residual connections, enabling the training of thousand-layer networks. The Inception series improved feature diversity through multi-scale parallel convolutions (e.g., Inception modules). To address practical deployment needs, model lightweighting emerged as a critical research direction. MobileNet [58] reduced computational costs through depthwise separable convolutions, enabling real-time mobile inference. EfficientNet [59] achieved a Pareto-optimal balance via compound scaling of depth, width, and resolution. Current advancements focus on dynamic inference, multimodal fusion, and leveraging Neural Architecture Search (NAS) for automated high-performance design.
For example, the essence of the famous ResNet lies in enabling each network layer to learn residual relationships between inputs and outputs rather than directly learning the target output. The residual unit can be denoted as follows:
y l = h x l + F x l , W l x l + 1 = f y l ,
where x l and x l + 1 are the inputs and outputs of the lth residual unit, noting that each residual unit generally contains a multilayer structure. F is the residual function, denoting the learned residuals. W l is the weight matrix of layer l. h ( x l ) = x l denotes the constant mapping, and f is the ReLU activation function.

2.6.3. Recurrent Neural Networks

As a classical model for sequential data processing, RNNs demonstrate unique advantages in tasks requiring temporal modeling, such as time series analysis, speech recognition, and machine translation, through their hidden state propagation mechanism across time steps. The recurrent architecture dynamically captures temporal dependencies within sequences. However, traditional RNNs struggle to model long-range dependencies due to the gradient vanishing problem. A simple RNN is as follows and consists of an input layer, a hidden layer, and an output layer:
S t = f U · X t + W · S t 1 O t = g V · S t ,
where X is a vector that represents the value of the input layer; S is a vector that represents the output value of the hidden layer; U is the weight matrix from the input layer to the hidden layer; O is also a vector that represents the values of the output layer; V is the weight matrix from the hidden layer to the output layer. The weight matrix W is the weight of the last value of the hidden layer used as the input to this one.
To address these limitations, researchers have developed critical variants: LSTM [60] and GRU leverage gating mechanisms (e.g., input/forget gates) to selectively retain or discard information, significantly enhancing long-term memory capabilities. Bidirectional RNN (Bi-RNN [61]) combines forward and backward sequence context via dual hidden states, improving global semantic understanding. Deep RNN stacks multiple recurrent layers to increase model complexity. Deep RNN, IndRNN [62], and Dilated LSTM [63] extend model complexity and long-sequence modeling capabilities by stacking layers, employing independent neuron recurrence, or using dilation mechanisms, respectively. Additionally, Echo State Networks (ESNs [64]) and Quasi-recurrent neural networks (QRNNs [65]) achieve efficient computation in chaotic system prediction and real-time tasks by randomly initializing reservoirs or leveraging convolutional parallelism. Zhang [66] introduced Recurrent Support Vector Machines (RSVMs), which combine recurrent neural networks (RNNs) with standard Support Vector Machines (SVMs). These improvements focus on three main objectives: alleviating gradient issues, enhancing long-term dependency capture, and optimizing computational efficiency.
Looking ahead, as computational power grows and algorithmic lightweight demands balance, the evolution of RNNs may focus on dynamically adapting to diverse computing environments (e.g., edge devices) and complex data structures (e.g., irregular time series, high-dimensional spaces), sustaining their vitality through the synergy of computational power and algorithms.

2.6.4. Long Short-Term Memory Networks

As the most representative variant of RNNs, the basic LSTM is defined by
i t = σ W x i x t + W h i h t 1 + b i , C t * = tanh W x c x t + W h c h t 1 + b c , f t = σ W x f x t + W h f h t 1 + b f , C t = f t C t 1 + i t C t * , o t = σ W x o x t + W h o h t 1 + b o , h t = o t tanh C t .
Compared to basic RNNs, LSTMs introduce the cell state C t , which can be viewed as the current state preserved during the computation of activations h t . Both c t and h t are propagated to the next step throughout the process. In the formulas, i t , o t and f t denote the input gate, output gate and forget gate, which are computed using the sigmoid function to produce values between 0 and 1, filtering (retaining or discarding) data. The additional gates in LSTMs are the forget gate f t , input gate i t , and output gate o t , which control the proportion of previous and current cell states and how the cell generates h t .
LSTM has many variants. For example, GRU merges gate units and hidden states to balance efficiency and performance in resource-constrained scenarios. ConvLSTM replaces fully connected layers with convolutional operations to achieve joint spatiotemporal modeling, suitable for tasks like video prediction [67]. xLSTM [68] extends the traditional LSTM by incorporating additional gating mechanisms to better capture long-range dependencies, enhancing performance in sequential data with extended temporal relationships. Vision-LSTM [69] adapts xLSTM building blocks to computer vision. Phased LSTM introduces a time-gating mechanism to adapt to irregularly sampled data [70]. Grid LSTM and MD-LSTM extend traditional LSTMs for multidimensional data, applicable to tasks such as image generation [71,72]. Variants like Tree-LSTM [73] enhance expressiveness in complex tasks by incorporating techniques such as attention mechanisms. These variants reflect a task-driven technical path. For instance, BiLSTM [74] is suitable for NLP tasks requiring global context, while SRU [75], with simplified gates and parallel computation, is ideal for large-scale real-time scenarios.

2.6.5. Graph Neural Networks

Graph neural networks (GNNs) aim to learn representations for each node in a graph. These representations are computed based on the node’s features, edge features, and the features of its neighboring nodes. In graph neural networks (GNNs), the iterative process of updating node features is captured by the equations
H t + 1 = A ˜ H t W A ˜ = I n + D 1 2 A D 1 2 ,
where H t + 1 represents the node features at the next time step, A ˜ is the normalized adjacency matrix that accounts for node degrees to ensure equitable information propagation, H t is the current feature matrix, W is a learnable weight matrix, I n is the identity matrix, and D is the degree matrix. This formulation allows for the effective aggregation of neighbor information while preserving node-specific features, which is crucial for learning robust node representations in graph-based tasks. The architecture of GNN is shown in Figure 1.
GNNs can be categorized into several types, including graph convolutional networks (GCNs: spectral-based or spatial-based), graph attention networks (GATs) based on attention mechanisms, graph networks based on gated updates, and graph networks with skip connections. ChebNet [76] was proposed, which uses Chebyshev polynomials to approximate the convolutional kernel, thereby reducing the number of parameters and lowering computational complexity.

2.6.6. Kolmogorov–Arnold Networks

Kolmogorov–Arnold Networks (KAN) are a novel neural network architecture based on Kolmogorov and Arnold’s theorem, which states that any continuous multivariable function can be represented as a superposition of a finite number of continuous univariate functions. Specifically, for an n-dimensional continuous function f ( x 1 , x 2 , , x n ) , there exists a set of continuous univariate functions g i and h i j such that
f ( x 1 , x 2 , , x n ) = i = 1 2 n + 1 g i j = 1 n h i j ( x j )
This theorem indicates that complex multivariate functions can be represented through combinations of simple univariate functions, providing new perspectives for neural network design [77]. Figure 2 illustrates the structure of a KAN.
In recent years, KAN has shown significant progress in fields like time series prediction and medical diagnosis. For instance, the MT-KAN model proposed by [78] uses a two-layer network with spline-parametrized functions to capture complex temporal dependencies in multivariate time series. In medical diagnosis, Tang [79] integrated KAN into the U-Net model (U-KAN) for superior 3D brain tumor segmentation using multimodal MRI data. Shuai [80] introduced the PIKAN model, which replaces traditional MLP layers with KAN layers for more accurate power system dynamics prediction with a smaller network size. Future research will focus on optimizing the KAN’s structure and training methods, exploring new applications in areas like finance and environmental monitoring, and combining it with fields such as quantum computing and bioinformatics.

2.6.7. Bayesian Neural Networks

Bayesian neural networks (BNNs) can be simply understood as introducing uncertainty into the weights of neural networks for regularization. This characteristic makes BNNs excel in handling small datasets, active learning, and combating overfitting.
A neural network can be viewed as a conditional distribution model P ( y x , w ) : the distribution of the predicted output values y given the input x , where w represents the weights within the neural network. Based on the weights w , the probability model for predicting y from the input x given dataset D becomes P ( y ^ x ^ ) = E P ( w D ) [ P ( y ^ x ^ , w ) ] . According to Bayesian theory, to determine w given D , one must proceed through
P ( w D ) = P ( w , D ) P ( D ) = P ( D w ) P ( w ) P ( D ) .
Using variational methods, we can approximate the true posterior distribution P ( w D ) with a distribution q ( w θ ) controlled by a set of parameters θ . This process can be achieved by minimizing the Kullback–Leibler (KL) divergence between the two distributions.
F ( D , θ ) = D KL [ q ( w θ ) P ( w ) ] E q ( w θ ) [ log P ( D w ) ] .
Current research progress in Bayesian neural networks (BNNs) mainly focuses on improving training efficiency, enhancing uncertainty quantification, and expanding application scenarios. For example, Gal [81] proposed Dropout as a Bayesian approximation method, demonstrating how to estimate model uncertainty through Dropout to enable Bayesian inference in deep learning. Additionally, Blundell [82] introduced a variational inference method for weight uncertainty, further advancing the practical applicability of BNNs. Fortunato [83] proposed Bayesian recurrent neural networks, showing their potential in handling sequential data. Despite this, BNNs have tremendous potential, particularly in scenarios demanding model interpretability and robustness.

2.6.8. Physics-Informed Neural Networks

Physics-informed neural networks (PINNs) are a rapidly growing field that leverages the power of neural networks to learn complex patterns and relationships from data while also incorporating the underlying physical principles such as partial differential equations (PDEs) or ordinary differential equations (ODEs) that govern the system. Let L t o t a l , L D a t a , L D E , L B C and L I C denote total loss, partial differential equation loss, initial condition loss, and boundary condition loss, respectively. Then, as shown in Figure 3,
L t o t a l = λ 1 L D a t a + λ 2 L D E + λ 3 L B C + λ 4 L I C ,
with parameters λ 1 ,   λ 2 ,   λ 3 , and λ4 representing weights for the adjustment of each loss term.
The extended models of PINN include Spectral PINNs (analyzing vibration phenomena through wavenumber domain techniques to improve analysis accuracy [85]), Hybrid PINNs (combining the computational efficiency of traditional PINNs with data-driven methods to reduce training time and enhance scalability [86]), and Theory-Constrained PINNs (embedding domain-specific theoretical principles to ensure model compliance with physical laws, thereby improving prediction accuracy and explainability [87]). Additionally, PINN has performed excellently in structural analysis, health monitoring, stress–strain analysis, and multi-scale modeling of composite materials, demonstrating high computational efficiency and accuracy [88,89].

2.6.9. Liquid Neural Networks

Liquid neural networks (LNNs) represent a paradigm shift in dynamic information processing, inspired by the adaptability and temporal processing capabilities of biological neurons. The foundational framework can be traced to the concept of Liquid Time-Constant Networks (LTCs [90]), where the system’s state evolves according to
d x ( t ) d t = 1 τ + f ( x ( t ) , I ( t ) , t , θ ) x ( t ) + f ( x ( t ) , I ( t ) , t , θ ) A ,
where x ( t ) is the hidden state, I ( t ) is the input, t represents time, τ denotes the time constant, and f is parameterized by θ . A is a parameter matrix. This formulation enables LNNs to adaptively modulate their response latency based on input stimuli, mimicking the behavior of biological neural systems. Applications span robotics, autonomous systems, and real-time signal processing, where handling non-stationary data streams is critical. Figure 4 provides an example of the motivation of LTCs.
To address the computational complexity of solving ODEs, Hasani [91] proposed Closed-form Continuous-time Networks (CfCs). By deriving closed-form approximations of neuronal dynamics, CfCs bypass numerical integration while preserving temporal fidelity. This approach reduces training costs by up to 70% and scales to high-dimensional tasks like video prediction and robotic control. Further studies, such as that by Kumar [92], emphasize LNNs’ ability to model short-term memory and context-dependent computation. By exploiting sparse spikes and adaptive time constants, LNNs achieve sub-millisecond latency in edge computing scenarios, paving the way for real-world embedded systems.

2.7. Deep Generative Models and Variants

2.7.1. Generative Adversarial Networks

Generative adversarial networks (GANs [20]) employ an adversarial mechanism between a generator (G) and a discriminator (D), both implemented as multilayer perceptrons, to synthesize high-fidelity multimodal data. The generator maps noise variables z from a prior distribution p z ( z ) to data space via G ( z ; θ g ) , producing a generated distribution p g . Simultaneously, the discriminator D ( x ; θ d ) outputs a scalar probability D ( x ) indicating whether x originates from real data or p g . During training, D aims to maximize correct classification of real and generated samples, while G minimizes log ( 1 D ( G ( z ) ) ) , thereby enhancing the generator’s ability to produce realistic data. In summary, D and G engage in a two-player minimax game, where the value function V ( G , D ) is defined as follows:
min G max D Adversarial game V ( D , G ) = E x p data [ log D ( x ) ] Real data score + E z p z [ log ( 1 D ( G ( z ) ) ) ] Fake data detection
Basic architectures like DCGAN establish an engineering foundation for image generation through convolutional layer normalization. The Wasserstein GAN (WGAN [93]) introduces Wasserstein distance and gradient penalty (GP) to effectively mitigate mode collapse. CycleGAN [94] leverages cyclic consistency loss to perform cross-domain translation with unpaired data (e.g., translating horses into zebras), extending applications to artistic style transfer and medical image cross-modal reconstruction. For generation control and quality improvement, the StyleGAN [95] series enables hierarchical editing of facial details through style mixing and disentangled latent spaces, while ProGAN [96] uses progressive training to gradually increase resolution for high-definition image generation. Conditional GANs (cGANs [97]) guide generation via label embedding, and CLIP-GAN [98] integrates cross-modal pre-trained models (e.g., CLIP) for text-driven image synthesis, advancing controllable generation toward multimodal interaction. Current GAN improvements focus on 3D generation (e.g., 3D-GAN for point cloud generation [99]), training stability optimization (e.g., spectral normalization), and integration with diffusion models (e.g., Projected GAN [100]), continually pushing boundaries in digital content creation, virtual reality, and industrial design.

2.7.2. Boltzmann Machine

The Boltzmann Machine (BM [101]) is a network of symmetrically coupled stochastic binary units. It contains a set of visible units v { 0 , 1 } D and a set of hidden units h { 0 , 1 } P . The energy of the state { v , h } is defined as
E ( v , h ; θ ) = 1 2 v L v 1 2 h J h v W h ,
where θ = { W , L , J } are the model parameters. In this context, W , L , and J represent visible-to-hidden, visible-to-visible, and hidden-to-hidden symmetric interaction terms. The diagonal elements of L and J are set to 0. The probability that the model assigns to a visible vector v is
p ( v ; θ ) = p * ( v ; θ ) Z ( θ ) = 1 Z ( θ ) h exp E ( v , h ; θ ) ,
where p * denotes unnormalized probability, and Z ( θ ) = v h exp E ( v , h ; θ ) is the partition function. The conditional distributions over hidden and visible units are given by p ( h j = 1 | v , h j ) = σ i = 1 D W i j v i + m = 1 j P J j m h j , and p ( v i = 1 | h , v i ) = σ j = 1 P W i j h j + k = 1 i D L i k v j , where σ ( x ) = 1 / ( 1 + exp ( x ) ) is the logistic function.
Boltzmann Machines (BMs), inspired by statistical mechanics, use energy functions and simulated annealing for training. However, their fully connected structure results in high computational complexity. Restricted Boltzmann Machines (RBMs [102]) mitigate this issue by restricting intra-layer connections and adopting the Contrastive Divergence (CD) algorithm, enhancing training efficiency. Deep Belief Networks (DBNs [32]), generative models constructed by stacking RBMs, are trained using unsupervised learning to extract hierarchical features, followed by supervised fine-tuning. Tang [103] introduced Deep Lambertian Networks (DLNs), which combine Lambertian reflectance with Gaussian RBMs and DBNs, allowing for the modeling of latent variables like albedo, surface normals, and light sources, thus better simulating the image generation process. The Deep Boltzmann Machine (DBM [104]), which stacks RBMs for complex data modeling, excels in multimodal data modeling and industrial diagnostics through layer-wise pretraining. Recent variants such as the Monotonic Deep Boltzmann Machine (mDBM [105]) integrate monotonicity constraints and parallel inference, achieving high classification accuracy on datasets like MNIST and CIFAR-10. The Born Machine [106], leveraging quantum many-body computation and tensor networks, generates uncorrelated samples via quantum state probabilities. While BM variants are powerful in expressiveness and generative tasks, their applications in quantum computing (e.g., Quantum BMs [107]) and NP-hard problem solving (e.g., for the Traveling Salesman Problem) demonstrate their versatility. However, high training complexity and resource demands remain challenges for future advancements.

2.7.3. Variational Autoencoder

The main principle of the variational autoencoder [108] is to map a set of data into an ideal Gaussian distribution through an encoder. Then, the samples sampled by Gaussian distribution are input into the decoder to generate reconstructed data. The optimization goal of the variational autoencoder is to maximaze the probability p ( x ) generated by input X by optimizing θ on the premise of sampling Z. A function Q ( z | x ) is therefore introduced to perform the function of the coding network.
J V A E = E Q ( z | x ) [ log P ( x | z ) ] D [ Q ( z | x ) P ( z ) ] .
Recent advancements in VAE research have focused on improving generative quality, addressing issues like blurry image generation, and enhancing the model’s ability to handle complex data. Improvements include Conditional VAE (CVAE, [109]) for generating specific categories of data, VAE-GAN [110] hybrids for higher-quality image synthesis, and PixelVAE [111] for combining autoregressive decoding with VAEs. Additionally, models like NVAE [112] and CWAE [113] have introduced architectural innovations to reduce training time and improve stability. Future research directions include better information separation, improved interpretability, and reduced computational complexity.

2.7.4. Diffusion Models

Diffusion models consist of a forward process that sequentially corrupts data with Gaussian noise, ultimately transforming the data distribution into pure noise, and a backward process where a denoising neural network is trained to remove the noise and restore a clean data distribution. This framework enables the generation of high-quality samples by learning to reverse the noise addition process. Here we consider the Ornstein–Ulhenbeck process, which is described by the following Stochastic Differential Equation (SDE),
d X t = 1 2 g ( t ) X t d t + g ( t ) d W t for g ( t ) > 0 ,
where the initial X 0 P data follows the data distribution, ( W t ) t 0 is a standard Wiener process, and g ( t ) is a nondecreasing weighting function. Diffusion models generate fake data by reversing the time of (17), which leads to the following backward SDE:
d X t = 1 2 X t + log p T t X t d t + d W ¯ t for t [ 0 , T ) ,
where log p t ( · ) is the so-called “score function”, i.e., the gradient of the log probability density function of P t ; W ¯ t is another Wiener process independent of W t ; and we use the superscript ← for distinguishing the forward process (17).
Diffusion models have emerged as a powerful generative AI technology, achieving remarkable success in fields such as computer vision, audio generation, reinforcement learning, and computational biology. Recent advancements have primarily focused on conditional diffusion models, which can generate new samples that meet specific conditions. For example, in vision and audio generation, these models enable high-fidelity sample generation with flexible control [45,114]. Additionally, diffusion models are being applied in reinforcement learning to parameterize policies through imitation learning and reward-maximizing planning [115,116]. In the life sciences, conditional diffusion models are used for single-cell image analysis, protein design, and drug discovery, surpassing traditional deep generative models [117,118]. Future research directions include linking diffusion models to stochastic control theory, enhancing adversarial robustness, and exploring distributionally robust optimization and discrete diffusion models.

2.8. Transformer and Its Variants

To address long texts and the inability of RNNs to process information in parallel, the famous attention mechanism was introduced and created the Transformer architecture. The core formula of the attention mechanism is typically expressed as follows:
Attention ( Q , K , V ) = softmax Q K T d k V
In the Transformer’s attention mechanism, Q (Query), K (Key), and V (Value) are vectors where attention scores are calculated by comparing Q and K, and V is weighted by these scores to produce the output. This allows the model to focus on different input parts when generating output. The division by d k scales the scores to prevent softmax domination by large values when the key vector dimension d k is large.
To address time series complexities, Transformer-based architectures have evolved with specialized mechanisms: Informer [119] employs ProbSparse attention and distillation to reduce redundancy (e.g., electricity forecasting), while AutoFormer [120] leverages autocorrelation for energy demand predictions. For non-stationary data, the Non-stationary Transformer [121] decomposes sequences into stationary/non-stationary components. iTransformer [122] reverses temporal modeling by encoding variables independently, whereas ETSformer [123] integrates exponential smoothing for interpretable decomposition. Medical signal analysis benefits from ShapeFormer’s [124] morphological attention for waveform detection. Beyond temporal domains, lightweight and cross-modal advancements include Deformable DETR’s [125] region-focused attention for vision tasks, Switch Transformer’s [126] trillion-parameter MoE scaling, and Swin Transformer’s [127] shifted-window hierarchical alignment. CrossFormer [128] further unifies image–text granularity. These innovations, summarized in Table 1, demonstrate Transformers’ adaptability across finance, healthcare, and multimodal systems, driven by co-optimization of algorithmic design and computational efficiency.

2.9. RWKV and Its Variants

The Receptance Weighted Key Value (RWKV) model [129] has been proposed as an efficient alternative to Transformers. RWKV combines the strengths of recurrent neural networks (RNNs) and linear attention mechanisms to achieve comparable performance with significantly reduced computational costs. The RWKV model employs time mixing to capture the relationship between tokens, as illustrated in Figure 5. For time t, given word x t and the previous word x t 1 , the Time-Mix module formulas are as follows:
r t = W r · ( μ r x t + ( 1 μ r ) x t 1 ) , k t = W k · ( μ k x t + ( 1 μ k ) x t 1 ) , v t = W v · ( μ v x t + ( 1 μ v ) x t 1 ) , w k v t = i = 1 t 1 exp ( ( t 1 i ) w + k i ) v i + exp ( u + k t ) v t i = 1 t 1 exp ( ( t 1 i ) w + k i ) + exp ( u + k t ) , o t = W o · ( σ ( r t ) w k v t )
The terms r t , k t and v t utilized in this context are analogous to the Q, K, and V components found in Transformer architectures.
RWKV-5 (Eagle) and RWKV-6 (Finch) [130] enhance the model’s expressiveness with multi-head matrix-valued states and dynamic recurrence mechanisms, retaining RNN-like inference efficiency. RWKV7 [131] further improves this with advanced linear attention techniques for lower time and memory complexity in long sequence processing. RWKV has been widely applied across various fields, including medical image restoration (Restore-RWKV [132]), image segmentation (RWKV-SAM [133]), visual language models (VisualRWKV-6 [134]), vision–language representation learning (RWKV-CLIP [135]), and 3D point cloud learning (PointRWKV [136]).

2.10. State Machine and Its Variants

Recent progress in state machines within deep learning focuses on the development and application of Structured State Space Models (SSMs) [137]. Figure 6 is a structural diagram of the SSM. As an efficient sequence modeling method, SSMs have emerged as strong alternatives to RNNs and Transformers, addressing issues like gradient vanishing in RNNs and quadratic complexity in Transformers by processing long sequences with linear or near-linear complexity [137]. Model like Mamba [138], built on the foundational S4 model [139], shows improvements in computational efficiency, memory optimization, and inference speed. SSMs excel in handling long-range dependencies across various fields such as NLP, speech recognition, vision, and time series forecasting. However, challenges in training optimization, hybrid modeling, and interpretability persist.

2.11. Reinforcement Learning Algorithms

Reinforcement learning (RL) involves an agent interacting with its environment to learn optimal behavioral strategies that maximize cumulative rewards. The agent perceives the environment’s state, selects actions, and receives feedback as rewards, aiming to learn a policy through trial and error. State transitions are generally considered random, influenced by the environment, and can be represented by a state transition density function p ( s | s , a ) = P ( S = s | S = s , A = a ) . The reward, or cumulative future reward U, is defined as U t = R t + R t + 1 + , with the agent’s goal being to maximize this return.
Deep Q-Networks (DQNs [140]) integrated deep neural networks with Q-learning, addressing high-dimensional state spaces. Asynchronous Advantage Actor-Critic (A3C [141]) improves training efficiency through parallel sampling, while Proximal Policy Optimization (PPO [24]) stabilizes policy updates with a clipped objective function. Biologically inspired networks, such as spiking neural networks (SNNs), Deep Belief Networks (DBNs), and Hierarchical Temporal Memory (HTM), draw from neuroscience to suit neuromorphic chips, drive deep feature learning, and excel in temporal pattern prediction. These models, by merging biological mechanisms with computational paradigms, expand AI’s potential in adaptive control, energy efficiency, and brain-like computing.

2.12. Topological Deep Learning and Its Variants

Topological Data Analysis (TDA) applications include integration with machine learning models. For instance, Hajij [142] pioneered the fields of topological machine learning and topological deep learning. Figure 7 depicts the architecture of a three-layer neural network topology. Persistent landscapes and persistent images are key integration points. Unlike conventional methods like principal component analysis and cluster analysis, TDA effectively captures high-dimensional data topology and identifies subtle categories missed by traditional approaches. Recent TDA integration with machine learning focuses on two main aspects. First, topological feature extraction [143] utilizes TDA to extract features from complex datasets, which can then be used for machine learning training. Second, topology-enhanced machine learning algorithms embed topological concepts into existing algorithms to enhance performance. For instance, adding topological features to neural networks has been shown to improve their performance [144,145]. These approaches show great promise in different fields for addressing issues traditionally handled by conventional models.

2.13. Spiking Neural Networks

Spiking neural networks (SNNs) have seen significant advancements in recent years, particularly in terms of scalability, energy efficiency, and hardware integration. Recent research has focused on developing large-scale SNN models, such as Spiking Transformers and Spiking Residual Networks, which have shown competitive performance in tasks like image and speech recognition [147]. These models leverage the unique temporal dynamics of SNNs to process spatiotemporal data more efficiently than traditional artificial neural networks (ANNs) [148,149].
Another key area of progress is the development of neuromorphic hardware and software frameworks that support SNNs. Hardware platforms like Intel’s Loihi and BrainScaleS-2 have enabled the deployment of SNNs on energy-efficient neuromorphic chips, reducing power consumption and improving computational efficiency [150]. Additionally, software frameworks such as SpikingJelly and CogSNN have been developed to facilitate the design and training of SNNs [151].
Training methodologies for SNNs have also seen improvements, including the use of surrogate gradient methods and backpropagation techniques to address the challenges of training deep SNNs [152,153]. These advancements have led to more robust and scalable SNN architectures capable of handling complex tasks with reduced computational overhead [147,154].
In summary, recent advancements in SNNs have focused on enhancing their scalability, energy efficiency, and integration with neuromorphic hardware, making them a promising alternative to traditional deep learning models for a wide range of applications [147,151].

2.14. Benchmarks and Performances of the Deep Learning Models

The Time Series Library (TSLib) contains various benchmarks for different tasks in time series analysis [155]. For the forecasting task, there are long-term and short-term benchmarks. The long-term benchmarks include ETT (with 4 subsets), Electricity, Traffic, Weather, Exchange, and ILI.
In the recent academic literature, the performance of various deep learning models such as CNN, RNN, LSTM, GNN, Transformer, and RWKV has been extensively evaluated on datasets such as ETT, Electricity, Traffic, Weather, Exchange, and ILI. These models are employed for time series forecasting and other sequential data tasks.
In time series forecasting, models like Informer and Autoformer [119,120], which are based on Transformer architectures, have shown significant improvements in handling long-sequence dependencies and have outperformed traditional models in datasets like Electricity and Weather. These models leverage efficient attention mechanisms and decomposition techniques to enhance forecasting accuracy.
LSTM and GRU, which are variants of RNNs, have been widely used in time series forecasting due to their ability to model long-term dependencies [156,157]. However, they often require more computational resources compared to Transformer-based models [158]. Despite this, RNN-based models like LSTM have shown effectiveness in tasks such as weather forecasting and energy consumption prediction. However, their performance in long-sequence tasks is often limited by computational inefficiency. RWKV, a recent model, aims to combine the efficiency of RNNs with the parallelism of Transformers, offering a balance between computational efficiency and performance [129].
Empirical evaluations on datasets like ETT, Electricity, and Weather consistently show that Transformer-based models generally outperform traditional RNN and CNN models in terms of accuracy and efficiency [159,160,161]. However, the choice of model architecture depends on the specific task and dataset characteristics [162].
In summary, while Transformer-based models have demonstrated superior performance in many sequential tasks, the selection of an appropriate model architecture depends on the specific application and data characteristics. For convenience, Table 2 lists the mathematical notations in this section.

3. Innovation Trends of Deep Learning on AI’s Modular Viewpoint

In Section 2, we systematically reviewed mainstream deep learning models and their variants. Through comprehensive analysis of extensive literature, we posit that current innovations in deep learning models primarily manifest in the following aspects. The landscape is illustrated in Figure 8.

3.1. Innovations in Feature Extraction Methods

The technical evolution in this field mainly concerns adjustment of convolution kernel and attention mechanisms. For CNNs, foundational advancements include nonlinear activation enhancements (AlexNet [7]), small kernel stacking (VGG [8]), multi-scale architectures (Inception [10]), and depthwise separable convolutions (Xception [163]). Subsequent innovations expanded residual modules (ResNeXt [164]) and optimized efficiency via grouped convolutions (ShuffleNet [165]). Feature fusion mechanisms integrated wavelet transforms [166] for frequency domain analysis, while dynamic convolution [167] enhanced expressiveness through parallel kernel weighting. Recent strides in multimodal fusion and geometric adaptation include the Dilated Convolutional Transformer [168] for the global context, Deformable Convolution [169] for geometric robustness, and CVOCA [170] for joint amplitude-phase modeling, collectively advancing CNN performance in parameter efficiency, computational complexity, and feature representation.
Innovations in the attention mechanism evolve through multi-scale fusion, spatiotemporal–frequency integration, and cross-modal alignment. Attention mechanisms include the spatial, such as CBAM, STN; channels, such as SE-Net, ECA-net; the temporal, such as BTA, CTA, HTA; and frequency, such as FFT. We can also divide them into self-attention, cross-attention and multi-head attention according to their mechanism. Early milestones encompassed saliency maps [171], RNN-based attention [172] and alignment–translation synergy [173]. Abundant variants emerged afterwards. Recent work includes high-frequency noise reduction via DCT [174], channel interaction optimization (ECANet [175]), and rotational positional embeddings (RoPE [176]). Cross-modal frameworks like Flamingo [177] enable image–text alignment, while Ring Attention [178] scales to million-length contexts. Multi-scale advances feature SENet [179] and CBAM for channel–spatial recalibration, alongside frequency–ramp architectures [180] balancing high/low-frequency details. The Swift Parameter-free Attention Network (SPAN [181]) is another recent innovation, which dynamically generates attention weights in a parameter-free manner while maintaining efficient computational performance. These innovations address computational efficiency, long-range dependencies, and multimodal alignment, solidifying attention mechanisms as cornerstones of modern deep learning.

3.2. Innovations in Normalization Methods

Early research on normalization methods aimed to calibrate distributions across feature dimensions. Batch Normalization (BN) [182] addressed internal covariate shift using mini-batch statistics, enhancing convolutional network training. Layer Normalization (LN) [183] applied normalization across feature dimensions, stabilizing training in sequential models and Transformers. Instance Normalization (IN) [184] independently normalized channel features, improving visual quality in style transfer. To mitigate BN’s limitations in small-batch scenarios, Group Normalization (GN) [185] grouped channels, while Weight Normalization (WN) [186] decoupled vector parameters to optimize gradient flow. Cross-Iteration Batch Normalization (CBN) [187] further aggregated statistics over multiple iterations. RMSNorm [188] simplified the normalization process by using the root mean square to normalize the input. DeepNorm [189] stabilized training in deep Transformers through residual connection standardization. Dynamic Tanh (DyT) [190] introduced a learnable scaling parameter leveraging the bounded nature of the tanh function. Filter Response Normalization (FRN) [191] enabled online learning without batch statistics via TLU activation integration, while Virtual Batch Normalization (VBN [192]) established paradigms for streaming data through dynamic population statistic estimation.
The evolution of dynamic adaptation strategies also emerged. The FiLM module [193] pioneered input-conditioned parameter generation. Jing [194] expanded multimodal control in generative tasks through dynamic instance normalization (DIN) with external condition vectors. Luo [195] achieved autonomous BN/LN/IN selection via differentiable gating mechanisms in Switchable Norm. Karras [196] enhanced generative diversity in StyleGAN through projected instance normalization (PIN) with style vector projections. Park’s [197] spatially-adaptive normalization (SPADE) became a milestone in image synthesis through semantic map-guided feature modulation. Miyato [198] revolutionized GAN stability with spectral normalization (SN) via weight matrix spectral norm constraints.
Research on normalization methods currently demonstrates three directions: (1) Decoupling from batch dependencies through statistical disentanglement strategies; (2) Developing dynamic parameter modulation based on input features/external conditions; (3) Constructing distributed normalization architectures for multimodal collaboration.

3.3. Innovations in Algorithm Module Enhancement and Hybridization

3.3.1. Architecture Modification

SPPNet [199] introduced a multi-scale pooling layer, which allows the network to process input images of arbitrary sizes and enhances the model’s robustness. ShuffleNet [165] adopted group convolution to reduce computational load and combined channel shuffle operations to enhance feature interaction. ResNet [9] first proposed the residual network module, where residual skip connections solved the training stability issue of extremely deep networks through identity mapping. Its variants, such as ResNeXt [164], expanded the cardinality through group convolution to balance network width and depth. Stochastic Depth [200] further introduced a random layer dropout strategy during training to enhance the generalization ability of ultra-deep networks. Subsequent research presents task adaptation trends. The Multi-Kernel Inverted Residual module (MKIR [201]) achieved efficient local feature fusion through multi-kernel depthwise convolution and attention mechanisms. DenseNet [202] extended the residual idea through dense cross-layer connections to promote feature reuse. Additionally, Swin Transformer [127] achieved multi-scale modeling through hierarchical window attention. Deep Layer Aggregation (DLA [203]) enhanced feature transfer efficiency through cross-layer iterative fusion. RevNet [204] significantly reduced memory usage through reversible residual blocks to support large-scale generative tasks.

3.3.2. Automated and Lightweighting Design

Neural Architecture Search (NAS) methods such as SMASH [205] dynamically generated candidate structures through a super network. EfficientNet [59] proposed a compound scaling rule to optimize depth, width, and resolution uniformly with NAS, promoting model adaptation in resource-sensitive scenarios. Dynamic Routing Networks [206] achieved dynamic trade-offs between computational efficiency and accuracy through input-adaptive path selection mechanisms. MixConv [207] integrated multi-scale depthwise separable convolution kernels and collaborated with the bottleneck compression strategy of DenseNet-BC to balance receptive field diversity and computational cost in mobile detection tasks. RepVGG [208] used multi-branch topological structures to enhance model performance and converted them to single-path architectures through parameter fusion during inference. Lightweighting is mainly about network compression. This field encompasses multiple families of methods, such as quantization [209], decomposition [210], knowledge distillation [211], and network pruning [212]. Knowledge distillation has become a crucial technique in the realm of large language models (LLMs). Distillation techniques have evolved to encompass a wide range of approaches including response distillation such as Decoupled Knowledge Distillation [213]), feature distillation such as FitNets [214], Attention Transfer [215], Overhaul Distillation [216]), relational distillation like RKD [217], Contrastive Distillation [218]), adversarial distillation [219], self-distillation including BYOT, Deep Mutual Learning [220]), and dynamic distillation such as Dynamic Knowledge Distillation [221] and Online Knowledge Distillation [222].

3.3.3. Module Integration and Stitching

The main methods for model stitching include serial connection, parallel connection, interactive combination, and multi-scale fusion. There exists a partial overlap between model stitching and innovations on architectural modifications. Specifically, serial connections such as the stacking of residual blocks in ResNet [9] and cross-layer feature concatenation in DenseNet [202] solve the network depth bottleneck through gradient optimization and feature reuse. Parallel connections such as the multi-branch Inception modules in GoogLeNet [10] and cardinality expansion in ResNeXt [164] enhance model expressiveness through group convolution and multi-path feature aggregation. Interactive combinations such as the parameter sharing between bidirectional self-attention layers in BERT [16] and the skip connections and encoder–decoder feature fusion in U-Net [223] improve module collaboration efficiency through dynamic information crossover. DetectoRS [224] proposed a recursive feature pyramid (Recursive-FPN) that iteratively optimizes multi-scale features through loop connections. Additionally, multi-scale fusion technology has evolved from the unidirectional pyramid of FPN [225] to the bidirectional cross-layer connections of PANet [226] and then to the auto-search architecture of NAS-FPN [227]. Combined with the high-resolution retention strategy of HRNet [228], it forms a cross-scale modeling paradigm that balances spatial and semantic information. These stitching methods have shifted from static stacking to dynamic coupling, providing a flexible and robust feature integration framework for complex tasks. REALM [229], i.e., end-to-end retrieval-enhanced pretraining, enhances the model’s knowledge acquisition capability through joint optimization of the retriever and generator. RETRO [230] with block retrieval and cross-attention mechanisms dynamically integrates external knowledge bases into the generation process. The combination of GANs (generative adversarial networks) and LSTMs (Long Short-Term Memory networks) fully leverages the GAN’s strength in generating complex data distributions and LSTM’s ability to handle long-term temporal dependencies. In recent years, research combining different DL algorithms has made significant progress. For example, the Transformer-enhanced Kalman Filter [231] estimates the hidden states of the Kalman Filter using a Transformer and performs well in time series prediction and anomaly detection. The latest KalmanFormer [232] directly learns the Kalman gain through the Transformer architecture, significantly improving performance in multi-sensor information fusion, achieving high-precision state estimation even in cases of model mismatch and nonlinearity. Cao [233] showed that Transformers can autoregressively generate sequences in Bayesian networks through maximum likelihood estimation (MLE), indicating their significant potential in learning complex probabilistic models and sequence generation tasks. Han [234] proposed a novel method of combining TDA with convolutional neural networks (CNNs) to enhance feature extraction and classification performance.

3.4. Innovations in Optimization Methods

Innovations in deep learning optimization methods primarily encompass parameter optimization, optimizer advancements, and activation function design.

3.4.1. Parameter Optimization

Among gradient-based optimization methods, Stochastic Gradient Descent (SGD) and Nesterov Accelerated Gradient (NAG) [235] are two prominent approaches. Adaptive optimization algorithms bridge momentum methods with learning rate adaptation. Notable examples include Adam [36], which has inspired variants like AdamW [236] and AdaDelta [237]. RMSProp [238], by integrating momentum and dynamic learning rate adaptation, has become a cornerstone for training complex tasks.
The field of optimization has witnessed the emergence of various advanced techniques, which can be categorized into several groups: Second-order optimization methods, such as Newton’s method, Kronecker-factored approximate curvature [239], and L-BFGS [240], have been developed to leverage second-order information for faster convergence. Hybrid methods, including Rectified Adam [241], Evolved Sign Momentum [242], and Second-order Clipped Stochastic Optimization [243], combine different optimization strategies to enhance performance. Population-based methods, such as Evolutionary Strategies [244], are heuristic algorithms [245,246] inspired by biological evolution. They optimize model parameters and offer unique advantages in complex optimization scenarios. Learning rate schedulers, such as Cyclical Learning Rates [247] and One-Cycle Policy [248], have also gained attention for their ability to accelerate training and improve model performance. Recent advancements include strategies like dynamic routing, proxy (sorragate) paths, parameter reparameterization (e.g., RepVGG, [208]), and Neural Architecture Search [59,205], achieving a balance between training efficiency and inference speed.

3.4.2. Activation Function Design

Activation functions have evolved from simple nonlinear mappings to adaptive probabilistic formulations. Early research includes sigmoid, Hyperbolic Tangent (Tanh), and softmax. In the 2010s, non-saturating activation functions and adaptive activation functions emerged, including ReLU [249], Leaky ReLU [250], Exponential Linear Unit (ELU [251]), Swish [252], and GELU [253]. ReLU [249] mitigated the vanishing gradient problem through sparse activation. GELU [253] incorporated stochastic regularization via smooth gating mechanisms, such as Gaussian error functions, aligning with Transformer self-attention computations. Swish [252] enhanced adaptability by dynamically adjusting activation states through learnable parameters. In the 2020s, dynamic and conditional activation functions flourished, such as SwiGLU [254], Mish [255], Dynamic ReLU [167], VeLU [256], and FReLU [257]. SwiGLU [254], combining Swish and Gated Linear Units (GLUs), improved model performance in Transformer-based architectures. VeLU (Variance-enhanced Learning Unit) [256] proved able to dynamically scale based on input variance by integrating ArcTan-Sin transformations and Wasserstein-2 regularization. Current research explores innovative activation functions like sine-based, learned, and rational activations. See Table 3 for a detailed comparison.

3.4.3. Techniques to Prevent Overfitting

To prevent overfitting, various techniques are employed in the field of deep learning. Regularization methods, such as L1/L2 regularization [258], Elastic Net [259], gradient penalty [260], and spectral norm regularization [198], are commonly used. Early stopping strategies, including dynamic patience [261], checkpoint ensemble [262], stochastic weight averaging (SWA [263]), and bootstrap validation [264], also play a significant role in preventing overfitting. Another prominent technique is Dropout [265] and its variants, such as Spatial Dropout [266], Drop-connect [267], Zoneout [268], Alpha Dropout [269], Stochastic Depth [200], Variational Dropout [270], Gaussian Dropout [271], Monte Carlo Dropout [81], Concrete Dropout [272], etc., randomly deactivate neurons. Different dropout methods can be applied depending on whether it is a CNN, RNN or any other neural network model. For example, Maxout activation functions [273] accelerate training and improved model performance. Zoneout introduced regularization for RNNs. Recent advanced techniques include data-enhancement techniques such as AutoAugment [274] and Mixup [275] or Cutmix [276]. Some architecture innovations are also helpful. For example, Deep Residual Learning (ResNets [9]) reduced training errors through residual connections. Batch Normalization [182] and its extension, which has been introduced, mitigated internal covariate shifts and accelerated convergence. Neural Architecture Search (NAS) can improve the generalization ability of the model. As stated in Section 3.3.2, distillation techniques can improve the robustness of the model.

3.4.4. Efficient Adaptation and Modularization

Mixture-of-Experts (MoE [277]) systems leveraged conditional computation for sparse activation, reflecting a shift from single-modality to multi-task, multi-scenario generalization. Recent developments include Sparse MoE [278], Gshard [279], BASE [280], Multi-gate MoE [281], Multimodal MoE [282], Mixture of Attention Heads (MOA [283]) and Federated MoE [284]. LoRA (Low-Rank Adaptation, [285]) enabled efficient parameter fine-tuning by freezing backbone networks and training only low-rank adapters, becoming a standard for lightweight fine-tuning. Adapter [286] inserted lightweight modules into frozen pretrained models to support multi-task adaptation. EfficientNet [59] introduced compound scaling rules to balance depth, width, and resolution for model efficiency. FlashAttention [287] optimized GPU memory usage, accelerating attention computations and reducing memory overhead, and it is widely adopted in GPT-4 and LLaMA training. MobileNetV2 [288] employed inverted residual structures for efficient computation on mobile devices.

3.5. Innovations in Transfer Learning Applications

The advancement of deep learning in transfer learning is manifested through the deep integration of theoretical frameworks and engineering practices; foundational mathematical principles were established through early studies on Maximum Mean Discrepancy (MMD, [289]) and adversarial domain adaptation (e.g., Wasserstein distance optimization [93]). Meanwhile, pre-trained models such as BERT [16] and Vision Transformer [43] revolutionized the “pre-training–fine-tuning” paradigm, enabling efficient cross-task transfer via parameter reuse. In multimodal transfer and cross-domain adaptation, CLIP [44] aligns cross-modal feature spaces through contrastive image–text learning, endowing models with zero-shot transfer capabilities. For few-shot and unsupervised transfer scenarios, meta-learning (MAML [290]) and self-supervised techniques (SimCLR [21]) enhance performance under data scarcity via task generalization and augmented signal generation, respectively. In dynamic and incremental transfer, methods like Elastic Weight Consolidation (EWC [291]) mitigate catastrophic forgetting in continual learning, facilitating progressive knowledge accumulation in open environments.
Recently, we have witnessed transfer learning (TL) demonstrating remarkable advantages in addressing data scarcity and domain discrepancies through cross-domain knowledge transfer. Breakthroughs in medical applications stand out. Med3D [292], as shown in Figure 9, is a pre-trained 3D convolutional network trained on the diverse 3DSeg-8 medical imaging dataset; it accelerates and enhances performance in 3D medical tasks like lung/liver segmentation. Med3D outperforms models trained from scratch or on non-medical datasets (e.g., Kinetics), reducing training time by ten times and improving accuracy by 3–20%. Bargshady [293] achieved 94.2 percent accuracy in COVID-19 detection using an Inception-CycleGAN-based model. Kathamuthu [294] elevated detection accuracy to 98 percent via a VGG-16 architecture. Industrial inspection innovations focus on adversarial learning and multimodal fusion, exemplified by Michau [295] achieving a state-of-the-art 99.8 percent performance in anomaly detection through adversarial TL frameworks, and Zhang [296] realizing 99.8 percent accuracy in equipment fault diagnosis via a blockchain-integrated federated learning framework. Satellite remote sensing has overcome data distribution disparities through domain adaptation, with Chen [297] reporting 90.17 percent mean Average Precision (AP) in aircraft detection using a domain-adaptive R-CNN. Environmental science advances feature cross-modal transfer, as evidenced by Cao [298] achieving 99.3 percent accuracy in waste classification using an improved Inception-V3 model. As illustrated in Figure 10, Cao [299] introduced transfer risk as a metric to assess transferability in cross-domain applications. Applied to financial tasks (e.g., stock prediction, portfolio optimization), transfer risk efficiently identifies optimal source tasks (e.g., cross-continent/sector) and correlates strongly with model performance, enabling effective transfer learning in finance. Negative transfer [300], where a model’s performance on downstream tasks declines due to source–target data mismatches, is a problem in transfer learning. In time series contexts, this phenomenon has not been delved into enough.

4. Conclusions

By summarizing paradigms of and advances in DL structures, this paper systematically summarizes and updates the recent fundamentals of deep learning algorithms and compares their performance on standard datasets. We also identify trends in innovation in deep learning algorithms from a modular AI perspective and analyze how these innovations contribute to the advancement of the field.
Through our analysis, we believe deep learning (DL) will continue to significantly impact AI’s modular development. We observe that future innovations in DL will be problem-driven due to the rapidly increasing demand for AI. Innovations will continuously emerge in areas such as feature extraction methods, normalization methods, module combination or stitching, optimization, and transfer learning applications. However, these innovations will not be isolated as before but will be viewed from the perspective of module combination. The siloed innovation of DL models must evolve into a modular collaborative framework to drive next-generation technological advancements. Furthermore, given limited data samples, real-time generated data, and improved hardware situations, the following specific promising directions of DL are also notable.
Few-shot learning and meta-learning are expected to remain critical. These methods enable models to quickly adapt to new tasks using minimal labeled data, thus improving both learning efficiency and generalization. Notable approaches such as Meta-Learner LSTM and MAML have already shown strong results in few-shot learning settings. Further progress will depend on refining model architectures and training strategies under data-scarce conditions. Self-supervised learning is becoming increasingly vital for utilizing large-scale unlabeled datasets. Improving the ability of pretrained models to exploit this type of data can significantly enhance performance across a wide range of AI tasks. Automation and model lightweighting are also key directions. Techniques such as Neural Architecture Search (NAS) aim to automate the design of neural networks, reducing human effort and improving efficiency. Moreover, developing models that can process multiple modalities cohesively will be essential for improving semantic understanding through enhanced cross-modal fusion. Integration of reinforcement learning (RL) with deep learning is crucial for creating intelligent agents capable of making decisions in complex and dynamic environments. This synergy will continue to advance the capabilities of AI systems in real-world scenarios.

Author Contributions

Conceptualization, Y.W. (Yicheng Wei); methodology, Y.W. (Yicheng Wei); validation, Y.W. (Yicheng Wei) and Y.W. (Yifu Wang); writing—original draft preparation, Y.W. (Yicheng Wei) and Y.W. (Yifu Wang); writing—review and editing, Y.W. (Yicheng Wei) and J.W.; project administration, Y.W. (Yicheng Wei) and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

The research and publication were supported by Suzhou City University Research Startup Funds (No. 5010708724).

Acknowledgments

We acknowledge the support given by Suzhou City University, China.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Newell, A.; Simon, H.A. Computer science as empirical inquiry: Symbols and search. Commun. ACM 1976, 19, 113–126. [Google Scholar] [CrossRef]
  2. Lindsay, R.K.; Buchanan, B.G.; Feigenbaum, E.A.; Lederberg, J. DENDRAL: A case study of the first expert system for scientific hypothesis formation. Artif. Intell. 1993, 61, 209–261. [Google Scholar] [CrossRef]
  3. Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386. [Google Scholar] [CrossRef] [PubMed]
  4. Hopfield, J.J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 1982, 79, 2554–2558. [Google Scholar] [CrossRef] [PubMed]
  5. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  6. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  7. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
  8. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; Computational and Biological Learning Society: Cambridge, UK, 2015. [Google Scholar]
  9. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  10. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  11. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  12. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
  13. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  14. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
  15. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
  16. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  17. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I.; Sutskever, H. Improving Language Understanding by Generative Pre-Training; Technical Report; OpenAI: San Francisco, CA, USA, 2018. [Google Scholar]
  18. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
  19. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  20. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
  21. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PmLR, Vienna, Austria, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
  22. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollar, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleands, LA, USA, 18–24 June 2022; IEEE Computer Society: Washington, DC, USA, 2022; pp. 15979–15988. [Google Scholar]
  23. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  24. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  25. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  26. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  27. Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef]
  28. Bossert, L.N.; Loh, W. Why the carbon footprint of generative large language models alone will not help us assess their sustainability. Nat. Mach. Intell. 2025, 7, 164–165. [Google Scholar] [CrossRef]
  29. Xua, B.; Yang, G. Interpretability research of deep learning: A literature survey. Inf. Fusion 2024, 115, 102721. [Google Scholar] [CrossRef]
  30. Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable ai: A review of machine learning interpretability methods. Entropy 2020, 23, 18. [Google Scholar] [CrossRef]
  31. Chattha, M.A.; Malik, M.I.; Dengel, A.; Ahmed, S. Addressing data dependency in neural networks: Introducing the Knowledge Enhanced Neural Network (KENN) for time series forecasting+. Mach. Learn. 2025, 114, 30. [Google Scholar] [CrossRef]
  32. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  33. Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef]
  34. Minar, M.R.; Naher, J. Recent advances in deep learning: An overview. arXiv 2018, arXiv:1807.08169. [Google Scholar] [CrossRef]
  35. Tian, Y.; Zhang, Y. A comprehensive survey on regularization strategies in machine learning. Inf. Fusion 2022, 80, 146–166. [Google Scholar] [CrossRef]
  36. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  37. Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. Int. Conf. Mach. Learn. 2013, 28, 1139–1147. [Google Scholar]
  38. Brauwers, G.; Frasincar, F. A general survey on attention mechanisms in deep learning. IEEE Trans. Knowl. Data Eng. 2021, 35, 3279–3298. [Google Scholar] [CrossRef]
  39. Li, P.; Pei, Y.; Li, J. A comprehensive survey on design and application of autoencoder in deep learning. Appl. Soft Comput. 2023, 138, 110176. [Google Scholar] [CrossRef]
  40. Li, Z.; Xia, T.; Chang, Y.; Wu, Y. A Survey of Rwkv. arXiv 2024, arXiv:2412.14847. [Google Scholar] [CrossRef]
  41. Chen, M.; Mei, S.; Fan, J.; Wang, M. An overview of diffusion models: Applications, guided generation, statistical rates and optimization. arXiv 2024, arXiv:2404.07771. [Google Scholar] [CrossRef]
  42. Ji, T.; Hou, Y.; Zhang, D. A comprehensive survey on kolmogorov arnold networks (kan). arXiv 2024, arXiv:2407.11075. [Google Scholar] [CrossRef]
  43. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  44. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PmLR, Virutal, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  45. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  46. Chen, X.; Wang, L.; Zhang, H. Deep Learning in Medical Image Analysis: Current Trends and Future Directions. Information 2023, 15, 755. [Google Scholar] [CrossRef]
  47. Zhao, Z.; Alzubaidi, L.; Zhang, J.; Duan, Y.; Gu, Y. A comparison review of transfer learning and self-supervised learning: Definitions, applications, advantages and limitations. Expert Syst. Appl. 2024, 242, 122807. [Google Scholar] [CrossRef]
  48. Botvinick, M.; Ritter, S.; Wang, J.X.; Kurth-Nelson, Z.; Blundell, C.; Hassabis, D. Reinforcement learning, fast and slow. Trends Cogn. Sci. 2019, 23, 408–422. [Google Scholar] [CrossRef]
  49. Hassabis, D.; Kumaran, D.; Summerfield, C.; Botvinick, M. Neuroscience-Inspired Artificial Intelligence. Neuron 2017, 95, 245–258. [Google Scholar] [CrossRef] [PubMed]
  50. Montavon, G.; Samek, W.; Müller, K.R. Methods for interpreting and understanding deep neural networks. Digit. Signal Process. 2018, 73, 1–15. [Google Scholar] [CrossRef]
  51. Bau, D.; Zhou, B.; Khosla, A.; Oliva, A.; Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3319–3327. [Google Scholar] [CrossRef]
  52. Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. arXiv 2017, arXiv:1710.09829. [Google Scholar] [CrossRef]
  53. Hamilton, W.; Ying, R.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1025–1035. [Google Scholar]
  54. Graves, A.; Wayne, G.; Danihelka, I. Neural Turing Machines. arXiv 2014, arXiv:1410.5401. [Google Scholar] [CrossRef]
  55. Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, Hensenki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar]
  56. Ng, A.Y. Sparse Autoencoder; CS294A Lecture Notes; Stanford University: Stanford, CA, USA, 2011; Volume 72, pp. 1–19. [Google Scholar]
  57. Zhang, X.; Li, X.; Wang, X.; Wang, X.; Wang, A.; Deng, J. Contrastive Learning of Visual Representations: A Survey. arXiv 2021, arXiv:2106.02697. [Google Scholar]
  58. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
  59. Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
  60. Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
  61. Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
  62. Liu, S.; Li, H.; Shi, L.; Ji, C.; Cao, J.; Lu, X.; Cao, Y. Independently recurrent neural network (IndRNN): Building a longer and deeper RNN. arXiv 2018, arXiv:1803.04831. [Google Scholar] [CrossRef]
  63. Chen, Y.; Zhang, Z.; Yu, Y.; Salakhutdinov, R.; Caruana, R. Dilated recurrent neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2018; pp. 2728–2738. [Google Scholar]
  64. Jaeger, H. Echo State Networks; GMD-Forschungszentrum Informationstechnik: Sankt Augustin, Germany, 2001. [Google Scholar]
  65. Bradbury, J.; Merity, S.; Xiong, C.; Li, R.; Socher, R. Quasi-recurrent neural networks. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  66. Zhang, S.X.; Zhao, R.; Liu, C.; Li, J.; Gong, Y. Recurrent support vector machines for speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; IEEE: New York, NY, USA, 2016; pp. 5885–5889. [Google Scholar]
  67. Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.k.; Woo, W.c. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. Adv. Neural Inf. Process. Syst. 2015, 28, 802–810. [Google Scholar]
  68. Beck, M.; Pöppel, K.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. xlstm: Extended long short-term memory. Adv. Neural Inf. Process. Syst. 2024, 37, 107547–107603. [Google Scholar]
  69. Alkin, B.; Beck, M.; Pöppel, K.; Hochreiter, S.; Brandstetter, J. Vision-lstm: xlstm as generic vision backbone. arXiv 2024, arXiv:2406.04303. [Google Scholar]
  70. Neil, D.; Pfeiffer, M.; Liu, S.C. Phased LSTM: Accelerating Recurrent Network Training for Long or Event-Sparse Time Series. Adv. Neural Inf. Process. Syst. 2016, 29, 3882–3890. [Google Scholar]
  71. Kalchbrenner, N.; Danihelka, I.; Graves, A. Grid Long Short-Term Memory. arXiv 2015, arXiv:1507.01526. [Google Scholar]
  72. Fukao, T.; Iizuka, H.; Kurita, T. Multidimensional Long Short-Term Memory. arXiv 2016, arXiv:1602.06289. [Google Scholar]
  73. Tai, K.S.; Socher, R.; Manning, C.D. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 27–34 July 2015; pp. 1556–1566. [Google Scholar]
  74. Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: SHORT Papers), Dublin, Ireland, 22–27 May 2016; pp. 207–212. [Google Scholar]
  75. Lei, T.; Neubig, G.; Jaakkola, T. Training RNNs as Fast as CNNs. arXiv 2017, arXiv:1709.02755. [Google Scholar]
  76. Tang, S.; Li, B.; Yu, H. ChebNet: Efficient and stable constructions of deep neural networks with rectified power units via Chebyshev approximation. Commun. Math. Stat. 2024, 1–27. [Google Scholar] [CrossRef]
  77. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
  78. Xu, K.; Chen, L.; Wang, S. Kolmogorov-arnold networks for time series: Bridging predictive power and interpretability. arXiv 2024, arXiv:2406.02496. [Google Scholar] [CrossRef]
  79. Tang, T.; Chen, Y.; Shu, H. 3D U-KAN implementation for multi-modal MRI brain tumor segmentation. arXiv 2024, arXiv:2408.00273. [Google Scholar]
  80. Shuai, H.; Li, F. Physics-informed kolmogorov-arnold networks for power system dynamics. IEEE Open Access J. Power Energy 2025, 12, 46–58. [Google Scholar] [CrossRef]
  81. Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 1050–1059. [Google Scholar]
  82. Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neural networks. arXiv 2015, arXiv:1505.05424. [Google Scholar] [CrossRef]
  83. Fortunato, M.; Blundell, C.; Vinyals, O. Bayesian recurrent neural networks. arXiv 2017, arXiv:1704.02798. [Google Scholar]
  84. Wu, Y.; Sicard, B.; Gadsden, S.A. Physics-informed machine learning: A comprehensive review on applications in anomaly detection and condition monitoring. Expert Syst. Appl. 2024, 255, 124678. [Google Scholar] [CrossRef]
  85. Lütjens, B.; Crawford, C.H.; Veillette, M.; Newman, D. Spectral pinns: Fast uncertainty propagation with physics-informed neural networks. In Proceedings of the Symbiosis of Deep Learning and Differential Equations, Virutal, 14 December 2021. [Google Scholar]
  86. Fang, Z. A high-efficient hybrid physics-informed neural networks based on convolutional neural network. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 5514–5526. [Google Scholar] [CrossRef]
  87. Chen, Y.; Huang, D.; Zhang, D.; Zeng, J.; Wang, N.; Zhang, H.; Yan, J. Theory-guided hard constraint projection (HCP): A knowledge-based data-driven scientific machine learning method. J. Comput. Phys. 2021, 445, 110624. [Google Scholar] [CrossRef]
  88. Karniadakis, G.E.; Kevrekidis, I.G.; Lu, L.; Perdikaris, P.; Wang, S.; Yang, L. Physics-informed machine learning. Nat. Rev. Phys. 2021, 3, 422–440. [Google Scholar] [CrossRef]
  89. Khalid, S.; Yazdani, M.H.; Azad, M.M.; Elahi, M.U.; Raouf, I.; Kim, H.S. Advancements in Physics-Informed Neural Networks for Laminated Composites: A Comprehensive Review. Mathematics 2024, 13, 17. [Google Scholar] [CrossRef]
  90. Hasani, R.; Lechner, M.; Amini, A.; Rus, D.; Grosu, R. Liquid Time-constant Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Virutal, 2–9 February 2021; Volume 35, pp. 7657–7666. [Google Scholar]
  91. Hasani, R.; Lechner, M.; Amini, A.; Liebenwein, L.; Ray, A.; Tschaikowski, M.; Teschl, G.; Rus, D. Closed-form continuous-time neural networks. Nat. Mach. Intell. 2022, 4, 992–1003. [Google Scholar] [CrossRef]
  92. Kumar, K.; Verma, A.; Gupta, N.; Yadav, A. Liquid Neural Networks: A Novel Approach to Dynamic Information Processing. In Proceedings of the 2023 International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT), Faridabad, India, 23–24 November 2023; IEEE: New York, NY, USA, 2023; pp. 725–730. [Google Scholar]
  93. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
  94. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar]
  95. Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
  96. Karras, T.; Aittala, M.; Laine, S.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10509–10518. [Google Scholar]
  97. Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
  98. Hou, Y.; Zhang, W.; Zhu, Z.; Yu, H. CLIP-GAN: Stacking CLIPs and GAN for Efficient and Controllable Text-to-Image Synthesis. IEEE Trans. Multimedia 2025, 27, 3702–3715. [Google Scholar] [CrossRef]
  99. Wu, J.; Zhang, C.; Xue, T.; Freeman, W.T.; Tenenbaum, J.B. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 82–90. [Google Scholar]
  100. Sauer, A.; Chitta, K.; Müller, J.; Geiger, A. Projected gans converge faster. Adv. Neural Inf. Process. Syst. 2021, 34, 17480–17492. [Google Scholar]
  101. Hinton, G.E.; Sejnowski, T.J.; Ackley, D.H. Boltzmann Machines: Constraint Satisfaction Networks that Learn; Carnegie-Mellon University, Department of Computer Science: Pittsburgh, PA, USA, 1984. [Google Scholar]
  102. Salakhutdinov, R.; Mnih, A.; Hinton, G. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning, Corvalis, OR, USA, 20–24 June 2007; pp. 791–798. [Google Scholar] [CrossRef]
  103. Tang, Y.; Salakhutdinov, R.; Hinton, G. Deep Lambertian Networks. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, 26 June–1 July 2012. [Google Scholar]
  104. Salakhutdinov, R.; Hinton, G.E. Deep Boltzmann machines. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Paris, France, 27–29 October 2009; pp. 448–455. [Google Scholar]
  105. Feng, Z.; Winston, E.; Kolter, J.Z. Monotone deep Boltzmann machines. arXiv 2023, arXiv:2307.04990. [Google Scholar] [CrossRef]
  106. Liu, J.G.; Wang, L. Differentiable learning of quantum circuit Born machines. Phys. Rev. A 2018, 98, 062324. [Google Scholar] [CrossRef]
  107. Amin, M.H.; Andriyash, E.; Rolfe, J.; Kulchytskyy, B.; Melko, R. Quantum boltzmann machine. Phys. Rev. X 2018, 8, 021050. [Google Scholar] [CrossRef]
  108. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  109. Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; Frey, B. Adversarial autoencoders. arXiv 2015, arXiv:1511.05644. [Google Scholar]
  110. Mescheder, L.; Nowozin, S.; Geiger, A. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 2391–2400. [Google Scholar]
  111. Gulrajani, I.; Kumar, K.; Ahmed, F.; Taiga, A.A.; Visin, F.; Vazquez, D.; Courville, A. PixelVAE: A Latent Variable Model for Natural Images. In Proceedings of the International Conference on Learning Representations, Toulan, France, 24–26 April 2017. [Google Scholar]
  112. Vahdat, A.; Kautz, J. NVAE: A Deep Hierarchical Variational Autoencoder. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; pp. 19667–19679. [Google Scholar]
  113. Knop, S.; Spurek, P.; Tabor, J.; Podolak, I.; Mazur, S.; Jastrzebski, S. Cramer-Wold Auto-Encoder. J. Mach. Learn. Res. 2020, 21, 6594–6621. [Google Scholar]
  114. Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based Generative Modeling through Stochastic Differential Equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
  115. Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
  116. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
  117. Krishnamoorthy, S.; Mashkaria, S.M.; Grover, A. Diffusion models for black-box optimization. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 17842–17857. [Google Scholar]
  118. Chen, M.; Huang, K.; Zhao, T.; Wang, M. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 4672–4712. [Google Scholar]
  119. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Wang, J.; Li, H. Informer: Beyond efficient transformer for long sequence time-series forecasting. arXiv 2020, arXiv:2012.07436. [Google Scholar] [CrossRef]
  120. Wu, H.; Zhou, H.; Zhang, S.; Wang, J.; Li, H. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. arXiv 2021, arXiv:2105.13100. [Google Scholar]
  121. Liu, Y.; Li, G.; Payne, T.R.; Yue, Y.; Man, K.L. Non-stationary transformer for time series forecasting. Electronics 2024, 13, 2075. [Google Scholar] [CrossRef]
  122. Liu, Y.; Li, G.; Payne, T.R.; Yue, Y.; Man, K.L. iTransformer: Inverse sequence modeling for time series forecasting. arXiv 2023, arXiv:2301.01234. [Google Scholar]
  123. Woo, S.; Park, J.; Lee, S.; Kim, I.S. ETSformer: Exponential smoothing transformers for time series forecasting. arXiv 2022, arXiv:2202.01381. [Google Scholar] [CrossRef]
  124. Wang, H.; Zhang, S.; Zhou, H.; Wang, J.; Li, H. ShapeFormer: Morphological attention for time series forecasting. arXiv 2023, arXiv:2301.01234. [Google Scholar]
  125. Zhu, X.; Wang, W.; Chen, Z.; Chen, Y.; Duan, J.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 515–531. [Google Scholar]
  126. Fedus, W.; Bapna, D.; Chu, C.; Clark, D.; Dauphin, Y.; Elsen, E.; Hall, A.; Huang, Y.; Jia, Y.; Jozefowicz, R.; et al. Switch transformers: Scaling to trillion parameter models with mixture-of-experts. arXiv 2021, arXiv:2101.00391. [Google Scholar]
  127. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
  128. Zhang, S.; Zhou, H.; Wu, H.; Wang, J.; Li, H. Crossformer: A cross-scale transformer for long-term time series forecasting. arXiv 2022, arXiv:2201.00809. [Google Scholar]
  129. Peng, B.; Alcaide, E.; Anthony, Q.; Albalak, A.; Arcadinho, S.; Biderman, S.; Cao, H.; Cheng, X.; Chung, M.; Grella, M.; et al. Rwkv: Reinventing rnns for the transformer era. arXiv 2023, arXiv:2305.13048. [Google Scholar] [CrossRef]
  130. Peng, B.; Goldstein, D.; Anthony, Q.; Albalak, A.; Alcaide, E.; Biderman, S.; Cheah, E.; Ferdinan, T.; Hou, H.; Kazienko, P.; et al. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. arXiv 2024, arXiv:2404.05892. [Google Scholar] [CrossRef]
  131. Peng, B.; Zhang, R.; Goldstein, D.; Alcaide, E.; Hou, H.; Lu, J.; Merrill, W.; Song, G.; Tan, K.; Utpala, S.; et al. Rwkv-7 “goose” with expressive dynamic state evolution. arXiv 2025, arXiv:2503.14456. [Google Scholar]
  132. Yang, Z.; Li, J.; Zhang, H.; Zhao, D.; Wei, B.; Xu, Y. Restore-rwkv: Efficient and effective medical image restoration with rwkv. arXiv 2024, arXiv:2407.11087. [Google Scholar] [CrossRef]
  133. Yuan, H.; Li, X.; Qi, L.; Zhang, T.; Yang, M.H.; Yan, S.; Loy, C.C. Mamba or rwkv: Exploring high-quality and high-efficiency segment anything model. arXiv 2024, arXiv:2406.19369. [Google Scholar]
  134. Hou, H.; Zeng, P.; Ma, F.; Yu, F.R. Visualrwkv: Exploring recurrent neural networks for visual language models. arXiv 2024, arXiv:2406.13362. [Google Scholar] [CrossRef]
  135. Gu, T.; Yang, K.; An, X.; Feng, Z.; Liu, D.; Cai, W.; Deng, J. RWKV-CLIP: A robust vision-language representation learner. arXiv 2024, arXiv:2406.06973. [Google Scholar]
  136. He, Q.; Zhang, J.; Peng, J.; He, H.; Li, X.; Wang, Y.; Wang, C. Pointrwkv: Efficient rwkv-like model for hierarchical point cloud learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 3410–3418. [Google Scholar]
  137. Somvanshi, S.; Islam, M.M.; Mimi, M.S.; Polock, S.B.B.; Chhetri, G.; Das, S. From S4 to Mamba: A Comprehensive Survey on Structured State Space Models. arXiv 2025, arXiv:2503.18970. [Google Scholar]
  138. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
  139. Gu, A.; Kim, K.; Lee, A. S4: Structured State Space Sequence Modeling. In Proceedings of the International Conference on Learning Representations, Virutal, 25–29 April 2022. [Google Scholar]
  140. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
  141. Mnih, V.; Badia, A.; Mirza, A.G.; Lillicrap, T.P.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
  142. Hajij, M.; Zamzmi, G.; Papamarkou, T.; Miolane, N.; Guzmán-Sáenz, A.; Ramamurthy, K.N.; Birdal, T.; Dey, T.K.; Mukherjee, S.; Samaga, S.N.; et al. Topological deep learning: Going beyond graph data. arXiv 2022, arXiv:2206.00606. [Google Scholar]
  143. Guo, W. Feature Extraction Using Topological Data Analysis for Machine Learning and Network Science Applications. Ph.D. Thesis, University of Washington, Washington, DC, USA, 2020. [Google Scholar]
  144. Love, E.R.; Filippenko, B.; Maroulas, V.; Carlsson, G. Topological convolutional layers for deep learning. J. Mach. Learn. Res. 2023, 24, 1–35. [Google Scholar]
  145. Pham, P. A Topology-Enhanced Multi-Viewed Contrastive Approach for Molecular Graph Representation Learning and Classification. Mol. Inform. 2025, 44, e202400252. [Google Scholar] [CrossRef]
  146. Papillon, M.; Sanborn, S.; Hajij, M.; Miolane, N. Architectures of Topological Deep Learning: A Survey on Topological Neural Networks. arXiv 2023, arXiv:2304.10031. [Google Scholar]
  147. Kundu, S.; Zhu, R.J.; Jaiswal, A.; Beerel, P.A. Recent advances in scalable energy-efficient and trustworthy spiking neural networks: From algorithms to technology. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 13256–13260. [Google Scholar]
  148. Nunes, J.D.; Carvalho, M.; Carneiro, D.; Cardoso, J.S. Spiking neural networks: A survey. IEEE Access 2022, 10, 60738–60764. [Google Scholar] [CrossRef]
  149. Lee, J.; Kwak, J.Y.; Keum, K.; Sik Kim, K.; Kim, I.; Lee, M.J.; Kim, Y.H.; Park, S.K. Recent Advances in Smart Tactile Sensory Systems with Brain-Inspired Neural Networks. Adv. Intell. Syst. 2024, 6, 2300631. [Google Scholar] [CrossRef]
  150. Dold, D.; Petersen, P.C. Causal pieces: Analysing and improving spiking neural networks piece by piece. arXiv 2025, arXiv:2504.14015. [Google Scholar] [CrossRef]
  151. Kheradpisheh, S.R.; Mirsadeghi, M.; Masquelier, T. BS4NN: Binarized spiking neural networks with temporal coding and learning. Neural Process. Lett. 2022, 54, 1255–1273. [Google Scholar] [CrossRef]
  152. Wu, Y.; Deng, L.; Li, G.; Zhu, J.; Shi, L. Spatio-temporal backpropagation for training high-performance spiking neural networks. Front. Neurosci. 2018, 12, 331. [Google Scholar] [CrossRef]
  153. Yu, C.; Gu, Z.; Li, D.; Wang, G.; Wang, A.; Li, E. Stsc-snn: Spatio-temporal synaptic connection with temporal convolution and attention for spiking neural networks. Front. Neurosci. 2022, 16, 1079357. [Google Scholar] [CrossRef]
  154. Ding, J.; Pan, Z.; Liu, Y.; Yu, Z.; Huang, T. Robust stable spiking neural networks. arXiv 2024, arXiv:2405.20694. [Google Scholar] [CrossRef]
  155. Wang, Y.; Wu, H.; Dong, J.; Liu, Y.; Long, M.; Wang, J. Deep time series models: A comprehensive survey and benchmark. arXiv 2024, arXiv:2407.13278. [Google Scholar] [CrossRef]
  156. Mienye, I.D.; Swart, T.G.; Obaido, G. Recurrent neural networks: A comprehensive review of architectures, variants, and applications. Information 2024, 15, 517. [Google Scholar] [CrossRef]
  157. Alam, F.; Islam, M.; Deb, A.; Hossain, S.S. Comparison of deep learning models for weather forecasting in different climatic zones. J. Comput. Sci. Eng. (JCSE) 2024, 5, 12–19. [Google Scholar] [CrossRef]
  158. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
  159. Wang, W.; Liu, Y.; Sun, H. Tlnets: Transformation learning networks for long-range time-series prediction. arXiv 2023, arXiv:2305.15770. [Google Scholar]
  160. Jiang, M.; Wang, K.; Sun, Y.; Chen, W.; Xia, B.; Li, R. MLGN: Multi-scale local-global feature learning network for long-term series forecasting. Mach. Learn. Sci. Technol. 2023, 4, 045059. [Google Scholar] [CrossRef]
  161. Sun, G.; Qi, X.; Zhao, Q.; Wang, W.; Li, Y. SVSeq2Seq: An Efficient Computational Method for State Vectors in Sequence-to-Sequence Architecture Forecasting. Mathematics 2024, 12, 265. [Google Scholar] [CrossRef]
  162. Bayat, S.; Isik, G. Assessing the Efficacy of LSTM, Transformer, and RNN Architectures in Text Summarization. In Proceedings of the International Conference on Applied Engineering and Natural Sciences, Konya, Turkey, 10–12 July 2023; Volume 1, pp. 813–820. [Google Scholar]
  163. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
  164. Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Los Alamitos, CA, USA, 2017; pp. 5987–5995. [Google Scholar]
  165. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
  166. Liu, Y.; Chen, Y.; Li, B.; Wang, S.; Chen, G. Wavelet Convolutional Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 7, 74973–74985. [Google Scholar]
  167. Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Lu, Y.; Liu, Z. Dynamic ReLU. In ECCV 2020: 16th European Conference; Springer: Cham, Switzerland, 2020; Volume 17, pp. 351–367. [Google Scholar] [CrossRef]
  168. Liu, Y.; Wang, R.; Zhang, Y.; Li, P.; Zhang, H. Multi-scale convolutional transformer network for motor imagery classification. Sci. Rep. 2023, 15, 12935. [Google Scholar]
  169. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y.; Wang, J. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
  170. Bai, Y.; Xu, Y.; Xu, K.; Li, W.; Liu, J.M. TOPS-speed complex-valued convolutional accelerator for feature extraction and inference. Nat. Commun. 2025, 16, 292. [Google Scholar] [CrossRef] [PubMed]
  171. Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1254–1259. [Google Scholar] [CrossRef]
  172. Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 2014, 27, 2204–2212. [Google Scholar]
  173. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
  174. Jiang, M.; Zeng, P.; Wang, K.; Liu, H.; Chen, W.; Liu, H. FECAM: Frequency enhanced channel attention mechanism for time series forecasting. Adv. Eng. Inform. 2023, 58, 102158. [Google Scholar] [CrossRef]
  175. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar] [CrossRef]
  176. Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
  177. Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. (NeurIPS) 2022, 35, 23716–23736. [Google Scholar]
  178. Liu, H.; Zaharia, M.; Abbeel, P. Ring attention with blockwise transformers for near-infinite context. arXiv 2023, arXiv:2310.01889. [Google Scholar] [CrossRef]
  179. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  180. Si, C.; Yu, W.; Zhou, P.; Zhou, Y.; Wang, X.; Yan, S. Inception transformer. Adv. Neural Inf. Process. Syst. 2022, 35, 23495–23509. [Google Scholar]
  181. Wan, C.; Yu, H.; Li, Z.; Chen, Y.; Zou, Y.; Liu, Y.; Yin, X.; Zuo, K. Swift parameter-free attention network for efficient super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 6246–6256. [Google Scholar]
  182. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
  183. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
  184. Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv 2016, arXiv:1607.08022. [Google Scholar]
  185. Wu, Y.; He, K. Group normalization. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  186. Salimans, T.; Kingma, D.P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Adv. Neural Inf. Process. Syst. 2016, 29, 901–909. [Google Scholar]
  187. Yao, Z.; Cao, Y.; Zheng, S.; Huang, G.; Lin, S. Cross-Iteration Batch Normalization. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 12326–12335. [Google Scholar]
  188. Zhang, B.; Sennrich, R. Root mean square layer normalization. Adv. Neural Inf. Process. Syst. 2019, 32, 12381–12392. [Google Scholar]
  189. Wang, H.; Ma, S.; Dong, L.; Huang, S.; Zhang, D.; Wei, F. Deepnet: Scaling transformers to 1000 layers. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6761–6774. [Google Scholar] [CrossRef]
  190. Zhu, J.; Chen, X.; He, K.; LeCun, Y.; Liu, Z. Transformers without normalization. arXiv 2025, arXiv:2503.10622. [Google Scholar] [CrossRef]
  191. Singh, S.; Krishnan, S. Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 11234–11243. [Google Scholar]
  192. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. Adv. Neural Inf. Process. Syst. 2016, 29, 2234–2242. [Google Scholar]
  193. Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleands, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  194. Jing, Y.; Liu, X.; Ding, Y.; Wang, X.; Ding, E.; Song, M.; Wen, S. Dynamic Instance Normalization for Arbitrary Style Transfer. In Proceedings of the AAAI Conference on Artificial Intelligence. Association for the Advancement of Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 4369–4376. [Google Scholar]
  195. Luo, P.; Ren, J.; Peng, Z.; Zhang, R.; Li, J. Differentiable learning-to-normalize via switchable normalization. arXiv 2018, arXiv:1806.10779. [Google Scholar]
  196. Karras, T.; Laine, S.; Aittala, M.; Hellst, J.; Lehtinen, J.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 8107–8116. [Google Scholar]
  197. Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic Image Synthesis With Spatially-Adaptive Normalization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 2332–2341. [Google Scholar]
  198. Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. arXiv 2018, arXiv:1802.05957. [Google Scholar] [CrossRef]
  199. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
  200. Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q. Deep Networks with Stochastic Depth. In Proceedings of the Computer Vision–ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 646–661. [Google Scholar]
  201. Rahman, M.M.; Marculescu, R. UltraLightUNet: Rethinking U-Shaped Network with Multi-Kernel Lightweight Convolutions for Medical Image Segmentation. OpenReview 2025. Available online: https://openreview.net/forum?id=BefqqrgdZ1 (accessed on 12 August 2025).
  202. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar] [CrossRef]
  203. Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2403–2412. [Google Scholar]
  204. Gomez, A.N.; Ren, M.; Urtasun, R.; Grosse, R.B. The reversible residual network: Backpropagation without storing activations. Adv. Neural Inf. Process. Syst. 2017, 30, 2211–2221. [Google Scholar]
  205. Brock, A.; Lim, T.; Ritchie, J.M.; Weston, N. SMASH: One-Shot Model Architecture Search through HyperNetworks. arXiv 2017, arXiv:1708.05344. [Google Scholar] [CrossRef]
  206. Cai, S.; Shu, Y.; Wang, W. Dynamic routing networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3588–3597. [Google Scholar]
  207. Tan, M.; Le, Q.V. Mixconv: Mixed depthwise convolutional kernels. arXiv 2019, arXiv:1907.09595. [Google Scholar] [CrossRef]
  208. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
  209. Courbariaux, M.; Bengio, Y.; David, J.P. Binaryconnect: Training deep neural networks with binary weights during propagations. In Proceedings of the NIPS 2015, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
  210. Denton, E.; Zaremba, W.; Bruna, J.; LeCun, Y.; Fergus, R. Exploiting linear structure within convolutional networks for efficient evaluation. In Proceedings of the NIPS 2014, Montreal, QC, Canada, 8–13 December 2014; pp. 1269–1277. [Google Scholar]
  211. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
  212. Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning both weights and connections for efficient neural network. In Proceedings of the NIPS 2015, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
  213. Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; Liang, J. Decoupled Knowledge Distillation. arXiv 2022, arXiv:2203.08679. [Google Scholar] [CrossRef]
  214. Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. FitNets: Hints for thin deep nets. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  215. Ba, J.; Kiros, J.R.; Hinton, G.E. Attention Transfer. In Proceedings of the NIPS 2016, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  216. Lee, D.H. Overhaul Distillation: A Method for Knowledge Transfer in Neural Networks. arXiv 2015, arXiv:1506.02581. [Google Scholar]
  217. Cho, Y.; Min, K.h.; Lee, J.; Shin, M.; Lee, D.H.; Yang, H.J. Relational Knowledge Distillation. In Proceedings of the CVPR, Long Beach, CA, USA, 16–20 June 2019; pp. 7624–7632. [Google Scholar]
  218. Tian, Y.; Krishnan, D.; Isola, P. Contrastive Representation Distillation. In Proceedings of the ICLR, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
  219. Zhu, Y.; Hua, G.; Wang, L. Knowledge Transfer via Distillation of Activation Boundaries Formed by Deep Neural Classifiers. In Proceedings of the ICLR, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  220. Yang, Y.; Shen, C.c.l.; Wang, Z.; Dick, A.; Hengel, A.v.d. Deep Mutual Learning. arXiv 2017, arXiv:1706.00384. [Google Scholar] [CrossRef]
  221. Zhang, J.W.; Li, G.; Li, Y.; Li, B.; Li, G.; Wang, X. Dynamic Knowledge Distillation. arXiv 2020, arXiv:2007.12355. [Google Scholar] [CrossRef]
  222. Zhang, J.W.; Li, G.; Wang, X.; Li, G. Online Knowledge Distillation from the Wisest. In Proceedings of the CVPR, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  223. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
  224. Qiao, S.; Chen, L.C.; Yuille, A. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10213–10224. [Google Scholar]
  225. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
  226. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
  227. Ghiasi, G.; Lin, T.Y.; Le, Q.V. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar] [CrossRef]
  228. Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
  229. Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. Retrieval augmented language model pre-training. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 3929–3938. [Google Scholar]
  230. Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; Van Den Driessche, G.B.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving language models by retrieving from trillions of tokens. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 2206–2240. [Google Scholar]
  231. Shi, Z. Incorporating Transformer and LSTM to Kalman Filter with EM algorithm for state estimation. arXiv 2021, arXiv:2105.00250. [Google Scholar] [CrossRef]
  232. Shen, S.; Chen, J.; Yu, G.; Zhai, Z.; Han, P. KalmanFormer: Using transformer to model the Kalman Gain in Kalman Filters. Front. Neurorobotics 2025, 18, 1460255. [Google Scholar] [CrossRef] [PubMed]
  233. Cao, Y.; He, Y.; Wu, D.; Chen, H.Y.; Fan, J.; Liu, H. Transformers Simulate MLE for Sequence Generation in Bayesian Networks. arXiv 2025, arXiv:2501.02547. [Google Scholar] [CrossRef]
  234. Han, Y.; Guangjun, Q.; Ziyuan, L.; Yongqing, H.; Guangnan, L.; Qinglong, D. Research on fusing topological data analysis with convolutional neural network. arXiv 2024, arXiv:2407.09518. [Google Scholar]
  235. Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Dokl. Ussr 1983, 269, 543–547. [Google Scholar]
  236. Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  237. Zeiler, M.D. ADADELTA: An Adaptive Learning Rate Method. arXiv 2012, arXiv:1212.5701. [Google Scholar] [CrossRef]
  238. Tieleman, T.; Hinton, G. Lecture 6.5-Rmsprop, Coursera: Neural Networks for Machine Learning; Technical Report; University of Toronto: Toronto, ON, Canada, 2012; Volume 6. [Google Scholar]
  239. Martens, J.; Grosse, R. Optimizing Neural Networks with Kronecker-Factored Approximate Curvature. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015. [Google Scholar]
  240. Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
  241. Liu, L.; Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; Han, J. On the Variance of the Adaptive Learning Rate and Beyond. In Proceedings of the International Conference on Machine Learning, PMLR, Addis Ababa, Ethiopia, 17–23 July 2022; pp. 2206–2240. [Google Scholar]
  242. Chen, C.; Wang, Y.; Zhou, X.; Zhang, G.; Zhang, J.; Tang, X.; Luo, W. A Symbolic Method for Training Neural Networks. arXiv 2023, arXiv:2310.00068. [Google Scholar]
  243. Liu, Z.; Wang, X.; Li, Y.; Song, X.; Xu, W. Sophia: A Scalable Stochastic Second-Order Optimizer for Language Model Pre-training. arXiv 2023, arXiv:2309.17467. [Google Scholar]
  244. Salimans, T.; Ho, J.; Chen, X.; Sidor, S.; Sutskever, I. Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv 2017, arXiv:1703.03864. [Google Scholar] [CrossRef]
  245. Huang, C.L.; Wang, C.J. A GA-based feature selection and parameters optimizationfor support vector machines. Expert Syst. Appl. 2006, 31, 231–240. [Google Scholar] [CrossRef]
  246. Pan, J.S.; Zhang, L.G.; Wang, R.B.; Snášel, V.; Chu, S.C. Gannet optimization algorithm: A new metaheuristic algorithm for solving engineering optimization problems. Math. Comput. Simul. 2022, 202, 343–373. [Google Scholar] [CrossRef]
  247. Smith, L.N. Cyclical Learning Rates for Training Neural Networks. arXiv 2015, arXiv:1506.01186. [Google Scholar]
  248. Smith, L.N.; Thomson, N.J. Super-Convergence: Very Fast Training of Neural Networks Using Standard Learning Rate Schedules. arXiv 2017, arXiv:1708.07120. [Google Scholar]
  249. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  250. Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; Volume 30, p. 3. [Google Scholar]
  251. Farid, A.; Hussain, F.; Khan, K.; Shahzad, M.; Khan, U.; Mahmood, Z. A fast and accurate real-time vehicle detection method using deep learning for unconstrained environments. Appl. Sci. 2023, 13, 3059. [Google Scholar] [CrossRef]
  252. Ramachandran, P.; Zoph, B.; Le, Q.V. Swish: A self-gated activation function. arXiv 2017, arXiv:1710.05941. [Google Scholar]
  253. Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
  254. Shazeer, N. GLU Variants Improve Transformer. arXiv 2020, arXiv:2002.05202. [Google Scholar] [CrossRef]
  255. Misra, D. Mish: A Self Regularized Non-Monotonic Neural Activation Function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
  256. Shakarami, A.; Yeganeh, Y.; Farshad, A.; Nicolè, L.; Ghidoni, S.; Navab, N. VeLU: Variance-enhanced Learning Unit for Deep Neural Networks. arXiv 2025, arXiv:2504.15051. [Google Scholar]
  257. Qiu, S.; Xu, X.; Cai, B. FReLU: Flexible Rectified Linear Units for Improving Convolutional Neural Networks. In Proceedings of the 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 1223–1228. [Google Scholar]
  258. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  259. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. 2005, 67, 301–320. [Google Scholar] [CrossRef]
  260. Gulrajani, I.; Ahmed, F.D.N.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved Training of Wasserstein GANs. arXiv 2017, arXiv:1704.00028. [Google Scholar] [CrossRef]
  261. Bottou, L.; LeCun, Y. Early Stopping—But When? In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 1996. [Google Scholar]
  262. Larochelle, H.; Bengio, Y. Snapshot Ensembles: Train 1, Get M for Free. arXiv 2007, arXiv:1704.00109. [Google Scholar]
  263. Izmailov, P.; Podoprikhin, D.; Garipov, T.; Vetrov, D.; Wilson, A.G. Averaging Weights Leads to Wider Optima and Better Generalization. arXiv 2018, arXiv:1803.05407. [Google Scholar]
  264. Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Monographs on Statistics & Applied Probability; Chapman & Hall/CRC: Boca Raton, FL, USA, 1994. [Google Scholar]
  265. Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580. [Google Scholar] [CrossRef]
  266. Tompson, J.; Goroshin, R.; Jain, A.; LeCun, Y.; Bregler, C. Efficient Object Localization Using Convolutional Networks. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  267. Wan, L.; Zeiler, M.D.; Zhang, S.; LeCun, Y.; Fergus, R. Regularization of Neural Networks using DropConnect. In Proceedings of the ICML, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
  268. Krueger, D.; Zoneout, T.C. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Units. In Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  269. Klambauer, G.; Unterthiner, T.; Mayr, A.; Hochreiter, S. Self-Normalizing Neural Networks. In Proceedings of the NeurIPS, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  270. Kingma, D.P.; Salimans, T.; Welling, M. Variational Dropout and the Local Reparameterization Trick. In Proceedings of the NeurIPS, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
  271. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  272. Gal, Y.; Hron, J.; Kendall, A. Concrete Dropout. In Proceedings of the NeurIPS 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  273. Goodfellow, I.; Warde-Farley, D.; Mirza, M.; Courville, A.; Bengio, Y. Maxout networks. In Proceedings of the International Conference on Machine Learning, PMLR, Atlanta, GA, USA, 17–19 June 2013; pp. 1319–1327. [Google Scholar]
  274. Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning Augmentation Strategies from Data. arXiv 2018, arXiv:1805.09501. [Google Scholar]
  275. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  276. Yun, S.; Han, D.; Oh, S.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. arXiv 2019, arXiv:1905.04899. [Google Scholar] [CrossRef]
  277. Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive mixtures of local experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef]
  278. Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
  279. Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv 2020, arXiv:2006.16668. [Google Scholar] [CrossRef]
  280. Lewis, M.; Bhosale, S.; Dettmers, T.; Goyal, N.; Zettlemoyer, L. Base layers: Simplifying training of large, sparse models. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 6265–6274. [Google Scholar]
  281. Ma, J.; Zhao, Z.; Yi, X.; Chen, J.; Hong, L.; Chi, E.H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Munich, Germany, 8–14 September 2018; pp. 1930–1939. [Google Scholar] [CrossRef]
  282. Mustafa, B.; Riquelme, C.; Puigcerver, J.; Jenatton, R.; Houlsby, N. Multimodal contrastive learning with limoe: The language-image mixture of experts. Adv. Neural Inf. Process. Syst. 2022, 35, 9564–9576. [Google Scholar]
  283. Zhang, X.; Shen, Y.; Huang, Z.; Zhou, J.; Rong, W.; Xiong, Z. Mixture of attention heads: Selecting attention heads per token. arXiv 2022, arXiv:2210.05144. [Google Scholar] [CrossRef]
  284. Reisser, M.; Louizos, C.; Gavves, E.; Welling, M. Federated mixture of experts. arXiv 2021, arXiv:2107.06724. [Google Scholar] [CrossRef]
  285. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022; Volume 1, p. 3. [Google Scholar]
  286. Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
  287. Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]
  288. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
  289. Gretton, A.; Borgwardt, K.; Rasch, M.; Schölkopf, B.; Smola, A. A kernel method for the two-sample-problem. Adv. Neural Inf. Process. Syst. 2006, 19, 513–520. [Google Scholar]
  290. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
  291. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
  292. Chen, S.; Ma, K.; Zheng, Y. Med3d: Transfer learning for 3d medical image analysis. arXiv 2019, arXiv:1904.00625. [Google Scholar] [CrossRef]
  293. Bargshady, G. Inception-CycleGAN: Cross-modal transfer learning for COVID-19 diagnosis. Expert Syst. Appl. 2022, 201, 117092. [Google Scholar]
  294. Kathamuthu, N.D.; Subramaniam, S.; Le, Q.H.; Muthusamy, S.; Panchal, H.; Sundararajan, S.C.M.; Alrubaie, A.J.; Zahra, M.M.A. A deep transfer learning-based convolution neural network model for COVID-19 detection using computed tomography scan images for medical applications. Adv. Eng. Softw. 2023, 175, 103317. [Google Scholar] [CrossRef]
  295. Michau, G.; Fink, O. Adversarial transfer learning for zero-shot anomaly detection in industrial systems. IEEE Trans. Ind. Inform. 2021, 18, 5388–5397. [Google Scholar]
  296. Zhang, L.; Wang, H.; Li, Y. Blockchain-based federated learning for secure multi-plant fault diagnosis. Reliab. Eng. Syst. Saf. 2023, 231, 108965. [Google Scholar]
  297. Chen, J.; Sun, W.; Li, X.; Hou, B. Domain adaptive R-CNN for cross-domain aircraft detection in satellite imagery. ISPRS J. Photogramm. Remote. Sens. 2022, 183, 90–101. [Google Scholar]
  298. Cao, J.; Yan, M.; Jia, Y.; Tian, X.; Zhang, Z. Application of a modified Inception-v3 model in the dynasty-based classification of ancient murals. EURASIP J. Adv. Signal Process. 2021, 2021, 1–25. [Google Scholar] [CrossRef]
  299. Cao, H.; Gu, H.; Guo, X.; Rosenbaum, M. Risk of Transfer Learning and its Applications in Finance. arXiv 2023, arXiv:2311.03283. [Google Scholar] [CrossRef]
  300. Wang, Z.; Dai, Z.; Póczos, B.; Carbonell, J. Characterizing and avoiding negative transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11293–11302. [Google Scholar]
Figure 1. Architecture of GNNs with network layers (in blue), graph nodes (1–3, in green), edges (in orange) and aggregation (in red).
Figure 1. Architecture of GNNs with network layers (in blue), graph nodes (1–3, in green), edges (in orange) and aggregation (in red).
Applsci 15 10539 g001
Figure 2. Schematic diagram of Kolmogorov-Arnold networks (KAN) illustrating multivariate function decomposition via univariate basis transformations.
Figure 2. Schematic diagram of Kolmogorov-Arnold networks (KAN) illustrating multivariate function decomposition via univariate basis transformations.
Applsci 15 10539 g002
Figure 3. Physics-informed neural network structure [84].
Figure 3. Physics-informed neural network structure [84].
Applsci 15 10539 g003
Figure 4. The trajectory’s latent space becomes more complex as the input passes through hidden layers [90].
Figure 4. The trajectory’s latent space becomes more complex as the input passes through hidden layers [90].
Applsci 15 10539 g004
Figure 5. The structure of the RWKV model consists of stacked residual blocks, where each block is made up of a time-mixing sub-block and a channel-mixing sub-block, incorporating recurrent elements to capture past information [40].
Figure 5. The structure of the RWKV model consists of stacked residual blocks, where each block is made up of a time-mixing sub-block and a channel-mixing sub-block, incorporating recurrent elements to capture past information [40].
Applsci 15 10539 g005
Figure 6. View of a continuous time-invariant SSM.
Figure 6. View of a continuous time-invariant SSM.
Applsci 15 10539 g006
Figure 7. Structure of a topological neural network: Data associated with a complex system are features defined on a data domain, which is preprocessed into a computational domain that encodes interactions between the system’s components with neighborhoods. The TNN’s layers use message passing to successively update features and yield an output, e.g., a categorical label in classification or a quantitative value in regression. The output represents new knowledge extracted from the input data [146].
Figure 7. Structure of a topological neural network: Data associated with a complex system are features defined on a data domain, which is preprocessed into a computational domain that encodes interactions between the system’s components with neighborhoods. The TNN’s layers use message passing to successively update features and yield an output, e.g., a categorical label in classification or a quantitative value in regression. The output represents new knowledge extracted from the input data [146].
Applsci 15 10539 g007
Figure 8. Graphic abstract of DL’s modular contributions to AI.
Figure 8. Graphic abstract of DL’s modular contributions to AI.
Applsci 15 10539 g008
Figure 9. Framework of Med3D [292].
Figure 9. Framework of Med3D [292].
Applsci 15 10539 g009
Figure 10. Procedure of portfolio transfer [299].
Figure 10. Procedure of portfolio transfer [299].
Applsci 15 10539 g010
Table 1. Development of Transformer and its variants.
Table 1. Development of Transformer and its variants.
ModelImprovementApplication Scenarios
TransformerSelf-attention mechanism enabling parallel sequence processing, replacing traditional RNNsNatural Language Processing (NLP), machine translation, image segmentation
BERTBidirectional Transformer with masked language modeling (MLM)Text classification, question answering
GPT SeriesAutoregressive generation with unidirectional TransformerText generation, dialogue systems
T5Unified text-to-text Transformer frameworkMulti-task learning, text generation
LongformerSparse attention (local sliding window + global attention), reduces complexity to O(n)Long-text summarization, document understanding
ReformerLocality-Sensitive Hashing (LSH) for key grouping, reversible residuals for memory efficiencyGenome sequence analysis, music generation
Transformer-XLRecurrence mechanism (caching previous segments) with relative positional encodingLanguage modeling, dialogue systems
LinformerLow-rank projection to compress key value matrices (O(n) complexity)Real-time translation, large-scale text processing
PerformerFAVOR+ (Fast Attention Via Orthogonal Random features) for linear time attentionProtein sequence modeling, image generation
BioBERTDomain-specific pretraining on biomedical/scientific corporaMedical literature mining, chemical entity recognition
Vision Transformer (ViT)Image-patching strategy for standard Transformer adaptationImage classification, object detection
Switch TransformerMixture-of-Experts (MoE) with dynamic token routing (trillion-scale parameters)Large-scale pretraining
Deformable DETRDeformable attention with dynamic receptive fields for faster convergenceComputer vision (e.g., COCO dataset detection)
InformerProbSparse self-attention and distillation for O(L log L) complexityLong-sequence forecasting (energy consumption, weather)
AutoFormerAutocorrelation mechanism capturing periodic dependenciesEnergy demand forecasting
Non-stationary TransformerHybrid attention (stationary/ non-stationary components)Non-stationary time series prediction
CrossformerHierarchical cross-scale attention for multivariate interactionsTraffic flow prediction
iTransformerInverted dimension modeling with variable-specific encodingMultivariate forecasting
ETSformerIntegration of Exponential Smoothing (ETS) decomposition with frequency attentionMedical time series analysis
ShapeFormerMorphological attention for local waveform patternsBiosignal classification
Table 2. A glimpse into the variables in Section 2.
Table 2. A glimpse into the variables in Section 2.
SectionVariable in Appearance OrderExplanation
Section 2.6 x The input vector.
w The weighted vector/ matrix.
bThe bias.
σ ( · ) The activation function(such as Sigmoid, ReLU).
yThe output.
V ( G , D ) The value function.
zThe feature representation.
f , g The decoder and encoder.
OThe output feature map/ layer.
FThe residual function in ResNet.
h ( · ) The constant mapping.
SThe output value of the hidden layer in RNN.
UThe weight matrix from the input layer to the hidden layer.
VThe weight matrix from the hidden layer to the output layer.
C t The cell state.
i t , o t , f t The input gate, output gate, and forget gate.
h t The computation of activations in LSTM.
H t The node feature matrix in GNN.
A ˜ The normalized adjacency matrix.
DThe degree matrix.
D The dataset.
D KL The Kullback–Leibler (KL) divergence.
L t o t a l , L D a t a , L P D E , L B C and L I C The total loss, partial differential equation loss, iniitial condition loss, and boundary condition loss in PINNs.
λ 1 , λ 2 , λ 3 , λ 4 The weights for the adjustment in PINNs.
x ( t ) The hidden state in LNN.
I ( t ) The input in LNN.
Section 2.7D and GThe discriminator and generator.
log ( 1 D ( G ( z ) ) ) The logarithmic loss function.
v { 0 , 1 } D The set of visible units in the Boltzmann Machine.
h { 0 , 1 } P The set of hidden units in the Boltzmann Machine.
{ v , h } The energy of the state in the Boltzmann Machine.
θ The model parameters.
W , L , J
Z ( θ ) The partition function.
σ ( · ) The logistic function.
( W t ) t 0 A standard Wiener process.
g ( t ) The weighting function.
log p t ( · ) The score function.
Section 2.8 Q , K , V Query, Key, and Value.
d k The dimension of the key vectors.
Section 2.9 r t , k t and v t The components in RWKV.
x t The word at t.
Section 2.11 p ( s | s , a ) The state transition density function.
s and s The current state and the new state.
aThe action.
UThe cumulated future reward.
R t The reward at time t.
Table 3. Activation functions at a glance: formulations, properties, and usages.
Table 3. Activation functions at a glance: formulations, properties, and usages.
FunctionEquation (Forward)RangeVDCostBest Practice/Pairing
Classic (‘80-’00)
Sigmoid σ ( x ) = 1 1 + e x (0,1)HlowProb. output, shallow nets
Tanh tanh ( x ) (−1,1)HlowRNN pre-2010
Softmax e x i j e x j (0,1)medFinal layer, multi-class
Non-saturating/adaptive (2010s)
ReLU max ( 0 , x ) [0,∞)Lvery lowDefault CNN/FC
L-ReLU max ( α x , x ) , α = 0.01 (−∞,∞)Lvery lowSparse/audio nets
ELU x ( x 0 ) , α ( e x 1 ) ( x < 0 ) (− α ,∞)LlowSmooth zero-centre
Swish x · σ ( β x ) , β = 1 or learn(−∞,∞)LlowDeep CNN, NAS-found
GELU x Φ ( x ) = x · 1 2 [ 1 + erf ( x / 2 ) ] (−∞,∞)LmedTransformers, BERT, GPT
Dynamic/conditional (2020s)
SwiGLU Swish ( x W ) ( x V ) (−∞,∞)Lmed–highFFN inside Transformers
D-ReLU max ( α k x + β k , x ) , k = input cond.[0,∞)LmedMobile CNN, few-shot
VeLU ArcTan ( Sin ( x ) ) · γ ( Var [ x ] ) (− π / 2 , π / 2 )LmedVariance-sensitive tasks
FReLU max ( x , T ( x ) ) , T = spatial context[0,∞)LlowObject detection, Seg
Mish x · tanh ( ln ( 1 + e x ) ) (−∞,∞)LlowGeneral-purpose CNN
Vanishing derivative risk: H = high, M = moderate, L = low. For Softmax the derivative is bounded away from zero in its domain, so the entry “–” indicates “not applicable/no VD issue”.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, Y.; Wang, Y.; Watada, J. A Modular Perspective on the Evolution of Deep Learning: Paradigm Shifts and Contributions to AI. Appl. Sci. 2025, 15, 10539. https://doi.org/10.3390/app151910539

AMA Style

Wei Y, Wang Y, Watada J. A Modular Perspective on the Evolution of Deep Learning: Paradigm Shifts and Contributions to AI. Applied Sciences. 2025; 15(19):10539. https://doi.org/10.3390/app151910539

Chicago/Turabian Style

Wei, Yicheng, Yifu Wang, and Junzo Watada. 2025. "A Modular Perspective on the Evolution of Deep Learning: Paradigm Shifts and Contributions to AI" Applied Sciences 15, no. 19: 10539. https://doi.org/10.3390/app151910539

APA Style

Wei, Y., Wang, Y., & Watada, J. (2025). A Modular Perspective on the Evolution of Deep Learning: Paradigm Shifts and Contributions to AI. Applied Sciences, 15(19), 10539. https://doi.org/10.3390/app151910539

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop