Next Article in Journal
The Study of Digital Forensics in KSA: Education, and Prosecution Capabilities: A Needs-Based Analysis
Previous Article in Journal
Machine Learning-Based Blood Pressure Prediction Using Cardiovascular Disease Data: A Comprehensive Comparative Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient and Lightweight Differentiable Architecture Search

School of Computer Science and Artificial Intelligence, Civil Aviation Flight University of China, Guanghan 618307, China
*
Authors to whom correspondence should be addressed.
Electronics 2026, 15(2), 314; https://doi.org/10.3390/electronics15020314 (registering DOI)
Submission received: 5 December 2025 / Revised: 4 January 2026 / Accepted: 9 January 2026 / Published: 10 January 2026

Abstract

While Neural Architecture Search (NAS) has revolutionized the automation of deep learning model design, gradient-based approaches like DARTS often suffer from high computational overheads, the collapse of skip-connections, and optimization instability. To address these limitations, we propose Efficient and Lightweight Differentiable Architecture Search (EL-DARTS). EL-DARTS constructs a compact and redundancy-reduced search space, integrates a partial channel strategy to lower memory usage, employs a Dynamic Coefficient Scheduling Strategy to balance edge importance, and introduces entropy regularization to sharpen operator selection. Experiments on CIFAR-10 and ImageNet demonstrate that EL-DARTS substantially improves both search efficiency and accuracy. Remarkably, it attains a 2.47% error rate on CIFAR-10, requiring merely 0.075 GPU-days for the search. On ImageNet, the discovered architecture achieves a 26.2% top-1 error while strictly adhering to the mobile setting (<600 M MACs). These findings confirm that EL-DARTS effectively stabilizes the search process and pushes the efficiency frontier of differentiable NAS.

1. Introduction

In recent years, deep learning (DL) has advanced at an extraordinary pace and has been successfully deployed across a wide spectrum of application domains [1,2,3]. These advancements are largely driven by the intrinsic flexibility of deep learning models and their strong capability for hierarchical feature representation. Although architecture fundamentally governs model performance, designing optimal structures remains a challenge. It entails the meticulous specification of hyperparameters—including layer types and dimensions—that dictate the model’s efficiency and generalization capabilities. As neural networks become increasingly deep and complex, the design of convolutional neural networks (CNNs) requires accounting for an expanding set of considerations, while the availability of large-scale datasets such as ImageNet [4] further amplifies the cost of manual trial-and-error exploration. Against this backdrop, neural architecture search (NAS) has emerged as a promising research direction and has garnered substantial attention within the deep learning community [5,6,7]. The central aim of NAS is to automate the construction of neural network architectures, thereby reducing human intervention and discovering architectures that surpass those crafted through manual design.
Popular NAS approaches typically construct a predefined search space that enumerates all candidate architectures, and then employ heuristic search strategies to identify an optimal design. Early mainstream methods largely relied on evolutionary algorithms (EAs) and reinforcement learning (RL) as the primary optimization paradigms.
By optimizing an recurrent neural network (RNN) controller with reinforcement learning, Zoph et al. [8] achieved a remarkable 96.35% accuracy on CIFAR-10. Other notable RL-based contributions include MetaQNN by Baker et al. [9], which adopts greedy search, and BlockQNN by Zhong et al. [10], which focuses on block-structured spaces.
In the domain of evolutionary algorithms, NasNet [11] systematically eliminates underperforming architectures during the search process. In addition, So, D et al. [12] introduced an evolutionary Transformer framework, and Real et al. [13] ensured population diversity through regularized evolution.
However, traditional architecture search strategies employing Bayesian optimization [14], evolutionary computation, or reinforcement learning invoke prohibitive computational costs. For instance, obtaining a state-of-the-art architecture may require 2000 GPU-days using reinforcement learning [15] or 3150 GPU-days using evolutionary algorithms [13]. To address the issue of computational cost, Liu. et al. proposed Differentiable architecture search [16] (DARTS). DARTS relaxes the discrete architecture representation into a continuous search space and optimizes validation performance directly via gradient descent, slashing the computational overhead from thousands of GPU-days to merely a few GPU-hours.
In recent years, DARTS has rapidly established itself as a mainstream paradigm for the automated design of deep neural network architectures [17]. Despite its remarkable efficiency, subsequent studies have identified several limitations of DARTS.
Depth Gap Problem: K. Yu et al. [18] pointed out that the discrepancy in network depth between the search phase (shallow) and the evaluation phase (deep) can lead the original DARTS to perform no better than, and sometimes even worse than, random search.
Performance Collapse and Skip-Connection Dominance: Studies have shown that due to the cumulative advantage of parameter-free operations, the final architectures are often dominated by skip connections, resulting in severe performance degradation [19].
To address these issues, researchers have proposed various solutions, including mitigating the dominance of skip connections [20]. By introducing a regularization term, S. Movahedi et al. [21] aimed to mitigate collapse through the harmonization of cell operations. Similarly, a collaborative competition technique was employed by Xie, W.S. et al. [22] to improve perturbation-based architecture selection. Additionally, Huang et al. [23] investigated the efficacy of DARTS in super-resolution scenarios. Luo et al. [24] introduced hardware-aware SurgeNAS to alleviate memory bottlenecks, while Li et al. [25] employed a polarization regularizer to discover more effective models.
Although DARTS represents a major breakthrough in search efficiency, it still faces severe memory bottlenecks when dealing with high-dimensional search spaces. To further enhance search efficiency and reduce dependence on hardware resources, extensive research has focused on lightweight design and memory optimization. To address the issue of excessive memory consumption, PC-DARTS [26] introduced a Partial Channel Connections strategy. By sampling and selecting operations on only a subset of channels during the search process, this method significantly improves memory utilization efficiency [27]. Such a design not only greatly reduces computational cost but also enables the use of larger batch sizes under the same hardware constraints, thereby improving search stability. To mitigate the potential information loss caused by channel sampling, Y. Xue et al. [28] incorporated the Convolutional Block Attention Module (CBAM) [29] into the architecture search. Shu Li et al. proposed DLW-NAS [30], which designs a differentiable sampler on top of a super-network to avoid exhaustively enumerating all possible sub-networks [27].
To address the aforementioned issues, we propose an efficient and lightweight differentiable architecture search method (EL-DARTS). In the design of the search space, redundant operations in DARTS are removed and a partial-channel strategy is adopted to reduce memory consumption, thereby substantially improving search efficiency. During architecture modeling, Dynamic Coefficient Scheduling Strategy is introduced to enhance the stability of the architecture parameters, alleviate the overuse of skip connections, and ensure fair competition among different candidate operators. In addition, an entropy-based regularization term is incorporated into the loss function to sharpen the distribution of architecture parameters, which accelerates search convergence and reduces the risk of becoming trapped in suboptimal local minima. Our main contributions are summarized as follows:
  • A lightweight neural architecture search algorithm is presented that preserves accuracy while substantially reducing parameter count and search time.
  • A FLOP-, parameter-, and latency-driven pruning strategy streamlines the conventional DARTS search space by removing redundant and computationally expensive operators, yielding a more compact operation set that improves both search efficiency and stability.
  • A Dynamic Coefficient Scheduling Strategy is employed to suppress the early dominance of skip connections—resulting from their parameter-free nature—and to promote fair competition among all candidate operators.
  • A progressively enhanced entropy-based regularization term is incorporated to sharpen the distribution of architecture parameters, accelerating operator differentiation, stabilizing structure selection during discretization, and reducing final accuracy degradation.
EL-DARTS effectively mitigates the excessive dominance of skip connections, accelerates operator differentiation, and achieves a superior balance between efficiency and performance compared with DARTS on both CIFAR-10 and ImageNet. Specifically, we achieved an error rate of 2.47% within less than 0.1 GPU-days (approximately 1.8 h) on a single GTX 1080 Ti, substantially surpassing the 3.15% error rate reported by DARTS, which requires 1.0 GPU-day for its search process.

2. Materials and Methods

2.1. Principle of DARTS

DARTS [16] formulates the neural architecture search space as a directed acyclic graph (DAG) where edges represent candidate operations from a predefined set O . To enable gradient-based optimization, the discrete selection of operations is relaxed into a continuous search space using a SoftMax distribution over learnable architecture parameters α . For an edge ( i , j ) , the mixed operation m ( i , j ) ( x ) is computed as:
m ( i , j ) ( x ) = o O exp ( α o ( i , j ) ) o O exp ( α o ( i , j ) ) o ( x )
where x is the input feature map, and o ( ) denotes a candidate operation from the search space O .
The search process is modeled as a bilevel optimization problem, where the goal is to find the optimal α that minimizes the validation loss L v a l , subject to the constraint (denoted as s . t . ) that the w denotes the network weights, and w represents the optimal weights obtained by minimizing the training loss L t r a i n :
min α   L v a l ( w ( α ) , α ) s . t . w * ( α ) = arg min w   L t r a i n ( w , α )
After the search phase, the final discrete architecture is derived by selecting the operation with the highest α value for each edge. While efficient, this standard formulation suffers from memory bottlenecks and instability, which we address in the following sections.

2.2. Lightweight Methods

2.2.1. More Efficient Search Space

The conventional DARTS search space typically includes eight fundamental operators. To clarify the notations used in this paper, we define the shortcuts used for these operations: none represents the zero operation (indicating no connection between nodes), max_pool denotes Max Pooling, avg_pool denotes Average Pooling, skip_connect represents an identity mapping (direct connection), sep_conv stands for Depthwise Separable Convolution, and dil_conv refers to Dilated Convolution. The suffixes (e.g., 3 × 3, 5 × 5) indicate the kernel size. The complete set includes: none, max_pool_3 × 3, avg_pool_3 × 3, skip_connect, sep_conv_3 × 3, sep_conv_5 × 5, dil_conv_3 × 3, and dil_conv_5 × 5. Although these operators offer diverse combinatorial possibilities for network architecture design, they exhibit substantial functional redundancy. In particular, the multiple SepConv operators with different kernel sizes, together with pooling-based operators, provide only limited differences in representational capability while significantly increasing the redundancy of the search space. Moreover, the computational costs of candidate operators exhibit significant disparities. For instance, parameter-free operations like skip connections and pooling incur negligible overhead, whereas large-kernel convolutions require substantially higher FLOPs and latency. This asymmetry skews the search process toward low-cost operators, destabilizing the optimization of architectural parameters and leading to degenerate solutions, typically networks dominated by skip-connections.
For mobile or lightweight application scenarios, the practical benefits of certain operators are limited, and some may even increase inference latency and energy consumption. As described in VGG [31], two 3 × 3 convolutions require fewer parameters and FLOPs than a single 5 × 5 convolution, while yielding comparable performance. Moreover, NAS-Bench-201 [32], a standard benchmark suite in the neural architecture search (NAS) field, also excludes 5 × 5 and 7 × 7 convolutional operations. Therefore, following the design principles adopted in VGG and NAS-Bench-201, we discard the high-FLOP operators used in DARTS. After measuring the latency of each operator on the target platform (as shown in Figure 1) and making an overall assessment, we select sep_conv_3 × 3, max_pool_3 × 3, skip_connect, and dil_conv_3 × 3 as the fundamental operators. This reduces redundancy in the search space and improves search efficiency.
The overall framework is illustrated in Figure 2. On the left, input1 and input2 denote the input nodes of the cell, while output represents the cell’s output node, and Node0–Node3 are the intermediate nodes. The black edges indicate mixed operations, with each edge containing four candidate operators (skip_connect, sep_conv_3 × 3, max_pool_3 × 3, and dil_conv_3 × 3). The direction of the arrows reflects the flow of information (i.e., the inputs to Nodes 0–3). The four blue lines represent the channel concatenation that forms the final output of the cell. The diagram on the right depicts the macro-level architecture during the search phase, where the green arrows indicate that the outputs of preceding cells serve as the inputs to subsequent cells.

2.2.2. More Efficient Training Process

To improve training efficiency in terms of GPU memory utilization, this study adopts the partial channel connection strategy proposed in PC-DARTS [26]. Specifically, to effectively reduce memory overhead and computational cost while maintaining high search performance, the feature channels c of each edge are divided into two separate branches. One subset of channels is directed to the Operation Selection Block, where different candidate operations are evaluated during the architecture search. The remaining channels—referred to as the masked part—are directly propagated to the output, preserving the input feature information through an identity mapping and thereby maintaining feature consistency across layers.
After the operation selection process, the outputs from both branches are concatenated and processed by a Channel Shuffle operation to facilitate inter-channel information exchange and feature fusion. This operation effectively mitigates the information fragmentation problem introduced by channel partitioning, ensuring that the resulting architecture maintains strong complementarity and connectivity in feature representation.
Unlike conventional DARTS, which performs operation selection over all feature channels, PC-DARTS and our method apply the search process only to a subset of channels, significantly reducing computational and memory demands during the search phase. This improvement enables neural architecture search to scale to larger networks and more complex datasets. Meanwhile, the channels excluded from the search still participate in feature propagation via identity mapping, which enhances training stability and generalization capability, preventing performance degradation caused by over-optimization.

2.3. Dynamic Coefficient Scheduling Strategy

In the original DARTS framework, each edge performs a SoftMax normalization over the architecture parameters α ( i , j ) corresponding to all candidate operations, forming a mixed operation. However, this mechanism introduces a clear unfair competition problem during optimization. Specifically, the skip-connection tends to dominate the search process due to its parameter-free nature, shorter gradient propagation path, and higher sensitivity to training loss reduction. Consequently, it rapidly accumulates advantages in the early search stages, leading to significantly larger SoftMax weights compared to other operations. This phenomenon, known as skip-connection dominance, causes the searched architectures to become overly simplified in later stages, thereby limiting their representational capacity.
Moreover, DARTS implicitly assumes that all incoming edges to a given node contribute equally to feature aggregation. In practice, however, the importance of different edges varies dynamically throughout training. When all edges are treated with equal importance, potentially valuable connections may suffer from persistent gradient disadvantages, preventing them from being adequately activated. As a result, the diversity and robustness of the final genotype are substantially reduced.
To address these issues, we propose a Dynamic Coefficient Scheduling Strategy (DCSS) that introduces learnable, dynamically adjusted weighting coefficients γ ( i , j ) at the edge level. This mechanism enables adaptive modeling of the relative importance of different input edges. By decoupling these dynamic coefficients from the architecture parameters α during optimization, the proposed strategy allows the model to flexibly adjust edge importance according to the training dynamics.
Specifically, for all incoming edges ( i , j ) connected to a node j within a cell, we first initialize a set of random coefficients γ 0 ( i , j ) drawn from a uniform distribution:
γ 0 ( i , j ) ~ U ( a , b ) ,   a , b ( 0 , 1 )
These coefficients serve as the initial dynamic weights that reflect the relative importance of each edge. Subsequently, a SoftMax normalization is applied to obtain the normalized weight of each edge p γ ( i , j ) :
p γ ( i , j ) = exp ( γ ( i , j ) ) i < j exp ( γ ( i , j ) )
The output of a node x ( j ) can be expressed:
x ( j ) = i < j p γ ( i , j ) m ( i , j ) ( x )
Here, O denotes the set of candidate operations. This design introduces a two-level learnable weighting mechanism: the outer-level coefficient p γ is used to adjust the relative importance of different edges, while the inner-level m ( i , j ) ( x ) is applied to differentiate among the candidate operators within each edge.
During the training phase, γ ( i , j ) is dynamically updated through the back-propagation of the loss function. In the initial stages, the edge weights exhibit minimal variance, which enables the model to adequately explore diverse connectivity patterns within a broad search space. As training progresses, the edge weights gradually reflect their actual contribution to task performance, leading to the attenuation of weights corresponding to ineffective or redundant connections, while reinforcing those with robust feature transmission capabilities.
The primary benefit of this approach is achieved by explicitly introducing a normalized edge weight between nodes. Consequently, the contribution of the skip operation no longer solely relies on its parameter-free nature; instead, it must compete with other operators under the same distribution, thereby mitigating the issue of unbounded growth typically associated with skip connections. Simultaneously, the dynamic adjustment opportunities afforded to different input edges during training result in a more balanced operator distribution, which effectively reduces the oscillation and collapse phenomena observed in the architectural parameter α . This dynamic coefficient enhances the network’s exploratory capacity during the early training phase, leading to the generation of a greater number of differentiated yet performance-similar candidate architectures during the discretization stage, ultimately improving the model’s transferability across various tasks.
It is noteworthy that, unlike strategies such as Fair-DARTS [33] which enforce fairness through gradient clipping or weight projection, the proposed method achieves a “soft constraint” balance via the adaptive evolution of γ , eliminating the need for explicit modifications to the architectural parameter gradients. This characteristic results in minimal computational overhead and facilitates easy implementation. Furthermore, the dynamic coefficient scheduling mechanism and the subsequent entropy regularization term are complementary in their optimization granularity: the former regulates the balance of feature flow at the edge level, while the latter promotes the sharpening of operator weights at the operation level. The synergistic effect of these two components collectively enhances the convergence and stability of the search process.

2.4. Entropy Regularization

In the late stages of the search, the tendency of the multi-operator weight distribution to become smooth may result in the discarding of potentially effective operations during the final discretization phase (i.e., genotype extraction). This ambiguity often leads to performance degradation and instability in the derived model. To mitigate this issue, we introduce an Entropy Regularization term. This mechanism is designed to constrain the distribution entropy of the operator weights on each edge, compelling it to gradually decrease during the latter part of training.
To further control this sharpening process effectively, we incorporate a Temperature Annealing strategy. Instead of utilizing the standard SoftMax function, we formulate the entropy calculation based on a temperature-scaled probability distribution. For the architectural parameters α associated with a single edge, the temperature-scaled probability p i is defined as:
p i = exp ( α i / T ) j exp ( α j / T )
where T is the temperature coefficient that controls the sharpness of the distribution. Based on this probability distribution, the entropy H ( p ) for a given edge is formally defined as:
H ( p ) = i p i log ( p i )
A higher T yields a softer distribution (high entropy) suitable for broad exploration in the early phase, while a lower T leads to a sharper distribution (low entropy) for precise exploitation in the later phase. To achieve a progressive sharpening effect, we employ an exponential decay schedule for T with a lower bound limit:
T ( t ) = max ( T min , T i n i t × β t )
where t denotes the current epoch, T i n i t = 5.0 is the initial temperature, β = 0.95 is the decay factor, and T min = 0.5 serves as the clipping threshold to prevent numerical instability.
Let w denote the network weights. Given the original validation loss L val , the total loss function L t o t a l incorporating the temperature-aware entropy regularization is formulated as:
L t o t a l = L val ( w * ( α ) , α ) + λ e E H ( p ( e ) )
where λ is the regularization weight coefficient. In our experiments, λ is set to a constant value of 0.001 to maintain a consistent regularization strength. This combined strategy forces the model to gradually favor deterministic architectural selection, thereby mitigating the randomness and the risk of overfitting during the search phase. Conceptually, this approach resonates with the inverse of the Maximum Entropy Principle in information theory, where certainty in model decisions is reinforced by minimizing entropy. Furthermore, relevant studies [34] have demonstrated that reducing the entropy of the architectural parameters effectively enhances structural consistency and search reproducibility during the discretization phase.
Finally, incorporating the efficient search space, dynamic coefficient scheduling, and entropy regularization, the overall training procedure of EL-DARTS during the search phase is summarized in Algorithm 1.
Algorithm 1: EL-DARTS Search Phase Optimization
Input: Training data  D train ,   Validation   data   D val
Initialize weights w , parameters α , γ , temperature t while not converged do:
 Update t according to T ( t ) = max ( T min , T i n i t × β t )  
For each mini-batch do
   Sample   batch   from   D val and generate channel mask M.
   Compute   loss   L t o t a l
  searches for the best structure ( α , γ )   using   D val
   Sample   batch   from   D train and generate channel mask M.
   Compute   training   loss   L t r a i n  
  trains w for that structure using D train .
  End for
End while
Output: Derive the final discrete architecture (Genotype)

3. Experiments

3.1. Experimental Setup and Datasets

The experiments were conducted on an environment running Ubuntu 22.04 (Canonical Ltd., London, UK) and utilizing an NVIDIA GTX 1080 Ti GPU(NVIDIA Corporation, Santa Clara, CA, USA). The programming environment was Python 3.10 (Python Software Foundation, Wilmington, DE, USA), powered by the PyTorch 2.1.0 (Meta Platforms Inc., Menlo Park, CA, USA) deep learning framework (on Ubuntu 22.04). Two standard datasets commonly used for evaluating Neural Architecture Search (NAS) methods—CIFAR-10 and ImageNet [4]—were employed for image classification.
  • CIFAR-10: This dataset contains 10 object classes with 6000 images per class, amounting to 60,000 images in total. All images are RGB formatted with a resolution of 32 × 32. We follow the standard split, utilizing 50,000 images for model training and the remaining 10,000 for evaluation.
  • ImageNet: The ImageNet dataset serves as a large-scale benchmark, covering 1000 distinct object classes. It provides a training set of approximately 1.28 million images and a validation set of 50,000 images. The dataset features high-resolution data that is maintained with a roughly balanced distribution among the categories.
Consistent with the standard practice of DARTS, the architecture search process was carried out on CIFAR-10, and the derived architecture was subsequently evaluated on both CIFAR-10 and ImageNet. For the ImageNet evaluation, we adopt the Mobile Setting, where the input image size is fixed at 224 × 224 and the total number of Multi-Add Operations (often denoted as MACs or FLOPs) is constrained to less than 600 M. A comprehensive set of experiments was performed to assess the proposed method, utilizing Test Error (%), the number of Parameters (Param), and Search Cost (GPU-Days) as the primary evaluation metrics.

3.2. Results on CIFAR-10

Our training configurations largely adhere to the settings established by DARTS. During the architecture search stage, we build a supernet consisting of eight cells (two reduction and six normal cells), where each cell contains = 6 nodes. The search process runs for 50 epochs with an initial channel width of 16. To perform bi-level optimization, we partition the 50,000 CIFAR-10 training images into two equal halves: the first half is utilized to optimize the model weights (w), while the second is reserved for updating the architecture parameters ( α ). We adopt the partial connection strategy, setting the sampling ratio to K = 4, meaning only 1/4 of the operations are sampled for feature transmission on each edge. Due to the high memory consumption of full channel connections, the original DARTS is limited to a batch size of 64. In contrast, our memory-efficient strategy allows us to increase the search phase batch size to 256, which improves the stability of the optimization process. The network weights (w) are optimized using SGD with momentum (0.9), an initial learning rate of 0.1 (annealed to zero using a cosine schedule without restarts), the entropy regularizer is initialized with a coefficient of 0.001, and a weight decay of 3 × 10−4. Benefiting from the increased batch size, the entire search process takes only 1.8 h on a single NVIDIA GTX 1080 Ti GPU for CIFAR-10.
Regarding the results obtained from the search conducted on the CIFAR-10 dataset, the heatmap of the architectural parameters α is presented in Figure 3, and the resulting cell architectures (Normal and Reduction Cells) are displayed in Figure 4.
As shown in Figure 4, the final normal cell contains only 2 skip-connections out of 8 edges. This stands in contrast to collapsed architectures often observed in standard DARTS, where skip-connections can occupy the majority of the edges. This structural balance confirms that our Dynamic Coefficient Scheduling Strategy effectively alleviates skip-connection dominance.
To ensure a fair comparison, our evaluation phase adheres strictly to the standard DARTS framework. The constructed network comprises a stack of 20 cells, consisting of 18 Normal Cells and 2 Reduction Cells, where all cells of a given type employ the same discovered topology. We set the initial channel count to 36 and train the model from scratch for 600 epochs using the complete set of 50,000 training images. We employ the SGD optimizer with an initial learning rate of 0.025 (annealed to zero via a cosine schedule without restarts), a momentum of 0.9, and a weight decay of 3 × 10−4. The comparative results are presented in Table 1.
Note: #ops denotes the number of candidate operations. The reported Search Cost represents the GPU-days required for the architecture search phase only. It does not include the cost of the final retraining (evaluation) phase.
Analysis of the CIFAR-10 search results demonstrates that EL-DARTS achieves a superior performance–cost balance. Most notably, its search cost is the lowest among all compared algorithms at just 0.075 GPU days, representing a twofold acceleration over the PC-DARTS baseline. Despite this extreme efficiency, EL-DARTS maintains highly competitive performance, yielding a test error of 2.47%, which is superior to leading methods like DARTS and PC-DARTS. Furthermore, the resulting architecture is highly compact, featuring only 3.1 M parameters. Collectively, these results confirm that EL-DARTS successfully pushes the efficiency frontier of differentiable architecture search while ensuring top-tier performance.

3.3. Results on ImageNet

To validate the transferability of the discovered architecture, we further conducted evaluation experiments on ImageNet. The evaluation network was constructed in a manner consistent with the DARTS algorithm, specifically: the network depth d was set to 14, the initial number of channels C 0 to 48, and the network was trained for 250 epochs. Adhering to the constraints of the DARTS algorithm, we selected models for comparison that meet the mobile device operational requirements, specifically those architectures where the number of multiply–accumulate operations (MACs) is less than 600 M when the input resolution is 224 × 224. The performance of our architecture, along with the architectures obtained by comparison algorithms, is presented in Table 2.
Analysis of the ImageNet evaluation results demonstrates that EL-DARTS achieves a superior performance–efficiency trade-off compared to its peers. Most notably, its search cost is the absolute lowest among all comparison architectures, requiring only 0.075 GPU days. This represents a significant acceleration over its gradient-based baselines, such as PC-DARTS (0.1 GPU days) and the original DARTS (4.0 GPU days). Concurrently, EL-DARTS yields a competitive Top-1 error rate of 26.2% and a Top-5 error rate of 8.0%, surpassing the performance of both DARTS (26.7%) and NASNet-A (26.0%).

3.4. Ablation Study

To rigorously validate the independent contribution and synergistic effect of our three proposed innovative components on the final architectural performance and search efficiency, we conduct a detailed ablation study on the CIFAR-10 dataset. Our full proposed method is built upon the established Partial Connection baseline (X) derived from PC-DARTS, integrating the following three novel points:
A: Efficient Search Space (ESS).
B: Dynamic Coefficient Scheduling Strategy (DCSS).
C: Entropy Regularization (ER).
Our ablation focuses on systematically removing each novel component from the Full Model to quantify its specific impact on the key performance metrics. Each algorithm was run three times on CIFAR-10, and the results are summarized in Table 3. When all three components are disabled, the model exhibits the worst performance. As the components are progressively enabled, the test error decreases consistently. Notably, activating components A and B together, or enabling all components (A, B, and C), leads to substantial accuracy gains, with the full configuration achieving the lowest test error of 2.47%. Moreover, incorporating additional components does not significantly increase the search cost, which remains around 0.075 GPU days for most settings. These results indicate that the proposed designs improve performance while maintaining high search efficiency. Overall, each component contributes positively to performance, and their combined effect yields the best results.

4. Discussion

4.1. Performance Analysis

Our primary hypothesis posited that the core bottlenecks of traditional gradient-based Neural Architecture Search (such as DARTS)—namely low search efficiency, structural over-dominance, and discretization inconsistency—could be concurrently resolved through targeted improvements to the search space, optimization path, and structural constraints. The experimental results strongly validate this hypothesis.
First, the search cost was sharply reduced to 0.075 GPU days, solving the resource barrier faced by NAS research, especially when compared to earlier reinforcement learning or evolutionary algorithms requiring thousands of GPU hours. This leap in efficiency is a direct result of the Efficient Search Space (ESS) proposed in EL-DARTS, which streamlines redundant operators and works in tandem with the memory optimization afforded by the Partial Channel Mechanism (from PC-DARTS) to stabilize the learning process.
Second, the architectural parameter stability issue was successfully resolved. The ablation study confirms that Entropy Regularization is the core factor ensuring high final architectural accuracy. This finding clarifies that utilizing a smooth regularization constraint to guide distribution sharpening within the continuous relaxation space is an effective way to resolve discretization inconsistency—a known constraint on DARTS performance in prior research. Simultaneously, Dynamic Coefficient Scheduling Strategy successfully mitigated the pathological over-dominance of the skip-connection inherent in the original DARTS framework, ensuring high-contributing convolutional operations were fully explored and selected.

4.2. Limitations

While EL-DARTS achieves significant efficiency gains, it relies on proxy metrics (MACs) which do not always perfectly correlate with real-world latency on specific hardware (e.g., FPGA or specialized ASICs). Future work could integrate direct hardware-in-the-loop latency feedback. Additionally, our search space is currently confined to CNN-based operations; extending the EL-DARTS framework to Vision Transformers (ViTs) or other emerging architectures remains an open challenge. Finally, although the search cost is low on a GTX 1080 Ti, performance scaling on ultra-low-power edge devices (e.g., microcontrollers) requires further investigation.

5. Conclusions

The significance of this study lies in pushing the efficiency frontier of differentiable NAS. By minimizing the search cost, EL-DARTS enables the rapid discovery of high-performance mobile architectures on commodity hardware, thereby democratizing NAS research. The final achievement of a 26.2% Top-1 error rate on ImageNet while strictly adhering to the 600 M MAC constraint further validates the method’s practical value and transferability.
This demonstrates that through the refined design of the optimization path and structural constraints, we can resolve the efficiency bottleneck imposed by computational resource limits without sacrificing performance. This work suggests a new revelation for the future of NAS: optimization stability and search efficiency are not conflicting goals, but can be synergistically enhanced through strategic regularization and optimization phase separation.
Regarding future work, current optimization focuses on the proxy metrics of error rate and MACs. Future research should integrate non-differentiable metrics such as actual hardware latency or energy consumption directly into the loss function, enabling multi-objective architectural search that is more closely aligned with real-world deployment requirements.

Author Contributions

Conceptualization, M.Z. and J.L.; Data curation, W.D., M.Z.; Funding acquisition, M.Z. and J.L.; Methodology, M.Z., W.D. and J.L.; Software, W.D., M.Z.; Validation, W.D., M.Z., J.L. and X.L.; Formal analysis, M.Z., W.D. and J.L.; Investigation, M.Z., W.D. and J.L.; Project administration, M.Z.; Writing—original draft, W.D., M.Z.; Writing—review and editing, M.Z., W.D., J.L. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was sponsored by the Fundamental Research Funds for the Central Universities 25CAFUC03038, 25CAFUC01006, and 24CAFUC04015.

Data Availability Statement

The CIFAR-10 dataset is publicly available at https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 2 December 2025). The ImageNet dataset is publicly available at https://image-net.org (accessed on 2 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wei, J.C.; Tang, X.; Liu, J.X.; Zhang, Z.Y. Detection of Pig Movement and Aggression Using Deep Learning Approaches. Animals 2023, 13, 3074. [Google Scholar] [CrossRef]
  2. Zeng, N.Y.; Wu, P.S.; Wang, Z.D.; Li, H.; Liu, W.B.; Liu, X.H. A Small-Sized Object Detection Oriented Multi-Scale Feature Fusion Approach With Application to Defect Detection. IEEE Trans. Instrum. Meas. 2022, 71, 1–14. [Google Scholar] [CrossRef]
  3. Szankin, M.; Kwasniewska, A. Can AI See Bias in X-ray Images? Int. J. Netw. Dyn. Intell. 2022, 1, 48–64. [Google Scholar] [CrossRef]
  4. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.H.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  5. Wang, X.L.; Sun, Y.; Ding, D.R. Adaptive Dynamic Programming for Networked Control Systems Under Communication Constraints: A Survey of Trends and Techniques. Int. J. Netw. Dyn. Intell. 2022, 1, 85–98. [Google Scholar] [CrossRef]
  6. Wang, J.W.; Zhuang, Y.; Liu, Y.S. FSS-Net: A Fast Search Structure for 3D Point Clouds in Deep Learning. Int. J. Netw. Dyn. Intell. 2023, 2, 100005. [Google Scholar] [CrossRef]
  7. Hu, L.W.; Wang, Z.D.; Li, H.; Wu, P.S.; Mao, J.F.; Zeng, N.Y. 8-DARTS: Light-weight differentiable architecture search with robustness enhancement strategy. Knowl.-Based Syst. 2024, 288, 111466. [Google Scholar] [CrossRef]
  8. Zoph, B.; Le, Q.V. Neural Architecture Search with Reinforcement Learning. arXiv 2017, arXiv:1611.01578. [Google Scholar] [CrossRef]
  9. Baker, B.; Gupta, O.; Naik, N.; Raskar, R. Designing Neural Network Architectures using Reinforcement Learning. arXiv 2017, arXiv:1611.02167. [Google Scholar] [CrossRef]
  10. Zhong, Z.; Yan, J.J.; Wu, W.; Shao, J.; Liu, C.L. Practical Block-wise Neural Network Architecture Generation. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Salt Lake City, UT, USA, 2018. [Google Scholar]
  11. Qin, X.; Wang, Z. NASNet: A Neuron Attention Stage-by-Stage Net for Single Image Deraining. arXiv 2020, arXiv:1912.03151. [Google Scholar]
  12. So, D.R.; Liang, C.; Le, Q.V. The Evolved Transformer. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. Jmlr-Journal Machine Learning Research. [Google Scholar]
  13. Real, E.; Aggarwal, A.; Huang, Y.P.; Le, Q.V. Regularized Evolution for Image Classifier Architecture Search. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
  14. Kandasamy, K.; Neiswanger, W.; Schneider, J.; Póczos, B.; Xing, E.P. Neural Architecture Search with Bayesian Optimisation and Optimal Transport. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 2–8 December 2018. Neural Information Processing Systems (Nips). [Google Scholar]
  15. Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Salt Lake City, UT, USA, 2018. [Google Scholar]
  16. Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable Architecture Search. arXiv 2019, arXiv:1806.09055. [Google Scholar] [CrossRef]
  17. Heuillet, A.; Nasser, A.; Arioui, H.; Tabia, H. Efficient Automation of Neural Network Design: A Survey on Differentiable Neural Architecture Search. ACM Comput. Surv. 2024, 56, 270. [Google Scholar] [CrossRef]
  18. Yu, K.; Sciuto, C.; Jaggi, M.; Musat, C.; Salzmann, M. Evaluating the Search Phase of Neural Architecture Search. arXiv 2019, arXiv:1902.08142. [Google Scholar] [CrossRef]
  19. He, H.Y.; Liu, L.J.; Zhang, H.N.; Zheng, N.N. Is-Darts: Stabilizing Darts through Precise Measurement on Candidate Importance. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 12367–12375. [Google Scholar]
  20. Chu, X.; Wang, X.; Zhang, B.; Lu, S.; Wei, X.; Yan, J. Darts-: Robustly Stepping Out of Perfor- Mance Collapse Without Indicators. arXiv 2021, arXiv:2009.01027. [Google Scholar]
  21. Movahedi, S.; Adabinejad, M.; Imani, A.; Keshavarz, A.; Dehghani, M.; Shakery, A.; Araabi, B.N. Λ-Darts: Mitigating Performance Collapse by Harmonizing Operation Selection Among Cells. arXiv 2023, arXiv:2210.07998. [Google Scholar]
  22. Xie, W.S.; Li, H.; Fang, X.W.; Li, S.Y. DARTS-PT-CORE: Collaborative and Regularized Perturbation-based Architecture Selection for differentiable NAS. Neurocomputing 2024, 580, 127522. [Google Scholar] [CrossRef]
  23. Huang, H.; Shen, L.; He, C.Y.; Dong, W.S.; Liu, W. Differentiable Neural Architecture Search for Extremely Lightweight Image Super-Resolution. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 2672–2682. [Google Scholar] [CrossRef]
  24. Luo, X.Z.; Liu, D.; Kong, H.; Huai, S.; Chen, H.; Liu, W.C. SurgeNAS: A Comprehensive Surgery on Hardware-Aware Differentiable Neural Architecture Search. IEEE Trans. Comput. 2023, 72, 1081–1094. [Google Scholar] [CrossRef]
  25. Li, Y.H.; Li, S.; Yu, Z.H. DARTS-PAP: Differentiable Neural Architecture Search by Polarization of Instance Complexity Weighted Architecture Parameters. In Proceedings of the 29th International Conference on MultiMedia Modeling (MMM), Bergen, Norway, 9–12 January 2023; Springer International Publishing: Bergen, Norway, 2023. [Google Scholar]
  26. Xu, Y.; Xie, L.; Zhang, X.; Chen, X.; Qi, G.-J.; Tian, Q.; Xiong, H. PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search. arXiv 2020, arXiv:1907.05737. [Google Scholar]
  27. Tan, M.X.; Chen, B.; Pang, R.M.; Vasudevan, V.; Sandier, M.; Howard, A.; Le, Q.V. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE Computer Soc: Long Beach, CA, USA, 2019. [Google Scholar]
  28. Xue, Y.; Qin, J.F. Partial Connection Based on Channel Attention for Differentiable Neural Architecture Search. IEEE Trans. Ind. Inform. 2023, 19, 6804–6813. [Google Scholar] [CrossRef]
  29. Woo, S.H.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer International Publishing: Munich, Germany, 2018. [Google Scholar]
  30. Li, S.; Mao, Y.X.; Zhang, F.C.; Wang, D.; Zhong, G.Q. DLW-NAS: Differentiable Light-Weight Neural Architecture Search. Cogn. Comput. 2023, 15, 429–439. [Google Scholar] [CrossRef]
  31. Sercu, T.; Puhrsch, C.; Kingsbury, B.; Lecun, Y. Very Deep Multilingual Convolutional Neural Networks for Lvcsr. In Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; IEEE: Shanghai, China, 2016. [Google Scholar]
  32. Dong, X.; Yang, Y. NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search. arXiv 2020, arXiv:2001.00326. [Google Scholar] [CrossRef]
  33. Chu, X.X.; Zhou, T.B.; Zhang, B.; Li, J.X. Fair DARTS: Eliminating Unfair Advantages in Differentiable Architecture Search. In Proceedings of the 16th European Conference on Computer Vision-ECCV-Biennial, Online, 23–28 August 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
  34. Jing, K.; Chen, L.; Xu, J. An architecture entropy regularizer for differentiable neural architecture search. Neural Netw. 2023, 158, 111–120. [Google Scholar] [CrossRef] [PubMed]
  35. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  36. Lin, Y.; Endo, Y.; Lee, J.; Kamijo, S. Bandit-NAS: Bandit sampling and training method for Neural Architecture Search. Neurocomputing 2024, 597, 127684. [Google Scholar] [CrossRef]
  37. Pham, H.; Guan, M.; Zoph, B.; Le, Q.; Dean, J. Efficient neural architecture search via parameters sharing. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: Stockholm, Sweden, 2018. [Google Scholar]
  38. Liu, H.; Simonyan, K.; Vinyals, O.; Fernando, C.; Kavukcuoglu, K. Hierarchical representations for efficient architecture search. arXiv 2017, arXiv:1711.00436. [Google Scholar]
  39. Chen, X.; Xie, L.X.; Wu, J.; Tian, Q. Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE Computer Soc: Seoul, Republic of Korea, 2019. [Google Scholar]
  40. Xue, Y.; Han, X.L.; Wang, Z.H. Self-Adaptive Weight Based on Dual-Attention for Differentiable Neural Architecture Search. IEEE Trans. Ind. Inform. 2024, 20, 6394–6403. [Google Scholar] [CrossRef]
  41. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
  42. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Figure 1. Latency of different DARTS operators measured on a GTX 1080 Ti.
Figure 1. Latency of different DARTS operators measured on a GTX 1080 Ti.
Electronics 15 00314 g001
Figure 2. Overview of the proposed neural architecture search framework.
Figure 2. Overview of the proposed neural architecture search framework.
Electronics 15 00314 g002
Figure 3. Heatmap of the Architectural Parameters α . The horizontal axis denotes the operation types, and the vertical axis represents the edge indices. Darker cell colors indicate larger values of α , implying a higher probability that the corresponding operation will be selected. (a) Heatmap of the architectural parameters for the Normal Cell during the search process. (b) Heatmap of the architectural parameters for the Reduction Cell during the search process.
Figure 3. Heatmap of the Architectural Parameters α . The horizontal axis denotes the operation types, and the vertical axis represents the edge indices. Darker cell colors indicate larger values of α , implying a higher probability that the corresponding operation will be selected. (a) Heatmap of the architectural parameters for the Normal Cell during the search process. (b) Heatmap of the architectural parameters for the Reduction Cell during the search process.
Electronics 15 00314 g003
Figure 4. The final cell architectures discovered on CIFAR-10. In the visualization, the green nodes(c_{k-2},c_{k-}) represent the inputs from previous cells, the blue nodes (0–3) denote intermediate hidden states, and the yellow node(c_{k}) indicates the cell’s output. The arrows represent the selected operations and indicate the direction of information flow. These two cell structures are assembled into the complete architecture according to the stacking scheme illustrated in Figure 2. (a) The resulting Normal Cell architecture. (b) The resulting Reduction Cell architecture.
Figure 4. The final cell architectures discovered on CIFAR-10. In the visualization, the green nodes(c_{k-2},c_{k-}) represent the inputs from previous cells, the blue nodes (0–3) denote intermediate hidden states, and the yellow node(c_{k}) indicates the cell’s output. The arrows represent the selected operations and indicate the direction of information flow. These two cell structures are assembled into the complete architecture according to the stacking scheme illustrated in Figure 2. (a) The resulting Normal Cell architecture. (b) The resulting Reduction Cell architecture.
Electronics 15 00314 g004
Table 1. Results with the CIFAR-10 dataset.
Table 1. Results with the CIFAR-10 dataset.
ArchitectureTest Error (%)Params (M)Search Cost (GPU Days)#opsSearch Method
DenseNet-BC [35]3.4625.6manual
NASNet-A [15]2.833.1200013RL
BlockQNN [10]3.5439.8968RL
Bandit-NAS [36]2.943.40.36RL
ENAS [37]2.894.60.56RL
AmoebaNet-A [13] 3.123.1315019evolution
Hierarchical evolution [38]3.75 ± 0.1215.73006evolution
DARTS _V1 [16]3.00 ± 0.143.31.58gradient
DARTS_V2 [16]2.76 ± 0.093.348gradient
P-DARTS [39]2.503.40.38gradient
PC-DARTS [26]2.573.60.18gradient
SWD-NAS [40]2.513.170.138gradient
IS-DARTS [19]2.44.470.428gradient
EL-DARTS2.473.10.0754gradient
Table 2. Results with the ImageNet dataset.
Table 2. Results with the ImageNet dataset.
ArchitectureTest Error (%) Top-1Test Error (%) Top-5Params (M)MACs (M)Search Cost (GPU Days)Search Method
MobileNet [41]29.410.54.2569manual
ShuffleNet 2 × (g = 3) [42]26.3~5524manual
NASNet-A [15]26.08.45.35642000RL
NASNet-B [15]27.28.75.34882000RL
AmoebaNet-A [13]25.58.05.15553150evolution
AmoebaNet-B [13]26.08.55.35553150evolution
P-DARTS (CIFAR-10) [39]24.47.44.95570.3gradient-based
PC-DARTS (CIFAR10) [26]25.17.85.35860.1gradient-based
DARTS [16] 26.78.74.75744gradient-based
EL-DARTS26.28.04.55820.075gradient-based
Table 3. Ablation Study Results (CIFAR-10). The symbol "√" indicates that the component is enabled, and "×" indicates that it is disabled.
Table 3. Ablation Study Results (CIFAR-10). The symbol "√" indicates that the component is enabled, and "×" indicates that it is disabled.
ABCTest Error (%)Search Cost (GPU Days)
×××2.92 ± 0.210.1
×2.64 ± 0.20.075
×2.73 ± 0.180.075
×2.8 ± 0.20.1
2.6 ± 0.130.075
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, M.; Du, W.; Li, J.; Li, X. Efficient and Lightweight Differentiable Architecture Search. Electronics 2026, 15, 314. https://doi.org/10.3390/electronics15020314

AMA Style

Zhou M, Du W, Li J, Li X. Efficient and Lightweight Differentiable Architecture Search. Electronics. 2026; 15(2):314. https://doi.org/10.3390/electronics15020314

Chicago/Turabian Style

Zhou, Min, Wenqi Du, Jianming Li, and Xin Li. 2026. "Efficient and Lightweight Differentiable Architecture Search" Electronics 15, no. 2: 314. https://doi.org/10.3390/electronics15020314

APA Style

Zhou, M., Du, W., Li, J., & Li, X. (2026). Efficient and Lightweight Differentiable Architecture Search. Electronics, 15(2), 314. https://doi.org/10.3390/electronics15020314

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop