Previous Article in Journal
Few-Shot Image Classification Algorithm Based on Global–Local Feature Fusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Sparse MLPs Through Motif-Level Optimization Under Resource Constraints

by
Xiaotian Chen
1,
Hongyun Liu
1 and
Seyed Sahand Mohammadi Ziabari
1,2,*
1
Informatics Institute, Faculty of Science, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands
2
Department of Computer Science and Technology, School of Science, Mathematics and Technology, SUNY Empire State University, 2 Union Avenue, Saratoga Springs, NY 12866, USA
*
Author to whom correspondence should be addressed.
AI 2025, 6(10), 266; https://doi.org/10.3390/ai6100266
Submission received: 12 September 2025 / Revised: 5 October 2025 / Accepted: 7 October 2025 / Published: 9 October 2025

Abstract

We study motif-based optimization for sparse multilayer perceptrons (MLPs), where weights are shared and updated at the level of small neuron groups (‘motifs’) rather than individual connections. Building on Sparse Evolutionary Training (SET), our approach reduces the number of unique parameters and redundant multiply–accumulate operations by exploiting block-structured sparsity. Across Fashion-MNIST and a lung X-ray dataset, our Motif-SET improves training/inference efficiency with modest accuracy trade-offs, and we provide a principled recipe to choose motif size based on accuracy and efficiency budgets. We further compare against representative modern sparse training and compression methods, analyze failure modes such as overly large motifs, and outline real-world constraints on mobile/embedded targets. Our results and ablations indicate that motif size m = 2 often offers a strong balance between compute and accuracy under resource constraints.

1. Introduction

The emergence of neural networks has facilitated the development of artificial intelligence (AI), defined as the ability of machines to simulate human cognitive processes. With the advancement of neural networks, the tasks they address have become increasingly complex, often involving the handling of high-dimensional data to satisfy specific requirements. To enhance performance, deep neural networks have evolved to more accurately mimic human brain functions, leading to substantial increases in computational cost and training time [1]. Typically, DNNs have many layers with fully connected neurons, which contain most of the network parameters (i.e., the weighted connections), leading to a quadratic number of connections with respect to their number of neurons [2].
To address this issue, the concept of sparse connected Multi-layer Perceptrons with evolutionary training was introduced. This algorithm, compared to fully connected DNNs, can substantially reduce computational cost on a large scale. Moreover, with feature extraction, sparsely connected DNNs can maintain a performance comparable to that of fully connected models.
However, this approach still demands considerable computational resources and time, which remains a limitation. Furthermore, the introduction of motif-based DNNs, which can retrain neurons using small structural groups (e.g., groups of three neurons), suggested the potential to surpass the performance of sparse connected DNNs and significantly enhance overall network efficiency. This paper aims to analyze and test motif-based DNNs, comparing their performance against benchmark models.
To provide a deeper understanding, the following sections will delve into the foundational aspects of these approaches.
As mentioned before, traditional neural networks are usually densely connected, meaning that each neuron is connected to every other neuron in the previous layer, resulting in a large number of parameters. Unlike normal DNN models, SET helps introduce sparsity, reduce redundant parameters in the network, and improve computational efficiency. Through the evolutionary algorithm, SET can gradually optimize the weights so that many connections become irrelevant or zero [2]. Therefore, SET is applied to improve training efficiency by optimizing the sparse structure of the model and reducing redundant parameters, which eventually can end in a reduction in the computational cost [3].
To further improve the performance of the Deep Neural Network, feature engineering is considered as a critical step in the development of machine learning models, involving the selection, extraction, and transformation of raw data into meaningful features that enhance model performance [4]. By enforcing sparsity in the neural network, SET effectively prunes less important connections, thereby implicitly selecting the most relevant features. As the evolutionary algorithm optimizes the network, connections that contribute insignificantly to the model’s performance are gradually set to zero, allowing the network to focus on the most informative features. If feature selection can be applied in this process, with some important features selected and remaining features dropped, the complexity of the network would be largely decreased and the quantified features would keep the original accuracy. Consequently, SET feature selection results in a streamlined model that is both computationally efficient and more accurate, leading to better overall performance [3], and Neil Kichler has demonstrated the effectiveness and robustness of this algorithm in his studies, further validating its practical application and benefits [5].
Network motifs are significant, recurring patterns of connections within complex networks. They reveal fundamental structural and functional insights in systems like gene regulation, ecological food webs, neural networks, and engineering designs. By comparing the occurrence of these motifs in real versus randomized networks, researchers can identify key patterns that help us to understand and optimize various natural and engineered systems.
As mentioned before, the SET updates new random weights when the weights of the connections are negative or insignificant (close to or equal to zero) and to some extent lead to some computational burden [6]. Based on the concept of motif and SET, a structurally sparse MLP is proposed. The motif-based structural optimization gave us the idea of renewing the weights by establishing a topology which can largely improve the efficiency (shown in Figure 1) [7,8]. Motif-structured training targets low-memory and predictable-latency deployments such as mobile perception, on-device health screening, and simple event detection. These scenarios motivate the efficiency targets and evaluation choices used in our experiments.
Our key research question is as follows: Can the efficiency and accuracy of sparse MLPs be improved by optimizing the structure of the Sparse MLPs and fine-tuning of the parameters of the network?

2. Related Work

Sparse MLP models have demonstrated significant potential to reduce computational costs (e.g., hardware and computation time requirements) while enhancing accuracy through feature extraction and sparse training. This research uses the work of Mocanu et al. as a benchmark model for comparison [3]. This section reviews the historical development of sparse neural networks. Subsequently, the key idea and algorithm of SET will be discussed. Lastly, the basic idea of structural optimization of Sparse MLPs will be introduced.
Y. LeCun et al. [9] introduced the concept of network pruning in the paper “Optimal Brain Damage” in 1990. This approach calculated the contribution of each connection to the general network error and selectively removed some of the nodes which are considered less important. Utilizing second-order derivatives, this method effectively reduced model complexity while preserving performance. Therefore, this seminal work provided and was regarded as a theoretical basis for later pruning techniques. Based on this seminal paper, B. Hassibi et al. proposed the “Optimal Brain Surgeon” method in 1993, in which second-order derivatives were also used but offered a more precise method for pruning the weights by considering the Hessian matrix’s structure [10]. The Optimal Brain Surgeon method was shown to achieve superior performance by more accurately identifying and removing redundant connections. This refinement was regarded as an important step forward in the efficiency of network pruning. In 2016, Han et al. introduced the concept of “Deep Compression”, a concept which combines pruning, quantization, and Huffman encoding [11]. This three-step method significantly reduces the storage requirements and computational cost of neural networks while maintaining the accuracy. In addition to the standard sparse network training steps, such as initial pruning followed by weight retraining, Deep Compression incorporates Huffman encoding as a final step. This study highlighted the practical advantages of integrating multiple optimization techniques. In 2018, Mocanu et al. introduced the idea of Sparse Evolutionary Training in their paper “Scalable Training of Artificial Neural Networks with Adaptive Sparse Connectivity Inspired by Network Science” [3]. In this research, Erdős-Rényi graph initialization was used to create the initial sparse network and initialize the weights. By selectively adding and removing connections based on the deviation and performance, SET aims to maintain a high proportion of zero-valued connections while optimizing the model’s accuracy. The process of the SET algorithm is shown in Figure 2. As mentioned above, this model will be taken as the benchmark mark to compare with the one in this research.
Based on previous research work, the concept of “Dynamic Sparse Reparameterization” was given by J. Frankle et al.; it is a method that involves continually adjusting the sparsity of network connections during training [12]. By dynamically reparameterizing the network, this approach helps to maintain a balance between performance and efficiency. This method stands out for its ability to adaptively optimize the network structure dynamically, leading to more efficient training processes. Based on the research of Mocanu et al., the paper “Robustness of sparse MLPs for supervised feature selection” by Kichler [5] in 2021 combined feature extraction with a sparse multi-layer training method which promoted the development of research. With a certain proportion of features dropped, the performance of the neural network can still be maintained as a densely connected network.
The motif-based concept refers to certain types of structural topology or network structure, and the term is usually used to describe a sub-network or a network pattern (shown in Figure 3). This diagram illustrates the general procedure of motif-based optimized SET models. The DNN model on the left is trained and retrained by individual nodes [13]. Therefore, the weights between nodes are different from each other. However, for the motif-based model, the weights will be initialized and retrained by small groups with a specific size of nodes [14]. By applying the motif-based structure, the efficiency of the network can be improved while maintaining the accuracy.
We benchmark Motif-SET against (i) dynamic sparse reparameterization (DSR), which adapts sparse connectivity during training; (ii) deep compression (magnitude pruning + quantization + Huffman coding); and (iii) magnitude pruning with weight rewinding. These methods capture the main lines of modern sparse training (DSR) and post-training compression (deep compression, pruning). We report accuracy, parameter count, theoretical FLOPs, and wall-clock latency to highlight accuracy–efficiency trade-offs. (Exact measurement protocol in Section 5.6).

3. Methodology

To address the research questions related to the motif-based structural optimization of sparsely connected neural networks, this section provides a detailed illustration of the proposed approaches. This section discusses the topological optimization method. Subsequently, the process of training will be discussed. Finally, the evaluation environment for this research is described.
The core principle of motif-based structural optimization in SET involves assigning weights between neurons based on motifs during each training process, followed by distributing these weights to individual nodes (Algorithm 1). The following pseudo code is the general training process of the motif-based SET model:
Algorithm 1 Motif-SET: Training with motif size m
Require: 
motif size m, sparsity ϵ , hidden sizes { h }
  1:
Initialize sparse masks with Erdős–Rényi; partition units into motifs of size m
  2:
for epoch = 1 T  do
  3:
    Forward: for each layer, apply block-constant weights W by motif.
  4:
    Backward: accumulate gradients per motif block, then distribute to edges.
  5:
    Evolution (SET): prune edges with small magnitude; regrow per sparse mask.
  6:
end for
This pseudo code outlines the process of network initialization, forward propagation, and backward propagation in the motif-based DNN model. As previously noted, most steps are performed using customizable motifs of a specific size. A detailed explanation of the network construction and training process is provided in the following subsections.

3.1. Network Construction

The general idea of using structural optimization based on a motif is to group nodes with a certain size and train them together. Unlike simply reducing the number of neurons, each node in the motif-based optimized network participates in both training and retraining. The key difference lies in the process of assigning new weights to nodes, which is conducted according to a specific topology, thereby enhancing the network’s efficiency [7,15].
Parameter initialization: Before initializing the weights of neural networks, some parameters need to be defined. The input size X refers to the size of the features of the training data set that is selected as input. For instance, for the Fashion MNIST dataset, the number of pixels in the graph is 784 (28 × 28); therefore, the input size is 784. The motif size m refers to the size of a topology or sub-network. Hidden size refers to the number of neurons for each hidden layer [16]. Epsilon ϵ is a parameter that controls the sparsity level of the network. The activation function σ is used to introduce non-linearity into the model. The loss function L measures the difference between the predicted outputs and the actual outputs [17].
Non-divisible layer widths. If a layer width n is not divisible by the motif size m, we use one of two simple strategies: (i) padding the layer with up to m 1 dummy neurons (discarded at inference without affecting outputs), or (ii) forming a residual motif with size < m for the final group. Both preserve functionality while keeping implementation minimal.
Weights and bias initialization: The random uniform distribution is utilized for generating the initial weights. The weights are assigned to each motif instead of each node, which will be more efficient. The He function is used here to set limitations for the weights [18]. The Erdős-Rényi model is used for setting sparse weight masks which can remove or keep the initial weights. And for each layer, it creates a random bias vector b i . The motif size also needs to be chosen such that the input size is divisible by the motif size [12]. The pseudo code for network construction is shown as follows:
Compared to the normal SET model, instead of initializing the weights and bias individually, the weights and bias initialization in this study are performed by motifs; for instance, 3 nodes are used as a motif. The weights and bias are first assigned to each motif then each motif will assign the parameters to each node, which are the same within each motif. Additionally, the motif size of the model can be customized. In this way, the efficiency of the network initialization will be significantly improved.
We use s ( 0 , 1 ) to denote the target edge sparsity (fraction of zeros). At initialization, we sample an Erdős–Rényi (ER) mask to achieve a target density 1 s with edge probability
p = ( 1 s ) ϵ ( n in + n out ) n in n out ,
where ϵ is a density scaling constant. Unless otherwise noted, we set s = 0.9 and ϵ = 1 and use s consistently throughout (we avoid reusing ϵ to mean sparsity).

3.2. Training Process

This subsection will show the process of the motif-based model training, which is the core of the whole study.
Forward Propagation: The forward function is similar to normal forward propagation in other models; the only difference in this study is that the nodes involved in the forward propagation process are trained in small groups [19] which will improve the total efficiency. The process is shown through the following equations: Z ( i ) representing the linear combination of i layer, A ( i 1 ) for the activation function for each layer (ReLu, Softmax, Sigmoid), W ( i ) for weight matrix for i layer, b ( i ) for bias [20]. For each motif in layer i:
Z ( i ) = A ( i 1 ) W ( i ) + b ( i )
A ( i ) = f activation ( Z ( i ) )
For the output layer, Softmax is chosen as the activation function because it converts raw model outputs into probabilities, ensuring they sum to 1. This is crucial for multi-class classification, providing clear and interpretable probabilities for each class. Backward Propagation: In the process of backward propagation, the output layer error δ ( L ) needs to be calculated first, where A ( L ) is the output of the network and Y is the true label:
δ ( L ) = A ( L ) Y
Then, the gradient for the output layer and each hidden layer needs to be computed in a reversed order. Let B denote the mini-batch size.
L W ( L ) = 1 B ( A ( L 1 ) ) δ ( L ) ,
L b ( L ) = 1 B i = 1 B δ ( L ) .
For each hidden layer i from L 1 to 1:
δ ( i ) = ( δ ( i + 1 ) W ( i + 1 ) ) f ( Z ( i ) )
Motif-based computation for each hidden layer [21,22], a zero matrix δ is initialized. And for each motif j in layer i, each submatrix W submatrix from weight matrix W ( i ) is extracted:
W submatrix = W ( i ) [ j start : j end ]
The sub-delta and sub-activation for each motif are calculated:
δ sub = ( δ ( i + 1 ) W submatrix ) f ( Z sub )
Update gradients for each motif and node individually:
L W sub = 1 m ( A ( i 1 ) ) δ sub
δ ( i ) [ j start : j end ] = δ sub
In the end of backward propagation, the weights and bias are updated for each layer. The pseudo code of back propagation is shown as follows:
Followed by forward propagation, the backward propagation training will also be trained by motifs in each layer; each motif will assign the retrained results to each node (Algorithm 2).
Algorithm 2 Backward pass for a motif-based sparse NN (one minibatch)
Require: 
inputs X, labels y, cached pre-activations { Z } = 1 L , activations { A } = 0 L with A 0 = X
Require: 
motif size m; per-layer motif parameters { K } (size G out ( ) × G in ( ) ); sparse edge masks { S }
Ensure: 
gradients { K , b } and updated parameters
  • 1: B batch size;                                                                                                                                                                                                                       ▹ number of samples
    2: // Output layer gradient
    3: δ L A L L ( A L , y ) σ ( Z L )
    4: W L A L 1 δ L S L ;    b L i = 1 B δ L ( i )
    5: Update W L , b L (optimizer step)
    6:
    7: for  = L 1 down to 1 do
    8:    // Backpropagate through mask and motifs
    9:     W eff Expand ( K , m ) S                                                                                                             ▹ replicate each motif parameter to an m × m block
    10:     δ δ + 1 W eff σ ( Z )
    11:     // Motif-aware gradient accumulation (block-sums)
    12:      A 1 grp GroupSum ( A 1 , m )                                                                                                                                                                                   ▹ B × G in ( )
    13:     δ grp GroupSum ( δ , m )                                                                                                                                                                                           ▹ B × G out ( )
    14:     K δ grp A 1 grp                                                                                                                                                                                                      ▹ G out ( ) × G in ( )
    15:     K K ActiveBlocks ( S , m )                                                                                                                               ▹ zero-out blocks with no active edges
    16:     b i = 1 B δ ( i )
    17:    Update K , b (optimizer step)
    18: end for
When SET helps and when it hurts: We use a short warmup of five epochs before any pruning. The prune fraction per step is ζ = 0.2 every 10 epochs, followed by Erdős–Rényi regrowth to maintain the target sparsity. Over the last third of training, the prune fraction decays to zero with a cosine schedule. Pruning too early or with a larger ζ caused accuracy loss and unstable training; delaying pruning and decaying ζ avoided these issues in practice. Regrowth ensures that edges removed in earlier stages can reappear if the gradient signal supports them, which addresses the concern that relevance can change across stages.

3.3. Process of Evolution

The core of the SET algorithm involves evolution, where weights close to zero are pruned and new weights are assigned [3]. The pseudo-code for this process is provided below:
We use a five-epoch warmup with no pruning, then prune a fraction ζ = 0.2 every 10 epochs with ER regrowth to maintain the target sparsity. Over the final third of training, ζ decays to 0 via a cosine schedule (Algorithm 3).
Algorithm 3 Evolution (SET) with motif-aware pruning and regrowth (per epoch)
Require: 
motif size m; prune fraction ζ ( 0 , 1 ) ; init std σ ; motif params { K } = 1 L ; sparse masks { S } = 1 L
Ensure: 
updated { K } and { S } at the same sparsity
1:
for  = 1 L  do
2:
     W eff Expand ( K , m )                                                                                                                                                                    ▹ replicate each K entry to an m × m block
3:
     A { ( i , j ) S [ i , j ] = 1 }                                                                                                                                                                                                                    ▹ active edge indices
4:
     k ζ · | A |
5:
     P SmallestMagnitude | W eff | on A , k                                                                                                                                                                                             ▹ SET prune set
6:
    Prune:   for  ( i , j ) P  do  S [ i , j ] 0 ; W eff [ i , j ] 0  end for
7:
     B old ActiveBlocks ( S , m )                                                                                                                                              ▹ G out × G in boolean: any edge active in each block
8:
     Z { ( i , j ) S [ i , j ] = 0 }                                                                                                                                                                                          ▹ available positions for regrowth
9:
     R UniformSample ( Z , k )                                                                                                                                                                                                          ▹ Erdős–Rényi regrowth
10:
    Regrow:   for  ( i , j ) R  do
11:
        S [ i , j ] 1                                                                                                                                                                                                                                                              ▹ activate edge
12:
        ( g o , g i ) ( i / m , j / m )                                                                                                                                                                                                 ▹ map edge to motif block
13:
        if  B old [ g o , g i ] = 0  then  K [ g o , g i ] N ( 0 , σ 2 )  end if
14:
    end for
15:
end for

3.4. Evaluation Environment

The evaluation is performed by testing both runtime and accuracy. We use the results reported by Mocanu et al. [3] as a benchmark to assess improvements. Robustness is further examined by evaluating across datasets of different types. All experiments were conducted on a desktop computer running Windows 10 (Microsoft, Redmond, WA, USA), with an Intel Core i5 13th-generation CPU (Intel Corporation, Santa Clara, CA, USA), 32 GB RAM, and an NVIDIA GeForce RTX 4070 Ti 12 GB GPU (NVIDIA Corporation, Santa Clara, CA, USA).

3.5. On-Device Latency and Energy

Many edge platforms are bandwidth-bound rather than compute-bound. To provide a practical proxy without requiring device-specific runs, we introduce a closed-form estimator that maps our per-motif counts to normalized on-device latency and energy.
Let C mm denote time per multiply–accumulate (MAC) for a dense matrix multiply on a target device class and C ld the amortized time per weight/activation load. For motif size m, block execution reduces dominant multiplies by approximately m 2 (due to weight sharing) and improves locality. For a layer with F 1 dense MACs and M 1 memory operations at m = 1 , we estimate
Lat ( m ) α · F 1 m 2 C mm + β · M 1 m C ld ,
where α , β [ 0 , 1 ] capture kernel/runtime overheads. The 1 / m memory term reflects fewer distinct blocks and better cache reuse. Table 1 summarizes the normalized on-device latency and energy predicted by the model as a function of motif size m and the memory-to-compute ratio κ .
With per-MAC energy E mm and per-load energy E ld , we use
Energy ( m ) α · F 1 m 2 E mm + β · M 1 m E ld .
To avoid device-specific constants, we report values normalized to the dense motif case m = 1 :
Lat ^ ( m ) = Lat ( m ) Lat ( 1 ) , Energy ^ ( m ) = Energy ( m ) Energy ( 1 ) .
In practice, F 1 can be taken as the FLOPs (MACs) of the m = 1 configuration and M 1 proportional to the parameter count at m = 1 . The proportionality cancels in the ratios of (13).
This estimator lets practitioners translate per-motif FLOPs/parameter tables into device-facing proxies without additional runs. If a single device measurement is available, ( α , β , κ ) can be calibrated once and reused across layers with (11)–(13).

3.6. Why Motifs Reduce Redundant Computation

Consider a layer with n in inputs and n out outputs. Partition units into motifs of size m (assume m n in , n out ) yielding G in = n in / m and G out = n out / m groups. Let K R G out × G in denote motif-to-motif weights, expanded to a block-constant matrix W = K 1 m × m (optionally masked sparsely as in SET). In the dense idealization, unique parameters drop from n out n in to G out G in = n out n in m 2 .
Block computation. Writing the input as group sums s R G in , with s g = i group g x i , the pre-activation can be computed as follows:
z = ( K s ) 1 m multiplies reduce from O ( E ) to O ( B ) ,
where E is the number of active edges (as in SET) and B is the number of active motif blocks. In the common case that each active block corresponds to up to m 2 active edges, B E / m 2 , showing an m 2 -fold reduction in unique parameters and an m 2 -fold reduction in dominant multiplies under block execution.
Capacity trade-off. The block-constant structure limits the effective rank of W by rank ( W ) min ( G out , G in ) , explaining accuracy drops for large m. Our ablations (Section 5.2) empirically show m = 2 balances compute reduction and expressivity on studied datasets.

4. Experiment

This section outlines the experimental process, starting with data preparation, followed by experimental design and evaluation. The aim is to assess the efficiency and accuracy of the motif-based sparse neural network compared to the benchmark model.

4.1. Data Preparation

In this research, the Fashion MNIST (FMNIST) and Lung datasets are used as benchmark datasets to evaluate the performance and efficiency of the model [23]. The FMNIST dataset (shown in Figure 4) is widely recognized as a benchmark for testing deep learning architectures. It is a dataset of Zalando’s article images, which consists of a training set of 60,000 samples and a test set of 10,000 samples. Each sample is a 28 × 28 grayscale image associated with a label from one of 10 classes [5]. This study loads the images using TensorFlow’s FMNIST module, normalizes pixel values to [ 0 , 1 ] , and applies one-hot encoding to the categorical labels. Optionally, it standardizes data using scikit-learn’s StandardScaler and stores preprocessed data in compressed .npz files for efficient access and reuse. This pipeline ensures the dataset is ready for machine learning tasks, facilitating robust model training and evaluation.
We use the Lung Disease 5-Class dataset (Figure 5), with five classes: COVID-19, Tuberculosis, Pneumonia, Emphysema, and Normal; input vectors have 3312 features per sample. A sample from the Lung dataset is shown in Figure 5. The dataset is initially loaded and its labels are converted into one-hot encoded format. Subsequently, the dataset is split into training and testing sets, reserving one-third for testing purposes. Feature normalization is then performed using StandardScaler library, ensuring consistent scaling across the dataset.
The properties of the two datasets are summarized in Table 2.
Table 3 summarizes the input preprocessing and the core training hyperparameters used across datasets.

4.2. Design of the Experiment

As mentioned above, in this experiment, Fashion MNIST and Lung are the two benchmark datasets used to test the accuracy and performance of the motif-based sparse neural network model. The benchmark model is set as the SET model [3]. To evaluate the performance and resource utilization of the model on the datasets, a comprehensive experimental setup is implemented. This setup includes data preprocessing steps such as standardization and one-hot encoding, as well as model initialization and training procedures. Functions were incorporated to retrieve CPU and GPU information to monitor hardware usage. The start and end times of the training process were recorded to calculate the total execution time [24]. Finally, the results, including test accuracy and resource details, were saved to a file for thorough analysis. This approach ensures a detailed and reproducible evaluation of the model’s performance and resource efficiency.
Design of FMNIST testing: the Fashion MNIST has 784 features which is the input size for the neural network, therefore, the motif size is set as 1 for the benchmark model and 2 and 4 for model testing. The number of neurons for each hidden layer is 3000 [16,25]. To further test the efficiency and show the differences of each motif-size model, a simpler version of the DNN model is developed with two hidden layers, and each one has 1000 neurons.
The Lung dataset has 3312 features which is the input size for the neural network and is divisible by one, two, and four. Therefore, four motif sizes are designed for testing. Like the FMNIST dataset, a simpler version of model with two hidden layers (1000 neurons for each) is designed and will be tested. In order to ensure the rigor of the experiment, the control variable method is applied. We train for 300 epochs with a batch size of 128 using SGD (momentum 0.9 , weight decay 5 × 10 4 ) and a cosine learning-rate schedule from an initial LR of 1 × 10 2 ; the target sparsity is s = 0.9 (10% density) unless stated otherwise.
To better evaluate the performance of each motif-size model, a comprehensive score (S) that combines the percentage reduction in running time ( R r ) and the percentage reduction in accuracy ( A r ) is designed [26]. Typically, for a DNN model, accuracy is significantly more important than efficiency [27,28]. Therefore, the accuracy reduction is weighted at 90% and the running time reduction is weighted at 10%. The comprehensive score is calculated using the following formula:
S = 0.1 × R r + 0.9 × ( 1 A r )
where:
R r = T base T T base
A r = A base A A base
D In these equations,
  • S is the comprehensive score.
  • R r is the percentage of running time reduction.
  • A R is the percentage of accuracy reduction.
  • T base is the running time observed for the benchmark model.
  • T is the running time for the specific motif-size model.
  • A base is the benchmark model accuracy.
  • A is the accuracy for the specific motif-size model.

5. Results

This section first presents the final results for each dataset and each motif size model. Furthermore, the model with the best overall performance for each dataset is identified, addressing the research question in Figure 1 by providing exact accuracy and efficiency metrics.

5.1. Experiment Results

This subsection presents and analyzes the results for both the FMNIST and Lung datasets.

5.2. Ablations: Trend vs. Motif Size

We sweep m { 1 , 2 , 4 , 8 } when divisible by layer widths. For each m, we report test accuracy, unique parameter count (block-level; see Section 3.6), theoretical block-MACs (FLOPs), and epoch time. Table 4 and Table 5 reports these results for FMNIST and Lung.
With a weight λ [ 0 , 1 ] that trades accuracy against speed, we choose
m = arg max m { 1 , 2 , 4 , 8 } [ λ Acc ( m ) Acc ( 1 ) relative accuracy + ( 1 λ ) Time ( 1 ) Time ( m ) speedup ] .
Here, Acc ( m ) is test accuracy and Time ( m ) is time per epoch; both are normalized to the m = 1 baseline. Larger λ favors accuracy; smaller λ favors speed. On our datasets, m = 2 typically maximizes this criterion (see Section 5.2), consistent with the capacity analysis in Section 3.6.
Choosing m.
  • Select a grid M = { 1 , 2 , 4 , 8 , 16 } and a target sparsity; fix the training protocol.
  • For each m M , train the motif model and record ( Acc ( m ) , FLOPs ( m ) , Time ( m ) ) .
  • Construct the Pareto frontier over ( Acc , log FLOPs ) and identify the knee by maximum curvature.
  • If deployment is device-bound, re-rank candidates by measured Time ( m ) at the real batch size.
  • Return m as the point on or near the knee that maximizes U ( m ) = λ ( 1 Δ Acc ( m ) ) + ( 1 λ ) Speedup ( m ) , with λ [ 0 , 1 ] .
Layerwise motif schedule: To recover accuracy while retaining most of the efficiency gains, we keep m = 2 in the first two hidden layers and use m = 1 in the penultimate and output layers. This schedule restores accuracy close to the dense sparse baseline while maintaining substantially fewer unique multiplies. Table 6 summarizes the layerwise motif schedule results.

Pruned/Distilled

To complement sparse-training baselines, we add two dense lightweight references at matched accuracy: (i) a Magnitude-Pruned MLP (global pruning followed by fine-tuning), and (ii) a Distilled MLP (a narrower student trained from a dense teacher via knowledge distillation). These baselines isolate width/depth scaling effects in dense networks from the benefits of motif-structured sparsity.
All runs use identical data splits, optimizer, scheduler, batch size, early-stopping rule, and wall-clock measurement protocol as the sparse baselines. We report Test Accuracy (Acc), Parameter Count (Params), Block-level FLOPs (FLOPs), and Time per epoch (Time). Where device metrics are unavailable, we also provide normalized compute-bound proxies Lat ^ and Energy ^ based on parameter/FLOP ratios.
Starting from a trained dense model with weights W, a global pruning mask M { 0 , 1 } shape ( W ) is formed by thresholding magnitudes. Let s ( 0 , 1 ) be the target sparsity and let τ be the magnitude threshold:
τ = quantile { | W i j | } i j , s , M i j = 1 { | W i j | τ } , W M W .
We fine-tune ( W , M ) for E epochs. A stronger baseline uses K iterative pruning steps with a cosine schedule to s max :
s t = 1 1 s max 1 + cos ( π t / K ) 2 , t = 0 , , K ,
pruning to s t and fine-tuning e epochs between steps.
Given a trained teacher with logits z ( T ) and a student with logits z ( S ) , we use a standard KD objective
L = ( 1 λ ) CE y , σ ( z ( S ) ) + λ T 2 KL σ z ( T ) T σ z ( S ) T ,
with temperature T > 1 and mixing λ [ 0 , 1 ] . Student width is chosen so its parameter count matches the motif model’s m = 2 configuration, enabling a like-for-like efficiency at matched accuracy comparison.
For dense models, Params and FLOPs are computed from layer shapes; for motif models, we report block-level FLOPs (effective multiplies per shared block). When only host-side training times are available, we estimate pruned/distilled time as compute-bound:
T dense - lite T dense × Params dense - lite Params dense .
Normalized proxies for dense models use Lat ^ = Energy ^ = Params / Params dense . For the motif model, we report measured ratios.
Dense pruning reduces parameter count but may leave memory traffic high; distillation reduces width and compute but can undershoot accuracy without careful ( T , λ ) . Motif-structured sparsity trades some representational flexibility for improved arithmetic locality; the Table 7 shows that at matched accuracy (FMNIST 73.3 % , Lung 92.6 % ), dense pruned/distilled models achieve the same parameter count but (under compute-bound assumptions) a much lower epoch time than dense baselines, while the measured motif model’s normalized time reflects practical software and memory effects. Reporting all three perspectives provides a balanced view of efficiency under deployment constraints.

5.3. Baselines: Modern Sparse Training and Compression

We compare Motif-SET against (a) DSR (dynamic sparse reparameterization) and (b) deep compression (magnitude pruning + quantization). Sparsity ratios and epochs are matched.
Beyond SET, DSR, and deep compression, two stronger sparse-training families are widely discussed: RigL and Movement Pruning. Both typically beat plain SET at comparable sparsity by using gradient-informed connectivity updates or movement-aware masks. A full replication is out of scope here, but our m = 2 operating point targets a complementary regime: structured sharing during training with hardware-friendly blocks. Adding RigL or movement-pruning style updates on top of our block structure is a promising direction for future work.
From Section 3.6, block execution yields an m 2 reduction in dominant multiplies while constraining rank to min ( n in / m , n out / m ) . Empirically, (Table 4, Table 5, and Table 8), m = 2 preserves sufficient rank for FMNIST/Lung while capturing 4 × parameter sharing per block, facilitating favorable speedups with a small accuracy drop. Larger m reduces unique parameters further but harms fine-grained discriminative capacity.

5.3.1. Results of FMNIST

In this test, 300 epochs were run with three hidden layers (3000 neurons for each). The result is shown in Table 9. When the motif size is set to 1, the total running time is 25,236.2 s, with an accuracy of 0.761. When the motif size is set to 2, the total running time is 14,307.5 and the accuracy is 0.733. Compared to the benchmark model, when the motif size is set to 2, the running time is reduced by 43.3%, while the accuracy decreases by 3.7%. When the motif size is set to 4, the total running time is 9209.3 s and the accuracy is 0.692. Comparing to the benchmark model, the efficiency is improved by 73.7% with a 9.7% drop in accuracy.
To further validate the efficiency of each motif size, a simpler model is established to test the running time, which has two hidden layers and each one of them has 1000 neurons. Table 9 presents the average running time per epoch. The running time for the first 30 epochs is recorded, which corroborates the results mentioned above.
Unless otherwise noted, we report averages over three random seeds. Across datasets and motif sizes, accuracy variability was within ± 0.5 % , while parameter and FLOP counts were effectively constant due to fixed sparsity schedules.
Taking both efficiency and performance into consideration, the comprehensive score for each motif size is then given as follows:
For motif size 1:
S 1 = 0.1 × 25,236.2 25,236.2 25,236.2 + 0.9 × 1 0.7610 0.7610 0.7610 = 0.9000
For motif size 2:
S 2 = 0.1 × 25,236.2 14,307.5 25,236.2 + 0.9 × 1 0.7610 0.7330 0.7610 = 0.9100
For motif size 4:
S 4 = 0.1 × 25,236.2 9209.3 25,236.2 + 0.9 × 1 0.7610 0.6920 0.7610 = 0.8864
According to the result, when the motif size is set to two, the model has the best overall performance with 43.3% efficiency improvement and a 3.7% accuracy drop, which outperforms the benchmark model 1.1% in overall performance.

5.3.2. Result of Lung

For the Lung dataset, 300 epochs were also conducted with three hidden layers, each containing 3000 neurons. The results are presented in Table 9. When the motif size is set to 1, the total running time is 4953.2 s with an accuracy of 0.937. When the motif size is set to two, the total running time is 3448.7 s, and the accuracy is 0.926. Compared to the benchmark model, with a motif size of two, the running time is reduced by 30.4%, and the accuracy is 1.2% lower. When the motif size is set to four, the total running time is 3417.3 s and the accuracy is 0.914. This configuration improves efficiency by 31.0% compared to the benchmark model, with a 2.5% decrease in accuracy.
To further test the efficiency, a simpler model with two hidden layers, each containing 1000 neurons, was established to test the running time. The average running time per epoch is presented in Table 9. Figure 6 illustrate the efficiency results for each motif size model. The running times for the first 30 epochs are recorded and displayed in Figure 7, and corroborate the results mentioned previously. Both the motif size two and motif size four models demonstrate significant and comparable improvements in efficiency. However, the motif size two model exhibits better accuracy performance and lower standard error, indicating a more stable evolution process. Consequently, the motif size two model outperforms others when considering both efficiency and accuracy factors.
From the results, it can be inferred that the motif-based model has a very significant improvement in efficiency but with some accuracy loss as well. Here, the comprehensive score equation is applied to calculate the overall performance score for each motif size model:
For motif size 1:
R r 1 = 4953.2 4953.2 4953.2 = 0
A r 1 = 0.937 0.937 0.937 = 0
S 1 = 0.1 × 0 + 0.9 × ( 1 0 ) = 0.9000
For motif size 2:
R r 2 = 4953.2 3448.7 4953.2 = 0.3039
A r 2 = 0.937 0.926 0.937 = 0.0117
S 2 = 0.1 × 0.3039 + 0.9 × ( 1 0.0117 ) = 0.9199
For motif size 4:
R r 4 = 4953.2 3417.3 4953.2 = 0.3103
A r 4 = 0.937 0.914 0.937 = 0.0246
S 4 = 0.1 × 0.3103 + 0.9 × ( 1 0.0246 ) = 0.9089
According to the results, the model with a motif size of two has the best overall performance. The motif-based optimized model has an improvement of 30.4% in efficiency and a drop of 2.5% percent in accuracy and this model 2.2% outperforms the benchmark model (SET) in overall performance.
From the results given above, it can be told that for both FMNIST and Lung dataset, the model with motif size two has a better overall performance in both efficiency, stability and accuracy. Further analysis of the results will be given in the next section.

5.4. Result Analysis

In the process of training and testing the performance of the model, it was found that the accuracy of the Lung dataset is much higher than that of FMNIST in the very first phase (first 30 epochs). This is mainly due to the number of features (3312 features) in Lung being almost four times the number in FMNIST (784 features). And the lung data set only has five output labels. FMNIST, on the other, hand has 10 output labels [23].
According to the results, for both Lung and FMNIST datasets, the model with a motif size set to two had the best overall performance: high efficiency, high accuracy (which was maintained), and relatively stable evolution processes. It is observed that the efficiency improvement from the reference models to the models with a motif size of two is larger than the improvement from the motif 2 models to the motif 4 models. In addition, the accuracy loss of the motif 2 model is smaller than that of the motif 4 model. However, such simple observations cannot conclude that the motif 2 model has a better overall performance than others. Therefore, a comprehensive score equation is introduced to calculate in a scientific and strict way. For any deep neural network model, the accuracy of the results is far more important than the required computation power and time. Therefore, the efficiency weight in the comprehensive score is set as 0.1 and the accuracy weight is set to 0.9. However, the results and analysis indicate that the overall performance of a motif-based DNNs model may differ when it comes to different types of datasets, studies, and research purposes.

5.5. Trade-Off Relationship

Therefore, in this section, the trade-off relationship between the efficiency–accuracy weight ratio and the comprehensive score for each motif size will be discussed. Figure 8 and Figure 9 are plotted to better demonstrate the relationship, showing that as long as the efficiency is a factor to be considered (efficiency–accuracy weight ratio greater than 0.1), the motif-based model will outperform the regular Sparse Evolutionary Training models. In terms of common sense, the accuracy of a deep neural network is more important than model efficiency; however, in real life, efficiency can also be a crucial factor that determines if a DNN model is good [11]. Balancing efficiency and accuracy is crucial in practical applications, as efficient models can provide real-time performance and resource optimization, which are essential in various scenarios such as mobile devices, real-time video analysis, and autonomous driving [29,30].
Furthermore, it can be judged based on the results and analysis that the overall performance may differ by the motif size, datasets, and structure of the neural network. This illustrates the significance of motif-based models when efficiency needs to be taken into consideration and the overall performance of different motif size models may vary when the purposes or scenarios of the study change.

5.6. Practical Evaluation on Constrained Devices

We profile batch-1 inference on (i) a laptop CPU, (ii) Raspberry Pi 4 (Cortex-A72), and (iii) a mid-range Android device. The latencies below are estimated from block-level FLOPs, using effective throughputs of 15/2/5 GFLOP/s and powers of 15/3/5 W for Laptop/RPi4/Android, respectively. We use the average FLOPs of FMNIST and Lung graphs for each m (i.e., m = 1 : 24.17 M; m = 2 : 6.05 M) to give a single representative figure per model. Accuracies are macro-averages across FMNIST/Lung. All device numbers are estimates derived from the normalized latency model in Section 3.5 with κ chosen per device to match the compute:memory ratio; effective throughputs are 15/2/5 GFLOP/s for Laptop/RPi4/Android, respectively. Table 10 reports the estimated on-device latency, energy proxy, and accuracy across the three platforms.
  • Additional dataset (CIFAR-10). To probe generality beyond grayscale datasets, we ran a compact MLP on CIFAR-10 (two hidden layers: 2048 and 1024 units) under the same training recipe as Fashion-MNIST. Trends mirror our main results: a motif size of m = 2 balances accuracy and efficiency, whereas larger motifs hurt accuracy disproportionately. Table 11 summarizes mean ± std over three runs.
  • Sensitivity analysis. We varied learning rate ( 10 3 , 10 2 , 10 1 ), sparsity (10%, 20%, 30%), and motif size ( m { 1 , 2 , 4 } ). Overall, motif size dominated the accuracy–efficiency trade-off, while learning rate and sparsity mainly affected convergence speed. Table 12 and Table 13 summarize the trends using normalized metrics (1.00 for the SET baseline at m = 1 ).

6. Discussion

In this paper, the concept of motif-based structural optimization is proposed. Based on SET-MLP, a feature engineering benchmark model, motif-based models were designed, tested and the one with the best performance was selected. According to the results found above, the motif-based model with feature engineering indeed has a very significant improvement in terms of reduction of the required computation cost and a noticeable drop in accuracy. However, the concrete performance for each type of motif-based model depends on datasets, which means the results may differ based on different datasets and other factors. Therefore, this section will also discuss these variations. Additionally, the trade-off relationship between efficiency and accuracy is a very crucial part in this study, which will be discussed in detail. Readers should interpret this trade-off with the target use case in mind. For edge and embedded settings, the m = 2 configuration gives large reductions in unique multiplies and parameters with a small drop in accuracy. When accuracy is paramount, we either keep m = 1 or apply motifs only in early layers; the simple schedule in Table 6 restores most of the accuracy while preserving a large share of the savings.
In our experiments, m = 2 consistently provides a strong operating point: relative to SET, accuracy reductions are typically under 1 % , while compute and parameter counts drop substantially. We therefore advise against very large motifs ( m 4 ) for accuracy-critical use cases, as they induce disproportionate accuracy losses. When it comes to interpreting the accuracy drop, when accuracy is the dominant requirement we either keep m = 1 or apply motifs only in early layers. Table 6 shows that a simple layerwise schedule restores most of the accuracy while retaining a large portion of the compute savings.

6.1. Failure Modes and Limitations

Large motifs reduce capacity. As m grows, W becomes block-constant, limiting rank and harming fine-grained feature interactions (Section 3.6). We observe accelerated accuracy drops for m 4 on FMNIST/Lung.
Topology divisibility constraints. Motif grouping requires layer widths divisible by m; otherwise, padding or uneven groups introduce implementation overhead.
Dataset sensitivity. Datasets with many classes and fine-grained patterns (e.g., 10-class FMNIST vs. 5-class Lung) are more sensitive to larger motifs, consistent with our results.
When not to use Motif-SET. If per-layer compute is already memory-bound or hardware lacks block-sparse benefits, magnitude pruning with quantization may be preferable.
Hardware validation. Our efficiency metrics (FLOPs and parameter counts) correlate with but do not fully determine latency and energy on target devices. Validating Motif-SET on representative embedded platforms (e.g., Raspberry Pi, Jetson Nano, mobile NPUs) is an important next step to quantify end-to-end gains.

6.2. Comparison to Lightweight/Pruned Networks

We report parameter count, theoretical FLOPs, and batch-1 latency for Motif-SET, SET ( m = 1 ), and a magnitude-pruned baseline tuned to match Motif-SET’s accuracy within ± 0.5 % . To make the comparison compact, counts are macro-averaged over our two evaluated graphs (FMNIST and Lung), and latency is estimated on a laptop-class CPU at 15 GFLOP/s; for unstructured sparse (magnitude-pruned) examples we assume an effective 40% of dense throughput. Unless noted otherwise we report accuracy deltas relative to the strongest baseline at matched parameter count and FLOPs in our tables, which yields a conservative and fair comparison.
Unlike post hoc pruning, which reduces compute primarily at inference, Motif-SET imposes structured sparsity during training, lowering memory traffic and multiply–accumulate operations throughout optimization. The resulting block structure is also more amenable to hardware acceleration than highly unstructured masks. Table 14 compares Motif-SET, SET ( m = 1 ), and a magnitude-pruned baseline under identical resource assumptions, reporting macro-averaged parameters, FLOPs, estimated batch-1 latency, and accuracy.

7. Conclusions

We introduced motif-structured training for sparse mlps and evaluated it against set and representative sparse and pruned baselines. across fmnist and lung, a motif size of two reduced unique multiplies and parameters by about four times while keeping accuracy within about one point of the set on average. The same trend held in a small cifar-10 check, and sensitivity studies showed motif size was the dominant factor in the accuracy to efficiency trade-off.
Among all the tested models, the motif size of two appeared to be the most optimal choice, offering a balance between efficiency and accuracy. This balance was quantified using a comprehensive score equation, emphasizing both accuracy and efficiency. However, the trade-off between these factors depends on the specific dataset and application requirements. The motif-based approach proved advantageous when efficiency is a key consideration, outperforming traditional Sparse Evolutionary Training models in such scenarios. Lung has many more input features per sample and fewer classes than FMNIST. This combination makes it less sensitive to block sharing and helps explain both the higher absolute accuracy and the smaller relative drop at m = 2 .
In summary, motif-level optimization provides a simple way to trade a small amount of accuracy for sizeable and structured efficiency gains during training and inference. Across our tests, m = 2 is a reliable default; when accuracy must match dense baselines, apply motifs in early layers only or keep m = 1 . These recipes are easy to implement, work with standard optimizers, and align with the needs of resource-constrained deployments.

8. Reflection and Future Work

This paper investigates the structural optimization of SET using a rather simple motif-based method. From the results of experiments performed using two different datasets and six types of models with different motif size models, Section 7 concludes that the models with motif size of two have the best overall performance and motif-based models usually have better overall performances than normal SET models when efficiency is an important factor. However, it still cannot be excluded that better structural optimization strategies exist. A dynamic motif size selection mechanism during the training process may help improve the overall performance. The qualitative aspect of the results is, however, likely to remain the same and is the reason for choosing this simple mechanism. As with any experimental study, one should analyze the described behaviour for many more datasets. What is more, more datasets and scenarios should be analyzed to check if such a model has enough robustness and can be applied to various datasets and scenarios. Last but not least, a comprehensive score equation is given in this paper to calculate the overall performance of a model. However, there is not a specific threshold to balance the efficiency and accuracy for a machine learning model at present, which is a direction for future work. The motif concept can translate to convolutional networks by grouping channels or filters into motifs, yielding structured sparsity patterns that resemble channel/filter pruning but with parameter sharing and block-level updates during training. A systematic CNN study is left to future work. A promising direction is to schedule motif size over training (e.g., begin with m = 1 for accuracy, then gradually increase m for efficiency), analogous to curriculum sparsification or dynamic sparse training. Exploring such schedules is left for future work.

Author Contributions

Conceptualization, X.C., H.L. and S.S.M.Z.; methodology, X.C.; software, X.C.; validation, X.C., H.L. and S.S.M.Z.; formal analysis, X.C.; investigation, X.C.; resources, H.L. and S.S.M.Z.; data curation, X.C.; writing—original draft preparation, X.C.; writing—review and editing, H.L. and S.S.M.Z.; visualization, X.C.; supervision, H.L. and S.S.M.Z.; project administration, H.L. and S.S.M.Z.; funding acquisition, not applicable. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study are publicly available: FMNIST at https://github.com/zalandoresearch/fashion-mnist and the Lung Disease 5-Class dataset at https://www.kaggle.com/datasets/obaidulhaque/lung-disease-5-class-dataset-t-p-n-e-c. No new datasets were created. Derived data (training logs and model checkpoints) are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ardakani, A.; Condo, C.; Gross, W.J. Sparsely-Connected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks. arXiv 2016, arXiv:1611.01427. [Google Scholar] [CrossRef]
  2. Bellec, G.; Kappel, D.; Maass, W.; Legenstein, R. Deep Rewiring: Training very sparse deep networks. arXiv 2017, arXiv:1711.05136. [Google Scholar] [CrossRef]
  3. Mocanu, D.C.; Mocanu, E.; Stone, P.; Nguyen, P.H.; Gibescu, M.; Liotta, A. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nat. Commun. 2018, 9, 2383. [Google Scholar] [CrossRef]
  4. Zhang, S.; Yin, B.; Zhang, W.; Cheng, Y. Topology Aware Deep Learning for Wireless Network Optimization. IEEE Trans. Wirel. Commun. 2022, 21, 9791–9805. [Google Scholar] [CrossRef]
  5. Kichler, N. Robustness of Sparse MLPs for Supervised Feature Selection. Bachelor’s Thesis, University of Twente, Enschede, The Netherlands, 2021. Available online: https://essay.utwente.nl/86886/ (accessed on 1 March 2024).
  6. Altman, N.S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. Am. Stat. 1992, 46, 175–185. [Google Scholar] [CrossRef]
  7. Liu, S.; Van der Lee, T.; Yaman, A.; Atashgahi, Z.; Ferraro, D.; Sokar, G.; Pechenizkiy, M.; Mocanu, D.C. Topological Insights into Sparse Neural Networks. arXiv 2020, arXiv:2006.14085. [Google Scholar] [CrossRef]
  8. Milo, R.; Shen-Orr, S.; Itzkovitz, S.; Kashtan, N.; Chklovskii, D.; Alon, U. Network Motifs: Simple Building Blocks of Complex Networks. Science 2002, 298, 824–827. [Google Scholar] [CrossRef]
  9. LeCun, Y.; Denker, J.; Solla, S. Optimal Brain Damage. In Advances in Neural Information Processing Systems; Morgan-Kaufmann: Burlington, MA, USA, 1989; Volume 2. [Google Scholar]
  10. Hassibi, B.; Stork, D.; Wolff, G. Optimal Brain Surgeon and general network pruning. In Proceedings of the IEEE International Conference on Neural Networks, San Francisco, CA, USA, 28 March–1 April 1993; Volume 1, pp. 293–299. [Google Scholar] [CrossRef]
  11. Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2016, arXiv:1510.00149. [Google Scholar] [CrossRef]
  12. Mostafa, H.; Wang, X. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the 2019 International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 4646–4655. [Google Scholar]
  13. Yin, Z.; Shen, Y. On the Dimensionality of Word Embedding. arXiv 2018, arXiv:1812.04224. [Google Scholar] [CrossRef]
  14. Zhang, B.; Gan, Z.-C.; Chen, W.; Hu, Q.-Q.; Hou, J.-F. Topology Optimization of Complex Network based on NSGA-II. In Proceedings of the 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chengdu, China, 20–22 December 2019; pp. 1680–1685. [Google Scholar] [CrossRef]
  15. Hayase, T.; Karakida, R. MLP-Mixer as a Wide and Sparse MLP. arXiv 2023, arXiv:2306.01470. [Google Scholar] [CrossRef]
  16. Wang, M.; Cui, Y.; Xiao, S.; Wang, X.; Yang, D.; Chen, K.; Zhu, J. Neural Network Meets DCN: Traffic-driven Topology Adaptation with Deep Learning. Proc. ACM Meas. Anal. Comput. Syst. 2018, 2, 1–25. [Google Scholar] [CrossRef]
  17. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
  18. Barabási, A.L.; Albert, R. Emergence of Scaling in Random Networks. Science 1999, 286, 509–512. [Google Scholar] [CrossRef]
  19. Changpinyo, S.; Sandler, M.; Zhmoginov, A. The Power of Sparsity in Convolutional Neural Networks. arXiv 2017, arXiv:1702.06257. [Google Scholar] [CrossRef]
  20. Bullmore, E.; Sporns, O. Complex brain networks: Graph theoretical analysis of structural and functional systems. Nat. Rev. Neurosci. 2009, 10, 186–198. [Google Scholar] [CrossRef] [PubMed]
  21. Shen-Orr, S.; Milo, R.; Mangan, S.; Alon, U. Network motifs in the transcriptional regulation network of Escherichia coli. Nat. Genet. 2002, 31, 64–68. [Google Scholar] [CrossRef] [PubMed]
  22. Zhang, S.; Du, Z.; Zhang, L.; Lan, H.; Liu, S.; Li, L.; Guo, Q.; Chen, T.; Chen, Y. Cambricon-X: An Accelerator for Sparse Neural Networks. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, Taiwan, 15–19 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–12. [Google Scholar] [CrossRef]
  23. Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar] [CrossRef]
  24. Sun, Y.; Huang, X.; Kroening, D.; Sharp, J.; Hill, M.; Ashmore, R. Testing Deep Neural Networks. arXiv 2019, arXiv:1803.04792. [Google Scholar] [CrossRef]
  25. Wang, Z.; Choi, J.; Wang, K.; Jha, S. Rethinking Diversity in Deep Neural Network Testing. arXiv 2024, arXiv:2305.15698. [Google Scholar] [CrossRef]
  26. Sophia, J.J.; Jacob, T.P. A Comprehensive Analysis of Exploring the Efficacy of Machine Learning Algorithms in Text, Image, and Speech Analysis. J. Electr. Syst. 2024, 20, 910–921. [Google Scholar] [CrossRef]
  27. Swink, M.; Talluri, S.; Pandejpong, T. Faster, better, cheaper: A study of NPD project efficiency and performance tradeoffs. J. Oper. Manag. 2006, 24, 542–562. [Google Scholar] [CrossRef]
  28. Tan, J.; Wang, L. Flexibility–efficiency tradeoff and performance implications among Chinese SOEs. J. Bus. Res. 2010, 63, 356–362. [Google Scholar] [CrossRef]
  29. Wu, F.; Kim, K.; Pan, J.; Han, K.J.; Weinberger, K.Q.; Artzi, Y. Performance-Efficiency Trade-Offs in Unsupervised Pre-Training for Speech Recognition. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7667–7671. [Google Scholar] [CrossRef]
  30. Atashgahi, Z.; Sokar, G.; van der Lee, T.; Mocanu, E.; Mocanu, D.C.; Veldhuis, R.; Pechenizkiy, M. Quick and Robust Feature Selection: The Strength of Energy-efficient Sparse Training for Autoencoders. arXiv 2020, arXiv:2012.00560. [Google Scholar] [CrossRef]
Figure 1. Motif-based structural sparse MLPs.
Figure 1. Motif-based structural sparse MLPs.
Ai 06 00266 g001
Figure 2. Process of training, pruning, and retraining in SET.
Figure 2. Process of training, pruning, and retraining in SET.
Ai 06 00266 g002
Figure 3. Concept of motif-based SET training.
Figure 3. Concept of motif-based SET training.
Ai 06 00266 g003
Figure 4. Sample from FMNIST dataset.
Figure 4. Sample from FMNIST dataset.
Ai 06 00266 g004
Figure 5. Sample from lung dataset.
Figure 5. Sample from lung dataset.
Ai 06 00266 g005
Figure 6. Lung (3312→3000→3000→3000→5), 300 epochs, sparsity 0.1. Average epoch time vs. test accuracy for motif sizes m { 1 , 2 , 4 } . Error bars: std. over 3 seeds.
Figure 6. Lung (3312→3000→3000→3000→5), 300 epochs, sparsity 0.1. Average epoch time vs. test accuracy for motif sizes m { 1 , 2 , 4 } . Error bars: std. over 3 seeds.
Ai 06 00266 g006
Figure 7. FMNIST (784→3000→3000→3000→10), 300 epochs, sparsity 0.1. Average epoch time vs. test accuracy for motif sizes m { 1 , 2 , 4 } . Error bars: std. over 3 seeds.
Figure 7. FMNIST (784→3000→3000→3000→10), 300 epochs, sparsity 0.1. Average epoch time vs. test accuracy for motif sizes m { 1 , 2 , 4 } . Error bars: std. over 3 seeds.
Ai 06 00266 g007
Figure 8. Relationship between efficiency accuracy weight ratio and Comprehensive Score for each motif size in Lung.
Figure 8. Relationship between efficiency accuracy weight ratio and Comprehensive Score for each motif size in Lung.
Ai 06 00266 g008
Figure 9. Relationship between efficiency accuracy weight ratio and Comprehensive Score for each motif size in FMNIST.
Figure 9. Relationship between efficiency accuracy weight ratio and Comprehensive Score for each motif size in FMNIST.
Ai 06 00266 g009
Table 1. Normalized on-device estimates from the theoretical model (lower is better). Here, κ summarizes the memory-to-compute ratio (a larger ratio implies more bandwidth-bound).
Table 1. Normalized on-device estimates from the theoretical model (lower is better). Here, κ summarizes the memory-to-compute ratio (a larger ratio implies more bandwidth-bound).
Motif Size m = 1 m = 2 m = 4 m = 8
Lat ^ ( m ) 1.00 0.25 + κ / 2 0.06 + κ / 4 0.02 + κ / 8
Energy ^ ( m ) 1.00 0.25 + κ / 2 0.06 + κ / 4 0.02 + κ / 8
Note. Define κ : = β M 1 C ld α F 1 C mm . Compute-bound regimes have κ 1 ; bandwidth-bound regimes have κ 1 .
Table 2. Dataset properties and splits used in our experiments.
Table 2. Dataset properties and splits used in our experiments.
DatasetInput DimsClassesTrain nTest nSource/URL
FMNIST7841060,00010,000https://github.com/zalandoresearch/fashion-mnist (accessed on 1 March 2024)
Lung3312517,1998600https://www.kaggle.com/datasets/obaidulhaque/lung-disease-5-class-dataset-t-p-n-e-c (accessed on 1 March 2024)
Note. Class balance (min/max, %): FMNIST 10.0/10.0; Lung 19.10/21.41.
Table 3. Preprocessing and training settings. CIFAR-10 is included from Section 6.
Table 3. Preprocessing and training settings. CIFAR-10 is included from Section 6.
FMNISTgrayscale in [0, 1]; z-score per pixel over train set; no augmentation
Lungz-score per feature over train split; no augmentation; 2:1 train:test split
CIFAR-10per-channel mean [0.4914, 0.4822, 0.4465], std [0.2470, 0.2435, 0.2616]; random crop 32 with padding 4; random horizontal flip p = 0.5
OptimizerSGD with momentum 0.9; weight decay 5 × 10 4
Schedulecosine decay from initial LR 1 × 10 2 with 5-epoch warmup
Batch size128
Epochs300 for FMNIST and Lung; 100 for CIFAR-10 sanity check
Seeds3; we report mean and standard deviation
Measurement protocol: Batching & data: batch size 128; 4 dataloader workers; shuffling per epoch. Optimizer: SGD (momentum 0.9 , weight decay 5 × 10 4 ). LR schedule: cosine decay from 1 × 10 2 over 300 epochs. Sparsity: target s = 0.9 (10% density) unless noted. Timing: wall clock measured over training iterations only (validation excluded), averaged over 3 seeds (±std).
Table 4. FMNIST (784→3000→3000→3000→10), 300 epochs. Params/FLOPs are block-level counts; the last layer is block-shared only if divisible by m. Acc. and time are measured for m { 1 , 2 , 4 } ; m = 8 is projected from a saturating fit t ( m ) = α + β / m 2 and a log-scale accuracy trend (see Section 3.6).
Table 4. FMNIST (784→3000→3000→3000→10), 300 epochs. Params/FLOPs are block-level counts; the last layer is block-shared only if divisible by m. Acc. and time are measured for m { 1 , 2 , 4 } ; m = 8 is projected from a saturating fit t ( m ) = α + β / m 2 and a log-scale accuracy trend (see Section 3.6).
Motif Size mParams (M)FLOPs (M)Test Acc. (%)Time/Epoch (s)
1 (SET)20.3820.3876.1084.12
25.105.1073.3047.69
41.301.3069.2030.70
80.350.3565.97 (proj.)31.14 (proj.)
Note. FLOPs denote unique motif-block multiplies under block execution, not edgewise multiplies; this matches the shared-parameter compute actually executed.
Table 5. Lung (3312→3000→3000→3000→5), 300 epochs. Params/FLOPs are block-level; the last layer is dense for m 2 as 5 0 ( mod m ) . Acc. and time are measured for m { 1 , 2 , 4 } ; m = 8 is projected using the same fitting recipe.
Table 5. Lung (3312→3000→3000→3000→5), 300 epochs. Params/FLOPs are block-level; the last layer is dense for m 2 as 5 0 ( mod m ) . Acc. and time are measured for m { 1 , 2 , 4 } ; m = 8 is projected using the same fitting recipe.
Motif Size mParams (M)FLOPs (M)Test Acc. (%)Time/Epoch (s)
1 (SET)27.9527.9593.7016.51
27.007.0092.6011.50
41.761.7691.4011.39
80.450.4590.27 (proj.)10.68 (proj.)
Table 6. Layerwise motif schedule recovers accuracy with most of the efficiency gains.
Table 6. Layerwise motif schedule recovers accuracy with most of the efficiency gains.
DatasetScheduleAcc. (%)Params (M)FLOPs (M)
FMNIST [ m 1 = 2 , m 2 = 2 , m 3 = 1 ] 75.58.28.2
Lung [ m 1 = 2 , m 2 = 2 , m 3 = 1 ] 93.310.810.8
Table 7. Dense lightweight baselines vs. motif-based model at matched accuracy. Pruned and Distilled are sized to the motif m = 2 parameter count; Time is a compute-bound estimate from the dense baseline (conservative).
Table 7. Dense lightweight baselines vs. motif-based model at matched accuracy. Pruned and Distilled are sized to the motif m = 2 parameter count; Time is a compute-bound estimate from the dense baseline (conservative).
DatasetMethodParams (M)FLOPs (M)Acc (%)Time (s/ep) Lat ^ Energy ^
FMNISTPruned MLP (global 75 % )5.105.1073.3021.050.250.25
FMNISTDistilled MLP (student width 1411 )5.105.1073.3021.060.250.25
FMNISTMotif model ( m = 2 ; measured)5.105.1073.3047.690.570.57
LungPruned MLP (global 75 % )7.007.0092.604.130.250.25
LungDistilled MLP (student width 1217 )7.007.0092.604.130.250.25
LungMotif model ( m = 2 ; measured)7.007.0092.6011.500.700.70
Computation details. FMNIST dense baseline: Params/FLOPs = 20.38 M, Time = 84.12 s/epoch. Motif ( m = 2 ): Params/FLOPs = 5.10 M, Time = 47.69 s/epoch. Lung dense baseline: 27.95 M, 16.51 s/epoch. Motif ( m = 2 ): 7.00 M, 11.50 s/epoch. Pruned: T = T dense · Params pruned Params dense gives FMNIST 84.12   ×   ( 5.10 / 20.38 ) = 21.05 s, Lung 16.51   ×   ( 7.00 / 27.95 ) = 4.13 s. Distilled student widths solve P ( h ) = Params m = 2 with P ( h ) = 2 h 2 + 794 h (FMNIST; h 1411 ⇒ 5.102 M) and P ( h ) = 2 h 2 + 3317 h (Lung; h 1217 ⇒ 7.000 M), yielding the times shown.
Table 8. Baseline comparison on FMNIST and Lung. Params/FLOPs are block-level counts (Section 3.6) using the same widths as in Table 4 and Table 5.
Table 8. Baseline comparison on FMNIST and Lung. Params/FLOPs are block-level counts (Section 3.6) using the same widths as in Table 4 and Table 5.
MethodDatasetParams (M)FLOPs (M)Test Acc. (%)Time/Epoch (s)
SET ( m = 1 )FMNIST20.3820.3876.1084.12
Motif-SET ( m = 2 )5.105.1073.3047.69
DSR20.3820.3874.7065.91
Deep Compression20.3820.3875.1771.98
SET ( m = 1 )Lung27.9527.9593.7016.51
Motif-SET ( m = 2 )7.007.0092.6011.50
DSR27.9527.9593.1514.01
Deep Compression27.9527.9593.3314.84
Table 9. Result of each motif size (FMNIST).
Table 9. Result of each motif size (FMNIST).
Motif SizeRunning Time (s)AccuracyAverage Running Time (s)
Motif Size:1 (SET)25,236.20.761017.73
Motif Size:214,307.50.73309.14
Motif Size:49209.30.69206.74
Table 10. On-device latency and energy proxy (estimated from FLOPs; see text).
Table 10. On-device latency and energy proxy (estimated from FLOPs; see text).
ModelDeviceLatency (ms)Energy Proxy (J)Accuracy (%)
Motif-SET ( m = 2 )Laptop CPU0.400.00682.95
SET ( m = 1 )Laptop CPU1.610.02484.90
Motif-SET ( m = 2 )RPi43.030.00982.95
SET ( m = 1 )RPi412.080.03684.90
Motif-SET ( m = 2 )Android1.210.00682.95
SET ( m = 1 )Android4.830.02484.90
Table 11. CIFAR-10 with Motif-SET (mean ± std over 3 runs).
Table 11. CIFAR-10 with Motif-SET (mean ± std over 3 runs).
Motif SizeAccuracy (%)Params (M)FLOPs (M)
m = 1 (baseline SET)63.5 ± 0.42.10410
m = 2 62.7 ± 0.51.05210
m = 4 59.8 ± 0.70.53105
Table 12. Effect of motif size on normalized accuracy and compute (mean ± std across learning rates 10 3 10 1 and sparsities 10–30%). Baseline is SET at m = 1 (1.00).
Table 12. Effect of motif size on normalized accuracy and compute (mean ± std across learning rates 10 3 10 1 and sparsities 10–30%). Baseline is SET at m = 1 (1.00).
Motif SizeAccuracy (norm.)FLOPs (norm.)
m = 1 (SET) 1.00 ± 0.00 1.00 ± 0.00
m = 2 0.99 ± 0.004 0.51 ± 0.02
m = 4 0.95 ± 0.007 0.26 ± 0.02
Table 13. Effect of sparsity at fixed m = 2 and learning rate 10 2 . Metrics are normalized to the SET baseline ( m = 1 ).
Table 13. Effect of sparsity at fixed m = 2 and learning rate 10 2 . Metrics are normalized to the SET baseline ( m = 1 ).
SparsityAccuracy (norm.)FLOPs (norm.)
10%0.9950.56
20%0.9900.51
30%0.9850.46
Table 14. Practicality under resource constraints (macro-averaged over FMNIST & Lung). Params/FLOPs are block-level unique MACs in millions. Latency is estimated batch-1 on a laptop CPU; see text for the throughput assumptions.
Table 14. Practicality under resource constraints (macro-averaged over FMNIST & Lung). Params/FLOPs are block-level unique MACs in millions. Latency is estimated batch-1 on a laptop CPU; see text for the throughput assumptions.
ModelParams (M)FLOPs (M)Latency (ms)Acc. (%)
Motif-SET ( m = 2 )6.056.050.4082.95
SET ( m = 1 )24.1724.171.6184.90
Magnitude-pruned6.056.051.0183.00
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, X.; Liu, H.; Ziabari, S.S.M. Efficient Sparse MLPs Through Motif-Level Optimization Under Resource Constraints. AI 2025, 6, 266. https://doi.org/10.3390/ai6100266

AMA Style

Chen X, Liu H, Ziabari SSM. Efficient Sparse MLPs Through Motif-Level Optimization Under Resource Constraints. AI. 2025; 6(10):266. https://doi.org/10.3390/ai6100266

Chicago/Turabian Style

Chen, Xiaotian, Hongyun Liu, and Seyed Sahand Mohammadi Ziabari. 2025. "Efficient Sparse MLPs Through Motif-Level Optimization Under Resource Constraints" AI 6, no. 10: 266. https://doi.org/10.3390/ai6100266

APA Style

Chen, X., Liu, H., & Ziabari, S. S. M. (2025). Efficient Sparse MLPs Through Motif-Level Optimization Under Resource Constraints. AI, 6(10), 266. https://doi.org/10.3390/ai6100266

Article Metrics

Back to TopTop