Divide-and-Merge Parallel Hierarchical Ensemble DNNs with Local Knowledge Augmentation

Jiang, Zhibin; Dong, Shuai; Liu, Kaining; Zhou, Jie; Zhang, Xiongtao

doi:10.3390/sym17081362

Open AccessArticle

Divide-and-Merge Parallel Hierarchical Ensemble DNNs with Local Knowledge Augmentation

by

Zhibin Jiang

^1,2,

Shuai Dong

^3,4,

Kaining Liu

^3,4,

Jie Zhou

^1,2 and

Xiongtao Zhang

^3,4,*

¹

Department of Computer Science and Engineering, Shaoxing University, Shaoxing 312000, China

²

Institute of Artificial Intelligence, Shaoxing University, Shaoxing 312000, China

³

School of Information Engineering, Huzhou University, Huzhou 313000, China

⁴

Zhejiang Province Key Laboratory of Smart Management & Application of Modern Agricultural Resources, Huzhou University, Huzhou 313000, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(8), 1362; https://doi.org/10.3390/sym17081362

Submission received: 25 June 2025 / Revised: 3 August 2025 / Accepted: 8 August 2025 / Published: 20 August 2025

(This article belongs to the Special Issue Advances in Neural Network/Deep Learning and Symmetry/Asymmetry)

Download

Browse Figures

Versions Notes

Abstract

Traditional deep neural networks (DNNs) often suffer from a time-consuming training process, which is restricted by accumulation of excessive network layers and a large amount of parameters. More neural units are required to be stacked to achieve desirable performance. Specifically, when dealing with large-scale datasets, a single DNN can hardly obtain the best performance on the available limited computing resources. To address the issues above, in this paper, a novel Parallel Hierarchical Ensemble Deep Neural Network (PH-E-DNN) is proposed to improve accuracy and efficiency of the deep network. Firstly, the fuzzy C-means algorithm (FCM) is adopted so that the large-scale dataset is separated into several small data partitions. As a benefit of the fuzzy partitioning of the FCM, several sub-models can be obtained through learning their respective data partitions and isolating them from the others. Secondly, the prediction results of each sub-model in the current level are used as the discriminative knowledge appended to original regional subsets, and predictions from each level symmetrically augment inputs for the next level. In the PH-E-DNN architecture, predictions from each level symmetrically augment inputs for the next level, creating a symmetrical flow of discriminative knowledge across the hierarchical structure. Finally, multiple regional subsets are merged to form a global augmented dataset, while multi-level parallel sub-models are stacked to organize a large-scale deep ensemble network. More importantly, only the multiple DNNs in the last level are ensembled to generate the decision result of the proposed PH-E-DNN. Extensive experiments demonstrate that the PH-E-DNN is superior to some traditional and deep learning models, only requiring a few parameters to be set, which demonstrates its efficiency and flexibility.

Keywords:

deep neural network (DNN); ensemble learning; fuzzy partition; stacked generalization; knowledge augmentation; parallel computing

1. Introduction

In the current era of big data, deep learning has become increasingly popular in numerous fields [1,2,3]. Deep neural networks (DNNs) undoubtedly owe their remarkable success to their powerful performance in representing complex relationships [4,5]. Differing from the single-layer nonlinear transformation mechanism of shallow networks, DNNs can capture high-level abstract features from original data with a hierarchical structure, which has become an efficient approach for extracting hidden information. The hierarchical architecture extracts different levels of features that can build a signal mapping feature from the bottom to the top. The chief restriction of DNNs is their enormous computing burden. DNNs often require a lengthy procedure to achieve the desired level of performance due to the excessive number of parameters involved. Determining how to effectively downsize network scale without weakening DNNs’ capabilities has thus become a key focus in this field. In order to exploit available computing resources and downsize traditional deep networks, deep ensemble models have been proposed to explore hierarchical representation performance.

Among traditional deep networks, DNNs consist of deep belief networks (DBNs) [6,7], stacked autoencoders (SAEs) [8,9,10,11], recurrent neural networks (RNNs) [12,13], and so forth. In addition, common strategies employed to enhance machine learning models in different fields include the following: (1) enhancing performance from the perspective of width and (2) ensemble learning. The broad learning system (BLS) [14,15] was proposed as an alternative to deep structures, expanding the nodes in a broader manner without stacking deeper networks and rapidly obtaining the solution to the pseudo-inverse. Ensemble learning often starts with several learners, and if required, more learners may be integrated into the community. Several surveys can be found in the literature that primarily focus on enhancing performance based on the fusion of deep learning and ensemble learning [16,17,18]. At present, the fusion of standard schemes such as ensemble learning and deep learning has become a key research trend in machine learning.

To address the aforementioned challenges, we propose a deep ensemble model that combines the best of both hierarchical and ensemble models simultaneously. Inspired by previous work based on wide learning [19,20,21], the PH-E-DNN extends the stacking structure from a single DNN to a parallel multinetwork learning system through the FCM. The application of smaller regional subsets accelerates local feature learning and decomposes a large-scale complex problem into several computational nodes. Each individual module can be regarded as a specialist for any instance belonging to the corresponding subset. The results will undoubtably be more satisfactory when several expert DNNs are ensembled. Our model can recast current data blocks respectively, and its capability incrementally enhances with increasing learning experience. It should be noted that all computational steps of the learning algorithm for the model are based on block-mode and are thus amenable to parallel implementation on several CPUs. The PH-E-DNN is able to progressively expand the data space to acquire deeper insight into all available instances. At the start of each level, the FCM algorithm [22,23,24] is employed to divide the entire dataset into small but informative input subsets individually. Thereafter, these previous subsets are recast through simple fusion that fuses various regional subsets into a global dataset to expand the original samples. Multiple regional subsets and the corresponding regional knowledge are combined as global augmented data and serve as input for the second level. This same procedure repeats at the second and third levels. Each level operates based on several DNNs implemented in parallel, which respectively process valuable information extracted by all sub-models in the preceding level. Following knowledge augmentation of three levels, the global data are enhanced through deeper representation. The main contributions of this work can be summarized as follows:

(1): As a deep ensemble scheme of layer-by-layer architecture, the model tends to achieve a more comprehensive and better representation of the original input. The deep hierarchical representation ensures that valuable knowledge is effectively captured and not abandoned. The model adopts a lightweight DNN as a basic building block, whose parameters do not vary; therefore, it can be trained quickly.
(2): The deep ensemble model takes advantage of the predictions from all previous levels to enhance generalization performance, which conforms to the theory of stacked generalization [25], and thus open the manifold structure of the original input space. Through this proposed stacking architecture, the augmented feature progressively moves away from the original manifold in a serial and parallel manner, such that enhanced classification performance improvement can be achieved.
(3): The knowledge augmentation strategy in local data partitions of each level enables valuable information to be preserved and knowledge to be supplemented for the original input. The proposed model is highly parallelizable since all learners at the same level can be implemented in parallel on various CPU computing nodes.

All in all, in this paper we propose a novel deep learning framework which is different from the existing deep learning-based DNNs. Benefitting from the divide-and-merge strategy and augmented knowledge, the proposed PH-E-DNN can obtain outstanding performance while achieving a comparably low computation cost. Extensive experiment results on benchmark and image datasets demonstrate the effectiveness of the proposed PH-E-DNN.

2. Related Work

In this section, some background knowledge is introduced that is relevant to deep network and ensemble methods. Many machine learning algorithms such as support vector machines (SVMs) [26,27] and extreme learning machines (ELMs) [28,29] are shallow learning networks; as a result, they are more likely to discard valuable implied fragments scattered in the original data. Instead of that in shallow learning, deep representation learning is built to realize complex nonlinear function approximation, which makes it capture higher-order representations and abstractions of original data. In deep learning, learning a more perfect and accurate representation from the input will be the permanent pursuit.

Typical ensemble learning schemes include Bagging and Boosting [30,31]. The scheme of Bagging is targeted at selecting n instances from initial training data D by means of random sampling with replacement, which means that some data in this dataset may never be picked out while some others may appear in Di frequently. The scheme of Boosting aims to train a base classifier in each iteration by concentrating mainly on learning from samples that the last classifier at the previous iteration misclassified. Additionally, stacked generalization [25] is also a measure to achieve the goal of ensemble learning, which provides a new approach to learn hierarchical representation. In [32], the DCN (Deep Convex Network) was proposed, which is composed of multiple layers of sub-models. Each sub-model in a DCN is a single-hidden-layer neural network. These units are connected in a serial way to obtain better performance. Wang et al. [33] proposed a deep transfer additive kernel least square support vector machine (DTA-LS-SVM) combining several AK-LS-SVM modules, in which the implied value in each module is well mined. A hierarchical TSK fuzzy classifier [34] is proposed by Zhou et al., which is based on stacked generalization principle. The training set plus random shifts gained from random projections of prediction of the current base unit is presented as the input of the next base building unit.

3. Proposed Method

A brief introduction of SAEs is offered for further reference first. Though SAEs are qualified with a strong capability originating from representation learning in DNN, this approach indeed tends to organize a very deep network, which leads to an overwhelming storage requirement. Therefore, in this section, we will deepen a DNN by stacking its simple structure into the proposed classifier so as to downsize the structure of hierarchical structure and share the promising advantages of traditional deep learning.

3.1. Basic Building Block

3.1.1. Autoencoder

The autoencoder is an unsupervised learning algorithm that is composed of three layers; namely, the input layer, the hidden layer, and the output layer. The network architecture of an autoencoder is shown in Figure 1.

The goal of an autoencoder is to reconstruct the output of network as close to the input as possible. The learning procedure of autoencoder is divided into two parts, namely the encoder and decoder. The encoder maps the input to the hidden representation, while the decoder is responsible for mapping it back. The encoding process is typically expressed as

h_{i} = s (W_{1} x_{i} + b_{1})

(1)

where

W_{1}

and

b_{1}

represent the weight and the bias between the input layer and hidden layer. In order to realize the reconstruction of data in input space, the hidden representation will then be mapped back, and the formula can be written as

z_{i} = s (W_{2} h_{i} + b_{2})

(2)

where

W_{2}

and

b_{2}

represent the weight and the bias between the hidden layer and output layer.

s (∙)

is the activation function, and a sigmoid function is used in this paper. All training data can be mapped to hidden representation, and the parameters of network are optimized to obtain the minimization of the reconstruction error:

L (x, z) = {‖x - z‖}^{2}

(3)

The defined reconstruction error

L (x, z)

is to keep z as close as possible to the original input x. Likewise, several such AEs can be employed so that the multilayer autoencoder is stacked up to form a stacked autoencoder. The basic AE is considered to be a building unit, and the mapped representation of the kth layer is used as input in (k + 1)th layer. The network decides the parameters via layer-wise greedy learning and fine-tunes the parameters with back propagation.

3.1.2. Other Variants of Autoencoder

The autoencoder tends to be trapped into the risk of overfitting in the presence of noise. To solve this problem, one of the effective methods is to add noise to the original data. The denoising autoencoder (DAE) is trained to reconstruct clean input from corrupted input, which helps to improve the robustness of training model. Firstly, the initial input Q is corroded into

Q_{c o r r u p t}

, which is a dirty version of original input. C indicates the degree of corruption. The noise added in our model is meant to force a fraction of elements to change to zero, the choice of which is made at random.

Q_{c o r r u p t} = r a n d (C) \times Q

(4)

Similar to the process in Equations (1) and (2), corrupted data that acts as input are encoded and decoded in the network.

J (W, b) = \frac{1}{m} \sum_{i = 1}^{m} (\frac{1}{2} {‖ x^{(i)} - x^{(i)'} ‖}^{2}) + \frac{λ}{2} \sum_{l = 1}^{n_{l} - 1} \sum_{i = 1}^{S_{l}} \sum_{j = 1}^{S_{l + 1}} {(W_{j i}^{(l)})}^{2}

(5)

where

n_{l}

is the number of network layers, while

S_{l}

is the number of hidden nodes in the lth layer. Here,

n_{l}

is assigned to 3 in our example. A neuron is thought to be active if its output is close to 1, or to be inactive if its output is close to zero. The first term in the above definition is an average sum-of-squares error term. The second is a regularization term that is aimed at reducing the magnitude of the weights and alleviating the over-fitting problem. To constrain neurons to be suppressed most of the time, a sparsity constraint is imposed on the hidden layers. We use

{\hat{ρ}}_{j}

to denote the activation of this hidden unit when a specific input x is given.

a_{j}^{(2)} (x)

denotes the activation of hidden unit j when a specific input x is fed into this network.

{\hat{ρ}}_{j} = \frac{1}{m} \sum_{i = 1}^{m} [a_{j}^{(2)} (x^{(i)})]

(6)

Equation (6) is the average activation of the hidden layer element. We can enforce the constraint by

{\hat{ρ}}_{j} = ρ

, where

ρ

is a sparsity parameter. In order to satisfy the constraint, an extra penalty term is added to the objective function that penalizes

{\hat{ρ}}_{j}

deviating from

ρ

significantly. The penalty term is written as

\sum_{j = 1}^{S_{i + 1}} K L (ρ ‖{\hat{ρ}}_{j}) = \sum_{j = 1}^{S_{i + 1}} (ρ \log \frac{ρ}{{\hat{ρ}}_{j}} + (1 - ρ) \log \frac{1 - ρ}{1 - {\hat{ρ}}_{j}})

(7)

where

K L (∙)

is the Kullback–Leibler divergence.

S_{i + 1}

is the number of neurons in the hidden layer. The overall cost function of denoising sparse autoencoder is

J_{s p a r s e} (W, b) = J (W, b) + β \sum_{j = 1}^{S_{i + 1}} K L (ρ ‖{\hat{ρ}}_{j})

(8)

This variant of an autoencoder is used to serve as the basic building block in our work, which is represented as deep neural network with autoencoders.

3.2. The Proposed Architecture of PH-E-DNN

The hierarchical architecture of the proposed PH-E-DNN is shown in Figure 2. The PH-E-DNN consists of two consecutive stages: a local knowledge augmentation stage in three levels, and global classification via integrating opinions from different individual sub-models. In the first stage, we divide the whole raw input dataset

D_{0}

into small but informative subsets. The PH-E-DNN uses these subsets to train several individual DNNs, which can be seen in the upper section of Figure 2. The input and its corresponding predictions from several regional subsets in the same level are combined together to form the global input of the next level. This procedure is repeated in the next level 2 and level 3 composed of such modules and stacked in a parallel manner. The PH-E-DNN is considered as a method to transit from a single deep learning mechanism to a parallel deep learning mechanism by separating the input into multiple small groups. The instances sharing some similar characteristics belong to a subset of the entire dataset, which is used to train individual DNNs and broaden local knowledge. In the second stage, with global augmented data generated by triple levels, the output combiner gathers their results to obtain the final elected output by the voting strategy. We will give its details in the following section.

In the former stage, raw data needed to go through a clustering process with the FCM algorithm and the whole training dataset was divided into subsets whose samples are disjoint. The FCM is beneficial for preprocessing raw data before using a DNN to learn various features, thus saving computation time and improving classification efficiency. The FCM is a soft clustering algorithm that obtains the membership of all instances to each class center by optimizing the objective function. Given a dataset

X = \{X_{1}, X_{2}, \dots, X_{N}\}

with N samples, the objective function is as follows:

J = \sum_{i = 1}^{c} \sum_{j = 1}^{N} μ_{i j}^{m} {‖X_{j} - V_{i}‖}^{2}

(9)

where c is the number of clusters, N is the number of samples,

m \in (1, + \infty)

is a weighting exponent,

μ_{i j} \in [0, 1]

is the degree of membership of

X_{j}

belonging to the ith cluster, and

V_{i}

is the center of cluster i. Using the optimization of the objective function, we can derive the following updated rules for the membership degree

μ_{i j}

and the cluster center

V_{i}

:

μ_{i j} = \sum_{k = 1}^{c} {(\frac{‖X_{j} - V_{i}‖}{X_{j} - V_{k}})}^{\frac{2}{1 - m}}

(10)

V_{i} = \frac{\sum_{j = 1}^{N} μ_{i j}^{m} X_{j}}{\sum_{j = 1}^{N} μ_{i j}^{m}}

(11)

Once the condition

‖X_{j} - V_{i}‖ < ε

is satisfied, the iterative calculation is halted.

ε

is a predetermined threshold value.

D_{0}

denotes the given training data, and the regional subsets are

S_{1}^{(1)}, S_{2}^{(1)}, \dots, S_{L}^{(1)}

, respectively. It should be noted that the dataset satisfies the condition

D_{0} = S_{1}^{(1)} \cup S_{2}^{(1)} \cup, \dots, \cup S_{L}^{(1)}

and

S_{i}^{(1)} \cap S_{j}^{(1)} = Ø

when

\forall i \neq j, 1 \leq i \leq L, 1 \leq j \leq L .

Thus, each DNN module can be allocated a subset separated from other data blocks, and the corresponding inputs are referred to as

S_{1}^{(1)}, S_{2}^{(1)}, \dots, S_{L}^{(1)}

. Then, we employ DNNs to learn the implied relationship between the data. The local classification models are able to learn more discriminative information since they are specific to each group of instances sharing similar characteristics. The processed data blocks are trained in multiple modules in parallel, and the integration of weak predictions derived from multiple modules is shown to perform hierarchical learning well. The output can be represented as

Y_{1}^{(1)}, Y_{2}^{(1)}, \dots, Y_{L}^{(1)}

. We can treat the output predictions derived from multiple modules as new knowledge, and this knowledge is converted to augmented feature segments, which involve

K_{1}^{(1)}, K_{2}^{(1)}, \dots, K_{L}^{(1)}

. After all modules in first level accomplish their tasks, L local augmented feature fragments are appended to the regional subsets, and the new augmented input will be generated.

Our method generates a supplementary knowledge base, and the actual weak prediction is fed into the original input. For now, regional augmented subsets can be formed as

\begin{array}{l} A_{1}^{(1)} = [S_{1}^{(1)} |K_{1}^{(1)}] \\ A_{2}^{(1)} = [S_{2}^{(1)} |K_{2}^{(1)}] \\ ⋮ \\ A_{L}^{(1)} = [S_{L}^{(1)} |K_{L}^{(1)}] \end{array}

(12)

where

A_{L}^{(1)}

represents a local augmented subset reconstructed by the Lth local classification model in the first level. The newly produced knowledge acts as guidance rules, which drives the model to run in a positive direction. More precisely, knowledge extracted from raw data can be preserved in the next augmented data, which tends to be more accurate than raw data representations. By fusing various regional augmented subsets, a highly discriminative representation of global augmented instances can be obtained. The new generated input assigned to the next level is exactly made up of the augmented dataset, the matrix form of which is

D_{1} = {[{(A_{1}^{(1)})}^{T}, {(A_{2}^{(1)})}^{T}, \dots, {(A_{L}^{(1)})}^{T}]}^{T}

. In this produced augmented dataset, L local augmented subsets are merged into

D_{1}

. Next, we are solely concerned with the next level of stacked generalization, which is described as follows. When the new global dataset comes with the augmented feature in the second level, FCM is adopted to pretreat the global augmented samples

D_{1}

once more. Then, the DNNs are trained in parallel. With the fuzzy partition of the FCM, we can derive

S_{1}^{(1)}, S_{2}^{(1)}, \dots, S_{L}^{(1)}

. Likewise, the process in the second level is the same as that in the previous level. The local prediction results from the second level are saved and then can be transmitted in the form of knowledge. After that, the new output will be concatenated with the input space in the second level, i.e., the original input space and the weak prediction output of two levels are fused to constitute the augmented input of the next level.

At the last level of the previous phase, the learning process of the first level and the second level is followed, performing the same training with the intention of augmenting regional subsets. What we would like to acquire by enhancing regional data are the global augmented data

D_{3} = {[{(A_{1}^{(3)})}^{T}, {(A_{2}^{(3)})}^{T}, \dots, {(A_{L}^{(3)})}^{T}]}^{T}

. Similar to the previous stage, multiple sub-models still operate in parallel. Instead of local data, we use the global data to train multiple DNNs with no parameters to change. Therefore, the input used in the latter phase is the whole dataset, which involves all instances and triply augmented features. According to the theory of ensemble learning, the ultimate outcome means that some sample can be classified correctly through multiple classifiers’ fusion if there is at least one correct classification. The second stage employs the voting principle [35] of ensemble learning that integrates multiple individual learners into a model committee to perform complex classification tasks. The majority vote in ensemble learning is accepted to fuse all the recognition results acquired by individual sub-models of the second stage. A pattern belonging to an unknown class is then classified by the joint judgment of committee members and identified to pertain to the class which receives the majority vote. Even if the estimate of some individual sub-models for a particular hypothesis is questionable, the result may be accepted, provided it receives sufficient support from other sub-models.

The PH-E-DNN is proposed to achieve enhanced performance and simultaneously downsize the network architecture of DNNs. By triply augmenting the original input space with the predictions of three levels, the PH-E-DNN stacks several lightweight DNN modules to guarantee stronger feature representation capability. Additionally, it should be noted that focusing on exploiting local knowledge is beneficial for obtaining more discriminative feature expression. All such augmented data are serially fed into the upper level, whose implied information is transformed into knowledge. On one side, the small size of the instances in each subset can alleviate the problem of lengthy training of deep learning models. On the other side, the correlations and causalities analyzed by each module can be captured while learning potential tendencies.

3.3. Knowledge Augmentation Based on Regional Subsets

In general, it can be observed that deeper and broader networks can learn representation. However, deep networks can be computationally expensive to train. In such systems, a large number of network units must be over-stacked to obtain an excellent hierarchical nonlinear expression. As such, DNNs often require numerous units to achieve the desired performance, which may inevitably result in complex storage requirements. In addition, it should be noted that it remains challenging to guarantee a favorable recognition rate in the final layer of the model by solely using a traditional deep learning model, even if deep learning can achieve accurate data representation. The above represents one of the reasons for the proposal of a lightweight local classification model. Using the FCM clustering algorithm, we split the data into several separate regions without overlap. The local classification model extends this knowledge by using a pre-specified dataset. Before global classification, local classification models can achieve an efficient representation since the additional features are learned only if they are useful for characterizing data samples.

Deep learning is also referred to as hierarchical learning. Unlike shallow machines [36,37,38,39], the power of a more efficient hierarchical model lies in higher-order abstractions and more precise feature description. With the study described in this paper, we aimed to cultivate new augmented features from old ones. The concept of hierarchical learning described herein covers two aspects of deep learning. First, the module chosen as a basic building block is a deep neural network, although it is a fairly simple network structure. In comparison, hierarchical learning means that multiple parallel DNN systems are cascaded, and a total of three levels of parallel DNNs are combined to implement parallel training of corresponding regional subsets. The concept of hierarchical learning in the present case closely aligns with the methodology of stacking in ensemble learning. Hierarchical learning thus refers to the philosophy of stacked generalization. The aim of this ensemble scheme based on DNNs is to obtain a complementary representation. Therefore, enhancing the quality of additional insight into the recognition problem can be achieved by improving the quality of the data representations, whose interpretation of instances is inadequate. After establishing the entire network, it performs knowledge transmission of the entire network in a bottom-up manner. The schematic of the knowledge augmentation strategy is depicted in Figure 3.

As hierarchical stacked networks, traditional over-stacked DNNs show a deficiency in declarative expression of the multi-hidden-layer learning process. A reasonable and acceptable explanation should be provided for users that reveals the trend of a particular recognition task. In our model, triply augmented features can aid in exploring implied correlations in raw data while inferring potential prediction trends of actual output. The method conforms to the philosophy of stacked generalization, which indeed provides sufficient knowledge in learning complex functions. Through this deep stacking architecture, the augmented data open manifolds in the original data space in a parallel manner to achieve more efficient separability.

Traditional individual DNNs face difficulty in leveraging the available computing resources of CPUs, thus posing an issue that requires an urgent solution. Information in the classification layer is dispersed. In order to extract effective information from the data, a method is developed to extract classification knowledge on regional subsets, which can not only reduce information loss but also improve the efficiency of feature extraction. The present study was conducted in the hope that it may provide an alternative solution to the tedious process of traditional deep networks. Incremental learning of knowledge enables researchers to observe local augmented features required at all levels without the commonly encountered issues of traditional deep algorithms. In three levels, multi-regional subsets partitioned by fuzzy partition are prepared for the training process of multiple sub-processes in CPUs. The proposed network aids us in understanding what each DNN has learned from local instances and makes it easier to apply it to the next generation. By analyzing the execution of the PH-E-DNN framework, we can learn how the sub-models in our model guide the learning process to achieve our aim. As several simple DNNs cooperate to fulfill their tasks, the model can accelerate the pace of the learning process. The algorithm of the basic building block DNN is shown in Algorithm 1. The algorithm of the proposed deep ensemble model is listed in Algorithm 2.

Algorithm 1: Basic building block: Deep neural network with autoencoder

Require: Input data.
Ensure: The actual output label of each basic building block.
1: Perform corrupting process to obtain the input

Q_{c o r r u p t}

.
2: Perform encoding and decoding, and achieve the output of the network.
3: Calculate the objective function

J_{s p a r s e} (W, b)

in Equation (8).
4: Finetune the parameters (W, b) with back propagation algorithm.
5: Repeat Step 2 to Step 4 until the cost function converges.

Algorithm 2: PH-E-DNN

Require: Input feature matrix

D_{0}

and corresponding label T, the number of sub-models in each level L, the number of hidden units, the epoch, and batchsize
Ensure: Fused decision output.
1: Initialization: Use FCM algorithm to divide total set

D_{0}

into multiple regional subsets

S_{1}^{(1)}, S_{2}^{(1)}, \dots, S_{L}^{(1)} .

2: Parallel training of sub-models: Then call Algorithm 1 to train several DNNs in parallel with these subsets, and the classification label

Y_{1}^{(1)}, Y_{2}^{(1)}, \dots, Y_{L}^{(1)}

can be determined.
3: Creation of augmented data: Consider the actual classification label

Y_{1}^{(1)}, Y_{2}^{(1)}, \dots, Y_{L}^{(1)}

as the created knowledge

K_{1}^{(1)}, K_{2}^{(1)}, \dots, K_{L}^{(1)},

and concatenate this new feature with previous input. The feature fusion is formed as

A_{1}^{(1)} = [S_{1}^{(1)} |K_{1}^{(1)}],

A_{2}^{(1)} = [S_{2}^{(1)} |K_{2}^{(1)}], \dots, A_{L}^{(1)} = [S_{L}^{(1)} |K_{L}^{(1)}]

. Collect all local augmented data, we can derive new total dataset

D_{1} = {[{(A_{1}^{(1)})}^{T}, {(A_{2}^{(1)})}^{T}, \dots, {(A_{L}^{(1)})}^{T}]}^{T}

. K denotes the new feature learned by the architecture, A represents the combined features integrating both the new feature and the previous input, and D refers to the new features aggregated from all subsets of regions.
4: Repeat the process similar to the first local knowledge augmentation. The augmented global dataset can be written as

D_{2} = {[{(A_{1}^{(2)})}^{T}, {(A_{2}^{(2)})}^{T}, \dots, {(A_{L}^{(2)})}^{T}]}^{T}

.
5: Repeat the process similar to the first local knowledge augmentation. So the final global dataset can be expressed by

D_{3} = {[{(A_{1}^{(3)})}^{T}, {(A_{2}^{(3)})}^{T}, \dots, {(A_{L}^{(3)})}^{T}]}^{T}

.
6: Parallel learning of sub-models: Call Algorithm 1 on global dataset

D_{3}

in parallel to achieve individual result

Y_{1}^{(4)}, Y_{2}^{(4)}, \dots, Y_{L}^{(4)}

.
7: Final decision: According to majority vote, determine the class of testing sample.

4. Experiments

In this section, we will present the results of our evaluations of the parallel hierarchical classifier PH-E-DNN on benchmark datasets. All experiments were implemented in MATLAB on a computer with an Intel Core i5-9400 2.90 GHz CPU and 8.0 GB of RAM.

4.1. Benchmark Datasets

Many datasets from the UCI [40] and KEEL repositories [41] are chosen to evaluate the proposed method. The details of these datasets are summarized in Table 1. The adopted datasets consist of binary-class and multi-class classification tasks, which are composed of different numbers of attributes or instances. In order to exhibit the merits of parallel learning in our approach, we prefer to use medium-scale and large-scale datasets. Not employing small datasets is not because the results are terrible, but because it makes little sense to demonstrate the parallelization benefits of our model. Among them, some datasets such as SAT, OPT, and CON have a relatively large number of features, while some datasets like SHU, ADU, and CON have a large sample size. Some datasets like SAT, PEN, and LET are datasets regarding to character recognition. Some datasets, which covers OPT, PEN and HARS, are initially separated and we still merge these and reconstruct them as corresponding training and testing data at random.

4.2. Experimental Setup

In our experimental organization, the whole dataset is randomly split into a training set and testing set, where 80% is for training and the remainder is for testing. In order to guarantee the fairness of this paper, we keep the basic building block training as our choice, whose parameter settings have no difference from the comparison method’s SAE. In the proposed model, the number of hidden units and the number of modules are not set to be very large. In our model, we employ the grid search strategy, the number of hidden units in each DNN is optimized within the range of [10:10:50], and the epoch and batchsize are searched between the range of [30:10:100] and [10:10:100] respectively. Here, the number of hidden units and sub-models are not set very large to assure that the network structure of the DNNs is easy to implement. The numbers of DNN modules in the same hierarchy are L = 3 and L = 5.

To verify whether the creation of new features can lead to substantive and effective improvement, the proposed method and some comparative algorithms are depicted as follows. The recognition performance in new structure is sequentially compared to a traditional SVM (LibSVM), the well-known deep networks SAE and DBN, and typical ensemble methods such as Adaboost and Bagging. The comparison methods, such as Adaboost, Bagging, and SVM [26], are implemented by toolbox in MATLAB 2019. At the same time, the SAE and DBN are implemented by MATLAB implementation code that is encapsulated in the DeepLearn Toolbox [42]. In order to compare the performance of deep fusion methods further, some other fusion methods are also designed as the rival methods. Similar to the proposed method in this paper, the PH-DNN and E-DNN share the same basic building block and represent the former stage and the latter stage, respectively. Instead of the basic building block in our proposed method, the PH-E-DBN replaces it with a DBN. The settings of the parameters in our scheme are exhibited in Table 2. The number of sub-classifiers of the ensemble method is determined in the range of [5,20]. In the SVM, the penalty parameter C and kernel width parameter g is determined by grid search. The number of hidden units of the DBN is searched in the interval of [10:10:50], while epoch and batchsize are respectively optimized in [30:10:60] and [10:10:100]. The PH-DNN is a deep stacking structure classification model composed of three levels in the previous stage of the PH-E-DNN, while the E-DNN is an ensemble model which is based on the majority vote in the later stage. Both of these methods are listed in the results as comparison methods.

4.3. Experimental Analysis and Comparison of Results

In our experiment, all samples should be partitioned through the FCM so as to prepare sufficient datasets for multi-module learning. The results of three and five partitions are presented as below, where the partition refers to the number of sub-models used in each level. According to the average values of metrics, the proposed method constructed via local DNNs possesses better interclass discriminating capability under testing conditions. L data partitions are trained, respectively, while other L-1 sub-models will not interfere with other training, which is fundamental to specific local knowledge learning. We can observe that the deep ensemble model associated with local knowledge augmentation for large scale databases improves the use of computation resources. This is due to the subdivision of the original data into disjointed sets, which reduces remarkably the number of samples in each subset. Each individual DNN is executed in a singleton node under the circumstances of parallel computation. In Table 3, the results of the basic building block are shown, including training accuracy, testing accuracy, training time, and testing time. In our article, each level has several subsets, and each subset performs its independent training. After the accomplishment of multi-module learning in each level, multiple subsets need to be merged into a global dataset first. Divided by FCM once more, all these new generated vectors come from the knowledge of previous regional subsets. To guarantee the consistency in comparison, the same requirements must be retained, such as the structures of network, training epochs, and so on. In the second phase, it is advisable to choose the class label value that the majority of experts stand for to obtain a more reliable estimate. At the moment of the final decision, the sub-classifier with the minority vote will be excluded and there is no need to take its opinion into account. An odd value is selected to break the balance in the situation of judgment deadlock. After comprehensive consideration, we selected the same number of sub-models in each level in two stages.

The total delay can be approximately divided into three levels of local knowledge augmentation in the first stage and final global classification of the entire dataset in the second stage. The time cost of the three levels of feature enhancement is depicted as follows. Owing to the parallelism in each enhancement level, our approach can drastically decrease the latency of triple knowledge augmentation. As presented in detail in Table 4 and Table 5, the time spent on knowledge enhancement is reduced with the increase in the number of sub-models.

The proposed local knowledge augmentation is highly consistent with the methodology of stacked generalization, which is essentially composed of feature expansion and pattern classification. We tend to obscure the complex operation performed therein and attach more significance to the reuse of actual classification labels. This process may help us move closer to the actual position of instances Though our model adds to the complexity of the original sub-model, the improvement brings with it significant benefits. Traditional DNNs consistently suffer from the burden of excessive complex computations. In contrast, the simple structures of the sub-models employed in our model can effectively decrease the time spent on training. As more divisions appear at each level, the learning time decreases further because fewer computations are required for layer-wise training compared with a large DNN. The triple-level knowledge learning process in the former phase merely lays the foundation for final classification, and more attention should be paid to the utilization of knowledge. Progressive levels of feature augmentation form the next step; thereafter, our only concern is the outcome of the final recognition. The results presented in Table 6 illustrate the training and testing accuracy in terms of both the mean and standard deviation in three partitions and five partitions.

The information hidden in subsets is extracted by using the stacked generalization principle, and the recognition performance is gradually improved. A slight improvement in some datasets is noted, whereby the improvement is still relatively steady. Regarding classification accuracy, the new model can achieve excellent recognition accuracy, at least comparable to the commonly employed deep learning methods. Moreover, the superiority is evident, thus fully validating the ability of deep learning methods with stacked generalization in automatically mining discriminative high-level features.

In Table 7 and Table 8, we include results from all approaches that involve traditional methods and fusion methods in terms of average testing accuracy and F-measure. On PEND and PENB, our two-stage ensemble algorithm is not as good as the PH-DNN, which implies that a voting strategy is not a bright choice on these two datasets. On PAG, the SVM performs better than our deep ensemble method. Further, the PH-E-DNN is slightly inferior to the PH-E-DBN on the OPT and LET character recognition datasets. In terms of training and testing time, the proposed model can compete with the deep learning algorithms in the DeepLearn Toolbox. It can be noted in the results that the proposed method has certain advantages over traditional ensemble methods. The stack of sub-models will inevitably increase the complexity of the original model, but the process of knowledge augmentation reveals trends in the identification results. The PH-E-DNN tends to deliver a better than or at least comparable performance to other rival methods because it reuses useful augmented features.

5. Case Study

5.1. Datasets

The MNIST handwritten digit recognition dataset is one of the most popular image datasets, being frequently employed to validate the performance of machine learning models in experiments. It covers 60,000 training instances and 10,000 testing instances, each of which consists of images of handwritten digits (from 0 to 9). All digits are represented by an image with a scale of 28 × 28 pixels. In this section, the PH-E-DNN is also evaluated on the Fashion-MNIST dataset. This image dataset is a variant of the MNIST dataset. The dataset is also composed of 70,000 samples (60,000 training images and 10,000 testing images) in 10 classes. The instances are images of fashion items, which involve t-shirts, trousers, pullovers, dresses, coats, and so on.

5.2. Comparison with Other Approaches

To ensure a fair comparison, the parameters of the SAE are still exactly the same as those in our structure (784-100-10). SAE2 represents a deep network with a relatively larger architecture. The CNN (6-2-12-2) consists of two convolutional layers with filters of 5 × 5 and two pooling layers of size 2 × 2 on these two datasets. The DBN network structure uses 784-100-100-10 for MNIST and 784-200-10 for Fashion-MNIST, respectively.

5.3. Discussion and Analysis

The experimental results of the adopted deep learning methods and the proposed PH-E-DNN on the MNIST and Fashion-MNIST datasets are demonstrated in Table 9, in terms of accuracy and time. We can find the superiority of the new deep learning method, but it may not reach the best results for these datasets. It can be noted that the PH-E-DNN shows slightly better results in contrast with other methods. As such, we tend to attribute the improvement to the embedded local augmented knowledge. Specifically, the result on Fashion-MNIST demonstrates the superiority of the proposed method, obtaining a remarkable improvement. Here, we conjecture that the performance profoundly benefits from the deep ensemble structure rather than the time-consuming training and tricky network parameters. It is noted that we compare our architecture horizontally with the algorithms in the DeepLearn toolbox 2022b without specific preprocessing of the raw images. Considering both recognition performance and computation resource utilization simultaneously, our proposed approach may be a competitive candidate choice.

6. Conclusions

According to local knowledge augmentation based on the principle of stacked generalization, a novel deep ensemble large-scale classifier denoted by DNNs has been developed in this article. The classifier termed as the PH-E-DNN is proposed to deal with large-scale datasets, being beneficial for fully mobilizing computing resources and enhancing recognition performance. Traditional DNNs require an excessive number of network units and pre-specified parameters to learn all input instances. Here, the FCM is adopted to preprocess the original input before multiple modules at each level learn the implied information, leading to better inter-class discrimination. In addition, the adoption of a concise deep module replicates the powerful representational learning ability of a conventional deep network. Furthermore, the PH-E-DNN is suitable for a parallel environment, resulting from the independence between modules in each level of the feature augmentation stage; in comparison, the individual lengthy DNN structure faces significant challenges in performing parallel computation due to its serial architecture. Considering both accuracy and resource utilization, our method demonstrates clear superiority when compared to numerous others.

However, the reason accounting for this meaningful knowledge augmentation is still undiscovered. In our future work, we will devote our efforts to studying how the local augmented knowledge can drive the instances to the correct location. Further, we will strive to apply this to real-world data in society to deal with large-scale application problems.

Author Contributions

Conceptualization, X.Z.; Methodology, Z.J. and X.Z.; Software, S.D. and K.L.; Validation, Z.J., S.D., K.L. and J.Z.; Formal analysis, K.L. and J.Z.; Resources, Z.J.; Writing—original draft, Z.J. and S.D.; Writing—review & editing, J.Z. and X.Z.; Visualization, S.D. and K.L.; Supervision, X.Z.; Project administration, J.Z.; Funding acquisition, Z.J. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grants 62376094, 62206177, and 62106145; by the Zhejiang Provincial Natural Science Foundation of China under Grant LY23F020007 and LQ22F020024; by the General Scientific Research Project of Zhejiang Education Department under Grant Y202248951.

Data Availability Statement

The data presented in this study are openly available in UCI at (https://archive.ics.uci.edu/), reference number [40].

Conflicts of Interest

The authors declare there are no competing interests.

References

Kim, J.; Nguyen, A.D.; Lee, S. Deep CNN-based blind image quality predictor. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 11–24. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Lim, P.; Qin, A.K.; Tan, K.C. Multiobjective deep belief networks ensemble for remaining useful life estimation in prognostics. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 2306–2318. [Google Scholar] [CrossRef]
Zhu, L.; Hill, D.J.; Lu, C. Intelligent short-term voltage stability assessment via spatial attention rectified RNN learning. IEEE Trans. Ind. Inform. 2021, 17, 7005–7016. [Google Scholar] [CrossRef]
Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
Deng, L.; Yu, D. Deep learning: Methods and applications. Found. Trends Signal Process. 2014, 7, 197–387. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
Hinton, G.E.; Osindero, S.; Teh, Y.W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A.; Bottou, L. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
Xu, J.; Xiang, L.; Liu, Q.; Gilmore, H.; Wu, J.; Tang, J.; Madabhushi, A. Stacked sparse autoencoder (SSAE) for nuclei detection on breast cancer histopathology images. IEEE Trans. Med. Imaging 2016, 35, 119–130. [Google Scholar] [CrossRef] [PubMed]
D’Angelo, G.; Palmieri, F. A stacked autoencoder-based convolutional and recurrent deep neural network for detecting cyberattacks in interconnected power control systems. Int. J. Intell. Syst. 2021, 36, 7080–7102. [Google Scholar] [CrossRef]
Zeng, K.; Yu, J.; Wang, R.; Li, C.; Tao, D. Coupled deep autoencoder for single image super-resolution. IEEE Trans. Cybern. 2017, 47, 27–37. [Google Scholar] [CrossRef]
Graves, A.; Mohamed Ar Hinton, G. Speech recognition with deep recurrent neural networks. In International Conference on Acoustics, Speech and Signal Processing; IEEE: Piscataway, NJ, USA, 2013; pp. 6645–6649. [Google Scholar]
Zhang, H.; Wang, Z.; Liu, D. A comprehensive review of stability analysis of continuous-time recurrent neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 1229–1262. [Google Scholar] [CrossRef]
Chen, C.L.P.; Liu, Z. Broad learning system: An effective and efficient incremental learning system without the need for deep architecture. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 10–24. [Google Scholar] [CrossRef]
Chen, C.L.P.; Liu, Z.; Feng, S. Universal approximation capability of broad learning system and its structural variations. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 1191–1204. [Google Scholar] [CrossRef]
Chen, Z.; Wu, M.; Gao, K.; Wu, J.; Ding, J.; Zeng, Z.; Li, X. A novel ensemble deep learning approach for sleep-wake detection using heart rate variability and acceleration. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 5, 803–812. [Google Scholar] [CrossRef]
Zheng, J.; Cao, X.; Zhang, B.; Zhen, X.; Su, X. Deep ensemble machine for video classification. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 553–565. [Google Scholar] [CrossRef] [PubMed]
Sun, Q.; Ge, Z. Gated stacked target-related autoencoder: A novel deep feature extraction and layerwise ensemble method for industrial soft sensor application. IEEE Trans. Cybern. 2022, 52, 3457–3468. [Google Scholar] [CrossRef] [PubMed]
Gou, J.; He, X.; Du, L.; Yu, B.; Chen, W.; Yi, Z. Hierarchical Locality-Aware Deep Dictionary Learning for Classification. IEEE Trans. Multimed. 2024, 26, 447–461. [Google Scholar] [CrossRef]
Zhang, W.; Wu, Q.M.J.; Yang, Y.; Akilan, T.; Zhang, H. A width-growth model with subnetwork nodes and refinement structure for representation learning and image classification. IEEE Trans. Ind. Inform. 2021, 17, 1562–1572. [Google Scholar] [CrossRef]
Duan, M.; Li, K.; Liao, X.; Li, K. A parallel multiclassification algorithm for big data using an extreme learning machine. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2337–2351. [Google Scholar] [CrossRef]
Lin, K.P. A novel evolutionary kernel intuitionistic fuzzy c-means clustering algorithm. IEEE Trans. Fuzzy Syst. 2014, 22, 1074–1087. [Google Scholar] [CrossRef]
Narayanan, S.J.; Baby, C.J.; Perumal, B.; Bhatt, R.B.; Cheng, X.; Ghalib, M.R.; Shankar, A. Fuzzy decision trees embedded with evolutionary fuzzy clustering for locating users using wireless signal strength in an indoor environment. Int. J. Intell. Syst. 2021, 36, 4280–4297. [Google Scholar] [CrossRef]
Zhang, X.; Nojima, Y.; Ishibuchi, H.; Hu, W.; Wang, S. Prediction by fuzzy clustering and KNN on validation data with parallel ensemble of interpretable TSK fuzzy classifiers. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 400–414. [Google Scholar] [CrossRef]
Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
Zhang, L.; Zhou, W.; Jiao, L. Wavelet support vector machine. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2004, 34, 34–39. [Google Scholar] [CrossRef] [PubMed]
Oskoei, M.A.; Hu, H. Support vector machine-based classification scheme for myoelectric control applied to upper limb. IEEE Trans. Biomed. Eng. 2008, 55, 1956–1965. [Google Scholar] [CrossRef] [PubMed]
Huang, G.B.; Zhou, H.; Ding, X.; Zhang, R. Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2012, 42, 513–529. [Google Scholar] [CrossRef] [PubMed]
Liang, N.Y.; Huang, G.B.; Saratchandran, P.; Sundararajan, N. A fast and accurate online sequential learning algorithm for feedforward networks. IEEE Trans. Neural Netw. 2006, 17, 1411–1423. [Google Scholar] [CrossRef]
Zhao, P.; Fang, J.; Jie, C.; Zhang, J.; Wang, E.; Zhang, S. Multiscale Deep Learning Reparameterized Full Waveform Inversion With the Adjoint Method. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–12. [Google Scholar] [CrossRef]
Wang, B.; Pineau, J. Online bagging and boosting for imbalanced data streams. IEEE Trans. Knowl. Data Eng. 2016, 28, 3353–3366. [Google Scholar] [CrossRef]
Deng, L.; Yu, D. Deep convex net: A scalable architecture for speech pattern classification. In Twelfth Annual Conference of the International Speech Communication Association; ISCA: Florence, Italy, 2011; pp. 2285–2288. [Google Scholar]
Wang, G.; Zhang, G.; Choi, K.S.; Lu, J. Deep additive least squares support vector machines for classification with model transfer. IEEE Trans. Syst. Man Cybern. Syst. 2019, 49, 1527–1540. [Google Scholar] [CrossRef]
Zhou, T.; Chung, F.; Wang, S. Deep TSK fuzzy classifier with stacked generalization and triplely concise interpreta-bility guarantee for large data. IEEE Trans. Fuzzy Syst. 2017, 25, 1207–1221. [Google Scholar] [CrossRef]
Li, D.; Chi, Z.; Wang, B.; Wang, Z.; Yang, H.; Du, W. Entropy-based hybrid sampling ensemble learning for imbalanced data. Int. J. Intell. Syst. 2021, 36, 3039–3067. [Google Scholar]
Bai, Z.; Huang, G.B.; Wang, D.; Wang, H.; Westover, M.B. Sparse extreme learning machine for classification. IEEE Trans. Cybern. 2014, 44, 1858–1870. [Google Scholar] [CrossRef]
Fan, B.; Lu, X.; Li, H.X. Probabilistic inference-based least squares support vector machine for modeling under noisy environment. IEEE Trans. Syst. Man Cybern. Syst. 2016, 46, 1703–1710. [Google Scholar] [CrossRef]
Razzak, I.; Blumenstein, M.; Xu, G. Multiclass support matrix machines by maximizing the inter-class margin for single trial EEG classification. IEEE Trans. Neural Syst. Rehabil. Eng. 2019, 27, 1117–1127. [Google Scholar] [CrossRef]
Sun, S.; Dong, Z.; Zhao, J. Conditional random fields for multiview sequential data modeling. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1242–1253. [Google Scholar] [CrossRef] [PubMed]
Bache, K.; Lichman, M. UCI Machine Learning Repository; University California, School of Information and Computer Science: Irvine, CA, USA, 2013; Available online: http://archive.ics.uci.edu/ml (accessed on 28 July 2024).
Derrac, J.; Garcia, S.; Sanchez, L.; Herrera, F. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J. Mult.-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
Li, Y.; Zhu, Q.; Liu, Z. Deep Learning for Image Reconstruction in Electrical Tomography: A Review. IEEE Sens. J. 2025, 25, 14522–14538. [Google Scholar] [CrossRef]

Figure 1. Autoencoder network.

Figure 2. The hierarchical architecture of the proposed PH-E-DNN.

Figure 3. Regional knowledge augmentation in hierarchical structure.

Table 1. Dataset description.

Dataset	Samples	Features	Classes
Winequality (WIN)	4898	11	7
Waveform3 (WAV)	5000	21	3
Pageblock (PAG)	5472	10	5
Optdigits (OPT)	5620	64	10
Satimage (SAT)	6435	36	6
Pendigits (PEND)	7494	16	10
Mushroom (MUS)	8124	21	2
Penbased (PENB)	10,992	16	10
Nursery (NUR)	12,960	8	5
Magic (MAG)	19,020	10	2
Letter (LET)	20,000	16	26
Adult (ADU)	48,841	14	2
Shuttle (SHU)	57,999	10	7
Connect-4 (CON)	67,557	42	3
ARWS (ARWS)	75,128	8	4
HARS (HARS)	10,299	561	6

Table 2. Parameters of basic building block.

Dataset	Hidden Units	Epochs	Batchsize
WIN	20	50	10
WAV	30	30	10
PAG	30	50	10
OPT	30	30	10
SAT	30	50	10
PEND	30	50	10
MUS	30	50	20
PENB	30	50	10
NUR	20	100	10
MAG	30	60	20
LET	20	50	10
ADU	40	40	20
SHU	10	50	100
CON	50	50	60
ARWS	20	50	100
HARS	50	50	60

Table 3. The recognition results of DNN (the unit of training/testing time is second).

Dataset	Training Accuracy	Testing Accuracy	Training Time	Testing Time
WIN	0.5415 (0.0070)	0.5358 (0.0152)	2.5350 (0.1092)	0.0012 (0.0005)
WAV	0.8681 (0.0068)	0.8605 (0.0103)	1.9949 (0.0561)	0.0012 (0.0005)
PAG	0.9519 (0.0045)	0.9475 (0.0069)	3.3017 (0.0481)	0.0105 (0.0416)
OPT	0.9905 (0.0012)	0.9782 (0.0059)	2.8216 (0.0186)	0.0018 (0.0004)
SAT	0.8937 (0.0046)	0.8832 (0.0109)	4.8640 (0.1638)	0.0020 (0.0009)
PEND	0.9686 (0.0169)	0.9626 (0.0188)	4.5772 (0.0290)	0.0015 (0.0005)
MUS	0.9986 (0.0026)	0.9983 (0.0033)	5.3348 (0.2391)	0.0017 (0.0007)
PENB	0.9573 (0.0130)	0.9546 (0.0135)	6.6304 (0.0811)	0.0021 (0.0005)
NUR	0.9708 (0.0036)	0.9681 (0.0051)	12.2063 (0.0828)	0.0012 (0.0002)
MAG	0.8561 (0.0052)	0.8536 (0.0089)	7.2087 (0.0790)	0.0022 (0.0005)
LET	0.7933 (0.0080)	0.7886 (0.0100)	11.4005 (0.0992)	0.0030 (0.0004)
ADU	0.8537 (0.0027)	0.8515 (0.0029)	14.5649 (0.4329)	0.0079 (0.0010)
SHU	0.9720 (0.0154)	0.9718 (0.0150)	5.9225 (0.1595)	0.0036 (0.0007)
CON	0.7763 (0.0077)	0.7691 (0.0063)	26.0889 (0.9251)	0.0138 (0.0011)
ARWS	0.9458 (0.0190)	0.9451 (0.0185)	6.2372 (0.0307)	0.0032 (0.0005)
HARS	0.9843 (0.0057)	0.9761 (0.0059)	29.1670 (0.2035)	0.0091 (0.0007)

Table 4. Time cost (second) in knowledge augmentation (L = 3).

Dataset	Knowledge Augmentation in Level 1	Knowledge Augmentation in Level 2	Knowledge Augmentation in Level 3
WIN	1.31 (0.25)	1.53 (0.23)	1.62 (0.25)
WAV	1.02 (0.20)	1.07 (0.13)	0.96 (0.09)
PAG	1.82 (0.40)	1.99 (0.41)	2.46 (0.39)
OPT	1.68 (0.25)	1.61 (0.12)	1.75 (0.12)
SAT	2.30 (0.18)	2.33 (0.04)	2.42 (0.13)
PEND	2.07 (0.37)	2.21 (0.37)	2.60 (0.18)
MUS	2.54 (0.18)	2.43 (0.06)	2.96 (0.12)
PENB	3.10 (0.50)	3.91 (0.36)	4.12 (0.12)
NUR	6.84 (0.22)	4.80 (0.14)	4.84 (0.15)
MAG	4.42 (0.45)	4.70 (1.10)	5.16 (1.10)
LET	5.41 (0.50)	5.25 (0.36)	5.63 (0.32)
ADU	8.71 (0.22)	9.16 (0.61)	9.42 (0.76)
SHU	4.91 (0.45)	6.00 (0.52)	6.39 (0.48)
CON	16.36 (0.50)	17.12 (1.56)	16.78 (0.53)
ARWS	5.03 (0.70)	4.50 (0.52)	4.58 (0.59)
HARS	25.04 (1.71)	24.37 (1.24)	24.73 (2.08)

Table 5. Time cost (second) in knowledge augmentation (L = 5).

Dataset	Knowledge Augmentation in Level 1	Knowledge Augmentation in Level 2	Knowledge Augmentation in Level 3
WIN	1.05 (0.36)	1.07 (0.14)	1.13 (0.13)
WAV	0.95 (0.28)	0.81 (0.10)	0.88 (0.11)
PAG	1.80 (0.34)	1.73 (0.13)	1.71 (0.13)
OPT	1.64 (0.25)	1.61 (0.23)	1.65 (0.25)
SAT	1.84 (0.40)	1.73 (0.21)	1.68 (0.22)
PEND	1.70 (0.25)	1.69 (0.21)	1.64 (0.17)
MUS	1.64 (0.30)	1.76 (0.12)	1.80 (0.20)
PENB	2.52 (0.28)	2.40 (0.18)	2.42 (0.17)
NUR	4.20 (0.32)	5.07 (0.19)	5.21 (0.16)
MAG	3.35 (0.55)	3.33 (0.45)	3.48 (0.50)
LET	4.05 (0.36)	4.35 (0.40)	4.03 (0.47)
ADU	9.28 (1.41)	7.60 (0.92)	7.76 (1.17)
SHU	4.46 (0.63)	5.15 (1.08)	5.99 (0.77)
CON	17.36 (1.01)	18.90 (1.97)	19.31 (1.84)
ARWS	3.85 (0.65)	4.37 (0.84)	4.38 (0.50)
HARS	23.85 (2.47)	23.65 (2.12)	24.59 (3.98)

Table 6. Average training and testing accuracy of PH-E-DNN with different numbers of modules.

	L = 3		L = 5
	Training	Testing	Training	Testing
WIN	0.5545 (0.0059)	0.5439 (0.0142)	0.5550 (0.0047)	0.5493 (0.0140)
WAV	0.8749 (0.0037)	0.8674 (0.0082)	0.8730 (0.0031)	0.8690 (0.0107)
PAG	0.9516 (0.0031)	0.9504 (0.0063)	0.9513 (0.0013)	0.9508 (0.0076)
OPT	0.9931 (0.0011)	0.9883 (0.0033)	0.9928 (0.0010)	0.9861 (0.0023)
SAT	0.8954 (0.0043)	0.8894 (0.0084)	0.8975 (0.0035)	0.8887 (0.0092)
PEND	0.9812 (0.0162)	0.9797 (0.0160)	0.9778 (0.0137)	0.9763 (0.0148)
MUS	0.9999 (0.0002)	0.9998 (0.0003)	1 (0)	1 (0.0001)
PENB	0.9807 (0.0179)	0.9785 (0.0191)	0.9837 (0.0128)	0.9825 (0.0135)
NUR	0.9893 (0.0035)	0.9883 (0.0039)	0.9863 (0.0032)	0.9854 (0.0032)
MAG	0.8602 (0.0026)	0.8570 (0.0039)	0.8612 (0.0018)	0.8587 (0.0051)
LET	0.8282 (0.0043)	0.8242 (0.0073)	0.8356 (0.0047)	0.8329 (0.0056)
ADU	0.8557 (0.0015)	0.8528 (0.0027)	0.8564 (0.0011)	0.8534 (0.0034)
SHU	0.9807 (0.0045)	0.9803 (0.0049)	0.9814 (0.0047)	0.9812 (0.0044)
CON	0.7836 (0.0025)	0.7782 (0.0016)	0.7857 (0.0031)	0.7820 (0.0061)
ARWS	0.9573 (0.0056)	0.9572 (0.0063)	0.9539 (0.0036)	0.9534 (0.0037)
HARS	0.9891 (0.0008)	0.9879 (0.0016)	0.9881 (0.0022)	0.9840 (0.0025)

The best results are highlighted in bold.

Table 7. Average testing accuracy in various schemes.

Dataset	Adaboost	Bagging	SVM	SAE	SAE2	PH-DNN	E-DNN	DBN	PH-E-DBN	PH-E-DNN
WIN	0.4609	0.5212	0.5343	0.5358	0.5288	0.5411	0.5316	0.5330	0.5387	0.5493
WIN	(0.0140)	(0.0104)	(0.0127)	(0.0152)	(0.0196)	(0.0217)	(0.0218)	(0.0189)	(0.0139)	(0.0140)
WAV	0.8153	0.8580	0.8643	0.8605	0.8591	0.8634	0.8646	0.8623	0.8648	0.8690
WAV	(0.0122)	(0.0092)	(0.0100)	(0.0103)	(0.0125)	(0.0111)	(0.0116)	(0.0095)	(0.0070)	(0.0107)
PAG	0.9350	0.9459	0.9537	0.9475	0.9420	0.9490	0.9497	0.9516	0.9506	0.9508
PAG	(0.0031)	(0.0067)	(0.0063)	(0.0069)	(0.0169)	(0.0078)	(0.0070)	(0.0078)	(0.0045)	(0.0076)
OPT	0.7222	0.9519	0.9762	0.9782	0.9831	0.9793	0.9818	0.9815	0.9900	0.9883
OPT	(0.0201)	(0.0061)	(0.0046)	(0.0059)	(0.0036)	(0.0056)	(0.0039)	(0.0042)	(0.0032)	(0.0033)
SAT	0.7929	0.8414	0.8673	0.8832	0.8760	0.8800	0.8867	0.8742	0.8866	0.8894
SAT	(0.0094)	(0.0085)	(0.0063)	(0.0109)	(0.0122)	(0.0097)	(0.0081)	(0.0281)	(0.0124)	(0.0084)
PEND	0.6827	0.8874	0.9872	0.9626	0.9645	0.9885	0.9662	0.9406	0.9660	0.9797
PEND	(0.0125)	(0.0073)	(0.0027)	(0.0188)	(0.0169)	(0.0032)	(0.0145)	(0.0110)	(0.0175)	(0.0160)
MUS	0.9986	0.9319	1	0.9983	0.9988	0.9989	0.9996	0.9994	0.9997	1
MUS	(0.0010)	(0.0070)	(0)	(0.0033)	(0.0029)	(0.0014)	(0.0017)	(0.0014)	(0.0005)	(0.0001)
PENB	0.6888	0.8794	0.9759	0.9546	0.9600	0.9862	0.9641	0.9470	0.9519	0.9825
PENB	(0.0108)	(0.0043)	(0.0025)	(0.0135)	(0.0156)	(0.0047)	(0.0155)	(0.0106)	(0.0108)	(0.0135)
NUR	0.8278	0.7956	0.9853	0.9681	0.9571	0.9865	0.9702	0.9651	0.9689	0.9883
NUR	(0.0050)	(0.0053)	(0.0022)	(0.0051)	(0.0147)	(0.0049)	(0.0043)	(0.0054)	(0.0053)	(0.0039)
MAG	0.7547	0.7834	0.8322	0.8536	0.8438	0.8551	0.8547	0.8527	0.8533	0.8587
MAG	(0.0051)	(0.0060)	(0.0053)	(0.0089)	(0.0115)	(0.0055)	(0.0139)	(0.0053)	(0.0065)	(0.0051)
LET	0.4590	0.7009	0.8203	0.7886	0.8024	0.8234	0.7981	0.8193	0.8556	0.8329
LET	(0.0061)	(0.0076)	(0.0080)	(0.0100)	(0.0138)	(0.0082)	(0.0068)	(0.0060)	(0.0083)	(0.0056)
ADU	0.8319	0.8318	0.8352	0.8515	0.8453	0.8512	0.8518	0.8488	0.8505	0.8534
ADU	(0.0017)	(0.0033)	(0.0028)	(0.0029)	(0.0042)	(0.0068)	(0.0033)	(0.0046)	(0.0035)	(0.0034)
SHU	0.9035	0.9436	0.9747	0.9718	0.9709	0.9766	0.9737	0.9882	0.9855	0.9812
SHU	(0.0027)	(0.0026)	(0.0012)	(0.0150)	(0.0117)	(0.0016)	(0.0037)	(0.0089)	(0.0043)	(0.0044)
CON	0.6596	0.6607	0.7477	0.7691	0.7652	0.7661	0.7797	0.7044	0.7262	0.7820
CON	(0.0043)	(0.0049)	(0.0040)	(0.0063)	(0.0069)	(0.0045)	(0.0054)	(0.0147)	(0.0065)	(0.0061)
ARWS	0.8973	0.9091	0.9704	0.9451	0.9659	0.9600	0.9534	0.9709	0.9693	0.9572
ARWS	(0.0028)	(0.0018)	(0.0015)	(0.0185)	(0.0064)	(0.0044)	(0.0072)	(0.0013)	(0.0024)	(0.0063)
HARS	0.5326	0.9810	0.9758	0.9761	0.9845	0.9797	0.9804	0.9633	0.9824	0.9879
HARS	(0.0129)	(0.0023)	(0.0025)	(0.0059)	(0.0062)	(0.0046)	(0.0036)	(0.0471)	(0.0111)	(0.0016)

The best results are highlighted in bold.

Table 8. Average F-measure in various schemes.

Dataset	Adaboost	Bagging	SVM	SAE	SAE2	PH-DNN	E-DNN	DBN	PH-E-DBN	PH-E-DNN
WIN	0.3518	0.5200	0.4296	0.4977	0.5122	0.5136	0.5108	0.4936	0.5166	0.5250
WIN	(0.0700)	(0.0202)	(0.0124)	(0.0104)	(0.0075)	(0.0199)	(0.0101)	(0.0155)	(0.0200)	(0.0249)
WAV	0.8055	0.8170	0.8663	0.8591	0.8600	0.8621	0.8695	0.8640	0.8681	0.8731
WAV	(0.0163)	(0.0136)	(0.0213)	(0.0167)	(0.0149)	(0.0019)	(0.0142)	(0.0085)	(0.0053)	(0.0114)
PAG	0.9163	0.9574	0.9003	0.9355	0.9555	0.9338	0.8695	0.9343	0.9511	0.9443
PAG	(0.0055)	(0.0056)	(0.0122)	(0.0082)	(0.0020)	(0.0111)	(0.0142)	(0.0059)	(0.0007)	(0.0076)
OPT	0.7076	0.9604	0.9849	0.9803	0.9861	0.9774	0.9755	0.9833	0.9836	0.9890
OPT	(0.0109)	(0.0036)	(0.0039)	(0.0023)	(0.0021)	(0.0054)	(0.0037)	(0.0031)	(0.0045)	(0.0026)
SAT	0.7859	0.8743	0.8697	0.8720	0.8850	0.8814	0.8736	0.8670	0.8713	0.8861
SAT	(0.0106)	(0.0107)	(0.0052)	(0.0147)	(0.0172)	(0.0132)	(0.0179)	(0.0225)	(0.0080)	(0.0089)
PEND	0.6618	0.9635	0.9692	0.9599	0.9677	0.9877	0.9531	0.9339	0.9483	0.9778
PEND	(0.0111)	(0.0051)	(0.0032)	(0.0166)	(0.0216)	(0.0024)	(0.0030)	(0.0065)	(0.0082)	(0.0200)
MUS	0.9989	0.9986	0.9647	0.9999	1	0.9991	1	0.9996	1(0)	1
MUS	(0.0008)	(0.0015)	(0.0034)	(0.0003)	(0)	(0.0008)	(0)	(0.0008)	(0)	(0)
PENB	0.6618	0.9664	0.9743	0.9648	0.9933	0.9801	0.9636	0.9457	0.9405	0.9879
PENB	(0.0083)	(0.0024)	(0.0057)	(0.0336)	(0.0028)	(0.0066)	(0.0164)	(0.0050)	(0.0025)	(0.0082)
NUR	0.8168	0.9547	0.8328	0.9604	0.9622	0.9872	0.9608	0.9541	0.9468	0.9896
NUR	(0.0072)	(0.0006)	(0.0096)	(0.0022)	(0.0024)	(0.0017)	(0.0038)	(0.0058)	(0.0124)	(0.0040)
MAG	0.8398	0.8499	0.8100	0.8462	0.8490	0.8521	0.8576	0.8483	0.8442	0.8589
MAG	(0.0058)	(0.0030)	(0.0042)	(0.0149)	(0.0092)	(0.0040)	(0.0014)	(0.0074)	(0.0038)	(0.0057)
LET	0.1255	0.8138	0.8253	0.7877	0.8237	0.8246	0.8096	0.8158	0.8584	0.8272
LET	(0.0192)	(0.0038)	(0.0077)	(0.0105)	(0.0101)	(0.0071)	(0.0092)	(0.0040)	(0.0065)	(0.0073)
ADU	0.8497	0.8285	0.8337	0.8439	0.8479	0.8436	0.8447	0.8466	0.8434	0.8487
ADU	(0.0042)	(0.0027)	(0.0042)	(0.0054)	(0.0082)	(0.0036)	(0.0041)	(0.0037)	(0.0005)	(0.0043)
SHU	0.9930	0.9982	0.9662	0.9710	0.9709	0.9749	0.9749	0.9843	0.9916	0.9763
SHU	(0.0008)	(0.0002)	(0.0012)	(0.0050)	(0.0070)	(0.0074)	(0.0062)	(0.0072)	(0.0034)	(0.0086)
CON	0.5203	0.6993	0.7298	0.7283	0.7299	0.7121	0.7367	0.6921	0.6798	0.7403
CON	(0.0029)	(0.0032)	(0.0026)	(0.0125)	(0.0026)	(0.0044)	(0.0059)	(0.0174)	(0.0030)	(0.0048)
ARWS	0.8564	0.9828	0.8812	0.9657	0.9660	0.9620	0.9688	0.9683	0.9685	0.9636
ARWS	(0.0047)	(0.0017)	(0.0029)	(0.0022)	(0.0014)	(0.0033)	(0.0019)	(0.0017)	(0.0019)	(0.0016)
HARS	0.4052	0.8739	0.9786	0.9811	0.9812	0.9767	0.9845	0.9720	0.9783	0.9854
HARS	(0.0053)	(0.0060)	(0.0034)	(0.0032)	(0.0029)	(0.0046)	(0.0029)	(0.0195)	(0.0028)	(0.0036)

The best results are highlighted in bold.

Table 9. Recognition results of deep methods on MNIST and Fashion-MNIST.

Dataset	Metrics	CNN	DBN	SAE	SAE2	PH-E-DNN
MNIST	Accuracy	0.9671	0.9748	0.9744	0.9794	0.9849
MNIST	Time	2282.22	521.90	688.99	1178.06	1260.03
Fashion-MNIST	Accuracy	0.8704	0.8775	0.8938	0.8963	0.9378
Fashion-MNIST	Time	5849.53	726.5254	713.507	1266.53	1293.38

The best results are highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, Z.; Dong, S.; Liu, K.; Zhou, J.; Zhang, X. Divide-and-Merge Parallel Hierarchical Ensemble DNNs with Local Knowledge Augmentation. Symmetry 2025, 17, 1362. https://doi.org/10.3390/sym17081362

AMA Style

Jiang Z, Dong S, Liu K, Zhou J, Zhang X. Divide-and-Merge Parallel Hierarchical Ensemble DNNs with Local Knowledge Augmentation. Symmetry. 2025; 17(8):1362. https://doi.org/10.3390/sym17081362

Chicago/Turabian Style

Jiang, Zhibin, Shuai Dong, Kaining Liu, Jie Zhou, and Xiongtao Zhang. 2025. "Divide-and-Merge Parallel Hierarchical Ensemble DNNs with Local Knowledge Augmentation" Symmetry 17, no. 8: 1362. https://doi.org/10.3390/sym17081362

APA Style

Jiang, Z., Dong, S., Liu, K., Zhou, J., & Zhang, X. (2025). Divide-and-Merge Parallel Hierarchical Ensemble DNNs with Local Knowledge Augmentation. Symmetry, 17(8), 1362. https://doi.org/10.3390/sym17081362

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Divide-and-Merge Parallel Hierarchical Ensemble DNNs with Local Knowledge Augmentation

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Basic Building Block

3.1.1. Autoencoder

3.1.2. Other Variants of Autoencoder

3.2. The Proposed Architecture of PH-E-DNN

3.3. Knowledge Augmentation Based on Regional Subsets

4. Experiments

4.1. Benchmark Datasets

4.2. Experimental Setup

4.3. Experimental Analysis and Comparison of Results

5. Case Study

5.1. Datasets

5.2. Comparison with Other Approaches

5.3. Discussion and Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI