Evolving Deep DenseBlock Architecture Ensembles for Image Classiﬁcation

: Automatic deep architecture generation is a challenging task, owing to the large number of controlling parameters inherent in the construction of deep networks. The combination of these parameters leads to the creation of large, complex search spaces that are feasibly impossible to properly navigate without a huge amount of resources for parallelisation. To deal with such challenges, in this research we propose a Swarm Optimised DenseBlock Architecture Ensemble (SODBAE) method, a joint optimisation and training process that explores a constrained search space over a skeleton DenseBlock Convolutional Neural Network (CNN) architecture. Specifically, we employ novel weight inheritance learning mechanisms, a DenseBlock skeleton architecture, as well as adaptive Particle Swarm Optimisation (PSO) with cosine search coefficients to devise networks whilst maintaining practical computational costs. Moreover, the architecture design takes advantage of recent advancements of the concepts of residual connections and dense connectivity, in order to yield CNN models with a much wider variety of structural variations. The proposed weight inheritance learning schemes perform joint optimisation and training of the architectures to reduce the computational costs. Being evaluated using the CIFAR-10 dataset, the proposed model shows great superiority in classification performance over other state-of-the-art methods while illustrating a greater versatility in architecture generation.


Introduction
Automatic deep architecture design is a growing practice within the domain of deep learning.The rapid pace of deep learning research has brought continued advancements in tools and methods for network generation.Many architecture generation/optimisation studies [1][2][3][4][5] focus on attempting to design deep networks from primitive building blocks, allowing for their methods to generate a huge variety of potential architectures.Such strategies are interesting but create search spaces that are enormous and feasibly impossible to properly navigate without a huge amount of resources for parallelisation.In this research, we propose novel weight inheritance training schemes, a DenseBlock skeleton architecture, in conjunction with adaptive Particle Swarm Optimisation (PSO) with cosine search trajectories, in order to improve efficiency while generating diverse architectures.
Specifically, in this research, we propose a Swarm Optimised DenseBlock Architecture Ensemble (SODBAE) method, a joint optimisation and training process that explores a constrained search space over a skeleton DenseBlock Convolutional Neural Network (CNN) architecture.We conduct the architecture search while using a PSO variant with adaptive cosine search coefficients to balance between exploration and exploitation.Importantly, in order to significantly improve computational efficiency, two weight inheritance learning schemes, i.e., Block Lookup Size Key-Name and Size (BLSK-NS), and Block Lookup Size Key-Size (BLSK-S), are proposed for joint optimisation and training of the CNN architectures.Precisely, the individual particles are jointly optimised and trained while using the above proposed continual training strategies that operate on the individual layers of the network.These weight inheritance schemes are subsequently applied to any CNN architecture devised by the swarm particles and utilise combinations of the name of the layer and the size of its parameter matrix to construct search keys for the centralised lookup tables.Following the aforementioned optimisation and training process, ensemble classification models are constructed from local best positions, as well as the distinctive particles from the final iteration in order to take advantage of the diversity of the jointly trained models.
The research contributions are as follows.
1.A deep ensemble architecture optimisation method is proposed for devising ensembles of deep neural networks while using dense connectivity principles.2. A DenseBlock-based skeleton architecture is proposed for effective optimisation.3. Two weight inheritance strategies, i.e., the BLSK-NS and BLSK-S methods, are proposed for the joint optimisation and training of complex CNN architectures.4. A comprehensive evaluation of the devised ensemble and base CNN models using various weight inheritance configurations on the CIFAR-10 dataset is provided.
The proposed SODBAE model provides an automatic approach for densely connected CNN model generation.Additionally, the BLSK-NS and BLSK-S weight inheritance methods provide continual training to any selection of CNN networks by functioning over the individual layers of a network.The proposed approach is able to devise sophisticated deep networks with versatile architectures while maintaining a reasonable trade-off between performance and computational efficiency.

Related Work
Automatic CNN architecture generation/optimisation and hyper-parameter fine-tuning have drawn great attention in computer vision [6].In parallel, evolutionary algorithms, such as PSO, show significant search capabilities in solving a variety of optimization problems.Proposed by Kennedy and Eberhart [7], the PSO algorithm employs the personal and global best experiences to guide the swarm to search for global optimality.Equations ( 1) and (2) define the velocity and position updating operations of the original PSO algorithm.
where v t+1 i and x t+1 i represent the velocity and position of particle i in the (t + 1)-th iteration, with w representing the inertia weight.The personal best solution of the particle i is denoted as pbest i , while the global best solution is represented as gbest.The acceleration coefficients c 1 and c 2 determine the focus of the search process between the cognitive and social components, with r 1 and r 2 representing the random search parameters.The acceleration coefficients c 1 and c 2 are assigned as constant values in the original PSO algorithm.As indicated in Equations ( 1) and (2), the particle velocity and position updating operations are guided by the personal and global best solutions to move the swarm towards global optimality.
The PSO algorithm has been widely adopted for optimal architecture generation and hyper-parameter fine-tuning applications.As an example, Yamasaki et al. [8] conducted optimal hyper-parameter selection with respect to AlexNet while using the PSO algorithm.Their work optimised several key configuration settings such as the kernel sizes, padding, the number of filters, and types of pooling, for optimal network identification.The ranking correlation and volatility strategies were proposed in order to optimise computational efficiency at the training stage.The PSO-optimised AlexNet model obtained accuracy rates of 80.15% and 53.28% for the CIFAR-10 and CIFAR-100 datasets, respectively.Domhan et al. [9] proposed strategies to accelerate optimal hyper-parameter identification for deep networks.A probabilistic learning curve mechanism was proposed in order eliminate learning options that were very unlikely to result in optimal performances in early training stages.A comprehensive evaluation was conducted for the optimisation of a number of learning configurations for both small and large networks, which included network hyper-parameters (such as initial learning rate, learning rate decay, momentum, weight decay, dropout rate, and the number of pooling layers), as well as convolutional layer hyper-parameters (such as the number of units, weight filler type (Gaussian and Xavier), and weight learning rate multiplier).Evaluated with the CIFAR-10 dataset, the optimised small and large CNN models obtained error rates of 17.2% and 8.81%, respectively, for image classification tasks.Lievski et al. [10] proposed a mixed integer deterministic-surrogate optimisation algorithm, namely Hyper-parameter Optimisation using radial basis function (RBF) and Dynamic coordinate search (HORD), for hyper-parameter identification in deep networks.Specifically, the RBF interpolation model was employed as the error surrogates for optimal hyper-parameter search.Besides that, the new offspring solutions were yielded in the close neighbourhood of the current global best solution in each iteration to significantly reduce the number of evaluations.The model showed superior performances for both low and high dimensional hyper-parameter optimisation problems when evaluated with four deep neural networks with six, eight, 15, and 19 hyper-parameters, respectively.Their work obtained an error rate of 20.54% for the CIFAR-10 dataset by optimising a CNN model with 19 hyper-parameters.
Albelwi and Mahmood [11] proposed the Nelder-Mead Method (NMM) in conjunction with a newly proposed objective function based on the error rate and the quality of the feature visualization using deconvolutional networks for deep architecture optimisation.Their objective function was able to reduce the error rate and improve the visualization results of the feature maps simultaneously.Because reconstructing images using feature activations was an effective way to indicate model efficiency, their work subsequently generated a correlation coefficient score between a pair of the original input and the reconstructed images for quality measurement.The hyper-parameters, such as the depths of convolutional and fully-connected layers, the number of filters, the kernel sizes, the number of pooling layers, pooling region sizes, and the number of neurons in fully-connected layers were optimised.Evaluated while using the CIFAR-10 dataset, their work achieved an error rate of 10.05% for image classification.Tan et al. [3] proposed a PSO variant for deep CNN architecture generation for skin cancer classification.The search of the optimal network topology was conducted based on an initial skeleton network with distinctive block architectures.The optimisation objective was to identify optimal training hyper-parameters (e.g., learning rate and weight decay), and the numbers of distinctive convolutional blocks with differentiated feature map sizes.Their PSO variant hybridized with crossover, mutation, and spiral search operations as well as attraction actions guided by the mean and randomly selected positions of the neighbouring elite solutions.Besides the deep architecture generation, the model was also used to enhance the lesion-skin segmentation performance by further enhancing the cluster centroids identified by the K-Means clustering.Evaluated with several well-known skin lesion datasets, such as ISIC 2017, the model outperformed a number of advanced PSO algorithms [12][13][14] for architecture generation and foreground-background segmentation for melanoma classification.Sun et al. [15] proposed automatic PSO-based architecture generation based on the proposal of flexible convolutional autoencoders, while Liang [16] performed the optimisation of fully connected (FC) layers in CNN models and Liu et al. [17] employed multi-objective optimisation for the design of optimal layer structures in deep networks.
Lu et al. [18] devised deep networks while using the Bayesian optimisation algorithm for an electrochemical drilling process.First, a network architecture consisting of a convolutional layer followed by three dense layers was determined based on manual trial-and-error.The Bayesian algorithm optimised the following CNN network hyper-parameters, i.e., the training-validation split, the dropout ratio, the numbers of neurons in the three dense layers, and the training iterations.Random dropout layers were also implanted to the generated networks in order to avoid overfitting.The Bayesian-optimised CNN model outperformed a physics embedded neural network and a Support Vector Regression (SVR) model for the prediction of the diameters of the drilled holes in electrochemical drilling tasks.Zhang et al. [19] proposed ensemble transfer learning models for optic disc (OD) segmentation with PSO-based hyper-parameter identification.Mask R-CNN was employed as the base evaluator.A new PSO algorithm was proposed in order ro optimise the hyper-parameters of each base segmenter.It comprised a number of mechanisms, such as accelerated and refined search actions with versatile super-ellipse trajectories, a super-formula oriented leader enhancement, as well as mean and random leader-guided search operations, to mitigate early stagnation.In order to further test model efficiency, the PSO algorithm was also evaluated using benchmark test functions, as well as hyper-parameter fine-tuning of the VGG16 model for a diagnostic problem, i.e., the detection of Diabetic Macular Edema (DME).Evaluated using Messidor and Drions datasets, their PSO-based ensemble transfer learning models showed superior performance over those yielded while using classical and advanced search methods, as well as state-of-the-art related studies based on handcrafted features, for OD segmentation and DME classification.Junior and Yen [5] devised deep architectures based on a modified PSO operation where each particle with a variable length was used to denote a potential deep CNN topology.New operators for the calculation of position difference and velocity updating were proposed to facilitate the optimal deep architecture search.Their optimised CNN networks achieved superior performance for MNIST and a number of its variant datasets with respect to image classification tasks.

Search Space Design
Deep architecture generation is a challenging task, since related studies [1,2,4,20] tend to create search spaces that are immense and feasibly impossible to properly navigate without a huge amount of resources for parallelisation.In this research, we approach the problem differently by proposing novel weight inheritance schemes, a DenseBlock skeleton architecture, as well as an adaptive cosine oriented PSO search mechanism to balance between devising versatile architectures and maintaining optimal training costs.In addition, we devise the skeleton architecture by considering more recent developments in the world of CNN design.Specifically, we employ the concept of residual or skip connections [21], as well as the concept of dense connectivity [22], in order to create architectures from a set of parameters with a much wider variety of structural variations.Using our DenseBlock skeleton architecture, we obtain a search process to include a total of K × D i−1 potential architectures, where K represents the denormalised range of the growth rate (k), while D represents the denormalised range of the depth (number of internal layers) of each dense block, and i represents the number of dense blocks.Such a potentially large search space provides a vastly more difficult task, albeit with the benefit of greater oautonomyn the part of the optimisation process.To significantly reduce computational costs and effectively explore the search space, two weight inheritance strategies, i.e., the BLSK-NS and BLSK-S methods, are proposed for the joint optimisation and training of the CNN architectures.These weight inheritance strategies purely rely on the parameters of the individual layers in the CNN architecture, meaning they can be used with any CNN construction method, rather than being tied to the optimisation process itself.Two ensemble models are subsequently built based on the optimised deep networks with dense connectivity to further enhance performance.We introduce each key proposed strategy comprehensively below.

Model Construction
Similar to the VGG networks, a typical DenseNet CNN model for image classification is constructed as a hierarchical graph of convolutional, batch normalisation, ReLU activation, and average pooling layers, followed by a final fully connected classification layer and a Softmax output layer.However, whilst the VGG models attach the output of a layer only with the next contiguous layer, DenseNets build up a 'global state' by connecting the output of a layer with all of the subsequent layers.This global state takes the form of a stack of feature maps, where the output of a layer is simply concatenated onto the stack following its processing of the previous global state.Constructing the network in this way allows the features from earlier in the network to persist for longer in the network hierarchy, rather than being repeatedly reprocessed by the intermediate layers and losing entropy.
Moreover, in the proposed DenseBlock skeleton architecture, as many existing studies [3,5,15], we employ the concept of blocks separated by spatial downsampling layers, in order to reduce the spatial size of the feature maps and increase the receptive field size of the subsequent layers.However, we achieve this by using the DenseNet concept of the 'growth rate' (k), which defines the number of feature maps that each new layer can concatenate onto the global state matrix.The growth rate is also used to determine the maximum size of the global state, as an excessive growth rate can result in drastically reduced performance.In our experiments, we set the maximum size of the global state as 4k, based on the recommendation of [22].This maximum size is maintained through the use of BottleNeck Layers consisting of a batch normalisation layer, followed by a ReLU activation and a 1 × 1 convolution mapping the number of input features to the maximum global state size.A visualisation of a BottleNeck Layer can be seen in Figure 1.As compared with related studies, including a BottleNeck layer before every convolution creates a fundamental change in structure, resulting in a definition of Processing Layers that are stacked within the blocks of our resulting architectures.Figure 2 depicts the Processing Layer definition, which consists of a sequence of actual component layers, i.e., batch normalisation, ReLU Activation, 1 × 1 convolution, batch normalisation, ReLU Activation, and finally a 3 × 3 convolution to form the bulk of the processing.
One effect of the number of feature maps produced and processed by the Processing Layers is that the parameter matrices do not grow to enormous sizes.This allows for us to construct much deeper networks by stacking many more Processing Layers in each block, without running out of GPU memory in the process.Due to the dense connections that are inherent in the SODBAE skeleton architecture, we also do not suffer from the vanishing entropy problem commonly seen when drastically increasing the depth of a CNN.Moreover, the optimization process enables each processing layer to always add the identified optimal (e.g., k) number of feature maps to the global state.We do not enforce any constraints when adding the number of feature maps that are recommended by the optimization process.The maximum number of feature maps in the global state is maintained by the bottleneck layers, as illustrated in Figure 1.Each bottleneck layer uses a 1 × 1 convolution to map the number of input features to the maximum size of the global state, after adding each set of new feature maps from each processing layer.As previously mentioned, in order to periodically reduce the spatial size of the internal feature maps, and increase the receptive field size of later Processing Layers, we implement spatial downsampling between the blocks in the SODBAE skeleton architecture.The spatial downsampling is performed by Transition Layers, rather than simply a single average pooling layer as in many existing studies.Transition Layers, as seen in [22], consist of a batch normalisation layer, followed by a ReLU activation, a 1 × 1 convolution, and finally an average pooling layer in order to perform the actual spatial downsampling.A visualisation of a Transition Layer can be seen in Figure 3.In the proposed model, we construct a SODBAE skeleton architecture by defining a number of design choices as parameters to an objective function, where the objective function can then use the parameters with the skeleton architecture in order to construct a full CNN model.The objective function used by the SODBAE method can be seen in Algorithm 1.For the model initialisation process, the skeleton architecture design choices we choose to optimise include: Train(model) We use ranges of k = [1, . . ., 48] giving K = 48 for the growth rate and d = [0, . . ., 48] giving D = 49 for the depth of each dense block, with a total number of blocks of i = 4.This means that each dense block will consist of between 0 and 48 Processing Layers, with each Processing Layer adding between 1 and 48 feature maps to the global state, which will contain a range of 4 to 192 feature maps at a time.Being motivated by [22], we simply use a single fully connected layer, followed by a Softmax layer to perform the final predictions.As with the earlier reduction in feature maps, this again saves GPU memory, allowing for us to construct deeper models with more convolutional processing.
Table 1 shows the parameters of the DenseBlock skeleton architecture to be optimised, leading to the total distinct architecture variants of 276, 710, 448.The skeleton architecture of the resulting CNN is visualised in Figure 4, where the Processing Layer refers to an encompassing SODBAE Processing Layer, as depicted in Figure 2 and Transition Layer refers to the one, as depicted in Figure 3.The ellipses in the visualisations represent the repeating layers in the respective blocks, where the dense connections will continue between the repeated layers and all subsequent layers, potentially creating very deep, very densely connected architectures.It is important to note that, whilst the dense connections are shown throughout the network, in actual processing this is represented as a global state that passes linearly through the network, allowing for the spatial size of the feature maps inside to be downsampled by the Transition Layers.

Particle Swarm Optimisation
With the proposed DenseBlock skeleton architecture defined, we subsequently optimise both the growth rate and size of each block by encoding the parameters as dimensions in a swarm optimisation process.The PSO model is selected for deep network generation in this research, owing to its superior search capabilities in solving diverse optimisation problems [23][24][25][26].Specifically, the proposed SODBAE model utilises a PSO variant where the acceleration coefficients are implemented by adaptive cosine functions, as defined in (3), to optimise the DenseBlock architecture that is described above.According to Mirjalili et al. [27], the acceleration coefficients defined in (3) while using the cosine functions provide comparatively smoother transitions of the search focus between the personal and global best solutions, in comparison with linearly increasing and decreasing search parameters for the social and cognitive components.Such adaptive search parameters enable the model to focus on global exploration at the beginning of the search process to thoroughly explore the search space, and gradually switch to local exploitation towards the final iterations to increase intensification around the global best solution.Therefore, the proposed model is equipped with enhanced capabilities in overcoming local optima traps and finding global optimality.
where c 1 and c 2 represent the acceleration coefficients for the cognitive and social components, respectively, in the PSO algorithm, as defined in Equation ( 1), and t and T denote the current and the maximum numbers of iterations, respectively.Each element in each particle is used to represent a key optimised factor for deep architecture search.Specifically, the first dimension of the search space is used in order to optimise the growth rate (k) of the DenseBlock skeleton architecture, whilst the remaining dimensions are used to optimise the sizes of the DenseBlocks.We normalise all dimensions to a range of [0, . . ., 1] for the purposes of the optimisation process, although, during the optimisation function evaluation, the values are denormalised to [1, . . . , 48] for the growth rate and [0, . . ., 48] for the DenseBlock sizes.Our skeleton architecture uses a total of 4 DenseBlocks, which allows for us to use the minimisation as illustrated in (4).argmin The search space for the hyperparameters is represented, as follows.
We actually explore the search space using, where the parameters themselves are cast to integers following the denormalisation in the objective function and the growth rate (k) is clipped to the range [1, . . . , 48].We describe the process for fitness evaluation as follows.Each particle is initialised with a random position in the search space, where each element of each particle represents one of the optimised hyper-parameters (i.e., the growth rate or the depth of each of the four DenseBlocks).We define the search range of each dimension of each particle as [0, 1].The particles move continuously in the search space by following the personal and global best solutions.During the fitness evaluation, we convert or denormalise each element of each particle from a numerical value between [0, 1] to an integer value representing a valid network hyper-parameter.Subsequently, a CNN model is devised based on the recommended network configurations for each particle.The model is subsequently trained using the training set and evaluated with the validation set.The error rate of the validation set is used as the fitness score.The global best solution represents the identified most optimal network.

Continual Training Through Weight Inheritance
We propose the continual training strategies with weight inheritance in SODBAE in order to effectively explore the search space and improve efficiency.Continual training refers to the process of jointly training the final model that will be produced by the optimisation process, during the process itself.This is performed using a central weight inheritance table designed to work with any CNN to be jointly optimised.Specifically, we propose two methods of weight inheritance.These new strategies have the added benefit of being generalisable to any CNN generation process, or even any deep neural networks, as they rely purely on the parameters of the individual layers in the model.We introduce the two proposed weight inheritance mechanisms below.

BLSK-NS
The first proposed weight inheritance lookup approach is the Block Lookup Size Key-Name and Size (BLSK-NS) method.
Owing to the fact that the number of feature maps produced by a layer in a block is variable depending on the growth rate of the network, we design the central weight inheritance lookup table method to take into account the size of the parameter matrix for a layer (rather than a block), ensuring that layers will only inherit weights from the table that match the position and the parameter matrix size.This results in an inheritance process that works over any deep CNN model, including dense architectures, rather than a CNN model with specific block-construction, as seen in the existing studies [3,5].Pseudocode for the newly proposed weight storage and weight inheritance processes can be seen in Algorithms 2 and 3, respectively.
Algorithm 2 Pseudocode to store weights of a partially trained CNN model with fitness value using centralised lookup tables.The key for each layer is constructed starting with the name of the layer, which itself encodes the position of the layer in the overall architecture, e.g., features.denseblock3.denselayer35.conv1.weightrepresents the weights of the first convolution of the 35th layer of the third block in the feature extraction portion of the architecture.Next, the size of the parameter matrix is iteratively concatenated onto the name string.The size of a parameter matrix depends on the layer type, with the convolutional layer weight matrices being R C out C in HW , where C out represents the number of feature maps to be generated, C in represents the number of input feature maps, H represents the height of the convolutional kernel, and W represents the width of the kernel.Knowing this, we can treat the size of the matrix as simply a vector of non-negative integers N 4 , where each element is the size of the corresponding dimension in the weight matrix.This can be seen in Algorithm 4, where ψ represents the string name of the layer mentioned above and ω represents the vector of integer dimension sizes N n , where n is the number of dimensions.The behavioural result of this key construction method is for every single convolutional layer to only inherit weights from previous convolutional layers in their exact position in previous architectures, with their exact size.A significant downside of this central weight inheritance method is the specificity of individual layer positions and matrix sizes when constructing a DenseNet model, resulting in a large number of entries into our centralised lookup table.This can cause the lookup table to grow to occupy a large amount of memory during the optimisation process (around 15 GB in our experiments).The specificity of the layer position also results in a degradation in the effectiveness of weight sharing, as small changes in architecture can result in a knock-on effect whereby none of the previous parameters can be inherited, despite the similarity in the configuration of the layers.

BLSK-S
In order to alleviate some of the specificity and memory usage issues of the BLSK-NS weight inheritance method, we also propose another weight inheritance strategy, i.e., the Block Lookup Size Key-Size (BLSK-S) method.This method is conceptually similar to BLSK-NS but discards the name of the parameter in the key construction method, meaning the parameters are only stored alongside their sizes.Given the construction of the DenseNet skeleton architecture, this has the effect of ensuring that contiguous layers within an individual DenseBlock will all share identical parameters upon inheritance.Algorithm 5 shows the BLSK-S key construction function, which only requires ω representing the parameter matrix size vector as a sole parameter.The BLSK-S weight inheritance method uses around a third of the memory (5 GB) of the BLSK-NS strategy (15 GB) during our experiments, leading to a minor decrease in optimisation time due to the faster indexing of parameter tables.
Algorithm 5 Block Lookup Size Key-Size (BLSK-S) weight inheritance key construction algorithm.return key 7: end function

Hyperparameters
In this research, we use stochastic gradient descent (SGD) to optimise the weights of each candidate CNN, with a nesterov momentum factor [28,29] of 0.9, weight decay of 1 × 10 −4 , a minibatch size of 64, performing a single epoch of training per fitness function evaluation.We use a population size of 50 individuals in the swarm, over 100 total iterations.Moreover, we use a cosine annealing learning rate schedule, which gradually decreases the learning rate over the total iterations using (7): where a and b represent the initial and final learning rate values, respectively, and φ and Φ represent the current iteration and total number of iterations, respectively.For our experiments, we use values of a = 1 × 10 −1 , b = 1 × 10 −3 for the joint optimisation and training process, and a = 1 × 10 −4 , b = 1 × 10 −7 for the fine-tuning process.

Ensembling
In this proposed research, we subsequently construct two ensemble models to further enhance classification performance.The first ensemble method, i.e., a local best position ensembling method, employs the personal best experiences, identified by the PSO model with adaptive cosine oriented coefficients, in order to improve upon the performance of the single global best model recommended by the optimisation process.The second ensemble model is directly constructed using the particle positions that were obtained in the final iteration.The above two processes provide primary and secondary sources of potential candidate architectures, resulting in two position-based ensembling methods.In other words, the swarm ensembling method uses the set of unique positions that are taken from all of the final particle positions for ensemble construction, while the local best ensembling method employs the set of unique positions taken from all of the local best positions for ensemble building.
After the optimisation process, each individual position is used in order to create an architecture which inherits weights from the central lookup table using one of the aforementioned configured inheritance methods.Each of the resulting CNN models is then fine-tuned on a combination of the training and validation datasets for a certain number of epochs and then evaluated on the test set to obtain its final performance result.These final performance results of each ensemble model are combined while using a plurality voting scheme to obtain the final prediction outcome.The plurality voting strategy simply tallies 'votes' from each candidate model for each class given an input image, choosing the class with the highest number of votes as the predicted class for the image.
Pseudocode for the overall SODBAE process for jointly optimising and training new DenseBlock CNN architectures for image classification can be seen in Algorithm 6.  Create the centralised weight inheritance controller 14: end if 15: Initialise the time logging 16: Initialise particle positions 17: Initialise particle velocities

Experiments and Analysis
We evaluate the proposed approach on the CIFAR-10 [30] image classification dataset, which consists of 60,000 32 × 32 colour images, from a set of 10 possible classes.We use inline augmentations during data loading, based on the ones used in [21].The splitting techniques provided by [30] are employed for model evaluation, whereby the dataset is divided into 50,000 training and 10,000 test images.The training set of 50,000 images is then further divided into 45,000 training and 5000 validation images.In the training phase, the training images are used inside the objective function to update the weights of the candidate CNN through backpropagation, while the validation images are used in order to determine the fitness score of the candidate architecture through evaluation (without backpropagation).
Following optimisation, the training and validation datasets are recombined to form a larger dataset, which is used for the final model fine-tuning step.This comes in contrast to many studies in this area, which tend to completely retrain their optimal architectures using the combined dataset.This often leads to improved performance, but at the cost of a drastically increased processing time and resource usage as any training progress from the optimisation process is discarded in order to start from scratch.In our implementation, we use the combined dataset to fine-tune the optimised trained architectures for a small number of epochs prior to ensembling to enhance performance.
During our empirical studies, we explore and discuss the performance of the single and ensemble networks of the two proposed methods of weight inheritance applied to the proposed DenseBlock optimisation process.We also perform evaluation without using weight sharing, in order to demonstrate how necessary this weight inheritance procedure is to the optimisation process.We perform a set of five runs for each weight inheritance strategy as well as the no weight inheritance procedure in combination with PSO, where the mean result of the five runs for the unseen test set (10,000 images) is used for performance comparison.

BLSK-NS
First, we evaluate the performance of the proposed BLSK-NS method.Figure 5 presents the progression of the global best fitness throughout the optimisation process, whilst Figure 6 illustrates the progression of the fitness scores of the individual particles and their local bests, throughout the optimisation process.These magnitudes are calculated by taking the l2-norm of the velocity vector for each particle, as below, which gives us a scalar value representing the size of the movement required from a particle by the position update.
where v i represents the i-th element of the particle velocity and ||m|| denotes the magnitude of the overall velocity.Fom the magnitude of the velocity for each particle, we can see that, initially, whilst the particles are mostly moving towards their own local best positions, the magnitudes of movements start to slow, indicating that they are approaching the local optimum positions, as illustrated in Figure 7.
As the adaptive acceleration coefficients begin to change the behaviour of the position updates, we can see that the magnitudes of the updates begin to spike again, indicating that the particles are moving further away from their respective local optimum positions.Towards the end of the optimisation process, this spike lessens, which indicates that the particles are settling into their new optimum positions.We can further understand the progress of the optimisation process by visualising the movements of the particles.Figure 8 shows the particle local and global best positions projected into a three-dimensional (3D) space, where the first two dimensions represent the position of the particles projected into two-dimensional (2D) space while using Principal Component Analysis (PCA) with the fitness as the third dimension.From these projections, we can see that the particles explore a large area as the fitness values descend, eventually settling in the region of the global optimum, which is then explored locally, as can be seen in the initial large movements of the global best becoming small movements.Table 2 contains the final accuracy and error rates of the system on the unseen test set after 30 fine-tuning epochs.The results for the local best ensembling, the last swarm position ensembling, and the final global best position provided by the optimisation process are given and compared.The best result is achieved by the swarm ensembling method with an error rate of 9.72%.It is probably owing to the fact that the swarm ensembling method contains more base model diversity.Specifically, in ensemble learning mechanisms [31], the performance of the ensemble models can be significantly affected by the base model diversity and complementary characteristics.For the swarm ensembling method, its base models are recommended by the final particle positions, while, for the local best ensembling strategy, its base models are contributed by the particle personal best solutions.In the final iterations, owing to the convergence of the search process, the particle personal best solutions demonstrate high position proximities (i.e., all close to the global best solution); therefore, the base CNN models devised by these personal best solutions are highly likely to have very similar model configurations.In other words, these base models show less complementary capabilities in further enhancing ensemble model performance.In contrast, the final swarm positions may show comparatively more diversity in comparison with that of the personal best solutions.The swarm ensemble composed of base models devised by such final swarm positions shows more complementary characteristics, therefore better performance.
Table 2. Performance measures on the CIFAR-10 test set for 30 fine-tuning epochs following optimisation while using the BLSK-NS weight inheritance method.With the best results highlighted in bold.

Fine-Tuning Details
Accuracy (%) Error Rate (%) Figure 9 shows an analysis of the best performing parameters for each dimension of the swarm, equating to the best parameters for the growth rate, and sizes of the first, second, third, and fourth dense blocks, respectively.It is interesting to note that the optimal size of growth rate, found by the optimisation process, is relatively small, followed by initially larger dense block sizes, tapering down to smaller dense blocks towards the end.This finding on the growth rate corroborates the conclusions of [22], which discovered that DenseNets performed favourably with narrower layers, contrasting with recent research into ResNets [32] that found much wider layers to be effective.This difference comes down to the feature re-use inherent in a DenseNet architecture due to the dense connectivity of all the internal layers.

BLSK-S
Next, we evaluate the performance of our BLSK-S method.Again, we present a graph of the progression of the global best fitness throughout the optimisation process in Figure 10.It is interesting to note that using the BLSK-S weight inheritance method, we see a significantly more gradual improvement initially, contrasting with Figure 5, which dropped sharply to then level out gradually.Figure 11 shows the progression of the fitness scores of the individual particles and their respective local bests.Again, we note that the improvement is much more gradual than was seen in the identical graph for the BLSK-NS weight inheritance method presented in Figure 6.In Figure 12, we again visualise the magnitude of the velocity for each of the individual particles using the l2-norm, as seen in (8).Interestingly, we can see that the magnitudes of the position updates using the BLSK-S weight inheritance method quickly taper off when compared with the BLSK-NS weight inheritance method presented in Figure 7.This is most likely due to the homogeneity of the particles when inheriting their weights using only the size.Each internal layer inside a specific block inherits the same weights as its surrounding layers, as long as they are of the identical size.This leads to very little variation between individuals with identical or very similar positions in the search space.As the adaptive acceleration coefficients begin to trend towards promoting movement towards the global best position, rather than the local best signals, we can see that the movement activity of the individual particles greatly increases as they strive for large movements away from their local minima.
Following the BLSK-NS results analysis, we again present the graphs of the best performing values for each dimension of the search space in Figure 13.This demonstrates similar findings to Figure 9, with a relatively small growth rate, followed by initially larger dense block sizes, tapering down to smaller dense blocks towards the end.We present the final testing results of the BLSK-S weight inheritance method on the unseen test set in Table 3 with the same variations, as seen in Table 2.
Table 3. Performance measures on the CIFAR-10 test set for 30 fine-tuning epochs following optimisation using the BLSK-S weight inheritance method.With the best results highlighted in bold.

Fine-Tuning Hyperparams
Accuracy (%) Error Rate (%) Here, we can see that the local best ensembling proves to be marginally better than the swarm ensembling method, with the final result of 13.74% error rate being slightly worse than those of both ensemble models of the BLSK-NS method.
We again present the 3D position projections in Figure 14.Moreover, we employ the same number of function evaluations as the termination criterion for the proposed BLSK-S and BLSK-NS weight inheritance strategies combined with PSO, in order to observe their convergence rates.The training process of the BLSK-NS method shows a comparatively faster convergence rate than that of the BLSK-S method, as indicated in Figures 5 and 6 and Figures 10 and 11

No Weight Inheritance
We also evaluate our method without weight inheritance, in order to demonstrate the ineffectiveness of the process without the key continual training procedures.Table 4 contains the results of the optimisation process without the weight inheritance methods on the test set, whilst Figure 15 shows the progression of the global best fitness throughout.As can be clearly seen, the optimisation initially improves as the particles move around the search space, but having to train each CNN from scratch before evaluating leads to a rapid plateau of the fitness.Figure 16 shows the same effect on the individual particle fitnesses, and the local best fitnesses.Interestingly, Figure 16a shows a large upwards spike in the fitness scores of the individual particles towards the end of the optimisation process.This is likely due to the adaptive acceleration coefficients promoting global best exploration rather than local best exploitation, effectively kicking the particles out of the local optima that they have settled in due to the sustained lack of fitness improvement.Figure 18 illustrates that there are relatively few changes in the global best throughout optimisation and, whilst the overall swarm did converge to an improved position, the particles then failed to improve upon that position.Additionally, the findings of the architecture search in Figure 19 are very similar to those of the two proposed weight inheritance methods, where the networks benefit from a comparatively smaller growth rate, followed by initially larger dense block sizes, tapering down to smaller dense blocks towards the end.

Comparison with Related Studies
We compare the proposed approach with recent state-of-the-art studies, in order to further indicate model efficiency.Table 5 contains detailed comparison results.
Moreover, cPSO-CNN [33] depicts a slightly better accuracy rate than those of the proposed SODBAE (BLSK-NS) model, but it only optimises the hyper-parameters of each layer within AlexNet, instead of devising competitive CNN architectures, as in the proposed approach.Specifically, the task of cPSO-CNN is to optimize the following hyper-parameters for each layer within AlexNet, i.e., kernel size and number, stride and padding.In other words, cPSO-CNN does not generate deep network architectures from scratch, but merely optimises the configuration of each layer within AlexNet.Therefore their optimization tasks are comparatively easier.In contrast, the proposed approach generates deep CNN architectures with dense connectivities from scratch; therefore, the optimization task is comparatively more challenging.
Particularly, one key method for comparison is the Q-NAS system [20].This work used a quantum-inspired evolutionary search method in order to generate architectures and evaluate them on the CIFAR-10 dataset, achieving an error rate of 11.00% following 50 h of optimisation using 20 GPUs (1000 GPU-hours).However, this work did not take advantage of weight inheritance or continual training processes, with each potential architecture being trained only on a small subset of the training set per iteration.Additionally, this related work as well as the majority of studies in this area required the final model to be retrained from scratch, vastly increasing the amount of redundant processing performed in order to generate the final, usable model.In comparison, our SODBAE (BLSK-NS) model achieved an error rate of 9.72% following ∼150 h of optimisation using a single GPU (150 GPU-hours).Besides the above, it also used weight inheritance strategies to train all of the models simultaneously, and did not require final retraining, but only fine-tuning for a small number of epochs.
In short, owing to the adoption of dense connectivity, the centralised weight inheritance schemes, as well as adaptive cosine oriented PSO-based search strategies, the proposed models show great superiority in classification performance while demonstrating greater versatility in architecture generation.
We also discuss the advantage of the proposed approach against handcrafted networks below.The existing handcrafted DenseNet and other CNN models are generated by a large number of trial-and-error with significant prerequisite expertise knowledge needed.In contrast, this research proposes an automatic approach for devising deep CNN models with dense connectivity without human intervene or specialist knowledge required.The PSO search process is also able to generate multiple effective CNN architectures with dense connections automatically for different problems at hand.Such a search process may significantly benefit the deployment of a CNN model to a new application domain.Moreover, by embedding the proposed weight inheritance strategies, the proposed approach achieves reasonable trade-off between performance and cost in comparison with some of the other deep architecture generation methods, as discussed above and as indicated in Table 5.Therefore, the proposed approach solves the bottleneck design problems of deploying deep learning models to a new application domain, as well as alleviating the large computational costs of the optimal architecture search by embedding the novel weight inheritance learning mechanisms in PSO.Although the performance of the proposed approach could be further improved in comparison with those of the handcrafted models in some cases, in future work we aim to adopt hybrid search strategies and other ensembling techniques in order to further enhance performance.Evolutionary and other optimisation techniques Large-Scale Evolution [1] 5.40 250 264 unknown Genetic CNN [2] 7.10 ∼20 ∼24 unknown Hierarchical Representations [4] 3.60 200 36 unknown Q-NAS [20] 11.00 20 50 unknown cPSO-CNN [33] 8.67 --unknown EPSO-CNN [8] 19.85 --unknown SMAC with predictive termination [9] 17.2 --unknown SMAC [9] 17.47 --unknown HORD [10] 20.54 --unknown NMM [11] 10.05 --unknown PSO-b [34] 18.53 --unknown GA-CNN [35] 25.41 --unknown Our system SODBAE (BLSK-S) 13

Conclusions and Future Work
In this research, we have demonstrated a new deep architecture optimisation model, namely SODBAE, based on the proposed strategies, such as weight inheritance mechanisms and dense connectivity using PSO.We have designed a new skeleton architecture with a large search space based on the concept of densely connected layers from [22] and created an accompanying objective function in order to optimise the network structures using the adaptive PSO model.We have created two new centralised weight inheritance methods that can theoretically work with any generative CNN architecture systems to provide continual training throughout an optimisation process.This allows for the joint optimisation and training of new CNN architectures without requiring a costly retraining step and wasting the progress from the optimisation process.We have evaluated the proposed SODBAE model on the CIFAR-10 dataset and found impressive results given the large search space and low resource requirements of our approach.The SODBAE (BLSK-NS) model is able to achieve an error rate of 9.72% on the CIFAR-10 dataset following around 150 h of joint optimisation and training, not requiring a final retraining step.In particular, it shows a proportion (i.e., 0.2273%, 31.25%,2.0833% and 15%) of the computational costs of Large-Scale Evolution [1], Genetic CNN [2], Hierarchical

Figure 2 .
Figure 2. A Processing Layer in the SODBAE skeleton architecture (consisting of an initial BottleNeck Layer, followed by component batch normalisation, ReLU, and convolutional layers making up an Inner Processing Layer).

Figure 3 .
Figure 3.A Transition Layer in the SODBAE skeleton architecture (consisting of batch normalisation, ReLU, convolutional, and average pooling layers).

1 .Algorithm 1
The growth rate (k) 2. The depth (No. of layers) of the first DenseBlock (d 0 ) 3. The depth (No. of layers) of the second DenseBlock (d 1 ) 4. The depth (No. of layers) of the third DenseBlock (d 2 ) 5. The depth (No. of layers) of the fourth DenseBlock (d 3 ) The objective function to be optimised.1: function OBJECTIVEFUNCTION(position):

Algorithm 6
Pseudocode of the SODBAE-based CNN architecture optimisation and training process.

1 : Start 2 :
Initialise the cosine learning rate decay controller 3: Initialise the adaptive search weight function 4: Create the dataset loaders 5: Define the search space boundaries 6: Create the results logging objects 7: if Using weight lookup then

Figure 5 .Figure 6 .
Figure 5. Fitness value of the global best particle throughout the optimisation process using the BLSK-NS weight inheritance method.

Figure 7
Figure 7 depicts the magnitudes of the velocities of the individual particles throughout the optimisation process.

Figure 7 .
Figure 7. Velocity magnitudes for each of the particle position updates throughout the optimisation process using the BLSK-NS weight inheritance method.

Figure 9 .
Figure 9. Visualisation of the best fitness scores achieved for different parameter values in each dimension using the BLSK-NS weight inheritance method.

Figure 10 .
Figure 10.Fitness value of the global best particle throughout the optimisation process using the BLSK-S weight inheritance method.
(a) Fitness values of the current positions of each individual particle.(b) Fitness values of the local best positions of each individual particle.

Figure 11 .
Figure 11.Fitness values for the current and local best positions of each individual particle throughout the optimisation process using the BLSK-S weight inheritance method

Figure 12 .
Figure 12.Velocity magnitudes for each of the particle position updates throughout the optimisation process using the BLSK-S weight inheritance method.
(a) Local best positions & their fitness scores by iteration for all 50 particles (point indicates start position) (b) All local & global best positions & their fitness scores (c) Global best positions & their fitness scores

Figure 14 .
Figure 14.Example local and global best positions over time for the BLSK-S weight inheritance method (PCA for dimensionality reduction to display particle positions on two axes with fitness on the third).

Figure 15 .
Figure 15.Fitness value of the global best particle throughout the optimisation process while using no weight inheritance method.

Figure 17 .
Figure 17.Velocity magnitudes for each of the particle position updates throughout the optimisation process using no weight inheritance method.
(a) Local best positions & their fitness scores by iteration for all 50 particles (point indicates start position) (b) All local & global best positions & their fitness scores (c) Global best positions & their fitness scores

Figure 18 .
Figure 18.Example local and global best positions over time with no weight inheritance method (PCA for dimensionality reduction to display particle positions on two axes with fitness on the third).

Figure 19 .
Figure 19.Visualisation of the best fitness scores achieved for different parameter values in each dimension using no weight inheritance method.

Table 1 .
The parameters of the DenseBlock convolutional architecture to be optimised.

Table 4 .
Performance measures on the CIFAR-10 test set for 30 fine-tuning epochs following optimisation using no weight inheritance method.With the best results highlighted in bold.

Table 5 .
Classification error rates (%) for the proposed SODBAE model and other related studies for the CIFAR-10 dataset