# A Framework for Designing the Architectures of Deep Convolutional Neural Networks

^{*}

## Abstract

**:**

## 1. Introduction

- We propose an efficient framework for automatically discovering a high-performing CNN architecture for a given problem through a very large search space without any human intervention. This framework also allows for an effective parallel and distributed execution.
- We introduce a novel objective function that exploits the error rate on the validation set and the quality of the feature visualization via deconvnet. This objective function adjusts the CNN architecture design, which reduces the classification error and enhances the reconstruction via the use of visualization feature maps at the same time. Further, our new objective function results in much faster convergence towards a better architecture.

## 2. Related Work

## 3. Background

#### 3.1. Convolutional Neural Networks

#### 3.2. CNN Architecture Design

## 4. Framework Model

#### 4.1. Reducing the Training Set

#### 4.2. CNN Feature Visualization Methods

#### Deconvolutional Networks

#### 4.3. Correlation Coefficient

_{fm}feature maps at random from the last layer to visualize their learned parts using deconvnet. The motivation behind selecting the last convolutional layer is that it should show the highest level of visualization as compared to preceding layers. We choose N

_{img}images from the training sample at random to test the deconvnet. The correlation coefficient is used to calculate the similarity between the input images N

_{img}and their reconstructions. Since each image of N

_{img}has a correlation coefficient (Corr) value, the results of all Corr values are accumulated in a scalar value called ($Cor{r}_{Res})$. Algorithm 1 summarizes the processing procedure for training a CNN architecture:

Algorithm 1. Processing Steps for Training a Single CNN Architecture. | |

1: | Input: training sample ${T}_{S}$, validation set ${T}_{V}$, N_{fm} feature maps, and N_{img} images |

2: | Output: Err and $Cor{r}_{Res}$ |

3: | Train CNN architecture design using SGD |

4: | Compute error rate (Err) on validation set ${T}_{V}$ |

5: | $Cor{r}_{Res}=0$ |

6: | For i = 1 to N_{fm} |

7: | Pick a feature map fm at random from the last convolutional layer |

8: | For j = 1 to N_{img} |

9: | Use deconvnet to visualize a selected feature map fm on image N_{img}[j] |

10: | $Cor{r}_{Res}=Cor{r}_{Res}$+ correlation coefficient (N_{img}[j], reconstructed image) |

11: | Return Err and $Cor{r}_{Res}$ |

#### 4.4. Objective Function

#### 4.5. Nelder Mead Method

_{1}, Z

_{2},…, Z

_{n+1}] refer to simplex vertices, where each vertex presents a CNN architecture. The vertices are sorted in ascending order based on the value of objective functions f(Z

_{1}) ≤ f(Z

_{2}) ≤…≤ f(Z

_{n+1}) so that Z

_{1}is the best vertex, which provides the best CNN architecture, and Z

_{n+1}is the worst vertex. NMM seeks to find the best hyperparameters λ

^{*}that designs a CNN architecture that minimizes the objective function in Equation 8 as follows:

#### 4.6. AcceleratingProcesssing Time with Parallelism

Algorithm 2. The Proposed Framework Pseudocode | |

1: | Input: n: Number of hyperparameters |

2: | Output: best vertex ($Z$[1]) found that minimizes the objective function in equation 8 |

3: | Determine training sample T_{S} using RMHC |

4: | Initialize the Simplex vertices (${Z}_{1:n+1}$) randomly from Table 1 |

5: | L = $\lceil \left(n+1\right)/3\rceil $ # 3 is the number of workers |

6: | For j = 1 to L |

7: | Train each 3 vertices of $Z$_{1,n+1} in parallel according to Algorithm 1 |

8: | For l= 1 to Max_ _iterations: |

9: | Normalize values of $Cor{r}_{Res}$ between the max and min of Err of vertices of ($Z$) |

10: | Compute $f({Z}_{i})$ based on Equation (8). for all vertices i = 1:n+1 |

11: | Z = order the vertices so that f($Z$_{1}) ≤ f($Z$_{2}),…,< f($Z$_{n+1}). |

12: | Set $B={Z}_{1}$, $A={Z}_{n}$, $W={Z}_{n+1}$ |

13: | Compute the centroid $C$ of vertices without considering the worst vertex: $C=$ |

14: | $\frac{1}{n}{{\displaystyle \sum}}_{i=1}^{n}{Z}_{i}$ |

15: | Compute reflected vertex: $R=C$ + α($C-W$) |

16: | Compute Expanded vertex $E$ = $R$ + γ ($R\text{}$−$\text{}C$) |

17: | Compute Contracted vertex $Con$ = ρ$R$ + (1 − ρ)$C$ |

18: | Train R, E, and Con simultaneously on workers 1,2, and 3 according to Algorithm 1 |

19: | Normalize $Cor{r}_{Res}\text{}\mathrm{of}\text{}R,\text{}E,\text{}\mathrm{and}\text{}Con$ between the max and min of Err of vertices of ($Z$) |

20: | Compute $f\left(R\right),\text{}f\left(E\right),\text{}\mathrm{and}\text{}f\left(con\right)$ based on Equation (8) |

21: | If f(B) > R < W |

22: | ${Z}_{n+1}$ = W |

23: | Else If f(R) ≤ f(B) |

24: | If f(E) < f(R) |

25: | ${Z}_{n+1}$ = E |

26: | Else |

27: | ${Z}_{n+1}$ = R |

28: | Else |

29: | d = true |

30: | If f(R) ≤ f(A) |

31: | If f(Con) ≤ f(R) |

32: | ${Z}_{n+1}$ = Con |

33: | d = false |

34: | If d = true |

35: | shrink toward the best vertex direction |

36: | L = $\lceil n/3\rceil $ |

37: | For k = 2 to n+1: # do not include the best vertex |

38: | ${Z}_{k}$ = B + σ(${Z}_{k}$ − B) |

39: | For j = 1 to L: |

40: | Train each 3 vertices of $({Z}_{2:n+1}$) in parallel on workers 1, 2, and 3 according |

41: | to Algorithm 1 |

42: | Return $Z$[1] |

## 5. Results and Discussion

#### 5.1. Datasets

#### 5.2. Experimental Setup

_{S}using an RMHC algorithm with a sample size based on a margin error of 1 and confidence level of 95%. Then, we select 8000 images randomly from (T

_{TR}–T

_{S}) for validation set T

_{V}.

**Training Settings**: We use SGD to train CNN architectures. The final learning rate is set to 0.08 for 25 epochs and 0.008 for the last epochs; these values are selected after doing a small grid search among different values on the validation set. We set the batch size to 32 images and the weight decay to 0.0005. The weights of all layers are initialized according to the Xavier initialization technique [42], and biases are set to zero. The advantage of Xavier initialization is that it makes the network converge much faster than other approaches. The weight sets it produces are also more consistent than those produced by other techniques. We apply ReLU with all layers and employ early stopping to prevent overfitting in the performance of the validation set. Once the error rate increases or saturates for a number of iterations, the model stops the training procedure. Since the training time of a CNN is expensive and some designs perform poorly, early stopping saves time by terminating poor architecture designs early. Dropout [11] is implemented with fully-connected layers with a rate of 0.5. It has proven to be an effective method in combating overfitting in CNNs, and a rate of 0.5 is a common practice. During the exploration phase of NMM, each of the experiments are run with 35 epochs. Once the best CNN architecture is obtained, we train it with the training set T

_{TR}and evaluate it on a testing set with 200 epochs.

**Nelder Mead Settings:**The total number of hyperparameters n is required to construct the initial simplex with n + 1 vertices. However, this number is different for each dataset. In order to define n for a given dataset, we initialize 80 random CNN architectures as an additional step to return the maximum number of convolutional layers (C

_{max}) in all architectures. Then, according to Equation (4), the number of hyperparameters n is given by:

_{max}× 4 + the max number of fully-connected layers.

_{0}) with n + 1 vertices. For all datasets, we set the value of the correlation coefficient parameter to $\eta $ = 0.20. We select at random N

_{fm}= 10 feature maps from the last convolutional layer to visualize their learned features and N

_{img}= 100 images from the training sample to assess the visualization. The number of iterations for NMM is 25.

#### 5.3. Results and Discussion

_{0}) of NMM, we optimize the architecture using NMM based on the proposed objective function (error rate as well as visualization). Then, from the same initialization (Z

_{0}), we execute the NMM based on the error rate objective function alone by setting $\eta $ to zero. Table 2 compares the error rate of five experiment runs obtained from the best CNN architectures found using the objective functions presented above on the CIFAR-10 and CIFAR-100 datasets respectively. The results illustrate that our new objective function outperforms the optimization obtained from the error rate objective function alone. The error rate averages of 15.87% and 40.70% are obtained with our objective function, as compared to 17.69% and 42.72% when using the error rate objection function alone, on CIFAR-10 and CIAFAR-100 respectively. Our objective function searches the architecture that minimizes the error and improves the visualization of learned features, which impacts the search space direction, and thus produces a better CNN architecture.

## 6. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Farabet, C.; Couprie, C.; Najman, L.; LeCun, Y. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell.
**2013**, 35, 1915–1929. [Google Scholar] [CrossRef] [PubMed] - Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv, 2014; arXiv:1409.1556. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. A convolutional neural network for modelling sentences. arXiv, 2014; arXiv:1404.2188. [Google Scholar]
- Kim, Y. Convolutional neural networks for sentence classification. arXiv, 2014; arXiv:1408.5882. [Google Scholar]
- Conneau, A.; Schwenk, H.; LeCun, Y.; Barrault, L. Very deep convolutional networks for text classification. arXiv, 2016; arXiv:1606.01781. [Google Scholar]
- Hubel, D.H.; Wiesel, T.N. Receptive fields and functional architecture of monkey striate cortex. J. Physiol.
**1968**, 195, 215–243. [Google Scholar] [CrossRef] [PubMed] - Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv, 2013; arXiv:1312.6229. [Google Scholar]
- Bengio, Y. Learning deep architectures for AI. Found. Trends Mach. Learn.
**2009**, 2, 1–127. [Google Scholar] [CrossRef] - Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.
**2014**, 15, 1929–1958. [Google Scholar] - Liu, Y.; Racah, E.; Correa, J.; Khosrowshahi, A.; Lavers, D.; Kunkel, K.; Wehner, M.; Collins, W. Application of deep convolutional neural networks for detecting extreme weather in climate datasets. arXiv, 2016; arXiv:1605.01156. [Google Scholar]
- Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature
**2017**, 542, 115–118. [Google Scholar] [CrossRef] [PubMed] - Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 818–833. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. arXiv, 2015; arXiv:1512.03385. [Google Scholar]
- Srivastava, R.K.; Greff, K.; Schmidhuber, J. Training very deep networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2377–2385. [Google Scholar]
- He, K.; Sun, J. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5353–5360. [Google Scholar]
- Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G. Recent advances in convolutional neural networks. arXiv, 2015; arXiv:1512.07108. [Google Scholar]
- De Andrade, A. Best Practices for Convolutional Neural Networks Applied to Object Recognition in Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2014. [Google Scholar]
- Zheng, A.X.; Bilenko, M. Lazy paired hyper-parameter tuning. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013; pp. 1924–1931. [Google Scholar]
- Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv, 2016; arXiv:1603.06560. [Google Scholar]
- Young, S.R.; Rose, D.C.; Karnowski, T.P.; Lim, S.-H.; Patton, R.M. Optimizing deep learning hyper-parameters through an evolutionary algorithm. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, Austin, TX, USA, 15–20 November 2015. [Google Scholar]
- Bergstra, J.S.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; pp. 2546–2554. [Google Scholar]
- Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing System, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 2951–2959. [Google Scholar]
- Wang, B.; Pan, H.; Du, H. Motion sequence decomposition-based hybrid entropy feature and its application to fault diagnosis of a high-speed automatic mechanism. Entropy
**2017**, 19, 86. [Google Scholar] [CrossRef] - Albelwi, S.; Mahmood, A. Automated optimal architecture of deep convolutional neural networks for image recognition. In Proceedings of the IEEE International Conference on Machine Learning and Applications, Anaheim, CA, USA, 18–20 December 2016; pp. 53–60. [Google Scholar]
- Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 20–25 August 1995; pp. 1137–1143. [Google Scholar]
- Schaer, R.; Müller, H.; Depeursinge, A. Optimized distributed hyperparameter search and simulation for lung texture classification in CT using hadoop. J. Imaging
**2016**, 2, 19. [Google Scholar] [CrossRef] - Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res.
**2012**, 13, 281–305. [Google Scholar] - Hutter, F.; Hoos, H.H.; Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. In Proceedings of the 5th International Conference on Learning and Intelligent Optimization, Rome, Italy, 17–21 January 2011; pp. 507–523. [Google Scholar]
- Murray, I.; Adams, R.P. Slice sampling covariance hyperparameters of latent gaussian models. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–9 December 2010; pp. 1723–1731. [Google Scholar]
- Gelbart, M.A. Constrained Bayesian Optimization and Applications. Ph.D. Thesis, Harvard University, Cambridge, MA, USA, 2015. [Google Scholar]
- Loshchilov, I.; Hutter, F. CMA-ES for hyperparameter optimization of deep neural networks. arXiv, 2016; arXiv:1604.07269. [Google Scholar]
- Luketina, J.; Berglund, M.; Greff, K.; Raiko, C.T. Scalable gradient-based tuning of continuous regularization hyperparameters. arXiv, 2015; arXiv:1511.06727. [Google Scholar]
- Chan, L.-W.; Fallside, F. An adaptive training algorithm for back propagation networks. Comput. Speech Lang.
**1987**, 2, 205–218. [Google Scholar] [CrossRef] - Larsen, J.; Svarer, C.; Andersen, L.N.; Hansen, L.K. Adaptive Regularization in Neural Network Modeling. In Neural Networks: Tricks of the Trade; Springer: Berlin, Germany, 1998; pp. 113–132. [Google Scholar]
- Pedregosa, F. Hyperparameter optimization with approximate gradient. arXiv, 2016; arXiv:1602.02355. [Google Scholar]
- Yu, C.; Liu, B. A backpropagation algorithm with adaptive learning rate and momentum coefficient. In Proceedings of the 2002 International Joint Conference on Neural Networks, Piscataway, NJ, USA, 12–17 May 2002; pp. 1218–1223. [Google Scholar]
- Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv, 2012; arXiv:1212.5701. [Google Scholar]
- Caruana, R.; Lawrence, S.; Giles, L. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Proceedings of the 2001 Neural Information Processing Systems Conference, Vancouver, BC, Canada, 3–8 December 2001; pp. 402–408. [Google Scholar]
- Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
- Garro, B.A.; Vázquez, R.A. Designing artificial neural networks using particle swarm optimization algorithms. Comput. Intell. Neurosci.
**2015**, 2015, 61. [Google Scholar] [CrossRef] [PubMed] - Chau, K.; Wu, C. A hybrid model coupled with singular spectrum analysis for daily rainfall prediction. J. Hydroinform.
**2010**, 12, 458–473. [Google Scholar] [CrossRef] - Wang, W.; Chau, K.; Xu, D.; Chen, X. Improving forecasting accuracy of annual runoff time series using arima based on eemd decomposition. Water Resour. Manag.
**2015**, 29, 2655–2675. [Google Scholar] [CrossRef] - Taormina, R.; Chau, K.-W. Data-driven input variable selection for rainfall–runoff modeling using binary-coded particle swarm optimization and extreme learning machines. J. Hydrol.
**2015**, 529, 1617–1632. [Google Scholar] [CrossRef] - Zhang, J.; Chau, K.-W. Multilayer ensemble pruning via novel multi-sub-swarm particle swarm optimization. J. UCS
**2009**, 15, 840–858. [Google Scholar] - Kulkarni, P.; Zepeda, J.; Jurie, F.; Pérez, P.; Chevallier, L. Learning the structure of deep architectures using L1 regularization. In Proceedings of the British Machine Vision Conference, Swansea, UK, 7–10 September 2015; pp. 23.1–23.11. [Google Scholar]
- Zoph, B.; Le, Q.V. Neural architecture search with reinforcement learning. arXiv, 2016; arXiv:1611.01578. [Google Scholar]
- Miikkulainen, R.; Liang, J.; Meyerson, E.; Rawal, A.; Fink, D.; Francon, O.; Raju, B.; Navruzyan, A.; Duffy, N.; Hodjat, B. Evolving deep neural networks. arXiv, 2017; arXiv:1703.00548. [Google Scholar]
- Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y.L.; Le, Q.; Kurakin, A. Large-scale evolution of image classifiers. arXiv, 2017; arXiv:1703.01041. [Google Scholar]
- Baker, B.; Gupta, O.; Naik, N.; Raskar, R. Designing neural network architectures using reinforcement learning. arXiv, 2016; arXiv:1611.02167. [Google Scholar]
- Yosinski, J.; Clune, J.; Nguyen, A.; Fuchs, T.; Lipson, H. Understanding neural networks through deep visualization. arXiv, 2015; arXiv:1506.06579. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] - Chen, T.; Xu, R.; He, Y.; Wang, X. A gloss composition and context clustering based distributed word sense representation model. Entropy
**2015**, 17, 6007–6024. [Google Scholar] [CrossRef] - Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Aldhaheri, A.; Lee, J. Event detection on large social media using temporal analysis. In Proceedings of the Computing and Communication Workshop and Conference, Las Vegas, NV, USA, 9–11 January 2017; pp. 1–6. [Google Scholar]
- Hijazi, S.; Kumar, R.; Rowen, C. Using Convolutional Neural Networks for Image Recognition. 2015. Available online: https://ip.cadence.com/uploads/901/cnn_wp-pdf (accessed on 20 May 2017).
- Olvera-López, J.A.; Carrasco-Ochoa, J.A.; Martínez-Trinidad, J.F.; Kittler, J. A review of instance selection methods. Artif. Intell. Rev.
**2010**, 34, 133–143. [Google Scholar] [CrossRef] - Albelwi, S.; Mahmood, A. Analysis of instance selection algorithms on large datasets with deep convolutional neural networks. In Proceedings of the IEEE Long Island Systems, Applications and Technology Conference, Farmingdale, NY, USA, 29 April 2016; pp. 1–5. [Google Scholar]
- Skalak, D.B. Prototype and feature selection by sampling and random mutation hill climbing algorithms. In Proceedings of the 11th International Conference on Machine Learning, New Brunswick, NJ, USA, 10–13 July 1994; pp. 293–301. [Google Scholar]
- Karpathy, A.; Johnson, J.; Li, F.F. Visualizing and understanding recurrent networks. arXiv, 2015; arXiv:1506.02078. [Google Scholar]
- Erhan, D.; Bengio, Y.; Courville, A.; Vincent, P. Visualizing Higher-Layer Features of a Deep Network; University of Montreal: Montréal, QC, Canada, 2009; p. 3. [Google Scholar]
- Ahlgren, P.; Jarneving, B.; Rousseau, R. Requirements for a cocitation similarity measure, with special reference to pearson’s correlation coefficient. J. Am. Soc. Inf. Sci. Technol.
**2003**, 54, 550–560. [Google Scholar] [CrossRef] - Dragomir, A.; Post, A.; Akay, Y.M.; Jneid, H.; Paniagua, D.; Denktas, A.; Bozkurt, B.; Akay, M. Acoustic detection of coronary occlusions before and after stent placement using an electronic stethoscope. Entropy
**2016**, 18, 281. [Google Scholar] [CrossRef] - Katoh, K.; Misawa, K.; Kuma, K.; Miyata, T. Mafft: A novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res.
**2002**, 30, 3059–3066. [Google Scholar] [CrossRef] [PubMed] - Nelder, J.A.; Mead, R. A simplex method for function minimization. Comput. J.
**1965**, 7, 308–313. [Google Scholar] [CrossRef] - Erl, T. Service-Oriented Architecture. A Field Guide to Integrating XML and Web Services; Prentice Hall PTR: Upper Saddle River, NJ, USA, 2004. [Google Scholar]
- Gu, X.; Gu, X. On the detection of fake certificates via attribute correlation. Entropy
**2015**, 17, 3806–3837. [Google Scholar] [CrossRef] - Alshinina, R.; Elleithy, K. Performance and challenges of service-oriented architecture for wireless sensor networks. Sensors
**2017**, 17, 536. [Google Scholar] [CrossRef] [PubMed] - Fielding, R.T. Architectural Styles and the Design of Network-Based Software Architectures. Ph.D. Thesis, University of California, Irvine, CA, USA, 2000. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 20 May 2017).
- Bergstra, J.; Breuleux, O.; Bastien, F.; Lamblin, P.; Pascanu, R.; Desjardins, G.; Turian, J.; Warde-Farley, D.; Bengio, Y. Theano: A CPU and GPU math compiler. In Proceedings of the Python for Scientific Computing Conference, Austin, TX, USA, 30 June–3 July 2010; pp. 1–7. [Google Scholar]
- Domhan, T.; Springenberg, J.T.; Hutter, F. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; pp. 3460–3468. [Google Scholar]
- Goodfellow, I.J.; Warde-Farley, D.; Mirza, M.; Courville, A.; Bengio, Y. Maxout networks. arXiv, 2013; arXiv:1302.4389. [Google Scholar]
- Wan, L.; Zeiler, M.; Zhang, S.; Cun, Y.L.; Fergus, R. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1058–1066. [Google Scholar]
- Lee, C.Y.; Xie, S.; Gallagher, P.W.; Zhang, Z.; Tu, Z. Deeply-supervised nets. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; pp. 562–570. [Google Scholar]

**Figure 2.**General components and a flowchart of ou

**r**framework for discovering a high-performing CNN architecture.

**Figure 3.**The top part illustrates the deconvnet layer on the left, attached to the convolutional layer on the right. The bottom part illustrates the pooling and unpooling operations [14].

**Figure 4.**Visualization from the last convolutional layer for three different CNN architectures. Grayscale input images are visualized after preprocessing.

**Figure 7.**The average of the best CNN architectures obtained by both objective functions. (

**a**) The architecture averages for our framework; (

**b**) The architecture averages for the error rate objective function.

Hyperparameter | Min. | Max. |
---|---|---|

Depth | 5 | 10 |

Number of fully-connected layers | 1 | 2 |

Number of filters | 50 | 150 |

Kernel sizes | 3 | 11 |

Number of pooling layers | 4 | 7 |

Pooling region sizes | 1 | 4 |

Number of neurons in fully-connected layers | 250 | 800 |

**Table 2.**Error rate comparisons between the top CNN architectures obtained by our objective function and the error rate objective function via NMM.

Expt. Num. | Error Rate Based on the Error Objective Function | Error Rate Based on Our Objective Function |
---|---|---|

Results comparison on CIFAR-10 | ||

1 | 18.10% | 15.27% |

2 | 18.15% | 16.65% |

3 | 17.81% | 16.14% |

4 | 17.12% | 15.52% |

5 | 17.27% | 15.79% |

Avg. | 17.69% | 15.87% |

Results comparison on CIFAR-100 | ||

1 | 42.10% | 41.21% |

2 | 43.84% | 40.68% |

3 | 42.44% | 40.15% |

4 | 42.98% | 41.37% |

5 | 42.26% | 40.12% |

Avg. | 42.72% | 40.70% |

**Table 3.**Error rate comparison for different methods of designing CNN architectures on CIFAR-10 and CIFAR-100. These results are achieved without data augmentation.

Method | CIFAR-10 | CIFAR-100 |
---|---|---|

Human experts design [72] | 18% | - |

Random search (our implementation) | 21.74% | 44.97% |

Genetic algorithms [22] | 25% | - |

SMAC [74] | 17.47% | 42.21% |

Our approach | 15.27% | 40.12% |

Method | Single Computer Execution Time | Parallel Execution Time |
---|---|---|

CIFAR-10 | 48 h | 18 h |

CIFAR-100 | 50 h | 24 h |

MNIST | 42 h | 14 h |

**Table 5.**Error rate comparisons with state-of-the-art methods and recent works on architecture design search. We report results for CIFAR-10 and CIFAR-100 after applying data augmentation and results for MNIST without any data augmentation.

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Albelwi, S.; Mahmood, A. A Framework for Designing the Architectures of Deep Convolutional Neural Networks. *Entropy* **2017**, *19*, 242.
https://doi.org/10.3390/e19060242

**AMA Style**

Albelwi S, Mahmood A. A Framework for Designing the Architectures of Deep Convolutional Neural Networks. *Entropy*. 2017; 19(6):242.
https://doi.org/10.3390/e19060242

**Chicago/Turabian Style**

Albelwi, Saleh, and Ausif Mahmood. 2017. "A Framework for Designing the Architectures of Deep Convolutional Neural Networks" *Entropy* 19, no. 6: 242.
https://doi.org/10.3390/e19060242