Optimization of Deep Neural Networks Using SoCs with OpenCL
Abstract
:1. Introduction
2. Evolutionary Optimization Method
- We have a set of samples, and a number of clear targets for, the problem that we wish to solve (in general, classification, regression or pattern matching)
- For each sample, we have several inputs. There is no clear evidence indicating whether each of these inputs is necessary or important.
- Finally, we do not know whether all of the available samples are suitable for the training, validation or testing of our neural network.
- Phase 1:
- Selection of the best inputs via evolutionary computation based on the delta test. When performing feature selection, it is important to avoid intervention from the neural network to ensure that the selection of the variables is independent of the network topology. Before conducting the current study, we investigated delta test optimization using genetic algorithms for regression problems with only one output [18]; however, this approach can also be extended to classification problems with multiple outputs. Because of the particular focus of this study, this phase of the methodology was not considered here. We are certain that different, more efficient methods could be used; however, in this experiment, our aim was to devise an orderly and efficient partitioning method for independent optimization and implementation.
- Phase 2:
- Filtering of the samples through replicator neural networks [19].
- Phase 3:
- Optimization of the neural network topology via heterogeneous evolutionary computation (this experimentation platform will be implemented and evaluated in this article 4.2).
- Phase 4:
- Optimization of the initial neural network weights via evolutionary computation (this experimentation platform was implemented and evaluated in [20]).
- Phase 5:
- Final training of a neural network with the topology obtained in Phase 3 and the initial weights obtained in Phase 4.
- The set of samples after Phase 2 (filtering) was divided into four subsets: training, validation, optimization and testing. The training subset was used to train MLPs based on the fitness function of the evolutionary algorithm in Phases 3 and 4. The validation subset was used for early termination of the MLP training in Phases 3 and 4. The purpose of the optimization subset was to obtain the objective value(s) (for a single objective or multiple objectives) of the fitness function for each individual of the population. Finally, the test subset was used to evaluate the final optimized neural network. We followed the recommendation that the test subset should not be used for the identification of the best-performing trained neural network [21].
- In Phases 3 and 4, many MLP training runs were necessary. In our experiments developed for Phase 4 in [20], we used the resilient backpropagation algorithm. Now, in the essays proposed for Phase 3 that are presented in Section 4.2, the RMSpropalgorithm was selected for this purpose based on two considerations: the speed of the algorithm and, more importantly, the ease of hardware implementation for all technologies used in the heterogeneous platform. RMSprop is a very effective, but currently unpublished, adaptive learning rate method; however, it shares with many other algorithms (e.g., Adam [22] and Adadelta [23]) the same characteristics of the forward and backward phases that are of interest to us for acceleration through OpenCL. To improve the generalization properties of the RMSprop algorithm, we performed early termination on the validation subset.
- The final training run was performed using the Bayesian regularization algorithm [24] on the union of the training, validation and optimization subsets.
- In Phase 3 (topology optimization), we performed a multi-objective optimization in which the second objective of the evolutionary computation was to minimize the number of connections. This technique improved the generalizability of the resulting neural network [25].
2.1. Optimization of the Topology
- Enormous computational effort is required. Our work and further details on the computational effort are presented in Section 3.
- Control of random number generation, which is necessary for the initialization of the weights during training, is difficult. Without such control, the best individuals in the population may be lost because a good individual can suffer decreases in their fitness function value in subsequent generations of the evolutionary algorithm (see Section 2.1.1).
- It is difficult to determine the best method for extracting the best individuals when different initializations are averaged (see Section 2.1.2).
2.1.1. Control of Random Number Generation
- Definition of a random generator stream with a unique seed for each individual: Therefore, three variables were necessary for the evolutionary computations related to the neural network topology: the number of neurons in the first hidden layer, the number of neurons in the second hidden layer and the seed of the random generator stream.
- Resetting the number generator streams to reproduce the results for the best individuals.
- Utilization of different sub-streams with different weight initializations for each individual.
2.1.2. Weight Initialization
- A fitness function with the goal of minimizing the mean squared error (MSE) or the mean of the MSE when training the same topology with a given number of executions of the RMSprop algorithm, each beginning with a different weight initialization. This alternative is called GATOPOMINor GATOPOMEAN.
- A fitness function with the goal of obtaining the best fitness function value for the evolutionary computation of the initial weights (as in Phase 4); in other words, the execution of a subordinate evolutionary computation for each individual as part of the main evolutionary computation. This alternative is called GATOPOGA or GATOPODE, depending on the evolutionary algorithm used in the second-level evolutionary computation: a genetic algorithm (GA) or differential evolution (DE) (see Figure 1). This scheme is obtained based on the studies carried out by [27].
- A fitness function with the goal of minimizing the MSE of resilient BP when the initial weights have been established using auto-encoders [28]. This technique is very commonly used for the non-supervised training of deep neural networks (DNNs). This alternative is called GATOPODNN.
- With 50 different initializations, we would need 5000 RMSprop training runs in each generation.
- With 500 individuals considered in the second-level evolutionary computation (GA or DE), we would need 50,000 RMSprop training runs in each generation of the main genetic algorithm.
- With an MLP with three layers of weights, we would need 300 RMSprop training runs for two layers of MLP weights in each generation and 100 RMSprop training runs for three layers of MLP weights in each generation.
3. Heterogeneous Computational Platform
- local individuals
- -
- CPU cores
- -
- GPU processors
- remote individuals on FPGAs
- -
- FPGA0 soft embedded processor
- -
- FPGA1 soft embedded processor + soft embedded processor with neural instructions
- -
- FPGA2 soft embedded processor + programmable coprocessor in software (technological solution provided in [20])
- -
- FPGA3 soft embedded processor + reconfigurable coprocessor
- remote individuals on FPGA-SoCs
- -
- FPGA-SoC0 hard embedded processor
- -
- FPGA-SoC1 hard embedded processor + soft embedded processor with neural instructions
- -
- FPGA-SoC2 hard embedded processor + programmable coprocessor in software
- -
- FPGA-SoC3 hard embedded processor + reconfigurable coprocessor (technological solution provided in this paper work)
- remote individuals with OpenCL solutions
- -
- FPGA-SoC4 hard embedded processor + reconfigurable kernel
- -
- FPGA4 soft embedded processor + reconfigurable kernel
- -
- CPU1 + PCI Express GPU kernel
- -
- CPU2 + PCI Express FPGA kernel
- -
- Embedded hard cores (Integrated ARM Cortex-A9 MPCore Processor System)
- -
- Runtime reconfiguration
- -
- Partial reconfiguration
- -
- Use of OpenCL solutions and, therefore, quasi-compatible solutions, with others technologies (CPU-GPU)
3.1. Implementation of the Individuals
- Sending the training set for the application from the host of the main genetic algorithm to the remote board calculating the individuals. This information must be sent to the remote board only once. If the number of generations of the main genetic algorithm is sufficiently large, the time required for communication to the remote node is usually negligible. The time spent on this task is denoted by .
- Transmitting the variables necessary for the evolutionary computation of the neural network topology: the number of neurons in the first hidden layer, the number of neurons in the second hidden layer and the seed of the random generator stream. This information must be sent for each new individual created by the remote board. The time spent on this task is denoted by .
- Sending the command from the host to the remote board to initiate the calculation of the fitness function for an individual. The time spent on this task is denoted by .
- Performing a large number of training iterations () for an ANN via RBP. The time required for this training can be decomposed into a fixed initialization time for the algorithm () and the time required for some number of iterations (E) of a loop, commonly referred to as epochs; within each iteration of this loop, there is another set of iterations determined by the number of mini-batches into which the training samples have been divided (), and each mini-batch iteration is broken down into two main phases: the calculation of the partial derivative of the error with respect to each weight and the updating of the weights. The amounts of time spent on each of these phases are denoted by and , respectively. After training, we must calculate the fitness values for the test set ().
- Sending the response for each individual from the remote board to the host to signal the end of the fitness function calculation and communicate the calculated value. The time spent on this task is denoted by .
3.1.1. Calculation of Weight Changes
- -
- : Forward step. Apply the pattern to the input layer and propagate the signal forward through the network until the final outputs have been calculated for each i (the neuron index) and l (the layer index).
- -
- : Backward Step 1. Compute the values for the output layer L and for the preceding layers by propagating the errors backwards using:
- -
- : Backward Step 2. When we obtain the delta errors in Backward Step 1, we can simultaneously obtain the accumulated partial derivative of the error with respect to each weight as follows:
3.1.2. Update of Weights
- -
- to the right, the equations involved as listed above and
- -
- to the left, an identification that allows us to understand which of these operations are implemented by our OpenCL kernel in the three main versions developed and that now we will detail.
3.2. Version 1: Algorithm with Kernel of Type Matrix Multiplication
Algorithm 1: RMSprop backpropagation. |
|
3.3. Version 2: Introduction of Non-Linear Function in the Kernel
3.4. Version 3: Introduction of the Derivative of the Tangent and Internal Storage of Variables in the Kernel
4. Performance Evaluation
4.1. ARM Versus ARM + FPGA Kernel
4.1.1. Preliminary Comments
- Use single instruction, multiple data (SIMD) techniques for vectorized data load/store. We use and as kernel attributes; multiple work-items from the same work-group will run at the same time. In all versions, with regard to the parameterization kernel, we consider the following attributes:
- -
- : four types of two-dimensional work-groups (4 × 4, 8 × 8, 16 × 16 and 32× 32), which correspond to the size of the sub-matrices (tiles) into which each of the arrays is divided (see Figure 4b).
- -
- : three values (1, 2 and 4).
- Replicate the compute resources on the remote device (FPGA). We use as a kernel attribute; multiple work-groups will run at the same time.
- -
- -
- -
- -
4.1.2. Analysis of Hardware Results
4.1.3. Analysis of Performance Results
4.2. Phase 3 Heterogeneous Platform
4.2.1. Preliminary Comments
- -
- FPGA device:
- -
- Cyclone V SoC 5CSEMA5F31C6 Device
- -
- Dual-core ARM Cortex-A9 (HPS)
- -
- 85 K programmable logic elements
- -
- 4450 Kbits embedded memory
- -
- 6 fractional PLLs
- -
- 2 hard memory controllers.
- -
- Memory device:
- -
- 64-MB (32 M × 16) SDRAM on FPGA
- -
- 1-GB (2 × 256 M × 16) DDR3 SDRAM on HPS and shared with FPGA. OpenCL kernels access this shared physical memory through direct connection to the HPS DDR hard memory controller.
4.2.2. Performance Efficiency
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Farmahini-Farahani, A.; Vakili, S.; Fakhraie, S.M.; Safari, S.; Lucas, C. Parallel scalable hardware implementation of asynchronous discrete particle swarm optimization. Eng. Appl. Artif. Intell. 2010, 23, 177–187. [Google Scholar] [CrossRef]
- Curteanu, S.; Cartwright, H. Neural networks applied in chemistry. I. Determination of the optimal topology of multilayer perceptron neural networks. J. Chem. 2011, 25, 527–549. [Google Scholar] [CrossRef]
- Islam, M.M.; Sattar, M.A.; Amin, M.F.; Yao, X.; Murase, K. A New Adaptive Merging and Growing Algorithm for Designing Artificial Neural Networks. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2009, 39, 705–722. [Google Scholar] [CrossRef] [PubMed]
- Han, K.H.; Kim, J.H. Quantum-inspired evolutionary algorithms with a new termination criterion, H-epsilon gate, and two-phase scheme. IEEE Trans. Evol. Comput. 2004, 8, 156–169. [Google Scholar] [CrossRef]
- Leung, F.H.F.; Lam, H.K.; Ling, S.H.; Tam, P.K.S. Tuning of the structure and parameters of a neural network using an improved genetic algorithm. IEEE Trans. Neural Netw. 2003, 14, 79–88. [Google Scholar] [CrossRef] [PubMed]
- Tsai, J.T.; Chou, J.H.; Liu, T.K. Tuning the structure and parameters of a neural network by using hybrid Taguchi-genetic algorithm. IEEE Trans. Neural Netw. 2006, 17, 69–80. [Google Scholar] [CrossRef] [PubMed]
- Ludermir, T.B.; Yamazaki, A.; Zanchettin, C. An optimization methodology for neural network weights and architectures. IEEE Trans. Neural Netw. 2006, 17, 1452–1459. [Google Scholar] [CrossRef] [PubMed]
- Palmes, P.P.; Hayasaka, T.; Usui, S. Mutation-based Genetic Neural Network. IEEE Trans. Neural Netw. 2005, 16, 587–600. [Google Scholar] [CrossRef] [PubMed]
- Niu, B.; Li, L. A Hybrid Particle Swarm Optimization for Feed-Forward Neural Network Training. In Advanced Intelligent Computing Theories and Applications: With Aspects of Artificial Intelligence; Huang, D.-S., Wunsch, D.C., Levine, D.S., Jo, K.-H., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 494–501. [Google Scholar]
- Lu, T.-C.; Yu, G.-R.; Juang, J.-C. Quantum-Based Algorithm for Optimizing Artificial Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2013, 24, 1266–1278. [Google Scholar]
- Garro, B.A.; Vazquez, R.A. Designing Artificial Neural Networks Using Particle Swarm Optimization Algorithms. Comput. Intell. Neurosci. 2015, 2015, 369298. [Google Scholar] [CrossRef] [PubMed]
- Young, S.R.; Rose, D.C.; Karnowski, T.P.; Lim, S.-H.; Patton, R.M. Optimizing Deep Learning Hyper-parameters Through an Evolutionary Algorithm. In Proceedings of the MLHPC ’15 Workshop on Machine Learning in High-Performance Computing Environments, Austin, TX, USA, 15 November 2015; ACM: New York, NY, USA, 2015; p. 5. [Google Scholar] [CrossRef]
- Miikkulainen, R.; Liang, J.-Z.; Meyerson, E.; Rawal, A.; Fink, D.; Francon, O.; Raju, B.; Shahrzad, H.; Navruzyan, A.; Duffy, N.; et al. Evolving Deep Neural Networks. arXiv, 2017; arXiv:1703.00548. [Google Scholar]
- Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y.L.; Le, Q.V.; Kurakin, A. Large-Scale Evolution of Image Classifiers. arXiv, 2017; arXiv:1703.01041. [Google Scholar]
- Baker, B.; Gupta, O.; Naik, N.; Raskar, R. Designing Neural Network Architectures Using Reinforcement Learning. arXiv, 2016; arXiv:1611.02167. [Google Scholar]
- Yao, X. Evolving artificial neural networks. Proc. IEEE 1999, 87, 1423–1447. [Google Scholar]
- Yao, X.; Liu, Y. A new evolutionary system for evolving artificial neural networks. IEEE Trans. Neural Netw. 1997, 8, 694–713. [Google Scholar] [CrossRef] [PubMed]
- Mateo, F.; Sovilj, D.; Gadea-Gironés, R. Approximate k-NN delta test minimization method using genetic algorithms: Application to time series. Neurocomputing 2010, 73, 2017–2029. [Google Scholar] [CrossRef]
- Hawkins, S.; He, H.; Williams, G.; Baxter, R. Outlier Detection Using Replicator Neural Networks. In Proceedings of the Fifth International Conference on Data Warehousing and Knowledge Discovery (DaWaK02), Toulouse, France, 2 September 2002; pp. 170–180. [Google Scholar]
- Fe, J.D.; Aliaga, R.J. Gadea-Gironés, R. Evolutionary Optimization of Neural Networks with Heterogeneous Computation: Study and Implementation. J. Supercomput. 2015, 71, 2944–2962. [Google Scholar] [CrossRef]
- Prechelt, L. PROBEN1—A Set of Neural Network Benchmark Problems and Benchmarking Rules; Technical Report; The University of Karlsruhe: Karlsruhe, Germany, 1994. [Google Scholar]
- Kingma, D.P.; Jimmy, B. Adam: A Method for Stochastic Optimization. arXiv, 2014; arXiv:1412.6980. [Google Scholar]
- Zeiler, M.D. ADADELTA: An Adaptive Learning Rate Method. arXiv, 2012; arXiv:1212.5701. [Google Scholar]
- MacKay, D.J.C. Bayesian Interpolation. Neural Comput. 1991, 4, 415–447. [Google Scholar] [CrossRef]
- Abbass, H.A. An Evolutionary Artificial Neural Networks Approach for Breast Cancer Diagnosis. Artif. Intell. Med. 2002, 25, 265–281. [Google Scholar] [CrossRef]
- Ahmad, F.; Isa, N.A.M.; Hussain, Z.; Sulaiman, S.N. A genetic algorithm-based multi-objective optimization of an artificial neural network classifier for breast cancer diagnosis. Neural Comput. Appl. 2013, 23, 1427–1435. [Google Scholar] [CrossRef]
- Sarangi, P.P.; Sahu, A.; Panda, M. Article: A Hybrid Differential Evolution and Back-Propagation Algorithm for Feedforward Neural Network Training. Int. J. Comput. Appl. 2013, 84, 1–9. [Google Scholar]
- Bengio, Y.; Lamblin, P.; Popovici, D.; Larochelle, H. Greedy layer-wise training of deep networks. Adv. Neural Inf. Process. Syst. 2007, 19, 153. [Google Scholar]
- Sankaradas, M.; Jakkula, V.; Cadambi, S.; Chakradhar, S.; Durdanovic, I.; Cosatto, E.; Graf, H.P. A Massively Parallel Coprocessor for Convolutional Neural Networks. In Proceedings of the ASAP 2009 20th IEEE International Conference on Application-Specific Systems, Architectures and Processors, Boston, MA, USA, 7–9 July 2009; pp. 53–60. [Google Scholar] [CrossRef]
- Prado, R.N.A.; Melo, J.D.; Oliveira, J.A.N.; Neto, A.D.D. FPGA based implementation of a Fuzzy Neural Network modular architecture for embedded systems. In Proceedings of the 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Australia, 10–15 June 2012; pp. 1–7. [Google Scholar] [CrossRef]
- Çavuşlu, M.; Karakuzu, C.; Şahin, S.; Yakut, M. Neural network training based on FPGA with floating point number format and it’s performance. Neural Comput. Appl. 2011, 20, 195–202. [Google Scholar] [CrossRef]
- Wu, G.-D.; Zhu, Z.-W.; Lin, B.-W. Reconfigurable back propagation based neural network architecture. In Proceedings of the 2011 13th International Symposium on Integrated Circuits (ISIC), Singapore, 12–14 December 2011; pp. 67–70. [Google Scholar] [CrossRef]
- Pinjare, S.L.; Arun Kumar, M. Implementation of Neural Network Back Propagation Training Algorithm on FPGA. Int. J. Comput. Appl. 2012, 52, 1–7. [Google Scholar]
- Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the FPGA ’15 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; ACM: New York, NY, USA, 2015; pp. 161–170. [Google Scholar] [CrossRef]
- Wang, C.; Gong, L.; Yu, Q.; Li, X.; Xie, Y.; Zhou, X. DLAU: A Scalable Deep Learning Accelerator Unit on FPGA. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2017, 36, 513–517. [Google Scholar] [CrossRef]
- Zhao, R.; Song, W.; Zhang, W.; Xing, T.; Lin, J.-H.; Srivastava, M.; Gupta, R.; Zhang, Z. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs. In Proceedings of the FPGA ’17 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; ACM: New York, NY, USA, 2017; pp. 15–24. [Google Scholar] [CrossRef]
- Bettoni, M.; Urgese, G.; Kobayashi, Y.; Macii, E.; Acquaviva, A. A Convolutional Neural Network Fully Implemented on FPGA for Embedded Platforms. In 2017 New Generation of CAS (NGCAS); IEEE: Piscataway, NJ, USA, 2017; pp. 49–52. [Google Scholar] [CrossRef]
- Qiao, Y.; Shen, J.; Huang, D.; Yang, Q.; Wen, M.; Zhang, C. Optimizing OpenCL Implementation of Deep Convolutional Neural Network on FPGA. In Network and Parallel Computing; Shi, X., An, H., Wang, C., Kandemir, M., Jin, H., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 100–111. [Google Scholar]
- Suda, N.; Chandra, V.; Dasika, G.; Mohanty, A.; Ma, Y.; Vrudhula, S.; Seo, J.-S.; Cao, Y. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks. In Proceedings of the FPGA ’16 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016. [Google Scholar] [CrossRef]
- Zhu, M.; Liu, L.; Wang, C.; Xie, Y. CNNLab: A Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis. arXiv, 2016; arXiv:1606.06234. [Google Scholar]
- Gadea-Giron, R.; Herrero, V.; Sebastia, A.; Salcedo, A.-M. The Role of the Embedded Memories in the Implementation of Artificial Neural Networks. In Proceedings of the FPL ’00 The Roadmap to Reconfigurable Computing, 10th International Workshop on Field-Programmable Logic and Applications, Villach, Austria, 27–30 August 2000; Springer: London, UK, 2000; pp. 785–788. [Google Scholar]
LUTs 4–5 | Registers | Memory | DSP | Latency | Power | Frequency | |
---|---|---|---|---|---|---|---|
Variables | Blocks 10 kbits | Blocks | Cycles | Dissipation mW | MHz | ||
OpenCL tanh | 4511 | 9127 | 49 | 25 | 45 | 202.86 | 136.67 |
RTL IPtanh | 161 | 163 | 10 | 0 | 4 | 18.98 | 133.07 |
Version 1 | Version 2 | Version 3 | |
---|---|---|---|
reqd_work_group_size | (16,16,1) | (16,16,1) | (16,16,1) |
num_simd_work_items | 4 | 4 | 1 |
nnum_compute_units | 1 | 1 | 2 |
Logic utilization | 22,995/32,070 | 23,500/32,070 | 22,293/32,070 |
(in ALMs) | 72% | 84% | 70% |
Total registers | 51,357 | 52,259 | 52,011 |
Total block | 1,323,804/4,065,280 | 1,653,928/4,065,280 | 1,770,586/4,065,280 |
memory bits | 33% | 41% | 44% |
Total DSP | 72/87 | 72/87 | 48/87 |
Blocks | 83% | 83% | 55% |
HPSPower | 1392.92 mW | 1392.92 mW | 1392.92 mW |
FPGA and HPS Power | 2524.70 mW | 2959.60 mW | 3147.25 mW |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gadea-Gironés, R.; Colom-Palero, R.; Herrero-Bosch, V. Optimization of Deep Neural Networks Using SoCs with OpenCL. Sensors 2018, 18, 1384. https://doi.org/10.3390/s18051384
Gadea-Gironés R, Colom-Palero R, Herrero-Bosch V. Optimization of Deep Neural Networks Using SoCs with OpenCL. Sensors. 2018; 18(5):1384. https://doi.org/10.3390/s18051384
Chicago/Turabian StyleGadea-Gironés, Rafael, Ricardo Colom-Palero, and Vicente Herrero-Bosch. 2018. "Optimization of Deep Neural Networks Using SoCs with OpenCL" Sensors 18, no. 5: 1384. https://doi.org/10.3390/s18051384
APA StyleGadea-Gironés, R., Colom-Palero, R., & Herrero-Bosch, V. (2018). Optimization of Deep Neural Networks Using SoCs with OpenCL. Sensors, 18(5), 1384. https://doi.org/10.3390/s18051384