Embedded Deep Learning Prototyping Approach for Cyber-Physical Systems: Smart LIDAR Case Study

: Cyber-Physical Systems (CPSs) are a mature research technology topic that deals with Artiﬁcial Intelligence (AI) and Embedded Systems (ES). They interact with the physical world via sensors/actuators to solve problems in several applications (robotics, transportation, health, etc.). These CPSs deal with data analysis, which need powerful algorithms combined with robust hardware architectures. On one hand, Deep Learning (DL) is proposed as the main solution algorithm. On the other hand, the standard design and prototyping methodologies for ES are not adapted to modern DL-based CPS. In this paper, we investigate AI design for CPS around embedded DL. The main contribution of this work is threefold: (1) We deﬁne an embedded DL methodology based on a Multi-CPU/FPGA platform. (2) We propose a new hardware design architecture of a Neural Network Processor (NNP) for DL algorithms. The computation time of a feed forward sequence is estimated to 23 ns for each parameter. (3) We validate the proposed methodology and the DL-based NNP using a smart LIDAR application use-case. The input of our NNP is a voxel grid hardware computed from 3D point cloud. Finally, the results show that our NNP is able to process Dense Neural Network (DNN) architecture without bias.


Introduction
Nowadays, Cyber-Physical Systems (CPS) interact with the physical world by analyzing their environment using a variety of sensors. For this purpose, a powerful analysis tool is needed, such as Artificial Intelligence (AI), or more precisely Deep Learning (DL) algorithms. Currently, DL technologies became a hot topic in solving problems such as data analytics and object recognition [1]. Since the late 20th century, they have evolved in a substantial way and tend to be applied in many different fields and applications related to computer science and engineering, such as CPS [2,3]. However, with the increased accuracy requirements and complexity of Neural Network (NN) architectures, DL technologies have been known to need a lot of computational power, mostly because of their huge number of parameters. Unlike distributed cloud computing, where a lot of processing power is available, embedded systems impel some restrictions on the use of DL technologies. Even when optimizing/compressing NN or using Graphics Processing Units (GPU) for embedded systems, there is still some possible optimization through the usage of specialized processing systems [4,5]. Additionally, if we want to build an application using specialized hardware processing for NN (e.g., FPGA [Field-Programmable Gate Array], ASIC [Application-Specific Integrated Circuit]), we need a complete design methodology for embedded DL in order to speed up the prototyping. In this paper, we introduce a new methodology for smart applications in CPS around DL technologies. We present and share the design of a hardware Neural Network Processor (NNP). We validate the methodology with a smart LIDAR (LIght Detection And Ranging) application case study. The new embedded DL methodology is oriented toward a hybrid CPU/FPGA-based design in order to simplify the prototyping. In this work, we share our experiences and the difficulties encountered while developing a smart LIDAR application for pedestrian detection to validate the proposed methodology. This paper is structured as follows: Section 2 presents the related works, Section 3 describes the proposed design methodology, Section 4 gives details about the NNP architecture design, Section 5 presents the experimentation results, Section 6 is dedicated to the discussion and analysis. Finally, Section 7 concludes this work.

Related Works
This work deals with two main research technologies topics around: (1) platformbased design and prototyping of deep neural network accelerator for efficient DL processing and (2) DL approaches used for 3D object classification and detection, using 3D sensors (e.g., LIDAR and a 3D camera). In this section, we give an overview and we highlight (to the best of our knowledge) the different related works that are in relation to the two main topics addressed in this work.

FPGA-Based Design and Prototyping of Deep Neural Network Accelerator
Platform-based design and prototyping have been proposed as solutions for time-tomarket and design costs problems in circuits and systems design, e.g., Pinto et al. [6], and we need to update such knowledge toward CPS using deep neural network accelerators. Platform-based design in the context of CPS was already addressed in Nuzzo et al. [7], by proposing an approach to abstract CPS design flow. Lacey et al. [8] presented the evolution of DL using FPGAs. They displayed different tools to design a DL accelerator on an FPGA platform, from high-level abstraction tools to deep learning framework. Sze et al. [9] made a tutorial and survey about DNN and hardware for DNN processing. They presented efficient ways to implement co-design processing of DNN using various optimizations. Li et al. [10] presented a survey about general-purpose processors (GPP) for neural network processing with a specific spotlight for the DianNao series accelerators. Abdelouahab et al. [11] presented a survey about FPGA Convolutional Neural Network (CNN) accelerators. Their work was mostly about algorithm and data management optimizations. Guo et al. [12] made a survey about FPGA neural network accelerators and summarized the different techniques used, showing that FPGA is a promising platform for neural network acceleration. Li et al. [13] proposed a model-based design methodology involving deep NN. They proposed an integrated set of tools and libraries alongside their methodology in order to assist designers of signal processing systems. Shawahna et al. [14] presented a survey about FPGA-based accelerators for DL networks, in particular Convolutional Neural Network (CNN), and tried to isolate a methodology for its conception. Their survey revealed a specific pattern for FPGA-based accelerated NN architecture, which is presented with techniques to optimize and automate the design. Those works were mainly focused on the design and prototyping of a deep neural network accelerator; however, there is a lack of a standard and global methodology that takes into account the design of NNP/DL in the context of an embedded application. Our work is more focused on a methodology for prototyping an embedded DL application on a hybrid CPU/FPGA platform rather than just a NN accelerator. Therefore, our interest revolves around the design and integration of deep neural network accelerators for CPS-based DL application using hybrid CPU/FPGA-based design and prototyping.

3D Object Classification in Deep Learning
Three-dimensional object classification is a hot topic considering current sensors such as LIDAR or 3D cameras. The usage of DL application may help reach great accuracy in the classification of 3D objects. Maturana and Scherer [15] proposed a 3D CNN using voxels as input. They proposed a way to convert a point cloud to a voxel model. Brock et al. [16] proposed a voxel-based auto-encoder and CNN to generate and classify 3D objects. Jing Huang and Suya You [17] proposed different 3D CNN architectures using voxels as input to classify objects. Qi et al. [18] proposed a deep learning architecture to directly classify and segmentate point cloud instead of voxels. Zhi et al. [19] proposed a lightweight version of 3D CNN. Three-dimensional volumetric binary voxel grid seems like a fine way to process 3D data in order to make pattern prediction for object classification. Three-dimensional object classification using DL is a hot topic because of today's 3D sensors and the accuracy it can yield, but it is heavy on computing power.
To our knowledge, most of the proposed works are scarce of results and precise information about the reproducibility of their experiments. In our paper, we tried to share the maximum of experimentation data regarding our methodology, NNP implementation, and performance as well as source code.

Proposed Methodology
In this section, we propose an embedded DL-based methodology for a FPGA-based CPS platform design using a hardware NNP.
The first step of our methodology was the definition of the system with its requirements and architecture. Then, different software algorithms were designed for data processing and DL. Those algorithms were hardware accelerated using High Level Synthesis (HLS) software tools or designed from scratch with a Hardware Description Language (HDL). Finally, the hardware accelerators were synthesized and uploaded on a hardware prototype to be tested. Considering those steps, several hidden tasks were present from data processing to data management, and the configuration of a hardware prototype. The goal of our methodology was to mitigate those hidden tasks, either with simplification or automation.
Our approach toward making a smart application for CPS was built around a FPGAbased DL methodology using an NNP. This methodology was divided in four parts ( Figure 1): hardware platform, hardware acceleration, embedded processing, and DL software. The transition between each part was as follows: the DL weight matrices were extracted and transferred to the hardware platform, and the embedded processing was hardware accelerated. The development and use of the NNP as a part of the methodology was an important step in order to handle the DL processing. The description of the methodology was made with a top-to-bottom approach by disassembling the different tasks to make a prototype and by explaining our design process to share our experiences. A design flow detailing the approach of our proposed methodology is presented in Figure 2. It shows the different steps of the four parts of the methodology and indicates how the parts are connected to each other.    Our proposed methodology for embedded DL differs from standard ones because of its constraints. The differences are mostly about integration of the NNP and its associated configuration software. Figure 3 illustrates the implementation of our methodology. Our DL processing was based on a hardware NNP that we created. In order to create it, we first designed it as a software using event-driven simulation interface (SystemC) and then migrated it to a hardware accelerator using HLS tools. This NNP could be considered a fully functional Intellectual Property (IP) to be integrated alongside the other IPs from the hardware-accelerated embedded processing. This processor used the extracted weight matrices, which came from the offline DL training and testing.

Hardware Prototype
One contribution of this work is to develop a real prototype by acquiring physical data from the real world and by processing it in order to obtain an accurate analysis. This analysis was conducted by transforming physical data into specific features that were used with a DL approach to be classified. The hardware prototype needed to host an embedded processing application to transform physical data as features for an NN so that data could be classified by our DL application (NNP). Hence, the hardware prototype hosted an FPGA bitstream containing hardware-accelerated embedded processing to calculate features from data as well as the NNP IP classifying those data. We also needed to set up the hardware prototype with an OS and a devicetree in order to execute the FPGA control software. The hardware prototype setup was automated with an automation tool that we developed for this case [20]. This tool deployed a bootloader (U-Boot) and a First Stage Bootloader (FSBL) to help obtain the first stages of the platform. The tool also deployed a Unix kernel with its initial ramdisk, preconfigured system files, and a generic devicetree to gain access to all components on the hardware prototype. Once set up, the hardware prototype needed software to control the FPGA processing as well as to transmit data between the different processing elements, since most data are available inside the platform DRAM to act as an interface between the different processing components, such as CPU and FPGA.

Hardware Acceleration
The embedded processing was built as an FPGA hardware IP. The hardware accelerator development was simplified using HLS software tools or HDL, but data management was still a sensitive part of the development because of the FPGA constraints. In this methodology, we considered that data is received and transmitted as a FIFO (First In, First Out) queue in order to simplify data management, even if it may mean extra calculation for processing tasks. This led to the embedded processing application receiving data from sensors and directly transmitting the processed information to the NNP. It also meant that the NNP received its data (input vector and weight matrices) as a FIFO queue and needed to compute the classification as data transmission progress. The embedded processing algorithms needed to be tweaked in order to compute FIFO transmitted data and used the smallest amount of internal cache (BRAM) possible, since FPGA does not have infinite internal memory. We mainly considered the usage of a HLS software to synthesize embedded processing software from High Level Language (HLL) to Register-Transfer Level (RTL), thus smoothing the software to hardware transfer.

Embedded Processing
The data perceived by the CPS should be processed so the NNP can use it correctly. It is necessary for two main reasons: (1) the data needs to be transformed for the NN to handle it and (2) to decrease the size of the neural network by computing some features beforehand. The main constraint is about data management. With data as a FIFO queue, most algorithms need to be redesigned in order to use as little memory cache as possible.

Deep Learning Software
A common method to make a DL application is by using specific tools to train and test NN architectures with a dataset. In this work, we considered the NN architecture as already defined. We also considered offline training as already conducted. The weights were then extracted to be used directly by the NNP embedded in the hardware platform. In this methodology, we considered weights extracted as a binary file containing the weight matrices between all layers.

Neural Network Processor Implementation
The NNP was designed to simplify the integration of DL in embedded CPS applications. To keep this processor simple, some constraints were defined: process the simplest NN architecture (fully connected NN without bias) with as few activation functions as possible, process any number of layers independently of their depth and width, and be re-configurable at runtime. A no-bias architecture was chosen here because bias calculation needs more computational power and time.
Computing a fully connected NN, also called Dense Neural Network (DNN), is mainly about matrix calculation. The main problem that arises from the implementation is not related to computation but data management. Weight matrices need to be loaded from external memory to be transmitted afterwards to the FPGA in order to allow for different configurations and because of the limited FPGA memory cache (BRAM) compared to the size of nowaday weight matrices that can reach dozens of megabits when using 32-bit floating-points. However, to optimize data transmission, each hidden layers' output needs to be kept in FPGA cache. Moreover, weights need to be sorted correctly depending on the scheduling for neuron processing units. In this implementation, inputs and weights are floating-point numbers and no compression is currently done.
In order to correctly control and configure this NNP, a configuration software is made which has for its main tasks loading binary files containing weight matrices to the DRAM, preparing the instructions for the NNP depending on each weight matrix dimension, and initializing DMA transmissions.

Neural Network Processor Architecture
First, we detail the hardware architecture of the NNP and how it calculates layers and neurons. Figure 4 represents the different parts of the processor and the communication interfaces in-between. There are four communication channels with the NNP for different data: the input vector, which comes directly from the embedded processing; the instructions and weight matrix, which are loaded from the external DRAM; and the output vector, which is loaded into the external DRAM.

Scheduler Module
The scheduler module loads all of the instructions from the DRAM to know how many weights and inputs should be loaded and the activation function to be used for each layer. Each instruction represents information about one layer and is coded by a 64-bit word containing three pieces of information: the number of neurons in the previous layer (30 bits), the number of neurons in this layer (30 bits), and the activation function of this layer (4 bits).
Once all instructions are loaded, the input vector is read from the data processing IP (Intellectual Property) into a local cache and the processing is started. For each layer, each neuron is represented by a neuron processing unit, also called a core, and thus, each neuron is computed individually. The scheduler starts a core by sending specific instructions to it, which are not the same as the ones that the scheduler module is receiving. Then, each input and weight connected to the specific computed neuron is sent. Once the core has finished the calculation, the output is returned to the scheduler, which stores it in the local cache (BRAM) to be used for the next layer. Once all neurons of the layer are computed, the output vector is used as the input vector of the next layer and the process starts again. Once the last layer is reached, the output vector of the NNP is written to the external DRAM. Algorithm 1 describes in a shorter way the scheduler module process. Because of the size of the number of neurons in the scheduler module instructions (30 bits), a layer should be able to have over 1 billion nodes. However, there is a hard limit inside the scheduler module cache (BRAM) for resource utilization purposes, which means that a layer cannot be larger than 65,536 nodes. The number of instructions that can be loaded in the scheduler is set to 512, which makes the instructions buffer size 4096 bytes. Thus, the scheduler module uses at least 528,384 bytes of FPGA memory cache. The resources used for the hardware scheduler component is showed in Table 1. Each neuron processing unit calculates one neuron at a time. It takes instructions from the scheduler module, each instruction is a 34 bit word containing two pieces of information: the number of inputs and weights (30 bits) and the activation function to be used (4 bits). When an instruction is received, the processing engine module starts listening to weights and inputs. Each time a pair of inputs and weights are received, they are multiplied and summed to previous results. Once all inputs and weights for one neuron are received, the activation function is calculated and sent to the output. Algorithm 2 describes in a shorter way to process the engine module process. Algorithm 2: Neuron processing unit algorithm.
Data: Weights, Inputs, Instructions Result: Output instructionCache ← read one from instruction; sum ← 0; repeat input ← read weight from scheduler; weight ← read weight from scheduler; sum ← sum + input * weight; until all weights and input are received; /* activation function is defined by instruction */ output ← activationFunction(sum); write output to scheduler; As said before, the activation functions are limited to four activation functions: relu, linear, sigmoid, and softmax. Relu, linear, and sigmoid activation functions are computed directly inside the neuron processing unit. However, in the case of the softmax activation function, the exponential part is performed in the neuron processing unit and the division by the sum of the output vector is performed inside the scheduler module because only the scheduler has access to the whole output vector. All the SystemC source code we wrote for the NNP is available online [21]. A testbench is available to load datasets and neural network models. The synthesis for each of those components is done using Vivado HLS [22]. The resources used for the hardware neuron processing component is shown in Table 2.

Configuration Software
Once the hardware was designed, a software stack was needed to load data into the DRAM to control the NNP. This software's purposes are twofold: to read a configuration file that regroups all weight matrices inside a binary file in order to determine from it the network topology and to generate instructions as well as to sort all weights in matrices for scheduling purposes. The weight matrices were then written in the external DRAM waiting to be read by the NNP, which was started and waited for the input vector from the data processing IP. Every time the NNP finished a calculation, the output vector was read from the external DRAM. Control of the NNP was performed through DMA registers since it is always waiting for instructions. Algorithm 3 describes how the configuration software behaves. Regarding weight sorting, since data were a FIFO queue and to reduce data memory cache usage inside the FPGA, the weights were transmitted in the same order as the transmissions to their associated neuron processing unit. The scheduling algorithm loaded all pairs of inputs and weights to each core until all neurons were processed. This means that weights must be sorted depending on layer size and the number of available cores. The basic process is about dividing the weight matrix into sub-matrices with a size depending of the number of available cores ( Figure 5). Each sub-matrix represents the weights for one set of K neuron processing units. Those weights needed to be sorted by processing unit, and this was done with a transposition. Then, each transposed matrix was vectorized for memory writing purposes. All vectors were then merged into one vector and written into the DRAM. K is the number of neuron processing units. K is the size of the last sub-matrix, with K ≤ K.

Experimentation and Results
In this section, two experiments are presented: the test and validation of our NNP and the smart LIDAR for pedestrian detection case study. The NNP test and validation present the time performance and accuracy of our processor. The smart LIDAR case study is made to test our proposed methodology alongside the NNP. We propose to share our experiments on a smart LIDAR for object classification application and the results of this experiment. We describe in detail the workflow and each component of the application as well as performance and resource utilization. It is noted that there is no real LIDAR involved in this work; we instead used the 3D Point Cloud People Dataset acquired from a Velodyne HDL 64E S2 sensor [23,24]. This dataset recorded real-world data in downtown Zurich (Switzerland) and recorded mostly pedestrians. The main goal of this validation is to determine if we can detect them by designing an application using our proposed methodology.

NNP: Test and Validation
Tests were done on a Zedboard development kit [25] using the Xilinx Zynq-7000 SoC (XC7Z020) [26]. Each test corresponds to the usage of a specific known dataset with a specific NN architecture for each dataset. NN model training and testing were done with Tensorflow [27]. NN topologies are shown in Figure 6.  Table 3 compares the accuracy results for each dataset achieved by our NNP depending on its number of cores and the Tensorflow software executed on a CPU. These results mean that our NNP correctly computes floating-point numbers without error. It is noted that the maximum number of cores possible on this FPGA platform is 4 because of the DSP (Digital Signal Processor) limitation (220 DSP available on our FPGA platform with each core using 48 DSP; see Table 2). In this work, the execution time is our main concern, and it is obviously related to the number of parameters in the NN and the number of cores in the processor. Figure 7a shows the execution time of three datasets depending on the number of cores. The execution time seems to be close to linear, with a same architecture and a different number of cores, except for when there is only one core, which shows a bottleneck. Moreover, the execution time per parameter seems to be the same between the different datasets, as shown in Figure 7b.
With Vivado HLS transforming the SystemC models to HDL, our hardware threads run in parallel (the scheduler and neuron processing units are independent finite state machines using the same clock). The clock for the hardware threads runs at 100 MHz, which is 150 MHz lower than the maximal frequency on our hardware. However, increasing more than this frequency means that time constraints are not met. The use of parallel hardware threads improved the processing time of our system. However, we want to point out the data transfer bottleneck in the AXI system bus, which affects the whole processing time of the system. This bottleneck is mainly due to the number of parameters transmitted. Since we use 32-bit floating points, the parameters matrices of the NN are in the MB scale and our AXI channels run at a theoretical maximum of 300 MB/s. We would obtain better results if we used compression such as 16-bit fixed point integer or binary weights. Another option to improve the time consumption would be to run the scheduler which controls the neural processing units with a faster clock than the neuron processing units so that data are read faster from DMA and distributed faster to the processing units, but we did not confirm that this will bypass the data transfer bottleneck. In the context of the defined topologies, MNIST [28] has 26,432 parameters, Cifar 10 [29] has 611,648 parameters (because, since the input is grayscaled, the image size is 32 × 32 × 1), and Cifar 100 [29] has 2,089,984 parameters. In Figure 7b, the analysis shows that the execution time of the feed forward sequence of a specific NN model may be predicted. This means that we can determine the needed NNP cores for a given application with real-time constraints.    Since we are using 32-bit floating points, the parameters matrices of the NN are in the 315 MB scale and our AXI channels are running at a theoretical maximum of 300 MB/s. 316 We would get better results if using compression such as 16-bit fixed point integer or 317 binary weights. Another option to improve the time consumption would be to run 318 the scheduler which controls the neural processing units with a faster clock than the 319 neuron processing units so that data is read faster from DMA and distributed faster to 320 the processing units, but we did not confirm this will bypass the data transfer bottleneck.

Smart LIDAR for Pedestrian Detection Case Study
The first step of this experiments was to define the workflow of the application. Figure 8 represents all the steps of the application. It first starts with the physical world data that were acquired through the LIDAR sensor. The sensor transmits its raw data to the embedded processing hardware IP in order to process and transform the information so it can be used by the NNP IP and then analyzed the data and classified the objects.

Deep Learning Software
The first step when working on this prototype was to define how to classify objects from 3D data, such as the point cloud received from the LIDAR. One way to classify objects is to convert point cloud to voxels and then to use deep learning to determine the category [15]. The dataset used in our case study was the Sydney Urban Objects Dataset (SUOD) [30,31], but we converted point clouds into a 32 × 32 × 32 voxel grid using a volumetric binary occupancy grid approach. The training was performed with the architecture represented in Figure 9 using Tensorflow software [27]. The hardware used for the training was 16 GB of RAM with an Intel Core i7-8550U CPU with 4 cores, 8 threads, a base frequency of 1.8GHz up to a turbo frequency of 4 GHz, as well as a 8 MB cache. Once the architecture was defined and the training/testing was performed, the weights were extracted in a NumPy binary format [32].

Embedded Processing
Three-dimensional point cloud data were the input of the smart LIDAR. Each object in the point cloud needed to be extracted and transformed to voxels as an input for the NNP. To achieve this, four tasks were needed, as shown in Figure 8. The first step was to make an occupancy grid to detect objects inside the point cloud. The second step was to remove all points that were not considered objects. The third step was to isolate objects with a "sliding box". The results for those three steps were presented in a previous paper [33]. The fourth step was to convert extracted objects into a volumetric binary occupancy grid. Once the object was transformed into a 32 × 32 × 32 voxel grid, it was sent to the NNP. The embedded processing was written in SystemC and synthesized to RTL with an HLS software (Xilinx Vivado HLS [22]). The algorithm for the "points to voxels" module is presented in Algorithm 4. The module received all the points from a box two times: the first time to calculate the bounding box and the second time to calculate the volumetric binary occupancy grid. The input and ouput of the "points to voxels" module are shown in Figure 10. The pedestrians were extracted in boxes of 2 × 2 × 2 meters and then converted to a 32 × 32 × 32 voxel grid. The FPGA synthesized results are shown in Table 4. After the hardware IPs implewere mented, we tested all extracted pedestrian boxes to find the mean execution time per point (Table 5).
voxel size ← (24,24,24); padding size ← (32,32,32); resolution ← 0.1m; minimum coordinates ← (+inf,+inf,+inf); voxel grid ←   The embedded processing was synthesized with the NNP. The bitstream was then ported on top of the platform. The hardware platform used was a ZedBoard Zynq-7000. The SD card generation was automated using our script [20]. This script deploys a UNIX operating system (OS) and all other required resources to boot this OS. The weight matrices were also integrated within the SD card along with the configuration software. In order to evaluate the system, two tests were performed. The first test was related to the SUOD. The accuracy was evaluated with all classes contained in the dataset. The second test was related to the 3D Point Cloud People Dataset. We extracted all pedestrian boxes from the Polyterrasse set to test if they were correctly classified, which means 599 fully visible pedestrians. Thus, the accuracy is related to the number of boxes that were correctly classified as pedestrians. With 4,204,160 parameters, the processing time of this network topology is shown in Figure 11. The results are shown in Table 6. The results of the SUOD accuracy for multiple object detection is really low compared to state-of-the-art neural networks. This is mainly due to two things: the use of dense NN instead of CNN, and the limited number of parameters in the NN compared to the number of classified objects. However, when trying to apply the same topology to only detecting pedestrians, the results are far better, which means that the use of dense NN and the number of parameters are enough to classify one type of object.

Discussion
The investigation of embedded DL within the design and prototyping of CPS using Multi-CPU/FPGA platforms shows that our proposed methodology simplifies the prototyping of DL-based CPS systems (e.g., autonomous vehicles). The critical step of the proposed methodology is the design and integration of the NNP. It is considered an accelerator for DL computation and mainly for the inference step. It also simplifies migration from the software deep learning to the hardware platform. This work is also considered a first step in simplifying the prototyping process of embedded DL for CPS. The scope of this methodology can cover DL applications dedicated to embedded classification in constrained environments. Our proposed methodology is oriented toward DNN topologies. In such embedded contexts, migration of the DL processing from CPU to hardware accelerators would increase the performance of the whole system, reaching specific realtime constraints and making the inference step easier for optimized embedded AI. Even if the NNP integration step is not automated yet, it is portable for several applications and platforms. In addition, the current NNP can be improved at several levels: (1) data management needs to be refactored and optimized in order to speed up computation, (2) a flexible architecture is needed to integrate more activation functions and topology types. Moreover, we think that the use of the Vivado HLS software tool to implement the NNP might slow down the final design compared to a from scratch HDL model. Currently, our NNP is adapted for lightweight neural networks since the execution time might be enough to reach real-time constraints in some applications while using low power. Although the limited topology and activation functions choice might be a constraint for some applications, NNP design reuse, in the context of platform based design for CPS systems, is a motivation for us to investigate the possibility of automated generation of this NNP with the needed number of cores and then the automation of the whole methodology, since several steps are already automated using our automation software. In fact, attempting to automate such generations for every case is not a realistic goal. However, we are convinced that automation of the NNP integration in the whole design methodology would present a huge improvement in terms of design time, exploration, prototyping, and CPS systems portability.

Conclusions
In this paper, a new methodology for Cyber-Physical Systems (CPS) design and prototyping is presented. It is defined around an embedded Deep Learning (DL) approach. In order to compute this embedded DL algorithm, we propose a new hardware Neural Network Processor (NNP) architecture. We also share our experience of the design/implementation and the porting of the DL-based NNP on a real hardware Multi-CPU/FPGA platform (Zynq). Our DL-NNP model used real data coming from a LIDAR sensor. Hardware threads were used to transform data from a 3D Point Cloud (LIDAR) to a voxel grid, which is considered the input of our NNP. Results related to the NNP performances are presented, and the whole methodology is validated with a smart LIDAR for pedestrian detection case study. The prediction of the NNP execution time is dependent on the number of parameters (weight matrices) in the Neural Network (NN) and the number of NNP cores. We still have some work to do to optimize the proposed NNP. In future work, we aim to automate generation and integration of the NNP. We want to simplify the design reuse of our NNP with automation. The work presented in this paper is a first step to understanding the design and implementation of Artificial Intelligence (AI) in the context of embedded systems related to CPS. The proposed methodology, which is oriented toward an embedded hardware DL based on FPGA, shows real progress in understanding the harsh relation between the embedded world, AI, and CPS. Through this work, we defined the different steps of this relation. We think that the automation of those steps will be extremely helpful for embedded AI designers to simplify the prototyping steps and to move toward more significant design space exploration.