GPU-Based Embedded Intelligence Architectures and Applications

: This paper present contributions to the state-of-the art for graphics processing unit (GPU-based) embedded intelligence (EI) research for architectures and applications. This paper gives a comprehensive review and representative studies of the emerging and current paradigms for GPU-based EI with the focus on the architecture, technologies and applications: (1) First, the overview and classiﬁcations of GPU-based EI research are presented to give the full spectrum in this area that also serves as a concise summary of the scope of the paper; (2) Second, various architecture technologies for GPU-based deep learning techniques and applications are discussed in detail; and (3) Third, various architecture technologies for machine learning techniques and applications are discussed. This paper aims to give useful insights for the research area and motivate researchers towards the development of GPU-based EI for practical deployment and applications.


Introduction
Embedded Intelligence (EI) in products or systems gives it the capability to reflect on its own operational performance. EI requires the use of sensors, communications and computational processing that are embedded into the product or system to meet specific operational goals. In recent years, machine learning [1], deep learning [2] and artificial intelligence (AI) have seen wide adoption across different platforms and impose new requirements on existing computing systems and architectures. They can exist solely as software, but in most cases, they require the use of hardware components to build standalone intelligent machines. This is where the relation between "intelligence" and embedded systems becomes important. There are different platforms for the deployment of machine learning, deep learning and AI: (1) Graphics Processing Unit (GPU); (2) Field Programmable Gate Array (FPGA); (3) Central Processing Unit (CPU); (4) Application Specific Integrated Circuit (ASIC); and (5) Field Programmable System on Chip (FPSoC).
Over the past few years, advancements in GPU architecture have provided significant increases in computational power. GPU has hundreds of smaller cores and is designed for the types of computations used to render lightning-fast graphics. GPUs which were initially designed as the drivers for video game screen-to-controller responsiveness have impacted the entire IT industry. As accelerators have been adopted, GPUs have been extended for a wide range of other applications and are also deployed in high-performance computing systems to provide powerful computational capability. GPUs have been utilized to function as hardware accelerators [3] in speeding up training and inference of machine learning, deep learning and AI. The density of their cores and power efficiency enable them to meet real-time requirements and the intensive computation challenges of machine learning, deep learning and AI. Machine learning algorithms have seen wide adoption across different hardware platforms including GPU, for energy-efficiency, small formfactor and affordable devices and applications. Modern smartphones include hardware for accelerating machine learning algorithms, and software frameworks have been optimized for embedded platforms, and hardware accelerators are being commoditized such as the edge Tensor Processing Unit (TPU) [4].
Similarly, GPUs are also widely used to serve the purpose of accelerating deep learning which has been demonstrated to be an effective tool for many applications. Emerging deep learning cloud services provided by AI service providers further accelerate the usage of deep learning in many business-critical processes. Deep learning platforms of large companies provide customized hardware such as servers, storage and networking communications to support the processing of high computational workloads. Different from the above, deep learning in centralized cloud environments incurs longer latency, energy usage and financial overheads. Research and development platforms have their own specific features and prefer focusing on cost-effective GPUs for the development of limited-scale computational clusters to handle diverse deep learning workloads. One trend can be observed in recently established competitions such as "low-power image recognition challenge" (LPIRC) [5] where there is a balanced emphasis on performance accuracy, computational throughput and power consumption. Other evidence is the increasing trends from major vendors and startups for providing low-power hardware accelerators for deep learning.
Some current development trends for GPU-based EI are towards: (1) Using lower precision arithmetic (e.g., from 32 bit to 16 bit representations); (2) Exploiting operations on sparse matrices (e.g., convolutional neural networks (CNNs) often have many zero weights which can be exploited with a mechanism termed ZeroSkip [6]). The ZeroSkip mechanism is discussed further in Section 3.6; and (3) Binary neural networks (BNNs) which are customized DNNs that use binary representations for weights and activations values. BNNs are discussed further in Section 3.1. Therefore, embedded intelligence with deep learning approaches for small and power efficient devices is attractive in several domains. Some examples of GPU-based EI for real-world domains and applications with social impacts can be found in predictive systems for disaster early warning and forecasting management (large-scale water supply systems management [7], simulation and forecasting for floods [8] and fires [9]) and usage in the electronics industry (circuit solver for electronic systems with large numbers of components [10]). Further examples of GPU-based EI applications and domains will be discussed in Sections 3 and 4.
Currently, a comprehensive survey or review for embedded intelligence research and development on GPU is not available. Most reviews on machine learning, deep learning or AI do not include or focus on hardware or embedded intelligence. This paper presents a comprehensive review and several representative studies of the emerging and current paradigms for embedded intelligence research and development on GPUs with the focus on the enabling technologies, applications and challenges. The overview and classifications of embedded intelligence research and development on GPUs are shown in Table 1. The research works are classified into the various categories: (1) First, the overview and classifications of GPU-based EI research are presented to show the full spectrum in this area which also serves as a concise summary of the scope of the paper; (2) Second, various architecture technologies for deep learning techniques and applications are discussed in detail; and (3) Third, various architecture technologies for machine learning techniques and applications are discussed. This paper aims to give useful insights for the research area and motivate researchers towards the development of GPU-based EI for practical deployment and applications. The remainder of the paper is as follows. Section 2 presents the overview and the different classification areas of EI research on GPU. This is followed by Sections 3 and 4 which give discussions on the GPU-based deep learning and machine learning techniques and applications. Section 5 concludes the paper.

Deep Learning on GPU Architecture
Deep learning approaches and techniques have been proposed and deployed to address many real-world problems such as bioinformatics, manufacturing, robotics, computer vision and natural language processing. Some well-known deep learning models are convolutional neural networks (CNNs) (e.g., AlexNet, GoogleNet, etc). Commercial cloud services offered by large technology companies have expanded the adoption of deep learning in various business critical processes. Several other deep learning frameworks have also been proposed such as TensorFlow (from Google), CNTK (from Microsoft) and Caffe 2 (from Facebook) to facilitate the training required for large datasets on GPU-enabled computational clusters.

Architecture Framework and Strategy
Ultra-deep neural networks (UDNNs) have been proposed to produce high-quality models. However, the training process for the UDNN is resource intensive and timeconsuming. This limits the training efficiency of UDNN on modern GPUs due to the limited DRAM capacity. The authors in [11] proposed a new architecture called AccUDNN which is an accelerator to optimize the limited GPU memory resources to speed up the UDNN training. The architecture of AccUDNN consists of several interconnected modules: (1) The information collector gathers the features and attributes to build the performance model; (2) The performance model builder analyses the run-time characteristics and behavior in terms of computational performance, memory utilization and communication requirements; (3) The constraint unit develops the conditions that do not lead to performance degradation; and (4) The hyperparameter tuner computes the optimal minibatch size to meet the efficiency constraints. Their experimental results showed that the proposed architecture resulted in a reduction of memory requirements for training the ResNet-152 from 24 GB to 8 GB.
The authors in [12] proposed the DeepSpotCloud architecture to execute deep learning tasks with cost efficiency and fault-tolerance. Figure 1 shows a system architecture of DeepSpotCloud. There are several important modules in this architecture: (1) The Spot Instance Orchestrator which executes the tasks for spot price monitoring, recommendation and arbitration; (2) The Spot Price Monitor which accesses the current GPU spot price; and (3) The Instance Arbitrator which monitors the running spots to identify the interrupted tasks. Their experimental results showed significant cost gain with a marginal increase in the task running time.  [12]. AWS: Amazon Web Services. (Reprinted with permission from ref. [12]. Copyright 2021 IEEE).
The authors in [13] proposed a scalable architecture for large-scale DNN training in a distributed environment. The framework consists of a four-tier technology stack: (1) Hardware infrastructure consisting of Nvidia CUDA GPU cluster and nodes; (2) Nvidia CUDA drivers; (3) Middleware for HDFS storage and cluster resource management (Apache Flink); and (4) The application framework which enables the user to manage the storage of huge amounts of data (GB). The framework also includes the learning pipeline for scaling across distributed environments, a neural network package and deep architecture builder, GPU executor, linear algebra library and data format parsers.
A significant challenge for training DNNs such as CNNs is the high amount of computational resources and memory bandwidth. The design optimization requires the accurate modelling of how the GPU performance increases with the available computational and memory resources. The authors in [14] proposed DeLTA which is an analytical GPU model for CNNs for various parameters (arithmetic performance, memory hierarchy traffic, data reuse) to optimize computation throughput and memory bandwidth. Their work deployed two NVIDIA Pascal GPUs (P100 and TITAN Xp) and a Volta GPU (V100) and was validated on four CNN architectures (AlexNet, VGG, GoogLeNet, ResNet). Their experimental results showed that the DeLTA architecture could be used for the resource space exploration and the identification of tradeoffs using various scaling parameters for GPU designs to meet different design requirements.
The authors in [15] proposed GRAMARCH, a heterogeneous 3D NoC-enabled GPU and ReRAM (Resistive random-access memory) architecture that exploits the advantages of ReRAM and GPUs for 3D Network-on-Chip. Figure 2 shows the proposed GRAMARCH architecture. The GRAMARCH architecture consists of two layers: (1) The bottom layer contains the GPU and Last Level Cache (LLC) titles, and (2) The top layer contains the ReRAM for storage and computation, and includes the eDRAM buffers, in-situ multipleaccumulate units (IMA), output registers, shift-and-add, sigmoid and max-pool units. Their experimental results showed performance improvement of up to 53 times compared to conventional GPUs for image segmentation. A further paper by the authors in [16] proposed an M3D-enabled architecture termed AccuReD, that combined ReRAM arrays together with GPU cores for training CNNs with high performance and accuracy. Their experimental results showed that the proposed architecture could accelerate the CNN training process by up to twelve times compared to conventional GPU platforms. The authors in [17] proposed a performance model for an asynchronous SGD deep learning system called SPRINT on GPU supercomputers. There are two stages in the training strategy: (1) Computation of cost derivatives, and (2) Addition of derivatives and current weights for weights updating. In their proposed architecture, the GPU threads process computes the gradient of randomly-picked samples and accumulate them to the host memory, and the update threads process executes MPI all-reduce to update the weights. Their experimental results used 192 GPUs and gave an average error of 5% to 19% for various DCNN architectures to perform image classification tasks on GPU supercomputers.
The authors in [18] proposed the vDNN++ (virtualized DNN) architecture. The proposed VDNN++ had three main solutions: (1) improved asynchronous transfer; (2) usage of heuristics to reduce the memory fragmentation; and (3) compression to reduce the memory footprint. Their experimental results showed that the performance of the proposed approach (vDNN++) gave an improvement of up to 60% compared with a previous approach (vDNN).
The authors in [19] proposed an approach for training RNNs (recurrent neural networks) on multiple GPUs. Their approach partitions the training data using map-reduce and batch-bucketing optimization and is distributed among the GPU processes. Their experimental results validated the approach and compared the approach using a different number of buckets for a number of performance parameters (clock cycle time, number of epochs and loss value).
The authors in [20] proposed a pipeline-hybrid parallelism training approach for training DNNs termed Pipe-Torch. In their approach, they formulated DNN training with pipeline-hybrid parallelism, and designed an algorithm to map and model the GPU cluster with a heterogeneous network environment. Their experimental results showed that the Pipe-Torch architecture and approach gave a 1.4 times performance speedup improvement.
The authors in [21] proposed a combination design of parameter-server and all-reduce schemes. Their proposed design deploys the asynchronous parameter-server aggregation for node communications among worker nodes and the synchronous all-reduce method for aggregation among intranode GPUs. Their experimental results showed that the approach led to computational performance improvement and increased the resource utilization in the GPU cluster.
The authors in [22] proposed an architecture to pipeline the training of multiple deep learning models on a GPU cluster. The dataset is partitioned to be handled by a specific GPU computation node. In their approach, each computation node uses preloaded data for training a new model or refining a partially trained model from a previous node. The authors also proposed a circle-based pipeline where data are distributed at each computation node. Their experimental results showed that the proposed pipeline architecture and framework could reduce the overall training time by several hours compared to the baseline method.
Another significant challenge is for the deployment of DNNs which have high computational complexity on mobile devices with the power and performance constraints. A solution is to use binary neural networks (BNNs) which are customized DNNs that use binary representations for weights and activations values. The authors in [23] proposed PhoneBit, a GPU-accelerated BNN inference engine which can be deployed on Androidbased mobile devices. Their work was validated using BNN variants for DNNs (AlexNet, YOLOv2 Tiny and VGG16). Their experimental results showed that the PhoneBit gave significant performance speedup and energy efficiency advantages when compared with other DNN approaches and deployments on mobile devices.
The authors in [24] proposed a BNN hardware accelerator design. Figure 3 shows their proposed BNN accelerator design. The architecture consists of many PEs (processing elements) which can be increased or decreased. The architecture can be scaled by adding more PEs. The proposed accelerator approach supports the hardware operations to perform the computations for the BNNs. Figure 4 illustrates the sequences of operations that a processing element takes to process a BNN layer. Their architecture was implemented on Aria 10 FPGA and their experimental results showed that the FPGA BNN implementation gave higher performance efficiency over the CPU (Xeon server) and GPU implementations (Titan X and TX1).  . Sequences of operations that a processing element takes to process a BNN layer [24]. (Reprinted with permission from ref. [24]. Copyright 2021 IEEE).
The authors in [25] proposed a power-efficient CNN implementation for FPGAs and GPUs to accelerate the DNN computations. Their GPU-FPGA approach implemented a DNN architecture (CNN). It contained six layers (three convolutional layers, two max pooling layers and one fully-connected softmax activation layer). The DNN architecture implemented the fully-connected layer on the FPGA and used the GPU for the other components. Their experimental work used a TX2 GPU (from Nvidia) and a proposed Artix-7 FPGA (from Xilinx), and showed increased computation speed and decreased power consumption.

Scheduling and Communication
The authors in [26] proposed an approach termed GENIE for dynamic quality of service (QoS) scheduling for a GPU cluster framework. The GENIE scheduling framework. contained two sections: (1) Offline profiling and (2) Online scheduling. The offline profiling section implements the prediction model based on the characterization of workloads, whereas the online profiling section implements the scheduling strategy based on the QoS specifications. Their experimental results were conducted on a GPU cluster and showed an improvement of 67% in comparison with other baseline schedulers.
The authors in [27] studied the communication requirements for DNN training on GPU and identified hardware approaches to reduce the bandwidth. Their proposed architecture focused on utilizing distributed GPU software tools (e.g., CUDA and MPI). The authors investigated the communication overheads involved in training DNNs for both weak and strong scaling of training and showed that overlapping could significantly reduce them. The authors evaluated the performance of these methods on microbenchmarks and end-to-end training using the opensource LBANN deep learning toolkit, and showed that their approach could improve on existing methods, particularly for larger-scale DNN implementations.
The authors in [28] modelled broadcast communication schemes and analyzed the performance of various approaches on GPU clusters. Their work investigated several different broadcast algorithms (ring, K-nomial and IB-MCAST) on GPU clusters. The IB-MCAST header is stored in the host (in this case CPU) memory and the GPU data are collected in a single step via an IB gather operation with the GDR feature enabled. Their experimental results gave a 68% latency reduction in comparison with other state-ofthe-art methods. The authors in [29] proposed an approach to improve the performance communications between GPU nodes by reducing the bottlenecks and I/O communications in CUDA aware MPI runtimes. In this approach, minibatches are further divided into several sub-minibatches and copied to GPU buffers. Their work was validated on the Cray CS-Storm cluster and the GPU cluster contained 12 nodes containing 8 NVIDIA K-80 GPUs. The results showed that their designs gave performance improvements of 23%, 21% and 15% for the CIFAR-10, MNIST and ImageNet datasets, respectively.

Image Processing and Computer Vision
The authors in [30] proposed a GPU-based framework for training 3D CNNs using the voxelization of polygonal models. Their approach uses geometric transformations and vertex displacement computations for data augmentation in the GPU. The authors framework was tested with the ModelNet10 and ModelNet40 datasets from Princeton. In their approach, the voxelization was carried out on-the-fly compared with the standard approach of carrying out the voxelization separately. Their experimental work which was implemented using C++ with a single-thread implementation showed that their approach gave higher performance in comparison with the standard method which performs the voxelization of models separately.
The authors in [31] proposed an efficient GPU-based image target detection approach using CNNs. Their proposed approach focused on accelerating the high-level detection and scheduling task and less on accelerating the low-level convolution computations. Figure 5 shows their proposed approach which consists of three modules: (1) Slidingwindow module-to preprocess the image; (2) parameter-generating module-to perform the control and generates a combination instruction which is used as input into the CNN module; and (3) CNN feedforward execution module. The authors in [32] presented a GPU-based real-time stereo estimation architecture termed StereoBit using the BNN (binary neural network) for computation of the disparity map. Figure 6 shows their proposed approach. Their experiments used a StereoBit network architecture with four layers. Their experimental results used a TITAN XP GPU and showed that the framework gave a performance improvement for the stereo computations (60 fps) and reduced the memory usage for storing parameters. The authors in [33] proposed a hardware/software GPU-based design for an embedded image recognition system. Their proposed GPU-based embedded image recognition approach used the NVIDIA GPU to accelerate the computations for an immune CNN to perform the image recognition. Their experimental results showed that the proposed hardware/software system design gave improved recognition performance and computational speed compared with the traditional approach.
The authors in [34] proposed a real-time face recognition architecture using DNN on an embedded GPU system. Their face detection system involved tracking faces and using a CNN to perform the recognition task. The main challenge of the design is the computational requirements of the CNN when implemented into a hardware platform with limited computational resources. Their experimental work used a Jetson TX2 which contains a high-performance processor in a power-efficient design. The authors in [35] also worked on the image/face recognition. They performed video analysis to identify the face attributes and to analyze the gender and age features of the customers in business. The face recognition task was realized using a convolution neural network (CNN) on a hardware platform utilizing an embedded GPU.
The authors in [36] developed a GPU-based emotion recognition robotic eye for a service robot. Human emotional states such happy, sad and relaxed were recognized by a trained deep learning architecture (ConvNet). The robot eye performs two tasks in the robotic system. Human heads and faces are processed by pretrained haar cascade classifiers, followed by the recognition of human emotional states by the trained ConvNet. The authors in [78] proposed a bird species classification system using deep learning (CNN) on a GPU parallel computing platform. Their approach used the Caltech-UCSD Birds 200 data set for training the CNN. Their experimental work was performed using TensorFlow and showed that the DCNN computing platform and implementation could predict 88% of bird species accurately.
The authors in [37] proposed an embedded GPU cluster computing approach for the inference of CNN. Their work involved the use of pre-trained CNN models developed by the U.S. Air Force Research Laboratory. Their experimental platform consisted of six Jetson TX2 development boards which were operated in parallel. The TX2 development board cluster was used to perform evaluations for a range of image sizes, tile sizes, tile overlaps and network configurations. Their experimental results showed that the parallel GPU cluster computing framework gave a performance increase of 4.3 times. The authors in [38] applied CNN for the classification of adjective noun pairs on a GPU cluster. Their approach used the ResNet50 CNN with 50 layers in their experiments that was used to map a 224 × 224 × 3 image to a 1200-dimensional vector for representing a probability distribution over the various classes.
The authors in [39] proposed an efficient gradient-based neural architecture search approach. Their approach represents the search space as a DAG (directed acyclic graph) which contains multiple sub-graphs. Each sub-graph represents a particular neural architecture. The authors used a differentiable sampler over the DAG to avoid the need for traversing all the sub-graph possibilities. Their experimental work showed that the approach was computationally efficient and reduced the search cost of the standard approach by about 104 times. In addition, the CNN and RNN models which were discovered indicated a competitive performance compared to state-of-the-art models. The authors in [40] evaluated the reliability of the YOLO object detection framework. In their approach, they used GPUs designed with various architectures (Pascal, Kepler, Maxwell) for the detection of objects. The authors also proposed an approach termed the Algorithm-Based Fault-Tolerance (ABFT) technique for the matrix multiplication kernels of DNNs. Their experimental work used the Caltech and Visual Object Classes datasets and gave 50-60% detections and corrections with induced corruptions.

Medical or Health
Magnetic resonance imaging (MRI) is popular in modern medicine. To obtain satisfying resolution within a limited scanning time, some fast MRI image reconstruction algorithms based on parallel imaging such as GRAPPA are used. RAKI is an improved GRAPPA-like method that reconstructs the under-sampled k-space with estimated pointspread functions (kernel) in the k-space. In RAKI, kernels are trained by a multi-layer CNN. However, RAKI reconstruction depends on multiple CNNs and the implementation of RAKI reconstruction is a time-consuming process. The authors in [41] proposed accelerated strategies for RAKI implementation aided by GPU parallel programming and TensorFlow, a popular deep learning framework. In their approach, they limited the iteration number of solving optimization problems in the CNN training, while ensuring a satisfying result. They also parallelized the training tasks by CPU multiprocessing to maximize the performance by fully utilizing GPU resources to achieve further acceleration. Their experimental work used two Intel E5-2643 CPUs and a NVIDIA Tesla K80 GPU and showed that their approach gave more than 60 times performance improvement compared with the conventional sequential implementation.
Computed tomography (CT) imaging is used for several applications from diagnosis to health care. The authors in [42] proposed a fast reconstruction technique termed Deep Learning Model Based Iterative Reconstruction (DL-MBIR). Their proposed DL-MBIR approach was trained using a 16-layer residual CNN to produce reconstructions that approximated MBIR images. Their experimental work was implemented using multiple GPUs (Titan X) and TensorFlow, and showed that the 2.5D reconstruction method achieved similar quality to the 3D reconstruction at a lower computational cost. Cervical cancer is a common cancer in women. The authors in [43] proposed an approach to classify the cervix type using DNN and GPU. Their experimental results used TensorFlow and Keras models and techniques and a CUDA parallel computing platform. The CUDA architecture provided a large API for applications.
Birth asphyxia (perinatal asphyxia) is a medical condition characterized by abnormal breathing patterns in newborn children. The authors in [44] developed a CNN approach to enable asphyxia to be determined at its early stages. Their approach observed the patterns in a children's cries after training the CNN on samples obtained from affected children. Their experimental work was implemented on NVIDIA DIGITS and gave a classification accuracy of 94%.

Modeling, Prediction and Memory
To achieve exascale computing, energy-efficiency is an essential parameter for highperformance computing systems. The authors in [45] proposed an approach to model GPU parameters (computational performance, energy and power consumption). Their work aimed to predict the changes in the computational time and energy consumption for different scaling and frequencies of various GPU domains. Their approach used the LSTM recurrent neural model which includes the input of the PTX assembly code of a GPU kernel. Their experimental work was performed on four GPU architectures (Tesla T4, Titan V, Titan Xp and GTX Titan X) and achieved energy savings of 8.0% for Tesla T4, 6.0% for Titan V, 29.0% for Titan Xp and 11.5% for GTX Titan X.
The limited and small GPU memory poses challenges to fit the full model and data and limits the size of the training model. To improve the efficiency of training a large deep neural network on a GPU, the authors in [46] proposed a data pinning algorithm that reduces GPU memory pressure. They also analyzed access patterns of deep learning computation, formally formulated the GPU data pinning problem, and provided a data pinning algorithm that reduces the data movement overheads between the CPU memory and GPU memory. Their technique reduced the GPU memory usage during the backward phase by performing gradient computation and weight update simultaneously. Their experimental results used the VGG-19 deep neural network implementation model and showed improvements over other approaches (e.g., GeePS [47]).
The authors in [48] proposed an approach for overhead reduction between a host PC and the GPU method based on data swapping for reducing the memory usage. The proposed method introduced the virtual layer concept on a GPU. Their experimental results showed that their approach could reduce memory usage by about 73% and the training time by about 33% compared with the traditional method. The authors in [49] proposed a memory management approach for efficient GPU memory utilization. Their approach reduced the I/O contention among many GPUs by performing an adjustment of the number of GPUs that can perform swapping operations on a specific layer. The authors also proposed an intelligent pre-fetching algorithm that operated from the CPU memory to the GPU. Their experimental results showed that their proposed approach could achieve high throughput processing while sustaining a large batch size.
The authors in [50] proposed the design and technique for an out-of-core cuDNN library which enables CNN computations which exceed the size of the GPU memory capacity. The software stack using out-of-core (OOC) cuDNN includes extensions of some CUDA functions for memory management. Their approach was validated by computing a CNN requiring 60 GB memory on a GPU with a capacity of 16 GB physical memory. The authors in [51] proposed a memory-efficient architecture for GPU termed GPU-only Unified ConvMM. In their approach, the matrix multiplication convolutional layer is accelerated using the parallel computational resources of the GPU. Furthermore, a unified memory architecture is used to optimize the flow of this GPU-only ConvMM layer. The unified memory (UM) architecture which contains the shared memory for the CPU and GPU. Their experimental results showed that the performance of the proposed algorithm could be improved using the GPU implementation because the GPU implementation minimized the data transfers between host and device and could provide improvements in performance.

Convolution and Performance Analysis
Convolution operations form a large part of the computational requirements for CNNs. The authors in [6] targeted to enhance the performance of the state-of-the-art convolution algorithm (termed Winograd convolution) on the GPU. To address the issue of CNNs often having many zero weights, they proposed a hardware mechanism (termed ZeroSkip) to skip the multiplication operations that will give zero results regardless of input data. Their approach used a customized hardware module in the GPU which would dynamically check for zero elements and manipulate the program counter (PC) accordingly. Their experimental results used a Titan X and showed that the ZeroSkip method gave a 51.8% higher performance than the baseline Winograd convolution.
The use of distributed deep learning (DDL) presents some challenges in terms of portability and scalability issues. The authors in [52] provided a detailed performance analysis of distributed TensorFlow using the Horovod framework for scalability under various runtime parameters. The authors implemented DDL approaches for three architectures (AlexNet, GoogleNet and ResNet50). Their experimental work used the Nvidia K40, K80 and P100 GPUs. Their testing used synthetic image data with various parameters (e.g., batch size and number of GPUs). Their results showed that the Horovod framework gave linear performance (images/s) for scalability up to 256 GPUs.
The authors in [53] proposed a fault-tolerant approach with soft errors as faults in the network and introduced triple modular redundancy to detect and rectify the faults. Their work showed that the presence of faults in the network would cause a significant decrease in the accuracy of the network. CNNs and other neural networks use activation functions to introduce nonlinearity into models for the computation of complex functions. The authors in [54] investigated four activation functions (sigmoid, hyperbolic tangent, rectified linear unit (ReLU) and exponential linear unit (ELU)) by implementing them on CNN using a GPU. Their experiments used the Nvidia GPU 940MX for training and testing of the CNN model. Their work showed that the ReLU activation function had better performance than the sigmoid and hyperbolic tangent activation functions. Another observed result was that the ELU had better performance than ReLU.

VLSI Placement
Analytical placement techniques represent the current state-of-the-art for VLSI placement with the capability to solve a nonlinear optimization problem. The authors in [55] proposed a GPU-accelerated placement approach termed DREAMPlace. The DREAMPlace algorithm implementation used deep learning toolkits which consists of three stacks (lowlevel operators, gradient derivation and optimization engines). Their experimental results showed that their DREAMPlace approach had a performance improvement of thirty times speedup in global placement compared to other state-of-the-art methods.

Machine Learning in GPU Architecture
Machine learning (ML) algorithms have seen wide adoption across different domains and hardware platforms in recent years. This section describes different types of machine learning techniques for embedded intelligence on GPUs.

Architecture/Platform/Framework and Strategy
The authors in [56] proposed a parallel approach termed the H-ELM (hierarchical extreme learning machine) algorithm based on GPU and Flink which is an in-memory cluster computing platform. Figure 7 shows the architecture and workflow of H-ELM. Flink uses Java interfaces to communicate with the GPU. The GFlink architecture is shown in Figure 8.   [56]. (Reprinted with permission from ref. [56]. Copyright 2021 IEEE).
The authors in [57] proposed a GPU architecture which is integrated with the Spark Big data framework. The HeteroSpark clusters can be visualized as Spark clusters with GPUs connected with Spark worker nodes. The HeteroSpark architecture approach integrated the GPU accelerator into the Spark framework to increase data parallelism and algorithm acceleration. Their experimental work validated the HeteroSpark architecture using popular machine learning applications, and showed that the HeteroSpark architecture could achieve a performance improvement of 18 times.
The authors in [58] proposed an efficient GPU-based MapReduce framework to accelerate SVM learning. In their framework, GPUs were deployed to perform the parallel numerical calculations and MapReduce was used to perform parallel task scheduling and processing. In their approach, the search task for the SVM is performed in parallel using the MapReduce computational model. The experimental results showed that their proposed GPU-based MapReduce framework had significant performance improvements for SVM learning.
The authors in [59] proposed fast and low precision learning for GPU-accelerated spiking neural network (SNN) simulator architecture termed ParallelSpikeSim. The Paral-lelSpikeSim simulator uses unsupervised learning and stochasticity and enables fast and accurate learning with low precision operations. Their experimental results showed up to two to three times speedup in learning performance compared to deterministic SNN architectures for simple and complex datasets.
The authors in [60] proposed an event-based and time-driven SNN simulator for a hybrid CPU-GPU platform. The authors performed a comparative study for the different simulation methods (event-driven and time-driven methods) with various simulation techniques and computational platforms (single-core CPU, multi-core CPU and GPU). Their experimental work implemented the EDLUT (event-driven neural simulator based on lookup tables) in CPU/GPU clusters, and their results showed improvements in the spike propagation and queue management time. The authors in [61] proposed a neural accelerated architecture for GPUs termed NGPU which enables the scalable integration of neural accelerators for large numbers of GPU cores. The proposed architecture has three features for improving performance: (1) elimination of fetch/decode during the neural execution; (2) reduction of memory/register file accesses by storage of partial results and parameters in dedicated small buffers within the SIMD lanes; and (3) implementation of the sigmoid using a lookup table. Their experimental results showed that NGPU had a 2.4 times average performance improvement speedup and a 2.8 times average energy reduction for a diverse set of benchmarks.
The authors in [62] proposed a novel machine learning approach to find the optimal choice of GPU memory requirements for CUDA applications. The workflow of the proposed approach has two phases: (1) Offline learning; and (2) Online inference. The Offline learning phase used the NSight CUDA Profiler to collect a set of profiling metrics. Their work investigated various classifiers including the random forest, random tree and Logit-Boost to identify the classifier which would give the highest accuracy. Their experimental results showed that the proposed approach was able to predict accurately the optimal memory requirements for discrete memory or unified memory space.
The authors in [63] proposed a generic sparsity pattern termed Regularized Multi Block (RMB) sparsity pattern, an efficient storage format (CRMB), and a fast GPU algorithm for processing the RMBMM (SDMM with the multiplicand having the RMB sparsity pattern). Figure 9 shows the CRMB storage format for storing an RMB sparse matric. Their work showed that the RMB sparsity pattern enabled efficient implementations for parallel algorithms and a reduction in storage for a sparse matrix. Figure 9. Regularized Multi Block (RMB) sparsity pattern for GPU memory [63]. (Reprinted with permission from ref. [63]. Copyright 2021 IEEE).
The authors in [64] proposed an approach to allocate computations on GPU kernels to optimize the problem parameters (neural structure and training set size) and GPU parameters. Figure 10 shows the neural model and its mapping into a GPU. They termed this representation BNL (basic neural layer) with the objective to optimize the computational speed for a matrix of samples. Their experimental results showed that performance increases of 100 to 250 times could be achieved by optimizing the number of threads and the GPU global memory. The authors in [65] proposed the implementation of a multichannel HNN (MHNN) on a GPU platform. Their work used three levels of parallelism (thread, block and stream) and was designed according to the MHNN-based unmixing procedure. Their experimental results showed that the GPU implementation gave improvements of several hundred times compared to the sequential CPU implementation.

Applications
The authors in [66] proposed an approach to optimize the energy consumption of a GPU by using dynamic voltage and frequency scaling (DVFS). Their approach implemented the DVFS energy management model in a GPU. Their experimental results used three GPU platforms (Tesla, Fermi and Kepler) and showed improvements for performance and power. The authors in [67] proposed another GPU and memory coordinated energy saving approach using the ELM termed EDVFS. The EDVFS energy saving approach performs the voltage and frequency adjustments based on the extracted runtime characteristics. Their experimental results showed that the EDVFS approach gave a maximum energy savings and average energy savings of 10.63% and 2.68%, respectively, when compared to the traditional DVFS.
The authors in [68] proposed a GPU approach for PLV (phase locking value) biomarkers. Their experimental results showed that their approach gave a 21.3 times improvement from a search space of over 1.2 million and a significant reduction in complexity for ondevice processing. The authors in [69] proposed an approach for an efficient GPU-based implementation of multivariate empirical mode decomposition (MEMD) for massive neural data processing. The authors exploited the computational power of GPUs for parallelizing the MEMD algorithm using CUDA. The parallel MEMD implementation used the recursive sifting process which is performed until the stopping criterion. Their experimental results showed an improved performance of 6 to 16 times compared to traditional PC-based implementations.
The authors in [70] explored the usage of GPUs for simulating large-scale neuronal networks based on the AdEx (Adaptive Exponential) neuron-model. Their approach used the AdEx with STDP synapses model. Their implementation used the Thrust GPU library to bypass the expensive host-device memory operations. Their experimental results showed that the optimized GPU implementation gave a performance improvement of fifty times when compared to a reference multicore implementation. Furthermore, their work showed that a dual-GPU configuration could push the acceleration to a speedup of ninety times for networks of 20,000 neurons.
The authors in [71] proposed a GPU Simulator of MLMVN (multilayer neural network with multi-valued neurons). The MLMVN requires fewer neurons and iterations for its learning algorithm to optimize a network learning error. Their experimental work showed that utilizing the massive parallelism of a GPU could give a thirty times performance improvement for the MLMVN learning process. The authors in [72] proposed a novel approach for ECG recognition over GPU platforms using the probabilistic neural network (PNN). They also studied various features (discrete wavelet transform (DWT), autoregressive (AR), and wave Statistic) for the PNN. Their experimental results showed improvements in computational time using a parallel PNN (P-PNN) using CUDA. The algorithm performance of P-PNN in terms of the recognition rate was also improved compared to other learning models such as SVM and ELM.
The authors in [73] proposed an approach for fast soma cell detection in knife-edge scanning microscopy (KESM) for high-throughput imagery using GPU-accelerated machine learning. Their proposed approach requires few training data and performs real-time cell detection at a rate that exceeded the data rate of traditional KESM. The authors in [74] proposed a parallel implementation of chaos neural networks for an embedded GPU using OpenCL (Open Computing Language). Their experimental results showed that the parallel CNN approach implemented on an embedded GPU gave a pseudo-random number generator that was 49% faster than the AES in counter mode.
The authors in [75] proposed an approach to identify abnormal behaviors for anomalybased intrusion detection system (IDS). The training and operation of the neural architecture on GPU are performed in three stages: (1) preparation of initial data; (2) transfer of data to the CUDA device; and (3) kernel routine and transfer of the result to the host device. Their experimental results showed that the GPU implementation gave a thirty times performance improvement compared to a CPU implementation.
The authors in [76] proposed an approach for robot trajectory generation and the computation of collision-free trajectories for robot swarms using GPU. The performance of the proposed framework was evaluated on two case studies: (1) a swarm of 200 quadcopters traversing a maze and (2) a fleet of 100 bicycle robots changing their formation. Their experimental results showed that the proposed method required just seconds to compute the feasible and collision-free trajectories for the entire swarm. The authors in [77] proposed the GPU WiSARD Vessel Tracker GWVT which tracked maritime vessels in dynamic environments (varying maneuverability, appearance and shape). The GWVT used the WiSARD weightless neural network and was implemented on a GPU. The parallel processing capa-bilities of the GPU enabled fast execution of the tracking algorithm. Their experimental results showed that the GWVT gave improved performance over a CPU tracker.

Conclusions
This paper has given a comprehensive survey of the research area of embedded intelligence (EI) for GPU-based architectures and hardware implementations. The review has covered both the newer deep learning approaches and the traditional machine learning approaches. The area of GPU memory scheduling and communication has also been included. GPU-based EI applications for several areas including image processing and computer vision, medical applications, modeling or prediction, convolution or performance analysis and VLSI placement have also been discussed to demonstrate the wide potential of these EI technologies for real-world deployments. The paper aims to be useful and to motivate researchers towards performing increasing research in this emerging and important technology area.