Adaptive Deep Learning for Soft Real-Time Image Classiﬁcation

: CNNs (Convolutional Neural Networks) are becoming increasingly important for real-time applications, such as image classiﬁcation in trafﬁc control, visual surveillance, and smart manufacturing. It is challenging, however, to meet timing constraints of image processing tasks using CNNs due to their complexity. Performing dynamic trade-offs between the inference accuracy and time for image data analysis in CNNs is challenging too, since we observe that more complex CNNs that take longer to run even lead to lower accuracy in many cases by evaluating hundreds of CNN models in terms of time and accuracy using two popular data sets, MNIST and CIFAR-10. To address these challenges, we propose a new approach that (1) generates CNN models and analyzes their average inference time and accuracy for image classiﬁcation, (2) stores a small subset of the CNNs with monotonic time and accuracy relationships ofﬂine, and (3) efﬁciently selects an effective CNN expected to support the highest possible accuracy among the stored CNNs subject to the remaining time to the deadline at run time. In our extensive evaluation, we verify that the CNNs derived by our approach are more ﬂexible and cost-efﬁcient than two baseline approaches. We verify that our approach can effectively build a compact set of CNNs and efﬁciently support systematic time vs. accuracy trade-offs, if necessary, to meet the user-speciﬁed timing and accuracy requirements. Moreover, the overhead of our approach is little/acceptable in terms of latency and memory consumption.


Introduction
Machine learning [1] has numerous applications, including image processing [2], natural language processing [3], and recommendation systems [4]. In particular, deep learning [5][6][7] is gaining popularity due to its superior accuracy enabled by the algorithmic advancement as well as the availability of big data and abundant resources in the cloud in recent years. Furthermore, deep learning supports automated feature selection [8] without requiring manual feature engineering required in other machine learning paradigms. Important soft real-time applications, such as object detection, recognition, tracking, and visual inspection for traffic control, surveillance, and smart manufacturing [9][10][11][12][13][14][15][16], can benefit from deep learning [2,5,11]. Especially, Convolutional Neural Networks (CNNs) [2] are very effective for image processing and computer vision [2,14,17]. In 2012, CNNs made a breakthrough in terms of the accuracy for image classification [2]. For computer vision tasks, CNNs have become the go-to algorithm since then. Thus, real-time deep learning via CNNs is an important issue.
Supporting real-time deep learning, however, is challenging. A deeper network with more layers may increase the accuracy; however, it significantly increases computation and memory requirements. Because training deep neural networks requires massive resources and big data sets [13], it is usually performed offline in the cloud [18]. To support real-time applications, such as traffic control, surveillance, and smart manufacturing, it is essential to perform sensor data analytics near sensors, such as cameras, in a timely fashion using the Technologies 2021, 9,20 2 of 23 models trained in the cloud [12,14]. Supporting timely sensor data analysis using a trained model, called predictions or inferences, is challenging though in resource-constrained embedded systems. Transmitting all sensor data to the cloud for analytics via machine learning incurs long, unpredictable latency, which may result in many deadlines misses. Further, such a naive approach may saturate the backbone network with the limited bandwidth as the number of sensors and Internet of Things (IoT) devices is increasing fast [19].
A promising approach to tackling these challenges is imprecise computation [20]. If the remaining time to the deadline is insufficient, the accuracy of soft real-time image classification could be adapted to meet the timing constraint. In this paper, however, we empirically observe that the relation between the inference execution time and accuracy is not necessarily linear and even counterintuitive oftentimes. Especially, more complex CNNs with longer execution times do not necessarily support higher accuracy, but they even provide lower accuracy in many cases (Section 4). A potential reason is that CNNs (and other neural networks) are designed to support non-linear and sophisticated data manipulations for robust learning [2,6]. Generally speaking, understanding why and how deep learning performs well is an open problem [21]. Therefore, an immediate application of imprecise computation is infeasible. In-depth research is required to support systematic trade-offs between the time and accuracy for deep learning via CNNs.
To shed light on the problem, in this paper, we derive Pareto optimal CNNs with different architectures defined by their hyper-parameters. Especially, we consider that a CNN with a longer execution time is Pareto optimal, if it provides higher accuracy for image classification than the other CNN models with shorter execution times. By supporting monotonically increasing accuracy for longer execution times, we eliminate all suboptimal CNNs that do not support higher accuracy despite longer execution times. Notably, we do not propose to replace advanced hyper-parameter optimization techniques, such as the Random Search, Bayesian Optimization, and Hyperband algorithms provided by popular machine learning frameworks, such as [22,23], with our approach. Instead, our approach includes a CNN optimized using the advanced hyper-parameter tuning methods in the set of Pareto optimal CNNs, only if its accuracy is higher than any CNN that is already in the set and has a shorter execution time. In this way, our approach derives Pareto optimal CNNs for adaptive real-time image classification that supports higher accuracy for a longer inference time. Thus, it is compatible to existing algorithms for hyper-parameter tuning.
On top of that, we obtain a smaller set of CNNs, called δ δ δ-Pareto optimal CNNs in this paper, only consisting of the Pareto optimal CNNs where a CNN enhances accuracy by more than a specified threshold δ comparing to the preceding δ-Pareto optimal CNN with the shorter execution time to further decrease the number of the CNNs stored in memory (note that a Pareto or δ-Pareto optimal CNN in this paper is the CNN expected to support the highest accuracy within the remaining time to the deadline among the CNNs available in a soft real-time image classification system. We do not claim any of our models is ultimately optimal among all possible CNN models, since an exhaustive search for optimal models that consider every possible CNN model is subject to a combinatorial explosion). At run-time, our approach efficiently selects the most cost-effective CNN model that is expected to support the highest possible accuracy among the CNN models estimated to complete the inference (image classification) within the remaining time to the deadline.
Our key contributions are summarized as follows.
• δ δ δ-Pareto optimal CNN Design and Run-Time Adaptation: We propose a new approach for efficient neural architecture search to derive Pareto and δ-Pareto optimal CNNs offline. Especially, we first derive a lightweight model that can support the user-specified minimum accuracy for image classification, such as 0.7. By extending the model incrementally, we explore more complex CNN models with longer execution times and higher accuracy, while rejecting models that are not Pareto-optimal. Moreover, we derive a compact set of δ-Pareto optimal CNNs to minimize the num- To address the issue, recent works, e.g., [24,25], have investigated how to dynamically skip or add layers to meet timing constraints. Unlike the non-adaptive baseline, we support methodical trade-offs between the inference time and accuracy. Moreover, our approach provides more flexibility and opportunities for robust, timely adaptation by considering not only the number of the layers but also the other key hyper-parameters. • Evaluation: We undertake extensive performance evaluation in terms of the prediction time and accuracy. We analyze the impacts of the hyper-parameters used to configure hundreds of CNN models on the inference time and accuracy, while comparing the effectiveness of our approach and the two baselines discussed above. In the evaluation presented in Section 4, our approach derives three different CNN models for MNIST [26] and CIFAR10 [27], respectively. For MNIST, the accuracy and inference time of the models range between 48.84 and 298.99 µs and 0.95-0.992. For CIFAR-10 that is more complex than MNIST, the accuracy and inference time range between 71.97-183.55 µs and 0.808-0.893, respectively, (for CIFAR-10, we have used a more powerful machine due to the relative complexity of the data set. A detailed description is given in Section 4). Different from the proposed approach, the non-adaptive vanilla baseline is unable to support stringent timing constraints, if the remaining time to the deadline is shorter than the inference time.
Notably, the layer-wise adaptation method in a single CNN [24,25] has less flexibility for CNN design and run-time adaptation than our approach. When the depth is increased from 8 to 19 in the layer-adaptive baseline, the execution time increases by more than 1.8×, but the accuracy enhances by only 3.6% for CIFAR-10. In contrast, comparing to the basic CNN with eight layers, two more powerful CNNs with 8 and 19 layers derived by our approach increase the accuracy by 4.9% and 8.5% for increasing the inference time by 1.4× and 2.55×, respectively. Thus, our 8-layer model supports higher accuracy than the 19-layer model of the baseline does, even though its execution time is 40% shorter than that of the baseline. In addition, if the remaining time to the deadline is sufficient, our approach can use our 19-layer model that enhances the accuracy by 8.5%. Finally, our approach has little/acceptable overhead. In our approach, the latency for switching between two CNNs for adaptation is at most 20 ns; therefore, our timing overhead is negligible. The total memory footprint of the models does not exceed the user-specified bound. Comparing to the non-adaptive baseline that stores only one CNN model, our approach increases the memory consumption by at most 11.211 MB that is acceptable in modern edge servers or IoT gateways. Overall, our approach for adaptive real-time image classification is more effective than the state-of-the-art baselines.
The rest of the paper is organized as follows. In Section 2, we give CNN background and formulate the problem investigated in this paper. In Section 3, we describe our proposed approach. In Section 4, we evaluate performance via extensive experiments. Related work is discussed in Section 5. A discussion of our limitations and future work issues are discussed in Section 6. Finally, Section 7 concludes the paper.

Background and Problem Formulation
In this paper, we consider CNNs for image classification that is one of the most essential applications in image processing and computer vision [2,14]. Given an input image, the CNN classifies it into one of the predefined classes, e.g., an alphanumeric symbol for plate number detection [28,29] or a real-world object for object detection/recognition [9][10][11][12][13][14][15] as illustrated in Figure 1.  [30]). When an input image is given, the proposed approach is required to classify the image into a class, e.g., a car or airplane, within the deadline via adaptive deep learning. Figure 2 illustrates the general CNN architecture [2]. When an input image is provided, a CNN extracts features from the image using multiple pairs of convolutional and pooling layers and classify the image into a class using fully connected layers.  In general, a CNN model consists of an input layer, convolutional layers, pooling layers, fully connected layers, and an output layer. The input layer takes an input image and the output layer produces the predicted class, e.g., a pedestrian, bicycle, car, or truck. The number of convolutional and pooling layers vary depending on applications and accuracy requirements. Generally (but not necessarily), the deeper the higher is the accuracy with potentially diminishing returns.

An Overview of the CNN Structure
In principle, deep learning is supervised learning. The architecture and complexity of a CNN model is determined by its hyper-parameters, such as the number of the layers and the sizes of the filter, stride, and pooling window (discussed shortly), which have different impacts on the execution time and inference accuracy. If the architecture of a CNN model is determined, the model is trained using a labeled data in the training set where each image in the data set is associated with the ground-truth label (class). The parameters, i.e., the weights of different features, should be determined in the training phase based on the back propagation and gradient descent algorithm applied to minimize the loss-the distance between the class predicted by the CNN and the true class [2,6].
A description of layers and key components in CNNs follows.
• Input layer [26,27,29]: In a CNN for computer vision, an image is represented by a 3D matrix defined by the image width, image height, and the depth of the channels, e.g., RGB. A gray scale image is stored as a 2D matrix. Images are pre-processed, if necessary, to conform to the width, height, and depth requirements and provided to the input layer. In CNNs, key operations, such as convolution and pooling to be discussed shortly, are independently applied to each channel. Therefore, for the sake of clarity, we mainly discuss convolution and pooling for 2D data in this section. • Convolutional layers [2,[31][32][33]: In a CNN, a convolution filter, also called a kernel, is applied to the input image. More specifically, element-wise multiplications between the filter and data in one segment are applied and the multiplication results are summed to produce one data in the feature map. For example, in Figure 3, a 3 × 3 kernel is applied to the first 3 × 3 segment in the input data. By performing the element-wise multiplications and sum, the first feature is produced in the feature map in Figure 3c. A new feature map is generated each time by sliding the filter a certain number of positions specified by the stride size (in Figure 3, stride = 1, that will produce nine convolutional results in the feature map). A convolutional layer usually uses multiple filters. As a result, it produces multiple feature maps and stacks them together [8]. A CNN usually consists of multiple convolutional layers. The first convolutional layers detect low level features, e.g., color, gradient orientation, and edges. The next layers detect middle-level features such as shapes. In addition, the following layers detect an object, e.g., a car. Kernel sizes and the number of convolutional layers are key hyper-parameters that determine the architectural configuration of convolutional layers. In addition, each convolutional output is provided to an activation function to expedite training. In this paper, we use ReRectified Linear Unit (LU) that is one of the most popular activation functions. It is an element-wise function applied to each data x produced by convolution; it simply returns max(0, x). ReLU is popular since it is nonlinear and computationally efficient. • Pooling layers [2,17]: The feature maps that is the output of one or more convolutional layer are fed into a pooling layer that, via downsampling, reduces the dimensionality and the risk of overfitting [1,34,35] where the CNN is memorizing the training data rather than generalizing the model to predict/infer the classes of new, unseen images. Max pooling and average pooling are the most common pooling techniques. For example, max pooing of size 2 × 2 with depth 1 and stride 2 is depicted in Figure 4. As shown in the figure, pooling keeps representative features, while halving the width and height. The pooling window size, stride, and number of pooling layers are important hyper-parameters that also affect the time and accuracy of image classification via CNNs. • Fully connected layers [2]: In a CNN, feature maps processed through the convolution and pooling layers are flattened, i.e., converted to a single-dimension vector, and fed to the fully connected layers. Each neuron in the first fully connected layer then computes the weighted sum of the features provided to itself. The following fully connected layer computes the weighted sum of the output signals provided as the input to itself. This process is repeated through the fully connected layers. • Output layer and training: By giving different features different weights, the convolutional and fully connected layers find the most correlated features to a particular class.
In the prediction phase, the output layer gives the probabilities that the input image belongs to different predefined classes based on the detected features. For image classification with more than two classes, the softmax function [36] used in this paper is a common technique to compute the probabilities [6]. Finally, the class with the highest probability is selected and compared to the label, the ground truth. Based on the comparison results, the weights are adjusted to enhance the classification accuracy via back propagation [37] and gradient descent [38] techniques. By repeating the whole procedure for a big training data set, the CNN learns the model for a specific application, such as computer vision. • CNN model evaluation: The accuracy of the trained CNN model H i is: where n c and n t represent the number of the images classified correctly and the total number of classified images, respectively. Specifically, accuracy is evaluated using the separate set of data, called the test set, the model has not seen during the training (for more details of CNNs, please refer to [2,6]). Following this approach, in Section 4, each data set is divided into the training set and test set. We use the training set to train our CNN models and use the test set to evaluate the generalizability of the models derived by our approach discussed in Section 3 in terms of the prediction accuracy and time. In deep learning, convolution consists of element-wise multiplications between the input data and filter as well as the summation of the multiplication results. In this specific example, element-wise multiplications between the data in (a) and the filter in (b) are performed and summed up to produce the output 4 in (c). The filter slides through the data to produce the nine results in (c). In this example, the stride is 1; that is, the filter slides by one position to the right or down after completing each convolution operation describe above.  [39]). In this example, max pooling is applied for dimensionality reduction via downsampling.The 2 × 2 max pooling window with stride 2 is applied through the data, producing the four results.

Problem Formulation
In this paper, we assume that a soft real-time framework for image classification is deployed in an IoT gateway connected to one-hop (wired/wireless) cameras or in an edge server directly connected to the gateway. We consider a sporadic task model, since a camera can quickly determine whether there is any moving object using, for example, a motion sensor and submits an image to the server. Thus, the minimum inter-arrival time between two consecutive jobs of task τ i for camera i is equal to the inverse of the frame rate of the camera. Upon the arrival of the j th image from camera i, the CNN is required to complete job τ ij to classify the image within the deadline = arrival time + D i , where D i is the relative deadline of τ i . Our approach is orthogonal to real-time scheduling [40]. A popular scheduling algorithm, e.g., Earlier Deadline First (EDF), can be used to schedule image classification tasks.
We assume that the system is dedicated to (soft) real-time image classification, since image classification via deep learning within stringent timing constraints is computationally demanding. We also assume that input images from (wired/wireless) cameras may arrive late at the edge and image classification jobs can be preempted by higher priority jobs (if any) too. Given that, at run time, we dynamically select one of the CNN models expected to efficiently classify an input image with the best possible accuracy among the models in the system within the remaining time to the inference job deadline.
The block diagram in Figure 5 illustrates the proposed research step by step. First, our framework allows a user to specify the key requirements for adaptive real-time image classification, e.g., the required minimum accuracy, deadline, memory budget to store Technologies 2021, 9, 20 8 of 23 the adaptive CNN models, for an application of interest, e.g., traffic control or smart manufacturing. Second, we propose an effective approach that derives a set of δ-Pareto optimal CNN models to meet the user requirements. Third, we design a lightweight algorithm that dynamically chooses the CNN expected to support the highest accuracy subject to the remaining time to the deadline at runtime with minimal overheads. Fourth, the proposed approach and the two baselines described before are evaluated via extensive experiments. Furthermore, we discuss related work, advantages and limitations of the proposed approach, and future work issues followed by the conclusions.

Exploring CNN Models for Timely, Adaptive Image Classification
In this section, we discuss how to perform neural architectural search and find δ-Pareto optimal CNN models. Furthermore, we describe how to choose an appropriate δ-Pareto optimal model subject to the remaining time to the deadline at run time.

Overview
In our approach, a user (real-time application designer) aware of the semantics of a realtime data analytics application of interest specifies four user requirements: {α min , D, C, δ α α min , D, C, δ α α min , D, C, δ α } where α min is the minimum acceptable accuracy (for image classification in this paper), D is the inference deadline such as 1ms, C is the allowed amount of memory space to store CNNs with different inference times and accuracy, and δ α is the minimum accuracy gain. To find Pareto optimal CNNs cost-efficiently, we take the following steps illustrated in Figure 6: We first derive a lightweight CNN, H 1 , whose accuracy is at least α min by exploring CNNs with different architectures (defined by their hyper-parameters).

2.
If the current set of Pareto efficient CNN models is H P = {H 1 , . . . , H i } where they are sorted in ascending order of the accuracy and inference time, we search for H i+1 whose accuracy, α(H i+1 ), is higher than the accuracy of H i , α(H i ), and its inference time, exec_time(H i+1 ), is not longer than D by incrementally modifying the hyperparameters of H i in the neighborhood of the search space to efficiently find H i+1 . 3.
We repeat this process until we cannot find a new CNN H i+1 such that α(H i+1 ) − α(H i ) and exec_time(H i+1 ) ≤ D after a predetermined number of trials.
In this way, we build H P that meets the user requirements offline. Based on H P , we build the set of δ-Pareto optimal CNNs, We store H δP in memory using no more than C bytes to support effective trade-offs between the inference time and accuracy at runtime, requiring no I/O to retrieve a CNN model. To classify an input image at time t, our approach efficiently selects the CNN model expected to provide the highest accuracy at t among the feasible CNN models in H f ⊂ H δP where the estimated inference time of an arbitrary CNN model H k ∈ H f is not longer than the remaining time to the deadline to meet stringent timing constraints. A more detailed description of deriving H δP and selecting an appropriate CNN in H δP at runtime for adaptive real-time image classification follows. α(H i ) ≥ α min (the user-specified minimum accuracy) and its execution time is not longer than the user-specified inference deadline D. It inserts H 1 into H δP (the set of δ-Pareto optimal CNNs). It δ α is the user-specified accuracy gain, and the total memory footprint of the δ-Pareto CNN models does not exceed the memory budget C. In the flowchart, the purple units specify and enforce user requirements for adaptive real-time image classification using our proposed approach, the blue boxes initialize and control the overall process, and the brown ones derive adaptive CNN models and add eligible CNNs that meet the user requirements into H δP .

Finding δ-Pareto Optimal CNNs Offline
The CNN models in H P , the input to Algorithm 1, are sorted in ascending order of the accuracy and inference time. In line 1 of Algorithm 1, we initialize the set of δ-Pareto optimal CNNs: H δP = {H 1 }. In lines 2-4, we initialize several variables: the number of CNNs in H δP , the total memory footprint, and the highest accuracy provided by the CNNs currently in H δP .
In lines 5-10, H i ∈ H P is appended to H δP , if its accuracy is higher than the accuracy of the CNN most recently appended to H δP by at least δ α and the user-specified memory budget, C, will not be exceed after adding H i to H δP . Finally, in line 11, the algorithm returns H δP and N δ (the number of the CNNs in H δP ) where N δ ≤ N.
We only store H δP in the edge server for efficient time vs. accuracy trade-offs at run-time. A user (i.e., an application designer) can specify δ α and C in Algorithm 1 to control N δ based on application requirements and the available memory space to store CNNs. Furthermore, the prediction time of every CNN in H P and H δP does not exceed the user-specified deadline D by construction as discussed in the previous subsection.

Efficient Run-Time Selection of a CNN for Timely Image Classification
When an input image should be classified by a sporadic task instance τ ij , the real-time image classification system picks H opt ∈ H δP , one of the δ-Pareto CNN models stored in the system, using Algorithm 2. The algorithm simply looks up the table of the δ-Pareto optimal CNNs in H δP and picks the CNN, H opt , expected to support the highest possible accuracy subject to ρ ij , i.e., the remaining time to the (absolute) deadline of the inference job τ ij : where L is the lookup In this paper, we require N δ be a small constant. We make this design choice, since the algorithm needs to run frequently to support high accuracy image classification subject to timing constraints. Thus, the time complexity of Algorithm 2 is O(1).

Evaluation Results
In this section, we first evaluate the impacts of hyper-parameters on the accuracy and latency for predictions using two popular data sets to verify our assertion that hyperparameters affect the inference time and accuracy and analyze which hyper-parameters have more impacts. After that, we evaluate the feasibility and effectiveness of our approach in comparison to layer-wise adaptation, e.g., [24,25]. Since the source code of [24,25] was not available, we have used the depth of a CNN as another hyper-parameter in our model search and evaluation. In this paper, we have used TensorFlow [22] to implement our CNN models.

Data Sets and Hyper-Parameters
The MNIST data set of handwritten digits [26] consists of 60,000 samples in training set and 10,000 samples in the test set. As summarized in Table 1, all the CNN models we trained for MNIST data set consist of two convolutional and ReLU layers, two pooling layers, and two fully connected layers where the number of neurons in the second fully connected layer is a half of that in the first fully connected layer. We have fixed the basic CNN architecture using these hyper-parameters, since we have achieved relatively high accuracy that ranges between 0.9 and 0.99 with different execution times for different values of the other hyper-parameters-different kernel sizes, pool sizes, stride sizes, and numbers of neurons in the fully connected layers. For training, we have performed 40,000 iterations. The batch size is 64; therefore, the model is updated after processing 64 random samples via the gradient descent algorithm [6]. The number of channels in the first and second convolutional layer are 16 and 36, respectively. The learning rate controls how quickly a model adjusts the weights. It is typically a small positive number less than 1. For the MNIST data set, we use the learning rate of 10 −4 . For training and inferences using the MNIST data set, we have used a machine with 4 cores and 8GB memory to mimic an IoT gateway or a low-end edge server such as [41].

CIFAR-10 Data Set
The CIFAR-10 data set [27] consists of 50,000 images in the training set and 10,000 images in the test set. Every image is color and labeled: it belongs to one of the 10 different classes of objects. As CIFAR-10 data are more complex and harder to train and perform predictions, we consider a more diverse set of hyper-parameters to support systematic trade-offs between the inference time and accuracy. Table 2 shows the hyper-parameters of the bare-bone CNN architecture. Based on these hyper-parameters, we train many models that use different kernel, pool, and stride sizes to empirically evaluate their impacts on inference time and accuracy. Furthermore, we consider different numbers of convolutional, ReLU, and pooling layers. To train a model, we use 150 epochs where each epoch processes all the images in the training set instead of using small batches different from what we have done for the MNIST data set. The initial learning rate is 10 −3 but it is reduced to 3 × 10 −4 after 100 epochs to make the weight adjustments less aggressive in the later part of training. As training and inference is more challenging for the CIFAR-10 data set, we use a more powerful machine with the 16 core Intel Xeon E5-2667 processor and 32GB memory to mimic a real-time edge server for classifying images from embedded cameras.
For each measurement of the inference time, we have used 1000 randomly selected images in the test set of the MNIST or CIFAR-10. We report the results of 20 such measurements using box plots, since the execution time varies even for the same CNN model with the same hyper-parameters and parameters (weights). To measure the inference accuracy, however, we have used the entire data in the test set following a common practice in machine learning literature [2]; therefore, there is only one accuracy measurement with respect to a specific set of hyper-parameters (and parameters). For brevity, Table 3 introduces a few notations that represent the hyper-parameters varied for experiments. In the following subsection, we evaluate the impacts of k, p, f , s, and d on the prediction accuracy and time.

Impacts of the Convolutional Kernel and Stride Sizes
First, we evaluate the impact of the convolutional kernel size k on the prediction time and accuracy. We measure the time and accuracy using four different kernel sizes in the convolutional layers, k = 3 × 3, 5 × 5, 7 × 7, and 9 × 9, for each combination of p, s, and f (defined in Table 3). More specifically, we have used 72 different combinations of p, s, and f for each k. For the clarity of the presentation, we only plot the most representative results in Figures 7 and 8 where f = 384.
From Figures 7 and 8, we observe that the accuracy and execution time for s = 2 is generally higher than those for s = 6, since a small stride performs relatively fine-grained data analysis via more convolutions. As shown in Figure 7, the execution time generally increases as the kernel size increases too due to the higher computational loads. In Figure 8, however, the accuracy does not show an obvious trend for different kernel sizes. For s = 2, the accuracy actually decreases as k increases, because there could be too much overlapped data between adjacent kernel executions as k increases. For s = 6, the accuracy initially increases but eventually plateaus or even drops as depicted in Figure 8. When k is relatively small (e.g., k = 3 × 3), the big stride skips certain data, resulting in low accuracy. As k increases, each kernel execution processes a bigger data block so that important features are skipped less; therefore, the accuracy increases. However, it becomes to process overlapped data repeatedly when k > s, increasing the likelihood of a drop in accuracy. Our experiments using the MNIST data set have shown similar results.    For the CIFAR-10 data set, the accuracy differences between the tested CNNs with different kernel and stride sizes range between 1.50 and 9.25%. The biggest execution time gap between the CNN models with different kernel sizes is 21.02%. For the MNIST data set, the impacts on the accuracy and execution time are between 0.1 and 3.65% and at most 19.86%, respectively. The difference of the results between the different kernel and stride sizes is less noticeable for MNIST, because the data set is less complex and, therefore, it is easier to achieve high accuracy using a relatively simple CNN architecture.

Impacts of the Pooling Window and Stride Sizes
In CNNs, pooling is used for downsampling to reduce redundant features, keeping representative ones. In this set of experiments, we vary the pooling window and stride sizes in the pooling layers that affect downsampling together. We evaluate their impacts on the inference time and accuracy. Specifically, we consider p = 3 × 3, 5 × 5, 7 × 7, and 9 × 9, and s = 2, 4, 6, 8, 10, and 12. For each p (pooling window size), we have used 72 combinations of k, s, and f . In addition, to evaluate possible impacts on time and accuracy of each s (stride size), we have considered 48 combinations of k, p, and f . Especially, we present the results for p = 3 × 3; the results for the other pooling window sizes were similar.
In Figures 9 and 10, both the time and accuracy drop as s increases. In Figure 9, the time initially decreases as s increases and then the trend slows down. This is because less computation is needed as s increases initially. When s > 6 in Figure 9, however, data skipping cannot make the models run much faster, because the basic computations in the convolutional and pooling layers do not decrease significantly. On the other hand, skipping results in loss of features, incurring accuracy drops. As a result, in Figure 10, the accuracy drops almost linearly as the stride size increases.  Interestingly, the inference time varies more widely for the different pooling window and stride combinations than it did for the different combinations of the kernel and stride sizes (discussed in Section 4.2.1). For the CIFAR-10 data set, the accuracy changes between 1.4 and 11.02% across all the CNNs tested in this section (1.25-9.25% in Section 4.2.1). The biggest difference in the inference time between any two different CNNs is 38.77% (21.02% in Section 4.2.1). We think this is because the goal of pooling is downsampling [6]. Thus, combinations of pooling window and stride sizes give the real-time image classification system more adaptability and control in terms of time vs. accuracy trade-offs. We observe similar patterns for MNIST: the impact on the accuracy and latency are between 0.1 and 3.99% (0.1-3.65% in Section 4.2.1) and at most 21.9% (19.86% in Section 4.2.1), respectively.

Impacts of the Fully Connected Layers
In this subsection, we evaluate the accuracy and time for different sizes of the first fully connected layers that range between114 and 144 and 320-448 neurons for MNIST and CIFAR-10, respectively (the second fully connected layer uses a half of the neurons as discussed in Section 4.1.1). However, we have not observed any clear pattern: using more neurons in the fully connected layers does not necessarily increase the accuracy or time to a noticeable degree. We think this is because main computations occur in the convolutional and pooling layers in a CNN. Furthermore, certain connections between the neurons in the fully connected layers may have little impact on accuracy due to the small weights. The number of neurons in the fully connected layers may have more impact on the time and accuracy in different deep learning models than CNNs. However, it is beyond the scope of the paper as we focus on adaptive real-time image classification using CNNs in this paper.

Impacts of the Total Depth
In this set of experiments, we evaluate impacts of the total depth of a CNN on the inference time and accuracy. We use several different depths, i.e., the total number of layers in the tested CNNs, summarized in Table 4 for the CIFAR-10 data set (we have achieved up to 0.99 accuracy using a CNN of depth 8 outlined in Table 1 for the MNIST data set. Thus, we do not consider increasing its depth any further). We set k = 3 × 3, p = 2 × 2, s = 2, and f = 512 to compare the CNNs with different depths on the common basis. We have trained each model and tuned their parameters independently.
In Figure 11, as the depth increases from 8 to 19, the accuracy and median time for an inference increases by approximately 0.036 and 0.08 ms, respectively. Although the accuracy is increased by 3.6%, the median execution time for one inference increases by more than 80%. From this, we observe that increasing the total depth of a CNN is a relatively expensive and less cost-effective option. Essentially, real-time image classification by only adapting the number of the layers (e.g., [24,25]) to execute at runtime is substantially more restricted and provides a much narrower scope of adaptation than our approach does. Thus, our approach is more cost-effective.   Figure 11. CNN Depth vs. Accuracy and Execution Time (CIFAR-10). When the depth is increased from 8 to 19, the execution time increases by more than 1.8×, but the accuracy enhances by only 3.6%. From these results, we observe that considering only the depth of a CNN for adaptation is less effective than our approach for robust real-time image classification via systematic time vs. accuracy trade-offs. The hyper-parameter values used to design the CNNs with different depths are specified in this figure.  Figures 12 and 13 plot the relationships between the inference time and accuracy for all the combinations of different hyper-parameters tested in this section. Each data point in the 2D figures shows the time and accuracy of each CNN fully configured by its hyper-parameters (and weights). In the figures, the relationship between the CNN execution time and accuracy is nonlinear and considerably irregular in that there are many zigzags and wide swings of accuracy for similar execution times. Hence, it is naive and often erroneous to assume that a longer execution time definitely leads to higher accuracy. To address the issue, in this paper, we explore organized trade-offs between time and accuracy for real-time image classification to support monotonically increasing accuracy with respect to the execution time.  Figure 12. Time vs. Accuracy for the MNIST Data Set. As plotted in this figure, there are many CNN models that are Pareto suboptimal; that is, their accuracy is lower than that supported by one or more CNNs with shorter execution times. Our approach eliminates them to support adaptive real-time image classification cost-efficiently.  Figure 13. Time vs. Accuracy for the CIFAR-10 Data Set. There are many Pareto suboptimal CNNs for this data set too. Our approach only considers Pareto optimal CNNs as candidates to be included in the set of CNNs for adaptive real-time image classification, H δP .

Effectiveness of Our Model Selections and Adaptation
Given the hundreds of CNN models whose inference times and accuracy are plotted in Figures 12 and 13, our approach picks only a few δ-Pareto optimal CNN models for cost-efficient real-time image classification as discussed next.

Evaluation Using MNIST
To evaluate the effectiveness of our approach, let us consider an example user requirement specification for MNIST: α min = 0.95, δ α = 0.02, D = 1 ms, and C = 10 MB. To meet the specified user requirements, we first derive H 1 that supports α min , D, and C. By incrementally modifying the hyper-parameters of H i (i ≥ 1) in the neighborhood of H i , we derive the set of 11 Pareto optimal CNN models, H = {H 1 , . . . , H 11 }, in ascending order of accuracy and inference times as illustrated in Figure 14 and summarized in Table 5, while discarding Pareto-inefficient CNNs that fail to support monotonically increasing accuracy for longer inference times. Furthermore, using Algorithm 1, we extract the set of δ-Pareto optimal CNNs, H δP = {H 1 , H 6 , H 11 }, boldfaced in Table 5. Our first baseline, which is a common approach for image classification via deep learning, only uses H 11 that supports the highest accuracy without any runtime adaptation via imprecise computation considering the remaining time to the deadline. The accuracy of H 1 and H 6 in H δP is lower than that of H 11 by 0.042 and 0.007, but their inference times are only 16.8% and 25% of the baseline, respectively. Therefore, our real-time image classification system can pick H opt among H δP using Equation (2) based on the remaining time to the deadline, if necessary, to meet stringent timing constraints of real-time image classification tasks for a relatively small accuracy loss when H 1 or H 6 is chosen. In Table 5, every inference time T < D (1 ms). Moreover, the total memory consumption to store H δP is 6.625 MB < C. Therefore, our approach meets the user required α min , δ α , D, and C by constructing H δP .  Figure 14. Pareto Optimal CNNs for MNIST. As illustrated in this figure, the set H P that consists of Pareto optimal CNNs may include several models with similar execution times and accuracy. Thus, it is necessary to select δ-Pareto CNNs only and include them in H δP to meet user requirements in a cost-efficient manner. For the evaluation using CIFAR-10, let us consider an example user-specification: α min = 0.8, δ α = 0.03, D = 1 ms, and C ≤ 50 MB. By incrementally modifying the hyperparameters of H i , we derive the set of 11 Pareto optimal CNN models, H = {H 1 , . . . , H 11 }, with the increasing accuracy and inference time as shown in Figure 15 and summarized in Table 6, while discarding the Pareto-inefficient CNNs that fail to support increasing accuracy for longer inference times. In addition, using Algorithm 1, we extract the set of δ-Pareto optimal CNNs, H δP = {H 1 , H 6 , H 11 }, boldfaced in Table 6. Comparing to the baseline that only uses H 11 , the accuracy of H 1 and H 6 is lower by 0.085 and 0.036, but their inference times are only 39.2% and 55.1% of the baseline, respectively. Thus, our realtime image classification system can efficiently pick H opt among H δP using Equation (2), if necessary, to meet tight timing constraints. In Table 5, every inference time T < D (1 ms). Furthermore, the total memory consumption to store H δP is 48.857 MB < C. Hence, our approach meets the user-specified α min , D, C, and δ α requirements by building H δP .  Unlike another baseline that considers the depth of the CNN for adaptation [24,25], we can consider and leverage several CNNs with the same number of layers too, such as the CNNs in Tables 5 and 6, if they meet the user requirements. Thus, our approach is more flexible and cost-effective as discussed before. Comparing to the first baseline that only uses H 11 , our approach consumes additional 3.067 MB and 9.899 MB to keep H 1 and H 6 for MNIST and CIFAR-10 in memory, respectively, (Tables 5 and 6). By comparing Figures 12 and 13 to Tables 5 and 6, we observe that Algorithm 1 reduces the total number of candidate models for runtime adaptation by two orders of magnitude. Furthermore, Algorithm 2 for dynamic adaptation is O(1) as discussed in Section 3. It selects a CNN in H δP expected to meet D only in 7-8 ns and 17-20 ns for MNIST and CIFAR-10, respectively. The additional memory consumption and latency of our adaptive approach is acceptable in modern edge gateways or servers. Overall, our approach has significant advantages over the common baseline approaches for the relatively small overhead.

Related Work
In early studies, accuracy is the standard metric to evaluate performance in machine learning applications [9,13]. Most object detection papers [9][10][11][12][13][14][15] focus on how to detect objects with high accuracy. In recent works, e.g., R-FCN [11], SSD [42], and YOLO [9], however, not only accuracy but also the frame rate is evaluated. In [13], trade-offs between the accuracy and frame rate are evaluated for three different neural networks, R-FCN [11], R-CNN [12], and SSD [42]. In this paper, we derive a set of δ-Pareto CNNs that satisfy the user-specified accuracy and memory requirements to efficiently meet stringent timing constraints of real-time image classification based on imprecise computation, since the average frame rate over an extended time interval may fail to measure and control transient fluctuations of latency (and timeliness) of individual inferences. Thus, our approach is complementary to these approaches.
Recently, real-time image classification and object detection are drawing increasing attention [24,25,43,44]. The key difference between our proposed approach and such works is that we systematically study impacts of hyper-parameters on the accuracy and execution time of CNNs to support a monotonic increase in accuracy for a longer execution time with little overhead at runtime. There are existing works on evaluation of CNN hyperparameters [13,45]; however, they do not consider timing constraints. In [13], different types of activation functions and classifiers are evaluated; however, we do not consider them since they do not affect execution times significantly.
Partial execution of some neurons or layers in CNNs based on the input or deadline has been considered. Input-dependent execution has been widely used in computer vision, such as cascaded detectors [46,47]. In dynamic Deep Neural Networks(D 2 NN) [48], in addition to normal neurons, there are control nodes that dynamically decides to skip neurons so that the execution time can be adjusted accordingly at run-time based on input. However, D 2 NN does not deal with time vs. accuracy trade-offs to meet stringent timing constraints for real-time image classification.
Deadline-based dynamic frameworks make the neural network model dynamically/ selectively execute a subset of layers. AnytimeNet [24] is a framework that enables gradual insertion of additional layers in an attempt to enhance accuracy if time permits. In multipath neural networks [25], a model is trained with multiple paths which contain different numbers of layers. At run-time, it is possible to change paths based on deadlines. In [49], a ResNet [50], is divided into a mandatory part and an optional part where the former is always executed, but the latter is run when enough time is available. In fact, Refs. [24,25,49] are most closely related to our work. Our work is, however, significantly more comprehensive than them in that the number of layers in a or the number of resblocks in a ResNet (CNN) is only one hyper-parameter whose applicability is relatively limited as thoroughly analyzed in Section 4. Depending on applications, a shallow neural network may provide high accuracy. In this paper, for example, a CNN with only eight layers in total supports over 0.99 accuracy for the MNIST data set. In [51], a CNN with only two hidden layers supports wireless channel state information classification with up to 0.98 accuracy. Adding more layers in such applications will increase the inference time with largely diminishing returns. Thus, we consider not only the number of layers but also the other important hyper-parameters, such as the kernel size, pooling window size, stride, and number of neurons in the fully connected layers, to support more robust, cost-efficient trade-offs between the inference time and accuracy. Thus, our work is complementary to them.

Discussion
Generally speaking, research on real-time machine learning explored in this paper is in an early stage with many open issues [52]. Supporting real-time machine learning is a challenging problem, since machine learning methodologies, e.g., deep learning, are computationally expensive. Further, new deep learning models are becoming increasingly more complicated to enhance the prediction performance. In this regard, the advantage of the proposed approach is providing systematic, robust trade-offs between the accuracy and timeliness of real-time image classification based on deep learning. Our design goal was to minimize the complexity and resulting uncertainties detrimental in safety-critical real-time systems, e.g., traffic control and smart manufacturing.
A drawback of our approach is the increased memory consumption to store multiple CNN models, even though the memory overhead is acceptable as analyzed in Section 4. To further reduce the memory consumption, several techniques, such as model pruning [53] and compression [54], can be applied to prune less important weights and compress them. Another limitation of the proposed approach is that the trained models are fixed. As a result, the prediction performance may drop, if the real-world environment, e.g., traffic status or lighting conditions, change dramatically. To address this issue, incremental learning [55] can be applied to continually update the model as necessary. A related challenge is how to support incremental learning without impairing the timeliness and current prediction accuracy. A thorough investigation is reserved for future work.
In addition, our approach could be integrated with other advanced works, such as [56][57][58], to create synergy. For example, our approach can be combined with [56] to diagnose faulty machine components in real-time in a smart factory. Furthermore, it can be synthesized with [57,58] to support efficient real-time video compression and nighttime image classification, respectively. These issues are reserved for future work.

Conclusions and Future Work
Although deep learning can significantly improve real-time applications, e.g., traffic control or smart manufacturing, it is computationally demanding. As a result, deadlines could be missed, raising potential safety issues. To shed light on the problem, we design a new adaptive approach for soft real-time image classification based on imprecise computation. In this paper, we analyze the relationship between the prediction time and accuracy of many CNN models offline. We then construct a set of δ-Pareto optimal CNNs that support higher accuracy for a longer execution time. At run-time, our approach efficiently selects the CNN model expected to support the highest accuracy for image classification subject to the inference deadline among the stored δ-Pareto optimal CNNs. In our evaluation undertaken using two popular data sets, we verify that our approach can find a set of δ-Pareto CNN models for cost-efficient time vs. accuracy trade-offs. For the MNIST data set, the accuracy and inference time of the models range between 48.84 and 298.99 µs and 0.95-0.992. For the CIFAR-10 data set, the accuracy and inference time range between 71.97 and 183.55 µs and 0.808-0.893. In contrast, the vanilla baseline that uses one non-adaptive model cannot support dynamic runtime adaptation to meet timing constraints. Designing the second baseline that only adapts the number of layers in a single CNN model, if necessary, to meet timing constraints is considerably less effective and flexible than our approach. By increasing the number of layers from 8 to 19, its inference time is increased by more than 1.8×, while the accuracy is improved by 3.6% only. In our approach, however, we enhance the accuracy by 4.9% and 8.5%, while increasing the inference time by 1.4× and 2.55× using 8 and 19 layers, respectively. Thus, using the same number of layers, we can support higher accuracy than the layer-wise adaptation method, while further enhancing the accuracy when sufficient time is remaining till the deadline. Furthermore, our approach is lightweight with little overhead. In general, real-time sensor data analytics via machine learning is an emerging topic. In this paper, we have performed an early work on systematic trade-offs between the time and accuracy for image classification. In the future, we will continue to investigate related research issues including the ones discussed in Section 6.