Deep Learning Models Compression for Agricultural Plants

: Deep learning has been successfully showing promising results in plant disease detection, fruit counting, yield estimation, and gaining an increasing interest in agriculture. Deep learning models are generally based on several millions of parameters that generate exceptionally large weight matrices. The latter requires large memory and computational power for training, testing, and deploying. Unfortunately, these requirements make it di ﬃ cult to deploy on low-cost devices with limited resources that are present at the ﬁeldwork. In addition, the lack or the bad quality of connectivity in farms does not allow remote computation. An approach that has been used to save memory and speed up the processing is to compress the models. In this work, we tackle the challenges related to the resource limitation by compressing some state-of-the-art models very often used in image classiﬁcation. For this we apply model pruning and quantization to LeNet5, VGG16, and AlexNet. Original and compressed models were applied to the benchmark of plant seedling classiﬁcation (V2 Plant Seedlings Dataset) and Flavia database. Results reveal that it is possible to compress the size of these models by a factor of 38 and to reduce the FLOPs of VGG16 by a factor of 99 without considerable loss of accuracy.


Introduction
Deep learning (DL) is playing a crucial role in precision agriculture to improve the yield of the farm [1][2][3]. However, applying deep learning in agriculture involves the acquisition and processing of a large amount of data related to crop. These data are essentially collected, based on wireless sensors (aboveground or underground), drones, satellites, robots, etc. Due to the huge amount of parameters, DL models are usually inefficient on low-cost devices with limited resources [4]. As a result, they are usually deployed on remote servers. Remote processing raises two issues: the latency of the network, and the bandwidth saving. Farm automation solutions such as automatic harvesting [5], fruit counting [6], etc., require a very short response time that is not easily provided by remote computation. The closer the processing unit is in the farm, the smaller is the latency. Apart from latency, most weakly industrialized economies experience low Internet penetration rates, especially in rural areas where agricultural activities are mainly performed. Many rural areas that do have connectivity have low bandwidths that only allow limited data traffic [7,8]. This can also increase the response time since DL models process a huge amount of data. A recent approach to overcome those limitations is infield deployment that can provide more efficient real-time reaction, critical for crop monitoring. Infield deployment generally relies on edge computing [9] that allows incoming data to be analyzed close to the source with a minimal footprint.
Edge computing adapts well to limited computing performance and it is suitable for infield inference, which enables real-time reaction and facilitates the sending of refined data or results over narrowband networks. Since devices used in agricultural field are resource-limited, using infield DL models is considered to be a significant challenge.
Few works [10,11] on DL in agriculture take into account the limitation of resources. Authors designed a model compression technique based on separable convolution and singular value decomposition. They applied their technique on very deep convolutional neural networks (CNNs) for plant image segmentation, with the aim of deploying the new model infield.
In contrast to the works of [10,11], which focus on segmentation (background/flower), this paper focuses on classification, which is a more common task in the application of DL in agriculture. This work proposes a combination of two methods namely pruning and quantization, which are applicable to any type of neural network and allows to obtain higher compression ratio. Three DL models (LeNet5, VGG16, and AlexNet) are compressed and applied on two datasets used in agriculture: V2 Plant Seedlings Dataset [12] and Flavia database [13].
The rest of the paper is organized as follows: Section 2 briefly presents common DL models as well as the constraints of deploying such models on low-cost and limited-resource devices. Section 3 describes of up-to-date techniques in model compression. The proposed model approach and experimental setup are discussed, respectively, in Sections 4 and 5. The evaluation on plant disease datasets is done in Section 6, and Section 7 concludes the present work.

Deep Learning Models
Several DL models have been proposed in the literature. This section presents common models that usually serve as a basis to other models, and the constraints of deploying DL models on low resource devices.

State-of-the-Art Models
The first major success of CNNs in the task of pattern recognition dates back to LeNet (LeNet5) [14]. It was designed for handwritten and machine-printed character recognition. According to current trends, LeNet5 is a very simple model. The network architecture encompasses two sets of convolutional and average pooling layers. Those layers are followed by a flattening layer, then two fully connected layers and finally a softmax classifier.
AlexNet [15] follows the ideas of LeNet5, incorporating several new concepts and more depth. It consists of eight layers: five convolutional layers and three fully connected (FC) layers. The great success of AlexNet is due to several advantages such as the ReLU activation function, the dropout, the data augmentation, as well as the use of GPUs to speed up computation.
Later VGGNet [16] based on AlexNet has been developed as an enhancement that replaces large convolutional filters of size 11 × 11 and 5 × 5, respectively, in the first and second convolutional layer in AlexNet with multiple small filters of size 3 × 3. With a given receptive field (the effective area size of input image on which output depends), several smaller filters stacked together are better than one larger filter because several nonlinear layers increase the depth of the network, allowing it to learn more complex features at a lower cost.
Another important model is ResNet [17]. The fundamental breakthrough of ResNet was to enable the training of extremely deep neural networks with more than 150 layers with no loss in gradient. Before training with residual layers, very deep neural networks were difficult to train because of vanishing gradients problem. ResNet introduced the concept of jump connection (residual connection) which allows the model to learn an identity function that ensures that upper layers will work at least as well as lower layers.
In the trend of developing DL models, Google has proposed Inception GoogleNet or Inception [18]. It is essentially a CNN that has 27 layers depth. In CNNs, much of the work consists of choosing the right layer to apply with the most common options (filters sizes, types of pooling, etc.). It is sufficient Appl. Sci. 2020, 10, 6866 3 of 19 to find the optimal local construction and repeat it spatially. GoogleNet uses all these layers in parallel and then combines their results.
These models are very often reused for different tasks in agriculture. Table 1 shows some important works that use state-of-the-art models in agriculture.

Deep Learning Models Constraints
Typically, these deep learning models are trained and deployed on GPUs because of their computing power requirements. Infield training and/or deployment of such models in low-cost equipment are therefore subject to some main constraints: the memory, the computing power, and the energy.
Memory: The most obvious bottleneck for DL is the working memory, but also the storage space, especially when working with low resource devices. Models commonly require storage space in the order of hundreds of MB due to their complex representation, far above 32 MB offered by IMote 2.0 or 64 MB proposed by Arduino Yun [19]. From a practical point of view, this inflates sensor apps dramatically and usually dwarfs application logic [20]. Concerning the runtime requirements, the required working memory at peak time will often preclude a particular platform from using a model even if this platform has the necessary storage space to hold the model representation. It is not uncommon for a single model layer to require tens or hundreds of MBs and this might be 90% more than the average layer consumption within the model. In such cases, manual customization of the model is necessary before it runs completely up to the final layer. Table 1. Some work using state-of-the-art models in agriculture.

Models Used Reference
Plant recognition AlexNet [21,22] Plant disease detection AlexNet and Inception [23] Tomato fruit counting Inception-ResNet [6] Plant species classification VGG16 [24] Mixed crops semantic segmentation VGG16 [25] Obstacles detection in the farm VGG16 and AlexNet [26] Fruit detection Faster R-CNN and VGG16 [27,28] Mango fruit detection and yield estimation Faster R-CNN [29] Plant disease recognition CaffeNet [30] Computing power: Models such as VGGNet, even when carefully implemented on embedded platforms, have proven to take a couple of minutes for a single image due to the paging required to overcome a large memory footprint [20]. Likewise, despite their high computing power, devices such as smartphones present challenges in running deep models with latency tailored to user interaction. This typically leads to the use of models that offload their computation to the cloud, or at least to a GPU. Computational requirements are also strongly dependent on the architecture of the model, with some layers such as convolutional layers being much more related to computation than feed-forward layers (which are more related to memory). As a result, simple architecture changes can have a significant impact on the running time.
Energy: Related to computation, the power consumption of deep models due to the excessive running time for a single inference can be very expensive for equipment used in continuous monitoring. In addition, the access to a power source in remote areas such as farms is not always guaranteed. That is why equipment are usually battery powered. The limited energy of battery therefore requires the minimization of the power consumption of deep learning models.
To overcome present challenges and to allow the deployment of DL models on resource limited equipment require the compression of models.

Related Works on Models Compression
The idea of compressing models has originated from the observation that many of the parameters used in deep neural networks (DNNs) are redundant [31] and thus may be removed without loss of performance. In addition, using a very deep model (in terms of number of layers) is not always necessary, and it may be possible to eliminate some of the hidden layers without considerably decreasing the accuracy [32]. As a consequence of the compression, the complexity will be reduced [9], which makes the application suitable for low resources.

Parameter Pruning
In many neural networks, a large number of parameters do not contribute significantly to the network's predictions [33]. Pruning is a simple but efficient method to introduce sparsity in deep neural networks. The idea of parameters pruning consists of reducing the size of the model by removing the unnecessary connection from the neural network. Pruning contributes to decrease the computation cost along with the storage and memory, while keeping an acceptable performance. Several heuristics have been studied to select and delete irrelevant connections.

Pruning Schedule
Pruning schedule is divided into two categories: one shot pruning and iterative pruning. One shot pruning aims to achieve the desired compression ratio in a single step. More aggressive than iterative pruning, redundant connections are pruned once with respect to the saliency criterion and the sparse pruned network is retrained. The main advantage is that it does not require additional hyper-parameters or pruning schedule. The authors of [34] propose a data-dependent selective pruning of redundant connections prior to training. For this purpose, the importance of the connections is evaluated by measuring its effect on the loss function.
In contrast to one shoot pruning, iterative pruning consists of gradually removing connections until obtaining the targeted compression ratio. As the configuration of the model changed, it has to be retrained to readapt the parameters after pruning iterations to recover the accuracy drop [35].

Salience Criteria
Brain Damage [36] and Optimal Brain Surgeon [37] used the second-order Taylor expansion to calculate the parameters' importance. It consists of removing parameters with the least increase of error approximated by second order derivatives. However, in very deep networks, computing the Hessian or Hessian inverse over all the parameters can be too expensive. A simpler method consists of using the magnitude of weight value.
Let us consider W the set of weight tensors and w ij W, the magnitude-based connections pruning supposes removing all connections whose weight absolute value is lower than a certain threshold T, as given in Equation (1).
An intuitive constraint for non-zero connections selection consists of weights regulation. Using the L 1 and L 2 norms, Song Han [33] reduced the number of parameters of AlexNet and VGG-16 by factors of 9 and 13, respectively. Some other methods learn pruning criteria using reinforcement learning (RL) [38,39]. RL based pruning focuses on learning the proper pruning criteria for different layers via the differential sampler.

Granulation
Depending on the structure, pruning can be classified in two main categories: unstructured pruning and structured pruning. The unstructured pruning approach, also called fine-grained, removes individual weights unconstrained from the local structure. However, pruning can remove Appl. Sci. 2020, 10, 6866 5 of 19 entire structures (kernel, filter, etc.). Figure 1 illustrates the different level of granularity neural network pruning. Fine-grained pruning methods remove network parameters independently to the data structure. Any unimportant parameters in the convolutional kernels can be pruned. This consists of using salience criteria to rank individual weights then remove the less important. Recently the authors of [33] proposed a deep compression framework to compress deep neural networks in three steps: pruning, quantization, and Huffman encoding. By using this method, AlexNet could be compressed by a factor of 35 without drops in accuracy. The work uses a polynomial decay function to control the step sparsity level. High sparsity levels can be achieved, so the model can be compressed to require less memory space and bandwidth. However, a poor pruning can lead to a drop in accuracy. In addition, these approaches result in unstructured parsimony of the model. Without specialized parallelization techniques and hardware that can perform computations on sparse tensors, this does not accelerate the computation. Another strategy is to remove entire data structures, such as kernels or filters.
Vector-level pruning methods prune 1D vectors in the convolutional kernels, and kernel-level pruning methods prune 2D kernels in the filters. Like fine-grained pruning, vector-level pruning and kernel-level convert dense connectivity into sparse connectivity, but in a structured way [40].
Group-level pruning tries to eliminate network parameters according to the same sparse pattern on the filters [41]. In this way, convolutional computation can be efficiently implemented with reduced matrix multiplication. In [42], the authors propose a group-wise convolutional kernel pruning approach inspired by optimal brain damage [36]. The authors of [35] used group Lasso to learn a sparse structure of neural networks.
Filter-level pruning reduces convolutional layers size by removing unimportant filters. This leads to a simplification of the network architecture. Layer filters are removed, and also the size of the output features-map is reduced and directly impacts on the convolutional layers' speedup. Pruning filters with low absolute weights sum is similar to pruning low magnitude weights. Filterlevel pruning maintains dense tensors which leads to a lower compression ratio but is more efficient for accelerating the model with less convolution operation. The authors of [43] proposed ThiNet, a filter-level pruning framework to simultaneously accelerate and compress convolutional neural networks. They used the next layer's feature map to guide the filter pruning in the current layer.
Unstructured sparsity produces very high compression rates but allows very low accuracy. Structured sparsity is more beneficial for direct savings of computational resources on embedded systems, in parallel computing environments and efficient inference acceleration. As proposed by [33], pruning can be combined with quantization to achieve maximal compression ratio. Fine-grained pruning methods remove network parameters independently to the data structure. Any unimportant parameters in the convolutional kernels can be pruned. This consists of using salience criteria to rank individual weights then remove the less important. Recently the authors of [33] proposed a deep compression framework to compress deep neural networks in three steps: pruning, quantization, and Huffman encoding. By using this method, AlexNet could be compressed by a factor of 35 without drops in accuracy. The work uses a polynomial decay function to control the step sparsity level. High sparsity levels can be achieved, so the model can be compressed to require less memory space and bandwidth. However, a poor pruning can lead to a drop in accuracy. In addition, these approaches result in unstructured parsimony of the model. Without specialized parallelization techniques and hardware that can perform computations on sparse tensors, this does not accelerate the computation. Another strategy is to remove entire data structures, such as kernels or filters.
Vector-level pruning methods prune 1D vectors in the convolutional kernels, and kernel-level pruning methods prune 2D kernels in the filters. Like fine-grained pruning, vector-level pruning and kernel-level convert dense connectivity into sparse connectivity, but in a structured way [40].
Group-level pruning tries to eliminate network parameters according to the same sparse pattern on the filters [41]. In this way, convolutional computation can be efficiently implemented with reduced matrix multiplication. In [42], the authors propose a group-wise convolutional kernel pruning approach inspired by optimal brain damage [36]. The authors of [35] used group Lasso to learn a sparse structure of neural networks.
Filter-level pruning reduces convolutional layers size by removing unimportant filters. This leads to a simplification of the network architecture. Layer filters are removed, and also the size of the output features-map is reduced and directly impacts on the convolutional layers' speedup. Pruning filters with low absolute weights sum is similar to pruning low magnitude weights. Filter-level pruning maintains dense tensors which leads to a lower compression ratio but is more efficient for accelerating the model with less convolution operation. The authors of [43] proposed ThiNet, a filter-level pruning framework to simultaneously accelerate and compress convolutional neural networks. They used the next layer's feature map to guide the filter pruning in the current layer.
Unstructured sparsity produces very high compression rates but allows very low accuracy. Structured sparsity is more beneficial for direct savings of computational resources on embedded systems, in parallel computing environments and efficient inference acceleration. As proposed by [33], pruning can be combined with quantization to achieve maximal compression ratio.

Quantization
Quantization reduces the model to lower representation allowing the reduction of computational time and low memory footprint. The quantization techniques can be grouped into two main categories: Scalar and Vector Quantization, and Fixed-Point Quantization.
Scalar and vector quantization techniques are mainly used in data compression. This technique makes use of a codebook and a set of quantization codes to represent the original data. Since the size of the codebook is much smaller than the original data, the original data could be effectively compressed by quantization. Inspired by this, scalar or vector quantization approaches consist of representing the parameters or weights applied to a deep lattice for compression. In [44] the authors applied k-means clustering to weights or perform product quantization and obtained a very good balance between model size and recognition accuracy. They achieved a factor of 16-24 network compression with only 1% loss of accuracy on the ImageNet classification task.
The Fixed-Point Quantization decreases computational complexity by decreasing the number of bits used in float representation. The precision of the result is also decreased, but in a controlled fashion. Quantization may be done during training (prequantization) or after (postquantization). In prequantization, a floating-point model is trained directly using fixed-point representation training techniques. An alternative approach to prequantization consists of training the floating-point model, and then using quantization training techniques at the fine-tuning step. Postquantization consists of converting a pretrained floating-point model to a fixed-point representation, and inferences are drawn using a fixed-point computation. The model is used without any retraining or fine-tuning step.
Quantization techniques achieve high compression ratios and accuracies but also require appropriate software approach or sophisticated hardware to support inference.

Low-Rank Factorization
Convolution operations comprise the bulk of calculations in CNNs. It follows that any reduction or simplification of convolutional layers would improve the speed of inference because convolution operations are heavy. The inputs to convolutional layers in a typical CNN are four-dimensional tensors. The key observation is that there could be a significant amount of redundancy in these tensors. Ideas based on tensor decomposition seem to be a particularly promising way to remove redundancy. As for the fully connected layer, it can be seen as a 2D matrix and the low rank can also help. In convolutional layers, reducing the convolution operations by decomposing the weight tensor into low-rank tensors approximation improves the model speedup.

Separable Convolution
MobileNet [45] and Xception [46] use separable convolutions to reduce model parameters. Separable convolution lowers the number of multiplications and additions in the convolution operation, with a direct consequence of reducing the model weight matrix and speeding up the training and testing of large CNNs. The spatial separable convolution decomposes the filter into two smaller filters and the depth wise separable convolutions separate the filters into filters of depth 1 followed by a 1 × 1 convolution.

Knowledge Distillation
The basic idea of knowledge distillation is to form a smaller "student" model to reproduce the results of a larger "teacher" model. The cross-entropy loss of the "teacher" model is added to the loss value of the "student" model. In this method, the student also learns the "obscure knowledge" [23] of the associated close categories of the teacher model in addition to learning labels from the input data.

Model Compression Metric
Two metrics are generally used to evaluate the compression of a model: the compression ratio and the speed up. The compression ratio is the ratio of memory required to inference before and after compression. It can be evaluated by the compression factor (CF) given in Equation (2) or the gain (R) in memory footprint given in Equation (3).
where FBC is the footprint before compression and FAC is the footprint after compression. The speed up measures the model acceleration before and after compression in terms of inference time. It is usually expressed in terms of Floating-Point Operations (FLOPs). The FLOPs evaluate the number of multiplication and addition used to complete a process, which is a common practice to evaluate the computational time. Improvements in FLOPs result in decreasing inference time of the networks because of removing unnecessary operations. However, time consumed by inference depends on the implementation of convolution operator, parallelization algorithm, hardware, scheduling, memory transfer rate, etc.

Proposed Compression Approach
The compression approach used in this paper is composed of two main steps: model pruning and quantization. The flowchart of the compression approach is provided in Figure 2.

Initialization
As proposed in [44], we implement the removal of the weights using a mask tensor (M) initialized at 1 at the early stage of the training, and of the same size as the weight tensor (W).

Pretraining
The first training phase performs a normal Stochastic Gradient Descent (SGD). The L2-normalization allows to reduce the magnitude of the weights and also rank them in order of importance.

Pruning
The connections are pruned iteratively by defining a step sparsity level at each stage. A polynomial decay function is used to determine the step sparsity level. It is defined in Equation (4).
where S is the targeted sparsity level; S r the initial sparsity level, r the initial iteration, n the last iteration, and k the current iteration. The step sparsity level is used to determine the threshold value of important connections.
To determine the threshold, the weights of each layer are ranked according to their magnitude. The threshold value is the value at the position step sparsity level multiplied by the number of weights. Let us consider m ij an element of the mask tensors M, when the absolute value of the weight associated to m ij is below the threshold T, m ij takes the value 0 as defined by Equation (5). The dot product given in Equation (6) forces the pruned weights to be set to zero.
Appl. Sci. 2020, 10, 6866 Unlike the connections pruning, in filters pruning, the salience criteria is the value of the sum of the absolute values of the filter weights. All the values in the mask corresponding to the weight of each filter are set to 0 if the sum of the absolute values of its filter weights is less than the threshold as defined in Equation (7). Then the dot product given in Equation (6) forces the pruned weights to be set to zero.
where FBC is the footprint before compression and FAC is the footprint after compression. The speed up measures the model acceleration before and after compression in terms of inference time. It is usually expressed in terms of Floating-Point Operations (FLOPs). The FLOPs evaluate the number of multiplication and addition used to complete a process, which is a common practice to evaluate the computational time. Improvements in FLOPs result in decreasing inference time of the networks because of removing unnecessary operations. However, time consumed by inference depends on the implementation of convolution operator, parallelization algorithm, hardware, scheduling, memory transfer rate, etc.

Proposed Compression Approach
The compression approach used in this paper is composed of two main steps: model pruning and quantization. The flowchart of the compression approach is provided in Figure 2.

Retraining
The retraining of the model allows to correct the damage caused by the removal of connections. After each backpropagation the weights of pruned connections are reset to zero by the dot product in Equation (6).

Compression of the Model
At this step we can either compress the size of the weights or extract the reduced architecture of the model. Since the weights are zeroed in an unstructured fashion, they cannot be removed from the layer weight tensor. Compressing the model file allows us to observe the variation in size. We use Gzip for file compression.
After retraining, connections with null masks can be directly removed from the model architecture. Convolutional layers: each filter is removed if its mask consists of zero values. Channels corresponding to the pruned filters are deleted in the next layers entry.
Batch-normalization layers: In the normalization layers the channels corresponding to the pruned filters are also deleted.
Soft-Max layer: removal of filters at the last convolution layer also requires removal of the corresponding connections in the soft-Max layer.

Quantization
We adopt quantization only at post-training. We apply fixed-point weights quantization. The matrix of weights is compressed from 32-to 16-bit floating point values.

Datasets
We conducted a set of experiments using two datasets of plant, namely plant seedlings dataset (V2) and Flavia database. The plant seedlings database [12] is a public image database for benchmark of plant seedling classification. The dataset contains 5537 images of 5184 × 3456 pixels of crop plants and weeds all in the seedling stage. The database includes 12 species of Danish agricultural plants and weeds with 253-762 images per species. Images of the different plants were collected with a DSLR camera at different stages separated by 2-3 days. Here we randomly selected 80% of images for training and 20% of images for evaluation. The second dataset (Flavia dataset) consists of 1907 leaf images of 1600*1200 pixels of 32 different plant species with 50-77 images per species [13]. Each image contains a single trimmed and isolated leaf. The images were collected on the campus of the Nanjing University and the Sun Yat-Sen arboretum using scanners or digital cameras on a plain background. The provided database contains RGB color images with a white background. We also randomly selected 80% of images for training and 20% of images for evaluation.

Experimentation Setup
We performed both parameters pruning and quantization on the LeNet5, AlexNet, and VGG-16. For each model, we modified only the last layer with respect to the number of classes in the dataset. For training we used the Stochastic Gradient Descent with a learning rate of 10 −2 . For these three experiments, we used softmax output layer with categorical cross-entropy loss. We trained both simple models and compressed models over 100 epochs using the training-set plant seedlings dataset and 500 epochs for Flavia database because of its small number of images. The input images were all resized to (50 × 50) without any data augmentation process.

Pruning
In the experiments we used iterative pruning and L 2 normalization. The pruning was performed after each 100 iterations of 50 images per batch. Then less relevant connections were set to zero according to the magnitude and the step sparsity level. The first iteration of pruning removes 50% of parameters, then continues gradually to achieve the desired sparsity level. The first iteration of pruning was performed after 1/5 of the total number of epochs. In addition, 1/10 of total epochs was used to allow the model to recover from the accuracy drop.

Experimentation Setting 1
The first experiment consists of performing the pruning on each layer of the model in order to obtain a completely sparse model. Each layer has the same sparsity level with L 2 regularization. For the same model we observed the accuracies in four cases. We removed 80%, 85%, 90%, and 95% of parameters.

Experimentation Setting 2
Convolutional layers use less weight than fully connected layers due to parsimony and weight sharing. Therefore, if the model is not too deep, these layers do not represent much of the overall volume of the model weights. For this experiment, the pruning is mainly focused on the fully connected module of our models. We first undertook a study of the proportions of dense layers in the total number of model parameters.

Experimentation Setting 3
The third experiment consists of filter-level pruning. Here to simplify the architecture, we replaced the fully connected module with a Global Average Pooling layer and a softmax layer. The new model (Model + GAP) was pruned with the total weights sum. The filter magnitude was determined by the absolute value of the weights sum.

Evaluation
To evaluate the performances of model pruning and quantization as model compression for agricultural applications, we focused on the reduction of the memory footprint and the speed-up. The core algorithms for fine-grained pruning and filter-level pruning are provided in mboxcrefapp:app1-applsci-926211,app:app2-applsci-926211 respectively.

Pruning with Same Pruning Ratio per Layer
Tables 2-4 present the application of pruning followed by a postquantization on the plant seedlings dataset for models VGGNet16, AlexNet, LeNet5, respectively. The comparison was made in terms of size and accuracy at different pruning levels. Results show that by using pruning techniques during compression and postquantization we can reduce the size of the models by an average factor of 38. In other words, we can achieve a reduction of factor 3.5, 4.2, 5.3, and 7.5 by removing, respectively, 80%, 85%, 90%, and 95% of trainable weights without considerable variation. We can then apply the postquantization and obtain a reduction of factor 12, factor 16, factor 22, and factor 38 with the same precision. In all the following tables, bold numbers indicate the best accuracy. We obtained similar results by applying the same process on the Flavia database. Since only the last layer is changed in the architecture of the models applied to these two datasets, the variation in terms of size is exceedingly small. Table 5 presents the different accuracy values obtained by applying the pruning on LeNet5, AlexNet, and VGG16 with a sparsity level of 80%, 85%, 90%, and 95% for all the architectures.  Figure 3 presents the distribution of the number of parameters for LeNet5 (top-panel), AlexNet (middle-panel), and VGGNet16 (botton-panel). Between the different convolutional layers, we have Pooling and Batch-Normalization layers that have insignificant weights, as well as the Drop-out layer between the fully connected layers. The convolutional layers have less weight than the fully connected layers. This is mainly due to the weight sharing and local connectivity of the neurons. Unlike AlexNet and LeNet5, VGGNet16 still has a considerable number of weights in the convolutional layers. Although the kernels are very small the convolutional layers have a large number of filters in VGGNet16. Figure 4 shows a comparison of fully connected layers to the rest of the neural network.

Only Fully Connected Layers Pruning
Pooling and Batch-Normalization layers that have insignificant weights, as well as the Drop-out layer between the fully connected layers. The convolutional layers have less weight than the fully connected layers. This is mainly due to the weight sharing and local connectivity of the neurons. Unlike AlexNet and LeNet5, VGGNet16 still has a considerable number of weights in the convolutional layers. Although the kernels are very small the convolutional layers have a large number of filters in VGGNet16. Figure 4 shows a comparison of fully connected layers to the rest of the neural network.  Between the different convolutional layers, we have Pooling and Batch-Normalization layers that have insignificant weights, as well as the Drop-out layer between the fully connected layers. The convolutional layers have less weight than the fully connected layers. This is mainly due to the weight sharing and local connectivity of the neurons. Unlike AlexNet and LeNet5, VGGNet16 still has a considerable number of weights in the convolutional layers. Although the kernels are very small the convolutional layers have a large number of filters in VGGNet16.  Dense layers have an exceptionally large number of parameters, which greatly influence the size of the model. Here, the convolutional layers are not pruned during training. Tables 6-8 present the results obtained after 99% pruning of the weights of the fully connected (FC) layers of models VGGNet16, AlexNet, LeNet5, respectively. On VVGNet16, by using only 1% of weights in the fully connected layers, the model size is reduced by factor 2.5 with a drop of 7.99 in terms of accuracy. By combining pruning and quantization, the model is reduced by factor 11 with the same accuracy drop of 7.99.  Between the different convolutional layers, we have Pooling and Batch-Normalization layers that have insignificant weights, as well as the Drop-out layer between the fully connected layers. The convolutional layers have less weight than the fully connected layers. This is mainly due to the weight sharing and local connectivity of the neurons. Unlike AlexNet and LeNet5, VGGNet16 still has a considerable number of weights in the convolutional layers. Although the kernels are very small the convolutional layers have a large number of filters in VGGNet16.
Dense layers have an exceptionally large number of parameters, which greatly influence the size of the model. Here, the convolutional layers are not pruned during training. Tables 6-8 present the results obtained after 99% pruning of the weights of the fully connected (FC) layers of models VGGNet16, AlexNet, LeNet5, respectively.
On VVGNet16, by using only 1% of weights in the fully connected layers, the model size is reduced by factor 2.5 with a drop of 7.99 in terms of accuracy. By combining pruning and quantization, the model is reduced by factor 11 with the same accuracy drop of 7.99. With AlexNet (see Table 7), using only 1% of weights in the fully connected layers, the model size is reduced by a factor of 21.1 with a gain in accuracy of 1.79. By combining pruning and quantization, the model is reduced by a factor of 85.2 with a gain in accuracy of 2.15.
Using LeNet5 (see Table 8), only 1% of weights in the fully connected layers is used. The model size is reduced by factor 12.01 with a drop of 1.3 in accuracy. However, by combining pruning and quantization, the model is reduced by factor 49.4 with a gain in accuracy of 3.14.
By observing the above results, sometime pruning only the fully connected layers can be good enough if the goal is only the footprint reduction. Pruning only the fully connected layers with high sparsity level results in high compression of the whole model.

Filter Pruning
Tables 9-11 present the results of filter-level pruning on VGGNet, AlexNet, and LeNet5 associated with Global average pooling, applied on the plant seedlings dataset.
Results show that the filter-level pruning is suitable for the VGGNet model and can compress it by a factor of 352 and accelerate it by a factor of 99 with a sparsity level of 90% while gaining in accuracy. On the other hand, AlexNet and LeNet5 are relatively smaller. The filter-level pruning seems to be less adapted. AlexNet can maintain a small loss of accuracy with a sparsity level of 80%, but LeNet5 with a sparsity level of 75% loses considerably in terms of accuracy.

Input Layer Resizing
The size of the input image is also an important parameter in the architecture of the model. The optimal choice of the input image considerably reduces the size of the model. Table 12 presents the use of VGGNet16 on the plant seedlings dataset with different sizes of input layers. The use of input images of size 50 × 50 × 3 instead of 224 × 224 × 3 as proposed by [6] on ImageNet, can allow us to have a model more than three times smaller with VGG16.

Conclusions and Future Work
In this paper, we raised the interest of DL model compression in the application to plant disease detection in agriculture. We discussed the DL inference on low-resources devices, then we summarized important techniques for size compression and consequently times speed-up which have been applied to allow DL on devices with limited resources. To evaluate the performance of model compression, we applied two pruning methods, combined with quantization to compress LeNet5, VGG16, and AlexNet on two databases used in agriculture. The results show on one hand that it is possible with fine-grained pruning to compress the size of these models by an average factor of 38 by pruning 95% of the model. Additionally, that in some cases only 1% of the weight of the fully connected layers can be used. However, deployment requires the use of specialized hardware. On the other hand, filter-level pruning can compress the size of huge models such as VGGNet by a factor of 352 and accelerate it by a factor of 99, without considerable accuracy loss. However, it is less efficient on smaller models.
Filter-level pruning has the advantage of directly reducing the architecture of the model which allows it to be deployed directly in farms on limited resources such as Arduino Yun (64 MB), Raspberry Pi (256 MB-1 GB), drones, and smartphones. In future work, we will experiment with fine-grained pruning in real farming environments using resources such as EIEs (Efficient Inference Engine) and FPGAs (Field Programmable Gate Arrays) to speed-up pruned model inferences.