Local Motion Planner for Autonomous Navigation in Vineyards with a RGB-D Camera-Based Algorithm and Deep Learning Synergy

With the advent of agriculture 3.0 and 4.0, in view of efficient and sustainable use of resources, researchers are increasingly focusing on the development of innovative smart farming and precision agriculture technologies by introducing automation and robotics into the agricultural processes. Autonomous agricultural field machines have been gaining significant attention from farmers and industries to reduce costs, human workload, and required resources. Nevertheless, achieving sufficient autonomous navigation capabilities requires the simultaneous cooperation of different processes; localization, mapping, and path planning are just some of the steps that aim at providing to the machine the right set of skills to operate in semi-structured and unstructured environments. In this context, this study presents a low-cost, power-efficient local motion planner for autonomous navigation in vineyards based only on an RGB-D camera, low range hardware, and a dual layer control algorithm. The first algorithm makes use of the disparity map and its depth representation to generate a proportional control for the robotic platform. Concurrently, a second back-up algorithm, based on representations learning and resilient to illumination variations, can take control of the machine in case of a momentaneous failure of the first block generating high-level motion primitives. Moreover, due to the double nature of the system, after initial training of the deep learning model with an initial dataset, the strict synergy between the two algorithms opens the possibility of exploiting new automatically labeled data, coming from the field, to extend the existing model’s knowledge. The machine learning algorithm has been trained and tested, using transfer learning, with acquired images during different field surveys in the North region of Italy and then optimized for on-device inference with model pruning and quantization. Finally, the overall system has been validated with a customized robot platform in the appropriate environment.


I. INTRODUCTION
Nowadays, with the continuous growth of the human population, agriculture industries and farmers have been facing the exponential augmentation of global demand of food production. According to the projections of growth established in 2017 by the United Nations [1], by 2050, the global population will be around 9.7 billion and it is expected to reach 11. technologies aimed at maximizing efficiency and productivity of every single land sustainably.
Over the years, precision agriculture [2] and digital farming [3] have gradually contributed with autonomous robotic machines and information collection to improve crop yield and resource management, to reduce the labor costs, and in part, to increase the production efficiency. This has led to equip harvesting machineries with driverless systems in order to maximize the navigation efficiency by reducing the number of intersections in the path, and therefore, the amount of fuel consumed [4]. Once endowed with the appropriate effectors, these robotic vehicles can harvest [5,6], spray [7,8,9], seed [10] and irrigate [11], and collect trees and crops data for inventory management [12,13,14]; when configured as platforms, they can carry laborers to prune and thin trees, hence reducing inefficiencies and injuries in the workplace [15]. Research on applications of mobile robotic systems in agricultural tasks has been increasing vastly [16]. However, despite the rising in investments and research activities on the subject, many implementations remain experimental and far from being applied on a large scale. Indeed, most of the proposed solutions require a combination of real-time kinematic GPS (RTK-GPS) [17,18] and costly sensors like three-dimensional multichannel Light Detection and Ranging (LIDAR) [19]. Other than being very expensive, those solutions are unreliable and prone to failure and malfunction due to their complexity.
On the other hand, several recent works [20,21] focus their efforts on finding an affordable solution for the generation of a global map with related way-points. However, path following inside vineyard rows is still a challenging task due to localization problems and variability of the environment. Indeed, GPS receivers require to function in an open area with a clear view of the sky [22], hence, expensive sensors and solutions are needed in order to navigate through vineyards rows and follow the generated paths.
In this context, we present a low-cost, robust local motion planner for autonomous navigation in vineyards trying to overcome some of the present limitations. Indeed, our powerefficient solution makes use only of RGB-D camera without involving any other expensive sensor such as LIDAR or RTK-GPS receivers. Moreover, we exploit recent advancements in the Deep Learning [23] techniques and optimization practices for Edge AI [24] in order to create an overall resilient local navigation algorithm able to navigate inside vineyard rows without any external localization system. The machine learning model has been trained using transfer learning with images acquired during different field surveys in the North region of Italy and then we validated the navigation system with a real robot platform in the relevant environment.

II. RELATED WORKS
As far as autonomous navigation is concerned, classic autonomous systems capable of navigating a vineyard adopt high-precision RTK-GPS [25,26,27,28] or by the use of laser scanners combined with GPS [29,30]. However, the lack of GPS availability due to environmental conditions such as large canopies, the need for prior surveying of the area, and unreliable connectivity in certain scenarios make GPS-free approaches desirable [31]. On the other hand, more modern and recent approaches employ different types of sensors usually combined with each other [19]. For example, Zaidner et al. introduced a data fusion algorithm for navigation, which optimally fused the localization data from various sensors; GPS, inertial navigation system (INS), visual odometry (VO) and wheel odometry are fused in order to estimate the state and localization of the mobile platform [32]. However, as highlighted by the authors, there is a trade-off between cost and accuracy, and the data fusion algorithm could fail if each sensor highly differs from each other.
Regarding affordable and low-cost solutions, Riggio et al. proposed a low-cost solution based only on a single-channel LIDAR and odometry, but it is greatly affected by the type of canopy and condition of the specific vineyard [33]. Instead, in [16], they proposed a vision based-control system using a clustering algorithm and Hough Transform in order to detect the central path between two rows. However, it is extremely sensitive to illumination conditions and intra-class variations.
On the other hand, the emerging needs of automation in the agricultural production systems and in the crop life cycles, concurrent to the unstoppable expansion towards new horizons of the deep learning, led to the development of several architectures for a variety of applications in precision agriculture. For instance, in [34] and [35], authors proposed solutions based on known architectures such as AlexNet [36], GoogleNet [37] and CaffeNet [38] to detect diseases in plants and leaves respectively. Moreover, deep learning has been used for crop type classification [39,40,41,42], crop yield estimation [43,44,45,46], fruit counting [24,47,48], and even to predict the weather, forecasting temperature and rainfall [49].
We propose a novel approach based only on a consumergrade RGB-D camera and latest innovations in the machine learning field to create a robust to noise local motion planner. The algorithm that exploits the depth map produces a proportional control and is supported, in case of failure, by a deep learning model that produces high-level primitives [50]. Moreover, the strict synergy between the two system blocks opens the possibility to easily create an incremental learning architecture where new labeled data coming from the field are used to extend the existing capabilities of the machine learning model.
The remainder of this paper is organized as follows. Section III introduces the materials and data used for this research. Sections IV and V give a detailed overview of the proposed methodology with the obtained experimental results followed by the conclusion and future works.

III. MATERIALS AND DATA
In order to acquire a dataset for training and testing the deep neural network, we performed field surveys in two distinct rural areas in the North part of Italy; Grugliasco near the metropolitan city of Turin in the Italian region of Piedmont and Valle San Giorgio di Baone in the Province of Padua in the Italian region Veneto. The collected video samples present different types of terrains, wine quality, and they were acquired at a different time of the day, with diverse meteorological conditions. Videos were shot at 1080p with a 16:9 ratio in order to have more flexibility during the data processing process.
On the other hand, to acquire images and compute the depth map on the platform, we employed the stereo camera Intel RealSense Depth Camera D435i 1 . It is a vision system equipped with an RGB camera and two infrared cameras which computes the depth of each pixel of the acquired frame up to 10 meters.
Finally, for the practical in-field evaluations, the stereo camera has been installed on an unmanned ground vehicle (UGV): the model Jackal from Clearpath Robotic endowed with an Intel Core i3-4330TE. The camera mounted on the chosen robotic platform is depicted in Figure 1 IV. PROPOSED METHODOLOGY Our goal is to develop a real-time local motion planner with an ultra-light computational load able to overcome practical problems faced by the GPS device when carrying out an autonomous navigation along a vineyard row.
The workflow of our proposal is the following: first, the stereo camera acquires the frames with the RGB camera and, simultaneously, it provides a depth map computed through the two infrared cameras. Successively, a light depth-mapbased algorithm processes the depth maps detecting the end of the vineyard row and, consequently, it calculates the control values with a proportional controller on both linear and angular velocities. Unfortunately, in particular weather and lightning conditions, the depth map generation is unreliable and prone to error. Indeed, as in many outdoor applications, the sunlight influences negatively the quality of the results and compromises the control given by the local navigation algorithm. To face this problem, as a back-up solution, we implemented a Convolutional Neural Network (CNN) trained at classifying whether the camera is pointing at the center of the end of the vineyard row or at one of its sides. Once an output prediction is obtained, we can route the path of the robot properly to avoid collisions with the sides of the vineyard. Moreover, we exploited the latest advancement in model optimization techniques in order to obtain an efficient and lightweight neural network able to inference in real-time on a low-cost, low-power device with limited computational capabilities. The overall algorithm pseudo-code is reported in Figure 1. We integrated the proposed algorithm with the opensource Robot Operating System 3 (ROS) to apply the generated control to the actuators of the selected UGV. Finally, to prevent the robot platform from colliding with unexpected obstacles that obstruct its way, we use the depth-map provided by the stereo camera and we apply a simple threshold value in order to immediately stop the motion in case of an impending collision.
The resulting system is a low-cost, power-efficient, and connection-free local path planner that can be easily integrated with a global system achieving fully autonomous navigation in vineyards.

A. Continuous Depth Map Control
In order to obtain a proportional control, we detect the center of the end of the vineyard row exploiting the depth-map provided by the stereo camera. Subsequently, the control values for the linear velocity and the angular velocity are calculated proportionally to the horizontal distance between the center of the end of the vineyard row and the longitudinal axis of the UGV.
To this end, we compute the largest area that gathers all the points beyond a certain depth value and then, we bound that area with a rectangle that will be used to compute the control values.
The depth-map is a single-channel matrix with the same dimensions of the image resolution, where each entry represents the depth in millimeters of the corresponding pixel in the camera frame. The limits of the depth computation are 0 and 8 meters; therefore, the values in the depth matrix range from 0 to 8000.
The main steps of the proposed methodology, described in Algorithm 1, are shown in detailed with the following points: 1) Matrix normalization: In order to have a solution adaptable to different outdoor scenarios, we need to have a dynamic definition of near field and far-field. Therefore, we employ a dynamic threshold computed proportionally to the maximum acquired depth value. Hence, by normalizing the matrix, we obtain a threshold that changes dynamically depending on the values of the depth map. 2) Depth threshold: We apply a threshold on the depth matrix, obtained through a detailed calibration, in order to define which is the near field (represented with a "0") and the far field (represented with a "1"). At this point the depth matrix is a binary mask. 3) Bounding operation: We perform edge detection on the binary image, extrapolating the contours of the white areas, and then, we bound these contours with a rectangle. 4) Back-up solution: If no white area is detected or in case the area of the largest rectangle is less than a certain threshold, we activate the back-up model based on machine learning. 5) Window selection: On the other hand, if there are multiple detected rectangles, we evaluate only the biggest one in order to get rid of the noise. The threshold value for the area is obtained through a calibration and it is used to avoid false positive detection. In fact, the holes on the sides of the vineyard row can be detected as large areas with all the points beyond the distance threshold, and therefore they can lead to a wrong command computation. To prevent the system from performing an autonomous navigation using an erroneous detection of the end of the vineyard row, we calibrated the threshold to reduce the possibility that this eventuality occurs drastically. From now on, with the term window we will refer to the largest rectangle detected in the processed frame which area is greater than the area threshold. 6) Control values: The angular velocity and the linear velocity values are both proportional to the horizontal distance (in pixel) between the center of the detected window and the center of the camera frame and based on a parabolic function. The distance d is computed as: where X w is the horizontal coordinate of the center of the detected rectangle and X c is the horizontal coordinate of the center of the frame. Figure 2 shows a graphical representation of the computation of the distance d.
The controller value for the angular velocity (ang vel) is calculated through the following formula: (2) where max ang vel is the maximum angular velocity achievable and w is the width of the frame. continue from line 18 13: else 14: angular velocity() 15: linear velocity() 16: acquire next frame and restart from line 1 17: end if 18: I 1×rh×rw×3 ← preprocessing(F h×w×3 ) 19: model prediction(I 1×rh×rw×3 ) 20: ML controller() As far as the linear velocity (lin vel) control function is concerned, it is still be a parabola, but this time the lower is the distance d the higher its value gets. Therefore the formula is: where max lin vel is the maximum linear velocity achievable and w is the width of the frame. Both control characteristics curve are depicted in Figure 3.  Figure 3a, whereas, the plot of the angular velocity control function is shown in Figure 3b.

B. Discrete CNN Control
Navigating in an outdoor environment can be extremely challenging. Among several troubles, we noticed that sunlight can be very deceptive when performing edge detection and it could lead to hazardous situations using a total camera-based navigation system. Therefore, besides the depth-map based algorithm, we propose a back-up solution that exploits machine learning methodologies in order to assist the main algorithm in case of failure. These are the last points described in Algorithm 1.
Greatly inspired by Giusti et al. [51] our second approach relies on a convolutional neural network (CNN) that classifies the frames acquired by the camera into the three following classes: left, center and right. So, in a vineyard scenario, the class center describes the view of the camera when the vehicle is pointing at the end of the vineyard row, whereas the classes left and right indicates whether the vehicle is pointing at the left side or at the right side of the vineyard row, respectively.
Successively, using the predictions of the trained network, we designed a basic control system to route the path of the robot through vineyards rows. Moreover, we exploited latest advancements in model optimization techniques in order to obtain an efficient and lightweight network able to inference in real-time on a low-cost edge AI platform.
1) Network Architecture: We have carefully selected a deep learning architecture from the literature that reaches high performance by also containing computational requirements and hardware costs. MobileNet [52] network, due to its efficient design, works reasonably fast on mobile devices and embedded systems without too much memory allocation.
The structure of the MobileNet, illustrated in Figure 4, consists on a first convolutional layer with n = 32 filters and stride s = 2 followed by 13 layers that include depthwise (dw) and pointwise convolutions. That largely reduces the number of parameters and inference time while still providing reasonable accuracy level. After each convolution, batch normalization [53] and ReLU activation function [54] are applied. Every two blocks the number of filters is doubled while reducing the first two dimensions with a stride greater than one. Finally, an average pooling layer resizes the output of the last convolutional block and feeds a fully connected layer with a softmax activation function that produces the final classification predictions.
We have modified the original final fully connected layers of the MobileNet by substituting them with two fully connected layers of 256 and three neurons, respectively. The resulting model is a CNN network with an overall depth of 90 layers and with just 3,492,035 parameters.
Moreover, we optimized the network model and we sped up the inference procedure by using the framework provided by NVIDIA TensorRT [55].
2) Pre-Processing: In the pre-processing phase, before feeding the network, we normalize and resize the images to the expected input dimensions rh × rw of the model. Indeed, the last two fully connected layers chain the network at a fixed input size of the raw data. More specifically for our modified MobileNet, the input dimensions are 224 × 224.

V. EXPERIMENTAL DISCUSSION AND RESULTS
In this section, we discuss the details of the deep learning model training with its dataset generation and evaluation. Furthermore, we introduce the optimization adjustments applied to the network in order to boost the frequency control to 47.15 Hz. Finally, we conclude with the experimentation data and results gathered during the field tests.

A. Dataset Creation
We used a dataset of 33.616 images equally balanced along with the three previously introduced classes. In order to create the training dataset, as previously introduced in Section III, we took several videos in a variety of vineyards rows with a 1080 p resolution camera in order to have more flexibility during the pre-processing phase. In particular, for the first video of the center class, we recorded rows with the camera pointing at its center. Whereas, for the other two videos, classes left and right, we registered with the camera rotated of 45 degrees with respect to the longitudinal axis of the row towards the left and the right side, respectively. Eventually, we took each video as a streaming of images and we selected the best frame every six consecutive ones using a Laplacian filter to detect the less blurring one. Figure 5 shows an example for each class.

B. Model Training
As already introduced, we trained the network using a technique known as transfer learning [56]; instead of starting to train with weights randomly initialized, we used variables obtained with an earlier training session. In particular, we exploited weights obtained fitting MobileNet with the ImageNet classification dataset [57]. Using this technique, we were able to take advantage of previous low-level features, learned by the network, highly reducing the number of images and epochs required for the training. Indeed, edges, contours, and basic textures are general-purpose features that can be reused for different tasks. In order to properly train, validate and test the model, we randomly divided the dataset into three subsets as follow: 70% for the training set, 15% for the development set, and the remaining 15% for the test set. We trained the resulting network for only six epochs with a batch size of 64.
To increment the robustness of the network and to overcome possible problems of overfitting, we used different techniques such as dropout [58], weight decay [59] and data augmentation with changes in zoom and brightness [60]. Finally, we used mini-batches with RMSprop optimizer [61], accuracy metric, and cross entropy as a loss function.

C. Machine Learning Model Evaluation and Optimization
The implemented model has been trained and tested with the subdivision of the dataset introduced in previous Section V-B, giving an accuracy of 1.0 over the test set. Therefore, this model is the one employed for the navigation.
In order to inspect the model and justify the high accuracy of it, we plotted the intermediate activations of the trained network and we adopted Grad-CAM, [62], to highlight important regions in the image for predicting the correct class. In Figure 7 are shown some feature maps at different level of depth: immediately after the first convolution, at an intermediate point and before the average pooling layer. It is clear how the deep learning model is able to generate robust feature maps already after the first convolution. Later those representations are exploited in order to produce disentangled representations that easily allow the model to predict the three different classes with high level of confidence. Instead, in Figure 6, are presented the regions of interest for the three different classes. With Grad-CAM we can visually validate that the network is activating around the proper patterns of the input image and that it is not exploiting short cuts to achieve a high level of accuracy. Indeed, we can easily assess that the model, trained with transfer learning, is exploiting the vineyard rows and their vanishing point to obtain an effective generalization power.
Moreover, in order to evaluate the robustness of the network over new scenarios, and prove how transfer learning is so effective for this specific application, we performed an experimentation, training the model only with a small part of the available dataset. So, we trained the architecture with just a vineyard type and tested the resulting model with five completely different scenarios with diverse wine quality and weather conditions. In particular, we used only 18% as training examples, corresponding to 6.068 images, due to the amount of images available for each region of the available dataset. Consequently, we tested the new trained network with the remaining 27458 samples. Fig. 7. Three input images belonging to different classes, with their respective activation maps taken at different level of depth of the network. Already on early stages, the network, pre-trained on ImageNet, is able to extract useful representations that lead at robust, disentangle activations in the final layers. It is possible to notice how the two spacial dimensions are increasingly reduced. Table I, an accuracy of 0.94 is achieved by the re-trained model in this second case. That is an optimal result considering the fact that the network has been trained with a very small dataset and it has been tested with a completely different vineyard scenario. This clearly demonstrates how transfer learning, for this specific task, is very effective at providing good generalization capabilities with also a small training set. We also compared, using this last split, the selected network with other notable architectures of the literature. As it is clear from Table II MobileNet is the right balance between average accuracy and computational request. However, in the presence of a platform with more flexible computational constraints, EfficientNet-B1, or networks with higher compound coefficient φ [63], would be much more likely to generalize over new scenarios maintaining an optimal level of efficiency. Finally, as previously introduced, the employed network has been optimized, discarding all redundant operations and reducing the floating point precision from 32 to 16 bits, using the framework TensorRT. The optimization process, besides not affecting the accuracy of the predictions, it gives a significant increment to the number of frames elaborated per second by our model, using the same hardware supplied with the robot. In fact, the control frequency using Tensorflow with a frozen graph, computational graph of the network without optimization and tranining nodes, was 21.92 Hz, whereas, with the performed optimization, we reached 47.15 Hz.

D. Field Experimentation
As far as the deployment is concerned, the system has been implemented in a ROS-oriented robot platform. The robot in which the local planner has been tested is an unmanned ground vehicle: the model Jackal from Clearpath Robotics (Figure 8) introduced in Section III.
The tests have been carried out in a new vineyard scenario. In order to correctly perform navigation the stereo camera has been installed in such a way that the center of the camera frame corresponds to the longitudinal axis of the vehicle. The proposed solution, after several trials with different vineyard rows but similar weather conditions, proved to be able to perform an autonomous navigation along the given paths, even lowering down the resolution of the camera to 640 × 480. More specifically, for the two infrared cameras with which the camera computes the depth-map the resolution has been set to 640 × 480, whereas, for the RGB images processed by the machine learning model we started with a resolution of 1280×720 and then we gradually reduced it until 640×480 as mentioned. Moreover, when acquiring images we used the default calibration provided by Intel with the camera lens distortion based on the Brown-Conrady model [67]. The intrinsic parameters for the final configuration of the camera in both the so-called depth and color modes are showed in Table III. All our tests showed precise trajectories, comparable with ones obtained with data fusion techniques that make use of several expensive sensors to maintain the correct course. As noticeable from the depth maps samples in Figure 9 taken during the field experimentation, the first method can detect the end of the vineyard independently from the direction of the longitudinal axis of the robot. The rectangle is successively used for control signals generation. In case of a fault of this solution, as previously introduced, the machine learning based algorithm takes control. Finally, it is possible to exploit the distance value d to easily collect new, already labeled, sample data from the operational work of the robotic platform. Indeed, due to the nature of all mini-batch gradient descent based optimizer, it is possible to continuously use new data points to extend the existing model's knowledge obtaining a more robust and prone to generalize neural network. Fig. 9. Instances of the depth-map based algorithm while performing tests in the vineyards. Wherever the robot is pointing at, it is capable of correctly detecting the end of the vineyard rows.

VI. CONCLUSIONS
We proposed a local motion planner for vineyards rows autonomous navigation. We exploited the stereo vision properties of an RGB-D camera and latest advancements in deep learning optimization techniques in order to obtain a lightweight, power-efficient algorithm able to run on a low-cost hardware.
The proposed overall methodology provides a real-time control frequency using only hardware with limited computational capabilities containing costs and required resources. The backup trained neural network is robust to different factors of variation, and after the optimization procedure, it provides a control frequency of 47.15 Hz without the need of external hardware accelerators.
Finally, the proposed local motion planner has been implemented on a robotic platform and tested on the relevant environment, demonstrating to scale real working conditions even with a low resolution.
As future work, we plan to integrate the presented work with a concrete application and extent the methodology to orchards and any other analogous scenario.