Visual SLAM for Indoor Livestock and Farming Using a Small Drone with a Monocular Camera: A Feasibility Study

Real-time data collection and decision making with drones will play an important role in precision livestock and farming. Drones are already being used in precision agriculture. Nevertheless, this is not the case for indoor livestock and farming environments due to several challenges and constraints. These indoor environments are limited in physical space and there is the localization problem, due to GPS unavailability. Therefore, this work aims to give a step toward the usage of drones for indoor farming and livestock management. To investigate on the drone positioning in these workspaces, two visual simultaneous localization and mapping (VSLAM)—LSD-SLAM and ORB-SLAM—algorithms were compared using a monocular camera onboard a small drone. Several experiments were carried out in a greenhouse and a dairy farm barn with the absolute trajectory and the relative pose error being analyzed. It was found that the approach that suits best these workspaces is ORB-SLAM. This algorithm was tested by performing waypoint navigation and generating maps from the clustered areas. It was shown that aerial VSLAM could be achieved within these workspaces and that plant and cattle monitoring could benefit from using affordable and off-the-shelf drone technology.


Introduction
As the world population increases every day, with the prospect that the world population reaches 9 billion people by 2050 [1], there is a growing demand for food, which means that the production of food also needs to increase. At the current growing rate, food production needs to increase by about 70% [2]. To meet this growing demand of food, an increase in production and productivity is needed. Therefore, the development of new techniques that can help increase the productivity and decrease the amount of human labor is needed. Intelligent and automated systems for agricultural operations are essential to meet challenges such as labor shortage, lowering human safety, and decreasing production costs by saving time, money, and energy [3]. Precision agriculture can be explained as a technique that helps the producer take better decisions per unit area of land and per unit of time [4].
Currently, more crops and vegetables are grown in greenhouses, and monitoring indoor cultivars is equally important as for the field-growing crops [5]. Image sensors become popular and have been used in greenhouses for performing data collection for applications such as plant monitoring approaches. In the research of [6], a method was designed for detecting and classifying bacterial spot diseases in tomato fields using camera images. In the research of [7], the authors use a handled camera to acquire images of cucumbers inside a greenhouse and then deep learning is used for detection.
Precision livestock is defined as the application of the precision agriculture concept within livestock farming using a variety of sensors and actuators in order to improve the management capacity for big groups of animals [8]. The precision livestock concept is based on real-time data collection and analysis for management purposes [9]. In the research of [10], camera images were combined with vision algorithms for detecting changes in animals posture, for lameness detection within a herd. Another example could be found in [11], where camera image analysis was used. Approaches aiming for determination of the animal condition score have also been performed following digital image analysis [12].
A current challenge in indoor precision farming and livestock is the need for real-time site monitoring. Collecting data for monitoring crop growth or animals conditions could be found as the most frequent and time-consuming processes in on-farm operations [13]. Therefore, new remote sensing approaches relying on autonomous robots could be an important asset for future indoor farming and dairy farm management. Remote sensing in greenhouses using mobile robots was previously reported. In the research of [14,15], a robotic system was proposed for detecting diseases of bell peppers. Plant issues such as nitrogen and calcium deficiencies could be detected by using image processing, aiming for higher overall plant product efficiency [16]. Nevertheless, all these works use ground robot platforms that are usually large and have limited field-of-view.
Unmanned aerial vehicles (UAV) became popular as they can be easily deployed and, when combined with advanced sensing technology, they are suitable to overcome many challenges in the agricultural sector such as pest control, weed detection, spraying, and crop monitoring [17][18][19]. Furthermore, the authors of [2] found that the most appropriate applications for drones in agriculture are irrigation, crop monitoring, soil and field analysis, and control of birds and diseases. In order to achieve these applications, the UAV localization is important. Outdoors, the UAVs relayed with the global positioning system (GPS). However, GPS is not an option for indoor environments. Therefore, other solutions need to be developed to enable autonomous UAV navigation inside greenhouses or in dairy farms barns.
The application of UAV in such environments is still in an early stage. In the research of [20], a small quad rotor was used for measuring environmental variables within an greenhouse. The optimal location of sensors on the UAV were determined. Moreover, the validation of using a UAV as platform for measuring environmental variables was performed by field experiments in a greenhouse. The authors concluded that the system setup of a mini-UAV sensory system could be used for obtaining environmental variables. Another study that could be found of UAV in agricultural indoor environment is the work of [21] which covers a precision landing control approach for agricultural UAV. By using GPS for localization and image recognition for localization of the landing platform, the authors made a basis setup for landing control in animal houses. In the study of [22], a multirobot setup consisting of a UAV and a ground robot was proposed for monitoring environmental variables in greenhouses. In the previous works, the authors mention the challenge of localization and navigation of UAV in greenhouse environments. Due to the complexity of the environment, the drone was controlled manually.
Estimating the UAV position for indoor farming could be achieved in several ways. In the research of [23], a localization principle was developed for ground robot navigation on a large-scale farm. The authors combined a light detection and ranging sensor with the wheels odometry obtained from the robot, proposing a mapping and localization system in terms of position and orientation for challenging agricultural scenarios. In the study of [24], a non-vision-based localization technique was used to obtain the position and orientation of robots within greenhouse environment. The position and orientation of a small robot were obtained using a sound signal which was recognized by an installed receiver architecture in the greenhouse. A drawback of these systems is that they are either exceedingly heavy and large to be shipped onboard a drone or rely in more complex setup in order to determinate the position of the robot. In the research of [25], an approach was made to estimate the position and orientation of a wheeled mobile robot, a greenhouse, using an onboard camera. An onboard visual positioning approach would be more suitable for UAV navigation as well.
Several solutions for performing localization tasks have been proposed in the literature and they can be divided in two main categories: the vision-based and the non-visionbased approaches. An example of non-vision-based principle can be found in [26] using acoustic signal for performing localization tasks. Although the authors claimed that higher tracking accuracies could be achieved by using acoustic signals, the UAVs require additional acoustic sensors while there distance limitations and the environment needs to be adapted beforehand. In [27], ultrawide band signals were proposed as a localization method. However, obstacles between the transmitter and the receiver reduce the system performance. Another approach using a combination of light detection and ranging sensors and inertial measurement unit (IMU) data was proposed in [28]. Although high accuracy was observed, light detection and ranging sensors are relatively large, heavy, and expensive, which can make this method difficult to adopt in UAV applications. Regarding the vision-based localization methods, a growing interest in recent years for using only visual information as localization input was observed. The authors in [29] proposed a system setup where a stereo camera was used to achieve localization in greenhouse environments. However, stereo cameras usually do not come integrated in the UAV platform, and setting up such a system requires further complex steps; thus, the system will be less affordable. By taking all these challenges into account, the focus of the current study will focus on monocular approaches, since most current UAVs are equipped with off-the-shelf cameras. Cameras are lightweight sensors and they have proved to be a suitable data source for UAV perception while no additional sensory is needed.
In order to deploy UAVs in indoor agricultural environments, an additional degree of perception is required [30]. Simultaneous localization and mapping (SLAM) is the principle of obtaining a map of the environment using different types of sensors, enabling a robot localization within the build map [31]. To be able to perform those tasks in real time, data from different sensors needs to be processed, hence the computational power needs are increasing dramatically. This can become a serious bottleneck since small UAVs are not equipped with adequate computational units, due to power supply and weight constraints. Therefore, the computational intense processes are assigned to a ground station. This method increases the possibilities of the small UAV and, at the same time, it enables the implementation of any commercial drone on the same setup without major changes. However, localization and mapping are computationally heavy tasks and the resources of the ground station can easily be exhausted especially when other complex tasks such as image processing are assigned simultaneously on the ground unit.
In recent years, a growing interest toward system setups using only visual information as input for SLAM principles, called visual SLAM (VSLAM), could be identified. This is due to the recent big improvements made in terms of localization performances of VSLAM algorithms [32]. Due to the increasing computational power of computers [33], VSLAM localization principles became more popular for indoor localization purposes. VSLAM principles were only used a few times in agricultural applications [29,34]. Therefore, it can be seen that the usage of VSLAM algorithms for UAV localization tasks in indoor agricultural environments is still at an early stage. Limited research was conducted with the purpose of exploring localization performances of such VSLAM approaches regarding indoor agricultural environments. As such, this work aims to compare two open-source state-of-the-art VSLAM algorithms for localization tasks applied to indoor precision livestock and farming using a small UAV with an onboard camera toward real-time data collection without human interaction for agricultural applications.
The aim of this work is to evaluate the performance and consistency of the state-ofthe-art VSLAMs in specific agricultural environments, implemented by inexpensive small UAVs. Since these environments have not been studied in this context before, we evaluate two monocular VSLAMs and we examined the feasibility of their integration in such cases.

Visual SLAM Algorithms
SLAM is used for creating a map of an unknown area, and after creating the map, the algorithm tries to estimate its position based on these created map. SLAMs can be achieved using data from different sensors, but the main two categories are the laserscan-based principle and the visual-based principles, also known as VSLAM [35]. Within monocular VSLAM, two main methods could be distinguished of extracting information from images. Those are the direct and the feature-based methods [36]. The direct method (dense or semidense) optimizes the geometry directly from the image's pixels intensities. For the feature-based method, a set of feature observations is extracted from the image, and thereafter, the camera position and scene geometry are computed as a function of these feature observations [37]. An advantage of the direct method is that due to the comparison of the entire image, they can obtain a more dense representation of the environment compared to feature-based methods, which generates less dense representations.
As both feature-based and direct methods have their pros and cons according to different environment circumstances and purposes [38], it can be concluded that it is important to take a feature-based method into account as well as a direct method to assess their performances in indoor agricultural environments.
For the algorithm selection in the direct and feature-based categories, the currently best performing VSLAM algorithms according to benchmark data sets were chosen. They also met the following requirements: the VSLAM algorithm must be open-source, and the VSLAM algorithm must have the ability to work in real-time as the aim is represented by real-time localization tasks. Furthermore, the VSLAM algorithm must have the ability to work within robot operating software (ROS) and, finally, the VSLAM algorithm has to support monocular input as monocular VSLAM is part of the aim. According to the criteria, LSD-SLAM [39] was found to meet the criteria for the direct methods, and ORB-SLAM in the implementation of ORB-SLAM2 [40] for the feature-based methods. Therefore, those were selected to make comparisons.
ORB-SLAM, introduced by [40], is a feature-based method based on the early method based on parallel tracking and mapping (PTAM) [41]. The features extracted from the input image are used in every step of the algorithm and the extraction is done by the extractor ORB [42], hence the ORB-SLAM naming. Improved and updated versions of ORB-SLAM can be found in ORB-SLAM2 [40]. Even more recently, there was a beta ORB-SLAM3 version [43].
LSD-SLAM, introduced by [39], is a direct method. This algorithm takes the whole image as an input and semidense point clouds are built up by applying pixelwise depth filters in small-baseline stereo comparisons.
Both ORB-SLAM and LSD-SLAM were found to be able to work in a real-time setup. In this setup, ORB-SLAM produces sparse point clouds and LSD-SLAM produces semidense point clouds.

UAV and Software Architecture
The small UAV that was used within the experiments and data acquisition is the DJI Tello. It is an inexpensive off-the-shelf platform endowing a monocular camera with an 2592 × 1936 image resolution. This platform can fly for up to 13 min within a range of 100 m. For its communication, it uses a WIFI at 2.4 GHz. To control the drone, a PC was allocated as a ground station. The ground station was configured to run ROS Melodic under Ubuntu 18.04.
In Figure 1, the system architecture is illustrated. Navigation data and RGB image data were transmitted to the ground station. Using ROS, the data were fed as a input to each of the two algorithms, LSD-SLAM and ORB-SLAM, respectively. Then, the outputs of the VSLAM were used as an input to the software development kit (SDK) to control the drone, and at the same time, they were sent to the EVO, a trajectory evaluation package by [44], to evaluate the accuracy of each algorithm. It is worth mentioning that each algorithm was evaluated individually to avoid potential issues such as computational bottlenecks, which could affect the results.

Experimental Setup
The indoor agricultural environment selected for the experiments was greenhouses of the company Fa. Duijnisveld. The crop that was cultivated is the bell pepper. The total farm size is 9 hectares of greenhouses and they are located in De Kwakel (The Netherlands). The indoor livestock environment selected was the dairy farm from the company Mts. Van Leusen, located in Zwolle (The Netherlands). This dairy farm has 90 dairy cows that are used for milk production. Figure 2 show the two sites described. These agricultural and livestock companies were selected because the activities carried out there are laborintensive and could benefit from smart techniques using drones to perform several tasks. Within greenhouses, the grown crops can vary, but a common feature is that they all contain a main corridor. Furthermore, the crop is grown in rows and has a rich canopy. Considering this, a flying strategy was defined to fly the drone inside the greenhouse: between the rows, which is the ideal place to monitor the crop, and over the main corridor, as that is the place where a drone performs turns to go the next row (and landing). The experimental setup in the greenhouse is illustrated in Figure 3.
A common feature in all the dairy barns environments is the feeding alley in which the animals come several times a day to eat. The main difference between the greenhouse and the barn is that the canopy is steady and the animals can move in the feeding alley. This workspace would ideally use a drone that autonomously monitors the animals' health and behavior. Therefore, it is important to know how the algorithms perform in such an environment. The experimental setup in the dairy farm is illustrated in Figure 4.  Drones will reduce the labor needed and save money. Some of the challenges that will be found in these environments are that the lighting conditions might still change (mainly inside of the greenhouse) and the scene over the time (more dynamic in the barn because the animals move constantly).

Evaluation Metrics
Monocular SLAM algorithms work with an arbitrary scale. Therefore, for making a fair comparison, the trajectories need to first be aligned so the scale factors are equal. In short, this indicates that one run of the multiple runs performed within an algorithm has been taken as reference and the other runs will be scaled toward the same scale factor. As in [38], the references were obtained directly from the VSLAM algorithms. Scaling the runs with each other was necessary; otherwise, the error due to the differences in scale would also be taken into account and that was not desired. This is because the aim was to study the variation in positioning rather then the variation within scale. Therefore, before making a comparison, alignment of trajectories was performed by the method of [45], which aligns two sets of points in k-dimensional Euclidean space.
After the alignment step, evaluations between and within the algorithms could take place. For evaluating the performance and accuracy of both LSD-SLAM and ORB-SLAM, the approach in [46] was used. Using this method, two main comparisons were made: the relative pose error and the absolute trajectory error, which is often also called the absolute pose error.
The relative pose error does give insight toward the local accuracy of the trajectory, which is evaluated over a fixed time interval ∆t. This indicates that this relative pose error corresponds with the drift of the trajectory. The relative pose error was calculated using Equation (1) in which the terms Pref and Pest are equal to the sequences of poses from the trajectories that are being compared. Whereas in [46], comparisons were performed for finding the absolute accuracy of VSLAM algorithms, this work focused on robustness and consistency of VSLAM algorithms in terms of variation in positioning.
The absolute trajectory error gives insight into the absolute error between two corresponding poses within a key frame which are compared. This evaluation method is desired for determining the global consistency when comparing trajectories with each other [46]. For calculating the absolute pose error, Equation (2) was used, in which R represents the rigid-body transformation for a set of keyframes that is compared.
The root mean squared error (RMSE) is the error over all the time steps within a sequence. This RMSE was calculated using Equation (3). In this equation, the trans(E i ) is equal to the translation component of the relative pose error calculated in Equations (1) and (2).

Results
To find out variations in positioning between VSLAM algorithms, each sequence from the selected scenes obtained was ran 20 times for the LSD and ORB-SLAM algorithms. Running multiple times was also necessary for seeing the consistency and robustness of an algorithm when the algorithm was run for multiple times. The results of each run were stored separately and were evaluated afterward using the methods explained in the sections above. The results were evaluated by using EVO, which is a tool made for trajectory evaluation [44]. Figure 5 shows the drone camera processing outcomes applying the direct method and the feature-based method. ORB-SLAM takes only key features, as can be seen by the green squares in the left side picture; LSD-SLAM uses the whole input images into account (right side picture). In Figure 6, the results of the variation in mean RMSE of ORB-SLAM over the four scenarios are presented. When looking to the results, the overall variance is quite constant over the four scenes, except for the greenhouse corridor. For both scenes within the greenhouse, the variation in errors was lower than the variation of errors of the dairy barn. In Figure 7, the results of the variation in mean RMSE of LSD-SLAM over the four scenarios are presented. When looking to the results, the relative pose error of the greenhouse corridor has the lowest error, with the lowest variance between the runs. When comparing the overall distribution, the variance was lower in the greenhouse compared to the dairy barn.

Comparing Absolute and Relative Position Errors
In Figure 8, the cumulative distribution function is illustrated for both ORB-SLAM and LSD-SLAM for the greenhouse environment.
In Figure 9, the cumulative distribution function is illustrated for both ORB-SLAM and LSD-SLAM for the dairy barn environment.

Waypoints Navigation
In order to test the scale calculation and the positioning performance of the proposed setup, ORB-SLAM was selected as the algorithm for testing. This is because ORB-SLAM was able to achieve the lowest variation in terms of absolute pose error and relative pose error in the clustered agricultural indoor scenes according to the results. For this testing step, a waypoint navigation mission was setup inside the greenhouse and the dairy farm barn. Two sets of waypoints were defined for each environment: 4-points and 6-points. Moreover, in the greenhouse the flying height was set to 10 m and in the barn to 5 m. For the testing step, ORB-SLAM was selected, as ORB-SLAM was found as the algorithm with the lowest variation. Figure 10 shows the top down view of the drone flying trajectory and the reference trajectory.

Semantic Mapping with Octomaps
Octomaps have the ability to use different resolutions for displaying the map derived from the SLAM. This is more memory-efficient according to the usage of point clouds, which can contain millions of points. The obtained obstacle maps generated using octomaps were analyzed using qualitative evaluation by visual observations, as in [47], toward detail level and noise. This assessment enable having more insight on the potential of obstacle map building using a small off-the-shelf drone with monocular camera.
The result map in the greenhouse scenery (main corridor) is shown in Figure 11. The crop rows could be visualized in the octomap, indicated with the red boxes. When inspecting the image closely, the concrete floor is also visible. There are some loose points in the obstacles map due to the texture less areas on the floor as the octomap uses the obtained point cloud from ORB-SLAM. Therefore, it was found that only large building elements within an area, such as floor and wall elements could be visualized (larger than 40 cm 3 ). On the other hand, small elements as a single crop leaf could not be observed. This was somehow expected due to the working principle of ORB-SLAM. In Figure 12 further loose points in comparison with the greenhouse environment can be observed. This is due to the fact that there are more features to track in the greenhouse than in the barn. Furthermore, it was found that straight lines, such as the feeding fence in the dairy barn environment can be detected as well, whereas the profile of the cows heads could not be detected clearly. Moreover, the density of the octomap is varying over the flown path (see the blue square in Figure 12) and more detail level was found within the blue square, compared toward the place indicated with the blue arrow. This could be explained by the amount of flying time in the beginning of the path, as that was the takeoff place.

Discussion
In this research, a comparison was made between open-source VSLAM algorithms for drone localization using monocular camera input in indoor agricultural environments to study the feasibility for drone positioning in indoor livestock and farming.
The absolute differences found between the two compared VSLAM algorithms ( Figures 6 and 7), in terms of root mean square error of absolute pose were 0.16 m and in terms of relative pose 0.012 m. Although these differences in performances seemed to be relatively small, these could be important. For example in greenhouse environments with tomato cultivar row distances can vary between 0.23 and 0.60 m depending on the crop growth stage [48] and in the bell pepper greenhouse where data was obtained the inter row space was 0.5 m. Speaking in terms of absolute difference in obtained error of 0.16 m between the VSLAM algorithms, this indicates that depending on the crop growth stage the error equals 27% up to 70% of the inter row width. Looking toward livestock purposes, centimetre differences in localization could be seen as very important, as these were found directly informed at the process of dairy monitoring tasks and side parameters for a minimization of disturbances [49].
For the comparison of methods, it was shown that the feature-based method ORB-SLAM outperformed the direct method LSD-SLAM in terms of average absolute translation RMSE and average relative rotation RMSE as shown in the cumulative distribution functions (Figures 8 and 9). The results are in line with the studies from [50,51]. It is important to mention that a part of the variation found could be due to the initialization steps of the algorithms. This was also pointed out by the authors of both ORB-SLAM [40] and LSD-SLAM [39]. To be able to draw conclusions about the variance following from initialization steps, examination of more different agricultural indoor data sequences is needed for finding possible causes.
Another point that needs to be taken into account regarding system robustness is that the localization is highly dependent on the camera input. When flying within a workspace with less features to track or the light conditions change significantly, the positioning system might fail. Potential solutions for improving the system robustness could be, for example, to integrate the drone navigation data, such as IMU data [52].
When looking toward the implementation of the proposed system setup in greenhouses or livestock barns, an important aspect to consider is that of communication. In the proposed experimental setup the WiFi receiver of the ground station was used as a proof of concept; this communication channel should replaced for large scale usage. The connectivity performance was not investigated as it was beyond the scope of this research. However, this is an important aspect to be studied in future research to achieve reliable real-time data processing in smart farming [53]. Recent work on different communication technologies for smart farming purposes and limitations regarding specific techniques was performed by [54]. Regarding the communication challenge of UAV implementation in large scale indoor agricultural environments, the authors found there was potential for 5G cellular technologies, promising improvements toward communication distances and data-processing speeds for meeting the mentioned connectivity challenge.
When performing flights with the drone in the dairy barn it was found that flying the drone close to the animals did not result in bigger amount of stress observed by the animals. This is an important aspect to observe during the experiments with drone data collection within livestock barns. Impacts of on-farm drone technologies toward the animal welfare aspect is a critically under-researched area, therefore in the research of [55] a preliminary investigation toward the impact on stress levels of sheep in outdoor drone applications was performed. However, conclusions on animal stress and behaviors could not be drawn, recommendations for further research being therefore made.
Regarding environment mapping to obtain a 3D map of the environment using a drone and octomaps, several challenges from different research domains arise. As for drone navigation, situational awareness and control within 6 degrees of motion is important [56]. In this research it was found that regarding indoor agricultural environment mapping by using octomaps and monocular camera input big variations in terms of detail levels between the indoor agricultural environments could be obtained (see Figures 11 and 12). This is because monocular VSLAM algorithms uses an arbitrary scale [57], which is set in the initialization step. For improving the quality and accuracy of the octomaps, research toward integration of real-world scale together with the obstacle maps is required. However, as this was beyond the scope of this research, it was left open for further investigations.
Although the usage of drones in indoor agricultural environments has been demonstrated as a very promising concept, the usage for agricultural applications is still in an early stage [58]. Therefore further investigations will be needed toward a practical, robust and farmer friendly implementation. Investigations that could be thought of besides the localization aspect can be those of the energy aspect, the computational aspect, such as the computational needs and limitations of the proposed positioning system. Furthermore, it could be thought of investigations toward the maximum area coverage of a single drone [59]. Another option that could be explored can be the usage of multiple drones which work together simultaneously. For outdoor usage this concept was already proposed in the research of [17].

Conclusions
The goal of this research was to investigate the output consistency and the limitations of two open-source VSLAM algorithms implemented on an off-the-shelf small drones for indoor livestock and farming environments. Additionally, the localization and mapping performance in these environments was evaluated. Throughout this work we can conclude that using the the ORB-SLAM and LSD-SLAM algorithms in indoor farming environments the drone can achieve positioning and waypoint navigation with relative small position error. Finally, the maps generated from the drone flights are sufficient to obtain a semantic representation of the workspaces. This study shows a proof-of-concept of the usage of drones for plants and cattle monitoring with an affordable budget (less than 150 euros) and computation resources.
Author Contributions: Conceptualization, S.K. and J.V.; methodology, J.V., C.P. and S.K.; investigation, S.K. and C.P.; writing-original draft preparation, all authors equally; writing-review and editing, all authors equally; project administration, J.V.; funding acquisition, J.V. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by H2020 ROSIN project of European Union grant number 732287.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Acknowledgments:
The authors would like to thank Wageningen Unifarm and the greenhouse manager Andre Maassen, the Firma Duijnisveld greenhouses company, and Melkveebedrijf van Leusen dairy farm for supporting the experiments.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: