Unimodal and Multimodal Perception for Forest Management: Review and Dataset

: Robotics navigation and perception for forest management are challenging due to the existence of many obstacles to detect and avoid and the sharp illumination changes. Advanced perception systems are needed because they can enable the development of robotic and machinery solutions to accomplish a smarter, more precise, and sustainable forestry. This article presents a state-of-the-art review about unimodal and multimodal perception in forests, detailing the current developed work about perception using a single type of sensors (unimodal) and by combining data from different kinds of sensors (multimodal). This work also makes a comparison between existing perception datasets in the literature and presents a new multimodal dataset, composed by images and laser scanning data, as a contribution for this research ﬁeld. Lastly, a critical analysis of the works collected is conducted by identifying strengths and research trends in this domain.


Introduction
In robotics, perception is the ability of a system to identify and interpret sensory information to achieve a better understanding and enhance its awareness on the surrounding environment.This article formally reviews the state-of-the-art about unimodal perception (using a single type of sensors) and multimodal perception (combining data from distinct kinds of sensors) in forestry environments.These types of perception are also relevant for agricultural purposes, but this work is only focused in the forestry domain.Therefore, this article covers scientific works tested in the woods and/or in the forests.Both of these terms represent forestry environments.In the woods, 25-60% of the land is covered by trees, and in forests, the tree canopy covers 60-100% of the land (https://www.reconnectwithnature.org/news-events/the-buzz/what-the-difference-woods-vs-forest, accessed on 6 October 2021).For simplicity, throughout this article, we refer to the forestry environments using the term "forests".
Over the years, several advances in perception systems have appeared, giving a positive impact in the forestry domain.These systems, combined with robotics, enabled the improvement, in terms of precision and intelligence, of several tasks and operations that were performed in forest a long time ago.Some years back, such operations were performed without thinking about the forest sustainability and were performed with obvious limitations.In forestry, perception is of utmost importance, as it is required for detecting trees, stems, bushes, and rocks [1], and measuring certain parameters of valuable vegetation whilst ignoring nonvaluable plants [2].Such tasks have an inherent difficulty because of illumination changes caused by tree-derived shading.With this in mind, this article presents an overview of the recent scientific developments about multimodal perception in forests for several purposes: species detection, disease detection, structural measurement, biomass and carbon dynamics assessment, and monitoring through autonomous navigation.The addition of cutting-edge technology to these and other operations not only leads to a smarter and more precise forestry but also helps to prevent and deal with natural disasters such as wildfires, which were estimated to have affected the lives of over 6.2 million people since 1998 [3].
The selection of works about perception systems for forestry was performed based on the current state-of-the-art of this domain; therefore, the majority of the cited works are from the past 10 years.The main focus of this review was the production of a scientific survey; therefore, articles published in journals and conferences were preferred.The literature databases that were used to search for scientific information were: Scopus, ScienceDirect, IEEE Xplore, SpringerLink, and Google Schoolar.
To perform this search, the following keywords were used: forests, sensor fusion, multimodal perception, images, lidar, radar, and navigation.These keywords were combined to form the following search strings: "forests AND images", "forests AND lidar", "forests" AND "sensor fusion" AND "multimodal perception", "forests AND images AND lidar AND radar", and "forests AND navigation".
The contributions of this work are the following: • A review of perception methods and datasets for multimodal systems and applications; • A publicly available dataset with multimodal perception data.
The rest of this article is structured as follows.Section 2 presents a review of unimodal and multimodal perception methods for forestry.In Section 3, our dataset and other perception datasets found in the literature are presented and detailed.Section 4 ends this article, drawing the main conclusions about the forestry unimodal and multimodal perception domains.

Unimodal and Multimodal Perception in Forestry
This section presents a literature review of scientific works about unimodal (using only images or LiDAR data) and multimodal perception in forestry.

Vision-Based Perception
Over the years, several works have appeared whose main goal was to use only visionbased data for performing forestry-related tasks.With this in mind, in this section, works related to vision-based perception in forest areas are covered.
The use of images to inspect forestry environments can have multiple purposes: disease detection in vegetation, vegetation inventory reports, vegetation health monitoring, detection of forest obstacles for safe autonomous or semiautonomous navigation and for assessing the structure, and mapping of the forest land, among others.
Health monitoring and disease detection in trees are a frequent topic in forest contexts that can be performed using only cameras.In [4], a study about the detection of pine wilt disease was conducted.The authors used a Unmanned Aerial Vehicle (UAV) equipped with a camera to gather aerial images for further processing.The UAV captured several images during three consecutive months that would form a dataset for training four different Deep Learning (DL) object detection methods-Faster R-CNN ResNet50, Faster R-CNN ResNet101, YOLOv3 DarkNet53, and YOLOv3 MobileNet-to diagnose such a disease.The authors claimed that the four methods achieved similar precision, but YOLOv3based models were lighter and faster than Faster R-CNN variations.In [5], another study about disease detection in pinus trees was made.In this work, the authors also used UAV-based images to detect the disease; however, they developed a method that combines Convolutional Neural Networks (CNN) and Generative Adversarial Networks (GAN) with an AdaBoost classifier.The GAN was used to extend the diseased samples of the dataset; the CNN was used to remove existent noise for the recognition task such as roads, rocks, and soils; and the role of AdaBoost classifier was to distinguish diseased trees from healthy ones and to identify shadows in the images.The proposed method attained better recognition performance than several well-known methods, such as support vector machines, AlexNet, VGG, and Inception-V3.Another work where UAVs were used to capture aerial images for the further identification of sick trees was proposed in [6].In this work, the authors wanted to detect sick fir trees, and for that, they started by obtaining a Digital Surface Model (DSM) from the aerial images, on top of which an algorithm developed by them was run to detect treetops.Then, the detected tree crowns were classified using five DL models: AlexNet, SqueezeNet, VGG, ResNet, and DenseNet.The obtained results showed that the proposed tree crown detection algorithm achieved on average best matching and counting of treetops.In terms of treetop classification, DenseNet, ResNet, and VGG were the DL models presenting more stability in their detection results.In [7], the authors presented an approach for diagnosing forest health based on the detection of dead trees in aerial images.In this work, the authors used their own aerial images datasets and used eight fine-tuned variations of a DL method called Mask R-CNN to produce dead tree detections, resulting in the best variation achieving a mean average precision of about 54%.The purpose of the detections was to serve as an indicator of environmental changes and even an alert for the possibility of forest fires occurrence.A study about monitoring trees' health was made in [8], where the authors collected aerial images using a UAV and performed individual tree identification by using a k-means method to perform tree segmentation, followed by the use of histogram of oriented gradients to localise the treetops.Afterwards, the images went through a multipyramid feature extraction step where important features were extracted to further identify the health of the trees.The results showed that the proposed method performed better than other state-of-the-art methods.
The production of inventory reports and the assessment of forest structure and its characterisation are important issues that can tell about the productivity of the forest land.In [9], the authors used a CNN called RetinaNet to detect palm trees in aerial images, achieving 89% and 77% precision in the validation and test dataset, respectively.The authors also presented a similar work in [10], where they went even deeper regarding the inventory report of palm trees, attaining a very high number of accounted palms with a confidence score above 50%.In [11], the authors used a photogrammetric technique called Structure from Motion (SfM) to generate point clouds from which some forest parameters were extrapolated such as tree positions, Diameter at Breast Height (DBH), and stem curves (curves that define the stem diameter at different heights).The image capture was made in two locations of Austria and in two locations of Slovakia, performing Terrestrial Laser Scanning (TLS) measurements that were used as ground-truth for the SfM parameters estimation.The results show that SfM is an accurate solution for forest inventory purposes and for measuring forest parameters, not falling far behind TLS.Another work where an SfM-based strategy was used to obtain forest plot characteristics such as tree positions, DBH, tree height, and tree density was presented in [12].In this work, the authors combined the image acquisition with a type of differential Global Navigation Satellite System (GNSS) technology, which is different from the common method of simply using photogrammetry to reconstruct the 3D point cloud of a scene; instead, their method is capable of extracting directly real geographical coordinates of the points.The results showed minimal differences in the positioning accuracy (between 0.162 and 0.201 m), on the trunk DBH measurements (between 3.07% and 4.51%), and on tree height measurements (between 11.26% and 11.91%).In [13], the authors also used SfM to make 3D reconstruction from aerial data collected using a UAV.They applied a watershed segmentation methods along with local maxima to detect individual trees, and after, tree heights were calculated using DSMs and digital terrain models.The tree detection procedure was carried out with a maximum 6% error, and the tree height estimation error was around 1 and 0.7 m for the pinus and eucalyptus stands, respectively.While it is import to quantify some forest parameters such as DBH, tree height, and tree position, the measurement of tree crowns must not be underestimated, as it is quite difficult to assess this measure manually, and it provides a comprehension about the stand timber volume.For this, the authors in [14] made a study on methods for the detection and extraction of tree crowns from UAV-based images and further crown measurement.They used three DL models, namely Faster R-CNN, YOLOv3, and Single-Shot MultiBox Detector (SSD).In terms of detection, the three models behaved similarly; however, in terms of crown width estimation (computed directly from the generated bounding boxes of the methods), SSD was the method that presented the lowest error.
Commonly, forestry inventory is estimated by detecting the trees.Several works proposed this approach as a way of assessing quantitatively the forest yield, forest biomass, and carbon dynamics from high-resolution remote sensing or UAV-based imagery [15][16][17][18][19][20][21][22].The inventory from a certain ecosystem can also be estimated by mapping it through satellite images, as was made in [23] for a mangrove ecosystem.The authors used a pixelbased random forest classifier that resulted in a mangrove map with an overall accuracy of 93%.This work demonstrated that the production of detailed ecosystem maps can have a high impact for monitoring and manage natural resources.
Autonomous navigation in forests is a relevant challenge.For this issue, it is fundamental that all obstacles in the forests are detected to avoid hazardous situations and damages.In [24], the authors installed a camera on a forwarder (a forestry vehicle that transports logs) and developed an algorithm that, using the images, detects trees and measures the distance to them.They trained an Artificial Neural Network (ANN) and a K-Nearest Neighbours (KNN) to perform the detections.After detecting the trees, a distance measurement process is executed, that is based on the intrinsic and extrinsic parameters of the camera, and other parameters related to the vehicle.Then, if the distance is below a proximity threshold, the vehicle is stopped, and it waits for a command from the operator before returning to operation.Similar work was presented in [25], where the authors developed an autonomous navigation and obstacle avoidance system for a robotic mower with a mounted camera.The obstacles and landmarks were detected using a CNN.Autonomous navigation in forests can also happen in the air with UAVs, as was shown in [26], where the authors developed a UAV-based system capable of following footpaths in the forest terrain.The CNN-based perception system detected the footpaths followed by a decision-making system that calculated the deviation angle of UAV's motion vector from the desired path, and if the angle did not exceed 80 degrees, the UAV would move forward; otherwise, it would turn left or right depending on the sign of the angle.In [27], the authors also used a UAV to develop an autonomous flight system using monocular vision in forest environments.The system is an enhanced version of an existing algorithm for rovers.The proposed method is capable of computing the distance to obstacles, calculating the angle to the nearest obstacle, and applying the correct yaw-velocity pair to manoeuvre the UAV to avoid the obstacles.A similar work was developed in [28], where the authors proposed a DL-based system for obstacle avoidance in forests.The system was tested in a simulated and in a real environment: in the simulated environment, the UAV concluded 85% of the test flights without collisions, and in the real environment, the UAV concluded all test flights without collisions.Other studies focused on detecting tree trunks in street images using Deep Learning methods [29,30], in dense forests using visible and thermal imagery combined with Deep Learning [31], and even on the detection of stumps in harvested forests [32] to enhance the surrounding awareness of the operators and to endow machines with intelligent object avoidance systems.
Table 1 shows a summary of works related with vision-based perception in forests, where studies are presented by categories, by the type of processing needed, and the number of works found with impact in each category.The aim of the "Health and diseases" category is to monitor the health of forest lands and detect the existence of diseases that affect forest trees, destroying some forest cultures and ecosystems.Data from this category are most of the times processed offline-the data are collected in the field and are processed later.For this category, only five works were presented, since the way of acquiring the data for such purpose is quite similar (the majority use UAVs or satellite imagery) and their processing is mostly targeted at the detection of sick or death trees in aerial images.The category of "Inventory and structure" is based on remote sensing to produce inventory reports about forest content, such as biomass volume, and to study the structure of forests using plot-level parameters to assess its growth and yield.The data of this category are also processed offline, and 15 works were collected for this category.Most of these works are focused on detecting and counting treetops in high-resolution and UAV images, and some of them are focused on measuring some parameters of forest trees, such as DBH, height, tree density, and size of the crown.Lastly, the "Navigation" category is mostly about works focused on detecting trees, in visible and thermal images, and also measuring the distance to them to perform avoidance manoeuvres.Other studies focused on following footpaths with decision making systems capable of deciding which route to choose.Such works are the foundations needed to attain fully or semiautonomous navigation in forests, hence the importance of the works in this area.The works of this category are all characterised by online data processing, otherwise the aerial or terrestrial vehicles, which rely on the visual perception, would crash and serious damage would happen to their hardware.
Another perspective taken from the collected works over vision-based perception in forests is about the nature of their perception systems, i.e., whether the systems are terrestrial or airborne.Of the 28 works, eight are of terrestrial nature, and 20 of aerial nature.These numbers may indicate that the domain of terrestrial-based perception is still under development, and more research is needed, since advanced ground-level perception can enable the development of technological solutions for harvesting biomass, and cleaning and planting operations, which in turn can help to tackle environmental issues, such as greenhouse effect, global warming, and even wildfires.

LiDAR Perception
This section is about forest perception using Light Detection And Ranging (LiDAR) technology.In this domain, the literature is divided into two main areas: LiDAR-based perception for estimating forestry inventory and structure, and for achieving autonomous navigation and other operations in forests.
Regarding the forest structure and inventory assessment domain, several works are focused on the development of methods to precisely perform tree detection and segmentation on LiDAR-based point clouds.In [33], the authors base their work in low-density full-waveform airborne laser scanning data for Individual Tree Detection (ITD), and tree species classification using a random forest classifier whose input was the extracted features from the detected trees.This work covered three tree species and, in the end, the results were compared with the ones obtained from the discrete return laser scanning data.In [34], a benchmark of eight ITD techniques was made over a dataset made by Canopy Height Models (CHMs) obtained from Airborne Laser Scanning (ALS).Additionally, an automated tree-matching procedure was presented that was capable of linking each detection results to the reference tree.The method proved to work in an efficient manner.In [35], the authors presented an ITD method based on the watershed algorithm for further computing several tree-related variables.Deep Learning is another way of performing ITD, as was shown in some works [36,37] where distinct DL methods were used, such as, Faster R-CNN, 3D-FCN, K-D Tree, and PointNet.Other works focused on Individual Tree Crown (ITC) detection and segmentation (or delineation) as the one presented in [38].In this work, the authors developed a framework that receives LiDAR-derived CHMs and 3D point cloud data and generates estimations of tree parameters such as tree height, mean crown width, and Above-Ground Biomass (AGB).The authors concluded that their framework is very accurate at ITC delineation, even in dense forestry areas.For the task of ITC, some works also used Deep Learning.In [39], the method PointNet [40] was used, and the authors presented a method that started by turning the point clouds containing the trees into voxels; then, the voxels were the input samples for training PointNet to detect the tree crowns; lastly, with the segmentation results provided by PointNet, a height-related gradient information was used to distinguish the boundaries of each tree crown.Over the years, novel tree segmentation methods have appeared: gradient orientation clustering [41], graph-cut variations [42], region-based segmentation [43], mean-shift segmentation [44], and layer stacking [45].The majority of works based on the use of LiDAR in forests are aimed at estimating and assessing forestry parameters and biomass.Some are focused on computing DBH [46][47][48][49], others want to measure AGB [47,48,[50][51][52][53][54][55][56][57], Leaf Area Index (LAI) [58][59][60], canopy height [47][48][49]54,[61][62][63][64][65][66][67][68], tree crown diameter for estimating biomass and volume [69], basal area and tree density [52,70], land cover classification [71], and above-ground carbon density using an ITC segmentation that locates trees in the CHM, measures their heights and their crowns widths, and computes the biomass [72].
Autonomous navigation and automated tasks in forests are still a challenge due to the unstructured nature of such environment and to the unavailability and/or degradation of GNSS signals [2,3].In [73], the authors claimed to solve such localisation problem in sparse forests, where the GNSS signals can be sporadically detected.They proposed a method that fuses GNSS information with a LiDAR-based odometry solution, which uses tree trunks as a feature input for a scan matching algorithm to estimate the relative movement of the aerial robot used in this work.The method employs a robust adaptive unscented Kalman filter, and, for motion control, the authors implemented an obstacle avoidance system based on a probabilistic planner.In [74], an autonomous rubber-tapping ground robot was presented.The robot achieves autonomous navigation by collecting a sparse point cloud of tree trunks using a low-cost LiDAR and gyroscope; the center points of the trees are acquired; then, the points are connected to form a line that serves as the robot's navigation path.Additionally, a fuzzy controller was used to analyse the heading and lateral errors while the robot performed certain operations: straight-line walking in a row at a fixed lateral distance, stopping at certain points, turning from a row to another, and gathering specific information regarding row spacing, plant spacing, and tree diameter.In [75], the authors presented a point cloud-based collision-free navigation system for UAVs.The system collects the point cloud using a LiDAR, converts it to an occupation map that is the input for a random tree to generate path candidates.They used a modified version of Covariant Hamiltonian Optimisation for Motion Planning objective function to choose the best candidate, whose trajectory is in turn the input of a model predictive controller.The authors' strategy was tested in four different simulated environments, and the results showed that their method is more successful and has a "shorter goal-reaching distance" than the ground-truth ones.Most of the time, the problem of navigation and localisation in forests can be resolved by using Simultaneous Localisation and Mapping (SLAM) algorithms [2].SLAM normally combines the data from a perception sensor, such as a camera, a LiDAR, or both, with the data from a Inertial Navigation System (INS).The authors in [76] developed a GNSS/INS/LiDAR-based SLAM method to perform highly precise stem mapping.The heading angles and velocities were extracted from GNSS/INS, enhancing the positioning accuracy of the SLAM method.In [77], the authors also used an INS and LiDAR-based SLAM method to attain a stable and a long-term navigation solution.They assessed the performance of two different approaches: making SLAM with only a LiDAR and making SLAM with a LiDAR and an Inertial Measurement Unit (IMU).They concluded that the positioning error improved when the second approach (LiDAR+IMU-based SLAM) was in use.Similarly, in [78], the goal was stem mapping and to accomplish that the authors combined GNSS+IMU with a LiDAR, mounted on a terrestrial vehicle, and performed SLAM.They concluded that the addition of LiDAR contributed to an improvement of 38% compared to the traditional approach of only using GNSS+IMU.In [79], the authors proposed a SLAM method called sparse SLAM (sSLAM) whose main application is in forests and for sparse point clouds.They tested their method on the field with a LiDAR and a GNSS-mounted on a harvester and compared their method with LeGO-LOAM.The results showed that sSLAM generates a lighter point cloud, incurs a lower GNSS parallel error, and has more consistency than LeGO-LOAM.Lastly, in [80], the authors proposed a new approach to match point clouds to tree maps using Delaunay triangulation.They tested their method with a dataset corresponding to a 200 m path, travelled by a harvester with a LiDAR and a GNSS mounted on it.Initially, the tree trunks are extracted from the map, resulting in a sparser map that is triangulated; then, a local submap of the harvester is registered, triangulated, and matched using triangular similarity maximisation, estimating the harvester's position.
Table 2 presents a summary of the works that were aforementioned and that are related to LiDAR-based perception in forests, where the category, processing type, and number of works found are highlighted.

Category Processing Type Works
Inventory and structure Offline [33-39,41-72] Navigation Online [73][74][75][76][77][78][79][80] The majority of works are focused on perceiving the forest structure and estimating its inventory, and drawing conclusions about the forest carbon stock and vegetation yield from it.With respect to navigation purposes, more research is needed, as only eight works were found to be interesting for the study at hand.Additionally, a detail to be mentioned is that the majority of works (around 30) were made on aerial systems (including spaceborne ones), showing the predominance of the aerial systems in forestry, similarly to the domain of vision-based perception.

Multimodal Perception
In this section, the domain of forest multimodal perception is addressed, and the works that meet the formal search are presented.
Multimodal perception combines data from different kinds of sensors through a sensor fusion approach to attain richer, more robust, and more accurate perception systems.In this sense, multimodal sensing is likely to present a superior performance comparatively to unimodal sensing, demonstrating the relevance of this type of sensing for the forestry domain.
One of the main applications of multimodal perception systems in forests is the classification of vegetation and tree species distinction.Fusing aerial-visible and hypermultispectral images with LiDAR data is the most common practice for classifying forestry vegetation.In [81], the authors proposed a data fusion system that combined aerial hyperspectral images with aerial LiDAR data to distinguish 23 classes, including 19 tree species, shadows, snags, and grassy areas.The authors tested three classifiers: support vector machines, Gaussian maximum likelihood with leave-one-out-covariance, and k-nearest neighbours.The results showed that the best classifier was support vector machines; the system benefited with the addition of LiDAR, improving the classification accuracies in almost all classes; and the system attained accuracies over 90% for some classes.In [82], the authors studied sensor fusion approaches to perform species classification.Initially, the trees were detected in the CHM derived from the ALS data, and then the detected trees were distinguished among four classes by use of different combinations of 23 features provided by ALS data and coloured orthoimages.Seven classification methods were studied: decision trees, discriminant analyses, support vector machines, k-nearest neighbours, ensemble classifiers, neural networks, and random forests.Again, the use of ALS-based features proved to improve the overall accuracy.The authors recommended to use quadratic support vector machines for tree species classification, as this performed better than the other methods.A similar work was produced in [83], where the authors proposed a method that at first performed ITC in a LiDAR-derived CHM, followed by a hyperspectral extraction in each segmented tree for further classification through two classifiers: random forests and a multiclass classifier.In this study, seven tree species were classified.The authors compared the use of all 118 bands against the use of only 20 optimal bands (obtained by minimum noise fraction transformation) in the classification performance.The results showed that using only 20 bands is beneficial, as it increases the overall accuracy of the two classifiers, and that the multiclass classifier is more robust with high-dimensional datasets composed by small sample sizes.Another similar study was made in [84].In this work, the authors presented a classification algorithm for tree species classification based on CNNs.Firstly, the algorithm performs ITD by using the local maximum method over a LiDAR-derived CHM; then, the trees are cropped from aerial images into patches that are classified by a ResNet50 CNN into one of seven classes.A comparison among the CNN from the algorithm with a traditional method (random forest) and two CNNs (ResNet18 and DenseNet121) was made, resulting in ResNet50 outperforming the other methods.In addition, a study regarding the resolution of the patches was made, where it was concluded that the biggest tested patch size generated better results.A study involving the classification of land use and land cover was made in [85], where the authors used a combination of satellite images with satellite Radio Detection And Ranging (RaDAR).In [86], the authors presented a study about the classification of the vertical structure of the forest.They fused the information of aerial orthophotos (an orthophoto is an aerial photograph or satellite imagery geometrically corrected (orthorectified) such that the scale is uniform (https://en.wikipedia.org/w/index.php?title=Orthophoto&oldid=1020970836, accessed on 6 October 2021)) with aerial LiDAR data and used an ANN to produce the classifications.In [87], the authors also used CNNs but instead of classifying trees, they wanted to detect them in fused data composed by aerial images and an ALS-derived DSM.They concatenated a DSM with a Normalised Difference Vegetation Index (NDVI) and with a concatenation of red, green, and near-infrared features.Their goal was to use a single CNN to process such combination of data, and for that, they used AlexNet.The results showed that the input data pair NDVI-DSM achieved the best results.With respect to tree detection, in 2005, a work about the detection of obstacles behind foliage using LiDAR and RaDAR [88] was published.The detection of occluded obstacles in forests is a major issue, as it is important that mobile platforms avoid crashing into other objects while traversing the forest.With this in mind, the authors in [88] were capable of detecting a tree trunk behind a maximum foliage thickness of 2.5 m.Some works related to detection in forests are focused on detecting terrain surfaces by means of LiDAR and vision-based data [89], while others, that also used LiDAR data combined with images, are focused on detecting roads instead [90].
There are application areas where multimodal perception is crucial.The estimation of biomass is an important process that can help predict the forest yield and its carbon cycle.Such assessment can be made by means of combined LiDAR data and multispectral imagery [91]; combined multispectral imagery and RaDAR imagery [92] and combined inventory data, multispectral imagery and RaDAR imagery [93].Moreover, the estimation of the vegetation or canopy height is also a field where multimodal perception takes an important role.This kind of estimation can be made by combining aerial photogrammetry with LiDAR-derived point clouds [94] and by combining LiDAR data with multispectral optical data [95].Other applications are aimed at: autonomous navigation in forests using sensor fusion of GNSS, IMUs, LiDAR, and cameras [96,97]; mapping the forest using a LiDAR and a camera mounted on a ground vehicle, and by means of a SLAM approach [98]; characterising the root and canopy structure of the forest by combination of LiDAR-derived point clouds at ground-level with Ground Penetrating RaDAR (GPR) [99]; and measuring forest structure parameters, such as average height, canopy openness, AGB, tree density, basal area and number of species, by combining spaceborne RaDAR images with multispectral images [100].

Perception in Other Contexts
Multimodal or unimodal perception also plays an important role in other contexts.Digital and precision agriculture, military robotics and disaster robotics are some of the areas where robots can be combined with advanced perception systems to enhance the knowledge of the robots about their surroundings in several tasks.
In agriculture, the introduction of digital and automated solutions in recent years potentiated the appearance of precision agriculture procedures that can be applied in farmer's cultures, increasing the production yield and decreasing the environmental impact of using fertilisers.With precision agriculture, the fertiliser application is performed at the right time, at the right place and with the right amount, fulfilling the crop needs.With this in mind, several scientific works have appeared in recent years.The majority of them are about autonomous harvesting where the fruit or vegetable must be detected and/or segmented prior to its picking [101][102][103][104][105][106][107][108][109].The detection of vegetables or fruits are also important to count them and estimate the production yield [110][111][112][113]. Similarly to forestry contexts, some works are about disease detection and monitoring [114,115], and others are focused on detection woody trunks, weeds, and general obstacles in crops for navigation [116][117][118][119], operation purposes [120,121], and cleaning tasks [122,123].Another application of perception systems in precision agriculture is characterising, monitoring, and phenotyping vegetative cultures using stereo vision [124,125], point clouds [126,127], satellite imagery [128], low-altitude aerial images [129], or multispectral imagery [130].Along with these advances in terms of perception, several robotic platforms have appeared: for harvesting [101,102,[104][105][106], for precise spraying [131], for plant counting [132], and for general agricultural tasks [133][134][135][136].A topic that is being increasingly studied in the agricultural sector is localisation and consequent autonomous navigation in crops.To achieved this, the robots can rely on topological maps for path planning [137,138], ground-based sensing [119,135,136], aerial-based sensing [139,140], and simulated sensing [141].
Regarding the military and disaster robotics domains, both share some of the perception issues that exist in forestry and agricultural, such as, illumination changes, occlusions, and possible dust and fog.Several scientific advances have been made in these domains.With respect to the occurrence of disastrous events, some scientific solutions have appeared that use UAVs with cameras to perform surveillance of shipwreck survivors at sea [142], to search for people after an avalanche [143], and to detect objects and people in buildings after calamity events [144][145][146][147].Other developments have been achieved related with inspecting bridges after disasters [148], rescuing people using a mobile robot similar to a crane [149], and scouting and counting of fallen distribution line poles [150].In a military context, perception takes an important role as it helps to detect airports, airplanes, and ships from satellite images [151,152] and even from RaDAR images [153].Some autonomous and semiautonomous systems have appeared that travelled by air [154,155] or by land [156] relying on vision sensors and/or LiDARs.Additionally work has been conducted in specific areas namely, opening doors using a robotic arm and 3D vision for unmanned ground vehicle [157], detecting obstacles in adverse conditions (fog, smoke, rain, snow, and camouflage) using a ultra wide-band RaDAR [158] or a spectral laser [159,160], fusing camera and LiDAR data to recognise and follow soldiers [161], avoiding obstacles using a 2D LiDAR [162], autonomously following roads and trails using a visual perception algorithm [163], and even multitarget detection and tracking for intruders [164].

Discussion
Normally, vision sensing is more beneficial than LiDAR, by the fact that each image datum comprises at maximum four values-red, green, blue, and possibly depth-whereas each LiDAR datum can only have two possible values, which are distance and intensity.Even so, for perception applications in forests, the combination of vision sensor(s) with LiDAR(s) is the favourite approach due to the existence of sharp illumination changes that can compromise the performance of cameras.Thus, even if the cameras temporarily failed, the system could continue operating using LiDAR-based perception.A relevant detail is that, when using multimodal data, it is expected that the diverse nature of the measurements incurs uncorrelated errors, interference, and noise.The expected consequence is that multimodal sensing is likely to be superior to unimodal sensing.It is hoped that an adequate data fusion technique improves the quality of the final perception with uncorrelated limitations.When compared to several sensors with different natures to a set of high-quality sensors of the same nature, multimodal is likely to have limitations in distinct situations instead of having persistent noise and interference measured with high accuracy.Admittedly, multimodal data come from diverse sensors that make the overall system more expensive.
Table 3 shows the most innovative and disruptive works regarding the categories mentioned in Sections 2.1-2.3.Biomass parameters Unimodal Airborne [93] Biomass estimation Multimodal Spaceborne Navigation [28] Autonomous flight Unimodal UAV [74] Autonomous rubber-tapping Unimodal Caterpillar robot [97] Autonomous navigation Multimodal Quadruped robot Species classification [81] Vegetation classification Multimodal Spaceborne, airborne [82] Vegetation classification Multimodal Airborne In the category "Health and diseases", the two works that are clearly highlighted were about performing disease detection in trees using only aerial images captured from UAVs.The work developed in [5] focused on detecting diseases in pinus trees using deep learning models to remove the noisy background from the UAV images (such as soils, roads, and rocks), followed by disease recognition using the AdaBoost algorithm.The authors went even further, and to expand their training dataset, they used a GAN.After the study, they concluded that their method achieved superior results compared to the state-of-the-art methods.The novelty of this article is related to the fact that the proposed method not only recognises diseased trees but also other forestry objects, which can be used to assess other forest parameters, such as LAI and rockiness of the forest terrain.The use of GAN to augment the dataset is a relevant point as well.The other work of this category was aimed at identifying sick fir trees [6].The authors' proposed method differs from the common methods by combining DSM, for detecting treetops, with UAV images, to classify the detected treetops.For the classification, the authors made a benchmark involving 10 deep learning classifiers.Their method achieved better results compared to three state-of-the-art methods.
Three works from the category "Inventory and structure" are of great relevance.The work proposed in [11] was aimed at using SfM, and by means of a handheld camera, to measure some inventory characteristics (tree positions and DBH).Then, the point cloud resulting from terrestrial SfM was evaluated using a TLS-based point cloud, which proved that the SfM method is an accurate solution for deriving inventory parameters from imagebased point clouds, which is an important breakthrough, as this type of work is normally performed by means of aerial or terrestrial LiDAR data.A similar work was proposed in [38], where the authors wanted to obtain biomass attributes, such as tree height, mean crown width, and AGB, but in this work, aerial LiDAR data were used.To achieve that, they proposed a method that performs ITC detection over a CHM derived from a LiDAR point cloud.The authors claim that their method is very accurate and efficient even in dense forests, where traditional methods tend to present a limited performance, hence the mentioning of this work in Table 3.The last work to be mentioned in this category was about estimating biomass using a spaceborne multimodal approach [93].The authors used a combination of in-field inventory data reports, with satellite images and satellite RaDAR to estimate and map forest biomass.The features were extracted from the data and serve as an input to two different estimation models.The interesting aspect of this article is the combination of forest inventory reports with remote sensing data to attain a low cost method with a high level of reliability and efficiency.
Within the "Navigation" category, there are three relevant works that were considered in Table 3.One is focused on performing autonomous flight with a UAV using only vision sensing [28].In this work, the authors developed an aerial system that uses a visionbased DL method to detect obstacles and then performs evasion manoeuvres.The results of this work are surprising and are the reason why this work was chosen; out of 100 flights carried out in a simulated environment, the rate of success was 85%, while in a real environment the rate of success was 100% for 10 flights.Another surprising work was developed in [74].The aim of this work was to perform rubber-tapping autonomously.For that, the authors used a caterpillar robot with a gyroscope and a LiDAR mounted on it.The robotic system was capable of walking along one row at a fixed lateral distance and then turning from one row into another, while performing rubber-tapping automatically.Moreover, the system collected forest information and mapped the forest during the operation.These developments constitute a tremendous breakthrough, as this is one of the first works to implement an autonomous system for performing a forestry operation without human interaction.Back in 2010, another work was published, and it was about autonomous navigation using a quadruped robot with a sensing system composed by a LiDAR, a stereo camera, GNSS receiver, and an IMU [97].The robot was tested in a forest environment, and it managed to successfully complete 23 out of 26 autonomous runs and even managed to travel more than 130 m in one of them.
The last category covered in this review is "Species classification".In this category, two works based on multimodal sensing for classification of vegetation were considered.In one of them, the authors combined hyperspectral images with LiDAR data to distinguish 23 classes, of which, 21 were vegetative, and benchmarked three classifiers [81].In the other work, the authors developed a method that performs ITD over an ALS-derived CHM, and then the detected tree crowns were classified into four classes.The classification was performed by combining 23 features from ALS data and orthoimages [82].These two works were selected by the fact that they covered a considerable amount of classes to identify, made several combinations of multimodal features to serve as an input for classification, and benchmarked the state-of-the-art methods to classify forestry vegetation.
Figure 1 presents a year distribution of the works that were studied and reviewed in this article.
From Figure 1, it is easily concluded that the majority of works (around 70%) are from 2017 onwards.This not only means that there is a growing interest in developing technology for forestry, but also that in the last four years, there have been technological developments with higher impact in the forestry domain.
The categories presented in Sections 2.1-2.3 are detailed in Figure 2 according to their coverage in this article.
Undoubtedly, the category with greater presence in this article (more than 50%) is "Inventory and structure".Such dominance of this category over the others can be explained by the fact that the type of works that the category embraces are mostly about forest characterisation by extracting vegetation parameters and biomass estimation (using vision and/or LiDAR), which are the most common work lines to estimate important socioeconomic variables, such as, forest yield, carbon stock, and wildfire risk.The second category with most coverage in this article is "Navigation".This information denotes that an increasing search is being made for automation systems capable of navigating and performing autonomous tasks in forests.This is crucial, and it is expected that in upcoming years a larger number of autonomous robotic solutions may appear, since the lack of manpower is a constant issue in forestry, for both manual and machine work.The third category is "Species classification/detection", and it is part of a relevant application domain that allows one to detect intrusive species and avoid future implications in the biodynamics of forestry areas.Another aspect that must be discussed in this article is the applicability and adequacy of sensor platforms to perform specific forestry operations.Table 4 presents an overview of different sensor platform types and their details regarding area coverage, data resolution, and whether a certain platform is at real-time operations or not.From Table 4, one can verify that the sensor platforms that should be used for collecting data from large forest areas are the airborne ones (spacecraft, aircraft, and UAVs), as they can cover more terrain than ground vehicles and in less time.Ground vehicles and UAVs are the platforms to be employed to achieve highly precise data and for real-time operation.However, ground vehicles are mostly preferred over UAVs since they typically have more energy autonomy and support much more payloads.Such characteristics are ideal to perform forestry tasks, which usually involve spending several hours in the terrain and carrying large amounts of weight.Nonetheless, ground vehicles/robots require advanced perception systems.To develop and test these systems, more datasets are needed.

Perception Datasets for Forestry
The existence of sufficient data is crucial for further developing multimodal perception in forests.Therefore, this section introduces our dataset as a contribution of this paper and also emphasises existing datasets within this field.

Proposed Dataset
The proposed dataset in this work is called QuintaRei Forest Multimodal Dataset (QuintaReiFMD) and was acquired in an eucalyptus forest located in Valongo (Portugal) using a robotic platform named AgRob V16, which is presented in Figure 3.The dataset is available in the Robot Operating System (ROS) format, it is made up by nine rosbags, and it includes visible, thermal, and depth images, and even point clouds.The dataset was recorded during the navigation of the robot (manually controlled using a remote controller) in the forest, in plane and also steep terrains, at a maximum velocity of 0.5 m/s.These data were collected by means of different sensors mounted on the front of AgRob V16: a ZED stereo camera (https://www.stereolabs.com/zed,accessed on 24 September 2021), pointing forward, mounted 96 cm above ground and tilted by 10 degrees, was used to acquire visible and depth images; a FLIR M232 camera (https: //www.flir.eu/products/m232,accessed on 24 September 2021), pointing forward and mounted 70 cm above ground with no tilt, was used to capture thermal images; a OAK-D camera (https://store.opencv.ai/products/oak-d,accessed on 24 September 2021), pointing to the left of the robot, mounted 96 cm above ground and tilted by 10 degrees, was used to collect visible and depth images; and a Velodyne Puck LiDAR (https://velodynelidar. com/products/puck, accessed on 24 September 2021), mounted 100 cm above ground, was used to acquire point clouds.These sensors are also presented in Figure 3 with coloured annotations.The dataset is publicly available at https://doi.org/10.5281/zenodo.5045354(accessed on 24 September 2021), and a partial description of the same is presented in Table 5, where the data types, data resolution, frame rate, Field Of View (FOV), and number of messages associated to each sensor are detailed.Other publicly available datasets were found in the literature.In [34], the authors built a unimodal dataset made of laser scanning data (available at https://www.newfor.net/download-newfor-single-tree-detection-benchmark-dataset, accessed on 24 September 2021) to perform a tree detection benchmark.In [165], the authors constructed a dataset of low-viewpoint coloured and depth images (available at https://doi.org/10.5281/zenodo.3690210, accessed on 24 September 2021) to enhance the intelligence of smaller robots, possibly achieving autonomous navigation in forests.In [31], the authors built a dataset of manually annotated visible and thermal images (available at https://doi.org/10.5281/zenodo.5213824, accessed on 24 September 2021) to perform trunk detection to enhance robot awareness in the forest.In [166], the authors presented a multimodal dataset of laser scans, colour and grey images (available at http://autonomy.cs.sfu.ca/sfu-mountain-dataset,accessed on 24 September 2021), whose data correspond to eight hours of trail navigation.In [167], the authors produced a dataset composed by colour images (available at https://etsin.fairdata.fi/dataset/06926f4b-b36a-4d6e-873c-aa3e7d84ab49,accessed on 24 September 2021) for forestry operations in general.Lastly, in [168], the authors proposed two multimodal datasets made of laser scans and thermal images (available as DS_AG_34 and DS_AG_35 at https://doi.org/10.5281/zenodo.5357238,accessed accessed on 12 October 2021) for forestry robotics, and they used the datasets to perform a SLAM benchmark.Table 6 summarises and describes the aforementioned datasets and our dataset.All datasets were acquired at ground-level.

Reference Perception Data Data Format
Eysn et al. [34] Laser scans LAS, TIFF, SHP Niu et al. [165] Colour and depth images PNG, CSV da Silva et al. [31] Visible and thermal images JPG, XML Bruce et al. [166] Laser scans; colour and monochrome images ROS Ali et al. [167] Colour images ROS Reis et al. [168] Laser scans; thermal images ROS QuintaReiFMD Laser scans; visible, thermal, and depth images ROS From Table 6, it can be seen that our dataset complements other existing datasets, as it contains laser scans and three different image types (more than any other) all together, enabling the development of more forestry applications, possibly in real-time, during day and night, using multimodal data.

Conclusions
The perception in forests is of utmost interest, since the combination of perception systems with robotics and machinery can enable a smarter, more precise, and more sustainable forestry.In this sense, this work presents a formal review of several scientific articles of forestry applications and operations by perceiving the forest environment.
This work reviewed unimodal and multimodal perception in forest environment.Additionally, this work contributes for the enrichment of multimodal data in forests by providing a public dataset composed by LiDAR data and three different types of imagery: visible, thermal, and depth.This dataset is more complete than any other as it includes four different types of sensor data (refer to Table 6).
Regarding unimodal sensing, the most common sensors are vision and LiDAR.Multimodal sensing takes advantage of a set of data coming from vision, LiDAR, and RaDAR.The most common usages for perception are divided into categories, such as health and diseases, inventory, and navigation.
Processing can be performed online, in real-time, onboard a given vehicle, or offline to reach conclusions after the mission that collected the data.With the literature review made in this article, the perception trends in forestry environments can be detailed.Vision-based perception is mainly used along with aerial vehicles and in offline tasks such as detecting diseases in vegetation and assessing the forest yield from its inventory; LiDAR-based perception is mostly used along with aerial vehicles (sometimes even spaceborne), and its data are most of the times processed offline for biomass estimation and structure measurement purposes; the multimodal perception is specially focused on offline operations, such as detecting and distinguishing vegetation species from aerial imagery and laser scanning systems, estimation of biomass using multispectral and hyperspectral images with LiDAR and RaDAR data, and measuring the forest canopy.
In the next years, the perception trends in forest should be focused on ground-based systems to perform forestry operations in real-time relying on visual perception and LiDAR perception alone and/or on a fusion of these two.Advances in these topics can enable further technological developments in forestry, including fully unmanned navigation for monitoring, and to perform operations such as cleaning, pruning, fertilising, and planting autonomously.

Figure 1 .
Figure 1.Year distribution of the reviewed articles.

Figure 2 .
Figure 2. Category distribution of the reviewed articles.

Figure 3 .
Figure 3. Lateral view of the ground mobile robotic platform AgRob V16 with the sensors annotated.

Table 1 .
Summary of the collect works about vision-based perception.

Table 2 .
Summary of the collect works about LiDAR-based perception.

Table 3 .
Best works in terms of innovation in each category.

Table 5 .
Partial description of the dataset acquired by the sensors mounted on AgRob V16: FOV, data types, data resolution, frame rate, and total number of messages related to each sensor.

Table 6 .
Summary of works presenting perception datasets in forests and our dataset, as well as their characteristics.