A Deep Residual U-Type Network for Semantic Segmentation of Orchard Environments

: Recognition of the orchard environment is a prerequisite for realizing the autonomous operation of intelligent horticultural tractors. Due to the complexity of the environment and the algorithm’s dependence on ambient light, the traditional recognition algorithm based on machine vision is limited and has low accuracy. However, the deep residual U-type network is more effective in this situation. In an orchard, the deep residual U-type network can perform semantic segmentation on trees, drivable roads, debris, etc. The basic structure of the network adopts a U-type network, and residual learning is added in the coding layer and bottleneck layer. Firstly, the residual module is used to improve the network depth, enhance the fusion of semantic information at different levels, and improve the feature expression capability and recognition accuracy. Secondly, the decoding layer uses up-sampling for feature mapping, which is convenient and fast. Thirdly, the semantic information of the coding layer is integrated by skip connection, which reduces the network parameters and accelerates the training. Finally, a network was built through the Pytorch Deep Learning Framework, which was implemented to train the data set and compare the network with the fully convolutional neural network, the U-type network, and the Front-end+Large network. The results show that the deep residual U-type network has the highest recognition accuracy, with an average of 85.95%, making it more suitable for environment recognition in orchards.


Introduction
The "Made in China 2025 strategy" puts forward new requirements for China's agricultural equipment and requires continuous improvement of the ability of agricultural machinery intelligence and precision operation [1]. The horticultural tractor is an important piece of agricultural machinery working in orchards, forest gardens, and other environments. One of the basic tasks of realizing intelligent operation of the horticultural tractor is to identify the working environment. With the development of science and technology, more and more methods are being used for environmental recognition, such as lidar for scanning and recognizing the surrounding environment. However, due to the high cost of lidar, it is difficult to apply to agricultural products. In contrast, the use of ordinary cameras as sensors has many advantages, like comprehensive information collection and low price [2]. For vision-based environmental recognition, related algorithms based on the target of achieving rapid and accurate recognition have mainly been formulated.
In the research of environment recognition, Radcliffe [3] developed a small autonomous navigation vehicle in a peach orchard based on machine vision. The machine vision system was based on the use of a multispectral camera to capture real-time images and process the images to obtain trajectories for autonomous navigation. Lyu [4] used naïve Bayesian classification to detect the boundary between the trunks and the ground of an orchard and proposed an algorithm to determine the centerline of the orchard road The main contributions of this paper are as follows: Section 1 introduces the current research status of environment identification and proposes an orchard environment identification algorithm based on the deep residual U-type network. Section 2 describes the acquisition and processing of orchard environment data sets. Section 3 presents the construction of the orchard environment model and constructs the deep residual U-type network segmentation mode by analyzing the characteristics of the residual network and the U-type network and combining the actual needs of the orchard environment identification. In the fourth section, experimental comparative analysis is carried out on the fully convolutional neural network, the U-type network, the Front-end+Large network, and the deep residual U-type network. Finally, the conclusions and future work are presented in Section 5.

Data Set Acquisition
According to the different working conditions of the horticultural tractor in the orchard, the orchard environment can be divided into the transportation environment and the working environment, as shown in Figure 1. The working environment is characterized by horticultural tractor work between two rows of fruit trees, with minor environmental changes, mainly carrying out pesticide spraying, picking, and other work. The transportation environment is characterized by horticultural tractors driving on non-structural roads with unobvious road boundaries and many sundries.
various information on the orchard environment, laying the foundation for autonomou operation of the gardening tractor in the future.
The main contributions of this paper are as follows: Section 1 introduces the curren research status of environment identification and proposes an orchard environment iden tification algorithm based on the deep residual U-type network. Section 2 describes th acquisition and processing of orchard environment data sets. Section 3 presents the con struction of the orchard environment model and constructs the deep residual U-type net work segmentation mode by analyzing the characteristics of the residual network and th U-type network and combining the actual needs of the orchard environment identifica tion. In the fourth section, experimental comparative analysis is carried out on the full convolutional neural network, the U-type network, the Front-end+Large network, and th deep residual U-type network. Finally, the conclusions and future work are presented in Section 5.

Data Set Acquisition
According to the different working conditions of the horticultural tractor in the or chard, the orchard environment can be divided into the transportation environment and the working environment, as shown in Figure 1. The working environment is character ized by horticultural tractor work between two rows of fruit trees, with minor environ mental changes, mainly carrying out pesticide spraying, picking, and other work. Th transportation environment is characterized by horticultural tractors driving on non structural roads with unobvious road boundaries and many sundries. To realize pixel-level segmentation of the orchard environment, the first step is t obtain a real image of the orchard environment. In this paper, a GoPro4 HD camera wa used as the orchard environment acquisition tool, with pixel resolution of 4000 × 3000. In order to improve the robustness and universality of the network model, we collected dat under various conditions according to changes in weather and light. The collected dat were saved in the form of a video. We used Python scripts to capture pictures from th video at a speed of 30 fps. In order to reduce the consumption of computer graphic memory, the picture pixels were adjusted to 1024 × 512, and a total of 2337 images of th orchard environment were obtained after processing; the images were arranged accordin to the corresponding serial numbers.

Data Set Preprocessing
Semantic segmentation requires preprocessing of the obtained image data. Orchard environment recognition based on the deep residual U-type network is a type of super vised learning. Original pictures and labeled pictures need to be input during model train ing, so the collected images need to be labeled manually. According to the characteristic of the orchard environment and the autonomous operation requirements of horticultura tractors, the objects in the orchard environment were divided into four categories: back ground, road, peach trees, and debris. In this paper, the orchard environment data set wa To realize pixel-level segmentation of the orchard environment, the first step is to obtain a real image of the orchard environment. In this paper, a GoPro4 HD camera was used as the orchard environment acquisition tool, with pixel resolution of 4000 × 3000. In order to improve the robustness and universality of the network model, we collected data under various conditions according to changes in weather and light. The collected data were saved in the form of a video. We used Python scripts to capture pictures from the video at a speed of 30 fps. In order to reduce the consumption of computer graphics memory, the picture pixels were adjusted to 1024 × 512, and a total of 2337 images of the orchard environment were obtained after processing; the images were arranged according to the corresponding serial numbers.

Data Set Preprocessing
Semantic segmentation requires preprocessing of the obtained image data. Orchard environment recognition based on the deep residual U-type network is a type of supervised learning. Original pictures and labeled pictures need to be input during model training, so the collected images need to be labeled manually. According to the characteristics of the orchard environment and the autonomous operation requirements of horticultural tractors, the objects in the orchard environment were divided into four categories: background, road, peach trees, and debris. In this paper, the orchard environment data set was artificially labeled using the semantic split labeling tool Labelme [17], which labels different categories in different colors. Table 1 provides information on the orchard environment categories and Figure 2 shows a labeled map of the orchard environment. artificially labeled using the semantic split labeling tool Labelme [17], which labels different categories in different colors. Table 1 provides information on the orchard environment categories and Figure 2 shows a labeled map of the orchard environment.  In order to solve the problem of insufficient training data scale, we used a data enhancement method [18] to expand the orchard environment data set, which is convenient, fast, and effective in solving the problem of inconsistent enhancement of original and labeled pictures. We expanded the data set to be 3 times as large by flipping, cropping, and scaling the images. After the above processing, the data set was divided into training and test sets in a ratio of 7:3 by the random selection method. There were 6214 training pictures and 2663 test pictures, and the serial number of each original picture corresponded to the serial number of the labeled picture.

Construction of the Orchard Environmental Identification Model
In recent years, with the application of fully convolutional networks (FCNs), convolutional neural networks (CNNs) have been used to generate semantic segmentation charts of any size on the basis of feature diagrams, which can split images at the pixel level [19]. Based on this method, many algorithms have been derived, such as the deep-splitting network framework DeepLab series [20,21], the scene resolution network PSPNet using a pyramid pooling module [22], and the U-Net network for medical image segmentation [23].
The U-Net network can still achieve better model segmentation accuracy with few samples, and for the first time, used the skip connection to add encoded features to decoding features, creating an information propagation path that allows signals to spread more easily between low-level and advanced features; this not only facilitates backpropagation during training but also improves model segmentation accuracy. However, the U-Net network has insufficient ability to obtain contextual information from images, especially for complex scene data with large differences in category scales. Multi-scale fusion is usually used to increase the depth of the network to improve the U-Net network's ability to obtain contextual information from images. However, with increasing network depth, the accuracy of model recognition decreases rapidly after saturation rapidly declines, and the recognition error increases. To solve this problem, He et al. [24] proposed In order to solve the problem of insufficient training data scale, we used a data enhancement method [18] to expand the orchard environment data set, which is convenient, fast, and effective in solving the problem of inconsistent enhancement of original and labeled pictures. We expanded the data set to be 3 times as large by flipping, cropping, and scaling the images. After the above processing, the data set was divided into training and test sets in a ratio of 7:3 by the random selection method. There were 6214 training pictures and 2663 test pictures, and the serial number of each original picture corresponded to the serial number of the labeled picture.

Construction of the Orchard Environmental Identification Model
In recent years, with the application of fully convolutional networks (FCNs), convolutional neural networks (CNNs) have been used to generate semantic segmentation charts of any size on the basis of feature diagrams, which can split images at the pixel level [19]. Based on this method, many algorithms have been derived, such as the deep-splitting network framework DeepLab series [20,21], the scene resolution network PSPNet using a pyramid pooling module [22], and the U-Net network for medical image segmentation [23].
The U-Net network can still achieve better model segmentation accuracy with few samples, and for the first time, used the skip connection to add encoded features to decoding features, creating an information propagation path that allows signals to spread more easily between low-level and advanced features; this not only facilitates backpropagation during training but also improves model segmentation accuracy. However, the U-Net network has insufficient ability to obtain contextual information from images, especially for complex scene data with large differences in category scales. Multi-scale fusion is usually used to increase the depth of the network to improve the U-Net network's ability to obtain contextual information from images. However, with increasing network depth, the accuracy of model recognition decreases rapidly after saturation rapidly declines, and the recognition error increases. To solve this problem, He et al. [24] proposed a deep residual network that used identity mapping to obtain more contextual information. The error did not increase with the depth of the network, which solved the problem of training degradation.

Residual Network
In traditional convolutional neural networks, multi-layer features become more abundant with the superimposition of network layers, but simple superposition networks will cause the problem of gradient disappearance and hinder the convergence of the model. In order to improve the accuracy of environment recognition and prevent gradient disappearance, a residual network was considered for addition into the network structure.
The residual network is mainly designed as a residual block with a shortcut connection, which is equivalent to adding a direct connection channel in the network, so that the network has a stronger identity mapping ability, thus expanding the network depth and improving the network performance without overfitting. The residual network consists of a series of stacked residual units. Each residual unit can be expressed in a general form: where x i is the input of the residual unit of layer i; w i is the network parameters of the residual unit of layer i; F(·) denotes the residual function; h(·) denotes the identity mapping function; and f (·) denotes the activation function. The residual neural network unit consists of two parts: the identity mapping part and the residual part. Identity mapping mainly integrates the input with the output processed by the residual, which facilitates the fusion of subsequent feature information, and the residual part is generally composed of multiple converse neural networks, normalized layers, and activation functions. Through the superposition of identity mapping and residuals to realize information interaction, the problem of poor ability to extract underlying features in the residual part is compensated. Figure 3 shows the difference between the common neural network unit and the residual neural network unit. Figure 3a is the structure of the common neural network unit, and Figure 3b is the structure of the residual neural network unit. a deep residual network that used identity mapping to obtain more contextual information. The error did not increase with the depth of the network, which solved the problem of training degradation.

Residual Network
In traditional convolutional neural networks, multi-layer features become more abundant with the superimposition of network layers, but simple superposition networks will cause the problem of gradient disappearance and hinder the convergence of the model. In order to improve the accuracy of environment recognition and prevent gradient disappearance, a residual network was considered for addition into the network structure.
The residual network is mainly designed as a residual block with a shortcut connection, which is equivalent to adding a direct connection channel in the network, so that the network has a stronger identity mapping ability, thus expanding the network depth and improving the network performance without overfitting. The residual network consists of a series of stacked residual units. Each residual unit can be expressed in a general form: where is the input of the residual unit of layer i; is the network parameters of the residual unit of layer i; (·) denotes the residual function; ℎ(·) denotes the identity mapping function; and (·) denotes the activation function.
The residual neural network unit consists of two parts: the identity mapping part and the residual part. Identity mapping mainly integrates the input with the output processed by the residual, which facilitates the fusion of subsequent feature information, and the residual part is generally composed of multiple converse neural networks, normalized layers, and activation functions. Through the superposition of identity mapping and residuals to realize information interaction, the problem of poor ability to extract underlying features in the residual part is compensated. Figure 3 shows the difference between the common neural network unit and the residual neural network unit. Figure 3a is the structure of the common neural network unit, and Figure 3b is the structure of the residual neural network unit.

Construction of the Deep Residual U-Net Model
Based on the characteristics of the deep residual network and U-Net network, we propose the deep residual U-type network, which introduces a residual layer to deepen the U-Net structure and avoids the occurrence of excessive training time, too many training parameters, and overfitting. In semantic segmentation, it is necessary to use both low-level detailed information and high-level semantic information for better results. The deep residual U-type network can well retain the information of both. The deep residual Utype network has two specific benefits: (1) For complex environment recognition, adding residual units will help the network training, improving recognition accuracy. (2) The long connection of low-level information and high-level information of the network and the skip connection of the residual unit are conducive to the dissemination of information, the parameter update distribution is more uniform, and the network model can have better performance.
In this paper, the nine-level architecture of the deep residual U-type network was applied to the identification of targets in the orchard environment, and the network consisted of three parts: a coding layer, bottleneck layer, and decoding layer. The first part of the coding layer extracts the features in the image, forming a feature map. The second part of the bottleneck layer connects the coding layer and decoding layer, equivalent to the bridge, to obtain low-frequency information in the image. The third part of the decoding layer restores the feature map to pixel-level classification, that is, semantic segmentation. Residual units were added to the coding and bottleneck layers to obtain contextual information, and the convolutional modules in the network contained a convolution layer, a batch normalization (BN) layer, and an activation function (Rectified Linear Unit, ReLU). Adopting the batch normalization layer can prevent instability of network performance caused by too many data before the activation function, which effectively solves the problem of gradient disappearance or gradient explosion [25]. Using the ReLU activation function can effectively reduce the amount of calculation and increase the nonlinear relationship between the various layers of the neural network [26]. The identity mapping connects the input and output of the residual neural network unit. Since the dimensionality of the input image changes during convolution, the corresponding dimension of the input image also needs to be changed during identity mapping. In this paper, a convolution kernel with a size of 1 × 1, a step size of 1, and a batch normalization layer was used as the identity mapping function.
There are four residual units in the coding layer, each of which is activated by the ReLU function after the residual function and constant mapping function are added together, and then the feature graph size is halved by Maxpool, which can effectively reduce parameters, reduce overfitting, improve model performance, and save computational memory [27]. The decoding layer consists of four basic units, using bilinear upsampling and the convolutional module for decoding. Compared with deconvolution, the above method is easier to implement in engineering and does not involve too many hyperparameter settings [28]. At the same time, the feature information in the coding layer is fused with the feature information in the decoding layer by a jumping connection, which makes full use of the semantic information and improves the recognition accuracy. After the last layer of decoding, a 1 × 1 convolution and the Softmax activation function are used to achieve multi-classification identification of the orchard environment. The network model in this paper has 25 convolutional layers and 4 maximum pooled layers, and the structure of the network model is shown in Figure 4.

Loss Function
When training the segmentation network, the images that need to be trained are split through the segmentation network to get segmentation images S(X n ). S(·) is the segmentation network model, and X n is the input image. The segmented image S(X n ) is compared with the corresponding label image Y n , and the loss function is minimized to make the segmented image close to the original labeled image, which ensures that the segmentation network can produce accurate predictions and has good robustness. In this paper, the standard cross-entropy loss function was used as a loss function to detect the difference between the segmentation image S(X n ) and the labeled image Y n [29]. The expression of the cross-entropy loss function is where L ce is the cross-entropy loss function; Y n is the ground truth (GT); X n is the input image; S(·) is the segmentation image; and h, w, c denote the height, width, and number of channels of the image, respectively.

Loss Function
When training the segmentation network, the images that need to be trained are split through the segmentation network to get segmentation images ( ). (·) is the segmentation network model, and is the input image. The segmented image ( ) is compared with the corresponding label image , and the loss function is minimized to make the segmented image close to the original labeled image, which ensures that the segmentation network can produce accurate predictions and has good robustness. In this paper, the standard cross-entropy loss function was used as a loss function to detect the difference between the segmentation image ( ) and the labeled image [29]. The expression of the cross-entropy loss function is where is the cross-entropy loss function; is the ground truth (GT); is the input image; (·) is the segmentation image; and ℎ, , denote the height, width, and number of channels of the image, respectively.

Orchard Environment Identification Test
The fully convolutional neural network, the U-type network, the Front-end+Large network, and the deep residual U-type network can all achieve image pixel-level semantic segmentation, but the four networks have their own unique characteristics. Among them,

Orchard Environment Identification Test
The fully convolutional neural network, the U-type network, the Front-end+Large network, and the deep residual U-type network can all achieve image pixel-level semantic segmentation, but the four networks have their own unique characteristics. Among them, the fully convolutional neural network uses deconvolution to achieve semantic segmentation without a bottleneck layer, and it uses the force pooling method to enter the decoding layer from the encoding layer directly. The U-type network adds a bottleneck layer between the encoding and decoding layers to realize a smooth transition, and it adopts upsampling and skip connection for decoding to achieve semantic segmentation. The Front-end+Large network has high recognition accuracy and high adaptability to different images, so it is widely used in farmland road recognition. The deep residual U-type network adopts a U-type network structure. By adding residual blocks in the coding layer and the bottleneck layer, the image context information and multi-layer network are fully utilized to realize detailed processing of the image. We conducted an experimental comparative analysis of the above four networks, as detailed in the following.

Test Implementation Details
Based on the deep residual U-type network model proposed above, the deep learning framework Pytorch was used to build the orchard environment recognition and segmentation model. There were a total of 6214 training images with a size of 1024 × 512. The hardware environment of the experiment was an Intel Core I7 9700K 8-core processor with a GeForce RTX 2070 and an 8 GB memory capacity.
With the deepening of training, the model is easily caught in the local minimum problem. In order to solve this problem, we adopted the RMSProp algorithm [30], which, according to the principle of minimization of the loss function, constantly dynamically adjusts the network model parameters to make the objective function converge faster. The initial learning rate in the RMSProp algorithm was set to 0.4, and the weight attenuation coefficient was 10 −8 . During model training, the batch size of data loading was 8, the number of iterations was 300, and the loss function value was recorded for each iteration.

Evaluation Indicators
There are three evaluation criteria for semantic segmentation, namely, execution time, memory footprint, and accuracy. The accuracy is based on the manually annotated image as the basic standard, compared with the prediction image from the segmentation network, and judged by calculating the pixel error between the prediction image and the real annotated image.
Suppose there are k + 1 categories (k target categories and one background category), P ii represents a correct prediction, and both P ij and P ji represent incorrect predictions. The general evaluation criteria are as follows: (1) Pixel Accuracy (PA): The ratio of the number of correctly classified pixels to the total number of pixels, and the formula is (2) Mean Pixel Accuracy (MPA): The average value of the ratio of the number of correct pixels in each category to the total number of pixels in that category, (3) Mean Intersection over Union (MIoU): Among the above criteria, the MIoU is the most representative and easy to implement. Many competitions and researchers use these criteria to evaluate their results [31]. In this paper, PA and MIoU were used as evaluation indicators for different categories of segmentation and overall network models.

Test Results and Analysis
When the deep residual U-type network model was trained, the model was saved every five iterations, and the model with the highest average intersection ratio among all models was selected as the test model. In order to verify the superiority of the proposed network model, it was compared with the fully convolutional neural network model, the U-type network model, and the Front-end+Large network. Figures 5-8 show the loss values and MIoU of each iteration of the four network models. Table 2 reports the highest Pixel Accuracy (PA) and the highest MIoU for category segmentation.
In this paper, PA and MIoU were used as evaluation indicators for different categories of segmentation and overall network models.

Test Results and Analysis
When the deep residual U-type network model was trained, the model was saved every five iterations, and the model with the highest average intersection ratio among all models was selected as the test model. In order to verify the superiority of the proposed network model, it was compared with the fully convolutional neural network model, the U-type network model, and the Front-end+Large network. Figures 5-8 show the loss values and MIoU of each iteration of the four network models. Table 2 reports the highest Pixel Accuracy (PA) and the highest MIoU for category segmentation.   In this paper, PA and MIoU were used as evaluation indicators for different categories of segmentation and overall network models.

Test Results and Analysis
When the deep residual U-type network model was trained, the model was saved every five iterations, and the model with the highest average intersection ratio among all models was selected as the test model. In order to verify the superiority of the proposed network model, it was compared with the fully convolutional neural network model, the U-type network model, and the Front-end+Large network. Figures 5-8 show the loss values and MIoU of each iteration of the four network models. Table 2 reports the highest Pixel Accuracy (PA) and the highest MIoU for category segmentation.    From its training effects, the fully convolutional neural network has a larger floating amplitude and higher frequency. After about 100 iterations, the loss value and the MIoU tended to stabilize, but the floating range was large. The fully convolutional neural network discards the fully connected layer and uses deconvolution to realize semantic segmentation. When recognizing the complex environment of the orchard, there are more network parameters in the fully convolutional neural network, which leads to problems such as the segmentation of the details in the image not being clear enough, the segmentation boundary between different categories being blurred, and the training time being long. The training effect of the U-type network in Figure 6 was significantly better than that of the fully convolutional neural network. The network tended to be stable after about 60 iterations. Up-sampling was adopted to map the feature images, which greatly reduced the network parameters. However, there were still fluctuations in the preliminary training process, which was due to insufficient network depth and poor ability to distinguish the boundaries of different categories in the image during training. The effect of Front-end-Large network training in Figure 7 was significantly improved compared with that for the first two networks. The fluctuation in the previous training was small and tended to be stable around 54 iterations. The training effect of the deep residual U-type network in Figure 8 was significantly better than that of the first three networks. There was no obvious fluctuation in the early training period, the change in the loss value and MIoU was small after about 40 iterations, and the overall training time was short and stable. The deep residual U-type network adopts the U-type network structure but adds a residual block in the coding layer and fuses the image feature information, which better processes the image boundary information. In addition, a jump connection and constraints were added to the decoding layer, which effectively reduced network parameters. Among the four networks, the deep residual U-type network had the lowest loss value, the highest MIoU, and the best training effect.   From its training effects, the fully convolutional neural network has a larger floating amplitude and higher frequency. After about 100 iterations, the loss value and the MIoU tended to stabilize, but the floating range was large. The fully convolutional neural network discards the fully connected layer and uses deconvolution to realize semantic segmentation. When recognizing the complex environment of the orchard, there are more network parameters in the fully convolutional neural network, which leads to problems such as the segmentation of the details in the image not being clear enough, the segmentation boundary between different categories being blurred, and the training time being long. The training effect of the U-type network in Figure 6 was significantly better than that of the fully convolutional neural network. The network tended to be stable after about 60 iterations. Up-sampling was adopted to map the feature images, which greatly reduced the network parameters. However, there were still fluctuations in the preliminary training process, which was due to insufficient network depth and poor ability to distinguish the boundaries of different categories in the image during training. The effect of Front-end-Large network training in Figure 7 was significantly improved compared with that  As can be seen from Table 2, in the category segmentation of the orchard environment, the four semantic segmentation network models had high accuracy in pixel recognition for backgrounds and roads, but low accuracy in recognition of fruit trees and debris. The average intersection ratio of the deep residual U-type network proposed in this paper was 85.95%, which is higher than those of the first three network models, and achieved better results in orchard environment recognition. Figure 9 shows the respective semantic segmentation prediction images from the four network models. Among them, the segmentation image generated by the fully convolutional neural network model has the following shortcomings: One is that some areas' categories were lost in the segmentation, and small objects such as small branches could not be identified. The other is that for the segmentation of a large region, the boundary detail information processing capability is insufficient; as shown in the figure, the segmentation of the road boundary is not clear enough. This result is due to the fact that the fully convolutional neural network model uses deconvolution in the decoding process. Although this method is simple and feasible, it causes problems such as violent pooling, blurring of segmented images, and lack of spatial consistency. Compared with the fully convolutional neural network model, the segmentation image generated by the U-type network model has higher MIoU and better segmentation effect. However, for the overlapping parts of the categories, the difference in the features of the separated categories is not obvious, the boundary of the overlapping part is rough, and the overlapping parts are easily lost, such as details of the intersection of fruit trees and sundries. For the complex environment of the orchard, due to insufficient depth of the U-type network, the detailed information of each category cannot be fully utilized, and overfitting occurs in the training process. The segmentation image generated by the Front-end+Large network has improved overall effect compared to those by the fully convolutional neural network and the U-type network, but some details, such as branches and debris, were still lost. It can be seen from the segmentation result image that the orchard environment recognition model based on the deep residual U-type network can well reflect the boundary information in the large area and small region category segmentation. The recognition accuracy was higher than that of the previous three segmentation network models.

Conclusions and Future Work
(1) Orchard environment recognition based on the deep residual U-type network was realized by collecting orchard environment information and constructing a deep residual U-type network. Compared with the fully convolutional neural network, the U-type network, and the Front-end+Large network, the deep residual U-type net- In summary, the U-type network model based on deep residuals proposed in this paper can effectively improve the recognition accuracy of orchard environments, and the segmentation model can also show better robustness for complex orchard environments and light changes.

Conclusions and Future Work
(1) Orchard environment recognition based on the deep residual U-type network was realized by collecting orchard environment information and constructing a deep residual U-type network. Compared with the fully convolutional neural network, the U-type network, and the Front-end+Large network, the deep residual U-type network showed strengthened ability to extract contextual information and process details of an orchard environment image. (2) The fully convolutional neural network, the U-type network, the Front-end+Large network, and the deep residual U-type network were tested and compared. The test results showed that the segmentation accuracy of the deep residual U-type network was better than that of the other three networks. (3) The semantic segmentation model based on the deep residual U-type network presented high recognition accuracy and strong robustness in actual orchard environment recognition, showing potential to provide environmental perception for the autonomous operation of horticultural tractors in orchards. At the same time, this method also has shortcomings, such as a large number of annotations required to make a data set, which is time-consuming and labor-intensive; a large amount of graphics card memory consumed during the training process; and insufficient postprocessing optimization of the prediction segmentation image. The next research work will focus on these shortcomings to further improve the accuracy of model recognition. Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.