Cherry Tree Crown Extraction from Natural Orchard Images with Complex Backgrounds

: Highly effective pesticide applications require a continual adjustment of the pesticide spray ﬂow rate that attends to different canopy characterizations. Real-time image processing with rapid target detection and data-processing technologies is vital for precision pesticide application. However, the extant studies do not provide an efﬁcient and reliable method of extracting individual trees with irregular tree-crown shapes and complicated backgrounds. This paper on our study proposes a Mahalanobis distance and conditional random ﬁeld (CRF)-based segmentation model to extract cherry trees accurately in a natural orchard environment. This study computed Mahalanobis distance from the image’s color, brightness and location features to acquire an initial classiﬁcation of the canopy and background. A CRF was then created by using the Mahalanobis distance calculations as unary potential energy and the Gaussian kernel function based on the image color and pixels distance as binary potential energy. Finally, the study completed image segmentation using mean-ﬁeld approximation. The results show that the proposed method displays a higher accuracy rate than the traditional algorithms K-means and GrabCut algorithms and lower labeling and training costs than the deep learning algorithm DeepLabv3+, with 92.1%, 94.5% and 93.3% of the average P, R and F1-score, respectively. Moreover, experiments on datasets with different overlap conditions and image acquisition times, as well as in different years and seasons, show that this method performs well under complex background conditions, with an average F1-score higher than 87.7%.


Introduction
Precision agriculture is a management strategy that uses modern science and technology to obtain required agricultural information for efficient precision crop management, such as formula fertilization, precision seeding, pest control, weed removal and water management [1][2][3].Precision-spraying technology is vital for prevention and pest control.However, although precision-spraying technologies have been widely used in precision agricultural production, their efficient application on cherry orchards remains a big challenge [4].For instance, implementing established spraying strategies to tree crops with different canopy characteristics, such as irregular sizes and shapes, may lead to spray drift and pesticide overdosing, posing a great risk to farmers and the environment [5,6].The situation may be worsening when the crown's size and shape change significantly in different growth stages [7].To reduce the negative impact of pesticide application, it is necessary to develop a canopy extraction technology that provides accurate tree-canopy data for precision-spraying systems.
Proximal sensing vehicle-mounted technologies are defined as the use of sensors and traction systems to identify and detect agricultural parameters [8,9].At present, field-based sensors are widely adopted for automatic tree identification, such as Visible-Near-Infrared imaging [10,11], stereo imaging [12,13] and thermal imaging [14,15].Target extraction based on RGB (Red Green Blue) digital cameras has seen a wide application in precision farming due to its low cost and non-contact data collecting [16][17][18].In this process, color index-based segmentation techniques are mostly applied to complete the crucial background removal.Several studies summarized the well-performing color indices in distinguishing plants from backgrounds [19][20][21], including excessive green index (ExG) [22], excess green minus excess red index (ExR) in RGB color space [23] and spectral vegetation indices such as the Normalized Difference Vegetation Index (DVI) [24] and the Green-Red Vegetation Index (GRV) [25].However, such color index-based methods will be ineffective when the background and plant share similar colors, e.g., green weeds and canopies.Omair extracted canopies from artificial turf in high-altitude communities using color and texture features [26], which is not applicable to a natural orchard environment where the physiological features will be unstable, especially when the canopy overlaps the weeds [27].Past studies showed that thresholding and filtering technology based on the grayscale or edge characteristics are often used in image pre-processing or combined with other segmentation methods [28,29], which means that the method based on a single feature is difficult to remove complex backgrounds.
By contrast, the statistics-based machine learning (ML) method can overcome the limitations of feature-based segmentation [30][31][32].ML methods can be divided into two categories: unsupervised and supervised learning algorithms.Unsupervised learning algorithms typically adopt clustering methods, such as Fuzzy C-Means (FCM), K-means and Gaussian Mixture Model (GMM).Liu et al. [33] used the Type-2 FCM to extract the Ginkgo and Platanus canopies from the UAV(Unmanned Aerial Vehicle) image without a complex background.Qi et al. [34] proposed an effective fruit tree segmentation method based on K-means clustering and color features to separate the background from the canopy.This method, however, is not ideal for the input images that contain weed background.Abdalla et al. [35] employed GMM, self-organizing maps, Fuzzy C-means and K-means algorithms calculated with the highest color features from ten color models to segment oilseed rape images, but the results cannot be generalized to other complex situations as the features between oilseed rape and background are obviously different and easy to distinguish.On the other hand, the unsupervised learning algorithms do not require model training and are simple to use, but they cannot process images with complex backgrounds.
Supervised learning algorithms, which can be divided into two types, i.e., traditional supervised learning algorithms and deep learning, can perform well in the segmentation task of complex scenes [36][37][38].Chen et al. [39] proposed a citrus canopy segmentation method based on the SVM (Support Vector Machines) segmentation model trained with 14 color features and five statistical textures.Mattos et al. [40] used the CNN (Convolutional Neural Network) algorithm to segment the citrus canopy from the background, which has an overall accuracy of 94% in seven different orchards.Wu et al. [41] proposed the deep learning model to extract apple tree's canopies and parameters with an over 90% of segmentation accuracy and recall rate.This body of research has exemplified the application of artificial intelligence in the farming industry.However, traditional supervised learning algorithms rely on complex feature engineering (FE), while deep learning requires large-size labeled data and high-performance computers.Lu et al. [42] summarized 34 available deep learning datasets in the agricultural field, including weeds, fruits and common ground crops, but no open-source datasets on fruit trees are included.Hence, supervised learning is not always the best choice.
Compared to the structured environments, a natural orchard environment poses more challenges to image segmentation.For example, the images taken from cherry orchards may include various non-target elements such as sky, land, cover films (Figure 1a) and houses (Figure 1b).Moreover, the tree's physiological properties, especially the porous media, lead to an uneven distribution of light within the canopy (Figure 1c).It is also difficult to differentiate the canopy from the weeds when there is an overlap between the two (Figure 1d).The objective of the current study was, therefore, to propose a method for extractin cherry-tree canopies from the complex background that has a higher accuracy rate tha the traditional supervised learning methods and lower computational costs than unsu pervised algorithms, such as deep learning.This study provides a new way to characteriz image features of tree crown by computing Mahalanobis distance from the image's colo brightness, and location features to acquire an initial classification of the canopy and back ground.Moreover, the tree-crown area features and global image features were consid ered by using the conditional random field (CRF).It was created by using the Mahalanob distance calculations as unary potential energy and the Gaussian kernel function base on the image color and pixels distance as binary potential energy.Finally, the study com pleted image segmentation using mean-field approximation.This proposed work w contribute to future machine-vision-based tree-crop extraction.

Test Site and Image Acquisition
The study was conducted on a cherry orchard of Zhongnong Futong Company Tongzhou, Beijing (116°48′32.1′′E and 39°51′46.37′′N)(Figure 2a).The test area has repr sentative climatic characteristics in local orchards, with weeds germinating and flourish ing in May and June.Cherry trees are spaced 4 m per plant with a 5 m path between row The average height of the canopy was about 1.7 m.An untrained tree experimental fiel was selected, with a total area of 10,660 m 2 (82 × 130 m).
The trees were photographed with a digital camera (LICE-7M2, Sony Inc., Toky Japan), which features a 24.3-megapixel sensor that enables high-resolution images, Axis Steady Shot INSIDE Stabilization, ISO sensitivity of 400, a focal length of 135 mm and a body size of 126.9 × 95.7 × 59.7 mm.The camera was mounted on Beno IT15 gimb land positioned 1 to 1.5 m above the ground and 3 to 3.5 m in front of the tree trunk.I order to execute the image processing program, the input image was adjusted to the fo lowing unified parameters: png format (lossless compression), 900:1600-pixel aspect rati 24 bits (three channels in total, and 8 bits for each channel).The images were collecte under different light conditions and in different growth stages from April to October, eac year, between 2018 and 2020.Figure 2 illustrates images captured under different lightin and weedy conditions in the field.Figure 2b exemplifies a cloudy day with high-densi weeds, whereas Figure 2c displays a sunny day with normal-density weeds.The objective of the current study was, therefore, to propose a method for extracting cherry-tree canopies from the complex background that has a higher accuracy rate than the traditional supervised learning methods and lower computational costs than unsupervised algorithms, such as deep learning.This study provides a new way to characterize image features of tree crown by computing Mahalanobis distance from the image's color, brightness, and location features to acquire an initial classification of the canopy and background.Moreover, the tree-crown area features and global image features were considered by using the conditional random field (CRF).It was created by using the Mahalanobis distance calculations as unary potential energy and the Gaussian kernel function based on the image color and pixels distance as binary potential energy.Finally, the study completed image segmentation using mean-field approximation.This proposed work will contribute to future machine-vision-based tree-crop extraction.

Test Site and Image Acquisition
The study was conducted on a cherry orchard of Zhongnong Futong Company in Tongzhou, Beijing (116 • 48 32.1 E and 39 • 51 46.37 N) (Figure 2a).The test area has representative climatic characteristics in local orchards, with weeds germinating and flourishing in May and June.Cherry trees are spaced 4 m per plant with a 5 m path between rows.The average height of the canopy was about 1.7 m.An untrained tree experimental field was selected, with a total area of 10,660 m 2 (82 × 130 m).

Template and Ground Truth Generation
The template and the ground truth are images of manually delineating the canopy area, with the former used for image classification and the latter for algorithm performance comparisons.In order to minimize subjectivity in image labeling, we invited an expert with experience in agricultural image processing to use Photoshop to generate ground truth and the template.The template includes two manually segmented images, which only capture the crown area and are representative in terms of the colors, light and The trees were photographed with a digital camera (LICE-7M2, Sony Inc., Tokyo, Japan), which features a 24.3-megapixel sensor that enables high-resolution images, 5-Axis Steady Shot INSIDE Stabilization, ISO sensitivity of 400, a focal length of 135 mm and a body size of 126.9 × 95.7 × 59.7 mm.The camera was mounted on Beno IT15 gimba-land positioned 1 to 1.5 m above the ground and 3 to 3.5 m in front of the tree trunk.In order to execute the image processing program, the input image was adjusted to the following unified parameters: png format (lossless compression), 900:1600-pixel aspect ratio, 24 bits (three channels in total, and 8 bits for each channel).The images were collected under different light conditions and in different growth stages from April to October, each year, between 2018 and 2020.Figure 2 illustrates images captured under different lighting and weedy conditions in the field.Figure 2b exemplifies a cloudy day with high-density weeds, whereas Figure 2c displays a sunny day with normal-density weeds.

Template and Ground Truth Generation
The template and the ground truth are images of manually delineating the canopy area, with the former used for image classification and the latter for algorithm performance comparisons.In order to minimize subjectivity in image labeling, we invited an expert with experience in agricultural image processing to use Photoshop to generate ground truth and the template.The template includes two manually segmented images, which only capture the crown area and are representative in terms of the colors, light and shooting angles in the image datasets.They were captured on a sunny and cloudy day, respectively.The template was created with Adobe Photoshop 2018 as an image-labeling tool [43], following three steps: selecting the lasso tool in Adobe Photoshop 2018 to outline the edge of the tree crown, taking the inside region of the closed curve as the image foreground and using the fill function to set the area outside the foreground to black (Figure 3b).In effect, the ground truth processing for algorithm performance comparison follows the same workflow but adds one step of setting the image foreground to white (Figure 3c).

Template and Ground Truth Generation
The template and the ground truth are images of manually delineati area, with the former used for image classification and the latter for alg mance comparisons.In order to minimize subjectivity in image labeling, expert with experience in agricultural image processing to use Photosh ground truth and the template.The template includes two manually segm which only capture the crown area and are representative in terms of the co shooting angles in the image datasets.They were captured on a sunny an respectively.The template was created with Adobe Photoshop 2018 as an tool [43], following three steps: selecting the lasso tool in Adobe Photoshop the edge of the tree crown, taking the inside region of the closed curve as t ground and using the fill function to set the area outside the foreground to 3b).In effect, the ground truth processing for algorithm performance comp the same workflow but adds one step of setting the image foreground to 3c).

Feature Construction of Tree Crown
Feature extraction consists of two steps: extracting color feature green background and extracting brightness and height features to sep weeds.

Color Feature Extraction
Since the plants were distinctively green, the study removed no regions based on color features [44].HSV (hue, saturation and valu adopted to extract color features, where H represents hue, S represen represents brightness.In this model, only the hue (H) and saturation ( color information, while brightness (V) is a separate channel [45].T will effectively deal with lighting changes or uneven lighting in the o is an unachievable outcome in RGB color space.Converting RGB col lows the calculations below [45]:

Feature Construction of Tree Crown
Feature extraction consists of two steps: extracting color features to remove the nongreen background and extracting brightness and height features to separate canopies from weeds.

Color Feature Extraction
Since the plants were distinctively green, the study removed non-green background regions based on color features [44].HSV (hue, saturation and value) color model was adopted to extract color features, where H represents hue, S represents saturation, and V represents brightness.In this model, only the hue (H) and saturation (S) channels describe color information, while brightness (V) is a separate channel [45].Thus, the HSV model will effectively deal with lighting changes or uneven lighting in the orchard image, which is an unachievable outcome in RGB color space.Converting RGB color space to HSV follows the calculations below [45]: where we have the following: where r, g and b are the red, green and blue channels in RGB color space; R , G and G are the normalized red, green and blue channels; H represents the type of color; S indicates the degree of color saturation; and V is the value of brightness.Figure 5 is an example of the variation of spectral components at the exacted location (marked as a red dotted line) under different lighting conditions.Figure 5b shows that the G component in the RGB color model changes sharply, while the H and S components in the HSV color space are stable.Therefore, the hue (H) and saturation (S) are selected as the color features for non-green background removal.where we have the following: where r, g and b are the red, green and blue channels in RGB color space; ′, G′ and G′ are the normalized red, green and blue channels; H represents the type of color; S indicates the degree of color saturation; and V is the value of brightness.Figure 5 is an example of the variation of spectral components at the exacted location (marked as a red dotted line) under different lighting conditions.Figure 5b shows that the G component in the RGB color model changes sharply, while the H and S components in the HSV color space are stable.Therefore, the hue (H) and saturation (S) are selected as the color features for non-green background removal.

Brightness and Height Feature Crossing
Weeds have a similar green color to the canopy; thus, we are not able to remove them from the image by using color features only.Weeds are annual herbaceous plants usually growing 0.5 m tall, while tree crops can be up to 2.5-4.0 m.However, this significant height difference may be ineffective in distinguishing them when the weeds and the tree canopy may come into contact and overlap in the image.Therefore, the target region can be divided into two parts by plant height: the upper canopy area and the overlapping area between the canopy and the weeds.In the isolated canopy region, it is easy to distinguish the canopy from weeds based on height.In the overlapping area, the distinguishment can be based on light intensity.The light intensity from the upper crown layer to the lower is gradually weakened [46].Thus, the lower crown layer is mainly in the shadow due to insufficient light, while the weed area is in the sun (Figure 6).This study selected the brightness distribution in the vertical direction of the image as the feature to distinguish between the canopy and weeds.The height and brightness features can be extracted from vertical pixel coordinates of the image and the V component of the HSV space, respectively.Pixel coordinates represent the position of pixels in the image.For an image with height H, each pixel height is (Yj = 1,2, H).The V component was obtained by Equation (4).

Brightness and Height Feature Crossing
Weeds have a similar green color to the canopy; thus, we are not able to remove them from the image by using color features only.Weeds are annual herbaceous plants usually growing 0.5 m tall, while tree crops can be up to 2.5-4.0 m.However, this significant height difference may be ineffective in distinguishing them when the weeds and the tree canopy may come into contact and overlap in the image.Therefore, the target region can be divided into two parts by plant height: the upper canopy area and the overlapping area between the canopy and the weeds.In the isolated canopy region, it is easy to distinguish the canopy from weeds based on height.In the overlapping area, the distinguishment can be based on light intensity.The light intensity from the upper crown layer to the lower is gradually weakened [46].Thus, the lower crown layer is mainly in the shadow due to insufficient light, while the weed area is in the sun (Figure 6).This study selected the brightness distribution in the vertical direction of the image as the feature to distinguish between the canopy and weeds.The height and brightness features can be extracted from vertical pixel coordinates of the image and the V component of the HSV space, respectively.Pixel coordinates represent the position of pixels in the image.For an image with height H, each pixel height is (Y j = 1,2, H).The V component was obtained by Equation (4).

Mahalanobis Distance Computation
Mahalanobis distance is a distance criterion that assigns each pixel into prediction groups, i.e., tree crown and background, by measuring pixel similarity [47].This study utilized Mahalanobis distance rather than other measures, e.g., Euclidean distance, because Mahalanobis distance considers the correlation between features.The Mahalanobis distance classified the canopy and background pixels by measuring the feature similarity between the original image and the template.The corresponding calculation follows two steps: (a) Computing mean vectors and covariance matrix: The mean vectors are the average value of the feature, commonly referring to the centroid of data distribution.The feature is a four-dimensional vector (H, S, Y and V) which is extracted from the template and sample images.The mean vectors are calculated as follows: where H, S and V are the hue, saturation and brightness components of HSV color space, respectively; Y is pixel height; n is the number of pixels; i=1,2, 3, n; and f is a vector composed of H, S, V and Y.
The covariance matrix is a square and symmetric matrix containing the variances and covariances associated with components of feature f (H, S, V and Y).The formula to compute the covariance between two variables is as follows:

Mahalanobis Distance Computation
Mahalanobis distance is a distance criterion that assigns each pixel into prediction groups, i.e., tree crown and background, by measuring pixel similarity [47].This study utilized Mahalanobis distance rather than other measures, e.g., Euclidean distance, because Mahalanobis distance considers the correlation between features.The Mahalanobis distance classified the canopy and background pixels by measuring the feature similarity between the original image and the template.The corresponding calculation follows two steps: (a) Computing mean vectors and covariance matrix: The mean vectors are the average value of the feature, commonly referring to the centroid of data distribution.The feature is a four-dimensional vector (H, S, Y and V) which is extracted from the template and sample images.The mean vectors are calculated as follows: where H, S and V are the hue, saturation and brightness components of HSV color space, respectively; Y is pixel height; n is the number of pixels; i=1,2, 3, n; and f is a vector composed of H, S, V and Y.The covariance matrix is a square and symmetric matrix containing the variances and covariances associated with components of feature f (H, S, V and Y).The formula to compute the covariance between two variables is as follows: where f is a pair of variables with the four components (H, S, V, Y); µ is the mean vectors obtained by Equation (5); n is the number of pixels.
(b) Computing the Mahalanobis distance: Mahalanobis distance will divide each pixel into two groups described by different mean vectors and covariances.Its formula Equation ( 8) is as follows: where f is four-dimensional vectors containing the H, S, V and Y values of each pixel; µ is the mean vectors calculated by Equation ( 5); the Cov is the covariance matrix calculated by Equation (7).
Figure 7 exemplifies the Mahalanobis distance calculation results.Figure 7b is the Mahalanobis distance obtained only by the H and S features where the non-green background was removed based on color feature, but the weeds were remained in the image.Figure 7c is the Mahalanobis distance based on H, S, V and Y features.From Figure 7, the Mahalanobis distance of the canopy regions is small, the gray value is low and the color is close to black.When the background area has a low similarity to the canopy, the Mahalanobis distance value becomes larger and the color is close to white.where f is a pair of variables with the four components (H, S, V, Y); µ is the mean vectors obtained by Equation ( 5); n is the number of pixels.
(b) Computing the Mahalanobis distance: Mahalanobis distance will divide each pixel into two groups described by different mean vectors and covariances.Its formula Equation ( 8) is as follows: ( ) ( ) where f is four-dimensional vectors containing the H, S, V and Y values of each pixel; μ is the mean vectors calculated by Equation ( 5); the Cov is the covariance matrix calculated by Equation (7). Figure 7 exemplifies the Mahalanobis distance calculation results.Figure 7b is the Mahalanobis distance obtained only by the H and S features where the non-green background was removed based on color feature, but the weeds were remained in the image.Figure 7c is the Mahalanobis distance based on H, S, V and Y features.From Figure 7, the Mahalanobis distance of the canopy regions is small, the gray value is low and the color is close to black.When the background area has a low similarity to the canopy, the Mahalanobis distance value becomes larger and the color is close to white.

Conditional Random Field for Image Segmentation
After a pre-classification of the image based on Mahalanobis distance, this section discusses conditional random field (CRF) modeling for image segmentation.

Energy Function Construction
Conditional Random Field (CRF) is a conditional probability distribution model that outputs random variables with a set of random input variables [48].In the image segmentation task, CRF treats pixels or pixel features as random input variables with a probability distribution and pixel label as output variables.If the definition of the random variable Yi = (y1, y2, y3, yn;) obeys the Markov property, the distribution of Yi constitutes a conditional random field.Each pixel i is assigned a corresponding label Yi through observable variable Xi = (x1, x2, x3, xn) in this random field.Figure 8 illustrates the overall operation of the CRF model.

Conditional Random Field for Image Segmentation
After a pre-classification of the image based on Mahalanobis distance, this section discusses conditional random field (CRF) modeling for image segmentation.

Energy Function Construction
Conditional Random Field (CRF) is a conditional probability distribution model that outputs random variables with a set of random input variables [48].In the image segmentation task, CRF treats pixels or pixel features as random input variables with a probability distribution and pixel label as output variables.If the definition of the random variable Y i = (y 1 , y 2 , y 3 , y n ;) obeys the Markov property, the distribution of Y i constitutes a conditional random field.Each pixel i is assigned a corresponding label Y i through observable variable X i = (x 1 , x 2 , x 3 , x n ) in this random field.Figure 8 illustrates the overall operation of the CRF model.The specific steps are as follows: (1) Model construction: establishing the mapping relationship between X and Y through the conditional probability distribution P(Y|X).In the fully connected conditional random field model, P(Y|X) is expressed in the form of Gibbs distribution: where X indicates the feature set f, and Y corresponds to the class labels, Y∈{L1, L2}.L1 represents the tree crown, and L2 is the background.Z is a normalization term that ensures the distribution P sums to 1 and is defined as follows: where Ε(Y|X) denotes the Energy function.
(2) The Energy function minimization: CRF aims to find the output Y with the maximum conditional probability P(Y|X).According to Equation (9), the problem of conditional probability maximization is the problem of energy minimization, which can be expressed as follows: ( ) * = Ε y arg min y Y X (11) where y* is the minimization of Energy function Ε(Y|X) The Ε(Y|X) consists of two types of potential energy: unary potentials and pairwise potentials: where ψu(yi) is the unary potential for the probability of pixel i taking the label yi, denoting the pixel's local information; ψp(yi, yj) is the pairwise potential, representing the label class similarity relationship between nearby pixels i and j, including inter-pixel global information; and i, j∈{1, 2,3, N} are the pixel indices.Equation (12) shows that unary and pairwise potential functions are the crux of conditional random field modeling.Their respective definition follows.The specific steps are as follows: (1) Model construction: establishing the mapping relationship between X and Y through the conditional probability distribution P(Y|X).In the fully connected conditional random field model, P(Y|X) is expressed in the form of Gibbs distribution: where X indicates the feature set f, and Y corresponds to the class labels, Y∈{L1, L2}.L1 represents the tree crown, and L2 is the background.Z is a normalization term that ensures the distribution P sums to 1 and is defined as follows: where E(Y|X) denotes the Energy function.
(2) The Energy function minimization: CRF aims to find the output Y with the maximum conditional probability P(Y|X).According to Equation ( 9), the problem of conditional probability maximization is the problem of energy minimization, which can be expressed as follows: where y* is the minimization of Energy function E(Y|X) The E(Y|X) consists of two types of potential energy: unary potentials and pairwise potentials: ψ p y i , y j (12) where ψ u (y i ) is the unary potential for the probability of pixel i taking the label y i , denoting the pixel's local information; ψ p (y i , y j ) is the pairwise potential, representing the label class similarity relationship between nearby pixels i and j, including inter-pixel global information; and i, j∈{1, 2,3, N} are the pixel indices.Equation (12) shows that unary and pairwise potential functions are the crux of conditional random field modeling.Their respective definition follows.
(3) The unary potential construction: The unary potential is the probability that a pixel obtains the corresponding label, indicating the category information of the current observation point.The study employed the Mahalanobis distance classifier results described in Section 3.2 to construct the unary potential energy.The unary potential takes the negative logarithm to provide a framework that unifies energy minimization: P M (y i ) is the label assignment probability for each pixel by the Mahalanobis distance classifier, which is calculated by Equation (8).The smaller the Mahalanobis distance, the greater the probability that the pixel is assigned to the canopy category.When the probability that the pixel i takes the label y i is large, the unary potential and energy are small.
(4) The pairwise potential computation: The pairwise potential pixels are constraints of the final label assignment.Its goal is to assign adjacent labeled pixels with similar characteristics to the same category.The punishment strength is positively correlated to the feature difference between adjacent pixels under the same label, thereby restricting the classifier's misclassification behavior.The general form of the paired potential function is a linear combination of Gaussian kernel functions: where u (y i , y j ) is a constant symmetric label compatibility function between the labels y i , and y j to punish the similar pixels with different class labels.When the classifier assigns different labels to adjacent pixels, the greater the difference between pixel features, the smaller the penalty is, which is consistent with Gibbs energy minimization.Moreover, ω (m) is the coefficient weight of the given kernels; m = (1, 2,3, N) is the number of kernel K (m) ; K (m) (f j , f j ) is the kernel potential function on feature vectors; and f j is feature vectors of pixels i, while f j is feature vectors of pixels j.This study used two Gaussian kernels to construct the K, which is primarily composed of the pixels' spectral and distance information (m = 2): where the first item is an appearance kernel based on RGB color and distance information; C is a three-dimensional vector composed of R, G and B components; P is a two-dimensional position vector composed of vertical and horizontal directions; C i and C j are color vectors of the pixel on positions p i and p j .The second item is a smooth kernel used to remove small isolated areas; ω (1) and ω (2) are the coefficient weights of each kernel; θ α , θ β and θ γ are the parameters of Gaussian kernel.

CRF Inference
The average field approximation theory is an efficient inference method that approximates the conditional probability distribution P(Y) with a simple distribution Q(Y), thereby simplifying the calculation process [49].The average field approximation is calculated with Equation ( 16): where Q i (y i ) is the independent marginal distribution of the random variable y i .For ease of exposition, assume that Q(Y) is the product of multiple independent distributions.To make the distribution Q(Y) approximate the true distribution P(Y), this study used the Kullback-Leibler (KL) distance as a metric, which is defined as follows: where D is the KL distance between the Q distribution and the P distribution.The Q distribution can be calculated by taking the minimum KL distance as the convergence criterion.The iteration flow includes five steps: message passing, weighting filter output, compatibility conversion, the unary potentials adding and probability normalizing.Liu has introduced this iteration process in detail [50].To validate the proposed method's crown segmentation effects on cherry trees, this study compared it with the K-means clustering algorithm, Convolutional Neural Networks (CNN) and GrabCut algorithm, which were widely used in tree image segmentation [51][52][53].K-means is an unsupervised machine learning algorithm that does not require labeled datasets and model training [54].To extract crowns by using the K-means clustering algorithm, the Elbow methods were employed to determine the optimal number of clusters (k) by computing the sum of squared errors (SSE).Figure 9a shows the relationship between k and SSE.When the k value increases, the SSE value drops sharply.However, the SSE value will not change significantly if the k number continues to increase.Therefore, the K value at the bending position identifies the optimal number of clusters.CNN is a supervised machine learning algorithm that relies on labeled data and model training.DeepLabV3+ is one of the best CNN-based semantic segmentation models at present [55].It improves the Xception network and adopts an Encoder-Decoder structure, thereby optimizing boundary details by restoring the ow-level features (Figure 9d).DeepLabV3+ retrains the Atrous Spatial Pyramid Pooling (ASPP) module to acquire multi-scale information.To segment canopies using DeeplabV3+, this study created 500 single-channel labeled images (see Figure 9c for an example), of which 400 were used for training and 50 for verification and testing, respectively.The model was trained on the Ubuntu 18.04 operating system of NVIDIA RTX 2080TI.GrabCut is an interactive algorithm that requires user interaction to implement image segmentation [56].GrabCut segments images by creating a new pixel distribution that is close to the foreground's pixel distribution.The foreground is the area inside the red bounding box, which is manually drawn by experts with image processing experience (see Figure 9b).

Evaluation Indices
Three performance measures, namely Precision (P), Recall (R) and F1-score (F1), are introduced to evaluate the cherry canopy segmentation results.P is the correctly extracted percentage of the canopy pixels with the segmentation method, indicating the segmentation accuracy.R is the percentage of the missing canopy pixels, measuring the segmentation completeness.F1 is the compromised mean of the P and R, representing the global metric of canopy segmentation accuracy.The actual trees area was manually counted using Adobe Photoshop 2018.These metrics are defined as follows:

Evaluation Indices
Three performance measures, namely Precision (P), Recall (R) and F1-score (F1), are introduced to evaluate the cherry canopy segmentation results.P is the correctly extracted percentage of the canopy pixels with the segmentation method, indicating the segmentation accuracy.R is the percentage of the missing canopy pixels, measuring the segmentation completeness.F1 is the compromised mean of the P and R, representing the global metric of canopy segmentation accuracy.The actual trees area was manually counted using Adobe Photoshop 2018.These metrics are defined as follows: where TP is the number of canopy pixels correctly produced by the segmentation algorithm.
FP represents the number of background pixels that are misidentified as trees.FN represents the number of tree pixels that are misidentified as background.A higher value of these three metrics indicates the segmentation method's better performance.

Segmentation Results of the Four Competing Methods
To evaluate the performance of the proposed method, the study selected 200 images with complex backgrounds, including bare soil, weeds, sidewalks, houses, plastic films and shelter, and under different lighting conditions.Figure 10 shows the original images of cherry trees.There were four rows of images that were taken on different weather conditions, i.e., sunny or cloudy and with different densities of weeds, i.e., high-density or low-density.Figure 10 exemplifies the four competing methods' segmentation results.From left to right are: the original image, the result of the proposed method, the K-means algorithm's result, the DeepLabV3+ algorithm's result, the GrabCut algorithm's result and the ground truth.
ferent light conditions and with different densities of weeds.However, the DeepLabV3+ algorithm results are not satisfactory, which is ineffective in crown branch recognition due to the ignored local information in convolution and upsampling.The overall segmentation results using the GrabCut method are better than that of K-means and DeepLabV3+ algorithm.However, the GrabCut method lost many image details, resulting in smooth tree-crown edges.This experiment shows that the proposed method could accurately identify tree crowns and obtain more image details than other algorithms.Figure 10 shows that the K-means algorithm failed to discriminate the canopy from the background.Lighting conditions have a greater impact on the K-means algorithm than the densities of weeds, as this method over-segmented all shaded areas as tree crowns.The remaining three methods performed well in canopy identification, robust under different light conditions and with different densities of weeds.However, the DeepLabV3+ algorithm results are not satisfactory, which is ineffective in crown branch recognition due to the ignored local information in convolution and upsampling.The overall segmentation results using the GrabCut method are better than that of K-means and DeepLabV3+ algorithm.However, the GrabCut method lost many image details, resulting in smooth tree-crown edges.This experiment shows that the proposed method could accurately identify tree crowns and obtain more image details than other algorithms.
Table 1 shows the average segmentation result and the computational cost of 200 test images using different segmentation methods.The average P, R and F1 values of K-means are 58.1%,79.7% and 68.9%, respectively.The segmentation accuracy of DeepLabV3+ is higher than that of K-means, with the average P increased by 24.3% and the average R reduced by 5.8%.The GrabCut algorithm's average P, R and F1 values have increased by 4.4%, 6.5% and 5.7% compared with K-means and 28.2%, 0.6% and 14.9% compared with DeepLabV3+.The results also show that the proposed method performs better than the remaining three methods with 92.1%, 94.5% and 93.3% of the average P, R and F1 values.K-means algorithm takes the lowest computational cost and does not require labeling data or model training.In contrast, the DeepLabV3+ algorithm needs massive model training and image annotations.GrabCut requires labeling all testing images, thereby taking the longest time.Hence, the proposed method in the study has a higher accuracy rate than traditional unsupervised algorithms and a lower computational cost than interactive algorithms and supervised algorithms.Different shooting angles may lead to two types of image samples, i.e., some crowns heavily overlapping with weeds, and crowns and weeds barely touching each other.Moreover, orchard images taken at different times of the day may have different exposure.For instance, images captured at noon, under strong sunlight, have the overexposure problem.This section compares the segmentation results of images taken under different overlapping conditions and in different times of the day.Figure 11 exemplifies cherry-tree images taken under a slightly overlapping condition (Figure 11a), partially overlapping condition (Figure 11b) and highly overlapping condition (Figure 11c), or in the morning (Figure 11d), at noon (Figure 11e) and in the evening (Figure 11f).
K-means algorithm takes the lowest computational cost and does not require labeling data or model training.In contrast, the DeepLabV3+ algorithm needs massive model training and image annotations.GrabCut requires labeling all testing images, thereby taking the longest time.Hence, the proposed method in the study has a higher accuracy rate than traditional unsupervised algorithms and a lower computational cost than interactive algorithms and supervised algorithms.

Performance Results under Different Overlapping Conditions and at Different Day Times
Different shooting angles may lead to two types of image samples, i.e., some crowns heavily overlapping with weeds, and crowns and weeds barely touching each other.Moreover, orchard images taken at different times of the day may have different exposure.For instance, images captured at noon, under strong sunlight, have the overexposure problem.This section compares the segmentation results of images taken under different overlapping conditions and in different times of the day.Figure 11 exemplifies cherry-tree images taken under a slightly overlapping condition (Figure 11a), partially overlapping condition (Figure 11b) and highly overlapping condition (Figure 11c), or in the morning (Figure 11d), at noon (Figure 11e) and in the evening (Figure 11f).Figure 11 shows that the tree crown was accurately extracted using the proposed method.Although the proposed method would miss some treetop leaves under highexposure conditions (see Figure 11e) and lose some crown details under low-exposure conditions (see Figure 11d), the extraction results were relatively accurate.Table 2 illustrates the proposed segmentation method's average P, R and F1 values for 100 sample images.The average P, R and F1 values are 93.2%,93.5% and 93.4%, respectively, with an accuracy rate over 90%.On the other hand, the average P value decreases significantly in the test set when the canopy and weeds overlap heavily.Hence, the overlapping degree between the canopy and weeds will affect the segmentation accuracy.Meanwhile, the average R values drop in the test set under both overexposure and underexposure conditions.Thus, the time of the image shooting mainly impacts the segmentation completeness.These discoveries underpin the proposed method's effectiveness in tree-crown recognition under different overlapping conditions and at different day times of image shooting.

Segmentation Results in Different Years and Seasons
This section analyzes the proposed method's effectiveness in different years and seasons, and the image segmentation results are plotted in Figure 12. Figure 12a shows images and their segmentation results in 2018 spring and summer.Figure 12b  values drop in the test set under both overexposure and underexposure conditions.Thus, the time of the image shooting mainly impacts the segmentation completeness.These discoveries underpin the proposed method's effectiveness in tree-crown recognition under different overlapping conditions and at different day times of image shooting.

Segmentation Results in Different Years and Seasons
This section analyzes the proposed method's effectiveness in different years and seasons, and the image segmentation results are plotted in Figure 12. Figure 12a shows images and their segmentation results in 2018 spring and summer.Figure 12b displays images taken in 2019 spring and summer.Figure 12c only contains images in 2020 autumn, as the COVID-19 pandemic interrupted the image acquisition [57].The highlighted patches show the image acquisition data.The test results of the datasets show that the proposed method can satisfy the segmentation of images taken in different seasons and years.To carry out quantitative verification, this study has analyzed a total of 150 sets of tree images that were taken continuously in 2018, 2019 and 2020.Figure 13 shows that season has a greater impact on the segmentation effect than the growth year.The average F1 value in spring is higher than in other seasons because the images taken in spring have brighter colors and fewer weeds.Images taken in autumn have the lowest segmentation accuracy and segmentation completeness because canopy characteristics have changed in autumn, with changed leaf color and scattered crown.Overall, the average P, R and F1 values in 2018, 2019 and 2020 are consistently above 87.7%.The results indicate that the proposed method is robust in canopy recognition in different years and seasons.
images taken in spring have brighter colors and fewer weeds.Images taken in autumn have the lowest segmentation accuracy and segmentation completeness because canopy characteristics have changed in autumn, with changed leaf color and scattered crown.Overall, the average P, R and F1 values in 2018, 2019 and 2020 are consistently above 87.7%.The results indicate that the proposed method is robust in canopy recognition in different years and seasons.

Conclusions
This study contrived to extract cherry-tree crowns from the complex background properly.The proposed method takes three stages: to compute Mahalanobis distance from the image's color, brightness and location features to acquire an initial classification of the canopy and background; to create a conditional random field, using the Mahalanobis distance calculations as unary potential energy and the Gaussian kernel function based on the image color and pixels distance as binary potential energy; and, finally, to complete the image segmentation, using mean-field approximation.In comparison with other methods, the proposed method has the highest average P, R and F1-score values, i.e., 92.1%, 94.5% and 93.3%, respectively, which were 34%, 14.8% and 24.2% higher than that of K-means traditional supervised algorithms.Compared with Grabcut interactive segmentation algorithms and DeepLabV3+ deep learning algorithm, the proposed method has lower image annotation and model training costs.The study also verified the feasibility and validity of the proposed method under different overlapping conditions, at different times of image acquisition, and in different years and seasons, and their results indicate that the overlapping conditions mainly affect the accuracy of the algorithm, but the image acquisition time affects the completeness of segmentation.The result also demonstrates that the season has a greater impact on the segmentation effect than the growth year.In a nutshell, the proposed method can outstand different environmental conditions, with the overall average P, R and F1 values higher than 87.7%.This study has exemplified that computer vision technology has great potential in crop identification.Future work will test the proposed method's application on other orchard tree crops and study new techniques that do not require data labeling.

Agriculture 2021 ,Figure 1 .
Figure 1.Challenges of crown segmentation in the unstructured environment: (a) sky, land and cover films; (b) house; (c) uneven light distribution within the canopy images; and (d) weeds.

Figure 1 .
Figure 1.Challenges of crown segmentation in the unstructured environment: (a) sky, land and cover films; (b) house; (c) uneven light distribution within the canopy images; and (d) weeds.

Figure 2 .
Test site, and the illustrations of images under different lighting conditions and weeds densities: (a) cherry orchard in Tongzhou Beijing, (b) cloudy day with high-density weeds and (c) sunny day with normal-density weeds.

Figure 2 .
Figure 2. Test site, and the illustrations of images under different lighting conditions and weeds densities: (a) cherry orchard in Tongzhou Beijing, (b) cloudy day with high-density weeds and (c) sunny day with normal-density weeds.

Figure 2 .
Test site, and the illustrations of images under different lighting weeds densities: (a) cherry orchard in Tongzhou Beijing, (b) cloudy day wit weeds and (c) sunny day with normal-density weeds.

Figure 4
Figure 4 is a flowchart of the proposed method: feature extraction, Ma tance computation and conditional random field (CRF) building.All algor veloped in MathWorks MATLAB R2018a and Python3.6 software on a PC an Intel®Core ™ i7-6700 central processing unit (CPU) and 16 GB of memory (RAM).

Figure 3 .
Figure 3. Image annotation schematic: (a) original canopy image, (b) standard image and (c) ground truth image.

Figure 4 Figure 4 .
Figure 4 is a flowchart of the proposed method: feature extraction, Mahalanobis distance computation and conditional random field (CRF) building.All algorithms were developed in MathWorks MATLAB R2018a and Python3.6 software on a PC equipped with an Intel®Core ™ i7-6700 central processing unit (CPU) and 16 GB of random-access memory (RAM).

Figure 4 .
Figure 4. Overall flowchart of the proposed method.

Figure 5 .
Figure 5. Influence of lighting conditions on spectral components: (a) an image of a fruit tree with fading light from left to right and (b) the curve of green, hue and saturation under different lighting conditions.

Figure 5 .
Figure 5. Influence of lighting conditions on spectral components: (a) an image of a fruit tree with fading light from left to right and (b) the curve of green, hue and saturation under different lighting conditions.

Figure 6 .
Figure 6.The illustrations of the brightness and height distribution of canopy under different lighting conditions.The yellow dotted lines indicate the mingled areas, and the color bar manifests brightness value.From left to right are the crown-and weed-height distribution, the original image and the brightness image; from top to bottom are the sunny fruit-tree images and cloudy fruit-tree images.

Figure 6 .
Figure 6.The illustrations of the brightness and height distribution of canopy under different lighting conditions.The yellow dotted lines indicate the mingled areas, and the color bar manifests brightness value.From left to right are the crown-and weed-height distribution, the original image and the brightness image; from top to bottom are the sunny fruit-tree images and cloudy fruit-tree images.

Figure 7 .
Figure 7. Examples of Mahalanobis distance computing: (a) original image; (b) three-dimensional image of Mahalanobis distance based on H and S features; and (c) three-dimensional image of Mahalanobis distance based on H, S, V and Y features.

Figure 7 .
Figure 7. Examples of Mahalanobis distance computing: (a) original image; (b) three-dimensional image of Mahalanobis distance based on H and S features; and (c) three-dimensional image of Mahalanobis distance based on H, S, V and Y features.

Figure 8 .
Figure 8.The CRF operation chart.X is the feature sequence, Y is the label sequence, L1 indicates the tree-crown class, and L2 indicates the background class.R, G and B are the red, green and blue channels in RGB color space; w and h represent the vertical and horizontal coordinate of the image, respectively.

Figure 8 .
Figure 8.The CRF operation chart.X is the feature sequence, Y is the label sequence, L1 indicates the tree-crown class, and L2 indicates the background class.R, G and B are the red, green and blue channels in RGB color space; w and h represent the vertical and horizontal coordinate of the image, respectively.

Figure 9 .
Figure 9.The illustrations of the four competing segmentation.(a) Relationship between k and SSE.(b) Image foreground labeling.(c) Labeled data for model training.(d) The structure of the DeepLabV3+ network.

Figure 9 .
Figure 9.The illustrations of the four competing segmentation.(a) Relationship between k and SSE.(b) Image foreground labeling.(c) Labeled data for model training.(d) The structure of the DeepLabV3+ network.

Figure 10 .
Figure 10.Comparison of different segmentation results: (a) sunny day and low-density weeds; (b) sunny day and high-density weeds; (c) cloudy day and low-density weeds; and (d) cloudy day and

Figure 10 .
Figure 10.Comparison of different segmentation results: (a) sunny day and low-density weeds; (b) sunny day and high-density weeds; (c) cloudy day and low-density weeds; and (d) cloudy day and low-density weeds.From left to right are the original image, the result of the proposed method, the K-means algorithm's result, the DeepLabV3+ algorithm's result, the GrabCut algorithm's result and the ground truth.

Figure 11 .Figure 11 .
Figure 11.Comparison of segmentation results under different overlapping conditions and at different day times: (a) samples under a slightly overlapping condition; (b) samples under a partially overlapping condition; (c) samples under a Figure 11.Comparison of segmentation results under different overlapping conditions and at different day times: (a) samples under a slightly overlapping condition; (b) samples under a partially overlapping condition; (c) samples under a highly overlapping condition; (d) samples collected in the morning; (e) samples collected at noon; and (f) samples collected in the evening.
displays images taken in 2019 spring and summer.Figure12conly contains images in 2020 autumn, as the COVID-19 pandemic interrupted the image acquisition[57].The highlighted patches show the image acquisition data.The test results of the datasets show that the proposed method can satisfy the segmentation of images taken in different seasons and years.To carry out quantitative verification, this study has analyzed a total of 150 sets of tree images were taken continuously in 2018, 2019 and 2020.

Figure 12 .
Figure 12.Segmentation results of the proposed method in different seasons and years: (a) samples in 2018 spring and summer; (b) samples in 2019 spring and summer; and (c) samples in 2020 autumn.

Figure 12 .
Figure 12.Segmentation results of the proposed method in different seasons and years: (a) samples in 2018 spring and summer; (b) samples in 2019 spring and summer; and (c) samples in 2020 autumn.

Figure 13 .
Figure 13.Average statistics of the proposed method for 200 sample images.

Table 1 .
Average results for 200 images using different algorithms.

Table 1 .
Average results for 200 images using different algorithms.

Table 2 .
Average statistics for 100 images under different overlapping conditions and at different times of the day.

Table 2 .
Average statistics for 100 images under different overlapping conditions and at different times of the day.