Real-Time Counting and Height Measurement of Nursery Seedlings Based on Ghostnet–YoloV4 Network and Binocular Vision Technology

: Traditional nursery seedling detection often uses manual sampling counting and height measurement with rulers. This is not only inefﬁcient and inaccurate, but it requires many human resources for nurseries that need to monitor the growth of saplings, making it difﬁcult to meet the fast and efﬁcient management requirements of modern forestry. To solve this problem, this paper proposes a real-time seedling detection framework based on an improved YoloV4 network and binocular camera, which can provide real-time measurements of the height and number of saplings in a nursery quickly and efﬁciently. The methodology is as follows: (i) creating a training dataset using a binocular camera ﬁeld photography and data augmentation; (ii) replacing the backbone network of YoloV4 with Ghostnet and replacing the normal convolutional blocks of PANet in YoloV4 with depth-separable convolutional blocks, which will allow the Ghostnet–YoloV4 improved network to maintain efﬁcient feature extraction while massively reducing the number of operations for real-time counting; (iii) integrating binocular vision technology into neural network detection to perform the real-time height measurement of saplings; and (iv) making corresponding parameter and equipment adjustments based on the speciﬁc morphology of the various saplings, and adding comparative experiments to enhance generalisability. The results of the ﬁeld testing of nursery saplings show that the method is effective in overcoming noise in a large ﬁeld environment, meeting the load-carrying capacity of embedded mobile devices with low-conﬁguration management systems in real time and achieving over 92% accuracy in both counts and measurements. The results of these studies can provide technical support for the precise cultivation of nursery saplings.


Introduction
The relatively new worldwide trend of 'precision forestry' refers to the use of high-tech sensors and analytical tools to support site-specific forest management for the conservation and use of forest resources [1,2]. According to McKinsey & Company Research, precision forestry plays an important role in nursery and forest management, forestry fees, timber delivery and value chains [3]. The global precision forestry market is projected to be worth USD 6.1 billion by 2024 [4] and has become an important industry in China.
Afforestation and reforestation operations constitute an important part of forest management [5], and the quality of seedlings produced by nurseries is related to the survival rate of planted trees, so it is crucial to advance the level of research to nursery techniques. The number of seedlings per unit area is a common indicator used to characterise seedling production. The rapid and accurate identification of saplings and the detection of the number of saplings per unit area play an important role not only in estimating production, but also in breeding and plant phenotyping. The height of a sapling is not only a reflection of its current growth status and the conditions required for its cultivation, but also determines its production trend and yield size as a cash crop. The traditional methods of left camera while completing the identification of the flower species. They completed real-time positioning and ranging of flowers with a real-time detection frame rate of 16 FPS and an average absolute error of 18.1 mm for flower centre positioning, with a maximum positioning error of 25.8 mm for flower centres under different light radiation conditions.
The above studies provide an insight into the detection of forest trees. However, the application of these techniques to nurseries with complex backgrounds and smaller individual saplings for target counting and height measurement still suffers from inaccuracies and is time consuming and labour intensive [33]. This study proposes the use of the Ghostnet-YoloV4 network and binocular vision technology to solve this type of problem in order to obtain the number and height of target saplings in real time, as well as to investigate whether the improved network can have a better detection capability to satisfy the rudimentary management equipment of most small nurseries with a lower computing power and the use of inexpensive binocular cameras. Field tests were carried out in nurseries to check that the method achieves the practical requirements of sapling counting and height measurement and enables intelligent mechanical operation at minimal cost. The short-term aim is to use the research results to help nursery staff reduce the burden of manually counting saplings and measuring height, while the long-term goal is to improve the effectiveness of forestry machinery automation and lay the foundation for intelligent forestry management.

Process of Nursery Sapling Detection Based on Ghostnet-YoloV4 Network and Binocular Cameras
As shown in Figure 1, the original left images of various saplings were first obtained using binocular cameras, labelled and then data augmented to produce a dataset, which was then fed into an improved neural network for training to obtain detection weights. At the same time, the binocular camera was stereo-calibrated and calibrated to obtain its internal parameters; when the neural network can detect the saplings in the still image and the BM algorithm integrated into the network can acquire the depth image, the nursery can be accessed for the real-time counting and height measurement of the saplings. The number and height of saplings acquired in real time are displayed directly in the image, and more detailed information on the location and height of the saplings can be seen in the PyCharm output window.
centres under different light radiation conditions. The above studies provide an insight into the detection of forest trees. However, the application of these techniques to nurseries with complex backgrounds and smaller individual saplings for target counting and height measurement still suffers from inaccuracies and is time consuming and labour intensive [33]. This study proposes the use of the Ghostnet-YoloV4 network and binocular vision technology to solve this type of problem in order to obtain the number and height of target saplings in real time, as well as to investigate whether the improved network can have a better detection capability to satisfy the rudimentary management equipment of most small nurseries with a lower computing power and the use of inexpensive binocular cameras. Field tests were carried out in nurseries to check that the method achieves the practical requirements of sapling counting and height measurement and enables intelligent mechanical operation at minimal cost. The shortterm aim is to use the research results to help nursery staff reduce the burden of manually counting saplings and measuring height, while the long-term goal is to improve the effectiveness of forestry machinery automation and lay the foundation for intelligent forestry management.

Process of Nursery Sapling Detection Based on Ghostnet-YoloV4 Network and Binocular Cameras
As shown in Figure 1, the original left images of various saplings were first obtained using binocular cameras, labelled and then data augmented to produce a dataset, which was then fed into an improved neural network for training to obtain detection weights. At the same time, the binocular camera was stereo-calibrated and calibrated to obtain its internal parameters; when the neural network can detect the saplings in the still image and the BM algorithm integrated into the network can acquire the depth image, the nursery can be accessed for the real-time counting and height measurement of the saplings. The number and height of saplings acquired in real time are displayed directly in the image, and more detailed information on the location and height of the saplings can be seen in the PyCharm output window.

Data Collection
The original experimental data were images of nursery saplings taken using the left binocular camera, each from a different angle. Considering the effects of different complex backgrounds and camera parameters in the natural environment, some lower-quality images were excluded from the data pre-processing in this study. The original images of the saplings are shown in Figure 2. Images of (a) large, (b) medium and (c) small spruce (the Latin name is Picea asperata Mast; large, medium and small indicate its body size at different growth periods), (d) Mongolian scotch pine (the Latin name is Pinus sylvestris var. mongholica Litv) and (e) Manchurian ash (the Latin name is Fraxinus mandshurica Rupr) saplings were collected from Xiaoling Town Forestry, Acheng District, Harbin, Heilongjiang Province, China, on 6 May 2022 from 13:30 to 16:30. Using a binocular camera, 500 images (640 × 480 pixels, 100 KB-130 KB in size) were taken of each sample, totalling 2500 original images. There were 1500 images left after screening. The dataset format of PASCAL VOC was used for this study, and the label file (xml format) was created manually using 'LabelImg' as the labelling tool.

Data Collection
The original experimental data were images of nursery saplings taken using the left binocular camera, each from a different angle. Considering the effects of different complex backgrounds and camera parameters in the natural environment, some lower-quality images were excluded from the data pre-processing in this study. The original images of the saplings are shown in Figure 2. Images of (a) large, (b) medium and (c) small spruce (the Latin name is Picea asperata Mast; large, medium and small indicate its body size at different growth periods), (d) Mongolian scotch pine (the Latin name is Pinus sylvestris var. mongholica Litv) and (e) Manchurian ash (the Latin name is Fraxinus mandshurica Rupr) saplings were collected from Xiaoling Town Forestry, Acheng District, Harbin, Heilongjiang Province, China, on 6 May 2022 from 13:30 to 16:30. Using a binocular camera, 500 images (640 × 480 pixels, 100 KB-130 KB in size) were taken of each sample, totalling 2500 original images. There were 1500 images left after screening. The dataset format of PASCAL VOC was used for this study, and the label file (xml format) was created manually using 'LabelImg' as the labelling tool.

Data Augmentation
YOLOV4 [34] comes with data augmentation functions, such as Mosaic, which can enrich various objects and backgrounds, greatly improving YOLOV4 performance and effectively solving the problem of the poor detection of smaller volume objects in model training. However, for the YOLO series, obtaining a larger dataset for the training and validation of the network model results in training weights that better fit this type of data and the better detection of the network model [35,36]. To prevent overfitting, this study used 15 methods to expand the original data, including flipping, adding noise, cutting, rotating, stretching and adjusting brightness, as shown in Table 1. The images were expanded simultaneously with the label files, and the expanded images and label files were divided into a training set and a validation set in the ratio of 9:1. This resulted in a training set of 22,500 images and a validation set of 2500 images. The testing of the training results was carried out directly at the nursery for the real-time detection of saplings, and the correctness of the method was evaluated by comparing the manual counting results with the system counting results to verify the timeliness and robustness of the method.

Data Augmentation
YOLOV4 [34] comes with data augmentation functions, such as Mosaic, which can enrich various objects and backgrounds, greatly improving YOLOV4 performance and effectively solving the problem of the poor detection of smaller volume objects in model training. However, for the YOLO series, obtaining a larger dataset for the training and validation of the network model results in training weights that better fit this type of data and the better detection of the network model [35,36]. To prevent overfitting, this study used 15 methods to expand the original data, including flipping, adding noise, cutting, rotating, stretching and adjusting brightness, as shown in Table 1. The images were expanded simultaneously with the label files, and the expanded images and label files were divided into a training set and a validation set in the ratio of 9:1. This resulted in a training set of 22,500 images and a validation set of 2500 images. The testing of the training results was carried out directly at the nursery for the real-time detection of saplings, and the correctness of the method was evaluated by comparing the manual counting results with the system counting results to verify the timeliness and robustness of the method.

Experimental Architecture
The experiments were conducted on Windows 10 using the pytorch 1.7.1 deep learning framework. The experimental platform was PyCharm 2020.2.3 with built-in python version 3.6.5. The binocular camera calibration software was Matlab 2021a. The hardware equipment configuration was as follows: binocular camera model: WN-L2110.K350L; triangular mount and checkerboard calibration board; Intel Core i7-10710U@1.1 GHz hexa-core computer processor; 16 GB memory; NVDIA GeForce MX350 graphics card (2 GB video memory and 8 GB virtual memory); 512 GB hard drive. As this study required a close fit of the detection frame to the sapling, the upper and lower edges of the labelled rectangular frame were overlapped with the vertices and bottom points of the sapling during labelling; the confidence threshold for network training and detection was set to high to improve the fit of the detection frame to the sapling during detection. validation of the network model results in training weights that better fit this type of data and the better detection of the network model [35,36]. To prevent overfitting, this study used 15 methods to expand the original data, including flipping, adding noise, cutting, rotating, stretching and adjusting brightness, as shown in Table 1. The images were expanded simultaneously with the label files, and the expanded images and label files were divided into a training set and a validation set in the ratio of 9:1. This resulted in a training set of 22,500 images and a validation set of 2500 images. The testing of the training results was carried out directly at the nursery for the real-time detection of saplings, and the correctness of the method was evaluated by comparing the manual counting results with the system counting results to verify the timeliness and robustness of the method. validation of the network model results in training weights that better fit this type of data and the better detection of the network model [35,36]. To prevent overfitting, this study used 15 methods to expand the original data, including flipping, adding noise, cutting, rotating, stretching and adjusting brightness, as shown in Table 1. The images were expanded simultaneously with the label files, and the expanded images and label files were divided into a training set and a validation set in the ratio of 9:1. This resulted in a training set of 22,500 images and a validation set of 2500 images. The testing of the training results was carried out directly at the nursery for the real-time detection of saplings, and the correctness of the method was evaluated by comparing the manual counting results with the system counting results to verify the timeliness and robustness of the method. validation of the network model results in training weights that better fit this type of data and the better detection of the network model [35,36]. To prevent overfitting, this study used 15 methods to expand the original data, including flipping, adding noise, cutting, rotating, stretching and adjusting brightness, as shown in Table 1. The images were expanded simultaneously with the label files, and the expanded images and label files were divided into a training set and a validation set in the ratio of 9:1. This resulted in a training set of 22,500 images and a validation set of 2500 images. The testing of the training results was carried out directly at the nursery for the real-time detection of saplings, and the correctness of the method was evaluated by comparing the manual counting results with the system counting results to verify the timeliness and robustness of the method. validation of the network model results in training weights that better fit this type of data and the better detection of the network model [35,36]. To prevent overfitting, this study used 15 methods to expand the original data, including flipping, adding noise, cutting, rotating, stretching and adjusting brightness, as shown in Table 1. The images were expanded simultaneously with the label files, and the expanded images and label files were divided into a training set and a validation set in the ratio of 9:1. This resulted in a training set of 22,500 images and a validation set of 2500 images. The testing of the training results was carried out directly at the nursery for the real-time detection of saplings, and the correctness of the method was evaluated by comparing the manual counting results with the system counting results to verify the timeliness and robustness of the method. validation of the network model results in training weights that better fit this type of data and the better detection of the network model [35,36]. To prevent overfitting, this study used 15 methods to expand the original data, including flipping, adding noise, cutting, rotating, stretching and adjusting brightness, as shown in Table 1. The images were expanded simultaneously with the label files, and the expanded images and label files were divided into a training set and a validation set in the ratio of 9:1. This resulted in a training set of 22,500 images and a validation set of 2500 images. The testing of the training results was carried out directly at the nursery for the real-time detection of saplings, and the correctness of the method was evaluated by comparing the manual counting results with the system counting results to verify the timeliness and robustness of the method.

Experimental Architecture
The experiments were conducted on Windows 10 using the pytorch 1.7.1 deep learning framework. The experimental platform was PyCharm 2020.2.3 with built-in python version 3.6.5. The binocular camera calibration software was Matlab 2021a. The hardware equipment configuration was as follows: binocular camera model: WN-L2110.K350L; triangular mount and checkerboard calibration board; Intel Core i7-10710U@1.1 GHz hexacore computer processor; 16 GB memory; NVDIA GeForce MX350 graphics card (2 GB video memory and 8 GB virtual memory); 512 GB hard drive. As this study required a close fit of the detection frame to the sapling, the upper and lower edges of the labelled rectangular frame were overlapped with the vertices and bottom points of the sapling during labelling; the confidence threshold for network training and detection was set to high to improve the fit of the detection frame to the sapling during detection.

Ghostnet-YoloV4 Network Architecture
YoloV4 is essentially a large CNN network that is more speed conscious, converting the detection problem into a regression problem, and various optimized versions are under development [34]. The main function of the CSPDaeknet53 of the YoloV4 network is to perform the initial feature extraction of the input image; the SPP is used to build a bottom-up feature pyramid to improve the perceptual field; the enhanced feature extraction network PANet is then used to perform more expressive feature fusion; and finally, Yolo-Head uses the extracted features for detection. Although real-time, high-quality and convincing target detection results can be applied on a single GPU, a network structure is more complex, and the number of parameters and operations is high, which presents a considerable burden for embedded devices [16]. This study, therefore, introduced Ghostnet into the YOLOV4 network to enhance feature extraction while reducing the network's transport burden.

Ghostnet
Ghostnet is a new lightweight neural network architecture proposed by a Huawei researcher in a paper published in June 2020 [37]. The abstract of the paper states that the network is designed to help solve the problem of the highly difficult deployment of con-

Experimental Architecture
The experiments were conducted on Windows 10 using the pytorch 1.7.1 deep learning framework. The experimental platform was PyCharm 2020.2.3 with built-in python version 3.6.5. The binocular camera calibration software was Matlab 2021a. The hardware equipment configuration was as follows: binocular camera model: WN-L2110.K350L; triangular mount and checkerboard calibration board; Intel Core i7-10710U@1.1 GHz hexacore computer processor; 16 GB memory; NVDIA GeForce MX350 graphics card (2 GB video memory and 8 GB virtual memory); 512 GB hard drive. As this study required a close fit of the detection frame to the sapling, the upper and lower edges of the labelled rectangular frame were overlapped with the vertices and bottom points of the sapling during labelling; the confidence threshold for network training and detection was set to high to improve the fit of the detection frame to the sapling during detection.

Ghostnet-YoloV4 Network Architecture
YoloV4 is essentially a large CNN network that is more speed conscious, converting the detection problem into a regression problem, and various optimized versions are under development [34]. The main function of the CSPDaeknet53 of the YoloV4 network is to perform the initial feature extraction of the input image; the SPP is used to build a bottom-up feature pyramid to improve the perceptual field; the enhanced feature extraction network PANet is then used to perform more expressive feature fusion; and finally, Yolo-Head uses the extracted features for detection. Although real-time, high-quality and convincing target detection results can be applied on a single GPU, a network structure is more complex, and the number of parameters and operations is high, which presents a considerable burden for embedded devices [16]. This study, therefore, introduced Ghostnet into the YOLOV4 network to enhance feature extraction while reducing the network's transport burden.

Ghostnet
Ghostnet is a new lightweight neural network architecture proposed by a Huawei researcher in a paper published in June 2020 [37]. The abstract of the paper states that the network is designed to help solve the problem of the highly difficult deployment of con-

Experimental Architecture
The experiments were conducted on Windows 10 using the pytorch 1.7.1 deep learning framework. The experimental platform was PyCharm 2020.2.3 with built-in python version 3.6.5. The binocular camera calibration software was Matlab 2021a. The hardware equipment configuration was as follows: binocular camera model: WN-L2110.K350L; triangular mount and checkerboard calibration board; Intel Core i7-10710U@1.1 GHz hexacore computer processor; 16 GB memory; NVDIA GeForce MX350 graphics card (2 GB video memory and 8 GB virtual memory); 512 GB hard drive. As this study required a close fit of the detection frame to the sapling, the upper and lower edges of the labelled rectangular frame were overlapped with the vertices and bottom points of the sapling during labelling; the confidence threshold for network training and detection was set to high to improve the fit of the detection frame to the sapling during detection.

Ghostnet-YoloV4 Network Architecture
YoloV4 is essentially a large CNN network that is more speed conscious, converting the detection problem into a regression problem, and various optimized versions are under development [34]. The main function of the CSPDaeknet53 of the YoloV4 network is to perform the initial feature extraction of the input image; the SPP is used to build a bottom-up feature pyramid to improve the perceptual field; the enhanced feature extraction network PANet is then used to perform more expressive feature fusion; and finally, Yolo-Head uses the extracted features for detection. Although real-time, high-quality and convincing target detection results can be applied on a single GPU, a network structure is more complex, and the number of parameters and operations is high, which presents a considerable burden for embedded devices [16]. This study, therefore, introduced Ghostnet into the YOLOV4 network to enhance feature extraction while reducing the network's transport burden.

Ghostnet
Ghostnet is a new lightweight neural network architecture proposed by a Huawei researcher in a paper published in June 2020 [37]. The abstract of the paper states that the network is designed to help solve the problem of the highly difficult deployment of convolutional neural networks in embedded devices; Ghostnet networks outperform Google MobileNetV3 and Facebook's FBNet across the board. Instead of normal convolution, the authors propose a novel Ghost module, as shown in Figure 3, that uses fewer parameters to generate more feature maps. The Ghost module applies a series of linear transformations to generate these redundant feature maps, reducing the amount of computation

Experimental Architecture
The experiments were conducted on Windows 10 using the pytorch 1.7.1 deep learning framework. The experimental platform was PyCharm 2020.2.3 with built-in python version 3.6.5. The binocular camera calibration software was Matlab 2021a. The hardware equipment configuration was as follows: binocular camera model: WN-L2110.K350L; triangular mount and checkerboard calibration board; Intel Core i7-10710U@1.1 GHz hexacore computer processor; 16 GB memory; NVDIA GeForce MX350 graphics card (2 GB video memory and 8 GB virtual memory); 512 GB hard drive. As this study required a close fit of the detection frame to the sapling, the upper and lower edges of the labelled rectangular frame were overlapped with the vertices and bottom points of the sapling during labelling; the confidence threshold for network training and detection was set to high to improve the fit of the detection frame to the sapling during detection.

Ghostnet-YoloV4 Network Architecture
YoloV4 is essentially a large CNN network that is more speed conscious, converting the detection problem into a regression problem, and various optimized versions are under development [34]. The main function of the CSPDaeknet53 of the YoloV4 network is to perform the initial feature extraction of the input image; the SPP is used to build a bottom-up feature pyramid to improve the perceptual field; the enhanced feature extraction network PANet is then used to perform more expressive feature fusion; and finally, Yolo-Head uses the extracted features for detection. Although real-time, high-quality and convincing target detection results can be applied on a single GPU, a network structure is more complex, and the number of parameters and operations is high, which presents a considerable burden for embedded devices [16]. This study, therefore, introduced Ghostnet into the YOLOV4 network to enhance feature extraction while reducing the network's transport burden.

Ghostnet
Ghostnet is a new lightweight neural network architecture proposed by a Huawei researcher in a paper published in June 2020 [37]. The abstract of the paper states that the network is designed to help solve the problem of the highly difficult deployment of con-

Experimental Architecture
The experiments were conducted on Windows 10 using the pytorch 1.7.1 deep learning framework. The experimental platform was PyCharm 2020.2.3 with built-in python version 3.6.5. The binocular camera calibration software was Matlab 2021a. The hardware equipment configuration was as follows: binocular camera model: WN-L2110.K350L; triangular mount and checkerboard calibration board; Intel Core i7-10710U@1.1 GHz hexacore computer processor; 16 GB memory; NVDIA GeForce MX350 graphics card (2 GB video memory and 8 GB virtual memory); 512 GB hard drive. As this study required a close fit of the detection frame to the sapling, the upper and lower edges of the labelled rectangular frame were overlapped with the vertices and bottom points of the sapling during labelling; the confidence threshold for network training and detection was set to high to improve the fit of the detection frame to the sapling during detection.

Ghostnet-YoloV4 Network Architecture
YoloV4 is essentially a large CNN network that is more speed conscious, converting the detection problem into a regression problem, and various optimized versions are under development [34]. The main function of the CSPDaeknet53 of the YoloV4 network is to perform the initial feature extraction of the input image; the SPP is used to build a bottom-up feature pyramid to improve the perceptual field; the enhanced feature extraction network PANet is then used to perform more expressive feature fusion; and finally, Yolo-Head uses the extracted features for detection. Although real-time, high-quality and convincing target detection results can be applied on a single GPU, a network structure is more complex, and the number of parameters and operations is high, which presents a considerable burden for embedded devices [16]. This study, therefore, introduced Ghostnet into the YOLOV4 network to enhance feature extraction while reducing the network's transport burden.

Ghostnet
Ghostnet is a new lightweight neural network architecture proposed by a Huawei researcher in a paper published in June 2020 [37]. The abstract of the paper states that the network is designed to help solve the problem of the highly difficult deployment of con-

Experimental Architecture
The experiments were conducted on Windows 10 using the pytorch 1.7.1 deep learning framework. The experimental platform was PyCharm 2020.2.3 with built-in python version 3.6.5. The binocular camera calibration software was Matlab 2021a. The hardware equipment configuration was as follows: binocular camera model: WN-L2110.K350L; triangular mount and checkerboard calibration board; Intel Core i7-10710U@1.1 GHz hexacore computer processor; 16 GB memory; NVDIA GeForce MX350 graphics card (2 GB video memory and 8 GB virtual memory); 512 GB hard drive. As this study required a close fit of the detection frame to the sapling, the upper and lower edges of the labelled rectangular frame were overlapped with the vertices and bottom points of the sapling during labelling; the confidence threshold for network training and detection was set to high to improve the fit of the detection frame to the sapling during detection.

Ghostnet-YoloV4 Network Architecture
YoloV4 is essentially a large CNN network that is more speed conscious, converting the detection problem into a regression problem, and various optimized versions are under development [34]. The main function of the CSPDaeknet53 of the YoloV4 network is to perform the initial feature extraction of the input image; the SPP is used to build a bottom-up feature pyramid to improve the perceptual field; the enhanced feature extraction network PANet is then used to perform more expressive feature fusion; and finally, Yolo-Head uses the extracted features for detection. Although real-time, high-quality and convincing target detection results can be applied on a single GPU, a network structure is more complex, and the number of parameters and operations is high, which presents a considerable burden for embedded devices [16]. This study, therefore, introduced Ghostnet into the YOLOV4 network to enhance feature extraction while reducing the network's transport burden.

Ghostnet
Ghostnet is a new lightweight neural network architecture proposed by a Huawei researcher in a paper published in June 2020 [37]. The abstract of the paper states that the network is designed to help solve the problem of the highly difficult deployment of con-

Experimental Architecture
The experiments were conducted on Windows 10 using the pytorch 1.7.1 deep learning framework. The experimental platform was PyCharm 2020.2.3 with built-in python version 3.6.5. The binocular camera calibration software was Matlab 2021a. The hardware equipment configuration was as follows: binocular camera model: WN-L2110.K350L; triangular mount and checkerboard calibration board; Intel Core i7-10710U@1.1 GHz hexacore computer processor; 16 GB memory; NVDIA GeForce MX350 graphics card (2 GB video memory and 8 GB virtual memory); 512 GB hard drive. As this study required a close fit of the detection frame to the sapling, the upper and lower edges of the labelled rectangular frame were overlapped with the vertices and bottom points of the sapling during labelling; the confidence threshold for network training and detection was set to high to improve the fit of the detection frame to the sapling during detection.

Ghostnet-YoloV4 Network Architecture
YoloV4 is essentially a large CNN network that is more speed conscious, converting the detection problem into a regression problem, and various optimized versions are under development [34]. The main function of the CSPDaeknet53 of the YoloV4 network is to perform the initial feature extraction of the input image; the SPP is used to build a bottom-up feature pyramid to improve the perceptual field; the enhanced feature extraction network PANet is then used to perform more expressive feature fusion; and finally, Yolo-Head uses the extracted features for detection. Although real-time, high-quality and convincing target detection results can be applied on a single GPU, a network structure is more complex, and the number of parameters and operations is high, which presents a considerable burden for embedded devices [16]. This study, therefore, introduced Ghostnet into the YOLOV4 network to enhance feature extraction while reducing the network's transport burden.

Ghostnet
Ghostnet is a new lightweight neural network architecture proposed by a Huawei researcher in a paper published in June 2020 [37]. The abstract of the paper states that the network is designed to help solve the problem of the highly difficult deployment of con-

Experimental Architecture
The experiments were conducted on Windows 10 using the pytorch 1.7.1 deep learning framework. The experimental platform was PyCharm 2020.2.3 with built-in python version 3.6.5. The binocular camera calibration software was Matlab 2021a. The hardware equipment configuration was as follows: binocular camera model: WN-L2110.K350L; triangular mount and checkerboard calibration board; Intel Core i7-10710U@1.1 GHz hexacore computer processor; 16 GB memory; NVDIA GeForce MX350 graphics card (2 GB video memory and 8 GB virtual memory); 512 GB hard drive. As this study required a close fit of the detection frame to the sapling, the upper and lower edges of the labelled rectangular frame were overlapped with the vertices and bottom points of the sapling during labelling; the confidence threshold for network training and detection was set to high to improve the fit of the detection frame to the sapling during detection.

Ghostnet-YoloV4 Network Architecture
YoloV4 is essentially a large CNN network that is more speed conscious, converting the detection problem into a regression problem, and various optimized versions are under development [34]. The main function of the CSPDaeknet53 of the YoloV4 network is to perform the initial feature extraction of the input image; the SPP is used to build a bottom-up feature pyramid to improve the perceptual field; the enhanced feature extraction network PANet is then used to perform more expressive feature fusion; and finally, Yolo-Head uses the extracted features for detection. Although real-time, high-quality and convincing target detection results can be applied on a single GPU, a network structure is more complex, and the number of parameters and operations is high, which presents a considerable burden for embedded devices [16]. This study, therefore, introduced Ghostnet into the YOLOV4 network to enhance feature extraction while reducing the network's transport burden.

Ghostnet
Ghostnet is a new lightweight neural network architecture proposed by a Huawei researcher in a paper published in June 2020 [37]. The abstract of the paper states that the network is designed to help solve the problem of the highly difficult deployment of con-

Experimental Architecture
The experiments were conducted on Windows 10 using the pytorch 1.7.1 deep learning framework. The experimental platform was PyCharm 2020.2.3 with built-in python version 3.6.5. The binocular camera calibration software was Matlab 2021a. The hardware equipment configuration was as follows: binocular camera model: WN-L2110.K350L; triangular mount and checkerboard calibration board; Intel Core i7-10710U@1.1 GHz hexacore computer processor; 16 GB memory; NVDIA GeForce MX350 graphics card (2 GB video memory and 8 GB virtual memory); 512 GB hard drive. As this study required a close fit of the detection frame to the sapling, the upper and lower edges of the labelled rectangular frame were overlapped with the vertices and bottom points of the sapling during labelling; the confidence threshold for network training and detection was set to high to improve the fit of the detection frame to the sapling during detection.

Ghostnet-YoloV4 Network Architecture
YoloV4 is essentially a large CNN network that is more speed conscious, converting the detection problem into a regression problem, and various optimized versions are under development [34]. The main function of the CSPDaeknet53 of the YoloV4 network is to perform the initial feature extraction of the input image; the SPP is used to build a bottom-up feature pyramid to improve the perceptual field; the enhanced feature extraction network PANet is then used to perform more expressive feature fusion; and finally, Yolo-Head uses the extracted features for detection. Although real-time, high-quality and convincing target detection results can be applied on a single GPU, a network structure is more complex, and the number of parameters and operations is high, which presents a considerable burden for embedded devices [16]. This study, therefore, introduced Ghostnet into the YOLOV4 network to enhance feature extraction while reducing the network's transport burden.

Ghostnet
Ghostnet is a new lightweight neural network architecture proposed by a Huawei researcher in a paper published in June 2020 [37]. The abstract of the paper states that the network is designed to help solve the problem of the highly difficult deployment of con-

Experimental Architecture
The experiments were conducted on Windows 10 using the pytorch 1.7.1 deep learning framework. The experimental platform was PyCharm 2020.2.3 with built-in python version 3.6.5. The binocular camera calibration software was Matlab 2021a. The hardware equipment configuration was as follows: binocular camera model: WN-L2110.K350L; triangular mount and checkerboard calibration board; Intel Core i7-10710U@1.1 GHz hexacore computer processor; 16 GB memory; NVDIA GeForce MX350 graphics card (2 GB video memory and 8 GB virtual memory); 512 GB hard drive. As this study required a close fit of the detection frame to the sapling, the upper and lower edges of the labelled rectangular frame were overlapped with the vertices and bottom points of the sapling during labelling; the confidence threshold for network training and detection was set to high to improve the fit of the detection frame to the sapling during detection.

Ghostnet-YoloV4 Network Architecture
YoloV4 is essentially a large CNN network that is more speed conscious, converting the detection problem into a regression problem, and various optimized versions are under development [34]. The main function of the CSPDaeknet53 of the YoloV4 network is to perform the initial feature extraction of the input image; the SPP is used to build a bottom-up feature pyramid to improve the perceptual field; the enhanced feature extraction network PANet is then used to perform more expressive feature fusion; and finally, Yolo-Head uses the extracted features for detection. Although real-time, high-quality and convincing target detection results can be applied on a single GPU, a network structure is more complex, and the number of parameters and operations is high, which presents a considerable burden for embedded devices [16]. This study, therefore, introduced Ghostnet into the YOLOV4 network to enhance feature extraction while reducing the network's transport burden.

Ghostnet
Ghostnet is a new lightweight neural network architecture proposed by a Huawei researcher in a paper published in June 2020 [37]. The abstract of the paper states that the network is designed to help solve the problem of the highly difficult deployment of convolutional neural networks in embedded devices; Ghostnet networks outperform Google MobileNetV3 and Facebook's FBNet across the board. Instead of normal convolution, the authors propose a novel Ghost module, as shown in Figure 3, that uses fewer parameters to generate more feature maps. The Ghost module applies a series of linear transfor-

Ghostnet-YoloV4 Network Architecture
YoloV4 is essentially a large CNN network that is more speed conscious, converting the detection problem into a regression problem, and various optimized versions are under development [34]. The main function of the CSPDaeknet53 of the YoloV4 network is to perform the initial feature extraction of the input image; the SPP is used to build a bottom-up feature pyramid to improve the perceptual field; the enhanced feature extraction network PANet is then used to perform more expressive feature fusion; and finally, YoloHead uses the extracted features for detection. Although real-time, high-quality and convincing target detection results can be applied on a single GPU, a network structure is more complex, and the number of parameters and operations is high, which presents a considerable burden for embedded devices [16]. This study, therefore, introduced Ghostnet into the YOLOV4 network to enhance feature extraction while reducing the network's transport burden.

Ghostnet
Ghostnet is a new lightweight neural network architecture proposed by a Huawei researcher in a paper published in June 2020 [37]. The abstract of the paper states that the network is designed to help solve the problem of the highly difficult deployment of convolutional neural networks in embedded devices; Ghostnet networks outperform Google MobileNetV3 and Facebook's FBNet across the board. Instead of normal convolution, the authors propose a novel Ghost module, as shown in Figure 3, that uses fewer parameters to generate more feature maps. The Ghost module applies a series of linear transformations to generate these redundant feature maps, reducing the amount of computation caused by some convolution operations. This operation generates several Ghost feature maps that extract the required information from the original features at a small cost. The Ghost module is plug-and-play and can be stacked to produce the Ghost bottleneck, which is a lightweight neural network known as Ghostnet. The authors conducted comparative experiments on the ImageNet classification dataset and showed that the network can perform fast inferences on mobile devices. This study, therefore, introduced Ghostnet to YoloV4 for deployment on laptops for real-time detection.

Ghostnet-YoloV4 Improvement Method
The main function of YoloV4's CSPDaeknet53 is to perform a series of convolution operations on the input image to complete the initial feature extraction and obtain a feature map. The main function of Ghostnet is to bypass some of the convolution operations and use cheaper linear operations to obtain feature maps. Therefore, our improved approach is to replace the CSPDaeknet53 with Ghostnet for the initial feature extraction of the input image, which allows us to maintain similar recognition performance while reducing the computational cost of the generic convolutional layer.
CSPDaeknet53 will eventually output three effective feature layers. Their feature map height, width and number of channels are: 52 × 52 × 256, 26 × 26 × 512 and 13 × 13 × 1024, respectively, a subsequent enhancement of the feature extraction network construction. If we directly replace Ghostnet with CSPDaeknet53 to process the input image and output the feature map, it will result in a mismatch between the input feature map height and width and the number of channels of SPP and PANet. Therefore, we segmented the Ghostnet sequence model constructed using the Ghost bottleneck by deriving a list of Ghostnet's cfgs parameters, then derived the positions of the height and width of the feature maps satisfying SPP and PANet, and finally, removed them as the input feature maps for SPP and PANet.
The cfgs parameter table is shown in Table 2, where k represents the convolutional kernel size, indicating the feature extraction capability across feature points; t represents the channel count size of the first Ghost module; c represents the final output channel count of the bottleneck structure; SE indicates whether the attention mechanism is used, when SE > 0 indicates use; s represents the step size, if 2 it will compress the height and width of the incoming feature layer. The output is the derived height, width and number of channels of the output feature map after each stage. The image will then change from 416 × 416 × 3 to 208 × 208 × 16, and then enter the feature extraction into the table.  The Ghost module is plug-and-play and can be stacked to produce the Ghost bottleneck, which is a lightweight neural network known as Ghostnet. The authors conducted comparative experiments on the ImageNet classification dataset and showed that the network can perform fast inferences on mobile devices. This study, therefore, introduced Ghostnet to YoloV4 for deployment on laptops for real-time detection.

Ghostnet-YoloV4 Improvement Method
The main function of YoloV4's CSPDaeknet53 is to perform a series of convolution operations on the input image to complete the initial feature extraction and obtain a feature map. The main function of Ghostnet is to bypass some of the convolution operations and use cheaper linear operations to obtain feature maps. Therefore, our improved approach is to replace the CSPDaeknet53 with Ghostnet for the initial feature extraction of the input image, which allows us to maintain similar recognition performance while reducing the computational cost of the generic convolutional layer.
CSPDaeknet53 will eventually output three effective feature layers. Their feature map height, width and number of channels are: 52 × 52 × 256, 26 × 26 × 512 and 13 × 13 × 1024, respectively, a subsequent enhancement of the feature extraction network construction. If we directly replace Ghostnet with CSPDaeknet53 to process the input image and output the feature map, it will result in a mismatch between the input feature map height and width and the number of channels of SPP and PANet. Therefore, we segmented the Ghostnet sequence model constructed using the Ghost bottleneck by deriving a list of Ghostnet's cfgs parameters, then derived the positions of the height and width of the feature maps satisfying SPP and PANet, and finally, removed them as the input feature maps for SPP and PANet.
The cfgs parameter table is shown in Table 2, where k represents the convolutional kernel size, indicating the feature extraction capability across feature points; t represents the channel count size of the first Ghost module; c represents the final output channel count of the bottleneck structure; SE indicates whether the attention mechanism is used, when SE > 0 indicates use; s represents the step size, if 2 it will compress the height and width of the incoming feature layer. The output is the derived height, width and number of channels of the output feature map after each stage. The image will then change from 416 × 416 × 3 to 208 × 208 × 16, and then enter the feature extraction into the table.
As we can see from the table, the height and width of the output feature maps for stage 3, stage 4 and stage 5 matched the height and width of the final output feature map of CSPDaeknet53. After Ghostnet was used to extract these three feature layers and passed into YoloV4 as the initial feature extraction output, our study also needed to modify the number of input channels for the convolution operation used in the SPP and PANet. This was carried out by making the input channel equal to the output channel, so that the number of output channels for Ghostnet as the backbone feature extraction network was the same as the number of input channels of the subsequent enhanced feature extraction network. At this point, the modification that introduced Ghostnet into YoloV4 was completed.

PANet Improvements
The main role of YoloV4's enhanced feature extraction network PANet is to perform feature fusion on the three initial effective feature layers. In this way, better features are extracted, and three more effective feature layers are obtained, resulting in a higher detection accuracy of the detection network YoloHead. The number of parameters in PANet is mostly concentrated on the ordinary convolution of 3 × 3. Therefore, to further reduce the number of parameters, we used the depth-separable convolution [38,39] to replace all the 3 × 3 ordinary convolutions used in PANet. We used the summary function to traverse the entire network structure and used its output network structure as a basis to derive the Ghostnet-YoloV4 network structure diagram with reference to the YoloV4 network, as shown in Figure 4. As we can see from the table, the height and width of the output feature maps for stage 3, stage 4 and stage 5 matched the height and width of the final output feature map of CSPDaeknet53. After Ghostnet was used to extract these three feature layers and passed into YoloV4 as the initial feature extraction output, our study also needed to modify the number of input channels for the convolution operation used in the SPP and PANet. This was carried out by making the input channel equal to the output channel, so that the number of output channels for Ghostnet as the backbone feature extraction network was the same as the number of input channels of the subsequent enhanced feature extraction network. At this point, the modification that introduced Ghostnet into YoloV4 was completed.

PANet Improvements
The main role of YoloV4′s enhanced feature extraction network PANet is to perform feature fusion on the three initial effective feature layers. In this way, better features are extracted, and three more effective feature layers are obtained, resulting in a higher detection accuracy of the detection network YoloHead. The number of parameters in PANet is mostly concentrated on the ordinary convolution of 3 × 3. Therefore, to further reduce the number of parameters, we used the depth-separable convolution [38,39] to replace all the 3 × 3 ordinary convolutions used in PANet. We used the summary function to traverse the entire network structure and used its output network structure as a basis to derive the Ghostnet-YoloV4 network structure diagram with reference to the YoloV4 network, as shown in Figure 4. The same approach as above was used to introduce the Mobilenetv3 modification to YoloV4 to compare the number of parameters. The summary function was used to calculate the total number of network parameters and to obtain the total number of parameters for the two different networks before and after the PANet modification. The total number of parameters is shown in Table 3, where (1) represents the original YoloV4 network; (2) represents Mobilenetv3-YoloV4, where Mobilenetv3 is introduced to replace the YoloV4 backbone; (3) represents Ghostnet-YoloV4, where Ghostnet is introduced to replace the YoloV4 backbone; (4) represents the introduction of Mobilenetv3 to replace the YoloV4 backbone and the modification of PANet for Mobilenetv3-YoloV4; and (5) represents the introduction of Ghostnet to replace the YoloV4 backbone and the modification of PANet for Ghostnet-YoloV4. The number of network parameters was significantly reduced. Compared with the Mobilenetv3-YoloV4 network, the Ghostnet-YoloV4 network has the smallest number of parameters, thus reducing many unnecessary calculations in the repetitive training and detection process. The height of a sapling is the length from the root neck of the sapling to the slight top of the main trunk, and its length at production is defined as the height of the sapling. As shown in Figure 5, point P is the apex of the sapling, and point Q is the point of contact between the rootstock and the ground, so the length L of PQ is the height of the sapling. The same approach as above was used to introduce the Mobilenetv3 modifi YoloV4 to compare the number of parameters. The summary function was used late the total number of network parameters and to obtain the total number of par for the two different networks before and after the PANet modification. The total of parameters is shown in Table 3, where (1) represents the original YoloV4 netw represents Mobilenetv3-YoloV4, where Mobilenetv3 is introduced to replace the backbone; (3) represents Ghostnet-YoloV4, where Ghostnet is introduced to rep YoloV4 backbone; (4) represents the introduction of Mobilenetv3 to replace the backbone and the modification of PANet for Mobilenetv3-YoloV4; and (5) repres introduction of Ghostnet to replace the YoloV4 backbone and the modification o for Ghostnet-YoloV4. The number of network parameters was significantly r Compared with the Mobilenetv3-YoloV4 network, the Ghostnet-YoloV4 network smallest number of parameters, thus reducing many unnecessary calculations i petitive training and detection process.

Principle of Binocular Stereo Vision for Height Measurement
The height of a sapling is the length from the root neck of the sapling to the s of the main trunk, and its length at production is defined as the height of the sap shown in Figure 5, point P is the apex of the sapling, and point Q is the point o between the rootstock and the ground, so the length L of PQ is the height of the s The 3D coordinates of objects in the actual scene can be determined using b stereo vision techniques [32,40]. For 3D reconstruction using binocular cameras, obtained the depth information of the current binocular view and then performe version from the camera coordinate system to the world coordinate system to ob 3D coordinates of the object.
As shown in Figure 6, the left and right eye cameras were fixed at points point P is the actual point to be measured; C and C' are the imaging points of p the binocular image, and the distance between the two points is d; the distance BE the left and right cameras is the baseline distance m; GE is the focal length f of th ular camera; and the depth of point P is D. Then, the depth information D can b mined according to the triangular relationship D f m d The 3D coordinates of objects in the actual scene can be determined using binocular stereo vision techniques [32,40]. For 3D reconstruction using binocular cameras, we first obtained the depth information of the current binocular view and then performed a conversion from the camera coordinate system to the world coordinate system to obtain the 3D coordinates of the object.
As shown in Figure 6, the left and right eye cameras were fixed at points B and E; point P is the actual point to be measured; C and C' are the imaging points of point P in the binocular image, and the distance between the two points is d; the distance BE between the left and right cameras is the baseline distance m; GE is the focal length f of the binocular camera; and the depth of point P is D. Then, the depth information D can be determined according to the triangular relationship The distance d is the parallax of the binocular view and was calculated using camera calibration parameters and a stereo matching algorithm. In this paper, the classical BM stereo matching algorithm was used to obtain the parallax depth map of the binocular image, and the left camera optical centre was used as the origin for the 3D reconstruction, which in turn extracted the 3D coordinates of the vertex P (X1, Y1, and Z1) and the base point Q (X2, Y2, and Z2) of the sapling, so that the tree height L was

Binocular Camera Calibration
In this study, the Camera Calibrator toolbox of Matlab was used for calibration. More than 40 images of the checkerboard grid were taken with the binocular camera and transferred to Matlab to obtain the calibration parameters for the binocular camera and complete the PyCharm parameter file. The calibration can be carried out by clicking on the histogram to remove the poorly angled checkerboard images and reduce the calibration error. The parameters to be filled in were the left and right camera internal parameters, the radial and tangential aberrations of the left and right cameras, the rotation matrix of camera 2 with respect to camera 1, the translation matrix of camera 2 with respect to camera 1 and the image size. The internal reference was used to correct for aberrations in the captured binocular images, to obtain an isochronous map (where the corresponding pixels of the binocular images are on a straight line) and to apply it to a depth map in stereo matching.

BM Stereo Matching Algorithm
BM stands for bidirectional matching. It works by dividing the frames of the two cameras into a number of small squares for matching: moving the small squares to match the small squares in the other image, finding the pixel positions of the different squares in the other image, and then combining the relationship data of the two cameras (translation and rotation matrices in the calibrated parameters) to calculate the actual depth of the object to generate the corresponding depth map. The BM algorithm is extremely fast and has a great speed advantage in stereo matching work for binoculars in real time. The main working steps of the BM algorithm are described below.
(1) Image acquisition: the images taken by the left and right eye cameras of the binocular camera at the same time, and the acquired images are orthorectified according to the camera parameters and distortion coefficients, and then enter the pre-filtering process.
(2) Pre-filtering: the window of the image filter is shifted in parallel on the acquired images to normalize the brightness of each part of the image while highlighting the relevant details in the image and enhancing the texture in the image, so that the features in the image can be extracted more easily. (3) This step is the most critical step in the BM algorithm, which is to match the highest matching pair of points in the left and right images, i.e., for each pixel point in the left (right) image, find the best matching pixel point in the right (left) image, and only find the best matching pixel point in the right (left) image with The distance d is the parallax of the binocular view and was calculated using camera calibration parameters and a stereo matching algorithm. In this paper, the classical BM stereo matching algorithm was used to obtain the parallax depth map of the binocular image, and the left camera optical centre was used as the origin for the 3D reconstruction, which in turn extracted the 3D coordinates of the vertex P (X 1 , Y 1 , and Z 1 ) and the base point Q (X 2 , Y 2 , and Z 2 ) of the sapling, so that the tree height L was

Binocular Camera Calibration
In this study, the Camera Calibrator toolbox of Matlab was used for calibration. More than 40 images of the checkerboard grid were taken with the binocular camera and transferred to Matlab to obtain the calibration parameters for the binocular camera and complete the PyCharm parameter file. The calibration can be carried out by clicking on the histogram to remove the poorly angled checkerboard images and reduce the calibration error. The parameters to be filled in were the left and right camera internal parameters, the radial and tangential aberrations of the left and right cameras, the rotation matrix of camera 2 with respect to camera 1, the translation matrix of camera 2 with respect to camera 1 and the image size. The internal reference was used to correct for aberrations in the captured binocular images, to obtain an isochronous map (where the corresponding pixels of the binocular images are on a straight line) and to apply it to a depth map in stereo matching.

BM Stereo Matching Algorithm
BM stands for bidirectional matching. It works by dividing the frames of the two cameras into a number of small squares for matching: moving the small squares to match the small squares in the other image, finding the pixel positions of the different squares in the other image, and then combining the relationship data of the two cameras (translation and rotation matrices in the calibrated parameters) to calculate the actual depth of the object to generate the corresponding depth map. The BM algorithm is extremely fast and has a great speed advantage in stereo matching work for binoculars in real time. The main working steps of the BM algorithm are described below.
(1) Image acquisition: the images taken by the left and right eye cameras of the binocular camera at the same time, and the acquired images are orthorectified according to the camera parameters and distortion coefficients, and then enter the pre-filtering process.
(2) Pre-filtering: the window of the image filter is shifted in parallel on the acquired images to normalize the brightness of each part of the image while highlighting the relevant details in the image and enhancing the texture in the image, so that the features in the image can be extracted more easily. (3) This step is the most critical step in the BM algorithm, which is to match the highest matching pair of points in the left and right images, i.e., for each pixel point in the left (right) image, find the best matching pixel point in the right (left) image, and only find the best matching pixel point in the right (left) image with the maximum correctness. (4) Post-filtering: the purpose of post-filtering is to filter the high quality matching points and to remove the incorrect matching points to the maximum extent.

Introduction of the YoloHead Method for Binocular Vision Technology
In order to achieve both the counting and height measurement of saplings, this study integrated the binocular vision code into YoloHead, the detection part of YoloV4. The basic principle is that the left view of each frame of a sapling was used as the image to be detected and fed into the Ghostnet-YoloV4 network for detection, where the current sapling category and number were quickly obtained and displayed from the top left corner of the output window; at the same time of detection, each binocular image of a sapling was stereo matched using the BM algorithm to obtain a depth map to complete the 3D reconstruction, thus obtaining the 3D position information of the current scene pixels and storing it in the points_3d function; then, the 2D coordinates of the detection frame of each sapling obtained by detecting the left image of the sapling were taken out, the vertices and base points of the detection frame of the height-fitted sapling were replaced by the vertices and base points of the sapling and converted to the corresponding 3D coordinates using the points_3d function, and finally the sapling height was obtained. In order to solve the shortcomings of a small number of sapling key points with a poor fit to the detection frame and saplings falling over or sloping, the solution of the manual selection of sapling key points was introduced: a mouse click event was added to the depth map of saplings, and the height of a small number of saplings was found in real time by clicking on the top and bottom points of the depth image of the saplings that fit poorly.

Output Window Design
For the real-time inspection, three windows were designed to display the results. The depth window shows the 3D depth image of the current scene; the left window shows the centre of the sapling in the current scene and the corresponding height; and the video window shows the current detection frame and number of saplings. The two parameters were set on the depth window adjustment bur: 'num' is the difference between the maximum and minimum parallax value, while 'blocksize' is the matching block size. Both parameters have a large impact on the depth map and adjusting them can significantly improve the display and accuracy of the depth map when the distance to the target varies. For real-time detection, adjust the height of the tripod and the angle of the binocular camera to optimally fit the sapling in the field of view, and adjust the parameters above the depth image of the binocular camera to optimally fit the sapling in the current scene when the block outline is complete and resembles a sapling.

Results Statistics Method
Manual testing was carried out with a research team of five people, and the average results of the five people were compared with the computerized testing results to verify the accuracy of the method. As the effective depth of the binocular camera for this experiment was 3-10 m, and the nursery field was large, we could not extract the number and height of saplings from the whole field at once. Therefore, for each type of spruce sapling, the experiment was set up to detect and count at three different points in the field.

Training Parameters and Results
The Ghostnet-YoloV4 network parameters and training results are shown in Table 4. The original detection image size was 640 × 480, but the network was automatically scaled to 416 × 416 before being fed into the network for training and detection. mAP values and recall rates were not particularly high due to the use of data augmentation. We spent a total of nearly 120 h training 25,000 images, resulting in a training loss of 0.35. We achieved a frame rate of 15 FPS for real-time detection, which meets the requirements for real-time detection.

Presentation of Sapling Detection Results
As shown in Figure 7, the spruce saplings varied in colour, size, texture and planting density during the three different growth periods. The colour and texture features helped the network to distinguish saplings from their surroundings, which directly affected the accuracy of the counts. The different sizes and planting densities of the saplings made the depth images of each sapling different. The larger the sapling, the more accurate the height measurement; the sparser the planting, the lower the level of occlusion, and the more accurate the count. From the diagram, we can quickly determine the number and height of saplings of this type. a frame rate of 15 FPS for real-time detection, which meets the requirements for real-time detection.

Presentation of Sapling Detection Results
As shown in Figure 7, the spruce saplings varied in colour, size, texture and planting density during the three different growth periods. The colour and texture features helped the network to distinguish saplings from their surroundings, which directly affected the accuracy of the counts. The different sizes and planting densities of the saplings made the depth images of each sapling different. The larger the sapling, the more accurate the height measurement; the sparser the planting, the lower the level of occlusion, and the more accurate the count. From the diagram, we can quickly determine the number and height of saplings of this type.   The results of the real-time inspection of Mongolian scotch pine are displayed in Figure 8, which also shows the complete design of the output window during detection. For sparse Mongolian scotch pine, no adjustment of the camera shooting angle was required; counting within the effective depth of the binocular camera was virtually error-free and the height measured by the system was very close to the actual height. counting within the effective depth of the binocular camera was virtually error-free and the height measured by the system was very close to the actual height. As shown in Figure 9, the slender trunk of the sapling was relatively large compared to the crown of the Manchurian ash sapling; the extended crown of the sapling obscured the Manchurian ash saplings from one another, which resulted in the poor detection of the Manchurian ash saplings. Additionally, the roots of the Manchurian ash saplings in the back row of the camera were easily blocked by the saplings near the binoculars, making it difficult to detect the roots of some of the saplings. Moreover, the cadres of the saplings were relatively small, so the smaller trunks of the saplings were not visible in the depth image, making it difficult to match the bottom of the frame with the bottom of the saplings. In this study, upon adjusting the position of the binocular camera downwards, the roots of the willow were exposed as much as possible, while the manual intervention and manual selection of the top and bottom points of the sapling for the severely obscured willow could improve the detection accuracy of the Manchurian ash sapling.  Table 5 demonstrates the accuracy of detection for three different forms of spruce saplings. The table shows the number of spruce saplings and the average height of the saplings for the three forms; for the 3D coordinates of the centre point obtained simultaneously, they could be used to locate the saplings for future operations, such as precise automatic watering and the application of pesticides. For each point, the following measures were calculated: TP indicates the number of true saplings correctly detected as saplings; FP indicates the number of false saplings incorrectly detected as saplings; FN indicates the number of true saplings incorrectly detected or missed; count indicates the average number of saplings counted manually; H indicates the average height of saplings measured by the system in cm; and TH indicates the average height of trees measured manually in centimetres. As shown in Figure 9, the slender trunk of the sapling was relatively large compared to the crown of the Manchurian ash sapling; the extended crown of the sapling obscured the Manchurian ash saplings from one another, which resulted in the poor detection of the Manchurian ash saplings. Additionally, the roots of the Manchurian ash saplings in the back row of the camera were easily blocked by the saplings near the binoculars, making it difficult to detect the roots of some of the saplings. Moreover, the cadres of the saplings were relatively small, so the smaller trunks of the saplings were not visible in the depth image, making it difficult to match the bottom of the frame with the bottom of the saplings. In this study, upon adjusting the position of the binocular camera downwards, the roots of the willow were exposed as much as possible, while the manual intervention and manual selection of the top and bottom points of the sapling for the severely obscured willow could improve the detection accuracy of the Manchurian ash sapling. the height measured by the system was very close to the actual height. As shown in Figure 9, the slender trunk of the sapling was relatively large comp to the crown of the Manchurian ash sapling; the extended crown of the sapling obscu the Manchurian ash saplings from one another, which resulted in the poor detectio the Manchurian ash saplings. Additionally, the roots of the Manchurian ash sapling the back row of the camera were easily blocked by the saplings near the binoculars, m ing it difficult to detect the roots of some of the saplings. Moreover, the cadres of the lings were relatively small, so the smaller trunks of the saplings were not visible in depth image, making it difficult to match the bottom of the frame with the bottom o saplings. In this study, upon adjusting the position of the binocular camera downwa the roots of the willow were exposed as much as possible, while the manual interven and manual selection of the top and bottom points of the sapling for the severely obscu willow could improve the detection accuracy of the Manchurian ash sapling.  Table 5 demonstrates the accuracy of detection for three different forms of sp saplings. The table shows the number of spruce saplings and the average height of saplings for the three forms; for the 3D coordinates of the centre point obtained simu neously, they could be used to locate the saplings for future operations, such as pre automatic watering and the application of pesticides. For each point, the follow measures were calculated: TP indicates the number of true saplings correctly detecte saplings; FP indicates the number of false saplings incorrectly detected as saplings indicates the number of true saplings incorrectly detected or missed; count indicates average number of saplings counted manually; H indicates the average height of sapl measured by the system in cm; and TH indicates the average height of trees measu manually in centimetres.  Table 5 demonstrates the accuracy of detection for three different forms of spruce saplings. The table shows the number of spruce saplings and the average height of the saplings for the three forms; for the 3D coordinates of the centre point obtained simultaneously, they could be used to locate the saplings for future operations, such as precise automatic watering and the application of pesticides. For each point, the following measures were calculated: TP indicates the number of true saplings correctly detected as saplings; FP indicates the number of false saplings incorrectly detected as saplings; FN indicates the number of true saplings incorrectly detected or missed; count indicates the average number of saplings counted manually; H indicates the average height of saplings measured by the system in cm; and TH indicates the average height of trees measured manually in centimetres. For the large spruce saplings, the nurserymen chose to plant them at a higher density in order to ensure their growth rate, so that they were counted with 100% accuracy. As the crowns of the large spruce saplings were farther away from the roots on the ground, it was easy to distinguish between them, and because these saplings were taller, the binocular camera took photographs from the side, so that the top and bottom points were selected more accurately, so their counting accuracy was also higher. The medium and small spruces were photographed diagonally downwards. The medium spruce was denser, and the shading between saplings had a greater impact on detection, making it easier for two or even more adjacent saplings to be mistakenly detected as one. The small spruce tilted and fell easily and the plants were shorter, making it easier to find the wrong top and bottom points of the saplings when taking height measurements, and resulting in a slightly lower accuracy than the other two spruce forms. However, the small spruce had a lower spacing, so the number of missed detections was lower, but there were slightly more false detections due to the similarity of its form to the surrounding weeds.

Analysis of Test Results
The three saplings showed the best detection results in terms of counting results for the Mongolian scotch pine. This is because camphor pine was more sparsely planted, while the three spruce and Manchurian ash sapling species were more densely planted, so the number of errors and omissions at each detection point for camphor pine was relatively low in comparison. In terms of height measurement, the saplings of Mongolian scotch pine were the furthest apart from one another, so the root and crown features were the most pronounced and the height measurements were the most accurate for Mongolian scotch pine saplings. In contrast, although manual intervention improved the detection accuracy of Manchurian ash, there were still a small number of missed detections, especially in the case of the smaller saplings that were relatively close to one another and could easily be detected as one sapling because they were too close to a slightly larger sapling.
The number of saplings was minimal when comparing the system detection with the manual detection, which shows that the data source was a good fit for the network detection function and, therefore, worked well. The slightly larger difference in the sapling height measurement is due to the errors inherent in the binocular camera and the fact that the roots of some saplings were not detected, which made the difference between the bottom point of the rectangular frame and the bottom point of the sapling too large and ultimately pulled down the average height. During the inspection, the binocular camera was very sensitive to changes in light, which was highlighted by the depth map display of Manchurian ash and spruce. When comparing tests of medium-sized spruce in shade and direct sunlight, and comparing tests under cloudy Manchurian ash and sunny Mongolian scotch pine, we found that under good lighting conditions, the grey contours of the saplings on the depth image were very close together, which also made the height measurements of spruce and Mongolian scotch pine saplings under sunlight more accurate. The reflection of sunlight brought out the colour and texture characteristics of the saplings and allowed more accurate results for spruce and Mongolian scotch pine saplings. In the shade or on cloudy days, the grey-scale contours of the Manchurian ash and medium-sized spruce on the depth images differed less from the background and neighbouring saplings, and the colour and texture characteristics were somewhat reduced, which caused a reduction in the accuracy of both the height measurements and count results. Adequate light conditions made the features, such as texture and colour, of the saplings more visible, facilitating the detection of the network. It is worth noting that when there was sufficient light, the rate of missed detection of multiple saplings into one was significantly reduced. Table 6 shows the overall detection accuracy of the saplings, from which the Mongolian scotch pine benefitted from its larger spacing, with the highest count accuracy of 96.97% and a high measurement accuracy of 96.55%. Although the front and rear of the Manchurian ash were obscured, it could still count and measure with a high accuracy of over 92%, which could be further improved by combining it with human intervention.

Network Performance Analysis
In order to verify the performance of the improved network, the following four networks were trained separately and tested for comparison to complete the ablation experiment, and the results are shown in Table 7, where (1) represents the original YoloV4 network; (2) represents Ghostnet-YoloV4, where Ghostnet is introduced to replace the YoloV4 backbone; (3) represents YoloV4 with PANet modification only; and (4) represents the introduction of Ghostnet to replace the YoloV4 backbone and the modification of PANet for Ghostnet-YoloV4. This dataset used the original dataset of saplings, a total of 1500 images, divided into a training and validation set in a ratio of 8:2, and training was carried out using four neural networks of 400 epochs. Of these, (1) had the longest training time and (4) had the shortest training time, and it had the best training results with a MAP value of 92.93%. Additionally, for all four networks, the real-time frame rate reached the maximum value of 15 FPS for this binocular camera. As the amount of training and detection data for the four networks was not very large, it was possible to have better performance but poorer detection results. In order to control for possible uncertainties of this type, we kept the influence of external factors as low as possible: all variables were the same, the detection locations were identical and the four neural networks were run separately for counting and altimetry. Accuracy calculations were still carried out using computer testing compared to manual testing. As can be seen in Table 7, Ghostnet-YoloV4, which introduces Ghostnet to replace the YoloV4 backbone and modifies the PANet, demonstrated the highest accuracy in counting and height measurement.

Reliability of the Ghostnet-YoloV4 Network
The experimental results show that the Ghostnet-YoloV4 network achieves good accuracy in the real-time counting of all three saplings. This result validates the prediction that the use of the Ghostnet network and deep separable convolution to improve YoloV4 not only reduces the network load massively but also has better detection results. The detection speed of the Ghostnet-YoloV4 network is very high, judging from the real-time frame rate of 15 FPS achieved. From the above, it is clear that there is no obstacle to deploying the Ghostnet-YoloV4 network on personal computers. It is also possible to apply the neural network to other mobile devices, such as mobile phones and tablets, in the future, which will greatly enhance the practical and generalisation capabilities of the network and can be applied to more fields for detection.
It is worth noting that Ghostnet-YoloV4 has a much lower number of parameters compared to YoloV4 when training the network, so it is much faster, thereby saving computer training time. This makes sense for practical applications, as for each different tree species, we need to carry out data collection, labelling and network training, and with a very large variety of saplings in the nursery, the training time is particularly important when conducting large-scale tree counts. If the training time is too long, it will cause a reduction in detection capability as the saplings grow and, more importantly, will delay the production process.

Binocular Camera 3D Reconstruction Capability
This experiment used images of three tree saplings taken using a binocular camera as the main study dataset, and the binocular view allowed the reconstruction of spatial location information. We chose a low-resolution binocular camera lens of only 640 × 480 in consideration of two factors: (1) Since the training data had 25,000 images, if the resolution of each image was increased so that the training set size was greater, it would massively increase the training time, making it unfavourable for both experimental research and field applications. (2) The network real-time processing and binocular camera real-time 3D reconstruction of higher-resolution images make the speed of the images inconsistent and causes delays, so we needed to choose lower-resolution images in consideration of the real-time effect. Although the lower-resolution images are sufficient to support the extraction of key points of the saplings and their height measurement, the accuracy and generalisability of the system would be improved if the inexpensive system could process the high-resolution images quickly. This will be possible as hardware computing power increases and information sources become more abundant. On the one hand, the increase in computing power will allow the system to obtain better information on colour, texture and depth for learning and detection, which will certainly improve the accuracy of counting and height measurement; on the other hand, other data sources, such as UAV point clouds [17][18][19][20] and hyperspectral imagery [41][42][43][44][45], can effectively reconstruct the structure of individual tree types and thus help their detection.
It is clear from the experimental results that the binocular camera can complete a 3D reconstruction of the current scene and generate a depth map. The depth map contains the 3D coordinates of the pixel points, from which we can obtain the vertex and base points of the saplings, as well as the centre point. The vertex and base points were used to calculate the height of the saplings, while the centroids could be used to position them. Compared to studies that estimate tree height using point cloud images from a drone, we have the advantage that height measurements can be carried out for small saplings shorter than 30 cm, and the binocular camera is cheaper and simpler to operate. The centroid location and height estimation of saplings could provide the basic capability to sense, distinguish, measure and locate target objects in future fully automated nursery management, which in turn would enable unmanned cultivation operations, such as automatic watering, fertilization and temperature control.
However, the cheapness of the binocular camera dictates the simplicity of its hardware system architecture. As a result, compared to sensors such as UAVs and LiDAR, binocular cameras can pose larger data errors and more tedious pre-calibration and other tasks. Binocular cameras are susceptible to terrain, light and weather [46], which negatively affects processes, such as subsequent altimetry and localisation and limits the ability of binocular cameras to generalise.
In addition, the binocular camera is unable to measure the height of saplings that are heavily obscured by one another, and the binocular camera will output a height value containing a large error due to the lack of access to the key points of the saplings. For this reason, we propose a method for manually selecting key points for height measurement. This method solves the problem of low fit and missed saplings through simple humanmachine interaction; in addition, by changing the position of the binocular camera, the height of some heavily occluded saplings can be detected. However, occlusion is still a problem in vision technology, and the accuracy of counting and height measurement using the binocular camera in this study was greatly reduced in the case of dense hibiscus and other saplings. To address these shortcomings, we will introduce other sensors in future work, such as using LIDAR to acquire point cloud data [17], to segment saplings for height measurement and enhance the generalisability of the system.

Experimental Errors
Although experimental errors are avoided as much as possible, some errors are inevitable due to objective conditions [47]. The errors mainly originate from the following: (1) There is always some discrepancy between the parameter calibration results and the real parameters of the binocular camera in actual use, which is due to the errors inherent in the binocular camera and the errors in the tessellation calibration images taken.
(2) Saplings that are too close together can be mistakenly detected as one sapling, which is due to the system identification errors caused by the mutual forking of sapling branches.
(3) There will always be a partial incomplete fit between the sapling and the sapling detection frame, and for very dense saplings, the count and measurement accuracy will be drastically reduced, which is a drawback of the Yolo series using rectangular frames as the detection tool. (4) Poor lighting conditions will lead to a reduced differentiation between saplings and their surroundings, making sapling features weaker and causing detection errors; for saplings on sloping ground, unsuitable detection points will decrease the accuracy rate. These errors can cause some discrepancies between the manually measured height TH and the system measured height H. This can be countered by applying manual intervention and finding the best camera angle, as described above. Considering that there may be saplings that are in overgrown grass, causing the roots to be obscured, we need to add the estimated height of the grass to the average height of the saplings.
To reduce binocular camera errors, the expensive Zed2 integrated binocular camera can be applied, which poses a much lower risk of calibration error and allows for higherresolution images, but this will require more hardware, such as a computer graphics card. In addition, the option of using techniques, such as density mapping [28], to estimate the number of saplings or LiDAR may alleviate the difficulty of counting and measuring the height of dense saplings.

Conclusions
This study constructed and enhanced a sapling image dataset using commercially available inexpensive binocular cameras to sample nursery data, and proposed a framework for the counting and height measurement of saplings using Ghostnet-YoloV4 networks combined with binocular vision techniques. The following conclusions are drawn: 1.
The Ghostnet network, which is suitable for loading into mobile devices, was introduced into the YoloV4 network with an improved PANet, and compared with networks such as Mobilenetv3-YoloV4, and it was found that the Ghostnet-YoloV4 network had the lowest total number of network parameters. Through field testing, it was found that the Ghostnet-YoloV4 network had more than 92% accuracy for all three types of tree saplings, and the overall accuracy was still above 90% even under different light and terrain conditions. These results validate the reliability of the Ghostnet-YoloV4 network.

2.
Binocular vision technology was integrated into the Ghostnet-YoloV4 network detection section to complete the 3D reconstruction of the current binocular view to obtain sapling heights. The results show that the binocular camera extracted sapling heights with an overall accuracy of 92.2%, which is sufficient to support accurate forestry in nurseries and nursery management. The field detection accuracy can also be improved if the binocular camera parameters and shooting positions are adjusted according to different light sapling morphologies and if human intervention is added.
This study constructed a new network structure with high detection accuracy and applicability. This demonstrates the feasibility of using a low-profile binocular camera and a personal computer to achieve the real-time counting and height measurement of nursery saplings. It can currently be used to help nursery staff reduce the burden of manual inspection. In the future, it could also be used to help automate forestry machinery for the real-time detection, classification, localization and acquisition of the shape and size of objects of interest around the machine, providing guidance for subsequent automated operations.
Author Contributions: Methodology, X.Y. and D.L.; resources, X.Y. and Y.M.; software, X.Y. and P.S.; writing, X.Y.; format calibration, Y.M. and G.W. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to them also being necessary for future essay writing.

Conflicts of Interest:
The authors declare no conflict of interest.