3D Point Cloud on Semantic Information for Wheat Reconstruction

: Phenotypic analysis has always played an important role in breeding research. At present, wheat phenotypic analysis research mostly relies on high-precision instruments, which make the cost higher. Thanks to the development of 3D reconstruction technology, the reconstructed wheat 3D model can also be used for phenotypic analysis. In this paper, a method is proposed to reconstruct wheat 3D model based on semantic information. The method can generate the corresponding 3D point cloud model of wheat according to the semantic description. First, an object detection algorithm is used to detect the characteristics of some wheat phenotypes during the growth process. Second, the growth environment information and some phenotypic features of wheat are combined into semantic information. Third, text-to-image algorithm is used to generate the 2D image of wheat. Finally, the wheat in the 2D image is transformed into an abstract 3D point cloud and obtained a higher precision point cloud model using a deep learning algorithm. Extensive experiments indicate that the method reconstructs 3D models and has a heuristic effect on phenotypic analysis and breeding research by deep learning. crease between the wheat leaf and the main stem indicates that the wheat leaf has been fully unfolded, and the crease is the target of detection. The probability in the ﬁgure is the conﬁdence score. In the experiment, we set the threshold to 85%. When the conﬁdence score is higher than 85%, the leaf is considered to be unfolded.


Introduction
Wheat, as a type of cereal crop, is widely planted throughout the world. Its caryopsis is one of the staple foods of human beings. According to the statistics, wheat provides more than a 20% proportion of the world's protein and heat for the human body [1]. A study has indicated that the required crop yield is expected to be doubled by 2050 in order to meet the demands of the rapid population growth [2]. As the climate changes, the breeding of high-yield and drought-resistant wheat varieties has been widely concerned and recognized.
Screening wheat seeds with high-yield and anti-disease genes is one of the solutions to increase yield. At present, phenotypic analysis is one of the curcial methods to screen fine varieties in breeding laboratory. Usually, the phenotypic data need to be measured manually by researchers with instruments, which makes the research process longer and the efficiency low. Fortunately, the rapid development of deep learning has enabled computer vision to be combined with breeding research. The wheat 3D point cloud model reconstructed by deep learning algorithms can be used to measure phenotypic data. The algorithm model of deep learning can also effectively replace some trivial and miscellaneous tasks that need to be completed manually. The 3D wheat model reconstructed by the algorithm can be used to calculate plant height, leaf area, leaf thickness, and other information. Moreover, the point cloud model can be used for segmentation tasks. It is easy to distinguish the stems and leaves of wheat and to measure various data separately Object-Detection: we used object detection algorithm to detect and judge whether the wheat leaves are unfolded. JC Zadoks et al. [3] proposed the decimal code for the growth stages of cereals. We drew lessons from the method proposed by P Sadeghi-Tehran et al. [4] to judge the growth stage of wheat. The unfoldment of different leaves represents that wheat enters different growth stages. With the help of the detection model, we can automatically judge which growth stage the wheat is in and record the time from sowing to the growth stage. Compared with the traditional machine learning and image processing methods [4], our method does not need to perform complex preprocessing on the image, the detection speed of our method is increased by about 30%, and the detection accuracy is higher. The object detection algorithm is used to collect wheat phenotype information, and the information is made into text descriptions of the wheat.
Text-to-Image: it is very difficult to transform semantic information into point cloud directly, so we used 2D images as the intermediate medium. We used Attentional Generative Adversarial Networks (AttnGAN) [5] to transform the growth environment and phenotypic information collected during wheat growth from the text domain to image domain. In the process of wheat growth, we used the temperature and humidity sensor to record the temperature and humidity information of the wheat growth environment in real time and reserved the information. Then, we combined the information with the probability and time of leaf unfolded detected in the first stage to a complete text description, which is used to train the AttnGAN. In the end, the AttnGAN model outputs images according to the text description. After testing, the inception score (IS) [6] of the generated images reached 4.41 and the R-precision [5] reached 64.78%.
Three-Dimensional Point Cloud: in this part, we used images that were generated in the second part to reconstruct the 3D model of wheat. It is really hard to reconstruct the 3D point cloud from the generated images. Therefore, the method we used is to complete the task in two stages. In the first stage, we reconstructed the wheat from a 2D image into a rough point cloud. Although the point cloud generated in the first stage is somewhat ambiguous, it still meets the shape characteristics of wheat. In the second stage, the point cloud, which is generated in the first stage, is used as the input. Then, an unsupervised learning method is used to generate more accurate point cloud. Moreover, the point cloud generated in this stage is closer to the shape of real wheat. This paper is organized into five sections, including the present one. Section 2 introduces the development and important contributions of the fields covered in this paper. Section 3 describes how to collect datasets and to preprocess the collected data. At the same time, the theoretical derivation of the model used in this paper is illustrated in detail. In Section 4, the training process and the experimental results are displayed and discussed, and we list a series of comparative experiments that we performed. The feasibility and effectiveness of the experiment are discussed in this section. The last section summarizes the contribution of this paper, and future research directions are proposed. The contribution of our method is threefold: 1. A wheat dataset is proposed, which contains wheat data annotation for object detection, text-to-image, and 3D point cloud; it can be used by other researchers. 2. The method of object detection is used to automatically detect when wheat enters each growth stage. 3. We proposed a method to reconstruct a 3D point cloud model of wheat by text description; the method is based on multi-task cooperation.

Related Work
Reconstructing 3D point cloud of wheat is not an easy task, it is particularly difficult to implement end-to-end generation. Therefore, it is a better choice to use mult-task cooperation. The final solution is to combine the three algorithms of object detection, text-to-image, and 3D point cloud reconstruction to achieve this purpose.

• Deep Learning in Wheat Breeding
With increasing population pressure and the subsequent demand for agricultural products, countries in the world will face the problem of insufficient crop production. Plant researchers have been trying to propose strategies for increasing the production of wheat. Nimai Senapati et al. [7] pointed out the importance of drought tolerance during reproductive development to increase wheat yield under climate change. Lin Ma et al. [8] isolated TaGS5 homoeologues in wheat and mapped them on chromosomes 3A, 3B, and 3D, and temporal and spatial expression analysis showed that TaGS5-3A was preferentially expressed in young spikes and developing grains. Muhammad Adeel Hassan et al. [9] evaluated the vegetation indices (Vls) of crops at different growth stages using multispectral images of unmanned aerial vehicle (UAV).Some researchers used the analysis of wheat phenotypes to judge the advantages and disadvantages of wheat varieties so as to select good varieties to increase yield. At present, some researchers have used the method of deep learning to assist wheat research. Aleksandra Wolanin et al. [10] estimated the yield of wheat with explainable deep learning. Xu Wang et al. [11] used high-throughput phenotyping with deep learning to understand the genetic structure of flowering time in wheat. Liheng Zhong et al. [12] completed the mapping of winter wheat with the method of deep learning. The above work has made great contribution to wheat breeding, but these methods generally require a large amount of manual operation and measurement of related instruments. In contrast, our work focuses on the automatic reconstruction of the 3D point cloud model of each growth stage of wheat.

• Object Detection Algorithms
Thus far, object detection is one of the most mature areas of deep learning, and it has been applied in many industries. The growth stage of wheat is usually judged by the unfolding of leaves, and the object detection algorithm can effectively detect whether the leaf is fully unfolded. Recently, object detection algorithms can be divided into two categories: the first is two-stage algorithms, the most representative of which is the Region-Convolutional Neural Networks (R-CNN) series, including fast R-CNN [13], faster R-CNN [14], Region-Fully Convolutional Neural Networks (R-FCN) [15], and Libra R-CNN [16]. These methods rely on CNN to generate Region Proposal and then classify and regress on region proposal. The characteristic of this type of method is that the accuracy is generally higher but the speed is slower than the one-stage method. For one-stage algorithms, the most representa-tive models are You Only Look Once (YOLO) series [17][18][19][20], Single Shot MultiBox Detector (SSD) [21], and RetinaNet [22], which can directly predict the bounding box and class probability from the input image. Due to the need to monitor the growth of wheat in real time, the one-stage method is better. Early YOLO models such as YOLOv1 and YOLOv2 only support the detection task of low-resolution images, and the detection effect for small objects cannot satisfy the actual needs. YOLOv4 has good performance in both detection accuracy and speed, so we chose YOLOv4 as the detection model and CSPDarknet53 [23] as the backbone and used the attention mechanism to improve the performance of the model on our own dataset. • Text-to-Image Algorithms Recently, great progress has been achieved in image generation with the emergence of Generative Adversarial Networks (GANs) [24]. Many fields such as image restoration, style transfer, video generation, music generation, text-to-image, etc. have made many interesting applications with the help of GANs. Because we need to reconstruct 3D point cloud of wheat from 2D images, the algorithm of text-to-image is completely consistent with our application scenario. Compared with traditional generative models, GANs have two major characteristics. (1) GANs do not need to rely on any prior distribution. It only needs to sample from a distribution (usually a Gaussian distribution) for training.
(2) The GAN models generate real-like samples in a very simple way; they only need to be forwarded through the generator. Generating high-resolution images from text descriptions is a challenging task. Initially, the models can only translate text to image pixels [25]. Stacked Generative Adversarial Networks (StackGAN) used a two-stage GAN to translate text information into a 256 × 256 real image for the first time [26]. Based on stackGAN, stackGAN-v2 is composed of multiple generators and discriminators and arranged in a tree shape, generating multi-scale images of the same scene from different branches of the tree [27]. AttnGAN allows for attention-driven, multi-stage refinement for fine-grained text-to-image generation. This model pays more attention to the details of related vocabulary in semantic description, and the generated image quality is better, which is why we chose AttnGAN.

• Reconstruction of Wheat 3D Model
Three-dimensional images are a special form of information expression. Its characteristic is to express the data of three dimensions in the space. Its forms of expression include depth map, geometric model, and point cloud model. Point cloud data are the most common and basic 3D model. Recently, deep learning on point clouds has thrived. Currently, there are many methods based on multiple views, such as Multi-view convolutional neural networks (MVCNN) [28] and Multi-view harmonized bilinear network (MHBN) [29]. Some methods such as DensePoint [30] and ConvPoint [31,32] are based on 3D discrete convolution; these methods define convolutional kernels on regular grids, where the weights for neighboring points are related to offsets with respect to the center point. some researchers have tried to reconstruct the 3D model of wheat. WeiFang et al. [33] proposed high-throughput volumetric reconstruction for a 3D wheat plant architecture. Research centers such as the Donald Danforth Plant Science Center and the Commonwealth Science and Industrial Research Organization (CSIRO) proposed a solution for 3D model reconstruction of plants based on 2D imaging [34]. Michael P. Pound et al. [35] proposed to use single-view images to optimize the model based on image information, curvature constraints, and the position of neighboring surfaces and to reconstruct a three-dimensional model of the plant. All of these works have achieved good results. However, the above works require the use of high-precision instruments or manual measurement of certain plant parameters to better reconstruct the three-dimensional model of the plant. Workload and cost are relatively high. In this paper, we use [36] to build the wheat 3D structure points. Specifically, this method takes a 3D point cloud as input and encodes it as a set of local features. The local features are then passed through a novel point integration module to produce a set of 3D structure points.

Materials and Methods
This section is divided into two parts. The first part mainly introduces data acquisition and preprocessing. The second part introduces the details of the algorithm we used.

RGB Image and Semantic Information
In order to collect the data continuously, we developed a set of equipment with a Raspberry Pi. The device is equipped with a RASPBERRY PI CAMERA MODULE V2 camera (Premier Farnell., London, UK), which has a prime lens and the image pixels up to 3280 × 2464. In the process of wheat growth, the phenomenon of occlusion between leaves is common. Therefore, only collecting a single-view image cannot meet the data requirement of the detection task. In fact, a rotatable turntable can solve this problem well; we simply put the wheat culture dish on it and let the turntable rotate slowly. It is easy to collect multi-view images in this way. The whole collecting process was completed in the breeding laboratory, and the advantage is that the whole process was not affected by environmental factors. Finally, we collected 2000 images of the wheat growing process. The dataset contains images of the growing process of 50 wheat plants, and at least 30 images were collected for each wheat plant. These images were used for the training of the object-detection task and text-to-image task, respectively. DHT11 [37] is a temperature and humidity sensor with calibrated digital signal output. We used it to collect the temperature and soil humidity of the wheat-growing environment. Then, all information such as temperature, soil humidity, wheat plant height, and leaf unfolded probability were combined into semantic information, which was used for training of the text-to-image task. Finally, we made 2000 textual annotations. The semantic information and the corresponding image example are shown in Figure 2.

Data Preprocessing
Since the input nodes of the deep learning network are fixed but the pixel size of the collected images is different, the images need to be resized first. We resized the original image to 1024 × 1024 pixels and then entered it into YOLOv4 for training. The growth environment of wheat is changeable: different weather conditions will lead to different light intensities, and different wind speeds will change the posture of wheat. Therefore, in order to improve the robustness of the model, we flipped the original image a few angles and gamma transformed the image. In addition, considering the hardware noise of the imaging sensor, such as the electronic circuit noise caused by low illumination or high temperature in the camera sensor, it is necessary to add gaussian noise and salt and pepper noise to make the model obtain a better fitting effect in an uncertain environment. After data augmentation, our dataset was expanded to 5000 images. In addition, the dataset also contains labels for object detection training, point cloud markers of wheat model, and text description of the image.

Detection Model
Object detection is the first part of the whole work, which is mainly used to detect whether the blade is unfolded. The structure of YOLOv4 [20] can be divided into three parts: backbone feature extraction network, enhanced feature extraction network, and Yolo-Head. Moreover, the anchor used in YOLOv4 is the same as YOLOv3. In the backbone network, YOLOv4 adopts Cross Stage Partial Network (CSPDarknet53). The main idea is multiple stacking of residual networks, which uses a large residual edge span connection structure to extract edge information better. It is worth noting that the last three effective layers obtained by CSPDarknet53 are all used as input for feature fusion to improve the network performance. YOLOv4's neck is divided into Spatial Pyramid Pooling (SPP) [38] and Feature Pyramid Networks (FPN) [39]. The most prominent feature of SPP is that it can easily achieve multi-scale training. SPP can extract features from images of different sizes; it can also output features of any size by adjusting the size and stride of the kernel. FPN adopts a jump connection structure, and a multi-dimensional fusion feature layer is finally obtained by convolution, sampling, and splicing. It combines multiple effective feature layers through continuous convolution and sampling. The bottom-up and top-down network designs enable fine-grained feature information to be directly integrated with the final feature layer. This short-circuit concept makes fine-grained localized information available on the top floor.
In the priors-anchor part, YOLOv4 does not directly predict the width, height, and center point coordinates of the bounding box; it predicts the offset. Compared with direct location prediction, it is easier to predict the offset and to avoid the problem that the bounding box may appear at any position of the image. The offset formula is defined as follows: where b x , b y is the center coordinates of the prediction box. b w , b y represent the length and width of the prediction box. t 0 is the confidence score. C x , C y is the upper-left coordinates of the grid cell in the feature map, and p w and p h are the width and height of the default bounding box mapped to the feature map. In the process of training, the correct bounding box is obtained by fitting four parameters t x , t y , t w , and t h . The loss function of YOLOv4 is divided into three parts: confidence loss, classification loss, and bounding box regression loss. Compared with YOLOv3, YOLOv4 changes only in bounding box regression loss, YOLOv3 uses Mean Squared Error (MSE) loss in bounding box regression, while YOLOv4 uses Complete-Intersection over Union (CIoU) [40] loss. CIoU is defined as follows: where α is the weight factor and measures the similarity of the aspect ratio and is the Distance-Intersection over Union (DIoU) [40]. CIoU combines the advantages of various loss functions well and fully considers the relationship of various prediction indicators. IoU is used to express the co-selection rate between the bounding box and ground truth. DIoU is used to make the bounding box regress better. α is used to measure the aspect ratio of the bounding box, which reflects the offset between the bounding and ground truth. The information detected by the model is used to train Attentional Generative Adversarial Networks (AttnGAN) [5].

Text-to-Image Model
Compared with other GAN models, AttnGAN has two special characteristics: (1) an attentional generative network; (2) Deep attentional multimodal similarity model (DAMSM) [5]. Most recently proposed text-to-image synthesis methods are based on GANs. These methods usually encode the whole text description into a global sentence vector as the condition for GAN-based image generation [41]. It leads to a lack of important fine-grained information at the word level and prevents the generation of high-quality images. AttnGAN not only encodes the natural language description into a global sentence vector but also encodes each word in the sentence into a word vector. In the first stage, the network utilizes the global sentence vector to generate a low-resolution image. In the next stage, it uses the image vector in each subregion to query word vectors by using an attention layer to form a word-context vector. The final objective function of the AttnGAN is defined as follows: where ξ G is the GAN loss that jointly approximates conditional and unconditional distributions and λ is a hyperparameter to balance the two terms. ξ DAMSM is a word-level fine-grained image-text matching loss computed by the DAMSM. Additionally, the loss for G i is defined as follows: wherex i is from the model distribution PG i . The function is divided into two parts: the unconditional-loss determines whether the image is fake or real, and the conditional-loss determines whether the image and the semantic information match. At each stage of the AttnGAN, the generator G i has a corresponding discriminator D i , each discriminator D i is trained to classify the input into the class of real or fake, and the loss for D i is defined as follows: where x i is from the true image distribution p data ,x i is from the model distribution PG i , both of them are at the ith scale, andē is a global sentence vector. The second part of the objective function is the loss function of the Deep attentional multimodal similarity model (DAMSM) [5] model. DAMSM learns two neural networks, which map words of the sentence and subregions of the image to a common semantic space and calculate the fine-grained loss of image generation. The neural networks learned by DAMSM are Long Short-Term Memory (LSTM) [42] and Convolutional Neural Network (CNN); the specific structure of these two networks will not be introduced in this paper. The LSTM network is used to extract semantic vectors from text descriptions; the CNN network is built upon the Inception-v3 [43] model pretrained on ImageNet [44]. We extracted global features from the last average pool layer of Inception-v3 and added a perceptron layer to convert image features into a common semantic space for text features. DAMSM uses image-text matching score to evaluate the result. The 2D image generated by the model is an important medium for reconstructing 3D point cloud.

Three-Dimensional Point Cloud Model
Our ultimate goal is to reconstruct the 3D point cloud model of wheat using text description, the generated 2D image is only a medium in the middle. In this stage, 3D point clouds need to be reconstructed from a single 2D image. Because 2D images are generated from models, it is impossible to use a depth camera and other devices to collect point cloud data, so we reconstructed the point cloud in two stages. In the first stage, we used a model that can generate point cloud from a single image [45]. Due to the lack of depth information, the shape of the point cloud reconstructed in the first stage is a little ambiguous. In the second stage, the method we used is an end-to-end framework [36], which can learn intrinsic structure points from point clouds. The framework consists of two parts: PointNet++ and Point integration model. The whole structure is shown in Figure 3. The input to PointNet++ is a point cloud, and the point cloud first enters an encoder. The encoder extract sample points Q = {q 1 , q 2 , . . . , q l } q i ∈ R 3 with the features F = { f 1 , f 2 , . . . , f l } f i ∈ R 3 ; l is the number of sample points; and c indicates the dimension of the feature representation. Additionally, the input to the point integration model is the points Q with the local contextual features F, which were obtained by the PointNet++ [46]. Shared Multi-Layer Perceptron (MLP) is a shared multi-layer perceptron block followed by softmax. It is used as an activation function to generate the probability maps P = {p 1 , p 2 , . . . , p m } . The p i j in the probability map p i indicates the probability of the point q i being the structure point S i . Therefore, the output points S can be defined as follows: For unsupervised training of the network, the reconstruction loss is defined based on the Chamfer distance (CD) [45]. In fact, the loss is the CD between the structure S and the input points X, the loss is computed as follows:

Experimental Results
The operating system of the experiment is Ubuntu16.04, the deep learning framework used in all experiments is PyTorch1.2, and all experimental results are obtained on NVIDIA GeForce RTX 2080 super GPU with a video memory of 8 GB. In this section, we use four subsections to show the experimental effects of the three models and discuss the experimental results in detail.
Training a good detector is the basis of our work. In our own dataset, the highest mean Average Precision (mAP) [47] of YOLOv4 is 0.917. After many experiments, we found that some tricks can improve the accuracy of the model on our own dataset. Finally, we set the image size to 512 and epoch = 200 and used mutli-scale training. In this case, we trained the model with the highest mAP value. The experimental results showed that the attention mechanism such as Convolutional Block Attention Module (CBAM) [48], Squeeze-and-Excitation Networks (SENet) [49], and multi-scale training have a great influence on the experimental results. We also used other tricks to assist in training the model. Figure 4 shows the training details of the comparative experiments, and Table 1 shows all of the results of the comparative experiments.  According to the above experimental results, we can draw the following conclusions: • The attention mechanism and multi-scale training are helpful to improve mAP value; when the image size is 416, the mAP value using SENet or CBAM is 0.015 higher than using multi-scale training. However, when the image size is 512, the mAP value using multi-scale training is 0.1 higher than using SENet or CBAM. • When the attention mechanism is used together with multi-scale training, the improvement in experimental results is not obvious; especially when the image size is 416, the map value was even reduced. This shows that the combination of multi-scale training and an attention mechanism requires a larger image size to provide more information. • When using CBAM, the mAP value is 0.01 higher than using SENet in all experiments. Additionally, it can be seen from the training process that the loss decreases more smoothly when using CBAM. The reason is that CBAM has one more spatial attention than SENet.
To verify the robustness of our model, we collected some wheat images from the field and tested them with our models. The results are shown in Figure 5 and show that our model can detect whether wheat leaves are unfolded in different environments. Figure 5. The test results of wheat images collected from the field. The crease between the wheat leaf and the main stem indicates that the wheat leaf has been fully unfolded, and the crease is the target of detection. The probability in the figure is the confidence score. In the experiment, we set the threshold to 85%. When the confidence score is higher than 85%, the leaf is considered to be unfolded.
After the phenotypic information of wheat is detected, the semantic information is used to generate the corresponding 2D image. The quality of the 2D image directly determines the quality of the final 3D point cloud. The evaluation index of GAN models is usually inception score, which gives the score from the two aspects of image clarity and diversity. The higher the value is, the better the training model is. However, the disadvantage is that it cannot reflect whether the image is well conditioned on the given text description, so we added another evaluation index R-precision, which is a complementary evaluation metric for the text-to-image synthesis task. More details about R-precision are presented in [5]. In the training stage, we first used the pretrained DAMSM to train the image and test encoders. Then, the text vector, which is made by a text encoder, and the vector sampled from gaussian distribution were used to train the generator. The parameter λ in Equation (3) and DAMSM have great influences on the experimental results. Table 2 shows the experimental results of different λ values and whether DAMSM is used. When λ = 5 and using DAMSM, we obtained the best model. We also tested our model by using a series of text descriptions. Figure 6 shows the results. Comparing the generated images with the real images, we find that the images generated by our model pay attention to the details of semantic information, and the generated image basically conforms to the text description. The quality of the generated image can fully meet the requirements of the next stage of 3D reconstruction.
The last part of the work is to reconstruct the 3D point cloud of wheat from the generated image. The task is so difficult that it needs two stages to complete. Although the point cloud obtained at the first stage has the shape of the real object, it is still quite different from the real object. Therefore, the point cloud is used as input to the model was introduced in Section 3.2.3. To generate point cloud models with more details, we set the number of structure points to 1024. To evaluate the robustness of the model to input point clouds with different densities, we used the point-wise average Eucliden distance to measure the stability of the structure points. Table 3 shows the results of the experiment. The growth stage of wheat can be divided into 11 stages, such as germination, emergence, tillering, etc. The morphological and physiological characteristics of each stage are different. Here, we can roughly divide them into three growth stages: early growth stage, middle booting stage, and mature stage. Figure 7 shows the structure points of each stage. As can be seen from Figure 7, the morphology of wheat has different characteristics at different growth stages. The generated point cloud model is very similar to the shape of real wheat, and the key features also conform to the text description. This is because the reconstructing is completed in two stages, and the feature information of the previous stage is retained. The results also show that it is feasible to divide the task into two stages. The 3D point cloud model can calculate the phenotypic parameters of wheat leaves through the coordinates of the points and can construct realistic a virtual model of leaf surfaces. The realistic virtual model is important for several applications in plant sciences, such as modelling agrichemical spray droplet movement and spreading on the surface.

Discussion
From the above experimental results, we can see that our method is feasible and effective. The quality of the generated image is largely determined by the text description, so the detailed and accurate text description is particularly important. Using an object detection algorithm to detect the unfolded probability of wheat leaves, we can judge the growth stages of wheat. From the text-to-image experiment, it is obvious that the detection results play an important role in image generation. The image of the training object detection model is continuously collected in the process of wheat growth, including the images of each growth stage of wheat. According to the experimental results, YOLOv4 can detect the unfolded probability of wheat leaves and then the growth stage of wheat can be judged correctly. In the actual research process, this method was able to replace part of the manual work. In the process of wheat growth, environmental factors and the transition of growth stages are very subtle changes. After using DAMSM, the generated image depends more on the description of each word. It is more conducive to generate images with different details. The values of inception score and R-precision in Table 2 can reflect that the model we used can generate high-quality images and that the matching degree between images and text descriptions is high, which makes 3D reconstruction using semantic information feasible. We used a single image to generate the final point cloud model in two stages, and the training process is unsupervised. It can be seen from the generated point cloud and various evaluation indexes that our model can reconstruct a reliable wheat 3D point cloud model. DM Kempthorne et al. [50] used the 3D scan data to reconstruct the 3D model of the wheat leaf. Compared with their method, our method does not need to use an expensive instrument such as the 3D scanner and our method greatly reduces the calculation time. Jonathon A. Gibbs et al. [51] conducted research on using voxels to build three-dimensional models of plants, and the 3D model reconstructed by this method was composed of many small cubes. The shape of wheat is usually not a regular geometry. Compared to using point cloud to reconstruct 3D structure, the models built by voxels have lower accuracy and the calculation of phenotypic parameters is also affected. Taking these factors into consideration, our method has better performance in practicability and accuracy. It is more suitable for daily breeding research.

Conclusions
In this paper, we propose a method to reconstruct wheat 3D point cloud model using semantic information and verify the feasibility of this method through experiments. A dataset that contains images of wheat, the text description matching the image, and point cloud data corresponding to the image is proposed. It is helpful to other researchers. Currently, we achieved the effect of generating 3D point cloud based on semantic information. Each point of the 3D point cloud model has a certain coordinate, and the coordinates of the point can be used to estimate leaf area, to calculate plant height, and to measure leaf thickness and other phenotypic data. In addition, the point cloud model can be used for classification and segmentation tasks. It is easy to distinguish which growth stage the wheat is in by using the point cloud model. If a point cloud model is used for segmentation task, the points of different colors in the entire 3D point cloud model represent different parts of the wheat and various phenotypic data of different parts can be calculated separately.
In actual application, only a data acquisition device and a computer with well-deployed algorithms are required. All calculation processes are completed automatically. Breeding researchers only need to perform some simple auxiliary work and to use the data for further ecophysiological research. Gramineae plants have a host of similar characteristics, and our method may be used as a heuristic algorithm for other Gramineae plants. We currently still use the multi-task method to reconstruct the point cloud, and end-to-end training has not yet been implemented. In the future, we will continue to explore effective methods to achieve end-to-end training of the whole structure.  Data Availability Statement: The data are available online at https://drive.google.com/drive/ folders/1ko6rlE1LThkNG_fcm5C12LcBaUWwdsPc?usp=sharing (accessed on 25 April 2021). As we are still conducting more research on the dataset, we will upload our dataset to the same link later.