Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Multi-View Jujube Tree Trunks Stereo Reconstruction Based on UAV Remote Sensing Imaging Acquisition System

Appl. Sci. 2024, 14(4), 1364; https://doi.org/10.3390/app14041364

by Shunkang Ling¹, Jingbin Li^1,*, Longpeng Ding^1,* and Nianyi Wang²

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2024, 14(4), 1364; https://doi.org/10.3390/app14041364

Submission received: 8 January 2024 / Revised: 4 February 2024 / Accepted: 5 February 2024 / Published: 7 February 2024

(This article belongs to the Special Issue Advances in Unmanned Aerial Vehicle (UAV) System)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Paper is sound, and the application interesting and original. The manuscript needs moderate revision of English though. Carefully revise grammar and style consistency. In the recent year there have been numerous papers with similar approaches accepted in the literature.

Some minor comments:

Line 25-26: consider rephrasing "towards precision and informationization" to "toward precision and information technology integration" for clarity.

Lines 41-45: This sentence is lengthy and complex. Rewrite for clarity.

Line 36-47: Replace "paper tends to start from the algorithm" with "this paper begins with an algorithmic approach".

Line 111: "and so on" is informal; consider replacing with "among other issues".

Line 157: Remove the repeated phrase "Net structure is usually able to achieve better performance in tasks with".

Line 165: add citation of the Transformer Model.

Line 172: Self-attentive -> self-attention

Line 188: Capitalize All

Line 205-206: There is a repetition, rewrite for clarity.

Line 290: Specify "DJI Mavic3 UAV equipped with an Insta360 x2 panoramic camera" for clarity.

Line 300-308: The explanation of image distortion and correction is technical; a simplified summary or a figure illustrating this process could aid understanding.

Lines 430-489: The ablation study is a strong aspect of the paper, demonstrating the effectiveness of each component of your approach. Consider adding a brief summary at the end to reinforce the key findings.

Line 381: Replace "a reference view with two neighboring source views is input" with something like "the input consists of a reference view and two neighboring source views".

Line 466-467: Replace "overfitting due to its more complex" with something like "overfitting because of its complexity".

Lines 490-535: Add future work.

Line 496: Change "Aiming at the actual image sampling process of jujube garden" to something like "Aiming at the specific image sampling process in the jujube garden".

Some general comments:

Consider adding a brief, qualitative summary in table format of the key findings and their importance, strengths of the methodology and key points.

Add recent bibliography about the topic:

[1] Ma, B., Du, J., Wang, L., Jiang, H., & Zhou, M. (2021). Automatic branch detection of jujube trees based on 3D reconstruction for dormant pruning using the deep learning-based method. Computers and Electronics in Agriculture, 190, 106484.

[2] Li, Y., Zhang, Z., Wang, X., Fu, W., & Li, J. (2023). Automatic reconstruction and modeling of dormant jujube trees using three-view image constraints for intelligent pruning applications. Computers and Electronics in Agriculture, 212, 108149.

[3] Li, J., Wu, M., & Li, H. (2023). 3D reconstruction and volume estimation of jujube using consumer-grade RGB-depth sensor (June 2023). IEEE Access.

And consider adding into the pipeline tabular information, by the use of models like TabNet [4,5,6] and ensemble methods:

[4] Arik, S. Ö., & Pfister, T. (2021, May). Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, No. 8, pp. 6679-6687).

[5] I. de Zarzà, J. de Curtò, Carlos T. Calafate, Area Estimation of Forest Fires using TabNet with Transformers, Procedia Computer Science, Volume 225, 2023, Pages 553-563, ISSN 1877-0509 https://doi.org/10.1016/j.procs.2023.10.040.

[6] Shah, C., Du, Q., & Xu, Y. (2022). Enhanced TabNet: Attentive interpretable tabular learning for hyperspectral image classification. Remote Sensing, 14(3), 716.

Comments on the Quality of English Language

Moderate editing of English is needed. Carefully revise grammar and style consistency.

Author Response

Thank you very much for your proposed revisions, for which I have made line-by-line corrections as follows:

Lines 25-26, 46-47, 111, 157, 172, 188, 381, 466-467, and 496 have been corrected as indicated, as shown in the highlighted sections of the manuscript.

Lines 41-45: the long and complex statement "However, there are some challenges and difficulties in using remote sensing and AI in sustainable agriculture, for example, there is a large amount of redundant video data in the video sequences captured by the cameras in the orchard, and whether the required target image data can be extracted from them quickly and accurately will directly affect the accuracy of the subsequent processing of the AI model, and may even lead to incorrect management decisions, which will bring about incalculable economic losses [10,11]." has been replaced with "Using remote sensing and AI in sustainable agriculture poses challenges. For instance, video sequences from orchard cameras generate a substantial amount of redundant data, which can be used to reduce the amount of data that can be collected and stored in the environment. For instance, video sequences from orchard cameras generate a substantial amount of redundant data. The quick and accurate extraction of required target image data from these sequences directly impacts the accuracy of subsequent AI model processing. in significant economic losses [10,11]."

Line 165: added citation for Transformer Model

[1] Kitaev N, Kaiser Ł, Levskaya A. Reformer: the efficient transformer [J]. arXiv preprint arXiv:2001.04451, 2020.

Lines 205-206: some of the original repetitive expressions have been changed to "while spaced convolution operations can increase the receptive fields without affecting the resolution "

Line290: "DJI Mavic3 UAV equipped with an Insta360 x2 panoramic camera" is described in more detail as "The jujube tree image acquisition equipment used in this study is an Insta360 x2 panoramic camera, which has an aperture of F2.0 with an equivalent focal length of 7.2 mm, and the image settings are set to record at 30 mm. image settings are set to record at 30fps at 5.7K. The panoramic camera is suspended and mounted via a DJI Mavic3 drone in the form shown in Figure 9. "

Lines 300-380: A simplified summary of The explanation of image distortion and correction has been added to the manuscript: The polar coordinate mapping model utilises the idea of polar coordinate system, which is to convert the original pixel point from a right-angle coordinate system (x, y) to a polar coordinate system (r, θ ), where r is the distance from the origin (the centre of the image) to the point, and θ is the angle from the origin to the point. During polar mapping, a pixel point on a panoramic image is the distance from the origin to the point. point on a panoramic image is determined by its coordinates in polar coordinates, which means that each pixel point is equivalent to a point radiating During polar mapping a pixel point on a panoramic image is determined by its coordinates in polar coordinates, which means that each pixel point is equivalent to a point radiating outward from the centre, which helps us to adjust the shape of the image in order to overcome the problems generated by lens distortion.

Lines 430-489: The summary of the ablation study has been clearly reflected in the last paragraph of the experimental discussion and analysis: Therefore in general, the MVSNet + Co U-Net + DRE-Net_SA method proposed in this paper has a large improvement in each performance index relative to the base model, with Acc improved by 20.4%, Comp by 12.8%, and Overall combined by 16.8% .

Lines 490-535: added future work to the last paragraph in the conclusions: However, there is still room for further improvement and enhancement of the work in this paper. Due to the increase of convolutional scale, this leads to the increase of GPU's occupied memory and the increase of training cost. the increase of convolutional scale, this leads to the increase of GPU's occupied memory and the increase of training cost. In future research work, we will adjust the network structure or utilize multi-GPU parallelism to reduce memory consumption to improve efficiency.

An up-to-date bibliography on the subject, as mentioned in the comments, has been added, and case studies using models and ensemble methods such as TabNet have been analysed.

Reviewer 2 Report

Comments and Suggestions for Authors

1）It is recommended to summarise the scientific challenge in the abstract rather than just pointing out the practical challenge.

2) The potential application of UAV’s monitoring in civil engineering structures like [1-2] should be mentioned in the introduction to attract more readers.

[1] Liu, Z., Song, Y., Gao, S., & Wang, H. (2023). Review of Perspectives on Pantograph-Catenary Interaction Research for High-Speed Railways Operating at 400 km/h and Above. IEEE Transactions on Transportation Electrification, 1–1. https://doi.org/10.1109/TTE.2023.3346379

[2] Jenie, Y. I., van Kampen, E. J., Ellerbroek, J., & Hoekstra, J. M. (2018). Safety assessment of a UAV CDR system in high density airspace using monte carlo simulations. IEEE Transactions on Intelligent Transportation Systems, 19(8), 2686–2695. https://doi.org/10.1109/TITS.2017.2758859

3) A more specific section title should be given. The current one ‘methods’ is too general and the readers find it difficult to gain useful information.

4) There are two steps named ‘2D convolution’ in Figure 1. Please give an appropriate explanation.

5) It is difficult to see the neural network details in Figure2. What is the architecture of the network like? How are the layers and neurons arranged?

6) How to overcome the motion blur issue of the UAV? Please comment.

7) Section 4 should focus on the field test instead of experiments.

8) Can any explanation be given to clarify why the stack of multiple existing methods can improve the performance as shown in Table 2?

Author Response

1）We made a summary addition to the scientific challenge based on the actual challenge: 'High quality agricultural multi-view stereo reconstruction technology is the key to precision and informatization in agriculture. Multi-view stereo reconstruction methods are an important part of 3D vision technology. In the multi-view stereo 3D reconstruction method based on deep learning, the effect of feature extraction directly affects the accuracy of reconstruction. Aiming at the actual problems in orchard fruit tree reconstruction, this paper designs an improved multi-view stereo structure...'

2）In order to appeal to a wider audience and to extend the potential applications, we have included additional references to this section of applications in the introduction.

3）We have refined the section headings in the 'Methods' section, e.g., 2.1.2, 2.2, 2.3, so that the reader can more clearly understand the information about the improved models in this paper.

4）We have added an explanation in section 2.2: 'the input multi-view image first undergoes a 2D convolution to capture simple features of the image (e.g., lines, color blocks, etc.) to help the model understand the image and provide raw information for 3D reconstruction. Next, another 2D convolution is performed so that the network can start learning to understand more complex features (e.g., shapes, textures, etc.) and use them to obtain depth information. Then into the feature extraction network,...'

5）The composite U-Net feature extraction network structure proposed in this paper is based on the idea of the classical model U-Net, where a U-Net is added after the single U-Net structure to form a composite network, and we have added an explanation about the network architecture as well as the operation of the arrangement in Section 2.1.1: 'this paper introduces a U-Net network structure after the basic network, so that after a convolution from coarse to fine convolution of its again convolution processing, the network structure as shown in Figure 2. The resolution is first reduced by maximum pooling operation, and then gradually restored to the original resolution by up-sampling operation, while keeping the number of channels unchanged. In the first half of the network, i.e., the encoding stage, the global features are extracted, and the input image is first subjected to a convolution operation with a convolution kernel size of 3×3, followed by four 2×2 convolutions and maximum pooling operations. In the up-sampling stage, the information of the feature maps with the same resolution located in the encoding and decoding structures respectively is concatenated and fused in the channel dimension, and after merging at the side, the convolution and up-sampling are continued to obtain the 32-channel feature maps with the same resolution as that of the original image, and then this is taken as an input to go through the network once more, and finally three groups of feature maps with different sizes are obtained, and the number of channels corresponds to 8 channels in the order of the feature maps from the largest to the smallest, respectively. The corresponding channel numbers are 8, 16 and 32, and the resolution is changed to 1, 1/4 and 1/16 of the original image.'

6）In response to this paper's method of overcoming motion blur produced by shooting, we provide a further illustration: 'Due to the mobile shooting process will inevitably appear motion blur, but hope that in the case of the date palm is clear enough to ensure the sampling efficiency. Firstly, we set the motion blur tolerance as: the camera's moving distance in each frame is set to be less than 5% of the target size; secondly, based on the target blur length combined with the camera's frame rate, resolution, field of view angle hardware parameters can be calculated as the moving distance of the image in the photographic element, and then theoretically launched the range of moving speed. However, in the actual orchard environment, due to the light intensity and the stability of the UAV movement and other factors, in order to ensure the actual shooting effect, we have conducted a variety of speed attempts, and finally set a reasonable flight speed of 1.2 m/s.'

7）We provide additional explanation on this issue in the paper. In order to objectively compare the performance gap between similar models, we selected the communal dataset for experimental validation in the comparison experiment section. In order to highlight and validate the performance of our model in the real environment, we choose the field sampled dataset for experimental validation in the ablation experiment section, which proves the feasibility of the field test.

8）In Table 2 ablation experimental part, this paper takes the model MVSNet as the basic research object and makes a series of improvements to this basic model. After scientific analysis for the most important feature extraction part, we first tried to design MVSNet + U-Net using U-Net as the extraction network framework, and found that better experimental results were achieved. Subsequently, we improved the U-Net to design MVSNet + Co. U-Net, and finally designed the composite feature extraction network constituting the MVSNet + Co U-Net + DRE-Net_SA model. After comparison experiments and ablation experiments it was found that better results were achieved, and the specific reason analysis is explained in the 4.4 Analysis section of the paper.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

No further comments.

Article Menu

Multi-View Jujube Tree Trunks Stereo Reconstruction Based on UAV Remote Sensing Imaging Acquisition System

Further Information

Guidelines

MDPI Initiatives

Follow MDPI