Saved Queries

Smart agriculture addresses inefficient resource utilization and disease control in poultry farming. Existing smart monitoring systems effectively detect birds. However, applying these models to new environments or for detecting new species requires a large amount of labeled data and manual work, which limits their wide application. To address this limitation, this paper presents the SSNFNet method, leveraging an enhanced few-shot object detection framework tailored for poultry farming contexts. SSNFNet integrates Sharpness-Aware Minimization (SAM) to enhance model generalization by smoothing the loss landscape and improving training stability. To further improve detection in densely populated scenes, we incorporate the Soft Non-Maximum Suppression (Soft-NMS) algorithm to mitigate overlapping bounding box issues. Through quantitative analysis and comparison, exemplified by a five-shot scenario on the poultry farming dataset, our method demonstrates significantly better performance compared to traditional object detection models, Specifically, it achieves a mean Average Precision (mAP) improvement of 3.93% compared to the state-of-the-art HTRPN model, raising the mAP from 78.00% to 81.93% while maintaining 8 FPS inference speed on Jetson Nano-class hardware. These results confirm the effectiveness and adaptability of our approach in real-world smart farming environments. Full article

(This article belongs to the Section Poultry)

21 pages, 3729 KiB

Open AccessArticle

Can AIGC Aid Intelligent Robot Design? A Tentative Research of Apple-Harvesting Robot

by Qichun Jin, Jiayu Zhao, Wei Bao, Ji Zhao, Yujuan Zhang and Fuwen Hu

Processes 2025, 13(8), 2422; https://doi.org/10.3390/pr13082422 - 30 Jul 2025

Abstract

More recently, artificial intelligence (AI)-generated content (AIGC) is fundamentally transforming multiple sectors, including materials discovery, healthcare, education, scientific research, and industrial manufacturing. As for the complexities and challenges of intelligent robot design, AIGC has the potential to offer a new paradigm, assisting in conceptual and technical design, functional module design, and the training of the perception ability to accelerate prototyping. Taking the design of an apple-harvesting robot, for example, we demonstrate a basic framework of the AIGC-assisted robot design methodology, leveraging the generation capabilities of available multimodal large language models, as well as the human intervention to alleviate AI hallucination and hidden risks. Second, we study the enhancement effect on the robot perception system using the generated apple images based on the large vision-language models to expand the actual apple images dataset. Further, an apple-harvesting robot prototype based on an AIGC-aided design is demonstrated and a pick-up experiment in a simulated scene indicates that it achieves a harvesting success rate of 92.2% and good terrain traversability with a maximum climbing angle of 32°. According to the tentative research, although not an autonomous design agent, the AIGC-driven design workflow can alleviate the significant complexities and challenges of intelligent robot design, especially for beginners or young engineers. Full article

(This article belongs to the Special Issue Design and Control of Complex and Intelligent Systems)

►▼ Show Figures

Figure 1

31 pages, 11269 KiB

Open AccessReview

Advancements in Semantic Segmentation of 3D Point Clouds for Scene Understanding Using Deep Learning

by Hafsa Benallal, Nadine Abdallah Saab, Hamid Tairi, Ayman Alfalou and Jamal Riffi

Technologies 2025, 13(8), 322; https://doi.org/10.3390/technologies13080322 - 30 Jul 2025

Abstract

Three-dimensional semantic segmentation is a fundamental problem in computer vision with a wide range of applications in autonomous driving, robotics, and urban scene understanding. The task involves assigning semantic labels to each point in a 3D point cloud, a data representation that is inherently unstructured, irregular, and spatially sparse. In recent years, deep learning has become the dominant framework for addressing this task, leading to a broad variety of models and techniques designed to tackle the unique challenges posed by 3D data. This survey presents a comprehensive overview of deep learning methods for 3D semantic segmentation. We organize the literature into a taxonomy that distinguishes between supervised and unsupervised approaches. Supervised methods are further classified into point-based, projection-based, voxel-based, and hybrid architectures, while unsupervised methods include self-supervised learning strategies, generative models, and implicit representation techniques. In addition to presenting and categorizing these approaches, we provide a comparative analysis of their performance on widely used benchmark datasets, discuss key challenges such as generalization, model transferability, and computational efficiency, and examine the limitations of current datasets. The survey concludes by identifying potential directions for future research in this rapidly evolving field. Full article

(This article belongs to the Section Information and Communication Technologies)

►▼ Show Figures

Figure 1

23 pages, 7371 KiB

Open AccessArticle

A Novel Method for Estimating Building Height from Baidu Panoramic Street View Images

by Shibo Ge, Jiping Liu, Xianghong Che, Yong Wang and Haosheng Huang

ISPRS Int. J. Geo-Inf. 2025, 14(8), 297; https://doi.org/10.3390/ijgi14080297 - 30 Jul 2025

Viewed by 44

Abstract

Building height information plays an important role in many urban-related applications, such as urban planning, disaster management, and environmental studies. With the rapid development of real scene maps, street view images are becoming a new data source for building height estimation, considering their easy collection and low cost. However, existing studies on building height estimation primarily utilize remote sensing images, with little exploration of height estimation from street-view images. In this study, we proposed a deep learning-based method for estimating the height of a single building in Baidu panoramic street view imagery. Firstly, the Segment Anything Model was used to extract the region of interest image and location features of individual buildings from the panorama. Subsequently, a cross-view matching algorithm was proposed by combining Baidu panorama and building footprint data with height information to generate building height samples. Finally, a Two-Branch feature fusion model (TBFF) was constructed to combine building location features and visual features, enabling accurate height estimation for individual buildings. The experimental results showed that the TBFF model had the best performance, with an RMSE of 5.69 m, MAE of 3.97 m, and MAPE of 0.11. Compared with two state-of-the-art methods, the TBFF model exhibited robustness and higher accuracy. The Random Forest model had an RMSE of 11.83 m, MAE of 4.76 m, and MAPE of 0.32, and the Pano2Geo model had an RMSE of 10.51 m, MAE of 6.52 m, and MAPE of 0.22. The ablation analysis demonstrated that fusing building location and visual features can improve the accuracy of height estimation by 14.98% to 69.99%. Moreover, the accuracy of the proposed method meets the LOD1 level 3D modeling requirements defined by the OGC (height error ≤ 5 m), which can provide data support for urban research. Full article

►▼ Show Figures

Figure 1

28 pages, 3441 KiB

Open AccessArticle

Which AI Sees Like Us? Investigating the Cognitive Plausibility of Language and Vision Models via Eye-Tracking in Human-Robot Interaction

by Khashayar Ghamati, Maryam Banitalebi Dehkordi and Abolfazl Zaraki

Sensors 2025, 25(15), 4687; https://doi.org/10.3390/s25154687 - 29 Jul 2025

Viewed by 196

Abstract

As large language models (LLMs) and vision–language models (VLMs) become increasingly used in robotics area, a crucial question arises: to what extent do these models replicate human-like cognitive processes, particularly within socially interactive contexts? Whilst these models demonstrate impressive multimodal reasoning and perception capabilities, their cognitive plausibility remains underexplored. In this study, we address this gap by using human visual attention as a behavioural proxy for cognition in a naturalistic human-robot interaction (HRI) scenario. Eye-tracking data were previously collected from participants engaging in social human-human interactions, providing frame-level gaze fixations as a human attentional ground truth. We then prompted a state-of-the-art VLM (LLaVA) to generate scene descriptions, which were processed by four LLMs (DeepSeek-R1-Distill-Qwen-7B, Qwen1.5-7B-Chat, LLaMA-3.1-8b-instruct, and Gemma-7b-it) to infer saliency points. Critically, we evaluated each model in both stateless and memory-augmented (short-term memory, STM) modes to assess the influence of temporal context on saliency prediction. Our results presented that whilst stateless LLaVA most closely replicates human gaze patterns, STM confers measurable benefits only for DeepSeek, whose lexical anchoring mirrors human rehearsal mechanisms. Other models exhibited degraded performance with memory due to prompt interference or limited contextual integration. This work introduces a novel, empirically grounded framework for assessing cognitive plausibility in generative models and underscores the role of short-term memory in shaping human-like visual attention in robotic systems. Full article

(This article belongs to the Special Issue Multimodal Human Behavior Understanding in Human–AI Interaction: Sensor-Based Signal Processing and Interaction Techniques)

►▼ Show Figures

Figure 1

19 pages, 9284 KiB

Open AccessArticle

UAV-YOLO12: A Multi-Scale Road Segmentation Model for UAV Remote Sensing Imagery

by Bingyan Cui, Zhen Liu and Qifeng Yang

Drones 2025, 9(8), 533; https://doi.org/10.3390/drones9080533 - 29 Jul 2025

Viewed by 205

Abstract

Unmanned aerial vehicles (UAVs) are increasingly used for road infrastructure inspection and monitoring. However, challenges such as scale variation, complex background interference, and the scarcity of annotated UAV datasets limit the performance of traditional segmentation models. To address these challenges, this study proposes UAV-YOLOv12, a multi-scale segmentation model specifically designed for UAV-based road imagery analysis. The proposed model builds on the YOLOv12 architecture by adding two key modules. It uses a Selective Kernel Network (SKNet) to adjust receptive fields dynamically and a Partial Convolution (PConv) module to improve spatial focus and robustness in occluded regions. These enhancements help the model better detect small and irregular road features in complex aerial scenes. Experimental results on a custom UAV dataset collected from national highways in Wuxi, China, show that UAV-YOLOv12 achieves F1-scores of 0.902 for highways (road-H) and 0.825 for paths (road-P), outperforming the original YOLOv12 by 5% and 3.2%, respectively. Inference speed is maintained at 11.1 ms per image, supporting near real-time performance. Moreover, comparative evaluations with U-Net show that UAV-YOLOv12 improves by 7.1% and 9.5%. The model also exhibits strong generalization ability, achieving F1-scores above 0.87 on public datasets such as VHR-10 and the Drone Vehicle dataset. These results demonstrate that the proposed UAV-YOLOv12 can achieve high accuracy and robustness in diverse road environments and object scales. Full article

(This article belongs to the Special Issue Advances in Civil Applications of Unmanned Aircraft Systems: 2nd Edition)

►▼ Show Figures

Figure 1

17 pages, 1603 KiB

Open AccessPerspective

A Perspective on Quality Evaluation for AI-Generated Videos

by Zhichao Zhang, Wei Sun and Guangtao Zhai

Sensors 2025, 25(15), 4668; https://doi.org/10.3390/s25154668 - 28 Jul 2025

Viewed by 148

Abstract

Recent breakthroughs in AI-generated content (AIGC) have transformed video creation, empowering systems to translate text, images, or audio into visually compelling stories. Yet reliable evaluation of these machine-crafted videos remains elusive because quality is governed not only by spatial fidelity within individual frames but also by temporal coherence across frames and precise semantic alignment with the intended message. The foundational role of sensor technologies is critical, as they determine the physical plausibility of AIGC outputs. In this perspective, we argue that multimodal large language models (MLLMs) are poised to become the cornerstone of next-generation video quality assessment (VQA). By jointly encoding cues from multiple modalities such as vision, language, sound, and even depth, the MLLM can leverage its powerful language understanding capabilities to assess the quality of scene composition, motion dynamics, and narrative consistency, overcoming the fragmentation of hand-engineered metrics and the poor generalization ability of CNN-based methods. Furthermore, we provide a comprehensive analysis of current methodologies for assessing AIGC video quality, including the evolution of generation models, dataset design, quality dimensions, and evaluation frameworks. We argue that advances in sensor fusion enable MLLMs to combine low-level physical constraints with high-level semantic interpretations, further enhancing the accuracy of visual quality assessment. Full article

(This article belongs to the Special Issue Perspectives in Intelligent Sensors and Sensing Systems)

►▼ Show Figures

Figure 1

13 pages, 3887 KiB

Open AccessArticle

Exploring 3D Roadway Modeling Techniques Using CAD and Unity3D

by Yingbing Yang, Yunchuan Sun and Yuhong Wang

Processes 2025, 13(8), 2399; https://doi.org/10.3390/pr13082399 - 28 Jul 2025

Viewed by 169

Abstract

To tackle the inefficiencies in 3D mine tunnel modeling and the tedious task of drawing centerlines, this study introduces a faster method for generating centerlines using CAD secondary development. Starting with the tunnel centerline, the research then dives into techniques for creating detailed 3D tunnel models. The team first broke down the steps and logic behind tunnel modeling, designing a 3D tunnel framework and its data structure—complete with key geometric components like traverse points, junctions, nodes, and centerlines. By refining older centerline drawing techniques, they built a CAD-powered tool that slashes time and effort. The study also harnessed advanced algorithms, such as surface fitting and curve lofting, to swiftly model tricky tunnel sections like curves and crossings. This method fixes common problems like warped or incomplete surfaces in linked tunnel models, delivering precise and lifelike 3D scenes for VR-based mining safety drills and simulations. Full article

(This article belongs to the Special Issue Oil and Gas Well Engineering: Experimental and Numerical Investigation)

►▼ Show Figures

Figure 1

18 pages, 5229 KiB

Open AccessArticle

Exploring the Spectral Variability of Estonian Lakes Using Spaceborne Imaging Spectroscopy

by Alice Fabbretto, Mariano Bresciani, Andrea Pellegrino, Kersti Kangro, Anna Joelle Greife, Lodovica Panizza, François Steinmetz, Joel Kuusk, Claudia Giardino and Krista Alikas

Appl. Sci. 2025, 15(15), 8357; https://doi.org/10.3390/app15158357 - 27 Jul 2025

Viewed by 239

Abstract

This study investigates the potential of spaceborne imaging spectroscopy to support the analysis of the status of two major Estonian lakes, i.e., Lake Peipsi and Lake Võrtsjärv, using data from the PRISMA and EnMAP missions. The study encompasses nine specific applications across 12 satellite scenes, including the validation of remote sensing reflectance (Rrs), optical water type classification, estimation of phycocyanin concentration, detection of macrophytes, and characterization of reflectance for lake ice/snow coverage. Rrs validation, which was performed using in situ measurements and Sentinel-2 and Sentinel-3 as references, showed a level of agreement with Spectral Angle < 16°. Hyperspectral imagery successfully captured fine-scale spatial and spectral features not detectable by multispectral sensors, in particular it was possible to identify cyanobacterial pigments and optical variations driven by seasonal and meteorological dynamics. Through the combined use of in situ observations, the study can serve as a starting point for the use of hyperspectral data in northern freshwater systems, offering new insights into ecological processes. Given the increasing global concern over freshwater ecosystem health, this work provides a transferable framework for leveraging new-generation hyperspectral missions to enhance water quality monitoring on a global scale. Full article

(This article belongs to the Special Issue Applications of Remote Sensing for Natural Hazard and Environment Monitoring)

►▼ Show Figures

Figure 1

30 pages, 92065 KiB

Open AccessArticle

A Picking Point Localization Method for Table Grapes Based on PGSS-YOLOv11s and Morphological Strategies

by Jin Lu, Zhongji Cao, Jin Wang, Zhao Wang, Jia Zhao and Minjie Zhang

Agriculture 2025, 15(15), 1622; https://doi.org/10.3390/agriculture15151622 - 26 Jul 2025

Viewed by 237

Abstract

During the automated picking of table grapes, the automatic recognition and segmentation of grape pedicels, along with the positioning of picking points, are vital components for all the following operations of the harvesting robot. In the actual scene of a grape plantation, however, it is extremely difficult to accurately and efficiently identify and segment grape pedicels and then reliably locate the picking points. This is attributable to the low distinguishability between grape pedicels and the surrounding environment such as branches, as well as the impacts of other conditions like weather, lighting, and occlusion, which are coupled with the requirements for model deployment on edge devices with limited computing resources. To address these issues, this study proposes a novel picking point localization method for table grapes based on an instance segmentation network called Progressive Global-Local Structure-Sensitive Segmentation (PGSS-YOLOv11s) and a simple combination strategy of morphological operators. More specifically, the network PGSS-YOLOv11s is composed of an original backbone of the YOLOv11s-seg, a spatial feature aggregation module (SFAM), an adaptive feature fusion module (AFFM), and a detail-enhanced convolutional shared detection head (DE-SCSH). And the PGSS-YOLOv11s have been trained with a new grape segmentation dataset called Grape-⊥, which includes 4455 grape pixel-level instances with the annotation of ⊥-shaped regions. After the PGSS-YOLOv11s segments the ⊥-shaped regions of grapes, some morphological operations such as erosion, dilation, and skeletonization are combined to effectively extract grape pedicels and locate picking points. Finally, several experiments have been conducted to confirm the validity, effectiveness, and superiority of the proposed method. Compared with the other state-of-the-art models, the main metrics

F 1

score and mask mAP@0.5 of the PGSS-YOLOv11s reached 94.6% and 95.2% on the Grape-⊥ dataset, as well as 85.4% and 90.0% on the Winegrape dataset. Multi-scenario tests indicated that the success rate of positioning the picking points reached up to 89.44%. In orchards, real-time tests on the edge device demonstrated the practical performance of our method. Nevertheless, for grapes with short pedicels or occluded pedicels, the designed morphological algorithm exhibited the loss of picking point calculations. In future work, we will enrich the grape dataset by collecting images under different lighting conditions, from various shooting angles, and including more grape varieties to improve the method’s generalization performance. Full article

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

►▼ Show Figures

Figure 1

20 pages, 3386 KiB

Open AccessArticle

Design of Realistic and Artistically Expressive 3D Facial Models for Film AIGC: A Cross-Modal Framework Integrating Audience Perception Evaluation

by Yihuan Tian, Xinyang Li, Zuling Cheng, Yang Huang and Tao Yu

Sensors 2025, 25(15), 4646; https://doi.org/10.3390/s25154646 - 26 Jul 2025

Viewed by 317

Abstract

The rise of virtual production has created an urgent need for both efficient and high-fidelity 3D face generation schemes for cinema and immersive media, but existing methods are often limited by lighting–geometry coupling, multi-view dependency, and insufficient artistic quality. To address this, this study proposes a cross-modal 3D face generation framework based on single-view semantic masks. It utilizes Swin Transformer for multi-level feature extraction and combines with NeRF for illumination decoupled rendering. We utilize physical rendering equations to explicitly separate surface reflectance from ambient lighting to achieve robust adaptation to complex lighting variations. In addition, to address geometric errors across illumination scenes, we construct geometric a priori constraint networks by mapping 2D facial features to 3D parameter space as regular terms with the help of semantic masks. On the CelebAMask-HQ dataset, this method achieves a leading score of SSIM = 0.892 (37.6% improvement from baseline) with FID = 40.6. The generated faces excel in symmetry and detail fidelity with realism and aesthetic scores of 8/10 and 7/10, respectively, in a perceptual evaluation with 1000 viewers. By combining physical-level illumination decoupling with semantic geometry a priori, this paper establishes a quantifiable feedback mechanism between objective metrics and human aesthetic evaluation, providing a new paradigm for aesthetic quality assessment of AI-generated content. Full article

(This article belongs to the Special Issue Convolutional Neural Network Technology for 3D Imaging and Sensing)

►▼ Show Figures

Figure 1

21 pages, 4388 KiB

Open AccessFeature PaperArticle

An Omni-Dimensional Dynamic Convolutional Network for Single-Image Super-Resolution Tasks

by Xi Chen, Ziang Wu, Weiping Zhang, Tingting Bi and Chunwei Tian

Mathematics 2025, 13(15), 2388; https://doi.org/10.3390/math13152388 - 25 Jul 2025

Viewed by 229

Abstract

The goal of single-image super-resolution (SISR) tasks is to generate high-definition images from low-quality inputs, with practical uses spanning healthcare diagnostics, aerial imaging, and surveillance systems. Although cnns have considerably improved image reconstruction quality, existing methods still face limitations, including inadequate restoration of high-frequency details, high computational complexity, and insufficient adaptability to complex scenes. To address these challenges, we propose an Omni-dimensional Dynamic Convolutional Network (ODConvNet) tailored for SISR tasks. Specifically, ODConvNet comprises four key components: a Feature Extraction Block (FEB) that captures low-level spatial features; an Omni-dimensional Dynamic Convolution Block (DCB), which utilizes a multidimensional attention mechanism to dynamically reweight convolution kernels across spatial, channel, and kernel dimensions, thereby enhancing feature expressiveness and context modeling; a Deep Feature Extraction Block (DFEB) that stacks multiple convolutional layers with residual connections to progressively extract and fuse high-level features; and a Reconstruction Block (RB) that employs subpixel convolution to upscale features and refine the final HR output. This mechanism significantly enhances feature extraction and effectively captures rich contextual information. Additionally, we employ an improved residual network structure combined with a refined Charbonnier loss function to alleviate gradient vanishing and exploding to enhance the robustness of model training. Extensive experiments conducted on widely used benchmark datasets, including DIV2K, Set5, Set14, B100, and Urban100, demonstrate that, compared with existing deep learning-based SR methods, our ODConvNet method improves Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM), and the visual quality of SR images is also improved. Ablation studies further validate the effectiveness and contribution of each component in our network. The proposed ODConvNet offers an effective, flexible, and efficient solution for the SISR task and provides promising directions for future research. Full article

►▼ Show Figures

Figure 1

8 pages, 4452 KiB

Open AccessProceeding Paper

Synthetic Aperture Radar Imagery Modelling and Simulation for Investigating the Composite Scattering Between Targets and the Environment

by Raphaël Valeri, Fabrice Comblet, Ali Khenchaf, Jacques Petit-Frère and Philippe Pouliguen

Eng. Proc. 2025, 94(1), 11; https://doi.org/10.3390/engproc2025094011 - 25 Jul 2025

Viewed by 192

Abstract

The high resolution of the Synthetic Aperture Radar (SAR) imagery, in addition to its capability to see through clouds and rain, makes it a crucial remote sensing technique. However, SAR images are very sensitive to radar parameters, the observation geometry and the scene’s characteristics. Moreover, for a complex scene of interest with targets located on a rough soil, a composite scattering between the target and the surface occurs and creates distortions on the SAR image. These characteristics can make the SAR images difficult to analyse and process. To better understand the complex EM phenomena and their signature in the SAR image, we propose a methodology to generate raw SAR signals and SAR images for scenes of interest with a target located on a rough surface. With this prospect, the entire radar acquisition chain is considered: the sensor parameters, the atmospheric attenuation, the interactions between the incident EM field and the scene, and the SAR image formation. Simulation results are presented for a rough dielectric soil and a canonical target considered as a Perfect Electric Conductor (PEC). These results highlight the importance of the composite scattering signature between the target and the soil. Its power is 21 dB higher that that of the target for the target–soil configuration considered. Finally, these simulations allow for the retrieval of characteristics present in actual SAR images and show the potential of the presented model in investigating EM phenomena and their signatures in SAR images. Full article

(This article belongs to the Proceedings of The 1st International Conference on Advanced Remote Sensing – Shaping Sustainable Global Landscapes (ICARS 2025))

►▼ Show Figures

Figure 1

26 pages, 15535 KiB

Open AccessArticle

BCA-MVSNet: Integrating BIFPN and CA for Enhanced Detail Texture in Multi-View Stereo Reconstruction

by Ning Long, Zhengxu Duan, Xiao Hu and Mingju Chen

Electronics 2025, 14(15), 2958; https://doi.org/10.3390/electronics14152958 - 24 Jul 2025

Viewed by 129

Abstract

The 3D point cloud generated by MVSNet has good scene integrity but lacks sensitivity to details, causing holes and non-dense areas in flat and weak-texture regions. To address this problem and enhance the point cloud information of weak-texture areas, the BCA-MVSNet network is proposed in this paper. The network integrates the Bidirectional Feature Pyramid Network (BIFPN) into the feature processing of the MVSNet backbone network to accurately extract the features of weak-texture regions. In the feature map fusion stage, the Coordinate Attention (CA) mechanism is introduced into 3DU-Net to obtain the position information on the channel dimension related to the direction, improve the detail feature extraction, optimize the depth map and improve the depth accuracy. The experimental results show that BCA-MVSNet not only improves the accuracy of detail texture reconstruction, but also effectively controls the computational overhead. In the DTU dataset, the Overall and Comp metrics of BCA-MVSNet are reduced by 10.2% and 2.6%, respectively; in the Tanksand Temples dataset, the Mean metrics of the eight scenarios are improved by 6.51%. Three scenes are shot by binocular camera, and the reconstruction quality is excellent in the weak-texture area by combining the camera parameters and the BCA-MVSNet model. Full article

►▼ Show Figures

Figure 1

22 pages, 9071 KiB

Open AccessArticle

Integrating UAV-Based RGB Imagery with Semi-Supervised Learning for Tree Species Identification in Heterogeneous Forests

by Bingru Hou, Chenfeng Lin, Mengyuan Chen, Mostafa M. Gouda, Yunpeng Zhao, Yuefeng Chen, Fei Liu and Xuping Feng

Remote Sens. 2025, 17(15), 2541; https://doi.org/10.3390/rs17152541 - 22 Jul 2025

Viewed by 271

Abstract

The integration of unmanned aerial vehicle (UAV) remote sensing and deep learning has emerged as a highly effective strategy for inventorying forest resources. However, the spatiotemporal variability of forest environments and the scarcity of annotated data hinder the performance of conventional supervised deep-learning models. To overcome these challenges, this study has developed efficient tree (ET), a semi-supervised tree detector designed for forest scenes. ET employed an enhanced YOLO model (YOLO-Tree) as a base detector and incorporated a teacher–student semi-supervised learning (SSL) framework based on pseudo-labeling, effectively leveraging abundant unlabeled data to bolster model robustness. The results revealed that SSL significantly improved outcomes in scenarios with sparse labeled data, specifically when the annotation proportion was below 50%. Additionally, employing overlapping cropping as a data augmentation strategy mitigated instability during semi-supervised training under conditions of limited sample size. Notably, introducing unlabeled data from external sites enhances the accuracy and cross-site generalization of models trained on diverse datasets, achieving impressive results with F1, mAP50, and mAP50-95 scores of 0.979, 0.992, and 0.871, respectively. In conclusion, this study highlights the potential of combining UAV-based RGB imagery with SSL to advance tree species identification in heterogeneous forests. Full article

(This article belongs to the Special Issue Remote Sensing-Assisted Forest Inventory Planning)

►▼ Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Error

Oops... you haven't selected anything for export.

Displaying article 1-50 on page 1 of 47.

Go to page 1 2 3 4 5

Search Results (2,328)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI