Sensors

Research

25 pages, 8383 KB

Open AccessArticle

MemLoTrack: Enhancing TIR Anti-UAV Tracking with Memory-Integrated Low-Rank Adaptation

by Jae Kwan Park and Ji-Hyeong Han

Sensors 2025, 25(23), 7359; https://doi.org/10.3390/s25237359 - 3 Dec 2025

Viewed by 142

Tracking small, fast-moving unmanned aerial vehicles (UAVs) in thermal infrared (TIR) imagery is a significant challenge due to low-resolution targets, Dynamic Background Clutter, and frequent occlusions. To address this, we introduce MemLoTrack, a novel onestream Vision Transformer tracker that integrates a memory mechanism [...] Read more.

Tracking small, fast-moving unmanned aerial vehicles (UAVs) in thermal infrared (TIR) imagery is a significant challenge due to low-resolution targets, Dynamic Background Clutter, and frequent occlusions. To address this, we introduce MemLoTrack, a novel onestream Vision Transformer tracker that integrates a memory mechanism into a parameterefficient LoRA framework. MemLoTrack enhances a baseline tracker (LoRAT) with two key components: (i) a gated First-In, First-Out (FIFO) memory bank (MB) for temporal context aggregation and (ii) a lightweight Memory Attention Layer (MAL) for effective information retrieval. A key component of our method is a selective memory update policy, which commits a frame to the memory bank only when it satisfies both a classification confidence threshold (

τ

) and a Kalman filter-based motion consistency check. This gating mechanism robustly prevents memory contamination due to distractors, occlusions, and reappearance events. Our training is highly efficient, updating only the LoRA adapters, MAL, and prediction head while the pretrained DINOv2 backbone remains frozen. Evaluated on the challenging Anti-UAV410 benchmark, MemLoTrack (L_mem = 7,

τ

= 0.8) achieves an AUC of 63.6 and a State Accuracy (SA) of 64.0, representing a significant improvement over the LoRAT baseline by +1.4 AUC and +1.5 SA. Compared to the state-of-the-art method FocusTrack, MemLoTrack demonstrates superior robustness with higher AUC (63.6 vs. 62.8) and SA (64.0 vs. 63.9), while trading lower precision (P/P-Norm) scores. Furthermore, MemLoTrack operates at 153 FPS on a single RTX 4070 Ti SUPER, demonstrating that parameter-efficient fine-tuning with a selective memory mechanism is a powerful and deployable strategy for real-time Anti-UAV tracking in demanding TIR environments. Full article

(This article belongs to the Special Issue Vision Sensors for Object Detection and Tracking)

► Show Figures

Figure 1

26 pages, 11944 KB

Open AccessArticle

Lightweight 3D Multi-Object Tracking via Collaborative Camera and LiDAR Sensors

by Dong Feng, Hengyuan Liu and Zhiyu Liu

Sensors 2025, 25(23), 7351; https://doi.org/10.3390/s25237351 - 3 Dec 2025

Viewed by 198

Abstract

With the widespread adoption of camera and LiDAR sensors, 3D multi-object tracking (MOT) technology has been extensively applied across numerous fields such as robotics, autonomous driving, and surveillance. However, existing 3D MOT methods still face significant challenges in addressing issues such as false [...] Read more.

With the widespread adoption of camera and LiDAR sensors, 3D multi-object tracking (MOT) technology has been extensively applied across numerous fields such as robotics, autonomous driving, and surveillance. However, existing 3D MOT methods still face significant challenges in addressing issues such as false detections, ghost trajectories, incorrect associations, and identity switches. To address these challenges, we propose a lightweight 3D multi-object tracking framework via collaborative camera and LiDAR sensors. Firstly, we design a confidence inverse normalization guided ghost trajectories suppression module (CIGTS). This module suppresses false detections and ghost trajectories at their source using inverse normalization and a virtual trajectory survival frame strategy. Secondly, an adaptive matching space-driven lightweight association module (AMSLA) is proposed. By discarding global association strategies, this module improves association efficiency and accuracy using low-cost decision factors. Finally, a multi-factor collaborative perception-based intelligent trajectory management module (MFCTM) is constructed. This module enables accurate retention or deletion decisions for unmatched trajectories, thereby reducing computational overhead and the risk of identity mismatches. Extensive experiments on the KITTI dataset show that the proposed method outperforms state-of-the-art methods across multiple performance metrics, achieving Higher Order Tracking Accuracy (HOTA) scores of 80.13% and 53.24% for the Car and Pedestrian categories, respectively. Full article

(This article belongs to the Special Issue Vision Sensors for Object Detection and Tracking)

► Show Figures

Figure 1

19 pages, 1528 KB

Open AccessArticle

Keyword-Conditioned Image Segmentation via the Cross-Attentive Alignment of Language and Vision Sensor Data

by Hye Rim Kim and Byoung Chul Ko

Sensors 2025, 25(20), 6353; https://doi.org/10.3390/s25206353 - 14 Oct 2025

Viewed by 867

Abstract

Advancements in multimodal large language models have opened up new possibilities for reasoning-based image segmentation by jointly processing visual and linguistic information. However, existing approaches often suffer from a semantic discrepancy between language interpretation and visual segmentation as a result of the lack [...] Read more.

Advancements in multimodal large language models have opened up new possibilities for reasoning-based image segmentation by jointly processing visual and linguistic information. However, existing approaches often suffer from a semantic discrepancy between language interpretation and visual segmentation as a result of the lack of a structural connection between query understanding and segmentation execution. To address this issue, we propose a keyword-conditioned image segmentation model (KeySeg) as a novel architecture that explicitly encodes and integrates inferred query conditions into the segmentation process. KeySeg embeds the core concepts extracted from multimodal inputs into a dedicated [KEY] token, which is then fused with a [SEG] token through a cross-attention-based fusion module. This design enables the model to reflect query conditions explicitly and precisely in the segmentation criteria. Additionally, we introduce a keyword alignment loss that guides the [KEY] token to align closely with the semantic core of the input query, thereby enhancing the accuracy of condition interpretation. By separating the roles of condition reasoning and segmentation instruction, and making their interactions explicit, KeySeg achieves both expressive capacity and interpretative stability, even under complex language conditions. Full article

(This article belongs to the Special Issue Vision Sensors for Object Detection and Tracking)

► Show Figures

Figure 1

19 pages, 4802 KB

Open AccessArticle

Enhanced SOLOv2: An Effective Instance Segmentation Algorithm for Densely Overlapping Silkworms

by Jianying Yuan, Hao Li, Chen Cheng, Zugui Liu, Sidong Wu and Dequan Guo

Sensors 2025, 25(18), 5703; https://doi.org/10.3390/s25185703 - 12 Sep 2025

Viewed by 730

Abstract

Silkworm instance segmentation is crucial for individual silkworm behavior analysis and health monitoring in intelligent sericulture, as the segmentation accuracy directly influences the reliability of subsequent biological parameter estimation. In real farming environments, silkworms often exhibit high density and severe mutual occlusion, posing [...] Read more.

Silkworm instance segmentation is crucial for individual silkworm behavior analysis and health monitoring in intelligent sericulture, as the segmentation accuracy directly influences the reliability of subsequent biological parameter estimation. In real farming environments, silkworms often exhibit high density and severe mutual occlusion, posing significant challenges for traditional instance segmentation algorithms. To address these issues, this paper proposes an enhanced SOLOv2 algorithm. Specifically, (1) in the backbone network, Linear Deformable Convolution (LDC) is incorporated to strengthen the geometric feature modeling of curved silkworms. A Haar Wavelet Downsampling (HWD) module is designed to better preserve details for partial visible targets, and an Edge-Augmented Multi-attention Fusion Network (EAMF-Net) is constructed to improve boundary discrimination in overlapping regions. (2) In the mask branch, Dynamic Upsampling (Dysample), Adaptive Spatial Feature Fusion (ASFF), and Simple Attention Module (SimAM) are integrated to refine the quality of segmentation masks. Experiments conducted on a self-built high-density silkworm dataset demonstrate that the proposed method achieves an Average Precision (AP) of 85.1%, with significant improvements over the baseline model in small- (APs: +10.2%), medium- (APm: +4.0%), and large-target (APl: +2.0%) segmentation accuracy. This effectively advances precision in dense silkworm segmentation scenarios. Full article

(This article belongs to the Special Issue Vision Sensors for Object Detection and Tracking)

► Show Figures

Figure 1

20 pages, 33417 KB

Open AccessArticle

Enhancing UAV Object Detection in Low-Light Conditions with ELS-YOLO: A Lightweight Model Based on Improved YOLOv11

by Tianhang Weng and Xiaopeng Niu

Sensors 2025, 25(14), 4463; https://doi.org/10.3390/s25144463 - 17 Jul 2025

Cited by 3 | Viewed by 2119

Abstract

Drone-view object detection models operating under low-light conditions face several challenges, such as object scale variations, high image noise, and limited computational resources. Existing models often struggle to balance accuracy and lightweight architecture. This paper introduces ELS-YOLO, a lightweight object detection model tailored [...] Read more.

Drone-view object detection models operating under low-light conditions face several challenges, such as object scale variations, high image noise, and limited computational resources. Existing models often struggle to balance accuracy and lightweight architecture. This paper introduces ELS-YOLO, a lightweight object detection model tailored for low-light environments, built upon the YOLOv11s framework. ELS-YOLO features a re-parameterized backbone (ER-HGNetV2) with integrated Re-parameterized Convolution and Efficient Channel Attention mechanisms, a Lightweight Feature Selection Pyramid Network (LFSPN) for multi-scale object detection, and a Shared Convolution Separate Batch Normalization Head (SCSHead) to reduce computational complexity. Layer-Adaptive Magnitude-Based Pruning (LAMP) is employed to compress the model size. Experiments on the ExDark and DroneVehicle datasets demonstrate that ELS-YOLO achieves high detection accuracy with a compact model. Here, we show that ELS-YOLO attains a mAP@0.5 of 74.3% and 68.7% on the ExDark and DroneVehicle datasets, respectively, while maintaining real-time inference capability. Full article

(This article belongs to the Special Issue Vision Sensors for Object Detection and Tracking)

► Show Figures

Figure 1

19 pages, 1636 KB

Open AccessArticle

Scene Graph and Natural Language-Based Semantic Image Retrieval Using Vision Sensor Data

by Jaehoon Kim and Byoung Chul Ko

Sensors 2025, 25(11), 3252; https://doi.org/10.3390/s25113252 - 22 May 2025

Cited by 1 | Viewed by 2293

Abstract

Text-based image retrieval is one of the most common approaches for searching images acquired from vision sensors such as cameras. However, this method suffers from limitations in retrieval accuracy, particularly when the query contains limited information or involves previously unseen sentences. These challenges [...] Read more.

Text-based image retrieval is one of the most common approaches for searching images acquired from vision sensors such as cameras. However, this method suffers from limitations in retrieval accuracy, particularly when the query contains limited information or involves previously unseen sentences. These challenges arise because keyword-based matching fails to adequately capture contextual and semantic meanings. To address these limitations, we propose a novel approach that transforms sentences and images into semantic graphs and scene graphs, enabling a quantitative comparison between them. Specifically, we utilize a graph neural network (GNN) to learn features of nodes and edges and generate graph embeddings, enabling image retrieval through natural language queries without relying on additional image metadata. We introduce a contrastive GNN-based framework that matches semantic graphs with scene graphs to retrieve semantically similar images. In addition, we incorporate a hard negative mining strategy, allowing the model to effectively learn from more challenging negative samples. The experimental results on the Visual Genome dataset show that the proposed method achieves a top nDCG@50 score of 0.745, improving retrieval performance by approximately 7.7 percentage points compared to random sampling with full graphs. This confirms that the model effectively retrieves semantically relevant images by structurally interpreting complex scenes. Full article

(This article belongs to the Special Issue Vision Sensors for Object Detection and Tracking)

► Show Figures

Figure 1

25 pages, 8373 KB

Open AccessArticle

Efficacy of Segmentation for Hyperspectral Target Detection

by Yoram Furth and Stanley R. Rotman

Sensors 2025, 25(1), 272; https://doi.org/10.3390/s25010272 - 6 Jan 2025

Viewed by 1277

Abstract

Algorithms for detecting point targets in hyperspectral imaging commonly employ the spectral inverse covariance matrix to whiten inherent image noise. Since data cubes often lack stationarity, segmentation appears to be an attractive preprocessing operation. Surprisingly, the literature reports both successful and unsuccessful segmentation [...] Read more.

Algorithms for detecting point targets in hyperspectral imaging commonly employ the spectral inverse covariance matrix to whiten inherent image noise. Since data cubes often lack stationarity, segmentation appears to be an attractive preprocessing operation. Surprisingly, the literature reports both successful and unsuccessful segmentation cases, with no clear explanations for these divergent outcomes. This paper elucidates the conditions under which segmentation might improve detector performance. Focusing on a representative algorithm and assuming a target additive model, the study examines all influential factors through theoretical analysis and extensive simulations. The findings offer fundamental insights and practical guidelines for characterizing segmented datasets, enabling a thorough evaluation of segmentation’s utility for detector performance. They outline the range of target scenarios and parameters where segmentation may prove beneficial and help assess the potential impact of proposed segmentation strategies on detection outcomes. Full article

(This article belongs to the Special Issue Vision Sensors for Object Detection and Tracking)

► Show Figures

Figure 1

20 pages, 2772 KB

Open AccessArticle

Activities of Daily Living Object Dataset: Advancing Assistive Robotic Manipulation with a Tailored Dataset

by Md Tanzil Shahria and Mohammad H. Rahman

Sensors 2024, 24(23), 7566; https://doi.org/10.3390/s24237566 - 27 Nov 2024

Cited by 4 | Viewed by 2770

Abstract

The increasing number of individuals with disabilities—over 61 million adults in the United States alone—underscores the urgent need for technologies that enhance autonomy and independence. Among these individuals, millions rely on wheelchairs and often require assistance from another person with activities of daily [...] Read more.

The increasing number of individuals with disabilities—over 61 million adults in the United States alone—underscores the urgent need for technologies that enhance autonomy and independence. Among these individuals, millions rely on wheelchairs and often require assistance from another person with activities of daily living (ADLs), such as eating, grooming, and dressing. Wheelchair-mounted assistive robotic arms offer a promising solution to enhance independence, but their complex control interfaces can be challenging for users. Automating control through deep learning-based object detection models presents a viable pathway to simplify operation, yet progress is impeded by the absence of specialized datasets tailored for ADL objects suitable for robotic manipulation in home environments. To bridge this gap, we present a novel ADL object dataset explicitly designed for training deep learning models in assistive robotic applications. We curated over 112,000 high-quality images from four major open-source datasets—COCO, Open Images, LVIS, and Roboflow Universe—focusing on objects pertinent to daily living tasks. Annotations were standardized to the YOLO Darknet format, and data quality was enhanced through a rigorous filtering process involving a pre-trained YOLOv5x model and manual validation. Our dataset provides a valuable resource that facilitates the development of more effective and user-friendly semi-autonomous control systems for assistive robots. By offering a focused collection of ADL-related objects, we aim to advance assistive technologies that empower individuals with mobility impairments, addressing a pressing societal need and laying the foundation for future innovations in human–robot interaction within home settings. Full article

(This article belongs to the Special Issue Vision Sensors for Object Detection and Tracking)

► Show Figures

Figure 1

21 pages, 2544 KB

Open AccessArticle

An Energy-Efficient Dynamic Feedback Image Signal Processor for Three-Dimensional Time-of-Flight Sensors

by Yongsoo Kim, Jaehyeon So, Chanwook Hwang, Wencan Cheng and Jong Hwan Ko

Sensors 2024, 24(21), 6918; https://doi.org/10.3390/s24216918 - 28 Oct 2024

Cited by 1 | Viewed by 1656

Abstract

With the recent prominence of artificial intelligence (AI) technology, various research outcomes and applications in the field of image recognition and processing utilizing AI have been continuously emerging. In particular, the domain of object recognition using 3D time-of-flight (ToF) sensors has been actively [...] Read more.

With the recent prominence of artificial intelligence (AI) technology, various research outcomes and applications in the field of image recognition and processing utilizing AI have been continuously emerging. In particular, the domain of object recognition using 3D time-of-flight (ToF) sensors has been actively researched, often in conjunction with augmented reality (AR) and virtual reality (VR). However, for more precise analysis, high-quality images are required, necessitating significantly larger parameters and computations. These requirements can pose challenges, especially in developing AR and VR technologies for low-power portable devices. Therefore, we propose a dynamic feedback configuration image signal processor (ISP) for 3D ToF sensors. The ISP achieves both accuracy and energy efficiency through dynamic feedback. The proposed ISP employs dynamic area extraction to perform computations and post-processing only for pixels within the valid area used by the application in each frame. Additionally, it uses dynamic resolution to determine and apply the appropriate resolution for each frame. This approach enhances energy efficiency by avoiding the processing of all sensor data while maintaining or surpassing accuracy levels. Furthermore, These functionalities are designed for hardware-efficient implementation, improving processing speed and minimizing power consumption. The results show a maximum performance of 178 fps and a high energy efficiency of up to 123.15 fps/W. When connected to the hand pose estimation (HPE) accelerator, it demonstrates an average mean squared error (MSE) of 10.03 mm, surpassing the baseline ISP value of 20.25 mm. Therefore, the proposed ISP can be effectively utilized in low-power, small form-factor devices. Full article

(This article belongs to the Special Issue Vision Sensors for Object Detection and Tracking)

► Show Figures

Figure 1

20 pages, 3099 KB

Open AccessArticle

Diffusion Models-Based Purification for Common Corruptions on Robust 3D Object Detection

by Mumuxin Cai, Xupeng Wang, Ferdous Sohel and Hang Lei

Sensors 2024, 24(16), 5440; https://doi.org/10.3390/s24165440 - 22 Aug 2024

Cited by 5 | Viewed by 2353

Abstract

LiDAR sensors have been shown to generate data with various common corruptions, which seriously affect their applications in 3D vision tasks, particularly object detection. At the same time, it has been demonstrated that traditional defense strategies, including adversarial training, are prone to suffering [...] Read more.

LiDAR sensors have been shown to generate data with various common corruptions, which seriously affect their applications in 3D vision tasks, particularly object detection. At the same time, it has been demonstrated that traditional defense strategies, including adversarial training, are prone to suffering from gradient confusion during training. Moreover, they can only improve their robustness against specific types of data corruption. In this work, we propose LiDARPure, which leverages the powerful generation ability of diffusion models to purify corruption in the LiDAR scene data. By dividing the entire scene into voxels to facilitate the processes of diffusion and reverse diffusion, LiDARPure overcomes challenges induced from adversarial training, such as sparse point clouds in large-scale LiDAR data and gradient confusion. In addition, we utilize the latent geometric features of a scene as a condition to assist the generation of diffusion models. Detailed experiments show that LiDARPure can effectively purify 19 common types of LiDAR data corruption. Further evaluation results demonstrate that it can improve the average precision of 3D object detectors to an extent of 20% in the face of data corruption, much higher than existing defence strategies. Full article

(This article belongs to the Special Issue Vision Sensors for Object Detection and Tracking)

► Show Figures

Figure 1

17 pages, 7257 KB

Open AccessArticle

Integrating Heuristic Methods with Deep Reinforcement Learning for Online 3D Bin-Packing Optimization

by Ching-Chang Wong, Tai-Ting Tsai and Can-Kun Ou

Sensors 2024, 24(16), 5370; https://doi.org/10.3390/s24165370 - 20 Aug 2024

Cited by 7 | Viewed by 4754

Abstract

This study proposes a method named Hybrid Heuristic Proximal Policy Optimization (HHPPO) to implement online 3D bin-packing tasks. Some heuristic algorithms for bin-packing and the Proximal Policy Optimization (PPO) algorithm of deep reinforcement learning are integrated to implement this method. In the heuristic [...] Read more.

This study proposes a method named Hybrid Heuristic Proximal Policy Optimization (HHPPO) to implement online 3D bin-packing tasks. Some heuristic algorithms for bin-packing and the Proximal Policy Optimization (PPO) algorithm of deep reinforcement learning are integrated to implement this method. In the heuristic algorithms for bin-packing, an extreme point priority sorting method is proposed to sort the generated extreme points according to their waste spaces to improve space utilization. In addition, a 3D grid representation of the space status of the container is used, and some partial support constraints are proposed to increase the possibilities for stacking objects and enhance overall space utilization. In the PPO algorithm, some heuristic algorithms are integrated, and the reward function and the action space of the policy network are designed so that the proposed method can effectively complete the online 3D bin-packing task. Some experimental results illustrate that the proposed method has good results in achieving online 3D bin-packing tasks in some simulation environments. In addition, an environment with image vision is constructed to show that the proposed method indeed enables an actual robot manipulator to successfully and effectively complete the bin-packing task in a real environment. Full article

(This article belongs to the Special Issue Vision Sensors for Object Detection and Tracking)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Vision Sensors for Object Detection and Tracking

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (11 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI