MDPI - Publisher of Open Access Journals

22 pages, 2166 KB

Open AccessArticle

Sound-to-Image Translation Through Direct Cross-Modal Connection Using a Convolutional–Attention Generative Model

by Leonardo A. Fanzeres, Climent Nadeu and José A. R. Fonollosa

Appl. Sci. 2026, 16(6), 2942; https://doi.org/10.3390/app16062942 - 18 Mar 2026

Sound plays a fundamental role in human perception, conveying information about events, objects, and spatial dynamics that may not be visually accessible. However, current technologies such as Acoustic Event Detection typically reduce complex soundscapes to textual labels, often failing to preserve their semantic [...] Read more.

Sound plays a fundamental role in human perception, conveying information about events, objects, and spatial dynamics that may not be visually accessible. However, current technologies such as Acoustic Event Detection typically reduce complex soundscapes to textual labels, often failing to preserve their semantic richness. This limitation motivates the exploration of sound-to-image (S2I) translation as an alternative connection between audio and visual modalities. Unlike multimodal approaches guided by intermediary constraints during the learning process, we investigate S2I translation without class supervision, cluster-based alignment, or textual mediation, a paradigm we refer to as direct S2I translation. To the best of our knowledge, apart from our previous work, no prior study addresses S2I translation under this fully direct setting. We propose a convolutional–attention generative framework composed of an audio encoder and a densely connected GAN integrating self-attention and cross-attention mechanisms. The attention-based model is systematically compared with a purely convolutional baseline. Results show that introducing attention at early stages of the generator significantly improves translation performance, increasing the likelihood of producing interpretable and semantically coherent visual representations of sound. These findings indicate that attention strengthens semantic correspondence between audio and vision while preserving the fully direct nature of the translation process. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

23 pages, 13050 KB

Open AccessArticle

BAWSeg: A UAV Multispectral Benchmark for Barley Weed Segmentation

by Haitian Wang, Xinyu Wang, Muhammad Ibrahim, Dustin Severtson and Ajmal Mian

Remote Sens. 2026, 18(6), 915; https://doi.org/10.3390/rs18060915 - 17 Mar 2026

Abstract

Accurate weed mapping in cereal fields requires pixel-level segmentation from unmanned aerial vehicle (UAV) imagery that remains reliable across fields, seasons, and illumination. Existing multispectral pipelines often depend on thresholded vegetation indices, which are brittle under radiometric drift and mixed crop–weed pixels, or [...] Read more.

Accurate weed mapping in cereal fields requires pixel-level segmentation from unmanned aerial vehicle (UAV) imagery that remains reliable across fields, seasons, and illumination. Existing multispectral pipelines often depend on thresholded vegetation indices, which are brittle under radiometric drift and mixed crop–weed pixels, or on single-stream convolutional neural network (CNN) and Transformer backbones that ingest stacked bands and indices, where radiance cues and normalized index cues interfere and reduce sensitivity to small weed clusters embedded in crop canopy. We propose VISA (Vegetation Index and Spectral Attention), a two-stream segmentation network that decouples these cues and fuses them at native resolution. The radiance stream learns from calibrated five-band reflectance using local residual convolutions, channel recalibration, spatial gating, and skip-connected decoding, which preserve fine textures, row boundaries, and small weed structures that are often weakened after ratio-based index compression. The index stream operates on vegetation-index maps with windowed self-attention to model local structure efficiently, state-space layers to propagate field-scale context without quadratic attention cost, and Slot Attention to form stable region descriptors that improve discrimination of sparse weeds under canopy mixing. To support supervised training and deployment-oriented evaluation, we introduce BAWSeg, a four-year UAV multispectral dataset collected over commercial barley paddocks in Western Australia, providing radiometrically calibrated blue, green, red, red edge, and near-infrared orthomosaics, derived vegetation indices, and dense crop, weed, and other labels with leakage-free block splits. On BAWSeg, VISA achieves 75.6% mean Intersection over Union (mIoU) and 63.5% weed Intersection over Union (IoU) with 22.8 M parameters, outperforming a multispectral SegFormer-B1 baseline by 1.2 mIoU and 1.9 weed IoU. Under cross-plot and cross-year protocols, VISA maintains 71.2% and 69.2% mIoU, respectively. The full BAWSeg benchmark dataset, VISA code, trained model weights, and protocol files will be released upon publication. Full article

(This article belongs to the Special Issue Intelligent UAV Remote Sensing for Next-Generation Precision Agriculture)

► Show Figures

Figure 1

17 pages, 1167 KB

Open AccessArticle

HOIMamba: Bidirectional State-Space Modeling for Monocular 3D Human–Object Interaction Reconstruction

by Jinsong Zhang and Yuqin Lin

Biomimetics 2026, 11(3), 214; https://doi.org/10.3390/biomimetics11030214 - 17 Mar 2026

Abstract

Monocular 3D human–object interaction (HOI) reconstruction requires jointly recovering articulated human geometry, object pose, and physically plausible contact from a single RGB image. While recent token-based methods commonly employ dense self-attention to capture global dependencies, isotropic all-to-all mixing tends to entangle spatial-geometric cues [...] Read more.

Monocular 3D human–object interaction (HOI) reconstruction requires jointly recovering articulated human geometry, object pose, and physically plausible contact from a single RGB image. While recent token-based methods commonly employ dense self-attention to capture global dependencies, isotropic all-to-all mixing tends to entangle spatial-geometric cues (e.g., contact locality) with channel-wise semantic cues (e.g., action/affordance), and provides limited control for representing directional and asymmetric physical influence between humans and objects. This paper presents HOIMamba, a state-space sequence modeling framework that reformulates HOI reconstruction as bidirectional, multi-scale interaction state inference. Instead of relying on symmetric correlation aggregation, HOIMamba uses structured state evolution to propagate interaction evidence. We introduce a multi-scale state-space module (MSSM) to capture interaction dependencies spanning local contact details and global body–object coordination. Building on MSSM, we propose a spatial-channel grouped SSM (SCSSM) block that factorizes interaction modeling into a spatial pathway for geometric/contact dependencies and a channel pathway for semantic/functional correlations, followed by gated fusion. HOIMamba further performs explicit bidirectional propagation between human and object states to better reflect asymmetric reciprocity in physical interactions. We evaluate HOIMamba on two public benchmarks, BEHAVE and InterCap, using Chamfer distance for human/object meshes and contact precision/recall induced by reconstructed geometry. HOIMamba achieves consistent improvements over representative prior methods. On the BEHAVE dataset, it reduces human Chamfer distance by 8.6% and improves contact recall by 13.5% compared to the strongest Transformer-based baseline, with similar gains observed on the InterCap dataset. Ablation studies on BEHAVE verify the contributions of state-space modeling, multi-scale inference, spatial-channel factorization, and bidirectional interaction reasoning. Full article

(This article belongs to the Special Issue Human–Robot Interaction and Collaboration: Advances in Sensing, Control, and Learning)

► Show Figures

Graphical abstract

18 pages, 9795 KB

Open AccessArticle

Potential Accessibility to Population as an Instrument for Sustainable Territorial Development: The Case Study of Serbia

by Danijela Srnić, Aleksandra Gajić Protić, Nikola Krunić, Nebojša Stefanović and Marija R. Jeftić

Sustainability 2026, 18(6), 2894; https://doi.org/10.3390/su18062894 - 16 Mar 2026

Abstract

The potential accessibility has been widely represented in the scientific literature since the 1950s and in the work of Hansen. Until today, different sets of measures have been developed for evaluating it in different scientific areas such as transport infrastructure, land use planning, [...] Read more.

The potential accessibility has been widely represented in the scientific literature since the 1950s and in the work of Hansen. Until today, different sets of measures have been developed for evaluating it in different scientific areas such as transport infrastructure, land use planning, regional development, and others. In the Serbian scientific literature, this concept received limited attention, so this paper represents a new contribution to the field of managing territorial development within spatial planning. The main aim of this research is to view the sustainable territorial development of the Republic of Serbia from a new perspective, which combines demographic and socioeconomic indicators with infrastructure development. Considering that, the potential accessibility to the population of local self-government centers index was calculated at the settlement level. This approach corresponds with demographic and economic trends in Serbia that are present in recent decades and some newer analyses in the scientific and professional literature regarding processes within the Serbian urban system. Findings from this research can make a significant contribution to further understanding of Serbian urban system patterns and sustainable territorial development. Full article

(This article belongs to the Special Issue Sustainable Urban Planning and Regional Development: 2nd Edition)

► Show Figures

Figure 1

24 pages, 4692 KB

Open AccessArticle

SSTNT: A Spatial–Spectral Similarity Guided Transformer-in-Transformer for Hyperspectral Unmixing

by Xinyu Cui, Xinyue Zhang, Aoran Dai and Da Sun

Photonics 2026, 13(3), 276; https://doi.org/10.3390/photonics13030276 - 13 Mar 2026

Viewed by 68

Abstract

Vision Transformers (ViTs), owing to their strong capability in modeling global contextual dependencies, have been widely adopted in hyperspectral image unmixing (HU). However, standard ViTs process images by partitioning them into non-overlapping patches, which disrupts spatial continuity at the pixel level and neglects [...] Read more.

Vision Transformers (ViTs), owing to their strong capability in modeling global contextual dependencies, have been widely adopted in hyperspectral image unmixing (HU). However, standard ViTs process images by partitioning them into non-overlapping patches, which disrupts spatial continuity at the pixel level and neglects the fine-grained structural relationships among pixels within local regions. Consequently, effectively capturing the detailed spatial–spectral features required for accurate unmixing remains challenging. Furthermore, the high computational complexity of global self-attention and its sensitivity to noise limit the applicability of conventional Transformers to HU. To address these issues, we propose a spatial–spectral similarity guided Transformer-in-Transformer (SSTNT) framework. The proposed network adopts a modified TNT architecture, in which the inner Transformer employs a linear self-attention (LSA) mechanism to efficiently exploit pixel-level local features within sliding windows, while the outer Transformer preserves global attention to aggregate contextual information, thereby forming a cooperative local–global optimization scheme. Furthermore, a lightweight spatial–spectral similarity module is introduced to enhance the modeling of neighborhood structures. Finally, spectral reconstruction is achieved through a trainable endmember decoder and a normalized abundance estimation module. Extensive experiments conducted on both synthetic and real hyperspectral datasets demonstrate the effectiveness and robustness of the proposed method. Full article

(This article belongs to the Special Issue Computational Optical Imaging: Theories, Algorithms, and Applications)

► Show Figures

Figure 1

23 pages, 8147 KB

Open AccessArticle

SDENet: A Novel Approach for Single Image Depth of Field Extension

by Xu Zhang, Miaomiao Wen, Junyang Jia and Yan Liu

Algorithms 2026, 19(3), 216; https://doi.org/10.3390/a19030216 - 13 Mar 2026

Viewed by 52

Abstract

Traditional hardware-based approaches for depth-of-field extension (DOF-E), such as optimized lens design or focus-stacking via layer scanning, are often plagued by bulkiness and prohibitive costs. Meanwhile, conventional multi-focus image fusion algorithms demand precise spatial alignment, a challenge that becomes particularly acute in applications [...] Read more.

Traditional hardware-based approaches for depth-of-field extension (DOF-E), such as optimized lens design or focus-stacking via layer scanning, are often plagued by bulkiness and prohibitive costs. Meanwhile, conventional multi-focus image fusion algorithms demand precise spatial alignment, a challenge that becomes particularly acute in applications like microscopy. To address these limitations, this paper proposed a novel single-image DOF-E method termed SDENet. The method adopts an encoder –decoder architecture enhanced with multi-scale self-attention and depth enhancement modules, enabling the transformation of a single partially focused image into a fully focused output while effectively recovering regions outside the original depth of field (DOF). To support model training and performance evaluation, we introduce a dedicated dataset (MSED) containing 1772 pairs of single-focus and all-focus images covering diverse scenes. Experimental results on multiple datasets verify that SDENet significantly outperforms state-of-the-art deblurring methods, achieving a PSNR of 26.98 dB and SSIM of 0.846 on the DPDD dataset, which represents a substantial improvement in clarity and visual coherence compared to existing techniques. Furthermore, SDENet demonstrates competitive performance with multi-image fusion methods while requiring only a single input. Full article

► Show Figures

Figure 1

23 pages, 5616 KB

Open AccessArticle

Informer–UNet: A Hybrid Deep Learning Framework for Multi-Point Soil Moisture Prediction and Precision Irrigation in Winter Wheat

by Dingkun Zheng, Chenghan Yang, Gang Zheng, Baurzhan Belgibaev, Madina Mansurova, Sholpan Jomartova and Baidong Zhao

Agriculture 2026, 16(6), 648; https://doi.org/10.3390/agriculture16060648 - 12 Mar 2026

Viewed by 136

Abstract

Soil moisture prediction is essential for precision irrigation in water-limited agricultural systems. This study presents a deep learning-driven irrigation framework for winter wheat, integrating a novel Informer–UNet model with a Comprehensive Irrigation Index for adaptive water management. The Informer–UNet combines ProbSparse self-attention mechanisms [...] Read more.

Soil moisture prediction is essential for precision irrigation in water-limited agricultural systems. This study presents a deep learning-driven irrigation framework for winter wheat, integrating a novel Informer–UNet model with a Comprehensive Irrigation Index for adaptive water management. The Informer–UNet combines ProbSparse self-attention mechanisms with UNet’s multi-scale feature fusion, enabling simultaneous prediction of soil moisture at 27 monitoring points across three depths, 10, 30, and 50 cm, while quantifying prediction uncertainty through Monte Carlo Dropout. A Comprehensive Irrigation Index incorporating moisture deviation, spatial variance, and confidence interval width was developed, with weights optimized via genetic algorithm. Field experiments were conducted in Chengdu, China, over two winter wheat growing seasons. The Informer–UNet achieved superior prediction accuracy, R² greater than 0.98, RMSE less than 0.65, compared to LSTM, Transformer, and standard Informer models, with the fastest convergence and lowest validation loss. The proposed DeepIndexIrr strategy maintained soil moisture within the target range, 55% to 75%, for over 81% of the irrigation period, reducing water consumption by 38.2% compared to fixed-threshold control and 19.2% compared to expert manual scheduling. These results demonstrate that integrating spatially distributed deep learning predictions with uncertainty-informed decision rules offers a promising approach for sustainable precision irrigation. Full article

(This article belongs to the Special Issue Advanced Earth Observation Technologies for Sustainable Irrigation Water Management)

► Show Figures

Figure 1

26 pages, 2632 KB

Open AccessArticle

Automated Malaria Ring Form Classification in Blood Smear Images Using Ensemble Parallel Neural Networks

by Pongphan Pongpanitanont, Naparat Suttidate, Manit Nuinoon, Natthida Khampeeramao, Sakhone Laymanivong and Penchom Janwan

J. Imaging 2026, 12(3), 127; https://doi.org/10.3390/jimaging12030127 - 12 Mar 2026

Viewed by 58

Abstract

Manual microscopy for malaria diagnosis is labor-intensive and prone to inter-observer variability. This study presents an automated binary classification approach for detecting malaria ring-form infections in thin blood smear single-cell images using a parallel neural network framework. Utilizing a balanced Kaggle dataset of [...] Read more.

Manual microscopy for malaria diagnosis is labor-intensive and prone to inter-observer variability. This study presents an automated binary classification approach for detecting malaria ring-form infections in thin blood smear single-cell images using a parallel neural network framework. Utilizing a balanced Kaggle dataset of 27,558 erythrocyte crops, images were standardized to 128 × 128 pixels and subjected to on-the-fly augmentation. The proposed architecture employs a dual-branch fusion strategy, integrating a convolutional neural network for local morphological feature extraction with a multi-head self-attention branch to capture global spatial relationships. Performance was rigorously evaluated using 10-fold stratified cross-validation and an independent 10% hold-out test set. Results demonstrated high-level discrimination, with all models achieving an ROC–AUC of approximately 0.99. The primary model (Model#1) attained a peak mean accuracy of 0.9567 during cross-validation and 0.97 accuracy (macro F1-score: 0.97) on the independent test set. In contrast, increasing architectural complexity in Model#3 led to a performance decline (0.95 accuracy) due to higher false-positive rates. These findings suggest that moderate-capacity feature fusion, combining convolutional descriptors with attention-based aggregation, provides a robust and generalizable solution for automated malaria screening without the risks associated with over-parameterization. Despite a strong performance, immediate clinical use remains limited because the model was developed on pre-segmented single-cell images, and external validation is still required before routine implementation. Full article

(This article belongs to the Section AI in Imaging)

► Show Figures

Figure 1

22 pages, 18777 KB

Open AccessArticle

LSOD-YOLO: A Visual Object Detection Method for AGV Perception Systems Based on a Lightweight Backbone and Detection Head

by Sijing Cai, Zhanzheng Wu, Kang Liu, Tianbai Zhang, Wei Weng and Xiaoyi Zheng

Technologies 2026, 14(3), 173; https://doi.org/10.3390/technologies14030173 - 12 Mar 2026

Viewed by 139

Abstract

In smart logistics and intelligent manufacturing scenarios, the deployment of Autonomous Guided Vehicles (AGVs) necessitates vision systems that balance stringent real-time constraints with high detection accuracy. However, contemporary lightweight models often struggle with multi-scale feature representation and precision degradation. To address these challenges, [...] Read more.

In smart logistics and intelligent manufacturing scenarios, the deployment of Autonomous Guided Vehicles (AGVs) necessitates vision systems that balance stringent real-time constraints with high detection accuracy. However, contemporary lightweight models often struggle with multi-scale feature representation and precision degradation. To address these challenges, this study presents LSOD-YOLO, a tailored evolution of YOLO11n designed for embedded AGV systems. Our methodology focuses on three architectural innovations: (1) we propose a Lightweight Shared Convolution Detection (LSCD) head integrated with Group Normalization (GN) and a scale-adaptive mechanism to harmonize multi-scale feature responses; (2) we re-engineer the backbone using a Star-Net architecture enhanced by Gated MLPs and Depthwise Attention to refine local spatial modeling; and (3) we integrate multi-branch residuals and Channel Attention (CAA) into the C3k2-Star-CAA module to enhance robustness against occlusions and complex backgrounds. The experimental validation on a self-built AGV industrial dataset and COCO128 reveals a compelling performance leap: a 30 FPS increase in throughput and a 1.5% gain in precision, all achieved with 32.8% fewer parameters. These findings confirm that LSOD-YOLO achieves a superior trade-off between computational efficiency and reliability, showing great potential for seamless deployment in resource-constrained AGV visual tasks. Full article

(This article belongs to the Special Issue Intelligent Transportation for Integrated Mobile System: AI-Driven Technologies, Engineering Systems, and Industrial Applications)

► Show Figures

Figure 1

26 pages, 7392 KB

Open AccessArticle

A CLIP-Based Zero-Shot Photovoltaic Segmentation Framework for Remote Sensing Imagery

by Hailong Li, Man Zhao, Lu Bai, Yan Liu, Xiaoqing He, Liangfu Chen, Jinhua Tao, Guangyan He and Zhibao Wang

Remote Sens. 2026, 18(6), 865; https://doi.org/10.3390/rs18060865 - 11 Mar 2026

Viewed by 179

Abstract

In photovoltaic remote sensing image segmentation tasks, fully supervised methods can achieve high accuracy. However, the high cost of pixel-level annotation significantly limits their scalability in large-scale scenarios. To overcome this annotation bottleneck, this paper proposes a zero-shot cross-modal segmentation framework based on [...] Read more.

In photovoltaic remote sensing image segmentation tasks, fully supervised methods can achieve high accuracy. However, the high cost of pixel-level annotation significantly limits their scalability in large-scale scenarios. To overcome this annotation bottleneck, this paper proposes a zero-shot cross-modal segmentation framework based on the visual-language pre-trained foundation model (CLIP). This approach harnesses CLIP’s cross-modal knowledge transfer capabilities to achieve precise extraction of photovoltaic targets without requiring any downstream training. This paper first introduces the Layer-wise Augmented Residual Attention (LARA) mechanism to enhance fine-grained detail representation in the feature space. Subsequently, a Cross-modal Semantic Attribution Module (CMSA) is designed to generate precise activation maps by leveraging image-text alignment gradient information. Finally, the Confidence-Aware Refinement Strategy (CARS) replaces the conventional training-based denoising process, directly producing high-quality binary segmentation masks through adaptive thresholding. Comparative experiments were conducted to evaluate the proposed method against various baselines using several public datasets with varying resolutions in Jiangsu Province including Unmanned Aerial Vehicles imagery, Beijing-2, Gaofen-2, and a self-created Sentinel-2 imagery covering multiple countries. Notably, the proposed method achieved an IoU of 70.3% on the Gaofen-2 PV03 dataset with a spatial resolution of approximately 0.3 m and 50.8% on the self-created Sentinel-2 PV_Sentinel-2 dataset with a spatial resolution of 10 m. Experimental results demonstrate that our proposed approach maintains excellent cross-domain generalisation capabilities while reducing annotation costs, thereby providing an efficient and viable technical pathway for the automated monitoring of large-scale photovoltaic facilities. Full article

(This article belongs to the Section AI Remote Sensing)

► Show Figures

Figure 1

17 pages, 3180 KB

Open AccessArticle

Adaptive Projector Photometric Compensation Under Dynamic Environment Lightings

by Feng Zhang, Siyu Xu, Bingyan Duan, Cheng Han and Fabin Wang

Algorithms 2026, 19(3), 209; https://doi.org/10.3390/a19030209 - 10 Mar 2026

Viewed by 150

Abstract

To mitigate color crosstalk in projected images induced by non-uniform projection surfaces and dynamic ambient lighting, we propose MAPCNet, an adaptive photometric compensation network that enables robust color reproduction across diverse environments. The experimental process begins with a mask-based preprocessing operation on a [...] Read more.

To mitigate color crosstalk in projected images induced by non-uniform projection surfaces and dynamic ambient lighting, we propose MAPCNet, an adaptive photometric compensation network that enables robust color reproduction across diverse environments. The experimental process begins with a mask-based preprocessing operation on a multi-surface photometric compensation dataset captured under varying ambient light conditions. The algorithm incorporates a multi-attention mechanism and a triple-interaction attention mechanism, and further integrates a hybrid attention module inspired by the Transformer architecture. By integrating channel attention and window-based self-attention, this module improves the network’s adaptability to spatial and illumination changes. Experimental results demonstrate that the photometric compensation achieved by this algorithm meets the visual perception requirements of the human eye. Full article

► Show Figures

Figure 1

25 pages, 11205 KB

Open AccessArticle

Remote Sensing Image Captioning via Self-Supervised DINOv3 and Transformer Fusion

by Maryam Mehmood, Ahsan Shahzad, Farhan Hussain, Lismer Andres Caceres-Najarro and Muhammad Usman

Remote Sens. 2026, 18(6), 846; https://doi.org/10.3390/rs18060846 - 10 Mar 2026

Viewed by 217

Abstract

Effective interpretation of coherent and usable information from aerial images (e.g., satellite imagery or high-altitude drone photography) can greatly reduce human effort in many situations, both natural (e.g., earthquakes, forest fires, tsunamis) and man-made (e.g., highway pile-ups, traffic congestion), particularly in disaster management. [...] Read more.

Effective interpretation of coherent and usable information from aerial images (e.g., satellite imagery or high-altitude drone photography) can greatly reduce human effort in many situations, both natural (e.g., earthquakes, forest fires, tsunamis) and man-made (e.g., highway pile-ups, traffic congestion), particularly in disaster management. This research proposes a novel encoder–decoder framework for captioning of remote sensing images that integrates self-supervised DINOv3 visual features with a hybrid Transformer–LSTM decoder. Unlike existing approaches that rely on supervised CNN-based encoders (e.g., ResNet, VGG), the proposed method leverages DINOv3’s self-supervised learning capabilities to extract dense, semantically rich features from aerial images without requiring domain-specific labeled pretraining. The proposed hybrid decoder combines Transformer layers for global context modeling with LSTM layers for sequential caption generation, producing coherent and context-aware descriptions. Feature extraction is performed using the DINOv3 model, which employs the gram-anchoring technique to stabilize dense feature maps. Captions are generated through a hybrid of Transformer with Long Short-Term Memory (LSTM) layers, which adds contextual meaning to captions through sequential hidden layer modeling with gated memory. The model is first evaluated on two traditional remote sensing image captioning datasets: RSICD and UCM-Captions. Multiple evaluation metrics like Bilingual Evaluation Understudy (BLEU), Consensus-based Image Description Evaluation (CIDEr), Recall-Oriented Understudy for Gisting Evaluation (ROUGE-L), and Metric for Evaluation of Translation with Explicit Ordering (METEOR), are used to quantify the performance and robustness of the proposed DINOv3 hybrid model. The proposed model outperforms conventional Convolutional Neural Network (CNN) and Vision Transformers (ViT)-based models by approximately 9–12% across most evaluation metrics. Attention heatmaps are also employed to qualitatively validate the proposed model when identifying and describing key spatial elements. In addition, the proposed model is evaluated on advanced remote sensing datasets, including RSITMD, DisasterM3, and GeoChat. The results demonstrate that self-supervised vision transformers are robust encoders for multi-modal understanding in remote sensing image analysis and captioning. Full article

► Show Figures

Figure 1

27 pages, 8552 KB

Open AccessArticle

A Data-Constrained and Physics-Guided Conditional Diffusion Model for Electrical Impedance Tomography Image Reconstruction

by Xiaolei Zhang and Zhou Rong

Sensors 2026, 26(5), 1728; https://doi.org/10.3390/s26051728 - 9 Mar 2026

Viewed by 212

Abstract

Electrical impedance tomography (EIT) provides noninvasive, high-temporal-resolution imaging for medical and industrial applications. However, accurate image reconstruction remains challenging due to the severe ill-posedness and nonlinearity of the inverse problem, as well as the limited robustness of existing single-source learning-based methods in real [...] Read more.

Electrical impedance tomography (EIT) provides noninvasive, high-temporal-resolution imaging for medical and industrial applications. However, accurate image reconstruction remains challenging due to the severe ill-posedness and nonlinearity of the inverse problem, as well as the limited robustness of existing single-source learning-based methods in real measurement scenarios. To address these limitations, a data-constrained and physics-guided Multi-Source Conditional Diffusion Model (MS-CDM) is proposed for EIT image reconstruction. Unlike conventional conditional diffusion methods that rely on a single measurement or an image prior, MS-CDM utilizes boundary voltage measurements as data-driven constraints and incorporates coarse reconstructions as physics-guided structural priors. This multi-source conditioning strategy provides complementary guidance during the reverse diffusion process, enabling balanced recovery of fine boundary details and global topological consistency. To support this framework, a Hybrid Swin–Mamba Denoising U-Net is developed, combining hierarchical window-based self-attention for local spatial modeling with bidirectional state-space modeling for efficient global dependency capture. Extensive experiments on simulated datasets and three real EIT experimental platforms demonstrate that MS-CDM consistently outperforms state-of-the-art numerical, supervised, and diffusion-based methods in terms of reconstruction accuracy, structural consistency, and noise robustness. Moreover, the proposed model exhibits robust cross-system applicability without system-specific retraining under multi-protocol training, highlighting its practical applicability in diverse real-world EIT scenarios. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

28 pages, 48517 KB

Open AccessArticle

DDF-DETR: A Multi-Scale Spatial Context Method for Field Cotton Seedling Detection

by Feng Xu, Huade Zhou, Yinyi Pan, Yi Lu and Luan Dong

Agriculture 2026, 16(5), 615; https://doi.org/10.3390/agriculture16050615 - 7 Mar 2026

Viewed by 305

Abstract

Accurate assessment of cotton emergence rates is essential for precision agriculture management, and unmanned aerial vehicle (UAV) imagery provides a scalable means for field-level monitoring. However, cotton seedling detection from UAV images faces persistent challenges: individual seedlings appear as small targets with diverse [...] Read more.

Accurate assessment of cotton emergence rates is essential for precision agriculture management, and unmanned aerial vehicle (UAV) imagery provides a scalable means for field-level monitoring. However, cotton seedling detection from UAV images faces persistent challenges: individual seedlings appear as small targets with diverse morphologies across varying flight altitudes; strong plastic film reflections, weeds, and soil cracks introduce substantial background interference; and “missing seedling” targets, which manifest as negative space features, exhibit high similarity to background noise. Existing CNN–Transformer hybrid detection architectures are limited by fixed convolutional receptive fields that cannot adapt to multi-scale target variations, attention mechanisms that lack explicit directional geometric modeling, and interpolation-based upsampling that attenuates high-frequency edge details of small targets. To address these issues, this paper proposes DDF-DETR (Dynamic-Direction-Frequency Detection Transformer), a multi-scale spatial context detection method based on RT-DETR. The method incorporates three components: a Dynamic Gated Mixer Block (DGMB) for adaptive multi-scale feature extraction with background noise suppression, a Direction-Aware Adaptive Transformer Encoder (DAATE) for directional geometric feature modeling at linear computational complexity, and a Frequency-Aware Sub-pixel Upsampling Network (FASN) for high-frequency detail recovery in the feature pyramid. On the self-constructed Xinjiang cotton field dataset, DDF-DETR achieves 83.72% mAP@0.5 and 63.46% mAP@0.5:0.95, representing improvements of 2.38% and 5.28% over the baseline RT-DETR-R18, while reducing the parameter count by 30.6% and computational cost to 42.8 GFLOPs. Generalization experiments on the VisDrone2019 and TinyPerson datasets further validate the robustness of the proposed method for small target detection across different scenarios. Full article

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

► Show Figures

Figure 1

32 pages, 9401 KB

Open AccessArticle

A Leakage-Aware Multimodal Machine Learning Framework for Nutrition Supply–Demand Forecasting Using Temporal and Spatial Data Fusion

by Abdullah, Muhammad Ateeb Ather, Jose Luis Oropeza Rodriguez, Carlos Guzmán Sánchez-Mejorada, Miguel Jesús Torres Ruiz and Rolando Quintero Tellez

Computers 2026, 15(3), 156; https://doi.org/10.3390/computers15030156 - 2 Mar 2026

Viewed by 476

Abstract

Accurate forecasting of nutrition supply–demand dynamics is essential for reducing resource wastage and improving equitable allocation. However, this task remains challenging due to heterogeneous data sources, cold-start regions, and the risk of information leakage in spatiotemporal modeling. This study presents a leakage-aware multimodal [...] Read more.

Accurate forecasting of nutrition supply–demand dynamics is essential for reducing resource wastage and improving equitable allocation. However, this task remains challenging due to heterogeneous data sources, cold-start regions, and the risk of information leakage in spatiotemporal modeling. This study presents a leakage-aware multimodal machine learning framework for nutrition supply–demand forecasting. The framework integrates temporal, spatial, and contextual information within a unified architecture. It combines self-supervised temporal representation learning, causal time-lag modeling, and few-shot adaptation to improve generalization under limited or previously unseen data conditions. Heterogeneous inputs include epidemiological, environmental, demographic, sentiment, and biologically derived indicators. These signals are encoded using a PatchTST-inspired temporal backbone coupled with a feature-token transformer employing cross-modal attention. Spatial dependencies are explicitly modeled using graph neural networks. Hierarchical decoding enables multi-horizon forecasting with calibrated uncertainty estimates. Model evaluation is conducted under strict spatiotemporal hold-out protocols with explicit leakage detection. All synthetic signals are excluded from testing. Across geographically and temporally disjoint datasets, the proposed framework consistently outperforms strong unimodal and multimodal baselines. It achieves macro-F1 scores above 99.5% and stable early-warning lead times of approximately 9 days under distribution shift. Ablation studies indicate that causal time-lag enforcement and few-shot adaptation contribute most strongly to performance robustness. Closed-loop simulation experiments suggest potential reductions in nutrient wastage of approximately 38%, response latency of 19%, and operational costs of 16% when deployed as a decision-support tool. External validation on fully unseen regions confirms the generalizability of the framework under realistic forecasting constraints. Full article

(This article belongs to the Special Issue AI in Bioinformatics)

► Show Figures

Figure 1

Search Results (948)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (948)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI