Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (57)

Search Parameters:
Keywords = global token generator

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
36 pages, 507 KB  
Article
Introducing a Resolvable Network-Based SAT Solver Using Monotone CNF–DNF Dualization and Resolution
by Gábor Kusper and Benedek Nagy
Mathematics 2026, 14(2), 317; https://doi.org/10.3390/math14020317 (registering DOI) - 16 Jan 2026
Abstract
This paper is a theoretical contribution that introduces a new reasoning framework for SAT solving based on resolvable networks (RNs). RNs provide a graph-based representation of propositional satisfiability in which clauses are interpreted as directed reaches between disjoint subsets of Boolean variables (nodes). [...] Read more.
This paper is a theoretical contribution that introduces a new reasoning framework for SAT solving based on resolvable networks (RNs). RNs provide a graph-based representation of propositional satisfiability in which clauses are interpreted as directed reaches between disjoint subsets of Boolean variables (nodes). Building on this framework, we introduce a novel RN-based SAT solver, called RN-Solver, which replaces local assignment-driven branching by global reasoning over token distributions. Token distributions, interpreted as truth assignments, are generated by monotone CNF–DNF dualization applied to white (all-positive) clauses. New white clauses are derived via resolution along private-pivot chains, and the solver’s progression is governed by a taxonomy of token distributions (black-blocked, terminal, active, resolved, and non-resolved). The main results establish the soundness and completeness of the RN-Solver. Experimentally, the solver performs very well on pigeonhole formulas, where the separation between white and black clauses enables effective global reasoning. In contrast, its current implementation performs poorly on random 3-SAT instances, highlighting both practical limitations and significant opportunities for optimization and theoretical refinement. The presented RN-Solver implementation is a proof-of-concept which validates the underlying theory rather than a state-of-the-art competitive solver. One promising direction is the generalization of strongly connected components from directed graphs to resolvable networks. Finally, the token-based perspective naturally suggests a connection to token-superposition Petri net models. Full article
(This article belongs to the Special Issue Graph Theory and Applications, 3rd Edition)
20 pages, 636 KB  
Article
Using Denoising Diffusion Model for Predicting Global Style Tokens in an Expressive Text-to-Speech System
by Wiktor Prosowicz and Tomasz Hachaj
Electronics 2025, 14(23), 4759; https://doi.org/10.3390/electronics14234759 - 3 Dec 2025
Viewed by 710
Abstract
Text-to-speech (TTS) systems based on neural networks have undergone a significant evolution, taking a step forward towards achieving human-like quality and expressiveness, which is crucial for applications such as social media content creation and voice interfaces for visually impaired individuals. An entire branch [...] Read more.
Text-to-speech (TTS) systems based on neural networks have undergone a significant evolution, taking a step forward towards achieving human-like quality and expressiveness, which is crucial for applications such as social media content creation and voice interfaces for visually impaired individuals. An entire branch of research, known as Expressive Text-to-speech (ETTS), has emerged to address the so-called one-to-many mapping problem, which limits the naturalness of generated output. However, most ETTS systems applying explicit style modeling treat the prediction of prosodic features as a regressive, rather than generative, process and, consequently, do not capture prosodic diversity. We address this problem by proposing a novel technique for inference-time prediction of speaking-style features, which leverages a diffusion framework for sampling from a learned space of Global Style Tokens-based embeddings, which are then used to condition a neural TTS model. By incorporating the diffusion model, we can leverage its powerful modeling capabilities to learn the distribution of possible stylistic features and, during inference, sample them non-deterministically, which makes the generated speech more human-like by alleviating prosodic monotony across multiple sentences. Our system blends a regressive predictor with a diffusion-based generator to enable smooth control over the diversity of generated speech. Through quantitative and qualitative (human-centered) experiments, we demonstrated that our system generates expressive human speech with non-deterministic high-level prosodic features. Full article
(This article belongs to the Special Issue Advances in Algorithm Optimization and Computational Intelligence)
Show Figures

Figure 1

29 pages, 2296 KB  
Article
V-MHESA: A Verifiable Masking and Homomorphic Encryption-Combined Secure Aggregation Strategy for Privacy-Preserving Federated Learning
by Soyoung Park and Jeonghee Chi
Mathematics 2025, 13(22), 3687; https://doi.org/10.3390/math13223687 - 17 Nov 2025
Viewed by 503
Abstract
In federated learning, secure aggregation is essential to protect the confidentiality of local model updates, ensuring that the server can access only the aggregated result without exposing individual contributions. However, conventional secure aggregation schemes lack mechanisms that allow participating nodes to verify whether [...] Read more.
In federated learning, secure aggregation is essential to protect the confidentiality of local model updates, ensuring that the server can access only the aggregated result without exposing individual contributions. However, conventional secure aggregation schemes lack mechanisms that allow participating nodes to verify whether the aggregation has been performed correctly, thereby raising concerns about the integrity of the global model. To address this limitation, we propose V-MHESA (Verifiable Masking-and-Homomorphic Encryption–combined Secure Aggregation), an enhanced protocol extending our previous MHESA scheme. V-MHESA incorporates verification tokens and shared-key management to simultaneously ensure verifiability, confidentiality, and authentication. Each node generates masked updates using its own mask, the server’s secret, and a node-only shared random nonce, ensuring that only the server can compute a blinded global update while the actual global model remains accessible solely to the nodes. Verification tokens corresponding to randomly selected model parameters enable nodes to efficiently verify the correctness of the aggregated model with minimal communication overhead. Moreover, the protocol achieves inherent authentication of the server and legitimate nodes and remains robust under node dropout scenarios. The confidentiality of local updates and the unforgeability of verification tokens are analyzed under the honest-but-curious threat model, and experimental evaluations on the MNIST dataset demonstrate that V-MHESA achieves accuracy comparable to prior MHESA while introducing only negligible computational and communication overhead. Full article
(This article belongs to the Special Issue Applied Cryptography and Blockchain Security, 2nd Edition)
Show Figures

Figure 1

17 pages, 294 KB  
Article
Approximate Fiber Products of Schemes and Their Étale Homotopical Invariants
by Dongfang Zhao
Mathematics 2025, 13(21), 3448; https://doi.org/10.3390/math13213448 - 29 Oct 2025
Viewed by 470
Abstract
The classical fiber product in algebraic geometry provides a powerful tool for studying loci where two morphisms to a base scheme, ϕ:XS and ψ:YS, coincide exactly. This condition of strict equality, however, is insufficient [...] Read more.
The classical fiber product in algebraic geometry provides a powerful tool for studying loci where two morphisms to a base scheme, ϕ:XS and ψ:YS, coincide exactly. This condition of strict equality, however, is insufficient for describing many real-world applications, such as the geometric structure of semantic spaces in modern large language models whose foundational architecture is the Transformer neural network: The token spaces of these models are fundamentally approximate, and recent work has revealed complex geometric singularities, challenging the classical manifold hypothesis. This paper develops a new framework to study and quantify the nature of approximate alignment between morphisms in the context of arithmetic geometry, using the tools of étale homotopy theory. We introduce the central object of our work, the étale mismatch torsor, which is a sheaf of torsors over the product scheme X×SY. The structure of this sheaf serves as a rich, intrinsic, and purely algebraic object amenable to both qualitative classification and quantitative analysis of the global relationship between the two morphisms. Our main results are twofold. First, we provide a complete classification of these structures, establishing a bijection between their isomorphism classes and the first étale cohomology group Hét1(X×SY,π1ét(S)̲). Second, we construct a canonical filtration on this classifying cohomology group based on the theory of infinitesimal neighborhoods. This filtration induces a new invariant, which we term the order of mismatch, providing a hierarchical, algebraic measure for the degree of approximation between the morphisms. We apply this framework to the concrete case of generalized Howe curves over finite fields, demonstrating how both the characteristic class and its order reveal subtle arithmetic properties. Full article
(This article belongs to the Section B: Geometry and Topology)
17 pages, 3194 KB  
Article
Improved Real-Time Detection Transformer with Low-Frequency Feature Integrator and Token Statistics Self-Attention for Automated Grading of Stropharia rugoso-annulata Mushroom
by Yu-Hang He, Shi-Yun Duan and Wen-Hao Su
Foods 2025, 14(20), 3581; https://doi.org/10.3390/foods14203581 - 21 Oct 2025
Viewed by 604
Abstract
Manual grading of Stropharia rugoso-annulata mushroom is plagued by inefficiency and subjectivity, while existing detection models face inherent trade-offs between accuracy, real-time performance, and deployability on resource-constrained edge devices. To address these challenges, this study presents an Improved Real-Time Detection Transformer (RT-DETR) tailored [...] Read more.
Manual grading of Stropharia rugoso-annulata mushroom is plagued by inefficiency and subjectivity, while existing detection models face inherent trade-offs between accuracy, real-time performance, and deployability on resource-constrained edge devices. To address these challenges, this study presents an Improved Real-Time Detection Transformer (RT-DETR) tailored for automated grading of Stropharia rugoso-annulata. Two innovative modules underpin the model: (1) the low-frequency feature integrator (LFFI), which leverages wavelet decomposition to preserve critical low-frequency global structural information, thereby enhancing the capture of large mushroom morphology; (2) the Token Statistics Self-Attention (TSSA) mechanism, which replaces traditional self-attention with second-moment statistical computations. This reduces complexity from O(n2) to O(n) and inherently generates interpretable attention patterns, augmenting model explainability. Experimental results demonstrate that the improved model achieves 95.2% mAP@0.5:0.95 at 262 FPS, with a substantial reduction in computational overhead compared to the original RT-DETR. It outperforms APHS-YOLO in both accuracy and efficiency, eliminates the need for non-maximum suppression (NMS) post-processing, and balances global structural awareness with local detail sensitivity. These attributes render it highly suitable for industrial edge deployment. This work offers an efficient framework for the automated grading of large-target crop detection. Full article
(This article belongs to the Section Food Engineering and Technology)
Show Figures

Figure 1

28 pages, 32815 KB  
Article
LiteSAM: Lightweight and Robust Feature Matching for Satellite and Aerial Imagery
by Boya Wang, Shuo Wang, Yibin Han, Linfeng Xu and Dong Ye
Remote Sens. 2025, 17(19), 3349; https://doi.org/10.3390/rs17193349 - 1 Oct 2025
Viewed by 1208
Abstract
We present a (Light)weight (S)atellite–(A)erial feature (M)atching framework (LiteSAM) for robust UAV absolute visual localization (AVL) in GPS-denied environments. Existing satellite–aerial matching methods struggle with large appearance variations, texture-scarce regions, and limited efficiency for real-time UAV [...] Read more.
We present a (Light)weight (S)atellite–(A)erial feature (M)atching framework (LiteSAM) for robust UAV absolute visual localization (AVL) in GPS-denied environments. Existing satellite–aerial matching methods struggle with large appearance variations, texture-scarce regions, and limited efficiency for real-time UAV applications. LiteSAM integrates three key components to address these issues. First, efficient multi-scale feature extraction optimizes representation, reducing inference latency for edge devices. Second, a Token Aggregation–Interaction Transformer (TAIFormer) with a convolutional token mixer (CTM) models inter- and intra-image correlations, enabling robust global–local feature fusion. Third, a MinGRU-based dynamic subpixel refinement module adaptively learns spatial offsets, enhancing subpixel-level matching accuracy and cross-scenario generalization. The experiments show that LiteSAM achieves competitive performance across multiple datasets. On UAV-VisLoc, LiteSAM attains an RMSE@30 of 17.86 m, outperforming state-of-the-art semi-dense methods such as EfficientLoFTR. Its optimized variant, LiteSAM (opt., without dual softmax), delivers inference times of 61.98 ms on standard GPUs and 497.49 ms on NVIDIA Jetson AGX Orin, which are 22.9% and 19.8% faster than EfficientLoFTR (opt.), respectively. With 6.31M parameters, which is 2.4× fewer than EfficientLoFTR’s 15.05M, LiteSAM proves to be suitable for edge deployment. Extensive evaluations on natural image matching and downstream vision tasks confirm its superior accuracy and efficiency for general feature matching. Full article
Show Figures

Figure 1

32 pages, 9638 KB  
Article
MSSA: A Multi-Scale Semantic-Aware Method for Remote Sensing Image–Text Retrieval
by Yun Liao, Zongxiao Hu, Fangwei Jin, Junhui Liu, Nan Chen, Jiayi Lv and Qing Duan
Remote Sens. 2025, 17(19), 3341; https://doi.org/10.3390/rs17193341 - 30 Sep 2025
Cited by 1 | Viewed by 1218
Abstract
In recent years, the convenience and potential for information extraction offered by Remote Sensing Image–Text Retrieval (RSITR) have made it a significant focus of research in remote sensing (RS) knowledge services. Current mainstream methods for RSITR generally align fused image features at multiple [...] Read more.
In recent years, the convenience and potential for information extraction offered by Remote Sensing Image–Text Retrieval (RSITR) have made it a significant focus of research in remote sensing (RS) knowledge services. Current mainstream methods for RSITR generally align fused image features at multiple scales with textual features, primarily focusing on the local information of RS images while neglecting potential semantic information. This results in insufficient alignment in the cross-modal semantic space. To overcome this limitation, we propose a Multi-Scale Semantic-Aware Remote Sensing Image–Text Retrieval method (MSSA). This method introduces Progressive Spatial Channel Joint Attention (PSCJA), which enhances the expressive capability of multi-scale image features through Window-Region-Global Progressive Attention (WRGPA) and Segmented Channel Attention (SCA). Additionally, the Image-Guided Text Attention (IGTA) mechanism dynamically adjust textual attention weights based on visual context. Furthermore, the Cross-Modal Semantic Extraction Module (CMSE) incorporated learnable semantic tokens at each scale, enabling attention interaction between multi-scale features of different modalities and the capturing of hierarchical semantic associations. This multi-scale semantic-guided retrieval method ensures cross-modal semantic consistency, significantly improving the accuracy of cross-modal retrieval in RS. MSSA demonstrates superior retrieval accuracy in experiments across three baseline datasets, achieving a new state-of-the-art performance. Full article
(This article belongs to the Section Remote Sensing Image Processing)
Show Figures

Figure 1

22 pages, 4113 KB  
Article
PathGen-LLM: A Large Language Model for Dynamic Path Generation in Complex Transportation Networks
by Xun Li, Kai Xian, Huimin Wen, Shengguang Bai, Han Xu and Yun Yu
Mathematics 2025, 13(19), 3073; https://doi.org/10.3390/math13193073 - 24 Sep 2025
Cited by 2 | Viewed by 1686
Abstract
Dynamic path generation in complex transportation networks is essential for intelligent transportation systems. Traditional methods, such as shortest path algorithms or heuristic-based models, often fail to capture real-world travel behaviors due to their reliance on simplified assumptions and limited ability to handle long-range [...] Read more.
Dynamic path generation in complex transportation networks is essential for intelligent transportation systems. Traditional methods, such as shortest path algorithms or heuristic-based models, often fail to capture real-world travel behaviors due to their reliance on simplified assumptions and limited ability to handle long-range dependencies or non-linear patterns. To address these limitations, we propose PathGen-LLM, a large language model (LLM) designed to learn spatial–temporal patterns from historical paths without requiring handcrafted features or graph-specific architectures. Exploiting the structural similarity between path sequences and natural language, PathGen-LLM converts spatiotemporal trajectories into text-formatted token sequences by encoding node IDs and timestamps. This enables the model to learn global dependencies and semantic relationships through self-supervised pretraining. The model integrates a hierarchical Transformer architecture with dynamic constraint decoding, which synchronizes spatial node transitions with temporal timestamps to ensure physically valid paths in large-scale road networks. Experimental results on real-world urban datasets demonstrate that PathGen-LLM outperforms baseline methods, particularly in long-distance path generation. By bridging sequence modeling and complex network analysis, PathGen-LLM offers a novel framework for intelligent transportation systems, highlighting the potential of LLMs to address challenges in large-scale, real-time network tasks. Full article
(This article belongs to the Special Issue Modeling and Data Analysis of Complex Networks)
Show Figures

Figure 1

21 pages, 37484 KB  
Article
Reconstructing Hyperspectral Images from RGB Images by Multi-Scale Spectral–Spatial Sequence Learning
by Wenjing Chen, Lang Liu and Rong Gao
Entropy 2025, 27(9), 959; https://doi.org/10.3390/e27090959 - 15 Sep 2025
Cited by 1 | Viewed by 2491
Abstract
With rapid advancements in transformers, the reconstruction of hyperspectral images from RGB images, also known as spectral super-resolution (SSR), has made significant breakthroughs. However, existing transformer-based methods often struggle to balance computational efficiency with long-range receptive fields. Recently, Mamba has demonstrated linear complexity [...] Read more.
With rapid advancements in transformers, the reconstruction of hyperspectral images from RGB images, also known as spectral super-resolution (SSR), has made significant breakthroughs. However, existing transformer-based methods often struggle to balance computational efficiency with long-range receptive fields. Recently, Mamba has demonstrated linear complexity in modeling long-range dependencies and shown broad applicability in vision tasks. This paper proposes a multi-scale spectral–spatial sequence learning method, named MSS-Mamba, for reconstructing hyperspectral images from RGB images. First, we introduce a continuous spectral–spatial scan (CS3) mechanism to improve cross-dimensional feature extraction of the foundational Mamba model. Second, we propose a sequence tokenization strategy that generates multi-scale-aware sequences to overcome Mamba’s limitations in hierarchically learning multi-scale information. Specifically, we design the multi-scale information fusion (MIF) module, which tokenizes input sequences before feeding them into Mamba. The MIF employs a dual-branch architecture to process global and local information separately, dynamically fusing features through an adaptive router that generates weighting coefficients. This produces feature maps that contain both global contextual information and local details, ultimately reconstructing a high-fidelity hyperspectral image. Experimental results on the ARAD_1k, CAVE and grss_dfc_2018 dataset demonstrate the performance of MSS-Mamba. Full article
Show Figures

Figure 1

21 pages, 2093 KB  
Article
Dual-Stream Time-Series Transformer-Based Encrypted Traffic Data Augmentation Framework
by Daeho Choi, Yeog Kim, Changhoon Lee and Kiwook Sohn
Appl. Sci. 2025, 15(18), 9879; https://doi.org/10.3390/app15189879 - 9 Sep 2025
Viewed by 1240
Abstract
We propose a Transformer-based data augmentation framework with a time-series dual-stream architecture to address performance degradation in encrypted network traffic classification caused by class imbalance between attack and benign traffic. The proposed framework independently processes the complete flow’s sequential packet information and statistical [...] Read more.
We propose a Transformer-based data augmentation framework with a time-series dual-stream architecture to address performance degradation in encrypted network traffic classification caused by class imbalance between attack and benign traffic. The proposed framework independently processes the complete flow’s sequential packet information and statistical characteristics by extracting and normalizing a local channel (comprising packet size, inter-arrival time, and direction) and a set of six global flow-level statistical features. These are used to generate a fixed-length multivariate sequence and an auxiliary vector. The sequence and vector are then fed into an encoder-only Transformer that integrates learnable positional embeddings with a FiLM + context token-based injection mechanism, enabling complementary representation of sequential patterns and global statistical distributions. Large-scale experiments demonstrate that the proposed method reduces reconstruction RMSE and additional feature restoration MSE by over 50%, while improving accuracy, F1-Score, and AUC by 5–7%p compared to classification on the original imbalanced datasets. Furthermore, the augmentation process achieves practical levels of processing time and memory overhead. These results show that the proposed approach effectively mitigates class imbalance in encrypted traffic classification and offers a promising pathway to achieving more robust model generalization in real-world deployment scenarios. Full article
(This article belongs to the Special Issue AI-Enabled Next-Generation Computing and Its Applications)
Show Figures

Figure 1

14 pages, 954 KB  
Article
A Benchmark for Symbolic Reasoning from Pixel Sequences: Grid-Level Visual Completion and Correction
by Lei Kang, Xuanshuo Fu, Mohamed Ali Souibgui, Andrey Barsky, Lluis Gomez, Javier Vazquez-Corral, Alicia Fornés, Ernest Valveny and Dimosthenis Karatzas
Mathematics 2025, 13(17), 2851; https://doi.org/10.3390/math13172851 - 4 Sep 2025
Viewed by 974
Abstract
Grid structured visual data such as forms, tables, and game boards require models that pair pixel level perception with symbolic consistency under global constraints. Recent Pixel Language Models (PLMs) map images to token sequences with promising flexibility, yet we find they generalize poorly [...] Read more.
Grid structured visual data such as forms, tables, and game boards require models that pair pixel level perception with symbolic consistency under global constraints. Recent Pixel Language Models (PLMs) map images to token sequences with promising flexibility, yet we find they generalize poorly when observable evidence becomes sparse or corrupted. We present GridMNIST-Sudoku, a benchmark that renders large numbers of Sudoku instances with style diverse handwritten digits and provides parameterized stress tracks for two tasks: Completion (predict missing cells) and Correction (detect and repair incorrect cells) across difficulty levels ranging from 1 to 90 altered positions in a 9 × 9 grid. Attention diagnostics on PLMs trained with conventional one dimensional positional encodings reveal weak structure awareness outside the natural Sudoku sparsity band. Motivated by these findings, we propose a lightweight Row-Column-Box (RCB) positional prior that injects grid aligned coordinates and combine it with simple sparsity and corruption augmentations. Trained only on the natural distribution, the resulting model substantially improves out of distribution accuracy across wide sparsity and corruption ranges while maintaining strong in distribution performance. Full article
(This article belongs to the Section E1: Mathematics and Computer Science)
Show Figures

Figure 1

21 pages, 4900 KB  
Article
RingFormer-Seg: A Scalable and Context-Preserving Vision Transformer Framework for Semantic Segmentation of Ultra-High-Resolution Remote Sensing Imagery
by Zhan Zhang, Daoyu Shu, Guihe Gu, Wenkai Hu, Ru Wang, Xiaoling Chen and Bingnan Yang
Remote Sens. 2025, 17(17), 3064; https://doi.org/10.3390/rs17173064 - 3 Sep 2025
Cited by 2 | Viewed by 1477
Abstract
Semantic segmentation of ultra-high-resolution remote sensing (UHR-RS) imagery plays a critical role in land use and land cover analysis, yet it remains computationally intensive due to the enormous input size and high spatial complexity. Existing studies have commonly employed strategies such as patch-wise [...] Read more.
Semantic segmentation of ultra-high-resolution remote sensing (UHR-RS) imagery plays a critical role in land use and land cover analysis, yet it remains computationally intensive due to the enormous input size and high spatial complexity. Existing studies have commonly employed strategies such as patch-wise processing, multi-scale model architectures, lightweight networks, and representation sparsification to reduce resource demands, but they have often struggled to maintain long-range contextual awareness and scalability for inputs of arbitrary size. To address this, we propose RingFormer-Seg, a scalable Vision Transformer framework that enables long-range context learning through multi-device parallelism in UHR-RS image segmentation. RingFormer-Seg decomposes the input into spatial subregions and processes them through a distributed three-stage pipeline. First, the Saliency-Aware Token Filter (STF) selects informative tokens to reduce redundancy. Next, the Efficient Local Context Module (ELCM) enhances intra-region features via memory-efficient attention. Finally, the Cross-Device Context Router (CDCR) exchanges token-level information across devices to capture global dependencies. Fine-grained detail is preserved through the residual integration of unselected tokens, and a hierarchical decoder generates high-resolution segmentation outputs. We conducted extensive experiments on three benchmarks covering UHR-RS images from 2048 × 2048 to 8192 × 8192 pixels. Results show that our framework achieves top segmentation accuracy while significantly improving computational efficiency across the DeepGlobe, Wuhan, and Guangdong datasets. RingFormer-Seg offers a versatile solution for UHR-RS image segmentation and demonstrates potential for practical deployment in nationwide land cover mapping, supporting informed decision-making in land resource management, environmental policy planning, and sustainable development. Full article
Show Figures

Figure 1

30 pages, 12920 KB  
Article
CSCN: A Cross-Scan Semantic Cluster Network with Scene Coupling Attention for Remote Sensing Segmentation
by Lei Zhang, Xing Xing, Changfeng Jing, Min Kong and Gaoran Xu
Remote Sens. 2025, 17(16), 2803; https://doi.org/10.3390/rs17162803 - 13 Aug 2025
Viewed by 1166
Abstract
The spatial attention mechanism has been widely employed in the semantic segmentation of remote sensing images due to its exceptional capacity for modeling long-range dependencies. However, the analysis performance of remote sensing images can be reduced owing to their large intra-class variance and [...] Read more.
The spatial attention mechanism has been widely employed in the semantic segmentation of remote sensing images due to its exceptional capacity for modeling long-range dependencies. However, the analysis performance of remote sensing images can be reduced owing to their large intra-class variance and complex spatial structures. The vanilla spatial attention mechanism relies on the dense affine operations and a fixed scanning mechanism, which often introduces a large amount of redundant contextual semantic information and lacks consideration of cross-directional semantic connections. This paper proposes a new Cross-scan Semantic Cluster Network (CSCN) with integrated Semantic Filtering Contextual Cluster (SFCC) and Cross-scan Scene Coupling Attention (CSCA) modules to address these limitations. Specifically, the SFCC is designed to filter redundant information; feature tokens are clustered into semantically related regions, effectively identifying local features and reducing the impact of intra-class variance. CSCA effectively addresses the challenges of complex spatial geographic backgrounds by decomposing scene information into object distributions and global representations, using scene coupling and cross-scanning mechanisms and computing attention from different directions. Combining SFCC and CSCA, CSCN not only effectively segments various geographic spatial objects in complex scenes but also has low model complexity. The experimental results on three benchmark datasets demonstrate the outstanding performance of the attention model generated using this approach. Full article
Show Figures

Figure 1

35 pages, 14152 KB  
Article
Hyperspectral and Multispectral Remote Sensing Image Fusion Based on a Retractable Spatial–Spectral Transformer Network
by Yilin He, Heng Li, Miaosen Zhang, Shuangqi Liu, Chunyu Zhu, Bingxia Xin, Jun Wang and Qiong Wu
Remote Sens. 2025, 17(12), 1973; https://doi.org/10.3390/rs17121973 - 6 Jun 2025
Cited by 3 | Viewed by 3502
Abstract
Hyperspectral and multispectral remote sensing image fusion is an optimal approach for generating hyperspectral–spatial-resolution images, effectively overcoming the physical limitations of sensors. In transformer-based image fusion methods constrained by the local window self-attention mechanism, the extraction of global information and coordinated contextual features [...] Read more.
Hyperspectral and multispectral remote sensing image fusion is an optimal approach for generating hyperspectral–spatial-resolution images, effectively overcoming the physical limitations of sensors. In transformer-based image fusion methods constrained by the local window self-attention mechanism, the extraction of global information and coordinated contextual features is often insufficient. Fusion that aims to emphasize spatial–spectral heterogeneous characteristics may significantly enhance the robustness of joint representation for multi-source data. To address these issues, this study proposes a hyperspectral and multispectral remote sensing image fusion method based on a retractable spatial–spectral transformer network (RSST) and introduces the attention retractable mechanism into the field of remote sensing image fusion. Furthermore, a gradient spatial–spectral recovery block is incorporated to effectively mitigate the limitations of token interactions and the loss of spatial–spectral edge information. A series of experiments across multiple scales demonstrate that RSST exhibits significant advantages over existing mainstream image fusion algorithms. Full article
(This article belongs to the Section Remote Sensing Image Processing)
Show Figures

Figure 1

25 pages, 5508 KB  
Article
A Lightweight Network for Water Body Segmentation in Agricultural Remote Sensing Using Learnable Kalman Filters and Attention Mechanisms
by Dingyi Liao, Jun Sun, Zhiyong Deng, Yudong Zhao, Jiani Zhang and Dinghua Ou
Appl. Sci. 2025, 15(11), 6292; https://doi.org/10.3390/app15116292 - 3 Jun 2025
Cited by 1 | Viewed by 1420
Abstract
Precise identification of water bodies in agricultural watersheds is crucial for irrigation, water resource management, and flood disaster prevention. However, the spectral noise caused by complex light and shadow interference and water quality differences, combined with the diverse shapes of water bodies and [...] Read more.
Precise identification of water bodies in agricultural watersheds is crucial for irrigation, water resource management, and flood disaster prevention. However, the spectral noise caused by complex light and shadow interference and water quality differences, combined with the diverse shapes of water bodies and the high computational cost of image processing, severely limits the accuracy of water body recognition in agricultural watersheds. This paper proposed a lightweight and efficient learnable Kalman filter and Deformable Convolutional Attention Network (LKF-DCANet). The encoder is built using a shallow Channel Attention-Enhanced Deformable Convolution module (CADCN), while the decoder combines a Convolutional Additive Token Mixer (CATM) and a learnable Kalman filter (LKF) to achieve adaptive noise suppression and enhance global context modeling. Additionally, a feature-based knowledge distillation strategy is employed to further improve the representational capacity of the lightweight model. Experimental results show that LKF-DCANet achieves an Intersection over Union (IoU) of 85.95% with only 0.22 M parameters on a public dataset. When transferred to a self-constructed UAV dataset, it achieves an IoU of 96.28%, demonstrating strong generalization ability. All experiments are conducted on RGB optical imagery, confirming that LKF-DCANet offers an efficient and highly versatile solution for water body segmentation in precision agriculture. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

Back to TopTop