Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (2,269)

Search Parameters:
Keywords = semantic representation

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
23 pages, 2725 KB  
Article
Text- and Face-Conditioned Multi-Anchor Conditional Embedding for Robust Periocular Recognition
by Po-Ling Fong, Tiong-Sik Ng and Andrew Beng Jin Teoh
Appl. Sci. 2026, 16(2), 942; https://doi.org/10.3390/app16020942 - 16 Jan 2026
Abstract
Periocular recognition is essential when full-face images cannot be used because of occlusion, privacy constraints, or sensor limitations, yet in many deployments, only periocular images are available at run time, while richer evidence, such as archival face photos and textual metadata, exists offline. [...] Read more.
Periocular recognition is essential when full-face images cannot be used because of occlusion, privacy constraints, or sensor limitations, yet in many deployments, only periocular images are available at run time, while richer evidence, such as archival face photos and textual metadata, exists offline. This mismatch makes it hard to deploy conventional multimodal fusion. This motivates the notion of conditional biometrics, where auxiliary modalities are used only during training to learn stronger periocular representations while keeping deployment strictly periocular-only. In this paper, we propose Multi-Anchor Conditional Periocular Embedding (MACPE), which maps periocular, facial, and textual features into a shared anchor-conditioned space via a learnable anchor bank that preserves periocular micro-textures while aligning higher-level semantics. Training combines identity classification losses on periocular and face branches with a symmetric InfoNCE loss over anchors and a pulling regularizer that jointly aligns periocular, facial, and textual embeddings without collapsing into face-dominated solutions; captions generated by a vision language model provide complementary semantic supervision. At deployment, only the periocular encoder is used. Experiments across five periocular datasets show that MACPE consistently improves Rank-1 identification and reduces EER at a fixed FAR compared with periocular-only baselines and alternative conditioning methods. Ablation studies verify the contributions of anchor-conditioned embeddings, textual supervision, and the proposed loss design. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

28 pages, 8826 KB  
Article
A Lightweight LLM-Based Semantic–Spatial Inference Framework for Fine-Grained Urban POI Analysis
by Zhuo Huang, Yixing Guo, Shuo Huang and Miaoxi Zhao
Smart Cities 2026, 9(1), 13; https://doi.org/10.3390/smartcities9010013 - 16 Jan 2026
Abstract
Unstructured POI name texts are widely used in fine-grained urban analysis, yet missing labels and semantic ambiguity often limit their value for spatial inference. This study proposes a large language model-based semantic–spatial inference framework (LLM-SSIF), a lightweight semantic–spatial pipeline that translates POI texts [...] Read more.
Unstructured POI name texts are widely used in fine-grained urban analysis, yet missing labels and semantic ambiguity often limit their value for spatial inference. This study proposes a large language model-based semantic–spatial inference framework (LLM-SSIF), a lightweight semantic–spatial pipeline that translates POI texts into interpretable, fine-grained spatial evidence through an end-to-end workflow that couples scalable label expansion with scale-controlled spatial diagnostics at a 500 m resolution. A key advantage of LLM-SSIF is its deployability: LoRA-based parameter-efficient fine-tuning of an open LLM enables lightweight adaptation under limited compute while scaling fine-label coverage. Trained on a nationwide cuisine-labeled dataset (~220,000 records), the model achieves strong multi-class short-text recognition (macro-F1 = 0.843) and, in the Guangzhou–Shenzhen demonstration, expands usable fine-category labels by ~14–15× to support grid-level inference under long-tail sparsity. The spatial module then isolates cuisine-specific over/under-representation beyond overall restaurant intensity, revealing contrasting cultural configurations between Guangzhou and Shenzhen. Overall, LLM-SSIF provides a reproducible and transferable way to translate unstructured POI texts into spatial–statistical evidence for comparative urban analysis. Full article
Show Figures

Figure 1

29 pages, 44276 KB  
Article
MSFFDet: A Meta-Learning-Based Support-Guided Feature Fusion Detector for Few-Shot Remote Sensing Detection
by Haoxiang Qi, Wenzhe Zhao, Ting Zhang and Guangyao Zhou
Appl. Sci. 2026, 16(2), 917; https://doi.org/10.3390/app16020917 - 15 Jan 2026
Abstract
Few-shot object detection in remote sensing imagery faces significant challenges, including limited labeled samples, complex scene backgrounds, and subtle inter-class differences. To tackle these issues, we design a novel detection framework that effectively transfers supervision from a few annotated support examples to the [...] Read more.
Few-shot object detection in remote sensing imagery faces significant challenges, including limited labeled samples, complex scene backgrounds, and subtle inter-class differences. To tackle these issues, we design a novel detection framework that effectively transfers supervision from a few annotated support examples to the query domain. We introduce a feature enhancement mechanism that injects fine-grained support cues into the query representation, helping the model focus on relevant regions and suppress background noise. This allows the model to generate more accurate proposals and perform robust classification, especially for visually confusing or small objects. Additionally, our method enhances feature interaction between support and query images through a nonlinear combination strategy, which captures both semantic similarity and discriminative differences. The proposed framework is fully end-to-end and jointly optimizes the feature fusion and detection processes. Experiments on three challenging benchmarks, NWPU VHR-10, iSAID and DIOR, demonstrate that our method consistently achieves state-of-the-art results under different few-shot settings and category splits. Compared with other advanced methods, it yields superior performance, highlighting its strong generalization ability in low-data remote sensing scenarios. Full article
(This article belongs to the Special Issue AI in Object Detection)
Show Figures

Figure 1

63 pages, 10763 KB  
Review
The State of HBIM in Digital Heritage: A Critical and Bibliometric Assessment of Six Emerging Frontiers (2015–2025)
by Fabrizio Banfi and Wanqin Liu
Appl. Sci. 2026, 16(2), 906; https://doi.org/10.3390/app16020906 - 15 Jan 2026
Abstract
After nearly two decades of developments in Historic/Heritage Building Information Modeling (HBIM), the field has reached a stage of maturity that calls for a critical reassessment of its evolution, achievements, and remaining challenges. Digital representation has become a central component of contemporary heritage [...] Read more.
After nearly two decades of developments in Historic/Heritage Building Information Modeling (HBIM), the field has reached a stage of maturity that calls for a critical reassessment of its evolution, achievements, and remaining challenges. Digital representation has become a central component of contemporary heritage conservation, enabling advanced methods for analysis, management, and communication. This review examines the maturation of HBIM as a comprehensive framework that integrates extended reality (XR), artificial intelligence (AI), machine learning (ML), semantic segmentation and Digital Twin (DT). Six major research domains that have shaped recent progress are outlined: (1) the application of HBIM to restoration and conservation workflows; (2) the expansion of public engagement through XR, virtual museums, and serious games; (3) the stratigraphic documentation of building archaeology, historical phases, and material decay; (4) data-exchange mechanisms and interoperability with open formats and Common Data Environments (CDEs); (5) strategies for modeling geometric and semantic complexity using traditional, applied, and AI-driven approaches; and (6) the emergence of heritage DT as dynamic, semantically enriched systems integrating real-time and lifecycle data. A comparative assessment of international case studies and bibliometric trends (2015–2025) illustrates how HBIM is transforming proactive and data-informed conservation practice. The review concludes by identifying persistent gaps and outlining strategic directions for the next phase of research and implementation. Full article
Show Figures

Figure 1

22 pages, 5928 KB  
Article
PromptTrace: A Fine-Grained Prompt Stealing Attack via CLIP-Guided Beam Search for Text-to-Image Models
by Shaofeng Ming, Yuhao Zhang, Yang Liu, Tianyu Han, Dengmu Liu, Tong Yu, Jieke Lu and Bo Xu
Symmetry 2026, 18(1), 161; https://doi.org/10.3390/sym18010161 - 15 Jan 2026
Abstract
The inherent semantic symmetry and cross-modal alignment between textual prompts and generated images have fueled the success of text-to-image (T2I) generation. However, this strong correlation also introduces security vulnerabilities, specifically prompt stealing attacks, where valuable prompts are reverse-engineered from images. In this paper, [...] Read more.
The inherent semantic symmetry and cross-modal alignment between textual prompts and generated images have fueled the success of text-to-image (T2I) generation. However, this strong correlation also introduces security vulnerabilities, specifically prompt stealing attacks, where valuable prompts are reverse-engineered from images. In this paper, we address the challenge of information asymmetry in black-box attack scenarios and propose PromptTrace, a fine-grained prompt stealing framework via Contrastive Language-Image Pre-training (CLIP)-guidedbeam search. Unlike existing methods that rely on single-stage generation, PromptTrace structurally decomposes prompt reconstruction into subject generation, modifier extraction, and iterative search optimization to effectively restore the visual–textual correspondence. By leveraging a CLIP-guided beam search strategy, our method progressively optimizes candidate prompts based on image–text similarity feedback, ensuring the stolen prompt achieves high fidelity in both semantic intent and stylistic representation. Extensive evaluations across multiple datasets and T2I models demonstrate that PromptTrace outperforms existing methods, highlighting the feasibility of exploiting cross-modal symmetry for attacks and underscoring the urgent need for defense mechanisms in the T2I ecosystem. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

24 pages, 1009 KB  
Article
HiSem-RAG: A Hierarchical Semantic-Driven Retrieval-Augmented Generation Method
by Dongju Yang and Junming Wang
Appl. Sci. 2026, 16(2), 903; https://doi.org/10.3390/app16020903 - 15 Jan 2026
Abstract
Traditional retrieval-augmented generation (RAG) methods struggle with hierarchical documents, often causing semantic fragmentation, structural loss, and inefficient retrieval due to fixed strategies. To address these challenges, this paper proposes HiSem-RAG, a hierarchical semantic-driven RAG method. It comprises three key modules: (1) hierarchical semantic [...] Read more.
Traditional retrieval-augmented generation (RAG) methods struggle with hierarchical documents, often causing semantic fragmentation, structural loss, and inefficient retrieval due to fixed strategies. To address these challenges, this paper proposes HiSem-RAG, a hierarchical semantic-driven RAG method. It comprises three key modules: (1) hierarchical semantic indexing, which preserves boundaries and relationships between sections and paragraphs to reconstruct document context; (2) a bidirectional semantic enhancement mechanism that incorporates titles and summaries to facilitate two-way information flow; and (3) a distribution-aware adaptive threshold strategy that dynamically adjusts retrieval scope based on similarity distributions, balancing accuracy with computational efficiency. On the domain-specific EleQA dataset, HiSem-RAG achieves 82.00% accuracy, outperforming HyDE and RAPTOR by 5.04% and 3.98%, respectively, with reduced computational costs. On the LongQA dataset, it attains a ROUGE-L score of 0.599 and a BERT_F1 score of 0.839. Ablation studies confirm the complementarity of these modules, particularly in long-document scenarios. Full article
Show Figures

Figure 1

31 pages, 15918 KB  
Article
Cross-Domain Landslide Mapping in Remote Sensing Images Based on Unsupervised Domain Adaptation Framework
by Jing Yang, Mingtao Ding, Wubiao Huang, Qiang Xue, Ying Dong, Bo Chen, Lulu Peng, Fuling Zhang and Zhenhong Li
Remote Sens. 2026, 18(2), 286; https://doi.org/10.3390/rs18020286 - 15 Jan 2026
Abstract
Rapid and accurate acquisition of landslide inventories is essential for effective disaster relief. Deep learning-based pixel-wise semantic segmentation of remote sensing imagery has greatly advanced in landslide mapping. However, the heavy dependance on extensive annotated labels and sensitivity to domain shifts severely constrain [...] Read more.
Rapid and accurate acquisition of landslide inventories is essential for effective disaster relief. Deep learning-based pixel-wise semantic segmentation of remote sensing imagery has greatly advanced in landslide mapping. However, the heavy dependance on extensive annotated labels and sensitivity to domain shifts severely constrain the model performance in unseen domains, leading to poor generalization. To address these limitations, we propose LandsDANet, an innovative unsupervised domain adaptation framework for cross-domain landslide identification. Firstly, adversarial learning is employed to reduce the data distribution discrepancies between the source and target domains, thereby achieving output space alignment. The improved SegFormer serves as the segmentation network, incorporating hierarchical Transformer blocks and an attention mechanism to enhance feature representation capabilities. Secondly, to alleviate inter-domain radiometric discrepancies and attain image-level alignment, a Wallis filter is utilized to perform image style transformation. Considering the class imbalance present in the landslide dataset, a Rare Class Sampling strategy is introduced to mitigate bias towards common classes and strengthen the learning of the rare landslide class. Finally, a contrastive loss is adopted to further optimize and enhance the model’s ability to delineate fine-grained class boundaries. The proposed model is validated on the Potsdam and Vaihingen benchmark datasets, followed by validation in two landslide scenarios induced by earthquakes and rainfall to evaluate its adaptability across different disaster domains. Compared to the source-only model, LandsDANet achieved improvements in IoU of 27.04% and 35.73% in two cross-domain landslide disaster recognition tasks, respectively. This performance not only showcases its outstanding capabilities but also underscores its robust potential to meet the demands for rapid response. Full article
(This article belongs to the Section AI Remote Sensing)
Show Figures

Graphical abstract

19 pages, 4395 KB  
Article
An Attention-Based Bidirectional Feature Fusion Algorithm for Insulator Detection
by Binghao Gao, Jinyu Guo, Yongyue Wang, Dong Li and Xiaoqiang Jia
Sensors 2026, 26(2), 584; https://doi.org/10.3390/s26020584 - 15 Jan 2026
Abstract
To maintain reliability, safety, and sustainability in power transmission, insulator defect detection has become a critical task in power line inspection. Due to the complex backgrounds and small defect sizes encountered in insulator defect images, issues such as false detections and missed detections [...] Read more.
To maintain reliability, safety, and sustainability in power transmission, insulator defect detection has become a critical task in power line inspection. Due to the complex backgrounds and small defect sizes encountered in insulator defect images, issues such as false detections and missed detections often occur. The existing You Only Look Once (YOLO) object detection algorithm is currently the mainstream method for image-based insulator defect detection in power lines. However, existing models suffer from low detection accuracy. To address this issue, this paper presents an improved YOLOv5-based MC-YOLO insulator detection algorithm. To effectively extract multi-scale information and enhance the model’s ability to represent feature information, a multi-scale attention convolutional fusion (MACF) module incorporating an attention mechanism is proposed. This module utilises parallel convolutions with different kernel sizes to effectively extract features at various scales and highlights the feature representation of key targets through the attention mechanism, thereby improving the detection accuracy. Additionally, a cross-context feature fusion module (CCFM) is designed, where shallow features gain partial deep semantic supplementation and deep features absorb shallow spatial information, achieving bidirectional information flow. Furthermore, the Spatial-Channel Dual Attention Module (SCDAM) is introduced into CCFM. By incorporating a dynamic attention-guided bidirectional cross-fusion mechanism, it effectively resolves the feature deviation between shallow details and deep semantics during multi-scale feature fusion. The experimental results show that the MC-YOLO algorithm achieves an mAP@0.5 of 67.4% on the dataset used in this study, which is a 4.1% improvement over the original YOLOv5. Although the FPS is slightly reduced compared to the original model, it remains practical and capable of rapidly and accurately detecting insulator defects. Full article
(This article belongs to the Section Industrial Sensors)
Show Figures

Figure 1

27 pages, 24824 KB  
Article
UGFF-VLM: Uncertainty-Guided and Frequency-Fused Vision-Language Model for Remote Sensing Farmland Segmentation
by Kai Tan, Yanlan Wu, Hui Yang and Xiaoshuang Ma
Remote Sens. 2026, 18(2), 282; https://doi.org/10.3390/rs18020282 - 15 Jan 2026
Abstract
Vision-language models can leverage natural language descriptions to encode stable farmland characteristics, providing a new paradigm for farmland extraction, yet existing methods face challenges in ambiguous text-visual alignment and loss of high-frequency boundary details during fusion. To address this, this article utilizes the [...] Read more.
Vision-language models can leverage natural language descriptions to encode stable farmland characteristics, providing a new paradigm for farmland extraction, yet existing methods face challenges in ambiguous text-visual alignment and loss of high-frequency boundary details during fusion. To address this, this article utilizes the semantic prior knowledge provided by textual descriptions in vision–language models to enhance the model’s ability to recognize polymorphic features, and proposes an Uncertainty-Guided and Frequency-Fused Vision-Language Model (UGFF-VLM) for remote sensing farmland extraction. The UGFF-VLM combines the semantic representation ability of vision-language models, further integrates an Uncertainty-Guided Adaptive Alignment (UGAA) module to dynamically adjust cross-modal fusion based on alignment confidence, and a Frequency-Enhanced Cross-Modal Fusion (FECF) mechanism to preserve high-frequency boundary details in the frequency domain. Experimental results on the FarmSeg-VL dataset demonstrate that the proposed method delivers excellent and stable performance, achieving the highest mIoU across diverse geographical environments while showing significant improvements in boundary precision and robustness against false positives. Therefore, the proposed UGFF-VLM not only mitigates the issues of recognition confusion and poor generalization in purely vision-based models caused by farmland feature polymorphism but also effectively enhances boundary segmentation accuracy, providing a reliable method for the precise delineation of agricultural parcels in diverse landscapes. Full article
(This article belongs to the Special Issue Advanced AI Technology for Remote Sensing Analysis)
Show Figures

Figure 1

23 pages, 1486 KB  
Article
AI-Based Emoji Recommendation for Early Childhood Education Using Deep Learning Techniques
by Shaya A. Alshaya
Computers 2026, 15(1), 59; https://doi.org/10.3390/computers15010059 - 15 Jan 2026
Abstract
The integration of emojis into Early Childhood Education (ECE) presents a promising avenue for enhancing student engagement, emotional expression, and comprehension. While prior studies suggest the benefit of visual aids in learning, systematic frameworks for pedagogically aligned emoji recommendation remain underdeveloped. This paper [...] Read more.
The integration of emojis into Early Childhood Education (ECE) presents a promising avenue for enhancing student engagement, emotional expression, and comprehension. While prior studies suggest the benefit of visual aids in learning, systematic frameworks for pedagogically aligned emoji recommendation remain underdeveloped. This paper presents EduEmoji-ECE, a pedagogically annotated dataset of early-childhood learning text segments. Specifically, the proposed model incorporates Bidirectional Encoder Representations from Transformers (BERTs) for contextual embedding extraction, Gated Recurrent Units (GRUs) for sequential pattern recognition, Deep Neural Networks (DNNs) for classification and emoji recommendation, and DECOC for improving emoji class prediction robustness. This hybrid BERT-GRU-DNN-DECOC architecture effectively captures textual semantics, emotional tone, and pedagogical intent, ensuring the alignment of emoji class recommendation with learning objectives. The experimental results show that the system is effective, with an accuracy of 95.3%, a precision of 93%, a recall of 91.8%, and an F1-score of 92.3%, outperforming baseline models in terms of contextual understanding and overall accuracy. This work helps fill a gap in AI-based education by combining learning with visual support for young children. The results suggest an association between emoji-enhanced materials and improved engagement/comprehension indicators in our exploratory classroom setting; however, causal attribution to the AI placement mechanism is not supported by the current study design. Full article
Show Figures

Figure 1

30 pages, 3060 KB  
Article
LLM-Based Multimodal Feature Extraction and Hierarchical Fusion for Phishing Email Detection
by Xinyang Yuan, Jiarong Wang, Tian Yan and Fazhi Qi
Electronics 2026, 15(2), 368; https://doi.org/10.3390/electronics15020368 - 14 Jan 2026
Viewed by 25
Abstract
Phishing emails continue to evade conventional detection systems due to their increasingly sophisticated, multi-faceted social engineering tactics. To address the limitations of single-modality or rule-based approaches, we propose SAHF-PD, a novel phishing detection framework that integrates multi-modal feature extraction with semantic-aware hierarchical fusion, [...] Read more.
Phishing emails continue to evade conventional detection systems due to their increasingly sophisticated, multi-faceted social engineering tactics. To address the limitations of single-modality or rule-based approaches, we propose SAHF-PD, a novel phishing detection framework that integrates multi-modal feature extraction with semantic-aware hierarchical fusion, based on large language models (LLMs). Our method leverages modality-specialized large models, each guided by domain-specific prompts and constrained to a standardized output schema, to extract structured feature representations from four complementary sources associated with each phishing email: email body text; open-source intelligence (OSINT) derived from the key embedded URL; screenshot of the landing page; and the corresponding HTML/JavaScript source code. This design mitigates the unstructured and stochastic nature of raw generative outputs, yielding consistent, interpretable, and machine-readable features. These features are then integrated through our Semantic-Aware Hierarchical Fusion (SAHF) mechanism, which organizes them into core, auxiliary, and weakly associated layers according to their semantic relevance to phishing intent. This layered architecture enables dynamic weighting and redundancy reduction based on semantic relevance, which in turn highlights the most discriminative signals across modalities and enhances model interpretability. We also introduce PhishMMF, a publicly released multimodal feature dataset for phishing detection, comprising 11,672 human-verified samples with meticulously extracted structured features from all four modalities. Experiments with eight diverse classifiers demonstrate that the SAHF-PD framework enables exceptional performance. For instance, XGBoost equipped with SAHF attains an AUC of 0.99927 and an F1-score of 0.98728, outperforming the same model using the original feature representation. Moreover, SAHF compresses the original 228-dimensional feature space into a compact 56-dimensional representation (a 75.4% reduction), reducing the average training time across all eight classifiers by 43.7% while maintaining comparable detection accuracy. Ablation studies confirm the unique contribution of each modality. Our work establishes a transparent, efficient, and high-performance foundation for next-generation anti-phishing systems. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

19 pages, 3064 KB  
Article
Frequency-Aware Unsupervised Domain Adaptation for Semantic Segmentation of Laparoscopic Images
by Huiwen Dong and Gaofeng Zhang
Appl. Sci. 2026, 16(2), 840; https://doi.org/10.3390/app16020840 - 14 Jan 2026
Viewed by 48
Abstract
Semantic segmentation of laparoscopic images requires costly pixel-level annotations, which are often unavailable for real surgical data. This gives rise to an unsupervised domain adaptation scenario, where labeled synthetic images serve as the source domain and unlabeled real images as the target. We [...] Read more.
Semantic segmentation of laparoscopic images requires costly pixel-level annotations, which are often unavailable for real surgical data. This gives rise to an unsupervised domain adaptation scenario, where labeled synthetic images serve as the source domain and unlabeled real images as the target. We propose a frequency-aware unsupervised domain adaptation framework to mitigate the domain gap between simulated and real laparoscopic images. Specifically, we introduce a Radial Frequency Masking module that selectively masks frequency components of real images, and employ a Mean Teacher framework to enforce consistency between high- and low-frequency representations. In addition, we propose a module called Fourier Domain Adaptation-Blend, a style transfer strategy based on low-frequency blending, and apply entropy minimization to enhance prediction confidence on the target domain. Experiments are conducted on public datasets by jointly training on simulated and real laparoscopic images. Our method consistently outperforms representative baselines. These results demonstrate the effectiveness of frequency-aware adaptation in surgical image segmentation without relying on manual annotations from the target domain. Full article
Show Figures

Figure 1

27 pages, 5686 KB  
Article
MAFMamba: A Multi-Scale Adaptive Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Images
by Boxu Li, Xiaobing Yang and Yingjie Fan
Sensors 2026, 26(2), 531; https://doi.org/10.3390/s26020531 - 13 Jan 2026
Viewed by 76
Abstract
With rapid advancements in sub-meter satellite and aerial imaging technologies, high-resolution remote sensing imagery has become a pivotal source for geospatial information acquisition. However, current semantic segmentation models encounter two primary challenges: (1) the inherent trade-off between capturing long-range global context and preserving [...] Read more.
With rapid advancements in sub-meter satellite and aerial imaging technologies, high-resolution remote sensing imagery has become a pivotal source for geospatial information acquisition. However, current semantic segmentation models encounter two primary challenges: (1) the inherent trade-off between capturing long-range global context and preserving precise local structural details—where excessive reliance on downsampled deep semantics often results in blurred boundaries and the loss of small objects and (2) the difficulty in modeling complex scenes with extreme scale variations, where objects of the same category exhibit drastically different morphological features. To address these issues, this paper introduces MAFMamba, a multi-scale adaptive fusion visual Mamba network tailored for high-resolution remote sensing images. To mitigate scale variation, we design a lightweight hybrid encoder incorporating an Adaptive Multi-scale Mamba Block (AMMB) in each stage. Driven by a Multi-scale Adaptive Fusion (MSAF) mechanism, the AMMB dynamically generates pixel-level weights to recalibrate cross-level features, establishing a robust multi-scale representation. Simultaneously, to strictly balance local details and global semantics, we introduce a Global–Local Feature Enhancement Mamba (GLMamba) in the decoder. This module synergistically integrates local fine-grained features extracted by convolutions with global long-range dependencies modeled by the Visual State Space (VSS) layer. Furthermore, we propose a Multi-Scale Cross-Attention Fusion (MSCAF) module to bridge the semantic gap between the encoder’s shallow details and the decoder’s high-level semantics via an efficient cross-attention mechanism. Extensive experiments on the ISPRS Potsdam and Vaihingen datasets demonstrate that MAFMamba surpasses state-of-the-art Convolutional Neural Network (CNN), Transformer, and Mamba-based methods in terms of mIoU and mF1 scores. Notably, it achieves superior accuracy while maintaining linear computational complexity and low memory usage, underscoring its efficiency in complex remote sensing scenarios. Full article
(This article belongs to the Special Issue Intelligent Sensors and Artificial Intelligence in Building)
Show Figures

Figure 1

23 pages, 54003 KB  
Article
TRACE: Topical Reasoning with Adaptive Contextual Experts
by Jiabin Ye, Qiuyi Xin, Chu Zhang and Hengnian Qi
Big Data Cogn. Comput. 2026, 10(1), 31; https://doi.org/10.3390/bdcc10010031 - 13 Jan 2026
Viewed by 120
Abstract
Retrieval-Augmented Generation (RAG) is widely used for long-text summarization due to its efficiency and scalability. However, standard RAG methods flatten documents into independent chunks, disrupting sequential flow and thematic structure, resulting in significant loss of contextual information. This paper presents MOEGAT, a novel [...] Read more.
Retrieval-Augmented Generation (RAG) is widely used for long-text summarization due to its efficiency and scalability. However, standard RAG methods flatten documents into independent chunks, disrupting sequential flow and thematic structure, resulting in significant loss of contextual information. This paper presents MOEGAT, a novel graph-enhanced retrieval framework that addresses this limitation by explicitly modeling document structure. MOEGAT constructs an Orthogonal Context Graph to capture sequential discourse and global semantic relationships—long-range dependencies between non-adjacent text spans that reflect topical similarity and logical associations beyond local context. It then employs a query-aware Mixture-of-Experts Graph Attention Network to dynamically activate specialized reasoning pathways. Experiments conducted on three public long-text summarization datasets demonstrate that MOEGAT achieves state-of-the-art performance. Notably, on the WCEP dataset, it outperforms the previous state-of-the-art Graph of Records (GOR) baseline by 14.9%, 18.1%, and 18.4% on ROUGE-L, ROUGE-1, and ROUGE-2, respectively. These substantial gains, especially the 14.9% improvement in ROUGE-L, reflect significantly better capture of long-range coherence and thematic integrity in summaries. Ablation studies confirm the effectiveness of the orthogonal graph and Mixture-of-Experts components. Overall, this work introduces a novel structure-aware approach to RAG that explicitly models and leverages document structure through an orthogonal graph representation and query-aware Mixture-of-Experts reasoning. Full article
(This article belongs to the Special Issue Generative AI and Large Language Models)
Show Figures

Figure 1

15 pages, 1527 KB  
Article
Learning Complementary Representations for Targeted Multimodal Sentiment Analysis
by Binfen Ding, Jieyu An and Yumeng Lei
Computers 2026, 15(1), 52; https://doi.org/10.3390/computers15010052 - 13 Jan 2026
Viewed by 88
Abstract
Targeted multimodal sentiment classification is frequently impeded by the semantic sparsity of social media content, where text is brief and context is implicit. Traditional methods that rely on direct concatenation of textual and visual features often fail to resolve the ambiguity of specific [...] Read more.
Targeted multimodal sentiment classification is frequently impeded by the semantic sparsity of social media content, where text is brief and context is implicit. Traditional methods that rely on direct concatenation of textual and visual features often fail to resolve the ambiguity of specific targets due to a lack of alignment between modalities. In this paper, we propose the Complementary Description Network (CDNet) to bridge this informational gap. CDNet incorporates automatically generated image descriptions as an additional semantic bridge, in contrast to methods that handle text and images as distinct streams. The framework enhances the input representation by directly translating visual content into text, allowing for more accurate interactions between the opinion target and the visual narrative. We further introduce a complementary reconstruction module that functions as a regularizer, forcing the model to retain deep semantic cues during fusion. Empirical results on the Twitter-2015 and Twitter-2017 benchmarks confirm that CDNet outperforms existing baselines. The findings suggest that visual-to-text augmentation is an effective strategy for compensating for the limited context inherent in short texts. Full article
(This article belongs to the Section AI-Driven Innovations)
Show Figures

Figure 1

Back to TopTop