Next Article in Journal
Thermal Deformation Correction for the FY-4A LMI
Previous Article in Journal
Comparison and Evaluation of Multi-Source Evapotranspiration Datasets in the Yarlung Zangbo River Basin
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Text-Injected Discriminative Model for Remote Sensing Visual Grounding

School of Computer Science and Technology, University of Science and Technology of China, Heifei 230000, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(1), 161; https://doi.org/10.3390/rs18010161
Submission received: 22 November 2025 / Revised: 23 December 2025 / Accepted: 31 December 2025 / Published: 4 January 2026
(This article belongs to the Section AI Remote Sensing)

Abstract

Remote Sensing Visual Grounding (RSVG) requires fine-grained understanding of language descriptions to localize the specific image regions. Conventional methods typically employ a pipeline of separate visual and textual encoders and a fusion module. However, as visual and textual features are extracted independently, they tend to lack semantic focus on object features during extraction, leading to suboptimal object focus. While some recent attempts have incorporated textual cues into visual feature extraction, they often design complex fusion modules. To address this, we introduce a simple fusion strategy to integrate textual information into visual backbone networks with minimal architectural changes. Moreover, most of the current works use common object detection losses, which only focus on the features inside the bounding box and neglect the background features. In remote sensing images, the high visual similarity between objects can confuse models, making it difficult to locate the correct target accurately. To this end, we design a novel attention regularization strategy to enhance the model’s ability to distinguish similar features outside bounding box regions. Experiments on three benchmark datasets demonstrate the promising performance of our approach.
Keywords: remote sensing; visual grounding; multimodal learning remote sensing; visual grounding; multimodal learning

Share and Cite

MDPI and ACS Style

Hu, M.; Yang, K.; Li, J. Text-Injected Discriminative Model for Remote Sensing Visual Grounding. Remote Sens. 2026, 18, 161. https://doi.org/10.3390/rs18010161

AMA Style

Hu M, Yang K, Li J. Text-Injected Discriminative Model for Remote Sensing Visual Grounding. Remote Sensing. 2026; 18(1):161. https://doi.org/10.3390/rs18010161

Chicago/Turabian Style

Hu, Minhan, Keke Yang, and Jing Li. 2026. "Text-Injected Discriminative Model for Remote Sensing Visual Grounding" Remote Sensing 18, no. 1: 161. https://doi.org/10.3390/rs18010161

APA Style

Hu, M., Yang, K., & Li, J. (2026). Text-Injected Discriminative Model for Remote Sensing Visual Grounding. Remote Sensing, 18(1), 161. https://doi.org/10.3390/rs18010161

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop