Fine-Grained Interpretation of Remote Sensing Image: A Review

Wang, Dongbo; Yan, Zedong; Liu, Peng

doi:10.3390/rs17233887

Open AccessReview

Fine-Grained Interpretation of Remote Sensing Image: A Review

by

Dongbo Wang

¹,

Zedong Yan

^2,3 and

Peng Liu

^2,3,*

¹

China Nuclear Power Engineering Co., Ltd., Beijing 100840, China

²

Aerospace Information Research Institute Chinese Academy of Sciences, Beijing 100049, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(23), 3887; https://doi.org/10.3390/rs17233887 (registering DOI)

Submission received: 19 September 2025 / Revised: 13 November 2025 / Accepted: 20 November 2025 / Published: 30 November 2025

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

This review systematically analyzed the methods of fine-grained remote sensing image interpretation at different levels (pixel-level, object-level, scene-level), and discussed the challenges faced by fine-grained methods (such as lack of a unified definition of fine granularity, heavy reliance on computer vision methods but with domain gaps, limited cross-domain generalization, and open-world recognition, etc.)
Current fine-grained interpretation datasets cover typical objects/scenarios but face defects like high intra-class similarity, limited modality diversity, geographic imbalance, and high annotation costs, while future datasets will develop toward multi-modal integration, global coverage, and temporal dynamics.

What is the implication of the main finding?

The summarized method system and dataset optimization direction provide a clear technical framework for subsequent research, helping to solve key challenges such as small inter-class differences and poor cross-domain generalization in fine-grained interpretation.
Dataset optimization will break one of the bottlenecks and promote the application of fine-grained interpretation technology in environmental monitoring, agriculture, urban planning, and other fields.

Abstract

This article conducts a systematic review on the fine-grained interpretation of remote sensing images, delving deeply into its background, current situation, datasets, methodology, and future trends, aiming to provide a comprehensive reference framework for research in this field. In terms of fine-grained interpretation datasets, we focus on introducing representative datasets and analyze their key characteristics such as the number of categories, sample size, and resolution, as well as their benchmarking role in research. For methodologies, by classifying the core methods according to the interpretation level system, this paper systematically summarizes the methods, models, and architectures for implementing fine-grained remote sensing image interpretation based on deep learning at different levels such as pixel-level classification and segmentation, object-level detection, and scene-level recognition. Finally, the review concluded that although deep learning has driven substantial advances in accuracy and applicability, fine-grained interpretation remains an inherently challenging problem due to issues such as the distinction of highly similar categories, cross-sensor domain migration, and high annotation costs. We also look forward to future directions, emphasizing the need to enhance the generalization, support open-world recognition further, and adapt to actual complex scenarios, etc. This review aims to promote the application of fine-grained interpretation technology for remote sensing images across a broader range of fields.

Keywords:

fine-grained interpretation; remote sensing information extraction; inter-class differences; intra-class variations; deep learning

1. Introduction

The acceleration of satellite imaging technology and the flourishing of advanced remote sensing platforms have ushered in a new era for Earth observation. Modern sensors are capable of capturing imagery at higher spatial resolution, more abundant spectral bands, and denser revisit frequency, creating rich datasets that go well beyond coarse land-cover maps. However, the interpretive methods for these data commonly remain at coarse granularity—that is, assigning broad semantic labels such as “urban,” “forest,” or “water” to pixels or regions. As demands for more precise, context-sensitive, and operationally useful products grow, the idea of “fine-grained remote sensing interpretation” has emerged as an indispensable research direction.

In the remote sensing community, the term fine-grained interpretation has been used with different connotations across semantic levels of analysis. Unlike the computer vision field, where “fine-grained” typically refers to discriminating sub-categories within an object class (e.g., species of birds or types of vehicles), remote sensing imagery embodies a hierarchical spatial–semantic structure, where fine-grainedness manifests differently at the pixel, object, and scene levels. Pixel-level fine-grainedness primarily relates to spectral and radiometric resolution. It concerns the discrimination of subtle spectral variations within mixed or adjacent pixels—for instance, distinguishing different wetland types [1] or tree species [2]. Here, fine-grained interpretation is more about spectral granularity and sub-pixel information modeling. Object-level fine-grainedness parallels the computer vision notion of part-based or subclass recognition. It focuses on identifying specific structures or components within a broader class (as Figure 1), such as distinguishing different types of buildings or ships [3,4]. The “fine-grainedness” lies in structural and morphological variability rather than subtle spectral differences. Scene-level fine-grainedness extends beyond local patterns to the semantic composition and contextual relationships within complex landscapes. It captures nuanced differences between semantically similar environments—e.g., distinguishing residential neighborhoods from mixed industrial–commercial zones [5], or subtle patterns of urban sprawl [6]. At this level, fine-grained interpretation involves contextual semantics and hierarchical scene understanding.

Yet despite many advances, there is no universally accepted formal definition of fine-grained remote sensing interpretation. Different authors may emphasize subclass discrimination, structural detail, or semantic subdivision, leading to blurred conceptual boundaries. To bring clarity, in this work we propose that (fine-grained remote sensing interpretation) FRSI should refer to the family of remote sensing image interpretation tasks that aim to achieve “higher semantic resolution and discriminative capability” than conventional coarse-level labeling. In essence, FRSI seeks “multi-dimensional finer granularity”—spectrum, semantic, spatial, or structure, and produces interpretable attribution or reasoning at these dimensions.

In recent years, there are existing several survey and review about “fine-grained”. For instance, “Fine-Grained Image Analysis With Deep Learning: A Survey” [7] focus on “fine-grained image analysis (FGIA)” in the field of computer vision —emphasizing the identification of subcategories within the same major category (such as bird species, car models). Meanwhile, “Fine-Grained Image Recognition Methods and Their Applications in Remote Sensing Images: A Review” [8] has been transferred into remote sensing contexts but its scope is still relatively close to object detection in the field of computer vision. However, this review of ours does not fully follow the perspectives and understandings of fine granularity presented in these existing reviews. We will give more consideration to different level semantics and contextual meanings of fine-grained interpretation.

Figure 1. Examples of fine-grained objects (ships [9]) in remote sensing images. Top line: coarse-grained interpretation. Bottom line: fine-grained interpretation.

The contributions of this survey are threefold: (1) Different from previous reviews on fine-grained interpretation, we review fine-grained interpretation from the perspectives of different semantic levels (pixel, object, and scene levels) of remote sensing image interpretation and summarized the existing methods. (2) We trace the “evolution” of remote sensing interpretation, systematically summarize the developing trend and challenges of fine-grained interpretation in the field of remote sensing. (3) We identify open issues and future directions—such as cross-domain adaptation, annotation efficiency, interpretability, open-world generalization, etc.

2. The Datasets for Fine-Grained Interpretation

2.1. Current Status of the Dataset

Remote sensing datasets play an extremely important role in the research of fine-grained remote sensing image interpretation. The current fine-grained remote sensing interpretation datasets can roughly be divided into three categories: pixel-level, target(object)-level, and scene-level. When choosing a dataset, we mainly select task types that are directly related to fine-grained interpretation and try to cover the three levels

Pixel-level datasets are mostly used for land cover classification and feature change detection, such as TREE [10], Belgium Data [11], and FUSU [12]. The data sources are mostly airborne or ground systems (such as the LiCHy hyperspectral system). They emphasize the subtle distinctions in spectral dimensions, such as the spectral differences among land covers. Object-level datasets mainly are constructed for individual targets such as ships, aircraft, and buildings, for example, HRSC2016 [13], FGSCR-42 [14], ShipRSImageNet [15], and MFBFS [16]. They are often high resolution (0.1–6 m) remote sensing images. There are a large number of categories, emphasizing the subtle differences between similar categories (such as ship models, aircraft models). The data sources mainly include Google Earth, WorldView, GaoFen series satellites, etc. Scene-level datasets have the widest coverage and are applied in remote sensing scene classification and retrieval, such as AID [17], NWPU-RESISC45 [18], PatternNet [19], MLRSNet [20], Million-AID [21], and MEET [22], etc. They have a wide resolution range (0.06–153 m). The sample size is large (ranging from tens of thousands to millions of images). The sources mainly include Google Earth, Bing Maps, Sentinel, OpenStreetMap, etc. In Table 1, common datasets for fine-grained remote sensing image interpretation are summarized.

Overall, the existing datasets have basically covered typical fine-grained objects and scenarios such as ships, aircraft, buildings, vegetation, and land use/cover, providing important support for related research.

2.2. Existing Deficiencies of Datasets

Despite substantial progress in fine-grained remote sensing interpretation, current datasets face fundamental limitations rooted in machine learning and deep learning theoretical constraints, hindering model generalization and real-world applicability.

High Intra-Class Similarity. (1) Imbalanced Feature Space and Blurred Decision Boundaries: From statistical learning theory, fine-grained classification suffers from skewed intra-class vs. inter-class variance. Categories are differentiated only by subtle local, spectral, or texture differences, resulting in narrow inter-class distances in the feature space. Meanwhile, intra-class samples are scattered due to imaging angle shifts, atmospheric interference, and target state changes (e.g., vegetation phenology), often making intra-class variance exceed inter-class variance. This violates the “compact intra-class, separated inter-class” assumption, masking discriminative features with noise. For CNNs and Transformers, the low signal-to-noise ratio of fine-grained features impedes effective gradient descent, failing to encode stable discriminative representations. (2) Bias–Variance Imbalance and Generalization Failures: High intra-class similarity disrupts the bias–variance tradeoff: increasing model complexity to capture subtle differences reduces bias but amplifies sensitivity to intra-class variations, triggering overfitting. From generalization error decomposition, this overfitting stems from overlearning “non-essential variations”—datasets lack systematic coverage of multi-dimensional factors (e.g., imaging geometry, meteorology), leaving representations vulnerable to environmental perturbations (out-of-distribution failures). Additionally, metric learning methods (e.g., Triplet Loss) fail to optimize, as minimal gaps between sample pairs prevent a discriminative metric space.

Limited Modality Diversity. (1) Information Bottleneck of Single Modalities: From information theory, single-modality data (e.g., optical imagery) has an inherent entropy ceiling. Optical data captures spectral and spatial features but lacks 3D geometry, material composition, or dielectric property information due to band and physical constraints, leading to incomplete representation. This violates deep learning’s demand for “distribution integrity,” creating an information bottleneck that caps performance. Single modalities also lack robustness: optical data is prone to cloud cover and illumination changes, while SAR suffers from speckle noise. Multi-modal data offsets interference via complementary noise distributions, but its absence abandons this theoretical advantage. (2) Cross-Modal Heterogeneity and Alignment Deficits: Modality heterogeneity is the core fusion challenge. Remote sensing modalities (2D optical grids, 3D LiDAR point clouds, SAR coherence features) differ in dimension, scale, and physics, requiring alignment to a unified semantic space. Effective alignment depends on large-scale paired samples, but current datasets lack such data, trapping fusion algorithms in “data–theory decoupling.” Advanced methods (e.g., modal embedding) cannot be validated, forcing reliance on simple concatenation that achieves only “weak complementarity.”

Geographic Imbalance. (1) Domain Shift and Violated IID Assumption: From domain adaptation theory, geographic imbalance causes severe domain shift. Training domains (China, U.S., Europe) and test domains (Africa, South America) differ in distribution due to climate, terrain, and human activity—manifesting as covariate shift and concept shift. This violates the independent and identically distributed (IID) assumption: when test domains deviate, learned representations and decision functions fail. Insufficient domain coverage means models cannot adapt to unseen regional feature complexity, leading to irreducible generalization errors. (2) Barriers to Domain-Invariant Feature Learning: Ideal representations should be “domain-invariant,” encoding intrinsic ground object attributes. From meta-learning theory, this requires cross-domain diverse data to separate domain noise. However, geographic imbalance creates a “data sparsity trap”: over-represented single-domain samples dominate learning with domain-specific features (e.g., regional roof preferences), preventing cross-domain generalization. This triggers “negative transfer,” where domain-specific features interfere with test domain classification, raising errors beyond random guessing.

Annotation Bottlenecks. (1) Scarce Strong Supervision and Data Inefficiency: Fine-grained annotation demands expert-driven pixel/object-level labels—high-information-density but costly signals. This limits dataset scale and quality. Deep learning models (especially large-parameter ones) rely on sufficient data; below a “critical threshold,” gradient descent fails, causing dimensionality curse and underfitting. Weak supervision alternatives (coarse labels, pseudo-labels) are flawed: coarse labels lack gradient signals, while pseudo-labels suffer from high noise, exacerbating overfitting. (2) Dataset Saturation and Generalization Risks: Benchmark saturation reflects divergent empirical and generalization risks. As algorithms iterate, training error drops to near-zero, but test error stagnates or rises—consistent with the “overfitting limit.” When model complexity exceeds data capacity, models learn “benchmark-specific biases” (annotation errors, imaging noise) instead of intrinsic features. Saturated datasets deviate from real-world distributions, leading to sharp performance drops in practice and “pseudo-positive progress” driven by bias adaptation.

2.3. Future Outlook of Fine-Grained Datasets

Future dataset development for fine-grained remote sensing interpretation is expected to follow several important directions:

1. Multi-modal integration. Most existing benchmarks are dominated by optical imagery, which captures rich spectral and spatial details but is often limited by weather, lighting, and occlusion. To address these challenges, constructing datasets that integrate optical, SAR, LiDAR, and hyperspectral modalities will be critical. SAR can penetrate clouds and provide structural backscatter features, LiDAR captures accurate 3D geometry and elevation information, and hyperspectral imaging offers detailed spectral signatures for material identification. By combining these complementary data sources, future datasets will enable models to recognize fine-grained categories even under challenging conditions (e.g., distinguishing tree species in dense canopies or identifying military targets under camouflage). Multi-modal benchmarks will also foster the development of fusion-based algorithms that better reflect real-world operational requirements.

2. Global coverage and domain diversity. Current datasets are geographically imbalanced, with most samples collected from regions such as China, the United States, and parts of Europe. This geographic bias restricts the generalization ability of models to unseen domains. Expanding datasets to cover diverse climates, cultures, and ecosystems—for instance, tropical rainforests in South America, arid deserts in Africa, or island regions in Oceania—will help mitigate domain bias. In addition, datasets should incorporate varying socio-economic environments (urban, rural, coastal, industrial) to ensure broader representativeness. Such global and cross-domain coverage will make fine-grained datasets more reliable for worldwide applications such as biodiversity monitoring, agricultural assessment, and disaster response.

3. Temporal and dynamic monitoring. Most existing benchmarks are static snapshots, which limits their use for monitoring changes over time. However, many fine-grained tasks are inherently dynamic, such as crop phenology, urban expansion, forest succession, and water resource fluctuation. Incorporating time-series data will allow researchers to capture temporal evolution and model long-term trends. For example, crop species might be indistinguishable at a single time point but reveal distinct spectral or structural patterns when tracked across multiple growth stages. Similarly, urban construction stages or seasonal flooding patterns can only be fully captured in temporal datasets. Building fine-grained time-series benchmarks will thus support more realistic monitoring and predictive modeling tasks.

4. Efficient annotation strategies. The creation of fine-grained datasets is constrained by the costly and time-consuming nature of expert annotations, especially when subtle distinctions (e.g., between aircraft variants or tree species) require domain expertise. To reduce labeling costs, future work should explore weakly supervised learning (using coarse labels or incomplete annotations), self-supervised learning (leveraging large-scale unlabeled imagery), and crowdsourcing platforms that engage non-experts under expert validation. Additionally, incorporating knowledge graphs and generative augmentation can help generate pseudo-labels or synthetic samples to expand datasets efficiently. These strategies will make it feasible to construct large-scale fine-grained benchmarks in a scalable and sustainable way.

5. Open-world and zero-shot benchmarks. In real-world applications, remote sensing systems often encounter novel classes that were not present in the training data. However, most current datasets assume closed-world settings, where the label space is fixed. Future benchmarks should explicitly support open-world recognition and zero-shot learning, where models can detect and reason about unseen categories by leveraging semantic embeddings, textual descriptions, or external knowledge bases. Initiatives such as OpenEarthSensing [30] exemplify this trend, providing benchmarks that require models to generalize to novel classes and handle uncertain environments. Such benchmarks will be vital for practical deployments in tasks like disaster monitoring, where emergent phenomena (e.g., new building types or unusual environmental events) cannot be predefined.

In summary, fine-grained datasets at the pixel, object, and scene levels have substantially advanced research in remote sensing interpretation, enriching both the scale and complexity of available benchmarks. Nevertheless, limitations such as high intra-class similarity, modality constraints, geographic imbalance, and annotation costs continue to hinder broader applicability. The future of fine-grained dataset construction will rely on multi-modality, global-scale diversity, temporal dynamics, efficient labeling strategies, and open-world settings, enabling more generalizable, intelligent, and application-ready solutions for remote sensing interpretation.

3. Methodology Taxonomy

Remote sensing image interpretation refers to the comprehensive technical process of analyzing, identifying, and interpreting the spectral, spatial, textural, and temporal characteristics of objects or phenomena in remote sensing images. Essentially, it serves as a “bridge” between remote sensing data and practical Earth observation applications. According to the granularity and objectives of information extraction, it can be divided into three core levels: pixel-level, object-level, and scene-level, as in Figure 2. Each level is interrelated yet has a clear differentiated positioning, while fine-grained interpretation is an in-depth extension of the demand for “subclass distinction” based on these levels.

Pixel-level interpretation, as the foundation of remote sensing interpretation, focuses on the semantic attribution of individual or local pixels. Its core tasks include pixel-level classification (e.g., distinguishing basic ground objects such as farmland, water bodies, and buildings) and semantic segmentation (delineating pixel-level boundaries of ground objects). Traditional methods rely on spectral features (e.g., the low near-infrared reflectance of water bodies) or simple texture features, which are suitable for macro ground object classification in medium- and low-resolution images (e.g., large-scale land use classification). With the development of high-resolution remote sensing technology, pixel-level interpretation has gradually advanced toward “fine-grained attribute distinction.” For example, it can distinguish different crop varieties in hyperspectral images and identify building roof materials in high-resolution optical images. This demand for “subclass segmentation under basic ground objects” has become the prototype of fine-grained interpretation at the pixel level.

Object-level interpretation centers on “discrete ground object targets” and requires both spatial localization of targets (e.g., bounding box annotation) and category judgment. Typical applications include ship detection, aircraft recognition, and building extraction. Traditional object-level interpretation focuses on “presence/absence” and “broad category distinction” (e.g., distinguishing “ships” from “aircraft”). However, practical scenarios often require more refined target classification: for instance, ships need to be distinguished into “frigates” and “destroyers,” aircraft into “passenger planes” and “military transport planes,” and buildings into “historic protected buildings” and “ordinary residential buildings.” This type of “subclass identification under the same broad category” has driven object-level interpretation toward fine-grained development, which needs to overcome the technical challenge of “feature confusion between highly similar targets” (e.g., the similar outlines of different ship models).

Scene-level interpretation takes the “entire image scene” as the analysis unit. By integrating pixels, targets, and contextual information, it judges the overall semantics of the scene (e.g., “airport,” “port,” “urban residential area”) and supports regional-scale applications (e.g., urban functional zone division, disaster scene assessment). Traditional scene-level interpretation focuses on “broad scene category distinction” (e.g., distinguishing “forests” from “cities”). However, refined applications require more detailed scene subclass division: for example, “urban residential areas” need to be subdivided into “high-density high-rise communities” and “low-density villa areas,” “wetlands” into “swamp wetlands” and “tidal flat wetlands,” and “airports” into “military-civilian joint-use airports” and “civil airports.” This ”functional/morphological subclass identification under broad scene categories“ has become the core demand for fine-grained scene-level interpretation.

Before the deep learning era, fine-grained interpretation of remote sensing images largely relied on handcrafted features designed to capture spectral, textural, structural, and spatial nuances within and across land-cover categories. These approaches formed the foundation of modern semantic interpretation, achieving remarkable success in scenarios with subtle intra-class variability and limited labeled data. At the pixel level, fine-grainedness was primarily expressed through the use of spectral and radiometric indicators. Researchers developed feature-based models to discriminate minute spectral differences in vegetation, soil, or water conditions. Typical examples include the use of spectral indices and texture descriptors to characterize vegetation stress or sub-pixel composition [39,40]. At the object level, the focus shifted toward the structural and geometric variability within object categories, such as ships, buildings, or aircraft. Before convolutional models emerged, these tasks were accomplished through manually designed descriptors, including Histogram of Oriented Gradients (HOG) [41], Local Binary Patterns (LBPs) [42], and Gabor filters, etc. At the scene-level, fine-grained interpretation aimed to distinguish semantically similar environments—such as residential, industrial, and commercial zones—by integrating global appearance and spatial context. Bag-of-Visual-Words (BoVW) [43] and Spatial Pyramid Matching (SPM) models [44] became particularly influential in remote sensing scene classification.

In summary, traditional handcrafted-feature-based approaches laid the conceptual and technical groundwork for modern fine-grained remote sensing interpretation. They introduced the notions of spectral granularity, structural detail, and contextual hierarchy—ideas later absorbed and generalized by deep learning frameworks. While limited in scalability and robustness, these methods provided interpretability and physical insight that remain valuable for hybrid and explainable AI paradigms.

Fine-grained interpretation is not a new paradigm independent of the three levels, but a technical deepening centered on the goal of “subclass distinction” based on each level. Its core value lies in breaking the bottleneck of semantic ambiguity of ground objects in the same broad category. In the following sections, a comprehensive and in-depth review of the methodologies for these three levels of fine-grained interpretation will be presented.

3.1. Fine-Grained Pixel-Level Classification or Segmentation

The fine-grained remote sensing interpretation at the pixel level mainly includes the classification at the pixel level and the semantic segmentation of the remote sensing images. Generally, there are more studies on pixel-level classification for hyperspectral images and more on semantic segmentation for high-resolution multispectral images. In recent years, research based on spatial–spectral joint classification has become popular. Many classification applications also take advantage of the correlation between pixels and feature consistency, and the boundary between segmentation and classification has gradually blurred. Whether it is classification or segmentation, the challenges faced by pixel-level fine-grained interpretation mainly come from the similarity of the spectral characteristics of subclass pixels within a large category. Some subclasses are even almost indistinguishable in the spectral dimension and can only be distinguished by features such as spatial texture or consistency in the temporal dimension.

An example of this is shown in Figure 3 from the GSFF dataset [10], which originally has 12 different land-cover classes, containing 9 forest vegetation categories. However, we find that these nine types of vegetation are very difficult to distinguish because their spectra are all very similar. Distinguishing these nine types of vegetation is a typical fine-grained classification problem. In addition, in natural image research like ImageNet, this fine-grained pixel-level classification based on similar spectra is not common. This is one of the significant differences between fine-grained research in the field of remote sensing and traditional computer vision.

Common methods for fine-grained classifying or segmenting pixel-level remote sensing images include novel data representation methods, the coarse–fine category relationship modeling method, the multi-source data fusion method, advanced data annotation optimization methods, etc.

3.1.1. Novel Data Representation

This category of methods focuses on breaking through the limitations of traditional and single features from spectrum through innovative feature extraction mechanisms, enabling more accurate capture of key information required for fine-grained classification (such as morphological structures, subtle texture differences, and edge textures, etc.).

Spatial–Spectral Joint Representation. Spatial–spectral joint representation is one of the most popular methods in fine-grained classification at the pixel-level. For example, CASST [45] establishes long-range mappings between spectral sequences (inter-band dependencies) and spatial features (neighboring pixel correlations) through a dual-branch Transformer and cross-attention; GRetNet [46] introduces Gaussian multi-head attention to dynamically calibrate the saliency of spectral–spatial features, enhancing the discriminability of fine-grained differences (e.g., spectral peak shifts in closely related tree species); in [47], CenterFormer focuses on the spatial–spectral features of target pixels through a central pixel enhancement mechanism to reduce background interference; E2TNet [48] designs an efficient multi-granularity fusion module to balance global correlations of coarse/fine-grained spatial–spectral features; FGSCNN [49] fuses high-level semantic features with fine-grained spatial details (e.g., edge textures) through an encoder–decoder architecture. Some studies do not simply jointly extract features in the spatial–spectral dimension, but introduce new spatial features such as gradients. For example, in [50], it proposes G2C-Conv3D, which weighted and combined traditional convolution with gradient-centralized convolution to simultaneously capture pixel intensity semantic information and gradient changes, so that it supplements intensity features with gradient information to improve the model’s sensitivity to subtle structures such as edges and textures. Spatial–Spectral joint representations break the independent modeling of spectral and spatial features, capturing their intrinsic correlations via mechanisms like attention and Transformer to improve classification accuracy in complex scenes with fine-grained classes.

Morphological Representation. Different from spatial–spectral joint representation, there are also studies exploring new feature extraction methods that are completely different from convolution operations or transform operations to deal with fine-grained classification problems. In [51], the authors propose SLA-NET, which combines morphological operators (erosion and dilation) with trainable structuring elements to extract fine morphological features (e.g., contours and compactness) of tree crowns; in [2], it designs a dual-concentrated network (DNMF) that separates spectral and spatial information before fusing morphological features to enhance the robustness of tree species classification; morphFormer [52] models the interaction between the structure and shape of trees/minerals through spectral–spatial morphological convolution and attention mechanisms. This kind of method focuses on the geometric morphology of objects (e.g., crown shape and texture distribution), compensating for the inability of traditional convolution to capture non-Euclidean features.

Edge and Area Representation. These methods focus on edge continuity and regional integrity, addressing the fragmentation of classification results in traditional methods. In [53], PatchOut adopts a Transformer–CNN hybrid architecture and a feature reconstruction module to retain large-scale regional features while restoring edge details, enabling patch-free fine land-cover classification; SSUN-CRF [54] combines a spectral–spatial unified network with a fully connected conditional random field to smooth the edges of classification results and enhance regional consistency; th edge feature enhancement framework (EDFEM+ESM) [55] improves the segmentation accuracy of mineral edges through multi-level feature fusion and edge supervision.

The advantages of novel data representation mainly lie in the following: (1) Strong fine-grained feature capture: Innovative representation mechanisms accurately capture key information such as morphology, spatial–spectral correlations, gradient changes, and edge textures, significantly improving the discriminability of closely related categories (e.g., tree species and minerals). (2) Flexible model adaptability: Modular designs (e.g., morphological modules and attention modules) can be embedded into mainstream architectures like CNN and Transformer, compatible with diverse scene requirements. Their limitations are as follows: High model complexity: Modules like multi-scale fusion and morphological transformation increase parameter scales and computational loads, imposing strict requirements on training data volume and hardware computing power.

3.1.2. Modeling Relationships Between Coarse and Fine Classes

This category of methods reduces the reliance of fine-grained tasks on annotated data by modeling the hierarchical relationship between coarse-grained categories (e.g., “vegetation”) and fine-grained categories (e.g., “oak” and “poplar”), using prior knowledge of coarse categories to guide fine category classification.

Typical methods are as follows: in [56], it uses GAN and DenseNet, where the generator learns coarse category distributions and the discriminator distinguishes fine category differences to achieve semi-supervised fine-grained classification; the coarse-to-fine joint distribution alignment framework [57] matches cross-domain coarse category distributions and then calibrates fine category feature differences through coupled VAE and adversarial learning; CSSD [58] maps patch-level coarse-grained information to pixel-level fine category classification through central spectral self-distillation, solving the “granularity mismatch” problem; The CPDIC [59] framework aligns cross-domain coarse-fine category distributions using calibrated prototype loss to enhance domain adaptability; the fine-grained multi-scale network [60] combines superpixel post-processing to iteratively optimize fine category boundaries from coarse classification results; CFSSL [61] performs coarse classification with a small number of labels, then uses high-confidence pseudo-labels to guide fine-grained classification of small categories.

The advantages of modeling relationships between coarse and fine classes are as follows: (1) High data efficiency: By reusing coarse category knowledge (e.g., spectral commonalities of “vegetation”), the demand for annotated samples for fine-grained categories (e.g., specific tree species) is reduced, making it particularly suitable for few-shot scenarios. (2) Strong generalization ability: Hierarchical modeling mitigates the interference of intra-fine-category variations (e.g., different growth stages of the same tree species) on classification, improving the model’s adaptability to scene changes. The limitations are as follows: (1) Risk of hierarchical bias: Unreasonable definition of hierarchical relationships between coarse and fine categories (e.g., incorrectly classifying “shrubs” as a subclass of “arbor”) can lead to systematic bias in fine-grained classification. (2) Limited cross-domain adaptability: In scenes with severe spectral variation (e.g., vegetation in different seasons), differences in feature distribution between coarse and fine categories may disrupt hierarchical relationships, reducing classification accuracy.

3.1.3. Multi-Source Data Integration

The core of this category of methods is to break through the information dimensional limitations of single-source data by fusing complementary data sources (e.g., hyperspectral and LiDAR, remote sensing and crowdsourced data), thereby improving the robustness and accuracy of fine-grained classification.

The fusion of hyperspectral data with LiDAR data or hyperspectral data with geographic information data is one of the two most common methods for fine-grained classification based on data fusion. In [62], they propose a coarse-to-fine high-order network that fuses spectral features of hyperspectral data and 3D structural information of LiDAR to capture multi-dimensional attributes of land cover through hierarchical modeling; in [63], it designs a multi-scale and multi-directional feature extraction network that integrates spectral–spatial–height features of hyperspectral and LiDAR data to enhance category discriminability in complex scenes; in [64], Sentinel-1 radar images (capturing microwave scattering characteristics of flooded areas) are combined with OpenStreetMap crowdsourced data (providing semantic labels of urban functional zones) to improve the accuracy of fine-grained urban flood detection.

The advantages of multi-source data integration methods are as follows: (1) Information complementarity: Multi-source data provide multi-dimensional information such as spectral, spatial, structural, and semantic, compensating for the lack of discriminability of single-source data in complex scenes (e.g., vegetation coverage and urban heterogeneous areas). It is applicable to diverse scenarios such as forests, cities, and hydrology, especially outstanding in distinguishing fine-grained subcategories (e.g., different tree species and flood-submerged buildings/roads). (2) Their limitations include the following: Challenges in data heterogeneity: Differences in spatial resolution (e.g., 10 m for hyperspectral vs. 1 m for LiDAR), coordinate systems, and noise levels among different data sources require complex registration and preprocessing steps, increasing the difficulty of method implementation.

3.1.4. Advanced Data Annotation Strategies

This category of methods focuses on reducing the reliance of fine-grained classification on large-scale accurately annotated data, optimizing annotation efficiency through strategies such as few-shot learning and semi-supervised annotation, and addressing the practical pain points of “high annotation cost and scarce samples”.

The most common approach is to introduce active learning [65] or incremental learning methods into fine-grained remote sensing image classification. LPILC [66] algorithm, based on linear programming, enables incremental learning with only a small number of new category samples without requiring original category data, adapting to dynamically updated classification needs; CSSD [58] uses central spectral self-distillation, taking the model’s own predictions as pseudo-labels to reduce dependence on manual annotation; CFSSL [61] screens high-confidence pseudo-labels through “breaking-tie” sampling (BT criterion) to reduce the impact of noisy annotations on the model.

The advantages of advanced data annotation strategies are as follows: (1) Significantly reduced annotation cost: These strategies can reduce manual annotation, making them particularly suitable for scenarios requiring professional knowledge for annotation, such as hyperspectral data. (2) Their limitations are as follows: The performance of few-shot/incremental learning highly depends on the robustness of pre-trained models. If the initial model has biases (e.g., a tendency to misclassify certain categories), it will continuously affect the classification of new categories.

3.2. Fine-Grained Object-Level Detection

In the context of remote sensing, fine-grained object detection refers to the task of not only identifying major target categories such as vehicles, airplanes, and ships, but also distinguishing their more detailed subcategories. As illustrated in the Figure 4, conventional object detection merely recognizes broad categories like Vehicle, Airplane, or Ship. In contrast, fine-grained object detection is able to further differentiate vehicles into Van, Small Car, and Other Vehicle; airplanes into A330, A321, A220, and Boeing 737; and ships into Tugboat and Dry Cargo Ship, among others. This enables a more precise and detailed recognition and classification of targets in remote sensing imagery.

Object detection, a core task in computer vision, aims to localize and classify objects in images. It has mainly evolved into two dominant paradigms: two-stage detectors and one-stage detectors, each with distinct architectural designs and trade-offs between accuracy and speed. Most of the target detection methods in the field of remote sensing are derived from these two types of methods in the field of computer vision. The following subsections will respectively review and summarize the improvements of the two types of methods (two-stage and one-stage) for fine-grained object detection tasks.

3.2.1. Two-Stage Detectors

Two-stage methods separate object detection into two sequential steps: (1) generating region proposals (potential object locations) and (2) classifying these proposals and refining their bounding boxes. This modular design typically achieves higher accuracy but at the cost of computational complexity.

R-CNN (Region-based Convolutional Neural Networks) [68] introduced the first two-stage framework. It uses selective search to generate region proposals, extracts features via CNNs, and applies SVMs for classification. Despite its pioneering nature, redundant computations make it inefficient. Fast R-CNN [69] addressed R-CNN’s inefficiencies by sharing convolutional features across proposals, using a RoI (Region of Interest) pooling layer to unify feature sizes, and integrating classification and regression into a single network. Faster R-CNN [70] revolutionized the field by replacing selective search with a Region Proposal Network (RPN), a fully convolutional network that predicts proposals directly from feature maps. This made two-stage detection end-to-end trainable and significantly faster. Mask R-CNN [71] extended Faster R-CNN by adding a branch for instance segmentation, demonstrating the flexibility of two-stage architectures in handling complex tasks beyond detection. Cascade R-CNN [72] improved bounding box regression by iteratively refining proposals with increasing IoU thresholds, addressing the mismatch between training and inference in standard two-stage methods.

R-CNN-based methods rely on a two-stage framework to address the core challenge of distinguishing highly similar targets (e.g., aircraft subtypes, ship models) in remote sensing images. In the research of fine-grained object detection, the two-stage structure is more popular than the one-stage structure. Two-stage object detection architectures can be further decomposed into a feature extraction backbone (Backbone) with a feature pyramid network (FPN), a region proposal network (RPN) for candidate regions, a region of interest alignment module (RoIAlign) for precise feature mapping, and task-specific heads for object classification (Cls), bounding box regression (Reg), and optional mask prediction (Mask Branch). Table 2 summarizes the improvements of fine-grained object detection on these components. Below is a detailed analysis of their improvements with different methods.

Contrastive Learning. This subcategory focuses on optimizing the feature space of highly similar targets through inter-sample contrast to amplify inter-class differences and reduce intra-class variations. Its core logic is to construct positive/negative sample pairs and use contrastive loss to guide the model in learning discriminative features, which is particularly effective for scenarios where visual similarity leads to feature confusion.

Existing studies in this subcategory (Contrastive Learning) mainly focus on the following: To address insufficient feature discrimination caused by long-tailed distributions, ref. [67] proposed PCLDet, which builds a category prototype library to store feature centers of targets (e.g., ships, aircraft) and introduces Prototypical Contrastive Loss (ProtoCL) to maximize inter-class distances while minimizing intra-class distances. A Class-Balanced Sampler (CBS) further balances sample distribution, ensuring that rare subtypes receive sufficient attention. For the problem of intra-class diversity in fine-grained aircraft detection, ref. [78] designed an Instance Switching-Based Contrastive Learning method. The Contrastive Learning Module (CLM) uses InfoNCE+ loss to expand the feature gap between aircraft subtypes (e.g., passenger aircraft models), while the Refined Instance Switching (ReIS) module mitigates class imbalance and iteratively optimizes features of discriminative regions (e.g., wings, engines). For oriented highly similar targets (e.g., ships), ref. [85] combined Oriented R-CNN (ORCNN) with Adaptive Prototypical Contrastive Learning (APCL) as in Figure 5. The Spatial-Aligned FPN (SAFPN) solves the spatial misalignment issue of traditional FPN, providing high-quality feature inputs for contrastive learning, and significantly improves the separability of features for ship subtypes (e.g., frigates vs. destroyers) on datasets such as FGSD and ShipRSImageNet. With regard to unknown ship detection via memory bank and uncertainty reduction, ref. [86] proposed a method that uses a Class-Balanced Proposal Sampler (CBPS) to balance sample learning and a fine-grained memory bank-based Contrastive Learning (FGCL) strategy to separate known/unknown ships. The Uncertainty-Aware Unknown Learner (UAUL) module reduces prediction uncertainty, solving the misjudgment of unknown highly similar ships (e.g., new military ships).

Knowledge Distillation. This subcategory aims to balance detection accuracy and model efficiency by transferring fine-grained knowledge from complex “teacher models” to lightweight “student models.” It has expanded from traditional multi-model distillation to self-distillation, enabling knowledge reuse within a single model and adapting to scenarios such as lightweight deployment and few-shot learning.

The technical evolution of this subcategory (Knowledge Distillation) is reflected in three directions: Multi-teacher knowledge distillation for accuracy–efficiency balance [77] used oriented R-CNN as the first teacher to locate vehicles/ships and Coarse-to-Fine Object Recognition Network (CF-ORNet) as the second teacher for fine-grained recognition. By distilling knowledge from both teachers into a student model and combining filter grafting, the model achieves high accuracy on high-resolution remote sensing images while reducing computational costs. For decoupled distillation for lightweight underwater detection, ref. [87] proposed the Prototypical Contrastive Distillation (PCD) framework, which uses R-CNN as the teacher model to transfer fine-grained knowledge of underwater targets (e.g., submersibles) via prototypical contrastive learning. The decoupled distillation mechanism allows the student model to focus on discriminative features, and contrastive loss enhances semantic structural attributes, improving the robustness of lightweight models in underwater environments. Self-distillation for few-shot scenarios [88] proposed Decoupled Self-Distillation for fine-grained few-shot detection. The model uses its “high-confidence branch” as an implicit teacher and “low-confidence branch” as a student to transfer knowledge of rare highly similar subtypes (e.g., rare aircraft models). Combined with progressive prototype calibration, this method addresses the problem of insufficient knowledge transfer due to limited data in few-shot scenarios.

Hierarchical Feature Optimization and Highly Similar Feature Mining (HFOSFM). This subcategory follows the logic of “from low-level feature purification to high-level feature fusion” to iteratively improve feature quality, with the ultimate goal of mining subtle discriminative features of highly similar targets. Low-level optimization focuses on eliminating noise (e.g., background interference, posture misalignment), while high-level optimization emphasizes integrating semantic information to enhance feature completeness.

Key innovations across these HFOSFM studies include the following: For low-level noise filtering and high-level feature matching, ref. [79] proposed PETDet, which uses the Quality-Oriented Proposal Network (QOPN) to generate high-quality oriented proposals (low-level purification) and the Bilinear Channel Fusion Network (BCFN) to extract independent discriminative features for proposals (high-level refinement). Adaptive Recognition Loss (ARL) further guides the R-CNN head to focus on high-quality proposals, solving the mismatch between proposals and features for highly similar targets. For multi-domain feature fusion and semantic association construction, ref. [76] proposed DIMA, which synchronously learns image and frequency-domain features via the Frequency-Aware Representation Supplement (FARS) mechanism (low-level detail enhancement) and builds coarse-fine feature relationships using the Hierarchical Classification Paradigm (HCP) (high-level semantic integration). This approach effectively amplifies structural differences between highly similar samples (e.g., ships of different tonnages). For oriented targets (e.g., rotating ships), ref. [80] proposed SFRNet, which uses the Spatial-Channel Transformer (SC-Former) to correct feature misalignment caused by posture variations (low-level spatial interaction) and the Oriented Transformer (OR-Former) to encode rotation angles (high-level semantic supplementation). This ensures that local differences (e.g., wing angles of tilted aircraft) are fully captured.

Category Relationship Modeling and Similarity Measurement Optimization (CRMSMO). This subcategory explicitly models intrinsic relationships between categories (e.g., hierarchical, structural, or functional relationships) to optimize similarity measurement logic, addressing the issue where traditional methods fail to distinguish highly similar targets due to over-reliance on visual features.

Representative studies of CRMSMO involve the following: Regarding semantic decoupling and anchor matching optimization, ref. [82] proposed a method for fine-grained ship detection that decouples classification and regression features using a polarized feature-focusing module and selects high-quality anchors via adaptive harmony anchor labeling. By optimizing the matching between anchors and category features, it improves the localization accuracy of highly similar ships. For hierarchical relationship constraint and feature distance expansion, ref. [83] proposed HMS-Net, which reinforces features at different semantic levels (e.g., ship contours vs. local components) and uses hierarchical relationship constraint loss to model the semantic hierarchy of ship subtypes (e.g., destroyer models). This explicitly expands the feature distance between highly similar subcategories. For invariant structural feature extraction via graph modeling, ref. [84] proposed Invariant Structure Representation, which uses the Graph Focusing Process (GFP) module to extract invariant structural features (e.g., cross-shaped aircraft, rectangular vehicles) based on graph convolution. The Graph Aggregation Network (GAN) updates node weights to enhance structural feature expression, enabling the model to distinguish visually similar targets by their inherent structural relationships. Shape-aware modeling for large aspect ratio targets [75] addressed the high similarity and large aspect ratio of ships in high-resolution satellite images by designing a Shape-Aware Feature Learning module to alleviate feature alignment bias and a Shape-Aware Instance Switching module to balance category distribution. This ensures sufficient learning of rare ship subtypes (e.g., special operation ships).

Multi-Source Feature Fusion and Context Utilization. This subcategory compensates for the lack of discriminative information caused by visual similarity by fusing multi-modal data (e.g., RGB, multispectral, LiDAR) and leveraging contextual relationships. It is particularly effective for scenarios where single-modal features are insufficient to distinguish highly similar targets (e.g., street tree subtypes). For example, in [74], it proposed a multisource region attention network that fuses RGB, multispectral, and LiDAR data. A multisource region attention module assigns weights to features of highly similar street tree subtypes, using multi-modal differences (e.g., spectral reflectance, elevation information) to supplement the information gap caused by visual similarity. This approach significantly improves the fine-grained classification accuracy of street trees in remote sensing imagery. Few-shot aircraft detection via cross-modal knowledge guidance [89] proposed the TEMO method, which introduced text-modal descriptions of aircraft and fused text-visual features via a cross-modal assembly module. This reduces confusion between new categories and known similar aircraft, enabling fine-grained recognition in few-shot scenarios based on the R-CNN two-stage framework.

3.2.2. One-Stage Detectors

Single-stage detectors (such as the YOLO series) omit the separation steps of candidate region generation and subsequent classification, and directly perform category prediction and bounding box regression on the feature map. This end-to-end structure significantly reduces model complexity and inference latency, thereby enabling real-time detection capabilities. Although early single-stage methods generally lagged behind two-stage detectors in terms of accuracy, YOLO et al. have made many improvements and achieved significant enhancements in aspects such as network structure optimization, loss function improvement, and the introduction of feature enhancement modules.

YOLO (You Only Look Once) [90] pioneered one-stage detection by treating object detection as a regression task. It divides the image into a grid, with each grid cell predicting bounding boxes and class probabilities, enabling real-time performance. SSD (Single Shot MultiBox Detector) [91] introduced multi-scale feature maps to detect objects of varying sizes, using default bounding boxes (anchors) at different layers to improve small object detection. RetinaNet [92] addressed the class imbalance issue in one-stage detectors with Focal Loss, a modified cross-entropy loss that down-weights easy background examples. This closed the accuracy gap with two-stage methods. YOLOv3 [93] enhanced the original YOLO with multi-scale prediction, a more efficient backbone (Darknet-53), and better class prediction, balancing speed and accuracy. EfficientDet [94] optimized both accuracy and efficiency through compound scaling (co-scaling depth, width, and resolution) and a weighted bi-directional feature pyramid network (BiFPN), achieving state-of-the-art results on COCO. YOLOv7 [95] introduced trainable bag-of-freebies (e.g., ELAN architecture, model scaling) and bag-of-specials (e.g., reparameterization) to boost performance, outperforming previous YOLO variants and other one-stage detectors on speed–accuracy curves.

Methods based on YOLO can be structurally decomposed into Backbone, Neck, and Head. Table 3 summarizes the improvements of existing fine-grained object detection approaches with respect to these decomposed components.

These methods can also be broadly categorized into four groups: data and input augmentation-driven, attention and feature fusion-driven, discriminative learning and task design-driven, and optimization and post-processing-driven. Each direction addresses different technical aspects, yet they share the common goal of enhancing the ability to distinguish visually similar targets and to improve the detection of small objects in complex remote sensing scenes.

Data and Input Augmentation-Driven Methods. This category mainly focuses on enriching input data and sample representation, alleviating challenges of limited training samples and class imbalance in remote sensing. For instance, the improved YOLOv7-Tiny [96] applies multi-scale/rotation augmentation to expand input sample diversity; Lightweight FE-YOLO [97] optimizes input by preprocessing input data to highlight fine-grained features of small targets; YOLOv8 (G-HG) [98], adjusts input feature resolution to match multi-scale remote sensing targets; YOLO-RS [99] adopts context-aware input sampling to focus on crop fine-grained regions; YOLOX-DW [100] applies adaptive sampling to balance the distribution of fine-grained classes in input data. Moreover, DETet [101] and MFL [102] explore image degradation recovery and super-resolution enhancement, offering new approaches to restore fine details in low-quality remote sensing images. These studies highlight that input-level improvements not only enhance robustness but also provide stronger foundations for fine-grained discrimination.

Attention and Feature Fusion-Driven Methods. Methods in this category emphasize enhancing discriminative feature representations by leveraging attention mechanisms and multi-scale fusion. For example, FGA-YOLO [103] and SR-YOLO [104] combine global multi-scale modules, bidirectional FPNs, and super-resolution convolutions to strengthen fine-grained representation of aircraft and UAV targets. WDFA-YOLOX [105] and YOLOv5+CAM [106] address SAR feature loss and wide-area vehicle detection through wavelet-based compensation and attention mechanisms. IF-YOLO [107] and FiFoNet [108] improve feature pyramid and fusion strategies to preserve small-object features and suppress background noise. These works demonstrate that precise feature modeling under complex backgrounds and scale variations is crucial for fine-grained detection.

Table 3. Summary of one-stage methods (YOLO) for fine-grained object detection.

Method	Input Stage	Backbone	Neck	Head	Purpose	Reference
FGA-YOLO	×	✓	✓	✓	Aggregate multi-layer features to enhance multi-scale information; Extract key discriminative features to improve fine-grained recognition; Alleviate imbalance between easy/hard samples via EMA Slide Loss	[103]
SR-YOLO	×	✓	✓	✓	Extract small-target fine-grained features via SR-Conv module; Enhance small-target feature fusion with bidirectional FPN; Improve detection accuracy via Normalized Wasserstein Distance Loss	[104]
IF-YOLO	×	✓	✓	×	Preserve small-target intrinsic features via IPFA module; Suppress conflicting information with CSFM; Fuse multi-scale features via FGAFPN	[107]
WDFA-YOLOX	×	✓	✓	✓	Compensate SAR fine-grained feature loss via WSPP module; Enhance small-ship features with GLFAE; Improve bounding-box regression via Chebyshev distance-GIoU Loss	[105]
Related-YOLO	×	✓	✓	×	Model ship component geometric relationships via relational attention; Adapt to rotated ships with deformable convolution; Optimize anchors via hierarchical clustering	[109]
YOLOv5+CAM	×	✓	✓	✓	Capture key regions via CAM attention module; Fuse multi-scale features with CAM-FPN; Enhance training via coarse-grained judgment + background supervision	[106]
FiFoNet	×	✓	✓	×	Capture global–local context via GLCC module; Select valid multi-scale features to block redundant information; Improve small-target detection in UAV images	[108]
FD-YOLOv8	×	✓	✓	×	Preserve aircraft local details via local feature module; Enhance local–global interaction via focus modulation; Improve fine-grained accuracy in complex backgrounds	[110]
YOLOX (GTDet)	×	✓	✓	✓	Adapt to oriented targets via GCOTA label assignment; Improve angle prediction via DLAAH; Enhance localization via anchor-free detection	[111]
DEDet	×	✓	✓	×	Restore nighttime details via FPP module; Filter background interference via progressive filtering; Improve nighttime UAV target detection	[101]
MFL	×	✓	✓	×	Realize SR-OD mutual feedback via MFL closed-loop; Focus on ROI details via FROI module; Narrow target feature differences via MSOI	[102]
InterMamba	×	✓	✓	✓	Capture long-range dependencies via VMamba backbone; Fuse multi-scale features via cross-VSSM; Optimize dense detection via UIL loss	[112]
Improved YOLOv7-Tiny	✓	✓	×	×	Construct diverse remote sensing aircraft dataset; Apply multi-scale/rotation augmentation to enrich input samples	[96]
Lightweight FE-YOLO	✓	✓	✓	×	Preprocess input data to highlight small-target fine-grained features; Reduce input noise interference via similarity-based channel screening; Optimize input feature distribution for remote sensing scenarios	[97]
YOLOv8 (G-HG)	✓	✓	✓	×	Adjust input feature resolution to match multi-scale remote sensing targets; Retain fine-grained details in input via redundant feature map sampling; Optimize input data utilization for complex background scenarios	[98]
YOLO-RS	✓	✓	✓	✓	Adopt context-aware input sampling to focus on crop fine-grained regions; Balance input class distribution via AC mix module	[99]
YOLOX-DW	✓	✓	×	×	Apply adaptive sampling to balance fine-grained class distribution in input; Optimize input sample selection to avoid rare class underrepresentation	[100]

Note: ✓ = The method involves improvements in this YOLO stage; × = The method does not involve improvements in this YOLO stage.

Discriminative Learning and Task Design-Driven Methods. This research line emphasizes introducing additional discriminative constraints or multi-task mechanisms to improve the separation of visually similar categories. FD-YOLOv8 [110] captures subtle differences in aircraft through local detail modules and focused modulation mechanisms. Related-YOLO [109] leverages relational attention, hierarchical clustering, and deformable convolutions to model structural relations between ship components. GTDet [111] enhances classification–regression consistency for oriented objects using optimal transport-based label assignment and decoupled angle prediction. MFL [102] builds a closed-loop between detection and super-resolution, guiding degraded images to recover discriminative details. Overall, these methods contribute discriminative signals by focusing on local part modeling, relational learning, and multi-task integration.

Optimization and Post-Processing-Driven Methods. This category centers on loss function design and post-processing optimization, improving adaptation to fine-grained targets during both training and inference. WDFA-YOLOX [105] and SR-YOLO [104] introduce novel regression losses (Chebyshev distance-IoU and normalized Wasserstein distance) to improve bounding box localization for small objects. GTDet [111] applies optimal transport-based assignment to address the scarcity of positive samples for oriented objects with large aspect ratios. SA-YOLO [113] dynamically adjusts class weights with adaptive loss functions to mitigate bias from data imbalance. DETet [101] employs iterative filtering during post-processing to suppress noise and false positives in night-time UAV imagery. These strategies demonstrate that careful optimization and post-processing not only stabilize training but also ensure the preservation of small and fine-grained targets during inference.

Overall, YOLO-based fine-grained detection research in remote sensing has established a comprehensive improvement pathway spanning input augmentation, feature modeling, discriminative learning, and optimization strategies. Data and input enhancements improve baseline robustness, attention and feature fusion strengthen discriminative representations, discriminative learning and task design introduce novel supervision signals, and optimization and post-processing ensure stability and reliability across stages. Future trends are expected to further integrate these directions, such as combining input augmentation with discriminative learning, or unifying feature modeling and optimization strategies into an end-to-end framework, to comprehensively improve fine-grained detection performance in remote sensing.

3.2.3. Other Methods for Fine-Grained Object Detection

In the field of fine-grained object detection in remote sensing, aside from YOLO and RCNN-based methods, existing studies can be broadly categorized into four classes: methods based on Transformer/DETR, classification/recognition networks, customized approaches for special modalities or scenarios, and graph-based or structural feature modeling methods. These approaches address challenges such as category ambiguity, feature indistinctness, and complex scene conditions from different perspectives, including global feature modeling, fine-grained feature optimization, environment-specific adaptation, and structural information exploitation.

Transformer/DETR-Based Methods. Transformer and DETR-based methods leverage self-attention mechanisms to model global feature dependencies, enabling the capture of fine-grained target relationships across the entire image and supporting end-to-end detection. Typical studies include the following: FSDA-DETR [114] as in Figure 6 employs cross-domain style alignment and category-aware feature calibration to achieve effective adaptation in optical-SAR cross-domain few-shot scenarios; GMODet [115] integrates region-aware and semantic–spatial progressive interaction modules within the DETR framework to capture spatio-temporal correlations of ground-moving targets, enabling efficient detection in large-scale remote sensing images; InterMamba [112] combines cross-visual selective scanning with global attention and user interaction feedback to optimize detection and annotation in dense scenes, enhancing discriminability in crowded environments.

CNN-Based Feature Interaction and Classification (Non-YOLO/RCNN). This category primarily focuses on fine-grained object classification or recognition. Most methods do not involve explicit detection or YOLO/RCNN structures (without localization modules), but rely on feature optimization, data augmentation, and feature purification to improve category discriminability. A few approaches (Context-Aware method [116]) incorporate lightweight localization modules in addition to classification. Representative works include [117], which combines CNN features with natural language attributes for zero-shot recognition; ref. [118], which uses region-aware instance modeling and adversarial generation to mitigate inter-class similarity; EFM-Net [119], which leverages feature purification and data augmentation to enhance fine-grained characteristics; ref. [120], integrating weak and strong features to iteratively optimize discriminative regions in low-resolution images; ref. [121], proposing a coarse-to-fine hierarchical framework for urban village classification; and [122], which uses feature decoupling and pyramid transformer encoding to distinguish visually similar targets in UAV videos. Overall, these methods emphasize enhancing classification capability under limited or ambiguous feature conditions.

Customized Methods for Special Modalities or Scenarios. These methods target fine-grained object detection under specific modalities (e.g., thermal infrared, underwater) or challenging scenarios (e.g., low-light, nighttime), optimizing feature extraction and localization through specialized modules. Typical studies include the following: U-MATIR [123] constructs a multi-angle thermal infrared dataset and leverages heterogeneous label spaces with hybrid-view cascade modules to enable efficient detection of thermal infrared targets; DEDet [101] employs pixel-level exposure correction and background noise filtering to improve feature quality and detection performance under low-light UAV imagery; the PCD method [87] uses prototype contrastive learning and decoupled distillation to transfer features and lighten models for underwater fine-grained targets, enhancing overall detection performance.

Graph-Based or Structural Feature Modeling Methods. Graph-based methods model structural relationships among target components, reinforcing classification and localization through structural consistency. Typical studies include the following: GFA-Net [84] employs a graph-focused aggregation network to model structural features and node relations, achieving precise detection of structurally deformed targets; in [124], they integrate geospatial priors with frequency-domain analysis to infer the distribution and class relationships of aircraft in large-scale SAR images, enabling efficient localization.

Overall, the three categories of fine-grained object detection methods form a complementary technical system targeting the core challenge of “high inter-class similarity”: R-CNN-based methods mainly achieve high precision through specialized technical paths (contrastive learning, knowledge distillation, hierarchical feature optimization) and are suitable for complex scenarios (few-shot, unknown categories); YOLO-based methods mainly prioritize efficiency via multi-scale fusion and attention mechanisms, making them ideal for real-time scenarios (UAV, SAR); other methods break through traditional frameworks to address special scenarios (cross-domain, nighttime, zero-shot), providing innovative supplements.

3.3. Fine-Grained Scene-Level Recognition

Fine-grained scene-level recognition is playing an increasingly important role in remote sensing applications, where distinguishing subtle differences between visually similar scenes has become more complex and challenging. Many studies on scene understanding in remote sensing images draw on methods and models from the field of computer vision. In the field of computer vision, image scene understanding generally follows two fundamental paradigms: bottom-up and top-down approaches.

Bottom-up methods start from pixels and low-level features, progressively extracting textures, shapes, and spectral information, and then aggregating them into high-level semantics through deep neural networks, as in Figure 7. Their advantages lie in being data-driven, well-suited for large-scale imagery, and capable of automatic feature learning with good transferability. However, they often lack high-level semantic constraints, making them vulnerable to intra-class variability and complex backgrounds, which may lead to insufficient semantic interpretability.

Top-down methods, in contrast, begin with task objectives or prior knowledge, employing geographic knowledge graphs, ontologies, or semantic rules to guide and constrain the interpretation of low-level features, as in Figure 7. These approaches have the strengths of semantic clarity and interpretability, aligning more closely with human cognition. Their limitations, however, include dependence on high-quality prior knowledge, high construction costs, and limited scalability in large-scale automated tasks.

In terms of research trends, bottom-up methods dominate the current literature. In fine-grained remote sensing scene understanding in particular, most studies rely on multi-scale feature modeling, attention mechanisms, convolutional neural networks, and Transformer architectures to capture subtle inter-class differences through hierarchical abstraction. These approaches are well-suited to large-scale data-driven training and have therefore become the mainstream. By contrast, top-down methods are mainly explored in knowledge-based scene parsing, cross-modal alignment, and zero-shot learning, and remain relatively limited in number, though they show promise for enhancing semantic interpretability and cross-domain generalization.

In summary, fine-grained remote sensing scene understanding is currently almost exclusively driven by bottom-up feature learning approaches, while top-down methods remain at an exploratory stage. This paper mainly reviews two studies on the scene level of remote sensing images: scene classification and image retrieval. Most of them are based on bottom-up fine-grained image recognition or understanding.

3.3.1. Scene Classification

The core challenges of fine-grained remote sensing image scene classification converge on four dimensions: feature confusion caused by “large intra-class variation and high inter-class similarity” as in Figure 8, data constraints from “high annotation costs and scarce samples”, modeling imbalance between “local details and global semantics”, and domain shift across “sensors and regions”. The studies can be categorized into four core classes and one category of scattered research according to technical objectives and methodological logic.

Multi-Granularity Feature Modeling. It is one of the core approaches to resolving “Intra-Class Variation-Inter-Class Similarity”. This category represents the fundamental technical direction for fine-grained classification. Its core logic involves mining multi-dimensional features (e.g., “local–global”, “low–high resolution”, “high–low frequency”) to capture subtle discriminative information between subclasses, thereby addressing the pain points of “large intra-class variation and high inter-class similarity” in remote sensing scenes. Its technical evolution has progressed from single-granularity enhancement to multi-granularity collaborative decoupling, which can be further divided into two technical branches: (1) Multi-Level Feature Fusion and Semantic Collaboration. This branch strengthens the transmission and discriminability of fine-grained semantics through feature interaction across different network levels. Typical studies include the following: Ref. [126] proposed the MGML-FENet framework, innovatively designing a Channel-Separate Feature Generator (CS-FG) to extract multi-granularity local features (e.g., building edge textures, crop ridge structures) at different network levels. Ref. [127] proposed MGSN, pioneering a coarse-grained guiding fine-grained bidirectional mechanism. Its MGSL module enables simultaneous learning of global scene structures and local details. Ref. [128] proposed the MG-CAP framework: it generates multi-granularity features via progressive image cropping, and uses Gaussian covariance matrices (replacing traditional CNN features) to capture feature high-order correlations. (2) Frequency/Scale Decoupling and Enhancement. Typical studies include the following: Targeting the characteristic of remote sensing images where “high-frequency details are separated from low-frequency structures”, this branch strengthens the independence and discriminability of fine-grained features through frequency decomposition or multi-scale modeling. Ref. [129] proposed MF²CNet: it realizes parallel extraction and decoupling of high/low-frequency features (high-frequency for fine-grained details like road markings, low-frequency for global structures like road orientation). Ref. [130] designed a Multi-Granularity Decoupling Network (MGDNet), focusing on “fine-grained feature learning under class-imbalanced scenarios”. The network is guided to focus on subclass differences using region-level supervision. Ref. [131] proposed the ECA-MSDWNet, which integrates “multi-scale feature extraction” with incremental learning: the Efficient Channel Attention module focuses on key fine-grained features, while the multi-scale depthwise convolution reduces computational costs.

Cross-Domain and Domain Adaptation Learning. It is one of the key technologies for addressing “sensor-region” shift. In practical applications of fine-grained remote sensing classification, distribution shifts between training data (source domain) and test data (target domain) (e.g., optical-SAR sensor differences, regional differences between southern and northern farmlands) drastically reduce model generalization. This category mainly includes the following: (1) Open-Set Domain Adaptation. It is addressing real-world scenarios with “unknown subclasses in the target domain”. Traditional domain adaptation assumes “complete category overlap between source and target domains”, while Open-Set Domain Adaptation (OSDA) is more aligned with remote sensing reality (e.g., unknown subclasses such as “new artificial islands” or “special crops” appearing in the target domain). Its core lies in “separating unknown classes and aligning fine-grained features of known classes”. Ref. [132] proposed IAFAN, which innovatively designs a USS mechanism (calculating sample semantic correlations via instance affinity matrix to identify unknown classes) and uses SDE loss to expand fine-grained differences between known classes (e.g., parking lot–industrial park vehicle arrangement density differences). (2) Multi-Source Domain Adaptation. It achieves fine-grained alignment with limited annotations. For scenarios where the target domain contains only a small number of annotations, this branch improves the domain adaptability of fine-grained features through “pseudo-label optimization” and “multi-source subdomain modeling”. Ref. [133] uses a bidirectional prototype module for source–target category/pseudo-label alignment, and introduces adversarial training to optimize pseudo-labels (reducing mislabeling like “elevated roads as bridges”). In [134] for multi-source-single-target domain shifts, it adopts a “shared + dual-domain feature extractor” architecture (learning multi-source shared features like water bodies with low reflectance first, then fine-grained subdomain alignment per source–target pair). In [135] to address fine-grained domain shift via frequency dimension, its HFE module aligns source–target fine-grained details (e.g., road marking edge intensity), and its LFE module aligns global structures (e.g., road network topology).

Semi-Supervised and Zero-Shot Learning. It is the inevitable path to reducing fine-grained annotation dependence. Annotations for fine-grained remote sensing scene classification require dual expertise in pixel-level labeling and subclass semantics, resulting in extremely high annotation costs. This category overcomes data constraints through “limited annotations + unlabeled sample utilization” or “knowledge transfer”, serving as a key enabler for the large-scale application of fine-grained classification. (1) Semi-Supervised Learning. Pseudo-Label Optimization and Consistency Constraints, mainly focused on enhancing the model’s ability to learn fine-grained features through “supervised signals + unsupervised consistency”. Ref. [136] modeled fine-grained road scene understanding as semi-supervised semantic segmentation. It optimizes supervised loss on annotated samples and consistency loss (e.g., perturbed prediction consistency) on unlabeled ones via ensemble prediction, and cuts annotation cost by using few annotated samples to reach high accuracy of fully supervised models on a self-built dataset. Ref. [130] solved fine-grained minority sample mislabeling—evaluating pseudo-label reliability via model confidence and class prior, prioritizing high-reliability samples for updates, and combined DCF loss to improve minority class accuracy in subclass classification. (2) Zero-Shot Learning. By knowledge graphs or cross-modal transfer, this method targets scenarios with no annotations for target subclasses. This branch achieves fine-grained classification through knowledge transfer, with a core focus on establishing accurate mappings between visual features and semantic descriptions. For example, in [35], this study constructs a Remote Sensing Knowledge Graph (RSKG) for the first time. It generates semantic representations of remote sensing scene categories through graph representation learning, so as to improve the ability of domain semantic expression.

CNN and Transformer Fusion Modeling. This is one of the technical trends of balancing local details and global semantics. Traditional CNNs excel at extracting local fine-grained features (e.g., edges, textures) but lack strong global semantic modeling capabilities; Transformers excel at capturing global dependencies (e.g., spatial correlations between targets) but have insufficient local detail perception. This category achieves complementary advantages through strategies such as knowledge distillation, feature embedding, and hierarchical fusion, addressing the core contradiction of fragmented local features and missing global semantics in fine-grained classification. (1) Knowledge Distillation. With transformers as teachers and CNNs as students, global semantic knowledge is transferred to lightweight models, balancing fine-grained accuracy and computational efficiency. Ref. [137] proposed ET-GSNet, first using ViT as a fine-grained semantic teacher and ResNet18 as a student model; dynamic knowledge distillation lets the student learn CNN local details and ViT global semantics. Ref. [138] innovated a multi-path self-distillation mechanism: with ResNet34 as backbone, it fuses multi-scale fine-grained features and builds bidirectional self-distillation to couple global semantics and local details, achieving high accuracy in urban functional zone fine-grained classification. (2) Feature Embedding and Hierarchical Fusion. Through CNN feature embedding into Transformers or hierarchical Transformers with local modules, in-depth collaboration between local details and global semantics for fine-grained features is achieved. In [139], it embeds CNN-extracted local fine-grained features into ViT’s Patch Embedding layer, enabling “local+global” learning with limited data and accelerating convergence/improving accuracy vs. pure ViTs. In [140], for multi-scale fine-grained targets, swin Transformer captures global structures, CNN extracts multi-resolution local features, and multi-scale fusion achieves high urban scene classification accuracy with better discrimination than pure Transformers. Figure 9, which is a good example, ref. [141], introduces GNNs for fine-grained structural features. It designs a dual attention (DA) module to suppress background noise by extracting channel–spatial key features, builds a three-stage hierarchical transformer extractor to capture multiscale global features, and develops a pixel-level GNN extractor to distinguish similar scenes via spatial topology.

Some Other Research: Ref. [142] systematically reviewed a large number of deep learning methods, clarifying for the first time that “fine-grained classification needs to overcome the local limitations of CNNs and data dependence of Transformers”, and summarizing key directions such as “multi-granularity features” and “cross-domain adaptation”, providing a framework for subsequent research. Ref. [143] conducted a meta-analysis of multiple studies, quantitatively showing the rise of Transformers in fine-grained classification in recent years (with several representative studies), with AID and NWPU-RESISC45 as the most commonly used benchmarks. It also identified unresolved issues such as “fine-grained feature confusion” and “cross-sensor domain shift”, providing data support for research directions. For efficiency and robustness optimization, ref. [144] proposed a bilinear model based on MobileNetv2, enhancing fine-grained features through “dual convolutional layer feature transformation + Hadamard product”. On the UC-Merced dataset, the number of parameters is much lower than that of traditional CNNs, while accuracy is improved, making it suitable for fine-grained classification requirements on edge devices. Ref. [145] proposed the Confounder-Free Fusion Network (CFF-NET), eliminating “spurious correlations between background interference and fine-grained features” (e.g., misclassifying “cloud shadows as water textures”) through three branches (“global–local–target”). It achieves SOTA performance in fine-grained classification and retrieval tasks for aerial images, providing new ideas for “robust fine-grained modeling”.

3.3.2. Image Retrieval

Fine-grained remote sensing image retrieval (FRSIR) has become an active research area, aiming to capture subtle distinctions among highly similar remote sensing (RS) images. The existing methods have conducted fine-grained optimization for remote sensing image retrieval from different perspectives.

Global–local and multi-scale feature fusion methods integrate information across different levels. For instance, GaLR introduces global and local representation with dynamic fusion and relational enhancement [146], while GLISA leverages global–local soft alignment with adaptive local information extraction [147]. FAAMI aggregates multi-scale information with cross-layer connections and consistency enhancement [148], and MSITA learns salient information with multiscale fusion and image-guided text alignment [149]. Related approaches such as AMFMN also focus on multiscale representation and semantic consistency [150].

Fine-grained semantic alignment focuses on aligning visual patches with textual words. FGVLA introduces spatial mask and contrastive losses to enhance patch-to-word correspondence [85]. MAFA-Net combines multi-attention fusion with fine-grained alignment for bidirectional retrieval [151]. JGDN captures intra-modal fine-grained semantics to guide cross-modal learning [152], while CDMAN introduces cues-driven alignment and multi-granularity association learning [125]. Earlier work on semantic alignment networks also paved the way for this line of research [153].

Optimization and training strategies aim to improve generalization and address modality imbalance. RSITR-FFT adapts CLIP to RS domains through fine-grained tuning with consistency regularization [154]. FGIS introduces information supplementation and value-guided learning [155]. SWPE uses strong and weak prompts to capture both global and fine-grained semantics [156], while RDB bridges representation discrepancies with differential and hierarchical attention [157]. SMLGN further integrates multi-subspace joint learning with adversarial training to achieve modality consistency [158].

Other novel feature modeling methods include the following: FRORS integrates fine-grained prototype memory and Gram-based learning to capture intra-class heterogeneity and inter-class commonality [159]. DMFH introduces multiscale fine-grained hashing for efficient retrieval [160]. GNN-based methods explicitly model associations between text and images for fine-grained matching [161]. SWAN employs scene-aware aggregation to mitigate semantic confusion in RS retrieval [162]. Other explorations, such as sketch-based fine-grained retrieval, also extend the paradigm [163,164].

This section mainly reviews two aspects of research on fine-grained scene-level recognition: scene classification and image retrieval. Among them, research on fine-grained scene classification is relatively abundant, while research on fine-grained image retrieval is still in the exploratory stage. From a technical perspective, most studies adopt a bottom-up paradigm. “Large intra-class variation and high inter-class similarity” remain core challenges in fine-grained scene recognition. Compared with pixel-level and object-level fine-grained recognition, scene-level fine-grained recognition employs more diverse and comprehensive technologies, and thus is more challenging.

3.4. The Common Methods Used Across Different Levels in Fine-Grained Interpretation

We also find that some methods appear repeatedly in fine-grained interpretation at different levels, such as Utilizing MultiSource Data, Modeling Relationships Between Classes, Advanced Data Annotation Strategies, Knowledge Distillation, Novel Data Representation, Component Relationship Learning, Enhanced Attention Mechanism, Few shot or Zero shot, Prototypical Contrastive Learning, etc. We summarized some commonly used methods at different levels in Table 4. Those more fundamental frameworks such as Transformer are not included.

The purpose and form of the same technology (such as the attention mechanism) are different when they are used at different levels of fine-grained interpretation. It is more necessary to clearly distinguish their application differences through a three-tier structure. The design logic and core goals of the same core technology (such as the attention mechanism) at different levels are completely different—pixel-level attention focuses on “fine calibration of spectral–spatial features” (for example, GRetNet’s Gaussian multi-head attention [46] is used to capture the spectral differences of tree species). Target-level attention focuses on “the enhancement of local key features of the target” (such as SFRNet’s spatial–channel Transformer for feature extraction of inclined aircraft components [80]), while scene-level attention emphasizes “the collaborative modeling of global-local context” (such as MGML-FENet’s channel-separated attention for the fusion of scene texture and structure [126]).

Transformer-based architectures and large vision models (e.g., ViT, Swin Transformer, Segment Anything) have emerged as universal technical backbones across pixel-, object-, and scene-level fine-grained interpretation, owing to their superior capability in modeling global dependencies and capturing subtle discriminative features. Unlike traditional CNNs limited by local receptive fields, these models leverage self-attention mechanisms to dynamically weight spatial–spectral relationships, addressing core challenges such as intra-class variation and inter-class similarity in remote sensing tasks. At the pixel level, ViT-based variants (e.g., GRetNet, CenterFormer) split hyperspectral or high-resolution images into patches, using multi-head attention to model long-range correlations between spectral bands and neighboring pixels. This enables precise capture of subtle spectral–spatial differences (e.g., tree species with similar spectra but distinct texture distributions) that CNNs often miss. Swin Transformer’s hierarchical window attention further optimizes computational efficiency, making it suitable for large-scale pixel-level segmentation by balancing local detail extraction and global context integration. For object-level detection, Transformer/DETR frameworks (e.g., FSDA-DETR, GMODet) replace handcrafted proposal mechanisms with end-to-end set prediction, effectively handling highly similar targets (e.g., ship subtypes, aircraft models). The self-attention module aggregates global contextual information to distinguish structural nuances (e.g., hull shapes of frigates vs. destroyers), while cross-attention aligns multi-modal features (RGB, LiDAR) for enhanced robustness. Segment Anything Model (SAM) contributes to fine-grained object segmentation by generating high-quality masks, supporting precise boundary delineation for deformed or small targets (e.g., urban buildings, street trees). In scene-level recognition, ViT and Swin Transformer excel at modeling semantic hierarchies and contextual relationships. By encoding global scene structures and local fine-grained details (e.g., road markings in urban scenes, crop ridge patterns in agricultural areas), they mitigate the impact of intra-class variation (e.g., seasonal changes in vegetation) and inter-class similarity (e.g., residential vs. mixed-use zones). Hybrid models (e.g., P2FEViT) that embed CNN-extracted local features into Transformer backbones further fuse complementary strengths, achieving both discriminative power and efficiency.

Transformer and large-model paradigms are not a panacea but a toolbox: their core mechanisms—adaptive non-local aggregation, head-wise factorization, and transfer-friendly pretraining—directly address the representational needs of fine-grained remote sensing. Realizing their full potential requires deliberate architectural priors, sensor-aware tokenization, domain-centric pretraining, and interpretability validation tailored to geospatial imagery. However, challenges remain, including high computational costs and reliance on massive annotated data. Future advancements may focus on lightweight adaptations (e.g., efficient attention mechanisms) and self-supervised pre-training tailored to remote sensing characteristics, further expanding their applicability in fine-grained interpretation.

3.5. Summary of Methods

In this chapter, with the three core levels of remote sensing image interpretation (pixel-level, object-level, and scene-level) as the framework, fine-grained interpretation is taken as a technical deepening centered on the demand for “subclass distinction” at each level, rather than an independent paradigm. Centering on the core issues of fine-grained interpretation such as small inter-class differences and large intra-class variations, we systematically sort out the representative methods of fine-grained interpretation under the three levels. Meanwhile, this chapter analyzes the technical logic and applicable scenarios of different methods, and finally constructs a fine-grained remote sensing image interpretation method classification system covering core tasks, providing systematic reference for method research in this field.

4. Main Development Trends

4.1. Journal Distribution

From a journal distribution perspective, top-tier remote sensing publications have become the primary platforms for papers focused on fine-grained interpretation of remote sensing images. As shown in Figure 10, between 2015 and 2025, research on fine-grained remote sensing image interpretation has been published widely across a number of high-impact journals in the geoscience and remote sensing domain. The distribution of articles shows several clear patterns.

Leading journals by output: With 455 published papers, IEEE Transactions on Geoscience and Remote Sensing serves as the primary platform for theoretical and technological innovation in this field. Its research covers the core content of the entire fine-grained interpretation chain. It also acts as a key venue for publishing achievements related to theoretical breakthroughs and technological optimization. Remote Sensing (MDPI) has published 246 papers, focusing on multi-directional application exploration and methodological research in fine-grained interpretation. Additionally, it incorporates extensive dataset validation work, providing abundant practical case support for the field. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing has 235 published papers, with a core focus on the application and implementation of fine-grained interpretation technologies. Its research emphasizes the adaptive practice of technologies in specific scenarios and pays attention to fine-grained processing methods for satellite data.

IEEE Geoscience and Remote Sensing Letters has published 113 papers, featuring short-format research. Its content focuses on single-point innovation and preliminary verification in fine-grained interpretation. It also rapidly disseminates cutting-edge innovative ideas in the field. Journal of Applied Remote Sensing has 103 published papers, emphasizing the practical verification of fine-grained interpretation methods, providing references for the engineering application of methods. With 65 published papers, International Journal of Applied Earth Observation and Geoinformation conducts research from a geospatial information perspective. It focuses on publishing achievements related to the integration of remote sensing and geospatial relationships. ISPRS Journal of Photogrammetry and Remote Sensing has 49 published papers, focusing on the integration of photogrammetry and remote sensing technologies, reflecting the in-depth linkage between fine-grained interpretation and traditional surveying and mapping technologies. Remote Sensing of Environment has 30 published papers, focusing on the high-value application of fine-grained interpretation in environmental monitoring.

The following trends are observed: While IEEE and ISPRS outlets remain dominant, the substantial number of publications in IJAEOG and Remote Sensing (MDPI) reflects a move toward more application-oriented and open-access journals, increasing global visibility. High outputs in Remote Sensing (MDPI) and IEEE JSTARS suggest that fine-grained interpretation is increasingly intersecting with computer vision, data science, and earth observation applications. TGRS and RSE continue to publish core theoretical and algorithmic advances, while IJAEOG and MDPI’s Remote Sensing emphasize applied and case-driven studies.

4.2. Annually Published Articles

In Figure 11, this graph adopts a combined form of “bar chart + line chart” to intuitively present the changes in the number of published papers in the field of fine-grained remote sensing image interpretation from 2015 to 2025. The horizontal axis represents the time dimension by year, and the vertical axis denotes the annual number of published papers. Among them, the bar chart corresponds to the actual number of papers published each year, while the red line chart is used to fit the overall trend. From the perspective of data distribution, during the period 2015–2023, the annual number of papers showed a steady growth trend with a relatively moderate growth rate, reflecting the gradual development of research in this field at the technical and theoretical levels during this stage. The year 2023 marked a key turning point, after which the number of papers entered a phase of rapid growth and reached a periodic peak in 2024, demonstrating a significant surge in the research enthusiasm for this field. Although the data in the 2025 bar chart is lower than that in 2024, the actual number of published papers in 2025 has not declined because the year 2025 has not yet ended.

In view of the current development trend, it is expected that fine-grained remote sensing image interpretation will remain a research hotspot in 2025 and the coming years. With the continuous development of multi-source data fusion and artificial intelligence (especially deep learning technology), the accuracy and efficiency of interpretation are expected to be further improved. Interdisciplinary research will also deepen, promoting remote sensing technology to move from macro observation to micro fine-grained interpretation and providing core technical support for the digital and intelligent transformation of various industries.

4.3. Keyword Co-Occurrence Network

In Figure 12, this is a keyword co-occurrence network diagram in the field of fine-grained remote sensing interpretation. Nodes of different colors represent different categories: blue for methods/models, red for tasks, green for datasets/sensors, and yellow for applications. The lines between nodes indicate the associations among keywords, while the size of each node reflects the importance or frequency of occurrence of the corresponding keyword. The diagram covers a wide range of keywords, spanning from data acquisition (e.g., satellite sensors such as WorldView and Landsat), processing methods (e.g., self-supervised learning and graph convolutional networks), tasks (e.g., fine-grained ship recognition and change detection) to multi-domain applications (e.g., agricultural monitoring and disaster monitoring). It presents the complex interconnections among various elements in this field.

In the methods/models category (blue nodes), the largest nodes correspond to deep learning-related technologies—specifically Transformer architecture and contrastive learning. These two keywords not only have the highest frequency of occurrence but also connect to the most other nodes (e.g., linking to “fine-grained object recognition” in tasks and “hyperspectral datasets” in data), becoming the core driving forces of technical innovation in the field.

In the tasks category (red nodes), “fine-grained object recognition” and “change detection” are the largest nodes. Their prominent size and dense connecting lines (e.g., linking to multiple processing methods and application scenarios) confirm that high-precision, detail-oriented interpretation tasks have become the primary research focus.

In the datasets/sensors category (green nodes), “hyperspectral datasets” and “high-resolution satellite data (e.g., WorldView)” are the largest nodes. Their frequent co-occurrence with methods and tasks reflects that multi-source, high-precision data has become the basic support for advancing research, with increasing attention paid to data quality and diversity.

In the applications category (yellow nodes), “agricultural monitoring” and “urban planning” are the most prominent nodes. Their strong associations with core tasks indicate that practical application in key fields is the main orientation of research, and the integration between technical methods and industry needs is becoming increasingly close.

Overall, the field of fine-grained remote sensing interpretation shows a trend of multi-dimensional coordinated development, with clear core nodes standing out in the keyword network. The cross-integration among these core nodes (e.g., Transformer connecting to hyperspectral datasets and fine-grained object recognition) is driving the field toward a more refined, intelligent, and application-oriented direction.

5. Discussion

Although fine-grained interpretation of remote sensing images has made significant progress in recent years, the field still faces many unresolved challenges before it can achieve reliable and scalable applications in real-world scenarios. These challenges span conceptual definitions, methodological limitation, dataset construction, and relation to human cognition, etc. Below, we discuss several key challenges in depth and we also make a surey on several promising directions, many of which directly correspond to the challenges outlined above.

5.1.1. Theory and Frame

(1) Challenges: Lack of a unified definition of fine granularity. One of the most fundamental problems is the absence of a unified definition of “fine-grained” in the context of remote sensing. At the pixel level, fine granularity may refer to distinguishing subtle spectral variations; at the object level, it often refers to recognizing subcategories of the same class, such as different ship types or aircraft models; at the scene level, it may mean subdividing complex environments into finer categories, such as different types of residential or agricultural areas. These different perspectives are not hierarchical and are often defined independently, which creates inconsistencies across studies. As a result, it is difficult to fairly compare methods or transfer models between datasets. The lack of a common framework also complicates the creation of benchmarks, since each dataset may adopt its own class definitions and granularity standards. Establishing a standardized and hierarchical definition of fine granularity would provide a foundation for dataset construction, model evaluation, and cross-domain generalization. Future Direction: Establishing unified definitions and hierarchical frameworks. A first priority is to standardize the definition of fine granularity across pixel-level, object-level, and scene-level tasks. Developing a unified and hierarchical framework will enable more consistent dataset construction, fairer benchmarking, and clearer methodological comparisons. Such a framework could also serve as a foundation for cross-task transfer and multi-level integration.

(2) Challenges: Limited integration with cognitive theories. Humans are adept at fine-grained recognition because of cognitive mechanisms such as selective attention, hierarchical semantic reasoning, and multi-modal integration. However, current fine-grained remote sensing methods are largely disconnected from cognitive science. Most rely on purely data-driven deep learning approaches without incorporating principles inspired by human perception. Questions such as how to replicate human-like selective attention to key image regions, how to leverage semantic hierarchies for more interpretable reasoning, and how to dynamically integrate multiple information sources remain underexplored. Bridging remote sensing interpretation with cognitive theories could lead to new paradigms in feature learning and model design, potentially making models both more accurate and more interpretable. Future Direction: Bridging remote sensing interpretation with cognitive science. Finally, future research may benefit from stronger integration with cognitive theories. Models inspired by human perception—such as attention-guided reasoning, semantic hierarchy utilization, and multi-modal integration—can potentially achieve both higher accuracy and interpretability. This interdisciplinary approach could open up new directions for fine-grained interpretation, enabling systems that not only match but also mimic human analytical capabilities.

5.1.2. Data and Annotation

(1) Challenges: High annotation cost and lack of consistency. Fine-grained datasets require extensive high-quality annotations, which are both costly and difficult to obtain. Unlike coarse-grained classification, fine-grained labeling often requires domain expertise—for example, identifying tree species or aircraft variants demands specialist knowledge that crowdsourcing cannot easily provide. Furthermore, fine-grained annotation is prone to inconsistencies, as different annotators may disagree on subtle distinctions. This variability undermines dataset reliability and can introduce noise into model training and evaluation. While some efforts have been made to reduce costs using weakly supervised learning, semi-supervised learning, or automatic labeling techniques, these methods are still insufficient for achieving the level of accuracy required for fine-grained interpretation. Without scalable annotation solutions, progress in this field will remain constrained. Future research should explore integrating expert knowledge with automatic or semi-automatic annotation strategies, as well as leveraging knowledge graphs and large language models to support label consistency. Future Direction: Reducing annotation cost and improving label consistency. Novel strategies are needed to address the bottleneck of high annotation cost. These include active learning to prioritize informative samples, semi- and weakly supervised learning to leverage partially labeled data, and automated annotation assisted by knowledge graphs or large foundation models. Such approaches can reduce reliance on expensive expert labeling while maintaining high label quality and consistency.

(2) Challenges: Insufficient multi-modal and multi-source data fusion. Remote sensing inherently involves multiple modalities of data, each capturing complementary information. Optical images provide rich texture and color cues, SAR images capture structural and backscatter properties, LiDAR data reveals 3D geometry, and hyperspectral sensors provide dense spectral signatures that can separate visually similar categories. Despite this, most fine-grained interpretation research still relies almost exclusively on optical images. This limitation arises partly from the lack of multi-modal benchmark datasets with aligned and consistent annotations, and partly from the difficulty of designing models that can effectively integrate heterogeneous modalities. Without multi-source fusion, models may fail to fully exploit the complementary strengths of different sensors, leading to suboptimal performance in complex or ambiguous scenarios. Future progress will depend on building large-scale, well-annotated multi-modal datasets and developing fusion strategies that can dynamically balance modality contributions under varying conditions. Future Direction: Building and exploiting multi-modal and multi-source datasets. The integration of optical, SAR, LiDAR, hyperspectral, and temporal data will be crucial for fine-grained interpretation. Future efforts should focus on constructing large-scale, well-aligned multi-modal datasets and developing fusion strategies that adaptively balance complementary modalities. Advances in cross-modal representation learning and self-supervised pretraining are expected to play an important role.

5.1.3. Model and Method

(1) Challenges: Heavy reliance on computer vision methods but with domain gaps. Fine-grained remote sensing interpretation has borrowed heavily from advances in computer vision, including deep convolutional neural networks, attention mechanisms, and transformer-based architectures. These methods have enabled rapid progress, but the direct transfer of computer vision techniques has limitations. Remote sensing images differ fundamentally from natural images in several ways: they are top views and cover much larger spatial extents, often include multiple modalities (optical, SAR, LiDAR, hyperspectral), and are acquired under widely varying conditions such as seasons, illumination, and sensor platforms. For instance, models that perform well on ImageNet-style natural images may fail to handle the scale variation and sensor noise inherent in remote sensing. Furthermore, remote sensing tasks often demand recognition of subtle differences across highly similar categories, which is less common in computer vision benchmarks. This domain gap indicates the need to adapt or redesign methods specifically for remote sensing, rather than relying solely on transferring computer vision techniques. Future Direction: Designing domain-specific methods beyond computer vision transfer. While computer vision remains an important source of inspiration, the remote sensing community needs to design methods tailored to the unique properties of remote sensing imagery. This includes accounting for large-scale spatial structures, multi-resolution data, and sensor-specific characteristics. Hybrid models that integrate domain knowledge with advanced architectures such as transformers may offer a way forward.

(2) Challenges: Intrinsic difficulty of small inter-class differences and large intra-class variations. The coexistence of small inter-class differences and large intra-class variations is an intrinsic difficulty of fine-grained tasks. Many fine-grained remote sensing categories are visually similar, such as different models of aircraft, ships, or ecologically related tree species. At the same time, instances of the same class can appear very different depending on seasonality, vegetation phenology, viewing angle, or illumination. For example, the same type of crop field can exhibit drastic changes in spectral characteristics over its growth cycle, while the same residential area may look very different under varying imaging resolutions or atmospheric conditions. This dual challenge requires models to learn highly discriminative features while also being robust to intra-class variability. Existing approaches, such as attention mechanisms or part-based feature learning, partially address these issues, but their effectiveness is still limited in real-world, large-scale scenarios. Tackling this challenge will likely require more sophisticated strategies, including dynamic feature adaptation, multi-scale representation learning, and domain-aware training. Future Direction: Addressing the dual challenge of inter-class similarity and intra-class variability. Future models must be capable of simultaneously handling subtle inter-class differences and large intra-class variations. Promising directions include multi-scale representation learning, part-based feature modeling, and dynamic adaptation mechanisms that adjust feature extraction based on environmental conditions. Incorporating temporal and contextual cues may also help reduce ambiguity in challenging cases.

(3) Challenges: Limited cross-domain generalization and open-world recognition. Most fine-grained models perform well on specific benchmark datasets but generalize poorly when applied to new regions, sensors, or time periods. This performance drop is largely due to domain shifts caused by differences in geography, climate, imaging conditions, and sensor characteristics. In real-world applications, remote sensing imagery frequently contains novel classes not seen during training, making the traditional closed-set assumption unrealistic. Addressing this challenge requires developing models with stronger cross-domain adaptability and the ability to detect and learn from unseen categories. Open-world recognition and zero-shot learning have been proposed as potential solutions, but current work in these areas is still at an exploratory stage. Progress will require not only new methods but also large-scale open-world benchmarks to evaluate and train models under realistic conditions. Future Direction: Enhancing cross-domain generalization and supporting open-world recognition. Developing robust models that can generalize across regions, sensors, and time periods will be a key research focus. Domain adaptation, domain generalization, and meta-learning provide potential solutions. At the same time, constructing open-world benchmarks and incorporating zero-shot and few-shot learning methods will allow models to better handle previously unseen categories, bringing remote sensing closer to real-world requirements.

To avoid being too long, the discussion on large models in open-world scenarios is relatively limited.

6. Conclusions

This study delivers a systematic and comprehensive review of fine-grained interpretation for remote sensing images, encompassing three core tiers: pixel-level classification and segmentation, object-level detection and recognition, and scene-level understanding. It rigorously scrutinizes existing benchmark datasets, representative methodological approaches, and their respective advantages and limitations. The review underscores that while deep learning has driven substantial advancements in the accuracy and applicability of fine-grained interpretation, this domain remains encumbered by inherent challenges. These encompass seven pivotal issues, including the absence of a standardized definition for “fine-grained”, the coexistence of minimal inter-class disparities and extensive intra-class variations, prohibitive annotation costs, and constrained cross-domain generalization capabilities.

The mitigation of these challenges necessitates synergistic efforts across dataset construction, algorithmic innovation, and interdisciplinary collaboration. Looking forward, fine-grained interpretation of remote sensing images is anticipated to evolve toward the establishment of unified frameworks, multi-modal integration, scalable annotation strategies, and robust open-world recognition. Concurrently, integrating remote sensing research with cognitive science and harnessing advancements in foundation models and self-supervised learning is poised to unlock novel research paradigms. With sustained progress along these avenues, fine-grained interpretation technology is expected to exert a more prominent impact in domains such as environmental monitoring, agricultural production, urban planning, and disaster management, etc.

Author Contributions

Conceptualization, D.W., Z.Y., and P.L.; formal analysis, D.W., Z.Y., and P.L.; investigation, P.L.; resources, P.L.; writing—original draft preparation, D.W., Z.Y., and P.L.; writing—review and editing, D.W., Z.Y., and P.L.; supervision, P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Comprehensive Site Selection System Project grant number E5E2180501, National Key Research and Development Program of China under Grant 2024YFF1307204 and National Natural Science Foundation of China under Grant 61731022.

Data Availability Statement

We encourage all authors of articles published in MDPI journals to share their research data.

Acknowledgments

We would like to thank the anonymous reviewers for their many valuable comments.

Conflicts of Interest

Author Dongbo Wang was employed by China Nuclear Power Engineering Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Liu, H.; Liao, T.; Wang, Y.; Qian, X.; Liu, X.; Li, C.; Li, S.; Guan, Z.; Zhu, L.; Zhou, X.; et al. Fine-grained wetland classification for national wetland reserves using multi-source remote sensing data and Pixel Information Expert Engine (PIE-Engine). GIScience Remote Sens. 2023, 60, 2286746. [Google Scholar] [CrossRef]
Guo, Z.; Zhang, M.; Jia, W.; Zhang, J.; Li, W. Dual-concentrated network with morphological features for tree species classification using hyperspectral image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7013–7024. [Google Scholar] [CrossRef]
Ai, J.; Mao, Y.; Luo, Q.; Jia, L.; Xing, M. SAR Target Classification Using the Multikernel-Size Feature Fusion-Based Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5214313. [Google Scholar] [CrossRef]
Xue, W.; Ai, J.; Zhu, Y.; Chen, J.; Zhuang, S. AIS-FCANet: Long-Term AIS Data Assisted Frequency-Spatial Contextual Awareness Network for Salient Ship Detection in SAR Imagery. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 15166–15171. [Google Scholar] [CrossRef]
Chen, J.; Chen, Y.; Zheng, Z.; Ling, Z.; Meng, X.; Kuang, J.; Shi, X.; Yang, Y.; Chen, W.; Wu, Z. Urban Functional Zone Classification Based on High-Resolution Remote Sensing Imagery and Nighttime Light Imagery. Remote Sens. 2025, 17, 1588. [Google Scholar] [CrossRef]
Yao, Y.; Liang, H.; Li, X.; Zhang, J.; He, J. Sensing Urban Land-Use Patterns by Integrating Google Tensorflow and Scene-Classification Models. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2017. [Google Scholar] [CrossRef]
Wei, X.S.; Song, Y.Z.; Aodha, O.M.; Wu, J.; Peng, Y.; Tang, J.; Yang, J.; Belongie, S. Fine-Grained Image Analysis with Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 8927–8948. [Google Scholar] [CrossRef]
Chu, Y.; Ye, M.; Qian, Y. Fine-Grained Image Recognition Methods and Their Applications in Remote Sensing Images: A Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 19640–19667. [Google Scholar] [CrossRef]
Wang, J.; Hu, J.; Zhi, X.; Shi, T.; Cui, Q. Dataset and Benchmark for Fine-Grained Ship Recognition in Complex Optical Remote Sensing Scenarios. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5624111. [Google Scholar] [CrossRef]
Zhang, B.; Zhao, L.; Zhang, X. Three-dimensional convolutional neural network model for tree species classification using airborne hyperspectral images. Remote Sens. Environ. 2020, 247, 111938. [Google Scholar] [CrossRef]
Zhu, Y.; Li, W.; Zhang, M.; Pang, Y.; Tao, R.; Du, Q. Joint feature extraction for multi-source data using similar double-concentrated network. Neurocomputing 2021, 450, 70–79. [Google Scholar] [CrossRef]
Yuan, S.; Lin, G.; Zhang, L.; Dong, R.; Zhang, J.; Chen, S.; Zheng, J.; Wang, J.; Fu, H. FUSU: A multi-temporal-source land use change segmentation dataset for fine-grained urban semantic understanding. Adv. Neural Inf. Process. Syst. 2024, 37, 132417–132439. [Google Scholar]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017; SciTePress: Setubal, Portugal, 2017; Volume 2, pp. 324–331. [Google Scholar]
Di, Y.; Jiang, Z.; Zhang, H. A public dataset for fine-grained ship classification in optical remote sensing images. Remote Sens. 2021, 13, 747. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, L.; Wang, Y.; Feng, P.; He, R. Shiprsimagenet: A large-scale fine-grained dataset for ship detection in high-resolution optical remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8458–8472. [Google Scholar] [CrossRef]
Wang, Z.; Zhou, Y.; Wang, F.; Wang, S.; Gao, G.; Zhu, J.; Wang, P.; Hu, K. Mfbfs: High-resolution multispectral remote sensing image fine-grained building feature set. J. Remote Sens. 2024, 28, 2780–2791. [Google Scholar]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Lu, X. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Zhou, W.; Newsam, S.; Li, C.; Shao, Z. Patternnet: A benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS J. Photogramm. Remote Sens. 2018, 145, 197–209. [Google Scholar] [CrossRef]
Qi, X.; Zhu, P.; Wang, Y.; Zhang, L.; Peng, J.; Wu, M.; Chen, J.; Zhao, X.; Zang, N.; Mathiopoulos, P.T. Mlrsnet: A multi-label high spatial resolution remote sensing dataset for semantic scene understanding. ISPRS J. Photogramm. Remote Sens. 2020, 169, 337–350. [Google Scholar] [CrossRef]
Long, Y.; Xia, G.S.; Li, S.; Yang, W.; Yang, M.Y.; Zhu, X.X.; Zhang, L.; Li, D. On creating benchmark dataset for aerial image interpretation: Reviews, guidances, and million-aid. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4205–4230. [Google Scholar] [CrossRef]
Li, Y.; Wu, Y.; Cheng, G.; Tao, C.; Dang, B.; Wang, Y.; Zhang, J.; Zhang, C.; Liu, Y.; Tang, X.; et al. MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification with Zoom-Free Remote Sensing Imagery. arXiv 2025, arXiv:2503.11219. [Google Scholar] [CrossRef]
Chen, K.; Wu, M.; Liu, J.; Zhang, C. Fgsd: A dataset for fine-grained ship detection in high resolution satellite images. arXiv 2020, arXiv:2003.06832. [Google Scholar] [CrossRef]
Huang, X.; Ren, L.; Liu, C.; Wang, Y.; Yu, H.; Schmitt, M.; Hänsch, R.; Sun, X.; Huang, H.; Mayer, H. Urban building classification (ubc)-a dataset for individual building detection and classification from satellite imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1413–1421. [Google Scholar]
Huang, X.; Chen, K.; Tang, D.; Liu, C.; Ren, L.; Sun, Z.; Hänsch, R.; Schmitt, M.; Sun, X.; Huang, H.; et al. Urban building classification (ubc) v2—A benchmark for global building detection and fine-grained classification from satellite imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5620116. [Google Scholar] [CrossRef]
Liu, G.; Peng, B.; Liu, T.; Zhang, P.; Yuan, M.; Lu, C.; Cao, N.; Zhang, S.; Huang, S.; Wang, T.; et al. Large-scale fine-grained building classification and height estimation for semantic urban reconstruction: Outcome of the 2023 IEEE GRSS data fusion contest. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11194–11207. [Google Scholar] [CrossRef]
Wu, Z.Z.; Wan, S.H.; Wang, X.F.; Tan, M.; Zou, L.; Li, X.L.; Chen, Y. A benchmark data set for aircraft type recognition from remote sensing images. Appl. Soft Comput. 2020, 89, 106132. [Google Scholar] [CrossRef]
Yu, W.; Cheng, G.; Wang, M.; Yao, Y.; Xie, X.; Yao, X.; Han, J. Mar20: Remote sensing image military aircraft target recognition dataset. J. Remote Sens. 2023, 27, 2688–2696. [Google Scholar]
Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 116–130. [Google Scholar] [CrossRef]
Xiang, X.; Xu, Z.; Deng, Y.; Zhou, Q.; Liang, Y.; Chen, K.; Zheng, Q.; Wang, Y.; Chen, X.; Gao, W. Openearthsensing: Large-scale fine-grained benchmark for open-world remote sensing. arXiv 2025, arXiv:2502.20668. [Google Scholar]
Jin, P.; Xia, G.S.; Hu, F.; Lu, Q.; Zhang, L. AID++: An Updated Version of AID on Scene Classification. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 4721–4724. [Google Scholar] [CrossRef]
Xiao, Z.; Long, Y.; Li, D.; Wei, C.; Tang, G.; Liu, J. High-resolution remote sensing image retrieval based on cnns from a dimensional perspective. Remote Sens. 2017, 9, 725. [Google Scholar] [CrossRef]
Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Scene classification with recurrent attention of vhr remote sensing images. IEEE Trans. Geosci. Remote Sens. 2018, 57, 1155–1167. [Google Scholar] [CrossRef]
Li, H.; Dou, X.; Tao, C.; Wu, Z.; Chen, J.; Peng, J.; Deng, M.; Zhao, L. Rsi-cb: A large-scale remote sensing image classification benchmark using crowdsourced data. Sensors 2020, 20, 1594. [Google Scholar] [CrossRef]
Li, Y.; Kong, D.; Zhang, Y.; Tan, Y.; Chen, L. Robust deep alignment network with remote sensing knowledge graph for zero-shot and generalized zero-shot remote sensing image scene classification. ISPRS J. Photogramm. Remote Sens. 2021, 179, 145–158. [Google Scholar] [CrossRef]
Hua, Y.; Mou, L.; Jin, P.; Zhu, X.X. Multiscene: A large-scale dataset and benchmark for multiscene recognition in single aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5610213. [Google Scholar] [CrossRef]
Yuan, J.; Ru, L.; Wang, S.; Wu, C. Wh-mavs: A novel dataset and deep learning benchmark for multiple land use and land cover applications. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1575–1590. [Google Scholar] [CrossRef]
Zhao, D.; Yuan, B.; Chen, Z.; Li, T.; Liu, Z.; Li, W.; Gao, Y. Panoptic Perception: A Novel Task and Fine-Grained Dataset for Universal Remote Sensing Image Interpretation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5620714. [Google Scholar] [CrossRef]
Pu, R.; Gong, P.; Tian, Y. Wavelet transform applied to EO-1 Hyperion hyperspectral data for forest LAI estimation. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1181–1191. [Google Scholar]
Somers, B.; Asner, G.P.; Tits, L.; Coppin, P. Endmember variability in Spectral Mixture Analysis: A review. Remote Sens. Environ. 2011, 115, 1603–1616. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Heikkilä, M.; Pietikäinen, M.; Schmid, C. Description of interest regions with local binary patterns. Pattern Recognit. 2009, 42, 425–436. [Google Scholar] [CrossRef]
Zhu, Q.; Zhong, Y.; Zhao, B.; Xia, G.S.; Zhang, L. Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery. IEEE Geosci. Remote Sens. Lett. 2016, 13, 747–751. [Google Scholar] [CrossRef]
Yang, J.; Yu, K.; Gong, Y.; Huang, T. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1794–1801. [Google Scholar]
Peng, Y.; Zhang, Y.; Tu, B.; Li, Q.; Li, W. Spatial–Spectral Transformer with Cross-Attention for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5537415. [Google Scholar] [CrossRef]
Han, Z.; Xu, S.; Gao, L.; Li, Z.; Zhang, B. GRetNet: Gaussian Retentive Network for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5509105. [Google Scholar] [CrossRef]
Jia, C.; Zhang, X.; Meng, H.; Xia, S.; Jiao, L. CenterFormer: A Center Spatial–Spectral Attention Transformer Network for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 5523–5539. [Google Scholar] [CrossRef]
Zhao, Y.; Bao, W.; Xu, X.; Zhou, Y. E2TNet: Efficient enhancement Transformer network for hyperspectral image classification. Infrared Phys. Technol. 2024, 142, 105569. [Google Scholar] [CrossRef]
Li, Z.; Guo, F.; Li, Q.; Ren, G.; Wang, L. An Encoder–Decoder Convolution Network With Fine-Grained Spatial Information for Hyperspectral Images Classification. IEEE Access 2020, 8, 33600–33608. [Google Scholar] [CrossRef]
Roy, S.K.; Kar, P.; Hong, D.; Wu, X.; Plaza, A.; Chanussot, J. Revisiting Deep Hyperspectral Feature Extraction Networks via Gradient Centralized Convolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5516619. [Google Scholar] [CrossRef]
Zhang, M.; Li, W.; Zhao, X.; Liu, H.; Tao, R.; Du, Q. Morphological Transformation and Spatial-Logical Aggregation for Tree Species Classification Using Hyperspectral Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5501212. [Google Scholar] [CrossRef]
Roy, S.K.; Deria, A.; Shah, C.; Haut, J.M.; Du, Q.; Plaza, A. Spectral–Spatial Morphological Attention Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503615. [Google Scholar] [CrossRef]
Ji, R.; Tan, K.; Wang, X.; Tang, S.; Sun, J.; Niu, C.; Pan, C. PatchOut: A novel patch-free approach based on a transformer-CNN hybrid framework for fine-grained land-cover classification on large-scale airborne hyperspectral images. Int. J. Appl. Earth Obs. Geoinf. 2025, 138, 104457. [Google Scholar] [CrossRef]
Yuan, J.; Wang, S.; Wu, C.; Xu, Y. Fine-Grained Classification of Urban Functional Zones and Landscape Pattern Analysis Using Hyperspectral Satellite Imagery: A Case Study of Wuhan. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3972–3991. [Google Scholar] [CrossRef]
Chen, Z.; Xu, T.; Pan, Y.; Shen, N.; Chen, H.; Li, J. Edge Feature Enhancement for Fine-Grained Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5636613. [Google Scholar] [CrossRef]
Chen, Y.; Huang, L.; Zhu, L.; Yokoya, N.; Jia, X. Fine-Grained Classification of Hyperspectral Imagery Based on Deep Learning. Remote Sens. 2019, 11, 2690. [Google Scholar] [CrossRef]
Miao, J.; Zhang, B.; Wang, B. Coarse-to-Fine Joint Distribution Alignment for Cross-Domain Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 12415–12428. [Google Scholar] [CrossRef]
Wu, H.; Xue, Z.; Zhou, S.; Su, H. Overcoming Granularity Mismatch in Knowledge Distillation for Few-Shot Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5503517. [Google Scholar] [CrossRef]
Huang, Y.; Peng, J.; Zhang, G.; Sun, W.; Chen, N.; Du, Q. Adversarial Domain Adaptation Network with Calibrated Prototype and Dynamic Instance Convolution for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5514613. [Google Scholar] [CrossRef]
Ma, Y.; Deng, X.; Wei, J. Land Use Classification of High-Resolution Multispectral Satellite Images With Fine-Grained Multiscale Networks and Superpixel Postprocessing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3264–3278. [Google Scholar] [CrossRef]
Zhao, C.; Chen, M.; Feng, S.; Qin, B.; Zhang, L. A Coarse-to-Fine Semisupervised Learning Method Based on Superpixel Graph and Breaking-Tie Sampling for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5507705. [Google Scholar] [CrossRef]
Ni, K.; Xie, Y.; Zhao, G.; Zheng, Z.; Wang, P.; Lu, T. Coarse-to-Fine High-Order Network for Hyperspectral and LiDAR Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–16. [Google Scholar] [CrossRef]
Liu, Y.; Ye, Z.; Xi, Y.; Liu, H.; Li, W.; Bai, L. Multiscale and Multidirection Feature Extraction Network for Hyperspectral and LiDAR Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 9961–9973. [Google Scholar] [CrossRef]
Liu, Z.; Li, J.; Wang, L.; Plaza, A. Integration of Remote Sensing and Crowdsourced Data for Fine-Grained Urban Flood Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 13523–13532. [Google Scholar] [CrossRef]
Liu, P.; Wang, L.; Ranjan, R.; He, G.; Zhao, L. A survey on active deep learning: From model driven to data driven. ACM Comput. Surv. (CSUR) 2022, 54, 1–34. [Google Scholar] [CrossRef]
Bai, J.; Yuan, A.; Xiao, Z.; Zhou, H.; Wang, D.; Jiang, H.; Jiao, L. Class Incremental Learning with Few-Shots Based on Linear Programming for Hyperspectral Image Classification. IEEE Trans. Cybern. 2022, 52, 5474–5485. [Google Scholar] [CrossRef]
Ouyang, L.; Guo, G.; Fang, L.; Ghamisi, P.; Yue, J. PCLDet: Prototypical Contrastive Learning for Fine-Grained Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5613911. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Han, Y.; Yang, X.; Pu, T.; Peng, Z. Fine-Grained Recognition for Oriented Ship Against Complex Scenes in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5612318. [Google Scholar] [CrossRef]
Sumbul, G.; Cinbis, R.G.; Aksoy, S. Multisource Region Attention Network for Fine-Grained Object Recognition in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4929–4937. [Google Scholar] [CrossRef]
Guo, B.; Zhang, R.; Guo, H.; Yang, W.; Yu, H.; Zhang, P.; Zou, T. Fine-Grained Ship Detection in High-Resolution Satellite Images With Shape-Aware Feature Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1914–1926. [Google Scholar] [CrossRef]
Cheng, J.; Yao, X.; Yang, X.; Yuan, X.; Feng, X.; Cheng, G.; Huang, X.; Han, J. DIMA: Digging Into Multigranular Archetype for Fine-Grained Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5628714. [Google Scholar] [CrossRef]
Wang, L.; Zhang, J.; Tian, J.; Li, J.; Zhuo, L.; Tian, Q. Efficient Fine-Grained Object Recognition in High-Resolution Remote Sensing Images from Knowledge Distillation to Filter Grafting. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4701016. [Google Scholar] [CrossRef]
Zeng, L.; Guo, H.; Yang, W.; Yu, H.; Yu, L.; Zhang, P.; Zou, T. Instance Switching-Based Contrastive Learning for Fine-Grained Airplane Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5633416. [Google Scholar] [CrossRef]
Li, W.; Zhao, D.; Yuan, B.; Gao, Y.; Shi, Z. PETDet: Proposal Enhancement for Two-Stage Fine-Grained Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5602214. [Google Scholar] [CrossRef]
Cheng, G.; Li, Q.; Wang, G.; Xie, X.; Min, L.; Han, J. SFRNet: Fine-Grained Oriented Object Recognition via Separate Feature Refinement. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610510. [Google Scholar] [CrossRef]
Ouyang, L.; Fang, L.; Ji, X. Multigranularity Self-Attention Network for Fine-Grained Ship Detection in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 9722–9732. [Google Scholar] [CrossRef]
Liu, Y.; Liu, J.; Li, X.; Wei, L.; Wu, Z.; Han, B.; Dai, W. Exploiting Discriminating Features for Fine-Grained Ship Detection in Optical Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 20098–20115. [Google Scholar] [CrossRef]
Yang, Y.; Zhang, Z.; Feng, P.; Yan, Y.; He, G.; Liu, S.; Zhang, P.; Gao, H. HMS-Net: A Hierarchical Multilabel Fine-Grained Ship Detection Network in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 15394–15411. [Google Scholar] [CrossRef]
Zhu, Z.; Sun, X.; Diao, W.; Chen, K.; Xu, G.; Fu, K. Invariant Structure Representation for Remote Sensing Object Detection Based on Graph Modeling. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625217. [Google Scholar] [CrossRef]
Li, Y.; Chen, L.; Li, W. Fine-Grained Ship Recognition With Spatial-Aligned Feature Pyramid Network and Adaptive Prototypical Contrastive Learning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5604313. [Google Scholar] [CrossRef]
Gong, T.; Cheng, W.; Chen, Y.; Xiong, S.; Lu, X. Discover the Unknown Ones in Fine-Grained Ship Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4208214. [Google Scholar] [CrossRef]
Chen, X.; Chen, X.; Ge, X.; Chen, J.; Wang, H. Online Decoupled Distillation Based on Prototype Contrastive Learning for Lightweight Underwater Object Detection Models. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4203514. [Google Scholar] [CrossRef]
Guo, H.; Liu, Y.; Pan, Z.; Hu, Y. Advancing Fine-Grained Few-Shot Object Detection on Remote Sensing Images with Decoupled Self-Distillation and Progressive Prototype Calibration. Remote Sens. 2025, 17, 495. [Google Scholar] [CrossRef]
Lu, X.; Sun, X.; Diao, W.; Mao, Y.; Li, J.; Zhang, Y.; Wang, P.; Fu, K. Few-Shot Object Detection in Aerial Imagery Guided by Text-Modal Knowledge. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604719. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Zhang, Y.; Li, S.; Wang, H.; Liu, Y.; Zhang, J. Aircraft Target Detection in Remote Sensing Images Based on Improved YOLOv7-Tiny Network. IEEE Geosci. Remote Sens. Lett. 2024, 21, 48904–48922. [Google Scholar] [CrossRef]
Chen, Y.; Liu, J.; Zhang, Y.; Li, W.; Wang, H. A Remote Sensing Target Detection Model Based on Lightweight Feature Enhancement and Feature Refinement Extraction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5265–5279. [Google Scholar] [CrossRef]
Luo, Y.; Xiong, G.; Li, X.; Wang, Z.; Chen, J. An Improved YOLOv8 Detector for Multi-Scale Target Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 114123–114136. [Google Scholar] [CrossRef]
Li, M.; Zhang, W.; Wang, Q.; Zhao, Y. YOLO-RS: Remote Sensing Enhanced Crop Detection Methods. arXiv 2025, arXiv:2504.11165. [Google Scholar] [CrossRef]
Wang, C.; Li, J.; Zhang, H.; Liu, X. YOLOX-DW: A Fine-Grained Object Detection Algorithm for Remote Sensing Images. Remote Sens. 2024. Available online: https://www.researchsquare.com/article/rs-5122331/v1 (accessed on 19 November 2025).
Xi, Y.; Jia, W.; Miao, Q.; Feng, J.; Ren, J.; Luo, H. Detection-Driven Exposure-Correction Network for Nighttime Drone-View Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5605014. [Google Scholar] [CrossRef]
Yang, J.; Fu, K.; Wu, Y.; Diao, W.; Dai, W.; Sun, X. Mutual-Feed Learning for Super-Resolution and Object Detection in Degraded Aerial Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5628016. [Google Scholar] [CrossRef]
Wu, J.; Zhao, F.; Yao, G.; Jin, Z. FGA-YOLO: A one-stage and high-precision detector designed for fine-grained aircraft recognition. Neurocomputing 2025, 618, 129067. [Google Scholar] [CrossRef]
Zhao, S.; Chen, H.; Zhang, D.; Tao, Y.; Feng, X.; Zhang, D. SR-YOLO: Spatial-to-Depth Enhanced Multi-Scale Attention Network for Small Target Detection in UAV Aerial Imagery. Remote Sens. 2025, 17, 2441. [Google Scholar] [CrossRef]
Wu, F.; Hu, T.; Xia, Y.; Ma, B.; Sarwar, S.; Zhang, C. WDFA-YOLOX: A Wavelet-Driven and Feature-Enhanced Attention YOLOX Network for Ship Detection in SAR Images. Remote Sens. 2024, 16, 1760. [Google Scholar] [CrossRef]
Song, Y.; Wang, S.; Li, Q.; Mu, H.; Feng, R.; Tian, T.; Tian, J. Vehicle Target Detection Method for Wide-Area SAR Images Based on Coarse-Grained Judgment and Fine-Grained Detection. Remote Sens. 2023, 15, 3242. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, Y.; Shi, Z.; Zhang, Y.; Gao, R. Unmanned Aerial Vehicle Object Detection Based on Information-Preserving and Fine-Grained Feature Aggregation. Remote Sens. 2024, 16, 2590. [Google Scholar] [CrossRef]
Xi, Y.; Jia, W.; Miao, Q.; Liu, X.; Fan, X.; Li, H. FiFoNet: Fine-Grained Target Focusing Network for Object Detection in UAV Images. Remote Sens. 2022, 14, 3919. [Google Scholar] [CrossRef]
Ma, S.; Wang, W.; Pan, Z.; Hu, Y.; Zhou, G.; Wang, Q. A Recognition Model Incorporating Geometric Relationships of Ship Components. Remote Sens. 2024, 16, 130. [Google Scholar] [CrossRef]
Jiang, X.N.; Niu, X.Q.; Wu, F.L.; Fu, Y.; Bao, H.; Fan, Y.C.; Zhang, Y.; Pei, J.Y. A Fine-Grained Aircraft Target Recognition Algorithm for Remote Sensing Images Based on YOLOV8. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4060–4073. [Google Scholar] [CrossRef]
Huang, Q.; Yao, R.; Lu, X.; Zhu, J.; Xiong, S.; Chen, Y. Oriented Object Detector With Gaussian Distribution Cost Label Assignment and Task-Decoupled Head. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5621916. [Google Scholar] [CrossRef]
Liu, S.; Yang, Z.; Li, Q.; Wang, Q. InterMamba: A Visual-Prompted Interactive Framework for Dense Object Detection and Annotation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5619811. [Google Scholar] [CrossRef]
Su, Y.; Zhang, T.; Li, F. SA-YOLO: Self-Adaptive Loss Function for Imbalanced Sample Detection. J. Electron. Inf. Technol. 2024, 46, 123–134. [Google Scholar]
Yang, B.; Han, J.; Hou, X.; Zhou, D.; Liu, W.; Bi, F. FSDA-DETR: Few-Shot Domain-Adaptive Object Detection Transformer in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4412016. [Google Scholar] [CrossRef]
Wang, B.; Sui, H.; Ma, G.; Zhou, Y.; Zhou, M. GMODet: A Real-Time Detector for Ground-Moving Objects in Optical Remote Sensing Images With Regional Awareness and Semantic–Spatial Progressive Interaction. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5605623. [Google Scholar] [CrossRef]
Xu, X.; Chen, Z.; Zhang, X.; Wang, G. Context-Aware Content Interaction: Grasp Subtle Clues for Fine-Grained Aircraft Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5641319. [Google Scholar] [CrossRef]
Sumbul, G.; Cinbis, R.G.; Aksoy, S. Fine-Grained Object Recognition and Zero-Shot Learning in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 770–779. [Google Scholar] [CrossRef]
Zhang, J.; Zhong, Z.; Wei, X.; Wu, X.; Li, Y. Remote Sensing Image Harmonization Method for Fine-Grained Ship Classification. Remote Sens. 2024, 16, 2192. [Google Scholar] [CrossRef]
Yi, Y.; You, Y.; Li, C.; Zhou, W. EFM-Net: An Essential Feature Mining Network for Target Fine-Grained Classification in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5606416. [Google Scholar] [CrossRef]
Zhao, W.; Tong, T.; Yao, L.; Liu, Y.; Xu, C.; He, Y.; Lu, H. Feature Balance for Fine-Grained Object Classification in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5620413. [Google Scholar] [CrossRef]
Chen, D.; Tu, W.; Cao, R.; Zhang, Y.; He, B.; Wang, C.; Shi, T.; Li, Q. A hierarchical approach for fine-grained urban villages recognition fusing remote and social sensing data. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102661. [Google Scholar] [CrossRef]
Wu, H.; Nie, J.; He, Z.; Zhu, Z.; Gao, M. One-Shot Multiple Object Tracking in UAV Videos Using Task-Specific Fine-Grained Features. Remote Sens. 2022, 14, 3853. [Google Scholar] [CrossRef]
Jiang, C.; Ren, H.; Li, F.; Hong, Z.; Huo, H.; Zhang, J.; Xin, J. Object detection from aerial multi-angle thermal infrared remote sensing images: Dataset and method. ISPRS J. Photogramm. Remote Sens. 2025, 228, 438–452. [Google Scholar] [CrossRef]
Luo, R.; He, Q.; Zhao, L.; Zhang, S.; Kuang, G.; Ji, K. Geospatial Contextual Prior-Enabled Knowledge Reasoning Framework for Fine-Grained Aircraft Detection in Panoramic SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5226213. [Google Scholar] [CrossRef]
Chen, Y.; Huang, J.; Sun, Z.; Xiong, S.; Lu, X. Thread the Needle: Cues-Driven Multiassociation for Remote Sensing Cross-Modal Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4709813. [Google Scholar] [CrossRef]
Zhao, Q.; Lyu, S.; Li, Y.; Ma, Y.; Chen, L. MGML: Multigranularity Multilevel Feature Ensemble Network for Remote Sensing Scene Classification. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 2308–2322. [Google Scholar] [CrossRef]
Guo, W.; Li, S.; Yang, J.; Zhou, Z.; Liu, Y.; Lu, J.; Kou, L.; Zhao, M. Remote Sensing Image Scene Classification by Multiple Granularity Semantic Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2546–2562. [Google Scholar] [CrossRef]
Wang, S.; Guan, Y.; Shao, L. Multi-Granularity Canonical Appearance Pooling for Remote Sensing Scene Classification. IEEE Trans. Image Process. 2020, 29, 5396–5407. [Google Scholar] [CrossRef]
Bai, L.; Liu, Q.; Li, C.; Ye, Z.; Hui, M.; Jia, X. Remote Sensing Image Scene Classification Using Multiscale Feature Fusion Covariance Network With Octave Convolution. IEEE Trans. Image Process. 2022, 60, 5396–5407. [Google Scholar] [CrossRef]
Miao, W.; Geng, J.; Jiang, W. Multigranularity Decoupling Network With Pseudolabel Selection for Remote Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5603813. [Google Scholar] [CrossRef]
Ye, Z.; Zhang, Y.; Zhang, J.; Li, W.; Bai, L. A Multiscale Incremental Learning Network for Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5606015. [Google Scholar] [CrossRef]
Niu, B.; Pan, Z.; Chen, K.; Hu, Y.; Lei, B. Open Set Domain Adaptation via Instance Affinity Metric and Fine-Grained Alignment for Remote Sensing Scene Classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6005805. [Google Scholar] [CrossRef]
Li, Y.; Li, Z.; Su, A.; Wang, K.; Wang, Z.; Yu, Q. Semisupervised Cross-Domain Remote Sensing Scene Classification via Category-Level Feature Alignment Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5621614. [Google Scholar] [CrossRef]
Wang, Y.; Shu, Z.; Feng, Y.; Liu, R.; Cao, Q.; Li, D.; Wang, L. Enhancing Cross-Domain Remote Sensing Scene Classification by Multi-Source Subdomain Distribution Alignment Network. Remote Sens. 2025, 17, 1302. [Google Scholar] [CrossRef]
Zhu, P.; Zhang, X.; Han, X.; Cheng, X.; Gu, J.; Chen, P.; Jiao, L. Cross-Domain Classification Based on Frequency Component Adaptation for Remote Sensing Images. Remote Sens. 2024, 16, 2134. [Google Scholar] [CrossRef]
Xiao, R.; Wang, Y.; Tao, C. Fine-Grained Road Scene Understanding from Aerial Images Based on Semisupervised Semantic Segmentation Networks. IEEE Geosci. Remote Sens. Lett. 2022, 19, 3001705. [Google Scholar] [CrossRef]
Xu, K.; Deng, P.; Huang, H. Vision Transformer: An Excellent Teacher for Guiding Small Networks in Remote Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5618715. [Google Scholar] [CrossRef]
Shi, C.; Ding, M.; Wang, L.; Pan, H. Learn by Yourself: A Feature-Augmented Self-Distillation Convolutional Neural Network for Remote Sensing Scene Image Classification. Remote Sens. 2023, 15, 5620. [Google Scholar] [CrossRef]
Wang, G.; Chen, H.; Chen, L.; Zhuang, Y.; Zhang, S.; Zhang, T.; Dong, H.; Gao, P. P2FEViT: Plug-and-Play CNN Feature Embedded Hybrid Vision Transformer for Remote Sensing Image Classification. Remote Sens. 2023, 15, 1773. [Google Scholar] [CrossRef]
Solomon, A.A.; Agnes, S.A. MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification. Remote Sens. 2024, 4, 462–480. [Google Scholar] [CrossRef]
Li, Z.; Xu, W.; Yang, S.; Wang, J.; Su, H.; Huang, Z.; Wu, S. A Hierarchical Graph-Enhanced Transformer Network for Remote Sensing Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 20315–20330. [Google Scholar] [CrossRef]
Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.S. Remote Sensing Image Scene Classification Meets Deep Learning: Challenges, Methods, Benchmarks, and Opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
Thapa, A.; Horanont, T.; Neupane, B.; Aryal, J. Deep Learning for Remote Sensing Image Scene Classification: A Review and Meta-Analysis. Remote Sens. 2023, 15, 4804. [Google Scholar] [CrossRef]
Yu, D.; Xu, Q.; Guo, H.; Zhao, C.; Lin, Y.; Li, D. An Efficient and Lightweight Convolutional Neural Network for Remote Sensing Image Scene Classification. Sensors 2020, 20, 1999. [Google Scholar] [CrossRef]
Xiong, W.; Xiong, Z.; Cui, Y. A Confounder-Free Fusion Network for Aerial Image Scene Feature Representation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5440–5454. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, W.; Tian, C.; Rong, X.; Zhang, Z.; Wang, H.; Fu, K.; Sun, X. Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5620616. [Google Scholar] [CrossRef]
Hu, G.; Wen, Z.; Lv, Y.; Zhang, J.; Wu, Q. Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5623915. [Google Scholar] [CrossRef]
Zheng, F.; Wang, X.; Wang, L.; Zhang, X.; Zhu, H.; Wang, L.; Zhang, H. A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval. Sensors 2023, 23, 8437. [Google Scholar] [CrossRef]
Chen, Y.; Huang, J.; Li, X.; Xiong, S.; Lu, X. Multiscale Salient Alignment Learning for Remote-Sensing Image–Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4700413. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, W.; Fu, K.; Li, X.; Deng, C.; Wang, H.; Sun, X. Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4404119. [Google Scholar] [CrossRef]
Cheng, Q.; Zhou, Y.; Huang, H.; Wang, Z. Multi-Attention Fusion and Fine-Grained Alignment for Bidirectional Image-Sentence Retrieval in Remote Sensing. IEEE/CAA J. Autom. Sin. 2022, 9, 1532–1535. [Google Scholar] [CrossRef]
Yang, L.; Feng, Y.; Zhou, M.; Xiong, X.; Wang, Y.; Qiang, B. A Jointly Guided Deep Network for Fine-Grained Cross-Modal Remote Sensing Text–Image Retrieval. J. Circuits, Syst. Comput. 2023, 32, 2350221. [Google Scholar] [CrossRef]
Cheng, Q.; Zhou, Y.; Fu, P.; Xu, Y.; Zhang, L. A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4284–4297. [Google Scholar] [CrossRef]
Xiu, D.; Ji, L.; Geng, X.; Wu, Y. RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework With Consistency Regularization for Remote Sensing Image-Text Retrieval. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6016805. [Google Scholar] [CrossRef]
Zhou, Z.; Feng, Y.; Qiu, A.; Duan, G.; Zhou, M. Fine-Grained Information Supplementation and Value-Guided Learning for Remote Sensing Image-Text Retrieval. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 19194–19210. [Google Scholar] [CrossRef]
Sun, T.; Zheng, C.; Li, X.; Gao, Y.; Nie, J.; Huang, L.; Wei, Z. Strong and Weak Prompt Engineering for Remote Sensing Image-Text Cross-Modal Retrieval. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6968–6980. [Google Scholar] [CrossRef]
Ning, H.; Wang, S.; Lei, T.; Cao, X.; Dou, H.; Zhao, B.; Nandi, A.K.; Radeva, P. Representation discrepancy bridging method for remote sensing image-text retrieval. Neurocomputing 2025, 650, 130915. [Google Scholar] [CrossRef]
Chen, Y.; Huang, J.; Xiong, S.; Lu, X. Integrating Multisubspace Joint Learning with Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4702217. [Google Scholar] [CrossRef]
Mao, Y.Q.; Jiang, Z.; Liu, Y.; Zhang, Y.; Qi, K.; Bi, H.; He, Y. FRORS: An Effective Fine-Grained Retrieval Framework for Optical Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 7406–7419. [Google Scholar] [CrossRef]
Huang, J.; Feng, Y.; Zhou, M.; Xiong, X.; Wang, Y.; Qiang, B. Deep Multiscale Fine-Grained Hashing for Remote Sensing Cross-Modal Retrieval. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6002205. [Google Scholar] [CrossRef]
Yu, H.; Yao, F.; Lu, W.; Liu, N.; Li, P.; You, H.; Sun, X. Text-Image Matching for Cross-Modal Remote Sensing Image Retrieval via Graph Neural Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 812–824. [Google Scholar] [CrossRef]
Pan, J.; Ma, Q.; Bai, C. Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. Association for Computing Machinery, Thessaloniki, Greece, 12–15 June 2023; ICMR ’23. pp. 398–406. [Google Scholar] [CrossRef]
Yang, B.; Wang, C.; Ma, X.; Song, B.; Liu, Z.; Sun, F. Zero-Shot Sketch-Based Remote-Sensing Image Retrieval Based on Multi-Level and Attention-Guided Tokenization. Remote Sens. 2024, 16, 1653. [Google Scholar] [CrossRef]
Liu, Y.; Dang, Y.; Qi, H.; Han, J.; Shao, L. Zero-shot sketch-based remote sensing image retrieval based on cross-modal fusion. Neural Netw. 2025, 191, 107796. [Google Scholar] [CrossRef]

Figure 2. Examples of fine-grained interpretation at different levels [38]. It proposed a new ”panoptic perception” task, which showed similar conception of image interpretation at different levels. The instance-level is similar to the object-level in this paper.

Figure 3. Examples of fine-grained pixel-level classification of remote sensing images [2,10].

Figure 4. Comparison between ordinary remote sensing target detection and fine-grained remote sensing target detection [67].

Figure 5. Prototypical contrast learning [85].

Figure 6. FSDA-DETR [114]. Domain shift is observed between the source and target domains when the given target-domain data is scarce.

Figure 7. Paradigms and example of scene-level recognition. (a) Bottom-up vs. top-down. (b) Example of scene-level recognition (bottom-up) [125].

Figure 8. Illustration of intra-class variation and inter-class similarity in scene-level recognition in MEET [22].

Figure 9. In [141], a hierarchical graph-enhanced Transformer network based on GNN is proposed, which improves the accuracy of remote sensing scene classification through dual attention mechanisms, multi-stage feature extraction, and graph structure modeling.

Figure 10. Distribution of published articles by several core journals. From 2015 to 2025: IEEE Trans. Geosci. Remote Sens.(455), Remote Sens. MDPI(246), IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.(235), IEEE Geosci. Remote Sens. Lett.(113), J. Appl. Remote Sens.(103), ISPRS J. Photogramm. Remote Sens.(49), Int. J. Appl. Earth Obs. Geoinf.(65), Remote Sens. Environ.(30), J. Remote Sens.(23).

Figure 11. Number of published articles annually in selected journals.

Figure 12. Keyword co-occurrence network.

Table 1. Summary of datasets for fine-grained remote sensing image interpretation. “M” is multimodal feature. “T” is temporal feature.

Dataset Name	Resolution	Content	Categories	Total Images	Source	M	T
FGSCR-42 [14]	0.1–4.5	Ship	42	9320	GoogleEarth, ISPRS, GanFen etc.	×	×
FGSD [23]	0.3–2	Ship	43	4736	GoogleEarth	×	×
ShipRSImageNet [15]	0.12–6	Ship	50	3435	WorldView-3, GaoFen-2 etc.	×	×
MFBFS [16]	1–4	Building	3	11,005	GaoFen-2	×	×
UBC [24]	0.5–0.8	Building	61	800	SuperView, GaoFen-2	×	×
UBC-v2 [25]	0.5–1	Building	12	11,336	SuperView, GaoFen-2, GaoFen-3	√	×
DFC2023 [26]	0.5–1	Building	12	1773	SuperView, GaoFen-2, GaoFen-3	√	×
MTARSI [27]	0.3–2	Aircraft	20	9598	GoogleEarth	×	×
MAR20 [28]	0.3–2	Aircraft	20	3842	GoogleEarth	×	×
		Air planes, Ships
FAIR1M [29]	0.3–0.8	Vehicles, Courts	37	15,000	Gaofen, GoogleEarth	×	×
		Road
OpenEarthSensing [30]	0.3–10	Objects and Scenes	189	157,674	Different Public Datasets	√	×
TREE [10]	0.68	Tree Species	12	1	LiCHy Hyperspectral system	×	×
Belgium Data [11]	0.68	Tree Species	7	1450	LiCHy Hyperspectral system	×	×
FUSU [12]	0.2–0.5	Land Use Change	17	62,752	Google Earth, Sentinel	×	×
MEET [22]	2025	Scene	80	1,033,778	OpenStreetMap	×	×
NWPU [18]	0.2–30	Scene	45	31,500	GoogleEarth	×	×
AID [17]	0.5–8	Scene	30	10,000	GoogleEarth	×	×
AID++ [31]	0.5–8	Scene	46	400,000	GoogleEarth, OpenStreetMap	×	×
RSD46-WHU [32]	0.5–2	Scene	46	117,000	GoogleEarth	×	×
MLRSN [20]	0.1–10	Scene	46	109,161	GoogleEarth	×	×
Million-AID [21]	0.5–153	Scene	51	10,000	GoogleEarth	×	×
PatternNet [19]	0.06–4.7	Scene	38	30,400	Different Public Datasets	×	×
OPTIMAL-31 [33]	-	Scene	31	1860	GoogleEarth, Bing maps	×	×
RSI-CB256 [34]	0.3–3	Scene	35	24,000	GoogleEarth, Bing maps	×	×
RSI-CB128 [34]	0.3–3	Scene	45	36,000	GoogleEarth, Bing maps	×	×
SR-RSKG [35]	0.2–30	Scene	70	56,000	GoogleEarth etc.	×	×
Multiscene [36]	0.3–0.6	Scene	36	100,000	GoogleEarth, OpenStreetMap	×	×
WH-MAVS [37]	1.2	Scene	14	47,137	GoogleEarth	×	×

Table 2. Summary of the improvements of fine-grained object detection on different components.

Method	Backbone (+FPN)	RPN	RoIAlign	Bbox Cls/Reg	Mask Branch	Purpose	Reference
EIRNet	✓	✓	×	✓	×	Bidirectional feature fusion via DFF-Net; Optimize proposals with Mask-RPN (reuse attention mask); Mine interclass relations for ship fine-grained Cls	[73]
MRAN	✓	✓	✓	✓	✓	Fuse RGB/multispectral/LiDAR features; Proposal generation via attention scores; Optimize RoI sampling for small trees; Multisource feature-driven Cls/Reg; Refine tree canopy segmentation	[74]
PCLDet	✓	✓	×	✓	✓	Prototype learning for fine-grained features; Class-balanced sampler (CBS) for long-tail data; ProtoCL loss for Bbox Cls/Reg; Prototype constraint for Mask segmentation	[67]
SAM (Shape-Aware Model)	✓	✓	✓	✓	✓	Shape-aware Conv for large-aspect-ratio ships; Dynamic anchor adjustment; RoI sampling optimization for deformed ship parts; Shape loss for Bbox fitting; Shape-constrained Mask for ship-background distinction	[75]
HCP-Mask-RCNN	✓	✓	✓	✓	✓	Frequency-aware (FARS) module for detail features; Fine-grained proposal prioritization; RoI alignment with frequency features; Coarse-fine hierarchy (HCP) for Cls; Frequency-guided Mask for fine structures	[76]
Oriented R-CNN	✓	✓	×	✓	×	Oriented feature enhancement via FPN; Oriented proposal generation; Geospatial object localization/Cls; Serve as teacher network for knowledge distillation	[77]
ISCL-Mask-RCNN	✓	×	×	✓	×	Contrastive learning (CLM) to widen interclass distance; Refined instance switching (ReIS) for class imbalance; Improve airplane fine-grained detection (HBB/OBB)	[78]
PETDet	✓	✓	✓	✓	×	Anchor-free QOPN for high-quality proposals; Bilinear channel fusion (BCFN) for RoI features; Adaptive recognition loss (ARL) for Cls/Reg; Focus on fine-grained target distinction	[79]
SFRNet	✓	✓	✓	✓	✓	SC-Former for spatial-channel interaction; OR-Former for rotation-sensitive features; Multi-RoI loss (MRL) for Cls; Separate feature refinement for Cls/segmentation	[80]
MGANet	✓	✓	✓	✓	×	Local–global alignment (LAM) for ship features; Multigranularity self-attention (MSM) for fusion; RoIAlign optimization for local ship parts; Improve dense ship fine-grained Cls/Reg	[81]
FineShipNet	✓	✓	✓	✓	×	Blend synchronization module for feature reuse; Polarized feature focusing for task decoupling; Adaptive harmony anchor labeling; RoIAlign for ship discriminative features (Cls/Reg)	[82]
HMS-Net	✓	✓	✓	✓	×	Multiscale region feature re-extraction; Top-down feature fusion with guidance; Hierarchical loss for interclass relations (Cls); RoIAlign for ship fine-grained features	[83]
GFA-Net	✓	✓	✓	✓	✓	Graph focusing process (GFP) for structural features; Graph aggregation network (GAN) for node weight; RoIAlign for invariant structure features (Cls/Reg); Mask segmentation for object structure preservation	[84]
DIMA	✓	✓	✓	✓	✓	FARS module for frequency-domain features; Hierarchical classification (HCP) for Cls; RoI alignment with frequency details; Mask refinement for fine target structures	[76]

Note: ✓ = Themethod involves improvements in this stage; × = Themethod does not involve improvements in this stage. Abbreviations: Bbox Cls/Reg = Bounding Box Classification and Regression; Cls = Classification; Reg = Regression; FPN = Feature Pyramid Network; RPN = Region Proposal Network; RoIAlign = Region of Interest Align;MRAN = Multisource Region Attention Network; SAM = Shape-AwareModel; QOPN = Quality-Oriented Proposal Network; BCFN = Bilinear Channel Fusion Network; FARS = Frequency-Aware Representation Supplement.

Table 4. Some common methods used crossing different levels in fine-grained interpretation.

Methods	Pixel-Level Classification	Object-Level Detection	Scene-Level Recognition
Utilizing Multi-Source Data	[62,63,64]	[74,89]	[133,134,135]
Modeling Relationships Between Classes	[56,57,58,59,60,61]	[75,82,83,84]	×
Knowledge Distillation	×	[77,87,88]	[137,138]
Enhanced Attention Mechanism	[45,46]	[105,106,109]	[131,141,151,157]
Few shot or Zero shot	×	[88]	[35]
Prototypical Contrastive Learning	×	[67,78,85,86,87,159]	[133,159]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, D.; Yan, Z.; Liu, P. Fine-Grained Interpretation of Remote Sensing Image: A Review. Remote Sens. 2025, 17, 3887. https://doi.org/10.3390/rs17233887

AMA Style

Wang D, Yan Z, Liu P. Fine-Grained Interpretation of Remote Sensing Image: A Review. Remote Sensing. 2025; 17(23):3887. https://doi.org/10.3390/rs17233887

Chicago/Turabian Style

Wang, Dongbo, Zedong Yan, and Peng Liu. 2025. "Fine-Grained Interpretation of Remote Sensing Image: A Review" Remote Sensing 17, no. 23: 3887. https://doi.org/10.3390/rs17233887

APA Style

Wang, D., Yan, Z., & Liu, P. (2025). Fine-Grained Interpretation of Remote Sensing Image: A Review. Remote Sensing, 17(23), 3887. https://doi.org/10.3390/rs17233887

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fine-Grained Interpretation of Remote Sensing Image: A Review

Highlights

Abstract

1. Introduction

2. The Datasets for Fine-Grained Interpretation

2.1. Current Status of the Dataset

2.2. Existing Deficiencies of Datasets

2.3. Future Outlook of Fine-Grained Datasets

3. Methodology Taxonomy

3.1. Fine-Grained Pixel-Level Classification or Segmentation

3.1.1. Novel Data Representation

3.1.2. Modeling Relationships Between Coarse and Fine Classes

3.1.3. Multi-Source Data Integration

3.1.4. Advanced Data Annotation Strategies

3.2. Fine-Grained Object-Level Detection

3.2.1. Two-Stage Detectors

3.2.2. One-Stage Detectors

3.2.3. Other Methods for Fine-Grained Object Detection

3.3. Fine-Grained Scene-Level Recognition

3.3.1. Scene Classification

3.3.2. Image Retrieval

3.4. The Common Methods Used Across Different Levels in Fine-Grained Interpretation

3.5. Summary of Methods

4. Main Development Trends

4.1. Journal Distribution

4.2. Annually Published Articles

4.3. Keyword Co-Occurrence Network

5. Discussion

5.1.1. Theory and Frame

5.1.2. Data and Annotation

5.1.3. Model and Method

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI