Adaptive Global Dense Nested Reasoning Network into Small Target Detection in Large-Scale Hyperspectral Remote Sensing Image

Zhan, Siyu; Yang, Yuxuan; Zhong, Muge; Lu, Guoming; Zhou, Xinyu

doi:10.3390/rs17060948

Open AccessArticle

Adaptive Global Dense Nested Reasoning Network into Small Target Detection in Large-Scale Hyperspectral Remote Sensing Image

by

Siyu Zhan

,

Yuxuan Yang

,

Muge Zhong

,

Guoming Lu

and

Xinyu Zhou

^*

Institute of Intelligent Computing, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 948; https://doi.org/10.3390/rs17060948

Submission received: 3 February 2025 / Revised: 4 March 2025 / Accepted: 5 March 2025 / Published: 7 March 2025

(This article belongs to the Special Issue Integrating Deep Learning with Image Perception for Advanced Remote Sensing Applications)

Download

Browse Figures

Versions Notes

Abstract

Small and dim target detection is a critical challenge in hyperspectral remote sensing, particularly in complex, large-scale scenes where spectral variability across diverse land cover types complicates the detection process. In this paper, we propose a novel target reasoning algorithm named Adaptive Global Dense Nested Reasoning Network (AGDNR). This algorithm integrates spatial, spectral, and domain knowledge to enhance the detection accuracy of small and dim targets in large-scale environments and simultaneously enables reasoning about target categories. The proposed method involves three key innovations. Firstly, we develop a high-dimensional, multi-layer nested U-Net that facilitates cross-layer feature transfer, preserving high-level features of small and dim targets throughout the network. Secondly, we present a novel approach for computing physicochemical parameters, which enhances the spectral characteristics of targets while minimizing environmental interference. Thirdly, we construct a geographic knowledge graph that incorporates both target and environmental information, enabling global target reasoning and more effective detection of small targets across large-scale scenes. Experimental results on three challenging datasets show that our method outperforms state-of-the-art approaches in detection accuracy and achieves successful classification of different small targets. Consequently, the proposed method offers a robust solution for the precise detection of hyperspectral small targets in large-scale scenarios.

Keywords:

small target detection; hyperspectral imagery (HSI); deep learning; prior knowledge in knowledge graph (KG); nested network

1. Introduction

With the rapid advancement of remote sensing technology, hyperspectral imaging has become an indispensable tool for capturing spectral information from surface materials. By acquiring hundreds of continuous spectral bands, HSI provides detailed spectral data, enabling precise identification and analysis of surface materials. This capability has made HSI a promising technology with vast potential applications in fields such as military reconnaissance, environmental monitoring, agricultural management, and resource exploration. Hyperspectral sensors are particularly adept at detecting subtle spectral variations, which are often closely linked to the chemical composition and physical structure of specific substances. As a result, HSI plays a critical role in target detection and classification tasks.

Hyperspectral target detection is typically defined as the task of identifying and locating targets with unique spectral characteristics within hyperspectral remote sensing images. Unlike optical imagery, hyperspectral sensors typically have lower spatial resolution to balance energy across spatial and spectral dimensions, despite offering rich spectral information. As a result, small targets, such as cars and airplanes, occupy only a few pixels, making accurate detection in HSI particularly challenging.

Since the 1990s, researchers have explored various information processing algorithms for small target detection in hyperspectral images (HSIs). Traditional target detection algorithms typically extract target spectra from image pixels and use spectral matching methods for detection and classification. Algorithms that directly compare target spectra with pixel spectra, such as those based on Euclidean distance and spectral angle mapping (SAM) [1], are computationally efficient and fast. However, they rely solely on spectral information, leading to low detection accuracy and poor performance. Additionally, spectral matching filters (SMFs) [2], constrained energy minimization (CEM) [3], and orthogonal subspace projection (OSP) [4] are classical algorithms in HSI small target detection. Numerous enhancements to these methods have been proposed. For example, the CEM algorithm distinguishes between target and background by designing a filter that minimizes output energy under specific constraints, but it struggles with complex backgrounds and noise. Several variants of CEM have since been developed, such as those in [5,6,7,8]. The orthogonal subspace projection algorithm, introduced by Chang et al. [4], has inspired derivatives like Signature Space Orthogonal Projection (SSP) [9], Target Signature Space Orthogonal Projection (TSSP) [10], and Oblique Subspace Projection (OSP). However, projection-based algorithms tend to perform poorly when targets and backgrounds are linearly inseparable.

Subsequently, various machine learning techniques have been applied to hyperspectral target detection. Kernel methods [7,11,12,13] map data into a high-dimensional space, effectively addressing the issue of linear indivisibility. When combined with classical techniques, these kernel methods improve detection performance, as demonstrated by approaches like kernel-based constrained energy minimization [14], kernel spectral matched filter (KSMF) [15], and kernel target-constrained interference minimization filter (KTCIMF) [7]. In addition, tree-structured models [16,17] organize pixels into a binary tree, allowing target detection by assessing the distance between the target pixel and other pixels in the graph.

Sparse representation-based detectors (SRDs) have become a significant focus in HSI target detection. Among traditional methods, the sparse target detector (STD) [18] is particularly well known. This approach involves learning and constructing a dictionary containing both target and background information, which allows for the separation of target and background pixels through coefficient computation and residual detection. To improve performance, detectors that combine sparse and collaborative representations (CSCRs) [19,20] introduce a joint representation that utilizes neighboring pixels to enhance background reconstruction. This method strengthens the modeling capacity for background features, making it especially effective for detecting small targets. Beyond dictionary-based methods, alternative approaches for target and background reconstruction have been explored. For example, Xu et al. [21] proposed a method for HSI reconstruction and anomaly detection using tensor robust principal component analysis (RPCA), which decomposes the hyperspectral image into low-rank and sparse components to separately reconstruct the background and targets. Similarly, an effective infrared small target detection method based on third-order tensor construction and Tucker decomposition (TCTD) was introduced in [22]. However, these reconstruction-based methods often rely heavily on prior knowledge to extract features supporting the reconstruction process, which may limit their broader applicability.

In recent years, convolutional neural networks (CNNs) have made significant breakthroughs in image processing, driving their application in HSI target detection. Du et al. [23] proposed a novel target detector, CNNTD, based on CNNs. Building upon this, Zhang et al. developed HTD-IRN-Net [24], a specialized target detection framework for HSI. Expanding on these advancements, Zhu et al. introduced a two-stream convolutional neural target detector (TSCNTD) [25], which consists of two branches. The lower branch extracts sample features, while the upper branch extracts prior target features. The network then performs binary classification by comparing these features to determine whether a test pixel is a target. However, CNN-based HSI target detection methods often face challenges due to the limited availability of samples for HSI small targets.

To address the challenge of limited sample sizes in HSI small target detection, many researchers have turned to self-supervised learning techniques. Yao et al. [26] introduced a self-supervised learning method based on spectral hybrid features, allowing the feature extraction network to learn more discriminative representations with limited labeled data. Wang et al. [27] proposed the spectral contrastive learning-based HTD-IRN method (SCLHTD-IRN), which uses a trained adversarial convolutional autoencoder to extract feature representations from augmented samples. Sun et al. [28] developed a novel approach for extracting regions of interest (ROI) using self-supervised learning. However, these methods often require training and testing on the same dataset, limiting their applicability across different scenes and resulting in significant computational overhead. To overcome these limitations, Chen et al. [29] proposed a generative self-supervised pre-training model with a spectral–spatial masking (S2M) strategy. This method employs a lightweight visual transformer (ViT) as the backbone network to learn generic feature representations without the need for labeled samples, which can then be transferred to various HTD-IRN tasks. Additionally, Shi et al. [30] introduced a semi-supervised adaptive fewer-samples learning (SSDA-FSL) method, which enables sensor-independent HSI target detection and significantly enhances the model’s generalization capabilities.

However, deep learning-based algorithms for weak small target detection often overlook two critical issues: the loss of deep features of the target and the failure to exploit the relationship between the target and surrounding elements for detection reasoning. To address these challenges, Li et al. [31] proposed DNANet, which achieves effective small target detection in deep networks by designing a dense nested interaction module (DNIM) and a cascaded channel and spatial attention module (CSAM). These modules enhance and retain target information in the deep network, leading to superior performance in infrared small target detection tasks. Incorporating prior knowledge into neural networks has also proven to significantly reduce human intervention while enhancing feature extraction capabilities. For example, Marino et al. [32] utilized graph convolutional networks (GCNs) to process knowledge graphs, leveraging high-level semantic information to achieve more accurate classification. Similarly, Chen et al. [33] employed knowledge graphs to encode semantic relationships among objects. By iteratively reasoning over these relationships, their model progressively refined its understanding of object interactions within images. Xu et al. [34] proposed a detection system that emulates human reasoning processes. By integrating extensive human commonsense knowledge, the system achieved enhanced detection performance on the COCO dataset. In the field of remote sensing, Qian et al. [35] introduced ARNet, which uses prior knowledge encoded in knowledge graphs to improve the accuracy of aircraft detection and fine-grained classification in remote sensing images. In the field of HSI target detection, researchers have also explored the integration of prior knowledge. For instance, Li et al. [36] developed a network architecture capable of processing both hyperspectral and multispectral data while incorporating external prior knowledge. This approach enhances the spatial resolution and spectral information of the resultant images, supporting a variety of remote sensing applications more effectively.

This paper focuses on detecting small and dim targets in hyperspectral images (HSIs), addressing three main challenges:

(1) Small Target Size: HSIs often have low spatial resolution, causing small targets like vehicles and airplanes to occupy only a few pixels or even sub-pixel regions. Pooling layers in convolutional neural networks (CNNs) can result in the loss of high-level feature information crucial for detecting these weak targets. (2) Target Recognition Difficulty: Accurate detection of small targets requires fine-grained classification. The small scale of targets often leads to their spectra blending with surrounding background signatures, making spectral variations within the same target class more complex. (3) Random Target Distribution in Remote Sensing Scenes: Targets are distributed unpredictably, ranging from dense clusters to scattered locations, and appear in varied environments like runways, water, or dense vegetation. The randomness and complexity of these distributions make small target detection particularly challenging.

This study presents a small and dim target detection algorithm that integrates spatial, spectral, and knowledge information from HSI. (1) Spatial Dimension: the algorithm utilizes a multi-layer nested U-Net architecture combined with a spatial attention mechanism to capture deep spatial features of targets while minimizing feature loss. (2) Spectral Dimension: To address spectral overlap between small targets and their environment, as well as noise interference, a method for inverting biochemical parameters is introduced. This enhances spectral differentiation and emphasizes the biochemical characteristics of the targets. (3) Knowledge Dimension: Semantic features of the targets, surface types, and surrounding elements are constructed and propagated through a knowledge graph. These semantic features are fused with the extracted HSI features to enrich target features.

(1): Dense Cross-Scale Feature Aggregation U-Net: This approach breaks through the limitations of traditional U-Net’s unidirectional feature transmission by designing a high-dimensional densely nested structure, which is a variant of U-Net++. Through a cross-layer multi-path feature interaction mechanism, it enables dynamic fusion of shallow details and deep semantics. An adaptive feature weighting strategy is introduced to significantly enhance the deep feature retention capability for weak small targets, addressing the core issue of progressive feature attenuation in conventional networks.
(2): Spectral–Physical Correlation Enhancement Model: A spectral–physical coupling framework is proposed, integrating biochemical parameters. By modeling the band relationships with physical constraints and incorporating an adaptive parameter fusion mechanism, this framework constructs a nonlinear correlation model between spectral responses and target biochemical attributes. It enhances the spectral distinguishability of ground objects across different sensor scenarios, effectively overcoming the interference caused by sensor parameter differences.
(3): Knowledge Graph-Driven Adaptive Reasoning Framework: A graph attention network is constructed that integrates a geospatial knowledge graph, encoding prior knowledge such as environmental topology and target occurrence patterns into graph nodes. A hierarchical attention mechanism is used to achieve probabilistic reasoning of target–background relationships. This framework overcomes the semantic fragmentation bottleneck of traditional purely data-driven methods, enabling adaptive suppression of background interference and collaborative enhancement of target semantics in complex scenarios.

The subsequent sections of this paper are structured as follows: Section 2 explores the technical details, Section 3 introduces the dataset and presents the experimental results, and Section 4 concludes with a summary of the paper.

2. Methods

2.1. Overview

This section introduces the Adaptive Global Dense Nested Reasoning Network (AGDNR) model, designed for small target detection in HSI. The model comprises three primary components: Feature extraction module: this module accurately extracts features from HSI, ensuring effective detection of small targets. Surface feature extraction module: it performs surface classification first and subsequently extracts features from different surface types. Knowledge reasoning module: by integrating a knowledge graph into the end-to-end network, this module incorporates prior knowledge into the convolutional neural network, enhancing the model’s detection capability. The overall structure of the model proposed in this paper is shown in Figure 1. In Figure 1a, the feature extraction module first extracts an initial feature map from the HSI and performs the 1 × 1 kernel size convolution classification. Meanwhile, in Figure 1b, the land cover feature extraction module classifies pixels into land surface types and extracts land type features. In Figure 1c, the classification weights from Figure 1a are stored in the semantic pool as target class semantics, and the land type features from Figure 1b are stored as land cover semantics. In Figure 1c, the reasoning model combines the pixel categories obtained from the convolution classification in Figure 1a and the land surface classification in Figure 1b with a knowledge graph for reasoning. During reasoning, the attention of the pixel class, derived from the initial features, is incorporated to guide the model’s focus on the primary pixel classes, ultimately producing pixel-level reasoning enhanced features. These enhanced features are concatenated with the initial features for final detection and classification.

The feature extraction module is designed to effectively extract HSI features. The structure of a traditional U-Net is shown in Figure 2a, which consists of two main parts: The encoding part is composed of a series of convolutional and pooling layers, which can capture global context information in deep layer features. The decoding part is totally symmetrical to the encoding part, including up-sampling layers and convolutional layers. In the U-Net, the up-sampling layers increase the resolution of the output. Meanwhile, skip connections are used to connect the features from the pooling layers to the outputs of the up-sampling layers, which enables subsequent layers to achieve more accurate features. However, in the task of HSI small target detection, targets often range in size from a single pixel to a dozen pixels. In traditional U-Net networks, small target features are likely to vanish during the repeated pooling operations, leaving the deep network unable to retain these critical features. To address this issue, this paper adopts a dense nested convolutional neural network. This design connects shallow features to deep features, preserving small target features within the deep layers.

2.2. Feature Extraction Module

2.2.1. The Dense Nested Network

As shown in Figure 2b, we construct a dense nested convolutional neural network by stacking multiple U-shaped fully convolutional networks. In HSI small target detection, different target sizes require different receptive fields. Therefore, different layers of the convolutional network may capture features of targets of various sizes. In our network, multiple nodes are designed in each layer, which can integrate the output features of the nodes in the same layer and the adjacent layers. This multi-layer fusion helps preserve small target features in the deeper layers, leading to enhanced small target detection performance. The structure of the feature extraction module with five layers is shown in Figure 2b.

As shown in Figure 2b, the nodes of the dense nested neural network are denoted as

L_{i, j}

, where

i

represents the

i^{t h}

down-sampling layer, and

j

represents the

j^{t h}

horizontal dense block. Thus, let

L_{i, j}

be the output of the node at the

{(i + 1)}^{t h}

row and

{(j + 1)}^{t h}

column. The output of

L_{i, j}

can be represented as follows:

L_{i, j} = F ([L_{i, 0}, L_{i, 1}, \dots, L_{i, j - 1}, D (L_{i - 1, j}), U (L_{i + 1, j - 1})])

(1)

where

F (.)

denotes the convolutional module, spatial attention module, and channel attention module of the first node,

[., .]

denotes the concatenation layer,

U (.)

denotes the up-sampling layer with a scaling factor of 2,

D (.)

denotes the down-sampling layer with a scaling factor of 2. As shown in Equation (1), the input of node

L_{i, j}

consists of the skip connections from all previous nodes in the same layer, the down-sampling of the upper-layer node, and the up-sampling of the previous node in the lower layer.

2.2.2. Channel and Spatial Attention Module

As shown in Figure 3, each node in the network applies multi-layer convolution to the input before passing it through the channel and spatial attention modules. These attention modules focus on the most relevant channels and pixels for HSI small target detection to improve the detection accuracy. Given an input feature

L \in ℝ^{C \times H \times W}

, we have the channel attention maps

M_{c} \in ℝ^{C \times 1 \times 1}

and the spatial attention maps

M_{s} \in ℝ^{1 \times H \times W}

.

As shown in Figure 3, let the output of a node

L_{i, j}

after multi-layer convolution be

L

, then the calculation formula for the output of this node is as follows:

L^{'} = M_{c} (L) \otimes L

(2)

L^{″} = M_{s} (L^{'}) \otimes L^{'}

(3)

where

\otimes

denotes the element-wise multiplication. Before multiplying

M_{c}

and

M_{s}

with

L

and

L^{'}

, their sizes have already been stretched to

ℝ^{C \times H \times W}

.

2.2.3. Feature Output Module

Finally, the output of the last node in each layer is up-sampled to the same dimension as the output of the first layer and then concatenated to obtain the final global feature map. The width and height of the feature

L_{i, n - i} (i = 0, 1, \dots, n)

for each layer are

1 / 2^{i}

of the first layer’s feature

L_{0, 5}

. Therefore, for the output of the

i^{t h}

layer, it is necessary to input it into an up-sampling layer with a scaling factor of

2^{i}

. The calculation of the global feature map is as follows:

F = [L_{0, n}, U_{1} (L_{1, n - 1}), \dots, U_{n - 1} (L_{n - 1, 1}), U_{n} (L_{n, 0})]

(4)

where

U_{i}

denotes an up-sampling layer with a scaling factor of

2^{i}

,

[., .]

denotes the concatenation layer.

2.2.4. Convolutional Classification Module

Finally, the output features are passed through two 1 × 1 convolutional neural networks to obtain the detection probability for each pixel. The calculation process for the probability is as follows:

P = sofmax ({Conv}_{1 \times 1} ({Conv}_{1 \times 1} (F)))

(5)

where

P

denotes the probability that each pixel in the image belongs to each category.

P \in R^{N \times H \times W}

, where

N

is the number of categories of the targets to be detected (including the background). Thus, we obtain the initial detection probability value

P

of the targets.

2.3. Surface Feature Extraction Module

The surface feature extraction module aims to classify the surface types and uses a convolutional neural network to extract the semantic features of different surfaces.

2.3.1. Land Surface Parameters

Commonly used land surface parameters include the Normalized Difference Vegetation Index (NDVI), Leaf Area Index (LAI), Ratio Vegetation Index (RVI), Difference Vegetation Index (DVI), and Temperature–Vegetation Dryness Index (TDVI). The following is a detailed introduction to these surface parameters:

(1): Normalized Difference Vegetation Index (NDVI) [37]

The Normalized Difference Vegetation Index (NDVI) is a widely used parameter in remote sensing for assessing crop growth and nutritional status. It reflects vegetation coverage and the influence of background factors like soil, water, and built-up areas. The NDVI value is computed using the near-infrared (NIR) and red channels:

N D V I = \frac{ρ_{N I R} - ρ_{R}}{ρ_{N I R} + ρ_{R}}

(6)

where

ρ_{N I R}

is the NIR channel reflectivity, and

ρ_{R}

is the red channel reflectivity. The range of the NDVI is [−1, 1].

(2): Leaf Area Index (LAI) [38]

The Leaf Area Index measures the total leaf area relative to land area. It is linked to the Soil-Adjusted Vegetation Index (SAVI) [39], which considers the influence of the soil reflectance on the NDVI, providing a more accurate evaluation of vegetation health [40]. The calculation formula of the SAVI value is as follows:

S A V I = (1 + L) \frac{ρ_{N I R} - ρ_{R}}{ρ_{N I R} + ρ_{R} + L}

(7)

where

L

is the adjustment factor, which aims to reduce the influence of the soil reflectance on the NDVI. Based on the SAVI, the calculation formula of the LAI value is as follows:

L A I = 0.0654 \times \exp (4.0901 \times S A V I)

(8)

where 0.0654 and 4.0901 are coefficients determined empirically.

(3): Ratio Vegetation Index (RVI) [41]

The RVI, defined as the ratio of reflectance in visible and infrared bands, is critical for assessing vegetation coverage and growth. It is particularly sensitive to vegetation coverage above 50%, but its sensitivity decreases when vegetation coverage is below 50%. The calculation formula of the RVI value is as follows:

R V I = \frac{ρ_{N I R}}{ρ_{R}}

(9)

(4): Difference Vegetation Index (DVI)

The DVI, derived as the difference between visible and infrared wavelengths, is more sensitive to soil background changes than the RVI. However, its accuracy diminishes in areas with complex soil backgrounds and low vegetation coverage. The calculation formula of the DVI value is as follows:

D V I = ρ_{N I R} - ρ_{R}

(10)

(5): Temperature–Vegetation Dryness Index (TDVI) [42]

The TDVI is a remote sensing index used to monitor and evaluate drought conditions. It combines a vegetation index (such as the NDVI) and land surface temperature (LST) to provide information about the vegetation health status and drought stress. The calculation formula of the TDVI value is as follows:

T D V I = \frac{T_{s} - T_{s - \min}}{T_{s - \max} - T_{s - \min}}

(11)

where

T_{s - \min}

is the lower temperature value (wet edge) among the pixels with the same NDVI.

T_{s - \max}

is the higher temperature value (dry edge) among the pixels with the same NDVI. If it is assumed that the feature space of the NDVI is trapezoidal, the approximate temperatures of the dry edge and the wet edge can be calculated by fitting the equations of the dry edge and the wet edge. The calculation formulas are as follows:

T_{s - \min} = a_{1} + b_{1} \cdot N D V I

(12)

T_{s - \max} = a_{2} + b_{2} \cdot N D V I

(13)

where

a_{1}

,

b_{1}

is the coefficient of the fitting equation of the dry edge, and

a_{2}

,

b_{2}

is the coefficient of the fitting equation of the wet edge.

2.3.2. Land Surface Classification and Feature Extraction Based on NDVI

The range of the NDVI is

[- 1, 1]

. Based on the NDVI values, the land surface can be divided into five distinct categories. The numerator in the NDVI formula represents the difference between the reflectance of the infrared and red bands. Water, clouds, and oceans typically exhibit high reflectance in the visible spectrum but low reflectance in the near-infrared spectrum. Therefore, when the NDVI value is less than 0.2, these regions can be identified as water, oceans, or clouds. Vegetation reflects more in the infrared spectrum and absorbs red light. Thus, an area’s NDVI greater than 0 indicates vegetation, with higher values reflecting denser vegetation coverage. Areas with NDVI values between −0.2 and 0 are generally considered urban regions. The value range for surface classification based on NDVI is shown in Table 1.

In this paper, surface-related parameters such as the LAI, RVI, DVI, and TDVI are utilized to enhance the input features. The surface feature extraction module first performs thresholds classification on the surface types based on the NDVI value. The hyperspectral information for the five surface categories is then input into a multi-layer convolutional neural network for feature extraction, ultimately producing a one-dimensional feature vector.

Initially, the five surface parameters (NDVI, LAI, RVI, DVI, and TDVI) are calculated. Among these, the NDVI is used in the first step to classify surface types, while the LAI, RVI, DVI, and TDVI serve as enhancement features, which are concatenated with the original HSI data to enrich the feature information. The process of incorporating these surface parameters into the original map can be represented as follows:

{i n p u t}^{'} = [i n p u t, L A I, R V I, D V I, T D V I]

(14)

where input denotes the preprocessed HSI data,

[,]

denotes the concatenation layer.

{i n p u t}^{'}

is used as the input for the dense nested network in Section 2.2 for feature extraction.

The process of surface classification based on NDVI values is as follows:

F_{i, j}^{k} = \{\begin{array}{l} i n p u t_{i, j} & if {condition}_{k} \\ 0 & others \end{array}

(15)

where

{condition}_{k}

denotes the range of NDVI values corresponding to the

k^{t h}

land surface type. Subsequently, the images segmented based on the NDVI values are input into the convolutional neural network, which can be represented as follows:

M_{k} = MLP (Conv (F^{k}))

(16)

where

C o n v

represents a multi-layer deep convolutional neural network for feature extraction, MLP denotes a multi-layer perceptron with one hidden layer, designed to reduce the features to one dimension. Additionally,

M_{k} \in ℝ^{1 \times L}

,

L

denotes the length of the semantic features of surface types. The one-dimensional features obtained after dimensionality reduction are used for pixel-level knowledge reasoning.

2.4. Pixel-Level Knowledge Reasoning Module

One of the characteristics of HSI small target detection tasks is that the targets occupy very few pixels, typically only a few to a dozen pixels. Therefore, integrating the pixel-level recognition results from Section 2.2 into a single target in advance poses difficulties for uniformly extracting classification features of targets at different scales. Consequently, this paper proposes a pixel-level knowledge graph reasoning method, which integrates pixel-level knowledge into small target detection and recognition, effectively addressing the detection tasks of targets with inconsistent pixel scales. The entire reasoning process is shown in Figure 4. Starting from the initial feature map extracted from HSI in Section 2.2.3, the 1 × 1 convolution classification weights are used as the target category semantics, and the land cover feature vectors extracted in Section 2.2.2 serve as the land surface type semantics, forming a global semantic pool. In the orange box of Figure 4, the constructed global semantic pool is propagated over the knowledge graph built in Section 2.4.1 to obtain the reasoning results. Simultaneously, the probability matrix of each pixel’s class is multiplied by the reasoning results to obtain pixel-level reasoning, as described in Section 2.4.2. Additionally, the results of the initial features after convolution and MLP are multiplied by the semantic pool to obtain attention weights for each pixel class. This attention mechanism is integrated into the reasoning process, allowing the model to focus on the primary targets and land surface classes, as detailed in Section 2.4.3.

2.4.1. Knowledge Graph Construction

A knowledge graph is a structured semantic knowledge base that can be used to describe the semantic relationships between various entities. In this paper, different detection targets, such as cars, ships, and airplanes, are considered as target entities, and land surface types like water, soil, and vegetation are regarded as land surface type entities. By constructing the semantic relationships between entities using prior knowledge, a knowledge graph is built for use in this paper.

Pixel-Level Knowledge: Knowledge graphs are utilized to establish semantic relationships among different target pixels, targets, and surface types. For example, in the San Diego dataset (illustrated in Figure 5), pixels belonging to the same target type, such as airplanes, are often located in close proximity, and have adjacency relationships. Moreover, airplanes are generally found on land surfaces, while water, urban areas, and vegetation are not typical landing sites. Thus, adjacency relationships can also be established between land surface pixels and target pixels. Beyond adjacency, other positional or logical relationships can similarly be defined to enrich the semantic connections within the knowledge graph.

Knowledge Graph Construction: To utilize the prior knowledge in the knowledge graph, it is essential to transform the structured graph-based knowledge into a machine-readable format. The knowledge graph constructed in this paper is an entity–relationship knowledge graph, consisting of entities and the relationships between them. It is stored in the form of an adjacency matrix for efficient computation. Taking a knowledge graph constructed using adjacency relationships as an example, an adjacency matrix

E

is first established to quantify the relationships between targets. The row and column indices of

E

correspond to different target types and five distinct land surface types. Targets of the same type exhibit adjacency relationships, leading to the diagonal elements

E (i, i)

being set to 1. If adjacency exists between different targets, the corresponding matrix element

E (i, j)

is set to 1; otherwise, it is set to 0. For relationships between targets and land surface types, the frequency

F (i, j)

, which represents the occurrence of specific land surface pixels surrounding a given target in the training dataset, can be used as the value for

E (i, j)

. Here,

i

denotes the index of a target type, and

j

denotes the index of a land surface type.

2.4.2. Knowledge Reasoning

Construction of Global Semantic Pool: Firstly, we constructed the semantic features of target pixels and land surface types through two methods. Inspired by works such as Reasoning-RCNN [34], which use the weights of classifiers to identify unknown targets, the weights of the classifiers can represent the semantic features of various targets to a certain extent. In Section 2.2 of this paper, for features of uniform dimensions, we used two 1 × 1 convolutional layers for pixel-level classification. We extracted the weights of the filter corresponding to different categories from the last convolutional layer to construct the global target semantic pool

M_{1} \in R^{C \times L}

, where

C

denotes the number of target categories, and

L

represents the length of the convolutional filter weights. At the same time, for different land surface types, we use the features extracted in Section 2.3 as the semantic features of various land surface types

M_{2} \in R^{5 \times L}

, where

L

is the length of the final obtained features. Finally, the global semantic pool for targets and land surface types

M = [M_{1}, M_{2}]

can be obtained. The global semantic pool will be continuously updated during the network training process, and as the number of iterations increases, the semantic pool will become more accurate.

Semantic Propagation and Feature Augmentation: After semantic features are established for all target categories and land surface types, these semantics can be propagated through the edges of the knowledge graph. However, an effective approach is still required to link pixels in HSI to the target categories. In this model, surface categories are hard-linked to pixels, while target categories are soft-linked. The soft link between target categories and pixels is the classification probabilities

P_{1} \in ℝ^{N \times C}

for

C

categories and

N

pixels, enabling a flexible representation of the proportion of pixels’ categories. Conversely, the link between pixels and surface categories is a hard link. Each pixel is assigned a unique surface type based on its NDVI value. This establishes a hard link where the probability of a pixel belonging to its surface type is 1, while the probabilities for other surface types are 0. This process results in a probability matrix

P_{2} \in ℝ^{N \times 5}

, where

N

is the number of pixels. Finally, the total probability matrix

P = [P_{1}, P_{2}] \in ℝ^{N \times (C + 5)}

integrates both target and surface probabilities. This matrix is used to construct new feature vectors via matrix multiplication as follows:

F^{'} = P E M W_{G}

, where

W_{G} \in ℝ^{C \times D}

denotes the weight matrix of the neural network, and the dimension of the resulting augmented feature

F^{'}

is

ℝ^{N \times D}

, providing each pixel with a comprehensive augmented feature representation.

2.4.3. Attention Mechanism

Since all categories contribute to the generation of enhanced features, noise is inevitably introduced into the network. To mitigate this, an attention module is employed to focus the network’s attention on relevant categories, reducing the impact of irrelevant targets that do not exist in HSI. In this paper, we use an attention mechanism to identify the target categories and land surface types that each input image primarily focuses on. The process begins by performing a convolution with a 3 × 3 filter on the feature map of the entire image. Afterward, global pooling is applied, followed by a fully connected layer. The output is then passed through a softmax function to compute the attention weights for each category

α = softmax (z_{α} W_{α} M^{T})

, where

z_{α} \in ℝ^{L}

denotes the feature vector obtained from the convolution of the original image features,

W_{α} \in ℝ^{L \times L}

denotes the fully connected weight matrix, and

α \in ℝ^{C + 5}

. The final enhanced feature is then generated as follows:

F^{'} = P (α \otimes E M) W_{G}

, where

\otimes

denotes element-wise multiplication.

Finally, the enhanced feature

F^{'}

is concatenated with the original feature

F

, resulting in the combined feature vector

[F, F^{'}]

. This concatenated feature is then fed into a new 1 × 1 convolutional neural network with a hidden layer to obtain the final detection output. The enhanced feature

F^{'}

, derived from prior knowledge, acts as a supplement to the target features, thus improving detection accuracy.

2.4.4. Loss Function:

In this paper, we perform two classification steps. The first classification occurs after the feature extraction module in Section 2.2.4, and the second follows the enhanced feature concatenation in Section 2.4.3. As a multi-class classification task, the output represents the probability of each pixel belonging to each category. The two classification outputs are denoted as

p r e d 1 \in ℝ^{H \times W \times (C + 1)}

and

p r e d 2 \in ℝ^{H \times W \times (C + 1)}

, where

H

is the image height,

W

is the image width, and

C

is the number of the target classes. The ground-truth one-hot labels are denoted as

g \in ℝ^{H \times W \times (C + 1)}

. In this paper, we use the multi-class soft-IoU as our loss function. The loss for the first classification can be expressed as follows:

L_{l, c} = 1 - \frac{\sum (p r e d 1_{c} \otimes g_{c})}{\sum (p r e d 1_{c} + g_{c} - p r e d 1_{c} \otimes g_{c})}

(17)

L_{1} = \frac{1}{C} \sum_{c = 1}^{C} L_{1 c}

(18)

where

p r e d 1_{c}

and

g_{c}

are the

c

-th channel of

p r e d 1

and

g

,

\otimes

denotes the element-wise multiplication, the summation symbol denotes the operation of summing up all the elements. Similarly, the second classification loss function can also be written in the following form:

L_{2, c} = 1 - \frac{\sum (p r e d 2_{c} \otimes g_{c})}{\sum (p r e d 2_{c} + g_{c} - p r e d 2_{c} \otimes g_{c})}

(19)

L_{2} = \frac{1}{C} \sum_{c = 1}^{C} L_{2, c}

(20)

where

p r e d 2_{c}

is the

c

-th channel of

p r e d 2

. The final loss function is as follows:

L = w_{1} L_{1} + w_{2} L_{2}

(21)

where

w_{1}

and

w_{2}

are selectable weights. In this paper, we choose

w_{1} = w_{2} = 0.5

.

3. Experiments and Results

3.1. Dataset Introduction

This work uses two real datasets and one synthetic dataset for training and testing target detection tasks.

(1): San Diego Dataset: The San Diego dataset contains hyperspectral remote sensing images captured by the AVIRIS sensor at the United States San Diego airport. The image has a spatial resolution of 3.5 m and the size of 400 × 400 pixels. It includes 224 bands ranging from 370 to 2510 nanometers. After removing bands influenced by atmospheric effects, 203 usable bands remain. The dataset includes 1103 target pixels, covering six aircraft and other small targets of interest. These targets vary in size, ranging from a few pixels to several dozen, making it suitable for evaluating the model’s ability to detect targets of different scales. The RGB image of this dataset is shown in Figure 5a [43].
(2): Mosaic Avon Dataset: The Mosaic Avon dataset is part of the “SpecTIR Hyperspectral Airborne Experiment 2012” project. It captures a section of Avon Park in New York, USA, using a push-broom sensor. The sensor collects spectral data across 360 bands, ranging from 400 to 2450 nm, where each band’s wavelength is carefully labeled. The spatial resolution of this dataset ranges from 1 to 5 m. For this study, a 256 × 256 pixels sub-region was selected, which includes large grassland areas interspersed with smaller land and road patches. The dataset contains 228 target pixels for detection [44].
(3): Synthetic Dataset: The background of our synthetic dataset is derived from the TG1HRSSC hyperspectral dataset [45], collected by the Chinese Academy of Sciences Space Application Engineering and Technology Center in 2021 from the Tiangong-1 satellite. This dataset includes images across three spectral ranges: full-color (PAN), visible near-infrared (VNIR), and short-wave infrared (SWIR). It covers nine geographic categories such as urban areas, farmland, ports, and airports. In this paper, three hyperspectral images of port areas were selected. These images, sized at 256 × 256 pixels with 54 bands ranging from 400 to 1000 nm, consist of sea areas, vegetation, and urban land surfaces. The targets for this dataset were extracted from the ABU (Airport–Beach–Urban) dataset [46], which was manually curated from the AVIRIS website. The targets include aircraft, ships, and cars, which were resized, spectrally adjusted, and then randomly selected and dropped into the water and urban areas of the background dataset, creating three synthetic HSI images. Each of these images contains more than ten unique targets for detection.
(4): HAD100 Dataset: The HAD100 dataset is collected by the AVIRIS sensor. The dataset is uniformly cropped into 64 × 64 image patches, with each image containing one to several small targets, with each consisting of a few to tens of pixels. The dataset includes 276 bands ranging from 400 to 2500 nm. We selected 40 images from the HAD100 dataset for training and testing, which contain multiple types of targets. We categorized the targets into four classes [47].

3.2. Experimental Details

3.2.1. Evaluation Metrics

In this paper, we use detection maps, the receiver operating characteristic (ROC) curve, three types of area under the curve (AUC), and the separability map to evaluate model performance. Detection maps provide an intuitive visualization of the model’s detection effectiveness and its ability to suppress background interference. The ROC curve is generated by dividing the detection results into targets and backgrounds using various thresholds between 0 and 1, calculating the detection probabilities (

D

) and false alarm rates (

F

) for each threshold [48], and plotting these as a curve. For the ROC curve, the closer it is to the upper-left corner, the more it indicates that the model can obtain a higher detection probability under the same false alarm rate, reflecting that the model has a better ability to distinguish between positive and negative examples.

The AUC (area under the curve) is a quantitative measure of how closely the ROC curve approaches the upper-left corner, reflecting the quality of the detection [49,50]. The most common AUC, denoted as

{AUC}_{(D, F)}

, is calculated based on the true positive rate (

D

) and the false positive rate (

F

) at different thresholds (

τ

). A higher

{AUC}_{(D, F)}

value indicates better overall detection performance.

Additionally, two variant AUC metrics are employed for more specific assessments:

{AUC}_{(τ, D)}

and

{AUC}_{(τ, F)}

. A higher

{AUC}_{(τ, D)}

value indicates a higher detection effectiveness and a lower

{AUC}_{(τ, F)}

value indicates a better background suppression of the model.

3.2.2. Parameter Settings

To evaluate the detection performance of the AGDNR model proposed in this paper, we compared it with five detection methods: two classic approaches, CEM [3] and ACE [51], and four deep learning-based methods, MLSN [52], HTD-IRN [24], TSSTD [53], and CS-TTD [54]. The parameters used for these comparisons were obtained from their public project.

For our AGDNR method, the optimizer is set to Adagrad, and the loss function is cross-entropy loss. The learning rate is initialized to 0.05, the number of training epochs is set to 1500, and the batch size is set to the number of training images in the dataset. During training, 20% to 40% of the pixels labeled as targets are randomly selected to form the training set for all the using datasets. The number of the target of the training and testing set is shown in Table 2.

All model training and testing were conducted on a Linux-based remote server equipped with an NVIDIA Tesla T4 GPU and 15 GB of RAM. This experimental setup ensures a fair and consistent comparison across all methods.

3.3. Comparison of Algorithm Performance

Figure 6 demonstrates the detection results of six baseline models and the model proposed in this paper. The traditional CEM algorithm can detect the majority of targets across all four datasets and effectively suppress the background. ACE, HTD-IRN, and TSTTD exhibit strong background suppression capabilities but can only detect a limited number of targets. MLSN and CS-TTD have weaker background suppression abilities, but both can detect a relatively large number of targets in the Avon dataset. Among them, CS-TTD detects almost all targets in the synthetic dataset, with only a few missed detections. The AGDNR model performs excellently across all four datasets, detecting the locations of all targets, with only a few false detections in the San Diego dataset.

Figure 7 shows the ROC curves of seven models. The curves show that on the San Diego dataset, Avon dataset, and synthetic dataset, AGDNR has the best ROC curve with a higher detection probability and a lower false alarm rate, indicating that the AGDNR model has a stable detection accuracy and the optimal detection effect.

Table 3 presents the AUC values for seven models, including

{AUC}_{(D, F)}

,

{AUC}_{(τ, D)}

, and

{AUC}_{(τ, F)}

, with the best values for each model highlighted in bold. Overall, our method performed the best in the majority of AUC values. Specifically, on the Avon dataset and the HAD100 dataset, AGDNR achieved the best AUC values, indicating good model detection accuracy, effectiveness, and background suppression capability. On the San Diego dataset, the

{AUC}_{(τ, F)}

value is slightly lower than that of TSTTD, and the

{AUC}_{(τ, D)}

is lower than that of ACE. However, ACE’s high

{AUC}_{(τ, F)}

indicates a complete lack of background suppression capability. On the synthetic dataset, CS-TTD’s

{AUC}_{(τ, D)}

surpasses AGDNR, yet its

{AUC}_{(τ, F)}

is inferior to AGDNR, suggesting weaker background suppression performance compared to AGDNR.

Figure 8 presents the separability maps for each method under four datasets, which intuitively demonstrates the effectiveness of AGDNR in separating targets from the background. Compared to other methods, the distance between the orange boxes (background) and green boxes (targets) of AGDNR is the greatest across the three datasets. Additionally, the orange parts are close to 0, indicating strong background suppression capability. At the same time, the central line of the green boxes is often higher, suggesting that the model has strong target detection capabilities.

3.4. Target Type Reasoning Analysis

The model constructed in this paper not only can detect the positions of small targets, but also has the ability to reason the categories of them. In order to analyze the classification and detection ability of AGDNR, we drew the detection maps, calculated the ROC curves of the four datasets, and gave the three types of AUC values of different types of targets to verify the target category detection performance of AGDNR.

As shown in Figure 9, AGDNR performs well on four datasets; the AGDNR model is capable not only of inferring the category of the target but also of accurately identifying its contour. Specifically, on the synthetic dataset and HAD100 dataset, it can completely detect all target locations and accurately reason the categories of all targets. On the San Diego dataset and the Avon dataset, the model can detect the positions and contours of all the targets. For the targets of category 3 in the San Diego dataset, there were no missed detections, but a few false positives were scattered in the background, which is likely due to the targets of category 3 being too small and their spectra being greatly influenced by the background. Regarding the targets of category 4 in the synthetic dataset, some pixels of category 3 were incorrectly reasoned as targets of category 4 to some extent, which may be because they are too close to each other, causing the network to mistakenly propagate a few features of category 4 to category 3. However, as for the false positive pixels, they are classified with a relatively low probability of being an incorrect target, which is within an acceptable range and does not affect the final classification results.

As shown in Figure 10 and Table 4, from the ROC curves, it is known that the model can accurately classify target pixels on all four datasets. On the four datasets, the model’s

{AUC}_{(D, F)}

value is close to 1 for each category, indicating that the model can accurately detect and classify targets. The model’s

{AUC}_{(τ, D)}

values are relatively high on the four datasets, indicating that the detected pixels have a higher effectiveness. However, on category 3 of the San Diego dataset, category 4 of the Mosaic Avon dataset, and category 3 of the synthetic dataset, the model’s detection effectiveness is slightly weaker. The

{AUC}_{(τ, F)}

values are close to 0 on the four datasets, indicating that the model has strong background suppression capability. In general, AGDNR has achieved good results on all four datasets. It can accurately and effectively detect and classify targets.

3.5. Ablation Experiment

3.5.1. Module Ablation

The deep nested network employed in this paper facilitates direct interaction between shallow and deep features, effectively feeding the features of small targets directly into the deeper layers of the network, thus preventing the features of small targets from being lost due to deep convolution and pooling.

In this paper, skip connections and down-sampling in the backbone feature extraction network are removed to form two network variants, while keeping other network structures and output scales unchanged, in order to investigate the role of skip connections and down-sampling in preserving the features of small targets.

(1): AGDNR w/o SC: As shown in Figure 11a, the skip connections in the original model link features within the same layer, which helps to preserve the features of the preceding layers in subsequent convolutional networks. As illustrated in Figure 11b, all skip connections in the model are removed to obtain a variant.
(2): AGDNR w/o SC&DS: Down-sampling layers are used to maintain the features of small targets in deep networks. As shown in Figure 11c, we remove all the down-sampling layers except for the first column to form a variant.

To further investigate the roles of skip connections and down-sampling layers, we removed the first skip connection and down-sampling layer, respectively, from the AGDNR and AGDNR w/o SC models, thereby creating two variants.

(1): AGDNR without SC1: As shown in Figure 12a, we removed the skip connection layer between the nodes in the first column and subsequent nodes, preventing the raw information from the preceding layer from being passed to the subsequent network layers.
(2): AGDNR without SC&DS1: As depicted in Figure 12b, we eliminated the down-sampling layer in the first row, hindering the transfer of detailed information from the upper layers to the lower network layers.

Figure 13 displays the average values of output features from different nodes in various models after being up-sampled to the same size. These nodes exhibit the output features at different depths in the network, which can intuitively demonstrate the feature representation capability of each node. Specifically, for the original model AGDRN, the deeper nodes already possess the ability to distinguish small targets from the background. In the features of

L_{2, 2}

, the features of the small targets are completely separated from the background features. Moreover, the features at the nodes

L_{4, 0}

mainly represent the common features of a cluster of small targets, which helps in the localization of small targets. The features at the shallow nodes can finely express the features of small targets with sufficient granularity, allowing the model to accurately identify the small target pixels.

For AGDNR w/o SC, there is a significant decline in the feature representation capability of deeper nodes. At nodes that are deeper than

L_{1, 3}

, these nodes receive less information from shallow layers and cannot effectively distinguish between background and targets. Additionally, for node

L_{3, 1}

, the distinction between targets and background is unclear. For AGDNR w/o SC&DS, the representation capability of small target features further declines. Among all the nodes, only

L_{1, 3}

and

L_{0, 4}

show a clear distinction. Other nodes cannot effectively differentiate small targets from their surroundings. In summary, as the network structure simplifies, the ability of features of small targets to remain in deeper layers significantly decreases. However, the output of shallow nodes is less affected by skip connections and down-sampling; both can effectively produce features for target detection and classification.

Figure 14 presents the

{AUC}_{(τ, D)}

values of different models across four datasets. For

{AUC}_{(D, F)}

and

{AUC}_{(τ, F)}

, the output values of all models show little variation, with

{AUC}_{(D, F)}

close to 1 and

{AUC}_{(τ, F)}

close to 0. This indicates that all models have excellent target detection and classification capabilities, as well as background suppression capabilities. From the Figure, it is evident that for

{AUC}_{(τ, D)}

, AGDNR w/o SC and AGDNR w/o SC&DS exhibit less target detection and classification capabilities compared to AGDNR. On the San Diego dataset, the classification effectiveness of AGDNR w/o SC across all categories is about 1% lower than AGDNR, while AGDNR w/o SC&DS sees a reduction of about 10% in the classification effectiveness for category 1. On the Avon dataset, although AGDNR w/o SC and AGDNR w/o SC&DS outperform AGDNR in the detection of category 3 and category 4; this comes at the cost of almost no detection effectiveness for category 2. This suggests that these two variants cannot effectively classify and detect all types of targets on the Avon dataset. On the synthetic dataset, AGDNR performs the most stably, maintaining high detection effectiveness across all categories. In summary, AGDNR exhibits the most stable detection accuracy, background suppression capabilities, and detection effectiveness. In contrast, AGDNR w/o SC generally has a lower overall detection effectiveness, and AGDNR w/o SC&DS has the lowest effectiveness.

Figure 15 shows the

{AUC}_{(τ, D)}

values of AGDNR and its four variants across four datasets. On the San Diego dataset, Avon dataset, and synthetic dataset, the

{AUC}_{(τ, D)}

values of all models decreased with the reduction in the dense nested convolutional neural network structure components. On the HAD100 dataset, the

{AUC}_{(τ, D)}

value of AGDNR w/o SC surpassed that of AGDNR, likely due to the target scale being much larger than the image scale, rendering the skip connection layers ineffective. On the San Diego dataset, Avon dataset, and HAD100 dataset, the

{AUC}_{(τ, D)}

value dropped by approximately 1% when one fully connected layer was removed, showing a significant decrease. Additionally, on the San Diego and Avon datasets, removing one or all skip connection layers resulted in similar

{AUC}_{(τ, D)}

values, indicating that the first skip connection layer is crucial for preserving small target features. Compared to AGDNR w/o SC, the

{AUC}_{(τ, D)}

value of AGDNR w/o SC&DS1 did not significantly decrease across the three datasets. However, on two datasets, AGDNR w/o SC&DS showed a 10% drop in the

{AUC}_{(τ, D)}

value compared to AGDNR w/o SC&DS1, suggesting that the first down-sampling layer has less impact, possibly because it does not reduce the distance between the deep network and the first layer network as significantly as the first skip connection layer does.

Table 5 presents the IoU values of AGDNR and its four variants across the four datasets, with the optimal values highlighted in bold. The full model achieved the highest IoU values on the first three datasets. Most data trends in the table are similar to those in Figure 15. The table indicates that the first skip connection layer has a significant impact on model performance. Models without down-sampling or skip connection layers performed the worst across all datasets, indicating the lowest capability for extracting small target features.

3.5.2. Data Ablation

The knowledge reasoning section of this paper seeks to improve the detection accuracy by utilizing land surface categories. During the reasoning process, the learned land surface features are propagated to target pixel features based on the weights of the knowledge graph adjacency matrix, resulting in enhanced features that increase the detection performance. To investigate the impact of land surface reasoning on model performance, this study removes the land surface classification part and generation part of land surface semantic features. Meanwhile, other parts of the model remain unchanged. This variant is named AGDNR w/o LSF.

Figure 16 compares the detection results on the San Diego dataset with and without land surface features in target reasoning. Both models successfully detect most target pixels. However, the original model exhibits a few false positives, whereas AGDNR w/o LSF not only shows false positives but also misclassifies certain targets, such as incorrectly identifying targets of category 2 as targets of category 1. Additionally, for correctly detected targets, the brightness of AGDNR w/o LSF is lower than that of AGDNR, indicating that incorporating land surface categories in reasoning significantly improves the accuracy and effectiveness of target detection and classification.

Figure 17 illustrates the ROC curves of two models on the San Diego dataset for different target categories. Meanwhile, Figure 18 shows the AUC values of the two models for different target categories. From the ROC curves, it is evident that for targets of all the categories, AGDNR w/o LSF consistently stays below the curve of the original model AGDNR, indicating that its detection and classification performance is far inferior to that of AGDNR. As shown in the AUC values in Figure 18, the

{AUC}_{(τ, D)}

value of AGDNR w/o LSF is about 0–0.1 lower than that of AGDNR, indicating that its classification effectiveness is significantly inferior to that of the original model AGDNR.

Figure 19 displays the enhanced features generated by the original model AGDNR and AGDNR w/o LSF. For AGDNR, enhanced features can be generated for each category of targets, and the brightness of these targets in the image varies, indicating that the generated enhanced features effectively aid the model in distinguishing between different target categories. In contrast, AGDNR w/o LSF can only learn enhanced features for one type of target, and the enhanced features generated for this category are not uniform, suggesting that AGDNR w/o LSF struggles to generate enhanced features for multiple targets simultaneously. This is likely due to the absence of land surface reasoning, making it difficult to infer effective enhanced features, which significantly reduces the model’s reasoning capability. Conversely, ARGNR can effectively reason out enhanced features for targets, thereby increasing recognition and detection accuracy.

4. Conclusions

This paper proposes AGDNR for hyperspectral small target detection and classification tasks on large-scale datasets. By designing a dense nested convolutional neural network, AGDNR can maintain the features of small targets without loss in deep networks on large-scale datasets. Moreover, the model extracts land surface semantic features through land surface classification and feature extraction. Finally, the model’s classification module and reasoning module reason about target categories based on the semantic features of land surfaces and targets. Extensive experiments on three challenging hyperspectral datasets have verified the superiority of the AGDNR method over state-of-the-art approaches. It can better adapt to detection tasks in various scenarios and accurately identify the categories of small targets.

However, the proposed method still has deficiencies in reasoning. (1) The adjacency matrix used in reasoning is represented by zero-dimensional matrices, not vectors, which cannot represent multiple or complex target relationships. This limitation in representing complex relationships prevents the network from reasoning over multiple relationships simultaneously, further restricting improvements in detection performance. We explore a vector-based knowledge graph reasoning approach, integrating graph neural networks to enhance the reasoning capabilities. (2) At the same time, the adjacency matrix relies on human prior knowledge and has limited adaptability across different datasets, restricting the model’s generalization capability. We explore an adaptive knowledge graph approach that automatically learns knowledge graph relationships from training images. (3) The model constructs an enhanced feature for each pixel, which results in high computational cost and prolonged processing time. In the future, we will explore methods to jointly represent features of pixels within the same category to reduce the computational load.

Author Contributions

Conceptualization, S.Z. and X.Z.; Methodology, X.Z.; Formal analysis, Y.Y.; Investigation, Y.Y. and M.Z.; Resources, G.L.; Data curation, Y.Y. and M.Z.; Writing—original draft, Y.Y.; Writing—review & editing, S.Z., M.Z. and X.Z.; Visualization, Y.Y.; Supervision, S.Z. and G.L.; Project administration, S.Z. and G.L.; Funding acquisition, S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kruse, F.A.; Lefkoff, A.B.; Boardman, J.W.; Heidebrecht, K.B.; Shapiro, A.T.; Barloon, P.J.; Goetz, A.F.H. The Spectral Image Processing System (SIPS)—Interactive Visualization and Analysis of Imaging Spectrometer Data. Remote Sens. Environ. 1993, 44, 145–163. [Google Scholar] [CrossRef]
Nasrabadi, N.M. Regularized Spectral Matched Filter for Target Recognition in Hyperspectral Imagery. IEEE Signal Process. Lett. 2008, 15, 317–320. [Google Scholar] [CrossRef]
Farrand, W.H.; Harsanyi, J.C. Mapping the Distribution of Mine Tailings in the Coeur d’Alene River Valley, Idaho, through the Use of a Constrained Energy Minimization Technique. Remote Sens. Environ. 1997, 59, 64–76. [Google Scholar] [CrossRef]
Chang, C.-I. Orthogonal Subspace Projection (OSP) Revisited: A Comprehensive Study and Analysis. IEEE Trans. Geosci. Remote Sens. 2005, 43, 502–518. [Google Scholar] [CrossRef]
Zhao, R.; Shi, Z.; Zou, Z.; Zhang, Z. Ensemble-Based Cascaded Constrained Energy Minimization for Hyperspectral Target Detection. Remote Sens. 2019, 11, 1310. [Google Scholar] [CrossRef]
Chen, S.-Y.; Lin, C.; Tai, C.-H.; Chuang, S.-J. Adaptive Window-Based Constrained Energy Minimization for Detection of Newly Grown Tree Leaves. Remote Sens. 2018, 10, 96. [Google Scholar] [CrossRef]
Wang, T.; Du, B.; Zhang, L. A Kernel-Based Target-Constrained Interference-Minimized Filter for Hyperspectral Sub-Pixel Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 626–637. [Google Scholar] [CrossRef]
Chang, C.-I. Hyperspectral Target Detection: Hypothesis Testing, Signal-to-Noise Ratio, and Spectral Angle Theories. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5505223. [Google Scholar] [CrossRef]
Du, Q.; Ren, H.; Chang, C.-I. A Comparative Study for Orthogonal Subspace Projection and Constrained Energy Minimization. IEEE Trans. Geosci. Remote Sens. 2003, 41, 1525–1529. [Google Scholar] [CrossRef]
Chang, C.-I.; Ren, H. An Experiment-Based Quantitative and Comparative Analysis of Target Detection and Image Classification Algorithms for Hyperspectral Imagery. IEEE Trans. Geosci. Remote Sens. 2000, 38, 1044–1063. [Google Scholar] [CrossRef]
Nasrabadi, N.M.; Kwon, H. Kernel Spectral Matched Filter for Hyperspectral Target Detection. In Proceedings of the Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, PA, USA, 23–23 March 2005; Volume 4, pp. iv/665–iv/668. [Google Scholar]
Kwon, H.; Nasrabadi, N.M. Kernel Matched Subspace Detectors for Hyperspectral Target Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 178–194. [Google Scholar] [CrossRef] [PubMed]
Capobianco, L.; Garzelli, A.; Camps-Valls, G. Target Detection With Semisupervised Kernel Orthogonal Subspace Projection. IEEE Trans. Geosci. Remote Sens. 2009, 47, 3822–3833. [Google Scholar] [CrossRef]
Ma, K.Y.; Chang, C.-I. Kernel-Based Constrained Energy Minimization for Hyperspectral Mixed Pixel Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5510723. [Google Scholar] [CrossRef]
Kwon, H.; Nasrabadi, N.M. Hyperspectral Target Detection Using Kernel Spectral Matched Filter. In Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop, Washington, DC, USA, 27 June 2004–2 July 2004; p. 127. [Google Scholar]
Dong, Y.; Du, B.; Zhang, L. Target Detection Based on Random Forest Metric Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 1830–1838. [Google Scholar] [CrossRef]
Sun, X.; Qu, Y.; Gao, L.; Sun, X.; Qi, H.; Zhang, B.; Shen, T. Target Detection Through Tree-Structured Encoding for Hyperspectral Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 4233–4249. [Google Scholar] [CrossRef]
Chen, Y.; Nasrabadi, N.M.; Tran, T.D. Sparse Representation for Target Detection in Hyperspectral Imagery. IEEE J. Sel. Top. Signal Process. 2011, 5, 629–640. [Google Scholar] [CrossRef]
Li, W.; Du, Q.; Zhang, B. Combined Sparse and Collaborative Representation for Hyperspectral Target Detection. Pattern Recognit. 2015, 48, 3904–3916. [Google Scholar] [CrossRef]
Zhu, D.; Du, B.; Zhang, L. Binary-Class Collaborative Representation for Target Detection in Hyperspectral Images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1100–1104. [Google Scholar] [CrossRef]
Xu, Y.; Wu, Z.; Chanussot, J.; Wei, Z. Joint Reconstruction and Anomaly Detection From Compressive Hyperspectral Images Using Mahalanobis Distance-Regularized Tensor RPCA. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2919–2930. [Google Scholar] [CrossRef]
Wang, T.; Zhang, H.; Lin, H.; Jia, X. A Sparse Representation Method for a Priori Target Signature Optimization in Hyperspectral Target Detection. IEEE Access 2018, 6, 3408–3424. [Google Scholar] [CrossRef]
Du, J.; Li, Z.; Sun, H. CNN-Based Target Detection in Hyperspectral Imagery. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 2761–2764. [Google Scholar]
Zhang, G.; Zhao, S.; Li, W.; Du, Q.; Ran, Q.; Tao, R. HTD-Net: A Deep Convolutional Neural Network for Target Detection in Hyperspectral Imagery. Remote Sens. 2020, 12, 1489. [Google Scholar] [CrossRef]
Zhu, D.; Du, B.; Zhang, L. Two-Stream Convolutional Networks for Hyperspectral Target Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6907–6921. [Google Scholar] [CrossRef]
Yao, C.; Yuan, Y.; Jiang, Z. Self-Supervised Spectral Matching Network for Hyperspectral Target Detection. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 2524–2527. [Google Scholar]
Wang, Y.; Chen, X.; Zhao, E.; Song, M. Self-Supervised Spectral-Level Contrastive Learning for Hyperspectral Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5510515. [Google Scholar] [CrossRef]
Sun, Q.; Liu, Y.; Sheng, C.; Wang, H.; Lu, X. Regions of Interest Extraction for Hyperspectral Small Targets Based on Self-Supervised Learning. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Chen, X.; Zhang, Y.; Dong, Y.; Du, B. Generative Self-Supervised Learning With Spectral-Spatial Masking for Hyperspectral Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5522713. [Google Scholar] [CrossRef]
Shi, Y.; Li, J.; Li, Y.; Du, Q. Sensor-Independent Hyperspectral Target Detection With Semisupervised Domain Adaptive Few-Shot Learning. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6894–6906. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense Nested Attention Network for Infrared Small Target Detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef]
Marino, K.; Salakhutdinov, R.; Gupta, A. The More You Know: Using Knowledge Graphs for Image Classification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 20–28. [Google Scholar]
Chen, X.; Li, L.-J.; Fei-Fei, L.; Gupta, A. Iterative Visual Reasoning Beyond Convolutions. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7239–7248. [Google Scholar]
Xu, H.; Jiang, C.; Liang, X.; Lin, L.; Li, Z. Reasoning-RCNN: Unifying Adaptive Global Reasoning Into Large-Scale Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6412–6421. [Google Scholar]
Qian, Y.; Pu, X.; Jia, H.; Wang, H.; Xu, F. ARNet: Prior Knowledge Reasoning Network for Aircraft Detection in Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Li, S.; Dian, R.; Liu, H. Learning the External and Internal Priors for Multispectral and Hyperspectral Image Fusion. Sci. China Inf. Sci. 2023, 66, 140303. [Google Scholar] [CrossRef]
Van de Griend, A.A.; Owe, M. On the Relationship between Thermal Emissivity and the Normalized Difference Vegetation Index for Natural Surfaces. Int. J. Remote Sens. 1993, 14, 1119–1131. [Google Scholar] [CrossRef]
Xie, Q.; Huang, W.; Liang, D.; Chen, P.; Wu, C.; Yang, G.; Zhang, J.; Huang, L.; Zhang, D. Leaf Area Index Estimation Using Vegetation Indices Derived From Airborne Hyperspectral Images in Winter Wheat. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 3586–3594. [Google Scholar] [CrossRef]
Huete, A.R. A Soil-Adjusted Vegetation Index (SAVI). Remote Sens. Environ. 1988, 25, 295–309. [Google Scholar] [CrossRef]
Jing, L.I.; Hanqiu, X.U.; Xia, L.I.; Yanbin, G.U.O. Vegetation Information Extraction of Pinus Massoniana Forest in Soil Erosion Areas Using Soil-Adjusted Vegetation Index. J. Geo-Inf. Sci. 2015, 17, 1128–1134. [Google Scholar] [CrossRef]
MAJOR, D.J.; BARET, F.; GUYOT, G. A Ratio Vegetation Index Adjusted for Soil Brightness. Int. J. Remote Sens. 1990, 11, 727–740. [Google Scholar] [CrossRef]
Wang, C.; Qi, S.; Niu, Z.; Wang, J. Evaluating Soil Moisture Status in China Using the Temperature–Vegetation Dryness Index (TVDI). Can. J. Remote Sens. 2004, 30, 671–679. [Google Scholar] [CrossRef]
JPL|AVIRIS Data Portal. Available online: https://aviris.jpl.nasa.gov/dataportal/ (accessed on 1 December 2024).
(PDF) The SHARE 2012 Data Campaign. Available online: https://www.researchgate.net/publication/260782828_The_SHARE_2012_data_campaign (accessed on 1 December 2024).
Liu, K.; Zhou, Z.; Li, S.Y.; Liu, Y.F.; Wan, X.; Liu, Z.W.; Tan, H.; Zhang, W.F. Scene classification dataset using the Tiangong-1 hyperspectral remote sensing imagery and its applications. Natl. Remote Sens. Bull. 2020, 24, 1077–1087. [Google Scholar] [CrossRef]
Sun, X.T. Sxt1996/Airport-Beach-Urban-ABU. Available online: https://github.com/sxt1996/Airport-Beach-Urban-ABU (accessed on 1 December 2024).
Li, Z.; Wang, Y.; Xiao, C.; Ling, Q.; Lin, Z.; An, W. You Only Train Once: Learning a General Anomaly Enhancement Network with Random Masks for Hyperspectral Anomaly Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5506718. [Google Scholar] [CrossRef]
Zhu, D.; Du, B.; Zhang, L. Target Dictionary Construction-Based Sparse Representation Hyperspectral Target Detection Methods. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 1254–1264. [Google Scholar] [CrossRef]
Zhao, C.; Wang, M.; Feng, S.; Su, N. Hyperspectral Target Detection Method Based on Nonlocal Self-Similarity and Rank-1 Tensor. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5500815. [Google Scholar] [CrossRef]
Chang, C.-I. An Effective Evaluation Tool for Hyperspectral Target Detection: 3D Receiver Operating Characteristic Curve Analysis. IEEE Trans. Geosci. Remote Sens. 2021, 59, 5131–5153. [Google Scholar] [CrossRef]
Kraut, S.; Scharf, L.L. The CFAR Adaptive Subspace Detector Is a Scale-Invariant GLRT. IEEE Trans. Signal Process. 1999, 47, 2538–2541. [Google Scholar] [CrossRef]
Wang, Y.; Chen, X.; Wang, F.; Song, M.; Yu, C. Meta-Learning Based Hyperspectral Target Detection Using Siamese Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5527913. [Google Scholar] [CrossRef]
Jiao, J.; Gong, Z.; Zhong, P. Triplet Spectralwise Transformer Network for Hyperspectral Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5519817. [Google Scholar] [CrossRef]
Yang, Q.; Wang, X.; Chen, L.; Zhou, Y.; Qiao, S. CS-TTD: Triplet Transformer for Compressive Hyperspectral Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5533115. [Google Scholar] [CrossRef]

Figure 1. Overview of the hyperspectral target detection framework. (a) Feature extraction module. (b) Surface feature extraction module. (c) Pixel-level knowledge reasoning module. The proposed framework begins by extracting the original features of the image using a deeply nested convolutional neural network. A 1 × 1 convolutional classifier is then applied for pixel-level target detection. At this stage, the convolutional kernel weights are extracted as the semantic representation of the target. These semantics are further enriched by combining them with surface semantics derived from feature extraction after NDVI classification, creating a global semantic pool. Finally, the semantics are propagated among nodes based on a priori knowledge graph, resulting in enhanced features that are utilized for the final target detection.

Figure 2. The structure of U-Net and dense nested convolutional neural network. (a) U-net. (b) Dense nested convolutional neural network. In the dense nested network, each node consists of a concatenation module, a convolutional module, and an attention module. Meanwhile, the output features of each node will be input to the nodes in the lower layer, other nodes in the same layer, and the nodes in the upper layer, enabling the preservation of small target features in the deep convolution.

Figure 3. Channel attention mechanism and spatial attention mechanism after multi-layer convolution.

Figure 4. Global reasoning module. The global semantic pool integrates the weights of the convolutional filters and the extracted surface category features into a semantic pool of the targets and the surface type. Then, this semantic pool is propagated according to the knowledge graph for global reasoning. The attention mechanism is used to adaptively emphasize the most relevant classes of targets. Finally, a soft link of the target category and a hard link of the surface type are applied to obtain enhanced features. The enhanced features and the initial features are concatenated to obtain better detection results.

Figure 5. (a) RGB image from the San Diego dataset. (b) Surface classification results based on NDVI values. (c) Ground truth, in which targets of different categories are represented in different colors.

Figure 6. Detection maps of different methods. (a) RGB images of the San Diego dataset, Avon dataset, synthetic dataset, and HAD100 dataset. (b) Ground truth. (c) ACE. (d) CEM. (e) HTD-IRN. (f) MLSN. (g) TSTTD. (h) CS-TTD. (i) AGDNR.

Figure 7. ROC curves of different methods. (a) San Diego dataset. (b) Avon dataset. (c) Synthetic dataset. (d) HAD100 dataset.

Figure 8. Separability maps of different methods. (a) San Diego dataset. (b) Avon dataset. (c) Synthetic dataset. (d) HAD100 dataset.

Figure 9. Classification and detection maps of the four datasets. (a) RGB image. (b) Ground truth. (c) Small target detection map. (d) Category 1. (e) Category 2. (f) Category 3. (g) Category 4.

Figure 10. Classification ROC curves of four datasets. (a) San Diego dataset. (b) Masic Avon dataset. (c) Synthetic dataset. (d) HAD100 dataset.

Figure 11. Network architectures of AGDNR and its two variants. (a) AGDNR. (b) AGDNR w/o SC. (c) AGDNR w/o SC&DS.

Figure 12. Network architectures of two variants. (a) AGDNR without SC1. (b) AGDNR without SC&DS1.

Figure 13. Output features of nodes at different levels for AGDRN and its two variants. (a) RGB image of the San Diego dataset. (b) Ground truth. (c) Output features of node

L_{4, 0}

. (d) Output features of node

L_{3, 1}

. (e) Output features of node

L_{2, 2}

. (f) Output features of node

L_{1, 3}

. (g) Output features of node

L_{0, 4}

.

Figure 13. Output features of nodes at different levels for AGDRN and its two variants. (a) RGB image of the San Diego dataset. (b) Ground truth. (c) Output features of node

L_{4, 0}

. (d) Output features of node

L_{3, 1}

. (e) Output features of node

L_{2, 2}

. (f) Output features of node

L_{1, 3}

. (g) Output features of node

L_{0, 4}

.

Figure 14. The

{AUC}_{(τ, D)}

values of different target categories of the four datasets. (a) San Diego dataset. (b) Masic Avon dataset. (c) Synthetic dataset. (d) HAD100 dataset.

Figure 14. The

{AUC}_{(τ, D)}

values of different target categories of the four datasets. (a) San Diego dataset. (b) Masic Avon dataset. (c) Synthetic dataset. (d) HAD100 dataset.

Figure 15. The

{AUC}_{(τ, D)}

values of AGDNR and four variants on all categories of the four datasets.

Figure 15. The

{AUC}_{(τ, D)}

values of AGDNR and four variants on all categories of the four datasets.

Figure 16. Detection maps of ARGNR and ARGNR with surface reasoning removed. (a) RGB image. (b) Ground truth map. (c) Small target detection map. (d) Category 1. (e) Category 2. (f) Category 3.

Figure 17. ROC curves of ARGNR and AGDNR w/o LSF on different target categories in San Diego dataset. (a) Category 1. (b) Category 2. (c) Category 3. (d) All Categories.

Figure 18. AUC values of ARGNR and AGDNR w/o LSF on different target categories in San Diego dataset.

Figure 19. The enhanced features generated by ARGNR and AGDNR w/o LSF on San Diego dataset.

Table 1. Land surface classification method based on NDVI values.

NDVI Interval	Land Surface Type
$[- 1, - 0.2]$	water, clouds, or oceans
$[- 0.2, 0]$	urban
$[0, 0.2]$	soil
$[0.2, 0.5]$	soil/vegetation mixed area
$[0.5, 1]$	vegetation

Table 2. The number of training and testing targets in the four datasets. Due to the small data volume of some targets, the targets will be split.

Dataset	Train or test	Category 1	Category 2	Category 3	Category 4	All Categories
San Diego dataset	train	5	2	11	--	18
San Diego dataset	test	15	6	38	--	59
Avon dataset	train	4	4	0.5	0.3	8.8
Avon dataset	test	12	12	2	1	27
Synthetic dataset	train	5	5	5	--	15
Synthetic dataset	test	15	15	15	--	45
HAD100 dataset	train	6	6	4	4	20
HAD100 dataset	test	26	25	13	13	77

Table 3. The AUC values of different methods on four datasets.

Dataset	AUC	ACE	E-CEM	HTD-IRN	TSSTD	CS-TTD	AGDNR
San Diego dataset	${AUC}_{(D, F)}$	0.673917	0.904242	0.769122	0.832509	0.910121	0.999537
	${AUC}_{(τ, D)}$	0.986446	0.160722	0.024773	0.025760	0.698200	0.805841
	${AUC}_{(τ, F)}$	0.965915	0.064171	0.004925	0.001483	0.205343	0.006516
Avon dataset	${AUC}_{(D, F)}$	0.753610	0.899847	0.898329	0.954091	0.988532	0.999844
	${AUC}_{(τ, D)}$	0.904321	0.498903	0.026905	0.061970	0.809364	0.831935
	${AUC}_{(τ, F)}$	0.872586	0.300727	0.015234	0.003224	0.041702	0.000284
Synthetic dataset	${AUC}_{(D, F)}$	0.423741	0.630519	0.491987	0.999370	1.000000	1.000000
	${AUC}_{(τ, D)}$	0.466774	0.476457	0.514915	0.375661	0.998860	0.858989
	${AUC}_{(τ, F)}$	0.439160	0.402847	0.495558	0.000497	0.003939	0.000005
HAD100 dataset	${AUC}_{(D, F)}$	0.793811	0.935234	0.698967	0.811994	0.626478	0.999534
	${AUC}_{(τ, D)}$	0.895538	0.504348	0.549460	0.024640	0.278232	0.808148
	${AUC}_{(τ, F)}$	0.559483	0.015566	0.332796	0.002070	0.181698	0.001311

Table 4. The AUC values of AGDNR for object detection results in different categories of the four datasets.

Dataset	AUC	Category 1	Category 2	Category 3	Category 4	All Categories
San Diego	${AUC}_{(D, F)}$	0.999768	0.999904	0.999846	--	0.999829
	${AUC}_{(τ, D)}$	0.787623	0.842028	0.772183	--	0.800620
	${AUC}_{(τ, F)}$	0.000495	0.000387	0.000610	--	0.000497
Masic Avon	${AUC}_{(D, F)}$	0.999990	0.999836	0.999996	0.999984	0.999947
	${AUC}_{(τ, D)}$	0.861449	0.773185	0.733864	0.861592	0.811629
	${AUC}_{(τ, F)}$	0.000136	0.000110	0.000029	0.000087	0.000089
Synthetic dataset	${AUC}_{(D, F)}$	1.000000	1.000000	1.000000	--	1.000000
	${AUC}_{(τ, D)}$	0.799914	0.895986	0.716441	--	0.775847
	${AUC}_{(τ, F)}$	0.000005	0.000005	0.000022	--	0.000007
HAD100	${AUC}_{(D, F)}$	0.999995	0.999863	0.999672	0.999992	0.999868
	${AUC}_{(τ, D)}$	0.862036	0.807737	0.712738	0.658273	0.796993
	${AUC}_{(τ, F)}$	0.000185	0.000617	0.000327	0.000339	0.000356

Table 5. The IoU values of AGDNR and four variants on all categories of four datasets.

Model	Dataset
Model	San Diego Dataset	Avon Dataset	Synthetic Dataset	HAD100 Dataset
AGDNR	0.806	0.899	1.000	0.820
AGDNR w/o SC1	0.778	0.869	1.000	0.844
AGDNR w/o SC	0.776	0.883	0.944	0.810
AGDNR w/o SC&DS1	0.760	0.883	0.930	0.800
AGDNR w/o SC&DS	0.758	0.630	0.793	0.796

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhan, S.; Yang, Y.; Zhong, M.; Lu, G.; Zhou, X. Adaptive Global Dense Nested Reasoning Network into Small Target Detection in Large-Scale Hyperspectral Remote Sensing Image. Remote Sens. 2025, 17, 948. https://doi.org/10.3390/rs17060948

AMA Style

Zhan S, Yang Y, Zhong M, Lu G, Zhou X. Adaptive Global Dense Nested Reasoning Network into Small Target Detection in Large-Scale Hyperspectral Remote Sensing Image. Remote Sensing. 2025; 17(6):948. https://doi.org/10.3390/rs17060948

Chicago/Turabian Style

Zhan, Siyu, Yuxuan Yang, Muge Zhong, Guoming Lu, and Xinyu Zhou. 2025. "Adaptive Global Dense Nested Reasoning Network into Small Target Detection in Large-Scale Hyperspectral Remote Sensing Image" Remote Sensing 17, no. 6: 948. https://doi.org/10.3390/rs17060948

APA Style

Zhan, S., Yang, Y., Zhong, M., Lu, G., & Zhou, X. (2025). Adaptive Global Dense Nested Reasoning Network into Small Target Detection in Large-Scale Hyperspectral Remote Sensing Image. Remote Sensing, 17(6), 948. https://doi.org/10.3390/rs17060948

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Global Dense Nested Reasoning Network into Small Target Detection in Large-Scale Hyperspectral Remote Sensing Image

Abstract

1. Introduction

2. Methods

2.1. Overview

2.2. Feature Extraction Module

2.2.1. The Dense Nested Network

2.2.2. Channel and Spatial Attention Module

2.2.3. Feature Output Module

2.2.4. Convolutional Classification Module

2.3. Surface Feature Extraction Module

2.3.1. Land Surface Parameters

2.3.2. Land Surface Classification and Feature Extraction Based on NDVI

2.4. Pixel-Level Knowledge Reasoning Module

2.4.1. Knowledge Graph Construction

2.4.2. Knowledge Reasoning

2.4.3. Attention Mechanism

2.4.4. Loss Function:

3. Experiments and Results

3.1. Dataset Introduction

3.2. Experimental Details

3.2.1. Evaluation Metrics

3.2.2. Parameter Settings

3.3. Comparison of Algorithm Performance

3.4. Target Type Reasoning Analysis

3.5. Ablation Experiment

3.5.1. Module Ablation

3.5.2. Data Ablation

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI