QiGSAN: A Novel Probability-Informed Approach for Small Object Segmentation in the Case of Limited Image Datasets

Andrey Gorshenin; Anastasia Dostovalova

doi:10.3390/bdcc9090239

and

Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 119333 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput.2025, 9(9), 239;https://doi.org/10.3390/bdcc9090239

This article belongs to the Special Issue Recent Advances in Machine Learning Methods for Imperfect Large-Scale Data

Version Notes

Order Reprints

Abstract

The paper presents a novel probability-informed approach to improving the accuracy of small object semantic segmentation in high-resolution imagery datasets with imbalanced classes and a limited volume of samples. Small objects imply having a small pixel footprint on the input image, for example, ships in the ocean. Informing in this context means using mathematical models to represent data in the layers of deep neural networks. Thus, the ensemble Quadtree-informed Graph Self-Attention Networks (QiGSANs) are proposed. New architectural blocks, informed by types of Markov random fields such as quadtrees, have been introduced to capture the interconnections between features in images at different spatial resolutions during the graph convolution of superpixel subregions. It has been analytically proven that quadtree-informed graph convolutional neural networks, a part of QiGSAN, tend to achieve faster loss reduction compared to convolutional architectures. This justifies the effectiveness of probability-informed modifications based on quadtrees. To empirically demonstrate the processing of real small data with imbalanced object classes using QiGSAN, two open datasets of synthetic aperture radar (SAR) imagery (up to

0.5

m per pixel) are used: the High Resolution SAR Images Dataset (HRSID) and the SAR Ship Detection Dataset (SSDD). The results of QiGSAN are compared to those of the transformers SegFormer and LWGANet, which constitute a new state-of-the-art model for UAV (Unmanned Aerial Vehicles) and SAR image processing. They are also compared to convolutional neural networks and several ensemble implementations using other graph neural networks. QiGSAN significantly increases the

F_{1}

-score values by up to

63.93 %

,

48.57 %

, and

9.84 %

compared to transformers, convolutional neural networks, and other ensemble architectures, respectively. QiGSAN outperformed the base segmentors with the mIOU (mean intersection-over-union) metric too: the highest increase was

35.79 %

. Therefore, our approach to knowledge extraction using mathematical models allows us to significantly improve modern computer vision techniques for imbalanced data.

Keywords:

probability-informed neural networks; semantic segmentation; imbalanced data; small sample learning

1. Introduction

Semantic segmentation is a relevant approach used in various applied fields, such as autonomous driving [], navigation [], or Earth monitoring []. However, there is often a class imbalance in real data. One of the most significant examples is the segmentation of small or background objects in an image []. The number of elements in the training data for such objects is typically much lower than for other categories. This leads to greater intra-class variation than inter-class one, resulting in poor generalization of neural network (NN) architectures.

Modern large neural network architectures that deal with high-resolution images, such as visual transformers, demonstrate great performance in processing large datasets []. However, when the dataset is highly unbalanced, for instance, when some classes correspond to

1 %

of all training examples, these NNs demonstrate lower performance. Various modifications to NN architecture, the training process, and augmentation techniques are still being investigated to improve the efficiency of handling imbalanced image datasets [,,]. For imbalanced semantic segmentation, specific (usually weighted) loss functions such as weighted cross entropy, dual cross entropy, or focal loss and their modifications [,,,] can be used. Another approach consists of NN architecture modifications such as attention modules [] or special hierarchical decoders [,,,], which are designed to capture the details of small objects. Finally, the features from another modalities [] can be used to improve the segmentation accuracy.

However, such approaches often face difficulties when dealing with real-world problems, where the training datasets are limited [,,,,]. For example, in the small data case, the amount of data available is often limited to 1000 or fewer elements. In this case, NNs tend to just remember small objects [,]. The decrease in training sample size and intra-class variance can lead to the overfitting. Therefore, more complex and deep NN architectures, even with custom loss functions, are less effective. A solution can be found in ensemble approaches and mathematical models.

This paper introduces a novel probability-informed approach to improving the accuracy of neural network processing for small datasets with imbalanced classes. Our methodology is based on the possibility of obtaining additional information in the case of a limited volume of data using their mathematical (probability) models []. It is inspired by the principles of physics-informed machine learning []. Previously, this approach was shown to be effective for the segmentation problems with a small receptive field (

32 \times 32

pixels) [,]. This paper extends the basic idea to high-resolution image processing.

The aim of this paper is to develop a novel ensemble graph convolution neural network (GCN) architecture, informed by the Markov random field model, namely the quadtrees [], for the effective segmentation of small objects on high-resolution images with limited training data volumes. Therefore, Quadtree-informed Graph Self-Attention Network (QiGSAN) is proposed. It consists of novel quadtree-informed self-attention and compression blocks as well as graph convolutional layers to take into account the complex spatial and hierarchical interconnections between image pixels. Convolutional and transformer encoders are also tested in QiGSAN to extract image pixel features.

As an example of a real imbalanced segmentation problem, ship segmentation on high-resolution synthetic aperture radar (SAR) imagery is used. There are two imbalanced classes in this task: the target, which is the ship, and the background, which includes other surfaces. The target label can occupy only a few pixels or be barely distinguishable from noise. At the same time, real SAR image datasets are usually small, as the variability and uniqueness of the images makes them difficult to label [,], so they fit the definition of a small imbalanced dataset well.

The main contributions of the paper are as follows:

A novel probability-informed ensemble architecture QiGSAN was developed to improve the accuracy of small object semantic segmentation in high-resolution imagery datasets with imbalanced classes and a limited volume of samples;
New quadtree-informed architectural blocks have been introduced to capture the interconnections between features in images at different spatial resolutions during the graph convolution of superpixel subregions [];
The theorem concerning faster loss reduction in the quadtree-informed graph convolutional neural networks was proved;
Using QiGSAN, the ship segmentation accuracy ( $F_{1}$ -score) on high resolution SAR images (HRSID [] and SSDD [] datasets) increased up to $63.93 %$ compared SegFormer [] and LWGANet, a new state-of-the-art transformer for UAV (Unmanned Aerial Vehicles) and SAR image processing []. It also improves the results of non-informed convolutional architectures, such as ENet [] and DeepLabV3 [] (up to $48.57 %$ in $F_{1}$ -score), and other ensemble implementations with the various graph NNs (up to $9.84 %$ in $F_{1}$ -score).

The rest of the paper is organized as follows. Section 2 discusses known approaches to the construction of graph architectures and the division of images into superpixels for small sample learning cases. Section 3 presents details of our approach to small object segmentation based on QiGSAN architecture, including a proof of the theorem concerning faster loss reduction in the quadtree-informed graph convolutional neural networks. Section 4 is devoted to the results obtained by QiGSANs on the open datasets HRSID and SSDD. Section 5 presents a comprehensive ablation study. Section 6 summarizes the results achieved and discusses future research directions.

2. Related Works

In this section, we will briefly review some previous research on the subject area that is relevant to our research.

2.1. Small Data in Image Processing

Small datasets consist of a small number of observations. They are often used in practical data processing applications, including SAR objects segmentation. The high variability in the data, which is typical of high resolution images, and the difficulty in labeling SAR images limit the number of available images for training. Traditional machine learning approaches, such as modern transformer architectures or Faster R-CNN-based methods [], demonstrate a weak generalization ability and a tendency towards overfitting when processing small datasets []. The most popular methods for improving the efficiency of small data processing are augmentation, regularization [], and transfer learning []. However, in SAR image processing, these methods often face difficulties due to differences in the subject areas or the insufficient diversity in the augmented data []. Indeed, augmentations are constructed on a limited part of data, so they cannot reproduce the variability found in real data. Architectural modifications allowing for more efficient identification of the distinctive features of objects seem to be a promising approach. Such solutions may lie in finding the most important regions of an image for the correct identification of an object [,] or in methods for comparing global and local features of objects [].

Small object processing relates to class imbalance problems. The core issue is the lack of data to distinguish between rare small objects and the background. Weighted loss functions [] can be used for small objects. Modifications to the training process include range training data for segmentation []. Architecture modifications, such as the use of attention blocks and hierarchical encoder features [], and small object activation branches [] are also implemented.

2.2. Graph Neural Network Image Analysis

GCNs [] are used for the analysis of data whose internal structure can be described by a graph. Modern graph architectures used for practical tasks are often modifications of graph neural networks with attention (GAT) [] or simple GCN []. For example, GraphSAGE [] differs from simple GCN only in that it aggregates node features using a predefined number of neighbours. There are also GCN modifications that implement the self-attention mechanisms. For example, they can be used for preprocessing concatenated node features [,] or to detect global node attributes [].

Typically, graph processing architectures consist of the following parts: a feature encoder and a graph block that performs convolution using a two-dimensional grid. The grid represents pixel features at a single spatial resolution and does not consider their hierarchical properties. Methods that suggest using features at different resolutions are uncommon, and these features are not integrated into a single (trainable) structure like quadtrees. Therefore, implementing a quadtree into an NN architecture can significantly improve the efficiency of convolution on graphs. However, a quadtree has more nodes than a comparable grid graph, making it necessary to optimise it, particularly to process high resolution images.

Existing quadtree implementations are only used in specific problems, such as image tokenization [], or the implementation of multi-scale attention [,]. In addition, convolutions rather than graphs are usually used to represent spatial relationships [].

A well-known method for optimizing graph convolution is to reduce the number of nodes in the graph being processed from the total number of pixels in an image to a fixed number N. The structure of sub-areas may be irregular, when pixels are grouped according to their brightness properties []. It may also be regular if the superpixels have a predetermined size and shape.

The reduction in graph node numbers often leads to a decrease in the accuracy of data processing, which is similar to reducing the number of NN weights. In order to prevent this decrease in accuracy, especially in small sample learning, various modifications to GCNs have been proposed. Specifically, an effective method is related to different attention mechanisms, such as those based on the concatenation of feature vectors [] or the dot product [].

GCNs are also used for SAR image segmentation [] or object detection []. High resolution images processed by GCNs are divided into superpixels, which are image regions consisting of spatially adjacent pixels with similar brightness properties. The division into superpixels with different sizes takes into account the similarity between neighbouring pixels. However, this requires time-intensive preprocessing of images using specific algorithms such as Bayesian adaptive superpixel segmentation []. Using superpixels of the same size does not require the graph to adapt to each new image. However, it does not take into account the similarities in the reflective properties of the pixels. A solution to this problem is to use a hierarchical graph that aggregates node information at various spatial resolutions.

2.3. Probability-Informed Neural Networks

Probability informing is used to solve a wide range of tasks related to analyzing data with stochastic properties. It was demonstrated in papers [,,] that using probabilistic characteristics as additional NN features or constructing the NN architecture based on probability models (for example, deep Gaussian mixture one []) can improve the NN predictions for time-series analysis. Promising results were also obtained for SAR image analysis []. Probability informing can be used to estimate the uncertainty in NN predictions [], reliability of engineering systems, and model risk functions using NN []. It can also be utilized to refine cost function estimates [] or to improve the interpretability of spiking NN predictions []. In particular, these studies demonstrate that using probability models improves the accuracy of the processing with a limited number of training examples. This approach has previously been used to analyze turbulent plasma data [,] and information flows [], but there are no published studies in the literature on SAR images based on these principles.

2.4. Summary

This paper develops a methodology for using GCNs with attention to process small objects in datasets of high-resolution images with a limited volume of training samples. A probability-informed approach has been proposed to improve GCN’s accuracy. Division into regular superpixels organized into a quadtree structure has been introduced in order to improve the computational efficiency of NNs, while taking account of the similarity between neighbouring pixels. Our model requires less training data and computational resources than other types of neural networks, which makes it applicable to small sample learning problems. Therefore, this approach can also be applied in cases where commonly used NN models do not produce good results.

3. Methodology

This section is devoted to the mathematical formulation of the research problem and a description of our methodology for solving it.

3.1. Problem Statement

Let

X = {X^{k}}_{k = 1, \dots, K}

be a set of images that needs to be segmented. That is, for each image, we need to determine for each pixel

X_{i j}^{k}

,

i = 1, \dots, H_{X}

,

j = 1, \dots, W_{X}

, where

H_{X}, W_{X}

are height and width of images, the most probable class label

y_{i j}^{k}

from two variants (t denotes the target or background class number, the target class corresponds to index 0):

y_{i j}^{s} = \underset{t = 0, \dots, 1}{arg max} P (X_{i j}^{k} ∣ t) .

(1)

Let

F (\cdot)

be a function of some neural encoder, and let the set

X

correspond to small data, that is,

K \sim 1000

. The research problem is to improve the accuracy of the small target segmentation obtained by

F (\cdot)

and avoid overfitting.

The suggested solution is based the construction of an ensemble of

F (\cdot)

with the another neural network

G (\cdot)

to model the spatial connections between pixels using a quadtree adjacency matrix A (that is, a quadtree-informed model):

G : R^{c h + 2} \times A \to R^{2},

(2)

where

c h

is the number of image channels.

G (\cdot)

takes a vector

X_{c a t}^{k} = (F (X^{k}), X^{k}) \in R^{c h + 2}

as input data, where

F (X^{k}) \in R^{2}

(there are two classes: target and background).

G (\cdot)

has to improve the segmentation of the target class with respect to the probability of the correct class label choosing, as follows:

P (G {(X^{k}, F (X^{k}), A)}_{0} = Y_{0}^{k}) ⩾ P (F {(X^{k})}_{0} = Y_{0}^{k}),

(3)

where

Y^{k}

is the real class label for

X^{k}

pixels, and index 0 corresponds to small objects.

3.2. Quadtree-Informed Graph Self-Attention Networks

The section presents a methodology for SAR image segmentation using QiGSAN. Figure 1 presents the corresponding NN architecture.

Figure 1. Architecture of Quadtree-informed Graph Self-Attention Networks. It consists of 5 blocks:: the first is the preprocessing block with convolutional layers; the second is the block for forming features using quadtree layers; the third is a superpixel forming block; the fourth performs graph convolution operations; and the fifth is a compression block using quadtree layer.

This is an ensemble neural network consisting of a pre-trained encoder

F (\cdot)

, which can be implemented by a convolutional or transformer network, and a quadtree-informed GCN

G (\cdot)

.

F (\cdot)

and

G (\cdot)

are trained separately. A graph self-attention mechanism is used in

G (\cdot)

that is based on the dot product. This architecture consists of 5 blocks. Quadtree-informed stages, see blocks “2” and “5” on Figure 1, are placed within the dashed borders.

3.3. Pre-Processing of Image Features by Convolutional Layers

Image features obtained from the output and preceding n-channel layers of the

F (\cdot)

are used as additional input data to the graph neural network

G (\cdot)

, which takes the composition of the source image and the encoder features as inputs. First, these features are consistently processed in the first architectural block by three two-dimensional convolutional layers

c_{1 - 3} (\cdot)

, with a kernel size

3 \times 3

, and a GeLU (Gaussian Error Linear Units) activation function [,,]. In the first and second layer, the number of output channels increases by two times compared to input ones. In the third layer, it decreases to one. Convolutional layers compress the information about the pixel features before dividing the image into superpixels and creating the quadtree.

3.4. Superpixels

Before performing the graph convolution operation, the outputs of the three convolutional layers and the constructed quadtree layers on their basis

{X_{Q_{i}}^{k}}_{Q_{i} = 1, \dots, h}

(see Section 3.5) are divided into superpixels of size

M \times M

, where M is a hyperparameter.

In the block “3” (see Figure 1), pixels from different quadtree layers with sizes

H_{X}^{Q_{i}} \times W_{X}^{Q_{i}}

,

Q_{i} = 1, \dots, h

are concatenated to form a vector

x_{Q}

and then combined to create a superpixel vector

x_{Q}^{(sp)}

in the

C (\cdot)

procedure, which is equivalent to the PyTorch unfold operation that creates a vector of vectors of image segments with size

M \times M

:

C : R^{H_{X}^{Q_{i}} \times W_{X}^{Q_{i}}} \to R^{\frac{H_{X}^{Q_{i}} \cdot W_{X}^{Q_{i}}}{M^{2}} \times M^{*}},

(4)

where

M^{*} = M^{2}

is the size of the pixel features in a superpixel. The superpixel vector is obtained by applying

C (\cdot)

to each of the h of quadtree layers of

x_{Q}

. Obtained vector

x_{Q}^{(sp)}

is a quadtree node vector that is used in the graph convolution operation.

3.5. Quadtree Informing

The quadtree informing of QiGSAN is implemented in the second and fifth blocks of

G (\cdot)

called “Quadtree layers compression” and “Quadtree structure forming” (see Figure 1), which precede superpixel formation.

A quadtree of height h is a hierarchical structure, where elements are organized into

h > 1

layers, with each element on the

k - 1

-th layer being connected to four elements (children) on the

k^{t h}

layer and no more than one of the elements on the

k - 2

-th layer. In the

S_{k}

layer, elements are often organized as

{s_{i, j}^{k}}

,

i, j = 1, \dots, L_{k}

and

L_{k} = 2 \cdot L_{k - 1}

. Figure 2 presents an example of a quadtree with height

h = 2

constructed from an image of size

L \times L

, which is divided into superpixels. A source image is placed in the

h^{t h}

layer of the quadtree. Features of other quadtree layers are usually [] constructed with average pooling with a kernel size of

2^{h - i}

,

i = 1, \dots, h - 1

. The input of the “Quadtree structure forming” block is processed through

h - 1

pooling layers, and these results are concatenated to form a vector of quadtree layers divided into superpixels.

Figure 2. Quadtree scheme when with height

h = 2

.

The neighboring elements of the

S_{k}

layer are also connected to each other. Then, the quadtree becomes a spatial-hierarchical one []. An approach to establishing interconnections within the layer is using a two-dimensional grid, as shown in Figure 3. Such a graph structure does not have hierarchical interconnections and its elements,

{s_{i, j}^{k}}

, are connected to their nearest neighbors horizontally, vertically, and along diagonals. This quadtree model has the Markov property [] because the state of an element depends only on its nearest neighbors.

Figure 3. Spatial relations for the

(i, j)

element of a two-dimensional grid.

The modern graph convolutional neural networks for image analysis use two-dimensional grids and their variations as image representations. Let us demonstrate that that the use of quadtrees improves the training of neural networks. Let

X \in R^{N \times N}

,

X \in X

, be a processed image. Denote the pixels of X as

X_{i, j}

,

i, j = 1, \dots, N

and the vector of these pixels as

x \in R^{N^{2} \times 1}

. The elements of the vector

x

are obtained from X by the rule

x_{i \cdot N + j} = X_{i, j}

. Let

x^{(s p)} \in R^{\frac{N^{2}}{M^{2}} \times M^{2}}

be a vector of superpixels of size M, obtained from the image X. Denote

x_{Q}

as a vector that contains all the elements in a quadtree with h layers. Vector

x_{Q}

is constructed in the block Quadtree structure forming, see Figure 1. Features for the

h - 1

quadtree layers are constructed from the output of the previous convolutional layer

c_{1 - 3} (c o n c a t (X, F (X))

using several average pooling with a kernel of size

2^{h - i}

,

i = 1, \dots, h - 1

. Vector

x_{Q}^{(s p)}

is the quadtree one divided into superpixels:

x_{Q}^{(s p)} = C (x_{Q})

, see Section 3.4.

Linear graph convolution network (GCN)

G (\cdot)

can be defined as a network of the following form []:

G (x^{(s p)}) = A \cdot x^{(s p)} \cdot B,

(5)

where

A \in R^{\frac{N^{2}}{M^{2}} \times \frac{N^{2}}{M^{2}}}

is an adjacency matrix, and B is a linear transformation matrix, for which we propose that

x^{(s p)} \cdot B \in R^{\frac{N^{2}}{M^{2}} \times M^{2}}

. Let

G_{s k i p} (\cdot)

be the linear GCN with the skip connections [], that is

G_{s k i p} (x^{(s p)}) = \sum_{l = 0}^{H} W_{(l)} \cdot X_{(l)},

(6)

where

X_{(l)} = A \cdot B_{(l)} \cdot X_{(l - 1)}

,

X_{(0)} = x^{(s p)}

, and H is the number of graph convolution layers. According to the Formula (6), when

H = 1

, we have

G_{s k i p} (x^{(s p)}) = W_{(0)} \cdot x^{(s p)} + W_{(1)} \cdot G (x^{(s p)})

.

Theorem 1.

Let

G (\cdot)

and

G_{Q} (\cdot)

be one-layer linear GCNs. If the graph convolution of

G (\cdot)

is a two-dimensional grid and the graph convolution of

G_{Q} (\cdot)

is a spatial-hierarchical quadtree with h layers, then

\frac{d}{d t} (L_{t} (G (x^{(s p)}), Y)) < \frac{d}{d t} (L_{t} (G_{Q} (x_{Q}^{(s p)}), Y)),

(7)

where

L_{t} (\cdot, y)

is an arbitrary differentiable loss function, Y are the labels, and t is the number of training epoch. This implies that the loss function decreases faster in the quadtree-informed GCNs.

Proof.

It is well known [] that

\frac{d}{d t} (L_{t} (G (\cdot), y)) < \frac{d}{d t} (L_{t} (G_{s k i p} (\cdot), y))

(8)

for an arbitrary differentiable loss function

L_{t} (\cdot, y)

in the case of linear models. So, to prove theorem’s claim, it suffices to show that the

G_{Q} (\cdot)

is a NN with skip connections.

For a vector

x_{Q}

, constructed using average pooling, the following representation holds:

x_{Q} = U \cdot x,

(9)

where U is the block matrix composed from h blocks

{(U_{2^{h - 1}}, \dots, U_{2}, I)}^{T}

, I is the identity matrix, and

U_{p} \in R^{\frac{N^{2}}{p^{2}} \times N^{2}}

is a matrix, in which rows contain p sequences of units of the length p,

p = 1, \dots, h - 1

, separated by

N - 1

zeros.

U_{p}

is a pooling matrix, and p represents the size of a pooling area. Multiplication at

U_{p}

means that each element in the result vector is obtained as the mean of the elements in the

p \times p

field from the image X. When X is represented as a vector

x

,

p \times p

, field is placed in

x

row by row with each row consisting of p units and the end and beginning of each row separated by

N - 1

zeros. All elements after the final row are zeros. One can calculate directly that

a v g_{p} (x) = a v g_p o o l (x, 2^{p} \times 2^{p})

is equal to the

U_{p} \cdot x

. Therefore,

x_{Q} = (a v g (x), x)

where

a v g (x) = (a v g_{h - 1} (x) \dots, a v g_{2} (x))

can be obtained by multiplying the block matrix U and the vector x.

Next, denote

x_{f l a t}

as the vector

x^{(s p)}

resized to

N^{2} \times 1

, preserving the ordering of pixels in the superpixel structure. The difference between the image X and vectors

x_{f l a t}

,

x

, and

x^{(s p)}

is shown in Figure 4.

Figure 4. Difference between image X and vectors

x_{f l a t}

,

x

, and

x^{(s p)}

(

N = 4

,

M = 2

).

Figure 4 shows that each consecutive

M^{2}

element of

x_{f l a t}

belongs to a single image superpixel. The unfolding operation is carried out on the vector

x^{(s p)}

from left to right and from top to down. Then, by performing direct multiplication, one can verify that

x_{f l a t}

can be obtained by matrix multiplication from

x

:

x_{f l a t} = D \cdot x

where

D = {(D_{1}, \dots, D_{\frac{N^{2}}{M^{2}}})}^{T}

and

D_{i} \in R^{M^{2} \times N^{2}}

and have the following form:

D_{i} = (\begin{matrix} 0_{M \times M \cdot i}, I_{M \times M}, 0_{M \times (N^{2} - M \cdot (i + 1))} \\ 0_{M \times (N + M \cdot i)}, I_{M \times M}, 0_{M \times (N^{2} - M \cdot (i + 1) - N)} \\ \dots \\ 0_{M \times (N + i) \cdot M}, I_{M \times M}, 0_{M \times N^{2} - (N + i + 1) \cdot M} \end{matrix}),

(10)

Matrix D reorders the elements of

x

, forming the flattened vector of superpixels

x^{(s p)}

. The multiplication

D_{i} \cdot x

forms the vector of a length

M^{2}

of pixels from the

i^{t h}

superpixel. This is constructed using the fact that superpixel elements are placed into M consecutive rows, with each element located one after another in the vector

x

.

D_{i}

is a block diagonal matrix containing M identical matrices

I_{M \times M}

, shifted left by

N \cdot j + i \cdot M

where j is a number of identical matrix in

D_{i}

. The jth identical matrix extracts the j-th row of the i-th superpixel, preserving the order of the elements in the resulting vector. The shift of the next identical matrix in

D_{i}

is equal to the shift of the

j + 1

superpixel row in the vector

x

.

The vector

x_{f l a t}

was defined to show that the vector of superpixels

x^{(s p)}

can be represented as a linear transformation of the vector

x

, as follows:

x^{(s p)} = \sum_{i = 1}^{\frac{N^{2}}{M^{2}}} \sum_{j = 1}^{M^{2}} {(e_{i}^{1})}^{T} \cdot e_{(i - 1) \cdot M^{2} + j}^{2} \cdot x_{f l a t} \cdot e_{j}^{(3)},

(11)

where

e_{i}^{k}

,

k = 1, \dots, 3

are the basis vectors, and

e_{i}^{1} \in R^{1 \times \frac{N^{2}}{M^{2}}}

,

e_{i}^{2} \in R^{1 \times N^{2}}

and

e_{i}^{3} \in R^{1 \times M^{2}}

.

For the vector

x^{(s p)}

and the matrices

A = {a_{k l}} \in R^{\frac{N^{2}}{M^{2}} \times \frac{N^{2}}{M^{2}}}, B = {b_{k l}} \in R^{M^{2} \times M^{2}}

, the following equations hold:

\begin{matrix} A \cdot x^{(s p)} \cdot B = \sum_{i = 1}^{\frac{N^{2}}{M^{2}}} \sum_{j = 1}^{M^{2}} {(e_{i}^{1})}^{T} \cdot e_{(i - 1) \cdot M^{2} + j}^{2} \cdot A^{f l a t} B^{f l a t} \cdot D \cdot x \cdot e_{j}^{3}, \end{matrix}

(12)

where

A^{f l a t} = {A_{k}^{f l a t}}, B^{f l a t} = {B_{k}^{f l a t}}, B_{k}^{f l a t} \in R^{\frac{N^{2}}{M^{2}} \times N^{2}}, A_{k}^{f l a t} \in R^{\frac{N^{2}}{M^{2}} \times N^{2}}

,

k = 1, \dots, J

, and

J = \frac{N^{2}}{M^{2}}

:

\begin{matrix} A_{k}^{f l a t} = (\begin{matrix} a_{k 1}, \underset{M^{2} - 1}{\underset{︸}{0 . . 0}}, a_{k 2}, . ., a_{k J}, 0 . . 0 \\ 0, a_{k 1}, \underset{M^{2} - 1}{\underset{︸}{0 . . 0}}, a_{k 2}, . ., a_{k J}, 0 . . 0 \\ \dots \\ \underset{M^{2} - 1}{\underset{︸}{0 . . 0}}, a_{k 1}, \underset{M^{2} - 1}{\underset{︸}{0 . . 0}}, a_{k 2}, \dots a_{k J} \end{matrix}), B_{k}^{f l a t} = (\begin{matrix} \underset{k \cdot M^{2}}{\underset{︸}{0 . . 0}}, b_{11} . . b_{M 1}, b_{M + 1, 1} . . b_{M^{2}, 1}, . . 0 \\ \dots \\ \underset{k \cdot M^{2}}{\underset{︸}{0 . . 0}}, b_{1 M} . . b_{M M}, b_{M + 1, M} . . b_{M^{2}, M}, . . 0 \end{matrix}) . \end{matrix}

(13)

Multiplication

A_{k}^{f l a t} \cdot D \cdot x

presents the results of multiplying the kth row of matrix A and

x^{(s p)}

. The offsets

M^{2} - 1

in

A_{k}^{f l a t}

are used to perform the multiplication of the kth row with the column containing the elements in the same positions from different superpixels. Multiplication

B_{k}^{f l a t} \cdot D \cdot x

presents the results of multiplying each kth superpixel elements and the matrix B. Each row of

B_{k}^{f l a t}

contains the elements from each column of B. Direct multiplication leads to the Equation (12), which shows that linear modifications of

x^{(s p)}

can be represented by a linear transformation of

x

.

Let A be the adjacency matrix for a two-dimensional grid graph convolution and

A_{Q}

be the adjacency matrix of the spatial-hierarchical quadtree. A describes the spatial connections between the elements in the h-layers quadtree, and

A_{Q}

can be represented as follows:

A_{Q} = A_{q} + (\begin{matrix} 0 & 0 \\ 0 & A \end{matrix})

. For the vector of quadtree’s features divided into superpixels

x_{Q}^{(s p)}

, Formula (9) implies that

x_{Q}^{(s p)} = U \cdot x^{(s p)}

. Then, the

G_{Q} (x_{Q}^{(s p)})

can be represented as follows:

\begin{matrix} G_{Q} (x_{Q}^{(s p)}) = A_{Q} \cdot x_{Q}^{(s p)} \cdot B = A_{q} \cdot x_{Q}^{(s p)} \cdot B + (\begin{matrix} 0 & 0 \\ 0 & A \end{matrix}) \cdot x_{Q}^{(s p)} \cdot B . \end{matrix}

(14)

In this formula

(\begin{matrix} 0 & 0 \\ 0 & A \end{matrix}) \cdot x_{Q}^{(s p)} = (\begin{matrix} 0 & 0 \\ 0 & A \end{matrix}) \cdot (\begin{matrix} 0 \\ x^{(s p)} \end{matrix}) = (\begin{matrix} 0 \\ I_{\frac{N^{2}}{M^{2}} \times \frac{N^{2}}{M^{2}}} \end{matrix}) \cdot A \cdot x^{(s p)} .

(15)

Moreover,

A_{q} \cdot x_{Q}^{(s p)} \cdot B = A_{q} \cdot U \cdot x^{(s p)} \cdot B = A_{q U} \cdot x^{(s p)} \cdot B

. Then

\begin{matrix} G_{Q} (x_{Q}^{(s p)}) = (A_{q U} \cdot x^{(s p)} + (\begin{matrix} 0 \\ I_{\frac{N^{2}}{M^{2}} \times \frac{N^{2}}{M^{2}}} \end{matrix}) \cdot A \cdot x^{(s p)}) \cdot B = \\ = A_{q U} \cdot x^{(s p)} \cdot B + (\begin{matrix} 0 \\ I_{\frac{N^{2}}{M^{2}} \times \frac{N^{2}}{M^{2}}} \end{matrix}) \cdot G (x^{(s p)}) . \end{matrix}

(16)

According to the Formula (6) for the case of

H = 1

and Formula (12), which proposes that

A_{q U} \cdot x^{(s p)} \cdot B

can be denoted by

W_{(0)} \cdot x^{(s p)}

),

G_{Q} (\cdot)

is a graph convolutional network with a skip connection and an adjacency matrix A. □

The theorem presented above demonstrates that using quadtrees to describe spatial interconnections between pixels improves the training process of graph neural networks. This is seen as a theoretical justification for the results of testing QiGSAN and comparing it to other types of graph.

In the block “Quadtree layers compression”, the graph convolution output is divided into h subvectors (Figure 1 presents the case where

h = 5

), where each of them corresponds to one of the quadtree layers. An image is formed from each subvector. Starting with

k = 1

, each

k^{t h}

layer is scaled by bilinear interpolation to match the size of of

k + 1

-th layer and then concatenated with it. The resulting multi-channel image is processed using two-dimensional convolutions. This method allows you to consistently take into account the influence of elements from the upper layers of the quadtree on the final image segmentation result.

3.6. Graph Self-Attention Operation

The graph convolution operation in block “4” (see Figure 1) is performed on a self-attention graph layer (the corresponding NN can be called “GSAN”), incorporating the self-attention mechanism into the graph network []. We use a graph self-attention based on the dot product (see, for example, []) instead of a standard GAT’s variant, which is based on concatenation (see []):

x_{i}^{(a t t - s p)} = \sum_{j = 1}^{N} s o f t m a x (A \cdot L e a k y R e L U (a^{T} \cdot x_{i}^{(s p)} \cdot x_{j}^{(s p)})) \cdot x_{j}^{(s p)} \cdot W .

(17)

where

X_{i}

is a feature vector for the

i^{t h}

node,

X_{i}^{*}

is a vector of updated node features, A is the adjacency matrix, and W and a are the trainable weights. The use of GSAN allows us for a better understanding of the relationships between the features of a quadtree.

4. Experiments

This section contains details about testing QiGSAN using open SAR datasets.

4.1. Datasets

QiGSAN is tested on SAR images of ships from HRSID [] and SSDD [] datasets. To simulate a lack of training data for HRSID, only images from the testing part are used. Table 1 presents descriptions of the datasets, including the sizes and resolutions of the images processed as well as the size of objects within them. Following a structure proposed by the creators of HRSID [], the objects were split into three categories: small (the object area is less than

32 \times 32

pixels), middle (less than

64 \times 64

), and large (greater than

64 \times 64

) objects. Examples of the segmented images are presented in Figure 5 and Figure 6.

Table 1. Characteristics of datasets.

Figure 5. Examples of images from HRSID. The ships are marked (see bottom row).

Figure 6. Examples of images from SSDD. The ships are marked (see bottom row).

4.2. Training, Metrics, and Hyperparameters

To estimate the efficiency of the proposed approach, a five-fold cross-validation was conducted. Overall,

70 %

of the four folds were used for training, and the last

30 %

was used as the validation set. So, the training sample volume is about

56 %

of the full data. All source images were processed using patches of size

256 \times 256

pixels.

An augmentation used for training increases the diversity of the dataset. Each image patch is rotated at a random angle (from 0 to 180 degrees) and is shifted by a random value less than 128 pixels. In the case of the HRSID dataset, objects are mostly small. Patches containing ships were enhanced by applying the Gaussian blur transform [] to them.

The

F_{1}

-score calculated using the Precision and Recall values is a metric for estimating segmentation accuracy, as follows:

\begin{matrix} F_{1} = \frac{2 \cdot R e c a l l \cdot P r e c i s i o n}{R e c a l l + P r e c i s i o n}, P r e c i s i o n = \frac{T P}{T P + F P}, R e c a l l = \frac{T P}{T P + F N}, \end{matrix}

(18)

where True Positive (TP) are pixels, correctly classified as the chosen class, False Positive (FP) are pixels, incorrectly determined to be the chosen class, and False Negative (FN) are pixels of the chosen class that are classified incorrectly.

The batch size is 32. Adam [] is used as an optimizer. This demonstrates better convergence compared to AdamW [] and RMSProp []. The cross-entropy

H (p, q)

is the loss function, as follows:

H (p, q) = D_{K L} (p ∣ q) - \sum_{x} p (x) l n (q (x)),

(19)

where

p (x)

and

q (x)

are the target and estimated distributions, and

D_{K L} (p ∣ q)

is a Kullback–Leibler divergence [].

The description of the hyperparameters and their possible values are presented in Table 2.

Table 2. The description and range of change for the NNs hyperparameters.

For graph network

G (\cdot)

processing, all images are divided into superpixels. The best results for both datasets were obtained when the superpixel size was

8 \times 8

(

M = 8

) and the quadtree height was

h = 5

. Most HRSID and SSDD objects are very small (less than

32 \times 32

pixels), and using larger superpixels results in losing object details. In addition, using smaller ones increases computational complexity and introduces noise distortion.

4.3. Variants of the $G (\cdot)$ Network

QiGSAN is compared with several graph architectures that can be used as alternatives to

G (\cdot)

although they are all based on the same encoders (ENet, FCN, DeepLabV3, SegFormer, and LWGANet). These networks form two groups: quadtree-informed networks and NNs (see Figure 7). The acronyms for these architectures also depend on the implementation of graph block.

Figure 7. Variants of the

G (\cdot)

network.

Quadtree-informed architectures (see the orange dashed block at the bottom left corner of Figure 7) are similar to QiGSAN (see Figure 1), except for the graph block:

QiGSAN uses self-attention graph layers;
GCN-QT and GAT-QT are based on vanilla GCN and GAT, respectively.

The NNs from the second group use a two-dimensional grid graph for image processing. These networks do not process the multi-scale features from quadtree layers. So, these architectures consist of only blocks “1”, “3”, and “4” (see Figure 1). Uninformed alternatives of QiGSAN are called GCN, GSAN, and GAT by the acronyms of the chosen layer in the graph block.

4.4. Main Results on SAR Images

This section compares the results of QiGSAN with those of NNs using alternative implementations of

G (\cdot)

, as well as various convolutional and transformer architectures. Some of these, such as LWGANet and DeepLabV3 and U-Net++ with across feature map attention (AFMA) [], were designed to improve the accuracy of small object recognition.

Table 3 presents the best Recall, Precision, and

F_{1}

-score values for ship segmentation accuracy obtained in five-fold cross-validation on HRSID and SSDD datasets (the best metric values for each dataset are highlighted in bold). It contains the results of convolutional NNs (ENet, DeepLabV3 and FCN), transformers (SegFormer, LWGANet), and proposed graph ensemble architecture with different implementations of the

G (\cdot)

network, such as GCN-QT, QiGSAN, GAT-QT, GCN, GSAN, and GAT. A more detailed analysis of the ensemble results, including the accuracy estimates obtained for each encoder, is provided in Section 5.

Table 3. Mean and median Recall, Precision, and

F_{1}

-score values (

\times 100 %

) for 5-fold HRSID and SSDD cross-validation. The best metric values for dataset are highlighted in bold.

QiGSAN improves the accuracy of ship segmentation, compared to the results of all tested configurations. The increase in the mean

F_{1}

values is

0.33

–

63.93 %

compared to the transformer results (

0.33

–

36.97 %

for the best ones). For the convolutional NNs, the mean increase in

F_{1}

is

9.91

–

48.57 %

(

9.91

–

23.81 %

for the best ones).

QiGSAN achieves the highest accuracy results among all ensemble modifications. The mean

F_{1}

increase for informed (GCN-QT and GAT-QT) networks is

1.15

–

4.45 %

, whereas for non-informed networks, it is

2.95

–

9.84 %

. QiGSAN also outperformed AFMA DeepLabV3, AFMA U-Net++, and LWGANet with an increase in

F_{1}

-score of

5.24 %

to

63.93 %

.

QiGSAN achieves the best Precision values for both datasets. The increase is 1.34–70.38%, compared to the baseline convolution networks and transformers. According to Table 3, QiGSAN’s Recall value is often lower than that of other networks, but this difference is offset by the significant increase in precision, resulting in an increase in

F 1

-score and improved image segmentation quality. Examples of HRSID and SSDD image segmentation are shown in Figure 8 and Figure 9.

Figure 8. Segmentation of HRSID images by FCN, GCN, and QiGSAN.

Figure 9. Segmentation of SSDD images by ENet, GCN, and QiGSAN.

The results obtained by QiGSAN and the base segmentors (FCN, ENet, SegFormer, LWGANet, and DeepLabV3) were also compared using the mIoU (mean Intersection-over-Union) metric. For HRSID, the QiGSAN’s mIoU was

82.76 %

, while the mIoUs of the base segmentor models ranged from

73.66 %

to

80.87 %

. This represents an increase of 1.88–9.09%. For SSDD, QiGSAN had an mIoU score of

83.94 %

, and the mIoUs of the other models ranged from

48.14 %

to

66.85 %

. This represents a range of increases from

17.08 %

to

35.79 %

.

5. Ablation Study

Within the ablation study, several issues related to the influence of architectural elements on results are considered.

5.1. HRSID

The estimations of the ship segmentation accuracy obtained in five-fold cross-validation on HRSID images are presented in Table 4. The column “Vanilla values” contains the

F_{1}

-score values obtained by the pre-trained convolutional NNs and transformers used in the ensembles as an

F (\cdot)

encoders, which are mentioned in the first column. Each cell in other columns contains a mean

F_{1}

value and its standard deviation, as well as the median (in brackets) for the graph ensemble NNs. The ensemble configuration is defined by the cell location:

F (\cdot)

and

G (\cdot)

are defined by row and column name, respectively. The best metric values for each

F (\cdot)

encoder are highlighted in bold.

Table 4. Mean and Median

F_{1}

-score values (

\times 100 %

) for five-fold HRSID cross-validation. The best metric values for each

F (\cdot)

encoder are highlighted in bold.

QiGSAN improves the accuracy of ship segmentation according to the results from all convolutional and transformer encoders. The increase in mean

F_{1}

values is

0.33

–

5.24 %

for transformer encoders and

1.89

–

15.07 %

for convolutional ones. For median values, the increase is

0.47

–

6.92 %

for both encoder type. QiGSAN also outperforms simpler graph block implementations, such as GCN-QT and GAT-QT in terms of accuracy. The advantage is between

2.8 %

and

25.85 %

for the networks with the same encoders. Furthermore, the quadtree-informed networks significantly outperforms GSAN, GAT, and GCN using a graph-grid. The difference in mean

F_{1}

value is between

2.53

and

24.25 %

. Applying networks without informing images from HRSID in two out of the three cases does not improve the segmentation accuracy compared to the basic convolutional encoder.

5.2. SSDD

The estimations of the ship segmentation accuracy obtained in five-fold cross-validation on SSDD images are presented in Table 5. The best metric values for each

F (\cdot)

encoder are highlighted in bold.

Table 5. Mean and Median

F_{1}

-score values (

\times 100 %

) for 5-fold SSDD cross-validation. The best metric values for each

F (\cdot)

encoder are highlighted in bold.

For all base convolutional and transformer encoders, the graph neural network processing improves the ship segmentation results. The best accuracy metric values are achieved using QiGSAN. The increase in mean

F_{1}

values is 10.3–56.48% for convolutional encoders and 27.36–63.93% for transformer ones. For median values, the increase is

12.47

–

64.94 %

. QiGSAN outperforms the results of GCN-QT and GAT-QT. An advantage in

F_{1}

-score is up to

16.11 %

for the networks with the same encoders. The quadtree-informed networks significantly outperform GSAN, GAT, and GCN: the difference in the mean

F_{1}

value is between

2.47 %

and

17.73 %

.

The presented results are independent of the fold splitting in cross-validation. Additional experiments for three-fold cross-validation on SSDD are shown in Table 6. The best metric values for each

F (\cdot)

encoder are highlighted in bold. In all cases, quadtree-informed networks demonstrate the best accuracy metrics. The increase in

F_{1}

values is

17.02

–

63.36 %

, while GSAN, GAT, and GCN accuracies increase by

11.15

–

59.27 %

.

Table 6. Mean and median

F_{1}

-score values (

\times 100 %

), 3-fold cross-validation (SSDD). The best metric values for each

F (\cdot)

encoder are highlighted in bold.

5.3. Comparison of QiGSAN with Other Configurations of Graph Networks

Theorem 1 implies a faster loss function reduction in quadtree-informed neural networks. This theoretical result was confirmed by experiments, as shown in the curves for informed and non-informed graph neural network with the LWGANet encoder, presented in Figure 10, for both datasets.

Figure 10. Examples of loss functions for HRSID (a) and SSDD (b).

According to the results of the experiments presented in Section 5.1 and Section 5.2, quadtree-informed architectures are able to segment ships more accurately than the networks without informing processes, see Figure 11 and Figure 12.

Figure 11. Changes in

F_{1}

-score for different architectures, compared to the values for the corresponding vanilla encoder (HRSID).

Figure 12. Changes in

F_{1}

-score for different architectures, compared to the values for the corresponding vanilla encoder (SSDD).

They present the differences in ship segmentation accuracy averaged over five-fold cross-validation between the ensemble graph network and vanilla encoders. Proposed architectures learn effectively with a small number of training samples: both HRSID and SSDD contain no more than 1962 images. In all cases, QiGSAN demonstrates the best results in accuracy metric values. In addition, this architecture shows superior results on HRSID, in which

87.4 %

of objects are small (see Section 4.1) and comparable to the superpixel size. The ensemble architectures based on the FCN (see Figure 11) and LWGANet (see Figure 12) encoders demonstrate the maximum differences in

F_{1}

value. While LWGANet shows the lowest accuracy in ship segmentation, QiGSAN based on LWGANet achieves the best accuracy for data segmentation.

The results obtained by vanilla FCN and LWGANet show the lowest accuracy in ship segmentation for both datasets. These are the NNs with the largest number of parameters among other base segmentors, as shown in Table 7, so these NNs are prone to overfitting if the training dataset is small. Ensemble with graph networks combine the information from networks with large numbers of parameters to produce more accurate predictions.

Table 7. The number of parameters (in thousands) and the computing performance (in GFLOPS) of base segmentors

F (\cdot)

and graph networks

G (\cdot)

.

Table 7 shows that the graph NN

G (\cdot)

from QGISAN contains about 370,000 parameters. This small network combined with convolution or transformer NNs can improve segmentation accuracy by up to

59.27 %

. For SSDD, the ensemble NN that achieves one of the highest segmentation accuracies (based on the ENet encoder) has much fewer parameters compared to the best baseline network (DeeplabV3). Table 7 also demonstrates that quadtree-informed NNs have more parameters compared to non-informed ones. However, the majority of these extra parameters correspond to normalization layers in block “5” in the QiGSAN architecture, see Figure 1. Therefore, increasing the number of parameters does not necessarily lead to decreased performance: the increase in FLOPS is up to

3.57 %

.

6. Conclusions and Discussion

6.1. Discussion

QiGSAN leads to more accurate results for small object segmentation than all other architectures, including transformers, and can also be used on large datasets with imbalanced or heterogeneous classes, which is common in real-world datasets. The main limitation of neural network architectures that QiGSAN compares with is the low training efficiency on small and imbalanced datasets. Another limitation of QiGSAN is the correlation between the accuracy values obtained from the base encoder network and the graph part of the architecture. Accurate predictions from the base encoder generally improve the results of QiGSAN.

In a few aspects, QiGSAN can be described as a trustworthy graph neural network []. The first aspect is explainability, as QiGSAN has a graph block based on a quadtree model. This is an effective method for describing the connections between features in images with different spatial resolutions. Second, fairness is important, as QiGSAN helps improve the segmentation of smaller objects. Third, environmental well-being is also important, since the graph network used in QiGSAN has only a few thousand parameters, significantly increasing segmentation accuracy even with small and weak encoder models. However, robustness and privacy issues are not addressed in this paper.

6.2. Summary

The paper proposes a novel ensemble graph neural network QiGSAN to perform semantic segmentation of imbalanced small datasets of high-resolution images. It is composed of a convolutional or transformer encoder and a graph network with a special self-attention block and block of multi-resolution feature compression informed by a quadtree. To process high resolution images, QiGSAN contains a special module for dividing images into superpixels of a fixed size incorporated into NN architecture, which reduces the number of graph nodes. Different well-known implementations of graph convolution, GCN and GAT, were tested. Comparison is also made to the traditional approach, which uses a two-dimensional grid to represent connections between pixels as a graph.

QiGSAN was applied to segment ships on real SAR images from two open datasets. Experiments showed that quadtree-informed networks are more efficient for small objects than transformer or convolutional NNs. Using QiGSAN, the increase in

F_{1}

-score values compared to convolutional NNs is up to

48.57 %

and for transformers is up to

63.93 %

. QiGSAN also demonstrates the best accuracy values compared to other variations in graph ensemble networks: the increase in

F_{1}

-score values is up to

9.84 %

. QiGSAN showed the greatest efficiency when processing objects comparable in size with superpixels. NNs without informing processes, when processing the HRSID dataset, where

87.4 %

of the objects are small, could not improve the results of the convolutional and transformer networks used as encoders in most cases.

6.3. Furher Research Directions

First of all, architectural modifications can be made to effectively process both global and local superpixels features to detect objects of various sizes. The analytical results for new architectures are also required. Second, the types of analyzed data should be extended, for example, by using images and videos from unmanned vehicles. Third, interconnection scaling is another research direction, and QiGSAN is able to handle this challenge, as it successfully deals with hierarchical structures. Finally, detection problems can also be solved using QiGSAN-segmented images and standard OpenCV [] tools, as well as by incorporating a quadtree-informed approach into the architectures of widely used neural networks detectors for the development of smart systems, such as those used in ports [].

Author Contributions

Conceptualization, A.G.; formal analysis, A.G. and A.D.; funding acquisition, A.G.; investigation, A.G. and A.D.; methodology, A.G. and A.D.; project administration, A.G.; resources, A.G.; supervision, A.G.; validation, A.G. and A.D.; visualization, A.G. and A.D.; writing—original draft, A.G. and A.D.; writing—review and editing, A.G. and A.D. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported by the Ministry of Science and Higher Education of the Russian Federation, project No. 075-15-2024-544.

Data Availability Statement

The original data presented in the study are openly available in HRSID and SSDD. HRSID: https://github.com/chaozhong2010/HRSID (accessed on 8 July 2025); SSDD: https://github.com/TianwenZhang0825/Official-SSDD (accessed on 8 July 2025).

Acknowledgments

The research was carried out using the infrastructure of the Shared Research Facilities “High Performance Computing and Big Data” (CKP “Informatics”) of the Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences.

Conflicts of Interest

The author declares no conflicts of interest.

References

Elhassan, M.A.; Zhou, C.; Khan, A.; Benabid, A.; Adam, A.B.; Mehmood, A.; Wambugu, N. Real-time semantic segmentation for autonomous driving: A review of CNNs, Transformers, and Beyond. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102226. [Google Scholar] [CrossRef]
Song, Y.; Xu, F.; Yao, Q.; Liu, J.; Yang, S. Navigation algorithm based on semantic segmentation in wheat fields using an RGB-D camera. Inf. Process. Agric. 2023, 10, 475–490. [Google Scholar] [CrossRef]
Lei, P.; Yi, J.; Li, S.; Li, Y.; Lin, H. Agricultural surface water extraction in environmental remote sensing: A novel semantic segmentation model emphasizing contextual information enhancement and foreground detail attention. Neurocomputing 2025, 617, 129110. [Google Scholar] [CrossRef]
Kampffmeyer, M.; Salberg, A.B.; Jenssen, R. Semantic Segmentation of Small Objects and Modeling of Uncertainty in Urban Remote Sensing Images Using Deep Convolutional Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 680–688. [Google Scholar] [CrossRef]
Dede, A.; Nunoo-Mensah, H.; Tchao, E.T.; Agbemenu, A.S.; Adjei, P.E.; Acheampong, F.A.; Kponyo, J.J. Deep learning for efficient high-resolution image processing: A systematic review. Intell. Syst. Appl. 2025, 26, 200505. [Google Scholar] [CrossRef]
Chen, P.; Liu, Y.; Ren, Y.; Zhang, B.; Zhao, Y. A Deep Learning-Based Solution to the Class Imbalance Problem in High-Resolution Land Cover Classification. Remote Sens. 2025, 17, 1845. [Google Scholar] [CrossRef]
Nguyen, Q.D.; Thai, H.T. Crack segmentation of imbalanced data: The role of loss functions. Eng. Struct. 2023, 297, 116988. [Google Scholar] [CrossRef]
Genc, A.; Kovarik, L.; Fraser, H.L. A deep learning approach for semantic segmentation of unbalanced data in electron tomography of catalytic materials. Sci. Rep. 2022, 12, 16267. [Google Scholar] [CrossRef]
Hossain, M.S.; Betts, J.M.; Paplinski, A.P. Dual Focal Loss to address class imbalance in semantic segmentation. Neurocomputing 2021, 462, 69–87. [Google Scholar] [CrossRef]
Chopade, R.; Stanam, A.; Pawar, S. Addressing Class Imbalance Problem in Semantic Segmentation Using Binary Focal Loss. In Proceedings of the Ninth International Congress on Information and Communication Technology, London, UK, 19–22 February 2024; Yang, X.S., Sherratt, S., Dey, N., Joshi, A., Eds.; Springer: Singapore, 2024; pp. 351–357. [Google Scholar] [CrossRef]
Saeedizadeh, N.; Jalali, S.M.J.; Khan, B.; Kebria, P.M.; Mohamed, S. A new optimization approach based on neural architecture search to enhance deep U-Net for efficient road segmentation. Knowl.-Based Syst. 2024, 296, 111966. [Google Scholar] [CrossRef]
Debnath, R.; Das, K.; Bhowmik, M.K. GSNet: A new small object attention based deep classifier for presence of gun in complex scenes. Neurocomputing 2025, 635, 129855. [Google Scholar] [CrossRef]
Liu, W.; Kang, X.; Duan, P.; Xie, Z.; Wei, X.; Li, S. SOSNet: Real-Time Small Object Segmentation via Hierarchical Decoding and Example Mining. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 3071–3083. [Google Scholar] [CrossRef]
Sang, S.; Zhou, Y.; Islam, M.T.; Xing, L. Small-Object Sensitive Segmentation Using Across Feature Map Attention. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6289–6306. [Google Scholar] [CrossRef]
Kiobya, T.; Zhou, J.; Maiseli, B. A multi-scale semantically enriched feature pyramid network with enhanced focal loss for small-object detection. Knowl.-Based Syst. 2025, 310, 113003. [Google Scholar] [CrossRef]
Hou, X.; Bai, Y.; Xie, Y.; Ge, H.; Li, Y.; Shang, C.; Shen, Q. Deep collaborative learning with class-rebalancing for semi-supervised change detection in SAR images. Knowl.-Based Syst. 2023, 264, 110281. [Google Scholar] [CrossRef]
Chen, Y.; Li, X.; Luan, C.; Hou, W.; Liu, H.; Zhu, Z.; Xue, L.; Zhang, J.; Liu, D.; Wu, X.; et al. Cross-level interaction fusion network-based RGB-T semantic segmentation for distant targets. Pattern Recognit. 2025, 161, 111218. [Google Scholar] [CrossRef]
Liu, Z.; Lv, Q.; Lee, C.H.; Shen, L. Segmenting medical images with limited data. Neural Netw. 2024, 177, 106367. [Google Scholar] [CrossRef]
Wang, C.; Gu, H.; Su, W. SAR Image Classification Using Contrastive Learning and Pseudo-Labels with Limited Data. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 4012505. [Google Scholar] [CrossRef]
Wang, C.; Luo, S.; Pei, J.; Huang, Y.; Zhang, Y.; Yang, J. Crucial feature capture and discrimination for limited training data SAR ATR. ISPRS J. Photogramm. Remote Sens. 2023, 204, 291–305. [Google Scholar] [CrossRef]
Dong, Y.; Li, F.; Hong, W.; Zhou, X.; Ren, H. Land cover semantic segmentation of Port Area with High Resolution SAR Images Based on SegNet. In Proceedings of the 2021 SAR in Big Data Era (BIGSARDATA), Nanjing, China, 22–24 September 2021; pp. 1–4. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Z.; Wang, R.; Zhang, S.; Zhang, Y. Limited Data-Driven Multi-Task Deep Learning Approach for Target Classification in SAR Imagery. In Proceedings of the 2024 5th International Conference on Geology, Mapping and Remote Sensing (ICGMRS), Wuhan, China, 12–14 April 2024; pp. 239–242. [Google Scholar] [CrossRef]
Lyu, J.; Zhou, K.; Zhong, Y. A statistical theory of overfitting for imbalanced classification. arXiv 2025, arXiv:2502.11323. [Google Scholar] [CrossRef]
Li, Z.; Kamnitsas, K.; Glocker, B. Analyzing Overfitting Under Class Imbalance in Neural Networks for Image Segmentation. IEEE Trans. Med. Imaging 2021, 40, 1065–1077. [Google Scholar] [CrossRef]
Gorshenin, A.; Kozlovskaya, A.; Gorbunov, S.; Kochetkova, I. Mobile network traffic analysis based on probability-informed machine learning approach. Comput. Netw. 2024, 247, 110433. [Google Scholar] [CrossRef]
Karniadakis, G.; Kevrekidis, I.; Lu, L.; Perdikaris, P.; Wang, S.; Yang, L. Physics-Informed Machine Learning. Nat. Rev. Phys. 2021, 3, 422–440. [Google Scholar] [CrossRef]
Dostovalova, A. Neural Quadtree and its Applications for SAR Imagery Segmentations. Inform. Primen. 2024, 18, 77–85. [Google Scholar] [CrossRef]
Dostovalova, A.; Gorshenin, A. Small sample learning based on probability-informed neural networks for SAR image segmentation. Neural Comput. Appl. 2025, 37, 8285–8308. [Google Scholar] [CrossRef]
Pastorino, M.; Moser, G.; Serpico, S.B.; Zerubia, J. Semantic Segmentation of Remote-Sensing Images Through Fully Convolutional Neural Networks and Hierarchical Probabilistic Graphical Models. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5407116. [Google Scholar] [CrossRef]
Schmitt, M.; Ahmadi, S.; Hansch, R. There is No Data Like More Data - Current Status of Machine Learning Datasets in Remote Sensing. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2021), Brussels, Belgium, 12–16 July 2021; pp. 1206–1209. [Google Scholar] [CrossRef]
Jhaveri, R.H.; Revathi, A.; Ramana, K.; Raut, R.; Dhanaraj, R.K. A Review on Machine Learning Strategies for Real-World Engineering Applications. Mob. Inf. Syst. 2022, 2022, 1833507. [Google Scholar] [CrossRef]
Bae, J.H.; Yu, G.H.; Lee, J.H.; Vu, D.T.; Anh, L.H.; Kim, H.G.; Kim, J.Y. Superpixel Image Classification with Graph Convolutional Neural Networks Based on Learnable Positional Embedding. Appl. Sci. 2022, 12, 9176. [Google Scholar] [CrossRef]
Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A High-Resolution SAR Images Dataset for Ship Detection and Instance Segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR Ship Detection Dataset (SSDD): Official Release and Comprehensive Data Analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
Spasev, V.; Dimitrovski, I.; Chorbev, I.; Kitanovski, I. Semantic Segmentation of Unmanned Aerial Vehicle Remote Sensing Images Using SegFormer. In Proceedings of the Intelligent Systems and Pattern Recognition. Communications in Computer and Information Science, Istanbul, Turkey, 26–28 June 2024; Volume 2305, pp. 1416–1425. [Google Scholar] [CrossRef]
Lu, W.; Chen, S.B.; Ding, C.H.Q.; Tang, J.; Luo, B. LWGANet: A Lightweight Group Attention Backbone for Remote Sensing Visual Tasks. arXiv 2025, arXiv:2501.10040. [Google Scholar] [CrossRef]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. In Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Li, J.; Wei, X. Research on efficient detection network method for remote sensing images based on self attention mechanism. Image Vis. Comput. 2024, 142, 104884. [Google Scholar] [CrossRef]
Brigato, L.; Iocchi, L. A Close Look at Deep Learning with Small Data. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Virtual, 10–15 January 2021; pp. 2490–2497. [Google Scholar] [CrossRef]
Huang, Z.; Pan, Z.; Lei, B. Transfer Learning with Deep Convolutional Neural Network for SAR Target Classification with Limited Labeled Data. Remote Sens. 2017, 9, 907. [Google Scholar] [CrossRef]
Zhou, Y.; Jiang, X.; Li, Z.; Liu, X. SAR Target Classification with Limited Data via Data Driven Active Learning. In Proceedings of the IGARSS 2020–2020 IEEE International Geoscience and Remote Sensing Symposium, Virtual, 26 September–2 October 2020; pp. 2475–2478. [Google Scholar] [CrossRef]
Yu, J.; Zhou, G.; Zhou, S.; Yin, J. A Lightweight Fully Convolutional Neural Network for SAR Automatic Target Recognition. Remote Sens. 2021, 13, 3029. [Google Scholar] [CrossRef]
Wang, W.; Jiang, Z.; Liao, J.; Ying, Z.; Zhai, Y. Explorations of Contrastive Learning in the Field of Small Sample SAR ATR. Procedia Comput. Sci. 2022, 208, 190–195. [Google Scholar] [CrossRef]
Chong, Q.; Ni, M.; Huang, J.; Wei, G.; Li, Z.; Xu, J. Let the loss impartial: A hierarchical unbiased loss for small object segmentation in high-resolution remote sensing images. Eur. J. Remote Sens. 2023, 56, 2254473. [Google Scholar] [CrossRef]
Chong, Q.; Ni, M.; Huang, J.; Liang, Z.; Wang, J.; Li, Z.; Xu, J. Pos-DANet: A dual-branch awareness network for small object segmentation within high-resolution remote sensing images. Eng. Appl. Artif. Intell. 2024, 133, 107960. [Google Scholar] [CrossRef]
Wu, S.; Sun, F.; Zhang, W.; Xie, X.; Cui, B. Graph Neural Networks in Recommender Systems: A Survey. ACM Comput. Surv. 2022, 55, 1–37. [Google Scholar] [CrossRef]
Vrahatis, A.G.; Lazaros, K.; Kotsiantis, S. Graph Attention Networks: A Comprehensive Review of Methods and Applications. Future Internet 2024, 16, 318. [Google Scholar] [CrossRef]
Tanis, J.H.; Giannella, C.; Mariano, A.V. Introduction to Graph Neural Networks: A Starting Point for Machine Learning Engineers. arXiv 2024, arXiv:2412.19419. [Google Scholar] [CrossRef]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1025–1035. [Google Scholar]
Jiang, M.; Liu, G.; Su, Y.; Wu, X. Self-attention empowered graph convolutional network for structure learning and node embedding. Pattern Recognit. 2024, 153, 110537. [Google Scholar] [CrossRef]
Ihalage, A.; Hao, Y. Formula Graph Self-Attention Network for Representation-Domain Independent Materials Discovery. Adv. Sci. 2022, 9, 2200164. [Google Scholar] [CrossRef]
Peng, Y.; Xia, J.; Liu, D.; Liu, M.; Xiao, L.; Shi, B. Unifying topological structure and self-attention mechanism for node classification in directed networks. Sci. Rep. 2025, 15, 805. [Google Scholar] [CrossRef]
Ronen, T.; Levy, O.; Golbert, A. Vision Transformers with Mixed-Resolution Tokenization. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 4613–4622. [Google Scholar]
Ke, L.; Danelljan, M.; Li, X.; Tai, Y.W.; Tang, C.K.; Yu, F. Mask Transfiner for High-Quality Instance Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4402–4411. [Google Scholar] [CrossRef]
Tang, S.; Zhang, J.; Zhu, S.; Tan, P. Quadtree Attention for Vision Transformers. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Chitta, K.; Álvarez, J.M.; Hebert, M. Quadtree Generating Networks: Efficient Hierarchical Scene Parsing with Sparse Convolutions. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2019; pp. 2009–2018. [Google Scholar]
Yang, F.; Ma, Z.; Xie, M. Image classification with superpixels and feature fusion method. J. Electron. Sci. Technol. 2021, 19, 100096. [Google Scholar] [CrossRef]
Zi, W.; Xiong, W.; Chen, H.; Li, J.; Jing, N. SGA-Net: Self-Constructing Graph Attention Neural Network for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2021, 13, 4201. [Google Scholar] [CrossRef]
Cao, P.; Jin, Y.; Ruan, B.; Niu, Q. DIGCN: A Dynamic Interaction Graph Convolutional Network Based on Learnable Proposals for Object Detection. J. Artif. Intell. Res. 2024, 79, 1091–1112. [Google Scholar] [CrossRef]
Liu, X.; Li, Y.; Liu, X.; Zou, H. Dark Spot Detection from SAR Images Based on Superpixel Deeper Graph Convolutional Network. Remote Sens. 2022, 14, 5618. [Google Scholar] [CrossRef]
Gorshenin, A.; Kuzmin, V. Statistical Feature Construction for Forecasting Accuracy Increase and Its Applications in Neural Network Based Analysis. Mathematics 2022, 10, 589. [Google Scholar] [CrossRef]
Gorshenin, A.; Vilyaev, A. Finite Normal Mixture Models for the Ensemble Learning of Recurrent Neural Networks with Applications to Currency Pairs. Pattern Recognit. Image Anal. 2022, 32, 780–792. [Google Scholar] [CrossRef]
Gorshenin, A.; Vilyaev, A. Machine Learning Models Informed by Connected Mixture Components for Short- and Medium-Term Time Series Forecasting. AI 2024, 5, 1955–1976. [Google Scholar] [CrossRef]
Tyralis, H.; Papacharalampous, G. A review of predictive uncertainty estimation with machine learning. Artif. Intell. Rev. 2024, 57, 94. [Google Scholar] [CrossRef]
Wang, Z.; Nakahira, Y. A Generalizable Physics-informed Learning Framework for Risk Probability Estimation. In Proceedings of the 5th Annual Learning for Dynamics and Control Conference, Philadelphia, PA, USA, 15–16 June 2023; Matni, N., Morari, M., Pappas, G.J., Eds.; PMLR: Cambridge, MA, USA, 2023; Volume 211, pp. 358–370. [Google Scholar]
Zhang, Z.; Li, J.; Liu, B. Annealed adaptive importance sampling method in PINNs for solving high dimensional partial differential equations. J. Comput. Phys. 2025, 521, 113561. [Google Scholar] [CrossRef]
Zuo, L.; Chen, Y.; Zhang, L.; Chen, C. A spiking neural network with probability information transmission. Neurocomputing 2020, 408, 1–12. [Google Scholar] [CrossRef]
Batanov, G.; Gorshenin, A.; Korolev, V.Y.; Malakhov, D.; Skvortsova, N. The evolution of probability characteristics of low-frequency plasma turbulence. Math. Model. Comput. Simulations 2012, 4, 10–25. [Google Scholar] [CrossRef]
Batanov, G.; Borzosekov, V.; Gorshenin, A.; Kharchev, N.; Korolev, V.; Sarksyan, K. Evolution of statistical properties of microturbulence during transient process under electron cyclotron resonance heating of the L-2M stellarator plasma. Plasma Phys. Control. Fusion 2019, 61, 075006. [Google Scholar] [CrossRef]
Gorshenin, A. Concept of online service for stochastic modeling of real processes. Inform. Primen. 2016, 10, 72–81. [Google Scholar] [CrossRef]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units GELUs. arXiv 2023, arXiv:1606.08415. [Google Scholar] [CrossRef]
Kang, J.; Liu, R.; Li, Y.; Liu, Q.; Wang, P.; Zhang, Q.; Zhou, D. An Improved 3D Human Pose Estimation Model Based on Temporal Convolution with Gaussian Error Linear Units. In Proceedings of the International Conference on Virtual Rehabilitation, ICVR, Nanjing, China, 26–28 May 2022; pp. 21–32. [Google Scholar] [CrossRef]
Satyanarayana, D.; Saikiran, E. Nonlinear Dynamic Weight-Salp Swarm Algorithm and Long Short-Term Memory with Gaussian Error Linear Units for Network Intrusion Detection System. Int. J. Intell. Eng. Syst. 2024, 17, 463–473. [Google Scholar] [CrossRef]
Dostovalova, A. Using a Model of a Spatial-Hierarchical Quadtree with Truncated Branches to Improve the Accuracy of Image Classification. Izv. Atmos. Ocean. Phys. 2023, 59, 1255–1262. [Google Scholar] [CrossRef]
Xu, K.; Zhang, M.; Jegelka, S.; Kawaguchi, K. Optimization of Graph Neural Networks: Implicit Acceleration by Skip Connections and More Depth. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; Volume 139. [Google Scholar]
Veličkovič, P.; Casanova, A.; Lió, P.; Cucurull, G.; Romero, A.; Bengio, Y. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018—Conference Track Proceedings, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar] [CrossRef]
Qin, F. Blind Single-Image Super Resolution Reconstruction with Gaussian Blur. In Proceedings of the Mechatronics and Automatic Control Systems, Hangzhou, China, 10–11 August 2013; Wang, W., Ed.; Springer: Cham, Switzerland, 2014; pp. 293–301. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015. [Google Scholar] [CrossRef]
Zhou, P.; Xie, X.; Lin, Z.; Yan, S. Towards Understanding Convergence and Generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6486–6493. [Google Scholar] [CrossRef]
Zou, F.; Shen, L.; Jie, Z.; Zhang, W.; Liu, W. A sufficient condition for convergences of Adam and RMSProp. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11119–11127. [Google Scholar]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
Zhang, H.; Wu, B.; Yuan, X.; Pan, S.; Tong, H.; Pei, J. Trustworthy Graph Neural Networks: Aspects, Methods, and Trends. Proc. IEEE 2024, 112, 97–139. [Google Scholar] [CrossRef]
Chandan, G.; Jain, A.; Jain, H.; Mohana. Real Time Object Detection and Tracking Using Deep Learning and OpenCV. In Proceedings of the International Conference on Inventive Research in Computing Applications, ICIRCA 2018, Coimbatore, India, 11–12 July 2018; pp. 1305–1308. [Google Scholar] [CrossRef]
Belmoukari, B.; Audy, J.F.; Forget, P. Smart port: A systematic literature review. Eur. Transp. Res. Rev. 2023, 15, 4. [Google Scholar] [CrossRef]

Figure 1. Architecture of Quadtree-informed Graph Self-Attention Networks. It consists of 5 blocks:: the first is the preprocessing block with convolutional layers; the second is the block for forming features using quadtree layers; the third is a superpixel forming block; the fourth performs graph convolution operations; and the fifth is a compression block using quadtree layer.

Figure 2. Quadtree scheme when with height

h = 2

.

Figure 3. Spatial relations for the

(i, j)

element of a two-dimensional grid.

Figure 4. Difference between image X and vectors

x_{f l a t}

,

x

, and

x^{(s p)}

(

N = 4

,

M = 2

).

Figure 5. Examples of images from HRSID. The ships are marked (see bottom row).

Figure 6. Examples of images from SSDD. The ships are marked (see bottom row).

Figure 7. Variants of the

G (\cdot)

network.

Figure 8. Segmentation of HRSID images by FCN, GCN, and QiGSAN.

Figure 9. Segmentation of SSDD images by ENet, GCN, and QiGSAN.

Figure 10. Examples of loss functions for HRSID (a) and SSDD (b).

Figure 11. Changes in

F_{1}

-score for different architectures, compared to the values for the corresponding vanilla encoder (HRSID).

Figure 12. Changes in

F_{1}

-score for different architectures, compared to the values for the corresponding vanilla encoder (SSDD).

Table 1. Characteristics of datasets.

Dataset	Number of Images	Image Size	Resolution (m)	Small Objects	Middle Objects	Large Objects
SSDD	1160	190–668	1–10	1529	935	76
HRSID (test)	1962	800	0.5–3	5176	619	127

Table 2. The description and range of change for the NNs hyperparameters.

Hyperparameter	Description	Values
Encoder	Architecture of the basic segmenter	ENet, Fully-Conv. Network (FCN) based on ResNet50 [], DeepLabV3 (DLv3), SegFormer, LWGANet
Pre-train	Pre-trained weights for base networks	None; PyTorch’s weights for DeepLabV3 and FCN; Imagenet, ADE20K
Epochs	Number of training epochs of base segmenter	100–300
lr_decay	Learning rate decay	None; 0.5; 0.8
h	Quadtree height	4; 5
M	Superpixel size	4; 8; 16

Table 3. Mean and median Recall, Precision, and

F_{1}

-score values (

\times 100 %

) for 5-fold HRSID and SSDD cross-validation. The best metric values for dataset are highlighted in bold.

Table 3. Mean and median Recall, Precision, and

F_{1}

-score values (

\times 100 %

) for 5-fold HRSID and SSDD cross-validation. The best metric values for dataset are highlighted in bold.

NN		HRSID			SSDD
	Recall	Precision	$F_{1}$ -Score	Recall	Precision	$F_{1}$ -Score
ENet	79.70 ± 10.02 (82.12)	68.87 ± 10.46 (70.87)	73.77 ± 9.59 (77.74)	74.54 ± 7.45 (75.02)	23.51 ± 18.66 (17.96)	33.04 ± 19.4 (29.28)
DeepLabV3	74.68 ± 15.25 (80.1)	62.84 ± 5.65 (62.75)	60.45 ± 18.67 (69.97)	66.93 ± 11.68 (68.72)	52.76 ± 5.26 (53.58)	58.35 ± 4.47 (57.81)
FCN	77.82 ± 8.25 (81.06)	69.55 ± 6.83 (66.99)	59.87 ± 32.15 (74.13)	80.54 ± 4.15 (80.55)	12.70 ± 3.48 (12.71)	21.87 ± 5.35 (21.86)
SegFormer	73.98 ± 12.84 (78.56)	76.50 ± 20.42 (85.85)	74.98 ± 16.59 (82.61)	57.22 ± 26.57 (68.68)	38.15 ± 8.43 (39.83)	44.64 ± 16.05 (50.36)
LWGANet	78.74 ± 12.35 (80.74)	69.28 ± 16.14 (73.36)	73.36 ± 13.58 (78.34)	72.72 ± 9.39 (74.46)	10.41 ± 2.43 (9.94)	17.68 ± 3.29 (16.82)
Small object-oriented models
AFMA DeepLabV3	86.02 ± 11.42 (89.33)	51.61 ± 15.19 (55.06)	64.05 ± 14.92 (69.14)	72.85 ± 8.87 (73.01)	16.48 ± 5.08 (16.48)	26.47 ± 6.49 (27.22)
AFMA U-Net++	88.72 ± 8.39 (91.51)	47.56 ± 9.47 (51.54)	62.06 ± 10.43 (66.12)	63.37 ± 11.59 (64.22)	33.23 ± 11.42 (31.19)	40.04 ± 6.31 (39.29)
Ensembles specified by type of the $G (\cdot)$ network (on the left column)
GCN	73.89 ± 11.72 (76.72)	74.50 ± 8.39 (75.79)	73.84 ± 9.56 (76.92)	74.22 ± 4.64 (72.93)	78.14 ± 3.21 (79.28)	76.09 ± 3.48 (75.55)
GAT	33.68 ± 29.03 (34.44)	54.17 ± 39.36 (65.43)	68.08 ± 9.02 (71.54)	72.41 ± 5.64 (71.62)	78.48 ± 9.99 (80.03)	74.87 ± 4.01 (76.58)
GSAN	74.77 ± 16.32 (80.1)	67.03 ± 8.18 (65.79)	69.38 ± 13.55 (72.22)	81.87 ± 2.21 (82.11)	75.95 ± 4.97 (76.74)	78.71 ± 2.36 (78.38)
GCN-QT	68.58 ± 13.36 (74.12)	74.78 ± 18.42 (82.91)	75.27 ± 10.00 (78.72)	78.01 ± 6.46 (78.92)	79.66 ± 5.67 (80.98)	78.64 ± 4.27 (77.98)
GAT-QT	72.57 ± 18.00 (78.38)	75.95 ± 20.98 (86.11)	76.53 ± 10.86 (80.36)	78.14 ± 4.42 (77.92)	82.52 ± 4.21 (83.89)	80.46 ± 3.78 (80.27)
QiGSAN	73.44 ± 12.71 (77.90)	77.84 ± 21.31 (87.55)	78.60 ± 11.38 (82.87)	78.27 ± 4.12 (78.72)	84.79 ± 4.59 (86.29)	81.61 ± 3.6 (81.76)

Table 4. Mean and Median

F_{1}

-score values (

\times 100 %

) for five-fold HRSID cross-validation. The best metric values for each

F (\cdot)

encoder are highlighted in bold.

Table 4. Mean and Median

F_{1}

-score values (

\times 100 %

) for five-fold HRSID cross-validation. The best metric values for each

F (\cdot)

encoder are highlighted in bold.

Encoder $F (\cdot)$	VanillaValues	Graph NN $G (\cdot)$
Encoder $F (\cdot)$	VanillaValues	GCN-QT	QiGSAN	GAT-QT	GCN	GSAN	GAT
ENet	73.77 ± 9.59 (77.74)	71.97 ± 9.01 (75.38)	75.66 ± 14.6 (79.62)	72.68±6.79 (74.81)	61.57 ± 17.39 (64.86)	56.40 ± 5.1 (58.40)	11.77 ± 10.29 (9.65)
DeepLabV3	60.45 ± 18.67 (69.97)	67.23 ± 14.69 (75.32)	70.03 ± 14.1 (76.97)	44.18 ± 28.9 (51.15)	66.28 ± 13.9 (74.44)	50.91 ± 19.68 (63.00)	31.46 ± 19.62 (38.41)
FCN	59.87 ± 32.15 (74.13)	72.77 ± 8.16 (76.02)	74.94 ± 7.9 (78.47)	72.64 ± 4.95 (74.45)	50.65 ± 34.05 (65.58)	58.15 ± 6.47 (60.09)	23.65 ± 21.7 (22.49)
SegFormer	74.98 ± 16.59 (82.61)	71.48 ± 15.74 (78.82)	75.31 ± 16.89 (82.59)	71.67 ± 17.70 (78.75)	65.51 ± 15.19 (66.83)	41.06 ± 33.14 (44.86)	51.75 ± 34.72 (66.55)
LWGANet	73.36 ± 13.58 (78.34)	75.27 ± 10.00 (78.72)	78.60 ± 11.38 (82.87)	76.53 ± 10.86 (80.36)	73.84 ± 9.56 (76.92)	69.38 ± 13.55 (72.22)	68.08 ± 9.02 (71.54)

Table 5. Mean and Median

F_{1}

-score values (

\times 100 %

) for 5-fold SSDD cross-validation. The best metric values for each

F (\cdot)

encoder are highlighted in bold.

Table 5. Mean and Median

F_{1}

-score values (

\times 100 %

) for 5-fold SSDD cross-validation. The best metric values for each

F (\cdot)

encoder are highlighted in bold.

Encoder $F (\cdot)$	VanillaValues	Graph NN $G (\cdot)$
Encoder $F (\cdot)$	VanillaValues	GCN-QT	QiGSAN	GAT-QT	GCN	GSAN	GAT
ENet	33.04 ± 19.4 (29.28)	72.63 ± 5.78 (74.56)	$81.18 \pm 2.15$ $(81.41)$	75.54 ± 8.5 (80.07)	76.09 ± 3.48 (75.55)	78.71 ± 2.36 (78.38)	74.87 ± 4.01 (76.58)
DeepLabV3	58.35 ± 4.47 (57.81)	68.20 ± 7.51 (70.23)	$68.65 \pm 5.9$ $(70.28)$	67.81 ± 3.24 (68.51)	65.84 ± 7.25 (65.73)	50.92 ± 34.4 (65.25)	52.72 ± 5.98 (55.80)
FCN	21.87 ± 5.35 (21.86)	78.64 ± 4.27 (77.98)	$78.35 \pm 6.0$ $(78.09)$	72.71 ± 10.8 (76.82)	74.61 ± 5.62 (74.56)	74.96 ± 3.92 (73.86)	71.98 ± 7.63 (73.83)
SegFormer	44.64 ± 16.05 (50.36)	55.89 ± 19.80 (63.64)	$72.00 \pm 3.85$ $(73.18)$	58.33 ± 20.75 (65.43)	50.73 ± 20.4 (56.58)	59.37 ± 13.14 (60.93)	54.35 ± 15.10 (59.02)
LWGANet	17.68 ± 3.29 (16.82)	65.59 ± 21.89 (74.16)	$81.61 \pm 3.6$ $(81.76)$	80.46 ± 3.78 (80.27)	73.19 ± 1.44 (73.16)	69.93 ± 17.53 (78.08)	72.11 ± 8.99 (75.58)

Table 6. Mean and median

F_{1}

-score values (

\times 100 %

), 3-fold cross-validation (SSDD). The best metric values for each

F (\cdot)

encoder are highlighted in bold.

Table 6. Mean and median

F_{1}

-score values (

\times 100 %

), 3-fold cross-validation (SSDD). The best metric values for each

F (\cdot)

encoder are highlighted in bold.

Encoder $F (\cdot)$	VanillaValues	Graph NN $G (\cdot)$
Encoder $F (\cdot)$	VanillaValues	GCN-QT	QiGSAN	GAT-QT	GCN	GSAN	GAT
ENet	25.08 ± 3.83 (25.98)	75.05 ± 3.08 (75.05)	$80.47 \pm 3.9$ $(80.87)$	77.14 ± 0.41 (77.94)	73.32 ± 0.88 (73.32)	78.90 ± 4.53 (78.92)	71.02 ± 4.3 (73.12)
DeepLabV3	51.32 ± 5.80 (50.59)	63.26 ± 7.54 (63.09)	$68.34 \pm 4.1$ $(66.19)$	66.11 ± 3.75 (63.74)	62.47 ± 4.80 (60.51)	55.82 ± 19.73 (62.84)	50.32 ± 12.56 (51.52)
FCN	25.69 ± 13.06 (25.37)	$74.39 \pm 0.2$ $(74.39)$	69.78 ± 5.68 (69.38)	68.24 ± 9.13 (68.14)	71.06 ± 2.09 (71.16)	67.35 ± 10.53 (67.75)	70.41 ± 2.51 (70.85)
SegFormer	45.32 ± 5.12 (44.47)	67.19 ± 2.49 (68.09)	$74.0 \pm 1.84$ $(74.22)$	72.95 ± 1.17 (72.75)	67.28 ± 0.81 (65.18)	65.04 ± 1.94 (63.74)	66.2 ± 0.28 (65.9)
LWGANet	20.34 ± 8.82 (20.34)	80.41 ± 1.47 (81.91)	$83.7 \pm 0.38$ $(84.2)$	82.83 ± 0.78 (82.22)	79.61 ± 1.97 (78.06)	80.01 ± 2.63 (79.11)	78.16 ± 3.39 (76.55)

Table 7. The number of parameters (in thousands) and the computing performance (in GFLOPS) of base segmentors

F (\cdot)

and graph networks

G (\cdot)

.

Table 7. The number of parameters (in thousands) and the computing performance (in GFLOPS) of base segmentors

F (\cdot)

and graph networks

G (\cdot)

.

	ENet	DeepLabV3	FCN	LWGANet	Segformer	GCN	GAT	GSAN	QiGCN	QiGAT	QiGSAN
GFLOPs	46.6	157.74	2360	196	88.7	168.16	168.08	168.08	174.84	174.72	174.76
Params	358.7	11029	35322	12583	3755	49.8	48.5	48.5	370.3	370.1	370.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

QiGSAN: A Novel Probability-Informed Approach for Small Object Segmentation in the Case of Limited Image Datasets

Abstract

1. Introduction

2. Related Works

2.1. Small Data in Image Processing

2.2. Graph Neural Network Image Analysis

2.3. Probability-Informed Neural Networks

2.4. Summary

3. Methodology

3.1. Problem Statement

3.2. Quadtree-Informed Graph Self-Attention Networks

3.3. Pre-Processing of Image Features by Convolutional Layers

3.4. Superpixels

3.5. Quadtree Informing

3.6. Graph Self-Attention Operation

4. Experiments

4.1. Datasets

4.2. Training, Metrics, and Hyperparameters

4.3. Variants of the $G (\cdot)$ Network

4.4. Main Results on SAR Images

5. Ablation Study

5.1. HRSID

5.2. SSDD

5.3. Comparison of QiGSAN with Other Configurations of Graph Networks

6. Conclusions and Discussion

6.1. Discussion

6.2. Summary

6.3. Furher Research Directions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

QiGSAN: A Novel Probability-Informed Approach for Small Object Segmentation in the Case of Limited Image Datasets

Abstract

1. Introduction

2. Related Works

2.1. Small Data in Image Processing

2.2. Graph Neural Network Image Analysis

2.3. Probability-Informed Neural Networks

2.4. Summary

3. Methodology

3.1. Problem Statement

3.2. Quadtree-Informed Graph Self-Attention Networks

3.3. Pre-Processing of Image Features by Convolutional Layers

3.4. Superpixels

3.5. Quadtree Informing

3.6. Graph Self-Attention Operation

4. Experiments

4.1. Datasets

4.2. Training, Metrics, and Hyperparameters

4.3. Variants of the G ( · ) Network

4.4. Main Results on SAR Images

5. Ablation Study

5.1. HRSID

5.2. SSDD

5.3. Comparison of QiGSAN with Other Configurations of Graph Networks

6. Conclusions and Discussion

6.1. Discussion

6.2. Summary

6.3. Furher Research Directions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

4.3. Variants of the $G (\cdot)$ Network