1. Introduction
River sand is a non-metallic mineral produced by the repeated collision and friction of natural stones under the action of water [
1]. It has become an important building material due to its high quality, easy mining and low cost. However, there have been frequent occurrences of excessive sand mining on riverbanks in recent years, leading to instability of the river and damage to the environment, affecting the safety of bridges and shipping. Therefore, it is vital to obtain the real-time statuses of sand mining areas on riverbanks and prevent excessive sand mining.
River management departments need a distribution map of the sand mining areas on the riverbank that meets the following specific requirements. First, the sand mining areas need to be automatically acquired by a computer. Second, the overall accuracy (OA) rate needs to be above 93% and no large sand mining areas must be left out. Third, the extracted areas need clear boundaries, without blurring or fragmentation. Lastly, the method needs to be fast and use little memory so that it can run on mobile devices such as laptops and tablets, making it convenient for managers to acquire the information. The traditional way to capture this information is field measurement, which faces difficulty in meeting real-time requirements.
Recently, remote sensing and satellite imaging technology has greatly promoted earth observation [
2]. Compared with manual methods, remote sensing technology has obvious advantages in its ability to facilitate the continuous monitoring of riverbank sand mining [
3], and high-resolution satellite images enable the necessary quick extraction of riverbank sand mining areas. Semantic segmentation is an important method in remote sensing imaging; it enables the pixel-level classification of images and suits the monitoring of riverbank sand mining areas.
Traditional methods of semantic segmentation are mainly based on simple features of images such as points and values. For example, segmentation based on a gray value uses thresholds and gradation to detect different categories of objects [
4]. Edge detection algorithms use detection operators to obtain the edges of objects [
5,
6]. However, these methods ignore the spatial features and texture and color information of the images, and the improper selection of thresholds can lead to mistakes. For images with complex pixel changes, these methods perform relatively poorly. There are also methods based on the structures of structured regression forest (SRF) [
7] or support vector machine (SVM) [
8], but many parameters, such as the kernel function, need to be repeatedly adjusted in these methods. These parameters significantly impact the results; setting them improperly may lead to poor performance. Therefore, it is difficult to meet the accuracy requirements using traditional methods.
Recently, AI technology has attracted attention from many scholars [
9]. Image segmentation methods based on AI technology have been commonly used in many kinds of tasks, including satellite images, due to their excellent precision [
10,
11]. Fully convolutional networks (FCNs) [
12] have shown favorable results in remote sensing image segmentation tasks [
13], such as Unet [
14], Unet++ [
15], Deeplab series [
16] and HRnet [
17]. However, these networks have too many parameters, which seriously reduces the efficiency of segmentation, especially for large-scale remote sensing images. The method should be improved to adapt to the real-time monitoring of sand mining behavior on riverbanks.
Feature information in the images is mainly extracted by the encoder in deep neural networks. There are two main structures of encoders: the convolutional structure and Transformer structure. The traditional fully convolutional structure is used in numerous models for semantic segmentation tasks, and some scholars have proposed adding plug-and-play modules into the model to improve the precision of feature extraction and prevent the loss of key features [
18]. However, the method of adding modules leads to limited improvements in accuracy and also adds many parameters to the model, resulting in a decrease in computational efficiency. In recent years, models using Transformer based on self-attention have found more usage in image segmentation tasks [
19]. Although Transformer can handle more complex data to generate accurate output, it also brings computational and storage challenges, requiring a large amount of computing resources and optimization algorithms to support its training and inference processes. The number of parameters in this structure even exceeds that of the entire fully convolutional neural network, making it unsuitable for riverbank sand mining area segmentation. Therefore, it is necessary to make lightweight improvements to convolutional neural networks to reduce the numbers of parameters and improve the efficiency.
In addition, due to the different areas of sand mining, the varying sizes of targets should also be considered. The network should consider targets of different sizes and be better able to extract multi-scale features.
Generalizable and real-time models are essential for the analysis of riverbank sand mining areas. Current models encounter challenges in computational demands and in multi-scale feature extraction [
20]. To conquer these problems, in our study, we propose a lightweight multi-scale network (LMS Net) for quick results based on segmentation with high-resolution satellite images for a riverbank sand mining area.
Our research makes the following contributions:
The use of LMS Net is proposed to improve the efficiency of detecting sand mining areas on the riverbank. The parameters of the network were reduced in number to at least 1/10 compared to the results obtained using typical networks like Unet, while maintaining similar accuracy.
An LMS block was designed and added into our network. It enhances the multi-scale feature extraction ability and has shown good performance in multi-scale sand mining area segmentation.
Compared to other lightweight networks, our network greatly improves segmentation accuracy while improving a few parameters, achieving a balance between computational resources and segmentation performance.
We introduce our research from the following aspects.
Section 2 summarizes the existing lightweight semantic segmentation networks. In the third section, we discuss the overall architecture of the LMS network and the detailed structure of the LMS block. The fourth section presents the results of comparative experiments and an ablation study using tables and images. The last two sections draw conclusions and discuss the experimental results.
2. Related Works
This section summarizes and elaborates on the advantages and disadvantages of lightweight image segmentation networks proposed by previous researchers.
The core of lightweight networks is to modify the network from both the volume and speed perspectives while maintaining accuracy as much as possible. Several successful lightweight networks have emerged and been applied in remote sensing image processing.
SqueezeNet is an early lightweight network [
21]. It uses the Fire module for parameter compression. Although it is not as widely used as other lightweight networks, its architectural ideas and experimental conclusions are still worth learning from.
The MobileNet model is a classic and widely used lightweight network. The core of the first generation of MobileNet is a depthwise separable convolution module, which is composed of both depthwise convolution and pointwise convolution [
22]. MobileNet v2 uses linear bottlenecks and inverted residuals to reduce information loss [
23]. MobileNet v3 introduced a squeeze-and-excitation module and hard-swish activation function to reduce the computational complexity [
24].
The ShuffleNet series is also very important in lightweight networks. ShuffleNet v1 uses a pointwise group convolution module to accelerate calculation and shuffles the channels to boost the information flow between the feature channels [
25]. ShuffleNet v2 introduced a channel split to further reduce the parameters [
26].
GhostNet was designed for solving the problem of redundancy in feature maps [
27]. It uses a ghost module with an identity mapping operation to expand the dimensions of the feature map, which is a cost-efficient way to generate more features.
In recent years, some scholars have conducted studies on the lightweight segmentation of satellite images, which is partially related to the segmentation of riverbank sand-mining areas. Chen et al. [
3] proposed an improved DeepLabv3+ lightweight neural network, combining the MobileNetv2 backbone, a hybrid dilated convolution (HDC) module and a trip pooling module (SPN) to alleviate the gridding effect. Inuwa et al. [
28] proposed a lightweight and memory-efficient network that can be deployed on resource-constrained devices and is suitable for real-time remote sensing applications. Bo et al. [
29] proposed an ultra-lightweight network (ULN) to reduce the amount of calculation and improve the computation speed. This network achieved competitive results with fewer parameters. He et al. [
30] proposed an enhanced end-to-end lightweight segmentation network dedicated to satellite imagery, where a superpixel segmentation pooling module is added to improve the accuracy. Wang et al. [
31] explored a cost-effective multimodal sensing semantic segmentation model, which employed multiple lightweight modality-specific experts, an adaptive multimodal matching module and a feature extraction pipeline to improve the efficiency. Luo et al. [
32] combined the advantages of convolutional neural networks and Transformer to propose the FSegNet network, introducing FasterViT and utilizing its efficient hierarchical attention to mitigate the surge in self-attention computation of the remote sensing images. Wang et al. [
33] proposed a novel lightweight edge-supervised neural network for optical remote sensing images, where the backbone is lightened by a feature encoding subnet to achieve better performance compared with some typical methods. Yan et al. [
34] introduced a lightweight network based on multi-scale asymmetric convolutional neural networks with an attention mechanism (MA-CNN-A) for ship-radiated noise classification. Wang et al. [
35] proposed a multi-scale graph encoder–decoder network (MGEN) for multimodal data classification. Song et al. [
36] employed a convolutional neural network model named DeeperLabC for the semantic segmentation of corals.
In summary, these networks mentioned above were all designed for general tasks in the segmentation of remote sensing images. Lightweight modifications were made to the networks to balance computational speed and accuracy. However, these general networks perform poorly in the special scenario of detecting sand mining behavior from riverbank images. Thus, it is vital to build a specific lightweight network dedicated to the task of evaluating riverbank sand mining area segmentation.
In our research, we propose a brand-new method of achieving riverbank sand mining area segmentation using high-resolution satellite images. Our method demonstrates lightweight improvements compared with traditional networks and also considers multi-scale feature extraction. The following sections introduce the detailed structure and experimental results of the model.
3. Materials and Methods
3.1. Image Dataset and Preprocessing
We proposed an image dataset to test the accuracy and efficiency of our network. The experimental data products are from the Gaofen-2 (GF-2) Satellite, which has an orbital altitude of 631 km and a 5-day revisit period. The resolution of the panchromatic band in its image is 1 m and the resolution of the multispectral bands (RGB and near-infrared) is 4 m. The wavelengths of each band are 450–520 nm in blue, 520–590 nm in green, 630–690 nm in red and 770–890 nm in near-infrared [
30].
The dataset for our research was from 5 sets of satellite image products. Four of them were taken over Shulan, Jilin Province, China. These were acquired on 4 November 2022, 14 June 2023, 27 August 2023 and 15 October 2023. The other one was taken over Changsha, Hunan Province, China, on 20 July 2023. We selected 20 regions in these 5 images and marked all the sand mining areas at the riverbank as the ground-truth labels. All these images and corresponding labels were clipped to a size of 256 × 256. We used random rotations, shifts, blurring, salt and pepper noise and horizontal and vertical flips to expand the dataset and a ±0–5% random brightness jitter in the RGB bands for data augmentation. The augmented dataset was then divided into a training dataset, validation dataset and testing dataset at an approximate 8:1:1 ratio. Each dataset contained images with and without riverbank sand mining areas. The types and quantities of the images in each dataset are shown in
Table 1, and
Figure 1 uses 3 typical scenarios as examples to show the images in the dataset. We have already made part of this dataset open source. The 4 images taken over Shulan and the corresponding labels can be downloaded at
https://pan.baidu.com/s/1symaNsAmXzamDR2Ljf7nlQ?pwd=uafb (accessed on 4 January 2025).
3.2. Overall Architecture of LMS Net
Figure 2 shows the overall architecture of our LMS Net. The numbers in the figure represent the sizes of the images or features. As shown in the figure, the LMS network has 5 levels from shallow to deep. In the first level, the original image
was input to the LMS block, and the output was
. This process can be represented by the following formula:
Starting from the second level, the output features from the previous level were 2 × 2 maximum-pooled and then input to another LMS block. This step can be formulated as follows:
The output features of the LMS block at each level were input into a channel attention module (CAM), which is defined in
Section 3.4. The output feature was
. This can be described as follows:
In the decoding process, first,
was upsampled through a 2 × 2 bilinear interpolation and then concatenated with
by channel to form the feature
.
Then,
was input to the LMS block and upsampled through a 2 × 2 bilinear interpolation before being concatenated with
in the last stage. The features returned to the previous level. These operations were then repeated in each level until the features returned to the first level.
At the end of the network, was calculated through a 3 × 3 convolutional layer and a channel number adjustment was obtained. Finally, the output result was obtained.
3.3. Detailed Structure of LMS Block
In our lightweight multi-scale network, we propose a new module for the encoder and decoder, a lightweight multi-scale (LMS) block, which is important for the quick extraction of the multi-scale features of the sand mining area from satellite images. This section shows the detailed structure of the LMS block.
The input of this module is from satellite images or feature maps from the middle levels. The input matrix was first calculated through 5 parallel convolutional layers. The first 4 of the 5 convolutional layers have a kernel size of 3; a stride of 1; and dilation rates of 1, 2, 3 and 5, respectively. These 4 parallel modules use different dilation rates to extract image features of different scales. The output matrices of these 4 layers are
,
,
and
, which have the same length and width and 1/8 the number of channels as the output matrix. Then, these four matrices are concatenated along the channel dimension to form a new feature,
, which has half the number of channels compared to the output matrix.
Next,
undergoes a depthwise convolution and the output matrix is
. These two matrices are concatenated by channel and finally form the output matrix
. This final step can be described as follows:
The detailed architecture of our LMS block is shown in
Figure 3.
3.4. Channel Attention Module
A channel attention module (CAM) [
37] is used in our network to selectively emphasize the informative features in the images while suppressing features that are less informative.
Figure 4 shows the detailed architecture of this module.
First, the input feature map
containing
channels is compressed in the dimension of the channel to form two 1 × 1 matrices,
and
.
uses max pooling and
uses average pooling. Then,
and
go through two fully connected layers, the first of which has
nodes and the second of which has
nodes. After this step, we can obtain two matrices with weights. Next, these two matrices are added to form one new vector. This vector is calculated with a sigmoid function
. Finally, it is multiplied with the input feature map, and the output result is
. This mechanism can be described as follows:
3.5. Experimental Design and Criteria
All the experiments in our research were deployed with Tensorflow 2.10.1 on an NVIDIA 4070 Graphics Processing Unit (GPU) manufactured by the MSI company in Kunshan, Jiangsu Province, China. To accelerate the training process, CUDA toolkit 11.2 and cuDNN library 8.1 were used.
Our LMS network was compared with Deeplab V3+ [
16], MobileNet V2 [
23], ShuffleNet V2 [
26], SqueezeNet [
21], GhostNet [
27], Enet [
38], Ultra-lightweight Net [
29] and FSegNet [
32]. All the networks were trained with an Adam optimizer. We set the hyperparameter
to 0.9,
to 0.999 and the initial learning rate to 0.0001. The networks were trained for 100 iterations, and the batch size was 10. We adopted the early stopping strategy, which terminates training when the loss does not decrease after 20 iterations of training. We used the CrossEntropy function to calculate the loss of all these networks, which is formulated as follows. All the network experiments were 9-fold cross-validated and averaged.
We also designed an ablation study to compare the performance before and after network improvement. Three networks, UNet, Unet-half and our LMS Net, were trained to verify the improvement in performance. Unet-half reduces the number of channels in each convolution layer by half, based on Unet. The optimizer, hyperparameters and early stopping strategy were the same in the ablation study and in the contrast experiment. The ablation experiment was also 9-fold cross-validated and averaged.
The output pixels of the results are composed of 4 kinds: True positive (TP) is the term used for pixels that are correctly segmented as sand mining areas. False positive (FP) is the term for those that are segmented as sand mining areas but are labeled as background. True negative (TN) is the term for those that are correctly segmented as background and false negative (FN) is the term for those that are segmented as background but are labeled as sand mining areas.
We evaluated the results of the experiment from the following perspectives. The overall accuracy (OA) refers to the ratio of TP and TN to the total number of output pixels. The precision (P), recall (R), mean intersection over union (mIoU), F1-score (F1) and Kappa are formulated as follows.
In addition, we also compared the number of parameters (Params), floating-point operation speed (Floating-Point Operations Per Second, FLOPS), output frame rate (Frame Per Second, FPS) and prediction time of each model.
5. Discussion
From the results of the comparative experiment, we can divide the networks used for comparison into four categories. The performance of networks with fewer than 1M parameters, including Deeplab V3+ [
16], MobileNet V2 [
23], ShuffleNet V2 [
26] and SqueezeNet [
21], was relatively poor. Ultra-lightweight net [
29] and LMS Net had fewer than 1M parameters more than those four networks but showed greatly improved accuracy. The performance of ENet [
38] and Ghostnet [
27] fell between that of the two categories above. FSegNet [
32] consumed the most computational resources among these networks because of the usage of Transformer, and its improvement in segmentation quality was limited.
Considering that ultra-lightweight net [
29] and LMS Net have more complex decoder designs compared to other networks, we can speculate that the rationality of decoder design greatly affects the segmentation accuracy. Although the number of network parameters in the other two categories is different, their decoder designs are very simple. Therefore, an increase in the number of parameters can bring a slight improvement in accuracy, but this improvement is far inferior to that achieved with a more reasonable decoder design for the network.
From the results of the ablation experiment, it can be seen that the LMS network can greatly reduce the number of parameters and improve computational speed while maintaining accuracy that is not significantly different from that of complex networks like Unet. The LMS network is even better than Unet in extracting targets of different sizes. These results indicate that our LMS network achieves the goal of lightweight and multi-scale object extraction.
In future research, we will focus on the design of decoder structures in lightweight networks and how to balance the number of parameters between encoders and decoders. We will seek new methods to optimize encoder parameters and design suitable lightweight segmentation decoders.