A Semantic Segmentation Method for Winter Wheat in North China Based on Improved HRNet

Wang, Chunshan; Zhu, Penglei; Yang, Shuo; Zhang, Lijie

doi:10.3390/agronomy14112462

Open AccessArticle

A Semantic Segmentation Method for Winter Wheat in North China Based on Improved HRNet

¹

School of Information Science and Technology, Hebei Agricultural University, Baoding 071001, China

²

Hebei Engineering Research Center for Agricultural Remote Sensing Application, Baoding 071001, China

³

National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China

^*

Author to whom correspondence should be addressed.

Agronomy 2024, 14(11), 2462; https://doi.org/10.3390/agronomy14112462

Submission received: 8 September 2024 / Revised: 15 October 2024 / Accepted: 17 October 2024 / Published: 22 October 2024

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Winter wheat is one of the major crops for global food security. Accurate statistics of its planting area play a crucial role in agricultural policy formulation and resource management. However, the existing semantic segmentation methods for remote sensing images are subjected to limitations in dealing with noise, ambiguity, and intra-class heterogeneity, posing a negative impact on the segmentation performance of the spatial distribution and area of winter wheat fields in practical applications. In response to the above challenges, we proposed an improved HRNet-based semantic segmentation model in this paper. First, this model incorporates a semantic domain module (SDM), which improves the model’s precision of pixel-level semantic parsing and reduces the interference from noise through multi-confidence scale class representation. Second, a nested attention module (NAM) is embedded, which enhances the model’s capability of recognizing correct correlations in pixel classes. The experimental results show that the proposed model achieved a mean intersection over union (mIoU) of 80.51%, a precision of 88.64%, a recall of 89.14%, an overall accuracy (OA) of 90.12%, and an F1-score of 88.89% on the testing set. Compared to traditional methods, our model demonstrated better segmentation performance in winter wheat semantic segmentation tasks. The achievements of this study not only provide an effective tool and technical support for accurately measuring the area of winter wheat fields, but also have important practical value and profound strategic significance for optimizing agricultural resource allocation and achieving precision agriculture.

Keywords:

remote sensing image; winter wheat; semantic segmentation; HRNet; semantic domain optimization; nested attention optimization

1. Introduction

Winter wheat is one of the major crops for global food security. Accurate statistics of its planting area are crucial for guiding agricultural production, ensuring food supply, and optimizing planting structure [1]. The timely and accurate acquisition of crop spatial distribution information is of great practical significance for optimizing crop planting layout, enhancing crop monitoring and production management, accurately estimating and predicting yield, and providing decision support for government agencies in formulating scientifically sound agricultural policies. With the advancement of modern agricultural technology, especially following the integration of remote sensing image processing and machine learning, the development of crop recognition and planting area statistics techniques has been significantly promoted [2,3].

Although traditional machine learning methods have achieved good results in crop semantic segmentation tasks using remote sensing images, there are still many limitations to be addressed. For example, the clustering-based segmentation methods are largely affected by lighting and seasonal changes [4]; the threshold segmentation methods are sensitive to initial parameters [5]; and the edge segmentation methods have difficulty in accurately distinguishing the boundaries of crop planting plots [6]. In recent years, deep learning technology has made an important breakthrough in the field of crop segmentation. The multi-layer neural network in deep learning models can automatically learn image features and effectively capture multi-scale information of crops. Compared to traditional machine learning methods, deep learning models exhibit higher accuracy and stronger generalization ability [7,8,9].

In crop semantic segmentation tasks based on remote sensing images, the model must accurately recognize the class of each pixel and determine the positions of crops. This requires the model to not only learn high-level semantic information of wheat plots, but also capture low-level boundary details to accurately obtain the contour and shape of a target plot [10,11,12]. The Variable Spatial Attention Module proposed by Yang et al. [13] achieved a fine recognition of crop features by calculating the feature weights in spatial dimensions, which improved the model’s capability of capturing crop details as well as its adaptability to complex farmland scenes. The multi-task learning encoder–decoder network proposed by Long et al. [14] was able to help the classification model to learn both local details and the overall structure of crops by unifying the perception of boundaries and shapes of planting plots. The HRNet model introduced by Zhang et al. [15] successfully maintained high-resolution feature representation through its parallel structure and multi-scale feature fusion, allowing the model to accurately capture the fine structure and spatial distribution of crops. This model provides accurate spatial information of the target object in tasks of crop semantic segmentation. Zhang et al. [16] improved DeepLabV3+ by replacing its backbone network with the lightweight MobileNetV2 and introducing a Convolutional Block Attention Module (CBAM), which combines channel attention and spatial attention modules, thereby achieving lightweight semantic segmentation extraction for winter wheat.

Although the above methods have integrated the spatial contextual information at multiple scales through feature fusion, which is conducive to enhancing the model’s ability of pixel-level feature representation and improving the overall segmentation performance, there are still two main drawbacks. Firstly, the independence of the correlation computation process will inevitably introduce significant noise and fuzziness into the model [17,18,19], which may seriously affect the classification accuracy. For example, in semantic segmentation tasks of winter wheat planting plots, it may lead to fuzzy and erroneous boundary recognition. Secondly, single-confidence scale class representation methods exhibit shortcomings in dealing with intra-class heterogeneity [17,18,20], making it difficult to adapt to changes in wheat growth status within the same planting plot caused by different environmental and growth factors. This phenomenon will result in the poor performance of the model in distinguishing differences within the same planting plot, which in turn affects the overall evaluation of the planting plot distribution.

To address the above issues, a classification model based on an improved HRNet network for the accurate recognition of winter wheat planting plots was proposed in this paper, using Gaofen-1 satellite images as the data source. The main contributions are summarized as follows:

(1): A diverse semantic segmentation dataset for winter wheat in North China was constructed, encompassing various categories. The study area includes both small, fragmented plots in mountainous regions and relatively concentrated farmland in the plains, reflecting a certain degree of regional heterogeneity.
(2): A semantic domain module was incorporated into the classification model to extend the semantic domain of pixels. Additionally, the concept of class confidence was introduced as a scale criterion within the semantic domain to extract multi-confidence scale class representations. Consequently, the model’s parsing accuracy of pixel-level semantic information significantly improved, and the discrepancies between pixel-level semantic recognition results and actual classes were effectively reduced, leading to enhanced semantic segmentation accuracy.
(3): A nested attention module was introduced to enhance the model’s sensitivity to local features, strengthening key features in images while suppressing correlations with non-target features. Consequently, the model’s ability to recognize crop boundaries in complex agricultural scenes was greatly improved.

2. Materials and Methods

2.1. Overview of the Research Area

The research area of this paper is confined to Shijiazhuang City, Hebei Province, China. Shijiazhuang City is situated in the central southern part of Hebei Province, with a geographical scope ranging from 113°30′–115°20′ E and 37°27′–38°47′ N. At the northern edge of the Huang Huai Plain, Shijiazhuang City has four distinct seasons (hot summers, cold winters, and relatively short springs and autumns). Our research area covers parts of the mountainous regions and plains in Shijiazhuang City, with complex and diverse terrain conditions. Generally speaking, the planting plots in mountainous regions are small in scale and scattered in distribution, while the plots in plains are concentrated in distribution. A total of 156 villages were sampled from the counties under the jurisdiction of Shijiazhuang City (locations shown in Figure 1), covering different terrain types. The total area of the samples is 624.1 km², including several typical agricultural production regions. Overall, the distribution of the selected planting plots can well reflect the general characteristics of land distribution in Shijiazhuang City. The climate in the research area belongs to the temperate monsoon climate, with precipitation mainly concentrated in summers and relative dryness in winters. Winter wheat, as the main crop planted over the winter season, is an important component of local agricultural production. It is usually sown in early October, and harvested from late May to early June in the following year after experiencing the low temperature of winter.

2.2. Data Preprocessing and Sample Construction

In this study, the remote sensing image data obtained by the PMS sensor carried by the China-made Gaofen-1 satellite were used as the source data. In the process of image data selection, priority was given to images with less cloud content and a uniform color tone distribution to ensure that the data quality meets our research requirements. The original multispectral images have an 8 m resolution and include four bands: red, green, blue, and near-infrared. The panchromatic images have a 2 m resolution; despite having a high spatial resolution, the panchromatic images only provide spectral information for a single band. In order to correct the geometric deformation of images, a 30 m resolution Digital Elevation Model (DEM) was used to perform orthorectification on both multispectral and panchromatic images. After orthorectification, the images were treated by the Gram Schmidt Orthogonalization (GS) fusion technique to achieve effective fusion between multispectral images and panchromatic images. This process can not only preserve the rich spectral features in multispectral images, but also improve the spatial resolution to 2 m, so that the applicability of the images in fine terrain recognition was greatly improved. After the above steps, the cloud coverage of the images was significantly reduced, and the color tone distribution became more uniform, which is conducive to improving the model’s recognition accuracy of crops, buildings, water bodies, forests, and other background classes that are contained in the source data images. A sample image after the above processing steps is shown in Figure 2.

In terms of geographical scope selection, to ensure the reliability of our research, representative locations that can comprehensively cover different geographical environments in Shijiazhuang City were selected through precise analysis of the remote sensing image data. The selected samples cover typical terrain types across the city, including a total of 156 villages. Based on the remote sensing images, the boundaries of the land cover types in the samples were drawn to generate vector polygons, and the specific classes of each planting plot were determined through field investigations. In addition, the class information was annotated in the corresponding vector attribute table. The annotated vector data were then converted into raster data, which were used as the label data for model training in ArcGIS10.8 software. To meet the model’s input requirements, the images in the source data and the corresponding labels were cropped into 512 × 512 pixel blocks, and a dataset containing 3488 images was obtained. These images were divided into the training set, validation set, and testing set in a ratio of 7:2:1, which were used for model training, parameter tuning, and performance evaluation, respectively. To improve the model’s robustness, data augmentation was performed on the training and validation sets from the following aspects: (1) randomly rotating the images by 90°, 180°, or 270° to simulate land cover types at different angles and directions; (2) randomly scaling the images by 0.5, 0.75, 1.25, and 1.5 times to simulate the size changes of objects at different scales; (3) increasing or decreasing the contrast of images, such as histogram equalization and gamma correction, to simulate different image qualities or weather conditions.

2.3. Overall Structure of the Model

In this study, the HRNet network [21] was used as the backbone feature extraction network of the classification model, and a semantic domain module (SDM) and a nested attention module (NAM) [22] were introduced to improve the model’s performance. The overall architecture of the model is illustrated in Figure 3. First, the features extracted by the HRNet backbone network are denoted as R; meanwhile, a class probability distribution D is obtained through a 1 × 1 convolution network to serve as the representations of initial classification. Then, the SDM is used to extract multi-confidence scale class representations

S_{m}

and global class representations

S_{g}

from R and D. Subsequently, the NAM is used to process R and

S_{m}

for the purpose of optimizing the pixel class relationships

A_{g}

.

A_{g}

then interacts with

S_{g}

to generate the enhanced semantic representations

R_{a}

. Lastly, the image resolution is restored through bilinear interpolation.

2.4. Semantic Domain Module

The background complexity of remote sensing images is an important consideration of researchers. Due to the diversity of shooting conditions (e.g., geographical locations, time, and shooting angles), even for the same type of land cover type, different remote sensing images may show significant variability. For example, winter wheat may exhibit obvious differences in color and texture in the images captured from different regions, seasons, and shooting angles. The traditional class context modeling methods typically focus on extracting the global feature center of each class. This type of method often overlooks the diversity and complexity within the same class; consequently, the classification model may not be able to accurately distinguish the internal differences of classes with high similarity, thereby increasing the risk of false classification.

To solve the above problem, a scaling criterion based on class confidence was introduced in this paper to explore class representations with multi-confidence scales. Specifically, class representations of small-scale semantics (including only high confidence levels) can help quickly model prominent features, while class representations of large-scale semantics (including all confidence levels) can help model global features. By strengthening the interaction between pixels and multi-confidence scale class representations, the perceptual ability of pixels to different classes can be enhanced, and the interference from noise can be reduced, enabling the model to extract more accurate relationships between pixels and classes to achieve accurate winter wheat classification. The SDM introduced in this paper is shown in Figure 4. According to the class probability distribution

D \in R^{H \times W \times K}

, the feature representations

R \in R^{H \times W \times C}

are grouped into multiple class regions as follows:

R_{k} = {R_{[i, j, *]} | argmax (D_{[i, j, *]}) = k}

(1)

where

R_{k}

is an

N_{k} \times C

matrix; k refers to a class label; and

N_{k}

refers to the number of representations belonging to Class k. Similarly, a matrix

D_{k}

of size

N_{k} \times K

is defined as follows:

D_{k} = {D_{[i, j, *]} | argmax (D_{[i, j, *]}) = k}

(2)

Let

{T_{i}}^{j} (x)

represent a function that returns the range from the i-th element to the j-th element, sorted in a descending order with a starting index of 1. The absolute difference between the highest and second highest probability values in Class k is defined as the certainty that each pixel belongs to Class k. Then, the confidence level of whether a target pixel belongs to Class k can be determined (the larger the difference, the higher the certainty that this pixel belongs to Class k), as shown below:

W_{k} = {{T_{1}}^{1} (d) - {T_{2}}^{2} (d) | d \in D_{k}}

(3)

where

W_{k}

refers to a matrix of

N_{k}

× 1, representing the confidence that a pixel belongs to Class k. In addition, the certainty that a pixel belongs to Scale m of Class k is defined as follows:

W_{k_{m}} = {T_{1}}^{⌊\frac{m \times N_{k}}{M}⌋} (W_{k})

(4)

where

W_{k_{m}}

refers to a matrix of

⌊(m \times N_{k}) / M⌋

× 1, m ∈ [1, M]. For each Class k, the context representations are computed as follows:

S_{k_{m}} = \sum_{n = 1}^{⌊\frac{m \times N_{k}}{M}⌋} \frac{e^{W_{k_{m}} [n]}}{\sum e^{W_{k_{m}}}} \cdot R_{k_{m} [n, *]}

(5)

where

R_{k_{m} [n, *]}

refers to the part in

R_{k}

corresponding to the Weight

W_{k_{m}}

, while

S_{k_{m}} \in R^{1 \times C}

refers to the center of Class k at Scale m.

The output of the SDM is a tensor

S_{m} \in R^{M \times K \times C}

, which is considered as a multi-confidence scale class representation, as shown below:

S_{m} = {S_{k_{m}} | m \in [1, M], k \in K}

(6)

Moreover, large-scale class representations

S_{g} \in R^{K \times C}

, as expressed below, are adopted to participate in class context integration:

S_{g} = {S_{k_{m}} | m = M, k \in K}

(7)

2.5. Nested Attention Module

To seek common features in the original pixel class relationships for the purpose of enhancing correct correlations and suppressing erroneous correlations, an NAM was introduced into our model to reduce the noise and ambiguity of pixel-level class relationship weights. The relationship between pixels and global classes can be obtained from the relationship between pixels and multi-confidence class representations. The specific network structure is shown in Figure 5.

Firstly, the relationship between pixels and multi-confidence scale class representations is computed as follows:

F_{m} = R \otimes S_{m}

(8)

Then, the NAM takes

F_{m}

as the input:

{F^{'}}_{m} = S o f t m a x (\frac{A_{q k} (F_{m}) \otimes A_{q k} (F_{m})}{\sqrt{H W}}) \otimes A_{v} (F_{m})

(9)

The network structure of the NAM is shown in Figure 6, where

{F^{'}}_{m} \in R^{H \times W \times K}

represents the relationship

A_{q k}

after pixel class optimization; and

A_{v} \in R^{K \times MK}

refers to the fully connected layer used for projecting pixels to multi-confidence scale class representations. Finally, the optimized pixel class relationship

A_{g}

is obtained through residual linking.

2.6. Loss Function

To achieve the best training effect, the model proposed in this paper was trained by the Cross-Entropy Loss (CE Loss) function. An advantage of CE Loss is that it can effectively reduce the noise and ambiguity in pixel class relationships, and improve the perceptual ability of pixels to classes, thereby enhancing the model’s discriminative ability and overall performance. CE Loss is capable of handling class imbalance, providing stable gradient signals, and accelerating model convergence. In addition, it incorporates soft labeling to handle uncertainty, which is conducive to improving the model’s robustness and generalization ability, making it a suitable option for optimizing multi-confidence scale class representations. Mathematically, CE Loss can be expressed as follows:

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} o n e_h o t_{y_{i}} \cdot l o g (p_{i})

(10)

where

p_{i}

refers to the value of the model output after being processed by softmax;

one_h o t_{y_{i}}

refers to the one hot encoding of a true value. In this way, CE Loss can play an important role in the optimization process of multi-confidence scale class representations, ensuring that the model can still maintain efficient and accurate classification performance under complex backgrounds and varying conditions. More specifically, in dealing with high-complexity and high-diversity tasks such as the segmentation of remote sensing images, the advantage of CE Loss is very prominent. It can not only handle data imbalance, but also effectively optimize class representations with different confidence levels, therefore further improving the model’s overall performance and practicality.

2.7. Experimental Environment and Configurations

The server used in our experiment is equipped with Intel Core i9-9820X @ 3.30 GHz CPU, 64 GB memory, and 4 pieces of 24 GB RTX 3090 GPU for computation. The CUDA version is 11.7. The development environment is Python 3.9. The model construction, training, and parameter tuning are based on the Python 2.0.0 deep learning framework with the following parameters: optimizer, Adam; initial learning rate, 0.01; weight decay, 0.0005; batch size, 4; and epoch, 100.

2.8. Model Evaluation Metrics

Similar to previous studies related to remote sensing semantic segmentation [23,24,25], the mean intersection over union (mIoU), recall, precision, overall accuracy (OA), and F1-score were used in this paper as evaluation metrics to compare the computation results between different models. Specifically, the mIoU represents the mean ratio of the intersection over union of all classes, which can be computed as follows:

m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{i = 0}^{k} p_{i j} + \sum_{i = 0}^{k} p_{j i} - p i i} \times 100 %

(11)

In this paper, TP, TN, FP, and FN refer to the true positive, true negative, false positive, and false negative pixels predicted by the model, respectively. Precision refers to the proportion of samples that actually belong to a certain class among all the samples predicted to be this class.

P r e c i s i o n = \frac{T P}{T P + F P}

(12)

Recall is used to evaluate the model’s capability of recognizing positive samples. It can be computed as follows:

R e c a l l = \frac{T P}{T P + F N}

(13)

OA represents the proportion of correctly classified samples to the total number of samples, which can be computed as follows:

O A = \frac{T P + T N}{T P + F N + F P + T N}

(14)

The F₁-score represents the weighted harmonic average of recall and precision, which can be computed as follows:

F_{1} = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(15)

3. Results and Discussion

3.1. Subsection Model Comparison

To verify the effectiveness and accuracy of the proposed method, we selected several traditional deep learning models for remote sensing semantic segmentation to compare their performance with ours. The selected models include U-Net [26], DeepLabv3+ [27], Segformer [28], and PSPNet [29]. The comparison results are presented in Table 1. It can be seen that the mIoU of our method was increased by 4.5, 3,3, 1.8, and 1.4 percentage points compared to U-Net, DeepLabv3+, Segformer, and PSPNet, respectively, indicating that our model can effectively address the issue of intra-class differences in remote sensing image segmentation tasks. By enhancing the perceptual ability of pixels to different classes and suppressing the interference from noise, our model can more accurately capture pixel class relationships, thereby greatly improving the winter wheat segmentation accuracy.

According to the visualization results of semantic segmentation shown in Figure 7, U-Net and Segformer had false classifications in classifying narrow and elongated winter wheat planting plots, especially when dealing with smaller plots, and DeepLabv3+ had poor performance in boundary extraction. In terms of river segmentation, our model delivered the best performance in maintaining river continuity; U-Net and PSPNet had problems in recognizing water bodies; and DeepLabv3+ and Segformer had generally poor performance. In terms of trees and background segmentation, U-Net tended to classify trees as background content; DeepLabv3+ also misclassified a part of trees that were similar to background content; and Segformer had a small number of false classifications.

Comprehensive analysis shows that the model proposed in this paper has significant advantages in segmentation continuity, and is able to accurately recognize and segment planting plots while significantly reducing noise interference. These findings indicate that the SDM and NAM have prominent effects in optimizing the semantic segmentation performance. By enhancing the model’s capability of recognizing internal differences within various land cover classes, the mode’s segmentation performance targeting complex land cover structures was significantly improved, and the pixel-level class relationships were optimized, allowing it to achieve higher accuracy and robustness in semantic segmentation tasks.

3.2. Ablation Tests

In order to clarify the effectiveness of our method in improving the model accuracy, a series of ablation tests were conducted on the two optimization modules introduced in this paper based on the evaluation metrics mentioned earlier. Table 2 shows the results of the ablation tests on each module. For the model only incorporating an SDM, the mIoU was increased by 3.37%; for the model only incorporating an NAM, the mIoU was increased by 3.29%; for the model incorporating both modules, the mIoU was increased by 4.74%. The results prove that both optimization modules have a positive effect on the semantic segmentation of remote sensing images.

While exploring the influencing factors on the model’s performance, the scale number of the SDM and the range of the NAM were considered as key parameters. Under the same control environment and data conditions, a series of comparative tests were conducted. First, the range of the NAM was set to 16 × 16 pixels, while the scale number of the SDM was adjusted on such a basis. Table 3 shows the changing trend of model performance as the scale number increases from 1 to 16. Specifically, when the scale number was set to 1 (i.e., single-confidence scale), the model’s performance was unsatisfactory. As the scale number increased, the model’s segmentation performance was improved gradually, indicating that multi-scale representation is an effective strategy for capturing complex features in images. However, when the scale number was too large, the mIoU tended to turn downward, which may be due to overfitting or limitations in computing resources. In our tests, it was found that the model achieved the best performance in terms of IoU in classes including winter wheat, as well as in terms of overall mIoU, when the scale number was set to 8. Therefore, the scale number of the SDM of the final model was set to 8.

Subsequently, we further investigated the impact of the range of the NAM on the model performance. Specifically, the range of the NAM was gradually increased from 8×8 pixels to 64×64 pixels to observe the model performance under different range sizes. According to the results summarized in Table 4, the mIoU of the model reached its highest value when the range of the NAM was set to 16 × 16 pixels, indicating that the model can most effectively capture local features in images while maintaining good sensitivity to global contextual information at this range, so as to achieve the optimal segmentation performance. As the range of the NAM increased further, although the model could cover a wider range of contextual information, the mIoU did not continue to improve but instead showed a downward trend. This is probably because an excessively large range may reduce the sensitivity of the model to local details or increase the computational complexity, leading to performance weakening.

3.3. Migration Tests

Based on the test results and analyses in earlier sections, it can be concluded that the improved model proposed in this paper can deliver excellent performance in extracting winter wheat planting plots. However, due to differences in planting structure, texture features, and crop growth periods in remote sensing images across different regions, the segmentation performance of the model may be significantly affected in other areas. These differences could lead to a decline in the model’s generalization ability in various environments, making it particularly important to test the model’s performance in different regions to ensure its applicability and stability. In order to further evaluate the model’s generalization ability to different geographical areas, we selected some regions from Xingtai City, Hebei Province, China, as new validation regions. The climate conditions in Xingtai City are similar to those in Shijiazhuang City, both belonging to a temperate monsoon climate with cold winters, providing a suitable environment for the growth of winter wheat. For the selected regions in Xingtai City, the same method as the training set was used for sample extraction and data annotation, and a total of 84 villages were sampled, covering a total area of 285.4 km². The collected remote sensing images were treated by a series of preprocessing steps, including radiometric correction, atmospheric correction, orthorectified correction, and image fusion, to ensure the quality of the data. Subsequently, based on the results of field investigations, the classes of planting plots were determined and accurately annotated in the corresponding vector attribute table, which was then converted into raster data for further use. The geographical distribution of the selected validation regions is shown in Figure 8.

The selected regions in Xingtai City were used to test the previously trained model and the results are summarized in Table 5. It can be seen that the model exhibited a good generalization ability in these new regions. Specifically, in terms of winter wheat recognition, the precision, recall, F1-score, and IoU of the model reached 92.71%, 95.21%, 93.95%, and 88.58%, respectively. In particular, compared to the results in Shijiazhuang City, the difference in IoU was only 2.82%, indicating that our model has high consistency and stability in applications across different regions.

Notably, the class with the largest difference in IoU between the regions in Shijiazhuang City and Xingtai City is trees. Through in-depth analysis of the image data and field conditions, it was found that there is a significant temperature difference between Shijiazhuang and Xingtai in April, which leads to differences in the growth cycle of trees between the two cities. This biological difference is manifested as significant spectral and texture feature changes in remote sensing images, which affects the segmentation accuracy of the model for the class of trees.

4. Conclusions

In order to improve the semantic segmentation performance of winter wheat in remote sensing images, we proposed an improved model based on HRNet as the backbone network by incorporating an SDM and NAM as optimization modules. By introducing multi-confidence scale class representations through the SDM, we significantly enhanced the model’s pixel-level class perceptual ability, effectively reduced the interference from noise, and achieved more accurate extraction of pixel class relationships. The NAM utilizes the original pixel relationship class as a query to identify common features within its scope, in order to strengthen correct correlations and suppress erroneous correlations. On the test set, the model achieved an average intersection to union ratio (mIoU) of 80.51%, precision of 88.64%, recall of 89.14%, overall accuracy of 90.12% (OA), and F1 score of 88.89%. Compared to the existing models of U-Net, DeepLabv3+, Segformer, and PSPNet, the mIoU of our model was increased by 4.5, 3.3, 1.8, and 1.4 percentage points, respectively. In tests conducted in Xingtai, a region with varying spatial heterogeneity, the model achieved a precision of 92.71%, a recall of 95.21%, an F1 score of 93.95%, and an IoU of 88.58% for winter wheat recognition. The IoU difference compared to Shijiazhuang was 2.82%. This demonstrates that the model exhibits high consistency and stability when applied across different regions. It provides an effective tool and technical support for accurately measuring the area of winter wheat fields.

Although the model proposed in this paper has achieved significant performance improvements, there are still some limitations. First, the relatively high complexity of the model structure leads to a high demand for hardware resources, which may cause the problem of a long training time. Second, under the influence of environmental differences and differences in image acquisition time in different regions, the generalization ability of our model needs to be further enhanced.

In response to these limitations, our future research will focus on the following key directions:

(1): Model lightweighting: We will focus on developing lightweight neural network models by reducing model parameters through structural optimization and knowledge distillation techniques to lower the overall computational costs. In the meantime, we aim to maintain or even improve the segmentation accuracy to meet varying deployment requirements on edge devices.
(2): Dataset expansion and diversification: in order to enhance the generalization ability of our model, we will focus on constructing or expanding remote sensing image datasets by including images reflecting winter wheat growth under different geographical and climatic conditions, images reflecting different wheat growing stages from sowing to maturity, as well as images reflecting different planting patterns and influence from a range of pests and diseases.
(3): Innovation in optimization strategies and algorithms: we will focus on exploring new training strategies and optimization algorithms to further improve the training efficiency and performance of our model and reduce training time, while maintaining or enhancing model accuracy.

Through the above measures, we expect to maintain high precision and accuracy of our model while reducing resource consumption and improving its feasibility and applicability in practical applications. We hope to make a profound impact on the research field of remote sensing semantic segmentation and promote the development and application of related techniques, especially in precision agriculture and crop monitoring.

Author Contributions

C.W.: Writing—Original draft preparation; P.Z.: Methodology, Software; S.Y.: Data curation, Visualization; L.Z.: Writing—Reviewing and Editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hebei Province of China, grant number F2022204004.

Data Availability Statement

Data are contained within the article; further inquiries may be directed to the corresponding author.

Acknowledgments

We are grateful to our colleagues at the Hebei Key Laboratory of Agricultural Big Data and National Engineering Research Center for Information Technology in Agriculture for their help and input, without which this study would not have been possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, F.; Ren, J.; Wu, S.; Zhao, H.; Zhang, N. Comparison of Regional Winter Wheat Mapping Results from Different Similarity Measurement Indicators of NDVI Time Series and Their Optimized Thresholds. Remote Sens. 2021, 13, 1162. [Google Scholar] [CrossRef]
Yan, S.; Yao, X.; Zhu, D.; Liu, D.; Zhang, L.; Yu, G.; Gao, B.; Yang, J.; Yun, W. Large-scale crop mapping from multi-source optical satellite imageries using machine learning with discrete grids. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102485. [Google Scholar]
Pittman, K.; Hansen, M.C.; Becker-Reshef, I.; Potapov, P.V.; Justice, C.O. Estimating global cropland extent with multi-year MODIS data. Remote Sens. 2010, 2, 1844–1863. [Google Scholar] [CrossRef]
Coates, A.; Ng, A.Y. Learning feature representations with k-means. In Neural Networks: Tricks of the Trade: Second Edition; Springer: Berlin/Heidelberg, Germany, 2012; pp. 561–580. [Google Scholar]
Al-Amri, S.S.; Kalyankar, N.V. Image segmentation by using threshold techniques. arXiv 2010, arXiv:1005.4020. [Google Scholar]
Al-Amri, S.S.; Kalyankar, N.V.; Khamitkar, S.D. Image segmentation by using edge detection. Int. J. Comput. Sci. Eng. 2010, 2, 804–807. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Fu, Y.; Zhang, X.; Wang, M. DSHNet: A Semantic Segmentation Model of Remote Sensing Images Based on Dual Stream Hybrid Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4164–4175. [Google Scholar]
Wei, P.; Ye, H.C.; Qiao, S.T.; Liu, R.H.; Nie, C.J.; Zhang, B.R.; Song, L.J.; Huang, S.Y. Early Crop Mapping Based on Sentinel-2 Time-Series Data and the Random Forest Algorithm. Remote Sens. 2023, 15, 3212. [Google Scholar] [CrossRef]
Chen, J.; Zhu, J.; Sun, G.; Li, J.; Deng, M. SMAF-Net: Sharing Multiscale Adversarial Feature for High-Resolution Remote Sensing Imagery Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1921–1925. [Google Scholar] [CrossRef]
Zhang, G.; Jiang, W. Remote Sensing Image Semantic Segmentation Method Based on a Deep Convolutional Neural Network and Multiscale Feature Fusion. Int. J. Semant. Web Inf. Syst. 2023, 19, 1–16. [Google Scholar] [CrossRef]
Gao, L.; Qian, Y.R.; Liu, H.; Zhong, X.W.; Xiao, Z.Q. SRANet: Semantic relation aware network for semantic segmentation of remote sensing images. J. Appl. Remote Sens. 2022, 16, 014515. [Google Scholar] [CrossRef]
Yang, X.; Li, S.; Chen, Z.; Chanussot, J.; Jia, X.; Zhang, B.; Li, B.; Chen, P. An attention-fused network for semantic segmentation of very-high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 177, 238–262. [Google Scholar] [CrossRef]
Long, J.; Li, M.; Wang, X.; Stein, A. Delineation of agricultural fields using multi-task BsiNet from high-resolution satellite images. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102871. [Google Scholar] [CrossRef]
Zhang, J.; Lin, S.; Ding, L.; Bruzzone, L. Multi-scale context aggregation for semantic segmentation of remote sensing images. Remote Sens. 2020, 12, 701. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, H.; Liu, J.; Zhao, X.; Lu, Y.; Qu, T.; Tian, H.; Su, J.; Luo, D.; Yang, Y. A Lightweight Winter Wheat Planting Area Extraction Model Based on Improved DeepLabv3+ and CBAM. Remote Sens. 2023, 15, 4156. [Google Scholar] [CrossRef]
Zhang, F.; Chen, Y.; Li, Z.; Hong, Z.; Liu, J.; Ma, F.; Han, J.; Ding, E. Acfnet: Attentional class feature network for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6798–6807. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-contextual representations for semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VI 16. Springer: Cham, Switzerland, 2020; pp. 173–190. [Google Scholar]
Yu, C.; Wang, J.; Gao, C.; Yu, G.; Shen, C.; Sang, N. Context prior for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12416–12425. [Google Scholar]
Ma, X.; Ma, M.; Hu, C.; Song, Z.; Zhao, Z.; Feng, T.; Zhang, W. Log-can: Local-global class-aware network for semantic segmentation of remote sensing images. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Ma, X.; Che, R.; Wang, X.; Ma, M.; Wu, S.; Feng, T.; Zhang, W. DOCNet: Dual-Domain Optimized Class-Aware Network for Remote Sensing Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Liu, Y.; Shi, S.; Wang, J.; Zhong, Y. Seeing beyond the patch: Scale-adaptive semantic segmentation of high-resolution remote sensing imagery based on reinforcement learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16868–16878. [Google Scholar]
Peng, C.; Li, Y.; Jiao, L.; Chen, Y.; Shang, R. Densely based multi-scale and multi-modal fully convolutional networks for high-resolution remote-sensing image semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2612–2626. [Google Scholar] [CrossRef]
Wang, Z.; Guo, J.X.; Huang, W.Z.; Zhang, S.W. High-resolution remote sensing image semantic segmentation based on a deep feature aggregation network. Meas. Sci. Technol. 2021, 32, 095002. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]

Figure 1. Distribution map of the research area.

Figure 2. Schematic diagram of a sample image.

Figure 3. Overall architecture of the model.

Figure 4. Network structure of SDM.

Figure 5. The improved NAM.

Figure 6. The NAM structure.

Figure 7. Visualization results among different models. (a) represents a complex terrain, (b) represents a river region, and (c) represents a simple terrain.

Figure 8. Geographical distribution of selected regions in Xingtai City.

Table 1. Comparison results of different semantic segmentation models.

Model	Miou%	Recall%	Precision%	OA%	F1%
U-Net	76.02	84.64	87.19	89.53	85.54
DeepLabv3+	77.23	85.14	89.10	88.92	86.83
Segformer	78.69	86.84	88.75	89.91	87.76
PSPNet	79.11	86.84	88.04	90.11	88.04
Improved HRNet	80.51	89.14	88.64	91.42	88.89

Table 2. The impact of different modules on model performance results.

Exp	SDM	NAM	MIoU%	Recall%	Precision%	OA%	F1%
1	×	×	75.77	85.36	85.67	89.97	85.47
2	√	×	79.14	87.85	88.11	90.90	87.95
3	×	√	79.06	88.22	87.64	90.99	87.93
4	√	√	80.51	89.14	88.64	91.42	88.89

Table 3. The impact of the number of semantic domain optimization module scales on IoU and MIou of different categories.

Scale Quantity	Wheat	Tree	Water	Build	Background	Miou%
1	89.34	69.61	70.09	86.58	69.53	79.06
2	89.87	71.96	66.91	87.35	71.47	79.78
4	91.21	75.34	67.08	86.87	71.25	80.12
8	91.40	76.28	67.17	87.23	72.41	80.51
16	91.08	75.84	66.85	87.01	72.56	80.28

Table 4. The impact of nested attention optimization module range size on IoU and mIoU of different categories.

Range Size	Wheat	Tree	Water	Build	Background	Miou%
8 × 8	90.81	72.28	69.03	87.31	71.75	79.59
16 × 16	91.40	76.28	67.17	87.23	72.41	80.51
32 × 32	90.91	73.13	71.38	86.94	73.05	80.47
64 × 64	90.88	71.67	71.02	85.89	71.65	79.76

Table 5. Experimental results in Xingtai City.

Class	OA%	Precision%	Recall%	F1%	IoU%
Wheat		92.71	95.21	93.95	88.58
Tree		69.59	69.04	69.31	53.04
Water		83.98	59.37	69.57	53.33
Build		57.52	90.03	88.75	79.78
Background		70.57	65.39	67.88	51.38
Total	85.11	83.45	78.41	80.39	68.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, C.; Zhu, P.; Yang, S.; Zhang, L. A Semantic Segmentation Method for Winter Wheat in North China Based on Improved HRNet. Agronomy 2024, 14, 2462. https://doi.org/10.3390/agronomy14112462

AMA Style

Wang C, Zhu P, Yang S, Zhang L. A Semantic Segmentation Method for Winter Wheat in North China Based on Improved HRNet. Agronomy. 2024; 14(11):2462. https://doi.org/10.3390/agronomy14112462

Chicago/Turabian Style

Wang, Chunshan, Penglei Zhu, Shuo Yang, and Lijie Zhang. 2024. "A Semantic Segmentation Method for Winter Wheat in North China Based on Improved HRNet" Agronomy 14, no. 11: 2462. https://doi.org/10.3390/agronomy14112462

APA Style

Wang, C., Zhu, P., Yang, S., & Zhang, L. (2024). A Semantic Segmentation Method for Winter Wheat in North China Based on Improved HRNet. Agronomy, 14(11), 2462. https://doi.org/10.3390/agronomy14112462

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Semantic Segmentation Method for Winter Wheat in North China Based on Improved HRNet

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview of the Research Area

2.2. Data Preprocessing and Sample Construction

2.3. Overall Structure of the Model

2.4. Semantic Domain Module

2.5. Nested Attention Module

2.6. Loss Function

2.7. Experimental Environment and Configurations

2.8. Model Evaluation Metrics

3. Results and Discussion

3.1. Subsection Model Comparison

3.2. Ablation Tests

3.3. Migration Tests

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI