Next Article in Journal
SDLS: A Two-Stream Architecture with Self-Distillation and Local Streams for Remote Sensing Image Scene Classification
Previous Article in Journal
Developing 3D River Channel Modeling with UAV-Based Point Cloud Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MAF-RecNet: A Lightweight Wheat and Corn Recognition Model Integrating Multiple Attention Mechanisms

1
School of Land Science and Space Planning, Hebei GEO University, Shijiazhuang 050031, China
2
Hebei International Joint Research Center for Remote Sensing of Agricultural Drought Monitoring, School of Land Science and Space Planning, Hebei GEO University, Shijiazhuang 050031, China
3
Hebei Institute of Hydrogeology and Engineering Geology, Hebei Remote Sensing Center, Shijiazhuang 050021, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(3), 497; https://doi.org/10.3390/rs18030497
Submission received: 16 December 2025 / Revised: 14 January 2026 / Accepted: 28 January 2026 / Published: 3 February 2026

Highlights

What are the main findings?
  • MAF-RecNet achieves an excellent balance between accuracy and efficiency. For southern Hebei farmland recognition, it attains 87.57% mIoU and 95.42% mAP, outperforming models like SegNeXt and FastSAM, while remaining lightweight (15.25 M parameters, 21.81 GFLOPs).
  • The model also shows strong generalization, reaching 90.20% mIoU on a global wheat disease dataset, and maintains robust performance under noise and degradation tests, confirming its reliability in diverse real-world scenarios.
What are the implications of the main findings?
  • This study provides a practical solution for intelligent agricultural identification. By tackling key challenges like high model complexity, small-sample overfitting, and limited cross-domain generalization, MAF-RecNet achieves high accuracy with a lightweight design, offering a deployable tool for tasks such as crop census and disease monitoring.
  • Furthermore, it offers methodological insights for designing lightweight models. Its validated modular components—including a hybrid attention mechanism, a pre-trained backbone, and dual-attention skip connections—not only boost performance but also provide a transferable framework for addressing similar lightweight, small-sample recognition problems in other fields.

Abstract

This study is grounded in the macro-context of smart agriculture and global food security. Due to population growth and climate change, precise and efficient monitoring of crop distribution and growth is vital for stable production and optimal resource use. Remote sensing combined with deep learning enables multi-scale agricultural monitoring from field identification to disease diagnosis. However, current models face three deployment bottlenecks: high complexity hinders operation on edge devices; scarce labeled data causes overfitting in small-sample cases; and there is insufficient generalization across regions, crops, and imaging conditions. These issues limit the large-scale adoption of intelligent agricultural technologies. To tackle them, this paper proposes a lightweight crop recognition model, MAF-RecNet. It aims to achieve high accuracy, efficiency, and strong generalization with limited data through structural optimization and attention mechanism fusion, offering a viable path for deployable intelligent monitoring systems. Built on a U-Net with a pre-trained ResNet18 backbone, MAF-RecNet integrates multiple attention mechanisms (Coordinate, External, Pyramid Split, and Efficient Channel Attention) into a hybrid attention module, improving multi-scale feature discrimination. On the Southern Hebei Farmland dataset, it achieves 87.57% mIoU and 95.42% mAP, outperforming models like SegNeXt and FastSAM, while maintaining a balance of efficiency (15.25 M parameters, 21.81 GFLOPs). The model also shows strong cross-task generalization, with mIoU scores of 80.56% (Wheat Health Status Dataset in Southern Hebei), 90.20% (Global Wheat Health Dataset), and 84.07% (Corn Health Status Dataset). Ablation studies confirm the contribution of the attention-enhanced skip connections and decoder. This study not only provides an efficient and lightweight solution for few-shot agricultural image recognition but also offers valuable insights into the design of generalizable models for complex farmland environments. It contributes to promoting the scalable and practical application of artificial intelligence technologies in precision agriculture.

1. Introduction

Accurate and timely monitoring of crop spatial distribution and health status is crucial for global food security [1]. This study focuses on two of the world’s primary staple crops—wheat and maize—as their yield stability is directly linked to socioeconomic stability [1]. At the macro scale, remote sensing data enable large-area crop identification and dynamic monitoring through deep learning [1], providing more timely and spatially detailed information for yield forecasting [2]. At the field scale, high-resolution canopy images captured by handheld devices support fine-grained crop health diagnostics [3]. Deep learning models can identify stress responses induced by pests, diseases, or nutrient deficiencies [4], informing precision management practices such as variable-rate fertilization and pesticide application [5]. This approach not only safeguards yield but also enhances resource-use efficiency [6,7].
However, most existing models are designed for specific scales or tasks, lacking an integrated framework capable of simultaneously supporting macro-scale area monitoring and micro-scale disease identification. Therefore, developing deep learning models that fuse multi-scale data is of significant strategic importance for advancing smart agriculture [8]. Critically, the practical deployment of such models faces two major obstacles: the few-shot learning problem, where models must perform with scarce high-quality annotations, and the cross-domain generalization challenge, where performance degrades due to shifts in region, growth stage, or imaging conditions [8]. These constraints severely limit the large-scale, real-world application of deep learning in agriculture. To comprehensively validate the robustness and generalizability of multi-scale models, it is essential not only to evaluate them on structured farmland datasets but also to extend the analysis to more diverse and globally representative crop disease datasets. This will ensure that the models can adapt to varying agricultural environments and meet practical application requirements [7].
Current deep learning research in agricultural remote sensing primarily focuses on three key areas: model lightweighting, cross-domain generalization, and accuracy optimization in complex scenarios [9]. In lightweight design, researchers employ efficient architectures and compression techniques to reduce computational costs, enabling deployment on mobile or edge devices [10]. Relevant studies include lightweight CNNs for mobile-based pest and disease recognition [10], networks enhanced for robustness under adverse weather conditions [11], and lightweight models optimized through multidimensional saliency [12]. However, when confronted with the high target diversity, complex structures, and imbalanced sample distributions characteristic of real agricultural environments, these methods still exhibit insufficient representational capacity and limited generalization performance, with particularly pronounced degradation in few-shot and cross-domain scenarios. Consequently, constructing a multi-scale fusion model that balances lightweight design, strong generalization, and high accuracy has become a critical and urgent research priority in the field of intelligent interpretation for agricultural remote sensing.
To effectively address the challenges in wheat and maize identification, such as high model complexity, weak generalization, and insufficient accuracy in small-target recognition, this study proposes a lightweight Multi-Attention Fusion Recognition Network (MAF-RecNet). By organically integrating hybrid attention mechanisms with lightweight preprocessing modules, the model significantly enhances feature representation capabilities under limited sample conditions. This approach thus aims to achieve an excellent balance between computational efficiency and recognition accuracy, while enhancing the model’s adaptability and practical value in complex agricultural recognition scenarios.
In summary, the proposed MAF-RecNet method delivers notable improvements in mitigating overfitting in small-sample settings, enhancing small-object recognition, and strengthening generalization performance. It also represents a significant advance in balancing model lightweighting with recognition accuracy. The main objectives of this study are as follows:
(1)
Construct a lightweight network architecture that integrates multi-attention mechanisms to reduce model complexity while enhancing the ability to discern multi-scale features and small targets in wheat and maize images.
(2)
Develop model optimization methods tailored for small-sample conditions, by incorporating pre-trained knowledge, designing efficient feature fusion modules, and employing hybrid loss functions, to mitigate overfitting with limited data and balance recognition accuracy with computational efficiency.
(3)
Establish a hierarchical, multi-task performance evaluation framework to systematically validate the model’s performance across crop recognition tasks of varying regions and scales, and to test its cross-domain generalization capability and robustness under noisy conditions.

2. Materials and Methods

2.1. Acquisition and Preprocessing of Remote Sensing Images

This study utilized 103 scenes of Chinese Gaofen-2 (GF-2) satellite covering southern Hebei Province (including Shijiazhuang, Xingtai, and Handan). All images were acquired between February and April 2024, with cloud coverage below 20% per scene. Data were sourced from http://sasclouds.com. Detailed acquisition dates are listed in Table 1.
The panchromatic and multispectral imagery from the GF-2 satellite used in this study were all processed as surface reflectance products, eliminating the effects of illumination and atmosphere to ensure spectral comparability across images from different time phases. Subsequently, orthorectification was performed on the imagery to remove terrain relief and sensor geometric distortions. A pan-sharpening fusion technique based on the Gram-Schmidt method was then applied to integrate the high-spatial-resolution panchromatic band with the spectrally rich multispectral bands [13].
This technique effectively preserves the clarity of object edges while maintaining the integrity of multispectral features, thereby generating high-quality fused imagery with a spatial resolution of 0.8 m [14]. Finally, precise cropping of the fused imagery was conducted according to the vector boundaries of the study area, removing irrelevant background regions to obtain target area imagery for subsequent analysis [15].

2.2. Dataset Construction

2.2.1. Pre-Enhanced Image Dataset

This study constructed the South Hebei Farmland Identification Dataset (SHFD) using the sliding window method based on preprocessed GF-2 satellite imagery. This dataset covers southern Hebei Province (geographic coordinates: 36°03′–38°47′ N, 113°27′–115°49′ E), targeting the identification of corn and wheat fields. The specific workflow is as follows: A fixed 256 × 256-pixel window slides row by row and column by column across the image at a preset stride. Each movement crops out an image block sample [16,17]. The total number of image blocks generated is calculated using Equation (1).
N = [ ( H   256 ) / s + 1 ]   ×   [ ( W     256 ) / s + 1 ]
In this context, H and W are the input image’s height and width, s is the sliding window stride, and   N is the total number of computed samples. Field surveys confirmed 2976 wheat and corn field areas in southern Hebei, from which the aforementioned method generated 700 target samples.
To comprehensively evaluate the performance of the MAF-RecNet model, this study selected four datasets with different sources, scales, and task objectives. The Southern Hebei Farmland Detection Dataset (SHFD) is constructed from Chinese Gaofen-2 (GF-2) satellite, targeting macro-scale remote sensing farmland segmentation tasks, and is designed to validate the model’s capability in identifying crop spatial distribution within complex geographical scenes. The other three datasets focus on micro-scale crop disease identification: the Wheat Health Status Dataset in Southern Hebei (WHSSH) was collected via handheld devices in southern Hebei, covering leaf rust, loose smut, stripe rust, stem rust, and healthy wheat, comprising 620 images of 256 × 256 pixels; the Global Wheat Health Dataset (GWHD) [18] is derived from an internationally published multi-source wheat head detection dataset, from which this study selected wheat ear samples infected with Fusarium head blight, powdery mildew, loose smut, and stripe rust to build a disease-recognition subset containing 230 image patches of 256 × 256 pixels; the Corn Health Status Dataset (CHSD) [19] is based on the publicly available plant disease dataset Plant Village, and includes samples of corn brown spot, rust, smut, downy mildew, gray leaf spot, and leaf blight, resulting in 723 samples of 256 × 256 pixels.
This study adopts a multimodal validation framework aimed at systematically evaluating the model’s end-to-end agricultural monitoring capability from “macro remote sensing monitoring to close-range micro-diagnosis.” The selection of the four datasets (SHFD, WHSSH, GWHD, CHSD) is based on their representativeness and complementarity. The primary considerations for choosing these four datasets are as follows:
(1)
Complementarity in Modality and Scale: SHFD (macro-scale satellite imagery) and the other three datasets (micro-scale close-range imagery) together form a complete span of “spatial scales,” covering core agricultural vision tasks ranging from field-level distribution identification to organ-level pathological diagnosis.
(2)
Representativeness of Data Sources: The combination includes both internationally recognized public benchmark datasets (GWHD, CHSD), ensuring comparability with existing research, and non-public datasets (SHFD, WHSSH) specifically constructed to reflect the practical demands and challenges in regional monitoring scenarios.
(3)
Diversity in Crops and Tasks: The datasets cover two major crops (wheat and maize) and encompass multiple tasks such as health status recognition, disease classification, and farmland segmentation, allowing for an initial assessment of the model’s adaptability across crops and tasks.
(4)
Controllability at the Current Research Stage: By limiting the number of datasets to four while ensuring comprehensive validation dimensions, this approach facilitates focused analysis of the model’s behavior under key variations (e.g., scale, data characteristics), preventing the dilution of analytical depth due to an excessive number of test sets.
The decision to “select only four” represents a balanced research design consideration: these datasets constitute a “minimal yet sufficient test set” capable of systematically reflecting the model’s performance in terms of “scale transition,” “modality switching,” and “task migration,” while ensuring concentrated and in-depth experimental analysis.

2.2.2. Post-Enhanced Image Dataset

To address the limited sample sizes in the respective datasets, this study employed a systematic image augmentation strategy to effectively expand the data. First, all original samples were retained as the foundation. Subsequently, each sample underwent systematic application of multiple geometric transformations and color space adjustments. All augmented samples were subjected to a quality filter: only those with a target pixel ratio exceeding 3% were retained, ensuring that key discriminative features were preserved while scaling the dataset. The specific frequencies of application for each augmentation method are summarized in Table 2. Examples of the resulting augmented samples are shown in Figure 1. The main augmentation methods applied were as follows:
(1)
Random horizontal or vertical flipping, enriching the spatial distribution characteristics of farmland [18];
(2)
Random rotation between −30° and 30°, enhancing orientation diversity [19];
(3)
Random scaling from 0.8 to 1.2, simulating scale variation at different capture distances [19];
(4)
Color enhancement through combined adjustments in brightness (0.7–1.3×), contrast (0.7–1.3×), saturation (0.8–1.2×), and sharpness (0.5–1.5×), improving robustness under varying illumination [20];
(5)
Introduction of both Gaussian noise (σ = 5–15) and Gaussian blur (radius = 0.5–2.0 pixels), increasing model tolerance to image noise [21,22]; together, these transformations form a combined augmentation strategy that further diversifies feature representation.

2.2.3. Robustness Testing Dataset

To address the complex environmental interference that deep learning models may encounter in practical deployment, this study constructed robustness testing and evaluation datasets (SHFD-RTD, WHSSH-RTD, GWHD-RTD, CHSD-RTD) based on the augmented datasets. By introducing multi-dimensional image distortion and interference techniques, the sample size of each dataset was systematically expanded to 1.2 times the original. Detailed information on the robustness test datasets is provided in Table 3, and representative samples are shown in Figure 2. The specific construction methods include:
(1)
Simulating sensor acquisition noise using uniform noise of varying intensities [19];
(2)
Employing spatial filtering techniques such as box blurring and threshold filtering to reproduce image degradation scenarios [21];
(3)
Simulating real-world image processing workflows through post-processing methods like de-sharpening masks and detail enhancement [18];
(4)
Combining color balance and color temperature adjustment techniques to restore complex lighting conditions [19];
(5)
Introducing geometric distortion and other transformations to simulate imaging anomalies [20].

2.3. MAF-RecNet

2.3.1. Overall Architecture

For agricultural multi-classification tasks with limited training samples, this paper proposes a lightweight model named MAF-RecNet. The model is based on an encoder–decoder architecture (Figure 3), where both the encoder and decoder consist of five stages. The encoder performs downsampling, with each stage comprising two residual blocks and a hybrid attention module. The attention mechanisms employed include Coordinate Attention, Efficient Channel Attention, External Attention, and Pyramid Split Attention, which enhance the model’s feature representation capability and cross-regional generalization under sample-sparse conditions.
The decoder performs upsampling; the first four stages progressively reconstruct features using transposed convolution, two standard convolutional layers, and the same hybrid attention mechanisms, maintaining classification accuracy while improving computational efficiency. The final upsampling stage consists of a transposed convolution layer followed by a 1 × 1 convolution. Additionally, skip connections incorporating an attention gate mechanism and Efficient Channel Attention are introduced between the encoder and decoder to enhance the detection of small-scale crop targets.
Before entering the decoder, a dedicated feature extraction layer further refines semantic information. During training, a hybrid loss function combining weighted cross-entropy and Dice loss is adopted to mitigate class imbalance, and the AdamW optimizer with cosine annealing learning rate scheduling is used to strengthen cross-task generalization. Through the synergistic design of these components, MAF-RecNet provides an effective solution for few-shot, multi-class crop recognition tasks.

2.3.2. Preprocessing Module

To address the challenges of few-shot and small-target recognition in crop multi-classification tasks, this study designs an encoder-front preprocessing module that balances high recognition accuracy with lightweight deployment requirements. Its overall structure is illustrated in Figure 4. By incorporating multi-level feature enhancement and fusion mechanisms, the module significantly improves the model’s feature discrimination capability in complex agricultural scenarios. The core of the module consists of three components: an improved initial feature extraction layer, a coordinate attention mechanism, and an efficient channel attention mechanism.
The improved initial feature extraction layer adopts a composite structure that integrates feature extraction, enhancement, and reorganization. First, spatial downsampling is performed using a 3 × 3 convolution with a stride of 2, which reduces computational complexity while preserving key crop details [23]. Then, a dilated convolution with a dilation rate of 2 is introduced to expand the receptive field, enhancing adaptability to multi-scale crop images [23]. Finally, a 1 × 1 convolution is employed to achieve cross-channel feature fusion and reorganization, further improving feature discriminability while maintaining computational efficiency [24].
This preprocessing module forms a hierarchical feature optimization workflow via synergistic interaction of its three components: an initial feature extraction layer that establishes multi-scale base representations, a coordinate attention mechanism that enhances spatial localization, and an efficient channel attention mechanism that refines channel responses [25,26]. These components function as a complementary system—ensuring rich and diverse features, improving small-target recognition, and boosting feature discriminability, respectively.
This collaborative workflow supplies the subsequent encoder with highly discriminative features, improving crop classification accuracy and efficiency in small-sample settings while maintaining a lightweight architecture. The resulting feature maps integrate multi-scale spatial context with enhanced channel correlations, providing a superior feature foundation for agricultural multi-classification tasks.

2.3.3. Skeleton Model

To address the challenges of few-shot learning and model lightweighting in crop multi-classification, this study employs a pre-trained ResNet18 as the backbone network (its structure is illustrated in Figure 5). Based on a deep residual learning framework, ResNet18 alleviates gradient vanishing and network degradation through residual connections [27]. The prior knowledge acquired from ImageNet pre-training provides strong visual feature representation capabilities [28], which helps suppress overfitting under limited training data and accelerates convergence for crop-specific recognition tasks.
To tackle issues such as class imbalance, blurred boundaries, and small-target detection in wheat and maize image recognition, a hybrid attention module is incorporated into the ResNet18 backbone. Conventional networks often treat spatial and channel features uniformly, making it difficult to focus on key crop regions, especially amid complex background interference. Therefore, the proposed module integrates four complementary attention mechanisms: Coordinate Attention (CA), External Attention (EA), Pyramid Split Attention (PSA), and Efficient Channel Attention (ECA). The module aims to achieve the following objectives: (1) dynamically recalibrating feature importance to mitigate class imbalance [29]; (2) enhancing feature responses at crop boundaries through spatial reinforcement; (3) extracting discriminative multi-scale features to better characterize small-scale crop targets [30].

2.3.4. Hybrid Attention

The composition and ordering of this hybrid attention module (whose structure is shown in Figure 5) are based on a hierarchical, progressive visual information processing framework. Its core design concept is to simulate the cognitive process of the human visual system from “focusing on a location” to “understanding context,” then to “parsing structure,” and finally to “optimizing representation.”
Specifically, Coordinate Attention (CA) first provides the network with precise spatial location awareness. This mechanism establishes long-range spatial dependencies by decomposing and encoding the feature map along coordinate dimensions, thereby helping the network accurately locate key regions in the image (such as specific farmland or crop disease areas). The structure of CA is shown in Figure 6, and its computational workflow is as follows: First, global average pooling is performed separately along the height and width directions to obtain feature vectors for the two orientations in Equations (2) and (3) [25]. Next, these two feature vectors are concatenated, transformed and fused via a shared 1 × 1 convolution, and then passed through a Sigmoid function to generate the spatial attention weights in Equation (4) [31]. The final output is the directionally weighted result of the input features using these weights in Equation (5) [32].
z c h ( h ) = 1 W 0 i < W   x c ( h , i )
z c w ( w ) = 1 H 0 j < H   x c ( j , w )
g h , g w = S p l i t σ f C o n c a t ( z h , z w )
y c ( h , w ) = x c ( h , w ) × g c h ( h ) × g c w w
Among them, the input feature map is denoted as x c , where H ,   W and c represent the height, width, and number of channels of the feature map, respectively; z c h ( h ) and z c w ( w ) are the feature tensors obtained after applying global average pooling along the height and width directions; σ denotes the Sigmoid activation function; f represents the shared 1 × 1 convolution operation; C o n c a t indicates feature concatenation; g h and g w are the attention weights generated for the height and width directions, respectively; and y c ( h , w ) denotes the output feature of the c -th channel after weighting by Coordinate Attention.
Building upon coordinate attention, External Attention (EA) further incorporates lightweight external learnable memory units to simulate contextual reasoning based on prior knowledge [33]. By enabling cross-sample feature interaction, it effectively enhances the semantic expression and generalization capability of the features. The specific computational workflow is as follows: First, the input features are mapped into query vectors via linear transformation in Equation (6) [33]. Then, two independent external memory units are utilized as the key and value, respectively, to compute attention weights and generate the output features through weighted aggregation in Equation (7) [34]. Finally, the channel dimension is restored to its original size via a residual connection and linear projection.
Q = X W Q
A = S o f t m a x Q M K , Y = A M V
Here, X denotes the input features, W Q is the linear transformation matrix, Q represents the query space, S o f t m a x is the activation function, M K stands for the key, M V stands for the value, A corresponds to the planned attention map, and Y indicates the enhanced features.
Next, the Pyramid Split Attention (PSA) module employs a parallel multi-branch architecture for multi-scale feature extraction. This design explicitly enhances the model’s adaptability to variations in target scale, enabling it to capture multi-level visual features ranging from fine-grained textures to global structures (its structure is illustrated in Figure 6) [35].
The specific computational workflow is as follows: First, the input features are evenly split into four groups along the channel dimension in Equation (8) [35]. The first three groups are processed by depthwise separable convolutions with different kernel sizes (set to 3 × 3, 5 × 5, and 7 × 7 in this study), while the last group retains the original features in Equation (9) [36]. Subsequently, the outputs from all branches are concatenated along the channel dimension. This concatenated result then passes through a 1 × 1 convolution followed by a ReLU activation, and finally a Sigmoid function to generate the channel attention weights in Equation (10) [36]. The final output is obtained by performing channel-wise multiplication between the input features and these attention weights in Equation (11) [36].
G 1 , G 2 , , G S = S p l i t X , G i R H × W × C S
Y i = D W C o n v K i × K i ( G i ) , i = 1 , , S 1
w = σ F 1 × 1 B N R e L U C o n c a t Y 1 , , Y S 1 , G S
Y = X w , w R 1 × 1 × C
Here, X denotes the input features, S (set to 4 in this study) represents the number of split groups, D W C o n v K i × K i refers to the depthwise separable convolutions with different kernel sizes (set to 3, 5, and 7 in this study), w stands for the channel attention weights, F 1 × 1 Conv indicates the 1 × 1 convolution, C o n c a t denotes channel-wise concatenation, B N represents batch normalization, and Y corresponds to the output features.
Finally, the Efficient Channel Attention (ECA) module performs a lightweight recalibration of the channel dimension on the features refined by the preceding steps. It adaptively enhances discriminative channels while suppressing redundant information, thereby achieving the final optimization of feature representation at minimal computational cost [37]. The specific computational workflow is as follows: First, global spatial compression is applied to the input features in Equation (12) [37]. Then, a one-dimensional convolution is used to capture local cross-channel interactions, followed by a Sigmoid function to generate the channel attention weights in Equation (13) [38]. The final output is obtained by performing channel-wise multiplication between the input features and these weights in Equation (14) [38].
z c = 1 H × W i = 1 H   j = 1 W   x c ( i , j ) , z R C
s = σ C o n v 1 D k ( z ) , s R C
y c = s c x c
Here, x denotes the input features, z represents the compressed features after global spatial pooling, C o n v 1 D k refers to the one-dimensional convolution, σ indicates the Sigmoid activation function, s stands for the channel attention weights, and y c corresponds to the output features.
Overall, the entire hybrid module follows a progressive, sequential design of “spatial localization → contextual enhancement → multi-scale fusion → channel refinement.” This constructs a hierarchical and systematic visual information processing pathway in theory, where features are progressively refined and purified at each stage. This provides a solid theoretical foundation for accurate recognition in complex agricultural scenarios.

2.3.5. Decoder

To meet the requirements for generalization, robustness, and computational efficiency in wheat and maize identification tasks, this study designs an efficient multi-stage decoder, with its overall structure illustrated in Figure 7. The decoder adopts a progressive upsampling strategy to gradually restore the spatial resolution of feature maps, thereby avoiding the detail loss that may result from single-step upsampling and ensuring smooth and complete feature reconstruction.
The decoder follows a four-stage architecture, each stage consisting of a hybrid convolutional module and a hybrid attention module. The hybrid convolutional module comprises a transposed convolution layer followed by two standard 3 × 3 convolutional layers (the overall module layout is shown in Figure 7). Transposed convolution (also known as fractionally strided convolution) achieves upsampling by inserting zeros into the input features and then performing standard convolution [39]. For a stride of 2, for example, this operation inserts one row and one column of zeros along the height and width dimensions, respectively, after which the convolution kernel slides over the expanded feature map to compute the output values [39]. Subsequently, two consecutive 3 × 3 convolutional layers perform deep feature refinement: the first layer integrates spatial context via its local receptive field to preliminarily extract features and suppress noise [40,41]; the second layer further enhances feature discriminability and spatial consistency through added nonlinear transformations, thereby improving the representation of crop morphological structures.
While progressively enlarging the feature maps, this hybrid convolutional module effectively fuses multi-scale features transferred from the encoder, enabling the recovery of crop details without sacrificing high-level semantic information. The dual-convolution structure expands the model’s nonlinear representational capacity and effective receptive field, helping to reduce feature loss and boundary blur for small-scale targets during decoding, thus improving boundary clarity and morphological completeness.
To address specific challenges in crop recognition, such as complex background interference and scale variation, a hybrid attention module is introduced after each hybrid convolutional module. Building upon the features extracted by convolution, this module achieves a transition from global perception to local focus through dynamic feature weighting, which is particularly suited for locating discriminative regions (e.g., crops versus background weeds) in agricultural scenes. At the final output stage of the decoder, a 1 × 1 convolutional layer is employed to directly map the deep features to pixel-level recognition results [42]. This layer compresses the feature channels to the target output dimension, realizing an end-to-end transformation from high-level semantic features to classification or segmentation maps. This design maintains the spatial resolution of the feature maps while generating the output with minimal computational overhead, thereby significantly improving the inference efficiency of the model without compromising recognition accuracy.

2.3.6. Skip Connections

In encoder–decoder architectures, conventional skip connections often propagate background noise along with features, harming accuracy, especially for small crop targets. To address this, a dual-attention-enhanced skip connection is introduced, using spatial and channel attention to selectively refine encoder features before fusion, thereby suppressing noise and improving feature quality.
Specifically, an Attention Gate employs high-level decoder features to guide adaptive spatial filtering of encoder outputs [43]. The process involves: resizing decoder features via bilinear interpolation to match encoder spatial dimensions [43]; projecting both feature sets to the same channel dimension using 1 × 1 convolutions in Equation (15) [44]; performing element-wise addition followed by ReLU activation in Equation (16) [44]; applying another 1 × 1 convolution to produce a single-channel spatial map; normalizing it with Sigmoid to generate a spatial attention weight matrix in Equation (17) [43]; and finally, using element-wise multiplication to selectively weight the original encoder features in Equation (18) [43].
g = W g g , x = W x x
ψ = R e L U ( g + x )
α = σ ( W ψ ψ )
x o u t = x α
Here, W g and W x are the 1 × 1 convolution weights for the decoder and encoder features, respectively; * denotes the convolution operation; g and x represent the transformed encoder and decoder features; ReLU denotes the activation function; ψ denotes the fused features; σ denotes the Sigmoid function; W ψ represents the 1 × 1 convolution weights; α denotes the generated attention weight matrix; and ⊗ denotes element-wise multiplication.
Building on the attention gate, the Efficient Channel Attention (ECA) mechanism is further applied for channel-wise refinement. This dual attention-enhanced skip connection offers key advantages for wheat and maize recognition: spatially, it enhances feature responses in key crop areas while suppressing background interference through adaptive selection; channel-wise, it recalibrates features to strengthen discriminative expression [44].
In small-sample learning, it extracts more discriminative correlations from limited data, improving model generalization [45]. Designed to be lightweight, the module adds minimal computational overhead while significantly boosting feature fusion quality and efficiency, enabling high-precision crop identification.

2.3.7. Loss Function and Optimizer

To address challenges such as class imbalance and blurred boundaries in wheat and corn recognition tasks, this study designed a comprehensive loss function and optimization strategy. For the loss function, a composite loss function combining cross-entropy loss and Dice loss was adopted, with an adjustable weight parameter α balancing the contributions of both. This design leverages the complementary properties of different loss functions, ensuring classification accuracy while optimizing the geometric consistency of segmentation regions.
The cross-entropy loss function calculates the discrepancy between the predicted probability distribution and the true labels to ensure clear classification boundaries, as shown in Equation (19) [46]. The Dice loss function is specifically designed to address class imbalance by directly optimizing the overlap between predicted and ground-truth regions, thereby enhancing recognition performance for minority classes. Its calculation formula is shown in Equation (20) [46]. The combined loss function integrates these two losses through linear weighting, as shown in Equation (21).
This study employs grid search to determine the optimal weight parameter, ultimately setting α = 0.7. This configuration preserves the advantages of cross-entropy loss in classification tasks while fully leveraging the strengths of Dice loss in handling class imbalance.
L C E = 1 N i = 1 N   c = 1 C   y i , c l o g ( p i , c )
L D i c e = 1 C c = 1 C   1 2 i = 1 N   p i , c y i , c + ϵ i = 1 N   p i , c + i = 1 N   y i , c + ϵ
L Total = α · L CE + ( 1   α )   · L Dice
Here, N represents the number of samples; C represents the number of classes; y i , c denotes the ground-truth label for class C ; p i , c is the corresponding predicted probability; ϵ is a smoothing factor used to ensure numerical stability; and α is the weight parameter.
This study employs the AdamW optimizer for parameter updates. By decoupling weight decay from gradient updates, it achieves superior regularization effects, thereby enhancing the model’s generalization capability in agricultural scenarios [47]. Building on this, we introduce a cosine annealing with warm restarts learning rate scheduling strategy. This strategy periodically decays the learning rate from its initial value to near zero following a cosine function, then performs a “warm restart,” rapidly recovering it to near the initial value and commencing the next cycle [48]. This helps the model escape local optima during training and accelerates the convergence process. Simultaneously, to prevent overfitting and determine the optimal training epoch, an early stopping mechanism is implemented [48].
It continuously monitors the target performance metric (mIoU) on the validation set; if no improvement is observed within a predefined patience period (set to 20 epochs in this experiment), training is halted early, and the model parameters are restored to the checkpoint with the best validation performance. The complete hyperparameter configuration includes: an initial learning rate of 1 × 10−4, a weight decay coefficient of 1 × 10−4, a period multiplier T_mult of 1.5, a batch size of 2, a training epoch limit of 200, and an early stopping patience of 20.

2.4. Model Performance and Reliability Evaluation

2.4.1. Hardware Configuration and Evaluation Metrics

This study employs a comprehensive evaluation framework to objectively assess model performance, covering both accuracy and computational efficiency. For accuracy evaluation, the following metrics are adopted: mean Average Precision (mAP), mean Precision (mPrecision), mean Recall (mRecall), mean F1-Score (mF1), and mean Intersection over Union (mIoU) [49]. These metrics collectively measure the model’s detection accuracy, classification reliability, coverage completeness, overall balance, and segmentation consistency, as calculated in Equations (22)–(26).
m P r e c i s i o n = 1 K i = 1 K T P i T P i + F P i
m R e c a l l = 1 K i = 1 K   T P i T P i + T N i
m F 1 S c o r e = 1 K i = 1 K   2 × P r e c i s i o n i × R e c a l l i P r e c i s i o n i + R e c a l l i
m I o U = 1 K i = 1 K   T P i T P + F P + F N
m A P = 1 K i = 1 K   k = 1 n   P ( k ) × Δ R ( k )
Here, T P i denotes the number of samples that truly belong to class i and are predicted as class i (True Positives for class i ), F P i represents the number of samples predicted as class i but whose true class is not i (False Positives for class i ), and F N signifies the number of samples that truly belong to class i but are predicted otherwise (False Negatives for class i ). n is the total number of samples, K is the number of classes, P ( k ) is the Precision, and Δ R ( k ) is the change in Recall.
For efficiency evaluation, this study employs total parameters [50] and floating-point operations (FLOPs) [50] as key metrics of model complexity to assess its lightweight design and computational efficiency, providing references for deployment in practical scenarios. All models are implemented in Python (version 3.9; Python Software Foundation, Wilmington, DE, USA) and trained/validated on a hardware platform equipped with an NVIDIA GeForce RTX 4060 GPU (manufactured by Dell Inc. in Round Rock, TX, USA) to ensure consistency in the experimental environment.
The dataset was partitioned into training and validation sets following a region-based split strategy (in an approximately 8:2 ratio) to prevent data leakage. This approach ensures that all images from the same geographic area or acquisition sequence are assigned exclusively to either set, compelling the model to generalize to unseen conditions rather than memorizing location-specific features. To mitigate potential bias from the resulting class distribution shift, we ensured that the proportion of each category was kept as balanced as possible between the two sets. All evaluation metrics were calculated solely on the validation set to ensure a fair assessment of model generalization.
All models were trained under identical hyperparameter settings to enable a direct and fair comparison. Loss values and key performance metrics were recorded after each epoch, allowing for continuous monitoring of overfitting or underfitting based on loss curve trends. This practice ensures full traceability of the training process and guarantees the reproducibility of the experimental outcomes.

2.4.2. Comparative Experiment

This study evaluated the performance of the MAF-RecNet model for farmland identification tasks in the Southern Hebei region. The experimental procedure was as follows. First, based on the self-constructed SHFD, five models—MAF-RecNet, MDFNet [51], SegFormer [52], FastSAM [53], and SegNeXt [54]—were trained and tested under identical conditions.
Subsequently, the prediction results of each model were compared with the ground truth annotations, and model performance was evaluated on the validation set to examine their generalization capability and classification accuracy. Finally, through a comprehensive comparison of the quantitative results, a comparative analysis was conducted to identify the strengths and limitations of MAF-RecNet relative to other mainstream lightweight models for farmland identification in Southern Hebei.

2.4.3. Ablation Studies

This study conducted an ablation experiment to evaluate the performance of MAF-RecNet and its two variants—MA-RecNet (with the feature modification components removed from the skip connections) and M-RecNet (with the feature modification components further removed from the decoder)—compared to the baseline U-Net [55] model in the farmland identification task in Southern Hebei, aiming to verify the effectiveness and contribution of each proposed module.
In the experiment, all models were trained and tested on the unified SHFD. Model predictions were compared with ground-truth annotations, and performance evaluation was completed on the validation set. Finally, by systematically comparing the performance of each model, an in-depth analysis was conducted on the strengths and weaknesses of MAF-RecNet and its variants in the farmland recognition task, thereby clarifying the specific contribution of different modules to the overall model performance.

2.4.4. Generalization Capability Testing

This study systematically evaluates the generalization capability and reliability of the MAF-RecNet model across multiple tasks, including: farmland identification in Southern Hebei, wheat health monitoring in Southern Hebei, cross-regional wheat health monitoring, and corn disease identification. Experiments were conducted by training and testing the MAF-RecNet model on four corresponding datasets—SHFD, WHSSH, GWHD, and CHSD.
For each task, the model’s predictions were first compared against the ground truth labels. Subsequently, the model was evaluated on the validation set using multiple metrics to comprehensively assess its performance across different scenarios. Finally, by conducting a comparative analysis of the results from various tasks, the generalization ability and potential limitations of MAF-RecNet in multi-scenario wheat and corn recognition were assessed.

2.4.5. Confusion Matrix Analysis

Leveraging the results from the generalization testing, this study employs a confusion matrix to conduct an in-depth analysis of the recognition results of MAF-RecNet. The specific workflow is as follows:
First, at the pixel level, the model’s predictions on the validation set across different datasets are categorized into four types: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Next, the counts and proportions of these four-pixel categories were statistically calculated for each dataset. Finally, by comparing the distribution differences in each pixel category across datasets, the adaptability and primary limitations of MAF-RecNet in multi-scenario recognition tasks were evaluated.

2.4.6. Robustness Testing

To evaluate the robustness of MAF-RecNet under noisy conditions, this study conducted comprehensive testing of the model using corrupted test datasets incorporating synthetic noise and simulated degradation: SHFD-RTD, WHSSH-RTD, GWHD-RTD, and CHSD-RTD.
During the experiments, the model’s predictions on each corrupted dataset were first compared against the ground truth labels. Subsequently, the model was evaluated on the validation set using multiple metrics to objectively assess its recognition stability under varying noise levels. Through a systematic analysis of the model’s performance across different noise conditions, this study evaluated MAF-RecNet’s ability to maintain recognition accuracy in complex environments and identified its primary limitations.

3. Results

3.1. Comparative Experimental Results

In the model comparison experiment for farmland identification in the Southern Hebei region, the MAF-RecNet proposed in this study achieved a good balance between accuracy and efficiency, demonstrating superior overall performance compared to other mainstream segmentation models.
As shown in Table 4, MAF-RecNet reached 87.57% on the key evaluation metric mIoU, exceeding MDFNet (82.32%), SegFormer (74.43%), FastSAM (78.81%), and SegNeXt (80.56%) by 5.25%, 13.14%, 8.76%, and 7.01%, respectively. These results indicate its higher accuracy in identifying farmland boundaries and regions, which is further supported by the visual results in Figure 8.
In terms of model complexity, MAF-RecNet has 15.25 million parameters. Among all compared models, FastSAM (17.43 million parameters) is the closest in parameter count, yet its mIoU is 8.76 percentage points lower than that of MAF-RecNet. This shows that with a similar number of parameters, MAF-RecNet can learn more discriminative feature representations and achieve better performance.
Regarding computational efficiency, MAF-RecNet requires 21.81 GFLOPs, which is higher than SegNeXt’s 16.8 GFLOPs. However, it achieves a 7.01% improvement in mIoU at this limited computational cost, demonstrating a favorable trade-off for high-precision applications such as farmland identification. Compared to FastSAM, which has a similar parameter count, MAF-RecNet reduces FLOPs by 21% while maintaining higher recognition accuracy, further highlighting its computational efficiency advantage.
Comprehensive analysis shows that FastSAM is the most comparable to MAF-RecNet in terms of model scale and can serve as a direct benchmark model. While maintaining a similar parameter count, MAF-RecNet outperforms on both mIoU and FLOPs—two key metrics—indicating that its network architecture can more efficiently translate computational resources into model performance. In summary, with a moderate complexity of 15.25 million parameters and 21.81 GFLOPs, MAF-RecNet achieves an effective balance between accuracy and efficiency in the farmland identification task in Southern Hebei, providing a reliable technical foundation for practical deployment.

3.2. Ablation Study Results

Ablation experiments systematically validate the contribution of each module in MAF-RecNet. Quantitative results are shown in Table 5, and a comparison of recognition performance across models is presented in Figure 9.
MAF-RecNet achieved the best performance across all evaluation metrics: mIoU of 87.57%, mAP of 95.42%, mF1 of 92.83%, mPrecision of 90.56%, and mRecall of 95.18%. This indicates that all proposed modules significantly enhance farmland recognition. As shown in Figure 9, the model excels in farmland boundary segmentation and small-area detection, effectively preserving the complete geometric features of farmland regions.
Removing the feature modification component from the skip connections (resulting in MA-RecNet) led to a significant decline in all metrics: mIoU dropped by 4.38 percentage points to 83.19%, and mAP decreased by 4.77 percentage points to 90.65%. Figure 9 shows that MA-RecNet produced more discontinuous and misclassified field edges, confirming the component’s importance for effective multi-level feature fusion.
Further removal of this component from the decoder (resulting in M-RecNet) caused an even more pronounced performance drop: mIoU plummeted by 9.63 percentage points to 77.94%. As seen in Figure 9, M-RecNet struggled to retain details and performed poorly on farmlands with complex shapes, demonstrating that decoder optimization is crucial for recovering fine-grained features and accurately locating farmland boundaries.
Compared to the baseline U-Net, the complete MAF-RecNet achieved a 17.51 percentage-point improvement in mIoU. Figure 9 shows that while U-Net results exhibit noticeable boundary blurring and internal holes, MAF-RecNet produces more complete and accurate segmentation.
In summary, the skip connection mechanism facilitates effective feature fusion between the encoder and decoder, while the improved decoder architecture significantly enhances detail recovery. The synergistic effect of these components collectively boosts model performance. MAF-RecNet achieves an effective balance between accuracy and efficiency, making it well-suited for practical farmland identification tasks.

3.3. Generalization Capability Test Results

This study systematically evaluated the generalization capability of the MAF-RecNet model across four datasets with distinct characteristics. The quantitative results are presented in Table 6. MAF-RecNet demonstrates robust performance across all datasets and metrics. It achieved the best performance on the GWHD, with an mIoU of 90.20%, mAP of 98.28%, mPrecision of 93.28%, and mRecall of 98.04%. It also delivered strong results on the SHFD, with an mIoU of 87.57%, mAP of 95.42%, mPrecision of 90.56%, and mRecall of 95.18%. On the more challenging WHSSH and CHSD, it still attained competitive mIoU scores of 80.56% and 84.07%, respectively, confirming the model’s adaptability across diverse environments.
Analysis of the precision metrics indicates that MAF-RecNet maintains an mPrecision above 83% across all four datasets, peaking at 93.28% on GWHD. This suggests effective control of false positives in varied scenarios. Regarding recall metrics, the model achieved an mRecall exceeding 87% on all datasets, notably reaching 98.04% on GWHD, highlighting its robust detection capability for targets with varying morphologies.
Visual analysis in Figure 10 illustrates the model’s performance on the SHFD. Its high recall rate of 95.18% reflects complete detection of scattered small fields, while a precision of 90.56% indicates accurate boundary delineation. On the WHSSH dataset, a recall of 87% ensures effective target identification within complex backgrounds, and a precision of 83.32% demonstrates its resistance to interference in challenging environments. On the CHSD, a recall of 91.37% confirms the model’s sensitivity in detecting target areas, and its precision of 86.94% reflects accurate boundary localization.
From the perspective of the comprehensive F1-score metric, the model maintains high performance across all four datasets: 95.61% on GWHD, 92.83% on SHFD, 89.12% on CHSD, and 85.40% on WHSSH. These results validate the model’s effective balance between precision and recall across diverse scenarios.
A comprehensive analysis shows that MAF-RecNet delivers consistent performance across all evaluation metrics in varied agricultural settings, which differ in geographical environments, imaging conditions, and crop types. The model not only achieves high scores in composite metrics such as mIoU and mAP but also exhibits balanced results in specialized metrics including mPrecision, mRecall, and mF1-score. This well-rounded capability provides a reliable technical foundation for practical applications in diverse agricultural recognition scenarios.

3.4. Confusion Matrix Analysis Results

Based on the confusion matrix evaluation in Figure 11, MAF-RecNet shows varied generalization across four datasets. For True Negatives (TN), it performed best on CHSD (51.05%), followed by WHSSH (49.29%), GWHD (43.18%), and SHFD (26.33%), indicating increasing difficulty in background suppression as scene complexity rises.
True Positive (TP) rates were highest on SHFD (64.51%), with GWHD at 51.25%, CHSD at 41.15%, and WHSSH lowest at 40.85%, showing stronger target detection in simpler scenes. False Negative (FN) rates were highest on WHSSH (5.80%), followed by CHSD (3.89%), SHFD (3.27%), and GWHD (1.02%), suggesting greater missed detections in complex or small-object scenarios.
False Positive (FP) rates peaked on SHFD (5.89%), with WHSSH at 4.06%, GWHD at 4.54%, and CHSD at the lowest at 3.91%, reflecting increased background misclassification in complex environments.
Overall, MAF-RecNet achieved the most balanced performance on GWHD, with high TP and low FN. On CHSD, it excelled in background recognition but had lower target detection. On SHFD, target identification was strong but background suppression was weaker, leading to higher FP. On the most challenging WHSSH, it maintained a decent TN rate but struggled with both TP and FN. These results highlight the model’s performance variability across different remote sensing scenarios, pointing to the need for improved object perception and background discrimination in complex settings in future work.

3.5. Robustness Test Results

Based on the quantitative results in Table 7 and the visual analysis in Figure 12, this study systematically evaluated the robustness of the MAF-RecNet model on four noisy datasets. The quantitative results indicate noticeable differences in the model’s performance across different tasks. For the farmland identification task (SHFD-RTD), the model achieved an mIoU of 77.94% and an mF1-score of 82.62%, demonstrating moderate resistance to interference. For the corn disease identification task (CHSD-RTD), it attained an mIoU of 78.19% and an mAP of 85.19%, indicating strong adaptability.
Notably, the model’s performance in wheat health monitoring exhibited significant regional variation. On the Southern Hebei dataset (WHSSH-RTD), the mIoU decreased to 73.31% and the mAP dropped to 79.89%, reflecting the challenges that regional complexity poses to model robustness. In contrast, on the global dataset (GWHD-RTD), the model maintained superior performance, achieving an mIoU of 82.98% and an mAP as high as 90.42%, highlighting its strong generalization capability.
From a visual analysis perspective, Figure 12 presents the model’s recognition results across the datasets. For SHFD-RTD, the model preserved overall continuity in large-scale farmland areas but showed minor boundary deviations, consistent with its moderate quantitative performance. On GWHD-RTD, the model not only accurately identified wheat ear regions but also clearly distinguished between areas affected by different types of diseases, aligning with its excellent quantitative outcomes.
On CHSD-RTD, the model demonstrated high sensitivity in detecting disease patches, though further improvement is needed in segmenting the boundaries of severely affected regions. On WHSSH-RTD, the model was significantly affected by complex field environments (e.g., weeds, shadows), leading to misclassifications at the crop-background interface, which directly explains its relatively lower quantitative metrics.
A comprehensive analysis indicates that MAF-RecNet exhibits strong robustness in global tasks characterized by clear target features and standardized scenarios. However, it still faces challenges in tasks with strong regional specificity and complex backgrounds. Future work should focus on enhancing the model’s ability to preserve details and recognize boundaries in complex environments, particularly through adaptive optimization tailored to region-specific scenarios.

4. Discussion

4.1. Impact of Sample Quality on Recognition Accuracy and Generalization

Model performance relies not only on architectural design but fundamentally on high-quality sample data. This study analyzed three data quality dimensions—diversity, annotation accuracy, and perturbation resistance—confirming their significant positive correlation with the model’s recognition accuracy and generalization ability. Diversity was achieved through multi-source, multi-scale datasets (SHFD, WHSSH, GWHD, CHSD) and systematic augmentation, compelling the model to learn discriminative features. For example, the balanced and standardized GWHD yielded optimal performance (mIoU 90.20%), while the complex WHSSH scenes posed greater challenges, underscoring diversity’s role in generalization. High-precision annotation, ensured by rigorous labeling procedures and public benchmarks, provided reliable supervision for learning clear class boundaries [56]. The observed performance drop in cross-task testing can be partially attributed to the model’s reduced ability to utilize subtle spatial cues across varying data distributions.
To quantify the impact of data quality, robustness test sets were constructed in this study. Performance declined across all noise-added test sets, but to varying degrees: GWHD-RTD showed a smaller drop, indicating strong discriminability and higher robustness in its original data; whereas WHSSH-RTD exhibited a more pronounced decline, revealing that samples under complex backgrounds are more sensitive to quality degradation. As shown in Figure 13, training on enhanced high-quality data resulted in stable convergence of the loss curve, with no signs of overfitting or underfitting. In contrast, when trained on lower-quality robustness test data, mild overfitting commonly occurred, and even underfitting trends were observed on WHSSH-RTD. This indicates that a decline in sample quality directly impairs the model’s recognition accuracy, leading the model to fit noise in the data more easily rather than learning the essential features of the target [57]. In conclusion, achieving high model performance requires both architectural innovation and a foundation of high-quality data.

4.2. Deciphering the Contribution Mechanisms of Attention Modules via Ablation Studies

Based on ablation results (Table 5), the contribution of MAF-RecNet’s hybrid attention module stems from its stage-specific optimization of the feature processing pipeline. Removing the dual attention (Attention Gate + ECA) from skip connections (MA-RecNet) caused a 4.38% mIoU drop. This module acts as an intelligent filter for cross-level fusion, suppressing encoder background noise and enhancing target features via spatial selection and channel recalibration, thereby supplying the decoder with cleaner, more discriminative low-level features.
Further removal of the decoder’s hybrid attention (M-RecNet) led to an additional 5.25% mIoU decrease. This module’s role is to refine features during upsampling, maintaining discriminability and preventing detail loss or semantic dilution as resolution is restored—critical for reconstructing small or complex crop targets. The full MAF-RecNet’s 17.51% mIoU gain over U-Net results from the synergy of all modules. The encoder’s CA, EA, PSA, and ECA form a “feature refinement pipeline” that builds a foundational representation with strong spatial awareness, semantic context, and channel efficiency. This high-quality base, combined with the decoder and skip connection mechanisms, creates an end-to-end optimized system.
In summary, each attention component addresses a specific weakness: encoder modules enhance feature extraction, skip connections optimize fusion, and decoder modules preserve detail during reconstruction. Their collaborative, hierarchical operation ensures features are progressively purified and focused, enabling high accuracy with a lightweight 15.25 M-parameter design.

4.3. Model Limitations and Future Directions

Although MAF-RecNet exhibits strong performance in wheat and corn identification, several limitations remain. Its ability to discern subtle disease features under complex backgrounds is insufficient, particularly when leaf textures blend with the environment, affecting boundary precision and detail retention. While the model balances parameters and computational cost well, its inference efficiency on resource-constrained edge devices needs further improvement. Additionally, adaptability to extreme climates, regional variations, and specialized cultivation patterns is limited, leading to potential performance drops outside the training data distribution.
The model also relies heavily on large volumes of high-quality annotations, challenging its stability in real-world settings with scarce or poor-quality labels. Furthermore, it is trained only on visible-light imagery without integrating multi-source remote sensing data (e.g., hyperspectral, LiDAR), which restricts its capacity to analyze crop biochemistry and 3D structure [58,59].
To address these issues, future research should focus on the following directions:
(1)
Model Lightweighting and Efficiency Optimization: Explore efficient attention module designs combined with neural architecture search (NAS) techniques to reduce model size and computational overhead while maintaining accuracy, thereby meeting stricter requirements for embedded deployment.
(2)
Few-shot and Weakly Supervised Learning Frameworks: Develop more effective image augmentation and synthetic data methods to reduce reliance on large-scale, high-quality annotations, thereby enhancing learning efficiency and model robustness in data-scarce scenarios.
(3)
Integrating hyperspectral and LiDAR technologies—particularly emerging hyperspectral LiDAR systems—enables simultaneous 3D hyperspectral data collection, allowing detailed characterization of crop biochemical and structural traits. This helps minimize spatial aggregation errors and enhances monitoring accuracy and adaptability in complex field environments.
(4)
Cross-domain Adaptation and Generalization Enhancement: Develop lightweight domain adaptation algorithms to improve model transferability across geographic regions, imaging conditions, and climatic backgrounds, thereby strengthening real-world generalization and robustness.

5. Conclusions

To address key challenges in corn and wheat recognition—including few-shot learning, model complexity, and cross-domain generalization—this study proposes a lightweight Multi-Attention Fusion Recognition Network (MAF-RecNet). Using farmland identification in southern Hebei as the primary validation scenario, the model’s generalization ability was systematically evaluated across multiple agricultural tasks. The main findings are as follows:
(1)
MAF-RecNet achieves an effective balance between recognition accuracy and model efficiency. For farmland identification in southern Hebei, it attains an mIoU of 87.57% and a mAP of 95.42%, outperforming mainstream models such as SegNeXt. Through the synergistic design of multi-level attention mechanisms and lightweight components, the model achieves high-precision recognition with only 15.25 million parameters.
(2)
Ablation experiments validate the effectiveness of each module: coordinate attention enhances spatial perception of small-target boundaries, integrated attention improves discriminative representation of multi-scale features, and dual-attention skip connections optimize feature fusion. The collaborative operation of these modules provides essential support for model performance.
(3)
The model demonstrates strong cross-task generalization and robustness. In the global wheat-health identification task, it achieves an mIoU of 90.20% and an mAP of 98.28%, reflecting effective knowledge transfer. Moreover, it maintains stable performance under noise-interference robustness testing, verifying its practicality in complex agricultural environments.
In summary, MAF-RecNet provides an effective technical solution for precise crop identification via its lightweight architecture and multi-level attention mechanisms. The model not only balances accuracy and efficiency under few-shot conditions, but its modular design also offers valuable insights for developing lightweight models in related fields. Future work will focus on further model compression, enhanced few-shot learning capabilities, and multi-modal data fusion to continuously improve the model’s applicability and adaptability across diverse agricultural scenarios. The model code for this study has been made publicly available at https://doi.org/10.5281/zenodo.18201092, accessed on 27 January 2026. The dataset can be obtained from the corresponding author upon reasonable request.

Author Contributions

Conceptualization, J.Z. and H.Y. (Haiming Yan); Data curation, Y.L. and W.F.; Formal analysis, J.Z. and Y.L.; Investigation, H.Y. (Hao Yao); Methodology, L.N. and Z.W.; Software, H.Y. (Hao Yao); Supervision, L.N. and Z.W.; Validation, J.Z. and H.Y. (Haiming Yan); Writing—original draft, H.Y. (Hao Yao); Writing—review and editing, J.Z. and H.Y. (Haiming Yan). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Central Guidance on Local Science and Technology Development Fund of Hebei Province, China (No. 236Z4201G) and Natural Science Foundation of Hebei Province (G2023403002).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

Thanks to the editors and reviewers for their careful review, constructive suggestion and reminders, which helped improve the quality of the paper.

Conflicts of Interest

The authors are grateful to the editors and anonymous reviewers for their constructive suggestions and insightful comments, which have significantly improved the quality of this work. The authors also declare no competing financial interests or personal relationships that could have appeared to influence the research reported in this paper.

References

  1. Karthikeyan, L.; Chawla, I.; Mishra, A.K. A review of remote sensing applications in agriculture for food security: Crop growth and yield, irrigation, and crop losses. J. Hydrol. 2020, 586, 124905. [Google Scholar] [CrossRef]
  2. Yang, G.; Li, X.; Xiong, Y.; He, M.; Zhang, L.; Jiang, C.; Yao, X.; Zhu, Y.; Cao, W.; Cheng, T. Annual winter wheat mapping for unveiling spatiotemporal patterns in China with a knowledge-guided approach and multi-source datasets. ISPRS J. Photogramm. Remote Sens. 2025, 225, 163–179. [Google Scholar] [CrossRef]
  3. Dong, J.; Fu, Y.; Wang, J.; Tian, H.; Fu, S.; Niu, Z.; Han, W.; Zheng, Y.; Huang, J.; Yuan, W. Early-season mapping of winter wheat in China based on Landsat and Sentinel images. Earth Syst. Sci. Data 2020, 12, 3081–3095. [Google Scholar] [CrossRef]
  4. Su, X.; Wang, L.; Zhang, M.; Qin, W.; Bilal, M. A High-Precision Aerosol Retrieval Algorithm (HiPARA) for Advanced Himawari Imager (AHI) data: Development and verification. Remote Sens. Environ. 2021, 253, 112221. [Google Scholar] [CrossRef]
  5. Li, Z.; Cheng, Q.; Chen, L.; Zhang, B.; Guo, S.; Zhou, X.; Chen, Z. Predicting Winter Wheat Yield with Dual-Year Spectral Fusion, Bayesian Wisdom, and Cross-Environmental Validation. Remote Sens. 2024, 16, 2098. [Google Scholar] [CrossRef]
  6. Xiong, Y.; McCarthy, C.; Humpal, J.; Percy, C. Near-infrared spectroscopy and deep neural networks for early common root rot detection in wheat from multi-season trials. Agron. J. 2024, 116, 2370–2390. [Google Scholar] [CrossRef]
  7. Qi, H.; Qian, X.; Shang, S.; Wan, H. Multi-year mapping of cropping systems in regions with smallholder farms from Sentinel-2 images in Google Earth engine. GISci. Remote Sens. 2024, 61, 2309843. [Google Scholar] [CrossRef]
  8. Moldvai, L.; Mesterházi, P.Á.; Teschner, G.; Nyéki, A. Weed Detection and Classification with Computer Vision Using a Limited Image Dataset. Appl. Sci. 2024, 14, 4839. [Google Scholar] [CrossRef]
  9. Niloofar, P.; Francis, D.P.; Lazarova-Molnar, S.; Vulpe, A.; Vochin, M.-C.; Suciu, G.; Balanescu, M.; Anestis, V.; Bartzanas, T. Data-driven decision support in livestock farming for improved animal health, welfare and greenhouse gas emissions: Overview and challenges. Comput. Electron. Agric. 2021, 190, 106406. [Google Scholar] [CrossRef]
  10. Rai, N.; Sun, X. WeedVision: A single-stage deep learning architecture to perform weed detection and segmentation using drone-acquired images. Comput. Electron. Agric. 2024, 219, 108792. [Google Scholar] [CrossRef]
  11. Katsini, L.; Muñoz López, C.A.; Bhonsale, S.; Roufou, S.; Griffin, S.; Valdramidis, V.; Akkermans, S.; Polanska, M.; Van Impe, J. Modeling climatic effects on milk production. Comput. Electron. Agric. 2024, 225, 109218. [Google Scholar] [CrossRef]
  12. Nie, M.; Sun, J.; Guoyang, H.; Niu, A.; Hu, Y.; Yan, Q.; Zhu, Y.; Zhang, Y. FSCFNet: Lightweight neural networks via multi-dimensional importance-aware optimization. Neurocomputing 2026, 660, 131823. [Google Scholar] [CrossRef]
  13. Zhu, Y.; Pan, Y.; Hu, T.; Zhang, D.; Dai, J. An Integrated Sample-Free Method for Agricultural Field Delineation From High-Resolution Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 13112–13134. [Google Scholar] [CrossRef]
  14. Chen, C.; Zhang, C.; Zhao, L.; Yang, C.; Yao, X.; Fu, B. Soybean cultivation and crop rotation monitoring based on multi-source remote sensing data and Bi-LSTM enhanced model. Comput. Electron. Agric. 2025, 239, 110959. [Google Scholar] [CrossRef]
  15. Zhou, G.; Qian, L.; Gamba, P. Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey. Remote Sens. 2025, 17, 3532. [Google Scholar] [CrossRef]
  16. Wu, X.; Guo, P.; Sun, Y.; Liang, H.; Zhang, X.; Bai, W. Recent Progress on Vegetation Remote Sensing Using Spaceborne GNSS-Reflectometry. Remote Sens. 2021, 13, 4244. [Google Scholar] [CrossRef]
  17. Liu, Q.; Zhao, L.; Sun, R.; Yu, T.; Cheng, S.; Wang, M.; Zhu, A.; Li, Q. Estimation and Spatiotemporal Variation Analysis of Net Primary Productivity in the Upper Luanhe River Basin in China From 2001 to 2017 Combining With a Downscaling Method. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 353–363. [Google Scholar] [CrossRef]
  18. Liu, J.; Fan, G.; Maimutimin, B. Efficient Image Segmentation of Coal Blocks Using an Improved DIRU-Net Model. Mathematics 2025, 13, 3541. [Google Scholar] [CrossRef]
  19. Shin, Y.; Park, J.; Hong, J.; Sung, H. Runtime Support for Accelerating CNN Models on Digital DRAM Processing-in-Memory Hardware. IEEE Comput. Archit. Lett. 2022, 21, 33–36. [Google Scholar] [CrossRef]
  20. Guo, J.; Bao, W.; Wang, J.; Ma, Y.; Gao, X.; Xiao, G.; Liu, A.; Dong, J.; Liu, X.; Wu, W. A comprehensive evaluation framework for deep model robustness. Pattern Recognit. 2023, 137, 109308. [Google Scholar] [CrossRef]
  21. Fu, X.; Wang, J.; Zeng, D.; Huang, Y.; Ding, X. Remote Sensing Image Enhancement Using Regularized-Histogram Equalization and DCT. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2301–2305. [Google Scholar] [CrossRef]
  22. Baskota, A.; Ghimire, S.; Ghimire, A.; Baskaran, P. Herbguard: An Ensemble Deep Learning Framework With Efficientnet and Vision Transformers for Fine-Grained Classification of Medicinal and Poisonous Plants. IEEE Access 2025, 13, 179333–179350. [Google Scholar] [CrossRef]
  23. Jiao, L.; Liu, H.; Liang, Z.; Chen, P.; Wang, R.; Liu, K. An Anchor-Free Refining Feature Pyramid Network for Dense and Multioriented Wheat Spikes Detection Under UAV. IEEE Trans. Instrum. Meas. 2025, 74, 5003314. [Google Scholar] [CrossRef]
  24. Prommakhot, A.; Onshaunjit, J.; Ooppakaew, W.; Samseemoung, G.; Srinonchat, J. Hybrid CNN and Transformer-Based Sequential Learning Techniques for Plant Disease Classification. IEEE Access 2025, 13, 122876–122887. [Google Scholar] [CrossRef]
  25. Wu, P.; Huang, H.; Qian, H.; Su, S.; Sun, B.; Zuo, Z. SRCANet: Stacked Residual Coordinate Attention Network for Infrared Ship Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5003614. [Google Scholar] [CrossRef]
  26. Zhang, G.; Wang, S.; Xie, Y.; Xie, S.Q.; Hu, Y.; Zhang, Y. Robotic Grasp Detection via Residual Efficient Channel Attention and Multiscale Feature Learning. IEEE Trans. Instrum. Meas. 2025, 74, 7513311. [Google Scholar] [CrossRef]
  27. Zhang, Z.; Wang, Y.; Chai, S.; Tian, Y. Detection and Maturity Classification of Dense Small Lychees Using an Improved Kolmogorov–Arnold Network–Transformer. Plants 2025, 14, 3378. [Google Scholar] [CrossRef]
  28. She, S.; Meng, T.; Zheng, X.; Shao, Y.; Hu, G.; Yin, W.; Shen, J.; He, Y. Evaluation of Defects Depth for Metal Sheets Using Four-Coil Excitation Array Eddy Current Sensor and Improved ResNet18 Network. IEEE Sens. J. 2024, 24, 18955–18967. [Google Scholar] [CrossRef]
  29. Wang, P.; Feng, Y.; Sun, X.; Cheng, X. Swin Transformer Based Recognition for Hydraulic Fracturing Microseismic Signals from Coal Seam Roof with Ultra Large Mining Height. Sensors 2025, 25, 6750. [Google Scholar] [CrossRef]
  30. Mahareek, E.A.; Cifci, M.A.; Desuky, A.S. Integrating Convolutional, Transformer, and Graph Neural Networks for Precision Agriculture and Food Security. AgriEngineering 2025, 7, 353. [Google Scholar] [CrossRef]
  31. Yao, H.; Li, Y.; Feng, W.; Zhu, J.; Yan, H.; Zhang, S.; Zhao, H. CAGM-Seg: A Symmetry-Driven Lightweight Model for Small Object Detection in Multi-Scenario Remote Sensing. Symmetry 2025, 17, 2137. [Google Scholar] [CrossRef]
  32. Wu, H.; Lv, H.; Wang, A.; Yan, S.; Molnar, G.; Yu, L.; Wang, M. CNN-GCN Coordinated Multimodal Frequency Network for Hyperspectral Image and LiDAR Classification. Remote Sens. 2026, 18, 216. [Google Scholar] [CrossRef]
  33. Liu, X.; Chen, Y.; Zhang, D.; Yan, R.; Ni, H. A Multichannel Long-Term External Attention Network for Aeroengine Remaining Useful Life Prediction. IEEE Trans. Artif. Intell. 2024, 5, 5130–5140. [Google Scholar] [CrossRef]
  34. Chen, Z.; Zheng, Y.; Weng, T.-H.; Li, L.-H.; Li, K.-C.; Poniszewska-Maranda, A. An Improved Dilated-Transposed Convolution Detector of Weld Proximity Defects. IEEE Access 2024, 12, 157127–157139. [Google Scholar] [CrossRef]
  35. Zhao, N.; Huang, B.; Yang, J.; Radenkovic, M.; Chen, G. Oceanic Eddy Identification Using Pyramid Split Attention U-Net With Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1500605. [Google Scholar] [CrossRef]
  36. Peng, C.; Li, B.; Zou, K.; Zhang, B.; Dai, G.; Tsoi, A.C. Dual-Branch Multi-Dimensional Attention Mechanism for Joint Facial Expression Detection and Classification. Sensors 2025, 25, 3815. [Google Scholar] [CrossRef]
  37. Dong, Z.; Zhu, D.; Huang, M.; Lin, Q.; Møller-Jensen, L.; Silva, E.A. MSCANet: Multi-Scale Spatial-Channel Attention Network for Urbanization Intelligent Monitoring. Remote Sens. 2026, 18, 159. [Google Scholar] [CrossRef]
  38. Li, Z.; Zhen, Z.; Chen, S.; Zhang, L.; Cao, L. Dual-Level Attention Relearning for Cross-Modality Rotated Object Detection in UAV RGB–Thermal Imagery. Remote Sens. 2025, 18, 107. [Google Scholar] [CrossRef]
  39. Yuan, Z.; Ding, X.; Xia, X.; He, Y.; Fang, H.; Yang, B.; Fu, W. Structure-Aware Progressive Multi-Modal Fusion Network for RGB-T Crack Segmentation. J. Imaging 2025, 11, 384. [Google Scholar] [CrossRef]
  40. Huang, X.; Zhang, X.; Wang, L.; Yuan, D.; Xu, S.; Zhou, F.; Zhou, Z. MMA-Net: A Semantic Segmentation Network for High-Resolution Remote Sensing Images Based on Multimodal Fusion and Multi-Scale Multi-Attention Mechanisms. Remote Sens. 2025, 17, 3572. [Google Scholar] [CrossRef]
  41. Wei, Y.; Guo, X.; Lu, Y.; Hu, H.; Wang, F.; Li, R.; Li, X. Phenology-Guided Wheat and Corn Identification in Xinjiang: An Improved U-Net Semantic Segmentation Model Using PCA and CBAM-ASPP. Remote Sens. 2025, 17, 3563. [Google Scholar] [CrossRef]
  42. He, W.; Mei, S.; Hu, J.; Ma, L.; Hao, S.; Lv, Z. Filter-Wise Mask Pruning and FPGA Acceleration for Object Classification and Detection. Remote Sens. 2025, 17, 3582. [Google Scholar] [CrossRef]
  43. Jiang, K.; Wu, C. LGCANet: Local–Global and Change-Aware Network via Segment Anything Model for Remote Sensing Images Change Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5629213. [Google Scholar] [CrossRef]
  44. Makarov, I.; Koshevaya, E.; Pechenina, A.; Boyko, G.; Starshinova, A.; Kudlay, D.; Makarova, T.; Mitrofanova, L. A Comparative Analysis of SegFormer, FabE-Net and VGG-UNet Models for the Segmentation of Neural Structures on Histological Sections. Diagnostics 2025, 15, 2408. [Google Scholar] [CrossRef]
  45. Gu, G.; Wang, Z.; Weng, L.; Lin, H.; Zhao, Z.; Zhao, L. Attention Guide Axial Sharing Mixed Attention (AGASMA) Network for Cloud Segmentation and Cloud Shadow Segmentation. Remote Sens. 2024, 16, 2435. [Google Scholar] [CrossRef]
  46. Lizzi, F.; Saponaro, S.; Giuliano, A.; Talamonti, C.; Ubaldi, L.; Retico, A. Radiomics and Deep Learning Interplay for Predicting MGMT Methylation in Glioblastoma: The Crucial Role of Segmentation Quality. Cancers 2025, 17, 3417. [Google Scholar] [CrossRef]
  47. Zhou, P.; Xie, X.; Lin, Z.; Yan, S. Towards Understanding Convergence and Generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6486–6493. [Google Scholar] [CrossRef]
  48. Lu, F.; Xu, J.; Sun, Q.; Lou, Q. An Efficient Vision Mamba–Transformer Hybrid Architecture for Abdominal Multi-Organ Image Segmentation. Sensors 2025, 25, 6785. [Google Scholar] [CrossRef]
  49. Wu, X.; Ren, X.; Zhai, D.; Wang, X.; Tarif, M. Lights-Transformer: An Efficient Transformer-Based Landslide Detection Model for High-Resolution Remote Sensing Images. Sensors 2025, 25, 3646. [Google Scholar] [CrossRef]
  50. Xiong, B.; Fan, S.; He, X.; Xu, T.; Chang, Y. Small Logarithmic Floating-Point Multiplier Based on FPGA and Its Application on MobileNet. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 5119–5123. [Google Scholar] [CrossRef]
  51. Zhang, H.; Zhang, P.; Wang, Z.; Chao, L.; Chen, Y.; Li, Q. Multi-Feature Decision Fusion Network for Heart Sound Abnormality Detection and Classification. IEEE J. Biomed. Health Inform. 2024, 28, 1386–1397. [Google Scholar] [CrossRef]
  52. Xie, Y.; Hao, Z.-W.; Wang, X.-M.; Wang, H.-L.; Yang, J.-M.; Zhou, H.; Wang, X.-D.; Zhang, J.-Y.; Yang, H.-W.; Liu, P.-R.; et al. Dual-Stream Attention-Based Classification Network for Tibial Plateau Fractures via Diffusion Model Augmentation and Segmentation Map Integration. Curr. Med. Sci. 2025, 45, 57–69. [Google Scholar] [CrossRef]
  53. Korfhage, N.; Mühling, M.; Freisleben, B. Search anything: Segmentation-based similarity search via region prompts. Multimed. Tools Appl. 2024, 84, 32593–32618. [Google Scholar] [CrossRef]
  54. Ortega-Ruíz, M.A.; Karabağ, C.; Roman-Rangel, E.; Reyes-Aldasoro, C.C. DRD-UNet, a UNet-Like Architecture for Multi-Class Breast Cancer Semantic Segmentation. IEEE Access 2024, 12, 40412–40424. [Google Scholar] [CrossRef]
  55. de Oliveira, C.E.G.; Vieira, S.L.; Paranaiba, C.F.B.; Itikawa, E.N. Breast tumor segmentation in ultrasound images: Comparing U-net and U-net + +. Res. Biomed. Eng. 2025, 41, 16. [Google Scholar] [CrossRef]
  56. Li, J.; Jin, Z.; Wu, T. A Spatiotemporal Forecasting Method for Cooling Load of Chillers Based on Patch-Specific Dynamic Filtering. Sustainability 2025, 17, 9883. [Google Scholar] [CrossRef]
  57. Chen, R.; Wu, J.; Zhao, X.; Luo, Y.; Xu, G. SC-CNN: LiDAR point cloud filtering CNN under slope and copula correlation constraint. ISPRS J. Photogramm. Remote Sens. 2024, 212, 381–395. [Google Scholar] [CrossRef]
  58. Sun, J.; Shi, S.; Gong, W.; Yang, J.; Du, L.; Song, S.; Chen, B.; Zhang, Z. Evaluation of hyperspectral LiDAR for monitoring rice leaf nitrogen by comparison with multispectral LiDAR and passive spectrometer. Sci. Rep. 2017, 7, 40362. [Google Scholar] [CrossRef]
  59. Bai, J.; Niu, Z.; Huang, Y.; Bi, K.; Fu, Y.; Gao, S.; Wu, M.; Wang, L. Full-waveform hyperspectral LiDAR data decomposition via ranking central locations of natural target echoes (Rclonte) at different wavelengths. Remote Sens. Environ. 2024, 310, 114227. [Google Scholar] [CrossRef]
Figure 1. Illustration of samples before and after image augmentation (A): Before augmentation; (a): After augmentation.
Figure 1. Illustration of samples before and after image augmentation (A): Before augmentation; (a): After augmentation.
Remotesensing 18 00497 g001
Figure 2. Examples from the Robustness Testing Dataset (A): Unperturbed; (a): Perturbed.
Figure 2. Examples from the Robustness Testing Dataset (A): Unperturbed; (a): Perturbed.
Remotesensing 18 00497 g002
Figure 3. Model Architecture Diagram.
Figure 3. Model Architecture Diagram.
Remotesensing 18 00497 g003
Figure 4. Preprocessing Module Architecture Diagram.
Figure 4. Preprocessing Module Architecture Diagram.
Remotesensing 18 00497 g004
Figure 5. Architecture Diagram of ResNet18 and Hybrid Attention Module (A): ResNet18, (B): Hybrid Attention Module.
Figure 5. Architecture Diagram of ResNet18 and Hybrid Attention Module (A): ResNet18, (B): Hybrid Attention Module.
Remotesensing 18 00497 g005
Figure 6. Structure of Coordinate Attention and Pyramid Split Attention (A): Coordinate Attention; (B): Pyramid Split Attention.
Figure 6. Structure of Coordinate Attention and Pyramid Split Attention (A): Coordinate Attention; (B): Pyramid Split Attention.
Remotesensing 18 00497 g006
Figure 7. Architecture Diagram of Decoder and Hybrid Convolution Module (A): Decoder, (B): Hybrid Convolution Module.
Figure 7. Architecture Diagram of Decoder and Hybrid Convolution Module (A): Decoder, (B): Hybrid Convolution Module.
Remotesensing 18 00497 g007
Figure 8. Recognition results of MAF-RecNet on the SHFD (A): ground truth; (B): prediction results; red regions: target.
Figure 8. Recognition results of MAF-RecNet on the SHFD (A): ground truth; (B): prediction results; red regions: target.
Remotesensing 18 00497 g008
Figure 9. Comparison of recognition results by different models in the ablation study on the SHFD (A): ground truth; (BE): predictions by MAF-RecNet, MA-RecNet, M-RecNet, and U-Net, respectively; red regions: target.
Figure 9. Comparison of recognition results by different models in the ablation study on the SHFD (A): ground truth; (BE): predictions by MAF-RecNet, MA-RecNet, M-RecNet, and U-Net, respectively; red regions: target.
Remotesensing 18 00497 g009
Figure 10. Recognition results of MAF-RecNet in the generalization test across datasets (AD): ground truth for SHFD, WHSSH, GWHD, and CHSD, respectively; (ad): corresponding prediction results; red regions: target.
Figure 10. Recognition results of MAF-RecNet in the generalization test across datasets (AD): ground truth for SHFD, WHSSH, GWHD, and CHSD, respectively; (ad): corresponding prediction results; red regions: target.
Remotesensing 18 00497 g010
Figure 11. Evaluation Results Based on Confusion Matrix.
Figure 11. Evaluation Results Based on Confusion Matrix.
Remotesensing 18 00497 g011
Figure 12. Recognition results of MAF-RecNet in the robustness test (AD): ground truth for SHFD-RTD, WHSSH-RTD, GWHD-RTD, and CHSD-RTD, respectively; (ad): corresponding prediction results; red regions: target.
Figure 12. Recognition results of MAF-RecNet in the robustness test (AD): ground truth for SHFD-RTD, WHSSH-RTD, GWHD-RTD, and CHSD-RTD, respectively; (ad): corresponding prediction results; red regions: target.
Remotesensing 18 00497 g012
Figure 13. Loss Function of MAF-RecNet.
Figure 13. Loss Function of MAF-RecNet.
Remotesensing 18 00497 g013
Table 1. Temporal Distribution of Remote Sensing Images.
Table 1. Temporal Distribution of Remote Sensing Images.
TimeNumber of Scenes
February14
March82
April7
Total103
Table 2. Details of the augmented dataset.
Table 2. Details of the augmented dataset.
Dataset NameNumber of SamplesTotal Pixels
(M)
Target Pixels
(M)
Average Target Pixel Ratio per Sample
(%)
SHFD150098.366.6367.78
WHSSH150098.345.8646.65
GWHD2300150.7378.7952.27
CHSD2000131.0759.0345.04
Table 3. Detailed Description of the Robustness Test Dataset.
Table 3. Detailed Description of the Robustness Test Dataset.
Dataset NameNumber of SamplesTotal Pixels
(M)
Target Pixels
(M)
Average Target Pixel Ratio Per Sample
(%)
SHFD-RTD1800117.9679.7767.62
WHSSH-RTD1797117.7754.7846.52
GWHD-RTD2758180.7593.851.89
CHSD-RTD2397157.0170.8245.09
Table 4. Comparative experiment results.
Table 4. Comparative experiment results.
Model NamemPrecsion
(%)
mRcall
(%)
mF1-Score (%)mIoU (%)mAP
(%)
Parameters
(M)
FLOPs
(G)
MAF-RecNet90.5695.1892.8387.5795.4215.2521.81
MDFNet85.1389.4787.2682.3289.6936.18106.16
SegFormer76.9880.9078.9174.4381.103.7231.93
FastSAM81.5085.6683.5578.8185.8817.4327.63
SegNeXt83.3287.5785.4080.5687.7924.916.8
Table 5. A Results of Ablation Experiments for Each Model.
Table 5. A Results of Ablation Experiments for Each Model.
Model NamemPrecision
(%)
mRecall
(%)
mF1-Score
(%)
mIoU
(%)
mAP
(%)
Parameters
(M)
FLOPs
(G)
MAF-RecNet90.5695.1892.8387.5795.4215.2521.81
MA-RecNet86.0390.4288.1983.1990.6515.821.32
M-RecNet80.6084.7182.6277.9484.9215.0119.64
U-Net72.4576.1474.2670.0676.3413.3914.77
Table 6. Results of Generalization Ability Tests.
Table 6. Results of Generalization Ability Tests.
Dataset NamemPrecision
(%)
mRecall
(%)
mF1-Score
(%)
mIoU
(%)
mAP
(%)
SHFD90.5695.1892.8387.5795.42
WHSSH83.3287.5785.4080.5687.79
GWHD93.2898.0495.6190.2098.28
CHSD86.9491.3789.1284.0791.60
Table 7. Results of Robustness Tests.
Table 7. Results of Robustness Tests.
Dataset NamemPrecision
(%)
mRecall
(%)
mF1-Score
(%)
mIoU
(%)
mAP
(%)
SHFD-RTD80.6084.7182.6277.9484.92
WHSSH-RTD75.8279.6977.7173.3179.89
GWHD-RTD85.8290.2087.9682.9890.42
CHSD-RTD80.8584.9782.8878.1985.19
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yao, H.; Zhu, J.; Li, Y.; Yan, H.; Feng, W.; Niu, L.; Wu, Z. MAF-RecNet: A Lightweight Wheat and Corn Recognition Model Integrating Multiple Attention Mechanisms. Remote Sens. 2026, 18, 497. https://doi.org/10.3390/rs18030497

AMA Style

Yao H, Zhu J, Li Y, Yan H, Feng W, Niu L, Wu Z. MAF-RecNet: A Lightweight Wheat and Corn Recognition Model Integrating Multiple Attention Mechanisms. Remote Sensing. 2026; 18(3):497. https://doi.org/10.3390/rs18030497

Chicago/Turabian Style

Yao, Hao, Ji Zhu, Yancang Li, Haiming Yan, Wenzhao Feng, Luwang Niu, and Ziqi Wu. 2026. "MAF-RecNet: A Lightweight Wheat and Corn Recognition Model Integrating Multiple Attention Mechanisms" Remote Sensing 18, no. 3: 497. https://doi.org/10.3390/rs18030497

APA Style

Yao, H., Zhu, J., Li, Y., Yan, H., Feng, W., Niu, L., & Wu, Z. (2026). MAF-RecNet: A Lightweight Wheat and Corn Recognition Model Integrating Multiple Attention Mechanisms. Remote Sensing, 18(3), 497. https://doi.org/10.3390/rs18030497

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop