1. Introduction
Wheat is widely recognized as one of the most important staple crops in the world, playing a vital role in ensuring global food security, supporting agricultural economic development, and maintaining social stability [
1]. In China, wheat is not only the primary food source for a large population in the northern regions, but also occupies a central position in food processing, the feed industry, and related agricultural value chains [
2]. Consequently, wheat production and quality are of great strategic importance to national food security and agricultural sustainability [
3]. However, with the growing demand for food quality and safety, focusing solely on yield is no longer sufficient to meet the requirements of modern agriculture [
4]. Wheat quality directly affects its nutritional value, processing performance, and market competitiveness [
5]. As a result, wheat quality inspection has become an essential component of ensuring food safety and promoting refined agricultural production. The adoption of scientific and accurate quality assessment methods enables reliable grading and evaluation of wheat, while also providing a solid basis for storage management, transportation, and industrial utilization, thereby offering significant theoretical and practical value [
6].
At present, methods for wheat kernel identification can be broadly categorized into two groups: physicochemical analysis-based techniques and manual inspection approaches [
7]. Although both categories are capable of evaluating wheat quality to a certain extent, they suffer from notable limitations in terms of efficiency, accuracy, cost, and practical applicability. Physicochemical methods primarily assess wheat quality by measuring physical and chemical properties of the kernels. For example, near-infrared spectroscopy (NIR) has been widely applied to determine protein content and moisture levels in cereals, which can indirectly indicate the presence of mold or insect damage [
8]. In addition, physical separation techniques, such as flotation and sieving, are employed to remove defective kernels based on variations in density or weight. Under controlled laboratory conditions, these methods can achieve high precision and offer clear advantages in detecting potential quality issues, including mycotoxin contamination or excessive moisture content [
9]. However, their practical limitations are evident. The reliance on expensive analytical instruments, such as spectrometers and chromatographic systems, along with associated consumables, substantially increases detection costs. Moreover, the lengthy testing cycles, combined with time-consuming sample preparation and analysis procedures, render these techniques impractical for large-scale and real-time screening applications. In parallel, manual visual inspection and hand sorting remain common approaches for identifying defective wheat kernels [
10]. In this process, trained operators evaluate kernel color, shape, surface defects, and other visual characteristics to distinguish sound kernels from those affected by insect damage, mold growth, black embryo, or scab infection. The primary advantage of manual inspection lies in its minimal equipment requirements and its flexibility in handling diverse defect types, which makes it suitable for small-scale production or specialized batch inspections. Nevertheless, manual methods are inherently characterized by low efficiency, strong subjectivity, and a high dependence on skilled labor, all of which lead to substantially increased operational costs [
11]. As labor expenses continue to rise, the economic feasibility and long-term sustainability of manual inspection are steadily diminishing.
Traditional machine learning techniques, such as support vector machines (SVMs) and random forests, rely heavily on handcrafted features for wheat kernel classification [
12]. This dependence limits their ability to effectively capture the subtle morphological and textural variations present in wheat kernels, thereby constraining recognition accuracy. These approaches require domain expertise to design and select appropriate features; however, the characteristics of wheat kernels affected by developmental defects or damage often exhibit pronounced heterogeneity. As a result, manually engineered features struggle to adequately represent such complex variations [
13], making these models prone to misclassification when distinguishing defective kernels from normal samples.
In recent years, deep learning-based image classification has achieved remarkable success across a wide range of application domains. Notably, in the agricultural sector, the adoption of deep learning techniques has demonstrated substantial potential and clear advantages [
14]. For example, established deep learning models such as GoogLeNet [
15], VGG [
16], and ResNet [
17] have been effectively applied to crop seed variety classification, consistently exhibiting strong classification performance [
18]. By training on large-scale seed image datasets, deep learning models are able to automatically learn discriminative representations that capture subtle inter-class differences, thereby enabling efficient and accurate classification [
19]. These advances have not only improved the level of automation in agricultural production, but also provided robust technical support for seed quality control and varietal improvement.
The characteristics of defective wheat kernels are not readily discernible, resulting in a high degree of overall similarity among them. Consequently, their classification and recognition can be regarded as a fine-grained image classification challenge. Wang et al. [
20] proposed the FFDNet-VGG terahertz image enhancement algorithm to address the problem of low image quality in unsound wheat kernel imaging, and further evaluated its effectiveness by employing a CNN-based classification network for unsound kernel recognition based on the enhanced images, thereby achieving improved classification accuracy. Gao et al. [
21] modified the ResNet architecture by integrating attention mechanisms and proposed the Res24_D_CBAM model, which achieved an identification accuracy of 94% for six types of wheat kernels. Shen et al. [
22] incorporated an efficient channel attention (ECA) module into the Mask R-CNN framework and optimized both the feature pyramid network (FPN) and region proposal network (RPN), resulting in an average classification accuracy of 86% across six wheat kernel categories. Zhao et al. [
23] enhanced the YOLOv5 model by pruning the backbone network and introducing a channel-mixing attention module in the neck, leading to an average recognition accuracy of 99.2% for six wheat kernel types. More recently, Zhang et al. [
24]. proposed a lightweight network termed GFRNet by integrating low-rank multi-scale convolutional operators with inverted residual structures to address the high computational cost of self-attention mechanisms and large convolution kernels in wheat quality inspection. The enhanced model achieved an average accuracy of 83.89%, a Top-1 accuracy of 84.90%, and a Macro-F1 score of 83.20% on the GrainSpace dataset, outperforming ConvNeXt-T, InceptionNeXt-T, FasterNet, ResNet50, and Swin-T by approximately 2–5 percentage points, while reducing the parameters and FLOPs to 15.915 M and 2.28 G, respectively. In 2023, Zhang et al. [
25] designed an improved YOLOv5 model by integrating an efficient channel attention (ECA) mechanism into the baseline network, achieving an average detection accuracy of 96.24% for defective wheat kernels, which represented improvements of 10% over CBAM-YOLOv5 and SENet-YOLOv5, and 13% over the original YOLOv5 model. In 2025, Zhang et al. [
26] further proposed a multi-scale parallel convolutional network (MSPCNeXt) to overcome the limitations of single-scale CNNs in detecting defective kernels during storage. By parallelizing convolution kernels of different scales and optimizing MSPCNeXt-v1 and MSPCNeXt-v2 blocks through large-kernel decomposition and feature splitting, the receptive field was effectively enlarged while computational complexity was reduced. Experimental results on the GrainSpace dataset showed that, compared with ConvNeXt, MSPCNeXt-v1 improved the average accuracy and Top-1 accuracy by 2.531% and 2.286%, respectively, while MSPCNeXt-v2 achieved improvements of 2.224% and 1.857%, with reductions of 0.456 G FLOPs and 5.986 M parameters. Earlier, Zhang et al. [
27] integrated attention mechanisms into residual networks of varying depths, and the optimized attention-enhanced ResNet-50 model achieved an average accuracy of 97.5% in identifying defective wheat kernels.
Based on the above review, although attention-based deep learning models have achieved promising results in defective wheat kernel recognition, several challenges remain. Most existing methods adopt conventional Softmax-based multiclass classifiers, which enforce strong inter-class competition and may limit discrimination when defect categories exhibit high visual similarity and ambiguous boundaries. In fine-grained wheat kernel recognition tasks, this limitation can suppress subtle but critical defect-related features.
To address this issue, this study proposes an EfficientNet-B1-CBAM-based recognition framework with a multi-binary classification strategy. EfficientNet-B1 is employed as the backbone for efficient feature extraction, while the Convolutional Block Attention Module (CBAM) enhances both channel-wise and spatial feature representations. In addition, the traditional single-head multiclass classifier is replaced with five independent Sigmoid-based binary classifiers, allowing each category to learn class-specific decision boundaries with reduced inter-class interference. Extensive experiments, including comparative analysis, ablation studies, and cross-validation, are conducted to validate the effectiveness and robustness of the proposed method.
2. Materials and Methods
2.1. Data Collection and Dataset Development
The defective wheat kernel dataset used in this study was collected from local farmers’ households (Heze, China). Since wheat from different cultivars is commonly stored together after harvesting, the dataset comprises kernels from multiple varieties. Nevertheless, the defective kernels exhibit consistent visual characteristics across varieties, enabling unified analysis and classification. The dataset includes five categories of wheat kernels: sound kernels, insect-damaged kernels, Fusarium-damaged kernels, moldy kernels, and kernels with black embryos. Representative examples of each category are shown in
Figure 1. These kernel types are critical factors affecting wheat quality and commercial grading. During data acquisition, wheat kernel images were captured using a high-resolution camera under standardized lighting conditions to ensure consistent image quality and to minimize the influence of background noise. For each wheat kernel category, a total of 300 images were collected, resulting in an overall dataset of 1500 images. Within each category, 200 images were allocated to the training set, while the remaining 100 images were used for testing. The dataset used in this study has been made publicly available at Zenodo and can be accessed at:
https://doi.org/10.5281/zenodo.18222641, accessed on 17 December 2025.
As all images were captured under identical acquisition conditions, two data augmentation techniques—Gaussian blurring and Poisson noise injection—were applied to enhance data diversity and improve model generalization. Through this augmentation process, the number of images for each wheat kernel category was expanded from the original 200 images to 600 images, resulting in a total augmented dataset of 3000 images. For each wheat kernel category, the augmented dataset was further divided into training and validation sets at a ratio of 9:1. Specifically, 540 images per category were used for network training, while the remaining 60 images constituted the validation set. The training set was employed for model learning, whereas the validation set was used for hyperparameter tuning and optimal model selection. The independent test set was retained for the final performance evaluation of the proposed model.
To mitigate the risk of overfitting caused by the limited dataset size and homogeneous acquisition conditions, a five-fold cross-validation strategy was adopted in this study. The dataset was randomly partitioned into five subsets, where in each fold, four subsets were used for training and the remaining subset was used for validation. No data augmentation was applied during the cross-validation process to strictly evaluate the generalization performance of the proposed model on raw images.
2.2. EfficientNet-B1-CBAM Network Model
In this study, an EfficientNet-B1-based network [
28] is developed by integrating the Convolutional Block Attention Module (CBAM), an attention mechanism originally proposed by Woo et al. [
29], resulting in the EfficientNet-B1-CBAM model for high-accuracy wheat kernel identification. The overall architecture of the proposed model is illustrated in
Figure 2. It consists of a feature extraction layer, an attention-enhancement layer, a feature aggregation layer, and a multi-level binary classification output layer.
First, the input images are fed into the EfficientNet-B1 backbone for feature extraction. This network employs a compound scaling strategy to jointly balance depth, width, and resolution, and utilizes lightweight MBConv blocks to achieve efficient feature extraction and information compression. Subsequently, the Convolutional Block Attention Module (CBAM), which consists of channel attention and spatial attention submodules, is integrated to enhance feature representation. The channel attention mechanism highlights informative feature channels through adaptive weighting, while the spatial attention mechanism emphasizes salient regions via spatial attention maps. Together, these mechanisms enable the model to focus more effectively on subtle surface variations and pathological characteristics of wheat kernels. After global average pooling, the extracted global features are fed into five independent binary classifiers in a one-vs.-rest manner, corresponding to Fusarium-damaged, insect-damaged, sound, black-embryo, and moldy wheat kernels, respectively. This multi-binary classification design allows each category to learn class-specific decision boundaries independently, thereby reducing feature interference among visually similar classes and improving overall recognition performance.
2.3. EfficientNet-B1 Network Model
EfficientNet is a family of efficient convolutional neural networks proposed by Google, whose core idea lies in the compound scaling strategy that uniformly scales network depth, width, and input resolution. By jointly balancing these three dimensions, EfficientNet achieves superior performance with significantly reduced computational cost. In contrast to conventional architectures that scale the network along a single dimension, compound scaling enables EfficientNet to enhance feature representation capability while maintaining a well-controlled model complexity.
EfficientNet-B0 serves as the baseline architecture of the EfficientNet family, from which multiple variants, ranging from EfficientNet-B1 to EfficientNet-B7, are derived through parameter scaling.
Table 1 presents a detailed description of the EfficientNet-B1 architecture, including the network stages, module types, input resolutions, kernel sizes, input tensor dimensions, output channel numbers, and the number of layers in each stage.
The entire network is composed of a series of stacked MBConv blocks, with MBConv modules at different stages varying in terms of channel width, kernel size, and stride. This design enables a progressive feature extraction process that transitions from shallow to deep representations and from low-dimensional to high-dimensional features. The network begins with a standard convolutional layer responsible for extracting low-level features from the input images. The intermediate stages consist of multiple groups of MBConv blocks, where each group reduces the spatial resolution to varying degrees while gradually increasing the number of channels to enhance feature representation capability. Finally, global average pooling followed by a fully connected classification head is employed to generate the final predictions. Compared with traditional convolutional neural networks, EfficientNet-B1 offers a notable advantage in achieving high accuracy while maintaining relatively low computational complexity. On large-scale datasets such as ImageNet, EfficientNet-B1 has been shown to achieve accuracy comparable to, or even exceeding, that of larger models such as ResNet and Inception. Owing to these characteristics, EfficientNet-B1 is particularly well suited for agricultural applications involving limited data and computational resources, such as wheat kernel classification and inspection tasks, where both accuracy and efficiency are critical.
The mobile inverted bottleneck convolution (MBConv) module is the core building block of the EfficientNet family, and its structure is illustrated in
Figure 3. This module was originally introduced in MobileNetV2 [
30], and is characterized by an inverted residual structure. By establishing efficient shortcut connections between the input and output, the inverted residual design enables effective utilization of both low-dimensional and high-dimensional feature spaces, thereby improving the efficiency of feature extraction.
The MBConv module consists of five main components, enabling the network to maintain a lightweight structure while preserving strong feature representation capability. The basic architecture of the MBConv module can be divided into the following five key parts:
Let the input feature map be denoted as:
Here, H and W denote the height and width of the input feature, respectively, while signifies the number of input channels.
First, a 1 × 1 convolution is applied to expand the number of input channels to
, where
is the expansion factor. This operation increases the dimensionality of the feature space, thereby enhancing the model’s ability to represent complex features:
Here, denotes the batch normalization operation, and represents the activation function.
Next, a
depthwise separable convolution (typically
k = 3 or 5) is applied independently to each channel. This operation allows the model to effectively capture local spatial information while significantly reducing computational complexity.
In this context, denotes the depth-wise convolution operation.
- 3.
SE (Squeeze-and-Excitation) Module
This module is an important attention mechanism in EfficientNet-B1 for enhancing feature representation [
31], as illustrated in
Figure 4. In conventional convolutional neural networks, each channel is typically assigned equal importance during feature extraction, neglecting differences in the representational capacity of individual channels. The SE module explicitly models inter-channel dependencies and adaptively assigns weights to each channel, thereby achieving dynamic feature recalibration and effectively strengthening the network’s feature representation capability. The SE module consists of two main stages: Squeeze and Excitation. In the Squeeze stage, the input feature map is first compressed into a single global descriptor for each channel via global average pooling (GAP), capturing channel-level holistic information. In the Excitation stage, a bottleneck structure composed of two fully connected layers (FC1 and FC2) performs non-linear feature modeling. The first fully connected layer reduces the number of channels by a ratio r (the channel reduction ratio) to lower computational cost while capturing compact inter-channel dependencies. The second fully connected layer restores the channel dimension to its original size, and a Sigmoid function generates weight coefficients in the range [0, 1], reflecting the relative importance of each channel. Finally, these weights are multiplied with the original feature map channel-wise, enhancing salient features and suppressing redundant information, thereby improving the network’s capacity for effective feature representation and discrimination. The computational process of the SE module can be expressed as:
Here, denotes the global average pooling operation, represents the ReLU activation function, is the Sigmoid function, and indicates the channel-wise weighting operation. In the absence of the attention module, the feature map is represented as .
In EfficientNet, the SE module is embedded within each MBConv block, enabling the network to dynamically adjust the contribution of different channels during the layer-by-layer feature extraction process. Since the surface features of wheat kernels are often subtle and different categories are highly similar, the SE module helps emphasize key feature channels, such as disease spot textures or mold discoloration, while suppressing background noise and irrelevant channels. This selective feature enhancement contributes to improved overall classification performance.
After feature weighting, a 1 × 1 convolution is applied to reduce the number of channels back to the target dimension
. This step effectively controls the model’s parameter size while preserving the integrity of the extracted information, allowing the module to remain lightweight. The computation can be expressed as:
Finally, when the input and output feature dimensions are the same (i.e., stride = 1 and
), the MBConv module incorporates a residual connection, allowing the input features to be directly added to the output features. This facilitates feature reuse and enables efficient gradient propagation throughout the network.
If the dimensions do not match, only output
:
2.4. Attention Mechanism
The Convolutional Block Attention Module (CBAM) is a lightweight and plug-and-play attention mechanism, as illustrated in
Figure 5. Its core idea is to sequentially apply attention mechanisms along the channel and spatial dimensions, enabling the network to focus more effectively on important feature regions while suppressing irrelevant or redundant information. CBAM consists of two sequential components: the Channel Attention Module (CAM) and the Spatial Attention Module (SAM). The channel attention module assigns weights across different channels of the feature map, emphasizing channels that are most informative for classification. The spatial attention module then assigns weights across the two-dimensional spatial domain, highlighting more discriminative local regions within the image. Compared to the single SE module, CBAM combines both channel and spatial attention, optimizing feature representation at a finer level and significantly improving the network’s performance in object recognition and classification tasks. Its simple design and low computational overhead make it well-suited for integration into lightweight networks such as EfficientNet-B1. When applied to wheat kernel recognition, CBAM effectively emphasizes texture anomalies and morphological differences, thereby enhancing classification accuracy.
2.4.1. Channel Attention Module
The Channel Attention Module (CAM) is designed to capture the importance of different channels in the input feature map, guiding the network to focus on the most discriminative feature channels. Its structure is illustrated in
Figure 6. The core idea of CAM is to explicitly model global dependencies along the channel dimension and assign adaptive weights to each channel, thereby enhancing the model’s feature representation capability. Let the input feature map be denoted as:
First, global average pooling (GAP) and global max pooling (GMP) are applied separately to capture global spatial information, producing two one-dimensional channel descriptors:
Subsequently, these two descriptors are fed into a shared-parameter multilayer perceptron (MLP) to extract high-level inter-channel relationships, resulting in:
Here, and represent the weight matrices of the two fully connected layers, and denotes the ReLU activation function. is the Sigmoid activation function, which constrains the output weights to the range [0, 1] and enables adaptive modulation of different channels.
Finally, the obtained attention weights are multiplied channel-wise with the input feature map to produce the weighted feature representation:
In this context, denotes the element-wise multiplication operation. Through this process, the model is able to emphasize informative channels while suppressing irrelevant or noisy features, thereby significantly enhancing the discriminative power of the feature representation and improving the overall recognition performance of the network.
2.4.2. Spatial Attention Module
The Spatial Attention Module (SAM) is designed to guide the network to focus on where discriminative information is located within the feature map, enhancing responses at key spatial locations. Its structure is illustrated in
Figure 7. Unlike channel attention, spatial attention explicitly models the spatial dependencies of features to emphasize important regional patterns. Let the input be the feature map weighted by the channel attention module:
In practice, the input feature map is first processed using both channel-wise max pooling and average pooling operations to generate spatial descriptors that capture salient features at the spatial level:
These two feature maps represent the average and maximum responses across all channels at each spatial location, jointly reflecting the importance of target regions. The two maps are then concatenated along the channel dimension and passed through a 7 × 7 convolution layer to aggregate spatial context information, thereby generating the spatial attention map:
Here, denotes the convolution operation, while represents the Sigmoid function, which normalizes the feature values.
Finally, spatial feature weighting is performed by element-wise multiplication:
Through this mechanism, the model can effectively focus on discriminative spatial regions while suppressing background noise, thereby further enhancing the spatial sensitivity of feature representations and improving recognition performance.
2.5. Design of Multiple-Binary Classifiers
In the proposed EfficientNet-B1-CBAM model for wheat kernel recognition, the output layer is composed of five independent binary classifiers, corresponding to scab-damaged, insect-damaged, perfect, black-embryo, and moldy kernels, respectively.
Each binary classifier receives the feature vector
, from the global average pooling layer and maps it through a fully connected layer to obtain a linear output
:
In this context,
and
represent the weight and bias parameters of the i-th classifier, respectively. The linear output is then passed through a Sigmoid function to convert it into a probability:
In this context,
represents the probability of the sample belonging to the corresponding class. When
≥ 0.5, the model classifies the sample as belonging to that class; otherwise, it is classified as not belonging. To further improve classification stability, the threshold
can be fine-tuned on the validation set using metrics such as the ROC curve or F1 score.
In this formula denotes the ground-truth label, and is an adjustable weight used to mitigate biases arising from class imbalance. By independently optimizing the parameters of each binary classifier, the model can simultaneously learn the defect features of multiple wheat kernel categories while sharing a common feature representation space.
Unlike conventional multi-class classification that employs a Softmax activation to enforce mutual exclusivity among categories, this study adopts a multiple binary classification strategy consisting of five independent Sigmoid-based classification heads. Although the five wheat kernel categories are mutually exclusive at the semantic level, their visual characteristics exhibit strong inter-class similarity and ambiguous boundaries, particularly among damaged and defective seeds. Under such conditions, a Softmax-based classifier may suffer from excessive inter-class competition, which suppresses subtle discriminative cues during training. By contrast, the proposed multi-binary formulation allows each classifier to independently learn class-specific decision boundaries in a one-vs.-rest manner. This design decouples the learning process of different defect categories and enables the network to focus on fine-grained morphological features relevant to each class. During inference, although multiple classifiers may output high confidence scores, the final predicted category is determined by selecting the class with the maximum confidence value. Therefore, the proposed framework remains suitable for mutually exclusive classification tasks. To validate the effectiveness of this design, a comparative experiment with a conventional Softmax-based multi-class classification head was conducted, and the results are discussed in
Section 3.2.
2.6. Evaluation Indicators
In the wheat kernel recognition task, to comprehensively evaluate the model’s classification performance, this study employs four metrics: Precision, Accuracy, Recall, and F1-score. Their calculations are defined as follows:
In these formulas, TP (True Positive) denotes the number of positive samples correctly predicted as positive; TN (True Negative) denotes the number of negative samples correctly predicted as negative; FP (False Positive) represents the number of negative samples incorrectly predicted as positive; and FN (False Negative) represents the number of positive samples incorrectly predicted as negative.
Accuracy reflects the overall correctness of the model’s predictions and serves as the most intuitive evaluation metric, but it can be biased under imbalanced class distributions. Precision measures the proportion of correctly predicted positive samples among all samples predicted as positive. Recall evaluates the model’s coverage of actual positive samples, indicating how many true positives are correctly identified. Since Precision and Recall often exhibit a trade-off, the F1-score, defined as their harmonic mean, provides a balanced assessment of the model’s performance across both metrics.
2.7. Hardware and Software Preparation
The images in this study were captured using the ultra-wide camera of an iPhone 14 Pro, ensuring high-quality and clear photographs. The experiments were conducted on a laboratory system running Windows 11. The system was equipped with an AMD Ryzen 5 5600H CPU with Radeon Graphics, an AMD Radeon (TM) GPU with 512 MB of VRAM, and 16 GB of RAM (Advanced Micro Devices, Inc. (AMD), Santa Clara, CA, USA). Python 3.11 was used as the programming language, and the deep learning experiments were implemented using the PyTorch 2.4.1 framework.
4. Conclusions
In this study, a novel deep learning framework, EfficientNet-B1-CBAM with a multi-binary classification strategy, was proposed for automated and fine-grained identification of defective wheat kernels. The model integrates the convolutional block attention module (CBAM) to enhance feature representation and employs five independent binary classifiers corresponding to fusarium-damaged, insect-damaged, healthy, black-embryo, and moldy kernels. Extensive experiments, including standard testing, ablation studies, comparative experiments with mainstream CNN models, and 5-fold cross-validation, demonstrated the proposed model’s superior performance. Specifically, the integration of CBAM and multi-binary classification improved classification precision, recall, and F1-score, achieving an average precision of 0.9925 across folds. These results confirm that the proposed approach can accurately and reliably capture subtle morphological differences among defective wheat kernels, providing a strong basis for intelligent wheat quality inspection and grading.
Despite the promising results, several limitations remain. All experiments were conducted on single-kernel images captured under controlled conditions. The model has not yet been evaluated on bulk-grain scenarios, overlapping kernels, or images acquired from conveyor belts, which are common in industrial environments. As a result, the practical applicability of the proposed method in real-world production lines is still unclear. Future work will focus on extending the dataset to include multi-kernel, overlapping, and conveyor belt imagery to better simulate industrial conditions. Additionally, exploring model adaptation techniques, such as domain adaptation or lightweight deployment on edge devices, will be pursued to enhance the method’s robustness and scalability in practical wheat processing systems.