Defective Wheat Kernel Recognition Using EfficientNet with Attention Mechanism and Multi-Binary Classification

Wang, Duolin; Li, Jizhong; Gong, Han; Chen, Jianyi

doi:10.3390/app16031247

Open AccessArticle

Defective Wheat Kernel Recognition Using EfficientNet with Attention Mechanism and Multi-Binary Classification

¹

College of Science and Information Science, Qingdao Agricultural University, Qingdao 266109, China

²

College of Resource and Environment Science, China Agriculture University, Beijing 100193, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1247; https://doi.org/10.3390/app16031247

Submission received: 17 December 2025 / Revised: 18 January 2026 / Accepted: 22 January 2026 / Published: 26 January 2026

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figures

Review Reports Versions Notes

Abstract

As a globally significant food crop, the assessment of wheat quality is essential for ensuring food security and enhancing the processing quality of agricultural products. Conventional methods for assessing wheat kernel quality are often inefficient and markedly subjective, which hampers their ability to accurately distinguish the complex and diverse phenotypic characteristics of wheat kernels. To tackle the aforementioned issues, this study presents an enhanced recognition method for defective wheat kernels, based on the EfficientNet-B1 architecture. Building upon the original EfficientNet-B1 network structure, this approach incorporates the lightweight attention mechanism known as CBAM (Convolutional Block Attention Module) to augment the model’s capacity to discern features in critical regions. Simultaneously, it modifies the classification head structure to facilitate better alignment with the data, thereby enhancing accuracy. The experiment employs a self-constructed dataset comprising five categories of wheat kernels—perfect wheat kernels, insect-damaged wheat kernels, scab-damaged wheat kernels, moldy wheat kernels, and black germ wheat kernels—which are utilized for training and validation purposes. The results indicate that the enhanced model attains a classification accuracy of 99.80% on the test set, reflecting an increase of 2.6% compared to its performance prior to the enhancement. Furthermore, the Precision, Recall, and F1-score all demonstrated significant improvements. The proposed model achieves near-perfect performance on several categories under controlled experimental conditions, with particularly high precision and recall for scab-damaged and insect-damaged kernels. This study demonstrates the efficacy of the enhanced EfficientNet-B1 model in the recognition of defective wheat kernels and offers novel technical insights and methodological references for intelligent wheat quality assessment.

Keywords:

EfficientNet; attention mechanism; deep learning; wheat kernel quality detection

1. Introduction

Wheat is widely recognized as one of the most important staple crops in the world, playing a vital role in ensuring global food security, supporting agricultural economic development, and maintaining social stability [1]. In China, wheat is not only the primary food source for a large population in the northern regions, but also occupies a central position in food processing, the feed industry, and related agricultural value chains [2]. Consequently, wheat production and quality are of great strategic importance to national food security and agricultural sustainability [3]. However, with the growing demand for food quality and safety, focusing solely on yield is no longer sufficient to meet the requirements of modern agriculture [4]. Wheat quality directly affects its nutritional value, processing performance, and market competitiveness [5]. As a result, wheat quality inspection has become an essential component of ensuring food safety and promoting refined agricultural production. The adoption of scientific and accurate quality assessment methods enables reliable grading and evaluation of wheat, while also providing a solid basis for storage management, transportation, and industrial utilization, thereby offering significant theoretical and practical value [6].

At present, methods for wheat kernel identification can be broadly categorized into two groups: physicochemical analysis-based techniques and manual inspection approaches [7]. Although both categories are capable of evaluating wheat quality to a certain extent, they suffer from notable limitations in terms of efficiency, accuracy, cost, and practical applicability. Physicochemical methods primarily assess wheat quality by measuring physical and chemical properties of the kernels. For example, near-infrared spectroscopy (NIR) has been widely applied to determine protein content and moisture levels in cereals, which can indirectly indicate the presence of mold or insect damage [8]. In addition, physical separation techniques, such as flotation and sieving, are employed to remove defective kernels based on variations in density or weight. Under controlled laboratory conditions, these methods can achieve high precision and offer clear advantages in detecting potential quality issues, including mycotoxin contamination or excessive moisture content [9]. However, their practical limitations are evident. The reliance on expensive analytical instruments, such as spectrometers and chromatographic systems, along with associated consumables, substantially increases detection costs. Moreover, the lengthy testing cycles, combined with time-consuming sample preparation and analysis procedures, render these techniques impractical for large-scale and real-time screening applications. In parallel, manual visual inspection and hand sorting remain common approaches for identifying defective wheat kernels [10]. In this process, trained operators evaluate kernel color, shape, surface defects, and other visual characteristics to distinguish sound kernels from those affected by insect damage, mold growth, black embryo, or scab infection. The primary advantage of manual inspection lies in its minimal equipment requirements and its flexibility in handling diverse defect types, which makes it suitable for small-scale production or specialized batch inspections. Nevertheless, manual methods are inherently characterized by low efficiency, strong subjectivity, and a high dependence on skilled labor, all of which lead to substantially increased operational costs [11]. As labor expenses continue to rise, the economic feasibility and long-term sustainability of manual inspection are steadily diminishing.

Traditional machine learning techniques, such as support vector machines (SVMs) and random forests, rely heavily on handcrafted features for wheat kernel classification [12]. This dependence limits their ability to effectively capture the subtle morphological and textural variations present in wheat kernels, thereby constraining recognition accuracy. These approaches require domain expertise to design and select appropriate features; however, the characteristics of wheat kernels affected by developmental defects or damage often exhibit pronounced heterogeneity. As a result, manually engineered features struggle to adequately represent such complex variations [13], making these models prone to misclassification when distinguishing defective kernels from normal samples.

In recent years, deep learning-based image classification has achieved remarkable success across a wide range of application domains. Notably, in the agricultural sector, the adoption of deep learning techniques has demonstrated substantial potential and clear advantages [14]. For example, established deep learning models such as GoogLeNet [15], VGG [16], and ResNet [17] have been effectively applied to crop seed variety classification, consistently exhibiting strong classification performance [18]. By training on large-scale seed image datasets, deep learning models are able to automatically learn discriminative representations that capture subtle inter-class differences, thereby enabling efficient and accurate classification [19]. These advances have not only improved the level of automation in agricultural production, but also provided robust technical support for seed quality control and varietal improvement.

The characteristics of defective wheat kernels are not readily discernible, resulting in a high degree of overall similarity among them. Consequently, their classification and recognition can be regarded as a fine-grained image classification challenge. Wang et al. [20] proposed the FFDNet-VGG terahertz image enhancement algorithm to address the problem of low image quality in unsound wheat kernel imaging, and further evaluated its effectiveness by employing a CNN-based classification network for unsound kernel recognition based on the enhanced images, thereby achieving improved classification accuracy. Gao et al. [21] modified the ResNet architecture by integrating attention mechanisms and proposed the Res24_D_CBAM model, which achieved an identification accuracy of 94% for six types of wheat kernels. Shen et al. [22] incorporated an efficient channel attention (ECA) module into the Mask R-CNN framework and optimized both the feature pyramid network (FPN) and region proposal network (RPN), resulting in an average classification accuracy of 86% across six wheat kernel categories. Zhao et al. [23] enhanced the YOLOv5 model by pruning the backbone network and introducing a channel-mixing attention module in the neck, leading to an average recognition accuracy of 99.2% for six wheat kernel types. More recently, Zhang et al. [24]. proposed a lightweight network termed GFRNet by integrating low-rank multi-scale convolutional operators with inverted residual structures to address the high computational cost of self-attention mechanisms and large convolution kernels in wheat quality inspection. The enhanced model achieved an average accuracy of 83.89%, a Top-1 accuracy of 84.90%, and a Macro-F1 score of 83.20% on the GrainSpace dataset, outperforming ConvNeXt-T, InceptionNeXt-T, FasterNet, ResNet50, and Swin-T by approximately 2–5 percentage points, while reducing the parameters and FLOPs to 15.915 M and 2.28 G, respectively. In 2023, Zhang et al. [25] designed an improved YOLOv5 model by integrating an efficient channel attention (ECA) mechanism into the baseline network, achieving an average detection accuracy of 96.24% for defective wheat kernels, which represented improvements of 10% over CBAM-YOLOv5 and SENet-YOLOv5, and 13% over the original YOLOv5 model. In 2025, Zhang et al. [26] further proposed a multi-scale parallel convolutional network (MSPCNeXt) to overcome the limitations of single-scale CNNs in detecting defective kernels during storage. By parallelizing convolution kernels of different scales and optimizing MSPCNeXt-v1 and MSPCNeXt-v2 blocks through large-kernel decomposition and feature splitting, the receptive field was effectively enlarged while computational complexity was reduced. Experimental results on the GrainSpace dataset showed that, compared with ConvNeXt, MSPCNeXt-v1 improved the average accuracy and Top-1 accuracy by 2.531% and 2.286%, respectively, while MSPCNeXt-v2 achieved improvements of 2.224% and 1.857%, with reductions of 0.456 G FLOPs and 5.986 M parameters. Earlier, Zhang et al. [27] integrated attention mechanisms into residual networks of varying depths, and the optimized attention-enhanced ResNet-50 model achieved an average accuracy of 97.5% in identifying defective wheat kernels.

Based on the above review, although attention-based deep learning models have achieved promising results in defective wheat kernel recognition, several challenges remain. Most existing methods adopt conventional Softmax-based multiclass classifiers, which enforce strong inter-class competition and may limit discrimination when defect categories exhibit high visual similarity and ambiguous boundaries. In fine-grained wheat kernel recognition tasks, this limitation can suppress subtle but critical defect-related features.

To address this issue, this study proposes an EfficientNet-B1-CBAM-based recognition framework with a multi-binary classification strategy. EfficientNet-B1 is employed as the backbone for efficient feature extraction, while the Convolutional Block Attention Module (CBAM) enhances both channel-wise and spatial feature representations. In addition, the traditional single-head multiclass classifier is replaced with five independent Sigmoid-based binary classifiers, allowing each category to learn class-specific decision boundaries with reduced inter-class interference. Extensive experiments, including comparative analysis, ablation studies, and cross-validation, are conducted to validate the effectiveness and robustness of the proposed method.

2. Materials and Methods

2.1. Data Collection and Dataset Development

The defective wheat kernel dataset used in this study was collected from local farmers’ households (Heze, China). Since wheat from different cultivars is commonly stored together after harvesting, the dataset comprises kernels from multiple varieties. Nevertheless, the defective kernels exhibit consistent visual characteristics across varieties, enabling unified analysis and classification. The dataset includes five categories of wheat kernels: sound kernels, insect-damaged kernels, Fusarium-damaged kernels, moldy kernels, and kernels with black embryos. Representative examples of each category are shown in Figure 1. These kernel types are critical factors affecting wheat quality and commercial grading. During data acquisition, wheat kernel images were captured using a high-resolution camera under standardized lighting conditions to ensure consistent image quality and to minimize the influence of background noise. For each wheat kernel category, a total of 300 images were collected, resulting in an overall dataset of 1500 images. Within each category, 200 images were allocated to the training set, while the remaining 100 images were used for testing. The dataset used in this study has been made publicly available at Zenodo and can be accessed at: https://doi.org/10.5281/zenodo.18222641, accessed on 17 December 2025.

As all images were captured under identical acquisition conditions, two data augmentation techniques—Gaussian blurring and Poisson noise injection—were applied to enhance data diversity and improve model generalization. Through this augmentation process, the number of images for each wheat kernel category was expanded from the original 200 images to 600 images, resulting in a total augmented dataset of 3000 images. For each wheat kernel category, the augmented dataset was further divided into training and validation sets at a ratio of 9:1. Specifically, 540 images per category were used for network training, while the remaining 60 images constituted the validation set. The training set was employed for model learning, whereas the validation set was used for hyperparameter tuning and optimal model selection. The independent test set was retained for the final performance evaluation of the proposed model.

To mitigate the risk of overfitting caused by the limited dataset size and homogeneous acquisition conditions, a five-fold cross-validation strategy was adopted in this study. The dataset was randomly partitioned into five subsets, where in each fold, four subsets were used for training and the remaining subset was used for validation. No data augmentation was applied during the cross-validation process to strictly evaluate the generalization performance of the proposed model on raw images.

2.2. EfficientNet-B1-CBAM Network Model

In this study, an EfficientNet-B1-based network [28] is developed by integrating the Convolutional Block Attention Module (CBAM), an attention mechanism originally proposed by Woo et al. [29], resulting in the EfficientNet-B1-CBAM model for high-accuracy wheat kernel identification. The overall architecture of the proposed model is illustrated in Figure 2. It consists of a feature extraction layer, an attention-enhancement layer, a feature aggregation layer, and a multi-level binary classification output layer.

First, the input images are fed into the EfficientNet-B1 backbone for feature extraction. This network employs a compound scaling strategy to jointly balance depth, width, and resolution, and utilizes lightweight MBConv blocks to achieve efficient feature extraction and information compression. Subsequently, the Convolutional Block Attention Module (CBAM), which consists of channel attention and spatial attention submodules, is integrated to enhance feature representation. The channel attention mechanism highlights informative feature channels through adaptive weighting, while the spatial attention mechanism emphasizes salient regions via spatial attention maps. Together, these mechanisms enable the model to focus more effectively on subtle surface variations and pathological characteristics of wheat kernels. After global average pooling, the extracted global features are fed into five independent binary classifiers in a one-vs.-rest manner, corresponding to Fusarium-damaged, insect-damaged, sound, black-embryo, and moldy wheat kernels, respectively. This multi-binary classification design allows each category to learn class-specific decision boundaries independently, thereby reducing feature interference among visually similar classes and improving overall recognition performance.

2.3. EfficientNet-B1 Network Model

EfficientNet is a family of efficient convolutional neural networks proposed by Google, whose core idea lies in the compound scaling strategy that uniformly scales network depth, width, and input resolution. By jointly balancing these three dimensions, EfficientNet achieves superior performance with significantly reduced computational cost. In contrast to conventional architectures that scale the network along a single dimension, compound scaling enables EfficientNet to enhance feature representation capability while maintaining a well-controlled model complexity.

EfficientNet-B0 serves as the baseline architecture of the EfficientNet family, from which multiple variants, ranging from EfficientNet-B1 to EfficientNet-B7, are derived through parameter scaling. Table 1 presents a detailed description of the EfficientNet-B1 architecture, including the network stages, module types, input resolutions, kernel sizes, input tensor dimensions, output channel numbers, and the number of layers in each stage.

The entire network is composed of a series of stacked MBConv blocks, with MBConv modules at different stages varying in terms of channel width, kernel size, and stride. This design enables a progressive feature extraction process that transitions from shallow to deep representations and from low-dimensional to high-dimensional features. The network begins with a standard convolutional layer responsible for extracting low-level features from the input images. The intermediate stages consist of multiple groups of MBConv blocks, where each group reduces the spatial resolution to varying degrees while gradually increasing the number of channels to enhance feature representation capability. Finally, global average pooling followed by a fully connected classification head is employed to generate the final predictions. Compared with traditional convolutional neural networks, EfficientNet-B1 offers a notable advantage in achieving high accuracy while maintaining relatively low computational complexity. On large-scale datasets such as ImageNet, EfficientNet-B1 has been shown to achieve accuracy comparable to, or even exceeding, that of larger models such as ResNet and Inception. Owing to these characteristics, EfficientNet-B1 is particularly well suited for agricultural applications involving limited data and computational resources, such as wheat kernel classification and inspection tasks, where both accuracy and efficiency are critical.

The mobile inverted bottleneck convolution (MBConv) module is the core building block of the EfficientNet family, and its structure is illustrated in Figure 3. This module was originally introduced in MobileNetV2 [30], and is characterized by an inverted residual structure. By establishing efficient shortcut connections between the input and output, the inverted residual design enables effective utilization of both low-dimensional and high-dimensional feature spaces, thereby improving the efficiency of feature extraction.

The MBConv module consists of five main components, enabling the network to maintain a lightweight structure while preserving strong feature representation capability. The basic architecture of the MBConv module can be divided into the following five key parts:

Let the input feature map be denoted as:

X \in ℝ^{H \times W \times C_{i n}}

Here, H and W denote the height and width of the input feature, respectively, while

c_{i n}

signifies the number of input channels.

Expansion Convolution:

First, a 1 × 1 convolution is applied to expand the number of input channels to

t \times C_{i n}

, where

t

is the expansion factor. This operation increases the dimensionality of the feature space, thereby enhancing the model’s ability to represent complex features:

X_{e} = σ (BN (X * W_{e})), W_{e} \in ℝ^{1 \times 1 \times C_{i n} \times (t C_{i n})}

(1)

Here,

B N (\cdot)

denotes the batch normalization operation, and

σ (\cdot)

represents the activation function.

2.: Depthwise Convolution

Next, a

k \times k

depthwise separable convolution (typically k = 3 or 5) is applied independently to each channel. This operation allows the model to effectively capture local spatial information while significantly reducing computational complexity.

X_{d} = σ (BN (X_{e} *_{d w} W_{d})), W_{d} \in ℝ^{k \times k \times (t C_{i n})}

(2)

In this context,

*_{d w}

denotes the depth-wise convolution operation.

3.: SE (Squeeze-and-Excitation) Module

This module is an important attention mechanism in EfficientNet-B1 for enhancing feature representation [31], as illustrated in Figure 4. In conventional convolutional neural networks, each channel is typically assigned equal importance during feature extraction, neglecting differences in the representational capacity of individual channels. The SE module explicitly models inter-channel dependencies and adaptively assigns weights to each channel, thereby achieving dynamic feature recalibration and effectively strengthening the network’s feature representation capability. The SE module consists of two main stages: Squeeze and Excitation. In the Squeeze stage, the input feature map is first compressed into a single global descriptor for each channel via global average pooling (GAP), capturing channel-level holistic information. In the Excitation stage, a bottleneck structure composed of two fully connected layers (FC1 and FC2) performs non-linear feature modeling. The first fully connected layer reduces the number of channels by a ratio r (the channel reduction ratio) to lower computational cost while capturing compact inter-channel dependencies. The second fully connected layer restores the channel dimension to its original size, and a Sigmoid function generates weight coefficients in the range [0, 1], reflecting the relative importance of each channel. Finally, these weights are multiplied with the original feature map channel-wise, enhancing salient features and suppressing redundant information, thereby improving the network’s capacity for effective feature representation and discrimination. The computational process of the SE module can be expressed as:

s = σ (W_{2} δ (W_{1} GAP (X_{d})))

(3)

X_{s} = X_{d} ⊙ s

(4)

Here,

G A P (\cdot)

denotes the global average pooling operation,

δ (\cdot)

represents the ReLU activation function,

σ (\cdot)

is the Sigmoid function, and

⊙

indicates the channel-wise weighting operation. In the absence of the attention module, the feature map is represented as

X_{s} = X_{d}

.

In EfficientNet, the SE module is embedded within each MBConv block, enabling the network to dynamically adjust the contribution of different channels during the layer-by-layer feature extraction process. Since the surface features of wheat kernels are often subtle and different categories are highly similar, the SE module helps emphasize key feature channels, such as disease spot textures or mold discoloration, while suppressing background noise and irrelevant channels. This selective feature enhancement contributes to improved overall classification performance.

4.: Projection Convolution

After feature weighting, a 1 × 1 convolution is applied to reduce the number of channels back to the target dimension

C_{o u t}

. This step effectively controls the model’s parameter size while preserving the integrity of the extracted information, allowing the module to remain lightweight. The computation can be expressed as:

X_{p} = BN (X_{s} * W_{p}), W_{p} \in ℝ^{1 \times 1 \times (t C_{i n}) \times C_{o u t}}

(5)

5.: Residual Connection

Finally, when the input and output feature dimensions are the same (i.e., stride = 1 and

C_{i n} = C_{o u t}

), the MBConv module incorporates a residual connection, allowing the input features to be directly added to the output features. This facilitates feature reuse and enables efficient gradient propagation throughout the network.

Y = X_{p} + X

(6)

If the dimensions do not match, only output

X_{p}

:

Y = X_{p}

(7)

2.4. Attention Mechanism

The Convolutional Block Attention Module (CBAM) is a lightweight and plug-and-play attention mechanism, as illustrated in Figure 5. Its core idea is to sequentially apply attention mechanisms along the channel and spatial dimensions, enabling the network to focus more effectively on important feature regions while suppressing irrelevant or redundant information. CBAM consists of two sequential components: the Channel Attention Module (CAM) and the Spatial Attention Module (SAM). The channel attention module assigns weights across different channels of the feature map, emphasizing channels that are most informative for classification. The spatial attention module then assigns weights across the two-dimensional spatial domain, highlighting more discriminative local regions within the image. Compared to the single SE module, CBAM combines both channel and spatial attention, optimizing feature representation at a finer level and significantly improving the network’s performance in object recognition and classification tasks. Its simple design and low computational overhead make it well-suited for integration into lightweight networks such as EfficientNet-B1. When applied to wheat kernel recognition, CBAM effectively emphasizes texture anomalies and morphological differences, thereby enhancing classification accuracy.

2.4.1. Channel Attention Module

The Channel Attention Module (CAM) is designed to capture the importance of different channels in the input feature map, guiding the network to focus on the most discriminative feature channels. Its structure is illustrated in Figure 6. The core idea of CAM is to explicitly model global dependencies along the channel dimension and assign adaptive weights to each channel, thereby enhancing the model’s feature representation capability. Let the input feature map be denoted as:

F \in ℝ^{C \times H \times W}

First, global average pooling (GAP) and global max pooling (GMP) are applied separately to capture global spatial information, producing two one-dimensional channel descriptors:

F_{a v g}^{c} = GAP (F) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F (c, i, j)

(8)

F_{m a x}^{c} = GMP (F) = \max_{\begin{matrix} i = 1 ~ H, \\ j = 1 ~ W \end{matrix}} F (c, i, j)

(9)

Subsequently, these two descriptors are fed into a shared-parameter multilayer perceptron (MLP) to extract high-level inter-channel relationships, resulting in:

M_{c} (F) = σ (MLP (F_{a v g}^{c}) + MLP (F_{m a x}^{c}))

(10)

In this context:

MLP (x) = W_{1} (δ (W_{0} x))

(11)

Here,

w_{0}

and

W_{1}

represent the weight matrices of the two fully connected layers, and

δ (\cdot)

denotes the ReLU activation function.

σ (\cdot)

is the Sigmoid activation function, which constrains the output weights to the range [0, 1] and enables adaptive modulation of different channels.

Finally, the obtained attention weights are multiplied channel-wise with the input feature map to produce the weighted feature representation:

F^{'} = M_{c} (F) \otimes F

(12)

In this context,

\otimes

denotes the element-wise multiplication operation. Through this process, the model is able to emphasize informative channels while suppressing irrelevant or noisy features, thereby significantly enhancing the discriminative power of the feature representation and improving the overall recognition performance of the network.

2.4.2. Spatial Attention Module

The Spatial Attention Module (SAM) is designed to guide the network to focus on where discriminative information is located within the feature map, enhancing responses at key spatial locations. Its structure is illustrated in Figure 7. Unlike channel attention, spatial attention explicitly models the spatial dependencies of features to emphasize important regional patterns. Let the input be the feature map weighted by the channel attention module:

F^{'} \in ℝ^{C \times H \times W}

In practice, the input feature map is first processed using both channel-wise max pooling and average pooling operations to generate spatial descriptors that capture salient features at the spatial level:

F_{a v g}^{s} = {AvgPool}_{c} (F^{'})

(13)

F_{m a x}^{s} = {MaxPool}_{c} (F^{'})

(14)

These two feature maps represent the average and maximum responses across all channels at each spatial location, jointly reflecting the importance of target regions. The two maps are then concatenated along the channel dimension and passed through a 7 × 7 convolution layer to aggregate spatial context information, thereby generating the spatial attention map:

M_{s} (F^{'}) = σ (f^{7 \times 7} ([F_{a v g}^{s}; F_{m a x}^{s}]))

(15)

Here,

f^{7 \times 7} (\cdot)

denotes the convolution operation, while

σ (\cdot)

represents the Sigmoid function, which normalizes the feature values.

Finally, spatial feature weighting is performed by element-wise multiplication:

F^{″} = M_{s} (F^{'}) \otimes F^{'}

(16)

Through this mechanism, the model can effectively focus on discriminative spatial regions while suppressing background noise, thereby further enhancing the spatial sensitivity of feature representations and improving recognition performance.

2.5. Design of Multiple-Binary Classifiers

In the proposed EfficientNet-B1-CBAM model for wheat kernel recognition, the output layer is composed of five independent binary classifiers, corresponding to scab-damaged, insect-damaged, perfect, black-embryo, and moldy kernels, respectively.

Each binary classifier receives the feature vector

x \in R^{d}

, from the global average pooling layer and maps it through a fully connected layer to obtain a linear output

Z_{i}

:

z_{i} = W_{i}^{⊤} x + b_{i}

(17)

In this context,

w_{i}

and

b_{i}

represent the weight and bias parameters of the i-th classifier, respectively. The linear output is then passed through a Sigmoid function to convert it into a probability:

p_{i} = σ (z_{i}) = \frac{1}{1 + e^{- z_{i}}}, i = 1, 2, 3, 4, 5

(18)

In this context,

p_{i}

represents the probability of the sample belonging to the corresponding class. When

p_{i}

≥ 0.5, the model classifies the sample as belonging to that class; otherwise, it is classified as not belonging. To further improve classification stability, the threshold

τ_{i}

can be fine-tuned on the validation set using metrics such as the ROC curve or F1 score.

L = \sum_{i = 1}^{5} α_{i} [- y_{i} \log (p_{i}) - (1 - y_{i}) \log (1 - p_{i})]

(19)

In this formula

y_{i} \in {0, 1}

denotes the ground-truth label, and

α_{i}

is an adjustable weight used to mitigate biases arising from class imbalance. By independently optimizing the parameters of each binary classifier, the model can simultaneously learn the defect features of multiple wheat kernel categories while sharing a common feature representation space.

Unlike conventional multi-class classification that employs a Softmax activation to enforce mutual exclusivity among categories, this study adopts a multiple binary classification strategy consisting of five independent Sigmoid-based classification heads. Although the five wheat kernel categories are mutually exclusive at the semantic level, their visual characteristics exhibit strong inter-class similarity and ambiguous boundaries, particularly among damaged and defective seeds. Under such conditions, a Softmax-based classifier may suffer from excessive inter-class competition, which suppresses subtle discriminative cues during training. By contrast, the proposed multi-binary formulation allows each classifier to independently learn class-specific decision boundaries in a one-vs.-rest manner. This design decouples the learning process of different defect categories and enables the network to focus on fine-grained morphological features relevant to each class. During inference, although multiple classifiers may output high confidence scores, the final predicted category is determined by selecting the class with the maximum confidence value. Therefore, the proposed framework remains suitable for mutually exclusive classification tasks. To validate the effectiveness of this design, a comparative experiment with a conventional Softmax-based multi-class classification head was conducted, and the results are discussed in Section 3.2.

2.6. Evaluation Indicators

In the wheat kernel recognition task, to comprehensively evaluate the model’s classification performance, this study employs four metrics: Precision, Accuracy, Recall, and F1-score. Their calculations are defined as follows:

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(20)

P r e c i s i o n = \frac{T P}{T P + F P}

(21)

R e c a l l = \frac{T P}{T P + F N}

(22)

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(23)

In these formulas, TP (True Positive) denotes the number of positive samples correctly predicted as positive; TN (True Negative) denotes the number of negative samples correctly predicted as negative; FP (False Positive) represents the number of negative samples incorrectly predicted as positive; and FN (False Negative) represents the number of positive samples incorrectly predicted as negative.

Accuracy reflects the overall correctness of the model’s predictions and serves as the most intuitive evaluation metric, but it can be biased under imbalanced class distributions. Precision measures the proportion of correctly predicted positive samples among all samples predicted as positive. Recall evaluates the model’s coverage of actual positive samples, indicating how many true positives are correctly identified. Since Precision and Recall often exhibit a trade-off, the F1-score, defined as their harmonic mean, provides a balanced assessment of the model’s performance across both metrics.

2.7. Hardware and Software Preparation

The images in this study were captured using the ultra-wide camera of an iPhone 14 Pro, ensuring high-quality and clear photographs. The experiments were conducted on a laboratory system running Windows 11. The system was equipped with an AMD Ryzen 5 5600H CPU with Radeon Graphics, an AMD Radeon (TM) GPU with 512 MB of VRAM, and 16 GB of RAM (Advanced Micro Devices, Inc. (AMD), Santa Clara, CA, USA). Python 3.11 was used as the programming language, and the deep learning experiments were implemented using the PyTorch 2.4.1 framework.

3. Results and Discussion

3.1. EfficientNet-B1-CBAM Model Evaluation and Learning Rate Optimization Results

This section evaluates the feasibility of the proposed model through experimental analysis and investigates the influence of learning-rate selection on training behavior. The learning-rate search interval was determined based on both transfer-learning principles and empirical observations. Since the EfficientNet-B1 backbone was initialized with ImageNet pretrained weights, excessively large learning rates may disrupt pretrained feature representations, while overly small learning rates can lead to insufficient parameter updates and slow convergence. Therefore, the learning rate was explored within a moderate range commonly adopted for fine-tuning deep convolutional neural networks, specifically from 1 × 10⁻⁴ to 6 × 10⁻⁴. As illustrated in Figure 8, excessively large learning rates lead to unstable optimization and performance fluctuations, while overly small learning rates result in slow convergence and insufficient parameter updates. In contrast, learning rates within the moderate range exhibit more stable training behavior and higher classification accuracy. As shown in Table 2, the model achieved the highest classification accuracy of 99.7% when the learning rate was set to 0.0005. Taking both convergence stability and final performance into account, a learning rate of 0.0005 was selected as the optimal value and adopted for all subsequent experiments.

To effectively demonstrate the performance of the EfficientNet-B1-CBAM model, images from the test set were used to evaluate and compare the classification capability of the baseline model and the proposed EfficientNet-B1-CBAM model. Figure 9 presents the confusion matrices obtained from the test dataset, which illustrate the recognition results of both models. In addition, Gradient-weighted Class Activation Mapping (Grad-CAM) [32] was employed to visually analyze the classification results for each wheat kernel category, as shown in Figure 10.

As shown in the confusion matrix in Figure 9, the EfficientNet-B1-CBAM model exhibits remarkably high classification accuracy in the wheat kernel identification task. The recognition results for all five kernel categories—Fusarium-damaged, insect-damaged, sound, black embryo, and moldy kernels—are highly stable. Except for a very small number of misclassified samples, nearly all categories achieve recognition accuracies close to 100%. Table 3 and Table 4 present the confusion matrices and detailed performance analyses of the baseline model and the EfficientNet-B1-CBAM model, respectively. A comparative analysis of Table 3 and Table 4 indicates that the proposed model maintains consistently high Precision, Recall, and F1-score across all five kernel categories and outperforms the baseline model in most cases. These results demonstrate that the incorporation of the attention mechanism effectively enhances the overall recognition performance of the model. It should be noted that the reported performance was obtained under controlled imaging conditions and on single-kernel images, which may partially explain the high accuracy.

As illustrated in Figure 10, the network without the attention mechanism primarily focuses on the global appearance of the wheat kernels and often disperses attention to irrelevant background regions. This indicates a limited ability to capture the subtle and local defect cues necessary for identifying imperfect kernels. In contrast, the model equipped with the CBAM exhibits more targeted and discriminative activation patterns, directing attention to key defect regions such as insect damage, moldy surfaces, and scab lesions, while effectively suppressing background interference. This demonstrates that CBAM not only enables the model to accurately localize critical regions but also enhances the representation of these informative features. Consequently, the integration of CBAM significantly improves the perceptual selectivity and fine-grained recognition capability of EfficientNet-B1, resulting in more reliable and interpretable identification of defective wheat kernels.

To evaluate the real-time applicability of the proposed method, inference latency was measured on a CPU-only platform using the trained EfficientNet-B1 + CBAM model. The evaluation was conducted with a batch size of one on real validation images. After a warm-up phase, the average inference time was calculated over 100 runs. The results show that the proposed model achieves an average inference time of 40.29 ms per image, with a standard deviation of 1.02 ms, corresponding to approximately 24.8 frames per second. These results demonstrate that the proposed method can satisfy near real-time requirements even without GPU acceleration.

3.2. Comparison of Sigmoid and Softmax Classification Strategies

To justify the use of the proposed multi-binary classification scheme, a comparative experiment was conducted between the Sigmoid-based multi-binary head and a conventional Softmax-based multiclass classification head. Both models shared the same EfficientNet-B1 backbone with CBAM attention modules, and were trained and evaluated on the same dataset under identical training configurations to ensure a fair comparison. In the Softmax-based model, the final classification layer consisted of a single fully connected layer with five output neurons followed by a Softmax activation function, and the cross-entropy loss was employed for optimization. In contrast, the proposed model adopted five independent binary classifiers with Sigmoid activation, each responsible for distinguishing one class from the others, optimized using binary cross-entropy loss.

As shown in Table 5, the proposed multi-binary classification scheme consistently outperforms the conventional Softmax-based multi-class classifier in terms of macro-averaged precision and F1-score. This improvement indicates that independent binary decision heads are more effective in capturing subtle inter-class differences among visually similar wheat kernel categories.

3.3. Comparative Experiment

To further evaluate the effectiveness of the proposed method, comparative experiments were conducted against several widely used deep learning models, including VGG16, InceptionV3, EfficientNet-B0, ResNet18, and MobileNetV2. All models were trained and evaluated under the same experimental settings to ensure a fair comparison. The quantitative results on the test set are summarized in Table 6.

As shown in Table 6, the proposed method achieved the best overall performance, with an accuracy of 99.60%, a precision of 99.80%, a recall of 99.70%, and an F1-score of 99.60%. These results significantly outperform all comparison models across all evaluation metrics. Among the baseline models, VGG16 exhibited the weakest performance, achieving only 66.60% accuracy while requiring the largest number of parameters (134.1 M), indicating poor suitability for fine-grained wheat kernel recognition tasks. InceptionV3 and EfficientNet-B0 achieved competitive performance, with accuracies of 94.20% and 96.92%, respectively, demonstrating the effectiveness of modern CNN architectures. Lightweight models such as MobileNetV2 and ResNet18 showed reasonable performance but were limited in their ability to capture subtle defect features, resulting in lower recognition accuracy. Notably, the proposed model surpasses EfficientNet-B0 by 2.68% in accuracy while maintaining a relatively small parameter size of 6.9 M. This highlights the advantage of integrating attention mechanisms and a multi-binary classification strategy, which enhances discriminative feature learning without introducing excessive computational overhead.

3.4. Ablation Experiments

To further analyze the contribution of each component, ablation experiments were conducted on the test set, with quantitative results summarized in Table 7 and training dynamics illustrated in Figure 11.

As shown in Table 7, the baseline EfficientNet-B1 achieved an accuracy of 97.0%. Introducing CBAM led to a clear performance improvement, increasing accuracy to 98.8%, which confirms the effectiveness of attention mechanisms in enhancing discriminative feature learning. Replacing the conventional multiclass head with the multi-binary classification strategy resulted in a moderate improvement (97.4%), indicating that decoupled binary decisions help alleviate inter-class competition. The proposed model, which combines CBAM with the multi-binary classification strategy, achieved the best overall performance, reaching 99.6% accuracy and F1-score with only a marginal increase in parameters. Figure 11 further illustrates the precision convergence behavior of different models. The proposed method exhibits faster convergence and consistently higher precision across training epochs compared with the baseline and single-module variants. In contrast, the conventional five-class model shows slower stabilization and lower final precision, while the addition of CBAM alone provides partial improvement. These results demonstrate that the proposed multi-binary strategy and CBAM complement each other, leading to more stable optimization and superior fine-grained recognition performance.

Overall, both quantitative metrics and convergence analysis confirm that the proposed EfficientNet-B1-CBAM with a multi-binary classification head provides a more robust and accurate solution for defective wheat kernel recognition.

3.5. Cross-Validation Results

The cross-validation results are summarized in Table 8. As shown in Table 8, the proposed model maintains consistently high macro-precision scores across all five folds, with only minor performance variations. The low standard deviation indicates that the observed performance is not sensitive to a particular data partition, highlighting the stability and robustness of the proposed method. Although the dataset size is relatively limited, the cross-validation results demonstrate that the model is unlikely to suffer from sample memorization and is capable of learning discriminative features with good generalization ability.

4. Conclusions

In this study, a novel deep learning framework, EfficientNet-B1-CBAM with a multi-binary classification strategy, was proposed for automated and fine-grained identification of defective wheat kernels. The model integrates the convolutional block attention module (CBAM) to enhance feature representation and employs five independent binary classifiers corresponding to fusarium-damaged, insect-damaged, healthy, black-embryo, and moldy kernels. Extensive experiments, including standard testing, ablation studies, comparative experiments with mainstream CNN models, and 5-fold cross-validation, demonstrated the proposed model’s superior performance. Specifically, the integration of CBAM and multi-binary classification improved classification precision, recall, and F1-score, achieving an average precision of 0.9925 across folds. These results confirm that the proposed approach can accurately and reliably capture subtle morphological differences among defective wheat kernels, providing a strong basis for intelligent wheat quality inspection and grading.

Despite the promising results, several limitations remain. All experiments were conducted on single-kernel images captured under controlled conditions. The model has not yet been evaluated on bulk-grain scenarios, overlapping kernels, or images acquired from conveyor belts, which are common in industrial environments. As a result, the practical applicability of the proposed method in real-world production lines is still unclear. Future work will focus on extending the dataset to include multi-kernel, overlapping, and conveyor belt imagery to better simulate industrial conditions. Additionally, exploring model adaptation techniques, such as domain adaptation or lightweight deployment on edge devices, will be pursued to enhance the method’s robustness and scalability in practical wheat processing systems.

Author Contributions

Conceptualization, D.W. and J.L.; methodology, D.W. and J.L.; validation, J.L., J.C.; formal analysis, H.G. and J.C.; investigation, H.G.; resources, D.W. and H.G.; data curation, D.W., H.G.; writing—original draft preparation, D.W.; writing—review and editing, D.W. and J.L.; visualization, J.L.; supervision, J.C.; project administration, J.C.; funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shandong Province Natural Science Foundation, grant number ZR2023MA076.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wu, S.; Lin, D. Study on the Impact of Import Source Diversification on China’s Wheat Import Risk from a Correlation Perspective. Sci. J. Econ. Manag. Res. 2024, 6, 61–68. [Google Scholar] [CrossRef]
Ali, M.F.; Ma, L.; Han, W.; Zhou, Y.; Wang, S.; Lin, X.; Wang, D. Interactive effects of irrigation and planting density on photosynthetic performance and yield of winter wheat. Field Crops Res. 2026, 337, 110250. [Google Scholar] [CrossRef]
Zhang, D.; Wu, D.; Xue, M.; Liu, J.; Batchelor, W.D.; Li, D.; Zhang, W.; Wang, Y.; Ju, H. Optimizing the water-energy-food nexus in dryland winter wheat systems on the loess plateau. Agric. Water Manag. 2025, 321, 109930. [Google Scholar] [CrossRef]
Karunarathna, S.; Somasiri, S.; Mahanama, R.; Ruhunuge, R. The influence of organic matter in paddy soil on the profiles of minerals and toxic heavy metals in rice grains. Cereal Res. Commun. 2025, 53, 2693–2705. [Google Scholar] [CrossRef]
Onuoha, L.N.; Emojorho, E.E.; Adinkwu, O.M.; Emumejaye, K.; Obetta, R.N.; Ogboli, C.C.; Ohwo, F.E. Chemical and Functional Properties of Chin-Chin from Wheat, Soybeanand Millet Flour Blends. Asian J. Adv. Res. Rep. 2025, 19, 63–71. [Google Scholar] [CrossRef]
Gong, L.; Peng, Z.; Zhang, B.; Wei, Z.; Yang, G.; Cai, J.; Zhang, X.; Yu, Y. Forecasting the mid-and-long-term crop evapotranspiration for winter wheat in China using combination models based on public weather forecast data. Comput. Electron. Agric. 2025, 236, 110504. [Google Scholar] [CrossRef]
Sanna, M.; Farbo, M.G.; Valentoni, A.; Melis, R.; Porcu, M.C.; Piu, P.P.; Serra, M.; Pretti, L. Sensory Evaluation and Physicochemical Analysis of Beers with Old Sardinian Wheats. Appl. Sci. 2025, 15, 9138. [Google Scholar] [CrossRef]
Singh, G.; Joshi, N.K. Insect Pests of Wheat in North India: A Comprehensive Review of Their Bio-Ecology and Integrated Management Strategies. Agriculture 2025, 15, 2067. [Google Scholar] [CrossRef]
Bu, Z.; Wu, S.; Sun, X. Application of enzymes in the quality improvement of whole wheat products: Effects on dietary fibre, dough rheology and gas phase. J. Cereal Sci. 2025, 124, 104222. [Google Scholar] [CrossRef]
Wang, D.; Shi, L.; Yin, H.; Cheng, Y.; Liu, S.; Wu, S.; Yang, G.; Dong, Q.; Ge, J.; Li, Y. A Detection Approach for Wheat Spike Recognition and Counting Based on UAV Images and Improved Faster R-CNN. Plants 2025, 14, 2475. [Google Scholar] [CrossRef] [PubMed]
Sheikhigarjan, A.; Safari, M.; Ghazi, M.M.; Zarnegar, A.; Shahrokhi, S.; Bagheri, N.; Moein, S.; Seyedin, P. Chemical control of wheat sunn pest, Eurygaster integriceps, by UAV sprayer and very low volume knapsack sprayer. Phytoparasitica 2024, 52, 49. [Google Scholar] [CrossRef]
Adablanu, S.; Barman, U.; Das, D. Advancing deep learning for automated stroke detection: A review. Brain Hemorrhages 2025, 6, 247–260. [Google Scholar] [CrossRef]
Nargesi, M.H.; Kheiralipour, K.; Jayas, D.S. Classification of different wheat flour types using hyperspectral imaging and machine learning techniques. Infrared Phys. Technol. 2024, 142, 105520. [Google Scholar] [CrossRef]
Pazhanivelan, S.; Ragunath, K.P.; Sudarmanian, N.S.; Satheesh, S.; Shanmugapriya, P. Deep Learning-Based Multi-Class Pest and Disease Detection in Agricultural Fields. J. Sci. Res. Rep. 2025, 31, 538–546. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar] [CrossRef]
Setiawan, W.; Saputra, M.A.; Koeshardianto, M.; Rulaningtyas, R. Transfer Learning and Fine Tuning in Modified VGG for Haploid Diploid Corn Seed Images Classification. Rev. d’Intell. Artif. 2024, 38, 483–490. [Google Scholar] [CrossRef]
Li, H.; Duan, X. Approach for identifying crop seeds with similar appearances using hyperspectral images and improved ResNet 18 based on cloud platform. Electron. Lett. 2024, 60, e70102. [Google Scholar] [CrossRef]
Kumar, V.; Aydav, P.S.S.; Minz, S. Crop Seeds Classification Using Traditional Machine Learning and Deep Learning Techniques: A Comprehensive Survey. SN Comput. Sci. 2024, 5, 1031. [Google Scholar] [CrossRef]
Tugrul, B.; Eryigit, R.; Ar, Y. Deep Learning-Based Classification of Image Data Sets Containing 111 Different Seeds. Adv. Theory Simul. 2023, 6, 2300435. [Google Scholar] [CrossRef]
Fei, W.; Yuying, J.; Hongyi, G.; Xinyu, C.; Li, L. Discrimination of wheat unsound grains based on deep learning and terahertz spectral image technology. In Proceedings of the Seventeenth National Conference on Laser Technology and Optoelectronics, Shanghai, China, 23–26 August 2022. [Google Scholar]
Gao, H.; Zhen, T.; Li, Z. Detection of Wheat Unsound Kernels Based on Improved ResNet. IEEE Access 2022, 10, 20092–20101. [Google Scholar] [CrossRef]
Shen, R.; Zhen, T.; Li, Z. Segmentation of Unsound Wheat Kernels Based on Improved Mask RCNN. Sensors 2023, 23, 3379. [Google Scholar] [CrossRef]
Zhao, W.; Liu, S.; Li, X.; Han, X.; Yang, H. Fast and accurate wheat grain quality detection based on improved YOLOv5. Comput. Electron. Agric. 2022, 202, 107426. [Google Scholar] [CrossRef]
Zhang, Q.; Li, H.; Lv, P.; Xiao, L.; Wang, C.; Zhao, H.; Jing, S. GFRnet: A grain fast recognition network for detecting unsound wheat kernels in grain storage. J. Stored Prod. Res. 2026, 115, 102830. [Google Scholar] [CrossRef]
Zhang, Z.; Zuo, Z.; Li, Z.; Yin, Y.; Chen, Y.; Zhang, T.; Zhao, X. Real-Time Wheat Unsound Kernel Classification Detection Based on Improved YOLOv5:Regular Papers. J. Adv. Comput. Intell. Intell. Inform. 2023, 27, 474–480. [Google Scholar] [CrossRef]
Zhang, Q.; Wang, J.; Lv, P.; Wang, A.; Lv, L.; Zhang, M.; Wang, J. MSPCNeXt: Multi-scale parallel convolution for identifying unsound kernels in stored wheat. J. Stored Prod. Res. 2025, 111, 102559. [Google Scholar] [CrossRef]
Zhang, W.; Ma, H.; Li, X.; Liu, X.; Jiao, J.; Zhang, P.; Gu, L.; Wang, Q.; Bao, W.; Cao, S. Imperfect Wheat Grain Recognition Combined with an Attention Mechanism and Residual Network. Appl. Sci. 2021, 11, 5139. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Chen, M.S.; Andrew, H.; Menglong, Z.; Andrey, Z.; Liang, C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. arXiv 2018, arXiv:1801.04381. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2011–2023. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]

Figure 1. Sample images of imperfect kernels and perfect kernels of wheat.

Figure 2. EfficientNet-B1-CBAM Model Framework.

Figure 3. MBConv Module Structure.

Figure 4. SE Module Structure. (

\otimes

: channel-wise weighted multiplication; the same as below).

Figure 4. SE Module Structure. (

\otimes

: channel-wise weighted multiplication; the same as below).

Figure 5. CBAM Structure.

Figure 6. Channel Attention Module Structure. (

⨁

: element-wise addition along the channel dimension; the same as below;

\int

: mapping via the Sigmoid activation function; the same as below).

Figure 6. Channel Attention Module Structure. (

⨁

: element-wise addition along the channel dimension; the same as below;

\int

: mapping via the Sigmoid activation function; the same as below).

Figure 7. Spatial Attention Module Structure.

Figure 8. Results of at different learning rates. (a) Training Loss Curves at Different Learning Rates; (b) Training Accuracy at Different Learning Rates.

Figure 9. Confusion Matrix. (a) Original Model; (b) EfficientNet-B1-CBAM Model.

Figure 10. Sample visualization: Grad-CAM. (Upper) original image, (Middle) activated heatmaps without attention mechanism, and (Lower) activated heatmaps with attention mechanism.

Figure 11. Curve of Variation in Precise Values.

Table 1. EfficientNet-B1 Network Architecture.

Stage i	Operator	Resolution	Channels	Layers
1	Conv3 × 3	240 × 240	32	1
2	MBConv_1, k3 × 3	120 × 120	16	2
3	MBConv_6, k3 × 3	120 × 120	24	3
4	MBConv_6, k5 × 5	60 × 60	40	3
5	MBConv_6, k3 × 3	30 × 30	80	4
6	MBConv_6, k5 × 5	15 × 15	112	4
7	MBConv_6, k5 × 3	15 × 15	192	5
8	MBConv_6, k3 × 3	8 × 8	320	2
9	Conv1 × 1, Pooling, and FC	8 × 8	1280	1

Table 2. The accuracy of different learning rates for EfficientNet-B1-CBAM.

Learning Rate	Batch	Epoch	Time	Acc (%)
0.00005	32	50	4 h 48 m 22 s	97.4
0.0001	32	50	4 h 41 m 25 s	97.4
0.0002	32	50	4 h 40 m 19 s	98.3
0.0003	32	50	4 h 33 m 52 s	98.7
0.0004	32	50	4 h 35 m 47 s	98.3
0.0005	32	50	4 h 32 m 6 s	99.7
0.0006	32	50	4 h 34 m 45 s	99.0
0.001	32	50	4 h 45 m 19 s	99.0
0.01	32	50	4 h 36 m 40 s	98.7

Table 3. Classification performance metrics of the baseline model.

Kernels Type of Wheat	Pre (%)	R (%)	F1 (%)
scab-damaged	100	98	99.5
insect-damaged	97.87	92	94.85
Perfect	94.29	98	96.59
black germ	97.98	97	97.49
moldy	95.15	98	96.55

Table 4. Classification performance metrics of the our model.

Kernels Type of Wheat	Pre (%)	R (%)	F1 (%)
scab-damaged	100	99	99.5
insect-damaged	98.91	99	99.12
Perfect	100	100	100
black germ	99.34	100	99.05
moldy	99.01	99	99.5

Table 5. Performance comparison between Softmax and Sigmoid classification heads.

Method	Acc (%)	Pre (%)	F1 (%)	R (%)
Softmax (Cross-Entropy)	98.33	98.35	98.32	98.33
Sigmoid (Multi-binary, BCE)	99.6	99.8	99.6	99.6

Table 6. Comparative experiments of different models.

Model	Acc (%)	Pre (%)	F1 (%)	R (%)	Param
VGG16	66.60	71.40	67.28	68.40	134.1 M
InceptionV3	94.20	94.36	94.20	94.20	24.4 M
EfficientNet-B0	96.92	96.80	96.81	96.80	4.01 M
ResNet18	92.40	92.73	92.47	92.40	11.2 M
MobileNetV2	92.60	92.93	92.65	92.60	2.23 M
Our	99.60	99.80	99.60	99.70	6.9 M

Table 7. Test set ablation experiments.

Model	Acc (%)	Pre (%)	F1 (%)	R (%)	Param
EfficientNet-B1	97.0	97.6	97.0	97.0	6.5 M
EfficientNet-B1 +CBAM	98.8	98.9	98.9	98.8	6.9 M
EfficientNet-B1 (2-class classification)	97.4	97.5	97.4	97.4	6.5 M
Our	99.6	99.8	99.6	99.6	6.9 M

Table 8. Five-fold cross-validation statistics.

Fold	Macro Precision
fold 1	98.96%
fold 2	99.24%
fold 3	100%
fold 4	99.37%
fold 5	98.67%
Mean ± Std	99.25% ± 0.45%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, D.; Li, J.; Gong, H.; Chen, J. Defective Wheat Kernel Recognition Using EfficientNet with Attention Mechanism and Multi-Binary Classification. Appl. Sci. 2026, 16, 1247. https://doi.org/10.3390/app16031247

AMA Style

Wang D, Li J, Gong H, Chen J. Defective Wheat Kernel Recognition Using EfficientNet with Attention Mechanism and Multi-Binary Classification. Applied Sciences. 2026; 16(3):1247. https://doi.org/10.3390/app16031247

Chicago/Turabian Style

Wang, Duolin, Jizhong Li, Han Gong, and Jianyi Chen. 2026. "Defective Wheat Kernel Recognition Using EfficientNet with Attention Mechanism and Multi-Binary Classification" Applied Sciences 16, no. 3: 1247. https://doi.org/10.3390/app16031247

APA Style

Wang, D., Li, J., Gong, H., & Chen, J. (2026). Defective Wheat Kernel Recognition Using EfficientNet with Attention Mechanism and Multi-Binary Classification. Applied Sciences, 16(3), 1247. https://doi.org/10.3390/app16031247

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Defective Wheat Kernel Recognition Using EfficientNet with Attention Mechanism and Multi-Binary Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection and Dataset Development

2.2. EfficientNet-B1-CBAM Network Model

2.3. EfficientNet-B1 Network Model

2.4. Attention Mechanism

2.4.1. Channel Attention Module

2.4.2. Spatial Attention Module

2.5. Design of Multiple-Binary Classifiers

2.6. Evaluation Indicators

2.7. Hardware and Software Preparation

3. Results and Discussion

3.1. EfficientNet-B1-CBAM Model Evaluation and Learning Rate Optimization Results

3.2. Comparison of Sigmoid and Softmax Classification Strategies

3.3. Comparative Experiment

3.4. Ablation Experiments

3.5. Cross-Validation Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI