Skip to Content
AgricultureAgriculture
  • Article
  • Open Access

3 March 2026

Segmentation Is Not the Purpose: A Wheat Impurity Regression Network Integrating Semantic Segmentation

,
,
,
and
1
College of Engineering, China Agriculture University, Beijing 100083, China
2
School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing 100048, China
*
Author to whom correspondence should be addressed.

Abstract

Real-time and accurate acquisition of the wheat impurity rate is a key technology for realizing intelligent cleaning operations, and it directly influences the quality of wheat harvest. This study proposes a novel impurity rate regression network named Segmentation is Not The Purpose (SNTP). SNTP integrates a semantic segmentation network and an impurity rate regression network into a single neural architecture and replaces the DeepLabV3+ backbone with MobileNetV4, which serves as the segmentation branch of SNTP. Furthermore, a Transformer block is introduced into the regression branch to enable global feature extraction, and a Generalized Categorical Regression head is designed based on Distribution Focal Loss to improve regression accuracy. The SNTP model ultimately achieves an MIoU of 77.7%, an MPA of 83.3%, an MAE of 0.045, and an MSE of 0.005 on the validation set, with only 9.51M parameters and 17.98 GMACs of computation, successfully solving the overfitting problem in impurity rate regression networks and achieving high regression accuracy. SNTP is easy to optimize, requires no additional prior knowledge, and the performance of the SNTP model is unaffected by camera mounting height, making it exceptionally versatile for deployment and enabling real-time impurity rate detection, which is the key technology for intelligent cleaning.

1. Introduction

Wheat (Triticum aestivum L.) is one of the three major food crops in the world. According to statistics from the Food and Agriculture Organization (FAO) of the United Nations, the annual output of wheat reached 798 million tonnes in 2023. Ensuring the safety of wheat production operations is of great significance to global food security [1]. In the current wheat harvesting process, wheat threshing operations will inevitably mix wheat with impurities, so wheat cleaning operations are required to ensure wheat quality. After the cleaning operation, the probability of mildew in wheat products is significantly reduced, which means higher economic value [2]. Introducing intelligent control into cleaning operations is an effective method to improve cleaning quality. The key technology of this method lies in accurately and timely monitoring cleaning quality during the cleaning operation, namely, the real-time impurity rate and breakage rate. For this reason, it is necessary to develop an online monitoring system for wheat cleaning quality.
Currently, cleaning operations mainly adopt air-screen cleaning, which achieves the separation of grains and impurities based on their different concentrations and physicomechanical properties [3]. Therefore, traditional methods for detecting cleaning quality are naturally based on physical principles. Specifically, since grains with different impurity rates generate different impacts in the cleaning box, piezoelectric sensors are installed in the cleaning box to collect impact signals, thereby obtaining the grain breakage rate [4,5,6]. The grain impurity rate is mainly obtained through visual sensors [7]. Industrial cameras are used to capture images of grains with impurities and process them to obtain the impurity rate. Chen et al. (2020) [8] employed the local threshold algorithm of Niblack to segment the preprocessed images and extract features, and then used the decision tree algorithm to classify the features for impurity rate estimation. Chen et al. (2024) [9] performed operations such as binarization, hole filling, and denoising on the preprocessed images of rapeseed with impurities to segment impurities and grains, then calculated the impurity rate based on the number of pixels. The above methods for obtaining impurity rates based on traditional image segmentation algorithms and machine learning algorithms have the advantages of low computational cost and a solid theoretical foundation, but they exhibit low accuracy and limited universality.
In recent years, with the development of deep learning, computer vision technology has also achieved considerable progress. An increasing number of studies have introduced deep learning to process images of grains with impurities, thereby obtaining the impurity rate [10]. These studies share a consistent approach: first, target detection or image segmentation networks are used to detect grain and impurity separately; then, on the basis of the detection results, grains and impurities are counted individually, or the number of pixels of grains and impurities is statistically analyzed, respectively. Finally, a mathematical model is established to relate the statistical data to the material quality, and the grain quality and impurity quality are calculated through this mathematical model so that the impurity rate can be calculated. Therefore, the focus of these studies is placed on improving recognition accuracy.
Such studies can be roughly divided into three categories according to the calculation method of impurity rate: only segmentation without calculating the impurity rate; calculating the impurity rate by simply counting pixels based on the segmentation results; and calculating the impurity rate using a complex model that combines object counting and pixel statistics. The relevant studies are introduced one by one below. Liu et al. [11] used the improved EfficientNetV2 to achieve the segmentation of rice (Oryza sativa L.) grains and impurities, achieving a precision of 94.44%. Yu et al. [12] introduced EfficientNetB7 into the Faster-RCNN network to realize the detection of impurities and broken corn (Zea mays L.) kernels in images of corn with impurities. Jiang et al. [13] collected wheat images with impurities using a THz spectral imaging system and created a dataset, thereby highlighting the image features of wheat and impurities. Subsequently, YOLOv9 was used to detect wheat and impurities, achieving high accuracy. The above three studies only performed segmentation or detection of grains and impurities. These technologies themselves cannot provide the impurity rate. Therefore, in addition to detecting and segmenting grain images with impurities, other studies have further developed models for calculating impurity rates. Chen et al. [14] used DeepLabV3+ to segment wheat images containing impurities, counted the number of pixels of impurities and wheat, respectively, and finally calculated the impurity rate of the sample using a quantitative model. Zhang et al. [15] introduced a detail detection head into DeepLabV3+ and developed the DeepLab-EDA model for wheat and impurities segmentation. The mass per pixel of wheat and impurities was measured experimentally, and the mass was calculated by counting the number of pixels for each target in the segmentation results to determine the impurity rate. Although the above two studies estimated impurity rates by combining image segmentation results with mathematical models, the models used are based on mass per pixel obtained from simple experiments and cannot handle situations of material occlusion and stacking. The next two studies attempted to solve this problem by constructing more complex impurity rate calculation models. When building the peanut (Arachis hypogaea L.) impurity rate calculation model, Gu et al. [16] established a regression equation between quantity and quality for peanut pods, which have a relatively uniform shape and account for a large proportion of impurities, and estimated their quality by counting. For other impurities with irregular shapes and a small proportion, the pixel-counting method was still used to calculate quality. The segmentation model adopted was the improved YOLOv8n-Seg. Wu et al. [17] adopted a similar idea when calculating the rice impurity rate: For intact rice grains with a uniform shape, the method of counting the number of targets was used to calculate the quality, and for impurities, the method of pixel counting was used. Therefore, in this study, Mask R-CNN was used for image processing: Only impurities and broken rice grains were segmented, and only intact rice grains were detected. This approach achieves the goal while reducing the computational load. However, whether using pixel counting or target counting to calculate quality, artificially constructed impurity rate calculation models have the problem of poor generalization. For models based on pixel counting, changing the installation height of the visual sensor invalidates the original calculation model, and new experiments must be conducted at the new installation height to build a new model. For models based on target counting, on the one hand, they cannot fully leverage the advantages of object detection or image segmentation, resulting in wasted performance; on the other hand, even for the same crop, there are differences in thousand-grain weight across varieties. Moreover, once these two types of calculation models are established, they cannot be optimized, which obviously fails to fully exploit the advantages of neural networks that can automatically optimize parameters through training.
To thoroughly address these issues, we propose a method for estimating impurity rates from grain images with impurities: a regression head is added to the neural network, allowing it to directly output the impurity rate without relying on a manually designed fixed computational model. In fact, Yu et al. [18] have attempted to use an improved ResNet-18 to realize impurity rate regression for images of corn with impurities. However, when attempting to apply this technology specifically to impurity-containing wheat images, a standard ResNet-18 was employed for impurity rate regression, leading to a significant gap between the train MAELoss and val MAELoss. The train MAELoss gradually converged and decreased, whereas val MAELoss remained persistently high, with severe fluctuations, throughout the training process. This indicates that the model suffered from overfitting to the training set. Our analysis revealed that the dataset constructed in [18] has relatively similar features, whereas the impurity rate label distribution in the dataset constructed in this paper is more diverse, and the image features are also more varied. The heatmap generated by the model further corroborates the overfitting phenomenon, as shown in Figure 1. The model was expected to focus on the target features of wheat and impurities, but the high-temperature regions in the heatmap indicate that it actually focused on incomprehensible features. This suggests that the model did not correctly regress the mass ratio of wheat and impurities as expected, but instead learned features from other dimensions, which is highly consistent with the definition of overfitting.
Figure 1. Heatmap of the output of the last residual block in ResNet-18.
This phenomenon is not surprising: Image data contains complex and diverse information, and deriving the impurity rate from image information requires multiple steps. If these steps are skipped and the impurity rate is directly used as the optimization target, the neural network is easily disturbed by other information in the images. To help the model learn the correct features, we constructed a novel network with both a segmentation and a regression branch, named “Segmentation Is Not the Purpose”. As the name suggests, the optimization target of this network is no longer to achieve better segmentation accuracy, unlike existing studies, but to obtain an impurity rate with lower error.
Semantic segmentation networks have been proven to help network weights focus on the features of segmented targets [19]. The semantic segmentation task extracts low-dimensional features from the input image and is relatively easy to optimize. The segmentation loss can effectively guide the backbone to learn correct features during training. In summary, SNTP is a multi-task model with two output branches, the segmentation branch and the regression branch, which learn features from image segmentation labels and image impurity rate labels, respectively. The segmentation branch ensures that the backbone of the model extracts the correct target features. Since the segmentation head and regression head share one backbone, this can just help the regression head extract correct image information. The design scheme of the model is shown in Figure 2. To further enhance regression accuracy, we also introduced self-attention into the regression branch to extract global semantic information and modified the regression method to distributed regression. Ultimately, an end-to-end impurity rate detection method that is easy to optimize and has good universality is constructed.
Figure 2. The design scheme of SNTP.

2. Materials and Methods

2.1. Data Acquisition and Augmentation

The wheat variety used for this research is Zhengmai 336. It is planted in Henan, China, and harvested in June. Impurities other than glass slag were collected from threshing machines.
The system for wheat impurity image collecting is shown in Figure 3. The system consists of an adjustable height industrial camera bracket, a ring light (Product model: RI15045-W, produced by OPT-Machine, Dongguan, China), an illumination power supply, an industrial camera (product model: BFS-U3–51S5C-C, produced by FLIR, Wilsonville, OR, USA), and an electronic balance (product model: AB135-S, produced by METTLER TOLEDO, Columbus, OH, USA).
Figure 3. Image collecting system.
Five types of impurities commonly found in wheat harvesting operations were selected as the data collection objects in this study, namely: wheat bran, straw, weed, gravel, and glass. The data collection process is as follows: Before each image was collected, various impurities and wheat grains were weighed separately on an electronic balance, and their masses were recorded. Then, the impurities and wheat grains were mixed, and images were captured using an industrial camera. After image collection, the total mass of impurities and the total mass of impurities and wheat grains was calculated, and the ratio of the two was taken as the impurity rate. During data collection, the camera height was not fixed; instead, it was flexibly adjusted according to the number of targets. The images were annotated using X-Anylabeling (v2.3.5) [20], with the impurity rate and the annotated mask images serving as labels, as shown in Figure 4a.
Figure 4. Dataset and image augmentation. (a) Image of wheat with impurities and its mask. (b) Original image and image augmentation effect.
The production of such data involves high cost and time consumption; therefore, a total of 1000 impurity-containing wheat images were collected in this study. They were divided into a training set and a validation set at a ratio of 4:1, including 800 images for training and 200 images for validation. The details of the dataset are shown in Table 1. Given that the dataset is only a small-scale one, to avoid overfitting and accelerate optimization, this study used the Albumentations library [21] and adopted an online augmentation method to the training set. A total of 10 online augmentation methods were employed, which are shown in Figure 4b.
Table 1. Number of each instance in the dataset.

2.2. Segmentation Branch: Improved DeepLabV3+

The network architecture used in this study is shown in Figure 5, where the segmentation branch adopts DeepLabV3+ [22]. The segmentation branch is not the focus of this study; however, simply using a standard DeepLabV3+ would result in excessively high computational overhead and unacceptable memory usage, which is not suitable for the agricultural production field, which is extremely cost-sensitive. Therefore, measures need to be taken to reduce the computational and memory overhead of the network.
Figure 5. The architecture of SNTP.
The method adopted in this study is to replace its backbone with the advanced MobileNetV4 [23], specifically MNV4-Conv-S. Additionally, since the standard MNV4 is a classification network with 32× downsampling, this downsampling rate is too high for a semantic segmentation network. Meanwhile, DeepLabV3+ uses dilated convolution in the bottom layer of the network to increase the receptive field of convolution kernels. Thus, we modified part of the MNV4 structure to reduce the downsampling rate to 8× and converted some convolution layers to dilated convolution, while keeping the ASPP module unchanged. The modified backbone structure is listed in Table 2. It should be specifically noted that since the segmentation branch and the regression branch share the same backbone, the above modifications will also affect the regression branch.
Table 2. Structure of modified MNV4.

2.3. Regression Branch: Generalized Categorical Regression (GCR)

Regression networks often suffer from the drawback of being difficult to optimize due to the properties of squared loss [24]. The true labels in such tasks are usually continuous quantities rather than discrete ones, and their representation can be regarded as a Dirac delta distribution. This representation is overly strict: A prediction is considered correct only if it is exactly equal to the true label, thereby ignoring the systematic errors common in regression tasks. Even if the predicted value of the neural network perfectly matches the true label, this does not necessarily mean that the network has accurately extracted the features of the input data due to the systematic errors inherent in the true labels.
Ref. [25] was the first to point out this issue when designing bounding box regression for object detection networks. Subsequently, Distribution Focal Loss (DFL) was proposed to address this problem, with its core idea being to transform the regression method into a classification vector. Ref. [26] extended this idea to the task of spacecraft pose estimation and achieved favorable results.
This study argues that this method can be generalized to all regression tasks with the following properties:
  • The regression target is a continuous variable.
  • The regression target has a definite value range.
  • The true labels of the regression target have unavoidable errors.
We refer to this method as Generalized Categorical Regression (GCR). Specifically, for regression tasks satisfying the above properties, the general approach is single-point regression, which assumes the probability distribution concentrates at a single point, i.e., the Dirac delta distribution δ ( x a ) . Its mathematical expectation is calculated as follows:
+ δ ( x a ) d x = 1
E ( x ) = + δ ( x a ) x d x = a
Specifically, for the regression tasks, the predicted value output by the network is the mathematical expectation of this distribution; therefore, the predicted value y p can be calculated as follows:
y p = Y 0 Y n δ ( x y p ) x d x
where Y 0 and Y n represent the minimum value and maximum value of the label y, respectively. Following the idea of DFL, the distribution method is modified to a learnable general distribution P i ( x ) . Since the regression target has a definite value range [ Y 0 , Y n ] , this range can be divided into n + 1 equally spaced nodes { y 0 , y 1 , , y n 1 , y n } . At this point, the neural network no longer predicts a single point but a classification probability vector { P ( y 0 ) , P ( y 1 ) , , P ( y n 1 ) , P ( y n ) } —essentially a weight vector whose elements sum to 1. All nodes are weighted and summed according to these weights, so the size of the weight vector is also n + 1 . Thus, the predicted value R p can be simply calculated as follows:
y p = i = 0 n P ( y i ) y i .
It is easy to prove that the value of y p in the above formula is a continuous quantity between [ y 0 , y n ] , which meets the requirements of the task.
Specifically, in this study, the impurity rate is a continuous value within the range of [ 0 , 1 ] , and the electronic balance used for weighing during label production has systematic errors. Therefore, this study meets the conditions for using GCR. A step size of 0.1 is adopted, and the range is divided into 11 nodes within [ 0 , 1 ] , namely { 0 , 0.1 , 0.2 , 0.9 , 1 } . Therefore, the final fully connected layer has an output dimension of 11, and a classification vector is obtained through a softmax layer. The final impurity rate is then computed according to Formula (4). We also experimented with a step size of 0.05 and a step size of 0.2, as shown in Table 3. The experimental results indicate that a step size of 0.1 achieves the best performance.
Table 3. Different step size of GCR.
When calculating the loss, since the regression layer has been converted into a classification layer, the cross-entropy loss function can be used directly. For this purpose, it is also necessary to modify the representation of labels. The labels in the dataset are still single points describing the impurity rate, and they need to be converted into classification vectors of equal length. This study follows the idea of DFL, assigning probabilities only to the two nodes adjacent to the true label while setting the probabilities of all other nodes to 0. The probability calculation method is as follows:
s l = 10 × ( y r y t )
s r = 10 × ( y t y l )
where y l represents the nearest left node, y r represents the nearest right node, s l represents the label at the y l node, and s r represents the label at the y r node. The constant 10 appears in the formula because the node step size is 0.1; multiplying by 10 converts the distance to a probability. The conversion process is shown in Figure 6. This conversion method forces the network to focus on the two nearest nodes, facilitating optimization. Thus, the calculation formula for the loss L r e g of the regression branch is obtained as follows:
L r e g = ( s r log ( p r ) + s l log ( p l ) )
where p l and p r represent the predicted value of point y l and point y r , respectively.
Figure 6. The conversion process of labels (the lower half of the image) is an example.

2.4. Attention Module

Transformer blocks [27] are renowned for their powerful capability in extracting global semantic information, and have demonstrated excellent performance on large-scale datasets. In this study, determining the impurity rate requires analyzing global image information, which makes Transformer blocks highly suitable for this task. The key question lies in where to insert the Transformer block. The first approach is to insert the Transformer block after the backbone. The advantage of this method is that both the segmentation branch and the regression branch can benefit from the Transformer’s global semantic information extraction capabilities. However, due to the presence of the segmentation branch, the backbone in this study adopts a low downsampling rate. Inserting a Transformer block at the bottom of the branches—similar to the design in Vision Transformer [28]—would incur enormous computational overhead. To address this, the feature dimension would need to be reduced before being fed into the Transformer block. However, since the input features are obtained by flattening image patches, this process effectively becomes equivalent to downsampling the low-level feature map of the backbone, which leads to severe feature loss—particularly and significantly impacts the segmentation branch—even if the output of the Transformer block is later upsampled back to the original dimension. This insertion method is referred to as Mode B in this study. The other insertion approach is to place the Transformer block solely at the bottom of the regression branch. To reduce computational overhead, the input feature map is further downsampled by 8×, and the resulting feature map is directly flattened and used as input to the Transformer block. In this way, although the segmentation branch cannot benefit from the Transformer block, it avoids the information loss caused by downsampling. For the regression branch, because it outputs only a single vector, it inherently requires dimensionality reduction to obtain the desired output shape. Therefore, it is not sensitive to information loss from downsampling; instead, introducing the Transformer block enables better extraction of the mass relationship between impurities and wheat. This insertion method is referred to as Mode C in this study. In addition, the network structure without a Transformer block serves as the control group, referred to as Mode A. The structures of the three modes are shown in Figure 7. Table 4 presents the training results for the three network configurations. Both the above analysis and the experimental results indicate that Mode C is the optimal insertion method for the Transformer block.
Figure 7. The structure of the three modes of Transformer insertion.
Table 4. Different modes of Transformer block insertion.

3. Results

3.1. Experiment Setting and Evaluation Metrics

The experimental environment and hyperparameter settings are shown in Table 5. In terms of training strategy, we adopted a cosine annealing learning rate schedule, with the validation MIoU used as the criterion for judging model convergence. The weights for both the segmentation loss and regression loss are set to 1.
Table 5. Experiment setting.
The evaluation metrics of the model mainly include the Mean Absolute Error (MAE) and Mean Squared Error (MSE) between the model outputs and the true labels. Meanwhile, the Mean Intersection over Union (MIoU) and Mean Pixel Accuracy (MPA) from image segmentation metrics are recorded for model analysis. The calculation formulas of MAEand MSE are as follows:
MAE = 1 m i = 1 m | y t y p |
MSE = 1 m i = 1 m ( y t y p ) 2
where m represents the number of data points in a mini-batch. MAE is one of the most intuitive error evaluation metrics in regression tasks, as it directly reflects the average error between the model’s predictions and the actual values. MSE can amplify the impact of large errors on the metric results, thereby more sensitively reflecting the model’s fit to samples with large deviations.
Since the regression target of this study, namely the impurity rate, is a ratio with a value range of [ 0 , 1 ] , and the dataset constructed in this study simulates the scenario of impurity-containing wheat materials collected from the threshing mechanism and sent to the cleaning mechanism, the impurity rate is relatively high. The distribution of regression labels is shown in Figure 8. Data with an impurity rate exceeding 10% account for more than 70% of the entire dataset, and data with an impurity rate exceeding 20% account for about half. Therefore, the optimization target for MAE in this study is set to be less than 0.05 (i.e, the absolute error is less than 5%)
Figure 8. Distribution of regression labels. (a) Distribution of labels in the training set. (b) Distribution of labels in the validation set.
The formulas for calculating MIoU and MPA are as follows:
MIoU = 1 N i = 1 N T P i T P i + F P i + F N i
MPA = 1 N i = 1 N T P i T P i + F P i
where N represents the number of classes; T P i (True Positive) denotes the number of pixels in class i that are correctly predicted as belonging to that class; F P i (False Positive) denotes the number of pixels from other classes that are incorrectly predicted as belonging to class i; and F N i (False Negative) denotes the number of pixels in class i that are incorrectly predicted as belonging to other classes.
The advantage of MIoU lies in its ability to balance the segmentation performance across all classes, avoiding evaluation bias caused by an excessively high proportion of samples from major classes. It enables a comprehensive evaluation of the model’s segmentation performance and is therefore widely used as the core evaluation criterion for semantic segmentation models. MPA can intuitively reflect the classification accuracy of each class but is insensitive to class confusion. In this study, it serves as a supplementary evaluation metric for MIoU. Both metrics range from [ 0 , 1 ] , with values closer to 1 indicating better segmentation performance.
The SNTP model achieves the following metrics on the validation set: MIoU of 77.7%, MPA of 83.3%, MAE of 0.045, and MSE of 0.005.

3.2. Ablation Experiment

To verify the effectiveness of GCR and the Transformer block, ablation experiments were conducted. The results of the ablation experiments are shown in Table 6. We have also conducted ablation experiments on the modified MobileNetV4 to demonstrate the importance of reducing the downsampling rate and adding dilated convolutions. Since this modification is intended solely to meet the architectural requirements of DeepLabV3+, no additional analysis is provided.
Table 6. Ablation experiment result.
Based on the results of the ablation experiments, several interesting conclusions can be drawn. First, the separate application of GCR or the Attention module fails to yield performance improvements—an outcome that is actually consistent with our expectations. When GCR is used alone, the structure of the regression branch is overly simple and contains relatively few optimizable parameters, preventing GCR from demonstrating its advantages. When the Attention module is used alone, the regression branch suffers from gradient vanishing, most likely because the L2 loss function becomes trapped in a saddle point or a local optimum. Meanwhile, without interference from the regression loss, network optimization is fully devoted to improving segmentation accuracy, allowing this configuration to achieve the highest segmentation accuracy among all four ablation experiments.
This phenomenon is not unexpected. As mentioned earlier, regression networks inherently suffer from optimization difficulties, and introducing the Attention module further increases the complexity of the regression branch, which exacerbates these optimization challenges. Consequently, under the same training hyperparameters, applying the Attention module alone easily leads the regression branch to fall into a saddle point or local optimum. The motivation for applying GCR to transform the regression task into a classification task is precisely to address this issue, and the module performs as intended. The experimental results also reveal a phenomenon: the performance of the segmentation branch and the regression branch does not always improve synchronously. Excessively high performance in either branch may degrade performance in the other, a common characteristic of multitask networks.
The results of the ablation experiments demonstrate that the combined use of the Attention module and GCR yields a significant performance improvement: MIoU increases by 2.7%, MPA increases by 3.6%, and MAE decreases by 0.019 (i.e., the absolute error is reduced by 1.9%). These results strongly validate the effectiveness of the network enhancements and confirm that both modules function as expected.

3.3. Comparison Experiment

Comparison experiments were conducted to demonstrate the performance advantages of SNTP. Since SNTP is essentially an impurity rate regression network, the primary baselines in the experiments are various regression models. The results of the comparative experiments are shown in Table 7.
Table 7. Comparison experiment.
From these results, it can be observed that without the segmentation network assisting in extracting correct image features, even neural networks with complex and advanced architectures struggle to achieve satisfactory validation metrics. All regression networks exhibited overfitting on the training set; even the MambaOut-Base model, despite having a large number of parameters, suffered from gradient vanishing at the very beginning of training—a problem that remained unresolved despite extensive adjustments to the training hyperparameters.
This also indicates that in the impurity rate regression task the complex relationship between the impurity rate (serving as the label) and the images of wheat with impurities (serving as the training data) greatly increases the difficulty of network optimization. Simply increasing the number of network layers is ineffective. In addition, due to the lightweight design, even with the addition of the segmentation branch, the number of parameters of SNTP is still controlled within 10 M, and the computational complexity is also limited to 20 GMACs. This means the model requires low deployment costs and possesses real-time processing capabilities, which are highly consistent with the actual needs of agricultural production.

3.4. Model Fitting Performance

To further validate the regression accuracy of the SNTP model and quantitatively evaluate its ability to explain variance in wheat impurity rate, the coefficient of determination ( R 2 ) and a ground-truth vs. predicted-value scatter plot were analyzed based on the validation set. The inference results of the model on the validation set were saved, with the ground truth values on the abscissa and the predicted values on the ordinate to plot a scatter plot. A straight line y = x was also added to the plot as a reference line, as shown in Figure 9. The closer the points in the plot are to the reference line, the closer the predicted values are to the true values for those points. Among the 200 validation set samples, the vast majority cluster closely around the ideal prediction line y = x . Despite the presence of a few outliers, it can still be concluded that the deviation between the predicted and true values is small. The linear fitting curve is y = 0.98 x 0.000063 , with a slope of 0.98 (very close to 1) and an intercept of only 0.000063 , which is consistent with the characteristics of the SNTP regression model. In addition, the calculated coefficient of determination of the model is 0.93, which further demonstrates that the model has good fitting performance.
Figure 9. Scatter plot Ground Truth vs. Predicted.

3.5. Model Visualization

The heatmaps of the model intuitively illustrate the features learned at each layer of the model, which help explain the results of the comparative experiments. Heatmaps are generated using Gradcam (v1.5.5) [38]. The heatmaps of the bottom layer of SNTP’s backbone are shown in Figure 10a. It can be observed that, aided by the segmentation branch, the backbone extracts the features of each target in the image as expected. The high-activation regions of the heatmap are concentrated around the target areas and essentially outline the contours of the targets, indicating that the model’s performance is consistent with the design theory.
Figure 10. Heatmaps of models. (a) Heatmaps of the SNTP backbone. (b) Heatmaps of regression networks.
The heatmaps of the bottom convolutional layers of various regression networks are shown in Figure 10b. As analyzed for ResNet-18 in Section 1, none of the regression networks successfully extracted the target features. The results of their heatmaps indicate that the regression networks learned patterns not interpretable to humans, which explains the continuous decrease in their loss on the training set. Among all regression networks, only MobileViTv2-2.0 successfully learned part of the target features; however, it still suffered from substantial false activations in background regions, lost a large number of target features, and failed to correctly outline the target contours. The results of the heatmaps are consistent with the overfitting phenomenon of each regression model on the training set, further demonstrating that pure regression models cannot be applied to the impurity rate regression task for impurity-containing wheat images.
The output visualization of the SNTP’s segmentation branch intuitively reflects its segmentation performance. Figure 11 shows the segmentation results and ground truth labels for three validation set images. By comparing the segmentation results with the ground-truth labels, it can be seen that the segmentation branch of SNTP can generally reconstruct the contours of the targets to be segmented, but there are some pixel-level classification errors—especially between wheat and wheat bran with similar color features, and between weeds and straw with similar shapes. This may be due to interference from the regression branch. However, since SNTP does not take segmentation accuracy as its primary optimization target, the segmentation performance of the current model is sufficient for the task.
Figure 11. Visualization of segmentation results. (a) Origin image 1. (b) Mask 1. (c) Segmentation result 1. (d) Origin image 2. (e) Mask 2. (f) Segmentation result 2. (g) Origin image 3. (h) Mask 3. (i) Segmentation result 3.
Feature map visualization is also an effective method to explain model performance. By extracting feature maps during the model’s inference process, we can intuitively understand the function of each convolutional layer. The feature map visualization method adopted in this study involves summing the weights of all channels into a single color channel, and the resulting feature maps of representative layers in SNTP are shown in Figure 12.
Figure 12. Feature maps of SNTP layers. (a) Feature maps of backbone layers. (b) Feature maps of the segmentation branch. (c) Feature maps of the regression branch.
The backbone feature maps and segmentation branch feature maps reveal that, consistent with most semantic segmentation networks, the weights of the backbone and segmentation branch are distributed such that the top few convolutional layers act as edge detectors—manifested as highlighted regions outlining the target contours. In contrast, the bottom convolutional layers begin to extract high-dimensional features inside the targets, with highlighted regions gradually concentrating toward the center of the targets. At the output layer of the segmentation branch, due to feature fusion of high-resolution feature maps, the targets have been basically restored. In the regression branch, however, the feature map resolution is low due to the high downsampling rate. The high-dimensional features of the images learned by the model are no longer easily interpretable by humans. Nevertheless, it can still be observed that the highlighted regions in the UIB 1 feature map correspond to the locations of gravels with relatively large mass in the input images, while the highlighted regions in the UIB 2 feature map correspond to other targets with smaller mass. This may be because the model has extracted mass-related features, which could be one of the reasons why the model can accurately regress the impurity rate.

4. Discussion

Generalization ability has always been a critical factor determining whether neural networks can be applied to engineering practices [39]. Networks that only work with specific datasets are difficult to implement in engineering applications. Previous studies adopted a method of manually counting pixels based on segmentation results, which is highly dependent on the inherent characteristics of the dataset itself. This approach requires that objects of the same category with identical mass must have similar pixel-level properties during data collection, which restricts the flexibility of dataset acquisition methods. In addition, since the network outputs masked 2D images, this calculation method must ignore the thickness effect. All these drawbacks can introduce systematic errors and thus hinder practical applications. In contrast, the SNTP network takes the impurity rate directly as its optimization target, which allows it to fully leverage the advantages of neural networks and eliminate restrictions on data collection. Moreover, neural networks can capture high-dimensional features in the data [40], giving the model the potential to reduce systematic errors. Results from comparative experiments and model visualization demonstrate that integrating a semantic segmentation branch can effectively reduce the optimization difficulty of the regression network, resolve the overfitting issue, stabilize the network training structure, and elevate network accuracy to a high level. This enables the realization of the concept of end-to-end impurity rate acquisition to enhance model generalization.
The dataset employed in this study can still be improved. The main limitations are twofold: first, the dataset construction process incurs high time costs, which limits its overall scale; second, the data acquisition process differs somewhat from real-world operating scenarios. Since SNTP is designed as a multi-task network, the dataset requires two types of annotations. This makes the data preparation process more time-consuming than for single-task semantic segmentation or regression alone, resulting in high costs for building large-scale datasets. A feasible solution is to introduce unsupervised pre-training strategies for SNTP. Specifically, the model can be pre-trained on extensive unlabeled image data through carefully designed proxy tasks, and then fine-tuned on a small amount of labeled data [41]. Unlabeled data can be collected at very low cost, enabling the construction of large-scale datasets economically. This approach has been proven to bring significant performance gains in fields such as semantic segmentation [42] and represents a highly promising direction for future research. Improving data authenticity can be achieved by directly integrating a data acquisition platform into the cleaning equipment, which also represents a future improvement direction. Additionally, this dataset contains only one wheat variety. Considering that different wheat varieties may vary in physical properties such as thousand-kernel weight and color, this represents a potential limitation of the dataset in terms of generalization.
Neural networks designed for multi-task objectives may exhibit performance synergy. For instance, the regression branch of the SNTP model achieved substantial accuracy improvements after integrating the segmentation branch. However, such architectures can also lead to performance trade-offs—in other words, the accuracy of the segmentation branch may decline due to interference from the regression branch. This issue can be mitigated by adjusting training strategies, such as assigning distinct loss weights to different tasks [43]. Future research directions will focus on optimizing the network structure to achieve a more rational integration of the regression and segmentation branches, adopting more sophisticated training strategies, and further refining the design of the regression head. These approaches are expected to alleviate the performance trade-off phenomenon and further enhance the overall accuracy of the network.

5. Conclusions

This study proposes a novel architecture-based impurity rate regression network for images of wheat with impurities, named “Segmentation is Not The Purpose” (SNTP). The SNTP model integrates a semantic segmentation model and a regression model into a single framework, leveraging the segmentation branch to assist the model in learning accurate target features, thereby enabling the regression branch to yield precise impurity rates. Employing MNV4-Conv-S as the backbone and the segmentation head of DeepLabV3+ as the segmentation branch, the SNTP introduces a Transformer block into the regression branch, extends Distribution Focal Loss to regression tasks, and designs a Generalized Categorical Regression head. Through the aforementioned designs, the SNTP successfully addresses the overfitting issue commonly encountered in single regression models. Ultimately, on the validation set of the constructed dataset, the model achieves the following metrics: MIoU of 77.7%, MPA of 83.3%, MAE of 0.045, and MSE of 0.005. Among these, the regression metrics significantly outperform those of single regression models and meet the designed targets. Meanwhile, the model parameters do not exceed 10 M, and the computational complexity is less than 20 GMACs, complying with the low-cost and real-time requirements of agricultural production. Furthermore, SNTP does not require the manual construction of additional mathematical models, which means less prior knowledge is needed. The research findings of this study are conducive to realizing intelligent cleaning operations.

Author Contributions

Conceptualization, Y.B. and H.Y.; methodology, Y.B.; software, Y.B.; data curation, Y.B., H.Y. and X.L.; writing—original draft preparation, Y.B.; writing—review and editing, X.Z., H.Y. and D.L.; supervision, D.L.; project administration, D.L.; funding acquisition, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Key Research and Development Program of China (2021YFD2100600).

Data Availability Statement

The code is available at https://github.com/byh00/SNTP (accessed on 20 January 2026), the dataset is available at https://www.kaggle.com/datasets/byh0007/wheat-images-with-impurity (accessed on 20 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liu, L.; Li, P. An improved YOLOv5-based algorithm for small wheat spikes detection. Signal Image Video Process. 2023, 17, 4485–4493. [Google Scholar] [CrossRef]
  2. Tibola, C.S.; Fernandes, J.M.C.; Guarienti, E.M. Effect of cleaning, sorting and milling processes in wheat mycotoxin content. Food Control 2016, 60, 174–179. [Google Scholar] [CrossRef]
  3. Badretdinov, I.; Mudarisov, S.; Lukmanov, R.; Permyakov, V.; Ibragimov, R.; Nasyrov, R. Mathematical modeling and research of the work of the grain combine harvester cleaning system. Comput. Electron. Agric. 2019, 165, 104966. [Google Scholar] [CrossRef]
  4. Wu, Y.; Li, X.; Mao, E.; Du, Y.; Yang, F. Design and development of monitoring device for corn grain cleaning loss based on piezoelectric effect. Comput. Electron. Agric. 2020, 179, 105793. [Google Scholar] [CrossRef]
  5. Zhang, M.; Jiang, L.; Wu, C.; Wang, G. Design and Test of Cleaning Loss Kernel Recognition System for Corn Combine Harvester. Agronomy 2022, 12, 1145. [Google Scholar] [CrossRef]
  6. Wei, D.; Wu, C.; Jiang, L.; Wang, G.; Chen, H. Design and Test of Sensor for Monitoring Corn Cleaning Loss. Agriculture 2023, 13, 663. [Google Scholar] [CrossRef]
  7. Zhao, J.; Yin, Y.; Shao, M.; Wang, Q.; Wang, F.; Lu, L. Development and testing of an online detection system for impurity content in wheat for combine harvesters. Measurement 2025, 256, 117942. [Google Scholar] [CrossRef]
  8. Chen, J.; Lian, Y.; Li, Y. Real-time grain impurity sensing for rice combine harvesters using image processing and decision-tree algorithm. Comput. Electron. Agric. 2020, 175, 105591. [Google Scholar] [CrossRef]
  9. Chen, X.; Guan, Z.; Li, H.; Zhang, M. A Machine Vision-Based Method of Impurity Detection for Rapeseed Harvesters. Processes 2024, 12, 2684. [Google Scholar] [CrossRef]
  10. Liu, L.; Du, Y.; Chen, D.; Li, Y.; Li, X.; Zhao, X.; Li, G.; Mao, E. Impurity monitoring study for corn kernel harvesting based on machine vision and CPU-Net. Comput. Electron. Agric. 2022, 202, 107436. [Google Scholar] [CrossRef]
  11. Liu, Q.; Liu, W.; Liu, Y.; Zhe, T.; Ding, B.; Liang, Z. Rice grains and grain impurity segmentation method based on a deep learning algorithm-NAM-EfficientNetv2. Comput. Electron. Agric. 2023, 209, 107824. [Google Scholar] [CrossRef]
  12. Yu, H.; Li, Z.; Li, W.; Guo, W.; Li, D.; Wang, L.; Wu, M.; Wang, Y. A Tiny Object Detection Approach for Maize Cleaning Operations. Foods 2023, 12, 2885. [Google Scholar] [CrossRef]
  13. Jiang, Y.; Jiang, M.; Abbas, T.; Ge, H.; Wen, X.; Chen, H. Detection of wheat impurities using terahertz imaging technology and deep learning. LWT 2025, 229, 118201. [Google Scholar] [CrossRef]
  14. Chen, M.; Jin, C.; Ni, Y.; Xu, J.; Yang, T. Online Detection System for Wheat Machine Harvesting Impurity Rate Based on DeepLabV3+. Sensors 2022, 22, 7627. [Google Scholar] [CrossRef]
  15. Zhang, Q.; Wang, L.; Ni, X.; Wang, F.; Chen, D.; Wang, S. Research on wheat broken rate and impurity rate detection method based on DeepLab-EDA model and system construction. Comput. Electron. Agric. 2024, 226, 109375. [Google Scholar] [CrossRef]
  16. Gu, M.; Shen, H.; Ling, J.; Yu, Z.; Luo, W.; Wu, F.; Gu, F.; Hu, Z. Online detection of broken and impurity rates in half-feed peanut combine harvesters based on improved YOLOv8-Seg. Comput. Electron. Agric. 2025, 237, 110494. [Google Scholar] [CrossRef]
  17. Wu, Z.; Chen, J.; Ma, Z.; Li, Y.; Zhu, Y. Development of a lightweight online detection system for impurity content and broken rate in rice for combine harvesters. Comput. Electron. Agric. 2024, 218, 108689. [Google Scholar] [CrossRef]
  18. Yu, H.; Li, Z.; Guo, W.; Li, D.; Wang, L.; Wang, Y. An estimation method of maize impurity rate based on the deep residual networks. Ind. Crops Prod. 2023, 196, 116455. [Google Scholar] [CrossRef]
  19. Qin, X.; Dai, H.; Hu, X.; Fan, D.P.; Shao, L.; Van Gool, L. Highly Accurate Dichotomous Image Segmentation. In Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 38–56. [Google Scholar]
  20. Wang, W. Advanced Auto Labeling Solution with Added Features. Github Repository, 2023. Available online: https://github.com/CVHub520/X-AnyLabeling (accessed on 23 February 2023).
  21. Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
  22. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar]
  23. Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal Models for the Mobile Ecosystem. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; pp. 78–96. [Google Scholar]
  24. Stewart, L.; Bach, F.; Berthet, Q.; Vert, J.P. Regression as Classification: Influence of Task Formulation on Neural Network Features. In 26th International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 25–27 April 2023; Ruiz, F., Dy, J., van de Meent, J.W., Eds.; Proceedings of Machine Learning Research; PMLR: Cambridge MA, USA, 2023; Volume 206, pp. 11563–11582. [Google Scholar]
  25. Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 21002–21012. [Google Scholar]
  26. Zhou, H.; Yao, L.; She, H.; Si, W. SDPENet: A Lightweight Spacecraft Pose Estimation Network with Discrete Euler Angle Probability Distribution. IEEE Robot. Autom. Lett. 2025, 10, 3086–3093. [Google Scholar] [CrossRef]
  27. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  28. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
  29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  30. Tan, M.; Le, Q. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
  31. Mehta, S.; Rastegari, M. Separable Self-attention for Mobile Vision Transformers. arXiv 2022, arXiv:2206.02680. [Google Scholar] [CrossRef]
  32. Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 17302–17313. [Google Scholar]
  33. Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. FastViT: A Fast Hybrid Vision Transformer Using Structural Reparameterization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 5785–5795. [Google Scholar]
  34. Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 18–22 June 2023; pp. 16133–16142. [Google Scholar]
  35. Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
  36. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  37. Yu, W.; Wang, X. MambaOut: Do We Really Need Mamba for Vision? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 4484–4496. [Google Scholar]
  38. Gildenblat, J.; Contributors. PyTorch Library for CAM Methods. Github Repository, 2021. Available online: https://github.com/jacobgil/pytorch-grad-cam (accessed on 31 May 2017).
  39. Rohlfs, C. Generalization in neural networks: A broad survey. Neurocomputing 2025, 611, 128701. [Google Scholar] [CrossRef]
  40. Recanatesi, S.; Farrell, M.; Advani, M.; Moore, T.; Lajoie, G.; Shea-Brown, E. Dimensionality compression and expansion in Deep Neural Networks. arXiv 2019, arXiv:1906.00443. [Google Scholar] [CrossRef]
  41. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  42. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. arXiv 2021, arXiv:2111.06377. [Google Scholar] [CrossRef]
  43. Kendall, A.; Gal, Y.; Cipolla, R. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.