TCNet: Transformer Convolution Network for Cutting-Edge Detection of Unharvested Rice Regions

: Cutting-edge detection is a critical step in mechanized rice harvesting. Through visual cutting-edge detection, an algorithm can sense in real-time whether the rice harvesting process is along the cutting-edge, reducing loss and improving the efficiency of mechanized harvest. Although convolutional neural network-based models, which have strong local feature acquisition ability, have been widely used in rice production, these models involve large receptive fields only in the deep network. Besides, a self-attention-based Transformer can effectively provide global features to complement the disadvantages of CNNs. Hence, to quickly and accurately complete the task of cutting-edge detection in a complex rice harvesting environment, this article develops a Transformer Convolution Network (TCNet). This cutting-edge detection algorithm combines the Transformer with a CNN. Specifically, the Transformer realizes a patch embedding through a 3 × 3 convolution, and the output is employed as the input of the Transformer module. Additionally, the multi-head attention in the Transformer module undergoes dimensionality reduction to reduce overall network computation. In the Feed-forward network, a 7 × 7 convolution operation is used to realize the position-coding of different patches. Moreover, CNN uses depth-separable convolutions to extract local features from the images. The global features extracted by the Transformer and the local features extracted by the CNN are integrated into the fusion module. The test results demonstrated that TCNet could segment 97.88% of the Intersection over Union and 98.95% of the Accuracy in the unharvested region, and the number of parameters is only 10.796M. Cutting-edge detection is better than common lightweight backbone networks, achieving the detection effect of deep convolutional networks (ResNet-50) with fewer parameters. The proposed TCNet shows the advantages of a Transformer combined with a CNN and provides real-time and reliable reference information for the subsequent operation of rice harvesting.


Introduction
The emergence of rice combine harvesters has significantly improved the efficiency of rice harvesting and is currently the primary rice harvesting method [1].During the operation of a rice combine harvester, the driver has to pay attention to whether the cutting width of the harvester cutting table is complete.When the cutting table does not align with the cutting edge of the unharvested region, it affects the harvesting operation efficiency and may lead to loss [2].Additionally, the effects of light and noise on harvesting operations prohibit the driver from maintaining the entire cutting width at all times, especially in harsh environments.The effective detection of the cutting-edge of the unharvested region can significantly reduce the driver's work intensity and improve efficiency [3].
Global navigation satellite systems (GNSS) have been extensively used in agricultural machinery [4].Although automatically driving the rice combine harvester is feasible through GNSS, it cannot perceive real-time field environment changes and crop growth [5].Furthermore, when the satellite signal is lost or the field environment changes, the harvester cannot ensure the effective operation of the cutting table along the cutting-edge.Lidar is highly accurate for crop line detection [6,7].Zhao et al. acquired three-dimensional point clouds in wheat fields with lidar [8].The Otsu method was also employed to detect the crop edge position on each scanning contour, and the cutting-edge of the unharvested region was obtained using the least squares method.Lidar can extract crop lines, but the high sensor cost limits further application [9].In addition, LiDAR sensors require high computing power to process large amounts of data, which is a bottleneck for real-time applications [10].
Machine vision is a low-cost option with strong real-time capabilities and, therefore, has received increasing attention in crop boundary detection [11].Currently, crop cuttingedge detection based on machine vision involves three types.The first one relies on traditional image processing for cutting-edge detection.For instance, Zhang et al. [2] obtained the boundary points based on color features and vertical projection and obtained the cutting-edge by polynomial fitting the boundary points.The average error in locating the cutting-edge was 2.84 cm.Zhang et al. [12] segmented rice, rape, and wheat based on a custom color factor combination and obtained the cutting-edge navigation path by denoising and least square fitting.Their method attained 96.7% accuracy.Debain et al. [13] used texture parameters and Markov fields for crop edge segmentation, and the guidance assistance system achieved an accuracy of less than 10 cm.Although the accuracy of cutting-edge detection using image processing is high, it is greatly affected by light intensity, shadow, and crop type.Thus, ensuring high detection accuracy in outdoor field conditions is very challenging.
The second method relies on stereo camera cutting-edge detection.For example, Luo et al. [14] used coordinate transformation and the Otsu algorithm to obtain the unharvested crop region.They simultaneously detected the unharvested crop edge and the crop end edge according to the horizontal length characteristics of the unharvested region contour.Kneip et al. [15] used the Expectation-Maximization algorithm with a Gaussian Mixture Model to segment crops and the ground based on different 3D point cloud characteristics.The final cutting-edge was obtained through post-processing.Zhang et al. [16] proposed a spatial clustering method based on point cloud density to extract crop regions of interest.In this method, the feature points were extracted from each region of interest, and the cutting-edges were obtained through polynomial fitting.Stereoscopic cameras usually obtain height differences through point cloud processing, but binocular cameras are prone to false detection when crop plants are short or lodging [2].In addition, the accuracy of cutting-edge judgment relates to the number of point clouds.As the point cloud cardinality increases, the processing burden increases, affecting the algorithm's real-time detection performance.
The third method is cutting-edge detection based on deep learning.Among the current methods, the most potential is semantic segmentation, i.e., the boundary of the unharvested region and harvested region are divided by the segmentation model, and the cutting-edge is obtained by fitting the boundary.For instance, Kim et al. [17] introduced a weakly supervised crop region segmentation model, which only required a small data scale to achieve crop region localization and obtain the final crop edge through an edge detection algorithm.Zhu et al. [18] proposed a semantic segmentation model of The Robotic Combine Network (TRCNet) to complete rice field segmentation.Its optimal IOU was 0.834.The semantic segmentation model was proved to detect the cutting-edge of rice in unharvested regions.The deep learning-based methods are more robust and universal than traditional algorithms and point cloud processing.
The common semantic segmentation models typically rely on a convolutional neural network (CNN) as the backbone for feature extraction.CNNs have limited receptive fields and can only process limited context information.The model's receptive field can be increased by deepening the number of convolutional layers, but at the same time, the additional computational burden must be considered [19].In recent years, attention mechanisms have been widely used.Spurred by this trend, adding attention modules to the model can extract valuable information.For instance, Dosovitskiy et al. [20] developed a vision transformer (ViT) based on the self-attention mechanism, where the self-attention replaced the CNN architecture.ViT attained excellent results in computer vision tasks.Zheng et al. [21] applied the Transformer model to the semantic segmentation task.Each layer of the Transformer has a global receptor field in this model, which can provide a powerful segmentation model by combining it with a simple decoder.The Transformer model is stronger than the CNN in global feature acquisition but weaker in local feature perception.In addition, the deep Transformer model involves many parameters.According to the newly released Segforemr semantic segmentation model [22], the accuracy of the shallow Transformer model is much lower than that of a deep model, but the parameter number of the deep model is increased by ten times.
The cutting-edge detection task needs high real-time performance to facilitate the subsequent machine control.However, deep Transformer models are too complex to meet the real-time requirements due to the large number of parameters.In addition, the Transformer model lacks local features, which CNN can supplement with effective local features.Thus, the following question arises: "Can we effectively integrate the two to improve the actual performance of the shallow Transformer model in cutting-edge detection tasks?".This article combines the CNN and Transformer model and proposes a TCNet for cutting-edge detection of unharvested rice regions.The main contributions of this article are as follows: 1. Improve the calculation of patch embedding, multi-head attention, and location coding in the original ViT.
2. Design a fusion module to integrate features extracted from the convolutional module with those extracted from the Transformer module.
3. Build neck and decoder parts applicable to the backbone network for feature collection and final pixel prediction.
4. Proposing an effective cutting-edge fitting method for cutting-edge detection.

Image Data Set
The images used in this study were collected from the Zengcheng Teaching and Research Bases of South China Agricultural University, which recorded a live video of a rice combine harvester (RG60, Weichai Lovol Intelligent Agriculture Technology Co., Ltd., Weifang, China) using a camera (ZED2, Stereo Labs Inc., San Francisco, CA, USA).The camera is rigidly fixed on the harvester through mechanical parts and moves with the harvester during operation.The left-eye image of the camera was exported at one-second intervals.The schematic diagram of the acquired image is shown in Figure 1.A total of 3984 pictures were collected with a resolution of 1920 × 1080.The data set was annotated using Labelme, and the annotated data was converted into the PASCAL VOC common data format.The data set was randomly divided into a training set, a validation set, and a test set based on a 7:2:1 ratio.It is known that data enhancement strategies can avoid overfitting and enhance the model's generalization ability.Thus, this article increases the random interference of brightness, saturation, contrast, and tone to simulate the influence of different weather factors in the acquiring process.Image rotation, horizontal flip, and vertical flip simulate the camera's position relative to the unharvested region.The random scaling of image size was used to mimic segmentation under different resolutions.Besides, random cropping reduced the impact of different proportions of segmented regions of the image, and image enhancement randomly enhanced the training set while preserving the original validation set and test set.

Overview of Model Framework
The proposed model architecture is illustrated in Figure 2 and is based on CNN [23] and the Transformer module in ViT [20].The overall network structure adopts the encoderdecoder model framework, and the neck is placed before the decoder.The proposed segmentation model mainly comprises four modules: (1) Convolutional module for extracting local features of images with different resolutions, (2) Transformer module for extracting global features of images with different resolutions, (3) fusion block for blending global features and local features, and (4) the neck-decoder part of the final segmentation mask that is generated by fusing multilevel features.An image of size height (H) × width (W) × 3 is used as an example to describe the network's process.Specifically, the stem layer reduces the input image resolution to

Patch Embedding
The Transformer in ViT [20] follows the same 1D input structure as Natural Language Processing (NLP).Therefore, there is a mismatch between the two-dimensional images and one-dimensional sequences.Hence, to process 2D images, the image must be converted into a 1D sequence, where H and W are the height and width of the original image, and C is the number of channels.L is the length of the converted 1D sequence, which is also the effective input sequence length of the Transformer.Patch embedding is converting a 2D image to a 1D sequence.In ViT, a 16 × 16 convolutional kernel with a step size of 16 is used to implement the convolutional operation and realize the corresponding length of the patch embedding.Nevertheless, the patch embedding of the convolutional kernels with large lengths and step sizes causes two problems.The first is that the 1D sequence length of the patch embedding is the same, and the trained model cannot adapt to the segmentation task under different resolution inputs.The second problem is that a large step size will likely lead to a low local connection between adjacent patches and excessive independence between the patches.
This article adopts a 3 × 3 convolutional kernel to replace the 16 × 16 convolutional kernel in the traditional patch embedding.The patch embedding in the model has four parts corresponding to the input of each Transformer block.The convolutional step of the four parts is set to [1,2,2,2].As the stem module reduces the image's resolution to a quarter of the original image, the convolutional step is set to 1 on the patch embedding of the first stage.That is, the resolution of the first part of the patch embedding will not be reduced.The convolutional step of the last three parts is set to 2 to generate 1D sequences with different resolutions, and a padding operation is added before the convolutional operation to ensure that the length of the output 1D sequence remains unchanged.Regarding the patch embedding of the four parts, the image resolution is [ H×W 4 , H×W 8 , H×W 16 , H×W  32 ].

Transformer Block
The Transformer block is a Transformer encoder [24] mainly composed of multi-head attention, multilayer perceptron (MLP), and feed-forward network (FFN).Additionally, layer norm and dropout layers are applied between the modules to prevent network overfitting during training.Figure 3 illustrates the overall structure.

Resolution Reduction Multi-Head Attention
The transformer branch of the model comprises four Transformer blocks, each receiving the 1D sequence as input after the patch embedding.Since a 3 × 3 convolutional kernel is used for patch embedding, a longer 1D sequence will be generated by a 16 × 16 convolutional with a step size of 16 in ViT.Note that using traditional multi-head attention can lead to excessive computation.We refer to the study of Wang et al. [25] to reduce the dimension of the input image resolution of the multi-head attention.
The input of multi-head attention is usually divided into Query (Q), Key (K), and Value (V).In ViT, the one-dimensional sequence of patch embedding is the initial input.The one-dimensional layer then linearly projects the layer norm sequence to obtain Q, K, and V.The difference with ViT is that our method reduces the input 2D image resolution when calculating K and V.The resolution reduction attention (RRA) for stage i is calculated as follows: where Concat(•) means to splice different heads on the channel, , and W O ∈ R C i ×C i are linear projections, and N i is the head number of the attention layer in Stage i.Therefore, the dimension of each head is equal to C i N i . RR(•) is the operation for reducing the resolution dimension of the input sequence, which is written as: where x ∈ R (H i W i )×C i represents an input sequence, and R i denotes the reduction ratio of the attention layers in Stage i. Reshape(x, R i ) is an operation of reshaping the input sequence x to a sequence of size reduces the dimension of the input sequence R 2 i C i to C i , and LayerNorm(•) refers to layer normalization.The multi-head attention operation Attention(•) is calculated as: where d head represents dimension of head.These equations (Equations ( 1)-( 4)) show that the computational costs of resolution dimension attention operation are R 2 i times lower than traditional multi-head attention.
Take the multi-head attention module calculation in the first Transformer block as an example.The 1D sequence length of the image output after the stem module is i , after scaling and normalization, is multiplied by V to obtain the final attention value ( HW 16 × C), whose dimension is the original and is consistent with the output 1D sequence length of the traditional multi-head attention.The resolution reduction rate in the four Transformer blocks is set to [8,4,2,1].As the number of network layers deepens, the image's resolution gradually decreases, and the sequence length after the patch embedding gradually decreases.Therefore, the sequence length of the last two layers will be too short if the same resolution reduction is adopted.The number of multiple heads is set to [1,2,5,8] because the convolutional operation in the patch embedding process will continuously improve the feature's channel dimension.The deeper the layer, the more heads are required to allocate the channel number reasonably to ensure operation efficiency.

Feed-Forward Network
ViT uses position embedding to introduce the location information of different patches, which improves the correlation between patches.However, the position embedding must ensure that the 1D sequence length of each patch bedding is the same.The Transformer block used in this article is a multi-resolution input and cannot provide effective position information using the position embedding.In addition, Chu et al. [26] proposed that the position code used in ViT does not have translation invariance because each patch adds a unique position code.The model's classification performance will decrease if the location coding is removed.According to the study of Islam et al. [27], the model can implicitly learn the absolute positions of different patches through zero-filling and convolutional operations, and the positions of the other patches can be inferred through the relative relationship between absolute positions and patches.
Xie et al. [22] also show that position coding is unnecessary in the Transformer module and can thus be replaced by zero-filled convolutional operations.This article considers the convolutional operation to encode the position of the patch.Besides, the encoding process does not use the common 3 × 3 convolution, as we consider that the selected patch size will have a particular impact on the absolute position calculation and the relative position inference between different patches.Specifically, a larger convolutional kernel will make different patches contain more relevant information and thus better infer the relative position, but a larger convolutional kernel will increase the processing burden.Therefore, we use a 7 × 7 convolution with a step size of 1 for the position encoding.The 7 × 7 convolutional operation is incorporated with the activation function and dropout layer, and the output is finally connected with the original input by skip connection.The calculation is formulated as follows: where x in represents the input of the FFN, Conv 7×7 is a 7 × 7 convolution, GELU is the activation function, and Dropout is the random drop layer.The overall FFN diagram is depicted in Figure 4.

Convolutional Block
The convolutional block is responsible for learning local features in the image.Considering the effect of the number of parameters on the execution time, we do not rely on traditional convolutional operations but build convolutional blocks based on depthwise separable convolution (DWConv).DWConv was first proposed by Chollet et al. [23], which, compared with traditional convolution, has fewer parameters and is faster to execute.
Figure 5 illustrates the convolutional block, where the input image is first performed with a 1 × 1 convolutional operation with a step size of 1 and a BatchNorm&Act operation.The BatchNorm&Act represents the regularization and activation layers, accelerating network convergence and increasing its nonlinear expression.The second step is to conduct DWConv.Considering that the FFN module involves 7 × 7 position coding, we also use the 7 × 7 convolutional operation with step size 1 for the convolutional operation on the feature graphs of each channel respectively to complete the DWConv.To maintain the unity of resolution, we adopt a zero-padding strategy.The third step is to enter 1 × 1 convolution for integrated learning of the whole channel information.The input and final output are added by using the skip connection to obtain the final output of the convolutional block.Considering the operation speed and skip connection, we preserve the same number of input and output channels for each step of the convolutional operation so that the number of feature graphs does not change during the operation of the whole convolutional block.

Neck and Decoder
The neck collects the fused features extracted from different stages, with F1, F2, F3, and F4 representing the fusion features of each stage.
The neck part comprises 1 × 1 and 3 × 3 convolutions that aim to unify the number of channels of the four features into 256 layers.Thus, after passing through the neck, the number of channels corresponding to the F1, F2, F3, and F4 features is 256.Besides, the size of the feature map is not modified, but only the number of the feature channels in different stages is unified.
The decoder comprises a convolution and an up-sampling operation, aiming to unify the resolution of different feature maps and thus fuse multi-scale features.Taking H 4 × W 4 as the target resolution, the resolution is unified to H 4 × W 4 by convolving and upsampling the features of the four stages.It should be noted that the convolutional operation does not change the number of channels, which are 256.An addition operation is performed on the four feature maps to combine the extracted feature information at different scales, and the number of channels is converted into the number of categories through a 1 × 1 convolution for the final pixel classification.

Cutting-Edge Detection
After semantic segmentation, the resulting image is converted into a binary image for subsequent image processing.Take a labeled mask image as an example.The overall process is shown in Figure 7. Specifically, first, the contour in the binary image is extracted, the contour areas for all the extracted contours are compared, and the contour with the largest area is output.From right to left, from top to bottom, we proceed line by line to obtain the boundary points, save all boundary point coordinates, and output the rightmost boundary point.Next, we search the pixels around the rightmost boundary point (x max , y max ) for cutting-edge fitting.Given the row constraint conditions (Equation ( 6)) and column constraint conditions (Equation ( 7)), the boundary feature points are screened, and the boundary points must satisfy both the row and the column constraints.The feature points near the rightmost boundary points to be fitted are obtained by calculation.This feature point screening process effectively avoids the effects of missing rows and missing rice boundaries.The final cutting-edge is obtained by fitting the selected feature points with the least square method.

Evaluation Indexes and Hardware Parameters
The change in the loss value is used to determine whether the model's training is stable.The loss value calculation is selected cross entropy (Equation ( 8)) [28].The calculation formula is as follows: where N is the number of pixels in the image, C is the number of classes,y i,c is the ground truth label for the ith pixel and cth class, and p i,c is the predicted probability for the ith pixel and cth class.The performance metrics to evaluate the model are Intersection over Union (IOU) (Equation ( 9)) [29], accuracy (Acc) (Equation ( 10)) [28], and the number of model parameters.
Acc = TP + TN TP + TN + FP + FN (10) where TP represents the number of positive samples correctly classified, TN is the number of negative samples correctly classified, FP is the number of negative samples classified into positive samples, and FN is the number of positive samples classified into negative samples.The weight of the SETR model [21]

Model Training and Unharvested Region Segmentation
The TCNet model is trained with 500 epochs, with Figure 8 presenting the corresponding loss value.According to Figure 8, the overall loss decreases as the epoch increases because the model's weight is constantly improved to adapt to the data set used in this article.The model reaches the maximum oscillation at 150 epochs, and the subsequent loss value gradually converges.After 450 epochs, the loss value stabilizes and remains near 0.01.Therefore, training is terminated at 500 epochs, and the model weights at 475 epochs are selected as the final model weights.
The trained model achieves 97.88% segmentation IOU and 98.95% Acc in the unharvested regions, while the corresponding background segmentation results are 99.63% and 99.81%.The parameter number of the whole model is 10.796M.Table 1 reports the model's result.Four representative images were selected for segmentation test.Figure 9 illustrates the unharvested region segmentation result.The first line represents the input image, the second line represents the manually labeled truth value, the third line represents the TCNet segmentation results, and the blue dashed box in the third line marks the segmentation region with apparent errors.
In Figure 9a, the rice grows well without obvious lodging and lacks seedlings.The boundary on the right side of the unharvested region and the harvested region is relatively clear, and the boundary between the edge of the field and the unharvested region is partially acquired in the top image.Since the edge of the field is covered with a certain amount of broken straw, the boundary between the edge of the field and the unharvested region is blurred, making it challenging for human eyes to observe the boundary position.The model presents certain misdetections at the boundary between the unharvested region and the edge of the field.The input image in Figure 9b exhibits minor lodging, with the boundaries of the unharvested region slightly obscured by scattered straw residues.In this case, the model completes the segmentation of the unharvested region well, and the straw residues do not affect the boundary extraction.It is worth noting that a part of the sunshade appears in the upper left corner of the image, and the model has some missing detection at the boundary due to the influence of the sunshade.In Figure 9c, there is a slight lodging in the input image, and the overall segmentation effect of the model is ideal, but there are some missing detections at the border between the unharvested region and the cutting table of the harvester in the lower left corner.In the input image of Figure 9d, severe lodging exists right before the harvest.The orientation of the rice after lodging is different, and part of the lodging rice covers the harvested region.To avoid the leakage of lodging rice, we label the lodging rice in the harvested region as an unharvested region.In this scenario, the model effectively completes the segmentation of unharvested regions, and the mixing of lodging rice and background does not affect the model's segmentation effect, presenting a less erroneous segmentation at the lodging rice boundary in the upper right corner.There is also a small amount of missing detection in the upper left corner of Figure 9d, which we suspect is due to the blurring of the rice features in the upper left corner, farther from the camera.
Overall, TCNet performs well in segmenting the unharvested regions, with accurate segmentation occurring in regions where the boundaries are clearly defined.When the model is disturbed by the ground or external objects, there will be wrong segmentation, but the overall area of false detection is small, primarily located in the upper left and lower left corner of the unharvested region, which does not affect the subsequent cutting-edge detection.In addition, a small amount of false detection occurs when lodging rice obscures the boundary between unharvested and harvested regions, but the overall boundary model can still be effectively segmented.

Cutting-Edge Detection Effect
The effective segmentation of unharvested regions ensures the accuracy of cuttingedge detection.Figure 10 depicts the effect of cutting-edge detection by employing four representative images, where the first line depicts the input image, the second line is the manually labeled truth value, the third line is the TCNet segmentation result, and the fourth line is the cutting-edge extraction result.
The rice grows well in Figure 10a, and the sunshade and other harvesters are in the upper left corner.From the model's segmentation effect, there are many missing regions where interference occurs in the upper left corner, which is also the largest missing image in the test set.There is also a small amount of missing detection in the lower left corner due to the interference of the adjusting mechanism of the harvester's cutting table.The detection results suggest that the missing region does not affect cutting-edge detection due to the effective extraction of feature points on the right edge.The harvester in Figure 10b works to the field's boundary.In this scene, although there is broken straw interference at the top and right side of the unharvested region, the segmentation effect of the unharvested region is ideal, and there is no apparent false detection region.The results of cutting-edge detection fit the unharvested boundary.According to Figure 10c, the harvester is harvesting lodging rice, and the segmentation effect infers that the boundary between lodging rice and the unharvested region can be effectively separated.The model also has a small amount of missed detection in the upper left corner, which does not affect the cutting-edge detection, and the detected cutting-edge fits the boundary of the unharvested region.In Figure 10d, two obvious dividing lines are on the right side of the unharvested region (marked by the blue box in the second line (d)), caused by the driver's failure to align the cutting-edge harvest in the last lap of the harvest.In this case, the segmentation effect of the model is ideal, and the unharvested region is effectively segmented.However, for the cutting-edge detection, the distance between the two boundaries is relatively close, the feature point to be fitted contains pixels of the two boundaries, and the detected cutting-edge has a certain deviation from the right boundary.Overall, detecting the cutting-edge depends on the effective segmentation of the right side of the unharvested region, and some small area contour effects caused by false detection and missing detection are effectively eliminated through area screening.Row and column direction restrictions can effectively filter out noise interference on the left side of the cutting table and some dividing lines.The overall cutting-edge detection effect is relatively stable, and the unharvested region incorrectly detected does not affect the edge detection.Besides, the field's boundary and rice lodging scenarios can also effectively extract the edge.

Performance Comparison of Different Models
We compare the proposed segmentation model with other advanced semantic segmentation models, such as PSPNet [30], Deeplabv3 [31], Deeplabv3+ [32], SegFormer [22], and SETR [21].We replace the main backbone of the model based on a convolutional neural network with a representative feature extraction backbone.Table 1 reports that the backbone of PSPNet, Deeplabv3, and Deeplabv3+ are MobilenetV2, Resnet-18, and Resnet-50, respectively, while SegFormer and SETR preserve the original Mit-b0 and T-Large main backbones.We train the models separately until the loss values stabilize.The evaluation indexes of the model are IOU, Acc, and the number of parameters, and the best weight of IOU and Acc of each model is selected as the test weight.
Table 1 reports the comparison results.From the perspective of the IOU index, TCNet is superior to other algorithms in background segmentation, and the segmentation results of the unharvested regions are the same as those of Deeplabv3+.From the perspective of the Acc index, TCNet is also better than other models in background segmentation, and the segmentation effect of unharvested regions is slightly lower than that of Deeplabv3+.Regarding the number of parameters, SegFormer has the least, while TCNet has more parameters than SegFormer but fewer than most CNN-based models.
For the task of segmentation of unharvested regions, the segmentation of the background is better.The difference in the metrics of background segmentation is smaller for different models.The main difference in the segmentation effect is reflected in the foreground.TCNet is 6.75% and 2.86% higher than PSPNet with the same amount of parameters in IOU and ACC, respectively.In summary, TCNet attains the optimal evaluation metrics while requiring a relatively low number of parameters.It should be noted that Resnet-50 attains better Acc metric than TCNet on the unharvested region segmentation, but more parameters increase the computational cost.Although the lightweight MobileNetV2 and Resnet-18 backbones involve fewer parameters, the segmentation index of the unharvested region is greatly affected.The model backbone of SegFormer employs the lightweight Mit-b0 network, which has the fewest parameters but also affects the segmentation index of the unharvested region.Besides, SETR has the most parameters and the worst effect among all models because it is a large segmentation model.Even though the weights have been pre-trained on large data sets, the unharvested region data set still needs to be larger for the network to learn effective classification features.
Figure 11 illustrates the cutting-edge detection results of different models, presenting four images to compare the effectiveness of the cutting-edge detection model.The rice growth in Figure 11a is normal, with interference from the harvester and edge of the field at the top of the image.For this scenario, both PSPNet and SETR have a large area of miss-segmentation.Deeplabv3 and SegFormer present a miss-detection due to the interference of the harvester and the field on the right side of the unharvested region, which directly led to the error of cutting-edge fitting.TCNet and Deeplabv3+ effectively completed the segmentation and cutting-edge detection of unharvested regions without harvesters and edge of the field interference.Figure 11b is the harvester operation scene at the field's edge.SegFormer and PSPNet have some missing detection, but the final cutting-edge detection is satisfactory.Although SETR has many false detections, it does not affect cutting-edge detection.Deeplabv3, Deeplabv3+, and TCNet effectively complete the segmentation and cutting-edge detection of unharvested regions.Figure 11c presents a normal harvest with many broken straws at the boundary between harvested and unharvested regions.The detection results show that PSPNet and SETR are prone to the interference of broken stalks, which leads to false detection in the harvested region.However, from the detection result, PSPNet effectively completes cutting-edge detection, indicating that the cutting-edge detection algorithm has a certain stability.SegFormer has some partial miss-detections in the upper part of the unharvested region that do not affect the cutting-edge detection.Deeplabv3, Deeplabv3+, and TCNet all effectively complete unharvested region segmentation and cutting-edge detection without the many broken stalks affecting the effect of cutting-edge fitting.Figure 11d illustrates the lodging region during harvest.The detection results reveal that PSPNet and SETR cannot effectively segment lodging rice, while the other algorithms extract cutting-edge effectively.It is worth noting that TCNet has some miss-detections in the upper left corner of the unharvested region, while SegFormer, Deeplabv3, and Deeplabv3+ have no obvious misdetection in the image.According to the cutting-edge detection results, Deeplabv3+ and TCNet effectively complete cutting-edge detection, while the other competitor models have different degrees of errors in cutting-edge detection.Increasing the number of network layers can improve the segmentation effect for convolutional neural networks, but the number of parameters also increases.Thus, although Deeplabv3+ has a better segmentation effect than Deeplabv3 and PSPNet, the number of parameters increases threefold.SegFormer based on Transformer still shows good segmentation performance with a few parameters, but the segmentation effect shown in Figure 11 is prone to missing detection caused by the lack of local feature information.TCNet retains the advantages of convolutional neural networks and Transformers.Thus, we achieve the same segmentation performance of deep convolutional neural networks (ResNet-50) with fewer parameters than lightweight convolutional neural networks.TCNet compared with SegFormer, the missing area in the unharvested region is significantly reduced, and TCNet is more robust to external interference (Figure 11a).The test results show that a Transformer combined with a CNN can maintain high segmentation performance under low parameter numbers.

Discussion
From the overall results, TCNet can better complete the segmentation of unharvested regions.Part of the missed detection region is due to external interference, such as other harvesters or sunshades.The small missed and incorrectly detected regions do not affect the cutting-edge detection results.The feature points obtained by area thresholding and condition restriction are accurate and stable, which can effectively fit the cutting-edge.
Compared with previous cutting-edge detection methods based on image processing, TCNet is more generalizable and stable.The model performance is not easily affected by outdoor conditions such as light or shadow.We have a better adaptation to unharvested lodging rice than using stereo cameras to detect cutting-edge.The stereo camera easily determines the unharvested lodging rice as a harvested region by fitting the cutting-edge through the crop height.Compared with deep learning-based cutting-edge detection, TCNet achieves better cutting-edge detection performance with fewer parameters.
TCNet's training data come from a single region.In real-world applications, the model's detection accuracy for different regions and rice varieties will be affected.In addition, different harvester operating speeds, cutting table heights, and harvesting patterns may also affect cutting-edge detection performance.These need to be followed up by gradually adding more samples to improve TCNet's robustness against unknown data.

Conclusions
The TCNet achieves 97.88% segmentation IOU and 98.95% Acc in the unharvested regions, while the corresponding background segmentation results are 99.63% and 99.81%.The parameter number of the whole model is 10.796M.The model has a good segmentation result for the unharvested region, and the disturbance of lodging, broken straw, and ground does not affect the segmentation effect.There is a small amount of missing detection at the edge of the field because the rice features at the edge are fuzzy, affecting the model's pixel classification.TCNet has a better segmentation effect and accuracy, affording better cutting-edge detection than employing a lightweight CNN as the backbone network.TCNet has a better segmentation effect in local regions than the Transformer segmentation model.Comparing the deep convolutional neural network Resnet-50, we achieve the same performance with less than four times the number of parameters.
Overall, TCNet provides an effective technical route for cutting-edge detection.The Transformer model has excellent global attention, and by integrating the local features extracted by the CNN, we achieve the performance of the deep convolutional neural network in the shallow network.The TCNet proposed in this article realizes effective detection of cutting-edge at a low number of parameters.It can provide reliable reference information on whether the harvester operates along the cutting edge, reducing loss and improving the efficiency of mechanized harvest.In addition, real-time calculation of cutting width can be realized through further locating of the cutting-edge, which provides effective technical support for real-time monitoring of the feeding rate of the harvester.

Figure 2 .
Figure 2. Overall structure of the model.

H 4 × W 4 ×
3 through a 7 × 7 convolutional operation with a step size of 4. Images with reduced resolution enter two branches respectively.The first branch is a feature extraction branch composed of the Conv block, responsible for extracting local features from the image.The second is the feature extraction branch, composed of a Transformer block.The patch embedding is responsible for transforming the image into the format required by the Transformer block, and the Transformer module extracts the global features of the image.The fusion block blends the features extracted from the Conv block with the global features of the Transformer block.After the feature extraction process is completed, the encoder part is finished.The neck part adopts a pyramid structure, which integrates features of different resolutions and feeds them into the decoder part for decoding.In decoding, features of different resolutions are up-sampled one by one to maintain a unified resolution of H 4 × W 4 × 3. Finally, the image is classified by 1 × 1 convolution.

HW 16 ×C
is obtained after the LayerNorm layer and the linear projection, and the dimensions of K, V ∈ R HW 16×R 2 i ×C are obtained after resolution reduction and linear projection.In Equation (4), QK T transposes K, where the length L = HW 16×R 2 i of K does not affect the operation of the matrix.Besides, the operation result QK T ∈ R HW 16 × HW 16×R 2

Figure 6
Figure 6 illustrates the Fusion Block.The fusion block integrates the local and the global features.The local features extracted from the convolutional block and the global features extracted from the Transformer block are fused through the concat operation.The combined features are reduced in dimension by a 1 × 1 convolution to preserve the number of channels constant.The subsequent Transformer block extracts the global features after the patch embedding operation.The proposed fusion block aims to supplement the global features extracted by the Transformer block with the local features extracted by the convolutional block.Therefore, we only input the Transformer block instead of the convolutional block for the features processed by the fusion block.
is employed as the pre-training weight of the TCNet model.The experimental setup involves an Intel Core i7-12700F running Win11 and an RTX3080 GPU.The CUDA (version 11.6) accelerated model and CUDNN (version 8.9.2) provided by NVIDIA are used for training.Additionally, this work uses the MMsegmentation fra

Figure 9 .
Figure 9.Effect of rice unharvested region segmentation.The serial numbers (a-d) indicate the results under different images.

Figure 10 .
Figure 10.Effect of cutting-edge detection.The serial numbers (a-d) indicate the results under different images.

Figure 11 .
Figure 11.Cutting-edge detection effects of different models.The serial numbers (a-d) indicate the results under different images.
The corresponding channel numbers are 32, 64, 128, and 256, and the corresponding resolutions are H

Table 1 .
Model comparison results (The optimal index is indicated in bold).