An Enhanced SegNeXt with Adaptive ROI for a Robust Navigation Line Extraction in Multi-Growth-Stage Maize Fields

Zhai, Yuting; Gao, Zongmei; Li, Jian; Zhou, Yang; Xu, Yanlei

doi:10.3390/agriculture16030367

Open AccessArticle

An Enhanced SegNeXt with Adaptive ROI for a Robust Navigation Line Extraction in Multi-Growth-Stage Maize Fields

by

Yuting Zhai

¹,

Zongmei Gao

²

,

Jian Li

¹,

Yang Zhou

¹

and

Yanlei Xu

^1,*

¹

School of Information Technology, Jilin Agricultural University, Changchun 130012, China

²

Department of Biological Systems Engineering, Washington State University, Pullman, WA 99164-6120, USA

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(3), 367; https://doi.org/10.3390/agriculture16030367

Submission received: 14 January 2026 / Revised: 29 January 2026 / Accepted: 3 February 2026 / Published: 4 February 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Navigation line extraction is essential for visual navigation in agricultural machinery, yet existing methods often perform poorly in complex environments due to challenges such as weed interference, broken crop rows, and leaf adhesion. To enhance the accuracy and robustness of crop row centerline identification, this study proposes an improved segmentation model based on SegNeXt with integrated adaptive region of interest (ROI) extraction for multi-growth-stage maize row perception. Improvements include constructing a Local module via pooling layers to refine contour features of seedling rows and enhance complementary information across feature maps. A multi-scale fusion attention (MFA) is also designed for adaptive weighted fusion during decoding, improving detail representation and generalization. Additionally, Focal Loss is introduced to mitigate background dominance and strengthen learning from sparse positive samples. An adaptive ROI extraction method was also developed to dynamically focus on navigable regions, thereby improving efficiency and localization accuracy. The outcomes revealed that the proposed model achieves a segmentation accuracy of 95.13% and an IoU of 93.86%. The experimental results show that the proposed algorithm achieves a processing speed of 27 frames per second (fps) on GPU and 16.8 fps on an embedded Jetson TX2 platform. This performance meets the real-time requirements for agricultural machinery operations. This study offers an efficient and reliable perception solution for vision-based navigation in maize fields.

Keywords:

segmentation model; navigation line extraction; adaptive ROI; RANSAC

1. Introduction

The autonomous navigation of agricultural machinery predominantly relies on Global Navigation Satellite System (GNSS) and real-time kinematic (RTK) technologies in recent years [1,2,3]. However, this technology lacks a field environment sensing module to accurately locate the field crops’ position and necessitates global path planning for the operational area prior to each use [4,5]. In contrast, visual navigation technology utilizes sensors mounted in front of the wheels to facilitate real-time path planning between crop rows, thereby minimizing mechanical damage to crops caused by the wheels. Its low computational cost and wide applicability make it a popular choice for crop row detection and navigation in agriculture [6,7].

In crop row extraction research, target detection and semantic segmentation are two technical tools that effectively identify and extract crop row information. Recent studies have employed deep learning-based detection networks, such as improved YOLO variants, to extract navigation trajectories by identifying crop plants and using the center points of detection bounding boxes as reference features for path fitting [8,9,10]. While effective for crop row extraction, these methods require extensive manual labeling of individual plants, which is highly time-consuming. Moreover, the accuracy of such center point-based fitting heavily depends on annotation quality. To reduce labeling burden, semantic segmentation methods that only require row-level labels have been explored as a more efficient alternative [11]. Guo et al. further formulated maize crop-row detection as a parameter prediction problem and introduced a transformer-based parameter prediction framework to model long-range row geometry for visual navigation [12]. Cheng et al. developed a Swin-Transformer-enhanced segmentation network for full-width harvesting scenarios and extracted navigation lines from the segmented harvesting boundaries, improving robustness in complex field conditions [13]. Yang et al. proposed a method for potato crop row detection using a VGG16-based U-Net with an adaptive midpoint fitting algorithm [14]. Wei et al. developed a lightweight neural network GhostNet along with a Gaussian cross-entropy loss function to automatically extract image features and track early maize lanes [15]. In maize, growth is commonly divided into vegetative (V) and reproductive (R) phases. Most vision-based row-extraction studies concentrate on early vegetative stages (e.g., seedling/early-leaf), when inter-row corridors are clearly visible; in later vegetative stages, leaf expansion increases row overlap, occlusion, and shadows. These stage-dependent changes exacerbate scale variation and foreground–background ambiguity, thereby requiring both long-range context modeling along row structures and robust preservation of fine-grained boundaries. However, the convolution operation in the semantic segmentation network is performed on localized regions of the image, which constrains the network’s ability to extract features from ROIs, such as crop rows that consist of multiple adjacent polygons spanning the entire image. Consequently, optimization is required for agricultural machinery navigation in complex environments.

The plant centroids are extracted based on the results of semantic segmentation of crop rows, after which a straight-line fitting method is employed to determine the optimal route for autonomous navigation. Commonly utilized straight-line fitting methods include the Hough Transform (HT), Least Squares Method (LSM), and Random Sample Consensus (RANSAC). García et al. applied HT in localized ROIs to detect both straight and curved crop rows in maize fields during early growth stages [16]. However, HT is computationally expensive and impractical for real-time applications. Liu et al. adopted LSM to fit crop row centerlines after classifying centroids, though the method is sensitive to outliers and performs poorly in areas with missing plants [17]. Fu et al. grouped feature points using region-growing clustering and fitted seedling row centerlines with RANSAC, providing a navigation baseline for field robots [18]. RANSAC iteratively estimates model parameters from subsets of data, making it robust to noise and capable of handling up to 50% outliers [19]. Nevertheless, its iterative nature leads to high computational costs, limiting real-time performance in agricultural settings.

To effectively address the challenges associated with the visual navigation of agricultural machines across multiple maize growth stages, we present the following contributions: (1) A novel crop segmentation method is introduced to extract crop row features through the application of strip convolution, complemented by local position and multi-scale fusion attention mechanism coding to mitigate the interference from complex backgrounds and inter-row weeds, thereby improving the model’s robustness. (2) An adaptive region of interest extraction method is proposed to identify the navigation region, eliminating the necessity for subsequent localization point planning across different crop rows, which significantly reduces computational demands and enables real-time navigation. (3) An RANSAC algorithm is enhanced by incorporating gradient direction constraints on the randomly selected samples, enabling the identification of a high-quality centroid fitting navigation line within the ROI region. This improvement provides reliable route information, offering guidance for field cultivation, fertilization, and other agricultural operations.

2. Materials and Methods

2.1. Data Collection

Maize crops are cultivated in fields characterized by regular parallel structures and spaced line patterns, which facilitate field operations and management while also providing essential visual information for subsequent crop row segmentation model learning. This paper focuses on the growth stage of maize from seedling to mid-planting period, during which the maize leaves have not yet closed the field paths, resulting in minimal overlap or shading between different crop rows. This period is also critical for spraying and fertilizing in the field. The experiment utilized a homemade platform equipped with a fixed D435i depth camera (Intel^® RealSense™, Intel Corporation, Santa Clara, CA, USA), with a resolution 1920 × 1080 pixels, capturing RGB images of maize to establish a maize image database in the experimental field of Jilin Agricultural University, located in Changchun City, Jilin Province (longitude: 125.41, latitude: 43.81), where maize rows are spaced 60 cm apart, as illustrated in Figure 1. The images are saved in JPG format, totaling 2400 images.

Figure 1a presents maps of various growth periods of the crop. These images were captured on 2 June 2024 (first stage, average height 15 cm), 17 June 2024 (second stage, average height 30 cm), 1 July 2024 (third stage, average height 45 cm), and 20 July 2024 (fourth stage, average height 60 cm). Figure 1b illustrates images captured at varying angles by positioning the camera atop a platform approximately 1.5 m above the ground. Through optimization of the camera’s position and angle, over-capturing of crop rows, which could increase computation time, and too short crop rows, leading to significant centerline deviation, were prevented. An acquisition angle of 45 degrees below the horizontal plane yielded images most suitable for crop row segmentation. The platform was moved at a constant speed, generating stable video sequences that were extracted into images at a rate of 5 frames per second. These sequences encompassed a variety of scenarios within the maize rows, including weeds, broken rows, cloudy days, and normal conditions, as depicted in Figure 1d. Although the dataset contains 2400 images, it was collected across four growth stages on four different dates and includes multiple in-field scenarios (e.g., weeds, broken rows, cloudy/normal illumination) from continuous video streams sampled at 5 fps, which provides substantial intra-field variability for training and testing. The dataset was split into training/validation/testing sets at a ratio of 8:1:1. To avoid stage bias, the split was performed in a growth-stage stratified manner, ensuring that each subset contains samples from all four growth stages. In addition, to reduce temporal correlation, frames originating from the same continuous video segment were assigned to the same subset.

Deep learning methods can yield accurate pixel segmentation results. However, the leaves of the crop canopy exhibit characteristics such as dispersion, crossing, and asymmetry, which significantly interfere with the extraction of crop canopy row detection lines. Additionally, a substantial amount of labeled data is required for the training of irregularly shaped maize plants. Therefore, we proposed an edge-fit labeling strategy, rather than employing the traditional semantic segmentation labeling method, Crop Row Coverage Area (CRCA). This new approach involves drawing polygonal frames immediately above and below the edges of each crop row in the image, enabling comprehensive labeling of the entire crop rows, referred to as the edge-fit line (EFL) labeling method, as illustrated in Figure 1c. This technique ensures continuous labeling of crop rows, even in the presence of missing seedlings, thereby addressing the issue of inaccurate route extraction in fragmented row regions. Furthermore, this strategy effectively distinguishes between different crop rows, preventing interference from crops in adjacent rows during the centerline fitting process of the current crop row line.

2.2. LMF-SegNext Crop Row Extraction Model

In the field of visual navigation, semantic segmentation networks are frequently employed for scene segmentation, with their accuracy significantly influencing subsequent navigation line extraction, which ultimately affects the overall precision of machine vision navigation [20]. Classical segmentation network models include UNet [21], Deeplabv3 [22], CCNet [23], and DANet [24]. Given that the experiments necessitate model deployment on an embedded device, specifically the Jetson TX2, several lightweight networks were also selected, including BiseNet [25], ERFnet [26], Segmenter [27], Fast-rcnn [28], and SegNeXt [29]. A detailed description of the experiments and the performance comparison of each model is provided in Section 3.1. After a comprehensive evaluation, SegNeXt demonstrates distinct advantages across several assessment metrics.

The SegNeXt network employs MSCA (Multi-Scale Context Attention) as its encoder, which comprises three primary components: deep convolution for aggregating local information; multi-branch deep strip convolution for capturing multi-scale context; and 1 × 1 convolution for modeling the relationships between different channels. The network employs Hamburger layers, which are based on matrix decomposition, to enhance the computational efficiency and feature extraction process. Notably, the strip convolution effectively extracts strip-like feature information, rendering it particularly suitable for the task of crop strip segmentation. Furthermore, the feature map output from Stage 1 of the network architecture contains an excess of low-level information. To address this, the decoder utilizes matrix decomposition to aggregate only the features from the last three stages (Stages 2–4) for global spatial information modeling, thereby enhancing both performance and computational efficiency. While the network is adept at fusing global contextual information, it struggles with extracting fine-grained local features. To mitigate the deficiency of local coded information in the SegNeXt model, which hampers the perception of crop row image features, this paper introduces the LMF-SegNext network. This new architecture aims to enhance the fine-grained segmentation capability of crop row boundaries, with the Attention module stacking configured to (2, 1, 3, 2) to reduce model parameters and floating-point computations. For clarity, throughout this paper, “SegNeXt” denotes the original SegNeXt configuration in (standard backbone without our stacking reduction). We refer to the variant that only adjusts the Attention-module stacking to (2, 1, 3, 2) as “Slim SegNeXt”. Our LMF-SegNeXt is built upon Slim SegNeXt. Beyond the attention stacking configuration, LMF-SegNeXt introduces three optimization changes with respect to the original SegNeXt. (i) We insert a Local Block after the MFA features to explicitly enhance fine-grained contour cues via multi-level pooling and cross-layer feature complementarity, mitigating the loss of local boundary details that is often observed in SegNeXt under perspective shrinkage and fragmented rows. (ii) We design a MFA module as a lightweight feature calibration unit before the decoder upsampling heads, which performs cross-scale spatial–channel collaborative attention to adaptively fuse local details and global semantics. (iii) We replace the standard cross-entropy with Focal Loss to alleviate severe foreground–background imbalance in seedling-belt masks, enabling the model to focus on hard pixels and sparse positive samples. The architecture is illustrated in Figure 2.

2.2.1. Local Block

The Local block is a refined local semantic feature extraction module designed for the LMF-SegNext model, which progressively aggregates multilevel features through pooling layers to enhance the contour features of crop rows. Initially, the output features from the MFA module are input into the Local block following batch normalization, where they are downscaled using 1 × 1 convolution to reduce computational complexity. Subsequently, a parallel maximum pooling operation is introduced, with all sensory field sizes set to 5 × 5. The Local Block adopts a 5 × 5 pooling window to enhance fine-grained boundary cues while limiting feature mixing across adjacent crop rows. This setting provides stable boundary continuity and robustness to local noise with good computational efficiency. This configuration mitigates issues such as image distortion resulting from image processing operations and addresses the problem of convolutional neural networks extracting redundant features from the image. Through a bottom-up connection, the pooled feature map is fused with its corresponding upper-layer feature map, thereby enhancing the local information localization capability of the deep features and alleviating the common issue of detail loss in deep feature maps. In this cross-layer connected feature fusion process, a 3 × 3 convolution operation is applied to each scale feature map extracted post-pooling, providing a more comprehensive and rich feature representation for the network. Additionally, a jump connection is established between the modular input and the deeper features, allowing for the fusion of shallow feature maps that contain rich semantic information with the pooled feature maps. This approach further increases the utilization of underlying information, ensuring that the output nodes of the network retain critical information from the shallow edges to a greater extent. Finally, the output dimension of the module is adjusted using 1 × 1 convolution. The constructed Local module enhances the complementarity among different feature maps, enabling the simultaneous capture of local detail information and assisting the network in comprehensively understanding image content, which is essential for the accurate segmentation of crop rows across various environments. The specific structure is shown in Figure 3.

2.2.2. Multi-Scale Fusion Attention

Building upon the refined local feature extraction of the Local Block, we propose the Multi-scale Fusion Attention (MFA) module to address the model’s limitations in modeling multi-scale contextual dependencies and long-range spatial relationships within complex agricultural scenes. This module establishes a cross-scale spatial-channel collaborative attention mechanism to dynamically calibrate and fuse local details with global semantics without significantly increasing computational overhead, thereby enhancing the model’s ability to model irregular crop row structures. The module aims to establish a cross-scale spatial-channel collaborative attention mechanism. Without significantly increasing computational overhead, it dynamically calibrates and fuses local details with global semantics, further enhancing the model’s segmentation robustness for irregular crop row structures. Its architecture is shown in Figure 4. MFA is positioned before the decoder’s upsampling and prediction heads as a feature calibration unit, performing global reorganization and refinement of the decoder’s current layer aggregated features. This module is designed to address the limited spatial relationship modeling capability of the decoder’s MLP structure, complementing the Local Block in the encoder that focuses on local contour modeling. This synergy enhances the model’s ability to understand the overall structure of crop rows in complex agricultural environments.

The MFA takes the feature map output from the decoder MLP as input, first extracting feature representations across different receptive fields through a parallel multi-scale context modeling architecture. The module comprises three parallel branches: the local branch employs 3 × 3 depthwise separable convolutions to capture fine-grained spatial details such as crop row edges and local textures; the intermediate branch utilizes 3 × 3 dilated depthwise separable convolutions with a dilation rate of 2 to expand the receptive field without significantly increasing computational overhead, modeling intermediate spatial relationships between crop row segments; The global branch extracts global contextual features through global average pooling combined with 1 × 1 convolutions, representing overall vein distribution and scene semantic information, then upscales to restore spatial resolution matching the input features. The three feature channels are concatenated to form

F_{m s}

.

This multi-scale parallel architecture enables MFA to simultaneously perceive multi-level information ranging from local details to global structures. Building upon multi-scale feature fusion, MFA further introduces a channel-guided spatial attention generation mechanism. First, the fused feature

F_{m s}

undergoes global average pooling to obtain channel-level features. A lightweight fully connected network then learns inter-channel dependencies to generate a channel weight vector s. Unlike traditional channel attention that directly performs channel recalibration on features, MFA uses this channel weight as a guidance signal to modulate multi-scale spatial features. Specifically, s is applied to

F_{m s}

via per-channel weighting and aggregated through 1 × 1 convolutions to generate the spatial attention weight map

M_{s}

. This ensures that spatially significant regions are jointly determined by channel relationships infused with multi-scale semantic information, enabling effective channel-guided spatial attention. Subsequently, MFA performs spatially weighted global average pooling on the original input features using the generated spatial attention map, yielding more discriminative channel statistics.

2.2.3. Focal Loss

In the actual corn crop row dataset collected, since crop rows typically exhibit elongated and continuous spatial structures, their proportion at the pixel level is significantly smaller than that of background areas. This leads to a pronounced class imbalance problem. Positive samples (crop row pixels) are markedly fewer than negative samples (background pixels). When trained directly using standard cross-entropy loss, the model tends to minimize overall loss by prioritizing fitting the dominant background category. This results in insufficient attention to crop row regions during training, impairing the model’s segmentation performance for the minority class structure.

To mitigate this training bias caused by class imbalance, this paper employs Focal Loss as the model’s loss function. Focal Loss dynamically modulates the contribution of sample gradients at the loss function level, guiding the model to focus more on hard-to-classify samples and minority class samples during optimization. For pixel-level binary segmentation tasks, it is defined as:

L_{f o c a l} = {- α (1 - p_{t})}^{γ} l o g (p_{t})

(1)

Here,

p_{t}

denotes the model’s predicted probability for the true class,

α

is the class balancing factor used to adjust the relative weight of positive and negative samples in the loss function, and

γ

is the modulation factor controlling the suppression level of losses for easily classified samples. In all experiments, the Focal Loss hyperparameters were fixed to

α = 0.25

and

γ = 2.0

.

During training, the modulation factor

{(1 - p_{t})}^{γ}

significantly reduces the loss weight for samples with high prediction confidence and low classification difficulty, while assigning greater gradient contribution to samples with uncertain predictions or classification errors.

2.2.4. Adaptive ROI Extraction Scheme

To maintain class balance during training, while only the central seedling band is utilized for navigation guidance, all visible seedling bands in the training images are annotated to provide sufficient positive samples (crop regions) against the negative background samples. Upon completion of the semantic segmentation inference phase, the network generates pixel-level segmentation that labels all crop rows within the image. The crop row information in the resulting binary image can then be utilized to fit the centerline of the crop rows. However, performing route extraction for every crop row in the image would significantly increase computational burden. To improve the efficiency of route extraction, an ROI extraction method was proposed, which automatically eliminates redundant labels while retaining only the crop rows essential for navigation as the region of interest.

The process is initiated by extracting all contours from the binary image. A bottom buffer zone (BBZ) is defined to select candidate seedling bands that intersect with this region; those that do not overlap with the BBZ are considered part of the background. The BBZ is defined as follows:

B B Z = \{(x, y | x \in [0, W], y \in [y_{m a x} - 0.2 H, y_{m a x}])\}

(2)

where

W

and

H

represent the width and height of the image, respectively. y_max denotes the maximum value of the image’s vertical axis.

The proposed adaptive ROI extraction algorithm is based on an edge-fitting labeling strategy. Due to constraints imposed by camera angle and height, the number of seedling bands intersecting the BBZ typically does not exceed four. If one or two bands remain after filtering, they are directly designated as the ROI. If more than two bands are present, the central band (in the case of an odd number) or the two central bands (in the case of an even number) are selected as the ROI. Finally, regression fitting is applied to the pixel points within the selected ROI to determine the navigation path.

2.3. Navigation Line Fitting

After identifying the ROI for the crop row, the image is segmented into 50-pixel-wide strips that are evenly spaced in the vertical direction. Each strip encompasses a segment of the ROI feature, and the intersection points of the strip with this feature segment are calculated. The midpoints of the four intersection points are then designated as the center point of the crop row [30,31]. This paper proposes a gradient-constrained RANSAC (G-RANSAC) algorithm for extracting crop row centerlines in maize fields under weedy or disrupted row conditions. The core idea is to use gradient direction to constrain the randomly sampled points within the RANSAC process, thereby improving the selection of high-quality inliers and enhancing the reliability of the final model.

Due to the agricultural machinery traveling parallel to the seedling rows, the centerline may occasionally appear vertical in the image, resulting in an infinite slope. Therefore, a linear model of the form

y = a x + b

is used to represent the seedling centerline [32]. The initial value of the dataset utilized for line fitting is typically set to 2, meaning that only the first two points yielding the best fitting results are employed as initial values [33]. The gradient direction

G_{1}

and

G_{2}

are computed based on the coordinates of the two randomly selected points

P_{1} (x_{1}, y_{1})

and

P_{2} (x_{2}, y_{2})

. Specifically, the gradient direction for each point is calculated using the direction vector between the two points, which is determined by the difference in their coordinates. The gradient direction

G

is then computed as the angle of the direction vector relative to the horizontal axis:

G = a r c t a n \frac{y_{2} - y_{1}}{x_{2} - x_{1}}

(3)

If the two randomly selected points lie on the same straight line, the absolute value of the difference between their gradient directions must be less than a specified threshold T. In this study, T was empirically set to 4° based on the comparative analysis. The straight-line estimation model is computed using these two points, and the average of their gradient directions is taken as the principal direction of the straight line, denoted as

Y_{g}

. The remaining points are then iterated through; if both conditions outlined in Equations (4) and (5) are satisfied, the point is classified as a local point:

|G_{i} - Y_{g}| < T

(4)

D_{i} = \frac{|y_{i} - a x_{i} - b|}{\sqrt{1 + a^{2}}}

(5)

where

G_{i}

is the gradient direction of a point and

D_{i}

is the distance from the point to the straight line. The number of in-bound points is constrained by defining a threshold k as one of the stopping conditions of the algorithm. To determine the value of k, 20 samples were randomly selected from each ROI of the dataset and a total of 100 samples were obtained. The proportion of interior points was manually counted, with a mean value of 0.83 and a minimum value of 0.75. k was set to 0.78, which allowed the samples to be balanced between the mean value and a conservative lower bound on its minimum value. In addition, the calculation of the iteration termination number

n

based on the prior probability

P

can also stop the algorithm early and improve the fitting accuracy, as shown in Equation (6):

n = \frac{\log (1 - P)}{\log (1 - k^{2})}

(6)

where the a priori probability

P

is usually set to 0.99 [34], which ensures that there is a 99% probability of successful sampling at least once. Therefore, the best model can be obtained in no more than 5 iterations.

Let

N

denote the number of centroid points within the selected ROI. In each iteration, a line hypothesis is generated from two sampled points and all

N

points are checked to count inliers, resulting in

O (N)

time per iteration (as in standard RANSAC). Compared with vanilla RANSAC, G-RANSAC adds an

O (1)

gradient-consistency pre-check

(∣ θ_{1} - θ_{2} ∣ < T)

on the sampled pair; hypotheses that are unlikely to be correct are discarded before the full inlier evaluation, which reduces the expected runtime in practice. With the inlier ratio

w \geq k

and success probability

p = 0.99

, Equation (5) yields an iteration bound

I \leq 5

under our setting. Therefore, the overall fitting complexity is

O (I N)

, and with bounded

I

, the fitting stage is effectively linear in

N

.

3. Results and Discussion

To evaluate the performance of the proposed method presented in this paper, a series of experimental comparisons was conducted. The server device used for model training operated on Windows 10, with the software environment configured as Python 3.8 and PyTorch 2.0.0. The hardware utilized an Intel i7-7820X CPU, featuring a main frequency of 3.60 GHz, and was equipped with two Titan Xp GPUs, each possessing 12.0 GB of video memory and running CUDA version 12.0. During the model training, images were scaled to a resolution of 1024 × 1024 pixels. The learning rate was initially set to 0.01 and was dynamically adjusted using the PolyLR strategy, while Stochastic Gradient Descent (SGD) served as the optimizer. The training process comprised 100 epochs, with a batch size of 4, a momentum of 0.9, and a weight decay of 0.0005.

3.1. Comparison of Semantic Segmentation Models

To evaluate the performance of various models, nine widely used segmentation networks were trained and tested on our maize crop row dataset, with the results presented in Table 1. The reported values are computed on the held-out test split and represent averages over images from all four growth stages described. The focus was on the Seed class, as it directly impacts navigation-line fitting, which relies on accurate segmentation of the seedling-belt mask. We reported key metrics, including Intersection over Union (IoU), Accuracy (ACC), and Frames Per Second (FPS), among others, to provide a comprehensive assessment. The table indicates that, in terms of segmentation accuracy, the seedling band identification accuracy and segmentation intersection and merger ratio of the LMF-SegNext model were 95.13% and 93.86%, respectively, surpassing the corresponding indices of the other segmentation models. Regarding model parameters, the DeepLabv3 (65.74 MB), DaNet (47.46 MB), and CCNet (47.45 MB) models are relatively large, limiting their application in real-time crop row recognition scenarios. In contrast, the LMF-SegNext model has a parameter size of 13.21 MB, which, while larger than that of the ERFNet and BiseNetv2 models, is justified as the additional parameters primarily enhance recognition and segmentation accuracy. From a navigation perspective, the Seed class is the only region directly used for computing centroids and fitting the navigation line. Therefore, segmentation reliability for this specific class is more critical than overall visual quality. As shown in Table 1, LMF-SegNeXt improves the Seed IoU to 93.86% and the ACC to 95.13%. This represents an increase of 4.80% in IoU and 2.10% in ACC over ERFNet, and an improvement of 3.08% in IoU and 2.12% in ACC over BiSeNetv2. These improvements are particularly meaningful for agricultural navigation, as even small gains in Seed IoU typically lead to less fragmented masks and more stable centroid distributions, which directly benefits subsequent line fitting. Furthermore, LMF-SegNeXt achieves lower computational complexity with 18.67 GFLOPs, compared to 29.1 GFLOPs for ERFNet and 24.6 GFLOPs for BiSeNetv2, while maintaining fast inference at 3.6 s per 100 images. When integrated with ROI extraction, centroid computation, and centerline fitting, the complete pipeline requires only 37.03 ms per frame, enabling a processing rate of 27 frames per second. On-device testing on the Jetson TX2 platform confirms the feasibility of real-time performance. Therefore, the LMF-SegNext model effectively ensures real-time recognition of maize crop rows while improving recognition accuracy.

To illustrate the differences among the models in crop row segmentation, particularly regarding detail processing, edge recognition, and overall segmentation accuracy. Figure 5 presents the results of various models in the crop row segmentation task. As shown in Figure 5, the Fast-SCNN, DaNet, BiseNetv2, and DeepLabv3 models are prone to producing broken rows during segmentation. In contrast, the UNet and Segmenter models may incorrectly classify weed regions as crop row regions, which leads to multiple crop rows merging together. The ERFNet and CCNet models exhibit significant fluctuations in crop row edge segmentation. Although the SegNext model reduces the misidentification of weeds, the presence of perspective distortion causes the crop rows at the edges of the image to appear smaller, resulting in broken rows. Conversely, the LMF-SegNext model effectively models global airspace information by aggregating multi-stage features, allowing for a more comprehensive analysis of contextual semantic information. The strip convolution within its MFA module effectively extracts crop row features, while the Local module further enhances the model’s local feature extraction capabilities. Consequently, when the LMF-SegNext model is employed for crop row recognition, the edge contours of the segmented crop rows are smooth and clear, which reduces the incidence of broken rows and misidentification of weed areas, thereby ensuring that agricultural machinery can accurately recognize the intermediate crop rows during operation.

3.2. Evaluation of LMF-SegNext Performances

To systematically evaluate the impact of the Local module, Focal Loss, and the multi-scale fusion attention module on the performance of the LMF-SegNeXt model, this paper designed multiple sets of ablation experiments based on the original SegNeXt model. These experiments tested: (i) SegNeXt (Original); (ii) Slim SegNeXt; (iii) Slim SegNeXt + Focal Loss; (iv) Slim SegNeXt + Focal Loss + Local; and (v) LMF-SegNeXt (Slim SegNeXt + Focal Loss + Local + MFA). The pixel-level segmentation accuracy and loss evolution curves during training for each model are shown in Figure 6.

When using SegNeXt as the baseline, the model achieved a mean IoU of 92.04% and a maximum IoU of 96.08%, accompanied by a mean loss of 0.21 and a maximum loss of 0.76. These results indicate that while the model can attain high-quality segmentation in complex field conditions, it remains sensitive to background interference and structurally ambiguous samples during training. The introduction of an edge-aligning crop row annotation strategy provides strong geometric supervision, shifting the model’s focus toward structural continuity. Under this guidance, incorporating Focal Loss effectively enhanced learning in challenging regions such as broken rows, elevating the mean IoU to 92.58% and the maximum IoU to 96.19%, while reducing the mean loss to 0.14 and the maximum loss to 0.49. Further integration of the Local module reinforced spatial constraints aligned with the crop row direction. This enabled more consistent recovery of local continuity in ambiguous scenarios, raising the mean IoU to 93.57% and the maximum IoU to 96.80%, and further lowering the mean loss to 0.12 along with the maximum loss to 0.41. Finally, the multi-scale fusion attention mechanism was incorporated to dynamically integrate contextual information across scales. This allowed the model to maintain structural guidance while enhancing inference capability in discontinuous regions, achieving a mean IoU of 93.86% and a maximum IoU of 97.12%, with the mean loss significantly decreased to 0.09 and the maximum loss recorded at 0.41. This progressive refinement demonstrates that the combined contributions of structured supervision and multi-scale feature adaptation enable robust and stable convergence, even under challenging field variations.

In order to validate the edge-preservation benefit introduced by the Local module, we performed a dedicated boundary-aware evaluation. Specifically, we introduced two boundary-aware measures: (1) Boundary F1 (BFScore): extract 1-pixel boundaries from the prediction and ground truth, then compute precision/recall by counting boundary pixels whose nearest counterpart lies within a tolerance τ pixels, and report the corresponding F1 score; (2) Boundary IoU (B-IoU): build a boundary band by dilating the ground-truth boundary with radius r, restrict both masks to this band, and compute IoU within the band. We set τ = 5 pixels and r = 5 pixels for all experiments. The ablation study results for the Seed class, presented in Table 2, provide clear evidence of improvement. Incorporating the Local module elevates the BFScore from 0.89 to 0.91 and raises the B-IoU from 0.81 to 0.84. This enhancement stems from the Local Block’s ability to aggregate fine-grained spatial features, a capability that is further reinforced by our structure-guided training pipeline utilizing edge-aligning annotations, which explicitly emphasize geometric contours during optimization. Consequently, the module not only improves region overlap but, more critically, strengthens boundary fidelity. The resulting segmentation masks exhibit superior contour integrity and alignment, which directly minimizes centroid jitter and reduces broken-row artifacts. These improvements are essential for achieving stable region of interest extraction and reliable navigation-line fitting in our agricultural guidance pipeline, a benefit qualitatively corroborated by the cleaner crop-row boundaries shown in Figure 5.

3.3. Evaluation of Model Generalization and Robustness

The paper further analyzes the performance of the method under varying conditions, including strong light, cloudy days, weed distribution, and broken rows. The heat map analysis is conducted on images encompassing these four scenarios, as illustrated in Figure 7. The heat map reveals that slight variations in lighting conditions have a minimal impact on the accuracy of maize crop row segmentation. The model consistently depicts the contours of the crop row accurately, even in low overall brightness. During camera imaging, the perspective effect causes parallel crop rows to converge at vanishing points within the image, particularly at the top left and right edges, where the crop rows appear shorter and less distinct. Segmentation accuracy experiences a slight decline when weed density is high due to the similarity in appearance between weeds and crops. Nevertheless, our LMF-SegNext model successfully identifies and segments each independent crop row area, including those that are spaced apart, even in instances of missing crops. This capability ensures that the model consistently provides reliable navigational guidance, even amid irregular crop distributions, thereby confirming the model’s robustness in complex environments.

The proposed method demonstrated significant segmentation performance on our custom crop row dataset. To further demonstrate that the proposed method is transferable beyond our in-field acquisition settings, we additionally evaluated it on the representative public Crop Row Benchmark Dataset (CRBD). This public dataset contains 281 field images covering six crop types: corn, celery, potato, onion, sunflower, and soybean. The acquisition conditions encompass various growth stages, multi-density weed infestations, and complex lighting conditions such as dynamic shadow occlusion, which reflect practical agricultural challenges. The model was trained on our in-field maize dataset and then directly evaluated on CRBD without any additional fine-tuning or retraining; all hyperparameters and inference settings were kept identical. Quantitative analysis revealed that our model achieved an IoU of 90.7% on the CRBD.

Figure 8 visually demonstrates the method’s adaptive advantages across various crop varieties and complex field environments through comparative visualizations. The model achieves a precision segmentation score of 95% IoU for typical crop rows, such as potato and corn, thereby confirming its generalization capacity across different crop morphologies. Although there is a moderate decline in accuracy in weed-intensive scenarios, the system continues to exhibit effective recognition capabilities. It is noteworthy that while minor over-segmentation occurs in unannotated apical regions of high-angle images, the model adheres fully to engineering expectations in extracting annotated target areas. Experimental results indicate that the segmentation accuracy of central crop rows across various test scenarios meets the requirements of agricultural navigation systems. The method shows particularly strong robustness in core recognition regions that are critical for agricultural machinery path planning, thus validating its practical value in real-world agricultural applications.

3.4. Evaluation of the Effects of Different Labeling Methods

When utilizing deep learning models to identify crop rows, labeling the crop row area contour is typically preferred over individual plant area labeling. This approach simplifies the labeling process and accommodates the diverse growth patterns of maize leaves. Our proposed EFL labeling method delineates a polygonal area that conforms to the edge of the image based on the centerline of the crop row, as illustrated in Figure 9. Our practical evaluation under consistent conditions (same tool and annotator) showed that EFL reduces the average annotation time per image from about 90 s (using traditional CRCA polygon tracing) to about 25 s, achieving a time reduction of approximately 72%.

Figure 9a presents the original image of crop rows; Figure 9b illustrates the graph of the EFL labeling strategy; Figure 9c depicts the graph of the CRCA labeling strategy; Figure 9d shows the graph detailing specific labels of the EFL labeling strategy; Figure 9e presents the training results of the EFL labeling strategy; and Figure 9f displays the training results of the CRCA labeling strategy. As observed in Figure 9b,c, both labeling schemes (EFL and CRCA) effectively cover the core region of the crop row. However, they exhibit differences in the stability of the mask shape and adaptability to crop characteristics. The mask shape generated by the CRCA labeling strategy is influenced by the characteristics of maize plant leaves, leading to instability. The irregularity of the leaves results in fluctuations in the mask shape, which adversely affects the accuracy of the segmentation of crop row areas. This issue is particularly pronounced when the leaves of neighboring crop rows overlap, causing adhesion between masks. In Figure 9d, it is evident that the mask structure produced by the EFL labeling strategy is straighter and smoother, aligning with the general trend of the crop row stripes’ extension direction. In instances where seedlings are sparse or rows are broken, labeling the extended region of the crop rows within the field of view aids the model in comprehending the continuity of the crop rows. This approach enhances the model’s ability to recognize crop rows under non-ideal conditions. Ensuring accurate recognition and tracking of crop rows in each image frame significantly improves the robustness and utility of the segmentation model.

Figure 9e,f demonstrate the evaluation of the impact of two crop row labeling strategies on the training process of the Local-SegNeXt model. The model trained with the EFL labeling strategy demonstrates a significant performance improvement, achieving a 6.11 percentage point increase in the accuracy of crop row identification and a 8.34 percentage point enhancement in the intersection over union (IoU) ratio of the segmentation results. The EFL labeling strategy facilitates the model’s ability to learn the continuity and structural features of the crop rows more effectively by providing a smooth and uniform mask structure, which results in faster convergence and higher segmentation accuracy. Conversely, the CRCA labeling strategy led to challenges in the model’s ability to learn consistent feature representations while processing images, resulting in greater fluctuations during training. This instability was particularly pronounced in scenarios involving crossing leaves between crop rows or uneven crop growth. These findings indicate that the EFL labeling strategy offers significant advantages in enhancing the performance of the Local-SegNeXt model, particularly regarding stability and adaptability when handling complex crop scenes.

3.5. Performance of Navigation Line Fitting

To effectively evaluate the fitting performance of the crop row line, two indicators were utilized: the average lateral distance

L

and the angle

θ

between the fitted line and the ground truth line. Figure 10 illustrates the calculation principle of this evaluation method. In this context, the upper left corner of the image serves as the origin of the coordinate system, where the red line represents the predicted route and the black line denotes the actual route. The deviation angle

θ

between the predicted and actual routes is calculated using Equation (7), where

a_{1}, a_{2}

are the slopes of the two straight lines, respectively.

θ = a r c t a n \frac{a_{1} - a_{2}}{a_{1} \times a_{2}}

(7)

Take two well-separated points on the Y-axis,

(0, y 1)

and

(0, y 2)

. Two horizontal scanlines (i.e., lines perpendicular to the Y-axis) are then drawn at

y = y 1

and

y = y 2

, respectively. Let

{∆ x}_{1}

and

{∆ x}_{2}

denote the horizontal pixel offsets between the intersections of each scanline with the ground-truth route and the predicted route. We use |

{∆ x}_{1}

| and |

{∆ x}_{2}

| to prevent error cancelation in cases where the predicted route crosses the ground-truth route. Since

Δ x

is evaluated at specific scanline positions, changing

(y 1, y 2)

can lead to slight variations in the resulting average lateral distance L. Following the evaluation protocol in [35], we therefore fix

(y 1, y 2)

as two well-separated locations defined by fixed fractions of the image height for all methods and all experiments, ensuring fair and reproducible comparisons.

Take two distant points on the Y-axis with coordinates (0,

y 1

) and (0,

y 2

), make straight lines perpendicular to the Y-axis, and calculate the pixel errors

{∆ x}_{1}

and

{∆ x}_{2}

between the intersections of the two straight lines with the real course and the predicted course. The absolute value of the two-pixel errors is used in order to avoid offsetting errors due to the intersection of the predicted course with the real course.

The specific parameters of the customized threshold

T

for the gradient direction difference in the sampling points in the G-RANSAC algorithm are determined through comparative tests, with the analysis results presented in Figure 11. When the image contains two crop rows, the rows are fitted to the left and right regions of interest (ROIs), and their midpoint is extracted as the navigation line. As illustrated in Figure 11, a threshold

T

that is set too small results in the exclusion of points that slightly deviate from the true linear structure, thereby reducing the algorithm’s tolerance to noise and outliers. Conversely, a threshold set too large may incorrectly include points that do not belong to the same straight line, leading to an inaccurate fit. When

T

= 4°, the fitting line deviation error is minimized, achieving a better balance between accuracy and robustness. Therefore,

T

= 4° is adopted as the default setting for G-RANSAC in all experiments reported in this paper.

Table 3 compares four line-fitting methods in terms of runtime, angular deviation, and lateral distance error. Because the adaptive ROI extraction limits fitting to only one or two crop rows, all methods except Hough Transform (HT) finish within 10 ms. Furthermore, compared to the original RANSAC algorithm, pre-checking the initial sampling model by utilizing the gradient direction of the sampling points accelerates the algorithm’s processing. The average lateral distance errors of 7.65 and 5.28 pixels for LSM and RANSAC, respectively, are higher than those of 4.24 and 1.87 pixels for the G-RANSAC algorithm. Notably, the average angular deviation achieved by the G-RANSAC route extraction algorithm is 3.41°. This indicates that the G-RANSAC algorithm not only enhances accuracy but also improves processing efficiency, thereby providing more reliable route information for navigation.

3.6. Effect of Various Growth Stages on Navigation Line Fitting

Randomly selected from four maize growth stages, 50 images were used for route-fitting evaluation, and the mean errors are reported in Table 4. The fitting errors increased first and then decreased as the crop grew. In the seedling stage, the canopy was sparse, and inter-row interference was minimal. The angular error was 0.48°, and the average lateral offset was 1.32 pixels, which were the lowest among all stages. Errors rose in the three-leaf and five-leaf stages because trumpet-like leaf expansion increased occlusion and blurred row boundaries, which degraded line fitting performance. In the late stage, plant growth became more vertical and the inter-row structure stabilized, leading to a reduction in fitting error. The angular error decreased to 0.62°, and the average lateral offset decreased to 2.06 pixels. Overall, although the errors vary across stages, the maximum error remains within an acceptable range, providing reliable navigation cues for plant-protection robots in real farmland conditions.

Figure 12 presents the crop row segmentation model developed. It assesses the model’s effectiveness in detecting crop rows within the maize canopy across four distinct growth stages. The evaluation involves a comparison of the model’s results with manually calibrated lines. The figure indicates that the model demonstrates robust detection performance across various growth stages, successfully addressing the challenges posed by the interconnection of leaves between different rows as the maize crop matures. As the crop develops, the increasing dispersion of maize leaves affects the extracted seedling zone, leading to fluctuations in the edges of the segmentation results. Nevertheless, by integrating the center-of-mass point extraction method, it is still possible to obtain accurate linear fitting parameters. In practical field applications, crop seedling deficiency presents a common challenge; the inability to extract sufficiently precise feature points results in deviations in the extracted navigation line, complicating crop row detection. In contrast, the algorithm presented in this paper effectively extracts the canopy region of interest (ROI) even in the presence of missing plants, thereby generating enough feature points to accurately simulate the crop canopy row detection line.

3.7. Execution Time Performance Evaluation

In practical applications, prompt decision-making and response time are critical. The response time was evaluated, with the results indicating that the LMF-SegNext model segments the maize seedling crop row image in 36 ms. The total time for ROI extraction from the segmented image, center of mass computation, and centerline fitting via the G-RANSAC algorithm is 1.03 ms. Consequently, the overall time required for extracting the navigation lines from a maize crop row image is 37.03 ms, yielding a processing speed of 27 frames per second using the method presented in this paper.

To this end, this paper implemented real-time testing of the algorithm on a Jetson TX2 edge computing device, utilizing the ONNX Runtime efficient runtime library to accelerate model inference. The Jetson TX2 was equipped with an image acquisition system and mounted on an electric four-wheel drive farm machine, specifically designed for simulated seedling route extraction tests with a soil trough width of 1.8 m. According to the methodology presented in this paper, one, two, and three rows of crops were detected, as illustrated in Figure 13. The results indicate that the average angular error of the method on the edge computing device is 3.62°, with an average lateral offset of 4.89 cm. The algorithm achieves a video stream processing frame rate of 16.8 frames per second, which means that when the farm machine travels at 10 km/h, it can process 6 frames of images for every meter traveled. Therefore, the method proposed in this paper effectively identifies and extracts the centerline of maize crop rows, ensuring real-time autonomous navigation of agricultural machinery.

4. Conclusions

This study presents a vision-based autonomous navigation pipeline for maize fields across multiple growth stages under complex field conditions. To alleviate the scarcity of public multi-stage navigation datasets beyond the seedling stage, we constructed a comprehensive dataset covering diverse scenarios and growth stages, enabling stage-aware evaluation of segmentation and navigation performance.

In terms of scientific novelty, the proposed method integrates (1) an enhanced SegNeXt-based segmentation network (LMF-SegNeXt) that strengthens both long-range row-structure modeling and local boundary preservation, and (2) a complete navigation-line extraction pipeline consisting of adaptive ROI selection, equidistant horizontal strip-based centroid extraction, and a gradient-constrained RANSAC fitting strategy to reduce outlier interference and improve fitting stability. On the constructed dataset, LMF-SegNeXt achieves an IoU of 93.86%, outperforming the compared baselines and demonstrating robust canopy/row segmentation across different growth stages.

Regarding practical significance, the proposed pipeline is designed for real-time deployment. The adaptive ROI strategy eliminates instance-specific mask planning and focuses computation on navigation-critical regions, simplifying row detection and improving efficiency. Combined with the lightweight fitting strategy, the system was deployed on a Jetson TX2 edge device and achieved a processing speed of 16.8 fps, with an average angular error of 3.62° and an average lateral offset of 4.89 cm, indicating its feasibility for real-time in-row navigation of agricultural machinery.

Limitations of the current approach should also be noted. Occasional extreme environments (e.g., intense shadows or large heading errors) may influence ROI selection and subsequently affect line fitting robustness. Future work will refine the BBZ/ROI strategy to improve robustness under extreme conditions, which is essential for closed-loop autonomous operations in fully vision-driven agricultural robotics.

Author Contributions

Methodology, Y.Z. (Yuting Zhai); Validation, Z.G.; Formal analysis, Y.Z. (Yang Zhou); Data curation, J.L.; Writing—original draft, Y.Z. (Yuting Zhai); Writing—review & editing, Z.G., Y.Z. (Yang Zhou) and Y.X.; Supervision, J.L.; Project administration, Y.X.; Funding acquisition, Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Jilin Provincial Scientific and Technological Development Program, grant number: 20230202035NC.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guan, H.; Deng, H.; Ma, X.; Zhang, T.; Zhang, Y.; Zhu, T.; Zhou, H.; Gu, Z.; Lu, Y. A corn canopy organs detection method based on improved DBi-YOLOv8 network. Eur. J. Agron. 2024, 154, 127076. [Google Scholar] [CrossRef]
Maseko, S.; Van der Laan, M.; Tesfamariam, E.H.; Delport, M.; Otterman, H. Evaluating machine learning models and identifying key factors influencing spatial maize yield predictions in data intensive farm management. Eur. J. Agron. 2024, 157, 127193. [Google Scholar] [CrossRef]
Li, D.; Li, B.; Long, S.; Feng, H.; Xi, T.; Kang, S.; Wang, J. Rice seedling row detection based on morphological anchor points of rice stems. Biosyst. Eng. 2023, 226, 71–85. [Google Scholar] [CrossRef]
Li, Y.; Hong, Z.; Cai, D.; Huang, Y.; Gong, L.; Liu, C. A SVM and SLIC based detection method for paddy field boundary line. Sensors 2020, 20, 2610. [Google Scholar] [CrossRef] [PubMed]
Wang, T.; Chen, B.; Zhang, Z.; Li, H.; Zhang, M. Applications of machine vision in agricultural robot navigation: A review. Comput. Electron. Agric. 2022, 198, 107085. [Google Scholar] [CrossRef]
Chen, S.; Zhang, M.; Li, X.; Li, X.; Liu, W.; Ji, B. Precision agriculture intelligent connection network based on visual navigation. IET Netw. 2023, 12, 167–178. [Google Scholar] [CrossRef]
Panda, S.K.; Lee, Y.; Jawed, M.K. Agronav: Autonomous navigation framework for agricultural robots and vehicles using semantic segmentation and semantic line detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 6272–6281. [Google Scholar]
Diao, Z.; Guo, P.; Zhang, B.; Zhang, D.; Yan, J.; He, Z.; Zhao, S.; Zhao, C.; Zhang, J. Navigation line extraction algorithm for corn spraying robot based on improved YOLOv8s network. Comput. Electron. Agric. 2023, 212, 108049. [Google Scholar] [CrossRef]
Gong, H.; Zhuang, W. An improved method for extracting inter-row navigation lines in nighttime maize crops using YOLOv7-tiny. IEEE Access 2024, 12, 27444–27455. [Google Scholar] [CrossRef]
Wang, S.; Yu, S.; Zhang, W.; Wang, X.; Li, J. The seedling line extraction of automatic weeding machinery in paddy field. Comput. Electron. Agric. 2023, 205, 107648. [Google Scholar] [CrossRef]
Yang, Y.; Zhou, Y.; Yue, X.; Zhang, G.; Wen, X.; Ma, B.; Xu, L.; Chen, L. Real-time detection of crop rows in maize fields based on autonomous extraction of ROI. Expert Syst. Appl. 2023, 213, 118826. [Google Scholar] [CrossRef]
Guo, Z.; Quan, L.; Sun, D.; Lou, Z.; Geng, Y.; Chen, T.; Xue, Y.; He, J.; Hou, P.; Wang, C.; et al. Efficient crop row detection using transformer-based parameter prediction. Biosyst. Eng. 2024, 246, 13–25. [Google Scholar] [CrossRef]
Cheng, G.; Jin, C.; Chen, M.; Cai, Z.; Liu, Z. Wheat Full-Width harvesting navigation line extraction method using improved Swin-Transformer. Comput. Electron. Agric. 2025, 239, 110881. [Google Scholar] [CrossRef]
Yang, R.; Zhai, Y.; Zhang, J.; Zhang, H.; Tian, G.; Zhang, J.; Huang, P.; Li, L. Potato visual navigation line detection based on deep learning and feature midpoint adaptation. Agriculture 2022, 12, 1363. [Google Scholar] [CrossRef]
Wei, C.; Li, H.; Shi, J.; Zhao, G.; Feng, H.; Quan, L. Row anchor selection classification method for early-stage crop row-following. Comput. Electron. Agric. 2022, 192, 106577. [Google Scholar] [CrossRef]
García-Santillán, I.; Guerrero, J.M.; Montalvo, M.; Pajares, G. Curved and straight crop row detection by accumulation of green pixels from images in maize fields. Precis. Agric. 2018, 19, 18–41. [Google Scholar] [CrossRef]
Liu, X.; Qi, J.; Zhang, W.; Bao, Z.; Wang, K.; Li, N. Recognition method of maize crop rows at the seedling stage based on MS-ERFNet model. Comput. Electron. Agric. 2023, 211, 107964. [Google Scholar] [CrossRef]
Fu, D.; Jiang, Q.; Qi, L.; Xing, H.; Chen, Z.; Yang, X. Detection of the centerline of rice seedling belts based on region growth sequential clustering-RANSAC. Trans. CSAE 2023, 39, 47–57. [Google Scholar]
Zhang, S.; Liu, Y.; Xiong, K.; Tian, Y.; Du, Y.; Zhu, Z.; Du, M.; Zhai, Z. A review of vision-based crop row detection method: Focusing on field ground autonomous navigation operations. Comput. Electron. Agric. 2024, 222, 109086. [Google Scholar] [CrossRef]
Liu, Y.; Guo, Y.; Wang, X.; Yang, Y.; Zhang, J.; An, D.; Han, H.; Zhang, S.; Bai, T. Crop Root Rows Detection Based on Crop Canopy Image. Agriculture 2024, 14, 969. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Wenzek, G.; Lachaux, M.-A.; Conneau, A.; Chaudhary, V.; Guzmán, F.; Joulin, A.; Grave, E. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference; European Language Resources Association: Paris, France, 2020; pp. 4003–4012. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 3146–3154. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2018; pp. 325–341. [Google Scholar]
Romera, E.; Alvarez, J.M.; Bergasa, L.M.; Arroyo, R. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2017, 19, 263–272. [Google Scholar] [CrossRef]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 7262–7272. [Google Scholar]
Poudel, R.P.; Liwicki, S.; Cipolla, R. Fast-scnn: Fast semantic segmentation network. arXiv 2019, arXiv:1902.04502. [Google Scholar] [CrossRef]
Guo, M.-H.; Lu, C.-Z.; Hou, Q.; Liu, Z.; Cheng, M.-M.; Hu, S.-M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Shi, J.; Bai, Y.; Diao, Z.; Zhou, J.; Yao, X.; Zhang, B. Row detection BASED navigation and guidance for agricultural robots and autonomous vehicles in row-crop fields: Methods and applications. Agronomy 2023, 13, 1780. [Google Scholar] [CrossRef]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 9716–9725. [Google Scholar]
Riu, C.; Nozick, V.; Monasse, P.; Dehais, J. Classification performance of RANSAC algorithms with automatic threshold estimation. In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022), Online, 6–8 February 2022; HAL: Villeurbanne, France, 2022; pp. 723–733. [Google Scholar]
He, Y.; Zhang, X.; Zhang, Z.; Fang, H. Automated detection of boundary line in paddy field using MobileV2-UNet and RANSAC. Comput. Electron. Agric. 2022, 194, 106697. [Google Scholar] [CrossRef]
Wen, Y.; Dai, W.; Yu, W.; Chen, B.; Pan, L. An enhanced RANSAC-RTK algorithm in GNSS-challenged environments. J. Spat. Sci. 2025, 70, 65–84. [Google Scholar] [CrossRef]
Yu, J.; Zhang, J.; Shu, A.; Chen, Y.; Chen, J.; Yang, Y.; Tang, W.; Zhang, Y. Study of convolutional neural network-based semantic segmentation methods on edge intelligence devices for field agricultural robot navigation line extraction. Comput. Electron. Agric. 2023, 209, 107811. [Google Scholar] [CrossRef]

Figure 1. Dataset construction schematic diagram. (a) Different periods of crop growth; (b) capture images at varied angles; (c) Edge-fit line labeling strategy; (d) different scenarios of crop growth.

Figure 2. LMF-SegNeXt network architecture.

Figure 3. Local block structure used in the LMF-SegNext network.

Figure 4. Multi-scale Fusion Attention module.

Figure 5. Images of different models’ output in the crop row segmentation task. Crop rows (black), background (white).

Figure 6. IoU during training (left) and loss curve (right).

Figure 7. Model Heatmap Analysis in Different Scenarios. (The color gradient (from blue to red) represents the relative confidence or activation level of the model at each pixel location.).

Figure 8. Visualization of segmentation performance on the CRBD (White: Crop rows). (a) potato crop rows, (b) high-angle perspective scenes, (c) weed-dense challenging scenarios, and (d) corn crop rows.

Figure 9. Comparison of prediction effects between edge-fit line (EFL) and Crop Row Coverage Area (CRCA) labeling strategy. (a) Crop row original image; (b) EFL labeling strategy; (c) CRCA labeling strategy; (d) EFL labeling details; (e) training result with EFL labeling strategy; (f) training result with CRCA labeling strategy.

Figure 10. Schematic diagram of the calculation principles for the average lateral distance (

L

) and angle (

θ

).

Figure 10. Schematic diagram of the calculation principles for the average lateral distance (

L

) and angle (

θ

).

Figure 11. Specific parameter analysis results for the customized threshold

T

. (a1,a2) Input image. (b1,b2) Fitting-based labels. (c1,c2) Segmentation output. (d1,d2) Adaptive row extraction. (e1,e2) Centroid computation. (f1,f2) Centroid mapping. (g1–i2) G-RANSAC fitting results for T = 2°, 4°, and 6°.

Figure 11. Specific parameter analysis results for the customized threshold

T

. (a1,a2) Input image. (b1,b2) Fitting-based labels. (c1,c2) Segmentation output. (d1,d2) Adaptive row extraction. (e1,e2) Centroid computation. (f1,f2) Centroid mapping. (g1–i2) G-RANSAC fitting results for T = 2°, 4°, and 6°.

Figure 12. Maize Crop row detection results at various growth stages: seedling to seven-leaf stage. (a) Original image, (b) LMF-SegNext network segmentation result, (c) region of interest extraction result, (d) navigation line fitting result.

Figure 13. Experimental deployment of the algorithm in simulated field conditions. The red line indicates the fitted navigation line.

Table 1. Different model performance evaluation using our in-field maize datasets.

Model	Seed		Background		Flops/G	Params/M	Time/s	Seed	Seed	Seed	Speed
Model	IOU	ACC	IOU	ACC	Flops/G	Params/M	Time/s	Dice	Precision	Recall	FPS
DeepLabv3	90.67	93.82	78.09	90.38	540.00	65.74	5.40	95.11	95.61	94.61	18.52
ERFNet	89.06	93.03	74.74	85.95	29.10	2.08	3.70	94.21	94.71	93.72	27.03
BiseNetv2	90.78	93.01	80.38	91.87	24.60	3.34	4.10	95.17	95.67	94.67	24.39
Fast-scnn	90.67	93.85	78.97	91.17	185.00	1.40	6.80	95.11	95.71	94.51	14.71
Segmenter	89.43	91.80	77.38	93.32	297.00	102.00	15.50	94.42	94.92	93.93	6.45
DaNet	89.84	93.87	76.87	88.71	441.00	47.46	12.60	94.65	95.15	94.15	7.94
UNet	73.95	89.02	44.53	55.26	406.00	28.98	14.50	85.02	86.52	83.58	6.90
CCNet	91.06	93.95	80.82	90.19	403.00	47.45	9.30	95.32	96.12	94.53	10.75
SegNext	92.04	94.39	81.82	93.28	32.78	27.63	5.50	95.86	96.66	95.07	18.18
Slim-SegNext	91.85	94.08	81.25	92.19	22.34	18.08	4.28	95.02	96.54	94.75	23.36
LMF-SegNext	93.86	95.13	83.43	93.78	18.67	13.21	3.60	96.83	97.63	96.05	27.78

Table 2. Boundary-aware quantitative evaluation for the Seed class in the Local module ablation study.

Model	Mean IoU	$BFScore @ τ = 5 p x$	$B - Iou @ r = 5 p x$
SegNeXt + Focal	92.58	0.89	0.81
SegNeXt + Focal + Local	93.57	0.91	0.84

Note: BFScore and B-Iou are higher-is-better metrics (range [0, 1]).

Table 3. Performance of navigation line fitting method (mean ± SD).

Method	HT	LSM	RANSAC	G-RANSAC
Time (ms)	15.0 ± 20.45	10 ± 0.30	3 ± 0.12	0.9 ± 0.05
$θ$ (°)	5.58 ± 0.74	3.22 ± 0.48	1.54 ± 0.25	0.32 ± 0.08
L (pixels)	14.46 ± 2.10	7.65 ± 0.98	5.28 ± 0.98	3.41 ± 0.42

Table 4. Mean error of route fitting for four growth stages of maize (mean ± SD).

	Seeding Stage	Three-Leaf Stage	Five-Leaf Stage	Seven-Leaf Stage
$θ$ (°)	0.48 ± 0.33	1.47 ± 0.41	1.75 ± 0.50	0.62 ± 0.65
L (pixels)	1.32 ± 1.22	4.25 ± 1.35	5.13 ± 1.49	2.06 ± 1.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhai, Y.; Gao, Z.; Li, J.; Zhou, Y.; Xu, Y. An Enhanced SegNeXt with Adaptive ROI for a Robust Navigation Line Extraction in Multi-Growth-Stage Maize Fields. Agriculture 2026, 16, 367. https://doi.org/10.3390/agriculture16030367

AMA Style

Zhai Y, Gao Z, Li J, Zhou Y, Xu Y. An Enhanced SegNeXt with Adaptive ROI for a Robust Navigation Line Extraction in Multi-Growth-Stage Maize Fields. Agriculture. 2026; 16(3):367. https://doi.org/10.3390/agriculture16030367

Chicago/Turabian Style

Zhai, Yuting, Zongmei Gao, Jian Li, Yang Zhou, and Yanlei Xu. 2026. "An Enhanced SegNeXt with Adaptive ROI for a Robust Navigation Line Extraction in Multi-Growth-Stage Maize Fields" Agriculture 16, no. 3: 367. https://doi.org/10.3390/agriculture16030367

APA Style

Zhai, Y., Gao, Z., Li, J., Zhou, Y., & Xu, Y. (2026). An Enhanced SegNeXt with Adaptive ROI for a Robust Navigation Line Extraction in Multi-Growth-Stage Maize Fields. Agriculture, 16(3), 367. https://doi.org/10.3390/agriculture16030367

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Enhanced SegNeXt with Adaptive ROI for a Robust Navigation Line Extraction in Multi-Growth-Stage Maize Fields

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. LMF-SegNext Crop Row Extraction Model

2.2.1. Local Block

2.2.2. Multi-Scale Fusion Attention

2.2.3. Focal Loss

2.2.4. Adaptive ROI Extraction Scheme

2.3. Navigation Line Fitting

3. Results and Discussion

3.1. Comparison of Semantic Segmentation Models

3.2. Evaluation of LMF-SegNext Performances

3.3. Evaluation of Model Generalization and Robustness

3.4. Evaluation of the Effects of Different Labeling Methods

3.5. Performance of Navigation Line Fitting

3.6. Effect of Various Growth Stages on Navigation Line Fitting

3.7. Execution Time Performance Evaluation

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI