1. Introduction
Maize (
Zea mays) is a major global cereal crop used for food, feed, and bioenergy, and its kernel quality strongly influences germination, early vigor, and final yield. Kernel attributes such as physical purity, damage, and varietal consistency also affect downstream processes, including milling, grain fractionation, and industrial applications [
1]. High standards of purity and kernel quality have been established by seed certification programs. However, the qualitative standards are still largely determined through a manual process that involves trained personnel inspecting, counting, and sorting kernels [
2]. The current processes of evaluating kernels are time-consuming, resource-intensive, and cannot be easily scaled to meet the demands of modern high-throughput breeding and commercial seed production. As a result, there is growing interest in automation, particularly using machine vision and deep learning techniques, to assist in automated seed evaluation and processing. That includes kernel counting, kernel defect detection, kernel breakage estimation, and seed vigor assessment.
Machine vision [
3] and deep learning [
4] have rapidly advanced kernel-level inspection by providing scalable alternatives to manual evaluation. Early systems relied on handcrafted shape and texture descriptors paired with classical classifiers to distinguish whole from broken kernels or to assess purity. With modern deep learning, CNN-based [
5] models have been developed for tasks such as on-ear kernel detection and counting, high-throughput phenotyping, and classification of good, defective, and impurity kernels [
6]. The results demonstrated that deep feature extractors can match or surpass human graders on specific tasks. Self-supervised and contrastive learning approaches further reduce annotation costs by learning transferable kernel representations that support embryo-orientation detection and segmentation. At the same time, hyperspectral and RGB-based models have been used to predict maize variety, vigor, and germination [
7]. These studies collectively establish that image-based deep learning can capture many aspects of seed and kernel quality and form a strong foundation for automation.
Despite these advances, current image-based kernel analysis methods still face several major challenges. First, most rely on a single, monolithic classifier that maps directly from an image to a composite label such as “good”, “defective”, or “impurity”. This compresses diverse visual cues, purity, morphology, and orientation into one decision, making error analysis difficult, and provides limited interpretability for agronomists and seed analysts.Second, the field remains dominated by CNN-based and handcrafted feature pipelines that are effective for local pattern extraction but less suited to capturing the global shape context and subtle structural cues needed to distinguish borderline impurities or embryo orientation. Finally, many prior works address only one sub-task at a time (e.g., purity or embryo orientation) or are tightly coupled to specific imaging setups, which complicates reuse across quality dimensions or deployment environments. Although Vision Transformers (ViTs) [
8] and related architectures offer improved capacity for modeling long-range relationships and have shown strong performance in plant disease detection and phenotyping [
9], their use for fine-grained, kernel-level evaluation is still limited. In contrast, human graders naturally follow a structured, multi-stage reasoning process: they begin by assessing whether a kernel is pure, then determine its morphological category, and finally examine its orientation and finer anatomical features. Emulating this hierarchical decision pathway in automated models helps address both interpretability and modeling limitations in prior work.
At the same time, there is a broader trend toward deep learning-based image analysis in precision agriculture, which further motivates our approach. In crop-weed management, CWRepViT-Net has shown that encoder–decoder frameworks built from RepViT blocks can perform accurate semantic segmentation of crops and multiple weed species throughout the life cycle of soybean fields, highlighting the potential of transformer-style backbones for fine-grained canopy understanding and decision support in the field [
10]. In aerial monitoring, Succulent-YOLO combines a CLIP-enhanced YOLOv10 detector, dynamic group convolutions, and a multi-scale fusion neck with a Mamba-based super-resolution module (MambaIR) to detect succulent plants in UAV imagery, achieving high mean average precision even on low-resolution inputs [
11]. Closer to our application domain, Rocha et al. developed a real-time system that mounts a camera on a self-propelled forage harvester to capture images of chopped corn silage and uses machine learning to count whole kernels and estimate the Kernel Processing Score with strong agreement to laboratory sieve analysis, demonstrating that image-based models can provide actionable quality metrics directly in harvesting equipment [
12]. Together, these examples show that deep learning, including transformer-based and CLIP-enhanced architectures, is increasingly used to automate image-based assessment of crops and grain products at canopy, field, and processing levels.
In this study, we investigate whether multi-stage Convolutional Vision Transformers (CvTs) [
13] can model the hierarchical decision-making process used by human seed analysts for single-kernel evaluation. To this end, we introduce CornViT, a three-step CvT-based framework that sequentially classifies kernel purity, morphological type, and embryo orientation. Each stage employs an independently fine-tuned CvT-13 model [
14] operating on
RGB images, leveraging CvT’s combination of convolutional inductive biases with the global context modeling of Transformers. By decomposing grading into three explicit decisions and using a transformer-based backbone, CornViT is designed to address the limitations of prior work: it replaces monolithic labels with interpretable intermediate outputs, uses convolution-augmented self-attention to capture both local surface cues and global kernel shape, and provides modular stages that can be adapted to different imaging setups or extended to additional quality attributes. Across the three stages, CornViT substantially outperforms strong CNN baselines: ResNet-50 [
15,
16] attains only around 77 to 81 percent accuracy and DenseNet-121 [
17] around 87 to 89 percent, whereas CornViT achieves 93.76%, 94.11%, and 91.12% test accuracy for purity, shape, and embryo-orientation classification, respectively. This highlights the advantages of transformer-based architectures for fine-grained agricultural vision tasks. We deploy the models in a lightweight web application that supports stage-wise inference and exposes interpretable outputs through a simple browser interface, illustrating how such a framework can be integrated into seed-quality and precision-agriculture workflows.
Three major contributions from this research are as follows:
Introduction of CornViT, a three-stage CvT-based framework that mirrors human-style hierarchical reasoning for kernel purity, morphology, and embryo orientation.
Construction and release of a stage-wise annotated corn kernel dataset comprising three curated subsets aligned with the purity, shape, and embryo-orientation tasks.
Development of a ready-to-use web application that exposes the full CornViT pipeline through a browser interface, enabling easy adoption in seed quality assessment and precision-agriculture workflows.
3. Materials and Methods
3.1. Problem Formulation
This study considers single-kernel RGB images in which each image contains exactly one corn kernel on a uniform background. The goal is to obtain a hierarchical description of the kernel through three binary decisions:
Purity (Stage 1):
Pure: visually acceptable kernels without obvious defects or discoloration.
Impure: kernels that are broken, discolored, silkcut (intact kernels with visible surface cracks and silk-embedded fissures), or otherwise unsuitable.
Shape (Stage 2, conditioned on purity):
Embryo orientation (Stage 3, conditioned on purity and flat shape):
Let
denote an RGB kernel image, and let
denote the Stage 1–3 models. The overall pipeline maps
where
,
(if defined),
(if defined).
For kernels that are predicted as impure in Stage 1, no further classification is attempted, so and remain undefined. Similarly, for kernels that are pure but round, Stage 3 is skipped and is undefined.
In practice, this hierarchy is implemented using three independent binary classifiers, each trained on a stage-specific dataset: for purity, for shape (pure kernels only), and for embryo orientation (pure–flat kernels only). This design allows each stage to specialize in its own decision boundary while maintaining a simple and interpretable global pipeline.
The overall workflow of the proposed CornViT framework is summarized in the flowchart in
Figure 1. Starting from a single-kernel RGB image on a uniform background, the image is first passed through a standardized preprocessing pipeline (resize, augmentation, and ImageNet-style normalization). The preprocessed image is then processed sequentially by three CvT-13 classifiers corresponding to Stage 1 (purity), Stage 2 (shape), and Stage 3 (embryo orientation). At each stage, a binary decision is made, and the hierarchy either terminates (for impure or pure-round kernels) or progresses to the next classifier (for pure and pure–flat kernels). The final output is a hierarchical label tuple (
,
,
) that describes purity, morphology, and embryo orientation, with undefined components for skipped stages.
3.2. Dataset Preparation
A publicly available corn seed image dataset [
37] hosted on Kaggle served as the starting point for this study. The original dataset contains four labeled classes: broken, discolored, pure, and silkcut, but closer inspection revealed substantial inconsistencies between the class labels and corresponding images. Several images were misplaced across folders and did not correctly represent their annotated class.
Moreover, the predefined class structure did not align with the hierarchical purity, shape, and embryo-orientation objectives considered here. Due to these inconsistencies, the dataset was unsuitable for direct model training. Furthermore, publicly available image datasets for corn kernel analysis are extremely limited, particularly for classification tasks involved in this study.
To address this gap, we performed a comprehensive manual curation process in which each image was visually inspected, hand-picked, and reassigned to its correct class. In total, 17,801 single-kernel images from the Kaggle download were examined. Of these, 10,536 images were duplicates and were therefore discarded. A total of 7265 images were retained in the curated pool used to construct the stage-wise datasets (
Table 1,
Table 2 and
Table 3). Approximately 50% of these retained images required relabeling their class during curation.
Using this cleaned pool of images, we then constructed three progressively filtered datasets, each tailored to a specific stage of the CornViT pipeline. We regard these three curated datasets as a key contribution of this work. They provide a ready-to-use benchmark suite for researchers interested in corn kernel purity, morphology, and embryo orientation. The full datasets, along with train/validation/test splits, will be made publicly available at
https://doi.org/10.5281/zenodo.17693853.
3.2.1. Stage 1 Dataset: Purity Classification
The first stage aims to distinguish pure kernels from impure ones. To this end, the curated dataset was reorganized into two classes: (1) Pure: kernels that are visually free from defects or discoloration, (2) Impure: an aggregated class combining broken, discolored, and silkcut kernels.
Figure 2 presents sample images, and
Table 1 summarizes the train/validation/test partitioning, which follows a 70/15/15 split.
Table 1.
Summary of the Stage 1 dataset, showing the number of pure and impure samples in the training, validation, and test subsets.
Table 1.
Summary of the Stage 1 dataset, showing the number of pure and impure samples in the training, validation, and test subsets.
| Subset | Pure | Impure | Total |
|---|
| Training Set | 2586 | 2499 | 5085 |
| Validation Set | 555 | 535 | 1090 |
| Test Set | 554 | 536 | 1090 |
| Overall Total | 3695 | 3570 | 7265 |
3.2.2. Stage 2 Dataset: Morphological Classification
The second stage focuses on morphological categorization of kernels based on shape characteristics. Since impure kernels are not relevant for further morphological or orientation analysis, only the pure samples from Stage 1 were included in Stage 2. From these, two new classes were identified: (1) Flat and (2) Round. This refinement is motivated by the observation that embryo orientation, the target of Stage 3, is only visually meaningful for flat kernels.
Figure 3 presents sample images, and
Table 2 summarizes the train/validation/test split.
Table 2.
Summary of the Stage 2 dataset, showing the number of flat and round samples in the training, validation, and test subsets following the 70/15/15 split.
Table 2.
Summary of the Stage 2 dataset, showing the number of flat and round samples in the training, validation, and test subsets following the 70/15/15 split.
| Subset | Flat | Round | Total |
|---|
| Training Set | 1374 | 1329 | 2703 |
| Validation Set | 294 | 284 | 578 |
| Test Set | 294 | 284 | 578 |
| Overall Total | 1962 | 1897 | 3859 |
3.2.3. Stage 3 Dataset: Embryo Orientation Classification
The final stage addresses embryo orientation detection, which is critical for kernel viability and downstream seed processing. Because embryo orientation is visible only on flat kernels, Stage 3 uses the flat subset of the Stage 2 dataset as its base. Two classes were manually derived: (1) Embryo Up and (2) Embryo Down.
Figure 4 presents sample images, and
Table 3 summarizes the train/validation/test split.
Table 3.
Summary of the Stage 3 dataset, showing the number of embryo-up and embryo-down samples in the training, validation, and test subsets following the 70/15/15 split.
Table 3.
Summary of the Stage 3 dataset, showing the number of embryo-up and embryo-down samples in the training, validation, and test subsets following the 70/15/15 split.
| Subset | Embryo Up | Embryo Down | Total |
|---|
| Training Set | 813 | 561 | 1374 |
| Validation Set | 174 | 119 | 293 |
| Test Set | 174 | 119 | 293 |
| Overall Total | 1161 | 799 | 1960 |
3.3. Image Preprocessing
All images were processed through a standardized preprocessing pipeline before being fed into the CornViT models. Each image was first resized to a fixed resolution of to match the CvT-13 backbone configuration.
To improve model robustness and mitigate overfitting, we applied a set of common on-the-fly data augmentations during training. These augmentations included random horizontal and vertical flips, color jittering (adjustments to brightness, contrast, and saturation), and small in-plane rotations of up to 15°. After augmentation, each image was converted into a normalized PyTorch tensor. Normalization followed the standard ImageNet statistics (mean = , standard deviation = ), which is typically used for models pretrained on ImageNet.
Overall, this preprocessing and augmentation pipeline ensures consistent image scaling, improves generalization under modest domain shifts, and provides well-conditioned inputs for all three CornViT stages. The full implementation is available in the accompanying code repository
https://github.com/SaiTeja-Erukude/CornViT (accessed on 17 December 2025).
3.4. CornViT Architecture
Each stage of the CornViT framework is implemented as an independent Convolutional Vision Transformer (CvT-13) classifier [
13] built on top of the official Microsoft CvT implementation. All three stages share the same backbone architecture but are trained with their own binary classification heads and datasets.
3.4.1. Convolutional Vision Transformer (CvT-13)
The Convolutional Vision Transformer (CvT) [
13] is a hybrid vision backbone that combines the global self-attention of Vision Transformers (ViTs) [
8] with the local inductive biases of CNNs. CvT introduces two key modifications to the vanilla ViT architecture: (1) Convolutional token embedding and (2) Convolutional projection in self-attention.
Instead of partitioning the image into non-overlapping patches and flattening them with a linear projection (as in ViT), CvT uses convolutional layers to generate tokens. These convolutions define local receptive fields and perform spatial down-sampling, allowing each stage to operate on progressively coarser yet semantically richer feature maps. This injects CNN-like properties such as shift, scale, and distortion invariance into the transformer.
In the transformer blocks, CvT replaces pure linear projections for queries, keys, and values with convolutional projections. This enables the attention mechanism to be aware of local spatial neighborhoods while still modeling long-range dependencies through multi-head self-attention. As a result, CvT can better capture fine-grained textures (e.g., kernel surface cues) and global shape simultaneously, often with fewer parameters and FLOPs than comparable ViT or deep CNN backbones.
CvT-13 is the smallest variant described by Wu et al., with three transformer stages of increasing channel width and decreasing spatial resolution as depicted in
Figure 5. For
inputs, the final stage produces a compact global representation that feeds a lightweight classification head. This makes CvT-13 a good fit for our single-kernel classification setting, where both local surface detail and global kernel morphology are important.
A block-level overview of the proposed CornViT network is shown in
Figure 6, which details the internal structure of a single stage: three transform stages of the CvT backbone, followed by a global pooling module and a 2-unit stage-specific classification head. All stages of CornViT follow the same internal structure.
3.4.2. Experimental Setup
All stages in the proposed framework are implemented using the official Microsoft CvT codebase [
14] with the CvT-13 configuration for 384 × 384 inputs. For each stage, we initialize a CvT-13 backbone from the publicly available ImageNet-22k pretrained checkpoint [
38] and adapt it to the corresponding binary task by attaching a 2-unit classification head. Low-level configuration details (e.g., repository cloning and config files) follow the official implementation and are documented in the public code repository.
We adopt a head-only fine-tuning strategy in which the CvT backbone remains frozen and only the final linear classification layer is updated. This choice was motivated by three factors: (i) the curated stage-specific datasets are relatively small compared with large-scale vision corpora, increasing the risk of overfitting when unfreezing deeper transformer blocks; (ii) head-only tuning substantially reduces training time and GPU memory requirements, enabling fully independent training for the three stages; (iii) freezing the backbone ensures stable, comparable feature representations across purity, morphology, and embryo-orientation tasks. Although partial unfreezing (e.g., unfreezing the final transformer stage) may offer additional performance gains—particularly for the visually subtle Stage 3 embryo-orientation task—we leave this investigation for future work.
All stages share the same training configuration, including optimizer, learning-rate scheduler, label smoothing, and number of epochs; the full set of hyperparameters is summarized in
Table 4. The entire training and inference pipeline is implemented in PyTorch (version 2.9.0) [
39], and the complete source code is publicly available at:
https://github.com/SaiTeja-Erukude/CornViT (accessed on 17 December 2025).
3.5. Algorithms
The full training and inference procedures of the proposed CornViT framework are summarized using two stage-wise algorithms. Algorithm 1 describes the independent training strategy for each classification stage, and Algorithm 2 outlines the hierarchical inference pipeline used to obtain the final kernel labels.
For clarity, we recall that
,
, and
denote the three curated datasets used in this work:
contains all kernels used for purity classification;
is a subset of
that contains only kernels labeled as pure and is used for shape classification; and
is a subset of
that contains only pure–flat kernels and is used for embryo-orientation classification.
| Algorithm 1 CornViT Stage-wise Training |
Require:- 1:
: Stage 1 data (impure vs. pure) - 2:
: Stage 2 data (flat vs. round; pure only) - 3:
: Stage 3 data (embryo up vs. down; pure–flat only) - 4:
CvT-13 models with 2-class heads - 5:
train_transforms, val_transforms, number of epochs - 6:
for do ▹ stage-wise training loop - 7:
Initialize with ImageNet-pretrained CvT-13 weights - 8:
Replace final classification head with a 2-unit linear layer - 9:
Freeze all backbone parameters of (head-only fine-tuning) - 10:
if then - 11:
- 12:
else if then - 13:
- 14:
else - 15:
- 16:
end if - 17:
for to T do - 18:
for all in training split of do - 19:
- 20:
- 21:
- 22:
Update head parameters of using AdamW to minimize L - 23:
end for - 24:
Evaluate on the validation split of using val_transforms - 25:
end for - 26:
end for
|
| Algorithm 2 CornViT Hierarchical Inference |
Require:- 1:
: RGB image of a single kernel - 2:
Load trained models - 3:
val_transforms - 4:
- 5:
Stage 1: Pure vs. Impure - 6:
- 7:
if then - 8:
Output: (impure, –, –) - 9:
return - 10:
end if - 11:
Stage 2: Flat vs. Round - 12:
- 13:
- 14:
if then - 15:
Output: (pure, round, –) - 16:
return - 17:
end if - 18:
Stage 3: Embryo Up vs. Down - 19:
- 20:
- 21:
Output: (pure, flat, )
|
3.6. Evaluation Metrics
To comprehensively assess the performance of each stage in the proposed CornViT framework, we employ standard classification metrics, including Accuracy (Acc), Precision (Pre), Recall (Re), F1 score, and their macro and weighted averages. Since each stage is formulated as a binary classification problem, these metrics characterize different aspects of the model’s behavior, such as overall correctness, reliability on the positive class, and robustness under class imbalance [
41].
Let , , , and denote the number of true positives, false positives, true negatives, and false negatives, respectively.
The Accuracy measures the overall proportion of correctly classified samples.
Precision measures the proportion of predicted positive samples that are actually positive.
Recall (also known as sensitivity) quantifies the proportion of actual positive samples that are correctly identified.
The F1 score is the harmonic mean of Precision and Recall.
For completeness, we also report macro and weighted averages across classes. Let C be the number of classes (here, ), and let , , and denote the class-wise metrics for class c, with samples in class c and the total number of samples.
The macro-average of a generic metric
is computed as
which treats all classes equally.
The weighted-average version accounts for class imbalance by weighting each class by its support:
In our experiments, we report Accuracy together with macro- and weighted-average Precision, Recall, and F1, providing a balanced view of performance under potentially imbalanced class distributions at each CornViT stage. Model selection for each stage is based primarily on validation Accuracy, while also inspecting macro- and weighted-average F1; when candidate models achieve similar validation Accuracy, we prefer those with higher macro-F1 to avoid degrading minority-class performance.
4. Results
4.1. Baseline CNNs
To establish performance benchmarks and contextualize the performance of the proposed CornViT framework, baseline experiments were carried out. Two convolutional neural networks (CNNs) [
5] were utilized: ResNet-50 [
15,
16] and DenseNet-121 [
17]. Both models were chosen because they represent strong, well-established backbones for image classification and have been extensively used in agricultural and plant-phenotyping tasks.
A broader survey including additional lightweight CNNs such as EfficientNet or MobileNet would be valuable, but lies outside the scope of this study. Our goal in this work is not to exhaustively benchmark all CNN variants, but rather to compare a representative pair of strong, widely used CNN backbones against the proposed multi-stage CvT framework on a newly curated kernel dataset. Exploring a broader range of mobile-oriented CNNs is therefore left as complementary future work, orthogonal to the main question of whether CvT offers advantages for hierarchical kernel-level analysis.
ResNet-50 is a 50-layer residual network that introduces identity-based skip connections (residual blocks) to ease the optimization of deep models and mitigate vanishing-gradient issues. Each residual block learns a residual mapping with respect to its input, allowing gradients to flow more directly through the network and enabling effective training of very deep architectures [
16].
DenseNet-121 is a densely connected convolutional network in which each layer receives, as input, the concatenation of all feature maps from preceding layers within the same dense block. This design encourages feature reuse, improves information flow, and reduces the number of parameters compared to traditional feed-forward CNNs with comparable depth [
17].
For all three CornViT stages (purity, shape, embryo orientation), the baselines were configured as follows:
Both ResNet-50 and DenseNet-121 were initialized from ImageNet-pretrained weights.
The final classification layers were replaced with new 2-class fully connected heads, matching the binary tasks at each stage.
The same preprocessing and data augmentation pipeline described in
Section 3.3 was applied (resize, random flips, color jitter, small rotations, and ImageNet normalization), ensuring that differences in performance were attributable primarily to the backbone architecture rather than to differing data pipelines.
Training was performed using binary cross-entropy loss and Adam optimizer. The number of epochs, batch size, and learning-rate schedule were kept comparable to those used for the CvT models to provide a fair comparison.
The train/validation/test splits were also the same as the corresponding CornViT models to ensure a fair comparison.
Results from
Table 5 confirm that DenseNet-121 provides a stronger baseline than ResNet-50 across all three tasks, likely due to its enhanced feature reuse and gradient flow. Nevertheless, the proposed CvT-based CornViT models achieve higher accuracies in all stages (93.76%, 94.11%, and 91.12%, respectively), highlighting the benefit of transformer-based global reasoning and the hierarchical design for fine-grained kernel classification.
4.2. Stage 1—Pure vs. Impure Classification
Stage 1 distinguishes impure from pure kernels across 1090 test images.
Table 6 summarizes per-class metrics, and the corresponding confusion matrix is shown in
Figure 7. Performance is well-balanced across classes, with both pure and impure kernels achieving F1 scores above 0.93. This is important in practice: false positives (impure labeled as pure) can contaminate subsequent grading, while false negatives (pure labeled as impure) reduce usable yield. The symmetry of Precision and Recall suggests that CornViT Stage 1 maintains a good trade-off between sensitivity and specificity.
Compared to the CNN baselines, the CvT-based model improves absolute accuracy by roughly 7–17 percentage points and substantially increases F1 scores, highlighting the benefit of transformer-based global reasoning even for relatively simple binary tasks.
4.3. Stage 2—Shape Classification
Stage 2 receives pure kernels and predicts flat versus round morphology on a test set of 578 images.
Table 7 reports the quantitative results, and the corresponding confusion matrix is shown in
Figure 8. Accuracy improves slightly relative to Stage 1, and both shape classes show nearly identical F1 scores (≈0.94). Shape classification is inherently more subtle than gross impurity detection, as it primarily depends on global geometry and kernel silhouette. The high performance of Stage 2 indicates that the CvT backbone effectively captures these morphological cues, supporting its suitability for shape-sensitive grading tasks.
This performance clearly surpasses the ResNet-50 and DenseNet-121 baselines, likely because those architectures are less effective at modeling the long-range spatial context required for accurate kernel shape analysis.
4.4. Stage 3—Embryo Orientation Classification
Stage 3 is the most challenging step: Given pure, flat kernels, it predicts whether the embryo is facing up or down. The test set contains 293 images.
Table 8 shows per-class performance, and the corresponding confusion matrix is shown in
Figure 9.
Despite the fine-grained nature of the task and the smaller sample size, Stage 3 achieves over 91% accuracy and macro F1 . Embryo-up kernels are slightly easier to classify (higher Recall and F1), likely because embryo structures (e.g., scutellum, germ region) produce distinctive surface cues when facing the camera, whereas embryo-down kernels may resemble each other more subtly depending on lighting and angle.
The performance gap relative to the CNN baselines is even more pronounced here; CNNs find it difficult to distinguish such subtle orientation cues, whereas the CvT’s attention mechanisms appear to better capture global texture and shape patterns that signal orientation.
To summarize the comparative performance across all three stages,
Table 9 reports test-set accuracies for CornViT and the two CNN baselines. CornViT achieves the highest accuracy in every stage, with gains of roughly 7 to 17 percentage points over ResNet-50 and about 2 to 7 percentage points over DenseNet-121.
4.5. Overall Pipeline Behavior and Error Propagation
Since the proposed framework, CornViT, is a hierarchical pipeline, errors in an early stage can propagate downstream. For example, a kernel misclassified as impure in Stage 1 will never be considered for shape or orientation analysis. However, the very high Stage 1 accuracy (≈93.8%) limits the number of such cases.
One way to quantify end-to-end performance is to consider the effective accuracy for particularly important label combinations, such as “pure, flat, embryo up.” Assuming independence between stages, a rough lower bound on the probability of correctly classifying a kernel as pure, flat, and embryo up is
This indicates that 80% of kernels that traverse all three stages could be classified correctly. In practice, the true end-to-end accuracy depends on the distribution of samples and correlation of errors across stages. A more precise characterization of real-world performance would require a direct end-to-end evaluation of the complete three-stage pipeline on a joint test set covering all purity–shape–orientation combinations, which we leave for future work.
4.6. Visual Analysis
To complement the quantitative metrics reported so far, we performed qualitative visual analyses of CornViT’s predictions.
Figure 10 illustrates representative examples from each stage, including both correctly and incorrectly classified kernels where
y is the true label and
is the predicted label.
For Stage 1, typical misclassifications involve borderline impurities: kernels that carry small artifacts such as tip cap (red/brown patch, point where the kernel was attached to the cob) or have slight deviations from the idealized shape. These subtle irregularities can cause the model to hesitate or assign the wrong label. In Stage 2, errors are concentrated on kernels with intermediate shapes that lie between the “flat” and “round” prototypes. In Stage 3, the most challenging cases are kernels where the embryo is partially visible or where the embryo-facing side is only subtly different in appearance from the opposite side.
6. Discussion
The results demonstrate that the proposed CornViT framework can reliably reproduce human-style hierarchical reasoning for corn kernel grading. Across all three stages, the CvT-13 backbone paired with stage-specific binary heads achieved high and well-balanced performance, with accuracies exceeding 91%. This indicates that the combination of convolutional token embedding and convolutional self-attention projections provides a strong inductive bias for capturing both local texture and global morphology in kernel images.
A first observation is that the hierarchical design is both effective and practical. Rather than forcing the model to infer a composite label in a single step, CornViT decomposes grading into three interpretable decisions: whether the kernel is pure, what its gross shape is, and, for pure–flat kernels, how the embryo is oriented. Each decision aligns with a question human graders routinely ask, making the intermediate outputs meaningful in their own right. The high and symmetric Precision–Recall values in Stages 1 and 2 suggest that these initial filters are robust, providing a stable foundation for more subtle orientation analysis in Stage 3.
The comparison with CNN baselines further highlights the benefits of transformer-based backbones in this domain. Under identical training conditions, CornViT consistently outperforms strong CNN architectures such as ResNet-50 and DenseNet-121 across all stages, with the largest gains observed in the embryo-orientation task, where discrimination depends on subtle cues in kernel surface structure and silhouette. The results suggest that convolution-augmented self-attention is especially well-suited to tasks that combine fine-grained texture analysis with global shape reasoning.
The stage-wise dataset design also plays a crucial role. By constructing separate, carefully curated datasets for purity, shape, and embryo orientation, we avoid conflating label noise in the original source with model capacity. Each stage can be trained and evaluated on labels tailored to its specific decision, enabling a cleaner analysis of model behavior and error modes. At the same time, the three datasets can be reassembled to study error propagation in the full pipeline, e.g., how misclassifications in Stage 1 affect downstream shape and orientation predictions.
From a deployment perspective, the hierarchical structure and the accompanying Flask-based web application make the system attractive for practical seed quality workflows. The web interface exposes stage-wise predictions and confidences through a simple browser front-end, lowering the barrier to adoption for agronomists and seed technicians who may not be familiar with deep learning frameworks. The ability to skip later stages when a kernel is clearly impure also saves computation in potential high-throughput settings.
Beyond corn kernels, CornViT fits into a broader trend toward automated, image-based classification of agricultural products. Similar hierarchical pipelines could be designed for fruit or vegetable grading (e.g., separating healthy from damaged fruit before finer defect categorization) or for multi-stage assessment of silage quality, where stages might reflect purity, particle-size characteristics, and kernel processing score. In this sense, CornViT complements recent work such as Succulent-YOLO [
11], which uses a modern YOLO-based pipeline with CLIP-enhanced features for agricultural product classification from UAV imagery, and the 2022 study on real-time estimation of corn silage kernel processing score from image data [
12], both of which demonstrate how deep learning can support decision-making across the crop production chain.
At the same time, several limitations merit discussion. First, the imaging setup is relatively controlled, with a single general style of background, lighting, and camera placement. The extent to which CornViT generalizes to other cameras, lighting conditions, or varieties remains to be fully explored. Second, the current model operates on single kernels placed on a uniform background; extending the approach to multi-kernel scenes or conveyor-belt imagery would require additional detection or segmentation modules. Third, the independence assumption used in our rough joint-accuracy estimate may not hold perfectly in practice, as errors across stages can be correlated. Finally, although head-only fine-tuning is computationally efficient and performed well here, it constrains the model’s ability to adapt to the fine-grained visual differences that drive the Stage 3 embryo-orientation decision. More flexible strategies, such as selectively unfreezing the final CvT stage, applying layer-wise learning-rate decay, could enable the model to better capture subtle structural cues associated with embryo position.
Future work will investigate several directions, including progressive unfreezing and low-rank adaptation (LoRA), as alternatives that balance stability with representational flexibility. They will also examine joint multi-task training with a shared backbone and multiple heads, which may improve parameter efficiency and exploit shared structure across stages. Domain adaptation and data augmentation strategies could be explored to improve robustness to new imaging setups and additional corn varieties. Incorporating attention or saliency visualizations directly into the web interface may further enhance interpretability for end users. Finally, extending the hierarchical framework to other crop species and quality attributes such as damage, disease, or varietal classification could broaden its relevance to precision agriculture and seed-processing pipelines.
7. Conclusions
This paper presented CornViT, a multi-stage Convolutional Vision Transformer framework for hierarchical corn kernel analysis. By explicitly mirroring the reasoning process of human seed analysts through three sequential decisions: purity (impure vs. pure), shape (flat vs. round), and embryo orientation (up vs. down), CornViT delivers accurate and interpretable kernel-level grading from RGB images. Across dedicated test sets, the three CvT-13 models achieved stage-wise accuracies of 93.76%, 94.11%, and 91.12%, respectively, with strong per-class F1 scores, including for the more challenging embryo-orientation task.
Compared with strong CNN baselines, CornViT consistently delivers higher stage-wise performance, underscoring the benefits of convolution-augmented self-attention for capturing global morphology and subtle surface cues within a hierarchical decision framework.
A further contribution of this work is the construction of a ready-to-use, stage-wise annotated dataset tailored to hierarchical kernel classification. Each image is labeled for purity, morphology, and, where applicable, embryo orientation, enabling both isolated stage-wise training and end-to-end pipeline evaluation. Together with the curated datasets, the lightweight Flask-based web application that encapsulates the full CornViT pipeline makes the approach readily deployable in laboratory and industrial environments.
Future research directions include investigating progressive unfreezing, joint multi-task training of a shared backbone, exploring more advanced fine-tuning strategies, and improving robustness to diverse imaging conditions and corn varieties. Extending the framework to multi-kernel scenes, conveyor-based inspection, and other crop species would further enhance its impact on smart seed processing and precision agriculture. Overall, the CornViT framework and accompanying dataset provide a solid foundation for accurate, interpretable, and deployable seed- and kernel-level quality control systems.