Featured Application
This study presents a neural network-based approach for predicting peach maturity using tabular sensor data. The proposed model can be integrated into real-time fruit sorting pipelines, enhancing classification accuracy and decision support in the agri-food sector, with additional applicability in autonomous harvesting platforms such as drone-operated picking systems.
Abstract
Peach maturity at harvest is a critical factor influencing fruit quality and postharvest life. Traditional destructive methods for maturity assessment, although effective, compromise fruit integrity and are unsuitable for practical implementation in modern production. This study presents a machine learning approach for non-destructive peach maturity prediction using tabular data collected from 701 ‘Redhaven’ peaches. Three neural network models suitable for small tabular datasets (TabNet, SAINT, and NODE) were applied and evaluated using classification metrics, including accuracy, F1-score, and AUC. The models demonstrated consistently strong performance across several feature configurations, with TabNet achieving the highest accuracy when all non-destructive measurements were available, while TabNet provided the most robust and practical performance on the comprehensive non-destructive subset and in optimized minimal-feature settings. These findings indicate that non-destructive sensing methods, particularly when combined with modern neural architectures, can reliably predict maturity and offer potential for real-time, automated fruit selection after harvest. The integration of such models into autonomous harvesting systems, for instance, through drone-based platforms equipped with appropriate sensors, could significantly improve efficiency and fruit quality management in horticultural peach production.
1. Introduction
Monitoring fruit maturity is crucial for determining the optimal harvest time to achieve the best balance between eating quality (flavour, sweetness) and the durability required for shipping and handling. Harvesting too early results in tasteless, underripe fruit, while harvesting too late leads to overripe, soft fruit that spoils quickly and has a very short post-harvest life [1]. The firmness of the peach fruit is closely related to harvest maturity and is the main factor that affects the post-harvest storage characteristics of the fruit. Firmness has been identified as the most reliable parameter for determining maturity compared to other quality indicators [2]. Peach firmness is typically measured using a penetrometer, which applies a standardized force to penetrate the fruit flesh. In this research, peach firmness is measured using a penetrometer equipped with an 8 mm wide cylindrical probe, where after skin removal at four equatorial sides maximum penetration force is being recorded [3,4]. This presents a significant limitation, as this maturity assessment method is destructive to the fruit. Alongside mechanical tests, dielectric properties have also been explored for maturity evaluation. For instance, while dielectric property measurements have proven to be reliable predictors in models for evaluating peach maturity, traditional dielectric measurement methods also require inserting electrodes into the fruit. This makes the technique equally destructive and impractical for high-throughput sorting [5]. Therefore, there is a need for an accurate measure of fruit maturity based on parameters that can be measured non-destructively.
Based on a literature review, the primary non-destructive methods for assessing fruit maturity can be categorized into several groups: Near-Infrared Spectroscopy (NIRS), Computer Vision, and other techniques such as acoustic methods and electronic nose. The complex data obtained from these techniques are subsequently processed by advanced machine learning methods, which translate the information into specific maturity indicators such as fruit firmness and soluble solid content, both of which are commonly associated with peach maturity.
As early as 2009, it was demonstrated that portable Vis-NIR spectroscopy can accurately and non-destructively predict apricot fruit quality attributes such as firmness and soluble solids content. The results showed a strong correlation between spectral data and internal quality, confirming the method’s potential for real-time, in-field maturity assessment [6]. Today, this technology effectively combines VIS/NIR spectroscopy with machine learning models (such as decision trees and gradient boosting) for non-destructive assessment of grape ripeness. The authors achieve high accuracy in predicting key ripeness indicators, including anthocyanin content, soluble solids content, and titratable acidity, confirming the method’s efficiency for in-field analysis [7].
In parallel with spectroscopic approaches, computer vision and deep learning techniques have been used to develop one-stage instance segmentation models utilizing architectures such as Mask R-CNN and YOLO to detect and classify the maturity stages of peaches. This method demonstrates high classification accuracy and offers a robust solution for the automated harvesting and sorting of peaches under real-world orchard conditions [8].
The use of machine learning, and especially neural networks, in fruit maturity prediction is becoming increasingly common due to their ability to perform accurate and automated classification based on visual or sensor data.
The application of deep learning is not limited to peaches. For instance, a study on bananas [9], which are also highly perishable, focuses on the development of deep learning models for maturity classification using convolutional neural networks (CNN) and the AlexNet architecture applied to image data. The emphasis is on computer vision, data augmentation, and implementation in the Keras/TensorFlow framework for robust and accurate classification. The proposed CNN model achieved up to 99.36% validation accuracy, outperforming AlexNet in most evaluated scenarios.
However, these are methods that rely on neural network models for fruit maturity prediction based almost exclusively on images. Few models focus on non-destructive prediction using only tabular data, even though sensors providing such features are significantly cheaper, and model training is considerably faster.
A study by Ropelewska [10] explores the use of non-destructive sensor measurements combined with machine learning models, including Random Forest, Naive Bayes, k-NN, and Decision Tree, for classifying peaches at different maturity stages. Based on tabular data such as firmness, skin color (L*, a*, b*), size, and weight collected via low-cost sensors, the Random Forest model achieved classification accuracy exceeding 90%. These findings highlight that low-cost, non-destructive sensing technologies paired with machine learning provide an efficient and reliable approach for maturity prediction in peaches.
Another study [11] aligns well with other machine learning approaches for fruit maturity prediction, as it relies exclusively on structured tabular data rather than image-based inputs. By evaluating models such as Random Forest, Gradient Boosting, XGBoost, and Decision Trees on physicochemical measurements of oil palm fruit, it confirms that high predictive performance can be achieved without visual features. In particular, the XGBoost model demonstrated superior accuracy, reinforcing the value of ensemble methods in agricultural classification tasks. While these classical models have shown strong performance, the question remains whether newer, specialized neural network architectures for tabular data can offer superior performance or different advantages such as interpretability.
In a recent study [12], the performance of eight machine learning models was evaluated for predicting peach maturity using exclusively non-destructive tabular features, including fruit dimensions and skin color parameters. Following feature selection with the LASSO method, the artificial neural network (ANN) model demonstrated the highest classification accuracy among the tested approaches.
As neural network models demonstrated the highest accuracy in the study [12], a review of the literature identified three neural architectures that have shown strong performance on tabular data: TabNet, NODE, and SAINT [13,14,15].
TabNet employs a sequential attention mechanism to select relevant features at each decision step, enabling both interpretability and efficient representation learning [16]. NODE integrates differentiable decision trees into deep networks, leveraging the strengths of ensemble methods while retaining end-to-end trainability [17]. SAINT introduces a transformer-based architecture with both feature-level and intersample attention, capturing complex dependencies within and across data instances. These models represent a new generation of deep learning methods specifically optimized for structured/tabular datasets, often matching or surpassing traditional gradient boosting techniques in predictive performance [18].
The aim of this study is to evaluate the applicability and performance of recent neural network architectures specifically optimized for tabular data in the context of non-destructive peach maturity prediction. By comparing TabNet, SAINT, and NODE on a real-world dataset of sensor-based measurements, this research seeks to identify models that combine high classification accuracy with efficient training and potential for integration into autonomous harvesting systems. Ultimately, the study contributes to the development of reliable, low-cost, and scalable machine learning solutions for decision support in precision horticulture.
Unlike previous studies that primarily relied on classical machine learning algorithms or deep learning models applied to image data, this research focuses on neural network architectures specifically designed for tabular inputs. While earlier work demonstrated the feasibility of using low-cost sensor data with models such as Random Forest or ANN, this study systematically compares advanced tabular-specific neural networks (TabNet, SAINT, and NODE), highlighting their potential to achieve high accuracy while preserving interpretability (TabNet) and efficiency in non-destructive fruit maturity classification.
In this study, the Index of Absorbance Difference (iad) emerged as a particularly strong predictor of peach maturity, often dominating model performance. However, although iad sensors can be integrated into industrial and drone-based systems, they are relatively expensive. Since iad also represents a dominant predictor of maturity, additional experiments were conducted to evaluate model performance when this strong feature was excluded and when models were trained on compact subsets of the most informative non-destructive variables. The results demonstrated that even in the absence of iad, and when relying on reduced feature sets, the evaluated neural network models achieved satisfactory classification accuracy, suggesting that reliable maturity prediction can be accomplished using features derived from low-cost sensors performing non-destructive measurements.
2. Materials and Methods
All data processing, analysis, and model development were performed using the Python programming language (v3.9). Key libraries included Pandas (v2.3.1) for data manipulation, Scikit-learn (1.7.1) for machine learning, and Matplotlib (v3.10.3) for data visualization, together with PyTorch (v2.7.1)-based libraries for neural network modelling (including pytorch_tabnet, node-tabular, and an open-source implementation of SAINT).
2.1. Dataset Description
The generation of dataset containing this type of measurement is a resource-intensive process, demanding significant investment in both time and labor. Consequently, the dataset utilized in many of the aforementioned studies are of a relatively limited scale, particularly within the context of neural network applications. For instance, the dataset presented in [12] comprises 180 samples, whereas the one in [2] includes merely 120 measurements. The variables used in the former study are marked with a (+) symbol in Table 1, as they correspond closely to those measured in the present work and are therefore directly comparable.
Table 1.
Description of measured variables used in the dataset. The (+) symbol denotes variables also used in the study by Ljubobratović et al. [12].
The dataset employed in this research comprises 701 samples of peach fruit (Prunus persica), each characterized by 44 variables utilized for classifying maturity stages. All samples were collected from a single cultivar (‘Redhaven’) grown at one geographical location, which may limit the generalizability of the models to other cultivars or growing conditions. Therefore, model performance should be independently validated before application to different peach varieties or environments. The data are maintained in a CSV file format. In USA lack of taste and failure to ripen are the main reasons that limit stone fruit consumption [19]. Peaches, as being susceptible to bruising due to fleshy mesocarp, are usually harvested earlier to last longer and withstand manipulation. However, although they are climacteric fruit, there is a connection between on-tree physiological maturity and the development of key fruit quality traits [20]. Hence, too early harvest can lead to consumer dissatisfaction and reduced further consumption. In order to meet the quality standards, peach fruit at harvest should not exceed firmness higher than (Ramina et al. [21], according to Neri et al. [22]). Hence this firmness threshold was used in this study primarily to address consumers growing dissatisfaction with peach fruit eating quality. Following this main maturity segmentation presented in this paper further ones should be also carried in packinghouses (overripe, ready to eat, ready to buy and storable peaches).
The remaining 43 variables function as predictors and are composed of a diverse array of physical, colorimetric, biochemical, and electrical measurements. These can be broadly categorized into the following groups:
- Physical and Morphometric Attributes: vol: Peach volume, mass: Peach mass, firm: Fruit firmness
- Biochemical and Sensory Features: tst: Soluble solids content, ta: Titratable acidity
- Colorimetric Descriptors: udb: share of additional color, md_X, moo_X, mop_X: Maximum values of additional color, ground, and petiole ground color
- Impedance-Based Electrical Properties: zs_nd, th_nd: Zs and Ts components (non-destructive), zs_d, th_d: Zs and Ts components (destructive), zs_nd_X, th_nd_X: non-destructively obtained Ts and Zs adjusted for deviation from mean volume
- Ripening Index: iad: index of absorbance difference
With a clearly defined categorical target and a heterogeneous mix of continuous, ordinal, and categorical features, this dataset is highly suitable for the development and evaluation of supervised machine learning models for fruit quality assessment and maturity classification.
A total of 701 fruits at various ripening stages were analyzed during this period, resulting in a dataset comprising 701 samples with 43 predictor variables. The dataset does not include the complete photographic set from four different angles nor the multispectral imaging results, which were conducted only on one-third of the fruits. Nevertheless, it provides a highly comprehensive representation of the morphological and physicochemical characteristics of peaches. The list of all variables, including their dataset names and descriptions, is provided in Table 1. The (+) symbol in the description indicates variables that were also used in the study by Ljubobratović et al. [12], representing comparable measurement features. Variables th_nd_X and zs_nd_X were adjusted based on deviation from the average fruit volume. All above mentioned variables, with exception of destructive and non-destructive dielectric properties, were measured as reported in our previous study [23]. While dielectric properties were conducted on all fruits in two orthogonal orientations: first along the narrower longitudinal axis of the peach, and subsequently along the wider transverse axis. All measurements were carried out using an ET430 handheld LCR meter with a drive voltage of 600 mV and a frequency of 10 kHz, parameters previously identified as optimal for characterization of peach physicochemical properties [24]. Additionally, the udb assessment was performed using a slightly modified protocol, which included an additional category (0) denoting peaches exhibiting no additional coloration.
2.2. Data Preprocessing
Prior to model training, the dataset underwent a structured preprocessing pipeline. The raw data were first imported from a semicolon-delimited CSV file with comma used as the decimal separator. A non-informative identifier column (br) was removed. The target variable (ripe) was isolated, while the remaining 43 variables were used as predictors. Categorical features were encoded using the OrdinalEncoder, and all numerical features were standardized to zero mean and unit variance using the StandardScaler. The fitted encoder and scaler were retained for future use on test data to ensure consistency in data transformation.
To ensure robust model development and to avoid any form of information leakage, all model selection procedures were performed exclusively within the training portion of the data. After the initial stratified train-test split, the 75% training subset (525 samples) was further partitioned using stratified 5-fold cross-validation. In each fold, approximately 80% of the training set (420 samples) was used for model training, while the remaining 20% (105 samples) served as a validation fold for monitoring training behaviour, tuning hyperparameters, and applying early stopping.
Importantly, no sample from the independent 25% test set was ever used during cross-validation or any phase of model construction. This ensures that all reported test-set results reflect true model generalization on entirely unseen data.
Train-Test Partitioning
Before any model development, the complete dataset of 701 Redhaven peach samples was partitioned into a stratified training-test split. A total of 75% of the data (525 samples) was allocated to the training set, while the remaining 25% (176 samples) formed an independent held-out test set. The split was stratified with respect to the binary maturity label to preserve the original class distribution in both subsets. The test set was not used at any stage of model development, including feature preprocessing, hyperparameter tuning, early stopping, or model selection. All preprocessing steps (imputation, encoding, and standardization) were fitted exclusively on the training set and subsequently applied to the test set using the same fitted parameters, ensuring a fully isolated and unbiased evaluation on unseen data.
2.3. Models Description
Three neural network models specifically designed for tabular data were trained on the preprocessed dataset: TabNet, NODE, and SAINT. Each model embodies a distinct architectural approach to learning from structured data. TabNet employs sequential attention for feature selection and interpretability [13], NODE utilizes ensembles of differentiable oblivious decision trees [14], while SAINT incorporates both self-attention across features and intersample attention across rows [15].
All models were trained exclusively on the training portion of the dataset (75% of the full dataset; 525 samples). Model selection, hyperparameter tuning, and early stopping were performed using stratified 5-fold cross-validation applied only within the training subset. The independent 25% test set (176 samples) was withheld throughout the entire development process and used solely for final evaluation. Detailed architectural and implementation settings for each model are described below.
TabNet is a deep learning architecture specifically designed for tabular data. It utilizes a sequential attention mechanism that enables the model to focus on the most relevant features at each decision step. This structure not only enhances predictive performance but also provides interpretability through feature importance masks [13]. In our implementation, the TabNetClassifier from the pytorch_tabnet library was used. The model was trained for 200 epochs with the following hyperparameters: 3 decision steps, 8-dimensional feature transformer, 8-dimensional attention transformer, and a batch size of 256. Early stopping was applied with a patience of 20 epochs.
NODE (Neural Oblivious Decision Ensembles) replaces traditional decision tree ensembles with differentiable oblivious decision trees, where each decision node applies the same split across all samples. This architecture enables end-to-end gradient-based optimization while preserving tree-like inductive biases [14]. The model was implemented using the node package by Popov et al. [17] and trained using the inner 5-fold cross-validation procedure applied only to the training subset. No information from the test set was used during training, validation, hyperparameter selection, or early stopping. The architecture consisted of an Oblivious Decision S-Tree (ODST) layer with 160 trees, each having a depth of 7, a batch size of 256, and a maximum of 200 training epochs with early stopping (patience = 25). Optimization was carried out using the Adam optimizer with a learning rate of 0.001.
SAINT (Self-Attention and Intersample Attention Transformer) extends the transformer architecture to tabular data by combining both self-attention across features and intersample attention across rows. This dual-attention mechanism allows the model to capture complex intra- and inter-feature dependencies [15]. We used an adapted implementation based on the open-source SAINT repository. The model was trained for up to 500 epochs with a batch size of 128, a transformer dimension of 64, 6 attention layers, and 8 attention heads, with a dropout rate of 0.05. Categorical features were embedded, and numerical features were normalized. The training pipeline included label smoothing, stochastic depth, and CutMix augmentation (lambda = 0.1). Optimization was performed using the AdamW optimizer (learning rate = 0.0001, weight decay = 0.01), with early stopping applied using a patience of 25 epochs.
2.4. Evaluation Setup and Metrics
To evaluate the predictive performance of the models, three standard classification metrics were employed: Accuracy, F1-score, and Area Under the Receiver Operating Characteristic Curve (AUC).
Train-validation-test protocol. All models were developed using a strict separation between training, validation, and testing data. The dataset was first split into a training subset (75%; 525 samples) and an independent test subset (25%; 176 samples). Only the training subset was used for model development, including preprocessing, standardization, hyperparameter selection, and early stopping. A stratified 5-fold cross-validation procedure was applied exclusively within the training subset. In each fold, 4/5 of the training data served as the inner training split, while 1/5 was used as the validation split for tuning and early stopping. No information from the test subset was used at any stage of model selection or parameter optimization.
After cross-validation, the final trained models were evaluated once on the independent test subset to obtain the final Accuracy, F1-score, and AUC metrics reported in the Section 3. These test-set metrics represent the true generalization performance of each model, while the cross-validation metrics are used only for model selection and comparative analysis.
Accuracy quantifies the proportion of correctly classified samples among all predictions. While widely used, this metric may be less informative in the presence of class imbalance, as it does not account for the distribution of true positives and false negatives across individual classes [25]. F1-score represents the harmonic mean of precision and recall, providing a balanced assessment of both false positives and false negatives. It is especially suitable when the cost of misclassification is asymmetric or when dealing with uneven class distributions, as in the case of fruit maturity stages [25]. AUC measures the ability of the classifier to discriminate between classes, by computing the area under the ROC curve. It is threshold-independent and reflects the trade-off between the true positive rate and false positive rate across all decision thresholds. In multi-class settings, AUC is computed using a one-vs-rest approach and averaged across all classes [26].
These metrics together offer a comprehensive evaluation of model performance, considering both classification correctness and class-wise discrimination capacity.
Cross-validation metrics (Accuracy, F1-score, AUC) reported in the tables represent the mean and standard deviation computed across the five validation folds within the training subset. These values quantify internal model stability and guide model selection, but they do not represent final model performance. Final test-set metrics were computed only once on the independent 176-sample test subset withheld from all stages of model development.
2.5. Main Experimental Design
The primary experiment was designed to evaluate model performance using only features obtained through non-destructive measurement techniques. This scenario reflects a practical application setting where fruit quality assessment must be performed without damaging the samples, such as in automated harvesting systems. For comparative analysis, models were also evaluated on the full feature set, which includes both non-destructive and destructive attributes. This comparison provides insight into the trade-off between prediction accuracy and measurement feasibility in real-world scenarios.
During the analysis, the iad (index of absorbance difference) variable emerged as a particularly strong predictor, frequently appearing as the top-ranked feature in model importance analyses. While iad is obtained non-destructively, it relies on specialized optical instrumentation that may not always be available in practical deployments. To assess the dependence of model performance on this variable, an additional experiment was performed under identical training and validation conditions, using a modified dataset excluding the iad feature. This allowed for the evaluation of predictive robustness in scenarios where simplified sensing configurations are required. Building on this, a final experiment was performed to identify a minimal yet powerful feature set, training the models using only the top-ranked predictors from the non-iad feature group. This aimed to define the most efficient model in terms of both predictive power and sensor complexity.
3. Results
The subsequent section presents the results of the conducted experiments, offering a comprehensive comparison of model performance across four distinct feature configurations. These configurations comprise: (1) training on the complete feature set, encompassing both destructive and non-destructive variables; (2) training exclusively on features derived from non-destructive methods; (3) training on the non-destructive subset following the exclusion of the iad variable; and (4) training on a select subset of top features from the non-destructive set, also excluding the iad variable. The primary objective of this analysis is to evaluate the predictive capacity of each model and to determine the impact of feature availability on classification performance.
Unless otherwise stated, all cross-validation performance values reported in this section correspond to the mean ± standard deviation across the five cross-validation folds, computed from predictions on the held-out validation partitions.
3.1. TabNet Evaluation
The TabNet model was trained using stratified 5-fold cross-validation on the selected feature subsets. Prior to cross-validation, a stratified 75/25 train-test split was applied, and the 5-fold procedure was conducted exclusively on the 525-sample training partition. After cross-validation, the model was retrained on the complete training subset and subsequently evaluated on the independent test set of 176 samples. The implementation was based on the pytorch_tabnet library, with training performed over a maximum of 200 epochs and early stopping (patience = 20). The implementation was based on the pytorch_tabnet library, with training performed over a maximum of 200 epochs and early stopping (patience = 20). The architecture consisted of 3 decision steps, 8-dimensional feature and attention transformers, and a batch size of 256. Optimization was carried out using the Adam optimizer with learning rate scheduling.
In addition to predicted class labels and probabilities, TabNet provides interpretable outputs in the form of feature selection masks, which capture the attention allocated to each input feature at every decision step. By aggregating these masks across folds and samples, global feature importance scores were derived.
Model performance for each fold was evaluated using Accuracy, F1-score, and AUC, and results were summarized in tabular format. Moreover, the averaged feature importance values were visualized as bar plots to highlight the most influential predictors across models and experimental conditions.
3.1.1. TabNet Model Evaluation Using the Full Feature Set
Using the full set of 43 predictors, the TabNet model showed stable performance across the validation folds. Cross-validation resulted in a mean accuracy of 0.8819 ± 0.0222, a mean AUC of 0.9344 ± 0.0138, and a mean F1-score of 0.8782 ± 0.0210.
Following cross-validation, the final model trained on the entire training subset was evaluated on the independent test set. The test results showed an accuracy of 0.8920, an F1-score of 0.8927, and an AUC of 0.9614, confirming strong generalization to unseen samples.
Feature importance analysis identified iad as the most influential feature, followed by colourimetric and dielectric properties such as moo_h, moo_a_b, and d_th. The corresponding feature importance plot, ROC curve, and learning curves are presented in Figure 1a, Figure 2, and Figure 3a.
Figure 1.
Mean feature importance plots for TabNet models: (a) full feature set, (b) non-destructive feature set, and (c) non-destructive feature set without the iad variable.
Figure 2.
Mean Receiver Operating Characteristic (ROC) curves for three TabNet configurations: the blue curve represents the full feature set, the red curve corresponds to the non-destructive feature set, and the green curve denotes the non-destructive feature set without the iad dielectric property.
Figure 3.
Model training and validation loss curves for the three TabNet configurations. (a) Full feature set; (b) non-destructive feature set; (c) non-destructive feature set without iad. Solid lines represent mean loss values across the five cross-validation folds, while the shaded areas indicate variability across folds.
3.1.2. TabNet Model Evaluation Using Non-Destructive Features
Using the non-destructive feature set (38 features), the TabNet model achieved consistent results across validation folds, with a mean accuracy of 0.8819 ± 0.0214, a mean AUC of 0.9415 ± 0.0205, and a mean F1-score of 0.8774 ± 0.0222.
Evaluation on the independent test set resulted in an accuracy of 0.8977, an F1-score of 0.8941, and an AUC of 0.9528. These values confirm that the model generalizes well using only non-destructive measurements.
3.1.3. TabNet Performance on the Non-Destructive Set Without iad Feature
For this feature subset, a stratified 75/25 train-test split was applied, with 5-fold cross-validation performed exclusively on the training portion (525 samples). After cross-validation, the model was retrained on the complete training set and evaluated on the independent test set (176 samples).
Cross-validation produced a mean accuracy of 0.8571, a mean F1-score of 0.8479, and a mean AUC of 0.8992. These values represent an expected decrease relative to the full non-destructive configuration, reflecting the removal of the highly influential iad variable.
Feature importance analysis showed that several non-destructive optical and dielectric properties remained strong predictors, including nd_zs, moo_a, moo_ccl, mass, and mop_a.
Evaluation on the independent test set yielded an accuracy of 0.8409, an F1-score of 0.8313, and an AUC of 0.9230. These metrics closely reflect the cross-validation performance and confirm that the reduced feature set remains capable of reliable maturity classification.
3.1.4. Optimized TabNet Model with Top-Performing Features
To investigate whether strong performance can be maintained using a more compact feature set, a series of experiments was conducted using only the highest-ranked non-destructive predictors (excluding iad). TabNet models were trained using progressively larger subsets of the most influential features, ranging from 9 to 15 variables.
The resulting test-set metrics are summarized in Table 2. Performance did not improve monotonically with additional features; instead, a peak occurred at 13 features.
Table 2.
Model performance with a reduced feature set. The table shows the test-set performance for models trained with a varying number of the most important features. The best overall performance, achieved with 13 features, is highlighted in bold.
3.1.5. Comparative Analysis and Summary of TabNet Model Performance
A consolidated overview of the four TabNet configurations is presented in Table 3. The comparison highlights the influence of feature selection on model performance based on test-set metrics.
Table 3.
Consolidated TabNet model performance across four feature-set configurations, evaluated on the independent test set. The table reports Accuracy, AUC, and F1-score, together with Precision and Recall values from the corresponding test-set classification reports. The best-performing configuration, the Non-Destructive Set (38 features), is highlighted in bold.
To provide further insight into the predictive behaviour of the best-performing configuration (Non-Destructive Set), the corresponding confusion matrix on the independent test set is shown in Table 4.
Table 4.
Confusion matrix for the TabNet model using the Non-Destructive Feature Set (test set).
3.2. NODE Evaluation
The Neural Oblivious Decision Ensembles (NODE) model was trained using a stratified 5-fold cross-validation approach on the specified feature subsets. The implementation was based on the node-tabular library, with training performed for a maximum of 200 epochs and employing an early stopping mechanism with a patience of 25 epochs to prevent overfitting. The architecture consisted of an Oblivious Decision S-Tree (ODST) layer with 160 trees, each having a depth of 7, and a batch size of 256. Optimization was carried out using the Adam optimizer with a specified learning rate and weight decay.
To assess interpretability, global feature importance scores were derived post-hoc using a permutation-based method. After training on each fold, the importance of each feature was calculated by measuring the drop in model accuracy when that feature’s values were randomly shuffled. These scores were then averaged across all folds.
Model performance for each fold was evaluated using Accuracy, AUC, and F1-score, with the aggregated results summarized in tabular format. Furthermore, the averaged feature importance values were presented in tables to highlight the most influential predictors.
3.2.1. NODE Model Evaluation Using the Full Feature Set
This evaluation of the Neural Oblivious Decision Ensembles (NODE) model also utilizes the complete set of 43 predictor variables. This scenario provides a direct comparison to the TabNet model under identical data conditions, establishing a performance baseline with the full spectrum of available information.
A stratified 75/25 train-test split was applied, after which a 5-fold cross-validation procedure was conducted exclusively on the training portion (525 samples). The average performance across the 5-fold cross-validation indicates robust predictive capabilities, with a mean accuracy of 0.8590 ± 0.0244, a weighted F1-score of 0.8510 ± 0.0239, and a mean Area Under the Curve (AUC) of 0.9440 ± 0.0117. The standard deviations reflect consistent performance across folds.
Following cross-validation, the model was retrained on the full training set and subsequently evaluated on the independent test set (176 samples). The final test results showed an accuracy of 0.8636, an F1-score of 0.8588, and an AUC of 0.9496, confirming strong generalization to unseen data.
Feature importance was evaluated using the permutation importance method. Figure 4a shows the ten most influential predictors. Consistent with the TabNet results, the iad variable was identified as the dominant predictor, followed by color and dielectric properties such as moo_b, mop_cirg1, and d_zs, although with substantially smaller contributions.
Figure 4.
Top 10 most influential features for the NODE model determined by Permutation Importance, trained on: (a) full feature set, (b) non-destructive feature set, and (c) non-destructive feature set without the iad variable.
The model’s overall discriminative capability is illustrated in Figure 5 and Figure 6a. The mean receiver operating characteristic (ROC) curve attained an average cross-validation AUC of 0.944, while the learning dynamics demonstrated stable convergence of training and validation losses, with early stopping effectively preventing overfitting across folds.
Figure 5.
Mean Receiver Operating Characteristic (ROC) curves for three NODE configurations: the blue curve represents the full feature set, the pink curve corresponds to the non-destructive feature set, and the green curve denotes the non-destructive feature set without the iad.
Figure 6.
Mean training and validation loss curves for NODE models, illustrating learning behavior across three experimental configurations: (a) full feature set, (b) non-destructive feature set, and (c) non-destructive feature set without the iad feature. Solid lines represent mean loss values across the five cross-validation folds, while the shaded areas indicate variability across folds.
3.2.2. NODE Model Evaluation Using Non-Destructive Features
In this scenario, the NODE model was trained using a reduced feature set comprising 38 non-destructive variables. The purpose of this evaluation is to determine whether high predictive performance can be maintained when relying exclusively on measurements obtainable without physically altering or damaging the fruit, which is essential for real-world applications such as automated harvesting and in-field maturity estimation.
Across the 5-fold cross-validation, the model achieved a mean accuracy of 0.8667 ± 0.0217, a mean F1-score of 0.8575 ± 0.0205, and a mean AUC of 0.9369 ± 0.0127. These results demonstrate that the model retains strong predictive capability even without destructive measurements, performing at a level nearly equivalent to the full 43-feature configuration. The relatively small standard deviations indicate stable behaviour across folds, suggesting that the non-destructive feature set provides sufficiently rich and consistent information for reliable model training.
Permutation-based feature importance analysis (Figure 4b) highlighted iad as by far the most influential predictor, with a markedly higher importance score than any other feature. Following iad, the most relevant variables were the dielectric properties nd_zs and nd_th, as colourimetric attributes such as moo_b, moo_c, and udb. These findings are consistent with the structure uncovered by the TabNet model in the same experimental condition and reinforce the central role of dielectric and optical measurements in predicting peach maturity.
Evaluation on the independent test set yielded an accuracy of 0.8239, an F1-score of 0.8144, and an AUC of 0.9345. Although slightly lower than the cross-validation means, these values reflect solid generalisation and confirm that the model performs reliably when applied to unseen data. The observed difference between validation folds and the held-out test set is consistent with expectations for moderately sized datasets and does not indicate overfitting.
Overall, the results of this configuration demonstrate that the NODE model can achieve high discriminative performance while relying solely on non-destructive measurements, making this approach highly suitable for practical deployment scenarios. The ROC curve and learning behaviour for this configuration are presented in Figure 5 and Figure 6b.
3.2.3. NODE Model Evaluation Using Non-Destructive Features Without iad
This experimental scenario investigates the ability of the NODE model to perform maturity classification when the dominant iad variable is removed from the non-destructive feature set. The purpose of this analysis is twofold: (i) to quantify the extent to which the iad measurement contributes to the model’s discriminative performance, and (ii) to evaluate whether the remaining 37 non-destructive features contain sufficient information to support a reliable and practically usable classifier.
Across the 5-fold cross-validation, the model achieved a mean accuracy of 0.8457 ± 0.0258, a mean F1-score of 0.8341 ± 0.0248, and a mean AUC of 0.9294 ± 0.0157.
Although these metrics represent a noticeable decrease relative to the full non-destructive configuration that includes the iad feature, the overall reduction remains moderate. This confirms that several dielectric, and colourimetric measurements still encode meaningful discriminatory structure even in the absence of the most influential predictor. Interestingly, the slightly smaller standard deviations-particularly in accuracy-suggest improved fold-to-fold stability, indicating that the removal of iad may reduce sensitivity to sampling variability despite lowering the mean performance.
Permutation-based feature importance analysis (Figure 4c) revealed a substantial reordering of the predictor hierarchy. With iad removed, the dielectric properties nd_zs and nd_th emerged as the two most influential variables, followed by colourimetric attributes such as moo_l, moo_a, moo_co, and mop_cirg2. Additional contributors included physical traits such as vol, as well as secondary colourimetric features (moo_b, moo_a_b). This ranking is consistent with the patterns observed in the TabNet analysis under the same feature constraints, reinforcing the robustness of these variables across modelling approaches.
Evaluation on the independent test set produced an accuracy of 0.8295, an F1-score of 0.8276, and an AUC of 0.9297. These results closely correspond to the cross-validation averages and demonstrate good generalisation to unseen samples. Despite the absence of the iad feature, the model retains a strong ability to distinguish between maturity classes.
Overall, while a decline in absolute performance is unavoidable when the dominant predictor is removed, the NODE model continues to exhibit stable and meaningful classification behaviour. This finding is relevant for practical deployments where iad measurements may be unavailable, and it highlights the potential of the remaining non-destructive dielectric and colourimetric features to support autonomous maturity assessment systems. The ROC curve and learning dynamics for this configuration are presented in Figure 5 and Figure 6c.
3.2.4. Optimized NODE Model with Top-Performing Features
Following the removal of the iad feature, a final series of experiments was conducted to determine the optimal number of remaining non-destructive predictors required to maintain high model performance. The NODE architecture was trained on progressively larger subsets of the most influential features, starting from the top 9 and extending to the top 15. This procedure enabled the identification of the point at which additional features cease to improve predictive performance or introduce unnecessary noise.
The performance metrics for each configuration are summarized in Table 5. For each subset, the model was evaluated using the independent test set following retraining on the full 525-sample training portion. This evaluation protocol ensures that the reported values reflect genuine generalization performance rather than cross-validation variability.
Table 5.
Performance summary of the NODE model trained on subsets of top-performing non-destructive features (after excluding iad). The best-performing configuration based on test set AUC is highlighted in bold.
The results demonstrate that model performance remains relatively stable across subsets of 9–15 features, with only minor fluctuations in accuracy and F1-score. However, the configuration using 13 features achieves the highest test AUC (0.9227) and a competitive F1-score (0.8391), indicating the most effective balance between discriminative capability and model compactness.
This outcome suggests that enlarging the feature set beyond 13 variables offers no clear advantage, while smaller subsets occasionally reduce predictive strength. The 13-feature configuration therefore represents the optimal trade-off, retaining sufficient variability to support robust decision boundaries while avoiding the diminishing returns observed when additional features are included.
For completeness, the performance of this optimal 13-feature model is also highlighted separately in the overall NODE comparison presented in Table 6.
Table 6.
Consolidated NODE model performance across four feature-set configurations, evaluated on the independent test set.
3.2.5. Comparative Analysis and Summary of NODE Model Performance
This section synthesizes the results from the four experimental scenarios for the Neural Oblivious Decision Ensembles (NODE) model. A direct comparison, presented in Table 6, provides an overview of the model’s behaviour under different feature availability constraints, based exclusively on the independent test set, in full alignment with the revised evaluation protocol.
The comparative analysis reveals several consistent trends. The full feature set yields the strongest overall performance across all evaluated configurations, with the highest AUC and a balanced combination of precision and recall. The non-destructive feature set produces slightly lower values but remains competitive, confirming that destructive measurements contribute little additional information for NODE.
Removing the dominant iad measurement results in an observable performance decline, consistent with findings from the TabNet experiments. Nevertheless, the model retains stable discriminative capability even without this feature. The optimized Top-13 feature configuration provides a compact alternative with competitive test-set performance, achieving the highest F1-score among the reduced-feature models.
To illustrate practical classification behaviour, Table 7 presents the confusion matrix for the Top-13 Feature configuration evaluated on the independent test set.
Table 7.
Confusion matrix for the NODE model using the Top-13 Feature Set (test set).
3.3. SAINT Evaluation
The Self-Attention and Intersample Attention Transformer (SAINT) model was evaluated under the same revised protocol as the TabNet and NODE architectures. Following a stratified 75/25 train-test split, a 5-fold cross-validation procedure was applied exclusively to the training portion (525 samples), after which the model was retrained on the full training set and assessed on the independent test set (176 samples). This ensured a strictly controlled evaluation in which the final metrics reflect performance on data unseen during both model development and hyperparameter selection.
The model was implemented in a custom PyTorch (v2.7.1) framework and trained for up to 500 epochs, using early stopping (patience = 25) to prevent overfitting. The architecture employed a transformer dimension of 64, a depth of 6 attention layers, 8 attention heads, and a dropout rate of 0.05. Optimization was performed using AdamW (learning rate = 0.0001; weight decay = 0.01). To enhance generalization, CutMix augmentation (lambda = 0.1) was applied during training.
Since transformer-based architectures do not produce intrinsic feature importance, interpretability was addressed via a post-hoc Permutation Importance analysis, quantifying the reduction in validation accuracy when each feature was individually perturbed. Model performance across the four experimental feature configurations was evaluated using Accuracy, AUC, and F1-score, consistent with the other models in this study.
3.3.1. SAINT Model Evaluation Using the Full Feature Set
The initial evaluation of the Self-Attention and Intersample Attention Transformer (SAINT) model was conducted using the complete set of 43 predictor variables. As with the previous models, the revised evaluation protocol was applied: after the stratified 75/25 train-test split, a 5-fold cross-validation was performed exclusively on the training subset (525 samples), followed by retraining on the full training set and final testing on the independent test set (176 samples).
Across the cross-validation folds, the SAINT model achieved a mean accuracy of 0.8914, a mean AUC of 0.9552, and a mean F1-score of 0.8866. These results indicate strong and stable predictive performance, with low variability across folds (e.g., a standard deviation of approximately 0.02 for both accuracy and F1-score).
Model training was conducted using a custom PyTorch (v2.7.1) implementation for a maximum of 500 epochs, with an early stopping mechanism (patience = 25) based on validation loss. Optimization was performed using the AdamW optimizer (learning rate = 0.0001, weight decay = 0.01). The architecture consisted of a transformer dimension of 64, a depth of 6 layers, 8 attention heads, and a dropout rate of 0.05. In addition, CutMix augmentation (lambda = 0.1) was applied during training to improve generalization.
As the SAINT architecture does not yield intrinsic feature importance scores, a post-hoc interpretability analysis was performed using the Permutation Importance method. The average permutation scores-summarized in Figure 7a—again identified the non-destructive iad measurement as the most influential feature, followed by several additional non-destructive and destructive properties, including nd_zs, moo_a, moo_h, and ta. These results are consistent with the patterns observed in the other neural models.
Figure 7.
Top 10 most influential features for the SAINT model determined by Permutation Importance, trained on: (a) full feature set, (b) non-destructive feature set, and (c) non-destructive feature set without the iad variable.
The ROC and learning curves, shown in Figure 8 and Figure 9a, illustrate the high discriminative capacity and stable optimization behaviour of the SAINT model. The cross-validation mean ROC curve confirms a strong average AUC of 0.955, while the smooth convergence of the learning curves indicates effective regularization and the successful functioning of the early stopping mechanism.
Figure 8.
Mean Receiver Operating Characteristic (ROC) curves for three SAINT configurations: the blue curve represents the full feature set, the red curve corresponds to the non-destructive feature set, and the green curve denotes the non-destructive feature set without the iad dielectric property.
Figure 9.
Mean training and validation loss curves for SAINT models, illustrating learning behavior across three experimental configurations: (a) full feature set, (b) non-destructive feature set, and (c) non-destructive feature set without the iad feature. Solid lines represent mean loss values across the five cross-validation folds, while the shaded areas indicate variability across folds.
Final evaluation on the independent test set yielded an accuracy of 0.8977, an AUC of 0.9735, and an F1-score of 0.9011. The confusion matrix shows balanced behaviour across the two classes, with only a small number of misclassifications. These results position SAINT as one of the top-performing models in this study when provided with the full combined destructive and non-destructive feature space.
3.3.2. SAINT Model Evaluation Using Non-Destructive Features
In the second scenario, the SAINT model was evaluated using a reduced set of 38 non-destructive features. This configuration is particularly relevant for practical, real-world deployments in which the fruit must remain intact, and the goal is to assess the extent to which the model can preserve high predictive performance without relying on destructive measurements.
Following the revised evaluation protocol, a stratified 75/25 train-test split was applied, and a 5-fold cross-validation was conducted exclusively on the training subset (525 samples). Across the validation folds, the model achieved a mean accuracy of 0.8876, a mean AUC of 0.9537, and a mean F1-score of 0.8831. These values are only slightly lower than those obtained with the full feature set, confirming that the non-destructive measurements encode nearly all of the discriminatory information required for reliable maturity classification.
The permutation-based feature importance analysis (Figure 7b) again identified iad as the dominant predictor, with a substantially higher importance score than any other variable. Secondary contributors included dielectric properties such as nd_th and nd_zs, followed by colourimetric attributes including moo_a_b, moo_a, mop_l, and moo_b. These findings align closely with the patterns observed for the TabNet and NODE models under the same experimental conditions.
The ROC and learning curves (Figure 8 and Figure 9b) further illustrate the model’s strong discriminative behaviour and stable optimization dynamics. The mean ROC curve confirms a high average AUC of approximately 0.954, while the learning curve demonstrates efficient convergence with effective regularization provided by CutMix and early stopping.
Final evaluation on the independent test set yielded an accuracy of 0.8864, an AUC of 0.9689, and an F1-score of 0.8913. The confusion matrix indicates balanced performance across both maturity classes, with only a small number of misclassifications. These results confirm that SAINT maintains strong generalization ability even when restricted to non-destructive measurements, making this configuration highly suitable for autonomous maturity assessment systems.
3.3.3. SAINT Model Evaluation Using Non-Destructive Features Without iad
This third experimental scenario evaluates the robustness of the SAINT architecture when its most influential predictor, the iad variable, is removed. This configuration is essential for assessing the standalone predictive value of the remaining 37 non-destructive features, which primarily consist of dielectric and colourimetric measurements. The results provide insight into the feasibility of practical, low-cost sensor systems that do not rely on the specialised instrumentation required for iad acquisition.
Following the revised protocol, a stratified 75/25 train-test split was applied, and a 5-fold cross-validation was performed exclusively on the training subset. Across the validation folds, the SAINT model achieved a mean accuracy of 0.8743, a mean AUC of 0.9428, and a mean F1-score of 0.8658. As expected, these values represent a moderate decline relative to the full non-destructive feature set, but the model nevertheless maintains strong discriminative performance, indicating that substantial predictive information remains available even without iad.
The feature-importance hierarchy (Figure 7c) undergoes a clear reorganisation in the absence of iad. Dielectric properties such as nd_th and nd_zs emerge as the dominant predictors, followed by several colour-related variables including moo_ccl, moo_b, moo_col, moo_c, and moo_a. This pattern mirrors the shifts observed in the TabNet and NODE models under the same constraints, underscoring the consistent importance of optical and dielectric measurements.
Figure 8 and Figure 9c illustrate that the overall discriminative capability remains high. The mean ROC curve confirms a robust average AUC of approximately 0.943, while the learning curves show stable convergence and effective regularisation.
Final evaluation on the independent test set yielded an accuracy of 0.8636, an AUC of 0.9594, and an F1-score of 0.8667. The confusion matrix indicates a well-balanced classification performance, with only a modest increase in misclassifications compared with the full non-destructive configuration. These findings demonstrate that SAINT remains a competitive model even without access to the iad measurement, highlighting the practical value of the remaining non-destructive features for autonomous maturity assessment systems.
3.3.4. Optimized SAINT Model with Top-Performing Features
In the final experimental stage, the SAINT model’s performance was evaluated on incrementally larger subsets of the most influential non-destructive features (after excluding iad). This analysis, conducted on feature sets ranging from 9 to 15 predictors, aimed to identify the optimal configuration that balances model complexity with predictive accuracy, simulating a cost-effective, targeted sensor system. As in the previous experiments, a stratified 75/25 train-test split was used, with 5-fold cross-validation performed exclusively on the 525-sample training subset and final evaluation on the independent test set of 176 samples.
The test-set performance metrics for each feature subset are summarized in Table 8.
Table 8.
Performance summary of the SAINT model trained on subsets of top-performing non-destructive features (after excluding iad), evaluated on the independent test set. The best overall performing configuration is highlighted in bold.
The results demonstrate the robustness of the SAINT architecture. Across all tested subsets from 9 to 15 features, test-set accuracy remains in a narrow range between 0.8352 and 0.8807, with F1-scores between 0.8415 and 0.8786 and all AUC values above 0.95. The configuration with 15 features achieves the highest test AUC (0.9607), together with the joint-best test accuracy (0.8807) and F1-score (0.8786), and is therefore selected as the optimal trade-off between model compactness and predictive performance.
The permutation importance analysis identified a set of fifteen non-destructive variables as the most informative predictors for the SAINT model. For clarity of presentation, Figure 7c displays only the ten highest-ranked features; however, these correspond closely to the top part of the full fifteen-feature ranking used during model optimization.
The performance metrics obtained using the complete optimized subset of fifteen features (excluding iad) are also included in the overall SAINT comparison presented in Table 9.
Table 9.
Summary of SAINT model performance across four feature-set configurations, evaluated on the independent test set. The optimal reduced model is highlighted in bold.
3.3.5. Comparative Analysis and Summary of SAINT Model Performance
This section consolidates the independent test-set results for all SAINT configurations. The full feature set is included exclusively as an upper-bound reference and is not considered for operational deployment. Among the reduced-feature models, the configuration based on the top-performing fifteen non-destructive features offers the most balanced trade-off between predictive accuracy, computational efficiency, and system simplicity, and is therefore highlighted as the preferred solution.
Although the full feature configuration achieves the highest absolute performance, the differences between it and the reduced models remain relatively modest. For practical deployment-particularly in real-time or embedded contexts-the complexity, sensor requirements, and computational overhead become equally important considerations.
The top-performance subset of fifteen features delivers slightly lower accuracy and F1-score compared with the full non-destructive set; however, it provides a markedly more efficient and lightweight architecture. This configuration reduces the dimensionality by over 60%, eliminates reliance on destructive measurements, and requires only a compact set of inexpensive sensors. It also trains faster, converges more consistently across folds, and exhibits excellent generalization on the independent test set. These properties make it the most attractive configuration for realistic, resource-constrained maturity prediction systems.
A class-wise evaluation for the selected top-performing model is provided in Table 10.
Table 10.
Confusion matrix for the SAINT model trained on the top-performance 15-feature subset. Independent test-set results.
4. Discussion
This study systematically evaluated the performance of three modern neural network architectures; TabNet, NODE, and SAINT, for the non-destructive classification of peach maturity. Our findings align with a growing body of literature that positions neural networks and machine learning as state-of-the-art tools for fruit quality assessment. The superiority of Artificial Neural Networks (ANNs) over traditional models has been decisively shown; for instance, Abdel-Sattar et al. [27] demonstrated that an ANN achieved an R2 greater than 0.97 for predicting peach quality attributes, while regression models struggled. Similarly, a study by Kangune et al. [28] showed a CNN model achieving significantly higher accuracy (79.49%) compared to an SVM model (69%) for grape ripeness, which supports our decision to focus exclusively on neural architectures.
The performance of our models is highly competitive, especially when compared to studies on the same ‘Redhaven’ cultivar. Scalisi et al. [29] used a fluorescence spectrometer on ‘Redhaven’ peaches and achieved a pooled F1-score of 0.85 for classifying maturity. In our study, the optimized SAINT model using 15 non-destructive top features achieved an F1-score of 0.8786, while the optimized TabNet model using 13 features achieved an F1-score of 0.8623. These results demonstrate that compact, potentially lower-cost sensor configurations combined with specialized tabular neural networks can match or exceed the performance of more complex spectrometric approaches. While modern CNNs achieve high accuracies on image data, often exceeding 96% for bananas [30] or 94% for plums in uncontrolled environments [31], our study highlights that high accuracy is also attainable using less expensive, non-image, tabular data. This modernizes earlier approaches, such as that of Llobet et al. [32], who used an electronic nose and neural networks to achieve over 90% accuracy for banana ripeness. Our work contributes to this field by applying the latest architectures specifically designed for the tabular data that such sensor systems produce.
Beyond direct maturity sensing, the utility of machine learning in peach orchards is well-established for modeling complex agricultural systems. Studies have successfully used ANNs to predict fruit quality from soil mineral content [33] and Random Forest models to diagnose tree nutrition with over 80% accuracy [34], underscoring the power of these methods to handle intricate, non-linear interactions. The final comparative analysis of our models, presented in Table 11, further explores these capabilities.
Table 11.
Final comparison of the best-performing configurations for TabNet, NODE, and SAINT based solely on independent test-set performance. Precision and Recall are reported separately.
A consolidated view of the results shows that no single model dominates in all scenarios; rather, each architecture performs best under specific conditions. The SAINT model consistently achieved the highest scores when the dataset included the full feature set, which is in line with its design that leverages attention mechanisms to model interactions between features.
In contrast, when relying entirely on the comprehensive non-destructive feature set (38 features), the TabNet model delivered the strongest overall performance (Accuracy = 0.8977; F1 = 0.8941), surpassing both NODE and SAINT in this practically relevant scenario. This indicates that TabNet’s sequential attention mechanism is particularly effective when a full collection of low-cost non-destructive measurements is available.
In the reduced-feature settings, the optimized SAINT configuration using 15 features provided the strongest overall performance (Accuracy = 0.8807; F1 = 0.8786). However, the 13-feature TabNet configuration (Accuracy = 0.8693; F1 = 0.8623) remained highly competitive, offering a favorable balance of accuracy, robustness, and interpretability.
All three models exhibit a measurable decline once the iad variable is removed, confirming its relevance for maturity prediction. Nevertheless, the optimized configurations—particularly the 13-feature TabNet and 15-feature SAINT models—demonstrate that reliable classification is achievable using compact, non-destructive feature subsets, which is important for practical implementation.
The practical implications of our research are best viewed in the context of large-scale precision agriculture. As demonstrated by Zhang et al. [35] for almond yield prediction, machine learning models are essential tools for orchard-level management. Our conclusion regarding the practicality of compact, optimized models resonates with the findings of Mazen and Nashat [36], who also highlighted the importance of feature selection for achieving high performance in banana classification. This confirms that the future of practical systems lies not only in model complexity but in the synergy between an optimized architecture and a minimal, yet informative, set of low-cost sensor data, offering a tangible pathway towards robust and scalable systems for real-time decision support.
4.1. Limitations of the Study
Despite the robust findings, this study has several limitations that should be acknowledged. First, all models were trained and validated exclusively on the ‘Redhaven’ peach cultivar. The physiological and spectral properties can vary significantly between different cultivars, and further research is needed to validate the model’s generalizability. Second, the data were collected from a single geographical location and season, meaning the models might not account for variations caused by different climatic conditions or horticultural practices. Finally, while the dataset of 701 samples is comprehensive for this type of agricultural study, it is relatively small by deep learning standards, which could limit the models from learning even more complex patterns. Although the dataset is relatively small by deep learning standards, the study employed a stratified 75/25 train-test split combined with 5-fold cross-validation performed exclusively on the training subset. Future research should validate the models on external datasets from additional seasons and cultivars.
4.2. Future Research Directions
Based on these limitations, several avenues for future research emerge. A crucial next step is to evaluate the performance of the optimized TabNet and SAINT models on a multi-cultivar dataset collected across different seasons and locations. Another promising direction is the development and field-testing of a prototype sensor system based on the 13 optimal non-destructive features. This would involve integrating the required sensors into a portable unit or an autonomous platform to validate the model’s performance in real-time. Finally, exploring data fusion techniques that combine these tabular features with low-resolution multispectral images could potentially unlock further performance gains.
4.3. Model Interpretability and Explainability
From the perspective of interpretability, all three models provided valuable insights into the underlying feature relevance. TabNet, by design, offers inherent interpretability through its sequential attention mechanism, which highlights the contribution of each feature at every decision step. This allowed visualization of feature importance dynamics and identification of the most influential attributes contributing to classification decisions. For the SAINT and NODE models, which are less interpretable by nature, post hoc permutation-based analysis was performed, revealing a high degree of consistency in the ranking of dominant predictors. In particular, dielectric properties and ground color-based features frequently appeared among the top-ranked variables, confirming their physiological relevance to fruit maturity. The concordance across different model architectures suggests that the observed feature relationships are robust and not driven by random correlations or overfitting effects.
5. Conclusions
This study successfully demonstrated the high potential of modern neural network architectures for non-destructive classification of peach maturity using tabular sensor data. Through a systematic evaluation of TabNet, NODE, and SAINT models across four distinct feature configurations, we have shown that it is possible to achieve high classification accuracy (up to 89.77%) without resorting to destructive measurement techniques.
The key findings are threefold:
- The transformer-based SAINT architecture exhibited the highest overall predictive performance when the full feature set (43 variables) was available, achieving an F1-score of 0.9011 and an AUC of 0.9735, and thus establishing the upper benchmark for this dataset.
- The iad index was confirmed as the single most influential predictor of maturity. However, all models demonstrated the capacity to maintain strong performance even in its absence, relying on a combination of dielectric, colorimetric, and morphometric features.
- The TabNet model, when optimized to use only the top 13 non-destructive features (excluding iad), emerged as a highly practical and well-balanced solution. It achieved an accuracy of 86.93% and an F1-score of 0.8623, offering a compelling balance of performance, efficiency, and interpretability for real-world deployment.
These results suggest that future research should focus on deploying such optimized, low-cost sensor systems in real-world agricultural settings. The integration of a model like the 13-feature TabNet into autonomous platforms could significantly enhance the efficiency of selective harvesting, improve fruit quality management, and reduce post-harvest losses, thereby contributing to more sustainable and profitable horticultural practices.
A crucial next step will be to evaluate the performance of the optimized TabNet and SAINT models on a multi-cultivar dataset collected across different seasons and locations. Additionally, future efforts should explore the development of a prototype sensor system based on the 13 optimal features, integrating the required sensors into a portable or autonomous platform to validate the model’s performance in real time. Finally, exploring data fusion techniques that combine these tabular features with low-resolution multispectral imaging may further enhance predictive accuracy and interpretability.
Author Contributions
Conceptualization, D.L., M.M. and M.B.B.; methodology, D.L.; software, D.L.; validation, D.L., M.M. and M.B.B.; formal analysis, D.L.; investigation, D.L.; data curation, D.L. and M.V.; writing—original draft preparation, D.L.; writing—review and editing, M.M., M.B.B., M.V. and T.J.; visualization, D.L.; supervision, M.M., M.B.B. and T.J.; project administration, M.M. and T.J.; funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the University of Rijeka, grant number uniri—drustv–18–122.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
References
- Giannopoulos, O.; Deltsidis, A.; Chavez, D. Peaches: Appropriate Harvest and Postharvest Handling. In Bulletin 1555; University of Georgia Cooperative Extension: Athens, GA, USA, 2024. [Google Scholar]
- Zhu, Y.; Zhang, Y.; Li, Y.; Xu, L.; Zhang, J.; Ma, X.; Zhang, W.; Jiang, H. Determination of fruit maturity and its prediction model based on the pericarp index of absorbance difference (IAD) for peaches. PLoS ONE 2017, 12, e0177511. [Google Scholar] [CrossRef]
- Ljubobratović, D.; Vuković, M.; Brkić Bakarić, M.; Jemrić, T.; Matetić, M. Assessment of various machine learning models for peach maturity prediction using non-destructive sensor data. Sensors 2022, 22, 5791. [Google Scholar] [CrossRef] [PubMed]
- Vuković, M.; Jurić, S.; Vinceković, M.; Levaj, B.; Fruk, G.; Jemrić, T. Effect of yellow and Stop Drosophila Normal anti-insect photoselective nets on vegetative, generative and bioactive traits of peach (cv. Suncrest). J. Agric. Sci. 2023, 29, 111–121. [Google Scholar] [CrossRef]
- Ljubobratović, D.; Zhang, G.; Brkić Bakarić, M.; Jemrić, T.; Matetić, M. Predicting peach fruit ripeness using explainable machine learning. In Proceedings of the 31st International DAAAM Symposium, Virtually, 21–24 October 2020; pp. 0717–0723. [Google Scholar] [CrossRef]
- Camps, C.; Christen, D. Non-destructive assessment of apricot fruit quality by portable visible-near infrared spectroscopy. LWT-Food Sci. Technol. 2009, 42, 1125–1131. [Google Scholar] [CrossRef]
- Elsherbiny, O.; El-Hendawy, S.; Elsayed, S.; Elwakeel, A.E.; Alebidi, A.; Yue, X.; Elmessery, W.M.; Galal, H. Incorporation of visible/near-infrared spectroscopy and machine learning models for indirect assessment of grape ripening indicators. Sci. Rep. 2025, 15, 12345. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Z.; Hicks, Y.; Sun, X.; Luo, C. Peach ripeness classification based on a new one-stage instance segmentation model. Comput. Electron. Agric. 2023, 214, 108369. [Google Scholar] [CrossRef]
- Aherwadi, N.; Mittal, U.; Singla, J.; Jhanjhi, N.Z.; Yassine, A.; Hossain, M.S. Prediction of fruit maturity, quality, and its life using deep learning algorithms. Electronics 2022, 11, 4100. [Google Scholar] [CrossRef]
- Ropelewska, E.; Rutkowski, K.P. The classification of peaches at different ripening stages using machine learning models based on texture parameters of flesh images. Agriculture 2023, 13, 498. [Google Scholar] [CrossRef]
- Shuaib, S.E.; Riyapan, P.; Jumrat, S.; Pianroj, Y.; Muangprathub, J. Predictions of oil volume in palm fruit and estimates of their ripeness: A comparative study of machine learning algorithms. Acta Agrobot. 2024, 77, 196783. [Google Scholar] [CrossRef]
- Ljubobratović, D.; Vuković, M.; Brkić Bakarić, M.; Jemrić, T.; Matetić, M. Utilization of Explainable Machine Learning Algorithms for Determination of Important Features in ‘Suncrest’ Peach Maturity Prediction. Electronics 2021, 10, 3115. [Google Scholar] [CrossRef]
- Somvanshi, S.; Das, S.; Javed, S.A.; Antariksa, G.; Hossain, A. A survey on deep tabular learning. arXiv 2024, arXiv:2410.12034. [Google Scholar] [CrossRef]
- Hwang, Y.; Song, J. Recent deep learning methods for tabular data. Commun. Stat. Appl. Methods 2023, 30, 215–226. [Google Scholar] [CrossRef]
- Fan, Y.; Waldmann, P. Tabular deep learning: A comparative study applied to multi-task genome-wide prediction. BMC Bioinform. 2024, 25, 322. [Google Scholar] [CrossRef]
- Arik, S.Ö.; Pfister, T. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 6679–6687. [Google Scholar]
- Popov, S.; Morozov, S.; Babenko, A. Neural oblivious decision ensembles for deep learning on tabular data. arXiv 2019, arXiv:1909.06312. [Google Scholar] [CrossRef]
- Somepalli, G.; Goldblum, M.; Schwarzschild, A.; Bruss, C.B.; Goldstein, T. Saint: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv 2021, arXiv:2106.01342. [Google Scholar] [CrossRef]
- Crisosto, C. How do we increase peach consumption? In Proceedings of the V International Peach Symposium 592, Davis, CA, USA, 8–11 July 2001; pp. 601–605. [Google Scholar]
- Ziosi, V.; Noferini, M.; Fiori, G.; Tadiello, A.; Trainotti, L.; Casadoro, G.; Costa, G. A new index based on vis spectroscopy to characterize the progression of ripening in peach fruit. Postharvest Biol. Technol. 2008, 49, 319–329. [Google Scholar] [CrossRef]
- Ramina, A.; Tonutti, P.; McGlasson, W. Ripening, nutrition and postharvest physiology. In The Peach: Botany, Production and Uses; CABI: Wallingford, UK, 2008; pp. 550–574. [Google Scholar]
- Neri, F.; Brigati, S. Sensory and objective evaluation of peaches. In COST 94. The Post-Harvest Treatment of Fruit and Vegetables: Quality Criteria; COST: Brussels, Belgium, 1994; pp. 107–115. [Google Scholar]
- Vuković, M.; Ljubobratović, D.; Matetić, M.; Brkić Bakarić, M.; Jurić, S.; Jemrić, T. Evolution of Non-Destructive and Destructive Peach ‘Redhaven’Quality Traits During Maturation. Agronomy 2025, 15, 1476. [Google Scholar] [CrossRef]
- Zhang, G.; Fu, Q.; Fu, Z.; Li, X.; Matetić, M.; Brkic Bakaric, M.; Jemrić, T. A comprehensive peach fruit quality evaluation method for grading and consumption. Appl. Sci. 2020, 10, 1348. [Google Scholar] [CrossRef]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Abdel-Sattar, M.; Al-Obeed, R.S.; Aboukarima, A.M.; Eshra, D.H. Development of an artificial neural network as a tool for predicting the chemical attributes of fresh peach fruits. PLoS ONE 2021, 16, e0251185. [Google Scholar] [CrossRef] [PubMed]
- Kangune, K.; Kulkarni, V.; Kosamkar, P. Grapes Ripeness Estimation using Convolutional Neural network and Support Vector Machine. In Proceedings of the 2019 Global Conference for Advancement in Technology (GCAT), Bangalore, India, 18–20 October 2019; pp. 1–5. [Google Scholar] [CrossRef]
- Scalisi, A.; Pelliccia, D.; O’Connell, M.G. Maturity Prediction in Yellow Peach (Prunus persica L.) Cultivars Using a Fluorescence Spectrometer. Sensors 2020, 20, 6555. [Google Scholar] [CrossRef] [PubMed]
- Saragih, R.E.; Emanuel, A.W.R. Banana Ripeness Classification Based on Deep Learning using Convolutional Neural Network. In Proceedings of the 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), Virtually, 9–11 April 2021; pp. 85–89. [Google Scholar] [CrossRef]
- Miragaia, R.; Chávez, F.; Díaz, J.; Vivas, A.; Prieto, M.H.; Moñino, M.J. Plum Ripeness Analysis in Real Environments Using Deep Learning with Convolutional Neural Networks. Agronomy 2021, 11, 2353. [Google Scholar] [CrossRef]
- Llobet, E.; Hines, E.L.; Gardner, J.W.; Franco, S. Non-destructive banana ripeness determination using a neural network-based electronic nose. Meas. Sci. Technol. 1999, 10, 538. [Google Scholar] [CrossRef]
- Sun, H.; Huang, X.; Chen, T.; Zhou, P.; Huang, X.; Jin, W.; Liu, D.; Zhang, H.; Zhou, J.; Wang, Z.; et al. Fruit quality prediction based on soil mineral element content in peach orchard. Food Sci. Nutr. 2022, 10, 1756–1767. [Google Scholar] [CrossRef] [PubMed]
- Betemps, D.L.; de Paula, B.V.; Parent, S.É.; Galarça, S.P.; Mayer, N.A.; Marodin, G.A.B.; Rozane, D.E.; Natale, W.; Melo, G.W.B.; Parent, L.E.; et al. Humboldtian Diagnosis of Peach Tree (Prunus persica) Nutrition Using Machine-Learning and Compositional Methods. Agronomy 2020, 10, 900. [Google Scholar] [CrossRef]
- Zhang, Z.; Jin, Y.; Chen, B.; Brown, P. California Almond Yield Prediction at the Orchard Level with a Machine Learning Approach. Front. Plant Sci. 2019, 10, 809. [Google Scholar] [CrossRef]
- Mazen, F.M.A.; Nashat, A.A. Ripeness Classification of Bananas Using an Artificial Neural Network. Arab. J. Sci. Eng. 2019, 44, 6901–6910. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).