A General Machine Learning Model for Assessing Fruit Quality Using Deep Image Features

: Fruit quality is a critical factor in the produce industry, affecting producers, distributors, consumers, and the economy. High-quality fruits are more appealing, nutritious, and safe, boosting consumer satisfaction and revenue for producers. Artiﬁcial intelligence can aid in assessing the quality of fruit using images. This paper presents a general machine learning model for assessing fruit quality using deep image features. This model leverages the learning capabilities of the recent successful networks for image classiﬁcation called vision transformers (ViT). The ViT model is built and trained with a combination of various fruit datasets and taught to distinguish between good and rotten fruit images based on their visual appearance and not predeﬁned quality attributes. The general model demonstrated impressive results in accurately identifying the quality of various fruits, such as apples (with a 99.50% accuracy), cucumbers (99%), grapes (100%), kakis (99.50%), oranges (99.50%), papayas (98%), peaches (98%), tomatoes (99.50%), and watermelons (98%). However, it showed slightly lower performance in identifying guavas (97%), lemons (97%), limes (97.50%), mangoes (97.50%), pears (97%), and pomegranates (97%).


Introduction
Fruit quality refers to a fruit's overall characteristics that determine its desirability, nutritional content, and safety for consumption [1].It is determined by the fruit's appearance, flavour, texture, nutritional value, and safety [2].For several reasons, high fruit quality is crucial for the industry, consumers, and the economy.
High-quality fruits benefit growers and sellers economically, promote healthy eating habits, reduce healthcare costs, positively impact the environment, ensure food safety, and promote international trade [3].Promoting high fruit quality requires using sustainable farming practices, implementing food safety regulations, and promoting healthy eating habits [3].For the industry, fruit quality is critical for market competitiveness and profitability.The produce industry is highly competitive, and consumers are more discerning than ever, demanding high-quality fruits that meet their flavour, appearance, and nutrition expectations.Furthermore, the reputation of producers and distributors depends on the quality of their products [3].Consumers who are satisfied with the quality of fruits are more likely to become repeat customers and recommend the products to others, which can help to build a strong brand image and increase sales [3].
In addition, fruit quality is critical for food safety [1].Poor-quality fruits are more prone to contamination by pathogens and spoilage microorganisms, leading to foodborne illness outbreaks and damaging the industry's reputation.For people, fruit quality is crucial because it determines the taste, nutritional value, and safety of their consumed fruits [1].High-quality fruits are more nutritious, flavourful, and appealing, making them more likely to be consumed and incorporated into a healthy diet.Furthermore, high-quality fruits are less likely to contain harmful contaminants or spoilage microorganisms, reducing the risk of foodborne illness and promoting public health.
Fruit quality impacts the entire supply chain, from producers to distributors to retailers.High-quality fruits are less likely to spoil during transportation and storage, reducing waste and increasing profits for all parties involved.Furthermore, high-quality fruits are more likely to be sold at premium prices, increasing the value of the entire supply chain.
Several factors determine fruit quality, including variety, growing conditions, harvesting practices, transportation, and storage [1].For example, the timing of the harvest can have a significant impact.Harvesting fruits too early can result in poor taste, texture, and aroma; harvesting fruits too late can lead to overripening, loss of nutrients, and spoilage.Growing conditions such as soil quality, irrigation, and pest management can also impact fruit quality.Fruits grown in nutrient-rich soil, with proper irrigation and pest management practices, are more likely to be of higher quality than those grown in poor soil conditions or with inadequate pest control measures.Transportation and storage conditions are also crucial for maintaining fruit quality.Fruits must be transported and stored at optimal temperatures and humidity levels to prevent spoilage, maintain freshness, and preserve nutritional value.
Artificial intelligence (AI) can aid in assessing the quality of the fruit using images [4][5][6][7].AI-based technologies such as computer vision and machine learning (ML) algorithms can analyse the visual characteristics of the fruit and provide an objective quality assessment [8,9].The AI algorithms can be trained using a large dataset of images [10] of different fruits with varying quality.They can learn to identify the specific features that indicate the quality of the fruit [11,12].
This study is the first to introduce the concept of a general ML model for visually assessing the fruit quality of various types of fruits.While our research focuses on this specific application, it is important to acknowledge that the field of machine learning has witnessed the development of general models for various other applications as well, such as low-cost sensor calibration [13], small molecule substrates of enzyme prediction [14], and topology optimization [15].
We considered the development of a vision transformer (ViT) network [16], a type of neural network architecture designed for image classification tasks that use the transformer architecture, introduced initially for natural language processing.In ViT, an image is first divided into fixed-size patches.These patches are then flattened and linearly projected into a lower dimensional space, creating a sequence of embeddings.These embeddings are then fed into a multi-head self-attention mechanism, which allows the network to learn to attend to essential patches and relationships between patches.
The self-attention mechanism [17] is followed by a feedforward neural network, which processes the attended embeddings and outputs class probabilities.ViT also includes additional techniques, such as layer normalisation, residual connections, and token embedding, which help improve the network's performance.ViT allows for effective self-attention mechanisms in image classification tasks, providing a promising alternative to traditional convolutional neural networks (CNNs) [18].
A collection of fruit-quality datasets of various fruit types, such as apples, cucumbers, grapes, kakis, oranges, papayas, peaches, tomatoes, watermelons, guavas, lemons, limes, mangoes, pears, and pomegranates, served to train the general model and inspect its performance against fruit-dedicated trained models.
The contributions of this study can be summarised as follows: • We present a general ML model for determining the quality of various fruit based on their visual appearance; • This general model performs better or equal to dedicated per-fruit models;

•
Comparisons with the State-of-the-Art architectures reveal the superiority of ViTs in fruit quality assessment.

Related Work
Recent studies have reported remarkable success in visually estimating fruit quality.Rodríguez et al. [19] focused on identifying plum varieties during early maturity stages, a difficult task even for experts.The authors proposed a two-step approach where images are first processed to isolate the plum.Then, a deep convolutional neural network is used to determine its variety.The results demonstrate high accuracy, ranging from 91 to 97%.
In [20], the authors proposed a CNN to help with sorting by detecting defects in mangosteen.Indonesia has identified mangosteen as a fruit with significant export potential, but not all are defect free.Quality assurance for export is performed manually by sorting experts, which can lead to inconsistent and inaccurate results due to human error.The suggested method achieved a classification accuracy of 97% in defect recognition.
During the growth process of apple fruit crops, there are instances where biological damage occurs on the surface or inside of the fruit.These lesions are typically caused by external factors such as the incorrect application of fertilisers, pest infestations, or changes in meteorological conditions such as temperature, sunlight, and humidity.Wenxue et al. [21] employed a CNN for real-time recognition of apple skin lesions captured by infrared video sensors, capable of intelligent, unattended alerting for disease pests.Experimental results show that the proposed method achieves a high accuracy and recall rate of up to 97.5% and 98.5%, respectively.
In [22], the authors proposed an automated method to distinguish between naturally and artificially ripened bananas using spectral and RGB data.They used a neural network on RGB data and achieved an accuracy of up to 90%.They used spectral data classifiers such as random forest, multilayer perceptron, and feedforward neural networks.They achieved accuracies of up to 98.74% and 89.49%, respectively.These findings could help ensure the safety of banana consumption by identifying artificially ripened bananas, which can harm human health.
In [23], hyperspectral reflectance imaging (400~1000 nm) was used to evaluate and classify three common types of peach diseases by analysing spectral and imaging information.Principal component analysis was used to reduce the high dimensionality of the hyperspectral images, and 54 imaging features were extracted from each sample.The proposed model had 82.5%, 92.5%, and 100% accuracy for slightly decayed, moderately decayed, and severely decayed samples, respectively.
Ref. [24] proposed developing a deep learning-based model called Fruit-CNN for recognising fruits and assessing their quality.The dataset used in this study includes twelve categories of six different fruits based on their quality.It comprises 12,000 images in real-world situations with varying backgrounds.The proposed model outperformed other State-of-the-Art models, achieving an accuracy of 99.6% on a test set of previously unseen images.In [25], the authors utilised a CNN to create an efficient fruit classification model.The model was trained using the Fruits 360 dataset, which consists of 131 varieties of fruits and vegetables.This study focused on three specific fruits, divided into the following three categories based on quality: good, raw, and damaged.The model was developed using Keras and trained for 50 epochs, achieving an accuracy rate of 95%.In [11], the authors used two banana fruit datasets to train and assess their presented model.The original dataset contains 2100 images categorised into ripe, unripe, and over-ripe, with 700 images in each category.This study employed a handcrafted CNN for the classification.The CNN model achieved an accuracy of 98.25% and 81.96% regarding the two datasets.
In [26], the authors developed a model to identify rotting fruits from input images.This study used three types of fruits: apples, bananas, and oranges.The features of the fruit images were collected using the MobileNetV2 [27] architecture.The model's performance was evaluated on a Kaggle dataset, and it achieved a validation accuracy of 99.61%.In [28], the authors proposed two approaches for classifying the maturity status of papaya: machine learning (ML) and transfer learning.The experiment used 300 papaya fruit images, with 100 images for each maturity level.The ML approach a utilised local binary pattern, histogram of directed gradients, grey level co-occurrence matrix, and classification approaches including k-nearest neighbours, support vector machine, and naive Bayes.In contrast, transfer learning utilised seven pre-trained models, including VGG-19 [29].Both methods achieved 100% accuracy, with the ML method achieving this in 0.0995 s of training time and the transfer learning method achieving 100% accuracy.
Most related works have focused on building fruit-specific models.Subsequently, they utilised datasets containing fruits from a single variety.There is a need for general fruit quality prediction models, which are transferrable from industry to industry and are trained using large-scale datasets.Moreover, recent advances in deep learning models can be benchmarked for fruit quality assessment to investigate their performance.

Deep Learning Framework
We propose a ViT model for the classification task.The current section describes the fundamental concepts of the ViT model and the parameters of the proposed model.

Convolutional Neural Networks (CNNs)
CNNs are a class of neural networks designed explicitly for image-processing tasks [30,31].CNNs use convolutional and pooling layers to extract features from an input image.Convolutional layers work by convolving a set of learnable filters (kernels) over the input image to produce feature maps [18].The filters are designed to detect specific patterns in the image, such as edges or corners.
Pooling layers are used to downsample the feature maps produced by convolutional layers, reducing their size while retaining the most critical information.The most common type of pooling layer is max pooling, which takes the maximum value from each subregion of the feature map.
CNNs have succeeded highly in image classification tasks, achieving State-of-the-Art performance on benchmark datasets such as ImageNet.However, they are limited in their ability to capture global relationships between different parts of an image.

Transformers
Transformers are a type of neural network architecture initially developed for natural language processing tasks, such as machine translation and text summarisation.Transformers use a self-attention mechanism [32] to capture relationships between different parts of an input sequence [33].
The self-attention mechanism works by computing a weighted sum of the input sequence, where the weights are taught based on the importance of each element to the other elements in the sequence.This allows the model to focus on relevant parts of the input sequence while ignoring irrelevant parts.
Transformers have been highly successful in natural language processing tasks, achieving State-of-the-Art performance on benchmark datasets such as GLUE and SuperGLUE.

ViT Model
ViTs are a type of deep learning model that combines the power of CNNs with the attention mechanism of transformers to process images.This hybrid architecture is highly effective for image classification tasks, as it allows the model to focus on relevant parts of an image while capturing spatial relationships between them.
ViTs use the following two main components: CNNs and transformer networks.The CNNs are used for feature extraction from the images, while transformer networks are used for attention mechanisms.CNNs are particularly good at capturing local image features such as edges and corners.In contrast, transformer networks can capture the global structure of images by attending to relevant regions.By combining the two, visual transformer CNNs can capture local and global features, improving performance.
The ViT of this study divides the input image into a grid of 33 smaller patches, similar to how image segmentation works [16].Each patch is flattened and passed through convolutional layers to extract features.The transformer network then processes these features, which attends to the most relevant features and aggregates them to generate a representation of the image.This representation is then passed through a series of fully connected layers to classify the image.
The proposed ViT model in Figure 1  The ViT of this study divides the input image into a grid of 33 smaller patches, similar to how image segmentation works [16].Each patch is flattened and passed through convolutional layers to extract features.The transformer network then processes these features, which attends to the most relevant features and aggregates them to generate a representation of the image.This representation is then passed through a series of fully connected layers to classify the image.
The proposed ViT model in Figure 1 consists of multiple layers of self-attention and feedforward networks.The self-attention mechanism allows the network to attend to different input parts and weight them based on relevance.The feedforward network generates a new representation of the input, which is then used in the next self-attention layer.The model first processes the input images by dividing them into smaller patches.Each patch is then encoded using a patch encoder layer, which applies a dense layer and an embedding layer.The encoded patches are then passed through a series of transformer blocks.Each block applies a layer of multi-head attention followed by an MLP.The multihead attention layer allows the model to attend to different image parts.In contrast, the MLP layer applies non-linear transformations to the encoded patches.The model first processes the input images by dividing them into smaller patches.Each patch is then encoded using a patch encoder layer, which applies a dense layer and an embedding layer.The encoded patches are then passed through a series of transformer blocks.Each block applies a layer of multi-head attention followed by an MLP.The multihead attention layer allows the model to attend to different image parts.In contrast, the MLP layer applies non-linear transformations to the encoded patches.
After the final transformer block, the encoded patches are flattened and fed into an MLP that produces the final classification.The MLP applies two dense layers with 500 and 250 units to the encoded patches.The output of the MLP is then passed through a dense layer with two units and a Softmax activation function to produce the final prediction.
The model is trained using the sparse categorical cross-entropy loss function, which compares the predicted class probabilities to the actual class labels.The AdamW optimiser optimises the model, which applies weight decay to the model parameters.The model is evaluated using the sparse categorical accuracy metric, which measures the proportion of correctly classified examples.

Sources
We used various sources for collecting fruit images classified between quality-related categories.We used the extracted image collection to develop this study's large-scale dataset.The image sources comprise the following:

Characteristics and Preprocessing
The datasets mentioned above were processed to create this study's dataset.The analysis identified 16 fruit types.
We have followed the steps described below to create the dataset: Step 1. Download all files from each source.
Step 2. Create the initial list of examined fruit types.
Step 3.For each dataset, validate the availability of each fruit in the list.
Step 4. For each dataset, exclude corrupted and low-resolution images.
Step 5. Create a large-scale dataset that contains all available fruit types.
Step 6. Exclude fruits that are not labelled.
Step 7. Define the two classes: good quality (GQ) and bad quality (BQ).
Step 8. Exclude fruit types that include less than 50 images per class.
Table 1 presents the image distribution between the classes of the final dataset, the total number of images per fruit, the initial image format, and image size.Apart from the 16 separate datasets, which have been organised to represent one fruit each, we created an ultimate dataset of all fruit types for training the general model.This dataset will henceforth be addressed as the Union dataset (UD).
We also collected 200 images per fruit that serve the purpose of the external evaluation dataset.The characteristics of this dataset are presented in Table 2.We emphasize that, in this study, we exclusively assessed the quality of fruits based on their visual appearance.We did not consider other features, such as taste, texture, nutritional content, or internal characteristics such as ripeness, which are undoubtedly critical factors in determining overall fruit quality.This limitation is important to acknowledge, as it implies that our quality assessment is solely based on external attributes such as colour, shape, size, and visual defects.While visual appearance can provide valuable insights into fruit quality, it is not a comprehensive measure.
Dataset preprocessing includes sorting the images by fruit, excluding low-resolution We emphasize that, in this study, we exclusively assessed the quality of fruits based on their visual appearance.We did not consider other features, such as taste, texture, nutritional content, or internal characteristics such as ripeness, which are undoubtedly critical factors in determining overall fruit quality.This limitation is important to acknowledge, as it implies that our quality assessment is solely based on external attributes such as colour, shape, size, and visual defects.While visual appearance can provide valuable insights into fruit quality, it is not a comprehensive measure.
Dataset preprocessing includes sorting the images by fruit, excluding low-resolution and corrupted images, grouping the images into classes, resizing the images to fit in a black background with a 200 × 200-pixel canvas, and normalisation.
CNNs require input images to have a consistent size.Resizing ensures that all input images have the same dimensions, which is essential for the network to process them effectively.This standardization simplifies the architecture and reduces the need for complex resizing operations within the network.Resized images are computationally more efficient to process.Large variations in image sizes can increase the computational load on the network, slowing down training and inference.Resizing images to a uniform size reduces this computational burden.
Normalizing pixel values to a common range (e.g., [0, 1] or [−1, 1]) helps to stabilize and accelerate the training process.It ensures that the network's weights are updated uniformly, preventing saturation of activation functions.Normalization also helps mitigate the effects of the differences in lighting and contrast across images, making the network more robust to variations in input data.Normalizing inputs helps maintain a consistent scale of gradients across layers during backpropagation.This can prevent vanishing or exploding gradients, which are common issues in deep networks, and enable more stable and faster convergence during training.Normalization can act as a form of regularization by reducing the likelihood of overfitting.It imposes constraints on the network's weights and activations, making the model more resistant to noise in the training data.
Data augmentation is a crucial strategy to artificially increase the effective size of the training dataset and improve model generalization.The following methods were applied:

•
Width shift: We randomly shifted the image horizontally, changing the position of the fruit within the frame.This helps the model learn to recognize the same fruit from different viewpoints.

•
Height shift: similar to width shift, we randomly shifted the image vertically to introduce variations in the fruit's vertical position within the frame.

•
Rotation: We applied random rotations to the images to simulate different orientations of the fruits.This helps the model become more invariant to rotation.

•
Gaussian noise: we added Gaussian noise to the images to simulate variations in lighting conditions and improved the model's robustness to noise.

•
Sheer: sheer transformations were applied to deform the image, introducing slight distortions that mimic real-world deformations in fruit appearance.

Experiment Design
Figure 3 illustrates the methodology of this research study.We designed the experiments as follows: a.
Build a ViT network and perform a 10-fold cross-validation procedure using the UD dataset.b.
Evaluate the model's per-fruit performance in detecting rotten-and good-quality fruits.c.
Build ViT models for each fruit and perform a 10-fold cross-validation procedure using data from the specific fruit.d.
Evaluate the models' performance in detecting rotten-and good-quality fruits.
Figure 3 illustrates the methodology of the present research study.
In evaluating a classification model's performance, several key metrics are commonly used, such as accuracy, precision, recall, and the F1 score.Accuracy measures the proportion of correctly classified instances, providing a general overview of a model's correctness.Precision, conversely, gauges the model's ability to correctly identify positive instances among those it predicted as positive, focusing on minimizing false positives.Recall, also known as sensitivity, assesses the model's capability to identify all positive instances among the actual positives, concentrating on minimizing false negatives.The F1 score, which harmonizes precision and recall, offers a balanced metric that considers false positives and false negatives, making it particularly useful when class imbalance is present in the data.These evaluation criteria collectively provide a comprehensive assessment of a model's performance, aiding in informed decision-making and model refinement.

General Model
In this section, we present the classification results of the general model, which was trained using the large-scale UD dataset.

Training and Validation Performance
Under the 10-fold cross-validation procedure, the general model achieves an accuracy of 0.9794.The latter is computed regardless of the fruit under examination.The model obtains a 0.9886 precision, 0.9733 recall, and 0.9809 F1 score (Table 3).The above scores represent the aggregated scores derived from each iteration over the ten-fold procedure.The model performs excellently in identifying the general condition of any fruit of the dataset.It yields 178 false-good predictions and 424 false-rotten predictions.Correct predictions include 15,476 true-good cases and 13,137 true-rotten cases.

External Per-Fruit Evaluation
The general model has been evaluated using the external datasets of various fruit types.The reader shall recall that each external dataset includes 100 good and 100 rotten

General Model
In this section, we present the classification results of the general model, which was trained using the large-scale UD dataset.

Training and Validation Performance
Under the 10-fold cross-validation procedure, the general model achieves an accuracy of 0.9794.The latter is computed regardless of the fruit under examination.The model obtains a 0.9886 precision, 0.9733 recall, and 0.9809 F1 score (Table 3).The above scores represent the aggregated scores derived from each iteration over the ten-fold procedure.The model performs excellently in identifying the general condition of any fruit of the dataset.It yields 178 false-good predictions and 424 false-rotten predictions.Correct predictions include 15,476 true-good cases and 13,137 true-rotten cases.

External Per-Fruit Evaluation
The general model has been evaluated using the external datasets of various fruit types.The reader shall recall that each external dataset includes 100 good and 100 rotten fruit representations.Table 4 presents the results.The general model shows remarkable performance in identifying the quality of apples (accuracy of 0.9950), cucumbers (accuracy of 0.99), grapes (accuracy of 1.00), kakis (accuracy of 0.9950), oranges (accuracy of 0.9950), papayas (accuracy of 0.98), peaches (accuracy of 0.98), tomatoes (accuracy of 0.9950), and watermelons (accuracy of 0.98).
It is worth noticing that the general model achieved equal or higher classification scores in the external datasets than the scores from the Union dataset (UD) which contains the training data.This phenomenon is strong evidence of the generalisation capabilities of the model.

Dedicated Models
In this section, we present the results of the dedicated models.Each model is trained to distinguish between good and rotten images of a specific fruit.Subsequently, each model can operate using images of a single fruit variety.

Training and Validation Performance
Table 5 summarises the 10-fold cross-validation results of the dedicated models.All models obtain high-performance metrics except for the grape and papaya models.The general model is more effective than the dedicated models for predicting the quality of cucumbers, grapes, kakis, mangos, papayas, pears, tomatoes, and watermelons.
It yields equal classification accuracy in apples, oranges, and peaches.Subsequently, the dedicated models are better when built for bananas, guavas, lemons, limes, and pomegranates.Of the sixteen fruit types, the dedicated models performed better only in five of them (Table 7, Figure 4).The general model is more effective than the dedicated models for predicting the quality of cucumbers, grapes, kakis, mangos, papayas, pears, tomatoes, and watermelons.

Comparison with Classic Machine Learning Models
We also oppose the proposed general model (ViT) to various classic machine learning models implemented with the aid of the scikit-learn Python library.Each network was trained and evaluated under the same conditions.To prepare the images for such networks, the initial image was flattened to form a one-dimensional vector that is mandatory  The comparison reveals that the suggested general and dedicated models are consistent with the literature and may exhibit better performance regarding specific fruit types.More precisely, most studies report an accuracy between 97% and 99% in determining the quality of the fruits.The general model of this study reports per-fruit accuracies that vary between 97% and 100%.
The comparisons also verify that the general model is often better than the dedicated models.

Discussion
The quality of fruits is essential in determining their market value and consumer satisfaction.High-quality fruits are visually appealing, flavourful, and nutritionally dense.However, assessing fruit quality can be laborious and time-consuming, especially when performed manually.This is where deep learning technology can be applied to automate and optimise the process of fruit quality assessment.By processing a large dataset of fruit images, deep learning algorithms can be trained to recognise specific patterns and features indicative of fruit quality.For instance, a deep learning model can be trained to identify specific colouration, texture, and shape characteristics that indicate freshness, ripeness, or maturity in a fruit.Deep learning can be used to assess the quality of fruits at different stages of production, from the farm to the market.Farmers can use deep learning algorithms to assess the quality of their products in real-time, allowing them to make informed decisions on when to harvest or transport their fruits.
Additionally, food retailers can use deep learning technology to sort and grade fruits based on their quality, reducing waste, and ensuring consistent product quality for consumers.Furthermore, deep learning can also be applied to preserve fruit quality during storage and transportation.By detecting and removing low-quality fruits before shipping, deep learning algorithms can reduce the chances of damage or spoilage during transportation, ensuring that consumers receive only high-quality fruits.
This research study presented a general ML model based on vision transformers for estimating fruit quality based on photographs.We proposed a general model that can be trained with multiple fruits and predict the quality of any fruit variety that participated in the training set.This general model was superior to dedicated models, in which training was performed using a single fruit variety.According to the results, a generalised model predicts the quality of cucumbers, grapes, kakis, mangos, papayas, pears, tomatoes, and watermelons more efficiently than dedicated models.However, the classification accuracy of the generalised and dedicated models is similar for apples, oranges, and peaches.
On the other hand, the dedicated models perform better for bananas, guavas, lemons, limes, and pomegranates.Only five of the sixteen fruits analysed showed improved results when using dedicated models.
This suggests that while a generalised model may provide satisfactory results for most fruits, dedicated models tailored to specific fruits can significantly enhance the accuracy of the predictions, particularly for fruits with unique characteristics or qualities that are difficult to generalise.
To summarize, we presented a machine learning model based on ViT networks capable of assessing the quality of various fruits based solely on their visual appearance, eliminating the need for fruit-specific models.Our general model showcases performance that either equals or surpasses dedicated, fruit-specific models, simplifying the process while maintaining or enhancing accuracy.Through rigorous comparisons with State-of-the-Art techniques, our research establishes vision transformers (ViTs) as the superior choice for fruit quality assessment, setting a new benchmark in computer vision for agriculture and quality control.This study has some limitations.Firstly, fruit quality can be evaluated based on several factors, including appearance, flavour, texture, and nutritional content.While the appearance of the fruit can be an indicator of quality, it is not always reliable.
In some cases, the appearance of the fruit can provide some clues about its quality.For example, ripe fruit should have a bright and uniform colour, be free of bruises or blemishes, and have a firm and smooth texture.However, some exceptions exist to these guidelines, such as bananas, which develop brown spots as they ripen but are still perfectly edible.Other factors affecting fruit quality, such as flavour and nutritional content, cannot be assessed based on appearance alone.For example, some fruits may look perfectly fine but lack flavour or be low in certain nutrients.While some fruit characteristics such as colour, shape, and texture can be visually evaluated.Other vital factors such as flavour, aroma, and nutritional content cannot be assessed visually.Moreover, the visual appearance of the fruit can be influenced by various factors, such as lighting, the angle of the camera, and post-harvest treatments, which can affect the quality assessment.The latter can be considered a limitation of this study.
Integrating machine learning models into existing fruit sorting and grading systems may improve efficiency and accuracy but also open the door to a holistic approach that combines image and non-image characteristics for more comprehensive fruit quality assessments.This synergy between different data sources maximizes the potential for optimizing fruit grading processes across various agricultural contexts.
Adapting machine learning models to account for variations in fruit quality stemming from diverse factors such as climate, soil, and growing conditions is crucial for ensuring the robustness and applicability of these models in real-world agricultural settings.One approach involves incorporating these environmental variables as features in the training dataset.By including climate data (e.g., temperature, humidity, and precipitation); soil characteristics (e.g., pH levels and nutrient content); and growing conditions (e.g., irrigation methods and pesticide usage), the existing model can learn to recognize patterns and correlations between these variables and fruit quality.This enables the model to make more nuanced and context-aware quality assessments.Regular updates of these environmental data help the model adapt to changing conditions over time.
Secondly, while studying 16 fruit types provides valuable insights, it is essential to note that this sample size may not represent all fruit types.To fully assess the effectiveness of generalised versus dedicated models for predicting fruit quality, a more comprehensive and diverse dataset should be used.
Including a broader range of fruit varieties in future studies can help to identify patterns and trends across different types of fruit and further establish the efficacy of generalised and dedicated models.Additionally, expanding the sample size can provide more accurate and robust results, allowing for greater confidence in the findings and a better understanding of the strengths and limitations of these modelling approaches.
The integration of machine learning into fruit quality assessment raises important ethical considerations.Privacy and consent are paramount, demanding robust data anonymization and comprehensive consent procedures.Transparency and fairness are crucial.Biases inherited from data must be addressed with fairness-aware algorithms, ongoing monitoring, and clear model explanations.Environmental responsibility is key, as machine learning can impact resource consumption.Ethical practices involve optimizing algorithms for sustainability.Labour displacement concerns the call for plans to retrain and reskill affected workers.Finally, ensuring equitable access to these technologies, especially for small-scale farmers, is vital.Initiatives for technology transfer and knowledge sharing promote fairness and broad benefits.

Conclusions
AI-based technologies can potentially revolutionise the fruit industry by providing objective and efficient quality assessment.This study introduced a general machine learning model based on vision transformers to assess fruit quality from images.The model outperformed dedicated models trained on single fruit types, except for apples, oranges, and peaches, where both had similar accuracy.Dedicated models were better for specific fruits such as bananas and pomegranates.Overall, a generalised model worked well for most fruit types.However, dedicated models could improve the accuracy for fruit types with unique features.Fruit quality depends on multiple factors, including appearance, flavour, and nutrition.Appearance can be misleading and affected by various factors.This study has limitations in this regard.Finally, while the 16 fruit types used in this study provide a valid starting point, future research should include a more diverse and extensive range of fruit types to better evaluate the effectiveness of generalised and dedicated models in predicting fruit quality.
consists of multiple layers of self-attention and feedforward networks.The self-attention mechanism allows the network to attend to different input parts and weight them based on relevance.The feedforward network generates a new representation of the input, which is then used in the next self-attention layer.used for attention mechanisms.CNNs are particularly good at capturing local image features such as edges and corners.In contrast, transformer networks can capture the global structure of images by attending to relevant regions.By combining the two, visual transformer CNNs can capture local and global features, improving performance.

Figure 1 .
Figure 1.The proposed vision transformer network.The model takes input images of size (200,200,3) and returns a prediction of one of the two classes.The model's architecture consists of a series of transformer blocks, each with a multi-head attention layer and a multilayer perceptron (MLP) layer.The input images are divided into patches and fed into the transformer blocks.The model is trained using the sparse categorical cross entropy loss function and the AdamW optimiser.The model first processes the input images by dividing them into smaller patches.Each patch is then encoded using a patch encoder layer, which applies a dense layer and an embedding layer.The encoded patches are then passed through a series of transformer blocks.Each block applies a layer of multi-head attention followed by an MLP.The multihead attention layer allows the model to attend to different image parts.In contrast, the MLP layer applies non-linear transformations to the encoded patches.

Figure 1 .
Figure 1.The proposed vision transformer network.The model takes input images of size (200,200,3) and returns a prediction of one of the two classes.The model's architecture consists of a series of transformer blocks, each with a multi-head attention layer and a multilayer perceptron (MLP) layer.The input images are divided into patches and fed into the transformer blocks.The model is trained using the sparse categorical cross entropy loss function and the AdamW optimiser.The model first processes the input images by dividing them into smaller patches.Each patch is then encoded using a patch encoder layer, which applies a dense layer and an embedding layer.The encoded patches are then passed through a series of transformer blocks.Each block applies a layer of multi-head attention followed by an MLP.The multihead attention layer allows the model to attend to different image parts.In contrast, the MLP layer applies non-linear transformations to the encoded patches.After the final transformer block, the encoded patches are flattened and fed into an MLP that produces the final classification.The MLP applies two dense layers with 500 and 250 units to the encoded patches.The output of the MLP is then passed through a dense layer with two units and a Softmax activation function to produce the final prediction.

Figure 2
Figure 2 illustrates the data collection and preprocessing steps for creating the datasets of this study.

Figure 2 .
Figure 2. Data collection and processing procedure.The top-left box describes the process of creating the UD dataset.The lower-left box presents the creating of the external evaluation datasets.Both datasets share the same pre-processing steps, visualized in the right box.

Figure 2 .
Figure 2. Data collection and processing procedure.The top-left box describes the process of creating the UD dataset.The lower-left box presents the creating of the external evaluation datasets.Both datasets share the same pre-processing steps, visualized in the right box.

Figure 4 .
Figure 4. Column plot comparing the dedicated and the general models' per-fruit performance.Figure 4. Column plot comparing the dedicated and the general models' per-fruit performance.

Figure 4 .
Figure 4. Column plot comparing the dedicated and the general models' per-fruit performance.Figure 4. Column plot comparing the dedicated and the general models' per-fruit performance.

4. 3 .
Comparison with State-of-the-Art Models under a 10-Fold Cross-Validation Procedure on the UD Dataset

Figure 5 .
Figure 5. UD dataset classification performance comparison between various State-of-the-Art CNNbased networks under a 10-fold cross-validation procedure.

Table 1 .
Per-fruit characteristics of this study's dataset.

Table 2 .
Per-fruit characteristics of this study's external evaluation dataset.

Table 3 .
Results of the general model under a 10-fold cross-validation procedure.UD refers to the training dataset.

Table 3 .
Results of the general model under a 10-fold cross-validation procedure.UD refers to the training dataset.

Table 4 .
Results of the general model when testing with external data.The testing fruit column refers to the type of fruits used for testing the model.The latter images originate from the test dataset.

Table 7 .
Comparison between dedicated models and the general model in per-fruit accuracy measured using the external test set.

Table 7 .
Comparison between dedicated models and the general model in per-fruit accuracy measured using the external test set.

Table 8 .
UD dataset classification of various State-of-the-Art CNN-based networks under a 10-fold cross-validation procedure.

Table 10 .
Comparison with the literature.