Counterfeit Detection of Iranian Black Tea Using Image Processing and Deep Learning Based on Patched and Unpatched Images

: Tea is central to the culture and economy of the Middle East countries, especially in Iran. At some levels of society, it has become one of the main food items consumed by households. Bio-active compounds in tea, known for their antioxidant and anti-inflammatory properties, have proven to confer neuroprotective effects, potentially mitigating diseases such as Parkinson’s, Alzheimer’s, and depression. However, the popularity of black tea has also made it a target for fraud, including the mixing of genuine tea with foreign substitutes, expired batches, or lower quality leaves to boost profits. This paper presents a novel approach to identifying counterfeit Iranian black tea and quantifying adulteration with tea waste. We employed five deep learning classifiers—Reg-NetY, MobileNet V3, EfficientNet V2, ShuffleNet V2, and Swin V2T—to analyze tea samples categorized into four classes, ranging from pure tea to 100% waste. The classifiers, tested in both patched and non-patched formats, achieved high accuracy, with the patched MobileNet V3 model reaching an accuracy of 95% and the non-patched EfficientNet V2 model achieving 90.6%. These results demonstrate the potential of image processing and deep learning techniques in combating tea fraud and ensuring product integrity in the tea industry.


Introduction
Since its inception, tea has been used as a traditional beverage to promote health and mental peace, as it contains compounds with antioxidant, anti-inflammatory, immuneenhancing, metabolic-regulating, and cytoprotective properties [1,2].In addition to its unique taste, tea is rich in many bioactive compounds such as catechins, polyphenols, polysaccharides, polypeptides, pigments, and alkaloids, which are effective health factors [3][4][5].In Iran, where this study was conducted, about 32,000 ha of farms in the northern part of the country is devoted to tea cultivation due to the suitable climatic conditions for its cultivation [6].The most common and popular type of tea is black tea, which is plucked from green tea leaves and obtained after dehumidification, rubbing, fermentation, and drying [7][8][9].
However, food, as a fundamental human necessity, often becomes a target for fraudulent activities within its lucrative market.Fraudsters, driven by the prospect of substantial profits, may resort to various deceptive practices that can jeopardize public health [10].Not all food fraud is intentional; some instances arise inadvertently due to negligence or inadequate hygiene standards during production [11].Thus, it is crucial to identify and address food fraud due to its potential health, economic, social, and psychological impacts [12,13], including fraud in tea production.
The adulteration of black tea is a challenging problem due to its universality and popularity [14,15].Among the major adulterations of tea, the following can be mentioned: mixing tea with similar foreign varieties but with large differences in economic value [16] or adding flavoring and coloring materials including pigments and other dyes to improve the appearance of tea or damaged tea leaves [17].Some profiteers also offer tea waste in the form of high-quality tea with the addition of colorants and additives.Tea leaves may be mixed with sand, sawdust, and plant roots and marketed to increase volume and weight [18,19].Even the very small amounts of iron shavings that are added during tea processing [20] happen to lower the final price and increase the profit margin of fraudulent producers.
There are several methods to detect food fraud.One of the most traditional methods is visual inspection, which is a very tedious and time-consuming task [21].It is also highly dependent on the skill of the analyst and may not be suitable for all types of fraud.For tea, physical and microscopic examinations by trained technicians are proven approaches to detect foreign material.However, this is not a reliable method for distinguishing colors [22].With advances in artificial intelligence technologies, the development and implementation of food quality control systems are becoming increasingly possible.There is potential to use various artificial intelligence programs such as machine learning models, natural language processing, and image processing to improve food safety [10].Deep learning is a subset of machine learning that does not necessarily require a labeled dataset in the sense that it can use unsupervised learning for its training [23].
In the work of Xu et al. [24], deep learning techniques were used to distinguish highquality tea from old tea; this approach showed its high potential in tea quality analysis using image processing.In another study by Amsaraj and Mutturi [25], a machine learning method was used to identify the artificial color added in tea to classify and determine the amount of fraud, and the result showed the potential of the deep learning technique in identifying color fraud.Also, in India, deep learning and convolutional neural networks were used to identify various diseases caused in leaves, and the results stated that the mentioned method was successful in diagnosing and analyzing the disease of tea leaves accurately [26].Hu et al. [27] identified the burnt leaves in tea using a deep learning method based on an algorithm to improve the original images and reduce the effects of light and shadow changes.In addition, they could show that the accuracy of the deep learning technique is higher than that of classical learning.In research to solve the problem of classification and differentiation in tea quality, spectrometry and a convolutional neural network technique were used [15].Ding et al. [28] identified different varieties of tea using a deep learning technique and compared it with other analytical methods.The results showed that deep learning is an accurate, non-destructive, and promising method.
Given the widespread problem of tea adulteration, which varies by geographic, temporal, qualitative, and biological factors-including classifications of safe and dangerous-it is imperative to create new methods to distinguish between authentic and adulterated tea.This distinction is crucial to safeguarding economic interests, public health, and consumer trust.This paper focuses on detecting tea adulteration and quantifying the extent of its contamination with tea waste.To achieve this, we employed five deep learning classifiers-RegNetY, MobileNet V3, EfficientNet V2, ShuffleNet V2, and Swin V2Tto analyze the data.These classifiers categorized the samples into four distinct classes: class 1 (100% tea), class 2 (85% tea, 15% waste), class 3 (55% tea, 45% waste), and class 4 (100% waste).

Dataset Acquisition
In the present paper, 2 kg of famous Iranian tea, Momtaz variety, was obtained from a tea processing factory located in Rasht, Iran (37°16′51.0024″N; 49°34′59.0052″E).A total of 1 kg of tea from a mixture of different varieties was brewed to obtain tea waste, that is, the discarded part of tea after brewing.The wastes were dried to take on the appearance of intact tea.Then, 800 samples were prepared, evenly distributed among four classes: 0%, 15%, 45%, and 100% tea waste.Then, RGB images of all the samples were captured by a digital CCD camera Canon EOS 4000D (Canon Inc., Ota, Tokyo, Japan) with EF-S 18-55 mm-24.1 MP.The shutter speed was 1/25 s, the 35 mm equivalent focal length was 25 mm, the aperture (F-stop) was f/1.8, and the ISO speed was ISO-250.The images were taken without flash, using laboratory lighting, which was kept constant in all the photographs.Concerning the digital parameters, the images were captured at a resolution of 4000 × 1800 pixels, with a depth of 8 bits per pixel and channel, in RGB, using the camera's JPEG format with a low compression ratio.In order to make the volume of the samples of each class uniform, a frame of 25 × 35 × 5 cm was prepared.The camera was placed vertically above the samples and was fixed by a holder at a distance of 20 cm from the samples.The samples of each class were placed in the frame and mixed until a uniform sample was obtained.For example, to prepare a sample of class 2 (15% tea waste), the frame volume was filled with 85% black tea and 15% tea waste and mixed homogeneously.The frame was then removed, and the sample number and class type were noted next to the sample (Figure 1).Finally, the images were manually cropped one by one by the researchers to remove the background (Figure 2).These images were randomly divided into three disjoint groups-train, validation, and test-with a ratio of 6:2:2, equally distributed by class.

Methodology
To analyze the dataset, two different approaches were compared: dividing the training and validation images into equal-sized patches and using the whole images unpatched.The reason for the first approach was to increase the size of the dataset and also not to overload the models with features.Additionally, since this approach results in smaller images, we were able to increase our batch size to 20 to improve model generalization and reduce noise.Note that batch size is an important hyperparameter used in convolutional neural networks, which refers to a subset of the images that are processed at a time in the training process.Typically, larger batch sizes result in better model generalization but at the cost of higher memory consumption.
The original resolution of the images was 1320 × 880 pixels.In the patched approach, the images were divided into six patches of 440 × 440 pixels.All the patches obtained from a given image were placed into the same split (training, validation, or test) of the dataset.In the unpatched approach, they were resized to 660 × 440 pixels, to mitigate the out-ofmemory problem.In the first case, a batch size of 20 was used for all models, and in the second case, a batch size of 10 was used.In both cases, the images were introduced in the models in 3D (width × height × channels).The process of training, validation, and testing the models was implemented in the Google Colab environment and using a Tesla T4 GPU (NVIDIA, Santa Clara, CA, USA).

Classification Models
Five classifiers were used to classify the dataset images in both the patched and unpatched approaches: RegNet, MobileNet, EfficientNet, ShuffleNet, and Swin Transformer.All of these models have different versions; the latest versions were used and compared as described below.

RegNet Y800MF
RegNetY and RegNetX are among the RegNet models that use regularization techniques to constrain the design space while fine-tuning the model to find the optimal parameters and hyperparameters.These models are a more generalized form of the proposed AnyNetY and AnyNetX models.The main difference between the AnyNetX and AnyNetY models is that AnyNetY models have a finer design space, which yields better classification results.RegNet models are trained from a more general design space, and after every 10 epochs, the design space is shrunk based on the performance of the models in that design space.This process results in higher accuracy while using less memory [29].

MobileNet V3 Large
MobileNet is one of the most popular convolutional neural networks.It is mainly used for classification, object detection, and semantic segmentation.These models use various techniques to achieve a correct balance between accuracy, latency, and computational efficiency, making them optimal for hardware-constrained and real-time tasks.It has three main variants called v1, v2, and v3.
The MobileNet v1 model uses depthwise-separable convolutions instead of standard depthwise convolutions while using much less memory [30].
The MobileNet v2 model was an improvement over the first version in several ways.It introduced the concept of inverted residuals, which reduces computational cost while preserving representational capacity by reducing the number of channels for residual inputs and outputs compared to the intermediate bottleneck layers.In addition, the nonlinear activations in the bottleneck layers were replaced by linear ones to improve the propagation of gradients during training.Furthermore, the introduction of the width and resolution multipliers helped to better control the number of channels in each layer and the input resolution of the model, respectively, providing further control over the tradeoff between model size, computational cost, and accuracy.It also improved the residual connections by adding a skip connection from the input to the final layer, which improved information flow and gradient propagation during training.Overall, these changes improved the efficiency, flexibility, complexity, and accuracy of the model while reducing the likelihood of overfitting [31].
MobileNet v3 further improved the latter version by further optimizing the number of parameters and reducing the computational complexity.The most notable feature of this model was the introduction of the Squeeze-and-Excitation and Hard-Swish activation functions.The Hard-Swish function provided better nonlinearity and gradient flow than the ReLU function, thus improving the model performance.The Squeeze-and-Excitation function allowed the network to adaptively recalibrate channelwise feature responses by explicitly modeling interdependencies between channels, increasing the representational power of the network and improving its ability to capture informative features [32].

EfficientNet V2S
Two versions of the EfficientNet model have been used, v1 and v2.EfficientNet v1 is a convolutional neural network architecture that introduces the concept of compound scaling to efficiently balance model depth, width, and resolution.By simultaneously scaling the depth, width, and resolution of the network, EfficientNet v1 achieves state-of-theart performance across a wide range of tasks while maintaining computational efficiency.Its advantages include superior performance compared to other architectures at similar computational cost, adaptability to different resource constraints through scaling, and robustness to different computer vision tasks.However, training and fine-tuning the model can be computationally expensive and time-consuming due to its larger size and complexity, especially when training larger images [33].
EfficientNet v2 improves on its predecessor by introducing a new scaling method called stochastic depth and an improved compound scaling method that focuses on improving training speed and accuracy.Stochastic depth randomly drops layers during training, helping to regularize the network and improve generalization.In addition, Effi-cientNet v2 optimizes model training using a more sophisticated compound scaling method that effectively balances network depth, width, and resolution.These enhancements result in improved performance and efficiency over its predecessor.

ShuffleNet V2X10
The ShuffleNet model also has two versions, v1 and v2.ShuffleNet v1 is a convolutional neural network designed to significantly optimize computational complexity while maintaining high accuracy, making it a great candidate for embedded and real-time tasks where accuracy and hardware constraints are high.Instead of standard convolutional layers, ShuffleNet v1 uses grouped pointwise convolutions to divide the input channels into multiple groups and apply separate pointwise convolutions to each, significantly improving computational cost.However, due to the problem of overfitting posed by grouped pointwise convolutions, ShuffleNet v1 shuffles the output channels after each such layer, allowing for better information exchange between layers and improving feature representation [34].
ShuffleNet v2 modified ShuffleNet v1 in several significant ways.First, ShuffleNet v2 introduced a novel channel splitting operation in the depthwise separable convolution, which allows for a more efficient use of computational resources and reduces computational cost.Second, ShuffleNet v2 introduced a finer granularity in the channel shuffling operation, which improves feature interaction across channels while maintaining low computational overhead.In addition, ShuffleNet v2 adopted a multi-resolution strategy in its architectural design, incorporating different scales of feature maps to effectively capture global and local information.Together, these improvements resulted in better accuracy and computational efficiency than ShuffleNet v1 [35].

Swin V2T
Swin Transformer is a variant of the Transformer architecture designed for image processing.It introduces a novel hierarchical structure called shifted windows to efficiently capture local and global dependencies in images.
Swin Transformer v1 introduces shifted windows, which enable the capture of hierarchical representations of images, allowing the model to effectively capture local and global features.This architecture allows for scaling to larger image sizes without significantly increasing the computational cost.Swin Transformer v1 achieves state-of-the-art performance on various image recognition tasks, surpassing previous architectures such as Vision Transformer [36].
Swin Transformer v2 builds on the success of its predecessor by addressing three key challenges in training large vision models.(1) Training instability: Swin Transformer v2 employs a novel residual post-normalization technique combined with a cosine attention mechanism that improves the stability of the training process, allowing for larger and more powerful models.(2) Resolution Gap: There has been a challenge in transferring knowledge from models pretrained on low-resolution images to tasks requiring high-resolution inputs.Swin Transformer v2 addresses this with a log-spaced continuous position bias method, allowing the model to effectively bridge the resolution gap and perform well on high-resolution tasks.(3) Labeled Data Reliability: Large-scale vision models typically require massive amounts of labeled data for training, which can be expensive and timeconsuming to acquire.Swin Transformer v2 includes a self-supervised pretraining method called SimMIM.SimMIM alleviates data hunger by allowing the model to learn from unlabeled images, reducing the dependence on large amounts of labeled data [37].

Analysis Metrics
The following metrics were used to measure the accuracy achieved by the different models in the classification of adulterated tea: cross-entropy, confusion matrix, accuracy, precision, recall, and F1-score.Cross-entropy loss, or log loss, measures the dissimilarity between the data's true distribution and the predicted distribution.
In the context of multi-class classification, let us denote  as the number of samples,  the number of classes,  , = { 1 the -th sample belongs to class  0 otherwise , and  , = ℙ  , = 1 .
The cross-entropy loss over all samples will be as follows: The confusion matrix is a tool used in machine learning, specifically in classification tasks.It helps visualize the performance of a classification model by summarizing the number of correct and incorrect predictions.The confusion matrix  is constructed such that  , corresponds to the number of observations known to be in class  and predicted to be in class .The number of true positives (TPs), false positives (FPs), True Negatives (TNs), and false negatives (FNs) for each arbitrary class  is as Equation ( 2).An example of the values for  ,  ,  , and  is shown in Table 1 for a case of 4 classes.
Precision denotes the accuracy of the model's positive predictions.In the case of multi-class classification, we calculate each class's precision separately and the average of all the classes: Recall measures the model's ability to find all the positive samples.Similar to precision, in the case of multi-class classification, the overall recall score is the average value of the class recall scores of the four classes: The F1-score denotes the harmonic mean of the precision and recall scores.This value represents the tradeoff between precision and recall, i.e., the model is not making too little or too many positive predictions:

Implementation Details
All models in their latest versions were trained twice, each following two approaches as mentioned in Section 2.2.In the first approach, using patched images, models were trained using a batch size of 20, and in the second approach, using the whole images, the batch size was 10.
The cyclic learning rate scheduler was used to update the learning rate after each step.In this paper, each step is the application of one training or validation batch to the model.The scheduler was of type "triangular2" which starts from a value of base_lr and linearly increases the learning rate up to the specified max_lr and repeats the same process downward to complete a cycle.After each cycle, the values of base_lr and max_lr are halved until training is finished.The operation of this "triangular2" scheduler is shown in Section 3.1.The cycle size of the scheduler is 10,000 and 5000 for the patched and nonpatched models, respectively, because the batch size for the former is twice that of the latter [38].The SGD optimizer was used to train the models for 200 epochs.
At the end of each epoch, the model was benchmarked against the validation dataset, and if its accuracy was higher than the previous best checkpoint, it was exported as a new best checkpoint.After the 200th epoch, the best checkpoint of the model was evaluated using the test dataset, and the results are reported and analyzed in Section 3.

Results and Discussion
Figure 3 and Table 2 show a global comparison between the proposed models based on model size, training time, and accuracy.In most cases, patching the dataset causes our training time per epoch to multiply, compared to not patching.This is because the patched version divides the original images into parts, while the unpatched version reduces them to half the size.Therefore, considering the same original images, the former must process a larger total number of pixels.
In most cases, except for the EfficientNet model, using the patched dataset resulted in higher accuracy, which could be due to the fact that the patched dataset is four times larger than the unpatched dataset.The exception of EfficientNet may be due to the use of the training-aware NAS technique, which attempts to stabilize the training time by using various regularization techniques, resulting in lower accuracy.
In the following subsections, the results of the five models are described in detail, including the patched and unpatched approaches.

An Evaluation of the RegNetY800MF Classifier
The plots of the loss, accuracy, and learning rate of the patched and unpatched Reg-NetY800MF model are shown in Figure 4.This variant of the RegNet family performed much better than the rest of its counterparts in terms of test accuracy, precision, recall, and F1-score.Its learning rate scheduler was configured with a base_lr of 10 and a max_lr of 0.4; however, since the patched model used twice the batch size of the non-patched model, its cycle size is double that of the non-patched model so as not to rush it into small learning rates and thus slower fitting.The validation accuracy and loss of the patched model experienced fewer oscillations than the non-patched model, but it achieved lower loss and higher accuracy overall.The confusion matrices of the patched and non-patched RegNetY800MF model are shown in Figure 5.The performance measures obtained from the confusion matrix are presented in Table 3.Both models performed well in predicting the 0% and 100% classes correctly; however, they both predicted some 45% samples as 100% incorrectly, which affected their Precision % and Precision % negatively.
For the 15% class, the non-patched model predicted them all correctly, resulting in 100% Recall % .However, it also predicted some 0% and 45% samples as 15%, which affected its Precision % negatively.In contrast, the patched model performed worse for this class, both in terms of its false positives and false negatives, resulting in a poor Recall % .But because its number of true positives was also worse, its Precision % was more than the non-patched model.
For the 45% class, the non-patched model had worse performance in almost every aspect; it had fewer true positives and more false negatives, resulting in a staggering Recall % of 52.5%.But since it had no false positives, it had a perfect Precision % .The patched model had more false positives and true positives while having fewer false negatives, which resulted in a lower Precision % and higher Recall % .
The F1-score can help compare the performance of the models for the 15% and 45% classes in terms of precision-recall tradeoff analysis.Overall, the patched model had a more consistent prediction probability for these classes, while the non-patched model was more likely to predict the 45% class as 15% or 100%.
Overall, the patched model performed slightly better in the case of the F1-score; however, they both had an accuracy of 87.5%.

Evaluation of MobileNet V3Large Classifier
The loss, accuracy, and learning rate plots of the patched and non-patched Mo-bileNetV3Large model are shown in Figure A1 in Appendix A. Its learning rate scheduler was configured with a base_lr of 2.5 × 10 and max_lr of 0.25; the reason for the different values for this model was that it did not converge with the usual values for base_lr and max_lr.The validation accuracy and loss of both models experienced a lot of fluctuations, with the non-patched model experiencing more.Even so, the non-patched model achieved higher validation accuracy more frequently than the patched model.
The patched and non-patched MobileNetV3Large models' confusion matrices are displayed in Figure 6, and the accuracy measures are in Table 4.The patched model performed better in all aspects.The 0%, 15%, and 100% classes almost had no false negatives and slightly more false positives, resulting in high precision and recall.Also, the 45% class had more false negatives, which resulted in a slight decrease in recall compared to other classes.The non-patched model correctly predicted 34 of the 0% class, while the rest were predicted as 15%.However, it classified all 15% samples correctly.This model performed the worst in predicting the 45% samples, resulting in more false negatives, which caused Recall % to suffer.Additionally, due to the absence of false positives for this class, Precision % is 100%.In conclusion, the patched model achieved an accuracy of 95%, while the non-patched model correctly predicted only 87.5% of the samples.

Evaluation of EfficientNet V2S Classifier
The loss, accuracy, and learning rate plots for the patched and non-patched Efficient-NetV2S models are given in Figure A2.Its learning rate scheduler was configured with a base_lr of 10 and max_lr of 0.4.Both these models' validation accuracy experienced some fluctuations, with the non-patched one experiencing more while achieving higher values in general.Also, it bears mentioning that while both models' validation loss experienced a lot of oscillation, the general trend of the patched model is upward after the 10,000th step (equivalent to the 60th epoch).In contrast, the non-patched model's validation reached 0.25 multiple times when its learning rate was below 0.05.Therefore, if the non-patched model was continued for another 100 epochs, it could have reached a more satisfactory fitting state.
The patched and non-patched EfficientNetV2S models' confusion matrices are shown in Figure 7 and Table 5.Among these models, the patched model performed worse and had less accuracy.Both models correctly predicted almost all 0% and 100% samples, resulting in their near-perfect Recall % and Recall % .
The patched model performed the worst while classifying the 45% class, not even predicting half of them right.This resulted in this model's low Recall % while also affecting the other classes' precision scores poorly.However, 85% of the 15% class's samples were predicted correctly, which is acceptable compared to the former class.
The non-patched model performed much better, achieving an accuracy of 90.6% compared to the patched model's 83.1%.But this model classified about one-quarter of the 45% class samples as 100%, which caused its Precision % and Recall % to be lower than its other classes.Additionally, this model performed well in the case of the 15% class, predicting 92.5% of them correctly while predicting the others as 0% or 45%.Both models had decent precision-recall tradeoffs, with the non-patched model having about a 10% higher F1-score.

Evaluation of ShuffleNet V2x10 Classifier
The accuracy, loss, and learning rate plots for the patched and non-patched Shuf-fleNetV2X10 models are shown in Figure A3.Its learning rate scheduler was configured with a base_lr of 10 and max_lr of 0.4.Both models' validation loss forms an upward concave curve; i.e., they reach a global minimum at around 60 epochs and slowly rise afterward.Additionally, the validation accuracy of the non-patched surpassed 90% but experienced a slight downward trend afterward.Furthermore, the patched model's validation accuracy was fitted at around 85%.As explained in the following subsection, the patched model's test accuracy far exceeds that of the non-patched model, likely due to the patched model's better fit concerning its validation accuracy.
The patched and non-patched ShuffleNetV2X10 models' confusion matrix is shown in Figure 8, and the performance measures are in Table 6.Similar to previous models, the patched model predicted almost all 0% and 100% samples correctly, with little to no confusion, resulting in near-perfect Recall % , Recall % , Precision % , and Precision % scores.This model also predicted a fair number of the 45% samples correctly while also over-classifying about half of the 15% samples as 45%.This resulted in a high Recall % , low Precision % , and Recall % .Furthermore, the model under-classified the 15% samples, which resulted in a high Precision % .Based on the fact that the 45% class had the lowest precision, it can be deduced that our model is skewed towards predicting samples as this class.
The non-patched model performed astronomically worse than all the others.This model predicted all 100% class's samples correctly, resulting in its Recall % being 100%; however, since more than half of the 45% samples were also misclassified as 100%, its Precision % was 60.6%.This model also correctly classified nearly 90% of the 15% samples and misclassified the rest as 45% or 100%, resulting in its high Recall % .However, similar to the 100% class, the 15% class also had more false positives than the 45% and 0% classes, which also affected its Precision % negatively, down to less than 60%.On the other hand, unlike the rest of our models, this model's 0% class had a lot of false negatives, resulting in its less than 60% Recall % .In conclusion, this model was highly skewed towards predicting our samples as either 15% or 100% due to their precision scores being less than all the other classes.
Overall, based on the per-class and average F1-scores of the two models, it can be concluded that the non-patched model had a far worse precision-recall tradeoff, with its Recall % resulting in its low F1-score % , ultimately resulting in a lower average F1-score.Table 6.The per-class precision, recall, and F1-score of the patched and non-patched approach for the four classes (0%, 15%, 45%, and 100% tea waste) for the ShuffleNetV2X10 model.

Evaluation of Swin V2T Classifier
The accuracy, loss, and learning rate plots for the patched and non-patched SwinV2T models are shown in Figure A4.Its learning rate scheduler was configured with a base_lr of 2 × 10 and max_lr of 8 × 10 .Similar to the MobileNetV3Large model, this model diverged with the usual values for base_lr and max_lr.These models' validation accuracy is almost equal, both becoming fit at around 85%, with a near-90% maximum value.However, the non-patched model had slightly higher test accuracy, which could be because it achieved 90% validation accuracy later during the training process, suggesting a better fit.Both models' validation loss experienced some fluctuations.Still, when comparing their general trend after becoming fit, the non-patched model's trend was linear.In contrast, the patched model had a slightly increasing trend, which could be another possible reason for its somewhat lower test accuracy.
The patched and non-patched SwinV2T models' confusion matrices are shown in Figure 9.The scores presented in Table 7 show that the patched model had slightly higher average precision, recall, and F1-scores, even though its accuracy is 81.2%, while the nonpatched model is 85%.
The patched model correctly classified almost all 0%, 15%, and 100% samples, resulting in high recall scores.However, it correctly classified only half of the 45% samples, classifying most of the rest as 100%.This affected its Recall % and Precision % negatively.Overall, it can be deduced that this model was skewed towards the 100% class.
The non-patched model predicted all 15% samples correctly and even classified some 0% and 45% samples as 15%.This resulted in its perfect Recall % , but it lowered its Precision % down to 75%.In comparison, the model predicted only three-quarters of the 45% samples correctly, misclassifying the rest as 15% and 0%.Additionally, this model correctly predicted less than three-quarters of the 100% samples, classifying the rest primarily as 45%, resulting in a lower Recall % and Precision % .
Overall, the patched model had four more false predictions than its non-patched counterpart, resulting in slightly less accuracy.

Analysis of Image Size and Resolution
The comparison between the patched and unpatched approaches is useful to analyze the effectiveness of the classifiers as a function of image size and resolution, i.e., the level of detail of the images.While the unpatched approach uses larger images (660 × 440 pixels), the patched approach uses smaller images (440 × 440) but with a higher level of detail, since the original images are not rescaled.However, these two cases are insufficient to draw precise conclusions about the effect of image size and resolution on the classifiers.Therefore, we performed an additional experiment to test other sizes in both approaches.
Specifically, in the unpatched version, we tested the use of 330 × 220-(×4 reduction in the original ones) and 160 × 110-pixel images (×8 reduction).In the patched version, the extraction of patches of 220 × 220 (24 patches per original image) and of 110 × 110 pixels (96 patches per image) was tested.For this experiment, only the model that obtained the best overall result in the previous tests was used, i.e., MobileNetV3Large.Although in the unpatched approach, the optimal model was EfficientNet V2S, the MobileNetV3Large network has also been used for it, in order to properly compare the effect of size in both approaches.The classification matrices obtained for each of these variants are shown in Figure 10.Table 8 summarizes the accuracy and efficiency of all of them.First, the results obtained clearly demonstrate the negative effect of image reduction on classification.The accuracy of the unpatched approach drops from 82.5% for the highest resolution to 72.5% and 46.9% for the reductions by ×4 and ×8, respectively.Training time is also reduced, but in practice, these versions are not feasible.Analyzing why this happens, in Figure 2, we can see that the details that differentiate intact tea from adulterants are very small.By reducing the images ×4 and ×8, these details are lost, so the classifiers are unable to produce a good result.Thus, sufficient resolution is required to see the details of interest in the image.
However, the size of the images is also relevant.Recall that in the patched versions, the images are not resized but divided into small pieces with different sizes.The patched version of 440 × 440 pixels achieved the best accuracy of 95%, compared to the versions of 220 × 220 and 110 × 110 pixels.This worsening occurs even though they have more images for training (which, in turn, translates into longer training times, as shown in Table 8).For example, the 110 × 110 version has 16 times more images than the 440 × 440 version, but its accuracy does not exceed 80%.The reason for these poor results can be seen in the confusion matrices in Figure 10.Although both reduced patched versions achieve the perfect classification of the 0% class, they tend to confuse the classes with 15% and 45% adulterants.In other words, not only is the resolution of the images important in order not to lose details (as we have seen before), but it is also necessary that the images be large enough to make a meaningful sample of the scenes.An image of 110 × 110 pixels only shows about 1% of the original 1320 × 880-pixel image.This small percentage is enough to determine the presence of adulterants, as shown in the confusion matrices, but not to quantify them, producing a great confusion between classes 15% and 45%.As seen in Figures 11 and 12, almost all models achieved scores higher than 80% with the non-patched EfficientNetV2S and patched MobileNetV3Large surpassing 90%.Out of these models, the worst performing model was the non-patched ShuffleNetV2X10 model which seemed to have not been able to find a proper fit, due to its under-classification of the 45% class.Additionally, almost all models achieved a higher precision score compared to their recall score.This is due to the fact that the number of false positives was almost always less than the number of false negatives.As mentioned, the patched Mo-bileNetV3Large model performed much better than the other models-achieving an accuracy of 95% in the best case-which is due to the addition of its new features, namely the Hard-Swish and Squeeze-and-Excitation activation functions which resulted in better nonlinearity and modeling better feature channel interdependencies, respectively.Furthermore, Table 9 compares our results with several other similar investigations.The most important note to keep in mind is the fact that these papers have different datasets than our own; so, their results cannot be directly compared to our own.However, it can be said that Malyjurek et al. [39] and Kelis et al. [40] used non-CNN methods to analyze hyperspectral images, which resulted in less accuracy.Additionally, Zheng et al. [41] used a symmetric all CNN (SACNN) to analyze hyperspectral images and achieved an accuracy of 92.4%.The usage of hyperspectral images aided in their high accuracy.However, our methods used RGB images, and the patching technique resulted in an accuracy of 95% for the MobileNetV3Large model.The use of standard RGB images has a great advantage over hyperspectral images, since it implies lower camera costs, the greater availability of the cameras, and easier assembly in factories.

Conclusions
The development of new techniques for the estimation of the percentage of tea residues in an image is beneficial in the field of agriculture, for example, for the detection of adulteration.In the present work, a new method for adulteration detection in Momtaz tea has been proposed using RGB images and five deep learning models.The problem is posed as a four-class classification problem, where the mixtures range from 100% tea to 100% residue.Two different approaches have been compared, using the whole obtained images and using patches from them.The use of patches is clearly beneficial, as it increases the number of training samples for the models.For the best model, MobileNetV3Large, a classification accuracy of 95% is achieved in the patched approach.This allows for its practical feasibility, recalling that it is a four-class classification problem.The proposed models could be used with other tea varieties and with other types of adulterants.However, this would require obtaining new datasets with which to retrain the models.
As future work, an alternative technique to classification would be to estimate the percentage of tea waste using regression methods.This would allow for an estimate of the degree of adulteration to be given as a percentage of the mixture.However, for this to be possible, it would be necessary to have many more samples and better distributed in the percentages between 0% and 100%.With the available dataset, we did not achieve acceptable results of the models for the regression problem.One possibility is to divide the input images into smaller patches.However, the experiments have shown that in order to achieve good classification accuracy, it is necessary that the images have not only a good level of detail but also a sufficient size to sample the scene to be studied.Another approach is to use some form of an ensemble technique and take a vote between multiple models; this approach can fix the issue of models being biased and over-classifying or under-classifying certain classes.This is due to the fact that some models are better at classifying certain classes, while others over-or under-classify that class.In such cases, the model is said to be biased for or against certain classes.One way to solve this problem would be to use an ensemble of models that all receive the test dataset and take a vote between the models and classify the image as the class that most models classify it as.

Figure 1 .
Figure 1.A sample image of the preparation of a tea sample.

Figure 3 .
Figure 3.Comparison of training time, model size (number of parameters), and obtained accuracy for all proposed models.

Figure 5 .
Figure 5. Confusion matrices of patched and unpatched approach for RegNetY800MF model.

Figure 6 .
Figure 6.Confusion matrices of the patched and non-patched MobileNetV3Large model.

Figure 7 .
Figure 7. Confusion matrices of the patched and non-patched EfficientNetV2S model.

Figure 8 .
Figure 8. Confusion matrices of the patched and non-patched ShuffleNetV2X10 model.

Figure 9 .
Figure 9. Confusion matrices of the patched and non-patched SwinV2T model.

Figures 11
Figures 11 and 12 give a comparison of all patched and non-patched models according to the defined evaluation criteria, namely accuracy, precision, recall, and F1-score.

Figure 11 .
Figure 11.A comparison of the models using the evaluating criteria and the unpatched approach.

Figure 12 .
Figure 12.A comparison of the models using the evaluating criteria and the patched approach.

Table 1 .
An example of the values for  ,  ,  , and  for a case of 4 classes.

Table 2 .
Comparison of training time, model size (number of parameters), and obtained accuracy for proposed models.

Table 3 .
The per-class precision, recall, and F1-score of the patched and non-patched approach for the four classes (0%, 15%, 45%, and 100% tea waste) for the RegNetY800MF model.

Table 8 .
A comparison of the training time, accuracy, precision, recall, and F1-score achieved by the proposed approaches and image sizes, for the MobileNetV3Large model.

Table 9
compares the correct classification rate (CCR) of the proposed classifiers with other similar works in the current state of the art of visual adulteration detection methods.

Table 9 .
A comparison of the correct classification rates (CCRs) achieved by the proposed classification models and other similar works in the literature for adulteration detection using computer vision systems.