Discriminative Power of Geometric Parameters of Different Cultivars of Sour Cherry Pits Determined Using Machine Learning

: The aim of this study was to develop models based on linear dimensions or shape factors, and the sets of combined linear dimensions and shape factors for discrimination of sour cherry pits of different cultivars (‘Debreceni botermo’, ‘Łut ó wka’, ‘Nefris’, ‘Kelleris’). The geometric parameters were calculated using image processing. The pits of different sour cherry cultivars statistically signiﬁcantly differed in terms of selected dimensions and shape factors. The discriminative models built based on linear dimensions produced average accuracies of up to 95% for distinguishing the pit cultivars in the case of ‘Nefris’ vs. ‘Kelleris’ and 72% for all four cultivars. The average accuracies for the discriminative models built based on shape factors were up to 95% for the ‘Nefris’ and ‘Kelleris’ pits and 73% for four cultivars. The models combining the linear dimensions and shape factors produced accuracies reaching 96% for the ‘Nefris’ vs. ‘Kelleris’ pits and 75% for all cultivars. The geometric parameters with high discriminative power may be used for distinguishing different cultivars of sour cherry pits. It can be of great importance for practical applications. It may allow avoiding the adulteration and mixing of different cultivars.


Introduction
Sour (tart) cherry (Prunus cerasus L.) is one of the two main species from the Prunus genus, besides sweet cherry (Prunus avium L.), with fruits globally traded. These fruit crops have been used by humans since 5000-4000 BCE, which was determined based on cherry pits from archaeological sites. Nowadays, there are many sour cherry cultivars. Due to the health benefits of cherries, tree crop cultivation should increase, and processing technology should be improved [1]. The cherry fruit has low caloric content and significant amounts of nutrients and bioactive components, e.g., polyphenols, fiber, vitamin C, carotenoids, potassium, as well as melatonin, serotonin, and tryptophan. A small number of sour cherries is consumed fresh. Up to 97% of fruits are processed mainly for cooking or baking [2]. Before processing, cherries are usually accurately pitted, as the unintended pits in processed cherry products may be a major concern for consumers (potential for injury) and processors (litigation) [3]. The pit of cherry fruit accounts for 6.30% by weight or even 7-15% of the whole fruit and it consists of the shell (75-80%) and kernel (20-25%) [4,5]. The very hard shell contains sclerenchyma and fiber matters. The kernel contains dietary proteins and fiber, and it has antimicrobial and antioxidant activities. The kernels may be used for the production of oils for the pharmaceutical, perfume and cosmetic industries or the production of biodiesel [4]. Additionally, cherry pit biomass may be potentially used for conversion into biochar for water remediation. This biomass may be also cofired with coal for the generation of electricity. The cherry pit biochar may be applied as catalyst supports, alkaline-functionalized gas adsorbents, electrode materials, or soil amendments for greenhouse crop production [6][7][8][9][10][11]. However, pits are still an important waste disposal problem for the processing industry [4]. The traditional waste disposal should be replaced by greener ways of cherry pit biomass application [11].
Depending on the extraction procedure and roasting process, the nutrients may pass from the sour cherry kernels into the oil at different percentages [12]. The sour cherry cultivar may also influence the oil content of the kernel that is about 17-36% [5]. The cultivar of cherry kernel also has a great effect on lipophilic bioactive compounds, e.g., sterols, essential fatty acids, tocopherols, tocochromanols, squalene, carotenoids [5,13]. Due to the dependence of the chemical properties of sour cherry kernels on the cultivar, correct cultivar recognition may be important in practice. The processing of cherry kernels may require a uniform sample of kernels with the same characteristics. Some cultivars with certain chemical properties may be more desirable for processing than others. Therefore, there may be a need for authentication to avoid adulteration and mixing different cultivars.
The application of machine learning may be useful for plant research. Machine learning as a sub-class of artificial intelligence is an important topic in the computer field. Currently, researchers strive to increase the precision of algorithms and the intelligence of machines. Learning became a significant part of machines. Due to computer vision, which is a domain of machine learning, machines can be trained for processing, analyzing, and recognizing visual data [14]. Machine learning is intended to enable machines to learn using the available data and make predictions. The learning of computers automatically by themselves without human intervention may be important for precise prediction [15]. The prediction models developed using machine learning and artificial intelligence can provide promising and accurate results. The models based on artificial intelligence can learn from existing data and then predict even nonlinear phenomena related to, e.g., prediction of food production, crop yield, or identification of the number of immature fruits [16]. The application of machine learning in modern agriculture is important due to the increasing call for food, the necessity for increasing the effectiveness of agricultural practices and decreasing the environmental burden. Machine learning ensures an increase in computational power compared to conventional techniques of data processing, which can be incapable of extracting all necessary information from field data and thus meeting the growing demands of smart farming [17]. Machine learning focused on the detection of disease, species, and weeds in crops, the prediction of crop yield and soil parameters, and the classification of crop images to evaluate the plant quality and yield can be one of the key components of the agricultural revolution [18].
In the case of the seed industry, machine learning may be important for the production, correct cultivar identification, identification of contaminations, and quality control. The use of machine vision techniques can result in more accurate and faster classification results compared to the manual inspection performed by specialists based on the color and morphological features of seeds [19]. Machine learning caused significant advances in seed research by providing decision-making support and facilitating the development of robust approaches in the seed industry [20]. The usefulness of the application of machine learning for seed classification was reported in the available literature. The machine learning models were built based on various image features. In the case of cultivar discrimination of fruit seeds or pits and stones, the high efficiency of models based on texture parameters was reported for pepper seeds [21], apple seeds [22], peach seeds and stones [23], sour cherry pits [24], and sweet cherry pits [25]. Furthermore, the geometric features proved to be useful for the pit or stone discrimination for different cultivars of apricot [26], plum [27][28][29], olive [30], jujube [31], and sweet cherry [25]. However, in the present study, extensive research using dozens of geometric parameters, including linear dimensions and shape factors, was performed for the first time to discriminate sour cherry pits 'Debreceni botermo', 'Łutówka', 'Nefris', 'Kelleris' using different classifiers (machine learning algorithms). The innovative models based on the sets of selected linear dimensions, shape factors, and combined linear dimensions and shape factors were developed. This approach to distinguishing cultivars of sour cherry pits is original.
The aim of this study was to develop discriminative models based on geometric features including linear dimensions and, separately, shape factors, as well as the combination of linear dimensions and shape factors for the discrimination of the sour cherry pits of different cultivars. The discriminative power of geometric parameters for distinguishing the pairs of cultivars and all four cultivars was compared.

Image Analysis
The pits were imaged using a flatbed scanner. The sour cherry pits were scanned on a black background at the 1200 dpi resolution and the pit images were saved in TIFF. The images of sour cherry pits were analyzed with the use of Mazda software (Łódź University of Technology, Institute of Electronics, Poland) [32]. For each pit, the region of interest (ROI) including the whole pit was determined. A caliper image was used for the calibration. Then, for each pit with overlaid ROI, the geometric parameters were computed. Among the linear dimensions, the following features were determined: length (L); width (S); length of the skeletonized object (L sz ); area of circumscribing ellipse on the object (FE); maximal length of the ellipse axis on the object (L maxE ); minimal length of the ellipse axis on the object (L minE ); area of circumscribing circle (Fd 2 ); radius of circumscribing circle (D 2 ); profile specific perimeter (Ul); Martin's maximal radius (M max ); Martin's minimal radius (M min ); vertical Feret diameter (F v ); convex perimeter (U w ); object boundary specific perimeter (U g ); equivalent circular area diameter (S pol ); total object specific area (F t ); horizontal Feret diameter (F h ); maximal Feret diameter (F max ); minimal Feret diameter (F min ); Martin's average radius (M aver ). The calculated shape factors included: elliptic shape factor (W 1 ); circular shape factor (W 2 ); circularity (W 3 ); folding factor (W 4 ); mean thickness factor (W 5 ); elongation and irregularity ratio (W 7 ); rectangular aspect ratio (W 8 ); area ratio (W 9 ); radius ratio (W 10 ); diameter range (W 11 ); roundness ((4 π F)/(π S max 2 )) (W 12 ); roundness (S max /F) (W 13 ); roundness (F/S max 3 ) (W 14 ); roundness (4F/(π S min S max )) (W 15 ); standard deviation of all radii (SigR); Haralick ratio (R H ); Blair-Bliss ratio (R B ); Malinowska ratio (R M );

Statistical Analysis
The mean values of the linear dimensions and shape factors of the pits of sour cherries 'Debreceni botermo', 'Łutówka', 'Nefris', and 'Kelleris' were compared to determine the differences in parameters between sour cherry cultivars. The STATISTICA (StatSoft Inc., Tulsa, OK, USA) software program was used at a significance level of p ≤ 0.05. The normality of the distribution was checked using Kolmogorov-Smirnov, Lilliefors and Shapiro-Wilk tests. The Newman-Keuls test was used for the comparison of the means. The homogenous groups of sour cherry pits had no statistically significant differences in the geometric parameters and were indicated by the same letters in columns. The separate groups in terms of linear dimensions or shape factors with statistically significant differences were indicated by different letters in columns.
The usefulness of geometric parameters including linear dimensions and shape factors for distinguishing the pits of sour cherries belonging to different cultivars was analyzed using the WEKA (Machine Learning Group, University of Waikato) application [33]. In the first step of the analysis, the discriminative models were built based on linear dimensions. In the next step, the models based on shape factors were developed. Then, the discriminative models were built based on datasets of the combined linear dimensions and shape factors. The discriminative models were developed separately for each pair of cultivars and all four cultivars. The attribute selection to choose the parameters with the highest discriminative power was carried out using the Best First with the correlation-based feature selection (CFS) subset evaluator, the Ranker method with the Info Gain attribute evaluator, the Ranker method with the OneR attribute evaluator, the Genetic Search method with the CFS subset evaluator. The criterion for evaluating the usefulness of datasets selected with the use of search methods was the highest correctness of discrimination. However, a great reduction in the number of parameters decreased the correctness of the discrimination and analyzes were performed with the exclusion of only a few attributes. The datasets were manually split into a training (70%) and test set (30%). The application of a separate test set that was not used for training ensured the objectivity of the results. The discrimination was performed using the classifiers (machine learning algorithms): NaiveBayes, BayesNet (from the group of Bayes), JRip, PART (Rules), J48, RandomTree (decision trees), Logistic, MultilayerPerceptron (Functions), MultiClassClassifier, FilteredClassifier (Meta), and IBk, KStar (Lazy) [34]. Based on preliminary observations, the highest classification accuracy for discriminative models was found for the Logistic method and the results obtained for this classifier are shown in this paper. The results are presented as confusion matrices and average accuracies (rounded to integers), as well as the values of the true positive (TP) rate, precision, F-measure, receiver operating characteristic (ROC) area and precision-recall (PRC) area calculated using the Weka application based on the formulas: where TP is true positive; FP is false positive; FN is false negative.
In the first step of the discriminant analysis, the cherry pits were compared in pairs including two different cultivars. The results of the discrimination based on selected linear dimensions are presented in Table 3. The highest average accuracy of 95% was determined in the case of distinguishing between 'Nefris' and 'Kelleris' pits. The confusion matrix revealed that 95% of the pits belonging to 'Nefris' were correctly included in the class 'Nefris' and 5% incorrectly assigned to the class 'Kelleris', whereas 94% of 'Kelleris' pits were correctly included in the class 'Kelleris' and 6% were incorrectly included in the class 'Nefris'.  Table 1) that indicated that for most parameters, the 'Nefris' and 'Kelleris' pits were not in one homogenous group and in some cases formed two of the most distant groups. The lowest average accuracies were observed for the discrimination of the pits of cherry 'Łutówka' vs. 'Nefris' (78%) and 'Debreceni botermo' vs. 'Łutówka' (84%). In these cases, the linear dimensions had the lowest discriminative power. The 'Łutówka' and 'Nefris' pits, as well as those of 'Debreceni botermo' and 'Łutówka' were the most similar in terms of length. The difference in length between the 'Łutówka' and 'Nefris' pits was 0.26 mm and the difference between the 'Debreceni botermo' and 'Łutówka' pits was equal to 0.21 mm (Table 1). In the case of other pairs of cherry pits, an average accuracy of 90% was found for distinguishing 'Debreceni botermo' vs. 'Kelleris', 87% for 'Debreceni botermo' vs. 'Nefris' and 'Łutówka' vs. 'Kelleris' (Table 3). The results of discrimination of the pairs of pits of cherry 'Debreceni botermo', 'Łutówka', 'Nefris', 'Kelleris' based on shape factors are shown in Table 4. The tendency was similar to the results of discriminative models built based on linear dimensions (Table 3). In both cases, the 'Nefris' and 'Kelleris' pits were characterized by the highest average discrimination accuracy of 95% (Tables 3 and 4). The sour cherry pits of 'Łutówka' vs. 'Nefris' (78%) (Tables 3 and 4) and 'Debreceni botermo' vs. 'Łutówka' (84% (Table 3), 85% (Table 4)) had the lowest average accuracies. The other discriminative models built based on shape factors produced average accuracies of 92% for 'Debreceni botermo' vs. 'Kelleris' pits, 88% for 'Debreceni botermo' vs. 'Nefris' pits, 87% for 'Łutówka' vs. 'Kelleris' pits (Table 4). It indicated that the accuracies for models built based on shape factors (Table 4) were slightly higher than models built based on linear dimensions (Table 3). The accuracies of discrimination based on selected combined linear dimensions and shape factors (Table 5) were higher than for the discrimination performed with shape factors (Table 4) and linear dimensions (Table 3). In the case of models built based on sets of combined linear dimensions and shape factors (Table 5), the average accuracy reached 96% for distinguishing 'Nefris' and 'Kelleris'. It is 1% higher than for the discrimination of the 'Nefris' and 'Kelleris' pits for models built based on linear dimensions (95%, Table 3) and shape factors (95%, Table 4). In addition, the lowest accuracy of 79%, determined based on combined linear dimensions and shape factors for 'Łutówka' vs. 'Nefris' pits (Table 5), was 1% higher than for the model based on linear dimensions (78%, Table 3) and shape factors (78%, Table 4) for the discrimination of the 'Łutówka' and 'Nefris' pits. Furthermore, the discrimination accuracies for all other pairs of cherry pits based on combined linear dimensions and shape factors ( Table 5) increased and were equal to 86% for 'Debreceni botermo' vs. 'Łutówka', 89% for 'Debreceni botermo' vs. 'Nefris', 93% for 'Debreceni botermo' vs. 'Kelleris', and 90% for 'Łutówka' vs. 'Kelleris'.
The performance of the discrimination for all four cultivars was compared for the models built separately for linear dimensions, shape factors and combined linear dimensions and shape factors ( Table 6). The average accuracy of 75% was the highest for discriminative models including combined linear dimensions and shape factors. In this analysis, the pits 'Debreceni botermo' and 'Kelleris' were characterized by an accuracy of 82%. The correctness of 76% was determined for the pits 'Nefris' and 59% for the pits 'Łutówka'. The least incorrectly classified cases were between the pits 'Nefris' and 'Kelleris', and the most incorrectly classified cases were between the pits 'Łutówka' and 'Nefris'. The discriminative models built based on shape factors produced an accuracy of 73%. The lowest average accuracy of discrimination of four cherry cultivars was observed for models built based on linear dimensions (72%). It indicated that combined linear dimensions and shape factors had the highest discriminative power for distinguishing the cherry pits belonging to different cultivars, and the discriminative power of linear dimensions was the lowest.  The results of the studies revealed the usefulness of the geometric parameters for the discrimination of different cultivars of sour cherry pits. Both linear dimensions and shape factors had a high discriminative power. However, the models built based on combined linear dimensions and shape factors provided the highest results, equal to 96%, for the discrimination of two pit cultivars and 75% for four pit cultivars. The results obtained by Ropelewska [24] indicated that the textures had even higher discriminative power for the discrimination of the pits of different sour cherry cultivars. The pairs of cultivars were discriminated with an average accuracy of up to 100%, whereas, for the discrimination of four cultivars, the correctness of up to 96.25% was achieved. Ropelewska [25] reported that for sweet cherry pits as well, the discrimination accuracies for models built based on textural features (up to 100% for two pit cultivars and 95% for three cultivars) were higher than for geometric parameters (up to 99% for two cultivars and 95% for three cultivars). Additionally, Ropelewska [25] found that the models combining geometric and textural parameters provided the highest accuracies of up to 100% for two cultivars and 98% for three pit cultivars. The results of cultivar discrimination of sour cherry pits based on geometric parameters presented in this paper did not reach 100%. This may indicate some limitations of the developed models that make it impossible to distinguish 'Debreceni botermo', 'Łutówka', 'Nefris', and 'Kelleris' sour cherry pits based on geometric features with 100% accuracy. It prompts us to carry out further research on sour cherry pits to build discriminative models combining selected geometric and other features. However, the contribution of this study to distinguishing sour cherry pit cultivars using machine learning is significant. The linear dimensions and shape factors with the highest discriminative power were indicated. The mean values of these selected parameters differed the most among the cultivars. The next stage of the research may involve combining these geometric features and selected textures in the model to increase the discrimination accuracy. The developed models based on geometric and textural features could be more successfully applied in practice to detect falsification of sour cherry pit cultivars.

Conclusions
The geometric parameters such as linear dimensions and shape factors proved to be useful for the discrimination of sour cherry pits belonging to different cultivars. Higher accuracies were observed when distinguishing pairs of pit cultivars than four cultivars. The discriminative models built based on sets of linear dimensions or shape factors and combined linear dimensions and shape factors provided very high results. However, the highest discriminative power for distinguishing the different cultivars of sour cherry pits was observed for combined linear dimensions and shape factors, whereas the linear dimensions were characterized by the lowest discriminative power. The present study was the first extensive approach to classify sour cherry pits belonging to different cultivars using innovative models built based on geometric features by machine learning algorithms. Such models developed using the sets of selected linear dimensions, shape factors and combined linear dimensions and shape factors for the discrimination of 'Debreceni botermo', 'Łutówka', 'Nefris', and 'Kelleris' sour cherry pits were not found in the available literature. The results of the discrimination based on geometric features were high, comparable to the results obtained for models built using texture parameters reported in previous studies. Demonstrating the usefulness of geometric features to distinguish sour cherry pit cultivars can have practical importance to authenticate pit samples and avoid mixing different cultivars with different chemical properties. However, the limitation of the proposed approach may be the accuracy of the discrimination, which was less than 100%. Therefore, future research may focus on developing the models combining the geometric and texture features to increase their discrimination accuracy.