Classification of Pepper Seeds by Machine Learning Using Color Filter Array Images

The purpose of this work is to classify pepper seeds using color filter array (CFA) images. This study focused specifically on Penja pepper, which is found in the Litoral region of Cameroon and is a type of Piper nigrum. India and Brazil are the largest producers of this variety of pepper, although the production of Penja pepper is not as significant in terms of quantity compared to other major producers. However, it is still highly sought after and one of the most expensive types of pepper on the market. It can be difficult for humans to distinguish between different types of peppers based solely on the appearance of their seeds. To address this challenge, we collected 5618 samples of white and black Penja pepper and other varieties for classification using image processing and a supervised machine learning method. We extracted 18 attributes from the images and trained them in four different models. The most successful model was the support vector machine (SVM), which achieved an accuracy of 0.87, a precision of 0.874, a recall of 0.873, and an F1-score of 0.874.


Introduction
Pepper is a spice obtained from the berries of different species of pepper plants that belong to the botanical family Piperaceae.It is important to differentiate between genuine and counterfeit pepper, as only the fruits of Piper nigrum, Piper cubeba, and Piper longum are legally recognized as "pepper" [1].
The species Piper nigrum produces green, white, or black pepper, depending on the stage of its harvest and the method of preparation.The species Piper longum produces long pepper, which was widely used in the Middle Ages but has become almost forgotten today.Piper cubeba produces cubeb pepper, which is round and has a small tail, hence its name, "tailed pepper".
According to data from [1], the top five pepper-producing countries were Vietnam, Brazil, Indonesia, India, and China.Vietnam was the largest producer, with a production of 482,977 tons, followed by Brazil, with 113,374 tons, and Indonesia, with 105,817 tons.India and China also produce significant amounts of pepper, with 81,958 and 68,000 tons, respectively.
Black pepper is traditionally used for its anti-inflammatory properties.Several studies have been carried out in this area, showing a targeted effect of piperine.In particular, it is thought to act by reducing the number of messengers responsible for inflammation in cells affected by osteoarthritis (joint disorders leading to joint pain) [2].When taken as a course of treatment, it is also thought to reduce pain [3,4].Laboratory studies have shown the protective effect of piperine on damage linked to oxidative stress.It is also thought to have a beneficial effect on antioxidant enzymes, increasing our protection against oxidation and premature cell ageing [5,6].
In any case, it can be very challenging to distinguish between different types of pepper based on their seeds, as they have similar morphologies [7,8].This creates a problem of mislabeling on the market.This is the case for Penja pepper in Cameroon, which is one of the rare and exceptional varieties of Piper nigrum and is highly coveted by top chefs and gourmands.Its superior quality is due to the unique terroir of Penja, which offers exceptional soil and climate conditions, as well as the specialized knowledge and expertise of the local craftsmen.
Computer vision, specifically image processing, is a non-destructive testing solution that can be used to address classification problems.The methods employed include machine learning and deep learning, among others.Several studies have already been conducted in this area for spices, with a particular focus on the classification of pepper and chili seeds.For example, in [9], fuzzy logic is used to classify chili and bell pepper seeds.In this study, the accuracy achieved was 85%.The same study was repeated in [7] using 23 different machine learning algorithms.The algorithms that achieved 100% accuracy were Fine KNN, Weighted KNN, Boosted Trees, Bagged Trees, and Subspace KNN.
Another study was conducted by Awang Iskandar and his team on the detection of foreign bodies in a sample of Piper nigrum pepper seeds [8].They were able to detect foreign bodies, such as pebbles and strings, with 100% accuracy.They employed several segmentation techniques, including the Color and Erodes Segmentation Technique, Color Erode and Clarify Segmentation Technique, and Color and Texture Segmentation Technique.The most effective method was found to be the Color and Texture Segmentation Technique.
Several studies have demonstrated that the use of color filter array CFA images yields improved results for both segmentation and classification.CFA data are obtained from monochromatic cameras, where the color filter array (CFA) makes each photosensor sensitive to only one color component.CFA images must be demosaiced to obtain the final color images, but this process can negatively impact textural information.This is because demosaicing affects color texture descriptors such as chromatic co-occurrence matrices (CCMs) [10].A more recent work carried out an analysis of automatic image classification methods for Urticaceae pollen.This work compared machine learning and deep learning methods to classify Urticaceae pollen seeds.It is a very interesting work that shows the power of machine learning and deep learning algorithms in the classification of objects from images [11].
This work aimed to improve the authenticity of the product on the market and reduce the problem of usurped labeling.By creating a model that can accurately classify Penja pepper seeds from others, the industry can ensure that consumers are getting the product that they are paying for.Additionally, this work will contribute to the protection of the exceptional terroir of Penja and the know-how of the local craftsmen by making it easier to identify real Penja pepper seeds.This can help to support the local economy and promote sustainable agriculture practices.
The main contributions are listed as follows: • The creation of a large CFA image database; • The improvement of the experimental set-up used by Bitjoka et al., 2015 [12]; • The segmentation extraction and attribute extraction method, which can be used for the automation of seed identification in general.
The rest of the work is organized as follows: in Section 2, we first present related works on seed classification.Section 3 is devoted to our samples and image acquisition processes.Then, we show the different classification methods used as well as the selected attributes.In Section 4, we present the results and discussions.The paper is concluded in Section 5.

Related Work
Several works have been carried out on classification in the field of agri-food, and in relation to spices, classification work has mostly been carried out on peppers and chili peppers.Almost no classification work has been carried out on pepper seeds.However, the techniques and methods used for other spices can also be applied to pepper seeds, the usefulness of which is no longer proven, particularly in the health and culinary fields.
One of the works on pepper seed classification using machine vision is based on convolutional neural networks (CNNs) [13].In this work, the best classification score of 84.94 precision was achieved with the equipment used: a desktop scanner with a resolution of 1200 dpi.The use of the material can be justified by the fact that the Chili pepper are flat in appearance.Due to not having this material at our disposal, we were not able to reproduce the approach adopted in this work.However, this work clearly shows that neural networks are an effective means of classifying spices.
Another work was carried out on corn seeds [14].This work focused on the classification of five maize species using computer-based recognition.The models used are Multilayer perceptron (MLP), decision tree (DT), linear discrimination (LDA), naive Bayes (NB), support vector machine (SVM), and k-nearest neighbors (KNN), and the one which yielded the greatest performance was the SVM.These classification models have also been used in several classification projects in the agri-food sector.These works [15][16][17] and many others have shown the effectiveness of these models.As peppercorns have almost the same structure as corn seeds, it is also possible that these methods can work in identifying pepper seeds.
In the food industry, the attributes extracted from a product image directly convey information about the state of the product in the image [18].To make a classification, it is important to carefully choose the attributes that will serve as elements of comparison in the chosen model.Several works in the literature show that attributes are often selected in terms of shape, color, and texture attributes [19][20][21][22][23].Among the different texture analysis approaches used in the food industry, the majority of applications use either histograms of sums and differences or chromatic co-occurrence matrices.
Regarding classification performance evaluation methods, several measures have been used in the literature.The most used measures are mentioned in the review [24,25] In [26], Sabanci et al. (2022) worked on the classification of Chili Pepper seeds using convolutional neural networks (CNNs).Although their objectives are the same as ours, we worked on pepper seeds, which have a round appearance compared with Chili Pepper seeds, which are rather flat.The device used in the work of [13] for Chili Pepper seeds.This device is well suited for Chili Pepper seeds and not for pepper seeds.The accuracy of the results is well related to the equipment and the size of the database used.

Sample and Image Preparation
We used both white and black pepper seeds.The samples were divided into four groups: white seeds from Penja, black seeds from Penja, white seeds from other origins, and black seeds from other origins, as presented in Figure 1.The Penja pepper seeds were directly obtained from eight different sources in Penja, resulting in eight distinct samples of Penja pepper, five of which were white and three were black.The other origins comprised a mixture of peppers imported into Cameroon, such as those from Dubai, India, and Brazil.
convolutional neural networks (CNNs).Although their objectives are the same as ours, we worked on pepper seeds, which have a round appearance compared with Chili Pepper seeds, which are rather flat.The device used in the work of [13] for Chili Pepper seeds.This device is well suited for Chili Pepper seeds and not for pepper seeds.The accuracy of the results is well related to the equipment and the size of the database used.

Sample and Image Preparation
We used both white and black pepper seeds.The samples were divided into four groups: white seeds from Penja, black seeds from Penja, white seeds from other origins, and black seeds from other origins, as presented in Figure 1.The Penja pepper seeds were directly obtained from eight different sources in Penja, resulting in eight distinct samples of Penja pepper, five of which were white and three were black.The other origins comprised a mixture of peppers imported into Cameroon, such as those from Dubai, India, and Brazil.The different sources of the pepper root samples used are described in Table 1.

Images Acquisition Device
We used a device similar to the one used in [12].This dispositive was established in the Mechanic laboratory of the University Institute of Technology of Brest. Figure 2

Images Acquisition Device
We used a device similar to the one used in [12].This dispositive was established in the Mechanic laboratory of the University Institute of Technology of Brest. Figure 2   This box is made of wood and is sealed off from external light.The only source of light is the 150 W LED ribbon on the top inside and the intensity of the light can be adjusted using a potentiometer.
The Fujifilm X-E1 (Amazon France, Brest, France) digital camera was selected for taking images because its high resolution and good image quality.The images were taken with a resolution of 4896 × 3264 pixels.The aperture was set to f/8, the ISO was set to 400, and the shutter speed was set to 1/60 s to ensure that the images captured had good depth of field, low noise, and good sharpness.The images were taken in RAW format in 14 Bit This box is made of wood and is sealed off from external light.The only source of light is the 150 W LED ribbon on the top inside and the intensity of the light can be adjusted using a potentiometer.
The Fujifilm X-E1 (Amazon France, Brest, France) digital camera was selected for taking images because its high resolution and good image quality.The images were taken with a resolution of 4896 × 3264 pixels.The aperture was set to f/8, the ISO was set to 400, and the shutter speed was set to 1/60 s to ensure that the images captured had good depth of field, low noise, and good sharpness.The images were taken in RAW format in 14 Bit and later converted to the PGM format for further processing.The device was tested and validated by a team.This device has been used in the following way ➢ The drawer is placed inside the lightproof box ➢ The camera is positioned above the drawer and focused on the seeds ➢ The image is captured with the camera This process is repeated for 10 pinches of the same sample.The images are taken in RAW (.RAF) + JPG (1920 × 1280 pixels, Size: 24.9 Mb, No flash).

Creation of the Dataset
With the Python library rawpy, (Python 3.9.13,Anaconda environment, jupyterLab 3.4.4)we generated a 16-bit grayscale image with the PGM (Portable Graymap) extension.The flowchart describing the procedure for creating image data for classification is shown in the diagram below in Figure 3.These CFA grayscale images (Figure 4) were then segmented using the Otsu method, allowing us to create binary masks (Figure 5) to extract the seeds.Using the masks, we identified each seed (Figure 6) and then created a Bounding Box (smallest quadrilateral that contains the detected object) around each seed (Figure 7) on the CFA image.Finally, we saved the image of each seed in a PNG file (Figure 8).The extracted images were stored in 4 different folders, which would constitute our different code classes: PBP (for white pepper from Penja), PBA (for white pepper from other origins), PNP (for black pepper from Penja), and PNA (for black pepper from other origins).We had a total of 5618 seed images: 1335 were PBP, 1416 were PBA, 1437 were PNP, and 1430 were PNA.
Machine learning methods require the manual selection of relevant features prior to extracting them from the images.One challenge lies in the appropriate selection of a set of features for classification [6].The attributes retained for calculation on each seed image are primarily texture attributes, as texture is an important characteristic used in identifying objects or regions of interest in an image, whether it be a photomicrograph, aerial photograph, or satellite image [29].
The images attributes used can be grouped into 4 main groups like in [29,30]: -Shape attributes: area, perimeter, compactness, extent, width, and height; -The characteristics of the Gabor filter: the mean and the standard deviation; -The characteristics of the LBP (Local Binary Patterns transform: contrast, correlation, energy, homogeneity, and entropy); -The characteristics of the co-occurrence matrix (GLCM): dissimilarity, correlation, contrast, homogeneity, and ASM.
Following [29], a Grey Level Co-occurrence Matrix (GLCM) has been created using neighboring grey tones (Figure 3) in order to derive the textural features.GLCM gives an indication of the spatial relationship of pixels and characterizes the texture of an image by calculating how often pairs of pixels with specific values and in an unambiguous spatial relationship occur in an image.Specifically, GLCM contains the normalized relative frequency, p(i, j), indicating how often two pixels with grey levels i and j separated by a distance d along the angle θ occur within an image block.The separation distance d has been assumed to be d = 1, while the angles are assumed to be θ = 0 • , 45 • , 90 • , and 135 • .This is illustrated in Figure 9.
The co-occurrence matrix was calculated with a distance of 5 pixels and an angle of 0 degrees.These data are then saved in an Excel file.The feature extraction process for the pepper seed images is described by the following diagram in Figure 10: neighboring grey tones (Figure 3) in order to derive the textural features.GLCM gives an indication of the spatial relationship of pixels and characterizes the texture of an image by calculating how often pairs of pixels with specific values and in an unambiguous spatial relationship occur in an image.Specifically, GLCM contains the normalized relative frequency, p(i, j), indicating how often two pixels with grey levels i and j separated by a distance d along the angle θ occur within an image block.The separation distance d has been assumed to be d = 1, while the angles are assumed to be θ = 0°, 45°, 90°, and 135°.This is illustrated in figure 9.The co-occurrence matrix was calculated with a distance of 5 pixels and an angle of 0 degrees.These data are then saved in an Excel file.The feature extraction process for the pepper seed images is described by the following diagram in Figure 10:

Classification
Before building the different models for classification, we first eliminated outliers using the Isolation Forest algorithm from the sklearn library in Python3.9, with a contamination rate of 0.05 [26].This process allowed us to remove 282 data points, reducing the number of outliers from 5618 to 5336.The presence of outliers can be

Classification
Before building the different models for classification, we first eliminated outliers using the Isolation Forest algorithm from the sklearn library in Python3.9, with a contamination rate of 0.05 [26].This process allowed us to remove 282 data points, reducing the number of outliers from 5618 to 5336.The presence of outliers can be attributed to the high variance in shape variables such as area and perimeter.The results of the cleaning process are shown in the following diagrams.Figure 11   In these images, the images before cleaning (a), the yellow color, represent all of the seeds in the treated group.In the images after cleaning in Figures 12b, 13b   In these images, the images before cleaning (a), the yellow color, represent all of the seeds in the treated group.In the images after cleaning in Figures 12b, 13b, and 14b the red seeds represent the retained seeds, and those in green are the outliers.The red seeds shown in Figures 13a and 14a represent those that were not recognized.In these images, the images before cleaning (a), the yellow color, represent all of the seeds in the treated group.In the images after cleaning in Figures 12b, 13b     Next, we normalized the data using RobustScaler algorithm [26].This scaler removes the median and scales the data based on the quantile range (by default, it uses the interquartile range or IQR).The IQR is the range between the first quartile (25th quantile) and the third quartile (75th quantile).Standardization of a dataset is a common requirement for many machine learning algorithms.Normally, this is carried out by removing the mean and scaling to unit variance, but outliers can negatively impact the sample mean and variance.In such cases, the median and the interquartile range often provide better results.The list of texture attributes used in the literature are present in Table 2, and the feature selection is shown in Table 3.   Next, we normalized the data using RobustScaler algorithm [26].This scaler removes the median and scales the data based on the quantile range (by default, it uses the interquartile range or IQR).The IQR is the range between the first quartile (25th quantile) and the third quartile (75th quantile).Standardization of a dataset is a common requirement for many machine learning algorithms.Normally, this is carried out by removing the mean and scaling to unit variance, but outliers can negatively impact the sample mean and variance.In such cases, the median and the interquartile range often provide better results.The list of texture attributes used in the literature are present in Table 2, and the feature selection is shown in Table 3.  Figure 13 shows (a) the distribution of values for the three main attributes on the PBP batches and (b) the cleaning of outliers.
Figure 14 shows (a) the distribution of values for the three main attributes on the PNA batches and (b) the cleaning of outliers.
Next, we normalized the data using RobustScaler algorithm [26].This scaler removes the median and scales the data based on the quantile range (by default, it uses the interquartile range or IQR).The IQR is the range between the first quartile (25th quantile) and the third quartile (75th quantile).Standardization of a dataset is a common requirement for many machine learning algorithms.Normally, this is carried out by removing the mean and scaling to unit variance, but outliers can negatively impact the sample mean and variance.In such cases, the median and the interquartile range often provide better results.The list of texture attributes used in the literature are present in Table 2, and the feature selection is shown in Table 3.
Table 2. List of texture attributes used in the literature.

The Confusion Matrix
We notice that the SGD classifier and the SVM are the models that manage to distinguish black pepper seeds from white pepper seeds.The SVM is the model that has the highest accuracy and produces less confusion.In this study, 271 of the white Penja pepper seeds, 86%, were predicted accurately, and 87% of the 270 black Penja pepper seeds were predicted correctly.The Figure 16 present the Confusion matrix for the 4 models used.The Figure 16 present Confusion matrix for the 4 models J. Imaging 2024, 10, x FOR PEER REVIEW 16 of 19

The Confusion Matrix
We notice that the SGD classifier and the SVM are the models that manage to distinguish black pepper seeds from white pepper seeds.The SVM is the model that has the highest accuracy and produces less confusion.In this study, 271 of the white Penja pepper seeds, 86%, were predicted accurately, and 87% of the 270 black Penja pepper seeds were predicted correctly.The Figure 16

Learning Curve
By analyzing the learning curves, we can observe that the random forest model has suffered from overfitting.It fails to generalize well.The SGD model converged around 3100 data points, after which it too experienced overfitting.The KNeighbors and SVM models, however, continue to converge and appear to learn effectively.Hence, the SVM model, which achieved the highest accuracy in classifying pepper seeds, can be considered for use.The figure 17 present the Learning Curve of the 4 methods.

Discussion
The classification performance, as described in Table 2, shows that the highest performance achieved is 87.This result can be attributed to the use of shape and size attributes for the classification.Pepper seeds, in general, are similar in appearance and have a similar shape and texture, making them almost indistinguishable to the eye [8].We worked with 16-bit greyscale images (.pgm) obtained from the sensor's 14-bit raw RAW data (.RAF).Other similar works use color images, which have already undergone transformations during derrawtisation.Using RAW data gives more information than color images.
A similar study [7] on bell pepper and pimiento seeds produced better results, as the differences between these two species are already visible to the eye.They achieved a score of 89.2 using the SVM and 100 using the KNeighbors and tree classifiers.

Conclusions
In summary, this study aimed to classify pepper seeds using CFA images.The data used focused on Penja pepper, one of the most coveted in the world, coming from the Litoral region of Cameroon, and achieved an accuracy of 87%.The model was trained on a base of 4268 images, 80% of the data, and tested on 1068, 20% of the data.
Several machine learning methods were employed, and the most successful was the SVM [13].The precision obtained is higher than that of the [13] same linear SVM method that obtained a precision of 84.94%, which can be justified by the fact that in this work, he used chili seeds, which are almost flat in appearance.

Discussion
The classification performance, as described in Table 2, shows that the highest performance achieved is 87.This result can be attributed to the use of shape and size attributes for the classification.Pepper seeds, in general, are similar in appearance and have a similar shape and texture, making them almost indistinguishable to the eye [8].We worked with 16-bit greyscale images (.pgm) obtained from the sensor's 14-bit raw RAW data (.RAF).Other similar works use color images, which have already undergone transformations during derrawtisation.Using RAW data gives more information than color images.
A similar study [7] on bell pepper and pimiento seeds produced better results, as the differences between these two species are already visible to the eye.They achieved a score of 89.2 using the SVM and 100 using the KNeighbors and tree classifiers.

Conclusions
In summary, this study aimed to classify pepper seeds using CFA images.The data used focused on Penja pepper, one of the most coveted in the world, coming from the Litoral region of Cameroon, and achieved an accuracy of 87%.The model was trained on a base of 4268 images, 80% of the data, and tested on 1068, 20% of the data.
Several machine learning methods were employed, and the most successful was the SVM [13].The precision obtained is higher than that of the [13] same linear SVM method that obtained a precision of 84.94%, which can be justified by the fact that in this work, he used chili seeds, which are almost flat in appearance.
, and there are the following: -Precision measures the proportion of positive instances correctly identified among all positive instances.It is calculated by dividing the number of true positives by the sum of true positives and false positives.-Recall measures the proportion of correctly identified positive instances among all truly positive instances.It is calculated by dividing the number of true positives by the sum of true positives and false negatives.-F-measure, also known as F1 measure, represents a harmonic average of precision and recall.It provides a balanced measure between the two.It is calculated using the formula F1 = 2 × (precision × recall)/(precision + recall).-Accuracy measures the proportion of correctly classified instances among all instances.It is calculated by dividing the total number of correct predictions by the total number of instances.-Confusion matrix summarizes the performance of a model in terms of true positives, true negatives, false positives and false negatives.It can be used to calculate other metrics such as precision, recall, and accuracy.
shows (a) the image-taking box, (b) the light source, and (c) the camera used.
shows (a) the image-taking box, (b) the light source, and (c) the camera used.

Figure 6 .
Figure 6.Seed detection using the mask.

Figure 7 .
Figure 7. Creation of the Bounding Boxes.

Figure 6 .
Figure 6.Seed detection using the mask.

Figure 7 .
Figure 7. Creation of the Bounding Boxes.

Figure 6 .
Figure 6.Seed detection using the mask.

Figure 7 .
Figure 7. Creation of the Bounding Boxes.

Figure 7 .
Figure 7. Creation of the Bounding Boxes.

Figure 7 .
Figure 7. Creation of the Bounding Boxes.

Figure 8 .
Figure 8. Images of extracted seeds.Figure 8. Images of extracted seeds.

Figure 8 .
Figure 8. Images of extracted seeds.Figure 8. Images of extracted seeds.
shows (a) the distribution of values for the three main attributes on the PBP batches and (b) the cleaning of outliers.

Figure 13
Figure 13 shows (a) the distribution of values for the three main attributes on the PBP batches and (b) the cleaning of outliers.

Figure 13
Figure 13 shows (a) the distribution of values for the three main attributes on the PBP batches and (b) the cleaning of outliers.

Figure 14 Figure 14 .
Figure 14 shows (a) the distribution of values for the three main attributes on the PNA batches and (b) the cleaning of outliers.

Figure 14 Figure 14 .
Figure 14 shows (a) the distribution of values for the three main attributes on the PNA batches and (b) the cleaning of outliers.

Figure 12
Figure 12 shows (a) the distribution of values for the three main attributes on the PBA batches and (b) the cleaning of outliers.Figure13shows (a) the distribution of values for the three main attributes on the PBP batches and (b) the cleaning of outliers.Figure14shows (a) the distribution of values for the three main attributes on the PNA batches and (b) the cleaning of outliers.Next, we normalized the data using RobustScaler algorithm[26].This scaler removes the median and scales the data based on the quantile range (by default, it uses the interquartile range or IQR).The IQR is the range between the first quartile (25th quantile) and the third quartile (75th quantile).Standardization of a dataset is a common requirement for many machine learning algorithms.Normally, this is carried out by removing the mean and

4. 2 .
Learning CurveBy analyzing the learning curves, we can observe that the random forest model has suffered from overfitting.It fails to generalize well.The SGD model converged around 3100 data points, after which it too experienced overfitting.The KNeighbors and SVM models, however, continue to converge and appear to learn effectively.Hence, the SVM model, which achieved the highest accuracy in classifying pepper seeds, can be considered for use.The Figure17present the Learning Curve of the 4 methods.

Table 1 .
Origin of samples seed.