Development of an Optimal Algorithm for Detecting Damaged and Diseased Potato Tubers Moving along a Conveyor Belt Using Computer Vision Systems

: The article discusses the problem of detecting sick or mechanically damaged potatoes using machine learning methods. We proposed an algorithm and developed a system for the rapid detection of damaged tubers. The system can be installed on a conveyor belt in a vegetable store, and it consists of a laptop computer and an action camera, synchronized with a ﬂashlight system. The algorithm consists of two phases. The ﬁrst phase uses the Viola-Jones algorithm, applied to the ﬁltered action camera image, so it aims to detect separate potato tubers on the conveyor belt. The second phase is the application of a method that we choose based on video capturing conditions. To isolate potatoes infected with certain types of diseases (dry rot, for example), we use the Scale Invariant Feature Transform (SIFT)—Support Vector Machine (SVM) method. In case of inconsistent or weak lighting, the histogram of oriented gradients (HOG)—Bag-of-Visual-Words (BOVW)—neural network (BPNN) method is used. Otherwise, Otsu’s threshold binarization—a convolutional neural network (CNN) method is used. The ﬁrst phase’s result depends on the conveyor’s speed, the density of tubers on the conveyor, and the accuracy of the video system. With the optimal setting, the result reaches 97%. The second phase’s outcome depends on the method and varies from 80% to 97%. When evaluating the performance of the system, it was found that it allows to detect and classify up to 100 tubers in one second, which signiﬁcantly exceeds the performance of most similar systems.


Introduction
Rapid population growth requires an increase in the efficiency and productivity of the agricultural sector. To increase productivity in agriculture, farmers use pesticides widely, which negatively affect the human body. Recently, chemical pesticides are gradually being replaced by pesticides based on bacteria, fungi, and viruses [1]. However, this process is hampered, both at the level of legislation of some states and the level of farms [2]. In this situation, the automatic selection of vegetables, root crops, and fruits at the time of laying them for storage and during the preparation of seed material using computer vision can reduce the volume of pesticides applied and storage costs.

•
It is necessary to develop a sorting mechanism that will allow removing damaged objects from a moving conveyor belt at the recommended speeds of its movement (Table 1), following their current coordinates established by the computer vision system. Table 1. Recommended speeds of the conveyor for root crops [18]. Each disease has its characteristic features and reveals itself at different times of growth and storage of potatoes [19] (Figure 1). We chose the most harmful of them-late blight as the subject of our research (Figure 1(1)). The main danger of the disease is the high rate of its development. The annual shortage of potato tubers due to late blight is about 10% of the gross harvest. In the case of late blight's significant and early damage to the potato tops, the yield shortfall reaches 70% or more. Storing potatoes with a large number of infected tubers often leads to the rotting of the entire batch.
On the affected tubers, lead-gray or brown (depending on the variety and color of the skin), slightly depressed hard spots are formed, extending inward in the form of uneven brown smudges ("tongues").
Late blight rot often turns into dry Fusarium rot (Figure 1(4)) by the middle or end of storage.
For a selection of tubers for planting, tubers damaged by rodents can be considered acceptable ( Figure 2(4-6)). If potatoes are selected for sale, they must be discarded. • Computer vision systems working with piece objects will not be able to realize the throughput required for a vegetable store; • It is necessary to develop a sorting mechanism that will allow removing damaged objects from a moving conveyor belt at the recommended speeds of its movement (Table 1), following their current coordinates established by the computer vision system. Each disease has its characteristic features and reveals itself at different times of growth and storage of potatoes [19] (Figure 1). We chose the most harmful of them-late blight as the subject of our research (Figure 1(1)). The main danger of the disease is the high rate of its development. The annual shortage of potato tubers due to late blight is about 10% of the gross harvest. In the case of late blight's significant and early damage to the potato tops, the yield shortfall reaches 70% or more. Storing potatoes with a large number of infected tubers often leads to the rotting of the entire batch.
On the affected tubers, lead-gray or brown (depending on the variety and color of the skin), slightly depressed hard spots are formed, extending inward in the form of uneven brown smudges ("tongues"). Late blight rot often turns into dry Fusarium rot (Figure 1(4)) by the middle or end of storage.
For a selection of tubers for planting, tubers damaged by rodents can be considered acceptable (Figure 2(4-6)). If potatoes are selected for sale, they must be discarded. The authors faced mainly three problems: • Tubers moving on the conveyor are captured by the camera several times, and each time they are perceived by the system as different. So, re-capturing the image slows down the system significantly. • While moving on a conveyor, tubers may overlap each other and thereby hinder precise shape recognition.

•
The problem of direct identification of tubers affected by disease or rodents.
When solving the identified problems, the authors studied articles on computer vision that solve similar tasks in various spheres of human activity to some extent. The first task is to process one object once. We believe that the problem is similar to the problem of vehicle detection in a complex driving environment [20][21][22][23].
S. Aqel et al. [20] split this task into several subtasks. First, moving vehicles are detected using the background subtraction method. The researchers used morphological operators to reduce false areas and remove moving shadows, and finally, they performed classification using invariant Charlier moments. F. Liu et al. [21] studied the problem of real-time updating the background. They suggested using an indicator to avoid excessive background refresh. A. Soin and M. Chahande [22] found solutions to vehicle detection and recognition using deep neural networks (DNNs).
Y. Wei et al. noted that when vehicles move between rows, one of them may overlap the other. In their paper [23], they proposed using object tracking algorithms. This made it possible to significantly reduce the processing time of the video signal since the processed objects were not subjected to repeated analysis. The vehicle detection algorithm itself consisted of two stages: target extraction using the strong descriptive power of the HOG and segmentation of the region of interest (ROI region) based on Haar cascades.
The authors of [24] successfully implemented the problem of partial overlapping of objects. To accurately capture moving objects on weak mobile devices, they used the Boosted HAAR Cascade method. It significantly outperforms deep neural networks in image processing speed and allows the detection of partially closed objects.
We believe that: • The background subtraction method is applicable in conveyor conditions; • Getting rid of shadows is carried out by selecting the position of light sources; • Elimination of partial overlapping of objects is carried out by metering the supply of these objects to the conveyor and increasing the conveyor speed.  The authors faced mainly three problems: • Tubers moving on the conveyor are captured by the camera several times, and each time they are perceived by the system as different. So, re-capturing the image slows down the system significantly.

•
While moving on a conveyor, tubers may overlap each other and thereby hinder precise shape recognition.

•
The problem of direct identification of tubers affected by disease or rodents.
When solving the identified problems, the authors studied articles on computer vision that solve similar tasks in various spheres of human activity to some extent.
The first task is to process one object once. We believe that the problem is similar to the problem of vehicle detection in a complex driving environment [20][21][22][23].
S. Aqel et al. [20] split this task into several subtasks. First, moving vehicles are detected using the background subtraction method. The researchers used morphological op- To recognize damaged tubers, we must consider a set of factors related to the conditions in which tubers are selected. Most often, convolutional neural networks (CNNs) are used to solve such problems, which have recently significantly improved their performance [25][26][27][28][29][30][31]. However, as the authors of [32] have shown, convolutional neural networks working with high-resolution images are not intended to be implemented on devices with weak processors. It is necessary to use large kernels (for example, 7 × 7 or 9 × 9) or a large number of layers to obtain an acceptable field susceptibility with convolutional layers [33]. Both of these schemes lead to a very significant slowdown in the system. Therefore, most low-performance systems are limited to image sizes of less than 41 × 41 pixels to achieve Agronomy 2021, 11, 1980 5 of 18 acceptable image processing time on low-performance devices. Moreover, the processing time for each such frame can reach several seconds. In the conditions of a continuously moving conveyor, this is unacceptable.
Under these conditions, algorithms with a favorable speed/resources and power ratio deserve special interest [34]. For this reason, several methods of equipment optimization have recently been proposed [35][36][37][38].
We believe that attention should be paid to methods that use descriptor structures. A descriptor is a method that identifies a certain area of an image based on a set of features. Having studied the material over the past three years, we concluded that it is advisable to use the following descriptors in our work: HOG [38,39] and HAAR features [40].
Saad Abouzahir et al. [38] used HOG to identify weeds in crop images. They showed that the method in its classical form does not work well with images of damaged and deformed leaves. The use of BOVW in combination with HOG has significantly improved the quality of the material supplied to the classifier. With the BPNN, the weed detection rate ranged from 93% to 98%, depending on the crop. This combined approach significantly exceeded the results of other descriptors considered in their work.
M. Yogeshwari et al. [44] used the Adaptive Otsu's thresholding algorithm to binarize images of crop leaves. This algorithm made the architecture much easier and accelerated the operation of the convolutional neural network used as a classifier.
L.P. Saxena [46] used Niblack's local binarization method and its modifications in realtime applications. These techniques work to distinguish objects in their background. The author noted that these methods are helpful in various applications, such as restoration of damaged documents or manuscripts, text search in video frames, license plate recognition, and reading product barcodes.
Artificial neural networks process images mainly to recognize the depicted objects. A training sample is collected, the neural network is trained, if the necessary objects are detected on the test image, they are replaced with the corresponding analogs. S. Kang et al. [48] used cascading modular U-networks (CMU-Nets) to binarize the document image. The researchers solved in their work the problem of the insufficient number of training samples. In the case of text documents, the method showed high accuracy and productivity higher than analogs [49]; however, despite this improvement, the productivity of the method does not allow it to be used in real-time applications.
The simplest morphological operations are used primarily for noise reduction. However, they can be used to implement more complex image processing techniques. E. Imani et al. [47] managed with high precision to separate lesions of the eyeball and blood vessels. They used the Morphological Component Analysis (MCA) algorithm. This method allows separating structurally different objects, for each of which different transformations work effectively.
Der-Chang Tseng et al. [41] proposed to break down the image noise removal process into stages. The initially noisy image is analyzed using the enhanced MCA algorithm. After a series of transformations, the image is broken down into texture, structure, and image edges. In turn, noise removal is performed for each individual part. The image texture is removed using the BM3D algorithm, while part of the structure is removed using the ANLM method with an adaptive search box. The edges of the image are indicated by the K-SVD method. The first two are filters. The third (K-SVD) is a generalization of the k-means clustering method. The main idea of K-SVD is to split data into small chunks and alternate sparse coding and dictionary updates.
Using filters can be extremely useful to improve the results and speed of the method. D. Sharifrazi et al. [42] showed that using a Sobel filter improves the performance of a convolutional neural network for detecting COVID-19 using X-ray images. They ranked this combination of methods as the best of the wide variety of options.
G. Ravivarma et al. [43] compared Sobel, Canny, Prewitt, Roberts, and fuzzy logic methods for edge detection. They noted that the Sobel filter has the smallest temporal and spatial complexity in comparison with the others. Testing was carried out on pest-affected leaves of crops.
We believe that we should use real-time algorithms for image preprocessing. Threshold binarization algorithms (Adaptive Otsu's, Niblack's) and the simplest morphological operations for noise suppression (erosion, closure, opening, etc.) are suitable. And we can improve the result with a filter (one of the most promising is the Sobel filter). It is not advisable to use artificial neural networks at this stage.
U. Ahmad et al. considered in their work [50] the decrease in the accuracy of image processing depending on the speed of the conveyor belt. The speed varied from 0.08 to 0.3 m/s. The researchers found a significant increase in the error at the lower and upper limits of the speed (on average, two times) in determining the color index, saturation, and intensity of images, which significantly influenced the method's result for the classification of mango fruits. In an actual vegetable store, this conveyor speed is unacceptable. When analyzing the design of the fetal image-recording camera [50] (Figure 3), we concluded that we could significantly improve its characteristics by using pulsed light with a frequency that coincides with the frequency of taking pictures.
edges. In turn, noise removal is performed for each individual part. The image texture is removed using the BM3D algorithm, while part of the structure is removed using the ANLM method with an adaptive search box. The edges of the image are indicated by the K-SVD method. The first two are filters. The third (K-SVD) is a generalization of the kmeans clustering method. The main idea of K-SVD is to split data into small chunks and alternate sparse coding and dictionary updates.
Using filters can be extremely useful to improve the results and speed of the method. D. Sharifrazi et al. [42] showed that using a Sobel filter improves the performance of a convolutional neural network for detecting COVID-19 using X-ray images. They ranked this combination of methods as the best of the wide variety of options.
G. Ravivarma et al. [43] compared Sobel, Canny, Prewitt, Roberts, and fuzzy logic methods for edge detection. They noted that the Sobel filter has the smallest temporal and spatial complexity in comparison with the others. Testing was carried out on pest-affected leaves of crops.
We believe that we should use real-time algorithms for image preprocessing. Threshold binarization algorithms (Adaptive Otsu's, Niblack's) and the simplest morphological operations for noise suppression (erosion, closure, opening, etc.) are suitable. And we can improve the result with a filter (one of the most promising is the Sobel filter). It is not advisable to use artificial neural networks at this stage.
U. Ahmad et al. considered in their work [50] the decrease in the accuracy of image processing depending on the speed of the conveyor belt. The speed varied from 0.08 to 0.3 m/s. The researchers found a significant increase in the error at the lower and upper limits of the speed (on average, two times) in determining the color index, saturation, and intensity of images, which significantly influenced the method's result for the classification of mango fruits. In an actual vegetable store, this conveyor speed is unacceptable. When analyzing the design of the fetal image-recording camera [50] (Figure 3), we concluded that we could significantly improve its characteristics by using pulsed light with a frequency that coincides with the frequency of taking pictures. For an apparent fixation of defects in vegetables, fruits, and root crops, in some studies, the researchers used video cameras operating in the infrared range [51][52][53][54][55]. P.V. Balabanov et al. [51] studied the emission spectra of healthy and diseased fruits. They used a Vis-NIR (Visible-Near Infrared) hyperspectral camera in the 400-1000 nm For an apparent fixation of defects in vegetables, fruits, and root crops, in some studies, the researchers used video cameras operating in the infrared range [51][52][53][54][55]. P.V. Balabanov et al. [51] studied the emission spectra of healthy and diseased fruits. They used a Vis-NIR (Visible-Near Infrared) hyperspectral camera in the 400-1000 nm range. Studies have shown that all significant differences in the radiation of healthy and diseased fruits are in the visible part of the spectrum. A. Ibrahim et al. [54] examined the possibility of identifying internal damage to potatoes resulting from impacts during harvest (blackspot) by examining the absorption of a wavelength of 730 nm (near-infrared). They found that the tuber's damaged and undamaged inner parts have similar characteristics for most of the samples. They concluded that infrared radiation, located at the border with the visible part of the spectrum, is not suitable for solving such problems.
A. López-Maestresalas et al. [52] studied subsurface damage to potatoes (blackspot) using hyperspectral systems like Vis-NIR and SWIR (shortwave infrared) in the 1000-2500 nm range. They found out that on the three studied potato varieties, the SWIR system allows determining the presence of blackspot with an accuracy of higher than 93%, five hours after harvesting; Vis-NIR also detects subsurface damage, but with less accuracy. It was noted that at this stage of research, hyperspectral systems of this kind are not suitable for operation in the industry and can only be used in laboratory conditions. P.V. Balabanov et al. [56,57] suggested using thermal methods of indestructible and noncontact control with technical vision systems in the infrared spectral range of 8-14 microns when sorting agricultural products. The method is based on the fact that damaged, diseased and healthy plant tissues have different thermophysical characteristics. During the implementation of contactless measurements, the surface of the object under the study was heated using a laser [56] and an IR radiation source [57]. A FLIR A35 thermal imager was used to obtain information.
After analyzing the results of [51][52][53][54][55][56][57], we came to the following conclusions: • It is enough to use video cameras operating in the visible part of the spectrum to identify diseased tubers; • Some laboratory methods are available to detect subsurface damage, but these methods are not suitable for vegetable stores.

Grayscale transition and threshold binarization
For binarization, we considered two algorithms: Otsu's method and Niblack's method.
Otsu's method is a threshold binarization algorithm. The choice of the method is due to the following properties: adaptation to various kinds of images by choosing the optimal threshold; • fast lead time.
With this method, a threshold t is calculated to minimize the average segmentation error, i.e., the average error of deciding whether the image pixels belong to an object or a background. The image pixels' brightness values can be considered random, and their histogram can be taken as an estimate of the probability distribution density. If the probability distribution densities are known, then it is possible to determine the optimal (in the sense of the minimum error) threshold for image segmentation into two classes c0 and c1 (objects and background).
The histogram is plotted according to the values p i = n i /N. In this formula, N is the total number of image pixels, n i is the number of pixels with a brightness level i (0 ≤ i ≤ L). The threshold t is an integer value from 0 to L = max. With the help of the histogram, we can divide all the pixels into "useful" (object) and background ones. Relative frequencies W 0 and W 1 correspond to each type: Next, we calculate the average levels for each type of image using the formulas: Next, we find a threshold that reduces the variance of pixels of a particular type, determined by the following formula: Agronomy 2021, 11, 1980 8 of 18 The next step is to determine the interclass variance using the formula below: Then the maximum value is calculated to assess the quality of dividing the image into two parts, which corresponds to the desired threshold: Figure 3(3) and Figure 4(3) show a fragment of the image binarization by Otsu's method.
Next, we find a threshold that reduces the variance of pixels of a particular type, determined by the following formula: The next step is to determine the interclass variance using the formula below: Then the maximum value is calculated to assess the quality of dividing the image into two parts, which corresponds to the desired threshold: In addition, we considered the possibility of using adaptive binarization (Niblack's algorithm), where the global binarization threshold for the entire image is not sought, but local information is used. The idea behind this method is to vary the brightness threshold B of binarization from dot to dot based on the local value of the standard deviation. The brightness threshold at the (x, y) dot is calculated as follows: where ( , ) is the mean and ( , ) is the standard deviation of the sample for some neighborhoods of the dot. The size of the neighborhood should be as small as possible but sufficient to preserve local image details. At the same time, the size should be large enough to reduce the effect of noise on the result. The value of k determines which part of In addition, we considered the possibility of using adaptive binarization (Niblack's algorithm), where the global binarization threshold for the entire image is not sought, but local information is used.
The idea behind this method is to vary the brightness threshold B of binarization from dot to dot based on the local value of the standard deviation. The brightness threshold at the (x, y) dot is calculated as follows: where µ(x, y) is the mean and s(x, y) is the standard deviation of the sample for some neighborhoods of the dot. The size of the neighborhood should be as small as possible but sufficient to preserve local image details. At the same time, the size should be large enough to reduce the effect of noise on the result. The value of k determines which part of the object's border to take as the object itself. A value of k = −0.2 specifies a fairly good separation of objects represented in black and k = +0.2 when objects are in white. Figure 3(4) and Figure 4(4) show a fragment of the image binarization by Niblack's method. The authors applied the considered two binarization methods to both usual and inverted images of potatoes (Figures 3 and 4) Figure 3 shows that tubers in this representation are difficult to distinguish as separate objects, but if we knew in advance the location of each tuber in the image, damage to the tubers themselves is very noticeable. Moreover, it is easy to calculate the damaged area.
Based on this, the authors proposed determining the tubers' location in the image, not in the usual color but the inverted one with subsequent binarization.
Thus, by comparing the two representations of tubers in Figures 3 and 4, in one case, it is quite easy to identify the locations of tubers, and in the other case, to identify tubers with damage.

Using filters to update boundaries
The Sobel operator is used to define the boundaries of objects. This operator is based on the convolution of the image with integer filters. The operator uses the Gx and Gy kernels, with which the image is convolved to calculate the horizontal and vertical derivatives: This operator is used to approximate the gradient of the pixel intensity function. To detect the presence of gradient discontinuity, we can calculate the gradient change at the dot (i, j). This can be done by finding the following value: The following expression determines the direction of the gradient Q: Figure 5 shows the result of applying the Sobel filter to the image shown in Figure 2. When implementing the method, the kernels (9) were used. Figures 3(4) and 4(4) show a fragment of the image binarization by Niblack's method. The authors applied the considered two binarization methods to both usual and inverted images of potatoes (Figures 3 and 4) Figure 3 shows that tubers in this representation are difficult to distinguish as separate objects, but if we knew in advance the location of each tuber in the image, damage to the tubers themselves is very noticeable. Moreover, it is easy to calculate the damaged area.
Based on this, the authors proposed determining the tubers' location in the image, not in the usual color but the inverted one with subsequent binarization.
Thus, by comparing the two representations of tubers in Figures 3 and 4, in one case, it is quite easy to identify the locations of tubers, and in the other case, to identify tubers with damage.

Using filters to update boundaries
The Sobel operator is used to define the boundaries of objects. This operator is based on the convolution of the image with integer filters. The operator uses the Gx and Gy kernels, with which the image is convolved to calculate the horizontal and vertical derivatives: This operator is used to approximate the gradient of the pixel intensity function. To detect the presence of gradient discontinuity, we can calculate the gradient change at the dot (i, j). This can be done by finding the following value: The following expression determines the direction of the gradient Q: Figure 5 shows the result of applying the Sobel filter to the image shown in Figure 2. When implementing the method, the kernels (9) were used.   Figure 2 using the Sobel filter: (1-3) dry rotted tubers; (4-6) tubers affected by rodents; (7-9) healthy tubers.

The use of the descriptor
The HOG method assumes that the type of distribution of image intensity gradients makes it possible to accurately determine the presence and shape of objects present on it.
The image is split into cells. The histograms h i of the directional gradients of the interior dots are calculated in the cells. They are combined into one histogram (h = f (h 1 , ..., h k )), after which it is normalized to brightness. We can obtain the normalization factor in several ways, but they show approximately the same results. We will use the following equation: where h 2 is the used norm, ε is some small constant. When calculating the gradients, the image is convolved with the kernels [−1, 0, 1] and [−1, 0, 1] T , resulting in two D x and D y matrices of derivatives along the x and y axes, respectively. These matrices are used to calculate the angles and values (moduli) of the gradients at each dot in the image. Figure 6 shows the result of applying the HOG method to the image shown in Figure 2. The gradient value only is shown for clarity (the brighter the pixel, the larger the gradient).
terior dots are calculated in the cells. They are combined into one histogram (h = f (h1, ..., hk)), after which it is normalized to brightness. We can obtain the normalization factor in several ways, but they show approximately the same results. We will use the following equation: where ‖ℎ‖ 2 is the used norm, ε is some small constant. When calculating the gradients, the image is convolved with the kernels [−1, 0, 1] and [−1, 0, 1] T , resulting in two and matrices of derivatives along the x and y axes, respectively. These matrices are used to calculate the angles and values (moduli) of the gradients at each dot in the image. Figure 6 shows the result of applying the HOG method to the image shown in Figure  2. The gradient value only is shown for clarity (the brighter the pixel, the larger the gradient).
where L(x, y, σ) is the value of the Gaussian at the dot with coordinates (x, y) and blur radius σ; G(x, y, σ)-Gaussian kernel; I(x, y)-the value of the original image; *-convolution operation. The difference of Gaussians is an image obtained with pixel-by-pixel subtraction of the Gaussian of the original image from a Gaussian with a different blur radius (kσ): A pyramid of Gaussians and Gaussian differences is built. When moving from one level of the pyramid to another, the dimensions of the images are halved. The SIFT descriptor is used to extract feature points from the image, which are later used in classifiers. The key point in finding them is building a pyramid of Gaussians and the difference of Gaussians. Gaussian-image blurred with a Gaussian filter: L(x, y, σ) = G(x, y, σ) * I(x, y), where L(x, y, σ) is the value of the Gaussian at the dot with coordinates (x, y) and blur radius σ; G(x, y, σ)-Gaussian kernel; I(x, y)-the value of the original image; *-convolution operation. The difference of Gaussians is an image obtained with pixel-by-pixel subtraction of the Gaussian of the original image from a Gaussian with a different blur radius (kσ): A pyramid of Gaussians and Gaussian differences is built. When moving from one level of the pyramid to another, the dimensions of the images are halved.
After building the pyramids, key points are determined, which are the local extrema of the differences between the Gaussians. False key points are discarded, and for the remaining ones, their orientation is calculated. We determine the gradient's value m and direction θ from the formula: m(x, y) = (L(x + 1, y) − L(x − 1, y)) 2 + (L(x, y + 1) − L(x, y − 1)) 2 (15) θ(x, y) = tan −1 L(x, y + 1) − L(x, y − 1) The SIFT method operates the descriptor as a vector. The method takes a 4 × 4 square area centered at the particular dot and rotates it according to the singular point's direction. Each element of the area indicates the value of the gradient in eight directions.

Viola-Jones method
This algorithm uses Haar features [40] to classify objects in the image. These features are similar to convolution kernels and are rectangular regions composed of several adjacent parts (Figure 7). The Viola-Jones method uses the AdaBoost algorithm to construct a cascading classifier. When forming each new level, the AdaBoost algorithm selects the most informative features. The formation of the classifier ends when a predetermined target quality of the classifier is reached.
The SIFT method operates the descriptor as a vector. The method takes a 4 × 4 square area centered at the particular dot and rotates it according to the singular point's direction. Each element of the area indicates the value of the gradient in eight directions.

Viola-Jones method
This algorithm uses Haar features [40] to classify objects in the image. These features are similar to convolution kernels and are rectangular regions composed of several adjacent parts (Figure 7). The Viola-Jones method uses the AdaBoost algorithm to construct a cascading classifier. When forming each new level, the AdaBoost algorithm selects the most informative features. The formation of the classifier ends when a predetermined target quality of the classifier is reached.   For the correct operation of the method, the article's authors investigated several options for image preprocessing for subsequent training and a combination of trained classifiers.

Results
For video shooting, we took a 4k action camera with a frequency of 25 fps. For reliable fixation of the image, we used pulsed light of the same frequency. A rectangular pulse The SIFT method operates the descriptor as a vector. The method takes a 4 × 4 square area centered at the particular dot and rotates it according to the singular point's direction. Each element of the area indicates the value of the gradient in eight directions.

Viola-Jones method
This algorithm uses Haar features [40] to classify objects in the image. These features are similar to convolution kernels and are rectangular regions composed of several adjacent parts (Figure 7). The Viola-Jones method uses the AdaBoost algorithm to construct a cascading classifier. When forming each new level, the AdaBoost algorithm selects the most informative features. The formation of the classifier ends when a predetermined target quality of the classifier is reached.   For the correct operation of the method, the article's authors investigated several options for image preprocessing for subsequent training and a combination of trained classifiers.

Results
For video shooting, we took a 4k action camera with a frequency of 25 fps. For reliable fixation of the image, we used pulsed light of the same frequency. A rectangular pulse For the correct operation of the method, the article's authors investigated several options for image preprocessing for subsequent training and a combination of trained classifiers.

Results
For video shooting, we took a 4k action camera with a frequency of 25 fps. For reliable fixation of the image, we used pulsed light of the same frequency. A rectangular pulse generator controlled frequency and duty cycle. Several white-LED strips were fixed on a rectangular frame of 800 × 500 mm, having a total luminous flux of 4500 LM in regular operation. At this luminous flux value, the duty cycle was selected manually and varied from 15% to 25% to obtain a clear image. Pulsed light made using a conventional action camera possible instead of an expensive industrial camera [51]. The number of frames per second was adjusted so that the camera would shoot each tuber two times. This value is four frames per second with an aspect ratio of 16:9, a conveyor width of 800 mm, and a conveyor speed of 1 m/s. As shown above, due to the large number of computational operations required for convolutional networks, it is not feasible to use them for fast object recognition in high-resolution images.
We divided the diagnostic procedure into separate phases, allowing us to speed up identifying individual tubers in the video stream and direct analysis.
First phase: identifying individual tubers in the image. Because the system must process 4-5 images with a resolution of 3840 × 2160 pixels per second, the Viola-Jones method [58][59][60][61][62][63] is the most suitable method for detecting tubers. But even taking into account the high performance of this algorithm, the processing of such an image format will take an unacceptably long time. Therefore, it is necessary to reduce the resolution by order of magnitude along each coordinate axis (384 × 216 pixels). For a medium tuber 60 × 50 mm, the photo now has 30 × 25 pixels. Thus, we choose a scanning window of 30 × 25 pixels.
We trained the Haar cascade and implemented the Viola-Jones recognition algorithm using Python's standard OpenCV Traincascade application. We used 850 images of tubers (positive images) and 1000 images of the working area of the conveyor without control objects (negative images) as training data.
However, in practical use, the classifier showed relatively low results (Figure 8). Up to half of the tubers are not detected by this method. In addition, the method classified objects that are not tubers as tubers.
To improve the classification results, the authors applied image preprocessing using the Sobel filter. As a result, the probability of detecting tubers in the case of their location in one layer reached 97%.
When recognizing tubers, they are marked, their images are recorded and sent for further processing. Considering that each tuber appears on the images at least twice, further work is performed with only one of them (not the cropped edge of the video frame). This is an optimization element that allows you to speed up the algorithm significantly.
Based on the results of phase 1, we receive the selected images of tubers for subsequent diagnostics.
Second phase: Option 1: SIFT-SVM Selecting tubers depends on the defined tasks. The use of the SIFT descriptor followed by classification using the SVM method identifies damage localized in small areas-over 95% (Figure 9(1)), but much worse identifies damage in large areas-52% (Figure 9(2)).
As shown above, due to the large number of computational operations required for convolutional networks, it is not feasible to use them for fast object recognition in highresolution images.
We divided the diagnostic procedure into separate phases, allowing us to speed up identifying individual tubers in the video stream and direct analysis.
First phase: identifying individual tubers in the image. Because the system must process 4-5 images with a resolution of 3840 × 2160 pixels per second, the Viola-Jones method [58][59][60][61][62][63] is the most suitable method for detecting tubers. But even taking into account the high performance of this algorithm, the processing of such an image format will take an unacceptably long time. Therefore, it is necessary to reduce the resolution by order of magnitude along each coordinate axis (384 × 216 pixels). For a medium tuber 60 × 50 mm, the photo now has 30 × 25 pixels. Thus, we choose a scanning window of 30 × 25 pixels.
We trained the Haar cascade and implemented the Viola-Jones recognition algorithm using Python's standard OpenCV Traincascade application. We used 850 images of tubers (positive images) and 1000 images of the working area of the conveyor without control objects (negative images) as training data.
However, in practical use, the classifier showed relatively low results (Figure 8). Up to half of the tubers are not detected by this method. In addition, the method classified objects that are not tubers as tubers.
To improve the classification results, the authors applied image preprocessing using the Sobel filter. As a result, the probability of detecting tubers in the case of their location in one layer reached 97%.
When recognizing tubers, they are marked, their images are recorded and sent for further processing. Considering that each tuber appears on the images at least twice, further work is performed with only one of them (not the cropped edge of the video frame). This is an optimization element that allows you to speed up the algorithm significantly.
Based on the results of phase 1, we receive the selected images of tubers for subsequent diagnostics.
Second phase: Option 1: SIFT-SVM Selecting tubers depends on the defined tasks. The use of the SIFT descriptor followed by classification using the SVM method identifies damage localized in small areas-over 95% (Figure 9(1)), but much worse identifies damage in large areas-52% (Figure 9(2)).

Option 2: HOG-CNN and HOG-BOVW-BPNN
HOG is another promising candidate for the role of a reliable descriptor. According to the results of experiments, HOG with CNN, when identifying damaged tubers, shows results up to 75%, and in the case of diseased tubers, 67%. It should also be noted that the classic HOG method is relatively slow. The proposed BOVW-based HOG method works much faster and better.
For more accurate results, the image is grayscaled to minimize noise and brightness effects. Then the histogram is flattened. The size of training images is taken at a minimum of 384 × 384 and can be increased horizontally and vertically by a multiple of 64. The image is divided into small grids (128 × 128 with 50% overlap). Standard HOG is applied. The result is transferred to the stage of additional processing. Additional processing aims to reduce the further calculations significantly. The potato tuber has no predominant direction and is practically symmetrical concerning rotation. A simple transformation of matrices can reduce the possible options several times. This reduces the number of effective clusters and the number of letters in the visual dictionary ( Figure 10).
is divided into small grids (128 × 128 with 50% overlap). Standard HOG is applied. The result is transferred to the stage of additional processing. Additional processing aims to reduce the further calculations significantly. The potato tuber has no predominant direction and is practically symmetrical concerning rotation. A simple transformation of matrices can reduce the possible options several times. This reduces the number of effective clusters and the number of letters in the visual dictionary ( Figure 10). For a binary classification: damaged or not damaged tuber, we trained a BPNN.

Option 3: Otsu's Threshold Binarization-CNN
For binary classification, we used another method. To achieve greater accuracy, in addition to the usual images of the potato tubers, we also used their inverted copies. Otsu's threshold binarization was applied to both the images and their inverted copies. These images were transferred to two classifiers, for which we used convolutional networks with the dimension of convolutional kernels 3 × 3 ( Figure 11). Reducing the dimension of convolutional kernels made it possible to significantly speed up the work of classifiers. The results of the work of both classifiers complement each other.  Table 2 shows the implementation methods and the results of the second phase of the algorithm. For a binary classification: damaged or not damaged tuber, we trained a BPNN.

Option 3: Otsu's Threshold Binarization-CNN
For binary classification, we used another method. To achieve greater accuracy, in addition to the usual images of the potato tubers, we also used their inverted copies. Otsu's threshold binarization was applied to both the images and their inverted copies. These images were transferred to two classifiers, for which we used convolutional networks with the dimension of convolutional kernels 3 × 3 ( Figure 11). Reducing the dimension of convolutional kernels made it possible to significantly speed up the work of classifiers. The results of the work of both classifiers complement each other.
result is transferred to the stage of additional processing. Additional processing aims to reduce the further calculations significantly. The potato tuber has no predominant direction and is practically symmetrical concerning rotation. A simple transformation of matrices can reduce the possible options several times. This reduces the number of effective clusters and the number of letters in the visual dictionary ( Figure 10). For a binary classification: damaged or not damaged tuber, we trained a BPNN.

Option 3: Otsu's Threshold Binarization-CNN
For binary classification, we used another method. To achieve greater accuracy, in addition to the usual images of the potato tubers, we also used their inverted copies. Otsu's threshold binarization was applied to both the images and their inverted copies. These images were transferred to two classifiers, for which we used convolutional networks with the dimension of convolutional kernels 3 × 3 ( Figure 11). Reducing the dimension of convolutional kernels made it possible to significantly speed up the work of classifiers. The results of the work of both classifiers complement each other.  Table 2 shows the implementation methods and the results of the second phase of the algorithm.  Table 2 shows the implementation methods and the results of the second phase of the algorithm. It should be noted that the illumination and overlapping of the tubers had a strong influence. The displacement of the lamp at a considerable distance from the video camera, the mismatch between the flash firing frequency and the fps of the video camera leads to a significant deterioration in the results of the algorithms. Algorithms based on HOG turned out to be less sensitive to such mismatches of systems. It should also be noted that the HOG-BOVW algorithm is significantly faster than CNN.

Comparison and selection of options for implementing the second phase
Tubers overlap can be partially eliminated by increasing the conveyor speed. In this case, the recognition accuracy is significantly increased. However, you should take into account the recommended speeds of the conveyors for the transportation of root crops (Table 1). In addition, an increase in speed leads to a deterioration in the clarity of the resulting images.
To speed up the operation of convolutional networks, the authors propose to reduce the dimension of convolutional kernels to 3 × 3 but apply convolutional networks not to the original images ( Figure 2) but to the images subjected to differentiation using the Sobel filter ( Figure 5).

Discussion
At the modern stage of the development of computer vision technologies, the question of object classification makes no longer a problem. Convolutional neural networks, decision trees, etc., have been performing it much more accurately than a human would have done. However, we drew attention to several limitations of these methods associated with their application in real conditions of vegetable storage. So, for example, based on the conclusions of modern researchers, convolutional networks that are most promising for solving computer vision problems are not able to process the stream of images from a video camera of rapidly moving objects of small size and a significant number [32]. The algorithm, launched on inexpensive computer hardware, simply cannot keep up with the conveyor belt. And obsolete, even if properly processed, data will be of little use if damaged objects have already left the conveyor belt.
We proposed using the Viola-Jones algorithm at the first stage of processing the image from a video camera, which, unlike convolutional neural networks, works in a real-time mode [58][59][60][61][62][63]. This method was created for recognizing human faces and did not give good results when used to detect potato tubers; however, by selecting preprocessing filters, we achieved a probability of 97%, which corresponds to the results of a convolutional neural network (from 91 to 95% in works on convolutional networks for the last three years) [25][26][27][28][29][30][31].
At the second stage, we work with an image in which the sizes and coordinates of the tubers have already been determined. The classification task is reduced to determining whether the tuber is suitable for further use or if it should be recognized as damaged or diseased. Dividing the image into small fragments simplifies the task greatly. When processing small images, a convolutional neural network installed on a personal computer is capable of processing up to a hundred images per second. This corresponds to a complete classification of tubers that slightly fill the conveyor and move at low speed. The result of correct image classification by the convolutional network was up to 97%. On average, 50% more tubers are processed by the combination of HOG-BOVW-BPNN methods, but its classification result is about 2% less. These results significantly exceed the processing speed of computer vision systems installed on conveyors [9][10][11][12][13][14][15][16][17]. We plan to continue improving the second stage of the image processing in the direction of increasing the number of classified tubers per second and bringing it to several hundred, which will allow installing the computer vision system on conveyors with almost any load and moving at the maximum permissible speed. For this, we propose to create systems of parallel computation and split the objects selected at the first stage into several parallel processed threads.
The used methods allow us to find the percentage of damaged tubers and the number of tubers. Taking into account the area of each tuber in the image, we can determine the mass of potatoes passed along the conveyor with an error of up to 10%. This is consistent with the materials of the article by A. Kalantar et al., who determined the weight of agricultural products from its image [64].
We believe that the proposed algorithm can be used in a picking robot installed on the conveyor belt of a vegetable store [65,66].

Conclusions
For a reliable video recording of objects moving on the conveyor, it is necessary to use pulsed light with a pulse frequency equal to the fps of a video camera. Selecting the duty cycle and luminous flux can significantly improve the image characteristics even using a cheap action camera.
It is enough to use a video camera of the visible range of light to detect external signs of potato tubers disease. All developments with the identification of internal damage involve the use of expensive video cameras in the infrared spectrum; however, the methods proposed in the articles do not allow them to be used in a vegetable store.
When separating potato tubers from a conveyor image, most computer vision methods are not suitable. The Viola-Jones method turned out to be the most acceptable for this task, which works much faster than similar methods used to detect objects. Provided that the image is preprocessed, the quality of the selection of tubers is comparable.
When objects have already been selected, and you need to classify them, you can use various methods, each of which has advantages and disadvantages. The choice of the method may depend on the conditions of its use and the classified objects. For example, SIFT-SVM is great for detecting dry rot, and HOG-BOVW-BPNN is better for unbalanced or low light conditions.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.