A Study on Enhancement of Fish Recognition Using Cumulative Mean of YOLO Network in Underwater Video Images

In the underwater environment, in order to preserve rare and endangered objects or to eliminate the exotic invasive species that can destroy the ecosystems, it is essential to classify objects and estimate their number. It is very difficult to classify objects and estimate their number. While YOLO shows excellent performance in object recognition, it recognizes objects by processing the images of each frame independently of each other. By accumulating the object classification results from the past frames to the current frame, we propose a method to accurately classify objects, and count their number in sequential video images. This has a high classification probability of 93.94% and 97.06% in the test videos of Bluegill and Largemouth bass, respectively. The proposed method shows very good classification performance in video images taken of the underwater environment.


Introduction
Techniques for classifying and estimating populations in aquatic ecosystems are important and essential for conserving rare and endangered populations and eliminating exotic species that destroy ecosystems. In general, for small individuals, the number is estimated either by direct counting or by using the cross-line method [1,2] or the mark collection method [3,4]. In the case of large numbers of them, we generally use a camera, and must make efforts to count individuals directly from camera images or video images [5,6]. It is very difficult to classify populations and estimate the number of individuals in any method.
Convolutional Neural Network (CNN) is widely used for object recognition and classification, and shows very good results. Many methods have been proposed based on the principles of CNN, and their performance has been demonstrated in various fields [7][8][9][10]. However, some studies in the field have been conducted using CNN [11][12][13][14]. The issue of image classification began in AlexNet [8] and further research has been carried out in GoogLeNet and VGGNet [15,16]. ResNet, which appeared in 2015, outperformed human judgment [17]. Based on these studies, research has focused not only on the image classification problem, but also on the image detection problem that classifies various objects of the image into specific classes, and predicts the location of the specific objects [10]. R-CNN [18,19], which shows good performance in image detection problems, creates potential bounding boxes on an image, and then runs a classifier on the proposed bounding boxes. After the classification, post-processing is used to refine the bounding boxes, eliminate duplicate detection, and calculate classification scores based on other objects [20]. In contrast, You Only Look Once (YOLO) [20][21][22] is the fastest system to detect and classify various objects. YOLO is a simple structure with a single convolutional network that simultaneously predicts bounding boxes and classification probabilities. The much faster operating time allows real-time processing, and plays a role in filtering the background image by reasoning globally. Furthermore, a general CNN cannot classify multiple objects in one image, but the YOLO network can classify multiple objects using a bounding box. This is useful for recognizing different objects in a single image and counting the number of those that are recognized. In particular, it can be used very effectively for classifying populations and estimating the number of individuals in video images.
However, despite the advantages of YOLO, it is difficult to obtain accurate results in every frame for low illumination or unfocused images, such as video images in an underwater environment [11,23]. To solve this problem, a data collection method has been proposed and has become a very useful alternative [11]. This method can improve the classification performance of images, but it is difficult to classify multiple objects in a single image or perform real-time processing for counting the number of objects. The human visual system can classify objects by constantly looking at them. In contrast, YOLO does not use sequential image but it processes images in each frame individually. YOLO can process video images in real-time, but independently classifies object's locations and classes using only one frame at a time. This means that the classification results of the previous frame image do not affect those of the current frame image.
Therefore, we propose the method to accurately classify objects and count the number of objects in a video image by accumulating the classified results from the past frame to the current frame. The proposed method may degrade the classification performance of some frames depending on the underwater environment, but this disadvantage is compensated by applying the cumulative average. This is a heuristic method that mimics human experience and learning.
In this study, we use YOLOv2 [21] for object recognition in video images taken in the underwater environment, and we apply the human heuristic approach by accumulating the mean of classification results of past frames to increase object classification results and count the number of objects accurately. We verified that the proposed method improves the classification and counting of objects in video images.

YOLO
There are many studies on how to apply CNN to classify objects in unedited real-time video images [22,[24][25][26][27]. In order to apply CNN, it is necessary to crop an image to fit the input size of CNN [24][25][26]. Recent studies have used a saliency map [26,27] to select the region to crop an image. However, when using a saliency map, processing time and performance vary depending on the number of filters. This is the most important factor in real-time processing. In the case of YOLO, there is no need to crop the input image for object recognition. Additionally, it has a structure and processing time that are suitable for real-time processing. YOLO handles bounding boxes and class probabilities at more than 45 fps over the entire image, making it very fast. Furthermore, if there are no objects or they are not subject to classification in the image, it is less likely to detect the wrong object [20][21][22]. However, from the viewpoint of real-time processing, it is difficult to derive the accurate result in every frame from the video images with unfocused or low illumination. If YOLO outputs accurate classification results in every frame, the proposed method may not be necessary.

Learning Data and Video Image
For the learning data and the performance evaluation of this study, we needed image data for learning and video images taken in the underwater environment. It was also difficult to construct a lot of image data for the same kind of object for learning, and to obtain video images taken in the underwater environment. Although this study can be applied to various kinds of object classification, we selected fish that are easy to shoot in the underwater environment. Therefore, in this study, video images of fish were taken directly in the laboratory environment, and learning data images of fish species were made based on the captured video images [28]. The fish species to be classified include Largemouth bass, Bluegill, Common carp, Crucian carp, Catfish, Mandarin fish, and Skin carp. The seven fish images for learning have a similar streamline shape, and the image environment of the objects to be classified shows very different characteristics depending on the underwater environment [28]. We used 5000 [image/fish species] as basic learning data for YOLO. Figure 1 shows the labeled fish in fish images.  [28]. We used 5000 [image/fish species] as basic learning data for YOLO. Figure 1 shows the labeled fish in fish images. Using anchor boxes to help predict the position and the size of objects in an image in deep learning systems increases the speed and efficiency of object detection. Even in YOLO networks, anchor boxes are a set of bounding boxes with defined heights and widths. These boxes are defined to capture the magnification and aspect ratio of the specific object classes that are to be detected, and are typically selected according to the size of the objects of the learning data set, as shown in Figure  1. In this study, the average Intersection of Union (IoU) was calculated by k-mean clustering of various bounding box sizes, and k=4 with an average IoU of more than 0.74 was selected.
YOLO was learned using YOLOv2 provided by MATLAB. The optimization method for YOLOv2 used Stochastic Gradient Descent with Momentum (SGDM), the initial learning rate was set to 1.0 × e , and the size of the mini-batch was set to 256. For the hardware devices, CPU (Intel i9-7900 3.30 GHz) and four GPUs (NVIDIA GeForce GTX1080Ti) were used. Figure 2 shows the learning results of YOLOv2. In the case of catfish, the average precision is 82 %. Six species of fish, except catfish, were learned with more than 93 % precision. Using anchor boxes to help predict the position and the size of objects in an image in deep learning systems increases the speed and efficiency of object detection. Even in YOLO networks, anchor boxes are a set of bounding boxes with defined heights and widths. These boxes are defined to capture the magnification and aspect ratio of the specific object classes that are to be detected, and are typically selected according to the size of the objects of the learning data set, as shown in Figure 1. In this study, the average Intersection of Union (IoU) was calculated by k-mean clustering of various bounding box sizes, and k = 4 with an average IoU of more than 0.74 was selected.
YOLO was learned using YOLOv2 provided by MATLAB. The optimization method for YOLOv2 used Stochastic Gradient Descent with Momentum (SGDM), the initial learning rate was set to 1.0 × e −4 , and the size of the mini-batch was set to 256. For the hardware devices, CPU (Intel i9-7900 3.30 GHz) and four GPUs (NVIDIA GeForce GTX1080Ti) were used. Figure 2 shows the learning results of YOLOv2. In the case of catfish, the average precision is 82%. Six species of fish, except catfish, were learned with more than 93% precision.
After learning YOLO, a heuristic method was applied to classify objects from video images taken in the underwater environment. Figure 3 shows the installed underwater photography system for test video images. Since there are many floats in the aquatic environment, the video image changes according to the change of sun and external light. Our underwater photography system was equipped with wireless communication, and transmitted classified fish images and classification probabilities. Therefore, we needed a method that can accurately classify fish and count the number of classified fish in video images. After learning YOLO, a heuristic method was applied to classify objects from video images taken in the underwater environment. Figure 3 shows the installed underwater photography system for test video images. Since there are many floats in the aquatic environment, the video image changes according to the change of sun and external light. Our underwater photography system was equipped with wireless communication, and transmitted classified fish images and classification probabilities. Therefore, we needed a method that can accurately classify fish and count the number of classified fish in video images.

Heuristic Method
Humans can classify objects by looking only once at some objects or objects in the video image, but in general, humans recognize objects in succession, classify objects, and accurately classify objects with sequential images using information from their experience and learning. For example, if an animal suddenly appears in a dark forest, few people identify it precisely from the beginning. At first, they are not sure if it is a particular animal. However, if the animal appears in close proximity, most people will recognize it as a particular animal. In this way, the proposed method sequentially applies real-time images to YOLO. It computes the classification probability of an object in the heuristic method of computing the cumulative mean by accumulating the outputs of YOLO. This guarantees higher classification performance by classifying objects using the cumulative mean of  After learning YOLO, a heuristic method was applied to classify objects from video images taken in the underwater environment. Figure 3 shows the installed underwater photography system for test video images. Since there are many floats in the aquatic environment, the video image changes according to the change of sun and external light. Our underwater photography system was equipped with wireless communication, and transmitted classified fish images and classification probabilities. Therefore, we needed a method that can accurately classify fish and count the number of classified fish in video images.

Heuristic Method
Humans can classify objects by looking only once at some objects or objects in the video image, but in general, humans recognize objects in succession, classify objects, and accurately classify objects with sequential images using information from their experience and learning. For example, if an animal suddenly appears in a dark forest, few people identify it precisely from the beginning. At first, they are not sure if it is a particular animal. However, if the animal appears in close proximity, most people will recognize it as a particular animal. In this way, the proposed method sequentially applies real-time images to YOLO. It computes the classification probability of an object in the heuristic method of computing the cumulative mean by accumulating the outputs of YOLO. This guarantees higher classification performance by classifying objects using the cumulative mean of

Heuristic Method
Humans can classify objects by looking only once at some objects or objects in the video image, but in general, humans recognize objects in succession, classify objects, and accurately classify objects with sequential images using information from their experience and learning. For example, if an animal suddenly appears in a dark forest, few people identify it precisely from the beginning. At first, they are not sure if it is a particular animal. However, if the animal appears in close proximity, most people will recognize it as a particular animal. In this way, the proposed method sequentially applies real-time images to YOLO. It computes the classification probability of an object in the heuristic method of computing the cumulative mean by accumulating the outputs of YOLO. This guarantees higher classification performance by classifying objects using the cumulative mean of sequential object classification values, than by their using CNN or YOLO from a single image containing them.
In YOLO, the classification result for each object is represented by probability values. Assuming these classification results follow a normal distribution, as the number of samples for the same object increases by the central limit theorem, it is known that the mean of the classified sample means is equal to the mean of the population, and the standard error of sample means decreases with the number of samples, as shown in Equation (1) [29,30]: where, s x is the standard error of the sample means, σ is the standard deviation of the population, and n is the number of samples. Therefore, the more classified samples of the same object in the video images, the higher the confidence level for the classified object. In general, the time for recognizing and disappearing objects in video images is not constant, but assuming that at least 1 s or more is measured, a sample means of 30 frames or more may be detected. In the case of using the sample means of 30 frames or more, the standard error is reduced to s x < σ/ √ 30 = 0.1825σ . By applying the heuristic method to YOLO, our method can maintain higher accuracy than CNN or YOLO classification results, which use one frame for the object classification of video images. This is because our method has low standard error depending on the number of frames, as shown in Equation (1). The mean of the sample means was calculated as the cumulative mean for each object using the classification results for successive images. The mean of the sample means was calculated as the cumulative mean for each object using the classification results for successive images, as shown in Equation (2).
where, i denotes the number of frames, and k denotes a classification object, so Avg i (k) is the cumulative mean for i and k, and p i (k) is the probability of classification for i and k.

Cumulative Mean of The YOLO Network
We describe how to enhance recognition using the cumulative mean of the YOLO network. The proposed method uses YOLO for object recognition, and uses the heuristic approach to improve object classification results. Figure 4 shows the overall flow for the proposed method. Our method uses YOLO to recognize fish from all frame images of the video. Furthermore, it calculates the cumulative mean using the heuristic method when the fish were recognized. Next, the number of fish is counted. The method for counting the number of fish is to increase the number of fish each time the fish disappears after the fish is recognized for a certain period of time within the capture area of the image. If fish have not been recognized in the capture region of the image for a certain period of time, they are not classified, as they are considered less reliable. Figure 5 shows the capture region, and the capture lines are adaptively set according to the size of the detected object, as shown in Equation (3). If the object size is large, the bounding box of the recognized object is large, and the recognition probability is also high. Therefore, the capture region is set to narrow, so that the object can be classified when the center of the object is only a little away from the center of the image. If the object is small, the capture region is set to wide, so that the object can be classified when the center of the object is far from the center of the image. When YOLO did not recognize fish over 20 frames after the fish was recognized in the capture region, it is assumed that the fish disappeared in the other direction, and we classified the fish and calculated the number of fish: where, A is any constant, w l is the width of the bounding box of the object, w h is the height of the bounding box, c l is the width of the capture region, c s is the width of the minimum capture region, and c w is the width of the maximum capture region.  Figure 5 shows the capture region, and the capture lines are adaptively set according to the size of the detected object, as shown in Equation (3). If the object size is large, the bounding box of the recognized object is large, and the recognition probability is also high. Therefore, the capture region is set to narrow, so that the object can be classified when the center of the object is only a little away from the center of the image. If the object is small, the capture region is set to wide, so that the object can be classified when the center of the object is far from the center of the image. When YOLO did not recognize fish over 20 frames after the fish was recognized in the capture region, it is assumed that the fish disappeared in the other direction, and we classified the fish and calculated the number of fish: where, is any constant, is the width of the bounding box of the object, is the height of the bounding box, is the width of the capture region, is the width of the minimum capture region, and is the width of the maximum capture region.    Figure 5 shows the capture region, and the capture lines are adaptively set according to the size of the detected object, as shown in Equation (3). If the object size is large, the bounding box of the recognized object is large, and the recognition probability is also high. Therefore, the capture region is set to narrow, so that the object can be classified when the center of the object is only a little away from the center of the image. If the object is small, the capture region is set to wide, so that the object can be classified when the center of the object is far from the center of the image. When YOLO did not recognize fish over 20 frames after the fish was recognized in the capture region, it is assumed that the fish disappeared in the other direction, and we classified the fish and calculated the number of fish:

Experiments
where, is any constant, is the width of the bounding box of the object, is the height of the bounding box, is the width of the capture region, is the width of the minimum capture region, and is the width of the maximum capture region.

Experiments
The Largemouth bass and the Bluegill video images taken in the pond were used for the performance evaluation. In general CNN, classification of objects is conducted by one frame, so the classification performance and recognition rate differ, depending on learning. In particular, as the video images are underwater, it is sensitive to changes in sunlight or external lighting, and thus a secondary method of recognizing fish in a single frame is required [20][21][22]. In addition, if an object other than the object to be classified in the video image is captured and input to CNN, CNN has the disadvantage of forcibly classifying it as a fish species. YOLO also classifies objects for a single image, which can degrade classification performance depending on the learning; and it is difficult to obtain accurate classification results in every frame.
Firstly, an evaluation of the proposed method was performed on Largemouth bass. In the video images of 34 Largemouth bass, the proposed method classified 33 Largemouth bass (97.06%), with one classified as an object. Figure 6 shows the classification probabilities of 33 Largemouth bass, and recognized with a value of 60% or greater.
image, which can degrade classification performance depending on the learning; and it is difficult to obtain accurate classification results in every frame.
Firstly, an evaluation of the proposed method was performed on Largemouth bass. In the video images of 34 Largemouth bass, the proposed method classified 33 Largemouth bass (97.06 %), with one classified as an object. Figure 6 shows the classification probabilities of 33 Largemouth bass, and recognized with a value of 60 % or greater.  Figure 7 shows the classification results and frame images of YOLOv2 when the proposed method finally classifies Largemouth bass as 0.83. It took 322 frames until the Largemouth bass appeared on the right edge and disappeared to the left edge, and the frame image was recognized as a Mandarin fish for frame 1 to frame 22 but was correctly recognized as Largemouth bass in frames after frame 23. In the proposed method, the classification performance is represented by the cumulative average of the classification performance up to the last frame, even if the classification is made wrong up to frame 22. Therefore, the proposed method is less likely to yield incorrect classification results. In particular, it has a high classification probability for very slow-moving fish. Figure 7i,j indicate the classification probability of YOLOv2 and the proposed method, respectively, for each frame. It can be seen that the proposed method accurately recognizes a Largemouth bass after frame 37.  Figure 7 shows the classification results and frame images of YOLOv2 when the proposed method finally classifies Largemouth bass as 0.83. It took 322 frames until the Largemouth bass appeared on the right edge and disappeared to the left edge, and the frame image was recognized as a Mandarin fish for frame 1 to frame 22 but was correctly recognized as Largemouth bass in frames after frame 23. In the proposed method, the classification performance is represented by the cumulative average of the classification performance up to the last frame, even if the classification is made wrong up to frame 22. Therefore, the proposed method is less likely to yield incorrect classification results. In particular, it has a high classification probability for very slow-moving fish. Figure 7i,j indicate the classification probability of YOLOv2 and the proposed method, respectively, for each frame. It can be seen that the proposed method accurately recognizes a Largemouth bass after frame 37.
disadvantage of forcibly classifying it as a fish species. YOLO also classifies objects for a single image, which can degrade classification performance depending on the learning; and it is difficult to obtain accurate classification results in every frame.
Firstly, an evaluation of the proposed method was performed on Largemouth bass. In the video images of 34 Largemouth bass, the proposed method classified 33 Largemouth bass (97.06 %), with one classified as an object. Figure 6 shows the classification probabilities of 33 Largemouth bass, and recognized with a value of 60 % or greater.  Figure 7 shows the classification results and frame images of YOLOv2 when the proposed method finally classifies Largemouth bass as 0.83. It took 322 frames until the Largemouth bass appeared on the right edge and disappeared to the left edge, and the frame image was recognized as a Mandarin fish for frame 1 to frame 22 but was correctly recognized as Largemouth bass in frames after frame 23. In the proposed method, the classification performance is represented by the cumulative average of the classification performance up to the last frame, even if the classification is made wrong up to frame 22. Therefore, the proposed method is less likely to yield incorrect classification results. In particular, it has a high classification probability for very slow-moving fish. Figure 7i,j indicate the classification probability of YOLOv2 and the proposed method, respectively, for each frame. It can be seen that the proposed method accurately recognizes a Largemouth bass after frame 37.   Figure 8 shows the YOLOv2 classification results of each frame for one fish whose proposed method does not recognize Largemouth bass. A Largemouth bass appears at the top left of the camera, comes very close, and disappears to the top right. YOLOv2 misclassified it as Common carp for frame 9 to frame 15, after which it did not recognize any fish. YOLOv2 did not classify correctly, because each frame image did not show the overall outline of the fish, but instead only one part. In the case of such video images, CNN and YOLOv2 show the wrong classification results and count of the number of fish species. The proposed method may also not classify fish species. However, it does not count individuals for misclassification results.  Figure 8 shows the YOLOv2 classification results of each frame for one fish whose proposed method does not recognize Largemouth bass. A Largemouth bass appears at the top left of the camera, comes very close, and disappears to the top right. YOLOv2 misclassified it as Common carp for frame 9 to frame 15, after which it did not recognize any fish. YOLOv2 did not classify correctly, because each frame image did not show the overall outline of the fish, but instead only one part. In the case of such video images, CNN and YOLOv2 show the wrong classification results and count of the number of fish species. The proposed method may also not classify fish species. However, it does not count individuals for misclassification results. Figure 8 shows the YOLOv2 classification results of each frame for one fish whose proposed method does not recognize Largemouth bass. A Largemouth bass appears at the top left of the camera, comes very close, and disappears to the top right. YOLOv2 misclassified it as Common carp for frame 9 to frame 15, after which it did not recognize any fish. YOLOv2 did not classify correctly, because each frame image did not show the overall outline of the fish, but instead only one part. In the case of such video images, CNN and YOLOv2 show the wrong classification results and count of the number of fish species. The proposed method may also not classify fish species. However, it does not count individuals for misclassification results. Second, we evaluated the proposed method for Bluegill. The proposed method recognizes 62 (which is 93.94 %) of a total of 66 fish as Bluegill, and did not classify 4 Bluegills. Most of the 62 Bluegills were recognized with classification probabilities of more than 60 %, as shown in Figure 9. The proposed method using the heuristic method shows a very high recognition rate for the detection of fish, and can accurately count the population of fish.  Figure 10 shows the classification results and frame images of YOLOv2, with the lowest classification probability of 27 % for Bluegill. The Bluegill appears from the bottom right, disappears quickly down the left, and takes a total of 30 frames. From frame 1 to frame 8 it was recognized as Common carp, but from frame 9 to frame 30, it was recognized as Bluegill, or not as any object. In the case that YOLOv2 does not recognize the fish, the learning of YOLOv2 is not perfect. Second, we evaluated the proposed method for Bluegill. The proposed method recognizes 62 (which is 93.94%) of a total of 66 fish as Bluegill, and did not classify 4 Bluegills. Most of the 62 Bluegills were recognized with classification probabilities of more than 60%, as shown in Figure 9. The proposed method using the heuristic method shows a very high recognition rate for the detection of fish, and can accurately count the population of fish. Figure 8 shows the YOLOv2 classification results of each frame for one fish whose proposed method does not recognize Largemouth bass. A Largemouth bass appears at the top left of the camera, comes very close, and disappears to the top right. YOLOv2 misclassified it as Common carp for frame 9 to frame 15, after which it did not recognize any fish. YOLOv2 did not classify correctly, because each frame image did not show the overall outline of the fish, but instead only one part. In the case of such video images, CNN and YOLOv2 show the wrong classification results and count of the number of fish species. The proposed method may also not classify fish species. However, it does not count individuals for misclassification results. Second, we evaluated the proposed method for Bluegill. The proposed method recognizes 62 (which is 93.94 %) of a total of 66 fish as Bluegill, and did not classify 4 Bluegills. Most of the 62 Bluegills were recognized with classification probabilities of more than 60 %, as shown in Figure 9. The proposed method using the heuristic method shows a very high recognition rate for the detection of fish, and can accurately count the population of fish.  Figure 10 shows the classification results and frame images of YOLOv2, with the lowest classification probability of 27 % for Bluegill. The Bluegill appears from the bottom right, disappears quickly down the left, and takes a total of 30 frames. From frame 1 to frame 8 it was recognized as Common carp, but from frame 9 to frame 30, it was recognized as Bluegill, or not as any object. In the case that YOLOv2 does not recognize the fish, the learning of YOLOv2 is not perfect.  Figure 10 shows the classification results and frame images of YOLOv2, with the lowest classification probability of 27% for Bluegill. The Bluegill appears from the bottom right, disappears quickly down the left, and takes a total of 30 frames. From frame 1 to frame 8 it was recognized as Common carp, but from frame 9 to frame 30, it was recognized as Bluegill, or not as any object. In the case that YOLOv2 does not recognize the fish, the learning of YOLOv2 is not perfect. Figure 10i,j show the classification performances of YOLOv2 and the proposed method, respectively, for each frame. If the CNN or YOLO network recognizes the images in frame 1 to frame 8 as Common carp, and fails to recognize the images in frames 11,12,15,16,17,25, and 26 as fish, the fish species is recognized incorrectly. The proposed method has a low probability after frame 21, but is correctly recognized as Bluegill. Figure 11 shows one example of four cases where the proposed method did not recognize the Bluegill. It took a total of 23 frames until the Bluegill appeared from the bottom right and disappeared down the left. YOLOv2 recognized the Bluegill in frames 5, 7, 8 and 23, but did not recognize any fish in the other frames. In the proposed method, if fish have not been recognized in the capture region of the image for a certain period of time, they are not classified as objects. The video image used in the experiment is very different from the image of Figure 1a, which trained YOLOv2. Therefore, YOLOv2 is not fully trained, due to the lack of learning images for very fast-moving fish, such as video images. The proposed method is very simple and intuitive, while retaining the advantages of YOLO in video images of underwater environments. The heuristic method has shown excellent performance in classifying and counting objects in video images. Therefore, the proposed method is considered to be useful not only for objects in the underwater environment, but also for other objects.  Figure 10i,j show the classification performances of YOLOv2 and the proposed method, respectively, for each frame. If the CNN or YOLO network recognizes the images in frame 1 to frame 8 as Common carp, and fails to recognize the images in frames 11,12,15,16,17,25, and 26 as fish, the fish species is recognized incorrectly. The proposed method has a low probability after frame 21, but is correctly recognized as Bluegill. Figure 11 shows one example of four cases where the proposed method did not recognize the Bluegill. It took a total of 23 frames until the Bluegill appeared from the bottom right and disappeared down the left. YOLOv2 recognized the Bluegill in frames 5, 7, 8 and 23, but did not recognize any fish in the other frames. In the proposed method, if fish have not been recognized in the capture region of the image for a certain period of time, they are not classified as objects. The video image used in the experiment is very different from the image of Figure 1a, which trained YOLOv2. Therefore, YOLOv2 is not fully trained, due to the lack of learning images for very fast-moving fish, such as video images. The proposed method is very simple and intuitive, while retaining the advantages of YOLO in video images of underwater environments. The heuristic method has shown excellent performance in classifying and counting objects in video images. Therefore, the proposed method is considered to be useful not only for objects in the underwater environment, but also for other objects.  Figure 12 shows the results of a comparative experiment on recognition rates with other deep learning-based methods. GoogLeNet, Vgg16, and Vgg19 measured the recognition rate as the point in time when the fish is in the center of the video image. In the case of YOLOv2, the recognition rate   Figure 10i,j show the classification performances of YOLOv2 and the proposed method, respectively, for each frame. If the CNN or YOLO network recognizes the images in frame 1 to frame 8 as Common carp, and fails to recognize the images in frames 11,12,15,16,17,25, and 26 as fish, the fish species is recognized incorrectly. The proposed method has a low probability after frame 21, but is correctly recognized as Bluegill. Figure 11 shows one example of four cases where the proposed method did not recognize the Bluegill. It took a total of 23 frames until the Bluegill appeared from the bottom right and disappeared down the left. YOLOv2 recognized the Bluegill in frames 5, 7, 8 and 23, but did not recognize any fish in the other frames. In the proposed method, if fish have not been recognized in the capture region of the image for a certain period of time, they are not classified as objects. The video image used in the experiment is very different from the image of Figure 1a, which trained YOLOv2. Therefore, YOLOv2 is not fully trained, due to the lack of learning images for very fast-moving fish, such as video images. The proposed method is very simple and intuitive, while retaining the advantages of YOLO in video images of underwater environments. The heuristic method has shown excellent performance in classifying and counting objects in video images. Therefore, the proposed method is considered to be useful not only for objects in the underwater environment, but also for other objects.  Figure 12 shows the results of a comparative experiment on recognition rates with other deep learning-based methods. GoogLeNet, Vgg16, and Vgg19 measured the recognition rate as the point in time when the fish is in the center of the video image. In the case of YOLOv2, the recognition rate  Figure 12 shows the results of a comparative experiment on recognition rates with other deep learning-based methods. GoogLeNet, Vgg16, and Vgg19 measured the recognition rate as the point in time when the fish is in the center of the video image. In the case of YOLOv2, the recognition rate was measured for all frames from the point when a fish is recognized in the video image to the moment it leaves. Furthermore, YOLOv2 and the proposed method used the same learned YOLO network. All methods showed a high recognition rate of 0.85 or higher. The proposed method has a recognition rate of 0.95 and other methods have a recognition rate of 0.88~0.89. In the proposed method, the result of the previous frame affects the recognition result of the current frame. This has a function of canceling the recognition error in a single frame, and there is a performance improvement of about 0.08 compared to other methods. moment it leaves. Furthermore, YOLOv2 and the proposed method used the same learned YOLO network. All methods showed a high recognition rate of 0.85 or higher. The proposed method has a recognition rate of 0.95 and other methods have a recognition rate of 0.88 ~ 0.89. In the proposed method, the result of the previous frame affects the recognition result of the current frame. This has a function of canceling the recognition error in a single frame, and there is a performance improvement of about 0.08 compared to other methods.

Conclusions
YOLO shows excellent performance in object recognition, but the performance varies depending on network learning. It recognizes objects by processing images of each frame independently of each other. This means that the classification results in the previous frame do not affect those of the current frame. By accumulating the object classification results from the past frames to the current frame, we propose a method to accurately classify objects, and count their number in the sequential video images. The proposed method shows very good classification performance in video images taken in underwater environments. It has high classification probabilities of 93.94 % and 97.06 % in the test videos of Bluegill and of Largemouth bass, respectively. The proposed method is also affected by the performance of YOLO, but its performance was improved by applying the heuristic method that mimics human experience and learning.   Figure 12. Experimental results compared with other methods (GoogLeNet, Vgg16, Vgg19, and YOLOv2) for recognition rate.

Conclusions
YOLO shows excellent performance in object recognition, but the performance varies depending on network learning. It recognizes objects by processing images of each frame independently of each other. This means that the classification results in the previous frame do not affect those of the current frame. By accumulating the object classification results from the past frames to the current frame, we propose a method to accurately classify objects, and count their number in the sequential video images. The proposed method shows very good classification performance in video images taken in underwater environments. It has high classification probabilities of 93.94% and 97.06% in the test videos of Bluegill and of Largemouth bass, respectively. The proposed method is also affected by the performance of YOLO, but its performance was improved by applying the heuristic method that mimics human experience and learning.