Live Fish Species Classiﬁcation in Underwater Images by Using Convolutional Neural Networks Based on Incremental Learning with Knowledge Distillation Loss

: Nowadays, underwater video systems are largely used by marine ecologists to study the biodiversity in underwater environments. These systems are non-destructive, do not perturb the environment and generate a large amount of visual data usable at any time. However, automatic video analysis requires efﬁcient techniques of image processing due to the poor quality of underwater images and the challenging underwater environment. In this paper, we address live reef ﬁsh species classiﬁcation in an unconstrained underwater environment. We propose using a deep Convolutional Neural Network (CNN) and training this network by using a new strategy based on incremental learning. This training strategy consists of training the CNN progressively by focusing at ﬁrst on learning the difﬁcult species well and then gradually learning the new species incrementally using knowledge distillation loss while keeping the high performances of the old species already learned. The proposed approach yields an accuracy of 81.83% on the LifeClef 2015 Fish benchmark dataset.


Introduction
The underwater environment, particularly coral reefs, contains diverse ecosystems rich in biodiversity. These reefs are composed of assemblages of corals, algae and sponges. This complex structure offers an ideal habitat for many species, especially for protection and feeding [1]. The underwater environment is ecologically and economically important, but is threatened by pollution [2], over-fishing [3], and climate change [4]. These factors are destroying the ecosystem and accelerating the loss of coral and fish species living there [5]. It is now necessary to monitor the evolution of these ecosystems in order to identify and even anticipate any threatening degradation of the ecosystem [6]. This monitoring is achieved by observing and then estimating the diversity and abundance of fish species to understand the structure and dynamics of the coral reef community [7]. Traditional techniques used to observe ecosystems and monitor biodiversity, such as techniques (fishing [8], anesthesia [9]) and underwater visual census (UVC) [10], are destructive and/or do not ensure continuous monitoring of underwater biodiversity. It is important to adopt more advanced techniques that are non-destructive and provide continuity in ecosystem monitoring.
In recent years, underwater video techniques have been increasingly used to observe macrofauna and habitat in marine ecosystems. Advances in video camera technology, battery life and information storage make these techniques accessible to the majority of users. Underwater video has some notable advantages. Inexpensive in terms of cost and time, it allows for a large number of observations that can be reused at any time. It also makes it possible to monitor the aquatic communities of the ecosystem without disturbing its functioning. This is a major advantage over UVCs, which require the presence of a diver in the environment. In addition, this technique is preferable for monitoring large areas such as marine protected areas (MPAs), marine parks or World Heritage sites. It is also easy to implement, and the recorded videos can be used by non-specialists.
However, current underwater video observation techniques require human experts to analyze the rapidly increasing amount of data collected. Automatic processing of underwater videos is the key to efficiently and objectively analyzing these large amounts of data. In fact, this technique is not yet provided due to the difficulties presented by the underwater environment, which poses great challenges for computer vision (Figure 1). The luminosity changes frequently due to ocean current, visibility is limited, and the complex coral background sometimes changes rapidly due to moving aquatic plants. Object recognition in underwater video images is an open challenge in pattern recognition, especially when dealing with fish species recognition. Fish move freely in all directions, and they can also hide behind rocks and algae. In addition, the problems of fish overlap, and similarities in shape and pattern between fish of different species pose significant challenges in our application. We can divide automatic fish recognition into two steps: (1) fish detection, which aims to localize every single fish in the underwater image, (2) and fish classification, which aims to identify the species of each detected fish. In our previous work [11], we focused on fish detection in unconstrained underwater videos. In this paper, we address the fish species classification in underwater images.
Last decade, several works developed methods for live fish species recognition in open sea. The early works mainly used hand-crafted techniques such as forward sequential feature selection (FSFS) [12], discriminant analysis approach [13], histogram of oriented gradients (HOG) [14], and SURF [15].
All of these CNN-based works used classical multi-class CNN, which is trained at once on all classes in the dataset. With this structure, the CNN treats all classes the same. However, some classes are naturally more likely to be misclassified than others, especially for classes that have fewer samples or are difficult to classify. In a live fish recognition task, the species in the training set can be grouped into categories. We can group species according to their degree of difficulty. Difficult species require special treatment when learning the CNN model. On the other hand, incremental learning allows a model to integrate new examples without having to perform a complete re-training and without destroying the knowledge acquired from old data. We propose to build a CNN classifier starting with the difficult species. Initially, the model focuses on learning the difficult species well and then gradually learns the other species in an incremental way. We aim to keep the model stable when introducing new species while maintaining high performances on the old species already learned.
The main contributions of this paper are summarized as follows: • We propose a novel approach of using knowledge distillation for the training of the CNN architecture for live coral reef fish species classification task in unconstrained underwater images. • We propose to train the pre-trained ResNet50 [36] progressively by focusing at the beginning on hard fish species and then integrating more easy species. • Extensive experiments and comparisons of results with other methods are presented. The proposed approach outperforms state-of-the-art fish identification approaches on the LifeClef 2015 Fish ( www.imageclef.org/lifeclef/2015/fish accessed on 7 July 2022) benchmark dataset.
The rest of this paper is organized as follows. Section 2 outlines related work. Section 3 presents the proposed approach for underwater live fish species classification. Section 4 describes the benchmark dataset used in this work, provides the experimental results and performs a comparative study. Finally, the conclusion and perspectives are discussed in Section 5.

Related Works
In this section, we present a state-of-the-art computer vision approach for fish species classification (Section 2.1). Then, we briefly expose related works for incremental learning approaches (Section 2.2).

Fish Species Classification
Early work used hand-crafted features to recognize fish species in the open sea. Spampinato et al. [13] combined texture and shape features. An affine transformation is also applied to the acquired images to represent the fish in 3D. Cabrera-Gámez et al. [14] computed different local descriptors such as histogram of oriented gradients (HOG), local binary patterns (LBP), uniform local binary patterns (ULBP) and local gradient patterns (LGP). To improve classification results, they adopted a score-level fusion approach where the first layer is composed of a set of classifiers designed to each chosen descriptor, while the second layer classifier takes as input the scores of the first layer. SVM-based techniques can be considered flat classifiers because they classify all classes at the same time using the same features for all classes. Sometimes it may be useful to choose specific features for different classes; hierarchical classification tree techniques take this into account. The idea is to gradually separate the set of images into sub-classes, with each node in the tree having its own set of features. The main disadvantage of this structure is the accumulation of errors because if an error is made at a node, it will necessarily lead to new errors in the child nodes. Huang et al. [12] extracted 66 types of features: color, shape, and texture of different parts of the fish. They then proposed a hierarchical classification method called "Balance-Guaranteed Optimized Tree (BGOT)" that is supposed to minimize the error accumulation problem. Szűcs et al. [15] used speeded up robust features (SURFs) to classify fish species [37]. Dhar and Guha [38] extracted the robust gist feature and gray level co-occurrence matrix feature from a fish image. Then, they combined these features to feed an XgBoost classifier.
With the advent of deep learning, in particular convolutional neural networks (CNNs), many studies have focused on investigating the contribution of CNNs to the resolution of different tasks in computer vision. Villon et al. [25] presented two methods for recognizing fish in coral in HD underwater videos. The first method is based on a traditional two-step approach: extraction of HOG features and use of an SVM classifier. The second method is based on deep learning using the GoogleNet architecture [39]. They compared the two methods and found that deep learning is more efficient than HOG+SVM. Salman et al. [31] proposed a CNN of three convolution layers to extract features and feed standard classifiers such as SVM and K nearest neighbors (KNN). Qin et al. [30] proposed a CNN with three convolutional layers trained from scratch on the Fish Recognition Ground-Truth dataset. They also proposed in [33] a hybrid deep architecture with traditional methods to extract features from fish images. In this architecture, principal component analysis (PCA) is used in two convolutional layers, followed by binary hashing in the non-linear layer and block-wise histogram in the pooling layer. Then, spatial pyramid pooling (SPP) [40] is used. Finally, classification is performed with a linear SVM. Compared to their first work [30], the proposed hybrid deep architecture improved the accuracy by only 0.07%. Sun et al. [34] applied two deep architectures, PCANet [41] and NIN [42], to extract features from underwater images. A linear SVM classifier is used for classification. Jäger et al. [21] used features extracted from the activations of the seventh hidden layer of the pre-trained AlexNet model [43], and fed a multi-class SVM classifier. Sun et al. [22] proposed to extract the features of fish from a pre-trained deep CNN using transfer learning [44]. They retrained AlexNet with artificial data augmentation using an SVM classifier. Mathur et al. [28] fine-tuned a ResNet50 model by only re-training the last fully connected layers without any data augmentation. They used Adamax as an optimizer. Zhang et al. [29] proposed AdvFish, which addresses the noisy background problem. They fine-tuned the ResNet50 model by adding a new term in the loss function. This term encourages the network to automatically differentiate the fish regions from the noisy background and pay more attention to the fish regions. Pang et al. [45] used the teacher-student model to reduce the impact of interference on fish species classification. They distilled interference information by reducing the discrepancy of two distance matrices generated separately from a processed fish image and a raw fish image. KL-divergence is used to further reduce noise in the raw data distribution.
Other works tried to modify the original pre-trained models. Ju et Xue [23] proposed an improved AlexNet model. This model has less structural complexity, it consists of four convolutional layers instead of five and an added item-based soft attention layer. Iqbal et al. [24] used a reduced version of AlexNet consisting of four convolutional layers and two fully-connected layers. Cheng et He [46] proposed a deep residual shrinkage network. This network is an improved attention mechanism algorithm based on a deep residual network, which embeds a soft threshold as a shrinkage layer into a residual module to automatically set thresholds for each feature channel.
Other work attempted to address fish detection and species classification in the same framework. Li et al. [18] applied fast R-CNN convolutional networks [47] to detect and recognize fish species. Knausgard et al. [48] used the You Only Look Once (YOLO) object detection technique to detect fish in underwater images [49]. Then, they adopted a CNN with the squeeze-and-excitation (SE) architecture [50] to classify each fish in the image with a transfer learning framework. Jalal et al. [19] proposed a hybrid approach combining spatial and temporal information based on YOLOv3 to detect and classify fish images. An interesting recent survey cites all works on fish species recognition [51,52].
The training of our approach is different from classical multi-class CNN training that trains CNN at once in all classes. In our application, we group species according to their degree of difficulty. First, we start training our CNN with the difficult species, and then it gradually learns the other species in an incremental way.

Incremental Learning
Incremental learning [53,54] refers to learning from streaming data, which arrives over time. It allows a model to receive and integrate new examples without having to perform a complete re-training and without destroying the knowledge acquired from old data. An incremental learning algorithm is defined in [55] as one that meets the following criteria: • It should be able to learn additional knowledge from new data; • It should not require access to the original data (i.e., the data that were used to learn the current classifier); • It should preserve previously acquired knowledge; • It should be able to learn new classes that may be introduced with new data.
These four points apply to any general incremental learning problem. In our application, we want to use the principle of incremental learning in a classical transfer learning context. In classical transfer learning, a pre-trained model is re-trained once on a new dataset with a predefined number of classes. Our approach based on incremental learning trains a model progressively while adding new classes at each transfer of knowledge. The common point between this learning and classical incremental learning is that the model learns new data without destroying the knowledge acquired from the old data. On the other hand, the difference between them is that our approach requires re-training the system on both old and new data.
We can distinguish mainly three types of incremental learning algorithms: • Architectural strategy [56]: this algorithm modifies the architecture of the model in order to mitigate forgetting, e.g., adding layers, fixing weights. . . • Regularization strategy [57,58]: loss terms are added to the loss function that promotes the selection of important weights to keep the knowledge gained. This type also includes basic regularization techniques such as stalling and early stopping. • Repetition strategy [59]: old data are periodically replayed in the model to strengthen the connections associated with the learned knowledge. A simple approach is to store some of the previous training data and interleave it with new data for future training.
In our approach, we use the regularization strategy by modifying the loss function of the system.
For validation purposes, we carried out experiments on the LifeClef 2015 Fish (LCF-15) dataset. This underwater benchmark dataset is captured by underwater cameras in the open sea.

Proposed Approach
Here we propose a CNN approach for efficient fish species identification based on transfer learning and incremental learning. This approach trains the CNN progressively while adding a new species. For learning the new species, we use the Learning Without Forgetting approach (Section 3.2), which modifies the loss function by adding a term of loss called knowledge distillation loss (Section 3.3).

Architecture of the Approach
The architecture of our proposed approach is illustrated in Figure 2. We consider first a CNN with a set of shared parameters θ s (the convolutional layers). In the output layer, we consider specific parameters tuned using old species θ o (the weights of neurons in the output layer corresponding to the old species). Finally, we have specific parameters assigned to the new classes, randomly initialized θ n (the weights of neurons in the output layer corresponding to the new species). Our goal is to learn the specific parameters θ n and update the parameters θ s and θ o in order to ensure that the entire model performs well on both old and new species.

Learning Phase
Let X o be the set of samples of k difficult species (also called old species), X n the set of samples of N−k species, where N is the total number of species in dataset and X = X o ∪ X n is the total training set. We train our model in three steps:

•
Step 1: train parameters θ s and θ o : First, using classical transfer learning, we train a pre-trained network, here ResNet50, on X o . • Step 2: calculate probabilities: At the end of the first step, each image x i ∈ X is passed through the trained CNN (of parameters θ s and θ o ) to generate a vector of probabilities of belonging to the k old species p o (i). The set P o = f (θ s , θ o , X) of probabilities serves as labels corresponding to the training image set X; f (.) is the output of the CNN using the parameters θ s and θ o . The objective is to train the network without moving these predictions much. • Step 3: train all parameters: In order to incorporate the new species, we add nodes for each new species to the classification layer with randomly initialized weights (parameters θ n ). When training the new model, we jointly train all model parameters θ s , θ o and θ n until convergence. This procedure, called joint-optimize training, encourages the computed output probabilitiesP o to approximate the recorded probabilities P o . To achieve this, we modify the network loss function by adding a knowledge distillation term.

Knowledge Distillation
Knowledge distillation is an approach originally proposed by Hinton et al. [60] with the goal of reducing the size of a network. It uses two networks: a powerful but complex and expensive one called the master, and a smaller one called the student. First, the master network is trained, and then the student network is trained to imitate it. Practically, it predicts the outputs of the master by imitating the probabilities assigned to each class. In the end, we will have two networks of different sizes that produce the same outputs. This approach results in a lighter student model and improves performance.
In our approach, we propose to use knowledge distillation to train the network when adding the new species without forgetting the old knowledge. The knowledge distillation allows the network to reconcile its outputs after the integration of new classes to the outputs of the network before the integration. This can be modeled by a modified cross-entropy loss that increases the weight for smaller probabilities: where l is the number of labels, p o andp o are modified versions of the recorded p o and calculatedp o probabilities: where p o are softmax outputs and T is a parameter called temperature, which for a standard softmax function is normally set to 1. As T grows, the probability distribution generated by the softmax function becomes softer, providing more information about which classes the master finds more similar to the predicted class.

Total Loss Function
The total loss function (L total ) of the network is the sum of the knowledge distillation (L distillation ) and the loss function used by the network to learn the classes (L loss ).
where λ o is a loss balance weight between old and new classes. By increasing its value, we favor training the old images over the new images. Y andŶ are the vectors of truth and calculated labels, respectively, corresponding to the training image set X = X o ∪ X n . This approach has advantages over classic transfer learning. Indeed, the strategy of feature extraction without fine-tuning generally performs worse on a new dataset because the shared parameters θ s are related to the old classes, and they have not learned to extract discriminative features related to new classes. On the other hand, fine-tuning degrades performance in the old classes because the shared parameters have been relearned. However, proposed incremental learning allows learning a network without forgetting the old knowledge.

Experiments
We used the LCF-15 benchmark underwater dataset to evaluate the effectiveness of the proposed approach for fish species classification. The dataset contains images of fish of different colors, textures, positions, scales, and orientations. It was issued from the European project Fish4Knowledge (F4k) (www.fish4knowledge.eu accessed on 7 July 2022) [61]. During this project of five years, a large dataset of over 700,000 unconstrained underwater videos with more than 3000 fish species was collected in Taiwan, the largest fish biodiversity environment in the world.

LifeClef 2015 Fish (LCF-15) Benchmark Dataset
The LCF-15 is an underwater live fish dataset. The training set consists of 20 annotated videos and more than 22,000 annotated sample images. In this dataset, we have 15 different fish species. Figure 3 shows examples of the 15 fish species, and Table 1 gives the distribution of the fish species in the dataset. Each video is manually labeled and agreed on by two specialist experts. The dataset is imbalanced in the number of instances of different species; for example, the number of the species 'Dascyllus reticulates' is about 40 times more than the species 'Chaetodon speculum'.   The test set has 73 annotated videos. We note that for three fish species, there are no occurrences in the test set (Table 1). This is to evaluate the method's capability of rejecting false positives.
Compared with other underwater live fish datasets, the LCF-15 dataset provides challenging underwater images and videos marked by more noisy and blurry environments, complex and dynamic backgrounds and poor lighting conditions [31]. We choose this dataset because it contains two categories of species: hard and easy species.
Finally, we note that fish images can be extracted from videos using the available ground truth of available fish bounding boxes.

Learning Strategy for Live Fish Species Classification
The incremental learning of the model is performed in two steps. First, we train the model on images of difficult fish species. Then, we integrate new images of easy fish species and train the model on all images: old and new ones according to the approach described in Section 3.

•
Construction of two groups: in order to separate the species into two subsets, difficult and easy, we train the pre-trained network ResNet50 on all species of the LCF-15 dataset with transfer learning. Figure 4 illustrates the confusion matrix. From this confusion matrix, we can group the species into two main groups: group of species with low precision, difficult species, (AN, AV, CC, CT, MK, NN, PD, ZS) and group of species with high precision, easy species, (AC, CL, CS, DA, DR, HM, PV). • Step 1 (difficult species): We first train the model on the first group using a pre-trained ResNet50 model. We want the model to focus on this subset. For this reason, we apply a data augmentation technique. To perform data augmentation, we proceed as follows. We flip each fish sample horizontally to simulate a new sample where fish are swimming in the opposite direction; then, we scale each fish image to different scales (tinier and larger). We also crop the images by removing one quarter from each side to eliminate parts of the background. Finally, we rotate fish images with angles −20 • , −10 • , 10 • and 20 • for invariant rotation fish recognition issues. At the end of this training, the model generates the shared parameters θ s and the specific parameters for the first group θ o . • Step 2 (all species): Then, we add the species of the second group. In order to integrate these new species, we add a number of neurons equal to the number of species in this group into the classification layer. We randomly initialize the values of the weights of these new neurons (parameters θ n ) and keep the weights corresponding to the old species (θ s and θ o ). We apply in this second training the new loss function to learn the new species while keeping the knowledge learned in the old training.

Results
In this section, we present the results of our proposed approach. First, we evaluate the model trained on difficult species (Section 4.3.1). Then, we present the results of the model trained on all species after adding the new ones. For the last model, we analyze the effect of different parameters on its performance (Section 4.3.2). Finally, we compare our approach with state-of-the-art approaches (Section 4.3.3). Figure 5 shows the confusion matrix of the model trained on the first group that contains species with low precision. The model identifies the species CT and MK well, followed by the species AV, CC, NN and PD. The species AN and ZS remain difficult to identify. We obtain an accuracy of 85.26%. If we compare the precision of these species with those of Figure 4, we notice that in this model, the precisions are higher. The aim of our new approach is to maintain these high precisions when we add the other species.

Model Trained on All Species
In order to improve the performance of this model, we evaluate different optimizers and parameters, particularly the loss balance weight λ o and temperature T. i.
Optimization technique The choice of the optimizer is a crucial step that greatly influences performance. We first evaluate our approach using different optimizers for loss function (Equation (3)). For this, we set λ o to 0, and we test different optimization techniques (SGD, Adam, Adamax or RMSprop). Table 2 compares the accuracies with different optimization techniques on the LCF-15 dataset. We can observe that Adam shows the best results for our problem statement. Adam is recommended as the default optimizer for most of the deep learning applications. It inherits the good features of RMSProp and other algorithms. Unlike maintaining a single learning rate through training in SGD, the Adam optimizer updates the learning rate for each network weight individually. We use this optimizer for the rest of the work, and we add the distillation loss term. ii. Effect of parameter λ o λ o is a loss balance weight between old and new classes. In order to explore the effect of this balance weight on performance, we evaluate our system using different values of λ o . Figure 6 illustrates the impact of λ o on accuracies. It can be seen that the performance gradually decreases with the increase in λ o . This is because when we increase its value, we favor training the old images over the new images, which makes the system very related to old images. The accuracy is relatively good when λ o = 0.5 or 1. We set λ o to 0.5 for the next experiment. iii. Effect of temperature parameter T We also evaluate the effect of temperature parameter T seen in Equation (2). Figure 7 visualizes the effect of this parameter on the performance of our approach. We can observe that the accuracy is better when T ≥ 1. Hinton et al. [60] suggest to set T > 1, which increases the weight of smaller values and encourages the network to better learn similarities among classes. We achieve the best accuracy of 81.83% for T = 2. iv. Performance analysis After analyzing the effects of different parameters, we show results using the following setting: λ o = 0.5 and T = 2 using the Adam optimizer. We achieve an accuracy of 81.83%, which exceeds that of the non-incremental CNN (76.90%). Figure 8 illustrates the confusion matrix of the model trained on all species. We note that the precision of AN and AV are improved (AN from 43.14% to 44.19% and AV from 87.50% to 100%). The precision of the difficult species is reduced, but they remain higher than those of non-incremental learning (c.f. Figure 4); for example, the precision of the NN species is reduced from 85.58% to 60.99%, which is higher than 25.20% with non-incremental learning. The same for the species CC, the precision is reduced from 87.50% to 62.50%, which is much better than 16.67% with non-incremental learning. Due to the loss function with the knowledge distillation, we instruct the model not to forget too much of the knowledge acquired in the first training. The forgetting of old knowledge is due to the similarities between the old and new species and the fact that the newly added species are more representative. We also note that the precision of some representative fish species is reduced, for example, CL from 97.23% to 87.47% and DR from 93.75% to 82.73%. The network becomes more related to the old species because of the distillation of knowledge loss. We achieve our objective of keeping the model stable when introducing new species while maintaining high performances on the difficult species already learned. 44 Table 3 shows the comparison performances of our proposed approach with the state-of-the-art methods on the LCF-15 benchmark dataset. We note that we implemented modified AlexNet [24] and FishResNet [28] approaches by using the same provided parameters as in their papers. We also implemented the approach of Yolov3 [19] based on the spatial information using RGB images. From Table 3, we observe that our proposed approach based on incremental learning for live fish species classification outperforms the-state-of-the-art methods. We note that Salman et al. [31] reached an accuracy of 93.65% by testing their model on 7500 fish images issued from the LCF-15 dataset, but these fish images are not from the original test set provided in the dataset. When they tested their method, they did not use the original training and test split provided in the LCF-15 dataset. Furthermore, Jalal et al. [19] did not use the original test set provided in the LCF-15 benchmark dataset. Instead, they merged the training and test sets, and then they took 70% of samples for training and 30% for test. The test set of the LCF-15 benchmark dataset is highly blurry compared with the training set, which explains the high accuracy in their article compared with ours. Table 3. Comparison of fish recognition accuracies of various methods on the LCF-2015 dataset.

Conclusions
In this paper, we proposed a new CNN approach based on incremental learning for a live fish classification task in an unconstrained underwater environment. We proposed to first train ResNet50 on a hard fish species and then add an easy fish species by using incremental learning to reduce the forgotten problem. For this, we modified the loss function by adding knowledge distillation loss. Experiments on the LifeClef 2015 Fish benchmark dataset demonstrated that incremental precisions are higher than non-incremental preci-sions, and our proposed approach outperforms various state-of-the-art methods for fish species identification.
Our future work will aim to keep the old and new precision of each species higher in order to improve the performance of the system.

Conflicts of Interest:
The authors declare no conflict of interest.