Photo Identiﬁcation of Individual Salmo trutta Based on Deep Learning

: Individual ﬁsh identiﬁcation and recognition is an important step in the conservation and management of ﬁsheries. One of most frequently used methods involves capturing and tagging ﬁsh. However, these processes have been reported to cause tissue damage, premature tag loss, and decreased swimming capacity. More recently, marine video recordings have been extensively used for monitoring ﬁsh populations. However, these require visual inspection to identify individual ﬁsh. In this work, we proposed an automatic method for the identiﬁcation of individual brown trouts, Salmo trutta . We developed a deep convolutional architecture for this purpose. Speciﬁcally, given two ﬁsh images, multi-scale convolutional features were extracted to capture low-level features and high-level semantic components for embedding space representation. The extracted features were compared at each scale for capturing representation for individual ﬁsh identiﬁcation. The method was evaluated on a dataset called NINA204 based on 204 videos of brown trout and on a dataset TROUT39 containing 39 brown trouts in 288 frames. The identiﬁcation method distinguished individual ﬁsh with 94.6% precision and 74.3% recall on a NINA204 video sequence with signiﬁcant appearance and shape variation. The identiﬁcation method takes individual ﬁsh and is able to distinguish them with precision and recall percentages of 94.6% and 74.3% on NINA204 for a video sequence with signiﬁcant appearance and shape variation.


Introduction
There is a need for effective in situ fish monitoring methods with aims to confirm the presence of a single species, track changes of threatened species, or document longterm changes in entire communities [1]. The adaption of technology in such areas is expanding as the use of video recording systems becomes widespread. Video recording has a number of advantages, such as being less labor intensive [2], capable of covering large areas of habitats, and it can be used in areas that are difficult to cover by other methods [3,4]. Video recordings have an advantage in that a viewer is able to pause, rewind, or forward the video, thereby also increasing accuracy and precision [5]. The use of video methods also has advantages over electrofishing, as this can cause injury or death to the fish [6] and has also been shown to alter the reproductive behaviors of fish [7]. The use of video recordings has been applied over the world; examples include the Wenatchee River (USA) [8], Tallapoosa River (USA) [9], Mekong River (Laos) [10], Ongivinuk River (USA) [11], Ebro River (Spain) [12], Uur River (Mongolia) [13], and Gudbrandsdalslågen (Norway) [14]. Robust and reliable systems would provide important information about population counts and movement [15], and therefore, there is a need for further research to enable the use of automatic systems.
Video recordings, including those from fish ladders, are in many cases manually viewed [16], such as in the work by [8,10]. Visual inspection with the goal to classify or identify individual fish with experts has disadvantages compared to automatic systems, such as being dependent on an expert observer [16], learning can also impact accuracy of an observer [8], and being more costly [17] and time-consuming [8]. For continuous video recording, the review time is extensive, as well as prone to fatigue of the observer. Event-based recording, when recording starts due to an event, for example, a fish passing a sensor, has been shown to be less labor intensive and exhibit fewer recording errors [18]. Nonetheless, there is a need for automatic methods to analyze video recordings efficiently and at a low cost.
Ladder counters using video recording systems have been popular, and videos have been manually inspected to detect escaped farmed Atlantic salmon [19], the migration of Atlantic salmon Salmo salar and sea trout Salmo trutta morpha trutta [20], but also to estimate population size [21]. Research has been carried out to automate the inspection and include counting the number of fish [22], detecting stocked fish [23], and classifying species [24]. Deep learning has been shown to be a promising method of estimating length, girth, and weight [25,26]. These are valuable tools for monitoring populations, and have been shown to provide good precision. However, they do not recognize individual fish.
The identification of individual animals, including fish, has traditionally been carried out with capture-mark-recapture methods, where one inserts a physical mark or tag [27]. For fish, these methods can include visible implant tags, fin-clipping, cold branding, tattoos, and external tag identifiers attached by metal wire, plastic, or string [28][29][30]. These methods have been successfully used for different identification tasks. However, there are drawbacks to using these methods. In large-scale studies, they become expensive and time-consuming methods. The methods can also be difficult to use with juvenile fish or small fish. Tags also have the drawback that they can be lost, destroyed, or vanish. In addition, the main concern is the physical and behavioral influence it brings to the marked individuals [31]. In a study by Persat [32] it was stated that fish tagging methods, such as jaw tagging and coded tags, normally do not last longer than 9 months and may cause wounds, infections, increase mortality and cause the slow growth of the individual. One should thoroughly consider when to 'tag or not to tag' a fish [33], and guidelines have been made for the surgical implantation of acoustic transmitters [34]. The limitations of traditional capture-markrecapture methods unveil the need for the non-invasive recognition of individual fish.
In this paper, we propose a deep learning image-based system for the photo recognition of individual brown trouts, Salmo trutta. The goal is to be able to match, based on a photo of an individual brown trout, the same brown trout in a set of other images.

Identification Method
Research in deep learning has contributed to advances in a number of applications; those related to this paper include animal tracking [35] and animal recognition [36,37]. Our starting point was similar to recognition in other applications [38], where we extracted features from different layers of the encoder network. An encoder network is a deep neural network that takes an input image and generates a high-dimensional feature vector. The fish recognition model is shown in Figure 1. The input to the identification method was a pair of fish images (Fish A and Fish B in Figure 1). The encoder was composed of stacked convolutional layers. Encoder features were extracted with EfficientNet [39] since networks that performed better on the ImageNet dataset were capable of learning better transferable representations [40]. The layers at the beginning of the encoder network captured primitive image features, such as edges and textures, while deeper layers capture rich global information. The similarity between Fish A and B features was computed using the cosine similarity metric φ, which indicated the feature distance between Fish A and B at each scale of the network. FC was a fully connected layer followed by rectified linear activation function (ReLU). The output of ReLU was passed through a Sigmoid function to give a value between 0 and 1, indicating non-identical or identical fishes, respectively.
To further explore the spatial features and improve fish recognition performance, the automated identification method needs to incorporate high-level information about the fish and semantic information in the scene, and low-level information about the details in the textures, patterns, as well as an integration of various contexts extracted at different scales. Bidirectional feature pyramid networks (BiFPNs) [41] were extracted with the EfficientNet (φ = 0) encoder network ( Figure 1). BiFPN leveraged a convolutional neural network (CNN) to extract bidirectional feature maps with different resolutions. The feature maps in earlier layers captured the textural and color detail information in the local regions, while the feature maps in deep layers captured the fish semantic information of the whole fish image.
Based on the BiFPN representations of a fish, the problem of fish identification was reduced to a problem of matching low-level and high-level features between the two input images. It is necessary to identify a given fish correctly when it swims from right to left with viewpoint, scale, size, and appearance variations ( Figure 2). For this, we employed the cosine similarity φ i,j between the corresponding BiFPN of the left and right input images ( Figure 1). Cosine similarity has been shown to be effective in [42]. We used fish i and j to explain how the distance features were computed. Given the encoder network, BiFPN features at scale s were extracted for fish i, f s i , and j, f s j ( Figure 1). The size of each feature f depended on the depth of the network and we used the last five features for compact representation. Given the above notation, the cosine distance feature at each scale was given by Equation (1): where φ s i,j was the spatial cosine distance at each scale s. We concatenated the corresponding cosine distance from each scale Φ i,j = {φ 0 i,j } s=4 s=0 to form the final fused correspondence representation Φ i,j . We applied fully connected layers to encode the correspondence representation Φ i,j with a vector of size 64 × 4 and was then passed to a sigmoid layer. Mathematically, this can be represented as follows: where S is a sigmoid function, and W and b are the weights and biases of the fully connected network. The output of the network, p(I i , I j ), is normalized between [0, 1] as a post-processing step. We represented the final probability p that the two images in the pair, I i and I j , p(I i , I j ) were of the same fish, as shown in Equation (3).
We optimized this framework by minimizing the widely used cross-entropy loss over a training set of N pairs: where q n is the 0/1 output label for the input pair, which represents the same fish or not.

Dataset
Our dataset was provided by the Norwegian Institute for Nature Research and was the same basis material as used by Myrum et al. [23]. It contained 204 video clips captured in a fish ladder, where 101 videos were of stocked brown trout and 103 videos were of wild brown trout. Each video clip contained only one fish. The videos were referred to as the NINA204 dataset. The video clips were 24 s with a resolution of 320 × 240 pixels. The videos had different qualities, they varied in terms of illumination level, illumination uniformity, and they contained distortions such as air bubbles and algae. We annotated the videos with a rectangular bounding box around each fish using the Computer Vision Annotation Tool [43]. As these clips had multiple frames of the same brown trout, we were sure that it was the same individual, being a ground truth for our analysis. We selected video clips excluding small fish, juvenile fish, and clips where the we could not obtain enough images of full fish (excluding images of partial fish). Our dataset contained 49 unique, randomly chosen fish in the training set, with a total of 1943 images and 48 fish in the test set, and in the validation set, 1479 images, respectively. To train and validate the network, we created matched pairs for a given fish at frame T with {T + ∆t : ∆t ∈ Z} and random unmatched pairs. For all our experiments, ∆t was varied from one to five frames. For robust evaluation, we sampled the test set and created 16,269 unique matched (8K) and unmatched pairs (8K).
We also tested our method on a similar dataset to that of Zhao et al. [44], referred as TROUT39 dataset. This dataset was labeled in the same way as the previous dataset. Experts from the Norwegian Institute of Nature Research verified that it contained the same individual brown trouts, resulting in a ground truth for our analysis. The dataset contained 39 brown trouts with a total of 288 frames taken with a high-definition camera. Note that these images were taken outside the water and vary significantly from the dataset used to train the network.

Implementation Details
Our model was implemented using Pytorch library on a single NVIDIA GeForce GTX 1080 GPU. Due to the different image sizes in the dataset, we first cropped large boundary margins and resized all images into fixed dimensions with a spatial size of 512 × 512 before feeding to both encoders, and finally normalized them to [0, 1]. We used Adam [45] as an optimizer with a batch size of 4 and the learning rate α set to 0.0001. The EfficientNet [39] encoder network is initialed with pre-trained ImageNet weights for fine-tuning.

Training Procedure
To train the network, we started by creating matched and unmatched image pairs. Matched image pairs are images showing the same fish at a different time in the video. Unmatched image pairs are two different randomly generated fish pairs. Matched and unmatched image pairs are labeled as "1" and "0", respectively. Given a sample from matched and unmatched pairs with their corresponding label, we optimized the loss function defined in Equation (4) and monitored the validation loss. To increase the robustness and reduce overfitting of our model, we increased the amount of training data by applying a random rotation of angle (45°, 90°, 135°) and random vertical and horizontal flips. The network was trained by gradually increasing the difficulty of matched pairs from ∆t = 0 to ∆t = 5, where t is frames. This enabled the network to learn from simple to significant appearance variations between consecutive frames.

Evaluation Metrics
We evaluated the effectiveness of each method using precision (P), recall (R), F1-score (F1), accuracy (A), Likelihood Ratio Positive (LR+) [46], and specificity (Spec) metrics, which are defined below: where N TP and N TN are the number of true positives and negatives and N FP and N FN are the number of false positives and negatives, respectively. Furthermore, TPR and FPR represent the true positive rate (i.e., recall) and false positive rate defined as N FP N FP +N TN , respectively. Accuracy (A) describes the difference between the correctly predicted and actual value. Precision (P) represents how well the model was able to predict positive cases, while Recall (R) represents how well the model predicted actual positives by labeling them as positive. F1-score was a good measure for the balance between precision and recall. Specificity (Spec) represented the proportion of negatives that were correctly predicted. An Accuracy, Precision, Recall, F1 score, or specificity of 1 was considered perfect, while 0 was the lowest possible. Similarly, a method with 100 LR+ indicates would indicate a 100-fold increase in the odds of identical fish being in a matched pair.

Results
We presented the baseline results of the identification method on multiple experimental setups and compared them with histogram of oriented gradient (HOG) [47], rotation invariant local binary pattern (LBP) [48] feature-based methods. For each fish, HOG and LBP features were extracted and the histogram intersection was used to train a linear support vector machine classifier [49]. The HOG features were computed with eight orientations with 32 pixels per block, giving 2048 feature dimension. Thirty six rotation invariant LBP features were computed with a radius of 3 at each pixel and the resulting texture image was represented with a histogram of 512 dimension. The identification method was performed in multiple stages. First, we evaluated the method for geometric variations such as flipping and rotation. To do so, given an image and its transformation, we predicted the performance whether the two brown trouts were identical or not. For the NINA204 dataset, an F1-score of 0.974 was obtained for T + 0 and the F1-score decreased to 0.832 for T + 5 (Table 1). Our proposed method performed better than HOG and LBP at all times in all performance metrics.
Furthermore, we evaluated the identification method on the TROUT39 dataset. The TROUT39 dataset was significantly different from NINA204, as the pictures were taken in open-air with a high-definition camera. We obtained an overall performance of 0.592 in the F1-score (Table 2). Sample visual results are given in Table 3.  Table 3. Randomly chosen sample fish in the TROUT39 validation dataset that were correctly recognized. Note that the network was trained using under-water videos and the results here show the performance of the identification method for out-of-water fish recognition.

Fish ID Images Predicted
Brown

Discussion
There is a vast amount of literature related to fish in underwater images, both in freshwater and saltwater. This includes detection and tracking [50], the recognition of species [15], fish behavior [51], quantifying fish habitat [52], and more. However, with regard to the automatic recognition of individual fish, to the best of our knowledge, the only attempt has been made by Zhao et al. in [44,53]. Zhao et al. [44] proposed two different methods that concerned on the head region of brown trouts. The first method was based on local density features, where the image was binarized and divided into blocks. For each block, the number of spots were counted. The other proposed method was based on a list of common features, usually called a codebook. The codebook represented different information from the input image. Haurum et al. [53] proposed a method for the re-identification of zebrafish using metric learning, where they learned a distance function based on features.
To achieve the non-invasive recognition of fish, the fish need to have unique features, and these features need to stay present throughout the lifetime of the individual. The proposed identification method was robust to transformation, with an F1-score of 0.974. Furthermore, we performed the evaluation from easier (T + 1) to more difficult cases (T + 5) with significant appearances, lighting, and shape variations. With T + 5 examples, the identification method was able to perform fish identification with an F1-score of 0.832. Sample visual results for correctly recognized fishes are given in Table 4 and failure cases are shown in Table 5. Fish identification in the wild was difficult especially under high turbidity and orientation changes ( Table 5).
In the TROUT39 dataset, the performance loss compared to NINA204 dataset was expected, as the training data source is significantly different from the testing data source.
The identification method was able to differentiate pictures of brown trouts taken at different time instances (Table 3). Table 4. Randomly chosen sample fish in NINA204 dataset that were correctly recognized. The identification method was able to recognize difficult cases with significant variation in shape, turbidity, and appearance. Although not directly comparable, Zhao et al. [44] showed an accuracy of 0.649 and 0.740 on images from the TROUT39 dataset in their publication. This was around the same as the proposed method, which had an accuracy of 0.718 (Table 2). It is important to note that the method from Zhao et al. only used the head region of the brown trouts, while our identification method was based on the entire brown trout.

Fish ID
The results indicated that our method is able to perform even when the testing data are significantly different from the training data. The TROUT39 dataset was different from the NINA204 set, as the images were taken outside of the water, making the testing data different from the training data. Despite this, our identification method was able to achieve an accuracy of 0.718.
A comparison to other applications where re-identification was used showed that our accuracy of 0.974 for T + 0 and 0.874 for T + 5 in the NINA204 dataset and 0.718 for TROUT39 was similar to that obtained for other animals. The method used in [54] obtained an accuracy of 0.92 for chimpanzees, while [55] obtained 0.938 for individual lemurs, 0.904 and for golden monkeys, and 0.758 for chimpanzees. Our approach was non-invasive, avoiding the limitations of capture-mark-recapture methods. Being able to identify individual fish automatically from photos can be seen as desirable due to its lower cost and reduced workload, compared to, for example, capture and re-capture methods. However, criticism exists in regard to density estimation from camera traps, especially for animals that lack obvious natural markings [56]. This was also the case for the proposed identification method, in that it required fish species with natural markings. Automatic methods, if robust, can be a way to complement visual identification by humans. Identification by observers using camera traps can have interobserver variation [57]. Misidentification due to subjective natural markers has also been stated as a problem [56].
Such a non-invasive system has potential in the monitoring of fish, for example, migration patterns, or to be used to analyze behavior. The proposed method could allow one to analyze the behavior of individuals, for example, being able to detect the number of times a specific fish passes downstream or upstream in a fish ladder. It could also be useful for monitoring weight growth patterns and health conditions, which are crucial for optimizing cultivation factors such as temperature, fish density, and breeding frequency. It can also be used for population estimation, as it can be used to avoid counting the same fish multiple times.
Future work should include an evaluation of this method using additional video data sets. In the NINA204 dataset, images are extracted from a video clip within a timelimited section. Investigating how robust the method is to longer time intervals is a natural extension of this work. The dataset used can be seen as limited compared to what is used for evaluation in other computer vision tasks, and a larger dataset would strengthen the evaluation. Future work may also include the additional evaluation of the identification method using both brown trouts and other species such as European grayling, Thymallus thymallus) or European whitefish, Coregonus lavaretus. Additional work could also be carried out to make it more robust to different water turbidity levels. One can also study the method of combining deep learning abstract features with traditional manual features.

Conclusions
In this paper, we developed a deep convolutional architecture for the identification of individual fish. We employed a deep multi-scale bidirectional feature pyramid network to capture low-level features and high-level semantic components for embedding space representation. Based on the pyramid matching strategy, we designed a metric learning feature representation to capture robust representation to solve the fish identification problem. We demonstrated the effectiveness and promise of our method by reporting extensive evaluations on two different datasets containing brown trouts, Salmo trutta. In the NINA204 dataset, comparisons were made in the same environment, while in the TROUT39 dataset, comparison was made between different environments.