LIFRNet: A Novel Lightweight Individual Fish Recognition Method Based on Deformable Convolution and Edge Feature Learning

: With the continuous development of industrial aquaculture and artiﬁcial intelligence technology, the trend of the use of automation and intelligence in aquaculture is becoming more and more obvious, and the speed of the related technical development is becoming faster and faster. Individual ﬁsh recognition could provide key technical support for ﬁsh growth monitoring, bait feeding and density estimation, and also provide strong data support for ﬁsh precision farming. However, individual ﬁsh recognition faces signiﬁcant hurdles due to the underwater environment complexity, high visual similarity of individual ﬁsh and the real-time aspect of the process. In particular, the complex and changeable underwater environment makes it extremely difﬁcult to detect individual ﬁsh and extract biological features extraction. In view of the above problems, this paper proposes an individual ﬁsh recognition method based on lightweight convolutional neural network (LIFRNet). This proposed method could extract the visual features of underwater moving ﬁsh accurately and efﬁciently and give each ﬁsh unique identity recognition information. The method proposed in this paper consists of three parts: the underwater ﬁsh detection module, underwater individual ﬁsh recognition module and result visualization module. In order to improve the accuracy and real-time availability of recognition, this paper proposes a lightweight backbone network for ﬁsh visual feature extraction. This research constructed a dataset for individual ﬁsh recognition (DlouFish), and the ﬁsh in dataset were manually sorted and labeled. The dataset contains 6950 picture information instances of 384 individual ﬁsh. In this research, simulation experiments were carried out on the DlouFish dataset. Compared with YOLOV4-Tiny and YOLOV4, the accuracy of the proposed method in ﬁsh detection was increased by 5.12% and 3.65%, respectively. Additionally, the accuracy of individual ﬁsh recognition reached 97.8%.


Introduction
With the continuous development of aquaculture and the expansion of the farming scale, the old farming model of relying on experience, great effort and the weather has become increasingly inappropriate for the needs of current agricultural production and management.The aquaculture industry is gradually transforming from extensive production to factory, large-scale and intelligent production.Because of the continuous expansion of the aquaculture scale and aquaculture categories, it is of great significance to effectively obtain and analyze some important information generated in the production process, which is also very important for reducing the risk of aquaculture, improving the economic benefits of enterprises and reducing the labor intensity of employees.
In the process of fishery production, an accurate and real-time grasp of the body length, weight information, health status and behavior status of farmed fish could provide key analysis data for bait feeding, water quality management, disease control, feed formula management, etc., and then provide important data support for production management decision making.However, the acquisition of these data requires the realization of the individual recognition of fish.This means that, on the basis of the identified fish species, it is necessary to further determine the individual fish in order to bind the growth characteristics of the individual to the specific individual.To achieve this goal, people have tried a variety of different methods, including placing tags on fish bodies, using RFID technology and implanting tracking devices.However, these methods are highly invasive, which could cause great damage to individual fish and have high implementation costs, so these methods are difficult to popularize widely.
There is no doubt that non-contact biometric feature extraction is an ideal way to solve individual fish recognition.Due to its poor robustness and generalization, traditional computer vision technology could not meet the actual production needs of individual fish recognition.However, deep learning and artificial intelligence technology have developed rapidly in recent years, and their combination with many fields has shown great technical advantages and application value, which fully proves their effectiveness and practicality.The use of artificial intelligence and deep learning technology has provided good applications in the fields of attitude estimation, object detection, autopilot and target recognition, which makes it possible to use it to solve the individual recognition problem of fish.In particular, large-scale deep learning algorithms have achieved high accuracy in face recognition and have been widely used.With the deepening of research, we find that there are many challenges in solving the problem of individual fish recognition by using computer vision technology.Unlike other fields, accurate and real-time individual fish recognition has its own characteristics.It is not feasible to apply the mature technology for individual fish recognition directly.Underwater real-time individual fish recognition faces many new situations and challenges (Figure 1): Agriculture 2022, 12, x FOR PEER REVIEW 2 of 20 In the process of fishery production, an accurate and real-time grasp of the body length, weight information, health status and behavior status of farmed fish could provide key analysis data for bait feeding, water quality management, disease control, feed formula management, etc., and then provide important data support for production management decision making.However, the acquisition of these data requires the realization of the individual recognition of fish.This means that, on the basis of the identified fish species, it is necessary to further determine the individual fish in order to bind the growth characteristics of the individual to the specific individual.To achieve this goal, people have tried a variety of different methods, including placing tags on fish bodies, using RFID technology and implanting tracking devices.However, these methods are highly invasive, which could cause great damage to individual fish and have high implementation costs, so these methods are difficult to popularize widely.
Figure 1.Individual fish images taken from underwater.They contain extreme occlusion between individuals, chromatic aberration variation, a complex undersea environment and a high degree of visual similarity among fish species.
There is no doubt that non-contact biometric feature extraction is an ideal way to solve individual fish recognition.Due to its poor robustness and generalization, traditional computer vision technology could not meet the actual production needs of individual fish recognition.However, deep learning and artificial intelligence technology have developed rapidly in recent years, and their combination with many fields has shown great technical advantages and application value, which fully proves their effectiveness and practicality.The use of artificial intelligence and deep learning technology has provided good applications in the fields of attitude estimation, object detection, autopilot and target recognition, which makes it possible to use it to solve the individual recognition problem of fish.In particular, large-scale deep learning algorithms have achieved high accuracy in face recognition and have been widely used.With the deepening of research, Figure 1.Individual fish images taken from underwater.They contain extreme occlusion between individuals, chromatic aberration variation, a complex undersea environment and a high degree of visual similarity among fish species.1.
Complex underwater environment: The primary difficulty of underwater recognition task is that individual fish recognition needs to deal with the complex and changeable underwater environment.Due to poor underwater lighting conditions, compared to the data obtained under normal conditions, the quality of underwater video and image data is not high.In addition, some water bodies have large chromatic aberration changes, turbid water quality and the interference of many non-fish targets such as algae.It brings difficulties to effectively obtaining the visual characteristics of fish, so it is a great challenge to accurately recognize individual fish.

2.
Serious occlusion between fish: The majority of underwater fish activity occurs in groups.Many fish swim fast and are small in size, and the individual fish shield each other.However, individual recognition needs to accurately separate individual fish from other individual fish and the surrounding environment, and then extract the visual feature information of its torso.Therefore, it is very difficult to effectively extract the visual feature information of a fish torso between severely occluded individual fish.

3.
There is a significant visual similarity between each fish.Different fish species have unique visual characteristics.However, the visual differences between individual fish are very small and the visual similarities are strong.Some individual fish are challenging to distinguish directly with the naked eye, so the recognition algorithm must precisely capture the tiny visual variations between the individuals.
Therefore, this paper proposes an individual fish recognition method (LIFRNet) based on a lightweight convolution neural network.The main contributions of this paper are as follows: 1.
in the fish detection part, the CBAM attention mechanism module is added to greatly improve the accuracy of object detection at the cost of a small number of parameters by focusing on the two dimensions of channel and space; 2.
in the fish recognition part, we use the combination of 1 × 1 convolution and BN layer to learn the edge features of fish, use deformable convolution which is more suitable for the swimming posture of fish, and use the Mish activation function instead of Relu to obtain smaller intra class distance and larger inter-class distance; 3.
we collect and recognize the individual fish recognition dataset (DlouFish), which contains a total of 6950 images of 384 individual fish and is numbered by the individual, to facilitate the feature extraction, training and prediction of underwater individual fish using deep neural networks.
The structure of this paper is as follows: the second part of the paper mainly introduces the related work of individual fish recognition, and the third part mainly introduces our proposed lightweight convolutional neural network for individual fish recognition.The fourth part mainly introduces the results of the simulation experiments, and the fifth part summarizes the proposed work and prospects for the future work.

Related Work
With the wider application of animal individual recognition, animal individual recognition has also attracted more and more attention from experts and scholars, and has become a hot topic of research in academia and industry.The characteristics are analyzed and studied in a targeted manner, different methods are designed to realize the individual recognition of animals and different data sets are also constructed to train and test the individual recognition algorithm.In order to fill the gap in the northeast tiger dataset, Li et al. [1] published a dataset of over 8000 video clips of 92 northeast tigers in 2019.Based on this dataset, Liu et al. [2] used the pedestrian re-recognition method, extracted features from various body parts of tigers, and used a partial pose-guided global learning approach to complete the re-recognition of the northeast tiger.In order to increase the precision and speed of cow detection in actual production scenarios, Xu et al. [3] used a facial recognition framework combining Retinaface and ArcFace.The automatic cow recognition technique proposed by Li et al. [4] used the Zernike matrix as a cow feature extractor, followed by linear discriminant analysis on the collected features and a support vector integration class approach for individual cow recognition.Ghosh et al. [5] analyzed the performances of various deep CNN-based models using an identical set of hyper parameters trained end-toend on a pig breed dataset and a goat breed dataset, respectively.The experimental results showed that MobileNetV2 was the best deep CNN model for goat breed classification and InceptionV3 was the best model for pig breed classification.
In 2016, Villon et al. [6] analyzed and experimented with two different methods of deep learning and SVM classifiers to detect and identify fish, and discussed their advantages and disadvantages.Facts have proved that with the improvement of computer computing power and the arrival of the age of big data, the use of deep learning for fish detection and classification has become a major trend.Tamou et al. [7] used the pre-trained AlexNet network to extract features from the foreground fish images of an available underwater dataset and then used an SVM classifier to classify.The convolution neural network AlexNet is combined with transfer learning to realize the automatic classification of fish species.An accuracy of 99.45% was obtained in the Fish Recognition Ground Truth dataset.Blount et al. [8] proposed the Flukebook platform in the field of underwater species recognition, which combines photo algorithms with data management and infrastructure for whales and dolphins.Flukebook was trained on 15 different species to form 37 speciesspecific recognition processes, and then applied to cetacean photo recognition through ongoing collaboration between computer vision researchers, software engineers and biologists.By enhancing ResNet, enhancing the feature information output of identified objects, and enhancing the utilization of feature information, Zhao et al. [9] suggested a composite fish detection framework based on composite backbone networks and enhanced path aggregation networks.Nixon [10] proposed a neural network capable of identifying, categorizing and counting 11 fish species to track the reproductive activity of fish populations using YOLOv4 and Darknet as the infrastructure and architecture.Based on the spot features on shark skin, Arzoumanian et al. [11] developed a shark feature recognition library and employed feature matching for individual shark recognition.
In contrast to terrestrial animal recognition, underwater species recognition is more challenging to train a high-performance recognition model due to the noisy nature of underwater imagery.To solve these issues, Kaur et al. [12] proposed the Atmospheric-Light-Enhancement Algorithm (ALE), which includes a preprocessing step for underwater images that acts on the intensity, contrast, and sharpness of the object to improve the visualization quality.In order to train precise deep neural network fish recognition models from noisy large-scale underwater photos using adaptive perturbation methods on confrontational perturbed images, Zhang et al. [13] introduced a unique deep adversarial learning framework for AdvFish.Deep et al. [14] employed single-image super-resolution approaches to tackle the issue of limited discriminative information from low-resolution images and deep learning methods to explicitly learn discriminative features from relatively low-resolution images.Syreen et al. [15] proposed the Iterative Grouped Evolution Network (IGCN) to divide all candidate areas into fish and non-fish entities.A hybrid fusion of optical flow and VGG16 at level one.In the LifeCLEF 2015 fish dataset, a detection accuracy of 94.05% was achieved.Villon et al. [16] used the GoogLeNet to extract features and adopted the Softmax classification method to detect reef fish.Rauf et al. [17] proposed a CNN architecture containing 32 depth layers, which can better obtain valuable features from images to complete fish species recognition.A data set named Fish Pak was also provided, including 915 images of six different kinds of fish.Xu et al. [18] applied the YOLO deep learning model to three different data sets recorded by real-world waterpower sites to identify fish in underwater videos, with a mAP score of 0.5392.Petrellis [19] proposed fish morphological feature recognition based on deep learning technology.First, the fish and background were separated by object detection and image segmentation, and then the size of fish and the position of key points were measured by aligning landmarks.The accuracy of fish size estimation was 80-91%.Rosales et al. [20] created a fish detector using Faster R-CNN to locate fish.The model achieved a mini-batch accuracy equal to 99.95 percent with an RPN mini-batch accuracy equal to 100 percent.Jalal et al. [21] combines luminous flux and a Gaussian mixture model with a depth neural network to obtain time information to identify fish moving freely in the background, thus proposing a method to classify fish in an unconstrained underwater video.Classification accuracies of 91.64% and 78.8% were achieved for the LifeCLEF 2015 and UWA datasets, respectively.Hossain et al. [22] proposed an automatic monitoring system for marine organisms, which uses GMM background subtraction for detection, and uses the Pyramid Histogram Of visual Words (PHOW) feature with an SVM classifier for classification.In the CLEF 2015 dataset, the still image can be classified with an accuracy of 91.7%.Ben Tamou et al. [23] used the transfer learning framework to propose a training loss curve method for targeted data enhancement.Additionally, Tamou proposed a hierarchical CNN classification method to classify fish first into family levels and then into species categories.In the LifeClef 2015 Fish dataset, 81.53% accuracy was achieved.Ben Tamou et al. [24] achieved 81.83% accuracy in the LifeClef 2015 Fish dataset again.They used a new strategy of incremental learning to train the network.First, they learned difficult species and then learned new species using knowledge distillation to complete the classification task of live fish species in underwater images.
However, the purpose of some existing underwater fish recognition work is often to classify fish by category and not to accurately recognize different individual fish in the same species.At the same time, there are still some problems in the field of individual fish recognition, such as the lack of data resources and a large number of model parameters.In addition, the complexity and uncertainty of the underwater environment has led to a decrease in recognition accuracy.
In response to the above issues, a lightweight convolution neural network based individual fish recognition method (LIFRNet) is proposed.The schematic diagram of individual fish recognition is shown in Figure 2.
Agriculture 2022, 12, x FOR PEER REVIEW 5 of 20 identify fish moving freely in the background, thus proposing a method to classify fish in an unconstrained underwater video.Classification accuracies of 91.64% and 78.8% were achieved for the LifeCLEF 2015 and UWA datasets, respectively.Hossain [22] proposed an automatic monitoring system for marine organisms, which uses GMM background subtraction for detection, and uses the Pyramid Histogram Of visual Words (PHOW) feature with an SVM classifier for classification.In the CLEF 2015 dataset, the still image can be classified with an accuracy of 91.7%.Ben Tamou [23] used the transfer learning framework to propose a training loss curve method for targeted data enhancement.Additionally, Tamou proposed a hierarchical CNN classification method to classify fish first into family levels and then into species categories.In the LifeClef 2015 Fish dataset, 81.53% accuracy was achieved.Ben Tamou [24] achieved 81.83% accuracy in the LifeClef 2015 Fish dataset again.They used a new strategy of incremental learning to train the network.First, they learned difficult species and then learned new species using knowledge distillation to complete the classification task of live fish species in underwater images.
However, the purpose of some existing underwater fish recognition work is often to classify fish by category and not to accurately recognize different individual fish in the same species.At the same time, there are still some problems in the field of individual fish recognition, such as the lack of data resources and a large number of model parameters.In addition, the complexity and uncertainty of the underwater environment has led to a decrease in recognition accuracy.
In response to the above issues, a lightweight convolution neural network based individual fish recognition method (LIFRNet) is proposed.The schematic diagram of individual fish recognition is shown in Figure 2.

The Proposed Work
With the continuous development of computer vision technology, biometric technology has been widely used; especially, face recognition technology has been widely used in public security, cash payment and identity recognition, in addition to other fields.However, research on the individual recognition of underwater fish is still emerging.Individual fish recognition has great differences from other biometric recognition.First of all, individual fish recognition is primarily conducted using underwater equipment.Because of the complexity of the underwater environment, the water quality is frequently subpar, which results in poor image data and obscure fish biometric features, which severely

The Proposed Work
With the continuous development of computer vision technology, biometric technology has been widely used; especially, face recognition technology has been widely used in public security, cash payment and identity recognition, in addition to other fields.However, research on the individual recognition of underwater fish is still emerging.Individual fish recognition has great differences from other biometric recognition.First of all, individual fish recognition is primarily conducted using underwater equipment.Because of the complexity of the underwater environment, the water quality is frequently subpar, which results in poor image data and obscure fish biometric features, which severely hamper the extraction of fish biometric features.Secondly, the biological characteristics of some species of fish have only very small differences, and the similarity between biological individuals is very high.The feature extraction method must capture the small differences of biological individuals to accurately distinguish different individuals.Therefore, in view of the various problems in individual fish recognition, we must design individual recognition algorithms in a targeted manner.
Committed to solving the problems of difficulty in extracting biological features, high visual similarity between individual fish and high real-time requirements in individual fish recognition, this paper designs a lightweight underwater fish real-time individual recognition method: LIFRNet (Figure 3).LIFRNet consists of three parts, namely, the underwater fish object detection module, individual recognition module and visualization module.The underwater fish object detection module can detect the fish in the data stream in real time and separate the individual fish from the surrounding environment.The individual recognition module can extract the biological features of the detected single fish target and obtain a feature map with fish body features.The final visualization module uses the optimal weights obtained by multiple iterations of training to visually identify the fish school and output the results.
hamper the extraction of fish biometric features.Secondly, the biological characteristics of some species of fish have only very small differences, and the similarity between biological individuals is very high.The feature extraction method must capture the small differences of biological individuals to accurately distinguish different individuals.Therefore, in view of the various problems in individual fish recognition, we must design individual recognition algorithms in a targeted manner.
Committed to solving the problems of difficulty in extracting biological features, high visual similarity between individual fish and high real-time requirements in individual fish recognition, this paper designs a lightweight underwater fish real-time individual recognition method: LIFRNet (Figure 3).LIFRNet consists of three parts, namely, the underwater fish object detection module, individual recognition module and visualization module.The underwater fish object detection module can detect the fish in the data stream in real time and separate the individual fish from the surrounding environment.The individual recognition module can extract the biological features of the detected single fish target and obtain a feature map with fish body features.The final visualization module uses the optimal weights obtained by multiple iterations of training to visually identify the fish school and output the results.

Underwater Fish Detection Module
After extensive experiments, we found that existing object detection methods are mainly used to detect objects in normal environments.However, the results of these methods are not satisfactory for fuzzy, low frame rate and small objects, so these methods cannot be directly applied to the recognition of underwater individual fish.This paper proposes the YOLO-CBAM method for underwater fish object detection.The method proposed in this paper uses YOLOV4-Tiny as the main framework and makes targeted improvements according to the characteristics of individual fish recognition.After a lot of repeated experiments, compared with the ordinary YOLO algorithm, we found that, although YOLOV4-Tiny has a lighter network structure, the detection accuracy drops significantly.In order to retain the advantages of the YOLOV4-Tiny network, which has a small number of structural parameters and fast detection speed, and to further improve the detection ability of the network while reducing the weight of the network, this paper further optimizes the backbone network of YOLOV4-Tiny according to the characteristics of individual fish recognition (Figure 4).The algorithm in this paper integrates the convolutional block attention module (CBAM) [25] into the object detection backbone network, so that the network can adaptively focus on the more important parts of the image and fuzzy image data, so that the network can learn the visual characteristics of underwater fish objects.The structure schematic of CBAM module is shown in Figure 5.

Underwater Fish Detection Module
After extensive experiments, we found that existing object detection methods are mainly used to detect objects in normal environments.However, the results of these methods are not satisfactory for fuzzy, low frame rate and small objects, so these methods cannot be directly applied to the recognition of underwater individual fish.This paper proposes the YOLO-CBAM method for underwater fish object detection.The method proposed in this paper uses YOLOV4-Tiny as the main framework and makes targeted improvements according to the characteristics of individual fish recognition.After a lot of repeated experiments, compared with the ordinary YOLO algorithm, we found that, although YOLOV4-Tiny has a lighter network structure, the detection accuracy drops significantly.In order to retain the advantages of the YOLOV4-Tiny network, which has a small number of structural parameters and fast detection speed, and to further improve the detection ability of the network while reducing the weight of the network, this paper further optimizes the backbone network of YOLOV4-Tiny according to the characteristics of individual fish recognition (Figure 4).The algorithm in this paper integrates the convolutional block attention module (CBAM) [25] into the object detection backbone network, so that the network can adaptively focus on the more important parts of the image and fuzzy image data, so that the network can learn the visual characteristics of underwater fish objects.The structure schematic of CBAM module is shown in Figure 5.The visual attention mechanism module used in this paper consists of two sub-modules, which are the channel attention module and the spatial attention module.The spatial attention module obtains weights by locating objects and performing some transformations, so it can find the most important parts of the image for learning.The channel attention module obtains the importance of each feature through modeling and can assign different features according to different tasks.In this paper, the attention mechanism and YOLOV4-Tiny are organically integrated, which not only ensures the lightweight quality of the network model, but also improves the accuracy of the model to a certain extent.In the feature extraction network, CBAM is added to the two feature layers extracted by the backbone network and the up-sampled result.
The implementation of the channel attention mechanism is divided into two parts, as shown in Figure 6.Firstly, global max pooling and global average pooling are performed on the feature layer, respectively, and then the shared fully connected layer is used for processing.Then, the two results are added and put into the Sigmoid function to obtain the weights of each channel in the feature layer.Finally, the weights are multiplied by the original input feature layer to generate the input features required by the spatial attention module.The spatial attention mechanism takes the maximum value and average value of the channel of each feature point for the input feature layer, and then stacks the two results and uses a 7 × 7 convolution to adjust the number of channels to 1.After being processed by the Sigmoid function, the weights in the feature layer are obtained, and finally, the weights are multiplied by the original input feature layer to obtain the final generated features (Figure 7). Figure 8 shows the results of the fish object detection of the method   The visual attention mechanism module used in this paper consists of two sub-modules, which are the channel attention module and the spatial attention module.The spatial attention module obtains weights by locating objects and performing some transformations, so it can find the most important parts of the image for learning.The channel attention module obtains the importance of each feature through modeling and can assign different features according to different tasks.In this paper, the attention mechanism and YOLOV4-Tiny are organically integrated, which not only ensures the lightweight quality of the network model, but also improves the accuracy of the model to a certain extent.In the feature extraction network, CBAM is added to the two feature layers extracted by the backbone network and the up-sampled result.
The implementation of the channel attention mechanism is divided into two parts, as shown in Figure 6.Firstly, global max pooling and global average pooling are performed on the feature layer, respectively, and then the shared fully connected layer is used for processing.Then, the two results are added and put into the Sigmoid function to obtain the weights of each channel in the feature layer.Finally, the weights are multiplied by the original input feature layer to generate the input features required by the spatial attention module.The spatial attention mechanism takes the maximum value and average value of the channel of each feature point for the input feature layer, and then stacks the two results and uses a 7 × 7 convolution to adjust the number of channels to 1.After being processed by the Sigmoid function, the weights in the feature layer are obtained, and finally, the weights are multiplied by the original input feature layer to obtain the final generated features (Figure 7). Figure 8 shows the results of the fish object detection of the method The visual attention mechanism module used in this paper consists of two submodules, which are the channel attention module and the spatial attention module.The spatial attention module obtains weights by locating objects and performing some transformations, so it can find the most important parts of the image for learning.The channel attention module obtains the importance of each feature through modeling and can assign different features according to different tasks.In this paper, the attention mechanism and YOLOV4-Tiny are organically integrated, which not only ensures the lightweight quality of the network model, but also improves the accuracy of the model to a certain extent.In the feature extraction network, CBAM is added to the two feature layers extracted by the backbone network and the up-sampled result.
The implementation of the channel attention mechanism is divided into two parts, as shown in Figure 6.Firstly, global max pooling and global average pooling are performed on the feature layer, respectively, and then the shared fully connected layer is used for processing.Then, the two results are added and put into the Sigmoid function to obtain the weights of each channel in the feature layer.Finally, the weights are multiplied by the original input feature layer to generate the input features required by the spatial attention module.The spatial attention mechanism takes the maximum value and average value of the channel of each feature point for the input feature layer, and then stacks the two results and uses a 7 × 7 convolution to adjust the number of channels to 1.After being processed by the Sigmoid function, the weights in the feature layer are obtained, and finally, the weights are multiplied by the original input feature layer to obtain the final generated features (Figure 7). Figure 8 shows the results of the fish object detection of the method proposed in this paper.In Figure 8a, we found that the algorithm only detected two individual fish by YOLO-tiny, and all individual fish are detected by the proposed method.
Agriculture 2022, 12, x FOR PEER REVIEW 8 of 20 proposed in this paper.In Figure 8.a, we found that the algorithm only detected two individual fish by YOLO-tiny, and all individual fish are detected by the proposed method.

Individual Fish Recognition Module
The effective detection of individual fish in images is the premise of individual fish recognition.After accurately detecting the individual fish in the images, effectively extracting the visual feature information of the individual fish is the core problem which needs to be resolved in order to achieve accurate individual fish recognition.The differences in visual similarity between individuals of the same species of fish are small, and the recognition network needs to accurately capture the subtle feature differences between individuals.In order to adapt to the characteristics of individual fish recognition, this paper designs a lightweight and deformable convolution individual fish recognition network structure, as shown in Figure 9.

Individual Fish Recognition Module
The effective detection of individual fish in images is the premise of individu recognition.After accurately detecting the individual fish in the images, effectiv tracting the visual feature information of the individual fish is the core problem needs to be resolved in order to achieve accurate individual fish recognition.The ences in visual similarity between individuals of the same species of fish are sma the recognition network needs to accurately capture the subtle feature differen tween individuals.In order to adapt to the characteristics of individual fish recog this paper designs a lightweight and deformable convolution individual fish reco network structure, as shown in Figure 9. proposed in this paper.In Figure 8.a, we found that the algorithm only detected two in dividual fish by YOLO-tiny, and all individual fish are detected by the proposed method

Individual Fish Recognition Module
The effective detection of individual fish in images is the premise of individual fis recognition.After accurately detecting the individual fish in the images, effectively ex tracting the visual feature information of the individual fish is the core problem whic needs to be resolved in order to achieve accurate individual fish recognition.The diffe ences in visual similarity between individuals of the same species of fish are small, an the recognition network needs to accurately capture the subtle feature differences b tween individuals.In order to adapt to the characteristics of individual fish recognition this paper designs a lightweight and deformable convolution individual fish recognitio network structure, as shown in Figure 9.

Individual Fish Recognition Module
The effective detection of individual fish in images is the premise of individual fish recognition.After accurately detecting the individual fish in the images, effectively extracting the visual feature information of the individual fish is the core problem which needs to be resolved in order to achieve accurate individual fish recognition.The differences in visual similarity between individuals of the same species of fish are small, and the recognition network needs to accurately capture the subtle feature differences between individuals.In order to adapt to the characteristics of individual fish recognition, this paper designs a lightweight and deformable convolution individual fish recognition network structure, as shown in Figure 9. proposed in this paper.In Figure 8.a, we found that the algorithm only detected two individual fish by YOLO-tiny, and all individual fish are detected by the proposed method.

Individual Fish Recognition Module
The effective detection of individual fish in images is the premise of individual fish recognition.After accurately detecting the individual fish in the images, effectively extracting the visual feature information of the individual fish is the core problem which needs to be resolved in order to achieve accurate individual fish recognition.The differences in visual similarity between individuals of the same species of fish are small, and the recognition network needs to accurately capture the subtle feature differences between individuals.In order to adapt to the characteristics of individual fish recognition, this paper designs a lightweight and deformable convolution individual fish recognition network structure, as shown in Figure 9.Among them, the function of the distance calculation is to compare the features of the input pictures to calculate the similarity of individual fish.The smaller the value obtained after the distance calculation, the higher the similarity between individual fish, the larger the value and the lower the similarity.After repeated tests with different input images, we found that the values between individuals of the same fish were much less than 1, and the values between individuals of different fish were maintained at 1.4-1.7,which is greater than 1.Therefore, we use 1 as the cut-off point to determine whether it is the same individual fish.If the resulting value is less than 1, the model considers the input different pictures as the same individual fish; if it is greater than 1, it considers them as different individual fish.In this way, the task of recognizing underwater individual fish is completed.

Backbone Network
The discussion in this work is focused on making the model more lightweight while improving its capacity to identify fish bodies.We made three improvements to the Mo-bileNetV1 [26] backbone network for the individual recognition module: convolution kernel, activation function and average pooling layer.The improved backbone network is shown in Figure 10.A picture 112 × 112 × 3 in size is used as input, and a 1 × 1 × 512 feature vector is obtained after the neural network.The network has 29 layers, 14 layers of size 3 × 3 deformable convolutions, 13 layers of size 3 × 3 deep deformable convolutions, 1 layer of size 1 × 1 standard convolution and 1 fully connected layer.Among them, dark blue represents the normal deformable convolution and light purple represents the deeply separable deformable convolution.
Agriculture 2022, 12, x FOR PEER REVIEW 9 of 20 Among them, the function of the distance calculation is to compare the features of the input pictures to calculate the similarity of individual fish.The smaller the value obtained after the distance calculation, the higher the similarity between individual fish, the larger the value and the lower the similarity.After repeated tests with different input images, we found that the values between individuals of the same fish were much less than 1, and the values between individuals of different fish were maintained at 1.4-1.7,which is greater than 1.Therefore, we use 1 as the cut-off point to determine whether it is the same individual fish.If the resulting value is less than 1, the model considers the input different pictures as the same individual fish; if it is greater than 1, it considers them as different individual fish.In this way, the task of recognizing underwater individual fish is completed.

Backbone Network
The discussion in this work is focused on making the model more lightweight while improving its capacity to identify fish bodies.We made three improvements to the Mo-bileNetV1 [26] backbone network for the individual recognition module: convolution kernel, activation function and average pooling layer.The improved backbone network is shown in Figure 10.A picture 112 × 112 × 3 in size is used as input, and a 1 × 1 × 512 feature vector is obtained after the neural network.The network has 29 layers, 14 layers of size 3 × 3 deformable convolutions, 13 layers of size 3 × 3 deep separable deformable convolutions, 1 layer of size 1 × 1 standard convolution and 1 fully connected layer.Among them, dark blue represents the normal deformable convolution and light purple represents the deeply separable deformable convolution.

Deformable Convolution
The trajectory and pose of the fish body swimming in the water vary depending on the characteristics of fish activities underwater, but the standard convolution kernel can only sample the input feature map at a fixed position, which is weak in generalization ability and poorly adaptable to unknown changes.

Deformable Convolution
The trajectory and pose of the fish body swimming in the water vary depending on the characteristics of fish activities underwater, but the standard convolution kernel can only sample the input feature map at a fixed position, which is weak in generalization ability and poorly adaptable to unknown changes.
This paper uses deformable convolution in place of standard convolution to address this issue.Deformable convolution differs from standard convolution by adding a direction parameter to each element, allowing the convolution kernel to be extended to a wider range during training.Instead of using a regular convolution, a deformable convolution of size 3 × 3 is employed in this study [27].This allows the convolution kernel to change its shape to the actual situation and more effectively extract input information without increasing the number of parameters.The comparison of the sampling position of fish body images after the addition of deformable convolution is given in Figure 11.
Agriculture 2022, 12, x FOR PEER REVIEW 10 This paper uses deformable convolution in place of standard convolution to add this issue.Deformable convolution differs from standard convolution by adding a d tion parameter to each element, allowing the convolution kernel to be extended to a w range during training.Instead of using a regular convolution, a deformable convolu of size 3 × 3 is employed in this study [27].This allows the convolution kernel to cha its shape to the actual situation and more effectively extract input information with increasing the number of parameters.The comparison of the sampling position of body images after the addition of deformable convolution is given in Figure 11.By comparing a great number of fish pictures, we found that the mouth, tail, fins other parts of the fish have distinctive characteristics.As shown in Figure 12, the tr features of the two fish are similar, and the texture features of the mouth and tail pl key role in distinguishing different individuals.Therefore, we believe that when b features cannot be used for effective recognition, the learning of fish body edge feat is particularly important.In order to improve the network's ability to learn edge features, we added 1 × 1 st ard convolution, and discarded the pooling layer commonly used in convolution ne networks.Although the pooling layer has the advantage of preventing overfitting downsampling, it reduces the learning ability of the network for edge features.The rea for this is that, in a feature map, although the sensing ranges of the center point and corner point are the same, the sensing area of the center point contains the complet formation of the whole picture, while the sensing area of the corner point only cont

Edge Feature Learning
Image edge refers to the collection of pixels whose gray level changes are discontinuous around them.Edges widely exist between objects and backgrounds, and between objects.Therefore, edges are important features of image segmentation, image understanding and image recognition.In the task of underwater individual fish recognition, the changes of light and background environment often lead to occlusion and blurring of the body features of fish individuals.At this time, it is difficult to accurately complete the recognition task using the main features of the body of fish.At the same time, individuals of the same species usually have similar trunk characteristics, which also brings challenges to the recognition task.
By comparing a great number of fish pictures, we found that the mouth, tail, fins and other parts of the fish have distinctive characteristics.As shown in Figure 12, the trunk features of the two fish are similar, and the texture features of the mouth and tail play a key role in distinguishing different individuals.Therefore, we believe that when body features cannot be used for effective recognition, the learning of fish body edge features is particularly important.Image edge refers to the collection of pixels whose gra uous around them.Edges widely exist between objects an objects.Therefore, edges are important features of image standing and image recognition.In the task of underwater i changes of light and background environment often lead to body features of fish individuals.At this time, it is difficu recognition task using the main features of the body of fish. of the same species usually have similar trunk characteri lenges to the recognition task.
By comparing a great number of fish pictures, we foun other parts of the fish have distinctive characteristics.As features of the two fish are similar, and the texture feature key role in distinguishing different individuals.Therefore features cannot be used for effective recognition, the learn is particularly important.In order to improve the network's ability to learn edge ard convolution, and discarded the pooling layer common networks.Although the pooling layer has the advantage downsampling, it reduces the learning ability of the networ for this is that, in a feature map, although the sensing rang corner point are the same, the sensing area of the center p In order to improve the network's ability to learn edge features, we added 1 × 1 standard convolution, and discarded the pooling layer commonly used in convolution neural networks.Although the pooling layer has the advantage of preventing overfitting and downsampling, it reduces the learning ability of the network for edge features.The reason for this is that, in a feature map, although the sensing ranges of the center point and the corner point are the same, the sensing area of the center point contains the complete information of the whole picture, while the sensing area of the corner point only contains part of the picture.At this time, the weight of each point should be different, but the pooling layer treats them as the same weight [28].Therefore, when identifying fish bodies with fuzzy trunk features and similar texture features that can only be distinguished by detailed information, the disadvantages of pooling layer will be further amplified.
The 1 × 1 convolution was first used in the Network in Network technique [29], and its calculation method is the same as the other convolution kernels; the only difference is the size.The authors concluded that the operation of 1 × 1 convolution + Relu can increase the nonlinearity of the network, thus improving the nonlinear fitting ability of the network and the classification effect of the network without increasing the number of network parameters.
In this paper, a 1 × 1 convolution kernel with 512 channels is added after the 7 × 7 convolutional kernel to replace the average pooling layer.Since 1 × 1 convolution is performed on the channels, the correlation information between channels can be extracted.The 1024 channels are linearly combined across channels to 512 channels, thus increasing cross-channel information interaction and reducing computational load.
In addition, we added a BN layer and the Mish activation function.Using this combination, the advantages of the pooling layer in preventing overfitting are retained to the greatest extent, and the model learning speed can be accelerated.The nonlinearity can be greatly increased without losing the resolution of the characteristic map.

Mish Activation Function
The generalization ability and adaptability of the network can be significantly enhanced with the use of an appropriate activation function.Relu6 serves as the network's activation function in the conventional MobileNetV2 network to ensure good numerical resolution, even at low precision [30].However, the issue of neuron death is not resolved by Relu or Relu6.The gradient of the function becomes zero when the input is close to zero or negative, making it unable to learn using backpropagation.
In order to avoid such issues in individual fish recognition networks, the Mish function-a self-norming non-monotonic function whose smoothing property allows better penetration of information into the neural network, resulting in higher accuracy and stronger generalization-was adopted as the activation function in this paper, instead of Relu6 [31].The Mish function expression is shown in Equation ( 1) and the image is shown in Figure 13.y = x tanh(ln(1 + exp(x))), fuzzy trunk features and similar texture features that can only be distinguished by tailed information, the disadvantages of pooling layer will be further amplified.The 1 × 1 convolution was first used in the Network in Network technique [29] its calculation method is the same as the other convolution kernels; the only differen the size.The authors concluded that the operation of 1 × 1 convolution + Relu can incr the nonlinearity of the network, thus improving the nonlinear fitting ability of the netw and the classification effect of the network without increasing the number of networ rameters.
In this paper, a 1 × 1 convolution kernel with 512 channels is added after the convolutional kernel to replace the average pooling layer.Since 1 × 1 convolution is formed on the channels, the correlation information between channels can be extra The 1024 channels are linearly combined across channels to 512 channels, thus increa cross-channel information interaction and reducing computational load.
In addition, we added a BN layer and the Mish activation function.Using this bination, the advantages of the pooling layer in preventing overfitting are retained t greatest extent, and the model learning speed can be accelerated.The nonlinearity ca greatly increased without losing the resolution of the characteristic map.

Mish Activation Function
The generalization ability and adaptability of the network can be significantly hanced with the use of an appropriate activation function.Relu6 serves as the netw activation function in the conventional MobileNetV2 network to ensure good nume resolution, even at low precision [30].However, the issue of neuron death is not reso by Relu or Relu6.The gradient of the function becomes zero when the input is clo zero or negative, making it unable to learn using backpropagation.
In order to avoid such issues in individual fish recognition networks, the Mish f tion-a self-norming non-monotonic function whose smoothing property allows b penetration of information into the neural network, resulting in higher accuracy stronger generalization-was adopted as the activation function in this paper, instea Relu6 [31].The Mish function expression is shown in Equation ( 1) and the image is sh in Figure 13.y = x tanh(ln(1 + exp(x))), As seen in Figure 14, when the input value is negative, it is not truncated, as Relu and Relu6; instead, a lesser gradient is allowed to flow in order to ensure the flo As seen in Figure 14, when the input value is negative, it is not truncated, as with Relu and Relu6; instead, a lesser gradient is allowed to flow in order to ensure the flow of information, successfully resolving the issue of neuron death.The Mish function also avoids the gradient saturation issue because it is unbounded.It was found that the Mish activation function is 0.494% better than Swish and 1.671% better than Relu.
Agriculture 2022, 12, x FOR PEER REVIEW 12 of 20 information, successfully resolving the issue of neuron death.The Mish function also avoids the gradient saturation issue because it is unbounded.It was found that the Mish activation function is 0.494% better than Swish and 1.671% better than Relu.

Loss Function
The traditional target recognition is more inclined to be treated as a classification problem.The categories are labeled with categories and the results are given by Softmax, and Softmax loss [32] is shown in Equation ( 2).However, as the dataset expands and the categories change, it is necessary to retrain the model.For this type problem, Deng [33] proposed Arcface loss based on Softmax loss to improve the inter-class separability while reducing the intra-class distance, as shown in Equation (3).
Specifically, Arcface loss fixes the bias b in Softmax loss to 0, and transforms   into ∥  ∥⋅∥  ∥   by dot product transformation, where  represents the angle between the weights  and the features  ; after normalization makes ∥  ∥=∥  ∥= 1, the normalized prediction only depends on the angle  between the features  and the weights  ; then, it multiplies the features by the constant , at which time, the learned features are distributed on the hypersphere with the radius ; finally, the direction the angle penalty  is added in the direction of  to achieve the purpose of increasing the inter-class distance and reducing the intra-class distance.
In this paper, Arcface loss is used as the loss function of LIFRNet, and the loss converges to 0.001 after 300 epochs.

Real-Time Visualization Module
In the real-time visualization module, LIFRNet integrates the optimal weight of the detection and recognition modules to display individual fish information in real time.In fact, different sides of a fish body have different texture features.Therefore, we treat the different sides of the same fish body as two different fish in the training of the recognition

Loss Function
The traditional target recognition is more inclined to be treated as a classification problem.The categories are labeled with categories and the results are given by Softmax, and Softmax loss [32] is shown in Equation ( 2).However, as the dataset expands and the categories change, it is necessary to retrain the model.For this type of problem, Deng et al. [33] proposed Arcface loss based on Softmax loss to improve the inter-class separability while reducing the intra-class distance, as shown in Equation (3).
Specifically, Arcface loss fixes the bias b in Softmax loss to 0, and transforms W T yi f i into W j • f i cos θ j by dot product transformation, where θ j represents the angle between the weights W j and the features f i ; after normalization makes W j = f i = 1, the normalized prediction only depends on the angle θ j between the features f i and the weights W j ; then, it multiplies the features by the constant S, at which time, the learned features are distributed on the hypersphere with the radius S; finally, the direction the angle penalty m is added in the direction of θ j to achieve the purpose of increasing the inter-class distance and reducing the intra-class distance.
In this paper, Arcface loss is used as the loss function of LIFRNet, and the loss converges to 0.001 after 300 epochs.

Real-Time Visualization Module
In the real-time visualization module, LIFRNet integrates the optimal weight of the detection and recognition modules to display individual fish information in real time.In fact, different sides of a fish body have different texture features.Therefore, we treat the different sides of the same fish body as two different fish in the training of the recognition module.In the visualization module, we want the same fish to have unique identity information.Therefore, we propose two solutions: (1) Cameras are installed on different sides of the water.Each camera only recognizes fish swimming in the same direction, and then summarizes the information to obtain accurate fish information.(2) Recode the individual fish information, and use the numbers 1 and 2 as the basis to distinguish different sides of the same fish.The individual fish information is obtained through only one camera.
In the actual underwater fish body recognition, we found that the fish body swam in a wide range, the swimming posture was irregular and the swimming direction often changed, so we chose the recoding method to complete the fish body information visualization.The rendering of the visualization module is shown in Figure 14, where the code of the fish head swimming toward the right is 1, such as Fish_ 5_ 1.The code of the fish head swimming towards the left is 2, for example, Fish_ 13_ 2. The purpose of this is to prevent different side texture features from affecting the training of the model, and when the same fish swims past the underwater camera in different directions, we can also intuitively and accurately grasp the fish body information through the real-time visualization module.
In the processes of aquaculture, the real-time visualization module is helpful for the aquaculture personnel to pay attention to the individual information of fish at any time, and adjust the aquaculture strategy according to the actual situation to achieve the goal of precise aquaculture.
After the detection and recognition of the two modules, LIFRNet integrates the functions of the two modules to form a real-time visualization module, which can detect and identify individual fish in real time through underwater cameras, which enables to aquaculture personnel to pay attention to individual fish information at any time and adjust their aquaculture strategies according to the actual situation.

The Dataset
In the field of underwater fish recognition, there is no publicly accessible dataset for individual fish recognition.Therefore, we developed a fish recognition dataset (DlouFish), as shown in Figure 15, through extensive collection and collation.The dataset consists of 6950 labeled individual fish photographs and is numbered according to the individuals.It contains 2100 images of koi, 1850 images of puffer fish, 1800 images of clown fish and 1200 images of grass carp.These images are from the internet and photography.Considering that the underwater reference object is fuzzy, it was difficult to identify the fish body through the background environment.Therefore, we extracted the frame of the video, artificially labeled the identity information of the fish body according to the continuity of the video, and made a data set after the disruption.
We divided the data set into a training set and a test set at a ratio of 9:1, which included different kinds of fish bodies, such as brocade carp with obvious patterns and puffer fish with high similarity.At the same time, the lighting conditions are were quite different.The purpose of this was to improve the learning ability of the model during training and verify the generalization of the model during testing.
In order to facilitate the analysis of the experimental results, we formulated the naming rules of the data in the data set.The numbering rule is individual fish number + image number.For example, the picture named "000101" represents the first individual fish picture with the ID number 1, and the image named "001111" represents the 11th individual fish picture with the ID number 11.The advantage of this rule is that we can intuitively judge whether the predicted individual fish is the same one by using the number assigned.We divided the data set into a training set and a test set at a ratio of 9:1, which cluded different kinds of fish bodies, such as brocade carp with obvious patterns puffer fish with high similarity.At the same time, the lighting conditions are were q different.The purpose of this was to improve the learning ability of the model du training and verify the generalization of the model during testing.
In order to facilitate the analysis of the experimental results, we formulated the n ing rules of the data in the data set.The numbering rule is individual fish number + im number.For example, the picture named "000101" represents the first individual fish ture with the ID number 1, and the image named "001111" represents the 11th indivi fish picture with the ID number 11.The advantage of this rule is that we can intuiti judge whether the predicted individual fish is the same one by using the number assig

Experimental Setup
In this research, the experiments were conducted using the Pytorch framework un Ubuntu 20.04, and the computer GPU configuration was GeForce RTX 3090Ti.The function was Arcface, the optimizer was adam, momentum was 0.9, batchsize was 64 initial learning rate was 0.001 and the minimum learning rate was 0.0001 The algori evaluation metric was mAP, and the descent method was step, with 300 epochs of tr ing.

Performance Comparison of YOLOV4-Tiny Incorporating Different Attention Mechanis
In this research, the mechanism modules of mainstream attention in recent years cluding SE [34], ECA [35] and CBAM, were added to YOLOv4 tiny and compared w the traditional YOLOv4 tiny and YOLOv4 [36].The experimental results (Table 1) s that, by incorporating the attention mechanism module, the accuracy of the model ca

Experimental Setup
In this research, the experiments were conducted using the Pytorch framework under Ubuntu 20.04, and the computer GPU configuration was GeForce RTX 3090Ti.The loss function was Arcface, the optimizer was adam, momentum was 0.9, batchsize was 64, the initial learning rate was 0.001 and the minimum learning rate was 0.0001 The algorithm evaluation metric was mAP, and the descent method was step, with 300 epochs of training.

Performance Comparison of YOLOV4-Tiny Incorporating Different Attention Mechanisms
In this research, the mechanism modules of mainstream attention in recent years, including SE [34], ECA [35] and CBAM, were added to YOLOv4 tiny and compared with the traditional YOLOv4 tiny and YOLOv4 [36].The experimental results (Table 1) show that, by incorporating the attention mechanism module, the accuracy of the model can be significantly increased.In our DlouFish dataset, compared with the traditional YOLOv4, the accuracy of YOLOV4-Tiny after CBAM fusion was improved by 3.65%, and the parameter amount was nearly 10 times smaller than YOLOv4.At the same time, when the parameters were similar, the model we used performed the best and achieved an accuracy of 88.6%.This research used deep convolutional networks to learn the differences between fish visual features for individual fish recognition.Therefore, when the network predicts the same individual fish, the distance between the pictures given is as small as possible; that is, the similarity is high.When the network predicts different individual fish, the distance result given is as large as possible; that is, the similarity is low.The performance of LIFRNet in recognizing different individual fish is shown in Table 2.We adopted two different methods when using deformable convolution.The first one (Method 1) was to add a deformable convolution of size 3 × 3 to the 1 × 1 convolution, plus a BN layer with the Mish activation function, and not to change the standard convolution in the backbone network.The significance of this was to increase the number of network layers by adding a convolution, and at the same time, allow the activation function to play a bigger role.The other method (Method 2) is to replace all 3 × 3 standard convolutions with 3 × 3 deformable convolutions.The experimental results of these two methods are shown in Table 3.The results of the studies demonstrate that the distance between distinct fish bodies may be enhanced by adding deformable convolution and that this has a better overall impact than the other method, which is essentially identical when recognizing the same fish body.However, this has the unintended consequence of dramatically increasing the number of parameters.The original network had 4,231,976 parameters; the number of parameters obtained using this method was 55.75 percent more.
In this research, we finally choose the method of using deformable convolution, instead of standard convolution, without increasing the number of parameters, for the following reasons: 1.
YOLOV4-Tiny with a fused CBAM attention mechanism is utilized for this purpose, instead of YOLOV4 for object detection, as our goal is to create a lightweight solution for individual fish recognition.2.
While adding deformable convolution improves the effect, it is not particularly helpful for actual fish detection.The reason is that we artificially set a threshold value when the network returns a prediction result, and when the distance is larger than the threshold value, this predicted image is deleted from the list of alternatives, which does not affect the recognition accuracy.

3.
When recognizing the same fish, there is almost no difference in the effects of the two methods, which means that when the distance is less than the threshold value, the effect of the two methods on the recognition accuracy is equal.

Analysis of Experimental Results of Different Background Environment
In this experiment, we performed background elimination for different pictures of individual fish.The effect of the background environment on the distance of fish similarity was tested by this method when the model recognized different individual fish of the same category.The experimental results are shown in Figure 16.The experimental results showed that the similarity distance of individual fish changed slightly when we performed background elimination on one of the pictures.The similarity distance became smaller, but the value change was extremely small, which means that the difference in the background environment has a slight effect on the recognition ability of the model.When we eliminated the backgrounds of the two pictures at the same time, the similarity distance value was almost the same as the value without background elimination.This indicates that the difference in background environment color had only a minimal effect on the model.The model focuses more on the extraction of texture features from the individual fish than on learning the features of the pictures from the background environment.

Analysis of Experimental Results under Different Backbone Networks
In the following part of our research, we used Resnet50 [37], Iresnet50 [38], Mobilefacenet, MobilenetV2 and our proposed methods for experiments.The experimental results are shown in Tables 4 and 5.The experimental results showed that the similarity distance of individual fish changed slightly when we performed background elimination on one of the pictures.The similarity distance became smaller, but the value change was extremely small, which means that the difference in the background environment has a slight effect on the recognition ability of the model.When we eliminated the backgrounds of the two pictures at the same time, the similarity distance value was almost the same as the value without background elimination.This indicates that the difference in background environment color had only a minimal effect on the model.The model focuses more on the extraction of texture features from the individual fish than on learning the features of the pictures from the background environment.

Analysis of Experimental Results under Different Backbone Networks
In the following part of our research, we used Resnet50 [37], Iresnet50 [38], Mobilefacenet, MobilenetV2 and our proposed methods for experiments.The experimental results are shown in Tables 4 and 5.It can be seen that the distance of our method is smaller; the average distance decreased by 0.284, when testing the same fish.Additionally, the distance is larger when recognizing different individuals.
In addition, the resnet50 network with parameters six times larger than ours has 5.86% less accuracy than On the DlouFish dataset, our Acc 1 reached 97.8%.

Conclusions
In this paper, we proposed a lightweight algorithm for individual fish recognition that can lessen the negative effects of fish swimming irregularly and the complex underwater environment.We also constructed and labeled a fish recognition dataset (DlouFish), which contains 6950 images of 384 fish and is numbered by the individual, to fill the dataset gap in the field of underwater live fish recognition.The experimental results demonstrate that the algorithm suggested in this study performs both fish detection and fish recognition tasks with considerably higher accuracy and is capable of handling the underwater fish recognition challenge.We will keep working on underwater object detection in our upcoming studies and enhance the performance of the model in more difficult environments.

Figure 2 .
Figure 2. Schematic diagram of individual fish recognition.

Figure 2 .
Figure 2. Schematic diagram of individual fish recognition.

Figure 6 .
Figure 6.Schematic diagram of the structure of the channel attention model.

Figure 7 .Figure 8 .
Figure 7. Schematic diagram of the structure of the spatial attention model.

Figure 9 .
Figure 9. Schematic diagram of the network structure of the individual fish recognition module.

Figure 6 .Figure 6 .
Figure 6.Schematic diagram of the structure of the channel attention model.

Figure 7 .Figure 8 .
Figure 7. Schematic diagram of the structure of the spatial attention model.

Figure 7 .
Figure 7. Schematic diagram of the structure of the spatial attention model.

Figure 6 .
Figure 6.Schematic diagram of the structure of the channel attention model.

Figure 7 .Figure 8 .
Figure 7. Schematic diagram of the structure of the spatial attention model.

Figure 9 .
Figure 9. Schematic diagram of the network structure of the individual fish recognition module.

Figure 8 .
Figure 8. Performance of individual fish detection with CBAM: (a) is the performance of normal detection method; (b) is the performance of detection method with CBAM.

Figure 6 .
Figure 6.Schematic diagram of the structure of the channel attention model.

Figure 7 .Figure 8 .
Figure 7. Schematic diagram of the structure of the spatial attention model.

Figure 9 .
Figure 9. Schematic diagram of the network structure of the individual fish recognition module.Figure 9. Schematic diagram of the network structure of the individual fish recognition module.

Figure 9 .
Figure 9. Schematic diagram of the network structure of the individual fish recognition module.Figure 9. Schematic diagram of the network structure of the individual fish recognition module.

Figure 10 .
Figure 10.The network architecture is used for the recognition network.Each intermediate tensor is labeled filter size, channels and stride.Activation layers and batch normalization layers are inserted after each convolution but are not pictured here.

Figure 10 .
Figure 10.The network architecture is used for the recognition network.Each intermediate tensor is labeled filter size, channels and stride.Activation layers and batch normalization layers are inserted after each convolution but are not pictured here.

Figure 11 .
Figure 11.Standard convolution and deformable convolution: (a) is standard convolution and deformable convolution.The circle in the figure represents the change of the convolution rang 3.2.3.Edge Feature Learning Image edge refers to the collection of pixels whose gray level changes are discon uous around them.Edges widely exist between objects and backgrounds, and betw objects.Therefore, edges are important features of image segmentation, image un standing and image recognition.In the task of underwater individual fish recognition changes of light and background environment often lead to occlusion and blurring o body features of fish individuals.At this time, it is difficult to accurately complete recognition task using the main features of the body of fish.At the same time, individ of the same species usually have similar trunk characteristics, which also brings c lenges to the recognition task.By comparing a great number of fish pictures, we found that the mouth, tail, fins other parts of the fish have distinctive characteristics.As shown in Figure12, the tr features of the two fish are similar, and the texture features of the mouth and tail pl key role in distinguishing different individuals.Therefore, we believe that when b features cannot be used for effective recognition, the learning of fish body edge feat is particularly important.

Figure 12 .
Figure 12.The fish with similar body features and distinct edge point features.

Figure 11 .
Figure 11.Standard convolution and deformable convolution: (a) is standard convolution and (b) is deformable convolution.The circle in the figure represents the change of the convolution range.

Figure 11 .
Figure 11.Standard convolution and deformable convolution: (a) deformable convolution.The circle in the figure represents the ch

Figure 12 .
Figure 12.The fish with similar body features and distinct edge p

Figure 12 .
Figure 12.The fish with similar body features and distinct edge point features.

Figure 15 .
Figure 15.Example images of DlouFish dataset: (a) different individual fish; (b) the percenta the number of various classes in the dataset.

Figure 15 .
Figure 15.Example images of DlouFish dataset: (a) different individual fish; (b) the percentage of the number of various classes in the dataset.

20 Figure 16 .
Figure 16.Experimental results of different background environments.

Figure 16 .
Figure 16.Experimental results of different background environments.

Table 1 .
Accuracy comparison of the incorporation of different attention mechanisms.

Table 2 .
The effects of different stages of improvement on the recognition effect.

Table 3 .
The influence of deformable convolution on individual fish recognition.

Table 4 .
Comparison of fish distance performance in different networks.

Table 4 .
Comparison of fish distance performance in different networks.

Table 5 .
Performance comparison of different networks.