A Novel Detection Refinement Technique for Accurate Identification of Nephrops norvegicus Burrows in Underwater Imagery

With the evolution of the convolutional neural network (CNN), object detection in the underwater environment has gained a lot of attention. However, due to the complex nature of the underwater environment, generic CNN-based object detectors still face challenges in underwater object detection. These challenges include image blurring, texture distortion, color shift, and scale variation, which result in low precision and recall rates. To tackle this challenge, we propose a detection refinement algorithm based on spatial–temporal analysis to improve the performance of generic detectors by suppressing the false positives and recovering the missed detections in underwater videos. In the proposed work, we use state-of-the-art deep neural networks such as Inception, ResNet50, and ResNet101 to automatically classify and detect the Norway lobster Nephrops norvegicus burrows from underwater videos. Nephrops is one of the most important commercial species in Northeast Atlantic waters, and it lives in burrow systems that it builds itself on muddy bottoms. To evaluate the performance of proposed framework, we collected the data from the Gulf of Cadiz. From experiment results, we demonstrate that the proposed framework effectively suppresses false positives and recovers missed detections obtained from generic detectors. The mean average precision (mAP) gained a 10% increase with the proposed refinement technique.


Introduction
Research in underwater image analysis has gained popularity in many applications of marine sciences. There are various research directions in underwater image analysis, for instance, underwater species classification and detections [1], seafloor image recognition [2], coral reef classification [3], and flora and fauna recognition [4]. Underwater image analysis requires a set of image processing tasks including underwater object detection, classification, visual content recognition, and image annotation of large-scale marine species [5]. Certain challenges such as turbidity, color variations, and illumination changes make underwater environments very difficult for the models to detect and classify the objects automatically.
There are thousands of species in the ocean all over the world. One of the most important commercial species in Europe is the Norway lobster Nephrops norvegicus. Figure 1 shows the Nephrops norvegicus species (hereafter referred to as Nephrops). This species is distributed from 10 m to 800 m of depth in the Atlantic NE waters and the Mediterranean Sea [6], where sediment is suitable for them to construct their burrows. This species excavates into and inhabits burrow systems mainly in muddy seabed sediments, with more than 40 percent silt and clay [7]. These burrows systems have a single or multiple openings excavates into and inhabits burrow systems mainly in muddy seabed sediments, with more than 40 percent silt and clay [7]. These burrows systems have a single or multiple openings or holes with characteristic features that make them different to burrows built for other burrowing species [8] [9]. At least one opening has a crescent moon shape and a shallowly descending tunnel. It is often proof of expelled sediment forming a wide deltalike tunnel opening, and signals such as scratches and tracks are frequently observed. If a burrow system consists of more than one entrance, then the center of all the openings has a raised gain. It is assumed that each burrow system is occupied by a unique individual. Figure 2 shows the features of the Nephrops burrows system.  Nephrops spend most of their time inside the burrows, and their emergence behavior is influenced by several factors: time of year, light intensity, or tidal strength [10]. For this reason, abundance indices obtained from the commercial catch or the traditional bottom trawl surveys are thought to be poorly representative of the Nephrops population and they are not considered appropriate [11,12].
The abundance of Nephrops populations is currently monitored by underwater television (UWTV) surveys on many European grounds. The methodology used in UWTV surveys was developed in Scotland in the 1990s and is based on the identification and quantification of the burrows systems over the known area of Nephrops distribution [13]. Nephrops abundance from UWTV surveys is the basis of assessment and advice for managing these stocks [14].
Videos are recorded using a camera system mounted in a sledge with angle with respect to the bottom ranging between 37-60° depending to the country [15]. They are Sensors 2022, 22,4441 excavates into and inhabits burrow systems mainly in muddy seabed sediments, with more than 40 percent silt and clay [7]. These burrows systems have a single or multiple openings or holes with characteristic features that make them different to burrows built for other burrowing species [8] [9]. At least one opening has a crescent moon shape and a shallowly descending tunnel. It is often proof of expelled sediment forming a wide deltalike tunnel opening, and signals such as scratches and tracks are frequently observed. If a burrow system consists of more than one entrance, then the center of all the openings has a raised gain. It is assumed that each burrow system is occupied by a unique individual. Figure 2 shows the features of the Nephrops burrows system.  Nephrops spend most of their time inside the burrows, and their emergence behavior is influenced by several factors: time of year, light intensity, or tidal strength [10]. For this reason, abundance indices obtained from the commercial catch or the traditional bottom trawl surveys are thought to be poorly representative of the Nephrops population and they are not considered appropriate [11,12].
The abundance of Nephrops populations is currently monitored by underwater television (UWTV) surveys on many European grounds. The methodology used in UWTV surveys was developed in Scotland in the 1990s and is based on the identification and quantification of the burrows systems over the known area of Nephrops distribution [13]. Nephrops abundance from UWTV surveys is the basis of assessment and advice for managing these stocks [14].
Videos are recorded using a camera system mounted in a sledge with angle with respect to the bottom ranging between 37-60° depending to the country [15]. They are Nephrops spend most of their time inside the burrows, and their emergence behavior is influenced by several factors: time of year, light intensity, or tidal strength [10]. For this reason, abundance indices obtained from the commercial catch or the traditional bottom trawl surveys are thought to be poorly representative of the Nephrops population and they are not considered appropriate [11,12].
The abundance of Nephrops populations is currently monitored by underwater television (UWTV) surveys on many European grounds. The methodology used in UWTV surveys was developed in Scotland in the 1990s and is based on the identification and quantification of the burrows systems over the known area of Nephrops distribution [13]. Nephrops abundance from UWTV surveys is the basis of assessment and advice for managing these stocks [14].
Videos are recorded using a camera system mounted in a sledge with angle with respect to the bottom ranging between 37-60 • depending to the country [15]. They are reviewed manually by trained experts and quantified following the protocol established by ICES [8,16]. With the recent advancement in artificial intelligence and computer vision technology, many researchers employ AI-based tools to analyze marine species. Some people use feature extraction mechanisms to count and identify the species while others use some advanced techniques [17] such as neural networks. Convolutional neural networks (CNN) bring a revolution in object detection. Deep convolutional neural networks gain tremendous success in the tasks of object detection [18,19], classification [20,21], and segmentation [22,23]. These networks are data-driven and require a huge amount of labeled data for training.
In our previous work [24], we developed a deep learning model based on state-of-theart Faster RCNN [19] models Inceptionv2 [25] and MobileNetv2 [26] for the detection of Nephrops openings. Those models were trained on Gulf of Cadiz and Irish datasets. These models achieved good results in detecting the burrows from the image test data. However, when these trained models were tested on a video from Gulf of Cadiz, the accuracy of the detectors degraded. We figured out many false positive (FP) and missed true positive (TP) detections that adversely affect the accuracy of these models.
In this work, we proposed a detection refinement mechanism based on spatialtemporal information to enhance the detection of missed true positive and suppress the false positive detections. The work presented in [27] used the temporal information to track the faces and suppresses the false positive detections. Their approach used low-level tracking to detect the faces in real images. Furthermore, their approach does not recover the missed detections. In our case, the low-level tracking cannot be applied as we are using underwater videos and the objects we are detecting are not real species but the burrows on the ground, where the characteristics are very different than the natural image. The previous work integrates the temporal information to track the faces and suppress the false positives. In our approach we are using the spatial and temporal information to suppress the false positives and recover the missed detections. Our work is divided into two parts. At first, we trained the model using state-of-the-art Faster RCNN [19] models Inceptionv2 [25], ResNet50 [28], and ResNet101 [29] for the detection of Nephrops burrows. We built the dataset for training and testing the models. In the second part of our work, we presented a spatial-temporal-based detection refinement algorithm. We detected the burrows in each frame in a video sequence and then obtained the spatial and temporal information across the multiple frames to refine the Nephrops burrows detections. The spatial-temporal mechanism helped in suppressing the FP burrows and allowed us to find the missed TP detection that led us to achieve a better accuracy as well as tracking and counting burrows in a video sequence. Figure 3 shows the result of the detector that we trained using the Inception model. The bounding boxes in blue color show the ground truth, while the red color bounding boxes show the detections from the Inception model. Due to variation in camera direction and appearance of burrows, the detector accumulates FPs and missed detection in some frames. The figure clearly shows the missed detection in the intermediate frames. reviewed manually by trained experts and quantified following the protocol established by ICES [8,16].
With the recent advancement in artificial intelligence and computer vision technology, many researchers employ AI-based tools to analyze marine species. Some people use feature extraction mechanisms to count and identify the species while others use some advanced techniques [17] such as neural networks. Convolutional neural networks (CNN) bring a revolution in object detection. Deep convolutional neural networks gain tremendous success in the tasks of object detection [18,19], classification [20,21], and segmentation [22,23]. These networks are data-driven and require a huge amount of labeled data for training.
In our previous work [24], we developed a deep learning model based on state-ofthe-art Faster RCNN [19] models Inceptionv2 [25] and MobileNetv2 [26] for the detection of Nephrops openings. Those models were trained on Gulf of Cadiz and Irish datasets. These models achieved good results in detecting the burrows from the image test data. However, when these trained models were tested on a video from Gulf of Cadiz, the accuracy of the detectors degraded. We figured out many false positive (FP) and missed true positive (TP) detections that adversely affect the accuracy of these models.
In this work, we proposed a detection refinement mechanism based on spatial-temporal information to enhance the detection of missed true positive and suppress the false positive detections. The work presented in [27] used the temporal information to track the faces and suppresses the false positive detections. Their approach used low-level tracking to detect the faces in real images. Furthermore, their approach does not recover the missed detections. In our case, the low-level tracking cannot be applied as we are using underwater videos and the objects we are detecting are not real species but the burrows on the ground, where the characteristics are very different than the natural image. The previous work integrates the temporal information to track the faces and suppress the false positives. In our approach we are using the spatial and temporal information to suppress the false positives and recover the missed detections. Our work is divided into two parts. At first, we trained the model using state-of-the-art Faster RCNN [19] models Inceptionv2 [25], ResNet50 [28], and ResNet101 [29] for the detection of Nephrops burrows. We built the dataset for training and testing the models. In the second part of our work, we presented a spatial-temporal-based detection refinement algorithm. We detected the burrows in each frame in a video sequence and then obtained the spatial and temporal information across the multiple frames to refine the Nephrops burrows detections. The spatialtemporal mechanism helped in suppressing the FP burrows and allowed us to find the missed TP detection that led us to achieve a better accuracy as well as tracking and counting burrows in a video sequence. Figure 3 shows the result of the detector that we trained using the Inception model. The bounding boxes in blue color show the ground truth, while the red color bounding boxes show the detections from the Inception model. Due to variation in camera direction and appearance of burrows, the detector accumulates FPs and missed detection in some frames. The figure clearly shows the missed detection in the intermediate frames.   To address these challenges, we proposed a detection refinement approach based on spatial-temporal analysis that enhances the mAP of a generic detector. Our proposed detection refinement mechanism identified these missed detections, recovered them, and suppressed the false positives. Generally, our approach has the following contributions: i.
We propose the spatial-temporal filtering (STF) model that extracts the spatial and temporal information of all the detections of the consecutive frames of an input video by suppressing the false positives and recovering the missed detections. The proposed method will improve the performance of the generic detectors (such as Inception and ResNet, in our case). ii. We evaluate the performance of the proposed framework on our proposed novel dataset. From the experiment results, we demonstrate the effectiveness of the proposed approach.
The rest of the paper is organized as follows: the related work is presented in Section 2. The Materials and Methods section given in Section 3 presents the data collection method and proposed methodology to refine the detections. The achieved results with the proposed methodology are discussed in Section 4. Finally, Section 5 concludes the article.

Related Work
Object detection and classification is a challenging computer vision problem. Researchers have developed many methods for object detection and classification tasks. The existing object detection approaches use handcrafted feature-based models [30][31][32][33] and deep features models [34]. The hand-crafted features models use basic features such as shape [35], texture [36][37][38], and edges [35,38] to train the classifier. On the other hand, convolutional neural networks automatically learn hierarchical features from the training set. Deep learning replaces the handcrafted features and introduces some efficient algorithms for object detection and classification. Over the last few years, deep learning models have enjoyed tremendous success in various object detection and classification tasks. Due to this reason, deep learning models are also employed in the detection and classification of underwater species. Although the underwater environment is hard and challenging compared to the ground, the deep learning algorithms perform much better compared to the conventional and handcrafted features. State-of-the-art deep learning-based object detectors include region-based convolution network (R-CNN) [39], Fast R-CNN [40], and Faster R-CNN [19]. R-CNN uses deep ConvNet to classify the object proposals. R-CNN algorithm is computationally expensive as it uses a selective search [41] strategy to generate a large number of object proposals followed by the object proposal classification step. On the other hand, Fast R-CNN is the improvement of R-CNN, where a faster training process is achieved compared to R-CNN. Fast R-CNN uses multitasking in updating all the network layers and handling the loss which improves the speed and accuracy of the network. Compared to both methods, Faster R-CNN introduces region proposal network (RPN) as it combines the RPN with Fast R-CNN into a single network.
Li et al. [42] developed a deep learning model for the detection of marine objects. The model detects and recognizes fishes using deep convolutional network. They applied the Fast R-CNN algorithm to classify the twelve different classes of underwater fishes. They also introduced a dataset of 24,272 images of all these classes. They achieved more than 90% of accuracy in detection. Similarly, Villon et al. [43] applied the deep learning algorithms to the Fish4Knowledge dataset project to detect and classify the fishes. Rathi et al. [44] combined Faster R-CNN with three classification networks (ZF Net, CNN-M, and VGG16) to detect 50 fish and crustacean species from Queensland beaches and estuaries. The regional proposal method consists of a regional proposal network coupled with a classifier network. Xu et al. [45] applied the YOLO deep learning model to recognize the fishes in underwater videos. They used three different types of datasets that were recorded at real-world waterpower sites. They achieved an mAP up to 53.92%. Mandal et al. [46] presented a Faster R-CNN approach to identify the fishes and their different species using deep neural networks. Gundam et al. [47] also proposed a fish classification technique based on the Kalman filter that used partial automation of fish classification from underwater videos. Jalal et al. [1] proposed a hybrid approach that combines the YOLO-based object detection with optical flow and Gaussian matrix models to detect and classify the fishes from underwater videos. A similar method based on YOLO to detect and classify the fishes was proposed by Sung et al. [48]. They used 892 images and achieved the fish classification accuracy up to 93%. Jager et al. [49] proposed a deep CNN approach based on AlexNet architecture for the classification of fish species. They used the dataset of LifeCLEF 2015. Zhuang et al. [50] proposed a deep learning model based on SSD detector to automatically identify the fishes and their species. In their approach they used ResNet-10 as a classifier for species identification. Zhao et al. [51] proposed an automatic detection and classification method for fish and underwater species. The proposed method, called "Composed Fish-Net", is based on the composite backbone and a path aggregation network. The composite backbone method is the improvement of ResNet. The enhanced path aggregation network is designed to improve the semantic information caused by unsampling. The results show that they achieved an average precision (AP) of 75.2%. Labao et al. [52] proposed a multilevel object detection network that used R-CNN as network framework. Their proposed network contained two region proposal networks and seven CNNs connected by long short-term memory (LSTM). The proposed network showed an improvement in the performance over the simple one-stage detection networks. Salman et al. [53] proposed an R-CNN-based two-stage automatic fish detection and location method. They used the fish motion information and combined it with the background and optical flow information to generate the candidate region of the fish. Their proposed model requires a fixed size input image and the candidate region extraction needs a substantial disk space as well.
Deep learning models also have been employed to detect marine objects other than fishes, such as planktons and corals. These two are also major components of the underwater marine ecosystem. Plankton are the basics of aquatic food. Dieleman et al. [54] used a deep neural network to classify the plankton. They introduced the inception module for image information extraction. Lee et al. [55] also proposed a deep neural network for plankton classification on a large dataset. Their convolutional neural network used three convolutional layers and two fully connected layers. The problem with the coral classification is its color, size, texture, and shape. Shiela et al. [56] introduced a local binary pattern for texture and color coordination. For classification purposes, they used the neural network with three backpropagation layers. Elawady et al. [57] used supervised CNN for the classification of corals. Table A1 in Appendix B summarizes the key findings of the papers discussed in this section.

Materials and Methods
In this section, we discuss the proposed methodology of improving the detections of Nephrops burrows. Figure 4 shows the pipeline of proposed framework. This section also presents the equipment and method used in the data collection in detail. Generally, the proposed framework has two sequential stages. The first stage is object detection, while detection refinement is performed during the second stage. During the first stage, we use state-of-the-art generic detectors, for example, Faster RCNN, Inception, ResNet50, and ResNet101, to detect the Nephrops burrows. For this purpose, we first divide the input video sequence into temporal segments, with each segment consisting of N number of frames. We then apply state-of-the-art detectors to each temporal segment to detect Nephrops burrows. The obtained results are passed to the refinement module that will employ spatial-temporal filtering (STF) to recover the missed detections from the frames and suppress the false positive detections. This process improves the mean average precision (mAP) of the results obtained from the detectors.

Nephrops Burrows Detections
To detect and classify the Nephrops burrows, state-of-the-art Faster R-CNN deep learning algorithms, Inceptionv2 [25], ResNet50 [28], and ResNet101 [29], were used to train the models. Figure 5 shows the pipeline of the proposed detection framework.

Data Collection
High-resolution footage was collected using a sledge during the 2018 Underwater TV (UWTV) survey at the Gulf of Cadiz by marine scientists who belong to IEO (Instituto Español de Oceanografía), a Spanish research institution devoted to promoting ocean research and knowledge, including government assessment for sustainable fisheries. A sledge is a stainless-steel underwater vehicle equipped with multiple cameras, sensors, lasers, and lights to record the footage. Figure 6 shows the setup of the instruments mounted in the sledge and a sample image, and a complete description is presented in Table 1.

Nephrops Burrows Detections
To detect and classify the Nephrops burrows, state-of-the-art Faster R-CNN deep learning algorithms, Inceptionv2 [25], ResNet50 [28], and ResNet101 [29], were used to train the models. Figure 5 shows the pipeline of the proposed detection framework.

Nephrops Burrows Detections
To detect and classify the Nephrops burrows, state-of-the-art Faster R-CNN deep learning algorithms, Inceptionv2 [25], ResNet50 [28], and ResNet101 [29], were used to train the models. Figure 5 shows the pipeline of the proposed detection framework.

Data Collection
High-resolution footage was collected using a sledge during the 2018 Underwater TV (UWTV) survey at the Gulf of Cadiz by marine scientists who belong to IEO (Instituto Español de Oceanografía), a Spanish research institution devoted to promoting ocean research and knowledge, including government assessment for sustainable fisheries. A sledge is a stainless-steel underwater vehicle equipped with multiple cameras, sensors, lasers, and lights to record the footage. Figure 6 shows the setup of the instruments mounted in the sledge and a sample image, and a complete description is presented in Table 1.

Data Collection
High-resolution footage was collected using a sledge during the 2018 Underwater TV (UWTV) survey at the Gulf of Cadiz by marine scientists who belong to IEO (Instituto Español de Oceanografía), a Spanish research institution devoted to promoting ocean research and knowledge, including government assessment for sustainable fisheries. A sledge is a stainless-steel underwater vehicle equipped with multiple cameras, sensors, lasers, and lights to record the footage. Figure 6 shows the setup of the instruments mounted in the sledge and a sample image, and a complete description is presented in Table 1.      Sampling on 70 stations were conducted in the 2018 UWTV survey. A station is a geostatistical location where the Nephrops burrow density is estimated to obtain the Nephrops abundance index over the known survey area using geostatistical analysis. At each station, the sledge was deployed and towed with constant speed between 0.6-0.7 knots to obtain the best possible conditions for counting Nephrops burrows. Once the sledge is stable on the seabed, a video footage of 10-12 min at 25 frames per seconds is recorded, which corresponds to 200 m swept, approximately. Vessel position (dGPS) and position of sledge, using a HiPAP transponder, are recorded every 1 to 2 s. The distance over ground (DOG) is estimated from the position of sledge in all stations, and the field of view of the video footage is 75 cm (FOV), which was confirmed using two line lasers. Out of all these 70 stations, we selected seven based on the better lighting conditions, high contrast, and high density of Nephrops burrows, as well as the better visibility of burrows. The recorded footages were saved into hard disks for further analysis on Nephrops density.

Image Annotation
The obtained frames were annotated using Microsoft VOTT [58] tool. We adopted the mechanism to annotate the burrows manually in the Microsoft VOTT image annotation tool and saved the annotations in Pascal VOC format. The saved XML annotation file contains image name, class name (Nephrops), and bounding box details of each object of interest in the image. The annotated frames led to formulating the ground truths (GT) for model training. To create the datasets for training and testing, from the set of annotated frames (more than 100,000), we selected those which contained Nephrops burrows, using the criteria of using only one frame per individual object, selected to increase the diversity of its appearance, which the aim of creating a small dataset which contained most of the typical cases of Nephrops burrows.

Annotation Validation
The Nephrops burrows annotation is a tedious job, and it requires a lot of experience to annotate a burrow, because different species build burrows with similar appearance on the bottom of the sea. Once all the burrows are annotated, it is very important to validate each one of them with the advice of marine experts from IEO institution, Gulf of Cadiz. Only the validated annotations were used in the model training.

Prepare Dataset
After validating all the annotations, the dataset was divided in two independent groups, the first one for training and the second one for testing purposes. Details are given in Table 2. We utilized transfer learning [59] to fine-tune the models in TensorFlow [60]. Incep-tionv2 [25] is one of the architectures that have a high degree of accuracy, which helps to reduce the complexity of CNN. Inceptionv2 has 3 × 3 convolutions layers, which increases the performance of the network with respect to computational speed and processing.
ResNet50 [28] is a variant of the model ResNet. The ResNet50 has 48 convolutional layers, one max pool, and one average pool layer so it is a 50-layers-deep convolutional network. Out of these 50 layers, one layer is used in the first convolution with a kernel size of 7 × 7 64 kernels with stride 2 and a max pool of size 3 × 3 with stride 2, nine layers are used in the second convolution with a kernel size of 1 × 1, 64 kernels and 3 × 3, 128 kernels. In the next step, 12 layers are used with 1 × 1, 128; after that, a kernel of 3 × 3, 128, and, at last, a kernel of 1 × 1, 512. The fourth convolution uses 18 layers with kernel of 1 × 1, 256 and two more kernels with 3 × 3, 256 and 1 × 1, 1024. The fifth convolution uses nine layers with 1 × 1, 512 kernel with two more of 3 × 3, 512 and 1 × 1, 2048. Finally, the last layer is used for avg pool and a softmax function. ResNet50 is a widely used ResNet model.
The ResNet101 [29] is a dense convolutional neural network that is 101 layers deep. The first convolution has a kernel size of 7 × 7 64 kernels with stride 2 and a max pool of size 3 × 3 with stride 2. Nine layers are used in the second convolution with a kernel size of 1 × 1 64 kernels and 3 × 3 128 kernels. In the next step 12 layers are used with 1 × 1, 128; after that, a kernel of 3 × 3, 128, and, at last, a kernel of 1 × 1, 512. The fourth convolution uses 69 layers with kernel of 1 × 1, 256 and two more kernels with 3 × 3, 256 and 1 × 1, 1024. The fifth convolution uses 9 layers with 1 × 1, 512 kernel with two more of 3 × 3, 512 and 1 × 1, 2048. Finally, the last layer is used for avg pool and a softmax function. The ResNet50 and ResNet101 have better accuracy when compared to the other models for our problem.

Testing
To test our algorithm, we selected another station from the Gulf of Cadiz whose frames were not used in the training dataset. The test video, which is five minutes long and contains 7500 frames, was divided into temporal segments and then passed to our trained models to obtain the Nephrops burrows detections.

Detection Refinements
After the detections of Nephrops burrows, we performed a post analysis of the obtained results. After a critical analysis of the results, we found that the detectors encounter many FP and missed many TP, which degrades accuracy. To recover missed detections and suppress FP, we propose a detection refinement algorithm that exploits the spatialtemporal information among consecutive frames of the given temporal segment. The Inception, ResNet50, and ResNet101 models are tested on a video of five minutes in length. The proposed detection refinement algorithm takes V, λ, and W as inputs, where V is the video, λ, is a threshold value for displacement vector, the threshold value is the value of IoU (intersection over union) that is compared later with the IoU of detected Nephrops burrow, and W is a size of temporal window which determines the number of frames in the temporal window. These models provide a set of TP, FP, and missed detections. The criteria for definition of TP, FP, and working of the proposed detection algorithm is discussed in the next sections.

True Positives (TP)
The algorithm considers every detection as a TP if it is continuously detected by the detector within the temporal window and its average IoU in all the frames in the temporal window is more than or equal to the threshold value λ. Therefore, if the detector marks any FP detection as TP and the detection continues to occur in all the consecutive frames, then our algorithm considers it as a TP detection.

False Positives (FP)
The FP detections are those detections which are not detected in the consecutive frames and their combined IoU is less than the threshold value λ. These FP detections are also declared as FP in the ground truth dataset. The detectors detect them as TP because of camera angle (45 • ) and the position and angle of the burrow.

Missed Detections
The missed detections are those detections which are TP and are detected in some frames by the detector but missed in some intermediate frames due to position or visibility of the burrow. The missed detections are very important to identify because without identifying them we cannot track a burrow. We can increase the performance of models by recovering the missed detections.

Working of Detection Refinement Algorithm
The proposed algorithm is presented in Appendix A and shows the refinement mechanism using the spatial temporal analysis of data. This algorithm is divided into two sections, i.e., suppression of false positives and identification of missed detections. Figure 7 shows the basic processing steps of false positive suppression and missed detection identification and recovery.

441
10 of 23 identifying them we cannot track a burrow. We can increase the performance of models by recovering the missed detections.

Working of Detection Refinement Algorithm
The proposed algorithm is presented in Appendix A and shows the refinement mechanism using the spatial temporal analysis of data. This algorithm is divided into two sections, i.e., suppression of false positives and identification of missed detections. Figure 7 shows the basic processing steps of false positive suppression and missed detection identification and recovery.

Suppression of False Positives
The first step towards the refinement of detections is to suppress the FP. Let Fi = {B1 B2,…, Bn} be the frame i with n detections obtained with a deep learning model . Let sF be the set of consecutive frames within a temporal window with size W. The algorithm takes Bj for frame Fi as an input for refinement and provides a refined output as FR. To suppress the FP in the current frame i, we compute the overlapping of each detection Bj of the current frame and the detection in the next frame from sF.
The algorithm receives three inputs: an input video with detections V, threshold value , and temporal window size W. For each detection in the current frame b ∈ B j at

Suppression of False Positives
The first step towards the refinement of detections is to suppress the FP. Let F i = {B 1 , B 2 , . . . , B n } be the frame i with n detections obtained with a deep learning model. Let sF be the set of consecutive frames within a temporal window with size W. The algorithm takes B j for frame F i as an input for refinement and provides a refined output as F R . To suppress the FP in the current frame i, we compute the overlapping of each detection B j of the current frame and the detection in the next frame from sF.
The algorithm receives three inputs: an input video with detections V, threshold value λ, and temporal window size W. For each detection in the current frame b ∈ B j at frame F i , we first identify the current detection location in the next frame of sF and then compute δ κ = IoU value of current detection with consecutive k frame's detection in sF using Compare_Displacement_Vector(f b_Index , fc b_Index ) method (k = 1, . . . , W). Then, δ avg = 1/W ∑δ k is the estimated average within the temporal window. We marked the detection as FP if δ avg < λ, and as TP if otherwise, suppressing the FP. We process the whole video V detections in the same way.

Identification of Missed Detections
After refining the detections by suppressing the FP in the previous step, the next step is to identify the missed detections that were missed by our detector. For this purpose, we track each detection B j ∈ F i to identify the missed detection. If the detection is found in frame i + 1, we continue to track it till the temporal window size W. If the current detection is not tracked in any frame, we mark that as missed detection and store it in the set indexSet.
To calculate the value of the missed detection, we define the Set_BoundingBox_Value( ) method. We first compute the location of the missed detection from the indexSet. Letting B j be the current detection and indexSet j the missed detection, we calculate the accumulative value of detection from the current frame till the indexSet location and then calculate the average, called bBValue_missing. As we are maintaining the number of frames N between the current detection and the missed detection, we calculate the missed detection value by adding the N value to the bBValue_missing. The missed detections information is then filled and updates the refined output F R .

Experiments and Results
In this section, we evaluate the results of different experiments performed using the proposed detection refinement algorithm. We use three different models (Inception, ResNet50, and ResNet101) for training with Gulf of Cadiz dataset. Each model is trained up to 100k iterations, and a log is maintained for each 10k iteration for evaluation.

Quantitative Analysis
In the quantitative analysis, an annotated video with frame rate of 25 fps is used for testing the Inception, ResNet50, and ResNet101 models. The video is divided into five temporal segments, each of one minute. Each temporal segment has 1500 frames.
We record number of detection from each temporal segment by all three models. The detection is then processed through the proposed detection refinement algorithm to identify the TP, FP, and missed detections. Tables A2-A6 in Appendix B clearly show the obtained results in each temporal segment by each model and their corresponding improvement by the proposed detection refinement algorithm. The algorithm is run with W = 8, 12, and 16. In each temporal window, the algorithm is tested with λ = 0.3 and 0.4 and finds out the number of TP, FP, missed detection, and F1-score (geometric mean of precision and recall metrics) in each minute of the video. Table 3 shows the accumulative ground truth (GT), TP, FP, and missed (Miss) detections along with the mean values of precision, recall, and F1-score of each temporal segment. The %Before is the result obtained before applying the STF, while the %After shows the results obtained after applying the refinement algorithm. Table 3 shows that ResNet101 gives the best F1-score in each one of the five temporal segments, followed by ResNet50 and Inception. It was found that a small IoU value of 0.3 is clearly better than 0.4 in terms of precision, recall, and F1-score values because area surrounding burrows is sometimes not well defined for all three models. The effect of window size W shows a trend of better results for smaller values (mostly, W = 8 is better than W = 12 and W = 16). We performed experiments to find out the accuracy using mean average precision (mAP) after applying the detection refinement algorithm. We selected two different image sets from the third (image set 1) and fifth (image set 2) temporal segments. Each set consists of almost 200 images. Table 4 shows the definition of experiments performed.  Figures 8 and 9 show the results of experiments performed on image sets 1 and 2, respectively. The graphs show the results of detections with and without applying the detection refinement algorithm. The performance is evaluated after every 10k iterations. Results clearly show that the mAP increases after applying the refinement algorithm for all three models (Inception (a), ResNet50 (b), and ResNet101 (c)) and iteration number. Figure 8 shows a higher improvement in mAP after applying the proposed refinement algorithm as compared to Figure 9, where some improvement is also achieved, in part due to that image set 1 had obtained a lower mAP before the refinement. Image set 2 has better quality as compared to the images in image set 1, in terms of better appearance of burrows and less camera movement artifacts. This suggest that mAP is quite sensitive to video quality and that the proposed refinement algorithm compensates for this to some degree.

Qualitative Analysis
In this section, we qualitatively analyze the performance of the proposed detection refinement algorithm by applying it to the results obtained from Inception, ResNet50, and ResNet101 models. The red bounding boxes on the images shown in this section are the original detections obtained from the models; green bounding boxes are the recovered missed detections after applying the refinement algorithm, and ground truth data are marked with blue bounding boxes. Figure 10 shows a typical example of suppression of FP from the detections obtained from the Inception model. Figure 10a-c shows three frames where all burrows' entrances are detected correctly but some FP detections are also obtained, yet are suppressed by our proposed algorithm, resulting in a correct detection, which is shown in Figure 10d-f.

Qualitative Analysis
In this section, we qualitatively analyze the performance of the proposed detection refinement algorithm by applying it to the results obtained from Inception, ResNet50, and ResNet101 models. The red bounding boxes on the images shown in this section are the original detections obtained from the models; green bounding boxes are the recovered missed detections after applying the refinement algorithm, and ground truth data are marked with blue bounding boxes. Figure 10 shows a typical example of suppression of FP from the detections obtained from the Inception model. Figure 10a-c shows three frames where all burrows' entrances are detected correctly but some FP detections are also obtained, yet are suppressed by our proposed algorithm, resulting in a correct detection, which is shown in Figure 10d-f.

Qualitative Analysis
In this section, we qualitatively analyze the performance of the proposed detection refinement algorithm by applying it to the results obtained from Inception, ResNet50, and ResNet101 models. The red bounding boxes on the images shown in this section are the original detections obtained from the models; green bounding boxes are the recovered missed detections after applying the refinement algorithm, and ground truth data are marked with blue bounding boxes. Figure 10 shows a typical example of suppression of FP from the detections obtained from the Inception model. Figure 10a-c shows three frames where all burrows' entrances are detected correctly but some FP detections are also obtained, yet are suppressed by our proposed algorithm, resulting in a correct detection, which is shown in Figure 10d-f. A second rectification performed by the proposed detection refinement algorithm is the identification of missed detections. Figure 11 shows an example of six consecutive frames, before (a-f) and after (g-l) the application of this algorithm. Figure 11a shows two Nephrops burrows detections but missed one detection in (b), (c), (d), and (e) which is correctly rectified by the algorithm, as it is shown in the corresponding images (h), (i), (j), and (k). It can be shown also that ground truth annotations contain a third object in Figure 10 (d) and (f), which are correctly detected by the models, but are not shown in Figure 10 (ac) and (e), possibly due to the viewing angle of some frames. However, the identification of missed detections has a good impact on the improvement of accuracy and precision of the results. A similar approach is followed to rectify the detections from ResNet50and ResNet101 models. A second rectification performed by the proposed detection refinement algorithm is the identification of missed detections. Figure 11 shows an example of six consecutive frames, before (a-f) and after (g-l) the application of this algorithm. Figure 11a shows two Nephrops burrows detections but missed one detection in (b-e) which is correctly rectified by the algorithm, as it is shown in the corresponding images (h-k). It can be shown also that ground truth annotations contain a third object in Figure 10d,f, which are correctly detected by the models, but are not shown in Figure 10a-c,e, possibly due to the viewing angle of some frames. However, the identification of missed detections has a good impact on the improvement of accuracy and precision of the results. A similar approach is followed to rectify the detections from ResNet50and ResNet101 models.

Conclusions
Deep learning algorithms were performed very well on the Gulf of Cadiz dataset in identifying the burrows of Nephrops norvegicus. We applied the Faster RCNN algorithms Inception, ResNet50, and ResNet101 for detections. To increase the results accuracy, a spatial-temporal-based detection refinement algorithm was proposed and tested. The proposed algorithm suppresses the false positive detections and recovers the missed true positive detections. The proposed method when integrated with any detector always increased the performance. The performance was calculated using mAP. This mechanism helps marine science experts in the assessment of the abundance of this species.
In future work, we plan to use diverse datasets from UWTV surveys conducted in other Nephrops stocks by other countries. We will train the YOLO detectors with more and diverse datasets. In addition, we plan to track the burrows to estimate the abundance of Nephrops. We also plan to correlate the spatial and morphological distribution of burrow holes to estimate the number of burrow systems that are present and compare with human inter-observer variability studies.