A Deep Learning Approach to Assist Sustainability of Demersal Trawling Operations

: Bycatch in demersal trawl fisheries challenges their sustainability despite the implementation of the various gear technical regulations. A step towards extended control over the catch process can be established through a real-time catch monitoring tool that will allow fishers to react to unwanted catch compositions. In this study, for the first time in the commercial demersal trawl fishery sector, we introduce an automated catch description that leverages state-of-the-art region based convolutional neural network (Mask R-CNN) architecture and builds upon an in-trawl novel image acquisition system. The system is optimized for applications in Nephrops fishery and enables the classification and count of catch items during fishing operation. The detector robustness was improved with augmentation techniques applied during training on a custom high-resolution dataset obtained during extensive demersal trawling. The resulting algorithms were tested on video footage representing both the normal towing process and haul-back conditions. The algorithm obtained an F-score of 0.79. The resulting automated catch description was compared with the manual catch count showing low absolute error during towing. Current practices in demersal trawl fisheries are carried out without any indications of catch composition nor whether the catch enters the fishing gear. Hence, the proposed solution provides a substantial technical contribution to making this type of fishery more targeted, paving the way to further optimization of fishing activities aiming at increasing target catch while reducing unwanted bycatch.


Introduction
Commercial demersal trawl fisheries are defined as mixed due to the high presence of co-habiting species in the catch, resulting in high catch rates of non-target sizes and individuals, referred to as bycatch [1].In a quota-regulated management system, the commercial species and sizes can also be considered a bycatch if the individual vessel does not have quota available for a given species.Thus, the actual bycatch definition depends on fishery type and the area of fishing [2].To mitigate catch and subsequent discard of unwanted species and sizes, ambitious management plans such as the EU Common Fisheries Policy landing obligation have been implemented, forcing fishers to declare all catches of listed species and count them against their quota [3].The management plans are combined with technical regulations aiming at improving the gears size and species selectivity through mesh size regulations, trawl modifications and bycatch reduction devices.Despite these measures, catch of unwanted sizes and species still challenge these fisheries [2,4].Indeed, such catch-quota systems as the landing obligation provide an incentive and not a tool to minimize unwanted catches.Additionally, available technical measures are not able to provide information on the ongoing catch; hence, catch composition can only be discovered when the fishing gear is lifted on board the vessel [5].
Recent developments in underwater imaging systems can help bring traditional demersal trawl fisheries into the digital age by enabling catch monitoring during fishing operations.Such systems are indeed crucial to overcome the challenges the demersal trawl fisheries face.The possibility to monitor the catch inside the trawl during fishing can provide valuable information and act as a decision support tool for fishers [6].In-trawl camera systems are being introduced in pelagic fisheries [7][8][9][10] and demersal fisheries [6]; however, these systems have been, so far, used for scientific monitoring purpose only.
The developed catch monitoring methods are associated with extensive storage and manual processing of video recordings.To become an efficient decision support tool, these systems require automated processing of the data.Recently, automated processing of the data obtained by video cameras has become more common in various industries, and fisheries are not an exception.Several studies describe automated fish detection and classification commonly performed with the aid of deep learning models application [11][12][13][14][15].These studies demonstrate that the deep learning models for objects detection and classification are efficient tools for processing the on-board as well as underwater collected recordings of the catch.The deep learning ability to "learn" the object features given the annotated data makes it a powerful tool for solving complex image analysis tasks.The traditional computer vision approaches require preliminary object features engineering for each specific task, which limits these methods' efficient application to the real-world data [16].
However, the underwater video recordings, especially, are always challenged by poor visibility conditions [12,17].Additionally, in the specific application of catch monitoring system in demersal trawls, more prominent occlusion conditions can limit the camera field of view due to sediment resuspension during gear towing [18,19].Thus, acquisition of poor video recordings in bottom trawl applications can prevent quality data collection and hence hamper automated processing.
In this study, we demonstrate the successful automated processing of the catch based on the data collected during Nephrops-directed demersal trawling using a novel in-trawl image acquisition system, which helps to resolve the limitations caused by sediment mobilization [20].We hypothesize that the quality of the collected data using the novel system is sufficient for developing an algorithm for automated catch description.With the described method, we aim at closing a gap in the demersal trawling operations non-transparency and enable fishers to monitor and hence have a better control over the catch building process during fishing operations.To test the hypothesis, we fine-tune a pre-trained convolutional neural network (CNN), specifically, the region based CNN -Mask R-CNN model [21], with the aid of several augmentation techniques aiming at improving model robustness by increasing the variability in training data.The trained detector was then coupled with the tracking algorithm to count the detected objects.The known behavior aspects during trawling of fish and Nephrops (Nephrops norvegicus, Linnaeus, 1758) were considered while tuning the Simple Online and Realtime Tracking (SORT) algorithm [22].The resulting composite algorithm was tested against two types of videos depicting normal towing conditions (having low object occlusion and stable observation section) and the haul-back phase when the camera's occlusion rate is higher and the observation section is less stable.We assessed the performances of the algorithm in classifying demersal trawl catches into four categories and against the total counts per category.Automated catch count was also compared with the actual catch count.The system shows good performances and, when further developed, can help fishers to comply with present management plans, preserving fisheries economic and ecological sustainability by enabling skip-pers to automatically monitor the catch during fishing operation and to react to the presence of unwanted catch by either interrupting the fishing operation or relocating to avoid the bycatch.

Data Preparation
To collect the video footage containing the common commercial species of the demersal trawl fishery, such as Nephrops, cod (Gadus morhua, Linnaeus, 1758) and plaice (Pleuronectes platessa, Linnaeus, 1758), we performed 19 hauls, 1.5 h duration each, in Skagerrak, onboard RV "Havfisken".We used a low headline "Nephrops" demersal trawl with 40 mm mesh size in codend to sample all population entering the gear.To collect data of sufficient quality to enable automated detection and counting of the catch items, we used an in-trawl image acquisition system developed and described in [20].The essential parts of that system include a camera coupled with the lights placed inside a tarpaulin cylinder, with a defined optimal color in the aft part of the trawl and a sediment-suppressing sheet attached to the ground gear of the trawl (Figure 1) [20,23].The system ensured stable observation conditions without obscuring sediment clouds during demersal trawling and allowed us to collect high-resolution (720 p) frames to train the deep learning model.The camera settings were: 2 ms exposure, which provides the control over shutter speed; 70 gain, which is responsible for digital amplification of the signal from camera sensors; 4400 K color temperature; 60 fps frame rate.To select the frames containing the objects of interest, the data was subsampled with the aid of a blob detector [23].After this step, the dataset was further subsampled by a human supervisor, who selected the frames containing the target objects from the selected categories: Nephrops, round fish, flat fish and other (Figure 2).Nephrops class contained the frames depicting the target species of the demersal trawl fishery, namely Nephrops itself.Round fish class contained the frames with round fish species, such as cod, hake (Merluccius merluccius, Linnaeus, 1758) and saithe (Pollachius virens, Linnaeus, 1758).Flat fish class was composed from the frames of all flat fish species, plaice and dab (Limanda limanda, Linnaeus, 1758), for example.The other class contained the frames of different organisms such as non-commercial fish species and invertebrates, for instance, crabs.
The selected frames were manually annotated for the regions of interests and the resulting labels contained the polygons of individual objects and class ID.The prepared dataset consisted of 4385 images and was split in train and validation subsets as 88% and 12%, respectively.

Mask-RCNN Training
The architecture of Mask R-CNN was chosen to perform automated detection and classification of the objects [21].This deep neural network is well established in the computer vision community and builds upon the previous CNN architecture (e.g., Faster R-CNN [24].It is a two-stage detector that uses a backbone network for input image features extraction and a region proposal network to output the regions of interest and propose the bounding boxes.We used the ResNet 101-feature pyramid network (FPN) [25] backbone architecture.ResNet 101 contains 101 convolutional layers and is responsible for the bottom-up pathway, producing feature maps at different scales.The FPN then utilizes lateral connections with the ResNet and is responsible for the top-down pathway, combining the extracted features from different scales.
The network heads output the refined bounding boxes of the objects and class probabilities.In addition, as an extension of Faster R-CNN, a branch consisting of six convolutional layers provides a pixel-wise mask for the detected objects.The mask area can be used to estimate the real size of the object, which opens up a possibility to automate the catch items' size estimation during fishing.Therefore, we chose this architecture keeping in mind the scope of future work.During training, the polygons in the labeled dataset are converted to masks of the objects.We initialized the training routine with pre-trained ImageNet weights [26].We trained the model using Tesla V100 16 GB RAM, CUDA 11.0, cudnn v8.0.5.39, and followed the Mask RCNN Keras implementation [27].

Data Augmentation
To improve the model robustness and to avoid overfitting, we have used several image augmentation techniques during the Mask R-CNN training routine.These are instance-level transformations with Copy-Paste (CP) [28], geometric transformations, shifts in color and contrast, blur and introduction of artificial cloud-like structures [29].To evaluate the contribution of each of the techniques, we trained a model without any augmentations used during training and considered this model a baseline for further comparisons.
CP augmentation is based on cropping instances from a source image, selecting only the pixels corresponding to the objects as indicated by their masks and pasting them on a destination image and thus substituting the original pixel values in the destination image for the ones cropped from the source.The source and destination images are subject to geometric transformations prior to CP so that the resulting image contains objects from both images with new transformations that are not present in the original dataset.The authors of CP suggest using random jitter (translation), horizontal flip and scaling.We also add vertical flip and rotation ( = [−15°, … , 15°]).They show that large scale variation (10%, 200%), as opposed to standard scale variation (80%, 125%), improves the performance in the COCO dataset with random weights initialization.However, we find that large scale variation generates objects with unrealistic sizes that are not expected to be found with our image acquisition setup.We find that a scale variation between 50% and 150% works best with our dataset and network configuration.We have also explored the use of several source images and performed the training with two, three and five source images.If the number of objects in the source image is more than one, then the number of the objects to be copied and pasted is defined by a random number from one to number of objects in the source image.
Data collection was undertaken using a stable image acquisition system with a tightly attached camera and an artificial light source; the illumination was not always consistent in the images due to trawl movements as well as occasional catch and sediment occlusions of the camera field of view and the light source.To make the model more robust against these changes, we used color space augmentation (referred to as "Color" augmentation) by inducing variations in hue, saturation and brightness.Specifically, the shifts were applied sequentially, starting from hue value variations (−5, 7), followed by saturation shifts (−10, 10) and, finally, the brightness changes (−20, 20).These values were derived experimentally to indicate the typical variation of color and contrast in the dataset (Figure 3).Notwithstanding the high frame rate and the optimized ratio between exposure and gain, a degree of blur was present in the dataset.The common blur sources are the high speed of the objects' passage through the camera field of view and the light scattering from the sediments that can partially occlude the objects.To make the model robust against these variations, we have sequentially implemented Gaussian blur with varying sigma (0.0, 3.0) and Motion blur with a ranging kernel size (5,15).We refer to this type of tested augmentation as "Blur".
In addition to the mentioned sources of variations in images, the occasional presence of sediment creates a set of shapes and patterns that may not be present in the training dataset and can cause false positive detections.To account for this, we explored the use of cloud augmentation ("Cloud"), which introduced random clumps of cloud-like patterns with varying sizes and colors that resembled the sediment shapes found during trawling.We set the color range by specifying the color temperature, which was set to vary from 2000 to 6000 k, corresponding to hues ranging from white to orange, approximating the real sediment colors.This type of augmentation produces an overlay, which is blended with the original image, locally changing the color of the objects lying behind the clumps and globally introducing the cloud-like patterns.Prior to "Color", "Blur" and "Cloud" augmentations, we applied CP and geometric transformations during training.
The final model contained all the augmentation techniques applied to the images during training.The CP augmentation was applied to every training frame and the augmentations from imgaug library [29] were applied sequentially with the 40% likelihood of occurrence for each training frame.The order of augmentations applied to the image during training follows the sequence of the described augmentation techniques above.

Tracking and Counting
To track the detected objects and obtain the total automatic count of each category, we use an adaptation of the tracking algorithm SORT [22].It relies on the Kalman filter to update the tracks' locations and assumes a constant velocity model that corresponds to the general motion of the target species (Nephrops) during trawling [30].However, the round fish species are able to swim together with the towed gear and are able to escape the camera field of view and re-enter it again, which typically happens when those species travel forwards towards the trawl mouth [31].These events result in the track to disappear in the upper part of the frame; therefore, to solve this, we implement a filter in the top band of the image.In case the track disappears in the filter area, corresponding to top fifth of the image, the total count of the category does not increase.
We use the Mahalanobis distance between the tracks and detections centroids as the cost for the assignment problem, which is solved by the Hungarian algorithm [22].We use a short probationary period, requiring only two consecutive assigned frames for a track to be considered valid.The tracks are terminated after 15 consecutive frames without being assigned any detection.Finally, we use the matching cascade algorithm proposed in [32], giving priority in the assignment problem to tracks that have been lost for fewer frames.
Our tracking problem deals with multiple classes as opposed to SORT.Often during the first few frames of an object coming into the field of view, it presents fewer distinctive features and the model is not able to assign the correct class.To address this, we allow each track to initially consider all classes before assigning a definitive one.We enable this by introducing an additional attribute to each track which consists of a vector of length equal to the number of classes.We first define the probability vector, ̅ i (Equation (1)), as the output from the softmax layer of the network consisting of the likelihoods that object i belongs to each of C classes.An important property of the softmax function is that the sum of the probabilities for ̅ i will be equal to 1.
We then define the evidence vector for track i,̅  , as the cumulative summation of probability vectors across each timestep k (Equation ( 2)): Once the track is completed (at timestep k = K), the final confidence score and class assigned to the track are computed (Equations ( 3) and ( 4)): We also use the evidence vector to assist the assignment problem as well as to filter unlikely matches.In the assignment problem, an additional cost is added to the total cost, which we refer to as the   (Equation ( 5)): where j is the jth object considered for assignment to track i.For a given detection-track pair, it is computed as the sum of the track's evidence vector entries belonging to classes different than the object's class.In the filtering stage of the matching cascade, we introduce an additional gate that forbids any assignment that has a class cost higher than a preestablished threshold.

Algorithm Evaluation
To evaluate the algorithm performance, we have selected two test videos.One with the average catch rate corresponding to typical conditions during towing (1339 s from the haul start), referred to as "Towing", and the other with the higher occlusion rate and less stable observation conditions due to trawl movements in the end of the fishing operation (4100 s from the haul start), referred to as "Haul-back".The first video is a typical example of the data quality and observation conditions during regular demersal trawling, whereas the second video is a stress test of the algorithm.The evaluation sample size is 27,000 and 23,100 frames corresponding to the lengths of the two test videos.The total number of test frames containing Nephrops was 2082, round fish-19,840, flat fish-3221 and other-6113.
The algorithm outputs a set of predicted tracks that we wish to evaluate against a set of ground truth tracks.The ground truth tracks are defined by the frame index where the track first appears in the video and the frame index where the track last appears in the video (start and end indices).
To compare the predicted track against the ground truth start and end indices, we construct a binary vector for each ground truth (Equation ( 6)), where m is the number of frames between the start index of the first track and the end index of the last track present in the video and i is the ground truth index.We set the elements of  �  to be 1 between the start and end indices of the corresponding ground truth.The rest are set to 0. We construct a similar vector for the predictions, where n is the number of predicted tracks.
We then calculate the Intersection over Union (IoU) for each pair of  �  and  �  (Equation ( 7)): We are interested in solving the assignments between ground truths G and predictions P via maximizing the summed IoU, so we formulate the general assignment problem as a linear program (Equations ( 8)-( 13)): �   = 1 for  ∈  (10) ∈ ℤ for ,  ∈ , where the final definition of IoU enforces a penalty for assigning tracks that have an IoU that is less than or equal to some threshold value  ( = 0).The solution to Equation ( 8) yields optimal matches between ground truth and predictions.The solver implementation used the GNU Linear Programming Kit (GLPK) simplex method [33].(The matched ground truth tracks and the predicted tracks are treated as True Positives (TP), unmatched ground truth tracks correspond to False Negatives (FN) and the unmatched predicted tracks corresponds to False Positives (FP)).The number of TP, FN and FP were used to calculate Precision, Recall and the F-score of the algorithm.

Automated and Manual Catch Comparison
The two best performing algorithms were used to predict the total count of the catch items in the two selected test videos to diagnose automated count progress in relation to video frames.We then applied both algorithms to the other nine videos containing the catch monitoring during the whole fishing operation (haul).Predicted count for the whole haul was then compared with the manual count of the catch captured by the in-trawl image acquisition system and the actual catch count performed onboard the vessel.We have calculated an absolute error () (Equation ( 14)) of the predicted catch count to evaluate the algorithm performance in catch description of the entire haul. =   −   , (14) where   denotes the ground truth count and   corresponds to the predicted by the algorithm count per class.
All Nephrops were identified and counted onboard the vessel.Only the commercial species were counted onboard among the other three classes.Thus, cod and hake were counted onboard in the round fish category; plaice, lemon sole (Microstomus kitt, Walbaum, 1792) and witch flounder (Glyptocephalus cynoglossus, Linnaeus, 1758) were counted corresponding to the flat fish class; and squid (Loligo vulgaris, Lamarck, 1798) was counted for the other class.

Training
The selected values for the learning rate varied from 0.0003 to 0.0005 (Table 1).The specific values were chosen to prevent exploding gradient resulting in backpropagation failure.The 'ReduceOnPlateau' Keras function has been implemented to drop the learning rate by half if the validation loss has stopped decreasing during 12 epochs.The lowest bound for the learning rate was set to 0.0001.The small value for the learning rate required more iterations of training; therefore, the number of epochs for the best performing models were above 60 epochs with a maximum of 100 epochs.We have explored the use of one and two images per batch and, in general, the model performance was observed to be higher with the use of two images per batch, excepting the model trained with the blur augmentation.We have also experimented with the number of source images providing the instances to be pasted to the destination training image.The number of source images varied from two to five, which provided similar model performance; however, the use of three source images provided the highest scores.

Evaluation
As we are interested in the total catch automated description, we have averaged the resulting F-scores among the four categories and used it as a major indicator of the algorithms' performance (Figure 4).The first pattern that can be captured from the first glance at Figure 4 is the algorithms' difference in performance while applied to the two test videos.Overall, the algorithms' F-score applied to the "Haul-back" video case showed lower values compared to the "Towing" video.In case of the baseline model, the F-score was 15% lower while tested on the "Haul-back" video compared to the trawling scenario.Among all the studied detectors, testing of the algorithm with the baseline model expectedly showed the lowest F-scores in both video test cases.The highest F-score of 0.79 was reached with the algorithm utilizing Mask R-CNN trained with all augmentations applied to the "Towing" case video.In the case of the "Haul-back" video case, the algorithm with Mask R-CNN trained with CP, geometric transformations and cloud augmentation showed a slightly higher F-score than that of the algorithm with the detection based on the model trained with all augmentations.
The explicit table (Table A1) containing the values of the calculated Precision, Recall and F-score for all four categories in the two case videos are presented in Appendix A.
The detection examples obtained with using the Mask R-CNN trained with all augmentations as a detector on the "Towing" and "Haul-back" video frames are presented in Figure 5.

Comparison of Automated and Manual Catch Descriptions
Automated count estimated per frame of the test videos was closer to the ground truth count in the case of the "Towing" test video (Figure 6), supporting the algorithms' higher F-scores (Figure 4).During the "Haul-back", the automated count of Nephrops had a tendency towards underestimation by both algorithms, whereas in the case of round fish and flat fish classes an opposite trend of overestimation was observed.In the case of the other class, the algorithm based on training with "Cloud" augmentations approximated the real count better compared to the algorithm output with all test augmentations implemented during training.Manual catch count onboard deviates from the ground truth count in the videos due to the catch items avoiding the camera field of view and due to the variations in class assignment criteria (Table 2).All captured Nephrops, both in the resulting catch and captured by an in-trawl image acquisition system, were counted.In case of the round fish and flat fish classes, only the commercial species were counted onboard.The criteria of assigning catch items to round fish and flat fish classes for the automated detection and count purpose was based on the object aspect ratio assumption.Thus, in addition to the commercial species counted onboard, a number of non-commercial species contribute to the manual count in the videos.The reason for the mismatch in the manual count of the other class onboard and in the videos is similar.Only one species is considered commercial in this class and hence counted onboard.We can conclude that 73% of Nephrops are being recorded by an in-trawl image acquisition system.The algorithm based on Mask R-CNN training with "Cloud" augmentations applied outputs the closest to the manual count.An average F-score of this algorithm is 0.73, estimated for the two test videos (Table A1).All of the algorithms tend to overestimate the count of the other three classes.Figure 7 reveals the time interval of the fishing operation that corresponds to the largest automated count bias occurrence.
The largest absolute error of the predicted automated count output by the two best performing algorithms was observed in the video depicting the initialization of the catch process.This time stamp corresponds to the phase of the fishing operation when the trawl gets in contact with the seabed which causes increased sediment resuspension, the presence of which contributes to the count bias towards false positive detections.During towing, the absolute error in the automated count produced by both algorithms remains low.The video recordings of the catch monitoring during the entire trawling are available as the data supporting the reported results [34].

Discussion
In this study, we have described the automated video processing solution for catch description during commercial demersal trawling.The algorithm is tuned for a dataset collected in the Nephrops-directed mixed species fishery, which is obtained with the aid of the in-trawl observation section enabling sediment-free video footage during demersal trawling.The use of augmentations during training boosted the algorithm performance for both the towing and haul-back phase of the trawling operation.Based on the absolute error estimation of the automated count, we can conclude that the algorithm's performance is challenged in the demersal trawling initialization phase.However, the error in the automated count remained low during towing, corresponding to the core of the demersal trawling.These results indicate readiness of the proposed solution for at-sea application.Considering today's conditions for the demersal trawling practice, which is today more or less a blind process, the system has the potential to transform the traditional demersal trawl fishery to a more informed, targeted and efficient process.

Towards Precision Fishing
The concept of precision fishing implies advanced analytics for big data collected by ubiquitous perception devices [35].The resulting analysis of the videos collected on-board can provide detailed catch statistics.Today, such on-board monitoring systems are primarily used by managers and scientists to establish and update the regulations in a reactive manner.The demonstrated approach presents the possibility for fishers to utilize this information directly during the fishing process, which is an assertive management tool rather than reactive.The system application on a commercial scale offers a win-win solution for both fishers and managers.Using the obtained information regarding the catch composition and amount, fishers can react immediately to the presence of bycatch and thereby make their process more targeted and efficient, which will align ecological and economic sustainability.
The system is developed for commercial trawl fisheries, using the Nephrops-directed trawl fishery as a case study.The amount of bycatch in the mixed demersal trawl fishery targeting Nephrops is higher compared to the mixed fishery targeting fish species [2].Thus, the proposed solution is expected to have a higher impact while applied to this fishery.Nephrops-directed fisheries operate with low headline demersal trawls [5] where the implementation of monitoring devices is challenged by the smaller gear dimensions and the proximity to the seabed.We have demonstrated that the developed in-trawl observation system and the automated catch description approach is effective in this fishery.Demersal trawl fisheries that are targeting other species also experience similar challenges as the Nephrops fishery [1] so we expect that the proposed optical monitoring tool can be adapted for the majority of the demersal trawl fisheries, following further acquisition of speciesspecific data and labelling.With an increasing demand for seafood, the introduction of the novel technology that can improve extraction patterns in the commercial fisheries is crucial for sustainable use of limited natural resources [35].

Algorithm Performance
The tested algorithms performed worse on the "Haul-back" video compared to "Towing" video (Figure 4; Table A1).This observation is expected as changing hydrodynamic conditions alter the background panel position, which may contribute to FP detections of the background as an object due to reflection of light and irregular curvatures.Besides, some fish species hold in front of the observation section for longer periods of time during the towing phase and first fall through haul-back is initiated, causing a heavy increase in occlusion due to crowding.However, such conditions are present when the trawl is hauled back; thus, at that point, the decision to terminate the fishing operation has already been made.Our findings indicate that the algorithms are suitable for serving as an automated processing tool of the video stream and work as a decision support tool for the fishers to avoid manual analysis of the videos.The system efficiency as a decision support tool relies on the algorithm performance accuracy, provided it is high.In this study, we have demonstrated the maximum of 0.79 F-score via improving the accuracy of detection (Appendices A and B) and by extending the SORT algorithm with implementing evidence vector for more accurate class-to-track assignment as well as cascade matching to reduce the erroneous detection to track assignment between overlapping objects.The duplicate counts of the objects escaping from the top band of the frame were accounted for by introducing a filter in the top fifth rows of the frame.
Mask RCNN showed to be an efficient tool in the related studies of the catch registration on the conveyer belt as well as the in-trawl catch monitoring in pelagic fishery [13][14][15].To our knowledge, we present the first solution for automated catch description for the commercial demersal trawl fishery.It is made possible by using a systematic approach for ensuring the data quality during towing and fine-tuning the algorithm to the collected data.We foresee the necessity in additional fine-tuning of the algorithm to be effectively used in different conditions.Under the system implementation by the end users, we expect the detection accuracy improvement as more data will be collected and used to update the existing one [36].

Algorithm Real-World Application
To implement an effective decision support tool for fishers, the automated data processing needs to be close to real time.The proposed algorithm needs approximately 6000 s to process the "Towing" and "Haul-back" videos, which are of 450 s and 385 s, respectively.Our proposed solution can be optimized to leverage the inference speed of Mask R-CNN via NVIDIA TensorRT™.Another option is to consider another model architecture, such as single-stage detectors, which do not provide the pixel-wise mask information, essential for precise size estimation, but are much faster.At the data acquisition level, the input video stream can be subsampled to process every  ℎ frame of the input video, and the SORT component of the algorithm must be tuned for the resulting reduction in update rate.
Automated and manual catch count comparison indicated the difference in absolute error peaking in trawling initialization phase (Figure 7).This phase corresponds to 11% of the total fishing operation duration.It is a routine procedure, therefore, the time required to initialize trawling will be similar among the operations.Thus, this percentage will be reduced with longer trawling and hence cause a lower impact on the resulting count accuracy.Additionally, during this phase, the trawl is not fully operational as, during this time interval, the trawl geometry is unstable as the gear is in the process of settling at the seabed, which may result in the reduced number of catch items entering the gear.

Prospective Applications
The application of the Mask R-CNN architecture in combination with the use of stereo camera also allows obtaining automated size estimations of the catch.The automated length estimations of fish with aid of Mask R-CNN showed to be efficient and the approaches are demonstrated by extrapolating the estimated fish head length to the total length via a modelled ratio [37].Another study by Yu et al. [38] demonstrates the measurements of the body and caudal peduncle lengths and widths, eye and pupil diameters of the target fish species.These studies suggest that the total fish size estimation can be derived from the sizes of the specific features of animals.Considering Nephrops, the size of which are initially estimated from the carapace length [39] and the fact that most of the individuals have the carapace visible in the camera field of view, there is an opportunity to register the size measurements automatically as well.

Conclusions
The proposed solution is a part of a catch monitoring tool developed for the commercial demersal trawl fishery and has the potential to transform these fisheries from a blind to an informed process where the fisher can automatically obtain the composition and number of species in the catch.The algorithm showed the high performance during the towing conditions and, therefore, can be applied for automated data processing and act as a decision support tool for fishers, provided the adjustments towards near real-time performance.The future work includes embedding the algorithm on a portable hardware for practical use and exploring the possibilities for automated catch measurements.
The application of the blur augmentation, which decreased the F-score by 3% compared to the CP-only augmentation in the case of "Haul-back" and by 4% in the case of "Towing", indicates that the use of this augmentation type does not fully replicate the blur rate of the dataset.However, the sequential application of all test augmentations during training resulted in the highest F-score when applied to the "Towing" video.
Another augmentation technique from imgaug library, "Cloud" in combination with CP, resulted in an increase by 1% in the case of the "Towing" video and by 1.5% in the case of the "Haul-back" video.In the case of the latter, the "Cloud" augmentation with CP even resulted in an F-score surpassing the one of the detector based on the use of all applied augmentations during training.However, the application of detector based on CP and "Cloud" only augmentations during training led to the F-score yield to the all-tested augmentations-based detector in the case of the "Towing" video.
Overall, the major contribution to the detector performance improvement was achieved through the CP augmentation, which resulted in the higher presence of the instances per training image.The approach of using the synthetic images for training is common while training the deep learning models for real-world applications, such as biomedical fields.For instance, Frid-Adar et al. [40] used the synthetic images generated by Generative Adversarial Networks (GANs).The authors explored two types of GANs to synthesize the artificial images for liver disease classifications.Additionally, the authors observed a positive trend in the resulting performance of the classifier while using the combination of geometric transformations and the synthetic data.
In the fisheries world, Allken et al. [11] observed a similar trend while creating a synthetic dataset from the raw images of pelagic fish species, taking the background only image as a destination and cropped fully visible fish instances from the source images.Before pasting, the fish instances were subject to flip, rotation and scale.Inception3 pretrained on ImageNet dataset was then used for a classification task and showed the highest accuracy in three fish species after being trained on a 15,000 synthetized dataset generated with the aid of 70 source images.One of the significant differences of our approach to synthesize the data using CP is that the instances are cropped and pasted of each image simultaneously during training instead of using the static generated images for training.This feature adds the extra variability in the training set.

Figure 1 .
Figure 1.Image acquisition system overview.(A) Camera prototype version 2020, Atlas Maridan; (B) an outside view of the in-trawl image acquisition system.

Figure 2 .
Figure 2. The examples of the four categories used in a dataset: (A) Nephrops; (B) round fish; (C) flat fish; (D) other.

Figure 3 .
Figure 3. Examples of the applied augmentation techniques during training.(A) Original example images; (B) applied Copy-Paste and geometric transformations with the minimum values (left column) and maximum values (right column); (C) resulting augmentations with Copy-Paste + geometric transformations + color + blur + cloud with minimum (left) and maximum (right) values.

Figure 4 .
Figure 4. Effect of the augmentations applied during training on the resulting F-scores of the algorithm applied to the two test videos.

Figure 5 .
Figure 5. Multi object detection examples obtained from the model trained with all tested augmentations and applied to: (A) "Towing" test video and (B) "Haul-back" test video with the higher rate of occlusions and conditions variation.

Figure 6 .
Figure 6.Automated count dynamics per frames of the two test case videos-"Towing" and "Haulback".All-the algorithm based on Mask R-CNN trained with application of all test augmentations to the images, Cloud-the algorithm based on Mask R-CNN trained with application of Cloud augmentation applied to the images during training, Ground truth-the per frame ground truth count of objects in the test videos.

Figure 7 .
Figure 7. Absolute error estimation of the automated catch count output by the two best performing algorithms applied to all consecutive videos of the whole haul duration.All-detector based on Mask R-CNN with all types of test augmentations applied to the images during training; Cloud-detector based on Mask R-CNN with "Cloud" augmentation applied to the images during training.

Table 1 .
Tuned hyperparameter values for each of the augmentation techniques derived from experiments.CP-Copy-Paste augmentation.

Table 2 .
Automated (predicted) and manual catch count results per class.