Combined Color Semantics and Deep Learning for the Automatic Detection of Dolphin Dorsal Fins

: Photo-identiﬁcation is a widely used non-invasive technique in biological studies for understanding if a specimen has been seen multiple times only relying on speciﬁc unique visual characteristics. This information is essential to infer knowledge about the spatial distribution, site ﬁdelity, abundance or habitat use of a species. Today there is a large demand for algorithms that can help domain experts in the analysis of large image datasets. For this reason, it is straightforward that the problem of identify and crop the relevant portion of an image is not negligible in any photo-identiﬁcation pipeline. This paper approaches the problem of automatically cropping cetaceans images with a hybrid technique based on domain analysis and deep learning. Domain knowledge is applied for proposing relevant regions with the aim of highlighting the dorsal ﬁns, then a binary classiﬁcation of ﬁn vs. no-ﬁn is performed by a convolutional neural network. Results obtained on real images demonstrate the feasibility of the proposed approach in the automated process of large datasets of Risso’s dolphins photos, enabling its use on more complex large scale studies. Moreover, the results of this study suggest to extend this methodology to biological investigations of different species.


Introduction
Nowadays the study of cetaceans is of vital importance as an attempt to understand how marine ecosystems are alterating over current years and what are the main effects of these changes. Species monitoring is performed through the collection and the evaluation of meaningful bio-ecological parameters aimed to estimate, for example, their spatial distribution, site fidelity, abundance and migration as well as habitat use [1][2][3][4][5][6][7][8][9][10][11][12][13]. The estimation of these parameters can be greatly facilitated through the use of a non-invasive technique based on automated algorithms and a large data availability: the automatic photo-identification of specimens (photo-ID). Photo-ID is based on the general hypothesis that each individual is unique within its population, showing several specific physical characteristics useful for its identification. Photo-ID is especially encouraged because of its non-invasiveness and its high efficiency. However, given the widespread diffusion of mobile devices and digital cameras able to capture an extremely high number of high quality images, the photo-ID of large amounts of data must be performed with the aid of automated or semi-automated approaches.
In literature, in the specific case of Risso's dolphins [14,15], the algorithm SPIR has been presented [16][17][18][19] to perform the photo-ID of this species in a fully automated way. The main drawbacks even of state of the art methodologies, not only related to cetaceans but also to the study of other species, are the dependency on the manual intervention of an expert operator [20][21][22] as well as the unavailability of large datasets to be processed [23][24][25]. Moreover, even in the case of [16], where the photo-ID can be automatically performed using SPIR, a great concern regards the problem of fin cropping, considering the fact that pictures are captured generally from boats in real life settings. To that regard, Ref. [26] addresses the problem of enhancing the assistance in the processing of massive amounts of biological iconic data. The authors assert that: "the main bottleneck in processing data from photographic capture-recapture surveys is in object detection for cropping or delineating an area of interest so that matching algorithms can identify the individual". This concept can be easily extended to other species large scale studies. The work described in this paper addresses the problem of automatically cropping a dorsal fin starting from a full frame image using deep learning models. In recent years, deep learning models have become a powerful standard resource in the resolution of classification and regression problems throughout many applications [27][28][29][30][31], and are well suited to model the building blocks of a photo-ID automated pipeline. Their strength is the capability of automatically learn complex patterns in multi-dimensional signals (e.g., in images) if trained with a sufficiently high number of samples. Particular attention is being payed to the selection of specific Convolutional Neural Network (CNN) architectures (e.g., U-NET [32] for biomedical image segmentation) or focusing the study on the effect of using specific activation layers as in [33], but, as the authors claim, one of the main difficulty of handling complex CNN architectures is the huge number of resources needed to perform the computations. Examples of applications to the marine biology area can be found in [34][35][36][37][38][39][40]. In more details, in [34] a deep learning approach is employed to discriminate if an image pixel is part of the trailing edge of a fin by predicting its expected belonging probability to the fin, whilst [38] addresses cetaceans identification in images proposing the use of a Mask R-CNN to solve the problem of automatically detecting a region of interest that represents a fin in marine mammals images. However, the authors highlight the difficulties in obtaining good quality images labeled by domain experts, justifying their choice of applying transfer learning starting from a pre-trained complex model. In addition, it is worth highlighting that in full frame images taken from survey boats, even if captured by professionals, the interesting image portion that depicts a cetacean is relatively small and can be also searched introducing domain knowledge in a preliminary image pre-processing step as an alternative to a Mask R-CNN approach.
For this reason, in this paper a combination of an image pre-processing algorithm coupled with a Convolutional Neural Network classifier is presented, with the aim of approaching the automated crop of dorsal fins from a different point of view. This approach is an alternative to the Mask R-CNN based one, sharing its underlying idea. In fact, the generation of the proposals regions (i.e., the areas where it is likely to find the interesting object) here is demanded to an image pre-processing step, whilst the classification of fin vs. no-fin is performed by a CNN.
The paper is organized as follows: Section 2 gives an insight about the study areas, the dataset and the proposed methodology; the description of the experiments and their outcomes is reported in Section 3; Section 4 concludes the paper.

Study Areas and Dataset
The Gulf of Taranto situated in the Northern Ionian Sea (Central-eastern Mediterranean Sea) extends from Santa Maria di Leuca to Punta Alice covering an area of approximately 14,000 km 2 (see Figure 1). A complex morphology characterizes the basin. A narrow continental shelf cut by several channels identifies the western sector while descending terraces delineate the eastern one, both declining towards the Taranto Valley, a NW-SE submarine canyon system with no clear bathymetric connection to a major river system [41][42][43][44]. This singular morphology involves a complex distribution of water masses with a mixing of surface and dense bottom waters [45] and the occurrence of upwelling currents with high seasonal variability [46][47][48][49]. The second study area took place off Pico Island, one of the nine islands belonging to the Archipelago of the Azores (Portugal) (see Figure 2). The islands are separated by deep waters (ca. 2000 m) with scattered seamounts [50], stretching-out over 480 km, overlapping the Mid-Atlantic Ridge. The Gulf Stream, the North Atlantic and Azores currents (and their branches) are responsible for the complex pattern of ocean circulation that characterizes the Azores, and result in the high salinity, high temperature and low nutrient regime waters [51]. Due to the upwelling of nutrient-rich deep water currents, the runoff from land and the complex and dynamic oceanic circulation patterns, the area constitutes a food-rich oasis in the oligotrophic central North Atlantic. It concerns a coastal marine habitat where coastal, pelagic and deep-water ecosystems can be found in close vicinity of each other, resulting in a species-rich and highly diverse marine ecosystem [52]. Due to the absence of a continental shelf and the steep marine walls, over 25 cetacean species, including Risso's dolphins can be often found close to shore [53,54].
The data collection used in this work contains full frame images acquired by our research in the study areas described before. More specifically, the dataset is composed of: ∼10,000 pictures taken in the Gulf of Taranto (Jonian Sea) between 2013 and 2018 2.
∼14,000 pictures taken near Azores islands (Atlantic Ocean) in 2018 Pictures collected at item number 1 have been taken on board a 40 f t catamaran during standardized the surveys. In fact, random equally spaced transects have been daily generated, covering about 35 nautical miles in 5 h (with a speed of about 7 knots) only in favorable weather conditions [55] (see Figure 1). All the images have been taken by marine mammals observers on the boat with a Nikon D3300 camera with Nikon AF-P Nikkor 70-300 mm, f 4.5 − 6.3G ED lens. The photos have a spatial resolution of 6000 × 4000 pixels and their memory occupation is about 90 GB.
Pictures collected at item number 2 have been obtained off Pico island, covering approximately 540 km 2 during 2018. Risso's dolphins were first located from a land based look out (38.4078 N and 28.1880 W) using 25 × 80 binoculars (Steiner observer) [56] and encountered during ocean based surveys, using a 5.8 m long zodiac, equipped with a 50 HP outboard engine. Examples of the images are shown in Figure 3.

Methodology
Before carrying on photo-ID investigations, the crop of the interesting image portions that depict a dorsal fin must be done [26]. Figure 4 shows the block diagram of the proposed two-stage solution where it is immediate to see that a full frame image needs to be pre-processed and cropped in order to be subsequently used in an effective way. The two steps involved are the following: • image pre-processing using 3D polyhedron-based color segmentation; • classification based on CNN. High level block diagram of the proposed approach. A full frame input image is first pre-processed in order to extract regions of interest that may contain a dorsal fin. Then, the classification of fin vs. no-fin is performed using a Convolutional Neural Network (CNN) specifically designed to this end. The CNN block refers to the same Convolutional Neural Network that is used to classify each cropped image.

3D Polyhedron-Based Color Segmentation
The hypothesis behind this approach is straightforward and was inspired by the specific domain of the problem: assuming that images are generally composed by two main elements, sea and cetaceans, dorsal fins can be located by considering only pixels which assume a specific set of a priori fixed colors. The proposed method is based on the identification of κ different models consisting of color clusters in the CIE L*a*b* color space, each one representing a specific shooting condition. Each model m i , i = 1, . . . , κ is a key-value pair m i = (σ i , P i ), where the key σ i is a descriptor of the sea and the corresponding value P i is a descriptor of the dorsal fins. In more details, each sea descriptor i )} defines a lower bound and an upper bound for each channel of the Lab color space with the aim of filter the pixels belonging to the sea. P i = {(L j , a j , b j ) | j = 1, . . . , N i } is a set of Lab color triplets that define a 3D polyhedron in the Lab color space and can be used to mask image regions belonging to dorsal fins.
Whenever a full frame image I needs to be segmented to identify the candidate fins the following steps are performed, as qualitatively shown in Figure 5: Sea color estimation to identify the best model among the κ with a major voting approach, i.e., the model m i that masks the highest number of sea pixels: where 1 denotes the indicator function, I jk denotes the pixel of image I at position j, k and sea(σ i ) is the set of Lab color triplets where the three channel values simultaneously lie within the intervals defined in σ i ; 2.
Dorsal fins region proposal: a binary mask is computed by filtering the image I with the corresponding 3D polyhedron P i . Each of the resulting connected components-e.g., according to 8-connectivity-likely contains a dorsal fin.
Further processing steps are also considered with the aim of improving the results of the segmentation: • median filtering (for salt and pepper noise reduction), holes fill and selection of connected regions based on their area; • aspect ratio (width/height) dimension analysis to discard regions with high aspect ratio, due to their low probability of representing a dorsal fin useful for photo-ID purposes; • size refinement of single regions based on their centroids and extreme points in order to include only relevant portions of the fins.
At the end of the procedure, a certain number of proposed regions is available for the initial image I and each of them needs to be classified as fin or no-fin by the deep learning model.
The key hypothesis of the method is that, given the limited domain of the problem, colors are treated as carrying precise semantic information. However, it is possible to show that color semantics is not uniquely determined among pictures: depending on the amount and the type of light characterizing the scene (based, in turn, on weather conditions, time of the shot and presence of other elements) same colors can represent sea in some pictures and fins in other pictures. Overcoming this limitation, hereafter referred as color semantic ambiguity, is crucial for the development of an efficient segmentation method based on colors. Here, multiple models are considered to properly handle ambiguous cases.

Color Models Update
For each sea color set (σ i ) κ i=1 , a semi-automated iterative procedure has been established for the creation of the corresponding polyhedra (P i ) κ i=1 as well as their subsequent update. The steps are detailed in Algorithm 1.

Algorithm 1: Color models update
Result: Associate a dorsal fin color set P i to each cluster defined by σ i Input: Sea color sets defining clusters (σ i ) κ i=1 , clusters of images (C i ) κ i=1 , a loss function (P, P (I) ) measuring the segmentation error incurred in masking I withP, a threshold δ representing the minimum accepted improvement of the segmentation error to update a fin color set, a threshold representing the target segmentation error Output: Final color clusters models M = (σ i , P i ) κ i=1 M ← ∅ : color clusters initialization for i = 1 : κ do P i ← ∅ : dorsal fin color set initialization repeat S ← manually selected picture from the cluster C i S f ← sea_ f ilter(S, σ i ) : filtering of pixels representing the seã P ← manually selected dorsal fins color set from S f F ← P i ∪P : temporary filter creation/extension if loss variation: ∑ I∈C i (P i , P (I) ) − ∑ I∈C i (F, P (I) ) > δ then P i ← F : create/update the filter end until overall loss: ∑ I∈C i (P i , P (I) ) ≤ (if this condition is never reached then modify the sea color set σ i in order to exclude images causing a large overall loss. Compute the new clusters (C i ) κ i=1 and repeat the procedure for all the updated clusters; end M ← M ∪ (σ i , P i )

Convolutional Neural Network
A binary classification problem is defined in order to fulfill the need for filtering the segmentation phase results. The images obtained are labeled as fin if they actually contain a dorsal fin, no fin otherwise.
The classifier proposed is a Convolutional Neural Network built from scratch ( Figure 6), whose structure is inspired by the one implemented in [57] for a binary classification task applied to another domain. The input size of the images is 224 × 224 × 3. The architecture is composed of three blocks of convolutional layers which preserve the input size and extract local features through 3 × 3 filters coupled with the ReLU activation function. Information from such features are then merged in later stages of processing in order to detect higher-order features and ultimately to yield information about the image as whole. Each block halves the output size by applying a max pooling downsampling with the aim of learning invariant representations with respect to rotations and translations [58]. The last three blocks are fully connected layers aimed at using extracted features to obtain a final binary prediction through a Softmax activation function. The CNN architecture has been designed with the criterion of maximizing the clearness of its structure and minimizing the number of parameters, whilst keeping high efficiency in the classification task. The proposed classifier consists of ∼1.7 millions parameters, requiring ∼6.4 MB for the net to be stored and ∼7 MB to store all the intermediate processing steps needed to classify an unknown input (forward pass). These measurements are significantly low if compared to state-of-the-art architectures available as off-the-shelf models, for instance GoogleNet, AlexNet, VGGNet or ResNet.
Moreover, it is worth noting that use of 3 × 3 filters causes the receptive field of the third convolutional layer to be of size 7 × 7 with respect to the input layer, which is considered to be a reasonable dimension for the extraction of meaningful features. Using several 3 × 3 filters instead of a single 7 × 7 filter makes this result possible with fewer parameters. Supposing that all the volumes have C channels, then the single 7 × 7 convolutional layer would contain C × (7 × 7 × C) = 49C 2 parameters, while the three 3×3 convolutional layers would only contain 3 × (C × (3 × 3 × C)) = 27C 2 parameters, thus reducing by half the number of parameters involved.

Experiments and Results
With reference to Figure 4, the first experiment conducted is devoted to the image pre-processing step evaluation. κ = 5 different models, whose name reflect the appearance of the sea (Azure, Blue-Gray, Dark Blue, Light Blue-Green and Gray) have been defined using Algorithm 1 using a small subset of images sampled from the dataset with the aim to avoid bias. Table 1 shows the details of the five sea color sets (σ i ) κ i=1 that have been identified in the experiment. It is immediate to notice the variability of the data that reflects the need of defining multiple sea models. As pointed out before, these thresholds are highly dependent on the experimental setup used to capture the images as well as on the weather conditions during the acquisition campaigns. It is worth noting at this point that the choice of (σ i ) κ i=1 is a way of ensuring the convergence of the algorithm 1 with respect to the dataset considered and to the supervised evaluation procedure described in the section Color models update. The fin color sets (P i ) κ i=1 obtained are reported in Table 2 in terms of median values and median absolute distances of L, a and b coordinates. The corresponding boxplot, shown for each model in Figure 7, show a slight but clear difference in the appearance of the fin for the five models, with different ranges for the three components, especially the b one, that is largest in the case of m i = 4. Moreover, the statistics highlight the presence of outliers for a and b components in all the models except for m i = 2, that correspond to large polyhedra in the Lab space, as shown in Figure 8.    [50,71] [− 31,7] [−40, −9] m 5 Gray [10,65] The difference in the shapes of the polyhedra suggests how the color semantic ambiguity affects the solution. Figure 9 reports a qualitative comparison of the proposed approach with respect to the well known Otsu's based segmentation of background and foreground, where it is straightforward to notice that the the 3D polyhedron based segmentation clearly outperforms the Otsu's based one. A more detailed comparison of the two methods is given. The Otsu's based approach works as follows: given an image I and two thresholds t L , t b maximizing inter-class variance on the histograms of the channels L and b, the segmented image is computed filtering out the pixels of I at position (j, k) Otsu's segmentation was successfully applied to segment the dorsal fin from the sea in [17]. However, the results of applying this technique on a small subset of the dataset shown that more than 50% of images have been discarded, thus making unfeasible the automatic crop. This is due to the fact that the two binary thresholds on L and b channels are not enough to fulfill the requirement of clearly identify the region of interest that depicts a dorsal fin. Figure 10 shows examples of unusable images obtained with Otsu's algorithm.
To overcome this issue, the proposed 3D polyhedron-based color segmentation is based on the creation of κ fine-tuned models. The Otsu's based method, instead, has neither models nor parameters to tune and its effectiveness is limited by a more restrictive hypothesis related to color semantics: for any image I the histograms of L and b channels are assumed to show a bimodal distribution that can be exploited to effectively segment the sea and the fin. For this reason, the methodology proposed in this paper can achieve good generalization being able to overcome the color semantic ambiguity, whilst Otsu's segmentation can be effectively used only for a specific subset of the images.  Experiment number 2 is focused on the CNN training and validation that has been performed using the first part of the dataset, i.e., pictures taken in the Gulf of Taranto (Jonian Sea) between 2013 and 2018. Starting from the images, a total number of 15,228 crops have been identified, sub-divided in 4033 fin and 11195 no-fin, as shown in Figure 11. Data have been manually labeled and full frame images showing more than one fin have been used to produce multiple cropped fins. A total of 80% of the data has been used as training set, whilst the remaining 20% as validation set. Data in the training set have been augmented following these rules: (a) randomly rotating an image of an angle α in the range [−20, +20] degrees; (b) randomly translating the input image of p pixels in the range [−60, +60] pixels; (c) randomly applying an horizontal flip, with the aim of increasing the number of samples as well as virtually balance the two classes. The CNN has been trained using the Stochastic Gradient Descent with Momentum method, with minibatch dimension of 20, number of epochs 30 and initial learning rate of 0.0003. Moreover, the model has been trained five times to implement a k-fold cross validation strategy. The CNN training took about 3 h and 20 min for a single network on workstation equipped with a Intel Core i5-6400T CPU operating at 2.20 GHz, 8 GB RAM and Nvidia GeForce 930M with 2 GB memory as graphics card, confirming the capability of the proposed model to be trained without the need of using extremely powerful hardware. The quantitative results of this experiment are reported in Table 3 as the mean value of the three metrics achieved by the five CNNs. The metrics are evaluated per fins. The last experiment has been designed to further validate the performance of the CNN classifier using a total number of 20,888 crops processed starting from the pictures taken near Azores islands (Atlantic Ocean) during 2018. The aim of this experiment is to understand the generalization capabilities of the CNN developed in this work. For this reason, we have computed the Perception based Image Quality Evaluator (PIQE) index [59] on the validation dataset in order to give an overview of the variability of the images with an objective score reference. Figure 12 shows the boxplot of the PIQE scores computed on the validation dataset. The scores vary in the range 11.6928-89.5572, with a median value of 42.3608. The box (first and third quartile) ranges from 35.5100 to 50.2020. According to the quality scale associated to the PIQE, the images have a median fair quality and range from excellent to bad. The quantitative scores of the CNN in terms of Accuracy, Sensitivity and Specificity are reported in Table 4, where the score decrease is clear, if compared to the previous experiment, but even acceptable as it demonstrates the generalization capability of the CNN. A remark should be given on how we decided if an input crop was actually a fin or a no-fin image. In fact, we trained 5 different CNNs for cross validation purposes, and we queried them with the following strategy: the prediction is considered robust if four CNNs out of five give the same output. This approach guarantees more robustness in the evaluation of the metrics. The decrease is expected since the test set contains lots of images with completely new shooting conditions (due to different experimental setup, geographical area) with respect to the dataset used to train and validate the classifier.

Conclusions and Future Works
In this paper, an approach for the automated image crop of cetaceans dorsal fins in huge datasets has been presented. The methodology is defined as a deep hybrid model because it is inspired by region proposal networks but with the main characteristic of clearly splitting the region proposal task (pre-processing) from the classification task demanded to a CNN. The main advantages of this approach are the flexibility in introducing domain knowledge in the processing pipeline (i.e., the definition of the color clusters for a specific dataset) coupled with a lightweight deep learning model trainable and deployable on general purpose workstations. In fact, scaling the problem to a binary classification task enables a drastic reduction of the trained model parameters, enabling its widespread applicability and adaptability to multiple operative settings, even without expensive and high-performance hardware. Experiments on a high number of images acquired in a real context demonstrate the high capabilities of the proposed approach towards the automated photo-identification of individuals on a large scale. The algorithms presented and discussed are part of a more complex and ambitious photo-identification process that involves scientists with different backgrounds and expertise. Finally, a positive consequence of the approach described in this paper is the effective automation of the CNN training, because the cropped images are automatically extracted from a dataset independently from the number of images involved. This is a not negligible feature that must be taken into account for effectively enabling large scale studies. The future direction of this research will regard the test of the deep hybrid approach to other dataset acquired by different operators (even professionals or not) in different operating conditions, with the aim of understanding if and when a new training of the CNN will be needed.
Author Contributions: Conceptualization, V.R. and R.M.; methodology, V.R. and R.M.; software, G.L. and F.F.; formal analysis, G.L. and F.F.; resources, C.F., K.H. and R.C.; writing-original draft preparation, all the authors contributed; writing-review and editing, all the authors contributed; All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.