2.1. Study Areas and Dataset
The Gulf of Taranto situated in the Northern Ionian Sea (Central-eastern Mediterranean Sea) extends from Santa Maria di Leuca to Punta Alice covering an area of approximately 14,000 km
(see
Figure 1). A complex morphology characterizes the basin. A narrow continental shelf cut by several channels identifies the western sector while descending terraces delineate the eastern one, both declining towards the Taranto Valley, a NW-SE submarine canyon system with no clear bathymetric connection to a major river system [
41,
42,
43,
44]. This singular morphology involves a complex distribution of water masses with a mixing of surface and dense bottom waters [
45] and the occurrence of upwelling currents with high seasonal variability [
46,
47,
48,
49].
The second study area took place off Pico Island, one of the nine islands belonging to the Archipelago of the Azores (Portugal) (see
Figure 2). The islands are separated by deep waters (ca. 2000 m) with scattered seamounts [
50], stretching-out over 480 km, overlapping the Mid-Atlantic Ridge. The Gulf Stream, the North Atlantic and Azores currents (and their branches) are responsible for the complex pattern of ocean circulation that characterizes the Azores, and result in the high salinity, high temperature and low nutrient regime waters [
51]. Due to the upwelling of nutrient-rich deep water currents, the runoff from land and the complex and dynamic oceanic circulation patterns, the area constitutes a food-rich oasis in the oligotrophic central North Atlantic. It concerns a coastal marine habitat where coastal, pelagic and deep-water ecosystems can be found in close vicinity of each other, resulting in a species-rich and highly diverse marine ecosystem [
52]. Due to the absence of a continental shelf and the steep marine walls, over 25 cetacean species, including Risso’s dolphins can be often found close to shore [
53,
54].
The data collection used in this work contains full frame images acquired by our research in the study areas described before. More specifically, the dataset is composed of:
∼10,000 pictures taken in the Gulf of Taranto (Jonian Sea) between 2013 and 2018
∼14,000 pictures taken near Azores islands (Atlantic Ocean) in 2018
Pictures collected at item number 1 have been taken on board a
catamaran during standardized the surveys. In fact, random equally spaced transects have been daily generated, covering about 35 nautical miles in 5 h (with a speed of about 7 knots) only in favorable weather conditions [
55] (see
Figure 1). All the images have been taken by marine mammals observers on the boat with a Nikon D3300 camera with Nikon AF-P Nikkor 70–300 mm,
G ED lens. The photos have a spatial resolution of
pixels and their memory occupation is about 90 GB.
Pictures collected at item number 2 have been obtained off Pico island, covering approximately 540 km
during 2018. Risso’s dolphins were first located from a land based look out (38.4078 N and 28.1880 W) using
binoculars (Steiner observer) [
56] and encountered during ocean based surveys, using a 5.8 m long zodiac, equipped with a 50 HP outboard engine. Examples of the images are shown in
Figure 3.
2.2. Methodology
Before carrying on photo-ID investigations, the crop of the interesting image portions that depict a dorsal fin must be done [
26].
Figure 4 shows the block diagram of the proposed two-stage solution where it is immediate to see that a full frame image needs to be pre-processed and cropped in order to be subsequently used in an effective way. The two steps involved are the following:
3D Polyhedron-Based Color Segmentation
The hypothesis behind this approach is straightforward and was inspired by the specific domain of the problem: assuming that images are generally composed by two main elements, sea and cetaceans, dorsal fins can be located by considering only pixels which assume a specific set of a priori fixed colors. The proposed method is based on the identification of different models consisting of color clusters in the CIE L*a*b* color space, each one representing a specific shooting condition. Each model , is a key-value pair , where the key is a descriptor of the sea and the corresponding value is a descriptor of the dorsal fins. In more details, each sea descriptor defines a lower bound and an upper bound for each channel of the Lab color space with the aim of filter the pixels belonging to the sea. is a set of Lab color triplets that define a 3D polyhedron in the Lab color space and can be used to mask image regions belonging to dorsal fins.
Whenever a full frame image
I needs to be segmented to identify the candidate fins the following steps are performed, as qualitatively shown in
Figure 5:
Further processing steps are also considered with the aim of improving the results of the segmentation:
median filtering (for salt and pepper noise reduction), holes fill and selection of connected regions based on their area;
aspect ratio (width/height) dimension analysis to discard regions with high aspect ratio, due to their low probability of representing a dorsal fin useful for photo-ID purposes;
size refinement of single regions based on their centroids and extreme points in order to include only relevant portions of the fins.
At the end of the procedure, a certain number of proposed regions is available for the initial image I and each of them needs to be classified as fin or no-fin by the deep learning model.
The key hypothesis of the method is that, given the limited domain of the problem, colors are treated as carrying precise semantic information. However, it is possible to show that color semantics is not uniquely determined among pictures: depending on the amount and the type of light characterizing the scene (based, in turn, on weather conditions, time of the shot and presence of other elements) same colors can represent sea in some pictures and fins in other pictures. Overcoming this limitation, hereafter referred as color semantic ambiguity, is crucial for the development of an efficient segmentation method based on colors. Here, multiple models are considered to properly handle ambiguous cases.
Color Models Update
For each sea color set
, a semi-automated iterative procedure has been established for the creation of the corresponding polyhedra
as well as their subsequent update. The steps are detailed in Algorithm 1.
Algorithm 1: Color models update |
 |
Convolutional Neural Network
A binary classification problem is defined in order to fulfill the need for filtering the segmentation phase results. The images obtained are labeled as fin if they actually contain a dorsal fin, no fin otherwise.
The classifier proposed is a Convolutional Neural Network built from scratch (
Figure 6), whose structure is inspired by the one implemented in [
57] for a binary classification task applied to another domain. The input size of the images is
. The architecture is composed of three blocks of convolutional layers which preserve the input size and extract local features through
filters coupled with the ReLU activation function. Information from such features are then merged in later stages of processing in order to detect higher-order features and ultimately to yield information about the image as whole. Each block halves the output size by applying a max pooling downsampling with the aim of learning invariant representations with respect to rotations and translations [
58]. The last three blocks are fully connected layers aimed at using extracted features to obtain a final binary prediction through a Softmax activation function.
The CNN architecture has been designed with the criterion of maximizing the clearness of its structure and minimizing the number of parameters, whilst keeping high efficiency in the classification task. The proposed classifier consists of ∼1.7 millions parameters, requiring ∼6.4 MB for the net to be stored and ∼7 MB to store all the intermediate processing steps needed to classify an unknown input (forward pass). These measurements are significantly low if compared to state-of-the-art architectures available as off-the-shelf models, for instance GoogleNet, AlexNet, VGGNet or ResNet.
Moreover, it is worth noting that use of filters causes the receptive field of the third convolutional layer to be of size with respect to the input layer, which is considered to be a reasonable dimension for the extraction of meaningful features. Using several filters instead of a single filter makes this result possible with fewer parameters. Supposing that all the volumes have C channels, then the single convolutional layer would contain parameters, while the three 3×3 convolutional layers would only contain parameters, thus reducing by half the number of parameters involved.