Weakly Supervised Detection of Marine Animals in High Resolution Aerial Images

: Human activities in the sea, such as intensive ﬁshing and exploitation of offshore wind 1 farms, may impact negatively on the marine mega fauna. As an attempt to control such impacts, 2 surveying, and tracking of marine animals are often performed on the sites where those activities 3 take place. Nowadays, thank to high resolution cameras and to the development of machine 4 learning techniques, tracking of wild animals can be performed remotely and the analysis of the 5 acquired images can be automatized using state-of-the-art object detection models. However, 6 most state-of-the-art detection methods require lots of annotated data to provide satisfactory 7 results. Since analyzing thousands of images acquired during a ﬂight survey can be a cumbersome 8 and time consuming task, we focus in this article on the weakly supervised detection of marine 9 animals. We propose a modiﬁcation of the patch distribution modeling method (PaDiM), which is 10 currently one of the state-of-the-art approaches for anomaly detection and localization for visual 11 industrial inspection. In order to show its effectiveness and suitability for marine animal detection, 12 we conduct a comparative evaluation of the proposed method against the original version, as well 13 as other state-of-the-art approaches on two high-resolution marine animal image datasets. On 14 both tested datasets, the proposed method yielded better F1 and recall scores (75% recall/41% 15 precision, and 57% recall/60% precision, respectively) when trained on images known to contain 16 no object of interest. This shows a great potential of the proposed approach to speed up the 17 marine animal discovery in new ﬂight surveys. Additionally, such a method could be adopted 18 for bounding box proposals to perform faster and cheaper annotation within a fully-supervised 19 detection framework. 20


Introduction
With the ever-growing exploitation of marine natural resources, surveying human 24 activities in the sea has become essential [1]. Activities, such as the installation of 25 offshore wind farms and intensive fishing, should be closely monitored, as they can have 26 a serious impact on the marine mega fauna. For instance, the noise produced during 27 the different phases of an offshore wind farm development, including the site survey, 28 the wind farm construction and the deployment of turbines, can potentially lead to 29 various levels of physical injury, physiological, and behavioral changes in mammals, 30 fish, and invertebrates [2][3][4][5]. In order to ensure that such human activities can take place 31 without harming the marine ecosystem, different surveillance approaches have been 32 adopted in the past years. 33 Nowadays, aerial surveys are among the standard non-invasive approaches for 34 tracking the marine mega fauna [6][7][8][9][10]. Those surveys consist of flight sessions over the 35 Version January 12, 2022 submitted to Journal Not Specified https://www.mdpi.com/journal/notspecified Version January 12, 2022 submitted to Journal Not Specified 2 of 19 sea, during which environmental specialists are able to remotely observe the marine 36 animals (e.g., seabirds, mammals, and fish) that emerge on the surface. In parallel, 37 high resolution videos and photographs can be captured during the flight and, later, In this article, our main contributions are twofold: (1) a modification of the unsu-56 pervised anomaly detection method PaDiM (patch distribution modeling) [11], which 57 we prove to be better adapted for marine animal detection than the original method; 58 and (2) an evaluation of the proposed method and of other state-of-the-art approaches, 59 namely PaDiM [11], OrthoAD [12], and AnoVAEGAN [13], on two high-resolution  deep learning methods can even be similar to human's for some specific tasks, including 67 pathology detection [14] and animal behavioral analysis [15]. For image classification 68 on large datasets, such as ImageNet [16]  stance, the dolphins of Figure 1a,b look very much alike, but they belong to different 83 species: Delphinus delphis and Stenella coeruleoalba, respectively.

2.
The appearance of marine animals changes as they swim deeper in the ocean, 85 leading to ground-truth annotations with different confidence levels. For instance, 86 in the images of Figure 2, the presence of dolphins of the Delphinus delphis species 87 was confirmed by specialists, but lower confidence levels were assigned to those 88 annotations due to their blurry appearance.

3.
Depending on the flight altitude, animal instances are so small that they can only be 90 detected through their context. As an example, Figure 3 shows an image captured 91 during a flight session of Ifremer (Institut Français de Recherche pour l'Exploitation 92 de la Mer: https://wwz.ifremer.fr/ (accessed on 6 January 2022)). According to 93 specialists, the bright dots inside the green bounding boxes probably correspond 94 to marine animals, while that the ones inside the red box may be sun glitters. We 95 can observe that this analysis is only possible by taking into consideration the 96 proximity of each patch to the sun reflection.  Due to the complexity of detecting marine animals in those various scenarios, 105 research studies in the literature often limit their scope to the detection of a single animal 106 species [7,9] and/or to images with high density of animal instances [8,10]. For instance, 107 in the early work of [7], the authors tackle the detection of dugongs in aerial images 108 by combining an unsupervised region proposal method with a classification CNN.

109
On their dataset, whose number of images was not provided, the best precision and 110 recall scores were 27% and 80%, respectively. Similarly to [7], the authors of [9]  average precision scores were obtained for both species: 30% and 35% for the detection 118 of dolphins and of stingrays, respectively. In [8], both marine and terrestrial birds are 119 targeted. As a novelty, the authors were able to boost the number of birds in their 120 dataset by introducing samples of bird decoys. Using some of the state-of-the-art object 121 detection models, including Faster R-CNN [19] or YOLOv4 [20], an average precision 122 (AP) score of over 95% was reported on a set of positive samples, i.e., samples which 123 contain at least one ground-truth bounding box.

124
In a more recent work on seabirds detection [10], efforts were made to reduce

138
The sparse distribution of marine animals makes it hard to gather sufficient data 139 to train and test supervised models. Often, less than 5% of the images gathered during 140 a flight survey will contain animals (see Section 5.1). The differences in appearance 141 caused by the variations in animal depth shown in Figure 2 can also make it hard for 142 supervised models to learn class-specific features. To better handle these constraints 143 and to account for different weather conditions, we propose to train an object detector 144 by applying anomaly localization techniques to sea images. By training on sea images 145 without animals, our models require little to no-supervision compared to data-intensive   Reconstruction based methods train generative models to reconstruct the normal 167 images from the training data by minimizing the reconstruction loss. The intuition 168 is that anomalous samples will be poorly reconstructed and thus easy to detect by 169 comparing the reconstruction with the original image. The most used models are autoen-170 coders (AE) [22], variational autoencoders (VAE) [13,23] or adversarial autoencoders 171 (AAE) [13,24]. Although easy to understand, generative models are sometimes able to  the input image. SPADE [28] compares the testing samples to the normal-only training 185 set using a K-nearest neighbors retrieval on vectors created using a model pre-trained 186 on supervised image classification. PaDiM [11] proposes to model each patch location 187 using a Gaussian distribution and then use the Mahalanobis distance to compute the 188 anomaly scores. 189 We experiment with both generative and embeddings based methods to reformulate  to be robust image feature extractors [28,29]. Their use in anomaly detection has al-196 ready given interesting results in state-of-the-art benchmarks [11,12,30,31]. Since the 197 benchmark datasets commonly used for anomaly detection from images are different 198 from datasets available for marine mammals detection that can be made of thousands of 199 images and involve a strong texture component, we propose to modify and adapt deep 200 feature embedding methods to tackle the marine animals detection problem.

201
As first proposed in [28], to model the normal training set, the images are first encoded using a ResNet [32] model pre-trained on the ImageNet [16] dataset. To use different semantic levels, activations from the three intermediate layers are concatenated to create a feature map as used in [11,12,28]. Since this feature map is deep, the number of channels is often reduced using either random-dimensions selection [11] or a semiorthogonal embedding matrix [12]. In practice, we found that using a semi-orthogonal embedding yields more consistent results because the random dimension selection requires to test multiple dimensions in order to find a good combination. The method [11] then models these normal feature maps using a Gaussian distribution for each patch location. During training, only a single forward pass is necessary to encode the training set and to compute the mean vectors and covariance matrices estimating the Gaussian distribution. Both can be computed online using the formulas in Equations (1) and (2): where x k,i,j is the feature vector at location i, j of the kth training sample and N is the invariant patches as proposed in [11].

205
Once a Gaussian distribution has been estimated for each patch location, the anomaly score s(x i,j ) for each patch x i,j of a test image is computed using the Mahalanobis distance: For the Gaussian distribution, the Mahalanobis distance is proportional to the To build a spatially-invariant anomaly detection pipeline, the anomaly score should not be dependant on the patch coordinates. A simple modification could be to make the model a single Gaussian distribution fit to every patch samples of each image. However, since there are multiple patch modalities, the data may not fit a Gaussian distribution. This can be confirmed by looking at the statistical moments of the patches. Depending on the dimensionality reduction, the skewness and kurtosis of the data are not those of a Gaussian distribution. This is emphasized when using the random dimension downsampling technique proposed in [11]. To use a Gaussian model, we propose to transform the patch distribution into a Gaussian distribution using a normalizing flow (NF). A normalizing flow consists of an invertible transformation T(·) of an unknown input distribution x = T −1 (z) to a known latent distribution z ∼ p Z . Using the change of variable formula, we can compute the likelihood of any x: where det ∂z ∂x is the Jacobian determinant of T(·). Therefore, T is built so that its Jacobian 222 determinant is known and fast to compute. Usually, p Z is taken to be a centered multi-

233
Our model is similar to PaDiM estimated using a single shared Gaussian estimator for all 234 patches but with a learnt arbitrary complex transformation of the prior distribution p X 235 into a Gaussian distribution (see figure 5). We also experiment with using an ensemblistic 236 approach by using multiple normalizing flows in parallel and by taking the maximum 237 log-likelihood of all models for a given patch. This allows each model to specialize in a 238 type of patch. The loss function for the models is described in Equation (5): between two bounding boxes and is commonly used in detection tasks.

261
The entire box proposal pipeline can be seen on Figure 6.

311
As in [11], we use a Wide-ResNet50 [35] as our encoding backbone. The features 312 are then downsampled to a depth of c = 100 features using a semi-orthogonal projection 313 matrix as described in [12]. We use seven masked autoregressive density estimator [34] 314 (MADE) layers in our MAF model. They have seven hidden units with 130 connections 315 each. The Adam [36] optimizer is used with a learning rate of 0.001.

316
We compare our results with the PaDiM and OrthoAD methods from [11,12] using the same parameters. We also train an adversarial convolutional variational encoder (AnoVAEGAN) similar to [13] to reconstruct normal images. The anomalies are detected by comparing the image reconstruction with the original image using the structural similarity [37] (SSIM) metric. To measure the performance of the detection methods, we consider that a detection is positive if the IoU between the prediction box and the ground truth box is greater than 0.1. The F1 score, recall and precision can then be measured. They are computed as follows: Because the F1 score blends information about both the recall and precision, we 317 use it as our main metric. We also evaluate the classification performance between 318 anomalous and normal images of the models by computing the area under the receiver 319 operating characteristic curve (AUROC). The anomaly score for an image is defined as 320 the maximum anomaly score among all its patches.

322
The object detection scores for the Semmacape and Kelonia datasets are given in 323   Tables 1 and 2, respectively. For all metrics, higher scores indicate better performance.

324
On both datasets, the highest F1 scores among all tested approaches were obtained by 325 one of our proposed methods. The most significant improvements were observed on 326 the Semmacape dataset, for which our method provided an improvement of 6.1% and 327 of 22.6% in terms of F1 and recall scores, respectively, with respect to the state-of-the-328 art AnoVAEGAN [13]. On this dataset, the classification of patches into anomalous 329 and normal images is also significantly improved by our method, as attested by an 330 augmentation of 12.4% of AUROC in comparison to OrthoAD [12]. On the other hand, 331 more modest improvements were observed on the Kelonia dataset: 1.3% and 5.2% in 332 terms of F1 and recall scores, respectively, when compared to OrthoAD [12].

333
The improvement from using a normalizing flow to transform the embedding 334 vector is greater on the Semmacape dataset than on the Kelonia dataset. This is due   greater than our method. This is because large objects can sometimes be counted as As seen on Figure 9, the methods are not very sensitive to a variation in the number  features without full annotations, the proposed approach is able to detect marine animals. Figure 10. Example predictions (left) and their corresponding anomaly maps (right) from the Semmacape (2 first rows) and Kelonia (2 last rows) dataset. Rocks and waves have a higher anomaly score than water, using the appropriate anomaly threshold is important for the proposed regions to be interesting.
Although not yet on par with supervised methods, this is a first step on enabling weakly