Arctic Vision: Using Neural Networks for Ice Object Classiﬁcation, and Controlling How They Fail

: Convolutional neural networks (CNNs) have been shown to be excellent at performing image analysis tasks in recent years. Even so, ice object classiﬁcation using close-range optical images is an area where their use has barely been touched upon, and how well CNNs perform this classiﬁcation task is still an open question, especially in the challenging visual conditions often found in the High Arctic. The present study explores the use of CNNs for such ice object classiﬁcation, including analysis of how visual distortion of optical images impacts their performance and comparisons to human experts and novices. To account for the model’s tendency to predict the presence of very few classes for any given image, the use of a loss-weighting scheme pushing a model towards predicting a higher number of classes is proposed. The results of this study show that on clean images, given the class deﬁnitions and labeling scheme used, the networks perform better than some humans. At least for some classes of ice objects, the results indicate that the network learned meaningful features. However, the results also indicate that humans are much better at adapting to new visual conditions than neural networks.


Introduction
Computer vision using convolutional neural networks (CNNs) has revolutionized automated image recognition and object detection in recent years. It is by far the most successful technique for image classification and segmentation to date and is used in applications ranging from autonomous vehicles [1][2][3] to the detection of cancer cells [4,5]. With increasing traffic in the Arctic due to the melting of the polar ice, it would be desirable to exploit this powerful technique for navigational assistance to captains, potentially reducing the risk of collisions and damage. However, the Arctic poses different challenges for a machine learning system than other common use cases, such as autonomous driving or recognizing objects in well-lit rooms. First, labeled data are still relatively scarce. Although more and more near-field image data are becoming available from the region, the labeling of such data is extremely costly. The WMO [6] defines 220 different classes of ice objects, many of which are overlapping or difficult to distinguish. For example, the difference between a floeberg and a floebit is size, which is difficult to identify with accuracy in an image. Even if one uses a very small subset of these classes (such as the nine used in this work), the labeling needs to be done by experts, simply because most people are unable to distinguish between classes. Furthermore, the labels have a degree of subjectivity to them, as the viewer's interpretation of the image (relating to, e.g., the scale and color balance of the ice objects) impacts the labels. This is in contrast to the labeling of images used, for example, in autonomous driving or more general object detection, wherein most people are familiar with the domain, the task of labeling can easily be crowdsourced, and the labels are largely objective.
It is well-known that a neural network is dependent on a good training set to learn a meaningful function. However, when the training set is imbalanced, meaning some classes are much more common than others, the network will typically tend to predict the majority classes too often, and never or very seldom predict the minority classes. This behavior is undesired, as it could be very important for the network to detect rare classes (for example, even though brash ice is more common than icebergs, it is much more important to avoid icebergs when navigating). For this reason, multiple methods of dealing with such class imbalances have been proposed, an overview of which can be found in [9].
In general, one can differentiate between two classes of methods for handling class imbalance. The first is data resampling, which works by adding or removing samples from the dataset to balance it. In its simplest form, one can oversample [10][11][12] or undersample [10] the dataset by copying or removing images from it. A more sophisticated method for data resampling is the SMOTE algorithm [13], which synthesizes new samples of minority classes automatically. In the other class, the methods involve modifying the network or loss function. A common way to achieve this is through some sort of weighted loss [14][15][16], where the importance of different classes can be weighted against each other. This weight can be based only on the class of the sample [14], or on the class and prediction [16].
In many cases, these two approaches are interchangeable. Indeed, in the absence of random augmentation of images, oversampling and weighting samples with a given class can yield the same result. However, there are some intricacies connected to both. First, when using random augmentations of images before using them for training, oversampling introduces new images to the network (as the image is sampled more often), possibly enlarging the known input domain by a small margin. Loss weights, on the other hand, can be more flexible than resampling, as they can be dependent not only on information known a priori but also on the results of the training up to that point. Finally, the use of SMOTE or similar techniques actually introduces completely new samples to the training. However, the generation of such samples is not trivial when the input domain is large (e.g., when using images); therefore, these techniques are often not applicable.

Comparisons of Humans and CNNs
Several studies have looked at how well CNNs compare to humans on the task of image classification with distorted images. In [7], the authors found that neural networks are unable to generalize to kinds of distortions not seen during training, while humans are relatively robust to such changes. Indeed, even when including some kinds of distortions in the training set, it did not make the models robust to any other noise than the type included.
Similarly to reference [7], Dodge and Karam [8] found that humans are much more robust to distortions in images than computers. Their results show that even when including some examples of a given distortion in the training set, the CNN performed worse than the human participants.
The two previously mentioned works used relatively similar testing procedures, with images of well-known objects and settings that made the conditions for the human participants as similar to those of the computers as possible (e.g., by limiting the time the participants saw the images), thereby making them fair comparisons.

CNNs for Close-Range Ice Object Detection
In [17], a novel CNN architecture for semantic segmentation of river ice in images from an unmanned aerial vehicle (UAV) was developed. The network consists of two channels, one deep for extracting multi-scale semantic features, and one shallow for capturing small-scale targets. Their results show that the model successfully outperforms the state-of-the-art on their task. For a similar task, [18] trained several state-of-the-art networks using very limited data. They showed that even with very little data available, the CNN models outperformed a support vector machine (SVM) trained on the same data. Both of these works are relevant to the present work, but not directly applicable, as ice floating in the ocean looks different to and has other properties than that floating in rivers. Furthermore, an image captured from a UAV will often look different than one taken from onboard a vessel, especially with regard to shadows, and a UAV might not always be available in the High Arctic due to the harsh environment.
Kim et al. [12] present the initial results of using CNNs to recognize ice objects floating in the sea from near-field imaging, showing that a neural network can learn to recognize some forms of ice. They present an analysis of the effect of network architecture, and some initial results of the effects of simple distortions of images. Kim et al. [19] also performed image segmentation on ice images. Their results were promising, but not perfect, which they attributed to the small amount of available training data.
Although there is relatively little published work for this specific task, a lot of work has gone into the analysis of ice objects in synthetic aperture radar (SAR) images or using other techniques. For example, [20] identified ice floes in satellite images using mathematical morphology and clustering, and [21,22] used pulse coupled neural networks. Other researchers [23][24][25] used a gradient vector field snake algorithm to analyze ice floe distributions or parameters. Finally, [26] used a combination of image processing and analysis methods to find several parameters of the ice cover, including partial ice type concentration and floe size.
The present work investigates the use of CNNs for floating ice object classification in close range images, similarly to [12]. However, a more in-depth analysis of how the networks perform, especially in the presence of visual distortions, is given, to provide insights for further development of the technique. Furthermore, a comparison to human experts and novices is given, providing a benchmark for how well the networks perform. The experiments with the human participants differentiate themselves from the previous works [7,8] in that the experiments reported here measured how humans performed given their best chances. Specifically, the study did not utilize a time limit, allowing the participants to inspect the images for as long as they wanted. The results of the work are an important step towards creating systems for automatic data collection, along with navigational aid, for the increasing traffic in the Arctic.

Dataset
The dataset used in this work consists of 738 images containing ice objects. Most are taken in the Arctic zone, although there are some examples from Antarctica as well. Of the images, 689 were used for training and validation, while 49 were used for testing. The training data were further split into training and validation sets: 85 % for training. All splits were done randomly; however, to be able to compare the effect of different parameters, the splits were only calculated once, and reused for all experiments. Most images were taken from onboard vessels, either from mounted cameras, or manually, typically (but not exclusively) from the bridge. There is a variety of image qualities and kinds of ice objects. Most images were taken in good weather conditions with good visibility, although there are exceptions to this. The images have been acquired from various sources, including Google and Yandex images, publicly available image streams from vessels, and private pictures. The choice of having a large variance in the images, e.g., regarding camera placement and image quality, was made both to make the model able to analyze a larger variety of images, and due to relatively little available data.
Before training, each image was classified as containing any number of the nine ice object classes defined in Table 1. However, as can be seen in Figure 1, there is a huge class imbalance, in that a very large portion of images belong to one of the classes brash ice, broken ice, or deformed ice. This is likely due to several factors: First, some types of ice objects are simply more common than others. For example, pancake ice is relatively rare, while brash ice and broken ice are common, especially in the marginal ice zone and areas where ships travel. Second, some forms of ice objects, such as deformed ice and icebergs, can be more interesting subjects in a photograph than, e.g., level ice, so tourists in the Arctic or Antarctic are more likely to capture images of them. Regarding tourists, it is also important to remember that very young ice, such as pancake ice, is typically formed during early winter, when there is little light and there are few tourists, further increasing the data imbalance. Table 1. The definition of ice classes used in this work.

Class Description
Brash Ice Accumulations of floating ice made up of fragments not more than 2 m across, the wreckage of other forms of ice.
Broken Ice Predominantly flat ice cover broken by gravity waves or due to melting decay.
Deformed Ice A general term for ice that has been squeezed together and, in places, forced upwards (and downwards). Subdivisions are rafted ice, ridged ice and hummocked ice.
Floeberg A large piece of sea ice composed of a hummock, or a group of hummocks frozen together, and separated from any ice surroundings. It typically protrudes up to 5 m above sea level.
Floebit A relatively small piece of sea ice, normally not more than 10 m across, composed of a hummock (or more than one hummock) or part of a ridge (or more than one ridge) frozen together and separated from any surroundings. It typically protrudes up to 2 m above sea level.

Iceberg
A piece of glacier origin, floating at sea.

Ice Floe
Any contiguous piece of sea ice.
Level Ice Sea ice that has not been affected by deformation.
Pancake Ice Predominantly circular pieces of ice from 30 cm-3 m in diameter, and up to approximately 10 cm in thickness, with raised rims due to the pieces striking against one another.    we employ four semi-realistic image distortions:

172
• Image blur, which can happen due to snow, rain, or water on the camera lens.

173
• Brightness decrease, which imitates the visual conditions at night.

175
• Gaussian noise, which is similar to the effect of using a high ISO on the camera.
176 Figure 1. Class balance before the data were resampled. The plot shows the percentages of images in each dataset that contained the given class.
Training a neural network with imbalanced data will make it biased towards the majority classes. When the imbalance is as bad as here, the network could be expected to never or very seldom predict, e.g., pancake ice for a new image. This is undesirable, as the imbalance could as well be an artifact of the dataset as of the natural world, and we would like for the network to base its prediction on the image content rather than a (possibly incorrect) statistical distribution of the existence of ice objects. For this reason, oversampling was performed, meaning images with minority classes were duplicated in the datasets. Note that oversampling was done after the data splits, to avoid duplicate images in different sets. This led to the class distributions shown in Figure 2. Although still not perfectly balanced, it was much closer than before, which should help the network avoid making predictions solely based on a statistical distribution.

169
The visual conditions in the Arctic are often of variable quality, with snow, fog, darkness, and 170 other elements of the area impacting it. To test the robustness of neural networks to such conditions, 171 we employ four semi-realistic image distortions:

172
• Image blur, which can happen due to snow, rain, or water on the camera lens.

173
• Brightness decrease, which imitates the visual conditions at night.

175
• Gaussian noise, which is similar to the effect of using a high ISO on the camera. During training, random image augmentations were performed on the images. Specifically, random flipping along the x-axis, rotation, zoom, brightness, contrast, hue, and saturation adjustment were performed every time an image was used for training. This enlarged the input domain known to the network and meant the effect of oversampling was not simply showing the exact same image to the network multiple times, but instead introducing new, somewhat different, images. Note that this data augmentation is not the same as the distortions mentioned later in this section and the rest of the paper. Those distortions were applied during testing to measure the robustness of the network, as opposed to during training.

Image Distortions
The visual conditions in the Arctic are often of variable quality, with snow, fog, darkness, and other elements of the area impacting them. To test the robustness of neural networks to such conditions, we employed four semi-realistic image distortions: • Image blur, which can happen due to snow, rain, or water on the camera lens.
• Brightness decrease, which imitates the visual conditions at night.
• Gaussian noise, which is similar to the effect of using a high ISO on the camera.
Each distortion was applied at three different levels, and an example of their effect is shown in Figure 3. These distortions were only used during testing, so the networks were not subjected to them during training (as opposed to the random augmentations mentioned in the previous section, which were used to diversify the training set).

of 19
Each distortion is applied in three different levels, and an example of their effect is shown in Figure 3. The data in this work is sparse, meaning most images only contain one or a few of the nine 181 possible classes. During initial trials, it was observed that this led the network to be biased towards 182 predicting very few classes in an image, even after the data oversampling. To discourage this behavior, 183 we propose an adaption of the loss function used for training.

184
The goal of this modified loss is to avoid a model that predicts the absence of all, or almost all, 185 classes for many images. This is achieved by introducing a loss weighting scheme as discussed in is shown in (Eq. 1).
In the definition, L tn is the modified loss, x -the model input, y c -the class label for input x and

True Negative Weighted Loss
The data in this work are sparse, meaning most images only contain one or a few of the nine possible classes. During initial trials, it was observed that this led the network to be biased towards predicting very few classes in an image, even after the data oversampling. To discourage this behavior, we propose an adaption of the loss function used for training.
The goal of this modified loss is to avoid a model that predicts the absence of all or almost all classes for many images. This is achieved by introducing a loss weighting scheme, as discussed in Section 1.1.1, weighting the loss values for samples and classes where both the label and predicted label is 0 (meaning not present in the image) by a weight λ tn , 0 < λ tn ≤ 1. Such a prediction is called a true negative prediction, and we call the modified loss the true negative Weighted Loss, L tn . Its definition is shown in Equation (1).
In the definition, L tn is the modified loss, x-the model input, y c -the class label for input x and class c, L o -the original loss function, λ tn -the true negative weight, θ-the network parameters, and f pred -a function to get the prediction from a network (where f pred,c is the prediction for class c). Note that it is not necessary that f pred = f , where f is the neural network. Indeed, this is rarely the case, and a typical definition of f pred , which is used in this work, is shown in Equation (2), where σ is the sigmoid function. By varying λ tn , it is possible to control the balance between making correct true negative predictions more certain at the cost of more likely predicting the absence of classes actually in the image (called a false negative prediction), and avoiding a bias towards only making negative predictions.
It is important to note that even in the extreme case where λ tn = 0, this loss does not remove all encouragement for the network to correctly predict the absence of any class in an image. The reason for this is that the weights are still modified for the image as long as it is misclassified, thereby pushing the model towards predicting the absence of the class in the image. However, this method simply avoids the model continuing to make such predictions more and more certain, which would typically happen at the cost of failing to recognize the class when it is present in an image.

Training Procedure
For all experiments in this work, a pre-trained ResNet34 [27] was used. It was pre-trained on ImageNet [28] and is freely available from the torchvision model zoo [29]. For retraining the networks, the 1-cycle training scheme from [30] was utilized. We used the Adam optimizer [31] with decoupled weight regularization [32]. Images were randomly augmented as described in Section 2.1 and normalized to the mean µ and standard deviation σ in Equation (3), as per the torchvision documentation.
Before training began, the last fully connected layer in the network was exchanged for the block shown in Figure 4. Training then consisted of two phases: For the majority of the training, the original network was frozen and only the new layers were updated. Following that, all layers were unfrozen and training continued. During this last stage, the learning rate varied per layer, with the learning rate α i for layer i from the beginning of the network given by Equation (4). Here, layer 0 is the first and layer N is the final layer. Table 2 shows the training parameters used in this work. All networks used those parameters. Two models were trained for each value of the true negative weight, with all other hyperparameters being the same. The average metrics of the two are reported in the results, to avoid overly positive or negative results due to a good or bad initialization of the network parameters.
being the same. The average metrics of the two are reported in the results, to avoid overly positive or 219 negative results due to a good or bad initialization of the network parameters.
BN Dropout Figure 4. Network head used in all experiments. This is inserted in place of the original fully connected layer to adapt the network to the data. The fully connected layers include a ReLU activation. In the figure, BN is a batch normalization layer [33], FC is a fully connected layer, and Dropout is a dropout layer [34].

221
The results of the network classifications are compared with the results of the human classification 222 experiment described in [35]. A recap of the methodology used is given here. In the experiment, two 223 participant groups were used, one consisting of 8 novices with no prior experience with ice object 224 Figure 4. Network head used in all experiments. This was inserted in place of the original fully connected layer to adapt the network to the data. The fully connected layers included a ReLU activation. In the figure, BN is a batch normalization layer [33], FC is a fully connected layer, and Dropout is a dropout layer [34].

Human Experiments
The results of the network classifications are compared with the results of the human classification experiment described in [35]. A recap of the methodology used is given here. In the experiment, two participant groups were used, one consisting of eight novices with no prior experience with ice object identification, and one consisting of six experts in the field. Initially, the participants were shown a set of images with their respective classes as a training phase, before starting the classification test. During the test, the participants were first asked to classify a set of non-distorted images. The results of this initial test form a baseline for the human results. After the clean test phase finished, the participant was shown distorted images and asked to classify them. Each image was first shown at its maximum level of distortion. If a participant successfully classified the image (meaning selecting all correct and no incorrect classes) at a given level, that image was recorded as successfully classified at all lower distortion levels as well (similar to the procedure in [8]). If the image was not successfully classified, it was later shown at a lower distortion level. This continued until the participant either classified the image correctly or failed to classify it with no distortion applied. The participants had no time limit when classifying the images, and no cap on how many classes they could select. To keep the task for the humans similar to that of the neural networks, the participants were not told about the distortions beforehand. Once they had submitted their classification for a given image, they were not able to change it.

Results
Multiple performance-metrics were used to evaluate the networks in this work. Specifically, we employed the accuracy (acc), balanced accuracy (acc b ), geometric mean (µ g ), precision, recall, and F 1 score. The accuracy is the fraction of correct classifications (both of the presence and absence of classes) divided by the total number of classifications. For problems with many classes compared to the number of classes present in each image, the accuracy tends to be artificially high (as it becomes easy to predict the absence of a class correctly), so the balanced accuracy and geometric mean are often used as better metrics of how well the network really performs. Precision denotes how large a fraction of predicted classes is actually in an image, while recall denotes the fraction of classes in the images the network manages to predict. The F 1 score is a balance of precision and recall with the two given equal weight. The definitions of the metrics are shown in Table 3. Table 3. Definitions of all metrics used in this work. tp, f p, tn, and f n are short for true positive, false positive, true negative, and false negative, respectively, defining the possible kinds of predictions in a classification task.

Metric
Definition  Figure 5 shows how the test metrics vary with the true negative weight, λ tn . The metrics shown are the averages of the metrics for the two models trained for each value of λ tn and were calculated on the oversampled test set. The error bars show the minima and maxima of the two models. As the trends of all the metrics are relatively similar, the rest of this paper uses the F 1 score unless otherwise noted, to make the discussion easier to follow.  When looking at the effect of distortions, Figure 7 shows that all distortions negatively affect the models, with blur and noise impacting the models the most. On average, blur and noise degrade the F 1 score by 0.33 and 0.29, respectively, from clean to most distorted images, while brightness and fog degrade it by 0.04 and 0.10. The metrics are the means of two models trained with each value of λ tn , calculated on the oversampled test set.
From the results in Figure 5, it is clear that one of the models with true negative weight λ tn = 0.4 performed the best when analyzing clean images, while for distorted images none of the models performed notably better than the rest. For this reason, that model was used for comparison with the human participants and the in-depth analysis in the discussion. Figure 8 shows how the performances of human experts and novices compare to the model. Table 4 shows how much each group is affected by the distortions. The data indicates that, given the labeling scheme used in this work, the CNN performs better than both groups of humans on clean images. However, this form of experiment can put the experts at a disadvantage, which is discussed more in Section 4.4. Furthermore, it is clear from the table that humans are more robust to the distortions, with their average degradation being at the same level or better than the minimum degradation for the computer. As was expected, the data show that the experts perform better than novices. Version 2020-09-25 submitted to J. Mar. Sci. Eng.     When looking at the effect of distortions, Figure 7 shows that all distortions negatively affect the 259 model, with blur and noise impacting the models the most. On average, blur and noise degrade the  their average degradation being at the same level or better than the minimum degradation for the 273 computer. As is expected, the data shows that the experts perform better than novices. Figure 7. Effects of distortions on the performances of the models. Each data series is the average of two models trained with the given true negative weight λ tn . Table 4. Minimum, maximum, and average degradation of the fraction of images that were successfully classified, from distortion level 0 to 3 for humans and computers.   The F 1 score, as well as the fraction of correctly classified images, for novices, experts, and computers. Note that while the datasets used for this testing contains the same images as in the rest of the results, they are split into one test set for each type of distortion. Furthermore, no oversampling of the test set is performed, to keep the results from the computer comparable to the ones from the humans. Figure 8. The F 1 score and the fraction of correctly classified images, for novices, experts, and computers. Note that while the datasets used for this testing contained the same images as in the rest of the results, they were split into one test set for each type of distortion. Furthermore, no oversampling of the test set was performed, to keep the results from the computer comparable to the ones from the humans.

Discussion
Based on the results of the last section, a few relevant questions come to mind: What is the effect of applying the true negative weighted loss to the model during training? What do the models see in the images that creates such a gap between their performances on some classes, and how do the distortions affect this? Finally, can we glean any insights into the differences between how humans and computers classify ice images? The rest of this section will address those questions.

Effect of the True Negative Weighted Loss
Varying λ tn , the true negative weight, in the training loss, has little effect on the test-metrics of the networks. This can be seen in Figure 5. The figure shows a slight increase for λ tn = 0.4, but it is uncertain if this is an indication of that value being superior or if it is an artifact of the specific training run or random initialization of network weights. This hypothesis is further reinforced from the fact that Figure 9 shows a very large variance for the models with λ tn = 0.4 compared to the rest, indicating that these two models, trained with the exact same hyperparameters, behave very different to each other.
What varying λ tn does change, however, is the distribution of false predictions. Figure 9 shows the portion of false predictions being false positives for varying values of λ tn , and it is clear from the plot that lower values of λ tn tend to lead to a higher fraction of false positives.
Therefore, based on the previous observations, it is possible to change the behavior of the network when it fails, largely without affecting its ability to correctly recognize other elements of an image. This gives some general insights into the flexibility of neural networks: Although one is unable to increase the number of correct classifications (using this specific method), one can still make the network's behavior fit better to a use case. For ice object recognition, it is reasonable to assume one would prefer a system warning about some dangerous ice object a bit too often over a system that allows you to miss one. Of course, this is not necessarily true for all ice classes; e.g., it is likely unimportant if the model misses some brash ice from time to time. However, it could be extremely important not to miss icebergs, as colliding with them could be catastrophic.

275
Based on the results of the last section, a few relevant questions come to mind: What is the effect 276 of applying the true negative weighted loss to the model during training? What do the models see 277 in the images that creates such a gap between their performance on some classes, and how do the 278 distortions affect this? And finally, can we glean any insight into the difference between how humans 279 and computers classify ice images? The rest of this section will address these questions. Varying λ tn , the true negative weight, in the training loss has little effect on the test-metrics of 282 the networks. This can be seen in Figure 5. The figure shows a slight increase for λ tn = 0.4, but it is 283 uncertain if this is an indication of that value being superior or if it is an artifact of the specific training 284 run or random initialization of network weights. This hypothesis is further reinforced from the fact 285 that Figure 9 shows a very large variance for the models with λ tn = 0.4 compared to the rest, indicating 286 that these two models, trained with the exact same hyperparameters, behave very different to each 287 other.

288
What varying λ tn does change, however, is the distribution of false predictions. Figure 9 shows

301
From Figure 6, we see that the network performs very well for some classes, such as icebergs, 302 floebergs, and level ice, while failing spectacularly on others, e.g., ice floes, floebits, and to a certain 303 degree pancake ice. To understand this difference, it is useful to investigate which areas of the image 304 are of importance for the network to classify it. A method for this is the Grad-CAM [36], which uses 305 the gradient of classification scores wrt. the activations of a layer to find areas of interest to the network.

306
In all Grad-CAM images in this work, the activations of the last residual block are used.
307 Figure 9. The portion of false predictions that are false positives, plotted against the true negative weight λ tn . The score for each value of λ tn is the average of two models trained with that value. The error bars show the minimum and maximum values of the two models.

What the Network Sees
From Figure 6, we see that the network performs very well for some classes, such as icebergs, floebergs, and level ice, while failing spectacularly on others, e.g., ice floes, floebits, and to a certain degree pancake ice. To understand this difference, it is useful to investigate which areas of the image are of importance for the network to classify. A method for this is the Grad-CAM [36], which uses the gradient of classification scores with respect to the activations of a layer to find areas of interest to the network. In all Grad-CAM images in this work, the activations of the last residual block are used. Figure 10 illustrates the difference between a class the network successfully manages to recognize and one that it does not. Figure 10a-d shows the Grad-CAM images for iceberg activations, while Figure 10e-h shows those for ice floe activations. It is clear that the network indeed looks at the icebergs for classifying them, even though the mountains in Figure 10b fool the network into believing they're icebergs as well. It should be noted that since most of our dataset is from offshore areas in the Arctic, mountains and other shore features are not common in the dataset. Therefore, it is not surprising that the model has some problems with this image from Antarctica. Version 2020-09-25 submitted to J. Mar. Sci. Eng.
14 of 19 Regarding ice floe images, it is immediately apparent from the images that instead of looking at ice floes as a whole, the network only focuses on some few parts, typically around the edges of the floes, or, in cases where there are few or no edges, seemingly random locations. Now, an ice floe is not defined by its boundary, so there is a clear discrepancy between what the network has learned and the real world.
Based on our study, it is currently challenging to understand what makes ice floes more difficult to recognize than other ice objects. One guess is that the network has a problem recognizing ice features that are not located around a small area of an image, as an ice floe has few defining characteristics on a local scale. If this is true, it would be reasonable to assume the class level ice will exhibit similar behavior. From Figure 6, it seems that the network is actually very successful at classifying level ice. Even so, the Grad-CAM images for level ice activations in Figure 11 show a more nuanced view. Indeed the images show the same trend towards looking at specific parts of the ice, instead of the ice as a whole. The difference between the two classes seems to be that the network finds a useful descriptor for level ice in local areas, which makes sense because the idea of "levelness" can be applied to patches of varying sizes.
There are a few other plausible reasons for the difficulty with ice floes. One is that nothing in the training set explicitly teaches the network about open water, which often surrounds ice floes in the images. This could lead to models not understanding that the floe is not surrounded by ice, and making the distinction between ice floes, and e.g., level ice less apparent. Furthermore, the leading descriptor for an ice floe is its size, and since the dataset contains no scale information, this is likely hard to determine for the models. Finally, ImageNet, which the models were pre-trained on, contains some ice classes, with many misclassifications, especially for a few classes, including ice floes [12]. This can lead to the model being better suited to classify the classes with many correct samples in that dataset (such as icebergs), compared to those not present or with many incorrect samples.  Figure 7 shows the performances of networks with varying values of the true negative weight at different levels of distortions. It is clear from the plots that a higher level of distortion leads to lower performance, although the severity of the effect varies according to the distortion type. Brightness and fog impact the F 1 scores of the models by averages of 0.04 and 0.10, respectively, while blur and noise degrade it by 0.33 and 0.29, respectively. The results do not indicate that a certain value of the true negative weight consistently handles distortions better than the others. Figure 12 shows Grad-CAM images of the floeberg activations of an image containing a floeberg along with broken ice, with varying levels of distortions. The image was chosen for two reasons: First, because it contains a floeberg, a class the network is largely successful in classifying. Second, because the model failed to correctly classify some distorted versions of the image, so it yields a more interesting analysis. Looking at the images, it is noteworthy that the network notices the floeberg as the important part of the image in all instances, although it also starts looking at the sky in the case of the blurred images. This means that if asked the question, "Where in the image is the floeberg?" as opposed to, "Is there a floeberg in this image?" the network would be largely successful. However, since the Grad-CAM images are normalized, they do not show how strong the signal is, which is the problem here. Indeed, looking at the activations of the network, quite a few of the images would be classified as a floeberg (along with broken ice) by lowering the point at which the model marks a prediction as true. This indicates that neural networks have a theoretical ability to see even in distorted conditions, although work is needed to exploit this ability.

The Effects of Distortions
One problem with the given image was that the model had a tendency to label some of the distorted versions (distorted by fog or noise) as brash ice in addition to the other classes. We hypothesize that these modifications, which add elements to the image instead of making what is there already less clear, can introduce new textures and gradients in the image. This affects the network negatively, as ImageNet-trained neural networks are biased towards texture [37]. For brash ice, such misclassifications would typically not be of much importance; however, the problem points to a serious flaw in this model, one that needs to be addressed in the future. conditions, although work is needed to exploit this ability.

358
One problem with the given image was that the model had a tendency to label some of the As can be seen from the plots in Figure 8, and degradation values in Grad-CAM images in Figure 12 indicates that such robustness might be possible.

373
It appears from the plots that computers outperform humans for clean images, especially when 374 looking at the fraction of images that were classified exactly correct, while the F 1 score is slightly closer.

375
However, it must be mentioned that experts can be at a disadvantage here, as they might have some

Difference Between Novices, Experts, and Computers
As can be seen from the plots in Figure 8 and degradation values in Table 4, both humans and computers are negatively affected by distortions in the images. Although humans are affected by distortions in the images, they are much more robust to them than computers. Indeed, the distortion causing the least difficulty for the computers still led to a degradation at about the same level as the average distortion for the humans. This agrees with previous studies comparing humans and computer vision [7,8], and shows the need for more robust computer vision models. However, the distorted Grad-CAM images in Figure 12 indicate that such robustness might be possible.
It appears from the plots that computers outperform humans for clean images, especially when looking at the fraction of images that were classified exactly correct, while the F 1 score is slightly closer. However, it must be mentioned that experts can be at a disadvantage here, as they might have some pre-existing notion of what each ice class is that does not match the definitions used in this work perfectly. Such a pre-learned bias, along with a possibility of them being used to being either more or less detailed than the labeling used here, can lead to lower scores on paper, though they might perform better in a real situation. Since there is a certain amount of subjectivity in ice object classification (e.g., should one label the small pieces of brash ice typically present in between broken ice?), these scores can not be seen as an objective measure of how well the experts are recognizing the ice, but rather as a measure of how much they agree with the labeling scheme used here. Such considerations do not apply to the novices, as they have no pre-learned definitions of ice classes, so it is fair to say that the network at least outperforms the novices.
A reason for the difference between the correctly classified fraction and F 1 score can be seen from the definition of the F 1 score (see Table 3), namely, that humans classify more optimistically than the computer. In other words, humans have a larger tendency to select more classes, leading to a higher score for both true and false positives, and a symmetric decrease in true and false negatives. This hypothesis is confirmed when we see that novices on average classify each non-distorted image with 2.12 labels, and experts with 2.15, while computers use on average 1.86 labels per image. Although we can increase this number by lowering the true negative weight, this did not result in a higher F 1 score in our experiments.
Finally, it is worth noting that there is a large difference between the performance on undistorted images in the different partial test sets (i.e., the images with distortion level 0). This shows how some images likely are inherently more difficult to classify for the different groups. This was expected, especially because each subset was relatively small (to limit the time needed to perform the test for humans). What is interesting is that the relative difficulty of each different set varied between the groups. For humans, the fog-dataset was the most difficult by some margin, while the computer seemed to struggle less with it.

Summary and Conclusions
In this work, we have presented an in-depth analysis of the use of CNNs for the classification of ice objects in icy areas. The main contributions of the work can be summarized as: • A loss-weighting scheme for making the trained model more likely to predict that classes are present in an image was introduced. Results show that the scheme works as intended, by avoiding an excess of false negative classifications and the possibility of missing important ice objects in images.
• A demonstration of how CNNs can successfully recognize some ice objects in images using meaningful filters was provided, along with a discussion of why they struggle with some classes.
• A thorough analysis of the effect of semi-realistic image distortions on the classification task was provided. It was shown that even though the network fails to classify an image, it still recognizes the area of importance in the image for the given class.
• Finally, a comparison of the performances of human novices, experts, and computers on the classification task was given. The results indicate that for clean images, the model outperforms human novices, although it is less clear how it compares to experts. Both human participant groups handled distortions better than the network.
These results form a basis for continued work in the area of automatic ice object recognition from near field imagery, and provide some insights into the workings of CNNs in general. They provide a point to continue working from, towards automated navigational aids and data collection on Arctic vessels. For the future, relevant areas of research to improve the results include making CNNs more robust to visual distortions, improve the accuracy of the models for specific classes (e.g., ice floes), and finding more efficient methods for large-scale data collection. Furthermore, it would be interesting to compare the models in this work with ones trained on datasets that either include scale information directly (e.g., through depth images) or have an identical camera setup for all images.  Acknowledgments: The neural network training and testing was performed on resources provided by UNINETT Sigma2-The National Infrastructure for High Performance Computing and Data Storage in Norway. We thank all human participants in our experiments. Finally, our appreciation goes to the owners of the images used in this paper: Alex Cowan, Knut Vilhelm Høyland, Lauren Farmer, Natalie Lucier, Roger Skjetne, Sergey Dolya, SF Brit, Sveinung Løset, and the SAMCoT Project.

Conflicts of Interest:
The authors declare no conflict of interest.