Comparison of Deep Learning Models for the Classiﬁcation of Noctilucent Cloud Images

: Optically thin layers of tiny ice particles near the summer mesopause, known as noctilucent clouds, are of signiﬁcant interest within the aeronomy and climate science communities. Ground-based optical cameras mounted at various locations in the arctic regions collect the dataset during favorable summer times. In this paper, ﬁrst, we compare the performances of various deep learning-based image classiﬁers against a baseline machine learning model trained with support vector machine (SVM) algorithm to identify an effective and lightweight model for the classiﬁcation of noctilucent clouds. The SVM classiﬁer is trained with histogram of oriented gradient (HOG) features, and deep learning models such as SqueezeNet, ShufﬂeNet, MobileNet, and Resnet are ﬁne-tuned based on the dataset. The dataset includes images observed from different locations in northern Europe with varied weather conditions. Second, we investigate the most informative pixels for the classiﬁcation decision on test images. The pixel-level attributions calculated using the guide back-propagation algorithm are visualized as saliency maps. Our results indicate that the SqueezeNet model achieves an F1 score of 0.95. In addition, SqueezeNet is the lightest model used in our experiments, and the saliency maps obtained for a set of test images correspond better with relevant regions associated with noctilucent clouds.


Introduction
Noctilucent clouds (NLC) are the highest clouds in the earth's atmosphere in the vicinity of the mesopause, with an altitude range of 80-85 km. The extremely cold temperature during the summer in this region permits the formation of tiny ice particles of sizes in the range of 20-150 nm. NLCs are observed with the naked eye from the ground surface, typically from latitudes between 50-65 degrees and facing north [1]. The mesosphere is highly dynamic because it displays various types of waves and turbulence, which are influenced by the lower atmosphere as well as variations that are influenced by solar-terrestrial physics. Observations of the mesosphere are challenging because it lies above the heights that can be reached with balloons and aircraft, and it lies below the heights of most satellites. Satellites are, however, used for remote observations on large scales of the ice particles that form the NLC [2]. Observations from the ground with optical cameras and lidars provide more localized images, including spatial patterns. Investigating these clouds helps us to better understand the upper atmosphere and its dynamics caused by several effects in this region [3].
The NLC structures reveal, for instance, the influence of planetary waves [4]. The NLC observations are also discussed as an indicator of climate change [5], and some studies show an increased frequency of the NLC occurrence and NLC brightness suggested throughout 1964-1994 that can arise from increasing water vapor concentration at these altitudes [6,7]. Although the origins of NLCs and the conditions leading to their formation are still actively being investigated, there are various studies on the understanding of NLC in terms of their size, shape, and formation [6]. The local observations, for instance, above Northern Scandinavia, allow us to compare the NLC observations with radar studies. The radar observation of these clouds is made in polar mesospheric summer echoes (PMSE), which are observed at similar altitudes as the NLC and higher. They form as a result of several processes and require the presence of ice particles that are electrically charged turbulence in the neutral components of the atmosphere and free electrons. NLC, in contrast, merely depends on the size of the ice particles. Despite these differences, it is helpful to have a combined view of PMSE and NLC to investigate the local structures of these clouds. The PMSE and NLC display similar wavy structures, as shown in Figure 1. The PMSE was captured with an EISCAT radar at Tromsø, and an optical image from Kiruna, Sweden (67.84N, 20.41E). The wavy pattern displayed in these NLCs possibly indicates the influence of wave propagation on a scale, from a few kilometers to several kilometers [8].
The optical cameras preprogrammed for taking images every few minutes during favorable summer times collect the noctilucent cloud images from various locations in the arctic north. The identification of the NLC occurrence in images demands an expert's evaluation and hence is a resource-intensive task. In the literature, there are several studies on the analysis of NLC [9][10][11][12][13][14]; however, studies on its classification using deep-learning techniques are lacking. In a recent study by [8], different feature-extraction strategies on image patches of the size 50 by 50 pixels are implemented to classify these image patches into different categories, such as NLC, tropospheric cloud, clear sky, etc. The study compares the performance of LDA with different combinations of image features (mean, standard deviation, HLAC, and HOG) with that of a convolution neural network model. Although CNN achieves good classification accuracy and outperforms the rest of the methods used in the paper, the experimental pipeline implemented using patches is not common in practical applications. throughout 1964-1994 that can arise from increasing water vapor concentration at these altitudes [6,7]. Although the origins of NLCs and the conditions leading to their formation are still actively being investigated, there are various studies on the understanding of NLC in terms of their size, shape, and formation [6]. The local observations, for instance, above Northern Scandinavia, allow us to compare the NLC observations with radar studies. The radar observation of these clouds is made in polar mesospheric summer echoes (PMSE), which are observed at similar altitudes as the NLC and higher. They form as a result of several processes and require the presence of ice particles that are electrically charged turbulence in the neutral components of the atmosphere and free electrons. NLC, in contrast, merely depends on the size of the ice particles. Despite these differences, it is helpful to have a combined view of PMSE and NLC to investigate the local structures of these clouds. The PMSE and NLC display similar wavy structures, as shown in Figure 1. The PMSE was captured with an EISCAT radar at Tromsø, and an optical image from Kiruna, Sweden (67.84N, 20.41E). The wavy pattern displayed in these NLCs possibly indicates the influence of wave propagation on a scale, from a few kilometers to several kilometers The optical cameras preprogrammed for taking images every few minutes during favorable summer times collect the noctilucent cloud images from various locations in the arctic north. The identification of the NLC occurrence in images demands an expert's evaluation and hence is a resource-intensive task. In the literature, there are several studies on the analysis of NLC [9][10][11][12][13][14]; however, studies on its classification using deep-learning techniques are lacking. In a recent study by [8], different feature-extraction strategies on image patches of the size 50 by 50 pixels are implemented to classify these image patches into different categories, such as NLC, tropospheric cloud, clear sky, etc. The study compares the performance of LDA with different combinations of image features (mean, standard deviation, HLAC, and HOG) with that of a convolution neural network model. Although CNN achieves good classification accuracy and outperforms the rest of the methods used in the paper, the experimental pipeline implemented using patches is not common in practical applications. In this paper, we investigate the possibility of using state-of-the-art deep learning models to classify NLC based on whole images rather than image patches, as performed in the study of [8]. The state-of-the-art CNN architectures trained with transfer learning In this paper, we investigate the possibility of using state-of-the-art deep learning models to classify NLC based on whole images rather than image patches, as performed in the study of [8]. The state-of-the-art CNN architectures trained with transfer learning are compared to the baseline SVM classifier trained with the histogram of oriented gradient (HOG) features. In addition to the evaluation of their performance, we also visualize the pixel-level attributes for the test image to identify the pixels that contribute more to making the classification decision. The main advantages of using whole images instead of patches are: (1) it allows the use of existing state-of-the-art deep learning architectures and their pre-trained weights with transfer learning, and (2) the selected classifier model offers a real-world application.
The rest of the article is divided as follows: First, in Section 2, we outline the dataset associated with experiments, methods, and procedures followed in this paper. In Section 3, we explain the results obtained from our proposed method. In Section 4, we discuss the results and, finally, we highlight the conclusions in Section 5.

Dataset
The dataset consists of images captured from three different locations: Kiruna, Sweden (67.84N, 20.41E), Nikkaluokta, Sweden (67.85N, 19.01E), and Moscow, Russia (56.02N, 37.48E). The available dataset consists of images of various weather conditions and contrast levels of NLC activity. A total of 1177 images constitute the original dataset, with 362 belonging to the NLC category and 815 to the other category. The other category includes images with tropospheric clouds, twilight, clear sky, buildings, rain, and various other environmental conditions, as shown in Appendix A, Figure A1. After preprocessing and cropping, a total of 1540 and 4075 images are available as noctilucent and nonnoctilucent images, respectively. The few additional images from Novosibirsk, Russia, and Scotland are used for testing but not for training.

Convolutional Neural Network
A convolutional neural network (CNN) is a class of deep neural networks widely used for grid-like data, such as images and videos. A convolutional network combines three architectural ideas to ensure some degree of shift and distortion invariance: local receptive fields, shared weights (or weight replication), and, sometimes, spatial or temporal subsampling (through pooling) [15]. A typical convolution neural network consists of repeating convolutional blocks after the input layer and acts as a features extractor. Mostly, each convolution block consists of three layers: convolution, non-linear activation, and pooling.
SqueezeNet: SqueezeNet is a family of CNN architectures that has alexnet-level accuracy with 50 times fewer parameters and a significantly smaller model size. The new building block of SqueezeNet, called the fire module, replaces 3 × 3 filters with 1 × 1, decreases the number of input channels to 3 × 3 filters, and downsamples late in the network so that the convolution layers have large activation maps [23]. With model compression applied, the SqueezeNet model can be as small as 0.5 MB [23].
ShuffleNet: This is an extremely computation-efficient CNN architecture specially designed for very mobile devices with limited computing power. The new architecture maintains the accuracy with reduced computation costs by employing pointwise group convolution and channel shuffling [24].
MobileNet: A class of CNN architectures for mobile and embedded vision applications that use depthwise separable convolutions. Two simple global hyperparameters introduced in the architecture efficiently trade-off between latency and accuracy [25].
Resnet: Resnet is an effective CNN architecture to train substantially deeper neural networks. The architecture reformulates the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions [22].

Support Vector Machine
Support vector machine is a supervised machine learning algorithm. In classification, the algorithm puts all the input feature vectors on an imaginary plot and draws the imaginary high-dimensional line (hyperplane) that separates the examples with different categorical labels [28,29]. The equation of the hyperplane is given by two parameters: a real-valued vector w of the same dimensionality as our input feature vector, and a real number b as: , and D is the number of dimensions of the feature vector x.
The goal of the SVM learning algorithm during the training phase is to find the optimal values for the weight and the bias terms of the separating hyperplane: w * and b * , respectively. After solving the optimization functions, the predicted label for any input feature vector x is given by [29]: where the sign is a mathematical operator that takes any value as a real number and returns +1 if the input is a positive number and −1 if the input is a negative number.

Metrics Used for the Evaluation
In classification, the F1 score is the accuracy of the test samples calculated with the precision and recall. The precision of the model is the number of true positive test samples divided by the number of all positive results. The recall value of the model, which is also known as the sensitivity, is the number of true positive results divided by the number of all the samples that should have been identified as positive (true positive + false negative). The numerical calculation of precision, recall, and F1 score can be obtained using the Equations (3)-(5), respectively [30].

Procedure
First, a high-resolution image (typically of size 2303 × 1690 and 3088 × 2056 pixels) is converted to a lower resolution of 265 × 240 pixels. Next, a total of five cropped images-four corner-crops, and a single center-crop-of the size 224 × 224 pixels are obtained from each of these lower-resolution images. A non-noctilucent category, namely the other category, is created by randomly selecting images with no NLC activity from the dataset. The final dataset comprises 1540 NLC and 4075 other category images, respectively. Approximately 23 percent of the images from each category are used for testing. Out of the remaining 77 percent, 80 percent are used for training and 20 percent are used for the validation of the classifier models.
To train an SVM classifier, HOG features are extracted from the resized training sample of the size 224 by 224 pixels. For a given image, a 6272-dimensional HOG features vector is obtained by selecting 8 by 8 pixeled cells with eight orientations per cell. The SVM algorithm is trained on a batch of 100 samples (see Figure 2 for flow diagram). To train an SVM classifier, HOG features are extracted from the resized training sample of the size 224 by 224 pixels. For a given image, a 6272-dimensional HOG features vector is obtained by selecting 8 by 8 pixeled cells with eight orientations per cell. The SVM algorithm is trained on a batch of 100 samples (see Figure 2 for flow diagram). Finally, we obtain pixel attribution maps (saliency maps) associated with the different deep learning models used in the paper for a few selected test images containing noctilucent clouds. The attribution map signifies the contribution of image pixels in classification decisions and is computed with a guided backpropagation algorithm [31].

Results
The comparison of the SVM classifier with various deep learning architectures is shown in Table 1. The F1 score is used as the main metric to compare the performances of various image classifier models. The SVM algorithm trained with the histogram of oriented gradient (HOG) features achieved an f1 score of 0.55 with a precision of 0.25 and a recall of 0.38. The lightweight deep-learning models used for the experiment, SqueezeNet, ShuffleNet, and MobileNet, all achieved the same F1 score of 0.95. Among the deep learning models used in our experiments, SqueezeNet has the smallest size of 21.81 MB. The widely used and state-of-art image classification architecture ResNet has a larger model size of 81.11 MB. The comparison of various models according to their estimated model size, precision, recall, and F1 score (for NLC category) is shown in Table 1.  Finally, we obtain pixel attribution maps (saliency maps) associated with the different deep learning models used in the paper for a few selected test images containing noctilucent clouds. The attribution map signifies the contribution of image pixels in classification decisions and is computed with a guided backpropagation algorithm [31].

Results
The comparison of the SVM classifier with various deep learning architectures is shown in Table 1. The F1 score is used as the main metric to compare the performances of various image classifier models. The SVM algorithm trained with the histogram of oriented gradient (HOG) features achieved an f1 score of 0.55 with a precision of 0.25 and a recall of 0.38. The lightweight deep-learning models used for the experiment, SqueezeNet, ShuffleNet, and MobileNet, all achieved the same F1 score of 0.95. Among the deep learning models used in our experiments, SqueezeNet has the smallest size of 21.81 MB. The widely used and state-of-art image classification architecture ResNet has a larger model size of 81.11 MB. The comparison of various models according to their estimated model size, precision, recall, and F1 score (for NLC category) is shown in Table 1.  Table 2 shows the class predicted by different models on a few of the selected test images. Our results show that SVM misclassifies the NLC images in rows 3-5 among the selected test images. On the other hand, deep learning models show significantly better class predictions. The predicted labels for the test images in rows 3-5 indicate that NLC activity is not detected by all the deep learning models equally well. We also note that for the test images in row 8, the MobileNet and Resnet models misclassify a non-NLC image as NLC. class predictions. The predicted labels for the test images in rows 3-5 indicate that NLC activity is not detected by all the deep learning models equally well. We also note that for the test images in row 8, the MobileNet and Resnet models misclassify a non-NLC image as NLC.

Predicted Label Test Image
True            The trained SqueezeNet model is also tested with a few images from two different locations: Novosibirsk, Russia, and Scotland. These test images constitute different backgrounds, orientations, and camera settings that are not considered in the training phase. For the SqueezeNet model, when tested with images from a known location (same as the training dataset), nearly 10 percent (35 out of 343) of the noctilucent cloud images are misclassified; for details please see the confusion matrix in Figure 6a. The same model, when tested with images from two new locations (not included in the training dataset) missed nearly 38 percent (18 out of 48) of noctilucent cloud images; for details please see the confusion matrix in Figure 6b. The sample images from the new locations can be seen in Appendix A, Figure A2.   The trained SqueezeNet model is also tested with a few images from two different locations: Novosibirsk, Russia, and Scotland. These test images constitute different backgrounds, orientations, and camera settings that are not considered in the training phase.
For the SqueezeNet model, when tested with images from a known location (same as the training dataset), nearly 10 percent (35 out of 343) of the noctilucent cloud images are misclassified; for details please see the confusion matrix in Figure 6a. The same model, when tested with images from two new locations (not included in the training dataset) missed nearly 38 percent (18 out of 48) of noctilucent cloud images; for details please see the confusion matrix in Figure 6b. The sample images from the new locations can be seen in Appendix A, Figure A2.

Discussion
We employ different state-of-art deep learning architectures to detect noctilucent clouds and compare the performances of these models with the baseline machine learning model (SVM classifier). We find that the baseline machine learning model trained with a histogram of oriented gradient (HOG) features obtained the lowest F1 score of 0.55 for the NLC class. We infer that, although HOG features can be effective for objects with rigid boundaries and sharp contrast, they seem to be less effective in the case of fuzzy images, such as noctilucent clouds. All convolutional neural network models that are considered in the experiment have a significantly higher F1 score of 0.95. The sensitivity (recall value) of the deep learning models is also significantly higher (0.90-0.92). Furthermore, the saliency maps obtained with the guided-backpropagation algorithm for the test images (Figures 3-5) show the robust features selection capability of the deep learning models. Although all the deep learning models achieved a significantly high F1 score of 0.95, the sensitivity maps produced by SqueezeNet, ShuffleNet, and Resnet show enough relevance with the visual features of noctilucent cloud. The saliency maps in Figures 3-5 are plotted for the top 15% of the contributing pixels for the classification decision (NLC class).
MobileNet obtained the highest recall value of 0.92, but provided sensitivity maps that differ from the visual understanding of NLC features (please refer to the column for

Discussion
We employ different state-of-art deep learning architectures to detect noctilucent clouds and compare the performances of these models with the baseline machine learning model (SVM classifier). We find that the baseline machine learning model trained with a histogram of oriented gradient (HOG) features obtained the lowest F1 score of 0.55 for the NLC class. We infer that, although HOG features can be effective for objects with rigid boundaries and sharp contrast, they seem to be less effective in the case of fuzzy images, such as noctilucent clouds. All convolutional neural network models that are considered in the experiment have a significantly higher F1 score of 0.95. The sensitivity (recall value) of the deep learning models is also significantly higher (0.90-0.92). Furthermore, the saliency maps obtained with the guided-backpropagation algorithm for the test images (Figures 3-5) show the robust features selection capability of the deep learning models. Although all the deep learning models achieved a significantly high F1 score of 0.95, the sensitivity maps produced by SqueezeNet, ShuffleNet, and Resnet show enough relevance with the visual features of noctilucent cloud. The saliency maps in Figures 3-5 are plotted for the top 15% of the contributing pixels for the classification decision (NLC class).
MobileNet obtained the highest recall value of 0.92, but provided sensitivity maps that differ from the visual understanding of NLC features (please refer to the column for MobileNet in Figures 3-5). The test results in Figure 6 show that the SqueezeNet model performs well for the seen data and performs relatively poorly in the case of unseen data. The model missed a good number of NLC-containing images from a new geographical location (see Figure 6b for more details). To improve the classification decision associated with NLC on unseen images from new locations, we should try obtaining datasets from as diverse locations as possible for training. Additionally, domain adaptation techniques as mentioned in [32] can also be explored to develop a model that can generalize well for unseen data.

Conclusions
In this paper, we employ different deep learning-based image classifiers to identify images containing noctilucent clouds. The deep learning models are compared against a machine learning model trained with support vector machine (SVM) algorithm. The dataset includes optical images captured from different locations in northern Europe with varied weather conditions. The SVM classifier is trained with a histogram of oriented gradient (HOG) features, and deep learning models such as SqueezeNet, ShuffleNet, MobileNet, and Resnet are fine-tuned based on the dataset. In addition, for a few test images, we investigate the most informative pixels for the classification decision and visualize them as saliency maps. These so-called attribution maps are obtained by employing the guidedbackpropagation method. Our results show that the SqueezeNet model achieves an F1 score of 0.95 and is the lightest among the various deep learning models used in this paper. Additionally, the saliency maps obtained with SqueezeNet model are better-associated with noctilucent cloud features. With our experiment and results, we identify SqueezeNet model as a powerful and light model that can be implemented to identify noctilucent clouds.   Appendix A Figure A1. Randomly selected training dataset: images within yellow square contains small to large concentration of noctilucent cloud and are considered as NLC images and the rest with other category. Figure A2. Randomly selected images of NLC category from new locations that are used only for testing, but not used for training/fine-tuning the classifier models.