Deep Convolutional Neural Networks Capabilities for Binary Classification of Polar Mesocyclones in Satellite Mosaics

Polar mesocyclones (MCs) are small marine atmospheric vortices. The class of intense MCs, called polar lows, are accompanied by extremely strong surface winds and heat fluxes and thus largely influencing deep ocean water formation in the polar regions. Accurate detection of polar mesocyclones in high-resolution satellite data, while challenging, is a time-consuming task, when performed manually. Existing algorithms for the automatic detection of polar mesocyclones are based on the conventional analysis of patterns of cloudiness and they involve different empirically defined thresholds of geophysical variables. As a result, various detection methods typically reveal very different results when applied to a single dataset. We develop a conceptually novel approach for the detection of MCs based on the use of deep convolutional neural networks (DCNNs). As a first step, we demonstrate that DCNN model is capable of performing binary classification of 500 × 500 km patches of satellite images regarding MC patterns presence in it. The training dataset is based on the reference database of MCs manually tracked in the Southern Hemisphere from satellite mosaics. We use a subset of this database with MC diameters falling in the range of 200–400 km. This dataset is further used for testing several different DCNN setups, specifically, DCNN built “from scratch”, DCNN based on VGG16 pre-trained weights also engaging the Transfer Learning technique, and DCNN based on VGG16 with Fine Tuning technique. Each of these networks is further applied to both infrared (IR) and a combination of infrared and water vapor (IR + WV) satellite imagery. The best skills (97% in terms of the binary classification accuracy score) is achieved with the model that averages the estimates of the ensemble of different DCNNs. The algorithm can be further extended to the automatic identification and tracking numerical scheme and applied to other atmospheric phenomena that are characterized by a distinct signature in satellite imagery.


Introduction
Polar mesoscale cyclones (MCs) are high-latitude marine atmospheric vortices.Their sizes range from 200 to 1000 km with lifetimes typically spanning from 6 to 36 h [1].A specific intense type of mesocyclones, the so-called polar lows (PLs), is characterized by surface winds of more than 15 m/s and strong surface fluxes.These PLs have a significant impact on local weather conditions causing rough seas.Being relatively small in size (compared to the extratropical cyclones), PLs contribute significantly to the generation of extreme air-sea fluxes and initialize intense surface transformation of water masses resulting in the formation of ocean deep water [2][3][4].These processes are most intense in the Weddel and Bellingshausen Seas in the Southern Hemisphere (SH) and in the Labrador, Greenland, and Irminger Seas in the Northern Hemisphere (NH).
One potential source of data is reanalyses.However, MCs, being critically important for many oceanographic and meteorological applications, are only partially detectable in different reanalysis datasets, primarily due to the inadequate resolution.Studies [3,[5][6][7][8] have demonstrated the significant underestimation of both number of mesocyclones and wind speeds by modern reanalyses in contrast with satellite observations of MCs cloud signatures and wind speeds.This hints that the spatial resolution of modern reanalyses is still not good enough for reliable and accurate detection of MCs.Press et al. argued for at least 10 × 10 grid points is necessary for effective capturing the MC [9].This implies a 30 km spatial resolution in the model or reanalysis is needed for detecting MC with the diameter of 300 km.Some studies [5,10] have demonstrated that 80% (64%) of MCs (PLs) in the SH (NH) are characterized by the diameters ranging from 200 to 500 km (250 to 450 km for NH in [10]).The most recent study of Smirnova and Golubkin [11] revealed that only 70% of those could be sustainably represented, even in the very high-resolution Arctic System Reanalysis (ASR) [12].At the same time only 53% of the observed MCs characterized by diameters less than 200 km [5] are sustainably represented in ASR [11].It was also shown [3,5,6] that both number of MCs and associated winds in modern reanalyses are significantly underestimated when compared to satellite observations of cloud signatures of MCs and satellite scatterometer observations of MC winds.
One might argue for the use of operational analyses for detecting MCs.However, these products are influenced by the changes in the numerics of a model and physics parameterization schemes with newly developed ones, and by the changes of the performance of data assimilation system and the amount of assimilated data.This leads to artificial trends at climatological timescales.In several studies, automated cyclone tracking algorithms that were originally developed for mid-latitude cyclones were adapted for MCs identification and tracking [13][14][15].These algorithms were applied to the preprocessed (spatially filtered) reanalysis data and delivered climatological assessments of MCs activity in reanalyses or revealed the direction for their improvement.However, reported estimates of MCs numbers, sizes, and lifecycle characteristics vary significantly in these studies.
Zappa et al. [13] shows that ECMWF operational analysis makes it possible to detect up to 70% of the observed PLs, which is higher than ERA40 and ERA-Interim reanalyses (24%, 45%, or 55% depending on the procedure of tracking and the choice of reanalysis [6,13]).One bandpass filter in conjunction with different combinations of criteria used for the post-processing of the MC tracking results may result in a 30% spread in the number of PLs [13].Observational satellite-based climatologies of MCs and PLs [5,10,[16][17][18][19]] consistently reveal a mean vortex diameter of 300-350 km.In a number of reanalysis-based automated studies [14,20], the upper limit of MC and PL diameters was set to 1000 km, resulting in the mean values between 500 and 800 km.Thus, the estimates of MC sizes are still inconsistently derived with automated tracking algorithms.This inconsistency contrasts with the estimates for midlatitude cyclones' characteristics derived with the ensemble of tracking schemes [21] applied to a single dataset.
Satellite imagery of cloudiness is another data source for identification and tracking of MCs.These data allow for visual identification of cloud signatures that are associated with MCs.However, the manual procedure re Carleton quires enormous effort to build a long enough dataset.Pioneering work of Wilhelmsen [22] used ten years of consecutive synoptic weather maps, coastal observational stations, and several satellite images over the Norwegian and Barents Seas to describe local PLs activity.Later, in the 1990s, the number of instruments and satellite crossovers increased.It provoked many studies [16,[23][24][25][26][27][28] evaluating the characteristics of MCs occurrence and lifecycle in different regions of both NH and SH.These studies identified major MCs generation regions, their dominant migration directions, and cloudiness signature types that are associated with MCs.Increases in the amount of satellite observations allowed for the development of robust regional climatologies of MCs occurrence and characteristics.For the SH, Carleton [27] used twice daily cloudiness imagery of West Antarctica and classified for the first time four types of cloud signatures associated with PLs (comma, spiral, transitional type, and merry-go-round).This classification has been confirmed later in many works and it is widely used now.Harold et al. [16,26] used daily satellite imagery for building one of the most detailed datasets of MC characteristics for the Nordic Seas (Greenland, Norwegian, Iceland, and Northern Seas).Also, Harold et al. [16,26] developed a detailed description of the conventional methodology for the identification and tracking of MCs using satellite IR imageries.
There are also several studies regarding polar MCs and PLs activity in the Sea of Japan.Gang et al. [29] conducted the first long-term (three winter months) research of PLs in the Sea of Japan based on visible and IR imagery from the geostationary satellite with hourly resolution.In the era of multi-sensor satellite observations, Gurvich and Pichugin [30] developed the nine-year climatology of polar MCs based on water vapor, cloud water content and surface wind satellite data over the Western Pacific.This study reveals a mean MCs diameter of 200-400 km as well.
As these examples illustrate, most studies of MCs activity are regional [10,17,18,31,32] and they cover relatively short time periods [5] due to the very costly and time-consuming procedure of visual identification and tracking of MCs.Thus, development of the reliable long-term (multiyear) dataset covering the whole circumpolar Arctic or Antarctic remains a challenge.
Recently, machine learning methods have been found to be quite effective for the classification of different cloud characteristics such as solar disk state and cloud types.There are studies in which different machine learning techniques are used for recognizing cloud types [33][34][35].Methodologies employed include deep convolutional neural networks (DCNNs [36,37]), k-nearest-neighbor classifier (KNN), and Support Vector Machine (SVM) and fully-connected neural networks (FCNNs).Krinitskiy [38] used FCNNs for the detection of solar disk state and reported very high accuracy (96.4%) of the proposed method.Liu et al. [39] applied DCNNs to the fixed-size multichannel images to detect extreme weather events and reported the success score of the detection of 89 to 99%.Huang et al. [40] applied the neural network "DeepEddy" to the synthetic aperture radar images for detection of ocean meso-and submesoscale eddies.Their results are also characterized by high accuracy exceeding 96% success rate.However, Deep Learning (DL) methods have never been applied for detecting MCs.
DCNNs are known to demonstrate high skills in classification, pattern recognition, and semantic segmentation, when applied to two-dimensional (2D) fields, such as images.The major advantage of DCNNs is the depth of processing of the input 2D field.Similarly to the processing levels of satellite data (L0, L1, L2, L3, etc.), which allow retrieving, e.g., wind speed (L2 processing) from the raw remote measurements (L0), DCNNs are dealing with multiple levels of subsequent non-linear processing of an input image.In contrast to the expert-designed algorithms, the neural network levels of processing (so-called layers) are built in a manner that is common within each specific layer type (convolutional, fully-connected, subsampling, etc.).During the network training process, these layers of a DCNN acquire the ability to extract a broad set of patterns of different scales from the initial data [41][42][43][44].In this sense, a trained DCNN closely simulates the visual pattern recognition process naturally used by a human operator.There exist several state-of-the-art network architectures, such as "AlexNet" [35], "VGG16" and "VGG19" [45], "Inception" of several subversions [46], "Xception" [47], and residual networks [48].Each of these networks has been trained and tested using a range of datasets, including the one that is considered as a "reference" for the further image processing, the so-called ImageNet [49].Continuous development of all DCNNs aims to improve the accuracy of the ImageNet classification.Today, the existing architectures demonstrate high accuracy with the error rate from 2% to 16% [50].
A DCNN by design closely simulates the visual recognition process.IR and WV satellite mosaics can be interpreted as images.Thus, assuming that a human expert detects MCs on these mosaics on the basis of his visual perception, application of DCNN appears to be a promising approach to this problem.Liu et al. [39] described a DCNN applied to the detection of tropical cyclones and atmospheric rivers in the 2D fields of surface pressure, temperature and precipitation stacked together into "image patches".However, the proposed approach cannot be directly applied to the MC detection.This method is skillful for the detection of large-scale weather extremes that are discernible in reanalysis products.However, as noted above, MCs have a poorly observable footprint in geophysical variables of reanalyses.
In this study, we apply the DL technique [51][52][53] to the satellite IR and WV mosaics distributed by Antarctic Meteorological Research Center [54,55].This allows for the automated recognition of MCs cloud signatures.Our focus here is exclusively on the capability of DCNNs to perform a binary classification task regarding MCs patterns presence in patches of satellite imagery of cloudiness and/or water vapor, rather than on the DCNN-based MC tracking.This will indicate that a DCNN is capable of learning the hidden representation that is in accordance with the data and the MCs detection problem.
The paper is organized as follows.Section 2 describes the source data based on MC trajectories database [5].Section 3 describes the development of the MC detection method that is based on deep convolutional neural networks and necessary data preprocessing.In Section 4, we present the results of the application of the developed methodology.Section 5 summarizes the paper with the conclusions and provides an outlook.

Data
For the training of DCNNs, we use the MCs dataset for the Southern Ocean (SOMC, http://sail.ocean.ru/antarctica/),consisting of 1735 MC trajectories, resulting in 9252 MC locations and associated estimates of MC sizes [5] for the four-months period (June, July, August, September) of 2004 (Figure 1a).The dataset was developed by visual identification and tracking of MCs using 976 consecutive three-hourly satellite IR (10.3-11.3micron) and WV (~6.7 microns) mosaics provided by the Antarctic Meteorological Research Center (AMRC) Antarctic Satellite Composite Imagery (AMRC ASCI) [54,55].These mosaics are available online (https://amrc.ssec.wisc.edu/data/)and are composites of geostationary and polar-orbiting satellite observations (GOES East and West, Meteosat, MTSAT, NOAA satellites, METOP, FY-2, Aqua, Terra etc.).This mosaics dataset is maintained by the AMRC [55].The SOMC dataset contains longitudes and latitudes of MC centers at each three-hourly time step of the MC track, as well as MC diameter and the cloudiness signature type through the MC life cycle [5].These characteristics were used along with the associated cloudiness patterns of MCs from the initial IR and WV mosaics for training DCNNs.
cyclones and atmospheric rivers in the 2D fields of surface pressure, temperature and precipitation stacked together into "image patches".However, the proposed approach cannot be directly applied to the MC detection.This method is skillful for the detection of large-scale weather extremes that are discernible in reanalysis products.However, as noted above, MCs have a poorly observable footprint in geophysical variables of reanalyses.
In this study, we apply the DL technique [51][52][53] to the satellite IR and WV mosaics distributed by Antarctic Meteorological Research Center [54,55].This allows for the automated recognition of MCs cloud signatures.Our focus here is exclusively on the capability of DCNNs to perform a binary classification task regarding MCs patterns presence in patches of satellite imagery of cloudiness and/or water vapor, rather than on the DCNN-based MC tracking.This will indicate that a DCNN is capable of learning the hidden representation that is in accordance with the data and the MCs detection problem.
The paper is organized as follows.Section 2 describes the source data based on MC trajectories database [5].Section 3 describes the development of the MC detection method that is based on deep convolutional neural networks and necessary data preprocessing.In Section 4, we present the results of the application of the developed methodology.Section 5 summarizes the paper with the conclusions and provides an outlook.

Data
For the training of DCNNs, we use the MCs dataset for the Southern Ocean (SOMC, http://sail.ocean.ru/antarctica/),consisting of 1735 MC trajectories, resulting in 9252 MC locations and associated estimates of MC sizes [5] for the four-months period (June, July, August, September) of 2004 (Figure 1a).The dataset was developed by visual identification and tracking of MCs using 976 consecutive three-hourly satellite IR (10.3-11.3micron) and WV (~6.7 microns) mosaics provided by the Antarctic Meteorological Research Center (AMRC) Antarctic Satellite Composite Imagery (AMRC ASCI) [54,55].These mosaics are available online (https://amrc.ssec.wisc.edu/data/)and are composites of geostationary and polar-orbiting satellite observations (GOES East and West, Meteosat, MTSAT, NOAA satellites, METOP, FY-2, Aqua, Terra etc.).This mosaics dataset is maintained by the AMRC [55].The SOMC dataset contains longitudes and latitudes of MC centers at each three-hourly time step of the MC track, as well as MC diameter and the cloudiness signature type through the MC life cycle [5].These characteristics were used along with the associated cloudiness patterns of MCs from the initial IR and WV mosaics for training DCNNs.
AMRC ASCI mosaics spatially combine observations from geostationary and polar-orbiting satellites and cover the area to the South of ~40° S with three-hourly temporal and 5 km spatial resolution (Figure 1b,c).While the IR channel is widely used for MCs identification [16,17,26,27,32], we also additionally employ the WV channel imagery that provides a better accuracy over the ice-covered ocean, where the IR images are potentially incorrect.AMRC ASCI mosaics spatially combine observations from geostationary and polar-orbiting satellites and cover the area to the South of ~40 • S with three-hourly temporal and 5 km spatial resolution (Figure 1b,c).While the IR channel is widely used for MCs identification [16,17,26,27,32], we also additionally employ the WV channel imagery that provides a better accuracy over the ice-covered ocean, where the IR images are potentially incorrect.

Data Preprocessing
For training models, we first co-located a square (patch) of 100 × 100 mosaic pixels (500 × 500 km) with each MC center location from SOMC dataset (9252 locations in total) (Figure 2a-d).Since the distance between MCs in the multiple systems, such as the merry-go-round pattern, may be comparable to each MC diameter, and to ensure that (i) each patch covers only one MC and (ii) covers it completely, we require that MC diameters fall into 200-400 km range.Hereafter, we call this set of samples 'the true samples'.The chosen set of true samples includes 67% of the whole population of samples in SOMC dataset.
We additionally built the set of 'false samples' for DCNNs training.False samples were generated from the patches that do not consist of MC-associated cloudiness signatures (Figure 2e-h) according to the SOMC dataset.Table 1 summarizes the numbers of true and false samples that both make up the source dataset for our further analysis of IR and WV mosaics.The total number of snapshots used (both IR and WV) is 11,189.The true samples are 6177 (55%) of them, and 5012 (45%) are the false samples (see Figure 2).In order to unify images in the dataset, we normalized them by the maximum and the minimum brightness temperature (in the case of IR) over the whole dataset: where x denotes the individual sample (represented by a matrix of 100 × 100 pixels) and X is the whole dataset of 11,189 IR snapshots.The same normalization was applied to WV snapshots.

Data Preprocessing
For training models, we first co-located a square (patch) of 100 × 100 mosaic pixels (500 × 500 km) with each MC center location from SOMC dataset (9252 locations in total) (Figure 2a-d).Since the distance between MCs in the multiple systems, such as the merry-go-round pattern, may be comparable to each MC diameter, and to ensure that (i) each patch covers only one MC and (ii) covers it completely, we require that MC diameters fall into 200-400 km range.Hereafter, we call this set of samples 'the true samples'.The chosen set of true samples includes 67% of the whole population of samples in SOMC dataset.
We additionally built the set of 'false samples' for DCNNs training.False samples were generated from the patches that do not consist of MC-associated cloudiness signatures (Figure 2e-h) according to the SOMC dataset.Table 1 summarizes the numbers of true and false samples that both make up the source dataset for our further analysis of IR and WV mosaics.The total number of snapshots used (both IR and WV) is 11,189.The true samples are 6177 (55%) of them, and 5012 (45%) are the false samples (see Figure 2).In order to unify images in the dataset, we normalized them by the maximum and the minimum brightness temperature (in the case of IR) over the whole dataset: where  denotes the individual sample (represented by a matrix of 100 × 100 pixels) and  is the whole dataset of 11,189 IR snapshots.The same normalization was applied to WV snapshots.

Formulation of the Problem
We consider MC identification as a binary classification problem.We use the set of true and false samples (Figure 2) as input ("objects" herein).We have developed two DCNN architectures following two conditional requirements: either (i) the object is described by the IR image only or (ii)

Formulation of the Problem
We consider MC identification as a binary classification problem.We use the set of true and false samples (Figure 2) as input ("objects" herein).We have developed two DCNN architectures following two conditional requirements: either (i) the object is described by the IR image only or (ii) the object is described by both IR and WV images.Since the training dataset is almost target-balanced (see Table 1), assuming ~50/50 ratio of true/false samples, we further use the accuracy score as the measure of the classification quality.The accuracy score cannot be used as a reliable quality measure of any machine learning method in the case of the unbalanced training dataset.For example, in the case of a highly unbalanced dataset with the true/false ratio being 95/5, it is easy to achieve 95% accuracy score by just forcing the model to produce only the true outcome.Thus, balancing the source dataset with false samples is critical for building the reliable classification model.

Justification of Using DCNN
There is a set of best practices commonly used to construct DCNNs for solving classification problems [56].While building and training DCNNs for MCs identifications, we applied the technique that was proposed by LeCun [41].This technique implies the usage of consecutive convolutional layers that detect spatial data patterns, alternating with subsampling layers, which reduce the sample dimensions.The set of these layers is followed by a set of so-called fully-connected (FC) layers representing a neural classifier.The whole model built in this manner represents a non-linear classifier that is capable of directly predicting a target value for the input sample.A very detailed description of this model architecture can be found in [41].We will further term the FC layers set as "FC classifier", and the preceding part containing convolutional and pooling layers as "convolutional core" (see Figures 3 and 4).The outcome of the whole model is the probability of MC presence in the input sample.
While handling multiple concurrent and spatially aligned geophysical fields, it is important to choose a suitable approach.LeCun [41] proposed the DCNN focused on the processing of only grayscale images-meaning just one 2D field.In order to handle multiple 2D fields, they may be stacked together to form a three-dimensional (3D) matrix by analogy with colorful images which have three color channels: red, green and blue.This approach can be applied when one uses pre-trained networks like AlexNet [36], VGG16 [45], ResNet [48] or similar architectures because of the original purpose of these networks to classify colorful images.However, this approach should be exploited carefully when applied to geophysical fields, because the mentioned networks were trained using massive datasets (e.g., ImageNet) of real photographed scenes, which means specific dependencies laying between channels (red, green and blue) within each image.In contrast to the stacking approach applied by Liu et al. [39], we use separate CNN branch for each channel (IR and WV) to ensure that we are not limiting the overall quality of the whole network (see Figure 4).In the following, we describe in detail each DCNN architecture for both cases: IR + WV (Figure 4) and IR alone (Figure 3).
Since we consider the binary classification, and the source dataset is almost target-balanced (see Table 1), we use as a quality measure the accuracy score or Acc, which is a rate of objects, classified correctly as compared to the ground truth: where T denotes the dataset and T is its total samples count; y i is expert-defined target value (ground truth), ŷi is the model decision whether the i-th object contain MC.
In addition to the baseline that is the network proposed in [41], we applied a set of additional approaches that are commonly used to improve the DCNN accuracy and generalization ability (see Appendix A).Specifically, we used Transfer Learning (TL) [57][58][59][60][61][62] with the VGG16 [45] network pre-trained on ImageNet [49] dataset; Fine Tuning (FT) [63], Dropout (Do) [64], and dataset augmentation (DA) [65] (see Appendix A).With these techniques applied in various combinations, we constructed six DCNN architectures that are summarized in Table 2.All of these architectures are built in a common manner: the FC classifier follows the one-(for IR only) or two-branched (for IR + WV) convolutional core.If the convolutional core is one-branched, its output itself is input data for the corresponding FC classifier.If the convolutional core is two-branched, the concatenation product of their outputs is the input data for the corresponding FC classifier.The very detailed description of the constructed architectures is presented in Appendix A. For each DCNN structure, we trained a set of models, as described in detail in Section 3.5.We also applied ensemble averaging (see Appendix A) of a set of models of identical configuration via averaging probabilities of true class for each object of the dataset.We term these six ensemble-averaged models the "second-order" models.We also applied ensemble averaging per sample of all trained DCNNs that were trained in this work.We term this model the "third-order" model.Each of these models was trained using the method of backpropagation of error (BCE loss, see Appendix A) [66], denoted as "backprop training" in Figures 3 and 4.
Atmosphere 2018, 9, x FOR PEER REVIEW 7 of 23 augmentation (DA) [65] (see Appendix A).With these techniques applied in various combinations, we constructed six DCNN architectures that are summarized in Table 2.All of these architectures are built in a common manner: the FC classifier follows the one-(for IR only) or two-branched (for IR + WV) convolutional core.If the convolutional core is one-branched, its output itself is input data for the corresponding FC classifier.If the convolutional core is two-branched, the concatenation product of their outputs is the input data for the corresponding FC classifier.The very detailed description of the constructed architectures is presented in Appendix A. For each DCNN structure, we trained a set of models, as described in detail in Section 3.5.We also applied ensemble averaging (see Appendix A) of a set of models of identical configuration via averaging probabilities of true class for each object of the dataset.We term these six ensemble-averaged models the "second-order" models.We also applied ensemble averaging per sample of all trained DCNNs that were trained in this work.We term this model the "third-order" model.Each of these models was trained using the method of backpropagation of error (BCE loss, see Appendix A) [66], denoted as "backprop training" in Figures 3 and 4.

Proposed DCNN Architectures
Six DCNNs that we have constructed are able to perform binary classification on satellite mosaics data (IR alone or IR + WV) represented as grayscale 100 × 100 pixels images: 1. CNN #1.This model is built "from scratch" which means we have not used any pre-trained

Proposed DCNN Architectures
Six DCNNs that we have constructed are able to perform binary classification on satellite mosaics data (IR alone or IR + WV) represented as grayscale 100 × 100 pixels images: 1.
CNN #1.This model is built "from scratch" which means we have not used any pre-trained networks.CNN #1 is built in the manner proposed in [35].We varied sizes of convolutional kernels of each convolutional layers from 3 × 3 to 5 × 5. We also varied sizes of subsampling layers' receptive fields from 2 × 2 to 3 × 3.For each convolutional layer, we varied the number of convolutional kernels: 8, 16, 32, 64 and 100.The network convolutional core consists of three convolutional layers alternated with subsampling layers.Each pair of convolutional and subsampling layers is followed by a dropout layer.CNN #1 is one-branched, and objects are described by IR 500 × 500 km satellite snapshots only.

2.
CNN #2.This model is built "from scratch" with two separate branches-for IR and WV data.
The convolutional core of each branch is built in the same manner as the convolutional core for CNN #1 and as proposed in [41].We varied the same parameters of the structure here in the same ranges as for CNN #1.

3.
CNN #3.This model is built with TL approach.We used VGG16 pre-trained convolutional core to construct this model.None of VGG16 weights were optimized within this model, and only the weights of the FC classifier were trainable.This model is one-branched, and the objects are described by IR 500 × 500 km satellite snapshots only.CNN #3 structure is shown in Figure 3.

4.
CNN #4.This model is two-branched, and each branch of its convolutional core is built with TL approach, in the same manner as the convolutional core of CNN #3.Input data are IR and WV.None of VGG16 weights of this model in any of the two branches were optimized, and only the weights of the FC classifier were trainable.CNN #4 structure is shown in Figure 4.

5.
CNN #5 is built with both TL and FT approaches.We built the convolutional core of this model with the use of VGG16 pre-trained network.VGG16 convolutional core consists of five similar blocks of layers.For the CNN #5 we turned the last of these five blocks to be trainable.This model is one-branched, and objects are IR 500 × 500 km satellite snapshots only.CNN #5 structure is shown in Figure 3.

6.
CNN #6 is two-branched, and branches of its convolutional core are built in the same manner as the convolutional core of CNN #5.For the CNN #6, we turned the last of five blocks of each VGG16 convolutional cores to be trainable.Input data are IR and WV 500 × 500 km satellite snapshots of dataset samples.CNN #6 structure is shown in Figure 4.

Computational Experiment Design
The following hyper-parameters are included in each of the six networks: • Size (number of nodes) of the first layer of FC classifier (denoted as FC1 in Figures 3 and 4) Convolutional kernels count for each convolutional layer (only applies to CNN #1 and CNN #2)

•
Sizes of convolutional kernels (only applies to CNN #1 and CNN #2)

•
Sizes of receptive fields of subsampling layers (only applies to CNN #1 and CNN #2) The whole dataset was split into training (8952 samples) and testing (2237 samples) sets stratified by target value meaning that each set has the same (55:45) ratio of true/false samples as the whole dataset (i.e., 4924:4028 and 1253:984 samples in training and testing sets correspondingly).We have conducted hyper-parameters optimization for each of these DCNNs using stratified K-fold (K = 5) cross-validation approach.After this optimization, we trained several (typically 14-18) models with the best hyper-parameters configuration on the training set for each architecture (architecture-specific models).Then, we excluded models with the maximal and minimal accuracy score being estimated with the cross-validation approach, from this set of architecture-specific models for each of the six architectures.The remaining architecture-specific models were evaluated on the testing set, which was never seen by these models.We estimated the accuracy score for each individual model and the variance of accuracy score for the particular architecture with the best hyper-parameters combination (see Table 2).
5. CNN #5 is built with both TL and FT approaches.We built the convolutional core of this model with the use of VGG16 pre-trained network.VGG16 convolutional core consists of five similar blocks of layers.For the CNN #5 we turned the last of these five blocks to be trainable.This model is one-branched, and objects are IR 500 × 500 km satellite snapshots only.CNN #5 structure is shown in Figure 3. 6. CNN #6 is two-branched, and branches of its convolutional core are built in the same manner as the convolutional core of CNN #5.For the CNN #6, we turned the last of five blocks of each VGG16 convolutional cores to be trainable.Input data are IR and WV 500 × 500 km satellite snapshots of dataset samples.CNN #6 structure is shown in Figure 4.The dots along the "reshaped to vector" and "concatenated features vector" lines denote elements of convolutional cores outputs reshaped to vectors, which are, being concatenated to a combined features vector, the fully-connected classifier input data.

Computational Experiment Design
The following hyper-parameters are included in each of the six networks: The dots along the "reshaped to vector" and "concatenated features vector" lines denote elements of convolutional cores outputs reshaped to vectors, which are, being concatenated to a combined features vector, the fully-connected classifier input data.
With the ensemble averaging approach, we evaluated the second-order models on the "never-seen by the model" testing set.As described in Section 3.3, we estimated the optimal probability threshold p th for each second-order model and for the third-order model (see Table 2) for the best accuracy score estimation.These scores are treated as the quality measure of each particular architecture.
Numerical optimization and evaluation of models were performed at the Data Center of Far Eastern Branch of Russian Academy of Sciences (FEB RAS) [67] and DL computational resources of Sea-Air Interactions Laboratory of Shirshov Institute of Oceanology of Russian Academy of Sciences (IORAS, https://sail.ocean.ru/).Exploited computational nodes contain two graphics processing units (GPU) NVIDIA Tesla P100 16GB RAM.With these resources, the total GPU time of calculations is 3792 h.

Results
The designed DCNNs were applied to detect of Antarctic MCs for the period from June to September 2004.Summary of the results of the application of six models is presented in Table 2.As we noted above, each model is characterized by the utilized data source (IR alone or IR + WV, columns "IR" and "WV" in Table 2).These DCNNs are further categorized according to a chosen set of applied techniques in addition to the basic approach (see Table 2 legend).Table 2 also provides accuracy scores and probability thresholds estimated, as described in Section 3.5, for the individual, second-, and third-order models of each architecture.
Table 2. Accuracy score of each model with the best hyper-parameters combination.BA-basic approach [41], TL-Transfer Learning, FT-Fine Tuning, Do-dropout, DA-dataset augmentation.Acc is the accuracy score averaged across models of the particular architecture.AsEA is the accuracy score of the ensemble averaged models with the optimal probability threshold.p th is the optimal probability threshold value.Figure 5 demonstrates four main types of false classified objects.The first and the second types are the ones for which IR data are missing completely or partially.The third type is the one for which the source satellite data were suspected to be corrupted.These three types of classifier errors originating from the lack of source data or the corruption of source data.For the fourth type, the source satellite data were realistic but the classifier has made a mistake.Thus, some of false classifications are model mistakes, and some are associated with the labeling issue where human expert could guess on the MC propagation over the area with missing or corrupted satellite data.

Model
As shown in Table 2, CNN #3 and CNN #5 demonstrated the best accuracy among the second-order models on a never-seen subset of objects.The best combination of hyper-parameters for these networks is presented in Appendix B. Confusion matrices and receiver operating characteristic (ROC) curves for these models are shown in Figure 6a-d.Confusion matrices and ROC curves for all evaluated models are presented in Appendix C. Figure 6 clearly confirms that these two models perform almost equally for the true and the false samples.According to Table 2, the best accuracy score is reached using different probability thresholds for each second-or third-order model.
Comparison of CNN #1, CNN #2, on the one hand, and the remaining models, on the other hand, shows that DCNNs built with the use of TL technique demonstrate better performance when compared to the models built "from scratch".Moreover, the accuracy score variances of CNN #1 and CNN #2 are higher than for the other architectures.Thus, models built with TL approach seem to be more stable, and their generalization ability is better, as compared to models built "from scratch".
Comparing CNN #1 and CNN #2 qualities, we may conclude that the use of an additional data source (WV) results in the significant increase of the model accuracy score.Comparison of models within each pair of the network configurations (CNN #3 vs. CNN #5; CNN #4 vs. CNN #6) demonstrates that the FT approach does not provide significant improvement of the accuracy score in case of such a small size of the dataset.It is also obvious that the averaging over the ensemble members does increase the accuracy score from 0.6% for CNN #5 to 2.41% for CNN #1.However, in some cases, these score increases are comparable to the corresponding accuracy standard deviations.
It is also clear from the last row of Table 2, that the third-order model, which averages probabilities that are estimated by all trained models CNN #1-6, produces the accuracy of Acc = 97% which outperforms all scores of individual models and second-order ensemble models.ROC curve and confusion matrices for this model are presented in Figure 6e,f.Figure 7 demonstrates the characteristics of the best model (third-order ensemble-averaging model) regarding false negatives (FN).Since the testing set is unbalanced with respect to stages, types of cyclogenesis and cloud vortex types, which we present in Figure 7a,c,d relative FN rates for each separate class in each taxonomy.We present the testing set distribution of classes for these taxonomies as well.Note that scales are different for reference distributions of classes of the testing set and the distributions of missed MCs.Detailed false negatives characteristics may be found in Appendix D.
Comparing CNN #1 and CNN #2 qualities, we may conclude that the use of an additional data source (WV) results in the significant increase of the model accuracy score.Comparison of models within each pair of the network configurations (CNN #3 vs. CNN #5; CNN #4 vs. CNN #6) demonstrates that the FT approach does not provide significant improvement of the accuracy score in case of such a small size of the dataset.It is also obvious that the averaging over the ensemble members does increase the accuracy score from 0.6% for CNN #5 to 2.41% for CNN #1.However, in some cases, these score increases are comparable to the corresponding accuracy standard deviations.
It is also clear from the last row of Table 2, that the third-order model, which averages probabilities that are estimated by all trained models CNN #1-6, produces the accuracy of  = 97% which outperforms all scores of individual models and second-order ensemble models.ROC curve and confusion matrices for this model are presented in Figure 6e,f.Tracking procedure requires the sustainable ability of the MCs detection scheme to recognize mesocyclone cloud shape imprints during the whole MC life cycle.Figure 7a demonstrates that the best model classifies mesocyclone imprints almost equally for incipient (~4.6% incipient missed) and mature (~4% mature missed) stages.The fraction of missed MCs in its dissipating stage is lower (~4% missed among MCs in dissipating stage).As for distribution of missed MCs with respect to their diameters (see Figure 7b), the histogram demonstrates fractions of FN objects relative to the whole FN number.The distribution of MC diameters in the testing set in Figure 7b is shown as a reference.There is a peak around the diameter value of 325 km, which does not coincide with any issues of distributions of MC diameters when the testing set is subset by any particular class of any taxonomy.However, since the total number of missed MCs is too small, there is no obvious reason to make assumptions on the origin of this issue.The FN rates per cyclogenesis types (Figure 7c) demonstrate the only issue for the orography-induced MCs.This issue is caused by the total number of that cyclogenesis type, which is small (only 27 MCs in the testing set and only 134 in the training set), so the four that were missed is a substantial fraction of it.The same issue is demonstrated for the FN rates per cloud vortex types.Since the total number of "spiral cloud" type in the testing set is relatively small (59 of 1253), the five missed are a substantial fraction of it, as compared to 33 missed of 1006 for "comma cloud" type.Tracking procedure requires the sustainable ability of the MCs detection scheme to recognize mesocyclone cloud shape imprints during the whole MC life cycle.Figure 7a demonstrates that the best model classifies mesocyclone imprints almost equally for incipient (~4.6% incipient missed) and mature (~4% mature missed) stages.The fraction of missed MCs in its dissipating stage is lower (~4% missed among MCs in dissipating stage).As for distribution of missed MCs with respect to their diameters (see Figure 7b), the histogram demonstrates fractions of FN objects relative to the whole FN number.The distribution of MC diameters in the testing set in Figure 7b is shown as a reference.There is a peak around the diameter value of 325 km, which does not coincide with any issues of distributions of MC diameters when the testing set is subset by any particular class of any taxonomy.However, since the total number of missed MCs is too small, there is no obvious reason to make assumptions on the origin of this issue.The FN rates per cyclogenesis types (Figure 7c) demonstrate the only issue for the orography-induced MCs.This issue is caused by the total number of that cyclogenesis type, which is small (only 27 MCs in the testing set and only 134 in the training set), so the four that were missed is a substantial fraction of it.The same issue is demonstrated for the FN rates per cloud vortex types.Since the total number of "spiral cloud" type in the testing set is relatively small (59 of 1253), the five missed are a substantial fraction of it, as compared to 33 missed of 1006 for "comma cloud" type.

Conclusions and Outlook
In this study, we present an adaptation of a DCNN method resulting in an algorithm that recognizes MCs signatures in preselected patches of satellite imagery of cloudiness and spatially collocated WV imagery.The DCNN technique shows very high accuracy in this problem.The best accuracy score of 97% is reached using the third-order ensemble-averaging model (six models ensemble) and the combination of both IR and WV images as input.We assess the accuracy of MCs recognition by comparison of identified MCs (true/false-image contain MC/no MC on the image parameter) with a reference dataset [5].We demonstrate that deep convolutional networks are capable of effectively detecting the presence of polar mesocyclone signatures in satellite imagery

Conclusions and Outlook
In this study, we present an adaptation of a DCNN method resulting in an algorithm that recognizes MCs signatures in preselected patches of satellite imagery of cloudiness and spatially collocated WV imagery.The DCNN technique shows very high accuracy in this problem.The best accuracy score of 97% is reached using the third-order ensemble-averaging model (six models ensemble) and the combination of both IR and WV images as input.We assess the accuracy of MCs recognition by comparison of identified MCs (true/false-image contain MC/no MC on the image parameter) with a reference dataset [5].We demonstrate that deep convolutional networks are capable of effectively detecting the presence of polar mesocyclone signatures in satellite imagery patches of size 500 × 500 km.We also conclude that the quality of the satellite mosaics is sufficient enough for performing the task of binary classification regarding the MCs presence in 500 × 500 km patches, and for performing other similar tasks of pattern recognition type, e.g., semantic segmentation of MCs.
Since the satellite-based studies of polar mesocyclone activity conducted in the SH (and in NH as well) have never reported season-dependent variations of IR imprint of cloud shapes of MCs [23,27,68,69], we assume the proposed methodology to be applicable to satellite imageries of polar MCs that are available for the whole satellite observation era in SH.In the NH, the direct application of the models that were trained on SH dataset is restricted due to the opposite sign of relative vorticity, and thus, different cloud shape orientation.However the proposed approach is still applicable, and the only need is a dataset of tracks of MCs from the NH.
It was also shown that the accuracy of MCs detection by DCNNs is sensitive to the single (IR only) or double (IR + WV) input data usage.IR+WV combination provides significant improvement of the detection of MCs and allows a weak DCNN (CNN #2) to detect MCs with higher accuracy compared to the weak CNN #1 (89.3% and 96.3% correspondingly).The computational cost of DCNN training and hyper-parameters optimization for deep neural networks are time-and computational-consuming.However, once trained, the computational cost of the DCNN inference is low.Furthermore, the trained DCNN performs much faster when compared to a human expert.Another advantage of the proposed method is the low computational cost of data preprocessing that allows the processing of satellite imagery in real time or the processing of large amounts of collected satellite data.
We plan to extend the usage of this set of DCNNs (Table 2) for the development of an MCs tracking method based on machine learning and using satellite IR and WV mosaics.These efforts would be mainly focused on the development of the optimal choice of the "cut-off" window that has to be applied to the satellite mosaic.In the case of a sliding-window approach (e.g., running the 500 × 500 km sliding window through the mosaics), the virtual testing dataset of the whole mosaic is highly unbalanced, so a model with non-zero FPR evaluated on balanced dataset would produce much higher FPR.Thus, we expect the sliding-window approach not to be accurate enough in the problem of MC detection.In the future, instead of the sliding-window, the Unet-like [70,71] architecture should be considered with the binary semantic segmentation problem formulation.Since the models that have been applied in this study (specifically their convolutional cores) are capable of extracting the hidden representation that is relevant to MCs signatures, they may be used as the encoder part of the Unet-like encoder-decoder neural network for MCs identification and tracking.Considering MC tracking development, an approach proposed in a number of face recognition studies should be reassuring [72,73].This approach can be applied in a manner of triple-based training of the DCNN to estimate a measure of similarity between one particular MC signatures in consecutive satellite mosaics.

Appendix A. DCNN Best Practices and Additional Techniques
There is a set of best practices commonly used to construct DCNNs for solving classification problems [56].Modern DCNNs are built on the basis of consecutive convolutional and subsampling layers by performing nonlinear transformation of the initial data (see Figure 2 in [41]).The primary layer type of convolutional neural networks (CNNs) is the so-called convolutional layer which is designed to extract visual patterns density map using discrete convolution operation with K (tends to be from 3 to 1000) kernels followed by a nonlinear transformation operation (activation function).One additional layer type is a pooling layer performing subsampling operation with one of the following aggregation functions: maximum, minimum, mean or others.In the current practice the maximum is used.
Since the LeNet DCNN [41] several studies [41][42][43][44] have demonstrated that the usage of consecutive convolutional and subsampling layers results in a skillful detection of various spatial patterns from the input 2D sample.The approach proposed in [41] implies the use of the output of these stacked layers set as an input data for a classifier, which in general may be any method suitable for classification problems, such as linear models, logistic regression, etc. LeCun [41] suggested to use the neural classifier, and this is now a conventional approach.The advantage of using a neural classifier is the ability to train the whole model at once (the so-called end-to-end training).
The whole model built in this manner represents a classifier capable of direct predicting a target value for the sample.We term the fully-connected (FC) layers set as "FC classifier", and the preceding part containing convolutional and pooling layers as "convolutional core" (see Figures 3 and 4).
For building a DCNN it is important to account for data dimensionality during its transformations from layer to layer.The input for a DCNN is an image represented by a matrix of the size (h, w, d), where h and w correspond to the image height and width in pixels, d is its levels number, the so-called depth (e.g., d = 3 when levels are red, green and blue channels of a colorful image).For the water vapor or radio-brightness temperature satellite data, d = 1.A convolutional layer and subsampling layer are described in details in [41].Convolutional layers are characterized by their kernel sizes (e.g., 3 × 3, 5 × 5), their kernel numbers K and the nonlinear operation used (e.g., tanh in [41]).Subsampling layers are characterized by their receptive field sizes e.g., 3 × 3, 5 × 5 etc.The output of a convolutional layer with K kernels is the so-called feature maps which is a matrix of the size (h, w, K).The nonlinear operation transforms it to a matrix of size (h, w, 1).The following subsampling layer reduces the matrix size depending on the subsampling layer kernel size.Typically, this size is (2, 2) or (3,3).Thus, the subsampling operation reduces the sample size by a factor 2 or 3, respectively.The output of a convolutional core is a set of abstract feature maps which is represented by a 3D matrix.This matrix, being reshaped into a vector, is passed as the input to the FC classifier (see Figures 3 and 4).
FC classifier of all models of this study includes hidden FC layers whose count varied from 2 to 4. FC layers are characterized by the number of its basic building blocks (so-called artificial neurons), which transform input data according to their rules (activation function) and parameters (so-called weights) [53,74].Artificial neurons count of FC1 which is the layer following the convolutional core (see Figures 3 and 4), is chosen from the set {128, 256, 512, 1024}.The size of each following FC layer is half of the preceding one, but not less than 128.The output layer is fully-connected as well and contains one output node.For example, the structure of FC classifier in terms of nodes count of layers might be the following: {512; 256; 128; 1}.All FC layers are alternated with dropout layers in order to prevent overfitting of the model.All trainable layers' activation functions are Rectified Linear Unit (ReLU): σ ReLU (z) = max(0; z), (A1) except the output layer whose activation function is sigmoid: where θ are layers' trainable parameters.
In order to measure the error of the network on each individual sample during the training process we use the binary cross-entropy as a loss function: where y i is the expert-defined ground truth for the target value, ŷi is the estimated probability of the i-th sample to be true, N is samples count of the training set or a training mini-batch.This loss function is minimized in the space of the model weights using the method of backpropagation of error [66] denoted as "backprop training" in Figures 3 and 4. The outcome of the the whole model is the probability of each class for the input sample.In the case of binary classification, the FC classifier has one output unit, producing probability of MC presence for the input sample.
In addition to the basic approach proposed in [41] a number of techniques may be applied.Using them one can construct and train DCNNs of various accuracy and various generalization abilities which is characterized by the quality of a model estimated on a never-seen test data.

Figure 1 .
Figure 1.The input for the deep convolutional neural networks (DCNNs).(a) Trajectories of all mesocyclones (MCs) in Southern Ocean MesoCylones (SOMC) dataset, blue dots mark the point of generation of MC.Snapshots of satellite mosaics for SH for (b) InfraRed (IR) and (c) Water Vapor

Figure 1 .
Figure 1.The input for the deep convolutional neural networks (DCNNs).(a) Trajectories of all mesocyclones (MCs) in Southern Ocean MesoCylones (SOMC) dataset, blue dots mark the point of generation of MC.Snapshots of satellite mosaics for SH for (b) InfraRed (IR) and (c) Water Vapor (WV) channels at 00:00 UTC 02/06/2004.The red/blue squares indicate patches centered over the MCs (red squares) and those having no MC cloudiness signature in (blue) being cut from the mosaics for DCNNs training.

Figure 2 .
Figure 2. Examples (IR only) of true and false samples for DCNNs training and testing of DCNNs results assessment. 100 × 100 grid points (500 × 500 km) patches of IR mosaics for (a-d) true samples and false (e-h) samples.

Figure 2 .
Figure 2. Examples (IR only) of true and false samples for DCNNs training and testing of DCNNs results assessment. 100 × 100 grid points (500 × 500 km) patches of IR mosaics for (a-d) true samples and false (e-h) samples.

Figure 3 .
Figure 3. Convolutional neural networks (CNN) #3 and CNN #5 structures.The dots along the "reshaped to vector" line denote elements of the convolutional core output reshaped to a vector, which is the fully-connected classifier input data.

Figure 3 .
Figure 3. Convolutional neural networks (CNN) #3 and CNN #5 structures.The dots along the "reshaped to vector" line denote elements of the convolutional core output reshaped to a vector, which is the fully-connected classifier input data.

Figure 4 .
Figure 4. CNN #4 and CNN #6 structures.The dots along the "reshaped to vector" and "concatenated features vector" lines denote elements of convolutional cores outputs reshaped to vectors, which are, being concatenated to a combined features vector, the fully-connected classifier input data.

Figure 4 .
Figure 4. CNN #4 and CNN #6 structures.The dots along the "reshaped to vector" and "concatenated features vector" lines denote elements of convolutional cores outputs reshaped to vectors, which are, being concatenated to a combined features vector, the fully-connected classifier input data.

Figure 5 .
Figure 5. Examples of false classified objects, for which (a,b) IR satellite data is missing or corrupted, (c) the source satellite data is suspected to be corrupted, (d) the source satellite data is realistic, but the classifier has made a mistake.

Figure 5 .
Figure 5. Examples of false classified objects, for which (a,b) IR satellite data is missing or corrupted, (c) the source satellite data is suspected to be corrupted, (d) the source satellite data is realistic, but the classifier has made a mistake.

Figure 7
Figure 7 demonstrates the characteristics of the best model (third-order ensemble-averagin del) regarding false negatives (FN).Since the testing set is unbalanced with respect to stage es of cyclogenesis and cloud vortex types, which we present in Figure 7a,c,d relative FN rates fo

Figure 6 .
Figure 6.Confusion matrices and receiver operating characteristic curve for (a,b) CNN #3 and (c,d) CNN #5, both with the ensemble averaging approach applied (second-order models); and (e,f) third-order model CNN #1-6 averaged ensemble.

Figure 7 .
Figure 7. False negatives (FN, which are missed MCs) in the never-seen by the model testing set with respect to (a) lifecycle stages; (b) diameters; (c) cyclogenesis types; and, (d) types of cloud vortex.

Figure 7 .
Figure 7. False negatives (FN, which are missed MCs) in the never-seen by the model testing set with respect to (a) lifecycle stages; (b) diameters; (c) cyclogenesis types; and, (d) types of cloud vortex.
Atmosphere 2018, 9, x FOR PEER REVIEW 5 of 23 (WV) channels at 00:00 UTC 02/06/2004.The red/blue squares indicate patches centered over the MCs (red squares) and those having no MC cloudiness signature in (blue) being cut from the mosaics for DCNNs training.

Table 1 .
Total number of true and false samples.

Table A4 .
False negative rates per MC stages.