Evaluation of Deep Learning-Based Neural Network Methods for Cloud Detection and Segmentation

: This paper presents a systematic approach for accurate short-time cloud coverage prediction based on a machine learning (ML) approach. Based on a newly built omnidirectional ground-based sky camera system, local training and evaluation data sets were created. These were used to train several state-of-the-art deep neural networks for object detection and segmentation. For this purpose, the camera-generated a full hemispherical image every 30 min over two months in daylight conditions with a ﬁsh-eye lens. From this data set, a subset of images was selected for training and evaluation according to various criteria. Deep neural networks, based on the two-stage R-CNN architecture, were trained and compared with a U-net segmentation approach implemented by CloudSegNet. All chosen deep networks were then evaluated and compared according to the local situation.


Introduction and Motivation
Electric power load forecasting has been an integral part of managing electrical energy markets and infrastructure for many decades. Consequently, experiences, regulations, and planning by utilities and independent system operators are the dominant considerations for research and commercial development in this field. The cost of generating power from non-traditional energy sources can be reduced through the integration of solar energy into classical energy supply structures. However, such an integration has its challenges and costs [1,2]. These are mainly caused by the unstable conditions of renewable energy sources such as the dynamic change of sky conditions. Clouds are considered one of the key elements causing fluctuation in solar energy availability [3]. Thus, cloud coverage determines direct and non-direct solar irradiance. Accurate, short-term forecasting of cloud cover is required for a variety of applications, particularly for power generation from photovoltaic solar power plants, as their power output is heavily dependent on sky cloud coverage. The generated power decreases by up to 30 % with a light cloud cover of the sun as compared to cloudless conditions. The yield could decrease by 75 % in the case of sunshine dimmed by dense clouds [4].
The choice of a solar radiation forecast method depends significantly on the periods, which may vary from a few days ahead (intraweek), to a few hours (intraday), or a few minutes (intrahour). Depending on the forecasting application, different time horizons are relevant. The forecasting of the distributed photovoltaic (PV) power generation, which is the focus of this study, requires both intrahour and day-ahead forecasting of solar irradiance [5]. The parameter which is of interest for this study depends on the technology used for power generation. For non-concentrating systems (such as most PV systems), global irradiation (GI) on the inclined surface is required above all.
For different time horizons, however, different approaches are required: • For relatively long time horizons, of the order of 6 h or more, physics-based models are typically used [6,7]. • Two-to six-hour time horizons use a combination of methods based on observations or predictions of clouds through Numerical Weather Prediction Models (NWPM) and satellite images with information about the optical depth of the cloud and the motion vector of the cloud [6,8]. • For a very short time (<30 min), a range of ground-based imaging techniques were developed for GI using the information on cloud positioning and deterministic models [9,10].
The different solar forecasting techniques and their inputs are summarized in Table 1. Table 1. Main characteristics and inputs for different solar forecasting approaches [11]. Numerical weather prediction and up-to-date geostationary satellite-based forecast approaches are restricted in terms of their spatial and temporal resolution and are too imprecise for very short-term forecasts. So, the use of a ground-based sky imager in forecasting is a promising approach as it provides high temporal and spatial cloud cover resolution [12].

Approach Sampling Rate Spatial Resolution Forecast Horizon Application
Short-term cloud coverage prediction involves two main stages. The first stage includes the detection and segmentation of clouds using available images. The results obtained in the first stage are of great importance, as the quality of the actual prediction (the second stage) depends on the most elaborate representation possible of the clouds. This work presents a camera-based short-term cloud coverage prediction based on machine learning methods. The main contribution is the comparison and evaluation of deep neural network architectures, for instance, segmentation for clouds.

Camera-based Cloud Coverage Prediction
Over the last two decades, many studies have proposed various statistical methods for image processing [13,14]. These include various parametric approaches such as Bayesian model averaging [15], or non-homogeneous regression [16], or combined methods such as quantile mapping [17,18].
In recent times, machine learning methods have become increasingly popular in image processing [19]. The work of Taillardat et al. uses quantitative regression forests (QRF) to improve the accuracy of temperature and wind speed forecasts [20]. In [21], an approach based on neural networks to process ECMWF near-surface temperature predictions using QRF as a reference model is presented. Bakker et al. [22] propose several machine learning approaches for the post-processing of Numerical weather prediction (NWP) predictions for solar radiation based on quantum regression, including random forests, gradient amplification, and neural networks.
The detection of clouds in sky imager scenarios is also developing rapidly from classical approaches based on support vector machines and Bayes classifiers, as in [23], to systems employing deep learning techniques. After starting with simple neural structures for remote sensing images, as in [24], current systems are built upon segmentation-based approaches. These rely on encoder-decoder structures, first proposed in [25] and adapted recently for cloud coverage prediction in [26,27]. The importance and influence of image quality for object detection has been incorporated into deep learning approaches only recently, e.g., in [28].
In contrast to basic segmentation level tasks, the prediction of coverage improves when considering individual cloud objects for tracking and prediction. For this application, segmentation methods are the algorithms of choice. Most prominent and, in fact, ubiquitous in computer vision tasks such as pedestrian recognition is the two-stage approach of Mask R-CNN [29], which allows instance segmentation and bounding box prediction for a given set of classes. A third class of deep learning architectures is the so-called transformer networks, originally invented in the context of speech and natural language recognition. Current research focusses on applying transformers to object detection [30] and segmentation tasks [31].

Hardware and Imaging Sky Camera
The present study used a ground-based sky camera to monitor the sky. It is situated at Offenburg University, where it was built based on the optical systems described in [32][33][34]. It comprises a high-sensitivity CCD-based camera chip combined with a 180 • fish-eye lens for full hemispherical imaging. The camera system is combined with additional sensors to measure the actual ground solar irradiance and temperature. The resulting measurement station is shown in Figure 1. systems employing deep learning techniques. After starting with simple neural structures for remote sensing images, as in [24], current systems are built upon segmentation-based approaches. These rely on encoder-decoder structures, first proposed in [25] and adapted recently for cloud coverage prediction in [26,27]. The importance and influence of image quality for object detection has been incorporated into deep learning approaches only recently, e.g., in [28]. In contrast to basic segmentation level tasks, the prediction of coverage improves when considering individual cloud objects for tracking and prediction. For this application, segmentation methods are the algorithms of choice. Most prominent and, in fact, ubiquitous in computer vision tasks such as pedestrian recognition is the two-stage approach of Mask R-CNN [29], which allows instance segmentation and bounding box prediction for a given set of classes. A third class of deep learning architectures is the socalled transformer networks, originally invented in the context of speech and natural language recognition. Current research focusses on applying transformers to object detection [30] and segmentation tasks [31].

Sky Camera
The present study used a ground-based sky camera to monitor the sky. It is situated at Offenburg University, where it was built based on the optical systems described in [32][33][34]. It comprises a high-sensitivity CCD-based camera chip combined with a 180° fish-eye lens for full hemispherical imaging. The camera system is combined with additional sensors to measure the actual ground solar irradiance and temperature. The resulting measurement station is shown in Figure 1. Data acquisition was carried out based on a LabVIEW application that stores the captured sky images as an exposure series at a given time interval. The whole hardware setup for image capturing and data storage is described in [35]. The sky imager system was calibrated beforehand based on non-linear distortion models of spherical lenses [36,37].
With classic image processing steps, attempts were made to detect and segment clouds on these images to subsequently be able to make a short-term prediction. It turned out that good detection and segmentation of the clouds is essential for later solar irradiance prediction. With the classical approach, based on a sky illumination prediction and adaptive thresholding as presented in [38], an accuracy of 76.7% could be achieved. In this subsequent work, the aim is to evaluate whether neural networks-based approaches with deep learning are more suitable for detection and segmentation, in the sense of computational speed and accuracy. Data acquisition was carried out based on a LabVIEW application that stores the captured sky images as an exposure series at a given time interval. The whole hardware setup for image capturing and data storage is described in [35]. The sky imager system was calibrated beforehand based on non-linear distortion models of spherical lenses [36,37].
With classic image processing steps, attempts were made to detect and segment clouds on these images to subsequently be able to make a short-term prediction. It turned out that good detection and segmentation of the clouds is essential for later solar irradiance prediction. With the classical approach, based on a sky illumination prediction and adaptive thresholding as presented in [38], an accuracy of 76.7% could be achieved. In this subsequent work, the aim is to evaluate whether neural networks-based approaches with deep learning are more suitable for detection and segmentation, in the sense of computational speed and accuracy.
The ground-based camera system continuously generates a full hemispherical image. Images are selected from this data stream. Present clouds are marked in the images using pixelwise annotation. The classical system is able to work without a sun disc to block solar rays by using HDR images and a solar position prediction. It is therefore not necessary to mark the sun or other objects to compare the neural network approach on equal terms. The labeled images are treated as a small database, separated into training and validation sets, only holding back a small sub-set for testing.

Neural Network-Based Instance Segmentation
Instance segmentation in computer vision has been dominated by deep neural networks since their advent, culminating in the publishing of Mask R-CNN. In this work, we compare and evaluate the power of two prominent neural network architectures, namely Mask R-CNN, which was adapted and trained for the given data set, and Cloud SegNet, an actual state-of-the-art segmentation network already trained on generic cloud data.

Mask R-CNN
Mask R-CNN, although published in a canonical form, allows for variation and adaptation, not only in hyperparameters, but also in more profound ways, such as feature generator architecture, loss functions, or mask sizes.

Framework
Our implementation is based on PyTorch and the Detectron2 archetypes as described in [39]. The structure is highly modular, allowing networks to be adapted and trained for detection and segmentation, the latter as a classic instance, or for panoptic variation.
In this contribution, we use transfer learning and fine-tuning of a pre-trained version. As the clouds do vary in scale and shape, we employed pyramid networks as a backbone to ensure scale invariance, and the data augmentation stack of PyTorch to substantially increase our image database and emulate variations in brightness and color. The following sections briefly explain the structure and adaptation of the chosen network architecture.

Base RCNN-FPN as Backbone
We employ Feature Pyramide Networks (FPN) [40], trained with a focal loss on the MS Coco data set. The FPN backbone is important in detecting clouds on several scales. The network is an object detector with a multi-task loss to allow for class prediction and bounding box estimation. The whole network is basically divided into three components: The backbone network is a basic convolutional neural network to extract features on different scale levels. The feature maps of several layers are used to ensure scale invariance; the underlying ResNet architecture is reasonably fast for computation.
The classical two-stage approach makes reuse of these features in the Region Proposal Network, which is the second main component of the architecture. The feature maps are used as input and the ROI-align method is used to interpolate regions as possible object proposals for the last main component of the network.
The third stage, the so-called Box Head, consists of fully connected layers that predict the object class and perform a bounding box regression with a multi-task loss, in the case of the R-CNN-FPN base, on the proposed focal loss.
After the post-processing of the detector, non-maxima suppression ensures the efficient pruning of overlapping and wrong object detections. All in all, the RCNN-FPN network produces the typical output of an object detector, namely the most probable class and bounding box, which is exemplarily shown in Figure 2 for the detection of different clouds for a typical output of our system.

Mask Head
Mask R-CNN is the next step in augmenting the base network described above. An additional third head is added to the object detection-based network. This last head is called a mask head, and estimates a binary mask, based on two subsequent convolutional layers. Training can be performed in one seamless stage, adapting the weights and parameters of all the networks (region proposal, bounding box, class, and mask) simultaneously. This instance of segmentation is shown in Figure 3, adapting the seminal picture in [29] slightly for our case.

CloudSegNet
The second architecture this contribution evaluates is CloudSegNet. This is a classical encoder-decoder neural network. CloudSegNet focuses on its initial training set on the segmentation of day and night images within a single framework and achieved state-ofthe-art results [26]. The network architecture and the associated training data are also open-source [41].

Mask Head
Mask R-CNN is the next step in augmenting the base network described above. An additional third head is added to the object detection-based network. This last head is called a mask head, and estimates a binary mask, based on two subsequent convolutional layers. Training can be performed in one seamless stage, adapting the weights and parameters of all the networks (region proposal, bounding box, class, and mask) simultaneously. This instance of segmentation is shown in Figure 3, adapting the seminal picture in [29] slightly for our case.

Mask Head
Mask R-CNN is the next step in augmenting the base network described above. An additional third head is added to the object detection-based network. This last head is called a mask head, and estimates a binary mask, based on two subsequent convolutional layers. Training can be performed in one seamless stage, adapting the weights and parameters of all the networks (region proposal, bounding box, class, and mask) simultaneously. This instance of segmentation is shown in Figure 3, adapting the seminal picture in [29] slightly for our case.

CloudSegNet
The second architecture this contribution evaluates is CloudSegNet. This is a classical encoder-decoder neural network. CloudSegNet focuses on its initial training set on the segmentation of day and night images within a single framework and achieved state-ofthe-art results [26]. The network architecture and the associated training data are also open-source [41].

CloudSegNet
The second architecture this contribution evaluates is CloudSegNet. This is a classical encoder-decoder neural network. CloudSegNet focuses on its initial training set on the segmentation of day and night images within a single framework and achieved state-ofthe-art results [26]. The network architecture and the associated training data are also open-source [41].

CloudSegNet Architecture
CloudSegNet is a semantic segmentation network specifically designed to segment clouds from the background. In comparison to large image databases and classes, the cloud segmentation has significantly less texture, structure, and classes, as a plain architecture is chosen. The CloudSegNet architecture has the classical encoder-decoder structure used before U-Net. It is therefore comparable to the fully convolutional nets as described in [42]. This allows for few layers and thus few parameters to be trained. An overview of the architecture is shown in Figure 4, showing the encoder and decoder layers. CloudSegNet is a semantic segmentation network specifically designed to segment clouds from the background. In comparison to large image databases and classes, the cloud segmentation has significantly less texture, structure, and classes, as a plain architecture is chosen. The CloudSegNet architecture has the classical encoder-decoder structure used before U-Net. It is therefore comparable to the fully convolutional nets as described in [42]. This allows for few layers and thus few parameters to be trained. An overview of the architecture is shown in Figure 4, showing the encoder and decoder layers.

Encoder
The network's encoder block is built upon only three layers; the input size of the image is assumed to be 300 × 300 pixels, limiting the possible resolution. As described in its origins in [43,44], the lower convolution layers encode basic image features, e.g., lines. Later layers set together more and more complex features and can detect clouds in larger receptive fields. The input is condensed into a representation of 38×38×8 pixels.

Decoder
The subsequent decoder upsamples the image based on the deconvolution operation. The output is upsampled by three layers back to its original size, but only one channel with the probabilities for the classes of each pixel. This output is finally converted to a binary mask by a simple threshold.

Selection of Images
The given camera systems provide sky images for several months, taken with a frequency of one image every 10 min. Since its installation two years ago, a large amount of data is available that needs to be pre-sorted for the given task. To obtain sensible comparisons, the images were screened and several situations and weather scenarios have been pruned in advance. These include insects on the lens, too many raindrops upon the lens, dirt on the lens, a closed cloud cover, and heavy fog.
Examples of the removed images are shown in Figure 5.

Encoder
The network's encoder block is built upon only three layers; the input size of the image is assumed to be 300 × 300 pixels, limiting the possible resolution. As described in its origins in [43,44], the lower convolution layers encode basic image features, e.g., lines. Later layers set together more and more complex features and can detect clouds in larger receptive fields. The input is condensed into a representation of 38×38×8 pixels.

Decoder
The subsequent decoder upsamples the image based on the deconvolution operation. The output is upsampled by three layers back to its original size, but only one channel with the probabilities for the classes of each pixel. This output is finally converted to a binary mask by a simple threshold.

Selection of Images
The given camera systems provide sky images for several months, taken with a frequency of one image every 10 min. Since its installation two years ago, a large amount of data is available that needs to be pre-sorted for the given task. To obtain sensible comparisons, the images were screened and several situations and weather scenarios have been pruned in advance. These include insects on the lens, too many raindrops upon the lens, dirt on the lens, a closed cloud cover, and heavy fog.
Examples of the removed images are shown in Figure 5. From the remaining data, 76 images were randomly selected for the training data set and 14 for the test set. The training was performed using k-fold cross-validation, with the aim of minimizing the necessary amount of training data. To achieve a greater variation of the displayed clouds, the time interval between selected recordings was set to at least one hour and limited to between 8:00 a.m and 5:00 p.m. From the remaining data, 76 images were randomly selected for the training data set and 14 for the test set. The training was performed using k-fold cross-validation, with the aim of minimizing the necessary amount of training data. To achieve a greater variation of the displayed clouds, the time interval between selected recordings was set to at least one hour and limited to between 8:00 a.m and 5:00 p.m.
The overall numbers and characteristics of the image database used for training are summarized in Table 2. If a later contribution uses the segmentation as input, the interval can be easily scaled up. An exemplary image sample is shown in Figure 6.  Figure 5. Examples of the removed images with many raindrops upon the lens, dirt on the lens, a closed cloud cover, and heavy fog.

Marking the Clouds
The overall numbers and characteristics of the image database used for training are summarized in Table 2. If a later contribution uses the segmentation as input, the interval can be easily scaled up. An exemplary image sample is shown in Figure 6. From the remaining data, 76 images were randomly selected for the training data set and 14 for the test set. The training was performed using k-fold cross-validation, with the aim of minimizing the necessary amount of training data. To achieve a greater variation of the displayed clouds, the time interval between selected recordings was set to at least one hour and limited to between 8:00 a.m and 5:00 p.m.
The overall numbers and characteristics of the image database used for training are summarized in Table 2. If a later contribution uses the segmentation as input, the interval can be easily scaled up. An exemplary image sample is shown in Figure 6.

Marking the Clouds
To complete instanced segmentation, the time-consuming part is the pixel-wise labeling of the training data. Open-source tools were used and a representative segmentation was completed at the pixel level. Examples are again shown, this time in Figure 7. To complete instanced segmentation, the time-consuming part is the pixel-wise labeling of the training data. Open-source tools were used and a representative segmentation was completed at the pixel level. Examples are again shown, this time in Figure 7. For the input, we chose images that were non-rectified and not preprocessed to allow on the one hand for a comparison with CloudSegNet, and on the other hand for a test of the capability of cloud detection under severe optical distortions. The problem arose in the peripheral areas, where clouds are labeled with large difficulties, as shown in Figure  8.
The masks are binary in both cases, but Mask R-CNN also uses additional bounding box information generated from the positive areas.

Mask R-CNN
Given the training and test data, the hyperparameters and overall pipeline for Mask R-CNN had to be set up. For the input, we chose images that were non-rectified and not preprocessed to allow on the one hand for a comparison with CloudSegNet, and on the other hand for a test of the capability of cloud detection under severe optical distortions. The problem arose in the peripheral areas, where clouds are labeled with large difficulties, as shown in Figure 8.

Training
To complete instanced segmentation, the time-consuming part is the pixel-wise labeling of the training data. Open-source tools were used and a representative segmentation was completed at the pixel level. Examples are again shown, this time in Figure 7. For the input, we chose images that were non-rectified and not preprocessed to allow on the one hand for a comparison with CloudSegNet, and on the other hand for a test of the capability of cloud detection under severe optical distortions. The problem arose in the peripheral areas, where clouds are labeled with large difficulties, as shown in Figure  8.
The masks are binary in both cases, but Mask R-CNN also uses additional bounding box information generated from the positive areas.

Mask R-CNN
Given the training and test data, the hyperparameters and overall pipeline for Mask R-CNN had to be set up. The masks are binary in both cases, but Mask R-CNN also uses additional bounding box information generated from the positive areas.

Mask R-CNN
Given the training and test data, the hyperparameters and overall pipeline for Mask R-CNN had to be set up.

Training
For the training, the hyperparameters were adapted to our problem and data set. Using ADAM optimization [45], the learning rate was scheduled, starting with α = 0.00025. Validation and training data were separated with k-fold cross-validation. Convergence of the training loss could be observed after roughly 10,000 epochs. No further improvement could be achieved by varying the hyperparameters.

Visualization and Qualitative Assessment
After completing the training as described above, the results for the test data set were visually inspected. Results of the network forward pass are shown in Figure 9. On the left side, the input image is shown; the right side depicts results with the object mask and its detection bounding box.
0.00025. Validation and training data were separated with k-fold cross-validation. Convergence of the training loss could be observed after roughly 10,000 epochs. No further improvement could be achieved by varying the hyperparameters.

Visualization and Qualitative Assessment
After completing the training as described above, the results for the test data set were visually inspected. Results of the network forward pass are shown in Figure 9. On the left side, the input image is shown; the right side depicts results with the object mask and its detection bounding box.
Two possible outcomes are shown in the figure. In the upper half, a successful detection and segmentation of the clouds can be seen. It should be noted that the network is somewhat robust concerning disturbances, as the sun was not falsely detected as a cloud. The lower half of the figure shows a very large cloud that was only detected partially. Another problem is that a large portion of the remaining cloud was not detected at all. Our best solution so far is to massively extend the training data set. The quantitative evaluation follows in subsequent sections. Figure 9. Visualization of the trained Mask R-CNN network using the test set data (left without a mask, right with mask, successful segmentation above, insufficient segmentation below).

Evaluation
The evaluation was performed with the fine-tuned network for the test set data. As the training loss function is not very helpful in determining the overall quality, we chose the common recall or hit-rate value and the precision or accuracy to assess the quality of the segmentation. As we have a large number of negatives in the image, we calculated the F-score, defined as 2 ⋅ precision⋅recall precision recall , where the precision is the so-called positive prediction value, the quotient of all correctly identified objects (true positive value), and all positively classified objects (true positive cases and false-positive cases). The F-Score combines this value with the recall, or sensitivity, which is the quotient of the true positive Figure 9. Visualization of the trained Mask R-CNN network using the test set data (left without a mask, right with mask, successful segmentation above, insufficient segmentation below).
Two possible outcomes are shown in the figure. In the upper half, a successful detection and segmentation of the clouds can be seen. It should be noted that the network is somewhat robust concerning disturbances, as the sun was not falsely detected as a cloud. The lower half of the figure shows a very large cloud that was only detected partially. Another problem is that a large portion of the remaining cloud was not detected at all. Our best solution so far is to massively extend the training data set. The quantitative evaluation follows in subsequent sections.

Evaluation
The evaluation was performed with the fine-tuned network for the test set data. As the training loss function is not very helpful in determining the overall quality, we chose the common recall or hit-rate value and the precision or accuracy to assess the quality of the segmentation. As we have a large number of negatives in the image, we calculated the F-score, defined as 2 · precision·recall precision+recall , where the precision is the so-called positive prediction value, the quotient of all correctly identified objects (true positive value), and all positively classified objects (true positive cases and false-positive cases). The F-Score combines this value with the recall, or sensitivity, which is the quotient of the true positive values and the combination of true positive and false negative (missed objects) cases. We found that the F-Score is a superior quality measure compared to individual cases, clearly indicating the relevance of the results. In addition, we detailed the evaluation in further categories: the cloud segmentation was assessed for bounding box accuracy and pixel-wise segmentation, and separated for different sizes of clouds for detection. Finally, large clouds covering roughly a third of the input image are called large, those half the size of large are medium, and the remaining ones are small. Total area means all results summed up. The detailed results are listed in Table 3.

CloudSegNet
The CloudSegNet network was used as described in the publication. The network was also fine-tuned with our data set. The framework is based on TensorFlow with Keras, the official repository that was used for the setup.

Preparation of the Data Sets and Training
The CloudSegNet network requires the image data in RGB format and the associated ground truth mask is stored as a binary image. We also used data augmentation with rotation, mirroring, and distorting to enlarge the training image data set.

Visualization
The trained CloudSegNet was visualized as Mask R-CNN, except for the bounding boxes. Exemplary results are shown in Figure 10. The segmentation works well, even for the small database. The upper half shows a near-perfect segmentation; the lower half depicts a problem for misdetecting a bright cloud as the sun.

Evaluation
We used the same quality measures and images as for Mask R-CNN. The results with respect to accuracy and F-Score are far superior to Mask R-CNN. Therefore, we also list the results concerning training progress and complexity. The network could already be used after 500 epochs of fine-tuning, and after 3500 epochs the results are converged. The actual numbers are shown in Table 4.

Conclusions
The evaluation of two different deep neural network approaches showed promising results, albeit with Mask R-CNN lacking in efficiency. As we also have access to a wholly classical machine learning-based approach from [38], a comparison between the two deep learning methods and the pre-neural network method is shown in Table 5. It is worth mentioning that the semantic segmentation has the highest recall and precision, and therefore also the highest F-score. In terms of usage for cloud movement prediction and tracking, this could be used with an additional post-processing step as is needed for the classical approach. Interestingly, the most sophisticated model, Mask R-CNN, performs the worst. As this seems surprising, we conclude that this is due to the lack of training data. CloudSegNet has far fewer parameters to train and is explicitly suited to dealing with binary classes, whereas Mask R-CNN performs the best on large data sets and class numbers. Figure 10. Visualization of the CloudSegNet network using the evaluation data set (left side-input image, right side-segmentation mask).

Evaluation
We used the same quality measures and images as for Mask R-CNN. The results with respect to accuracy and F-Score are far superior to Mask R-CNN. Therefore, we also list the results concerning training progress and complexity. The network could already be used after 500 epochs of fine-tuning, and after 3500 epochs the results are converged. The actual numbers are shown in Table 4.

Conclusions
The evaluation of two different deep neural network approaches showed promising results, albeit with Mask R-CNN lacking in efficiency. As we also have access to a wholly classical machine learning-based approach from [38], a comparison between the two deep learning methods and the pre-neural network method is shown in Table 5. It is worth mentioning that the semantic segmentation has the highest recall and precision, and therefore also the highest F-score. In terms of usage for cloud movement prediction and tracking, this could be used with an additional post-processing step as is needed for the classical approach. Interestingly, the most sophisticated model, Mask R-CNN, performs the worst. As this seems surprising, we conclude that this is due to the lack of training data. CloudSegNet has far fewer parameters to train and is explicitly suited to dealing with binary classes, whereas Mask R-CNN performs the best on large data sets and class numbers. Another advantage of Mask R-CNN is the bounding box prediction, which allows it to be used as direct input for the subsequent tracking and prediction of individual clouds. The pixel-wise segmentation offers usage for the coverage prediction. Both algorithms are reasonably fast in the evaluation (not training) and outclass the classical approach, which has to generate HDR images out of a small image sequence first.
In conclusion, we propose using CloudSegNet for cloud segmentation and detection but will try to facilitate Mask R-CNN with additional data augmentation techniques, improving the amount of training data.
Another important task to look at is the viability for several different classes of clouds, as there could be cirrostratus and misty layers in contrast to the rather well-defined cumulus, cumulonimbus, or altostratus clouds. This will be tackled with advanced matting techniques and deep learning, as presented in [46]. Funding: This research is supported by the Bulgarian National Science Fund in the scope of the project "Exploration the application of statistics and machine learning in electronics" under contract number КΠ-06-Н42/1.