CloudSatNet-1: FPGA-Based Hardware-Accelerated Quantized CNN for Satellite On-Board Cloud Coverage Classiﬁcation

: CubeSats, the nanosatellites and microsatellites with a wet mass up to 60 kg, accompanied by the cost decrease of accessing the space, ampliﬁed the rapid development of the Earth Observation industry. Acquired image data serve as an essential source of information in various disciplines like environmental protection, geosciences, or the military. As the quantity of remote sensing data grows, the bandwidth resources for the data transmission (downlink) are exhausted. Therefore, new techniques that reduce the downlink utilization of the satellites must be investigated and developed. For that reason, we are presenting CloudSatNet-1: an FPGA-based hardware-accelerated quantized convolutional neural network (CNN) for satellite on-board cloud coverage classiﬁcation. We aim to explore the effects of the quantization process on the proposed CNN architecture. Additionally, the performance of cloud coverage classiﬁcation by biomes diversity is investigated, and the hardware architecture design space is explored to identify the optimal FPGA resource utilization. Results of this study showed that the weights and activations quantization adds a minor effect on the model performance. Nevertheless, the memory footprint reduction allows the model deployment on low-cost FPGA Xilinx Zynq-7020. Using the RGB bands only, up to 90% of accuracy was achieved, and when omitting the tiles with snow and ice, the performance increased up to 94.4% of accuracy with a low false-positive rate of 2.23% for the 4-bit width model. With the maximum parallelization settings, the hardware accelerator achieved 15 FPS with 2.5 W of average power consumption (0.2 W increase over the idle state).


Introduction
Over the last decade, the Earth Observation (EO) industry has experienced a dramatic decrease in the cost of accessing space [1]. With the introduction of CubeSats, nanosatellites and microsatellites with wet mass up to 60 kg [2], the rapid development of remote sensing technologies was amplified [3]. As of 2021, more than 1500 CubeSats have been launched [4], and according to [5], it will increase up to a thousand satellites per year till 2028. Naturally, as the number of satellites grows, satellite imagery becomes readily available. Harvested data plays a significant role in various disciplines like environmental protection, agriculture engineering, land or mineral resource exploration, geosciences, or military reconnaissance [6,7]. In line with the amount of remote sensing data acquired, the bandwidth resources for the data transmission inclines to be overloaded. Therefore, new techniques for efficient bandwidth resources management must be investigated and developed.
Several studies estimate that approximately 67% of the Earth's surface is covered with clouds [6,8,9]. Consequently, most of the remote sensing imageries (RSI) will be contaminated by them, which devalues the quality of RSI and negatively affects the postprocessing [6]. Cloudy conditions impair satellite sensor capabilities to obtain clear views bands exclusively was published by [30]. For model training, the Sentinel-2 dataset was used, and 76% of accuracy was reported by the model deployed on an ARM-based platform. Another possibility is to use the Forwards Looking Imager instrument, which provides analysis of the upcoming environment of the satellite. This approach was examined in [31], testing various lightweight CNNs deployed on the Zynq-7020 board using FPGA. The authors reported high accuracy of 98%, however, 100 images only were used for testing. Vieilleville et al. [32] investigated the possibilities of the deep neural network (DNN) distillation process in order to reduce the size of DNN while accommodating efficiency in terms of both accuracy and inference cost. The authors were able to reduce the number of DNN parameters from several million to less than one with a minimal drop in performance in the image segmentation process.
To sum up, lightweight CNNs provide a competitive on-board cloud detection performance in comparison to the state-of-the-art deep convolutional neural networks, like CDNetV1 [6], CDNetV2 [10] or CD-FM3SFs [33]. CDNetV1 is a neural network for cloud mask extraction from ZY-3 satellite thumbnails with the accuracy of 96.47% [6]. Its extended version, CDNetV2, focuses on adaptively fusing multi-scale feature maps and remedying high-level semantic information diluted at decoder layers to improve cloud detection accuracy with cloud-snow coexistence. The authors confirmed the robustness of the proposed method using validation on several other datasets like Landsat-8 or GF-1. Lately, Li et al. [33] introduced a lightweight network for cloud detection, fusing multiscale spectral and spatial features (CD-FM3SFs) using Sentinel-2A multispectral images. The best accuracy of 98.32% was achieved using the CPU as a computational unit.
To the best of our knowledge, the CloudScout cloud detection method proposed by Giuffrida et al. [34] and later extended by Rapuano et al. [20] is the most related work to this study. The method was developed in the frame of the Phisat-1 ESA mission, which exploits a hyperspectral camera to distinguish between the clear and cloud-covered images. To reduce the bandwidth, the mission has set a criterion that only images that present less than 70% of the cloudiness are transmitted to the ground. CloudScout was trained using Sentinel-2 hyperspectral data and achieved the 92% of accuracy, 1% of false positives with the power consumption of 1.8 W deployed on re-configurable Myriad-2 VPU by Movidius Intel [34]. Nevertheless, the authors identified multiple drawbacks due to the Myriad-2 design, which is not specifically suitable for the space environment (not based on a radiation-tolerant technology) [20]. Therefore, the authors extended their work and proposed an FPGA-based hardware accelerator for CloudScout CNN. The authors compared the Myriad-2 VPU with two FPGA boards: Zynq Ultrascale+ ZCU106 development board and Xilinx Kintex Ultrascale XQRKU060 radiation-hardened board. Results obtained by Zynq Ultrascale+ ZCU106 show that the FPGA-based solution reduced the inference time by 2.4 times (141.68 ms) but at the cost of 1.8 times greater power consumption (3.4 W) [20]. Inference time estimated for the Xilinx Kintex Ultrascale XQRKU060 board was 1.3 times faster (264.7 ms) in comparison with the Myriad-2 device, however, the power consumption was not reported.
Regarding the presented achievements of the related works and trends in the Cube-Sats development, we may expect a new era of smart nanosatellites equipped with reconfigurable, programmable hardware accelerators with an on-demand edge computing paradigm at payload level [3,12,19,20,[27][28][29]31,34]. A usual aspect of the presented studies is the employment of multispectral or hyperspectral RSI for the cloud detection system. Generally, the bands' composition of multi/hyperspectral RSI differs for individual missions, yet all are equipped with an RGB camera. Therefore, a cloud detection system built on RGB bands only may provide better portability for various missions independent of its multi/hyperspectral bands. In addition, the RGB cameras are several times cheaper and more convenient for short-term CubeSats missions. To the best of our knowledge, we identified only three studies [20,21,31] that performed deployment and evaluation of the CNN-based cloud detection method on an FPGA-based platform. Hence, in the scope of this study, we would like to present CloudSatNet-1: an FPGA-based hardware-accelerated quantized CNN for satellite on-board cloud coverage classification. More specifically, we aim to: • explore effects of quantization introduced to the proposed CNN architecture for cloud coverage classification, • investigate and optimize the performance of cloud coverage classification by biomes diversity and its false-positive identifications, • explore hardware architecture design space to identify optimal FPGA resource utilization.
The rest of the paper is organized as follows. Section 2.1 describes the used dataset and its preprocessing. Methodology is described in Section 2.3. In Section 3, the results are summarized. The discussion can be found in Section 4 and the conclusions are drawn in Section 5.

Dataset
For the purpose of this study, the Landsat 8 Cloud Cover Assessment Validation data (L8 biome dataset) [35] was used. The L8 biome dataset offers a balanced cloud distribution and diverse sets of land and water cover, which makes it a suitable source of data for the proposed CNN-based classification model. The L8 biome dataset was acquired by the Landsat 8 Operational Land Imager (OLI) and Thermal Infrared Sensor (TIRS) [36]. Furthermore, data are orthorectified and corrected for terrain relief using Level-1T processing [37].
The L8 biome dataset consists of 96 scenes divided into 8 biomes. The scene size is 185 km by 180 km, and each scene contains 11 multispectral bands with a resolution of 30 m per pixel (except bands 8, 10, and 11, which are not used in this work). Manually annotated cloud coverage is stored as a cloud validation mask. The cloud validation mask is an image whose pixel values contain information about the level (or class) of cloudiness, interpreted using the following Table 1. The example of the scene image (natural color composition) from the L8 biome dataset can be found in Figure 1a, with its respective cloud mask in Figure 1b.  Figure 1. L8 Biome dataset image patch example (a) reconstructed from bands B4, B3, and B2 with its associated multi-class cloud mask (b) [35]. (a) Image patch. (b) Cloud validation mask. Two cloud mask classes (thin cloud and cloud) are categorized as cloud pixels. From these pixels, the Cloud Cover Assessment (CCA) is computed as a ratio of cloud pixels to all pixels with values expressed in percentage [35]. The average CCA value for one scene is 48.35%. The distribution of the CCA values of the L8 biome dataset scenes is shown in Figure 2. Scenes are categorized by their area of capture into biome classes by the International Geosphere-Biosphere Programme [38] into 8 following biomes: Barren (BA), Forest (FO), Grass/Crops (GC), Shrubland (SH), Snow/Ice (SI), Urban (UR), Water (WA), Wetlands (WE). They are distinguishable from each other by their visual properties, and they have various intensities of cloud to terrain contrast, which leads to different challenges for the cloud detection system working with RGB data. For example, the biomes with sharp cloud to terrain contrast, like Grass/Crops, have a large value of the derivative at the transition between terrain and cloud. Therefore Grass/Crops biome is easy to classify as cloud borders visibly stand out from the biome's terrain. On the contrary, the other biomes like Snow/Ice have a terrain with cloud-like features, which may lead to a large number of false positives in classifier predictions as their terrain blends with clouds. Examples of image patches for each biome of the L8 biome dataset are shown in Figure 3.

Data Preprocessing
The image patch for each scene is a natural color composite from the combination of bands B4 (red), B3 (green), and B2 (blue). Values in patch images are re-scaled from the range 0-65,535 to 0-255 using a MinMax normalization. Patch images in the L8 biome dataset are georeferenced. The orbit path of the Landsat-8 does not go straight from south to north. The scene acquisition follows the orbit path of the satellite. Therefore the image appears to be rotated or tilted, like in Figure 4a. Redundant georeferencing information can be neglected when detecting clouds from satellite images. Next, the black (no-data) parts of the image need to be removed. The removal of the black parts consists of two steps. First, the image is rotated, so the actual image data are parallel to the whole scene image, as shown in Figure 4b. The rotation is using a nearest-neighbor interpolation method. Then, the image is cropped to lower resolution (from approx. 8000 × 8000 to approx. 6400 × 6400), so only image data are preserved, as illustrated in Figure 4c.  Image patch (with dimensions approx. 6400 × 6400 × 3) is cropped to 512 × 512 × 3 tiles, according the white lines in Figure 4c, omitting tiles at the edge that do not have full resolution. Each patch has a slightly different resolution after cropping, which causes a different number of generated tiles per patch (approx. 140). From 8 biomes each containing 12 scenes, there are in total 13,525 tiles. The original CCA values for the scene from Figure 2 do not apply to individual tiles. Generated tiles usually cover cloudy or cloud-free areas. This generates fewer tiles with balanced cloud coverage (or CCA value) in the final dataset (trade-off for creating many tiles from fewer image patches). Tiles with CCA ≥ 70% are categorized as cloud and the rest are categorized as non-cloud tiles. Each of the 13,525 tiles has been assigned a corresponding binary cloud coverage label. To preserve the evenly distributed cloud coverage in the train, validation and test dataset, the tiles from a single image patch are divided into 5 CCA quintiles: 0-20%, 20-40%, 40-60%, 60-80%, 80-100%. The distribution of the tiles and their CCA values per biome for the full L8 biome dataset is visualized in Figure 5. The tiles from each patch CCA quintile are divided to train, validation and test datasets in the ratio 2:1:7, with the coherent variation of the biomes and their CCA values, as visualized in Figure 6. In this study, the reliability of the results and the model portability are prominent. Therefore, the testing dataset is dominant in comparison to the training or validation dataset. Moreover, more than 2700 tiles are considered a satisfactory quantity for the model training. Since the variation of the train, validation, and the test dataset is coherent, suppression or advantage of any of the biomes or the CCA quintile during the model training is not expected.

Methodology
The procedure is divided into three stages. First, the baseline model of CNN with floating point parameters is trained. Then the weights and activations of the model are quantized and the model is re-trained. The last step is the deployment of the model on FPGA to achieve high throughput and low power consumption suitable for on-board data processing on satellite. To be able to deploy a CNN on the edge there are many techniques how to reduce the model memory footprint such as pruning or quantization. In this work, the focus of interest is on quantization which replaces floating point operations and weight tensors with lower bit widths which are especially useful for FPGA where arbitrary precision data types can be implemented.

Quantized CNN
The main idea of this section is to introduce the quantization of CNN and its implementation for the purpose of this study.
Quantization in neural networks is a technique used for optimization which proved to produce great success in the recent years [39]. Its main focus is on reducing memory footprint and computation time by replacing floating point compute operations and storing tensor weights with lower bit widths. This is especially useful for resource-constrained applications. There are two ways how to introduce quantization to a neural network. The first one is to train the neural network with quantized parameters and the second one is a quantization of parameters after the model is trained with floating point precision. In the former case, the process is called quantization-aware training (QAT); in the latter, it is referred to as post-training quantization (PTQ). PTQ may disturb the model parameters and change the point to which it converged during the training with floating point precision. For this reason, QAT is used for the experiments conducted in this study and training with quantized model parameters is performed. For a more comprehensive review of the current state of quantization in neural networks refer to the recent survey [39].
The network was implemented using the Brevitas framework. Brevitas is a PyTorch library used for QAT of neural networks [40]. At the time of writing the PyTorch library supports the quantization as well but allows just reduction from 32-bit floating point to 8-bit integer [41]. Brevitas in comparison allows reducing the weight and activation bit widths to as low as 1-bit which enables the creation of binary neural networks (BNN) [42]. Another reason why the Brevitas library is used is that a model trained using Brevitas can be exported and used by the FINN framework for dataflow architecture acceleration (DFA) on Xilinx FPGAs [23]. The FINN framework is a compiler for feed forward DFA for deep neural networks (DNN) inference. When DFA is used, every layer of DNN is mapped to its own set of dedicated compute and memory resources [43] which mimics the topology of DNN. In FINN the performance and resource usage can be controlled with a concept called Folding. FINN uses what is called matrix-vector threshold units (MVTU) for convolutional and fully connected layers. There are three parameters that can be set: matrix-vector matrix-multiple vector (MMV) length, processing elements (PE), and single instruction multiple data (SIMD) lanes. Using these parameters, it is possible to control the throughput of the network with respect to resource utilization of the FPGA.

CloudSatNet-1 Architecture
In the following paragraph, the proposed CNN architecture and loss function used during the training period are described.
The proposed network architecture consists of 10 convolutional layers and 2 fully connected layers, their specific parameters are visualized in Figure 7. Each layer except the last layer uses the ReLU activation function and has no bias. The network starts with an initial convolutional layer which processes 512 × 512 × 3 uint8 input and continues with 3 sequences of 3 layers each. The input size was chosen to allow direct comparison with CloudScout architecture [34]. Each sequence middle layer has a lower number of filters to implement bottleneck for better generalization properties. After each sequence and initial layer, there is batch normalization and max pooling with kernel size of 4, this leads to the effective reduction of feature dimensions. The last fully connected layer outputs unnormalized probability for each class where the first class represents cloud presence below 70% CCA in the image and the second class signals the presence of clouds above this threshold.  The loss function used for training the model was a modified binary cross entropy loss with an increased penalty for false positives (FP) errors shown in Equation (1). Penalty for FP errors is multiplied by a parameter α which is inspired by the approach reported in [34] where the authors showed a decrease in the number of FP errors while keeping accuracy on acceptable value when parameter α was set to 2.
where y is the ground-truth label,ŷ is the predicted output of the network and α is a hyper-parameter to increase penalty for FP errors.

Quantization Process
First, the model with floating point precision is trained as a baseline. After sufficient accuracy has been achieved weight and activation bit widths are progressively reduced and the change in accuracy is observed. To fit the model on FPGA and achieve high throughput with acceptable accuracy and low power consumption, in this paper the focus of interest is bit widths of hidden layers lower or equal to 4. In all experiments, the same bit widths are used for weights and activations. The first and last layer of the neural network can be more sensitive to quantization [44][45][46], so they were quantized to 8 bits. The last fully connected layer has also a quantized bias term. It was observed by the preliminary experiments that it is important to adjust weight initialization according to selected weight bit widths.
The proposed architecture contains blocks of convolutional layers followed by batch normalization and ReLU. This sequence has the advantage of hardware implementation which FINN framework [23] utilizes and usage of the batch normalization layer leads to faster convergence [47]. After the training, it becomes a fixed linear transformation during the inference. Brevitas does not provide a quantized alternative to PyTorch batch normalization layer, but the FINN framework supports native PyTorch batch normalization. Since the threshold-based quantized activation (ReLU) is used, batch normalization is implemented using successive thresholding in the FINN framework thanks to the process called Streamlining [48]. This process shows how integer-only operations can be used for the forward pass of a quantized neural network layer with uniformly quantized weights and activations.

Selected Hardware
The trained neural network model is deployed on Zturn development board equipped with SoC Xilinx Zynq Z7020. Thanks to FPGA, Zynq is able to provide a platform for computationally intensive processing, but at the same time meets the power consumption requirements of the developed CNN. The target frequency is set to 100 MHz. For the power consumption measurements, the J7-t USB safety meter was used.
Xilinx Zynq is an all-programmable System-on-Chip (SoC), which consists of the dual-core ARM Cortex-A9 processor coupled with FPGA based on Xilinx 7-series FPGA architecture into a single integrated circuit [49]. ARM Cortex-A9 is connected by industrystandard AXI interfaces, providing low latency and high bandwidth between the processor and programmable logic. FPGA programmable logic consists of 85,000 logic cells, 53,200 Look-Up Tables (LUTs), 106,400 Flip-Flops (FFs) and 4.9 Mb of block RAM (SoC data-sheet at [50]). In addition, it also contains 220 Digital signal processing (DSP48E1) slices for high-speed arithmetic embedded into fabric logic in proximity with Block RAM components. The processor is capable of running Linux operating system with PYNQ [51] library which enables the usage of the Python programming language for programming both the processor and hardware libraries called overlays. Power consumption in an idle state with booted Linux Ubuntu 18.04 was measured to be 2.32 W.

Proposed Workflow
The pipeline used to create the hardware accelerated CNN consists of the following steps. First, the baseline floating point model is trained and evaluated to observe standard metrics such as accuracy, recall, precision and F1 score. Next, QAT is used to train the quantized model, which is evaluated in the same way as the baseline model. In addition to this, a smaller verification dataset with the same distribution of tiles in the respective cloud cover ranges is created and consists of 380 tiles. Per-tile evaluation is performed on this dataset and resulting logits from the last layer are saved for the model verification deployed on FPGA. The quantized model is exported to ONNX format [52] and transformed to highlevel synthesis (HLS) code using the FINN framework. The model is then synthesized using Vivado Design Tools from Xilinx and the resulting bit file is deployed to FPGA. Evaluation of deployed model with focus on observing hardware accelerator attributes is performed. Per-tile evaluation on the verification dataset is performed and resulting logits are statistically compared using t-test to measure model distortion caused by the deployment of the model on the edge. Workflow is summarized in the scheme displayed in Figure 8.

Experimental Setup
The main goal is to experiment with end-to-end development of FPGA-based hardwareaccelerated quantized CNN for on-board cloud cover classification. Therefore the experiments are divided into three stages: (1) Training of the classification model with a focus on observing the impact of quantization to model accuracy; (2) Observe the accuracy of the resulting model on different biomes and remove the outliers from the dataset; (3) Explore hardware architecture design space to identify configuration with the highest throughput, pre-defined target throughput and minimal FPGA resource utilization.
For the model training, the aim is to achieve the highest accuracy and minimize the false-positive rate (FPR) on the test dataset for different bit widths of model weights and activations. Bayesian optimization (summarized in [53]) is used for hyper-parameter search of parameters defined in Table 2. The number of epochs is set to 40 with early stopping when accuracy on the validation dataset starts to diverge. Overall 32 runs were conducted for each of 4 total configurations with hidden layers weight and activation bit widths set to 32, 4, 3, and 2. In both training scenarios, model performance will be evaluated on the test dataset using accuracy, precision, recall and F1 metrics with the addition of FPR. In the second stage, the accuracy achieved on the particular biomes will be analyzed to identify the potential lack of the model performance. Regarding the achieved results, a new set of experiments will be conducted. In the last stage, the hardware architecture design space is explored using the FINN framework. Selection of parallelism in FINN can be defined as P = MMV × PE× SIMD [54]. At the time of writing, FINN only supports MMV set to 1 so just PE and SIMD are used to increase parallelization in the experiments. The layer with the largest number of cycles will limit the overall throughput of the network [54]. The estimated number of clock cycles per layer for the proposed architecture is shown in Table 3 in two configurations. One with default folding (no parallelization) with the lowest performance and the second one with maximum folding achievable for the proposed architecture. The first layer is the biggest bottleneck in the network so the DSP slices were assigned to it as it requires more resources to compute results with 8-bit inputs (uint8) and 8-bit weights.

Results
The results of cloud coverage classification employing the full L8 biome dataset to train and evaluate proposed CloudSatNet-1 CNN are shown in Table 4. In the upper part of the table, the most accurate models for each analyzed bit width (weight and activation) selected by ACC are presented. Top models for 32, 4, and 3-bit width provide similar classification performance (ACC ≈ 88-90%, FPR ≈ 7-10%). Though, the best-performed 2-bit width model lags with ACC = 83.41% and FPR = 17.59%. In the bottom part of Table 4, top models for each analyzed bit width selected by FPR are shown (models are selected from the top 10 models sorted by ACC). Marginal change of classification performance can be observed (1-3%), except the model based on 32-bit width, where FPR was reduced to 2.25% at the expense of approx. 3% of ACC. For more insights, the dependence of model ACC on FPR (with FPR value inverted for better readability) can be seen in Figure 9.  Optimal solutions, which represent a trade-off between ACC and FPR, are stressed out by Pareto fronts. Results of cloud coverage classification for best-performed 4-bit width models (4-bit width models are selected due to best accuracy/FPR ratio from quantized models) per biome using the full L8 biome dataset are shown in Table 5. Models are selected by the highest ACC. The model performed best on the Grass/Crops biome (ACC = 95.91% and FPR = 0.83%). However the best FPR = 0.49% was achieved in the Forest biome, though with low ACC = 84.01%. The worst performance (ACC = 69.24% and FPR = 31.11%) was achieved on the Snow/Ice biome. Based on the results of the cloud coverage classification per biome, hypothesis is made that excluding the Snow/Ice biome (cloud coverage classification on Snow/Ice biome using natural color composite is irrelevant) from model training will improve overall model performance (especially FPR). For a better illustration of the problem, the examples of FP tiles are presented in Figure 10.  Results of cloud coverage classification using the L8 biome dataset without Snow/Ice biome to train, validate and test the proposed CNN are shown in Table 6. In the upper part of the table, best-performed models selected by ACC are presented. As can be noticed, in comparison with previous models trained by the full L8 biome dataset the classification performance was improved (ACC ≈ 92-95%, FPR ≈ 2.9-5.7%). In the bottom part of Table 6, top models selected by FPR are shown (models are selected from the top 10 models sorted by ACC). In case of the 32 and 2-bit width models, there is no change in performance. However, FPR for 4 and 3-bit width models is lower, whereas 4-bit model outperforms the 32-bit width one. For a better illustration, the dependence of model ACC on FPR can be seen in Figure 11, where optimal solutions are highlighted by Pareto fronts.
Finally, the results of hardware architecture design space exploration are summarized. In Table 7, the overview of resource utilization measurements of quantized models using different bit widths can be found. Maximum and base folding setup was compared together with folding setup targeting 10 FPS. Even though the FPS is changing from 0.9 to 15.5, the average power consumption is stable at around 2.5 W. The parallelization settings and their respective estimated number of clock cycles for targeting specifically 10 FPS are reported in Table 8. Results of cloud coverage classification for best-performed quantized models on FPGA can be seen in Table 9. Classification ACC and FPGA resource utilization is reported for quantized models trained using the full L8 biome dataset and dataset excluding the Snow/Ice biome from the dataset. The best-performed model is a quantized 4-bit width model with Snow/Ice biome excluded from training and evaluation (ACC = 94.84%).    (1, 1) 1024 PE-processing elements; SIMD-single instruction multiple data; Conv-convolution layer; FC-fully connected layer. Table 9. Results of cloud coverage classification for best-performed quantized models on FPGA.

Quantized Model for Cloud Detection
Based on the results of the best-performing models reported in the upper part of Table 4, the increase of the quantization level resulted in slight overall performance deterioration. Even though, the quantized models achieved comparable results to the 32-bit baseline model (except the 2-bit model). The decrease of the overall accuracy for the 4-bit and 3-bit model is just around 2%, while for the 2-bit model it is more than 6%. However, rather than the highest overall accuracy, this study emphasizes on the low FPR (it is more convenient to process a redundant image than to discard the relevant one). Therefore, a balance between the ACC and FPR is in demand. For the baseline model and 3-bit model, the FPR is identically equal to 9.93%. In the case of the 4-bit model, almost 3% of FPR decrease can be noticed, however, the recall is lower by 10% in comparison to the 32-bit model. The 2-bit model suffers the most from the quantization effect resulting in very insufficient FPR = 17.59%. More balanced (ACC vs. FPR) results are provided in the bottom part of Table 4, where the best models by FPR from the top 10 models sorted by ACC are reported. Unfortunately, the performance of the quantized models keeps almost on the same levels, yet the baseline model significantly reduced its FPR to 2.25%, while decreasing its accuracy by around 3%. A more readable comparison of the model's performance can be seen in Figure 9. A trend of the trade-off between ACC and FPR across all quantized models together with the baseline is highlighted by Pareto fronts. It can be observed, that the baseline model outperforms the quantized ones, however, there can be found adequate alternatives to the 32-bit model.
To collect more insights and to improve the overall performance of the proposed cloud detection system, each biome of the L8 biome dataset was investigated separately. We hypothesize that some biomes produce significant noise during the training process due to the false cloud-like features (snow, ice, or fog). The 4-bit models were selected to investigate biomes in quantized cases, and its results are reported in Table 5. Best performance was obtained by a model trained on Grass/Crops biome with ACC = 95.91% and low FPR = 0.83%. Yet, the best FPR = 0.49% and precision of more than 99% was achieved by Forrest biome. However, this model lags on accuracy due to low recall = 68.36%, which will result in a high number of undetected cloudy images. This may be caused by the cloud categories merge (thin, thick) or by the fog, which is a usual false cloud-like feature in the Forest biome [34]. Similarly, the Wetlands biome (also often affected by fog) resulted in low FPR = 0.94% and high precision = 98.51%, but with low recall = 68.33%. The Shrubland, Urban, and Water biomes achieved comparable performance with ACC from 91.73% to 93.89% and FPR from 1.89% to 3.92%. The Barren biome obtained the second worse performance in terms of FPR = 10.14%. The reason for high FPR may lie in the nature of the Barren biome, which exaggerates the thin clouds features to thick clouds. The worst performance reports the Snow/Ice biome. Low precision of 50.47% and high FPR = 31.11% make its decision almost random. Since only the RGB channels were considered, the reason for misclassification is the inability to recognize between cloud, ice, and snow. To be able to classify the clouds above the snow and ice, an additional spectrum capable of altitude resolution will be necessary [6,10,34].
Regarding the previously mentioned results, all biomes, to a certain degree, suffer from the cloud-like features problem. An example is given in Figure 10, where six misclassified cases are presented. The first example of the Snow/Ice biome (A) has CCA = 0%, yet the snow in the image was misclassified to a cloud. The second example of Snow/Ice biome (B) with CCA = 42% merged cloud with turbid snow currents. Next, the smooth hilly terrain of the Barren biome (C) stretches the features of thinly dispersed clouds. This resulted in the falsely positive image, however, the CCA is 10% in reality. Similarly, the Water biome example (D) with CCA = 1% was misclassified due to the wavy, serpentine features of the shallow water. The last two examples (E, F) in Figure 10 represent the case near the threshold (CCA = 70%). Hereabouts, a small number of cloud pixels may flip the CCA over the threshold boundary. In addition, the precise value of the CCA for each tile may be softly different from the CCA label [35,37].
Following the reported results, the Snow/Ice biome is not suitable for the cloud detection using the proposed method. Moreover, problematic coexistence of the snow, ice and cloud in cloud detection systems has been also identified by [6,10,34]. Therefore, we decided to withdraw the Snow/Ice biome from the train, validation and test datasets, and to perform the experiment without this noisy data. In the real use case, the cloud detection system can omit known areas permanently covered by snow or ice from the analysis. Based on the results reported in Table 6, assumed improvements of all metrics can be observed. The best performing baseline model achieved ACC = 94.92% with FPR = 2.81%. Top quantized models obtained comparable accuracy from 94.84% to 92.02%, and FPR from 2.23% to 5.72%. We would like to stress out, that 4-bit quantized model performed slightly better in terms of precision (96.82%) and FPR (2.23%) in comparison to the 32-bit model. This makes it a proper quantized substitution for deployment on FPGA. Results of this analysis confirm our hypothesis that Snow/Ice biome is naturally prone to being false positive when using RGB channels only.
In Figure 11, the accuracy vs. FPR is visualized for models trained with excluded Snow/Ice biome. From elevated position of all models within this Figure it is evident that accuracy increased all-around in comparison to Figure 9. Curves of Pareto fronts lie closer together and to the baseline front, as the quantization takes a lower toll on models performance when without visually ambiguous data.
Based on these results, following observations will be emphasised to make a deduction. Increased quantization did not cause substantial drop in values for evaluation metrics scores of results with excluded Snow/ice biome. The 4-bit model matched or overtook baseline's metric scores in accuracy and FPR. This implies equality between representational capacity of 32-bit baseline and quantized models in classifier problems that do not require high resolution for discerning discriminative features. This statement is in line with results achieved in other works [46,55,56] dealing with the quantization.
The most relevant study, CloudScout [20,34], used hyperspectral bands for model training, resulting to 16-bit model with ACC = 92% and FPR = 1%. Our proposed method outperformed this result by a 4-bit model with higher accuracy up to 3%, however with lower FPR by 1.23%. Considering, that our model used RGB bands only (without Snow/Ice biome), the presented CloudSatNet-1 method brings promising improvements in the onboard cloud detection systems. Another relevant study [21] used a larger training dataset and achieved ACC of 91%. Nevertheless, when authors deployed the model on FPGA, a significant drop of accuracy to 64% occurred. The method introduced in this paper does not encounter a similar issue. The comparison of these methods is summarized in Table 10.

FPGA-Based Hardware Accelerator
The quantized models were deployed in three folding configurations for each bit width setting. Throughput, power consumption, and FPGA resource utilization were measured. Models with maximum folding achieved 15 FPS with input batch size of 1 and almost 20 FPS with batch size 120 which is the maximal batch size allowed to be loaded into RAM. Increase of the FPS with higher batch size was expected, and also confirmed by [3]. Power consumption measured with a USB power meter reported an increase of just over ≈0.2 W during the inference, compared to the resting state. In comparison with related studies, the authors of CloudScout [20,34] reported a throughput of 2.89 FPS and 1.8 W of power consumption using Myriad VPU with 512 × 512 × 3 input size, 7 FPS and 3.4 W of power consumption using Zynq Ultrascale+ ZCU106, and 3.77 FPS using XQRKU060 solution (estimation only). Next, in the study by Reiter et al. [21], the authors reported 358.1 FPS with a much smaller input size 32 × 32 × 3, and maximum power consumption of 2.4 W. Regarding these results, the throughput and power consumption of the hardware accelerator achieved in this study is comparable with the current state-of-theart solutions.
Based on the estimated number of cycles per layer reported in Table 3, it is visible that a bottle-neck in the first layer limited the optimal throughput, and it would require a change in the network architecture to allow a higher throughput target. It was demonstrated that the network throughput can be controlled to target a specific FPS desired by the needs of the mission. A set of experiments were conducted to target specific throughput of 10 FPS. Used parallelization settings are reported in Table 8. This approach may be useful when the instrument on the CubeSat does not have a high throughput, e.g., the camera is generating data at lower FPS. It showed flexibility in throughput control of the FPGA-based hardware accelerator created by the FINN framework. The differences for each bit width are in FPGA resource utilization, where the 2-bit model in base folding configuration utilized the lowest number of the resources (LUT = 46.27%, FF = 31.41%, BRAM = 29.29%, DSP = 0.45%). This is achieved due to no parallelization and a low memory footprint of 2-bit weights and activations. It shows the potential to reduce bit width for weights and activations even further to 1-bit and experiment with BNN in the future to enable higher throughput and deeper network on the same FPGA. As presented in Table 7, DSP slices for the first layer were selected to be utilized by Vivado just for SPEC and max folding in all bit width configurations. Memory footprint (BRAM utilization) varies from 1.43 Mb to 3.06 Mb in the ascending order relative to bit width.

Conclusions
Most of the RSI is contaminated by the clouds, hence the quick and accurate method of cloud removal running on-board of the satellite has potential to significantly save the downlink. In this study, we introduced CloudSatNet-1, an FPGA-based hardwareaccelerated quantized CNN for satellite on-board cloud coverage classification. We can conclude that the weights and activations quantization has a minimal or no effect on the model accuracy. However, the memory footprint reduction allows the model deployment and testing on low-cost FPGA Xilinx Zynq-7020. Using the L8 biome dataset and its RGB bands only, up to 90% of accuracy was achieved. Next, we omitted the Snow/Ice biome tiles from the dataset due to high noise production. The accuracy increased up to 94.4% of accuracy with low FPR = 2.23% for the 4-bit width model. With the maximum parallelization settings, the hardware accelerator achieved 15 FPS with 2.5 W of average power consumption (0.2 W increase over the idle state). Additionally, we proved that we can control throughput to target specific FPS for the proposed classifier. Considering the reported results, the presented novel approach achieved outcome comparable with the state of the art.
The presented solution has several limitations that we would like to stress out. Firstly, the high number of false positive tiles with a terrain containing cloud-like features may be in the future compensated with the analysis involving the multi-spectral bands. Next, the cloud categories from the original L8 biome dataset were merged to form a binary problem. Therefore, this study did not evaluate the result on the original cloud categories of the L8 biome dataset, which might provide more insights on miss-classifications. Furthermore, we did not cover the effects of the radiation on the cloud detection system and the redundancy will be subject of the future works. This work is the beginning of the greater effort to provide solutions based on AI for the space missions that can benefit from it, thus this work is a pilot one in nature. In the future, we aim to improve this solution to provide semantic segmentation of clouds with clouds categorization to respective classes compensating for the binary decision provided in this study.