A Scalable Computing Resources System for Remote Sensing Big Data Processing Using GeoPySpark Based on Spark on K8s

: As a result of Earth observation (EO) entering the era of big data, a signiﬁcant challenge relating to by the storage, analysis, and visualization of a massive amount of remote sensing (RS) data must be addressed. In this paper, we proposed a novel scalable computing resources system to achieve high-speed processing of RS big data in a parallel distributed architecture. To reduce data movement among computing nodes, the Hadoop Distributed File System (HDFS) is established on nodes of K8s, which are also used for computing. In the process of RS data analysis, we innovatively use the tile-oriented programming model instead of the traditional strip-oriented or pixel-oriented approach to better implement parallel computing in a Spark on Kubernetes (K8s) cluster. A large RS raster layer can be abstracted as a user-deﬁned tile format of any size, so that a whole computing task can be divided into multiple distributed parallel tasks. The computing resources applied by users would be immediately assigned in the Spark on K8s cluster by simply conﬁguring and initializing SparkContext through a web-based Jupyter notebook console. Users can easily query, write, or visualize data in any box size from the catalog module in GeoPySpark. In summary, the system proposed in this study can provide a distributed scalable resources system for assembling big data storage, parallel computing, and real-time visualization.


Introduction
As a result of the development of Earth observation (EO) and sensor technologies, humans' ability to undertake comprehensive observation of the Earth has entered an unprecedented period, and Earth system sciences have entered the era of big data [1,2]. The increasing availability of sensor technology has drastically promoted our ability to collect time-varying geospatial and climate data. The data collection volumes and rates easily overwhelm those of the past. At present, the observation data streaming rate of NASA's current missions is approximately 1.73 GB gigabytes per seconds, and the scale of NASA's climate change data repository is expected to increase to 230 petabytes by the end of 2030 [3]. In fact, remote sensing (RS) observation data, as a typical type of geospatial data, and even those gathered by a single satellite data center, is increasing dramatically at the speed of several terabytes per day [4]. To take advantage of these huge datasets, users face several limitations, such as the limited resources and processing capacities of personal computers [5][6][7]. Therefore, it is difficult to deal with the huge volume of RS data in a traditional computing paradigm.
In summary, the extraction and interpretation of the information from these large RS datasets is the greatest challenge currently faced by Earth system science. Data must be transformed into knowledge, thus breaking the paradox of "big data but little knowledge", development on Jupyter [34]. A Free and Open-Source Software (FOSS) solution was proposed for automatically scaling K8s worker nodes within a cluster to support dynamic workloads [32]. An elastic parallel spatial and temporal adaptive reflection fusion model was implemented in an OpenStack cloud computing environment using Spark user-defined aggregation functions to reduce the number of joining operations based on the K8s operator model [35].
In addition, we also reviewed the existing computational solutions for big EO data management, storage, and access. PipsCloud, which is based on an HPC technology, is a widely used cloud platform in mainland China [19]. Traditionally, the storage nodes of HPC systems are independent from the computing nodes. Therefore, it is necessary to establish a high-speed Gigabit fiber network between computing and storage nodes. However, the capacity of these high-speed links is still less than the aggregate bandwidth of all compute nodes [36]. The processing abstractions, such as Xarray [37], pixel-wise, strip-oriented [18], or DataFrame-oriented [35], adopted by Open Data Cube (ODC), GEE, and other existing platforms, separate the array data from the metadata of the RS data. Regarding the replicability of infrastructure, the majority of the widely used platforms, such as GEE, pipsCloud, Joint Research Center (JRC) Earth Observation Data and Processing Platform (JEODPP) [38], and Sentinel Hub (SH) [39], have little documentation on deploying available applications on the users' own infrastructures. Most of these platforms are closed or proprietary solutions, such as pipsCloud, JEODPP, ODC, SH, and System for Earth Observation Data Access, Processing and Analysis for Land Monitoring (SEPAL), and the datasets included in these platforms satisfy the individual demands. The widely used platform GEE suffers from costly limitations of the maximum duration of each request, the maximum number of simultaneous requests per user, and the maximum execution of operations, and the changes in algorithm implementations also impact the results by running the same script on the same dataset at different dates [8]. Additionally, it cannot be accessed in mainland China. Regarding the computing framework selection, the majority of platforms choose MapReduce, such as GEE, Hadoop-GIS [16], and spatialhadoop [40]. However, MapReduce relies on the Hadoop Distributed File System (HDFS) as the data exchange intermediary, and increases the I/O burden through frequently reading from and writing to the storage. Accordingly, we adopt Spark as a Distributed Parallel Computing (DPC) framework due to its excellent performance based on memory, and propose the tile-oriented programming model to deal with the huge RS layer as small tiles. This enables the limited memories of the cluster to load and process the long time-series RS data in any defined tile area without breaking the spatial relative characteristics of the RS data. We utilize containerization technology to achieve infrastructure replicability and version control of the software environment and algorithms implemented by most of the existing platforms. The system is built on the basis of the tile-oriented programming model, containerization, and a Jupyter notebook as a web portal, and possesses significant potential for processing RS big data in a less costly manner. In addition, our solution reduces the entry barriers for the EO community in cloud computing technologies and RS big data analysis platforms, and provides a simple and easy technical solution and theoretical basis for EO researchers to build their own scalable RS big data platform.
Thus, in this study, we comprehensively investigated the storage, analysis, and visualization process of spatio-temporal RS big data based on Spark on K8s [41] with a Jupyter notebook as a web portal. The main objective of this study was to facilitate RS big data processing through a scalable computing resources system. Specifically, we aimed to: (1) provide an option for researchers to deal with the huge volume of RS big data in a less costly and flexible cloud computing system based on existing hardware resources; (2) compile and assemble Docker images with free, open-access software and packages for public users to easily deploy a RS big data processing platform; (3) improve the efficiency of data loading by reducing the frequency of data movements between computing and storage nodes based on HDFS on K8s; (4) focus on the spatial structure characteristics of RS images to implement the innovative tile-oriented programming model based on the GeoPySpark [42] package; and (5) realize the auto migration of containers without missing data based on CephFS as a storage-class of the K8s cluster. Section 2 of this paper describes the methodology and data sources used in this study. Section 3 describes the detailed hardware and software environment of the system. Section 4 shows the time cost results for several aspects of the system and visualization in multiple scales. Section 5 discusses the main factors affecting the computing efficiency of the system. Section 6 presents the conclusion of the study.

Methodology and Data Sources
In this study, we developed an on-premises scalable computing resources system for RS big data storage, processing, and visualization. Specially, a platform using GeoPySpark based on Spark with a Jupyter notebook on K8s is introduced, which can be deployed on this distributed framework.

The Tile-Oriented Programing Model
Large RS data volumes create significant challenges in data partitioning policy and parallel programming difficulties in scalable computing containers. To address this issue, we incorporated the tile-oriented programming model and Resilient Distributed Datasets (RDDs) [43] to facilitate the design of generic parallel RS algorithms. The tiling of large maps is an long-standing practice. Large paper maps have always been divided into a series of map sheets at various scales [44]. The existing studies have not yet focused on the tile-oriented programming in the parallel computing platform. In our system, the raster layers are abstracted as user-defined tiles of a given size. Tile class contains a NumPy array that represents the cells of the raster in addition to other information regarding the data. Similarly, we can perform any map algebra operations, such as local or focal operations, as a normal NumPy array along with the tile class. As a technique for rendering textures in images, texture tiles meet the subjective criterion of visual acceptability [45]. To better support the visualization of multi-scale RS data, we can retrieve information and queries from the TiledRasterLayer catalog in different tile sizes, such as point, rectangle, and any user-defined box.

The Ditributed Parallel Architecture of System
The architecture of the proposed scalable computing resources system is illustrated in Figure 1. The system consists of three parts, namely, K8s cluster management (Part I), RS data distributed storage (Part II), and scalable computing resources of the containers (Part III). All nodes contained in this big data framework act as both computing nodes and storage nodes to avoid data movement between computing nodes.
The K8s cluster management (Part I) is mainly responsible for resource management, user authentication, service accounting, and configuration management. K8s was selected as the container orchestration platform based on its widespread adoption in the market and the fact that it has become a recognized container orchestration standard [16]. As the foundation of the K8s cluster, CephFS [46] was implemented on low-level nodes to provide a storage-class [47] for the PersistentVolume [48] of K8s. This improves the robustness of the K8s cluster by enabling other running nodes to take over the services provided by nodes that are down. Using Kuboard, a free-use management tool of K8s, users can easily control and monitor the use of hardware, such as CPU, storage, and RAM.
The RS data distributed storage (Part III) is the infrastructure for storing huge RS datasets. In this part, we implemented HDFS on the K8s cluster. To ensure the high availability of the storage system, we deployed two NameNodes (NNs)-one in active status, and the other in standby status-to ensure the high availability of HDFS. The NameNode (NN) is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system and tracks where the file data is kept across the cluster. It does not store the data of these files itself. The DataNode (DN) is responsible for storing the actual data in HDFS. In our platform, DataNodes (DNs) are distributed in all VMNs to reduce data movement between nodes. HDFS [49] is a popular cloud storage platform and streaming data access pattern, which can store huge, distributed file datasets, and write once and read many times. It is especially suitable for deployment on lower-cost computers consisting of a cluster because of its high reliability and high-performance characteristics. Therefore, we choose HDFS for data storage due to its outstanding capacity and high-speed read throughput of RS big data [50]. Kubectl provides users with a command-line interface to interact with K8s clusters. Kubelet is the primary "node agent" that runs on each node. A node is a virtual machine. A pod is the smallest deployable computing unit that users can create and manage in K8s. ZooKeeper maintains highly reliable distributed coordination of HDFS. NN and DN represent NameNode and DataNode of HDFS, respectively.
The K8s cluster management (Part I) is mainly responsible for resource management, user authentication, service accounting, and configuration management. K8s was selected as the container orchestration platform based on its widespread adoption in the market and the fact that it has become a recognized container orchestration standard [16]. As the foundation of the K8s cluster, CephFS [46] was implemented on low-level nodes to provide a storage-class [47] for the PersistentVolume [48] of K8s. This improves the robustness of the K8s cluster by enabling other running nodes to take over the services provided by nodes that are down. Using Kuboard, a free-use management tool of K8s, users can easily control and monitor the use of hardware, such as CPU, storage, and RAM.
The RS data distributed storage (Part III) is the infrastructure for storing huge RS datasets. In this part, we implemented HDFS on the K8s cluster. To ensure the high availability of the storage system, we deployed two NameNodes (NNs)-one in active status, and the other in standby status-to ensure the high availability of HDFS. The NameNode (NN) is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system and tracks where the file data is kept across the cluster. It does not store the data of these files itself. The DataNode (DN) is responsible for storing the actual data in HDFS. In our platform, DataNodes (DNs) are distributed in all VMNs to reduce data movement between nodes. HDFS [49] is a popular cloud storage platform and streaming data access pattern, which can store huge, distributed file datasets, and write once and read many times. It is especially suitable for deployment on lower-cost computers consisting of a cluster because of its high reliability and high-performance characteristics. Therefore, we choose HDFS for data storage due to its outstanding capacity and highspeed read throughput of RS big data [50].
The scalable computing resources of containers (Part III) consists of a Spark driver container with a Jupyter notebook and Spark executor containers. At the core, K8s with The architecture of the proposed scalable computing resources big data framework. Kubectl provides users with a command-line interface to interact with K8s clusters. Kubelet is the primary "node agent" that runs on each node. A node is a virtual machine. A pod is the smallest deployable computing unit that users can create and manage in K8s. ZooKeeper maintains highly reliable distributed coordination of HDFS. NN and DN represent NameNode and DataNode of HDFS, respectively.
The scalable computing resources of containers (Part III) consists of a Spark driver container with a Jupyter notebook and Spark executor containers. At the core, K8s with containerized Spark is integrated in the system to realize the assignment of scalable computing resources. In our system, we initialize a Spark application by only providing hints about the memory allocation to Spark Driver (SD). The parameters of Spark Executor Container (SEC), such as the number of instances, vcores, and memory, can be dynamically adjusted in each Spark application due to the fast initialization of containerized Spark by the Jupyter notebook console.
All of these containers are created by Docker images, which contain a series of Python libraries for raster data processing. The Jupyter notebook extends the console-based approach to interactive computing, which is a qualitatively new direction. It provides a web-based application suitable for capturing the whole computation process, including developing, documenting, and executing code, in addition to communicating the results. GeoPySpark [42] is a Python language binding library of GeoTrellis. GeoTrellis [51] is a geographic data processing engine for high performance applications, which provides data types for working with raster in the Scala language, in addition to a number of operations to manipulate raster data. GeoPySpark utilizes GeoTrellis to allow reading, writing, and operating of raster data. In GeoPySpark, raster images are represented by the tile class, which contains a NumPy array to represent the raster cells and other information regarding the data. The computing containers are implemented based on Python-3.7.3 and integrated with a Jupyter notebook as the access interface between the system and user-clients. As an indispensable part of the system, we integrated the mainstream Matplotlib [52] package of the Python library for data visualization, which is a comprehensive library for creating static, animated, and interactive visualization in Python. This enables publication-quality figures to be generated in a variety of hardcopy formats and interactive environments across platforms. When installing the Jupyter notebook application using Helm tools [53] in the K8s cluster, three services for running notebook container need to be deployed, namely, the web browser service, Spark driver service, and Spark UI service. The web browser service is exposed through haproxy-ingress [54] to provide a highly available and accessible service to client-side users.
The SparkContext is the entry point to any Spark functionality. When a Spark application is run, a driver program is hosted, which has the main function and imitates the SparkContext. The driver program then runs operations inside the executors on the worker nodes. Only two steps are needed to initialize a new SparkContext: (1) import the Python core libraries of GeoPySpark; (2) configure the parameters and create SparkContext. After creating a new SparkContext, computing jobs can easily be submitted to the cluster through the console cell. In our hundreds of executions, we used the Jupyter notebook as the user interface, which can provide a convenient way to adjust the parameters of Spark executors in the K8s cluster. In the parameter configuration section, we can configure the numbers of executors by defining the parameters of spark.executor.instances, the CPU cores of executors by spark.executor.cores, and each memory size of executors by spark.executor.memory. It is worth noting that the number of tasks is determined by the numbers of executors and vcores, which is equal to the product of the executors and vcores count. The Spark driver service is used to create scalable computing resources for the Spark executors by configuring the three above-mentioned parameters of SparkContext in the K8s cluster. Spark UI is exploited to monitor the running status of multiple tasks submitted by client users.

The Design of Experiments
In order to scientifically evaluate the factors that affect the performance of the system, we utilized one of the most widely used RS vegetation indices-Normalized Difference Vegetation Index (NDVI)-as a testbed to evaluate the feasibility and performance of our system. NDVI is one of the main vegetation indexes reflecting the spectral characteristics of vegetation [55]. It has been widely applied in various fields, such as environment monitoring systems on global or regional scales, analyzing vegetation and land cover dynamics, and extracting information of vegetation phenology [56]. In this work, to avoid more data movement in cluster nodes, we built HDFS storage on the heterogenous nodes, which are also Spark computing workers. In this study, we further explored how the less powerful nodes affect the efficiency of the whole cluster, which is crucial to build a data-intensive parallel computing system. In addition, when using GeoPySpark as the data processing library, it is necessary to carefully study the effect of tile size on tiling, computing, and writing. For raster data with the same computational complexity and size, a good question to guide the allocation of scientific resources is whether extending the computing resources can accelerate the computing progress. Finally, as the system is mainly used to analyze RS data, visualization provides an intuitive presentation for users to quickly understand the information in multiple scales.

The Data Sources of Experiments
To prove the feasibility and robust of the system, three experiments were performed. Moderate Resolution Imaging Spectroradiometer (MODIS) daily series images were analyzed and processed using the proposed system. MODIS is a key instrument onboard Terra (originally known as EOS AM-1) and Aqua (originally known as EOS PM-1) satellites. Terra MODIS and Aqua MODIS view the entire Earth's surface every 1 to 2 days. Thus, the MODIS sensor is suitable for some daily surface monitoring applications such as monitoring vegetation change. Nevertheless, the reflectance bands receive radiation from the cloud layer rather than the land surface due to the cloud effect. Therefore, it is also necessary to solve the problem of discontinuities in temporal and spatial data caused by cloud pollution in the vegetation change application.
In our experiments, we utilized two products of MODIS, MOD09GQ, and MOD35. MOD09GQ [57] provides MODIS band 1-2 daily surface reflectance at 250 m resolution. The MODIS cloud mask (MOD35) [58] is a science data product. It is regularly produced as a standard product of the Earth Observing System (EOS). Its main purpose is to identify scenes where land, ocean, and atmosphere products should be retrieved based upon the amount of obstruction of the surface due to clouds and thick aerosol.
The MODIS NDVI [59] can be calculated by using the surface reflectance of MODIS red and near infrared bands according to: where NIR is the near-infrared band and RED is the red band of MOD09GQ products. The production process of MODIS NDVI is illustrated in Figure 2, and mainly includes Cloud Mask (CM), image preprocessing, Quality Control (QC), and vegetation index calculation.

The Data Sources of Experiments
To prove the feasibility and robust of the system, three experiments were performed. Moderate Resolution Imaging Spectroradiometer (MODIS) daily series images were analyzed and processed using the proposed system. MODIS is a key instrument onboard Terra (originally known as EOS AM-1) and Aqua (originally known as EOS PM-1) satellites. Terra MODIS and Aqua MODIS view the entire Earth's surface every 1 to 2 days. Thus, the MODIS sensor is suitable for some daily surface monitoring applications such as monitoring vegetation change. Nevertheless, the reflectance bands receive radiation from the cloud layer rather than the land surface due to the cloud effect. Therefore, it is also necessary to solve the problem of discontinuities in temporal and spatial data caused by cloud pollution in the vegetation change application.
In our experiments, we utilized two products of MODIS, MOD09GQ, and MOD35. MOD09GQ [57] provides MODIS band 1-2 daily surface reflectance at 250 m resolution. The MODIS cloud mask (MOD35) [58] is a science data product. It is regularly produced as a standard product of the Earth Observing System (EOS). Its main purpose is to identify scenes where land, ocean, and atmosphere products should be retrieved based upon the amount of obstruction of the surface due to clouds and thick aerosol.
The MODIS NDVI [59] can be calculated by using the surface reflectance of MODIS red and near infrared bands according to: where NIR is the near-infrared band and RED is the red band of MOD09GQ products. The production process of MODIS NDVI is illustrated in Figure 2, and mainly includes Cloud Mask (CM), image preprocessing, Quality Control (QC), and vegetation index calculation.  The whole of mainland China was selected as the study area, and the daily NDVI dataset with 250 m spatial resolution and long time series (from 2002 to 2020) was produced by utilizing the proposed scalable computing resources big data system. The production process of the daily NDVI dataset produces almost 2GB of raster data per day, in which the data volume of the MOD35 cloud mask is approximately 332 MB; each quality assessment, for red and near-infrared bands, is almost 665 MB. The whole of mainland China has 24,642 × 14,157 pixels in a spatial resolution of 250 m (approximately 0.0025 • ). The daily NDVI is stored in HDFS format as an individual catalog, with metadata of the TiledRaster-Layer stored separately as a json file. The file stored as a catalog can be easily used to explore data of different scales, such as a point, a square, or any defined box. Map algebra operations are provided by GeoPySpark. Local and focal operations are performed only on the TiledRasterLayer. Therefore, we need transform the RasterLayer to the TiledRasterLayer before any operations can be loaded in raster data. As the tiled format is stored in a catalog, we can explore time-series data information at a point or a local regional area.

Time Cost of Compurting Mechanism
In this distributed parallel system, the time cost for the complete process for calculation contains three parts, namely, data load time (denoted as TL), computing time (denoted as TC), and collecting and writing time (denoted as TW). The total processing time Ttotal of the band-oriented RS algorithm can be formulated as following: Not only does TL depend on the I/O rates of the hard disk, but it also relates to the volume of data moving among the nodes. TC is determined by the tile size (S TILE ), the tiling numbers of the raster layer (denoted as N TILE ), the number of SECs (denoted as N SEC ), and the vcores of each SEC (denoted as V SEC ). Hence, the TC can be represented as Equation (3). Equation (3) clearly shows that the larger N SEC and V SEC , the lower the value of TC.
If a raster layer's columns and rows are represented by N cols and N rows , respectively, N TILE can be expressed as Equation (4). Thus, the smaller S TILE , the more tiles will be generated. NTILE = (Ncols/STILE + 1) × (Nrows/STILE + 1) TW is mainly decided by the tiling numbers of the raster layer (N TILE ), the number of SECs (N SEC ), and the vcores of each SEC (V SEC ). The bigger the value of N TILE , the greater the cost of the collecting time. The greater the values of N SEC and V SEC , the more resources needed by SD to maintain the status of all SECs, which affects the writing time of the calculation result.
The above explanation indicates that TC and TW contradict each other. Therefore, when designing an algorithm for RS datasets, the complexity of the algorithm should be considered. If the computing is intensive, such as in time series data construction, more SEC instances and bigger vcores should be initialized. Otherwise, fewer resources should be applied, thus allowing more resources to be provided to other data processes.

System Environment
The experimental environment consists of six virtual machine nodes. We used the HUAWEI FusionCompute virtualization cloud platform to create these VMNs; four of these were hosted on the FusionComputeV100R006 platform (R006), and two on the V100R005 platform (R005). The physical servers in R006 have Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20 GHz, Intel Corporation 82599EB 10 Gigabit Dual Port Backplane Connection (network), and 128GB memory on the V100R006 platform. The R005 servers have Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10 GHz, Intel Corporation I350 Gigabit Network Connection(network), and 64GB memory. Thus, the physical servers in R005 possess a less powerful CPU and slower network rate than those in R006. The hardware configurations of the VMNs in the K8s cluster are listed in Table 1. In total, there were 48 vcores and 144GB memory available in the system.
To monitor cluster performance, the Kuboard and Spark job monitor were used. The operating system (OS) of all nodes was Centos-7.4-x86_64. Docker was selected to provide the container execution environment due to its tight integration with K8s and wide industry adoption [60]. The software of the VMN and Docker images used in the platform is listed in Table 2. To manage the container-based workloads, our container-based orchestration platform consists of six nodes, namely, one master node and five worker nodes in the K8s cluster. It can be seen that our system is heterogeneous with different hardware and different networks.  We built spark-notebook and spark-py Docker images from the openjdk:8-jdk-slim base layer for processing RS data; the packages integrated in each of images are listed in Table 3. The spark-notebook image and the spark-py image was used to create the Spark driver container and Spark executor containers, respectively. These images were pushed to registry.cn-hangzhou.aliyuncs.com [61] (accessed on 28 November 2021) as a public repository. These images can be accessed by any users who wish to analyze RS images using container-oriented programming.

Results
We evaluated the performance of the system through a case study in which the daily distribution of 250 m NDVI within the mainland area of China was estimated using MODIS surface reflectance and cloud product data. Below, we first discuss the efficiency of heterogeneous nodes, and analyze the impact of the tile size. We then evaluate the efficiency of submitting computing jobs with different computing resources, and finally explore how to mine time-series information more effectively from massive amounts of RS data with a given tile size.

Time Cost of Heterogeneous Nodes
In order to evaluate the impact of the less powerful nodes on the performance of the cluster, we tracked the detailed time cost of the multi-task stages of the tiling job through the Spark UI service. We initialized a SparkContext and parameterized six executors with 8 GB memory and 2 vcores in this experiment. We executed the containerized Spark application twenty times and recorded each tiling execution time. All computing nodes were used in the first ten executions, and less powerful nodes were disabled in the final ten executions. As shown in Figure 3, when scheduling all nodes in the K8s cluster, some jobs were inevitably submitted to the less powerful nodes (node5 and node6), and the time cost of tiling varied (Figure 3a). When the less powerful nodes were disabled from scheduling, all executors were initialized in the efficient nodes, and the time cost of tiling was more stable. Taking the execution time of CM data processing process as an example, the highest time cost was over 35 s when the less powerful nodes were included in the computing stages. However, the time cost reduced to 22 s when the less powerful noes were disabled. It should be noted that the time costs of some executions were close to 25 s. The reason for this is that the datasets were stored and computed in different VMNs by tracking the details of the Spark UI service and the HDFS management system. The maximum performance gain was up to 37.1% (

Time Cost in Different Tile Sizes
Because all map algebra operations only work on the TiledRasterLayer, each raster band file needs to be tiled as a TiledRasterLayer after it is loaded as a RasterLayer. Therefore, it is necessary to study the effect of tile size on the efficiency of raster layers' processing. We produced the NDVI in the SparkContext, using six executors, two cores, and 8GB memory. Each given tile size experiment was executed 10 times, and the average execution time of the tiling stages and computing-writing stages were recorded separately. As shown in Figure 4, the average tiling time for different tile sizes was almost the same. Each smaller raster file (i.e., CM data) took 25 s, whereas each larger file (i.e., QC, RED, and NIR data) took almost 30 s under different tile sizes. Therefore, the tile size is not a prominent factor related to the time cost of the tiling stage, when each procedure is in the same SparkContext configuration.

Time Cost in Different Tile Sizes
Because all map algebra operations only work on the TiledRasterLayer, each raster band file needs to be tiled as a TiledRasterLayer after it is loaded as a RasterLayer. Therefore, it is necessary to study the effect of tile size on the efficiency of raster layers' processing. We produced the NDVI in the SparkContext, using six executors, two cores, and 8GB memory. Each given tile size experiment was executed 10 times, and the average execution time of the tiling stages and computing-writing stages were recorded separately. As shown in Figure 4, the average tiling time for different tile sizes was almost the same. Each smaller raster file (i.e., CM data) took 25 s, whereas each larger file (i.e., QC, RED, and NIR data) took almost 30 s under different tile sizes. Therefore, the tile size is not a prominent factor related to the time cost of the tiling stage, when each procedure is in the same SparkContext configuration.

Time Cost in Different Tile Sizes
Because all map algebra operations only work on the TiledRasterLayer, each raster band file needs to be tiled as a TiledRasterLayer after it is loaded as a RasterLayer. Therefore, it is necessary to study the effect of tile size on the efficiency of raster layers' processing. We produced the NDVI in the SparkContext, using six executors, two cores, and 8GB memory. Each given tile size experiment was executed 10 times, and the average execution time of the tiling stages and computing-writing stages were recorded separately. As shown in Figure 4, the average tiling time for different tile sizes was almost the same. Each smaller raster file (i.e., CM data) took 25 s, whereas each larger file (i.e., QC, RED, and NIR data) took almost 30 s under different tile sizes. Therefore, the tile size is not a prominent factor related to the time cost of the tiling stage, when each procedure is in the same SparkContext configuration. Spark has lazy loading behavior for transformations, which means that it does not trigger the computation of the transformation. Rather, it only tracks the requested transformation. When a user writes a transformation to obtain another dataset from an input dataset, it can be written in a way that makes the code readable. Therefore, we further explored the time cost of the computing-writing stage.
We tracked the detailed time cost in the Spark UI monitor service and found that the size of the tile is the main factor affecting the computing-writing efficiency of the cluster, Spark has lazy loading behavior for transformations, which means that it does not trigger the computation of the transformation. Rather, it only tracks the requested transformation. When a user writes a transformation to obtain another dataset from an input dataset, it can be written in a way that makes the code readable. Therefore, we further explored the time cost of the computing-writing stage.
We tracked the detailed time cost in the Spark UI monitor service and found that the size of the tile is the main factor affecting the computing-writing efficiency of the cluster, as shown in Figure 5. The time cost of computing-writing increases almost with the size of tile. The minimum time cost occurred in the second execution when the tile size was 256 pixels, and the maximum time cost appeared in the fourth execution then the tile size was 2048 pixels. The minimum time cost occurred in the case of a smaller tile size, but this does not mean that a smaller tile size is better in computing-writing stages, and the tile size of 512 pixels had a more stable time cost than other tile sizes.
Remote Sens. 2022, 14,521 12 of 20 as shown in Figure 5. The time cost of computing-writing increases almost with the size of tile. The minimum time cost occurred in the second execution when the tile size was 256 pixels, and the maximum time cost appeared in the fourth execution then the tile size was 2048 pixels. The minimum time cost occurred in the case of a smaller tile size, but this does not mean that a smaller tile size is better in computing-writing stages, and the tile size of 512 pixels had a more stable time cost than other tile sizes.

Time Cost of Scalable Computing Resources
As mentioned in Section 3, the system we implemented possesses 48 vcores and 144GB memory. All data required for processing were concentrated in one place and closely coupled with the processing resources to guarantee efficient development and convenient access. All nodes were used not only for distributed storage of data, but also

Time Cost of Scalable Computing Resources
As mentioned in Section 3, the system we implemented possesses 48 vcores and 144GB memory. All data required for processing were concentrated in one place and closely coupled with the processing resources to guarantee efficient development and convenient access. All nodes were used not only for distributed storage of data, but also for data processing. All these resources were used for the OS, K8s management, HDFS on K8s, and computing. Thus, the computing resources that we can lease must not include the resources occupied by the OS, K8s cluster management, and HDFS. Hence, nearly 42 vcores and 100GB memory resources monitored by Kuboard can be leased for computing. According to official documents relating to Spark performance tuning [62], the numbers of executors and cores, and the executor memory, are the three main factors that may affect the performance of Spark-based applications.
In this study, four different configuration sets were designed as follows: (1) 42 executors with 2GB memory and one vcore each as set 1 (42E1C2G); (2) 20 executors with 4GB memory and two vcores each as set 2 (20E2C4G); (3) 10 executors with 6GB memory and three vcores as set 3 (10E3C6G); and (4) six executors with 12GB memory and seven vcores as set 4 (6E7C12G). These four configuration sets can use almost all the computing vcores and free memory of the cluster, but are unable to fully use all of the computing vcores and free memory. For example, we can theoretically assign 10 executors with 8GB memory and four vcores in set 3. This is because nodes 3, 5, and 6 only have a total of eight vcores and 16 GB memory. As mentioned above, some resources must be kept for the basic running of the cluster, such as OS, cluster manager, and HDFS. Therefore, if two executors with 8GB memory and four vcores use resources on one node, nodes 3, 4, or 5, for example, will not have enough resources to lease. Thus, it is worth noting that when designing a Spark on K8s cluster with HDFS, the computing vcores and memory should be as large as the physical machine resources to avoid the waste of resources.
In computing-writing jobs of the daily NDVI calculation, four flatMap stages, three RDD collect stages, and one RDDWriter stage are included. RDD collect and RDDWriter stages result in slow progress, which increases the time cost of the Spark computing framework. The time cost of computing-writing jobs under four different configuration sets (512-pixel tile size) are shown in Figure 6. We found that the time cost increases with the increase in the number of executors under the same algorithm complexity.

Time Cost of Sliding Window Processing
To investigate the performance of the system in spatial computation for RS big data the Sliding Window Algorithm (SWA) was selected for this study.

Time Cost of Sliding Window Processing
To investigate the performance of the system in spatial computation for RS big data, the Sliding Window Algorithm (SWA) was selected for this study. Focal operations, including MEAN, MEDIAN, MODE, SUM, STANDARD_DEVIATION, MIN, MAX, SLOPE, and ASPECT, were performed in GeoPySpark by executing a given operation on a neighborhood throughout each tile in the layer. We chose the MEAN operation and an int16 data type of the TiledRasterLayer in the 3 × 3, 5 × 5, 7 × 7, and 9 × 9 sliding windows (SWs) under the three different computing resources, namely, 20 executors with 4GB memory and two vcores (20E2C4G), 10 executors with 4GB memory and two vcores (10E2C4G), and five executors with 4GB memory and two vcores (5E2C4G). Each given SW experiment was executed five times, and the execution time of processing was recorded, as shown in Figure 7(a1-a3). We also repeated all the tests in a float32 data type of the TiledRasterLayer in different SWs, as illustrated in Figure 7(b1-b3). The int16 and float32 data type of the raster layer had the same grid size and resolution. The results illustrate that the average time costs of the float32 data type are higher than those of the int16 data type by about 9. show the time costs of the int16 data type in 3 × 3, 5 × 5, 7 × 7, and 9 × 9 sliding windows; (b1-b3) show the time costs of the float32 data type in 3 × 3, 5 × 5, 7 × 7, and 9 × 9 sliding windows. The 20E2C4G represents 20 SEC, 2 vcores, and 4GB memory of SparkContext, and 10E2C4G, 5E2C4G also represent different configurations of SparkContext.

Multi-Scale Visualization
In this work, we used the GeoPySpark raster data processing package, which can easily obtain information from the TiledRasterLayer catalog at a point or for any defined box scale. We took the visualization of the NDVI calculation results on 14 June 2020 as an example. The NDVI distribution in the whole study area and the specified tile scale (e.g., a tile having a size of 1024 × 1024 pixels) can be easily visualized in the system, as shown in Figures 8 and 9. Therefore, our system can be used to easily view and visualize the multi-scale information relevant to users through the user interface of the web browser. show the time costs of the int16 data type in 3 × 3, 5 × 5, 7 × 7, and 9 × 9 sliding windows; (b1-b3) show the time costs of the float32 data type in 3 × 3, 5 × 5, 7 × 7, and 9 × 9 sliding windows. The 20E2C4G represents 20 SEC, 2 vcores, and 4GB memory of SparkContext, and 10E2C4G, 5E2C4G also represent different configurations of SparkContext.

Multi-Scale Visualization
In this work, we used the GeoPySpark raster data processing package, which can easily obtain information from the TiledRasterLayer catalog at a point or for any defined box scale. We took the visualization of the NDVI calculation results on 14 June 2020 as an example. The NDVI distribution in the whole study area and the specified tile scale (e.g., a tile having a size of 1024 × 1024 pixels) can be easily visualized in the system, as shown in Figures 8 and 9. Therefore, our system can be used to easily view and visualize the multi-scale information relevant to users through the user interface of the web browser.

Discussion
The results of this study indicate that the tile-oriented programming model can easily implement parallel computing of RS big data without destroying the spatial structure of the raster data. The raster layers can be abstracted as many small-size tiles, and a computing task can be divided into many small tile computing tasks, which can be assigned scalable computing resources (SEC). To investigate the factors that affect the efficiency of the system, we carried out several experiments after considering the hardware and computing paradigm. As a result, we found that the tile size plays an important role in parallel computing. When the tile size became smaller, the time cost was shorter. This is because the greater number of tiles generated can completely utilize the distributed computing resources. However, this does not mean the smaller tile size is better, and the stable time costs appear at a tile size of 512 pixels. The reason for this is that more time is spent on the counting of the HadoopRDDWriter (Resilient Distributed Dataset, RDD) stage for smaller size tiles, and on the flatMap operation of CutTiles stages for larger size tiles. Therefore, the size of the tile portion should not be too small or too big. Our experiments obtained results that were similar to those of previous studies [18,21], but there were still some divergences.
The computing resources is another vital factor that influences the computing efficiency. The number of computing tasks is decided by the number of vcores of SECs. For a simple computing complexity, the greater the number of containers that are leased, the more the driver manager will maintain the computing status of all executors and the greater the data movement in the shuffle progress. Therefore, a larger number of vcores of executors does not provide better results. Due to the implementation of the HDFS storage in the K8s computing nodes, the time costs are slightly inconsistent under the same computing resources, complexity of algorithm, and tile size because of the RS images stored in the same computing nodes.
The complexity of RS data processing algorithms also affects the performance of the computing-writing stage. The average time cost of a band-wise algorithm such as the NDVI algorithm was 187 s under 20E2C4G computing resources, and the average time cost of SW was nearly 50 s under the same computing resources. Because the NDVI algorithm contains four raster bands, the time cost of each band was almost 47 s. Thus, we found that the complex algorithm of spatial computing needs more time than the simple band-wise algorithm, but the difference was not as obvious in the parallel computing paradigm.
The data type is also a factor that can have a considerable effect on the computational time cost and storage space. The storage space of the RS layer having the same grid will double in float32 and quadruple in float64 compared with the int16 data type. Similarly,

Discussion
The results of this study indicate that the tile-oriented programming model can easily implement parallel computing of RS big data without destroying the spatial structure of the raster data. The raster layers can be abstracted as many small-size tiles, and a computing task can be divided into many small tile computing tasks, which can be assigned scalable computing resources (SEC). To investigate the factors that affect the efficiency of the system, we carried out several experiments after considering the hardware and computing paradigm. As a result, we found that the tile size plays an important role in parallel computing. When the tile size became smaller, the time cost was shorter. This is because the greater number of tiles generated can completely utilize the distributed computing resources. However, this does not mean the smaller tile size is better, and the stable time costs appear at a tile size of 512 pixels. The reason for this is that more time is spent on the counting of the HadoopRDDWriter (Resilient Distributed Dataset, RDD) stage for smaller size tiles, and on the flatMap operation of CutTiles stages for larger size tiles. Therefore, the size of the tile portion should not be too small or too big. Our experiments obtained results that were similar to those of previous studies [18,21], but there were still some divergences.
The computing resources is another vital factor that influences the computing efficiency. The number of computing tasks is decided by the number of vcores of SECs. For a simple computing complexity, the greater the number of containers that are leased, the more the driver manager will maintain the computing status of all executors and the greater the data movement in the shuffle progress. Therefore, a larger number of vcores of executors does not provide better results. Due to the implementation of the HDFS storage in the K8s computing nodes, the time costs are slightly inconsistent under the same computing resources, complexity of algorithm, and tile size because of the RS images stored in the same computing nodes.
The complexity of RS data processing algorithms also affects the performance of the computing-writing stage. The average time cost of a band-wise algorithm such as the NDVI algorithm was 187 s under 20E2C4G computing resources, and the average time cost of SW was nearly 50 s under the same computing resources. Because the NDVI algorithm contains four raster bands, the time cost of each band was almost 47 s. Thus, we found that the complex algorithm of spatial computing needs more time than the simple band-wise algorithm, but the difference was not as obvious in the parallel computing paradigm.
The data type is also a factor that can have a considerable effect on the computational time cost and storage space. The storage space of the RS layer having the same grid will double in float32 and quadruple in float64 compared with the int16 data type. Similarly, the computing time will increase when the NDVI value is converted from int16 into the float32 data type by expanding 10,000 times and rounding as an integer. Therefore, the data type may be transformed from float to integer to improve the computing performance and to save storage space when the accuracy of processing result is not (or is slightly) affected within an acceptable range.
The hardware of the physical servers may have an effect on the system. A server with stronger CPU rate will speed up a computing task, and a weaker CPU will slow the task. Therefore, we built the system on several nodes virtualized in heterogeneous physical servers. In this case, the less powerful nodes reduce the efficiency of the cluster, as in the case of the "Cask Effect", and any shortage affects the efficiency of the whole cluster. This also indicates that all computing tasks are balance loaded in distributed SECs according to the task scheduling mechanism of Spark on K8s.

Conclusions
In this study, we explored the state-of-the-art technologies for analysis and processing of RS big data, such as the storage, computing, and visualization of massive raster datasets generated from EOS. To improve the overall efficiency of the system, we adopted several novelty technological strategies. At the stage of loading data, we used HDFS on K8s architecture to avoid a huge volume of RS data movement in computing nodes. Hence, the time cost during the RS data storage is lower in the same node with computing nodes. Although the homogeneous physical server in a whole cluster is very effective, the distributed parallel platforms are usually built on several existing heterogenous servers. At the stage of computing, when the tasks were scheduled in the clusters excluding less powerful nodes, the time cost of the computing-writing stage was more stable. In our system, as mentioned in Section 3, the less powerful nodes hosted by the R005 platform are connected in the platform through a less powerful network. Therefore, the performance of the cluster is lower in the collecting and shuffling operations with huge data transmission among the nodes. Moreover, the tile-oriented programing model is quite efficient for large scales of the RS raster layer. The tile size is a crucial factor that affects the time cost in the computing-writing stages, and tiles cannot be too big or too small. Tiles that are too big cannot completely utilize the parallel computing resources, whereas tiles that are too small need more time to collect the results. The number and vcores of SECs are another key factor that has an impact on the capability in the computing-writing stages. For the band-oriented RS algorithm, such as the NDVI calculation, the complexity of computing is not highly intensive; therefore, increasing the computing resources does not improve the performance. Additionally, we assembled flexible packages of Python in Docker images, such as GDAL, GeoPySpark, and Matplotlib, which provided a convenient environment for RS data analysis and visualization at multiple scales. Therefore, we believe that the system proposed in this study can be easily transplanted to other big data platforms, such as AWS, Azure, Google GCP, Aliyun, and Tencent Cloud, based on open-access Docker images pushed to a public repository. In summary, the system we explored is clearly suitable for RS big data processing.
This work also has some shortcomings that will need to be addressed in the future. Our future work will investigate these issues, such as developing a multi-user analysis system for RS big data, upscaling or downscaling RS data pre-processing based on the pyramid class in GeoPySpark, and providing a cloud Tile Map Service (TMS) directly from both GeoPySpark RDDs and the tile catalog.  Data Availability Statement: The Docker images of the Jupyter notebook and the pyspark executor in this study are available online from registry.cn-hangzhou.aliyuncs.com/guojf/spark-notebook: rf-geopyspark-pyproj [61] and registry.cn-hangzhou.aliyuncs.com/guojf/spark-py:geopyspark [61], respectively. (accessed on 28 November 2021).

Acknowledgments:
We are grateful to LocationTech Labs, who provided the GeoPySpark library for this research. We would also like to thank NASA, who provided Cuprite data for the experimental validation.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The