A Distributed Storage and Access Approach for Massive Remote Sensing Data in MongoDB

: With the rapid development of earth-observation technology, the amount of remote sensing data has increased exponentially, and traditional relational databases cannot satisfy the requirements of managing large-scale remote sensing data. To address this problem, this paper undertakes intensive research of the NoSQL (Not Only SQL) data management model, especially the MongoDB database, and proposes a new approach to managing large-scale remote sensing data. Firstly, based on the sharding technology of MongoDB, a distributed cluster architecture was designed and established for massive remote sensing data. Secondly, for the convenience in the uniﬁed management of remote sensing data, an archiving model was constructed, and remote sensing data, including structured metadata and unstructured image data, were stored in the above cluster separately, with the metadata stored in the form of a document, and image data stored with the GridFS mechanism. Finally, by designing di ﬀ erent shard strategies and comparing MongoDB cluster with a typical relational database, several groups of experiments were conducted to verify the storage performance and access performance of the cluster. The experimental results show that the proposed method can overcome the deﬁciencies of traditional methods, as well as scale out the database, which is more suitable for managing massive remote sensing data and can provide technical support for the management of massive remote sensing data.


Introduction
As a spatial information carrier, remote sensing data play significant roles in many fields, such as environmental monitoring, land resources survey, and disaster assessment, with its characteristics of strong timeliness and large area coverage [1,2]. As earth-observation technologies (including remote sensing and satellite communication) and information technologies develop, a big data era has begun in the remote sensing field, where the amount of remote sensing data is massive and continues to increase exponentially [3,4]. According to statistics, the daily data amount produced by different satellite platforms is increasing at the terabyte level. Landsat8, for example, launched by the National Aeronautics and Space Administration in 2013, generates four hundred global images every day; Sentinel, launched by the European Space Agency, produces thousands of data every day. Faced with massive remote sensing data, how to store and manage such with high efficiency is an issue that must be solved currently.
Remote sensing data consist of unstructured image data and structured descriptive information attached to the image, which is also called metadata; these two files are commonly saved in the same directory. Previous studies have taken advantage of different methods to store remote sensing data, and the most commonly used method is combining the file system with a database, with the image data stored in the storage device and the metadata stored in the table of database [5]. However, this mode can neither achieve true storage nor reflect the nature of image data, and when the storage path is changed, there are many items that need to be modified in the database, which can reduce the safety and reliability of data to some extent. The second approach that has received much attention is storing image data in a database directly, using the data type "BLOB" [6,7]. The above two methods are implemented based on a relational database management system. While as the amount of remote sensing data grows rapidly, the weaknesses of a relational database in the process of managing massive spatial data are becoming more and more obvious, including slow reading and writing speed and difficult horizontal expansion [8]. Especially the problem of low storage and access efficiency under the background of massive data, which can no longer meet the requirements for data management in a big data era. The last method that some applications adopt is the file system [9,10]. In this condition, the image data are organized through the file mode in high-performance storage devices. However, the process of data retrieval and data acquisition is inconvenient using this method, and in many cases, users have to develop specialized programs to realize these functions. Thus, Not Only SQL (NoSQL) databases, which are typically represented by MongoDB, whose unique data management mode can solve the problems that existed in the relational database, are receiving more attention [11,12]. In addition, the file storage mechanism GridFS within MongoDB can achieve the distributed storage of large binary files [13]. This paper proposes a method to manage remote sensing data based on MongoDB, with the metadata stored in the form of document and the image data stored with GridFS.
Many studies use NoSQL databases to manage the spatial datasets. Li et al. proposed a management strategy based on MongoDB for the frequent modification and complex spatial analysis of large-scale GIS (Geographical Information System) data, and designed experiments to verify the feasibility and effectiveness of their strategy [14]. Xiang et al. flattened a hierarchical R-tree structure into a tabular MongoDB collection to manage planar spatial data, and the results showed that the planar spatial data could be effectively managed by this method [15]. Wang and Hu proposed a cloudizing storage method for unstructured LiDAR point cloud data with the distributed file system GridFS based on MongoDB, and the results showed that the proposed method performed better than the local file system [16]. Current studies mostly concentrate on the management of typical GIS data, such as point, line and surface, and less on remote sensing data, while this paper focuses on constructing a distributed storage strategy for massive remote sensing data to improve the management and service level.
The primary focus of this study is to promote the storage efficiency and access efficiency of remote sensing data by establishing a distributed sharding cluster based on the MongoDB database. In our approach, we propose a distributed storage and access method for remote sensing data based on MongoDB. Firstly, we established a distributed cluster for remote sensing data using the sharding technology of MongoDB. Secondly, we constructed an archiving model to realize the unified management of remote sensing data referring to prevalent international standards and metadata structures. Furthermore, for remote sensing data, metadata and image data are stored separately, with the former stored as documents, and the latter stored with the GridFS mechanism, which is related by the data filename. Finally, with Landsat8 metadata and various satellite data used as experimental data sources, we conducted various experiments to verify the availability and effectiveness of the proposed method. The results show that the method proposed for remote sensing data has higher performance in data storage and access, and can effectively manage massive data. It can also provide guidance for data management strategies and meet the demand for massive data management.

NoSQL and MongoDB
Although relational database management systems occupy an important position in the data management field, the large-scale data introduces a new challenge for data storage and management [17]. Bottlenecks exist in the process of managing massive data with traditional relational databases due to their ACID properties, and people have started to search for a more optimal solution. Consequently, NoSQL technology has emerged as a possible solution [18,19]. For the management of massive data in the distributed system, NoSQL databases usually focus more on the availability and partition tolerance and realize the eventual consistency, wherein it achieves the applications in many special areas [20]. With its feasible data model, high read and write performance, and powerful expandability under big data, NoSQL is more suitable for the storage and management of massive data.
In terms of its storage mode, NoSQL databases can be divided into four categories: key-value database, document-oriented database, column-family database, and graph-oriented database [21]. Presently, as a kind of document-oriented database, MongoDB is attracting much interest due to its free schema, support for complex queries, and powerful expandability. The fundamental data storage unit of MongoDB database is a document, while the BSON (Binary JavaScript Object Notation) format is used as the data storage structure, which is similar to the JSON (JavaScript Object Notation) format. Data is stored in the form of key-value in MongoDB. The document corresponds to the row in the table of the relational database, and many documents compose a collection that corresponds to the table of the relational database. Unlike the table, database users do not need to define the mode of the collection, while the structure of the table needs to be specified in advance in relational databases, and different types of documents can be stored in the same collection. Moreover, for large binary files, MongoDB provides a GridFS mechanism to store them.

Sharding Technology
Faced with the challenge of data processing in the big data era, the MongoDB database provides sharding technology as the solution to scale out [22]. Sharding technology is a database cluster system used to horizontally expand massive data, and the data is split and stored in different data nodes to handle greater data loads. Applying sharding technology can reduce the pressure on single machine performance caused by high data volume and high throughput applications, and improve random access performance under a large data volume. When storing massive data, one server might not be enough to store the data and provide acceptable read-write throughput. By establishing a cluster, the data can be divided and stored on multiple machines, so the database system can store and process more data to meet the processing needs of the large growth of data. Figure 1 presents the overall sharding cluster architecture of MongoDB.
The MongoDB cluster architecture mainly includes three parts: the shard server, the router server, and the config server. The actual data are stored in the shard server, which can be a replica set or a server. The router server is the entry that clients request the database and is used to address and locate the requests from these clients. All these requests need to be handled by mongos and forwarded to the corresponding shard server. The config server is used to store the configuration information of the router and the shard, which is set up at first and does not need significant space and resources.
When implementing sharding in the collection, one or more fields should be referred to as split data, which is also called the shard key. After the shard key is specified, the data are split into small data chunks, and different chunks are stored in corresponding sharding machines. There are three kinds of shard keys: ascending key, random key, and combined key. The values of the ascending key (such as object id and time) will grow steadily over time, and the latest insert documents will be assigned to the max chunk. The values of the random key have no specific rules to follow, such as name, MD5, and password, as they are absolutely random. As more data are inserted, all the data can be distributed evenly in different chunks. When there is no suitable shard key in the database, a combined key can be considered. The selection of the shard key can affect the performance of the system, such as its scalability, so it is very important to choose an appropriate shard key [23].
assigned to the max chunk. The values of the random key have no specific rules to follow, such as name, MD5, and password, as they are absolutely random. As more data are inserted, all the data can be distributed evenly in different chunks. When there is no suitable shard key in the database, a combined key can be considered. The selection of the shard key can affect the performance of the system, such as its scalability, so it is very important to choose an appropriate shard key [23].

GridFS Mechanism
MongoDB supports the storage of binary data through a lightweight file storage specification named GridFS. GridFS is a distributed file storage mechanism used to store large binary files, which splits the large file into many small file chunks, and each file chunk is stored as a document [24]. Under this mechanism, the large file is stored in two collections whose default names are fs.chunks and fs.files, with binary data stored in the fs.chunks collection and the descriptive information stored in the fs.files collection, which achieves the distributed storage of data.
When storing a binary file, if its size is greater than the pre-set chunksize value, the file will be divided into several chunks. Each file corresponds to a document in the fs.files collection, which corresponds to one or more documents in fs.chunks.

The Distributed Cluster Architecture
MongoDB is an open-source NoSQL database management system with a feasible schema and powerful scalability, which can provide excellent storage and access capability, especially in the large data management field. Aiming at massive remote sensing data, including unstructured image data and structured metadata, a distributed storage method is proposed in this paper based on MongoDB. This research establishes the distributed storage and access architecture for remote sensing data, which is composed of several physical machines, and there is no duplication of data among the cluster.
To ensure the high availability and consistency of the remote sensing data, this research takes advantage of the strategy of "replica sets + sharding" to construct the cluster. That is to say, the shard is also a replica set and consists of a group of mongod processes, while mongod is a process that is mainly used to handle data requests and manage data storage in the MongoDB database. Moreover, as a special member of the replica set, when the backup nodes cannot connect to the primary node, the arbiter takes part in the election of the new primary node, which does not store data and occupies fewer resources. Figure 2 shows the distributed cluster architecture for remote sensing data.

GridFS Mechanism
MongoDB supports the storage of binary data through a lightweight file storage specification named GridFS. GridFS is a distributed file storage mechanism used to store large binary files, which splits the large file into many small file chunks, and each file chunk is stored as a document [24]. Under this mechanism, the large file is stored in two collections whose default names are fs.chunks and fs.files, with binary data stored in the fs.chunks collection and the descriptive information stored in the fs.files collection, which achieves the distributed storage of data.
When storing a binary file, if its size is greater than the pre-set chunksize value, the file will be divided into several chunks. Each file corresponds to a document in the fs.files collection, which corresponds to one or more documents in fs.chunks.

The Distributed Cluster Architecture
MongoDB is an open-source NoSQL database management system with a feasible schema and powerful scalability, which can provide excellent storage and access capability, especially in the large data management field. Aiming at massive remote sensing data, including unstructured image data and structured metadata, a distributed storage method is proposed in this paper based on MongoDB. This research establishes the distributed storage and access architecture for remote sensing data, which is composed of several physical machines, and there is no duplication of data among the cluster.
To ensure the high availability and consistency of the remote sensing data, this research takes advantage of the strategy of "replica sets + sharding" to construct the cluster. That is to say, the shard is also a replica set and consists of a group of mongod processes, while mongod is a process that is mainly used to handle data requests and manage data storage in the MongoDB database. Moreover, as a special member of the replica set, when the backup nodes cannot connect to the primary node, the arbiter takes part in the election of the new primary node, which does not store data and occupies fewer resources. Figure 2 shows the distributed cluster architecture for remote sensing data.

Archiving Model for Metadata
Remote sensing metadata plays an essential role in correlative studies of earth-observation, and managing the metadata effectively can contribute to the application and sharing of remote sensing data [25,26]. In our research, remote sensing metadata is the descriptive information of the remote sensing image, which is generated to store attribute information. However, the contents of different metadata files are different, which brings difficulties to the unified management of remote sensing metadata. In the Landsat8 metadata file, for example, the "SENSOR_ID" field is the sensor identifier of the satellite, and the "SPACECRAFT_ID" field is the satellite identifier of the image, while in the ZY-3 metadata file, the corresponding fields are SensorID and SatelliteID. Therefore, constructing a unified archiving model for the management of remote sensing data is urgent.
Based on the investigation and survey of current metadata standards containing ISO (International Organization for Standardization) 19115 geographical information metadata standard and CSDGM (Content Standard for Digital Geospatial Metadata), as well as the metadata structures of multiple remote sensing data sources, the archiving model for remote sensing metadata is established, and its fields are determined: SatelliteID, SensorID, ReceiveDate, geographic coordinates, and so forth. These fields are shown in Table 1.

Archiving Model for Metadata
Remote sensing metadata plays an essential role in correlative studies of earth-observation, and managing the metadata effectively can contribute to the application and sharing of remote sensing data [25,26]. In our research, remote sensing metadata is the descriptive information of the remote sensing image, which is generated to store attribute information. However, the contents of different metadata files are different, which brings difficulties to the unified management of remote sensing metadata. In the Landsat8 metadata file, for example, the "SENSOR_ID" field is the sensor identifier of the satellite, and the "SPACECRAFT_ID" field is the satellite identifier of the image, while in the ZY-3 metadata file, the corresponding fields are SensorID and SatelliteID. Therefore, constructing a unified archiving model for the management of remote sensing data is urgent.
Based on the investigation and survey of current metadata standards containing ISO (International Organization for Standardization) 19115 geographical information metadata standard and CSDGM (Content Standard for Digital Geospatial Metadata), as well as the metadata structures of multiple remote sensing data sources, the archiving model for remote sensing metadata is established, and its fields are determined: SatelliteID, SensorID, ReceiveDate, geographic coordinates, and so forth. These fields are shown in Table 1. The remote sensing metadata is shown in the form of document in the MongoDB database. When inserting data into MongoDB database, if the "_id" field does not exist in the document, MongoDB will generate the field automatically. Taking one Landsat8 metadata file as example, the standard storage in the MongoDB database is achieved through data processing. The storage mode of the remote sensing metadata is as follows: { "_id": ObjectId("5cd393629e11f33380453591"), "ImageName": "LC80010762013365LGN01", "SatelliteID": "LANDSAT_8", "SensorID": "OLI_TIRS", FileStorePath": "E:\entity data\landsat8", "DataDownloadURL":"https://earthexplorer.usgs.gov/browse-link/12864/ LC80010762013365LGN01", "DataOwner": "USGS", "DataProvider": "USGS" } In latter experiments, the "TopLeftLatitude" field is selected as shard key to compare the performance of the cluster. When the shard key is specified, remote sensing metadata can be divided and stored on different shards.

Storage and Access of Image Data Based on GridFS
When storing data in the form of document in the MongoDB database, the volume of each file must be less than sixteen megabytes. However, with the rapid development of earth-observation technologies, the data amount of remote sensing image has already reached hundreds of megabytes, or even bigger, which exceeds the limitation and can never satisfy the storage demand for remote sensing big data [27]. On account that the traditional document-objected method cannot be adopted for the storage of image data, this paper takes advantage of the GridFS file storage mechanism to manage large remote sensing image data.
Under this circumstance, the data are managed in two collections: rs.files and rs.chunks. The keys in the rs.chunks collection include _id, n, data, and files_id. "_id" stands for the unique identifier of the file chunk, "n" stands for the relative position in file of the chunk, "data" stands for the binary data in the chunk, and "files_id" is the same as "_id" in the rs.files collection. The keys in the rs.files collection include _id, length, chunksize, uploadDate, and MD5. "_id" stands for the unique identifier of the document in the rs.files, "length" stands for the number of the bytes, "chunksize" stands for the size of every chunk in bytes, "uploadDate" stands for the date and time that the file is uploaded to GridFS, and "MD5" is the check value of the file, which is calculated by the server.
Taking one of the Sentinel1 image data as an example, the storage structure of the image data in the rs.files collection is as follows.

1.
Retrieve the image data waiting to be stored according to the specified image name. If the data with the same name exists, then finish the operation; if not, start to store the data with the GridFS mechanism. 2.
Store the data in two collections: rs.files and rs.chunks. The "rs.files" collection stores the metadata of each image, while "rs.chunks" collection stores the binary data of each image. 3.
The data in the "rs.files" collection usually does not need to be split because its data volume is small, while "files_id" and "n" are selected as a combined shard key to divide the data into different shard nodes.

4.
When accessing the image data, the data is retrieved in the "rs.files" collection with the specified query terms, and then the value of "_id" is obtained. Owing to the equal relationship between "_id" in rs.files and "files_id" in rs.chunks, "files_id" is also determined accordingly. Then the image data can be read sequentially through the value of "n".  1. Retrieve the image data waiting to be stored according to the specified image name. If the data with the same name exists, then finish the operation; if not, start to store the data with the GridFS mechanism. 2. Store the data in two collections: rs.files and rs.chunks. The "rs.files" collection stores the metadata of each image, while "rs.chunks" collection stores the binary data of each image. 3. The data in the "rs.files" collection usually does not need to be split because its data volume is small, while "files_id" and "n" are selected as a combined shard key to divide the data into different shard nodes. 4. When accessing the image data, the data is retrieved in the "rs.files" collection with the specified query terms, and then the value of "_id" is obtained. Owing to the equal relationship between "_id" in rs.files and "files_id" in rs.chunks, "files_id" is also determined accordingly. Then the image data can be read sequentially through the value of "n".
With the GridFS file storage mechanism, the image data is split to small data chunks and stored With the GridFS file storage mechanism, the image data is split to small data chunks and stored in the corresponding nodes. Figure 4 shows the storage mechanism of remote sensing image data.

Experimental Design
The key to the MongoDB distributed cluster is choosing an appropriate shard key. Therefore, to verify the practicability and availability of the proposed method for remote sensing data, several groups of comparison experiments are designed, including remote sensing metadata storage and access under different shard key strategies, image data storage, and access with the GridFS mechanism of MongoDB and PostgreSQL.

Experimental Data
The experimental data include remote sensing metadata and image data. In this research, the Metadata Extraction Tool, which was developed using Java programming language, is used to extract the necessary fields and values of the metadata file in the Landsat8 data package and transform them into corresponding fields in the archiving model. There are about 100 million global metadata records from 2013 to 2017, which can be obtained from the USGS website "https://earthexplorer.usgs.gov/". We downloaded the image data from different data centers, and these data include Landsat data, FY data, and Sentinel data. To simplify the expression, the time information and other information after it in the data filename of each data product are replaced with the symbol '*'. Table 2 presents detailed information of each data product.

Experimental Design
The key to the MongoDB distributed cluster is choosing an appropriate shard key. Therefore, to verify the practicability and availability of the proposed method for remote sensing data, several groups of comparison experiments are designed, including remote sensing metadata storage and access under different shard key strategies, image data storage, and access with the GridFS mechanism of MongoDB and PostgreSQL.

Experimental Data
The experimental data include remote sensing metadata and image data. In this research, the Metadata Extraction Tool, which was developed using Java programming language, is used to extract the necessary fields and values of the metadata file in the Landsat8 data package and transform them into corresponding fields in the archiving model. There are about 100 million global metadata records from 2013 to 2017, which can be obtained from the USGS website "https://earthexplorer.usgs.gov/". We downloaded the image data from different data centers, and these data include Landsat data, FY data, and Sentinel data. To simplify the expression, the time information and other information after it in the data filename of each data product are replaced with the symbol '*'. Table 2 presents detailed information of each data product.

Experimental Environment
The experimental environment is built on a cluster of three physical computers, and the configuration of each machine is the same: ubuntu-16.04.6 operating system, 16 GB RAM, a 500 GB hard disk, and a 3.20 GHz core CPU. For our experiments, we used MongoDB 4.0.8 and PostgreSQL 11.3 for comparison, which are deployed on the nodes. The configuration of each shard node in the MongoDB cluster is shown in Table 3. The PostgreSQL database is deployed on each node with port number 5432 in the cluster based on the master-slave structure. The node whose IP address is 10.3.102.199 is configured as the master, while the other two nodes are configured as slave 1 and slave 2, and the slave is a replication of the master.

Experimental Principle
For the metadata storage experiment, this paper computes the amount of inserted metadata per second, i.e., the insert speed. While for the other experiments, this paper computes the execution time difference of different commands by running programs. The calculation formulas are as follows.
Formula (1) is applied to calculate the average execution time, while Formula (2) is used to calculate the average insert speed. In Formula (1), T end represents the end time for executing commands, T start represents the start time for executing commands, n represents the execution time for the same experiment, and T avg represents the average execution time. In order to reduce the experimental errors, the same experiment was carried out five times, so n equals five in this experiment, and the value of T avg is calculated by averaging five figures. In Formula (2), Vol data represents the data volume, and S insert represents the insert speed of data.

Results and Analysis
In this section, we carry out the experiments with reference to the experimental design in Section 4. The content of this section contains three aspects. The first two are the experimental results of the storage and access performance comparison of the remote sensing data, and the third is the analysis of the results.

Metadata
After the values of the fields in the metadata archiving model are obtained, they are first stored in the MongoDB database to facilitate other experiments. In order to study the influence on cluster performance under different shard key strategies, we choose different shard key strategies. Considering that the query terms with remote sensing data mainly concentrated on a spatial range including latitude and longitude information or an imaging time range in practical application, the "TopLeftLatitude" field is chosen as a shard key because it carries spatial information. In the meantime, the hashed values of "_id" are calculated, where the shard key named "_id_hashed" is established. Finally, we use "_id_hashed" and "TopLeftLatitude" as shard keys, and both the storage experiments and access experiments are conducted under these two circumstances.

Storage
The storage experiments are conducted by inserting different amounts of metadata into the MongoDB database. The storage mode of the remote sensing metadata is described in detail in Section 3.2. In order to make the experimental results more intuitive and reliable, we have chosen the insert speed under different shard key strategies for comparison in this experiment. The experimental result is shown in Figure 5.

Metadata
After the values of the fields in the metadata archiving model are obtained, they are first stored in the MongoDB database to facilitate other experiments. In order to study the influence on cluster performance under different shard key strategies, we choose different shard key strategies. Considering that the query terms with remote sensing data mainly concentrated on a spatial range including latitude and longitude information or an imaging time range in practical application, the "TopLeftLatitude" field is chosen as a shard key because it carries spatial information. In the meantime, the hashed values of "_id" are calculated, where the shard key named "_id_hashed" is established. Finally, we use "_id_hashed" and "TopLeftLatitude" as shard keys, and both the storage experiments and access experiments are conducted under these two circumstances.

Storage
The storage experiments are conducted by inserting different amounts of metadata into the MongoDB database. The storage mode of the remote sensing metadata is described in detail in Section 3.2. In order to make the experimental results more intuitive and reliable, we have chosen the insert speed under different shard key strategies for comparison in this experiment. The experimental result is shown in Figure 5. As can be seen from the above figure, storage performance is affected by the number of metadata records. When choosing "_id_hashed" and "TopLeftLatitude" as shard keys, the two curves exhibit a similar changing trend, where the insert speed levels off after going through a process of growth. For the shard key named "_id_hashed", the data are stored randomly and evenly, which guarantees the load balancing among the shard nodes in the MongoDB cluster. While for "TopLeftLatitude" whose values range from −90 to 90, the new inserted data can be routed to various chunks, and when viewing the data distribution on the shard nodes at this time, we find that the image data is distributed evenly among the nodes. However, strictly speaking, the data storage with "TopLeftLatitude" does not achieve an absolute random and even distribution compared with the "_id_hashed" shard key, which yields better storage performance with the latter. As can be seen from the above figure, storage performance is affected by the number of metadata records. When choosing "_id_hashed" and "TopLeftLatitude" as shard keys, the two curves exhibit a similar changing trend, where the insert speed levels off after going through a process of growth. For the shard key named "_id_hashed", the data are stored randomly and evenly, which guarantees the load balancing among the shard nodes in the MongoDB cluster. While for "TopLeftLatitude" whose values range from −90 to 90, the new inserted data can be routed to various chunks, and when viewing the data distribution on the shard nodes at this time, we find that the image data is distributed evenly among the nodes. However, strictly speaking, the data storage with "TopLeftLatitude" does not achieve an absolute random and even distribution compared with the "_id_hashed" shard key, which yields better storage performance with the latter.
In addition, when the number of metadata records inserted are less than 5000, the insert speed of the two sharding strategies keeps rising because the metadata are not partitioned at that moment, and there are sufficient resources that can be utilized in the MongoDB cluster.

Access
The access experiments are conducted by retrieving the metadata in a certain range, with longitude values ranging from seventy-five degrees to one-hundred-and-thirty degrees and latitude values ranging from twenty degrees to fifty degrees, which is a rectangular area and covers many provinces of China. The access performance of the remote sensing metadata under various volumes is tested by executing commands in the program. The query command is: In addition, when the number of metadata records inserted are less than 5000, the insert speed of the two sharding strategies keeps rising because the metadata are not partitioned at that moment, and there are sufficient resources that can be utilized in the MongoDB cluster.

Access
The access experiments are conducted by retrieving the metadata in a certain range, with longitude values ranging from seventy-five degrees to one-hundred-and-thirty degrees and latitude values ranging from twenty degrees to fifty degrees, which is a rectangular area and covers many provinces of China. The access performance of the remote sensing metadata under various volumes is tested by executing commands in the program. The query command is:  As Figure 6 shows, the changing trend of the two curves is different. When choosing "_id_hashed" as the shard key, the access time tends to remain steady as the volume of metadata increases, which means the access performance is slightly influenced by the metadata volume in this situation.
However, when the "TopLeftLatitude" field is chosen as the shard key, the access time increases along with the volume of metadata, because the metadata is distributed in different shard nodes according to the value of the "TopLeftLatitude" field, and the data is retrieved among these nodes. In particular, when the metadata volume is 1,000,000, the access time that the "TopLeftLatitude" shard key consumes is forty-eight times longer than that of "_id_hashed", which indicates that as for the "TopLeftLatitude" shard key, access performance is more affected by the metadata volume than with the latter. As Figure 6 shows, the changing trend of the two curves is different. When choosing "_id_hashed" as the shard key, the access time tends to remain steady as the volume of metadata increases, which means the access performance is slightly influenced by the metadata volume in this situation.
However, when the "TopLeftLatitude" field is chosen as the shard key, the access time increases along with the volume of metadata, because the metadata is distributed in different shard nodes according to the value of the "TopLeftLatitude" field, and the data is retrieved among these nodes. In particular, when the metadata volume is 1,000,000, the access time that the "TopLeftLatitude" shard key consumes is forty-eight times longer than that of "_id_hashed", which indicates that as for the "TopLeftLatitude" shard key, access performance is more affected by the metadata volume than with the latter.

Image Data
To verify the storage and access performance of the MongoDB database for large binary files (namely, the remote sensing image data in this experiment), the relational database PostgreSQL is selected for comparison. PostgreSQL is recognized as the most powerful open source object-relational database management system; it supports abundant data types and provides rich interfaces. In this experiment, "files_id" and "n" are selected as the combined shard key to reduce the load pressure of a single shard in the MongoDB cluster, while PostgreSQL uses "_id" as the index, as there are only two fields, and the other is applied to store the binary data.

Storage
With different amounts of remote sensing image data inserted into MongoDB and PostgreSQL, the average storage time with the two databases can be computed through multiple experiments with Formula (1), which is chosen for the comparison in this experiment. The experimental result is shown in Figure 7.

Image Data
To verify the storage and access performance of the MongoDB database for large binary files (namely, the remote sensing image data in this experiment), the relational database PostgreSQL is selected for comparison. PostgreSQL is recognized as the most powerful open source object-relational database management system; it supports abundant data types and provides rich interfaces. In this experiment, "files_id" and "n" are selected as the combined shard key to reduce the load pressure of a single shard in the MongoDB cluster, while PostgreSQL uses "_id" as the index, as there are only two fields, and the other is applied to store the binary data.

Storage
With different amounts of remote sensing image data inserted into MongoDB and PostgreSQL, the average storage time with the two databases can be computed through multiple experiments with Formula (1), which is chosen for the comparison in this experiment. The experimental result is shown in Figure 7. As can be observed from Figure 7, the time required to store the image data of both MongoDB and PostgreSQL increases along with the increasing volume of the remote sensing image data, while the storage performance of the MongoDB database is relatively more stable. In addition, when storing the same image data file, PostgreSQL consumes more time than MongoDB. The time difference between the two databases is obvious, especially when the inserted image data amount is large. For example, when inserting the data "S3A_OL_1_EFR____2016*", the time that PostgreSQL takes is one point six times longer than that of MongoDB. To sum up, MongoDB performs better than PostgreSQL in storing large remote sensing image data.

Access
After all the remote sensing image data are inserted into MongoDB and PostgreSQL, the access experiments can be conducted. The average access time with two databases can be computed through multiple experiments with Formula (1), which is chosen for comparison in this experiment. The experimental results are displayed in Figure 8. As can be observed from Figure 7, the time required to store the image data of both MongoDB and PostgreSQL increases along with the increasing volume of the remote sensing image data, while the storage performance of the MongoDB database is relatively more stable. In addition, when storing the same image data file, PostgreSQL consumes more time than MongoDB. The time difference between the two databases is obvious, especially when the inserted image data amount is large. For example, when inserting the data "S3A_OL_1_EFR____2016*", the time that PostgreSQL takes is one point six times longer than that of MongoDB. To sum up, MongoDB performs better than PostgreSQL in storing large remote sensing image data.

Access
After all the remote sensing image data are inserted into MongoDB and PostgreSQL, the access experiments can be conducted. The average access time with two databases can be computed through multiple experiments with Formula (1), which is chosen for comparison in this experiment. The experimental results are displayed in Figure 8.  Figure 8 shows that, as the remote sensing image data amount grows, there is a growing trend in the time required to access data in both MongoDB and PostgreSQL, while the growth of the PostgreSQL database is relatively faster. In addition, when accessing the same image data file, PostgreSQL consumes more time than MongoDB. For instance, when accessing the data "LT05_L1GS_123046_*", the time that PostgreSQL takes is two point eight times longer than that of MongoDB. Thus, MongoDB performs better than PostgreSQL in accessing large remote sensing image data.
Compared with the method proposed by Wang and Hu [16], which managed the LiDAR point cloud data in the MongoDB database with two shard key strategies including "files_id" and "n", we used the image data in the same file size as the former to conduct the comparison experiments. Here, we label the method proposed in this paper as "Method 1" and the other as "Method 2". The experimental results are presented in Figure 9. As can be observed from Figure 9, no matter which shard key is applied, Method 1 consumes less time in accessing data than the other method. For instance, compared with Method 2, the data  Figure 8 shows that, as the remote sensing image data amount grows, there is a growing trend in the time required to access data in both MongoDB and PostgreSQL, while the growth of the PostgreSQL database is relatively faster. In addition, when accessing the same image data file, PostgreSQL consumes more time than MongoDB. For instance, when accessing the data "LT05_L1GS_123046_*", the time that PostgreSQL takes is two point eight times longer than that of MongoDB. Thus, MongoDB performs better than PostgreSQL in accessing large remote sensing image data.
Compared with the method proposed by Wang and Hu [16], which managed the LiDAR point cloud data in the MongoDB database with two shard key strategies including "files_id" and "n", we used the image data in the same file size as the former to conduct the comparison experiments. Here, we label the method proposed in this paper as "Method 1" and the other as "Method 2". The experimental results are presented in Figure 9.  Figure 8 shows that, as the remote sensing image data amount grows, there is a growing trend in the time required to access data in both MongoDB and PostgreSQL, while the growth of the PostgreSQL database is relatively faster. In addition, when accessing the same image data file, PostgreSQL consumes more time than MongoDB. For instance, when accessing the data "LT05_L1GS_123046_*", the time that PostgreSQL takes is two point eight times longer than that of MongoDB. Thus, MongoDB performs better than PostgreSQL in accessing large remote sensing image data.
Compared with the method proposed by Wang and Hu [16], which managed the LiDAR point cloud data in the MongoDB database with two shard key strategies including "files_id" and "n", we used the image data in the same file size as the former to conduct the comparison experiments. Here, we label the method proposed in this paper as "Method 1" and the other as "Method 2". The experimental results are presented in Figure 9. As can be observed from Figure 9, no matter which shard key is applied, Method 1 consumes less time in accessing data than the other method. For instance, compared with Method 2, the data As can be observed from Figure 9, no matter which shard key is applied, Method 1 consumes less time in accessing data than the other method. For instance, compared with Method 2, the data access time with "files_id" decreases by 60.9%, 52.9%, 39.7%, 31%, 58.6%, and 69.4%, respectively, under different data sizes, which indicates that the performance of Method 1 outweighs that of Method 2.

Analysis
Through the above experiments, the following conclusions can be drawn by analyzing the experimental results: • The MongoDB database uses the WiredTiger storage engine, which stores the data as disk files. When there are sufficient memory resources to cope with the storage and access requests in the cluster, higher performance can be obtained. Moreover, the network bandwidth will also influence the storage and access performance of the MongoDB cluster.

•
From the perspective of shard key strategy, the paper chooses two different shard keys to conduct the experiments, though neither of them can guarantee optimal performance in both storage and access. Therefore, when designing the shard key strategy, practicality should be considered.

Conclusions
Faced with the deficiencies of slow reading and writing speed, difficult horizontal expansion and low query efficiency in the process of managing massive remote sensing data with traditional relational database management systems, this paper proposes a distributed storage and access method for massive remote sensing data using the MongoDB sharding cluster architecture, with structured metadata stored in the form of document and unstructured image data stored with the GridFS mechanism. The result shows that the proposed method can overcome the weak points of traditional methods, scale out the database, and is more suitable to manage massive remote sensing data. For future work, we plan to study the influence of the number of data nodes on the performance of the distributed system.

Patents
We have submitted an application for an invention patent resulting from the work reported in this paper to the National Intellectual Property Administration, PRC, and now this patent is open to the public. The patent name is "A distributed storage method for large-scale remote sensing data based on MongoDB", and the patent application number is 201910585556.4.
Author Contributions: Shuang Wang proposed the main idea and wrote this paper. Shuang Wang, Guoqing Li, and Xiaochuang Yao conceived and designed the paper. Xiaochuang Yao and Lushen Pang provided key guidance on the implementation of the distributed cluster architecture. Yi Zeng and Lianchong Zhang collected the data and participated in the experiment. All authors read and approved this paper.
Funding: This research was funded by the Strategic Priority Research Program of the Chinese Academy of Sciences, grant number XDA19020103.