MHBase: A Distributed Real-Time Query Scheme for Meteorological Data Based on HBase

: Meteorological technology has evolved rapidly in recent years to provide enormous, accurate and personalized advantages in the public service. Large volumes of observational data are generated gradually by technologies such as geographical remote sensing, meteorological radar satellite, etc. that makes data analysis in weather forecasting more precise but also poses a threat to the traditional method of data storage. In this paper, we present MHBase, (Meteorological data based on HBase (Hadoop Database), a distributed real-time query scheme for meteorological data based on HBase. The calibrated data obtained from terminal devices will be partitioned into HBase and persisted to HDFS (the Hadoop Distributed File System). We propose two algorithms ( the Indexed Store and the Indexed Retrieve Algorithms ) to implement a secondary index using HBase Coprocessors, which allow MHbase to provide high performance data querying on columns other than rowkey. Experimental results show that the performance of MHBase can satisfy the basic demands of meteorological business services.


Introduction
With the evolution of Internet technology, the amount of data globally is beyond appraisal.In 2010, the global online and offline data size peaked at 1.2 Zettabytes (1 Zettabytes = 1024 EB = 1024*1024 PB) and reached to four Zettabytes in 2013 [1].According to the prediction of the International Data Corporation (IDC), the total amount of data is expected to reach eight Zettabytes by the end of this year [2] and more than 35 Zettabytes of data will be generated by the end of this decade [3].Data collected from meteorological fields amount to hundreds of terabytes every year.Collected data include weather information recorded at 10-minute intervals from more than 40,000 stations across the country.These collected data represent about 30% of the total meteorological data [4].Weather satellite remote sensors and Doppler radar produce data that are calculated in terabytes daily and play an important role in real-time weather forecasting.Conventional architecture, based on IOE (IBM, Oracle and EMC), has a centralized dedicated architecture and needs expensive high-end equipment when dealing with such enormous data.
Open source cloud platforms, represented by Apache Hadoop, use horizontal scaling distributed architecture to make Big Data storage economical.The Hadoop Distributed File System (HDFS) [5] is a distributed file system designed to be operated on commodity hardware while HBase (Hadoop Database) [6] is an open source, non-relational, distributed database modeled after Google's BigTable [7] based on HDFS.The combination of Hadoop and HBase enables reliable and expandable data storage.
It is common to use self-designed architecture to solve specific problem in specialized fields, especially in Meteorology.For example, NASA (National Aeronautics and Space Administration) uses Hadoop and Hive in RCMES (The Regional Climate Model Evaluation System) [8] to provide services and analysis based on observation data and also uses Sqoop [9] to migrate data.In another NASA project, MERRA/AS (Modern Era Retrospective-analysis for Research and Applications Analytic Services) [10], Cloudera's CDH cluster is used to store 80 TB of data to provide data sharing and data analysis services.
In this paper, we propose a platform called MHBase (Meteorological data based on Hbase) for meteorologically structured data based on HBase in order to satisfy the business requirements of safe storage and efficiently improve the query of large meteorological data.Our other meteorological operational system uses Hadoop to store data and other components of "Hadoop ecosystem" to implement functions like Hive and Zookeeper; thus, HBase is the best choice as its features are closely combined with Hadoop.Our system is based on existing techniques such as distributed architecture and data partition strategy.Several experiments have been done to show the performance improvement for our optimized method and combined architecture.The reliability of data is ensured by a multi-copying mechanism through HDFS.In accordance with the conventional query cases, especially on HBase, we designed rowkey and to realize a secondary index.
The rest of this paper is organized as follows.Section 2 introduces some related works about our project and analyses some indexed schemes already in existence.In Section 3, we present two algorithms and detail the design and implementation of MHBase.We also evaluate performance by experimentation in Section 4. Finally, we present conclusions and future work in Section 5.

Distributed Architecture
The Apache Hadoop is a framework that allows for distributed processing of massive data across clusters of computers using simple programming models.Its distributed file system, named HDFS, is the open source implementation of GFS (Google File System) [11].HDFS follows a master-slave structure, in which multiple datanodes are under the control of one "namenode".A namenode is mainly responsible for managing metadata about where the files are stored and for what purpose as well as also preserving the mappings between files and the corresponding data blocks in memory."Datanodes" are used to store the actual data and report block lists back to the namenode termly.Therefore, Hadoop is highly fault-tolerant and is designed to be deployed on low-cost hardware.It also provides high throughput access to application data and is extremely suitable for applications that have large data sets.
HBase is an open-source, distributed, non-relational database [12] on top of HDFS.Data are stored in HBase by (key,value)pairs.The key represents the row identifier and the value contains the row attributes.Data in HBase are modeled as a multidimensional map in which values are located by four required parameters, tablename, rowkey, family, and timestamp, and one optional parameter, qualifier.The key-value pair could be expressed as: value " Map ptablename, rowkey, f amily, rquali f ier, s timestampq These kinds of data are ultimately stored in HDFS to shield the heterogeneity of the underlying system to make the cluster more reliable and the consistency of data is ensured by Zookeeper [13], a centralized service for maintaining configuration information, naming, providing distributed synchronization and group services.Figure 1 demonstrates the relationship between Hadoop and HBase.There is a root table named "-ROOT-" and a catalog table named ".META." in HBase.-ROOTkeeps track of where the .META.table is, while the .META.table keeps a list of all RegionServers in the HBase cluster.To locate entries in three times, -ROOT-will never be divided into two.All these HBase tables are persisted into HDFS.HBase ensures load balancing via both horizontal and vertical types of data partitioning and, based on that, takes advantage of -ROOT-and .META. to locate specific entries.

Horizontal Partitioning
As shown in Figure 2, the table is divided into HRegions (region for short, the smallest unit of distributed storage in HBase) that are stored in different HRegionservers (also called Regionserver (RS)).Rowkey is used to address all columns in one single row and it is sorted on the basis of Log-Structured Merge Tree (LSM-Tree) [14] to speed up searching.A region will become larger with an increment in the volume of data and it will be split into two equal daughter regions when a particular threshold is reached.One RegionServer is composed of several regions but one region cannot be placed on multiple RegionServers.That is to say, a read and a write of one data entry will always be on the same RegionServer.

Vertical Partitioning
Each HRegion is split into individual Stores where data persist under the same family (see Figure 3) and each Store contains a MemStore and multiple StoreFiles.Data are first written onto the former and then flushed out of the latter when the predefined threshold is reached.From a logical viewpoint of the table, data are split horizontally across regions and then divided vertically into multiple Stores according to the families.

Meteorological Data Storage
With huge amounts of data being generated periodically in meteorological fields, the conventional IBM, Oracle and EMC architectures can no longer cope; hence, cloud computing becomes the best way to store data.Aji in Emory University presented Hadoop-GIS [15], which is a scalable and high performance spatial data warehousing system for running large-scale spatial queries on Hadoop.Hadoop-GIS supports multiple types of spatial queries on MapReduce through a spatial query engine called RESQUE that utilizes both global and customizable local spatial indexing to achieve efficient query processing.Xie and Xiong [16] achieved a linear quad-tree retrieval structure in the HBase.It uses MapReduce to insert data and create an index and also uses MapReduce to retrieve the index structure.Chuanrong Zhang from the University of Connecticut also introduced a parallel approach of MapReduce [17] for improving the query performance of geospatial ontology for disaster response.The approach focuses on parallelizing the spatial join computations of GeoSPARQL queries.All these existing systems work well in batch computations but the time for MapReduce is too long to satisfy the demands of real-time systems.Although efforts are being made to improve the real-time capability of meteorologically distributed systems, the most efficient and effective method is yet to be developed.

Optimizing
As HBase only offers the rowkey-based index [18], it can quickly look up specific records by rowkey because of the rowkey dictionary sequence.The only way to query other columns is to use a filter to perform a full-table scan, which has the effect of degrading performance.Like RDBMS, it is impossible to retrieve data using only a primary key for the meteorological application.The index scheme for HBase has always been a hot topic in open source communities.IHBase [19] is a region-level index structure that uses Bitmap [20] to organize indexes and is suitable for read-only scenarios.Furthermore, IHBase stores indexes in memory; therefore, indexes must be rebuilt every time a RegionServer crashes or restarts and this rebuild operation requires much time to complete.Some giant companies provide their own enterprise solutions for HBase optimizing.For example, Cloudera [21] integrates Solr with HBase to present a much more flexible query and index model of HBase datasets.However, this system is complicated to deploy and it will throw exceptions unless the servers are properly tuned.The Intel®Distribution for Apache Hadoop* Software [22] includes almost all of the Hadoop ecosystems from basic project HBase to data mining project Mahout.This release has enhanced the secondary index of HBase, but it must be deployed on a node that has more than two quad-core CPUs, 16 GB RAM and does not have RAID disks.
Huawei's recent contribution [23], which uses coprocessors to implement secondary index, is recognized as the best solution and this has attracted the attention of the open source community.Ge proposes CinHBa [24], whose index model is similar to Huawei's contribution to provide a high-performance query capable of generating non-rowkey columns.Lastly, Apache Phoenix [25] integrates this scheme internally after version 4.1 and also provides long-term technical support.In this paper, we use this idea as a reference to propose a real-time query scheme for meteorological applications.

Proposed Design
In this section, we will introduce the design of MHBase considering two aspects.First, we discuss the features of data partitioning to identify the main factors in meteorological systems so that they can be taken into consideration when designing the table.Secondly, we introduce the coprocessor-based index model in detail.The architecture of MHBase is illustrated in Figure 4. Our index model is built as an enhancement on top of HBase with a server-side component that contains two expanding instances of Observer to manage the indexes and the main idea of the model is to use an index table to store the value of all indexed columns to realize the secondary index.

Data Partitioning
Hadoop ecosystem is highly extensible; therefore, HBase, unlike in a RDBMS, data distribution is done at design time [26].HBase can be configured to use different policies for data partitioning, but for the sake of simplicity, we will use the default approach and we also optimize system performance by designing the tables' structure.
During the horizontal partitioning phase, undesirable RegionServer hotspotting will occur at write time if the rowkey is used, which, in turn, starts with "observation time".HBase sorts records lexicographically using the rowkey, which allows quick access to individual records, and it is always bigger, so the write operation will always be positioned in the region with the upper bound and this results particularly to RegionServer overloading.To circumvent this issue, we salted rowkey in the user table by prefixing "station number" in front of observation time.Meanwhile, rowkeys' length should be fixed in order to improve retrieval performance.
In vertical partitioning, for a logical table in HBase, data in each family persist in a file called Store, which separates physically.Thus, we put relevant attributes into the same family so that they can be stored together on a disk at design time.For example, we put regular meteorological parameters such as rainfall, temperature, and clouds into a family named "w_info", while the remaining unrelated information is stored in a separate family.This can improve the systems performance because non-relevant families (for current query) are not read [27].The Store will be split into two when it is larger than a given threshold.The formula below shows how this threshold is set by default split policy.
split.threshold " minpmax.size,R 2 ˚f lush.sizeq The split threshold is defined as the minimum of: (1) a constant value of a Store's max size; and (2) a function of the number of regions in its corresponding RegionServer (R) and the flush size of memstore.
From the beginning, the secondary element is used and data are split at a faster pace regardless of max.size.Ultimately, the split threshold will increase until it exceeds max.size.After a certain number of splits, the splitting step will remain constant from then on.
Physically, each family is separately stored in a Store; therefore, the pace of horizontal partitioning depends on the vertical design.As shown in Figure 5, with the given data, the larger the number of families, the more difficult it becomes for a Store to reach its splitting threshold.Conversely, if each family contains more attributes, it will be faster to reach the splitting threshold.With the aim of reducing communication cost across the different RegionServers, we assumed that one server could contain both regions of the actual table (user regions for short) and their corresponding indexed tables (index regions for short) so that it can only scan the corresponding index regions while scanning user regions for one index.To achieve this goal, we used an enhanced load balancer presented by Chrajeshbabu [23] to collocate the user regions and corresponding index regions.

Index Model
This segment introduces the optimization of HBase based on coprocessors.There is an index-utility, which can be used to manage the life cycle of an index as well as an application-programming interface (API), which can also be used to serve queries based on secondary indexes.Developers can connect their applications to a cluster and execute query operations by an API, just like they would if they were using the native HBase directly.

Coprocessor
A Coprocessor is divided into two categories: the Observer whose function is similar to a database trigger and the Endpoint, which is similar to stored procedures in RDBMS.It can execute custom code in parallel at the region level just like a lightweight mapreduce and it also makes it possible to extend server-side functionality.Furthermore, Observer contains rich hooked functions and, for this reason, some developers have attempted to realize the index in JIRA [28,29] reports.In this paper, we add IndexMasterObserver and IndexRegionObserver, which are two expanding instances of an Observer to manage the indexes.
IndexMasterObserver is used to manage a table-level operation by a master, such as create and delete an index table automatically, while IndexRegionObserver is used to perform specific data manipulations in region-level.Table 1 shows the main functions in IndexRegionObserver.When a user region is split and placed into another RegionServer, the original startkey of the upper region remains but another one should be changed and additional work has to be done by the coprocessor.These rewrite operations are triggered when regions are being compacted after region split.
The main steps are as follows: (1) get the original rowkey: row orig , split rowkey: row split and their length of region startkey len orig_ start , len split_start ; (2) calculate the length: len remain = the length of row orig ´len orig_ start ; (3) new byte [len split_start `len remain ]; (4) copy the combination of region startkey and the rest of original rowkey to new array.

Algorithms
In this part, we propose two algorithms, the ISA (Indexed Store Algorithm) and the IRA (Indexed Retrieve Algorithm) to overcome problems caused by native HBase features that can only perform a full table scan by filters to retrieve exact rows when querying on columns other than rowkey.Although adding indexes will improve random read performance, we still need to examine and minimize the potential impact on write operation.

Indexed Store Algorithm (ISA)
The idea of the storage algorithm is to use an indexed table to store the values of all indexed columns.The indexes will be updated at the time of data insertion.For one row, a server-side coprocessor will match the input with predefined indexed columns.If found, it generates a new rowkey corresponding to the old one by prefixing the region's address and index name and then puts it into the indexed table.In order to improve the performance, the put requests will be cached and then handled as a batch-put, even if there is only one Put.These puts will eventually be inserted into user table and index table, respectively.
The indexed store algorithm is shown as follows: The algorithm is roughly divided into two steps.In the first step, we set a flag depending on whether selected conditions D contain unindexed columns.In the second step, we take a random seek via indexed table for indexed columns and use filters to filter rows for unindexed columns to get result sets and then fetch the result sets according to the flag and logical relationship OP ('AND' / 'OR').Finally, the coprocessors scan the sets and return the actual results to the client.Note that it will take a full table scan if selected predicate OP is 'OR' and query conditions have unindexed columns.This is beyond the scope of this paper because it does not take advantage of indexes; however, the solution under this circumstance is given in our algorithm.
The indexed retrieve algorithm is shown as follows.
Algorithm The columns in query condition must first be checked on the server side.For example, server gets idxcol, which is intersected with colList (total qualifiers in query condition D) and indexList (predefined indexes in index table) in the first step.If the size of idxcol is smaller than colList, it suggests that condition contains unindexed columns.Thus, we set the flag True.After that, we use the IndexRegionObserver, an enhanced subclass instance of the coprocessor method above to traverse the idxcol to decide which index these columns belong to.In each iteration, there is a match between index and column and it will return a new key that contains an index name and the field value.Next, IndexRegionObserver creates a scanner prefixed with the key that determines the start and stop locations on each region and equally retrieves the region.

Experiments and Analyses
In this section, we conduct a comprehensive assessment through the experiments on the MHBase.Our experiments are conducted on a four-machine cluster, in which each machine uses default configurations in order to reduce influences on performance caused by various configuration settings.We deploy MHBase based on Hadoop-1.0.4 with an embedded ZooKeeper, the cluster-management component.One machine contains a master node and three Regionservers (the other three machines), all of which are connected together via a gigabit Ethernet.Specific machine parameters and configuration settings are listed in Table 2.We take weather data from 1949 to 2014 and chose the most commonly used seventeen attributes from 120 fields included in observed site information.They are classified into two categories: identity fields and observed fields.In our experiments, we design rowkey with the combination of "station number", "observation time (including year, month, day and hour)" and "data type".The length of each column in rowkey should be fixed to identify the meaning of each digit conveniently.Other identity fields such as "country number", "longitude", "latitude" and "station style" are stored in one column family.We use another column family to store observed fields.The table structure is described in Table 3. Indexes have been built on top of 14 attributes among 17 columns.Our goal is to transfer a business platform from RDBMS to a cloud environment in order to improve price/performance ratio; therefore, we compare the different dataset sizes (records) between MHBase and MySQL.
We have selected ten common queries, including querying by "observation time", "country number", etc. and the test performance for each case is the average of five times.Figure 6 shows the average time consumed by MySQL and MHBase.We discontinue testing MySQL without indexes because the query time had already reached more than seven seconds in five million rows.Examples: "w_meta:c_Number = + 086", "w_meta: LON = 118 ˝46 1 40.0"E","w_meta: LAT = 32 ˝03 1 42.0"N", "w_meta:style = GROUND" Family2: w_info Qualifiers: maxTEMP, minTEMP, stationPRESS, clouds, windDirection, windSpeed, relativeHumidity, rainfall, observeYear, observeMonth, observeDay, observeHour, imgURL Just as Figure 7 shows, the execution time of importing dataset into HBase and MHbase is stable.Although the performance of MySQL with indexes is excellent in small datasets, it began to decline with the increasing data scales.In addition, the speed of data import goes down significantly.Nevertheless, MHBase retrieves records via coprocessors in region-level, and the rowkey of the table in HBase is in lexicographical (alphabetical) order, so time consumption can be considered as a linear trend of growth.Note that the performance of MHBase is also better than MySQL with indexes from 18 million rows.It implies that MHBase will be more efficient in dealing with large-scale data.We already know that HBase would trade read efficiency on fields other than the primary key.Thus, the only way to search columns without rowkey is to take up a full table scan by a filter.As a consequence, we also make a comparison between MHBase and HBase with filters within 150,000 rows and the result in Figure 8 shows that filters cost more time because filters scan a full HBase table, while the time of MHBase can be thought as constant.An optimized query-by-filter method is more efficient at creating a scanner with a prefix-match first, and then filter data on other columns in the range of a start key and a stop key.For example, if we want all records generated from a station with number "210044001", we can set "210044001" as start key and "210044001" + "/0xffff" as the stop key and then use filters to retrieve columns other than rowkey because our rowkey starts with the station number and observed time.However, this scenario is impractical and lacks credibility, so we just use this special query case to illustrate the performance of MHBase.Figure 9 shows the time of same query case between MHBase and HBase (query by prefix).As shown in Figure 10, prefix matched queries can locate the range of target region directly and accurately, and afterwards filters work in the specific range.But MHBase use coprocessors to get records in a parallel approach; that is, the essence of an indexed query is one prefix-match and one rowkey-get operation, so it will be slower than retrieving directly by rowkey.The reason for the raised interval between two million and three million in the picture is that the region splitting causes I/O load to rise.The query's response time for both methods are close to each other using millions of data but the diversity of retrieval by MHBase is increased.We also use YCSB (Yahoo!Cloud Serving Benchmark) [30] to conduct stress tests on both MHBase and HBase.We test the performance with different pressure.As shown in the Figure 10a, it is observed that the throughput of MHBase will reduce and is lower than native HBase when inserting data using ten threads concurrently and its write operation is not so excellent because MHBase needs to write to the index table synchronously.In Figure 10b, there is almost no difference in 100% read circumstance.Generally, both throughputs are improved with the increase of pressure (threads).
As our experimental results show, with the benefit of indexes, MHBase can greatly improve the querying speed, even if there is a slight performance penalty in write operation, and provide multiplicity and high performance of data accesses on non-primary keys.That is to say, MHBase achieves a balance in read and write performance because data import in our meteorological system is offline.On the other hand, we use HDD and other PC equipment as base installation to gain better performance instead of expensive commercial equipment such as SSD or RAID, so the price/performance ratio is improved.In a nutshell, MHBase is able to manage and maintain indexes effectively to make query more efficient and also meet the requirements of meteorological applications.

Conclusions and Future Work
In this paper, we proposed MHBase, a distributed real-time query scheme for meteorological data based on HBase, which aims to satisfy the demands of a query's performance when facing huge amounts of meteorologically structured data.We explained the different influences on different datasets with respective split strategies and also developed two algorithms (the ISA and the IRA) to implement indexing by coprocessor.Finally, our findings were verified by experimentation and the results were compared.Our design achieved a balance between read and write performance and greatly improved the performance of a query.All these findings are evidence that MHBase is of practical significance in meteorological applications when compared to RDBMS.In our future work, we will use this solution in domains other than meteorology.

IHBase:
Indexed HBase HDFS: the Hadoop Distributed System RDBMS: Relational Database Management System PAPD: A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions

Figure 5 .
Figure 5.Effect of vertical partitioning on region splits.

Figure 6 .
Figure 6.Query time in weather dataset.

Figure 7 .
Figure 7.The time of data import.

Figure 8 .
Figure 8. Performance in different data size.

Figure 9 .
Figure 9.Time of indices query and prefix query.

Table 1 .
The main functions in IndexRegionObserver.
2. The Indexed Retrieve Algorithm (IRA)T: all entries that meet any one of the search conditions L: intermediate results according to the T and OP Input: D: the query conditions OP: the logical relationship which means it matches all conditions or one condition Output: F: the key-value entries set Procedure: (1): set flag = False (2): get all columns in D as colList (3): get total predefined indexes in index table as indexList (4): get indexes columns list: idxcol = colList X indexList (5): if idxcol.size< colList.sizethen (6): set flag = True (7): end if (8): for each idxcol i P idxcol do (9): determine the index column of idxcol i (10): generate a new rowkey of idxcol i (11): set startkey and stopkey of scanner on the basis of changed rowkey (12): get results by changed rowkey and add them into T i (13): end for (14): if OP = Equal_All then (15): L " i (16): if flag = True then (17): filter L by using unindexed columns in D and refresh L (18): end if (19): else if OP = Equal_One then (20): L "

Table 2 .
Configurations of the cluster

Table 3 .
HBase table structure of data.