A Cloud-Based Framework for Large-Scale Log Mining through Apache Spark and Elasticsearch

: The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and ﬁnding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efﬁciently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.


Introduction
Spatial data portals (SDPs) serve the Earth science community with massive geospatial data [1]. However, as the volume and variety of the data are increasing faster than ever, they pose a great challenge for SDPs to provide reliable and quality service [2]. An emerging trend in data discovery is mining user behaviors from logs for the latent linkages between users and data [3]. Taking the National Aeronautics and Space Administration (NASA) Physical Oceanography Distributed Active Archive Center (PO.DAAC) as an example, a solution was proposed to improve oceanography data discovery and access by mining user behavior data, called the Mining and Utilizing Dataset Relevancy from Oceanographic Dataset (MUDROD) [2]. The MUDROD engine manages logs and converts them into a series of sessions and domain-knowledge-like oceanographic vocabulary linkages based on Elasticsearch [4,5]. Elasticsearch is a component of the ELK stack and provides a solution to automatically index, search, analyze, and visualize logs with Logstash and Kibana. Elasticsearch can search data and ingest logs with simple operations, e.g., filtering Uniform Resource Locator (URL) with the aid of Logstash. Kibana enables users to analyze and visualize logs.
As an evolving open-source project, Elasticsearch provides rich resources and Application Programming Interfaces (APIs) for customizable search functionalities. We can implement log analysis functionalities using existing resources provides by the ELK stack efficiently. However, the ELK stack has an orientation toward efficient search rather than high-performance analysis. To improve performance of the log-mining framework built on Elasticsearch, a straightforward way is to adopt parallel processing frameworks. Big data frameworks, e.g., Apache Spark and Hadoop, are used to handle big data in recent years. Spark can accelerate the data analysis process with data parallelism and provide in-memory computing capability. To efficiently mine knowledge from web usage logs and make historical logs searchable and visualizable, we proposed a cloud-based framework for large-scale log mining using Spark and Elasticsearch. The framework can efficiently reconstruct sessions from offline logs and learn knowledge from user behavior data by making full use of advanced features provided by Spark and Elasticsearch, e.g., data parallelism and full-text indexing features. In addition, logs and extracted sessions are searchable on the platform. An information flow graph describes how data are flowing in the framework, and a log partitioner is designed to further speed up the log-mining process. We utilize PO.DAAC web usage log mining to demonstrate and evaluate the proposed framework and methods.

Related Work
As an important complementary component of the data discovery engine, log mining discovers implicit knowledge from user behavior data by learning meaningful patterns, trends, and linkages, e.g., Google explores user logs to catch what the world is searching and publishes reports of trends in a year (https://trends.google.com/trends/), whereby it was found that the two most popular Elasticsearch can search data and ingest logs with simple operations, e.g., filtering Uniform Resource Locator (URL) with the aid of Logstash. Kibana enables users to analyze and visualize logs.
As an evolving open-source project, Elasticsearch provides rich resources and Application Programming Interfaces (APIs) for customizable search functionalities. We can implement log analysis functionalities using existing resources provides by the ELK stack efficiently. However, the ELK stack has an orientation toward efficient search rather than high-performance analysis. To improve performance of the log-mining framework built on Elasticsearch, a straightforward way is to adopt parallel processing frameworks. Big data frameworks, e.g., Apache Spark and Hadoop, are used to handle big data in recent years. Spark can accelerate the data analysis process with data parallelism and provide in-memory computing capability. To efficiently mine knowledge from web usage logs and make historical logs searchable and visualizable, we proposed a cloud-based framework for large-scale log mining using Spark and Elasticsearch. The framework can efficiently reconstruct sessions from offline logs and learn knowledge from user behavior data by making full use of advanced features provided by Spark and Elasticsearch, e.g., data parallelism and full-text indexing features. In addition, logs and extracted sessions are searchable on the platform. An information flow graph describes how data are flowing in the framework, and a log partitioner is designed to further speed up the log-mining process. We utilize PO.DAAC web usage log mining to demonstrate and evaluate the proposed framework and methods.

Related Work
As an important complementary component of the data discovery engine, log mining discovers implicit knowledge from user behavior data by learning meaningful patterns, trends, and linkages, e.g., Google explores user logs to catch what the world is searching and publishes reports of trends in a year (https://trends.google.com/trends/), whereby it was found that the two most popular phrases searched in the Unites States were "world cup" and "Hurricane Florence" in 2018. Joachims et al. proposed the RankSVM algorithm to improve the ranking performance of a search engine by tracking user behavior on the ranking list. The assumption is that if a user clicked a link with a lower rank, the link is more relevant to the user's search than other links with a higher rank [6]. In the GIScience domain, researchers also pay close attention to log mining. Logs were collected and analyzed to assist search engines to understand queries [7], rank data intelligently [8], analyze a traveler's moving patterns for route recommendation [9], discover semantic relationships among geospatial vocabularies for domain ontology population [5], discover interesting spatiotemporal theme patterns [10], etc.
Sessionization is the process of splitting a collection of logs into multiple sessions which record a user's behavior on a website during a time period, serving as the first and fundamental step of log mining. However, it is quite a challenge to reconstruct sessions when the user identifier is missing in the raw logs. Multiple methods were proposed to group logs into sessions relying on time interval, referrer, or both [4,11,12]. In previous research, we proposed a method of reconstructing sessions from multiple logs to build a semantic knowledge base [4,5]. The knowledge extracted from logs is valuable, but the speed of the log-mining process needs improvement. For instance, for the small vertical search portal PO.DAAC, the log-mining framework built on Elasticsearch took more the one hour to finish monthly log analysis. Thus, we propose a cloud-based framework to speed up log mining with Apache Spark and Elasticsearch.
Logs are continually produced in different formats and rates, and can uncover useful information through proper and effective analysis. Cloud computing and parallel computation frameworks were introduced to support log analysis, as they can manage and analyze data of large size with a high production rate [13,14] The distributed computing paradigm can provide on-demand storage and computing resources for log mining [15,16]. The parallel computation frameworks, e.g., Apache Spark and Apache MapReduce, improve data analysis performance through data parallelism. Lin et al. proposed a unified cloud platform with batch analysis and in-memory computing capacity by combining Hadoop and Spark [16]. Therdphapiyanak et al. applied Hadoop for large-scale log analysis to efficiently detect abnormal traffic from high-volume data [17].
Given the significance of log analysis, several open-source and commercial solutions were designed for log collection, storage, search, analysis, and visualization. The ELK stack, i.e., Elasticsearch, Logstash, and Kibana, automatically collects, indexes, aggregates, and visualizes logs [18]. Prakash et al. efficiently geo-identified the website user traffic through logs using ELK stack [19]. Bagnasco et al. used the Elasticsearch ecosystem to monitor the Infrastructure as a Service (IaaS) and scientific application on the cloud, keeping track of usage and dynamic allocation of resources [20].
Both Spark and Elasticsearch have advantages and disadvantages for log mining. Spark is capable of handling big data, but it is not designed for log analysis. We have to conduct a significant amount of development if we implement a log analysis platform from scratch with Spark. Elasticsearch has the capability of indexing logs and searching specific records from billions of log records instantly; however, it only provides basic analysis functions, e.g., monthly aggregation and total count. Meaningful hidden patterns cannot be automatically detected by the ELK stack. Furthermore, Elasticsearch is a search engine, not a parallel analysis framework. A trend is to integrate Spark and Elasticsearch to implement an efficient and scalable log management and analysis system. Metha et al. proposed a streaming architecture based on ELK, Spark, and Hadoop to do anomaly detection from network connection logs in near real time [21]. Li et al. introduced a method to speed up log mining with Elasticsearch and Spark; however, Elasticsearch and Spark work independently in their framework [22]. Specifically, Spark is only used to split logs and distribute them to multiple machines prior to the log-mining process. In this research, we propose a log-mining framework integrating both Elasticsearch and Spark, and design a log partitioner approach to solve the skewed data problem. The framework aims to (1) present a platform which integrates Spark and Elasticsearch to make full use of the functionalities offered by both; (2) serve as a guide about how to speed up Elasticsearch-based data analysis with Spark. This paper introduces our research on utilizing cloud, Spark, and Elasticsearch in an interactively and optimized fashion to mine interesting patterns from log files. In addition to the introduction (Section 1) and literature review (Section 2), Section 3 introduces the logs used in the experiment. Section 4 delineates the proposed log-mining framework and the log-partitioning algorithm. Section 5 details the experiment, and Section 6 interprets and evaluates the experimental results. Section 7 reviews the framework and discusses future work.

Data
Logs of both Hypertext Transfer Protocol (HTTP) and File Transfer Protocol (FTP) are used in our research. For a data archive portal, HTTP logs record a user's search and click behavior when he/she visits the website. Generally, it consists of a client Internet Protocol (IP) address, request date/time, the page requested, HTTP code, user agent, and referrer ( Figure 2) in the combined log format. The combination of client IP and user agent could identify a unique user. The date illustrates when the user visited the page [23]. FTP logs track the download operations on all datasets archived. Similarly, the FTP log has IP and time fields ( Figure 3). HTTP, combined with FTP logs, records a user's activities on the websites, e.g., which collection-level metadata were viewed after a user submitted a query and which granule-level data the user downloaded. Through the log-mining process, such knowledge could be mined from the raw HTTP and FTP logs ( Figure 1). Other popular types of logs in geospatial data service, such as Open-source Project for a Network Data Access Protocol (OPeNDAP) [24], are not included in our research, but the framework can support these log types with minor modification. This paper introduces our research on utilizing cloud, Spark, and Elasticsearch in an interactively and optimized fashion to mine interesting patterns from log files. In addition to the introduction (Section 1) and literature review (Section 2), Section 3 introduces the logs used in the experiment. Section 4 delineates the proposed log-mining framework and the log-partitioning algorithm. Section 5 details the experiment, and Section 6 interprets and evaluates the experimental results. Section 7 reviews the framework and discusses future work.

Data
Logs of both Hypertext Transfer Protocol (HTTP) and File Transfer Protocol (FTP) are used in our research. For a data archive portal, HTTP logs record a user's search and click behavior when he/she visits the website. Generally, it consists of a client Internet Protocol (IP) address, request date/time, the page requested, HTTP code, user agent, and referrer ( Figure 2) in the combined log format. The combination of client IP and user agent could identify a unique user. The date illustrates when the user visited the page [23]. FTP logs track the download operations on all datasets archived. Similarly, the FTP log has IP and time fields (Figure 3). HTTP, combined with FTP logs, records a user's activities on the websites, e.g., which collection-level metadata were viewed after a user submitted a query and which granule-level data the user downloaded. Through the log-mining process, such knowledge could be mined from the raw HTTP and FTP logs ( Figure 1). Other popular types of logs in geospatial data service, such as Open-source Project for a Network Data Access Protocol (OPeNDAP) [24], are not included in our research, but the framework can support these log types with minor modification.

Proposed Framework Architecture for Log Mining
After an extensive study of the open-source tools supporting log mining, we propose an architecture integrating Elasticsearch with Spark for log mining. Elasticsearch, a component in the ELK stack ecosystem, provides a series of log management functionalities including log ingestion, index, query, statistical analyses and visualization. Spark is an in-memory parallel computing framework for big data. The goal of this framework is to utilize existing resources and tools to establish a flexible and scalable system for the entire log-mining process. The Elasticsearch ecosystem provides the framework as a log management system, in which logs are indexed for query, visualization, and statistical analyses. Spark accelerates the mining process through parallel This paper introduces our research on utilizing cloud, Spark, and Elasticsearch in an interactively and optimized fashion to mine interesting patterns from log files. In addition to the introduction (Section 1) and literature review (Section 2), Section 3 introduces the logs used in the experiment. Section 4 delineates the proposed log-mining framework and the log-partitioning algorithm. Section 5 details the experiment, and Section 6 interprets and evaluates the experimental results. Section 7 reviews the framework and discusses future work.

Data
Logs of both Hypertext Transfer Protocol (HTTP) and File Transfer Protocol (FTP) are used in our research. For a data archive portal, HTTP logs record a user's search and click behavior when he/she visits the website. Generally, it consists of a client Internet Protocol (IP) address, request date/time, the page requested, HTTP code, user agent, and referrer ( Figure 2) in the combined log format. The combination of client IP and user agent could identify a unique user. The date illustrates when the user visited the page [23]. FTP logs track the download operations on all datasets archived. Similarly, the FTP log has IP and time fields ( Figure 3). HTTP, combined with FTP logs, records a user's activities on the websites, e.g., which collection-level metadata were viewed after a user submitted a query and which granule-level data the user downloaded. Through the log-mining process, such knowledge could be mined from the raw HTTP and FTP logs ( Figure 1). Other popular types of logs in geospatial data service, such as Open-source Project for a Network Data Access Protocol (OPeNDAP) [24], are not included in our research, but the framework can support these log types with minor modification.

Proposed Framework Architecture for Log Mining
After an extensive study of the open-source tools supporting log mining, we propose an architecture integrating Elasticsearch with Spark for log mining. Elasticsearch, a component in the ELK stack ecosystem, provides a series of log management functionalities including log ingestion, index, query, statistical analyses and visualization. Spark is an in-memory parallel computing framework for big data. The goal of this framework is to utilize existing resources and tools to establish a flexible and scalable system for the entire log-mining process. The Elasticsearch ecosystem provides the framework as a log management system, in which logs are indexed for query, visualization, and statistical analyses. Spark accelerates the mining process through parallel

Proposed Framework Architecture for Log Mining
After an extensive study of the open-source tools supporting log mining, we propose an architecture integrating Elasticsearch with Spark for log mining. Elasticsearch, a component in the ELK stack ecosystem, provides a series of log management functionalities including log ingestion, index, query, statistical analyses and visualization. Spark is an in-memory parallel computing framework for big data. The goal of this framework is to utilize existing resources and tools to establish a flexible and scalable system for the entire log-mining process. The Elasticsearch ecosystem provides the framework as a log management system, in which logs are indexed for query, visualization, and statistical analyses. Spark accelerates the mining process through parallel computing. Figure 4 shows the system architecture. A cloud infrastructure, e.g., Amazon Web Service (AWS) [25] and OpenStack [26], automatically and efficiently launches virtual resources like virtual machines and network configurations. Once a virtual machine starts, a series of open-source tools cooperate with each other to manage and analyze logs. The Hadoop Distributed File System (HDFS) stores massive logs distributedly; Elasticsearch provides the capabilities of indexing and searching logs in a near real-time manner; Spark accelerates log-mining subtasks through data parallelism. An image can be created in advance to facilitate the installation of log-mining clusters on demand, including an HDFS cluster, a Spark cluster, and an Elasticsearch cluster. computing. Figure 4 shows the system architecture. A cloud infrastructure, e.g., Amazon Web Service (AWS) [25] and OpenStack [26], automatically and efficiently launches virtual resources like virtual machines and network configurations. Once a virtual machine starts, a series of open-source tools cooperate with each other to manage and analyze logs. The Hadoop Distributed File System (HDFS) stores massive logs distributedly; Elasticsearch provides the capabilities of indexing and searching logs in a near real-time manner; Spark accelerates log-mining subtasks through data parallelism. An image can be created in advance to facilitate the installation of log-mining clusters on demand, including an HDFS cluster, a Spark cluster, and an Elasticsearch cluster. The framework indexes and analyzes multi-source, heterogeneous logs through user identification, crawler detection, session reconstruction, and knowledge extraction. User identification recognizes all users from billions of logs. The IP or IP combined with other fields, e.g., agent, identifies a user. Crawler detection recognizes crawlers with well-known robots and visiting rate. Session reconstruction splits logs belonging to one user into small groups, and each group records a user's behavior in a relatively short time. Knowledge extraction summarizes a user's behavior in a session including the dataset the user clicked or viewed [4]. With the framework, big raw logs could be efficiently converted to knowledge, e.g., vocabulary linkage or metadata linkage, which can be adopted to improve search engines through better ranking, recommendation, and query understanding [2,5,8].

Information Flow
Logs flow through different stages to the final mining results. The raw log collection stage gathers multi-source logs and migrates them into the HDFS. Then, logs are loaded by Spark and indexed into Elasticsearch in parallel. During the ingestion process, Elasticsearch builds a full content index of logs for better access. Then, a list of unique users is recognized from ingested logs through the Elasticsearch term aggregation operation, which aggregates data based on a search query. For any user in the list, all logs belonging to a specific user are aggregated to visiting rates to determine if the user is a crawler. If the user is a crawler, all his/her logs will be discarded to reduce noise. Alternatively, real user logs will be kept for further analysis. The real user list is fed to Spark for identifying sessions and extracting knowledge parallelly. As shown in Figure 5 and described above, data flow between Spark and Elasticsearch during the whole process to fully leverage the advantages of both.

Log Partitioner
In the information flow, a pattern occurs repeatedly in which all users are recognized by Elasticsearch and partitioned into small groups. This repetitive process for user-level analyses can be performed in parallel, e.g., constructing sessions from a user's logs. In the process, term aggregation, one of the most beneficial features that Elasticsearch offers, returns the unique terms for a given field The framework indexes and analyzes multi-source, heterogeneous logs through user identification, crawler detection, session reconstruction, and knowledge extraction. User identification recognizes all users from billions of logs. The IP or IP combined with other fields, e.g., agent, identifies a user. Crawler detection recognizes crawlers with well-known robots and visiting rate. Session reconstruction splits logs belonging to one user into small groups, and each group records a user's behavior in a relatively short time. Knowledge extraction summarizes a user's behavior in a session including the dataset the user clicked or viewed [4]. With the framework, big raw logs could be efficiently converted to knowledge, e.g., vocabulary linkage or metadata linkage, which can be adopted to improve search engines through better ranking, recommendation, and query understanding [2,5,8].

Information Flow
Logs flow through different stages to the final mining results. The raw log collection stage gathers multi-source logs and migrates them into the HDFS. Then, logs are loaded by Spark and indexed into Elasticsearch in parallel. During the ingestion process, Elasticsearch builds a full content index of logs for better access. Then, a list of unique users is recognized from ingested logs through the Elasticsearch term aggregation operation, which aggregates data based on a search query. For any user in the list, all logs belonging to a specific user are aggregated to visiting rates to determine if the user is a crawler. If the user is a crawler, all his/her logs will be discarded to reduce noise. Alternatively, real user logs will be kept for further analysis. The real user list is fed to Spark for identifying sessions and extracting knowledge parallelly. As shown in Figure 5 and described above, data flow between Spark and Elasticsearch during the whole process to fully leverage the advantages of both.

Log Partitioner
In the information flow, a pattern occurs repeatedly in which all users are recognized by Elasticsearch and partitioned into small groups. This repetitive process for user-level analyses can be performed in parallel, e.g., constructing sessions from a user's logs. In the process, term aggregation, one of the most beneficial features that Elasticsearch offers, returns the unique terms for a given field along with the count of matching documents on the fly [27]. Given an IP, term aggregation returns all IPs in the log and the count of logs associated with each IP. Apache Spark divides the IP list into along with the count of matching documents on the fly [27]. Given an IP, term aggregation returns all IPs in the log and the count of logs associated with each IP. Apache Spark divides the IP list into sub-datasets and distributes them to different computing nodes for the same analyses. HashPartitioner, the default partitioner in Spark, assigns IPs to different groups according to the hash value of the IP address. In the log-mining process, the Spark driver splits all users into k partitions, each of which will be passed to an executor and run as a task. Although the HashPartitioner splits the user list evenly with the hash code of IP address, the log size in each task (partition) is significantly different because the partitioner ignores the fact that different users have different behavior. Some users interacting with the website frequently produce many logs, while others who visit the website seldom populate few logs. As a result, the workloads are quite different, known as the skewed data problem ( Figure  6). The skewed data problem will result in most tasks waiting for a slow task, which is the bottleneck of the whole process, since the Spark application runtime depends on the slowest task. To solve the problem, we propose a customized partitioner called LogPartitioner to balance logs across Spark tasks. As shown in Figures 7 and 8, the LogPartitioner algorithm consists of the following steps: 1. Calculate log records for each user. A user map stores the user and the corresponding log count as (user1, log_count1), (user2, log_count2), …, (userk, log_countk). The calculation module is implemented with the Elasticsearch term aggregation API and the performance depends on the In the log-mining process, the Spark driver splits all users into k partitions, each of which will be passed to an executor and run as a task. Although the HashPartitioner splits the user list evenly with the hash code of IP address, the log size in each task (partition) is significantly different because the partitioner ignores the fact that different users have different behavior. Some users interacting with the website frequently produce many logs, while others who visit the website seldom populate few logs. As a result, the workloads are quite different, known as the skewed data problem ( Figure 6). The skewed data problem will result in most tasks waiting for a slow task, which is the bottleneck of the whole process, since the Spark application runtime depends on the slowest task.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 14 along with the count of matching documents on the fly [27]. Given an IP, term aggregation returns all IPs in the log and the count of logs associated with each IP. Apache Spark divides the IP list into sub-datasets and distributes them to different computing nodes for the same analyses. HashPartitioner, the default partitioner in Spark, assigns IPs to different groups according to the hash value of the IP address. In the log-mining process, the Spark driver splits all users into k partitions, each of which will be passed to an executor and run as a task. Although the HashPartitioner splits the user list evenly with the hash code of IP address, the log size in each task (partition) is significantly different because the partitioner ignores the fact that different users have different behavior. Some users interacting with the website frequently produce many logs, while others who visit the website seldom populate few logs. As a result, the workloads are quite different, known as the skewed data problem ( Figure  6). The skewed data problem will result in most tasks waiting for a slow task, which is the bottleneck of the whole process, since the Spark application runtime depends on the slowest task. To solve the problem, we propose a customized partitioner called LogPartitioner to balance logs across Spark tasks. As shown in Figures 7 and 8, the LogPartitioner algorithm consists of the following steps: 1. Calculate log records for each user. A user map stores the user and the corresponding log count as (user1, log_count1), (user2, log_count2), …, (userk, log_countk). The calculation module is implemented with the Elasticsearch term aggregation API and the performance depends on the To solve the problem, we propose a customized partitioner called LogPartitioner to balance logs across Spark tasks. As shown in Figures 7 and 8, the LogPartitioner algorithm consists of the following steps:

1.
Calculate log records for each user. A user map stores the user and the corresponding log count as (user 1 , log_count 1 ), (user 2 , log_count 2 ), . . . , (user k , log_count k ). The calculation module is implemented with the Elasticsearch term aggregation API and the performance depends on the cluster configuration and unique IPs in the log. Generally, the runtime increases linearly with the increment of log number; thus, the time complexity is approximately O(n).

2.
Sort the user map by the log number in descending order. The higher a user is ranked in the map, the more frequently the user uses the website. The time complexity of the map sort operation is O(nlog 2 n)).

3.
Split users into k groups using the greedy algorithm [28]. The top k users in the sorted user map are assigned to k groups in series. For the rest, each user is assigned to the group with minimal logs until all users are assigned to the k groups. A user group map stores the split strategy in the format of (user i , group 1 ), (user j , group 1 ), . . . , (user k , group n ). Note that the greedy algorithm yields local optimum group strategy. The dynamic planning solutions can be adopted to split users too. In addition, to maximize the usage of all computation resources, k is suggested to be equal to or be multiples of the total number of Central Processing Unit (CPU) cores managed by Spark. The greedy algorithm is straightforward, and the time complexity is linear (O(n)).

4.
Load all users into Spark and distribute user data across Spark tasks with a customized partitioner. The partitioner receives the user map and partition number as input and assigns each user to a partition using the corresponding group identifier instead of the hash value of the user IP. As a result, the logs are nearly evenly distributed across all partitions, which prevents significant slow tasks and improves the log-mining performance.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 7 of 14 cluster configuration and unique IPs in the log. Generally, the runtime increases linearly with the increment of log number; thus, the time complexity is approximately O(n). 2. Sort the user map by the log number in descending order. The higher a user is ranked in the map, the more frequently the user uses the website. The time complexity of the map sort operation is O( log ). 3. Split users into k groups using the greedy algorithm [28]. The top k users in the sorted user map are assigned to k groups in series. For the rest, each user is assigned to the group with minimal logs until all users are assigned to the k groups. A user group map stores the split strategy in the format of (useri, group1), (userj, group1), …, (userk, groupn). Note that the greedy algorithm yields local optimum group strategy. The dynamic planning solutions can be adopted to split users too.
In addition, to maximize the usage of all computation resources, k is suggested to be equal to or be multiples of the total number of Central Processing Unit (CPU) cores managed by Spark. The greedy algorithm is straightforward, and the time complexity is linear (O(n)). 4. Load all users into Spark and distribute user data across Spark tasks with a customized partitioner. The partitioner receives the user map and partition number as input and assigns each user to a partition using the corresponding group identifier instead of the hash value of the user IP. As a result, the logs are nearly evenly distributed across all partitions, which prevents significant slow tasks and improves the log-mining performance.

Experiments
In this research, the NASA Advanced Information System Technologies (AIST) cloud platform was used to set up a testbed for the framework effortlessly and efficiently. An image with required software, e.g., HDFS, Spark, and Elasticsearch, was deployed in advance to facilitate the creation of a log-mining cluster on demand. Conveniently, whenever the cluster needs more computing

Experiments
In this research, the NASA Advanced Information System Technologies (AIST) cloud platform was used to set up a testbed for the framework effortlessly and efficiently. An image with required software, e.g., HDFS, Spark, and Elasticsearch, was deployed in advance to facilitate the creation of a log-mining cluster on demand. Conveniently, whenever the cluster needs more computing resources to process growing logs, we can start a new instance from the preconfigured image. In our experiments, six virtual machines were launched for the log-mining process. Each machine was configured with four CPU cores, 16 GB of memory, and a 2.4-GHz clock speed (Figure 9). Operations provided by the AIST cloud platform make it possible and efficient to group these virtual machines into a cluster.

Experiments
In this research, the NASA Advanced Information System Technologies (AIST) cloud platform was used to set up a testbed for the framework effortlessly and efficiently. An image with required software, e.g., HDFS, Spark, and Elasticsearch, was deployed in advance to facilitate the creation of a log-mining cluster on demand. Conveniently, whenever the cluster needs more computing resources to process growing logs, we can start a new instance from the preconfigured image. In our experiments, six virtual machines were launched for the log-mining process. Each machine was configured with four CPU cores, 16 GB of memory, and a 2.4-GHz clock speed (Figure 9). Operations provided by the AIST cloud platform make it possible and efficient to group these virtual machines into a cluster.  The entire 2014 PO.DAAC logs were used to test the performance of the framework. The log volume was >32.8 GB and the logs consisted of >154 million records in total. Figure 10 shows the distribution of logs in the 12 months in 2014. The user traffic changed dramatically from month to month and, in February, there were 6,727,710 (~7 million) logs in total and the log volume was >3.5 GB.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 9 of 14 The entire 2014 PO.DAAC logs were used to test the performance of the framework. The log volume was >32.8 GB and the logs consisted of >154 million records in total. Figure 10 shows the distribution of logs in the 12 months in 2014. The user traffic changed dramatically from month to month and, in February, there were 6,727,710 (~7 million) logs in total and the log volume was >3.5 GB. Three experiments were conducted to evaluate the performance of the framework. Firstly, the February 2014 monthly log was randomly selected and processed with HashPartitioner and LogPartitioner separately to validate the proposed partitioner. Secondly, the entire logs were processed by the proposed framework to test performance. The log-mining algorithm implemented only on Elasticsearch with the same procedure was used for comparison. Then, the monthly log was processed with multiple clusters consisting of different numbers of workers to validate the scalability of the framework. Finally, we discussed the evaluation results.

Comparison of the HashPartitioner and the LogPartitioner
To demonstrate the validity of the proposed LogPartitioner, the February 2014 monthly logs were separately parallelized through the HashPartitioner and the LogPartitioner. Table 1 shows the time spent on each log-mining step. Initially, it took more than one hour (~3873 s) to finish the entire mining process with Elasticsearch. With the proposed framework running on a cluster which  Three experiments were conducted to evaluate the performance of the framework. Firstly, the February 2014 monthly log was randomly selected and processed with HashPartitioner and LogPartitioner separately to validate the proposed partitioner. Secondly, the entire logs were processed by the proposed framework to test performance. The log-mining algorithm implemented only on Elasticsearch with the same procedure was used for comparison. Then, the monthly log was processed with multiple clusters consisting of different numbers of workers to validate the scalability of the framework. Finally, we discussed the evaluation results.

Comparison of the HashPartitioner and the LogPartitioner
To demonstrate the validity of the proposed LogPartitioner, the February 2014 monthly logs were separately parallelized through the HashPartitioner and the LogPartitioner. Table 1 shows the time spent on each log-mining step. Initially, it took more than one hour (~3873 s) to finish the entire mining process with Elasticsearch. With the proposed framework running on a cluster which consisted of four worker nodes, the processing time decreased from 3873 s to 589 s if the logs were split with the default HashPartitioner. By applying the proposed LogPartitioner, the processing time was further reduced to 310 s. The cluster had one master node and four worker nodes, on each of which an executor ran tasks with four CPU cores. It means 16 tasks were executed in parallel. Figure 11 shows the sum of log processing time on each executor in the crawler detection step. The dark line displays the runtime of executors when users were grouped with the HashPartitioner. Executor 10 worked much longer than other executors, indicating that most executors need to wait for the slow executor. The light line represents the processing time of all executors after introducing the LogPartitioner. All executors finished their tasks at almost the same time and no executor needed to wait extensively for other executors. Thus, the processing time was significantly reduced by our proposed LogPartitioner.

Comparison of Performance with and without the Framework
The whole 2014 PO.DAAC logs were processed with a cluster consisting of five worker nodes. For comparison, the entire logs were processed with and without the proposed framework. The logmining process took about one hour to process the entire logs with the framework. With Elasticsearch, it took more than 10 h. Monthly processing time was not positively correlated with the log size because some records generated by robots in the raw logs were discarded during the mining process ( Figure 12). Figures 13 and 14 show time spent on each stage of the log-mining process without and with the framework, respectively. The log ingestion time was significantly reduced since logs were read from the file system and written into Elasticsearch in parallel. The crawler detection and session reconstruction steps were also accelerated with the support of the proposed framework. Runtime (s) Figure 11. The sum of task runtime on each executor.

Comparison of Performance with and without the Framework
The whole 2014 PO.DAAC logs were processed with a cluster consisting of five worker nodes. For comparison, the entire logs were processed with and without the proposed framework. The log-mining process took about one hour to process the entire logs with the framework. With Elasticsearch, it took more than 10 h. Monthly processing time was not positively correlated with the log size because some records generated by robots in the raw logs were discarded during the mining process ( Figure 12). Figures 13 and 14 show time spent on each stage of the log-mining process without and with the framework, respectively. The log ingestion time was significantly reduced since logs were read from the file system and written into Elasticsearch in parallel. The crawler detection and session reconstruction steps were also accelerated with the support of the proposed framework.
For comparison, the entire logs were processed with and without the proposed framework. The logmining process took about one hour to process the entire logs with the framework. With Elasticsearch, it took more than 10 h. Monthly processing time was not positively correlated with the log size because some records generated by robots in the raw logs were discarded during the mining process ( Figure 12). Figures 13 and 14 show time spent on each stage of the log-mining process without and with the framework, respectively. The log ingestion time was significantly reduced since logs were read from the file system and written into Elasticsearch in parallel. The crawler detection and session reconstruction steps were also accelerated with the support of the proposed framework.

Scalability of the Proposed Framework
To demonstrate the scalability of our proposed framework, multiple clusters were set up with one master node and different numbers of worker nodes ranging from one to five. On each worker node, an executor performed the log-mining tasks with a maximum of four CPU cores. In experiments 1 and 2, only two CPU cores were in use because the processing speed exceeded the log ingestion speed if all cores were allocated. The partition number was set to 112 in experiment 4 and 120 in all other experiments to make sure the partition number was a multiple of the number of CPU cores. Table 2 shows that the processing time decreased as the worker number increased, and the results remained the same.  1  1  1  2  120  2056  1073  2  2  2  4  120  1055  1073  3  3  3  12  120  386  1073  4  4  4  16  112  317  1073  5  5  5  20 120 281 1073

Sample Mining Results
Through the log-mining workflow, 11,360 sessions were reconstructed from the entire 2014 raw logs. Figure 15 shows an example of the reconstructed session. The tree structure makes it easy to track which dataset a user clicked or downloaded after he/she performed a query. In this session, the

Scalability of the Proposed Framework
To demonstrate the scalability of our proposed framework, multiple clusters were set up with one master node and different numbers of worker nodes ranging from one to five. On each worker node, an executor performed the log-mining tasks with a maximum of four CPU cores. In experiments 1 and 2, only two CPU cores were in use because the processing speed exceeded the log ingestion speed if all cores were allocated. The partition number was set to 112 in experiment 4 and 120 in all other experiments to make sure the partition number was a multiple of the number of CPU cores. Table 2 shows that the processing time decreased as the worker number increased, and the results remained the same.

Sample Mining Results
Through the log-mining workflow, 11,360 sessions were reconstructed from the entire 2014 raw logs. Figure 15 shows an example of the reconstructed session. The tree structure makes it easy to track which dataset a user clicked or downloaded after he/she performed a query. In this session, the user download granule data from the "recon_sea_level_ost_l4_v1" and "alt_tide_gauge_l4_ost_sla_us_west_coast" collection after searching "sea surface topography". Such information could be used for calculating vocabulary linkage to maintain a live knowledge base for ranking, recommendation, and query suggestion [2,5]. Figure 16 lists statistical analysis results of a selected month, e.g., the most popular user input keys, the most popular datasets, session duration, and others.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 12 of 14 user download granule data from the "recon_sea_level_ost_l4_v1" and "alt_tide_gauge_l4_ost_sla_us_west_coast" collection after searching "sea surface topography". Such information could be used for calculating vocabulary linkage to maintain a live knowledge base for ranking, recommendation, and query suggestion [2,5]. Figure 16 lists statistical analysis results of a selected month, e.g., the most popular user input keys, the most popular datasets, session duration, and others.

Discussion and Conclusions
Log mining is the process of reconstructing sessions from raw web logs and discovering meaningful usage patterns and implicit linkages. The mining result is valuable in improving the data discovery experience. We proposed a cloud-based framework for large-scale log mining by integrating Elasticsearch and Spark. A customized partitioner named LogPartitioner solves the skewed data issue in the log parallelism process. The cloud platform serves as the framework with virtual resources. Experiments demonstrated that the processing time of one month of logs decreased from 60+ minutes to ~5 min with the framework. To share the framework with the public, the research results were published as a GitHub open-source project (https://github.com/apache/incubator-sdapmudrod). user download granule data from the "recon_sea_level_ost_l4_v1" and "alt_tide_gauge_l4_ost_sla_us_west_coast" collection after searching "sea surface topography". Such information could be used for calculating vocabulary linkage to maintain a live knowledge base for ranking, recommendation, and query suggestion [2,5]. Figure 16 lists statistical analysis results of a selected month, e.g., the most popular user input keys, the most popular datasets, session duration, and others.

Discussion and Conclusions
Log mining is the process of reconstructing sessions from raw web logs and discovering meaningful usage patterns and implicit linkages. The mining result is valuable in improving the data discovery experience. We proposed a cloud-based framework for large-scale log mining by integrating Elasticsearch and Spark. A customized partitioner named LogPartitioner solves the skewed data issue in the log parallelism process. The cloud platform serves as the framework with virtual resources. Experiments demonstrated that the processing time of one month of logs decreased from 60+ minutes to ~5 min with the framework. To share the framework with the public, the research results were published as a GitHub open-source project (https://github.com/apache/incubator-sdapmudrod).

Discussion and Conclusions
Log mining is the process of reconstructing sessions from raw web logs and discovering meaningful usage patterns and implicit linkages. The mining result is valuable in improving the data discovery experience. We proposed a cloud-based framework for large-scale log mining by integrating Elasticsearch and Spark. A customized partitioner named LogPartitioner solves the skewed data issue in the log parallelism process. The cloud platform serves as the framework with virtual resources.
Experiments demonstrated that the processing time of one month of logs decreased from 60+ minutes to~5 min with the framework. To share the framework with the public, the research results were published as a GitHub open-source project (https://github.com/apache/incubator-sdap-mudrod).
Several aspects are worth further investigation. Firstly, since the framework cannot analyze real-time logs, a real-time data streaming tool, e.g., Apache Spark Streaming, can be introduced into the framework for near real-time analysis and tracking [29]. Secondly, the network transportation cost in the cloud-based framework should be experimented for the mining process when both the Elasticsearch and Spark cluster are integrated as a whole entity. Thirdly, although the framework reported in the article focused on log data mining, one can accelerate the analysis process with a customized partitioner for other types of data, e.g., geo-tagged tweets. We will continue this research to improve the framework with more specific performance-improving techniques.