RedEdge: A Novel Architecture for Big Data Processing in Mobile Edge Computing Environments

: We are witnessing the emergence of new big data processing architectures due to the convergence of the Internet of Things (IoTs), edge computing and cloud computing. Existing big data processing architectures are underpinned by the transfer of raw data streams to the cloud computing environment for processing and analysis. This operation is expensive and fails to meet the real-time processing needs of IoT applications. In this article, we present and evaluate a novel big data processing architecture named RedEdge (i.e., data reduction on the edge) that incorporates mechanism to facilitate the processing of big data streams near the source of the data. The RedEdge model leverages mobile IoT-termed mobile edge devices as primary data processing platforms. However, in the case of the unavailability of computational and battery power resources, it ofﬂoads data streams in nearer mobile edge devices or to the cloud. We evaluate the RedEdge architecture and the related mechanism within a real-world experiment setting involving 12 mobile users. The experimental evaluation reveals that the RedEdge model has the capability to reduce big data stream by up to 92.86% without compromising energy and memory consumption on mobile edge devices.


Introduction
Cloud computing systems provide highly virtualized computing, networking, and storage services on top of massively parallel distributed systems [1][2][3].However, clouds were initially introduced as utility computing models to fulfill the processing requirements of enterprise applications [1].The voluminous and high speed data streams in IoT-based big data systems increase the network traffic of the cloud, which challenges the big data management capabilities [4].Recent literature provides some evidence whereby different data reduction methods enable big reduction in clouds [5].These data reduction methods are mainly based on: (1) network theory-based methods, whereby graph mapping and optimization algorithms reduce high dimensional big data streams into low dimensional datasets [6,7]; (2) compression algorithms are applied in order to reduce the volume of network traffic [8,9]; (3) data deduplication methods eliminate redundant and duplicated data [10,11]; (4) feature extraction and data filtration methods are applied in order to reduce the data streams at early stages [12,13]; and (5) data mining and machine learning techniques help with data reduction at early stages of big data through preprocessing [14,15] and prediction.Existing methods are applied for big data reduction in the context of clouds.However, there exists an opportunity to reduce big data streams even before entering the cloud.
Considering IoT-cloud communication models and the big data generated by mobile edge devices and applications, the cloud-centric big data processing results in increased latency and incremental data transfer cost.In addition, it increases the in-network data movement inside the cloud [16][17][18].Recently, mobile edge cloud computing (MECC) emerged as a solution to enable the extension of centralized cloud services to the edge of the network through edge servers [19][20][21][22][23].These edge servers reside at one-hop communication distances from mobile edge devices (see Figure 1); hence, they can meet the real-time needs of IoT applications.However, the decision of data processing in different layers across MECC depends on many factors, such as the capability of devices in MECC, the availability of these devices, the application profile (e.g., real time) and the data analytic tasks employed by the application.Hence, moving data processing from the cloud to MECC is not a trivial task.Based on the computational capabilities of mobile edge devices [24,25], we envision a novel data processing architecture called RedEdge i.e., the term is derived from the process of data reduction on the edge) [26].The RedEdge model employs the mobile edge devices as a data reduction platform.In the case of the unavailability of computational and battery power resources at one mobile edge device, nearby mobile edge devices within the MECC environment are used to offload the data stream and processing.The RedEdge transforms MECC into a mobile edge collaborative platform.In the case of resource unavailability on mobile edge devices, the MECC system offloads data streams to the cloud.

Infrastructure based Cloud
This article contributes a novel big data processing architecture for MECC systems.The RedEdge architecture employs a novel big data reduction technique whereby the data stream-mining algorithm processes and uncovers knowledge patterns and stores the resultant data using local storage in mobile edge devices and synchronizing with cloud data stores.To this end, we propose a middle-ware architecture that utilizes the computational power from MECC systems and embeds three layers of data reduction in existing big data systems.The first layer reduces the data stream strictly in the same mobile edge device whereby the data sources reside.The second layer reduces the data stream by forming an ad hoc network of closer mobile edge devices and enabling collaborative data processing among connected devices.The third layer harnesses the cloud resources in order to reduce the data streams.To assess the performance of the RedEdge architecture, we conducted a real-world experimental study by recruiting 12 graduate students from the University of Malaya, Malaysia, and ran the experiments for 15 days.The experimental evaluation was performed in terms of memory consumption, battery power utilization, latency and reduced bandwidth utilization.
The rest of the article is organized as follows.Related works are presented in Section 2, followed by the problem statement in Section 3 and the RedEdge architecture in Section 4. Section 5 presents the formal modelling and analysis of the RedEdge architecture.The experimental evaluation and a discussion of the overall results are given in Section 6.Finally, the article concludes with Section 7.

Related Work
Traditionally, big data reduction is performed after storing data streams in large-scale clusters and data centres or clouds.A variety of methods is applied for data reduction; however, this mainly involves methodologies relevant to network theory, compression, data deduplication, dimension reduction, data preprocessing and data mining and machine learning.
Network theory-based methods convert unstructured, high dimensional and complex data streams into low dimensional structured data [6,7,27,28].They extract topological structures from data streams and map them on to graph data structures.These methods further perform graph processing techniques to establish and optimize the relationships among mapped data.The optimized structures are represented as free-scale networks, small-world networks and random networks.The network theory-based methods are useful for data reduction; however, they require laborious efforts and high computational resources in order to find highly optimized datasets.
The compression-based data reduction helps with reducing the overall volume of big data that could be easily handled during in-network data movement in clusters and data centres [8,9,29,30].However, these methods involve computational overhead of decompression.Despite preserving the original datasets, the compression-based method could not improve the quality of big data for data analytics.A few common compression-based big data reduction methods include gZip, parallel compression, anamorphic stretch transform, sketching, compression sensing, spatio-temporal compression and adaptive compression.
Big data storage in cloud data centres is performed in highly duplicated settings whereby multiple copies of the same datasets are stored in different storage servers in the same rack of servers or different servers across the clusters [10,11,31,32].The data duplication is performed to meet the service level agreements (SLAs) for high availability; however, this costs extra storage spaces and computational power for data processing.Therefore, big data systems need to perform cluster level and node level data deduplication in order to eliminate redundant datasets and improve the quality of data for big data analytic algorithms.
Big data reduction during data preprocessing is the right method for early reduction [12,[33][34][35].The early data processing helps with reducing the data storage cost, as well as the computational cost that may be incurred at later stages.Existing literature discusses a variety of big data preprocessing methods, such as semantic analysis of big datasets using linked data structures and ontologies, data filtration using URL filtration methods, low memory pre-filters for streaming data and 2D peak detection methods.Although a few works adopted existing conventional methods, there exists a research gap to find new data preprocessing methods for big data reduction.
High dimensionality in big datasets arises due to the emergence of thousands or millions of attributes, and it is the norm rather than the exception in the case of big datasets.Researchers adopted dimension reduction methods in order to deal with the curse of high dimensionality [36][37][38][39][40][41].
The dimension reduction methods process the high dimensional unstructured big datasets and convert them into low dimensional structured datasets.Researchers proposed a few dimension reduction methods such as dynamic quantum clustering, BIGQuic, the map-reduce implementation of k-means clustering algorithms, online feature selection, tensor networks and optimization, feature hashing, critical feature dimension reduction approaches and incremental partial least square methods.Although feasible for big data reduction, the dimension reduction methods require a massive amount of computational resources.
Data mining and machine learning algorithms are another variant of big data reduction methods [14,42,43].These methods process the data streams by performing supervised, unsupervised, semi-supervised and deep learning models.The data mining and machine learning methods are quite useful due to early knowledge discovery from big data streams.The methods are useful for real-time big data analytics where multiple learning models at different levels of big data systems filter the data streams and uncover the knowledge patterns in parallel.
The literature review reveals that existing big data reduction methods work as cloud-centric approaches [5].However, our RedEdge architecture adopts the IoT device-centric approach for big data reduction.The RedEdge provides support for the deployment of data mining and machine learning-based big data reduction schemes.The architecture extends the capabilities of our previous work presented in [26,44].This article presents a detailed discussion on the big data reduction strategy.In addition, the article presents formal verification and validation of the overall functionalities using Petri nets.Moreover, a thorough experimental evaluation of RedEdge architecture is presented in this article.

The Case for a Multi-Layer Far-Edge Computing Architecture for Big Data Reduction
Let us consider the five-layer IoT reference architecture of fog computing systems introduced by Cisco (see Figure 2) [23,45].The physical layer at the lowest level facilitates data acquisition from mobile edge devices using onboard and offboard sensory and non-sensory data sources.The communication layer at the second level enables connectivity and data transfer from mobile edge devices to fog servers.The data aggregation layer provides functionality to aggregate data streams from connecting devices and performs data filtration operations in order to transfer useful raw data streams in clouds.The analytics layer ensures the availability of data analysis services through cloud service providers.Finally, the application layer provides functionalities to interact with IoT applications.Existing reference architectures have multiple issues at each layer [20,46].The mobile edge devices perform data collection operations and transfer raw data streams in edge servers.This data collection strategy increases the cost of data communication among mobile edge devices and fog servers.Secondly, fog servers are bounded by physical locations; therefore, mobile edge devices need to be in proximity to benefit from cloud services.Thirdly, big data processing and analytic components are provided through centralized services, hence increasing the computational burden in clouds.Fourthly, the IoT applications are built on top of clouds; therefore, fog computing architectures involve high coupling among application components at different layers.In addition, this requires persistent Internet connections to benefit from IoT applications.
Considering these limitation, we propose a new middleware architecture for IoT-based big data applications.The architecture reduces big data streams by enabling maximum resource provisioning near the data sources.The architecture is designed to perform early analytic operations over data streams in order to aggregate knowledge patterns in place of raw data streams.
The RedEdge architecture embeds three layers of data analytics for big data reduction.At the first layer, mobile edge devices perform data analytic operations for local data reduction using onboard computational resources.However, in the case of resource scarcity, the nearby mobile edge devices form an ad hoc network.The mobile edge devices other than source devices provide services for collaborative data reduction and perform required analytic operations on offloaded data streams.In this case, if there is no nearby mobile edge device or the required resources are not available at connected mobile edge devices, the data streams are offloaded to clouds, which initiates the required analytic services for remote data reduction.The reduced data streams are then shared with data aggregation servers either in fog edge servers or in cloud data centres.

RedEdge: An Architecture for Big Data Processing in MECC Environments
In this section, we present RedEdge, a novel architecture for big data processing in MECC environments.RedEdge enables big data reduction (see Figure 3) at three layers.These three layers are called the local analytics layer (LA) for onboard data reduction employed by the mobile edge device, the collaborative analytics layer (CA) for data reduction within the ad hoc network of mobile edge devices and the cloud-enabled analytics layer (CLA) for data reduction in clouds.

Components and Operations for LA
At the LA layer, the RedEdge provides five modules for: (1) data acquisition and data adaptation; (2) knowledge discovery; (3) knowledge management; (4) visualization and actuation; and (5) system management.

Data Acquisition and Data Adaptation
The RedEdge applications start execution and run as back-end services in mobile edge devices.Primarily, these applications perform the intelligent data collection depending on the application requirements.The data collection strategy also varies in different applications.For example, some of the applications (like environmental monitoring apps) may collect continuous data streams, and some of the applications may collect situation-based or periodic data collection.The data adaptation strategies help control data rates in RedEdge applications.

Knowledge Discovery
The knowledge discovery module supports the execution of analytic components using onboard computational resources in mobile edge devices.The module provides different algorithms for data preprocessing operations such as noise reduction, outlier analysis, handling missing values and anomaly detection, to name a few.In addition, the data fusion components provide functionalities to fuse data streams from multiple homogeneous and heterogeneous data sources such as onboard and offboard sensors, as well as Internet-enabled social media data streams as used in social IoTs.Moreover, the module provides components for transient storage of fused data streams in mobile edge devices.Furthermore, the module provides a library of different data stream analytic algorithms in order to perform clustering, classification and association rule mining operations.

Knowledge Management
The knowledge patterns generated by the knowledge discovery module differ depending on the data analytics operations of IoT applications.The knowledge management module enables one to integrate the relevant knowledge patterns and to produce a summarized and globalized view of the overall data.The knowledge integration is made in a way that all processed data could be effectively represented, and the resultant data is stored in local data stores using light databases.However, the challenge of tracking the processed and unprocessed data points introduces complexity in the overall data management process.To address this issue, the RedEdge works with the principles of data parallelism where each chunk of raw data is tracked starting from the acquisition until the integration of knowledge patterns.The data parallelization ensures that each data chunk is processed at least one time; however, this approach increases the overall computational complexity.

System Management
The onboard resource dynamics and fast mobility create the issues of tracking mobile edge device locations and onboard available resources.The core functions of the system management module are the adaptation engine, the context monitor and the resource monitor.These functions ensure the robustness of the far-edge computing architecture in different scenarios.The adaptation engine ensures the execution of RedEdge components and data reduction in all three layers.In addition, the resource monitor and context manager periodically monitor available resources (memory, storage), locations (frequently-visited locations) and device-usage behaviour (charging, idle).

Visualization and Actuation
The visualization module ensures the local knowledge availability by enabling on-screen visualization.The studies show that local visualization is very useful for real-time applications.However, due to resource constraints and limited screen size, local knowledge visualization does not support a detailed knowledge view.The topic of visualization needs a detailed and thorough study; therefore, it is not covered further in this article.The actuation component is designed to ensure the interaction of mobile edge devices with external environments, which include remote cloud services and nearby peer mobile edge devices.This module ensures the future extensibility of RedEdge to other devices, systems and communication networks.

Components and Operations for CA
The discovery of other mobile edge devices and the available communication interfaces are key requirements for the ad hoc network formation of nearby mobile edge devices.The execution of knowledge discovery processes collaboratively and synchronizing resultant knowledge patterns are also challenging during collaborative data processing.The CA layer of RedEdge handles these issues to ensure seamless and collaborative data reduction.

Discovering Mobile Edge Devices and Communication Interfaces
The device discovery module handles two main issues.First, it discovers the mobile edge devices that may provide data processing services to other mobile edge devices.The source device in the network scans all connected communication interfaces and enlists all available mobile edge devices.The adaptation engine in RedEdge maintains and periodically updates a list of mobile edge devices for service utilization.The known devices are given priority over unknown devices.However, the list of known devices is maintained and updated whenever a new device is connected.This approach helps to seamlessly adapt to new and unknown environments for collaborative data reduction.The second main issue handled by RedEdge at this stage is to adapt and switch between different communication interfaces.It ensures seamlessly switching between different communications interfaces while maintaining the proximity of devices.This strategy helps to ensure maximum collaboration considering co-movement between different communication areas (such as Wi-Fi networks, public Internet facilities and home-networks).

Peer to Peer Network Formation
Once the mobile edge devices are found and the information about their communication interfaces is collected, the RedEdge initiates the P2P network formation process.The source device broadcasts the peer request to all proximal mobile edge devices, which collect the information about available onboard computational resources and send this back to the source device.The source device then performs the cost-benefit analysis in order to decide the favourability of data offloading.In the case of favourable data offloading, mobile edge devices offload the data stream to nearby mobile edge devices.

Data Offloading in Mobile Edge Devices
The decision about the favourability of data offloading is quite challenging due to resource dynamics in mobile edge devices and the availability of communication interfaces.Depending on the multi-objective approach of data offloading, the minimal energy consumption, reduced bandwidth utilization cost, performance enhancement and maximum data reduction are considered as the main objectives.
Existing studies show that energy consumption differs between different communication interfaces and the distance among different devices; therefore, the optimal choice of communication interfaces is a key element to decide about the favourability.Keeping in view the recommendations given in [47], RedEdge creates the priority list of communication interfaces and switches accordingly.This approach helps with the maximum energy gain while communicating with proximal devices.The optimal bandwidth utilization is achieved by RedEdge by distributing data streams into small and manageable chunks.In this case, the data chunk size is carefully determined by calculating the energy cost of local computations and communication over proximal networks.The data chunk size is kept as big such that proximal communication becomes favourable when compared with local computations.However, the size must be kept moderate such that the performance of serving mobile edge devices will not be compromised.In addition, the offloading decision depends on the amount of data that needs to be processed in mobile edge devices.In the case of an insufficient amount of data, the data offloading becomes unfavourable and consumes more energy and computational resources when compared with the amount of data reduced using onboard computational resources in mobile edge devices.Considering these objectives and constraints, the RedEdge devises the optimal offloading strategy for collaborative and remote data reduction in MECC systems.Further details about the offloading scheme are presented in [44] for interested readers.

Knowledge Discovery and Pattern Synchronization
Once the offloading is completed, the mobile edge devices execute the components from their knowledge discovery modules.However, this depends on the application design, whether the whole knowledge discovery process is executed at the mobile edge device or partial task execution is performed.In the case of complete execution, the source device offloads raw data streams, and the mobile edge device executes the complete knowledge discovery process from preprocessing to data mining and summarization of patterns.In the case of partial execution, the source device offloads only preprocessed data in order to lower the overall bandwidth utilization in the ad hoc network.However, the mobile edge device executes the rest of the knowledge discovery process, and the resultant patterns are synchronized with the source device.In this case, the source device could not receive the results from the mobile edge device for a specified time period, and the data streams are offloaded to any other available nearer mobile edge device.To lessen the transient storage burden and to reserve the maximum computational power, the garbage collection process is executed by RedEdge, and mobile edge devices delete all processed raw data streams periodically from the Random Access Memory (RAM) and the device's local storage.Similarly, the source device deletes all processed data streams after receiving the corresponding knowledge patterns.

Components and Operations for CLA
RedEdge maintains a service repository of available cloud services.The requirements of cloud services vary; therefore, the service repository contains various services for remote data reduction in clouds.The choice of service is solely dependent on the needs of big data systems; however, RedEdge provides an interface to access all available services in the repository.The mobile application offloads the data in the cloud environment with the request for the required cloud services where the cloud service manager automatically runs the requested services and completes the task execution.
RedEdge provides seven types of services, which are designed for data uploading, data preprocessing, data fusion, data mining, pattern summarization, knowledge management and pattern synchronization.The data uploading services help with handling offloaded data streams.The raw data streams are uploaded in transient data stores in clouds.The data preprocessing, data fusion and data mining services are executed in order to process raw data streams and uncover new knowledge patterns.The pattern summarization and knowledge management services are used to integrate and summarize knowledge patterns both uploaded by mobile edge devices and produced by cloud services.The summarized knowledge patterns are stored in permanent data stores inside clouds.The pattern synchronization services transfer the knowledge patterns for data aggregation in big data systems.

Formal Modelling, Analysis and Verification
Considering the complexity of operations in the RedEdge architecture, we formally model the propose architecture in order to analyse and verify its operations using high level Petri nets (HLPN) [48], the satisfiability modulo theories library (SMT-Lib) [49] and the Z3solver [50].The basic introduction to HLPN, SMT-Lib and the Z3 solver is provided by [51,52] to aid the readers' understanding; therefore, further discussion on the topic is not made in this article.
The massive heterogeneity at all three layers of RedEdge introduces an unlimited amount of use cases and applications that are practically difficult to analyse and generalize in this research work.Therefore, considering the heterogeneity and complexity of operations in the RedEdge architecture, we formally model the propose architecture in order to analyse and verify its operations using high level Petri nets (HLPN) [48], the satisfiability modulo theories library (SMT-Lib) [49] and the Z3 solver [50].The HLPN modelling approach is used in order to analyse the overall feasibility of RedEdge as a data reduction architecture.This approach has benefits over traditional Petri net models, which are more effective in specific use cases.However, the HLPN modelling approach has benefits in generalizing the data processing operations at each layer.The basic introduction to HLPN, SMT-Lib and the Z3 solver is provided by [51,52] to aid the readers' understanding; therefore, further discussion on the topic is not made in this article.
Petri nets are used for graphical and mathematical modelling of a system and are applied to a wide range of systems, such as distributed, parallel, concurrent, nondeterministic, stochastic and asynchronous systems.For the formal modelling of RedEdge, we used a variant of the conventional Petri net called high level Petri net (HLPN).The HLPN simulates a system and provides its mathematical properties, which are used to analyse the behaviour of a system.
HLPN is based on a seven-tuple model N = (P, T, F, ϕ, R, L, M 0 ), where P denotes a set of places, T refers to the set of transitions (such that P ∩ T = ∅), F denotes flow relation (such that F ⊆ (P × T) ∪ (T × P)), ϕ maps places P to data types, R denotes a set of rules for transitions and L is a label on F and M 0 , which represents the initial marking.(P, T, F) provides information about the structure of the net, and (ϕ, R, L) provides the static semantics (i.e., information), which does not change throughout the system.In HLPN, places can have tokens of multiple types, which can be a cross product of two or more types.A few mapping examples include ϕ(P1) = Boolean, ϕ(P2) = ID, ϕ(P3) = P(Integer), and ϕ(P1) = Char, where P1, P2 and P3 are the places of HLPN.
SMT is used for verifying the satisfiability of formulae over the theories under consideration.SMT-Lib provides a common input platform and benchmarking framework that helps with the evaluation of the systems.The usage of SMT is common in many fields, including deductive software verification.This thesis adopts the Z3 solver with SMT-Lib, which is a theorem prover developed at Microsoft Research.Z3 is an automated satisfiability checker that determines whether the set of formulas are satisfiable in the built-in theories of SMT-Lib.The HLPN model for the RedEdge framework is shown in Figure 4. We identify data types, places and mapping of data types to places.Data types and their mappings are shown in Tables 1 and 2, respectively.In Figure 4, the rectangular black boxes represent transitions and belong to set T, whereas circles represent places and belong to set P.

Places Mappings
For each data collection phase, the data_sources information is initialized, and the data collection is started.The time series buffered data stream is created using T_stamp, the temporary file name (F_name) and the system generated data sources' ID (DS_ID).RedEdge generates all unique IDs in the system at the time of application deployment.Therefore, these IDs remain constant until the application is installed on a device.When the collected data files reach a maximum threshold (i.e., the file size given by the application developer), the data file is stored on the onboard storage.In addition, a Flag_status is maintained for each data file (called the data chunk).The Flag_status shows the current processing status of any data chunk (zero for unprocessed, one for under-processing and −1 for processed).This is done at transition T1, and the transition is mapped to the following rule (see Equation ( 1)).
Once the sufficient data are collected, T2 gathers data from Loc_Data and corresponding attributes (Chunk_ID, DS_ID and Flag_status) from Data_Tab and updates Flag_status to "under-processed" (i.e., one).In addition, T1 periodically cleans processed data from Loc_Data and updates Data_Tab accordingly.The data controlling rule at T2 is mapped as follows (see Equation ( 2)).
After the establishment of the amount and type of data to be processed, T3 collects Context_in f o, Conn, Loc_Res, Est_Res and related information and executes the Rule_engine, which runs the execution rules and switches between all three execution modes (i.e., LA, CA or CLA).The rule for the selection of the execution mode is mapped at T3 as follows (see Equation ( 3)).
T4 collects the Exec_mode status, and in the case of LA and CA, data mining tasks are initiated locally.For LA, all data mining tasks are executed using onboard local resources.However, in the case of CA, data mining tasks are offloaded to peer-candidate devices, where each peer-candidate device acts as a standalone data mining platform.In addition, T4 collects U p_data and schedules the data mining tasks accordingly.The Exec_mode at T4 is mapped using the following rule (see Equation ( 4)).
Once the data mining tasks are executed successfully, the Disc_Pattern is evaluated at T5 and is mapped as follows (see Equation ( 5)).
After refinement of Disc_patterns into Intr_patterns, the relevant patterns are summarized and merged at T6 using Equation (6).Depending on the configuration of each application, data patterns are marked as public, private or protected and stored locally at Loc_Pattern.
T7 synchronizes Loc_patterns with cloud data stores (CloudStr).In addition, the Flag_status of successfully executed Chunk_ID is updated to "processed" (i.e., −1).The synchronization at T7 is mapped using Equation (7).Consequently, whenever the Internet connection is available, neither LA, CA, or CLA is enabled, and there are some unsynchronized data patterns (UPs); then, UPs are synchronized with their respective counterparts in the cloud.However, data patterns need to be stored in different directories on the mobile edge devices to classify synchronized and unsynchronized versions.
Lastly, in the case of CLA, the raw data stream is uploaded to CloudStr using the following rule (see Equation ( 8)).
The formal verification of HLPN (using the Z3 solver) determines that RedEdge is completely workable and executes according to specified properties.We also evaluated our RedEdge model using the PIPE+ editor [53], which provides a graphical interface to develop and analyse HLPN for bounded model checking (BMC) (see Figure 5).The traversal paths in HLPN are given in forward and backward incidence matrices generated using PIPE+ (see Tables 3 and 4).

Places T0 T1 T10 T11 T6 T5 T4 T12 T2 T7 T8 T3 T13 T14 T19
φ(Data Sources) The results show that all places in the RedEdge are reachable when moving forward (see Table 3).Similarly, all places, except φ(Cloud Str), are reachable in reverse order (see Table 4).The φ(Cloud Str) is made irreversible to eliminate the loop in the data processing cycle.
BMC handles the state space explosion problem by executing a limited number of states.Therefore, BMC is applied over the finite set of transitions (M) using a linear temporal logic (LTL) formula ( f ) and given the upper bound value "k".BMC determines an execution path of length "k" that satisfies the LTL formula.For BMC, first of all, a logic formula φk is constructed from M, f and k and verified using the constraint solver.If an " f " is satisfied over a path of maximum length "k" in M_k, then φk is said to be satisfiable.In existential BMC, it is very hard to find the upper bound for "k"; therefore, the negated safety property is used for validation.The negation safety property determines the safety of the mode as long as f is not satisfiable.The HLPN model was translated to logic formulas and evaluated for its satisfiability.The tokens are distributed in different places in various markings on each state of HLPN.The detailed theory of HLPN mapping according to the SMT context is presented in [54] for interested readers.
In general, during safety (reachability) analysis, PIPE+ generated 3072 states and 38,592 arcs, creating a space explosion problem when φ(Data_Sources) and φ(Exec Mode) are enabled with one token on each place.The simulation results (see Table 5) show that all places are reachable, and it satisfies the safety property.The minimum thresholds, where all places are reachable, are 36 firings with five replications.Alternately, the maximum threshold is 10,000 firings with 15 replications.The simulation results with minimum and maximum thresholds are presented in terms of the average number of tokens produced at each place and the acceptable margin of error during each execution cycle.Consequently, we establish an argument that RedEdge is completely workable, and all places are reachable using specified rules.

Performance Evaluation of the Proposed Data Reduction Strategy
To validate and demonstrate the effectiveness of the RedEdge architecture and the related data reduction strategy employed by the RedEdge architecture, we have developed and tested an application based on the RedEdge architecture in a real-world application setting.In this section, we present the outcomes of this experimentation in terms of the effectiveness of the big data reduction strategy (including battery power consumption, memory consumption and latency).We compare our results with raw data stream uploading and present the potential savings in battery power and memory overhead that can be achieved by the RedEdge architecture.

Big Data Reduction in Participatory Sensing Application
In a smart city scenario, participatory sensing applications aid in collecting data streams from citizens and sensing systems deployed on roads, railway tracks, shopping and parking areas and countless other places in the cities.Let us consider an example of a citizen sensing application for a smart city, whereby the city administration wants to improve the quality of leisure time that citizens want to spend in public parks, sporting places and shopping malls.The city government asks the citizens to share information about their physical activities and locations in order to improve public facilities.Conventionally, the applications installed on citizens' mobile phones collect the sensing information (i.e., readings from accelerometers, GPS, nearer Wi-Fi, etc.) and transfer the raw data streams to the cloud for analysis.We use this scenario in order to evaluate the RedEdge architecture.

System Development Platform and Real-World Experiment Settings
We selected multiple application development platforms in order to evaluate the performance of the RedEdge architecture.For local data reduction components, we developed data acquisition and adaptation, knowledge discovery, knowledge management and system management modules using Android SDK and Java 8.For collaborative data reduction, we integrated the AllJoyn framework for device discovery, P2P network formation and data offloading.However, the data reduction modules were implemented using Android SDK and Java 8.For cloud-based data reduction, we developed multi-threaded cloud services and deployed it in a cloud environment using Google's compute engine.We select three classifiers, namely J48, naive Bayes and random forest as the underlying knowledge discovery techniques employed by the participatory sensing application.
The experiments were performed in two phases.In the first phase, we recruited 12 graduate students to collect data streams in order to develop the learning models.In the second phase, we deployed the learning models in the mobile edge devices for activity predictions and input the raw data streams in order to perform the evaluation of RedEdge.The performance evaluation of RedEdge was made in terms of battery power consumption and memory utilization during raw data uploading and data reduction using RedEdge.

Results of the Real-World Experiment
Mobile edge devices operate in resource-constrained environments; therefore, power consumption and memory utilization during data reduction were the main considerations during the evaluation.We integrated a software-based open source power profiling tool in RedEdge in order to measure the power consumed by application components.Since the nature of streaming data in big data systems varies according to the application requirements, we configured RedEdge accordingly.The accelerometer and GPS receiver collect the data stream at a constant rate (i.e., 100 readings per second for the accelerometer and a GPS reading after every 5 s).However, we generated different sizes of data chunks between time interval of 5 s and 300 s so that we can measure the effect of both volume and velocity on the performance of RedEdge.
Figure 6 shows the power consumption comparison of data uploading strategies.Initially, the raw data streams were uploaded in mobile edge devices, whereby the average battery power consumption for each data chunk remained around 16 mW (milliwatts).However, due to mobility constraints and switching among different networks, sometimes the average power overhead on the mobile edge device increased about 3 mW.The maximum power consumed during raw data uploading in mobile edge devices remained 19 mW.Comparatively during raw data uploading in clouds, the mobile edge device consumed less power, whereby the average consumption remained around 11 mW.However, the RedEdge architecture improves the performance, whereby the cost of uploading knowledge patterns remained around 1.33 mW on average.The experiment revealed that power consumption for knowledge transfer was almost 12-times lower as compared with raw data transfer in mobile edge devices and almost eight-times lower in the case of the comparison with raw data transfer in the cloud.
Although RedEdge minimized the power consumption for data transfer, there remains an energy overhead of data processing.The results presented in Figure 7 reveals that RedEdge consumed more power while processing data streams in mobile edge devices as compared with data processing using mobile edge devices in ad hoc networks and clouds.The average battery power consumption during data processing in the mobile edge device, the serving mobile edge device and the cloud remained 468 mW, 61 mW and 367 mW, respectively.However, the power consumption does not significantly impact the performance of mobile edge devices.For example, for a 2000-mAh (milliamperes per hour) battery operating with 3.7 volts, the mobile edge device can last for around 16 h in LA mode, 140 h in CA mode and around 20 h in CLA mode.The battery time was calculated using Equation (9).Here, "P" represents the power.
Power (mW) Figure 8 shows the memory consumption during raw data uploading and knowledge transfer in mobile edge devices and clouds.The mobile edge devices consumed 29 MB and 27 MB of total memory during raw data transfer in mobile edge devices and clouds, respectively.However, the memory consumption lowered up to 15 MB during knowledge pattern transfer in the cloud.Although we achieved 50% memory gain, RedEdge introduces an overhead of memory consumption for LA, CA and CLA modes.Figure 9 shows the memory overhead of the RedEdge architecture.The results reveal that RedEdge consumed on average 25 MB in LA mode, 27 MB in CA mode and 28 MB in CLA mode.Interestingly, the memory consumption during data processing using RedEdge does not significantly differ from that of raw data uploading in mobile edge devices and clouds.Therefore, the memory consumption overhead of RedEdge does not degrade the performance of mobile edge devices.Conventionally, big data systems collect the data streams, perform data indexing and storing operations inside clouds and perform data processing at lateral stages.The process from raw data acquisition to uncovering knowledge patterns involves latency in big data applications.We calculated the latency overhead caused by RedEdge (see Figure 10).The local data reduction in mobile edge devices creates a delay of about 1200 milliseconds (ms).Alternatively, collaborative data reduction introduces an average delay of 2393 ms (about 2.4 s), and remote data reduction brings a latency of 5675 ms (about 5.7 s).The aim of data reduction was achieved in this study.The experimental analysis reveals that out of 4.32 GB of data, acquired and processed using RedEdge, the architecture reduced the raw data stream to 315.98 MB of the knowledge data stream.In total, the reduced data stream accounts for 7.14% of the overall data, which shows the significance of the research.Although there seems to be little battery power consumption (see Figure 7) and memory (see Figure 9) overhead for RedEdge, the achieved benefits out-weigh the incurred cost.The RedEdge architecture introduces the following benefits by enabling early big data reduction:

•
The architecture enables controlling the velocity of incoming data streams in big data systems.
The data acquisition and adaptation module of RedEdge enable setting the speed of data collection according to the application requirements and provide mechanisms to acquire data streams from multiple data sources.

•
The value of big data matters rather than blindly collecting data streams in cloud data centres.The knowledge discovery module of RedEdge enables improving the quality of big data streams.
The module provides functionality to convert raw data streams into knowledge patterns, hence improving the quality of collected data streams.For example, in our use case application, the conversion of raw sensor readings into meaningful activities improves the quality of data streams.

•
Handling a voluminous amount of big data is quite challenging and requires laborious efforts in order to perform data deduplication, data indexing, storage, retrieval and data cleaning operations for big data analytics.The three-level data reduction facilitates reducing the sheer volume of big data in order to ease the big data management operations.For example, our use-case application reduced the data volume about 13 times as compared with raw data transmission in cloud data centres.

•
Conventionally, big data systems do not provide the local view of knowledge patterns near the data sources [55].The visualization and actuation module of RedEdge ensures local knowledge availability in order to control the data sharing by mobile users.

•
The architecture reduced big data streams near the data sources, hence lowering the bandwidth utilization cost.The cost is incurred in terms of data plans consumed by individual users, as well as the bandwidth utilization during in-network data movement in cloud data centres.

•
The data reduction near the data sources is highly beneficial in order to reduce the operational cost of big data systems.Governments and enterprises do not need to purchase extra data storage and data processing facilities.Alternatively, the cloud service providers can lower the operational cost due to less storage and processing requirements.

Conclusions and Future Work
Data reduction near the data sources is the right alternate solution of conventional methods of data reduction in big data systems.This research shows that data reduction inside mobile edge devices lowers the communication and computational burden in existing IoT-cloud communication models.To this end, the proposed RedEdge architecture contributes by utilizing mobile edge devices as the primary data mining platforms and further reduces the data stream in clouds before big data aggregation.The RedEdge architecture improves the big data systems in terms of volume, velocity and value by reducing 92.68% of the data streams before big data storage.RedEdge indirectly improves the big data management and in-network data movement operations at later stages of big data processing models.In the future, we aim to propose energy-and memory-efficient load balancing methods in order to utilize mobile edge devices as data reduction platforms to reduce the overall latency of sharing big data streams in clouds.In addition, we understand that security and privacy are grave concerns with such approaches [56,57].Hence, one important direction of this proposed work is to develop novel privacy and security mechanisms that can support distributed big data analytics in mobile edge cloud computing environments.
MemAn integer type representing maximum memory in the mobile edge deviceStorageAn integer type representing maximum storage in the mobile edge device App_ID A string type representing application_id in the mobile edge device Avlb_Loc_storage An integer type representing available local storage in the mobile edge device Avlb_SD_card An integer type representing available storage on the SD-card in the mobile edge device Wifi A string type representing availability and connectivity through Wi-Fi GSM A string type representing availability and connectivity status through GSM

Table 2 .
Places and mappings.

Table 5 .
Simulation results of the RedEdge HLPN model.