5.2.1. Analytical Tasks in Continuous Data Streams
The data ingestion task deployed at an edge node will retrieve parking data by defining a forever loop to iteratively trigger this task every 5 s. A raw streaming data tuple is considered a parking space event which will be sent to the closest edge node. The parking data streams consist of a set
of out-of-order tuples containing attributes in the format:
: a specific parking event containing 4 attributes {
spot_id, length, startTime, vehicle_id} described in
Table 3.
: a parking spot entity where the parking event is happening. It contains 3 attributes {
lat, long, spot_name} described in
Table 3.
The raw data tuples obtained after the data ingestion task will be forwarded to the data cleaning task, which consists of a sequence of operations including assessment, detection, repairing, and validation. The assessment process can detect and identify errors, redundancies, missing values, incomplete tuples, and inaccurate data fields. The tuples are re-organized, replaced, repaired or removed using adaptive integrity constraints in a dynamic sense to ensure data quality. Finally, validating the accuracy of the data tuples once they have been cleaned is an important operation before passing them to the next analytical task.
The attributes of a cleaned data tuple are later grouped into two new data fields (Parking Event and Spot Entity). Our new data tuple now becomes a set of attributes in which each contains a vector of 7 corresponding attributes {spot_id, length, startTime, vehicle_id, lat, long, spot_name}.
We have implemented an autonomous script to logistically apply adaptive integrity constraints to handle missing attribute’s values and tuples, to remove duplicate tuples and redundant attributes, and to repair incorrect attribute values. The cleaned tuples are then transferred to the data filtering task as illustrated in
Figure 5.
The data filtering automatically derives a subset of data from the original one using a set of criteria or extraction (filtering) operations. After finishing the data filtering task, the extracted data will be transferred to the data contextualization task to create new attributes and attach them to the original data tuples
T using a contextualization operation
as in Equation (
The data contextualization task has been implemented at the edge to handle the current incoming data tuples and at the fog node to handle the outdated data tuples as described in
Figure 5. A function was implemented to interpret the status (occupied or empty) of a parking spot whenever a driver parked his/her car.
Whenever a tuple arrives at the edge, we create an event label as Occupied and attach to the original tuple to mark that a parking spot is in use.
We compute the endTime using the startTime and the parking duration length. The parking duration is the one paid by the customer.
We also add the arriving time edge_arrivingTime whenever a tuple arrives at an edge node.
After the contextualization task at the edge has been executed, three new attributes , , are attached to the original tuple. The contextualized tuples become containing a vector of 10 attributes {spot_id, length, startTime, vehicle_id, lat, long, spot_name, event, endTime, edge_arrivingTime}. This new contextualized tuple will be transmitted to the fog node where a new attribute, , will be added for registering the ingestion time. At the fog, this Occupied data tuple is duplicated for two main purposes: (1) one copy of the Occupied data tuple is transmitted to accumulated data streams for further analytical tasks; (2) the other Occupied data tuple copy temporarily resides at the in-memory database for deducing other events.
In this smart parking application, outdated and current incoming Occupied data tuple are the important elements to determine the status of a parking event whenever a driver parked his/her car. We aim to infer whether an Empty event or an Occupied event is occurring at a specific parking spot.
Empty event is also computed at a fog node as shown in
Figure 6. The computation consists of the following steps:
When a contextualized tuple
with an
Occupied status arrives at the fog, it is treated as an outdated tuple and retained in database (RethinkDB) until a new tuple
of the same parking spot arrives. To detect the changes in our real time database, we have implemented an adhoc query using ReQL language to continuously monitor the incoming tuple as follows.
Description | ReQL Statement |
Monitoring the feed if any new object changes on a table | r.db(‘spdb’). table(‘raw_historical_table’). changes.run(conn). each{|change| p(change)} |
The new tuple
with an
Empty status is initially computed by mirroring some static attributes from the incoming tuple
including {
spot_id, lat, long, spot_name, edge_arrivingTime}. Then, the
startTime of tuple
is assigned by the
endTime of tuple
while the
endTime of tuple
is assigned by the
startTime of tuple
. The
length of tuple
is then computed by subtracting its
endTime from its
startTime. Finally, the
fog_arrivingTime of tuple
is attached at the end of the
Empty tuple creation task. The following query command is used to retrieve the outdated
Occupied tuple that temporarily resided in RethinkDB for this task.
Description | ReQL Statement |
Query the outdated “Occupied“ tuple that temporarily resided in RethinkDB | r.db(’spdb’). table(’raw_historical_table’). without(’id’, ’edge_arrivingTime’, ’fog_arrivingTime’). filter({“spot_id“: str(item[’spot_id’]), “event“: “Occupied“}). order_by(r.desc(’startTime’)).limit(1).run(conn) |
Once the data query task and the
Empty event creation task at the fog are completed, all outdated
Occupied data tuples, current incoming and new tuple will contain a vector of 11 attributes
corresponding with {
spot_id, length, startTime, vehicle_id, lat, long, spot_name,
fog_arrivingTime}. These event data tuples will be transmitted to the data summarization task at the fog and the data prediction task in the cloud for further analytics as indicated in the lifecycle in
Figure 5.
5.2.2. Analytical Tasks in Accumulated Data Streams
As aforementioned in
Section 3, streaming descriptive statistics task can be implemented using frequency measurement, central tendency measurement, dispersion or variation measurement, and position measurement. We chose the first approach, which implements the analytical task using frequency measurement for the smart parking application. The aim of this task is to show how often the parking event occurs by showing the parking frequency at each
spot_id grouped by
vehicle_id. We also analyze the parking behavior of the driver by statistically computing the parking usage of each vehicle. At the edge, the data stream can be configured to be accumulated at different time granularity (i.e., every 10 min).
The data aggregation task is executed at the fog in order to count how many times each parking spot was occupied every hour, day or month. We have implemented a Python script to trigger the data aggregation task. For example, after each hour, a set of individual summaries will be produced in which each Q contains 4 main attributes including {spot_id, lat, long, parking_frequency}. The aggregated data of this task are pushed to the data clustering task for further analytics.
The aim of the data clustering task is to demonstrate how it is possible to diagnose if an incident or event occur at the fog in near real time manner. To detect an occurrence, we build an algorithm based on the Hierarchical Agglomerate Clustering (HAC) [
20] approach to cluster the temporal dimensions from the incoming aggregated data. We choose to implement this unsupervised learning method at the fog because it can work independently and automatically without any human interference. The HAC method starts by partitioning a chunk of the data stream and place each data tuple into its own singleton cluster. Then, it merges the current pair of mutually closest clusters to form a new cluster. Finally, it repeats step by step until there is one final cluster left, which comprises the entire chunk of data stream.
The input of our clustering algorithm is a set of aggregated data tuples in which each data point contains 4 features {spot_id, lat, long, parking_frequency}. The aggregated data tuples are continuously pushing to the fog every hour. At the fog, we configure a user-defined window weekly. At the end of each time window, we trigger a data restructure function to sort the data so that each parking spot has not only its geo-information but also its parking frequency information at each hour during a week time window. Then, we apply the Principal Component Analysis (PCA) to select the best attributes to feed the clustering algorithm. The clustering algorithm is executed as shown in Algorithm 1.
Algorithm 1: Data clustering implementation for aggregated data based on Agglomerate Hierarchical Clustering approach |
![Sensors 19 03594 i001]() |
There are many criteria to measure the distance between two clusters,
u and
v, such as single linkage, complete linkage, average linkage, weighted linkage, centroid linkage or median linkage. In our algorithm, we use Ward linkage since it can efficiently handle noise. In this case, the distance between two clusters is measured as the following equation.
v are two joined cluster, and
s is any other cluster;
are the size of cluster
s, respectively.
Recently, Reference [
21] proposed the algorithm Adaptive Random Forest (ARF) to make predictions on data streams. In our smart parking application, we have implemented our data prediction task for continuous incoming data tuples in the cloud based on this ARF algorithm. According to the data life-cycle in
Figure 5, the contextualized data streams created by the data contextualization task will become the input data for the data prediction task. From the contextualized data stream, we receive a sequence of contextualized tuples
pushing from the fog in which each
corresponding with {
spot_id, length, startTime, vehicle_id, lat, long, spot_name, event, endTime, edge_arrivingTime, fog_arrivingTime}. For each tuple, we use the attribute
event = {Occupied | Empty} as the corresponding predictive target label when it is inputted to the ARF algorithm. It is worth noting that the ARF algorithm works based on the assumption that the tuples of input data stream are independent and identically distributed (iid). In our contextualized data stream, each data tuple
is individualistic and it does not influenced to or is influenced by tuple
. Also, the data contextualization task have deduce the
event when each tuple arrive at the fog node. Therefore, the ground truth target label
corresponding with the other attributes in tuple
is always available before the next tuple
is presented to the learning algorithm.
Algorithm 2: Data prediction implementation using the Adaptive Random Forest over the contextualized data stream in the cloud |
![Sensors 19 03594 i002]() |
Algorithm 2 illustrate the procedure to implement the ARF algorithm in the cloud to predict the event from the incoming contextualized data stream. Different from batch random forest algorithm, where all data instances are available for training; in stream learning, training is performed incrementally as new data tuple is available. In the process of growing trees over the current incoming data tuple , Algorithm 2 is able to detect whenever a concept drift happen in a tree and start to replace by its respective background tree. Performance P of the ARF model is computed according to some loss function that evaluates the difference between the set of expected target labels and the predicted ones .