1. Introduction
Despite the increasing popularity of the Internet of Things (IoT) systems, we find that the majority of IoT literature is mostly concerned with either investigating such systems as sources of data and information (e.g., Ref. [
1]), often to be used as validation datasets for machine learning algorithms that detect different classes of properties in these datasets (e.g., Refs. [
4]), or with the classical software engineering aspect of these systems that aims to achieve specific IoT-based solutions to all sorts of problems in different domain of applications (e.g., Refs. [
6]). There is, in fact, a lack of understanding of the relationship between these two views; the data-based model of IoT systems, and their formal models. Understanding one form of this relationship, and its usefulness, will be the focus of our research here, combining these two views.
In this paper, we propose a novel methodology; that abstract data representing the normal behaviour of a communicating system, derived by statically analysing its formal specification, can be used afterwards to detect and filter out real data representing abnormal behaviour of some implementation of the same system during runtime. We call such abstract data representing the normal behaviour the data blueprint of the system, against which data from its runtime behaviour are compared. We demonstrate that even the simplest of static analyses, in this case one that produces a data blueprint of the system on merely what data should be communicated, would suffice in simplifying anomaly detection of the system that would otherwise require the application of complex machine learning methods.
We apply our new method to one popular IoT protocol, namely the MQ Telemetry Transport (MQTT) protocol [
7], specifically to the QoS 0 mode of this protocol. We first model the protocol in the formal language of the
-calculus [
8], and then use an abstract interpretation of this language that captures message communications to provide the data blueprint for the protocol. We then compare this to some actual message communications captured during a study in literature, and we demonstrate that our data blueprint is capable easily of detecting anomalies in these communications.
The rest of the paper is structured as follows. In
Section 2, we discuss a few examples of literature to our approach. In
Section 3, we give an overview of the formal language used for modelling MQTT, the
-calculus. In
Section 4, we define a name-substitution abstract semantics for the
-calculus, which allows us to trace where messages are communicated to. In
Section 5, we demonstrate how the data blueprint generated from the abstract semantics is used to detect anomalies for the case of an example study from literature. Finally, in
Section 6, we conclude the paper and provide directions for future expansion of the ideas within.
2. Related Work
There is much literature [
14] regarding formally specifying and verifying IoT systems, networks, protocols, and standards, usually for the purpose of understanding the properties of such systems. We give here only a few examples. The author in Ref. [
9] proposes the use of the probabilistic model checker, PRISM [
10], as a framework for specifying and checking the functional properties of the IoT systems. The work uses probabilities, as quantitative concepts, to be modelled in the formal language. The interaction and interoperability in IoT systems is studied in Ref. [
11] using the Tree Query Logic of Ref. [
15]. This allows for a multi-layered view of an IoT system to be constructed, and hence the properties verified at the different layers. In Ref. [
12], the authors use Event-B [
16] to model and verify the properties of IoT communication protocols, such as the presence of duplicate channels, persistent sessions, and message ordering, in MQTT [
7], MQTT-SN [
17], and CoAP [
18]. Other efforts in this area focus on more specific goals, for example, the verification of security properties through the discovery of vulnerability surfaces, as in Refs. [
14], for example. None of these works, and others, address the question of how formal specifications can be used to benefit the detection of anomalous behaviour in the recorded data.
One of the research works close to ours is that of Ref. [
19], where training datasets related to musical songs are used to construct formal specifications in the form of automata machines capable of playing similar genres of music. In some sense, this is almost the reverse of our approach, which uses formal specifications to understand the dataset itself. Similarly, the work of Ref. [
20] proposes the formalisation of metadata specifications in order to discover datasets (more robustly) matching those metadata, more specifically, the discovery of geospatial datasets. Both of these works are related to ours, however, with different motivations, and therefore, approaches.
Formal verification techniques have also been used to verify big data-related applications, tools, platforms, and technologies. Although these are not strictly similar in their approach to ours, since they fall under the verification of system specifications category, we mention a few examples here for completeness. In Ref. [
21], the authors propose a stochastic model checking using UPPAAL [
22] to model the execution of big data applications, in particular, to model the execution strategies in Apache Spark [
23]. In addition, in Ref. [
24], formal verification is used to evaluate the execution time of Apache Spark applications, using a combination of directed acyclic graphs and constraint linear temporal logic. In Ref. [
25], the authors propose an extension of the
requirements engineering methodology [
26], called BiStar, more adapted for modelling the requirements of big data applications. The integrity property of BiStar is then modelled and verified using bigraphs [
To some extent, one may also consider slightly relevant the most recent work of Ref. [
28], which extensively covers the relationship between formal verification in computer science and stochastic analysis in mathematics, with application to big datasets related to 3D road networks and household power consumption. Finally, we should mention the works of Refs. [
31], who provide extensive reviews of the applications of formal methods to machine learning and the relationship between the two areas.
3. Theory
We specify systems using the formal language of the
-calculus [
8], as this is a simple and easy language to formulate, using the following syntax:
The syntax states informally that processes are defined as one of the following: an inactive process, 0, incapable of performing any further activity, a guarded process, , which performs an action and continues with the residue P, the creation of a new name, , with the scope restricted to P, the parallel composition of two processes, , a non-deterministic choice between two processes, , and finally, replication, , which can spawn any number of copies of P. The actions, , are defined using names , in terms of input actions, , output actions, , or silent unobservable actions, . We refer to the set of the free names of a process as and that of the bound names as , and we assume that initially, the bound names of a process are selected such that no two bound names are the same (-conversion). We write .
The standard semantics of processes is given using the classical structural and operational relations, shown in
Figure 1, which determine how a process can change its shape and evolve through communications.
We describe first the rules of the structural relation ≡. Rule – state that is a commutative monoid. Rules and state that the non-deterministic choice is commutative and associative over . Rule (also known as scope extrusion) allows scope restriction on a newly created name to be removed from the left-hand process, if the name is different from any of its free names. Rule removes name creation from a null process, and rule allows the order of name restriction to be swapped on the same process. Finally, rule allows a replication to spawn a copy of its process.
On the other hand, the rules of the operational relation are explained as follows. Rules – allow a guarded process to fire its action, making it observable to the context. Rule is the most important rule, as it allows communications to take place between matching input and output actions. The result is the replacement of the input action parameter with the message carried by the output action. Both processes will continue with their residual part. Rule is similar to rule , except it deals with bounded outputs, where the scope of the bounded output message is then restricted to the two residual processes. Rules and state that internal communications can also occur under parallel composition and non-deterministic choice composition. In the latter case, the process with the internal communication will continue as the main process, removing the option of the other inactive process. states that name restriction has no impact on the observation of an output action, if that name is different from the names of the channel and message being observed. Rule turns a free name output into a bounded name output by restricting the scope of the output message. Rule states that an input action is possible under name restriction, as long as the restriction is not on the name of the communication channel. If the restricted name is the same as the input parameter, then -conversion is necessary to distinguish the two names. Finally, rule states that silent internal actions can take place under any name restriction.
In addition to the above semantics, in Ref. [
32], we defined a non-standard name-substitution semantics, which when abstracted using an approximation function, yields a meaning represented by the abstract environment
, which maps each abstract name to a set of abstract names that can replace that name during a process’s interpretation.
represents the set of abstract names. Unlike
is defined such that it is finite, and therefore,
is also finite. The resulting semantic domain,
, consists of all such possible environments and guarantees termination for an abstract interpretation computed over it, such as that defined in Ref. [
32]. The bottom element of such domain,
, is the empty environment where
. In Ref. [
32], we gave an abstract interpretation of the
-calculus using the semantic domain
, and we showed this to be safe with regards to the name-substitution semantics, similar to other analyses we had defined for different variations of the
-calculus in Refs. [
35]. We next review this abstract interpretation in more detail.
4. An Abstract Semantics
Let us examine in more detail the rules of the abstract interpretation of the
, which captures the property of name substitutions in an abstract manner. These rules are shown in
Figure 2 below.
We explain these informally as follows. Rules, and , for null processes and input actions, respectively, do not change the value of the environment, since no communications take place in these rules. Rule deals with output actions; where the meaning of a process guarded by an output action is given as the union of two environments. The first environment reflects all possible communications between the output action and matching input actions in . A communication takes place whenever the sets of values of the two communicating channels have similar name values or share a common name value from previous substitutions. This means that the names of the channels are the same (i.e., the channels are free or restricted names) or that there must have been similar name values substituting both channels earlier in the interpretation (i.e., the channel names are closed under input actions). The effect of the communication is reflected by adding the value of the message to the value of the substituted input parameter in . The second environment is an unchanged , reflecting the option that no communication may take place. Rules –, are straightforward. Rule removes a silent action, and reinterprets the rest of the process. Rule does the same for the case of a newly-created name, given that all bound names are distinct. Rules and distributed the two sides of parallel composition and choice to the rest of the processes in . The rule for replicated processes, , attaches subscripts to bound names and tags of the spawned processes according to the number of copy of each process. This is necessary to ensure that these names and tags remain distinct throughout the semantics. The rule uses a least fixed-point calculation for a special function, , which starts by adding a single copy of P then increments till the least-fixed point is reached.
The abstraction function,
, constrains the number attached to a new copy of a bound name created during the replication process over some maximum permitted number of copies,
. Hence,
, we can define
For example, if we set
, and we apply the renaming mechanism to the structural semantics relation, we would get the following congruence:
where every copy after the third copy is still approximated to being the third copy.
To simplify our analysis of the next section, we define a variation of
, which instead of having number-distinguished copies of the same name, it contains the same name multiple times. Such a
multiset version, written as
, can be defined as follows:
where a copy of a name,
, is replaced by the original root name,
x, up to the maximum
copy of
x occurring in
, where
is either an input parameter or a newly created name. We now arrive at the definition of a
data blueprint for some process.
Definition 1. Define the data blueprint, , for a process, P, as follows:
Hence, the blueprint environment only reflects the communicated messages, ignoring the input parameters these instantiate.
5. Example: MQTT Data Analysis
The MQ Telemetry Transport (MQTT) protocol [
7] is described as a lightweight broker-based publish/subscribe messaging protocol that was designed to allow devices with small processing power and storage, such as those which the IoT is composed of, to communicate over low-bandwidth and unreliable networks. The publish/subscribe message pattern [
36], on which MQTT is based, provides for one-to-many message distribution with three varieties of delivery semantics, based on the level of quality of service expected from the protocol. These include the “exactly once” delivery semantics, the “at least once” delivery semantics and the “at most once” delivery semantics. We focus here on the “at most once” semantics, as this is the one most relevant to our example later.
In the “at most once” case, the protocol is configured to deliver messages with the best effort of the underlying communication network, and given that many networks are unreliable, there would be no guarantee that the MQTT messages will be delivered. This protocol, also termed the
QoS 0 protocol, is represented by the following flow of messages and actions:
The protocol defines the message communications between
Clients, i.e., end-devices responsible for generating data from their domain (the data source), and
Servers, i.e., brokers responsible for collecting source data from clients, based on specific topics, and publishing this data to interested subscribers. We can define the
-calculus model of the QoS 0 protocol as shown in
Figure 3.
In this model, the protocol consists of three top-level processes:
Server and
Subscriber. The
Client process represents any IoT device, which after connecting to the
Server, will always publish some data by sending them in a
Publish message, and so on. The
Server process, on the other hand, after acknowledging the connection message sent by the
Client, will always receive the published data and send these to the
Subscriber process. Finally, the
Subscriber process is always waiting to receive the published data. The model is abstract, but sufficient enough to capture anomalies in data as we show in
Section 5.2.
5.1. The QoS 0 Protocol Data Blueprint
Applying a non-uniform abstract interpretation for some value of
, e.g.,
, reveals the following
where its
equivalent is defined as follows:
From this, we can obtain the following data blueprint for the QoS 0 protocol:
In this case, we see that there are three copies of the Publish message, and a single copy of each of the Connect and Connack. We consider this analysis as providing the data blueprint corresponding to the normal behaviour of any system running an MQTT-based network in the QoS 0 mode up to the choice of . The choice of k will depend on the trade-off between precision and efficiency; larger values of k produce analyses with higher precision, however, these would take a longer time to run. For (i.e., a uniform analysis), it is impossible for any attacks relying on the repetition or multiplicity of messages to look anomalous with regards to the resulting data blueprint. Therefore, we avoid the case of a uniform analysis.
Property 1 (Normal Behaviour).
We characterise a protocol run as being normal,
if and only if, for some abstract dataset, , representing a run of the protocol, we have that:for some approximation number, k. 5.2. A Case of Intrusion Detection
Now let us review here one example in recent literature, where a study was presented in Ref. [
37], as a typical example of the application of machine learning algorithms in analysing dataset features and extraction of interesting information. We use this case study to demonstrate how our concept of data blueprints renders the analysis of intrusions straightforward, avoiding all the complexities associated with a machine learning-based approach, which reveal no better knowledge here compared to what is revealed by our approach.
In Ref. [
37], the authors used Wireshark to sniff packets (i.e., messages) exchanged over an MQTT network, both under normal and attack circumstances. This resulted in the following cases.
5.2.1. The Normal Case
In the normal case, the analysis of Ref. [
37] (§4.A) captures the following sequence of messages, representing the normal case scenario:
- 1.
Connect Command
- 2.
Connect Ack
- 3.
Publish Message
- …
- 24.
Publish Message
Applying our abstraction
, which limits the instances of the communicated messages to 3, we would obtain the following abstract representation of the above dataset:
We can see here that
, and therefore, it indicates normal behaviour of the protocol according to Property 1.
5.2.2. First Attack Scenario
Now, let us consider the second dataset generated in Ref. [
37] (§4.B), which contains the following sequence of messages:
- 1.
Connect Command
- 2.
Connect Ack
- …
- 17.
Connect Command
- 18.
Connect Ack
- 19.
Publish Message
- …
- 24.
Publish Message
Abstracting this dataset using
, gives us the following abstract environment:
In this case, we can clearly see that
and therefore there is some anomalous behaviour as captured by the dataset of [
37] (§4.B).
5.2.3. Second Attack Scenario
Finally, let us consider the dataset generated in Ref. [
37] (§4.C), which gives us the following sequence of messages:
- 1.
Connect Ack
- 2.
Connect Command
- 3.
Connect Ack
- 4.
Connect Command
- 5.
Connect Ack
- 6.
Publish Message
- 7.
Connect Command
- 8.
Connect Ack
- 9.
Connect Command
- 10.
Connect Ack
- 11.
Publish Message
- 12.
Connect Command
- 13.
Connect Ack
- 14.
Connect Command
- 15.
Connect Ack
- 16.
Publish Message
- 17.
Connect Command
- 18.
Connect Ack
- 19.
Connect Command
which when abstracted, for
, we obtain the following environment:
Again here, we can see that
, and therefore the dataset of [
37] (§4.C) reveals abnormal behaviour in the protocol run.
6. Conclusions
We presented in this paper a new method for detecting anomalies in datasets representing systems behaviour using a direct comparison with analysis of their formal specifications. We highlight that this method is different but more robust than fuzzy methods that use machine learning algorithms as it relies directly on formal specification and verification methods in detecting anomalous data. Such methods provide verifiable evidence for the presence of anomalies, unlike the fuzzy methods, which by their nature, can only hint at such anomalies. Furthermore, we applied our method to a case of IoT systems that use the MQTT protocol, and particularly, to a recent study in literature that presents multiple normal and abnormal datasets for this protocol. One of the shortcomings of our approach, at its current level of information that the analysis produces, is that it does not indicate “which” kind of attack does the abnormal state generated through the analysis provide, only that there is something abnormal (i.e., possibly some kind of attack).
We plan in the future to extend this method, with more definitions of the semantic properties underlying the formal language used to obtain more variations of normality in datasets, and hence extend the ability to detect more anomalous data. Because of the generality of this method, we also plan to apply this method to other systems and protocols.