1. Introduction
The advancement of sensor monitoring technologies and low-cost solutions, together with the introduction of the Internet of Things (IoT) in everyday life, has resulted in the capture of huge volumes of data [
1]. Data streams are huge, continuous, unbounded sequences of data that are generated at a rapid rate and have a dynamic distribution. Data stream mining is an ongoing study subject that recently emerged in order to extract knowledge from enormous amounts of continuously created data. Cooperative intelligent transport systems (C-ITSs) with networked vehicles are poised to transform mobility’s future. The flow of messages between vehicles via vehicle-to-vehicle communication (V2V) and between vehicles and transportation infrastructure via vehicle-to-infrastructure communication (V2I) facilitates this transformation. Cooperative awareness messages (CAMs) provide real-time information about individual vehicles. Nonetheless, due to the novelty of the idea, the impact of C-ITS services on road networks has yet to be fully felt and analysed [
2].
Anomalies are “patterns in data that do not conform to a well-defined notion of normal behaviour” [
3]. They are classified into three types: point anomalies, contextual anomalies, and collective anomalies [
3]. Point anomalies, or “outliers”, are individual data components that are inconsistent or anomalous in relation to all other data elements [
4]. Contextual anomalies are data elements that are considered unusual in a certain context. Collective anomalies are groups or sequences of connected data components that are out of sync with the rest of the dataset. For example, excessive traffic on a highway during business hours is usual, yet it is contextually anomalous traffic behaviour after midnight [
1]. Contextual attributes (such as time of day, season, and location) and behaviour attributes define each data piece when seen in context. The early identification of anomalies can decrease event risks, such as accidents and traffic jams. The majority of these occurrences may be attributed to driver error or poor road conditions. Road users and authorities benefit from identifying the location, time, and frequency of these road abnormalities.
Traffic incidents are non-recurring events that may cause traffic congestion and travel time delays. An incident is “an unexpected event that temporarily disrupts the flow of traffic on a segment of a roadway” [
5]. To lessen the impact and duration of incidents, it is critical to understand the frequency of occurrences by spotting variations from usual traffic patterns. Road occurrences/anomalies include car wrecks, vehicle breakdowns, debris on the road, and vehicle(s) stalled in the middle of the road. Two forms of traffic irregularities include traffic jams and road management [
6]. Short-term traffic disturbances may persist for a span of minutes or several hours, inducing a decline in traffic velocity or an upsurge in traffic density. Resolving long-term traffic management anomalies is a challenging task that may require considerable time and effort. The examination of deviations in traffic can be conducted by examining either local traffic anomalies or group traffic anomalies. The road network is divided into separate segments for local traffic anomalies, and each segment is analysed for individual abnormalities. For group traffic anomalies, any irregularity detected in one portion of a road network will influence and be assessed by analysing the causal connections between adjacent segments.
The detection of incidents in [
5] relies on actual Global Positioning System (GPS) data collected from vehicle tracks. The road network is segmented by road type, date, time, and the predominant weather conditions. Segments that exhibit a significantly lower average speed than the designated normal speed are regarded as abnormal and are extracted. The problem with this technique is that the segmentation process is impacted by the precision of polygonal line coordinates, and the accuracy range of GPS influences the differentiation of incidence from typical traffic congestion. To identify long-term abnormal traffic zones in big centres, ref. [
7] proposes long-term traffic anomaly detection (LoTAD). The method divides the road network into sections by utilising bus line data and an actual bus trajectory dataset, which results in temporal and spatial segments known as TS segments. Anomalies in bus lines are detected through the computation of an anomaly index, utilising the average velocity and average stop time as trajectory features. Utilising the data obtained from the atypical areas can provide valuable input for future urban traffic planning. These kinds of anomalies can be detected with the tool proposed in [
8].
The filter–discovery–match (FDM) method [
9] is a suggestion for determining accident locations. It involves dividing a roadway network into sections and creating speed vectors using the average speed. Actual incident records are used to determine the specific sections of road where the incident took place. Subsequently, the speed vectors of vehicles passing through those sections during the incident time are extracted. The regular velocity direction of the road sections is determined by computing the average velocity of the automobiles that crossed those sections within a specific time period and were not impacted by any traffic disruptions. The velocity disparities between the incident speed vectors and regular speed vectors for each segment are utilised to determine the candidate speed patterns. Through thorough experimentation using both real taxi data and simulated data, it was discovered that FDM resulted in a lower mean time-to-detect (MTTD) when compared to other existing techniques.
A comprehensive body of research was conducted to create diverse anomaly detection algorithms that encompass several categories, namely classification, nearest neighbour, clustering, statistical, information theoretic, spectral, and graph-based approaches [
3,
10]. Histogram-based outlier score (HBOS) [
11] operates on the premise of feature independence and computes outlier scores via the construction of histograms for individual features. Swift computation time is facilitated without the need for data labelling. Time is essential in computing, especially in C-ITS, where an enormous volume of data must be analysed to identify irregularities. Deviances from typical road traffic data are perceived by analysing the intricate attributes of constructed histograms to spot anomalies [
12]. Two categories of histograms are possible to construct: static bin-width and dynamic bin-width histograms. To achieve the uniform weighting of every feature, the bin’s maximum height is standardised to one by normalising the histograms and flipping the quantified results, resulting in abnormal occurrences receiving a higher score and normal instances receiving a lower score. This action aims to minimise the impact of floating-point precision errors that can lead to imbalanced distributions and elevated scores. For each instance
x, the HBOS is determined by the height of the bin in which the instance is placed:
where
d denotes the number of features,
x is the vector of features, and
hist is the density estimation of each feature instance.
HBOS scoring produces numerical values that indicate the degree of “outlierness” of each data point in relation to the rest of the dataset. The last stage involves thresholding, where a decision label is assigned to each element, indicating whether it is a regular instance or an anomaly, depending on the threshold parameter Th. Different statistical deviation measures, such as standard deviation, median absolute deviation (MAD), quantiles, and streaming analysis with a defined window can be utilised to establish the value of the Th parameter. If a score exceeds three times the standard deviation, it can be deemed an anomaly. Another method is to order the scores so that a top_k algorithm provides the k most anomalous observations.
Anomalies can be identified by assuming that the data follow a specific probability distribution and categorising data points with a low probability density as anomalous. In an elliptical distribution, the Mahalanobis distance between each point and the mean is calculated, with points exceeding a predetermined threshold being categorised as anomalies. Due to its ability to resist outlier observations, the minimum covariance determinant (MCD) [
13] serves as a highly dependable means for identifying anomalies in multivariate contexts. Given a dataset presented as an
n ×
p matrix, where
n refers to the number of occurrences and
p relates to the number of features. The initial stage in obtaining the MCD estimator involves calculating the covariance matrix’s determinant. A smaller set of observations (consisting of
h data points, wherein
n/2 ≤
h ≤
n) is selected from a larger sample of
n data points. This selection is made in a way that minimises the generalised variance of the subset
h. The MCD estimator defines the following location and scatter estimates [
14]:
, the mean of the h observations with the least possible determinant of the sample covariance matrix.
is the associated covariance matrix multiplied by a consistency factor c.
The mean and the covariance matrix are used to calculate the robust distance for a point
x defined as [
14]
where
is the MCD estimate of location, and
is the MCD covariance estimate. MCD selects the section of the data with the closest distribution to eliminate anomalies, as they tend to be far from the bulk of the data. This minimises the masking effect caused by atypical observations [
15].
The isolation forest (IForest) [
16] algorithm is utilised to uncover anomalies in data that have a high number of dimensions.It is a non-parametric method that demonstrates favourable results when applied to normally distributed, unbiased data that contain minimal noise [
4]. Its suitability for anomaly detection in C-ITS data lies in the fact that the data lack prior distribution and remain unlabeled. The IForest model is composed of a collection of unique, random isolation trees
itrees that are divided into nodes through recursive partitioning. IForest’s scoring stage computes an anomaly score for each data observation within the dataset. The outlier score is calculated based on the distance between the leaf and the root. The ultimate outcome is obtained by taking the mean of the distances from the individual data points to the different
itrees within the isolation forest. Given an instance
x, the anomaly score is defined as
where
E(
h(
x)) is the average path length of sample
x over
t itrees,
c(
n) is the average path length of the unsuccessful search in the binary search tree, and
H(
i)
(
i) +
(
is Euler’s constant). Based on the anomaly score
s, the following conclusions can be made [
17]:
If instances return s(x, n) extremely close to 1, then they are anomalies;
If instances have an s(x, n) less than 0.5, then they are deemed normal instances;
If all the instances return an s(x, n) of 0.5, then there is no differentiation between normal and anomalous instances.
Robust random cut forest (RRCF) [
18], a variation of isolation forest designed for streaming data, incorporates concept drift and tree evolution to generate a measure of the isolation score. The tree structure is impacted by the degree to which a new point alters the anomaly score. Consequently, the sensitivity of RRCF is reduced when the sample size is decreased. A robust random cut data structure is utilised as a summary or representation of the input stream. While detecting anomalies, RRCF maintains the original distances between all pairs of data points. The LSCP (locally selective combination in parallel outlier ensembles) [
19] detector builds a small area surrounding a test instance, utilising the consensus of its nearest neighbours in randomly selected feature subspaces. It employs an average of maximum technique, in which a homogenous set of base detectors is fitted to the training data before generating a pseudo ground truth for each occurrence by picking the maximum outlier score. It locates and combines the best detectors in the area and investigates both global and local data linkages. Its strength is that it can quantify the magnitude of local outliers.
The local outlier factor (LOF) [
20] measures how much a sample’s density deviates from its neighbours on a localised level. The score for the anomaly is determined based on the object’s isolation from its surroundings, giving it a localised significance. The distance between the
k-nearest neighbours determines the locality, which is used to estimate the local density. The initial step is to compute the
k-distance between a point
p and its
k-th neighbour. Measurement of the distance can be accomplished by various methods, though the Euclidean distance is frequently utilised (Equation (
6)):
Given a dataset
D and a positive integer
k, the
k-nearest neighbours of
p is any data point
q, whose distance to
p is not greater than
k-distance (p) (Equation (
7)):
The reachability distance of data point
p with respect to data point
o is defined using Equation (
8):
The next step is the estimation of the local reachability density (
lrd), which is inversely proportional to the average reachability distance of
p to its nearest
k neighbours (Equation (
9)):
The LOF is then calculated, which is the mean ratio of the
lrd of point
p to the
lrds of its neighbouring points (Equation (
10)). If a point is considerably distant from its surrounding points in relation to their proximity to one another, it is deemed an anomalous point:
LOF primarily excels at identifying outliers within a local context. If a point is proximal to a cluster with an extremely high density, it is classified as an anomaly. The interpretation of LOF is challenging, as it is presented in the form of a ratio. There is no specific threshold at which a point is considered an outlier. The identification of an anomaly is influenced by both the issue at hand and the individual analysing it. Streaming data refers to an ongoing influx of information that has the potential to be unending and can be regarded as a time series featuring multiple variables. The limitless influx of incoming data generates circumstances in which the data can transform over time, culminating in a scenario where modelling behaviour with more recent data holds greater relevance than using older data [
21]. The stream data model may be described as follows:
where
Algorithms specifically created for handling data streams are capable of managing enormous volumes of data. The fundamental concept of processing data streams is that instances are assessed just once upon arrival and eliminated to make room for succeeding instances. The algorithm analysing the stream lacks the ability to dictate the order of encountered instances, thereby necessitating that its model be adjusted in a stepwise fashion for each inspection. The
“anytime property” is another desirable characteristic that entails the model being readily available for usage at any given time interval during training. There are three primary challenges to identifying anomalies in data streams: limited memory capacity, imbalanced datasets, and concept drift [
22]. Adapting streaming anomaly detection techniques to real-world applications is a straightforward task, owing to their high speed and limited memory constraints [
23]. However, cutting-edge stream detection techniques are frequently geared towards detecting a certain sort of anomaly.
To prioritise fast processing and efficient storage in streaming situations, anomaly detection algorithms must possess the capability to swiftly and adeptly detect anomalies. In ref. [
21], the stream outlier miner (STORM) algorithm was proposed for detecting outliers based on distance. Two versions of STORM, namely exact-STORM and approx-STORM, have been suggested to address outlier queries in accordance with the sliding window model. If the memory can hold the complete window, then the outliers are determined by utilising exact-STORM. If memory is scarce and the window cannot be accommodated, approx-STORM is employed to estimate the anomalies using efficient approximations that offer statistical assurance. STORM considers the time-based characteristics of an individual data point in a data stream. Every datum remains within the sliding window for a specific duration.
Detecting outliers is a subjective task that heavily relies on the problem domain, data traits, and the kinds of anomalies present; hence, the effectiveness of detection algorithms varies widely [
1,
24]. There is a chance that particular subspaces may be successfully identified by certain anomaly detection algorithms, whereas some may exhibit low detection capabilities [
25]. To minimise errors comprehensively, merging the knowledge domains of every algorithm is crucial [
26]. Data points that fall outside the usual range for the whole dataset are known as global outliers, whereas local outliers can exist within the normal range of the entire dataset but exceed the normal range for nearby data points [
27].
The idea of data locality was initially introduced by the authors of [
28], and subsequently improved in [
29] to facilitate dynamic classifier selection in local spaces of the training points. Dynamic strategies for selecting and merging base classifiers have yielded superior outcomes as opposed to static approaches that merely aggregate base classifier outputs through voting. The ensemble methods for learning involve combining the forecasts of multiple fundamental models to produce results that are more stable and reliable. For a reliable anomaly detection ensemble that produces consistent and impartial overall accuracy, it is preferable to incorporate a variety of base detectors and methodically integrate their results to create a robust detector.
Anomaly detection ensembles use parallel or sequential combination structures to improve accuracy by combining outcomes from multiple detectors. Parallel combination structures aim to minimise variance, while serial combination structures aim to mitigate bias [
30]. Including all base detector outcomes in an ensemble may diminish its effectiveness, as different detectors may not identify specific anomalies, especially in unsupervised learning scenarios. Unsupervised algorithms for detecting anomalies aim to identify deviations in unlabelled datasets automatically, based on certain assumptions. The performance of a model can be evaluated based on the various features that exist within a dataset, and detection rates differ due to specialised models accommodating diverse observational characteristics. Using a collection of unique skills within an ensemble yields a stronger outcome than solely relying on one detector [
31]. Some other studies have been considered to detect anomalies as in [
32,
33,
34].
This study is focused on contextual anomalies. The notion of context originates from the structure of datasets, wherein two distinct sets of attributes characterise each data instance [
3]:
Contextual attributes: These are utilised for establishing the context (or neighbourhood) of a particular instance. Contextual features of a location in spatial datasets include its longitude and latitude. In time series data, time serves as a contextual characteristic that determines the position of an instance within the entirety of the sequence.
Behavioural/indicator attributes: These attributes have a direct correlation with the anomaly detection process, as they establish the anomalous behaviour. Within a spatial data collection outlining the mean precipitation levels within a specific nation, the proportion of rainfall observed at a given site shall be deemed a behavioural attribute.
Our approach involves utilising data to actively detect anomalies through unsupervised methods that target local contextual anomalies. We propose an enhancement of an ensemble anomaly detector called enhanced locally selective combination in parallel outlier ensembles (ELSCP). ELSCP is tailored for streaming scenarios by leveraging a pipeline framework that transforms data into a stream and passes it to ELSCP using a reference window model that implements a sliding window approach. The updated version facilitates the handling of information in a continuous flow, thereby allowing us to assess how effective our algorithm is in a streaming environment. Our approach involves the use of hypothesis testing to identify any unusual patterns in vehicle movement on the road. The primary assumption of our analysis is that “normal instances are far more frequent than anomalies”. The central hypothesis is that “If vehicles change their speed abruptly at a specific point, then it implies an incident has occurred”. We seek to investigate the following questions:
- (a)
What is the significance of data associations in anomaly detection, especially in a constrained road network?
- (b)
How can a balance between variance and bias be achieved in ensemble learning?
- (c)
How can we improve the detection rate of anomalies in CAM data streams?
- (d)
Can enhancing the LSCP algorithm improve the identification of anomalies in CAM data streams?
- (e)
How can the adapted technique be applied to real-world problems?
We propose the following contributions:
We define and investigate the issue of completely unsupervised anomaly ensemble construction;
We propose a robust ensemble-based methodology for the detection of anomalies from data streams in the C-ITS context;
We evaluate the proposed technique using a dataset of CAM messages generated in the C-ITS environment and compare its performance with state-of-the-art techniques in the streaming context.
This paper is structured as follows:
Section 2 presents the data generation and pre-processing steps. They correspond to the used materials. It also introduces our anomaly-detection approach, called enhanced LSCP (ELSCP), and the performance indicators that we used.
Section 3 presents the experimental results.
Section 4 is dedicated to discussion and limitations, with
Section 5 giving the conclusion and future work.