Time Series Clustering: A Complex Network-Based Approach for Feature Selection in Multi-Sensor Data

.


Introduction
The primary goal of industrial Internet of Things (IoT) has been linking the operations and information technology for insight into production dynamics. This potential flexibility entails a floor of technologies made of distributed networks of physical devices embedded with sensors, edge computers, and actuators used to identify, collect, and transfer data among multiple environments. Such IoT-based cyber-physical systems (CPS) establish a direct integration of engineering systems into digital computer-based ones, where the measurement or sensing technologies play an essential role in capturing the real world. Data collected are then the main ingredient to lift efficiency, accuracy, and economic benefits with the added merits of minimum human intervention [1,2].
A consequence of such transformation is that the sensor layer, customarily used to measure, is now the means to map the actual status (of the process) into the cyber-world to derive process information, collect it in databases, and use it as a basis for the models which can be adapted (ideally by self-optimization) to the real situations. In this vein, the CPS provides a holistic view of the engineering systems and enables a bi-directional physical to digital interaction via multimodal interfaces [3,4]. Unlike the classic concepts derived from control theory, CPS forms the basis to describe even complex interactions and thus anticipate process deviations or interpretation and prediction of system behavior, The rest of the paper is organized as follows. In Section 2, we describe the background and related works. In Section 3, we present the unsupervised FSS method. In Section 4, we describe the case study. In Section 5, we report experimental results to support the proposed approach. In Section 6, we summarize the present work and draw some conclusions.

Background
In this section we review the main approaches used for FSS and time series clustering, together with some complex network analysis tools that represent the constituent parts of the proposed method.

Feature Subset Selection
A possible classification of Feature Subset Selection (FSS) methods consists in embedded, wrapper, and filter approaches.
The embedded methods [27] are usually related to learning algorithms such as decision trees or neural networks that perform feature selection as part of the their training process. The wrapper methods [28], instead, use a predetermined learning algorithm to evaluate the goodness of feature subsets. Since a training and evaluation phase is required for every candidate subset, the computational complexity of these methods is large. Differently, the filter methods [29] do not rely on learning algorithms, but select features based on measures such as correlation. By combining filter and wrapper methods, it is possible to evaluate features through a predetermined learning algorithm without the computational complexity of wrappers thanks to an initial feature filtering [30].
In the context of filter methods, it was found that clustering-based methods outperform traditional feature selection methods, and they also reduce more redundant features with high accuracy [21,22]. As a part of filter methods, the clustering algorithms are used to group features according to their similarity. From a dimensionality reduction perspective, one or more representative variables for each cluster must be identified. For example, in [31], the cluster center is used as representative, while, in [32], a single optimal variable is considered based on the R 2 correlation coefficient.
FSS approaches can be further categorized into supervised and unsupervised. The former process feature selection by data class labels, while the latter relies on intrinsic properties of the data. Recently, unsupervised FSS are attracting an ever growing interest due to the widespread occurrence of unlabeled datasets in many applications [33,34]. For this reason, the present paper focuses only on unsupervised methods.

Time Series Clustering
Conventional clustering algorithms (partitional, hierarchical, and density-based) are unable to capture temporal dynamics and sequential relationships among data [35]. For these reasons, tremendous research efforts have been devoted to identify time series-specific algorithms.
The most common approaches involve modifying the conventional clustering algorithms to adapt them to deal with raw time series or to convert time series into static data (feature vectors or model parameters) so that conventional clustering algorithms can be directly applied. The former class of approaches includes direct methods, also called raw data-based [36][37][38][39], while the latter refers to indirect methods that can be distinguished between model-based [40][41][42] and feature-based [43][44][45].
In addition, according to the way clustering is performed, the algorithms can be grouped into whole time-series clustering and subsequence clustering, a valid alternative to reduce the computational costs by working separately on time series segments. Detailed reviews of clustering algorithms for time series can be found in [46][47][48].
It has been recently demonstrated that network approaches can provide novel insights for the understanding of complex systems [23,49], outperforming classical methods in the ability to capture arbitrary clusters [50]. In particular, the weakness of conventional techniques resides in the use of distance functions which allow finding clusters of a predefined shape. In addition, they identify only local relationships among neighbor data samples, being indifferent to long distance global relationships [50]. Examples of network approaches for time series clustering can be found in the literature, making use of DTW and hierarchical algorithms [51] and community detection algorithms [50].

Evaluation Metrics for Unsupervised FSS
To evaluate an unsupervised FSS method for time series, two main indicators are widely used: redundancy and information gain.
In particular, the redundancy of information among a set of time series y = (y 1 , y 2 , . . . , y N ) is quantified by the metric W I , which is defined as: where MI(y i , y j ) is the mutual information between time series y i and y j [16]. A low value of W I is associated to a set of time series that are maximally dissimilar to each other. It is also possible to consider the rate of variation of this metric, represented by the redundancy reduction ratio (RRR): The information gain, instead, is computed in terms of the Shannon entropy H [52], which reads as: where X is the data matrix associated to the set of time series y, being every row a sample of the observations and every column a different time series, and x i is the ith row of such matrix. The information gain is computed as the variation of entropy between the original time series and the time series after the feature selectionȳ. If the rate of variation is considered, it is possible to define the information gain ratio (IGR): where X andX are the data matrices associated to y andȳ.

Complex Network Analysis
In this section, we illustrate Complex Network Analysis (CNA) methods used in the present study, namely visibility graphs, local network measures, and community detection algorithms.

Visibility Graph
The visibility graph algorithm is a method to transform time series into complex network representations. This concept was originally proposed in the field of computational geometry for the study of mutual visibility between sets of points and obstacles, with applications such as robot motion planning [53]. The idea was extended to the analysis of univariate time series [24], making it possible to map a time series into a network that inherits several properties of the time series itself. Moreover, visibility graphs are able to capture hidden relations, merging complex network theory with nonlinear time series analysis [23].
In particular, the visibility graph algorithm maps a generic time series segment y t n = ((s t n ) 1 , (s t n ) 2 , . . . , (s t n ) L ) in a graph by considering a node (or vertex) for every observation (s t n ) i for i = 1, . . . , L, where L is the number of observations in the segment. The edges of the graph, instead, can be generated using two different algorithmic variants: the natural visibility graphs and the horizontal visibility graphs.
The natural visibility graph algorithm [24] generates an edge with unitary weight between two nodes if their corresponding observations in the series are connected by a straight line that is not obstructed by any intermediate observation. Formally, two nodes a and b have visibility if their corresponding observations (s t n ) a = (t a , v a ) and (s t for any intermediate observation (s t n ) c = (t c , v c ) such that a < c < b. t a and t b represent the timestamps of the two samples, while v a and v b are the actual observed values. A computationally more efficient algorithmic variant is the horizontal visibility graph [54,55], based on a simplified version of Equation (5).
Visibility graphs can be enhanced by considering its weighted version [56], where the weight between any pair of nodes (s t n ) a = (t a , v a ) and (s t n ) b = (t b , v b ) reads as: A schematic illustration of the weighted visibility graph construction is shown in Figure 1.

Local Network Measures
Networks can be composed by many nodes and edges, making the analysis and the unveiling of hidden relationships very challenging. For this reason, global and local network measures are used, respectively, to extrapolate synthetic topological information from the whole network and study the role nodes play in its structure. The latter purpose can be tackled using different centrality measures. The first historically proposed one is the degree centrality [25] which allows detecting the most influential nodes within the network. This measure is based on the simple concept that the centrality of a vertex in a network is closely related to the total number of its connections. In particular, the weighted degree centrality of a node i in a graph reads as: where L is the number of nodes, w ij is the weight of the edge connecting nodes i and j, and k t n = ((k t n ) 1 , (k t n ) 2 , . . . , (k t n ) L ) is also called the degree sequence of the graph. Another measure is the eigenvector centrality [57], which is used for determining elements that are related to the most connected nodes. The betweenness centrality [58], instead, is able to highlight which nodes are more likely to be in the network communication paths, and, finally, the closeness centrality [58] measures how quickly information can spread from a given node.

Community Detection
Since the last century, networks have been widely used to represent and analyze a large variety of systems, from social networks [59,60] to time series. One of the driver has been the growing interest in detecting groups of nodes (communities) that are strongly connected by high concentrations of edge (relations), sharing common properties or playing similar roles in the network [61][62][63]. In particular, nodes that are central in a community may be strongly influential on the control and stability of the group, while boundary nodes are crucial in terms of mediation and exchanges between different communities [64].
Many community detection methods have been proposed to date and a possible classification includes traditional, modularity-based, spectral, and statistical inference algorithms.
Traditional methods include graph partitioning [65,66], which selects groups of predefined size by minimizing the number of inter-group edges; distance-based methods [67], where a distance function is minimized starting from local network measures; and hierarchical algorithms [68] that produce multiple levels of groupings evaluating a similarity measure between vertices. Modularity-based methods [69][70][71], instead, try to maximize the Newman-Girvan modularity [72], which evaluates the strength of division into communities. One of the most popular methods is the Louvain's method [26]. This algorithm is based on a bottom-up approach where iteratively groups of nodes are created and then aggregated in larger clusters. In particular, nodes are initially considered as independent communities and the best cluster partition is identified by moving single nodes to different communities searching for a local maxima of the modularity. Then, a new network is constructed by modeling clusters as graph vertices and by computing edge weights as sum of the connection weights between adjacent nodes belonging to different communities. These steps are iteratively repeated until convergence, corresponding to a maximum of modularity.
Another category of community detection methods are the spectral algorithms [73], which detect communities by using the eigenvectors of matrices such as the Laplacian matrix of the graph. Finally, statistical inference algorithms [74,75] aim at extracting properties of the graph based on hypotheses involving the connections between nodes.

Visualization
Community detection algorithms are typically integrated with exploratory network tools in order to improve the network visualization [76]. These tools become essential to give insight into the network structure, by revealing hidden structural relationships that may otherwise be missed. As described in [77], there is a large variety of specialized exploratory network layouts (e.g., force-directed, hierarchical, circular, etc.) based on different criteria. Among them, force-directed layouts are extensively applied in the identification of communities with denser relationships owing to their capability to capture the modular aspect of the network.
An example of force-directed layout is the Frushterman-Reingold algorithm [78], which considers nodes as mass particles and edges as springs between the particles and the goal is to minimize the energy of this physical system in order to find the optimal configuration. This process is only influenced by the connections between nodes, thus, in the final configuration, the position of a node cannot be interpreted on its own, but has to be compared to the others.

Evaluation Metrics
Community detection algorithms can be evaluated through several external or internal indicators, customary for clustering or specially designed for community detection in networks [79]. External indicators, which evaluate the clustering results based on ground truth data, include purity [80], Rand index [81], and normalized mutual information [82]. On the other hand, internal indices, relying on intrinsic information to the clustering and specifically designed for networks. are the modularity [72] and the partition density [83].

Methods
This section discusses the proposed method, starting from the problem of time series clustering up to the task of unsupervised FSS. Given a set of N time series y = y 1 , y 2 , . . . , y N , the main steps of the proposed clustering approach are here summarized. a.
Remove time series noise through a low-pass filter. b.
Segment time series y n into consecutive non-overlapping intervals s 1 n , s 2 n , . . . , s T n corresponding to a fixed time amplitude L, where T is the number of segments extracted for each time series. c.
Transform every signal segment s t n (t = 1, . . . , T and n = 1, . . . , N) into a weighted natural visibility graph G t n . d.
Construct a feature vector k t n = ((k t n ) 1 , (k t n ) 2 , .., (k t n ) L ) for each visibility graph G t n , where (k t n ) i is the degree centrality of the ith node in the graph and k t n the degree sequence of the graph. e.
Define a distance matrix D t for every tth segment (t = 1, . . . , T), where the entry d t ij is the Euclidean distance between the degree centrality vectors k t i and k t j . Every matrix gives a measure of how different every pair of time series is in the tth segment. f.
Compute a global distance matrix D that covers the full time period T where the entry (i, j) is computed as d ij = 1 T ∑ T t=1 d t ij , averaging the contributions of the individual distance matrices associated to every segment. g.
Normalize D between 0 and 1, making it possible to define a similarity matrix as S = 1 − D, which measures how similar every pair of time series is over the full time period. h.
Build a weighted graph C considering S as an adjacency matrix. i.
Cluster the original time series by applying a community detection algorithm on the graph C and visualize the results through a force-directed layout. Figure 2 illustrates the flowchart of the methodology. After the initial stages of data filtering (Step a) and time series segmentation (Step b), for the transformation of every signal into network domain (Step d), we used the natural weighted visibility graphs. The natural variant was preferred to the horizontal one because it is able to capture properties of the original time series with higher detail, avoiding simplified conditions. The weighted variant, on the other hand, is used to magnify the spatial distance between observations that have visibility and thus avoid binary edges in favor of weighted edges in the visibility graph.
Since we used natural weighted visibility graphs to map time series into networks, for the extraction of a feature vector for each signal segment (Step e), we considered the weighted degree centrality sequence of the network, as suggested in [84], because it is able to fully capture the information content included in the original time series [25,85].
Then, after the construction of the segment distance matrices D t and the normalized global similarity matrix S together with its graph representation C (Steps f-h), we used the modularity-based Louvain's method in step (Step i) for community detection since extremely fast and well performing in terms of modularity.
To achieve a modular visualization of the clusters detected by the discussed method and their mutual connections, we used a force-directed algorithm, namely the Frushterman-Reingold layout, as a graphical representation.
Finally, for specific unsupervised FSS purposes, we considered a representative parameter for each cluster. Such parameters were identified based on their importance within the communities, by considering the signals with highest total degree centrality in their respective groups.
Every part of the proposed approach was developed in Python 3.6 [86], using the Numpy [87] and NetworkX [88] packages.

Case Study
This section deals with the case study considered for the applications of the proposed method, which is an internal combustion engine used in industrial cogeneration (or CHP).
The CHP system consists of a four-stroke ignition engine (P = 1032 kW) fueled with vegetable oil, coupled to a three-phase synchronous generator. The electricity produced is used to meet the self-consumption of the plant and the production surplus is fed into the grid.
The heat recovery takes place both from the engine cooling water circuit and from the exhaust gases. In particular, the heat exchange with engine cooling water (t = 65-80 • C) is used both to meet part of the plant heating requirement and for the preheating of the fuel before the injection phase. The return water from the plant is cooled by a ventilation system consisting of four fans (P = 15 kW).
The exhaust gases, after being treated in a catalyst, are conveyed inside a boiler of 535 kW thermal power, which is used to produce steam at about 90 • C useful for different production lines. A schematic representation of the system is shown in Figure 3.
The system is equipped with a sensor network for condition monitoring and control purposes that samples every minute for a total of 90 physical quantities at different points. The data used for the case study go from 25 June 2014 to 5 May 2015. The early preprocessing phase involved the removal of the parameters that were constantly flat and the cumulative signals, thus reducing the number of the starting parameters to 78. The final list of monitored CHP plant variables considered for the analysis is reported in Table 1. In the preprocessing phase, the outliers caused by sensor errors were also removed. To deal with unusual dynamics linked to system shutdowns, observations where the active power of the system was zero were filtered out. Afterwards, we resampled the data every 15 min to filter constant signal intervals and reduce the amount of measurements processed by the algorithm. The resulting data matrix used as input for the analysis had 30,240 rows and 78 columns. Finally, we built time series segments including 24 h of observations to capture the typical daily cycle of the plant.

Results
This section provides a detailed discussion about the experimental results obtained by the proposed approach, followed by a comparison with two traditional time series clustering methods. Figure 4 shows the plot of the 78 standardized signals during a representative period of about two months. Data were extracted during a total measuring time of almost 11 months. The total dataset was then analyzed by applying the method described in Section 3. After the application of a low-pass filter for noise removal, Steps b-d of the workflow, time series were segmented into non-overlapping intervals s t n , then mapped into natural visibility graphs G t n , and finally feature vectors were extracted in terms of degree sequences k t n . Afterwards, in Steps e-g, a global distance matrix D was computed by combining the contribution of all the distance matrices D t , followed by the definition of the similarity between all the pairs of time series. The resulting similarity matrix S is shown in Figure 5.

As per
Step h, the similarity matrix S is represented in the form of a weighted graph, also called similarity graph C, where each node corresponds to a specific signal and the edge weights quantify pairwise similarities between time series. To carry out the community detection phase, only the most important edges were taken into account. In particular, we performed edge pruning by filtering the pairwise similarities lower than the second quantile of their probability distribution.
Then, as for Step i, by means of the Louvain's algorithm (see Section 2.2.3), we identified 12 different communities within the filtered similarity graph, which globally cover 70 parameters. Table 2 provides the detail of the variables contained in each cluster with reference to the parameter IDs presented in Table 1. The eight signals shown in Figure 6 were not clustered since they were characterized by independent dynamics. This subset includes engine lube oil parameters, i.e., carter pressure, sump level, and pressure; generator parameters, i.e., power factor and reactive power; parameters in the fuel primary storage, i.e., tank level and pressure; and parameters in the exhaust gas treatment, i.e., urea tank level. Time series clustering results are illustrated with reference to the functional groups shown in the block diagram in Figure 3.
Most of the fuel parameters were grouped into two distinct homogeneous clusters (see Figure 7). Fuel temperatures from the primary storage to the output of the tanks 1 and 2 are included in Cluster 1 (Figure 7a), while Cluster 2 (Figure 7b) groups the fuel levels in the two tanks. Engine sensor signals fall, together with other strictly related parameters, into two distinct clusters (see Figure 8). In particular, Cluster 3 ( Figure 8a) includes all the cylinder temperatures and the exhaust temperatures, while Cluster 4 (Figure 8b) includes the casing temperatures, the supercharger temperatures, and the temperatures monitored at the engine auxiliaries, e.g., cooling water, lube oil, and intercooler subsystems. Cluster 4 also contains some parameters by the heat exchange with the engine cooling, such as water inlet temperatures of the process heat circuit and inlet fuel temperature.
All the parameters of the high temperature heat recovery circuit (process steam demand) were, instead, separated into two distinct groups (see Figure 9). In detail, Cluster 5 (Figure 9a) includes the thermal power and hot water flow rate, monitored at the boiler inlet, while, in Cluster 6 (Figure 9b), all the specific steam parameters are grouped together, such as steam flow rate, pressure, and thermal power, as well as the temperature of the condensed water. As mentioned above, low temperature heat circuit sensor signals, measured at the plant inlet, are part of Cluster 4 together with other engine and auxiliaries signals (see Figure 8b), while the water temperatures at the plant outlet and the delta in-out temperature are in Cluster 7 (see Figure 10).  The two principal properties of the electric power supply, frequencies and voltages, were divided into two clusters (see Figure 11). Notably, in Figure 11a, it is possible to note how the engine speed was included in Cluster 8 together with the generator and grid frequencies. On the other and, Cluster 9 includes all the generator and grid voltages.  Other electrical parameters, such as powers and currents, have instead been divided into three different clusters (see Figure 12). In particular, Cluster 10 ( Figure 12a) and Cluster 11 ( Figure 12b) distinguish, respectively, the generator powers from the generator currents, while Cluster 12 (Figure 12c) groups together the grid powers and currents. The latter refer only to Phase 2 current, because the Phase 1 and 3 currents were removed in the preprocessing phase due to sensor malfunctions.
The clustering results show that the proposed approach is independent from the nature of the monitored parameters and their functionality within the system.
For example, Clusters 1, 2, 7, 9, 10, and 11 (Figures 7a, 10a, 11b, and 12a,b) include only homogeneous variables (e.g., temperatures) belonging the same functional area (e.g., engine). Among those, it is interesting to note how the parameters within Cluster 2, i.e., the fuel levels in the tanks for primary storage, seem to be very different from the Euclidean point of view, but the method identified a similarity in their global trends.
On the other hand, Clusters 5, 6, and 12 (Figures 9a,b, and 12c) represent some examples of communities populated by heterogeneous physical parameters recorded in the same functional area.
Finally, a particular interest derives from the hidden relationships identified between parameters characteristic of different functional areas. Examples are Cluster 3 (Figure 9a), which includes temperatures of cylinder and exhaust; Cluster 4 ( Figure 9b), which groups together temperatures referred to the engine external casing, the engine auxiliaries, heat recovery, and fuel pre-heating systems and the inlet fuel; and Cluster 8 (Figure 11a), which is composed by frequencies and voltages related to both the generator and the grid.
After the identification of clusters, exploratory network analysis was used to render a graphical representation of their degree of similarity (the higher the similarity between nodes, the smaller their spatial distance), thus improving the cluster visualization. The Frushterman-Reingold layout applied to the similarity graph C, after edge pruning, provided the results shown in Figure 13.
The force-directed layout gives the evidence of a central core of strongly connected parameters, which includes, respectively, most of the fuel temperatures in the storage area (Cluster 1), all the temperatures of cylinders and exhaust (Cluster 3), all the process low temperature parameters (Cluster 7), and most of the generator and grid parameters (Clusters 8-11).  Notably, only two parameters of Cluster 3 are outside the central core, namely T29 and T34, measuring, respectively, the fuel temperature in the primary storage and in the tank 2 (the latter being a backup tank). It is also possible to notice how the temperatures of engine cooling water (T25-T27) and lube oil (T43) subsystems represent a key group in bridging the central core to the other variables of Cluster 4. Similarly, the steam parameters in the high temperature heat recovery (Cluster 6), although not directly included in the central core, appear to be strictly connected to it. As expected, no correlation is active among the fuel levels inside the tanks (Cluster 2), the power and flow rate of the hot water at the boiler inlet (Cluster 5), the grid power and currents (Cluster 12), and the rest of the network. To improve the interpretation of the results by adding quantitative information to the exploratory analysis, we calculated the cumulative percentage distribution of the average degree centrality of each cluster (see Figure 14). The bar chart in Figure 14 attributes a specific ranking to the clusters according to their average contribute to the degree centrality of the network. Overall, the results confirm the considerations made so far in relation to the core communities (Clusters 1, 3, 7, and [8][9][10][11], to the boundary communities (Clusters 4 and 6), and to the communities totally unrelated to the network (Clusters 2, 5, and 12).
As for the communities included in the central core, it is possible to obtain a distinction between the roles played in the network. In detail, Cluster 10, which groups engine speed and generator and grid frequencies together, is the most influential on the control and stability of the global systems, followed by Cluster 3, which includes cylinder temperatures and exhaust gases.
Finally, after clusters identification and analysis, FSS was performed by selecting in each cluster the representative signal as the one with the highest degree contribution in its group. Table 3 shows the selected variables associated to each cluster, together with their degree centrality in the similarity graph, and their share contribution to the sum of the degree centralities within the reference cluster.
The representative parameters shown in the table are visually confirmed by the force-directed layout in Figure 13. For example, variable T0 (condense temperature) appears to be the most influential node of Cluster 6 (process high temperature user parameters), having a high number of connections not only with variables of its own cluster, but also with those belonging to the central core of strongly connected signals. Another example is the parameter T43 (oil temperature) with respect to Cluster 4 (parameters strictly related to the engine).
As reported in the case study, the data matrix considered as input for the analysis has 30,240 × 78 dimensions. After the application of the proposed method, by considering the 12 representative cluster variables, listed in Table 3, together with the 8 independent signals shown in Figure 6, we obtained a final data matrix of size 30,240 × 20, thus reducing the dimensionality by 74.4%.

Performance Metrics
An exhaustive evaluation of the proposed method can be obtained by appropriate measures of clustering partitioning and FSS information content.
The lack of ground truth data in the present condition monitoring application precluded the evaluation of the clustering results through classical external indices. In addition, since we used a modularity-based method for the community detection, modularity was identified as the most appropriate metric of the final clustering. A first evaluation of the clustering results was performed using the modularity measure, which quantifies the goodness of the communities on a scale that goes from −1 to 1. In particular, we obtained a modularity index of 0.72, representative of good quality results.
Since the proposed approach belongs to the category of unsupervised FSS methods, the final evaluation was performed in terms of redundancy reduction ratio (RRR) and information gain ratio (IGR), defined, respectively, in Equations (2) and (4).
The proposed method was compared with standard approaches for time series clustering (see Table 4). In particular, a raw data-based method was considered, which uses the Euclidean distance as time series similarity measure and a partitioning clustering, namely K-Means, for grouping variables. In addition, we included a feature-based method in the comparison, which involves the extraction of statistical parameters characteristic of the time series (i.e., average, median, standard deviation, skewness, and kurtosis) and the subsequent application of the K-Means algorithm for clustering. Table 4. Comparison of the FSS performances between the proposed approach and two standard methods: a raw data-based method and a feature-based one. The evaluation was performed by considering the redundancy reduction ration (RRR) and information gain ratio (IGR) indices.

Method
Optimal  Table 4 shows that the time series clustering approach seems to be particularly efficient in terms of FSS, allowing a total redundancy reduction of 29.05% in the starting dataset by obtaining, at the same time, a global information gain of 10.60%.
It is also interesting to note that both performance metrics are better than those obtained with the standard approaches considered. In particular, the proposed method outperforms the raw data-based clustering approach in terms of both RRR and IGR indices, with an overall performance improvement of 19.53% and 2.21%, respectively. Looking at the results obtained with the feature-based method, also in this case the proposed approach provides better results with an increase of 8.09% and 2.70% for the RRR and IGR indices, respectively.

Conclusions
With the advent of Industry 4.0, the increasing availability of sensor data is leading to a rapid development of models and techniques able to deal with it. In particular, data-driven AI models are becoming essential to conduct the analysis of complex systems based on large data streams.
State-of-the-art models fail when dealing with overfitting in the data and suffer from performance loss when variables are highly correlated between each other. Many FSS methods have been introduced to address these problems. Notably, it has been demonstrated that clustering-based methods for unsupervised FSS outperform traditional approaches in terms of accuracy.
The complexity of nonlinear dynamics associated to data streams from sensor networks make standard clustering methods unsuitable in this context. For these reasons, in this paper, we propose a new clustering approach for time series useful for unsupervised FSS, exploiting different complex network tools. In particular, we mapped time series segments in the network domain through natural weighted visibility graphs, extracted their degree sequences as feature vectors to define a similarity matrix between signals, used a community detection algorithm to identify clusters of similar time series, and selected a representative parameter for each of them based on the variable degree contributions.
The analysis of the results highlights two advantages deriving from the proposed method. The first is the ability to group together both homogeneous and heterogeneous physical parameters even when related to different functional areas of the system. This is obtained by capturing time series similarities not necessarily linked to signal Euclidean distance. In the FSS perspective, the approach, by considering 12 representative variables for the identified clusters and 8 independent signals that were not clustered, reduced the dimensionality of the dataset by 74.4%.
Second, as an additional advantage with respect to FSS purposes, the method allows discovering hidden relationships between system components enriching the information content about the signal roles within the network.
Since the construction of a natural weighted visibility graph has time complexity O(L 2 ), being L the number of samples in a time series interval, the proposed approach was intended as an offline filtering tool. In particular, being the visibility graph the bottleneck of the algorithm, the global time complexity is in the order of O(TL 2 ), where T is the number of consecutive non-overlapping segments. Running the algorithm on a dataset of 11 months with time windows of 24 h took approximately 15 min. The idea is to consider the whole dataset at disposal in order to identify the overall most relevant signals, by averaging the contributions of all intervals. Thus, the resulting reduction in the dimensionality of data streams opens the possibility to simplify the condition monitoring system and its data.
If, instead, a real time tool for FSS or time series clustering is of interest, it is possible to imagine the integration of the proposed algorithm into sensor network now-casting models, e.g., on a sliding window of 24 h the algorithm runs in less than 3 s.