Matrix Proﬁle-Based Approach to Industrial Sensor Data Analysis Inside RDBMS

: Currently, big sensor data arise in a wide spectrum of Industry 4.0, Internet of Things, and Smart City applications. In such subject domains, sensors tend to have a high frequency and produce massive time series in a relatively short time interval. The data collected from the sensors are subject to mining in order to make strategic decisions. In the article, we consider the problem of choosing a Time Series Database Management System (TSDBMS) to provide efﬁcient storing and mining of big sensor data. We overview InﬂuxDB, OpenTSDB, and TimescaleDB, which are among the most popular state-of-the-art TSDBMSs, and represent different categories of such systems, namely native, add-ons over NoSQL systems, and add-ons over relational DBMSs (RDBMSs), respectively. Our overview shows that, at present, TSDBMSs offer a modest built-in toolset to mine big sensor data. This leads to the use of third-party mining systems and unwanted overhead costs due to exporting data outside a TSDBMS, data conversion, and so on. We propose an approach to managing and mining sensor data inside RDBMSs that exploits the Matrix Proﬁle concept. A Matrix Proﬁle is a data structure that annotates a time series through the index of and the distance to the nearest neighbor of each subsequence of the time series and serves as a basis to discover motifs, anomalies, and other time-series data mining primitives. This approach is implemented as a PostgreSQL extension that allows an application programmer both to compute matrix proﬁles and mining primitives and to represent them as relational tables. Experimental case studies show that our approach surpasses the above-mentioned out-of-TSDBMS competitors in terms of performance since it assumes that sensor data are mined inside a TSDBMS at no signiﬁcant overhead costs.


Introduction
Currently, big sensor data arise in a wide spectrum of Industry 4.0 [1], Internet of Things (IoT) [2], Smart City [3], and Smart Home [4] applications. In such subject domains, sensors tend to have high frequency (for example, tens of times per second) and produce time series of tens of millions to billions of elements in a relatively short time interval. The data obtained from sensors should be stored permanently and subjected to mining to extract hidden knowledge and make strategic decisions. In performing these tasks, a Time Series Database Management System (TSDBMS) plays a critically important role to provide an application programmer with means and tools to efficiently process and analyze such amounts of sensor data.
In this article, we consider the problem of choosing an appropriate TSDBMS for managing and mining big sensor data. TSDBMSs differ from both relational DBMSs (RDBMSs) and NoSQL systems. A TSDBMS does not need a transaction mechanism since sensor data are collected but not deleted or modified. Also, a TSDBMS should provide efficient operations for adding new atomic values that arrive in streaming mode, yet efficient mining operations where a time series is considered as a whole.
At present, despite the widespread use of NoSQL systems, RDBMSs remain the basic work tools to store and manipulate data in a wide spectrum of subject domains. This claim is supported by the statistics of the DB-Engines.com portal (see DBMS popularity broken down by database model, https://db-engines.com/en/ ranking_categories, accessed on 12 April 2021) pointing out that relational DBMSs hold up to 75 percent of the market. By its nature, an RDBMS does not provide built-in data mining functions, whereas the development of in-RDBMS data mining methods is a topical issue [5]. Indeed, if we consider an RDBMS only as a data repository, this may lead to significant overheads when exporting large data volumes outside the RDBMS, changing data format, and importing the results of the analysis back to RDBMS. In-RDBMS data mining methods cover SQL-based implementations of various problems (for example, association rules [6,7], clustering [8,9], graph mining [10,11], and others) as well as multipurpose libraries and frameworks [12][13][14][15]. However, to the best of our knowledge, there are no developments related to in-RDBMS time-series mining.
The article contributes as follows. We present an approach to managing and mining sensor data inside RDBMS, based on the Matrix Profile concept. Matrix Profile [16] is a data structure that annotates a time series through the index of and the distance to the nearest neighbor of each subsequence of the series. Sensor data and Matrix Profile structures that support the mining are represented as relational tables. Sensor data are mined inside an RDBMS at no significant overhead costs for export-import data. We implemented this approach as a PostgreSQL extension. Our experiments on real cases show that our approach surpasses out-of-TSDBMS competitors in terms of performance.
The rest of the article is organized as follows. Section 2 provides a brief overview of the most popular state-of-the-art TSDBMSs. Section 3 presents an approach to managing and mining sensor data inside RDBMS. Section 4 describes the experimental evaluation of our approach. Section 5 contain a summary of the results obtained and directions for further research.

Classification of Time Series DBMSs
In [17], the authors employ the following four-group classification of TSDBMSs. The first group includes systems that provide time series storage based on a third-party RDBMS or NoSQL system. The systems of the second group provide time series storage independently. The third group consists of RDBMSs providing tools to store and process time series. Finally, the fourth group is represented by commercial systems regardless of their basic data model, an underlying third-party RDBMS or NoSQL system to store time series, etc.
An add-on TSDBMS is implemented on top of a third-party system that provides the TSDBMS with a database engine and a data storage system. Depending on the data model exploited by the basic system, one can distinguish between an add-on TSDBMS over a NoSQL system or an RDBMS.
Below, we give a more detailed overview of three systems, namely InfluxDB, OpenTSDB, and TimescaleDB, which are currently the most popular representatives of the TSDBMS categories listed above, according to the DB-Engines.com portal (see DB-Engines Ranking of Time Series DBMS, https://db-engines.com/en/ranking/time+series+dbms, accessed on 12 April 2021).
To illustrate the capabilities of TSDBMS, we use the toy subject domain of an Industry 4.0 application, as shown in Figure 1. This example simulates a manufacturing line of metal products, which comprises two machines with sensors installed on them to provide predictive maintenance of the line. The first machine is equipped with a temperature sensor and an acoustic emission sensor to control the heating of the metal and capture waves occurring as a result of changes in its structure (cracks, corrosion, etc.), respectively. Each sensor outputs a single value. On the second machine, a vibration accelerometer is installed to monitor machine vibrations; it outputs three values (vibration acceleration along the X, Y, and Z axes).

InfluxDB, Native Time Series DBMS
InfluxDB is a free TSDBMS implemented in the Go programming language and distributed as an executable file for major operating systems and hardware platforms. InfluxDB supports command line and HTTP interfaces, as well as client libraries and plugins [37].

Organization of Data Storage
In InfluxDB, data are represented as a two-dimensional table called measurement. One of the measurement columns contains timestamps. The other columns belong to one of two categories: field or tag. Each field column stores time-series data and consists of field keys and field values. Each tag column is the metadata of a field and consists of tag keys and tag values. Fields are not indexed but indexes can be created for tags.
There is no explicit database schema in InfluxDB. The concepts of series and points are supported. A series is a named set of data that shares a measurement, a set of tags, and field keys. A point is a data element consisting of the following components: a measurement, a set of tags, a set of fields, and a timestamp. A point is uniquely identified by its series and timestamp. Figure 2 gives an example of the creation of a sensor database for the subject domain described in Figure 1. The database contains three measurements: acoustic, temperature, and accelerometer, which represent the data from the respective sensors. Each measurement has a machine tag to indicate the device on where the sensor is installed. The measurement fields, namely val for acoustic and temperature, and x, y, and z for accelerometer, reflect the values measured by the respective sensor. The specified measurements are created simultaneously with the insertion of data points to the database since InfluxDB does not support explicit schema definition.  At the physical level, InfluxDB employs the LSM (Log-structured Merge-tree) data structure [38], which provides fast data access in the workload with frequent INSERT queries. Also, InfluxDB supports automatic data compression to minimize the amount of data stored.

Query Language
InfluxDB provides the InfluxQL query language with SQL-like syntax. Figure 3 depicts an example of a query that finds the minimum value of the temperature sensor installed on the first machine.  As for mining tools, InfluxQL provides time series prediction through the Holt-Winters method [39]. Figure 4 shows an example of the prediction of the values of a temperature sensor.    InfluxDB supports continuous queries, which run automatically at a specified frequency. Figure 5 depicts an example of a continuous query that runs hourly and computes the minimum value of temperature sensor readings during an hour.

OpenTSDB, NoSQL-Based Time Series DBMS
OpenTSDB is a free TSDBMS implemented in the Java programming language. It serves as an add-on over NoSQL column family store systems, such as HBase or Cassandra [26]. OpenTSDB supports command line and HTTP interfaces, as well as client libraries and plugins [37].

Organization of Data Storage
OpenTSDB treats an element of a time series as a collection of a real value, a unique time series identifier (a metric in the original terminology), a timestamp, and a non-empty set of tags. A tag is a character string to store metadata.
OpenTSDB inherits the way the data are organized from the underlying NoSQL system, which uses a collection of system tables as a data storage, namely tsdb to manage data from time series, and tsdb-uid tsdb-tree, and tsdb-meta to manage service data. A tuple of the tsdb table contains the following attributes: a time series element, a timestamp, and a foreign key that references the table tsdb-uid and associates this tuple with a specific time series. The tsdb-uid table stores metric names and time-series tag values. In OpenTSDB, the tsdb-tree table allows for the definition and maintenance of a semantic hierarchy of stored time series similar to the file structure in an operating system. The tsdb-meta table allows for storing additional user-defined time-series information (for example, text annotations). Figure 6 gives an example of the creation of a database for the subject domain from Figure 1, where OpenTSDB serves as an add-on over the HBase system. Initially, HBase runs a standard script to create system tables for data storage. Next, a database is created containing five metrics: acoustic, temperature, and accelerometer.x, accelerometer.y, and accelerometer.z, which represent the data from the respective sensors. Each of these metrics has a machine tag to indicate the device the sensor is installed on. The creation of the specified metrics is performed simultaneously with the insertion of data points to the database through the put command since OpenTSDB does not support explicit schema definition. The put command is followed by the metric name, timestamp, data point value, and tags.

Query Language
In OpenTSDB, database queries are written in the JSON language. A query describes a directed acyclic graph (execution graph) whose nodes define data sources and transformation operations. The queries support arithmetic and logical expressions, filters, grouping, and others, as well as statistical and analytical functions, such as downsampling (decreas-ing the sampling rate of a series), interpolation (imputation of missing values of a series), and so forth. Figure 7 shows an example of a query that performs grouping and computes the minimum value of temperature sensor data received in the last hour. The query graph consists of two nodes, namely temperature_node, which reads data from the temperature metric, and the groupby_node, which groups data by machine tag and finds the minimal value.  Figure 8 presents an example of a query that computes the total sum of data from the temperature_node metric grouped in 5 minute time intervals. In the resulting empty groups, the system automatically fills the gaps with the sum calculated by linear interpolation (LERP) [40]. If there is not enough real data for LERP, the system outputs an undefined value NaN. Except for LERP, OpenTSDB does not provide built-in time series mining tools. However, OpenTSDB allows third-party extensions that provide such functionality (for example, the R2Time [41] library, implemented in the R programming language).

TimescaleDB, Relational-Based Time Series DBMS
TimescaleDB is a free TSDB implemented in the C programming language and distributed as an extension of PostgreSQL. TimescaleDB cooperates with a PostgreSQL instance and normally supports the same operations as PostgreSQL.

Organization of Data Storage
In TimescaleDB, time-series data are stored and processed in hypertables. A hypertable specifies a named set of time series and a way to split the data of the specified series into physically stored relational tables. The partitioning information is used further for parallel processing of the specified tables in PostgreSQL. Figure 9 contains s the hypertables of data for the subject domain given in Figure 1. Here, each hypertable stores the time series of sensor data from the corresponding machine. The machine1 hypertable has the following attributes to store data from sensors installed on the first machine: a timestamp, a sensor value, and a sensor type (an acoustic emission sensor or a temperature sensor). The machine2 hypertable defines a way to store the vibration accelerometer data along the X, Y, and Z axes from the sensor installed on the second machine. Both hypertables define a time series split into disjoint subsequences corresponding to periods of one month. Additionally, the machine1 hypertable stores the data from different sensors in separate tables.   Figure 9. Initially, a new database and its tables are produced. Next, we employ TimescaleDB as a PostgreSQL extension and convert the tables into hypertables using the system function that specifies the partitioning method. Finally, sensor data are inserted in the hypertables through the regular SQL command INSERT.

Query Language
In TimescaleDB, data from hypertables are retrieved through the ordinary SQL SELECT command with a wide range of standard features, such as subqueries, sorting, grouping, and others. Additionally, in TimescaleDB, SQL is extended by facilities to perform statistical analysis of time series, namely calculating the median, the moving average, and percentiles, building histograms, grouping in time intervals of a given length, and so on.
Also, the TimescaleDB supports continuous aggregates, which are views to automatically compute and materialize the results of a specified query in the background. Continuous aggregates are similar to materialized views in PostgreSQL but, unlike the latter, they are updated automatically as data are inserted or modified. Figure 11 provides an example of a continuous aggregate that calculates the average value of temperature sensor data and groups these values in one-hour periods. The TimescaleDB does not provide an application programmer with built-in time series mining functions. However, it inherits from PostgreSQL the ability to integrate with third-party libraries that implement data mining functions inside DBMS (for example, Apache MADlib [12]).

Matrix Profile Concept
The recently proposed Matrix Profile (MP) [16] is a data structure that annotates a time series through the index of and the distance to the nearest neighbor of each subsequence of the series. MP naturally allows one to detect both motifs and anomalies in a time series and can serve as a basis for discovering more sophisticated time-series mining primitives, such as semantic segments [42], chains (evolving patterns) [43], snippets (typical patterns) [44], and others.
Currently, MP and accompanying algorithms are extensively employed to mine sensor data in various Industry 4.0 applications. In [45], the authors describe how MP can be used to detect meter-swapping and prevent electricity theft. In [46], the authors present an MPbased approach to automatically labeling the system events in the synchrophasor data. In [47], MP is applied to discover discords in an energy time series for large commercial building load modeling. In [48], MP is applied to discover motifs and track the operational status of a machine through its vibration sensor data. In [49], the authors employ MP in Industrial IoT production maintenance.
Below, we follow [16,46], give basic definitions regarding MP, and thereafter describe an approach to embedding MP functionality into PostgreSQL.
Sensor data that undergo collection and mining are a form of time series.

Definition 1.
A time series is a chronologically ordered sequence of real-valued numbers: Time series data mining is primarily addressed in discovering patterns of fixed relatively small length segments of a time series, or subsequences.

Definition 2.
A subsequence T i,m of a time series T is its subset of m successive elements that starts at the i-th position: For further definitions, we specify all the subsequences of a given time series by sliding an m-length window throughout the time series (without physically extracting them).

Definition 3. An all-subsequence set S A
T of a time series T is an ordered set of all possible m-length subsequences contained in T. Specifically, these subsequences are sorted in ascending order with respect to their starting elements: Next, for a given subsequence, we compute its distance to each element in an allsubsequence set.

Definition 4. A distance profile D T
i is a vector of the distances between a given query subsequence T i,m and each subsequence T j,m in an all-subsequence set: , where dist(·, ·) denotes the Euclidean distance between z scores of two subsequences.
By the distance profile, we aim to find the nearest neighbor of each subsequence of a time series, except for trivial neighbors.
Now, the matrix profile may be formally defined as follows.

Definition 6.
A matrix profile P T of a time series T is a vector of distances between each subsequence T i,m and its nearest nontrivial neighbor: , where nn nt (D T i ) denotes the minimal distance between T i,m and its nontrivial neighbors.
A matrix profile is accompanied by a supplemental structure to locate the nearest nontrivial neighbors.

Definition 7.
A matrix profile index I T of a time series T is a vector that stores the index of the nearest nontrivial neighbor of each subsequence: The matrix profile can be considered as metadata to annotate corresponding time series. For example, the maximal value of the profile corresponds to an anomalous subsequence (a discord [50]), whereas the minimal values correspond to the best motif subsequence pair in the series.
The matrix profile and the matrix profile index can be generalized to the case of two time series when the query subsequence and the all-subsequence set are taken from different time series. In this case, we call P T1,T2 and I T1,T2 the join matrix profile and the join matrix profile index, respectively. Commonly, P T1,T2 = P T2,T1 and I T1,T2 = I T2,T1 .
The matrix profile is a domain-agnostic technique where an application programmer specifies the only parameter (a subsequence length). Among other advantages, the matrix profile allows for its parallel computing [51] and gradual updating [45].

Embedding Matrix Profile Management into RDBMS
Embedding MP functionality into RDBMS has two basic merits. By encapsulating time series data mining inside DBMS, we avoid overheads due to export-import large data. Moreover, being computed and stored in the database once, a matrix profile and related time series data mining primitives can then be used repeatedly.
To embed MP functionality into RDBMS, we represent a sensor database as depicted in Figure 12. Time series data are represented by the Time Series Directory (TSD) table, a set of univariate time series tables, and a set of multivariate time series tables. The TSD table stores information on each time series, namely its dimension, length, and the name of the table. Each univariate time series is implemented as a table with columns representing a serial number, a timestamp, and a sensor value itself. A multivariate time series is implemented as a table with a similar structure where each dimension is represented as a real-valued column. The MPD table stores metadata on the matrix profiles managed within the system and references corresponding time series in the TSD table (one or two foreign keys for an ordinary or join matrix profile, respectively). Each MP table, in the same manner as a JMP table, stores one matrix profile of the stored time series for a given subsequence length, using to this end one real-valued column for the distance to the nearest neighbor subsequence and one integer-valued column for the index thereof (that is, the MP table also stores a matrix profile index). Each DP table stores all the distance profiles of a time series for a given subsequence length, using for this purpose two integer-valued columns for the indices of subsequences and one real-valued column for the distance between them. The naming of Matrix Profile data objects reflects their indirect connection to Time series data objects and the subsequence length. For example, let us store acoustic sensor data in the ts_acoustic table and let us mine these data along 128-length subsequences. Then the sensor database contains the mp_acoustic_128 table and the dp_acoustic_128 table to store the Matrix Profile and the Distance Profiles of the time series, respectively.
The Time series data mining (TSDM) API provides an application programmer with tools to discover discords and motifs and compute matrix profiles. Similar to TimescaleDB, our API (which was called MPPostgres) is implemented as a PostgreSQL extension consisting of PL/pgSQL functions that return tables (see Figure 13). PL/pgSQL (Procedural Language/PostgreSQL) is a fully featured programming language that supported by PostgreSQL and allows much more procedural control than traditional SQL.
In MPPostgres, a discord table representation includes the following items: the index of the discord in the time series, the distance to its nearest neighbor, and the discord itself as an array. A motif is represented as a tuple with the following attributes: the left and right parts of the motif, their indices, and the distance between them. A PL/pgSQL function result can be stored in the database as a table or exploited as a view (virtual  table). The API can be complemented with functions to search for other mining primitives (chains, snippets, etc.). Discords and motifs are computed through SQL queries that select the rows of the MP table with the maximal and minimal values of the distance to the nearest neighbor (the nnDist column), respectively. The PL/pgSQL functions to compute matrix profiles are implemented as wrappers over the in-memory algorithm [51]. Note that such PL/pgSQL function is implemented in such a manner that it is called only if the respective MP table has not been created yet. MPPostgres can seamlessly work in tandem with TimescaleDB on top of PostgreSQL, as the example in Figure 14 shows (cf. Figure 10; note that, for the sake of simplicity, we have left only one sensor). After the typical steps (create a database in PostgreSQL, employ TimescaleDB, transform tables into hypertables, and insert sensor data in hypertables), we employ MPPostgres to discover discords and motifs based on the matrix profile and display the findings.

Experimental Case Studies
We embedded MP functionality into PostgreSQL v. 13.2 and evaluated the proposed approach in experiments conducted on a workstation (CPU: Intel Xeon Gold 6254 @4 GHz, RAM: 64 Gb, HDD: 1 Tb). In the experiments, we considered three cases of real Industry 4.0 applications related to mining big sensor data and implemented them based on the proposed approach. In these cases, we assumed that sensor data were stored and mined inside PostgreSQL and assessed the performance (running time) of our approach. We also compared our results with those of solutions based on InfluxDB and OpenTSDB, assuming that the sensor data were firstly exported outside these TSDBMSs, then converted to an appropriate format and mined with third-party tools [52], and finally, the results were imported back into those TSDBMSs. We keep in mind the fact that out-of-TSDBMS time series data mining is potentially faster than the in-TSDBMS one, but the absence of export-import overhead costs in our approach allows us to hope for an eventual advantage.

Electricity Theft Detection
In [45], the authors exploit the matrix profile concept to detect electricity theft. They simulate a meter-swapping event in a dataset of household electric power demand collected from twenty households [53] by swapping the traces of two time series (chosen randomly) starting at a specific date. Next, to discover the swapped pair, they divide each time series into two sections: the "Head", before the selected date, and the "Tail", after that date. Finally, among all possible pairs of houses, the resulting pair {House i , House j } should have the minimal swap-score, which is computed as SwapScore(House i , House j ) = min(P Head(House i ),Tail(House j ) ) min(P Head(House i ),Tail(House i ) ) + ε , where ε denotes the machine epsilon (an upper bound on the relative error due to rounding in floating-point arithmetic). Figure 15 shows how we implemented that case in MPPostgres (here and below, we use pseudo-code with a PL/pgSQL-like syntax, where the EXEC_QUERY statement runs a query with a specified character string, and an italicized variable name denotes the value of the respective variable, rather than the query text that is actually written). Note that all "Head"s and "Tail"s are represented by views (virtual tables), so we do not have overheads caused by physically extracting them for further processing.  Figure 16 shows experimental results associated with the electricity theft detection case. It can be seen that all the competitors perform matrix profile computations of the same duration. MPPostgres requires a short-time preprocessing for extracting data from tables through SQL queries. Also, all the solutions perform auxiliary computations (finding the minimum, computing SwapScore, etc.), where MPPostgres is slightly ahead of rivals. Finally, data export (extraction of data and its transference to a third-party application) introduces significant overheads due to out-of-TSDBMS solutions and is absent in our approach.

Detection of Active Electricity Consumption
The authors of [47] apply the matrix profile concept for modeling electricity consumption in three buildings of an academic campus located in Zurich, Switzerland. The authors exploit the Building Data Genome (BDG) dataset [54], where the reference buildings represent an office, a classroom, and a laboratory, identified through their nicknames: Travis, Tracy, and Teri, respectively. Among other things, the authors determine for each building the periods of active electricity consumption by discovering the top three discords as maximal points in the matrix profile of the consumer-side building electric power time series, where the subsequence length parameter corresponds to one week (seven days of hourly readings and a total of 168 points).
The implementation of the case in MPPostgres is shown in Figure 17. To find the discords, we simply call the discoverDiscords function of the TSDM API three times, specifying the respective parameters. The remaining code shows how the above-mentioned function is implemented. The rows of the resulting table are formed through the following loop. Using an SQL query, we select the rows with top-K maximal distance to the nearest neighbor from the MP table and take the values of the index and the distance fields for a row of the resulting table. At last, using the index found, we select all the points of the respective discord subsequence by an SQL query. Figure 18 depicts the experimental results associated with the case of active electricity consumption in buildings (for illustrative purposes, we replicated the BDG dataset, so that it corresponds to data for 32 years). We can observe a situation similar to the previous case. All the competitors compute the matrix profile equally fast. According to their nature, out-of-TSDBMS solutions introduce significant overhead costs on data export, in contrast to our approach. MPPostgres also outperforms the rivals at the discord discovery step since it exploits the built-in database indexing by table primary key when finding top-K values and retrieving rows from tables. It is worth noting that in the typical scenario, when the matrix profile has already been computed and saved in the MP table, our approach would outperform our rivals even more significantly.

Tracking the Operational Status of an Industrial Machine
The authors of [48] track the operational status of an industrial two-mode (high and low speed) exhaust fan through the matrix profile computed for a time series collected by a vibration sensor installed on the fan. While collecting the data, the authors deliberately control the fan blowing duration and speed. Switching the fan modes affects the vibrations, while the motifs in the time series indicate the moments when the machine status changes. The authors discover motifs through the matrix profile and compute the total duration of time intervals when the fan is in "high-speed" and "low-speed" mode, as well as the ratio of such intervals. The experimental results show that the ratio computed through the matrix profile is almost identical to ground truth data.
The authors of [48] do not provide the vibration dataset. We conducted in our study experiments similar to the one described in [48] and collected vibration data from a smallsized crushing machine which is located at the South Ural State University (Chelyabinsk, Russia) and is used for training engineers. The collected time series consists of more than 240 thousand points corresponding to a 3 minute time interval of machine work. During the process, we started up the crushing machine eight times, so that the machine status changed sixteen times, from "crushing stops" to "crushing starts" and back. As in the original case, the ratio of time intervals when the machine works in different modes, computed in our experiments through motif discovery, almost coincides with ground truth data.
The implementation of the described case in MPPostgres is shown in Figure 19. After computing the matrix profile of an input time series, we use an SQL query to retrieve top-K motifs as rows of the MP table with top-K minimal distance to the nearest neighbor of the respective subsequence. Then, we scan the time series from left to right through the indices of motifs found while computing the length of the interval between the previous and current motif and alternately adding the result to a total duration that indicates when the machine works in the first or the second mode. Finally, after similar one-time computations over the remaining part of the input time series, we obtain the resulting ratio.
In Figure 20, we see the experimental results obtained in the crushing machine operational status tracking case. Similar to previous cases, MPPostgres outperforms the outof-TSDBMS rivals since the latter are forced to export the data before processing it. Also, in the typical scenario when the matrix profile has already been computed and saved in the MP table, MPPostgres would show even higher performance.

Conclusions
In this article, we addressed the problem of choosing an appropriate tool for managing and mining big sensor data. Currently, such data arise in a wide spectrum of Industry 4.0 and Internet of Things applications, such as predictive maintenance, smart cities and factories, digital twins, and others. Such sensors have high frequency and produce time series containing up to tens of millions of elements in a relatively short time interval. The data collected from the sensors are subject to mining in order to make strategic decisions. To efficiently do this, we need specific Time Series DBMSs (TSDBMSs), which differ from relational DBMSs (RDBMSs) and NoSQL systems.
We suggest distinguishing native and add-on TSDBMSs. A native TSDBMS is a standalone development with an original query language, a database engine, and a data storage system. An add-on TSDBMS is implemented on top of a third-party system that provides the TSDBMS with a database engine and a data storage system. We briefly described the most popular representatives of the above-mentioned categories: InfluxDB (native TSDBMS), OpenTSDB (add-on over a NoSQL system), and TimescaleDB (add-on over a relational DBMS). Our overview showed that the above-mentioned systems provide an application programmer with a modest built-in toolset to mine time series data. This leads to unwanted overhead costs since we should use third-party mining systems that need to connect to a sensor database server first, then export the data outside the TSDBMS, convert them to an appropriate format before mining, and finally, import the results back into the TSDBMS.
We presented an approach to managing and mining sensor data inside RDBMS based on the Matrix Profile concept [16]. The Matrix Profile annotates a time series through the index of and the distance to the nearest neighbor of each subsequence of the time series. The Matrix Profile serves as a basis to discover motifs and discords in time series. Sensor data are stored as a set of univariate and multivariate time series tables, and their metadata are provided by the Time Series Directory table. Moreover, the sensor database contains tables to store the Matrix Profile data and the Matrix Profile Directory table to store metadata. We implemented this approach as a PostgreSQL extension which provides an application programmer with a set of user-defined functions (UDFs) to compute a matrix profile and time series data mining primitives and represent them as relational tables.
To evaluate the suggested approach, we performed an experimental study of three cases of real Industry 4.0 applications related to mining big sensor data, namely electricity meter-swapping detection, determining of active electricity consumption in buildings, and tracking operational status of machines. We assessed the performance of our approach against solutions based on InfluxDB and OpenTSDB. Our approach surpassed these out-of-TSDBMS competitors since it assumes that sensor data are mined inside a TSDBMS at no significant overhead costs due to data export and conversion, etc.
In further studies, we plan to enhance our TSDBMS extension of PostgreSQL with other sophisticated time series primitives, such as semantic segments, evolving and typical patterns, and others.