Revealing Driver’s Natural Behavior—A GUHA Data Mining Approach

: We investigate the applicability and usefulness of the GUHA data mining method and its computer implementation LISp-Miner for driver characterization based on digital vehicle data on gas pedal position, vehicle speed, and others. Three analytical questions are assessed: (1) Which measured features, also called attributes, distinguish each driver from all other drivers? (2) Comparing one driver separately in pairs with each of the other drivers, which are the most distinguishing attributes? (3) Comparing one driver separately in pairs with each of the other drivers, which attributes values show signiﬁcant differences between drivers? The analyzed data consist of 94,380 measurements and contain clear and understandable patterns to be found by LISp-Miner. In conclusion, we ﬁnd that the GUHA method is well suited for such tasks.


Introduction
In this paper, we investigate the applicability and usefulness of the GUHA data mining method [1][2][3] and its computer implementation LISp-Miner [4] in the context of digital forensics. The focus in forensics is the identification of individuals based on different kinds of evidence found at a crime scene. We investigate whether digital data in-vehicle can be used to describe individuals' natural driving behavior. Our research is related to the publication [5], where drivers were classified by machine learning techniques in order to identify them in a forensic scenario of a hit and run accident.
In total, ten drivers traveled between Korea University and the SANGAM World Cup Stadium in the surroundings of Seoul (South Korea) on three different road types and in-vehicle data was collected. The total driving time per individual was between 121 and 184 min. For reasons explained in [5], we do not use all the features available in the data for our analysis. Moreover, due to the nature of the GUHA method, the time dimension in the data is not relevant in our analysis.

The GUHA Method Briefly
Data mining is about finding interesting relations, associations, and structures in given data. At present there are several data mining methods developed for various types of data and related problems; many of them are based on statistics, either classical or Bayesian, and neural networks are used as well as machine learning approaches. GUHA (an abbreviation for Generalized Unary Hypothesis Automata) [1] differs from all the other data mining methods in that it is based on a well-defined logical formalism; dependencies on the data are labelled by truth values TRUE or FALSE (i.e., supported or unsupported by the data); however, the truth value of a statement is determined in a rather unusual way. In a GUHA context, 'data' are a flat matrix with rows and columns; their cells can, in principle, contain any form of symbols, but in practice the data must be converted to a binary form before the data mining process can take place.
The GUHA method has several computer implementations, the most advanced of which is LISp-Miner software developed and maintained at the Prague School of Economics [4] and freely downloadable from https://lispminer.vse.cz/ (accessed on 30 July 2021). GUHA is not a black box method. To be able to use LISp-Miner software, we need to have at least a rough idea of what the data is about to be able to ask questions, called analytic questions, about the data. For example (relevant in this study), 'What are the characteristics in driving style that distinguish driver A from all other drivers?' is an analytic question. LISpMiner finds all the dependencies relevant to the questions that are in the data, even though this may be a time-consuming process. The ability of LISp-Miner software to analyze such questions is based on the specific logical language of GUHA, the central part of which is generalized quantifiers such as 'ϕ is often followed by ψ', 'ϕ and ψ are quite equivalent', 'ϕ occurs much more often than average when ψ is present', and 'ϕ and ψ almost always exclude each other'. Here, ϕ and ψ are logic statements describing the data; continuing our previous example, ϕ could mean 'Driver A' and ψ could stand for 'Vehicle speed is very high and horizontal acceleration is high'. Within the user-specified boundary conditions, LISp-Miner checks contingency tables of form where m is the number of rows in (Boolean) data matrix, and • a is the number of objects satisfying both ϕ and ψ, • b is the number of objects satisfying ϕ but not ψ, • c is the number of objects not satisfying ϕ, satisfying ψ, • d is the number of objects not satisfying ϕ nor ψ.
For example, there is a sort of equivalence between the statements ϕ and ψ, denoted by v(ψ ≈ ϕ) = TRUE if (a + d)/m > p, where p ∈ (0, 1], and a + d ≥ Base It is important to note that '≈' is a generalized quantifier, not a logical connective. The closer the value of the parameter p is to the value of 1, the more confidently the statements ϕ and ψ appear only simultaneously in the data. Clearly, if b = c = 0 then the relation is classical equivalence. The higher the value of the parameter Base, the more significant the dependence. In practice, LISp-Miner goes through up to hundreds of thousands of such contingency tables but prints only those labelled by TRUE. However, due to its strong logical and combinatorial basis, LISp-Miner does not go through all possible contingency tables but only those relevant to the question. This speeds up the calculation considerably.
To date, another ten different quantifiers have been implemented in LISp-Miner software; we will introduce them later in the appropriate section. By appropriately adjusting the parameters p and Base, we find small-in-number in cases but with extremely important associations if they exist in the data.

Presentation of the Analyzed Data
The analyzed data were obtained as follows. In total, 10 drivers traveled between Korea University and the SANGAM World Cup Stadium in the surroundings of Seoul (South Korea) on three different road types. The number of features recorded was 51 in 1 s time intervals. Total driving time per individuals was between 121 and 184 min (cf. [5]).
The key research task is, from this data set, to distinguish each driver individually. In this study, the starting point of our data mining analysis is a raw data set with 94,380 rows and the following 12 columns: • Drivers A, . . . , J, (10 in all),

Data Preprocessing
Because LISp-Miner analysis is based on binary data, we processed the raw data as follows. Apart from driver identification, all the values in cells are numeric and linearly ordered. For simplicity, we divided each column into seven equally long sections and, to illustrate the results, we colored them as follows Mathematics 2021, 9, x FOR PEER REVIEW 3 of 10 study, the starting point of our data mining analysis is a raw data set with 94,380 rows and the following 12 columns:

Data Preprocessing
Because LISp-Miner analysis is based on binary data, we processed the raw data as follows. Apart from driver identification, all the values in cells are numeric and linearly ordered. For simplicity, we divided each column into seven equally long sections and, to illustrate the results, we colored them as follows extralow verylow lower average higher veryhigh extrahigh Each driver is handled separately, which produces ten new columns. Thus, there are 11 × 7 + 10 = 87 columns in the input data in LISp-Miner and so the analyzed data is a 94,380 × 87 Boolean data matrix. There are no empty cells in this data. The task is to characterize each of the ten drivers with a maximum of 77 different characteristics or combinations of these characteristics. In GUHA logic terminology, these 77 columns are called attributes or (unary) predicates, for example, then output of LISp-Miner procedure could be φ ≈ ψ, where ≈ is a specified generalized quantifier, assuming of course that the data supports such an association. In GUHA language, such closed formulas are called hypothesis. This explains the name GUHA: an automaton that produces hypotheses from a given data. Thus, LISp-Miner produces hypothesis, i.e., associations that the data supports. Further, we divided the data thus obtained into two parts. The rows with a sequence number not divisible by four (75% of all the rows) form the model set, and the remainder (25%) the test set. We first examined in the model set and selected a few hypotheses that are most strongly supported by the data, then we performed the same analysis among the test set and examined whether the same strongest hypotheses can be found among these results. Finally, we report some of the strongest hypothesis supported both by the model set and the test set.
Each driver is handled separately, which produces ten new columns. Thus, there are 11 × 7 + 10 = 87 columns in the input data in LISp-Miner and so the analyzed data is a 94,380 × 87 Boolean data matrix. There are no empty cells in this data. The task is to characterize each of the ten drivers with a maximum of 77 different characteristics or combinations of these characteristics. In GUHA logic terminology, these 77 columns are called attributes or (unary) predicates, for example, then output of LISp-Miner procedure could be ϕ ≈ ψ, where ≈ is a specified generalized quantifier, assuming of course that the data supports such an association. In GUHA language, such closed formulas are called hypothesis. This explains the name GUHA: an automaton that produces hypotheses from a given data. Thus, LISp-Miner produces hypothesis, i.e., associations that the data supports.
Further, we divided the data thus obtained into two parts. The rows with a sequence number not divisible by four (75% of all the rows) form the model set, and the remainder (25%) the test set. We first examined in the model set and selected a few hypotheses that are most strongly supported by the data, then we performed the same analysis among the test set and examined whether the same strongest hypotheses can be found among these results. Finally, we report some of the strongest hypothesis supported both by the model set and the test set.

Analytical Questions and the Most Relevant Answers to Them
In this chapter, we present the three key analytical questions we posed to the LISp-Miner software and some of the answers we received. Of the results, we have selected only the most significant.

The First Analytic Question: Which Hypotheses Distinguish Each Driver from All Other Drivers
It is natural to use the Above Average Quantifier of the 4ft-Miner procedure in LISp-Miner, because the hypotheses it produces answer to the question: in terms of which combination of attributes in ψ, does driver ϕ differ most clearly from the other drivers? Here the truth definition v(ϕ ≈ ψ) = TRUE is based on the condition a/(a + b) ≥ (1 + p)(a + c)/m, where p > 0 and Base ≥ a in the related contingency table, and v(ϕ ≈ ψ) = FALSE elsewhere. For example, if p = 4 and the above two conditions hold, then the statement ψ is at least 5 (=1 + p) times more common for driver ϕ than it is on average. More generally, the larger p, the more clearly the combination of attributes in ψ distinguishes driver ϕ from other drivers.
It is natural that the value a (called support) must be large enough. Otherwise, the result would have low general significance. Moreover, to make the distinguishing attributes as clear as possible, the value of the parameter p must be as large as possible. On the other hand, it lowers the threshold value Base, so a compromise must be made. After a few experiments, we came to the values p ≥ 5-10 and Base ≥ 100. In Figure 1, there is a screenshot from the front page of LISp-Miner when retrieving attributes describing driver D, where p ≥ 5.2.

Analytical Questions and the Most Relevant Answers to Them
In this chapter, we present the three key analytical questions we posed to the LISp-Miner software and some of the answers we received. Of the results, we have selected only the most significant.

The First Analytic Question: Which Hypotheses Distinguish Each Driver from All Other Drivers
It is natural to use the Above Average Quantifier of the 4ft-Miner procedure in LISp-Miner, because the hypotheses it produces answer to the question: in terms of which combination of attributes in ψ, does driver φ differ most clearly from the other drivers? Here the truth definition v(φ ≈ ψ) = TRUE is based on the condition in the related contingency table, and v(φ ≈ ψ) = FALSE elsewhere. For example, if p = 4 and the above two conditions hold, then the statement ψ is at least 5 (= 1 + p) times more common for driver φ than it is on average. More generally, the larger p, the more clearly the combination of attributes in ψ distinguishes driver φ from other drivers.
It is natural that the value a (called support) must be large enough. Otherwise, the result would have low general significance. Moreover, to make the distinguishing attributes as clear as possible, the value of the parameter p must be as large as possible. On the other hand, it lowers the threshold value Base, so a compromise must be made. After a few experiments, we came to the values p ≥ 5-10 and Base ≥ 100. In Figure 1, there is a screenshot from the front page of LISp-Miner when retrieving attributes describing driver D, where p ≥ 5.2. We observe that computer performance may sometimes constrain computing. If we limit the number of attributes in ψ to 12, LISp-Miner goes through about 2.5 million contingency tables. For this, a standard desktop computer takes about 25-30 min. For example, when examining the attributes characteristic of driver G, LISp-Miner went through We observe that computer performance may sometimes constrain computing. If we limit the number of attributes in ψ to 12, LISp-Miner goes through about 2.5 million contingency tables. For this, a standard desktop computer takes about 25-30 min. For example, when examining the attributes characteristic of driver G, LISp-Miner went through 25,556,692 contingency tables, from which 101 fulfilled the required boundary conditions; these are the related hypothesis. This took 29 min, 28 s, see Figure 2.
2,555,6692 contingency tables, from which 101 fulfilled the required boundary conditions; these are the related hypothesis. This took 29 min, 28 s, see Figure 2. We do not report them all but only the most apparent, the choice is more or less random. The results are on the following Figure 3. (average)' is more than 11 times (exactly 1 + 10,1) more common for driver A than for all the other drivers combined. In the test set, there are at least (Base ≥) 35 rows (in fact a = 40), where this property is associated with driver A. Moreover, there are only 7 (= b) rows where this property does not exist but A does. Thus, the share (40 out of 47) is 85%. In the group of the rest of the drivers the share is only 8%. It can also be seen from Figure 3. that the properties describing the drivers G, H, and I are exactly the same both in the model set and in the test set. The results for the other drivers are also very similar in both sets. Moreover, it is noteworthy that all the original 11 variables in row data appear in the results at least once.' Moreover, there are several almost identical hypotheses among the 101 (in the sense of over lapping attributes) among the outputs produced by LISp-Miner. For example, driver G is associated with the following attributes (see the first two lines on Figure 2). We do not report them all but only the most apparent, the choice is more or less random. The results are on the following Figure 3. Pressure(extra high) & Acceleration Speed Lateral(average)' is more than 11 times (exactly 1 + 10.1) more common for driver A than for all the other drivers combined. In the test set, there are at least (Base ≥) 35 rows (in fact a = 40), where this property is associated with driver A. Moreover, there are only 7 (=b) rows where this property does not exist but A does. Thus, the share (40 out of 47) is 85%. In the group of the rest of the drivers the share is only 8%. It can also be seen from Figure 3 that the properties describing the drivers G, H, and I are exactly the same both in the model set and in the test set. The results for the other drivers are also very similar in both sets. Moreover, it is noteworthy that all the original 11 variables in row data appear in the results at least once'.
LISp-Miner includes a Bayesian statistics-based tool to assess the reliability of results. The result is presented as a graph of its distribution (see Figure 4). The tapered the distribution (Graph on the left-hand side), the more reliable the result. The theoretical basis for this interpretation is explained in the publication [6]. extralow verylow lower average higher veryhigh extrahigh The result is presented as a graph of its distribution (see Figure 4). The tapered the distribution (Graph on the left-hand side), the more reliable the result. The theoretical basis for this interpretation is explained in the publication [6].  There are several ways in the LISp-Miner software to perform this kind of task. One of them is SD4ft-miner procedure [7], a handy tool for finding remarkable differences be-

The Second Analytic Question: Comparing One Driver Separately in Pairs with Each of the Other Drivers (up to 3), Which Are the Most Distinguishing Attributes
There are several ways in the LISp-Miner software to perform this kind of task. One of them is SD4ft-miner procedure [7], a handy tool for finding remarkable differences between two separate subsets of the data. The truth value depends on two contingency tables, and there are six possible quantifiers. The simplest one is based on the condition a a+b − A A+B ≥ p, where a and b refer to the contingency table defined by the first set and A and B refer to the contingency table defined by the second set, respectively. Obviously, 0 ≤ p ≤ 1, the closer the value is to 1, the more different the sets are with respect to that property. Conditions for related Base values can also be set.
As an example, we examine which (up to 3) attributes distinguish driver F from the other drivers. Base values are 1% of the total (both in the model set and in the test set) and p ≥ 0.2.
In addition, we limit the number of attributes to a maximum of three. In Figure 5, we have collected 1 to 3 such hypotheses (where the p-value is the highest) that are produced both in the model set and the test set.

The Third Analytic Question: Comparing One Driver Separately in Pairs with Each of the Other Drivers, Which Attribute (up to 3) Values Show Significant Differences between Drivers
Ac4ft-Miner procedure is well suited for investigating such an issue. The key idea is as follows: consider some background factors constant but change some others and see where this change leads. The definition of truth is the same as in the SD4ft-miner context. Consider, as an example, the effects on Fuel Consumption, Vehicle Speed, or Vehicle SpeedRN of driver change from F to X, when one of the other attributes remains constant. Figure 6. is to be understood as follows. For example, comparing drivers F and A (the first line in Figure 6.), the model set contains a = 2660 rows with statement ψ is Acceleration Speed Longitudinal(lower … higher) & Fuel Consumption(very low … average), and φ is driver F. The corresponding b value (φ is present but ψ is not) is 776. Thus = Figure 5. Hypotheses that distinguish driver F in pairs from other drivers. Figure 5 is to be understood as follows. For example, driver A differs significantly (at least) in three different ways from driver F, two of them are related to Acceleration Pedal Value, and one to Vehicle Speed (lines 1, 2, and 3 in Figure 5). Indeed, (see the first line) there are a = 4936 rows in the model data where ϕ is driver F and ψ is Acceleration Pedal Value(extra low . . . lower) and b = 3323 rows where ϕ is present but ψ is not. Thus Value(extra low . . . lower) and B = 4323 rows where ϕ is present but ψ is not. Thus A A+B = 0.277, corresponding to 28% of the cases. This gives a a+b − A A+B = 0.32 > p. It is noteworthy that of all the 11 original measurement classes, only 6 occur in Figure 5, and of all the 77 predicates, i.e., the columns that possibly characterize drivers, only 33 emerge. Fisher's exact test tested the statistical significance of the results. In all the above 16 cases, the difference is statistically significant if the limit p = 0.01 for significance is used. Thus, it can, with high probability, be considered certain that the results produced by LISp-Miner are not due to chance but actually describe the differences between driver F and other drivers.

The Third Analytic Question: Comparing One Driver Separately in Pairs with Each of the Other Drivers, Which Attribute (up to 3) Values Show Significant Differences between Drivers
Ac4ft-Miner procedure is well suited for investigating such an issue. The key idea is as follows: consider some background factors constant but change some others and see where this change leads. The definition of truth is the same as in the SD4ft-miner context. Consider, as an example, the effects on Fuel Consumption, Vehicle Speed, or Vehicle SpeedRN of driver change from F to X, when one of the other attributes remains constant. Figure 6 is to be understood as follows. For example, comparing drivers F and A (the first line in Figure 6), the model set contains a = 2660 rows with statement ψ is Acceleration Speed Longitudinal(lower . . . higher) & Fuel Consumption(very low . . . average), and ϕ is driver F. The corresponding b value (ϕ is present but ψ is not) is 776. Thus a a+b = 0.774, corresponding to 77% of the cases.  Figure 6. Hypotheses that distinguish driver F in pairs from other drivers (by Ac4Ft-Miner)

Observations
We list some of the strengths and weaknesses of GUHA and the LISp-Miner software If, on the other hand, for ϕ is the driver A, the corresponding figures are A = 1371, B = 804, therefore A A+B = 0.630, corresponding to 63% of the cases. This gives a a+b − A A+B = 0.14. The statistical significance of the results was tested by Fisher's exact test. In all the 10 cases, the difference is statistically significant if the limit p = 0.01 for significance is used. The results can therefore be considered statistically reliable. In Figure 6 we summarize some of the most significant results produced on the model set, which are also produced on the test set. The parameter Base = 1% of all the rows in the related (model or test) set and p ≥ 0.1.

Observations
We list some of the strengths and weaknesses of GUHA and the LISp-Miner software that we have identified during this study. To begin with, LISp-Miner has the right tools to analyze large data sets; we have not even used every possible one here. The problem we solved was obvious; making analytical questions was therefore easy. In general, this may not always be that simple; using LISp-Miner and interpreting the results requires practice. There are plenty of statistically significant differences in the data that differentiate the drivers; we have listed only the most significant, definitely not all of them. If the data size increases or there is a need to study really small subsets (say, less than 0.05% of the total data), the computation time will extend; however, this problem can be alleviated by introducing more powerful computers. LISp-Miner is a freely downloadable software [7]; ın any case, since GUHA logic is a well-defined logic, users can write their own software for their own needs if they do not want to use LISp-Miner.

Conclusions
In this work, we have investigated the suitability of the GUHA data mining method to find statistically significant features that differentiate between drivers in data on drivers' driving behavior. The result is unequivocally positive; since those differences in the data exist, LISp-Miner, the GUHA method implementation, finds them. Indeed, using the GUHA approach, it is possible to characterize drivers by combinations of feature values very special for them. This differs from the approach of using classification models largely targeting mean values and typical ranges to characterize classes. Rare values or value combinations are usually treated as outliers; some algorithms are even promoted because of their robustness towards outliers. This was found to be one important reasons for weak results in an attempt using a one-class classification (cf. [8]). In the forensic context, a combination of classical classification and an approach based on individual patterns in driving behavior could help to gain more reliable and explainable results. Another approach worth being investigated, instead of time series analysis and machine learning, is to take a closer look at specific patterns in the data. Such patterns are value combinations that are rare, e.g., only a few seconds each hour but occur with one person only.