WINkNN: Windowed Intervals’ Number kNN Classiﬁer for E ﬃ cient Time-Series Applications

: Our interest is in time series classiﬁcation regarding cyber–physical systems (CPSs) with emphasis in human-robot interaction. We propose an extension of the k nearest neighbor (kNN) classiﬁer to time-series classiﬁcation using intervals’ numbers (INs). More speciﬁcally, we partition a time-series into windows of equal length and from each window data we induce a distribution which is represented by an IN. This preserves the time dimension in the representation. A ll-order data statistics, represented by an IN, are employed implicitly as features; moreover, parametric non-linearities are introduced in order to tune the geometrical relationship (i.e., the distance) between signals and consequently tune classiﬁcation performance. In conclusion, we introduce the windowed IN kNN (WINkNN) classiﬁer whose application is demonstrated comparatively in two benchmark datasets regarding, ﬁrst, electroencephalography (EEG) signals and, second, audio signals. The results by WINkNN are superior in both problems; in addition, no ad-hoc data preprocessing is required. Potential future work is discussed.


Introduction
A cyber-physical system (CPS) has been defined as a device with sensing as well as reasoning capacities [1,2]. Strategic initiatives regarding CPSs include "Industrie 4.0" in Germany, the "Industrial Internet of Things (IIoT)" in the United States, and "Society 5.0" in Japan [3]. CPSs typically focus on multidisciplinary applications in healthcare, agriculture, and food supply, manufacturing, energy and critical infrastructures, transportation, logistics, security, and, lately, education [4]. Our interest here is in CPSs regarding human-robot interaction.
There is a need for supporting CPSs with mathematical models that involve both sensory data and structured software data toward improving CPS performance during their interaction with humans. However, a widely acceptable mathematical modelling framework is currently missing. In response, the lattice computing (LC) paradigm has been introduced [1,5] for hybrid mathematical modelling based on mathematical lattice theory that unifies rigorously numerical data and non-numerical data; the latter may include (lattice ordered) logic values, sets, symbols, and trees/graphs. More specifically, the cyber and physical components of a CPS are modelled involving non-numerical and numerical data, respectively, in any combination. An additional advantage of lattice computing is its capacity to calculate with semantics represented by a lattice (partial) order relation.
Human-robot interaction is, typically, driven by behavioral patterns including auditory and/or visual signals. Visual stimuli are very common in human-robot interaction applications, and are often presents experiments and results as well as a discussion. Finally, Section 5 concludes by summarizing the contribution of this work; potential future work is also delineated.
Level-0; The Lattice (R, ≤) of Real Numbers: Assume the lattice (R,≤) of real numbers. The greatest lower bound of two numbers x and y is the smallest of the two, denoted by x∧y, whereas the least upper bound of two numbers is the greatest of the two, denoted by x∨y. Positive valuation function v: R→R in lattice (R,≤) is any strictly increasing function. For instance, choosing v(x) = x it follows that the distance between two real numbers x and y equals d(x,y) = v(x∨y) − v(x∧y) = x∨y − x∧y = |x−y|. Extending to the N-dimensional Euclidean space (R N , ), with v(x) = x in every dimension, from Equation (1) there follows the Minkowski metric d(x,y;p) = (d 1 (x 1 , y 1 )) p + · · · + (d N (x N , y N )) p 1/p known in the literature as L p metric. In particular, the L 1 metric d(x,y;1) = x 1 − y 1 + · · · + x N − y N ) is known as Hamming distance; the L 2 metric d(x,y;2) = (d 1 (x 1 , y 1 )) 2 + · · · + (d N (x N , y N )) 2 is known as Euclidean distance; furthermore; the L ∞ metric equals d(x,y;∞) = max{|x 1 − y 1 |, . . . ,|x N − y N |}.
Level-1; The Lattice (I 1 , ≤) of Intervals: In the lattice (R,≤) of real numbers, any strictly decreasing function is a dual isomorphic function. Consider the partially ordered lattice (I 1 , ) of intervals in the lattice (R,≤ Given (a) a positive valuation function v: R→R, and (b) a dual isomorphic function θ: R→R in (R,≤), a positive valuation v 1 : R×R→R in the corresponding lattice (R×R,≥×≤) of generalized intervals is defined by v 1 ([a, b]) = v(θ(a)) + v(b). Therefore, a metric distance in (R×R,≥×≤) is given by: In conclusion, the metric d 1 (.,.), given by Equation (2), is valid in sublattice (I 1 , ), which is embedded to the superlattice (R×R,≥×≤).
Level  [36]. In particular, it has been shown that (F 1 , ) is a metric lattice [38] whose metric function D 1 : F 1 × F 1 → R + 0 is given by where the integrable function d 1 : I 1 × I 1 → R + 0 is defined by Equation (2). For L = F N 1 there follows a Minkowski metric given by Equation (1).
Parametric functions v(.) and θ(.) may introduce tunable non-linearities whose parameters can be estimated optimally. In this work, toward reducing complexity during optimization, we consider functions v(.) and θ(.) with only two parameters per function; in particular, we consider exclusively linear parametric functions v(.) and θ(.):

Methods and Datasets
This section deals with data representation issues as well as the windowed IN kNN, or WINkNN for short, classifier. It also presents the datasets to be employed in the next section.

Intervals' Numbers (INs) and Time-Series Representation Based on INs
Probability and possibility distributions have been studied comparatively in the literature [39,40]. Furthermore, a possibilistic interpretation has been proposed for an IN [38] followed by a probabilistic interpretation [37]. In conclusion, an IN was established as a mathematical object, which can be interpreted either probabilistically or possibilistically [30].
Recall that an IN can be represented either as a function F(x) = ∨ h∈[0, 1] {h : x ∈ F h }, namely membership-function-representation or, equivalently, as set of intervals F h , h∈[0, 1], namely interval-representation. We point out the potential of an IN to represent time-series big data. More specifically, the big data here regards the (time-series) samples. For instance, consider the EEG time-series signal shown in Figure 1a. Figure 1b shows the corresponding distribution function, regarding the time-series samples values, induced by algorithm distrIN [2]. Figure 1c displays the membership-function-representation of IN F; note that the latter matches the distribution function in Figure 1b. Finally, Figure 1d shows the equivalent, interval-representation of IN F using L = 32 intervals, where the reason for using L = 32 is explained in [30]. In conclusion, using only L = 32 numbers that define L = 32 intervals, we can potentially represent a distribution of samples. In the aforementioned sense, an IN with fairly few (i.e., L = 32) numbers can represent the distribution of orders of magnitude more numbers. In all previous works, a whole time-series signal has been represented by a single IN, e.g., [2,5]. However, the latter representation suppresses the "time dimension" which constitutes a significant information content. To recover the "time dimension," two alternatives have been examined here, namely Alternative 1 (AL1) and Alternative 2 (AL2), respectively, as explained in the following.
First, Alternative AL1 computes the cumulative sum (over time) of the samples' magnitude, normalized to [0, 1]. Figure 2 displays two INs induced from two EEG signals in two different classes, respectively, by Alternative AL1. Although time has been taken into account, there is no significant difference between the two INs in Figure 2 because the large number of involved samples resulted in very similar IN shapes. Hence, the distance between the two INs (computed by Equation (3)) is negligible. The latter explains the poor classification results recorded in experiments we have carried out using AL1. Therefore, Alternative AL1 was abandoned.  Second, Alternative AL2 recovers (partly) the time dimension by partitioning the original timeseries signal into consecutive windows; then, computing one IN per window. More specifically, the time-series signal is partitioned in a number N of windows of equal length; then, from all the data in a window, one IN is induced per window by algorithm distrIN [2]. For instance, from the EEG signal  Second, Alternative AL2 recovers (partly) the time dimension by partitioning the original timeseries signal into consecutive windows; then, computing one IN per window. More specifically, the time-series signal is partitioned in a number N of windows of equal length; then, from all the data in a window, one IN is induced per window by algorithm distrIN [2]. For instance, from the EEG signal the time-series signal is partitioned in a number N of windows of equal length; then, from all the data in a window, one IN is induced per window by algorithm distrIN [2]. For instance, from the EEG signal in Figure 1a, using N = 10 windows, the ten INs of Figure 3 were induced. Such a (partial) recovery of time has improved classification performance as demonstrated below. in Figure 1a, using N = 10 windows, the ten INs of Figure 3 were induced. Such a (partial) recovery of time has improved classification performance as demonstrated below.

The Datasets
Two different datasets have been engaged regarding, first, electroencephalography (EEG) data and, second, audio data, as explained in the following.

Electroencephalography (EEG) Dataset
We considered the SEED dataset [41,42], which contains EEG data signals collected from 15 subjects (7 males and 8 females aged between 22 and 24 years old) while watching movie clips designed to elicit three types of emotions, namely "positive", "neutral," and "negative." More specifically, each subject watched five clips for each types of emotion. The experiment was repeated three times. Therefore, the dataset included 3*5*15 = 225 trials per emotion. Note that the recording device was an EEG cap conforming to the international 10-20 system (for 62 channels); hence, for each trial (i.e., for each movie clip) there were 62 signals corresponding to the 62 channels, respectively. An EEG signal was sampled at 200 Hz resulting in time-series with 37,001 to 47,001 samples per time-series. Our objective was to recognize the emotional state, namely "positive," "neutral," or "negative," of individuals using EEG signals.
Of the 62 channels recorded by the EEG cap's electrodes, not all were used in our emotion recognition computational experiments. Efficient emotion recognition calls for prior channel selection in order to reduce computational complexity. More specifically, channel selection depends on the brain region involved in processing, given the stimulus. Previous studies have identified appropriate channel groups [42,43]. For comparison purposes, the five groups of channels proposed in [44], including four channels per group, were used in this work. More specifically, Table 1 shows the channel groups used in our classification experiments. In addition to the aforementioned five channel groups, the set-union of the channels in the five groups was also considered resulting in nine channels, namely FP1, FT8, T7,T8, TP7, PO7, FC2, FPZ, F8. We point out that each signal was decomposed into delta (δ), theta (θ), alpha (α), beta (β), and gamma (γ) frequency bands [45].  , T7, T8  FT8, T7, T8, TP7  FT8, T7, T8, PO7  FP1, FC2, T8, TP7  FP1, FPZ, F8, T7

The Datasets
Two different datasets have been engaged regarding, first, electroencephalography (EEG) data and, second, audio data, as explained in the following.

Electroencephalography (EEG) Dataset
We considered the SEED dataset [41,42], which contains EEG data signals collected from 15 subjects (7 males and 8 females aged between 22 and 24 years old) while watching movie clips designed to elicit three types of emotions, namely "positive", "neutral," and "negative." More specifically, each subject watched five clips for each types of emotion. The experiment was repeated three times. Therefore, the dataset included 3*5*15 = 225 trials per emotion. Note that the recording device was an EEG cap conforming to the international 10-20 system (for 62 channels); hence, for each trial (i.e., for each movie clip) there were 62 signals corresponding to the 62 channels, respectively. An EEG signal was sampled at 200 Hz resulting in time-series with 37,001 to 47,001 samples per time-series. Our objective was to recognize the emotional state, namely "positive," "neutral," or "negative," of individuals using EEG signals.
Of the 62 channels recorded by the EEG cap's electrodes, not all were used in our emotion recognition computational experiments. Efficient emotion recognition calls for prior channel selection in order to reduce computational complexity. More specifically, channel selection depends on the brain region involved in processing, given the stimulus. Previous studies have identified appropriate channel groups [42,43]. For comparison purposes, the five groups of channels proposed in [44], including four channels per group, were used in this work. More specifically, Table 1 shows the channel groups used in our classification experiments. In addition to the aforementioned five channel groups, the set-union of the channels in the five groups was also considered resulting in nine channels, namely FP1, FT8, T7,T8, TP7, PO7, FC2, FPZ, F8. We point out that each signal was decomposed into delta (δ), theta (θ), alpha (α), beta (β), and gamma (γ) frequency bands [45].  , T7, T8  FT8, T7, T8, TP7  FT8, T7, T8, PO7  FP1, FC2, T8, TP7  FP1, FPZ, F8, T7 3.

Audio Dataset
The objective here is spoken word classification using the Spoken Digits dataset [46], which contains 2000 recordings of English utterances of the ten digits "zero" to "nine" by four native English speakers. This dataset includes 200 signals per digit. Moreover, note that an audio signal was sampled at 8 KHz resulting in time-series with 1148 to 18,262 samples per time-series. Minimal silence was ensured by trimming the audio signal both at the beginning and at the end.

The Windowed Intervals' Number kNN (WINkNN) Classifier
We assume a dataset including n trn and n tst subjects for training and testing, respectively. In particular, the data regarding a single subject includes n ch time-series (i.e., channels). A time-series is optimally partitioned, as described below, in N parts, or equivalently windows, of equal length.
Algorithm 1 describes the windowed intervals' number kNN (WINkNN) classifier for training. In particular, an optimal set of parameters is computed in Algorithm 1 using the GENETIC optimization detailed in the next section. Note that Algorithm 1 implements "leave-one-out" training per individual chromosome. Next, Algorithm 2 details testing by the WINkNN classifier. Let k be the (integer) parameter of the kNN classifier; let n cl be the total number of classes; let n trn be the number of subjects for training; let n ch be the number of channels per subject; let N be the number of windows per channel; a time-series in a channel is represented by an element F∈F N 1 ; let N G be the number of generations of the GENETIC algorithm; 1. m = 1; 2.
while m < N G do 3.
for l = 1 to n ch do 6.
Calculate the set: kWinners(i,l) = kargmin j∈{1,...,,n trn }, j i D(F i,l ,F j,l ) that holds the indices of the k nearest neighbors to F i,l ∈F N 1 for j i; 7.
Using the class labels of the k winners in the set kWinners(i,l), update (i.e., increase the corresponding entries of) vector ClassCounter(1, . . . ,n cl ); 8. end for 9.
end while Let k be the (integer) parameter of the kNN classifier; let n cl be the total number of classes; let n ch be the number of channels per subject; let N be the number of windows per channel; a time-series in a channel is represented by an element F∈F N 1 ; consider a subject F 0 ∈F N 1 l∈{1, . . . , n ch } for classification; 1.
for l = 1 to n ch do 3.
Update (i.e., increase the corresponding entries of) vector ClassCounter(1, . . . ,n cl ) by considering the class labels of the k winners in the set kWinners(l); 5.
end for 6.

GENETIC Optimization
As mentioned above, the distance between two INs is calculated based on the two parametric functions ν(x) and θ(x) per window per channel. The parameter values of the aforementioned functions affect the metric distance, and as a result, the classification performance changes. To determine an optimal set of parameters that maximizes classification performance, a GENETIC algorithm was employed as detailed next.
During an initialization step, a population of n r chromosomes was considered. Each chromosome represented the set of parameters for classification. More specifically, for linear ν(x) and θ(x) functions, there were two parameters per function (v 1 /v 0 and θ 1 /θ 0 , respectively) that define slope and position. To enhance flexibility in tuning, we considered different pairs of functions ν(x) and θ(x) per window per channel. For example, to classify EEG signals with 9 channels, where each signal was partitioned in N = 5 windows, we used 2*2*5*9 = 180 parameters. Moreover, an additional parameter was used for the k value of the kNN classifier. Figure 4 shows the structure of a chromosome for n ch channels and N windows per (time-series) signal. Update (i.e., increase the corresponding entries of) vector ClassCounter(1,…,ncl) by considering the class labels of the k winners in the set kWinners(l); 5.
end for 6.

GENETIC Optimization
As mentioned above, the distance between two INs is calculated based on the two parametric functions ν(x) and θ(x) per window per channel. The parameter values of the aforementioned functions affect the metric distance, and as a result, the classification performance changes. To determine an optimal set of parameters that maximizes classification performance, a GENETIC algorithm was employed as detailed next.
During an initialization step, a population of nr chromosomes was considered. Each chromosome represented the set of parameters for classification. More specifically, for linear ν(x) and θ(x) functions, there were two parameters per function (v1/v0 and θ1/θ0, respectively) that define slope and position. To enhance flexibility in tuning, we considered different pairs of functions ν(x) and θ(x) per window per channel. For example, to classify EEG signals with 9 channels, where each signal was partitioned in N = 5 windows, we used 2*2*5*9 = 180 parameters. Moreover, an additional parameter was used for the k value of the kNN classifier. Figure 4 shows the structure of a chromosome for nch channels and N windows per (time-series) signal. An initial population of nr chromosomes (i.e., individuals) was randomly generated under constraints, namely the parametric functions ν(x) and θ(x) must be monotonically increasing and decreasing, respectively. Then, the fitness (i.e., cost) function of the GENETIC algorithm was computed as follows. Given ncl classes, the result of a classification cycle, regarding all the ntst subjects for testing, is a ncl × ncl confusion matrix B including classification percentages. In conclusion, the following fitness function was used per chromosome: )/ncl, where r{1,…,nr} The cost function Qr indicates how well a set of parameters defines distances that improve classification. Initially, all individuals are evaluated and they are sorted according to cost. To speed up data processing, parallel processing was applied for the evaluation of the individuals. Next, the individual selection procedure was applied to replace the parental population using tournament selection followed by genetic crossover and mutation; in particular, crossover was applied with probability cxpb = 0.7, whereas mutation was applied with probability mutpb = 0.02. An initial population of n r chromosomes (i.e., individuals) was randomly generated under constraints, namely the parametric functions ν(x) and θ(x) must be monotonically increasing and decreasing, respectively. Then, the fitness (i.e., cost) function of the GENETIC algorithm was computed as follows. Given n cl classes, the result of a classification cycle, regarding all the n tst subjects for testing, is a n cl × n cl confusion matrix B including classification percentages. In conclusion, the following fitness function was used per chromosome: Q r = 100 − B 11 + B 22 + . . . + B n cl n cl /n cl , where r ∈ {1, . . . , n r } The cost function Q r indicates how well a set of parameters defines distances that improve classification. Initially, all individuals are evaluated and they are sorted according to cost. To speed up data processing, parallel processing was applied for the evaluation of the individuals. Next, the individual selection procedure was applied to replace the parental population using tournament selection followed by genetic crossover and mutation; in particular, crossover was applied with probability cxpb = 0.7, whereas mutation was applied with probability mutpb = 0.02.

Experiments and Results
Two different sets of experiments have been conducted to demonstrate the capacity of the proposed WINkNN classifier regarding, first, an emotion classification task using EEG signals and, second, a word classification task using auditory signals. In each experiment, a single subject was left out for testing; eventually, every subject was left out exactly once for testing. In the aforementioned manner we estimated the generalization capacity of the WINkNN classifier. For both datasets, during the GENETIC optimization, the parameter values n r = 200 and N G = 50 were selected empirically.

Electroencephalography (EEG) Signal Classification
An optimal number N of windows was estimated, in the first place. In particular, before any optimization, classification experiments were carried out for an increasing number of 2, 3, 4, 5, 10, 15, and 20 windows. Table 2 shows the corresponding classification results. It was observed that when signals were partitioned to more than 5 windows, there was no significant improvement in classification performance; moreover, the computational cost was increasing rapidly for more than 5 windows. In conclusion, we decided to use N = 5 windows. Table 3 displays the results obtained by the WINkNN classifier comparatively with the results by alternative classification methods from the literature.  It can be seen that the classification performance of our proposed WINkNN classifier is comparable to more complex (therefore more computationally expensive) classification techniques that make use of the entire set of available cap electrodes and, in some cases, it is superior, especially when a small number of channels is used. Even though alternative approaches are based on feature extraction, our method operates on raw data-in other words, no ad hoc feature extraction has been carried out whatsoever. The windowed IN approach is also superior to the authors' previous work where a single IN was used to represent the entire signal [5]. It was also observed during the experiments that WINkNN was robust in terms of the consistency of classification results. Despite the fact that EEG signals stemming even from the same subject in response to same stimuli are not repeatable, the aforementioned consistency of the WINkNN classifier was remarkable and attributed to the representation of a data distribution by an IN.
the fact that EEG signals stemming even from the same subject in response to same stimuli are not repeatable, the aforementioned consistency of the WINkNN classifier was remarkable and attributed to the representation of a data distribution by an IN.

Classes 5 Classes 10 Classes
Default

Discussion
The use of a single IN to represent a time-series ignores the "time dimension" of a time-series signal; more specifically, it considers solely the distribution of the signal samples. Nevertheless, with the proposed WINkNN classifier, the "time dimension" is (partly) recovered toward improving classification performance. The classification results here have confirmed a consistent improvement in classification performance because of the partition of a time-series to more than 1 window.
It is observed that the improvement is larger for the experiments with the audio signals than with the EEG signals. This was explained by the fact that the information content in audio signals is more dependent on time, as opposed to the EEG signals which are significantly affected by a number of concurrent brain activations. Note also that not only EEG signals are subject-specific, but also the data in the trials, even per subject, are not repeatable, resulting in a decrease of classification performance for the EEG signals in comparison to audio signals.
Partitioning the signals to N time windows multiplies the total number of parameters of the WINkNN classifier. The latter increases the search space for the GENETIC algorithm and makes convergence to an optimal set of parameters slower. However, significant benefits are retained with N time windows including: (a) Substantial data reduction, compared to the raw data, because of the employment of "N" INs per channel; hence, the proposed techniques emerge promising for big data applications. (b) Time consuming, ad hoc feature extraction is not necessary; instead, all-order statistics are used implicitly as features for tunable classification using the optimizable, parametric WINkNN classifier. (c) Fairly heavy computational overhead is required only once, during training.
We remark that recently published classification techniques from the literature have reported higher classification results than the WINkNN classifier on the SEED (EEG) dataset but at the expense of considerably more time for computation, using all 64 channels as well as all frequency bands; moreover, ad hoc features had to be defined [14].

Conclusions
The interest of this work was in time-series classification regarding human-robot interaction. Our proposed novelty was to represent "time" by an N-tuple of INs instead of by a single IN. A novel windowed IN-based kNN, namely WINkNN, classifier was introduced and applied to two different benchmark classification datasets, namely the SEED dataset and the Spoken Digit dataset regarding the recognition of emotional states based on EEG signals and the recognition of spoken digits based on audio signals, respectively. Extensive computational experiments have demonstrated, comparatively, a superior performance of the WINkNN classifier.
The proposed WINkNN classifier used raw (time-series) data; in other words, no ad hoc feature extraction was necessary. The latter is considered to be a major advantage toward transferring IN-based techniques to a wide range of applications. Moreover, the performance of the proposed method can be optimized, parametrically. The capability of the WINkNN classifier was attributed (a) to the capacity of an IN to represent statistics of all-orders; the latter are the "features" engaged implicitly, (b) to the capacity to optimize, parametrically, the distance function, and (c) to the consideration of time.
Because of its capacity for representing a distribution of samples, an IN can potentially represent big data. Hence, IN-based techniques emerge promising in big data applications. Potential future work may include various time-series applications to different domains such as medicine, physics, economics, agriculture, and elsewhere. The implicit employment of "all-order statistics" should also be studied. Further future work will pursue an engagement of INs with alternative architectures such as neural networks etc.