A Many-Objective Simultaneous Feature Selection and Discretization for LCS-Based Gesture Recognition

Discretization and feature selection are two relevant techniques for dimensionality reduction. The first one aims to transform a set of continuous attributes into discrete ones, and the second removes the irrelevant and redundant features; these two methods often lead to be more specific and concise data. In this paper, we propose to simultaneously deal with optimal feature subset selection, discretization, and classifier parameter tuning. As an illustration, the proposed problem formulation has been addressed using a constrained many-objective optimization algorithm based on dominance and decomposition (C-MOEA/DD) and a limited-memory implementation of the warping longest common subsequence algorithm (WarpingLCSS). In addition, the discretization sub-problem has been addressed using a variable-length representation, along with a variable-length crossover, to overcome the need of specifying the number of elements defining the discretization scheme in advance. We conduct experiments on a real-world benchmark dataset; compare two discretization criteria as discretization objective, namely Ameva and ur-CAIM; and analyze recognition performance and reduction capabilities. Our results show that our approach outperforms previous reported results by up to 11% and achieves an average feature reduction rate of 80%.


Introduction
Gestures are composed of multiple body-part motions and can form activities [1]. Hence, gesture recognition offers a wide range of applications, including inter alia, fitness training, human robot and computer interaction, security, and sign language recognition. Likewise, gesture recognition is employed in ambient assisted living systems for tackling burgeoning and worrying public healthcare problems, such as autonomous living for people with dementia and Parkinson's disease. Although a large amount of work has been conducted on image-based sensing technology, camera and depth sensors are limited to the environment in which they are installed. Moreover, they are sensitive to obstructions in the field of vision, variation in luminous intensity, reflection, etc. In contrast, wearable sensors and mobile devices are more suitable for monitoring ambulatory activities and physiological signals.
In a supervised context, a wide range of action or gesture recognition techniques has been explored using wearable sensors. k-Nearest Neighbor (k-NN) might be the most straightforward classifier to utilize since it does not learn but searches the closest data in the training data using a given distance function. Even though conventional k-NN achieves good performance, it suffers from lack of ability to deal with these problems: low attribute and sample noise tolerance, high-dimensional spaces, large training dataset requirements, and imbalances in the data. Yu et al. [2] recently proposed a random subspace ensemble framework based on hybrid k-NN to tackle these problems, but the classifier has not yet been applied to a gesture recognition task. Hidden Markov Model (HMM) is the most traditional probabilistic method used in the literature [3,4]. However, computing transition probabilities necessary for learning model parameters requires a large amount of training data. HMM-based techniques may also not be suitable for hard real-time (synchronized clock-based) systems due to its latency [5]. Since data sets are not necessarily large enough for training, Support Vector Machine (SVM) is a classical alternative method [6][7][8]. SVM is, nevertheless, very sensitive to the selection of its kernel type and parameters related to the latter. There are novel dynamic Bayesian networks often used to deal with sequence analysis, such as recurrent neural networks (e.g., LSTMs) [9] and deep learning approach [10], which should become more popular in the next years.
Dynamic Time Warping (DTW) is one of the most utilized similarity measures for matching two time-series sequences [11,12]. Often reproached for being slow, Rakthanmanon et al. [13] demonstrated that DTW is quicker than Euclidean distance search algorithms and even suggests that the method can spot gestures in real time. However, the recognition performance of DTW is affected by the strong presence of noise, caused by either segmentation of gestures during the training phase or gesture execution variability.
The longest common subsequence (LCSS) method is a precursor to DTW. It measures the closeness of two sequences of symbols corresponding to the length of the longest subsequence common to these two sequences. One of the abilities of DTW is to deal with sequences of different lengths, and this is the reason why it is often used as an alignment method. In [14], LCSS was found to be more robust in noisy conditions than DTW. Indeed, since all elements are paired in DTW, noisy elements (i.e., unwanted variation and outliers) are also included, while they are simply ignored in the LCSS. Although some image-based gesture recognition applications can be found in [15][16][17], not much work has been conducted using non-image data. In the context of crowd-sourced annotations, Nguyen-Dinh et al. [18] proposed two methods, entitled SegmentedLCSS and WarpingLCSS. In the absence of noisy annotation (mislabeling or inaccurate identification of the start and end times of each segment), the two methods achieve similar recognition performances on three data sets compared with DTW-and SVM-based methods and surpass them in the presence of mislabeled instances. Extensions were recently proposed, such as a multimodal system based on WarpingLCSS [19], S-SMART [20], and a limited memory and real-time version for resource constrained sensor nodes [21]. Although the parameters of these LCSS-based methods should be application-dependent, they have so far been empirically determined and a lack of design procedure (parameter-tuning methods) has been suggested.
In designing mobile or wearable gesture recognition systems, the temptation of integrating many sensing units for handling complex gesture often negates key real-life deployment constraints, such as cost, power efficiency, weight limitations, memory usage, privacy, or unobtrusiveness [22]. The redundant or irrelevant dimensions introduced may even slow down the learning process and affect recognition performance. The most popular dimensionality reduction approaches include feature extraction (or construction), feature selection, and discretization. Feature extraction aims to generate a set of features from original data with a lower computational cost than using the complete list of dimensions. A feature selection method selects a subset of features from the original feature list. Feature selection is an NP-hard combinatorial problem [23]. Although numerous search techniques can be found in the literature, they fail to avoid local optima and require a large amount of memory or very long runtimes. Alternatively, evolutionary computation techniques have been proposed for solving feature selection problem [24]. Since the abovementioned LCSS technique directly utilizes raw or filtered signals, there is no evidence on whether we should favour feature extraction or selection. However, these LCSS-based methods impose the transformation of each sample from the data stream into a sequence of symbols. Therefore, a feature selection coupled with a discretization process could be employed. Similar to feature selection, discretization is also an NP-hard problem [25,26].
In contrast to the feature selection field, few evolutionary algorithms are proposed in the literature [25,27]. Indeed, evolutionary feature selection algorithms have the dis-advantage of high computational cost [28] while convergence (close to the true Pareto front) and diversity of solutions (set of solutions as diverse as possible) are still two major difficulties [29].
Evolutionary feature selection methods focus on maximizing the classification performance and on minimizing the number of dimensions. Although it is not yet clear whether removing some features can lead to a decrease in classification error rate [24], a multipleobjective problem formulation could bring trade-offs. Discretization attribute literature aims to minimize the discretization scheme complexity and to maximize classification accuracy. In contrast to feature selection, these two objectives seem to be conflicting in nature [30].
A multi-objective optimization algorithm based on Particle swarm optimization (heuristic methods) can provide an optimal solution. However, an increase in feature quantities increases the solution space and then decreases the search efficiency [31]. Therefore, Zhou et al. 2021 [31] noted that particle swarm optimisation may find a local optimum with high dimensional data. Some variants are suggested such as competitive swarm optimization operator [32] and multiswarm comprehensive learning particle swarm optimization [33], but tackling many-objective optimization is still a challenge [29].
Moreover, particle swarm optimization can fall into a local optimum (needs a reasonable balance between convergence and diversity) [29]. Those results are similar to filter and wrapper methods [34] (more details about Filter and wrapper methods can be found in [31,34]). Yang et al. 2020 [29] suggest to improve computational burdens with a competition mechanism using a new environment selection strategy to maintain the diversity of population. Additionally, to solve this issue, since mutual information can capture nonlinear relationships included in a filter approach, Sharmin et al. 2019 [35] used mutual information as a selection criteria (joint bias-corrected mutual information) and then suggested adding simultaneous forward selection and backward elimination [36].
Deep neural networks such as CNN [37] are able to learn and select features. As an example, hierarchical deep neural networks were included with a multiobjective model to learn useful sparse features [38]. Due to the huge number of parameter, a deep learning approach needs a high quantity of balanced samples, which is sometimes not satisfied in real-world problems [34]. Moreover, as a deep neural network is a black box (non-causal and non-explicable), an evaluation of the feature selection ability is difficult [37].
Currently, feature selection and data discretization are still studied individually and not fully explored [39] using many-objective formulation. To the best of our knowledge, no studies have tried to solve the two problems simultaneously using evolutionary techniques for a many-objective formulation. In this paper, the contributions are summarized as follows:

1.
We propose a many-objective formulation to simultaneously deal with optimal feature subset selection, discretization, and parameter tuning for an LM-WLCSS classifier. This problem was resolved using the constrained many-objective evolutionary algorithm based on dominance (minimisation of the objectives) and decomposition (C-MOEA/DD) [40].

2.
Unlike many discretization techniques requiring a prefixed number of discretization points, the proposed discretization subproblem exploits a variable-length representation [41].

3.
To agree with the variable-length discretization structure, we adapted the recently proposed rand-length crossover to the random variable-length crossover differential evolution algorithm [42].

4.
We refined the template construction phase of the microcontroller optimized Limited-Memory WarpingLCSS (LM-WLCSS) [21] using an improved algorithm for computing the longest common subsequence [43]. Moreover, we altered the recognition phase by reprocessing the samples contained in the sliding windows in charge of spotting a gesture in the steam.

5.
To tackle multiclass gesture recognition, we propose a system encapsulating multiple LM-WLCSS and a light-weight classifier for resolving conflicts.
The main hypothesis is as follows: using the constrained many-objective evolutionary algorithm based on dominance, an optimal feature subset selection can be found. The rest of the paper is organized as follows: Section 2 states the constrained many-objective optimization problem definition, exposes C-MOEA/DD, highlights some discretization works, presents our refined LM-WLCSS, and reviews multiple fusion methods based on WarpingLCSS. Our solution encoding, operators, objective functions, and constraints are presented in Section 3. Subsequently, we present the decision fusion module. The experiments are described in Section 4 with the methodology and their corresponding evaluation metrics (two for effectiveness, including Cohen's kappa, and one for reduction). Finally, our system is evaluated and the results are discussed in Section 5.

Preliminaries and Background
In this section, we first briefly provide some basic definitions on the constrained many-objective optimization problem. We then describe a recently proposed optimization algorithm based on dominance and decomposition, entitled C-MOEA/DD. Additionally, we review evolutionary discretization techniques and successors of the well-known classattribute interdependence maximization (CAIM) algorithm. Afterward, we expose some modifications on the different key components of the limited memory implementation of the WarpingLCSS. Finally, we review some fusion methods based on WarpingLCSS to tackle the multi-class gesture problem and recognition conflicts.

Constrained Many-Objective Optimization
Since artificial intelligence and engineering applications tend to involve more than two and three objective criteria [40], the concept of many objective optimization problems must be introduced beforehand. Literally, they involve many objectives in a conflicted and simultaneous manner. Hence, a constrained many-objective optimization problem may be formulated as follows: where x = [x 1 , . . . , x n ] T is a n-decision variable candidate solution taking its value in the bonded space Ω. A solution respecting the J inequality (g j (x) > 0) and K equality constraints (h k (x) = 0) is qualified as attainable. These constraints are included in the objective functions and are detailed in our proposed method in Section 3.3. F : Ω → R m associates a candidate solution to the objective space R m through m conflicting objective functions. The obtained results are thus alternative solutions but have to be considered equivalent since no information is given regarding the relevance of the others.
A solution x 1 is said to dominate another solution x 2 , written as x 1 ≺ x 2 if and only if ∀i ∈ {1, . . . , m} :

C-MOEA/DD
MOEA/DD is an evolutionary algorithm for many-objective optimization problems, drawing its strength from MOEA/D [44] and NSGA-III [45]. As it combines both the dominance-based and decomposition-based approaches, it implies an effective balance between the convergence and diversity of the evolutionary process. Decomposition is a popular method to break down a multiple objective problem into a set of scalar optimization subproblems. Here, the authors use the penalty-based boundary intersection approach, but they highlight that any approach could be applied. Subsequently, we briefly explain the general framework of MOEA/DD and expose its requisite modifications for solving constrained many-objective optimization problems.
At first, a procedure generates N solutions to form the initial parent solutions and creates a weight vector set, W, representing N unique subregions in the objective space. As the current problem does not exceed six objectives, only the one layer weight generation algorithm was used. The T closest weights for each solution are also extracted to form a neighborhood set of weight vectors, E. The initial population, P, is then divided into several non-domination levels using the fast non-dominated sorting method employed in NSGA-II.
In the MOEA/DD main while-loop, a common process is applied for each weight vector in E until the termination criterion is reached. It consists of randomly choosing k-mating parents in the neighboring subregions of the weight vector considered. When no solution exists in the selected subregions, they are randomly selected in the current population. These k-solutions are then altered using genetic operators. For each offspring, an intricate update mechanism is applied on the population.
First, the associated subregion of the offspring is identified. The considered offspring is then merged with the population in a temporary container, P . Next, the non-domination level structure of P is updated. It is worthy to note that an ingenious method was employed to avoid full non-dominated sorting of P . Since the population must preserve its size throughout the run of MOEA/DD, three cases may arise. When all solutions are nondominated, the worst solution of the most crowded weight vector is deleted from the population. This function has been denominated LocateWorst. When there are multiple non-domination levels, the deletion of one solution depends on the number within the last non-domination level, F l . On the one hand, there is only one solution in F l , and the density of the associated subregion is investigated so as not to incorrectly alter the population diversity. LocateWorst is called in the case where the density contains only one element. When the most crowded subregion associated with each solution in F l contains more than one element, the solution owning the largest scalarized value within it is deleted. Otherwise, LocateWorst is called so as not to delete isolated subregions.
Since MOEA/DD is designed to solve unconstrained many-objective optimization problems, Li et al. [40] also provided an extension for handling constrained many-objective optimization problems, which requires three modifications. First, a constraint violation value, CV(x), henceforth accompanies each solution x. It is determined as follows: where the function α returns the absolute value of α if α < 0 and returns 0 otherwise. Second, while the abovementioned update procedure is maintained for feasible solutions, the survey of the infeasible ones is dictated by their association with an isolated subregion. More precisely, a second chance of survival is granted to these infeasible solutions, and the solution with the largest CV or the one that is not associated with an isolated subregion is eliminated from the next population. Finally, the selection for reproduction procedure becomes a binary tournament, where two solutions are initially randomly picked, and the solution with the smallest CV is favoured or a random choice is applied in the case of equality.

Discretization
The discretization process aims to transform a set of continuous attributes into discrete ones. Although there is a substantial number of discretization methods in the literature, Garcia et al. [26] recently carried out extensive testing of the 30 most representative and newest discretization techniques in supervised classification. Amongst the best performing algorithms, FUSINTER, ChiMerge, CAIM, and Modified Chi2 obtained the highest average accuracies; it is possible to add Zeta and MDLP to this list if the Cohen's kappa metric is considered. In the authors' taxonomy, the evaluation measures for comparing solutions were broken down into five families: information, statistics, rough set, wrapper, and binning. Subsequently, we review few evolutionary approaches to solve discretization problems and succeeding methods of CAIM.
In [46], a supervised method called Evolutionary Cut Points Selection for Discretization (ECPSD) was introduced. The technique exploits the fact that boundary points are suitable candidates for partitioning numerical attributes. Hence, a complete set of boundary points for each attribute is first generated. A CHC model [47] then searches the optimal subset of cut points while minimizing the inconsistency. Later on, the evolutionary multivariate discretizer (EMD) was proposed on the same basis [27]. The inconsistency was substituted for the aggregate classification error of an unpruned version of C4.5 and a Naive Bayes. Additionally, a chromosome length reduction algorithm was added to overcome large numbers of attributes and instances in datasets. However, the selection of the most appropriate discretization scheme relies on the weighted-sum of each objective functions, where a user-defined parameter is provided. This approach is thus limited even though varying parameters of a parametric scalarizing approach may produce multiple different Pareto-optimal solutions. In [25], a multivariate evolutionary multi-objective discretization (MEMOD) algorithm is proposed. It is an enhanced version of EMD, where the CHC has been replaced by the well-known NSGA-II, and the chromosome length reduction algorithm hereafter exploits all Pareto solutions instead of the best one. The following objective functions have been considered: the number of cut points currently selected, the average classification error produced by a CART and Naive Bayes, and the frequency of the selected cut points. As previously exposed, CAIM stands out due to its performance amongst the classical techniques. Some extensions have been proposed, such as Class-Attribute Contingency Coefficient [48], Autonomous Discretization Algorithm (Ameva) [49], and ur-CAIM [30]. Ameva has been successfully applied in activity recognition [50] and fall detection for people who are older [51]. The technique is designed for achieving a lower number of discretization intervals without prior user specifications and maximizes a contingency coefficient based on the χ 2 statistics. The Ameva criterion is formulated as follows: where k and l are the number of discrete intervals and the number of classes, respectively. The ur-CAIM discretization algorithm enhances CAIM for both balanced and imbalanced classification problems. It combines three class-attribute interdependence criteria in the following manner: where CAIM N denotes the CAIM criterion scaled into the range [0,1]. CAIR and CAIU stand for Class-Attribute Interdependence Redundancy and Class-Attribute Interdependence Uncertainty, respectively. In the ur-CAIM criterion, the CAIR factor has been adapted to handle unbalanced data.

Limited-Memory Warping LCSS Gesture Recognition Method
SegmentedLCSS and WarpingLCSS, introduced by [18], are two template matching methods for online gesture recognition using wearable motion sensors based on the longest common subsequence (LCS) algorithm. Aside from being robust against human gesture variability and noisy gathered data, they are also tolerant to noisy labeled annotations. On three datasets (10-17 classes), both methods outperform DTW-based classifiers with and without the presence of noisy annotations. WarpingLCSS has a smaller runtime complexity, about one order of magnitude, than SegmentedLCSS. In return, a penalty parameter, which is application-specific, has to be set. Since each method is a binary classifier, a fusion method must be established, which will be discussed and illustrated in detail later.
A recently proposed variant of the WarpingLCSS method [21], labeled LM-WLCSS, allows the technique to run on a resource constrained sensor node. A custom 8-bit Atmel AVR motion sensor node and a 32-bit ARM Cortex M4 microcontroller were successfully used to illustrate the implementation of this method on three different everyday life applications. On the assumption that a gesture may last up to 10 s and given that the sample rate is 10 Hz, the chips are capable of recognizing, simultaneously and in real-time, 67 and 140 gestures, respectively. Furthermore, the extremely low power consumption used to recognize one gesture (135 µW) might suggest an ASIC (Application-Specific Integrated Circuit) implementation.
In the following subsections, we review the core components of the training and recognition processes of an LM-WLCSS classifier, which will be in charge of recognizing a particular gesture. All streams of sensor data acquired using multiple sensors attached to the sensor node are pre-processed using a specific quantization step to convert each sample into a sequence of symbols. Accordingly, these strings allow for the formation of a training data set essential for selecting a proper template and computing a rejection threshold. In the recognition mode, each new sample gathered is quantized and transmitted to the LM-WLCSS and then to a local maximum search module, called SearchMax, to finally output if a gesture has occurred or not. Figure 1 describes the entire data processing flow.

Quantization Step (Training Phase)
At each time, t, a quantization step assigns an n-dimensional vector, representing one sample from all connected sensors as a symbol. In other words, a prior data discretization technique is applied on the training data, and the resulting discretization scheme is used as the basis of a data association process for all incoming new samples. Specifically to the LM-WLCSS, Roggen et al. [21] applied the K-means algorithm and the nearest neighbor. Despite the fact that K-means is widely employed, it suffers from the following disadvantages: the algorithm does not guaranty the optimality of the solution (position of cluster centers) and the optimal number of clusters assessed must be considered the optimum. In this paper, we investigate the use of the Ameva and ur-CAIM coefficients as a discretization evaluation measure in order to find the best suitable discretization scheme. The nearest neighbor algorithm is preserved, where the squared Euclidean distance was selected as a distance function. More formally, a quantization step is defined as follows: where Q c (.) assigns to the sample x(t) the index of a discretization point L ci chosen from the discretization scheme L c associated with the gesture class c. Therefore, the stream is converted into a succession of discretization points.

Template Construction (Training Phase)
Let s ci denote the sequence i, i.e., the quantized gesture instance i, belonging to the gesture class training data set S c . Hence, S c ⊂ S, where S is the training data set. In the LM-WLCSS, the template construction of a gesture class c simply consists of choosing the first motif instance in the gesture class training data set. Here, we adopt the existing template construction phase of the WarpingLCSS. A templates c , representing all gestures from the class c, is therefore the sequence that has the highest LCS among all other sequences of the same class. It results in the following: where l(., .) is the length of the longest common subsequence.
The LCS problem has been extensively studied, and it has an exponential raw complexity of O(2 n ). A major improvement, proposed in [52], is achieved by dynamic programming in a runtime of O(nm), where n and m are the lengths of the two compared strings. In [43], the authors suggested three new algorithms that improve the work of [53], using a van Emde Boas tree, a balanced binary search tree, or an ordered vector. In this paper, we use the ordered vector approach, since its time and space complexities are O(nL) and O(R), where n and L are the lengths of the two input sequences and R is the number of matched pairs of the two input sequences.

Limited-Memory Warping LCSS
LM-WLCSS instantaneously produces a matching score between a symbol s c (i) and a templates c . When one identical symbol encounters the templates c , i.e., the ith sample and the first jth sample of the template are alike, a reward R c is given. Otherwise, the current score is equal to the maximum between the two following cases: (1) a mismatch between the stream and the template, and (2) a repetition in the stream or even in the template. An identical penalty D, the normalized squared Euclidean distance between the two considered symbols d(., .) weighted by a fixed penalty P c , is thus applied. Distances are retrieved from the quantizer since a pairwise distance matrix between all symbols in the discretization scheme has already been built and normalized. In the original LM-WLCSS, the decision between the different cases is controlled by tolerance . Here, this behavior has been nullified due to the exploration capacity of the metaheuristic to find an adequate discretization scheme. Hence, modeled on the dynamic computation of the LCS score, the matching score M c (j, i) between the first j symbols of the templates c and the first i symbols of the stream W stem from the following formula: where D = P c * d(W(i),s c (j)). It is easily determined that the higher the score, the more similar the pre-processed signal is to the motif. Once the score reaches a given acceptance threshold, an entire motif has been found in the data stream. By updating a backtracking variable, B c , with the different lines of (9) that were selected, the algorithm enables the retrieving of the start-time of the gesture.

Rejection Threshold (Training Phase)
The computation of the rejection threshold, ω c , requires computing the LM-WLCSS scores between the template and each gesture instance (expected chosen template) contained in the gesture class c. Let µ (c) and σ (c) denote the resulting mean and standard deviation of these scores. It follows where h c is a real positive in the range ]0, µ (c) σ (c) [.

Searchmax (Recognition Phase)
A SearchMax function is called after every update of the matching score. It aims to find the peak in the matching score curve, representing the beginning of a motif, using a sliding window without the necessity of storing that window. More precisely, the algorithm first searches the ascent of the score by comparing its current and previous values. In this regard, a flag is set, a counter is reset, and the current score is stored in a variable called Max. For each following value that is below Max, the counter is incremented. When Max exceeds the pre-computed rejection threshold, ω c , and the counter is greater than the size of a sliding window WF c , a motif has been spotted. The original LM-WLCSS SearchMax algorithm has been kept in its entirety. WF c , therefore, controls the latency of the gesture recognition and must be at least smaller than the gesture to be recognized.

Backtracking (Recognition Phase)
When a gesture has been spotted by SearchMax, retrieving its start-time is achieved using a backtracking variable. The original implementation as a circular buffer with a maximal capacity of |s c | * WB c has been maintained, where |s c | and WB c denote the length of the templates c and the length of the backtracking variable B c , respectively. However, we add an additional behavior. More precisely, WF c elements are skipped because of the required time for SearchMax to detect local maxima, and the backtracking algorithm is applied. The current matching score is then reset, and the WF c previous samples' symbols are reprocessed. Since only references to the discretization scheme L c are stored, re-quantization is not needed.

Fusion Methods Using WarpingLCSS
WarpingLCSS is a binary classifier that matches the current signal with a given template to recognize a specific gesture. When multiple WarpingLCSS are considered in tackling a multi-class gesture problem, recognition conflicts may arise. Multiple methods have been developed in literature to overcome this issue. Nguyen-Dinh et al. [18] introduced a decision-making module, where the highest normalized similarity between the candidate gesture and each conflicting class template is outputted. This module has also been exploited for the SegmentedLCSS and LM-WLCSS. However, storing the candidate detected gesture and reprocessing as many LCSS as there are gesture classes might be difficult to integrate on a resource constrained node. Alternatively, Nguyen-Dinh et al. [19] proposed two multimodal frameworks to fuse data sources at the signal and decision levels, respectively. The signal fusion combines (summation) all data streams into a single dimension data stream. However, considering all sensors with an equal importance might not give the best configuration for a fusion method. The classifier fusion framework aggregates the similarity scores from all connected template matching modules, and each one processes the data stream from one unique sensor, into a single fusion spotting matrix through a linear combination, based on the confidence of each template matching module. When a gesture belongs to multiple classes, a decision-making module resolves the conflict by outputting the class with the highest similarity score. The behavior of interleaved spotted activities is, however, not well-documented. In this paper, we decided to deliberate on the final decision using a light-weight classifier.

Proposed Method
In this section, we present an evolutionary algorithm for feature selection, discretization, and parameter tuning for an LM-WLCSS-based method. Unlike many discretization techniques requiring a prefixed number of discretization points, the proposed algorithm exploits a variable-length structure in order to find the most suitable discretization scheme for recognizing a gesture using LM-WLCSS. In the remaining part of this paper, our method is denoted by MOFSD-GR (Many-Objective Feature Selection and Discretization for Gesture Recognition).

Solution Encoding and Population Initialization
A candidate solution x integrates all key parameters required to enable data reduction and to recognize a particular gesture using the LM-WLCSS method.
As previously noted, the sample at time t is an n-dimensional vector x(t) = [x 1 (t) . . . x n (t)], where n is the total number of features characterizing the sample. Focusing on a small subset of features could significantly reduce the number of required sensors for gesture recognition, save computational resources, and lessen the costs. Feature selection has been encoded as a binary valued vector p c = {p j } n j=1 ∈ [0, 1] n , where p j = 0 indicates that the corresponding features is not retained whereas p j = 1 signifies that the associated feature is selected. This type of representation is very widespread across literature.
The Amongst the abovementioned LM-WLCSS parameters, only the SearchMax window length WF c , the penalty P c , and the coefficient h c of the threshold have been included into the solution representation.

1.
WF c controls the latency of the recognition process, i.e., the required time to announce that a gesture peak is present in the matching score. WF c is a positive integer uniformly chosen in the interval [WF lower c , WF upper c ] = [5,15]. By fixing the reward R c to 1, the penalty P c is a real number uniformly chosen in the range [0, 1]; otherwise, gestures that are different from the selected template would be hardly recognizable.

2.
The coefficient h c of the threshold is strongly correlated to the reward R c and the discretization scheme L c . Since it cannot easily be bounded, its value is locally investigated for each solution.

3.
The backtracking variable length WB c allows us to retrieve the start-time of a gesture. Although a too short length results in a decrease in recognition performance of the classifier, its choice could reduce the runtime and memory usage on a constrained sensor node. Since its length is not a major performance limiter in the learning process and it can easily be rectified by the decider during the deployment of the system, it was fixed to three times the length of the longest gesture occurrence in c in order to reduce the complexity of the search space.
Hence, the decision vector x can be formulated as follows: x = (p c , L c , P c , WF c , h c ).

Operators
In C-MOEA/DD, selected solutions produce one or more offspring using any genetic operators. In this paper, for each selected parent solution pair {x 1 , x 2 }, a crossover generates two children {x 1 , x 2 } that are mutated afterwards. In the following subsections, these two operators are explained.

Crossover Operation
The classical uniform crossover is used for the selected feature vector. In this paper, we adapted the recently proposed rand-length crossover for the random variable-length crossover differential evolution algorithm [42] to crossover two discretization schemes. More precisely, offspring lengths are firstly randomly and uniformly selected from the range [K lower , where x L c i indicates the discretization scheme (to be used for the gesture class c) associated with the solution x i and |.| indicates the number of elements in this designated discretization scheme. For the current value of i ∈ [1, min i∈{1,2} |x i L c |], three cases might occur. When both parent solutions contain a discretization point at the index i, the simulated binary crossover (SBX) is applied to each dimension of the two points. When one of the parent solution discretization scheme is too short, both children inherit from the parent having the longest discretization scheme. Otherwise, a new discretization point is uniformly chosen in the training space for each children solution. All newly created discretization points are randomly assigned to children solution. The pseudo-code of the rand-length crossover for discretization scheme procedure is given in Algorithm 1.
Since LM-WLCSS penalties are encoded as real-values, the SBX operator is also applied to the decision variable P c . In contrast, SearchMax window lengths are integers; thus, we incorporate the weighted average normally distributed arithmetic crossover (NADX) [54]. It induces a greater diversity than uniform crossover and SBX operators while still proposing values near and between the parents. Despite the length of the backtracking variable having been fixed, the NADX operator could be considered.
When selecting features, the discretization schemes or LM-WLCSS penalties, and SearchMax window lengths of children solutions are different from those of parent solutions, and their coefficients, h c , of the threshold must be undefined because the resulting LM-WLCSS classifier from the solution is altered.

Mutation Operation
All decision variables are equiprobably modified. The uniform bit flip mutation operator is applied to the selected feature binary vector. Each discretization point in the discretization scheme is also equiprobably altered. Specifically, when a discretization point has been identified for a modification, all of its features are mutated using the polynomial mutation operator. For all of the remaining decision variables, the polynomial mutation is applied whether decision variables are encoded as integers or real numbers.

Objective Functions
The quality of a candidate solution is measured by the objective functions. In order to find the best solution for recognizing a particular gesture using LM-WLCSS, five functions have been considered: where where T c is the set of distinct discretization points in the elected templates c , |T c | is the number of distinct elements in the latter, and [.] denotes the Iverson bracket. Let us firstly define the basic terms generated by a confusion matrix: tp (true positives) is the number of correctly identified samples, f p (false positives) refers to the incorrectly identified samples, tn (true negatives) is the number of correctly rejected samples, and f n (false negatives) refers to the incorrectly rejected samples. In (13), f 1 measures how well the trained binary classifier performs on the testing data set. Although the accuracy is widely acknowledged, it cannot be used as exclusive performance recognition indicator, since the classifier could have exactly zero predictive power [55]. We alternatively selected the F1 score, defined as the harmonic mean of precision and recall, where precision = tp tp+ f p and recall = tp tp+ f n . The objective function f 2 , in (14), directly comes from the template construction during the training phase of the binary classifier. It is the average sum of the longest common subsequence between the elected templates c and the other quantized gesture instances in the gesture class training data set. The higher the score is, the more the template represents the gesture class c.
The Ameva criterion, determined by the objective function f 3 in (15), expresses the quality of the discretization scheme component of the solution. Its highest values are attained when all samples from a specific class are quantized to a unique discretization point (the other discretization points have no associated samples). Additionally, the criterion favours a low number of discretization points. Since there are only two classes in this problem, i.e., the samples from the gesture class c represents the positive class, and all others examples are negatives; it might be possible to encounter similarities in the different gesture executions for both classes. As a result, negative examples might be quantized into the same discretization points defining the class templates c , and the Ameva criterion might try to create unnecessary discretization points. To overcome this issue, a constraint on the template, defined in (18), imposes that the latter must be defined by at least three distinct discretization points. Additionally, in (16), the objective function f 4 counters this conflicting situation and measures heterogeneity by the normalized entropy of the elected templates c included between [0, 1]. Lower appearance of a discretization point in the template is thus penalized. The Ameva criterion may be interchanged with ur-CAIM or any other discretization criterion.
In (17), the last objective function indicates the average number of selected features in the current solution, as we need to reduce the number of features.
Algorithm 2 presents the pseudo-code of the evaluation procedure of a candidate solution x. First and foremost, a quantizer Q c is created using the discretization scheme L c and the feature selection vector p c . An LM-WLCSS classifier can thus be trained on the training dataset. Although the objective function f 5 is completely independent of the classifier construction, an infeasible solution situation may be encountered due to the negativity of the rejection threshold ω c , as stated in (19). In contrast, evaluation procedure continues, and from the elected class template T c and the rejection threshold, it follows the objective function f 3 . As previously mentioned, the decision variable h c must be locally investigated. When the coefficient of variation 2×10×σ (c) because a high amplitude of the coefficients can nullify the rejection threshold. For each coefficient value, the previously constructed LM-WLCSS classifier is not retained. Only updating the SearchMax threshold, clearing the circular buffer (variable B c ), and resetting the matching score are necessary. Here, the greater objective function f 1 obtained value (i.e., the bestobtained classifier performance) and its associated h c are preserved, and the evaluated solution x and objective function F(x) are updated in consequence.

Multi-Class Gesture Recognition System
Whenever a new sample x(t) is acquired, each of the required subset of the vector is transmitted to the corresponding trained LM-WLCSS classifier to be specifically quantized and instantaneously classified. Each binary decision, forming a decision vector d(t), is sent to a decision fusion module to eventually yield which gesture has been executed. Among all of the aggregation schemes for binarization techniques, we decided to deliberate on the final decision through a light-weight classifier, such as neural networks, decision trees, logistic regressions, etc. Figure 2 illustrates the final recognition flow.
Clear the backtracking variable B c and reset the matching score

Experiments
In this section, we describe the experimental framework. First, we present the Opportunity dataset [56] as a benchmark for gesture recognition and dimensionality reduction. This dataset, available on the UCI machine learning repository (https://archive.ics.uci.edu/ ml/datasets/opportunity+activity+recognition (accessed on 15 September 2021), aims to propose a benchmark for human activity recognition algorithms or for specific stages of the activity recognition chain, such as dimensionality reduction, signal fusion, and classification. It includes multiple runs of a scripted two-part scenario performed by several subjects equipped with on-body sensors in a simulated studio flat, wherein numerous ambient and object sensors have been integrated. All raw sensor readings have 243 dimensions. The first part consists of an activity of daily living, allowing for a look at four abstraction levels of the activity recognition. The second one, denominated 'drill run', focuses on the number of instance daily gestures.

Benchmark Dataset
The different approaches used in thte literature to report classification results on this particular benchmark are reviewed. Finally, we detail the key points of our experimental setup, such as the required dataset partitioning imposed by our approach to avoid biases, general parameter settings, and performance metrics.

Experimental Setup
Three main ways have been adopted by gesture recognition literature to report classification results on the Opportunity dataset. First, in [57,58], the proposed method was tested on the challenging task B2 [58], where performance recognition must be reported on the testing set composed of ADL4 and ADL5 for Subjects 2 and 3. According to the challenge, the authors are free to include any remaining subsets into the training set. Missing values, due to packet-loss, have been replaced by linear interpolation. All on-body sensors have been exploited, resulting in an input space with 113 dimensions. Secondly, [58] also reported gesture recognition performances for each of the four subjects using an identical data preparation provided by the UCI repository. Although datasets have 113 dimensions, the methods used for handling missing data may reduce this number. Chen et al. [59] conducted a similar experimentation, but all types of sensors were included, i.e., 243 dimensions. Finally, in [18], a five-fold cross validation (in K-fold cross validation), a dataset D is split into k mutually exclusive subsets, where the size of each fold is approximately equal. One of the partition D t , with t ∈ {1, 2, . . . , k}, is used for testing the classifier performance, and the remaining of the dataset, i.e., D \ D t , consists of its training dataset. This process has to be repeated k-times and was performed on the 'drill run' subset of the Opportunity dataset using accelerometers on arms. Based on the same model validation technique, [19] evaluated the proposed methods on the 'drill run' of each subject using a five-fold cross validation. The experiments only employed 17 3D-sensors, and raw signals were down-sampled. In this work and the aforementioned one, there is no mention of methods for handling missing data.
In our proposed method, the whole training data stream must be quantized for each solution since the selected dimensions and discretization scheme vary. Due to the humongous Euclidean distance searches induced and limited experiment time requirements, we favour smaller datasets. Hence, for the sake of comparison, we reproduced the experiments of Nguyen-Dinh et al. [19] but without down-sampling raw signals. All 51 dimensions were scaled to unit size. We used the default method for handling missing values provided by the UCI repository. For each subject, Table 1 summarizes the number of repetitions (#inst) per gesture and their average length (avg) with standard deviation (SD). It follows that gestures have strong variability, especially 'CleanTable', 'DrinkfromCup', and Tog-gleSwitch', and the number of instances is inconstant. Additionally, this input dataset noticeably contains a very large portion of 'null classes' (40%). In this paper, we performed a five-fold cross-validation. The proposed framework for building a multi-class gesture recognition system based on LM-WLCSS, however, requires the partitioning of each training dataset, Z = D \ D t , into three mutually exclusive subsets, Z 1 , Z 2 , and Z 3 , to avoid biased results. Z 1 represents the training dataset used for all the base-level classifiers and contains 70% of Z. The remaining data is equally split over Z 1 and Z 2 . Performance recognition is maximized over the test set Z 2 . Once each binary classifier has been trained, predictions on the stream Z 3 are obtained, transforming all incoming multi-modal samples into a succession of decision vectors. This newly created dataset, Z 3 , allows us to resolve conflicts by training a light-weight classifier. Finally, the final performance of the system is assessed by using the testing dataset D t .
For our method, C-MOEA/DD parameters remain identical to the original paper [40]; hence, the penalty parameter in PBI θ = 5, the neighborhood size T = 20, and the probability used to select in the neighborhood δ = 0.9. For the reproduction procedure, the crossover probability is p c = 1.0, and the distribution index for the SBX operators is η c = 30. As stated before, mutation of a decision variable of a solution may occur with an equiprobability of occurrence p m = 1/6, and when this decision variable is a vector, each element also has an equal probability to be altered. The polynomial mutation distribution index was fixed at η m = 20. In this problem, we fixed the population size at 210, and the stopping criterion is reached when the number of evaluation exceeds 100,000.

Evaluation Metrics
The effectiveness of the proposed many-objective formulation is evaluated from the two following perspectives:

1.
Effectiveness: Work based on WarpingLCSS and its derivatives mainly use the weighted F1-score F w , and its variant F w NoNull , which excludes the null class, as primary evaluation metrics. F w can be estimated as follows: precision c * recall c precision c + recall c (20) where N c and N total are, respectively, the number of samples contained in class c and the total number of samples. Additionally, we considered Cohen's kappa. This accuracy measure, standardized to lie on a −1 to 1 scale, compares an observed accuracy Obs Acc with an expected accuracy Exp Acc , where 1 indicates the perfect agreement, and values below or equal to 0 represent poor agreement. It is computed as follows: 2.
Reduction capabilities: Similar to Ramirez-Gallego et al. [60], a reduction in dimensionality is assessed using a reduction rate. For feature selection, it designates the amount of reduction in the feature set size (in percentage). For discretization, it denotes the number of generated discretization points.

Results and Discussion
The validation of our simultaneous feature selection, discretization, and parameter tuning for LM-WLCSS classifiers is carried out in this section. The results on performance recognition and dimensionality reduction effectiveness are presented and discussed. The computational experiments were performed on an Intel Core i7-4770k processor (3. On all four subjects of the Opportunity dataset, Table 2 shows a comparison between the best-provided results by Nguyen-Dinh et al. [19], using their proposed classifier fusion framework with a sensor unit, and the obtained classification performance of MOFSD-GR Ameva and MOFSD-GR ur-CAIM . Our methods consistently achieve better F w and F w NoNull scores than the baseline. Although the use of Ameva brings an average improvement of 6.25%, te F1 scores on subjects 1 and 3 are close to the baseline. The current multi-class problem is decomposed using a one-vs.-all decomposition, i.e., there are m binary classifiers in charge of distinguishing one of the m classes of the problem. The learning datasets for the classifiers are thus imbalanced. As shown in Table 2, the choice of ur-CAIM corroborates the fact that this method is suitable for unbalanced dataset since it improves the average F1 scores by over 11%. Table 2. Average recognition performances on the Opportunity dataset for the gesture recognition task, either with or without the null class. [19] MOFSD-GR  Figure 3 illustrates the feature reduction rates produced by MOFSD-GR Ameva and MOFSD-GR ur-CAIM across all 17 gestures of the Opportunity dataset. The following analysis are made.

1.
The ur-CAIM criterion consistently leads to a better reduction rate (close to 80% in mean). Therefore, from a design point of view, the effectiveness of sensors-and their ideal placements-to recognize a specific activity are more identified.

2.
The Ameva criterion achieves a more stable standard deviation in the reduction rate across all subjects than the ur-CAIM criterion.

3.
Since MOFSD-GR Ameva achieves a better recognition rate than the baseline, its implied reduction capabilities are still acceptable (>40%).
Figures 3 and 4 depict the number of discretization points yielded by the two discretization strategies across all 17 gestures of the Opportunity dataset. From the results, the following assessment can be made.

1.
As intended by the nature of Ameva, MOFSD-GR Ameva yields a small number of cut points close to the constraint imposing that the template be made of at least three distinct discretization points (18). However, this advantage seems to limit the exploration capacity of C-MOEA/DD since only half of the original features are discarded.

2.
In contrast, MOFSD-GR ur-CAIM tends to generate larger discretization schemes than MOFSD-GR Ameva . Since the ur-CAIM criterion aggregates two conflicting objectives (CAIM aimed to generate a lower number of cut points, and the pair CAIR and CAIU advocates a larger number), compromises are made.  Tables 3 and 4 present more detailed results. They recapitulate the average, µ, and standard deviation, SD, of the number of cut points (#dp) produced and features selected (#d) by MOFSD-GR Ameva and MOFSD-GR ur-CAIM , respectively. Please note that no substantive conclusions could be drawn from the intersections between the following sets of selected features from (1) a particular subject, (2) a particular gesture, and (3) a particular gesture and fold due to the one-vs.-all decomposition approach used for this multi-class problem.

Limitation of the Study
More experimental comparisons against other recent methods or applies on different activity datasets such as Nurse Care Activity Recognition Challenge [61] to demonstrate the effectiveness of the proposed algorithm could be added in this paper. Moreover, other performances metrics could be investigated such as f-measure or feature reduction rate. However, such metrics cannot determine the overall performance of a feature selection algorithm considering both feature selection and discretization. In such a case, other proposed metrics (e.g., score, pareto optimality, and stability) can be employed for an improved analysis.
An optimal solution considers constraints (both Equations (18) and (19) in our proposed method) and then could be a local solution for the given set of data and problem formulated in the decision vector (11). This solution still needs proof of the convergence toward a near global optimum for minimization under the constraints given in Equations (12) to (19). Our approach could be compared with other recent algorithms such as convolutional neural network [37], fuzzy c-mean [62], genetic algorithm [63], particle swarm optimisation [64], and artificial bee colony [28]. However some difficulties arise before comparing and analysing the results: (1) near optimal solution for all algorithms represent a compromise and are difficult to demonstrate, and (2) both simultaneous feature selection and discretization contain many objectives.

Conclusions and Future Works
In this paper, we proposed an evolutionary many-objective optimization approach for simultaneously dealing with feature selection, discretization, and classifier parameter tuning for a gesture recognition task. As an illustration, the proposed problem formulation was solved using C-MOEA/DD and an LM-WLCSS classifier. In addition, the discretization sub-problem was addressed using a variable-length structure and a variable-length crossover to overcome the need of specifying the number of elements defining the discretization scheme in advance. Since LM-WLCSS is a binary classifier, the multi-class problem was decomposed using a one-vs.-all strategy, and recognition conflicts were resolved using a light-weight classifier. We conducted experiments on the Opportunity dataset, a real-world benchmark for gesture recognition algorithm. Moreover, a comparison between two discretization criteria, Ameva and ur-CAIM, as a discretization objective of our approach was made. The results indicate that our approach provides better classification performances (an 11% improvement) and stronger reduction capabilities than what is obtainable in similar literature, which employs experimentally chosen parameters, k-means quantization, and hand-crafted sensor unit combinations [19].
In our future work, we plan to investigate search space reduction techniques, such as boundary points [27] and other discretization criteria, along with their decomposition when conflicting objective functions arise. Moreover, efforts will be made to test the approach more extensively either with other dataset or LCS-based classifiers or deep learning approach. A mathematical analysis using a dynamic system, such as Markov chain, will be defined to prove and explain the convergence toward an optimal solution of the proposed method. The backtracking variable length, B c , is not a major performance limiter in the learning process. In this sense, it would be interesting to see additional experiments showing the effects of several values of this variable on the recognition phase and, ideally, how it affects the NADX operator.
Our ultimate goal is to provide a new framework to efficiently and effortlessly tackle the multi-class gesture recognition problem.