In this section, we explain our data analysis approach, including features calculation, feature selection, and classifier selection. We evaluate our approach on an individual basis (user-dependent) and the dataset as a whole (user-independent). All of the validation is carried out with the leave-session-out scheme.

#### 5.1. Feature Extraction

We first used the pyAudioAnalysis (An Open-Source Python Library for Audio Signal Analysis) [

63], to investigate an initial set of features, such as; zero-crossing rate, energy, the entropy of energy, spectral (centroid, spread, entropy, flux), roll-off, Mel frequency cepstral coefficients, chroma vector, and chroma deviation, and has fast plotting capabilities. For a more detailed analysis, we then switched to Tsfresh (Time Series Feature Extraction based on Scalable Hypothesis tests) version 0.16.0 [

64] also made for python and MIT license.

We used the Tsfresh library to extract 754-time features per input of DMA (11 in total), having a total of 8294 features. For feature selection, Tsfresh provides a feature extractor based on the vector of

p-values, where smaller the

p-value means a higher probability of rejecting the null hypothesis. To select the threshold for the

p-value, the library uses the Benjamini-Yekutieli (BY) procedure [

65]. A summary of the BY procedure would be: (1) organize the

p-values from lower to higher (step-up) and (2) select a small group of them, where the boundary between the selected features is set by the condition

${P}_{\left(k\right)}\le \frac{k}{m\xb7c\left(m\right)}\alpha $; where

${P}_{k}$ is the

p-value,

k is the last

p-value to be declared as valid for a given

$\alpha $ (rejecting the null hypotheses),

m is the total number of hypothesis/features and

$c\left(m\right)$ is a constant defined as

$c\left(m\right)=1$ when the features are independent or positively correlated, and as

$c\left(m\right)={\sum}_{i=1}^{m}\frac{1}{i}$ when there is an arbitrary dependency (selected case). This relationship is a simple graph of

p-values as dependent variable (“y”) and independent variable (“x”) equal to the range of 1...k, with slope =

$\frac{\alpha}{m\xb7c\left(m\right)}$.

The signals from the different microphone array combinations in the

Figure 1 were used as input to the feature extractor (Tsfresh) follow by standardization (

$mean=0$, unit-variance) and a feature reduction. The reduction of the features was done using the Benjamini-Yekutieli technique per volunteer and then selecting the top commons sixteen features presented in the list below, then these sixteen features were feed to a second round of extraction by each DMA (11 in total), given a maximum number of extracted features equal to

$DMAs\times 16=176$. These sixteen were extracted for both cases; user-dependent and user-independent tests.

The sixteen retained features are:

F1 $80\%$ quantile

F2 $10\%$ quantile

F3 Absolute FFT coefficient $\#94$

F4 Absolute FFT coefficient $\#38$

F5 Absolute FFT coefficient $\#20$

F6 p-Value of Linear Trend

F7 Standard-Error of Linear Trend

F8 Energy ratio by chunks (num-segments = 10, segment-focus = 1)

F9 Energy ratio by chunks (num-segments = 10, segment-focus = 8)

F10 Autocorrelation of lag = 2

F11 c3 = $\left\{E\right\}[{L}^{2}{\left(X\right)}^{2}\xb7L\left(X\right)\xb7X]$ lag = 3

F12 Count below mean

F13 Minimum R-Value of Linear Trend (chunk-length = 10)

F14 Largest fixed point of dynamics (PolyOrder = 3, #quantile = 30)

F15 Ratio beyond r-sigma (r = 1.5)

F16 Mean change quantiles with absolute difference (qH = 1.0, qL = 0.0)

Four of the most relevant features in the list above are connected with the quantile definition (features F1, F2, and F16). Quantile is the value below which a defined percentage of the data is expected to lie. For example, the first row of features list implies, a crucial feature of our data-set is a distinct value limiting the 80% of the data to be below it, in simple words, an upper threshold. In second place in the number of appearances, we found the FFT (Fast Fourier Transform) and linear least-squares regression (Linear Trend, features F6, F7, and F13). Here, the linear regression that was assumed the signal to be uniformly sampled (true for our case). Inside the linear trend characteristics, our focus is in p-value with the null hypothesis = “the slope equal to zero”, correlation coefficient (r-value), and the standard error of the estimation (stderr).

Next, is the energy ratio by chunks (F8 and F9). The procedure to extract this from our signal is; first, the signal is divided into segments. Second, the rate is calculated as the sum of squares of the selected portion divided by the sum of squares of the entire signal. In our features list, the signal was split into ten pieces and the ratio was calculated for pieces one and eight.

Furthermore, we have the autocorrelation with lag = 2 (F10) meaning the correlation between values two samples apart. Besides, the

${r}_{sigma}=r\times std\left(x\right)$ with

$r=1.5$ (F15) as the ratio of values that are

${r}_{sigma}$ away from the mean of the signal. A higher order autocovariance calculus is the

$C3=\frac{1}{n-2lag}{\sum}_{i=0}^{n-2lag}{x}_{i+2\xb7lag}^{2}\xb7{x}_{i+lag}\xb7{x}_{i}$ equation (F11), where

$lag$ is the separation between samples and it is a measure of the non-linearity of the data [

66].

As feature F14, we have the largest fixed point of dynamics. To understand the nature of complex systems, the field of stochastic modeling employs differential equations. Still, there are many theories to describe the dynamic. One of those is to consider the process as a Langevin process, governed by Equation (

4) (for the first-order differential). A simple version, would be to consider the time series model as a function of the state variable

x and time

t by;

${D}^{\left(1\right)}\left(x\left(t\right)\right)$ (deterministic part of the dynamic),

$\sqrt{{D}^{\left(2\right)}}=constant$ (stochastic force), and a Gaussian white noise factor

$\left(N\right)(mean=0,variance=1)$. The gathered data construct the Langevin equation without knowing the system dynamic. The terminology fixed-point refers to points where the drift coefficient is

${D}^{\left(1\right)}\xb7{x}_{fixedpoint}=0$, and its derivative is used for reducing the complexity in the analysis of the stability of the data; positive derivative, means stable fixed point and negative for an unstable point. Another simplification is applied when setting the

${D}^{\left(1\right)}\left(x\left(t\right)\right)$ as a polynomial whose coefficients come from the Friedrich procedure, to going deep on how the reconstruction is done, please refer to [

67,

68]. Our point is to explain the functionality of this dynamic modeling for classifying our data. In conclusion, the largest fixed point of dynamic (F14) is the maximum value of

${x}_{fixedpoint}$, where the drift coefficient is zero.

The final feature is mean-change quantiles (F16), which is a procedure where a range is limited by a qH (quantile maximum) and qL (quantile minimum). Subsequently, inside those boundaries, the mean of the absolute changes of the signal is computed. With qH = 1 and qL = 0, the mean-change is done over the entire signal.

In summary, we have extracted a total of 16 features per DMA pair (11 pairs, making it 176 features).

#### 5.2. Classifier Selection

With the 176 total selected features, we proceeded to find the best classifier architecture to map them onto the facial actions. In [

48], there is evidence that SVM (Support Vector Machine) is a good option in particular for avoiding overfitting. Others [

69,

70] have also achieved excellent results by using SVM with mechanomyography signals. In addition to the SVM option, we also decided to experiment with standard Matlab

^{®}.

We retained 33% as hold-out from the training set for classifier fine-tuning and started by looking at the default setting for KNN (K-nearest neighbors), SVM, and Ensemble-classifiers (Bootstrap Aggregation (Bagging) and Subspace) were tested. The best performing candidates were then fine-tuned through to obtain the optimal hyperparameters. The automatic performance metric was “accuracy” defined as $\frac{TP\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}TN}{TP\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}TN\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}FP\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}FN}$ where TN = True-negatives and FP = False-positives.

We used grid-search used for hyper-parameters improvement [

71]. Grid-search is an exhaustive search based on a defined subset of the hyper-parameter space. In the SVM case, there exists a kernel parameter that we can use to estimate if our data are linearly or non-linearly separable. Besides, the reduction of the overfitting is tuned by the error penalty parameter (C). Using grid-search, we tested the kernel to 2 types, one linear and the other as a polynomial. We searched the best fit for a range of values of the regularization parameter (C) equal to [0.001,0.01,1,10], in case of polynomial, C = [7, 8, 9, 10, 12, 15, 20], degrees-options = [1, 2, 3] and the

$\gamma $ was set to [

$1\ast 4/{n}_{features}$,

$1\ast 16/{n}_{features}$,

$1/{n}_{features}$,

$1/4\xb7{n}_{features}$,

$1/16\xb7{n}_{features}$], where

${n}_{features}=Top{16}_{Features}\ast DM{A}_{Combinations}$ and in the user-independent case the Gaussian kernel was added with C = [3, 5, 6, 7, 8, 9]. The validation of the grid-search selected was with 10fold cross-validation, and the performance metric was “recall” defined as

$\frac{TP}{TP\phantom{\rule{3.33333pt}{0ex}}+\phantom{\rule{3.33333pt}{0ex}}FN}$; where, TP = True-positives and FN = False-negatives.

Accordingly, we focused on SVM in python using the scikit-learn library version 0.23.1 and compared the result with the standard Matlab classifiers as a baseline.