Data Classiﬁcation Methodology for Electronic Noses Using Uniform Manifold Approximation and Projection and Extreme Learning Machine

: The classiﬁcation and use of robust methodologies in sensor array applications of electronic noses (ENs) remain an open problem. Among the several steps used in the developed methodologies, data preprocessing improves the classiﬁcation accuracy of this type of sensor. Data preprocessing methods, such as data transformation and data reduction, enable the treatment of data with anoma-lies, such as outliers and features, that do not provide quality information; in addition, they reduce the dimensionality of the data, thereby facilitating the tasks of a machine learning classiﬁer. To help solve this problem, in this study, a machine learning methodology is introduced to improve signal processing and develop methodologies for classiﬁcation when an EN is used. The proposed methodology involves a normalization stage to scale the data from the sensors, using both the well-known min − max approach and the more recent mean-centered unitary group scaling (MCUGS). Next, a manifold learning algorithm for data reduction is applied using uniform manifold approximation and projection (UMAP). The dimensionality of the data at the input of the classiﬁcation machine is reduced, and an extreme learning machine (ELM) is used as a machine learning classiﬁer algorithm. To validate the EN classiﬁcation methodology, three datasets of ENs were used. The ﬁrst dataset was composed of 3600 measurements of 6 volatile organic compounds performed by employing 16 metal-oxide gas sensors. The second dataset was composed of 235 measurements of 3 different qualities of wine, namely, high, average, and low, as evaluated by using an EN sensor array composed of 6 different sensors. The third dataset was composed of 309 measurements of 3 different gases obtained by using an EN sensor array of 2 sensors. A 5-fold cross-validation approach was used to evaluate the proposed methodology. A test set consisting of 25% of the data was used to validate the methodology with unseen data. The results showed a fully correct average classiﬁcation accuracy of 1 when the MCUGS, UMAP, and ELM methods were used. Finally, the effect of changing the number of target dimensions on the reduction of the number of data was determined based on the highest average classiﬁcation accuracy.


Introduction
An electronic nose (EN), or e-nose, is an electronic device that is used as an artificial olfactory system. In general terms, it is used to identify gases having a wide spectrum of odor patterns by utilizing the interaction with a sensor array. As a system, it takes advantage of its various properties for classification tasks and is composed of an array of sensors, a data acquisition system, and a pattern recognition approach [1]. These components, including both software and hardware, require the application of odor capturing by a sensor array, signal conditioning, signal processing, data gathering, data preprocessing, and pattern recognition techniques to perform odor classification [2]. Different approaches to model the response of the chemical sensors in electronic noses have been developed [3,4]. Firstly, deterministic models have been developed to reproduce the sensing mechanism in the metal oxide gas sensors based on the adsorption and desorption process carried out in the sensing material [5][6][7]. Some physicochemical parameters have been used to improve the selectivity of a metal-oxide gas sensor. For example, a time-domain characterization technique is developed in [8], searching to improve the discrimination ability of an electronic nose fitting a fully analytical model of the electric resistance transient sensor response. Secondly, numerical approaches have been developed to simulate the chemical reaction in the sensing material upon contact with the analyte studied [9,10]. Finally, stochastic models [11,12] have also been developed to reproduce the sensor response and have been linked in the process of classification of substances using electronic nose.
Pattern recognition in an EN enables the treatment of signals based on the identification of features that can be used in classification tasks. The signal processing methodology of an EN is composed of different stages, including the application of dimensionality reduction techniques, such as feature extraction or feature selection, to identify the best information, suppress outliers and noises in the collected signals, and decrease the false detection/classification rates [13]. Although a dimensionality reduction technique can filter redundant information by denoising, some useful information in the original data may be lost, thus affecting the final results of the classification. In addition, the classification method can automatically remove the useless components during the learning process of the sample [14].
Although several studies have been conducted with this aim, as described in the following section, this remains an open research topic because of the option to include new techniques to improve the classification process. To solve this problem, a classification methodology for classifying data from an EN-type sensor is proposed in this study. This methodology consists of several stages developed for EN-type sensor arrays using machine learning and signal processing techniques. Uniform manifold approximation and projection (UMAP) is used as a type of manifold learning to achieve dimensionality reduction, and an extreme learning machine (ELM) is used as a classifier. The methodology is evaluated by using three different EN datasets, and it achieves excellent results.
The overall contributions of this study can be summarized as follows:

1.
A methodology for classifying data obtained by using a sensor array was developed for ENs. A crucial part of this methodology is data preprocessing, including a data arrangement using an unfolding procedure, min − max or mean-centered unitary group-scaling (MCUGS) for dealing with the different magnitudes of the sensors (data normalization), a reduction of the manifold learning data by applying the supervised UMAP algorithm. 2.
The data classification process in the methodology is performed through an ELM classifier, which is a fast and accurate method for discriminating the different classes. A two-stage validation/verification evaluation process is carried out. First, in the training stage, 5-fold cross-validation (CV) is used to prevent an overfitting of 75% of the data, and second, a testing set is formed with the remaining 25% of the data and is used to verify the performance of the methodology with unseen data.

3.
The methodology was validated using three different datasets for classification: (i) Six distinct gases (dataset #1), (ii) three different qualities of wine (dataset #2), and (iii) three classes of gases (dataset #3). The average classification accuracy was used as a performance measure of the proposed process. The high average accuracy achieved in both training and testing sets indicated the effectiveness of the proposed methodology. 4.
The methodology can be used in imbalanced multi-class classification problems, as evidenced on datasets #2 and #3, which exhibited an imbalanced behavior in the number of samples per class.
The remainder of this paper is organized as follows. Section 2 presents a brief review of the related studies on the various uses of ENs and the available methods established in the literature for data analysis and processing of this type of sensor. Section 3 describes the methodology needed to improve the classification of data derived from EN sensors, along with the classification methodology applied and its step-by-step processing. Next, in Section 4, three EN datasets used to validate the classification methodology are described and a brief overview of the data acquisition and transformation methods used in the proposed EN data classification methodology is presented. Next, the data reduction stage using UMAP is described in Section 5. The data classification stage using the ELM method is then detailed in Section 6. Section 7 outlines the validation/verification procedure and describes the training/test data split. The classification performance measures are then defined. In Section 8, the experiment results and a discussion are provided. In addition, the main results after the application of the developed methodology for the three datasets are presented, including the normalization, dimensionality reduction, confusion matrix, average classification performance metrics, and tuning parameters for each method. Finally, some concluding remarks are provided in Section 9.

Related Work
Data preprocessing can improve the classification of different analytes in data from several sensors, including the sensors used by ENs. Different linear and nonlinear methods can be used in the data reduction stage. Although principal component analysis (PCA) [15] is used as a linear method, in most cases, data have nonlinear characteristics that cannot be identified through a linear method. For this reason, different nonlinear data reduction methods, which are members of the group of manifold learning algorithms [16,17], can be used to deal with the data reduction stage. In recent years, different studies related to the use of manifold learning algorithms in ENs have been conducted. For instance, the modified unsupervised discriminant projection [18] and Laplacian eigenmaps (LEs) [19] were used as data reduction methods for the rapid determination of the freshness of Chinese mitten crab, similarly the supervised locality-preserving projection was used to process the feature matrix before applying it to the classifier to improve the performance of an EN [20].
Different perspectives in the literature have been developed to deal with classification problems in ENs [21]. One such perspective is the machine learning approach, which can be divided in supervised, semi-supervised, or unsupervised methods. It delivers qualitative answers by which different types of analytes are identified [13]. Different supervised learning algorithms have been used to solve classification tasks in ENs, including support vector machines (SVMs) [22], artificial neural networks [23], and a new kernel discriminant analysis [14]. In addition, the k-nearest neighbor (KNN) algorithm with one neighbor and the Euclidean distance has proven to be efficient [24] solving multi-class problems, and obtaining high classification rates.
There are a significant number of datasets available in the literature related to the classification of EN signals. One such dataset was used in the study by Vergara et al. (2012) [25], who measured six volatile organic compounds during a 3-year period under strongly controlled operating conditions and using a series of 16 metal-oxide gas sensors [26]. In that study, the authors used a classifier ensemble as a supervised learning method, to treat the sensor drift problem in ENs.
Leon-Medina et al. [21] developed an EN machine learning classification methodology. In their study, four nonlinear data reduction algorithms were compared in an EN classification task: A kernel PCA, LEs, locally linear embedding, and Isomap. A KNN algorithm was used as the classification model. The results when applying these algorithms reached a classification accuracy of 98.33% after performing a holdout CV.
In 2019, Vallejo et al. [27] reviewed some models for classification and regression tasks usually determined by human perception, such as smell, and they called these measurement techniques soft metrology. In their study, the classification process in soft metrology is composed of several stages such as a database construction and preprocessing, construction of an effective representation space, model choice training and validation, and model maintenance.
Different preprocessing techniques for data reduction in big data have been developed; for example, wavelet packet decomposition (WPD) is used in [28] for selecting features that have maximal separability according to the Fisher distance criterion. The WPD method was compared with fast Fourier transform and autoregressive (AR) models for data transformation. The best classification accuracy was obtained using the WPD method, reaching a value of 90.8%.
Several studies on the use of machine learning in sensor arrays such as ENs have recently been conducted. Some drift problems in ENs are normally found owing to the deterioration of the sensors over time. A cross-domain discriminative subspace learning (CDSL) method was developed in [29]. This domain adaptation-based odor recognition method allows solving recognition tasks in systems of master and slave ENs. In this way, two different configurations of the source and target domains were solved. The maximum average accuracies reached by the CDSL method for the two configurations of the source and target domains were 78.96% and 80.17%, respectively.
In 2014, Zhang and Zhang [30] developed the domain adaptation ELM (DAELM) method to solve the drift problem. This method learns a robust classifier by leveraging a limited number of labeled data from a target domain for drift compensation in ENs. The developed DAELM method was validated using a dataset of an EN sensor array captured during a 36-month period, and two different settings were validated using this method, reaching average accuracies of 91.8% under both settings. In 2020, Kumar and Ghosh [31] developed a system identification approach for data transformation and reduction in sensor arrays. The equivalent circuit parameter method was compared against a discrete wavelet transform and the neighborhood component analysis, which found a significant reduction in the number of features in the machine learning tasks. The best average prediction accuracy was reached by the neural network regression method, with a value of 98%. A novel bio-inspired neural network [32] processes the raw data of an EN without any signal preprocessing, feature selection, or reduction. This significantly simplifies the data processing procedure in ENs. The results showed a classification accuracy of 100% in the task of recognizing seven classes of Chinese liquors. It is worth mentioning that although a classification accuracy of 100% is reached, this was only tested for the signals acquired with the electronic nose developed in [32], in addition, the proposed olfactory neural network has a problem with respect to the parameter tuning.
A portable EN instrument was developed in [33] for qualitative discrimination among different gas samples. A multivariate data analysis of this instrument was conducted using a PCA. Good results were obtained in the qualitative and quantitative tests conducted using this instrument for three industrial gases: Acetone, chloroform, and methanol. Krutzler et al. [34] showed that the determination of gas concentration is improved through the implementation of sensors with smaller resistance variations when the reference library from one sensor is also used for other sensor elements. In 2012, Brudzewski et al. [35] used a nonlinear classifier in the form of a Gaussian kernel SVM to classify EN data. A total of 11 classes of coffee brand mixtures were correctly classified with an average error of 0.21% using a differential nose system containing two arrays of semiconductor sensors composed of pairs of the same sensors. An incremental-learning fuzzy model for the classification of black tea using an EN is described in Tudu et al. [36]. With the use of the approach they developed, a universal computational model can evolve incrementally to automatically include the newly presented patterns in the training dataset without affecting the class integrity of the previously trained system. After a 10-fold CV procedure, an average classification rate of 80% with a standard deviation of 4.969% was obtained.
In 2014, Zhang and Tian [37] developed a multilayer perceptron (MLP) based on a multiple-input, single-output approach for an estimation of the simultaneous concentration of multiple types of chemicals in an EN. The lowest average mean square error of prediction was 3.33%, obtained by a particle swarm optimization method based on the bacterial chemotaxis backpropagation (BP) method. A study describing the extension of neuromorphic methods for artificial olfaction is presented in [38]. The neuromorphic engineering aims to overcome data processing challenges and reduce the output latency. The neuromorphic approach was applied in [39] for gas recognition of three target gases, namely, ethanol, methane, and carbon monoxide, with a 100% classification accuracy. A hybrid approach using both convolutional and recurrent neural networks (CRNNs), based on the long short-term memory module, was proposed in [40]. A deep learning method is well-suited for extracting the valuable transient feature contained at the very beginning of the response curve. The reported accuracy dramatically outperformed the previous algorithms, including gradient tree boosting, random forest (RF), SVM, KNN, and linear discriminant analysis. The CRNN approach reached an average classification accuracy of 98.28% when 4-s signals were used.
In 2019, Liu et al. [41] proposed a multi-task model based on the BP neural network (MBPNN) for EN classification problems. The approach they developed was compared against RF and SVM. Their study showed that the EN is effective for the classification and evaluation of organic green tea. The MBPNN method was satisfactorily used in two datasets of ENs for classification tasks, reaching accuracies of 99.83% and 99.67%. A multi-task learning-long-short-term memory (MLSTM) recurrent network was proposed in [42] for gas classification and concentration estimation (regression) in ENs. The best classification accuracies obtained with the MLSTM method for each of the three datasets presented were 99.99%, 95.17%, and 100%. Cheng et al. [43] proposed a solution to the problem of dynamically growing odor datasets while both the training samples and number of classes increase over time. The solution uses a deep nearest class mean (DNCM) model based on a deep learning framework and the nearest class mean method. The results showed that the DNCM model is extremely efficient for incremental odor classification, particularly for new classes with only a few training examples. The best average recognition accuracy was 78.09%, obtained using the DNCM method.
In 2020, an AR process associated with an observation of zero-mean Gaussian noise was used to solve drift issues in EN-type sensor arrays [44]. In this approach, each sensor response passes through a separate Kalman filter and a regression technique is used to predict the sensor response. In 2018, Zhang et al. [45] developed a subspace learning methodology for classifying data originating from a sensor array. A sliding windowbased smooth filter was employed for denoising and feature selection, a local discriminant preservation projection approach was applied for dimensionality reduction, and a kernel ELM was used as a classifier. The best results showed a classification accuracy of 98.22% on a training set using 5-fold CV.
In summary, some limitations of the existing methods in the literature that lead to the development of the methodology proposed in this study are: The low flexibility of the models, the need to make the dimensionality reduction model each time a new data is obtained, and the inability to perform a quick classification. To treat these limitations, the methodology for electronic nose signal classification developed in this study includes: (i) The correct verification in three different electronic nose datasets, (ii) the use of the supervised variant of the UMAP method for data reduction allowing for mapping new data to a low dimensional space using the model saved in the training, and (iii) the use of the ELM classifier algorithm with all its capabilities related with an extremely fast learning speed and good generalization performance.

Methodology Description
This study presents a methodology for classifying signals acquired by EN systems that can yield high classification rates. The classifier is constructed using four stages (see Figure 1).  The first stage is data acquisition, where the EN (as a multi-sensor system) collects data in a three-dimensional matrix: The rows and columns of the matrix contain the data acquired during each experiment/test per time instant for a specific sensor, whereas the third dimension stores the information of each sensor. In the second stage, these data are first properly transformed to consider whether the data captured by each sensor or their associated extracted features can present significant differences in their magnitudes. Owing to the different nature of the considered data, two different normalization approaches have been considered: The min − max normalization and MCUGS. Later, with the use of an unfolding process, the three-dimensional matrix containing the data is stored as a two-dimensional matrix. The third step concerns data reduction, whereas the UMAP method is applied to the high-dimensional EN-transformed data for discarding features that are irrelevant to the classification problem. After data reduction, the low-dimensional data serve as an input for training a machine learning classifier. In this case, the ELM algorithm is used owing to its capabilities, such as better generalization ability, robustness, controllability, and fast learning. Once the classifier is set, given a new experimental sample, the classification algorithm can predict its class. However, before the classifier is used to predict new class samples, it is crucial to evaluate its performance. This evaluation is a challenging task because the available data must be used to both define the classifier and estimate its performance. Here, a two-stage validation/verification evaluation is considered (see Figure 2). During the validation step, a standard 5-fold CV is performed by using a training set containing 75% of the total data (see [15]). This allows for the suitable tuning of the number of extracted features of the UMAP method and the hyperparameters of the ELM classifier using a GridSearch algorithm, while providing standard estimates of the overall performance of the model. However, although validation performance measures are usually used in the literature to assess the quality of a classification strategy, an extra verification step is added to avoid an overfitting. Once the UMAP model and the ELM classifier are defined, the strategy is tested using the remaining 25% of the data.

DATA
The key point here is that these test data are completely independent of the training step because they are not involved in the supervised feature extraction step nor in the tuning of the classifier parameters.  Consequently, two different performance measures are obtained, one for the training set (associated with the final hyperparameters after applying the CV GridSearch) and one for the test set.

Datasets Description, Data Acquisition, and Transformation
The contributions of this study include the approaches to data preprocessing: How the data are arranged, scaled, transformed, and reduced. In this section, data preprocessing is presented in adequate detail, along with the three different datasets used for the validation of the proposed approach. More precisely, the preprocessing is divided into three phases: Data acquisition, data transformation (feature extraction and min − max normalization for dataset #1, and MCUGS for datasets #2 and #3), and data reduction (Appendix A). Figure 1 illustrates these three phases.

Dataset #1
The first dataset was created by Vergara et al. [25] and is composed of 3600 measurements of six volatile organic compounds (acetaldehyde, ethanol, toluene, ammonia, ethylene, and acetone) under strongly controlled operating conditions recorded from 16 metal-oxide semiconductor gas sensors manufactured by Figaro, Inc. (figarosensor.com, accessed on 10 October 2021).

Data Acquisition
The EN employed to collect dataset #1 used a test chamber linked to a computercontrolled continuous flow system [21]. The total flow rate through the detection chamber was set to 200 mL/min. Synthetic dry air was used as the background for all measurements to keep the humidity level constant at 10% relative humidity (measured at 25 ± 1 • C ). The operating temperature of the sensors was set to 400°C. The dataset was obtained under laboratory conditions with a controlled atmosphere, which favors those materials of the sensors that interact with the gas in a reversible manner [25].
The experiments were carried out using an EN for measuring the behavior of 6 analytes during a 36-month period. The current study considered the 10th batch of the measurements, corresponding to the 36th month. More precisely, batch number 10 corresponded to the last batch of recordings in the study by Vergara et al. [25]. This batch contained 3600 measurements from the analytes captured during month 36 of the author's study and was collected 5 months after batch 9. During this 5-month period between the collection of measurements from batch 9 to batch 10, the sensors were turned off and their response capacity was affected owing to a lack of operating temperature [46], making the data collected during the 10th batch extremely relevant for validating the classification strategies. As previously stated, there were a total of 3600 experiments, divided into six classes. Table 1 describes the number of samples per class (gas) in this dataset. The duration, T, of the experiments was not constant, spanning at least 300 s (comprising three phases of measurement: Baseline measurements injecting only pure air, test gas measurements injecting the gas, and a recovery phase) under a sampling frequency of 100 Hz. Therefore, during each experiment, at least 300 s × 100 Hz = 30,000 resistance measurements per sensor were acquired. These collected data were arranged into a three-dimensional matrix with a size of n × M × N, where n = 3600 indicates the total number of experiments, M > 30,000 represents the number of time measures, and N = 16 shows the number of sensors. For the data transformation, the sensor measures were grouped into n × M matrices, X 1 raw , X 2 raw , . . . , X N raw (see Figure 3).  Acetone 600 Total 3600

Data Transformation
Feature Extraction: Feature extraction usually plays an important role in data preprocessing for chemo-sensory applications, transforming the raw sensor responses while preserving the most meaningful portion of the information contained in the original sensor signal. For this dataset, two distinct types of features were considered: Therefore, from the M > 30,000 resistance measurements per sensor stored in the columns of matrix X k raw , k = 1, . . . , N, only F = 8 features were extracted. Specifically, for a given experiment i = 1, . . . , n and sensor k = 1, . . . , N, let r k i = row i (X k raw ) ∈ R M be the ith row of matrix X k raw collecting all the time profiles of the resistance measures for the given sensor associated with the given experiment. Note that the jth component of this vector r k ij , j = 1, . . . , M, corresponds to the resistance acquired at time t = j/100 ∈ [0, T]. Therefore, X k raw = (r k ij ) is the matrix having a raw resistance value of r k ij as the (i, j)th entry. The two features corresponding to the steady-state response of the sensor are the difference between the maximum and minimum/baseline resistance measurements, and its normalized version, where the min(·) and max(·) functions return the minimum and maximum values of a vector, respectively. The second set of six features reflecting the sensor dynamics of the increasing/decaying transient portions of the sensor response are computed by first introducing the exponential moving average (EMA) of the signal. Given a smoothing factor α ∈ [0, 1], the EMA of the sensor resistance with respect to the smoothing factor α is a new vector ema α (r k i ) ∈ R M defined as ema α (r k i ) 1 = 0 and for l = 2, . . . , M is defined as: where ema α (r k i ) l is the lst component of the EMA signal. The maximum and minimum values of the EMA signal, M ik α = max(ema α (r i )) and m ik α = min(ema α (r k i )), respectively, are characteristic features of the rising and decaying portion of the sensor response. Therefore, the second set of six features is obtained by computing the minimum and maximum values of the EMA signals for α = 0.001, 0.01, and 0.1.
Thus, each vector r k i ∈ R M is mapped into vectorr k i ∈ R 8 : By computing the abovementioned F = 8 features for each of the experiments and each of the N = 16 sensors in the prerecorded time series, we can map the n × M matrices X 1 raw , X 2 raw , . . . , X N raw to the n × F matrices X 1 , X 2 , . . . , X N , where row i (X k ) =r k i . Figure 4 illustrates a reference time profile of the sensor resistance where the three phases (baseline, test gas, and recovery) can be clearly distinguished along with their associated EMA signals and extracted features.  Data Normalization: Both because the nature of the eight features is extremely diverse (steady-state raw and normalized, and dynamic) and because different types of sensors are used to acquire the signals, significant differences are appreciated in the magnitudes of the data. To make the data comparable, we apply a data normalization procedure. For each sensor k = 1, . . . , N and feature j = 1, . . . , F, let: be the minimum and maximum values of feature j computed by sensor k, respectively, where X k = (x k ij ), i.e., x k ij is the (i, j)th element entry of matrix X k and col j (X k ) is the jth column of matrix X k . The features, x k ij , in these matrices are scaled to lie between m = −1 and M = 1 by applying the min − max normalization: Therefore, elements x k ij of matrix X k , k = 1, . . . , N, are scaled to create a new matrix, X k = (x k ij ). For simplicity, the min − max normalized matrices,X k , are renamed simply as X k .

Dataset #2
In 2019, Rodriguez Gamboa et al. [47,48] developed a wine-quality detection system using an EN. The EN wine-quality detection system consists of an array of six metal-oxide gas sensors. These sensors, manufactured by Hanwei Sensors, were chosen because of their high sensitivity to alcohol and low sensitivity to benzine, high sensitivity to methane and natural gas, and high sensitivity to the liquefied petroleum gases isobutane and propane. The detection system uses a deep MLP neural network with online detection through a rising window method to process an early portion of the sensor signals.

Data Acquisition
The total number of experiments on dataset #2 was 235, which was divided into three classes (high-quality, average-quality, and low-quality wine). Table 2 lists the number of samples per class in this dataset and the ranges of the acetic acid detected and the volatile acidity according to the wine spoilage thresholds. Each experiment in the data acquisition had a duration of T = 180 s using a sample frequency of 18.5 Hz. The collected data were arranged, as in dataset #1, into a three-dimensional matrix of size n × M × N, where n = 235 indicates the total number of experiments, M = 180 s × 18.5 Hz = 3330 represents the number of time measures, and N = 6 indicates the number of sensors. For data transformation purposes, sensor measures were also grouped into n × M matrices X 1 raw , X 2 raw , . . . , X N raw (see Figure 3).

Data Transformation
Feature Extraction: Data transformation for dataset #1 in Section 4.1 is devised as a two-step procedure: (i) Feature extraction and (ii) data normalization. In the original study [47], a similar approach was conducted for the feature extraction and selection of dataset #2, where 23 features captured the dynamic and static behavior of each sensor. However, in the current study, because the raw data are available [48], the full set of sensor values were maintained. Therefore, the number of considered features coincided with the number of time samples, F = M.
Mean-Centered Unitary Group-Scaling: This method, which has been recently applied to the detection and classification of structural damage in wind turbines [15], is considered to be an alternative data normalization procedure. The MCUGS is formulated as a two-step normalization technique. The columns are modified by subtracting the mean of each column (column-wise scaling), and the mean-centered data are scaled by using the standard deviation of the scaled matrix. More precisely, for each sensor k = 1, . . . , N and time instant j = 1, . . . , M, we define: to be the mean of all measures acquired by sensor k at time instant j, where X k raw = (r k ij ). Therefore, the mean-centered matrices, X k = x k ij , are defined as: Then, because: the standard deviation of the mean-centered matrices, X k , is computed as follows: Therefore, matrix X k is scaled by dividing all the data by the standard deviation σ k : to create the MCUGS matrixX k = x k ij . As described in Section 4.1, for simplicity, the normalized matricesX k are renamed simply as X k .

Dataset #3
The third dataset used to validate the methodology for classifying EN signals was used by Yin et al. [49], who proposed a temperature-modulated EN system using two sensors from Figaro, Inc. The reference models of these two sensors are TGS2620 and TGS2602, which were selected because of their wider detection range and higher sensitivity for the monitoring of multiple gases.

Data Acquisition
The total number of experiments in dataset #3 was 309, divided into three classes (carbon monoxide (CO), formaldehyde (HCHO), and nitrogen oxide (NO 2 )) . Table 3 shows the number of samples per class in this dataset. Each experiment in the data acquisition had a duration of T = 50 s using a sample frequency of 100 Hz. The collected data were arranged, as in datasets #1 and #2, into a three-dimensional matrix of size n × M × N, where n = 309 represents the total number of experiments, M = 50 s × 100 Hz = 5000 indicates the number of time measures, and N = 2 represents the number of sensors. For data transformation purposes, sensor measures were also grouped into n × M matrices X 1 raw and X 2 raw (see Figure 3).

Data Transformation: MCUGS
To obtain the scaled matricesX 1 andX 2 , we normalize the raw matrices X 1 raw and X 2 raw using the MCUGS, as described in Section 4.2.2. These matrices are renamed simply as X 1 and X 2 . For uniformity in the notations in Section 4.4, as in dataset #2, let F = M.

Data Unfolding
Before the data reduction, the normalized matrices X k ∈ R n×F , k = 1, . . . , N, are concatenated horizontally such that the data are collected into a single n × D matrix X as follows: where D = F · N. Each row of matrix X is a D−dimensional vector that contains the data associated with a particular experiment. The ith row of matrix X collecting the information of all sensors for the ith experiment is: For clarity, and as a summary, note that the dimensions of matrix X are as follows: The rows of matrix X (for each of the different datasets) are the input samples for the data reduction technique described in Appendix A. The new samples, with a reduced dimension, are the input of the data-driven classification methodology described in Section 6. The process of the data arrangement, normalization, and unfolding in the case of datasets #2 and #3 is illustrated in Figure 5.

Data Reduction: UMAP
A proper selection of the data reduction approach and the number of target dimensions can have a significant impact on the power of the classification [50]. Dimensionality reduction should be used not only for visualization or as a preprocessing step for extremely high dimensional data but also as a general preprocessing technique for datasets that contain many variables not related to the target parameter to increase the classification accuracy. However, the difficult selection of both the dimensionality reduction technique and the reduced number of dimensions should be based directly on the effects of the classification power, using performance metrics such as accuracy [21].
When the input patterns lie on or near a low-dimensional submanifold of the input space, the structure of the dataset may be highly nonlinear, and therefore, linear methods are bound to fail [51]. To solve this issue, researchers have developed some manifold learning algorithms that serve as data reduction methods. This type of algorithm can be classified into two categories: Graph-based and kernel methods. The graph-based methods construct a sparse graph in which the nodes represent the input patterns and the edges represent the neighborhood relations [51]. Some graph-based methods include UMAP, Isomap, LEs, and locally linear embedding. The kernel-based methods are based on a kernel trick, which is used to generalize a linear method in statistical learning to nonlinear settings [52]. One example of a linear method that can be generalized to a nonlinear version is PCA, which can be generalized to its kernel version, i.e., a kernel PCA [53].
A detailed explanation of the UMAP method is beyond the scope of this study; see [54] for further details. To describe the background and motivation of the proposed methodology, we provide a practical computational description of the UMAP method in Appendix A.
A brief description of the main UMAP parameters used in this study is provided in Table 4. It is important to note that, in this study, any parameter that is not defined is set to its default value.

Supervised UMAP and Metric Learning
Two extensions of the standard UMAP method briefly outlined in Appendix A have been recently introduced (see [55,56]). The key ideas of these extensions are highlighted in this section.
The first extension is useful when a dimensionality reduction technique is used for classification purposes. In this case, a supervised manifold learning algorithm is available, where class labels are applied during the feature selection process to better describe the low-dimensional data in terms of class information. In short, this extension of the UMAP method takes the labels of each sample, and after a metric space is considered to describe the categorical distance, it includes this information in the similarity function p ij . The effect of the metrics provided by the data and the labels is controlled using the hyperparameter (target_weight ∈ [0, 1]) controlling the relative weight of the two metrics in the final similarity function (a value of 0 favors the data, whereas a value of 1 puts almost all of the weight onto the label distance).
The second extension is crucial to classifying new unseen data. The original UMAP method can be classified as a nonparametric algorithm that transforms the full data matrix X ∈ R n×D into a reduced low-order d-dimensional manifold described by matrix Y ∈ R n×d when minimizing the cross-entropy using a SGD method. However, if new data are available, it is not possible to recover their associated d-reduced features for use in the classification step. Fortunately, an extension of the UMAP method [57] incorporates a transformation mapper that makes it possible to transform new unseen data, albeit more slowly than some other transformers. Although this simple approach is used in the present study, the use of parametric UMAP, which introduces a deep neural network during the optimization procedure for learning the parametric relationship between the original data, X, and the final embedding, Y, can also be considered.

Data Classification: Extreme Learning Machine (ELM)
An ELM is a single hidden layer feedback neural network proposed by Huang et al. [58]. The input weight and the biases of the hidden layer nodes of the ELM are generated at random, and the output weight can be determined automatically by the input data without repeated adjustments. Unlike traditional neural networks, such as a BP neural network, the ELM has a simpler structure and higher training speed, and its approximation capability is comparable to that of traditional neural networks [59].
A detailed description of the ELM algorithm is beyond the scope of this study. For more information on it, please refer to [60]. However, to present the background and motivation for the proposed methodology, a summary of the ELM method is provided in Appendix B. The recap is based on [59].
A brief description of the main ELM parameters used in this study is provided in Table 5. It is important to note that, in this study, all undefined parameters are set to their default values.

Validation/Verification: Train/Test Data Split and Performance Measures
Assessing the performance of the machine learning tool on unseen data is crucial. Here, a two-stage approach is proposed on the basis of the initial validation step followed by a final verification on the completely unseen data. Therefore, this study split the data into the training and test sets with a 3:1 ratio, where 75% of the data were used for the training/validation of the model and the 25% test data, which remained hidden during the model training and model performance evaluation stage, were used for the final performance verification of the presented approach. Moreover, in the training and validation step, a 5-fold CV procedure was used to avoid an overfitting. That is, the training dataset containing 75% of the data was, in turn, split into five folds (or subsets) containing 15% of the data to generate five different validation models (each one taking the data from four folds as the local training data for the algorithm and the remaining local test data for the performance evaluation) (see Figure 2). For a fixed number of extracted UMAP features d and fixed values of the ELM hyperparameters, the UMAP method was applied to the entire training dataset, and the five ELM classifiers were trained using the local training data. These classifiers were validated using the associated local test dataset.
To define the performance/accuracy measure during the validation step, because we are dealing with a multiclass classification problem where each sample from the local test dataset has a known class label, we incorporate the correctness/incorrectness of the predicted classes into a multiclass confusion matrix (see [15]). Specifically, for a problem with l classes or labels {C 1 , C 2 , . . . , C l }, the number of samples belonging to class C i that have been classified as belonging to class C j , herein denoted by C ij , can be summarized in the confusion matrix, as shown in Table 6, for the specific case of l = 3. Table 6. Multi-class confusion matrix. The colored cells correspond to the true positives (green), true negatives (cyan), false negatives (orange), and false positives (magenta) associated with the C 3 class.

Predicted Class
Then, for a given class C i , denote by tp i , tn i , fn i , and fp i , the number of samples that, with respect to class C i , are true positives, true negatives, false negatives, and false positives, respectively, namely: This allows for introducing the following average performance measures: which consider the imbalance problem presented in the EN dataset #2 and the existing differences between the samples of the wine classes (see [61] for more details). As shown in Figure 2, the 5-fold CV produces five different confusion matrices providing five performance measures that are averaged to obtain the final validation performance measure. However, despite the performance measures being nonlinear, the common practice of computing the total performance measure by first adding the five confusion matrices associated with each fold and then computing the performance of the total added confusion matrix is used.

Experimental Results and Discussion
The results of the developed manifold learning classification methodology for ENs are shown in this section for the three described datasets. These results were obtained using the available online implementations of the UMAP [55] and ELM [62] methods. The EN used to acquire dataset #1 is composed of 16 sensors. Specifically, four Figaro sensors of each of the following types were used: TGS2600, TGS2602, TGS2610, and TGS2620. After conducting the experiments, acquiring the signals by each sensor, and selecting the corresponding eight features per sensor, we applied the min-max normalization. The data were then unfolded to obtain the two-dimensional data matrix X ∈ R 3600×128 , where each row contains the information regarding all 16 sensors of a particular experiment (see Section 4.4). Figure 6 shows the 128 feature vectors for a particular experiment before and after applying the min-max normalization. As shown in the figure, before the normalization, the non-normalized steady-state feature, ∆R ik (first feature per sensor), stands out among the other features, i.e., both the normalized steady-state feature (second feature per sensor) and the dynamic features (features 3-8). After the values are scaled to lie between −1 and 1, the data are comparable and ready to be used for classification.    Figure 9 shows the data collected in matrix X ∈ R 309×10,000 before and after applying MCUGS normalization. The non-preprocessed data contain the measures for each sensor (5000 values for 2 sensors) for all 309 experiments. It can be seen in the figure that, in the MCUGS signals, the differences between the three classes are accentuated and, therefore, are better suited for classification purposes.

Validation
Step: Tuning the UMAP/ELM Parameters As mentioned in Section 3, a 5-fold CV using a training set containing 75% of the total data was applied to properly tune the number of extracted features of the UMAP method and the hyperparameters of the ELM classifier using a GridSearch algorithm (see Figure 2). A standard method used to find the optimal parameters applies a simple grid search combined with a CV to evaluate the accuracy of the model for each set of candidate hyperparameters. However, a simultaneous exhaustive grid search within the entire grid of the joined UMAP and ELM hyperparameters is prohibitively expensive because we have numerous parameters to be tuned (see Tables 4 and 5). Therefore, a simplified strategy is used. First, the set of hyperparameters to be tuned is decreased by fixing some of the hyperparameters to their default values (the UMAP metric (d) is set to Euclidean, n_epochs is set to none, and the spread is set to 1), and therefore, only five parameters have to be tuned: The UMAP n_neighbors k, the UMAP min_dist, the UMAP n_components d, the number of hidden layer nodes of the ELMd, and the ELM activation function g. Then, the five-dimensional space is separated into five separate one-dimensional spaces, and sequential one-dimensional grid searches are carried out, while keeping the values of the other hyperparameters fixed. Although this approach is generally not optimal, because the hyperparameters are typically interdependent, in the current scenario it provides excellent classification results and the use of more involved techniques is not considered (see [63]). The optimal one-dimensional searches are conducted using the GridSearchCV function of scikit learn [64], where the accuracy of the model for each set of parameters is measured using the average accuracy (see Equation (2)). Table 7 presents the sequential GridSearch conducted for datasets #1 and #3 along with the final selected parameters. Since the results obtained are shown for the three selected activation functions (tanh, tribas, and hardlim), the GridSearch strategy is not applied to determine the optimal ELM activation function. To determine the other parameters, the ELM activation function is set to its default (tanh). The results obtained for dataset #2 are not detailed because a perfect classification was obtained for all the tested parameters. The final parameters for dataset #2 were set to n_neighbors = k = 16, min_dist = 0.1, n_components = d = 8, andd = 100.
As can be seen in Table 7, the first parameter that has been tuned is the number of neighbors k of the UMAP neighboring graph when the remaining parameters are fixed. Figure 10 shows the effectivity index associated to the average accuracy, defined as ρ = 1 − average accuracy, obtained when varying k from 6 to 55 for datasets #1 and #3. Since a perfect classification corresponds to an average unitary accuracy, the best results are obtained when ρ is close to zero. As can be seen in the figure, a jump in the quality of the classifier is obtained when using 16 neighbors for dataset #1 and 6, 34, or 35 neighbors for dataset #3, where the average accuracies obtained are 0.99901 and 0.99711, respectively. It can also be seen that, for parameter values larger than 40, an increase in the accuracy can be obtained (even yielding perfect classification results); however, considering the trade-off between computational cost and accuracy, and because the accuracy is also improved by tuning the remaining parameters, settings of k = 16 and 6 are the best option. The results for dataset #2 are not shown because all values of k yielded a perfect classification (ρ = 0). In this case, k = 16 is considered.
After the optimal number of neighbors k is fixed, GridSearch is applied with respect to the min_dist parameter that controls the minimum distance between the points in the low-dimensional representation (for the same fixed remaining parameters). Figure 11 shows the effectivity index obtained when varying min_dist within the interval [0.1, 0.9] for datasets #1 and #3. 10 Figure 11. Average accuracy versus the minimum distance between points in the low-dimensional UMAP representation min_dist for datasets #1 and #3 (semi-logarithmic plot in the vertical axis). The vertical axis represents the effectivity index. The optimal selected parameters are highlighted in red. The accuracy range for dataset #1 is from 99.839% to 99.901% and for dataset #3 is from 96.536% to 99.711%.
Again, the results for dataset #2 are not shown because a perfect classification was obtained for all values, taking min_dist = 0.1. Regarding dataset #1, the highest average accuracy of 0.99901 is obtained for values of min_dist of up to 0.6, and the default value of 0.1 is maintained. It is worth noting that the results in this case are not overly sensitive to changes in the parameter value. Finally, for dataset #3, changing the initially prescribed value of min_dist = 0.5 provides worse results, and therefore, the value is maintained.
The final hyperparameter of the UMAP method to be tuned is n_components, i.e., the dimension of the low-dimensional space, d (number of retained features used for classification). It is worth noting that this parameter, which is the dimension of the feature vector inserted into the ELM classifier, is key to the accuracy of the final classification results (see [65]). Figure 12 shows the values of the effectivity index when varying the n_components within the interval [2,20] for datasets #1 and #3. For dataset #3, a jump in quality is obtained for n_components, = 8. Regarding dataset #1, excellent results are obtained for nearly all parameter values and for a value of d = 8. As can be seen in the figure, the supervised version of the UMAP method obtains excellent accuracies for the training dataset even for an extremely low number of retained features (an accuracy of 0.99901 is obtained even for d = 2 in dataset #1, and an accuracy of 0.98845 is obtained for d = 4 in dataset #3, instead of 0.99711 in the case of d = 8). However, because the classifier is needed to classify new unseen data, the value of d = 8 is retained in both cases as a tradeoff between computation cost and accuracy. Finally, a GridSearch was conducted to set the number of hidden layer nodesd of the ELM method. It is well known that the number of hidden nodes is a key factor for achieving a good performance of the ELM method. The optimal value of the parameter usually depends on the size of the training set, n, the number of input features, d, and the number of output classes, L. Figure 13 shows the sensitivity analysis of the accuracy of the combined UMAP-ELM strategy with respect to the number of hidden nodes of the ELM classifier. The selected parameters ared = 60 and 50 for datasets #1 and #3, respectively, with associated final accuracies of 99.901% and 99.711%, respectively. As depicted in the figure, the combined approach allows excellent accuracies to be obtained even for an extremely low number of hidden nodes as opposed to the standard large number of hidden nodes required in other existing approaches (see for instance [45]). It is also worth noting that, for both datasets, all reported accuracies are above 90% (excluding the valued = 1 for dataset #1, where an accuracy of 88% was reached).  Although it has been reported that the selection of the UMAP and ELM methods has a significant impact on the performance, it is worth noting that the combined approach presented here is extremely robust. Figures 10-13 show the results of the sensitivity analysis of the combined UMAP-ELM strategy proposed, showing the accuracies with respect to the variation of the parameters. It can be seen in the figures that, for the first three parameters studied, the accuracies are all above 95%, and regarding the number of hidden nodes, all values ranging from 20 to 150 also ensure an accuracy of 95%.

Classification Results
This section shows the validation and verification performance measures obtained for the three considered datasets (see Figure 2).

Dataset #1
This dataset contains 3600 samples from six different analytes, where each sample has 128 initial features. After normalizing and unfolding all of the data, we applied a 3:1 partition of the total samples: 2700 samples were selected for the initial training (450 for each gas) and 900 samples were kept for the final testing (150 for each gas). The training samples were used in the supervised UMAP manifold learning algorithm to reduce the number of features to be used for classification to only d = 8 (see Table 7 for the details of the UMAP hyperparameters used). Figure 14 illustrates the first three features extracted using the UMAP method in dataset #1 for the training samples and the test samples, separately. As can be seen in the figure, the supervised UMAP allows the successful clustering of most samples of the training set, even though a clear visual separation cannot be achieved. A careful look at the eight features provided by the UMAP method provides similar results to those shown in Figure 14. Most of the samples seem to be clustered however, some remaining overlapping occurs with no clear visual separation. Therefore, the use of a machine learning classifier algorithm is crucial to be able to properly classify all of the samples. When the learned UMAP mapper is used to extract the features of the test dataset, it can be seen that most new points also follow the same pattern, even though the samples do not cluster as clearly. The classification results obtained by using a 5-fold CV on the training dataset with the hyperparameters reported in Table 7 are summarized in the total added confusion matrix shown in Table 8. This table also shows the classification results on the test dataset for the same value of the parameters.
As can be seen in the table, all the training samples from acetaldehyde, ammonia, and acetone are perfectly recognized and only eight samples from among the 2700 are misclassified (two ethylene samples, three ethanol samples, and three toluene samples), yielding an accuracy of 99.90%. Regarding the classification of the 900 samples of the test data, in this case, the toluene and ethylene samples are perfectly recognized and only 17 samples are misclassified, yielding an accuracy of 99.37%. Tables 9 and 10 show all the performance metrics associated with both the validation and verification steps, and the effect of the choice of the activation function of the ELM classifier.
As expected, because the parameters of the methods were trained using these data, excellent performance metrics were obtained when the combined UMAP-ELM classifier was used in the training set for the hyperbolic tangent (tanh) activation function. However, it can be seen that the same parameters also provided good results for the other two activation functions (tribas and hardlim). In addition, the accuracies obtained in the validation step with the unseen test dataset were excellent.   This dataset contains 235 samples of wine with three different qualities, where each initial sample has 19, 980 features. After the MCUGS was applied and all of the data were unfolded, a 3:1 partition of the total samples was achieved: 176 samples were selected for the initial training (32, 38, and 106 for the high-, average-, and low-quality samples, respectively) and 59 samples were kept for the final testing (11,13, and 35 for the high-, average-, and low-quality samples, respectively). As mentioned in Section 8.2, the combined UMAP-ELM strategy provides a perfect classification on the training set for all the selected parameters when using the tanh activation function. Note that, although some imbalance in the data is present, this does not indicate a poor predictive performance of the model for the minority classes. Figure 15 illustrates the first three features extracted using the UMAP method in dataset #2 for both the training and test samples. As can be seen in the figure, the capabilities of the UMAP method for clustering the data are evident owing to the perfect separation of the wine-quality classes as evidenced for both the training and test datasets. This clustering explains the perfect classification results obtained using the ELM classifier. The performance measures obtained for dataset #2 using three different activation functions on the ELM method are given in Tables 11 and 12. As mentioned, the tanh activation function allows to perfectly classify both the training and test datasets. The results for the tribas and hardlim functions were also excellent, yielding only some minor misclassifications (see the associated confusion matrices shown in Table 13).   The current combined UMAP-ELM approach improves the classification results presented in the original study [47]. This study also involves a feature extraction method along with the use of two different classifiers (an SVM and a deep MLP neural network) and presents the results of a 5-fold validation in the training and verification test stages. The reported average accuracies for the test stages were 97.34% for the conventional SVM approach, where 69 features were selected for the classification, and 97.68% for the improved approach, where 300 features were selected for the classification. In this case, the use of the combined UMAP-ELM methods allows perfect classifications to be obtained, therefore improving the aforementioned results.

Dataset #3
This dataset consists of 309 nearly balanced samples (92, 100, and 112 samples of carbon monoxide (CO), formaldehyde (HCHO), and nitrogen oxide (NO 2 ), respectively) split into 231 training samples (72, 75, and 84, respectively) and 78 test samples (24, 25, and 29, respectively). Each sample contained 10,000 initial features that were reduced to 8 using the supervised UMAP method. Figure 16 illustrates the first three features extracted using the UMAP method in dataset #3 for both the training and test samples. As can be seen in the figure, the supervised UMAP seems to properly cluster nearly all the samples of the training dataset however, when the test data were projected, a slight overlapping occurred. The small ambiguities present in the separation of the training dataset were amplified when projecting the test dataset. It is worth noting that, in this case, the characteristics of the features did not allow such an easy clustering as in dataset #2 and that the supervised UMAP method only had 231 samples to train however, in dataset #1, where the global structure was quite complex, 2700 samples were used for training. The classification results obtained using the hyperparameters reported in Table 7 are shown in Table 14 for both the verification and validation steps. In the verification step, the combined UMAP-ELM classifier reached an accuracy of 99.71% with only one misclassified HCHO sample. However, when used in the test set, the accuracy decreased to 88.89%. As mentioned above, although this is still a good classification result, the decrease in accuracy regarding the results obtained in datasets #1 and #3 could be due to the fact that very few samples were available for training. Although these samples were sufficient for clustering the training set using the supervised UMAP method, when the test data were projected, the main features did not allow for accurately separating the data and the ELM classifier introduced some classification errors.
Finally , Tables 15 and 16 show all the performance verification and validation measures for the three activation functions used. As with the previous datasets, excellent accuracies were obtained in all cases, with the tribas option being slightly worst. Among the three cases, the same decrease in accuracy was obtained for the test dataset.  The current results can be partially compared with those reported in the original study [49], where a subsampling strategy to extract 10, 20, 30, and 40 subsampled features from the original 10,000 features was used instead of the UMAP method. Then, a full 2D GridSearch was applied to find the optimal parameters of the regularized ELM classifier. Specifically, they tuned together the number of hidden nodes along with the penalty regularization parameter. All the data were used for training, and therefore, only the performance validation measures were obtained. Perfect per-class classifications were reported for HCHO and CO, whereas a 99.92% accuracy was reported for NO 2 . The perclass average accuracies (computed as shown in Equation (2) (tp i + tn i )/(tp i + tn i + fn i + fp i )) can also be obtained by employing our approach, as shown in Table 14 (training set); we obtain an average accuracy of 100% for CO and 99.56% (=230/231 × 100) for HCHO and NO 2 , where the total number of samples is 231 instead of 309. Three aspects are worth pointing out. First, as can be seen in Figure 10 and Table 17, a perfect classification for the training set could be obtained using the newly combined UMAP-ELM approach, for instance, taking k = 43. However, in the present approach, overfitting is avoided by adding the verification step. Second, reference [49] does not report how the average per-class accuracies are computed. Using the aforementioned standard per-class accuracy, a classification with only one error in one of the five folds (four retained for training and one for testing) yields an approximate per-class accuracy of 308/309 = 0.9967, from which it is difficult to reproduce the 99.92% accuracy reported for comparison with the present results. Finally, it is worth noting that, in [49], despite the larger number of extracted features, the number of hidden nodes used was kept tod = 3. In future studies, the use of the regularized ELM method instead of the standard non-regularized ELM method can be considered to determine whether the use of a penalty regularization parameter allows for decreasing the number of hidden nodes without introducing an overfitting during the validation step.

Concluding Remarks
In this study, a data classification methodology for ENs was presented. It involved several stages, namely, data unfolding, data reduction by applying the UMAP algorithm, and machine learning classification using the ELM method, and excellent results were achieved, as evidenced by the high values of the classification performance. Three different EN datasets were used to validate the application of the developed methodology. The first dataset was composed of six different gases, second dataset described three different classes of wine, and third dataset consisted of measurements of three different gases.
The main conclusions of this work are as follows: • Data transformation and unfolding: A data transformation procedure (feature extraction and min − max normalization for dataset #1 and MCUGS for datasets #2 and #3) was conducted to correct the differences in the signal magnitudes acquired by the EN sensor array. In addition, a data unfolding step was employed to arrange the acquired data in a two-dimensional matrix. • Data reduction: The supervised UMAP algorithm was used as a data reduction method. The capability to maximize the inter-class distance and minimize the intraclass distance was demonstrated. The UMAP data reduction procedure allows for a new representation of the data to be obtained in a low-dimensional space, facilitating the task conducted by the ELM classifier. • Data classification: The ELM classifier algorithm was used owing to its high training speed and satisfactory behavior in conducting classification tasks. Proof of this is in the high average accuracies obtained on the training and test sets for each of the three datasets tested. • Validation/verification and hyperparameter tuning: A data splitting procedure was used, which divided 75% of the data as the training set and the remaining 25% as the test set. First, the training procedure performed a 5-fold CV to avoid overfitting and the hyperparameters were tuned using the UMAP and ELM methods. Second, the remaining unseen 25% of the data were used in the verification stage. The results in this test set showed a good behavior after the application of the developed methodology because the UMAP algorithm mapped the new data correctly to the low-dimensional space and the previously trained ELM classifier performed a successful classification. • Various average classification performance measures were calculated. These measures describe the multi-class classification problem for a single number. High values of these classification performance measures were obtained for the training and test sets in each of the three EN datasets used. • Two of the three datasets exhibited imbalanced data. The classification results showed high average accuracies, and in some cases, the values reached 100% for the training set.
The overall contributions of this paper can be summarized as follows: In summary, based on the results and previous conclusions, it is possible to highlight the following elements as differentiating elements compared with other similar works: The combination of the methods to produce this methodology and its validation with three different datasets.
The use of min − max or mean-centered unitary group-scaling (MCUGS) to deal with the different magnitudes of the sensors. This MCUGS normalization method is firstly introduced in the analysis of data from EN sensors, and it is the second time that it has been used in the literature.
The use 5-fold cross-validation (CV) to prevent overfitting, which result in other way to validate the results compared with traditional validations in data analysis.
In future studies, the performance of the methodology will be evaluated using different methods of data normalization and different classification algorithms with an active learning approach.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Practical Computational Description of the UMAP Method
This section presents a practical computational description of the UMAP method in terms of weighted graphs. Let X ∈ R n×D be a matrix collecting all the samples for a specific dataset, where n denotes the total number of samples and D denotes the total number of features describing the data. The aim is to determine a reduced low-order d-dimensional manifold, d D, described by the transformed matrix, Y ∈ R n×d , that retains most of the information of the original matrix, X. In particular, in the present study, the aim is to obtain a transformed matrix, Y, that provides an accurate classifier when used as an input of the machine learning algorithm. The UMAP transformed matrix, Y, is computed in two phases. In the first phase, namely, graph construction, a weighted k-neighbor graph is constructed, and after proper transformations are applied on the edges of the graph, the data, X, are represented by the adjacency matrix of this graph. Then, a low-dimensional representation of the previous graph preserving the desired characteristics of the graph, i.e., a graph layout, is built. Specifically, let X ∈ R n×D be the input data set: where x i = row i (X) T ∈ R D ; in addition, consider the metric d : R D × R D → [0, +∞).
To fix these ideas, let d be the Euclidean distance, which is defined as: where (x i ) l denotes the lst element entry of vector x i , which coincides with x il , i.e., the (i, l)th entry of matrix X. Then, given an input hyperparameter, k, for each point x i , i = 1, . . . , n, we compute the set {x i 1 , x i 2 , . . . , x i k } of the k-nearest neighbors of x i under the metric d and define the parameters ρ i and σ i such that: Figure A1). Note that ρ i is the distance between point x i and its nearest neighbor, whereas σ i is a smoothed normalization factor. ember 10, 2021 submitted to Mathematics 32 of 40 x i  Figure A1. Five nearest neighbors of point x i for a particular set of points in R 3 , where, for simplicity, d j = d(x i , x i j ) and w j = w(x i , x i j ). The metric is the Euclidean distance.
x 1 x 2 x 3 x 4 x 5 x 6 Figure A2. Weighted directed three-neighbor graph for a particular set of points in R 2 . For clarity, only some weights p i|i 1 = w(x i , x i 1 ) are shown. The metric is the Euclidean distance.
The last step of the graph construction phase is to recover an undirected graph from G. The weights associated with the new undirected edges (x i , x j ) are given by the triangular conorm or the probabilistic sum of the directed weights, namely, Figure A1. Five nearest neighbors of point x i for a particular set of points in R 3 , where, for simplicity, The metric is the Euclidean distance.
Then, the weighted directed k-neighbor graph is defined as G = (V, E, w), where vertices V are the points x 1 , . . . , x n and the set of directed edges is Figure A2). It is worth noting that the definitions of ρ i and σ i ensure that p i|i j = w(x i , x i j ) ∈ (0, 1], at least one edge has a weight of 1, non-existent edges are set to have a weight of 0, and all weights associated with the outcoming edges of every node sum to log 2 (k). Weight p i|i j associated with the edge joining x i and x j represents the similarity between the high-dimensional points x i and x j , measured using an exponential probability distribution, which can also be seen as the probability that an edge joining x i and x j exists.  Figure A1. Five nearest neighbors of point x i for a particular set of points in R 3 , where, for simplicity, d j = d(x i , x i j ) and w j = w(x i , x i j ). The metric is the Euclidean distance.
x 1 x 2 x 3 x 4 x 5 x 6 Figure A2. Weighted directed three-neighbor graph for a particular set of points in R 2 . For clarity, only some weights p i|i 1 = w(x i , x i 1 ) are shown. The metric is the Euclidean distance.
The last step of the graph construction phase is to recover an undirected graph from G. The weights associated with the new undirected edges (x i , x j ) are given by the triangular conorm or the probabilistic sum of the directed weights, namely, which can be interpreted as the probability that at least one of the two directed edges (from x i to x j and from x j to x i ) exists. The adjacency matrix of this undirected graph P ∈ R n×n containing the weights p ij , i, j = 1, . . . , n in its corresponding (i, j)th position is the starting point of the UMAP procedure. Note that if P dir ∈ R n×n is the adjacency matrix of graph G, then from equation (A2), where '•' is the Hadamard (or pointwise) matrix product. It is worth noting that, in the initial step, the data matrix, X ∈ R n×D , is associated with a larger, although sparse, P ∈ R n×n matrix, where the new entries are probability-like measures of the distances between samples. Figure A2. Weighted directed three-neighbor graph for a particular set of points in R 2 . For clarity, only some weights p i|i 1 = w(x i , x i 1 ) are shown. The metric is the Euclidean distance.
The last step of the graph construction phase is to recover an undirected graph from G. The weights associated with the new undirected edges (x i , x j ) are given by the triangular conorm or the probabilistic sum of the directed weights, namely, which can be interpreted as the probability that at least one of the two directed edges (from x i to x j and from x j to x i ) exists. The adjacency matrix of this undirected graph P ∈ R n×n containing the weights p ij , i, j = 1, . . . , n in its corresponding (i, j)th position is the starting point of the UMAP procedure. Note that if P dir ∈ R n×n is the adjacency matrix of graph G, then from Equation (A2), where '•' is the Hadamard (or pointwise) matrix product. It is worth noting that, in the initial step, the data matrix, X ∈ R n×D , is associated with a larger, although sparse, P ∈ R n×n matrix, where the new entries are probability-like measures of the distances between samples. The aim of the second phase of the UMAP procedure is to compute a low-dimensional representation of the undirected graph preserving the main desired characteristics. That is, the goal is to compute a new transformed matrix, Y ∈ R n×d : where y i = row i (Y) T ∈ R d , containing only d features and a new undirected graph with associated weights defined by matrix Q ∈ R n×n , which minimizes the cross-entropy between the two graphs, and is defined as follows: where p ij and q ij are the weights of the undirected graphs given by the (i, j)th entries of the adjacency matrices P and Q, (see Figure A3). Since C(P, Q) = 0 for q ij = p ij , the goal of UMAP is to try to find a low-dimensional representation of the data trying to mimic the similarities p ij in a high-dimensional space.
x 1 x 2 x 3 x 4 x 5 x 6 x 7 Note that the values of p ij , which are computed from X, are fixed and that the weights q ij are computed from the d features of the samples stored in Y. Therefore, after the constant terms in the objective function are removed, the problem at hand translates to finding Y ∈ R n×d by minimizing: To use the stochastic gradient descent (SGD) method in the previous optimization problem, we consider a smooth function describing the similarities between the pairs of points in the low-dimensional space R d : where a and b can be user-defined positive parameters or automatically set by solving a nonlinear least squares fitting, once the minimum distance between the points in the low-dimensional representation (min_dist hyperparameter) and the effective scale of the embedded points (spread hyperparameter) is fixed. Specifically, a and b are found, if not previously given, fitting the real function (1 + ax 2b ) −1 to the exponential decay function: x < min_dist exp(−(x − min_dist)/spread), otherwise , for x ∈ [0, 3 · spread]. For instance, for the reference values min_dist = 0.1 and spread = 1, we have a = 1.58 and b = 0.90 (see [66] for details). Therefore, after an initial guess Y ini is computed, an iterative procedure is applied to find the minimum value in Equation (A4). The initial guess for the SGD method is the largest eigenvectors of the symmetric normalized graph Laplacian associated with P (see [67]). Specifically, we denote by D ∈ R n×n the degree matrix of the undirected graph P (D = diag(D ii ), where D ii = ∑ n j=1 p ij ), and by L = D 1/2 (D − P)D 1/2 the normalized graph Laplacian; then, Y ini is the matrix containing only the first d eigenvectors with the largest eigenvalues.

Appendix B. Practical Computational Description of Extreme Learning Machine
Let Y ∈ R n×d be a matrix collecting the n samples given in Equation (A3) to be classified, where each specific sample is given by y i = row i (Y) T = (y i1 , y i2 , . . . , y id ) T ∈ R d and d indicates the relevant features selected by the UMAP algorithm. In addition, let {C 1 , C 2 , . . . , C L } be the classes/labels into which the samples must be classified, where L denotes the total number of classes. The ELM classifier takes as input a particular sample, y i ∈ R d , and returns the output vector, o i = (o i1 , o i2 , . . . , o iL ) T ∈ R L , containing the raw score of the class membership to each of the L classes. By default, the ELM method uses binarized {−1, 1} class labels, and thus, if sample y i belongs to class C c i , we expect that (o i ) j = o ij ≈ −1 + 2δ j,c i , i.e., we expect the c i th component of the vector to be close to 1 and all the other components to be close to −1. The class prediction is then computed by selecting the maximum component of o i : In general, every layer of a neural network that takes as input a feature vector, x in ∈ R n in , and returns an output vector, x out ∈ R n out , is characterized by: (1) An activation function g : R → R; (2) An n out threshold or biases b j ∈ R, j = 1, . . . , n out , collected in vector b ∈ R n out ; and (3) n out weighting vectors w j ∈ R n in , j = 1, . . . , n out , collected in the weighting matrix W = (w 1 , w 2 , . . . , w n out ) T ∈ R n out ×n in . Thus, the output vector is computed as follows: where g • is the element-wise activation function that applies the activation function, g, to each component of the vector. Based on this definition, it is easy to recover the usual alternative expression for the jth component of the output vector: (x out ) j = g(w j · x in + b) = n in ∑ s=1 g(w js (x out ) s + b s ).
In particular, the ELM neural network consists of a hidden layer that converts the initial samples y i ∈ R d into the hidden feature vector h i ∈ Rd and an output layer that converts the hidden vector h i ∈ Rd into the output vector o i ∈ R L (see Figure A4). The hidden layer consists of a weight matrix W = (w 1 , w 2 , . . . , wd) T ∈ Rd ×d , a bias vector b ∈ Rd, and a user-defined activation function g, whereas the output layer consists of a weight matrix B = (β 1 , β 2 , . . . , β L ) T ∈ R L×d , a zero-bias vector, and an identity activation function. Therefore, after composing the two layers, we obtain an output vector of: Once the neural network parameters (g, W, b, B) are set, it is straightforward to provide the class prediction for any given sample.
As mentioned before, the training step of the ELM neural network is comparable to that of a traditional neural network, because the weights and biases of the hidden layer are randomly assigned at the beginning of the learning process and remain unchanged during the entire training process. In particular, once the hidden number of featuresd is set, the hidden weights and biases are randomly computed using the standardized normal distribution:  Figure A4. The ELM neural network consists of a hidden layer that converts the initial samples y i ∈ R d into the hidden feature vector h i ∈ Rd and an output layer that converts the hidden vector h i ∈ Rd into the output vector o i ∈ R L . Therefore, only the weights of the output layer B are computed using the training dataset. To illustrate the computation of B, for ease of presentation and without loss of generality, let us assume that all data Y ∈ R n×d are used to train the neural network (in practice, only some of the total data are used for training the ELM classifier, and Y ∈ R n×d should be replaced by Y ∈ R n train ×d ).
Because the class of the training samples is known, the raw score of the class membership for each sample is stored in the true output matrix: where t i = row i (T) T ∈ R L corresponds to the raw scores of the ith sample (containing −1 in the non-actual class and 1 in the actual class). Specifically, if the ith sample belongs to class C c i , then (t i ) j = 1 if j = c i and −1 otherwise. Moreover, because the hidden layer is already known and is described by (g, W, b), we can compute all the hidden features of the input samples h i = g • (Wy i + b) ∈ Rd and collect them in the matrix: where the vectors h i = row i (H) T are now placed in the rows of matrix H. Then, the output of the ELM neural network is: where recall that o i = Bh i . The output weights are then computed by minimizing the distances between the true outputs T and the predictions O, namely, by finding a least-squares solutionB of the linear system HB T = T:B = arg min B∈R L×d where the Frobenius norm of a matrix is A F = trace(A T A). Note that the system of equations, HB T = T, represents the n × L restrictions to be met, matrix B T has L ×d degrees of freedom, and, in general, the ELM can only approximate the training samples with zero error if the number of hidden nodes coincides with the total number of samples, d = n, in which case, B T = H −1 T. In the general case in whichd = n (note that the number of hidden nodes is usually much smaller than the number of samples), the output weights are computed by solving the least-squares problem in Equation (A8), yielding: where H † is the Moore--Penrose generalized inverse of H (see [68]). To further describe this key computation of the training step, despite the Moore-Penrose pseudo-inverse usually being computed using a singular decomposition of the matrix, in the usual case in which d < n, we have the following:B The training steps of the ELM are summarized in Algorithm A1. Algorithm A1. shows that the hidden layer parameters are randomly generated, independently of the dataset, and that the weights associated with the output layer are the only parameters that need to be trained. Equation (A9) reveals that the weights are computed explicitly without the use of an iterative procedure, and therefore, the ELM has a higher training speed than the BP learning algorithm.