Multi-label Classification with Optimal Thresholding for Multi-composition Spectroscopic Analysis

In this paper, we implement multi-label neural networks with optimal thresholding to identify gas species among a multi gas mixture in a cluttered environment. Using infrared absorption spectroscopy and tested on synthesized spectral datasets, our approach outperforms conventional binary relevance - partial least squares discriminant analysis when signal-to-noise ratio and training sample size are sufficient.

categories: problem transformation and algorithm adaption. Problem transformation algorithms transform a multi-label problem into one or more single-label problems. After the transformation, existing single-label classifiers can be implemented to make predictions, and the combined outputs will be transformed back into multi-label representations. One of the simplest problem transformation method is BR. It transforms a multi-label problem by splitting it into one binary problem for each label [12], [13]. Under the assumption of label independence, it ignores the correlations between labels. If such assumption fails, label powerset (LP) and classifier chains (CC) are known transformation alternatives where LP maps one subset of original labels into one class of the new single label [14] and CC passes label correlation information along a chain of classifiers [15]. In contrast, algorithm adaption methods modify existing single-label classifiers to produce multilabel outputs. For instance, the extensions of decision tree [16], Adaboost [9], and k-nearest neighbors (KNN) [17] are all designed to deal with multi-label classification problems. Restricted Boltzman machine [18], feedforward neural network (FNN) [19], [20], convolutional neural networks (CNN) [21], [22], and recurrent neural networks (RNN) [23] are employed to characterize label dependency in image processing or find feature representations in text classification. Those adaptive methods can identify multiple labels simultaneously and efficiently without repeatedly trained for sets of labels or chains of classifiers.
Our application of multi-label learning for spectroscopic analysis adopts FNN with optimal thresholding (FNN-OT), which is an adaptive FNN model inspired by [19], [20]. It will be compared with other problem transformation and algorithm adaption models that are extended from PLS and FNN. In this article, we will train all the models with simulated spectroscopic datasets and compare their results. It will be shown that for most evaluation metrics the adaptive FNN model has the best performance.

II. DATASET
To synthesize the datasets, firstly single gas spectrums of C 2 H 6 , CH 4 , CO, H 2 O, HBr, HCl, HF, N 2 O, and NO gasses were selected from the HITRAN [24] database. The gas spectrums were down sampled to 1, 000 pixels equally spaced between 1 µm and 7 µm wavelengths. Secondly, the gas concentrations were randomly generated from a uniformly distributed probability density function such that the concentration of each  gas is uniformly distributed between 0 − 10 µM. Thirdly, in real scenarios, gases could be partially correlated. To verify our model under partially correlated components, we introduce highly positive correlation between some gases so that their concentrations retain a pre-set correlation. The generation of uniformly distribute random variables with target correlation matrix will be discussed in Appendix A. Further, in order to test the validity of our classification model, we modify the concentration matrix such that each gas only appears in 50% of the gas mixture samples. Using the concentration matrix, the absorption spectrum of each gas mixture was synthesized using Beer-Lambert law, assuming that the gas mixture was contained in a 10 cm long sensing region and the light source has uniform intensity across the target wavelengths. Lastly, artificial Gaussian noises with pre-set signal-to-noise ratio (SNR) were added to the light intensity in order to obtain a closer-to-reality spectrum.
In this article, we used 12 datasets, each has a pre-set SNR of 0 dB, 10 dB, 20 dB, 30 dB, 40 dB and 50 dB. For each SNR, we generated two data sets respectively to represent uncorrelated and highly correlated cases. In uncorrelated cases, nine gas labels are mutually independent. In highly correlated cases, nine gases are evenly divided into three subsets. Gas labels within the same subset are highly correlated, and labels from different subsets are independent (Appendix A).

III. ALGORITHM
In single-label learning, a typical approach to classify an instance is to rank the probabilities (or scores) of all classes and choose the class with the highest probability as prediction. For multi-label problems, the same ranking system can be used to compute scores for all labels instead, then a threshold will be determined to assign all labels whose scores are higher than the threshold to the sample. This label score-label prediction framework is the foundation of adapting NN for multi-label learning. In the FNN-OT model, scores of all labels need to be calculated for ranking purpose, and a threshold decision model will be employed to assign a set of labels to the sample in the label prediction step. The whole process of FNN-OT is shown in Fig. 1a. Spectrum Signals are firstly pre-processed by principle component analysis (PCA). The output principle components are the input features of an FNN model, which produces one output score for each gas. Output scores will be the input of a following optimal thresholding (OT). For every sample in the training set, its threshold will be determined by OT illustrated in Fig. 1c. Its mechanism will be explained in Section III-C. Then the output scores and thresholds are the input and output variables of a new FNN model which will be used to calculate thresholds for testing samples.

A. Feedforward neural networks
FNN has outstanding performance with large scale datasets [20]. As shown in Fig. 1b, a typical FNN is formed by an input layer, an output layer and one or more hidden layers in-between. Each layer has a number of active neurons (circles without cross in Fig. 1b) that use the neuron outputs from previous layer as input and produces output to the neurons in next layer. In our case of multi-label learning, a simple one hidden layer FNN model can achieve a state-of-the-art result with great computational efficiency [20]. To get output score s based on input feature set x, our FNN can be written as [25]: where h is a hidden layer that lies between input and output layer, f h is the Rectified Linear Units (ReLU) activation function in hidden layer, f s is the sigmoid function for output layer, and W (1) , W (2) , b (1) , b (2) are the parameters that need to be trained from data. In our model, the loss function f L (s, y) is defined as the cross entropy of label score s and classification target y which can be expressed as: where L is the number of labels.
In our model, we adopted dropout to mitigate overfitting [25]. Dropout is a widely used method for preventing overfitting problems in neural networks. It randomly drops out a percentage of neurons in training, and the weights of remaining neurons will be trained by back-propagation [25]. Retention probability p = (p 1 , p 2 ) is the hyperparameter of dropout that will be tuned for our model. p 1 and p 2 are the probabilities of retaining units in input and the hidden layer of the neural network model. Retention probabilities set for the FNN-OT model are the ones that result in minimum losses. The dropout is activated by two diagonal matrices of Bernoulli random variables P (1) and P (2) with parameters p 1 and p 2 . Both parameters are retention probabilities of input and hidden layer for dropout.

B. Principle component analysis
In both training and testing, the 1, 000-pixel absorbance spectra will be pre-processed with principle component analysis (PCA), and the principle components will be the input of the FNN model (x). PCA is a commonly used preprocessing method for spectroscopic datasets. It is conventionally employed to reduce feature dimension by transferring original input variables into a smaller set of uncorrelated principle components (PC) that preserves highest explained variance [26]. As shown in Fig. 2a, at high SNR, PCA is an efficient technique for dimension reduction as only a small number of PCs is sufficient to preserve most of the variances. However, when the SNR drops to below 30 dB, variance of original data is almost evenly projected into PCs. Under such circumstances, PCA will not be efficient for dimension reduction. So, in a preliminary 10-fold test on the SNR=40 dB dataset, Hamming loss has higher means when number of PCs is less than the number of original pixels (blue line in Fig. 2b). However, as shown in the same plot, when PCA is adopted in conjunction with dropout (blue markers), the Hamming loss is significantly reduced compared to the models that only adopts PCA (yellow markers) or dropout (red marker) or neither of them (purple marker). Therefore, in this article, we adopt PCA for all SNRs not only for dimension reduction, but also for Hamming loss reductions.

C. Optimal thresholding
Once we obtain the output score s for a specific instance, we need to find a threshold t i to convert i-th label score s i in s to i-th label predictionsŷ i inŷ. Here,ŷ i can be expressed by an indicator functionŷ i = 1(s i > t i ). That is, for the i-th specific gas component label that has a score higher than t i , the prediction is 1 and 0 otherwise, representing the existence/nonexistence of that gas component in the spectrum.
For binary classification problems in single-label learning, the sigmoid activation function of the output layer results in output scores that are between 0 and 1, and those output scores are often interpreted as probabilities of the two possible classes. For each sample in the testing set, its predicted class will be the one with more than 0.5 probability (output score), so the classifier can be viewed as an FNN model with a threshold t = 0.5. As shown in our result section, mislabelling of extremely low concentration of a specific gas species as absent from the sample occurs more frequently than mislabeling a non-existing gas species as existing in the sample. This results an imbalance between recall and precision. To re-balance recall and precision for higher F 1 , adopting an optimal threshold t for each label in each instance is desirable. For samples in the training set, the method of determining t is illustrated in Fig. 1c. Suppose we have obtained output scores for all nine labels of a gas mixture. Three of them (blue ones) have the ground truth value 1 (gas species exists in the sample), and the rest labels in red are 0 (gas species is absent in the sample). Then we calculate the F 1 scores for the three candidates t 1 , t 2 and t 3 of t (dash lines), and the candidate with the highest F 1 score, which is t 2 in this example, is the t we need. In our model, we use output scores to calculate the candidates of t. For each sample, nine output scores will be formed into an increasing order: s 1 ≤ s 2 ≤, ..., ≤ s 9 . Since sigmoid function is used in the output layer, all output scores are between 0 and 1. So the ten threshold candidates will be: In order to systematically get thresholds for all instances in the testing set, we assume that threshold t is determined by the label scores s, and their relationship can be recognized by the following FNN model: where h t is a hidden layer with ReLU activation function f h , and W (2) t are the parameters that need to be estimated. We will use instances in the training set to train FNN model, and the loss function is the mean square error between t andt.

D. Evaluation metrics
To evaluate our models, we use micro averaged recall, precision and F 1 as our figures of merit. [10] In our context, true negative (TN) is the absence of a certain gas that has been correctly predicted in a sample. Similarly, true positive (TP) is the case that an existing gas is marked as present in a sample. False negative (FN) is the case that the classifier fails to identify an existing gas, and false positives (FP) is a false alarm where the classifier identifies a non-existence gas.

A. Hyper parameter tuning
In our research, we use TensorFlow to implement our FNN-OT and Adam as our optimizer. In first step we tune hyperparameters such as dropout rate and training sample size of the FNN-OT model with the SNR=30 dB data set.
1) Dropout: In order to tune the hyper-parameters for dropout, a grid search has been conducted on retention probabilities p = (p 1 , p 2 ) of input and hidden layers. A typical choice of retention rate is 0.8 for input layer and 0.5 for hidden layer [25], and a preliminary search on our datasets shows that the optimal choice of p is around (0.95, 0.2).
2) Training sample size: To determine the number of training samples that are sufficient for our models, we plotted the learning curve as shown in Fig. 3. Here, FNN-OT is compared with PLS-BR (Appendix B) and FNN with 0.5 threshold.
As shown in the learning curves plot, we change the number of samples in the training set while keeping the 20,000-sample testing set intact. Both training (solid lines) and testing (dashed lines) Hamming losses are plotted as a function of training samples. At an SNR of 30 dB, without dropout (blue markers), our FNN-OT model displays large variance and low bias as the training loss is almost 0 while the testing loss is above 0.1 even when the training sample size is around 100,000. This is a clear indication of overfitting for training samples fewer than 100,000. In contrast, by adopting dropout (red markers), the overfitting issue is solved and both training and testing loss converge to around 0.05 at around 100,000 training samples. In comparison, PLS-BR (green markers) does not display overfitting at the aforementioned sample size. However, the converged training and testing losses are higher (> 0.1) than our FNN-OT model with dropout, indicating our model outperforms this conventional technique. Nevertheless, the plot clearly shows that it is sufficient to use around 100,000 samples to train our FNN-OT with dropout model.

B. Performance comparison of mutually independent gas data
Parameters of PCA and FNN models will be trained in the 80,000-sample training sets and deployed in the 20,000-sample test sets.
We first compare our model using the datasets where all gas components are mutually independent. Fig. 4 presents the micro averaged precision, recall and F 1 score at six different SNRs. Expected, all models perform better at higher SNRs. When SNR is 0 dB, all three classifiers failed to identify gases because 0.5 micro-F 1 score is as good as random guess. Across all SNRs, FNN-OT yields better precision, recall and F 1 than the conventional PLS-BR, clearly indicating it a superior approach for gas identification. Fig. 4 also illustrates that all three models display higher values of precision than recall. This is due to the fact that most mislabelling occurs when a gas species' concentration is too low to produce detectable signal above noise background, all model will mistakenly predict the absence of that gas and produce a FN. However, as evident from Fig. 4b, selecting optimal threshold will significantly reduce the occurrence of FN and increase recall without significantly reducing the precision, resulting a better F 1 score. This clearly justifies the necessity of adopt FNN-OT.
The advantages of FNN-OT are further confirmed by comparing minimum detectable concentrations of nine gases in Fig. 5. As shown, both FNN-OT and FNN consistently show lower minimum detectable concentration at all SNRs while in general FNN-OT outperforms FNN.

C. Performance comparison for highly correlated gas data
We further apply our models to the cases when the gases are correlated. As shown in Fig. 6, when SNR is above 20 dB, performance of the 3 models is similar to the uncorrelated case and FNN-OT outperforms. Further at SNR=0 dB or 10 dB, FNN-OT significantly outperforms the other 2 models and its own results of the uncorrelated case due to the fact that FNN-OT can collaboratively identify gas species through organize their correlation while FNN and PLS-BR are not capable of.

V. CONCLUSIONS
In conclusion, by selecting optimal thresholds, FNN-OT outperforms conventional PLS-BR and FNN in two aspects. FNN-OT can dynamically select a threshold to reduce FN events. In addition, FNN-OT is capable of utilizing correlation among the components to enhance its classification capability. Both of these unique features make FNN-OT a favorable choice for spectroscopic analysis in cluttered environments.

APPENDIX A GENERATION OF CORRELATED UNIFORMLY DISTRIBUTED RANDOM VARIABLES
To test our models with highly correlated gas labels, we construct a correlation matrix of all nine gases. To simplify our model, we evenly divided nine gases into three subsets and generated highly correlated uniformly distributed random concentrations of the three gases in each subset. Firstly, we generated covariance matrix Σ of the nine variables, which has to be symmetric positive semi-definite. Let where L ij are 3 × 3 random matrices with element values uniformly distributed between (0, 1). Then Σ = LL T will be symmetric positive semi-definite. With the covariance matrix, one may easily obtain the corresponding multivariate normal random numbers X i through, e.g. MATLAB's mvnrnd command. To generate uniformly distributed random numbers Y i from the above multivariate normal random numbers X i , we used the approach in [27]. The procedure is as follows: define x j i , (j = 1, . . . , N i ) as the j − th random number of X i , (i = 1, 2). N i is the total number of samples in random variable X i . First compute the cumulative distributed function P i cdf of X i according to Here 1(x j i < x) is an indicator function that returns 1 if the condition in the bracket holds and 0 otherwise. Consequently, the uniformly distributed random variable Y i , (i = 1, 2) can be easily constructed accordingly to Fig. 7 clearly shows the validity of the procedure. Here, joint distribution of two partially correlated normal distributed random variables X 1 and X 2 with correlation coefficients 0.1 and 0.9 are plotted in subplots (a) and (c) respectively. The distribution of the corresponding transformed uniformly distributed random variables Y 1 and Y 2 are shown in subplots (b) and (d), with correlation coefficient values retained after transformation. Fig. 7(e) further plot the correlation coefficients of Y vs. the coefficients of X. As shows, the transformed correlation coefficients are almost identical to the coefficients of their original pair.

APPENDIX B PARTIAL LEAST SQUARE METHOD
Our model compares with conventional PLS-BR. PLS-BR is a multi-label classifier adapted from PLS. It utilizes BR to split the multi-label task into several single-label classification problems. BR decomposes the learning of output labels into a set of binary classification tasks, one per label, where each single model is learned independently, using only the information of that particular label and ignoring the information of all other labels [28]. It has various advantages such as the base learner can be selected from any of the binary learning methods, and also the complexity is linear with the number of labels. Apart from this, it can also optimize several loss functions. The main disadvantage of BR is that it assumes that all labels are independent and ignores the correlations between them.
PLS is a widely used quantitative technique in advanced spectral analysis [29]. In order to predict output Y from feature X, PLS describes the common structure of X and Y by combining PCA and multivariate regression. [30] Similar to PCA, PLS decomposes X and Y as follows: Where T and U are projections of X and Y , P T and Q T are transpose of orthogonal loading matrices. Then regression of T and U will be performed following the standard multivariate regression procedure.
PLS itself is not designed for classification, so an extension of PLS called PLS-DA (Partial Least Squares -Discriminant Analysis) is adopted to classify categorical outputs. PLS-DA has been successfully used to classify milk and lubricant based on spectroscopic data sets in [31], [32], and [33]. In binary classification (y = 0 or 1) cases, PLS-DA creates two dummy variables y 1 (y = 0) and y 2 (y = 1) for the y label, and then calculates the PLS regression scores for y 1 and y 2 . If y 1 has higher score, y is classified as 0. Otherwise the prediction class of y is 1.