Optimal Feature Search for Vigilance Estimation Using Deep Reinforcement Learning

: A low level of vigilance is one of the main reasons for trafﬁc and industrial accidents. We conducted experiments to evoke the low level of vigilance and record physiological data through single-channel electroencephalogram (EEG) and electrocardiogram (ECG) measurements. In this study, a deep Q-network (DQN) algorithm was designed, using conventional feature engineering and deep convolutional neural network (CNN) methods, to extract the optimal features. The DQN yielded the optimal features: two CNN features from ECG and two conventional features from EEG. The ECG features were more signiﬁcant for tracking the transitions within the alertness continuum with the DQN. The classiﬁcation was performed with a small number of features, and the results were similar to those from using all of the features. This suggests that the DQN could be applied to investigating biomarkers for physiological responses and optimizing the classiﬁcation system to reduce the input resources.


Introduction
One of the main causes of traffic and industrial accidents is a low level of attention from fatigue, which greatly reduces work efficiency and increases the risk of accidents, and thus, a system automatically detecting the low level of alertness is a highly desirable. The research to estimate drowsiness or lack of vigilance mainly investigates physiological signals and video streaming information (eye blinking and yawning) [1,2]. Video images obtained from a camera could detect with high accuracy, drowsiness, and with low accuracy, vigilance. However, at night time or when wearing sunglasses, it would be difficult to capture the faces of users using the camera. For these cases, physiological signals could help to determine the states of the drowsiness or low level of vigilance. The objective of this study was establishing automatic assessment of vigilance monitoring system with physiological signals using deep reinforcement learning algorithm. The previous studies of alertness estimation mainly utilized the electroencephalogram (EEG), electrocardiogram (ECG), electrooculogram (EOG), and electromyogram (EMG) signals [3][4][5][6][7][8]. A vigilance monitoring system based on physiological signals could be implemented by utilizing EEG, ECG, and EOG signals.
EEG signals are widely utilized in practical situations, including brain computer interfacing (BCI), epileptic seizure prediction, drowsiness detection, and vigilance state analysis [3,[9][10][11]. EEG signal analysis is more challenging than other physiological signals because it contains several artifacts and high levels of noises such as EOG, EMG, and ECG signals [12]. The vigilance state study also utilizes the EEG signals and mainly uses the four EEG frequency components [13]. Four frequency components are extracted to make up the features of an EEG signal in the delta (δ: 0-4 Hz), theta (θ: 4-8 Hz), alpha (α: [8][9][10][11][12][13], and beta (β: 13-20 Hz) bands [14]. A high amplitude in the delta band corresponds to deep sleep, and a high amplitude in the theta band corresponds to drowsiness [13]. Akerstedt et al. used features from frequency components in the theta and alpha bands to predict sleep and drowsiness states [7]. A decreasing power in the alpha frequency band and increasing power in the theta frequency band may be a meaningful drowsiness indicator [6]. Additionally, combinations of theta, alpha, and beta frequency components, such as (θ + α)/β, β/α, have been used to extract features for detecting the drowsy and fatigued state. Studies on detecting low level of alertness with the EEG signal mainly used frequency-domain features and rarely considered time-domain features [10]. The area, normalized decay, line length, mean energy, average peak amplitude, average valley amplitude, and normalized peak number are the main time-domain features of an EEG signal. ECG signals are widely utilized to predict arrhythmia, coronary artery disease, and paroxysmal atrial fibrillation [15][16][17]. Chui et al. [18] and Sahayadhas et al. [19] considered ECG signals for estimating the drowsiness condition by extracting heart rate variability (HRV) information, which includes timeand frequency-domain analysis.
Recently, deep learning methods have been applied to healthcare data as replacements for conventional feature extraction methods, and have yielded better performances [20,21]. Deep learning automatically detects specific patterns in physiological signals measured using EEG, ECG, and EMG [22]. In healthcare studies, the convolution neural network (CNN) and autoencoder approach have been widely used to extract features from physiological signals [23,24]. Hwang et al. proposed ultrashort-term segmentation of the ECG signal to estimate stressful conditions [25]. While deep learning approaches have outperformed conventional methods for healthcare data analytics, a critical limitation has been encountered. Although deep learning algorithms automatically extract features in a data-driven manner, they are still a black box that fails to provide physiological meaning. Data analysis should yield interpretable outcomes and explain physiological phenomena corresponding to high-level states, such as cognitive or physical conditions [26]. In the present study, relatively more interpretable conventional algorithms and deep learning methods were used to analyze the states, and were compared to explain the outcomes of deep learning approaches.
Reinforcement learning is an area of machine learning where an agent defined in a specific environment recognizes its current state and acts to maximize a future reward among selectable behaviors. Reinforcement learning is suitable for solving sequential decision-making problems. It has recently outperformed humans in many fields (e.g., Go, Atari game) and has also been used for deep learning optimization [27][28][29]. In the present study, features were extracted with conventional and deep learning methods and fed to a reinforcement learning agent to find the optimal feature set. The aim of this study was to find the best features of EEG and ECG for the assessment of vigilance through reinforcement learning, and thus obtain higher accuracy than conventional supervised learning algorithms.

Related Work
Traditionally, the studies to monitor the vigilance state have utilized several physiological signals, such as EEG, ECG, EOG, and EMG [6,7]. Duta et al. applied the deep learning method to EEG signal for the first time [3]. Looney et al. implemented a wearable monitoring system utilizing a one channel ear EEG signal. For the convenience of users, research about building a vigilance monitoring system using single channel EEG has been increased. In recent vigilance state studies, cameras were used to calculate eye closing time to determine vigilance state. In the experiment, subjects were given vigilance tasks, and if the response time was longer than 500 ms, it was labeled as lack of vigilance [30]. Meanwhile, several methods using HRV parameter and CNN analysis for raw-ECG signals have been applied [25].
Lee et al. [31] extracted the HRV parameters and used classifiers such as a support vector machine (SVM), k-nearest neighbors (KNN), a random forest (RF), and a convolutional neural network (CNN) for the driver vigilance monitoring. Janisch et al. [32,33] utilized a reinforcement learning algorithm to solve the sequential decision-making problem, wherein an agent chooses a feature to see or makes a class prediction at each step. This idea has been widely applied in several studies on topics such as fast disease diagnosis, facial recognition, and effective medical testing [34][35][36][37]. Neural architecture search (NAS), one of the automatic deep learning optimization frameworks using reinforcement learning, has recently been applied to the EEG analysis [38]. Section 2 outlines the materials and methods for the experiment using a vigilance task, the feature extraction, and the proposed reinforcement learning algorithm. Section 3 elaborates upon the classification results of the reinforcement learning and conventional algorithms. Section 4 discusses the experimental results and concludes the paper.

Experiments of Vigilance Task
Several studies have conducted experiments to detect a low level of vigilance with multi-channel EEG signals [3,5]. In this paper, the vigilance task experiment was conducted to detect the low level of vigilance with single channel EEG and ECG. A vigilance task consists of preparation and reaction time tasks [3].

•
Preparation: A subject was seated in front of a screen and equipped with EEG and ECG electrodes.

•
Reaction time task: The subject was asked to press a push-button switch when a stimulus (3 × 3 mm black square) appeared on the screen. The stimulation was presented in 1 s and the visual stimulation interval was random, from 5 to 15 s.
Eleven subjects aged between 20 and 30 years participated in the experiments. The participants were instructed to avoid consuming alcohol, drugs, or caffeine before the experiment, and their sleep time was limited to less than 3 h. The experiment was conducted at 13:00 and the total experiment time was 20 min. If the subject did not execute given task due to lack of alertness, the recorded data (10 s) were labeled with low level of vigilance. This experiment was approved by KoNIBP (Korea national institute for bioethics policy) public institutional review board (IRB) in 2017 (P01-201707-13-003). The EEG signal was measured with QEEG-8 (Laxtha, Daejeon, Korea), and an electrode was placed on the FpZ site (frontal) based on the 10-20 International Standard of Electrode Placement [39]. All electrodes were referenced to earlobes, and the ground was the right mastoid. The ECG signal was measured according to the Einthoven lead-I configuration with an ECG amplifier (Biopac MP36 system, Biopac System, Inc., Goleta, CA, USA). Both EEG and ECG signals were sampled at 1000 Hz. A photodetector sensor was used to synchronize the time between the visual stimulus and push-button switch response.

Feature Extraction for Assessment of Vigilance
The recorded raw EEG and ECG signals were processed with a fifth-order Butterworth filter (0.5-50 Hz) to remove ambient noise and powerline interference [40]. Previous vigilance estimation studies have applied 1, 2, 5, and 10 s segmentations of EEG signals. Among these, 5 and 10 s segmentations are widely used to detect a low level of vigilance [41]. In the present study, a 10 s segmentation of the EEG and ECG signals was simulated to predict the low level of vigilance. A 10 s EEG signal was decomposed into four different frequency components: the delta (0-4 Hz), theta (4-8 Hz), alpha (8)(9)(10)(11)(12)(13) and beta (13-20 Hz) rhythms. The power spectral densities (PSDs) θ, α, and β were calculated for the theta, alpha, and beta components, respectively. Then, the most commonly used features were extracted: (θ + α)/β, α/β, (θ + α)/(α + β), and θ/β [13]. The time-domain features of the EEG signal were also calculated, as given in Table 1. Area was the area under the wave for the given time-series data. Normalized decay is chance corrected fraction of data that has a positive or negative derivative. Line length is the sum of distance between all consecutive data. Mean energy is mean energy of time interval. Root mean square is the square root of the mean of the data points squared. Average peak amplitude is log base 10 of mean squared amplitude of the K peaks. Average valley amplitude is log base 10 of mean squared amplitude of the V valleys [10]. In the equations, x i is the ith sample in a window with a length of W (window length), and I(x) is a Boolean function. k(i) is the index of the ith peak in the time series data, which is defined when the sign of the difference between consecutive data sample (x i−1 and x i ) changes from positive to negative. Similarly, v(i) is the index of the ith valley and is determined by the sign change of the difference from negative to positive.

Feature Equation
Conventionally, the HRV has been utilized as a feature of ECG signals [19,25]. However, it is not easy to extract meaningful HRV parameters with short-term, 10 s ECG data. A deep learning approach was recently applied to extracting ECG features and yielded higher accuracy than state-of-art feature extraction methods. The convolutional neural network (CNN) is a deep learning method that enables short-term 10 s ECG analysis [15,16,25]. In the simulations of the present study, the features of the ECG signal were extracted with the CNN algorithm. Figure 1 displays the architecture of the ECG feature extraction process.
Hwang et al. tried to extract features of ECG signals by using the representations of hidden layers for a 1D CNN [25]. They designed the structure of convolutional filters and pooling windows based on the physiological characteristics of the ECG. They introduced an optimal model for analyzing ECG with a CNN-LSTM, which was called Deep-ECGNet. In order to acquire ECG features with the same data segmentation length (10 s) as EEG, Deep-ECGNet was applied in this study. To extract the ECG features, we used pretrained weights of Deep-ECGNet, which was trained using our ECG data. The exact Deep-ECGNet parameters used in the implementation are summarized in Table 2. In the structure of Deep-ECGNet, the length of the convolutional filter and that of the pooling layer were both set to 800 (0.8 s) to fully cover the shapes of the ECG sequences [25]. In the present model, 128 channels were produced empirically and by 1D convolution; increasing the kernel size length up to 1000 improved the classification accuracy compared to that of the Deep-ECGNet configuration. The activation function was designed with rectified linear units (ReLUs). The dropout rate was 0.1, and the early stopping method was utilized to avoid overfitting [42,43]. Sequential patterns of the ECG time series signals were recognized with a long short-term memory (LSTM) model, which is a type of recurrent neural network (RNN) [44]. After the RNN stage, a fully connected layer of multilayer perceptrons (MLPs) was used as ECG features [45]. The output number of the fully connected layer was set as equal to the number of EEG features (i.e., 11) so that the reinforcement learning would initially select features with the same probability. For the final classification, the softmax activation function was used to classify whether low level of vigilance or alertness state. Figure 2 illustrates all processes for the EEG and ECG feature extractions.

Redefinition of the Problem with Reinforcement Learning
After the feature extraction, reinforcement learning was applied to solve the sequential decision-making process. For each step, an agent picks a feature for classification in an episode [32]. The features extracted from 10 s EEG and ECG signals were combined into one dataset. That was used for the reinforcement learning environment data sample D. Environment picks a random sample from data sample D and an agent sequentially chooses a high value features for classification in an episode. For reinforcement learning, the Markov decision processes (MDPs) need to be defined first; these involve the state, action, state transition probability matrix, reward, and discount factor [46,47]. A standard reinforcement learning configuration and the environment E can be represented as partially observable Markov decision processes (POMDPs). Unlike conventional MDPs, POMDPs provide only limited information on the environment [48]. In this study, the POMDPs were defined by the state space S(s ∈ S), set of actions A(a ∈ A), reward function r(s, a), and transition function t(s, a). The state s is defined as s = (x, y,F ), whereF is the currently selected feature. (x, y) ∈ D is a sample drawn from the dataset D. x i is the value of features f i ∈ F = { f 1 , f 2 , f 3 , . . . f n }. n is the number of features. y is the class label. An agent only receives the observed state s, which consists of a set of pairs (x i , f i ) without the label y.
An agent chooses an action at each step in the action space a ∈ A(A = A c ∪ A f i ), where A c is an action classifying the low and high level of vigilance and A f is an action taking a new feature not selected previously. In this study, the feature set consisted of 11 EEG features and 11 ECG features. The agent was given a random EEG or ECG feature from the environment and decided to select more features to classify the low level of vigilance state or not. When the agent chose A c , the episode was terminated automatically, and the agent received a reward. The reward function is defined as ifa ∈ A c and a = y −1 ifa ∈ A c and a = y, where λ is a feature cost factor initially set as 1 and varying from 0 to 1. Whenever the agent picked the action A f , it received a −λ reward from the environment. Therefore, a higher value of λ made the agent pick the action A f . The transition function is given by where T is the terminating state.

Deep Q-Learning
In this study, deep Q-learning was used to learn the optimal policy π; this combines a deep convolutional neural network with the Q-learning algorithm. Deep Q-learning estimates by using the state-action value function Q with a neural network architecture. Before the deep learning model was developed, Q-learning was often applied to solve simple problems. However, as the number of states increases, updating the Q-function with a finite Q-table becomes impossible [47]. To solve this, the state action value function Q is approximated by an artificial neural network (ANN), whose optimal function Q * is defined as [28] where γ is the discount factor that determines the importance of the future reward, s is the state of the next time step, and a represents all possible actions of the next time step s [47]. The episode is guaranteed to terminate because of the terminating state T in the transition function of Equation (2). Thus, an episode has fewer transition steps than other reinforcement learning applications such as the Atari game and Go [28,29,49]. A Q-network designed with a neural network structure is generally used as a function approximator to estimate an action-value function with the weight θ in the neural network. The Q-network is trained by minimizing the mean square error (MSE) loss function L(θ), which iterates for each training step i: where Q is the target Q-network that holds its weight, and θ − is its weight. After the main Q-network (Q) updates, the target Q-network (Q) updates its weights. The experience replay method is used to overcome the strong correlation problem among samples [50]. Random samples from a replay memory D break the high correlation among samples, which yields efficient learning. For every time step, the environment E stores the experience e t = (s t , a t , r t , s t+1 ). In this study, the buffer size of the replay memory was set to 200,000.

Training Algorithm
The previous section describes the basic concept of the deep Q-network. This section describes the implementation of the environment and training procedures. The POMDP environment E was implemented with the deep Q-network algorithm, where the initial states 0 was randomly drawn from the training dataset, and the observed state s consisted of (x (i) , f i ). For the neural network input, the observed state s was converted to the tuple (x, m), wherex is a mask vector of the original x that contains acquired feature values and m is a mask vector that denotes the acquired feature index [32]. The mask m consisted of (0, 1). When the agent picked the ith feature, the mask vector was assigned the value of 1. The main purpose of the reinforcement learning model is to solve a sequential decision-making problem. Thus, the agent learned the optimal policy π to find the optimal features from the feature set and classify the low level of vigilance and awake conditions. To find the optimal feature combinations among all features in an episode, the agent should not pick the same feature. This was realized by using a mask vector. The mask vector m contained the history of all agent actions in an episode. The vectorsx and m were given as follows: The agent was designed to have the structure of a three-layer artificial neural network (ANN), where the input layer was the observed state s = (x, m) and the output layer represented Q-values, including Q f eature (s t , a t ) and Q class (s t , a t ). The overall learning procedure of the agent is illustrated in Figure 3.  The neural network that represents the Q-network had the inputs of the feature vector x and mask vector m. It was built from three fully connected layers with the ReLU activation function and an output layer with the linear function to produce the Q-value. The feature vector x had 22 dimensions, including the number of feature sets and the dimension of the mask vector m. Before the Q-network weight θ was updated, the replay memory should store enough episodes, where the agent selects a random action with the probability and stores its transition (s, a, r, s ) to the replay memory. Initially, 2000 transitions episodes were stored in the replay memory buffer. The environment E initialized the initial state s 0 without any feature information. After enough transition data were accumulated in the replay memory, the Q-network was updated. When an episode was terminated, y j in Equation (5) was determined by the environment, where the target Q-network and soft target update methods were applied [29,51]. The target Q-values were constrained to changing slowly, which greatly improved the stability of the learning process. The weight θ − of the target Q-networkQ was kept to be updated. If the agent selected the ith feature in the feature set, the environment assigned the ith feature value to the feature vector x and assigned 1 to the ith component in the vector m. The training and environment simulation algorithm procedures are described in Algorithm 1.
For every time step, the agent interacted with the environment E . In the action space A = {A c , A f }, the agent selected the A c or A f action to classify the state or select a feature. When the agent chose the A f action, the environment added the chosen feature to the set of selected features and created the mask m i = 1. By using the mask vector, the environment forced the agent to not choose the same action or feature by subtracting λc(a) from the Q value. The environment simulation algorithm procedures are described in Algorithm 2.
To evaluate the classification performance of the reinforcement learning algorithm, the accuracy, precision, recall, and F1 score were calculated as follows: where TP is a true positive, TN is a true negative, FN is a false negative, and TN is a true negative. Set y j = r j if episode terminates j+1 r j + γmax a Q (s t+1 , a ; θ − ) otherwise Clip y j from -1 to 0; Perform a gradient descent on (y j − Q(s j , a j ; θ)) 2 ; Every C steps, update weight θ − ← (1 − ρ)θ + ρθ − ; end end Algorithm 2: Environment.simulation.
if a ∈ A c then Add a to selected features; Mask m i = 1; Return (x * m, −λc(a)). end

Classification for Assessment of Vigilance
The total EEG and ECG data were divided into five segments for the fivefold validation method [52]. Four segments were used as the training set, and the other segment was used as the test set. This process was repeated five times so that each segment could be the test set. The cost factor λ in Equation (1) was set to 0.05 because it yielded the best classification accuracy in this experiment. The accuracy was decreased when the cost factor was 0.1 and the number of selected features was increased when the cost factor was 0.01. The cost factor for maximum accuracy while minimizing the number of selected features was 0.05. If the λ value is high, the classification accuracy should be low, and the number of selected features should increase because the agent receives a negative reward for the feature selection action A f . During the training process, the -greedy policy was used; this was not applied during the validation and testing processes. Figure 4 illustrates the accuracy of the deep Q-network (DQN) agent and the number of selected features. Figure 4a shows the DQN agent classification performance of subject 10, as the learning steps were increased during the training session, and Figure 4b illustrates the number of features chosen by the DQN agent. The means of the TP, TN, FP, and FN results across the five-fold cross-validation segments were calculated.  Table 3 presents the classification results of the 11 subjects. The DQN agent classification accuracy was more than 90%, although the results for Subjects 2 and 5 had an accuracy of less than 80%.

Comparison with Conventional Methods
A benchmark test was performed to compare the reinforcement learning algorithm with several conventional supervised learning classifiers: linear discriminant analysis (LDA), the multilayer perceptron classifier (MLP), the support vector machine (SVM), and the random forest classifier. The LDA classifier is widely used for EEG classification problems [53] and finds a linear combination of features to separate different classes. The MLP is a basic deep learning model that optimizes the loss function by using gradient descent [45]. The SVM is a supervised learning method that is often used for regression and classification problems. With a kernel trick, an SVM can be used to solve both linear and nonlinear problems. Random forest is an ensemble machine learning algorithm that is widely used for physiological data analysis. Figure 5 compares the scatter plots of the reinforcement learning algorithm and conventional supervised learning algorithms. Each dot represents the classification accuracy of a subject; the red dots above the gray line indicate that the DQN agent had a higher classification accuracy than the other conventional supervised learning classifiers. The green dots indicate that the classification rates were the same, and the blue dots indicate that the DQN agent had lower accuracy than the conventional supervised learning classifiers. Figure 5a shows that the DQN agent yielded better accuracy than the SVM for 10 subjects, and the SVM was better for only one subject. MLP and LDA had better classification accuracy than the reinforcement learning algorithm for one subject, and the result was the same for one subject. The random forest showed the best performance among the conventional supervised learning classifiers; it had better accuracy for three subjects compared to the DQN agent. The Mann-Whitney rank test was performed to compare the accuracy, precision, recall, and F1 score values for the DQN agent with the others [54]. Figure 6 shows the p-values of the four cases. In particular, the reinforcement learning algorithm significantly outperformed LDA with a p-value of less than 0.05. Figure 7 illustrates that classification precision, recall, and F1 score values for the 11 subjects. The reinforcement learning (i.e., DQN) showed the best accuracy, precision, recall, and F1 score. Among the 11 subjects, two subjects had lower classification accuracies of 78% and 60%. Excluding these two outliers, the DQN agent showed a stable accuracy, precision, recall, and F1 score.

Feature Study
In order to investigate the significant features of the physiological response to the low level of vigilance state, the total feature set consisting of 11 EEG and 11 ECG features was fed to the DQN agent, which selected the best features for vigilance estimation. During the training process, the number of features selected by the DQN agent for all 11 subjects was counted, including all episodes during the fivefold cross-validation.
In Figure 7, the blue bars denote the total numbers of selected EEG features, and the green bars denote the total numbers of selected ECG features. In total, 268 EEG features and 273 ECG features were selected. On average, 2.48 EEG features (±1.23, standard deviation (SD)) and 2.45 ECG features (±1.134, SD) were selected. Among all ECG and EEG features, the 10th Deep-ECGNet feature and 11th ECG feature were selected most often. Among the EEG features, the mean energy was selected most often, followed by the frequency feature (θ + α)/(α + β). In order to evaluate the significance of the selected features by the DQN agent, only some of the selected features were tested for the estimation of the vigilance state. Each episode represented trial data that were segmented into 10 s intervals. The DQN agent selected the best feature set for an episode based on its Q value, and the selected features were fed into the conventional classifiers. Because the DQN agent produced different Q values depending on the dataset of each subject, every subject had their own optimal feature set of two or three features. Figure 8a-d illustrates the box plots of the accuracy, precision, recall, and F1 score produced by the conventional classifiers with the optimal features. These results are comparable with those in Figure 7 obtained using all of the features, which suggests that using the DQN agent to select a few features could make the vigilance monitoring system lighter and more efficient by requiring fewer resources.

Discussion and Conclusions
A portable and wireless system was developed to monitor the low level of vigilance of a user. Because of the limited resources in terms of memory, processing, and battery power, optimizing the significant input or features is crucial to reducing the memory size and computation of the processor, and thus realizing low energy consumption. This study used the reinforcement algorithm of DQN to optimize the feature set of ECG and EEG responses to the state of low level vigilance. Figure 8 shows that selecting a feature set including only two or three features could yield a similar performance to that of using all of the features. This approach can be applied to most physiological and neuroscience research, where investigating the biomarkers for a physiological or cognitive condition is key to understanding the phenomenon. It was demonstrated that the reinforcement algorithm detects low level of vigilance using the optimized features more efficiently than conventional supervised learning classifiers, as shown in Figures 6 and 8. In addition to the DQN used in this paper, asynchronous advantage actor critic (A3C), trust region policy optimization (TRPO), and proximal policy optimization (PPO) algorithms are utilized in the state-of-the-art reinforcement learning algorithm. For the optimization of the deep learning structure and parameters, neural architecture search (NAS) is widely used [27,[55][56][57]. If the structure and the parameters of DQN are optimized through the NAS algorithm, the accuracy of the vigilance assessment will be increased. We left the parameter optimization and design of the new deep learning structure to future work.
In this study, the DQN algorithm with the POMDP environment was applied to classify the low level of vigilance and optimize the feature set. The optimal feature selection process suggested that both ECG and EEG can be used to estimate the vigilance, although the two ECG features were selected more often, as shown in Figure 8. This implies that both the autonomous system (measured by ECG) and central nervous system (measured by EEG) are responsible for the drowsy condition, while the autonomous system has a slightly stronger influence. Thus, both ECG and EEG sensors should be used for the development of a vigilance state detection system. These results demonstrated that the deep learning approach of Deep-ECGNet can be a state-of-the-art algorithm for extracting features from a short-term ECG signal. This is supported by the results of Hwang et al. [25], who demonstrated its performance for monitoring stress in an experiment. Although the DQN agent uses few optimal features, it had a higher classification accuracy than conventional classifiers considered in this study (i.e., LDA, MLP, SVM, and random forest). DQN can also suggest significant features of EEG and ECG corresponding to a low level of vigilance, which can be used to investigate biomarkers for the physiological state.