Using Psychophysiological Sensors to Assess Mental 2 Workload in Web Browsing 3

The mental workload induced by a Web page is essential for improving the user’s 12 browsing experience. However, continuously assessing the mental workload during a browsing 13 task is challenging. In order to face this issue, this paper leverages the correlation between stimuli 14 and physiological responses, which are measured with high-frequency, non-invasive 15 psychophysiological sensors during very short span windows. An experiment was conducted to 16 identify levels of mental workload through the analysis of pupil dilation measured by an 17 eye-tracking sensor. In addition, a method was developed to classify real-time mental workload by 18 appropriately combining different signals (electrodermal activity (EDA), electrocardiogram, 19 photoplethysmography (PPG), electroencephalogram (EEG), temperature and eye gaze) obtained 20 with non-invasive psychophysiological sensors. The results show that the Web browsing task 21 involves on average four levels of mental workload. Also, by combining EEG with the PPG and 22 EDA, the accuracy of the classification reaches 95.73 %. 23


Introduction
Although Web applications are often justified in terms of increasing the productivity of human tasks, they sometimes have the opposite effect, interrupting, reducing the performance of, or increasing the mental workload of the user [1][2][3][4].A typical task in which this phenomenon may occur is Web browsing.In this task, the user fixes her/his gaze on and between Web elements, i.e., graphic or textual areas of a Web page, such as news, commercial advertisements, and menus [5][6][7].
In cognitive psychology, mental workload refers to the total amount of perceived mental effort used for learning or processing new information [8][9][10][11].
An important factor in measuring the effectiveness of a Web page is the user's browsing experience.It has been shown that the higher the level of user's browsing experience is, the lower the mental workload [3,4,12].Every Web page has both an intrinsic and an extrinsic mental workload [3,[13][14][15].The former is related to the natural effort required to absorb new information, to the process of learning to navigate around the page, and to the process of becoming accustomed to the design of the page.The latter consists of the mental workload caused by the inclusion of unnecessary details or external interruptions, such as font styles that convey no meaning, commercial advertisement pop-ups, and irritating recommendations, which may have a negative effect on user's browsing experience.
Continuously assessing, at any moment, the mental workload involved in browsing tasks entails measuring it either when the user fixes her attention on a Web element or when her gaze switches from one element to another.This assessment of mental workload can enhance the user's browsing experience in many ways: for instance, avoiding extrinsic mental workload by automatically identifying the most suitable moments to proactively deliver content to the user or preventing irritating intrusions from the environment; reducing intrinsic mental workload by keeping the Web page support interventions on stand-by and adapting graphic user interfaces in real time; and evaluating the likelihood of user's abandonment, frustration or techno stress, among other benefits.In addition, instantaneous classification of mental workload into intrinsic or extrinsic to the Web elements of a Web page would make it possible to detect short time windows of reduced cognitive burden to activate the delivery of different types of recommendations in a timely, unobtrusive manner, such as contextual news in newspaper portals or commercial advertisement pop-ups on various Web sites.In addition, it may be possible to enhance search tasks, for instance, for restaurants, flight tickets, or retail products, by providing relevant feedback to the search engine based on the user's cognitive status [6].
To realize the above requirements, it is essential to address the challenge of automatically assessing the mental workload in a continuous fashion while the user is engaged in browsing, that is, in real time, with high frequency and using very short time windows.
Many studies have focused on classifying mental workload in general by capturing and processing data using ever less invasive psychophysiological sensors [16][17][18][19][20].This method is founded on the empirical demonstration of the correlation existing between psychological stimuli and physiological responses triggered by the nervous system.Moreover, mental workload has been shown to vary frequently within a short time span [21,22].
Although considerable research has been devoted to assessing mental workload on the scale of hours and minutes by using data extracted from psychophysiological sensors, less attention has been paid to time windows lasting seconds or less, such as when a user fixes her gaze on a Web element.Indeed, Bailey et al. [23] have recently proved that moments of reduced mental workload occur while the user's attention is transiting from one task to another.However, this was shown only for coarse-grained tasks, such as selecting a travel route among alternatives presented in a graphic interface or classifying a list of emails into various categories [23].
In this paper, the capabilities of psychophysiological sensors are leveraged to research the possibility of assessing mental workload in real time during a browsing task.This paper thus attempts to answer the following research questions: • RQ1: Is it possible to identify levels with regard to a user's mental workload within very short time windows (order of milliseconds) based on psychophysiological signals recorded during a Web browsing task?
• RQ2: Is it possible to accurately classify in real time a user's mental workload, both when her gaze is fixed on a Web element and when her gaze is transiting from one Web element to another, by combining different non-invasive psychophysiological sensors?
In addition, based on the findings of Bailey et al. [23], this paper attempts to prove the following hypothesis: • H1: Mental workload is significantly smaller when the user's attention is switching from one Web element to another than when she is focused on a Web element.
To answer these research questions and prove the stated hypothesis, an experiment was conducted in which 61 users performed a normal Web browsing task in front of a computer screen while their psychophysiological responses were measured by different sensors and recorded in a database.The gold standard with regard to answering RQ1 is pupil diameter because several previous studies have shown that, under controlled illumination conditions, this psychophysiological response is a valid and reliable indicator of mental workload [23][24][25][26][27][28][29].Using clustering methods, this paper shows that, by processing the pupil dilation response, four levels of mental workload can be identified per user on average.However, measuring pupil dilation with an eye tracker is not a realistic and practical method to classify mental workload, for example, in the open air, because it requires constant and controlled illumination conditions.Thus, in this paper, more practical and less invasive sensors are assessed to measure other psychophysiological responses, such as heart rate (HR), electrodermal activity (EDA), body temperature, and electrocardiogram (ECG).The electroencephalogram (EEG) sensor is also assessed because there have been important advances in the construction of portable EEGs and in algorithms to reduce motion-related artifacts [30] [31].It is expected that before long, there will be EEG devices that only capture brain waves from the areas of the brain relevant to the assessment of mental workload, making them less invasive [32].
This paper shows that, using all the sensors and efficiently processing their signals using artificial neural networks, mental workload can be classified as proposed in RQ2, with 68.94 % accuracy, 66.62 % recall, and 76.92 % precision.However, using all the sensors and a multi-layer perceptron, it is possible to achieve 88.46 % accuracy, 88.84 % recall, and 88.85 % precision.
Ultimately, the best performance is obtained by combining EDA, HR, and EEG, achieving 95.73 % accuracy, 94.25 % recall, and 95.6 % precision in the classification of mental workload.Furthermore, the hypothesis that mental workload is significantly smaller when the user's attention is switching from one Web element to another than when she is focused on a Web element is confirmed = 1.7829; − = 0.00184 < 0.05 .
The contributions of this paper include (i) identifying different levels of mental workload required for Web browsing through the processing and analysis of pupil dilation measured by an eye-tracking sensor; (ii) developing a method for appropriately combining non-invasive psychophysiological sensors to classify real-time mental workload in small time windows with high accuracy (mean=99.1%,SD = 0.2772%) based on the behavior of the user's gaze in a Web browsing task; and (iii) leaving open the possibility of using gaze shifts from one Web element to another as the most appropriate time to provide the user with recommendations, for example.
This paper is organized as follows.Section 2 provides the background required to understand this research.Section 3 presents the related literature.The experiment conducted is described in Section 4, as well as the data processing and the machine learning methods applied to the data.The results are presented in Section 5 and are discussed in Section 6, while Section 7 concludes the paper.

Assessment Methods
Cognitive resources are assets used by cognition to think, remember, make decisions, solve problems, or coordinate movements, such as perception, attention, short-and long-term memory, and motor control [33,34].According to Navon et al. [35], these resources underlying human learning and information processing are limited [36].
Wickens [9], in his multiple resource theory, suggests that these resources can be used in parallel for multiple tasks, using several resources at once.However, when task demand is high, the resources allocated to that task are not available for another task if the same mental resources are required at the same stage of processing.Excessive use, moreover, can cause a state of overload known as cognitive resource depletion [37].This overload means that the brain is unable to process new information, resulting in processing and/or execution errors [38].
Mental workload results from the different levels of resource demand, depending on the parallel tasks that the person is performing [8,9,21,22].Excessive resource demand can cause distraction, increase errors, generate stress and frustration, and reduce the ability to undertake mental planning, problem solving, or decision-making [39,40].One example is the distraction caused by unwelcome advertisements on a Web page while the user is browsing.In this case, the intermingling of the browsing task with the intrusion of commercial advertisements forces the user to divide attention and allocate cognitive resources to the new stimulus.
Traditionally, mental workload has been assessed in different situations using subjective methods [16] based on surveys, auto-perception scales, or think-aloud protocols [41][42][43].These methods are applied after the user has already finished the task, and the assessment of the mental workload depends of the user's final perception [44].Therefore, these methods are constrained by the reporting bias introduced by relying on past memories and by the problem of ecological validity based on observing responses to hypothetical scenarios rather than behaviors in a real setting [45].In addition, the static nature of these methods makes them unfit for real-time evaluation.The most widespread example of this method is the NASA Task Load Index, which measures the mental and physical performance, as well as the effort and frustration, of the user [46].
Performance-based methods have also been used, which measure indicators generated during task execution, such as the percentage of correct responses or execution time [3,16,17].In this method, the user needs to be engaged in only one task.Its major restriction is the difficulty of assessing mental workload in near real time.
The attempts to find objective indicators to measure mental workload in real time are based on collecting contextual information, which can be captured mainly using psychophysiological sensors [47][48][49].Indeed, there is ample empirical evidence in psychophysiology showing that some physiological responses are directly related to psychological factors such as stress, mental workload, and emotions [50][51][52].That is, there is a correlation between the physiological responses triggered by the nervous system and psychological stimuli.
Psychophysiological responses are controlled by the autonomic nervous system (ANS), which regulates and coordinates bodily processes such as digestion, temperature, blood pressure, and many aspects of emotional behavior [53].These actions occur independently of the conscious control of the individual.The ANS includes the sympathetic nervous system (SNS) and parasympathetic nervous system (PNS).The SNS controls actions required in emergency situations, such as stress and movement.It can cause heart rate acceleration, pupil dilation, and increased blood flow to the muscles, sweating, and muscle tension.The PNS controls the functions related to rest, repair, and relaxation of the body.The responses elicited by this system include a decrease in heart rate and blood pressure, stimulation of the digestive system, and pupillary contraction, among others [50,51].

Psychophysiological measurements
There are different types of methods to measure psychophysiological responses elicited complementarily by the SNS and PNS [54].For instance, the device for tracking gaze is the eye tracker.It consists of a camera on the computer screen that works according to the "corneal-reflection / pupil-center" method [55].It also allows the measurement of the variation of the pupil diameter.
The pupillography measures changes in pupil size, which can be attributed to both parasympathetic inhibition, which explains the first dilation phase, and sympathetic activation, which explains the subsequent contraction phase [56,57].Although pupil dilation can be triggered by a light reflex caused by changes in environment illumination or by a proximity or accommodation reflex to improve visual focus, it can also be caused by a psychosensory reflex associated with the cognitive or emotional engagement of the person while exposed to any sensory stimulus [58].In contrast to changes in the two previous reflexes, changes in pupil size in this case are subtler, so a high-precision device or eye tracker is required for their detection [59].
The eye tracker is also used for tracking the eye to determine gaze position or movements within a scene, including two relevant measurements: • Fixations: moments during which the gaze is relatively fixed or focused.They occur because sharp vision is only possible within a small area in the human eye called the fovea.It is useful to determine when eye fixation occurs because, in most cases, it coincides with attention.
Moreover, other studies have concluded that theta and delta bands are sensitive to stimuli involving difficult manipulation.
EDA is a psychophysiological response that can be assessed by measuring changes in the electrical properties of the skin.Skin conductivity varies with changes in skin moisture (sweat) and may reveal changes in the SNS.EDA is also known as galvanic skin response (GSR), and it is inexpensive to assess, easily captured, and robust.It is measured by attaching one or two electrodes usually to the fingers or toes.It is an indicator of psychological and physiological arousal.In addition, it serves to identify emotional states.EDA has two components: (1) a phasic component that changes rapidly and is related to external stimuli or a non-specific activity and (2) a tonic component or base signal that varies slowly and sets basic skin conductance.A classic behavior is that when arousal increases, there is an increase in sweat gland activity, decreasing electrical resistance, and thus increasing conductivity.
The cardiovascular system is particularly interesting for psychophysiology because it is highly sensitive to neurological processes and psychological factors such as stress.It is regulated by the ANS, which produces patterns of electrical activity that are fundamental for psychophysiological measurements [50].Several studies associate changes in cardiac activity with psychological phenomena, such as mental work, perception, attention, problem solving, and signal detection [60].
An ECG is used to measure the electrical activity of the heart, using at least three electrodes attached to the chest.The electrodes collect the necessary data with regard to the electric waves that describe the cardiac cycle, based on which the HR or its variation (HRV) are obtained.
The human body constantly exchanges heat with the environment as part of the process of self-regulation to maintain homeostasis (internal balance of the body).Body temperature increases and decreases in relation to the energy exchanged.The regulation of blood flow to the skin and thermal radiation is considered a function of the ANS [61].Studies conducted in this field, according to Genno et al. (1997) [62], suggest that skin temperature has potential as a psychophysiological measure of the individual.

Literature Review
This paper focuses on the measurement of mental workload while the user browses a Web site in front of her personal computer.However, the literature in this regard is scant.Thus, to start studying the measurement of mental workloads in various domains and to help understand the methodology associated with this type of research, this section focuses on two main points: the assessment of mental workload using psychophysiological sensors in general and the measurement of mental workload in Web environments.

Assessment of Mental Workload with Psychophysiological Sensors
A relevant study for this paper is that by Bailey et al. [23] who develop psychophysiological measures to assess the effect of interruptions on the performance of a person executing a task.They establish that interruption involves considerable negative effects, such as increased time to complete the task [63], a wider range of errors [64], additional efforts in decision-making [65] and mood changes such as increased frustration and anxiety [66][67][68].For example, when an interruption occurs at a random time while performing a major task, the time to completion can increase by up to 30 %, up to twice as many errors can be committed, and user displeasure doubles, in contrast to when the interruption occurs at a pre-programmed time.Therefore, Bailey et al. empirically find that interruptions may have a lower cost if they occur at a time of low mental workload, hypothesizing that this may occur at the boundaries between subtasks when executing the general task [69].As a test method, they assess mental workload by pupil dilation in three different tasks that include respective subtasks.The first task consists of assessing two different routes between two cities on a monitor; the user must measure the distance and cost of the routes, tabulate the data, and, finally, discriminate and choose the shortest and most economical route.In the second task, the user must edit a document and correct spelling at three levels of complexity (editing a word, editing two words, and editing a complete sentence).The third task entails classifying nine emails involving explicit issues (low complexity) and ambiguous issues (high complexity) into four categories.Each of these scenarios is applied to 24 people (seven women) between 19 and 50 years of age.The main conclusions of the study are as follows: (i) mental workload varies during the execution of the three tasks, (ii) the mental workload decreases when performing subtasks compared to the general task, and (iii) different subtasks demand different levels of mental workload based on their complexity.
Other studies focus on training classifiers to process psychophysiological signal data in a time window in order to predict whether the load associated with a specific task is high or low [70].For example, Haapalainen et al. [17] measure the mental workloads of basic tasks such as the resolution of problems on a monitor, visual perception, and cognitive speed by using an eye-tracking device, EEG, ECG, heat flow, and rate measurements.As a result, they find that ECG and heat flow together distinguish between tasks of high and low cognitive demand with 80 % precision.
Fritz et al. [16] seek to verify whether psychophysiological sensors are useful in measuring the difficulty of a computer code comprehension task with various levels of difficulty.The tasks are performed by software developers, who are monitored using an eye tracker and an electroencephalogram. Fritz et al. use the ℎ + ℎ ⁄ ratio based on the evidence that beta increases with task execution, theta is deleted, and alpha is blocked.The models obtained classify task difficulty with 85 % accuracy.
Shi et al. [71] assess stress and arousal levels by measuring EDA for increasing levels of difficulty.The experiment consists of a transition interface in which the participants must respond to the requirements in three scenarios: (1) using gestures and speaking, (2) only speaking, and (3) only using gestures.The difficulty varies depending on level of visual complexity, number of entities, number of distractors, time limit, and number of actions to complete.The results indicate that there is a significant increase in the EDA signal as task difficulty increases.
Nourbakhsh et al. [72] confirm the effectiveness of EDA in discriminating between the difficulty of eight arithmetic tasks with four levels of difficulty.In addition, as an extension of the previous study, Nourbakhsh et al. measure mental workload using EDA changes and the number of blinks obtained from an eye-tracking device.The experiment is the same as in the previous study.This time, by combining both sensors, 75 % precision is achieved for the lowest level of difficulty.
Xu et al. [73] show that mental workload can be measured by pupil dilation if illumination changes.The experiment consists of arithmetic tasks that vary in difficulty depending on the number of digits.
In Ikehara et al. [18], an eye-tracking device, a pressure sensor for the mouse, an EDA sensor, and a pulse oximeter (for measuring HR and level of oxygen in the blood) are used.The experiment consists of selecting on a screen the fractions whose value is less than 1/3.There are two levels of difficulty in the experiment.The results indicate that EDA and pupil dilation have the greatest statistical significance in terms of detecting task difficulty.
Using an elastic neural network, Hogervost et al. [19] find that the best performance is obtained when EEG is combined with pupil dilation (91% accuracy) and when EEG is combined with peripheral physiology (89 %); with EEG alone, they obtain 86 % accuracy.In addition, using only the measurement of the electrode located in the Pz position (central parietal area of the head), they obtain 88 % accuracy.

Assessment of Mental Workload in Web Environments
Although the study of users' cognitive responses during Web browsing is an intriguing area, it remains little explored.Indeed, one of the few studies on the topic is that by Albers [3], who examines how mental workload theory applies to the design of Web sites using the tapping test method, which measures mental workload by focusing on performance.As in all the examples using this approach, the tapping test adds an additional secondary task to the main one, measuring the performance of the participant to determine the level of mental workload induced.In this case, the main task is to browse two Web sites sequentially-with implicit mental workload controlled by design-and answer questions aloud in relation to the Web pages, while the secondary task is to rhythmically keep tapping per second.As mental workload increases, tapping begins to fall slowly and lose the rhythm, even losing it completely when there is cognitive overload.However, implementing a secondary task as required by this method prevents from generating a realistic scenario for the user and does not allow real-time measurement.
The most recent research regarding the observation of Web users' experience involves the measurement of their behavior as a reaction to different stimuli, such as notifications, and allows us to predict the user's response according to Navalpakkam & Churchill [74].By comparing mouse pointer movement to eye tracking, they are able to determine a more user-friendly layout for a Web site, which improves the effectiveness of the notification.Finally, they conclude that gaze and mouse movement patterns contain important information in terms of assessing the user's status, determining if they are distracted from the assigned task or striving to fulfill it.The correlation between eye movements and mouse pointer movement predicts a Web user's different psycho-emotional states.They also conclude that the user is more likely to pay attention to notifications when they vary in position on the Web site rather than when they are fixed.
As summarized in Table 1, the measurement of mental workload using psychophysiological signals has been tested for a varied set of tasks.In addition, studies have investigated how mental workload is related to the design of a Web page.However, the abovementioned research provides no evidence regarding assessment of mental workload while browsing a Web site using multiple psychophysiological measures.There is also no reference to time overhead to determine how feasible it is to implement real-time measurement.Partial.Time Windows average length for classification of 23.7 s.

Participants
The initial experimental group includes 61 participants (19 women and 42 men), aged between

Psychophysiological Sensors
Psychophysiological sensors have the advantage that measurements do not depend on the user's perception and are not under the control of the user.Fulfills.Eye tracker, EEG, ECG, heat flux and HR.
[14] Partial.Sliding time windows of sizes from 5 seconds to 60 seconds, sliding 5 seconds between intervals.
Fulfills Fails.Comprehension tasks of computer code.
Fulfills.Eye tracker, EEG [34], [37] Partial.Three silent reading tasks were performed.Each task consisted of four text slides and each slide was presented for 30 seconds.
Fulfills Fails.Arithmetic tasks Partial.EDA and blink.
Fulfills Fails.Select the fraction whose value is less than 1/3.
[18] Fulfills.The average overall duration of the limits was 550 ms.
Fulfills Partial.Measures the cost of interruptions in tasks such as: choosing a route, correcting spelling and classifying emails.
Partial.Only the pupillary dilation.
[3] Not applicable.Fails Fulfills.Measure the cognitive load on a website.
Fails.Measurement of mental workload by performance: tapping test.
[36] Fails.The participants interacted with each page for about 100-120 seconds.
Fulfills Fulfills.Study the design of websites in a way that improves the effect of a notification.
Partial.Compare the tracking of the mouse with eye tracking.
To measure the EDA and HR signals, the Shimmer GSR+ unit sensor was used with a sampling frequency of 120 Hz.The position of the electrodes for measuring the EDA was the palm area of the proximal phalanx of the index and ring fingers of the left hand [79].The optical sensor that functions as a photoplethysmograph (PPG) was attached to the lobe of the right ear [80].The Shimmer Bridge Amplifier + unit sensor with a sampling frequency of 50 Hz was used to measure body temperature.
The sensor was applied under the right armpit.This sensor was synchronized with the EDA and pulse sensors using a base provided by Shimmer together with Consensys software.
The BITalino BioMedical Development All-in-One Board with a sampling frequency of 1000 Hz was used to measure the ECG.The configuration of the three electrodes followed the lead II standard [81,82].Before applying the electrodes, the skin was prepared by wiping it with alcohol to remove grease and impurities to reduce noise.In addition, an ECG gel was used.OpenSignals evolution software provided by the manufacturer was used [83].
To measure the EEG, the Emotiv EPOC EEG sensor with a sampling frequency of 128 Hz was used.The sensor was attached to the head, positioning the reference sensors first.To improve the conduction of the electrical signals of the brain, each electrode was previously hydrated.To capture the data and verify that the sensor was properly applied, the Emotiv Xavier Testbench software provided by the manufacturer was used.
The Tobii T120 Eye Tracker with a sampling frequency of 120 Hz was used to measure pupil dilation and for eye tracking.Tobii Studio software was used for calibration and to perform data collection [84]-

Experimental Procedure
A fictitious Web site was created whose basic configuration is shown in Figure 2.This layout of the Web elements was maintained through all the experiment.The elements within the Web site were seven news headings with their respective representative image, four rectangular advertisements, a typical navigation bar with a menu, the logo of the page in the upper left corner, and a bar at the bottom of the page.Each participant was tested individually at the laboratory.A physically isolated experimental room was used to maintain the experimental configuration and the environment constant for all participants.In addition, the room did not receive any sunlight, to avoid the effects of infrared light on measurements and to maintain constant illumination conditions that do not affect pupil diameter measurements [85].
As soon as the participant arrived in the experimental room, the experiment was explained to her, and she was asked to read and sign the informed consent, as well as a questionnaire to get her basic anonymous information.The participant seated in front of the screen, and the sensors were connected in the following order: ECG, axillary temperature, EEG, EDA, and PPG; then the eye tracker was calibrated with the help of the participant (Figure 1).
Prior to the tests, each user underwent a relaxation period consisting of the visualization of three four-minute videos of landscapes with background instrumental music.Then, the participant was asked to take deep breaths for one minute with eyes closed and with soft background instrumental music.This procedure aimed to eliminate the Hawthorne effect -modification in the behavior of the subjects due to their awareness of being studied -and physiological effects similar to the "white coat" effect in measured signals [86].Next, the participant was asked to maintain a fixed posture, sitting in front of the computer, without moving the head or the left hand, where the sensors were connected.The instructions were that the user could freely browse the Web site for as long as they wanted and indicate when they wanted to finish.Finally, all sensors were removed from the participant, while she was asked do not tell others the experimental procedure.

Time Window Definition
Bailey (2008) [23] shows that mental workload decreases during transitions between subtasks.
For this paper, the analysis of each Web element is considered a specific subtask and the passage between them as the transition period between subtasks.Thus, in this study, mental workload is assessed during two time windows: • Active window: Time during which the user fixes her gaze on a specific area of interest (AoI), which may correspond to a news headline, an advertisement, or the menu bar of the Web site.
• Transition window: Time that elapses while the user is not fixing her gaze on any of the areas of interest.It can be a transition between two elements or towards the same element.
As illustrated in Figure 3, the red rectangles represent the studied AoIs; the blue circles represent fixations, which size varies in accordance with the fixation time and the blue lines represent the saccades.Thus, the time a fixation is into an AoI pertains to an active window.The time between two fixations, such as fixation one and fixation two, pertains to a transition window.
Note that the transition window between fixation two and four add the fixation three, which does not fall into any AoI.
To discriminate between types of windows, the data file exported from the Tobii Studio program generates a column showing the AoI that the participant is inspecting for each sample.It discriminates between 3 values: when the user is not looking at the screen -inactive -, when the user is looking at a certain AoI -active window -, and when the user's gaze is directed outside the AoIwhich is considered a transition window.
A long minimum time of 500 milliseconds is set to define a valid time window.This is based on the research of Loyola et al. [87], who assesses the identification of key Web elements in a Web site using eye tracking.This time span is selected to avoid possible contamination of the pupil signal by the analysis of a previous object.Time windows below the threshold are not considered for analysis and are therefore deleted.When the same Web element is analyzed before and after a deleted window, the two segments are joined, generating a window of greater length.

Data Preprocessing
The data exported from Tobii Studio contains the diameter of the left pupil, the diameter of the right pupil (both in millimeters), and the validation of the reliability of the capture of each pupil between 0 -high reliability -and four -the eye was not detected.On average for all participants and considering only valid windows, the reliability of the capture of the left pupil is 0.2469, and that of the right pupil is 0.22036; these are reliable values to validate the capture of pupil diameter data.As these values are an average for all the participants, the pupil data with the highest level of reliability are selected for each sample [16].
Next, signal distortion artifacts, such as saccades and blinks, are eliminated.A column in the extracted data shows if the sample is a fixation or a saccade, and this information is used to filter saccades.Furthermore, a linear interpolation between the values of the blinks detected is used.In addition, a Blackman window with a cut-off frequency of 2 Hz is applied as a low-pass filter.
EDA The raw data yield body temperature values in degrees Celsius.The processing of this signal consists of using a low-pass filter with a cut-off frequency of 1 Hz, as concluded based on the data collection in Haapalainen et al. [17].
The EEG signal is subject to a wide variety of artifacts and noise [89,90].Among the elements that cause artifacts are blinking, oculomotor activity, head movements, facial expressions that add noise due to the muscle electrical signal, and movement of the electrodes, among others.To eliminate the effect of head swinging, a high-pass filter with a cut-off frequency of 0.5 Hz is used.In addition, a low-pass filter with a cut-off frequency of 40 Hz is used to eliminate noise from the electrical grid (50 -60 Hz).To eliminate outliers and decrease the effect of the blinking artifact a Hampel filter is used [91].

Feature Extraction
Feature extraction is performed based on time windows.Since signals have different scales, to be comparable objects, it is necessary to standardize them before extracting characteristics from them, as proposed by Guyon et al. [92].To perform standardization, the classical − ⁄ form is used, where is the vector corresponding to the signal, and and are the mean and the standard deviation of the signal, respectively.2 shows a summary of the characteristics, following which the obtained characteristics are presented in more detail.
Because it has been proven that pupillary response is an important indicator of the mental effort required to solve a task, it is selected as the gold standard by which to cluster windows and generate labels for cognitive levels.There are clustering cases in the literature regarding the development of Web tasks such as the study of Loyola et al. [87].The selected characteristics are the mean and variance of the pupil diameter of the eye that displays greater reliability in its measurement.
Based on the findings of Nourbakhsh [72] and Shi et al. [71], the following characteristics are extracted from the processed EDA signal: accumulated normalized data, mean as a function of normalized time, and spectral power without normalized continuous component.Equation (2) shows the calculation of the normalized EDA signal.Each point in time is added, where corresponds to the participant, and is the total number of tasks; = 3 in this case: Therefore, the data for each participant are normalized by dividing the task signal by the mean value of all the tasks for the subject.Then, the accumulated EDA characteristics are calculated as shown in Equation ( 3) and mean EDA is calculated according to Equation (4), where is the total time for all the tasks: , The following characteristics are extracted from the phasic component obtained: number of peaks, maximum modulus, and average of the phasic component of the window [16].
Based on the proposal by Haapalainen et al. [17], the following characteristics are selected for the ECG signal: median, mean, and variance of the ECG median absolute deviation (ECG_MAD), calculated using Equation ( 5): . The characteristics of the heart rate obtained from the PPG signal are selected based on the time domain characteristics used in Betella [93].These are the mean, standard deviation, and root mean square of HR.Based on the proposal by Haapalainen et al. [17], the median and mean of the temperature are selected.
For the EEG signal, there are two main approaches: event-related potential (ERP) analysis and time-frequency signal analysis.The latter is selected because it is more closely related to the psychophysiological and structural processes of the brain [89].It is used to study emotional-cognitive states in particular and is more advisable when studying a limited period or a relatively low amount of data, as is the case of the time-window study of this paper [94].Among the different ways of analyzing the EEG signal in time-frequency are frequency bands with Fourier transform, Morlet wavelets, and Hilbert transform.All three show similar results according to Cohen [95].Thus, the option of the Hilbert Transform ( ) is selected, which has the advantage of greater control over frequency filtering.The Equation (6) shows this transform: where ℎ = 1 ⁄ , is the EEG signal and is the resulting analytic signal.Before applying this transform, a bandpass filter between 2 and 15 Hz is used to center the study in the theta (4 -8 Hz) and alpha (8 -12 Hz) frequency bands.These are related to states of mental activity and relaxation, respectively, where theta increases and alpha is suppressed when there is mental workload [94].A complex signal called the "analytical signal" is then obtained, from which two characteristics are extracted.This is performed for each of the 14 channels of the EEG signal.

Clustering
Clustering is performed per participant to determine how many levels of mental workload the user presents based on the measurement of pupil diameter in order to label the database after ascertaining these levels.In Loyola et al. [87], the k-means method is used.Because an overestimation or underestimation of the number K of clusters affects the quality of the cluster, the optimal value of clusters is sought.The K value is tested from two onwards to obtain two curves.
The index of Calinski & Harabasz (CH) and the internal measure of cohesion of the sum of the squares within the group (WSS) are selected to this end [96][97][98].The stop rule is the value closest to the area where the curves interact.Figure 4 shows an example of this methodology for participant 59, where the intersection is generated at = 3. Visually, the grouping can be validated considering Figure 5.
The Jaccard coefficient obtained using the bootstrap method is used as an external criterion for validating clusters, which assesses how stable the cluster is [96,97] .Values between 0.6 and 0.75 indicate that the group is measuring a pattern in the data, but there is no certainty as to which points should be grouped.Groups with stability values above approximately 0.85 can be considered highly stable (real clusters).There are participants who present well-defined clusters with Jaccard coefficients very close to one, and others with values far from acceptable.
On average, the coefficients are over 0.75, so clustering is accepted as valid.For example, for the clusters in Figure 5, the Jaccard indices are 0.6288 (cluster 1), 0.9024 (cluster 2), and 0.8517 (cluster 3).
Considering all the valid participants, the number of clusters varies between three and six levels of mental workload, and on average, there are approximately four levels of mental workload validated with acceptable cohesion indices (RQ1).elements, sometimes just one or two, time windows that would not make sense to classify because they would only be in the training set and not in the test set, are deleted.
Two classification models are applied: artificial neural networks and recursive feature elimination (NN and RFE respectively) and two-layer multi-layer perceptron (MLP).Each classification result is obtained from the average resulting from executing each algorithm 100 times.
Next, the implementation of each model is described.
An artificial neural network with a hidden layer is implemented, with all the artificial neurons completely interconnected and trained with the algorithm backpropagation.To calculate the number of neurons in the hidden layer (h), the heuristic method of the geometric pyramid rule is used with the expression ℎ = √ ⋅ , which consists in calculating the square root of the product between the number of inputs (m) -number of characteristics -and the number of outputs (n)number of classes.Therefore, the number of neurons in the hidden layer changes per participant.
This classifier is combined with the RFE method for feature selection.
There is evidence in the literature regarding the use of the random forest and recursive feature elimination (RF-RFE) method for the selection of characteristics with good results when applied to the classification of mental fatigue with EEG signals [99].This combines recursive elimination with random forest, that is, a set of decision trees that assesses features and generates a ranking following a score criterion.This method of feature selection is implemented with the Caret and Random Forest packages in R. The algorithm is executed using Matlab's Neural Network Toolbox with the toolbox's Neural Net Pattern Recognition nprtool.It is executed once the characteristics have been obtainedper participant -with the RFE method.Table 3 shows the characteristics selected for six participants as an example.
To test another way of improving the artificial neural network without using feature selection, a different neural network configuration is tested: MLP.For the implementation of the MLP neural network, the H2O package in R is used [100].The programmed neural network has two hidden layers with 100 neurons each, with a rectified linear activation function, as used by Hinton [101].The key, according to Hinton, to reducing overfitting is to include a 50 % dropout for each layer, which prevents artificial neurons from co-adapting to training data.Thus, each neuron in the hidden layers is omitted at random from the network with a probability of 0.5.In addition, another method added to avoid model overfitting is the and regularization method as a linear combination, as shown in Equation ( 7)Error!Reference source not found..For this, the objective function for the artificial neural network is defined as , | , where represents the weight matrix and the column of bias vectors for each training example.

Statistical Analysis
Based on Bailey (2008) [23], who showed a decrease in mental workload between subtask boundaries, the hypothesis that there is a decrease in mental workload in the transition time windows between the analysis time windows of one Web element and another is proposed.To verify the hypothesis, the mean pupil diameter within each window is selected as our gold standard.
The objective is to determine if mean pupil diameter varies depending on whether it is in an active window or in a transition object.An analysis of variance with repeated measures (ANOVA-RM) is performed since the factors to be studied are within-subjects.For the analysis, the complete universe of windows of all the participants is considered.
As a result, a − = 0.00184 is obtained with a 95 % confidence interval, so the null hypothesis is rejected.In addition, as shown in Table 4, mean pupil diameter in the transition windows is smaller than in the active windows.Therefore, it is concluded for the data as a whole that the difference between mean pupil diameter in the active windows and the transition windows is statistically significant and that the diameter is smaller in the transition windows (H1).

Evaluating Psychophysiological Sensors
To assess the performance of each sensor, the MLP neural network that obtains the best results with all the sensors is selected as a supervised learning model.Then, combinations of the three sensors with higher performance are tested: EEG, EDA, and PPG (HR).As shown in Table 9, the combinations with EEG provide the best results.The combination with the highest performance is EDA, PPG (HR), and EEG, with 95.73 %.
An important difference between the EEG sensor and the others is that it allows the extraction of a greater number of characteristics because the 14 electrodes contribute two characteristics each, for a total of 28 characteristics.This factor may explain the superior performance of this sensor compared to the rest.Therefore, it is concluded that it is possible to obtain good classification results for this experimental design with less than 5 sensors, even only with the EEG.The temperature sensor and the ECG can thus be discarded.

Evaluating Time Overhead
For mental workload classification in Web site browsing to lead to a real application, processing must be sufficiently fast, given that the time windows considered have a minimum length of 500 milliseconds.Table 10 shows the classification time for the algorithms used that yielded the best results.
The results show that, for the artificial neural networks and RFE model, the time window is very small at 0.0083 seconds on average, which ensures that this model can be implemented in real time with an acceptable classification mean of 68.94 % (see Table 5).On the other hand, for the model based on MLP, the effect of its parameters given by the 100 neurons in the two hidden layers increases the processing time to an average of 0,1 seconds.So, this model can be implemented in real time as well with a classification average of 99.1 % accuracy for all sensors and a reasonable 63.36% for more portable sensors (PPG and EDA, see Table 9).

Discussion
The results of the statistical analysis determine that pupil diameter in the transition time windows is statistically and significantly lower than in the active windows of Web element analysis.
Given the proven correlation between pupil dilation and mental workload, it is determined that there is a decrease in mental workload in the time windows between the analysis of one Web element and another (H1).
A possible application of the demonstration of this hypothesis (H1) is the generation of recommendation systems that support the user in during Web browsing according to her interest, that is, when she is not cognitively overloaded with new content.This is applicable, for example, to retail applications or advertisement.
Regarding the assessment of the psychophysiological sensors to estimate mental workload, with the exception of the EEG, the signals of the sensors used do not provide an appropriate level of classification by themselves for this type of task, although the combinations of signals with EEG stand out, obtaining very good results.One of the reasons may be the time constant of each signal; that is, signals such as skin temperature or conductivity take longer to react compared to electrical signals from the brain.
However, despite the fact that the combination of EDA and PPG (HR) does not provide better results than EEG alone, a reasonable level of accuracy (63.36 %) is obtained for its use in practice, even before portable EEG technology is available.The advantage of the EDA and PPG sensors is that they are non-invasive, portable, cheaper, and easily integrated into a board that transmits via Bluetooth or other wireless means to a gateway, such as a smartphone, from a wearable such as a smartwatch, a wristband, or a textile [88].In addition, considering that the time overhead of the classification in each time window is very small (on average 0.1056 s, with a standard deviation of 0.0091 s), EDA and PPG are considered feasible alternatives for the first practical applications of the real-time assessment of mental workload in users browsing Web pages.

Conclusion
The study of human behavior and physiology when performing human-computer interactions activities is complex due to the multiple factors that affect each person in their performance and behavior with regard to this class of tasks.This research assesses the behavior of a user in the simple task of browsing freely through a fictitious Web page created specifically for this study, using psychophysiological sensors.
It is shown that for the complete data set, that is, considering the complete universe of windows of all the participants, pupil diameter -as a measure of mental workload -is significantly lower in the transition windows than in the active windows, with a significance of − = 0.00184 for a 95 % confidence interval.Therefore, patterns of low mental workload states are identified, and the hypothesis (H1) that it is indeed possible to measure mental workload in Web browsing activities and, moreover, that the mental workload of the user decreases in the transition from the analysis of one Web element to another while browsing freely is verified.
The unsupervised model of k-means analysis as a data mining technique is applied to the mean and variance of pupil dilation, based on which the Web browsing task involves on average four levels of mental workload.Thus, it is concluded that there are several mental workload states that can be determined (RQ1).
To classify levels of mental workload, the MLP neural network is used, which obtains a result of 88.46 % accuracy on average (RQ2).In addition, the electroencephalogram is the sensor that obtains the best results, classifying with 88.78 % accuracy.If the EEG is combined with the PPG and EDA, the accuracy of the classification rises to 95.73 %.In terms of future lines of research, it is proposed to use the data to study Web users' mood behavior together with their cognitive behavior.In addition, it is proposed to focus the research on the EEG sensor, which showed superior performance, using other analytical approaches, such as wavelets and/or ERP, to determine the most relevant involved brain areas.

19 and 35
years (mean age = 23.8years, SD = 3.2 years), all engineering students at the University of Chile, recruited through the institutional news Web application.None of them suffered from cardiovascular diseases or was taking medications that could have affected their normal behavior.All of them were familiar with browsing tasks.Each session had a duration of approximately 60 min.The final experimental group is composed of 53 people.Eight participants were discarded due to various problems during signal measurement and processing.This research has the approval of the Research Ethics Committee at the Faculty of Physical and Mathematical Sciences at the University of Chile.In addition, all of the participants read an informed consent and agreed on signing it.The consent contained information about the procedure, purpose of the experiments, voluntary participation, right to decline to participate at any moment, how to access the research results and researchers' information.

Figure 2 .
Figure 2. Example of a dummy Web page for the experiment.

Figure 3 .
Figure 3. Example of active window and transition window.
raw data provides the values of electric resistance of the skin in Kilohms [ Ω].To reduce noise and eliminate motion artifacts, two procedures are performed: first, a strict instruction is given to each participant not to move the hand or fingers where the electrodes are attached, and second, the signal is filtered with a low-pass cut-off frequency of 5 Hz.Furthermore, on the recommendation of the literature[88], capture resolution is reduced without risk of data loss.The EDA signal measured with a sampling frequency of 120 Hz is reduced to 10 samples per second.The phasic component is extracted by applying a median filter with a window width of -4/+4 and subtracting the average of the current sample[88].This component allows the detection of peaks of the EDA signal.With slow transitions, the phasic component does not show major variations.Regarding the electrocardiogram, the raw data yield values that must be transformed to millivolts[mV].The processing of this signal consists of using a low-pass filter with a cut-off frequency of 100 Hz and applying the fast Fourier transform to obtain the characteristic shape.The raw data of the PPG yield signal values in millivolts[mV].From this signal, it is possible to obtain the HR.Previously, the PPG signal is processed using a low-pass filter with a cut-off frequency of 16 Hz with a Blackman window, obtaining a cleaner signal.Then, HR is obtained via the following steps: first, the peaks must be found; second, the time between them is substracted (∆ in [miliseconds / pulse]); third, they are converted from hundredths to seconds and from [seconds / pulse] to [pulses / second], which is then multiplied by 60 to convert to [beats/minute].This is resume in the Equation (1):

Figure 4 .
Figure 4. Optimal number of clusters according to the intersection method of CH and WSS curves for participant 59.

Figure 5 .
Figure 5. Optimal grouping of time windows according to their level of cognitive load for the participant 59.
, where the values of and are parameters that weight the relative contribution of the penalty terms and (rule and , respectively) in relation to the objective function , | .The values of = 10 and = 10 are determined as recommended in the H20 manual [100].

Table 1 .
Related work analysis [16,50]tion, they are becoming less intrusive and allow tasks to be performed in various scenarios, giving greater ecological validity to the experiments.They also allow real-time data capture[16,50].
dilation, six from EDA, two from body temperature, three from ECG, three from PPG-HR, and two from each of the 14 EEG channels.Table

Table 3 .
Selected features with the RFE method for different participants.

Table 5
shows that the worst accuracy measure obtained with the NN-RFE is 45.24 % (participant 48) with five classes, and the best result is 95.24 % (participant 23) with two classes.The classification mean including all 53 participants yields the result of 68.94 % accuracy with a standard deviation of 11.54 %.The results of the classification according to the number of final classes are also analyzed.As shown in Table6, there is a tendency for the classification percentage to decrease as the number of classes increases.In particular, an accuracy of 90.61 % is obtained for the classification of two classes, 73.34 % for three classes and acceptable values are obtained for four and five classes.Regarding the MLP, in Table7, the worst measure of accuracy obtained is 72.16 % (participant 17) with four classes, and the best result is 99.9 % (participant 9) with three classes.The classification mean including all 53 participants is 88.46 % with a standard deviation of 7.94 % for the accuracy measure (RQ2).The classification analysis is performed according to the number of final classes.As shown in Table8, the trend observed for NN-RFE is maintained such that the higher the number of classes is, the lower the classification percentage, but with a break in the case of four and five classes.

Table 4 .
Standardized means of pupillary diameter for transition and active windows.

Table 5 .
Results of classification using NN-RFE.

Table 6 .
Average classification using NN-RFE by quantity of classes.

Table 7 .
Results of classification using MLP.

Table 8 .
Average classification using MLP by quantity of classes.

Table 9
shows the results of assessing the performance of each sensor separately.The sensor with the best performance is EEG, with 88.78 % accuracy in the classification, slightly superior to the classification using all the sensors.The other sensors separately have a very low level of classification accuracy.Preprints (www.

Table 9 .
Summary of sensor classification results for MLP with 100 neurons in each hidden layer and 400 epochs.

Table 10 .
Testing time for models.