Datasets for Cognitive Load Inference Using Wearable Sensors and Psychological Traits

This study introduces two datasets for multimodal research on cognitive load inference and personality traits. Different to other datasets in Affective Computing, which disregard participants’ personality traits or focus only on emotions, stress, or cognitive load from one specific task, the participants in our experiments performed seven different tasks in total. In the first dataset, 23 participants played a varying difficulty (easy, medium, and hard) game on a smartphone. In the second dataset, 23 participants performed six psychological tasks on a PC, again with varying difficulty. In both experiments, the participants filled personality trait questionnaires and marked their perceived cognitive load using NASA-TLX after each task. Additionally, the participants’ physiological response was recorded using a wrist device measuring heart rate, beat-to-beat intervals, galvanic skin response, skin temperature, and three-axis acceleration. The datasets allow multimodal study of physiological responses of individuals in relation to their personality and cognitive load. Various analyses of relationships between personality traits, subjective cognitive load (i.e., NASA-TLX), and objective cognitive load (i.e., task difficulty) are presented. Additionally, baseline machine learning models for recognizing task difficulty are presented, including a multitask learning (MTL) neural network that outperforms single-task neural network by simultaneously learning from the two datasets. The datasets are publicly available to advance the field of cognitive load inference using commercially available devices.


Introduction
Affective Computing is the study and development of systems that have the ability to recognize and process human affective states [1]. While sensor-based recognition of human physical activity has reached a certain level of maturity, e.g., most mobile devices are nowadays capable of counting steps based on acceleration sensors, the human mental state recognition, e.g., stress, mental health, and cognitive load, remains challenging. Yet, the demand for advancing Affective Computing research is rising, since through improved understanding of its human users, Affective Computing promises to push the frontiers of human-computer interaction (HCI) and to enable new much-needed services that are directly related to psychological states, e.g., mobile healthcare [2]. One of the main impediment

Related Work
Cognitive load represents an important aspect modulating human behavior, and a timely and reliable assessment of a person's cognitive load would enable a range of new and improved applications in areas spanning from game-based learning, over simulator-based driving training, to considerate pervasive human-computer interaction. Yet, the concept remains intangible and is, thus, difficult to grasp and measure. In this section, we provide an overview of theoretical postulates behind the concept of cognitive load and recent efforts in measuring cognitive load. In addition, having in mind the nature of this paper, we also provide a brief survey of the existing open datasets in the field of Affective Computing.

Cognitive Load: From Theory to Measurements
Paas and van Merrienboer define cognitive load as "a multidimensional construct representing the load that performing a particular task imposes on the learner's cognitive system" [23]. As such, cognitive load is dependent on the task, the participant, and the interaction between the two. For instance, tasks may be objectively more or less demanding, people can have different cognitive capacities, and certain tasks can be easier for those who are skilled in similar tasks. This multi-dimensionality of cognitive load makes its measurement a rather challenging feat.
Cognitive load measurement methods often rely on data about the subjective perception of the task difficulty, performance data using primary and secondary task techniques, and psycho-physiological data [24]. Measuring subjective data is performed using surveys (e.g., NASA-TLX [25]) solved by a user at the end of a task. However, subjective post hoc measurements are impractical in real-world applications, as they require explicit querying of users. Cognitive load measurement through a secondary task performance requires a user to attend to a simple secondary task (for instance, react to a slowly changing screen background color), while solving the primary task [26]. These techniques, too, are invasive and, in numerous situations such as while driving, not suitable for in situ cognitive load inference.
Instead, physiological techniques for cognitive load measurement rely on signals, stemming from heart beat activity [27], breathing [28], heat flux [29], brain activity [30], and eye movement [31,32]. Changes in these signals are a result of our autonomic nervous system's reaction to increased cognitive load. In [29], Haapalainen et al. used elementary cognitive tasks (ECTs), a well-established tool in educational psychology [33], to elicit different levels of cognitive engagement and monitor users' eye movement, heart and brain activity, and skin conductance while the users solved the ECTs. The authors demonstrated that two extreme levels of task difficulty ("easy" vs. "difficult") can be discriminated with 80% accuracy using heat flux, ECG features, and person-specific data, i.e., personalized models. The method, however, requires that the users are static and strapped with specialized sensors. In a study with developers engaged in real-world programming tasks, Zuger et al. used physiological sensors to infer human interruptibility. The study shows that EEG signals, eye blinks, skin conductance, heart rate, and inter-beat interval features correlate with interruptibility, which, in turn, negatively correlates with a user's mental load [34].
Recent advancements in sensing technology enable less intrusive forms of vital sign monitoring and get us closer towards unintrusive cognitive load inference [35]. Gjoreski et al. used commercially available Empatica wristbands and acquired signals related to heart rate variability, blood volume pulse, GSR, skin temperature, and acceleration, while exposing users to varying levels of stress [11]. The study demonstrates that off-the-shelf equipment can be used for reliable (up to 92% accuracy achieved in the study) stress detection. While a separate concept, stress may be related to cognitive load, and an earlier study by Setz et al. has already shown that the same GSR sensor can be used to discriminate between the two phenomena [36]. Researchers have also attempted to unobtrusively measure cognitive load in specific environments. For instance, Novak et al. used MS band to infer cognitive load in a simulated driving environment [37]. The authors argue that cheap wearables may provide enough information about physiological signals to enable binary ("engaged in a task" vs. "not engaged in a task") classification of the cognitive load, yet are unlikely suitable for inferring the actual level of cognitive load. Schaule et al. used the same wristbands and an N-back task to elicit different levels of cognitive load among office workers [38].
Nevertheless, all the above work treats users as equals, whereas their (physiological) reactions to mental burden might be highly individual. In this paper, we rely on the current tendencies of unobtrusive wearable-based cognitive load monitoring, yet for the first time we introduce personality traits, an important user-level factor impacting cognitive load expression.

Open Datasets for Affective Computing
Open datasets are a staple of reproducible and verifiable science and may often catalyze significant research activity. Table 1 presents an overview of publicly available datasets in Affective Computing. We particularly focus on datasets that encompass data originating from physiological sensors such as EEG, electrooculogram (EOG), electrocardiogram (ECG), electromyogram (EMG), blood volume pulse (BVP), electrodermal activity (EDA), respiration rate (RESP), eye tracker, magnetoencephalography (MEG) , skin temperature (TEMP), acceleration (ACC), beat-to-beat intervals (RR sensor), and pulse oximetry (SpO2) sensors.
Six datasets-Ascertain [39], Amigos [40], DEAP [41], Mahnob [42], Decaf-movies, and Decaf-music [43]-are emotion recognition datasets where the participants watched affective multimedia in short sessions, e.g., with a duration of 50 to 80 seconds, and rated their experience after each affective session using psychological questionnaires. In all the datasets, the affective multimedia are short movie or music video clips designed to induce certain affective states (e.g., fear, surprise, joy, etc.). While the participants were watching the affective multimeda, their physiological response was recorded using a variety of devices. The dataset Emotions differs from the previous datasets as it contains data from a single participant over three weeks, standing in contrast to the studies that examine many participants over a short recording interval. Laughter is another slightly different dataset, which aims at laughter recognition using non-invasive wearable devices. Three datasets-Driving-workload [44], Driving-stress [45], and Driving-distractions [46]-are collected in studies where the main task is driving. In Driving-workload, the participants drove a predefined route including different sections (e.g., crowded vs. free highway) and marked their mental workload afterwards by watching a video recording of the driving session. Similarly, in the Driving-stress dataset, the participants drove different sections and marked the perceived stress level.
In addition, this study introduced "a computed stress level", which was calculated based on the situation on the road (e.g., number of cars, pedestrians, and signs). The Driving-distractions dataset is a driving-simulator study that analyzes the behavior of the drivers under different types of stressors (physical, emotional, cognitive, and none), and it can be used for development of machine learning models for monitoring driving distractions [47].
Three datasets-Stress-math [11], WESAD [48], and Non-EEG [49]-are collected in studies focused on psychological stress. In the Stress-math dataset, the participants solved simple mathematical questions under time and evaluation pressure. The goal of this study was to induce and recognize psychological stress. In the WESAD dataset, the participants experienced both emotional and stress stimuli. More specifically, WESAD contained three sessions for each participant: a baseline session (neutral reading task), an amusement session (watching a set of funny video clips), and a stress session (being exposed to the Trier Social Stress Test [50]). Similarly, Non-EEG is a dataset recorded during three different stress conditions including physical, cognitive, and emotional stressors.
Different to the already available datasets in Affective Computing, this study introduces two new datasets that enable cognitive load monitoring with a wrist device in combination with personality traits. The Snake dataset is a labeled dataset of cognitive load measurements in which participants played a smartphone game. The CogLoad dataset is the first dataset that allows analysis of the cognitive load induced by six different tasks in relation to the physiological responses of individuals and their personality traits. To the best of our knowledge, the only other vaguely related dataset that includes personality traits is Amigos, which focuses on human emotions.

Personality Traits, Physiological Responses, and Wearables
Research on the relationship between personality traits and physiological responses is not new and has been done in multiple domains, commonly in research on stress [51,52], aversive stimuli [53,54], and medical issues [55,56]. Most research, however, has not been conducted in order to produce datasets ready for analysis, especially in machine learning. Furthermore, most research is conducted with immovable and expensive instruments for measuring physiological responses. Research with inexpensive wearables for sensing physiological responses that also includes personality assessment and analysis is rare. The likely reason is that the market for such wearables is still new, but also because of the unawareness of the potential of personality traits as input data for ML models. The limited research that includes wearables and personality traits so far has mostly focused on emotions [57,58] and stress [59]. We are not aware of research on cognitive load in a similar capacity to ours.

CogLoad
In the conducted experiments, the participants solved cognitive tasks of varying difficulty. The experiments were performed in a quiet, normal-temperature room with one participant at a time. At the beginning of each session, the participants were placed in a comfortable chair in front of a computer monitor and were presented with brief information regarding the experiments. Next, a wrist device (MS band) was put on their left wrist, and the rest of the experimental session was recorded in the same chair without any restrictions regarding the participants' hand gestures. Thus, the experimental setup simulated sedentary work on a computer in an office.
The experimental scenario consisted of Part 1 and Part 2. Part 1 was dedicated to assessing the participants' cognitive capacity and the personality type. For assessing the participants' cognitive capacity, the participants solved two N-back tasks [60], i.e., 2-back and 3-back tasks, with a three-minute rest after each of them. For assessing the personality type, the participants filled a Hexaco Personality questionnaire, which provided information about the participants' honesty-humility, emotionality, extraversion, agreeableness, conscientiousness, and openness to experience [61].
In Part 2, the participants were presented with six primary tasks. For each task, three variations of a randomly selected primary cognitive-load task were presented to the participant. The variations differed in difficulty (easy, medium, and difficult) and thus in the expected cognitive load. After each of the three variations, the participants filled the NASA-TLX questionnaire to assess subjective cognitive load posed by the tasks. This questionnaire, the most common means of measuring cognitive load, contains a set of questions that, if administered immediately after the task, allows post hoc analysis of the cognitive load [25]. The questions identified by the NASA-TLX questionnaire assess mental, physical and time effort, quality of performance, effort, and frustration.
Additionally, in parallel with the primary tasks, a secondary task was presented to fill in the participant's free cognitive resources. The secondary task contained a square starting as completely transparent in a random placement on the PC screen, and then increased in opacity. The participant's goal was to react, i.e., to click on the appearing square as soon as they noticed it. The opacity of the square when clicked was intended to be related to the participant's engagement in the primary task, since more engaged users were expected to notice the square later, when it is darker [26]. An assumption is that increased engagement corresponds to higher cognitive load put towards the primary task.
The software, developed by Haapalainen et al. [29] in their study on psycho-physiological measures for assessing cognitive load, was used to display the primary tasks. The software displays the following tasks: Gestalt Completion test-where the participant is asked to identify incomplete drawings; Hidden Pattern test-where the participant has to decide whether a model image is hidden in other comparison images; Finding A's test-where the participant has to find the letter 'a' in presented words; Number Comparison test-where the participant has to decide whether or not two displayed numbers are the same; Pursuit test-where the participant has to visually track irregularly curved overlapping lines from numbers on the left side of a rectangle to letters on the opposite side; and Scattered X's test-where the participant has to find the letter 'x' on screens containing random letters at random placements. More details about the technical implementation can be found in Novak's thesis [20], while we present the statistical properties of the dataset in Section 4.1.

Snake
A specific version of the game Snake (https://en.wikipedia.org/wiki/Snake_(video_game_ genre)) was implemented on Android smartphones. The implemented version allowed varying the difficulty by changing the speed of the game. Twenty-three participants played the game at three difficulty levels: easy, medium, and difficult. Each level lasted at least two minutes. Immediately after the completion of each difficulty level, the participants answered a questionnaire to determine perceived difficulty. Difficulty levels were followed with 50% probability in the order from easy, over medium, to difficult, or vice versa. The questionnaire included the NASA Task Load Index (NASA-TLX) questionnaire, plus two general questions about how challenging and fun the game seemed to the user. These questions were answered by the users on a 7-point Likert scales across six categories. For assessing the personality type, the participants filled a Hexaco Personality questionnaire.
To assess the participants' physiological response, the MS band wrist device was used. The data output included heart rate, RR intervals, GSR, TEMP, and ACC data. The HR, GSR, and TEMP were sampled at 1 Hz, the ACC was sampled at 8 Hz, and the RR intervals were recorded upon detection (e.g., for 60 beats per minute, the frequency would be 1 Hz). Additionally, the screen-tapping speed was recorded. The data were transmitted via Bluetooth from the wrist device to the smartphone and then to a server. More details about the technical implementation can be found in Knez's thesis [19], while statistical properties of the dataset are described in Section 4.1.

Psychological and Behavior Analysis
Multi-and interdisciplinary efforts in computer science towards combining heterogeneous data in understanding and predicting targets related to complex cognitive phenomena with the help of computational methods, especially machine learning, are bearing fruit in discovering that physiological and psychological data interact in beneficial ways. Performing descriptive and similar statistics on psychological data, as is the norm in psychological, behavioral and cognitive sciences, therefore has a place in primarily computer science fields as well.
This section presents various statistical analyses of demographic, psychological, and cognitive load data from the two datasets. It uses them to discuss the reasons for various correlations and other factors relating to performance and cognitive load results. This is mostly to create a baseline demonstration for how demographic and psychological data can be exploited. Section 5 discusses more advanced analyses and interpretations. A detailed interpretation of the presented statistics is provided in Section 6.1.

Personality-Descriptive Analysis of the Datasets
The CogLoad dataset includes 23 randomly selected participants, sampled in Slovenia. Participants' mean age was 29.51 (standard deviation being 10.10), and their highest attained education levels were as follows: a high school diploma in 7 cases (30.43%), a bachelor's degree in 6 cases (26.09%), a master's degree (26.09%) in 6 cases, and a doctoral degree in 4 cases (17.39%). Right was the dominant hand of 22 participants, while 1 participant was left-handed. All participants had the MS band device strapped to their left hand. The Snake dataset includes 23 (16 men and 7 women) randomly selected participants, sampled in Slovenia. Participants' mean age was 24.91 (standard deviation being 12.05). The Hexaco personality questionnaire was administered with each of the participants in both datasets.
The personality analysis (descriptive statistics and correlations) we present here comes from the Hexaco questionnaire, which is based on six factor-level (higher level) scales or dimensions, each separated into lower facet-level scales. The six factor-level scales with multiple facet-level scales include the following:
People that rank high on honesty-humility do not pursue personal gain to the others' detriment, they follow the rules, they do not seek large material wealth, and do not judge people by their social status. On the opposite side of the spectrum, people that rank low on honesty-humility are prone to manipulating people, breaking rules, seeking material wealth over other goals, and feeling more important than others.
People that rank high on emotionality are extremely fearful of physical dangers, they are very prone to feel anxious when under stress, they constantly seek external support, and are very empathetic. On the opposite side of the spectrum, people that rank low on emotionality easily overcome fear of physical dangers, they do not worry a lot even when under stressful duress, they quickly find internal support for their matters, and they detach from others emotionally.
People that rank high on extraversion have high self-esteem, they are confident, they are often leadership material, they feel comfortable at social events, and they are enthusiastic. On the opposite side of the spectrum, people that rank low on extraversion are self-conscious, they cannot manage being the center of attention, they do not enjoy social gatherings, and they are generally less optimistic.
People that rank high on agreeableness quickly forgive people, they do not judge people, they have no problems cooperating with other people, and they manage their anger well. On the opposite side of the spectrum, people that rank low on agreeableness often hold grudges towards others for long periods of time, they are fast to criticize, they are not easily convinced they are wrong, and they react with anger in many situations.
People that rank high on conscientiousness are great at organizing their time and space, they can plan well towards their short-, medium-, and long-term goals, they are precise and can be perfectionists, and they always take time to think on their courses of action. On the opposite side of the spectrum, people that rank low on conscientiousness do not bother with having or respecting schedules, they prefer leisure to challenge, they are quickly satisfied in whatever they do, and they act spontaneously and without thought.
People that rank high on openness to experience are fascinated by aesthetics, be it in art or nature, they are extremely eager to learn, they use imagination in every aspect of their lives, and they are attracted to that which is out of the norm. On the opposite side of the spectrum, people that rank low on openness to experience are not interested in aesthetics, they do not pursue knowledge, they lack creativity, and they are fine with conforming.
For the CogLoad dataset, factor-level and facet-level scales were calculated from the questionnaire answers. For the Snake dataset, only factor-level scales were calculated. Table 2 shows the mean (M) and the standard deviation (SD) of our sample from the CogLoad and Snake datasets. No division into further groups (sex, age, education, handedness) was performed due to the low N. The table also shows M and SD of 100,318 self-reports from [62] for comparison purposes ('L&A (2016)' label in the table).

Personality, TLX, and Objective Cognitive Load Analysis
The data on psychological traits, TLX scores, and objective cognitive load were used for this analysis. Due to the high variation in 95% confidence interval scores, all correlations are presented in the tables. We are aware that commonly, only correlations with a minimum inclusion threshold of 0.3 in absolute value are presented, as such a correlation denotes a medium or higher (strong) correlation strength [63], while below 0.3 correlation is considered as weak. Spearman correlation was used for the presented scores for higher robustness. Table 3 presents the correlations of medium and above strength between personality traits and selected dimensions of the TLX scores for the CogLoad dataset with 95% confidence interval in parentheses. The label 'TLX_physical_demand' represents a score on the questions "How much physical activity was required?" and "Was the task easy or demanding, slack or strenuous?". Emotionality is a factor-level trait, while dependence, fearfulness and anxiety are emotionality's facet-level traits.  Table 4 presents correlations between the TLX scores and objective cognitive load measures for the CogLoad dataset with 95% confidence interval in parentheses. The label 'time_on_task' represents the time a participant spent on a task; 'num_correct' represents the number of correct answers; 'level' represents the difficulty level of the task; 'TLX_mean' represents the average of all TLX scores; 'TLX_effort' represents a score on the question "How hard did you have to work (mentally and physically) to accomplish your level of performance?"; 'TLX_temporal_demand' on "How much time pressure did you feel due to the pace at which the tasks or task elements occurred?" and "Was the pace slow or rapid?"; 'TLX_mental_demand' on "How much mental and perceptual activity was required?" and "Was the task easy or demanding, simple or complex?"; 'TLX_frustration' on "How irritated, stressed, and annoyed versus content, relaxed, and complacent did you feel during the task?"; and 'TLX_performance' on "How successful were you in performing the task? How satisfied were you with your performance?".   Table 5 presents correlations between personality traits and the objective cognitive load measures for the Snake dataset with 95% confidence interval in parentheses. The label 'Points' represents the number of points the participant got while playing the snake game. Table 5. Correlations between personality traits and the objective cognitive load measure for the Snake dataset with 95% confidence interval in parentheses.  Table 6 presents correlations between the TLX scores and objective cognitive load measures for the Snake dataset with 95% confidence interval in parentheses. Label 'subjective diff' represents the subjective score of how difficult the game was, 'level' represents the game's difficulty level, 'click per second' represents the number of clicks the participant made during the measuring time, 'gsr' represents the galvanic skin response, 'hr' represents the heart rate, and 'TLX_effort' represents a score on the question "How hard did you have to work (mentally and physically) to accomplish your level of performance?".

Machine Learning Analysis
In this section, we present a suite of machine learning modeling approaches that connect the data sensed by the Microsoft Band wristband with the outcome, i.e., the experienced level of cognitive load. Having in mind the susceptibility of subjective metrics of cognitive load to interpretation (potentially modulated by a participant's personality), here we focus on the objective/designed difficulty of a task and binary easy/hard classification as explained in Section 5.3.

Preprocessing, Segmentation, and Feature Extraction
We initially re-sampled all the data to a sampling frequency of 1 Hz. Next, the last 30 s of each task was used to extract features. Thus, one segment represents one task. For each segment, statistical features were extracted from each input signal, i.e., Heart rate, RR intervals, GSR, and TEMP, and their first differentials. The statistical features included mean, standard deviation, skewness, kurtosis, mean of the first derivative, mean of the second derivative, 25th and 75th percentiles, inter-quartile range, difference between the minimum and the maximum values, and coefficient of variation.
Additional features were extracted from the GSR signal using Skin Conductance Response (SCR) analysis. This type of feature/analysis is proven to be useful for detecting stressful conditions in driving scenarios [45] and in real-life situations [11]. The GSR signal is first preprocessed using a sliding mean filter, and then fast-acting (GSR responses) and slow-acting (tonic) components were extracted. The fast-acting component was used to calculate the number of responses in the signal, the responses per minute in the signal, and the sum of the responses. The slow-acting component was used to calculate the mean value of the first differentials of the tonic component, and the difference between the tonic component and the overall signal.
Activation of the sympathetic nervous system triggered by cognitive load leads to more equidistant heart beats. On the other hand, the rest periods between the tasks reverse this process, and the heart beats become more irregular, as "A healthy heart is not a metronome" [64]. Heart Rate Variability Analysis (HRV) is commonly used to quantify the dynamics of the RR intervals. The RR signal was filtered by removing the outliers, i.e., the RR intervals that are outside of the interval [0.7*median, 1.3*median], where the median is segment-specific. Next, the following HRV features were calculated: the mean heart rate, the standard deviation of the RR intervals, the standard deviation of the differences between adjacent RR intervals, the square root of the mean of the squares of the successive differences between adjacent RR intervals, the percentage of the differences between adjacent RR intervals that are greater than 20 ms, the percentage of the differences between adjacent RR intervals that are greater than 50 ms, and Poincare plot indices (SD1 and SD2) [65].

Normalization, Feature Selection, and Model Learning
To analyze the inter-participant and inter-session influence, experiments were performed without normalization, with session-specific min-max normalization, and with session-specific standardization. When min-max normalization is used, each feature is scaled between 0 and 1 by subtracting the minimal value and then by dividing this difference with the difference between the minimal and the maximal values. When standardization is used, each feature is mean centered by subtracting the mean value and then dividing with the standard deviation.
Additionally, experiments were performed with and without feature selection. In general, all feature selection methods can be divided into wrapper methods, ranking methods (also known as filter methods), and a combination of the two. The wrapper methods (e.g., based on ROC metrics [66]) produce better results compared to the ranking methods (e.g., information entropy [67]), but they induce a heavy computation burden. In this study, a ranking method based on mutual information [68] was used because it is very efficient to compute. Mutual information is a measure that estimates the dependency between two random variables. The features were ranked using mutual information values between the features and the class values estimated on the training data, and only the top-ranked 50 features were used to build models.
These ML algorithms learn one model for each training dataset. The ML approach capable of learning models for several ML datasets (ML tasks) in parallel while using a shared representation is Multi-task learning (MTL) [76]. The idea is to use what was learned from one dataset to help learn other tasks better. More specifically, in single-task neural networks, backpropagation algorithm is used to minimize a single loss function, and single neuron provides the final output. MTL, on the other hand, involves the minimization of a joint loss function (e.g., weighted sum of the binary cross-entropies of all tasks) and learning shared representations over all tasks (see Figure 1). The specific MTL architecture was similar to the MLP architecture. It contains two shared-hidden layers of size 512 units, one task-specific layer of size 32 units, and two task-specific sigmoid units that output the final predictions.
Both for the MTL and MLP architectures, ReLU activation units [77] were used in the hidden layers, which speeds up the training process compared to other activation layers (e.g., tanh). To avoid overfitting, L2 regularization and dropout were used. The training of the networks was fully supervised, by back propagating the gradients through all the layers. The parameters were optimized by minimizing the binary cross-entropy loss function using the Adam optimizer. The models were trained with a learning rate of 10 −4 and a decay of 10 −4 . The batch size was set to 32, and the number of training epochs was set to 50.

Experimental Setup
Leave-one-session-out evaluation techniques were used in all ML experiments. This means that the data of one session were used as a test data, and the rest of the data were used for training and tuning the ML models. In the CogLoad dataset, there is only one session per participant, thus the models are participant-independent. In the Snake dataset, there is more than one session for some participants, thus the models are participant-dependent.
For each ML algorithm, parameter tuning was performed using the following procedure: parameter settings were randomly sampled from distributions predefined by an expert. Next, models were trained with the specific parameters and then evaluated using internal k-fold cross-validation on the training data. The best performing model from the internal k-fold cross-validation was used to classify the test data. This tuning procedure was performed as many times as there were sessions in the specific experimental dataset. Additionally, the evaluation was repeated five times to account for the randomness present in the learning (e.g., Random Forest) and the tuning (e.g., the random parameter sampling) of the ML models.
For the CogLoad dataset, the ML task was the classification of rest vs. task segments. For the Snake dataset, the ML task was the classification of easy vs. hard segments. The rest periods were not recorded in the Snake dataset, thus rest vs. task classification is not possible. Additionally, for the Snake dataset, the segments with medium difficulty were not used in the ML analysis following the studies by Rissler et al. [78] and Maier et al. [79], in which only the top 20% and the lowest 20% of the data points were considered for the classification task. The data points that fall in between were discarded. Table 7 presents the size of the experimental datasets after the labeling. Each instance represents a 30-second segment labeled with a "High" or "Low" difficulty.
The averaged results for a binary classification problem are presented in Table 8. All models were dataset-specific, except for the MTL model, which is a joint model for the two datasets. The last three columns present the accuracy of the ML models built using selected features in combination with raw features (without any normalization), normalized features (min-max normalization), and standardized features. The three columns before that present the accuracy of each ML model built using all features in combination with raw features, normalized features, or standardized features. 6. Discussion

Results Discussion
The discussion examines the relationships between personality, cognitive load measures, and physiological data. To the best of our knowledge, this is the first research that tries to examine such results and interpret them. The examination focuses on correlations with at least medium correlation strength (±0.3 as the threshold). Table 3 shows significant correlations between personality traits and physical demand as measured by TLX for the CogLoad dataset. Emotionality and its three facet-level traits-dependence, fearfulness, and anxiety-all significantly positively correlate with subjective physical demand. Since emotionality describes a response to stressful and demanding situations as well as physical danger, the positive correlation is sensible, meaning people that rank high in emotionality also find tasks physically more demanding, and vice versa. Table 4 shows significant correlations between the TLX scores and objective cognitive load measures for the CogLoad dataset. The correlations show people that spend more time on tasks find them more demanding, put in more effort, get more frustrated, but also feel they performed better the more time they spent on them. The negative correlation between the correct answers and perceived performance, however, is unusual. It reports the more people felt they did well, the worse they actually performed. Whether this is due to chance or a measurement problem is unclear, but should be noted as something to be aware of. Deeper psychological profile construction and comparison could yield possible answers for this correlation. It could be people that score low on humility overwhelmingly report higher performance scores, but are very susceptible to cognitive load. That many TLX scores significantly correlate with the task difficulty level is also expected-this mostly confirms the difficulty levels are reasonably set as they are. Table 5 shows significant correlations between personality traits and objective cognitive load measures for the Snake dataset. People higher in emotionality have higher heart rates during solving tasks (and vice versa); people higher in agreeableness have higher temperature during solving tasks, which is the opposite of expected. Agreeable people have an ability to control temper, and as our temperature rises if we cannot control temper (which can be a response to stress), the correlation should be negative, not positive. This is another result that should be noted for future investigation.
Otherwise, more open people and more agreeable people score better. More open people are more skilled in solving complex tasks, which makes this results sensible. More agreeable people are more in control of their frustration, which could result in more points as they have an easier time staying focused on the task. Table 6 shows significant correlations between the TLX scores and other cognitive load measures for the Snake dataset. The results are mostly sensible: subjective difficulty correlates positively with TLX scores, except perceived performance, which is again sensible, as higher difficulty means worse perceived performance. A similar interpretation can be made for the objective level of difficulty as well as clicks per second (as more clicks are usually needed in tasks that have higher difficulty) and their significant correlations. More puzzling are the remaining correlations with temperature, galvanic skin response, and heart rate. The more demanding people perceive tasks to be and the more they get frustrated, the lower temperature they have. As discussed before, both should be positively correlated with TLX scores. Same goes for heart rate. The only explanation, if the correlations are causations, can be found in more thorough psychological profiles. It may be that our participants' profiles are such that demanding situations make them focus, thus lowering heart rate, galvanic skin response, and temperature. Interpreting correlations is always a difficult, sometimes questionable practice. Here, another presupposition is made before interpretation-that psychological and cognitive data are grounded in more physiological and neural phenomena. Regardless, the discussion shows that there are relationships between such heterogeneous data. Table 8 shows the ML results. It can be seen that, in general, the models performed better on the Snake dataset compared to the CogLoad datases. This is because the CogLoad models are person independent and the Snake models are person dependent. The highest accuracy of 82.3% on the Snake dataset was achieved by the XGB algorithm in combination with feature selection and feature standardization. The highest accuracy of 68.2% on the CogLoad dataset was achieved by the Bagging algorithm in combination with feature selection and feature standardization. Another observation is that the ensemble models (e.g., RF, Bagging, and XGB) performed better compared to the single-model algorithms. This is because the ensemble models are more robust to noise. Finally, it should be noted that our ML modeling was successful only with the two-class version of the cognitive load inference problem (e.g., discerning between low and high load). A more fine-grained low/medium/high load inference proved to be prohibitively difficult for our algorithms, thus was not discussed in this paper.
Regarding the proposed MTL approach, it is interesting to note for the dataset that contains more instances (the CogLoad dataset), both the MTL and the MLP performed similarly. However, for the smaller dataset, the MTL approach consistently outperformed the MLP approach. This may indicate that combining similar datasets using MTL is useful when the target dataset is small.

Related-Work Discussion
A direct comparison with results from the related work is not possible because of the many differences in the experimental setup. The differences include the following: different datasets, different sensors, different preproceessing steps, different ML methods, different classification tasks, different evaluation procedure, etc. To provide some insight, Table 9 presents the F1-scores achieved in the studies on emotion recognition. These studies analyze participants' physiological changes induced by a subtle stimuli (e.g., a video), which is similar to our study. All datasets are balanced, i.e., the majority class is close to 50% and all studies perform binary classification tasks (e.g., low vs. high arousal), which means that F1-scores and accuracy measures provide similar numbers. It can be seen that our results are comparable to the related work. Moreover, it can be seen that building accurate ML models to recognize changes induced by subtle stimuli is challenging. The challenge is even bigger when only a single wrist device is used. This was also confirmed by Maier et al. [79] in their study for detecting optimal user experience using a wrist device in participants that played the game Tetris. Their state-of-the-art deep neural network achieved an accuracy of 67.5% in a binary classification problem (high vs. low flow). Haapalainen et al. [29] achieved an average accuracy of 80% for binary classification problem ("easy" vs. "difficult" tasks) using personalized ML models and a combination of heat flux and ECG features, derived from specialized sensor equipment. The person-dependent models in this study achieved similar results using only a wrist device. The study revealed the task type and the chosen cognitive load metric on the models' accuracy. However, classifying task difficulty with an accuracy over 80%, on an ML task where the majority class is close to 50%, using person-independent models and unobtrusive sensors is still an open research question. This was also confirmed in our previous study related to the CogLoad dataset, where both task difficulty and TLX scores were used as ground-truth for ML models [80].

Real-Life Applications and Limitations
There are many use-cases for the presented datasets and models to enable improvement of meaningful life outcomes. Lohani et al. [81] presented an overview of the psychophysiological measures that can be utilized to assess cognitive states while driving. The psychophysiological measures included EEG, optical imaging, heart rate and HRV, blood pressure, GSR, ECG, thermal imaging, and pupillometry. Another use-case includes measuring workload of pilots. For example, Mohanavelu et al. [82] analyzed HRV features for measuring the cognitive workload of 20 fighter aircraft pilots in a flight simulator environment. The statistical analysis in their study revealed a strong significant difference between workload with respect to HRV parameters. Johannessen et al. [83] analyzed cognitive load in five physician team leaders during trauma resuscitation. Eye-tracking, GSR, and heart rate measures were captured during trauma resuscitations in a real-world setting. Fritz et al. [84] used psycho-physiological measures to assess task difficulty in software development. They conducted a study with 15 professional programmers to see how well an eye-tracker, a GSR, and an EEG sensor could be used to predict whether developers would find a task to be difficult. Jimenez-Molina et al. [85] explored PPG, EEG, temperature, and pupil dilation sensors to assess the mental workload of 61 participants during web browsing. They evaluated Multinomial Logistic Regression, SVM, and MLP models using 70%:30% train-test split. The best signal modality was EEG with an accuracy of 70%, while the rest of the modalities achieved an accuracy around 35%.
The size of the datasets used in our study is comparable to the related studies on cognitive load [82][83][84][86][87][88]. However, the findings should be confirmed in a larger study with more participants, in order to draw general conclusions. Finally, the secondary task used in the CogLoad dataset may be problematic for participants with vision problems. Any individual differences here could have skewed results. In future similar studies, vision should be taken into account.

Conclusions and Future Work
This study presented two datasets of multimodal data sensed with a commodity wearable device, while the participants were exposed to a varying cognitive load. To the best of our knowledge, these are the first datasets that include such rich sensor data augmented with the information on the personality traits of the participants. The experimental setup in which the datasets were collected included a variety of cognitive tasks performed on a smartphone and on a PC. We also presented an analysis of the psychological data in relation to the subjective cognitive load (NASA-TLX) and the objective cognitive load measures, revealing potentially significant relationships. For example, we found that people who rank high in emotionality find tasks physically more demanding and have higher heart rates during task solving (and vice versa). In addition, there was evidence that people who scored low on humility may report higher performance scores, but are very susceptible to cognitive load. Furthermore, we present baseline ML models for recognizing task difficulty. The person-independent models on the CogLoad dataset achieved an accuracy of 68.2%, while the person-dependent models on the Snake dataset achieved an accuracy of 82.3%. These results are in line with related work that uses more sophisticated lab-based measurement equipment. The proposed multi-task learning (MTL) neural network outperformed the single-task neural network (a Multi-layer perceptron; MLP) by simultaneously learning from the two datasets. The datasets will be made publicly available to advance the field of cognitive load inference using commercially available devices.
Our next step will be to build ML models that combine both the psychological and physiological data for inferring cognitive load [89]. Personality grouping shows differences between people on a more fundamental level, and these differences can be expressed physiologically. Grouping can be made either through unsupervised learning, i.e., clustering, or expert techniques (e.g., making groups on dominant dimensions). Finding 'noisy' participants is important as well. One-sixth of participants give false answers to psychological questionnaires [90]. For example, in our data, these individuals could be filtered out through the honesty-humility trait score. Making separate models for different groups is, therefore, viable as well. This should improve our current results as well as strengthen our vision for more interdisciplinary research on cognitive phenomena.