This section describes the methods used to develop a experimental sessions representing the first step in developing an eating behaviour assistance system. Here we present the details of the design and implementation of the data collection, the use of the experimentation software to perform different tasks, the calorie calculation, the labelling of the tasks performed by the subjects, a pipeline to classify HL and LL tasks, and finally aggregate all the data to visualise the effect of cognitive load tasks on food choice.
  3.1. Data Collection Setup
Before data collection, we obtained approval for the study from the ethics committee of the University of Potsdam, which was granted in March 2021. We then recruited subjects by advertising the study on the university campus and in student dormitories. Interested subjects contacted us by email. Subjects interested in the study were asked to fill out a form with their demographic information and answer health-related questions. Based on their answers, healthy subjects without diabetes, hypertension, or mental disorders were recruited. Participating subjects were sent an information sheet to brief them on the data collection. Subjects signed the informed consent form before the start of the study. On the day of their participation, we asked subjects to fast for 3 h before the study and not to consume food or coffee. We also informed them that they would be offered snacks after the experiment as they would be fasting for more than 4 h at the end of the experiment.
We recruited 14 subjects for the study. However, data from only 12 subjects were analysed, as data collection from two subjects failed due to technical issues with sensors. The 12 subjects averaged  years of age,  cm in height, and  kg in weight and were from either Asia or Europe.
Subjects were asked to wear the Shimmer3 GSR+ (Shimmer Research Ltd., Dublin, Ireland) device. This device in our study was used to collect PPG, GSR, accelerometer, and temperature sensor data from the non-dominant hand with a sampling rate of 51.2 Hz. PPG sensors measure heart activity. GSR measures the electrical property of the skin manifested in sweat due to the Sudomotor Nerve Activity (SMNA). At the beginning of the experiment, subjects were asked to shake their non-dominant hand three times to synchronise the different sensor data streams with the experimentation software, which was used to display stimuli for different tasks.
  3.2. Experimentation Software for Lab Task
To conduct the experimental sessions, we developed experimentation software using PsychoPy version 2021.1.4, a cross-platform Python package [
26]. We used the experimentation software to develop the sequences of cognitive tasks, stimuli, questionnaires, and instructions for the study and present them to subjects without interruption from the experimenter. We scheduled two data collection experimental sessions in a laboratory setting on two consecutive days to induce HL and LL tasks. In the first experimental session, subjects were randomly assigned to HL or LL sessions to avoid eating bias. The two experimental sessions included baseline video, voice recordings, cognitive load tasks, and questionnaires implemented in the software version 1.1.0. The sequence of tasks to be performed by the subject was displayed by the software, and the sequence is shown in 
Figure 3.
In the beginning, there was a ten-minute baseline video to relax the subjects in both cognitive sessions. Then, subjects had to record various speech prompts designed to elicit affective states. Subsequently, subjects had to close their eyes for one minute before answering affective states questionnaires (Positive and Negative Affect Schedule (PANAS)). The closing of the eyes served as a baseline for the acquisition of the EEG data. However, EEG data analysis is beyond the scope of this paper.
After answering the questionnaire, subjects started the cognitive load tasks. In the HL session, subjects performed three tasks that induced HL. At first, they performed a Reading Span task [
37] at their own pace, which took approximately 7 to 10 min. In this task, subjects were asked to read logical and illogical sentences and to press the right arrow key if the sentence was logical or the left arrow key if the sentence was illogical on the keyboard [
37,
38]. In addition, subjects were instructed to memorise the numbers displayed between readings and enter them in the display of the software after the round. Subjects performed three sets of this Reading Span task.
The second task was the Stroop task, which took approximately 5–8 min. Subjects had three types of Stroop tasks. The first type of Stroop task was a self-paced task in which subjects read the colour name, printed it in mismatched colours, and pressed the space bar to advance to the next word [
39]. 
Figure 2 demonstrates the Stroop task. In the second Stroop task, subjects had to read the colour name printed in mismatched colours with time pressure [
38]. In the third Stroop task, in addition to displaying the colour name, they were also presented with high- and low-pitched tones, and they were asked to count the high-pitched tones and enter them at the end of the round [
38].
The final task, with a HL, was the dual-N-back task. Here, subjects had to simultaneously recall a sequence of spoken letters and positions of a square and recognise whether the letter or position of the square matched the one that had previously appeared [
40]. If the subject’s performance in the first level of 1-back task (in which the subject had to recall only the square and letter displayed or announced before the current square or letter) was more than 80% correct, the identification length difficulty level was increased to 2-back, 3-back, and the like. 
Figure 4 demonstrates the 2-back task. The numbers written in red on the top left in the 
Figure 4 represent the flow of explanation of the block. In the first block, we have the blue square in the third position, first row. The blue square is present in the last row in the second block. In the third block, that is, after 2-back, the square position repeats its position. Since this position matches the square’s position in the first block, the subject would be required to press the left mouse button during the third block to indicate the correct occurrence of 2-back. The N-back task was played for only about 10 min. The N-back task was implemented by opening the Brain Workshop software version 4.8.4 [
41].
After each of the three tasks during HL session, subjects had to answer the National Aeronautics and Space Administration Task Load Index (NASA-TLX) questionnaire. NASA-TLX is a six dimensional subjective assessment tool to measure workload [
42,
43,
44]. After the final NASA-TLX questionnaire, subjects again answered the affective states questionnaires and the recorded voice prompts about affective states, followed by a 20-min break to eat. Finally, subjects were asked to close their eyes for one minute and then open them again to answer further affective states questionnaires.
During the LL session, subjects were engaged in tasks similar to the HL session, except for excluding three cognitive tasks, namely Reading Span, Stroop task, and N-back task. Instead, they were given the task to play Bejeweled 2, a computer game involving matching and sequencing jewel shapes [
45]. This game is considered a relaxing and LL task and was played for approximately 25 min. The game aims to combine at least three similar jewel shapes to score points. The game was accessed on the experimentation software via Steam Software (version 2021.05.18, Valve Corporation, Bellevue, WA, USA).
  3.3. Labels and Analysis
Timestamps generated by the experimentation software for each task were used to label the data. We also calculated the NASA-TLX scores for the HL and LL sessions, as mentioned in the work of Hart et al. [
42,
43] and Rubio et al. [
44]. NASA-TLX calculates the workload for each task. They were used to analyse the classification results. Also, we calculated the PANAS scores as mentioned in the work of Watson et al. [
46] for positive and negative affective states. Positive affect reflects emotions such as interest, excitement, strength, enthusiasm, pride, alertness, inspiration, determination, attention, and being active. In contrast, negative affect reflects negative emotions such as stress, upset, guilt, fear, hostility, irritability, shame, nervousness, and jitter. PANAS scores can range from 10 to 50. Lower scores indicate lower levels of positive/negative affect, and higher scores indicate higher levels of positive/negative affect.
We performed a Shapiro–Wilk test to examine the normality of the scores of PANAS and NASA-TLX collected in both sessions. Then, we performed the t-test and Friedman statistical test to determine the significant difference between and within groups based on the Shapiro–Wilk test results.
  3.5. Cognitive Load Classification
In this section, we explain the pipeline to perform binary classification of the cognitive load states of the subjects.
Synchronisation and Segmentation: First, we synchronised the Shimmer3 GSR+ data with the software labels output using the magnitude of acceleration data. The magnitude of the acceleration was high at the beginning of the experiment due to hand tapping. From this point, we obtain the timestamp of the labels generated by PsychoPy. We then segmented the data into baseline, questionnaires, voice, eating, and cognitive load.
Data Preprocessing: To remove the gravitational component and the high frequency components of the accelerometer data, we applied a Butterworth bandpass filter with a filter order of 4 to the signal between 0.5 Hz and 6 Hz. Then, the magnitude of the accelerometer was calculated from the 3-axis.
To calculate heart rate from the PPG signal, the raw PPG signal was Butterworth-bandpass-filtered between 0.5 and 8.0 Hz with an order of 4 using the Neurokit2 Python package [
48]. Then, the peaks in the PPG signals were identified by finding the local maxima by comparing the adjacent values. In this study, we identified the peaks using the SciPy Python package. We then calculated the heart rate by checking the distance between each peak [
49]. If the distance between the corresponding peak was greater or less than 10% of the previous distance, these peaks were not considered heartbeats.
To derive features from raw GSR, we separate the tonic and phasic components using the CVXEDA algorithm as these components have different time scales, and relationships to the triggering stimuli [
50]. Tonic components represent slow drifts of the baseline and are fluctuations in skin conductance [
50]. The phasic component reflects the short-term response to a stimulus [
50]. These components identify features that can discriminate between cognitive load states. In addition, using the same algorithm, we also derived SMNA, whose features are also used to discriminate between load levels. This algorithm is based on Bayesian statistics, convex mathematical optimisation, and sparsity [
50].
Windowing and Artefact Removal: The preprocessed signals were segmented into 2 min windows with 75% overlap. We chose 2 min windows because preliminary testing showed that shorter window lengths produced lower classification accuracy, and longer ones did not improve the accuracy, which was also observed in previous studies [
51,
52]. We developed an algorithm to take out the artefacts from the PPG, GSR, and temperature sensor data. In this algorithm, thresholds were applied to the magnitude of the accelerometer and heart rate to identify artefacts. We considered the following to apply the threshold:
 - If the mean accelerometer magnitude for a window was greater than 0.5 m/s, we discarded all the other signal windows for the same time index. A higher magnitude in the accelerometer signal indicated movement in hand, which induces artefacts in the other signals present in the Shimmer3 GSR+ device. 
- A threshold was applied to check if there were more than two unique heart rates across a 2 min window throughout the complete signal. If there were no unique heart rates, then the complete window was discarded and was not used for the study. 
- Further, to include the remaining windows, we checked if the heart rate was greater than 35 per window, as the resting state heart rate ranges from 40 to 109 beats per minute [ 53- ]. 
Feature Extraction and Selection: We extracted 125 features from the GSR, PPG, and temperature sensor signal windows that were free of artefacts. From the filtered PPG signal, we extracted 46 features as described in the work of Xiao et al. [
51]. In total, 4 PPG features were derived from Neurokit2 in the frequency domain, and another 11 PPG features were derived from heart rate. We derived the mean and standard deviation within a window from temperature sensor data. We extracted 43 features from GSR in the time domain using the EDA explorer [
54], and 19 features were derived from the tonic, phasic, and SMNA.
 In this study, we implemented a hybrid feature selection method. First, we implemented the filtering method to reduce the features by analysing the features for high variance using Analysis of Variance (ANOVA) based on the F1-score. We used the Scikit-Learn Python package with SelectKBest features to implement the ANOVA. Then we used the wrapper method Recursive Feature Elimination with Cross-Validation (RFECV) with RF to reduce the features further. These combinations were used because time and computational complexity can be reduced by excluding redundant features, and model accuracy can be improved [
55,
56].
Model Training: We normalised the selected features before using them to train and test the RF, SVM, and Gaussian Naive Bayes (GNB) classifiers. These models were chosen based on previous work [
14,
15,
30]. We implemented a nested cross-validation hyperparameter optimisation to train these models with a grid search method. The hyperparameters used for RF, SVM, and GNB are listed in 
Table 1. We performed this optimisation to tune the models and avoid overfitting. In this optimisation, the data from the twelve subjects were split into training and test sets. The data from one subject were used as test data, and the remaining data from the eleven subjects were used for training (LOSO). Using the data from the eleven subjects, we perform 11-fold cross-validation. In each cross-validation, one subject was the validation dataset, and the remaining ten subjects were the training dataset. We perform the hyperparameter optimisation using grid search to configure the model for the ten training datasets. The best model from the grid search cross-validation is evaluated with the last fold. This method is repeated eleven times, and the final cross-validation value is calculated from the mean of all eleven values. Therefore, each of the twelve subjects was tested with a different model because the hyperparameters were different for each model.
 Data Aggregation: To explore the impact of cognitive load on food choice, we analysed the cognitive load classification results, food choices during both sessions, and the negative affect (from PANAS) experienced during the HL session as it impacted the subjects significantly. The collected data were aggregated and subjected to hierarchical clustering using agglomerative clustering techniques [
57,
58].
 Before conducting the clustering analysis on the food consumption data measured in grams and millilitres, we normalised the values to a scale of 0 to 1 to account for variations in measurement units. This step ensured that the clustering results were not biased by differences in the scales of different food items. We excluded calorie consumption from the analysis due to the inclusion of water, which has zero calories. To facilitate the analysis, we converted the cognitive load classification accuracy results into binary labels. Accuracy above 80% were labelled as 1, indicating high accuracy, while accuracy below 80% were labelled as 0, representing low accuracy. Similarly, participants who reported higher negative affect during the load period compared to before load and after eating were labelled as 1, while the remaining participants were labelled as 0. This labelling approach allowed us to examine the connections between cognitive load, food choice, and negative affect.