Examining Mental Workload in a Spatial Navigation Transfer Game via Functional near Infrared Spectroscopy

The goal of this study was to examine the effects of task-related variables, such as the difficulty level, problem scenario, and experiment week, on performance and mental workload of 27 healthy adult subjects during problem solving within the spatial navigation transfer (SNT) game. The study reports task performance measures such as total time spent on a task (TT) and reaction time (RT); neurophysiological measures involving the use of functional near-infrared spectroscopy (fNIRS); and a subjective rating scale for self-assessment of mental workload (NASA TLX) to test the related hypothesis. Several within-subject repeated-measures factorial ANOVA models were developed to test the main hypothesis. The results revealed a number of interaction effects for the dependent measures of TT, RT, fNIRS, and NASA TLX. The results showed (1) a decrease in TT and RT across the three levels of difficulty from Week 1 to Week 2; (2) an increase in TT and RT for high and medium cognitive load tasks as compared to low cognitive load tasks in both Week 1 and Week 2; (3) an overall increase in oxygenation from Week 1 to Week 2. These findings confirmed that both the behavioral performance and mental workload were sensitive to task manipulations.


Introduction
In the human brain, the prefrontal cortex (PFC) plays an important role in executive functions involving cognitive processes such as problem solving, decision making, planning, and working memory [1][2][3]. To understand the complex nature of these cognitive processes within the PFC area, it is important to examine mental workload that is taxed by various cognitive tasks. Mental workload was defined as the cognitive and psychological effort that is required from a subject to complete a given task [4,5]. Previous studies indicated that mental workload is sensitive to task-related features including the difficulty level of the task, the order in which the tasks were presented (e.g., blocked vs. random practice), and the type of the task (e.g., learning vs. transfer task) [6][7][8].
Some researchers described the concept of mental workload as a section within broader cognitive load theory (CLT) [9]. CLT, developed in the 1980s, informs instructional design by proposing instructional strategies that are based on human cognitive architecture and can be used to enhance cognitive processes [10][11][12]. CLT aims at maximizing working memory resources to enhance learning and handling complex learning tasks which can exert a heavy cognitive load on the memory system [13][14][15][16]. According to CLT, working memory load can be influenced by three types of cognitive load, namely intrinsic cognitive load, extraneous cognitive load, and germane cognitive load [13][14][15]. Intrinsic cognitive load is characterized by the intrinsic nature of the task itself and depends on the number of elements to be processed simultaneously in working memory. This number, in its turn, depends on the level of element interactivity of the learning task, or, in other words, the extent to which the constituent elements of a task can or cannot be learned in isolation. Extraneous cognitive load is associated with the instructional strategies used to present the task to the learner (e.g., visual, audio) and is not directly related to the intrinsic nature of the task. Germane load is directly associated with learning and is described in terms of the cognitive resources used to learn something.
While subjective rating scales were widely used as a common technique for assessing cognitive load [17,18], Sweller and colleagues [10] called for the need of further research into physiological techniques (e.g., functional magnetic resonance imaging (fMRI)) of measuring cognitive load. Specifically, functional near-infrared spectroscopy (fNIRS) [19,20] is a promising noninvasive and portable technology that allows for monitoring PFC activation during complex cognitive tasks in natural settings [21][22][23][24]. Previous studies demonstrated that fNIRS can indicate various levels of mental workload changes associated with varying difficulty levels of the tasks [22,25,26].
In this study, we used the fNIRS to assess mental workload in the context of general problem solving in the spatial navigation transfer (SNT) game. The SNT game was designed with a focus on complex cognitive and visuospatial tasks involving analogical reasoning, spatial navigation, planning, and decision making, all of which have been associated with executive functions where the PFC region plays a role [1][2][3]27]. The game allowed for the manipulation of intrinsic cognitive load by varying the difficulty level of the analogous tasks as well as the manipulation of extraneous cognitive load by varying the contextual (surface) features of the task such as the problem scenarios. Previous studies provide evidence of successful fNIRS deployment for monitoring hemodynamic changes within complex cognitive and visuospatial tasks such as air traffic control [26] or maze navigation [28].
The goal of this study was to examine the effects of task-related variables on performance and mental workload during problem solving within the SNT game, which is a serious gaming approach for a task design to study cognitive load. The task-related variables under examination included the level of difficulty of the task (Level) with three levels (low cognitive load, medium cognitive load, and high cognitive load), the type of the problem scenario (Problem) with two levels (School Bus and Ambulance), and the week in which the task was presented (Week) with two levels (Week 1 and Week 2). The study investigated task performance measures such as total time spent on a task (TT) and reaction time (RT); neurophysiological measures involving the use of the fNIRS and a subjective rating scale for self-assessment of mental workload (NASA TLX) [17]. The overarching hypothesis was that there are main effects and interaction (three-way and/or two-way) effects between Week, Problem, and Level on mental workload as measured by Oxygenation (OXY (µV)) and perceived mental workload as measured by NASA TLX. Prior to testing this main hypothesis, behavioral performances were also analyzed in terms of reaction time and total time over Week, Problem and Level for the main effect and interactions among them.

Spatial Navigation Transfer (SNT) Game Tasks
The SNT game was designed and developed by the researchers of this study using CLT as an analytical framework. Specifically, the game allowed for manipulation of two types of cognitive load: intrinsic cognitive load was manipulated by varying the difficulty levels of the tasks and the extraneous load was manipulated by varying the contextual features of the game such as the problem scenarios.
The game contained two well-structured problems represented by two analogous scenarios: the School Bus problem and the Ambulance problem (See Figure 1), which were The game contained two well-structured problems represented by two analogous scenarios: the School Bus problem and the Ambulance problem (See Figure 1), which were similar at a deep, structural level (e.g., navigating the map to reach certain destinations while accounting for the constraints) but differed at a surface, contextual level (e.g., driving a school bus at daytime vs. driving an ambulance at night). Each of the two problem scenarios included nine analogical tasks of three levels of difficulty: low, medium, and high cognitive load. The difficulty level of each task for both problem scenarios was defined by the number of constraints such as a limited amount of fuel, limited amount of time, limited number of seats, and increasing traffic. In both problem scenarios, the subjects were required to perform a number of cognitive and visuospatial tasks. For instance, in the School Bus scenario, participants assumed the role of a school bus driver whose task was to navigate the game map to collect students from different locations and drive them to school while accounting for the limited seating and fuel. In the Ambulance scenario, participants assumed the role of an ambulance driver whose task was to collect patients from different locations on the game map and take them to the hospital while accounting for limited time and fuel. The tasks within each level were preceded by detailed instruction pages containing the description of the tasks and the type of constraints contained in the tasks. The game had been pilot tested with six subjects representing the target population. Those who participated in the pilot were not included in this study. The pilot testing allowed for the validation of the difficulty levels of the tasks, as well as modifications of certain game design features (e.g., clarifying the in-game instructions and feedback, editing the visuals, etc.).

Functional Near Infrared Spectroscopy (fNIRS)
Participants' brain activity changes during the SNT gameplay were assessed using the functional near infrared spectroscopy (fNIRS). As a field-deployable non-invasive optical brain monitoring technology, fNIRS provides a measure of cerebral hemodynamics in response to sensory, motor, or cognitive activation. It offers greater flexibility for deployment in dynamic environments relative to other neuroimaging modalities since it allows for continuous and localized cortical activity monitoring for physiological and psychological assessment while requiring a small and portable footprint, low initial investment, and near-zero runtime costs [19,20,22,23). A commercially available research-grade fNIRS system (fNIR Devices LLC, Potomac, MD) was utilized to conduct neurophysiological assessment tests. This research grade, 16-optode system allows to scan whole forehead and provides measures for oxygenated (HbO) and deoxygenated hemoglobin (HbR) concentration changes from the prefrontal cortex region, the area associated with higher cognitive functions, attention, working memory, decision making, and problem solving. The fNIRS sensor has 4 light emitting diodes (LEDs) and 10 photodetectors resulting in an array of 16 channels with 2 Hz. sampling rate (see Figure 2).

Functional near Infrared Spectroscopy (fNIRS)
Participants' brain activity changes during the SNT gameplay were assessed using the functional near infrared spectroscopy (fNIRS). As a field-deployable non-invasive optical brain monitoring technology, fNIRS provides a measure of cerebral hemodynamics in response to sensory, motor, or cognitive activation. It offers greater flexibility for deployment in dynamic environments relative to other neuroimaging modalities since it allows for continuous and localized cortical activity monitoring for physiological and psychological assessment while requiring a small and portable footprint, low initial investment, and near-zero runtime costs [19,20,22,23]. A commercially available research-grade fNIRS system (fNIR Devices LLC, Potomac, MD) was utilized to conduct neurophysiological assessment tests. This research grade, 16-optode system allows to scan whole forehead and provides measures for oxygenated (HbO) and deoxygenated hemoglobin (HbR) concentration changes from the prefrontal cortex region, the area associated with higher cognitive functions, attention, working memory, decision making, and problem solving. The fNIRS sensor has 4 light emitting diodes (LEDs) and 10 photodetectors resulting in an array of 16 channels with 2 Hz. sampling rate (see Figure 2).

Participants and Experiment Protocol
A total of 27 right-handed and healthy adults (13 males and 14 females) ranging from 18 to 40 years old (26-30 being the prevalent age group) volunteered for participation in this study. The volunteers were graduate students majoring in education, neuroscience, psychology, or business at a four-year private university located in the northeast of the US (see Appendix A for the selection criteria). IRB Protocol Number: 1806006395.
Prior to the experiment, all participants signed informed written consent forms approved by the Drexel University Human Subjects Institution Review Board. This was followed by the completion of a demographic questionnaire administered using the Qualtrics online survey tool. The questionnaire collected information about the participants' age, gender, and disciplinary area (reported earlier in this section). One week after the completion of the survey, the participants were invited to the experiment lab located at the target university to play the SNT game during two sessions across two weeks. At the beginning of session one, the participants received verbal introduction familiarizing them with the experiment protocol and the SNT game. In both sessions, the participants were asked to follow the in-game instructions and try to keep calm and minimize their movements during the experiment. Next, the fNIRS sensor pad was placed on the participant's forehead (see Figure 3). Afterwards, the participants were instructed to login to the game by entering their unique user ID. The user IDs helped to match participant survey responses and the experimental data collected in two sessions across two weeks. The game was played online on a 14-inch laptop. The participants were asked to use a mouse to navigate the game.

Participants and Experiment Protocol
A total of 27 right-handed and healthy adults (13 males and 14 females) ranging from 18 to 40 years old (26-30 being the prevalent age group) volunteered for participation in this study. The volunteers were graduate students majoring in education, neuroscience, psychology, or business at a four-year private university located in the northeast of the US (see Appendix A for the selection criteria). IRB Protocol Number: 1806006395.
Prior to the experiment, all participants signed informed written consent forms approved by the Drexel University Human Subjects Institution Review Board. This was followed by the completion of a demographic questionnaire administered using the Qualtrics online survey tool. The questionnaire collected information about the participants' age, gender, and disciplinary area (reported earlier in this section). One week after the completion of the survey, the participants were invited to the experiment lab located at the target university to play the SNT game during two sessions across two weeks. At the beginning of session one, the participants received verbal introduction familiarizing them with the experiment protocol and the SNT game. In both sessions, the participants were asked to follow the in-game instructions and try to keep calm and minimize their movements during the experiment. Next, the fNIRS sensor pad was placed on the participant's forehead (see Figure 3). Afterwards, the participants were instructed to login to the game by entering their unique user ID. The user IDs helped to match participant survey responses and the experimental data collected in two sessions across two weeks. The game was played online on a 14-inch laptop. The participants were asked to use a mouse to navigate the game.

Participants and Experiment Protocol
A total of 27 right-handed and healthy adults (13 males and 14 females) ranging from 18 to 40 years old (26-30 being the prevalent age group) volunteered for participation in this study. The volunteers were graduate students majoring in education, neuroscience, psychology, or business at a four-year private university located in the northeast of the US (see Appendix A for the selection criteria). IRB Protocol Number: 1806006395.
Prior to the experiment, all participants signed informed written consent forms approved by the Drexel University Human Subjects Institution Review Board. This was followed by the completion of a demographic questionnaire administered using the Qualtrics online survey tool. The questionnaire collected information about the participants' age, gender, and disciplinary area (reported earlier in this section). One week after the completion of the survey, the participants were invited to the experiment lab located at the target university to play the SNT game during two sessions across two weeks. At the beginning of session one, the participants received verbal introduction familiarizing them with the experiment protocol and the SNT game. In both sessions, the participants were asked to follow the in-game instructions and try to keep calm and minimize their movements during the experiment. Next, the fNIRS sensor pad was placed on the participant's forehead (see Figure 3). Afterwards, the participants were instructed to login to the game by entering their unique user ID. The user IDs helped to match participant survey responses and the experimental data collected in two sessions across two weeks. The game was played online on a 14-inch laptop. The participants were asked to use a mouse to navigate the game.  To determine each participant's cognitive baseline (fNIRS measures), a 15-s resting period was recorded while the participants were relaxed and still. The game started with a tutorial providing them with step-by-step instructions on how to navigate the game and collect baseline information on the participants' behavioral performance in the context of the game (see Appendix B). The tutorial presented the Farm problem, analogous to the Ambulance and School Bus problems, where the task was to navigate the map to collect milk from different locations and bring it back to the farm (see Appendix C). In each week, the participants completed 18 tasks which were presented in randomized blocks of three tasks (T1, T2, T3) (see Figure 4). The three tasks within a block were of the same level of difficulty. After completing each block, the participants were presented with the NASA TLX screen asking them to rate the perceived mental workload of the tasks. If the participant failed a task, they were prompted to repeat the same task a second time. If they failed a second time, they could not repeat the task and were prompted to proceed to the next task (see Figure 5). In Week 2, the tasks were introduced in a different order compared to Week 1. Another difference was that in Week 2, some task features (e.g., the number of collectables, the amount of fuel, the location of the collectables, etc.) within each level were manipulated to make them slightly more challenging as compared to Week 1. For example, for the School Bus scenario, in Week 2, the participants had an increased number of collectables. For the Ambulance scenario, in Week 2, the participants had a decreased amount of in-game time available. This slight manipulation ensured that transfer of learning could potentially occur between highly similar rather than the same tasks since contextual similarity is a necessary precondition for transfer [29][30][31]. To determine each participant's cognitive baseline (fNIRS measures), a 15-s resting period was recorded while the participants were relaxed and still. The game started with a tutorial providing them with step-by-step instructions on how to navigate the game and collect baseline information on the participants' behavioral performance in the context of the game (see Appendix B). The tutorial presented the Farm problem, analogous to the Ambulance and School Bus problems, where the task was to navigate the map to collect milk from different locations and bring it back to the farm (see Appendix C). In each week, the participants completed 18 tasks which were presented in randomized blocks of three tasks (T1, T2, T3) (see Figure 4). The three tasks within a block were of the same level of difficulty. After completing each block, the participants were presented with the NASA TLX screen asking them to rate the perceived mental workload of the tasks. If the participant failed a task, they were prompted to repeat the same task a second time. If they failed a second time, they could not repeat the task and were prompted to proceed to the next task (see Figure 5). In Week 2, the tasks were introduced in a different order compared to Week 1. Another difference was that in Week 2, some task features (e.g., the number of collectables, the amount of fuel, the location of the collectables, etc.) within each level were manipulated to make them slightly more challenging as compared to Week 1. For example, for the School Bus scenario, in Week 2, the participants had an increased number of collectables. For the Ambulance scenario, in Week 2, the participants had a decreased amount of in-game time available. This slight manipulation ensured that transfer of learning could potentially occur between highly similar rather than the same tasks since contextual similarity is a necessary precondition for transfer [29][30][31].   To determine each participant's cognitive baseline (fNIRS measures), a 15-s resting period was recorded while the participants were relaxed and still. The game started with a tutorial providing them with step-by-step instructions on how to navigate the game and collect baseline information on the participants' behavioral performance in the context of the game (see Appendix B). The tutorial presented the Farm problem, analogous to the Ambulance and School Bus problems, where the task was to navigate the map to collect milk from different locations and bring it back to the farm (see Appendix C). In each week, the participants completed 18 tasks which were presented in randomized blocks of three tasks (T1, T2, T3) (see Figure 4). The three tasks within a block were of the same level of difficulty. After completing each block, the participants were presented with the NASA TLX screen asking them to rate the perceived mental workload of the tasks. If the participant failed a task, they were prompted to repeat the same task a second time. If they failed a second time, they could not repeat the task and were prompted to proceed to the next task (see Figure 5). In Week 2, the tasks were introduced in a different order compared to Week 1. Another difference was that in Week 2, some task features (e.g., the number of collectables, the amount of fuel, the location of the collectables, etc.) within each level were manipulated to make them slightly more challenging as compared to Week 1. For example, for the School Bus scenario, in Week 2, the participants had an increased number of collectables. For the Ambulance scenario, in Week 2, the participants had a decreased amount of in-game time available. This slight manipulation ensured that transfer of learning could potentially occur between highly similar rather than the same tasks since contextual similarity is a necessary precondition for transfer [29][30][31].   The first step for the fNIRS signals was to apply the noise removal procedures to tease out artifacts due to movement and systemic changes. Different sources of noise might include head movements, physiological signals such as respiration and heart rate, as well as instrument-related and environmental noise [20,27]. After inspecting the data, the saturated channels were rejected, then raw light intensity was filtered by a low-pass finite impulse response (FIR) filter with a cut-off frequency of 0.1 Hz to eliminate the effects of any physiological and/or non-physiological noise sources. Next, relative changes of concentrations in oxygenated hemoglobin (oxy-Hb) and deoxygenated hemoglobin (deoxy-Hb) were calculated with modified Beer−Lambert law, using the baseline recorded at the beginning of the experimental data collection [1]. Blood oxygenation (OXY) was calculated using HbO−HbR and used in subsequent statistical analyses. Mean (average) statistics for OXY were calculated across the right channels (right dorsolateral (RDLPFC); right anterior (RANTPFC)), left channels (left dorsolateral (LDLPFC); left anterior (LANTPFC)). Previous studies that used average OXY as a dependent measure in complex cognitive tasks demonstrated its sensitivity to task difficulty levels and established validity of average OXY for measuring cortical changes associated with cognitive workload, e.g., [32,33]. Additionally, in an fNIRS study conducted by Liang and colleagues [1], medium effect sizes for OXY and oxy-Hb [d = 0.44-0.76] and small effect sizes [d = 0.26-0.28] for deoxy-Hb and total hemoglobin (HbT) were reported, suggesting that OXY and oxy-Hb were more sensitive to workload changes compared to the other two biomarkers. In addition, different brain regions within PFC reported varying sensitivity levels to workload changes associated with complex cognitive tasks [1,25,28,32]. For instance, the fNIRS study by Izzetoglu and colleagues [32] showed that the left anterior medial PFC was more sensitive to workload changes in cognitive tasks involving a virtual flight simulator. It is also known that the left DLPFC supports cognitive processes of reasoning, problem solving, and recognizing specific features while the right DLPFC supports the cognitive processes of planning and decision making [34][35][36]. Given that the SNT game tasks involved cognitive processes that require functions of PFC from both hemispheres, our statistical analyses involved measures from both the left (LDLPFC, LANTPFC) and the right (RDLPFC, RANTPFC) regions.

Statistical Analyses
The independent variables included experiment week (Week) with two levels (Week 1 and Week 2), problem scenario (Problem) with two levels (School Bus and Ambulance), and difficulty level (Level) with three levels (low cognitive load, medium cognitive load, and high cognitive load). The dependent variables included performance operationalized in terms of reaction time (RT) and total time (TT), and mean changes in oxygenation (OXY) for four regions of interest within PFC (right dorsolateral = RDLPFC, left dorsolateral = LDLPFC, right anterior = RANTPFC, and left anterior = LANTPFC).
Both the behavioral and fNIRS data were analyzed by using the NCSS statistics software tool. The first step was to screen the data for missing values and test the assumptions of factorial ANOVA such as checking for outliers, normality, and the assumption of sphericity. Normality was tested by conducting Shapiro−Wilk's test. Potential outliers were checked by conducting descriptive statistical analysis with normality plots. The assumption of sphericity was tested by running Mauchly's test. To address variability issues, z-scores or standardized scores were computed and plotted for each dependent measure. This was done by using the formula V − X g /SD g , where V is the variable score (e.g., TT, RT, LDLPFC, etc.), X g is the grand mean for each problem, and SD g is the grand standard deviation for each problem. Z scores were then used in all the subsequent statistical analyses (see Appendix D).
To test the hypothesis, several repeated-measures factorial ANOVA models were developed and tested separately for each dependent variable. 3 × 2 × 2 repeatedmeasures factorial ANOVA (Level × Problem × Week) was applied separately on the dependent variables of RT, TT, the OXY biomarker for the four regions of interest (LDLPFC, RDLPCF, LANTPFC, RANTPFC) and the TLX score. The within subject factors of Week, Problem, and Level were fixed effects, while the Subject factor was a random effect. A Greenhouse−Geisser correction was applied when the assumption of sphericity was violated. The alpha level of significance was set at 0.05. Tukey HSD post hoc test was conducted to assess any significant interaction effects. Partial eta-squared (η 2 ) was reported for the effect sizes. Cohen's [37] benchmarks for defining small (η 2 = 0.01), medium (η 2 = 0.06) and large (η 2 = 0.14) effects were used to assist with the interpretation of the calculated effect sizes [21].

Behavioral Results
For the dependent variable of TT, the results of the three-way ANOVA revealed twoway interaction effects of Week and Level, F (2, 52) = 14.71, p ≤ 0.001, η 2 = 0.36, and Problem and Level, F (2, 52) = 93.28, p ≤ 0.001, η 2 = 0.78 (see Figure 6). There were also main effects of Week, F (1, 26) = 95.34, p ≤ 0.001, η 2 = 0.79 and Level, F (2, 52) = 289.81, p ≤ 0.001, To test the hypothesis, several repeated-measures factorial ANOVA models were developed and tested separately for each dependent variable. 3 × 2 × 2 repeated-measures factorial ANOVA (Level × Problem × Week) was applied separately on the dependent variables of RT, TT, the OXY biomarker for the four regions of interest (LDLPFC, RDLPCF, LANTPFC, RANTPFC) and the TLX score. The within subject factors of Week, Problem, and Level were fixed effects, while the Subject factor was a random effect. A Greenhouse−Geisser correction was applied when the assumption of sphericity was violated. The alpha level of significance was set at 0.05. Tukey HSD post hoc test was conducted to assess any significant interaction effects. Partial eta-squared (η 2 ) was reported for the effect sizes. Cohen's [37] benchmarks for defining small (η 2 = 0.01), medium (η 2 = 0.06) and large (η 2 = 0.14) effects were used to assist with the interpretation of the calculated effect sizes [21].

Perceived Mental Workload Results
For the dependent measure of perceived mental workload, as measured by NASA TLX instrument, the results of the three-way ANOVA revealed a two-way interaction effect of Problem and Level, F

Perceived Mental Workload Results
For the dependent measure of perceived mental workload, as measured by NASA TLX instrument, the results of the three-way ANOVA revealed a two-way interaction effect of Problem and Level, F    . Line plot (left) represents interaction effect of Problem and Level on average scores for perceived mental workload as measured by NASA TLX with the error bars representing the standard error of the mean statistic. Bar chart (right) represents average scores for NASA TLX and standard deviation (SD) for low cognitive load, medium cognitive load, and high cognitive load tasks for School Bus (left) and Ambulance (right) problem scenarios.

Figure 23.
Bar charts represent the average standardized z-scores for oxygenation for perceived mental workload as measured by NASA TLX and standard deviation (SD) for low cognitive load, medium cognitive load, and high cognitive load tasks for School Bus (left) and Ambulance (right) problem scenarios for Week 1 and Week 2.

Results Summary
The results of the statistical analyses revealed a number of interaction effects and main effects for the behavioral, fNIRS, and NASA TLX data. Table 1 below summarizes these interaction effects and main effects for each dependent measure.

Results Summary
The results of the statistical analyses revealed a number of interaction effects and main effects for the behavioral, fNIRS, and NASA TLX data. Table 1 below summarizes these interaction effects and main effects for each dependent measure.

Discussion
This study used a hybrid approach to assessing mental workload [4,22] where we combined (1) task performance measures such as total time spent on a task (TT) and reaction time (RT); (2) neurophysiological measures involving the use of functional near-infrared spectroscopy (fNIRS) and a subjective rating scale for self-assessment of mental workload (NASA TLX).The results confirmed that there are interaction effects between Week, Problem, and Level on performance as measured by reaction time and total time. Furthermore, the associated effect sizes for the behavioral measures ranged from η 2 = 0.32-0.92 indicating a large effect [37]. The results showed that, overall, an improved performance was observed for the two analogous problem-solving tasks in Week 2 as compared to Week 1 of the experiment across all the three difficulty levels of the tasks. This finding suggests that the participants were able to retain and transfer their problem-solving skills from Week 1 to Week 2. Moreover, analyses indicated that participant performance was sensitive to task difficulty, showing a longer reaction time and total time for medium and high cognitive load tasks as compared to low cognitive load tasks. These findings are consistent with previous studies showing that task features, such as working memory demands and semantic content, affect behavioral performance and transfer during analogical problem solving [6,38,39].
The results confirmed the main hypothesis stating that there are interaction effects between Week, Problem, and Level on mental workload as measured by OXY (µV). Several significant interaction effects for each of the four regions of interest were revealed. Furthermore, the associated effect sizes for the OXY biomarker ranged from η 2 = 0.13 − 0.26 indicating medium to large effects [37]. The results corroborate the findings from previous fNIRS studies on mental workload changes during cognitive tasks, confirming that mental workload was sensitive to task difficulty, with small to moderate effect sizes for the OXY biomarker [1,25]. The results from this study showed that all the four brain regions of interest were sensitive to workload changes. These findings are in parallel with previous studies with functional brain imaging techniques confirming that different brain regions within the PFC area are sensitive to task manipulations during various cognitive tasks [1,[6][7][8]27,33]. In particular, our study results indicated that for both the Ambulance and School Bus scenarios, bilateral activation was observed both in the left (LDLPFC, LANTPFC) and in the right (RDLPFC, RANTPFC). This finding is not surprising given that the SNT game tasks involved cognitive processes of problem solving, analogical reasoning, spatial navigation, planning, and decision making, which, as discussed earlier, are associated with both hemispheres of PFC. This finding is consistent with the fNIRS study conducted by Liang and colleagues [1] which used the Tower of Hanoi tasks to measure mental workload changes across the four brain regions of interests in PFC. The findings from the Liang and colleagues [1] study showed a bilateral activation across the left and the right PFC regions reflected by the biomarkers of OXY and oxy-Hb.
The findings from the behavioral and neural data analyses contribute to the existing body of research within CLT which informs educational practice by proposing solutions based on human cognitive architecture [10]. In particular, this investigation with a controlled serious gaming environment provided empirical evidence, both at behavioral and neural levels, that the subjects were sensitive to the intrinsic cognitive load imposed by the task difficulty, as well as extraneous load imposed by two different problem scenarios (i.e., Ambulance and School Bus). In particular, for the School Bus problem, an overall monotonic increase across behavioral and neural measures was observed as the difficulty level of the tasks increased. However, in the case of the Ambulance problem, the changes across the behavioral and neural measures in response to the increased task difficulty were inconsistent compared to the School Bus problem. This inconsistency could be explained by the differences in the contextual features of the problem scenarios (e.g., driving a school bus at daytime to pick up students (School Bus problem) vs. driving an ambulance at night to pick up patients (Ambulance Problem)). These contextual differences, although not directly related to the intrinsic nature of the tasks (i.e., intrinsic cognitive load), could potentially exert different extraneous loads on the subject. Further research with the deployment of fNIRS could provide a more in-depth understanding of how various contextual manipulations of analogous complex cognitive tasks affect mental workload and behavioral performance across varying difficulty levels.
One inconsistency between the neural and behavioral data relates to the overall increase in average oxygenation for each of the four regions of interest compared to the overall decrease in performance measures from Week 1 to Week 2 for both problem scenarios. One possible explanation for this inconsistency could be that the positive transfer at the behavioral level manifested by a reduced total time and reaction time from Week 1 to Week 2 was at the cost of increased mental workload. This finding contradicts previous studies with fNIRS reporting improved performance (e.g., a decrease in total time required to complete a task) accompanied by reduced activity in the PFC during the later stages of learning [1,4,28]. It is possible that making the tasks within each level slightly more challenging in Week 2 (e.g., having one additional collectable or slightly less fuel) required more cognitive effort on part of the subjects, although at the behavioral level, the subjects performed better in Week 2 compared to Week 1 for both the School Bus and Ambulance problems (as indicated by a decrease in TT and RT). Furthermore, the NASA TLX results for both the School Bus and Ambulance problems were consistent with the behavioral measures, indicating that within each level the subjects perceived the tasks in Week 2 to be less mentally demanding than in Week 1. The findings suggest that there might be a compensatory mechanism at play allowing for improved performance at the expense of increased mental workload. This explanation is supported by the compensatory control model proposed by Hockey [40] stating that effective performance under high demands is accompanied by high levels of physiological activation. According to this model, performance protection requires compensatory costs such as increased mental effort. This is viewed as a trade-off between maintenance of the primary task goals and the amount of mental effort to be invested in the task. Future empirical research is needed to confirm this explanation and provide a deeper understanding on various compensatory mechanisms involved in the maintenance of performance under varying task demands.

Conclusions
This study investigated mental workload during the SNT game involving complex cognitive tasks of varying workload demands. A number of significant interaction effects were found across the behavioral measures of TT, RT, the neural measures of OXY (µV) for each of the four regions of interest within PFC (LDLPFC, RDLPFC, LANTPFC, RANTPFC) as well as the NASA TLX measure of perceived mental workload. The results confirmed the main hypothesis stating that there are interaction effects between Week, Problem, and Level on mental workload as measured by OXY (µV). The current study presents a promising application of fNIRS for measuring hemodynamic changes within PFC under different workload conditions in the context of serious gaming.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy restrictions.

Conflicts of Interest:
fNIR Devices, LLC manufacturers the optical brain imaging instrument, fNIRS, and licensed IP and know-how from Drexel University. Kurtulus Izzetoglu was involved in the technology development and thus offered a minor share in the startup firm, fNIR Devices, LLC. The remaining authors declare no conflict of interest.

Appendix A
The selection criteria for participating in the experiment are listed below: Appendix C Figure A1. Screenshots of the tutorial with the instruction page.
Appendix D Table A2. Week 1 standardized Z scores for School Bus and Ambulance problem scenarios.
Week 1 Figure A1. Screenshots of the tutorial with the instruction page.  Note. Means and standard deviations (SD) of standardized z scores were computed for the variables of reaction time (Z_RT), total time (Z_TT), average oxygenation for left dorsolateral prefrontal cortex (Z_LDLPFC), right dorsolateral prefrontal cortex (Z_RDLPFC), left anterior prefrontal cortex (Z_LANTPFC), and right anterior prefrontal cortex (Z_RDLPFC) across three difficulty levels (high cognitive load, medium cognitive load, and high cognitive load) for two problem scenarios (School Bus and Ambulance) for Week 2.