Assessing Visual Attention Using Eye Tracking Sensors in Intelligent Cognitive Therapies Based on Serious Games

This study examines the use of eye tracking sensors as a means to identify children’s behavior in attention-enhancement therapies. For this purpose, a set of data collected from 32 children with different attention skills is analyzed during their interaction with a set of puzzle games. The authors of this study hypothesize that participants with better performance may have quantifiably different eye-movement patterns from users with poorer results. The use of eye trackers outside the research community may help to extend their potential with available intelligent therapies, bringing state-of-the-art technologies to users. The use of gaze data constitutes a new information source in intelligent therapies that may help to build new approaches that are fully-customized to final users’ needs. This may be achieved by implementing machine learning algorithms for classification. The initial study of the dataset has proven a 0.88 (±0.11) classification accuracy with a random forest classifier, using cross-validation and hierarchical tree-based feature selection. Further approaches need to be examined in order to establish more detailed attention behaviors and patterns among children with and without attention problems.


Introduction
In recent years, the usage of video game-related content in areas, such as education, therapies and training, has risen sharply. Several studies suggest that the future of pedagogy will inevitably be linked to the proposal of combined play and learning in order to promote creativity in future generations [1]. The boom in serious games brings together the potential available invideo games, devoting it fully to the enhancement of specific abilities, skills and aptitudes in children and adults.
Moreover, the design and development of new adaptive serious games whose content changes based on user interaction make therapies, training and education more customized. These techniques provide systems with an efficient way of learning based on the users themselves, providing them with customized and personal experiences, which may increase their potential effects [2].
One of the most widely-used forms of adaptive intervention consists of helping students to complete some educational activities when they have specific difficulties proceeding on their own [3].
The purpose of this study is to explore the use of eye tracking sensors to evaluate the behavior of children in attention-related cognitive therapies based on serious games to determine the utility of eye-related data as an input biofeedback signal for attention improvement therapies.
Eye movements are a natural information source for proactive systems that analyze user behavior, where the goal is to infer implicit relevance feedback from gaze [4]. Moreover, following the eye-mind hypothesis put forth by Carpenter in 1980, there is a close link between the direction of the human gaze and the focus of attention [5], provided that the visual environment in front of the eyes is pertinent to the task that we want to study [6]. Eye tracking sensors collect information about the location and duration of an eye fixation within a specific area on a computer monitor.
In this study, normal developing children aged between eight and 12 years and with different attention skills are asked to solve a set of puzzles while their gaze patterns and interaction are recorded using an eye-tracking sensor. The recorded eye information includes the location of gaze fixation on the computer screen, the duration of fixations and saccades (the path of the eye movements), along with interaction information regarding performance during the exercise. We hypothesize that participants with better performance in the proposed exercises would demonstrate patterns of eye-movements that are quantifiably different from individuals with a weaker performance. Identification of these differences would be especially advantageous for teachers and psychologists, as this study may provide new insight into the strategies for the improvement of attention skills. Moreover, the authors would like to study the relation between gaze patterns and the degree of expertise. This will be done by determining if there are any differences between the first approach to an exercise and the subsequent ones.
This article is outlined as follows: First, the use of eye tracking sensors in the field of serious games will be studied and placed in context. Subsequently, the Materials and Methods Section will be introduced, in which the authors discuss the form and function of the data collected from the eye tracking sensor. Next, a discussion of the collected data and the approaches to data analysis are examined. Finally, the manuscript concludes with a discussion of possibilities for further research into the uses of eye tracking sensor and data as a biofeedback input to intelligent therapies.

Literature Review
The observation of eye-movements is not a new area of research within psychology-related fields, having been studied in depth over the last few decades [7][8][9].
Research using eye tracking sensors affords a unique opportunity to test aspects of theories about multimedia learning concerning processing during learning [10]. Moreover, the use of this approach may help in understanding where players focus their attention during game play [11], as well as how they confront unfamiliar games and software [12].
However, it was not until recently that researchers began to analyze and introduce eye tracking sensors and techniques in serious games and computer games [13][14][15]. Games that can be controlled solely through eye movement would be accessible to persons with decreased mobility or control. Moreover, the use of eye tracking data can change the interaction with games, producing new input experiences based on visual attention [15].
Eye tracking devices have been used in the design of educational games, in terms of assessing usability based on user gaze behaviors when interacting with the game [16,17]. El-Nasr and Yan used eye tracker sensors to analyze attention patterns within an interactive 3D game environment, so as to improve game level design and graphics [18].
Kickmeier-Rust et al. focused on assessing the effectiveness and efficiency of serious games. For this purpose, they assessed these variables with gaze data and gaze paths, in order to obtain interaction strategies in specific game situations [19]. Sennersten and Lindley also evaluated the effectiveness of virtual environments in games through the analysis of visual attention using eye tracking data [20]. Johansen et al. discussed the efficiency of eye tracker sensors in assessing users' behavior during game play [21].
Józsa and Hamornik used recorded eye tracking data to evaluate learning curves in university students while using a seven hidden differences puzzle game. They used this data to assess similarities and differences in information acquisition strategies considering gender-and education-dependent characteristics [22]. Dorr et al. conducted a similar study concluding that expert and novice players use different eye movement strategies [23]. Muir et al. used eye-tracking data to capture user attention patterns and to present results on how those patterns were affected by existing user knowledge, attitude towards getting help and performance while using the educational game, Prime Club [3].
Radoslaw et al. used eye tracker sensors for assessing render quality in games. They argued that gaze-dependent rendering was especially important when immersed in serious games, where players in virtual environments played a primary role [24]. Smith and Graham and Hillaire et al. concluded that use of an eye tracker increases video game immersion, altering the game play experience [25,26].
Chang et al. developed the game WAYLAas a means to evaluate the potential to offer new interaction experiences based on eye tracking and visual attention. These authors took advantage of the popularity and arrival of more affordable eye tracker sensors [27].
Li and Zhang used eye-movement analysis to assess patients' mental engagement in a rehabilitation game. Therapists use this feedback to adjust rehab exercises to users' needs [28]. Continuing with the health-related field, Lin et al. developed an eye-tracking system for eye motion disability rehabilitation as a joystick-controlled game [29]. Vickers et al. developed a framework that integrated automatic modification of game tasks, interaction techniques and input devices according to a user ability profile [30].
Walber et al. presented EyeGrab, a game for image classification controlled by the players' gaze. The main purpose of this game was to collect eye tracking data to enrich image context information [31].
Other studies, such as those conducted by Nacke et al., evaluated the use of eye tracker sensors as an alternative way of controlling interaction with games, obtaining favorable outcomes where this challenge results in positive affection and feelings of flow and immersion [32]. Ekman et al. goes one step further, discussing the limitations of using pupil-based interaction and providing suggestions for using pupil size as an input modality [33]. Table 1 shows the experimental conditions for the most relevant articles included in this section.

Materials and Methods
This section presents the methodology used in this study along with participants' characteristics and the selection procedure.

Participants
The process for assessing attention was performed with a group of typically developing children. This process relies on data recorded with an eye tracking sensor. Participants were aged between 8 and 12 years, with an average age of 10.0 (SD = 1.34). Thirty-two randomly-selected participants (13 girls and 19 boys) were selected from a group of 83 volunteers by their teachers. This sample size was considered adequate for the purpose of the outlined pilot study [34].
These children live in the Basque Country, Spain, have not been diagnosed with any attention-related disorder and speak Spanish as their mother tongue. All of the participants were recruited from the Colegio Vizcaya School.
Since they were mature minors, the approval of parents or guardians was requested prior to conducting the study. This approval consisted of an informed consent following receipt of a detailed description of the study, distributed via the school's regular newsletter.

Materials
All participants in the study completed the same assessment, which consisted of a puzzle exercise with four different levels of difficulty. Users have to connect each of the four slices presented in the exercise with its corresponding part in the main image. As Figure 1 displays, all of the participants were presented with the same image for each level, and all of the elements in the user interface appeared in the same part of the screen at each level. The main image and the slices appeared in the middle of the screen, occupying the whole display from left to right. The question stem appeared in the upper middle part of the screen. The button to advance to the next level appeared at the lower middle part of each screen. The consistent layout of the screen was intended to minimize wide eye movements.
Different levels' settings are outlined in Table 2. All of the users had a maximum pre-set time of 50 seconds to complete each of the levels. However, if they finished the level before the time ended, they could go on to the next exercise. Depending on the level, the displayed image was labeled as easy, medium or hard. Only Level 1 is displayed in color. Hard images have very similar slices and are more complicated to complete. Table 2 displays the different levels' settings.

Devices and Technologies
All of the data for this study were collected on the same device, which was located at the children's school outside the laboratory environment. These conditions were considered appropriate due to the nature of the system.
The set of puzzles was developed in Python [35]. The results obtained and the necessary parameters were stored in a SQLitedatabase [36]. The user interface and user interaction were developed using PyQT4 [37]. Fixation heat maps were produced based on the implementation developed by jjguy [38]. The classification process was implemented using the Scikit-learn library for machine learning in Python [39].
The puzzles were displayed on a 19-inch Lenovo monitor interface with an Acer Aspire Timeline X laptop running on Ubuntu 12.04. All of the text in the different exercises was displayed as black text against a light-grey background following normal grammatical conventions in Spanish. Images were inserted as JPEG digital pictures scaled from their original versions. Response selection and any changes were stored by monitoring the user interaction and recording eye movements with a Tobii X1 Light eye tracker sensor. Figure 2 shows the study setting while one of the participants was interacting with the system. The eye tracker is a non-invasive sensor with remote function. Participants were not required to remove their glasses or contact lenses during the tests. Accuracy under ideal conditions is 0.5 deg of the visual angle, while the sampling rate in this study was typically 28-32 Hz. As Figure 2 displays, the Tobii X1 light sensor was located beneath the computer monitor with the headrest fastened to the front edge of the desk, monitoring the participant's head. The laptop was located behind the monitor, without interfering with the participants' field of vision.
A typical experimental trial including calibration lasted less than 20 min for each participant.

Experimental Procedure
Prior to this study, participants' teachers responded autonomously to the EDAH scale for the evaluation of ADHD in the questionnaire on children between 6 and 12 years old [40]. Farré and Narbona designed this scale based on their experience with the adapted Conners questionnaire [41]. The EDAH measures the main characteristics of ADHD and the behavioral problems that may coexist with attentional deficit. This questionnaire was used to ensure that participants did not exhibit any ADHD-related behavior.
After completing the exercises, participants themselves were asked to fill in a usability questionnaire. The usability of the system was evaluated by a user satisfaction test based on the System Usability Scale [42]. This questionnaire consists of 10 items, which were evaluated by using a Likert scale ranging from 1, strongly agree, to 5, strongly disagree. Through feedback from this questionnaire, researchers will be able to continue to adapt the system to users' final needs.
Before completing the usability questionnaire, participants were seated in front of the eye tracking sensor to permit data collection. Users were seated opposite the center of the monitor, after adjusting the seating position to their height. Once they were aligned with the screen, the calibration process started, which took between 2 and 5 min per child. This calibration entails a visual target that moves around the screen. Participants were asked to follow this target with their gaze for a period of time. The target consists of a calibration grid with 5 positions, one on each corner of the screen and the last one right in the screen's center. The target consists of different calibration bullet points that appeared one after the other in the same order for all participants, starting from the top left corner.
Prior to the start of the exercises, participants were told in which kind of tasks they were taking part. They were also introduced to the eye tracking technology, and the sensor functionality was explained. Participants used the system and filled in the questionnaire in a controlled environment, with a researcher observing and keeping track of all of the behavioral aspects of the study, but not interfering in the experimental setting.

Data Analysis, Processing and Classification
Recorded gaze data during the exercises has been processed, analyzed and used in order to identify the set of features that may help to build a classifier, as shown in Figure 3.  This section will explain in detail the different steps involved in the data analysis and feature identification process, so as to contribute to the core of intelligent therapies based on visual attention and user interaction.

Eye Fixation Parameters
The analysis of fixations and saccadic movements during the performance of certain tasks is related to attention in various ways. Several studies support this hypothesis [43][44][45], concluding that oculomotor mechanisms rely on attention for some aspects of eye movement control [46].
During the performance of the study, raw gaze data were recorded with the eye tracking sensor. These raw gaze data were stored as .xml files in the system, with information related to the level of the exercise that was currently running.
Listing 1 shows the stored gaze data for each participant and exercise. These data consist of the (x, y) coordinates recorded by the eye tracking sensor, the timestamp in which they were perceived, the pupil size for each eye and the exercise; the level and the mode the coordinates belong to were also stored for matching the raw gaze data with other interaction recordings.
These raw data were used for analysis and processing so as to obtain meaningful information about eye fixation locations, fixation durations, saccades and saccadic durations. Fixations are the period of time when the eyes remain fairly still and new information is acquired from the visual array [9], while saccades are the eye movements themselves. During saccades, no information is retrieved by the brain, since vision is suppressed under most normal circumstances [47]. In order to detect the saccades and fixations, some processing techniques need to be applied to the raw data file. These steps are based on the Tobii I-VTfixation filter algorithm [48], have all been implemented in the Python programming language and are outlined in Figure 4.  . Raw data processing [48].
As Figure 4 shows, the first step in the processing algorithm is to apply the gap fill-in interpolation function. This step consists of filling in data where data are missing due to tracking problems that are not related to participants' behavior (such as blinks or when the user looks away from the screen). In order to distinguish between tracking problems and users' behavior, a max gap length is set, which limits the maximum length of the gap to be filled in. Following Tobii's white paper for the I-VT fixation filter and the value used by Komogortsev, this value was set at 75 ms [48,49].
After the gaps are filled in, the noise reduction function is applied. This function is based on a low-pass filter, which aims to smooth out the noise. The third step is the velocity calculator, which relates each sample with its velocity, in terms of visual angle (degrees per second). In order to reduce the impact of noise, the velocity for each sample is calculated as the average velocity of a period of time, taking as the central data input the current sample. This is done using a window length of 20 ms, which, according to the literature, has been found to handle a reasonable level of noise without distorting the signal [48].
The I-VT classifier applied to the signal is based on the one described by Komogortsev et al. [49] and outlined in the Tobii white paper [48]. The classifier determines which samples belong to a saccade, fixation or gap, based on a velocity threshold and the angle velocities calculated in the previous step. It also groups together consecutive samples using the same classification. The velocity threshold is set to 30 deg/s [48,50].
The merge fixations function aims to merge adjacent fixations that have been split up. This is done taking into account two different thresholds, the max-time between fixations, which is set to 75 ms [48], that is lower than the normal blink duration [49,51,52], and the max-angle between fixations, which is set at 0.5 deg [48,49,[53][54][55] Once all of the fixations have been identified, the shorter ones are removed. For the purposes of this analysis, 100 ms was set as the lower limit for fixation duration. This value was chosen based on the work of McConkie et al., who concluded that 60 ms must pass before current visual information becomes available to the visual cortex for processing [56]. R. Tai et al. arrived at the lower limit of 100 ms by adding 30 ms, which is the time that elapses, at the end of a fixation, between when a command to move the eyes is sent and the onset of that saccade is reported. They allowed also 10 ms for the processing of any currently-observed stimuli, arriving at the 100-ms threshold [57].
After all of the processing functions have been applied to the current data, a new gaze data file is created with all of the fixations for the current exercise and participant. As shown in Listing 2, fixation data have a similar structure to raw data. The stored fixation data save all of the fixations recorded during the exercise, along with the current activity information, user data and the duration, start time, end time and position of each fixation.

Outlier Detection Process
Once the processing stage is over, the fixation data are used to determine the outliers among the recorded data. This process is outlined in Figure 5.
As Figure 5 shows, a fixation heat map is created for each file. The fixation count heat map shows the accumulated number of fixations for each puzzle level and for each participant. Each fixation made adds a value to the color map at the location of the fixation [58].
The alpha layers of the images stored are then analyzed as a measure to identify the location and amount of fixations and saccades. All of the images are the same size and dimensions. The alpha information per image is stored in order to be processed by the median absolute deviation (MAD) algorithm for outlier detection implemented in Python.
The median deviation is a measure of scale based on the median of the absolute deviations from the median of the distribution [59]. The formula is shown in Equation (1).
Moreover, the heat maps were analyzed taking into account users' overall performance during the entire study, so as to have another feature to determine outlier detection.

Classification
This section outlines the first steps taken in the classification process. The aim of this part is to assess the feasibility of using a set of combined features to evaluate user performance. These features are related to user interaction, timing and visual attention, as well as image-related data obtained directly from the heat maps.
This part explains the theoretical insights taken in this process. Please refer to the same section in the Results part for the mathematical outcomes of this process.

Feature Identification
Feature selection is a determining factor when classifying patterns. Features need to be insensitive to noise and separated from each other. Their main purpose is to objectively describe certain aspects, in this case of the attention and performance process in intelligent therapies aimed at children.
A collection of 34 features was selected based on image characteristics and user performance related to the current exercise. Features were selected based on the recorded data. The authors, in conjunction with the multidisciplinary team taking part in this project, took into consideration performance variables, as well as gaze pattern recordings. The subset of selected features for analysis from the pilot phase is outlined in Table 3. Heat maps were divided into 9 quadrants in order to obtain detailed data about the location and density of fixations per participant and level.
The selected features were chosen for further analysis and consideration, so as to determine if they are suitable for use in an automatic classifier, capable of discerning the users' performance based on their interaction and gaze patterns.

Feature Selection
Feature selection creates a subset of features, improving their predictive performance and constructing patterns more efficiently. This helps to avoid multidimensionality, which may otherwise have an adverse effect on the decision making process [60].
Several techniques were used in this process. In order to assess the success rate of the classifier while obtaining the most accurate set of features, a set of different ensemble classifiers was used and compared with a traditional decision tree classifier.
• Sequential search: This process works by selecting the best features based on univariate statistical tests [39]. Inside this topic, the select k-best feature selection algorithm was applied. This process removes all but the k highest scoring features. • L1-based feature selection: This was applied to assess the feasibility of discarding the zero coefficients. This is a means of reducing the dimensionality of data [39]. • Hierarchical feature selection: In these feature selection processes, the set of features is divided into smaller subsets until only one remains in each node [61]. Tree-based estimators were applied to compute feature importance, so as to discard the irrelevant ones [39].

Classifier Performance Analysis
Ensemble learning algorithms works by running a base learning algorithm multiple times, voting out the resulting hypotheses [62]. Ensemble learning has received an increasing interest recently, since it is more accurate and robust to noise than single classifiers [63,64].
This article compares the performance capabilities of 3 different ensemble algorithms when they are applied to the real dataset recorded in this study. The aim of this experiment is to assess the feasibility of building a classifier able to determine user performance using an adequate set of features of a different nature recorded during the therapy.
All of the classifiers were evaluated using cross-validation. The studied classifiers were: • Random forest: This classifier is defined as a combination of tree predictors. Each tree depends on the values of a random vector sampled independently and with the same distribution for all trees [65]. Using the random selection of features yields error rates that compare favorably to AdaBoost [66], but are more robust with noise handling [65]. • Extremely randomized trees: A tree-based ensemble method for supervised classification and regression. It is a strongly randomized attribute selection method. This algorithm is accurate and computationally efficient [67]. • AdaBoost: This algorithm is an iterative procedure that tries to approximate the Bayes classifier by combining several weaker classifiers. A score is assigned to each classifier, and the final classifier is defined as the linear combination of the classifiers from each stage [68].
Moreover, a regular decision tree classifier was applied in order to assess the potential and improvement in accuracy, if any, of the previously mentioned tree-based ensemble methods.

Results
The recordings for the results explained in this section were taken during the month of May, 2014, at the Colegio Vizcaya school in Biscay, Spain.

Analysis of User Performance: Outcome Scores and Response Times
Although the present study is focused on the use of gaze data to analyze performance in attention-related cognitive therapies, we feel that it is also important to address commonly-used measurements to categorize user performance in this type of exercise: outcome scores and response times. These measures might be quite general in some cases where they show only a vague impression of the user's performance.
Participants' responses were recorded through the system implemented in Python. Their overall number of correct responses, as well as their number of correct responses per level are shown in Table 4.
The overall mean of correct responses is 11.937 (SD = 2.20) out of a possible score of 16. When the results are examined by levels, there are some differences in performance between the first two levels, which participants considered much easier, and the last two, which they found more difficult.
Users had a maximum of 50 seconds to complete each exercise. However, they were able to finish the level before time ran out. Considering the response times, i.e., total time spent on test questions, the data show that the majority of participants took most of the entire time available at all of the levels.
Levels can be segmented into two groups, according to difficulty. The first two are considered the easiest ones, while the last two are trickier. There is a tendency between the two groups; users tend to perform slower with Levels 1 and 3 than with Levels 2 and 4. This may be because they tend to be more careful with novelty exercises or when the difficulty suddenly changes. Figure 6 shows the overall performance of participants, regarding total time versus correct answers. As is displayed in Figure 6, users tend to respond correctly to more than half of the possible answers, while using 75% or more than the available time. When analyzing the group with the weakest performance, with a number of total correct answers below 10, it is clear that 75% of the participants in this group have a higher performance time.  Further analysis of user performance will be outlined in the following sections. The correct items' mean (11.93 out of 16) and standard deviation (SD = 2.20) values were used for obtaining the threshold for the weakest performers. This results in the value 9.73; since the study needs an entire threshold, this value was rounded up to 10. Participants with scores lower than this threshold were classified as the weakest performers. A total of four participants matched this criteria, so they were paired with the four best performers to obtain two balanced groups for further analysis. In order to address the research question stated in the Introduction, the four best performers (users with IDs 20,25,28 and 43) and the weakest four (users with Is 15, 16, 29 and 36) will be analyzed.

Fixation Heat Maps
Fixations were analyzed for each of the participants. Fixations were defined as a gaze longer than 100 ms. In order to address the research question stated in the Introduction, the most accurate and the weakest performers were selected for further analysis.
Fixations were displayed as heat maps, which were created based on the entire time participants took for each level. Red spots indicate higher levels of fixation, with yellow and green indicating decreasing amounts of fixations. Areas without color were not fixated upon. The most accurate performers are displayed in Figure 7, while the four with the weakest performance are displayed in Figure 8. When comparing the heat maps of both groups, there are some differences between the number, density and clustering of fixations. In Figure 7, where the total score results of the participants are 15 correct answers out of 16 possible ones for every case, the number of fixations is lower than for the participants with a weaker performance. Not only is it lower among participants, it also seems to decrease when analyzing the intra-level gaze behavior for each of them.
It is important to bear in mind that an overall lower number of total fixations suggests less time spent viewing specific areas of the assessment item.
Regarding Figure 8, where the total score for these participants ranges between six and nine correct answers out of 16 possible ones, the fixation density is higher for all the cases, except for the participant with ID 29.

Quantitative Analysis
This section includes a quantitative analysis of the data regarding various features, such as the number of fixations per level, their average duration and the gender and age of the selected subgroups of participants, in order to analyze the feasibility of establishing some behavioral patterns. Table 5 displays the four participants with the best results. When processing the number of fixations, we observe that they decrease in number with the progression of the levels for all of the users, as displayed in Figure 9a. Since the exercise has the same visual layout for every level, this may be related to their having achieved a certain degree of expertise with each new level.   Table 6 displays the four participants with the weakest performance results. When processing the number of fixations, we find no specific relationship among them either, as displayed in Figure 9b, which may be related to the lack of appropriate techniques for solving the puzzle task. Further analysis of the results was made on the best vs. weakest performers' data. Due to the number of users that were used for the further analysis of the results, a Mann-Whitney non-parametric test was applied. The results of the test are outlined in Table 7.   Table 7 shows that there are some significant differences (p ≤ 0.05) in performance between groups. These differences appeared in the number of fixations in Levels 2 and 4 and globally. Moreover, there are other significant differences for fixation average time (global), time (Level 4) and the number of correct answers (Levels 3 and 4). However, these results may not be enough to conclude that there are consistent differences regarding the level of expertise of the participants.
R. Tai et al. [57] and Chi et al. [69] hypothesized that fixation duration data did not produce clear and consistent differences regarding the level of expertise of the participants, which agrees with the results obtained in this section.

Classification
This section outlines the first steps taken in the classification process. The aim of this part is to assess the feasibility of using a set of combined features to evaluate user performance. These features are related to user interaction, timing and visual attention, as well as image-related data obtained directly from the heat maps.
This part explains the mathematical outcomes of this process. Please refer to Section 3.5.3 for the theoretical insights.
In order to further assess the number of optimal features for the classification part, a recursive feature elimination process with cross-validation was applied. Table 8 displays the existing relation between the number of features and the classifier's accuracy. The number of features depends on the feature selection algorithm applied. These algorithms were outlined in Section 3.5.3.2. Table 8. Performance comparison of feature selection algorithms using selected classifiers. The accuracy results displayed in Table 8 were obtained by applying a cross-validation process of 100 iterations to all of the available data. These user data were divided as follows: 60% of the data for training and 40% for testing the classifier inside the cross-validation process.
With this setting, feature selection seems to be beneficial for building any type of analyzed classifier. However, when employing all of the available features, the accuracy rate falls below 0.80 for decision trees and AdaBoost classifiers. The select K-best features algorithm improves the classifiers' accuracy, especially when using 22 features. The L1-based algorithm displays good accuracy results for all of the ensemble methods and falls below 0.80 for the decision tree classifier. The tree-based hierarchical algorithm employed gives good results in accuracy with a limited number of features that range between 10 and 14, depending on the classifier employed.
The authors compared the accuracy performance of the selected ensemble classifiers with the overall performance of the a decision tree classifier. Since the data did not follow a normal distribution, a Mann-Whitney analysis was used. The results of comparing the performance of every ensemble classifier (with all features) with the decision tree classifier (with all features) is displayed in Table 9. As is displayed in Table 9, the use of ensemble classifier methods significantly improves the overall performance of the classifier, regardless of the number of features employed. In the case of using all of the available features, the best classifier for the recorded data is the random forest.
Analyzing the difference in intra-classifier performance, Table 10 displays the Mann-Whitney analysis of the different feature-selection algorithms, comparing their performance with the accuracy obtained with the all features approach. Table 10 illustrates that for almost all of the analyzed settings in this article, the use of a smaller set of features significantly improves the overall accuracy of all of the ensemble classifiers and the decision tree. After carrying out all of the detailed experimental tests based on the recorded data, it can be concluded that accurate classification of different user performance according to their interaction and visual attention is possible.

Discussion and Conclusions
In the Conclusion, we intend to give an answer to the research questions outlined in the Introduction, as well as put forth new thoughts and trends about the present and future of assessing visual attention using eye tracker sensors in serious games.
According to the literature, there are several theories that link eye-movements with attentional processes [5,6], linking eye movements with cognitive processes, such as reading, visual search and scene perception. However, regarding intelligent therapies, eye movements do not always tell the whole story about the attentional process [70]. These resources should be complemented with other interaction records, as well as with relevant data about the participant. The higher the system information, the more accurate its customization to users' final needs.
In the Introduction, we hypothesized that participants with better performance may demonstrate patterns of eye-movements quantifiably different from individuals with weaker performance. Although some differences were found during the exercises, it is necessary to extend the study or to replicate it, in order to make stronger assumptions.
A comparison of the fixation duration data did not produce clear and consistent differences corresponding to the level of performance. These results corresponded with those related to the expertise level found by R. Tai et al. [57] and Chi et al. [69].
Regarding Figure 8, which shows the fixation heat maps for the weaker performers, fixation density is higher for all of the cases, except for the participant with ID 29. Moreover, the fixation density in Figure 7 decreases with the performance of new levels. These findings agree with R. Tai et al., who found an inverse relationship between the fixation and saccade amount and the participants' degree of expertise [57].
When analyzing performance data, there are some differences between the two groups for which the puzzle levels are classified into according to difficulty. Table 4 shows the performance results. When changing the exercise type or level of challenge, users tend to spend more time and perform the exercise with taking more time to think. When the tasks are repeated, the ability level increases and the time to complete them drops. This may be related to the acquisition of specific problem-solving skills, which become more accurate with repetition. Further studies need to be carried out about the users' ability and performance capabilities in repetitive tasks.
Intelligent therapies that dynamically adapt themselves to users' needs and performance based on their interaction with the system have been proven to be efficient in terms of improvement comparisons [71]. A good set of collected data may provide improved means for obtaining adapted and efficient intelligent cognitive data. Researchers should be very careful with the selected and recorded features. Several different approaches need to be followed in order to obtain the most accurate set of performance data.
Moreover, a deeper analysis of timing per exercise may also prove to be interesting for study. As a future approach, the reading instructions stage will be separated from the performance of the exercise, so that we can obtain explicit performance timing, with and without the reading stage. This could give further information about whether there are any differences between the first performance of an exercise and the subsequent ones. This new approach may also help in further assessment of attention in the performance and instruction reading stages.
Reviewing the literature, there are several studies published linking the size of the pupils with cognitive processes [72][73][74]. Although, this response in the pupils is slow [75]. Current eye trackers measure pupil size and give it as another parameter, so it is easy to analyze this feature during the performance of tasks. This parameter was not analyzed in this study, and it may be an interesting additional feature in future research about this topic.
In recent years, the popularity of eye trackers has increased, and there are some open-source projects offering tools for gaze data analysis [76][77][78][79][80], while some manufacturers offer low-cost devices, such as the EyeTribe [81]. There are also several DIY approaches for building custom eye trackers [82][83][84]. The accuracy of these systems may sometimes be slightly inferior to high-end eye trackers, but they may be a viable solution for use outside the laboratory setting [85]. The use of eye trackers outside the research community may help to extend its potential with available intelligent therapies, bringing state-of-the-art technologies to users.
This study may expand in future directions, such as the design and development of the system, so that the tool includes new skills that continue along the lines of the current tool, for work on new capabilities, such as working-memory or processing speed.
Moreover, future lines should include the design and development of a robust classifier, with the selected features outlined in Section 3.5.3.1. The initial study of the classifier capabilities of ensemble methods with the available user data has produced positive results, especially when implementing a feature selection algorithm beforehand (see Section 4.4 for further information about the ensemble classifiers performance). Other classifiers need to be studied and tested, in order to consider others that may be more accurate, alone or in combination with others. This approach will help to create an autonomous system able to discern user implication based on visual attention and performance records.
Finally, some directions for the future are to replicate this study: -with a greater number of users; -with users with and without attention-related problems; -developing a bilingual or trilingual tool that allows the study to be replicated in other areas in Spain and abroad where reported diagnosis of attention-related problems are significantly different from the Basque Country, Spain.
The use of gaze data constitutes a new information source in intelligent therapies that may help to build new approaches that are completely customized to final users' needs. Further studies need to be carried out in order to establish more detailed attention behaviors and patterns among children with and without attention problems. The replication of this study, along with the extension of the current system with new exercises, may help to build personalized performance profiles per user. These profiles may help in creating new customized therapies, while providing a new degree of information to the children themselves, therapists, psychologists, teachers and family.