Visual Performance Assessment of Videos—A Case Study of the Game “Spot the Difference”

: “Spot the Difference” is a well-known game where players must ﬁnd subtle differences between two almost identical pictures. If “Spot the Difference” is designed for videos, what is the difference between videos and pictures? If the performance of videos is measured by an eye tracker, what scan paths will be conducted? In this study, we explored this game using a video to conduct a visual performance evaluation. Twenty-ﬁve subjects were recruited in a full-factorial experiment to investigate the effect of background (with background, without background), video type (animation, text), and arrangement (left-to-right, top-to-bottom) on searching, eye tracking performance, and visual fatigue. The results showed that the video type had a signiﬁcant effect on the accuracy and subjective visual fatigue, with the accuracy and subjective visual fatigue for animation being better than for text. The results also indicated that the arrangement had a signiﬁcant effect on the number of ﬁxations, where top-to-bottom arrangement brought a higher number of ﬁxations. The background had a signiﬁcant effect on accuracy and subjective visual fatigue, where the accuracy and subjective visual fatigue without a background was better than with a background. For the analysis of the scan path, a denser scan path was found in text than in animation, in top-to-bottom arrangement than in left-to-right arrangement, and without a background than with a background. In the future, game manufacturers should use the results of this research to design different “Spot the Difference” videos. When designing a simple game, an animation without a background and involving a left-to-right arrangement was recommended. When designing a difﬁcult game, the opposite settings should be used.


Introduction
Visual searching is a daily task that is conducted in various environments. As a classic visual search game, "Spot the Difference" is usually played in the form of finding differences in pictures. Players must find subtle differences in two almost identical pictures. This game is designed to test the player's concentration and ability to search. Past studies have investigated the functions of "Spot the Difference" pictures, such as improvements in learners' interaction [1], the visual distractor in memory retention [2], the vernacularoriented methodology for investigating dialect lexis [3], and the detection of visual deficits for readers [4]. However, if "Spot the Difference" is designed for videos, what is the difference between videos and pictures? If the performance of videos is measured by an eye tracker, what scan paths will be conducted? How will the complexity of the background affect the players' performance when searching for targets? The above questions were the main motivation for carrying out this study. According to the literature, the more complex the external image information is, the longer the gaze duration is [5][6][7]. Caroux et al. [8] showed that, when the background is more complex, it will have more negative influences on the performance of the visual searching tasks. Huang and Pashler [9] also pointed out that the difficulty of the visual search test affects the search efficiency. According to Marlen et al. [10], the background complexity and the clear or fuzzy description of the target affect the search speed and accuracy. It is concluded that the higher the complexity and description ambiguity are, the slower and harder it is to find the target. The current research uses videos to present the game "Spot the Difference", yet it is difficult to divide the videos into simple and complex groups based on the background colors during the design process. In this study, to test the absence of a background, the background picture is presented in a monochromatic system; meanwhile, when there is a background, the background picture is presented in a multi-color system. The videos are divided into two types: videos without a background (one color) and videos with a background (multiple colors).
In addition, when people watch videos, different types of videos may have different impacts. Text represents the information in a verbal code, such as prose, expository text, or verbal instruction in written or spoken formats [11]. Lin and Chin [12] investigated three kinds of reading contents: science-, literary-, and magazine-related. In contrast, pictures are defined as being associated with the represented object by similarity or common structural properties. When testing performance, Lin and Hsieh [13] found that readers paid more attention to texts than pictures, and few fixations fell on pictures for English as a foreign language (EFL) beginners. Levie and Lentz [14] concluded that pictures increased readers' interest. Peeck [15] indicated that pictures improve readers' processing of difficult text. However, Sanchez and Wiley [16] found that pictures may cause the seductive details effect, and they showed how pictures can easily affect people's reading performance. For videos, Giacaman et al. [17] indicated that visual analogy videos helped students improve their learning. Mubarak et al. [18] showed that the interaction with videos for students improves the quality and outcomes of online courses. Dashtipour et al. [19] concluded that videos include text, audio, and visual features, resulting in a better performance in sentiment analysis. Based on these comparisons of texts and pictures, this study uses videos as the reading medium, and investigated two video types: text and animation.
Layout style can also affect a subject's search direction. According to the literature, people subjectively prefer to choose the editing method in which an image is placed on the left side of the text [20]. In addition, significant differences in eyeball movements and attention shifts occur when viewing images versus reading text. When reading text messages, the eyeball moves up and down or left and right along the line of text, with alternate saccades and fixations. The saccade amplitude and average time of fixation for reading text is smaller and shorter than when viewing images, and the eye movement for reading text follows a left-to-right and top-to-bottom pattern. No such pattern occurs when viewing images [21]. Previous similar studies have investigated the difference between vertical and horizontal arrangements of text on PDAs and computer screens [22,23], concluding that performance when reading vertical text is not always inferior to that when reading a horizontal format. Lin [24] investigated two types of phonetic symbols, including both vertical and horizontal arrangements, and found that the use of a vertical arrangement saves more time when sending Chinese text messages. Based on these factors, we aim to examine the differences between vertical and horizontal arrangements in this study.
In summary, background, video type, and arrangement are three critical factors affecting people when watching "Spot the Difference" videos. The purpose of this study is to assess the effects of these three factors on searching, eye tracking performance, and visual fatigue.

Participants
Twenty-five college students (mean age = 20.75 years; standard deviation = 0.52 years) took part in this experiment. All of the participants had a corrected visual acuity better than 0.8 [25] after examination with a TOPCON AGP7 vision tester.

Apparatus and Materials
A Panasonic 50 inch 3D plasma TV (TH-P50ST60W, Panasonic Corporation, Osaka, Japan) was selected as the visual display unit. A Mangold Vision eye tracker with a sampling rate of 60 Hz was used to record each participant's visual performance while watching the videos. Windows Movie Maker was the software used to edit the videos. In order to understand the impact of background on search performance, we used Animaker to design an animation video and a text video. We then inserted landscape photos into the videos with a background, while the original videos had no background. In addition, we inserted five targets that needed to be searched for in each video. To investigate the effect of the difference in arrangement on visual performance, the Open Shot software was used to construct two different arrangements-namely, top-to-bottom and left-to-right arrangements. For the videos arranged top-to-bottom, the video with targets was put on the top; with regard to the videos arranged left-to-right, the video with targets was put on the right. At the same time, in order to reduce the learning effect, the sequence of video presentation was randomized. In this study, the participants watched a 1 min video in each experiment combination, meaning they watched the videos for 8 min (combinations) in total.

Experimental Design
Three independent variables were evaluated in this study: background, video type, and arrangement. The two background levers were with a background versus without a background. The two video types were animation and text (see Figures 1 and 2). The arrangement was classified as either left-to-right or top-to-bottom (see Figures 1 and 2). A full factorial design was used, where all independent variables were set as within-subject factors. This study comprised 25 participants, and each participant randomly completed eight experimental combinations.

Participants
Twenty-five college students (mean age = 20.75 years; standard deviation = 0.52 years) took part in this experiment. All of the participants had a corrected visual acuity better than 0.8 [25] after examination with a TOPCON AGP7 vision tester.

Apparatus and Materials
A Panasonic 50 inch 3D plasma TV (TH-P50ST60W, Panasonic Corporation, Osaka, Japan) was selected as the visual display unit. A Mangold Vision eye tracker with a sampling rate of 60 Hz was used to record each participant's visual performance while watching the videos. Windows Movie Maker was the software used to edit the videos. In order to understand the impact of background on search performance, we used Animaker to design an animation video and a text video. We then inserted landscape photos into the videos with a background, while the original videos had no background. In addition, we inserted five targets that needed to be searched for in each video. To investigate the effect of the difference in arrangement on visual performance, the Open Shot software was used to construct two different arrangements-namely, top-to-bottom and left-to-right arrangements. For the videos arranged top-to-bottom, the video with targets was put on the top; with regard to the videos arranged left-to-right, the video with targets was put on the right. At the same time, in order to reduce the learning effect, the sequence of video presentation was randomized. In this study, the participants watched a 1 min video in each experiment combination, meaning they watched the videos for 8 min (combinations) in total.

Experimental Design
Three independent variables were evaluated in this study: background, video type, and arrangement. The two background levers were with a background versus without a background. The two video types were animation and text (see Figures 1 and 2). The arrangement was classified as either left-to-right or top-to-bottom (see Figures 1 and 2). A full factorial design was used, where all independent variables were set as within-subject factors. This study comprised 25 participants, and each participant randomly completed eight experimental combinations.

Tasks and Procedure
The experiment procedure was as follows: 1. The experimental process and steps were explained to the participants prior to the experiment. 2. The experiment began as the eye tracker finished the calibration from the six points. 3. The participants were asked to watch the video and fill out the subjective visual fatigue questionnaire for each treatment. 4. The participants were asked to complete eight treatment combinations. 5. The participants took a 5 min break. 6. Steps 2-5 were repeated until all experimental combinations were completed. The experimental duration was about 60 min for each participant.

Dependent Variables and Data Analysis
Three dependent variables were analyzed: accuracy, number of fixations, and subjective visual fatigue. Accuracy was a quotient of the number of differences found by the number of total differences (five). The number of fixations, in the act of fixating, is the number of points between any two saccades, during which the eyes are relatively stationary and virtually all visual input occurs [26]. Subjective visual fatigue was determined using Heuer et al.'s [27] questionnaire, which contained six items, and the participants answered the items on a 10-point scale. In addition, another eye tracker index was the scan path, which is the visual path followed by participants' eyes while observing the videos.
Analyses of variance (ANOVAs) were conducted with repeated measures of accuracy, number of fixations, and subjective visual fatigue. The least significant difference (LSD) test was used to find the significances among the levels of independent variables. All statistical analyses were calculated using the Statistical Products Services Solution (SPSS).

Results and Discussion
The ANOVA results for each dependent variable are shown in Table 1.

Tasks and Procedure
The experiment procedure was as follows: 1.
The experimental process and steps were explained to the participants prior to the experiment.

2.
The experiment began as the eye tracker finished the calibration from the six points.

3.
The participants were asked to watch the video and fill out the subjective visual fatigue questionnaire for each treatment.

4.
The participants were asked to complete eight treatment combinations.
Steps 2-5 were repeated until all experimental combinations were completed. The experimental duration was about 60 min for each participant.

Dependent Variables and Data Analysis
Three dependent variables were analyzed: accuracy, number of fixations, and subjective visual fatigue. Accuracy was a quotient of the number of differences found by the number of total differences (five). The number of fixations, in the act of fixating, is the number of points between any two saccades, during which the eyes are relatively stationary and virtually all visual input occurs [26]. Subjective visual fatigue was determined using Heuer et al.'s [27] questionnaire, which contained six items, and the participants answered the items on a 10-point scale. In addition, another eye tracker index was the scan path, which is the visual path followed by participants' eyes while observing the videos.
Analyses of variance (ANOVAs) were conducted with repeated measures of accuracy, number of fixations, and subjective visual fatigue. The least significant difference (LSD) test was used to find the significances among the levels of independent variables. All statistical analyses were calculated using the Statistical Products Services Solution (SPSS).

Results and Discussion
The ANOVA results for each dependent variable are shown in Table 1.

Accuracy
The ANOVA results for accuracy indicated that background (F = 11.599, p < 0.05), video type (F = 10.101, p < 0.05); the interaction between video type and arrangement (F = 11.300, p < 0.05); and the three-way interaction among background, video type, and arrangement (F = 63.325, p < 0.01) were the significant factors. As shown in Figure 3, the performance of accuracy without a background was better in the top-to-bottom animation and the left-to-right text, while the accuracy with a background was superior in the leftto-right animation and the top-to-bottom text. In addition, text with a background in the left-to-right arrangement showed the lowest accuracy. Caroux et al. [8] found that a complex background produced an inferior performance in visual searching. Lin [24] also showed that the horizontal arrangement of phonetic symbols brought about an inferior performance in sending Chinese text messages. A possible reason for this outcome may be that the average scanning distance from left to right is longer than that from top to bottom in a video, especially one with a background, resulting in a low accuracy. In order to see the characters in text clearly, a top-to-bottom arrangement generated a short scanning distance and resulted in a high accuracy. For the interaction between video type and arrangement, as shown in Figure 4, the animation with a left-to-right arrangement and text with a top-to-bottom arrangement had a better accuracy. It can be inferred that a left-to-right arrangement is suitable for animations while a top-to-bottom arrangement is appropriate for texts. Participants may be used to scanning objects on screens from the left to the right in their daily lives, resulting in a high accuracy for left-to-right animation. As previously mentioned, a short scanning distance was found for top-to-bottom text and also resulted in high accuracy. Finally, similar to the results in the three-way interaction, the left-to-right text in Figure 2 showed the lowest accuracy.

Accuracy
The ANOVA results for accuracy indicated that background (F = 11.599, p < 0 video type (F = 10.101, p < 0.05); the interaction between video type and arrangement 11.300, p < 0.05); and the three-way interaction among background, video type, and rangement (F = 63.325, p < 0.01) were the significant factors. As shown in Figure 3 performance of accuracy without a background was better in the top-to-bottom anima and the left-to-right text, while the accuracy with a background was superior in the to-right animation and the top-to-bottom text. In addition, text with a background in left-to-right arrangement showed the lowest accuracy. Caroux et al. [8] found that a c plex background produced an inferior performance in visual searching. Lin [24] showed that the horizontal arrangement of phonetic symbols brought about an infe performance in sending Chinese text messages. A possible reason for this outcome be that the average scanning distance from left to right is longer than that from to bottom in a video, especially one with a background, resulting in a low accuracy. In o to see the characters in text clearly, a top-to-bottom arrangement generated a short s ning distance and resulted in a high accuracy. For the interaction between video type arrangement, as shown in Figure 4, the animation with a left-to-right arrangement text with a top-to-bottom arrangement had a better accuracy. It can be inferred that a to-right arrangement is suitable for animations while a top-to-bottom arrangement is propriate for texts. Participants may be used to scanning objects on screens from the to the right in their daily lives, resulting in a high accuracy for left-to-right animation previously mentioned, a short scanning distance was found for top-to-bottom text also resulted in high accuracy. Finally, similar to the results in the three-way interac the left-to-right text in Figure 2 showed the lowest accuracy.   The results also indicated that the background was a significant factor. The accurac of the video without a background (M = 0.604) was significantly lower than that in th video with a background (M = 0.518). Huang and Pashler [9] indicated that the difficult of a visual search affected efficiency. Marlen et al. [10] also found that high visual com plexity and high verbal ambiguity resulted in a low accuracy and slow response time o visual search tasks. In this study, the white color was set for the without background con text and scenery was set for the with background context. Compared to the monochro matic color, scenery with multiple colors could produce more visual stimuli and result i a low accuracy.
The results further indicated that video type was significant, with the accuracy of th text (M = 0.520) being significantly lower than that of the animation (M = 0.602). As pa ticipants have to spend more time in watching and understanding the meaning of texts, makes sense that the accuracy of texts is lower than that of animation. The results from the interaction between video type and arrangement also showed that watching anima tion (M = 0.676) produced a higher accuracy than watching text (M = 0.488) in the left-to right arrangement.

Number of Fixations
As shown in Table 1, the ANOVA results for the number of fixations indicated tha background (F = 4.160, p < 0.05), video type (F = 4.391, p < 0.05), arrangement (F = 6.823, < 0.05), and the interaction between background and arrangement (F = 5.832, p < 0.05) wer the significant factors. For the interaction between background and arrangement, a shown in Figure 5, no significant difference was found with a background in the two a rangements. However, a significant difference (p < 0.05) was found for the two arrange ments without a background, where left-to-right arrangement without a background ha a small number of fixations. One possible reason for this may be that the participants ar used watching left-to-right screens when watching less complex backgrounds, resultin in a lower number of fixations. On the other hand, the top-to-bottom arrangement gene ated a short scanning distance due to its convenient search from the top down. Althoug inducing more fixations, participants gave positive comments on it. The results also indicated that the background was a significant factor. The accuracy of the video without a background (M = 0.604) was significantly lower than that in the video with a background (M = 0.518). Huang and Pashler [9] indicated that the difficulty of a visual search affected efficiency. Marlen et al. [10] also found that high visual complexity and high verbal ambiguity resulted in a low accuracy and slow response time on visual search tasks. In this study, the white color was set for the without background context and scenery was set for the with background context. Compared to the monochromatic color, scenery with multiple colors could produce more visual stimuli and result in a low accuracy.
The results further indicated that video type was significant, with the accuracy of the text (M = 0.520) being significantly lower than that of the animation (M = 0.602). As participants have to spend more time in watching and understanding the meaning of texts, it makes sense that the accuracy of texts is lower than that of animation. The results from the interaction between video type and arrangement also showed that watching animation (M = 0.676) produced a higher accuracy than watching text (M = 0.488) in the left-to-right arrangement.

Number of Fixations
As shown in Table 1, the ANOVA results for the number of fixations indicated that background (F = 4.160, p < 0.05), video type (F = 4.391, p < 0.05), arrangement (F = 6.823, p < 0.05), and the interaction between background and arrangement (F = 5.832, p < 0.05) were the significant factors. For the interaction between background and arrangement, as shown in Figure 5, no significant difference was found with a background in the two arrangements. However, a significant difference (p < 0.05) was found for the two arrangements without a background, where left-to-right arrangement without a background had a small number of fixations. One possible reason for this may be that the participants are used watching left-to-right screens when watching less complex backgrounds, resulting in a lower number of fixations. On the other hand, the top-to-bottom arrangement generated a short scanning distance due to its convenient search from the top down. Although inducing more fixations, participants gave positive comments on it. The results also indicated that the background was a significant factor. The numb of fixations of the video without a background (M = 1694.954) was significantly lower tha that with a background (M = 1815.030). Caroux et al. [8] showed that the structure of th background (lateral or radial) had an impact on the duration of fixations. The participan also reported that they had to concentrate on complex images, which resulted in mo fixations. Lin and Chen [28] also concluded that the use of more visual cues (such as 3 images) induces a greater number of fixations than the use of less visual cues (such as 2 images).
The results also indicated that the video type was a significant factor. The number fixations of the animation (M = 1676.024) was significantly lower than that of the text ( = 1833.960). Lin and Hsieh [13] indicated that the total fixation counts and average fixatio duration of texts were more than those of pictures, whether for EFL beginners or interm diate level readers. In this study, the participants also reported that reading texts require more concentration and resulted in more fixations.
The results further demonstrated that the arrangement was a significant factor. Th number of fixations of the left-to-right arrangement (M = 1668.364) was significantly low than that of the top-to-bottom arrangement (M = 1841.620). The reason for this may be th participants are used to watching left-to-right screens, which results in fewer fixations. addition, participants also considered the top-to-bottom arrangement convenient to sca Although it induced more fixations, the participants made positive comments about Lin [24] showed that vertical phonetic arrangement has better operation performance tha a vertical one in sending Chinese text messages. Table 1 shows that background (F = 7.664, p < 0.05), video type (F = 5.827, p < 0.05 and the interaction between video type and arrangement (F = 5.486, p < 0.05) were th significant factors. For the interaction between video type and arrangement, as shown Figure 6, the subjective visual fatigue for the top-to-bottom arrangement was higher texts than in animations. The reason for this may be that participants are used to readin from left to right; therefore, not much difference was found compared to the top-to-bo tom arrangement. In addition, the top-to-bottom text induced more visual fatigue tha the animation. As the participants had to watch texts more closely to distinguish diffe ences, it is reasonable to conclude that watching texts induces more visual fatigue tha watching animation. The results also indicated that the background was a significant factor. The number of fixations of the video without a background (M = 1694.954) was significantly lower than that with a background (M = 1815.030). Caroux et al. [8] showed that the structure of the background (lateral or radial) had an impact on the duration of fixations. The participants also reported that they had to concentrate on complex images, which resulted in more fixations. Lin and Chen [28] also concluded that the use of more visual cues (such as 3D images) induces a greater number of fixations than the use of less visual cues (such as 2D images).

Subjective Visual Fatigue
The results also indicated that the video type was a significant factor. The number of fixations of the animation (M = 1676.024) was significantly lower than that of the text (M = 1833.960). Lin and Hsieh [13] indicated that the total fixation counts and average fixation duration of texts were more than those of pictures, whether for EFL beginners or intermediate level readers. In this study, the participants also reported that reading texts required more concentration and resulted in more fixations.
The results further demonstrated that the arrangement was a significant factor. The number of fixations of the left-to-right arrangement (M = 1668.364) was significantly lower than that of the top-to-bottom arrangement (M = 1841.620). The reason for this may be that participants are used to watching left-to-right screens, which results in fewer fixations. In addition, participants also considered the top-to-bottom arrangement convenient to scan. Although it induced more fixations, the participants made positive comments about it. Lin [24] showed that vertical phonetic arrangement has better operation performance than a vertical one in sending Chinese text messages. Table 1 shows that background (F = 7.664, p < 0.05), video type (F = 5.827, p < 0.05), and the interaction between video type and arrangement (F = 5.486, p < 0.05) were the significant factors. For the interaction between video type and arrangement, as shown in Figure 6, the subjective visual fatigue for the top-to-bottom arrangement was higher in texts than in animations. The reason for this may be that participants are used to reading from left to right; therefore, not much difference was found compared to the topto-bottom arrangement. In addition, the top-to-bottom text induced more visual fatigue than the animation. As the participants had to watch texts more closely to distinguish differences, it is reasonable to conclude that watching texts induces more visual fatigue than watching animation. The results also indicated that background was a significant factor. The subjective visual fatigue of the without background context (M = 17.03) was significantly lower than the with background context (M = 19.75). Antes [7] demonstrated that more difficult search tasks produced higher visual fatigue. Lin and Chen [28] also found that more visual cues (3D images) induce more visual fatigue (the scores of simulator sickness questionnaire) than fewer visual cues (2D images). The participants also reported that they had to concentrate on complex images, which resulted in a high visual fatigue.

Subjective Visual Fatigue
The results also indicated that the video type was a significant factor. The subjective visual fatigue of animation (M = 17.76) was significantly lower than that of the text (M = 19.13). In general, reading texts produces more visual fatigue than reading animation. In this study, the participants also reported that reading texts required more concentration and resulted in a high visual fatigue.

Scan Path
As the scan paths of the two arrangements were differential, we conducted a comparison, which we discuss in this section. As shown in Figure 7, participants mainly conducted the search from the top down, and the scan paths were more concentrated compared to the left-to-right arrangement. As mentioned in Section 3.2, the top-to-bottom arrangement was convenient to scan due to the short scanning distance, which may be the reason the scan paths were more concentrated. On the other hand, participants mainly conducted the search from left to right in Figure 8, and the scan paths were more scattered, with some of them being outside the frame. The reason for this may be that the scanning distance of the left-to-right arrangement is longer than that of the top-to-bottom arrangement, resulting in easier sight deviation and leading to fixation points appearing outside the frame. The results also indicated that background was a significant factor. The subjective visual fatigue of the without background context (M = 17.03) was significantly lower than the with background context (M = 19.75). Antes [7] demonstrated that more difficult search tasks produced higher visual fatigue. Lin and Chen [28] also found that more visual cues (3D images) induce more visual fatigue (the scores of simulator sickness questionnaire) than fewer visual cues (2D images). The participants also reported that they had to concentrate on complex images, which resulted in a high visual fatigue.
The results also indicated that the video type was a significant factor. The subjective visual fatigue of animation (M = 17.76) was significantly lower than that of the text (M = 19.13). In general, reading texts produces more visual fatigue than reading animation. In this study, the participants also reported that reading texts required more concentration and resulted in a high visual fatigue.

Scan Path
As the scan paths of the two arrangements were differential, we conducted a comparison, which we discuss in this section. As shown in Figure 7, participants mainly conducted the search from the top down, and the scan paths were more concentrated compared to the left-to-right arrangement. As mentioned in Section 3.2, the top-to-bottom arrangement was convenient to scan due to the short scanning distance, which may be the reason the scan paths were more concentrated. On the other hand, participants mainly conducted the search from left to right in Figure 8, and the scan paths were more scattered, with some of them being outside the frame. The reason for this may be that the scanning distance of the left-to-right arrangement is longer than that of the top-to-bottom arrangement, resulting in easier sight deviation and leading to fixation points appearing outside the frame.

Conclusions
This study investigated the effect of background (with background, without back ground), video type (animation, text), and arrangement (left-to-right, top-to-bottom) o accuracy, number of fixations, and subjective visual fatigue. The results showed that back ground had significant effects on accuracy and subjective visual fatigue, where the accu racy and subjective visual fatigue without a background were better than those with background. The results also indicated that video type had significant effects on accurac and subjective visual fatigue, where the accuracy and subjective visual fatigue for anim tion were better than for text. Finally, arrangement had a significant effect on number o fixations, where the top-to-bottom arrangement produced a greater number of fixations The main contribution of this study is to provide suggestions for game manufacture when designing a "Spot the Difference" game. First, using left-to-right animations with out a background produces a better search performance. In the top-to-bottom arrang ment, the hot zones of the gaze are concentrated (not out of the search range) and produc more fixations. This information can be used as a reference when designing games in th future. To design an easier game, it is essential to avoid adopting text videos (especial

Conclusions
This study investigated the effect of background (with background, without back ground), video type (animation, text), and arrangement (left-to-right, top-to-bottom) o accuracy, number of fixations, and subjective visual fatigue. The results showed that back ground had significant effects on accuracy and subjective visual fatigue, where the accu racy and subjective visual fatigue without a background were better than those with background. The results also indicated that video type had significant effects on accurac and subjective visual fatigue, where the accuracy and subjective visual fatigue for anima tion were better than for text. Finally, arrangement had a significant effect on number o fixations, where the top-to-bottom arrangement produced a greater number of fixations The main contribution of this study is to provide suggestions for game manufacturer when designing a "Spot the Difference" game. First, using left-to-right animations with out a background produces a better search performance. In the top-to-bottom arrange ment, the hot zones of the gaze are concentrated (not out of the search range) and produc more fixations. This information can be used as a reference when designing games in th future. To design an easier game, it is essential to avoid adopting text videos (especiall

Conclusions
This study investigated the effect of background (with background, without background), video type (animation, text), and arrangement (left-to-right, top-to-bottom) on accuracy, number of fixations, and subjective visual fatigue. The results showed that background had significant effects on accuracy and subjective visual fatigue, where the accuracy and subjective visual fatigue without a background were better than those with a background. The results also indicated that video type had significant effects on accuracy and subjective visual fatigue, where the accuracy and subjective visual fatigue for animation were better than for text. Finally, arrangement had a significant effect on number of fixations, where the top-to-bottom arrangement produced a greater number of fixations.
The main contribution of this study is to provide suggestions for game manufacturers when designing a "Spot the Difference" game. First, using left-to-right animations without a background produces a better search performance. In the top-to-bottom arrangement, the hot zones of the gaze are concentrated (not out of the search range) and produce more fixations. This information can be used as a reference when designing games in the future. To design an easier game, it is essential to avoid adopting text videos (especially the left-to-right arrangement) so as not to reduce players' accuracy.
The limitation of the current study is that participants were asked to watch the videos for only 1 min due to large amount of eye tracking data produced. In addition, we could adjust the background colors to create color differences to understand the impacts of different color combinations on search performance. Furthermore, the animation and text could be integrated into a video to investigate its impact on search performance. Finally, the length of the video, the number of target objects, and more types of games could be considered in future studies.