Fuzzy System-Based Target Selection for a NIR Camera-Based Gaze Tracker

Gaze-based interaction (GBI) techniques have been a popular subject of research in the last few decades. Among other applications, GBI can be used by persons with disabilities to perform everyday tasks, as a game interface, and can play a pivotal role in the human computer interface (HCI) field. While gaze tracking systems have shown high accuracy in GBI, detecting a user’s gaze for target selection is a challenging problem that needs to be considered while using a gaze detection system. Past research has used the blinking of the eyes for this purpose as well as dwell time-based methods, but these techniques are either inconvenient for the user or requires a long time for target selection. Therefore, in this paper, we propose a method for fuzzy system-based target selection for near-infrared (NIR) camera-based gaze trackers. The results of experiments performed in addition to tests of the usability and on-screen keyboard use of the proposed method show that it is better than previous methods.


Introduction
The field of gaze-based interaction (GBI) has witnessed a significant growth in recent years, in response to certain long-standing challenges in gaze-tracking research. GBI helps persons with disabilities communicate with other people or devices. In 1982, Bolt first showed that the gaze can facilitate human-computer interface (HCI) implementation [1]. Using gaze input to perform tasks or computer operations eventually became popular among users with motor disabilities. Mauri et al. proposed the idea of using computer-assistive technologies, i.e., joysticks, trackballs, virtual keyboards, and eye tracking devices, to interact with computers [2]. These technologies, however, do not work well for severely disabled people who cannot use certain parts of their body. Alternative communication devices such as electromyogram (EMG), electroencephalogram (EEG), and electro-oculogram (EOG) [3,4] are better options, but are expensive and unavailable to most people. They may also be frustrating for users as they require electrodes to be attached to the body. Hence, camera-based gaze detection methods are a desirable alternative. Gaze tracking methods on the 2D monitor of a desktop computer have been widely studied [5][6][7]. However, there are some limitations to these methods, i.e., the inability to control 3D space, and significant degradation in accuracy due to variations in the Z-distance between the monitor and a user's eyes. Therefore, to control home appliances in 3D space, a non-wearable gaze tracking system has been proposed [8]. Increasing the accuracy of gaze detection has primarily been emphasized in past studies on gaze detection, but few have tackled target selection using a detection system. To the best of our knowledge, past research used methods based on the clicks of a mouse or a keyboard, the dwell time of the gaze position, and target selection by number of eye blinks. However, these methods have limitations in terms of selection speed as well as user convenience. A detailed summary past research on target selection in gaze trackers is provided in the next section.
-First, a new and improved method based on the Chan-Vese algorithm is proposed for detecting the pupil center with a boundary as well as the glint center. -Second, we use three features, i.e., change in pupil size using template matching (to measure pupil accommodation), change in gaze position during short dwell time, and Gabor filtering-based texture information of monitor image at gaze target. A fuzzy system is then used with these three features as inputs, and the decision concerning the user's target selection is taken through defuzzification. -Third, an optimal input membership function for fuzzy system can be obtained based on the maximum entropy criterion. -Fourth, through comparative experiments using an on-screen keyboard on a previous dwell time-based method, the performance and usability of our method were verified in a real gaze-tracking environment.
In Table 1, we summarize the comparison of the proposed with existing methods. Table 1. Comparison of previous and proposed methods.

Category Method Advantages Disadvantages
Gaze-based methods Use a single modality, i.e., eye blink [9,22], dwell time [12][13][14], antisaccades [15], on and off screen buttons [16], key strokes [19,20], face frowning [21], eye brow raises [22], and smiling [23] -Single sensing modality is used, and so is simpler in design compared to the method based on multiple modalities -Fewer input data items are required compared to multiple modalities -Lower accuracy of selection of target than methods based on multiple modalities -Completely dependent on a single modality; minor errors in sensing data can badly affect overall results -Some methods using a single modality are not feasible for users with high levels of motor disabilities, especially for those who can only move their eyes [19][20][21][22][23] Use multiple modalities, i.e., pupil accommodation and dwell time [24] -Accuracy of intentional object selection is high compared to methods based on single modality -There is room for further enhancement of detecting user's gaze for target selection -Incorrect detection of pupil size and the centers of pupil and corneal glint can significantly affect accuracy Visual saliency-based methods The remainder of this paper is organized as follows: in Section 3, our proposed system and methodology are introduced. The experimental setup is explained and the results are presented in Section 4. Section 5 contains our conclusions and discussion of some ideas for future work.

Overview of Proposed Method
In the proposed method, images of the eye were taken using an image acquisition system on a commercial Web camera, i.e., a Logitech (Lausanne, Switzerland) C600 [58] with a universal serial bus (USB) interface, and near-infrared (NIR) was used as source of illumination, with 8 × 8 NIR light-emitting diodes (LEDs) for our gaze tracker. The use of NIR illuminator served three important purposes [59,60]: first, it minimized the impact of different ambient light conditions; second, it distinguished the boundary of the pupil; third, as NIR is barely visible, it can minimize any interference while using the device in applications. In detail, the NIR light of shorter wavelength (shorter than 800 nm) has the tendency of making the iris darker (compared to the case using NIR light of wavelength greater than 800 nm). Therefore, the boundary between the pupil and the iris in the image becomes less distinct, and the error associated with locating the pupil boundary increases. On the other hand, NIR light of longer wavelength (longer than 800 nm) has the opposite tendency of making the iris brighter. Therefore, the boundary between the pupil and the iris in the image becomes more distinct, and the error associated with locating the pupil boundary is reduced. However, the camera image sensor is usually less sensitive to light according to the increase in wavelength, which means that the image captured by light of wavelength longer than 900 nm becomes darker, and the correct detection of the pupil boundary is consequently difficult. By considering all these factors, an NIR illuminator of 850 nm was adopted in our gaze tracking system. Images of 1600 × 1200 pixels were captured at a rate of 30 frames per second (fps). Larger eye images were required to analyze pupil size variations during the user's gaze for target selection. Thus, we used a zoom lens to obtain these images. For pupil accommodation, we need a gaze tracking system that can measure changes in pupil size. Unfortunately, not all commercial gaze tracking systems provide this function [61][62][63][64]. Therefore, we implemented our own system. Figure 1 shows the flowchart of our proposed system. An image of the user's face is first captured by our gaze tracking camera, and the search area of the left or right eye is defined (as shown in Figure 2b). Within this area, the pupil center with boundary and the glint center are found (see the details in Section 3.2). The bright spot on the corneal surface caused by the NIR light is referred to as glint. In the initial step, the user is instructed to observe four positions on the monitor for user-dependent calibration. Pupil size is then measured based on the accurate boundary of the pupil region (see the details in Section 3.3). The features are then calculated to detect the user's gaze for target selection. Feature 1 (F 1 ) represents pupil accommodation, i.e., it is measured by template matching with the graph of change in pupil size with respect to time (see the details in Section 3.3). The change in gaze position calculated over a short dwell time is referred to as feature 2 (F 2 ) (see the details in Section 3.4). Furthermore, the texture information of the image on the monitor at the gaze target is measured by Gabor filters and is used as feature 3 (F 3 ) (see the details in Section 3.5). These three feature values are combined using a fuzzy system, and the user's gaze for target selection can be detected based on the fuzzy output (see the details in Section 3.6).

Detection of Pupil and Glint Centers
The correct detection of the pupil region and the glint center is a prerequisite for obtaining accurate feature values for detecting the user's gaze for target selection. In our research, the pupil and glint regions (with geometric centers) were detected as shown in Figure 3, which corresponds to Steps 2-4 of Figure 1.
The approximate pupil area is first extracted as follows: to find dark pixels in the acquired image, binarization based on histogram thresholding is performed (Step 1 in Figure 3). Morphological operations and media filtering operations are executed to remove noise (Step 2 in Figure 3), and the approximate pupil area is detected (Step 3 of Figure 3) based on the circular Hough transform (CHT) [65]. The CHT is a basic technique for detecting circular objects in an image. The concepts of voting and local maxima, i.e., an accumulator matrix, are used to select candidate circles. The ROI including the eye is defined in the input image on the basis of the approximated pupil region (Step 4 of Figure 3). The accurate boundaries of the pupil and the glint are then located within the ROI using the Chan-Vese algorithm [66] based on an adaptive mask (Steps 5-10 of Figure 3). It depends on global properties, i.e., gray-level intensities, contour lengths, and regional areas, rather than local properties such as gradients. The main idea behind the Chan-Vese algorithm is active contour models to evolve a curve from a given image u o to detect objects in the image. In the classical active contour and snake models [67], an edge detector is used that depends on the gradient of image u o to stop the evolving curve on the boundary of the required object. By contrast, in the Chan-Vese algorithm, the model is based on trying to separate the image into regions based on intensities. We minimize the fitting terms and add some regularizing terms like the length of the curve C and/or the area of the region inside C. We want to minimize the energy function using level set φ(x, y) formulation [66]: where C = ∂ω is the curve where ω Ω, and Ω is the planar domain. c in = ω, c out = Ω/ω represent regions inside and outside curve C, respectively. The energy function F(c in , c out , C), is defined by: In the above, λ in , λ out , and µ are the positive constants, and u o (x, y) is the given input image. In order to enhance processing speed and accuracy, we propose an enhanced Chan-Vese algorithm with an adaptive mask based on the approximate pupil region (Steps 9 and 10 of Figure 3) (we use the constraint where (x, y) of Equation (2) belongs to the approximated pupil area). The mask parameters change according to pupil size and location. The CHT provides the rough radius and center of the pupil area of Step 3 in Figure 3. Based on this information, an accurate pupil boundary can be obtained by the Chan-Vese algorithm with adaptive mask, as shown in Steps 10 and 11 of Figure 3. Moreover, the rough size and location of the glint are detected by histogram thresholding and CHT, as shown in Steps 5 and 6 of Figure 3. Based on this information, an accurate glint boundary can be obtained by the Chan-Vese algorithm with adaptive mask as shown in Steps 8 and 11 of Figure 3 (we use the constraint whereby (x, y) of Equation (2) belongs to the approximated glint area). Using the boundaries of the pupil and the glint, their geometric centers are determined as shown in Step 12 of Figure 3.   Figure 2 shows the resulting images of the procedure. Based on these detection results of the pupil and glint areas, two features for inputs to the fuzzy system are calculated, as shown in Sections 3.3 and 3.4.

Calculating Feature 1 (Change in Pupil Size w.r.t. Time)
As the first feature used to detect the user's gaze for target selection, the change in pupil size with respect to time is measured by template matching. The authors of [24] measured pupil size based on the lengths of the major and minor axes of an ellipse fitted along the pupil boundary. The actual shape of the pupil is not elliptical, as shown in Figure 2e, and, according to [68], pupil detection based on ellipse fitting can yield incorrect pupil size. Therefore, in our research, pupil size is measured by counting total number of pixels inside the pupil boundary detected by the enhanced Chan-Vese method. A graph representing the in pupil size with respect to time is then constructed, and a moving average filter consisting of three coefficients is applied to it to reduce noise.
As mentioned above, the speed of image acquisition of our gaze tracking camera was 30 frames per second. Hence, an image frame was captured in 33.3 ms (1/30 s). To increase the speed of target selection, we used a window of size 10 frames (approximately 333 ms) to measure pupil dilation and constriction over time. Even with a short time window, our method can detect the user's gaze for target selection because three features are simultaneously used.
Past research has shown that cognitive tasks can affect changes in pupil size [69,70]. Based on this concept, the size of the pupil usually decreases in case the gaze is used for target selection, and we can define the shapes resulting from changes in pupil size as shown in Figure 4. Using these   As the first feature used to detect the user's gaze for target selection, the change in pupil size with respect to time is measured by template matching. The authors of [24] measured pupil size based on the lengths of the major and minor axes of an ellipse fitted along the pupil boundary. The actual shape of the pupil is not elliptical, as shown in Figure 2e, and, according to [68], pupil detection based on ellipse fitting can yield incorrect pupil size. Therefore, in our research, pupil size is measured by counting total number of pixels inside the pupil boundary detected by the enhanced Chan-Vese method. A graph representing the in pupil size with respect to time is then constructed, and a moving average filter consisting of three coefficients is applied to it to reduce noise.
As mentioned above, the speed of image acquisition of our gaze tracking camera was 30 frames per second. Hence, an image frame was captured in 33.3 ms (1/30 s). To increase the speed of target selection, we used a window of size 10 frames (approximately 333 ms) to measure pupil dilation and constriction over time. Even with a short time window, our method can detect the user's gaze for target selection because three features are simultaneously used.
Past research has shown that cognitive tasks can affect changes in pupil size [69,70]. Based on this concept, the size of the pupil usually decreases in case the gaze is used for target selection, and we can define the shapes resulting from changes in pupil size as shown in Figure 4. Using these shapes, template-based matching is performed with a given window size. Template-based matching is the measure of the similarity between the graph of dilation and constriction in pupil size with respect to time and the expected graph of pupil behavior (template graph shown in Figure 4) employing the user's gaze for target selection. Template-based matching compares these two graphs using the sum of square differences (SSD). The SSD and the final template matching score can be obtained by the following equations: In the above, p j is the jth value of pupil size in the graph of the input, and q ij is the jth value of the pupil size of the ith template graph. We used three template graphs as shown in Figure 4. The starting position of the template matching window was detected based on changes in gaze position, i.e., change in the horizontal and vertical gaze directions. The template matching score of Equation (1) decreased in case the user's gaze was employed for target selection and increased in other cases. The template matching score was used as feature 1. Since the maximum and minimum values of pupil size can show individual variation according to people, the value of pupil size was normalized by min-max scaling before template matching, as shown in Figure 4. Although one previous study [24] used a similar concept to measure the change in pupil size through peakedness, this method has the disadvantage whereby the accurate position of the peak needs to be detected in advance in the graph for pupil size, which can be affected by noise in the input data. In contrast to this, our method does not need to detect the position of the peak, and is more robust against noise in the input data (see Section 4). employing the user's gaze for target selection. Template-based matching compares these two graphs using the sum of square differences (SSD). The SSD and the final template matching score can be obtained by the following equations: Template matching score = min , , In the above, pj is the jth value of pupil size in the graph of the input, and qij is the jth value of the pupil size of the ith template graph. We used three template graphs as shown in Figure 4. The starting position of the template matching window was detected based on changes in gaze position, i.e., change in the horizontal and vertical gaze directions. The template matching score of Equation (1) decreased in case the user's gaze was employed for target selection and increased in other cases. The template matching score was used as feature 1. Since the maximum and minimum values of pupil size can show individual variation according to people, the value of pupil size was normalized by min-max scaling before template matching, as shown in Figure 4. Although one previous study [24] used a similar concept to measure the change in pupil size through peakedness, this method has the disadvantage whereby the accurate position of the peak needs to be detected in advance in the graph for pupil size, which can be affected by noise in the input data. In contrast to this, our method does not need to detect the position of the peak, and is more robust against noise in the input data (see Section 4).

Calculating Feature 2 (Change in Gaze Position within Short Dwell Time)
The detected pupil center and glint center (explained in Section 3.2) are used to calculate the gaze position, i.e., feature 2. Initial user calibration is performed to calculate gaze position. For calibration, each user is instructed to examine four positions close to the corners of the monitor [24]. From this, four pairs of pupil centers and glint centers are obtained, as shown in Figure 5.  The detected pupil center and glint center (explained in Section 3.2) are used to calculate the gaze position, i.e., feature 2. Initial user calibration is performed to calculate gaze position. For calibration, each user is instructed to examine four positions close to the corners of the monitor [24]. From this, four pairs of pupil centers and glint centers are obtained, as shown in Figure 5.

Calculating Feature 2 (Change in Gaze Position within Short Dwell Time)
The detected pupil center and glint center (explained in Section 3.2) are used to calculate the gaze position, i.e., feature 2. Initial user calibration is performed to calculate gaze position. For calibration, each user is instructed to examine four positions close to the corners of the monitor [24]. From this, four pairs of pupil centers and glint centers are obtained, as shown in Figure 5. The position of the center of the pupil is compensated for by that of the glint center, which can reduce the effect of head movement on the variation in gaze position. A geometric transform matrix can be calculated with these four pairs of detected pupil centers and glint centers [24]. This matrix defines the relationship between the region occupied by the monitor and that by the movable pupil, as shown in Figure 6. Then, geometric transform matrix is calculated using Equation (5), and the position of the user's gaze (Gx, Gy) is calculated by Equation (6): The position of the center of the pupil is compensated for by that of the glint center, which can reduce the effect of head movement on the variation in gaze position. A geometric transform matrix can be calculated with these four pairs of detected pupil centers and glint centers [24]. This matrix defines the relationship between the region occupied by the monitor and that by the movable pupil, as shown in Figure 6. The position of the center of the pupil is compensated for by that of the glint center, which can reduce the effect of head movement on the variation in gaze position. A geometric transform matrix can be calculated with these four pairs of detected pupil centers and glint centers [24]. This matrix defines the relationship between the region occupied by the monitor and that by the movable pupil, as shown in Figure 6. Then, geometric transform matrix is calculated using Equation (5), and the position of the user's gaze (Gx, Gy) is calculated by Equation (6): Figure 6. Relationship between pupil movable region (the quadrangle defined by P x0 , P y0 , P x1 , P y1 , P x2 , P y2 , and P x3 , P y3 ) in the eye image and the monitor region (the quadrangle defined by M x0 , M y0 , M x1 , M y1 , M x2 , M y2 , and M x3 , M y3 ).
Then, geometric transform matrix is calculated using Equation (5), and the position of the user's gaze (G x , G y ) is calculated by Equation (6): P y1 P y2 P y3 P x0 P y0 P x1 P y1 P x2 P y2 P x3 P y3 As the two positions are obtained from the left and the right eyes, the final gaze position is calculated by taking the average of these two gaze positions. Gaze position is not usually moved in case of the user's gaze for target selection. Therefore, the Euclidean distance (∆z i ) of gaze positions between the given (x i , y i ) and the previous image frames (x i−1 , y i−1 ) is calculated as shown in the Equation (7). Then, feature 2 (change in gaze position within a short dwell time) is calculated from the estimated starting (time) position of the gaze, i.e., S of Equation (8), over a specified short dwell time (the window size W of Equation (8)), and feature 2 becomes smaller in case of the employment of the user's gaze for target selection: Although a previous study [24] used a similar concept to measure changes in gaze position, this method involves measuring the change in gaze position by selecting the larger of two changes in gaze position along the horizontal and the vertical directions. For example, if the changes in gaze position are, respectively, 3 and 4 along the horizontal and vertical directions, feature 2 measured by this method is 4 instead of 5 ( √ 3 2 + 4 2 ). By contrast, our method measures the change in gaze position both along the horizontal and the vertical directions, and feature 2 is measured as 5 in this case.

Calculating Feature 3 (the Texture Information of Monitor Image at Gaze Target)
Baddeley et al. and Yanulevskaya et al. found that edge frequencies are strong contributors to the location of the user's gaze, and have higher correlations to it than other factors [71,72]. Based on this concept, we extract edge-based texture information from the expected gazing location using a Gabor filter [73], and use this information as feature 3 to detect the user's gaze for target selection. The region where the object of interest is located has a larger amount of texture than where it is not. A 2D Gabor filter in the spatial domain is defined as follows: where g(x, y) is defined as the Gabor function along the xand y-axes. σ x and σ y are standard deviations of the function along the xand y-axes, respectively. W is the radial frequency of a sinusoid. We consider only the real part of the Gabor filter for fast processing. To obtain the Gabor wavelet, g(x, y) in Equation (9) is used as the mother Gabor wavelet. The Gabor wavelet is then obtained by the scaling and rotation of g(x, y) as shown in Equation (10): where θ is the filter orientation expressed by θ = nπ/K, K is the number of the filter orientations (n is the integer number), a −m is the filter scale, and m = 0, . . . , P, P + 1 is the number of scales. In order to extract the accurate texture, we use 16 Gabor wavelet filters with K = 4 and P = 3, as shown in Figure 7. The average number of magnitudes obtained using Gabor wavelet filters within the ROI is used as feature 3. For this, the ROI for the application of the Gabor filter is defined based on the user's gaze as shown in Figure 8. According to the gaze position, the magnitude of texture in the ROI varies, and the user's gaze shows a high value for feature 3, due to the complex texture of the monitor image.

Explanation of Fuzzy Membership Functions
To detect the user's gaze for target selection, our method uses a fuzzy system with three input features, i.e., features 1-3 (explained in Sections 3.3-3.5), as shown in Figure 9. As explained in Sections 3.3-3.5, features 1 and 2 are smaller, whereas feature 3 is larger in case the user's gaze is used. Through normalization based on min-max scaling, these three features are made to range from 0 to 1. To make these three features consistent with one another, features 1 and 2 are recalculated by subtracting them from the maximum value (1). Therefore, features 1-3 are larger in when the user's gaze is used, and are used as inputs to the fuzzy logic system. Based on the output of the fuzzy system, we can determine whether the user is gazing at the selected target.

Explanation of Fuzzy Membership Functions
To detect the user's gaze for target selection, our method uses a fuzzy system with three input features, i.e., features 1-3 (explained in Sections 3.3-3.5), as shown in Figure 9. As explained in Sections 3.3-3.5, features 1 and 2 are smaller, whereas feature 3 is larger in case the user's gaze is used. Through normalization based on min-max scaling, these three features are made to range from 0 to 1. To make these three features consistent with one another, features 1 and 2 are recalculated by subtracting them from the maximum value (1). Therefore, features 1-3 are larger in when the user's gaze is used, and are used as inputs to the fuzzy logic system. Based on the output of the fuzzy system, we can determine whether the user is gazing at the selected target. 3.6. Fuzzy Logic System for Detecting User's Gaze for Target Selection

Explanation of Fuzzy Membership Functions
To detect the user's gaze for target selection, our method uses a fuzzy system with three input features, i.e., features 1-3 (explained in Sections 3.3-3.5), as shown in Figure 9. As explained in Sections 3.3-3.5, features 1 and 2 are smaller, whereas feature 3 is larger in case the user's gaze is used. Through normalization based on min-max scaling, these three features are made to range from 0 to 1. To make these three features consistent with one another, features 1 and 2 are recalculated by subtracting them from the maximum value (1). Therefore, features 1-3 are larger in when the user's gaze is used, and are used as inputs to the fuzzy logic system. Based on the output of the fuzzy system, we can determine whether the user is gazing at the selected target.  Figure 10 shows the membership functions for the three input features 1-3. The input values were classified into two classes in the membership function: Low (L) and High (H). In general, these value classes are not separated, and the membership functions are defined to have overlapping areas, as shown in Figure 10. With a small number of input data items, we obtained the distributions of features 1-3 as shown in Figure 10 and, based on the maximum entropy criterion [74], designed the input membership functions. For fair experiments, these data were not used for all experiments reported in Section 4.
We first define the rough shape of the input membership functions as linear by considering processing speed and the complexity of problem because such functions have been widely used in fuzzy applications [75][76][77]. The defined input membership functions are as follows: In Equations (11) and (12), i = 1, 2, and 3. _ ( ) is the L membership function of feature i whereas _ ( ) is its H membership function. Then, we can obtain the following equations: In Equations (13) and (14), i = 1, 2, and 3. _ ( ) is the L membership function of feature i of Equation (11) Figure 10). Based on Equations (13) and (14), the entropy can be calculated as follows:  Figure 10 shows the membership functions for the three input features 1-3. The input values were classified into two classes in the membership function: Low (L) and High (H). In general, these value classes are not separated, and the membership functions are defined to have overlapping areas, as shown in Figure 10. With a small number of input data items, we obtained the distributions of features 1-3 as shown in Figure 10 and, based on the maximum entropy criterion [74], designed the input membership functions. For fair experiments, these data were not used for all experiments reported in Section 4.
We first define the rough shape of the input membership functions as linear by considering processing speed and the complexity of problem because such functions have been widely used in fuzzy applications [75][76][77]. The defined input membership functions are as follows: In Equations (11) and (12), i = 1, 2, and 3. µ L_ f eature i (x) is the L membership function of feature i whereas µ H_ f eature i (x) is its H membership function. Then, we can obtain the following equations: where i = 1, 2, and 3. Based on the maximum entropy criterion [74], the optimal parameters of ( _ , _ , _ , _ , _ , _ , _ , _ ) of feature i are obtained by being chosen when the entropy ( _ , _ , _ , _ , _ , _ , _ , _ ) is maximized. From this, we can obtain the input membership functions of features 1-3. These membership functions are used to convert input values into degrees of membership. In order to determine whether the user's gaze for target selection occurs, the output value is also In Equations (13) and (14), i = 1, 2, and 3. µ L_ f eature i (x) is the L membership function of feature i of Equation (11), whereas µ H_ f eature i (x) is the H membership function of feature i in Equation (12). In addition, m L_ f eature i (x) is the L (data) distribution of feature i (non-gazing data of Figure 10), whereas m H_ f eature i (x) is the H (data) distribution of feature i (gazing data for target selection of Figure 10). Based on Equations (13) and (14), the entropy can be calculated as follows: H(a L_i , b L_i , p L_i , q L_i , a H_i , b H_i , p H_i , q H_i ) = −p L f eature i log p L f eature i − p H f eature i log p H f eature i (15) where i = 1, 2, and 3. Based on the maximum entropy criterion [74], the optimal parameters of (a L_i , b L_i , p L_i , q L_i , a H_i , b H_i , p H_i , q H_i ) of feature i are obtained by being chosen when the entropy H(a L_i , b L_i , p L_i , q L_i , a H_i , b H_i , p H_i , q H_i ) is maximized. From this, we can obtain the input membership functions of features 1-3. These membership functions are used to convert input values into degrees of membership. In order to determine whether the user's gaze for target selection occurs, the output value is also described in the form of a linear function from the membership functions, as in Figure 11 that shows the three functions of L, M, and H. Using these output membership functions, the fuzzy rule table, and a combination of the defuzzification method with the MIN and MAX rules, the optimal output value can be obtained (see details in Section 3.6.3)

Fuzzy Rules Based on Three Input Values
As explained in Section 3.6.1 and Figure 10, the values of features 1-3 are larger in case the user's gaze is employed for target selection. We define the output of our fuzzy system as "H" in case it is, and as "L" when it is not. Based on this, we define the fuzzy rules shown in Table 2. Using the three normalized input features, six corresponding values can be acquired using the input membership functions as shown in Figure 12 The proposed method defines which of L and H can be used as inputs for the defuzzification step using the eight rules in Table 2. The MIN or MAX method is commonly used for this purpose. In the MIN method, the minimum value is selected from each combination set of three members and used as input for defuzzification. For the MAX method, the maximum value is selected and used as defuzzification input.
For example, for a combination set of (0.25(H), 0.00(L), 0.68(H)), the MIN method selects the minimum value (0.00) as input. For the MAX method, the maximum value (0.68) is selected. Then, Figure 11. Definitions of output membership functions.

Fuzzy Rules Based on Three Input Values
As explained in Section 3.6.1 and Figure 10, the values of features 1-3 are larger in case the user's gaze is employed for target selection. We define the output of our fuzzy system as "H" in case it is, and as "L" when it is not. Based on this, we define the fuzzy rules shown in Table 2.   The proposed method defines which of L and H can be used as inputs for the defuzzification step using the eight rules in Table 2. The MIN or MAX method is commonly used for this purpose. In the MIN method, the minimum value is selected from each combination set of three members and used as input for defuzzification. For the MAX method, the maximum value is selected and used as defuzzification input.
For example, for a combination set of (0.25(H), 0.00(L), 0.68(H)), the MIN method selects the minimum value (0.00) as input. For the MAX method, the maximum value (0.68) is selected. Then, on the basis of fuzzy logic rules from Table 2 ( Table 3. We refer to these values as "inference values" (IVs). As shown in Table 3, these IVs are used as inputs for defuzzification to obtain the output. The MIN and MAX rules were compared in our experiments.  Several defuzzification methods are compared on fuzzy systems. We considered five of them, i.e., first of maxima (FOM), last of maxima (LOM), middle of maxima (MOM), center of gravity (COG), and bisector of area (BOA) [76][77][78][79]. In each defuzzification method excluding COG and BOA, the maximum values of IVs were used to calculate the output value. The maximum IVs were IV 1 (L) and IV 2 (M), as shown in Figure 13a. In the FOM defuzzification method, as the name implies, the first value after defuzzification is selected as the optimal weight value, and is represented as w 1 in Figure 13a. The last defuzzification value, i.e., w 3 is the optimal weight value selected by the LOM method. The method that shows that the optimal weight value is obtained from the middle value of the optimal weight values of FOM and LOM is referred to as the MOM method, i.e., (w MOM = 1 2 (w 1 + w 3 )). The output scores for the methods using the COG and the BOA are calculated differently from those of other defuzzification methods. The COG method is also known as center-of-area defuzzification. It calculates the output score based on the geometrical center of the non-overlapping regions, and is formed by the regions defined by all IVs. As shown in Figure 13b, regions R 1 , R 2 , and R 3 are defined based on all IVs, where region R 1 is the quadrangle defined by connecting four points (0, IV 1 (L)), (w 1 , IV 1 (L)), (0.5, 0), and (0, 0), R 2 is the quadrangle defined by connecting four points (w 2 , IV 2 (M)), (w 3 , IV 2 (M)), (1, 0), and (0, 0), and R 3 is that defined by connecting four points (w 4 , IV 3 (H)), (1, IV 3 (H)), (1, 0), and (0.5, 0).
Finally, the optimal weight value of the fuzzy system (w 5 ) is calculated from the center of gravity of regions R 1 , R 2 , and R 3 (considering the overlapping regions of R 1 , R 2 , and R 3 only once), as shown in Figure 13b. The BOA is calculated by a vertical line (w 6 ) that divides the region defined by the IVs into two sub-regions of equal area, which is why it is called the bisector method. Following this, our fuzzy system determines that user gazes at the position for target selection, if the output score of the fuzzy system is greater than a threshold defined for the fuzzy system. those of other defuzzification methods. The COG method is also known as center-of-area defuzzification. It calculates the output score based on the geometrical center of the non-overlapping regions, and is formed by the regions defined by all IVs. As shown in Figure 13b, regions R1, R2, and R3 are defined based on all IVs, where region R1 is the quadrangle defined by connecting four points (0, IV1(L)), (w1, IV1(L)), (0.5, 0), and (0, 0), R2 is the quadrangle defined by connecting four points (w2, IV2(M)), (w3, IV2(M)), (1, 0), and (0, 0), and R3 is that defined by connecting four points (w4, IV3(H)), (1, IV3(H)), (1, 0), and (0.5, 0). Finally, the optimal weight value of the fuzzy system ( ) is calculated from the center of gravity of regions R1, R2, and R3 (considering the overlapping regions of R1, R2, and R3 only once), as shown in Figure 13b. The BOA is calculated by a vertical line ( ) that divides the region defined by the IVs into two sub-regions of equal area, which is why it is called the bisector method. Following this, our fuzzy system determines that user gazes at the position for target selection, if the output score of the fuzzy system is greater than a threshold defined for the fuzzy system.

Experimental Results
The performance of the proposed method for detecting user gaze for target selection was measured through experiments performed with 15 participants, where each participant attempted two trials for each of three target objects, i.e., a teddy bear, a bird, and a butterfly, displayed at nine positions on a 19-inch monitor, as shown in Figure 14. That is, three experiments were performed with these objects (teddy bear, bird, and butterfly) with each at nine positions on the screen. We collected 270 data items (15 participants × 2 trials × 9 gaze positions) for gazing for target selection, i.e., true positive (TP) data, and the same number of non-gazing data, i.e., true negative (TN) data, for each of the three experiments. Most participants were graduate students, and some of them were faculty or staff members of our university's department. They were randomly selected by considering the variation in eye characteristics with age, gender, and nationality. All participants voluntarily participated in our experiments. Of the 15, five wore glasses and four people wore contact lenses. The remained 6 people did not wear glasses and contact lens. The ages of the participants ranged from 20 s to 40 s (mean age 29.3 years). Nine participants were male, and the other 6 ones were female. We made sure that participants of different nationalities were involved in our experiments: one Mongolian, one Tanzanian, two Pakistanis, four Vietnamese, and seven Koreans. Before the experiments, we provided the sufficient explanations of our experiments to all participants, and obtained informed (written) consents from all the participants.

Experimental Results
The performance of the proposed method for detecting user gaze for target selection was measured through experiments performed with 15 participants, where each participant attempted two trials for each of three target objects, i.e., a teddy bear, a bird, and a butterfly, displayed at nine positions on a 19-inch monitor, as shown in Figure 14. That is, three experiments were performed with these objects (teddy bear, bird, and butterfly) with each at nine positions on the screen. We collected 270 data items (15 participants × 2 trials × 9 gaze positions) for gazing for target selection, i.e., true positive (TP) data, and the same number of non-gazing data, i.e., true negative (TN) data, for each of the three experiments. Most participants were graduate students, and some of them were faculty or staff members of our university's department. They were randomly selected by considering the variation in eye characteristics with age, gender, and nationality. All participants voluntarily participated in our experiments. Of the 15, five wore glasses and four people wore contact lenses. The remained 6 people did not wear glasses and contact lens. The ages of the participants ranged from 20 s to 40 s (mean age 29.3 years). Nine participants were male, and the other 6 ones were female. We made sure that participants of different nationalities were involved in our experiments: one Mongolian, one Tanzanian, two Pakistanis, four Vietnamese, and seven Koreans. Before the experiments, we provided the sufficient explanations of our experiments to all participants, and obtained informed (written) consents from all the participants.  In the first experiment, we compared the accuracy of our method, in detecting the boundary between the pupil and the glint as well as the center of each, with a previous method [24]. As shown in Figure 15b,c, the boundary and center of the pupil detected by our method were closer to the ground truth than those calculated by the previous method. Moreover, as shown in Figure 15d,e, the detected boundary and center of the glint according to our method were closer to the ground truth than the previous method. In this experiment, the boundary and center of the ground truth were manually chosen. Further, for all images, we measured detection errors based on Euclidean distance between the center of the ground truth and the center as detected by our method and the previous method [24]. As shown in Table 4, our method yielded a higher accuracy (lower error) in detecting the centers of the pupil and the glint than the previous method. In the first experiment, we compared the accuracy of our method, in detecting the boundary between the pupil and the glint as well as the center of each, with a previous method [24]. As shown in Figures 15b,c, the boundary and center of the pupil detected by our method were closer to the ground truth than those calculated by the previous method. Moreover, as shown in Figures  15d,e, the detected boundary and center of the glint according to our method were closer to the ground truth than the previous method. In this experiment, the boundary and center of the ground truth were manually chosen. Further, for all images, we measured detection errors based on Euclidean distance between the center of the ground truth and the center as detected by our method and the previous method [24]. As shown in Table 4, our method yielded a higher accuracy (lower error) in detecting the centers of the pupil and the glint than the previous method.  In the second experiment, the accuracy of the detection of the user's gaze for target selection on TP and TN data, according to various defuzzification methods, was compared in terms of equal  Table 4. Average Euclidean distance between the center of the ground truth the detected center by our method and previous method (unit: pixels).

Pupil/Glint Center Method Average Euclidean Distance
Pupil Center Proposed method 2.1 Previous method [24] 4.6 Glint Center Proposed method 2.5 Previous method [24] 4.7 In the second experiment, the accuracy of the detection of the user's gaze for target selection on TP and TN data, according to various defuzzification methods, was compared in terms of equal error rate (EER). Two types of errors, types I and II, were considered. The error of incorrectly classifying TP data as TN data was defined as a type I error, and that of incorrectly classifying TN data as TP data was defined as a type II error. As explained in the previous section, our system determines that the user's gaze is employed (TP) if the output score of the fuzzy system is higher than the threshold. If not, our system determines that user's gaze is not employed (TN). Therefore, type I and II errors occur depending on the threshold. If the threshold is increased, type 1 error increases and type II error decreases. With a smaller threshold, type I error decreases whereas type II error increases. When type I and II errors are most similar to the appropriate threshold, the EER is calculated by averaging the two errors.

Experiment 1 (Bear)
In the first experiment, a teddy bear was used as target object, as shown in Figure 14b. The classification results of TP and TN data according to the five defuzzification methods that used the MIN and MAX rules are listed in Tables 5 and 6, respectively. As indicated in these tables, the smallest EER (approximately 0.19%) of classification was obtained by the COG with the MIN rule.  Figures 16 and 17 show the receiver operating characteristic (ROC) curves for the classification results of TP and TN data according to the various defuzzification methods using MIN or MAX rules, respectively. The ROC curve represents the change in type I error (%) according to the increase in 100-type II error (%).  In case that type I and II errors were small, the accuracy of that particular method was regarded as high. Therefore, if the ROC curve was closer to (0, 100) ("type I error" of 0% and "100-type II error" of 100%) on the graph, its accuracy was regarded as higher. As shown in these figures, the accuracy of classification by the COG with the MIN rule was higher than obtained by other defuzzification methods.

Experiment 2 (Bird)
In the second experiment, a bird was used as target object as shown in Figure 14b. The classification results of TP and TN data according to the five defuzzification methods that used the MIN and MAX rules are listed in Tables 7 and 8, respectively. As indicated in these tables, the smallest EER (approximately 0%) of classification was obtained by the COG with the MIN rule. Table 7. Type I and II errors with EER using the MIN rule (unit: %). In case that type I and II errors were small, the accuracy of that particular method was regarded as high. Therefore, if the ROC curve was closer to (0, 100) ("type I error" of 0% and "100-type II error" of 100%) on the graph, its accuracy was regarded as higher. As shown in these figures, the accuracy of classification by the COG with the MIN rule was higher than obtained by other defuzzification methods.

Experiment 2 (Bird)
In the second experiment, a bird was used as target object as shown in Figure 14b. The classification results of TP and TN data according to the five defuzzification methods that used the MIN and MAX rules are listed in Tables 7 and 8, respectively. As indicated in these tables, the smallest EER (approximately 0%) of classification was obtained by the COG with the MIN rule.  Figures 18 and 19 show the ROC curves for the classification results of TP and TN data according to the various defuzzification methods using MIN or MAX rules. As is shown, the accuracy of classification by the COG with the MIN rule was higher than for other defuzzification methods.

Experiment 3 (Butterfly)
In the third experiment, a butterfly was used as target object as shown in Figure 14b. The classification results of TP and TN data according to the five defuzzification methods that used the MIN and MAX rules are listed in Tables 9 and 10, respectively. As indicated in these tables, the smallest EER (approximately 0.19%) of classification was obtained by the COG with the MIN rule.

Experiment 3 (Butterfly)
In the third experiment, a butterfly was used as target object as shown in Figure 14b. The classification results of TP and TN data according to the five defuzzification methods that used the MIN and MAX rules are listed in Tables 9 and 10, respectively. As indicated in these tables, the smallest EER (approximately 0.19%) of classification was obtained by the COG with the MIN rule.  Figures 20 and 21 show the ROC curves for the classification results of TP and TN data according to the various defuzzification methods using the MIN or MAX rules. As is shown, the accuracy of classification by the COG with the MIN rule was higher than by other defuzzification methods.
The above results were verified by comparing the proposed method using three features (change of pupil size w.r.t. time by template matching, change in gaze position within short dwell time, and the texture information of monitor image at gaze target) with the previous method [24] using two features (change of pupil size w.r.t. time by peakedness, and change in gaze position within short dwell time). In addition, we compared the accuracy by proposed method with that by using three features (change of pupil size w.r.t. time by peakedness, change in gaze position within short dwell time, and the texture information of monitor image at gaze target). For convenience, we call the last method "Method A". The only difference between the proposed method and "Method A" is that change in pupil size w.r.t. time is measured by template matching in proposed method but by peakedness in "Method A".  Figures 20 and 21 show the ROC curves for the classification results of TP and TN data according to the various defuzzification methods using the MIN or MAX rules. As is shown, the accuracy of classification by the COG with the MIN rule was higher than by other defuzzification methods.  The above results were verified by comparing the proposed method using three features (change of pupil size w.r.t. time by template matching, change in gaze position within short dwell time, and the texture information of monitor image at gaze target) with the previous method [24] using two features (change of pupil size w.r.t. time by peakedness, and change in gaze position within short dwell time). In addition, we compared the accuracy by proposed method with that by using three features (change of pupil size w.r.t. time by peakedness, change in gaze position within short dwell time, and the texture information of monitor image at gaze target). For convenience, we call the last method "Method A". The only difference between the proposed method and "Method A" is that change in pupil size w.r.t. time is measured by template matching in proposed method but by peakedness in "Method A".
In all cases, the ROC curves of the highest accuracy among the various defuzzification methods with the MIN or MAX rules are shown. The accuracy of by proposed method was always higher than that of "Method A" and the previous method [24] in all cases of the three experiments, i.e., bear, bird, and butterfly, as shown in Figure 22.
In the next experiment, we compared the accuracy of proposed method, "Method A", and the previous method [24] when noise was included in the input data. All features of the proposed and the previous method was affected by the accurate detection of pupil size and gaze position, which were in turn affected by the performance of the gaze tracking system. Thus, we included Gaussian random noises in the detected pupil size and gaze position. In all cases, the ROC curves of the highest accuracy among the various defuzzification methods with the MIN or MAX rules are shown. The accuracy of by proposed method was always higher than that of "Method A" and the previous method [24] in all cases of the three experiments, i.e., bear, bird, and butterfly, as shown in Figure 22.
In the next experiment, we compared the accuracy of proposed method, "Method A", and the previous method [24] when noise was included in the input data. All features of the proposed and the previous method was affected by the accurate detection of pupil size and gaze position, which were in turn affected by the performance of the gaze tracking system. Thus, we included Gaussian random noises in the detected pupil size and gaze position. fatigue, we provided a rest time of 10 minutes to each participant between experiments. Based on [13,14], dwell time for target selection was set at 500 ms. That is, when the change in the user's gaze position for our feature 2 was lower than the threshold, and this state was maintained for longer than 500 ms, target selection was activated.  As shown in Figure 23, noise had a stronger effect on the previous method [24] and "Method A" than the proposed method in all three experiments. A notable decrease in accuracy with increase in EER was observed in the latter two methods when noise caused incorrect detection of pupil size and gaze. The average scores are shown in Figure 24, which shows that our method scored higher than the conventional dwell time-based method in terms of both convenience and interest. In the next experiments, we compared the usability of our system with the conventional dwell time-based selection method [13,14]. We requested the 15 participants to rate the level of convenience and interest in performing the target selection task of the proposed method and a dwell time-based method by using a questionnaire (5: very convenient, 4: convenient, 3: normal, 2: inconvenient, 1: very inconvenient, in case of a convenience questionnaire) (5: very interesting, 4: interesting, 3: normal, 2: uninteresting, 1: very uninteresting, in case of an interest questionnaire). In order to render our results unaffected by participant learning and physiological state, such as fatigue, we provided a rest time of 10 minutes to each participant between experiments. Based on [13,14], dwell time for target selection was set at 500 ms. That is, when the change in the user's gaze position for our feature 2 was lower than the threshold, and this state was maintained for longer than 500 ms, target selection was activated.
The average scores are shown in Figure 24, which shows that our method scored higher than the conventional dwell time-based method in terms of both convenience and interest. We also performed a t-test [80] to prove that user's convenience on the proposed method was statistically higher than that of the conventional dwell time-based method. The t-test was performed by taking two independent samples of data: user's convenience on our system (µ = 3.8, σ = 0.5) and with the conventional dwell time-based method (µ = 2.8, σ = 0.7). The calculated p-value was approximately 3.4 × 10 −4 , which was smaller than the 99% (0.01) significance level. Hence, the null hypothesis for the t-test, i.e., no difference between the two independent samples, was rejected. Therefore, we can conclude that there was a significant difference, up to 99%, in user convenience between our proposed method and the dwell time-based method.
A t-test analysis based on user interest was also performed. This also yielded similar results, i.e., user interest on our proposed system (µ = 4.2, σ = 0.5) was higher than with the dwell time-based system (µ = 2.6, σ = 0.9). The calculated p-value was 6.74 × 10 −6 , i.e., smaller than the significance level of 99% (0.01). Therefore, we concluded a significant difference in user interest between our system and the dwell time-based system. The average score for convenience and interest was higher for the proposed method because it is more natural than a conventional dwell time-based method.
In order to analyze the effect of the difference in size between two groups, we performed Cohen's d analysis [81]. It classifies the difference as small if it is within the range 0.2-0.3, medium if it is 0.5, and large if it is greater than or equal to 0.8. We calculated the value of Cohen's d for convenience and interest. For user convenience, it was 1.49, which was in the large-effect category; hence, it had a major effect on the difference between the two groups. For user interest, the calculated Cohen's d was approximately 2.14, which was also in the large-effect category. Hence, We also performed a t-test [80] to prove that user's convenience on the proposed method was statistically higher than that of the conventional dwell time-based method. The t-test was performed by taking two independent samples of data: user's convenience on our system (µ = 3.8, σ = 0.5) and with the conventional dwell time-based method (µ = 2.8, σ = 0.7). The calculated p-value was approximately 3.4 × 10 −4 , which was smaller than the 99% (0.01) significance level. Hence, the null hypothesis for the t-test, i.e., no difference between the two independent samples, was rejected. Therefore, we can conclude that there was a significant difference, up to 99%, in user convenience between our proposed method and the dwell time-based method.
A t-test analysis based on user interest was also performed. This also yielded similar results, i.e., user interest on our proposed system (µ = 4.2, σ = 0.5) was higher than with the dwell time-based system (µ = 2.6, σ = 0.9). The calculated p-value was 6.74 × 10 −6 , i.e., smaller than the significance level of 99% (0.01). Therefore, we concluded a significant difference in user interest between our system and the dwell time-based system. The average score for convenience and interest was higher for the proposed method because it is more natural than a conventional dwell time-based method.
In order to analyze the effect of the difference in size between two groups, we performed Cohen's d analysis [81]. It classifies the difference as small if it is within the range 0.2-0.3, medium if it is 0.5, and large if it is greater than or equal to 0.8. We calculated the value of Cohen's d for convenience and interest. For user convenience, it was 1.49, which was in the large-effect category; hence, it had a major effect on the difference between the two groups. For user interest, the calculated Cohen's d was approximately 2.14, which was also in the large-effect category. Hence, we concluded that user convenience and user interest showed a large effect in the difference between our proposed method and dwell time-based method.
To confirm the practicality of our method, we performed additional experiments using an on-screen keyboard based on our method, where each person typed a word through our system on an on-screen keyboard displayed on a monitor, as shown in Figure 25. All 15 subjects from before participated in the experiments, and a monitor with a 19-inch screen and resolution of 1680 × 1050 pixels was used. Twenty sample words were selected based on frequency of use as shown in Table 11 [82]. As shown in Table 11, the left-upper words "the" and "and" were more frequently used than the right-lower words "all" and "would". If the user's gaze detected by our method was associated with a specific button, and our method had determined the user's gaze, the corresponding character was selected and displayed, as shown in Figure 25. associated with a specific button, and our method had determined the user's gaze, the corresponding character was selected and displayed, as shown in Figure 25.  the, and, that, have, for, not, with, you, this, but, his, from, they, say, her, she, will, one, all, would We performed a t-test to prove that our method is statistically better than the conventional dwell time-based method [13,14], as shown in Figures 26 and 27. In order to immunize our results against participant learning and physiological state, such as fatigue, we gave a 10-min rest to each participant between experiments. Based on [13,14], dwell time for target selection was set at 500 ms. In all cases, our gaze detection method was used for fair comparison, and selection was performed using our method or the dwell time-based method. We conducted our statistical analysis by using  We performed a t-test to prove that our method is statistically better than the conventional dwell time-based method [13,14], as shown in Figures 26 and 27. In order to immunize our results against participant learning and physiological state, such as fatigue, we gave a 10-min rest to each participant between experiments. Based on [13,14], dwell time for target selection was set at 500 ms. In all cases, our gaze detection method was used for fair comparison, and selection was performed using our method or the dwell time-based method. We conducted our statistical analysis by using four performance criteria, i.e., accuracy, execution time, interest, and convenience.
As shown in Figure 26a, the t-test was performed using two samples independent of each other: the user's typing accuracy using our system (µ = 89.7, σ = 7.4) and the conventional dwell time-based method (µ = 67.5, σ = 6.5). The calculated p-value was approximately 1.72 × 10 −9 , smaller than for a 99% (0.01) significance level. Hence, the null hypothesis for the t-test, i.e., no difference between two independent samples, was rejected. Therefore, we concluded that there was a significant difference of up to 99% in accuracy between the proposed method and the dwell time-based method. Similarly, a t-test analysis based on average execution time for typing one character was performed, as shown in Figure 26b, this test yielded similar results, i.e., the average execution time for typing one character with our system (µ = 0.67, σ = 0.013) and the dwell time-based system (µ = 2.6, σ = 0.17). The calculated p-value was 2.36 × 10 −16 , i.e., smaller than the significance level of 99% (0.01). Therefore, we concluded that there was a significant difference in execution time for typing one character on a virtual keyboard between our system and the dwell time-based system. Although dwell time for target selection was set at 500 ms in the dwell-time-based method, it was often case that the participant did not wait for this long and move his or her gaze position, which Similarly, a t-test analysis based on average execution time for typing one character was performed, as shown in Figure 26b, this test yielded similar results, i.e., the average execution time for typing one character with our system (µ = 0.67, σ = 0.013) and the dwell time-based system (µ = 2.6, σ = 0.17).
The calculated p-value was 2.36 × 10 −16 , i.e., smaller than the significance level of 99% (0.01). Therefore, we concluded that there was a significant difference in execution time for typing one character on a virtual keyboard between our system and the dwell time-based system. Although dwell time for target selection was set at 500 ms in the dwell-time-based method, it was often case that the participant did not wait for this long and move his or her gaze position, which increased the average execution time for typing one character, as shown in Figure 26b.
A t-test analysis based on execution time for typing one word was also conducted as shown in Figure 26c. We compared our proposed system (µ = 2.9, σ = 0.42) with the dwell time-based system (µ = 8.7, σ = 0.87). The calculated p-value was 5.7 × 10 −16 , i.e., smaller than the significance level of 99% (0.01). Therefore, we concluded that there was a significant difference between the execution time for typing one word on a virtual keyboard between our system and the dwell time-based system.
As shown in Figure 27, the t-test analysis for user interest compared our proposed system (µ = 4.0, σ = 0.4) with the dwell time-based system (µ = 2.4, σ = 0.9). The calculated p-value was 6.1 × 10 −7 , i.e., smaller than the significance level of 99% (0.01). Therefore, we concluded that there was a significant difference in user interest between our system and the dwell time-based system. Similarly, we performed a t-test with respect to user convenience. We noted that with our system (µ = 3.9, σ = 0.5) and the conventional dwell time-based method (µ = 2.9, σ = 0.8), the calculated p-value was approximately 3.4 × 10 −4 , which was smaller than the 99% (0.01) significance level. Hence, the null hypothesis for the t-test, i.e., no difference between two independent samples was rejected. Therefore, we concluded a significant difference up to 99% in user convenience with our proposed method and the dwell time-based method. As shown in Figure 27, the average score for convenience and interest was higher for the proposed method because it is more natural than a conventional dwell time-based method.
Similarly, we calculated the value of Cohen's d for convenience, interest, average execution time for typing, and accuracy. For user convenience and interest, they were 1.49 and 2.34, respectively, and lay in a large category. Hence, this had a significant effect on the difference between the two groups. For average execution times for one character, one word, and accuracy, we calculated the value of Cohen's d as approximately 15.92, 8.51, and 3.19, respectively, which also lay in the large-effect category. Hence, from the p value and Cohen's d, we concluded that user convenience, interest, average execution time, and accuracy were significantly different for the proposed and the dwell time-based methods [13,14].
In our experiment, the screen resolution is 1680 × 1050 pixels on a 19-inch monitor. The z-distance between the monitor and user's eye ranges 60 to 70 cm. Considering the accuracy (about ±1°) of gaze detection in our system, the minimum distance between two objects on the monitor screen should be about 2.44 cm (70 cm × tan(2°)) which corresponds to approximately 82 pixels.

Conclusions
In this study, we proposed a method for detecting the user's gaze for target selection using a gaze tracking system based on an NIR illuminator and a camera. The pupil center with the boundary and the glint center were more accurately based on an enhanced Chan-Vese algorithm. We employed three features-change of pupil size w.r.t. time measured by template matching, change in gaze position within short dwell time, and the texture information of monitor image at Similarly, we performed a t-test with respect to user convenience. We noted that with our system (µ = 3.9, σ = 0.5) and the conventional dwell time-based method (µ = 2.9, σ = 0.8), the calculated p-value was approximately 3.4 × 10 −4 , which was smaller than the 99% (0.01) significance level. Hence, the null hypothesis for the t-test, i.e., no difference between two independent samples was rejected. Therefore, we concluded a significant difference up to 99% in user convenience with our proposed method and the dwell time-based method. As shown in Figure 27, the average score for convenience and interest was higher for the proposed method because it is more natural than a conventional dwell time-based method.
Similarly, we calculated the value of Cohen's d for convenience, interest, average execution time for typing, and accuracy. For user convenience and interest, they were 1.49 and 2.34, respectively, and lay in a large category. Hence, this had a significant effect on the difference between the two groups. For average execution times for one character, one word, and accuracy, we calculated the value of Cohen's d as approximately 15.92, 8.51, and 3.19, respectively, which also lay in the large-effect category. Hence, from the p value and Cohen's d, we concluded that user convenience, interest, average execution time, and accuracy were significantly different for the proposed and the dwell time-based methods [13,14].
In our experiment, the screen resolution is 1680 × 1050 pixels on a 19-inch monitor. The z-distance between the monitor and user's eye ranges 60 to 70 cm. Considering the accuracy (about ±1 • ) of gaze detection in our system, the minimum distance between two objects on the monitor screen should be about 2.44 cm (70 cm × tan(2 • )) which corresponds to approximately 82 pixels.

Conclusions
In this study, we proposed a method for detecting the user's gaze for target selection using a gaze tracking system based on an NIR illuminator and a camera. The pupil center with the boundary and the glint center were more accurately based on an enhanced Chan-Vese algorithm. We employed three features-change of pupil size w.r.t. time measured by template matching, change in gaze position within short dwell time, and the texture information of monitor image at gaze target. These features as input values were combined with a fuzzy system where optimal input membership functions were designed based on the maximum entropy criterion and the user's gaze for target selection was determined through defuzzification methods. Performance was evaluated by comparing the defuzzification results with the EER and ROC curves. We verified from the results that the COG method using the MIN rule is suitable in terms of accuracy for different objects with varying amounts of texture. In future work, we will investigate a method to enhance performance by combining our proposed features with various physiological ones, such as electrocardiography (ECG) data, electroencephalogram (EEGs) data, or skin temperature (SKT).