A Two-step Approach for Interest Estimation from Gaze Behavior in Digital Catalog Browsing

Shimonishi, Kei; Kawashima, Hiroaki

doi:10.16910/jemr.13.1.4

Open AccessArticle

A Two-step Approach for Interest Estimation from Gaze Behavior in Digital Catalog Browsing

by

Kei Shimonishi

¹ and

Hiroaki Kawashima

²

¹

Kyoto University, Japan

²

University of Hyogo, Japan

J. Eye Mov. Res. 2020, 13(1), 1-17; https://doi.org/10.16910/jemr.13.1.4

Submission received: 12 October 2019 / Published: 1 April 2020

Download

Browse Figures

Versions Notes

Abstract

:

While eye gaze data contain promising clues for inferring the interests of viewers of digital catalog content, viewers often dynamically switch their focus of attention. As a result, a direct application of conventional behavior analysis techniques, such as topic models, tends to be affected by items or attributes of little or no interest to the viewer. To overcome this limitation, we need to identify “when” the user compares items and to detect “which attribute types/values” reflect the user’s interest. This paper proposes a novel twostep approach to addressing these needs. Specifically, we introduce a likelihood-based short-term analysis method as the first step of the approach to simultaneously determine comparison phases of browsing and detect the attributes on which the viewer focuses, even when the attributes cannot be directly obtained from gaze points. Using probabilistic latent semantic analysis, we show that this short-term analysis step greatly improves the results of the subsequent step. The effectiveness of the framework is demonstrated in terms of the capability to extract combinations of attributes relevant to the viewer’s interest, which we call aspects, and also to estimate the interest described by these aspects.

Keywords:

decision making; eye movement; aspect-model; region of interest; gaze; attention; interest

Introduction

Estimating the real-time interest of users browsing a digital catalog opens a variety of application possibilities including online recommendation of items that might better fit their needs and automated assistance, such as offering a new viewpoint for their choice (Misu et al., 2011; Reusens, Lemahieu, Baesens, & Sels, 2017; Walker et al., 2004). Bring such systems into reality requires the development of a representation of user interest and a method for estimating it.

Consider a situation in which a user is browsing a digital catalog containing items, each with multiple attributes, and selects one item. According to the concept of meansend chains (Collen & Hoekstra, 2001; Gutman, 1982), it is desirable to estimate user interest not only on individual items and attributes but also on the user’s personal values because such values are often linked to the basic reason for a choice.

To represent user values, we assume that “each value can be associated with a subset of attributes.” For example, the value “health” has strong relevance to “low calorie” and “fiber rich” attributes. A model of a user’s internal process for this situation can be illustrated in Figure 1, which is based on the means-end chain concept. By assuming that personal values can be defined as certain aspects of items in a content domain, we introduce aspects as the representation of personal values.

In the course of automated inference of user interests in this setting, a set of aspects needs to be prepared beforehand to represent personal values. In fact, the successful estimation of user interest depends on the appropriateness of the prepared aspects; meanwhile, they depend on many factors, such as content domains, the user’s characteristics, and the task being engaged (Parnell et al., 2013). As will be discussed in the related work section, one approach to the analysis is to use interviews, but the quality of the data collected depends on the interviewer’s skills; moreover, there is a cost involved.

In this paper, we therefore investigate another, data- driven, approach to obtaining a set of aspects from users’ behavior by assuming that “users share several common aspects related to the same content domain.” While such an assumption may not always be valid, it is still useful, at least for actual applications of decision support.

Research objective

This paper addresses two problems related to using eye gaze data collected during digital catalog browsing: (1) data-driven extraction of aspects that describe user interest and (2) estimation of the user interest.

The analysis of sensory information (e.g., GPS coordinates, click streams) with machine-learning techniques, such as topic models, is a growing trend for identifying an association between a user’s behavior and the user’s internal state (Bobadilla, Ortega, Hernando, & GutiéRrez, 2013). Among these techniques, eye tracking is a promising approach to closely exploring a user’s internal states during decision making (S. W. Shi, Wedel, & Pieters, 2013; Chen, Wang, & Wu, 2016).

Our preliminary experiments revealed two important observations related to the limitation of directly applying topic models to gaze data:

Observation 1. Users frequently switch their browsing states, e.g., from “simply grasping information about items” to “actively comparing items based on their interest”;
Observation 2. Users do not always take into account all the attributes of displayed items but rather focus on a subset of them.

In a large-scale analysis of users’ click histories on websites, Das et al. found that user clicks are noisier than their explicit ratings and purchase activities (Das, Datar, Garg, & Rajaram, 2007). That is, a user’s browsing behaviors (e.g., eye gazes and clicks) are not always closely associated with the user interest. In fact, eye gazes can be much noisier than clicks because gaze data capture a wide range of the human decision-making processes behind clicks. Additionally, each gaze point is only a slice of the user’s information processing in contrast to click activities, which involve more explicit decisions to obtain additional information. Therefore, it is likely that users focus on only some of the attributes of items on which gaze points are located.

As a result, the aspects obtained from the direct application of topic models tend to be affected by items or attributes of no interest to the user. To overcome this limitation, we need to identify “when” the user compares items and to detect “which attribute types/values” reflect the user’s interest.

This research aims at providing a novel approach to obtaining a set of aspects from user gaze data collected during content browsing by extracting the dynamic changes in the user’s “focus” on attributes. It also aims at determining how the automatically obtained aspects can be used to represent and estimate the user interest.

Contributions

This paper proposes a novel two-step approach to the analysis of eye gaze data for aspect learning and interest estimation as described in the research objective above (the flow of the approach is summarized in Figure 2).

In the proposed framework, a user’s gaze behavior is interpreted as the sequence of items at which the user looked (the bottom right part of Figure 2). The sequence of attribute values of each attribute type is retrieved from the sequence of items (the bottom left part of Figure 2). From the sequence of attribute values, as the first step, we introduce a likelihood-based short-term analysis of the attribute values at which the user looked (the redhighlighted area in Figure 2) for detecting distinctive gaze behavior while the user is actively comparing items. At the same time, the attribute values on which the user focuses during the distinctive periods are also extracted, which we refer to as the attributes-of-focus (or AOF for short). An example of AOF detection is shown in Figure 2 by the red-circled period when the user focuses only on “low calorie” and does not focus on any other attributes. As the next step, we apply a probabilistic generative model to the AOF (the green-highlighted area in Figure 2), a variant of the probabilistic latent semantic analysis (pLSA) model (topic model), to obtain the aspects and also to estimate the user interest described by the aspects.

The generative model used in the second step is an extension of our previous work (Shimonishi, Kawashima, Yonetani, Ishikawa, & Matsuyama, 2013), which took all attribute values of items into account. In contrast, the basic idea now is to use the AOF behind the gaze behavior as the “observation” of the subsequent probabilistic generative model. This is done by applying the first step. Along with introducing the first step, the generative model is modified to handle the AOF. While gaze behavior data is much noisier than the intended actions (e.g., explicit rating) used in traditional topic models, the first step acts as a filter to distinguish meaningful gaze data from the original data.

Organization of this paper

In the following two sections, we briefly review related work and introduce the details of the two-step approach: a likelihood-based short-term analysis for AOF detection and a generative model for aspect learning and interest estimation. We then evaluate our framework, discuss limitations, and conclude in the subsequent sections.

Related Work

Analysis of values behind decision making

The means-end chain model (Gutman, 1982) is one of the well-known methods for analyzing value-oriented behaviors in decision making (Keeney, 1992; Parnell et al., 2013) and exploring consumer motivations (Arsil, Li, & Bruwer, 2016; Zanoli & Naspetti, 2002). In this model, consumer decision making consists of options (means), consequences, and values (ends), and the model explains how a product can achieve the desired end states. Note that, in this paper, we use the term “value” to also describe “desired consequence” for simplicity, whereas the different levels are distinguished in means-end chain models. To apply the means-end theory, an interviewbased method, such as laddering, is widely utilized in marketing research (Arsil et al., 2016; Reynolds & Gutman, 1988; Zanoli & Naspetti, 2002). In laddering methods, interviewers directly ask users about the values driving their decision making. Since the user responses are diverse, the use of laddering to obtain appropriate values faces several challenges (Veludo, Ikeda, & Campomar, 2006). These challenges include the need to elicit information about the user’s values and the need to control the dialog; therefore, the results strongly depend on the interviewer’s skills (Reynolds & Gutman, 1988). In addition, many interviews are needed for each decision domain to obtain a sufficiently large value set.

This difficulty is also seen in the analytic hierarchy process (AHP) (Saaty, 1980), which is an organization technique for group decision making. The AHP structures a decision problem similarly to the means-end chain model. Participants in the decision-making process first discuss criteria for the problem and then construct a hierarchical structure consisting of a decision goal, goal alternatives, and the criteria. Once a hierarchical structure for decision making is constructed, the alternatives can be sorted with respect to the weights of each of the criteria. However, the quality of the construction step depends on the skill and knowledge of the participants.

Estimation of internal states behind user behavior

Estimation of a user’s internal states from the user’s behavior is attracting researchers’ attention thanks to improvements in automated data acquisition technologies (e.g., sensors and the Web). For example, search logs on the Web contain rich information from which one can infer a user’s internal states (Athukorala, Medlar, Ou-lasvirta, Jacucci, & Glowacka, 2016; He, Qvarfordt, Halvey, & Golovchinsky, 2016; Martin-Albo, Leiva, Huang, & Plamondon, 2016; Uetsuji, Yanagimoto, & Yoshioka, 2015).

While the inference of internal states requires the representation of the states (i.e., state space), it is often difficult to manually prepare the state space itself since appropriate representation depends on the situation. Unsupervised machine learning techniques, such as topic models (latent factor models) (Iwata, Watanabe, Yamada, & Ueda, 2009; Jin, Zhou, & Mobasher, 2004; Ni, Lu, Quan, Wenyin, & Hua, 2012; Uetsuji et al., 2015; Y. Shi, Larson, & Hanjalic, 2014), are widely studied as promising techniques for finding a representation of a user’s internal states from his or her decision-making behavior. Iwata et al. (Iwata et al., 2009), for example, proposed a model for estimating temporal changes in consumer interest and item trends and tracking time-varying item trends by analyzing item purchase logs using a variant of the dynamic topic model.

Internal and external factors of gaze behavior

Human visual attention is affected by both internal and external factors (Orquin & Loose, 2013; Bruce & Tsotsos, 2009; Kollmorgen, Nortmann, Schröder, & König, 2010), which direct a user’s goal-oriented and stimulus-oriented attention, respectively.

Goal-oriented attention is driven internally by one’s goals, so the resultant gaze behavior is affected by the task being engaged even if the same stimulus is presented (Yarbus, 1967; Borji & Itti, 2014). In fact, a number of studies have been conducted on cognitive-state estimation from eye gaze. The applications developed include inferring a user’s knowledge levels (Cole, Gwizdka, Liu, Belkin, & Zhang, 2013), a user’s cognitive ability to read graphs (Steichen, Conati, & Carenini, 2014), and a user’s engagement in conversations (Ishii, Nakano, & Nishida, 2013). A user’s gaze behavior has also been used to estimate his or her preference from content browsing (Brandherm, Prendinger, & Ishizuka, 2008; Hirayama, Jean-Baptiste, Kawashima, & Matsuyama, 2010). For example, Brandherm et al. developed an approach for estimating a user’s preferred target in displayed content on the basis of the frequency and duration of gazing at targets (Brandherm et al., 2008).

Stimulus-oriented attention is directed by external factors, such as the visual saliency of a scene (Itti, Koch, & Niebur, 1998). Therefore, the effects of external factors also need to be considered when analyzing gaze behavior during choice. For example, the effect of spatial position is known to be large, especially after shortduration presentation of a visual target (e.g., an image) (Tatler, Baddeley, & Gilchrist, 2005). Furthermore, external factors themselves can even change the decision results. For example, Milosavljevic et al. reported that salient targets tend to be chosen when the decision time is short or when a cognitive load exists (Milosavljevic, Navalpakkam, Koch, & Rangel, 2012).

Gaze behavior and decision phase

User decision making consists of several phases (Russo & Leclerc, 1994; Schaffer, Kawashima, & Matsuyama, 2016) such as browsing a catalog to acquire information (screening phase) and comparing items to evaluate them (comparison phase). Note that the comparison phase is expected to contain more clues to a user’s values than the screening phase. To identify the decision phases, a shortterm (segment-wise) analysis of gaze region sequences has been proposed (Ishikawa, Kawashima, & Matsuyama, 2015; Schaffer et al., 2016). A tri-gram of gaze region sequences is considered to be a unit of analysis to extract layout-related gaze features for classifying the decision phases (browsing states).

In this work, we focus on analyzing comparison behavior rather than rapid choice behavior by explicitly discriminating the screening and comparison phases. Using this experimental design, we analyze gaze behavior in the comparison phase since the effect of visual saliency in this phase is reduced due to the proceeding screening phase, as will be explained in the evaluation section.

Methods

Two-step approach to estimating user interests

We first introduce the representation of content and user interest used to describe the decision-making situation for a digital catalog. Let

ℐ

= {I₁, … , I_N} be a set of items in a digital catalog and 𝒜 be a set of attribute types common to all the items, where every item takes one attribute value from the set of attribute values

𝒱

^(a) = {V₁^(a), … , V_K⁽a^a)} for each attribute type a ∈ 𝒜. For example, “calorie” is an attribute type, while “high” and “low” are its values.

To represent the user interest, we introduce aspects of items to describe possible reasons for comparison. Let

𝒞

= {C₁, … , C_R} be a set of R aspects of items in a content domain, where these aspects are assumed to depend not on the user or displayed item set but on the content domain, as mentioned in the introduction section. We also assume that each aspect can be characterized by its association with attribute values. For example, the aspect “healthy” in the food content domain is relevant to “

The main idea of our approach is the explicit use of AOF, i.e., the attributes upon which the user actually focused. Although each item has attribute values for all attribute types Jemr 13 00004 i011

, only a subset of attributes is taken care of in a decision-making session. We therefore introduce a designated step to extract AOF before applying a probabilistic generative model for interest estimation using unsupervised learning.

Our approach thus consists of two steps (Figure 2). As the first step, we apply a likelihood-based short-term analysis to the sequences of gaze targets, i.e., the regions of interest (ROIs). In this step, the AOF are extracted by detecting the periods of distinctive gaze behavior, which are characterized by biased gaze-target patterns from neutral browsing. As will be shown later, this step is simple but highly effective for the subsequent second step, learning aspects and estimating user interest from gaze behavioral data using a generative model. As a concrete model for the second step, we propose the probabilistic interest-driven attention focusing (pIAF) model by extending the pLSA (Hofmann, 1999) suitable for our situation. The two steps are explained in the following subsections, respectively.

It depends on content design whether AOF can be directly obtained by observing user gaze points on a catalog, and it is difficult in general. For example, when a user browses a catalog with food pictures, it cannot be obtained only from gaze points on which visual attributes the user focuses. The proposed method only requires ROIs decided based on each item (i.e., does not need to directly observe AOF) and therefore applicable to wide range of content design.

AOF detection by short-term analysis

To detect the AOF (the red-highlighted area in Figure 2), we follow an anomaly detection approach: We first model neutral-browsing behavior and then utilize the likelihood of the model computed in each window position (temporal interval). Since a likelihood value tells us how likely an observation occurs under the model, the bias of the gaze behavior (i.e., distinctive gaze behavior) can be detected when the value falls below a predetermined threshold.

Figure 3. Flow of AOF detection using short-term analysis.

Interest estimation using a generative model

Modeling human behavior as a generative process using a probabilistic model is one approach to analyzing the internal states behind behavioral data. That is, the probabilistic model enables the model parameters to be estimated from data and to the infer internal states as latent parameters (Iwata et al., 2009; Y. Shi et al., 2014). By borrowing this concept, we propose the pIAF model to learn the aspects and to estimate user interest (the greenhighlighted area in Figure 2). Figure 4 illustrates the overview of the model.

Regarding observation model ℎ, one can directly apply a categorical distribution similar to the original pLSA (Hofmann, 1999), and in fact, our previous work followed this option (Shimonishi et al., 2013). However, such a model seems to be unnatural as a user’s focus can take only a single attribute value at a time. In actual situations, a user jointly considers “multiple” attribute values in not all but “partial” attribute types. Therefore, the observation model ℎ should be able to represent a joint distribution of attribute values constituting the AOF.

To take both the multiplicity and partiality of users’ focus into account, we have extended the observation model to incorporate the concept of users’ attention resource (Goldstein, Vanhorn, Francis, & Neath, 2011). Specifically, we consider a multinomial distribution on “all the attribute values” as the observation model and introduce the number of attribute values of simultaneous focus, n_t, as a parameter of the attention resource at time t. Here, ℎ(f_t; n_t, p_r) is derived as

Input: (a training set of AOF sequences);
Output: {p_r}_r (the degree of association between aspects and attribute values) and {θ(s)}_s (the users’ interests in training sessions).

Meanwhile, the user’s interest θ(s) in the observed eye gaze data during session s can also be estimated once the aspect parameters are learned (Figure 4 (b)). Note that {p_r}_r is given in this case, and therefore the input and output parameters of the estimation algorithm are summarized as follows (inference setting):

Input: {p_r}_r and (the AOF sequence during a session);
Output: θ(s) (the user’s interest during the session).

Evaluation

We evaluated the proposed two-step approach in terms of two perspectives. First, we investigated how the use of AOF detection affects the results of aspects obtained from eye gaze data compared with a method not using AOF detection. Then, we evaluated the accuracy of user interest estimation from gaze data for each decisionmaking session using the obtained aspects.

To focus on evaluating the basic effectiveness of the proposed framework, we conducted an experiment in a controlled decision-making situation rather than in an actual situation, which involves a variety of decisionmaking factors. Specifically, during each session, we asked the participant to select one item from a set of items in a digital catalog in accordance with a particular requirement. We expected that the participants would compare several options with some bias regarding the attribute values of interest. Since each of the requirements can be characterized by several attribute values, the given tasks serve as the ground truth for quantitative evaluation of both the aspect learning and interest estimation.

Participants

We conducted the experiment with the help of 37 participants (18 male and 19 female university students, ranging in age from 19 to 34, with a mean of 22.3 and a standard deviation of 2.9).

Design

The importance of decision making depends on the content domain and affects the user’s behavior. For example, if the decision will greatly affect the user’s life (e.g., choosing a house), the user will examine the options more seriously and carefully than for less-important decisions (e.g., deciding what to eat for lunch). We therefore used the content domain of choosing a laptop computer, which is assumed to have moderate importance.

As shown in Figure 5, the participants were asked to select a laptop computer (“PC”, hereinafter) from 12 PCs displayed on a screen. An eye tracker (Tobii X120) under the display was used to measure the participant’s eye movements, where the freedom of head movement was 300 x 220 x 300 mm, and the accuracy was 0.5 degrees. A sampling rate of 60 Hz (less than the maximum rate of 120 Hz) was used as we needed patterns of fixations rather than saccades. Each item region contained a written description of the attributes of the PC along with a picture to help the participants remember the position of the PC in the content. Each PC had five attribute types: price, screen size, CPU score, memory capacity, and weight. Each of the attribute types could take one of three values (e.g., low, middle, high). The content for each session was prepared so that 4 out of the 12 PCs had the same attribute value for each of the five types. To reduce the effect of the difference of content among the trials, we prepared three types of content by only replacing pictures of the PCs. That is, the sets of PCs’ attribute values in each content were the same.

In each session, we gave a participant a task to select a PC that fulfills a specified requirement (situation and purpose). For simplicity, we hereinafter use “task” to also denote “requirement.” An example task is as follows:

“Please assume that you will use your primary PC at home to watch movies and play games. Which PC do you think is the best for that situation?” We assumed that receiving a particular task would keep the participant’s interest θ(s) constant during the session.

Three tasks common to all participants were prepared. Each participant completed three sessions corresponding to the three tasks, where the order of the tasks was systematically ordered in a Latin squares design. We expected that aspects obtained from eye gaze data would be related to the three tasks. In particular, we expected that the participant’s interest θ(s) took one of three states (1, 0, 0), (0, 1, 0), or (0, 0, 1) depending on the task, where θ_r = 1 for the aspect C_r (r ∈ {1, 2, 3}) that corresponds to the specified task.

Each of these three tasks is implicitly related to several attribute values. For instance, a PC for playing games or watching movies requires “high CPU score and large memory capacity.” We refer to these values as taskrelated attribute values. Although the visual appearance conveys various types of information, the information actually obtained depended on the participant. Therefore, we designed that all the task-related attribute values were included in the text descriptions to equalize the amount of information given to the participants (Figure 5). Table 1 summarizes the three tasks and task-related attribute values. The aim of this experiment was to determine the learning capability of our framework to identify aspects common to multiple participants. Therefore, we designed tasks that could be easily and uniquely determined to some extent.

We expected the participants to interpret each task as a set of attributes and then compare the options that satisfied the requirements. The participants’ knowledge affected not only the decision process (Karimi, Papamichail, & Holland, 2015) but also this interpretation. To set the participants’ knowledge almost equal, we briefly explained the meaning of the attribute types before explaining the tasks. Although this setting may seem too controlled, it is reasonable for our aim, which was not to determine the accuracy of detecting attribute values looked at by the participants but to determine the effect of AOF detection on aspect learning.

To elicit the participants’ comparison behavior, we set the number of items that met the task to two so that the participants could not uniquely decide on one PC for the specified task. We randomized the positions of the PCs in the content to reduce the effect of the spatial layout.

The total number of sessions in this experiment was 111 (37 participants x 3 tasks, contents).

Procedure

Each participant was first asked to sit facing the display and to position his or her face in line with the chin rest (see Figure 5). After the calibrating the eye tracker, we explained the content and procedure of the experiment. The procedure in each session consisted of four steps:

Step 1. The participant was given a task (first two columns in Table 1).
Step 2. The accuracy of the eye tracker’s calibrated parameters was confirmed.
Step 3. The content was displayed, and the participant was asked to select one PC.
Step 4. After the participant reported having made a selection, we asked which PC the participant had selected.

In Step 3, to explicitly separate the screening phase from the comparison phase (Russo & Leclerc, 1994; S. W. Shi et al., 2013), each item was displayed in turn at intervals of three seconds. Then they were all displayed and eye gaze was measured. Separating the screening phase in this way may have reduced the effect of the spatial position (see also the related work section), thereby enabling us to assume the participants browsed the content uniformly during neutral browsing.

While items with a certain combination of attribute values (e.g., {low price, high CPU score, large memory capacity}) would satisfy more than one task, we did not include items with such combinations because once such an item appeared in a session, it could affect the compari-son behaviors in the subsequent sessions due to the famil-iarity of the attribute combination.

Because we did not limit the decision-making time, the participants browsed the content as long as they need-ed and had enough time to compare items before deciding on one.

Results

Figure 6 shows an example of gaze trajectory when a participant compared PCs by considering the attribute values of PCs. As can be seen in this figure, each partici-pant mainly looked at and compared keywords of attrib-ute values rather than pictures. Regarding the difference of the tasks, we confirmed from the obtained gaze data that the difference of both the physical duration and the number of items looked at in each task are small among the three tasks.

Example of AOF detected from actual eye gaze data are shown in Figure 7. Window size l and threshold value tℎr. used to determine the AOF were set to 5 and 0.03 , respectively. These parameters were determined experimentally to satisfy the following two constraints: At least one AOF was extracted from all sessions except for too short sessions; and, the only shared attribute val-ues of the two items were extracted as AOF when a par-ticipant compared only two items in a short-term window. The task-related attribute values of this session were high CPU score and large memory capacity ( Jemr 13 00004 i014

and

; the notations were introduced in the methods section, and subscript 3 denotes the highest value for that attribute type). Remind that “Time” in the figure is based on the switching of gaze targets (i.e., ROIs). In this example, we can see that the participant first focused on high CPU scores ( Jemr 13 00004 i013

) and then compared items with not only a high CPU scores but also large memory capac-ity ( Jemr 13 00004 i014

and

) and a mid-range price ( Jemr 13 00004 i016

).

The compared items coincided with those we ex-pected to be compared for the given task. An attribute value of midrange price ( Jemr 13 00004 i016

) was also detected be-cause the compared items commonly had the value Jemr 13 00004 i016

, although it was not included in the task. Similar choice behavior was seen in many other sessions with other participants. That is, the participants first focused on one task-related attribute value and then narrowed down the options by adding the other task-related attribute values to focus on. However, the participants did not always focus on attribute values related to their interests in the last couple of fixations. The total number of detected AOF in each (normalized) temporal position is depicted in Figure 8. In this plot, each session was divided into ten segments with respect to the temporal position, and the total number of AOF in each segment was calculated. This figure shows that, although the participants com-pared PCs mainly in latter part of the session, the com-parison decreases in the end of the session.

Figure 8. Similarities between task-related attribute values and learned aspects without and with AOF detection step.

The results of the learned aspects without and with the AOF detection step are shown in Figure 9 (b) and (c), respectively. They can be compared qualitatively with the task-related attribute values shown in Figure 9 (a) in terms of the degree of association with the attribute values. The results without the AOF detection step (Figure 9 (b)) were obtained from original attribute-value sequences (e.g., Figure 7 (a)) while those with the AOF detection step (Figure 9 (c)) were obtained from sequences of detected AOF (e.g., Figure 7 (b)). In both cases, the number of aspects was set to three (R = 3), the same as that of the tasks, so that aspects corresponding to the given tasks were obtained.

These results show that the learned aspects were more distinct from one another when the AOF detection step was applied and that the attribute values highly associated with each aspect seem to be similar to the task-related attribute values for all three tasks. In addition, from the result with the different setting of the number of aspects, R = 4 (Figure 9 (d)), we can see that task-related attribute values were still obtained successfully as aspects C₁ to C₃ despite the existence of another aspect, C₄.

For quantitative evaluation of the effectiveness of the AOF detection step, the similarities between the learned aspects and the given tasks were calculated. The similarities were defined by the cosine similarity of two parameter vectors:

Figure 9. Task-related attribute values and learned aspects without and with AOF detection step using eye gaze data. In (b) and (c), the number of aspects was set to three (the same as that of the tasks) while it was set to four in (d). The size and color of each dot both depict the value of the multinomial parameter p_r,k, which represents the degree that aspect C_r is associated with attribute value V_k.

Figure 8. Similarities between task-related attribute values and learned aspects without and with AOF detection step.

To evaluate the accuracy of interest estimation, we conducted task estimation using the learned aspects, with the given task as the ground truth. Table 2 shows the results of task estimation based on the maximum probability of the estimated participants’ interest, argmax_r θ(s) = argmax_r P(C_r|s) . The interest θ(s) shown in Figure 9 (c) was estimated at the same time as aspect learning. That is, parameter estimation for the learning setting with Eq. (3) was used here since the main point of this evaluation is to confirm the effect of AOF detection on aspect learning and interest estimation. Note, on the other hand, that our approach can also be used to estimate user interest from newly observed gaze data with the inference setting. The accuracy of the task estimation was 83.8%. In 4 out of the 111 sessions, the AOF detection step did not detect the participant’s comparison behavior as biased gaze behavior due to quick decision making by the participants. That is, duration for those four sessions was less than 5, the size of the analysis window in the AOF detection step, l.

These results show the effectiveness of using the AOF detection step both in learning aspects of items and in estimating the participant’s engaging tasks (i.e., their interest) from their gaze behavior. In particular, the results shown in Figure 9 indicate that considering the participant’s “focus” on attributes (i.e., attributes-offocus) is crucial to analyzing the participant’s comparison behavior.

Discussion

The results presented above demonstrate the effectiveness of introducing the AOF detection step for learning aspects from gaze behavioral data and its effectiveness in estimating user interests reflected in gaze-target patterns. In this section, we discuss the limitations of our framework that arise from the assumptions made about the user internal model and observable gaze behavior.

Dynamics of user interests

User interest is affected by many factors, including relatively stable individual preferences and temporary interests elicited by external information (e.g., novel information). However, our model assumes that a user’s interest is constant during a content browsing session. Although this assumption is useful for determining the basic capabilities of the proposed model and algorithm for learning aspects, which is the main focus of this paper, dynamic changes of user interest in actual situations should be addressed in order to put the proposed method into practical use.

One approach to doing this is to introduce temporal segments by exploiting the results of the AOF detection step. By dividing AOF sequence into segments at the points where attribute values in the AOF change, we can consider that a user’s interest is constant in each segment. The proposed learning and estimation methods could then be applied to these segments, i.e., parameter vector θ could be assumed to be piecewise constant. While this segment-based model may cause shortage of training data, extending our pLSA-based model with a Bayesian approach such as latent Dirichlet allocation (LDA) (Blei, Ng, & Jordan, 2003) is a possible solution.

Number of aspects

Since our method finds aspects on the basis of unsupervised learning as do topic models, the number of aspects needs to be given. In fact, the number of aspects was set to be the same as that of the tasks for the sake of evaluating the learned aspects in terms of the similarity to the given tasks. However, in practice, the number of aspects can vary depending on the user and the situation (e.g., the content and the task), so it is difficult to determine it beforehand in general.

As shown by the results in Figure 9 (d), our approach is robust to the setting of the number to some extent. Moreover, the use of standard hyper-parameter estimation of unsupervised learning, e.g., non-parametric Bayes, should be effective in determining the appropriate number of aspects. However, the “interpretation” of learned aspects is desirable for many applications, such as speech dialog systems using aspects for probing questions (Misu et al., 2011), and needs to be investigated.

Temporal patterns of gaze targets

The temporal patterns of gaze targets convey useful information for estimating user interest during decision making. For example, if a user “re-fixates” on a target, he or she is probably specifically focusing on the target and comparing it with the alternatives (Schaffer et al., 2016). Although these temporal patterns are indirectly considered in the AOF detection step to identify bias in the user’s gaze behavior in a short-term window, explicit modeling of gaze-transition patterns needs to be incorporated for a natural interpretation of user gaze behavior. Information of gaze duration

Because we are particularly interested in user comparison behavior, changes in the gaze target (i.e., gaze-target transition) were used for the definition of time t in the proposed approach. This is suitable for treating the number of times a target is looked at as the weight placed on the target’s importance. However, this approach cannot be used to examine how carefully the user looks at each target. For example, Sugano et al. showed with random forests that duration is the feature that contributes the most to estimating the user’s interests (Sugano, Ozaki, Kasai, Ogaki, & Sato, 2014). Taking physical duration information into account may enable us to examine the degree of user focus on a gaze target and may increase the accuracy of the interest estimation.

Effect of visual saliency and content design

While we simply assumed that users uniformly browse displayed items on a screen during neutral browsing, visual saliency, such as salient regions and the position of items in a catalog, affects user gaze behavior (e.g., center bias (Borji, 2012)). Therefore, taking into account the effect of visual saliency in the neutral-browsing model may increase the accuracy of AOF detection. For example, a saliency map (Itti et al., 1998) can be used to determine the parameters of the multinomial distribution used in the neutral-browsing model as the weights on each attribute of items.

In our experiment, the positions of the items were randomized to reduce the effect of item location in order to focus on the aspect-driven comparison behavior. However, the layouts in actual catalog content are not randomized but rather structured (e.g., similar items are arranged more closely together). This affects content browsing and is actually useful for analyzing browsing states in a decision-making process (Schaffer et al., 2016). Therefore, the information of content design needs to be considered for interest estimation.

Conclusion

This paper addressed the problem of finding a representation for a user interest and its estimation from eye gaze data in digital-catalog browsing using a data-driven approach. By introducing aspects as approximate representations of user values when choosing items, we aimed at obtaining a set of aspects from eye gaze data using unsupervised learning of the pIAF model, a probabilistic generative model of attributes-of-focus (AOF). The main contribution of this paper is the introduction of an AOF detection step to overcome the problem of such datadriven learning and estimation being strongly affected by item information and attributes of no interest to the user. We evaluated the validity of this approach with actual eye gaze data and found that it successfully constructs distinctive aspects highly correlated with user decision goals compared with an approach without the AOF detection step. Future work includes investigating ways to overcome the limitations discussed in the previous section and applying the proposed method to interactive systems that can proactively assist user decision making.

Ethics and Conflict of Interest

The authors declare that the contents of the article are in agreement with the ethics described in http://biblio.unibe.ch/portale/elibrary/BOP/jemr/ethics.ht ml and that there is no conflict of interest regarding the publication of this paper. .

Acknowledgments

This work was supported by JSPS KAKENHI Grant Numbers JP15J06965, JP26280075, JP19H04226, and JST, PRESTO JPMJPR14D1.

Appendix

Derivation of the equation for maximum likelihood estimation

The joint probability of an aspect-of-focus and an attribute-of-focus during session s is derived as follows by assuming the conditional independence of f_t and s given c_t:

(5)

Hence, the probability that an attribute-of-focus f_t is observed during session s is given by

(6)

When we have S sessions of decision making, the parameters to be estimated are {θ(s)}_s, users’ interests, and {p_r}_r, which are multinomial parameters that characterize aspects. Here, the probability of an attribute-of-focus f_t during session s is given by Eq. (6), where parameter n_t is given as described in the methods section. Given parameters θ(s) and {p_r}_r, the likelihood of a sequence of attributes-of-focus Jemr 13 00004 i013

during session s is computed as

(7)

By computing the likelihood of all sessions, we can obtain Eq. (3).

The model parameters can therefore be estimated by solving the following optimization problem derived by maximizing the logarithm of Eq. (3):

This problem can be solved using an expectationmaximization algorithm similar to parameter estimation of pLSA.

References

Arsil, P., E. Li, and J. Bruwer. 2016. Using means-end chain analysis to reveal consumers’ motivation for buying local foods: An exploratory study. Gadjah Mada International Journal of Business 18, 3: 285–300. [Google Scholar] [CrossRef]
Athukorala, K., A. Medlar, A. Oulasvirta, G. Jacucci, and D. Glowacka. 2016. Beyond relevance: Adapting exploration/exploitation in information retrieval. In Proceedings of the 21st international conference on intelligent user interfaces. pp. 359–369. [Google Scholar] [CrossRef]
Blei, D. M., A. Y. Ng, and M. I. Jordan. 2003. Latent dirichlet allocation. Machine Learning Research 3: 993–1022. [Google Scholar]
Bobadilla, J., F. Ortega, A. Hernando, and A. GutiéRrez. 2013. Recommender systems survey. Knowledge-Based Systems 46: 109–132. [Google Scholar] [CrossRef]
Borji, A. 2012. Boosting bottom-up and top-down visual features for saliency estimation. In Proceedings of computer vision and pattern recognition. pp. 438–445. [Google Scholar] [CrossRef]
Borji, A., and L. Itti. 2014. Defending Yarbus: Eye movements reveal observers’ task. Journal of Vision 14, 29: 1–22. [Google Scholar] [CrossRef]
Brandherm, B., and H.I. Prendinger. 2008. Dynamic Bayesian network based interest estimation for visual attentive presentation agents. In Proceedings of the 7th international joint conference on autonomous agents and multiagent systems-volume. vol. 1, pp. 191–198. [Google Scholar]
Bruce, N. D. B., and J. K. Tsotsos. 2009. Saliency, attention, and visual search: An information theoretic approach. Journal of Vision 9, 5: 1–24. [Google Scholar] [CrossRef]
Chen, L., F. Wang, and W. Wu. 2016. Inferring users’ critiquing feedback on recommendations from eye movements. In Proceedings of 24th international conference on case-based reasoning research and development. pp. 62–76. [Google Scholar] [CrossRef]
Cole, M. J., J. Gwizdka, C. Liu, N. J. Belkin, and X. Zhang. 2013. Inferring user knowledge level from eye movement patterns. Information Processing & Management 49, 5: 1075–1091. [Google Scholar] [CrossRef]
Collen, H., and J. Hoekstra. 2001. Values as determinants of preferences for housing attributes. Journal of Housing and the Built Environment 16, 3: 285–306. [Google Scholar] [CrossRef]
Das, A. S., M. Datar, A. Garg, and S. Rajaram. 2007. Google news personalization: Scalable online collaborative filtering. In Proceedings of the 16th international conference on world wide web. pp. 271–280. [Google Scholar] [CrossRef]
Goldstein, E. B., D. Vanhorn, G. Francis, and I. Neath. 2011. Cognitive psychology: Connecting mind, research, and everyday experience, 3rd ed. Belmont, CA: Wadsworth/Cengage Learning. [Google Scholar]
Gutman, J. 1982. A means-end chain model based on consumer categorization processes. Journal of Marketing 46, 2: 60–72. [Google Scholar] [CrossRef]
He, J., P. Qvarfordt, M. Halvey, and G. Golovchinsky. 2016. Beyond actions: Exploring the discovery of tactics from user logs. Information Processing & Management 52, 6: 1200–1226. [Google Scholar] [CrossRef]
Hirayama, T., D. Jean-Baptiste, H. Kawashima, and T. Matsuyama. 2010. Estimates of user interest using timing structures between proactive content display updates and eye movements. IEICE Transactions on Information & Systems E-93D, 6: 1470–1478. [Google Scholar] [CrossRef]
Hofmann, T. 1999. Probabilistic latent semantic analysis. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence. pp. 289–296. [Google Scholar]
Ishii, R., Y. I. Nakano, and T. Nishida. 2013. Gaze awareness in conversational agents: Estimating a user’s conversational engagement from eye gaze. ACM Transactions on Interactive Intelligent Systems, August, vol. 3(2), pp. 1–11:25. [Google Scholar] [CrossRef]
Ishikawa, E., H. Kawashima, and T. Matsuyama. 2015. Using designed structure of visual content to understand content-browsing behavior. IEICE Transactions on Information & Systems E-98D, 8: 15261535. [Google Scholar] [CrossRef]
Itti, L., C. Koch, and E. Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 11: 1254–1259. [Google Scholar] [CrossRef]
Iwata, T., S. Watanabe, T. Yamada, and N. Ueda. 2009. Topic tracking model for analyzing consumer purchase behavior. In Proceedings of the 21st international joint conference on artificial intelligence. pp. 1427–1432. [Google Scholar]
Jin, X., Y. Zhou, and B. Mobasher. 2004. Web usage mining based on probabilistic latent semantic analysis. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. pp. 197–205. [Google Scholar] [CrossRef]
Karimi, S., K. N. Papamichail, and C. P. Holland. 2015. The effect of prior knowledge and decision-making style on the online purchase decision-making process: A typology of consumer shopping behaviour. Decision Support Systems 77: 137–147. [Google Scholar] [CrossRef]
Keeney, R. L. 1992. Value-Focused Thinking. A Path to Creative Decision Making. Cambridge: Harvard University Press. [Google Scholar]
Kollmorgen, S., N. Nortmann, S. Schröder, and P. König. 2010. Influence of low-level stimulus features, task dependent factors, and spatial biases on overt visual attention. PLoS Computational Biology 6, 5: 1–20. [Google Scholar] [CrossRef] [PubMed]
Martin-Albo, D., L. A. Leiva, J. Huang, and R. Plamondon. 2016. Strokes of insight: User intent detection and kinematic compression of mouse cursor trails. Information Processing & Management 52, 6: 989–1003. [Google Scholar] [CrossRef]
Milosavljevic, M., V. Navalpakkam, C. Koch, and A. Rangel. 2012. Relative visual saliency differences induce sizable bias in consumer choice. Journal of Consumer Psychology 22, 1: 67–74. [Google Scholar] [CrossRef]
Misu, T., K. Sugiura, T. Kawahara, K. Ohtake, C. Hori, H. Kashioka, H. Kawai, and S. Nakamura. 2011. Modeling spoken decision support dialogue and optimization of its dialogue strategy. ACM Transactions on Speech and Language Processing, June, vol. 7(3), pp. 1–10:18. [Google Scholar] [CrossRef]
Ni, X., Y. Lu, X. Quan, L. Wenyin, and B. Hua. 2012. User interest modeling and its application for question recommendation in user-interactive question answering systems. Information Processing & Management 48, 2: 218–233. [Google Scholar] [CrossRef]
Orquin, J. L., and S. M. Loose. 2013. Attention and choice: A review on eye movements in decision making. Acta Psychologica 144, 1: 190–206. [Google Scholar] [CrossRef]
Parnell, G. S., D. W. Hughes, R. C. Burk, P. J. Driscoll, P. D. Kucik, B. L. Morales, and L. R. Nunn. 2013. Invited review-survey of value-focused thinking: Applications, research developments and areas for future research. Journal of Multi-Criteria Decision Analysis 20, 1-2: 49–60. [Google Scholar] [CrossRef]
Reusens, M., W. Lemahieu, B. Baesens, and L. Sels. 2017. A note on explicit versus implicit information for job recommendation. Decision Support Systems 98: 26–35. [Google Scholar] [CrossRef]
Reynolds, T. J., and J. Gutman. 1988. Laddering theory, method, analysis, and interpretation. Journal of Advertising Research 28, 1: 11–31. [Google Scholar]
Russo, J. E., and F. Leclerc. 1994. An eye-fixation analysis of choice processes for consumer nondurables. Journal of Consumer Research 21, 2: 274–290. [Google Scholar] [CrossRef]
Saaty, T. L. 1980. The analytic hierarchy process: planning, priority setting, resource allocation. London; New York, NY: McGraw-Hill International Book Co. [Google Scholar]
Schaffer, E., H. Kawashima, and T. Matsuyama. 2016. A probabilistic approach for eye-tracking based process tracing in catalog browsing. Journal of Eye Movement Research 9, 7: 1–14. [Google Scholar] [CrossRef]
Shi, S. W., M. Wedel, and G. M. R. Pieters. 2013. Information acquisition during online decision making: A model-based exploration using eye-tracking data. Management Science 59, 5: 1009–1026. [Google Scholar] [CrossRef]
Shi, Y., M. Larson, and A. Hanjalic. 2014. Collaborative filtering beyond the user-item matrix: A survey of the state of the art and future challenges. ACM Computing Surveys, May, vol. 47(1), pp. 1–3:45. [Google Scholar] [CrossRef]
Shimonishi, K., H. Kawashima, R. Yonetani, E. Ishikawa, and T. Matsuyama. 2013. Learning aspects of interest from gaze. In Proceedings of the 6th workshop on eye gaze in intelligent human machine interaction: Gaze in multimodal interaction. pp. 41–44. [Google Scholar] [CrossRef]
Steichen, B., C. Conati, and G. Carenini. 2014. Inferring visualization task properties, user performance, and user cognitive abilities from eye gaze data. ACM Transactions on Interactive Intelligent Systems, July, vol. 4(2), pp. 1–11:29. [Google Scholar] [CrossRef]
Sugano, Y., Y. Ozaki, H. Kasai, K. Ogaki, and Y. Sato. 2014. Image preference estimation with a datadriven approach: A comparative study between gaze and image features. Journal of Eye Movement Research 7, 3: 1–9. [Google Scholar] [CrossRef]
Tatler, B. W., R. J. Baddeley, and I. D. Gilchrist. 2005. Visual correlates of fixation selection: Effects of scale and time. Vision Research 45, 5: 643–659. [Google Scholar] [CrossRef]
Uetsuji, K., H. Yanagimoto, and M. Yoshioka. 2015. User intent estimation from access logs with topic model. Procedia Computer Science 60: 141–149. [Google Scholar] [CrossRef]
Veludo, T. M., A. A. Ikeda, and M. C. Campomar. 2006. Laddering in the practice of marketing research: barriers and solutions. Qualitative Market Research: An International Journal 9, 3: 297–306. [Google Scholar] [CrossRef]
Walker, M. A., S. J. Whittaker, A. Stent, P. Maloor, J. Moore, M. Johnston, and G. Vasireddy. 2004. Generation and evaluation of user tailored responses in multimodal dialogue. Cognitive Science 28, 5: 811–840. [Google Scholar] [CrossRef]
Yarbus, A. L. 1967. Eye movements and vision. New York, NY: Plenum. [Google Scholar]
Zanoli, R., and S. Naspetti. 2002. Consumer motivations in the purchase of organic food: A means-end approach. British Food Journal 104, 8: 643–653. [Google Scholar] [CrossRef]

Figure 1. Example situation of users’ choice behavior during content browsing and user’s internal hierarchical structure based on means-end chain concept.

Figure 2. Flow of the two-step approach for interest estimation from user’s gaze behavior: 1st step is to detect user’s attributes-of-focus by applying likelihood-based short-term analysis; 2nd step is to estimate user interest by applying a probabilistic.

Figure 4. Illustration of interest-driven attention focusing model.

Figure 5. Experimental environment and displayed content of a catalog. Each item region consisted of a picture and descriptions of the PC attributes in text format. (Descriptions in this figure are translations from the original language, Japanese.).

Figure 6. Example of gaze trajectory on the content. The descriptions in this figure are translations from the original Japanese same as Figure 5.

Figure 7. Example of detected AOF. (a) The original sequence of attribute values of browsed items; (b) detected AOF.

Table 1. Three tasks and task-related attribute values.

Table 2. Results of estimation of participants’ interest (engaging task). Number in each cell in column r (r = 1, 2, 3) shows the number of sessions that were estimated as engaged task r. “No comparison” means the AOF detection step could not detect a bias of gaze behavior during the session.

Share and Cite

MDPI and ACS Style

Shimonishi, K.; Kawashima, H. A Two-step Approach for Interest Estimation from Gaze Behavior in Digital Catalog Browsing. J. Eye Mov. Res. 2020, 13, 1-17. https://doi.org/10.16910/jemr.13.1.4

AMA Style

Shimonishi K, Kawashima H. A Two-step Approach for Interest Estimation from Gaze Behavior in Digital Catalog Browsing. Journal of Eye Movement Research. 2020; 13(1):1-17. https://doi.org/10.16910/jemr.13.1.4

Chicago/Turabian Style

Shimonishi, Kei, and Hiroaki Kawashima. 2020. "A Two-step Approach for Interest Estimation from Gaze Behavior in Digital Catalog Browsing" Journal of Eye Movement Research 13, no. 1: 1-17. https://doi.org/10.16910/jemr.13.1.4

APA Style

Shimonishi, K., & Kawashima, H. (2020). A Two-step Approach for Interest Estimation from Gaze Behavior in Digital Catalog Browsing. Journal of Eye Movement Research, 13(1), 1-17. https://doi.org/10.16910/jemr.13.1.4

Article Menu

A Two-step Approach for Interest Estimation from Gaze Behavior in Digital Catalog Browsing

Abstract

Introduction

Research objective

Contributions

Organization of this paper

Related Work

Analysis of values behind decision making

Estimation of internal states behind user behavior

Internal and external factors of gaze behavior

Gaze behavior and decision phase

Methods

Two-step approach to estimating user interests

AOF detection by short-term analysis

Interest estimation using a generative model

Evaluation

Participants

Design

Procedure

Results

Discussion

Dynamics of user interests

Number of aspects

Temporal patterns of gaze targets

Effect of visual saliency and content design

Conclusion

Ethics and Conflict of Interest

Acknowledgments

Appendix

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI