Glimpse: A Gaze-Based Measure of Temporal Salience

Temporal salience considers how visual attention varies over time. Although visual salience has been widely studied from a spatial perspective, its temporal dimension has been mostly ignored, despite arguably being of utmost importance to understand the temporal evolution of attention on dynamic contents. To address this gap, we proposed Glimpse, a novel measure to compute temporal salience based on the observer-spatio-temporal consistency of raw gaze data. The measure is conceptually simple, training free, and provides a semantically meaningful quantification of visual attention over time. As an extension, we explored scoring algorithms to estimate temporal salience from spatial salience maps predicted with existing computational models. However, these approaches generally fall short when compared with our proposed gaze-based measure. Glimpse could serve as the basis for several downstream tasks such as segmentation or summarization of videos. Glimpse’s software and data are publicly available.


Introduction
Visual salience (or saliency) refers to the ability of an object, or part of a scene, to attract our visual attention. The biological basis for this phenomenon is well known [1]: salience emerges in parallel processing of retinal input at lower levels in the visual cortex [2]. Concepts other than salience, such as surprise [3], have been found to explain human gaze in dynamic natural scenes. While the concept of spatial salience has been extensively investigated for static contents such as natural images [4,5] and graphic displays [6,7], the temporal salience of dynamic contents such as videos remains largely unexplored. Spatial salience predicts where attention is allocated in the image domain, whereas temporal salience predicts when attention happens and how it varies over time.
The importance of temporal salience to gain valuable insights about a video structure has been recently noted, and a mouse click-based interaction model was proposed to annotate datasets in the absence of an eye tracker [8]. However, the approach requires significant manual work and recommends several passes over the same video to ensure a low intra-observer variability and obtain more reliable estimates.
In this work, we investigated how to automatically estimate temporal salience in videos using eye-tracking data. Our main hypothesis was that when gaze coordinates are spatio-temporally consistent across multiple observers, it is a strong indication of visual attention being allocated at a particular location within a frame (spatial consistency) and at a particular time span (temporal consistency). In other words, uninteresting dynamic contents are expected to induce non-homogeneous, randomly located gaze points (low temporal salience), whereas truly attention-grabbing contents would concentrate similar gaze points from different observers for some time span (high temporal salience).
Our approach, named GLIMPSE (gaze's spatio-temporal consistency from multiple observers), is illustrated in Figure 1. As can be observed, there is a general agreement To the best of our knowledge, GLIMPSE is the first method that addresses the problem of computing temporal salience from gaze data, without requiring explicit human annotation effort, nor model training. Additionally, because eye-tracking data are not always available, a secondary research contribution we made in this paper was exploring whether frame-level spatial salience maps, as predicted by existing computational models, can be used to produce reasonable estimates of temporal salience according to our method. The idea is similar as before: spatio-temporal consistency in the spatial salience map across time might provide cues for estimating temporal salience. This alternative is highly relevant because, if proven effective, it would pave the way for a more agile computation of temporal salience, without having to recruit human participants.
In sum, the key contributions of this paper are: (1) a measure of temporal salience in dynamic scenes based on the notion of observer-spatio-temporal gaze consistency; (2) analysis and evaluation of the proposed measure; (3) an exploration of heuristic measures of temporal salience derived from computational models of spatial salience; and (4) software and data to allow others to build upon our work.

Related Work
Our work was mostly related to research on eye-tracking applications in dynamic scenes, such as segmentation, summarization, and compression of videos. We review those here and also relate to recent tools that have been used for annotation of temporal salience datasets.

Downstream Applications
Many video summarization approaches rely on predicting frame-level importance scores [9,10], which are task dependent and therefore biased towards a particular summarization goal, whereas temporal salience is a more generic concept that could in turn be tailored to more specific or higher level tasks. Since eye gaze is known to provide cues on the underlying cognitive processes [11,12], it can be expected to be particularly useful for this kind of video-processing task. However, despite being used in some computer vision problems [13][14][15][16], its general use has been limited.
Gaze data in first-person wearable systems can aid in temporal video segmentation [17], and its computational prediction has been studied [18]. The gaze data of the wearer of an egocentric camera have been used to score the importance of the frames, as the input to a fast-forward algorithm [19]. In these cases, however, gaze is available only from a single user [17,19] (the wearer), instead of the (multiple) watchers of a video, as considered in our work.
An alternative to estimating the intrinsic salience of the visual contents is to analyze the observers' attention, as in a recent work [20], which found that the eye movements of students watching instructional videos were similar. It has also been found that gaze location may vary upon repeated viewings of the same video [21]. In the scope of behavioral biometrics, the fusion of mouse and eye data has been proposed for improved user identification [22].

Handling Temporal Information
Low-level conspicuity maps can be used to derive a temporal attention curve [23], to subsequently extract keyframes through a clustering-based temporal segmentation based on the visual similarity of neighboring frames. A similar approach has been proposed [24], but including camera motion as a visual feature, plus audio and linguistic cues. These approaches are arguably difficult to use in real-time applications.
Salience maps derived from gaze data can be used for video compression by preserving higher visual quality at salient spatial regions. Based on the notion of the temporal consistency of attention (i.e., spatial salient regions in neighboring frames are likely to overlap), a temporal propagation of the salience map can be performed [25]. Although the temporal concept is indirectly considered, salience is used as a purely spatial concept.
The recently introduced concept of multi-duration salience [26] includes a notion of time, but still for defining spatial maps for static contents. Salience in dynamic scenes is related to but conceptually different from salience in static images [27]. Specific methods for the dynamic case have been studied [28][29][30][31][32][33] and, very recently, unified image-video approaches [34] proposed, but only in the context of spatial salience. For gaze prediction, temporal features are found to be of key importance in rare events, so spatial static features can explain gaze in most cases [35]. At the same time, features derived from deep learning models exploiting temporal information have been found to benefit gaze estimation over using static-only features [36].

Annotation Tools
Finally, researchers have sought different annotation approaches for understanding and predicting visual attention, mostly focused on static images [37]. Crowdsourcing techniques such as the Restricted Focus Viewer [38] or BubbleView [39] have emerged as a poor man's eye tracker [40] to collect data at a large scale, where the computer display is blurred and the user has to move or click their mouse in order to see a small region in focus [8,41]. User-unknown limited mouse clicks [8] require the same participant to watch the same contents several times. This brings more reliable annotation, but challenges its scalability in terms of the length or number of videos. A comprehensive review of user interfaces for predicting stimulus-driven attentional selection [42] and a comparison of recent methodologies [43] are representative of alternatives to eye-based data.

Novelty and Relevance of GLIMPSE
It is important to highlight the novelty of both the problem addressed in this work (temporal quantification of the visual salience of dynamic contents) and the proposed approach (measure based on observer-spatio-temporal consistency of gaze data) with respect to these previous works. Specifically, some existing approaches consider time-varying contents (videos), but only estimate spatial salience maps, without providing a scalar salience score as a function of time. Furthermore, when the temporal dimension is considered, it is only for the purpose of improving the quality of the estimated spatial salience maps. Even the recent concept of multi-duration salience, it is still based on the notion of spatial maps and for static contents. Additionally, based on gaze data, GLIMPSE is fundamentally different from explicit human-annotation-based approaches.
Because of all these reasons, GLIMPSE is the first of its kind, to the best of our knowledge. As happens with new problems and approaches, this novelty prevents us from quantitatively assessing its performance (no prior ground-truth exists yet), but it also represents a unique opportunity to provide the scientific community with a reference quantification of temporal salience in terms of both ready-to-use measures computed on a particular video dataset and software to measure temporal salience for other dynamic contents with available gaze data. We believe this will significantly facilitate research progress on the problem of temporal visual salience estimation and its applications.

Measure Description
We illustrate GLIMPSE with videos, the paradigmatic example of time-varying visual contents. For a given video, let g(o, t) = (x, y) be the gaze position of observer o ∈ {1, . . . , N} at time (or frame number) t ∈ {1, . . . , T}, for N observers along the T frame long video. There are four variables involved: two spatial coordinates (x, y), the temporal domain t, and the observer o. Our goal is to compute a temporal salience score s(t) ∈ R for each frame t from the (implicit) four-dimensional function f (x, y, o, t).
The idea is to capture the spatio-temporal consistency of the observers' gaze points. This entails some notion of the distance and dispersion of such points distributions: the closer they are, both in space and time, the higher the consistency. After some exploration, we were eventually inspired by Ripley's K function [44], a measure of spatial homogeneity that has been used, for example, in ecology [45] and bio-geography [46]. Based on this measure, we formulated temporal salience as: where d ij is the pairwise Euclidean distance between the ith and jth points in the set P t of n gaze points from all the observers within a temporal window of length 2θ t + 1 centered at t, i.e., and 1[p] is the indicator function, which is one when predicate p is true and zero otherwise.
We used θ s to denote the spatial scale, which is a distance threshold. Thus, Equation (1) accounts for the number of paired gaze points that are close enough, in a normalized way, so that s(t) ∈ [0, 1]. The larger s(t) is, the higher the spatio-temporal and inter-observer consistency, which in our problem translates to higher temporal salience. This definition of s(t) is interesting because, besides being rather natural and relatively simple, it implicitly captures an aggregation measure without the need for an explicit clustering, which would be more computationally expensive as well. Note that in this definition of s(t), there is no need to keep track of which gaze points belong to which observer: all gaze points within the specified temporal window can be considered as a "bag of gaze points". Furthermore, importantly, the gaze points are processed in raw form, i.e., without computing gaze features, nor classifying gaze points into fixations or saccades.

Evaluation
We tested GLIMPSE with the publicly available SAVAM dataset [25]. Details on the eye tracker, videos, and users watching those videos are given in Table 1. Since we were interested in frame-level data and the frame rate of the videos was smaller than the eye tracker's sampling frequency, gaze positions within a frame were averaged. The data were recorded using a binocular system, so we arbitrarily chose the left eye as the input source. The (x, y) gaze coordinates were normalized to [0, 1] by dividing them by the frame's width W and height H, respectively. This makes GLIMPSE independent of the video frame size and facilitates setting a meaningful distance threshold θ s across studies.

Analysis of Hyperparameters
We first studied the effect of the spatial θ s (distance threshold) and temporal θ t (time window) parameters of GLIMPSE. As shown in Figure 2, for a fixed θ t , a too permissive or a too strict distance threshold θ s leads to salience estimates that are either nearly always too low or too high, which are essentially uninformative. ....  For intermediate values of θ s , the score profile s(t) is similar, but larger values of θ s produce generally higher salience scores. In addition, some particular values induce better discrimination between peaks and valleys. Regarding the effect of the temporal window θ t for the same spatial scale θ s , an increase in θ t produces a smoothing effect on salience estimates. We empirically set θ s = 0.1 and θ t = 5 as reasonable values, according to earlier pilot experiments.

Convergence Analysis
Now, for fixed values of the hyperparameters (θ s = 0.1, θ t = 5), we conducted a convergence analysis to see how many observers would be required to get salience profiles as close as possible to those obtained with all the observers, which would be the best case scenario but also the most expensive overall, as it requires more human participants.
Let s k (t) be the salience scores produced for 1 ≤ k ≤ N observers. We wanted to compare s k (t) to s N (t) for a given video of length T. To this end, the length-normalized Euclidean distance d between two salience scores, s 1 (t) and s 2 (t), is defined as: We took p(k) = min(p max , ( N k )) random samples of size k observers out of the N observers available in the SAVAM dataset. A conservative upper bound of p max = 400 was set so that not all possible combinations were computed (for example, for N = 58 observers, there are as many as 30,856 different combinations of k = 3 observers) and computed the mean of d k,N = d(s k , s N ) for each of the samples.
As shown in Figure 3, convergence happens quickly, which means that a reliable temporal salience can be obtained with much fewer observers, suggesting thus that GLIMPSE is quite scalable. The confidence intervals are very small and hence not shown in the figure.
A very similar trend was observed for the rest of the SAVAM videos. As a reference for the scale of the distance, it is worth looking at Figure 4 (discussed below), which shows the related profiles of signals s k (t) and s N (t), together with the corresponding distances d k,N . It can be noted in Figure 3 that d k,N is particularly low for k > 5 for video v36, which has a (almost constant) low salience score along the whole video; see Figure 5b. This result is particularly relevant since it can be expected that for low-attention contents, less observers are required to get reliable estimates of temporal salience.

Effect of the Number of Observers
Now, we use Figure 4 to illustrate temporal salience scores s k (t) for a varying number of observers k. Two main observations are worth mentioning: First, it can be seen that s k (t) ≥ s k (t) for k < k ; i.e., the fewer the observers, the more overestimated the salience score tends to be, and therefore, s N (t) represents a conservative lower bound. Second, the convergence of s k (t) to s N (t) with k is quite apparent, reinforcing the fact that it happens with very few observers, as noticed before in Figure 3.

Qualitative Assessment
Similar to Figure 1, which illustrates GLIMPSE for video v22 in the SAVAM dataset, further examples are provided in Figure 5 that highlight the behavior of the proposed measure.
In v30, at t ≈ 200 the woman in the background grabs the attention of many observers, with the corresponding increase in the salience score; see Figure 5a. Afterwards, at t ≈ 240, attention spreads across the woman, the back of the boy, and other scene regions, resulting in lower salience scores. Then, at t ≈ 300 and t ≈ 380, attention is highly consistent around the boy's face and at the women in the background, respectively, and so, s(t) exhibits local peaks at those times.
In v36, the tree leaves are moving with the wind all the time, with no particular region drawing the observers' attention; see Figure 5b. Consequently, gaze locations are not homogeneous, and accordingly, the salience score is very low and flat overall.
The more dynamic contents in v43 produce higher peaks and more variations in the temporal salience than in other examples; see Figures 1 and 5. The high score at t ≈ 80 aligns with the appearance of the girl's face. The valley at t ≈ 173 can be explained by a scene change, where observers' gaze points diverge. An eye-catching car maneuver draws the attention of the observers around t ≈ 330 and, after a viewpoint change, again at t ≈ 415.
The qualitative results reported in [8] (Figure 3) include the salience scores for 250 frames of videos v30 and v36 ( Figure 5). Like our approach, their salience scores in video v30 are higher in accordance to relevant video events. However, GLIMPSE differs in when peaks and valleys happen in the salience signal, as well as the overall salience scores, in absolute terms. For v36, their scores are essentially flat, as with GLIMPSE. However, their scores are close to 0.5 in some parts, whereas GLIMPSE predicts much lower scores overall (about 0.1), which arguably reflects better the "monotonous" content of this video.

Comparison with Downstream Applications
Since currently there is no ground-truth or alternative approaches for temporal salience computation, a fair quantitative comparison is not possible. However, as a reference, we compared GLIMPSE with two approaches of video-related problems, namely a popular temporal segmentation approach, Kernel-based temporal segmentation (KTS) [47], and a recent memory-based frame-wise visual interestingness estimation method [48] (VisInt). The input to KTS were the 2048-dimensional activations prior to the last fully connected layer of InceptionV3 [49] from the video frames downsampled to a 400 × 225 resolution. The input to VisInt was the video frames resized to 320 × 320 resolution, as per the default choice in the authors' software (https://github.com/wang-chen/interestingness last accessed on 28 April 2021).
The salience score s(t) from GLIMPSE was compared with a reference signal r(t). For KTS, r(t) = 1 for frames temporally close to the detected scene change points, and r(t) = 0 otherwise. For VisInt, r(t) is the interestingness score. We tested several of VisInt's writing rates (γ w > 0) to the visual memory system during online learning, where higher γ w implies decreasing interest in new visual inputs earlier.
On the one hand, we compared GLIMPSE with KTS using the precision and recall metrics, since the KTS signal is binary, defined as precision = I/S and recall = I/R, with I = ∑ t min(s(t), r(t)), R = ∑ t r(t), and S = ∑ t s(t). Both metrics are defined in [0, 1], with lower values representing a poor match between the compared signals. On the other hand, we compared GLIMPSE with VisInt using Spearman's ρ and Kendall's τ, since both are continuous signals, and they were recommended in similar contexts [50]. Both rank correlation metrics are defined in [−1, 1], with values close to zero denoting weak or no correlation.
Results across SAVAM videos revealed low precision and recall, as shown in Table 2a, and essentially no correlation; see Table 2b. This means that GLIMPSE differs from other segmentation-or "importance"-like scoring approaches. In particular, it can be observed in Figure 6 that interestingness peaks from VisInt tend to agree (v30, v43) with some scene change points detected by KTS, but GLIMPSE is not biased by these changes. On the one hand, VisInt produced a flat signal in v36, which rightfully corresponded to the homogeneous contents of that video, but it also did so in v22, thus missing the subtle image changes corresponding to people moving in the hall ( Figure 1) and that GLIMPSE aptly captured. On the other hand, KTS may produce non-meaningful scene changes (v22, v36) and did not align with attention-grabbing moments, as detected by GLIMPSE (v30 and v43).  In sum, these experiments highlighted how existing scoring techniques for detecting key events rely on low-level visual cues and tend to produce suboptimal results at best. In contrast, being based on the cognitively rich human gaze, GLIMPSE was able to robustly estimate the temporal evolution of attention in a semantically meaningful way.
In terms of computational efforts, asymptotic costs (Table 3) indicated that GLIMPSE and VisInt, being online algorithms, depend linearly on the length of the video T, whereas KTS has a quadratic dependency and might scale poorly to long videos. The cost for KTS did not include the part of extracting the frame features. GLIMPSE had a quadratic term for the number of gaze points n within a temporal window, which can be in the order of a few hundred (e.g., for N = 58 observers and θ t = 5 frames in our experiments). Since gaze points are very low-dimensional (simply 2D), computing the pair-wise distances is very efficient. Once gaze points were available, GLIMPSE was really fast, since it did not depend on either the size of the frames or the video length, unlike VisInt, which had video frames as the input, or KTS, which usually deals with long frame feature vectors. Actual running times (Table 4) highlighted how efficient GLIMPSE was: about one order of magnitude faster than KTS (even without feature extraction) and more than two orders of magnitude faster than VisInt. These statistics corresponded to times measured for the first 10 videos (v01-v10) in the SAVAM dataset (avg. number of frames per video: 444.0 ± 15.8), using an AMD Ryzen 5 processor (3550H series) @ 2.1 GHz with 8 GB of RAM and a built-in NVIDIA GeForce GTX 1650 GPU with 4 GB of memory.

Summary
GLIMPSE provides a consistent quantification of temporal salience, with good convergence behavior in terms of the number of observers required to achieve temporal scores similar to those of many more observers. This is particularly interesting, since with GLIMPSE, it is not necessary to recruit many users who can provide eye-tracking data: with as few as three observers, we can expect an average error as small as 1%. Additionally, our qualitative experiments showed that GLIMPSE produced temporal salience estimates that were well aligned with key attention-grabbing events in the videos, unlike other downstream video applications (temporal segmentation, interestingness estimation), which have different purposes. This also suggested that this kind of gaze-based measure cannot be easily replaced by existing low-level algorithms relying only on purely visual cues. We concluded that GLIMPSE contributes to understanding how salience evolves in dynamic scenes, which can enable or assist several downstream applications such as the ones discussed in Section 2.

Experiments with Computational Salience Models
Since GLIMPSE provides a consistent and reliable reference of temporal salience, we investigated whether temporal salience can be alternatively estimated from spatial salience maps predicted by computational models. In the literature, these models have been shown to correlate reasonably well with human fixations [51], but it is still unknown whether they can be used to derive reliable temporal salience scores. We explored this possibility by considering several existing computational models of spatial salience (Section 5.1); some heuristic scoring algorithms (Section 5.2) that map spatial salience in the 2D image domain to 1D salience scores in the time domain; and then comparing their output (Section 5.3) when GLIMPSE is taken as a (ground-truth) reference.

Models
We considered three computational models of spatial salience, each representing a family of approaches. Classic computational models such as Itti et al. [4] approached human visual attention by heuristically defining conspicuity maps that rely on locally distinctive features (e.g., color, intensity, etc.), whose combination results in a bottomup salience map. Graph-based visual salience (GBVS) [52] is a popular model that was reported to outperform classic methods and has been tested for combining salience maps and eye fixations for visualization purposes [53]. Therefore, GBVS was the first model we selected.
Recently, deep convolutional neural nets have been proposed to predict salience maps as their output [54]. Alternatively, "salience maps" of the deepest layers in neural networks are explored not for attention modeling, but mainly for visualization and explanatory purposes [55,56]. We tested two of such deep learning models: the multiduration model [26], which predicts how the duration of each observation affects salience, and the temporally-aggregating spatial encoder-decoder network (TASED) [32], which was proposed as a video-specific salience model.
We note that the multiduration model [26] makes predictions for horizons of 0.5, 3, and 5 s. Since we observed that the resulting salience maps were not very different for our purposes, we used the 3 s horizon, which corresponds to the intermediate value.
In all cases, we refer to S(x, y; t) as the spatial salience at position (x, y) and at frame t. Notice that this notation for the 2D spatial map S is different from s(t), which we use to refer to the 1D temporal salience.

Scoring Algorithms
The goal of the scoring algorithms proposed here is to produce a temporal salience score s(t) from the spatial salience maps S(x, y; t). We observed that some computational models tended to produce very noisy salience maps, while others estimated very clean salience maps. We also remark that the data variability that arises naturally with gaze points from multiple observers was lacking most of the time in the computed salience maps. These issues can be (partially) addressed differently via the following strategies: MUTUALINFO Comparing neighboring salience maps. The similarity of salience maps that are close in time should be able to capture the temporal consistency even when the spatial salience is noisy or spread out. This can be quantified by the (average) mutual information I computed over a temporal window: where θ t = 5 in our experiments, as discussed in Section 3.
MAXVALUE Using the maximum spatial salience score. When the salience map is clean and does not vary substantially over time, the spatio-temporal consistency can be unusually high. Therefore, instead, its global maximum can be a rough indication of how salient the corresponding frame is: SPREAD Quantifying the spread of the salience map. The spatial distribution of a salience map S(x, y) is a measure of spatial consistency. To quantify this, the salience centroid (x c , y c ) is first computed through weighted averages for each spatial coordinate: and then, the salience map is weighted with a 2D Gaussian kernel G σ (x, y) centered at (x c , y c ): The Gaussian's bandwidth σ dictates how tolerant it is to spread deviations (the lower σ, the more strict), similar to the role that θ s has in Equation (1). We set σ = W θ s 2 as a function of the salience map size (width W) and the side length of the Gaussian window as = 2 2σ + 1, following official implementations in computer vision toolboxes (see, e.g., https://mathworks.com/help/images/ref/imgaussfilt.html last accessed on 27 April 2021).
POINTS Generating point hypotheses. The fact that some salience maps are noisy can be leveraged as a way to generate multiple point hypotheses and thus naturally induce some variability in the data, somehow mimicking what happens when dealing with actual gaze points from several observers. The procedure is illustrated in Figure 7 and summarized as follows: 1.
The salience map S was thresholded to get a binarized map B.

2.
The centroids {C i } of the regions (connected components) of the binary salience map B were computed. 3.
The Ripley-based measure (Equation (1)) was used as is, simply by replacing the gaze points in Equation (2) by these centroids {C i }, also over a temporal window θ t . Figure 7. The POINTS scoring algorithm works by hallucinating "gaze hypotheses" points from salience maps, in this case computed by GBVS for the SAVAM v30 (dolphin).

Results
We compared the results of different combinations of computational model and scoring algorithm to produce estimates of temporal salience. We use the Model/SCORING notation to denote each combination. For example, GBVS/MUTUALINFO indicates that the spatial salience maps produced by the GBVS computational model were compared with the mutual information as the scoring algorithm.

Quantitative Assessment
We compared the salience scores computed by a salience map model s map against the reference salience scores s gaze computed by GLIMPSE with gaze points, using θ s = 0.1 and θ t = 5, as above. We computed the average Jaccard index, also known as the intersection over union (IoU): which is defined in [0, 1] and has meaningful semantics [57]. IoU is widely used in computer vision for various tasks such as object detection [58]. Since different metrics capture different aspects of the compared signals, we also computed Spearman's ρ, ∈ [−1, 1], which accounts for non-linear correlations [59]. We observed very similar results with other similarity and correlation measures; therefore, we only report IoU and ρ for brevity's sake. Finally, we included s(t) = λ as a straightforward baseline method, where λ ∈ [0, 1] is a constant score of temporal salience. Note that, being constant, correlation measures cannot be computed for these baselines. It can be observed in Figure 8 that, overall, the performance of the computational models was rather modest. However, taking into account their limitations, in some cases, these models produced reasonable estimates. For example, for some videos and some algorithms, the IoU was as high as 0.8. As expected, there was no single best combination of a computational model and a scoring algorithm. Rather, some combinations outperformed others in some cases. Focusing on the scoring algorithms, MUTUALINFO tended to perform sub-optimally when compared to most of the other combinations. SPREAD, in combination with the salience maps produced by both deep learning models (TASED and multiduration), achieved the highest performance. Interestingly, the POINTS scoring algorithm in combination with the otherwise noisy salience maps obtained with GBVS provided a very effective procedure: GBVS/POINTS closely followed multiduration/SPREAD and TASED/SPREAD.
The baseline method with λ = 0.25 achieved the highest performance in terms of IoU, and only the three best performing algorithms outperform the baseline method with λ = 0.5. There are two important aspects that constitute a good s(t) signal: one is the absolute values, which should be close to the expected temporal salience score; the other is the relative changes, which should capture when (and how much) the temporal salience increases and decreases. The simple baseline, with a properly guessed λ, might be good in the first aspect, but ignores completely the second aspect. Since the IoU metric focuses more on the absolute aspect, a better way of capturing the relative aspect would be necessary in order to compare different approaches. Regarding Spearman's ρ, all methods had a positive, but low correlation, with those using TASED salience maps performing slightly better.
These experiments suggested that, by considering the temporal signals s map (t) globally, the computational models behaved poorly and hardly matched s gaze (t). As our qualitative analysis below illustrates, the temporal salience scores derived from computational models aligned relatively well with gaze-based scores only locally, i.e., at some temporal segments at some videos. As a result, these isolated locally good performances were eventually dismissed with the (globally-averaging) metrics such as r, ρ, and IoU.

Qualitative Assessment
We now discuss how the temporal salience scores computed from spatial salience maps relate to the gaze-based scores. We first focus on multiduration/SPREAD, which was the best performing method according to IoU. For the lowest IoU, the salience scores differed notably; see Figure 9a. For the intermediate IoU values, as can be seen in Figure 9b, although the overall score curves were different in absolute terms, there were interesting matching patterns, some of which are marked with green background regions, but also others where even the reverse patterns were observed, some of which are marked with red background regions. For the highest IoU, the curves may not only be similar in some patterns, but also close in absolute values; see Figure 9c. Regarding the other computational methods, their behavior was more diverse, but some general patterns could also be identified. For instance, both GBVS/MUTUALINFO and GBVS/POINTS tended to overestimate the salience scores, which hardly aligned with GLIMPSE's. TASED/SPREAD performed badly in some videos such as v10, but had good matching patterns in the videos in Figure 9b, which was in agreement with its higher IoU and could be easily noticed around frame t = 200. Interestingly, in some cases (e.g., v38), TASED/SPREAD exhibited a better behavior than multiduration/SPREAD, which suggested that some of the computational methods may complement one another.

Summary
Computational methods of spatial salience, combined with our scoring algorithms, may be used to estimate temporal salience through some notion of the spatio-temporal consistency of predicted attention. When compared to the reference scores estimated with GLIMPSE, limited performance was observed. However, interesting matching patterns could be noticed, which suggested that further work is needed for improving the underlying computational model, the scoring strategy, or both. Overall, it can be argued that the best performing computational models are deep-learning based (multiduration and TASED) using the SPREAD and POINTS scoring algorithms.

Discussion, Limitations, and Future Work
The quantification of temporal salience in dynamic scenes such as videos is an overlooked research problem. Arguably, temporal salience may even be more important than spatial salience in these cases [8]. We proposed GLIMPSE, a novel measure based on the observer-spatio-temporal consistency of gaze points. We showed that GLIMPSE is conceptually simple and has interesting properties. Crucially, it relies solely on raw gaze data, without analyzing the video contents at all. GLIMPSE only has two hyperparameters, the spatial (θ s ) and temporal (θ t ) scales, which are easily understandable. A potential limitation of our measure is that some domain knowledge may be required to help fine-tune such hyperparameters. For example, in some applications, it may be desirable to smooth the resulting scores with higher θ t or emphasize the peaks/valleys with lower θ s .
One direction to improve GLIMPSE would be to include video content analysis. This might help, for example, to automatically and dynamically set the spatial scale θ s as a function of the size of the relevant object(s) being attended. Furthermore, in our comparison of GLIMPSE to the temporal salience estimated from spatial salience maps, we used heuristic scoring algorithms, which, being hand-crafted, may miss uncovering relevant visual patterns for more reliable and robust estimates. Therefore, a natural next step is to train a sequential deep neural model using GLIMPSE's as the supervisory signal and taking as the input the raw image contents, possibly aided with either precomputed spatial salience maps, or learned end-to-end. This would provide stronger insights into how predictable the gaze-based temporal salience score is from visual-only contents.
Besides the raw gaze data used in this work, the duration of eye fixations could be considered as well, since users typically process information during fixation events [60], so we hypothesize that longer fixations should correlate with higher temporal salience. Comparing scan-paths from multiple observers [61] might be an interesting complementary mechanism for quantifying the temporal attention.
Touching on another promising research line, creating new datasets with ground-truth labels of temporal salience scores is extremely costly, but certainly would facilitate progress in this problem and related topics. GLIMPSE could be used in this regard, allowing for reliable benchmarking tasks. Another avenue for future work is developing some downstream applications with GLIMPSE such as video segmentation, compression, summarization, or frame-rate modulation.
Looking forward into the future, we believe GLIMPSE will contribute to the realization of calm technology [62], where user interaction happens unconsciously. In this context, one could use GLIMPSE to automatically build annotated datasets of temporal salience with little effort. Considering recent work that has enabled webcams as affordable eyetracking devices [63] with interesting applications [20], we envision a remote or co-located environment where participants just watch videos at their own pace while their gaze data are collected in the background, aggregated, and processed in a glimpse.

Conclusions
GLIMPSE is a novel measure of temporal salience, based on the observer-spatiotemporal consistency of unprocessed eye-tracking data. The measure is conceptually simple and requires no explicit training. Importantly, the estimated salience scores converge quickly with the number of observers, so GLIMPSE does not need a large number of participants to derive consistent results. GLIMPSE is computationally efficient, which also lends itself as a suitable method for real-time, on-line computation.
We showed that GLIMPSE provides consistent estimates of visual attention over time, which could be used in several downstream tasks with video contents. Additionally, we explored scoring algorithms for temporal estimation from computational models of spatial salience. When compared to GLIMPSE as a reference, they were found to have limited performance.
Ultimately, this paper lays the groundwork for future developments of eye-tracking applications that can make sense of when visual attention is allocated in dynamic scenes. Critically, the distribution of the peaks and valleys of the temporal scores tends to align semantically with salient and human-explainable video events, making our method a sensible approach to produce a consistent reference of temporal salience.