Next Article in Journal
On the Self-Calibration of Aerodynamic Coefficients in Vehicle Dynamic Model-Based Navigation
Previous Article in Journal
The Application of Drones in Healthcare and Health-Related Services in North America: A Scoping Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Inferring Visual Biases in UAV Videos from Eye Movements

1
Univ Rennes, CNRS, IRISA, 263 Avenue Général Leclerc, 35000 Rennes, France
2
Univ Rennes, INSA Rennes, CNRS, IETR—UMR 6164, 35000 Rennes, France
*
Author to whom correspondence should be addressed.
Drones 2020, 4(3), 31; https://doi.org/10.3390/drones4030031
Submission received: 3 June 2020 / Revised: 24 June 2020 / Accepted: 30 June 2020 / Published: 4 July 2020

Abstract

:
Unmanned Aerial Vehicle (UAV) imagery is gaining a lot of momentum lately. Indeed, gathered information from a bird-point-of-view is particularly relevant for numerous applications, from agriculture to surveillance services. We herewith study visual saliency to verify whether there are tangible differences between this imagery and more conventional contents. We first describe typical and UAV contents based on their human saliency maps in a high-dimensional space, encompassing saliency map statistics, distribution characteristics, and other specifically designed features. Thanks to a large amount of eye tracking data collected on UAV, we stress the differences between typical and UAV videos, but more importantly within UAV sequences. We then designed a process to extract new visual attention biases in the UAV imagery, leading to the definition of a new dictionary of visual biases. We then conduct a benchmark on two different datasets, whose results confirm that the 20 defined biases are relevant as a low-complexity saliency prediction system.

Graphical Abstract

1. Introduction

Visual attention is the mechanism developed by the Human Visual System (HVS) to cope with the large quantity of information it is presented with. This capacity to sort out important information and to discard others can be represented thanks to visual salience representations. Saliency maps show the statistical distribution of human eye positions when seeing visual stimuli. There are many factors influencing where we look at, such as the visual content, the task at hand, and some systematic tendencies. We investigate the latter point in this paper in the specific context of Unmanned Autonomous Vehicles (UAVs) videos.
When watching natural scenes on a screen, observers tend to look at the center irrespective of the content [1,2]. This center bias is attributed to several effects such as the photographer bias [2], the viewing strategy, the central orbital position [3], the re-centering bias, the motor bias [4,5], the screen center bias [1], and the fact that the center is the optimized location to access most visual information at once [1]. It is usually represented as a centered isotropic Gaussian stretched to the video frame aspect ratio [6,7]. Other biases are present in gaze deployment, mainly due to differences in context, i.e., observational task [8], population sample [9], psychological state [10], or content types [11,12,13]. Biases are studied through various modalities, such as reaction times for a task [8,9], specifically designed architecture [9] or through hand-crafted features describing salience or fixations [11]. Note that, except for the center bias, we are not aware of visual attention biases that take the form of saliency patterns.
UAV imagery is getting momentum with the multiplication of potential applications it offers, i.e., new delivery services, journalism [14], and security and surveillance systems [15,16,17]. This imaging differs from conventional contents on various aspects. For instance, the photographer bias is not the same due to the special training and visual routines required to control the aerial sensor [18,19]. Shot images and videos represent objects under a new and unfamiliar birds’ perspective, with new camera and object motions, among others [20]. We thus wonder whether such variations impact visual explorations of UAV contents.
A first element of answer is proposed in previous works [21,22] that put into question the presence of center bias in UAV videos. We also claim that existing saliency models fail to predict saliency in UAV contents [21] because they are trained on unrepresentative ground truths. We thus aim to show a difference in attentional allocation between human saliency maps of typical and UAV sequences. To do so, we define a set of handcrafted features to represent UAV human saliency maps in a high-dimensional space and to discriminate them.
We then propose to dig further the question to understand and find out visual attention biases in UAV videos. Our aim is to produce a dictionary of biases possibly fed as priors to static or dynamic saliency prediction models. It is a widespread solution to improve saliency models, and it can take several forms. For instance, in [23], the dynamic two-streamed network includes a handmade dictionary of 2D centered-Gaussians which provides different versions of the center bias. To the best of our knowledge, it is the first time that one empirically extracts a dictionary of biases for UAV videos.
Our biases extraction follows a similar pipeline than a current attempt to analyze saliency patterns in movie shots to explore cinematography dynamic storytelling [24]. Their pipeline includes the definition of features (saliency maps predicted by Deep Gaze II [25]), a dimension reduction operation through a principal component analysis, and finally clustering using K-means.
Regarding the UAV ecosystem, deep learning is a mainstream tool for numerous applications, i.e., automatic navigation of UAV [19,26,27], object [27] or vehicle tracking [28,29,30], and (moving) object detection [31,32] under real-time constraints [33]. Some works combine both object detection and tracking [34,35,36,37] or implement the automation of aerial reconnaissance tasks [38]. However, only a minority of works take benefits from Regions of Interest (ROI) [27,39], sometimes in real time [40], which is a first step towards considering visual attention. We believe that saliency, and in particular a dictionary of biases, will enable the enhancement of current solutions, even in real-time applications.
In the remainder of the paper, we describe in Section 2 the material and followed methodology. It includes a description of conventional and UAV video datasets in Section 2.1, as well as the handcrafted features extracted from human saliency and fixation maps to represent visual salience in a high dimensional space in Section 2.2. We then verify the clear distinction between conventional and UAV videos saliency-wise using data analyses in Section 2.3. Interestingly, the used T-distributed Stochastic Neighbor Embedding (t-SNE) [41] representation presents differences within UAV saliency patterns. On this basis, biases in UAV videos are designed in Section 2.4. A benchmark is presented in Section 2.5 on two datasets—EyeTrackUAV1 and EyeTrackUAV2—to evaluate the relevance and efficiency of biases based on typical saliency metrics. Results are provided in Section 3 and discussed in Section 4. Finally, Section 5 concludes on the contributions of this work.

2. Materials and Methods

2.1. Datasets

Visual salience is an active field that tackles various issues such as segmentation [42], object recognition [43,44,45], object and person detection [46,47], salient object detection [48,49], tracking [50,51], compression [52,53], and retargeting [54]. Accordingly, numerous eye tracking datasets have been created for typical images [6,55,56,57,58]. To a lesser extent, it is possible to find datasets on videos [23,59,60,61]. Eventually, datasets on specific imaging (e.g., UAV videos) are getting growing attention and are being developed [20,22,62].
In this study, we consider only gaze information collected in free-viewing conditions. Observers can explore and freely appreciate the content.

2.1.1. Typical Videos

Among available datasets, the DHF1K [23] is the largest one for dynamic visual saliency prediction. It is a perfect fit for developing saliency models as it includes 1000 (640 × 360) video sequences covering a wide range of scene variations. A 250 Hz eye tracking system recorded the gaze of 17 observers per stimuli under free-viewing conditions. We used the videos available in the training set (i.e., 700 videos), together with their fixation annotations. We have computed the human saliency maps following the process described in [22] to have comparable maps with the UAV videos dataset.

2.1.2. UAV Videos

In this regard, this study investigates the 43 videos of EyeTrackUAV2 (30 fps, 1280 × 720 and 720 × 480) [22], the largest and latest public gaze dataset for UAV content. Gaze information were recorded at 1000 Hz on 30 participants visualizing content extracted from datasets DTB70 [63], UAV123 [64], and VIRAT [65]. It presents a wide range of scenes, in terms of viewing angle, distance to the scene, among others. Besides, EyeTrackUAV2 was created in view to provide both free-viewing and surveillance-based task conditions. There are indeed contents compliant with objects detection and tracking, and contents with no salient object. We used fixation and human saliency maps generated from binocular data under free-viewing conditions, as recommended in the paper.
Last, in order to prove the external validity of results, we include the EyeTrackUAV1 dataset [20] in the benchmark of biases. This dataset includes 19 sequences coming from UAV123 (1280 × 720 and 30 fps). Saliency information was recorded on 14 subjects under free-viewing conditions at 1000 Hz. Overall, the dataset comprises eye tracking information on 26,599 frames, which represents 887 s of video.

2.2. Definition and Extraction of Hand-Crafted Features

Extracting features from gaze information has a two-fold role. First, it enables the representation of saliency maps in a high dimensional space. Thanks to this representation, we expect to discriminate types of imaging and show the importance of developing content-wise solutions. Second, we also expect to find out specific characteristics, that would reveal discrepancies in gaze deployments occurring on conventional and UAV content. Those specific characteristics could be used to approximate content-wise biases. Note that in such case only features that can parameterize biases will be kept. We extract statistics for each frame based on the four representations, described below.
  • Human Saliency Maps (HSMs)
  • Visual fixation features
  • Marginal Distributions (MDs) of visual saliency
  • 2D K-means and Gaussian Mixture Models (GMM) of fixations

2.2.1. Human Saliency Maps

Saliency maps are 2D topographic representations highlighting areas in the scene that attract one’s attention. HSMs result from the convolution of the observers’ fixations with a Gaussian kernel representing the foveal part of our retina [66]. In the following, a human saliency map at time t is defined as I t : Ω R 2 R + , where Ω = [ 1 N ] × [ 1 M ] with N and M the resolution of the input. Moreover, let p i be the discrete probability of the pixel i = x , y to be salient. The probability p i is then defined as follows, such that 0 p i 1 and i Ω p i = 1 :
p i = I t ( i ) j Ω I t ( j )
From human saliency maps, we extract overall spatial complexity features and short-term temporal features as described below.
Energy: The energy of a pixel is the sum of the vertical and horizontal gradient absolute magnitudes. A Sobel filter of kernel size 5 is used as derivative operator. We define as features the average and standard deviation (std) of E t over all pixels of a frame.
E t ( i ) = | I t ( i ) x | + | I t ( i ) y |
High energy mean would indicate several salient regions or shape-wise complex areas of interest, whereas low energy would indicate more simple-shaped or single gaze locations.
Entropy: Shannon entropy is then defined as follows.
s E = i Ω p i l o g ( p i )
A high value of entropy would mean that the saliency map contains a lot of information, i.e., it is likely that saliency is complex. A low entropy indicates a single zone of salience.
Short-term temporal gradient: We used a temporal gradient to characterise the difference over time in visual attention during a visualization. This gradient is computed as follows, with G t m a x ( i ) = 0 ,
G t ( i ) = | I t ( i ) I t + 1 ( i ) |
A large gradient transcribes a large movement between I t and I t + 1 .

2.2.2. Visual Fixation Features

We retrieve fixations from eye positions performing a two-step spatial and temporal threshold, the Dispersion-Threshold Identification (I-DT) [20,67]. Spatial and temporal thresholds were selected to be equal to 0.7 degree of visual angle and 80 ms, respectively, according to [68].
The fixation number can be representative of the number of salient objects in the scene. Their position may also indicate congruence between subjects [11,69]. We thus include the number of fixations per frame as features as well as the number of clusters derived from fixation positions.
The number of clusters is computed through two off-the-shelf typical clustering techniques [70,71], namely the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [72] and the Hierarchical extension of it (HDBSCAN) [73]. Regarding settings, we set the minimum number of points per cluster to 2. Other parameters were left to default values in python libraries (hdbscan and sklearn), such as the use of the euclidean distance.

2.2.3. Marginal Distributions

A marginal distribution describes a part of a multi-variable distribution. In our context, MDs represent vertical and horizontal salience, i.e., the sum of relative saliency of pixels over columns ( p v ) and rows ( p h ), respectively. MDs are computed as follows.
p h ( x ) = y { 1 . . M } p i ( x , y ) , p v ( y ) = x { 1 . . N } p i ( x , y )
Several illustrations of horizontal and vertical marginal distributions of UAV HSMs are provided in Figure 1 to qualitatively support the claim that the center bias does not necessarily apply to UAV contents. Accordingly, several descriptors of marginal distributions bring knowledge on horizontal and vertical saliency, which enables to study whether the center bias applies to UAV videos. Moreover, these parameters may be of capital importance when designing UAV parametric biases based on experimental data. From horizontal and vertical MDs, we extract the following features.
Moments: empirical mean, variance, skewness, and kurtosis. Moments characterise the shape of a distribution. According to Fisher and Pearson’s works [74], we express the k t h moment as follows,
μ k = 1 n x = 0 n ( p m d ( x ) p ¯ ) k
where p m d { p h , p v } is the marginal distribution, n { N , M } is the number of samples in p m d , and p ¯ is the empirical mean of the distribution.
Median and geometric mean: A Gaussian distribution mean is equal or similar to its geometric mean and median. Thus, these two values help characterizing further marginal distributions, as well as indicate “how far” from a 1D single-Gaussian distribution marginal distributions are.

2.2.4. 2D K-Means and Gaussian Mixture Models

Single and multiple clusters are formed from fixation positions. We use K-means and GMMs to approximate the features of these groups. On the one hand, a single cluster brings knowledge about general behavior. On the other hand, having several batches allows for a more precise data depiction. Besides, the group expressing the largest variance is particularly interesting to express saliency when opposed to a single cluster. We thus ran both algorithms to approximate a single cluster as well as multiple classes. We characterize only the most representative batch in the latter scenario.
K-means [75] is the most used clustering algorithm in machine learning due to its accuracy, simplicity, and low complexity. We define as feature the center position of a cluster. Two features are thus extracted, one from a single batch, the second from the most representative group—in terms of fixations number—of a multi-clustering. Regarding implementation details, HDBSCAN algorithm gives the number of classes, and all runs follow a random initialization.
Gaussian Mixture Models: Similarly to K-means, we consider a single Gaussian and the Gaussian in the GMM that accounts for the highest variance. Extracted features are their means, covariance matrices, and variance (i.e., the extent of fixation points covered by the Gaussian).

2.2.5. Overall Characteristics

We extract 38 characteristics per frame. Table 1 summarizes the features describing human saliency maps. When considering the entire sequence, means and standard deviations of these components are extracted temporally, i.e., over the entire set of frames. This gives 76 features per video, independently of the sequences’ number of frames. As a side note, all features relative to a position have been normalized to be free from datasets resolutions.

2.3. Visualization and Classifications

In this section, we want to verify whether attention in typical and UAV content exhibits different patterns. We already know that some sequences of EyeTrackUAV2, mainly originating from the VIRAT dataset, are likely to present a center bias and thus have similar HSMs than those of conventional videos. On the contrary, we expect other sequences, originating from DTB70 and UAV123, to present HSMs with different biases, as claimed in [21,22].
Before exploring and generating biases, we visualize data in a low-dimensional space, thanks to the t-SNE technique [41]. This process maps elements in a 2D space accordingly to the distribution of distances in the high-dimension space. It is a current technique to classify and discriminate elements [76,77]. The t-SNE is advantageous for two reasons: it enables visualization, and most importantly, it comprehensively considers all dimensions in its representation. In our context, the t-SNE shows distance differences between saliency maps based on extracted features and discriminates content-based peculiarities.
If the t-SNE algorithm clusters typical HSMs separately from UAV HSMs, per sequence and frame, this attests that there are different visual attention behaviors in UAV imaging. Otherwise, there is no point in performing biases identification. This is a crucial step towards content-wise biases creation as we first verify there are distinctions between conventional and UAV HSMs and thus justify the existence of such biases. Second, we can determine clusters in which saliency characteristics are similar and learn biases based on the t-SNE 2D distance measure.

2.3.1. Discrimination between Salience in Typical and UAV Content

As recommended in [78], we paid close attention to hyperparameters selection of the t-SNE method. Therefore we verified several perplexity values (e.g., 2, 5, 10, 15, 20, 30, and 50) for each t-SNE run. To stay in line with previous settings, the minimum of components is 2 and distances are euclidean. Any t-SNE starts with random initialization, runs with a learning rate of 200, and stops after 1000 iterations.
The process was applied on all UAV contents of EyeTrackUAV2 and the 400 first videos of the DHF1K dataset. This represents an overall number of 275,031 frames, 232,790 for typical and 42,241 for UAV content. We considered all features when dealing with frames, while we computed the temporal average and std of feature values when dealing with sequences. In addition, we ran the t-SNE on Marginal Distribution (MD) parameters only, to verify whether they have the abilities of a separability criterion. If so, they can help to reconstruct biases.

Results

All results follow similar tendencies over all tested parameters. They are described and illustrated in Figure 2 with perplexity 30. In Figure 2a, we added the thumbnails of the first frame of sequences for more readability. Colors were attributed to the points, or to the thumbnails, depending on the dataset that elements are originating from. In Figure 2a, we can observe the spread of HSMs in the space, per sequence and considering all features. We note that UAV videos (colored) are located moderately apart from conventional videos, especially these coming from UAV123 (red) and DTB70 (green). Furthermore, most of UAV VIRAT videos (blue) are closer to typical sequences than other UAV content. These sequences are possibly sharing saliency characteristics with certain conventional videos. This fact supports previous claims, in that the majority of VIRAT HSMs present a center bias. It is also striking that traditional videos are separated into two large clusters. This could be further explored, though not in this paper as it would get out of the scope of our study.
Figure 2b depicts how HSMs spread spatially for each frame, based on all extracted features. We can observe a huge set of close clusters toward the center of the space, containing mainly DHF1K frame representations, with a fair proportion of VIRAT points and some UAV123 maps. At the borders, we can observe a lot of small batches, formed mainly by UAV contents. These clusters are more distant to their neighbors than those in the center of the space. That is, it makes sense to study various biases in UAV videos as there are numerous batches of UAV content representations clearly distinct from other groups.
Figure 2c presents the dispersion of HSMs, per frame and based on marginal features only. As for the previous observations (see Figure 2b), we observe a distinction between representations of typical and UAV content, with UAV clusters at the borders and more distant to other batches. A difference stands in that we find more VIRAT points within centered groups.
Based on the three illustrations, and considering the generalization of these results over other values of perplexity and features sets, we can proceed with the identification of biases in UAV content. It is also reasonable to rely only on marginal distributions characteristics to generate these biases.

2.3.2. Clustering of UAV Sequences

In this section, the focus is set on UAV sequences only as we aim to derive the attentional biases. In other words, from t-SNE representations, we aim to identify clusters of different gaze deployment habits in UAV visualizations. Initially, biases are regarded as generic behaviors that are found consistently in visual attention responses. Here, we specifically want to extract prevailing and UAV-specific biases.
In pilot tests, we have tried generating a dictionary of biases via a systematic variation of parameters of an anisotropic 2D Gaussian distribution. We also created a dictionary with the human means of the 43 sequences of EyeTrackUAV2. Results, comprising usual similarity metrics of the saliency field applied on biases and HSMs, showed weak performances. These studies emphasized the need to derive patterns of salience from HSMs directly. Additionally, parameters on which we cluster HSMs must be used for the construction of parametric biases. That points out the exclusion of some previous handcrafted features, such as statistics on HSMs and fixation features, hardly useful to generate parametric biases. Besides, we want biases free from any constraints, which excludes 2D K-means and GMM predictions that require prior knowledge. Thus, we focus on marginal distributions features to classify UAV sequences. Moreover, the previous section has emphasized their relevance.
We thus performed a t-SNE with a low perplexity to derive our clusters based only on MDs parameters on EyeTrackUAV2 sequences. Perplexity can be seen as a loose measure of the effective number of neighbors [41]. We thus selected a perplexity of 5 to have about eight clusters. It is a compromise between being content-specific and deriving meaningful and prominent patterns. We also ran several iterations of t-SNE, with perplexity 2, 10, and 20, to ensure the robustness of the obtained classification.
In line with previously defined settings, we relied on the 2D-Euclidean distance in the t-SNE representation to compute the similarity between samples. We computed a confusion matrix based on a hierarchical clustering algorithm (dendrogram) implemented with the Ward criterion [79,80]. The Ward variance minimization algorithm relies on the similarity between clusters centroïds. Hierarchical clustering is highly beneficial in our context as it is less constrained than K-means. There is no need for prior knowledge, such as a number of classes, number of elements in groups, or even whether batches are balanced.

Results

Following the tree architecture, sequences showing a Ward dissimilarity under 150 are forming clusters. We ultimately determine seven classes to learn content-wise biases. Results are compiled in Figure 3 and Table 2. Figure 3a presents this distance matrix between sequences HSMs. Figure 3b shows the t-SNE spatial dispersion of sequences and outlines the seven groups. Human Means (HMs) thumbnails allow verifying the similarity between sequences HSMs in clusters.
Overall, most classes show homogeneous human means. For instance, in group II, there is a recurrent vertical and thin salient pattern. The most heterogeneous batch would be cluster IV, for which the averaging process used to generate HMs may not be an optimal temporal representation. Finally, Table 2 describes clusters, their number of sequences, and the number of frames it embodies. To make up for the hardly readable axes on Figure 3a, Table 2 follows the order of sequences used to form the confusion matrix.

2.4. Extrapolation of Biases

This section introduces the process designed to generate parametric saliency biases for UAV imaging. First, we need to define which properties of HSM ground truth will be exploited to parameterize and generate biases. From the set of such features, we need to derive generic and prominent patterns. The entire process is described below and is summarized in Figure 4. At this point, sequences have been clustered based on their saliency marginal distribution features.

2.4.1. Extraction of MDs Statistics

The idea of extracting biases roots in the prevalence of a center bias in generic audiovisual stimuli salience. The center bias is often represented by a centered isotropic Gaussian stretched to the video frame aspect ratio [6,7]. Here, we get free from the isotropic and centered characteristics of the Gaussian. Moreover, the stretching may not apply to UAV contents, which, for instance, often present smaller size objects. Under these considerations, biases are to be parameterized by a set of two mean and standard deviation parameters, representing the Gaussian center and spreading, respectively. Such 1D-Gaussian parameters are derived from MDs of the ground truth human saliency maps. That is, we compute mean and std values of the marginal distributions, horizontally and vertically, for each HSM frame of each sequence, i.e., μ r , σ r , μ c , and σ c , respectively. We will refer to these values as MDs statistics, to make the difference with MDs features.
Doing so, we assume that MDs follow 1D-Gaussian distributions horizontally and vertically. As we can see in Figure 1, this claim is reasonable if there is a behavior similar to single-object tracking (i.e., Figure 1a,b). However, it is more questionable when there is no object of interest (i.e., Figure 1c). Still, we believe this choice is sound for the following reasons.
  • Averaging, and to a lesser extent getting the standard deviation, over the HSM columns and rows act as filtering unrepresentative behaviors.
  • It is seldom to observe several areas of interest within a single HSM. Often, the congruence between observers is pretty low when it happens. For instance, such events may occur when there is a shift of attention from one object to another. Biases do not target to capture such behaviors. Besides, if there is no object of interest, we expect some biases to present center-bias alike patterns to deal with it.
  • Plans are to use biases as a dictionary, and particularly to combine these derived salience patterns. This fact should cope with some of the inaccuracies made during this process.
  • Last but not least, we are dealing with video saliency, i.e., a frame visualization lasts about 0.3 s. Assuming a 1D-Gaussian marginal distribution for a HSM, we make a rather small error. However, pooling these errors could reveal to be significant. Accordingly, the significance of error depends more on the strategy for biases parameters extraction from MDs statistics. We discuss this issue in the following section.

2.4.2. Local Maxima of MDs Statistics Distributions

The identification of biases parameters from MDs statistics is critical as it could introduce or exacerbate errors in priors predictions. To study patterns in MDs, the stress is laid on clustering, which serves the examination of prominent patterns within saliency-wise similar content. Consequently, the investigation of biases patterns relies on distributions of MDs statistics over each cluster. Figure 5 presents the obtained distributions, named MDS-D.
It is straightforward that usual statistics over MDS-Ds, such as the mean, will not make meaningful, prevailing, and accurate bias parameters. Notions of preponderance, precision, and significance directly involve the computation of local maxima in distributions. Local maxima, also referred to as peaks, embed the aspect of likelihood, can be computed precisely, and give a value that is directly useful for biases generation.
To conduct a robust computation of local maxima, we estimate the probability density function of MDS-Ds using a Gaussian kernel density estimation (kde). Extrapolating the distribution exhibits two main advantages. First, we conduct a biases parameters extraction based on likelihood. This particularly fits the aim of biases. Second, probability density functions are less prone to noise, which could have dramatic results in extrema extractions, especially direct comparison of neighboring values.
We found important to include all local maxima, even if it means including less relevant future bias parameters. Besides, there is no strong basis to rely on regarding setting a threshold of meaningfulness of a peak. For instance, I I σ c shows a peak at 241, and similarly I I I μ c presents a maximum at 880, which would have been rejected with a precisely set threshold despite their likely importance. Accordingly, a post-biases-creation filtering is needed to select the most relevant biases (see Section 3.1).
In order to create a comprehensive and exhaustive bank of biases, all possible combinations of MDS-D parameters are considered in the process. It means that in cluster I, there will be # μ r × # σ r × # μ c × # σ c = 3 × 2 × 3 × 2 = 36 patterns. Table 3 reports the values of peaks extracted from the distributions. It also indicates how many biases were generated per cluster.

2.4.3. Reconstruct Parametric 2D-Anisotropic Gaussian Biases

To construct a 2D-anisotropic Gaussian pattern, we need four parameters: the coordinates of the center of the Gaussian, and horizontal and vertical standard deviations. A bias is thus defined by the 2D-Gaussian of horizontal mean μ c and std σ c , and vertical mean μ r and std σ r . It forms a 1280 × 720 image resulting from the dot product of the 1D-Gaussian distributions formed horizontally and vertically (see Equations (7) and (8)). Bicubic interpolation is used to resize biases when dealing with VIRAT sequences, which have a smaller resolution.
B i a s ( x , y , μ r , σ r , μ c , σ c ) = G ( x , μ r , σ r ) T G ( y , μ c , σ c )
with G ( x , μ , σ ) = 1 σ 2 π exp ( ( x μ ) 2 2 σ 2 )
where x = 1 720 , y = 1 1280 , and T is the transpose operator. The formed patterns are calculated for each cluster, for each comparison of biases parameters. The dictionary reaches an overall number of 296, from which a subset is presented in Figure 6.

2.5. Biases Analyses

We conduct several analyses to verify that extracted biases are meaningful and representative of human deployment of visual attention in UAV videos. First, we examine the relative similarity between biases and the EyeTrackUAV2 dataset, overall, and per cluster. Results sort out biases by efficiency and lead to the selection of a reduced and reliable dictionary. It follows a qualitative analysis of the obtained dictionary.
Then, we aim to situate the efficiency of biases when compared to baselines and hand-crafted features saliency models. Additionally, we verify the external validity of the performance of extracted biases on EyeTrackUAV1. We thus conduct the benchmark on both EyeTrackUAV1 and EyeTrackUAV2.
Finally, we explore an additional function of the dictionary: improving static handcrafted features saliency models by filtering their results with biases. The results of thes analyses will precise the potential usages of the dictionary.

2.5.1. Benchmark Metrics

All along this analysis, we employ typical saliency metrics. We made the decision to exclude metrics involving fixations on the basis that fixation extraction processes may interfere with the results. We want to get as free as possible from a possible dataset bias. Accordingly, we included three state-of-the-art saliency metrics recommended in the MIT benchmark [66,81]: Pearson’s Coefficient of Correlation (CC), Similarity (SIM), and the Kullback Leibler divergence (KL).
CC: The range of the correlation metric goes from −1 to 1, 1 representing a perfect fit between the data.
SIM: Similarity measures a perfect fit between the data histograms as 1, and no similarity as 0.
KL: This metric exhibits a dissimilarity score, emphasising the error made during the prediction. It thus favors patterns with large variance in order to reduce the amount of error made. The score ranges from 0 to infinity ( + ), with having the lowest score the better.
Each metric is computed on every frame. Frame scores are then averaged over sequences, cluster or entire dataset. Although not being optimal regarding temporal considerations, it is the most widely used practice to evaluate dynamic salience [23,82,83,84,85,86].

2.5.2. Benchmark Multi-Comparisons

The aim of biases is to perform better than a conventional Center Bias (CB), and would ideally challenge handcrafted features saliency models (HCs) prediction accuracy. This section introduces the CB, HMs, and HCs stimuli compared to biases in the benchmark.
The center bias is a centered isotropic Gaussian stretched to video frame aspect ratio [7]. CB is a popular prior used in typical imaging. It thus sets a perfect baseline to measure biases.
Human means are the temporally averaged saliency maps over an entire sequence. It gives 19 HMs for EyetrackUAV1, and 43 for EyetrackUAV2.
Handcrafted features saliency models: Based on the work in [21], the selected HCs depict the range of prediction accuracy of most typical hand-crafted features saliency models. BVS [87], GBVS [88], and RARE2012 [89] being the most predictive models, SIM [90] and SUN [91] the least.

2.5.3. Handcrafted Features Saliency Models Filtered with Biases

Filtering HCs with biases will provide new insights for the power of prediction of biases and the information it brings. To sustain a low-complexity constraint, we define the filtering operation as the Hadamard product between the bias and the prediction map of an HC (see Equation (9)).
F i l t e r e d H C = H C B i a s
Moreover, we measure the efficiency of this process through gain which is the difference between scores of filtered HC and original HCs or biases results.
G a i n X = s c o r e ( F i l t e r e d H C ) s c o r e ( X )
with X {HC, Bias}, and score is the outcome of CC, SIM, or KL measures.

3. Results

3.1. A First Quantitative Analysis: Selection of Biases

In this section, we sort out biases per cluster to select the most performing ones and build a reliable dictionary. We have selected three of most predictive patterns per cluster, under the constraint that they outperform the CB, and that they are dissimilar enough. Efficiency was considered for the three metrics simultaneously, favoring biases achieving the best results for the three of them. Sorting reduces down to 3 × 7 the number of saliency patterns in the dictionary. From now on, we name biases the 21 patterns selected to form the dictionary.
Table 4 reports the parameters of biases and their respective metric results on EyeTrackUAV2 overall and per cluster. We emphasized the results standing beyond the others, per cluster.
All biases perform the highest on their cluster without exception. It confirms the interest to use these biases as content-based priors and is highly promising for the future regarding the use of the dictionary of biases in saliency studies.
Regarding the expectations we had for biases, patterns of cluster IV show low prediction scores. They are always under our previsions, though higher than the CB. We went through the content to understand this behavior. It turns out that videos of class IV present a lot of movements, and meteorologic conditions interfere with two sequences. A single image, such as bias, is not able to capture patterns related to extreme camera and object movements or other impacting conditions.
On the contrary, biases from clusters V, VI, and VII exceed our expectations. These classes’ biases go beyond the purpose set for them. In addition to outperforming the CB per cluster, they are more reliable overall.
Clusters I, II, and III reach the expected efficiency, with a specific mention for III which outreaches our expectations for SIM and KL metrics.
Regarding overall results, some biases do not show a better predictive power than the CB. On the other hand, as already mentioned, all saliency patterns outperform CB cluster-wise. This confirms that the clustering and biases generation processes designed to derive content-specific behavior are sound. This is also supported by the fact that content from classes II, III, and IV are hardly predicted by other biases than their respective ones.
To further discuss our clustering strategy, the results hint to pool classes V, VI, and VII. This would be reasonable because the three groups have the same parent in the hierarchical dendrogram set in Section 2.3.2. However, combining all sequences would have an impact on the diversity of the patterns in the dictionary.

3.2. A Qualitative Analysis

Figure 6 illustrates the selected saliency patterns. Most of the extracted biases of the last three clusters, especially for VI and VII, can be seen as variations of a center bias more in line with UAV content peculiarities. For instance, objects are smaller and could accordingly necessitate precise saliency areas. This is reflected in salience patterns with centered 2D Gaussian with low std in at least one direction.
More than half of biases present a 2D Gaussian with a high longitudinal std, which looks like human attitudes towards 360° contents [92]. Observers may act towards high altitude UAV contents as they would do for omnidirectional content.
In clusters II and III, and to a lesser extent in class IV, saliency is specifically seen at the top of the content. We make a connection between this fact and the UAV sequence itself. The position of objects of interest, in this context in the upper part of the content, strongly influences visual attention. An upward camera movement may strengthen this fact. However, this latter fact alone seems less significant because other videos with such camera motion show distinct patterns (such as truck3).
Note that bias II-3 is highly representative of videos peculiarities. Indeed, we can observe a vertical Gaussian accounting for the vertical road depicted in most videos of this cluster.
Finally, in classes II, V, VI, and VII, and to a lesser extent in III, biases are centered width-wise. Moreover, at least one bias per cluster presents this characteristic. To the best of our knowledge, it is the first overall behavior exhibited from analyses on gaze deployment in UAV imaging.

3.3. Biases Benchmark

In this section, we aim to situate the efficiency of biases when compared to baselines and handcrafted features saliency models. We also provide an evaluation of the true power of prediction of biases, both on their own and coupled with saliency prediction models. Additionally, we verify the external validity of the performance of extracted biases on EyeTrackUAV1. We thus conduct the benchmark on both EyeTrackUAV1 and EyeTrackUAV2.

3.3.1. Results on EyeTrackUAV2

CB, HMs, and HCs

Results obtained by biases are presented in Table 4, and those of baselines and HCS in Table 5. Biases not only compete with the CB but also surpass it, per cluster. Some of them exhibit a center-bias alike pattern suiting more to UAV peculiarities. The biases extracted from clusters V, VI, and VII present less dispersed 2D Gaussians. These patterns seem to cope with the high distance between the UAV and the scene, i.e., the relatively small size of objects of interest and a longitudinal exploration of content.
HMs are particularly unsuitable for clusters II and IV. Moreover, the struggle to predict the saliency of these two classes is a revealing recurrent issue, stressing the importance of biases. HMs predictions are low for I and III and mild for the last three classes. For obvious reasons, HMs of VIRAT, showing center-bias alike maps, achieve the highest scores. When compared to HMs, biases III-1, IV-1, V, VI, and VII show superior overall prediction efficiency. Cluster-wise, biases achieve a better score, including cluster IV. Biases, with no exception, are more relevant for salience prediction than HMs, which is a fine achievement. Besides, no behavior such as having a HM performing particularly well on its cluster has been observed. One reason to this observation could be that HMs are too specific to the content to have a power of generalization over a cluster, even more over an entire dataset.
Now that biases outrun baseline maps, we deal with static saliency models that use handcrafted features to make predictions. Such comparisons measure whether biases capture relevant behaviors that reach the efficiency of more elaborate and refined models.
Over the entire dataset, biases from clusters VI and VII, and to a lesser extent V and III, outreach the scores of HCs. On the contrary, biases extracted from classes I, II, and IV hardly compete. That makes sense since the former clusters exhibit similar patterns to center bias, which has a strong power of generalization over a dataset. Though, the latter ones extracted more content-related behaviors, which poorly express general salience. Concerning clusters, V, VI, and VII show better scores for biases. To a lesser extent, biases of I, II, and III outdo SIM and SUN. Only biases from IV have a prediction power far worse than HCs.
The takeaway message is that several priors have a high overall prediction power able to compete with more complex HCs. Biases are efficient content-wise and surpass at least SIM and SUN. It implies they go beyond envisioned baselines and carry out relevant saliency information. It is thus highly interesting to use biases as low complexity saliency predictions.

Handcrafted Models Filtered with Biases

Table 6 gives the obtained results. Overall, filtering exacerbates the results obtained before. Predictions on clusters V, VI, and VII are improved and present the highest gains. Biases I-3, III-all, and IV-1 are slightly less impacting filters. II-all and IV-3 only make some mild improvements. Unsurprisingly, filtering with II-all and IV-3 do not perform well overall. Their performance over specific sequences though justifies their use. For instance, II-all presents the highest gains in SIM and CC for all HC on sequences DTB70 ManRunning2, UAV123 car9, and car11, among others.
This brings us to note that biases are particularly efficient on the sequences which constitute their cluster. Moreover, for at least one of these videos, the obtained gain is the highest when compared to other biases. Besides, an impressive outcome is the presence of this improvement on the other sequences.
Improving saliency results demonstrate that biases bring new knowledge about saliency to HCs which is beneficial to bring up the efficiency of predictions. The increase of accuracy is nevertheless rather low. Table 6 presents the overall gain per model, which is representative of the range obtained sequence-wise. Obviously, there are more benefits to use biases on less accurate models, such as SIM and SUN. Doing so achieves an improvement of about 0.1 in CC. It is less interesting to use filtering on better models, with an overall gain approaching 0.01 in CC for RARE2012 and BMS. This difference is quite more significant than expected. Finally, the filtered SUN does not reach the accuracy of unfiltered BMS. Thus, the advantage of using biases as filters is not encouraging. This does not call into question the meaning and interest in the dictionary. But, it discards its use as a bank of filters.
To summarize, filtering with biases restrict the salience spatially. Such a constraint is often beneficial. Overall, filtering exacerbates biases effects, given several good general biases (i.e., patterns from V, VI, and VII) and several specific patterns (i.e., I-all, II-all, and VI-3). Accuracy results are directly linked to the complexity of the used models. Indeed, complex models show fewer improvements than a less elaborate predictor. Under this study conditions, using the dictionary as a bank of filters does not seem optimal. Still, results prove that biases bring new and advantageous saliency information on their own. This supports using the dictionary as a set of priors.

3.3.2. Results on EyeTrackUAV1

In order to verify the external validity of the bank of biases, we carry out the same study than above on another dataset, namely, EyeTrackUAV1.

Biases, CB, HMs and HCs

Biases on EyeTrackUAV1 match our expectations overall and sequences-wise, especially when compared to results of [21]. Results—presented in Table 7 and for CC on sequences in Table 8—confirm that biases from clusters III, V, VI, and VII are good generic predictors, while those from I, II, and IV are more content-centric. For instance, II-all are efficient for person18. II-3 is particularly efficient for car13, perhaps because it draws a typical road pattern. Biases from I are accurate predictors on car2, I-3 being particularly applicable for building5. Last, IV reaches high scores for boat6, and car8, among others. Hence, some sequences are better represented by specific biases. These outcomes confirm the previous analyses on biases and recall the importance of clustering. The last point, KL continues favoring biases with high dispersion, meaning IV-1, VI-1, and VII-3. Note that based on its very poor effect on both datasets, IV-2 is evicted from the dictionary.
CB is less predictive than at least one bias for each content. Moreover, biases V-2, and VII outdo CB overall. This fact confirms the validity and interest of this dictionary.
HMs of respective sequences surpass biases, CB, and on most occasions HCs. However, such performance is restricted to their sequence. On other content, they barely compete with best biases sequence-wise. There are only two exceptions: HMs of car10 and person3. They present patterns highly located towards the center of the content (similar to II-3 and VII-2).
HCs results highly depend on the content. For instance, they outreach HMs and biases on boat6, car8, person14, person18, person20, person3, and truck1. Yet, they are less reliable than biases or HM on bike3, boat8, car10, car13, and wakeboard10. We could not relate the behavior of HCs and biases on these contents based on the annotations of the database.
Regarding EyeTrackUAV1 contents, Person13 is very hard to predict, for all tested maps. In a different vein, wakeboard10 is noticeable: its HM reaches a CC score of 0.84 of prediction, while biases and HCs hardy getting CC scores of 0.3. Accordingly, additional patterns could be added in the future to deal with sequences similar to these two in the future.
To sum up, biases show improvements for specific contents, not well predicted by HCs. Overall, biases reach the set expectations, validating the external validity of the dictionary.

Handcrafted Models Filtered with Biases

Scores obtained while filtering HCs with biases on EyeTrackUAV1 are reported in Table 9. Filters with the greatest dispersion—IV-1, VI-1, and VII-3—are the only ones to show a gain in KL, for all HC. This fact is counterbalanced by SIM results. They report progress for all biases for SIM and SUN, and only biases from II do not improve BMS, GBVS, and RARE2012. Thanks to CC scores, we are able to sort out biases from most to least performing: VII, VI, IV-1, and V, followed by I-2 and -3, III, and IV-3, and at the end I-1 and II biases.
Even if biases show gain for at least two metrics, this gain is rather insignificant. Thus, results confirm those obtained on EyeTrackUAV2: biases convey meaningful saliency information, however, combining them with HCs as defined here is not optimal.

4. Discussion

We would like to further discuss several points we dealt with all along with this paper.
First, we justified the made assumption of 1D Gaussian interpolation for marginal distributions. In the future, it could be interesting to learn different approximations of distributions and see the variations occuring on extracted biases. Several options can be implemented, such as learning a GMM or an alpha-stable Lévy distribution [93,94]. The latter distributions learn skewed distributions, which under particular conditions can express a Gaussian. Similarly, the extraction of local maxima on MDs parameters distributions could be replaced. Parameters of GMM or a mixture of alpha-stable Lévy distributions are good alternatives. However, the difference made by such sophisticated implementations is possibly insignificant, which explains our choice to keep this study technically sound and rather simple. Further improvements can be brought about in the future.
We also could have investigated a local extrema extraction to compute MD statistics. However, statistics would have been irrelevant in the case of low congruence between subjects. Assuming a 1D-Gaussian is less of a compromise than this solution.
We would like to stress that metrics for dynamic saliency rely on the average over the sequence. Consequently, scores may not be representative of the power of prediction of every frame. We also used the averaging process prior to t-SNE to compute features for sequences. This strengthens the distinction observed between types of imagery and within UAV content. Future studies are needed to evaluate if other strategies, such as the median or a combination of means and std of the distribution of scores, are more pertinent.
Finally, regarding the study on HCs filtering, other ways to improve HCs can be envisioned. For instance, a competition on the frame level of HCs and biases may provide better prediction scores. We decided not to include such a study here not to cover up the true message of our paper: proving we designed a relevant dictionary of biases.

5. Conclusions

Understanding people’s visual behaviors towards multimedia content has been the object of extensive research for decades. The extension of visual attention considerations to UAV videos is though underexplored because of the infancy stage of the UAV imagery and the additional complexity to deal with temporal content.
Previous studies identified that UAV videos present peculiarities that modify gaze deployment during content visualization. We wonder if we can observe the differences between this imagery and typical videos saliency-wise. If so, it appears reasonable to further study UAV salience patterns and to extract biases that are representative of it. Ultimately, a dictionary of biases forms a reliable low-complexity saliency system. Its potential spans from real-time applications embedded on UAVs to providing the dictionary as priors to saliency prediction schemes.
The distinction between conventional and UAV content is developed through the extraction of valuable features, describing human saliency maps (HSMs) in a high-dimensional space. We designed hand-crafted features based on fixations dispersion, human saliency spatial and temporal structure, vertical and horizontal marginal distributions, and last based on GMMs. They were computed over each frame of the datasets DHF1K (400 videos) for typical videos and EyeTrackUAV2 (43 videos) for the unconventional imagery. We then run a machine learning algorithm for dimension-reduction visualization, namely the t-SNE, to observe differences between visual attention in the two types of content.
Our results justify why learning saliencies over conventional imaging to predict UAV imaging is neither accurate nor relevant. The separation between both imaging based on extracted characteristics pleads for dedicated processing. In addition to finding distinctions between types of content, the t-SNE revealed within UAV content differences. We thus apply the t-SNE on EyeTrackUAV2 alone to identify classes that represent sets of similar saliency patterns in sequences. We based this study on marginal distributions, which will prove useful to generate parametric content-specific biases from these clusters. A hierarchical clustering algorithm, specifically a dendrogram with a Ward criterion, sorts out sequences and, after thresholding, exhibits seven clusters of mainly balanced classes.
For each class, we proceed by extracting parameters that are needed to generate biases. From horizontal and vertical marginal distributions, we compute means and standard deviations, assuming they follow a 1D-Gaussian distribution. We gather all resulting statistics over time and cluster and compute their distributions. Such distributions express the likelihood of the center coordinates and std of a 2D-Gaussian in saliency patterns. Accordingly, we want to extract only prevailing values and implement the extraction of local maxima to define biases parameters.
We carry out with the creation of biases. We first reconstruct two 1D-Gaussian that will stand for horizontal and vertical distributions. By multiplying both, we will obtained the bias pattern. Biases are generated for all the combinations of previous parameters, leading to a number of 296 patterns. This number is reduced by keeping only most meaningful biases on clusters in terms of CC, SIM and KL and that outdo the center bias. The ensuing dictionary contains 21 patterns dowlable on our dedicated web-page (https://www-percept.irisa.fr/uav-biases/).
A qualitative analysis hints that videos shot with high altitude may present saliency patterns with high longitudinal variability. Besides, it seems that UAV saliency is centered width-wise. Moreover, some biases show center-bias alike patterns that suit more to UAV, as they have a reduced std in at least one dimension. We think this copes with the small size of objects in UAV videos. Quantitatively, biases surpass others on their respective cluster and achieve fair scores. All of the above is in line with our expectations for single image saliency patterns.
Finally, we conduct a benchmark over EyeTrackUAV2 to compare biases efficiency against baselines and handcrafted features saliency models (HCs). The study is extended to EyeTrackUAV1 to check the external validity of the dictionary and our conclusions. Overall, the dictionary presents some patterns with a high power of generalization, that overall and per cluster or sequence surpass CB, sometimes human means (HMs) and even more interesting HCs. Content-specific patterns were proved accurate and useful in appropriate contexts.
We further explored the potential of biases by using the dictionary as a set of filters. The outcome is that biases bring knowledge that is different from HCs. Note that the achieved gain is relatively low, particularly for the most-performing HCs. We thus recommend the use of biases on their own and not as filters. A future application that seems highly relevant is to feed dynamic saliency schemes with this dictionary of priors.
Overall, this study justifies the need for specializing salience to the type of content one deals with, especially UAV videos. The main outcome of the paper is the creation of a dictionary of biases, forming a low-complexity saliency system that is sound, relevant, and effective. It lays the ground for further advancements toward dynamic saliency prediction in specific imaging.

Author Contributions

Conceptualization, investigation, formal analysis, and writing—original draft, A.-F.P.; Supervision, L.Z. and O.L.M.; Writing—review and editing, A.-F.P., L.Z., and O.L.M. All authors have read and agreed to the published version of the manuscript.

Funding

The presented work is funded by the ongoing research project ANR ASTRID DISSOCIE (Automated Detection of SaliencieS from Operators’ Point of View and Intelligent Compression of DronE videos) referenced as ANR-17-ASTR-0009.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bindemann, M. Scene and screen center bias early eye movements in scene viewing. Vis. Res. 2010, 50, 2577–2587. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Tseng, P.H.; Carmi, R.; Cameron, I.G.; Munoz, D.P.; Itti, L. Quantifying center bias of observers in free viewing of dynamic natural scenes. J. Vis. 2009, 9, 4. [Google Scholar] [CrossRef] [PubMed]
  3. Van Opstal, A.; Hepp, K.; Suzuki, Y.; Henn, V. Influence of eye position on activity in monkey superior colliculus. J. Neurophysiol. 1995, 74, 1593–1610. [Google Scholar] [CrossRef] [PubMed]
  4. Tatler, B.W. The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions. J. Vis. 2007, 7, 4. [Google Scholar] [CrossRef] [Green Version]
  5. Le Meur, O.; Liu, Z. Saccadic model of eye movements for free-viewing condition. Vis. Res. 2015, 116, 152–164. [Google Scholar] [CrossRef] [PubMed]
  6. Le Meur, O.; Le Callet, P.; Barba, D.; Thoreau, D. A coherent computational approach to model bottom-up visual attention. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 802–817. [Google Scholar] [CrossRef] [Green Version]
  7. Bylinskii, Z.; Judd, T.; Oliva, A.; Torralba, A.; Durand, F. What Do Different Evaluation Metrics Tell Us About Saliency Models? IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 740–757. [Google Scholar] [CrossRef] [Green Version]
  8. Rahnev, D.; Maniscalco, B.; Graves, T.; Huang, E.; De Lange, F.P.; Lau, H. Attention induces conservative subjective biases in visual perception. Nat. Neurosci. 2011, 14, 1513. [Google Scholar] [CrossRef]
  9. Zhang, A.T.; Le Meur, B.O. How Old Do You Look? Inferring Your Age From Your Gaze. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 2660–2664. [Google Scholar]
  10. Gotlib, I.H.; Krasnoperova, E.; Yue, D.N.; Joormann, J. Attentional biases for negative interpersonal stimuli in clinical depression. J. Abnorm. Psychol. 2004, 113, 127. [Google Scholar] [CrossRef] [Green Version]
  11. Le Meur, O.; Fons, P.A. Predicting image influence on visual saliency distribution: The focal and ambient dichotomy. In Proceedings of the 2020 ACM Symposium on Eye Tracking Research & Applications, Stuttgart, Germany, 2–5 June 2020. [Google Scholar]
  12. Bannier, K.; Jain, E.; Le Meur, O. Deepcomics: Saliency estimation for comics. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications, Warsaw, Poland, 14–17 June 2018; pp. 1–5. [Google Scholar]
  13. Li, J.; Su, L.; Wu, B.; Pang, J.; Wang, C.; Wu, Z.; Huang, Q. Webpage saliency prediction with multi-features fusion. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–29 September 2016; pp. 674–678. [Google Scholar]
  14. Postema, S. News Drones: An Auxiliary Perspective; Edinburgh Napier University: Edinburgh, UK, 2015. [Google Scholar]
  15. Agbeyangi, A.O.; Odiete, J.O.; Olorunlomerue, A.B. Review on UAVs used for aerial surveillance. J. Multidiscip. Eng. Sci. Technol. (JMEST) 2016, 3, 5713–5719. [Google Scholar]
  16. Zhu, P.; Du, D.; Wen, L.; Bian, X.; Ling, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-VID2019: The Vision Meets Drone Object Detection in Video Challenge Results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
  17. Dang, T.; Khattak, S.; Papachristos, C.; Alexis, K. Anomaly Detection and Cognizant Path Planning for Surveillance Operations using Aerial Robots. In Proceedings of the 2019 International Conference on Unmanned Aircraft Systems (ICUAS), Atlanta, GR, USA, 11–14 June 2019; pp. 667–673. [Google Scholar]
  18. Huang, C. Towards a Smart Drone Cinematographer for Filming Human Motion. Ph.D. Thesis, UC Santa Barbara, Santa Barbara, CA, USA, 2020. [Google Scholar]
  19. Benbihi, A.; Geist, M.; Pradalier, C. Learning Sensor Placement from Demonstration for UAV networks. arXiv 2019, arXiv:1909.01636. [Google Scholar]
  20. Krassanakis, V.; Perreira Da Silva, M.; Ricordel, V. Monitoring Human Visual Behavior during the Observation of Unmanned Aerial Vehicles (UAVs) Videos. Drones 2018, 2, 36. [Google Scholar] [CrossRef] [Green Version]
  21. Perrin, A.F.; Zhang, L.; Le Meur, O. How well current saliency prediction models perform on UAVs videos? In Proceedings of the International Conference on Computer Analysis of Images and Patterns, Salerno, Italy, 3–5 September 2019; pp. 311–323. [Google Scholar]
  22. Perrin, A.F.; Krassanakis, V.; Zhang, L.; Ricordel, V.; Perreira Da Silva, M.; Le Meur, O. EyeTrackUAV2: A Large-Scale Binocular Eye-Tracking Dataset for UAV Videos. Drones 2020, 4, 2. [Google Scholar] [CrossRef] [Green Version]
  23. Zhang, K.; Chen, Z. Video saliency prediction based on spatial-temporal two-stream network. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 3544–3557. [Google Scholar] [CrossRef]
  24. Finlayson, M.; Phillipson, G. Towards a New Empirically Driven Language of Cinematography. Eurograpics & Eurovis 2020 (EGEV2020). 2020. Available online: https://www.youtube.com/watch?v=sogQCmhNb1I&t=4114s (accessed on 15 June 2020).
  25. Kümmerer, M.; Wallis, T.S.A.; Bethge, M. DeepGaze II: Reading fixations from deep features trained on object recognition. arXiv 2016, arXiv:1610.01563. [Google Scholar]
  26. Yan, C.; Xiang, X.; Wang, C. Towards Real-Time Path Planning through Deep Reinforcement Learning for a UAV in Dynamic Environments. J. Intell. Robot. Syst. 2019, 98, 297–309. [Google Scholar] [CrossRef]
  27. Zhou, Y.; Tang, D.; Zhou, H.; Xiang, X.; Hu, T. Vision-Based Online Localization and Trajectory Smoothing for Fixed-Wing UAV Tracking a Moving Target. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
  28. Zhang, W.; Liu, C.; Chang, F.; Song, Y. Multi-Scale and Occlusion Aware Network for Vehicle Detection and Segmentation on UAV Aerial Images. Remote Sens. 2020, 12, 1760. [Google Scholar] [CrossRef]
  29. Maiti, S.; Gidde, P.; Saurav, S.; Singh, S.; Chaudhury, S.; Sangwan, D. Real-Time Vehicle Detection in Aerial Images Using Skip-Connected Convolution Network with Region Proposal Networks. In Proceedings of the International Conference on Pattern Recognition and Machine Intelligence, Tezpur, India, 17–20 December 2019; pp. 200–208. [Google Scholar]
  30. Bozcan, I.; Kayacan, E. AU-AIR: A Multi-modal Unmanned Aerial Vehicle Dataset for Low Altitude Traffic Surveillance. arXiv 2020, arXiv:2001.11737. [Google Scholar]
  31. Mahayuddin, Z.R.; Saif, A.S. A Comprehensive Review Towards Appropriate Feature Selection for Moving Object Detection Using Aerial Images. In Proceedings of the International Visual Informatics Conference, Bangi, Malaysia, 19–21 November 2019; pp. 227–236. [Google Scholar]
  32. Tang, Z.; Liu, X.; Chen, H.; Hupy, J.; Yang, B. Deep Learning Based Wildfire Event Object Detection from 4K Aerial Images Acquired by UAS. AI 2020, 1, 166–179. [Google Scholar] [CrossRef]
  33. Wu, Q.; Zhou, Y. Real-Time Object Detection Based on Unmanned Aerial Vehicle. In Proceedings of the 2019 IEEE 8th Data Driven Control and Learning Systems Conference (DDCLS), Dali, China, 24–27 May 2019; pp. 574–579. [Google Scholar]
  34. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
  35. Yu, H.; Li, G.; Zhang, W.; Huang, Q.; Du, D.; Tian, Q.; Sebe, N. The Unmanned Aerial Vehicle Benchmark: Object Detection, Tracking and Baseline. Int. J. Comput. Vis. 2019, 128, 1141–1159. [Google Scholar] [CrossRef]
  36. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Hu, Q.; Ling, H. Vision Meets Drones: Past, Present and Future. arXiv 2020, arXiv:2001.06303. [Google Scholar]
  37. Qi, Y.; Wang, D.; Xie, J.; Lu, K.; Wan, Y.; Fu, S. BirdsEyeView: Aerial View Dataset for Object Classification and Detection. In Proceedings of the 2019 IEEE Globecom Workshops (GC Wkshps), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar]
  38. Prystavka, P.; Sorokopud, V.; Chyrkov, A.; Kovtun, V. Automated Complex for Aerial Reconnaissance Tasks in Modern Armed Conflicts. In Proceedings of the International Workshop on Conflict Management in Global Information Networks (CMiGIN 2019), Lviv, Ukraine, 29 November 2019. [Google Scholar]
  39. Xie, B. Target Detection Algorithm for Aerial Rice Planting Area Combined with Deep Learning and Visual Attention; Revista de la Facultad de Agronomia de la Universidad del Zulia: Zulia, Venezuela, 2019; Volume 36. [Google Scholar]
  40. Boehrer, N.; Gabriel, A.; Brandt, A.; Uijens, W.; Kampmeijer, L.; van der Stap, N.; Schutte, K. Onboard ROI selection for aerial surveillance using a high resolution, high framerate camera. In Mobile Multimedia/Image Processing, Security, and Applications 2020; International Society for Optics and Photonics: San Diego, CA, USA, 2020; Volume 11399, p. 113990E. [Google Scholar]
  41. Maaten, L.v.d.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  42. Dutt Jain, S.; Xiong, B.; Grauman, K. FusionSeg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3664–3673. [Google Scholar]
  43. Zhao, Y.; Ma, J.; Li, X.; Zhang, J. Saliency detection and deep learning-based wildfire identification in UAV imagery. Sensors 2018, 18, 712. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  44. Gajjar, V.; Khandhediya, Y.; Gurnani, A.; Mavani, V.; Raval, M.S.; Nakada, M.; Chen, H.; Terzopoulos, D.; Hosseini, H.; Xiao, B.; et al. ViS-HuD: Using Visual Saliency to Improve Human Detection with Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1908–1916. [Google Scholar]
  45. Van Gemert, J.C.; Verschoor, C.R.; Mettes, P.; Epema, K.; Koh, L.P.; Wich, S. Nature conservation drones for automatic localization and counting of animals. In Proceedings of the Workshop at the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 255–270. [Google Scholar]
  46. Aguilar, W.G.; Luna, M.A.; Moya, J.F.; Abad, V.; Ruiz, H.; Parra, H.; Angulo, C. Pedestrian detection for UAVs using cascade classifiers and saliency maps. In Proceedings of the International Work-Conference on Artificial Neural Networks, Cadiz, Spain, 14–16 June 2017; pp. 563–574. [Google Scholar]
  47. Zhang, J.; Liang, X.; Wang, M.; Yang, L.; Zhuo, L. Coarse-to-fine object detection in unmanned aerial vehicle imagery using lightweight convolutional neural network and deep motion saliency. Neurocomputing 2019, 398, 555–565. [Google Scholar] [CrossRef]
  48. Sokalski, J.; Breckon, T.P.; Cowling, I. Automatic salient object detection in UAV imagery. In Proceedings of the 25th International Unmanned Air Vehicle Systems, Chichester, UK, 31 May–2 June 2010; pp. 1–12. [Google Scholar]
  49. Wang, W.; Shen, J. Deep visual attention prediction. IEEE Trans. Image Process. 2017, 27, 2368–2378. [Google Scholar] [CrossRef] [Green Version]
  50. Bi, F.; Lei, M.; Wang, Y.; Huang, D. Remote sensing target tracking in UAV aerial video based on saliency enhanced mdnet. IEEE Access 2019, 7, 76731–76740. [Google Scholar] [CrossRef]
  51. Li, J.; Ye, D.H.; Chung, T.; Kolsch, M.; Wachs, J.; Bouman, C. Multi-target detection and tracking from a single camera in Unmanned Aerial Vehicles (UAVs). In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 4992–4997. [Google Scholar]
  52. Guo, C.; Zhang, L. A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE Trans. Image Process. 2009, 19, 185–198. [Google Scholar]
  53. Ma, Z.; Wu, J.; Zhong, S.h.; Jiang, J.; Heinen, S.J. Human eye movements reveal video frame importance. Computer 2019, 52, 48–57. [Google Scholar] [CrossRef]
  54. Trinh, H.; Li, J.; Miyazawa, S.; Moreno, J.; Pankanti, S. Efficient UAV video event summarization. In Proceedings of the Pattern Recognition (ICPR), Conference on 2012 21st International, Tsukuba, Japan, 11–15 November 2012; pp. 2226–2229. [Google Scholar]
  55. Liu, H.; Heynderickx, I. Studying the added value of visual attention in objective image quality metrics based on eye movement data. In Proceedings of the 2009 16th IEEE international conference on image processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 3097–3100. [Google Scholar]
  56. Judd, T.; Durand, F.; Torralba, A. A Benchmark of Computational Models of Saliency to Predict Human Fixations; MIT Libraries: Cambridge, MA, USA, 2012. [Google Scholar]
  57. Bylinskii, Z.; Isola, P.; Bainbridge, C.; Torralba, A.; Oliva, A. Intrinsic and extrinsic effects on image memorability. Vis. Res. 2015, 116, 165–178. [Google Scholar] [CrossRef]
  58. Fan, S.; Shen, Z.; Jiang, M.; Koenig, B.L.; Xu, J.; Kankanhalli, M.S.; Zhao, Q. Emotional attention: A study of image sentiment and visual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 8–22 June 2018; pp. 7521–7531. [Google Scholar]
  59. Gitman, Y.; Erofeev, M.; Vatolin, D.; Andrey, B.; Alexey, F. Semiautomatic visual-attention modeling and its application to video compression. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 31 January 2014; pp. 1105–1109. [Google Scholar]
  60. Coutrot, A.; Guyader, N. How saliency, faces, and sound influence gaze in dynamic social scenes. J. Vis. 2014, 14, 5. [Google Scholar] [CrossRef]
  61. Coutrot, A.; Guyader, N. An efficient audiovisual saliency model to predict eye positions when looking at conversations. In Proceedings of the 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; pp. 1531–1535. [Google Scholar]
  62. Fu, K.; Li, J.; Shen, H.; Tian, Y. How drones look: Crowdsourced knowledge transfer for aerial video saliency prediction. arXiv 2018, arXiv:1811.05625. [Google Scholar]
  63. Li, S.; Yeung, D.Y. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
  64. Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the 14th European Conference on Computer Vision, ECCV 2016, Amsterdam, The Netherlands, 8–16 October 2016; pp. 445–461. [Google Scholar]
  65. Oh, S.; Hoogs, A.; Perera, A.; Cuntoor, N.; Chen, C.C.; Lee, J.T.; Mukherjee, S.; Aggarwal, J.; Lee, H.; Davis, L.; et al. A large-scale benchmark dataset for event recognition in surveillance video. In Proceedings of the CVPR, Salt Lake City, UT, USA, 20–25 June 2011; pp. 3153–3160. [Google Scholar]
  66. Le Meur, O.; Baccino, T. Methods for comparing scanpaths and saliency maps: Strengths and weaknesses. Behav. Res. Methods 2013, 45, 251–266. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  67. Salvucci, D.D.; Goldberg, J.H. Identifying fixations and saccades in eye-tracking protocols. In Proceedings of the 2000 Symposium on Eye Tracking Research & Applications, Palm Beach Gardens, FL, USA, 6–8 November 2000; pp. 71–78. [Google Scholar]
  68. Ooms, K.; Krassanakis, V. Measuring the Spatial Noise of a Low-Cost Eye Tracker to Enhance Fixation Detection. J. Imaging 2018, 4, 96. [Google Scholar] [CrossRef] [Green Version]
  69. Bruckert, A.; Lam, Y.H.; Christie, M.; Olivier, L. Deep Learning For Inter-Observer Congruency Prediction. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3766–3770. [Google Scholar]
  70. McInnes, L.; Healy, J. Accelerated Hierarchical Density Based Clustering. In Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 18–21 November 2017; pp. 33–42. [Google Scholar] [CrossRef] [Green Version]
  71. McInnes, L.; Healy, J.; Astels, S. Comparing Python Clustering Algorithms. In Hdbscan Docs; 2016; Available online: https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html (accessed on 15 June 2020).
  72. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD Proc. 1996, 96, 226–231. [Google Scholar]
  73. Campello, R.J.; Moulavi, D.; Sander, J. Density-based clustering based on hierarchical density estimates. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Gold Coast, Australia, 14–17 April 2013; pp. 160–172. [Google Scholar]
  74. Zwillinger, D.; Kokoska, S. CRC Standard Probability and Statistics Tables and Formulae; Chapman&Hall, CRC Press: Boca Raton, FL, USA, 1999. [Google Scholar]
  75. Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
  76. Hast, A.; Lind, M.; Vats, E. Embedded Prototype Subspace Classification: A Subspace Learning Framework. In Proceedings of the International Conference on Computer Analysis of Images and Patterns, Salerno, Italy, 3–5 September 2019; pp. 581–592. [Google Scholar]
  77. Miao, A.; Zhuang, J.; Tang, Y.; He, Y.; Chu, X.; Luo, S. Hyperspectral image-based variety classification of waxy maize seeds by the t-SNE model and procrustes analysis. Sensors 2018, 18, 4391. [Google Scholar] [CrossRef] [Green Version]
  78. Wattenberg, M.; Viégas, F.; Johnson, I. How to Use t-SNE Effectively. Distill 2016. [Google Scholar] [CrossRef]
  79. Müllner, D. Modern hierarchical, agglomerative clustering algorithms. arXiv 2011, arXiv:1109.2378. [Google Scholar]
  80. Bar-Joseph, Z.; Gifford, D.K.; Jaakkola, T.S. Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 2001, 17, S22–S29. [Google Scholar] [CrossRef] [Green Version]
  81. Bylinskii, Z.; Judd, T.; Borji, A.; Itti, L.; Durand, F.; Oliva, A.; Torralba, A. Mit Saliency Benchmark. 2015. Available online: http://saliency.mit.edu/results_mit300.html (accessed on 30 June 2020).
  82. Liang, H.; Liang, R.; Sun, G. Looking into saliency model via space-time visualization. IEEE Trans. Multimed. 2016, 18, 2271–2281. [Google Scholar] [CrossRef]
  83. Bazzani, L.; Larochelle, H.; Torresani, L. Recurrent mixture density network for spatiotemporal visual attention. arXiv 2016, arXiv:1603.08199. [Google Scholar]
  84. Xu, M.; Jiang, L.; Sun, X.; Ye, Z.; Wang, Z. Learning to detect video saliency with HEVC features. IEEE Trans. Image Process. 2016, 26, 369–385. [Google Scholar] [CrossRef]
  85. Bak, C.; Kocak, A.; Erdem, E.; Erdem, A. Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Trans. Multimed. 2017, 20, 1688–1698. [Google Scholar] [CrossRef] [Green Version]
  86. Cornia, M.; Baraldi, L.; Serra, G.; Cucchiara, R. Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Trans. Image Process. 2018, 27, 5142–5154. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  87. Zhang, J.; Sclaroff, S. Exploiting surroundedness for saliency detection: A boolean map approach. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 889–902. [Google Scholar] [CrossRef] [PubMed]
  88. Harel, J.; Koch, C.; Perona, P. Graph-based visual saliency. In Advances in Neural Information Processing Systems; MIT Press: Vancouver, BC, Canada, 2007; pp. 545–552. [Google Scholar]
  89. Riche, N.; Mancas, M.; Duvinage, M.; Mibulumukini, M.; Gosselin, B.; Dutoit, T. Rare2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis. Signal Process. Image Commun. 2013, 28, 642–658. [Google Scholar] [CrossRef]
  90. Murray, N.; Vanrell, M.; Otazu, X.; Parraga, C.A. Saliency estimation using a non-parametric low-level vision model. In Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 433–440. [Google Scholar]
  91. Zhang, L.; Tong, M.H.; Marks, T.K.; Shan, H.; Cottrell, G.W. SUN: A Bayesian framework for saliency using natural statistics. J. Vis. 2008, 8, 32. [Google Scholar] [CrossRef] [Green Version]
  92. David, E.J.; Gutiérrez, J.; Coutrot, A.; Da Silva, M.P.; Callet, P.L. A Dataset of Head and Eye Movements for 360° Videos. In Proceedings of the 9th ACM Multimedia Systems Conference, New York, NY, USA, 18–21 June 2018; pp. 432–437. [Google Scholar] [CrossRef] [Green Version]
  93. Mandelbrot, B. The Pareto-Levy law and the distribution of income. Int. Econ. Rev. 1960, 1, 79–106. [Google Scholar] [CrossRef] [Green Version]
  94. Lévy, P. Calcul des Probabilités; Springer: Berlin/Heidelberg, Germany, 1925. [Google Scholar]
Figure 1. Human Saliency Maps and their Marginal distributions of frame 165 for four sequences of EyeTrackUAV2.
Figure 1. Human Saliency Maps and their Marginal distributions of frame 165 for four sequences of EyeTrackUAV2.
Drones 04 00031 g001
Figure 2. t-SNE learnt on all features and marginal distribution characteristics, on sequence and frame data (perplexity values 30). We can see that most points representing UAV content are in areas of the space not covered by typical points, and that there are various clusters of UAV points.
Figure 2. t-SNE learnt on all features and marginal distribution characteristics, on sequence and frame data (perplexity values 30). We can see that most points representing UAV content are in areas of the space not covered by typical points, and that there are various clusters of UAV points.
Drones 04 00031 g002
Figure 3. Cluster selection from t-SNE.
Figure 3. Cluster selection from t-SNE.
Drones 04 00031 g003
Figure 4. Pipeline describing the biases generation process. First, we compute vertical and horizontal marginal distribution of a frame, from which we extract means ( μ ) and standard deviations ( σ ). Variations of these four features are studied through their distributions, over all frames of sequences in a cluster. The peaks of these distributions are extracted as biases parameters. Eventually, by combining values of peaks, we create a dictionary of biases. This process is applied on each cluster individually.
Figure 4. Pipeline describing the biases generation process. First, we compute vertical and horizontal marginal distribution of a frame, from which we extract means ( μ ) and standard deviations ( σ ). Variations of these four features are studied through their distributions, over all frames of sequences in a cluster. The peaks of these distributions are extracted as biases parameters. Eventually, by combining values of peaks, we create a dictionary of biases. This process is applied on each cluster individually.
Drones 04 00031 g004
Figure 5. Examples of MDs statistic distributions and the extracted peak values. Statistics are means ( μ ) and standard deviation ( σ ) per columns (c) or rows (r). Associated values can be found in Table 3.
Figure 5. Examples of MDs statistic distributions and the extracted peak values. Statistics are means ( μ ) and standard deviation ( σ ) per columns (c) or rows (r). Associated values can be found in Table 3.
Drones 04 00031 g005
Figure 6. Examples of a bias extracted per cluster
Figure 6. Examples of a bias extracted per cluster
Drones 04 00031 g006
Table 1. Number of features extracted per frame and per sequence.
Table 1. Number of features extracted per frame and per sequence.
Description LevelSource DataFeaturesnb Features
FrameSaliency mapsEnergy mean and std, entropy, temporal gradient mean and std5
FixationsNumber of fixations, DBSCAN and HDBSCAN number of clusters3
Marginal distributionFirst to fourth degree moments, geometric mean and median of marginal distributions12
Center coordinates of K-mean algorithm ( with one and HDBSCAN number of clusters)
Gaussian mixture modelstogether with means and covariances of the most representative 2D Gaussian of18
1- and HDBSCAN-GMM, and proportion of variance explained by these Gaussians
SequenceAllMean and std of all precedent characteristics over time76
Table 2. Clusters, their sequences, number of sequence, and frame.
Table 2. Clusters, their sequences, number of sequence, and frame.
IIIIIIIVVVIVII
6 seq. 4891 frames7 seq., 10694 frames2 seq., 613 frames7 seq., 8932 frames8 seq., 9549 frames5 seq., 2435 frames8 seq., 5127 frames
Soccer1ManRunning2Girl1ManRunning109152008flight2tape1_3_crop1building1Girl2
09152008flight2tape1_3_crop2car4Walkingcar1409152008flight2tape2_1_crop2car11Basketball
truck3wakeboard809162008flight1tape1_1_crop209152008flight2tape1_3_crop3person22bike2
09152008flight2tape3_3_crop1car309152008flight2tape1_5_crop109152008flight2tape1_5_crop2building3building2
car13car909152008flight2tape2_1_crop4StreetBasketball1truck2truck4
car15car1car7building4car12
car209162008flight1tape1_1_crop1bike3Soccer2
09152008flight2tape2_1_crop109152008flight2tape2_1_crop3
Table 3. Position (in pixel) of local maxima in distributions of statistic values. These peaks are the future parameters necessary for biases generation. Highest peaks are highlighted in bold.
Table 3. Position (in pixel) of local maxima in distributions of statistic values. These peaks are the future parameters necessary for biases generation. Highest peaks are highlighted in bold.
ClusterIIIIIIIVVVIVII
μ r 236, 291, 416102, 231, 292245, 340217, 491, 523, 570238, 335, 361, 388, 449, 510405259, 343
σ r 99, 170115, 163103, 177107, 175115, 17182, 94, 15383, 149
μ c 91, 332, 640210, 284, 643, 896519, 577, 880131, 376, 705, 1038409, 695712, 772415, 683
σ c 176, 30785, 241, 279223, 346174, 282150, 267, 358207, 309158, 315
Number of biases36722464721216
Table 4. Parameters of biases and saliency metric results of biases on EyeTrackUAV2. Bold numbers achieve the highest score per cluster.
Table 4. Parameters of biases and saliency metric results of biases on EyeTrackUAV2. Bold numbers achieve the highest score per cluster.
Biases ParametersOverallPer Cluster
μ r σ r μ c σ c CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓
CenterBias0.3050.3491.5530.1040.2382.423
I_1416993323070.1220.2583.3350.3090.3252.215
I_24161703323070.1000.2612.1100.2780.2921.727
I_3416996401760.2820.3333.5260.2910.3073.627
CenterBias0.3050.3491.5530.2010.1632.687
II_11021636432410.0830.2532.7900.2990.2292.097
II_21021156432410.0220.2036.5070.2850.2352.374
II_3102163643850.0950.20710.0330.3060.2584.601
CenterBias0.3050.3491.5530.1470.2841.846
III_12451035773460.1960.3042.9120.2700.3551.540
III_22451035193460.1770.2962.9900.2650.3571.587
III_32451038803460.1940.3033.0440.2720.3231.622
CenterBias0.3050.3491.553-0.0040.1323.676
IV_12171757052820.2310.3201.672-0.0200.1473.518
IV_2491107131174-0.0750.09514.9130.0960.13214.501
IV_32171753762820.0790.2562.282-0.0070.1464.836
CenterBias0.3050.3491.5530.3060.3001.635
V_14491716951500.2600.3253.3100.4690.3871.613
V_23881716951500.2970.3443.0830.4530.3831.608
V_34491156951500.2530.3154.3540.4840.4042.236
CenterBias0.3050.3491.5530.3790.4171.181
VI_14051537123090.3140.3401.5270.4740.4421.009
VI_2405947122070.2990.3443.0240.4700.4771.912
VI_3405947123090.2880.3372.5530.4640.4661.388
CenterBias0.3050.3491.5530.4300.3521.327
VII_1343836831580.3010.3414.6060.5920.4882.522
VII_23431496831580.3140.3552.7180.5210.4122.024
VII_33431496833150.3350.3511.4600.5020.3551.287
Table 5. HC models results overall and per cluster, as well as HM average results, for EyeTrackUAV2. Stressed results show the best results per metric per cluster.
Table 5. HC models results overall and per cluster, as well as HM average results, for EyeTrackUAV2. Stressed results show the best results per metric per cluster.
OverallIIIIIIIVVVIVII
CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIM ↑KLCC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIMKL ↓CC ↑SIM ↑KL ↓
BMS0.2660.3131.5500.3500.2851.6640.4480.2262.0910.4620.3291.4480.3810.2671.9370.1940.2541.8160.4360.4071.0500.3760.2921.647
GBVS0.3270.3371.4470.2360.2651.8270.3600.1872.2220.4390.3501.3410.2740.2282.1400.2740.2721.7070.3500.3871.1560.2840.2741.663
RARE20120.2980.3241.5980.3660.3111.5890.3760.2132.1580.3770.3451.4660.3470.2672.0260.2360.2661.8810.4380.4151.0430.2400.2851.774
SIM0.1240.2701.7600.2610.2491.7980.1980.1252.7030.3120.2701.6970.1910.1792.3520.1190.2192.0140.2100.3441.2950.0370.2162.050
SUN0.1550.2801.7410.1970.2361.9070.1130.1142.9060.2130.2431.8120.1520.1722.4270.1430.2212.0090.2800.3621.2190.0330.2172.123
HM0.2200.3063.0540.1170.2244.5350.0870.1294.4070.0850.2323.4440.0050.1196.6700.2390.2742.8260.3000.3752.7150.3290.3202.058
Table 6. Saliency metric gainHC for HC models filtered with biases and CB per cluster on EyeTrackUAV2. Result showing a gain (positive for CC and SIM, negative for KL) are highlighted for more readability.
Table 6. Saliency metric gainHC for HC models filtered with biases and CB per cluster on EyeTrackUAV2. Result showing a gain (positive for CC and SIM, negative for KL) are highlighted for more readability.
CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓
I-1II-1III-1IV-1V-1VI-1VII-1
BMS−0.052−0.0211.567−0.080−0.0180.9890.0080.0221.2000.0590.048−0.0620.0760.0471.5690.1410.068−0.2070.0760.0442.929
GBVS−0.113−0.0381.695−0.130−0.0351.1360.262−0.0041.386−0.0100.0230.1310.0110.0261.7160.0520.045−0.0240.0000.0133.133
RARE2012−0.080−0.0291.585−0.107−0.0301.0520.2530.0021.282−0.0030.0290.0420.0210.0291.6360.0600.051−0.1150.0140.0203.002
SIM0.020−0.0081.5180.0060.0010.9220.2250.0421.1050.1450.065−0.1550.1640.0651.4690.2170.082−0.2950.1910.0752.778
SUN0.004−0.0111.531−0.020−0.0090.9560.2270.0341.1380.1140.056−0.1160.1320.0551.5150.1850.074−0.2520.1570.0622.833
I-2II-2III-2IV-2V-2VI-2VII-2
BMS−0.041−0.0100.323−0.165−0.0764.705−0.0070.0141.273−0.297−0.20013.1800.1000.0621.3670.0960.0571.3130.1090.0701.018
GBVS−0.099−0.0230.433−0.217−0.0934.7790.248−0.0111.454−0.348−0.21213.1860.0240.0351.5460.0150.0271.5280.0260.0391.230
RARE2012−0.066−0.0150.345−0.187−0.0884.7220.244−0.0031.347−0.310−0.20513.1090.0330.0391.4540.0340.0341.4020.0370.0441.121
SIM0.0110.0000.293−0.063−0.0524.6290.2050.0341.183−0.180−0.17213.1060.1970.0831.2520.1960.0811.1850.2100.0930.891
SUN−0.003−0.0040.305−0.090−0.0624.6530.2100.0261.215−0.208−0.17813.0990.1620.0721.2990.1620.0701.2420.1750.0810.942
I-3II-3III-3IV-3V-3VI-3VII-3
BMS0.0770.0451.801−0.114−0.0798.3000.0010.0191.339−0.072−0.0180.5120.0530.0302.6200.1000.0540.8370.1440.076−0.252
GBVS−0.0020.0161.998−0.168−0.0998.4000.264−0.0031.499−0.133−0.0360.652−0.0150.0062.7670.0150.0271.0330.0510.048−0.042
RARE20120.0190.0231.875−0.138−0.0918.2920.246−0.0021.431−0.094−0.0260.5480.0020.0112.6740.0340.0340.9160.0570.055−0.146
SIM0.1780.0701.676−0.006−0.0568.2010.2140.0391.251−0.005−0.0030.4580.1520.0532.5070.1860.0730.7330.2330.092−0.352
SUN0.1450.0581.733−0.038−0.0668.2220.2230.0321.274−0.026−0.0090.4810.1200.0422.5550.1560.0640.7780.1990.083−0.310
CB
BMS0.0970.035−0.162
GBVS0.0340.023−0.069
RARE20120.0240.028−0.116
SIM0.1560.042−0.208
SUN0.1310.038−0.187
Table 7. Overall metric results of biases, CB, HMs, and HCs on the dataset EyeTrackUAV1. Highlighted results for biases are outperforming CB, stressed HC scores outdo the best bias.
Table 7. Overall metric results of biases, CB, HMs, and HCs on the dataset EyeTrackUAV1. Highlighted results for biases are outperforming CB, stressed HC scores outdo the best bias.
I-1I-2I-3II-1II-2II-3III-1III-2III-3IV-1IV-2IV-3V-1V-2V-3VI-1VI-2VI-3VII-1VII-2VII-3CBHMsBMSGBVSRARE2012SIMSUN
CC ↑0.1030.1040.2210.1270.0700.1540.2150.2040.1570.2210.0580.1300.2010.2490.1750.2200.2080.1900.2770.2830.2680.2290.2230.3830.3730.3350.2160.197
SIM ↑0.1750.1740.2350.1900.1600.2020.2180.2150.2010.2150.0610.1860.2200.2400.2120.2090.2270.2100.2680.2570.2250.1840.2280.2480.2450.2560.1870.187
KL ↓3.7392.4303.7712.8545.5237.6472.9853.0253.2292.03013.5622.3983.3173.0574.3962.0513.5873.2364.3942.7241.9142.1863.8561.8111.8031.8682.1512.208
Table 8. CC results of biases, CB and HCs per sequence on the dataset EyeTrackUAV1. Highlighted results for biases are outperforming CB. Highest HC scores are also stressed.
Table 8. CC results of biases, CB and HCs per sequence on the dataset EyeTrackUAV1. Highlighted results for biases are outperforming CB. Highest HC scores are also stressed.
CC ↑Bike3Boat6Boat8Building5Car10Car13Car2Car4Car6Car7Car8Group2Person13Person14Person18Person20Person3Truck1Wakeboard10
I-10.0670.0380.1640.1720.1020.1160.180−0.0390.1490.0660.1470.0860.0620.140−0.0460.1510.1040.2320.058
I-20.0420.0940.1270.1180.1060.1430.175−0.0550.1520.0690.1550.0840.0690.155−0.0130.1410.1090.2190.090
I-30.2920.0780.4350.3440.3310.2140.2270.1450.3160.1970.1690.2770.0960.1530.0390.1550.3060.2290.197
II-10.1280.2800.059−0.1260.1550.2250.0610.2040.1440.0570.1670.1350.0940.0800.2430.0320.1710.0890.204
II-20.0510.234−0.040−0.1570.0650.1810.0280.1680.0880.0110.1240.0620.0640.0320.220−0.0060.0830.0020.117
II-30.1480.2980.151−0.0380.2400.3000.0160.1490.1980.0820.1490.1630.0880.0700.2040.0230.2560.1060.323
III-10.2420.3790.222−0.0780.3190.2140.1000.1810.2020.1330.2220.2440.1440.1900.2280.1420.3180.3110.360
III-20.2140.3780.204−0.0820.3010.2050.0980.1490.1960.1250.2220.2260.1400.1890.2200.1450.3030.2980.344
III-30.2480.2220.189−0.0690.2490.1280.0560.2210.1200.0900.1290.1970.0930.1180.1590.0830.2280.2550.256
IV-10.3020.3150.246−0.0150.3160.2590.1270.2860.2320.1430.2160.2620.1340.1580.2420.1080.3040.2490.314
IV-2−0.120−0.073−0.080−0.026−0.101−0.044−0.045−0.103−0.073−0.067−0.033−0.067−0.0280.005−0.102−0.006−0.0820.018−0.074
IV-30.0750.2820.095−0.0800.1660.1680.1060.0080.1510.0660.2020.1210.1060.1450.1720.1380.1770.1860.192
V-10.3000.0810.3620.2860.3130.2160.1360.2570.2780.2020.1030.2770.0720.1090.0900.1010.2710.1590.204
V-20.3510.1600.4270.2490.3910.2660.1420.3110.3150.2190.1510.3420.1090.1480.1380.1230.3580.2260.300
V-30.2750.0220.3550.3290.2760.1710.1420.1990.2620.1850.0890.2450.0500.0940.0410.1050.2280.1310.128
VI-10.3880.1030.3720.3200.3230.1960.1860.1910.2600.2180.1480.2700.0870.1540.0820.1570.2590.2580.200
VI-20.3740.0500.4320.3200.3270.1700.1650.1810.2630.2050.1320.2840.0740.1340.0460.1420.2770.2150.167
VI-30.3610.0360.3870.3330.2830.1430.1730.1210.2300.1840.1340.2380.0680.1340.0230.1600.2280.2330.130
VII-10.3820.1650.5720.1760.4590.2410.1350.2350.2960.2200.1860.3890.1300.1990.1100.1600.4640.3410.397
VII-20.3770.2320.4710.2110.4450.2980.1580.3100.3380.2290.1960.3780.1400.1840.1680.1450.4260.2910.382
VII-30.4140.2300.4110.2120.3950.2470.1960.2270.2930.2210.2200.3230.1330.2070.1520.1870.3500.3510.312
CB0.3400.2100.2800.2090.2990.2530.2030.1940.2590.1940.2040.2410.2140.1840.1580.1760.2400.2820.218
BMS0.1810.6530.2630.3950.3750.1070.2870.3560.3370.2590.3190.5530.3020.4660.4260.4100.7610.4380.388
GBVS0.2410.5420.3340.3300.4060.1600.1590.2510.3590.2670.3440.5660.3640.4740.4090.4120.6790.3650.420
RARE20120.2460.6180.3400.3800.2470.1060.0960.1660.3220.2040.3000.4750.2210.4000.3760.3490.6950.4910.338
SIM0.1000.4170.3000.3390.1740.0350.0900.0290.1400.1280.1980.2990.1470.2360.2760.3010.3500.3660.170
SUN0.1270.3490.2350.3480.0950.0770.1110.0250.1310.0950.1970.2490.1230.1620.2970.2390.3670.3550.152
Table 9. Saliency metric gainHC for HC models filtered with biases and CB overall on EyeTrackUAV1. Result showing a gain (positive for CC and SIM, negative for KL) are highlighted for more readability.
Table 9. Saliency metric gainHC for HC models filtered with biases and CB overall on EyeTrackUAV1. Result showing a gain (positive for CC and SIM, negative for KL) are highlighted for more readability.
CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓CC ↑SIM ↑KL ↓
I-1I-1III-1CBV-1VI-1VII-1
BMS−0.0580.0201.467−0.0370.0375.6590.0300.0680.9040.0740.0310.6500.0490.0811.1080.0810.066−0.1970.0580.1112.429
GBVS−0.0890.0111.575−0.0650.0205.747−0.0130.0451.0570.0290.0230.778−0.0040.0571.2670.0180.045−0.0210.0080.0812.601
RARE2012−0.0200.0221.461−0.0160.0265.6860.0330.0520.9900.0510.0300.7190.0520.0661.1830.0680.056−0.1090.0620.0832.515
SIM−0.0240.0231.4350.0020.0415.5300.0920.0690.7960.1160.0320.5910.0950.0801.0080.1190.065−0.2850.1410.1212.251
SUN−0.0020.0231.4130.0220.0405.5050.1020.0660.7970.1040.0310.5860.1000.0741.0110.1190.060−0.2680.1440.1122.257
I-2II-2III-2IV-1V-2VI-2VII-2
BMS−0.0050.0280.103−0.145−0.0113.4030.0250.0660.9330.0600.069−0.1710.0750.1000.9010.0220.0741.4410.0940.1160.611
GBVS−0.0390.0220.198−0.168−0.0243.477−0.0160.0441.0800.0090.0460.0040.0190.0711.082−0.0350.0461.6390.0380.0830.811
RARE20120.0250.0330.114−0.109−0.0223.4830.0320.0511.0130.0470.053−0.0600.0680.0781.0040.0300.0551.5220.0830.0900.732
SIM0.0020.0260.087−0.0800.0033.6860.0830.0660.8320.1100.070−0.2700.1380.1000.7770.0800.0801.3190.1660.1160.467
SUN0.0230.0260.068−0.0570.0033.7240.0950.0640.8300.1130.066−0.2590.1390.0920.7850.0880.0731.3320.1670.1080.477
I-3II-3III-3IV-3V-3VI-3VII-3
BMS0.0390.0871.613−0.0540.045−0.178−0.0340.0361.180−0.0020.0420.126−0.0010.0622.2140.0170.0571.0390.1030.083−0.287
GBVS−0.0130.0601.794−0.0810.029−0.088−0.0770.0181.320−0.0330.0290.240−0.0550.0382.372−0.0380.0341.2190.0420.057−0.094
RARE20120.0520.0691.680−0.0280.029−0.132−0.0250.0211.2650.0180.0370.1780.0150.0472.2640.0240.0441.1020.0830.068−0.179
SIM0.0990.0921.4900.0140.054−0.2200.0200.0451.0660.0210.0390.0850.0510.0672.1150.0660.0610.9400.1610.080−0.395
SUN0.1080.0851.4920.0340.051−0.2130.0300.0411.0760.0460.0390.0710.0600.0612.1150.0730.0560.9550.1570.075−0.379

Share and Cite

MDPI and ACS Style

Perrin, A.-F.; Zhang, L.; Le Meur, O. Inferring Visual Biases in UAV Videos from Eye Movements. Drones 2020, 4, 31. https://doi.org/10.3390/drones4030031

AMA Style

Perrin A-F, Zhang L, Le Meur O. Inferring Visual Biases in UAV Videos from Eye Movements. Drones. 2020; 4(3):31. https://doi.org/10.3390/drones4030031

Chicago/Turabian Style

Perrin, Anne-Flore, Lu Zhang, and Olivier Le Meur. 2020. "Inferring Visual Biases in UAV Videos from Eye Movements" Drones 4, no. 3: 31. https://doi.org/10.3390/drones4030031

Article Metrics

Back to TopTop