1. Introduction
Wearable sensors with egocentric (firstperson) cameras, such as smart glasses, are receiving increasing attention from the computer vision and clinical science communities [
1,
2]. The technology has been applied to many realworld applications, such as action recognition [
3], that are traditionally implemented using cameras in thirdperson view. Egocentric videos can also be employed jointly with conventional thirdperson action recognition videos to improve recognition performance [
4]. Conventional thirdperson video clips provide a global view of highlevel appearances, while egocentric ones provide a more explicit view of monitored people and objects by describing human interactive actions and reflecting the subjective gaze selection of a smart glasses wearer.
For both firstperson and thirdperson video clips, one of the key steps in action recognition is feature extraction [
5,
6,
7,
8,
9]. Local feature descriptors (LFDs) are commonly employed to describe a local property of video actions (e.g., image patches) when constructing a discriminative visual representation using the bag of visual words (BoVW) framework [
10]. Typical examples of LFDs include the histogram of oriented gradients (HOG) [
11], histogram of optical flow (HOF) [
12], motion boundary histogram (MBH) [
13], and histogram of motion gradients (HMG) [
14]. Among these LFDs, with the exception of HOG, gradients over time for consecutive video frames provide useful information, as the magnitude of gradients becomes large around regions of abrupt intensity changes (e.g., edges and corners). This property has enabled the development of feature extraction that is more informative (in terms of shape, object, etc.) compared with the flat regions in video frames. In addition, the gradients (for HOG and HMG) or optical flows (for HOF and MBH) over neighboring pixels in an individual frame represent spatial information. The temporal and spatial information combine to make LFDs effective approaches to feature extraction.
LFDs often use bins to aggregate the gradient information or its variations and extensions. Briefly, by partitioning the angular space over 2
$\pi $, the gradient space is partitioned into multiple subspaces, each referred to as a bin, to summarize the information carried by pixels using a weighted summation operation. The weight of each pixel in a bin is determined by its angular distance to the bin, while the magnitude information is derived from the gradient information. By representing visual attention, a saliency map can effectively distinguish pixels in a frame [
15] to provide key information for daily activity recognition. Thus, saliency maps can be used to enhance the extraction of LFDs.
This paper proposes an algorithm that integrates saliency maps into LFDs to improve the effectiveness of video analysis; the algorithm was developed on the basis of the hypothesis that the most important interactive human actions are in the foreground of video frames. In particular, the proposed work uses the information of saliency map to further adjust the weights applied to bin strength calculation such that more visually important information can be considered by LFDs with higher weights. The contributions (in meeting the objectives) of this paper are mainly twofold: (1) the proposal of saliency mapenhanced LFD extraction approach (i.e., SMLFD), which works with HOG, HMG, HOF, MBHx, and MBHy for video analysis, and (2) the development of an egocentric action recognition framework based on the proposed SMLFD for memory aid systems, which is the secondary objective of this work. The proposed work was evaluated using a publicly available dataset, and the experimental results demonstrate the effectiveness of the proposed approach.
The remainder of the paper is organized as follows.
Section 2 introduces related work.
Section 3 presents the details of the proposed method for egocentric action recognition. In
Section 4, the experimentation and the evaluated results of the proposed method are demonstrated. The conclusion is drawn in
Section 5.
2. Related Work
In the research of visual action recognition, feature extraction from RGB videos has been intensively explored [
5,
9,
16]. Prior to the feature extraction phase, regions of interest (ROIs) can be detected to significantly improve the efficiency of action recognition. Spatiotemporal interest points (STIPs) [
17] with Dollar’s periodic detector [
18] have been commonly employed to locate ROIs. Early local descriptors were used for feature extraction by extending their original counterparts in the image domain [
13,
19,
20]. The 3D versions of scaleinvariant feature transform (SIFT) [
19] and HOF [
12] have been proposed to speed up robust features (SURF) and improve the performance of visual action recognition [
21,
22,
23]. To discretize gradient orientations, the HOG feature descriptor has been frequently used in the process of extracting lowlevel (i.e., local) features. Additionally, the work reported in [
24] extracted midlevel motion features using the local optical flow for the action recognition task. A higher visual representation was also applied in [
25] for recognizing human activities. Recently, local features were suggested to be concatenated with improved dense trajectories (iDT) [
8] and deep features [
26]. More details can be found in [
27].
Several methods have been used to compute saliency maps for salient object detection and fixation prediction. When predicting fixation, the image signature [
15] significantly enhances the efficiency of saliency map calculations [
28]. The image signature was originally proposed for efficiently predicting the location of the human eye fixation and has been successfully applied for thirdperson [
29] and firstperson [
5] action recognition. The work in [
29] fused multiple saliency prediction models to calculate saliency maps for better revealing some visual semantics, such as faces, moving objects, etc. In [
5], a histogrambased local feature descriptor family was proposed that utilized the concept of gaze region of interest (GROI). Prior to the feature extraction stage, the RoI was obtained by expanding from the gaze point to the point with the maximum pixel value in the calculated framewise saliency map. The extracted sparse features were then employed for egocentric action recognition. The work reported in [
5,
6,
15] showed that the saliency map is robust to noise in visual action recognition, and it is able to cope with selfocclusion, which occurs in various first and thirdperson visual scenes.
Once the visual features are extracted, the encoding process—an essential part of achieving classification efficiency—is required to obtain a unique representation. There are three feature encoding types: votingbased, reconstructionbased, and supervectorbased. Votingbased encoding methods (e.g., [
30]) allow each descriptor to directly vote for the codeword using a specific strategy. Reconstructionbased encoding methods (e.g., [
31]) employ visual codes to reconstruct the input descriptor during the decoding process. Supervectorbased encoding methods [
6,
7,
32] usually yield a visual representation with high dimensionality via the aggregation of highorder statistics. The vector of locally aggregated descriptors (VLAD) [
7] and the Fisher vector (FV) [
32] are widely employed supervectorbased encoding schemes due to their competitive performance in visual action recognition. After noting the appearance of redundancy in datasets, the authors in [
6] proposed saliencyinformed spatiotemporal VLAD (SSTVLAD) and FV (SSTFV) to speed up the feature encoding process and enhance the classification performance. The objectives were achieved by selecting a small number of videos from the dataset according to the ranked spatiotemporal videowise saliency scores.
A set of preprocessing (after the feature extraction phase) and postprocessing (after the feature encoding stage) techniques are also commonly applied to boost the performance of both first and thirdperson action recognition. Additionally, dimensionality reduction techniques [
33,
34] (e.g., principal component analysis (PCA), linear discriminant analysis (LDA), autoencoder, fuzzy rough feature selection, etc.) and normalization techniques (e.g., RootSIFT,
ℓ1,
ℓ2, PN) and their combinations (e.g., PN
ℓ2 [
5,
6],
ℓ1PN, etc.) have been integrated into action recognition applications. Several popular classifiers have also been explored for recognition applications, such as linear and nonlinear support vector machine (SVM) [
14,
35] and artificial neural networks (ANNs) [
5,
9,
36]. They are usually coupled with different frame sampling strategies (e.g., dense, random, and selective sampling) [
37]. The recent work reported in [
35,
38] combined multiple feature descriptors and pooling strategies in the encoding phase, leading to improved performance.
3. Saliency MapBased Local Feature Descriptors for Egocentric Action Recognition
The framework of the proposed saliency mapbased egocentric action recognition is illustrated in
Figure 1, which highlights the proposed SMLFD feature extraction approach itself, as detailed in
Section 3.3. This framework was developed according to the principle of the BoVW approach. In the illustrative diagram, SMHMG is used as a representative example to demonstrate the workflow of the proposed approach, as shown in the bottom row of
Figure 1. In particular, the SMHMG feature extraction approach takes two consecutive frames in a 3D video matrix
$\mathcal{V}$ as the input. Then, the motion information between the two consecutive frames (
$\mathcal{TD}$) is captured via a temporal derivative operation with respect to time
t. This is followed by the calculation of gradients in the spatial
x and
y directions. From this, the magnitudes and orientations of every pixel in the frame are calculated (framewise, jointly denoted as
$\mathcal{M}$ and
$\mathcal{O}$), thus generating the corresponding saliency map
$\mathcal{S}$. Then, the magnitude response of pixels are aggregated into a limited number of evenly divided directions over 2
$\pi $, i.e., bins. In this illustrative example, eight bins are used. This is followed by a weighting operation using saliency information to generate a histogram of bins for each block. The final feature representation is then generated by concatenating all the block information (18 blocks in this illustrative example) into a holistic histogram feature vector. The key steps of this process are presented in the following subsections. The processes of feature encoding [
5,
6,
7,
32], pre and postprocessing [
8,
14,
35,
39], and classification [
5,
6,
9,
14] are omitted here as they are not the focus of this work, but these topics have been intensively studied and are available in the literature.
3.1. Video Representation
An egocentric video clip is usually formed by a set of video frames, and each frame is represented as a twodimensional array of pixels. Thus, each video clip $\mathcal{V}$ can be viewed as a threedimensional array of pixels, with x and yaxes representing the plane of the video frame, and the taxis denoting the timeline. An egocentric video clip is denoted by $\mathcal{V}\in {\mathbb{R}}^{m\times n\times f}$, where $m\times n$ represents the resolution of each video frame, and f represents the total number of frames. Histogrambased LFDs use each pair of consecutive frames $\mathcal{F}{r}_{j}$ and $\mathcal{F}{r}_{j+1}(1\le j\le f1)$ to capture the motion information along the timeline (except HOG, which does not consider the temporal information) and use the neighboring pixels in every frame $\mathcal{F}{r}_{i}(1\le i\le f)$ to extract the spatial information. In this work, all LFDs, including HOG, HMG, HOF, MBHx, and MBHy, were enhanced by using the saliency map, yielding SMHOG, SMHMG, SMHOF, SMMBHx, and SMMBHy, respectively. Briefly, HOG, HMG, HOF, and MBH represent videos using the inframe gradient only, inandbetween frame gradient, inframe optical flow and betweenframe gradient, and the imaginary and real parts of the optical flow gradient and betweenframe gradient, respectively.
3.2. Local Spatial and Temporal Information Calculation
SMHOG and SMHMG: SMHOG calculates spatial gradient information for each input video frame
$\mathcal{F}{r}_{i}$. By extending SMHOG, SMHMG performs an efficient temporal gradient calculation between each pair of neighboring frames (
$\mathcal{T}D$) prior to the entire SMHOG process using Equation (
1).
SMHOF and SMMBH: The gradients in SMHOF are implemented using the Horn–Schunck (HS) [
40] optical flow method; the calculated flow vector
$\overrightarrow{{\mathcal{OF}}_{j,j+1}}$ is also used in SMMBHx and SMMBHy. Because
$\overrightarrow{{\mathcal{OF}}_{j,j+1}}$ is a complex typed vector, SMMBHx and SMMBHy use its imaginary (
${\mathcal{IF}}_{j,j+1}$) and real (
${\mathcal{RF}}_{j,j+1}$) parts, respectively.
The gradients in the
x and
y directions for SMHOG, SMHMG, SMHBHx, and SMMBHy are summarized below:
where
$k=i$ and
$i\in [1,f]$ for SMHOG, and
$k=j$ and
$j\in [1,f1]$ for the others. Please note that the above derivative operations are usually practically implemented using a convolution with a Haar kernel [
41]. Then, the magnitude
${\mathcal{M}}_{k}$ and orientation
${\mathcal{O}}_{k}$ of the temporal and spatial information about the pixels in each frame are calculated as:
where · and
$\overline{)F(\xb7)}$ denote the magnitude and orientation of the complextyped vector, respectively.
3.3. Saliency MapInformed Bin Response Generation
The orientations of temporal and spatial information are evenly quantized into
b bins in the range of [0, 2
$\pi $], i.e.,
${\mathcal{B}}_{q}=2\pi \xb7q/b$,
$q\in \{0,1,\cdots ,b1\}$, to aggregate the gradient information of pixels. The original LFD methods assign the two closest bins to each pixel on the basis of its gradient orientation, and the bin strength of each of these is calculated as the weighted summation of the magnitude of the bin’s partially assigned pixels. Given pixel
p in frame
$\mathcal{F}{r}_{k}$ with gradient orientation
${o}_{p}$ and magnitude
${m}_{p}$, which are calculated using Equation (
3), suppose that the two neighboring bins are
${B}_{q}$ and
${B}_{q+1}$. Then, the weights of pixel
p relative to
${B}_{p}$ and
${B}_{q+1}$ are calculated as
${w}_{pq}=b({\mathcal{B}}_{q+1}{o}_{p})/2\pi $ and
${w}_{p(q+1)}=b({o}_{p}{\mathcal{B}}_{q})/2\pi $, respectively. From this, the contributions of pixel
p to bins
${\mathcal{B}}_{q}$ and
${\mathcal{B}}_{q+1}$ are calculated as
${w}_{pq}\ast {m}_{p}$ and
${w}_{p(q+1)}\ast {m}_{p}$, respectively.
Given that a saliency map represents the visual attractiveness of a frame, the saliency values are essentially a fuzzy distribution of each pixel regarding its visual attractiveness. Therefore, the saliency membership ${\mu}_{Attractiveness}(p)$ of each pixel p indicates its importance to the video frame from the perspective of human visual attention. On the basis of this observation, this work further distinguished the contribution of each pixel to its neighboring bins by introducing the saliency membership to the bin strength calculation. In particular, the weights ${w}_{pq}$ and ${w}_{p(q+1)}$ of pixel p relative to its neighboring bins ${\mathcal{B}}_{q}$ and ${\mathcal{B}}_{q+1}$ are updated by the aggregation of its saliency membership value; that is, the original weights ${w}_{pq}$ and ${w}_{p(q+1)}$ are updated to ${w}_{pq}\ast {\mu}_{Attractiveness}(p)$ and ${w}_{p(q+1)}\ast {\mu}_{Attractiveness}(p)$. Accordingly, the contributions of pixel p to bins ${\mathcal{B}}_{q}$ and ${\mathcal{B}}_{q+1}$ are updated as ${w}_{pq}\ast {\mu}_{Attractiveness}(p)\ast {m}_{p}$ and ${w}_{p(q+1)}\ast {\mu}_{Attractiveness}(p)\ast {m}_{p}$, respectively.
There are multiple ways available in the literature for the calculation of saliency maps. This work adopted the approach reported in [
15], which is a pixelbased saliency membership generation approach proposed according to the hypothesis that the most important part or parts of an image are in the foreground. The pseudocode of the algorithm is illustrated in Algorithm 1. The saliency map of a frame is denoted by
S, which collectively represents the saliency value
${\mu}_{Attractiveness}(p)$ of every pixel
p in the frame. Briefly, each frame
$\mathcal{F}{r}_{i}$ is first converted to three color channels, Red, Green, and Blue (RGB), denoted by
${\mathcal{F}}_{i}^{R},{\mathcal{F}}_{i}^{G},{\mathcal{F}}_{i}^{B}$, as shown in Line 1 of the algorithm. Then, the video frame
${\mathcal{F}}_{i}$ in RGB is converted to the CIELAB space with L, A, and B channels [
42], as indicated by Line 2. This is followed by reconstructing the frame using discrete cosine transform (DCT) The
D(·) operation and inverse DCT
D[·]
${}^{1}$ operation in Line 3 distinguish the foreground and background in a fuzzy way. From this, in Line 4, the mean values of the pixels in the L, A, B channels are computed and denoted as
$\overline{{\mathcal{F}}_{i}}$; the surface is smoothed via a Gaussian kernel
${\kappa}_{g}$ with the entrywise Hadamard product (i.e., ∘) operation, with output
$\dot{{\mathcal{F}}_{i}}$. The final saliency map
S is then obtained by normalizing each value in
$\dot{{\mathcal{F}}_{i}}$ to the range of
$[0,1]$.
Algorithm 1: Saliency Membership Calculation Procedure. 
Input: $\mathcal{F}{r}_{i}$: the ith 2D video frame, $\mathcal{F}{r}_{i}\in {\mathbb{R}}^{m\times n}$ Output: S: saliency membership of the ith frame Procedure getSaliencyMap$(\mathcal{F}{r}_{i})$
 1:
Construct the RGB video frames: $\mathcal{F}{r}_{i}\to {\mathcal{F}}_{i}^{R},{\mathcal{F}}_{i}^{G},{\mathcal{F}}_{i}^{B}$  2:
Convert ${\mathcal{F}}_{i}$ in the RGB frames to LAB frames: $\tilde{{\mathcal{F}}_{i}}=Rgb2Lab({\mathcal{F}}_{i})$;  3:
Reconstruct the frame $\widehat{{\mathcal{F}}_{i}}$ by $\widehat{{\mathcal{F}}_{i}}=D{\left[sgn(D(\tilde{{\mathcal{F}}_{i}}))\right]}^{1}$;  4:
Calculate the LAB channel average $\overline{{\mathcal{F}}_{i}}$ using $\overline{{\mathcal{F}}_{i}}=\widehat{{\mathcal{F}}_{i}^{L}}+\widehat{{\mathcal{F}}_{i}^{A}}+\widehat{{\mathcal{F}}_{i}^{B}}/3$;  5:
Smooth the 2D single channel $\dot{{\mathcal{F}}_{i}}$ by $\dot{{\mathcal{F}}_{i}}={\kappa}_{g}\ast (\overline{{\mathcal{F}}_{i}}\circ \overline{{\mathcal{F}}_{i}})$;  6:
Compute S by normalizing the $\dot{{\mathcal{F}}_{i}}$, $0<{\mu}_{Attractiveness}^{i}(p)\le 1,\forall p\in \mathcal{F}{r}_{i}$;  7:
returnS;

The extra computational complexity of the proposed approach, compared with the original versions, mainly lies in the calculation of the saliency map, as the saliency information is integrated into the proposed approach by a simple multiplication operation. The computational cost of the saliency map approach used in this work is generally moderate, as it is only a fraction of the cost of other saliency algorithms [
15].
4. Experiments and Results
As described in this section, the publicly available video dataset UNNGazeEAR [
5] was utilized to evaluate the performance of the proposed methods, with the support of a comparative study in reference to the GROILFD approach [
5]. The UNNGazeEAR dataset consists of 50 video clips in total, including five egocentric action categories. The length of the videos ranges from 2 to 11 seconds, with 25 frames per second. The sample frames are shown in
Table 1. All the experiments were conducted using an HP workstation with Intel
^{®} Xeon
^{™} E51630 v4 CPU @ 3.70 GHz and 32 GB RAM.
4.1. Experimental Setup
The parameters used in [
5,
6,
9,
14,
35,
39] were also adopted in this work. Specifically, the BoVW model and PCA were employed to select 72 features. The value of the normalization parameter of PN
ℓ2 was fixed to 0.5. A backpropagation neural network (BPNN) was used for classification after the features were extracted. The BPNN was trained using the scaled conjugate gradient (SCG) algorithm with a maximum of 100 training epochs. The number of neurons in the hidden layer was fixed to 20, and the training ratio was set to 70%. The performance metric was the mean accuracy of 100 independent runs.
4.2. Experiments with Different Resolutions, Block Sizes, and Feature Encoding Methods
In this experiment, SMLFD (including SMHOG, SMHMG, SMHOF, SMMBHx, and SMMBHy) was applied for comparison with GROILFD [
5] (covering GROIHOG, GROIHMG, GROIHOF, GROIMBHx, and GROIMBHy). Videos with the same downscaled resolution (by a factor of 6), as reported in [
5], were used in this experiment as the default. Thus, each video in the dataset has a uniform resolution of 320 × 180 pixels.
Firstly, the values of the feature extraction time
${T}^{E}$ were studied.
Table 2 and Table 6 show the speed of extracting SMLFD was significantly boosted by at least 40fold (denoted as 40×). When using a block size of 16by16 spacial pixels by 6 temporal frames (denoted as [16 × 16 × 6]) for the feature extraction, SMHOG, SMHMG, SMHOF, SMMBHx, and SMMBHy were boosted by 40×, 41×, 43×, 41×, and 41×, respectively. When using a block size of [32 × 32 × 6], SMHOG, SMHMG, SMHOF, SMMBHx, and SMMBHy were boosted by 41×, 43×, 45×, 43×, and 43×, respectively. Because GROILFD extracts the sparse features, SMLFD feature extraction needs more computational time than GROILFD. However, VLAD and FV feature encoding methods consume similar amounts of time for feature encoding.
In terms of accuracy, the performance of SMLFD was improved significantly compared with that using the original resolution. Furthermore, SMLFD consistently outperformed GROILFD with the downscaled dataset, because SMLFD constructed dense features. To investigate the tradeoff between performance and time complexity, the features encoded with a smaller number of visual words were studied.
Table 2 shows that SMLFD generally had better accuracy when using smaller block sizes (i.e., [4 × 4 × 6], [8 × 8 × 6], and [16 × 16 × 6]), while GROILFD outperformed when using larger ones. Therefore, this indicates that SMLFD is a better candidate for lowresolution videos.
4.3. Experiments with Varying Number of Visual Words for Encoding
In this experiment, the effect of varying the number of visual words used in the VLAD and FV feature encoding scheme was investigated. Specifically, half of the visual words were used to compare with the setting in the experiment reported in
Section 4.2. The impact of feature dimensions was also investigated when using PCA in the preprocessing phase. The results are shown in
Figure 2. The experiment used 3, 6, 9, 18, 24, 36, 48, 60, and 72 feature dimensions. SMLFD outperformed GROILFD when using a small number of visual words for feature encoding. The best performances that were achieved by SMLFD are summarized in
Table 3. As shown in
Figure 3, SMHOG exceeded other approaches in most cases when using block sizes of
$[4\times 4\times 6]$ and
$[8\times 8\times 6]$. However, SMHMG outperformed others in most cases for a block size of
$[16\times 16\times 6]$.
Table 4 shows that the proposed SMLFD outperformed GROILFD. SMLFD achieved its peak performance when using a smaller block size (i.e., [4 × 4 × 6]) while GROILFD reached its best performance when using a larger block size [16 × 16 × 6] with VLAD and [8 × 8 × 6] with FV). This again indicates that SMLFD and GROILFD represent families of local dense feature descriptors and sparse feature descriptors, respectively.
4.4. Experiment Using the Memory Aid Dataset
In this experiment, the trained models (with which the SMLFD features were extracted) were applied to an untrimmed video stream that was published in [
5].
Table 5 shows that the SMLFDtrained models possessed better performance compared with that trained using GROILFD features. In general, SMLFD performed as well as GROILFD. Each of them had three superior performances and four equivalent performances. Similarly, both SMHOG and GROIHOG achieved 100% accuracy using FV feature encoding. The three higher results yielded by GROILFD were produced by GROIHOG, GROIMBHx, and GROIMBHy, all under VLAD feature encoding.
4.5. Comparison with Original Resolution
The proposed SMLFD family of local feature descriptors can achieve comparable results to GROILFD. In this experiment, the proposed SMLFD feature descriptors were investigated by adopting all videos from the dataset with the original resolution of 1920 × 1080 pixels. Given the comparison results in
Table 6, GROILFD clearly outperformed SMLFD in terms of accuracy and required time for feature extraction. The reason for this is threefold: (1) GROILFD only proceeds with a single connected interest region with the noise removed from the frame, whereas SMLFD deals with all the pixels in each frame; thus, GROILFD is a faster family of feature extraction approaches; (2) SMLFD extends LFD by introducing an additional realtime operation of calculating the framewise saliency membership with a high degree of sensitivity to the resolution of the video; (3) SMLFD is not efficient in suppressing irrelevant background and foreground noises. To conclude, SMLFD is highly scalevariant and GROILFD is a better candidate for videos with high resolution.
4.6. Discussion
Egocentric videos are usually nonstationary due to the camera motion of the smart glasses or other datacapturing devices; the experimental results of using the proposed local features indicate the ability of the proposed approach to cope with this challenge during video action recognition. Of course, the proposed feature descriptors also have their own limitations. For instance, the computation cost of the proposed approach is generally higher than that of the originals. Also, the performance of the proposed approach closely depends on the accuracy of the calculated saliency map, and poorly generated saliency maps may significantly limit the effectiveness of the proposed approach. Of course, there is a good selection of approaches available in the literature for saliency map calculation, and their effectiveness in supporting the proposed approach requires further investigation.