GhostBlock-Augmented Lightweight Gaze Tracking via Depthwise Separable Convolution

Guo, Jing-Ming; Cheng, Yu-Sung; Zeng, Yi-Chong; Yang, Zong-Yan

doi:10.3390/electronics14152978

Open AccessArticle

GhostBlock-Augmented Lightweight Gaze Tracking via Depthwise Separable Convolution

Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei City 106335, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(15), 2978; https://doi.org/10.3390/electronics14152978

Submission received: 29 June 2025 / Revised: 20 July 2025 / Accepted: 24 July 2025 / Published: 25 July 2025

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a lightweight gaze-tracking architecture named GhostBlock-Augmented Look to Coordinate Space (L2CS), which integrates GhostNet-based modules and depthwise separable convolution to achieve a better trade-off between model accuracy and computational efficiency. Conventional lightweight gaze-tracking models often suffer from degraded accuracy due to aggressive parameter reduction. To address this issue, we introduce GhostBlocks, a custom-designed convolutional unit that combines intrinsic feature generation with ghost feature recomposition through depthwise operations. Our method enhances the original L2CS architecture by replacing each ResNet block with GhostBlocks, thereby significantly reducing the number of parameters and floating-point operations. The experimental results on the Gaze360 dataset demonstrate that the proposed model reduces FLOPs from 16.527 × 10⁸ to 8.610 × 10⁸ and parameter count from 2.387 × 10⁵ to 1.224 × 10⁵ while maintaining comparable gaze estimation accuracy, with MAE increasing only slightly from 10.70° to 10.87°. This work highlights the potential of GhostNet-augmented designs for real-time gaze tracking on edge devices, providing a practical solution for deployment in resource-constrained environments.

Keywords:

deep learning; lightweight model; depthwise separable convolution; gaze tracking

1. Introduction

Deep learning technologies are increasingly being adapted for use on devices with limited computational resources. In scenarios such as edge computing or real-time image recognition, factors such as model size and inference speed become critical. While large-scale models are powerful, they are often impractical for terminal deployment due to their heavy resource demands, which can significantly hinder real-time processing. In automotive applications, for example, where cameras are used to detect driver distraction or fatigue, compact and efficient models are essential for ensuring responsiveness and system integration within hardware constraints. Gaze tracking plays a critical role in applications such as driver monitoring and assistive technologies, enabling systems to interpret human attention with precision [1]. While many lightweight models aim to reduce the computational load for edge deployment, they often do so at the expense of accuracy. For example, a comprehensive survey [2] highlights that although deep-learning-based appearance gaze estimation models have significantly advanced accuracy under controlled conditions, they still struggle in unconstrained settings featuring head pose variation, occlusion, and lighting changes. These limitations point to a growing gap for models that can perform robustly in real-world scenarios while remaining efficient. In this work, we propose an improved gaze-tracking framework that balances speed and accuracy through architectural optimization. Our method builds upon the Look to Coordinate Space (L2CS) gaze estimation model by integrating GhostNet as the backbone and enhancing it with a custom-designed depthwise convolutional block. Through a comparative study of lightweight architectures, including MobileNets, ConvNets and ResNeXts, we have identified L2CS as the most effective method for capturing fine-grained eye movement. The resulting model achieves real-time inference with a reduction in parameters and FLOPs of nearly 50% while maintaining high performance with only a marginal 1.5% drop in accuracy. This demonstrates its practicality for deployment on resource-constrained devices.

Modern terminal devices, including smart phones, surveillance systems [3,4], IoT devices, drones, robots and in-vehicle systems [5,6,7], are increasingly expected to perform AI tasks locally with real-time responsiveness [8], minimal power consumption, optimized system performance and improved privacy. To meet these requirements, model compression [9,10] and acceleration techniques are necessary to reduce computational cost and model size without compromising prediction accuracy [11].

Previous studies in lightweight deep learning have primarily focused on minimizing parameter counts, often at the expense of model accuracy. Thus, achieving an optimal balance between model efficiency and performance remains a central challenge. Recent research has introduced various strategies to overcome this issue, including developing inherently lightweight architectures, as well as compression and optimization techniques.

In the context of gaze tracking, appearance-based gaze estimation models aim to predict gaze direction directly from images of the face or eyes. These methods demonstrate strong adaptability across different image resolutions and environmental conditions, rendering them ideal for mobile and embedded systems. However, they still encounter significant challenges in unconstrained environments where factors such as variable head poses and lighting conditions can distort the appearance of the eyes and reduce the reliability of predictions. Advances in deep learning techniques and the availability of large-scale datasets have led to significant progress in developing improved gaze estimation solutions [12,13].

One notable lightweight architecture is GhostNet, which optimizes convolutional operations by reducing the generation of redundant features. GhostNet addresses the inefficiency whereby conventional convolution layers often produce excessive features that are computationally expensive yet contribute little to model performance. By introducing cost-effective GhostBlock, GhostNet significantly reduces training and inference overheads. Its lightweight design has been successfully applied in various domains. For example, it has been used to accelerate GIVTED-Net for medical image segmentation [14], enhance YOLOv4 for UAV-based inspection of power transmission lines [15], and improve real-time emotion recognition systems by reducing FLOPs while maintaining expressive feature extraction [16].

Inspired by these successes, this study incorporates GhostNet’s convolutional efficiency techniques into the L2CS gaze-tracking model. Specifically, we have redesigned certain convolutional blocks and revised the fully connected layers to reduce parameter count while preserving spatial attention to subtle eye movements.

2. Related Works

This study focuses on improving gaze-tracking performance using lightweight deep learning techniques. The large-scale, publicly available Gaze360 dataset [17] is used for training and evaluation purposes. Integrating lightweight models with targeted architectural optimizations achieves a reduction of over 50% in training and inference parameters while preserving high predictive accuracy. Specifically, the L2CS gaze-tracking model [18] is selected as the baseline due to its strong angular prediction capabilities and is then optimized using GhostNet [19] for improved computational efficiency. Additionally, depthwise convolutions [20] are incorporated to refine grouped convolutions [21], enabling efficient inference on both CPU and GPU platforms [22].

2.1. L2CS Network

The L2CS (Look to Coordinate Space) network is based on the ResNet-50 [23] architecture and uses a multi-loss framework combining regression and classification branches to improve the accuracy of gaze angle estimation. This dual-branch design uses parallel convolutional pathways to perform both gaze regression and discrete gaze classification simultaneously, enabling the network to correct the angular errors that are inherent in pure regression models.

The regression branch estimates continuous gaze angles, while the classification branch divides the angular space into bins and classifies the gaze direction accordingly. This hybrid strategy results in more robust performance when head poses and eye appearances vary. Two distinct loss functions are employed to supervise each gaze angle dimension: mean squared error (MSE) for regression and cross-entropy for classification. These are defined as follows:

ε = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - p_{i})}^{2},

(1)

where

y_{i}

is the ground-truth gaze angle,

p_{i}

is the predicted angle, and

n

is the number of samples.

I (x) = - \log_{2} (p (x)),

(2)

where

p (x)

denotes the predicted probability for the correct gaze angle class.

Recent gaze-tracking models such as GazeLSTM and HG-Net have also demonstrated promising performance on challenging datasets. However, L2CS offers the best balance between computational cost and gaze estimation accuracy, which will be validated by the experimental results presented later. Specifically, L2CS achieves lower MAE than GazeLSTM and comparable performance to HG-Net while maintaining significantly fewer FLOPs and parameters. These characteristics make it a strong candidate for further optimization and justify its selection as the baseline architecture in this study.

2.2. ResNet50

ResNet-50 is the main backbone in L2CS owing to its balance of representational power and computational efficiency. With 50 layers, ResNet-50 can capture both low- and high-level features, enabling precise estimation of subtle eye movements. Compared to deeper variants, such as ResNet-101 or ResNet-152, ResNet-50 is more resource-efficient, making it ideal for use in embedded or real-time systems.

The residual learning framework of ResNet effectively mitigates vanishing and exploding gradient issues during training, thereby improving convergence and generalization. Furthermore, its modular architecture facilitates easy adaptation and extension, allowing for enhancements specific to the task at hand, such as custom blocks for gaze tracking.

2.3. GhostNet

GhostNet introduces an efficient convolutional strategy comprising three stages: standard convolution, ghost feature generation and feature map concatenation. Firstly, a reduced number of intrinsic feature maps are generated through standard convolution with a predefined ratio. Next, ghost features are created by applying lightweight 3 × 3 convolutions to these intrinsic maps. Finally, the intrinsic and ghost features are concatenated to form the complete output feature map.

This mechanism enables GhostNet to maintain the same output dimensionality as standard convolution while reducing the number of parameters and the computational overhead by around 50%. This architecturally efficient approach has been successfully applied in various fields, including medical image segmentation, real-time UAV inspection systems, and emotion recognition applications.

2.4. Depthwise Separable Convolution

Depthwise separable convolution is a key operation in lightweight convolutional neural networks (CNNs) that significantly reduces computational complexity. It consists of a depthwise convolution, which applies a single filter to each input channel, followed by a pointwise (1 × 1) convolution that combines these features across channels. This operation is more computationally efficient than standard convolution yet still effective in capturing spatial patterns, making it ideal for mobile and edge computing tasks.

2.5. Group Convolution

Group convolution, originally proposed in AlexNet, was designed to enable parallel training on multiple GPUs by partitioning feature maps and filters into distinct groups. While this approach significantly reduces computation and parameter count, it limits information sharing between groups, potentially constraining the model’s learning capacity. Later enhancements, such as depthwise and dilated convolutions, were developed to address this issue.

2.6. Squeeze-and-Excitation Networks

The Squeeze-and-Excitation (SE) module is a channel attention mechanism which explicitly models the interdependencies between feature channels. It consists of three phases: Squeeze, Excitation and Feature Recalibration.

2.6.1. Squeeze

Global average pooling is applied to compress the spatial dimensions and create a channel-wise descriptor.

z_{c} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j),

(3)

where

z_{c}

is the descriptor for channel

c

, and

u_{c}

is the corresponding channel in the input feature map

U

.

2.6.2. Excitation

The channel descriptors are passed through a gating mechanism comprising two fully connected layers with ReLU and Sigmoid activations.

s = F_{e x} (z, W) = σ (g (z, W)) = σ (W_{2} δ (W_{1} z)),

(4)

where

W_{1}

and

W_{2}

are learned weight matrices,

δ (\cdot)

is the ReLU function, and

σ (\cdot)

is the Sigmoid activation. The reduction ratio

r

controls model complexity.

2.6.3. Feature Recalibration

Each channel is rescaled using the computed attention weights.

{\hat{u}}_{c} = F_{s c a l e} (u_{c}, s_{c}) = s_{c} \cdot u_{c},

(5)

This rescaled feature map

\hat{U}

is then forwarded to the subsequent layers. The SE module is lightweight and modular, and it can be easily integrated into existing CNN architectures to enhance performance with minimal overhead.

2.7. Recent Transformer-Based Approaches

In recent years, transformer-based or hybrid CNN–transformer models have been applied to gaze estimation tasks, especially in real-world scenarios such as retail and human–computer interaction. One such approach utilizes a deep gaze estimation framework that combines convolutional backbones with transformer modules to enhance spatial attention and temporal modeling. While these methods show promising accuracy, they generally incur higher computational and memory costs, making them less suitable for deployment on resource-constrained edge devices [24].

These limitations further motivate the development of lightweight alternatives. In contrast to transformer-heavy models, our proposed method focuses on CNN-based architectural optimization using GhostBlocks, which achieves substantial FLOPs and parameter reduction while maintaining competitive accuracy for real-time applications.

3. Materials

This study uses the Gaze360 dataset, which is publicly available and was jointly released by the Toyota Research Institute and the Massachusetts Institute of Technology. The dataset comprises annotated 3D gaze labels collected from 238 participants in both indoor and outdoor settings. It encompasses a wide range of head poses, gaze angles and subject distances. This diversity and comprehensive coverage makes the dataset particularly suitable for developing robust gaze estimation models under unconstrained conditions.

3.1. Dataset Characteristics

The key features of the Gaze360 dataset include:

Gaze Direction Coverage:

The dataset supports the estimation of gaze in all directions, including those not directly visible to the eye (e.g., behind or peripheral gaze), providing a comprehensive representation of eye movement in natural settings.

Head Pose Variability:

It includes annotated pitch and yaw angles of the head, enabling the development of models that can compensate for variations in head movement and improve the robustness of gaze prediction.

Participant Diversity:

The dataset includes 238 individuals of different ages, genders and ethnic backgrounds, ensuring that trained models generalize well across demographic groups.

Extreme Gaze Angles:

The recorded data includes a wide range of gaze angles, from direct frontal gaze to extreme lateral and vertical gaze directions, which are critical for training models in real-world applications like driver monitoring or assistive technologies.

In this work, we utilized a subset of the Gaze360 dataset consisting of 112,251 facial images, which we split into 84,902 training images (75.6%), 11,318 validation images (10.1%), and 16,031 test images (14.3%). This partitioning follows the same practice as prior work on L2CS to ensure consistent evaluation across models.

3.2. Data Collection Protocol

To capture a wide range of gaze and head pose variations, the data collection setup uses a Ladybug5 360-degree panoramic camera mounted on a tripod at the center of the recording area. An operator holds a target board marked with a central crosshair and instructs the participant, positioned between one and three meters from the camera, to focus continuously on the target.

During recording, the operator moves the board dynamically in various directions and elevations around the participant and the camera to simulate a wide spectrum of gaze orientations. This setup enables the system to accurately annotate gaze vectors in relation to both head position and environmental context. The panoramic imaging ensures that gaze direction is captured across the entire sphere of visual space, including peripheral and occluded regions.

This flexible yet controlled recording protocol ensures that the dataset contains high-quality, densely distributed gaze annotations that are suitable for training and evaluating appearance-based gaze estimation models in real-world conditions.

4. Proposed Method

4.1. Architecture Overview

The proposed approach is based on the L2CS gaze-tracking model, which, in turn, adopts the ResNet-50 architecture as its backbone. As shown in Figure 1, the network comprises four stages with 3, 4, 6, and 3 residual blocks, respectively.

To enhance computational efficiency, we replace each ResNet bottleneck block with a customized lightweight module called the GhostBlock. This new block design is inspired by GhostNet and incorporates grouped convolution and feature reuse strategies to reduce redundancy while preserving feature expressiveness. Figure 2 illustrates the architecture of a single GhostBlock, which is consistently applied across all blocks in the modified network.

In the original ResNet bottleneck design, each block consists of three convolution layers: a 1 × 1 convolution for dimension reduction, a 3 × 3 convolution for spatial feature extraction, and another 1 × 1 convolution for dimension restoration. Our GhostBlock serves as a unified replacement for the entire bottleneck structure, integrating these roles into a more efficient formulation.

4.2. GhostBlock Design

The proposed GhostBlock is designed to address the problem of redundant feature computation typically found in standard convolution operations while being specifically tailored for the task of gaze tracking. Unlike the standard GhostNet, which primarily replaces single convolution layers, our GhostBlock is designed to replace an entire ResNet bottleneck block, serving as a drop-in, lightweight replacement at the block level.

As illustrated in Figure 3, each GhostBlock begins with a 1 × 1 convolution to generate a reduced set of intrinsic feature maps. A linear transformation is then applied using grouped (or depthwise) convolution, which expands these intrinsic features into a richer set of ghost features. In our implementation, the depthwise convolution is configured with a 3 × 3 kernel, stride of 1, and ReLU activation. This depthwise layer follows the initial 1 × 1 convolution and serves as the main transformation for generating ghost features.

Based on the experimental results in Section 5, setting the number of groups to 1 leads to the best accuracy, and thus is adopted as the default configuration in our final model. In this setting, the grouped convolution becomes equivalent to depthwise convolution [25], as shown in Figure 4.

The final output of the GhostBlock is produced by concatenating the intrinsic and ghost features, yielding a complete feature map with the same dimensionality as standard convolution but significantly lower computational cost. Variants of the GhostBlock (e.g., GhostBlock 2 and GhostBlock 3) differ in the kernel sizes used to generate the intrinsic features (see Figure 5), but their overall structure and operations remain consistent.

4.3. Experiment Setup and Evaluation

4.3.1. Dataset

We evaluated our method using the Gaze360 dataset, which was provided by the Toyota Research Institute and the Massachusetts Institute of Technology. This dataset comprises 112,251 annotated samples collected from 238 participants in indoor and outdoor environments. It includes a wide range of head poses and gaze angles. The training, validation, and testing datasets consist of 84,902, 11,318, and 16,031 samples, respectively.

4.3.2. Experiment Environment

All experiments were conducted on a workstation equipped with an Intel^® Core™ i7-12700K 3.60 GHz CPU (Intel Corporation, Santa Clara, CA, USA), 32 GB of DDR4 RAM, and an NVIDIA GeForce RTX 3060 Ti GPU (NVIDIA, Santa Clara, CA, USA). The system operated under Ubuntu 20.04 LTS, and all model training and evaluation tasks were implemented using Python 3.9.0 along with standard deep learning libraries. Training was performed under standard conditions with empirically selected hyperparameters, which produced optimal performance in terms of convergence and generalization. The hyperparameter settings used in this study are summarized as follows: The model is trained using the Adam optimizer, with a learning rate set to 10⁻⁵ to ensure stable and gradual convergence. The loss function is a combination of Mean Absolute Error (MAE) and Cross Entropy, allowing the model to account for both regression and classification components. The training process is conducted over 60 epochs, with a relatively small batch size of 2, which may help in capturing finer-grained updates at the cost of longer training time.

4.3.3. Evaluation Metrics

To evaluate the effectiveness of the proposed model, two primary indicators were employed:

Mean Angular Error (MAE):

This metric measures the angular deviation between the predicted gaze vector

\hat{g}

and the ground truth vector

g

. The lower the MAE, the more accurate the model:

L_{a n g u l a r} = \frac{g \cdot \hat{g}}{‖g‖ ‖\hat{g}‖}

(6)

Inference Efficiency:

We also compute the floating-point operations (FLOPs) to evaluate model inference efficiency. Due to the integration of GhostBlocks and depthwise convolution, the proposed model achieves a significant reduction in FLOPs, making it well suited for real-time deployment on edge devices.

5. Experimental Results

This section presents the evaluation results of the proposed lightweight gaze-tracking model. It highlights the model’s performance in terms of computational efficiency and accuracy, and compares it with existing approaches. In Table 1, the proposed model achieves a significant reduction in computational cost. The FLOPs fall from 1.65 billion to 0.861 billion—a reduction of 47.9%. Additionally, the parameter count drops from 0.23 million to 0.12 million—a 48.74% decrease—demonstrating the effectiveness of the applied model compression techniques. Despite these reductions, the MAE increases by only 1.5%, from 10.7 to 10.87—a reasonable compromise given the computational gains.

To validate the model further, we compared it with several gaze-tracking methods that were trained using the Gaze360 dataset in recent years. In Table 2, the proposed model significantly improves inference speed while maintaining comparable accuracy to other state-of-the-art methods, confirming its suitability for real-time applications.

While the 1.5% increase in MAE may appear marginal in general terms, its practical implications depend on specific application requirements. For instance, in driver-monitoring systems, an angular error tolerance within 3–5 degrees is typically sufficient to distinguish between on-road and off-road gaze. The proposed model’s MAE of 10.87° remains consistent with other state-of-the-art methods trained on the Gaze360 dataset while delivering substantial reductions in computational complexity. This trade-off between accuracy and efficiency makes the proposed model suitable for real-time deployment in resource-constrained environments.

Figure 6 shows the results of gaze tracking, visualizing two test images under different lighting conditions and viewing angles. The green arrow represents the ground truth gaze, while the red and yellow arrows show predictions from the proposed model and the original L2CS model, respectively. Table 3 summarizes the corresponding pitch and yaw errors, demonstrating that the proposed model retains strong accuracy across varying conditions.

We also compared the baseline L2CS architecture with other lightweight models, including MobileNets, ConvNets and ResNeXts. As shown in Table 4, L2CS outperforms the others in terms of mean absolute error (MAE) due to its gaze-specific architectural design. While other models excel in general vision tasks, the tailored structure of L2CS better captures subtle eye movements, making it an ideal foundation for further optimization.

When exploring the addition of attention mechanisms, we incorporated a Squeeze-and-Excitation (SE) block into our model. However, as shown in Table 5, the SE block did not provide any significant improvement in accuracy and was therefore excluded from the final design to avoid unnecessary overhead.

Furthermore, we examined the impact of the feature reduction ratio when generating primary features in GhostBlocks. As shown in Table 6, a larger ratio leads to smaller feature maps and faster computation but results in a higher error rate. A reduction ratio of S = 2 was therefore adopted to strike the best balance between performance and speed.

Finally, we evaluated the effect of applying the proposed optimization strategy to different convolutional blocks within the ResNet-based backbone. In Table 7, the labels “1st Block” to “4th Block” correspond to the four sequential bottleneck block groups (Stages 1 to 4) in ResNet, as illustrated in Figure 1. The results show that applying GhostBlock only to the first block (Stage 1) leads to a modest reduction in FLOPs and slightly increased MAE. As more blocks are replaced (e.g., Stages 1–3), the MAE temporarily degrades, likely due to partial feature inconsistency introduced in early and middle layers. However, when all four blocks (Stages 1–4) are replaced, the model achieves the best performance in both efficiency and accuracy. This confirms that uniformly applying architectural improvements across all stages is essential for maintaining feature consistency and enhancing overall performance.

In addition, we conducted a supplementary experiment to examine the impact of different group settings in the linear operation of GhostBlocks. Under the same configuration of S = 2 and full block replacement, we tested various group numbers for the grouped convolution. The results showed that setting the group number to 1 provides the best balance between accuracy and efficiency. In Table 8, the results validates our design choice, as group = 1 is equivalent to a depthwise convolution and maintains strong gaze estimation performance without introducing excessive computational complexity.

To further evaluate the real-time performance of the proposed model in a practical setting, we conducted a live inference test in an indoor environment using a Logitech C922 Pro Stream webcam configured at 720p resolution. The experiment was carried out on a desktop system equipped with an Intel Core i7-12700K CPU, 32 GB DDR4 RAM, and an NVIDIA GeForce RTX 3060 Ti GPU running Ubuntu 20.04. We evaluated the real-time inference performance under continuous frame input in an indoor environment. The baseline L2CS model achieved 35.2 FPS, while the proposed lightweight model reached 45.6 FPS. These results demonstrate that the model reduces theoretical computational cost.

6. Conclusions

This paper proposes an efficient, lightweight gaze-tracking model designed for use on terminals and edge devices. By integrating GhostNet into the convolutional architecture of the L2CS model, redundant feature maps are effectively eliminated during convolution operations, reducing computational cost without significantly compromising accuracy. Through careful analysis of the feature representations at each layer, the proposed model selectively retains informative features while discarding non-essential ones using GhostNet’s lightweight convolutional mechanism. Consequently, compared to the original L2CS model (16.527 × 10⁸ FLOPs, 2.387 × 10⁵ parameters), the proposed model (8.610 × 10⁸ FLOPs, 1.224 × 10⁵ parameters) achieves a 47.9% reduction in FLOPs and a 48.74% decrease in the number of parameters, with only a 1.5% increase in mean angular error. These results demonstrate that the proposed optimization strategy successfully balances model accuracy, computational efficiency and deployment feasibility. This confirms the method’s suitability for real-time gaze estimation tasks on platforms with limited resources, such as in-vehicle systems, mobile devices, and embedded vision modules.

However, it is worth noting that the model exhibits slight performance degradation under challenging conditions such as extreme lighting, occlusion, and rapid head movement. Moreover, the current study evaluates performance solely on the Gaze360 dataset, which may not fully capture the diversity of real-world settings. To address this, future work will include cross-dataset evaluations using MPIIGaze to further validate generalization capability. We also plan to incorporate temporal cues from sequential frames and explore domain adaptation techniques to enhance robustness across varied environments and hardware constraints.

Author Contributions

Conceptualization, J.-M.G. and Y.-S.C.; methodology, Y.-S.C.; software, Y.-S.C.; validation, Y.-S.C.; formal analysis, Y.-S.C.; investigation, Y.-S.C.; resources, J.-M.G.; data curation, Y.-S.C. and Z.-Y.Y.; writing—original draft preparation, Y.-S.C.; writing—review and editing, Y.-C.Z., J.-M.G. and Z.-Y.Y.; visualization, Y.-S.C.; supervision, J.-M.G. and Y.-C.Z.; project administration, J.-M.G.; funding acquisition, J.-M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in this study is openly available in the Gaze360 dataset repository, published by Kellnhofer et al. in the paper “Gaze360: Physically Unconstrained Gaze Estimation in the Wild” (ICCV 2019). The dataset is publicly accessible for academic research and can be retrieved from http://gaze360.csail.mit.edu (accessed on 1 August 2023). It comprises over 170,000 images of 238 subjects with corresponding 3D gaze vectors, head poses, and camera calibration parameters.

Conflicts of Interest

The authors declare no conflict of interest.

References

Vasiljevas, M.; Damaševičius, R.; Maskeliūnas, R. A human-adaptive model for user performance and fatigue evaluation during gaze-tracking tasks. Electronics 2023, 12, 1130. [Google Scholar] [CrossRef]
Pathirana, P.; Senarath, S.; Meedeniya, D.; Jayarathna, S. Eye gaze estimation: A survey on deep learning-based approaches. Expert Syst. Appl. 2022, 199, 116894. [Google Scholar] [CrossRef]
Duong, H.-T.; Le, V.-T.; Hoang, V.T. Deep learning-based anomaly detection in video surveillance: A survey. Sensors 2023, 23, 5024. [Google Scholar] [CrossRef] [PubMed]
Dondi, P.; Porta, M. Gaze-based human-computer interaction for museums and exhibitions: Technologies, applications and future perspectives. Electronics 2023, 12, 3064. [Google Scholar] [CrossRef]
Khan, M.Q.; Lee, S. Gaze and eye tracking: Techniques and applications in ADAS. Sensors 2019, 19, 5540. [Google Scholar] [CrossRef] [PubMed]
Mao, R.; Li, G.; Hildre, H.P.; Zhang, H. A survey of eye tracking in automobile and aviation studies: Implications for eye-tracking studies in marine operations. IEEE Trans. Hum.-Mach. Syst. 2021, 51, 87–98. [Google Scholar] [CrossRef]
Vetturi, D.; Tiboni, M.; Maternini, G.; Bonera, M. Use of eye tracking device to evaluate the driver’s behaviour and the infrastructures quality in relation to road safety. Transp. Res. Procedia 2020, 45, 587–595. [Google Scholar] [CrossRef]
Chen, L.; Li, Y.; Bai, X.; Wang, X.; Hu, Y.; Song, M.; Xie, L.; Yan, Y.; Yin, E. Real-time gaze tracking with head-eye coordination for head-mounted displays. In Proceedings of the 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Singapore, 17–21 October 2022; pp. 82–91. [Google Scholar]
Kawawa-Beaudan, M.; Roggenkemper, R.; Zakhor, A. Recognition-aware learned image compression. arXiv 2022, arXiv:2202.00198. [Google Scholar] [CrossRef]
Fu, H.; Liang, F.; Liang, J.; Wang, Y.; Fang, Z.; Zhang, G. Fast and high-performance learned image compression with improved checkerboard context model, deformable residual module, and knowledge distillation. IEEE Trans. Image Process. 2024, 33, 4702–4715. [Google Scholar] [CrossRef] [PubMed]
Liao, L.; Wu, S.; Song, C.; Fu, J. RS-Xception: A Lightweight Network for Facial Expression Recognition. Electronics 2024, 13, 3217. [Google Scholar] [CrossRef]
Her, P.; Manderle, L.; Dias, P.A.; Medeiros, H.; Odone, F. Uncertainty-aware gaze tracking for assisted living environments. IEEE Trans. Image Process. 2023, 32, 2335–2347. [Google Scholar] [CrossRef] [PubMed]
Bao, J.; Liu, B.; Yu, J. An individual-difference-aware model for cross-person gaze estimation. IEEE Trans. Image Process. 2022, 31, 3322–3333. [Google Scholar] [CrossRef] [PubMed]
Al-Fahsi, R.D.H.; Prawirosoenoto, A.N.F.; Nugroho, H.A.; Ardiyanto, I. GIVTED-Net: GhostNet-Mobile Involution ViT Encoder-Decoder Network for Lightweight Medical Image Segmentation. IEEE Access 2024, 12, 81281–81292. [Google Scholar] [CrossRef]
Zhang, S.; Qu, C.; Ru, C.; Wang, X.; Li, Z. Multi-objects recognition and self-explosion defect detection method for insulators based on lightweight GhostNet-YOLOV4 model deployed onboard UAV. IEEE Access 2023, 11, 39713–39725. [Google Scholar] [CrossRef]
Pan, J.; Fang, W.; Zhang, Z.; Chen, B.; Zhang, Z.; Wang, S. Multimodal emotion recognition based on facial expressions, speech, and EEG. IEEE Open J. Eng. Med. Biol. 2023, 5, 396–403. [Google Scholar] [CrossRef] [PubMed]
Kellnhofer, P.; Recasens, A.; Stent, S.; Matusik, W.; Torralba, A. Gaze360: Physically unconstrained gaze estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6912–6921. [Google Scholar]
Abdelrahman, A.A.; Hempel, T.; Khalifa, A.; Al-Hamadi, A.; Dinges, L. L2cs-net: Fine-grained gaze estimation in unconstrained environments. In Proceedings of the 2023 8th International Conference on Frontiers of Signal Processing (ICFSP), Corfu, Greece, 23–25 October 2023; pp. 98–102. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Gyawali, D. Comparative Analysis of CPU and GPU Profiling for Deep Learning Models. arXiv 2023, arXiv:2309.02521. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Senarath, S.; Pathirana, P.; Meedeniya, D.; Jayarathna, S. Customer gaze estimation in retail using deep learning. IEEE Access 2022, 10, 64904–64919. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Hu, H.; Wu, C.; Lin, K.; Liu, T. HG-Net: Hybrid Coarse-Fine-Grained Gaze Estimation in Unconstrained Environments. In Proceedings of the 2023 9th International Conference on Virtual Reality (ICVR), Xianyang, China, 12–14 May 2023; pp. 1–6. [Google Scholar]

Figure 1. L2CS model architecture.

Figure 2. Flowchart of the improved model architecture.

Figure 3. Intrinsic feature generation. The symbol ‘∗’ denotes the convolution operator.

Figure 4. GhostBlock linear operation diagram.

Figure 5. GhostBlock concatenation diagram.

Figure 6. Two test images with gaze-tracking results. (a) Frontal faces with gaze directed downward; (b) faces turned to the side with gaze looking toward the lower right. The arrows represent gazes using various approaches. (Green: ground truth, yellow: L2CS, red: the proposed method).

Table 1. Comparison between L2CS and the proposed model.

Model	FLOPs (×10⁸)	Parameters (×10⁵)	MAE
L2CS	16.527	2.387	10.70
Proposed Model (L2CS+ GhostBlock)	8.610	1.224	10.87

Table 2. Comparison of various gaze-tracking models.

Model	FLOPs (×10⁸)	Parameters (×10⁵)	MAE
L2CS (ICSFP’22] [18]	16.527	2.387	10.70
GazeLSTM (CVPR’22] [17]	27.722	1.169	13.50
HG-Net (CVPR’23] [26]	21.080	1.250	10.78
Proposed Model (L2CS+ GhostBlock)	8.610	1.224	10.87

Table 3. Ground truth and angular error between the original model and the lightweight model.

Angular Error (Radian)	Pitch		Yaw
Angular Error (Radian)	Proposed (L2CS+ GhostBlock)	L2CS	Proposed (L2CS+ GhostBlock)	L2CS
Figure 6a	0.057	0.083	0.194	0.178
Figure 6b	0.167	0.054	0.048	0.082

Table 4. Comparison of various backbone architectures.

Model	MAE
L2CS	10.70
MobileNets	12.40
ConvNet	11.90
ResNeXt	11.10

Table 5. Comparison of using SE block.

Model	SEBlock	MAE
L2CS	✓	10.78
L2CS	✗	10.70

Table 6. Comparison of using various feature map generation ratios.

Model	Ratio	FLOPs (×10⁸)	MAE
Proposed Model (L2CS+ GhostBlock)	S = 2	8.610	10.80
Proposed Model (L2CS+ GhostBlock)	S = 4	4.733	12.50

Table 7. Ablation experiments on replacement block.

Model	Replacement Block	FLOPs (×10⁸)	MAE
Proposed Model (L2CS+ GhostBlock)	1st Block	14.966	11.20
	1st, and 2nd Blocks	12.917	11.50
	1st, 2nd, and 3rd Blocks	9.992	12.00
	1st, 2nd, 3rd and 4th Blocks	8.610	10.80

Table 8. Comparison of number of group in GhostBlock.

Method	Group	MAE
Proposed Model (L2CS+ GhostBlock)	1	10.87
Proposed Model (L2CS+ GhostBlock)	2	10.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, J.-M.; Cheng, Y.-S.; Zeng, Y.-C.; Yang, Z.-Y. GhostBlock-Augmented Lightweight Gaze Tracking via Depthwise Separable Convolution. Electronics 2025, 14, 2978. https://doi.org/10.3390/electronics14152978

AMA Style

Guo J-M, Cheng Y-S, Zeng Y-C, Yang Z-Y. GhostBlock-Augmented Lightweight Gaze Tracking via Depthwise Separable Convolution. Electronics. 2025; 14(15):2978. https://doi.org/10.3390/electronics14152978

Chicago/Turabian Style

Guo, Jing-Ming, Yu-Sung Cheng, Yi-Chong Zeng, and Zong-Yan Yang. 2025. "GhostBlock-Augmented Lightweight Gaze Tracking via Depthwise Separable Convolution" Electronics 14, no. 15: 2978. https://doi.org/10.3390/electronics14152978

APA Style

Guo, J.-M., Cheng, Y.-S., Zeng, Y.-C., & Yang, Z.-Y. (2025). GhostBlock-Augmented Lightweight Gaze Tracking via Depthwise Separable Convolution. Electronics, 14(15), 2978. https://doi.org/10.3390/electronics14152978

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GhostBlock-Augmented Lightweight Gaze Tracking via Depthwise Separable Convolution

Abstract

1. Introduction

2. Related Works

2.1. L2CS Network

2.2. ResNet50

2.3. GhostNet

2.4. Depthwise Separable Convolution

2.5. Group Convolution

2.6. Squeeze-and-Excitation Networks

2.6.1. Squeeze

2.6.2. Excitation

2.6.3. Feature Recalibration

2.7. Recent Transformer-Based Approaches

3. Materials

3.1. Dataset Characteristics

3.2. Data Collection Protocol

4. Proposed Method

4.1. Architecture Overview

4.2. GhostBlock Design

4.3. Experiment Setup and Evaluation

4.3.1. Dataset

4.3.2. Experiment Environment

4.3.3. Evaluation Metrics

5. Experimental Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI