End-to-End Swallowing Event Localization via Blue-Channel-to-Depth Substitution in RGB-D: GRNConvNeXt-Modified AdaTAD with KAN-Chebyshev Decoder

Lai, Derek Ka-Hei; Zhao, Zi-An; Tam, Andy Yiu-Chau; Li, Jing; Zhang, Jason Zhi-Shen; Wong, Duo Wai-Chi; Cheung, James Chung-Wai

doi:10.3390/ai6110276

Open AccessArticle

End-to-End Swallowing Event Localization via Blue-Channel-to-Depth Substitution in RGB-D: GRNConvNeXt-Modified AdaTAD with KAN-Chebyshev Decoder

by

Derek Ka-Hei Lai

¹,

Zi-An Zhao

¹,

Andy Yiu-Chau Tam

¹

,

Jing Li

¹,

Jason Zhi-Shen Zhang

²,

Duo Wai-Chi Wong

^1,*

and

James Chung-Wai Cheung

^1,3,*

¹

Department of Biomedical Engineering, Faculty of Engineering, The Hong Kong Polytechnic University, Hong Kong

²

Department of Mathematics, School of Science, Hong Kong University of Science and Technology, Hong Kong

³

Research Institute of Smart Ageing, The Hong Kong Polytechnic University, Hong Kong

^*

Authors to whom correspondence should be addressed.

AI 2025, 6(11), 276; https://doi.org/10.3390/ai6110276

Submission received: 18 September 2025 / Revised: 15 October 2025 / Accepted: 20 October 2025 / Published: 22 October 2025

(This article belongs to the Special Issue Artificial Intelligence in Biomedical Engineering: Challenges and Developments)

Download

Browse Figures

Versions Notes

Abstract

Background: Swallowing is a complex biomechanical process, and its impairment (dysphagia) poses major health risks for older adults. Current diagnostic methods such as videofluoroscopic swallowing (VFSS) and fiberoptic endoscopic evaluation of swallowing (FEES) are effective but invasive, resource-intensive, and unsuitable for continuous monitoring. This study proposes a novel end-to-end RGB–D framework for automated swallowing event localization in continuous video streams. Methods: The framework enhances the AdaTAD backbone through three key innovations: (i) finding the optimal strategy to integrate depth information to capture subtle neck movements, (ii) examining the best adapter design for efficient temporal feature adaptation, and (iii) introducing a Kolmogorov–Arnold Network (KAN) decoder that leverages Chebyshev polynomials for non-linear temporal modeling. Evaluation on a proprietary swallowing dataset comprising 641 clips and 3153 annotated events demonstrated the effectiveness of the proposed framework. We analysed and compared the modification strategy across designs of adapters, decoders, input channel combinations, regression methods, and patch embedding techniques. Results: The optimized configuration (VideoMAE + GRNConvNeXtAdapter + KAN + RGD + boundary regression + sinusoidal embedding) achieved an average mAP of 83.25%, significantly surpassing the baseline I3D + RGB + MLP model (61.55%). Ablation studies further confirmed that each architectural component contributed incrementally to the overall improvement. Conclusions: These results establish the feasibility of accurate, non-invasive, and automated swallowing event localization using depth-augmented video. The proposed framework paves the way for practical dysphagia screening and long-term monitoring in clinical and home-care environments.

Keywords:

temporal action detection; computer vision; dysphagia screening; swallowing detection; Kolmogorov–Arnold Networks

1. Introduction

Dysphagia, characterized by difficulty in swallowing, is a significant geriatric syndrome affecting more than half of older adults with dementia and is associated with a 13-fold increase in mortality risk [1,2]. This condition impairs the natural swallowing process, which occurs approximately 200–1000 times daily in healthy individuals [3], and can lead to severe complications including aspiration pneumonia, which is the third leading cause of injury deaths in older people [4]. The prevalence of dysphagia ranges from 25% in adults to 60% in residential care facilities [5,6], imposing substantial economic burdens with median hospitalization charges exceeding $30,000 for aspiration pneumonia cases [7]. Current diagnostic approaches using videofluoroscopic swallowing (VFSS) and fiberoptic endoscopic evaluation of swallowing (FEES) are considered gold standards but present significant limitations including invasiveness, patient discomfort, radiation exposure, high costs, and requirements for specialized personnel [8]. These constraints make these assessments unsuitable for routine screening or continuous monitoring.

As dysphagia represents a gradual aging process that deteriorates progressively with cognitive decline and neurological disorders [9], continuous monitoring becomes essential to identify high-risk stages for timely intervention. Wearable sensor-based screening systems, accompanied by machine learning or deep learning, enable continuous monitoring using sensors such as inertial measurement units (IMUs) with accelerometers and gyroscopes, electromyography (EMG) electrodes, acoustic sensors, microphones, nasal airflow sensors, and strain sensors [10,11,12,13]. Although these systems show promise as alternatives to traditional methods, they face significant challenges, including poor signal quality, susceptibility to motion artifacts, and patient compliance issues, particularly among older adults with dementia [10,11,12,13]. In contrast, computer vision represents an emerging trend that addresses these limitations without requiring direct physical contact.

Computer vision approaches have emerged predominantly through conventional RGB camera systems for non-contact dysphagia monitoring [11]. Sakai et al. [14] developed a machine learning-based screening test using image recognition from iPad cameras to assess sarcopenia dysphagia. Similarly, Yamamoto et al. [15] employed a compact 3D camera to detect lip motion patterns and quantify swallowing dynamics during bolus flow in elderly participants. However, the performance of RGB-based systems has been moderate [14]. This moderate performance can be attributed to the inherent limitation that swallowing processes involving hyoid bone movement, laryngeal elevation, and subtle soft tissue deformation in the neck region are often too subtle for visual observation through conventional RGB cameras alone. RGB-D (depth) cameras offer significant advantages by providing three-dimensional (3D) spatial information that can capture subtle surface deformations and movement patterns invisible to conventional cameras. Lai et al. [16] successfully demonstrated this approach using RGB-D cameras to create a comprehensive dataset of swallowing activities and achieved an accuracy (F1-score) of 92% using Transformer X3D.

Despite its promising potential to address contact-based sensor limitations, computer vision for dysphagia screening faces significant technical challenges that hinder its transition to real-world monitoring. Current computer vision approaches, whether utilizing RGB or RGB-D modalities, are predominantly constrained to pre-designed experimental protocols, such as the Comprehensive Assessment Protocol for Swallowing (CAPS) [17], that focus on controlled swallowing and non-swallowing (such as reading and dry-swallowing) tasks rather than naturalistic eating behaviors. These studies typically require manual temporal segmentation and windowing of video sequences, with researchers manually clipping specific swallowing events from longer recordings for analysis. The absence of automated event detection in continuous video streams severely limits the potential for ubiquitous screening scenarios, such as capturing and analyzing entire meal-taking processes without human oversight.

Recent advances in computer vision have increasingly focused on temporal action localization (TAL) models, which aim to precisely identify and temporally segment specific actions within continuous video streams. TAL represents a fundamental paradigm shift from traditional frame-based classification approaches toward comprehensive temporal understanding, enabling systems to automatically detect when actions occur and determine their precise temporal boundaries without manual intervention [18]. The evolution of TAL approaches has progressed through several distinct paradigms, from early two-stage methods to sophisticated end-to-end methods. One-step approaches aim to directly predict actions at the frame level without generating proposals, offering simplicity but often struggling with long-range temporal dependencies. A notable example is the Convolutional–De-Convolutional (CDC) network, introduced by Shou et al. [19], which places CDC filters on top of 3D ConvNets that are effective for abstracting action semantics but reduce temporal length. Two-step approaches involve generating temporal proposals and then classifying them. The paradigm is exemplified by the Structural Segment Network that models the temporal structure of each action instance via a structured pyramid and introduces a decomposed discriminative model with an action classifier and completeness detector [20].

An end-to-end architecture for TAL is typically realized by integrating a feature extractor with a localization head. A prominent example of this paradigm is ActionFormer, a state-of-the-art single-stage detector [21]. It leverages a Transformer-based encoder to process multiscale feature representations, employing local self-attention to efficiently model temporal context, followed by a lightweight decoder that classifies each temporal moment and regresses the starting and ending boundaries of the action [21]. Conventionally, ActionFormer operates on features pre-extracted by an offline backbone, most notably I3D (Inflated 3D ConvNet) [22]. I3D adapts 2D convolutional kernels for 3D spatiotemporal data by “inflating” them, enabling it to effectively learn motion patterns from video and produce powerful, generic action features [22]. To create a more cohesive end-to-end system, recent work like AdaTAD (Adapter Tuning for Temporal Action Detection) has explored merging ActionFormer with adaptable backbones such as VideoMAE (Video Masked Autoencoder) [23,24]. VideoMAE, a self-supervised model pre-trained via a masked autoencoding strategy, excels at learning robust and instance-specific representations [24]. The key innovation of AdaTAD is its adaptive framework, which allows the VideoMAE backbone to be efficiently fine-tuned alongside the ActionFormer detector for the specific target dataset. However, this ambition of full end-to-end fine-tuning introduces significant practical limitations. The computational demands and memory requirements of jointly training these sophisticated architectures are substantial, and attempting to update all parameters of a large pre-trained backbone like VideoMAE risks the catastrophic forgetting of its valuable, generalized representations.

This challenge has driven the development and adoption of Parameter-Efficient Fine-Tuning (PEFT) approaches as a solution [25]. In fact, the AdaTAD framework is a prime example of this strategy in action [23]. Instead of full fine-tuning, PEFT methods enable the targeted adaptation of pre-trained models by updating only a small subset of parameters or by inserting lightweight, trainable modules (e.g., adapters) into the frozen backbone. This approach allows the model to specialize in the downstream task while preserving the integrity of the majority of its pre-trained knowledge.

Most existing TAL models are designed and pre-trained exclusively on three-channel RGB inputs, making it challenging to integrate modalities such as depth without significant architectural modifications or retraining, especially in specialized domains like swallowing event localization. Prior studies have explored substituting depth for one of the RGB channels in three-channel architectures. Vandrol et al. [26] tested RGD and RDB configurations for YOLOv8 weed detection, with RDB outperforming RGB in mean Average Precision (mAP). In addition, Liu et al. [27] used an RGD input with ResNet-50 for vehicle detection, reporting an 86% average precision versus 81% for RGB, validated on the Waymo dataset. These findings highlight the depth substitution’s potential. Additionally, we attempted to accommodate all RGB-D channels by early and late fusion, despite their performance being worse than the RGD combination. The poor performance of early fusion likely arises from the mismatch between the four-channel input and the VideoMAE backbone’s pretrained three-channel RGB structure, compounded by suboptimal weight initialization from the pretrained model, which struggles to adapt to the added depth data, leading to noisy feature extraction.

2. Materials and Methods

2.1. Overview

This study introduces a novel, end-to-end framework for localizing swallowing events in continuous RGB-D video streams. Our approach is built upon the AdaTAD architecture, which couples a VideoMAE feature extractor with an ActionFormer-based detector. We systematically enhance this baseline by exploring modifications across five key architectural components:

(a): Temporal Feature Adapter: We conduct a comparative analysis of five Parameter-Efficient Fine-Tuning (PEFT) adapters: the original BottleneckAdapter, InvertedConvNeXtAdapter, GRNConvNeXtAdapter, Adapter+, and Compacter, to optimize temporal feature learning while the VideoMAE backbone remains frozen.
(b): Decoder Head: We replace the standard Multilayer Perceptron (MLP) decoder in the detection head with a Kolmogorov–Arnold Network (KAN) that utilizes Chebyshev polynomials, hypothesizing it can better model the non-linear dynamics of swallowing.
(c): Input Modality: To leverage depth information, we evaluate a channel substitution strategy (RGD, RDB, DGB) and compare its performance against standard RGB and traditional early/late fusion RGB-D methods.
(d): Regression Method: We compare two strategies for boundary prediction: centerness-based regression and direct boundary regression.
(e): Patch Embedding: We assess the impact of different positional encoding techniques, specifically comparing the standard sinusoidal positional encoding with Rotary Positional Encoding (RoPE).

The methodology unfolds in two main phases. First, we perform an initial ablation study on the public THUMOS14 dataset to benchmark the performance of different adapter and decoder combinations. Second, we transition to our proprietary swallowing dataset, where we evaluate the complete set of modifications to identify the optimal configuration for domain-specific swallowing event detection. Model performance is quantified using mean Average Precision (mAP) at various temporal Intersection over Union (tIoU) thresholds, with the final proposed model’s efficacy validated against baseline models.

2.2. Data Acquisition and Labelling

Data was collected from 136 (32 male and 104 female; mean age: 85 ± 7.39) older adults at day centers and care homes. Half of the participants (n = 68) were clinically diagnosed with dysphagia and followed a dysphagia diet at International Dysphagia Diet Standardization Initiative (IDDSI) level 4 or below. The other 68 participants had no dysphagia. Individuals with a history of neck surgery, tracheostomy, or feeding tube use were excluded. Additionally, participants with cognitive impairments that hindered their ability to understand, respond to, or comply with study requirements were deemed ineligible. The mean age of participants was 85 ± 7.4 years. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of the Hong Kong Polytechnic University Institutional Review Board (Reference Number: HSEARS20230302009) on 20 March 2023.

During the data acquisition phase, movements of the lower face and neck, including the lips, mandible, and throat, were recorded during swallowing and non-swallowing tasks using an RGB-D camera (Intel RealSense D435i, Intel Corp., Santa Clara, CA, USA). As illustrated in Figure 1, participants were seated in a neutral position with an eye-level marker on a table serving as a fixed reference point to ensure consistent posture. The camera, set to 30 frames per second with a resolution of 480 × 848 pixels, was positioned on a table approximately 35 cm from the target anatomical regions and angled at 45° relative to the horizontal plane to optimize visibility of the neck and mandible. All data were stored on a connected computer. We adapted the Comprehensive Assessment Protocol for Swallowing (CAPS) [17], but restricted the testing boluses to IDDSI levels 0 to 4 to reduce choking risks associated with more solid textures, in accordance with the recommendations from our occupational therapist. Participants performed tasks that included swallowing foods of varying textures or non-swallowing activities, such as coughing or speaking, with each video capturing the same action repeated five times consecutively.

Following data collection, an occupational therapist conducted a thorough manual review of the footage to identify and isolate clips with clear depictions of swallowing and non-swallowing actions. The therapist manually marked the timeframes for swallowing and non-swallowing events in each video based on observations of the RGB and depth (D) video data, clipping the footage to focus on these events. Videos were excluded if the camera view was obstructed, if involuntary participant movements occurred during feeding, or if the precise temporal onset of the swallowing event could not be determined. This curation process resulted in a final dataset of 641 complete video clips for subsequent analysis.

2.3. Data Processing

RGB data were utilized directly without additional processing. For depth data, we applied the processing pipeline recommended by the Intel RealSense SDK to optimize data quality. Initially, depth data were transformed into the disparity domain using a Depth-to-Disparity transform, which facilitated subsequent filtering. To reduce spatial noise, an edge-preserving spatial filter [28] was applied in the disparity domain. Temporal consistency was enhanced, and noise further minimized, through a temporal filter that leveraged information from previous frames in the disparity domain. Finally, the processed data were converted back to the depth domain using a Disparity-to-Depth transform, preparing them for subsequent analysis. Depth data were clipped at 1.0 m to suppress irrelevant background disparity and focus on the near-field neck region, then rescaled to 8-bit to match the dynamic range of RGB inputs. This choice standardized channel ranges while preserving salient geometric features. Empirically, we observed a negligible impact of alternative clipping thresholds (e.g., 0.8 or 1.2 m) or higher quantization, indicating robustness of the normalization strategy.

2.4. Baseline Model Architecture

We developed an end-to-end TAL architecture through our modifications to the AdaTAD framework, as a baseline model illustrated in Figure 2. For feature extraction, we leverage VideoMAE (Vision Transformer ViT) as the foundational backbone and pretrain it on the Kinetic-400 dataset [29].

To enhance training efficiency and reduce memory usage, we implemented a parameter-efficient fine-tuning (PEFT) approach that involves freezing the encoder of VideoMAE, thereby preserving its powerful pre-trained representations, and strategically tuning a feature adapter [30]. This adapter is crucial for capturing the dynamic temporal characteristics inherent in swallowing events.

To optimize the model for our data, we explored modifications on different parts, which include exploring the best adapter, comparing detector heads using MLP with KAN and finding the best input channel combination. Further, after finding the best adapter, we fine-tuned the adapter by searching for the optimal depth, scaling factor, and with addition of residual connections.

We adopted AdaTAD for its ability to capture long-range dependencies via local self-attention and hierarchical temporal pyramids, which scale efficiently compared to recurrent–convolutional hybrids. While recurrent–convolutional systems remain viable [31,32], their sequential recurrence limits parallelism and increases training time. Our adoption of AdaTAD harnesses transformer efficiency and enables parameter-efficient fine-tuning, achieving superior performance across numerous public dataset benchmarks and demonstrating superior results in many public dataset benchmarking tools. Nevertheless, we will retain ActionFormer with I3D feature extraction as the baseline for model comparison.

2.5. Adapter Exploration

We compare five recently proposed adapter designs, which include (a) Bottleneck Adapter, (b) InvertedConvNeXt Adapter, (c) GRNConvNeXt Adapter, (d) Adapter+, and (e) Compacter.

(a): Bottleneck Adapter. This module follows the compress–process–expand paradigm originally proposed for ResNet architectures and used in the AdaTAD baseline as the temporal-informative adapter [23]. It begins with a downscaling convolution that reduces the channel dimension to one quarter, followed by a SiLU (Sigmoid Linear Unit) activation (Figure 3a). A depthwise 1D convolution with kernel size 3 then captures local temporal patterns. Channel-wise interactions are introduced via a subsequent pointwise (1 × 1) 1D convolution. The output of these operations is added to the downscaled features through a residual connection. An up-projection layer restores the original channel dimension, and a second residual connection adds the result to the original input, enabling efficient temporal modeling and stable optimization.
(b): The InvertedConvNeXt Adapter, illustrated in Figure 3b, is inspired by the ConvNeXt block design [33]. It begins with a down-projection layer that reduces the feature dimensionality, followed by a depth-wise 1D convolution with a kernel size of 7, which efficiently captures long-range temporal dependencies across channels. Layer normalization is applied to stabilize training. Next, a patch-wise convolution expands the feature dimensionality fourfold, enhancing representational capacity. A GELU activation introduces non-linearity, and a second patch-wise convolution projects the features back to their original dimension. Finally, an up-projection layer restores the full representation, and a residual connection adds the adapter’s input to its output, improving gradient flow and enabling more effective learning.
(c): GRNConvNeXt Adapter. Inspired by ConvNeXt V2 [34], this variant integrates a Global Response Normalization (GRN) layer after the patch-wise convolutions (depicted in Figure 3c). GRN adaptively normalizes features based on their global response, improving information flow, generalization, and the modeling of long-range dependencies while maintaining stable training. The remainder of the block follows an inverted bottleneck structure, including an initial down-projection, depthwise convolution, GELU activation, and up-projection, with a residual connection to preserve input information.
(d): Adapter+ [35]. Following the Houlsby-style adapter (Figure 3d), Adapter+ removes internal normalization and employs channel-wise scaling to finely modulate the adapter’s contribution with negligible overhead.
(e): Compacter [36]. For the final variant, we implement Compacter (Figure 3e), which replaces the down- and up-projection layers with two low-rank parameterized hypercomplex multiplication (LPHM) layers. Instead of learning full weight matrices, each LPHM layer constructs its weight matrix as a sum of Kronecker products between shared “slow” weights and adapter-specific “fast” rank-one weights. This reduces parameter complexity from O(kd) in standard adapters to O(k + d). Concretely, A matrices are shared across layers to capture general adaptation knowledge, while B matrices are parameterized in low rank to model layer-specific adaptations.

Following feature extraction, the processed features are fed into the detection component, which is based on the ActionFormer framework (Figure 2). It begins with two convolutional layers to project the input features before passing them to the transformer encoder. Its encoder consists of seven transformer blocks, each employing local attention mechanisms to efficiently capture temporal dependencies. Notably, the last five transformer blocks apply 2× downsampling, progressively reducing temporal resolution to facilitate multi-scale temporal modeling. This architectural design creates a hierarchical feature pyramid across multiple temporal scales. Both the classification and regression heads operate on these multi-level features, enabling precise temporal action localization by capturing and integrating information at different temporal resolutions.

2.6. Decoder Selection

We specifically focus on tuning the decoder within ActionFormer to optimize its performance for event localization. Apart from the baseline multilayer perceptron (MLP) layer, we examine a more recent, powerful alternative based on the KAN theorem. The KAN theorem establishes that any multivariate continuous function can be represented as a finite sum of univariate continuous functions, providing a powerful theoretical foundation for modeling complex, high-dimensional relationships [37]. We explored the use of KAN in the decoder based on the hypothesis that it can better capture the inherent spatial and temporal coherence present in video data. This continuity property may enable the KAN-based decoder to exploit underlying structures in the data more effectively, potentially leading to improved model performance in temporal action localization.

To detail the core transformation, the input is first normalized to the range [−1, 1] by applying a hyperbolic tangent activation (1).

A = t a n h (X)

(1)

The main mechanism lies in the generation of Chebyshev polynomials as feature expansions [38]. For the normalized input A, the process proceeds as follows: Let k = [0, 1, …, D] represent the set of polynomial degrees considered. For each value of k, the Chebyshev polynomial of order k is computed using the identity (2).

θ = \arccos (c l a m p (A, - 1 + ϵ, 1 - ϵ)), P_{k} = c o s (k \cdot θ)

(2)

where ϵ is a small positive value to ensure numerical stability. This step expands the channel dimension of the input from its original value to (D + 1) times larger, producing a polynomial feature space P.

A standard convolutional operation with learnable weights W is then applied to P, as shown in (3):

Z = C o n v (P; W)

(3)

Finally, the features are passed through a normalization layer, and optionally a dropout layer, to improve training stability and generalization (4).

Y = D r o p o u t (N o r m (Z))

(4)

This module ultimately replaces the traditional 1D convolutional layer of MLP in the model, serving as the core component for classification and regression tasks.

For inference, the model processes input video streams in fixed windows of 768 frames, segmenting each window into 48 non-overlapping temporal chunks, each comprising 16 consecutive frames. These chunks are individually encoded by the backbone network, after which their corresponding feature maps are concatenated along the temporal dimension. To ensure temporal alignment with the original input, the concatenated features are subjected to spatial pooling and then temporally interpolated back to the full 768-frame resolution. The resulting unified feature map is subsequently fed into a streamlined ActionFormer detection head, which outputs precise temporal boundaries for each action instance within the sequence.

For training, the model is optimized using the AdamW algorithm, employing a learning rate schedule that combines an initial linear warm-up phase with subsequent cosine decay. The learning rates are set to 1 × 10⁻⁴ for the detection head and 2 × 10⁻⁴ for the adapter modules, reflecting the differential adaptation needs of these components.

2.7. Regression Head Selection

The regression head is responsible for precisely localizing the temporal boundaries of a detected swallowing event. To identify the most effective approach for this task, we compared two distinct regression strategies. The first is centerness-based regression, which predicts a single value for each temporal location within a potential action. This “centerness” score quantifies how close a given point in time is to the center of the action instance, effectively down-weighting predictions from locations near the start or end boundaries, which are often less reliable. The second strategy is boundary-based regression, which, for each temporal position, directly predicts two values: the distance to the action’s starting boundary (onset) and the distance to its ending boundary (offset). Both regression heads share the same bottleneck architecture as the classification head—comprising three 1D convolutional layers with kernel sizes of 1, 3, and 1—and operate across all levels of the temporal feature pyramid to enable multi-scale boundary prediction. Our experiments systematically evaluated both methods to determine which provided more accurate temporal localization for swallowing events.

2.8. Data Input Strategy

Since the pretrained model weights were learned on the three-channel Kinetics-400 RGB video dataset, directly using a four-channel input (RGBD) would cause dimensionality mismatch. To overcome this, we substituted each individual channel with the depth channel, creating RGD, RDB, and DGB inputs. Recognizing that the blue channel may offer less salient information in certain imaging scenarios, we hypothesized that incorporating depth would provide more informative spatial cues for localizing swallowing events. The performance of the RGD input was then compared to the standard RGB configuration.

In addition to substitution strategies, we implemented early and late fusion baselines. Early fusion applied a convolutional layer to project the four-channel RGBD input into a three-channel representation, preserving compatibility with pre-trained backbones. Late fusion processed RGB and depth streams independently and fused them in feature space before the detection head, mimicking the dimensionality of the RGB backbone output. These baselines provide fair comparisons for evaluating the effectiveness of our substitution strategy.

2.9. Patch Embedding Method

Sinusoidal Positional Encoding: In the context of transformer-based models applied to image processing tasks, such as image classification or object detection, sinusoidal positional encoding serves as a mechanism to incorporate spatial positional information into the model. Transformers, originally designed for sequential data like text, lack an inherent understanding of the two-dimensional spatial arrangement of pixels or patches in images. Sinusoidal positional encoding addresses this by assigning each image patch or pixel to a unique positional vector, generated using fixed, periodic mathematical functions. These vectors create a structured pattern that allows the model to distinguish the relative positions of patches, such as whether one patch is to the left or above another. This is particularly critical in vision transformers, where the model must capture spatial relationships to understand image content effectively. However, the fixed nature of sinusoidal encodings limits their adaptability to diverse image sizes or task-specific spatial patterns, potentially constraining performance in complex visual tasks.

Rotary Positional Encoding (RoPE): Rotary Positional Encoding (RoPE) represents an advanced approach to embedding positional information in transformer models for image processing, offering improved flexibility over traditional sinusoidal encodings. In vision tasks, RoPE integrates positional information by applying a rotation transformation to the feature representations of image patches or pixels, effectively encoding their relative spatial relationships. Unlike sinusoidal encodings, which rely on adding separate positional vectors, RoPE modifies the attention mechanism itself by rotating the query and key vectors based on their positions in the image grid. This rotation-based approach preserves the relative distances between patches, enabling the model to better capture spatial dependencies, such as the arrangement of objects in an image. RoPE’s ability to generalize across varying image sizes and its computational efficiency make it particularly effective for tasks like image segmentation or scene understanding, where precise spatial awareness is essential, often outperforming sinusoidal encodings in modern vision transformer architectures.

2.10. Model Training

Training used AdamW (lr 1 × 10⁻⁴ for detection head, 2 × 10⁻⁴ for adapters; weight decay 0.05). The backbone was frozen. A linear warmup over 5 epochs transitioned to cosine decay. Gradient clipping was applied at 1.0. Maximum training was 100 epochs, with convergence reached by epoch 60. Mixed precision was enabled, batch size was 2, and 20 data loader workers were used. EMA tracking and static graph optimization were applied. Experiments ran on NVIDIA RTX 4090 (Nvidia Corporation, Santa Clara, CA, USA), with 24 GB VRAM. BatchNorm was retained for compatibility with pre-trained weights, despite small batch sizes.

2.11. Model Evaluation

Before adapting our framework to the specific domain of swallowing events, we conducted a comprehensive evaluation and ablation study on the THUMOS14 dataset [39]. This dataset, widely recognized for its action recognition and temporal localization challenges, serves as an ideal benchmark for validating the core components of our modified ActionFormer. The primary objective of this phase is to assess and select the optimal configurations for both the temporal feature adapter (BottleneckAdapter, InvertedConvNeXtAdapter, and GRNConvNeXtAdapter) and the decoder (MLP vs. KAN). I3D (Inflated 3D ConvNet) [22], trained over 35 epochs, served as the baseline model for comparative analysis.

Following the successful benchmarking on THUMOS, we adapted and trained the model on our proprietary RGB-D swallowing dataset. A total of 641 video clips were collected, each with a mean duration of approximately 27 s (Table 1). Some clips appeared lengthy because they included both the preparatory and recovery phases of the event. Patients with dysphagia may require extra time to clear any residue, which contributed to the extended duration. These clips contained 3153 annotated episodes encompassing both swallowing and non-swallowing events. Model evaluation was conducted using a training-testing split of approximately 80% and 20%, respectively. I3D (inflated3D ConvNet) and the original AdaTAD served as the baseline models for comparative analysis.

Performance was assessed using the mean Average Precision (mAP) metric, computed at multiple temporal Intersection over Union (tIoU) thresholds of 0.1 to 0.5 on THUMOS and 0.1 to 0.7 on domain adaptation, as well as by averaging the mAP scores across these thresholds. For each tIoU threshold, the Average Precision (AP) for each action class was determined by matching predicted action segments to ground truth segments; a prediction was considered correct if its tIoU with a ground truth segment exceeded the threshold. Precision and recall were evaluated at various confidence score thresholds, with AP calculated as the area under the resulting precision-recall curve.

3. Results

3.1. Benchmarking and Initial Ablation

We evaluated our proposed model configurations against several baselines to assess the effectiveness of architectural modifications using the THUMOS dataset (Figure 4). Baseline systems included the ActionFormer (I3D + RGB) with its original detection head, as well as a variant where the original 1D convolutional (MLP) detection head was replaced with a KAN. For AdaTAD, we compared the original architecture (VideoMAE + BottleneckAdapter + MLP) with our modified versions and also examined variants in which the adapter module was removed to align with baseline settings.

Across all experiments, replacing the convolution (MLP) decoder with a KAN decoder consistently degraded performance. For example, the KAN-equipped ActionTransformer achieved an average mAP of 68.7%, compared to higher scores achieved by MLP-based counterparts. In contrast, our modified AdaTAD models with MLP decoders achieved substantially stronger results, with average mAPs ranging from 74% to 75%, outperforming the baselines, including the original AdaDAT. Among adapter designs, both the proposed InvertedConvNeXtAdapter (83.3% mAP) and GRNConvNeXtAdapter (83.5% mAP) slightly outperformed the original BottleneckAdapter (82.5% mAP), indicating that the new adapters provide modest yet consistent gains over the established design. Since ICNA and GRNNA have similar performance, to facilitate further evaluation, we will omit InvertedConvNeXtAdapter in Section 3.2.

3.2. Domain Adaptation and Model Performance

Overall, VideoMAE substantially outperformed I3D as the feature extractor, with average mAP improving from 61.55% (I3D + MLP + RGB baseline) to 81.65% when paired with the GRNConvNeXtAdapter and boundary regression. For the decoder, the Kolmogorov–Arnold Network (KAN) offered moderate gains over the baseline MLP in domain-specific settings, improving the average from 81.27% (MLP) to 82.50% (KAN).

Among adapter designs, the GRNConvNeXtAdapter yielded the highest performance (82.78%), consistently surpassing the BottleneckAdapter (82.50%), Adapter+ (81.91%), and Compacter (77.76%).

The RGD substitution achieved 82.78% average mAP, exceeding RGB (81.91%), DGB (81.50%), and both early and late fusion baselines (78.60–81.27%), but slightly worse than RDB (82.93%). Although the average gain over RGB is modest (~0.9 pp), it was consistent across thresholds, with greater benefit at stricter tIoUs (0.5–0.7). Importantly, our dataset spanned two hostel environments with differing ambient illumination, demonstrating robustness across real-world variability without environment-specific retraining. Compared with early and late fusion, RGD matched or exceeded performance while avoiding dual-stream complexity and additional GPU memory cost.

For regression strategies, boundary-based regression consistently outperformed centerness, achieving 82.50% compared to 81.27%. Finally, for patch embeddings, sinusoidal positional encoding yielded a higher average mAP (82.50%) than rotary embeddings (RoPE, 81.83%).

Taken together, the configuration comprised VideoMAE + GRNConvNeXtAdapter + KAN decoder + RGD input + boundary regression + sinusoidal embedding, achieving an average mAP of 82.50%, representing a clear improvement over the baseline I3D + RGB + MLP model (61.55%). The detailed results are summarized in Figure 5

3.3. Adapter Fine-Tuning

After identifying the GRNConvNeXtAdapter as the most effective temporal adapter, we performed a systematic fine-tuning to determine its optimal depth, MLP scaling factor, and the use of residual connections. Figure 6 reports the detailed performance across temporal IoU thresholds (0.1–0.7). Several key trends emerged.

First, increasing the adapter depth improved mAP up to 4 layers, with average performance peaking at 83.25% (Depth = 4, Scale = 2, Residual = No). Beyond 4 layers, deeper settings (6 layers) generally reduced mAP to approximately 81–82%, suggesting over-parameterization and optimization instability. Second, enlarging the scaling factor of the MLP improved representational capacity: mAP rose from 82.53% (Scale = 0.5) to 83.25% (Scale = 2), while further expansion to 4× led to reduced or inconsistent gains (≈82%). Third, enabling residual connections enhanced robustness at stricter thresholds (tIoU ≥ 0.6), yielding up to 74.47% at tIoU 0.6 and 61.86% at tIoU 0.7 (Depth = 4, Scale = 4, Residual = Yes), compared to 72.05% and 59.07% under the corresponding non-residual setting.

We evaluated the Kolmogorov–Arnold Network (KAN) decoder with Chebyshev degree D = 3. Compared to the MLP head, KAN offered small but consistent gains on the swallowing dataset (82.50% vs. 81.27% average mAP). The trade-off was modestly higher runtime and memory footprint, though still feasible for research-scale inference. Extended experiments varying D showed stable results, with D = 3 providing a good balance between accuracy and efficiency. We also conducted supplementary comparisons between five different KAN basis functions (including Chebyshev, B-splines, Fourier, Legendre, and Radial Basis Function), with two polynomial degrees D = 3 and D = 4, which are provided in Figure S1.

Overall, the best-performing configuration was achieved with Depth = 4, Scale = 2, with or without residuals, reaching an average of 83.25% (No residual) and 82.95% (With residual), outperforming both shallower or deeper variants as well as alternative adapters such as BottleneckAdapter, Adapter+, and Compacter.

In summary, we systematically evaluated the impact of different feature extractors, adapter modules, decoders, input channel configurations, regression heads, and patch embedding strategies on swallowing event localization. Subsequent to the best model architectural combination, we further fine-tune the adapter configurations based on depth, scaling factor, and the addition of a residual connection. Comprehensive results are presented in Figure 6, while Table 2 offers a direct comparison between the baseline AdaTAD framework and the optimized model.

4. Discussion

We present a novel end-to-end RGB-D framework for swallowing event localization that addresses the limitations of purely RGB-based methods. Our core contributions include the proposal of GRNConvNeXtAdapter, which outperforms other existing adapters, and the RGD input strategy, which replaces the noisy blue channel with depth information to capture subtler 3D spatial cues critical for swallowing detection. Experimental results demonstrated significant performance improvements resulting from both the adapter enhancement and the use of RGB-D input. While the KAN decoder did not show any advantage over the existing decoder on the benchmark THUMOS dataset, it outperformed in our proprietary dataset, highlighting its potential in domain-specific applications.

The superior performance of the GRNConvNeXtAdapter can be attributed to its integration of the Global Response Normalization (GRN) layer from ConvNeXt V2, which addresses a critical architectural limitation compared to the BottleneckAdapter. While both adapters enhance temporal feature adaptation, the GRN layer in GRNConvNeXtAdapter employs a three-step mechanism—global L2-norm feature aggregation, divisive normalization for inter-channel competition, and feature calibration—that effectively prevents feature collapse and promotes channel diversity [34]. This mechanism enables adaptive re-calibration of feature channels based on their global response, making them more discriminative for complex temporal patterns inherent in swallowing event detection. Additionally, the GRNConvNeXtAdapter benefits from the larger 7 × 1 depth-wise convolution kernel that captures longer-range temporal dependencies more effectively than traditional bottleneck designs [34]. The combination of enhanced inter-channel feature competition through GRN and improved temporal modeling creates a more robust architecture that learns balanced and informative representations across all feature channels, leading to superior swallowing event localization performance without additional computational overhead.

Though we hypothesized that RGD should have the best performance over other channel combinations, results reveal that RDB marginally outperforms RGD. We still select RGD because the performance difference is almost negligible, but this input combination is strategically advantageous for future integration with video photoplethysmography (vPPG), where preserving the red and green channel is crucial to provide the strongest plethysmography signals due to haemoglobin absorption characteristics, while the blue channel is typically the most susceptible to noise and provides minimal physiological information [40]. Similarly, in digital imaging systems, the blue channel suffers from lower sensor sensitivity and greater light scattering [41,42], making it a prime candidate for replacement. By substituting the blue channel with depth information, the RGD approach provides the model with spatial cues that enable accurate capture of soft tissue deformations and movement patterns in the neck region, critical features that are often invisible or poorly represented in conventional 2D RGB imagery.

Although the improvement of RGD over others is numerically small, it is consistent and particularly evident at stricter thresholds. Depth complements RGB by exposing geometric cues (e.g., soft-tissue deformation, laryngeal elevation) not reliably captured by color channels alone. Scientifically, we treat swallowing dynamics as a latent variable weakly expressed in passive RGB-D video through subtle motion, reflectance, and depth cues, and learn a task-specific mapping that aggregates these cues across space and time to infer a relative surrogate in real time without external excitation or specialized hardware. Compared with instrumented modalities (e.g., ultrasound [43] and thermography [44]), which can provide calibrated or subsurface information but requires devices, calibration, and controlled acquisition, our approach favors simplicity, cost, and deployability while remaining limited to line-of-sight evidence. Trade-offs include reliance on task-specific training, non-absolute outputs, and no guaranteed access to subsurface properties beyond what correlates with visible/depth cues.

Initial benchmarks on the THUMOS dataset showed that KANs offered no clear advantage over MLP, likely due to KANs’ higher computational overhead and the relatively simple or general nature of video data in THUMOS. However, in our swallowing event localization experiments using specialized RGBD data, KANs outperformed MLPs, likely because the complex feature space of RGBD data better leverages KANs’ advanced representational capacity. The compositional structure of KANs and their flexible, learnable activations appear especially adept at capturing the subtle, localized non-linear dynamics of swallowing events, making their theoretical strengths practically beneficial in this domain despite computational challenges.

Directly training a fully end-to-end RGB-D model is hindered by the mismatched characteristics of RGB and depth data—different resolutions, noise profiles, and bit depths—and the enormous computational and memory demands of processing two high-dimensional video streams simultaneously. As a result, many workflows resort to a pseudo end-to-end pipeline, with RGB-D features extracted or pre-trained separately before being fused in a downstream detector, which prevents joint optimization of multimodal representations. In this study, we followed the parameter-efficient fine-tuning strategy of AdaTAD by freezing a pre-trained backbone and inserting lightweight adapters that we modify, thereby preserving learned knowledge, drastically reducing resource requirements, and enabling effective adaptation to RGD swallowing event localization without the prohibitive costs of full end-to-end retraining. Additionally, we attempted to accommodate all RGB-D channels by early and late fusion, despite their performance being worse than the RGD combination. The poor performance of early fusion likely arises from the mismatch between the four-channel input and the VideoMAE backbone’s pretrained three-channel RGB structure, compounded by suboptimal weight initialization from the pretrained model, which struggles to adapt to the added depth data, leading to noisy feature extraction.

In our study, we evaluated model performance using mAP at tIoU thresholds ranging from 0.1 to 0.7, which align with the conventional range of 0.3 to 0.7 used in some other localization studies [45,46]. This choice was deliberate, catering to both temporal accuracy and precise temporal boundary delineation. Our decision to modify the AdaTAD architecture was driven by its demonstrated strength in achieving high temporal accuracy, which is well-suited for detecting the core dynamics of swallowing events, such as hyoid bone elevation, typically occurring in the central portion of the video sequence. In contrast, models like TALLFormer [47], which emphasize precise temporal boundary prediction, often demand significantly greater computational resources, making them less practical for our application. Given that the primary features of swallowing, such as laryngeal elevation and hyoid movement, are most prominent in the middle of the event, precise boundary localization is less critical.

Annotation transparency is central to clinical adoption. We defined swallowing onset as the first frame of rapid laryngeal elevation and offset as the return to baseline hyoid position. An occupational therapist performed primary annotations, which were checked by research staff, and disagreements were resolved by consensus. While formal inter-annotator agreement statistics were not computed, a limitation of the present work, the protocol ensured clear onset/offset identification.

Swallowing is inherently biomechanical, governed by coordinated soft tissue and skeletal dynamics. Future research should integrate finite-element (FEM) simulations to provide motion priors or synthetic sequences that improve interpretability and generalization. Such simulations could align predicted trajectories with physiological expectations, thereby enhancing both performance and clinical credibility.

The choice of preprocessing also warrants consideration. Depth clipping to 1.0 m and rescaling to 8-bit were pragmatic design decisions that standardized dynamic range and suppressed irrelevant background disparity. Empirically, these steps preserved salient spatial features, with negligible observed impact on performance. Cross-validation was not conducted due to computational limitations and the substantial annotation effort required for our dataset; instead, an 80/20 stratified split was employed, consistent with standard practice in temporal action detection.

Our experiments also provide insight into decoder trade-offs. The Kolmogorov–Arnold Network (KAN) with Chebyshev degree D = 3 yielded gains over MLP on the swallowing dataset, albeit at modest runtime and memory overheads. While KANs did not outperform MLPs in general-purpose video benchmarks, their compositional flexibility appears advantageous in specialized biomedical settings. Exploring alternative KAN variants, such as Fourier- or spline-based formulations, may further expand their utility.

Finally, qualitative analysis highlights that while temporal saliency visualization methods (e.g., Grad-CAM) are less informative for dense action localization tasks, case-level inspection remains valuable. We observed scenarios where RGD corrected errors made by RGB alone, particularly under variable lighting and subtle tissue motion. These examples reinforce the role of depth in providing robust cues that generalize across naturalistic monitoring conditions. In home-based deployment scenarios, however, numerous uncontrolled factors, such as ambient lighting variability and sensor placement, can degrade performance. A generalized AI model is therefore needed to enhance robustness across heterogeneous environments. Optimization of medical devices is crucial for the efficiency of healthcare treatments [48], and future research should focus on building adaptive systems capable of handling these dynamic, real-world conditions.

There were several limitations to be noted. Due to strict computational power constraints, batch normalization was applied with an extremely small batch size of two, which is suboptimal. This is due to the setting results in unstable batch statistics that introduce noise and hinder effective training and generalization [49]. A promising remedy in future work is video-specific patch/token compression (e.g., token merging or learned patch compression) to reduce visual tokens per frame, lowering memory usage and enabling larger BN batches or normalization methods less sensitive to batch size. Additionally, our use of KAN was limited to a Chebyshev polynomial-based variant, yet various KAN architectures exist [50], each with strengths tailored to different data characteristics (e.g., Fourier KANs [51], B-spline KANs [37]). Currently, selecting the optimal KAN type lacks systematic guidelines and often relies on empirical testing, increasing computational demands [50]. Advancing automated or adaptive KAN selection strategies or developing theoretical insights into basis function suitability for specific tasks represents a crucial direction for future research. Finally, we recognize that results can be better presented with a detailed error diagnostic tool to understand model performance qualitatively. We will include the approach proposed by Alwassel et al. [52] in our future study.

5. Conclusions

This study introduced an end-to-end RGB–D framework for swallowing event localization that overcomes key limitations of RGB-only approaches in dysphagia monitoring. We enhanced the AdaTAD backbone through four directions: (1) an RGD input strategy replacing the noisy blue channel with depth information to capture subtle soft-tissue deformations, (2) a GRNConvNeXtAdapter incorporating Global Response Normalization for improved temporal feature adaptation, (3) a Kolmogorov–Arnold Network (KAN) decoder leveraging Chebyshev polynomials for non-linear temporal pattern recognition, and (4) a boundary regression head and sinusoidal patch embedding method to further optimize model precision

Comprehensive evaluation on our proprietary swallowing dataset (641 clips, 3153 events) demonstrated substantial improvements. The optimized configuration (VideoMAE + GRNConvNeXtAdapter + KAN + RGD + boundary regression + sinusoidal embedding) achieved an average mAP of 83.25%, compared with 61.55% for the baseline I3D + RGB + MLP model. Ablation studies confirmed that each component—depth substitution, adapter design, regression strategy, and embedding choice—contributed incrementally to the overall performance.

These findings establish the feasibility of precise, automated temporal localization of swallowing events in continuous video streams without manual segmentation. By combining architectural innovations with a principled modality substitution, this framework advances the development of non-invasive, real-world dysphagia screening tools, supporting future deployment in clinical and home-care monitoring.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai6110276/s1, Figure S1: Performance of ablated modified AdaTAD compared with different KAN functions and degree of polynomial.

Author Contributions

Conceptualization: D.W.-C.W. and J.C.-W.C.; Data curation, D.K.-H.L., Z.-A.Z., A.Y.-C.T., J.L. and J.Z.-S.Z.; Formal analysis, D.K.-H.L., Z.-A.Z., A.Y.-C.T., J.L. and J.Z.-S.Z.; Funding acquisition, J.C.-W.C.; Investigation, D.K.-H.L., Z.-A.Z., A.Y.-C.T., J.L. and J.Z.-S.Z.; Methodology, D.W.-C.W. and J.C.-W.C.; Project administration, D.W.-C.W. and J.C.-W.C.; Software, D.K.-H.L., Z.-A.Z. and A.Y.-C.T.; Supervision, D.W.-C.W. and J.C.-W.C.; Validation, D.K.-H.L. and A.Y.-C.T.; Visualization, Z.-A.Z. and A.Y.-C.T.; Writing—original draft, D.K.-H.L. and Z.-A.Z.; Writing—review & editing, D.W.-C.W. and J.C.-W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Health and Medical Research Fund from the Health Bureau, Hong Kong (reference number: 19200461 and 21221871).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of the Hong Kong Polytechnic University Institutional Review Board (Reference Number: HSEARS20230302009) on 20 March 2023.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request. Due to privacy concerns and ethical considerations, the RGB-D data containing identifiable facial information of patients cannot be publicly disclosed.

Acknowledgments

We thank Kowloon Home for the Aged Blind from The Hong Kong Society for the Blind, Kei Oi Neighborhood Elderly Centre and Li Ka Shing Care and Attention Home for the Elderly from the Hong Kong Sheng Kung Hui for facilitating subject recruitment and facilities to conduct the experiment.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Putri, A.R.; Chu, Y.-H.; Chen, R.; Chiang, K.-J.; Banda, K.J.; Liu, D.; Lin, H.-C.; Niu, S.-F.; Chou, K.-R. Prevalence of swallowing disorder in different dementia subtypes among older adults: A meta-analysis. Age Ageing 2024, 53, afae037. [Google Scholar] [CrossRef]
Altman, K.W.; Yu, G.-P.; Schaefer, S.D. Consequence of dysphagia in the hospitalized patient: Impact on prognosis and hospital resources. Arch. Otolaryngol.—Head Neck Surg. 2010, 136, 784–789. [Google Scholar] [CrossRef] [PubMed]
Lear, C.S.; Flanagan, J., Jr.; Moorrees, C. The frequency of deglutition in man. Arch. Oral Biol. 1965, 10, 83-IN15. [Google Scholar] [CrossRef] [PubMed]
Kramarow, E.; Warner, M.; Chen, L.-H. Food-related choking deaths among the elderly. Inj. Prev. 2014, 20, 200–203. [Google Scholar] [CrossRef]
De Sire, A.; Ferrillo, M.; Lippi, L.; Agostini, F.; de Sire, R.; Ferrara, P.E.; Raguso, G.; Riso, S.; Roccuzzo, A.; Ronconi, G. Sarcopenic dysphagia, malnutrition, and oral frailty in elderly: A comprehensive review. Nutrients 2022, 14, 982. [Google Scholar] [CrossRef]
Bhattacharyya, N. The prevalence of dysphagia among adults in the United States. Otolaryngol.—Head Neck Surg. 2014, 151, 765–769. [Google Scholar] [CrossRef]
Wu, C.-P.; Chen, Y.-W.; Wang, M.-J.; Pinelis, E. National trends in admission for aspiration pneumonia in the United States, 2002–2012. Ann. Am. Thorac. Soc. 2017, 14, 874–879. [Google Scholar] [CrossRef]
de Castro, M.A.F.; Dedivitis, R.A.; de Matos, L.L.; Baraúna, J.C.; Kowalski, L.P.; de Carvalho Moura, K.; Partezani, D.H. Endoscopic and videofluoroscopic evaluations of swallowing for dysphagia: A systematic review. Braz. J. Otorhinolaryngol. 2025, 91, 101598. [Google Scholar] [CrossRef]
Feng, H.-Y.; Zhang, P.-P.; Wang, X.-W. Presbyphagia: Dysphagia in the elderly. World J. Clin. Cases 2023, 11, 2363. [Google Scholar] [CrossRef]
Lai, D.K.-H.; Cheng, E.S.-W.; Lim, H.-J.; So, B.P.-H.; Lam, W.-K.; Cheung, D.S.K.; Wong, D.W.-C.; Cheung, J.C.-W. Computer-aided screening of aspiration risks in dysphagia with wearable technology: A Systematic Review and meta-analysis on test accuracy. Front. Bioeng. Biotechnol. 2023, 11, 1205009. [Google Scholar] [CrossRef]
Wong, D.W.-C.; Wang, J.; Cheung, S.M.-Y.; Lai, D.K.-H.; Chiu, A.T.-S.; Pu, D.; Cheung, J.C.-W.; Kwok, T.C.-Y. Current Technological Advances in Dysphagia Screening: Systematic Scoping Review. J. Med. Internet Res. 2025, 27, e65551. [Google Scholar] [CrossRef] [PubMed]
So, B.P.-H.; Chan, T.T.-C.; Liu, L.; Yip, C.C.-K.; Lim, H.-J.; Lam, W.-K.; Wong, D.W.-C.; Cheung, D.S.K.; Cheung, J.C.-W. Swallow detection with acoustics and accelerometric-based wearable technology: A scoping review. Int. J. Environ. Res. Public Health 2022, 20, 170. [Google Scholar] [CrossRef] [PubMed]
Yao, K.-Y.; Lai, D.K.-H.; Lim, H.-J.; So, B.P.-H.; Chan, A.C.-H.; Yip, P.Y.-M.; Wong, D.W.-C.; Dai, B.; Zhao, X.; Wong, S.H.D.; et al. 2H-MoS2 lubrication-enhanced MWCNT nanocomposite for subtle bio-motion piezoresistive detection with deep learning integration. Mater. Des. 2025, 253, 113861. [Google Scholar] [CrossRef]
Sakai, K.; Gilmour, S.; Hoshino, E.; Nakayama, E.; Momosaki, R.; Sakata, N.; Yoneoka, D. A machine learning-based screening test for sarcopenic dysphagia using image recognition. Nutrients 2021, 13, 4009. [Google Scholar] [CrossRef]
Yamamoto, Y.; Sato, H.; Kanada, H.; Iwashita, Y.; Hashiguchi, M.; Yamasaki, Y. Relationship between lip motion detected with a compact 3D camera and swallowing dynamics during bolus flow swallowing in Japanese elderly men. J. Oral Rehabil. 2020, 47, 449–459. [Google Scholar] [CrossRef]
Lai, D.K.-H.; Cheng, E.S.-W.; So, B.P.-H.; Mao, Y.-J.; Cheung, S.M.-Y.; Cheung, D.S.K.; Wong, D.W.-C.; Cheung, J.C.-W. Transformer Models and Convolutional Networks with Different Activation Functions for Swallow Classification Using Depth Video Data. Mathematics 2023, 11, 3081. [Google Scholar] [CrossRef]
Lim, H.-J.; Lai, D.K.-H.; So, B.P.-H.; Yip, C.C.-K.; Cheung, D.S.K.; Cheung, J.C.-W.; Wong, D.W.-C. A comprehensive assessment protocol for swallowing (CAPS): Paving the way towards computer-aided dysphagia screening. Int. J. Environ. Res. Public Health 2023, 20, 2998. [Google Scholar] [CrossRef]
Hu, K.; Shen, C.; Wang, T.; Xu, K.; Xia, Q.; Xia, M.; Cai, C. Overview of temporal action detection based on deep learning. Artif. Intell. Rev. 2024, 57, 26. [Google Scholar] [CrossRef]
Shou, Z.; Chan, J.; Zareian, A.; Miyazawa, K.; Chang, S.-F. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5734–5743. [Google Scholar]
Zhao, Y.; Xiong, Y.; Wang, L.; Wu, Z.; Tang, X.; Lin, D. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2914–2923. [Google Scholar]
Zhang, C.; Wu, J.; Li, Y. ActionFormer: Localizing Moments of Actions with Transformers. arXiv 2022, arXiv:2202.07925. [Google Scholar] [CrossRef]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Liu, S.; Zhang, C.-L.; Zhao, C.; Ghanem, B. End-to-end temporal action detection with 1b parameters across 1000 frames. In Proceedings of the the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18591–18601. [Google Scholar]
Tong, Z.; Song, Y.; Wang, J.; Wang, L. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093. [Google Scholar]
Han, Z.; Gao, C.; Liu, J.; Zhang, J.; Zhang, S.Q. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv 2024, arXiv:2403.14608. [Google Scholar]
Vandrol, J.; Perren, J.; Koller, A. Effect of Depth Band Replacement on Red, Green and Blue Image for Deep Learning Weed Detection. Sensors 2025, 25, 161. [Google Scholar] [CrossRef]
Liu, Z.; Farrell, J.; Wandell, B.A. ISETAuto: Detecting Vehicles with Depth and Radiance Information. IEEE Access 2021, 9, 41799–41808. [Google Scholar] [CrossRef]
Gastal, E.S.L.; Oliveira, M.M. Domain transform for edge-aware image and video processing. ACM Trans. Graph. 2011, 30, 1–12. [Google Scholar] [CrossRef]
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar] [CrossRef]
Lin, Z.; Geng, S.; Zhang, R.; Gao, P.; De Melo, G.; Wang, X.; Dai, J.; Qiao, Y.; Li, H. Frozen clip models are efficient video learners. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 25–27 October 2022; pp. 388–404. [Google Scholar]
Pratticò, D.; Laganà, F.; Oliva, G.; Fiorillo, A.S.; Pullano, S.A.; Calcagno, S.; Carlo, D.D.; Foresta, F.L. Integration of LSTM and U-Net Models for Monitoring Electrical Absorption with a System of Sensors and Electronic Circuits. IEEE Trans. Instrum. Meas. 2025, 74, 1–11. [Google Scholar] [CrossRef]
Prattico, D.; Laganá, F.; Oliva, G.; Fiorillo, A.S.; Pullano, S.A.; Calcagno, S.; Carlo, D.D.; Foresta, F.L. Sensors and Integrated Electronic Circuits for Monitoring Machinery on Wastewater Treatment: Artificial Intelligence Approach. In Proceedings of the 2024 IEEE Sensors Applications Symposium (SAS), Naples, Italy, 23–25 July 2024; pp. 1–6. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
Steitz, J.-M.O.; Roth, S. Adapters Strike Back. arXiv 2024, arXiv:2406.06820. [Google Scholar] [CrossRef]
Karimi Mahabadi, R.; Henderson, J.; Ruder, S. Compacter: Efficient Low-Rank Hypercomplex Adapter Layers. arXiv 2021, arXiv:2106.04647. [Google Scholar] [CrossRef]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
SS, S.; AR, K.; KP, A. Chebyshev polynomial-based kolmogorov-arnold networks: An efficient architecture for nonlinear function approximation. arXiv 2024, arXiv:2405.07200. [Google Scholar]
Idrees, H.; Zamir, A.R.; Jiang, Y.-G.; Gorban, A.; Laptev, I.; Sukthankar, R.; Shah, M. The thumos challenge on action recognition for videos “in the wild”. Comput. Vis. Image Underst. 2017, 155, 1–23. [Google Scholar] [CrossRef]
Kumar, M.; Veeraraghavan, A.; Sabharwal, A. DistancePPG: Robust non-contact vital signs monitoring using a camera. Biomed. Opt. Express 2015, 6, 1565–1588. [Google Scholar] [CrossRef] [PubMed]
Wyatt, P.J. Differential light scattering and the measurement of molecules and nanoparticles: A review. Anal. Chim. Acta X 2021, 7, 100070. [Google Scholar] [CrossRef] [PubMed]
Liu, P.T.; Ruan, D.B.; Yeh, X.Y.; Chiu, Y.C.; Zheng, G.T.; Sze, S.M. Highly responsive blue light sensor with amorphous indium-zinc-oxide thin-film transistor based architecture. Sci. Rep. 2018, 8, 8153. [Google Scholar] [CrossRef]
Allen, J.E.; Clunie, G.M.; Winiker, K. Ultrasound: An emerging modality for the dysphagia assessment toolkit? Curr. Opin. Otolaryngol. Head Neck Surg. 2021, 29, 213–218. [Google Scholar] [CrossRef]
de Almeida, A.N.S.; de Souza Ferreira, S.L.; Balata, P.M.M.; da Cunha, D.A.; Pernambuco, L.; da Silva, H.J. Thermography in complementary assessments of head and neck muscles: A scoping review. J. Oral Rehabil. 2022, 49, 1188–1196. [Google Scholar] [CrossRef]
Liberatori, B.; Conti, A.; Rota, P.; Wang, Y.; Ricci, E. Test-time zero-shot temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 18720–18729. [Google Scholar]
Reka, A.; Borza, D.L.; Reilly, D.; Balazia, M.; Bremond, F. Introducing Gating and Context into Temporal Action Detection. In Proceedings of the European Conference on Computer Vision, MiCo Milano, Italy, 8–13 September 2024; pp. 322–334. [Google Scholar]
Cheng, F.; Bertasius, G. TALLFormer: Temporal Action Localization with a Long-memory Transformer. arXiv 2022, arXiv:2204.01680. [Google Scholar] [CrossRef]
Laganà, F.; Faccì, A.R. Parametric optimisation of a pulmonary ventilator using the Taguchi method. J. Electr. Eng. 2025, 76, 265–274. [Google Scholar] [CrossRef]
Keskar, N.S.; Mudigere, D.; Nocedal, J.; Smelyanskiy, M.; Tang, P.T.P. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv 2016, arXiv:1609.04836. [Google Scholar]
Somvanshi, S.; Javed, S.A.; Islam, M.M.; Pandit, D.; Das, S. A survey on kolmogorov-arnold network. ACM Comput. Surv. 2025, 58, 1–35. [Google Scholar] [CrossRef]
Zhang, J.; Fan, Y.; Cai, K.; Wang, K. Kolmogorov-Arnold Fourier Networks. arXiv 2025, arXiv:2502.06018. [Google Scholar] [CrossRef]
Alwassel, H.; Caba Heilbron, F.; Escorcia, V.; Ghanem, B. Diagnosing Error in Temporal Action Detectors. arXiv 2018, arXiv:1807.10706. [Google Scholar] [CrossRef]

Figure 1. Experimental Setup for Data Collection. The participant was seated in a neutral position, with an RGB-D camera placed on a table approximately 35 cm from the target anatomical regions (lips, mandible, and throat) and angled at 45° relative to the horizontal plane. The left side of the figure shows an illustrative screenshot displaying both the RGB and depth (D) channel outputs, with the depth visualization colorized using a jet colormap.

Figure 2. Architecture of The Baseline AdaTAD. The model comprises a feature extractor utilizing the VideoMAE-B backbone with a parameter-efficient fine-tuning (PEFT) encoder, followed by different temporal feature adapters. The detector, adapted from the original ActionFormer, includes a classification head and a regression head, implemented as a multilayer perceptron (MLP) via 1D convolution.

Figure 3. Temporal feature adapter configurations: (a) BottleneckAdapter; (b) InvertedConvNeXtAdapter; (c) GRNConvNeXtAdapter; (d) Adapter+; (e) Compacter.

Figure 4. Performance of ablated modified AdaTAD compared with baseline models in the benchmark THUMOS dataset. BNA: BottleneckAdapter; GRNNA: GRNConvNeXtAdapter; I3D: Inflated 3D ConvNet; ICNA: InvertedConvNeXtAdapter; KAN: Kolmogorov–Arnold Network; mAP: mean average precision; MLP: multilayer perceptron via 1D convolution; tIoU: Intersection over Union; VMAE: Video Masked Autoencoder.

Figure 5. Performance of ablated modified AdaTAD compared with baseline models in the swallowing dataset. BNA: BottleneckAdapter; GRNNA: GRNConvNeXtAdapter; I3D: Inflated 3D ConvNet; ICNA: InvertedConvNeXtAdapter; KAN: Kolmogorov–Arnold Network; mAP: mean average precision; MLP: multilayer perceptron; TFA: temporal feature adapter; tIoU: temporal Intersection over Union; VMAE: Video Masked Autoencoder. RGBD_E: RGB-D with early fusion; RGBD_L: RGB-D with late fusion.

Figure 6. Performance of ablated fine-tuned GRNConvNeXtAdapter compared with the modified model in the swallowing dataset. mAP: mean average precision; tIoU: temporal Intersection over Union.

Table 1. Dataset composition and descriptive statistics for the RGB-D swallowing video, stratified by training and testing set.

Data Metric	Training	Testing	Total
No. of video clips	512 (79.88%)	129 (20.12%)	641
Total duration (min)	229.6	55.17	284.71
Mean (SD) duration (s)	26.90 (20.92)	25.66 (20.02)	26.65 (20.75)
Max duration (s)	150.80	122.22	–
Min duration (s)	4.74	7.04	–
Annotated swallowing events	1091	260	1351
Annotated non-swallowing events	1427	375	1802
Total No. of events	2518 (79.86%)	635 (20.14%)	3153

SD: standard deviation.

Table 2. Comparison between the baseline AdaTAD framework and our optimized model. The inference is performed on a randomly initialized four-channel video consisting of 720 frames with a resolution of 480 × 640.

Data Metric	Baseline	Optimized Model
Total Parameters	49.5 M	74.4 M
GPU Memory Used	4251.5 MB	4346.5 MB
Inference Speed	304.0 fps	219.1 fps
Throughput	0.40 samples/s	0.29 samples/s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lai, D.K.-H.; Zhao, Z.-A.; Tam, A.Y.-C.; Li, J.; Zhang, J.Z.-S.; Wong, D.W.-C.; Cheung, J.C.-W. End-to-End Swallowing Event Localization via Blue-Channel-to-Depth Substitution in RGB-D: GRNConvNeXt-Modified AdaTAD with KAN-Chebyshev Decoder. AI 2025, 6, 276. https://doi.org/10.3390/ai6110276

AMA Style

Lai DK-H, Zhao Z-A, Tam AY-C, Li J, Zhang JZ-S, Wong DW-C, Cheung JC-W. End-to-End Swallowing Event Localization via Blue-Channel-to-Depth Substitution in RGB-D: GRNConvNeXt-Modified AdaTAD with KAN-Chebyshev Decoder. AI. 2025; 6(11):276. https://doi.org/10.3390/ai6110276

Chicago/Turabian Style

Lai, Derek Ka-Hei, Zi-An Zhao, Andy Yiu-Chau Tam, Jing Li, Jason Zhi-Shen Zhang, Duo Wai-Chi Wong, and James Chung-Wai Cheung. 2025. "End-to-End Swallowing Event Localization via Blue-Channel-to-Depth Substitution in RGB-D: GRNConvNeXt-Modified AdaTAD with KAN-Chebyshev Decoder" AI 6, no. 11: 276. https://doi.org/10.3390/ai6110276

APA Style

Lai, D. K.-H., Zhao, Z.-A., Tam, A. Y.-C., Li, J., Zhang, J. Z.-S., Wong, D. W.-C., & Cheung, J. C.-W. (2025). End-to-End Swallowing Event Localization via Blue-Channel-to-Depth Substitution in RGB-D: GRNConvNeXt-Modified AdaTAD with KAN-Chebyshev Decoder. AI, 6(11), 276. https://doi.org/10.3390/ai6110276

Article Menu

End-to-End Swallowing Event Localization via Blue-Channel-to-Depth Substitution in RGB-D: GRNConvNeXt-Modified AdaTAD with KAN-Chebyshev Decoder

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview

2.2. Data Acquisition and Labelling

2.3. Data Processing

2.4. Baseline Model Architecture

2.5. Adapter Exploration

2.6. Decoder Selection

2.7. Regression Head Selection

2.8. Data Input Strategy

2.9. Patch Embedding Method

2.10. Model Training

2.11. Model Evaluation

3. Results

3.1. Benchmarking and Initial Ablation

3.2. Domain Adaptation and Model Performance

3.3. Adapter Fine-Tuning

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI