Multimodal Model for Automated Pain Assessment: Leveraging Video and fNIRS

Vianto, Jo; Divakaran, Anjitha; Yang, Hyungjeong; Yeom, Soonja; Kim, Seungwon; Kim, Soohyung; Shin, Jieun

doi:10.3390/app15095151

Open AccessArticle

Multimodal Model for Automated Pain Assessment: Leveraging Video and fNIRS

by

Jo Vianto

¹

,

Anjitha Divakaran

¹

,

Hyungjeong Yang

^1,*

,

Soonja Yeom

²

,

Seungwon Kim

¹

,

Soohyung Kim

¹

and

Jieun Shin

³

¹

Department of Artificial Intelligence Convergence, Chonnam National University, Gwangju 61186, Republic of Korea

²

School of ICT, University of Tasmania, Sandy Bay, TAS 7005, Australia

³

Department of Psychology, Chonnam National University, Gwangju 61186, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5151; https://doi.org/10.3390/app15095151

Submission received: 20 February 2025 / Revised: 29 April 2025 / Accepted: 29 April 2025 / Published: 6 May 2025

Download

Browse Figures

Versions Notes

Abstract

Pain assessment is a challenging task for clinicians due to its subjective nature, particularly in individuals with communication difficulties, cognitive impairments, or severe disabilities. Traditional methods such as the Visual Analogue Scale (VAS), Numerical Rating Scale (NRS), and Verbal Rating Scale (VRS) rely heavily on patient feedback, which can be inconsistent and subjective. To address these limitations, developing objective and reliable pain assessment tools that incorporate advanced technologies, such as multimodal data integration from video and fNIRS, is important for improving clinical outcomes. However, challenges such as noise susceptibility in fNIRS signals must be carefully addressed to realize their full potential. Recent studies have explored automatic pain assessment using machine learning and deep learning techniques, which require high-quality data that can accurately represent pain categories. In response to the introduction of a new dataset in the AI4Pain Challenge, we proposed a multimodal neural network model utilizing attention-based fusion to improve overall accuracy (MMAPA). Our model leverages video and fNIRS modalities as well as manually extracted statistical features. We also implemented fNIRS signal preprocessing and artifact noise filtering, which significantly improved performance on both the fNIRS and statistical feature branches. On the hidden test set, our model achieved an accuracy of 51.33%, outperforming the official baseline of 43.33%. To evaluate generalizability, we further tested our method on the BioVid Heat Pain Database, where our fusion model achieved the highest accuracy in the 10-fold cross-validation setting, outperforming PainAttNet and unimodal variants. These results highlight the effectiveness of our multimodal attention-based approach in improving pain classification performance across datasets.

Keywords:

attention-based fusion; brain–computer interface; deep learning; fNIRS; multimodal model; pain assessment

1. Introduction

One of the tasks that clinicians find difficult is pain assessment [1]. The challenges in pain assessment vary across individuals, age groups, and circumstances, especially for those with communication difficulties. The subjective nature of pain makes it challenging for clinicians to assess patient behavior when determining pain levels. Additionally, certain conditions, such as cognitive impairments or severe disability, can further complicate accurate pain evaluation. Developing objective and reliable pain assessment tools that incorporate advanced technologies, such as multimodal data integration from video and biosignals (e.g., functional near-infrared spectroscopy (fNIRS)), is important for improving clinical outcomes and patient care. While biosignals offer valuable insights into brain and physiological responses associated with pain [2,3], they can also be susceptible to noise and artifacts [4]. Therefore, effective preprocessing and multimodal fusion strategies are essential to fully leverage biosignal information in automated pain assessment. Breivik et al. attempted to scale the level of pain based on facial expression [5]. The same study also investigated methods for assessing pain in individuals with communication difficulties.

The Visual Analogue Scale (VAS)/Graphic Rating Scale, Numerical Rating Scale (NRS), and Verbal Rating Scale (VRS) are examples of instruments for pain intensity measurement based on patient feedback. In the VAS assessment, a graphical line represents “no pain at all” to “pain as bad as it could be”. In NRS assessment, individuals are asked to indicate which number in between two limits can represent the level of pain that the individual is having. In the VRS assessment, patients are asked to describe the level of pain they experience [2]. Although pain level assessment is possible through VAS, NRS, and VRS, these methods rely on subjective patient feedback.

Numerous studies have been conducted to evaluate pain without relying on patient self-reports. In addition to self-report methods, observational scales are commonly used, and most consider facial expression, vocalization, and body language as important characteristics [6]. However, these scales are still manually performed by the observer; therefore, to overcome this drawback, Werner et al. performed automatic pain assessment on facial features activity using machine learning methods [7]. In contrast, Alghamdi et. al. in [8] conducted similar tasks automatically from facial images using deep learning. Other researchers have utilized modalities such as audio, electrodermal activity, and respiration rate in their studies [9].

Conventional methods, such as VAS, NRS, and VRS, are highly demanding in terms of manual feedback from patients. Relying on self-reporting can result in inconsistencies in pain assessment, as it is influenced by individual variations in patients’ perceptions, communication skills, and inclination to provide accurate pain reports [5]. The research in [7] aimed to automate this task primarily by relying on machine learning models. However, machine learning models usually require domain knowledge and expertise for manual feature extraction. Significant advancements in deep learning can provide opportunities for improved pain evaluation.

Deep learning often outperforms traditional machine learning and conventional methods, particularly in tasks involving large datasets and complex patterns, such as image and speech recognition. This advantage is due to its ability to automatically learn and extract features from raw data, reducing the need for manual feature engineering, as shown in [8]. In recent years, deep learning and the availability of public datasets have played a crucial role. They have made improvements in various fields like computer vision and healthcare. Diverse models have been developed to accomplish difficult tasks, such as natural language processing models and generative models.

Datasets serve as fundamental building blocks for researchers, enabling the effective training and evaluation of models. The Fourth International Workshop on the Automated Assessment of Pain (AAP 2024) introduced a new dataset for pain assessment, featuring both video and functional near-infrared spectroscopy (fNIRS) modalities. This presents significant challenges in integrating and synchronizing diverse data types to ensure an accurate and meaningful pain assessment. Additionally, they conducted the AI4Pain challenge [10], encouraging participants to evaluate pain levels using the provided dataset.

The development of a multimodal model that effectively integrates and synchronizes the diverse data types from video and functional near-infrared spectroscopy (fNIRS) modalities presents several challenges. First, the data from these two modalities have different characteristics and sampling rates, necessitating sophisticated alignment and fusion techniques. Video data, typically high-dimensional and temporally rich, require extensive preprocessing, including frame extraction, facial feature detection, and expression analysis [11]. In contrast, fNIRS data, which capture hemodynamic responses, require careful filtering and noise reduction to ensure the signal quality [12]. Combining these modalities to create a cohesive representation of pain signals is a complex task. In addition to the AI4Pain dataset, we also evaluated our method on the BioVid Heat Pain Database [3,13], which contains multimodal recordings of pain responses to calibrated heat stimuli. Including this dataset allows us to assess the generalizability and robustness of the proposed method across different data sources and experimental settings.

We propose a novel MMAPA multimodal deep learning model for the AI4Pain Challenge datasets. To achieve robust performance, we employed a transformer-based [14] model for feature extraction on fNIRS modality and multimodality fusion. We demonstrate a preprocessing technique that works effectively on fNIRS modality. Statistical feature extraction of the fNIRS signal was also performed, making it one of the modalities in our model. The video data, raw fNIRS signals, and fNIRS statistical features were fed into MARLIN [15], a multi-layer perceptron, and EEG Conformer [16], respectively. This comprehensive approach ensures that each modality is effectively leveraged, enhancing the overall accuracy and reliability of pain assessment. Our model demonstrates the potential for integrating diverse data types to significantly improve automated pain evaluation.

Our main contributions are summarized as follows: (1) We proposed a novel multimodality model for pain assessment using the AI4Pain dataset. (2) We demonstrated a preprocessing technique for fNIRS signals on the AI4Pain dataset. (3) We conducted extensive experiments and analyses for each modality to demonstrate the performance of the proposed model and benchmark it against multiple well-known state-of-the-art models.

The remainder of this paper is organized as follows. Section 2: Related Works reviews existing studies on Automated Pain Assessment. Section 3: Methods describes the proposed methodology, including preprocessing steps, feature extraction, fNIRS and video encoders, and fusion techniques. Section 4: Experiments and Results outlines the experimental setup, datasets, and evaluation metrics used to assess our approach. This section also presents the results and discussion, comparing our model’s performance with existing methods. Section 5: Discussion discusses MMAPA’s real-time performance, ethical concerns, and challenges in accurately detecting the “No Pain” category. Finally, Section 6: Conclusion summarizes our findings and suggests potential directions for future research. Our models and codes can be accessed at https://sdsuster.github.io/ai4painmm (accessed on 28 April 2025).

2. Related Works

Facial expressions have become a popular feature in pain assessment tasks. When experiencing pain, our nervous system responds by exhibiting several expressions, such as closed eyes, lowered brows, raised cheeks, raised upper lip, and parted lips [6]. Nowadays, data-driven models have become popular, enabling clinicians to automatically assess the level of pain a patient is experiencing. Reference [7] employed random forest and support vector machines in their research to automatically assess the level of pain a patient is experiencing. Facial feature positions, such as those of the eyes, mouth, and brows, were extracted. Their method aims to detect changes in facial expressions when a patient is experiencing pain. References [8,17] directly classified the level of pain from two images using CNN. The main advantage of CNNs is their ability to emphasize and learn from images without relying heavily on manual feature extraction techniques. This technique enhances model performance. However, these studies on pain assessment focused exclusively on facial expressions instead of employing a multimodal approach, potentially limiting the comprehensiveness of their evaluations. However, these models often lack temporal context, making it difficult to assess ongoing or changing pain levels accurately.

The fNIRS modality is often highlighted for its versatility and non-invasive nature [4]. It has been proven to be able to detect depression in [18]. They proposed a domain and frequency feature extraction method using AlexNet. They used three types of features: raw data, statistical features, and channel correlation. These features were extracted from raw fNIRS data. However, they conducted separate experiments for each feature modality. They found that channel correlation features yielded the highest accuracy. Similar to the previous study, Jinlong et al. [19] conducted a study using fNIRS to distinguish healthy controls from patients with major depression. Other tasks utilized the fNIRS modality such as specific finger tapping or rest prediction [20], motor execution or rest prediction [21], and mental workload prediction [22]. Although these studies highlight the potential of fNIRS in diverse contexts, many early studies relied on handcrafted features and did not fully leverage advanced preprocessing and multimodal integration strategies that can address challenges like signal noise and variability.

Several pain assessment tools have been developed using fNIRS. Using raw data, [23] attempted to mitigate automatic feature extraction issues in their experiments. Raw signals were then fitted and evaluated using the Bi-directional Long Short-Term Memory (Bi-LSTM) model. Another study [24] used the window-sliding method to extract local features from the fNIRS raw signal. However, recent studies have shown that traditional machine learning methods are often outperformed by deep learning approaches, which are better suited for modeling the complex, non-linear, and temporal patterns present in physiological signals such as video.

Limited research has explored the integration of video and fNIRS modalities, particularly with the AI4Pain dataset, which focuses on multimodal pain assessment. Two notable publications in this area stand out. The first study leverages the AI4Pain dataset by applying transfer learning from a pre-trained VGGNet to extract features from the video modality. These features are then used with classifiers such as ANN, majority voting, and LSTM for pain classification. However, a notable limitation of this study is that it does not utilize the fNIRS modality, which could further improve performance through multimodal fusion. The second research proposed the Twins-PainViT framework. It comprises two models, PainViT–1 and PainViT–2, which were designed for pain assessment using video and fNIRS signals. PainViT–1 extracts embeddings from both modalities, and these embeddings are visualized as waveform diagrams, which are then fused into a single image. PainViT–2 processes this unified visual representation to complete the pain assessment, utilizing a hierarchical structure with token mixing and cascaded attention mechanisms for efficient feature extraction and fusion. These studies highlight the potential of combining video and fNIRS data to enhance pain detection and analysis, advancing research in this domain.

Lu et al. [25] proposed PainAttnNet, which is a transformer-based deep learning framework designed for pain intensity classification using a single modality of physiological signals. Their methodology consists of a multi-stage unimodal architecture: a Multiscale Convolutional Network (MSCN) first captures both short- and long-range dependencies using parallel convolutional branches with varying receptive fields; this is followed by a Squeeze-and-Excitation ResNet (SE-ResNet) to enhance salient features through channel-wise attention. Finally, a transformer encoder incorporating temporal convolution and multi-head self-attention is used to model temporal dependencies. By relying solely on physiological signals (such as electrodermal activity), the model achieves competitive performance on the BioVid dataset, outperforming several prior state-of-the-art approaches. However, its unimodal nature limits its ability to exploit complementary information from other modalities, such as video or audio, which can be critical in pain assessment tasks.

EEG Conformer is a compact convolutional neural network model originally designed for electroencephalogram (EEG) modality, which is aimed at feature extraction and classification tasks [26]. They claimed that the model worked across several tasks. It also works with a very limited data size. EEG and fNIRS signals have similar characteristics, consisting of temporal and spatial information. However, EEG has high-resolution temporal and low-resolution spatial information and vice versa [16]. In our experiment, EEG Conformer will be utilized on the fNIRS modality to evaluate how well it works for pain assessment.

The EEG Conformer is a combination of CNN for feature extraction and self-attention modules inspired by the transformer model. The model uses convolution to extract temporal and spatial features and then performs global temporal feature encapsulation using a self-attention mechanism. Multi-head attention was utilized in their experiments, which demonstrated improvements compared to its baseline, EEGNet. Multi-head attention, introduced in [14], enhances the model’s ability to project queries, keys, and values. The core concept of multi-head attention is to enhance the model’s ability to focus on different parts of the input sequence simultaneously, capturing a richer representation of the data.

A multimodal fusion strategy is a process of integrating information from multiple modalities to improve the performance of the model. Qiu et al. [27] proposed a multi-level progressive learning method to fuse EEG-fNIRS modalities. Time-domain and frequency-domain features of both modalities were combined before feature selection using the Atom Search Optimization algorithm. EEG and fNIRS have similar characteristics in terms of data representation; however, video and fNIRS have significantly different data representations. Video data are represented in a higher-dimensional space compared to fNIRS. To address this issue, Gkikas and Tsiknakis [12] proposed a fusion method for video and physiological signals (e.g., ECG), introducing a video preprocessing method and employing transformer-based architectures with hierarchical attention modules for effective multimodal fusion. Although this study used ECG instead of fNIRS, due to their similar representations, fNIRS and ECG could be interchangeable.

3. Methods

The AI4Pain model is a multimodal deep learning framework designed to assess pain by integrating facial video and fNIRS signals (see Figure 1). It consists of three key branches: a pre-trained MARLIN encoder for extracting spatiotemporal features from facial video, an EEG Conformer module for capturing temporal dynamics in raw fNIRS signals, and a fully connected network for manually extracted statistical features from fNIRS. These modality-specific features are fused using a custom attention-based fusion module to enhance cross-modal interactions. The combined representation is then used to classify pain levels, enabling robust and interpretable automated pain assessment.

3.1. fNIRS Signal Preprocessing

fNIRS requires careful preprocessing to reduce noise caused by artifacts such as motion and ambient light, which otherwise limit its effectiveness. These artifacts typically manifest as low-frequency fluctuations that closely resemble genuine brain activation patterns, requiring more sophisticated artifact removal and preprocessing strategies to ensure clean signals. Traditional high-pass or Butterworth band-pass filters often fall short in eliminating such noise due to the overlapping frequency components between artifacts and physiological signals.

To address this challenge, we developed a preprocessing strategy (Algorithm 1) that includes a K-means-based noisy channel identification step. Specifically, we first compute the standard deviation of the raw signal for each individual channel across time. This statistic serves as a robust indicator of overall signal variability: clean channels tend to exhibit moderate and consistent fluctuations, while noisy channels often display excessive or irregular variations.

These per-channel standard deviation values are then fed into a K-means clustering algorithm with

k = 2

, which partitions the channels into two groups—representing low-variance (clean) and high-variance (noisy) channels. Channels assigned to the high-variance cluster are considered noisy and are excluded from subsequent processing steps.

The remaining channels are then filtered using a Butterworth band-pass filter with low and high cutoff frequencies of 0.002 Hz and 0.1 Hz, respectively, to retain relevant hemodynamic components. Finally, the signals are normalized using z-score normalization. The output of this full denoising and preprocessing pipeline is illustrated in Figure 2.

Figure 2. As a result of noise filtering, the signal became clearer, and outliers were removed. Sampling of 50 Hz is used in the experiment.

Algorithm 1 Artifact Noise Filtering Algorithm

Require: $n \geq 0$
Ensure: $y = x^{n}$
  1:
for each $c h a n n e l$ in $c h a n n e l s$ do
  2:
   if $c h a n n e l$ is noisy then
  3:
        $s i g n a l \leftarrow l o a d_s i g n a l ()$
  4:
        $X \leftarrow d i f f (s i g n a l)$ {diff is equal to $s i g n a l [i + 1] - s i g n a l [i]$ }
  5:
        $s t t d \leftarrow s t a n d a r d_d e v i a t i o n (X)$
  6:
        $X [X > = s t t d] = 0$ {Set $X [i]$ to 0 if $X [i] \geq s t t d$ }
  7:
   end if
  8:
end for
  9:
return y

3.2. Manual Feature Extractions

Manual statistical features were extracted from preprocessed fNIRS signals, offering several benefits. These features capture essential aspects of fNIRS data, such as the mean, variance, and kurtosis, providing a compact and interpretable representation. This is particularly valuable when dealing with complex brain signals, as statistical features can highlight relevant patterns and trends that may be critical for accurate classification or regression tasks. Additionally, manual feature extraction can mitigate the curse of dimensionality and enhance model performance and robustness, particularly in situations where data are limited or noisy. Overall, this approach ensures that the most informative aspects of the fNIRS signals are effectively utilized, facilitating better understanding and improved outcomes in various neuroimaging studies and applications. A total of 11 features per channel were extracted with 24 channels for oxyhemoglobin (

Δ

HbO2) and deoxyhemoglobin (

Δ

HHb). The features used in our experiment were inspired by those extracted in [18], as shown in Table 1. Compared to [18], all features in our study were extracted exclusively from the on-event period, which refers to the time intervals during which specific pain-related stimuli were applied.

The extracted manual features are then encoded through two fully connected layers to capture their essential characteristics. These encoded features are fused with features derived from the fNIRS signal, specifically the outputs from the EEG Conformer model, as presented in Figure 3. This fusion of manually extracted and Conformer-extracted features enables a more comprehensive representation of the underlying data, facilitating improved performance in subsequent tasks.

Figure 3. EEG Conformer is employed to encode spatial and temporal information through convolution networks and global features through multi-head attention module [26]. Refer to Table 2 for the detail of convolution layers.

Table 1. Extracted manual features from fNIRS signals.

Feature	Number	Description
Mean (Average)	1	Average signal amplitude
Peak Value (Peaks)	1	Maximum signal value
Valley Value (Valleys)	1	Minimum signal value
Variance	1	Signal variability
Integral	1	Total signal magnitude
Skewness (Linear)	1	Signal asymmetry around the mean
Kurtosis (Quadratic Term)	1	Signal sharpness or outlier level
Total	7

3.3. MARLIN Encoder for Video Encoding

A pre-trained MARLIN encoder was used for video representation learning in our study. This model was trained on YouTubeFace (YTF) [15] using a masked autoencoding strategy. The MARLIN encoder is built upon a Vision Transformer (ViT) backbone that processes only the visible spatiotemporal patches of the input video, effectively learning spatial and temporal patterns while ignoring masked regions. This selective encoding encourages the model to extract robust and semantically meaningful features. In the MARLIN framework, the encoder outputs latent embeddings from these visible patches, which are later used for reconstruction by the decoder. Notably, the encoder captures dynamics across time, making it suitable for analyzing facial expressions. In our experiment, we employed the MARLIN-based model with a feature dimension of

s \times 768

, where s denotes the length of the video in seconds. The encoder output served as a compact and informative representation for downstream pain estimation tasks.

3.4. fNIRS Encoding Using EEG Conformer

The fNIRS modality is divided into preprocessed raw signals and statistical features. The preprocessed raw signals are then fed into the EEG Conformer to extract both spatial and temporal information, following the architecture proposed in [26]. The “Conv” layers consist of a temporal and a spatial convolution layer. Both temporal and spatial information are encoded into tokens, which are subsequently fed into a multi-head attention module. This module is designed to learn global temporal dependencies, enhancing feature extraction. The architectural details are provided in Table 2.

Table 2. EEG Conformer convolution layers architecture.

Layer	In	Out	Kernel	Stride
Temporal Conv	1	k	$(25, 1)$	$(1, 1)$
Spatial Conv	k	k	$(1, 24)$	$(1, 1)$
Avg Pool	k	k	$(15, 1)$	$(5, 1)$
Rearrange	$(k, 1, m) \to (m, k)$

3.5. Attention-Based Fusion

Our fusion module consists of two single-attention heads, referred to as Attention Left and Attention Right. The fusion box processes two embedding inputs: left embedding and right embedding. The left embedding serves as the query for the left attention head and as the key and value for the right attention head, while the right embedding acts as the query for the right attention head and as the key and value for the left attention head. The outputs from both attention heads are subsequently combined using addition.

This attention fusion module is designed to handle only two modalities at a time. Therefore, to integrate video data, statistical features, and fNIRS signals, a stepwise approach is required. First, we fuse the statistical features with the fNIRS signals, as they are closely related. The resulting fused representation is then combined with the video modality using the same fusion process. The detailed architecture of our fusion model is shown in Figure 4.

4. Experiment and Results

4.1. Datasets

The integration of artificial intelligence (AI) and sensing technologies offers a promising opportunity to enhance pain recognition, which is a critical aspect of healthcare and affective computing. The proposed Grand Challenge aims to leverage fNIRS and facial video analysis to advance automated pain assessment. By combining these sensing modalities, the challenge seeks to capture both neurophysiological and behavioral aspects of pain, providing a more comprehensive and objective approach that has not been previously explored. Participants are invited to develop novel algorithms for pain recognition, contributing to the creation of a multimodal sensing dataset that can serve as a benchmark and resource for future research in this field [10].

The AI4Pain dataset, consisting of 65 subjects, was divided into 41 for training, 12 for validation, and 12 for testing. In each experiment, transcutaneous electrical nerve stimulation (TENS) electrodes were placed on the forearm and the back of the hand to deliver controlled pain stimuli. TENS was used to induce both High Pain and Low Pain conditions in randomized order across 12 repetitions per subject. Each trial began with a 60-s baseline recording, which was followed by a 10-s stimulation phase and a 40-s rest period. The baseline segment was labeled as No Pain. The resulting class distribution was No Pain:Low Pain:High Pain = 1:12:12. The classification task involves distinguishing among these three pain levels using multimodal signals. The fNIRS modality includes 24 channels with each channel consisting a pair of signals:

Δ

HbO2 and

Δ

HHb.

In addition to AI4Pain, we also utilize the BioVid Heat Pain Database (Part A) [3,13], which is a publicly available benchmark dataset for pain recognition. BioVid includes recordings from 87 participants exposed to calibrated heat pain stimuli administered via a thermode applied to the forearm. Each subject underwent a personalized pain calibration procedure to determine five thermal thresholds (T0–T4), where T0 represents no pain and T4 corresponds to the maximum tolerable pain level. Participants received 20 stimuli per pain level, each lasting four seconds, with randomized ordering and rest periods in between to avoid adaptation. The dataset comprises synchronized multimodal recordings, including high-resolution video of the upper body and face, electrodermal activity (EDA), electromyogram (EMG), and electrocardiogram (ECG). Following prior studies such as PainNet [25] and AI4Pain [10], we adopt a three-class classification scheme by grouping the five pain levels into No Pain (T0), Low Pain (T1–T2), and High Pain (T3–T4). While both datasets are employed, our primary experiments and model evaluation are conducted using the AI4Pain dataset due to its relevance to the Grand Challenge setting.

4.2. Implementation Details

Our model, illustrated in Figure 1, was implemented using the TensorFlow-Keras framework. We employed the Adam optimizer with a learning rate of 0.0001 and trained the model with a batch size of 64. The cross-entropy loss function was used for training. Although we initially set the maximum number of epochs to 300, as the validation loss increased while training loss decreased, indicating overfitting, we applied early stopping based on the validation loss, halting training when no improvement was observed. Our experiments on the AI4Pain dataset involved two types of dataset treatments. In the first treatment (denoted as B-1), we used the original class ratio without additional preprocessing. In the second treatment (denoted as B-6), we divided the baseline “No Pain” samples into six parts to investigate the impact of class imbalance on the model’s performance. Additionally, we validated our model using the original class ratio to ensure the robustness of our findings. Due to the relatively small dataset size, a dropout rate of 0.25 was applied to the multi-head attention modules. For data sampling, we extracted the first

8 \times 32

frames from each video and the corresponding first 8 s of fNIRS signals, using a stride of 5. Our experiments showed that these sampling parameters had minimal impact on the overall model performance.

4.3. Test Accuracy Against Competitors

The AI4Pin test labels are not publicly available. To evaluate our model’s performance, we submitted our predicted labels to the challenge organizers for testing. The organizers allow a maximum of two submissions for each team. In one of our submissions, our model achieved an accuracy of 51.33%. This score was the highest among all competitors in the challenge, positioning our model at the top in terms of performance. A detailed comparison of our model’s results with other methods can be found in Table 3. After careful consideration, we submitted the video + fNIRS model for the final evaluation, as it achieved the highest validation accuracy in our internal tests. However, this accuracy was not significantly higher than that of the model that combined video, fNIRS, and manually engineered features. Although the simpler model performed well, the marginal difference suggests that additional manual features did not enhance its effectiveness in this case.

Based on the results presented in Table 4, our proposed method demonstrates strong performance, particularly under the 10-fold cross-validation setting. In this protocol, a larger portion of the dataset is used for training, while a smaller portion is reserved for testing in each fold. This setup contrasts with LOOCV, which typically exposes models to higher variance due to the minimal size of the test set. Notably, our multimodal attention-based fusion model (MMAPA – G + V) achieved the highest accuracy in the 10-fold setting, outperforming both unimodal baselines and the established PainAttNet benchmark. This improvement suggests that our model is better suited for scenarios where more training data are available, indicating stronger fitting capacity and better generalization under constrained test conditions.

The results also reveal that the video-only modality consistently underperforms compared to the GSR-based model on the BioVid dataset. This observation indicates that physiological signals, particularly GSR, provide more reliable and discriminative features for pain detection in this context. The effectiveness of GSR likely stems from its direct association with autonomic nervous system responses to nociceptive stimuli. Furthermore, the integration of GSR and video through our attention-based fusion mechanism allows the model to capture complementary patterns across modalities. Overall, our model exhibits not only superior accuracy but also enhanced robustness and generalizability across different validation schemes.

4.4. Quantitative Analysis

To analyze and benchmark our model, we conducted extensive experiments. We evaluated our model performance using accuracy and F1 score metrics. The baseline does not have any information about F1 score metric provided. Since we have limited trials submitting test results, only test accuracy on the “video + fNIRS” model was submitted. Table 3 shows a performance comparison of pain detection models using different modalities: fNIRS, video, manual assessments, and their combinations. It includes results for baseline models and the proposed models (referred to as “MMAPA-Ours*”) under two conditions, B-6 and B-1. The baseline models using fNIRS, video, and their combination achieved validation accuracies of 43.20%, 40.00%, and 40.20%, respectively, with corresponding test accuracies of 43.30%, 40.10%, and 41.70%.

The proposed models significantly outperformed the baseline models in validation accuracy. Under condition B-6, the fNIRS-based model achieved a validation accuracy of 58.67%, while the video-based model reached 61.67%. Manually extracted statistical features and combined modalities (video + fNIRS, video + manual, fNIRS + manual, and video + fNIRS + manual) showed varying accuracies from 58.33% to 61.00%. Similar trends were observed under condition B-1, with the highest validation accuracy of 62.67% achieved by both the video + fNIRS and video + manual modalities.

In terms of F1 scores, the models generally performed well in detecting “Low Pain” and “High Pain” levels, although they struggled significantly with the “No Pain” category, often showing a score of 0.00%. We attempted to mitigate this imbalance by applying both weighted loss and synthetic sample generation. While these methods improved the detection of “No Pain”, they ultimately degraded the model’s ability to predict “Low Pain” and “High Pain” categories, resulting in a net decrease in overall accuracy. This phenomenon occurred because both the challenge validation and test sets maintained a heavily imbalanced distribution (approximately 1:6:6 for No Pain, Low Pain, and High Pain), which limited the effectiveness of balancing strategies. In contrast, when evaluated on the balanced BioVid dataset, our model exhibited robust performance across all classes. Overall, the results highlight that the proposed models, particularly those utilizing multiple modalities, provide enhanced accuracy and better overall performance in pain detection compared to baseline models, demonstrating the effectiveness of integrating various data sources for this task.

To better understand the model’s performance across pain levels, we analyzed the confusion matrix (Figure 5). The diagonal elements represent correctly classified samples, while off-diagonal entries indicate misclassifications. The model failed to correctly classify any samples from the No Pain class with all instances being misclassified as “Low Pain” or “High Pain”. This is likely due to a significant class imbalance in the dataset, where “No Pain” samples are underrepresented compared to the other classes. In contrast, the “Low Pain” class achieved strong performance, with 116 out of 144 samples correctly classified, and 28 misclassified as High Pain. The High Pain class also showed moderate performance, with 72 correct classifications and 74 misclassified as Low Pain. These results highlight the model’s sensitivity to class distribution and its tendency to confuse High Pain with Low Pain, suggesting the need for class-balancing strategies, loss weighting to improve the recognition of underrepresented classes, or a more class-balanced sample dataset.

Although biosignal modalities such as GSR demonstrated stronger performance compared to fNIRS in the BioVid experiments, this difference is primarily due to the nature of the physiological responses captured. Peripheral biosignals, such as GSR and ECG, tend to provide faster and more direct autonomic nervous system responses to acute pain stimuli [3], whereas cortical signals captured by fNIRS reflect slower hemodynamic changes that are often more susceptible to noise and motion artifacts [4]. Despite these challenges, fNIRS remains valuable in clinical contexts where direct brain activity monitoring is essential. Therefore, while certain biosignals may outperform others depending on the dataset characteristics and response types, each modality offers distinct advantages depending on the application scenario.

Moreover, the observed gap between validation and test accuracy could be partially attributed to overfitting. Given the limited training set size, the model may have captured validation-specific patterns that do not generalize well to the test set. This overfitting risk is further amplified by the inherent class imbalance across both splits, suggesting that larger and more diverse datasets would be beneficial for improving generalization performance.

In addition to accuracy and F1 analysis, we also evaluated model performance using ROC curves and AUC scores (Figure 6). On the BioVid dataset, the model achieved a strong AUC of 0.88, whereas on the AI4Pain dataset, AUC values ranged from 0.54 to 0.61, reflecting the challenges posed by subtle signal differences and class imbalance.

4.5. Effectiveness of Artifact Noise Reduction and Vizualiation

The effectiveness of artifact noise reduction is highlighted in Table 3, which focuses exclusively on models utilizing the fNIRS modality. In the table, the column labeled “Val Acc w/o AF” presents the validation accuracy achieved by our model when trained without any artifact noise filtering. This baseline result allows for a clear comparison of how much the filtering process impacts overall model performance. Notably, artifact noise filtering significantly improves accuracy. This improvement is particularly pronounced in the “(Video + fNIRS) B-1” model, where the application of noise filtering increased the overall accuracy by 2.67%. This demonstrates the critical role of effective noise reduction in enhancing model performance, especially in multimodal systems where signal quality can vary significantly between sources.

In addition to our noise filtering approach, we also compared MMAPA with traditional noise reduction techniques such as ICA, as summarized in Table 5. While ICA-based preprocessing improved the fNIRS model accuracy to 58.00%, our proposed MMAPA method achieved a slightly higher validation accuracy of 58.67% and a marginally better F1 score. This comparison highlights that while classical techniques like ICA can mitigate some artifacts, integrating task-specific multimodal processing and attention mechanisms, as seen in MMAPA, further enhances robustness and classification performance.

To identify and mitigate noisy channels, we employed the K-means clustering algorithm, which enabled us to differentiate between noisy and non-noisy signals. As shown in Figure 7, the centroids for the noisy and non-noisy channels were calculated as 29.77 and 1.24, respectively. The threshold between the two clusters was determined to be 15.5. Any channel with a standard deviation of gradient below 15.5 was classified as non-noisy, while channels with a standard deviation exceeding this value were deemed noisy. This systematic classification enabled us to process and filter the noise from a significant portion of the channels, totaling 393 channels. The K-means-based approach for noise detection proved to be an effective method for isolating noisy channels, allowing for more precise artifact noise filtering and contributing to the overall enhancement of model accuracy. The use of this noise reduction technique is essential for improving the reliability of models trained with fNIRS data, as it reduces the impact of sensor and environmental interference.

4.6. Comparison of EEG Conformer with SOTA Models

Table 6 shows our model performance using the EEG Conformer compared with other state-of-the-art models (Bi-LSTM and Bi-GRU). Most methods struggle significantly with the “No Pain” category. In the B-6 experiment, both Bi-LSTM and Bi-GRU have slightly low accuracy due to their sensitivity in detecting “No Pain”. Their high “No Pain” F1 score retracted accuracy on “Low Pain” and “High Pain”. It also show an accuracy trade-off between “No Pain” and other classes. Both Bi-LSTM and Bi-GRU tend to be overfitted easily compared to our model. Due to this issue, the model struggles to achieve the best validation accuracy. Bi-GRU also shows less effectiveness compared to other methods.

Additionally, in the B-1 experiment, the EEG Conformer model maintains its performance with a validation accuracy of 58.67% but shows a slight decrease in the mean F1 score to 39.59%. The Bi-LSTM and Bi-GRU models demonstrate improved accuracy with the smaller batch size, where Bi-LSTM (48 units) achieves the highest validation accuracy of 59.00% and a mean F1 score of 40.35%. To evaluate the proposed Algorithm 1, we validated the accuracy of fNIRS data (both unimodal and multimodal) that was preprocessed using only a Butterworth band-pass filter. The results in Table 6 show that the proposed Algorithm 1 can improve model accuracy.

4.7. EEG Conformer Parameter Effectiveness

In this ablation study on EEG Conformer parameter effectiveness, we evaluate the impact of different K values and attention dimensions on validation accuracy. The results show that for K values of 20, 30, and 40, the validation accuracies are 58.33%, 57.67%, and 58.67%, respectively, with the best performance at K = 40. However, for attention dimensions of 156, 16, and 56, the validation accuracies are 59%, 58.33%, and 59%, respectively, with no significant improvement as the attention dimensions increase, indicating potential overfitting. Thus, we selected 16 attention dimensions to avoid overfitting and optimize efficiency. For detailed results, see Table 7.

4.8. Visual Explanation Using Grad-CAM

We employed Grad-CAM (Gradient-weighted Class Activation Mapping) [34] to visualize which regions of the input video frames most influenced the model’s predictions. Grad-CAM computes the gradients of the target class score with respect to the convolutional feature maps and uses them to generate class-discriminative localization maps. These maps highlight the spatial regions that the model focuses on when making its decision.

In our experiments, we applied Grad-CAM to the output of the MARLIN encoder to generate attention heatmaps for the pain classification task. As illustrated in Figure 8, the model consistently showed moderate attention to facial regions such as the eyes and mouth corners when classifying High Pain inputs. These regions became more pronounced and visually distinguishable when the subject exhibited clear facial expressions associated with pain. However, the overall Grad-CAM activations were relatively abstract and lacked sharp localization, suggesting that the model’s attention was distributed rather than highly focused. In contrast, for No Pain inputs, the model displayed diffuse or minimal activation, often spreading attention across the entire face without clearly defined focal points. This indicates that the absence of distinctive facial features in neutral expressions may hinder the model’s ability to confidently localize informative regions, contributing to lower discriminative capacity for this class.

4.9. Effectiveness of Fusion Methods

Figure 9 illustrates the effectiveness of our attention-fused approach compared to other methods in terms of accuracy. Based on the experiment conducted under the “B-1” condition across all modalities, our attention-based fusion approach outperforms all modality combinations except for “fNIRS + manual”. While our method demonstrates superior performance, the differences in accuracy between various modality combinations remain relatively small. This suggests that while fusion strategies can enhance performance, the choice of modality pairing also plays a crucial role in achieving optimal results.

4.10. Model Complexity and Speed

Table 8 presents a comparison of model complexity and inference speed across different modalities. The MMAPA model using only fNIRS data contains significantly fewer parameters (0.49 M) compared to the video-based models (over 86 M parameters). Despite this substantial reduction in model size, the fNIRS-only model achieves competitive accuracy while offering a remarkable inference speed of 3620 samples per second. In contrast, video-based models achieve only around seven samples per second due to their much larger architectures. These results suggest that fNIRS-based models represent a more efficient and practical choice, particularly for real-time or resource-constrained applications. The lightweight nature and high inference speed of the MMAPA (fNIRS) model make it highly suitable for deployment on edge devices, enabling fast and scalable pain detection without the need for heavy computational infrastructure.

5. Discussion

Our proposed MMAPA framework is designed with potential real-time deployment in mind. During validation, we observed that the average inference speed for the fNIRS-only model was approximately 3620 samples per second, while video-based models achieved only around seven samples per second. These results suggest that the fNIRS-based MMAPA variant is highly capable of supporting real-time pain assessment even on resource-constrained devices. In contrast, the large computational overhead of video-based models highlights the need for the further optimization of video embeddings. Future work should explore more lightweight yet effective video feature extraction methods to enhance real-time applicability without sacrificing accuracy. Additionally, deployment on lower-power edge devices could benefit from techniques such as model pruning, quantization, or knowledge distillation, which we leave as future directions.

From an ethical standpoint, we acknowledge the risk of bias when applying MMAPA to underrepresented groups such as elderly or non-Western populations, particularly given the demographic limitations of the AI4Pain and BioVid datasets. Broader validation across diverse cohorts is necessary before clinical adoption to ensure fairness and generalizability.

Regarding model failure analysis, MMAPA consistently struggled in detecting the “No Pain” category, often misclassifying these instances as “Low Pain” or “High Pain”. This challenge likely arises from the more subtle facial and physiological cues present when no pain is experienced, which is a phenomenon similarly noted in prior studies. Due to dataset restrictions, we could not publicly share qualitative examples, but internal error analysis confirmed the difficulty in distinguishing the “No Pain” class reliably.

6. Conclusions

Our proposed method demonstrated a high validation accuracy. Its validation accuracy reached up to 62.7% for both the video + fNIRS and the video + manual statistical features configurations. For the final submission, we chose to report results using the video + fNIRS modality, as it showed the best performance in internal validation. However, our model achieved only 51.33% accuracy on the hidden test set. This performance gap can be primarily attributed to the high class imbalance in the test dataset, where the “No Pain” category was significantly underrepresented. A simple up-sampling strategy to generate more “No Pain” samples proved ineffective for our model. Moreover, since the challenge was designed to reward the highest overall accuracy, we prioritized optimizing for general performance and did not conduct extensive experiments specifically aimed at improving the “No Pain” classification.

The fNIRS modality faced challenges due to artifact noise, which is a known issue that requires rigorous preprocessing for optimal signal quality. Our artifact filtering approach partially mitigated these effects, although some residual noise may have impacted the final model performance. This noise is typically introduced by ambient lighting, subject movement, or improper sensor placement, resulting in signal contamination and information loss. As a result, physiological patterns became harder to distinguish, making it more difficult for the model to learn meaningful features from the fNIRS data. Without thorough preprocessing—such as the artifact filtering discussed in this study—fNIRS data remain highly susceptible to noise, limiting its standalone effectiveness in pain classification tasks.

In contrast, our experiments on the BioVid Heat Pain database demonstrated that the GSR modality provided more discriminative features for pain detection than the video modality alone. The fusion of GSR and video using our attention-based model achieved the highest accuracy in the 10-fold cross-validation setting, outperforming both unimodal baselines and prior models such as PainAttNet. These results support the value of physiological signals in automatic pain recognition and highlight the effectiveness of our multimodal attention-based fusion approach in capturing complementary information.

These findings underscore the complexity of pain classification and highlight the challenge of achieving balanced sensitivity across all pain categories. Future work should explore more advanced fusion strategies, semi-supervised or self-supervised learning to leverage underrepresented classes such as “No Pain” and the inclusion of additional datasets or modalities (e.g., EEG or EMG) to further enhance generalization. Testing the model in real-world clinical scenarios would also be a valuable next step to evaluate its practical applicability.

Author Contributions

Conceptualization, J.V., H.Y., S.Y., S.K. (Seungwon Kim), J.S. and S.K. (Soohyung Kim); methodology, J.V., H.Y., S.Y., S.K. (Seungwon Kim), J.S. and S.K. (Soohyung Kim); software, J.V. and A.D.; original draft preparation, J.V.; draft review, editing and language error checking, A.D. and S.Y.; supervision and project administration were performed by H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP)-Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government (MSIT) (IITP-2025-RS-2022-00156287, 25%), the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00208397, 25%), the Institute of Information & communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development grant funded by the Korea government (MSIT) (IITP-2023-RS-2023-00256629, 25%), and the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2025-RS-2024-00437718, 25%) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).

Data Availability Statement

3rd party data restrictions apply to the availability of these data. Data were obtained from the AI4Pain Grand Challenge 2024 and are available at https://sites.google.com/view/ai4pain (accessed on 3 April 2024) with the permission of the challenge organizers. Data from the BioVid Heat Pain Database were obtained from the Neuro-Information Technology Group at Otto von Guericke University Magdeburg and are available at https://www.nit.ovgu.de/BioVid.html (accessed on 10 April 2025) upon request.

Acknowledgments

GenAI was used to assist of grammar editing and English quality enhancing.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rojas, R. Multimodal physiological sensing for the assessment of acute pain. Front. Pain Res. 2023, 4, 1150264. [Google Scholar] [CrossRef]
Haefeli, M.; Elfering, A. Pain assessment. Eur. Spine J. 2006, 15, S17–S24. [Google Scholar] [CrossRef]
Walter, S.; Gruss, S.; Limbrecht-Ecklundt, K.; Traue, H.C.; Werner, P.; Al-Hamadi, A.; Walter, S.; Diniz, N. The BioVid Heat Pain Database: Data for the Advancement and Systematic Validation of an Automated Pain Recognition System. In Proceedings of the 2013 IEEE International Conference on Cybernetics (CYBCO), Lausanne, Switzerland, 13–15 June 2013. [Google Scholar]
Quaresima, V.; Ferrari, M. Functional Near-Infrared Spectroscopy (fNIRS) for Assessing Cerebral Cortex Function During Human Behavior in Natural/Social Situations: A Concise Review. Organ. Res. Methods 2019, 22, 46–68. [Google Scholar] [CrossRef]
Breivik, H. Assessment of pain. Br. J. Anaesth. 2008, 101, 17–24. [Google Scholar] [CrossRef]
Werner, P.; Lopez-Martinez, D.; Walter, S.; Al-Hamadi, A.; Gruss, S.; Picard, R. Automatic Recognition Methods Supporting Pain Assessment: A Survey. IEEE Trans. Affect. Comput. 2022, 13, 530–552. [Google Scholar] [CrossRef]
Werner, P.; Al-Hamadi, A.; Limbrecht-Ecklundt, K.; Walter, S.; Gruss, S.; Traue, H. Automatic Pain Assessment with Facial Activity Descriptors. IEEE Trans. Affect. Comput. 2017, 8, 286–299. [Google Scholar] [CrossRef]
Alghamdi, T.; Alaghband, G. Facial Expressions Based Automatic Pain Assessment System. Appl. Sci. 2022, 12, 6423. [Google Scholar] [CrossRef]
Schwenker, F.; Scherer, S. Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2017; Volume 10183. [Google Scholar] [CrossRef]
Fernandez Rojas, R.; Hirachan, N.; Joseph, C.; Seymour, B.; Goecke, R. The AI4Pain Grand Challenge 2024: Advancing Pain Assessment with Multimodal fNIRS and Facial Video Analysis. In Proceedings of the 2024 12th International Conference on Affective Computing and Intelligent Interaction, Glasgow, UK, 16–18 September 2024. [Google Scholar]
Benavent-Lledo, M. A Comprehensive Study on Pain Assessment from Multimodal Sensor Data. Sensors 2023, 23, 9675. [Google Scholar] [CrossRef]
Gkikas, S. Multimodal automatic assessment of acute pain through facial videos and heart rate signals utilizing transformer-based architectures. Front. Pain Res. 2024, 5, 1372814. [Google Scholar] [CrossRef]
Werner, P.; Al-Hamadi, A.; Niese, R.; Walter, S.; Gruss, S.; Traue, H.C. Towards Pain Monitoring: Facial Expression, Head Pose, a New Database, an Automatic System and Remaining Challenges. In Proceedings of the British Machine Vision Conference (BMVC), Bristol, UK, 9–13 September 2013. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Cai, Z.; Ghosh, S.; Stefanov, K.; Dhall, A.; Cai, J.; Rezatofighi, H.; Haffari, R.; Hayat, M. MARLIN: Masked Autoencoder for Facial Video Representation LearnINg. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 1493–1504. [Google Scholar]
Liu, Z.; Shore, J.; Wang, M.; Yuan, F.; Buss, A.; Zhao, X. A systematic review on hybrid EEG/fNIRS in brain-computer interface. Biomed. Signal Process. Control 2021, 68, 102595. [Google Scholar] [CrossRef]
Semwal, A.; Londhe, N. Automated Facial Expression Based Pain Assessment Using Deep Convolutional Neural Network. In Proceedings of the 3rd International Conference on Intelligent Sustainable Systems (ICISS 2020), Thoothukudi, India, 3–5 December 2020; pp. 366–370. [Google Scholar] [CrossRef]
Wang, R.; Hao, Y.; Yu, Q.; Chen, M.; Humar, I.; Fortino, G. Depression Analysis and Recognition Based on Functional Near-Infrared Spectroscopy. IEEE J. Biomed. Health Inf. 2021, 25, 4289–4299. [Google Scholar] [CrossRef] [PubMed]
Chao, J. FNIRS Evidence for Distinguishing Patients with Major Depression and Healthy Controls. IEEE Trans. Neural Syst. Rehabil. Eng. 2021, 29, 2211–2221. [Google Scholar] [CrossRef] [PubMed]
Khan, H.; Noori, F.; Yazidi, A.; Uddin, M.; Khan, M.; Mirtaheri, P. Classification of individual finger movements from right hand using fnirs signals. Sensors 2021, 21, 7943. [Google Scholar] [CrossRef]
Hamid, H.; Naseer, N.; Nazeer, H.; Khan, M.; Khan, R.; Khan, U. Analyzing Classification Performance of fNIRS-BCI for Gait Rehabilitation Using Deep Neural Networks. Sensors 2022, 22, 1932. [Google Scholar] [CrossRef]
Khalil, K.; Asgher, U.; Ayaz, Y. Novel fNIRS study on homogeneous symmetric feature-based transfer learning for brain–computer interface. Sci. Rep. 2022, 12, 3198. [Google Scholar] [CrossRef]
Rojas, R.; Romero, J.; Lopez-Aparicio, J.; Ou, K. Pain Assessment Based on fNIRS Using Bi-LSTM RNNs. In Proceedings of the International IEEE/EMBS Conference on Neural Engineering (NER 2021), Baltimore, MD, USA, 24–27 April 2023; pp. 399–402. [Google Scholar] [CrossRef]
Fernandez Rojas, R.; Huang, X.; Ou, K.L. Toward a functional near-infrared spectroscopy-based monitoring of pain assessment for nonverbal patients. J. Biomed. Opt. 2017, 22, 106013. [Google Scholar] [CrossRef]
Lu, X.; Wu, Z.; Lu, B.; Yang, G. PainAttnNet: Multiscale Convolutional Transformer for Pain Intensity Classification Using Biosignals. arXiv 2023, arXiv:2303.06845. [Google Scholar]
Song, Y.; Zheng, Q.; Liu, B.; Gao, X. EEG Conformer: Convolutional Transformer for EEG Decoding and Visualization. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 710–719. [Google Scholar] [CrossRef]
Qiu, L.; Zhong, Y.; He, Z.; Pan, J. Improved classification performance of EEG-fNIRS multimodal brain-computer interface based on multi-domain features and multi-level progressive learning. Front. Hum. Neurosci. 2022, 16, 973959. [Google Scholar] [CrossRef]
Prajod, P.; Schiller, D.; Don, D.W.; André, E. Faces of Experimental Pain: Transferability of Deep Learned Heat Pain Features to Electrical Pain. arXiv 2024, arXiv:2406.11808. [Google Scholar]
Gkikas, S.; Tsiknakis, M. Twins-PainViT: Towards a Modality-Agnostic Vision Transformer Framework for Multimodal Automatic Pain Assessment using Facial Videos and fNIRS. arXiv 2024, arXiv:2407.19809. [Google Scholar]
Gouverneur, P.; Li, F.; Adamczyk, W.M.; Szikszay, T.M.; Luedtke, K.; Grzegorzek, M. Comparison of feature extraction methods for physiological signals for heat-based pain recognition. Sensors 2021, 21, 4838. [Google Scholar] [CrossRef] [PubMed]
Shi, H.; Chikhaoui, B.; Wang, S. Tree-Based Models for Pain Detection from Biomedical Signals. In Proceedings of the International Conference on Smart Homes and Health Telematics, Ames, IA, USA, 28 June–2 July 2008; pp. 183–195. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. The MMAPA framework design of our model consists of two sections: feature extraction and a fusion module. Facial video representation vectors are extracted from videos using MARLIN Encoder [15]. At the end, all features were fused using an attention-based layer.

Figure 4. Attention fuse box architecture. Our fusion method receives 2 embedded vectors as inputs. The first dense layer is the neck section, which is useful to shape. In our fusion method, 512 units of any dense layer were utilized. Q: query, K: key, V: value.

Figure 5. Vizualiation of confusion matrix MMAPA (Video + fNIRS).

Figure 6. ROC curves for BioVid (left) and AI4Pain (right). The high AUC on BioVid (0.88) demonstrates strong separability, whereas there are lower AUCs on AI4Pain (0.54–0.61).

Figure 7. Vizualiation of K-means result.

Figure 8. Grad-CAM visualizations highlighting the model’s attention regions for High Pain and No Pain video inputs. Background images source: BioVid Database [3].

Figure 9. Validation accuracy comparison of baseline and proposed fusion methods using various data modalities, highlighting improvements through our fusion techniques. The orange bar represents “B-1”, and the blue bar represents “B-6”.

Table 3. Model comparison against baseline and other methods.

Methods (Modality)	Val Acc w/oAF	Val Accuracy	Val F1 Score	Test Accuracy
Baseline (fNIRS)	-	43.20%	-	43.30%
Baseline (Video)	-	40.00%	-	40.10%
Baseline (Video + fNIRS)	-	40.20%	-	41.70%
Simple ANN + Voting (Video) [28]	-	-	-	49.00%
LSTM (Video) [28]	-	-	-	43.00%
Twins-PainViT [29]	-	-	-	46.76%
MMAPA-Ours (fNIRS)_B-6	59.00%	58.67%	41.71%	-
MMAPA-Ours (Video)_B-6	-	61.67%	41.83%	-
MMAPA-Ours (Manual)_B-6	-	59.00%	39.88%	-
MMAPA-Ours (Video + fNIRS)_B-6	59.33%	60.33%	39.70%	-
MMAPA-Ours (Video + Manual)_B-6	-	59.33%	40.29%	-
MMAPA-Ours (fNIRS + Manual)_B-6	58.00%	58.33%	38.93%	-
MMAPA-Ours (Video + fNIRS + Manual)_B-6	59.33%	61.00%	41.42%	-
MMAPA-Ours (fNIRS)_B-1	57.00%	58.67%	39.59%	-
MMAPA-Ours (Video)_B-1	-	61.67%	41.79%	-
MMAPA-Ours (Manual)_B-1	-	59.00%	40.13%	-
MMAPA-Ours (Video + fNIRS)_B-1	60.00%	62.67%	41.76%	51.33%
MMAPA-Ours (Video + Manual)_B-1	-	62.67%	42.63%	-
MMAPA-Ours (fNIRS + Manual)_B-1	59.00%	59.67%	40.66%	-
MMAPA-Ours (Video + fNIRS + Manual)_B-1	61.00%	62.33%	42.08%	-

^B-1 Original class ratio. ^B-6 Upsampled “No Pain” class for balancing. ^{Val Acc w/oAF} Validation accuracy without artifact noise filtering proposed in Algorithm 1.

Table 4. Comparison of model performance with PainAttNet on the BioVid dataset using two evaluation protocols. LOOCV refers to Leave-One-Out Cross-Validation. GSR denotes the galvanic skin response modality, while G + V represents the fusion of GSR and video inputs. The left subtable reports LOOCV accuracy (a), and the right subtable presents 10-fold cross-validation accuracy (b). Our proposed method, MMAPA, consistently outperforms PainAttNet across different modalities and evaluation settings. We performed benchmark on

T_{0}

vs.

T_{4}

scenario (baseline vs. highest pain in BioVid dataset).

Table 4. Comparison of model performance with PainAttNet on the BioVid dataset using two evaluation protocols. LOOCV refers to Leave-One-Out Cross-Validation. GSR denotes the galvanic skin response modality, while G + V represents the fusion of GSR and video inputs. The left subtable reports LOOCV accuracy (a), and the right subtable presents 10-fold cross-validation accuracy (b). Our proposed method, MMAPA, consistently outperforms PainAttNet across different modalities and evaluation settings. We performed benchmark on

T_{0}

vs.

T_{4}

scenario (baseline vs. highest pain in BioVid dataset).

(a)
Model	LOOCV Accurracy
MLP [30]	84.22%
XGBoost [31]	85.23%
PainAttNet [25]	85.34%
MMAPA-Ours (GSR)	89.22% ± 11.38%
MMAPA-Ours (Video)	73.33% ± 15.71%
MMAPA-Ours (G + V)	87.22% ± 12.18%
(b)
Model	10-Fold Accurracy
PainAttNet [25]	84.09% ± 7.51%
MMAPA-Ours (GSR)	84.68% ± 5.38%
MMAPA-Ours (Video)	67.86% ± 5.29%
MMAPA-Ours (G + V)	85.75% ± 5.21%

Table 5. Comparison between MMAPA and ICA/PCA noise reduction methods.

Method	Accuracy	F1 Score
ICA	58.00%	39.34%
MMAPA (Ours)	58.67%	39.59%

Table 6. Validation accuracy and F1 scores of various neural network models using fNIRS signal.

Methods	Val Accuracy	Val F1 Score
Methods	Val Accuracy	No Pain	Low Pain	High Pain	Mean
EEG Conformer_B-6	58.67%	0.00%	65.47%	59.66%	41.71%
Bi-LSTM (48 units)_B-6 [32]	50.00%	10.53%	47.01%	59.55%	39.03%
Bi-GRU (48 units)_B-6 [33]	49.67%	0.00%	54.04%	51.25%	35.10%
Bi-LSTM (96 units)_B-6 [32]	45.33%	7.41%	46.87%	51.03%	35.10%
Bi-GRU (96 units)_B-6 [33]	50.67%	5.71%	55.56%	51.26%	37.51%
EEG Conformer_B-1	58.67%	0.00%	63.61%	55.17%	39.59%
Bi-LSTM (48 units)_B-1 [32]	59.00%	0.00%	59.66%	61.38%	40.35%
Bi-GRU (48 units)_B-1 [33]	55.00%	0.00%	57.98%	54.09%	37.36%
Bi-LSTM (96 units)_B-1 [32]	57.00%	0.00%	59.80%	56.64%	38.81%
Bi-GRU (96 units)_B-1 [33]	57.00%	0.00%	60.38%	55.56%	57.52%

Note: All Bi-LSTM and Bi-GRU models consist of 2 layers with N units each of Bi-LSTM or Bi-GRU.

Table 7. EEG Conformer accuracy within different parameters. (a) Accuracy comparison on different K values and (b) accuracy comparison on different attention dims. (*) indicates the setting used in our final experiments.

(a)
K Value	Validation Accurracy
20	58.33%
30	57.67%
40 *	58.67%
(b)
Attention Dims	Validation Accurracy
16 *	58.33%
56	59.00%
156	58.67%

Table 8. Model complexity and inference speed (measured on an NVIDIA RTX 2080 Ti).

Model	Accuracy	Parameters	Inference Speed (Samples/s)
MMAPA (fNIRS)	58.67%	0.49 M	3620
MMAPA (Video)	61.67%	86.23 M	7.5
MMAPA (Video + fNIRS)	62.67%	86.73 M	7.35

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vianto, J.; Divakaran, A.; Yang, H.; Yeom, S.; Kim, S.; Kim, S.; Shin, J. Multimodal Model for Automated Pain Assessment: Leveraging Video and fNIRS. Appl. Sci. 2025, 15, 5151. https://doi.org/10.3390/app15095151

AMA Style

Vianto J, Divakaran A, Yang H, Yeom S, Kim S, Kim S, Shin J. Multimodal Model for Automated Pain Assessment: Leveraging Video and fNIRS. Applied Sciences. 2025; 15(9):5151. https://doi.org/10.3390/app15095151

Chicago/Turabian Style

Vianto, Jo, Anjitha Divakaran, Hyungjeong Yang, Soonja Yeom, Seungwon Kim, Soohyung Kim, and Jieun Shin. 2025. "Multimodal Model for Automated Pain Assessment: Leveraging Video and fNIRS" Applied Sciences 15, no. 9: 5151. https://doi.org/10.3390/app15095151

APA Style

Vianto, J., Divakaran, A., Yang, H., Yeom, S., Kim, S., Kim, S., & Shin, J. (2025). Multimodal Model for Automated Pain Assessment: Leveraging Video and fNIRS. Applied Sciences, 15(9), 5151. https://doi.org/10.3390/app15095151

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Model for Automated Pain Assessment: Leveraging Video and fNIRS

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. fNIRS Signal Preprocessing

3.2. Manual Feature Extractions

3.3. MARLIN Encoder for Video Encoding

3.4. fNIRS Encoding Using EEG Conformer

3.5. Attention-Based Fusion

4. Experiment and Results

4.1. Datasets

4.2. Implementation Details

4.3. Test Accuracy Against Competitors

4.4. Quantitative Analysis

4.5. Effectiveness of Artifact Noise Reduction and Vizualiation

4.6. Comparison of EEG Conformer with SOTA Models

4.7. EEG Conformer Parameter Effectiveness

4.8. Visual Explanation Using Grad-CAM

4.9. Effectiveness of Fusion Methods

4.10. Model Complexity and Speed

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI